Skip to content

Graded Quiz: Advanced Concepts of Transformer Architecture :Generative AI Language Modeling with Transformers (IBM AI Engineering Professional Certificate) Answers 2025

1. How does a GPT-like model generate responses word by word?

It can generate each word at a time using the prior words in sequence.
❌ Retrieves full response at once
❌ Uses encoder-decoder cross-attention
❌ Looks at future tokens

Explanation:
GPT is an autoregressive decoder that predicts the next token using only previous tokens (causal mechanism).


2. Which attention should Priya apply for next-word prediction?

❌ Multi-head attention masking
❌ Global attention masking
❌ Local attention masking
Causal attention masking

Explanation:
Causal masks hide future tokens so the model predicts using past tokens only — required for autoregressive models.


3. Which PyTorch technique is commonly used for causal LM text generation?

❌ CNNs
❌ GRUs
❌ Bi-directional LSTM
Transformer architecture

Explanation:
Transformers with causal masking dominate modern text generation (GPT, LLaMA, etc.).


4. Which component generates the target output from encoded input?

❌ Positional encoding
It decodes the encoded input sequence to generate the output.
❌ Adds attention scores
❌ Encodes the input

Explanation:
In sequence-to-sequence transformers, the decoder uses encoder outputs + its own attention to produce translations.


5. How does BERT understand both left and right context?

❌ Causal masks
❌ Predict masked words only
Bidirectional training method
❌ Contextual representation generation

Explanation:
BERT is a bidirectional encoder using masked language modeling, meaning it reads tokens from both directions.


6. Why use Next Sentence Prediction (NSP)?

❌ Translate languages
Determine if one sentence logically follows another
❌ Create text from a prompt
❌ Sentiment classification

Explanation:
NSP teaches BERT relationships between sentences — important for tasks like Q&A and inference.


7. What embeddings does BERT use?

Token embeddings, position embeddings, and segment embeddings
❌ Token embeddings only
❌ Token + position embeddings
❌ Position + segment embeddings

Explanation:
BERT uses all three to understand word meaning, order, and sentence pairing.


8. Best optimizer for fine-tuning BERT?

❌ SGD
Adam
❌ RMSprop
❌ Adagrad

Explanation:
Adam (or AdamW) is standard due to adaptive learning rates + momentum, enabling stable fine-tuning.


9. Which component ensures a decoder uses only earlier tokens?

❌ Linear layer
❌ Normalization
❌ Cross-attention
Masking layer

Explanation:
The causal mask blocks future tokens during self-attention so predictions are autoregressive.


10. Which decoder component generates logits?

❌ Multi-head attention
❌ Normalization
❌ Feedforward
Linear layer

Explanation:
The linear layer maps hidden states to vocabulary logits for prediction.


🧾 Summary Table

Q# Correct Answer Key Concept
1 Generate word-by-word using prior tokens Autoregressive decoding
2 Causal masking Next-word prediction
3 Transformer Causal LM text generation
4 Decoder generates output Seq2seq decoding
5 Bidirectional training BERT’s context understanding
6 NSP checks sentence order BERT pretraining
7 Token + position + segment embeds BERT inputs
8 Adam Best optimizer for BERT
9 Masking layer Prevent future-token access
10 Linear layer Output logits