Graded Quiz: Advanced Concepts of Transformer Architecture :Generative AI Language Modeling with Transformers (IBM AI Engineering Professional Certificate) Answers 2025
1. How does a GPT-like model generate responses word by word?
✅ It can generate each word at a time using the prior words in sequence.
❌ Retrieves full response at once
❌ Uses encoder-decoder cross-attention
❌ Looks at future tokens
Explanation:
GPT is an autoregressive decoder that predicts the next token using only previous tokens (causal mechanism).
2. Which attention should Priya apply for next-word prediction?
❌ Multi-head attention masking
❌ Global attention masking
❌ Local attention masking
✅ Causal attention masking
Explanation:
Causal masks hide future tokens so the model predicts using past tokens only — required for autoregressive models.
3. Which PyTorch technique is commonly used for causal LM text generation?
❌ CNNs
❌ GRUs
❌ Bi-directional LSTM
✅ Transformer architecture
Explanation:
Transformers with causal masking dominate modern text generation (GPT, LLaMA, etc.).
4. Which component generates the target output from encoded input?
❌ Positional encoding
✅ It decodes the encoded input sequence to generate the output.
❌ Adds attention scores
❌ Encodes the input
Explanation:
In sequence-to-sequence transformers, the decoder uses encoder outputs + its own attention to produce translations.
5. How does BERT understand both left and right context?
❌ Causal masks
❌ Predict masked words only
✅ Bidirectional training method
❌ Contextual representation generation
Explanation:
BERT is a bidirectional encoder using masked language modeling, meaning it reads tokens from both directions.
6. Why use Next Sentence Prediction (NSP)?
❌ Translate languages
✅ Determine if one sentence logically follows another
❌ Create text from a prompt
❌ Sentiment classification
Explanation:
NSP teaches BERT relationships between sentences — important for tasks like Q&A and inference.
7. What embeddings does BERT use?
✅ Token embeddings, position embeddings, and segment embeddings
❌ Token embeddings only
❌ Token + position embeddings
❌ Position + segment embeddings
Explanation:
BERT uses all three to understand word meaning, order, and sentence pairing.
8. Best optimizer for fine-tuning BERT?
❌ SGD
✅ Adam
❌ RMSprop
❌ Adagrad
Explanation:
Adam (or AdamW) is standard due to adaptive learning rates + momentum, enabling stable fine-tuning.
9. Which component ensures a decoder uses only earlier tokens?
❌ Linear layer
❌ Normalization
❌ Cross-attention
✅ Masking layer
Explanation:
The causal mask blocks future tokens during self-attention so predictions are autoregressive.
10. Which decoder component generates logits?
❌ Multi-head attention
❌ Normalization
❌ Feedforward
✅ Linear layer
Explanation:
The linear layer maps hidden states to vocabulary logits for prediction.
🧾 Summary Table
| Q# | Correct Answer | Key Concept |
|---|---|---|
| 1 | Generate word-by-word using prior tokens | Autoregressive decoding |
| 2 | Causal masking | Next-word prediction |
| 3 | Transformer | Causal LM text generation |
| 4 | Decoder generates output | Seq2seq decoding |
| 5 | Bidirectional training | BERT’s context understanding |
| 6 | NSP checks sentence order | BERT pretraining |
| 7 | Token + position + segment embeds | BERT inputs |
| 8 | Adam | Best optimizer for BERT |
| 9 | Masking layer | Prevent future-token access |
| 10 | Linear layer | Output logits |