1. How does a GPT-like model generate responses word by word?

✅ It can generate each word at a time using the prior words in sequence.
❌ Retrieves full response at once
❌ Uses encoder-decoder cross-attention
❌ Looks at future tokens

Explanation:
GPT is an autoregressive decoder that predicts the next token using only previous tokens (causal mechanism).

2. Which attention should Priya apply for next-word prediction?

❌ Multi-head attention masking
❌ Global attention masking
❌ Local attention masking
✅ Causal attention masking

Explanation:
Causal masks hide future tokens so the model predicts using past tokens only — required for autoregressive models.

3. Which PyTorch technique is commonly used for causal LM text generation?

❌ CNNs
❌ GRUs
❌ Bi-directional LSTM
✅ Transformer architecture

Explanation:
Transformers with causal masking dominate modern text generation (GPT, LLaMA, etc.).

4. Which component generates the target output from encoded input?

❌ Positional encoding
✅ It decodes the encoded input sequence to generate the output.
❌ Adds attention scores
❌ Encodes the input

Explanation:
In sequence-to-sequence transformers, the decoder uses encoder outputs + its own attention to produce translations.

5. How does BERT understand both left and right context?

❌ Causal masks
❌ Predict masked words only
✅ Bidirectional training method
❌ Contextual representation generation

Explanation:
BERT is a bidirectional encoder using masked language modeling, meaning it reads tokens from both directions.

6. Why use Next Sentence Prediction (NSP)?

❌ Translate languages
✅ Determine if one sentence logically follows another
❌ Create text from a prompt
❌ Sentiment classification

Explanation:
NSP teaches BERT relationships between sentences — important for tasks like Q&A and inference.

7. What embeddings does BERT use?

✅ Token embeddings, position embeddings, and segment embeddings
❌ Token embeddings only
❌ Token + position embeddings
❌ Position + segment embeddings

Explanation:
BERT uses all three to understand word meaning, order, and sentence pairing.

8. Best optimizer for fine-tuning BERT?

❌ SGD
✅ Adam
❌ RMSprop
❌ Adagrad

Explanation:
Adam (or AdamW) is standard due to adaptive learning rates + momentum, enabling stable fine-tuning.

9. Which component ensures a decoder uses only earlier tokens?

❌ Linear layer
❌ Normalization
❌ Cross-attention
✅ Masking layer

Explanation:
The causal mask blocks future tokens during self-attention so predictions are autoregressive.

10. Which decoder component generates logits?

❌ Multi-head attention
❌ Normalization
❌ Feedforward
✅ Linear layer

Explanation:
The linear layer maps hidden states to vocabulary logits for prediction.

🧾 Summary Table

Q#	Correct Answer	Key Concept
1	Generate word-by-word using prior tokens	Autoregressive decoding
2	Causal masking	Next-word prediction
3	Transformer	Causal LM text generation
4	Decoder generates output	Seq2seq decoding
5	Bidirectional training	BERT’s context understanding
6	NSP checks sentence order	BERT pretraining
7	Token + position + segment embeds	BERT inputs
8	Adam	Best optimizer for BERT
9	Masking layer	Prevent future-token access
10	Linear layer	Output logits

Graded Quiz: Advanced Concepts of Transformer Architecture :Generative AI Language Modeling with Transformers (IBM AI Engineering Professional Certificate) Answers 2025