Graded Quiz: Fundamental Concepts of Transformer Architecture :Generative AI Language Modeling with Transformers (IBM AI Engineering Professional Certificate) Answers 2025
1. What does self-attention primarily allow a model to do?
❌ Eliminate unimportant words
❌ Generate paraphrases
❌ Identify parts of speech
✅ Represent each word using its surrounding context
Explanation:
Self-attention computes relationships between all tokens, enabling each word to encode meaning based on its context.
2. What parameter influences positional encoding values across embedding dimensions?
❌ Counts input tokens
❌ Tracks where each word appears
❌ Indicates phase offset
✅ Determines the frequency of sine and cosine waves
Explanation:
Sinusoidal positional encoding uses varying frequencies across dimensions, controlled by the formula’s denominator term.
3. How does attention ensure “chat” maps to “cat”?
❌ Randomly pick value vectors
❌ Multiply values and keys only
❌ Replace query with key vector
✅ Match the query vector with the transposed key matrix and retrieve the corresponding value
Explanation:
Attention = softmax(QKᵀ)V → the Q·Kᵀ scores determine which value vector (translation) is selected.
4. Role of multi-head attention in summarization?
✅ Apply multiple scaled dot-product attention operations in parallel on different representation subspaces
❌ Mask future tokens everywhere
❌ Apply a single attention mechanism
❌ Multiply vectors without scaling
Explanation:
Multi-head attention lets the model focus on multiple aspects of meaning simultaneously.
5. What is the correct second step after creating embeddings?
❌ Normalize sentence length
❌ Create token index mapping
❌ Extract word features
✅ Apply positional encoding to embeddings
Explanation:
Transformers have no inherent notion of order, so positional encoding must be added immediately after embeddings.
🧾 Summary Table
| Q# | Correct Answer | Key Concept |
|---|---|---|
| 1 | Represent context | Purpose of self-attention |
| 2 | Frequency of sin/cos | Positional encoding math |
| 3 | QKᵀ → value | How attention selects outputs |
| 4 | Parallel attention heads | Multi-head attention role |
| 5 | Positional encoding | Transformer pipeline step |