Transformers:Sequence Models(Deep Learning Specialization)Answeres:2025
Question 1
A Transformer Network, like RNNs, GRUs, and LSTMs, can process information one word at a time (sequentially).
-
✅ False
-
❌ True
Explanation:
Transformers process all tokens in parallel using self-attention. Unlike RNNs, which are sequential by design, Transformers handle sequences non-sequentially, enabling faster and more efficient training.
Question 2
Transformer Network methodology is taken from: (Check all that apply)
-
❌ Convolutional Neural Network style of processing
-
✅ Attention mechanism
-
❌ Convolutional Neural Network style of architecture
-
❌ None of these
Explanation:
Transformers are built entirely on the self-attention mechanism (“Attention is All You Need”), not CNNs or RNNs. Attention lets each token directly relate to all others.
Question 3
What are the key inputs to computing the attention value for each word?
-
❌ Quotation, knowledge, and value
-
✅ Query, Key, and Value
-
❌ Query, knowledge, and vector
-
❌ Quotation, key, and vector
Explanation:
Self-attention operates on three vectors derived from each word’s embedding — Query (Q), Key (K), and Value (V) — to compute how much focus each word should place on the others.
Question 4
Which of the following correctly represents Attention?
-
❌ Attention(Q,K,V)=softmax(QVTdk)KAttention(Q,K,V) = softmax\left(\frac{QV^T}{\sqrt{d_k}}\right)KAttention(Q,K,V)=softmax(dkQVT)K
-
❌ Attention(Q,K,V)=min(QVTdk)KAttention(Q,K,V) = min\left(\frac{QV^T}{\sqrt{d_k}}\right)KAttention(Q,K,V)=min(dkQVT)K
-
❌ Attention(Q,K,V)=min(QKTdk)VAttention(Q,K,V) = min\left(\frac{QK^T}{\sqrt{d_k}}\right)VAttention(Q,K,V)=min(dkQKT)V
-
✅ Attention(Q,K,V)=softmax(QKTdk)VAttention(Q,K,V) = softmax\left(\frac{QK^T}{\sqrt{d_k}}\right)VAttention(Q,K,V)=softmax(dkQKT)V
Explanation:
The Scaled Dot-Product Attention formula is
Attention(Q,K,V)=softmax(QKTdk)VAttention(Q,K,V) = softmax\left(\frac{QK^T}{\sqrt{d_k}}\right)VAttention(Q,K,V)=softmax(dkQKT)V
where dkd_kdk is the dimension of the key vectors (for scaling stability).
Question 5
Which statement represents Key (K) in self-attention?
-
✅ K = specific representations of words given a Q
-
❌ K = the order of the words in a sentence
-
❌ K = qualities of words given a Q
-
❌ K = interesting questions about the words in a sentence
Explanation:
In self-attention, K (Key) represents information used to determine how relevant each word is to a given query word. It encodes the features each word contributes to matching queries.
Question 6
What does iii represent in this multi-head attention computation?
-
❌ Associated with the order of the words
-
❌ Associated with specific representations of words given a Q
-
❌ Associated with the ith word in a sentence
-
✅ Associated with the ith “head” (sequence)
Explanation:
Each attention head in multi-head attention computes attention with its own Q,K,VQ, K, VQ,K,V projections. The index iii denotes the head number, not word position.
Question 7
What is NOT necessary for the Decoder’s second block of Multi-Head Attention?
-
❌ V
-
❌ K
-
✅ Q
-
❌ All of the above
Explanation:
In the decoder’s second multi-head attention block, the Query (Q) comes from the decoder’s previous layer (not needed as input directly to this block). The K and V come from the encoder outputs. So, Q isn’t part of this specific input pair.
Question 8
What does the output of the encoder block contain?
-
❌ Softmax layer followed by a linear layer
-
✅ Contextual semantic embedding and positional encoding information
-
❌ Linear layer followed by a softmax layer
-
❌ Prediction of the next word
Explanation:
The encoder outputs a context-rich representation of each token — embedding meaning (context) and positional information — which becomes the input for the decoder’s attention layers.
Question 9
Why is positional encoding important in translation? (Check all that apply)
-
✅ Position and word order are essential in sentence construction of any language.
-
✅ It helps to locate every word within a sentence.
-
❌ It is used in CNN and works well there.
-
✅ Providing extra information to our model.
Explanation:
Since Transformers process tokens in parallel (no sequence order), positional encoding injects information about each token’s position, helping the model understand order and grammar relationships.
Question 10
Which is not a good criterion for a good positional encoding algorithm?
-
✅ It should output a common encoding for each time-step (word’s position in a sentence).
-
❌ Distance between any two time-steps should be consistent
-
❌ The algorithm should be able to generalize to longer sentences
-
❌ It must be deterministic
Explanation:
A good positional encoding gives unique encodings for different positions — not common ones. Each position should have a distinct encoding while maintaining consistent distance relationships.
🧾 Summary Table
| Q# | ✅ Correct Answer(s) | 🧠 Key Concept |
|---|---|---|
| 1 | False | Transformers are parallel, not sequential. |
| 2 | Attention mechanism | Transformers built on self-attention. |
| 3 | Query, Key, and Value | Core components of attention. |
| 4 | softmax(QKTdk)Vsoftmax(\frac{QK^T}{\sqrt{d_k}})Vsoftmax(dkQKT)V | Scaled dot-product attention. |
| 5 | K = specific representations of words | Keys determine word relevance to queries. |
| 6 | ith “head” | Each head focuses on different relations. |
| 7 | Q | Decoder’s 2nd block uses encoder’s K, V. |
| 8 | Contextual semantic + positional info | Encoder outputs contextualized embeddings. |
| 9 | Position order + location + extra info | Positional encodings restore sequence order. |
| 10 | Common encoding per position ❌ | Positions must have unique encodings. |