Skip to content

Transformers:Sequence Models(Deep Learning Specialization)Answeres:2025

Question 1

A Transformer Network, like RNNs, GRUs, and LSTMs, can process information one word at a time (sequentially).

  • False

  • ❌ True

Explanation:
Transformers process all tokens in parallel using self-attention. Unlike RNNs, which are sequential by design, Transformers handle sequences non-sequentially, enabling faster and more efficient training.


Question 2

Transformer Network methodology is taken from: (Check all that apply)

  • ❌ Convolutional Neural Network style of processing

  • Attention mechanism

  • ❌ Convolutional Neural Network style of architecture

  • ❌ None of these

Explanation:
Transformers are built entirely on the self-attention mechanism (“Attention is All You Need”), not CNNs or RNNs. Attention lets each token directly relate to all others.


Question 3

What are the key inputs to computing the attention value for each word?

  • ❌ Quotation, knowledge, and value

  • Query, Key, and Value

  • ❌ Query, knowledge, and vector

  • ❌ Quotation, key, and vector

Explanation:
Self-attention operates on three vectors derived from each word’s embedding — Query (Q), Key (K), and Value (V) — to compute how much focus each word should place on the others.


Question 4

Which of the following correctly represents Attention?

  • Attention(Q,K,V)=softmax(QVTdk)KAttention(Q,K,V) = softmax\left(\frac{QV^T}{\sqrt{d_k}}\right)K

  • Attention(Q,K,V)=min(QVTdk)KAttention(Q,K,V) = min\left(\frac{QV^T}{\sqrt{d_k}}\right)K

  • Attention(Q,K,V)=min(QKTdk)VAttention(Q,K,V) = min\left(\frac{QK^T}{\sqrt{d_k}}\right)V

  • Attention(Q,K,V)=softmax(QKTdk)VAttention(Q,K,V) = softmax\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Explanation:
The Scaled Dot-Product Attention formula is

Attention(Q,K,V)=softmax(QKTdk)VAttention(Q,K,V) = softmax\left(\frac{QK^T}{\sqrt{d_k}}\right)V

where dkd_k is the dimension of the key vectors (for scaling stability).


Question 5

Which statement represents Key (K) in self-attention?

  • K = specific representations of words given a Q

  • ❌ K = the order of the words in a sentence

  • ❌ K = qualities of words given a Q

  • ❌ K = interesting questions about the words in a sentence

Explanation:
In self-attention, K (Key) represents information used to determine how relevant each word is to a given query word. It encodes the features each word contributes to matching queries.


Question 6

What does ii represent in this multi-head attention computation?

  • ❌ Associated with the order of the words

  • ❌ Associated with specific representations of words given a Q

  • ❌ Associated with the ith word in a sentence

  • Associated with the ith “head” (sequence)

Explanation:
Each attention head in multi-head attention computes attention with its own Q,K,VQ, K, V projections. The index ii denotes the head number, not word position.


Question 7

What is NOT necessary for the Decoder’s second block of Multi-Head Attention?

  • ❌ V

  • ❌ K

  • Q

  • ❌ All of the above

Explanation:
In the decoder’s second multi-head attention block, the Query (Q) comes from the decoder’s previous layer (not needed as input directly to this block). The K and V come from the encoder outputs. So, Q isn’t part of this specific input pair.


Question 8

What does the output of the encoder block contain?

  • ❌ Softmax layer followed by a linear layer

  • Contextual semantic embedding and positional encoding information

  • ❌ Linear layer followed by a softmax layer

  • ❌ Prediction of the next word

Explanation:
The encoder outputs a context-rich representation of each token — embedding meaning (context) and positional information — which becomes the input for the decoder’s attention layers.


Question 9

Why is positional encoding important in translation? (Check all that apply)

  • Position and word order are essential in sentence construction of any language.

  • It helps to locate every word within a sentence.

  • ❌ It is used in CNN and works well there.

  • Providing extra information to our model.

Explanation:
Since Transformers process tokens in parallel (no sequence order), positional encoding injects information about each token’s position, helping the model understand order and grammar relationships.


Question 10

Which is not a good criterion for a good positional encoding algorithm?

  • It should output a common encoding for each time-step (word’s position in a sentence).

  • ❌ Distance between any two time-steps should be consistent

  • ❌ The algorithm should be able to generalize to longer sentences

  • ❌ It must be deterministic

Explanation:
A good positional encoding gives unique encodings for different positions — not common ones. Each position should have a distinct encoding while maintaining consistent distance relationships.


🧾 Summary Table

Q# ✅ Correct Answer(s) 🧠 Key Concept
1 False Transformers are parallel, not sequential.
2 Attention mechanism Transformers built on self-attention.
3 Query, Key, and Value Core components of attention.
4 softmax(QKTdk)Vsoftmax(\frac{QK^T}{\sqrt{d_k}})V Scaled dot-product attention.
5 K = specific representations of words Keys determine word relevance to queries.
6 ith “head” Each head focuses on different relations.
7 Q Decoder’s 2nd block uses encoder’s K, V.
8 Contextual semantic + positional info Encoder outputs contextualized embeddings.
9 Position order + location + extra info Positional encodings restore sequence order.
10 Common encoding per position ❌ Positions must have unique encodings.