Transformers:Sequence Models(Deep Learning Specialization)Answeres:2025

Question 1

A Transformer Network, like RNNs, GRUs, and LSTMs, can process information one word at a time (sequentially).

✅ False
❌ True

Explanation:
Transformers process all tokens in parallel using self-attention. Unlike RNNs, which are sequential by design, Transformers handle sequences non-sequentially, enabling faster and more efficient training.

Question 2

Transformer Network methodology is taken from: (Check all that apply)

❌ Convolutional Neural Network style of processing
✅ Attention mechanism
❌ Convolutional Neural Network style of architecture
❌ None of these

Explanation:
Transformers are built entirely on the self-attention mechanism (“Attention is All You Need”), not CNNs or RNNs. Attention lets each token directly relate to all others.

Question 3

What are the key inputs to computing the attention value for each word?

❌ Quotation, knowledge, and value
✅ Query, Key, and Value
❌ Query, knowledge, and vector
❌ Quotation, key, and vector

Explanation:
Self-attention operates on three vectors derived from each word’s embedding — Query (Q), Key (K), and Value (V) — to compute how much focus each word should place on the others.

Question 4

Which of the following correctly represents Attention?

❌ $softmax\left(\frac{QV^T}{\sqrt{d_k}}\right)K$
❌ $min\left(\frac{QV^T}{\sqrt{d_k}}\right)K$
❌ $min\left(\frac{QK^T}{\sqrt{d_k}}\right)V$
✅ $softmax\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

Explanation:
The Scaled Dot-Product Attention formula is

$softmax\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

where $d_k$ is the dimension of the key vectors (for scaling stability).

Question 5

Which statement represents Key (K) in self-attention?

✅ K = specific representations of words given a Q
❌ K = the order of the words in a sentence
❌ K = qualities of words given a Q
❌ K = interesting questions about the words in a sentence

Explanation:
In self-attention, K (Key) represents information used to determine how relevant each word is to a given query word. It encodes the features each word contributes to matching queries.

Question 6

What does $i$ represent in this multi-head attention computation?

❌ Associated with the order of the words
❌ Associated with specific representations of words given a Q
❌ Associated with the ith word in a sentence
✅ Associated with the ith “head” (sequence)

Explanation:
Each attention head in multi-head attention computes attention with its own $Q, K, V$ projections. The index $i$ denotes the head number, not word position.

Question 7

What is NOT necessary for the Decoder’s second block of Multi-Head Attention?

❌ V
❌ K
✅ Q
❌ All of the above

Explanation:
In the decoder’s second multi-head attention block, the Query (Q) comes from the decoder’s previous layer (not needed as input directly to this block). The K and V come from the encoder outputs. So, Q isn’t part of this specific input pair.

Question 8

What does the output of the encoder block contain?

❌ Softmax layer followed by a linear layer
✅ Contextual semantic embedding and positional encoding information
❌ Linear layer followed by a softmax layer
❌ Prediction of the next word

Explanation:
The encoder outputs a context-rich representation of each token — embedding meaning (context) and positional information — which becomes the input for the decoder’s attention layers.

Question 9

Why is positional encoding important in translation? (Check all that apply)

✅ Position and word order are essential in sentence construction of any language.
✅ It helps to locate every word within a sentence.
❌ It is used in CNN and works well there.
✅ Providing extra information to our model.

Explanation:
Since Transformers process tokens in parallel (no sequence order), positional encoding injects information about each token’s position, helping the model understand order and grammar relationships.

Question 10

Which is not a good criterion for a good positional encoding algorithm?

✅ It should output a common encoding for each time-step (word’s position in a sentence).
❌ Distance between any two time-steps should be consistent
❌ The algorithm should be able to generalize to longer sentences
❌ It must be deterministic

Explanation:
A good positional encoding gives unique encodings for different positions — not common ones. Each position should have a distinct encoding while maintaining consistent distance relationships.

🧾 Summary Table

Q#	✅ Correct Answer(s)	🧠 Key Concept
1	False	Transformers are parallel, not sequential.
2	Attention mechanism	Transformers built on self-attention.
3	Query, Key, and Value	Core components of attention.
4	$softmax(QKTdk)Vsoftmax(\frac{QK^T}{\sqrt{d_k}})V$	Scaled dot-product attention.
5	K = specific representations of words	Keys determine word relevance to queries.
6	ith “head”	Each head focuses on different relations.
7	Q	Decoder’s 2nd block uses encoder’s K, V.
8	Contextual semantic + positional info	Encoder outputs contextualized embeddings.
9	Position order + location + extra info	Positional encodings restore sequence order.
10	Common encoding per position ❌	Positions must have unique encodings.