Sequence Models & Attention Mechanism:Sequence Models(Deep Learning Specialization)Answeres:2025

Question 1

Suppose this encoder–decoder model for MT.
“This model is a ‘conditional language model’ in the sense that the encoder portion (green) is modeling the probability of the input sentence $x$ .” True/False.

✅ False
❌ True

Explanation: The encoder encodes the input sentence $x$ into a representation (hidden states); it does not model $P (x)$ . The overall model is a conditional language model because the decoder models $P(y∣x)P(y\mid x)$ . The encoder itself is not estimating the probability of $x$ .

Question 2

If you decrease beam width $B$ in beam search, which are true? (Select all that apply.)

❌ Beam search will converge after fewer steps.
Explanation: Number of time steps (sequence length) does not change with beam width; the per-step branching is smaller but the search still runs until end tokens—so not “fewer steps.”
❌ Beam search will use up more memory.
Explanation: Smaller $B$ uses less memory.
✅ Beam search will run more quickly.
Explanation: With fewer beams to expand each step, computation per step is reduced, so it runs faster.
❌ Beam search will generally find better solutions.
Explanation: Reducing beam width usually reduces search quality (smaller beam -> less chance to find high-prob sequences).

Question 3

Beam search without sentence-length normalization tends to output overly short translations. True/False.

❌ False
✅ True

Explanation: Sequence probabilities multiply many numbers < 1, making longer sequences have lower absolute probability. Without length normalization, beam search is biased toward short outputs (higher average token probabilities), so it often returns overly short translations.

Question 4

Speech recognition example: model gives $P(y^∣x)=1.95×10−7P(\hat y\mid x)=1.95\times10^{-7}$ for the bad transcript and $P(y∗∣x)=3.42×10−9P(y^*\mid x)=3.42\times10^{-9}$ for the human-best.
True/False: Trying a different network architecture could help correct this example.

✅ True
❌ False

Explanation: If the model assigns higher probability to an incorrect transcript than the correct one, that indicates a modeling error (the learned $P(y∣x)P(y\mid x)$ is wrong). Changing/ improving the model architecture (or features, capacity, training) can change the learned distribution and may fix such errors. (If instead $P(y∗)>P(y^)P(y^* )>P(\hat y)$ but search found $y^\hat y$ , the problem would be search.)

Question 5

Later you find that for the vast majority of mistaken examples $P(y∗∣x)>P(y^∣x)P(y^*\mid x) > P(\hat y\mid x)$ . Does that suggest focusing on improving the search algorithm?

❌ False.
✅ True.

Explanation: If the model assigns higher probability to the true sequence than the found sequence, then the model’s distribution is fine but the decoding/search failed to find the higher-probability sequence. That points to improving the search/beam method (or beam width / normalization strategy), not the model itself.

Question 6

About attention weights $α⟨t,t′⟩\alpha_{\langle t,t’\rangle}$ . Which statements are true? (Check all that apply.)

❌ $∑t′α⟨t,t′⟩=0\sum_{t’} \alpha_{\langle t,t’\rangle} = 0$ .
(False.)
❌ $∑t′α⟨t,t′⟩=−1\sum_{t’} \alpha_{\langle t,t’\rangle} = -1$ .
(False.)
✅ $α⟨t,t′⟩\alpha_{\langle t,t’\rangle}$ is equal to the amount of attention $y⟨t⟩y^{\langle t\rangle}$ should pay to $a⟨t′⟩a^{\langle t’\rangle}$ .
(True.)
✅ We expect $α⟨t,t′⟩\alpha_{\langle t,t’\rangle}$ to be generally larger for values of $a⟨t′⟩a^{\langle t’\rangle}$ that are highly relevant to the value the network should output for $y⟨t⟩y^{\langle t\rangle}$ .
(True.)

Explanation: Attention weights for decoder time $t$ are normalized (softmax) across encoder positions $t^{'}$ (so they sum to 1, not 0 or −1). $α⟨t,t′⟩\alpha_{\langle t,t’\rangle}$ indicates how much the decoder at time $t$ attends to encoder hidden $a⟨t′⟩a^{\langle t’\rangle}$ , and will be larger for encoder positions useful to predict $y⟨t⟩y^{\langle t\rangle}$ .

Question 7

The network learns where to pay attention via scores $e⟨t,t′⟩e_{\langle t,t’\rangle}$ computed by a small NN. Which of the following does $s⟨t⟩s^{\langle t\rangle}$ depend on? (Select all that apply.)

✅ $α⟨t,t′⟩\alpha_{\langle t,t’\rangle}$
(Yes — the context vector formed using $α\alpha$ influences the decoder hidden state $s⟨t⟩s^{\langle t\rangle}$ .)
❌ $s⟨t+1⟩s^{\langle t+1\rangle}$
(No — the future state does not affect the current state.)
✅ $e⟨t,t′⟩e_{\langle t,t’\rangle}$
(Yes — because $e$ is used to compute $α\alpha$ , which contributes to the context used in computing $s⟨t⟩s^{\langle t\rangle}$ .)
❌ $s⟨t⟩s^{\langle t\rangle}$ is independent of $α\alpha$ and $e$ .
(No — it is not independent.)

Explanation: Decoder state $s_t$ is computed from previous state and the context vector (a weighted sum of encoder states using $α\alpha$ ). $α\alpha$ is derived from $e$ , so both $e$ and $α\alpha$ influence $s_t$ . Future decoder states don’t influence the current one.

Question 8

Attention helps most when input length $T_x$ is small or large?

❌ The input sequence length $T_x$ is small.
✅ The input sequence length $T_x$ is large.

Explanation: Attention provides the biggest advantage when the encoder must compress long inputs into a single fixed-size vector — i.e., for large $T_x$ . Attention lets the decoder access all encoder positions directly, which helps greatly for long sequences.

Question 9

CTC collapse of kk_eee____ee_p__eeeeeeee_____rrrrr → which does it collapse to?

❌ ke epe r
✅ keeper
❌ keper
❌ kkeeeeepeeeeeeeerrrrr

Explanation (CTC decoding rule): First collapse repeated consecutive identical characters into one (unless separated by blank), then remove blanks. Doing so yields k _ e _ e _ p _ e _ r → removing blanks gives k e e p e r → "keeper".

Question 10

In trigger-word detection, does $x⟨t⟩x^{\langle t\rangle}$ represent the trigger word $x$ being stated for the t-th time? True/False.

✅ False
❌ True

Explanation: In trigger-word detection $x⟨t⟩x^{\langle t\rangle}$ usually denotes the input feature vector (audio frame or window) at time step $t$ , not the t-th occurrence count of the trigger word. So the statement is incorrect.

🧾 Summary Table

Q #	Correct Answer(s)	Key concept
1	✅ False	Encoder encodes $x$ , decoder models $P(y∣x)P(y\mid x)$ ; encoder doesn’t model $P (x)$ .
2	✅ Beam runs more quickly (only)	Smaller beam → less computation & memory, but worse search quality.
3	✅ True	No length normalization → bias for short outputs.
4	✅ True	Model architecture changes can fix modeling errors where model assigns wrong probabilities.
5	✅ True	If $P(y∗)>P(y^)P(y^*)>P(\hat y)$ , the search failed — improve search/decoding.
6	✅ α equals attention; ✅ α larger for relevant encoder states	α sums to 1; α indicates how much decoder at t attends to encoder position t’.
7	✅ α; ✅ e	Decoder state depends on context (α), which comes from e scores.
8	✅ Large $T_x$	Attention helps most with long inputs.
9	✅ `keeper`	CTC: collapse repeats then remove blanks.
10	✅ False	$x⟨t⟩x^{\langle t\rangle}$ is the input frame at time t, not “t-th time the trigger was uttered.”