Sequence Models & Attention Mechanism:Sequence Models(Deep Learning Specialization)Answeres:2025
Question 1
Suppose this encoder–decoder model for MT.
“This model is a ‘conditional language model’ in the sense that the encoder portion (green) is modeling the probability of the input sentence xxx.” True/False.
-
✅ False
-
❌ True
Explanation: The encoder encodes the input sentence xxx into a representation (hidden states); it does not model P(x)P(x)P(x). The overall model is a conditional language model because the decoder models P(y∣x)P(y\mid x)P(y∣x). The encoder itself is not estimating the probability of xxx.
Question 2
If you decrease beam width BBB in beam search, which are true? (Select all that apply.)
-
❌ Beam search will converge after fewer steps.
Explanation: Number of time steps (sequence length) does not change with beam width; the per-step branching is smaller but the search still runs until end tokens—so not “fewer steps.” -
❌ Beam search will use up more memory.
Explanation: Smaller BBB uses less memory. -
✅ Beam search will run more quickly.
Explanation: With fewer beams to expand each step, computation per step is reduced, so it runs faster. -
❌ Beam search will generally find better solutions.
Explanation: Reducing beam width usually reduces search quality (smaller beam -> less chance to find high-prob sequences).
Question 3
Beam search without sentence-length normalization tends to output overly short translations. True/False.
-
❌ False
-
✅ True
Explanation: Sequence probabilities multiply many numbers < 1, making longer sequences have lower absolute probability. Without length normalization, beam search is biased toward short outputs (higher average token probabilities), so it often returns overly short translations.
Question 4
Speech recognition example: model gives P(y^∣x)=1.95×10−7P(\hat y\mid x)=1.95\times10^{-7}P(y^∣x)=1.95×10−7 for the bad transcript and P(y∗∣x)=3.42×10−9P(y^*\mid x)=3.42\times10^{-9}P(y∗∣x)=3.42×10−9 for the human-best.
True/False: Trying a different network architecture could help correct this example.
-
✅ True
-
❌ False
Explanation: If the model assigns higher probability to an incorrect transcript than the correct one, that indicates a modeling error (the learned P(y∣x)P(y\mid x)P(y∣x) is wrong). Changing/ improving the model architecture (or features, capacity, training) can change the learned distribution and may fix such errors. (If instead P(y∗)>P(y^)P(y^* )>P(\hat y)P(y∗)>P(y^) but search found y^\hat yy^, the problem would be search.)
Question 5
Later you find that for the vast majority of mistaken examples P(y∗∣x)>P(y^∣x)P(y^*\mid x) > P(\hat y\mid x)P(y∗∣x)>P(y^∣x). Does that suggest focusing on improving the search algorithm?
-
❌ False.
-
✅ True.
Explanation: If the model assigns higher probability to the true sequence than the found sequence, then the model’s distribution is fine but the decoding/search failed to find the higher-probability sequence. That points to improving the search/beam method (or beam width / normalization strategy), not the model itself.
Question 6
About attention weights α⟨t,t′⟩\alpha_{\langle t,t’\rangle}α⟨t,t′⟩. Which statements are true? (Check all that apply.)
-
❌ ∑t′α⟨t,t′⟩=0\sum_{t’} \alpha_{\langle t,t’\rangle} = 0∑t′α⟨t,t′⟩=0.
(False.) -
❌ ∑t′α⟨t,t′⟩=−1\sum_{t’} \alpha_{\langle t,t’\rangle} = -1∑t′α⟨t,t′⟩=−1.
(False.) -
✅ α⟨t,t′⟩\alpha_{\langle t,t’\rangle}α⟨t,t′⟩ is equal to the amount of attention y⟨t⟩y^{\langle t\rangle}y⟨t⟩ should pay to a⟨t′⟩a^{\langle t’\rangle}a⟨t′⟩.
(True.) -
✅ We expect α⟨t,t′⟩\alpha_{\langle t,t’\rangle}α⟨t,t′⟩ to be generally larger for values of a⟨t′⟩a^{\langle t’\rangle}a⟨t′⟩ that are highly relevant to the value the network should output for y⟨t⟩y^{\langle t\rangle}y⟨t⟩.
(True.)
Explanation: Attention weights for decoder time ttt are normalized (softmax) across encoder positions t′t’t′ (so they sum to 1, not 0 or −1). α⟨t,t′⟩\alpha_{\langle t,t’\rangle}α⟨t,t′⟩ indicates how much the decoder at time ttt attends to encoder hidden a⟨t′⟩a^{\langle t’\rangle}a⟨t′⟩, and will be larger for encoder positions useful to predict y⟨t⟩y^{\langle t\rangle}y⟨t⟩.
Question 7
The network learns where to pay attention via scores e⟨t,t′⟩e_{\langle t,t’\rangle}e⟨t,t′⟩ computed by a small NN. Which of the following does s⟨t⟩s^{\langle t\rangle}s⟨t⟩ depend on? (Select all that apply.)
-
✅ α⟨t,t′⟩\alpha_{\langle t,t’\rangle}α⟨t,t′⟩
(Yes — the context vector formed using α\alphaα influences the decoder hidden state s⟨t⟩s^{\langle t\rangle}s⟨t⟩.) -
❌ s⟨t+1⟩s^{\langle t+1\rangle}s⟨t+1⟩
(No — the future state does not affect the current state.) -
✅ e⟨t,t′⟩e_{\langle t,t’\rangle}e⟨t,t′⟩
(Yes — because eee is used to compute α\alphaα, which contributes to the context used in computing s⟨t⟩s^{\langle t\rangle}s⟨t⟩.) -
❌ s⟨t⟩s^{\langle t\rangle}s⟨t⟩ is independent of α\alphaα and eee.
(No — it is not independent.)
Explanation: Decoder state sts_tst is computed from previous state and the context vector (a weighted sum of encoder states using α\alphaα). α\alphaα is derived from eee, so both eee and α\alphaα influence sts_tst. Future decoder states don’t influence the current one.
Question 8
Attention helps most when input length TxT_xTx is small or large?
-
❌ The input sequence length TxT_xTx is small.
-
✅ The input sequence length TxT_xTx is large.
Explanation: Attention provides the biggest advantage when the encoder must compress long inputs into a single fixed-size vector — i.e., for large TxT_xTx. Attention lets the decoder access all encoder positions directly, which helps greatly for long sequences.
Question 9
CTC collapse of kk_eee____ee_p__eeeeeeee_____rrrrr → which does it collapse to?
-
❌
ke epe r -
✅
keeper -
❌
keper -
❌
kkeeeeepeeeeeeeerrrrr
Explanation (CTC decoding rule): First collapse repeated consecutive identical characters into one (unless separated by blank), then remove blanks. Doing so yields k _ e _ e _ p _ e _ r → removing blanks gives k e e p e r → "keeper".
Question 10
In trigger-word detection, does x⟨t⟩x^{\langle t\rangle}x⟨t⟩ represent the trigger word xxx being stated for the t-th time? True/False.
-
✅ False
-
❌ True
Explanation: In trigger-word detection x⟨t⟩x^{\langle t\rangle}x⟨t⟩ usually denotes the input feature vector (audio frame or window) at time step ttt, not the t-th occurrence count of the trigger word. So the statement is incorrect.
🧾 Summary Table
| Q # | Correct Answer(s) | Key concept |
|---|---|---|
| 1 | ✅ False | Encoder encodes xxx, decoder models P(y∣x)P(y\mid x)P(y∣x); encoder doesn’t model P(x)P(x)P(x). |
| 2 | ✅ Beam runs more quickly (only) | Smaller beam → less computation & memory, but worse search quality. |
| 3 | ✅ True | No length normalization → bias for short outputs. |
| 4 | ✅ True | Model architecture changes can fix modeling errors where model assigns wrong probabilities. |
| 5 | ✅ True | If P(y∗)>P(y^)P(y^*)>P(\hat y)P(y∗)>P(y^), the search failed — improve search/decoding. |
| 6 | ✅ α equals attention; ✅ α larger for relevant encoder states | α sums to 1; α indicates how much decoder at t attends to encoder position t’. |
| 7 | ✅ α; ✅ e | Decoder state depends on context (α), which comes from e scores. |
| 8 | ✅ Large TxT_xTx | Attention helps most with long inputs. |
| 9 | ✅ keeper |
CTC: collapse repeats then remove blanks. |
| 10 | ✅ False | x⟨t⟩x^{\langle t\rangle}x⟨t⟩ is the input frame at time t, not “t-th time the trigger was uttered.” |