Skip to content

Sequence Models & Attention Mechanism:Sequence Models(Deep Learning Specialization)Answeres:2025

Question 1

Suppose this encoder–decoder model for MT.
“This model is a ‘conditional language model’ in the sense that the encoder portion (green) is modeling the probability of the input sentence xx.” True/False.

  • False

  • ❌ True

Explanation: The encoder encodes the input sentence xx into a representation (hidden states); it does not model P(x)P(x). The overall model is a conditional language model because the decoder models P(y∣x)P(y\mid x). The encoder itself is not estimating the probability of xx.


Question 2

If you decrease beam width BB in beam search, which are true? (Select all that apply.)

  • ❌ Beam search will converge after fewer steps.
    Explanation: Number of time steps (sequence length) does not change with beam width; the per-step branching is smaller but the search still runs until end tokens—so not “fewer steps.”

  • ❌ Beam search will use up more memory.
    Explanation: Smaller BB uses less memory.

  • Beam search will run more quickly.
    Explanation: With fewer beams to expand each step, computation per step is reduced, so it runs faster.

  • ❌ Beam search will generally find better solutions.
    Explanation: Reducing beam width usually reduces search quality (smaller beam -> less chance to find high-prob sequences).


Question 3

Beam search without sentence-length normalization tends to output overly short translations. True/False.

  • ❌ False

  • True

Explanation: Sequence probabilities multiply many numbers < 1, making longer sequences have lower absolute probability. Without length normalization, beam search is biased toward short outputs (higher average token probabilities), so it often returns overly short translations.


Question 4

Speech recognition example: model gives P(y^∣x)=1.95×10−7P(\hat y\mid x)=1.95\times10^{-7} for the bad transcript and P(y∗∣x)=3.42×10−9P(y^*\mid x)=3.42\times10^{-9} for the human-best.
True/False: Trying a different network architecture could help correct this example.

  • True

  • ❌ False

Explanation: If the model assigns higher probability to an incorrect transcript than the correct one, that indicates a modeling error (the learned P(y∣x)P(y\mid x) is wrong). Changing/ improving the model architecture (or features, capacity, training) can change the learned distribution and may fix such errors. (If instead P(y∗)>P(y^)P(y^* )>P(\hat y) but search found y^\hat y, the problem would be search.)


Question 5

Later you find that for the vast majority of mistaken examples P(y∗∣x)>P(y^∣x)P(y^*\mid x) > P(\hat y\mid x). Does that suggest focusing on improving the search algorithm?

  • ❌ False.

  • True.

Explanation: If the model assigns higher probability to the true sequence than the found sequence, then the model’s distribution is fine but the decoding/search failed to find the higher-probability sequence. That points to improving the search/beam method (or beam width / normalization strategy), not the model itself.


Question 6

About attention weights α⟨t,t′⟩\alpha_{\langle t,t’\rangle}. Which statements are true? (Check all that apply.)

  • ∑t′α⟨t,t′⟩=0\sum_{t’} \alpha_{\langle t,t’\rangle} = 0.
    (False.)

  • ∑t′α⟨t,t′⟩=−1\sum_{t’} \alpha_{\langle t,t’\rangle} = -1.
    (False.)

  • α⟨t,t′⟩\alpha_{\langle t,t’\rangle} is equal to the amount of attention y⟨t⟩y^{\langle t\rangle} should pay to a⟨t′⟩a^{\langle t’\rangle}.
    (True.)

  • We expect α⟨t,t′⟩\alpha_{\langle t,t’\rangle} to be generally larger for values of a⟨t′⟩a^{\langle t’\rangle} that are highly relevant to the value the network should output for y⟨t⟩y^{\langle t\rangle}.
    (True.)

Explanation: Attention weights for decoder time tt are normalized (softmax) across encoder positions t′t’ (so they sum to 1, not 0 or −1). α⟨t,t′⟩\alpha_{\langle t,t’\rangle} indicates how much the decoder at time tt attends to encoder hidden a⟨t′⟩a^{\langle t’\rangle}, and will be larger for encoder positions useful to predict y⟨t⟩y^{\langle t\rangle}.


Question 7

The network learns where to pay attention via scores e⟨t,t′⟩e_{\langle t,t’\rangle} computed by a small NN. Which of the following does s⟨t⟩s^{\langle t\rangle} depend on? (Select all that apply.)

  • α⟨t,t′⟩\alpha_{\langle t,t’\rangle}
    (Yes — the context vector formed using α\alpha influences the decoder hidden state s⟨t⟩s^{\langle t\rangle}.)

  • s⟨t+1⟩s^{\langle t+1\rangle}
    (No — the future state does not affect the current state.)

  • e⟨t,t′⟩e_{\langle t,t’\rangle}
    (Yes — because ee is used to compute α\alpha, which contributes to the context used in computing s⟨t⟩s^{\langle t\rangle}.)

  • s⟨t⟩s^{\langle t\rangle} is independent of α\alpha and ee.
    (No — it is not independent.)

Explanation: Decoder state sts_t is computed from previous state and the context vector (a weighted sum of encoder states using α\alpha). α\alpha is derived from ee, so both ee and α\alpha influence sts_t. Future decoder states don’t influence the current one.


Question 8

Attention helps most when input length TxT_x is small or large?

  • ❌ The input sequence length TxT_x is small.

  • The input sequence length TxT_x is large.

Explanation: Attention provides the biggest advantage when the encoder must compress long inputs into a single fixed-size vector — i.e., for large TxT_x. Attention lets the decoder access all encoder positions directly, which helps greatly for long sequences.


Question 9

CTC collapse of kk_eee____ee_p__eeeeeeee_____rrrrr → which does it collapse to?

  • ke epe r

  • keeper

  • keper

  • kkeeeeepeeeeeeeerrrrr

Explanation (CTC decoding rule): First collapse repeated consecutive identical characters into one (unless separated by blank), then remove blanks. Doing so yields k _ e _ e _ p _ e _ r → removing blanks gives k e e p e r"keeper".


Question 10

In trigger-word detection, does x⟨t⟩x^{\langle t\rangle} represent the trigger word xx being stated for the t-th time? True/False.

  • False

  • ❌ True

Explanation: In trigger-word detection x⟨t⟩x^{\langle t\rangle} usually denotes the input feature vector (audio frame or window) at time step tt, not the t-th occurrence count of the trigger word. So the statement is incorrect.


🧾 Summary Table

Q # Correct Answer(s) Key concept
1 ✅ False Encoder encodes xx, decoder models P(y∣x)P(y\mid x); encoder doesn’t model P(x)P(x).
2 ✅ Beam runs more quickly (only) Smaller beam → less computation & memory, but worse search quality.
3 ✅ True No length normalization → bias for short outputs.
4 ✅ True Model architecture changes can fix modeling errors where model assigns wrong probabilities.
5 ✅ True If P(y∗)>P(y^)P(y^*)>P(\hat y), the search failed — improve search/decoding.
6 ✅ α equals attention; ✅ α larger for relevant encoder states α sums to 1; α indicates how much decoder at t attends to encoder position t’.
7 ✅ α; ✅ e Decoder state depends on context (α), which comes from e scores.
8 ✅ Large TxT_x Attention helps most with long inputs.
9 keeper CTC: collapse repeats then remove blanks.
10 ✅ False x⟨t⟩x^{\langle t\rangle} is the input frame at time t, not “t-th time the trigger was uttered.”