Recurrent Neural Networks:Sequence Models(Deep Learning Specialization)Answeres:2025
Question 1
Which expression refers to the s-th word in the r-th training example?
-
❌
x^(s)<r> -
❌
x<r>(s) -
✅
x(r)<s> -
❌
x<s>(r)
Explanation: The usual notation is x^{(r)}_{<s>}, i.e. the r-th training example (superscript or parentheses) and the s-th word within that example (angle-bracket subscript). Option x(r)<s> matches that ordering: example index first, then time/word index.
Question 2
The pictured RNN architecture is appropriate when:
-
✅ Tx=TyT_x = T_yTx=Ty
-
❌ Tx<TyT_x < T_yTx<Ty
-
❌ Tx>TyT_x > T_yTx>Ty
-
❌ Tx=1T_x = 1Tx=1
Explanation: The shown architecture (same-length many-to-many RNN) is used when input and output sequence lengths match and outputs are produced at each input timestep (e.g., sequence labeling, time-aligned tasks). That corresponds to Tx=TyT_x = T_yTx=Ty.
Question 3
Which tasks use a many-to-one RNN architecture? (Choose all that apply.)
-
❌ Image classification (input an image and output a label)
-
✅ Music genre recognition
-
✅ Language recognition from speech (audio → single language label)
-
❌ Speech recognition (input audio → transcript)
Explanation: Many-to-one maps a variable-length input sequence to a single output label. Music-genre and language-recognition from an audio clip are many-to-one. Speech recognition outputs a sequence (many-to-many), while image classification is not sequence-to-label (unless you treat image patches as sequence, but not typical here).
Question 4
True/False: At time ttt the RNN is estimating P(y⟨t⟩∣y⟨1⟩,…,y⟨t−1⟩)P(y^{\langle t\rangle}\mid y^{\langle 1\rangle},\dots,y^{\langle t-1\rangle})P(y⟨t⟩∣y⟨1⟩,…,y⟨t−1⟩).
-
✅ True
-
❌ False
Explanation: An RNN language model predicts the next token at time ttt conditioned on previous tokens; it models exactly that conditional distribution.
Question 5
True/False: When sampling from your trained language-model RNN, at each step ttt you sample a word from the RNN’s output probabilities and feed it as input for step t+1t+1t+1.
-
✅ True
-
❌ False
Explanation: That is the standard ancestral sampling procedure for RNN language models: sample from the output distribution at each step and feed the chosen token forward.
Question 6
True/False: If during RNN training your weights/activations become NaN, you have an exploding gradient problem.
-
✅ True
-
❌ False
Explanation: NaNs often result from numerical overflow due to exploding gradients (extremely large parameter/activation updates). It’s a classic symptom (though NaNs can also come from e.g., invalid ops). But exploding gradients are the common cause.
Question 7
LSTM with 10,000-word vocab and 100-dim activations a⟨t⟩a^{\langle t\rangle}a⟨t⟩. What is the dimension of Γu\Gamma_uΓu at each time step?
-
❌ 1
-
✅ 100
-
❌ 300
-
❌ 10000
Explanation: Gate vectors (like Γu\Gamma_uΓu) in LSTM/GRU have the same dimensionality as the hidden state/activation vector a⟨t⟩a^{\langle t\rangle}a⟨t⟩. With 100-dim activations, Γu\Gamma_uΓu is a 100-dim vector.
Question 8
Alice removes the GRU update gate Γu\Gamma_uΓu (set to 0). Betty removes the reset gate Γr\Gamma_rΓr (set to 1). Which model is more likely to work without vanishing gradients on very long sequences?
-
❌ Alice’s model (removing Γu\Gamma_uΓu) because if Γr≈0\Gamma_r\approx 0Γr≈0 the gradient can flow.
-
❌ Alice’s model (removing Γu\Gamma_uΓu) because if Γr≈1\Gamma_r\approx 1Γr≈1 the gradient can flow.
-
❌ Betty’s model (removing Γr\Gamma_rΓr) because if Γu≈0\Gamma_u\approx 0Γu≈0 the gradient can flow.
-
✅ Betty’s model (removing Γr\Gamma_rΓr), because if Γu≈1\Gamma_u\approx 1Γu≈1 the gradient can propagate back through that timestep without much decay.
Explanation: The update gate Γu\Gamma_uΓu (aka z) in GRUs controls the extent of the identity/pathway from ht−1h_{t-1}ht−1 to hth_tht. If Γu\Gamma_uΓu can be close to 1, the state can carry through many steps (identity-like), enabling gradients to flow (avoiding vanishing). Removing the reset gate (Γr\Gamma_rΓr) is less harmful than disabling the update gate. Hence Betty’s variant (keeping Γu\Gamma_uΓu) is preferable for long-term gradient flow.
Question 9
True/False: The GRU update gate and the LSTM forget/input gates play similar roles to 1−Γu1-\Gamma_u1−Γu and Γu\Gamma_uΓu.
-
✅ True
-
❌ False
Explanation: In GRU: ht=Γu⊙ht−1+(1−Γu)⊙h~th_t = \Gamma_u \odot h_{t-1} + (1-\Gamma_u)\odot \tilde{h}_tht=Γu⊙ht−1+(1−Γu)⊙h~t. In LSTM: ct=ft⊙ct−1+it⊙c~tc_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_tct=ft⊙ct−1+it⊙c~t. So ftf_tft (forget) ↔ Γu\Gamma_uΓu and iti_tit (input) ↔ 1−Γu1-\Gamma_u1−Γu (up to naming conventions). They play analogous roles in gating old vs. new information.
Question 10
You have 365 days of weather x⟨1⟩,…,x⟨365⟩x^{\langle 1\rangle},\dots,x^{\langle 365\rangle}x⟨1⟩,…,x⟨365⟩ and moods y⟨1⟩,…,y⟨365⟩y^{\langle 1\rangle},\dots,y^{\langle 365\rangle}y⟨1⟩,…,y⟨365⟩. Should you use a unidirectional RNN or a bidirectional RNN?
-
❌ Unidirectional RNN, because y⟨t⟩y^{\langle t\rangle}y⟨t⟩ depends only on x⟨t⟩x^{\langle t\rangle}x⟨t⟩.
-
❌ Bidirectional RNN, because this allows backpropagation to compute more accurate gradients.
-
✅ Bidirectional RNN, because this allows the prediction of mood on day ttt to take into account more information.
-
❌ Unidirectional RNN, because y⟨t⟩y^{\langle t\rangle}y⟨t⟩ depends only on past xxx but not future xxx.
Explanation: If your label y⟨t⟩y^{\langle t\rangle}y⟨t⟩ conceptually depends on past and future weather (e.g., mood influenced by recent days both before and after a day), a bidirectional RNN can use context from both directions to predict y⟨t⟩y^{\langle t\rangle}y⟨t⟩. If causality strictly forbids using future data at prediction time (online setting), you must use unidirectional; but the problem statement implies mood depends on current and past few days (and you have full dataset), so bidirectional gives more context and typically better accuracy for offline prediction/evaluation.
🧾 Summary Table
| Q # | Correct Answer(s) | Key concept |
|---|---|---|
| 1 | ✅ x(r)<s> |
Notation: example index then word/time index. |
| 2 | ✅ Tx=TyT_x = T_yTx=Ty | Architecture matches equal-length input/output. |
| 3 | ✅ Music genre recognition; ✅ Language recognition | Many-to-one: sequence → single label. |
| 4 | ✅ True | RNN models conditional P(yt∣y<t)P(y_t \mid y_{<t})P(yt∣y<t). |
| 5 | ✅ True | Sampling: sample token at t, feed to t+1. |
| 6 | ✅ True | NaNs commonly indicate exploding gradients. |
| 7 | ✅ 100 | Gate vectors match hidden-state dimensionality. |
| 8 | ✅ Betty’s model (remove Γr\Gamma_rΓr) — option 4 | Update gate preserves identity path → gradient flow. |
| 9 | ✅ True | GRU update ↔ LSTM forget; (1−update) ↔ input gate. |
| 10 | ✅ Bidirectional RNN (for offline prediction with full context) | Use both past & future context for best per-time prediction. |