Skip to content

Recurrent Neural Networks:Sequence Models(Deep Learning Specialization)Answeres:2025

Question 1

Which expression refers to the s-th word in the r-th training example?

  • x^(s)<r>

  • x<r>(s)

  • x(r)<s>

  • x<s>(r)

Explanation: The usual notation is x^{(r)}_{<s>}, i.e. the r-th training example (superscript or parentheses) and the s-th word within that example (angle-bracket subscript). Option x(r)<s> matches that ordering: example index first, then time/word index.


Question 2

The pictured RNN architecture is appropriate when:

  • Tx=TyT_x = T_y

  • Tx<TyT_x < T_y

  • Tx>TyT_x > T_y

  • Tx=1T_x = 1

Explanation: The shown architecture (same-length many-to-many RNN) is used when input and output sequence lengths match and outputs are produced at each input timestep (e.g., sequence labeling, time-aligned tasks). That corresponds to Tx=TyT_x = T_y.


Question 3

Which tasks use a many-to-one RNN architecture? (Choose all that apply.)

  • ❌ Image classification (input an image and output a label)

  • Music genre recognition

  • Language recognition from speech (audio → single language label)

  • ❌ Speech recognition (input audio → transcript)

Explanation: Many-to-one maps a variable-length input sequence to a single output label. Music-genre and language-recognition from an audio clip are many-to-one. Speech recognition outputs a sequence (many-to-many), while image classification is not sequence-to-label (unless you treat image patches as sequence, but not typical here).


Question 4

True/False: At time tt the RNN is estimating P(y⟨t⟩∣y⟨1⟩,…,y⟨t−1⟩)P(y^{\langle t\rangle}\mid y^{\langle 1\rangle},\dots,y^{\langle t-1\rangle}).

  • True

  • ❌ False

Explanation: An RNN language model predicts the next token at time tt conditioned on previous tokens; it models exactly that conditional distribution.


Question 5

True/False: When sampling from your trained language-model RNN, at each step tt you sample a word from the RNN’s output probabilities and feed it as input for step t+1t+1.

  • True

  • ❌ False

Explanation: That is the standard ancestral sampling procedure for RNN language models: sample from the output distribution at each step and feed the chosen token forward.


Question 6

True/False: If during RNN training your weights/activations become NaN, you have an exploding gradient problem.

  • True

  • ❌ False

Explanation: NaNs often result from numerical overflow due to exploding gradients (extremely large parameter/activation updates). It’s a classic symptom (though NaNs can also come from e.g., invalid ops). But exploding gradients are the common cause.


Question 7

LSTM with 10,000-word vocab and 100-dim activations a⟨t⟩a^{\langle t\rangle}. What is the dimension of Γu\Gamma_u at each time step?

  • ❌ 1

  • 100

  • ❌ 300

  • ❌ 10000

Explanation: Gate vectors (like Γu\Gamma_u) in LSTM/GRU have the same dimensionality as the hidden state/activation vector a⟨t⟩a^{\langle t\rangle}. With 100-dim activations, Γu\Gamma_u is a 100-dim vector.


Question 8

Alice removes the GRU update gate Γu\Gamma_u (set to 0). Betty removes the reset gate Γr\Gamma_r (set to 1). Which model is more likely to work without vanishing gradients on very long sequences?

  • ❌ Alice’s model (removing Γu\Gamma_u) because if Γr≈0\Gamma_r\approx 0 the gradient can flow.

  • ❌ Alice’s model (removing Γu\Gamma_u) because if Γr≈1\Gamma_r\approx 1 the gradient can flow.

  • ❌ Betty’s model (removing Γr\Gamma_r) because if Γu≈0\Gamma_u\approx 0 the gradient can flow.

  • Betty’s model (removing Γr\Gamma_r), because if Γu≈1\Gamma_u\approx 1 the gradient can propagate back through that timestep without much decay.

Explanation: The update gate Γu\Gamma_u (aka z) in GRUs controls the extent of the identity/pathway from ht−1h_{t-1} to hth_t. If Γu\Gamma_u can be close to 1, the state can carry through many steps (identity-like), enabling gradients to flow (avoiding vanishing). Removing the reset gate (Γr\Gamma_r) is less harmful than disabling the update gate. Hence Betty’s variant (keeping Γu\Gamma_u) is preferable for long-term gradient flow.


Question 9

True/False: The GRU update gate and the LSTM forget/input gates play similar roles to 1−Γu1-\Gamma_u and Γu\Gamma_u.

  • True

  • ❌ False

Explanation: In GRU: ht=Γu⊙ht−1+(1−Γu)⊙h~th_t = \Gamma_u \odot h_{t-1} + (1-\Gamma_u)\odot \tilde{h}_t. In LSTM: ct=ft⊙ct−1+it⊙c~tc_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t. So ftf_t (forget) ↔ Γu\Gamma_u and iti_t (input) ↔ 1−Γu1-\Gamma_u (up to naming conventions). They play analogous roles in gating old vs. new information.


Question 10

You have 365 days of weather x⟨1⟩,…,x⟨365⟩x^{\langle 1\rangle},\dots,x^{\langle 365\rangle} and moods y⟨1⟩,…,y⟨365⟩y^{\langle 1\rangle},\dots,y^{\langle 365\rangle}. Should you use a unidirectional RNN or a bidirectional RNN?

  • ❌ Unidirectional RNN, because y⟨t⟩y^{\langle t\rangle} depends only on x⟨t⟩x^{\langle t\rangle}.

  • ❌ Bidirectional RNN, because this allows backpropagation to compute more accurate gradients.

  • Bidirectional RNN, because this allows the prediction of mood on day tt to take into account more information.

  • ❌ Unidirectional RNN, because y⟨t⟩y^{\langle t\rangle} depends only on past xx but not future xx.

Explanation: If your label y⟨t⟩y^{\langle t\rangle} conceptually depends on past and future weather (e.g., mood influenced by recent days both before and after a day), a bidirectional RNN can use context from both directions to predict y⟨t⟩y^{\langle t\rangle}. If causality strictly forbids using future data at prediction time (online setting), you must use unidirectional; but the problem statement implies mood depends on current and past few days (and you have full dataset), so bidirectional gives more context and typically better accuracy for offline prediction/evaluation.


🧾 Summary Table

Q # Correct Answer(s) Key concept
1 x(r)<s> Notation: example index then word/time index.
2 Tx=TyT_x = T_y Architecture matches equal-length input/output.
3 ✅ Music genre recognition; ✅ Language recognition Many-to-one: sequence → single label.
4 ✅ True RNN models conditional P(yt∣y<t)P(y_t \mid y_{<t}).
5 ✅ True Sampling: sample token at t, feed to t+1.
6 ✅ True NaNs commonly indicate exploding gradients.
7 ✅ 100 Gate vectors match hidden-state dimensionality.
8 ✅ Betty’s model (remove Γr\Gamma_r) — option 4 Update gate preserves identity path → gradient flow.
9 ✅ True GRU update ↔ LSTM forget; (1−update) ↔ input gate.
10 ✅ Bidirectional RNN (for offline prediction with full context) Use both past & future context for best per-time prediction.