Recurrent Neural Networks:Sequence Models(Deep Learning Specialization)Answeres:2025

Question 1

Which expression refers to the s-th word in the r-th training example?

❌ x^(s)<r>
❌ x<r>(s)
✅ x(r)<s>
❌ x<s>(r)

Explanation: The usual notation is x^{(r)}_{<s>}, i.e. the r-th training example (superscript or parentheses) and the s-th word within that example (angle-bracket subscript). Option x(r)<s> matches that ordering: example index first, then time/word index.

Question 2

The pictured RNN architecture is appropriate when:

✅ $T_x = T_y$
❌ $T_x < T_y$
❌ $T_x > T_y$
❌ $T_x = 1$

Explanation: The shown architecture (same-length many-to-many RNN) is used when input and output sequence lengths match and outputs are produced at each input timestep (e.g., sequence labeling, time-aligned tasks). That corresponds to $T_x = T_y$ .

Question 3

Which tasks use a many-to-one RNN architecture? (Choose all that apply.)

❌ Image classification (input an image and output a label)
✅ Music genre recognition
✅ Language recognition from speech (audio → single language label)
❌ Speech recognition (input audio → transcript)

Explanation: Many-to-one maps a variable-length input sequence to a single output label. Music-genre and language-recognition from an audio clip are many-to-one. Speech recognition outputs a sequence (many-to-many), while image classification is not sequence-to-label (unless you treat image patches as sequence, but not typical here).

Question 4

True/False: At time $t$ the RNN is estimating $P(y⟨t⟩∣y⟨1⟩,…,y⟨t−1⟩)P(y^{\langle t\rangle}\mid y^{\langle 1\rangle},\dots,y^{\langle t-1\rangle})$ .

✅ True
❌ False

Explanation: An RNN language model predicts the next token at time $t$ conditioned on previous tokens; it models exactly that conditional distribution.

Question 5

True/False: When sampling from your trained language-model RNN, at each step $t$ you sample a word from the RNN’s output probabilities and feed it as input for step $t + 1$ .

✅ True
❌ False

Explanation: That is the standard ancestral sampling procedure for RNN language models: sample from the output distribution at each step and feed the chosen token forward.

Question 6

True/False: If during RNN training your weights/activations become NaN, you have an exploding gradient problem.

✅ True
❌ False

Explanation: NaNs often result from numerical overflow due to exploding gradients (extremely large parameter/activation updates). It’s a classic symptom (though NaNs can also come from e.g., invalid ops). But exploding gradients are the common cause.

Question 7

LSTM with 10,000-word vocab and 100-dim activations $a⟨t⟩a^{\langle t\rangle}$ . What is the dimension of $Γu\Gamma_u$ at each time step?

❌ 1
✅ 100
❌ 300
❌ 10000

Explanation: Gate vectors (like $Γu\Gamma_u$ ) in LSTM/GRU have the same dimensionality as the hidden state/activation vector $a⟨t⟩a^{\langle t\rangle}$ . With 100-dim activations, $Γu\Gamma_u$ is a 100-dim vector.

Question 8

Alice removes the GRU update gate $Γu\Gamma_u$ (set to 0). Betty removes the reset gate $Γr\Gamma_r$ (set to 1). Which model is more likely to work without vanishing gradients on very long sequences?

❌ Alice’s model (removing $Γu\Gamma_u$ ) because if $Γr≈0\Gamma_r\approx 0$ the gradient can flow.
❌ Alice’s model (removing $Γu\Gamma_u$ ) because if $Γr≈1\Gamma_r\approx 1$ the gradient can flow.
❌ Betty’s model (removing $Γr\Gamma_r$ ) because if $Γu≈0\Gamma_u\approx 0$ the gradient can flow.
✅ Betty’s model (removing $Γr\Gamma_r$ ), because if $Γu≈1\Gamma_u\approx 1$ the gradient can propagate back through that timestep without much decay.

Explanation: The update gate $Γu\Gamma_u$ (aka z) in GRUs controls the extent of the identity/pathway from $h_{t-1}$ to $h_t$ . If $Γu\Gamma_u$ can be close to 1, the state can carry through many steps (identity-like), enabling gradients to flow (avoiding vanishing). Removing the reset gate ( $Γr\Gamma_r$ ) is less harmful than disabling the update gate. Hence Betty’s variant (keeping $Γu\Gamma_u$ ) is preferable for long-term gradient flow.

Question 9

True/False: The GRU update gate and the LSTM forget/input gates play similar roles to $1−Γu1-\Gamma_u$ and $Γu\Gamma_u$ .

✅ True
❌ False

Explanation: In GRU: $ht=Γu⊙ht−1+(1−Γu)⊙h~th_t = \Gamma_u \odot h_{t-1} + (1-\Gamma_u)\odot \tilde{h}_t$ . In LSTM: $ct=ft⊙ct−1+it⊙c~tc_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t$ . So $f_t$ (forget) ↔ $Γu\Gamma_u$ and $i_t$ (input) ↔ $1−Γu1-\Gamma_u$ (up to naming conventions). They play analogous roles in gating old vs. new information.

Question 10

You have 365 days of weather $x⟨1⟩,…,x⟨365⟩x^{\langle 1\rangle},\dots,x^{\langle 365\rangle}$ and moods $y⟨1⟩,…,y⟨365⟩y^{\langle 1\rangle},\dots,y^{\langle 365\rangle}$ . Should you use a unidirectional RNN or a bidirectional RNN?

❌ Unidirectional RNN, because $y⟨t⟩y^{\langle t\rangle}$ depends only on $x⟨t⟩x^{\langle t\rangle}$ .
❌ Bidirectional RNN, because this allows backpropagation to compute more accurate gradients.
✅ Bidirectional RNN, because this allows the prediction of mood on day $t$ to take into account more information.
❌ Unidirectional RNN, because $y⟨t⟩y^{\langle t\rangle}$ depends only on past $x$ but not future $x$ .

Explanation: If your label $y⟨t⟩y^{\langle t\rangle}$ conceptually depends on past and future weather (e.g., mood influenced by recent days both before and after a day), a bidirectional RNN can use context from both directions to predict $y⟨t⟩y^{\langle t\rangle}$ . If causality strictly forbids using future data at prediction time (online setting), you must use unidirectional; but the problem statement implies mood depends on current and past few days (and you have full dataset), so bidirectional gives more context and typically better accuracy for offline prediction/evaluation.

🧾 Summary Table

Q #	Correct Answer(s)	Key concept
1	✅ `x(r)<s>`	Notation: example index then word/time index.
2	✅ $T_x = T_y$	Architecture matches equal-length input/output.
3	✅ Music genre recognition; ✅ Language recognition	Many-to-one: sequence → single label.
4	✅ True	RNN models conditional $P(yt∣y<t)P(y_t \mid y_{<t})$ .
5	✅ True	Sampling: sample token at t, feed to t+1.
6	✅ True	NaNs commonly indicate exploding gradients.
7	✅ 100	Gate vectors match hidden-state dimensionality.
8	✅ Betty’s model (remove $Γr\Gamma_r$ ) — option 4	Update gate preserves identity path → gradient flow.
9	✅ True	GRU update ↔ LSTM forget; (1−update) ↔ input gate.
10	✅ Bidirectional RNN (for offline prediction with full context)	Use both past & future context for best per-time prediction.