Shallow Neural Networks:Neural Networks and Deep Learning(Deep Learning Specialization) Answers:2025
Question 1 — Which of the following are true? (Check all that apply.)
❌ w₃[4] is the column vector of parameters of the fourth layer and third neuron.
❌ a denotes the activation vector of the second layer for the third example.
❌ w₃[4] is the row vector of parameters of the fourth layer and third neuron.
✅ a[2] denotes the activation vector of the second layer.
❌ a₃[2] denotes the activation vector of the second layer for the third example.
✅ w₃[4] is the column vector of parameters of the third layer and fourth neuron.
Explanation:
Using standard DL notation: A[l] is the activation vector of layer l. So a[2] (or A[2]) denotes the activation vector of layer 2 — true.
Weights are usually indexed as w^{[l]} for layer l, and w^{[l]}[j] commonly denotes the column vector of parameters for the j-th neuron in layer l — so w₃[4] → column vector of parameters of layer 3, neuron 4 — true. The other statements swap indices or misplace layer/example indices — false.
Question 2
In which case is the linear (identity) activation function most likely used?
❌ For binary classification problems.
❌ The linear activation function is never used.
✅ When working with regression problems.
❌ As activation function in the hidden layers.
Explanation:
A linear activation is appropriate for regression output (predicting continuous values). For classification we use sigmoid/softmax; hidden layers normally use non-linear activations.
Question 3
Which is a correct vectorized forward propagation implementation for layer l?
❌ (option using W[l-1] A[l] + b[l-1])
✅ Z[l] = W[l] A[l-1] + b[l]
A[l] = gl
❌ (other incorrect indexing options)
Explanation:
Vectorized forward prop uses the weights of layer l multiplied by activations from previous layer A[l-1]: Z[l] = W[l] A[l-1] + b[l], then A[l] = g(Z[l]).
Question 4
The ReLU function has no derivative at c = 0 — so its use is becoming more rare. True/False?
❌ True
✅ False
Explanation:
While ReLU is not differentiable exactly at 0, in practice this does not make it “rare”; ReLU is very common because of simplicity and effective training. The nondifferentiability at a single point is not a practical issue.
Question 5
Given x = np.random.rand(4,5) and y = np.sum(x, axis=1), what is y.shape?
❌ (1, 5)
❌ (5,)
✅ (4,)
❌ (4, 1)
Explanation:
Summing over axis=1 collapses the second dimension (columns), producing one sum per row. With 4 rows → shape (4,).
Question 6
Best option to initialize weights for a network with tanh hidden layer?
❌ Initialize the weights to large random numbers.
❌ Initialize all weights to a single number chosen randomly.
❌ Initialize all weights to 0.
✅ Initialize the weights to small random numbers.
Explanation:
Small random initialization breaks symmetry and avoids saturation. Zero or identical initialization prevents learning; huge values saturate activations.
Question 7
Using linear activations in hidden layers of a multilayer NN is equivalent to using a single layer. True/False?
❌ False
✅ True
Explanation:
A composition of linear functions is still linear — stacking linear hidden layers is mathematically equivalent to one linear transformation (no extra representational power).
Question 8
Which is true about the tanh function?
✅ For large values the slope is close to zero.
❌ For large values the slope is larger.
❌ The derivative at c = 0 is not well defined.
❌ The slope is zero for negative values.
Explanation:tanh(x) saturates to ±1 for large |x| so derivative there → near 0. The derivative at 0 is well-defined (it’s 1).
Question 9 — 1-hidden-layer NN: Which statements are true? (Check all that apply)
❌ W[1] will have shape (4, 2)
✅ b[1] will have shape (2, 1)
❌ W[2] will have shape (1, 4)
✅ W[1] will have shape (2, 4)
✅ b[2] will have shape (1, 1)
❌ b[1] will have shape (4, 1)
❌ b[2] will have shape (4, 1)
❌ W[2] will have shape (4, 1)
Explanation:
Assuming the conventional sizes: input size = 4, hidden layer size = 2, output size = 1, then
-
W[1]shape = (n_hidden, n_input) = (2,4). -
b[1]shape = (2,1). -
W[2]shape = (n_output, n_hidden) = (1,2). -
b[2]shape = (1,1).
So the three ✅ statements above are the correct ones.
Question 10
What are the dimensions of Z[1] and A[1]?
❌ (4, 1)
✅ (2, m)
❌ (4, m)
❌ (2, 1)
Explanation:Z[1] and A[1] have shape (n^{[1]}, m) where n^{[1]} is the number of units in hidden layer 1 (here 2) and m is number of examples → (2, m).
🧾 Summary Table
| Q# | ✅ Correct Answer | Key Concept |
|---|---|---|
| 1 | a[2]; w₃[4] is layer3 neuron4 column | Notation for activations & weight vectors |
| 2 | When working with regression problems | Linear activation used for regression outputs |
| 3 | Z[l] = W[l] A[l-1] + b[l]; A[l]=g(Z[l]) | Vectorized forward propagation |
| 4 | False | ReLU nondifferentiability at 0 is not a practical problem |
| 5 | (4,) | sum(..., axis=1) collapses columns → one per row |
| 6 | Initialize small random numbers | Good weight init for tanh (avoid saturation) |
| 7 | True | Stacking linear layers is still linear |
| 8 | For large values slope ≈ 0 | tanh saturates → small derivative far from 0 |
| 9 | b[1] (2,1); W[1] (2,4); b[2] (1,1) | Shapes derived from (input=4, hidden=2, output=1) |
| 10 | (2, m) | Activations per layer: (units, #examples) |