Skip to content

Shallow Neural Networks:Neural Networks and Deep Learning(Deep Learning Specialization) Answers:2025

Question 1 — Which of the following are true? (Check all that apply.)

w₃[4] is the column vector of parameters of the fourth layer and third neuron.
a denotes the activation vector of the second layer for the third example.
w₃[4] is the row vector of parameters of the fourth layer and third neuron.
a[2] denotes the activation vector of the second layer.
a₃[2] denotes the activation vector of the second layer for the third example.
w₃[4] is the column vector of parameters of the third layer and fourth neuron.

Explanation:
Using standard DL notation: A[l] is the activation vector of layer l. So a[2] (or A[2]) denotes the activation vector of layer 2 — true.
Weights are usually indexed as w^{[l]} for layer l, and w^{[l]}[j] commonly denotes the column vector of parameters for the j-th neuron in layer l — so w₃[4] → column vector of parameters of layer 3, neuron 4true. The other statements swap indices or misplace layer/example indices — false.


Question 2

In which case is the linear (identity) activation function most likely used?

❌ For binary classification problems.
❌ The linear activation function is never used.
✅ When working with regression problems.
❌ As activation function in the hidden layers.

Explanation:
A linear activation is appropriate for regression output (predicting continuous values). For classification we use sigmoid/softmax; hidden layers normally use non-linear activations.


Question 3

Which is a correct vectorized forward propagation implementation for layer l?

❌ (option using W[l-1] A[l] + b[l-1])
Z[l] = W[l] A[l-1] + b[l]
    A[l] = gl
❌ (other incorrect indexing options)

Explanation:
Vectorized forward prop uses the weights of layer l multiplied by activations from previous layer A[l-1]: Z[l] = W[l] A[l-1] + b[l], then A[l] = g(Z[l]).


Question 4

The ReLU function has no derivative at c = 0 — so its use is becoming more rare. True/False?

❌ True
✅ False

Explanation:
While ReLU is not differentiable exactly at 0, in practice this does not make it “rare”; ReLU is very common because of simplicity and effective training. The nondifferentiability at a single point is not a practical issue.


Question 5

Given x = np.random.rand(4,5) and y = np.sum(x, axis=1), what is y.shape?

❌ (1, 5)
❌ (5,)
(4,)
❌ (4, 1)

Explanation:
Summing over axis=1 collapses the second dimension (columns), producing one sum per row. With 4 rows → shape (4,).


Question 6

Best option to initialize weights for a network with tanh hidden layer?

❌ Initialize the weights to large random numbers.
❌ Initialize all weights to a single number chosen randomly.
❌ Initialize all weights to 0.
Initialize the weights to small random numbers.

Explanation:
Small random initialization breaks symmetry and avoids saturation. Zero or identical initialization prevents learning; huge values saturate activations.


Question 7

Using linear activations in hidden layers of a multilayer NN is equivalent to using a single layer. True/False?

❌ False
✅ True

Explanation:
A composition of linear functions is still linear — stacking linear hidden layers is mathematically equivalent to one linear transformation (no extra representational power).


Question 8

Which is true about the tanh function?

For large values the slope is close to zero.
❌ For large values the slope is larger.
❌ The derivative at c = 0 is not well defined.
❌ The slope is zero for negative values.

Explanation:
tanh(x) saturates to ±1 for large |x| so derivative there → near 0. The derivative at 0 is well-defined (it’s 1).


Question 9 — 1-hidden-layer NN: Which statements are true? (Check all that apply)

❌ W[1] will have shape (4, 2)
b[1] will have shape (2, 1)
❌ W[2] will have shape (1, 4)
W[1] will have shape (2, 4)
b[2] will have shape (1, 1)
❌ b[1] will have shape (4, 1)
❌ b[2] will have shape (4, 1)
❌ W[2] will have shape (4, 1)

Explanation:
Assuming the conventional sizes: input size = 4, hidden layer size = 2, output size = 1, then

  • W[1] shape = (n_hidden, n_input) = (2,4).

  • b[1] shape = (2,1).

  • W[2] shape = (n_output, n_hidden) = (1,2).

  • b[2] shape = (1,1).
    So the three ✅ statements above are the correct ones.


Question 10

What are the dimensions of Z[1] and A[1]?

❌ (4, 1)
(2, m)
❌ (4, m)
❌ (2, 1)

Explanation:
Z[1] and A[1] have shape (n^{[1]}, m) where n^{[1]} is the number of units in hidden layer 1 (here 2) and m is number of examples → (2, m).


🧾 Summary Table

Q# ✅ Correct Answer Key Concept
1 a[2]; w₃[4] is layer3 neuron4 column Notation for activations & weight vectors
2 When working with regression problems Linear activation used for regression outputs
3 Z[l] = W[l] A[l-1] + b[l]; A[l]=g(Z[l]) Vectorized forward propagation
4 False ReLU nondifferentiability at 0 is not a practical problem
5 (4,) sum(..., axis=1) collapses columns → one per row
6 Initialize small random numbers Good weight init for tanh (avoid saturation)
7 True Stacking linear layers is still linear
8 For large values slope ≈ 0 tanh saturates → small derivative far from 0
9 b[1] (2,1); W[1] (2,4); b[2] (1,1) Shapes derived from (input=4, hidden=2, output=1)
10 (2, m) Activations per layer: (units, #examples)