Shallow Neural Networks:Neural Networks and Deep Learning(Deep Learning Specialization) Answers:2025

Question 1 — Which of the following are true? (Check all that apply.)

❌ w₃[4] is the column vector of parameters of the fourth layer and third neuron.
❌ a denotes the activation vector of the second layer for the third example.
❌ w₃[4] is the row vector of parameters of the fourth layer and third neuron.
✅ a[2] denotes the activation vector of the second layer.
❌ a₃[2] denotes the activation vector of the second layer for the third example.
✅ w₃[4] is the column vector of parameters of the third layer and fourth neuron.

Explanation:
Using standard DL notation: A[l] is the activation vector of layer l. So a[2] (or A[2]) denotes the activation vector of layer 2 — true.
Weights are usually indexed as w^{[l]} for layer l, and w^{[l]}[j] commonly denotes the column vector of parameters for the j-th neuron in layer l — so w₃[4] → column vector of parameters of layer 3, neuron 4 — true. The other statements swap indices or misplace layer/example indices — false.

Question 2

In which case is the linear (identity) activation function most likely used?

❌ For binary classification problems.
❌ The linear activation function is never used.
✅ When working with regression problems.
❌ As activation function in the hidden layers.

Explanation:
A linear activation is appropriate for regression output (predicting continuous values). For classification we use sigmoid/softmax; hidden layers normally use non-linear activations.

Question 3

Which is a correct vectorized forward propagation implementation for layer l?

❌ (option using W[l-1] A[l] + b[l-1])
✅ Z[l] = W[l] A[l-1] + b[l]
A[l] = gl
❌ (other incorrect indexing options)

Explanation:
Vectorized forward prop uses the weights of layer l multiplied by activations from previous layer A[l-1]: Z[l] = W[l] A[l-1] + b[l], then A[l] = g(Z[l]).

Question 4

The ReLU function has no derivative at c = 0 — so its use is becoming more rare. True/False?

❌ True
✅ False

Explanation:
While ReLU is not differentiable exactly at 0, in practice this does not make it “rare”; ReLU is very common because of simplicity and effective training. The nondifferentiability at a single point is not a practical issue.

Question 5

Given x = np.random.rand(4,5) and y = np.sum(x, axis=1), what is y.shape?

❌ (1, 5)
❌ (5,)
✅ (4,)
❌ (4, 1)

Explanation:
Summing over axis=1 collapses the second dimension (columns), producing one sum per row. With 4 rows → shape (4,).

Question 6

Best option to initialize weights for a network with tanh hidden layer?

❌ Initialize the weights to large random numbers.
❌ Initialize all weights to a single number chosen randomly.
❌ Initialize all weights to 0.
✅ Initialize the weights to small random numbers.

Explanation:
Small random initialization breaks symmetry and avoids saturation. Zero or identical initialization prevents learning; huge values saturate activations.

Question 7

Using linear activations in hidden layers of a multilayer NN is equivalent to using a single layer. True/False?

❌ False
✅ True

Explanation:
A composition of linear functions is still linear — stacking linear hidden layers is mathematically equivalent to one linear transformation (no extra representational power).

Question 8

Which is true about the tanh function?

✅ For large values the slope is close to zero.
❌ For large values the slope is larger.
❌ The derivative at c = 0 is not well defined.
❌ The slope is zero for negative values.

Explanation:
tanh(x) saturates to ±1 for large |x| so derivative there → near 0. The derivative at 0 is well-defined (it’s 1).

Question 9 — 1-hidden-layer NN: Which statements are true? (Check all that apply)

❌ W[1] will have shape (4, 2)
✅ b[1] will have shape (2, 1)
❌ W[2] will have shape (1, 4)
✅ W[1] will have shape (2, 4)
✅ b[2] will have shape (1, 1)
❌ b[1] will have shape (4, 1)
❌ b[2] will have shape (4, 1)
❌ W[2] will have shape (4, 1)

Explanation:
Assuming the conventional sizes: input size = 4, hidden layer size = 2, output size = 1, then

W[1] shape = (n_hidden, n_input) = (2,4).
b[1] shape = (2,1).
W[2] shape = (n_output, n_hidden) = (1,2).
b[2] shape = (1,1).
So the three ✅ statements above are the correct ones.

Question 10

What are the dimensions of Z[1] and A[1]?

❌ (4, 1)
✅ (2, m)
❌ (4, m)
❌ (2, 1)

Explanation:
Z[1] and A[1] have shape (n^{[1]}, m) where n^{[1]} is the number of units in hidden layer 1 (here 2) and m is number of examples → (2, m).

🧾 Summary Table

Q#	✅ Correct Answer	Key Concept
1	a[2]; w₃[4] is layer3 neuron4 column	Notation for activations & weight vectors
2	When working with regression problems	Linear activation used for regression outputs
3	Z[l] = W[l] A[l-1] + b[l]; A[l]=g(Z[l])	Vectorized forward propagation
4	False	ReLU nondifferentiability at 0 is not a practical problem
5	(4,)	`sum(..., axis=1)` collapses columns → one per row
6	Initialize small random numbers	Good weight init for tanh (avoid saturation)
7	True	Stacking linear layers is still linear
8	For large values slope ≈ 0	tanh saturates → small derivative far from 0
9	b[1] (2,1); W[1] (2,4); b[2] (1,1)	Shapes derived from (input=4, hidden=2, output=1)
10	(2, m)	Activations per layer: (units, #examples)