Optimization Algorithms:Improving Deep: Neural Networks: Hyperparameter Tuning, Regularization and Optimization:(Deep Learning Specialization) Answers:2025

Question 1

Which notation would you use to denote the 3rd layer’s activations when the input is the 7th example from the 8th minibatch?

✅ a[3]{8}(7)
❌ a[8]{7}(3)
❌ a[3]{7}(8)
❌ a[8]{3}(7)

Explanation:
Notation format:

[l] → layer number
{k} → minibatch index
(i) → training example index
Hence, the activations for layer 3, 7th example, 8th minibatch → a[3]{8}(7).

Question 2

Which statements about mini-batch gradient descent do you agree with?

✅ Training one epoch using mini-batch GD is faster than using batch GD.
✅ When mini-batch size = training size, it becomes batch GD.
❌ You should implement it without a loop so all mini-batches are processed at once.

Explanation:
Mini-batch GD processes smaller batches, giving faster per-epoch updates.
If batch size = m (all examples), it’s equivalent to batch GD.
You must iterate through mini-batches — cannot process all at once.

Question 3

We usually choose a mini-batch size greater than 1 and less than m because it balances vectorization efficiency and performance.

✅ True
❌ False

Explanation:
Mini-batch GD (1 < batch < m) allows vectorized computation and faster convergence without the slow full-batch computation.

Question 4

While using mini-batch gradient descent, the cost function J shows some oscillation — which statement is correct?

✅ If you are using mini-batch GD, this looks acceptable. But if using batch GD, something is wrong.
❌ If you’re using either, it’s fine.
❌ No matter what, something is wrong.
❌ If you’re using batch GD, it’s acceptable.

Explanation:
Mini-batch updates use partial data → cost curve fluctuates.
Batch GD uses all data → smooth cost curve. Oscillation is expected only in mini-batch GD.

Question 5

Temperature example: β=0.5, θ₁=10°C, θ₂=10°C, v₀=0
Compute v₂ (uncorrected) and v₂_corrected.

✅ v₂ = 7.5, v₂_corrected = 10
❌ v₂ = 10, v₂_corrected = 7.5
❌ v₂ = 7.5, v₂_corrected = 7.5
❌ v₂ = 10, v₂_corrected = 10

Explanation:
v₁ = 0.5×0 + 0.5×10 = 5
v₂ = 0.5×5 + 0.5×10 = 7.5
Bias correction: v₂_corrected = v₂ / (1−β²) = 7.5 / 0.75 = 10.

Question 6

Which statement is true about learning rate decay?

✅ For later epochs, parameters are closer to minimum, so smaller steps prevent oscillations.
❌ It reduces model variance.
❌ It increases step size later.
❌ It increases steps each iteration.

Explanation:
As we near the minimum, smaller learning rates prevent overshooting. Learning rate decay reduces α gradually over epochs.

Question 7

You use exponentially weighted averages with β₁ and β₂. Which is true?

✅ β₁ > β₂
❌ β₁ = β₂
❌ β₁ < β₂
❌ β₁ = 0, β₂ > 0

Explanation:
Larger β (e.g., 0.9) → smoother curve (more memory).
Smaller β (e.g., 0.5) → more responsive, noisier curve.
Thus, the smoother (less responsive) line corresponds to larger β.

Question 8

Gradient descent and momentum comparison:

✅ (1) is GD, (2) is momentum with small β, (3) is momentum with large β.
❌ (1) small β, (2) GD, (3) large β
❌ (1) small β, (2) small β, (3) GD
❌ (1) GD, (2) large β, (3) small β

Explanation:

GD: slow zigzag path
Momentum (small β): smoother but still zigzag
Momentum (large β): faster convergence with smoother trajectory.

Question 9

Batch GD is slow — which methods can help minimize J faster? (Check all that apply.)

✅ Normalize the input data
✅ Try mini-batch GD
✅ Try using Adam
❌ Try initializing weights at zero

Explanation:
Normalization improves optimization speed. Mini-batch + Adam make training efficient.
Initializing weights at zero prevents symmetry breaking and kills learning.

Question 10

Which of the following are true about Adam?

❌ Adam automatically tunes α
❌ Adam can only be used with batch GD
✅ Adam combines advantages of RMSProp and momentum
❌ ε is most important hyperparameter to tune

Explanation:
Adam = Adaptive Moment Estimation combines momentum and RMSProp.
It doesn’t auto-tune α; α must be chosen. ε is for numerical stability (not major tuning target).

🧾 Summary Table

Q#	✅ Correct Answer	Key Concept
1	a[3]{8}(7)	Activation notation (layer, minibatch, example)
2	Mini-batch faster; equal size = batch GD	Mini-batch GD properties
3	True	Choose 1 < batch size < m for efficiency
4	Acceptable only for mini-batch GD	Cost oscillation behavior
5	v₂=7.5, v₂_corrected=10	Bias correction in exponential averaging
6	Smaller steps near minimum	Learning rate decay intuition
7	β₁ > β₂	Smoother curve = larger β
8	(1)=GD, (2)=small β, (3)=large β	Effect of momentum in GD
9	Normalize, Mini-batch, Adam	Ways to speed up optimization
10	Adam = RMSProp + Momentum	Adam combines both benefits