Skip to content

Optimization Algorithms:Improving Deep: Neural Networks: Hyperparameter Tuning, Regularization and Optimization:(Deep Learning Specialization) Answers:2025

Question 1

Which notation would you use to denote the 3rd layer’s activations when the input is the 7th example from the 8th minibatch?

a[3]{8}(7)
❌ a[8]{7}(3)
❌ a[3]{7}(8)
❌ a[8]{3}(7)

Explanation:
Notation format:

  • [l] → layer number

  • {k} → minibatch index

  • (i) → training example index
    Hence, the activations for layer 3, 7th example, 8th minibatcha[3]{8}(7).


Question 2

Which statements about mini-batch gradient descent do you agree with?

Training one epoch using mini-batch GD is faster than using batch GD.
When mini-batch size = training size, it becomes batch GD.
❌ You should implement it without a loop so all mini-batches are processed at once.

Explanation:
Mini-batch GD processes smaller batches, giving faster per-epoch updates.
If batch size = m (all examples), it’s equivalent to batch GD.
You must iterate through mini-batches — cannot process all at once.


Question 3

We usually choose a mini-batch size greater than 1 and less than m because it balances vectorization efficiency and performance.

True
❌ False

Explanation:
Mini-batch GD (1 < batch < m) allows vectorized computation and faster convergence without the slow full-batch computation.


Question 4

While using mini-batch gradient descent, the cost function J shows some oscillation — which statement is correct?

If you are using mini-batch GD, this looks acceptable. But if using batch GD, something is wrong.
❌ If you’re using either, it’s fine.
❌ No matter what, something is wrong.
❌ If you’re using batch GD, it’s acceptable.

Explanation:
Mini-batch updates use partial data → cost curve fluctuates.
Batch GD uses all data → smooth cost curve. Oscillation is expected only in mini-batch GD.


Question 5

Temperature example: β=0.5, θ₁=10°C, θ₂=10°C, v₀=0
Compute v₂ (uncorrected) and v₂_corrected.

v₂ = 7.5, v₂_corrected = 10
❌ v₂ = 10, v₂_corrected = 7.5
❌ v₂ = 7.5, v₂_corrected = 7.5
❌ v₂ = 10, v₂_corrected = 10

Explanation:
v₁ = 0.5×0 + 0.5×10 = 5
v₂ = 0.5×5 + 0.5×10 = 7.5
Bias correction: v₂_corrected = v₂ / (1−β²) = 7.5 / 0.75 = 10.


Question 6

Which statement is true about learning rate decay?

For later epochs, parameters are closer to minimum, so smaller steps prevent oscillations.
❌ It reduces model variance.
❌ It increases step size later.
❌ It increases steps each iteration.

Explanation:
As we near the minimum, smaller learning rates prevent overshooting. Learning rate decay reduces α gradually over epochs.


Question 7

You use exponentially weighted averages with β₁ and β₂. Which is true?

β₁ > β₂
❌ β₁ = β₂
❌ β₁ < β₂
❌ β₁ = 0, β₂ > 0

Explanation:
Larger β (e.g., 0.9) → smoother curve (more memory).
Smaller β (e.g., 0.5) → more responsive, noisier curve.
Thus, the smoother (less responsive) line corresponds to larger β.


Question 8

Gradient descent and momentum comparison:

(1) is GD, (2) is momentum with small β, (3) is momentum with large β.
❌ (1) small β, (2) GD, (3) large β
❌ (1) small β, (2) small β, (3) GD
❌ (1) GD, (2) large β, (3) small β

Explanation:

  • GD: slow zigzag path

  • Momentum (small β): smoother but still zigzag

  • Momentum (large β): faster convergence with smoother trajectory.


Question 9

Batch GD is slow — which methods can help minimize J faster? (Check all that apply.)

Normalize the input data
Try mini-batch GD
Try using Adam
❌ Try initializing weights at zero

Explanation:
Normalization improves optimization speed. Mini-batch + Adam make training efficient.
Initializing weights at zero prevents symmetry breaking and kills learning.


Question 10

Which of the following are true about Adam?

❌ Adam automatically tunes α
❌ Adam can only be used with batch GD
Adam combines advantages of RMSProp and momentum
❌ ε is most important hyperparameter to tune

Explanation:
Adam = Adaptive Moment Estimation combines momentum and RMSProp.
It doesn’t auto-tune α; α must be chosen. ε is for numerical stability (not major tuning target).


🧾 Summary Table

Q# ✅ Correct Answer Key Concept
1 a[3]{8}(7) Activation notation (layer, minibatch, example)
2 Mini-batch faster; equal size = batch GD Mini-batch GD properties
3 True Choose 1 < batch size < m for efficiency
4 Acceptable only for mini-batch GD Cost oscillation behavior
5 v₂=7.5, v₂_corrected=10 Bias correction in exponential averaging
6 Smaller steps near minimum Learning rate decay intuition
7 β₁ > β₂ Smoother curve = larger β
8 (1)=GD, (2)=small β, (3)=large β Effect of momentum in GD
9 Normalize, Mini-batch, Adam Ways to speed up optimization
10 Adam = RMSProp + Momentum Adam combines both benefits