Optimization Algorithms:Improving Deep: Neural Networks: Hyperparameter Tuning, Regularization and Optimization:(Deep Learning Specialization) Answers:2025
Question 1
Which notation would you use to denote the 3rd layer’s activations when the input is the 7th example from the 8th minibatch?
✅ a[3]{8}(7)
❌ a[8]{7}(3)
❌ a[3]{7}(8)
❌ a[8]{3}(7)
Explanation:
Notation format:
-
[l]→ layer number -
{k}→ minibatch index -
(i)→ training example index
Hence, the activations for layer 3, 7th example, 8th minibatch → a[3]{8}(7).
Question 2
Which statements about mini-batch gradient descent do you agree with?
✅ Training one epoch using mini-batch GD is faster than using batch GD.
✅ When mini-batch size = training size, it becomes batch GD.
❌ You should implement it without a loop so all mini-batches are processed at once.
Explanation:
Mini-batch GD processes smaller batches, giving faster per-epoch updates.
If batch size = m (all examples), it’s equivalent to batch GD.
You must iterate through mini-batches — cannot process all at once.
Question 3
We usually choose a mini-batch size greater than 1 and less than m because it balances vectorization efficiency and performance.
✅ True
❌ False
Explanation:
Mini-batch GD (1 < batch < m) allows vectorized computation and faster convergence without the slow full-batch computation.
Question 4
While using mini-batch gradient descent, the cost function J shows some oscillation — which statement is correct?
✅ If you are using mini-batch GD, this looks acceptable. But if using batch GD, something is wrong.
❌ If you’re using either, it’s fine.
❌ No matter what, something is wrong.
❌ If you’re using batch GD, it’s acceptable.
Explanation:
Mini-batch updates use partial data → cost curve fluctuates.
Batch GD uses all data → smooth cost curve. Oscillation is expected only in mini-batch GD.
Question 5
Temperature example: β=0.5, θ₁=10°C, θ₂=10°C, v₀=0
Compute v₂ (uncorrected) and v₂_corrected.
✅ v₂ = 7.5, v₂_corrected = 10
❌ v₂ = 10, v₂_corrected = 7.5
❌ v₂ = 7.5, v₂_corrected = 7.5
❌ v₂ = 10, v₂_corrected = 10
Explanation:
v₁ = 0.5×0 + 0.5×10 = 5
v₂ = 0.5×5 + 0.5×10 = 7.5
Bias correction: v₂_corrected = v₂ / (1−β²) = 7.5 / 0.75 = 10.
Question 6
Which statement is true about learning rate decay?
✅ For later epochs, parameters are closer to minimum, so smaller steps prevent oscillations.
❌ It reduces model variance.
❌ It increases step size later.
❌ It increases steps each iteration.
Explanation:
As we near the minimum, smaller learning rates prevent overshooting. Learning rate decay reduces α gradually over epochs.
Question 7
You use exponentially weighted averages with β₁ and β₂. Which is true?
✅ β₁ > β₂
❌ β₁ = β₂
❌ β₁ < β₂
❌ β₁ = 0, β₂ > 0
Explanation:
Larger β (e.g., 0.9) → smoother curve (more memory).
Smaller β (e.g., 0.5) → more responsive, noisier curve.
Thus, the smoother (less responsive) line corresponds to larger β.
Question 8
Gradient descent and momentum comparison:
✅ (1) is GD, (2) is momentum with small β, (3) is momentum with large β.
❌ (1) small β, (2) GD, (3) large β
❌ (1) small β, (2) small β, (3) GD
❌ (1) GD, (2) large β, (3) small β
Explanation:
-
GD: slow zigzag path
-
Momentum (small β): smoother but still zigzag
-
Momentum (large β): faster convergence with smoother trajectory.
Question 9
Batch GD is slow — which methods can help minimize J faster? (Check all that apply.)
✅ Normalize the input data
✅ Try mini-batch GD
✅ Try using Adam
❌ Try initializing weights at zero
Explanation:
Normalization improves optimization speed. Mini-batch + Adam make training efficient.
Initializing weights at zero prevents symmetry breaking and kills learning.
Question 10
Which of the following are true about Adam?
❌ Adam automatically tunes α
❌ Adam can only be used with batch GD
✅ Adam combines advantages of RMSProp and momentum
❌ ε is most important hyperparameter to tune
Explanation:
Adam = Adaptive Moment Estimation combines momentum and RMSProp.
It doesn’t auto-tune α; α must be chosen. ε is for numerical stability (not major tuning target).
🧾 Summary Table
| Q# | ✅ Correct Answer | Key Concept |
|---|---|---|
| 1 | a[3]{8}(7) | Activation notation (layer, minibatch, example) |
| 2 | Mini-batch faster; equal size = batch GD | Mini-batch GD properties |
| 3 | True | Choose 1 < batch size < m for efficiency |
| 4 | Acceptable only for mini-batch GD | Cost oscillation behavior |
| 5 | v₂=7.5, v₂_corrected=10 | Bias correction in exponential averaging |
| 6 | Smaller steps near minimum | Learning rate decay intuition |
| 7 | β₁ > β₂ | Smoother curve = larger β |
| 8 | (1)=GD, (2)=small β, (3)=large β | Effect of momentum in GD |
| 9 | Normalize, Mini-batch, Adam | Ways to speed up optimization |
| 10 | Adam = RMSProp + Momentum | Adam combines both benefits |