Hyperparameter tuning, Batch Normalization, Programming Frameworks:Improving Deep: Neural Networks: Hyperparameter Tuning, Regularization and Optimization:(Deep Learning Specialization) Answers:2025
Question 1
Which of the following are true about hyperparameter search?
✅ Choosing random values for hyperparameters is convenient since we might not know which are most important.
❌ When using random values they must always be uniformly distributed.
❌ Choosing grid values is better when number of hyperparameters is high.
✅ When sampling from a grid, the number of values per hyperparameter is larger than when using random values.
Explanation:
Random search is efficient when we don’t know which hyperparameters matter most.
Uniform sampling isn’t always ideal — log scale is often used.
Grid search becomes inefficient for many hyperparameters.
Question 2
In a project with limited computational resources, which three hyperparameters would you choose to tune?
❌ ε in Adam
✅ α (learning rate)
✅ mini-batch size
✅ β (momentum parameter)
❌ β₁, β₂ in Adam
Explanation:
The most sensitive hyperparameters are learning rate, mini-batch size, and momentum β.
Adam’s β₁, β₂ and ε usually work well with default values.
Question 3
Even if enough computational power is available for tuning, it is always better to “babysit” one model (“Panda strategy”). True/False?
✅ False
❌ True
Explanation:
Automated hyperparameter search (many models) is more efficient than manually fine-tuning one model (“Panda strategy”).
Question 4
Knowing α ∈ [0.00001, 1.0], which is the recommended way to sample α?
✅ r = -4*np.random.rand(); α = 10r**
❌ r = np.random.rand(); α = 0.00001 + r0.99999
❌ r = -5np.random.rand(); α = 10r
❌ r = np.random.rand(); α = 10r
Explanation:
Learning rate is best sampled on a log scale since its range spans several magnitudes.
Question 5
Finding good hyperparameters is time-consuming, so you should do it once at the start and never again. True/False?
✅ False
❌ True
Explanation:
Hyperparameters often need retuning when data, architecture, or problem changes.
Question 6
When using batch normalization, it’s OK to drop W[l] from forward propagation. True/False?
✅ False
❌ True
Explanation:
Batch norm normalizes after computing Z = W[l]A[l-1] + b[l];
W[l] is still needed to compute Z and can’t be omitted.
Question 7
When using normalization, if σ is very small, normalization may fail due to division by zero. True/False?
✅ True
❌ False
Explanation:
When σ ≈ 0, rounding errors can make division unstable.
That’s why a small ε (epsilon) is added for numerical stability.
Question 8
Which of the following are true about batch normalization?
❌ β[l] and γ[l] are hyperparameters tuned by random sampling.
✅ γ[l] and β[l] set the variance and mean of ẑ[l].
❌ z_norm = (z – μ) / σ² (wrong formula)
✅ When using batch norm, γ[l] and β[l] are learned (trainable) parameters.
Explanation:
γ[l] and β[l] control scaling and shifting after normalization and are learned by gradient descent, not manually tuned.
Question 9
At test time, we turn off Batch Norm to avoid random predictions. True/False?
✅ False
❌ True
Explanation:
At test time, Batch Norm uses running averages (mean & variance) computed during training — it’s not “turned off”.
Question 10
Which statements about deep learning programming frameworks are true?
✅ They allow you to code deep learning algorithms with fewer lines of code.
✅ Good governance helps keep open-source frameworks fair and open long-term.
❌ They require cloud-based machines to run.
Explanation:
Frameworks (like TensorFlow, PyTorch, Keras) simplify coding.
They run locally or on cloud — not limited to cloud systems.
🧾 Summary Table
| Q# | ✅ Correct Answer | Key Concept |
|---|---|---|
| 1 | Random search + grid has fixed points | Random search efficiency |
| 2 | α, mini-batch size, β (momentum) | Key tunable hyperparameters |
| 3 | False | Better to run multiple models (not babysit one) |
| 4 | r = -4*np.random.rand(); α = 10**r | Log-scale sampling for learning rate |
| 5 | False | Hyperparameters must be re-tuned as project evolves |
| 6 | False | W[l] can’t be dropped in batch norm |
| 7 | True | Small σ may cause division instability |
| 8 | γ, β learned; control variance/mean | Batch norm introduces trainable params |
| 9 | False | Batch norm stays active with stored running averages |
| 10 | Frameworks simplify DL, governance matters | Frameworks ease dev; cloud not required |