Graded Quiz: Introduction to Reinforcement Learning with Keras :Deep Learning with Keras and Tensorflow (IBM AI Engineering Professional Certificate) Answers 2025
1. Question 1
Primary objective of Q-learning:
-
✅ To learn a policy that maximizes the cumulative reward over time
-
❌ Minimize immediate reward
-
❌ Ignore future rewards
-
❌ Clustering
Explanation:
Q-learning is all about maximizing long-term cumulative reward.
2. Question 2
Role of Q-value function:
-
❌ Terminal state probability
-
✅ Expected utility of taking action a in state s
-
❌ Record sequence of actions
-
❌ Count steps
Explanation:
Q(s, a) predicts how good an action is in a given state.
3. Question 3
Exploration rate (ε) refers to:
-
❌ Speed of Q update
-
❌ Reward discount
-
❌ Reset frequency
-
✅ Probability of selecting a random action
Explanation:
ε controls exploration vs exploitation.
4. Question 4
Why use a neural network for Q-values in large spaces?
-
❌ Simplify action choice
-
❌ No reward needed
-
❌ Increase computation
-
✅ Q-tables become impossible to store → use neural networks
Explanation:
Deep Q-learning approximates Q(s, a) when table-based methods fail.
5. Question 5
Balancing exploration and exploitation:
-
❌ K-means
-
❌ Backpropagation
-
❌ SGD
-
✅ Epsilon-greedy policy
Explanation:
ε-greedy selects random actions with probability ε.
6. Question 6
Key DQN innovation:
-
❌ Single network
-
❌ Immediate rewards only
-
❌ Continuous action
-
✅ Experience replay + target networks
Explanation:
Replay breaks correlations; target network stabilizes training.
7. Question 7
Purpose of replay buffer:
-
❌ Store Q-values
-
✅ Store experiences for random sampling
-
❌ Reset environment
-
❌ Reduce learning rate
Explanation:
Random sampling avoids correlated updates.
8. Question 8
Target network updates:
-
❌ More frequent
-
❌ Never updated
-
❌ Same frequency
-
✅ Less frequently than the primary network
Explanation:
Slow updates → stable targets.
9. Question 9
Role of Bellman equation:
-
❌ Calculate immediate reward
-
❌ Initialize weights
-
✅ Update Q-values using immediate + discounted future reward
-
❌ Determine action count
Explanation:
Bellman equation defines Q-value updates.
10. Question 10
Significance of discount factor γ:
-
✅ Importance of future rewards
-
❌ Learning rate
-
❌ Normalize Q-values
-
❌ Exploration rate
Explanation:
γ ∈ [0,1] controls how much future rewards matter.
🧾 Summary Table
| Q# | Correct Answer |
|---|---|
| 1 | Maximize cumulative reward |
| 2 | Expected utility Q(s,a) |
| 3 | Probability of random action (ε) |
| 4 | Replace impractical Q-table |
| 5 | Epsilon-greedy |
| 6 | Replay buffer + target networks |
| 7 | Random sampling of experiences |
| 8 | Update less frequently |
| 9 | Update Q-values (Bellman equation) |
| 10 | Weight future rewards (γ) |