1. Question 1

Primary objective of Q-learning:

✅ To learn a policy that maximizes the cumulative reward over time
❌ Minimize immediate reward
❌ Ignore future rewards
❌ Clustering

Explanation:
Q-learning is all about maximizing long-term cumulative reward.

2. Question 2

Role of Q-value function:

❌ Terminal state probability
✅ Expected utility of taking action a in state s
❌ Record sequence of actions
❌ Count steps

Explanation:
Q(s, a) predicts how good an action is in a given state.

3. Question 3

Exploration rate (ε) refers to:

❌ Speed of Q update
❌ Reward discount
❌ Reset frequency
✅ Probability of selecting a random action

Explanation:
ε controls exploration vs exploitation.

4. Question 4

Why use a neural network for Q-values in large spaces?

❌ Simplify action choice
❌ No reward needed
❌ Increase computation
✅ Q-tables become impossible to store → use neural networks

Explanation:
Deep Q-learning approximates Q(s, a) when table-based methods fail.

5. Question 5

Balancing exploration and exploitation:

❌ K-means
❌ Backpropagation
❌ SGD
✅ Epsilon-greedy policy

Explanation:
ε-greedy selects random actions with probability ε.

6. Question 6

Key DQN innovation:

❌ Single network
❌ Immediate rewards only
❌ Continuous action
✅ Experience replay + target networks

Explanation:
Replay breaks correlations; target network stabilizes training.

7. Question 7

Purpose of replay buffer:

❌ Store Q-values
✅ Store experiences for random sampling
❌ Reset environment
❌ Reduce learning rate

Explanation:
Random sampling avoids correlated updates.

8. Question 8

Target network updates:

❌ More frequent
❌ Never updated
❌ Same frequency
✅ Less frequently than the primary network

Explanation:
Slow updates → stable targets.

9. Question 9

Role of Bellman equation:

❌ Calculate immediate reward
❌ Initialize weights
✅ Update Q-values using immediate + discounted future reward
❌ Determine action count

Explanation:
Bellman equation defines Q-value updates.

10. Question 10

Significance of discount factor γ:

✅ Importance of future rewards
❌ Learning rate
❌ Normalize Q-values
❌ Exploration rate

Explanation:
γ ∈ [0,1] controls how much future rewards matter.

🧾 Summary Table

Q#	Correct Answer
1	Maximize cumulative reward
2	Expected utility Q(s,a)
3	Probability of random action (ε)
4	Replace impractical Q-table
5	Epsilon-greedy
6	Replay buffer + target networks
7	Random sampling of experiences
8	Update less frequently
9	Update Q-values (Bellman equation)
10	Weight future rewards (γ)

Graded Quiz: Introduction to Reinforcement Learning with Keras :Deep Learning with Keras and Tensorflow (IBM AI Engineering Professional Certificate) Answers 2025