1. Which parameter reduces repetitive sequences?

❌ Temperature
✅ Repetition penalty
❌ Min/max tokens
❌ Top-K sampling

Explanation:
A higher repetition penalty discourages the model from repeating the same phrases, increasing output diversity.

2. Five different generated outputs from the same prompt are called:

❌ Tokens
❌ Policies
✅ Rollouts
❌ Gradients

Explanation:
In RLHF and PPO frameworks, multiple generated outputs per prompt are known as rollouts.

3. Goal of RLHF:

❌ Initialize embeddings
❌ Replace supervision with RL
❌ Pretrain from raw tokens
✅ Adjust model responses based on preferences

Explanation:
RLHF trains a model to prefer outputs aligned with human feedback (ratings, comparisons).

4. KL penalty coefficient role:

❌ Increases randomness
✅ Limits divergence between updated and original policies
❌ Maximizes reward log-probability
❌ Boosts advantage function

Explanation:
The KL penalty keeps the fine-tuned model close to the base model to avoid unstable updates.

5. PPO utility to vary input text length:

❌ input_min_text_length
❌ input_size
✅ LengthSampler
❌ tokenizer.pad_token

Explanation:
LengthSampler randomly samples different input lengths to simulate realistic training conditions.

6. Primary role of PPOConfig:

❌ Execute token alignment
❌ Collect logs
❌ Control decoding temperature
✅ Define model settings and learning rate

Explanation:
PPOConfig defines hyperparameters such as learning rate, batch sizes, and model architecture.

7. DPO: function to maintain valid probabilities when constructing a new distribution:

❌ Compare responses
✅ Confirm probabilities from a valid distribution
❌ Remove irrelevant outputs
❌ Elaborate differences between models

Explanation:
Ensuring the transformed distribution remains valid (sum=1, non-negative) is essential in DPO.

8. Optimal DPO solution:

❌ Duplicate reward model
❌ Use reference-only model
❌ Eliminate partition function
✅ Scale the reference model using the beta parameter

Explanation:
DPO reweights the reference model with β to produce an optimized policy aligned with preferences.

9. Why prefer DPO over PPO for preference fine-tuning?

❌ Increase randomness
❌ Use closed vocabulary
❌ Ignore feedback
✅ Reformulate the optimization to avoid numerical instability

Explanation:
DPO avoids PPO’s unstable advantage estimation and KL control by turning preference optimization into a simpler likelihood objective.

10. First step when loading & configuring model:

❌ tokenizer = GPT2Tokenizer.from_pretrained(“gpt2”)
❌ model_ref = AutoModelForCausalLM.from_pretrained(“gpt2”)
❌ tokenizer.pad_token = tokenizer.eos_token
✅ model = AutoModelForCausalLM.from_pretrained(“gpt2”)

Explanation:
The model must be loaded first before configuring tokenizer padding or creating reference copies.

🧾 Summary Table

Q#	Correct Answer	Key Concept
1	Repetition penalty	Reduce repetition
2	Rollouts	Multiple generated samples
3	Adjust model using preferences	RLHF purpose
4	Limit divergence	KL penalty
5	LengthSampler	Variable sequence lengths
6	Define model & learning rate	PPOConfig
7	Valid probability check	DPO distribution
8	β-scaled reference model	DPO optimal form
9	Avoid PPO instability	DPO advantage
10	Load model first	Model initialization

Fine-Tuning Causal LLMs with Human Feedback and Direct Preference :Generative AI Advance Fine-Tuning for LLMs (IBM AI Engineering Professional Certificate) Answers 2025