Fine-Tuning Causal LLMs with Human Feedback and Direct Preference :Generative AI Advance Fine-Tuning for LLMs (IBM AI Engineering Professional Certificate) Answers 2025
1. Which parameter reduces repetitive sequences?
❌ Temperature
✅ Repetition penalty
❌ Min/max tokens
❌ Top-K sampling
Explanation:
A higher repetition penalty discourages the model from repeating the same phrases, increasing output diversity.
2. Five different generated outputs from the same prompt are called:
❌ Tokens
❌ Policies
✅ Rollouts
❌ Gradients
Explanation:
In RLHF and PPO frameworks, multiple generated outputs per prompt are known as rollouts.
3. Goal of RLHF:
❌ Initialize embeddings
❌ Replace supervision with RL
❌ Pretrain from raw tokens
✅ Adjust model responses based on preferences
Explanation:
RLHF trains a model to prefer outputs aligned with human feedback (ratings, comparisons).
4. KL penalty coefficient role:
❌ Increases randomness
✅ Limits divergence between updated and original policies
❌ Maximizes reward log-probability
❌ Boosts advantage function
Explanation:
The KL penalty keeps the fine-tuned model close to the base model to avoid unstable updates.
5. PPO utility to vary input text length:
❌ input_min_text_length
❌ input_size
✅ LengthSampler
❌ tokenizer.pad_token
Explanation:
LengthSampler randomly samples different input lengths to simulate realistic training conditions.
6. Primary role of PPOConfig:
❌ Execute token alignment
❌ Collect logs
❌ Control decoding temperature
✅ Define model settings and learning rate
Explanation:
PPOConfig defines hyperparameters such as learning rate, batch sizes, and model architecture.
7. DPO: function to maintain valid probabilities when constructing a new distribution:
❌ Compare responses
✅ Confirm probabilities from a valid distribution
❌ Remove irrelevant outputs
❌ Elaborate differences between models
Explanation:
Ensuring the transformed distribution remains valid (sum=1, non-negative) is essential in DPO.
8. Optimal DPO solution:
❌ Duplicate reward model
❌ Use reference-only model
❌ Eliminate partition function
✅ Scale the reference model using the beta parameter
Explanation:
DPO reweights the reference model with β to produce an optimized policy aligned with preferences.
9. Why prefer DPO over PPO for preference fine-tuning?
❌ Increase randomness
❌ Use closed vocabulary
❌ Ignore feedback
✅ Reformulate the optimization to avoid numerical instability
Explanation:
DPO avoids PPO’s unstable advantage estimation and KL control by turning preference optimization into a simpler likelihood objective.
10. First step when loading & configuring model:
❌ tokenizer = GPT2Tokenizer.from_pretrained(“gpt2”)
❌ model_ref = AutoModelForCausalLM.from_pretrained(“gpt2”)
❌ tokenizer.pad_token = tokenizer.eos_token
✅ model = AutoModelForCausalLM.from_pretrained(“gpt2”)
Explanation:
The model must be loaded first before configuring tokenizer padding or creating reference copies.
🧾 Summary Table
| Q# | Correct Answer | Key Concept |
|---|---|---|
| 1 | Repetition penalty | Reduce repetition |
| 2 | Rollouts | Multiple generated samples |
| 3 | Adjust model using preferences | RLHF purpose |
| 4 | Limit divergence | KL penalty |
| 5 | LengthSampler | Variable sequence lengths |
| 6 | Define model & learning rate | PPOConfig |
| 7 | Valid probability check | DPO distribution |
| 8 | β-scaled reference model | DPO optimal form |
| 9 | Avoid PPO instability | DPO advantage |
| 10 | Load model first | Model initialization |