Skip to content

Fine-Tuning Causal LLMs with Human Feedback and Direct Preference :Generative AI Advance Fine-Tuning for LLMs (IBM AI Engineering Professional Certificate) Answers 2025

1. Which parameter reduces repetitive sequences?

❌ Temperature
Repetition penalty
❌ Min/max tokens
❌ Top-K sampling

Explanation:
A higher repetition penalty discourages the model from repeating the same phrases, increasing output diversity.


2. Five different generated outputs from the same prompt are called:

❌ Tokens
❌ Policies
Rollouts
❌ Gradients

Explanation:
In RLHF and PPO frameworks, multiple generated outputs per prompt are known as rollouts.


3. Goal of RLHF:

❌ Initialize embeddings
❌ Replace supervision with RL
❌ Pretrain from raw tokens
Adjust model responses based on preferences

Explanation:
RLHF trains a model to prefer outputs aligned with human feedback (ratings, comparisons).


4. KL penalty coefficient role:

❌ Increases randomness
Limits divergence between updated and original policies
❌ Maximizes reward log-probability
❌ Boosts advantage function

Explanation:
The KL penalty keeps the fine-tuned model close to the base model to avoid unstable updates.


5. PPO utility to vary input text length:

❌ input_min_text_length
❌ input_size
LengthSampler
❌ tokenizer.pad_token

Explanation:
LengthSampler randomly samples different input lengths to simulate realistic training conditions.


6. Primary role of PPOConfig:

❌ Execute token alignment
❌ Collect logs
❌ Control decoding temperature
Define model settings and learning rate

Explanation:
PPOConfig defines hyperparameters such as learning rate, batch sizes, and model architecture.


7. DPO: function to maintain valid probabilities when constructing a new distribution:

❌ Compare responses
Confirm probabilities from a valid distribution
❌ Remove irrelevant outputs
❌ Elaborate differences between models

Explanation:
Ensuring the transformed distribution remains valid (sum=1, non-negative) is essential in DPO.


8. Optimal DPO solution:

❌ Duplicate reward model
❌ Use reference-only model
❌ Eliminate partition function
Scale the reference model using the beta parameter

Explanation:
DPO reweights the reference model with β to produce an optimized policy aligned with preferences.


9. Why prefer DPO over PPO for preference fine-tuning?

❌ Increase randomness
❌ Use closed vocabulary
❌ Ignore feedback
Reformulate the optimization to avoid numerical instability

Explanation:
DPO avoids PPO’s unstable advantage estimation and KL control by turning preference optimization into a simpler likelihood objective.


10. First step when loading & configuring model:

❌ tokenizer = GPT2Tokenizer.from_pretrained(“gpt2”)
❌ model_ref = AutoModelForCausalLM.from_pretrained(“gpt2”)
❌ tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(“gpt2”)

Explanation:
The model must be loaded first before configuring tokenizer padding or creating reference copies.


🧾 Summary Table

Q# Correct Answer Key Concept
1 Repetition penalty Reduce repetition
2 Rollouts Multiple generated samples
3 Adjust model using preferences RLHF purpose
4 Limit divergence KL penalty
5 LengthSampler Variable sequence lengths
6 Define model & learning rate PPOConfig
7 Valid probability check DPO distribution
8 β-scaled reference model DPO optimal form
9 Avoid PPO instability DPO advantage
10 Load model first Model initialization