1. What does QLoRA use to minimize memory during fine-tuning?

❌ Zero-shot inference
❌ Few-shot inference
✅ 4-bit quantization
❌ LoRA adaptation

Explanation:
QLoRA uses 4-bit quantization to compress model weights and drastically reduce GPU memory usage during fine-tuning.

2. Which technique adds low-rank matrices to reduce trainable parameters?

❌ Soft prompts
❌ Full fine-tuning
✅ LoRA
❌ Additive fine-tuning

Explanation:
LoRA uses trainable low-rank matrices to update only a tiny subset of parameters while keeping the original model frozen.

3. How does adding low-rank matrices affect parameter efficiency?

❌ Replaces full weight matrices
✅ Adds a very small number of trainable parameters to the existing weights
❌ Tracks original parameter count
❌ Increases trainable parameters with high-rank matrices

Explanation:
LoRA keeps the original weights frozen and adds a minimal low-rank update matrix → extremely parameter-efficient.

4. Why apply LoRA to a BERT-like model on HuggingFace?

✅ LoRA integrates low-rank matrices into selected modules via PEFT, configured using TrainingArguments.
❌ Tailors all model layers & disables dropout
❌ Trains from scratch
❌ Removes tokenization & transformer blocks

Explanation:
LoRA via HuggingFace PEFT modifies only certain layers (e.g., attention projections), enabling memory-efficient fine-tuning.

5. What mechanism enables fast & memory-efficient LoRA fine-tuning?

❌ Quantization (QLoRA uses this, LoRA alone does not)
❌ Maintains full network from scratch
❌ Small batch engineering
✅ Freezes the main weights and updates only the added low-rank matrices

Explanation:
LoRA’s core idea: freeze the large pretrained model and train only tiny low-rank adapters.

🧾 Summary Table

Q#	Correct Answer	Key Concept
1	4-bit quantization	QLoRA memory savings
2	LoRA	Low-rank training
3	Add small low-rank matrices	Parameter efficiency
4	LoRA via PEFT	Efficient fine-tuning
5	Freeze weights + train low-rank updates	LoRA mechanism