Graded Quiz: Data Preparation for LLMs :Generative AI and LLMs: Architecture and Data Preparation (IBM AI Engineering Professional Certificate) Answers 2025
1. Best tokenization method for complex morphology with small vocabulary but higher input length
❌ Subword-based tokenization
❌ Word-based tokenization
❌ WordPiece tokenization
✅ Character-based tokenization
Explanation:
Character-level tokenization has the smallest possible vocabulary (all characters), but sequences become very long — increasing computational cost.
2. How to standardize variable sequence lengths?
❌ Batching
✅ Padding
❌ Iteration
❌ Shuffling
Explanation:
Padding ensures all sequences in a batch have equal length by adding special tokens like <pad>.
3. Symbol used to indicate subwords that attach to previous token (WordPiece)?
✅ ## symbol
❌ Underscore symbol
❌ <pad> token
❌ <eos> token
Explanation:
In WordPiece tokenization, ##token means the token continues the previous word without a space.
4. What should Sonia prioritize to fix inconsistent formatting, typos, and noise?
❌ Increase batch size
✅ Clean data to remove inconsistencies and noise
❌ Token-level augmentation
❌ Adding more data
Explanation:
Dirty or inconsistent data significantly harms model performance — data cleaning is the highest priority.
5. Argument to prevent feeding data in the same order repeatedly
✅ The shuffle argument
❌ The dataset
❌ The batch size
❌ The padding value
Explanation:
Setting shuffle=True randomizes the order of samples to prevent the model learning unwanted sequence patterns.
🧾 Summary Table
| Q# | Correct Answer | Key Concept |
|---|---|---|
| 1 | Character-based | Small vocab, high sequence length |
| 2 | Padding | Standardizing sequence length |
| 3 | ## | WordPiece subword continuation |
| 4 | Clean data | Remove noise & inconsistencies |
| 5 | shuffle=True | Randomize sample order |