Skip to content

Graded Quiz: Data Preparation for LLMs :Generative AI and LLMs: Architecture and Data Preparation (IBM AI Engineering Professional Certificate) Answers 2025

1. Best tokenization method for complex morphology with small vocabulary but higher input length

❌ Subword-based tokenization
❌ Word-based tokenization
❌ WordPiece tokenization
Character-based tokenization

Explanation:
Character-level tokenization has the smallest possible vocabulary (all characters), but sequences become very long — increasing computational cost.


2. How to standardize variable sequence lengths?

❌ Batching
Padding
❌ Iteration
❌ Shuffling

Explanation:
Padding ensures all sequences in a batch have equal length by adding special tokens like <pad>.


3. Symbol used to indicate subwords that attach to previous token (WordPiece)?

## symbol
❌ Underscore symbol
<pad> token
<eos> token

Explanation:
In WordPiece tokenization, ##token means the token continues the previous word without a space.


4. What should Sonia prioritize to fix inconsistent formatting, typos, and noise?

❌ Increase batch size
Clean data to remove inconsistencies and noise
❌ Token-level augmentation
❌ Adding more data

Explanation:
Dirty or inconsistent data significantly harms model performance — data cleaning is the highest priority.


5. Argument to prevent feeding data in the same order repeatedly

The shuffle argument
❌ The dataset
❌ The batch size
❌ The padding value

Explanation:
Setting shuffle=True randomizes the order of samples to prevent the model learning unwanted sequence patterns.


🧾 Summary Table

Q# Correct Answer Key Concept
1 Character-based Small vocab, high sequence length
2 Padding Standardizing sequence length
3 ## WordPiece subword continuation
4 Clean data Remove noise & inconsistencies
5 shuffle=True Randomize sample order