1. Best tokenization method for complex morphology with small vocabulary but higher input length

❌ Subword-based tokenization
❌ Word-based tokenization
❌ WordPiece tokenization
✅ Character-based tokenization

Explanation:
Character-level tokenization has the smallest possible vocabulary (all characters), but sequences become very long — increasing computational cost.

2. How to standardize variable sequence lengths?

❌ Batching
✅ Padding
❌ Iteration
❌ Shuffling

Explanation:
Padding ensures all sequences in a batch have equal length by adding special tokens like <pad>.

3. Symbol used to indicate subwords that attach to previous token (WordPiece)?

✅ ## symbol
❌ Underscore symbol
❌ <pad> token
❌ <eos> token

Explanation:
In WordPiece tokenization, ##token means the token continues the previous word without a space.

4. What should Sonia prioritize to fix inconsistent formatting, typos, and noise?

❌ Increase batch size
✅ Clean data to remove inconsistencies and noise
❌ Token-level augmentation
❌ Adding more data

Explanation:
Dirty or inconsistent data significantly harms model performance — data cleaning is the highest priority.

5. Argument to prevent feeding data in the same order repeatedly

✅ The shuffle argument
❌ The dataset
❌ The batch size
❌ The padding value

Explanation:
Setting shuffle=True randomizes the order of samples to prevent the model learning unwanted sequence patterns.

🧾 Summary Table

Q#	Correct Answer	Key Concept
1	Character-based	Small vocab, high sequence length
2	Padding	Standardizing sequence length
3	##	WordPiece subword continuation
4	Clean data	Remove noise & inconsistencies
5	shuffle=True	Randomize sample order