1. What does self-attention primarily allow a model to do?

❌ Eliminate unimportant words
❌ Generate paraphrases
❌ Identify parts of speech
✅ Represent each word using its surrounding context

Explanation:
Self-attention computes relationships between all tokens, enabling each word to encode meaning based on its context.

2. What parameter influences positional encoding values across embedding dimensions?

❌ Counts input tokens
❌ Tracks where each word appears
❌ Indicates phase offset
✅ Determines the frequency of sine and cosine waves

Explanation:
Sinusoidal positional encoding uses varying frequencies across dimensions, controlled by the formula’s denominator term.

3. How does attention ensure “chat” maps to “cat”?

❌ Randomly pick value vectors
❌ Multiply values and keys only
❌ Replace query with key vector
✅ Match the query vector with the transposed key matrix and retrieve the corresponding value

Explanation:
Attention = softmax(QKᵀ)V → the Q·Kᵀ scores determine which value vector (translation) is selected.

4. Role of multi-head attention in summarization?

✅ Apply multiple scaled dot-product attention operations in parallel on different representation subspaces
❌ Mask future tokens everywhere
❌ Apply a single attention mechanism
❌ Multiply vectors without scaling

Explanation:
Multi-head attention lets the model focus on multiple aspects of meaning simultaneously.

5. What is the correct second step after creating embeddings?

❌ Normalize sentence length
❌ Create token index mapping
❌ Extract word features
✅ Apply positional encoding to embeddings

Explanation:
Transformers have no inherent notion of order, so positional encoding must be added immediately after embeddings.

🧾 Summary Table

Q#	Correct Answer	Key Concept
1	Represent context	Purpose of self-attention
2	Frequency of sin/cos	Positional encoding math
3	QKᵀ → value	How attention selects outputs
4	Parallel attention heads	Multi-head attention role
5	Positional encoding	Transformer pipeline step