1. Question 1

Primary purpose of multi-head self-attention:

✅ To process different parts of the input sequence in parallel
❌ Sequential processing
❌ Reduce training time
❌ Ensure equal output size

Explanation:
Multi-head attention lets the model learn different relationships in parallel.

2. Question 2

Purpose of feedforward layers in Transformers:

❌ Focus on sequence parts
❌ Weigh importance of words
✅ Transform the data after self-attention (non-linear transformation)
❌ Compute attention weights

Explanation:
Feedforward networks refine and project attention outputs.

3. Question 3

How Transformers handle temporal dependencies:

❌ Convolutions
❌ Recurrent connections
✅ Positional encodings maintain order information
❌ Zero-mean normalization

Explanation:
Transformers have no recurrence → position is injected via encoding.

4. Question 4

Loss function in:

❌ MinMaxScaler
❌ MultiHeadAttention
❌ Adam
✅ mean squared error

Explanation:
mse = Mean Squared Error.

5. Question 5

Role of softmax in attention:

❌ Dimensionality reduction
✅ Normalize attention scores to probabilities
❌ Provide non-linearity
❌ Compute dot products

Explanation:
Softmax turns raw attention scores into a probability distribution.

6. Question 6

Mechanism used by transformers to convert speech to text:

❌ Layers
✅ Spectrograms
❌ Patches
❌ Images

Explanation:
Audio is converted into spectrograms before being fed to transformer models.

7. Question 7

Which method applies self-attention and combines heads?

❌ TransformerBlock class
❌ split_heads
❌ MultiHeadSelfAttention class
✅ call method

Explanation:
The call() method executes the forward pass, including computing and combining heads.

8. Question 8

Converts text to numerical format:

❌ lstm.model
❌ Sequential
✅ TextVectorization
❌ Vectorizer

Explanation:
TextVectorization converts raw text → integer sequences.

9. Question 9

RNN model using SimpleRNN + Dense:

❌ Using lstm
❌ RNN without Dense
❌ RNN with Dense but wrong layer type

Explanation:
SimpleRNN is the basic RNN cell.

10. Question 10

Purpose of:

✅ Compute attention scores + weighted sum of values
❌ Apply self-attention and combine heads
❌ Define multi-head mechanism
❌ Split into multiple heads

Explanation:
Attention = softmax(QKᵀ)V.

🧾 Summary Table

Q#	Correct Answer
1	Parallel processing via multi-head attention
2	Transform data after self-attention
3	Positional encoding
4	mean squared error
5	Normalize scores to probabilities
6	Spectrograms
7	call method
8	TextVectorization
9	SimpleRNN + Dense model
10	Compute attention scores & weighted sum

Graded Quiz: Transformers in Keras :Deep Learning with Keras and Tensorflow (IBM AI Engineering Professional Certificate) Answers 2025