1. What should Sarah’s team at InnovateAI do first when using Keras with vision transformers?

❌ Implement a CNN baseline
✅ Load a pre-trained vision transformer model and prepare the dataset for transfer learning
❌ Collect a huge dataset to train from scratch
❌ Design a custom ViT from scratch

Explanation:
The first step in transfer learning with ViTs is loading a pre-trained transformer model and preparing the dataset for fine-tuning.

2. What should Jamie emphasize about transfer learning with ViTs?

❌ Large datasets are mandatory
❌ Vision transformers rely exclusively on CNN layers
❌ Extensive hyperparameter tuning is required
✅ Pre-trained models can be adapted to new tasks with minimal data

Explanation:
A major advantage of transfer learning is adapting powerful pre-trained models using small, task-specific datasets.

3. What should Alex focus on when applying transfer learning with PyTorch vision transformers?

❌ Train from scratch
❌ Add random layers
❌ Use model without changes
✅ Fine-tune the pre-trained model on the new dataset

Explanation:
Fine-tuning allows the model to adjust pre-learned representations to the new domain.

4. What is the function of `self.features` inside the ConvNet class?

❌ Implements the transformer encoder
✅ Applies the pre-trained CNN architecture for feature extraction
❌ Loads the dataset
❌ Computes loss

Explanation:
self.features typically contains convolutional layers responsible for generating feature maps.

5. What does the PatchEmbed `proj` layer do?

❌ Applies pooling
❌ Flattens the raw image
❌ Creates overlapping patches
✅ Projects CNN feature maps to embedding dimension using a 1×1 convolution

Explanation:
A 1×1 convolution transforms spatial feature maps into patch embeddings for the transformer.

6. What operation is typically used when creating positional encoding?

✅ Adding a learned or fixed vector to each patch embedding
❌ Concatenating zeros
❌ Randomly reshuffling patches
❌ Subtracting mean

Explanation:
Transformers require positional information, added as a learned/fixed vector to each patch.

7. Why is a Classification Head included at the end of a ViT model?

❌ Resize the image
❌ Increase regularization
❌ Encode positions
✅ Project final transformer output to the number of target classes

Explanation:
The classification head maps transformer embeddings to class logits for supervised learning.

🧾 Summary Table

Q#	Correct Answer	Key Concept
1	Pretrained ViT + dataset prep	Transfer learning start
2	Adapt pre-trained models	Value of transfer learning
3	Fine-tuning	Proper application of TL
4	CNN feature extractor	ConvNet internals
5	1×1 projection to embeddings	Patch embedding mechanism
6	Add positional vectors	Position encoding
7	Classification projection	Final prediction layer

Graded Quiz: CNN – Vision Transformer Integration :AI Capstone Project with Deep Learning (IBM AI Engineering Professional Certificate) Answers 2025