Graded Quiz: CNN – Vision Transformer Integration :AI Capstone Project with Deep Learning (IBM AI Engineering Professional Certificate) Answers 2025
1. What should Sarah’s team at InnovateAI do first when using Keras with vision transformers?
❌ Implement a CNN baseline
✅ Load a pre-trained vision transformer model and prepare the dataset for transfer learning
❌ Collect a huge dataset to train from scratch
❌ Design a custom ViT from scratch
Explanation:
The first step in transfer learning with ViTs is loading a pre-trained transformer model and preparing the dataset for fine-tuning.
2. What should Jamie emphasize about transfer learning with ViTs?
❌ Large datasets are mandatory
❌ Vision transformers rely exclusively on CNN layers
❌ Extensive hyperparameter tuning is required
✅ Pre-trained models can be adapted to new tasks with minimal data
Explanation:
A major advantage of transfer learning is adapting powerful pre-trained models using small, task-specific datasets.
3. What should Alex focus on when applying transfer learning with PyTorch vision transformers?
❌ Train from scratch
❌ Add random layers
❌ Use model without changes
✅ Fine-tune the pre-trained model on the new dataset
Explanation:
Fine-tuning allows the model to adjust pre-learned representations to the new domain.
4. What is the function of self.features inside the ConvNet class?
❌ Implements the transformer encoder
✅ Applies the pre-trained CNN architecture for feature extraction
❌ Loads the dataset
❌ Computes loss
Explanation:self.features typically contains convolutional layers responsible for generating feature maps.
5. What does the PatchEmbed proj layer do?
❌ Applies pooling
❌ Flattens the raw image
❌ Creates overlapping patches
✅ Projects CNN feature maps to embedding dimension using a 1×1 convolution
Explanation:
A 1×1 convolution transforms spatial feature maps into patch embeddings for the transformer.
6. What operation is typically used when creating positional encoding?
✅ Adding a learned or fixed vector to each patch embedding
❌ Concatenating zeros
❌ Randomly reshuffling patches
❌ Subtracting mean
Explanation:
Transformers require positional information, added as a learned/fixed vector to each patch.
7. Why is a Classification Head included at the end of a ViT model?
❌ Resize the image
❌ Increase regularization
❌ Encode positions
✅ Project final transformer output to the number of target classes
Explanation:
The classification head maps transformer embeddings to class logits for supervised learning.
🧾 Summary Table
| Q# | Correct Answer | Key Concept |
|---|---|---|
| 1 | Pretrained ViT + dataset prep | Transfer learning start |
| 2 | Adapt pre-trained models | Value of transfer learning |
| 3 | Fine-tuning | Proper application of TL |
| 4 | CNN feature extractor | ConvNet internals |
| 5 | 1×1 projection to embeddings | Patch embedding mechanism |
| 6 | Add positional vectors | Position encoding |
| 7 | Classification projection | Final prediction layer |