Skip to content

Graded Quiz: CNN – Vision Transformer Integration :AI Capstone Project with Deep Learning (IBM AI Engineering Professional Certificate) Answers 2025

1. What should Sarah’s team at InnovateAI do first when using Keras with vision transformers?

❌ Implement a CNN baseline
Load a pre-trained vision transformer model and prepare the dataset for transfer learning
❌ Collect a huge dataset to train from scratch
❌ Design a custom ViT from scratch

Explanation:
The first step in transfer learning with ViTs is loading a pre-trained transformer model and preparing the dataset for fine-tuning.


2. What should Jamie emphasize about transfer learning with ViTs?

❌ Large datasets are mandatory
❌ Vision transformers rely exclusively on CNN layers
❌ Extensive hyperparameter tuning is required
Pre-trained models can be adapted to new tasks with minimal data

Explanation:
A major advantage of transfer learning is adapting powerful pre-trained models using small, task-specific datasets.


3. What should Alex focus on when applying transfer learning with PyTorch vision transformers?

❌ Train from scratch
❌ Add random layers
❌ Use model without changes
Fine-tune the pre-trained model on the new dataset

Explanation:
Fine-tuning allows the model to adjust pre-learned representations to the new domain.


4. What is the function of self.features inside the ConvNet class?

❌ Implements the transformer encoder
Applies the pre-trained CNN architecture for feature extraction
❌ Loads the dataset
❌ Computes loss

Explanation:
self.features typically contains convolutional layers responsible for generating feature maps.


5. What does the PatchEmbed proj layer do?

❌ Applies pooling
❌ Flattens the raw image
❌ Creates overlapping patches
Projects CNN feature maps to embedding dimension using a 1×1 convolution

Explanation:
A 1×1 convolution transforms spatial feature maps into patch embeddings for the transformer.


6. What operation is typically used when creating positional encoding?

Adding a learned or fixed vector to each patch embedding
❌ Concatenating zeros
❌ Randomly reshuffling patches
❌ Subtracting mean

Explanation:
Transformers require positional information, added as a learned/fixed vector to each patch.


7. Why is a Classification Head included at the end of a ViT model?

❌ Resize the image
❌ Increase regularization
❌ Encode positions
Project final transformer output to the number of target classes

Explanation:
The classification head maps transformer embeddings to class logits for supervised learning.


🧾 Summary Table

Q# Correct Answer Key Concept
1 Pretrained ViT + dataset prep Transfer learning start
2 Adapt pre-trained models Value of transfer learning
3 Fine-tuning Proper application of TL
4 CNN feature extractor ConvNet internals
5 1×1 projection to embeddings Patch embedding mechanism
6 Add positional vectors Position encoding
7 Classification projection Final prediction layer