Deep Convolutional Models:Convolutional Neural Networks(Deep Learning Specialization)Answers:2025

Question 1

Which of the following do you typically see in ConvNet?

❌ Multiple FC layers followed by a CONV layer.
❌ Use of multiple POOL layers followed by a CONV layer.
❌ ConvNet makes exclusive use of CONV layers.
✅ Use of FC layers after flattening the volume to generate output classes.

Explanation: Typical ConvNet architectures use convolutional + pooling layers to learn spatial features, then flatten and use fully-connected (FC) layers at the end to produce class scores. You rarely place conv layers after FC or pool then conv (the latter sometimes occurs but is not typical), and ConvNets do include FC layers — they’re not exclusively convolutional.

Question 2

In LeNet-5 we can see that as we get into deeper networks the number of channels increases while the height and width of the volume decreases. True/False?

✅ True
❌ False

Explanation: Standard design (including LeNet, AlexNet, VGG, etc.) stacks conv/pool layers so spatial dimensions shrink (via pooling/strides) while channel (feature map) count increases deeper in the network.

Question 3

Training a deeper network (adding layers) allows the network to fit more complex functions and thus almost always results in lower training error (plain networks). True/False?

✅ False
❌ True

Explanation: Although deeper networks have higher representational capacity, plain deep networks often suffer optimization difficulties (vanishing/exploding gradients, degradation), so adding layers does not guarantee lower training error unless architectural fixes (skip connections) or optimization techniques are used.

Question 4

Which equation captures computations in a ResNet block?

Options shown — correct one:

❌ a[l+2] = g(W[l+2] g(W[l+1] a[l] + b[l+1]) + b[l+2] + a[l]) + a[l+1]
❌ a[l+2] = g(W[l+2] g(W[l+1] a[l] + b[l+1]) + b[l+2])
✅ a[l+2] = g( W[l+2] g( W[l+1] a[l] + b[l+1] ) + b[l+2] + a[l] )
❌ a[l+2] = g( W[l+2] g( W[l+1] a[l] + b[l+1] ) + b[l+2] ) + a[l]

Explanation: A typical ResNet block computes intermediate activations and adds the input (identity) as a skip connection before the outer activation: $a_{l+2}=g(W_{l+2}\,g(W_{l+1}a_l+b_{l+1})+b_{l+2}+a_l)$ .

Question 5

In the best scenario when adding a ResNet block it will learn to approximate the identity function after a lot of training, helping improve overall performance. True/False?

❌ False
✅ True

Explanation: One intended benefit of residual blocks is that if additional layers aren’t helpful, the residual mapping can approach zero so the block approximates identity, avoiding degradation and not hurting performance.

Question 6

1×1 convolutions are the same as multiplying by a single number. True/False?

❌ True
✅ False

Explanation: A 1×1 convolution at each spatial location computes a linear combination across channels (a weight matrix applied to the channel vector), not multiplication by a single scalar.

Question 7

Which are true about the Inception Network? (Check all that apply)

✅ One problem with simply stacking up several layers is the computational cost of it.
❌ Making an inception network deeper won’t hurt the training set performance.
✅ Inception blocks allow the use of a combination of 1×1, 3×3, 5×5 convolutions and pooling by stacking up all the activations resulting from each type of layer.
❌ Inception blocks allow the use of a combination of 1×1, 3×3, 5×5 convolutions, and pooling by applying one layer after the other.

Explanation: Inception modules concatenate outputs of multiple filter sizes/pooling in parallel (not sequentially) to capture multi-scale features. Stacking layers increases computational cost; deeper networks can still help but “won’t hurt” is too strong/incorrect.

Question 8

Models trained for one CV task can’t be used directly in another; in most cases we must change the softmax/last layers and re-train. True/False?

❌ False
✅ True

Explanation: In practice you reuse pretrained feature extractors and replace/adapt last layers (softmax or heads) for new tasks, then fine-tune. So you typically do modify last layers to transfer.

Question 9

In Depthwise Separable Convolution you: (choose correct statements)

✅ For the “Depthwise” computations each filter convolves with only one corresponding color channel of the input image.
❌ You convolve the input image with a filter of nf x nf x nc where nc acts as the depth of the filter.
✅ Perform two steps of convolution.
❌ The final output is of dimension nout x nout x nc (where nc is number of input channels).
❌ For the “Depthwise” computations each filter convolves with all of the color channels of the input image.
❌ Perform one step of convolution.
✅ You convolve the input image with nc number of nf x nf filters (one per input channel).
✅ The final output is of dimension nout x nout x nc′ where nc′ is the number of filters used in the pointwise convolution step.

Explanation: Depthwise separable conv = (1) depthwise: apply one nf×nf filter per input channel (spatial per-channel), (2) pointwise: 1×1 conv to mix channels (gives nc′ output channels). It’s a two-step factorization that reduces params.

Question 10

MobileNet v2 Bottleneck block: input n×n×5, expansion uses 30 filters, depthwise uses 3×3, projection uses 20 filters. No biases. How many parameters in complete block?

✅ 1020
❌ 1101
❌ 80
❌ 8250

Explanation (calculation):

Expansion (1×1): 5 → 30: $5×30=1505\times30 = 150$ params.
Depthwise (3×3): depthwise on 30 channels: $3×3×30=2703\times3\times30 = 270$ .
Projection (1×1): 30 → 20: $30×20=60030\times20 = 600$ .
Total = 150 + 270 + 600 = 1020.

🧾 Summary Table

Q #	Correct Answer(s)	Key concept
1	✅ Use FC after flattening	ConvNets typically end with FC layers for classification.
2	✅ True	Channels ↑ while spatial dims ↓ deeper in classic nets.
3	✅ False	Plain deeper nets may worsen optimization; depth ≠ guaranteed lower train error.
4	✅ a[l+2] = g(W[l+2] g(W[l+1] a[l]+b[l+1]) + b[l+2] + a[l])	Residual / skip connection form.
5	✅ True	Residual blocks can learn identity mapping when helpful.
6	✅ False	1×1 convs are per-pixel channel mixing, not scalar multiply.
7	✅ (1 & 3 true)	Inception uses parallel kernels and concatenation to get multi-scale features.
8	✅ True	Transfer learning typically replaces/retrains last layer(s).
9	✅ Depthwise per-channel; ✅ two steps; ✅ nc filters; ✅ output nout×nout×nc′	Depthwise separable = depthwise (per-channel) + pointwise (1×1).
10	✅ 1020	Sum of expansion + depthwise + projection parameters.