Special Applications: Face Recognition & Neural Style Transfer:Convolutional Neural Networks(Deep Learning Specialization)Answers:2025

Question 1

Which of the following do you agree with?

❌ Face recognition requires comparing pictures against one person’s face.
✅ Face recognition requires K comparisons of a person’s face.
❌ Face verification requires K comparisons of a person’s face.

Explanation:
Face recognition (identifying who the person is among $K$ known identities) generally requires comparing the probe face against the $K$ stored identities (or computing similarity to each of $K$ templates). Face verification (is this person X?) requires just a single comparison to the claimed identity, not $K$ .

Question 2

Workgroup membership detection — which do you agree with?

✅ It will be more efficient to learn a function $d(img1,img2)d(\text{img}_1,\text{img}_2)$ for this task.
❌ It is best to build a CNN with a softmax output with as many outputs as members of the group.
❌ This can’t be considered a one-shot learning task since there might be many members in the workgroup.
✅ This can be considered a one-shot learning task.

Explanation:
A learned similarity/distance function $d(⋅,⋅)d(\cdot,\cdot)$ (or embedding + nearest-neighbor) is flexible: you can add/remove members without retraining the classifier head. This is exactly the one-shot / few-shot setup: you compare a probe to one (or a few) examples per person rather than training a fixed softmax over all members.

Question 3

To train with triplet loss, must you collect pictures only from current team members? True/False?

✅ False
❌ True

Explanation:
Triplet-loss training benefits from many distinct identities to teach the embedding to separate different faces. You do not need to restrict triplet training to only current workgroup members — using many other identities (external data) typically improves embedding quality and generalization.

Question 4

Triplet loss $max⁡(∥f(A)−f(P)∥2−∥f(A)−f(N)∥2+α,0)\max(\|f(A)-f(P)\|^2 – \|f(A)-f(N)\|^2 + \alpha, 0)$ is larger in which case?

❌ When the encoding of A is closer to the encoding of P than to the encoding of N.
✅ When the encoding of A is closer to the encoding of N than to the encoding of P.
❌ When $A = P$ and $A = N$ .

Explanation:
The loss grows when the positive is farther from A than the negative: i.e., $f(A)-f(P)\|^2$ is large relative to $f(A)-f(N)\|^2$ . Intuitively, loss is large when A is closer to N than to P (violating the desired inequality). If $A = P$ and $A = N$ (identical points), the expression reduces to $α\alpha$ (if both distances are 0), but the typical failure case is when negative is too close.

Question 5

Siamese network architecture — which do you agree with most?

✅ The upper and lower neural networks depicted have exactly the same parameters, but the outputs are computed independently for each image.
❌ This depicts two different neural networks with different architectures.
❌ The two networks have the same architecture, but they might have different parameters.
❌ The two images are combined in a single volume and pass through a single neural network.

Explanation:
A Siamese network uses shared weights — the same network (same parameters) applied to each input separately to produce embeddings; outputs are computed independently and then compared (e.g., distance).

Question 6

You’re more likely to find a unit responding strongly to cats in layer 4 than layer 1. True/False?

✅ True
❌ False

Explanation:
Lower layers (layer 1) learn low-level features (edges, colors); deeper layers (layer 4) capture higher-level, semantically meaningful features (parts or whole-object detectors), so you’re more likely to find a neuron selective for cats in deeper layers.

Question 7

Neural style transfer: which loss terms are used? (choose all that apply)

✅ $JstyleJ_{\text{style}}$ that compares $S$ and $G$ .
❌ $JcorrJ_{\text{corr}}$ that compares $C$ and $S$ .
❌ $T$ that calculates triplet loss between $S, G, C$ .
✅ $JcontentJ_{\text{content}}$ that compares $C$ and $G$ .

Explanation:
Neural style transfer optimizes a generated image $G$ to minimize: content loss $J_{content}(G,C)$ (match high-level content features of $C$ ) and style loss $J_{style}(G,S)$ (match Gram-matrix/style statistics of $S$ ). Triplet/correlation terms are not part of standard NST.

Question 8

Content loss $J_{cont}(G,C) = \|a^{[l]}(C) – a^{[l]}(G)\|^2$ . We choose $l$ to be a very high value to use the more abstract activation. True/False?

✅ True
❌ False

Explanation:
For content similarity you pick a deeper layer $l$ (higher-level activations) because they capture abstract content/structure rather than low-level texture; hence $l$ is chosen among deeper layers.

Question 9

In neural style transfer, what is updated each iteration?

❌ The pixel values of the content image $C$
❌ The neural network parameters
✅ The pixel values of the generated image $G$
❌ The regularization parameters

Explanation:
In NST the pretrained network parameters are fixed; optimization is performed over the image pixels of $G$ to minimize combined content+style loss.

Question 10

3D conv: input $32×32×32×332\times32\times32\times3$ , 16 filters of size $4×4×44\times4\times4$ , zero padding and stride 1. Output size?

❌ $31×31×31×1631\times31\times31\times16$
✅ $29×29×29×1629\times29\times29\times16$
❌ $29×29×29×1329\times29\times29\times13$
❌ $29×29×29×329\times29\times29\times3$

Explanation:
Assuming zero padding = 0 (no padding), spatial output dims = $(32 - 4) /1 + 1 = 29$ along each of the three spatial axes. Depth = number of filters = 16. So output = $29×29×29×1629\times29\times29\times16$ .

🧾 Summary Table

Q #	Correct Answer(s)	Key concept
1	✅ Face recognition requires K comparisons	Recognition = identify among K candidates; verification = single comparison.
2	✅ Learn $d(⋅,⋅)d(\cdot,\cdot)$ ; ✅ One-shot task	Embedding + similarity supports flexible group membership; one-shot setup.
3	✅ False	Triplet training benefits from many identities; not restricted to current members.
4	✅ When A is closer to N than to P	Loss large when negative is too close (violates margin).
5	✅ Shared-parameter networks applied independently	Siamese uses same parameters for both branches.
6	✅ True	Deeper layers encode higher-level object concepts.
7	✅ $J_{style}(S,G)$ ; ✅ $J_{content}(C,G)$	Style transfer uses content + style losses.
8	✅ True	Content loss uses deeper-layer activations for abstract content.
9	✅ Pixel values of generated image $G$	NST optimizes pixels of G; network weights fixed.
10	✅ $29×29×29×1629\times29\times29\times16$	3D conv output spatial size = (32−4)+1 = 29, depth = #filters.