Skip to content

Special Applications: Face Recognition & Neural Style Transfer:Convolutional Neural Networks(Deep Learning Specialization)Answers:2025

Question 1

Which of the following do you agree with?

  • ❌ Face recognition requires comparing pictures against one person’s face.

  • Face recognition requires K comparisons of a person’s face.

  • ❌ Face verification requires K comparisons of a person’s face.

Explanation:
Face recognition (identifying who the person is among KK known identities) generally requires comparing the probe face against the KK stored identities (or computing similarity to each of KK templates). Face verification (is this person X?) requires just a single comparison to the claimed identity, not KK.


Question 2

Workgroup membership detection — which do you agree with?

  • It will be more efficient to learn a function d(img1,img2)d(\text{img}_1,\text{img}_2) for this task.

  • ❌ It is best to build a CNN with a softmax output with as many outputs as members of the group.

  • ❌ This can’t be considered a one-shot learning task since there might be many members in the workgroup.

  • This can be considered a one-shot learning task.

Explanation:
A learned similarity/distance function d(⋅,⋅)d(\cdot,\cdot) (or embedding + nearest-neighbor) is flexible: you can add/remove members without retraining the classifier head. This is exactly the one-shot / few-shot setup: you compare a probe to one (or a few) examples per person rather than training a fixed softmax over all members.


Question 3

To train with triplet loss, must you collect pictures only from current team members? True/False?

  • False

  • ❌ True

Explanation:
Triplet-loss training benefits from many distinct identities to teach the embedding to separate different faces. You do not need to restrict triplet training to only current workgroup members — using many other identities (external data) typically improves embedding quality and generalization.


Question 4

Triplet loss max⁡(∥f(A)−f(P)∥2−∥f(A)−f(N)∥2+α,0)\max(\|f(A)-f(P)\|^2 – \|f(A)-f(N)\|^2 + \alpha, 0) is larger in which case?

  • ❌ When the encoding of A is closer to the encoding of P than to the encoding of N.

  • When the encoding of A is closer to the encoding of N than to the encoding of P.

  • ❌ When A=PA=P and A=NA=N.

Explanation:
The loss grows when the positive is farther from A than the negative: i.e., ∥f(A)−f(P)∥2\|f(A)-f(P)\|^2 is large relative to ∥f(A)−f(N)∥2\|f(A)-f(N)\|^2. Intuitively, loss is large when A is closer to N than to P (violating the desired inequality). If A=PA=P and A=NA=N (identical points), the expression reduces to α\alpha (if both distances are 0), but the typical failure case is when negative is too close.


Question 5

Siamese network architecture — which do you agree with most?

  • The upper and lower neural networks depicted have exactly the same parameters, but the outputs are computed independently for each image.

  • ❌ This depicts two different neural networks with different architectures.

  • ❌ The two networks have the same architecture, but they might have different parameters.

  • ❌ The two images are combined in a single volume and pass through a single neural network.

Explanation:
A Siamese network uses shared weights — the same network (same parameters) applied to each input separately to produce embeddings; outputs are computed independently and then compared (e.g., distance).


Question 6

You’re more likely to find a unit responding strongly to cats in layer 4 than layer 1. True/False?

  • True

  • ❌ False

Explanation:
Lower layers (layer 1) learn low-level features (edges, colors); deeper layers (layer 4) capture higher-level, semantically meaningful features (parts or whole-object detectors), so you’re more likely to find a neuron selective for cats in deeper layers.


Question 7

Neural style transfer: which loss terms are used? (choose all that apply)

  • JstyleJ_{\text{style}} that compares SS and GG.

  • JcorrJ_{\text{corr}} that compares CC and SS.

  • TT that calculates triplet loss between S,G,CS,G,C.

  • JcontentJ_{\text{content}} that compares CC and GG.

Explanation:
Neural style transfer optimizes a generated image GG to minimize: content loss Jcontent(G,C)J_{content}(G,C) (match high-level content features of CC) and style loss Jstyle(G,S)J_{style}(G,S) (match Gram-matrix/style statistics of SS). Triplet/correlation terms are not part of standard NST.


Question 8

Content loss Jcont(G,C)=∥a[l](C)−a[l](G)∥2J_{cont}(G,C) = \|a^{[l]}(C) – a^{[l]}(G)\|^2. We choose ll to be a very high value to use the more abstract activation. True/False?

  • True

  • ❌ False

Explanation:
For content similarity you pick a deeper layer ll (higher-level activations) because they capture abstract content/structure rather than low-level texture; hence ll is chosen among deeper layers.


Question 9

In neural style transfer, what is updated each iteration?

  • ❌ The pixel values of the content image CC

  • ❌ The neural network parameters

  • The pixel values of the generated image GG

  • ❌ The regularization parameters

Explanation:
In NST the pretrained network parameters are fixed; optimization is performed over the image pixels of GG to minimize combined content+style loss.


Question 10

3D conv: input 32×32×32×332\times32\times32\times3, 16 filters of size 4×4×44\times4\times4, zero padding and stride 1. Output size?

  • 31×31×31×1631\times31\times31\times16

  • 29×29×29×1629\times29\times29\times16

  • 29×29×29×1329\times29\times29\times13

  • 29×29×29×329\times29\times29\times3

Explanation:
Assuming zero padding = 0 (no padding), spatial output dims = (32−4)/1+1=29 (32 – 4)/1 + 1 = 29 along each of the three spatial axes. Depth = number of filters = 16. So output = 29×29×29×1629\times29\times29\times16.


🧾 Summary Table

Q # Correct Answer(s) Key concept
1 ✅ Face recognition requires K comparisons Recognition = identify among K candidates; verification = single comparison.
2 ✅ Learn d(⋅,⋅)d(\cdot,\cdot); ✅ One-shot task Embedding + similarity supports flexible group membership; one-shot setup.
3 ✅ False Triplet training benefits from many identities; not restricted to current members.
4 ✅ When A is closer to N than to P Loss large when negative is too close (violates margin).
5 ✅ Shared-parameter networks applied independently Siamese uses same parameters for both branches.
6 ✅ True Deeper layers encode higher-level object concepts.
7 Jstyle(S,G)J_{style}(S,G); ✅ Jcontent(C,G)J_{content}(C,G) Style transfer uses content + style losses.
8 ✅ True Content loss uses deeper-layer activations for abstract content.
9 ✅ Pixel values of generated image GG NST optimizes pixels of G; network weights fixed.
10 29×29×29×1629\times29\times29\times16 3D conv output spatial size = (32−4)+1 = 29, depth = #filters.