Special Applications: Face Recognition & Neural Style Transfer:Convolutional Neural Networks(Deep Learning Specialization)Answers:2025
Question 1
Which of the following do you agree with?
-
❌ Face recognition requires comparing pictures against one person’s face.
-
✅ Face recognition requires K comparisons of a person’s face.
-
❌ Face verification requires K comparisons of a person’s face.
Explanation:
Face recognition (identifying who the person is among KKK known identities) generally requires comparing the probe face against the KKK stored identities (or computing similarity to each of KKK templates). Face verification (is this person X?) requires just a single comparison to the claimed identity, not KKK.
Question 2
Workgroup membership detection — which do you agree with?
-
✅ It will be more efficient to learn a function d(img1,img2)d(\text{img}_1,\text{img}_2)d(img1,img2) for this task.
-
❌ It is best to build a CNN with a softmax output with as many outputs as members of the group.
-
❌ This can’t be considered a one-shot learning task since there might be many members in the workgroup.
-
✅ This can be considered a one-shot learning task.
Explanation:
A learned similarity/distance function d(⋅,⋅)d(\cdot,\cdot)d(⋅,⋅) (or embedding + nearest-neighbor) is flexible: you can add/remove members without retraining the classifier head. This is exactly the one-shot / few-shot setup: you compare a probe to one (or a few) examples per person rather than training a fixed softmax over all members.
Question 3
To train with triplet loss, must you collect pictures only from current team members? True/False?
-
✅ False
-
❌ True
Explanation:
Triplet-loss training benefits from many distinct identities to teach the embedding to separate different faces. You do not need to restrict triplet training to only current workgroup members — using many other identities (external data) typically improves embedding quality and generalization.
Question 4
Triplet loss max(∥f(A)−f(P)∥2−∥f(A)−f(N)∥2+α,0)\max(\|f(A)-f(P)\|^2 – \|f(A)-f(N)\|^2 + \alpha, 0)max(∥f(A)−f(P)∥2−∥f(A)−f(N)∥2+α,0) is larger in which case?
-
❌ When the encoding of A is closer to the encoding of P than to the encoding of N.
-
✅ When the encoding of A is closer to the encoding of N than to the encoding of P.
-
❌ When A=PA=PA=P and A=NA=NA=N.
Explanation:
The loss grows when the positive is farther from A than the negative: i.e., ∥f(A)−f(P)∥2\|f(A)-f(P)\|^2∥f(A)−f(P)∥2 is large relative to ∥f(A)−f(N)∥2\|f(A)-f(N)\|^2∥f(A)−f(N)∥2. Intuitively, loss is large when A is closer to N than to P (violating the desired inequality). If A=PA=PA=P and A=NA=NA=N (identical points), the expression reduces to α\alphaα (if both distances are 0), but the typical failure case is when negative is too close.
Question 5
Siamese network architecture — which do you agree with most?
-
✅ The upper and lower neural networks depicted have exactly the same parameters, but the outputs are computed independently for each image.
-
❌ This depicts two different neural networks with different architectures.
-
❌ The two networks have the same architecture, but they might have different parameters.
-
❌ The two images are combined in a single volume and pass through a single neural network.
Explanation:
A Siamese network uses shared weights — the same network (same parameters) applied to each input separately to produce embeddings; outputs are computed independently and then compared (e.g., distance).
Question 6
You’re more likely to find a unit responding strongly to cats in layer 4 than layer 1. True/False?
-
✅ True
-
❌ False
Explanation:
Lower layers (layer 1) learn low-level features (edges, colors); deeper layers (layer 4) capture higher-level, semantically meaningful features (parts or whole-object detectors), so you’re more likely to find a neuron selective for cats in deeper layers.
Question 7
Neural style transfer: which loss terms are used? (choose all that apply)
-
✅ JstyleJ_{\text{style}}Jstyle that compares SSS and GGG.
-
❌ JcorrJ_{\text{corr}}Jcorr that compares CCC and SSS.
-
❌ TTT that calculates triplet loss between S,G,CS,G,CS,G,C.
-
✅ JcontentJ_{\text{content}}Jcontent that compares CCC and GGG.
Explanation:
Neural style transfer optimizes a generated image GGG to minimize: content loss Jcontent(G,C)J_{content}(G,C)Jcontent(G,C) (match high-level content features of CCC) and style loss Jstyle(G,S)J_{style}(G,S)Jstyle(G,S) (match Gram-matrix/style statistics of SSS). Triplet/correlation terms are not part of standard NST.
Question 8
Content loss Jcont(G,C)=∥a[l](C)−a[l](G)∥2J_{cont}(G,C) = \|a^{[l]}(C) – a^{[l]}(G)\|^2Jcont(G,C)=∥a[l](C)−a[l](G)∥2. We choose lll to be a very high value to use the more abstract activation. True/False?
-
✅ True
-
❌ False
Explanation:
For content similarity you pick a deeper layer lll (higher-level activations) because they capture abstract content/structure rather than low-level texture; hence lll is chosen among deeper layers.
Question 9
In neural style transfer, what is updated each iteration?
-
❌ The pixel values of the content image CCC
-
❌ The neural network parameters
-
✅ The pixel values of the generated image GGG
-
❌ The regularization parameters
Explanation:
In NST the pretrained network parameters are fixed; optimization is performed over the image pixels of GGG to minimize combined content+style loss.
Question 10
3D conv: input 32×32×32×332\times32\times32\times332×32×32×3, 16 filters of size 4×4×44\times4\times44×4×4, zero padding and stride 1. Output size?
-
❌ 31×31×31×1631\times31\times31\times1631×31×31×16
-
✅ 29×29×29×1629\times29\times29\times1629×29×29×16
-
❌ 29×29×29×1329\times29\times29\times1329×29×29×13
-
❌ 29×29×29×329\times29\times29\times329×29×29×3
Explanation:
Assuming zero padding = 0 (no padding), spatial output dims = (32−4)/1+1=29 (32 – 4)/1 + 1 = 29(32−4)/1+1=29 along each of the three spatial axes. Depth = number of filters = 16. So output = 29×29×29×1629\times29\times29\times1629×29×29×16.
🧾 Summary Table
| Q # | Correct Answer(s) | Key concept |
|---|---|---|
| 1 | ✅ Face recognition requires K comparisons | Recognition = identify among K candidates; verification = single comparison. |
| 2 | ✅ Learn d(⋅,⋅)d(\cdot,\cdot)d(⋅,⋅); ✅ One-shot task | Embedding + similarity supports flexible group membership; one-shot setup. |
| 3 | ✅ False | Triplet training benefits from many identities; not restricted to current members. |
| 4 | ✅ When A is closer to N than to P | Loss large when negative is too close (violates margin). |
| 5 | ✅ Shared-parameter networks applied independently | Siamese uses same parameters for both branches. |
| 6 | ✅ True | Deeper layers encode higher-level object concepts. |
| 7 | ✅ Jstyle(S,G)J_{style}(S,G)Jstyle(S,G); ✅ Jcontent(C,G)J_{content}(C,G)Jcontent(C,G) | Style transfer uses content + style losses. |
| 8 | ✅ True | Content loss uses deeper-layer activations for abstract content. |
| 9 | ✅ Pixel values of generated image GGG | NST optimizes pixels of G; network weights fixed. |
| 10 | ✅ 29×29×29×1629\times29\times29\times1629×29×29×16 | 3D conv output spatial size = (32−4)+1 = 29, depth = #filters. |