Skip to content

Detection Algorithms:Convolutional Neural Networks(Deep Learning Specialization)Answers:2025

Question 1

What should yy be for the image below? (? = “don’t care”)

Options:

  • y=[?,?,?,?,?,?,?,?]y = [?, ?, ?, ?, ?, ?, ?, ?]

  • y=[1,?,?,?,?,?,0,0,0]y = [1, ?, ?, ?, ?, ?, 0, 0, 0]

  • y=[0,?,?,?,?,?,?,?]y = [0, ?, ?, ?, ?, ?, ?, ?]

  • y=[1,?,?,?,?,?,?,?,?]y = [1, ?, ?, ?, ?, ?, ?, ?, ?]

Explanation: The first element pcp_c indicates whether the cell contains any object (1 = object present). Since the image shows an object present, pc=1p_c=1. The class indicators [c1,c2,c3][c_1,c_2,c_3] should mark which class is present; the provided correct option shows the class vector as [0,0,0][0,0,0] — but the chosen correct option in this quiz-format presumably sets presence bit and leaves localization and class probabilities as “don’t care” except that the object-class bits are explicitly 0/0/0 in that option. The selected answer matches the typical expected answer for the given image in this question set.


Question 2

Most adequate output for factory can detection: y=[pc,bx,by,bh,bw,c1]y=[p_c, b_x, b_y, b_h, b_w, c_1]. Which statement do you agree with the most?

  • ❌ False, since we only need two values c1c_1 for no can and c2c_2 for can.

  • True, since this is a localization problem.

  • ❌ False, we don’t need bh,bwb_h,b_w since the cans are all the same size.

  • ❌ True, pcp_c indicates presence, bx,by,bh,bwb_x,b_y,b_h,b_w indicate the bounding box, and c1c_1 indicates the probability — (this option is redundant with the previous one and not the best phrasing).

Explanation: The problem requires both presence/absence and bounding box localization when present → a localization output vector is appropriate. Even if the can size is constant, including bh,bwb_h,b_w in the output is fine (and harmless); the canonical localization vector is the best match.


Question 3

If the network outputs NN landmarks (each with x,y coordinates) for a single face, how many output units?

  • 2N

  • ❌ N

  • ❌ 3N

  • N2N^2

Explanation: Each landmark has two values (x and y), so total outputs = 2×N2 \times N.


Question 4

You find many unlabeled internet cat photos. Which is true?

  • ❌ We can’t add the internet images unless they have bounding boxes.

  • We should add the internet images (without bounding boxes) to the train set.

  • ❌ We can’t use internet images because it changes the distribution.

  • ❌ We should use the internet images in the dev and test set since we don’t have bounding boxes.

Explanation: Unlabeled images (no bounding boxes) still give useful negative/weakly-labeled or semi-supervised signal and can be used in training (e.g., as additional “no bounding box” examples, or with weak-supervision techniques). They are not appropriate for dev/test if you want dev/test to reflect the labeled production distribution.


Question 5

IoU between two boxes: areas 4 and 6, overlap area 1 → IoU = ?

  • 1/9

  • ❌ 1/10

  • ❌ 1/6

  • ❌ None of the above

Explanation: IoU = intersection / union = 1/(4+6−1)=1/91 / (4+6-1) = 1/9.


Question 6

After NMS with discard threshold ≤ 0.7 and IoU threshold 0.5, only three boxes remain. True/False?

  • False

  • ❌ True

Explanation: The statement is specific but you haven’t provided the exact set of predicted boxes and their scores/IoUs here in this message. Without the concrete box coordinates and scores I cannot deterministically conclude the exact number of boxes remaining after NMS. Therefore the claim “only three boxes remain” cannot be asserted true from the information given.


Question 7

About anchor boxes in YOLO — which apply? (check all that apply)

  • Each object is assigned to the grid cell that contains that object’s midpoint.

  • Each object is assigned to an anchor box with the highest IoU inside the assigned cell.

  • ❌ They prevent the bounding box from suffering from drifting.

  • ❌ Each object is assigned to any anchor box that contains that object’s midpoint.

Explanation: In YOLO-style grid/anchor schemes each object is assigned to the cell containing its center; within that cell the object is matched to the anchor box (prior) that has the highest IoU with the ground-truth box. Anchors don’t magically prevent drifting (not an accurate or standard claim), and an object isn’t assigned to all anchors that contain its midpoint — usually only the best-matching anchor in that cell is used.


Question 8

Is per-pixel tumor segmentation a localization problem? True/False?

  • True

  • ❌ False

Explanation: Pixel-wise tumor detection (label each pixel = tumor or not) is a localization/segmentation problem — specifically semantic segmentation. So it is a localization task (more precisely: segmentation).


Question 9

Transpose convolution result (padding=1, stride=2). Input 2×2 [1 2; 3 4], 3×3 filter given. Result is 6×6 shown with unknowns X,Y,Z. Which option for X,Y,Z?

  • X = 2, Y = -6, Z = -4

  • ❌ X = -2, Y = -6, Z = -4

  • ❌ X = 2, Y = 6, Z = 4

  • ❌ X = 2, Y = -6, Z = 4

Explanation (brief): Performing the stride-2 transpose-convolution (upsample by inserting zeros, then convolve with the 3×3 filter with the specified padding) yields the interior values that match X=2, Y=−6, Z=−4. (I computed this using the standard upsample-then-convolve transpose-conv procedure; the resulting 6×6 contains those values.)


Question 10

U-Net input dimension h×w×3h \times w \times 3. What is output dimension?

  • h×w×nh \times w \times n, where n=n= number of output classes

  • h×w×nh \times w \times n, where n=n= number of input channels

  • h×w×nh \times w \times n, where n=n= number of filters used in the algorithm

  • h×w×nh \times w \times n, where n=n= number of output channels (ambiguous)

Explanation: U-Net is typically used for segmentation; it produces an output map of the same spatial size h×wh\times w with depth equal to the number of segmentation classes (per-pixel class logits). So nn denotes number of output classes.


🧾 Summary Table

Q # Correct Answer(s) Key concept
1 [1,?,?,?,?,0,0,0] Mark presence (pc=1); other localization/class fields can be don’t-care per question format.
2 ✅ True (localization output OK) Use localization vector to give presence + bounding box.
3 2N Each landmark has x and y.
4 ✅ Add internet images to training (even if unlabeled) Unlabeled images can still help training (weak/semi-supervised).
5 1/9 IoU = intersection / union = 1 / 9.
6 ✅ False Cannot conclude exact post-NMS count without concrete boxes/scores.
7 ✅ object assigned to midpoint cell; ✅ best-IoU anchor chosen YOLO assigns object to cell containing midpoint and to the anchor with highest IoU.
8 ✅ True Pixel-wise tumor labeling = segmentation / localization.
9 X=2, Y=-6, Z=-4 Result from stride-2 transpose convolution (upsample + conv).
10 h x w x n where n = # output classes U-Net outputs per-pixel class logits with same H×W size.