Skip to content

Bird Recognition in the City of Peacetopia (Quiz Case Study):Structuring Machine Learning Projects(Deep Learning Specialization)Answers:2025

Question 1

True or False: You acknowledge that having multiple evaluation metrics may complicate the decision-making process and slow down iteration speed.

  • True

  • ❌ False

Explanation: Multiple metrics can cause conflicting objectives and make decisions harder (which metric to optimize), slow down iteration because trade-offs must be analyzed, and complicate experiment comparisons. When speed of iteration matters, a single clear optimizing metric (and a few satisficing constraints) is usually preferable.


Question 2

Given accuracy, runtime ≤10s, and memory ≤10MB, how would you choose a model?

  • ❌ Take the model with the smallest runtime because that will provide the most overhead to increase accuracy.

  • ❌ Create one metric by combining the three metrics and choose the best performing model.

  • Find the subset of models that meet the runtime and memory criteria. Then, choose the highest accuracy.

  • ❌ Accuracy is an optimizing metric, therefore the most accurate model is the best choice.

Explanation: Runtime and memory here are satisficing constraints (hard limits). First filter models that satisfy those constraints, then optimize the primary objective (accuracy) among that feasible set. Combining metrics into one number is fragile and can hide constraints; choosing only by runtime ignores accuracy.


Question 3

True or False: The essential difference between an optimizing metric and satisficing metrics is the priority assigned by the stakeholders.

  • True

  • ❌ False

Explanation: Optimizing metrics (what you maximize/minimize) and satisficing constraints (requirements that must be met) are distinguished by stakeholder priority: the optimizing metric drives improvement, while satisficing metrics are constraints that must be satisfied.


Question 4

With 10,000,000 data points, best option for train/dev/test splits?

  • ❌ train – 60%, dev – 30%, test – 10%

  • ❌ train – 33.3%, dev – 33.3%, test – 33.3%

  • ❌ train – 60%, dev – 10%, test – 30%

  • train – 95%, dev – 2.5%, test – 2.5%

Explanation: With very large datasets you can allocate a large fraction to training while keeping small but sufficiently large dev/test sets (millions of images remain even at a few percent). This keeps training data abundant while keeping evaluation sets manageable.


Question 5

You get 1,000,000 social-media images with a different distribution. Best use?

  • ❌ Add it to the dev set to evaluate how well the model generalizes across a broader set.

  • ❌ Do not use the data. It will change the distribution of any set it is added to.

  • ❌ Split it among train/dev/test equally.

  • Add it to the training set.

Explanation: If the new data distribution is different but you want the model to generalize to it, adding to training (with proper labeling and possible domain adaptation) helps the model learn those new patterns. Avoid contaminating dev/test distributions unintentionally if you care primarily about the original camera distribution.


Question 6

Council wants to add 1,000,000 citizen images to the dev set. Why object? (Choose all that apply)

  • The 1,000,000 citizen data images do not have a consistent input-output relationship as the security camera data.

  • The dev set no longer reflects the distribution of data (security cameras) you most care about.

  • ❌ A bigger test set will slow down the speed of iterating because of the computational expense of evaluating models on the test set.

  • This would cause the dev and test set distributions to become different. This is a bad idea because you’re not aiming where you want to hit.

Explanation: Adding citizen images to dev will change its distribution away from security-camera inputs, so dev no longer measures the performance you truly care about. If citizen images have different labeling/noise characteristics, the input-output relationship may be inconsistent. The stated test-set slowdown refers to the test set, not dev — so that option is not directly applicable here.


Question 7

Human <1% error; training error 5.2%; dev error 7.3%. Best next step?

  • Train a bigger network to reduce the 5.2% training error.

  • ❌ Try an ensemble model to reduce bias and variance.

  • ❌ Get more data or apply regularization to reduce variance.

  • ❌ Validate the human data set with a sample of your data to ensure the images are of sufficient quality.

Explanation: Training error (5.2%) is much higher than human-level (<1%) — this indicates avoidable bias. Increase model capacity (bigger network) or improve model architecture to reduce training error. Ensembles and more data/regularization address variance, not the primary issue here.


Question 8

Multiple human labelers with errors 0.3%, 0.5% (experts), 1.0%, 1.2% (non-experts). Define human-level performance for Bayes error estimate?

  • ❌ 0.3% (lowest expert)

  • ❌ 0.0% (perfect)

  • 0.4% (average of the two experts’ error rates)

  • ❌ 0.75% (average of all four)

Explanation: When estimating Bayes-level using humans, average expert performance (those trained in the task) is preferred. Averaging the expert error rates (0.3% and 0.5%) gives ~0.4%.


Question 9

Optimal order of accuracy from worst to best?

  • ❌ Human-level performance -> Bayes error -> the learning algorithm’s performance.

  • The learning algorithm’s performance -> human-level performance -> Bayes error.

  • ❌ The learning algorithm’s performance -> Bayes error -> human-level performance.

  • ❌ Human-level performance -> the learning algorithm’s performance -> Bayes error.

Explanation: The learning algorithm is typically worst (higher error), humans are better, and Bayes error (irreducible) is the best (lowest error). So worst → best: algorithm → human → Bayes.


Question 10

Human 0.1%, training 2.0%, dev 2.1%. Best next step?

  • Prioritize actions to decrease bias by increasing model complexity, as the training error significantly exceeds human-level performance.

  • ❌ Continue tuning until the training set error matches human-level performance, focusing solely on the optimizing metric.

  • ❌ Evaluate the test set to determine the variance.

  • ❌ Deploy the model to target devices to evaluate against satisficing metrics.

Explanation: Training error (2.0%) ≫ human (0.1%) indicates avoidable bias. Increase model capacity or reduce bias. Test/dev evaluation is secondary until bias is reduced.


Question 11

Test error 7.0%, dev 2.1%, train 2.0%. Conclusions? (Choose all that apply)

  • You should try to get a bigger dev set.

  • ❌ You have underfitted to the dev set.

  • ❌ Try decreasing regularization for better generalization with the dev set.

  • You have overfitted to the dev set.

Explanation: Large jump from dev (2.1%) → test (7.0%) suggests the model (or model-selection process) overfit to the dev set (selection bias). Remedies include a larger dev set, ensuring dev/test distributions align, or using a separate validation-selection set. Underfitting and decreasing regularization are not the immediate interpretations.


Question 12

After a year: human 0.10%, training 0.05%, dev 0.05%. Which are likely? (Choose all that apply.)

  • ❌ This result is not possible since it should not be possible to surpass human-level performance.

  • Pushing to even higher accuracy will be slow because you will not be able to easily identify sources of bias.

  • The model has recognized complex, emergent features that humans may not readily perceive. (Chess and Go, for example).

  • ❌ There is still avoidable bias.

Explanation: Models can and do surpass average human performance in many tasks. Once you surpass human level, further improvements are slow and diagnosing remaining errors is hard. The model may exploit features humans miss. Since training/dev errors are already below human, avoidable bias is less likely.


Question 13

System accurate but false negatives too high. Best next step?

  • ❌ Look at all the models you’ve developed during the development process and find the one with the lowest false negative error rate.

  • ❌ Expand your model size to account for more corner cases.

  • ❌ Pick false negative rate as the new metric, and use this new metric to drive all further development.

  • Reset your “target” (metric) for the team and tune to it.

Explanation: If false negatives are the stakeholder priority, redefine the target metric (or set of metrics / constraints) to penalize false negatives appropriately (e.g., change loss, threshold, class weighting). Then drive development to that metric. Simply picking among past models or arbitrarily increasing size is suboptimal.


Question 14

New bird species appears; only 1,000 images; performance degrades. First action?

  • Put the new species’ images in training data to learn their features.

  • ❌ Augment your data to increase the number of images of the new bird species.

  • ❌ Split them between dev and test and re-tune.

  • ❌ Add pooling layers to downsample features to accommodate the new species.

Explanation: The immediate practical step is to include the few new-species images in training (fine-tune) so the model can learn their features. Augmentation could help later, but primary action is to train/fine-tune on the new-class examples. Changing architecture or shuffling into dev/test first will not directly teach the model to recognize the new species.


Question 15

You have 100,000,000 cat images; training takes ~2 weeks. Which statements do you agree with? (Check all that agree.)

  • You could consider a tradeoff where you use a subset of the cat data to find reasonable performance with reasonable iteration pacing.

  • ❌ With the experience gained from the Bird detector, you are confident to build a good Cat detector on the first try.

  • Given a significant budget for cloud GPUs, you could mitigate the training time.

  • Accuracy should exceed the City Council’s requirements, but the project may take as long as the bird detector because of the two-week training/iteration time.

Explanation: With enormous datasets, using subsets (or curriculum/fine-tuning) speeds iteration. A two-week training loop can be shortened with more compute (costly). Experience helps but rarely guarantees a first-try success. Long training time can slow development even if final accuracy may be high.


🧾 Summary Table

Q # Correct Answer(s) (marked) Key concept
1 ✅ True Multiple metrics complicate decisions / slow iteration
2 ✅ Filter by runtime & memory, then pick highest accuracy Satisficing constraints + optimizing metric
3 ✅ True Difference is stakeholder priority
4 ✅ train 95% / dev 2.5% / test 2.5% Large-data split: maximize training
5 ✅ Add to training set Use new-distribution data to improve generalization
6 ✅ Options 1,2,4 Keep dev distribution aligned with target data
7 ✅ Train a bigger network High training error → avoidable bias
8 ✅ 0.4% (avg of experts) Human-level = expert average
9 ✅ algorithm → human → Bayes Accuracy order: algorithm worst → Bayes best
10 ✅ Decrease bias (increase complexity) Training error >> human → increase capacity
11 ✅ Bigger dev set; overfitted to dev Large dev→test gap → overfitting to dev
12 ✅ Slow to improve further; model may find emergent features Surpassing humans is possible; diminishing returns
13 ✅ Reset target metric and tune to it Re-define metric to match stakeholder priority (reduce FN)
14 ✅ Put new species images into training Fine-tune/train on new-class examples first
15 ✅ Use subset; use more GPUs; project may still take long Tradeoffs between iteration speed, compute cost, final accuracy