Skip to content

Quiz 3:Practical Machine Learning (Data Science Specialization) Answers 2025

1. Question 1

CART model (segmentationOriginal data) – predictions for given variable values:

a. PS, b. WS, c. PS, d. Not possible to predict
❌ a. PS, b. Not possible to predict, c. PS, d. Not possible to predict
❌ a. PS, b. WS, c. PS, d. WS
❌ a. PS, b. PS, c. PS, d. Not possible to predict

Explanation:
Using rpart on segmentationOriginal, cases (a), (b), and (c) follow branches that lead to PS, WS, and PS, respectively. Case (d) cannot be predicted because it lacks variables (TotalIntench2 missing), so no complete path through the tree.


2. Question 2

K-fold cross-validation: bias & variance tradeoff.

The bias is larger and the variance is smaller. Under leave one out cross validation K is equal to the sample size.
❌ The bias is smaller and the variance is bigger. Under leave one out cross validation K is equal to one.
❌ The bias is smaller and the variance is smaller. Under leave one out cross validation K is equal to the sample size.
❌ The bias is smaller and the variance is smaller. Under leave one out cross validation K is equal to one.

Explanation:
Smaller K → less training data per fold → higher bias, lower variance.
LOOCV → K = n (sample size), lowest bias, highest variance.


3. Question 3

Olive oil dataset — classification tree with Area as outcome.

2.783. It is strange because Area should be a qualitative variable – but tree is reporting the average value of Area as a numeric variable in the leaf predicted for newdata.
❌ 0.005291005 0 0.994709 0 0 0 0 0 0. The result is strange because Area is numeric variable.
❌ 4.59965. There is no reason why the result is strange.
❌ 0.005291005 0 0.994709 0 0 0 0 0 0. There is no reason why the result is strange.

Explanation:
If Area is not converted to a factor, tree() treats it as numeric and outputs a numeric mean (≈2.783). This is strange, since Area should be categorical.


4. Question 4

SAheart dataset — Logistic regression (chd ~ age + alcohol + obesity + tobacco + typea + ldl).

Test Set Misclassification: 0.31, Training Set: 0.27
❌ Test Set Misclassification: 0.43, Training Set: 0.31
❌ Test Set Misclassification: 0.27, Training Set: 0.31
❌ Test Set Misclassification: 0.35, Training Set: 0.31

Explanation:
The fitted logistic model achieves about 27% error on training and 31% error on test — typical for this data. Calculated using missClass() function on predict(..., type='response').


5. Question 5

Vowel recognition data (vowel.train, vowel.test) — Random Forest model (randomForest(y ~ ., data=vowel.train)), variable importance order:

x.2, x.1, x.5, x.6, x.8, x.4, x.9, x.3, x.7, x.10
❌ x.2, x.1, x.5, x.8, x.6, x.4, x.3, x.9, x.7, x.10
❌ x.10, x.7, x.9, x.5, x.8, x.4, x.6, x.3, x.1, x.2
❌ x.1, x.2, x.3, x.8, x.6, x.4, x.5, x.9, x.7, x.10

Explanation:
Variable importance (Gini index) ranks features roughly in that order (x.2 most important → x.10 least). Computed using varImp(randomForest_model).


🧾 Summary Table

Q# ✅ Correct Answer Key Concept
1 a. PS, b. WS, c. PS, d. Not possible CART decision tree prediction
2 Bias ↑, Variance ↓ when K small; LOOCV K = n Cross-validation tradeoff
3 2.783; strange because Area numeric Tree regression vs classification
4 Test 0.31, Train 0.27 Logistic regression accuracy
5 x.2, x.1, x.5, x.6, x.8, x.4, x.9, x.3, x.7, x.10 Random forest variable importance