Part 1: Building your Own Binary Classification Model :Mastering Data Analysis in Excel (Excel to MySQL: Analytic Techniques for Business Specialization) Answers 2025

✅ Q1 — Model (ready to submit)

(Uses standardized inputs Z_… ; score range kept within −3.5 .. +3.5)

Model Score =
0.40·Z_income − 0.30·Z_credit_card_debt − 0.20·Z_auto_debt − 0.15·Z_years_at_current_address − 0.10·Z_age + 0.07·Z_years_at_current_employer

Notes:

Uses at least two inputs (in fact 6 standardized inputs).
Coefficients chosen to reflect that higher income → lower default risk (+), higher debts → higher default risk (−).
Score will be used directly in the AUC Calculator (higher → more likely default).
(You can paste this into the quiz answer box as the model function.)

✅ Q9–Q11 — Conceptual MCQs (your ✔️/❌ format)

9. True Positive Rate is…
❌ Equal to the Test Incidence
✔️ Greater than the Test Incidence
❌ Less than the Test Incidence

10. Positive Predictive Value (PPV)…
✔️ Greater than .25
❌ Equal to .25
❌ Less than .25

11. Negative Predictive Value (NPV)…
✔️ Greater than .75
❌ Equal to .75
❌ Less than .75

(These follow because a model that reduces uncertainty will have PPV > base rate and NPV > (1 − base rate); and TPR typically > test incidence for a useful predictive model when interpreted per the course’s statements.)

❗ What I cannot compute right now (and why)

I cannot produce numeric answers for Q2, Q3, Q4, Q5, Q6, Q7, Q8, Q12, Q13 because they require numeric outputs from your Training and Test sets (AUC, thresholds, confusion matrices, and cost-per-event values). Those values come from either:

the raw Training/Test CSV (with actual labels and score predictions), or
your AUC/threshold table exported from the AUC Calculator spreadsheet (columns with threshold, TPR, FPR, TP, FP, TN, FN, precision, NPV, cost-per-event, etc.)

If you paste/upload either the CSVs or the AUC/threshold table now, I will compute every missing numeric answer and return them in the exact formats the quiz expects.

✅ Exactly what to upload / paste (pick one)

Option A — Raw CSVs (preferred)
Provide two CSVs or one CSV with a set column indicating train or test. Columns needed:

actual: 1 = default, 0 = non-default
score: your model score (higher → more likely default). If you only have raw inputs instead of score, provide standardized input columns (Z_income, Z_credit_card_debt, …), and I will compute score with the model above.

Option B — AUC / Threshold Table
Provide a table with one row per threshold and these columns (or equivalents):
threshold, TP, FP, TN, FN, TPR, FPR, Precision, NPV, cost_per_event
(If you used the AUC Calculator spreadsheet, copy the full table; I only need the columns I listed or the spreadsheet export.)

🔧 How I will compute the missing questions (so you know exactly what I’ll return)

If you give me the data I will:

Compute model score for every row if not present (using the model above).
Calculate ROC & AUC (two decimal places) on the Training Set → answer Q2.
Evaluate AUC on the Test Set (no retraining) → answer Q3.
Using given costs (FN = $5,000 ; FP = $2,500) compute cost-per-event across thresholds on Training Set and find the threshold that minimizes cost-per-event → return the threshold (Q4) and the minimum cost-per-event as an integer (Q5).
- Cost-per-event formula per threshold:
  Cost_per_event = (FP_count * Cost_FP + FN_count * Cost_FN) / N
  where N = number of rows in that set (200 for training).
Apply the training-set optimal threshold to the Test Set and compute the cost-per-event on the test set using same formula → Q6 (integer).
Compute dollar savings per event vs issuing to everyone ($1,250 baseline):
Savings_per_event = 1250 - Test_cost_per_event → Q7 (integer).
Payback days: bank spent $750,000; daily applicants = 1,000.
Days_to_payback = round(750000 / (Savings_per_event * 1000)) → Q8 (integer days).
From confusion matrix at the chosen threshold on Training Set, return True Positive Rate (TPR) → Q12.
- TPR = TP / (TP + FN).
From confusion matrix at chosen threshold on Training Set, return Test Incidence -> Q13.
- Test Incidence = (TP + FP) / N (fraction of applicants classified as positive i.e., “test says default”).
I will supply all returned numbers with the formatting the quiz expects (two decimals for AUC, integers for costs and days) and produce a final 🧾 Summary Table with every quiz answer ready to copy/paste.

🧾 Exact Python code you can run locally (or paste here) — produces all answers automatically

(If you prefer I run it for you, upload the CSV and I’ll run it and paste results.)

# paste this into a Python environment (pandas + sklearn installed)

import pandas as pd

import numpy as np

from sklearn import metrics
# --- load CSV (example) ---

# df = pd.read_csv("data.csv")  # expects columns: id,set,actual,score OR standardized inputs
# If you have standardized inputs but not score, compute:

# coeffs = {'Z_income':0.40, 'Z_credit_card_debt':-0.30, 'Z_auto_debt':-0.20,

#           'Z_years_at_current_address':-0.15, 'Z_age':-0.10, 'Z_years_at_current_employer':0.07}

# df['score'] = sum(df[c]*w for c,w in coeffs.items() if c in df.columns)
def compute_auc(df):

    y = df['actual']

    s = df['score']

    fpr, tpr, thresholds = metrics.roc_curve(y, s)

    auc = metrics.auc(fpr, tpr)

    return auc, fpr, tpr, thresholds
def confusion_at_thresh(df, thresh):

    pred = (df['score'] >= thresh).astype(int)

    TP = int(((pred==1) & (df['actual']==1)).sum())

    FP = int(((pred==1) & (df['actual']==0)).sum())

    TN = int(((pred==0) & (df['actual']==0)).sum())

    FN = int(((pred==0) & (df['actual']==1)).sum())

    return dict(TP=TP,FP=FP,TN=TN,FN=FN)
def cost_per_event(TP,FP,TN,FN, N, cost_FN=5000, cost_FP=2500):

    total_cost = FN*cost_FN + FP*cost_FP

    return total_cost / N
# split

df_train = df[df['set']=='train'].copy()

df_test  = df[df['set']=='test'].copy()
# Q2

auc_train, fpr_t, tpr_t, th_t = compute_auc(df_train)

# Q3

auc_test, _, _, _ = compute_auc(df_test)
# compute cost_per_event across thresholds (use unique thresholds from training)

rows = []

Ntrain = len(df_train)

for thresh in np.unique(df_train['score']):

    c = confusion_at_thresh(df_train, thresh)

    cpe = cost_per_event(c['TP'],c['FP'],c['TN'],c['FN'], Ntrain)

    rows.append({'threshold':thresh, **c, 'cost_per_event':cpe})

table = pd.DataFrame(rows).sort_values('cost_per_event')

best = table.iloc[0]

best_threshold = best['threshold']

best_cost_train = int(round(best['cost_per_event']))  # Q5 integer
# Q4 = best_threshold

# Q6 = cost on test at that same threshold

ctest = confusion_at_thresh(df_test, best_threshold)

cost_test = int(round(cost_per_event(ctest['TP'],ctest['FP'],ctest['TN'],ctest['FN'], len(df_test))))
# Q7 savings per event

savings = int(round(1250 - cost_test))
# Q8 days to payback

days = int(round(750000 / (savings * 1000))) if savings>0 else None
# Q12 (TPR from training at best threshold)

TPR = best['TP'] / (best['TP'] + best['FN'])
# Q13 (test incidence at best threshold on training? The quiz asks "your model's test incidence" — save training-set classification incidence)

test_incidence = (best['TP'] + best['FP']) / Ntrain

# Print all neatly...

(If you run this, replace df = pd.read_csv(...) and ensure your set column is train/test or split accordingly.)

📋 Quick spreadsheet / AUC-Calculator tips (if you prefer Excel)

Use the AUC Calculator copy for Training set and for Test set.
On the Training copy: identify the column where Cell J2 (overall minimum cost-per-event) is minimal. The threshold is in row 10 of that column. That threshold = answer to Q4.
The value in Cell J2 (of that column) = Q5 (give as integer).
Switch to Test copy, find same column index (same threshold column), and read row 17 (cost-per-event) — that is Q6 (integer).
Q7 = 1250 - Q6 (integer). Q8 = round(750000 / (Q7 * 1000)).
Q12 = TPR at your training-set chosen threshold = TP / (TP + FN) (found in the table row for that threshold). Q13 = Test incidence = (TP + FP) / N_train from that threshold’s row.

✅ A. Model Questions (Q1–Q8, Q12, Q13)

Q No.	Status	Summary
1	✔️ Answered	Model provided using standardized inputs (Z-scores): Score = 0.40·Z_income − 0.30·Z_credit_card_debt − 0.20·Z_auto_debt − 0.15·Z_years_at_current_address − 0.10·Z_age + 0.07·Z_years_at_current_employer
2	❌ Pending	Needs Training Set AUC → upload spreadsheet or AUC table.
3	❌ Pending	Needs Test Set AUC → upload spreadsheet or AUC table.
4	❌ Pending	Needs Threshold from AUC calculator (row 10 of min-cost column).
5	❌ Pending	Needs minimum cost-per-event from cell J2.
6	❌ Pending	Needs Test Set cost-per-event using Training Set threshold.
7	❌ Pending	Needs saving-per-event = 1250 − (your answer to Q6).
8	❌ Pending	Needs payback days = 750000 ÷ (1000 × saving-per-event).
12	❌ Pending	Needs True Positive Rate from AUC calculator.
13	❌ Pending	Needs Test Incidence (TP + FP probability sum).

✅ B. MCQ Model Theory Questions (Q9–Q11)

Q No.	Correct Option	Format
9	✔️ Greater than the Test Incidence	TPR must exceed the base rate for a useful model
10	✔️ Greater than .25	PPV must exceed base rate of default
11	✔️ Greater than .75	NPV must exceed 1 − base rate