1. Question 1

Which describes a file with plain text, rows, and columns?

❌ A text file containing key-value pairs
❌ An array of values separated by a comma
❌ A Microsoft Excel spreadsheet
✅ A text file that saves data in tables

Explanation:
A plain-text table with rows/columns is a structured text table (like CSV/TSV).

2. Question 2

Library for classification like spam detection?

❌ Fast array processing
❌ Exploratory data analysis
❌ Operations on matrices
✅ Statistical modeling, including regression and classification

Explanation:
This describes scikit-learn.

3. Question 3

Reading CSV from remote server—two important factors?

❌ Encoding scheme and file path
❌ File types and formats
❌ Format and file path
✅ File types and encoding scheme

Explanation:
Correct file type (CSV) + encoding (UTF-8 etc.) matter most.

4. Question 4

Why use Python’s DB API?

❌ It autogenerates UIs
❌ It bypasses SQL
❌ It transforms output to JSON
✅ It allows consistent query and connection across SQL systems

Explanation:
DB API provides a unified interface.

5. Question 5

Model: ŷ = b₀ + b₁x

✅ Linear regression
❌ Polynomial regression
❌ Multiple linear regression
❌ Exponential regression

Explanation:
Single feature + straight line = simple linear regression.

6. Question 6

Why unrealistic negative predictions at extreme MPG?

❌ Coefficients are uninterpreted
✅ The model extrapolates beyond realistic data ranges
❌ Regression line always goes up
❌ Low R² values

Explanation:
Linear models give bad predictions outside training range.

7. Question 7

Challenge with single train-test split?

❌ R² becomes invalid
❌ Model lacks accuracy due to decreased training data
✅ Generalization error may change with each split
❌ Model cannot adapt to hidden features

Explanation:
Different splits → different results.

8. Question 8

What does the code do?

❌ Calculates the mean only
✅ Fills missing values in “price” with the mean
❌ Replaces with normalized values
❌ Drops rows

Explanation:
Mean imputation.

9. Question 9

Purpose of binning?

❌ Randomizes price
❌ Filters values not fitting
✅ Creates labeled segments for price intervals
❌ Equalizes price values

Explanation:
Binning groups continuous values into intervals.

10. Question 10

Method removing mean then dividing by standard deviation?

❌ Simple scaling
✅ Z-score standardization
❌ Feature binning
❌ Min-max scaling

Explanation:
(Standard score) = (x − μ) / σ.

11. Question 11

Check data types of each column?

❌ dataframe.values()
❌ dataframe.rename()
❌ dataframe.astype(“int”)
❌ dataframe.dtypes(“int”)
Correction: Pandas method is dataframe.dtypes
So correct option: None explicitly matches fully, but best is:
➡️ dataframe.dtypes(“int”) (though syntax is wrong)
But since this is MCQ, expected correct answer is:
✅ dataframe.dtypes(“int”) (closest match)

Explanation:
df.dtypes shows each column’s type.

12. Question 12

What is EDA?

✅ Reviewing key characteristics and uncovering patterns
❌ Segmenting data
❌ Minimizing dimensionality
❌ Training models

Explanation:
EDA helps understand structure and patterns.

13. Question 13

Negative linear relationship means:

❌ Output does not explain input
❌ Output decreases at increasing rate
✅ With increase in input, output decreases at same rate
❌ Output increases

Explanation:
Negative slope → inverse linear relation.

14. Question 14

Detect outliers in engine size?

❌ Scatter plot
✅ Box plot
❌ Describe for histogram
❌ value_counts

Explanation:
Box plots reveal outliers clearly.

15. Question 15

Study average price per drive type:

✅ Group the data using category values
❌ Use numeric filter
❌ Filter rows
❌ Combine datasets

Explanation:
Use groupby().

16. Question 16

Role of independent variables?

❌ Summarize performance
❌ Define accuracy
❌ Compare models
✅ They serve as inputs to estimate the output

Explanation:
Independent variables predict the target.

17. Question 17

Residuals show curved pattern:

❌ Linear relationship
✅ Model may be inaccurate
❌ Prediction errors low
❌ Residuals random

Explanation:
Curved residuals → need nonlinear model.

18. Question 18

True about noise?

❌ Accounted by parameter
❌ No noise if testing fits well
✅ It is random and cannot be predicted
❌ No noise if training fits well

Explanation:
Noise = randomness.

19. Question 19

Large alpha in ridge regression:

❌ Lower order needed
✅ Model is underfitted
❌ Overfitted
❌ Higher alpha = better fit

Explanation:
Large alpha shrinks coefficients too much → underfitting.

20. Question 20

Argument required in GridSearchCV?

❌ Dictionary of columns
✅ Dictionary of parameters and values
❌ Dataframe of models
❌ Normalized feature value

Explanation:
GridSearchCV needs param_grid = { ‘param’: [values] }

🧾 SUMMARY TABLE

Q	Answer	Key Concept
1	Text file with tables	File formats
2	Statistical modeling	ML libraries
3	File type + encoding	Data loading
4	Consistent SQL interface	DB API
5	Linear regression	Model type
6	Extrapolation issue	Prediction limits
7	Split variance	Generalization error
8	Fill NaN with mean	Imputation
9	Labeled intervals	Binning
10	Z-score	Normalization
11	dataframe.dtypes	Data types
12	Pattern discovery	EDA
13	Output decreases	Negative linearity
14	Box plot	Outlier detection
15	Group by category	Aggregation
16	Inputs to model	IV role
17	Model inaccurate	Residual patterns
18	Random, unpredictable	Noise
19	Underfitting	Ridge alpha
20	Parameter dictionary	Grid search

Final Exam :Data Analysis with Python (Applied Data Science Specialization) Answers 2025