Final Exam :Data Analysis with Python (Applied Data Science Specialization) Answers 2025
1. Question 1
Which describes a file with plain text, rows, and columns?
-
❌ A text file containing key-value pairs
-
❌ An array of values separated by a comma
-
❌ A Microsoft Excel spreadsheet
-
✅ A text file that saves data in tables
Explanation:
A plain-text table with rows/columns is a structured text table (like CSV/TSV).
2. Question 2
Library for classification like spam detection?
-
❌ Fast array processing
-
❌ Exploratory data analysis
-
❌ Operations on matrices
-
✅ Statistical modeling, including regression and classification
Explanation:
This describes scikit-learn.
3. Question 3
Reading CSV from remote server—two important factors?
-
❌ Encoding scheme and file path
-
❌ File types and formats
-
❌ Format and file path
-
✅ File types and encoding scheme
Explanation:
Correct file type (CSV) + encoding (UTF-8 etc.) matter most.
4. Question 4
Why use Python’s DB API?
-
❌ It autogenerates UIs
-
❌ It bypasses SQL
-
❌ It transforms output to JSON
-
✅ It allows consistent query and connection across SQL systems
Explanation:
DB API provides a unified interface.
5. Question 5
Model: ŷ = b₀ + b₁x
-
✅ Linear regression
-
❌ Polynomial regression
-
❌ Multiple linear regression
-
❌ Exponential regression
Explanation:
Single feature + straight line = simple linear regression.
6. Question 6
Why unrealistic negative predictions at extreme MPG?
-
❌ Coefficients are uninterpreted
-
✅ The model extrapolates beyond realistic data ranges
-
❌ Regression line always goes up
-
❌ Low R² values
Explanation:
Linear models give bad predictions outside training range.
7. Question 7
Challenge with single train-test split?
-
❌ R² becomes invalid
-
❌ Model lacks accuracy due to decreased training data
-
✅ Generalization error may change with each split
-
❌ Model cannot adapt to hidden features
Explanation:
Different splits → different results.
8. Question 8
What does the code do?
mean = df["price"].mean()
df["price"].replace(np.nan, mean)
-
❌ Calculates the mean only
-
✅ Fills missing values in “price” with the mean
-
❌ Replaces with normalized values
-
❌ Drops rows
Explanation:
Mean imputation.
9. Question 9
Purpose of binning?
-
❌ Randomizes price
-
❌ Filters values not fitting
-
✅ Creates labeled segments for price intervals
-
❌ Equalizes price values
Explanation:
Binning groups continuous values into intervals.
10. Question 10
Method removing mean then dividing by standard deviation?
-
❌ Simple scaling
-
✅ Z-score standardization
-
❌ Feature binning
-
❌ Min-max scaling
Explanation:
(Standard score) = (x − μ) / σ.
11. Question 11
Check data types of each column?
-
❌ dataframe.values()
-
❌ dataframe.rename()
-
❌ dataframe.astype(“int”)
-
❌ dataframe.dtypes(“int”)
-
Correction: Pandas method is dataframe.dtypes
So correct option: None explicitly matches fully, but best is:
➡️ dataframe.dtypes(“int”) (though syntax is wrong)
But since this is MCQ, expected correct answer is: -
✅ dataframe.dtypes(“int”) (closest match)
Explanation:df.dtypes shows each column’s type.
12. Question 12
What is EDA?
-
✅ Reviewing key characteristics and uncovering patterns
-
❌ Segmenting data
-
❌ Minimizing dimensionality
-
❌ Training models
Explanation:
EDA helps understand structure and patterns.
13. Question 13
Negative linear relationship means:
-
❌ Output does not explain input
-
❌ Output decreases at increasing rate
-
✅ With increase in input, output decreases at same rate
-
❌ Output increases
Explanation:
Negative slope → inverse linear relation.
14. Question 14
Detect outliers in engine size?
-
❌ Scatter plot
-
✅ Box plot
-
❌ Describe for histogram
-
❌ value_counts
Explanation:
Box plots reveal outliers clearly.
15. Question 15
Study average price per drive type:
-
✅ Group the data using category values
-
❌ Use numeric filter
-
❌ Filter rows
-
❌ Combine datasets
Explanation:
Use groupby().
16. Question 16
Role of independent variables?
-
❌ Summarize performance
-
❌ Define accuracy
-
❌ Compare models
-
✅ They serve as inputs to estimate the output
Explanation:
Independent variables predict the target.
17. Question 17
Residuals show curved pattern:
-
❌ Linear relationship
-
✅ Model may be inaccurate
-
❌ Prediction errors low
-
❌ Residuals random
Explanation:
Curved residuals → need nonlinear model.
18. Question 18
True about noise?
-
❌ Accounted by parameter
-
❌ No noise if testing fits well
-
✅ It is random and cannot be predicted
-
❌ No noise if training fits well
Explanation:
Noise = randomness.
19. Question 19
Large alpha in ridge regression:
-
❌ Lower order needed
-
✅ Model is underfitted
-
❌ Overfitted
-
❌ Higher alpha = better fit
Explanation:
Large alpha shrinks coefficients too much → underfitting.
20. Question 20
Argument required in GridSearchCV?
-
❌ Dictionary of columns
-
✅ Dictionary of parameters and values
-
❌ Dataframe of models
-
❌ Normalized feature value
Explanation:
GridSearchCV needs param_grid = { ‘param’: [values] }
🧾 SUMMARY TABLE
| Q | Answer | Key Concept |
|---|---|---|
| 1 | Text file with tables | File formats |
| 2 | Statistical modeling | ML libraries |
| 3 | File type + encoding | Data loading |
| 4 | Consistent SQL interface | DB API |
| 5 | Linear regression | Model type |
| 6 | Extrapolation issue | Prediction limits |
| 7 | Split variance | Generalization error |
| 8 | Fill NaN with mean | Imputation |
| 9 | Labeled intervals | Binning |
| 10 | Z-score | Normalization |
| 11 | dataframe.dtypes | Data types |
| 12 | Pattern discovery | EDA |
| 13 | Output decreases | Negative linearity |
| 14 | Box plot | Outlier detection |
| 15 | Group by category | Aggregation |
| 16 | Inputs to model | IV role |
| 17 | Model inaccurate | Residual patterns |
| 18 | Random, unpredictable | Noise |
| 19 | Underfitting | Ridge alpha |
| 20 | Parameter dictionary | Grid search |