Week 1 Quiz:Getting and Cleaning Data(Data Science Specialization):Answers2025
Question 1
How many properties are worth $1,000,000 or more?
✅ 53
❌ 24
❌ 31
❌ 2076
Explanation:
Using R:
data <- read.csv("ss06hid.csv")
sum(data$VAL == 24, na.rm = TRUE)
In the codebook, VAL == 24 corresponds to properties worth $1,000,000 or more, and the count is 53.
Question 2
Consider the variable
FES. Which tidy data principle does it violate?
✅ Tidy data has one variable per column.
❌ Tidy data has one observation per row.
❌ Each tidy data table contains information about only one type of observation.
❌ Each variable in a tidy data set has been transformed to be interpretable.
Explanation:FES encodes multiple family types in one variable (like “Married-couple family”, “Single male”, “Single female”).
Hence, it violates the rule of “one variable per column”.
Question 3
What is the value of
sum(dat$Zip*dat$Ext, na.rm=T)from the Excel dataset?
✅ 36534720
❌ 154339
❌ 33544718
❌ NA
Explanation:
Code used:
library(readxl)
dat <- read_excel("DATA.gov_NGAP.xlsx", range="R18C7:R23C15")
sum(dat$Zip * dat$Ext, na.rm = TRUE)
Result = 36,534,720
Question 4
How many restaurants have zipcode 21231?
✅ 127
❌ 17
❌ 100
❌ 156
Explanation:
library(XML)
doc <- xmlTreeParse("restaurants.xml", useInternal = TRUE)
root <- xmlRoot(doc)
zip <- xpathSApply(root, "//zipcode", xmlValue)
sum(zip == "21231")
Result = 127
Question 5
Using the data.table package, which is the fastest way to calculate the average of
pwgtp15bySEX?
✅ DT[, mean(pwgtp15), by=SEX]
❌ rowMeans(DT)[DT$SEX==1]; rowMeans(DT)[DT$SEX==2]
❌ mean(DT$pwgtp15,by=DT$SEX)
❌ mean(DT[DT$SEX==1,]$pwgtp15); mean(DT[DT$SEX==2,]$pwgtp15)
❌ tapply(DT$pwgtp15,DT$SEX,mean)
❌ sapply(split(DT$pwgtp15,DT$SEX),mean)
Explanation:data.table syntax DT[, mean(pwgtp15), by=SEX] is vectorized and optimized in C, hence provides the fastest user time compared to base R methods like tapply or split.
🧾 Summary Table
| Q# | ✅ Correct Answer | Key Concept |
|---|---|---|
| 1 | 53 | Property count worth ≥ $1,000,000 |
| 2 | One variable per column | Tidy data rule violation |
| 3 | 36,534,720 | Excel subset + sum calculation |
| 4 | 127 | XML parsing + filtering by zipcode |
| 5 | DT[, mean(pwgtp15), by=SEX] | Fastest data.table aggregation |