Skip to content

Week 1 Quiz:Getting and Cleaning Data(Data Science Specialization):Answers2025

Question 1

How many properties are worth $1,000,000 or more?

53
❌ 24
❌ 31
❌ 2076

Explanation:
Using R:

data <- read.csv("ss06hid.csv")
sum(data$VAL == 24, na.rm = TRUE)

In the codebook, VAL == 24 corresponds to properties worth $1,000,000 or more, and the count is 53.


Question 2

Consider the variable FES. Which tidy data principle does it violate?

Tidy data has one variable per column.
❌ Tidy data has one observation per row.
❌ Each tidy data table contains information about only one type of observation.
❌ Each variable in a tidy data set has been transformed to be interpretable.

Explanation:
FES encodes multiple family types in one variable (like “Married-couple family”, “Single male”, “Single female”).
Hence, it violates the rule of “one variable per column”.


Question 3

What is the value of sum(dat$Zip*dat$Ext, na.rm=T) from the Excel dataset?

36534720
❌ 154339
❌ 33544718
❌ NA

Explanation:
Code used:

library(readxl)
dat <- read_excel("DATA.gov_NGAP.xlsx", range="R18C7:R23C15")
sum(dat$Zip * dat$Ext, na.rm = TRUE)

Result = 36,534,720


Question 4

How many restaurants have zipcode 21231?

127
❌ 17
❌ 100
❌ 156

Explanation:

library(XML)
doc <- xmlTreeParse("restaurants.xml", useInternal = TRUE)
root <- xmlRoot(doc)
zip <- xpathSApply(root, "//zipcode", xmlValue)
sum(zip == "21231")

Result = 127


Question 5

Using the data.table package, which is the fastest way to calculate the average of pwgtp15 by SEX?

DT[, mean(pwgtp15), by=SEX]
❌ rowMeans(DT)[DT$SEX==1]; rowMeans(DT)[DT$SEX==2]
❌ mean(DT$pwgtp15,by=DT$SEX)
❌ mean(DT[DT$SEX==1,]$pwgtp15); mean(DT[DT$SEX==2,]$pwgtp15)
❌ tapply(DT$pwgtp15,DT$SEX,mean)
❌ sapply(split(DT$pwgtp15,DT$SEX),mean)

Explanation:
data.table syntax DT[, mean(pwgtp15), by=SEX] is vectorized and optimized in C, hence provides the fastest user time compared to base R methods like tapply or split.


🧾 Summary Table

Q# ✅ Correct Answer Key Concept
1 53 Property count worth ≥ $1,000,000
2 One variable per column Tidy data rule violation
3 36,534,720 Excel subset + sum calculation
4 127 XML parsing + filtering by zipcode
5 DT[, mean(pwgtp15), by=SEX] Fastest data.table aggregation