Complete all Exercises, and submit answers to Questions on the Coursera platform.

This initial quiz will concern exploratory data analysis (EDA) of the Ames Housing dataset. EDA is essential when working with any source of data and helps inform modeling.

First, let us load the data:

load("ames_train.Rdata")
library(tidyverse)
library(grid)
library(gridExtra)
  1. Which of the following are the three variables with the highest number of missing observations?
    1. Misc.Feature, Fence, Pool.QC
    2. Misc.Feature, Alley, Pool.QC
    3. Pool.QC, Alley, Fence
    4. Fireplace.Qu, Pool.QC, Lot.Frontage
na_count <- colSums(is.na(ames_train))
head(sort(na_count, decreasing = TRUE), 3)
##      Pool.QC Misc.Feature        Alley 
##          997          971          933
  1. How many categorical variables are coded in R as having type int? Change them to factors when conducting your analysis.
    1. 0
    2. 1
    3. 2
    4. 3
str(ames_train)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1000 obs. of  81 variables:
##  $ PID            : int  909176150 905476230 911128020 535377150 534177230 908128060 902135020 528228540 923426010 908186050 ...
##  $ area           : int  856 1049 1001 1039 1665 1922 936 1246 889 1072 ...
##  $ price          : int  126000 139500 124900 114000 227000 198500 93000 187687 137500 140000 ...
##  $ MS.SubClass    : int  30 120 30 70 60 85 20 20 20 180 ...
##  $ MS.Zoning      : Factor w/ 7 levels "A (agr)","C (all)",..: 6 6 2 6 6 6 7 6 6 7 ...
##  $ Lot.Frontage   : int  NA 42 60 80 70 64 60 53 74 35 ...
##  $ Lot.Area       : int  7890 4235 6060 8146 8400 7301 6000 3710 12395 3675 ...
##  $ Street         : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Alley          : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA 2 NA NA NA ...
##  $ Lot.Shape      : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Land.Contour   : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 1 4 4 4 ...
##  $ Utilities      : Factor w/ 3 levels "AllPub","NoSeWa",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Lot.Config     : Factor w/ 5 levels "Corner","CulDSac",..: 1 5 5 1 5 1 5 5 1 5 ...
##  $ Land.Slope     : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 2 1 1 1 ...
##  $ Neighborhood   : Factor w/ 28 levels "Blmngtn","Blueste",..: 26 8 12 21 20 8 21 1 15 8 ...
##  $ Condition.1    : Factor w/ 9 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Condition.2    : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Bldg.Type      : Factor w/ 5 levels "1Fam","2fmCon",..: 1 5 1 1 1 1 2 1 1 5 ...
##  $ House.Style    : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 3 3 3 6 6 7 3 3 3 7 ...
##  $ Overall.Qual   : int  6 5 5 4 8 7 4 7 5 6 ...
##  $ Overall.Cond   : int  6 5 9 8 6 5 4 5 6 5 ...
##  $ Year.Built     : int  1939 1984 1930 1900 2001 2003 1953 2007 1984 2005 ...
##  $ Year.Remod.Add : int  1950 1984 2007 2003 2001 2003 1953 2008 1984 2005 ...
##  $ Roof.Style     : Factor w/ 6 levels "Flat","Gable",..: 2 2 4 2 2 2 2 2 2 2 ...
##  $ Roof.Matl      : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Exterior.1st   : Factor w/ 16 levels "AsbShng","AsphShn",..: 15 7 9 9 14 7 9 16 7 14 ...
##  $ Exterior.2nd   : Factor w/ 17 levels "AsbShng","AsphShn",..: 16 7 9 9 15 7 9 17 11 15 ...
##  $ Mas.Vnr.Type   : Factor w/ 6 levels "","BrkCmn","BrkFace",..: 5 3 5 5 5 3 5 3 5 6 ...
##  $ Mas.Vnr.Area   : int  0 149 0 0 0 500 0 20 0 76 ...
##  $ Exter.Qual     : Factor w/ 4 levels "Ex","Fa","Gd",..: 4 3 3 3 3 3 2 3 4 4 ...
##  $ Exter.Cond     : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 3 5 5 5 5 5 5 ...
##  $ Foundation     : Factor w/ 6 levels "BrkTil","CBlock",..: 2 2 1 1 3 4 2 3 2 3 ...
##  $ Bsmt.Qual      : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 4 6 3 4 NA 3 4 6 4 ...
##  $ Bsmt.Cond      : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 6 NA 6 6 6 6 ...
##  $ Bsmt.Exposure  : Factor w/ 5 levels "","Av","Gd","Mn",..: 5 4 5 5 5 NA 5 3 5 3 ...
##  $ BsmtFin.Type.1 : Factor w/ 7 levels "","ALQ","BLQ",..: 6 4 2 7 4 NA 7 7 2 4 ...
##  $ BsmtFin.SF.1   : int  238 552 737 0 643 0 0 0 647 467 ...
##  $ BsmtFin.Type.2 : Factor w/ 7 levels "","ALQ","BLQ",..: 7 2 7 7 7 NA 7 7 7 7 ...
##  $ BsmtFin.SF.2   : int  0 393 0 0 0 0 0 0 0 0 ...
##  $ Bsmt.Unf.SF    : int  618 104 100 405 167 0 936 1146 217 80 ...
##  $ Total.Bsmt.SF  : int  856 1049 837 405 810 0 936 1146 864 547 ...
##  $ Heating        : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Heating.QC     : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 1 3 1 1 5 1 5 1 ...
##  $ Central.Air    : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 1 2 2 2 ...
##  $ Electrical     : Factor w/ 6 levels "","FuseA","FuseF",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ X1st.Flr.SF    : int  856 1049 1001 717 810 495 936 1246 889 1072 ...
##  $ X2nd.Flr.SF    : int  0 0 0 322 855 1427 0 0 0 0 ...
##  $ Low.Qual.Fin.SF: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Bsmt.Full.Bath : int  1 1 0 0 1 0 0 0 0 1 ...
##  $ Bsmt.Half.Bath : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Full.Bath      : int  1 2 1 1 2 3 1 2 1 1 ...
##  $ Half.Bath      : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ Bedroom.AbvGr  : int  2 2 2 2 3 4 2 2 3 2 ...
##  $ Kitchen.AbvGr  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Kitchen.Qual   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 3 3 5 3 3 5 3 5 3 ...
##  $ TotRms.AbvGrd  : int  4 5 5 6 6 7 4 5 6 5 ...
##  $ Functional     : Factor w/ 8 levels "Maj1","Maj2",..: 8 8 8 8 8 8 4 8 8 8 ...
##  $ Fireplaces     : int  1 0 0 0 0 1 0 1 0 0 ...
##  $ Fireplace.Qu   : Factor w/ 5 levels "Ex","Fa","Gd",..: 3 NA NA NA NA 1 NA 3 NA NA ...
##  $ Garage.Type    : Factor w/ 6 levels "2Types","Attchd",..: 6 2 6 6 2 4 6 2 2 3 ...
##  $ Garage.Yr.Blt  : int  1939 1984 1930 1940 2001 2003 1974 2007 1984 2005 ...
##  $ Garage.Finish  : Factor w/ 4 levels "","Fin","RFn",..: 4 2 4 4 2 3 4 2 4 2 ...
##  $ Garage.Cars    : int  2 1 1 1 2 2 2 2 2 2 ...
##  $ Garage.Area    : int  399 266 216 281 528 672 576 428 484 525 ...
##  $ Garage.Qual    : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ Garage.Cond    : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 5 6 6 6 6 6 6 6 ...
##  $ Paved.Drive    : Factor w/ 3 levels "N","P","Y": 3 3 1 1 3 3 3 3 3 3 ...
##  $ Wood.Deck.SF   : int  0 0 154 0 0 0 0 100 0 0 ...
##  $ Open.Porch.SF  : int  0 105 0 0 45 0 32 24 0 44 ...
##  $ Enclosed.Porch : int  0 0 42 168 0 177 112 0 0 0 ...
##  $ X3Ssn.Porch    : int  0 0 86 0 0 0 0 0 0 0 ...
##  $ Screen.Porch   : int  166 0 0 111 0 0 0 0 0 0 ...
##  $ Pool.Area      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Pool.QC        : Factor w/ 4 levels "Ex","Fa","Gd",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Fence          : Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Misc.Feature   : Factor w/ 5 levels "Elev","Gar2",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Misc.Val       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Mo.Sold        : int  3 2 11 5 11 7 2 3 4 5 ...
##  $ Yr.Sold        : int  2010 2009 2007 2009 2009 2009 2009 2008 2008 2007 ...
##  $ Sale.Type      : Factor w/ 10 levels "COD","Con","ConLD",..: 10 10 10 10 10 3 10 7 10 10 ...
##  $ Sale.Condition : Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 5 5 5 5 6 5 5 ...
ames_train$Overall.Qual <- factor(ames_train$Overall.Qual,ordered = TRUE)
ames_train$Overall.Cond <- factor(ames_train$Overall.Cond,ordered = TRUE)
  1. In terms of price, which neighborhood has the highest standard deviation?
    1. StoneBr
    2. Timber
    3. Veenker
    4. NridgHt
ames_train %>% 
  group_by(Neighborhood) %>% 
  summarise(sdev=sd(price)) %>% 
  arrange(desc(sdev))
## # A tibble: 27 x 2
##    Neighborhood    sdev
##    <fct>          <dbl>
##  1 StoneBr      123459.
##  2 NridgHt      105089.
##  3 Timber        84030.
##  4 Veenker       72545.
##  5 Crawfor       71268.
##  6 GrnHill       70711.
##  7 Somerst       65199.
##  8 Edwards       54852.
##  9 CollgCr       52786.
## 10 SawyerW       48354.
## # ... with 17 more rows
  1. Using scatter plots or other graphical displays, which of the following variables appears to be the best single predictor of price?
    1. Lot.Area
    2. Bedroom.AbvGr
    3. Overall.Qual
    4. Year.Built
# Lot.Area
p1 <- ggplot(ames_train, aes(x = Lot.Area, y = price)) +
  geom_point() +
  stat_smooth(method = 'lm')

# Bedroom.AbvGr
p2 <- ggplot(ames_train, aes(x = Bedroom.AbvGr, y = price)) +
  geom_jitter() +
  stat_smooth(method = 'lm')

# Overall.Qual
p3 <- ggplot(ames_train, aes(x = Overall.Qual, y = price)) +
  geom_jitter()+
  stat_smooth(method = 'lm')

# Year.Built
p4 <- ggplot(ames_train, aes(x = Year.Built, y = price)) +
  geom_point()+
  stat_smooth(method = 'lm')

grid.arrange(p1, p2, p3, p4, ncol = 2)

  1. Suppose you are examining the relationship between price and area. Which of the following variable transformations makes the relationship appear to be the most linear?
    1. Do not transform either price or area
    2. Log-transform price but not area
    3. Log-transform area but not price
    4. Log-transform both price and area
# No log transform
p51 <- ggplot(ames_train, aes(x = area, y = price)) +
  geom_point() +
  stat_smooth(method = 'lm')

#Log area transform
p52 <- ggplot(ames_train, aes(x = log(area), y = price)) +
  geom_point() +
  stat_smooth(method = 'lm')

#Log price transform
p53 <- ggplot(ames_train, aes(x = area, y = log(price))) +
  geom_point() +
  stat_smooth(method = 'lm')

# Log transform both
p54 <- ggplot(ames_train, aes(x = log(area), y = log(price))) +
  geom_point() +
  stat_smooth(method = 'lm')

grid.arrange(p51, p52, p53, p54, ncol = 2)

  1. Suppose that your prior for the proportion of houses that have at least one garage is Beta(9, 1). What is your posterior? Assume a beta-binomial model for this proportion.
    1. Beta(954, 46)
    2. Beta(963, 46)
    3. Beta(954, 47)
    4. Beta(963, 47)
x <- ames_train %>%
  filter(Garage.Cars >= 1) %>%
  summarize(Sum = n())

n <- ames_train %>%
  filter(!is.na(Garage.Cars)) %>%
  summarize(Sum = n())

print(paste("Beta(",9+x, "," ,1+n-x,")"))
## [1] "Beta( 962 , 47 )"
  1. Which of the following statements is true about the dataset?
    1. Over 30 percent of houses were built after the year 1999.
    2. The median housing price is greater than the mean housing price.
    3. 21 houses do not have a basement.
    4. 4 houses are located on gravel streets.
#1. Over 30 percent of houses were built after the year 1999?
q7_1 <- ames_train %>%
  filter(Year.Built > 1999) %>%
  summarize(Sum = n())

q7_11 <- ames_train %>%
  filter(!is.na(Year.Built)) %>%
  summarize(Sum = n())

print(paste("% of houses were built after the year 1999 is", q7_1/q7_11*100))
## [1] "% of houses were built after the year 1999 is 27.2"
#2. The median housing price is greater than the mean housing price?
mean <- mean(ames_train$price)
median <- median(ames_train$price)
c(median, mean)
## [1] 159467.0 181190.1
#3. 21 houses do not have a basement?
q73 <- ames_train %>%
  filter(Total.Bsmt.SF==0) %>%
  summarize(Sum = n())
q73
## # A tibble: 1 x 1
##     Sum
##   <int>
## 1    21
#4. 4 houses are located on gravel streets?
q74 <- ames_train %>%
  group_by(Street) %>%
  summarize(Sum = n())
q74
## # A tibble: 2 x 2
##   Street   Sum
##   <fct>  <int>
## 1 Grvl       3
## 2 Pave     997
  1. Test, at the \(\alpha = 0.05\) level, whether homes with a garage have larger square footage than those without a garage.
    1. With a p-value near 0.000, we reject the null hypothesis of no difference.
    2. With a p-value of approximately 0.032, we reject the null hypothesis of no difference.
    3. With a p-value of approximately 0.135, we fail to reject the null hypothesis of no difference.
    4. With a p-value of approximately 0.343, we fail to reject the null hypothesis of no difference.
ames_train$Has.Garage = ifelse(ames_train$Garage.Area > 0, 1 ,0)
ames_train$Has.Garage <- as.factor(ames_train$Has.Garage)
t.test(area ~ Has.Garage, data = ames_train)
## 
##  Welch Two Sample t-test
## 
## data:  area by Has.Garage
## t = -5.134, df = 50.702, p-value = 4.535e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -482.9963 -211.4183
## sample estimates:
## mean in group 0 mean in group 1 
##        1145.043        1492.251
  1. For homes with square footage greater than 2000, assume that the number of bedrooms above ground follows a Poisson distribution with rate \(\lambda\). Your prior on \(\lambda\) follows a Gamma distribution with mean 3 and standard deviation 1. What is your posterior mean and standard deviation for the average number of bedrooms in houses with square footage greater than 2000 square feet?
    1. Mean: 3.61, SD: 0.11
    2. Mean: 3.62, SD: 0.16
    3. Mean: 3.63, SD: 0.09
    4. Mean: 3.63, SD: 0.91
lambda <- 3
sigma_sq <- 1
b <- lambda/sigma_sq
a <- lambda * b
sum_x <- ames_train %>% filter(area>2000) %>% summarise(sum_x=sum(Bedroom.AbvGr))
n <- ames_train %>% filter(area>2000) %>% summarise(n=n())
a_star <- a + sum_x
b_star <- b + n
lambda_star <- a_star / b_star
sigma_star <- sqrt(a_star / b_star^2)
lambda_star
##      sum_x
## 1 3.617021
sigma_star
##       sum_x
## 1 0.1601644
  1. When regressing \(\log\)(price) on \(\log\)(area), there are some outliers. Which of the following do the three most outlying points have in common?
    1. They had abnormal sale conditions.
    2. They have only two bedrooms.
    3. They have an overall quality of less than 3.
    4. They were built before 1930.
fit <- lm(log(price) ~ log(area), data = ames_train)
par(mfrow = c(2,2))
plot(fit)

# Using MASS package to extract standardize residuals from a linear model
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
par(mfrow = c(1,1))
ames_train$stdres <- stdres(fit)
ames_train$stdres_abs <- abs(stdres(fit))
ames_train$stdres_abs_q10 <- abs(stdres(fit)) > 3
ames_train[which(ames_train$stdres_abs_q10 == TRUE), c('Bedroom.AbvGr','Overall.Qual','Year.Built','Sale.Condition', 'stdres_abs')] %>% arrange(desc(stdres_abs)) %>% head(n = 3)
## # A tibble: 3 x 5
##   Bedroom.AbvGr Overall.Qual Year.Built Sale.Condition stdres_abs
##           <int> <ord>             <int> <fct>               <dbl>
## 1             2 2                  1923 Abnorml              7.37
## 2             3 4                  1920 Normal               4.83
## 3             3 4                  1910 Abnorml              4.43
# **OR** We can use `broom` package to have the same result
library(broom)
fit_aug <- augment(fit)
ames_train$stdres <- fit_aug$.std.resid
ames_train$stdres_abs <- abs(fit_aug$.std.resid)
ames_train$stdres_gt2 <- abs(fit_aug$.std.resid) > 3
ames_train[which(ames_train$stdres_gt2==TRUE),
           c('Bedroom.AbvGr','Overall.Qual','Year.Built','Sale.Condition',
             'stdres_abs')] %>% arrange(desc(stdres_abs)) %>% head(n=3)
## # A tibble: 3 x 5
##   Bedroom.AbvGr Overall.Qual Year.Built Sale.Condition stdres_abs
##           <int> <ord>             <int> <fct>               <dbl>
## 1             2 2                  1923 Abnorml              7.37
## 2             3 4                  1920 Normal               4.83
## 3             3 4                  1910 Abnorml              4.43
  1. Which of the following are reasons to log-transform price if used as a dependent variable in a linear regression?
    1. price is right-skewed.
    2. price cannot take on negative values.
    3. price can only take on integer values.
    4. Both a and b.
ggplot(ames_train, aes(x= price)) +
  geom_histogram(bins = 30)

  1. How many neighborhoods consist of only single-family homes? (e.g. Bldg.Type = 1Fam)
    1. 0
    2. 1
    3. 2
    4. 3
ames_train %>% group_by(Neighborhood) %>% 
  summarise(mean.Bldg.Type = mean(Bldg.Type == "1Fam")) %>% 
  filter(mean.Bldg.Type==1)
## # A tibble: 3 x 2
##   Neighborhood mean.Bldg.Type
##   <fct>                 <dbl>
## 1 ClearCr                   1
## 2 NoRidge                   1
## 3 Timber                    1
  1. Using color, different plotting symbols, conditioning plots, etc., does there appear to be an association between \(\log\)(area) and the number of bedrooms above ground (Bedroom.AbvGr)?
    1. Yes
    2. No
ggplot(ames_train, aes(x = Bedroom.AbvGr, y = log(area))) +
  geom_jitter()

  1. Of the people who have unfinished basements, what is the average square footage of the unfinished basement?
    1. 590.36
    2. 595.25
    3. 614.37
    4. 681.94
ames_train %>%
  filter(!is.na(Bsmt.Unf.SF),Bsmt.Unf.SF > 0) %>%
  summarize(mean = mean(Bsmt.Unf.SF))
## # A tibble: 1 x 1
##    mean
##   <dbl>
## 1  595.