This initial laboratory will address the exploratory data analysis (EDA) of the Ames Housing dataset.

Install the necessary packages:

library(devtools)
library(dplyr)
library(statsr)
library(gtools)
library(plotly)

First, let us load the data:

load("ames_train.RData")
  1. Which of the following are the three variables with the highest number of missing observations?
    1. Misc.Feature, Fence, Pool.QC
    2. Misc.Feature, Alley, Pool.QC
    3. Pool.QC, Alley, Fence
    4. Fireplace.Qu, Pool.QC, Lot.Frontage
# type your code for Question 1 here, and Knit
na_count <-data.frame(sapply(ames_train, function(y) sum(length(which(is.na(y))))))
  1. How many categorical variables are coded in R as having type int? Change them to factors when conducting your analysis.
    1. 0
    2. 1
    3. 2
    4. 3
# type your code for Question 2 here, and Knit
str(ames_train)
## tibble [1,000 × 81] (S3: tbl_df/tbl/data.frame)
##  $ PID            : int [1:1000] 909176150 905476230 911128020 535377150 534177230 908128060 902135020 528228540 923426010 908186050 ...
##  $ area           : int [1:1000] 856 1049 1001 1039 1665 1922 936 1246 889 1072 ...
##  $ price          : int [1:1000] 126000 139500 124900 114000 227000 198500 93000 187687 137500 140000 ...
##  $ MS.SubClass    : int [1:1000] 30 120 30 70 60 85 20 20 20 180 ...
##  $ MS.Zoning      : Factor w/ 7 levels "A (agr)","C (all)",..: 6 6 2 6 6 6 7 6 6 7 ...
##  $ Lot.Frontage   : int [1:1000] NA 42 60 80 70 64 60 53 74 35 ...
##  $ Lot.Area       : int [1:1000] 7890 4235 6060 8146 8400 7301 6000 3710 12395 3675 ...
##  $ Street         : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Alley          : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA 2 NA NA NA ...
##  $ Lot.Shape      : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Land.Contour   : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 1 4 4 4 ...
##  $ Utilities      : Factor w/ 3 levels "AllPub","NoSeWa",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Lot.Config     : Factor w/ 5 levels "Corner","CulDSac",..: 1 5 5 1 5 1 5 5 1 5 ...
##  $ Land.Slope     : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 2 1 1 1 ...
##  $ Neighborhood   : Factor w/ 28 levels "Blmngtn","Blueste",..: 26 8 12 21 20 8 21 1 15 8 ...
##  $ Condition.1    : Factor w/ 9 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Condition.2    : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Bldg.Type      : Factor w/ 5 levels "1Fam","2fmCon",..: 1 5 1 1 1 1 2 1 1 5 ...
##  $ House.Style    : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 3 3 3 6 6 7 3 3 3 7 ...
##  $ Overall.Qual   : int [1:1000] 6 5 5 4 8 7 4 7 5 6 ...
##  $ Overall.Cond   : int [1:1000] 6 5 9 8 6 5 4 5 6 5 ...
##  $ Year.Built     : int [1:1000] 1939 1984 1930 1900 2001 2003 1953 2007 1984 2005 ...
##  $ Year.Remod.Add : int [1:1000] 1950 1984 2007 2003 2001 2003 1953 2008 1984 2005 ...
##  $ Roof.Style     : Factor w/ 6 levels "Flat","Gable",..: 2 2 4 2 2 2 2 2 2 2 ...
##  $ Roof.Matl      : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Exterior.1st   : Factor w/ 16 levels "AsbShng","AsphShn",..: 15 7 9 9 14 7 9 16 7 14 ...
##  $ Exterior.2nd   : Factor w/ 17 levels "AsbShng","AsphShn",..: 16 7 9 9 15 7 9 17 11 15 ...
##  $ Mas.Vnr.Type   : Factor w/ 6 levels "","BrkCmn","BrkFace",..: 5 3 5 5 5 3 5 3 5 6 ...
##  $ Mas.Vnr.Area   : int [1:1000] 0 149 0 0 0 500 0 20 0 76 ...
##  $ Exter.Qual     : Factor w/ 4 levels "Ex","Fa","Gd",..: 4 3 3 3 3 3 2 3 4 4 ...
##  $ Exter.Cond     : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 3 5 5 5 5 5 5 ...
##  $ Foundation     : Factor w/ 6 levels "BrkTil","CBlock",..: 2 2 1 1 3 4 2 3 2 3 ...
##  $ Bsmt.Qual      : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 4 6 3 4 NA 3 4 6 4 ...
##  $ Bsmt.Cond      : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 6 NA 6 6 6 6 ...
##  $ Bsmt.Exposure  : Factor w/ 5 levels "","Av","Gd","Mn",..: 5 4 5 5 5 NA 5 3 5 3 ...
##  $ BsmtFin.Type.1 : Factor w/ 7 levels "","ALQ","BLQ",..: 6 4 2 7 4 NA 7 7 2 4 ...
##  $ BsmtFin.SF.1   : int [1:1000] 238 552 737 0 643 0 0 0 647 467 ...
##  $ BsmtFin.Type.2 : Factor w/ 7 levels "","ALQ","BLQ",..: 7 2 7 7 7 NA 7 7 7 7 ...
##  $ BsmtFin.SF.2   : int [1:1000] 0 393 0 0 0 0 0 0 0 0 ...
##  $ Bsmt.Unf.SF    : int [1:1000] 618 104 100 405 167 0 936 1146 217 80 ...
##  $ Total.Bsmt.SF  : int [1:1000] 856 1049 837 405 810 0 936 1146 864 547 ...
##  $ Heating        : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Heating.QC     : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 1 3 1 1 5 1 5 1 ...
##  $ Central.Air    : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 1 2 2 2 ...
##  $ Electrical     : Factor w/ 6 levels "","FuseA","FuseF",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ X1st.Flr.SF    : int [1:1000] 856 1049 1001 717 810 495 936 1246 889 1072 ...
##  $ X2nd.Flr.SF    : int [1:1000] 0 0 0 322 855 1427 0 0 0 0 ...
##  $ Low.Qual.Fin.SF: int [1:1000] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Bsmt.Full.Bath : int [1:1000] 1 1 0 0 1 0 0 0 0 1 ...
##  $ Bsmt.Half.Bath : int [1:1000] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Full.Bath      : int [1:1000] 1 2 1 1 2 3 1 2 1 1 ...
##  $ Half.Bath      : int [1:1000] 0 0 0 0 1 0 0 0 0 0 ...
##  $ Bedroom.AbvGr  : int [1:1000] 2 2 2 2 3 4 2 2 3 2 ...
##  $ Kitchen.AbvGr  : int [1:1000] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Kitchen.Qual   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 3 3 5 3 3 5 3 5 3 ...
##  $ TotRms.AbvGrd  : int [1:1000] 4 5 5 6 6 7 4 5 6 5 ...
##  $ Functional     : Factor w/ 8 levels "Maj1","Maj2",..: 8 8 8 8 8 8 4 8 8 8 ...
##  $ Fireplaces     : int [1:1000] 1 0 0 0 0 1 0 1 0 0 ...
##  $ Fireplace.Qu   : Factor w/ 5 levels "Ex","Fa","Gd",..: 3 NA NA NA NA 1 NA 3 NA NA ...
##  $ Garage.Type    : Factor w/ 6 levels "2Types","Attchd",..: 6 2 6 6 2 4 6 2 2 3 ...
##  $ Garage.Yr.Blt  : int [1:1000] 1939 1984 1930 1940 2001 2003 1974 2007 1984 2005 ...
##  $ Garage.Finish  : Factor w/ 4 levels "","Fin","RFn",..: 4 2 4 4 2 3 4 2 4 2 ...
##  $ Garage.Cars    : int [1:1000] 2 1 1 1 2 2 2 2 2 2 ...
##  $ Garage.Area    : int [1:1000] 399 266 216 281 528 672 576 428 484 525 ...
##  $ Garage.Qual    : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ Garage.Cond    : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 5 6 6 6 6 6 6 6 ...
##  $ Paved.Drive    : Factor w/ 3 levels "N","P","Y": 3 3 1 1 3 3 3 3 3 3 ...
##  $ Wood.Deck.SF   : int [1:1000] 0 0 154 0 0 0 0 100 0 0 ...
##  $ Open.Porch.SF  : int [1:1000] 0 105 0 0 45 0 32 24 0 44 ...
##  $ Enclosed.Porch : int [1:1000] 0 0 42 168 0 177 112 0 0 0 ...
##  $ X3Ssn.Porch    : int [1:1000] 0 0 86 0 0 0 0 0 0 0 ...
##  $ Screen.Porch   : int [1:1000] 166 0 0 111 0 0 0 0 0 0 ...
##  $ Pool.Area      : int [1:1000] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Pool.QC        : Factor w/ 4 levels "Ex","Fa","Gd",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Fence          : Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Misc.Feature   : Factor w/ 5 levels "Elev","Gar2",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Misc.Val       : int [1:1000] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Mo.Sold        : int [1:1000] 3 2 11 5 11 7 2 3 4 5 ...
##  $ Yr.Sold        : int [1:1000] 2010 2009 2007 2009 2009 2009 2009 2008 2008 2007 ...
##  $ Sale.Type      : Factor w/ 10 levels "COD","Con","ConLD",..: 10 10 10 10 10 3 10 7 10 10 ...
##  $ Sale.Condition : Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 5 5 5 5 6 5 5 ...
  1. In terms of price, which neighborhood has the highest standard deviation?
    1. StoneBr
    2. Timber
    3. Veenker
    4. NridgHt
# type your code for Question 3 here, and Knit

std_dev_neighborhood <- ames_train %>% group_by(Neighborhood) %>%
  summarise(std_devprice=sd(price))
  1. Using scatter plots or other graphical displays, which of the following variables appears to be the best single predictor of price?
    1. Lot.Area
    2. Bedroom.AbvGr
    3. Overall.Qual
    4. Year.Built
# type your code for Question 4 here, and Knit

fig1 <- plot_ly(data=ames_train,x=~Lot.Area,y=~price,type="scatter",mode="markers")
fig1
fig2 <- plot_ly(data=ames_train,x=~Bedroom.AbvGr,y=~price,type="scatter",mode="markers")
fig2
fig3 <- plot_ly(data=ames_train,x=~Overall.Qual,y=~price,type="scatter",mode="markers")
fig3
fig4 <- plot_ly(data=ames_train,x=~Year.Built,y=~price,type="scatter",mode="markers")
fig4
  1. Suppose you are examining the relationship between price and area. Which of the following variable transformations makes the relationship appear to be the most linear?
    1. Do not transform either price or area
    2. Log-transform price but not area
    3. Log-transform area but not price
    4. Log-transform both price and area
# type your code for Question 5 here, and Knit

fig1 <- plot_ly(data=ames_train,x=~log(area),y=~log(price),type="scatter",mode="markers")
fig1
  1. Suppose that your prior for the proportion of houses that have at least one garage is Beta(9, 1). What is your posterior? Assume a beta-binomial model for this proportion.
  1. Beta(954, 46)
  2. Beta(963, 46)
  3. Beta(954, 47)
  4. Beta(963, 47)
# type your code for Question 6 here, and Knit

x <- nrow(ames_train[ames_train$Garage.Area>0,])
n <- 1000
alpha <- 9
beta <- 1
alpha+x
## [1] 963
beta+n-x
## [1] 47
  1. Which of the following statements is true about the dataset?
    1. Over 30 percent of houses were built after the year 1999.
    2. The median housing price is greater than the mean housing price.
    3. 21 houses do not have a basement.
    4. 4 houses are located on gravel streets.
# type your code for Question 7 here, and Knit

nrow(ames_train[ames_train$Year.Built>1999,])/nrow(ames_train)  # not > 30%
## [1] 0.272
median(ames_train$price)>mean(ames_train$price)
## [1] FALSE
nrow(ames_train[ames_train$Street=="Grvl",])
## [1] 3
  1. Test, at the \(\alpha = 0.05\) level, whether homes with a garage have larger square footage than those without a garage.
    1. With a p-value near 0.000, we reject the null hypothesis of no difference.
    2. With a p-value of approximately 0.032, we reject the null hypothesis of no difference.
    3. With a p-value of approximately 0.135, we fail to reject the null hypothesis of no difference.
    4. With a p-value of approximately 0.343, we fail to reject the null hypothesis of no difference.
# type your code for Question 8 here, and Knit

garage <- ames_train %>% filter(Garage.Area>0)
no_garage <- ames_train %>% filter(Garage.Area==0)
t.test(x=garage$area, y = no_garage$area,
       alternative = "two.sided",
       mu = 0, paired = FALSE, var.equal = FALSE,
       conf.level = 0.95)
## 
##  Welch Two Sample t-test
## 
## data:  garage$area and no_garage$area
## t = 5.134, df = 50.702, p-value = 4.535e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  211.4183 482.9963
## sample estimates:
## mean of x mean of y 
##  1492.251  1145.043
  1. For homes with square footage greater than 2000, assume that the number of bedrooms above ground follows a Poisson distribution with rate \(\lambda\). Your prior on \(\lambda\) follows a Gamma distribution with mean 3 and standard deviation 1. What is your posterior mean and standard deviation for the average number of bedrooms in houses with square footage greater than 2000 square feet?
  1. Mean: 3.61, SD: 0.11
  2. Mean: 3.62, SD: 0.16
  3. Mean: 3.63, SD: 0.09
  4. Mean: 3.63, SD: 0.91
# type your code for Question 9 here, and Knit

# First, find number of homes > 2000 sq ft.

nrow(ames_train[ames_train$area>2000,])
## [1] 138
  1. When regressing \(\log\)(price) on \(\log\)(area), there are some outliers. Which of the following do the three most outlying points have in common?
    1. They had abnormal sale conditions.
    2. They have only two bedrooms.
    3. They have an overall quality of less than 3.
    4. They were built before 1930.
# type your code for Question 10 here, and Knit
model <- lm(log(price)~log(area),data=ames_train)
ames_train$sq_residuals <- (residuals(model))^2
write.csv(ames_train, "data.csv")
  1. Which of the following are reasons to log-transform price if used as a dependent variable in a linear regression?
    1. price is right-skewed.
    2. price cannot take on negative values.
    3. price can only take on integer values.
    4. Both a and b.
# type your code for Question 11 here, and Knit
fig <- plot_ly(data=ames_train,x=~price,type="histogram",nbinsx=60)
fig
  1. How many neighborhoods consist of only single-family homes? (e.g. Bldg.Type = 1Fam)
    1. 0
    2. 1
    3. 2
    4. 3
# type your code for Question 12 here, and Knit

sf_homes_neighborhood <- ames_train %>% 
  group_by(Neighborhood) %>% summarise(mean(Bldg.Type == "1Fam"))
  1. Using color, different plotting symbols, conditioning plots, etc., does there appear to be an association between \(\log\)(area) and the number of bedrooms above ground (Bedroom.AbvGr)?
    1. Yes
    2. No
# type your code for Question 13 here, and Knit
fig <- plot_ly(data=ames_train,x=~Bedroom.AbvGr,y=~log(area),type="scatter",
               mode="markers")
fig
cor(ames_train$Bedroom.AbvGr, log(ames_train$area), method = "pearson")
## [1] 0.5457625
  1. Of the people who have unfinished basements, what is the average square footage of the unfinished basement?
    1. 590.36
    2. 595.25
    3. 614.37
    4. 681.94
# type your code for Question 14 here, and Knit

answer <- ames_train[complete.cases(ames_train$Bsmt.Unf.SF),] %>% 
  filter(Bsmt.Unf.SF!=0) %>% summarise(mean(Bsmt.Unf.SF))
answer
## # A tibble: 1 x 1
##   `mean(Bsmt.Unf.SF)`
##                 <dbl>
## 1                595.