Complete all Exercises, and submit answers to Questions on the Coursera platform.

This initial quiz will concern exploratory data analysis (EDA) of the Ames Housing dataset. EDA is essential when working with any source of data and helps inform modeling.

First, let us load the data:

# Install the tool to download packages from Github
library(devtools)
## Warning: package 'devtools' was built under R version 3.4.1
install_github("StatsWithR/statsr")
## Skipping install of 'statsr' from a github remote, the SHA1 (6f64cf27) has not changed since last install.
##   Use `force = TRUE` to force installation
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.1
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(statsr)
load("ames_train.Rdata")
#missing data handling
library(Rcpp)
require(Amelia)
## Loading required package: Amelia
## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.4, built: 2015-12-05)
## ## Copyright (C) 2005-2017 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
missmap(ames_train, main="Missing Map")
## Warning in if (class(obj) == "amelia") {: the condition has length > 1 and
## only the first element will be used
## Warning: Unknown or uninitialised column: 'arguments'.

## Warning: Unknown or uninitialised column: 'arguments'.
## Warning: Unknown or uninitialised column: 'imputations'.

sapply(ames_train, function(df) {
  + sum(is.na(df)==TRUE)/ length(df)
})
##             PID            area           price     MS.SubClass 
##           0.000           0.000           0.000           0.000 
##       MS.Zoning    Lot.Frontage        Lot.Area          Street 
##           0.000           0.167           0.000           0.000 
##           Alley       Lot.Shape    Land.Contour       Utilities 
##           0.933           0.000           0.000           0.000 
##      Lot.Config      Land.Slope    Neighborhood     Condition.1 
##           0.000           0.000           0.000           0.000 
##     Condition.2       Bldg.Type     House.Style    Overall.Qual 
##           0.000           0.000           0.000           0.000 
##    Overall.Cond      Year.Built  Year.Remod.Add      Roof.Style 
##           0.000           0.000           0.000           0.000 
##       Roof.Matl    Exterior.1st    Exterior.2nd    Mas.Vnr.Type 
##           0.000           0.000           0.000           0.000 
##    Mas.Vnr.Area      Exter.Qual      Exter.Cond      Foundation 
##           0.007           0.000           0.000           0.000 
##       Bsmt.Qual       Bsmt.Cond   Bsmt.Exposure  BsmtFin.Type.1 
##           0.021           0.021           0.021           0.021 
##    BsmtFin.SF.1  BsmtFin.Type.2    BsmtFin.SF.2     Bsmt.Unf.SF 
##           0.001           0.021           0.001           0.001 
##   Total.Bsmt.SF         Heating      Heating.QC     Central.Air 
##           0.001           0.000           0.000           0.000 
##      Electrical     X1st.Flr.SF     X2nd.Flr.SF Low.Qual.Fin.SF 
##           0.000           0.000           0.000           0.000 
##  Bsmt.Full.Bath  Bsmt.Half.Bath       Full.Bath       Half.Bath 
##           0.001           0.001           0.000           0.000 
##   Bedroom.AbvGr   Kitchen.AbvGr    Kitchen.Qual   TotRms.AbvGrd 
##           0.000           0.000           0.000           0.000 
##      Functional      Fireplaces    Fireplace.Qu     Garage.Type 
##           0.000           0.000           0.491           0.046 
##   Garage.Yr.Blt   Garage.Finish     Garage.Cars     Garage.Area 
##           0.048           0.046           0.001           0.001 
##     Garage.Qual     Garage.Cond     Paved.Drive    Wood.Deck.SF 
##           0.047           0.047           0.000           0.000 
##   Open.Porch.SF  Enclosed.Porch     X3Ssn.Porch    Screen.Porch 
##           0.000           0.000           0.000           0.000 
##       Pool.Area         Pool.QC           Fence    Misc.Feature 
##           0.000           0.997           0.798           0.971 
##        Misc.Val         Mo.Sold         Yr.Sold       Sale.Type 
##           0.000           0.000           0.000           0.000 
##  Sale.Condition 
##           0.000
  1. Which of the following are the three variables with the highest number of missing observations?
    1. Misc.Feature, Fence, Pool.QC
    2. Misc.Feature, Alley, Pool.QC
    3. Pool.QC, Alley, Fence
    4. Fireplace.Qu, Pool.QC, Lot.Frontage
# type your code for Question 1 here, and Knit

  1. How many categorical variables are coded in R as having type int? Change them to factors when conducting your analysis.
    1. 0
    2. 1
    3. 2
    4. 3
# type your code for Question 2 here, and Knit
str(ames_train)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1000 obs. of  81 variables:
##  $ PID            : int  909176150 905476230 911128020 535377150 534177230 908128060 902135020 528228540 923426010 908186050 ...
##  $ area           : int  856 1049 1001 1039 1665 1922 936 1246 889 1072 ...
##  $ price          : int  126000 139500 124900 114000 227000 198500 93000 187687 137500 140000 ...
##  $ MS.SubClass    : int  30 120 30 70 60 85 20 20 20 180 ...
##  $ MS.Zoning      : Factor w/ 7 levels "A (agr)","C (all)",..: 6 6 2 6 6 6 7 6 6 7 ...
##  $ Lot.Frontage   : int  NA 42 60 80 70 64 60 53 74 35 ...
##  $ Lot.Area       : int  7890 4235 6060 8146 8400 7301 6000 3710 12395 3675 ...
##  $ Street         : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Alley          : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA 2 NA NA NA ...
##  $ Lot.Shape      : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Land.Contour   : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 1 4 4 4 ...
##  $ Utilities      : Factor w/ 3 levels "AllPub","NoSeWa",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Lot.Config     : Factor w/ 5 levels "Corner","CulDSac",..: 1 5 5 1 5 1 5 5 1 5 ...
##  $ Land.Slope     : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 2 1 1 1 ...
##  $ Neighborhood   : Factor w/ 28 levels "Blmngtn","Blueste",..: 26 8 12 21 20 8 21 1 15 8 ...
##  $ Condition.1    : Factor w/ 9 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Condition.2    : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Bldg.Type      : Factor w/ 5 levels "1Fam","2fmCon",..: 1 5 1 1 1 1 2 1 1 5 ...
##  $ House.Style    : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 3 3 3 6 6 7 3 3 3 7 ...
##  $ Overall.Qual   : int  6 5 5 4 8 7 4 7 5 6 ...
##  $ Overall.Cond   : int  6 5 9 8 6 5 4 5 6 5 ...
##  $ Year.Built     : int  1939 1984 1930 1900 2001 2003 1953 2007 1984 2005 ...
##  $ Year.Remod.Add : int  1950 1984 2007 2003 2001 2003 1953 2008 1984 2005 ...
##  $ Roof.Style     : Factor w/ 6 levels "Flat","Gable",..: 2 2 4 2 2 2 2 2 2 2 ...
##  $ Roof.Matl      : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Exterior.1st   : Factor w/ 16 levels "AsbShng","AsphShn",..: 15 7 9 9 14 7 9 16 7 14 ...
##  $ Exterior.2nd   : Factor w/ 17 levels "AsbShng","AsphShn",..: 16 7 9 9 15 7 9 17 11 15 ...
##  $ Mas.Vnr.Type   : Factor w/ 6 levels "","BrkCmn","BrkFace",..: 5 3 5 5 5 3 5 3 5 6 ...
##  $ Mas.Vnr.Area   : int  0 149 0 0 0 500 0 20 0 76 ...
##  $ Exter.Qual     : Factor w/ 4 levels "Ex","Fa","Gd",..: 4 3 3 3 3 3 2 3 4 4 ...
##  $ Exter.Cond     : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 3 5 5 5 5 5 5 ...
##  $ Foundation     : Factor w/ 6 levels "BrkTil","CBlock",..: 2 2 1 1 3 4 2 3 2 3 ...
##  $ Bsmt.Qual      : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 4 6 3 4 NA 3 4 6 4 ...
##  $ Bsmt.Cond      : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 6 NA 6 6 6 6 ...
##  $ Bsmt.Exposure  : Factor w/ 5 levels "","Av","Gd","Mn",..: 5 4 5 5 5 NA 5 3 5 3 ...
##  $ BsmtFin.Type.1 : Factor w/ 7 levels "","ALQ","BLQ",..: 6 4 2 7 4 NA 7 7 2 4 ...
##  $ BsmtFin.SF.1   : int  238 552 737 0 643 0 0 0 647 467 ...
##  $ BsmtFin.Type.2 : Factor w/ 7 levels "","ALQ","BLQ",..: 7 2 7 7 7 NA 7 7 7 7 ...
##  $ BsmtFin.SF.2   : int  0 393 0 0 0 0 0 0 0 0 ...
##  $ Bsmt.Unf.SF    : int  618 104 100 405 167 0 936 1146 217 80 ...
##  $ Total.Bsmt.SF  : int  856 1049 837 405 810 0 936 1146 864 547 ...
##  $ Heating        : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Heating.QC     : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 1 3 1 1 5 1 5 1 ...
##  $ Central.Air    : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 1 2 2 2 ...
##  $ Electrical     : Factor w/ 6 levels "","FuseA","FuseF",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ X1st.Flr.SF    : int  856 1049 1001 717 810 495 936 1246 889 1072 ...
##  $ X2nd.Flr.SF    : int  0 0 0 322 855 1427 0 0 0 0 ...
##  $ Low.Qual.Fin.SF: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Bsmt.Full.Bath : int  1 1 0 0 1 0 0 0 0 1 ...
##  $ Bsmt.Half.Bath : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Full.Bath      : int  1 2 1 1 2 3 1 2 1 1 ...
##  $ Half.Bath      : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ Bedroom.AbvGr  : int  2 2 2 2 3 4 2 2 3 2 ...
##  $ Kitchen.AbvGr  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Kitchen.Qual   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 3 3 5 3 3 5 3 5 3 ...
##  $ TotRms.AbvGrd  : int  4 5 5 6 6 7 4 5 6 5 ...
##  $ Functional     : Factor w/ 8 levels "Maj1","Maj2",..: 8 8 8 8 8 8 4 8 8 8 ...
##  $ Fireplaces     : int  1 0 0 0 0 1 0 1 0 0 ...
##  $ Fireplace.Qu   : Factor w/ 5 levels "Ex","Fa","Gd",..: 3 NA NA NA NA 1 NA 3 NA NA ...
##  $ Garage.Type    : Factor w/ 6 levels "2Types","Attchd",..: 6 2 6 6 2 4 6 2 2 3 ...
##  $ Garage.Yr.Blt  : int  1939 1984 1930 1940 2001 2003 1974 2007 1984 2005 ...
##  $ Garage.Finish  : Factor w/ 4 levels "","Fin","RFn",..: 4 2 4 4 2 3 4 2 4 2 ...
##  $ Garage.Cars    : int  2 1 1 1 2 2 2 2 2 2 ...
##  $ Garage.Area    : int  399 266 216 281 528 672 576 428 484 525 ...
##  $ Garage.Qual    : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ Garage.Cond    : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 5 6 6 6 6 6 6 6 ...
##  $ Paved.Drive    : Factor w/ 3 levels "N","P","Y": 3 3 1 1 3 3 3 3 3 3 ...
##  $ Wood.Deck.SF   : int  0 0 154 0 0 0 0 100 0 0 ...
##  $ Open.Porch.SF  : int  0 105 0 0 45 0 32 24 0 44 ...
##  $ Enclosed.Porch : int  0 0 42 168 0 177 112 0 0 0 ...
##  $ X3Ssn.Porch    : int  0 0 86 0 0 0 0 0 0 0 ...
##  $ Screen.Porch   : int  166 0 0 111 0 0 0 0 0 0 ...
##  $ Pool.Area      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Pool.QC        : Factor w/ 4 levels "Ex","Fa","Gd",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Fence          : Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Misc.Feature   : Factor w/ 5 levels "Elev","Gar2",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Misc.Val       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Mo.Sold        : int  3 2 11 5 11 7 2 3 4 5 ...
##  $ Yr.Sold        : int  2010 2009 2007 2009 2009 2009 2009 2008 2008 2007 ...
##  $ Sale.Type      : Factor w/ 10 levels "COD","Con","ConLD",..: 10 10 10 10 10 3 10 7 10 10 ...
##  $ Sale.Condition : Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 5 5 5 5 6 5 5 ...
summary(ames_train)
##       PID                 area          price         MS.SubClass    
##  Min.   :5.263e+08   Min.   : 334   Min.   : 12789   Min.   : 20.00  
##  1st Qu.:5.285e+08   1st Qu.:1092   1st Qu.:129763   1st Qu.: 20.00  
##  Median :5.354e+08   Median :1411   Median :159467   Median : 50.00  
##  Mean   :7.059e+08   Mean   :1477   Mean   :181190   Mean   : 57.15  
##  3rd Qu.:9.071e+08   3rd Qu.:1743   3rd Qu.:213000   3rd Qu.: 70.00  
##  Max.   :1.007e+09   Max.   :4676   Max.   :615000   Max.   :190.00  
##                                                                      
##    MS.Zoning    Lot.Frontage       Lot.Area       Street     Alley    
##  A (agr):  0   Min.   : 21.00   Min.   :  1470   Grvl:  3   Grvl: 33  
##  C (all):  9   1st Qu.: 57.00   1st Qu.:  7314   Pave:997   Pave: 34  
##  FV     : 56   Median : 69.00   Median :  9317              NA's:933  
##  I (all):  1   Mean   : 69.21   Mean   : 10352                        
##  RH     :  7   3rd Qu.: 80.00   3rd Qu.: 11650                        
##  RL     :772   Max.   :313.00   Max.   :215245                        
##  RM     :155   NA's   :167                                            
##  Lot.Shape Land.Contour  Utilities      Lot.Config  Land.Slope
##  IR1:338   Bnk: 33      AllPub:1000   Corner :173   Gtl:962   
##  IR2: 30   HLS: 38      NoSeWa:   0   CulDSac: 76   Mod: 33   
##  IR3:  3   Low: 20      NoSewr:   0   FR2    : 36   Sev:  5   
##  Reg:629   Lvl:909                    FR3    :  5             
##                                       Inside :710             
##                                                               
##                                                               
##   Neighborhood  Condition.1   Condition.2   Bldg.Type    House.Style 
##  NAmes  :155   Norm   :875   Norm   :988   1Fam  :823   1Story :521  
##  CollgCr: 85   Feedr  : 53   Feedr  :  6   2fmCon: 20   2Story :286  
##  Somerst: 74   Artery : 23   Artery :  2   Duplex: 35   1.5Fin : 98  
##  OldTown: 71   RRAn   : 14   PosN   :  2   Twnhs : 38   SLvl   : 41  
##  Sawyer : 61   PosN   : 11   PosA   :  1   TwnhsE: 84   SFoyer : 36  
##  Edwards: 60   RRAe   : 11   RRNn   :  1                2.5Unf : 10  
##  (Other):494   (Other): 13   (Other):  0                (Other):  8  
##   Overall.Qual     Overall.Cond     Year.Built   Year.Remod.Add
##  Min.   : 1.000   Min.   :1.000   Min.   :1872   Min.   :1950  
##  1st Qu.: 5.000   1st Qu.:5.000   1st Qu.:1955   1st Qu.:1966  
##  Median : 6.000   Median :5.000   Median :1975   Median :1992  
##  Mean   : 6.095   Mean   :5.559   Mean   :1972   Mean   :1984  
##  3rd Qu.: 7.000   3rd Qu.:6.000   3rd Qu.:2001   3rd Qu.:2004  
##  Max.   :10.000   Max.   :9.000   Max.   :2010   Max.   :2010  
##                                                                
##    Roof.Style    Roof.Matl    Exterior.1st  Exterior.2nd  Mas.Vnr.Type
##  Flat   :  9   CompShg:984   VinylSd:349   VinylSd:345          :  7  
##  Gable  :775   Tar&Grv: 11   HdBoard:164   HdBoard:150   BrkCmn :  8  
##  Gambrel:  8   WdShake:  2   MetalSd:147   MetalSd:148   BrkFace:317  
##  Hip    :204   WdShngl:  2   Wd Sdng:138   Wd Sdng:130   CBlock :  0  
##  Mansard:  4   Metal  :  1   Plywood: 74   Plywood: 96   None   :593  
##  Shed   :  0   ClyTile:  0   CemntBd: 40   CmentBd: 40   Stone  : 75  
##                (Other):  0   (Other): 88   (Other): 91                
##   Mas.Vnr.Area    Exter.Qual Exter.Cond  Foundation  Bsmt.Qual  Bsmt.Cond 
##  Min.   :   0.0   Ex: 39     Ex:  4     BrkTil:102       :  1       :  1  
##  1st Qu.:   0.0   Fa: 11     Fa: 19     CBlock:430   Ex  : 87   Ex  :  2  
##  Median :   0.0   Gd:337     Gd:116     PConc :453   Fa  : 28   Fa  : 23  
##  Mean   : 104.1   TA:613     Po:  0     Slab  : 12   Gd  :424   Gd  : 44  
##  3rd Qu.: 160.0              TA:861     Stone :  3   Po  :  1   Po  :  1  
##  Max.   :1290.0                         Wood  :  0   TA  :438   TA  :908  
##  NA's   :7                                           NA's: 21   NA's: 21  
##  Bsmt.Exposure BsmtFin.Type.1  BsmtFin.SF.1    BsmtFin.Type.2
##      :  2      GLQ    :294    Min.   :   0.0   Unf    :863   
##  Av  :157      Unf    :279    1st Qu.:   0.0   LwQ    : 31   
##  Gd  : 98      ALQ    :163    Median : 400.0   Rec    : 29   
##  Mn  : 87      Rec    :107    Mean   : 464.1   BLQ    : 24   
##  No  :635      BLQ    : 87    3rd Qu.: 773.0   ALQ    : 20   
##  NA's: 21      (Other): 49    Max.   :2260.0   (Other): 12   
##                NA's   : 21    NA's   :1        NA's   : 21   
##   BsmtFin.SF.2      Bsmt.Unf.SF     Total.Bsmt.SF     Heating   
##  Min.   :   0.00   Min.   :   0.0   Min.   :   0.0   Floor:  0  
##  1st Qu.:   0.00   1st Qu.: 223.5   1st Qu.: 797.5   GasA :988  
##  Median :   0.00   Median : 461.0   Median : 998.0   GasW :  8  
##  Mean   :  48.07   Mean   : 547.0   Mean   :1059.2   Grav :  2  
##  3rd Qu.:   0.00   3rd Qu.: 783.0   3rd Qu.:1301.0   OthW :  1  
##  Max.   :1526.00   Max.   :2336.0   Max.   :3138.0   Wall :  1  
##  NA's   :1         NA's   :1        NA's   :1                   
##  Heating.QC Central.Air Electrical   X1st.Flr.SF      X2nd.Flr.SF    
##  Ex:516     N: 55            :  0   Min.   : 334.0   Min.   :   0.0  
##  Fa: 22     Y:945       FuseA: 54   1st Qu.: 876.2   1st Qu.:   0.0  
##  Gd:157                 FuseF: 12   Median :1080.5   Median :   0.0  
##  Po:  1                 FuseP:  2   Mean   :1157.1   Mean   : 315.2  
##  TA:304                 Mix  :  0   3rd Qu.:1376.2   3rd Qu.: 688.2  
##                         SBrkr:932   Max.   :3138.0   Max.   :1836.0  
##                                                                      
##  Low.Qual.Fin.SF   Bsmt.Full.Bath   Bsmt.Half.Bath      Full.Bath    
##  Min.   :   0.00   Min.   :0.0000   Min.   :0.00000   Min.   :0.000  
##  1st Qu.:   0.00   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:1.000  
##  Median :   0.00   Median :0.0000   Median :0.00000   Median :2.000  
##  Mean   :   4.32   Mean   :0.4474   Mean   :0.06106   Mean   :1.541  
##  3rd Qu.:   0.00   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:2.000  
##  Max.   :1064.00   Max.   :3.0000   Max.   :2.00000   Max.   :4.000  
##                    NA's   :1        NA's   :1                        
##    Half.Bath     Bedroom.AbvGr   Kitchen.AbvGr   Kitchen.Qual
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Ex: 67      
##  1st Qu.:0.000   1st Qu.:2.000   1st Qu.:1.000   Fa: 20      
##  Median :0.000   Median :3.000   Median :1.000   Gd:403      
##  Mean   :0.378   Mean   :2.806   Mean   :1.039   Po:  1      
##  3rd Qu.:1.000   3rd Qu.:3.000   3rd Qu.:1.000   TA:509      
##  Max.   :2.000   Max.   :6.000   Max.   :2.000               
##                                                              
##  TotRms.AbvGrd     Functional    Fireplaces    Fireplace.Qu  Garage.Type 
##  Min.   : 2.00   Typ    :935   Min.   :0.000   Ex  : 16     2Types : 10  
##  1st Qu.: 5.00   Min2   : 24   1st Qu.:0.000   Fa  : 24     Attchd :610  
##  Median : 6.00   Min1   : 18   Median :1.000   Gd  :232     Basment: 11  
##  Mean   : 6.34   Mod    : 16   Mean   :0.597   Po  : 18     BuiltIn: 56  
##  3rd Qu.: 7.00   Maj1   :  4   3rd Qu.:1.000   TA  :219     CarPort:  1  
##  Max.   :13.00   Maj2   :  2   Max.   :4.000   NA's:491     Detchd :266  
##                  (Other):  1                                NA's   : 46  
##  Garage.Yr.Blt  Garage.Finish  Garage.Cars     Garage.Area     Garage.Qual
##  Min.   :1900       :  2      Min.   :0.000   Min.   :   0.0       :  1   
##  1st Qu.:1961   Fin :247      1st Qu.:1.000   1st Qu.: 312.0   Ex  :  1   
##  Median :1979   RFn :278      Median :2.000   Median : 480.0   Fa  : 37   
##  Mean   :1978   Unf :427      Mean   :1.767   Mean   : 475.4   Gd  :  7   
##  3rd Qu.:2002   NA's: 46      3rd Qu.:2.000   3rd Qu.: 576.0   Po  :  3   
##  Max.   :2010                 Max.   :5.000   Max.   :1390.0   TA  :904   
##  NA's   :48                   NA's   :1       NA's   :1        NA's: 47   
##  Garage.Cond Paved.Drive  Wood.Deck.SF    Open.Porch.SF   
##      :  1    N: 67       Min.   :  0.00   Min.   :  0.00  
##  Ex  :  1    P: 29       1st Qu.:  0.00   1st Qu.:  0.00  
##  Fa  : 21    Y:904       Median :  0.00   Median : 28.00  
##  Gd  :  6                Mean   : 93.84   Mean   : 48.93  
##  Po  :  6                3rd Qu.:168.00   3rd Qu.: 74.00  
##  TA  :918                Max.   :857.00   Max.   :742.00  
##  NA's: 47                                                 
##  Enclosed.Porch    X3Ssn.Porch       Screen.Porch      Pool.Area      
##  Min.   :  0.00   Min.   :  0.000   Min.   :  0.00   Min.   :  0.000  
##  1st Qu.:  0.00   1st Qu.:  0.000   1st Qu.:  0.00   1st Qu.:  0.000  
##  Median :  0.00   Median :  0.000   Median :  0.00   Median :  0.000  
##  Mean   : 23.48   Mean   :  3.118   Mean   : 14.77   Mean   :  1.463  
##  3rd Qu.:  0.00   3rd Qu.:  0.000   3rd Qu.:  0.00   3rd Qu.:  0.000  
##  Max.   :432.00   Max.   :508.000   Max.   :440.00   Max.   :800.000  
##                                                                       
##  Pool.QC      Fence     Misc.Feature    Misc.Val           Mo.Sold      
##  Ex  :  1   GdPrv: 43   Elev:  0     Min.   :    0.00   Min.   : 1.000  
##  Fa  :  1   GdWo : 37   Gar2:  2     1st Qu.:    0.00   1st Qu.: 4.000  
##  Gd  :  1   MnPrv:120   Othr:  1     Median :    0.00   Median : 6.000  
##  TA  :  0   MnWw :  2   Shed: 25     Mean   :   45.81   Mean   : 6.243  
##  NA's:997   NA's :798   TenC:  1     3rd Qu.:    0.00   3rd Qu.: 8.000  
##                         NA's:971     Max.   :15500.00   Max.   :12.000  
##                                                                         
##     Yr.Sold       Sale.Type   Sale.Condition
##  Min.   :2006   WD     :863   Abnorml: 61   
##  1st Qu.:2007   New    : 79   AdjLand:  2   
##  Median :2008   COD    : 27   Alloca :  4   
##  Mean   :2008   ConLD  :  7   Family : 17   
##  3rd Qu.:2009   ConLw  :  6   Normal :834   
##  Max.   :2010   Con    :  5   Partial: 82   
##                 (Other): 13

  1. In terms of price, which neighborhood has the highest standard deviation?
    1. StoneBr
    2. Timber
    3. Veenker
    4. NridgHt
# type your code for Question 3 here, and Knit
aggregate(ames_train$price, by=list(ames_train$Neighborhood), FUN=sd)
##    Group.1         x
## 1  Blmngtn  26454.86
## 2  Blueste  10381.23
## 3   BrDale  13337.59
## 4  BrkSide  37309.91
## 5  ClearCr  48068.69
## 6  CollgCr  52786.08
## 7  Crawfor  71267.56
## 8  Edwards  54851.63
## 9  Gilbert  41190.38
## 10  Greens  29063.42
## 11 GrnHill  70710.68
## 12  IDOTRR  31530.44
## 13 MeadowV  18939.78
## 14 Mitchel  39682.94
## 15   NAmes  27267.97
## 16 NoRidge  35888.97
## 17 NPkVill  11958.37
## 18 NridgHt 105088.90
## 19  NWAmes  41340.50
## 20 OldTown  36429.69
## 21  Sawyer  21216.22
## 22 SawyerW  48354.36
## 23 Somerst  65199.49
## 24 StoneBr 123459.10
## 25   SWISU  27375.76
## 26  Timber  84029.57
## 27 Veenker  72545.41

  1. Using scatter plots or other graphical displays, which of the following variables appears to be the best single predictor of price?
    1. Lot.Area
    2. Bedroom.AbvGr
    3. Overall.Qual
    4. Year.Built
# type your code for Question 4 here, and Knit
ames_train$Overall.Qual <- as.numeric(ames_train$Overall.Qual)
library(ggplot2)
ggplot(data = ames_train, aes(x = Lot.Area, y = price)) +
geom_point()

ggplot(data = ames_train, aes(x = Bedroom.AbvGr, y = price)) +
geom_point()

ggplot(data = ames_train, aes(x = Overall.Qual, y = price)) +
geom_point()

ggplot(data = ames_train, aes(x = Year.Built, y = price)) +
geom_point()

  1. Suppose you are examining the relationship between price and area. Which of the following variable transformations makes the relationship appear to be the most linear?
    1. Do not transform either price or area
    2. Log-transform price but not area
    3. Log-transform area but not price
    4. Log-transform both price and area
# type your code for Question 5 here, and Knit
ggplot(data = ames_train, aes(x = area, y = price)) +
geom_point()

ggplot(data = ames_train, aes(x = area, y = log(price))) +
geom_point()

ggplot(data = ames_train, aes(x = log(area), y = price)) +
geom_point()

ggplot(data = ames_train, aes(x = log(area), y = log(price))) +
geom_point()

  1. Suppose that your prior for the proportion of houses that have at least one garage is Beta(9, 1). What is your posterior? Assume a beta-binomial model for this proportion.
    1. Beta(954, 46)
    2. Beta(963, 46)
    3. Beta(954, 47)
    4. Beta(963, 47)
# type your code for Question 6 here, and Knit
table(ames_train$Garage.Cars >= 1)
## 
## FALSE  TRUE 
##    46   953

  1. Which of the following statements is true about the dataset?
    1. Over 30 percent of houses were built after the year 1999.
    2. The median housing price is greater than the mean housing price.
    3. 21 houses do not have a basement.
    4. 4 houses are located on gravel streets.
# type your code for Question 7 here, and Knit
table(ames_train$Year.Built > 1999)
## 
## FALSE  TRUE 
##   728   272
median(ames_train$price) - mean(ames_train$price)
## [1] -21723.08
table(ames_train$Total.Bsmt.SF > 0)
## 
## FALSE  TRUE 
##    21   978
table(ames_train$Street)
## 
## Grvl Pave 
##    3  997

  1. Test, at the \(\alpha = 0.05\) level, whether homes with a garage have larger square footage than those without a garage.
    1. With a p-value near 0.000, we reject the null hypothesis of no difference.
    2. With a p-value of approximately 0.032, we reject the null hypothesis of no difference.
    3. With a p-value of approximately 0.135, we fail to reject the null hypothesis of no difference.
    4. With a p-value of approximately 0.343, we fail to reject the null hypothesis of no difference.
# type your code for Question 8 here, and Knit
with(ames_train, t.test(area[Garage.Cars>0], area[Garage.Cars==0]))
## 
##  Welch Two Sample t-test
## 
## data:  area[Garage.Cars > 0] and area[Garage.Cars == 0]
## t = 5.134, df = 50.702, p-value = 4.535e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  211.4183 482.9963
## sample estimates:
## mean of x mean of y 
##  1492.251  1145.043

  1. For homes with square footage greater than 2000, assume that the number of bedrooms above ground follows a Poisson distribution with rate \(\lambda\). Your prior on \(\lambda\) follows a Gamma distribution with mean 3 and standard deviation 1. What is your posterior mean and standard deviation for the average number of bedrooms in houses with square footage greater than 2000 square feet?
    1. Mean: 3.61, SD: 0.11
    2. Mean: 3.62, SD: 0.16
    3. Mean: 3.63, SD: 0.09
    4. Mean: 3.63, SD: 0.91
# type your code for Question 9 here, and Knit
# k * theta = mean
# sqrt(k) * theta = sd
# k = (mean / sd) ^ 2 = 9
# theta = mean / k = 1/3
k = 9
theta = 1/3
ames_q9 = ames_train[ames_train$area > 2000, ]
k_star = k + sum(ames_q9$Bedroom.AbvGr)
theta_star <- theta / (length(ames_q9$Bedroom.AbvGr) * theta + 1)
post_mean <- k_star * theta_star
post_sd <- theta_star * sqrt(k_star)
post_mean
## [1] 3.617021
post_sd
## [1] 0.1601644

  1. When regressing \(\log\)(price) on \(\log\)(area), there are some outliers. Which of the following do the three most outlying points have in common?
    1. They had abnormal sale conditions.
    2. They have only two bedrooms.
    3. They have an overall quality of less than 3.
    4. They were built before 1930.
# type your code for Question 10 here, and Knit
lm_q10 = lm(log(price) ~ log(area), ames_train)
res_q10 = resid(lm_q10) ** 2
threshold = sort(res_q10, partial=997)[997]
which(res_q10 > threshold)
## 206 428 741 
## 206 428 741
ames_train$Sale.Condition[c(206, 428, 741)]
## [1] Abnorml Abnorml Normal 
## Levels: Abnorml AdjLand Alloca Family Normal Partial
ames_train$Bedroom.AbvGr[c(206, 428, 741)]
## [1] 3 2 3
ames_train$Overall.Qual[c(206, 428, 741)]
## [1] 4 2 4
ames_train$Year.Built[c(206, 428, 741)]
## [1] 1910 1923 1920

  1. Which of the following are reasons to log-transform price if used as a dependent variable in a linear regression?
    1. price is right-skewed.
    2. price cannot take on negative values.
    3. price can only take on integer values.
    4. Both a and b.
# type your code for Question 11 here, and Knit
hist(ames_train$price)

hist(log(ames_train$price))

  1. How many neighborhoods consist of only single-family homes? (e.g. Bldg.Type = 1Fam)
    1. 0
    2. 1
    3. 2
    4. 3
ames_train %>%
  group_by(Neighborhood) %>%
  summarise(flag=mean(Bldg.Type == '1Fam')) %>%
  arrange(-flag)
## # A tibble: 27 x 2
##    Neighborhood      flag
##          <fctr>     <dbl>
##  1      ClearCr 1.0000000
##  2      NoRidge 1.0000000
##  3       Timber 1.0000000
##  4      Gilbert 0.9795918
##  5      BrkSide 0.9756098
##  6       NWAmes 0.9756098
##  7      CollgCr 0.9647059
##  8       IDOTRR 0.9428571
##  9        NAmes 0.9225806
## 10       Sawyer 0.9180328
## # ... with 17 more rows

  1. Using color, different plotting symbols, conditioning plots, etc., does there appear to be an association between \(\log\)(area) and the number of bedrooms above ground (Bedroom.AbvGr)?
    1. Yes
    2. No
# type your code for Question 13 here, and Knit
cor(ames_train$Bedroom.AbvGr, log(ames_train$area))
## [1] 0.5457625
ggplot(ames_train, aes(Bedroom.AbvGr, log(area))) + geom_point()

  1. Of the people who have unfinished basements, what is the average square footage of the unfinished basement?
    1. 590.36
    2. 595.25
    3. 614.37
    4. 681.94
# type your code for Question 14 here, and Knit
ames_train %>%
  filter(Bsmt.Unf.SF > 0) %>%
  summarise(mean(Bsmt.Unf.SF))
## # A tibble: 1 x 1
##   `mean(Bsmt.Unf.SF)`
##                 <dbl>
## 1            595.2527