Complete all Exercises, and submit answers to Questions on the Coursera platform.

This second quiz will deal with model assumptions, selection, and interpretation. The concepts tested here will prove useful for the final peer assessment, which is much more open-ended.

First, let us load the data:

load("ames_train.Rdata")
# Install the tool to download packages from Github
library(devtools)
install_github("StatsWithR/statsr")
## Skipping install of 'statsr' from a github remote, the SHA1 (6f64cf27) has not changed since last install.
##   Use `force = TRUE` to force installation
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(statsr)
  1. Suppose you are regressing \(\log\)(price) on \(\log\)(area), \(\log\)(Lot.Area), Bedroom.AbvGr, Overall.Qual, and Land.Slope. Which of the following variables are included with stepwise variable selection using AIC but not BIC? Select all that apply.
    1. \(\log\)(area)
    2. \(\log\)(Lot.Area)
    3. Bedroom.AbvGr
    4. Overall.Qual
    5. Land.Slope
# type your code for Question 1 here, and Knit

  1. When regressing \(\log\)(price) on Bedroom.AbvGr, the coefficient for Bedroom.AbvGr is strongly positive. However, once \(\log\)(area) is added to the model, the coefficient for Bedroom.AbvGr becomes strongly negative. Which of the following best explains this phenomenon?
    1. The original model was misspecified, biasing our coefficient estimate for Bedroom.AbvGr
    2. Bedrooms take up proportionally less space in larger houses, which increases property valuation.
    3. Larger houses on average have more bedrooms and sell for higher prices. However, holding constant the size of a house, the number of bedrooms decreases property valuation.
    4. Since the number of bedrooms is a statistically insignificant predictor of housing price, it is unsurprising that the coefficient changes depending on which variables are included.
# type your code for Question 2 here, and Knit

  1. Run a simple linear model for \(\log\)(price), with \(\log\)(area) as the independent variable. Which of the following neighborhoods has the highest average residuals?
    1. OldTown
    2. StoneBr
    3. GrnHill
    4. IDOTRR
# type your code for Question 3 here, and Knit
lm3 = lm(log(price) ~ log(area), data = ames_train)
pred3 = predict(lm3, newdata = ames_train)
ames_train$resid3 = resid(lm3)
ames_train %>% group_by(Neighborhood) %>% summarise(mean_resid = mean(resid3)) %>% arrange(desc(mean_resid))
## # A tibble: 27 x 2
##    Neighborhood mean_resid
##          <fctr>      <dbl>
##  1      GrnHill  0.5086290
##  2      StoneBr  0.3785578
##  3       Greens  0.3784994
##  4      NridgHt  0.3370307
##  5       Timber  0.2667409
##  6      Somerst  0.2049053
##  7      Blmngtn  0.1715470
##  8      Veenker  0.1285235
##  9      NoRidge  0.1259984
## 10      CollgCr  0.1235094
## # ... with 17 more rows

  1. We are interested in determining how well the model fits the data for each neighborhood. The model from Question 3 does the worst at predicting prices in which of the following neighborhoods?
    1. GrnHill
    2. BlueSte
    3. StoneBr
    4. MeadowV
# type your code for Question 4 here, and Knit

  1. Suppose you want to model \(\log\)(price) using only the variables in the dataset that pertain to quality: Overall.Qual, Basement.Qual, and Garage.Qual. How many observations must be discarded in order to estimate this model?
    1. 0
    2. 46
    3. 64
    4. 924
# type your code for Question 5 here, and Knit
summary(ames_train)
##       PID                 area          price         MS.SubClass    
##  Min.   :5.263e+08   Min.   : 334   Min.   : 12789   Min.   : 20.00  
##  1st Qu.:5.285e+08   1st Qu.:1092   1st Qu.:129763   1st Qu.: 20.00  
##  Median :5.354e+08   Median :1411   Median :159467   Median : 50.00  
##  Mean   :7.059e+08   Mean   :1477   Mean   :181190   Mean   : 57.15  
##  3rd Qu.:9.071e+08   3rd Qu.:1743   3rd Qu.:213000   3rd Qu.: 70.00  
##  Max.   :1.007e+09   Max.   :4676   Max.   :615000   Max.   :190.00  
##                                                                      
##    MS.Zoning    Lot.Frontage       Lot.Area       Street     Alley    
##  A (agr):  0   Min.   : 21.00   Min.   :  1470   Grvl:  3   Grvl: 33  
##  C (all):  9   1st Qu.: 57.00   1st Qu.:  7314   Pave:997   Pave: 34  
##  FV     : 56   Median : 69.00   Median :  9317              NA's:933  
##  I (all):  1   Mean   : 69.21   Mean   : 10352                        
##  RH     :  7   3rd Qu.: 80.00   3rd Qu.: 11650                        
##  RL     :772   Max.   :313.00   Max.   :215245                        
##  RM     :155   NA's   :167                                            
##  Lot.Shape Land.Contour  Utilities      Lot.Config  Land.Slope
##  IR1:338   Bnk: 33      AllPub:1000   Corner :173   Gtl:962   
##  IR2: 30   HLS: 38      NoSeWa:   0   CulDSac: 76   Mod: 33   
##  IR3:  3   Low: 20      NoSewr:   0   FR2    : 36   Sev:  5   
##  Reg:629   Lvl:909                    FR3    :  5             
##                                       Inside :710             
##                                                               
##                                                               
##   Neighborhood  Condition.1   Condition.2   Bldg.Type    House.Style 
##  NAmes  :155   Norm   :875   Norm   :988   1Fam  :823   1Story :521  
##  CollgCr: 85   Feedr  : 53   Feedr  :  6   2fmCon: 20   2Story :286  
##  Somerst: 74   Artery : 23   Artery :  2   Duplex: 35   1.5Fin : 98  
##  OldTown: 71   RRAn   : 14   PosN   :  2   Twnhs : 38   SLvl   : 41  
##  Sawyer : 61   PosN   : 11   PosA   :  1   TwnhsE: 84   SFoyer : 36  
##  Edwards: 60   RRAe   : 11   RRNn   :  1                2.5Unf : 10  
##  (Other):494   (Other): 13   (Other):  0                (Other):  8  
##   Overall.Qual     Overall.Cond     Year.Built   Year.Remod.Add
##  Min.   : 1.000   Min.   :1.000   Min.   :1872   Min.   :1950  
##  1st Qu.: 5.000   1st Qu.:5.000   1st Qu.:1955   1st Qu.:1966  
##  Median : 6.000   Median :5.000   Median :1975   Median :1992  
##  Mean   : 6.095   Mean   :5.559   Mean   :1972   Mean   :1984  
##  3rd Qu.: 7.000   3rd Qu.:6.000   3rd Qu.:2001   3rd Qu.:2004  
##  Max.   :10.000   Max.   :9.000   Max.   :2010   Max.   :2010  
##                                                                
##    Roof.Style    Roof.Matl    Exterior.1st  Exterior.2nd  Mas.Vnr.Type
##  Flat   :  9   CompShg:984   VinylSd:349   VinylSd:345          :  7  
##  Gable  :775   Tar&Grv: 11   HdBoard:164   HdBoard:150   BrkCmn :  8  
##  Gambrel:  8   WdShake:  2   MetalSd:147   MetalSd:148   BrkFace:317  
##  Hip    :204   WdShngl:  2   Wd Sdng:138   Wd Sdng:130   CBlock :  0  
##  Mansard:  4   Metal  :  1   Plywood: 74   Plywood: 96   None   :593  
##  Shed   :  0   ClyTile:  0   CemntBd: 40   CmentBd: 40   Stone  : 75  
##                (Other):  0   (Other): 88   (Other): 91                
##   Mas.Vnr.Area    Exter.Qual Exter.Cond  Foundation  Bsmt.Qual  Bsmt.Cond 
##  Min.   :   0.0   Ex: 39     Ex:  4     BrkTil:102       :  1       :  1  
##  1st Qu.:   0.0   Fa: 11     Fa: 19     CBlock:430   Ex  : 87   Ex  :  2  
##  Median :   0.0   Gd:337     Gd:116     PConc :453   Fa  : 28   Fa  : 23  
##  Mean   : 104.1   TA:613     Po:  0     Slab  : 12   Gd  :424   Gd  : 44  
##  3rd Qu.: 160.0              TA:861     Stone :  3   Po  :  1   Po  :  1  
##  Max.   :1290.0                         Wood  :  0   TA  :438   TA  :908  
##  NA's   :7                                           NA's: 21   NA's: 21  
##  Bsmt.Exposure BsmtFin.Type.1  BsmtFin.SF.1    BsmtFin.Type.2
##      :  2      GLQ    :294    Min.   :   0.0   Unf    :863   
##  Av  :157      Unf    :279    1st Qu.:   0.0   LwQ    : 31   
##  Gd  : 98      ALQ    :163    Median : 400.0   Rec    : 29   
##  Mn  : 87      Rec    :107    Mean   : 464.1   BLQ    : 24   
##  No  :635      BLQ    : 87    3rd Qu.: 773.0   ALQ    : 20   
##  NA's: 21      (Other): 49    Max.   :2260.0   (Other): 12   
##                NA's   : 21    NA's   :1        NA's   : 21   
##   BsmtFin.SF.2      Bsmt.Unf.SF     Total.Bsmt.SF     Heating   
##  Min.   :   0.00   Min.   :   0.0   Min.   :   0.0   Floor:  0  
##  1st Qu.:   0.00   1st Qu.: 223.5   1st Qu.: 797.5   GasA :988  
##  Median :   0.00   Median : 461.0   Median : 998.0   GasW :  8  
##  Mean   :  48.07   Mean   : 547.0   Mean   :1059.2   Grav :  2  
##  3rd Qu.:   0.00   3rd Qu.: 783.0   3rd Qu.:1301.0   OthW :  1  
##  Max.   :1526.00   Max.   :2336.0   Max.   :3138.0   Wall :  1  
##  NA's   :1         NA's   :1        NA's   :1                   
##  Heating.QC Central.Air Electrical   X1st.Flr.SF      X2nd.Flr.SF    
##  Ex:516     N: 55            :  0   Min.   : 334.0   Min.   :   0.0  
##  Fa: 22     Y:945       FuseA: 54   1st Qu.: 876.2   1st Qu.:   0.0  
##  Gd:157                 FuseF: 12   Median :1080.5   Median :   0.0  
##  Po:  1                 FuseP:  2   Mean   :1157.1   Mean   : 315.2  
##  TA:304                 Mix  :  0   3rd Qu.:1376.2   3rd Qu.: 688.2  
##                         SBrkr:932   Max.   :3138.0   Max.   :1836.0  
##                                                                      
##  Low.Qual.Fin.SF   Bsmt.Full.Bath   Bsmt.Half.Bath      Full.Bath    
##  Min.   :   0.00   Min.   :0.0000   Min.   :0.00000   Min.   :0.000  
##  1st Qu.:   0.00   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:1.000  
##  Median :   0.00   Median :0.0000   Median :0.00000   Median :2.000  
##  Mean   :   4.32   Mean   :0.4474   Mean   :0.06106   Mean   :1.541  
##  3rd Qu.:   0.00   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:2.000  
##  Max.   :1064.00   Max.   :3.0000   Max.   :2.00000   Max.   :4.000  
##                    NA's   :1        NA's   :1                        
##    Half.Bath     Bedroom.AbvGr   Kitchen.AbvGr   Kitchen.Qual
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Ex: 67      
##  1st Qu.:0.000   1st Qu.:2.000   1st Qu.:1.000   Fa: 20      
##  Median :0.000   Median :3.000   Median :1.000   Gd:403      
##  Mean   :0.378   Mean   :2.806   Mean   :1.039   Po:  1      
##  3rd Qu.:1.000   3rd Qu.:3.000   3rd Qu.:1.000   TA:509      
##  Max.   :2.000   Max.   :6.000   Max.   :2.000               
##                                                              
##  TotRms.AbvGrd     Functional    Fireplaces    Fireplace.Qu  Garage.Type 
##  Min.   : 2.00   Typ    :935   Min.   :0.000   Ex  : 16     2Types : 10  
##  1st Qu.: 5.00   Min2   : 24   1st Qu.:0.000   Fa  : 24     Attchd :610  
##  Median : 6.00   Min1   : 18   Median :1.000   Gd  :232     Basment: 11  
##  Mean   : 6.34   Mod    : 16   Mean   :0.597   Po  : 18     BuiltIn: 56  
##  3rd Qu.: 7.00   Maj1   :  4   3rd Qu.:1.000   TA  :219     CarPort:  1  
##  Max.   :13.00   Maj2   :  2   Max.   :4.000   NA's:491     Detchd :266  
##                  (Other):  1                                NA's   : 46  
##  Garage.Yr.Blt  Garage.Finish  Garage.Cars     Garage.Area     Garage.Qual
##  Min.   :1900       :  2      Min.   :0.000   Min.   :   0.0       :  1   
##  1st Qu.:1961   Fin :247      1st Qu.:1.000   1st Qu.: 312.0   Ex  :  1   
##  Median :1979   RFn :278      Median :2.000   Median : 480.0   Fa  : 37   
##  Mean   :1978   Unf :427      Mean   :1.767   Mean   : 475.4   Gd  :  7   
##  3rd Qu.:2002   NA's: 46      3rd Qu.:2.000   3rd Qu.: 576.0   Po  :  3   
##  Max.   :2010                 Max.   :5.000   Max.   :1390.0   TA  :904   
##  NA's   :48                   NA's   :1       NA's   :1        NA's: 47   
##  Garage.Cond Paved.Drive  Wood.Deck.SF    Open.Porch.SF   
##      :  1    N: 67       Min.   :  0.00   Min.   :  0.00  
##  Ex  :  1    P: 29       1st Qu.:  0.00   1st Qu.:  0.00  
##  Fa  : 21    Y:904       Median :  0.00   Median : 28.00  
##  Gd  :  6                Mean   : 93.84   Mean   : 48.93  
##  Po  :  6                3rd Qu.:168.00   3rd Qu.: 74.00  
##  TA  :918                Max.   :857.00   Max.   :742.00  
##  NA's: 47                                                 
##  Enclosed.Porch    X3Ssn.Porch       Screen.Porch      Pool.Area      
##  Min.   :  0.00   Min.   :  0.000   Min.   :  0.00   Min.   :  0.000  
##  1st Qu.:  0.00   1st Qu.:  0.000   1st Qu.:  0.00   1st Qu.:  0.000  
##  Median :  0.00   Median :  0.000   Median :  0.00   Median :  0.000  
##  Mean   : 23.48   Mean   :  3.118   Mean   : 14.77   Mean   :  1.463  
##  3rd Qu.:  0.00   3rd Qu.:  0.000   3rd Qu.:  0.00   3rd Qu.:  0.000  
##  Max.   :432.00   Max.   :508.000   Max.   :440.00   Max.   :800.000  
##                                                                       
##  Pool.QC      Fence     Misc.Feature    Misc.Val           Mo.Sold      
##  Ex  :  1   GdPrv: 43   Elev:  0     Min.   :    0.00   Min.   : 1.000  
##  Fa  :  1   GdWo : 37   Gar2:  2     1st Qu.:    0.00   1st Qu.: 4.000  
##  Gd  :  1   MnPrv:120   Othr:  1     Median :    0.00   Median : 6.000  
##  TA  :  0   MnWw :  2   Shed: 25     Mean   :   45.81   Mean   : 6.243  
##  NA's:997   NA's :798   TenC:  1     3rd Qu.:    0.00   3rd Qu.: 8.000  
##                         NA's:971     Max.   :15500.00   Max.   :12.000  
##                                                                         
##     Yr.Sold       Sale.Type   Sale.Condition     resid3        
##  Min.   :2006   WD     :863   Abnorml: 61    Min.   :-2.08526  
##  1st Qu.:2007   New    : 79   AdjLand:  2    1st Qu.:-0.14145  
##  Median :2008   COD    : 27   Alloca :  4    Median : 0.02048  
##  Mean   :2008   ConLD  :  7   Family : 17    Mean   : 0.00000  
##  3rd Qu.:2009   ConLw  :  6   Normal :834    3rd Qu.: 0.17002  
##  Max.   :2010   Con    :  5   Partial: 82    Max.   : 0.84536  
##                 (Other): 13
str(ames_train)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1000 obs. of  82 variables:
##  $ PID            : int  909176150 905476230 911128020 535377150 534177230 908128060 902135020 528228540 923426010 908186050 ...
##  $ area           : int  856 1049 1001 1039 1665 1922 936 1246 889 1072 ...
##  $ price          : int  126000 139500 124900 114000 227000 198500 93000 187687 137500 140000 ...
##  $ MS.SubClass    : int  30 120 30 70 60 85 20 20 20 180 ...
##  $ MS.Zoning      : Factor w/ 7 levels "A (agr)","C (all)",..: 6 6 2 6 6 6 7 6 6 7 ...
##  $ Lot.Frontage   : int  NA 42 60 80 70 64 60 53 74 35 ...
##  $ Lot.Area       : int  7890 4235 6060 8146 8400 7301 6000 3710 12395 3675 ...
##  $ Street         : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Alley          : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA 2 NA NA NA ...
##  $ Lot.Shape      : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Land.Contour   : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 1 4 4 4 ...
##  $ Utilities      : Factor w/ 3 levels "AllPub","NoSeWa",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Lot.Config     : Factor w/ 5 levels "Corner","CulDSac",..: 1 5 5 1 5 1 5 5 1 5 ...
##  $ Land.Slope     : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 2 1 1 1 ...
##  $ Neighborhood   : Factor w/ 28 levels "Blmngtn","Blueste",..: 26 8 12 21 20 8 21 1 15 8 ...
##  $ Condition.1    : Factor w/ 9 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Condition.2    : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Bldg.Type      : Factor w/ 5 levels "1Fam","2fmCon",..: 1 5 1 1 1 1 2 1 1 5 ...
##  $ House.Style    : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 3 3 3 6 6 7 3 3 3 7 ...
##  $ Overall.Qual   : int  6 5 5 4 8 7 4 7 5 6 ...
##  $ Overall.Cond   : int  6 5 9 8 6 5 4 5 6 5 ...
##  $ Year.Built     : int  1939 1984 1930 1900 2001 2003 1953 2007 1984 2005 ...
##  $ Year.Remod.Add : int  1950 1984 2007 2003 2001 2003 1953 2008 1984 2005 ...
##  $ Roof.Style     : Factor w/ 6 levels "Flat","Gable",..: 2 2 4 2 2 2 2 2 2 2 ...
##  $ Roof.Matl      : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Exterior.1st   : Factor w/ 16 levels "AsbShng","AsphShn",..: 15 7 9 9 14 7 9 16 7 14 ...
##  $ Exterior.2nd   : Factor w/ 17 levels "AsbShng","AsphShn",..: 16 7 9 9 15 7 9 17 11 15 ...
##  $ Mas.Vnr.Type   : Factor w/ 6 levels "","BrkCmn","BrkFace",..: 5 3 5 5 5 3 5 3 5 6 ...
##  $ Mas.Vnr.Area   : int  0 149 0 0 0 500 0 20 0 76 ...
##  $ Exter.Qual     : Factor w/ 4 levels "Ex","Fa","Gd",..: 4 3 3 3 3 3 2 3 4 4 ...
##  $ Exter.Cond     : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 3 5 5 5 5 5 5 ...
##  $ Foundation     : Factor w/ 6 levels "BrkTil","CBlock",..: 2 2 1 1 3 4 2 3 2 3 ...
##  $ Bsmt.Qual      : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 4 6 3 4 NA 3 4 6 4 ...
##  $ Bsmt.Cond      : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 6 NA 6 6 6 6 ...
##  $ Bsmt.Exposure  : Factor w/ 5 levels "","Av","Gd","Mn",..: 5 4 5 5 5 NA 5 3 5 3 ...
##  $ BsmtFin.Type.1 : Factor w/ 7 levels "","ALQ","BLQ",..: 6 4 2 7 4 NA 7 7 2 4 ...
##  $ BsmtFin.SF.1   : int  238 552 737 0 643 0 0 0 647 467 ...
##  $ BsmtFin.Type.2 : Factor w/ 7 levels "","ALQ","BLQ",..: 7 2 7 7 7 NA 7 7 7 7 ...
##  $ BsmtFin.SF.2   : int  0 393 0 0 0 0 0 0 0 0 ...
##  $ Bsmt.Unf.SF    : int  618 104 100 405 167 0 936 1146 217 80 ...
##  $ Total.Bsmt.SF  : int  856 1049 837 405 810 0 936 1146 864 547 ...
##  $ Heating        : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Heating.QC     : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 1 3 1 1 5 1 5 1 ...
##  $ Central.Air    : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 1 2 2 2 ...
##  $ Electrical     : Factor w/ 6 levels "","FuseA","FuseF",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ X1st.Flr.SF    : int  856 1049 1001 717 810 495 936 1246 889 1072 ...
##  $ X2nd.Flr.SF    : int  0 0 0 322 855 1427 0 0 0 0 ...
##  $ Low.Qual.Fin.SF: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Bsmt.Full.Bath : int  1 1 0 0 1 0 0 0 0 1 ...
##  $ Bsmt.Half.Bath : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Full.Bath      : int  1 2 1 1 2 3 1 2 1 1 ...
##  $ Half.Bath      : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ Bedroom.AbvGr  : int  2 2 2 2 3 4 2 2 3 2 ...
##  $ Kitchen.AbvGr  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Kitchen.Qual   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 3 3 5 3 3 5 3 5 3 ...
##  $ TotRms.AbvGrd  : int  4 5 5 6 6 7 4 5 6 5 ...
##  $ Functional     : Factor w/ 8 levels "Maj1","Maj2",..: 8 8 8 8 8 8 4 8 8 8 ...
##  $ Fireplaces     : int  1 0 0 0 0 1 0 1 0 0 ...
##  $ Fireplace.Qu   : Factor w/ 5 levels "Ex","Fa","Gd",..: 3 NA NA NA NA 1 NA 3 NA NA ...
##  $ Garage.Type    : Factor w/ 6 levels "2Types","Attchd",..: 6 2 6 6 2 4 6 2 2 3 ...
##  $ Garage.Yr.Blt  : int  1939 1984 1930 1940 2001 2003 1974 2007 1984 2005 ...
##  $ Garage.Finish  : Factor w/ 4 levels "","Fin","RFn",..: 4 2 4 4 2 3 4 2 4 2 ...
##  $ Garage.Cars    : int  2 1 1 1 2 2 2 2 2 2 ...
##  $ Garage.Area    : int  399 266 216 281 528 672 576 428 484 525 ...
##  $ Garage.Qual    : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ Garage.Cond    : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 5 6 6 6 6 6 6 6 ...
##  $ Paved.Drive    : Factor w/ 3 levels "N","P","Y": 3 3 1 1 3 3 3 3 3 3 ...
##  $ Wood.Deck.SF   : int  0 0 154 0 0 0 0 100 0 0 ...
##  $ Open.Porch.SF  : int  0 105 0 0 45 0 32 24 0 44 ...
##  $ Enclosed.Porch : int  0 0 42 168 0 177 112 0 0 0 ...
##  $ X3Ssn.Porch    : int  0 0 86 0 0 0 0 0 0 0 ...
##  $ Screen.Porch   : int  166 0 0 111 0 0 0 0 0 0 ...
##  $ Pool.Area      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Pool.QC        : Factor w/ 4 levels "Ex","Fa","Gd",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Fence          : Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Misc.Feature   : Factor w/ 5 levels "Elev","Gar2",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Misc.Val       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Mo.Sold        : int  3 2 11 5 11 7 2 3 4 5 ...
##  $ Yr.Sold        : int  2010 2009 2007 2009 2009 2009 2009 2008 2008 2007 ...
##  $ Sale.Type      : Factor w/ 10 levels "COD","Con","ConLD",..: 10 10 10 10 10 3 10 7 10 10 ...
##  $ Sale.Condition : Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 5 5 5 5 6 5 5 ...
##  $ resid3         : num  0.1762 0.0906 0.0232 -0.1024 0.1517 ...
df2 <- ames_train[,c("price","Garage.Qual", "Bsmt.Qual","Overall.Qual")]
summary(df2)
##      price        Garage.Qual Bsmt.Qual   Overall.Qual   
##  Min.   : 12789       :  1        :  1   Min.   : 1.000  
##  1st Qu.:129763   Ex  :  1    Ex  : 87   1st Qu.: 5.000  
##  Median :159467   Fa  : 37    Fa  : 28   Median : 6.000  
##  Mean   :181190   Gd  :  7    Gd  :424   Mean   : 6.095  
##  3rd Qu.:213000   Po  :  3    Po  :  1   3rd Qu.: 7.000  
##  Max.   :615000   TA  :904    TA  :438   Max.   :10.000  
##                   NA's: 47    NA's: 21

  1. NA values for Basement.Qual and Garage.Qual correspond to houses that do not have a basement or a garage respectively. Which of the following is the best way to deal with these NA values when fitting the linear model with these variables?
    1. Drop all observations with NA values for Basement.Qual or Garage.Qual since the model cannot be estimated otherwise.
    2. Recode all NA values as the category TA since we must assume these basements or garages are typical in the absence of all other information.
    3. Recode all NA values as a separate category, since houses without basements or garages are fundamentally different than houses with both basements and garages.
# type your code for Question 6 here, and Knit
nrow(subset(ames_train, is.na(Garage.Qual) | is.na(Bsmt.Qual)))
## [1] 64

  1. Run a simple linear model with \(\log\)(price) regressed on Overall.Cond and Overall.Qual. Which of the following subclasses of dwellings (MS.SubClass) has the highest median predicted prices?
    1. 075: 2-1/2 story houses
    2. 060: 2 story, 1946 and Newer
    3. 120: 1 story planned unit development
    4. 090: Duplexes
# type your code for Question 7 here, and Knit
lm7 = lm(log(price) ~ Overall.Cond + Overall.Qual, data = ames_train)
pred7 = predict(lm7, newdata = ames_train)
ames_train$pred7 = exp(pred7)
ames_train %>% group_by(MS.SubClass) %>% summarise(median_price=mean(pred7)) %>% arrange(desc(median_price))
## # A tibble: 15 x 2
##    MS.SubClass median_price
##          <int>        <dbl>
##  1         120     223426.4
##  2          60     216394.6
##  3          75     212014.3
##  4          20     176183.6
##  5          70     170761.4
##  6         160     162887.1
##  7          45     162778.7
##  8          80     159430.8
##  9          50     140960.9
## 10          90     139547.1
## 11          85     136612.1
## 12          40     127257.5
## 13         190     126160.3
## 14         180     125505.2
## 15          30     117523.0

  1. Using the model from Question 7, which observation has the highest leverage or influence on the regression model? Hint: use hatvalues, hat or lm.influence.
    1. 125
    2. 268
    3. 640
    4. 832
# type your code for Question 8 here, and Knit
which.max(lm.influence(lm7)$hat)
## 268 
## 268

  1. Which of the following corresponds to a correct interpretation of the coefficient \(k\) of Bedroom.AbvGr, where \(\log\)(price) is the dependent variable?
    1. Holding constant all other variables in the dataset, on average, an additional bedroom will increase housing price by \(k\) percent.
    2. Holding constant all other variables in the model, on average, an additional bedroom will increase housing price by \(k\) percent.
    3. Holding constant all other variables in the dataset, on average, an additional bedroom will increase housing price by \(k\) dollars.
    4. Holding constant all other variables in the model, on average, an additional bedroom will increase housing price by \(k\) dollars.
# type your code for Question 9 here, and Knit

In a linear model, we assume that all observations in the data are generated from the same process. You are concerned that houses sold in abnormal sale conditions may not exhibit the same behavior as houses sold in normal sale conditions. To visualize this, you make the following plot of 1st and 2nd floor square footage versus log(price):

n.Sale.Condition = length(levels(ames_train$Sale.Condition))
par(mar=c(5,4,4,10))
plot(log(price) ~ I(X1st.Flr.SF+X2nd.Flr.SF), 
     data=ames_train, col=Sale.Condition,
     pch=as.numeric(Sale.Condition)+15, main="Training Data")
legend(x=,"right", legend=levels(ames_train$Sale.Condition),
       col=1:n.Sale.Condition, pch=15+(1:n.Sale.Condition),
       bty="n", xpd=TRUE, inset=c(-.5,0))

  1. Which of the following sale condition categories shows significant differences from the normal selling condition?
    1. Family
    2. Abnorm
    3. Partial
    4. Abnorm and Partial
# type your code for Question 10 here, and Knit
lm10 = lm(log(price) ~ Sale.Condition, data = ames_train)
summary(lm10)
## 
## Call:
## lm(formula = log(price) ~ Sale.Condition, data = ames_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.28589 -0.23436 -0.02709  0.24463  1.33354 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           11.74223    0.05016 234.091  < 2e-16 ***
## Sale.ConditionAdjLand  0.09490    0.28153   0.337    0.736    
## Sale.ConditionAlloca   0.21674    0.20221   1.072    0.284    
## Sale.ConditionFamily   0.11734    0.10745   1.092    0.275    
## Sale.ConditionNormal   0.25361    0.05196   4.881 1.23e-06 ***
## Sale.ConditionPartial  0.75216    0.06624  11.355  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3918 on 994 degrees of freedom
## Multiple R-squared:  0.1367, Adjusted R-squared:  0.1324 
## F-statistic: 31.49 on 5 and 994 DF,  p-value: < 2.2e-16

Because houses with non-normal selling conditions exhibit atypical behavior and can disproportionately influence the model, you decide to only model housing prices under only normal sale conditions.

  1. Subset ames_train to only include houses sold under normal sale conditions. What percent of the original observations remain?
    1. 81.2%
    2. 83.4%
    3. 87.7%
    4. 91.8%
# type your code for Question 11 here, and Knit
table(ames_train$Sale.Condition)
## 
## Abnorml AdjLand  Alloca  Family  Normal Partial 
##      61       2       4      17     834      82
normals = subset(ames_train, Sale.Condition == 'Normal')

  1. Now re-run the simple model from question 3 on the subsetted data. True or False: Modeling only the normal sales results in a better model fit than modeling all sales (in terms of \(R^2\)).
    1. True, restricting the model to only include observations with normal sale conditions increases the \(R^2\) from 0.547 to 0.575.
    2. True, restricting the model to only include observations with normal sale conditions increases the \(R^2\) from 0.575 to 0.603.
    3. False, restricting the model to only include observations with normal sale conditions decreases the \(R^2\) from 0.575 to 0.547.
    4. False, restricting the model to only include observations with normal sale conditions decreases the \(R^2\) from 0.603 to 0.575.
# type your code for Question 12 here, and Knit
lm12 = lm(log(price) ~ log(area), data = normals)
summary(lm3)
## 
## Call:
## lm(formula = log(price) ~ log(area), data = ames_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.08526 -0.14145  0.02048  0.17002  0.84536 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.34441    0.19262   27.75   <2e-16 ***
## log(area)    0.92167    0.02657   34.69   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2834 on 998 degrees of freedom
## Multiple R-squared:  0.5466, Adjusted R-squared:  0.5461 
## F-statistic:  1203 on 1 and 998 DF,  p-value: < 2.2e-16
summary(lm12)
## 
## Call:
## lm(formula = log(price) ~ log(area), data = normals)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.36369 -0.12269  0.02005  0.14587  0.82373 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.73138    0.18711   30.63   <2e-16 ***
## log(area)    0.86716    0.02587   33.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2493 on 832 degrees of freedom
## Multiple R-squared:  0.5745, Adjusted R-squared:  0.574 
## F-statistic:  1123 on 1 and 832 DF,  p-value: < 2.2e-16