Complete all Exercises, and submit answers to Questions on the Coursera platform.
This second quiz will deal with model assumptions, selection, and interpretation. The concepts tested here will prove useful for the final peer assessment, which is much more open-ended.
First, let us load the data:
load("ames_train.Rdata")
# Install the tool to download packages from Github
library(devtools)
install_github("StatsWithR/statsr")## Skipping install of 'statsr' from a github remote, the SHA1 (6f64cf27) has not changed since last install.
## Use `force = TRUE` to force installation
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(statsr)price) on \(\log\)(area), \(\log\)(Lot.Area), Bedroom.AbvGr, Overall.Qual, and Land.Slope. Which of the following variables are included with stepwise variable selection using AIC but not BIC? Select all that apply.
area)
Lot.Area)
Bedroom.AbvGr
Overall.Qual
Land.Slope
# type your code for Question 1 here, and Knitprice) on Bedroom.AbvGr, the coefficient for Bedroom.AbvGr is strongly positive. However, once \(\log\)(area) is added to the model, the coefficient for Bedroom.AbvGr becomes strongly negative. Which of the following best explains this phenomenon?
Bedroom.AbvGr
# type your code for Question 2 here, and Knitprice), with \(\log\)(area) as the independent variable. Which of the following neighborhoods has the highest average residuals?OldTown
StoneBr
GrnHill
IDOTRR
# type your code for Question 3 here, and Knit
lm3 = lm(log(price) ~ log(area), data = ames_train)
pred3 = predict(lm3, newdata = ames_train)
ames_train$resid3 = resid(lm3)
ames_train %>% group_by(Neighborhood) %>% summarise(mean_resid = mean(resid3)) %>% arrange(desc(mean_resid))## # A tibble: 27 x 2
## Neighborhood mean_resid
## <fctr> <dbl>
## 1 GrnHill 0.5086290
## 2 StoneBr 0.3785578
## 3 Greens 0.3784994
## 4 NridgHt 0.3370307
## 5 Timber 0.2667409
## 6 Somerst 0.2049053
## 7 Blmngtn 0.1715470
## 8 Veenker 0.1285235
## 9 NoRidge 0.1259984
## 10 CollgCr 0.1235094
## # ... with 17 more rows
GrnHill
BlueSte
StoneBr
MeadowV
# type your code for Question 4 here, and Knitprice) using only the variables in the dataset that pertain to quality: Overall.Qual, Basement.Qual, and Garage.Qual. How many observations must be discarded in order to estimate this model?
# type your code for Question 5 here, and Knit
summary(ames_train)## PID area price MS.SubClass
## Min. :5.263e+08 Min. : 334 Min. : 12789 Min. : 20.00
## 1st Qu.:5.285e+08 1st Qu.:1092 1st Qu.:129763 1st Qu.: 20.00
## Median :5.354e+08 Median :1411 Median :159467 Median : 50.00
## Mean :7.059e+08 Mean :1477 Mean :181190 Mean : 57.15
## 3rd Qu.:9.071e+08 3rd Qu.:1743 3rd Qu.:213000 3rd Qu.: 70.00
## Max. :1.007e+09 Max. :4676 Max. :615000 Max. :190.00
##
## MS.Zoning Lot.Frontage Lot.Area Street Alley
## A (agr): 0 Min. : 21.00 Min. : 1470 Grvl: 3 Grvl: 33
## C (all): 9 1st Qu.: 57.00 1st Qu.: 7314 Pave:997 Pave: 34
## FV : 56 Median : 69.00 Median : 9317 NA's:933
## I (all): 1 Mean : 69.21 Mean : 10352
## RH : 7 3rd Qu.: 80.00 3rd Qu.: 11650
## RL :772 Max. :313.00 Max. :215245
## RM :155 NA's :167
## Lot.Shape Land.Contour Utilities Lot.Config Land.Slope
## IR1:338 Bnk: 33 AllPub:1000 Corner :173 Gtl:962
## IR2: 30 HLS: 38 NoSeWa: 0 CulDSac: 76 Mod: 33
## IR3: 3 Low: 20 NoSewr: 0 FR2 : 36 Sev: 5
## Reg:629 Lvl:909 FR3 : 5
## Inside :710
##
##
## Neighborhood Condition.1 Condition.2 Bldg.Type House.Style
## NAmes :155 Norm :875 Norm :988 1Fam :823 1Story :521
## CollgCr: 85 Feedr : 53 Feedr : 6 2fmCon: 20 2Story :286
## Somerst: 74 Artery : 23 Artery : 2 Duplex: 35 1.5Fin : 98
## OldTown: 71 RRAn : 14 PosN : 2 Twnhs : 38 SLvl : 41
## Sawyer : 61 PosN : 11 PosA : 1 TwnhsE: 84 SFoyer : 36
## Edwards: 60 RRAe : 11 RRNn : 1 2.5Unf : 10
## (Other):494 (Other): 13 (Other): 0 (Other): 8
## Overall.Qual Overall.Cond Year.Built Year.Remod.Add
## Min. : 1.000 Min. :1.000 Min. :1872 Min. :1950
## 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1955 1st Qu.:1966
## Median : 6.000 Median :5.000 Median :1975 Median :1992
## Mean : 6.095 Mean :5.559 Mean :1972 Mean :1984
## 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2001 3rd Qu.:2004
## Max. :10.000 Max. :9.000 Max. :2010 Max. :2010
##
## Roof.Style Roof.Matl Exterior.1st Exterior.2nd Mas.Vnr.Type
## Flat : 9 CompShg:984 VinylSd:349 VinylSd:345 : 7
## Gable :775 Tar&Grv: 11 HdBoard:164 HdBoard:150 BrkCmn : 8
## Gambrel: 8 WdShake: 2 MetalSd:147 MetalSd:148 BrkFace:317
## Hip :204 WdShngl: 2 Wd Sdng:138 Wd Sdng:130 CBlock : 0
## Mansard: 4 Metal : 1 Plywood: 74 Plywood: 96 None :593
## Shed : 0 ClyTile: 0 CemntBd: 40 CmentBd: 40 Stone : 75
## (Other): 0 (Other): 88 (Other): 91
## Mas.Vnr.Area Exter.Qual Exter.Cond Foundation Bsmt.Qual Bsmt.Cond
## Min. : 0.0 Ex: 39 Ex: 4 BrkTil:102 : 1 : 1
## 1st Qu.: 0.0 Fa: 11 Fa: 19 CBlock:430 Ex : 87 Ex : 2
## Median : 0.0 Gd:337 Gd:116 PConc :453 Fa : 28 Fa : 23
## Mean : 104.1 TA:613 Po: 0 Slab : 12 Gd :424 Gd : 44
## 3rd Qu.: 160.0 TA:861 Stone : 3 Po : 1 Po : 1
## Max. :1290.0 Wood : 0 TA :438 TA :908
## NA's :7 NA's: 21 NA's: 21
## Bsmt.Exposure BsmtFin.Type.1 BsmtFin.SF.1 BsmtFin.Type.2
## : 2 GLQ :294 Min. : 0.0 Unf :863
## Av :157 Unf :279 1st Qu.: 0.0 LwQ : 31
## Gd : 98 ALQ :163 Median : 400.0 Rec : 29
## Mn : 87 Rec :107 Mean : 464.1 BLQ : 24
## No :635 BLQ : 87 3rd Qu.: 773.0 ALQ : 20
## NA's: 21 (Other): 49 Max. :2260.0 (Other): 12
## NA's : 21 NA's :1 NA's : 21
## BsmtFin.SF.2 Bsmt.Unf.SF Total.Bsmt.SF Heating
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Floor: 0
## 1st Qu.: 0.00 1st Qu.: 223.5 1st Qu.: 797.5 GasA :988
## Median : 0.00 Median : 461.0 Median : 998.0 GasW : 8
## Mean : 48.07 Mean : 547.0 Mean :1059.2 Grav : 2
## 3rd Qu.: 0.00 3rd Qu.: 783.0 3rd Qu.:1301.0 OthW : 1
## Max. :1526.00 Max. :2336.0 Max. :3138.0 Wall : 1
## NA's :1 NA's :1 NA's :1
## Heating.QC Central.Air Electrical X1st.Flr.SF X2nd.Flr.SF
## Ex:516 N: 55 : 0 Min. : 334.0 Min. : 0.0
## Fa: 22 Y:945 FuseA: 54 1st Qu.: 876.2 1st Qu.: 0.0
## Gd:157 FuseF: 12 Median :1080.5 Median : 0.0
## Po: 1 FuseP: 2 Mean :1157.1 Mean : 315.2
## TA:304 Mix : 0 3rd Qu.:1376.2 3rd Qu.: 688.2
## SBrkr:932 Max. :3138.0 Max. :1836.0
##
## Low.Qual.Fin.SF Bsmt.Full.Bath Bsmt.Half.Bath Full.Bath
## Min. : 0.00 Min. :0.0000 Min. :0.00000 Min. :0.000
## 1st Qu.: 0.00 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.000
## Median : 0.00 Median :0.0000 Median :0.00000 Median :2.000
## Mean : 4.32 Mean :0.4474 Mean :0.06106 Mean :1.541
## 3rd Qu.: 0.00 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:2.000
## Max. :1064.00 Max. :3.0000 Max. :2.00000 Max. :4.000
## NA's :1 NA's :1
## Half.Bath Bedroom.AbvGr Kitchen.AbvGr Kitchen.Qual
## Min. :0.000 Min. :0.000 Min. :0.000 Ex: 67
## 1st Qu.:0.000 1st Qu.:2.000 1st Qu.:1.000 Fa: 20
## Median :0.000 Median :3.000 Median :1.000 Gd:403
## Mean :0.378 Mean :2.806 Mean :1.039 Po: 1
## 3rd Qu.:1.000 3rd Qu.:3.000 3rd Qu.:1.000 TA:509
## Max. :2.000 Max. :6.000 Max. :2.000
##
## TotRms.AbvGrd Functional Fireplaces Fireplace.Qu Garage.Type
## Min. : 2.00 Typ :935 Min. :0.000 Ex : 16 2Types : 10
## 1st Qu.: 5.00 Min2 : 24 1st Qu.:0.000 Fa : 24 Attchd :610
## Median : 6.00 Min1 : 18 Median :1.000 Gd :232 Basment: 11
## Mean : 6.34 Mod : 16 Mean :0.597 Po : 18 BuiltIn: 56
## 3rd Qu.: 7.00 Maj1 : 4 3rd Qu.:1.000 TA :219 CarPort: 1
## Max. :13.00 Maj2 : 2 Max. :4.000 NA's:491 Detchd :266
## (Other): 1 NA's : 46
## Garage.Yr.Blt Garage.Finish Garage.Cars Garage.Area Garage.Qual
## Min. :1900 : 2 Min. :0.000 Min. : 0.0 : 1
## 1st Qu.:1961 Fin :247 1st Qu.:1.000 1st Qu.: 312.0 Ex : 1
## Median :1979 RFn :278 Median :2.000 Median : 480.0 Fa : 37
## Mean :1978 Unf :427 Mean :1.767 Mean : 475.4 Gd : 7
## 3rd Qu.:2002 NA's: 46 3rd Qu.:2.000 3rd Qu.: 576.0 Po : 3
## Max. :2010 Max. :5.000 Max. :1390.0 TA :904
## NA's :48 NA's :1 NA's :1 NA's: 47
## Garage.Cond Paved.Drive Wood.Deck.SF Open.Porch.SF
## : 1 N: 67 Min. : 0.00 Min. : 0.00
## Ex : 1 P: 29 1st Qu.: 0.00 1st Qu.: 0.00
## Fa : 21 Y:904 Median : 0.00 Median : 28.00
## Gd : 6 Mean : 93.84 Mean : 48.93
## Po : 6 3rd Qu.:168.00 3rd Qu.: 74.00
## TA :918 Max. :857.00 Max. :742.00
## NA's: 47
## Enclosed.Porch X3Ssn.Porch Screen.Porch Pool.Area
## Min. : 0.00 Min. : 0.000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 0.00 Median : 0.000 Median : 0.00 Median : 0.000
## Mean : 23.48 Mean : 3.118 Mean : 14.77 Mean : 1.463
## 3rd Qu.: 0.00 3rd Qu.: 0.000 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :432.00 Max. :508.000 Max. :440.00 Max. :800.000
##
## Pool.QC Fence Misc.Feature Misc.Val Mo.Sold
## Ex : 1 GdPrv: 43 Elev: 0 Min. : 0.00 Min. : 1.000
## Fa : 1 GdWo : 37 Gar2: 2 1st Qu.: 0.00 1st Qu.: 4.000
## Gd : 1 MnPrv:120 Othr: 1 Median : 0.00 Median : 6.000
## TA : 0 MnWw : 2 Shed: 25 Mean : 45.81 Mean : 6.243
## NA's:997 NA's :798 TenC: 1 3rd Qu.: 0.00 3rd Qu.: 8.000
## NA's:971 Max. :15500.00 Max. :12.000
##
## Yr.Sold Sale.Type Sale.Condition resid3
## Min. :2006 WD :863 Abnorml: 61 Min. :-2.08526
## 1st Qu.:2007 New : 79 AdjLand: 2 1st Qu.:-0.14145
## Median :2008 COD : 27 Alloca : 4 Median : 0.02048
## Mean :2008 ConLD : 7 Family : 17 Mean : 0.00000
## 3rd Qu.:2009 ConLw : 6 Normal :834 3rd Qu.: 0.17002
## Max. :2010 Con : 5 Partial: 82 Max. : 0.84536
## (Other): 13
str(ames_train)## Classes 'tbl_df', 'tbl' and 'data.frame': 1000 obs. of 82 variables:
## $ PID : int 909176150 905476230 911128020 535377150 534177230 908128060 902135020 528228540 923426010 908186050 ...
## $ area : int 856 1049 1001 1039 1665 1922 936 1246 889 1072 ...
## $ price : int 126000 139500 124900 114000 227000 198500 93000 187687 137500 140000 ...
## $ MS.SubClass : int 30 120 30 70 60 85 20 20 20 180 ...
## $ MS.Zoning : Factor w/ 7 levels "A (agr)","C (all)",..: 6 6 2 6 6 6 7 6 6 7 ...
## $ Lot.Frontage : int NA 42 60 80 70 64 60 53 74 35 ...
## $ Lot.Area : int 7890 4235 6060 8146 8400 7301 6000 3710 12395 3675 ...
## $ Street : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
## $ Alley : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA 2 NA NA NA ...
## $ Lot.Shape : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Land.Contour : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 1 4 4 4 ...
## $ Utilities : Factor w/ 3 levels "AllPub","NoSeWa",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Lot.Config : Factor w/ 5 levels "Corner","CulDSac",..: 1 5 5 1 5 1 5 5 1 5 ...
## $ Land.Slope : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 2 1 1 1 ...
## $ Neighborhood : Factor w/ 28 levels "Blmngtn","Blueste",..: 26 8 12 21 20 8 21 1 15 8 ...
## $ Condition.1 : Factor w/ 9 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Condition.2 : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Bldg.Type : Factor w/ 5 levels "1Fam","2fmCon",..: 1 5 1 1 1 1 2 1 1 5 ...
## $ House.Style : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 3 3 3 6 6 7 3 3 3 7 ...
## $ Overall.Qual : int 6 5 5 4 8 7 4 7 5 6 ...
## $ Overall.Cond : int 6 5 9 8 6 5 4 5 6 5 ...
## $ Year.Built : int 1939 1984 1930 1900 2001 2003 1953 2007 1984 2005 ...
## $ Year.Remod.Add : int 1950 1984 2007 2003 2001 2003 1953 2008 1984 2005 ...
## $ Roof.Style : Factor w/ 6 levels "Flat","Gable",..: 2 2 4 2 2 2 2 2 2 2 ...
## $ Roof.Matl : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Exterior.1st : Factor w/ 16 levels "AsbShng","AsphShn",..: 15 7 9 9 14 7 9 16 7 14 ...
## $ Exterior.2nd : Factor w/ 17 levels "AsbShng","AsphShn",..: 16 7 9 9 15 7 9 17 11 15 ...
## $ Mas.Vnr.Type : Factor w/ 6 levels "","BrkCmn","BrkFace",..: 5 3 5 5 5 3 5 3 5 6 ...
## $ Mas.Vnr.Area : int 0 149 0 0 0 500 0 20 0 76 ...
## $ Exter.Qual : Factor w/ 4 levels "Ex","Fa","Gd",..: 4 3 3 3 3 3 2 3 4 4 ...
## $ Exter.Cond : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 3 5 5 5 5 5 5 ...
## $ Foundation : Factor w/ 6 levels "BrkTil","CBlock",..: 2 2 1 1 3 4 2 3 2 3 ...
## $ Bsmt.Qual : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 4 6 3 4 NA 3 4 6 4 ...
## $ Bsmt.Cond : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 6 NA 6 6 6 6 ...
## $ Bsmt.Exposure : Factor w/ 5 levels "","Av","Gd","Mn",..: 5 4 5 5 5 NA 5 3 5 3 ...
## $ BsmtFin.Type.1 : Factor w/ 7 levels "","ALQ","BLQ",..: 6 4 2 7 4 NA 7 7 2 4 ...
## $ BsmtFin.SF.1 : int 238 552 737 0 643 0 0 0 647 467 ...
## $ BsmtFin.Type.2 : Factor w/ 7 levels "","ALQ","BLQ",..: 7 2 7 7 7 NA 7 7 7 7 ...
## $ BsmtFin.SF.2 : int 0 393 0 0 0 0 0 0 0 0 ...
## $ Bsmt.Unf.SF : int 618 104 100 405 167 0 936 1146 217 80 ...
## $ Total.Bsmt.SF : int 856 1049 837 405 810 0 936 1146 864 547 ...
## $ Heating : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Heating.QC : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 1 3 1 1 5 1 5 1 ...
## $ Central.Air : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 1 2 2 2 ...
## $ Electrical : Factor w/ 6 levels "","FuseA","FuseF",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ X1st.Flr.SF : int 856 1049 1001 717 810 495 936 1246 889 1072 ...
## $ X2nd.Flr.SF : int 0 0 0 322 855 1427 0 0 0 0 ...
## $ Low.Qual.Fin.SF: int 0 0 0 0 0 0 0 0 0 0 ...
## $ Bsmt.Full.Bath : int 1 1 0 0 1 0 0 0 0 1 ...
## $ Bsmt.Half.Bath : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Full.Bath : int 1 2 1 1 2 3 1 2 1 1 ...
## $ Half.Bath : int 0 0 0 0 1 0 0 0 0 0 ...
## $ Bedroom.AbvGr : int 2 2 2 2 3 4 2 2 3 2 ...
## $ Kitchen.AbvGr : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Kitchen.Qual : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 3 3 5 3 3 5 3 5 3 ...
## $ TotRms.AbvGrd : int 4 5 5 6 6 7 4 5 6 5 ...
## $ Functional : Factor w/ 8 levels "Maj1","Maj2",..: 8 8 8 8 8 8 4 8 8 8 ...
## $ Fireplaces : int 1 0 0 0 0 1 0 1 0 0 ...
## $ Fireplace.Qu : Factor w/ 5 levels "Ex","Fa","Gd",..: 3 NA NA NA NA 1 NA 3 NA NA ...
## $ Garage.Type : Factor w/ 6 levels "2Types","Attchd",..: 6 2 6 6 2 4 6 2 2 3 ...
## $ Garage.Yr.Blt : int 1939 1984 1930 1940 2001 2003 1974 2007 1984 2005 ...
## $ Garage.Finish : Factor w/ 4 levels "","Fin","RFn",..: 4 2 4 4 2 3 4 2 4 2 ...
## $ Garage.Cars : int 2 1 1 1 2 2 2 2 2 2 ...
## $ Garage.Area : int 399 266 216 281 528 672 576 428 484 525 ...
## $ Garage.Qual : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ Garage.Cond : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 5 6 6 6 6 6 6 6 ...
## $ Paved.Drive : Factor w/ 3 levels "N","P","Y": 3 3 1 1 3 3 3 3 3 3 ...
## $ Wood.Deck.SF : int 0 0 154 0 0 0 0 100 0 0 ...
## $ Open.Porch.SF : int 0 105 0 0 45 0 32 24 0 44 ...
## $ Enclosed.Porch : int 0 0 42 168 0 177 112 0 0 0 ...
## $ X3Ssn.Porch : int 0 0 86 0 0 0 0 0 0 0 ...
## $ Screen.Porch : int 166 0 0 111 0 0 0 0 0 0 ...
## $ Pool.Area : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Pool.QC : Factor w/ 4 levels "Ex","Fa","Gd",..: NA NA NA NA NA NA NA NA NA NA ...
## $ Fence : Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA NA NA NA NA NA NA ...
## $ Misc.Feature : Factor w/ 5 levels "Elev","Gar2",..: NA NA NA NA NA NA NA NA NA NA ...
## $ Misc.Val : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Mo.Sold : int 3 2 11 5 11 7 2 3 4 5 ...
## $ Yr.Sold : int 2010 2009 2007 2009 2009 2009 2009 2008 2008 2007 ...
## $ Sale.Type : Factor w/ 10 levels "COD","Con","ConLD",..: 10 10 10 10 10 3 10 7 10 10 ...
## $ Sale.Condition : Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 5 5 5 5 6 5 5 ...
## $ resid3 : num 0.1762 0.0906 0.0232 -0.1024 0.1517 ...
df2 <- ames_train[,c("price","Garage.Qual", "Bsmt.Qual","Overall.Qual")]
summary(df2)## price Garage.Qual Bsmt.Qual Overall.Qual
## Min. : 12789 : 1 : 1 Min. : 1.000
## 1st Qu.:129763 Ex : 1 Ex : 87 1st Qu.: 5.000
## Median :159467 Fa : 37 Fa : 28 Median : 6.000
## Mean :181190 Gd : 7 Gd :424 Mean : 6.095
## 3rd Qu.:213000 Po : 3 Po : 1 3rd Qu.: 7.000
## Max. :615000 TA :904 TA :438 Max. :10.000
## NA's: 47 NA's: 21
NA values for Basement.Qual and Garage.Qual correspond to houses that do not have a basement or a garage respectively. Which of the following is the best way to deal with these NA values when fitting the linear model with these variables?NA values for Basement.Qual or Garage.Qual since the model cannot be estimated otherwise.
NA values as the category TA since we must assume these basements or garages are typical in the absence of all other information.
NA values as a separate category, since houses without basements or garages are fundamentally different than houses with both basements and garages.
# type your code for Question 6 here, and Knit
nrow(subset(ames_train, is.na(Garage.Qual) | is.na(Bsmt.Qual)))## [1] 64
price) regressed on Overall.Cond and Overall.Qual. Which of the following subclasses of dwellings (MS.SubClass) has the highest median predicted prices?
# type your code for Question 7 here, and Knit
lm7 = lm(log(price) ~ Overall.Cond + Overall.Qual, data = ames_train)
pred7 = predict(lm7, newdata = ames_train)
ames_train$pred7 = exp(pred7)
ames_train %>% group_by(MS.SubClass) %>% summarise(median_price=mean(pred7)) %>% arrange(desc(median_price))## # A tibble: 15 x 2
## MS.SubClass median_price
## <int> <dbl>
## 1 120 223426.4
## 2 60 216394.6
## 3 75 212014.3
## 4 20 176183.6
## 5 70 170761.4
## 6 160 162887.1
## 7 45 162778.7
## 8 80 159430.8
## 9 50 140960.9
## 10 90 139547.1
## 11 85 136612.1
## 12 40 127257.5
## 13 190 126160.3
## 14 180 125505.2
## 15 30 117523.0
hatvalues, hat or lm.influence.
# type your code for Question 8 here, and Knit
which.max(lm.influence(lm7)$hat)## 268
## 268
Bedroom.AbvGr, where \(\log\)(price) is the dependent variable?
# type your code for Question 9 here, and KnitIn a linear model, we assume that all observations in the data are generated from the same process. You are concerned that houses sold in abnormal sale conditions may not exhibit the same behavior as houses sold in normal sale conditions. To visualize this, you make the following plot of 1st and 2nd floor square footage versus log(price):
n.Sale.Condition = length(levels(ames_train$Sale.Condition))
par(mar=c(5,4,4,10))
plot(log(price) ~ I(X1st.Flr.SF+X2nd.Flr.SF),
data=ames_train, col=Sale.Condition,
pch=as.numeric(Sale.Condition)+15, main="Training Data")
legend(x=,"right", legend=levels(ames_train$Sale.Condition),
col=1:n.Sale.Condition, pch=15+(1:n.Sale.Condition),
bty="n", xpd=TRUE, inset=c(-.5,0))Family
Abnorm
Partial
Abnorm and Partial
# type your code for Question 10 here, and Knit
lm10 = lm(log(price) ~ Sale.Condition, data = ames_train)
summary(lm10)##
## Call:
## lm(formula = log(price) ~ Sale.Condition, data = ames_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.28589 -0.23436 -0.02709 0.24463 1.33354
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.74223 0.05016 234.091 < 2e-16 ***
## Sale.ConditionAdjLand 0.09490 0.28153 0.337 0.736
## Sale.ConditionAlloca 0.21674 0.20221 1.072 0.284
## Sale.ConditionFamily 0.11734 0.10745 1.092 0.275
## Sale.ConditionNormal 0.25361 0.05196 4.881 1.23e-06 ***
## Sale.ConditionPartial 0.75216 0.06624 11.355 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3918 on 994 degrees of freedom
## Multiple R-squared: 0.1367, Adjusted R-squared: 0.1324
## F-statistic: 31.49 on 5 and 994 DF, p-value: < 2.2e-16
Because houses with non-normal selling conditions exhibit atypical behavior and can disproportionately influence the model, you decide to only model housing prices under only normal sale conditions.
ames_train to only include houses sold under normal sale conditions. What percent of the original observations remain?
# type your code for Question 11 here, and Knit
table(ames_train$Sale.Condition)##
## Abnorml AdjLand Alloca Family Normal Partial
## 61 2 4 17 834 82
normals = subset(ames_train, Sale.Condition == 'Normal')# type your code for Question 12 here, and Knit
lm12 = lm(log(price) ~ log(area), data = normals)
summary(lm3)##
## Call:
## lm(formula = log(price) ~ log(area), data = ames_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.08526 -0.14145 0.02048 0.17002 0.84536
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.34441 0.19262 27.75 <2e-16 ***
## log(area) 0.92167 0.02657 34.69 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2834 on 998 degrees of freedom
## Multiple R-squared: 0.5466, Adjusted R-squared: 0.5461
## F-statistic: 1203 on 1 and 998 DF, p-value: < 2.2e-16
summary(lm12)##
## Call:
## lm(formula = log(price) ~ log(area), data = normals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.36369 -0.12269 0.02005 0.14587 0.82373
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.73138 0.18711 30.63 <2e-16 ***
## log(area) 0.86716 0.02587 33.52 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2493 on 832 degrees of freedom
## Multiple R-squared: 0.5745, Adjusted R-squared: 0.574
## F-statistic: 1123 on 1 and 832 DF, p-value: < 2.2e-16