library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.0.0 ✔ purrr 0.2.5
## ✔ tibble 1.4.2 ✔ dplyr 0.7.6
## ✔ tidyr 0.8.1 ✔ stringr 1.3.1
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
load("/Users/haroldnelson/Dropbox/RProjects/Lab 3/ames.RData")
Our goal is to build a regression model that will predict the sale price of a house in Ames Iowa. This is similar to what Zillow does.
First let’s look over the data and select a reasonable list of variables we think might be useful. Drop variables with a significant number of missing values. Also drop variables with little variation.
summary(ames)
## Order PID MS.SubClass MS.Zoning
## Min. : 1.0 Min. :5.263e+08 Min. : 20.00 A (agr): 2
## 1st Qu.: 733.2 1st Qu.:5.285e+08 1st Qu.: 20.00 C (all): 25
## Median :1465.5 Median :5.355e+08 Median : 50.00 FV : 139
## Mean :1465.5 Mean :7.145e+08 Mean : 57.39 I (all): 2
## 3rd Qu.:2197.8 3rd Qu.:9.072e+08 3rd Qu.: 70.00 RH : 27
## Max. :2930.0 Max. :1.007e+09 Max. :190.00 RL :2273
## RM : 462
## Lot.Frontage Lot.Area Street Alley Lot.Shape
## Min. : 21.00 Min. : 1300 Grvl: 12 Grvl: 120 IR1: 979
## 1st Qu.: 58.00 1st Qu.: 7440 Pave:2918 Pave: 78 IR2: 76
## Median : 68.00 Median : 9436 NA's:2732 IR3: 16
## Mean : 69.22 Mean : 10148 Reg:1859
## 3rd Qu.: 80.00 3rd Qu.: 11555
## Max. :313.00 Max. :215245
## NA's :490
## Land.Contour Utilities Lot.Config Land.Slope Neighborhood
## Bnk: 117 AllPub:2927 Corner : 511 Gtl:2789 NAmes : 443
## HLS: 120 NoSeWa: 1 CulDSac: 180 Mod: 125 CollgCr: 267
## Low: 60 NoSewr: 2 FR2 : 85 Sev: 16 OldTown: 239
## Lvl:2633 FR3 : 14 Edwards: 194
## Inside :2140 Somerst: 182
## NridgHt: 166
## (Other):1439
## Condition.1 Condition.2 Bldg.Type House.Style
## Norm :2522 Norm :2900 1Fam :2425 1Story :1481
## Feedr : 164 Feedr : 13 2fmCon: 62 2Story : 873
## Artery : 92 Artery : 5 Duplex: 109 1.5Fin : 314
## RRAn : 50 PosA : 4 Twnhs : 101 SLvl : 128
## PosN : 39 PosN : 4 TwnhsE: 233 SFoyer : 83
## RRAe : 28 RRNn : 2 2.5Unf : 24
## (Other): 35 (Other): 2 (Other): 27
## Overall.Qual Overall.Cond Year.Built Year.Remod.Add
## Min. : 1.000 Min. :1.000 Min. :1872 Min. :1950
## 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1954 1st Qu.:1965
## Median : 6.000 Median :5.000 Median :1973 Median :1993
## Mean : 6.095 Mean :5.563 Mean :1971 Mean :1984
## 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2001 3rd Qu.:2004
## Max. :10.000 Max. :9.000 Max. :2010 Max. :2010
##
## Roof.Style Roof.Matl Exterior.1st Exterior.2nd
## Flat : 20 CompShg:2887 VinylSd:1026 VinylSd:1015
## Gable :2321 Tar&Grv: 23 MetalSd: 450 MetalSd: 447
## Gambrel: 22 WdShake: 9 HdBoard: 442 HdBoard: 406
## Hip : 551 WdShngl: 7 Wd Sdng: 420 Wd Sdng: 397
## Mansard: 11 ClyTile: 1 Plywood: 221 Plywood: 274
## Shed : 5 Membran: 1 CemntBd: 126 CmentBd: 126
## (Other): 2 (Other): 245 (Other): 265
## Mas.Vnr.Type Mas.Vnr.Area Exter.Qual Exter.Cond Foundation
## : 23 Min. : 0.0 Ex: 107 Ex: 12 BrkTil: 311
## BrkCmn : 25 1st Qu.: 0.0 Fa: 35 Fa: 67 CBlock:1244
## BrkFace: 880 Median : 0.0 Gd: 989 Gd: 299 PConc :1310
## CBlock : 1 Mean : 101.9 TA:1799 Po: 3 Slab : 49
## None :1752 3rd Qu.: 164.0 TA:2549 Stone : 11
## Stone : 249 Max. :1600.0 Wood : 5
## NA's :23
## Bsmt.Qual Bsmt.Cond Bsmt.Exposure BsmtFin.Type.1 BsmtFin.SF.1
## : 1 : 1 : 4 GLQ :859 Min. : 0.0
## Ex : 258 Ex : 3 Av : 418 Unf :851 1st Qu.: 0.0
## Fa : 88 Fa : 104 Gd : 284 ALQ :429 Median : 370.0
## Gd :1219 Gd : 122 Mn : 239 Rec :288 Mean : 442.6
## Po : 2 Po : 5 No :1906 BLQ :269 3rd Qu.: 734.0
## TA :1283 TA :2616 NA's: 79 (Other):155 Max. :5644.0
## NA's: 79 NA's: 79 NA's : 79 NA's :1
## BsmtFin.Type.2 BsmtFin.SF.2 Bsmt.Unf.SF Total.Bsmt.SF
## Unf :2499 Min. : 0.00 Min. : 0.0 Min. : 0
## Rec : 106 1st Qu.: 0.00 1st Qu.: 219.0 1st Qu.: 793
## LwQ : 89 Median : 0.00 Median : 466.0 Median : 990
## BLQ : 68 Mean : 49.72 Mean : 559.3 Mean :1052
## ALQ : 53 3rd Qu.: 0.00 3rd Qu.: 802.0 3rd Qu.:1302
## (Other): 36 Max. :1526.00 Max. :2336.0 Max. :6110
## NA's : 79 NA's :1 NA's :1 NA's :1
## Heating Heating.QC Central.Air Electrical X1st.Flr.SF
## Floor: 1 Ex:1495 N: 196 : 1 Min. : 334.0
## GasA :2885 Fa: 92 Y:2734 FuseA: 188 1st Qu.: 876.2
## GasW : 27 Gd: 476 FuseF: 50 Median :1084.0
## Grav : 9 Po: 3 FuseP: 8 Mean :1159.6
## OthW : 2 TA: 864 Mix : 1 3rd Qu.:1384.0
## Wall : 6 SBrkr:2682 Max. :5095.0
##
## X2nd.Flr.SF Low.Qual.Fin.SF Gr.Liv.Area Bsmt.Full.Bath
## Min. : 0.0 Min. : 0.000 Min. : 334 Min. :0.0000
## 1st Qu.: 0.0 1st Qu.: 0.000 1st Qu.:1126 1st Qu.:0.0000
## Median : 0.0 Median : 0.000 Median :1442 Median :0.0000
## Mean : 335.5 Mean : 4.677 Mean :1500 Mean :0.4314
## 3rd Qu.: 703.8 3rd Qu.: 0.000 3rd Qu.:1743 3rd Qu.:1.0000
## Max. :2065.0 Max. :1064.000 Max. :5642 Max. :3.0000
## NA's :2
## Bsmt.Half.Bath Full.Bath Half.Bath Bedroom.AbvGr
## Min. :0.00000 Min. :0.000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:2.000
## Median :0.00000 Median :2.000 Median :0.0000 Median :3.000
## Mean :0.06113 Mean :1.567 Mean :0.3795 Mean :2.854
## 3rd Qu.:0.00000 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :2.00000 Max. :4.000 Max. :2.0000 Max. :8.000
## NA's :2
## Kitchen.AbvGr Kitchen.Qual TotRms.AbvGrd Functional
## Min. :0.000 Ex: 205 Min. : 2.000 Typ :2728
## 1st Qu.:1.000 Fa: 70 1st Qu.: 5.000 Min2 : 70
## Median :1.000 Gd:1160 Median : 6.000 Min1 : 65
## Mean :1.044 Po: 1 Mean : 6.443 Mod : 35
## 3rd Qu.:1.000 TA:1494 3rd Qu.: 7.000 Maj1 : 19
## Max. :3.000 Max. :15.000 Maj2 : 9
## (Other): 4
## Fireplaces Fireplace.Qu Garage.Type Garage.Yr.Blt Garage.Finish
## Min. :0.0000 Ex : 43 2Types : 23 Min. :1895 : 2
## 1st Qu.:0.0000 Fa : 75 Attchd :1731 1st Qu.:1960 Fin : 728
## Median :1.0000 Gd : 744 Basment: 36 Median :1979 RFn : 812
## Mean :0.5993 Po : 46 BuiltIn: 186 Mean :1978 Unf :1231
## 3rd Qu.:1.0000 TA : 600 CarPort: 15 3rd Qu.:2002 NA's: 157
## Max. :4.0000 NA's:1422 Detchd : 782 Max. :2207
## NA's : 157 NA's :159
## Garage.Cars Garage.Area Garage.Qual Garage.Cond Paved.Drive
## Min. :0.000 Min. : 0.0 : 1 : 1 N: 216
## 1st Qu.:1.000 1st Qu.: 320.0 Ex : 3 Ex : 3 P: 62
## Median :2.000 Median : 480.0 Fa : 124 Fa : 74 Y:2652
## Mean :1.767 Mean : 472.8 Gd : 24 Gd : 15
## 3rd Qu.:2.000 3rd Qu.: 576.0 Po : 5 Po : 14
## Max. :5.000 Max. :1488.0 TA :2615 TA :2665
## NA's :1 NA's :1 NA's: 158 NA's: 158
## Wood.Deck.SF Open.Porch.SF Enclosed.Porch X3Ssn.Porch
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 0.00 Median : 27.00 Median : 0.00 Median : 0.000
## Mean : 93.75 Mean : 47.53 Mean : 23.01 Mean : 2.592
## 3rd Qu.: 168.00 3rd Qu.: 70.00 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :1424.00 Max. :742.00 Max. :1012.00 Max. :508.000
##
## Screen.Porch Pool.Area Pool.QC Fence Misc.Feature
## Min. : 0 Min. : 0.000 Ex : 4 GdPrv: 118 Elev: 1
## 1st Qu.: 0 1st Qu.: 0.000 Fa : 2 GdWo : 112 Gar2: 5
## Median : 0 Median : 0.000 Gd : 4 MnPrv: 330 Othr: 4
## Mean : 16 Mean : 2.243 TA : 3 MnWw : 12 Shed: 95
## 3rd Qu.: 0 3rd Qu.: 0.000 NA's:2917 NA's :2358 TenC: 1
## Max. :576 Max. :800.000 NA's:2824
##
## Misc.Val Mo.Sold Yr.Sold Sale.Type
## Min. : 0.00 Min. : 1.000 Min. :2006 WD :2536
## 1st Qu.: 0.00 1st Qu.: 4.000 1st Qu.:2007 New : 239
## Median : 0.00 Median : 6.000 Median :2008 COD : 87
## Mean : 50.63 Mean : 6.216 Mean :2008 ConLD : 26
## 3rd Qu.: 0.00 3rd Qu.: 8.000 3rd Qu.:2009 CWD : 12
## Max. :17000.00 Max. :12.000 Max. :2010 ConLI : 9
## (Other): 21
## Sale.Condition SalePrice
## Abnorml: 190 Min. : 12789
## AdjLand: 12 1st Qu.:129500
## Alloca : 24 Median :160000
## Family : 46 Mean :180796
## Normal :2413 3rd Qu.:213500
## Partial: 245 Max. :755000
##
#ames = ames %>%
# select()
Now we want to divide our data into two parts. The first part will be the basis of our model, called the training dataset. Begin by putting the entire dataset into a randomized order.
# Get randomized row numbers
set.seed(199)
rows = sample(nrow(ames))
# Use the randomized numbers to randomize ames itself.
ames = ames[rows,]
Now separate ames into the training and testing pieces. Designate a split point. Then use it to split the data.
# Use 70% of the data for training.
split = round(nrow(ames) * .7)
train = ames[1:split,]
test = ames[(split+1):nrow(ames),]
Now build a model on train. Look at the summary results.
mod1 = lm(SalePrice~X1st.Flr.SF,data = train)
summary(mod1)
##
## Call:
## lm(formula = SalePrice ~ X1st.Flr.SF, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -511869 -37461 -12090 37099 405237
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30820.978 4314.224 7.144 1.25e-12 ***
## X1st.Flr.SF 130.500 3.545 36.816 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 62800 on 2049 degrees of freedom
## Multiple R-squared: 0.3981, Adjusted R-squared: 0.3978
## F-statistic: 1355 on 1 and 2049 DF, p-value: < 2.2e-16
Compute thr RMSE for the train data and compare it with the summary output.
PredTrain = predict(mod1,train)
ErrTrain = train$SalePrice - PredTrain
RMSETrain = sqrt(mean(ErrTrain^2))
RMSETrain
## [1] 62769.56
# Note the similarity to the residual standard error in the summary output.
Now compute the RMSE for the test data.
PredTest = predict(mod1,test)
ErrTest = test$SalePrice - PredTest
RMSETest = sqrt(mean(ErrTest^2))
RMSETest
## [1] 62184.05
Note that the RMSE for the test data is higher than the RMSE for the train data. The RMSE of the test data is what you should quote to describe the likely performance of your model when it is used on previously unseen data.
Extend the model above to mod2 by adding one additional variable. Compute the RMSE for the test data. By how much did you improve the performance of your model?
# Insert your code here.
Add one more variable to create mod3. By how much were you able to improve your model?
# Insert your code here.
Add as many variables as you want and see how well you can do.
# Insert your code here.