library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.0.0     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.6
## ✔ tidyr   0.8.1     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
load("/Users/haroldnelson/Dropbox/RProjects/Lab 3/ames.RData")

Our goal is to build a regression model that will predict the sale price of a house in Ames Iowa. This is similar to what Zillow does.

First let’s look over the data and select a reasonable list of variables we think might be useful. Drop variables with a significant number of missing values. Also drop variables with little variation.

summary(ames)
##      Order             PID             MS.SubClass       MS.Zoning   
##  Min.   :   1.0   Min.   :5.263e+08   Min.   : 20.00   A (agr):   2  
##  1st Qu.: 733.2   1st Qu.:5.285e+08   1st Qu.: 20.00   C (all):  25  
##  Median :1465.5   Median :5.355e+08   Median : 50.00   FV     : 139  
##  Mean   :1465.5   Mean   :7.145e+08   Mean   : 57.39   I (all):   2  
##  3rd Qu.:2197.8   3rd Qu.:9.072e+08   3rd Qu.: 70.00   RH     :  27  
##  Max.   :2930.0   Max.   :1.007e+09   Max.   :190.00   RL     :2273  
##                                                        RM     : 462  
##   Lot.Frontage       Lot.Area       Street      Alley      Lot.Shape 
##  Min.   : 21.00   Min.   :  1300   Grvl:  12   Grvl: 120   IR1: 979  
##  1st Qu.: 58.00   1st Qu.:  7440   Pave:2918   Pave:  78   IR2:  76  
##  Median : 68.00   Median :  9436               NA's:2732   IR3:  16  
##  Mean   : 69.22   Mean   : 10148                           Reg:1859  
##  3rd Qu.: 80.00   3rd Qu.: 11555                                     
##  Max.   :313.00   Max.   :215245                                     
##  NA's   :490                                                         
##  Land.Contour  Utilities      Lot.Config   Land.Slope  Neighborhood 
##  Bnk: 117     AllPub:2927   Corner : 511   Gtl:2789   NAmes  : 443  
##  HLS: 120     NoSeWa:   1   CulDSac: 180   Mod: 125   CollgCr: 267  
##  Low:  60     NoSewr:   2   FR2    :  85   Sev:  16   OldTown: 239  
##  Lvl:2633                   FR3    :  14              Edwards: 194  
##                             Inside :2140              Somerst: 182  
##                                                       NridgHt: 166  
##                                                       (Other):1439  
##   Condition.1    Condition.2    Bldg.Type     House.Style  
##  Norm   :2522   Norm   :2900   1Fam  :2425   1Story :1481  
##  Feedr  : 164   Feedr  :  13   2fmCon:  62   2Story : 873  
##  Artery :  92   Artery :   5   Duplex: 109   1.5Fin : 314  
##  RRAn   :  50   PosA   :   4   Twnhs : 101   SLvl   : 128  
##  PosN   :  39   PosN   :   4   TwnhsE: 233   SFoyer :  83  
##  RRAe   :  28   RRNn   :   2                 2.5Unf :  24  
##  (Other):  35   (Other):   2                 (Other):  27  
##   Overall.Qual     Overall.Cond     Year.Built   Year.Remod.Add
##  Min.   : 1.000   Min.   :1.000   Min.   :1872   Min.   :1950  
##  1st Qu.: 5.000   1st Qu.:5.000   1st Qu.:1954   1st Qu.:1965  
##  Median : 6.000   Median :5.000   Median :1973   Median :1993  
##  Mean   : 6.095   Mean   :5.563   Mean   :1971   Mean   :1984  
##  3rd Qu.: 7.000   3rd Qu.:6.000   3rd Qu.:2001   3rd Qu.:2004  
##  Max.   :10.000   Max.   :9.000   Max.   :2010   Max.   :2010  
##                                                                
##    Roof.Style     Roof.Matl     Exterior.1st   Exterior.2nd 
##  Flat   :  20   CompShg:2887   VinylSd:1026   VinylSd:1015  
##  Gable  :2321   Tar&Grv:  23   MetalSd: 450   MetalSd: 447  
##  Gambrel:  22   WdShake:   9   HdBoard: 442   HdBoard: 406  
##  Hip    : 551   WdShngl:   7   Wd Sdng: 420   Wd Sdng: 397  
##  Mansard:  11   ClyTile:   1   Plywood: 221   Plywood: 274  
##  Shed   :   5   Membran:   1   CemntBd: 126   CmentBd: 126  
##                 (Other):   2   (Other): 245   (Other): 265  
##   Mas.Vnr.Type   Mas.Vnr.Area    Exter.Qual Exter.Cond  Foundation  
##         :  23   Min.   :   0.0   Ex: 107    Ex:  12    BrkTil: 311  
##  BrkCmn :  25   1st Qu.:   0.0   Fa:  35    Fa:  67    CBlock:1244  
##  BrkFace: 880   Median :   0.0   Gd: 989    Gd: 299    PConc :1310  
##  CBlock :   1   Mean   : 101.9   TA:1799    Po:   3    Slab  :  49  
##  None   :1752   3rd Qu.: 164.0              TA:2549    Stone :  11  
##  Stone  : 249   Max.   :1600.0                         Wood  :   5  
##                 NA's   :23                                          
##  Bsmt.Qual   Bsmt.Cond   Bsmt.Exposure BsmtFin.Type.1  BsmtFin.SF.1   
##      :   1       :   1       :   4     GLQ    :859    Min.   :   0.0  
##  Ex  : 258   Ex  :   3   Av  : 418     Unf    :851    1st Qu.:   0.0  
##  Fa  :  88   Fa  : 104   Gd  : 284     ALQ    :429    Median : 370.0  
##  Gd  :1219   Gd  : 122   Mn  : 239     Rec    :288    Mean   : 442.6  
##  Po  :   2   Po  :   5   No  :1906     BLQ    :269    3rd Qu.: 734.0  
##  TA  :1283   TA  :2616   NA's:  79     (Other):155    Max.   :5644.0  
##  NA's:  79   NA's:  79                 NA's   : 79    NA's   :1       
##  BsmtFin.Type.2  BsmtFin.SF.2      Bsmt.Unf.SF     Total.Bsmt.SF 
##  Unf    :2499   Min.   :   0.00   Min.   :   0.0   Min.   :   0  
##  Rec    : 106   1st Qu.:   0.00   1st Qu.: 219.0   1st Qu.: 793  
##  LwQ    :  89   Median :   0.00   Median : 466.0   Median : 990  
##  BLQ    :  68   Mean   :  49.72   Mean   : 559.3   Mean   :1052  
##  ALQ    :  53   3rd Qu.:   0.00   3rd Qu.: 802.0   3rd Qu.:1302  
##  (Other):  36   Max.   :1526.00   Max.   :2336.0   Max.   :6110  
##  NA's   :  79   NA's   :1         NA's   :1        NA's   :1     
##   Heating     Heating.QC Central.Air Electrical    X1st.Flr.SF    
##  Floor:   1   Ex:1495    N: 196           :   1   Min.   : 334.0  
##  GasA :2885   Fa:  92    Y:2734      FuseA: 188   1st Qu.: 876.2  
##  GasW :  27   Gd: 476                FuseF:  50   Median :1084.0  
##  Grav :   9   Po:   3                FuseP:   8   Mean   :1159.6  
##  OthW :   2   TA: 864                Mix  :   1   3rd Qu.:1384.0  
##  Wall :   6                          SBrkr:2682   Max.   :5095.0  
##                                                                   
##   X2nd.Flr.SF     Low.Qual.Fin.SF     Gr.Liv.Area   Bsmt.Full.Bath  
##  Min.   :   0.0   Min.   :   0.000   Min.   : 334   Min.   :0.0000  
##  1st Qu.:   0.0   1st Qu.:   0.000   1st Qu.:1126   1st Qu.:0.0000  
##  Median :   0.0   Median :   0.000   Median :1442   Median :0.0000  
##  Mean   : 335.5   Mean   :   4.677   Mean   :1500   Mean   :0.4314  
##  3rd Qu.: 703.8   3rd Qu.:   0.000   3rd Qu.:1743   3rd Qu.:1.0000  
##  Max.   :2065.0   Max.   :1064.000   Max.   :5642   Max.   :3.0000  
##                                                     NA's   :2       
##  Bsmt.Half.Bath      Full.Bath       Half.Bath      Bedroom.AbvGr  
##  Min.   :0.00000   Min.   :0.000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :0.00000   Median :2.000   Median :0.0000   Median :3.000  
##  Mean   :0.06113   Mean   :1.567   Mean   :0.3795   Mean   :2.854  
##  3rd Qu.:0.00000   3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :2.00000   Max.   :4.000   Max.   :2.0000   Max.   :8.000  
##  NA's   :2                                                         
##  Kitchen.AbvGr   Kitchen.Qual TotRms.AbvGrd      Functional  
##  Min.   :0.000   Ex: 205      Min.   : 2.000   Typ    :2728  
##  1st Qu.:1.000   Fa:  70      1st Qu.: 5.000   Min2   :  70  
##  Median :1.000   Gd:1160      Median : 6.000   Min1   :  65  
##  Mean   :1.044   Po:   1      Mean   : 6.443   Mod    :  35  
##  3rd Qu.:1.000   TA:1494      3rd Qu.: 7.000   Maj1   :  19  
##  Max.   :3.000                Max.   :15.000   Maj2   :   9  
##                                                (Other):   4  
##    Fireplaces     Fireplace.Qu  Garage.Type   Garage.Yr.Blt  Garage.Finish
##  Min.   :0.0000   Ex  :  43    2Types :  23   Min.   :1895       :   2    
##  1st Qu.:0.0000   Fa  :  75    Attchd :1731   1st Qu.:1960   Fin : 728    
##  Median :1.0000   Gd  : 744    Basment:  36   Median :1979   RFn : 812    
##  Mean   :0.5993   Po  :  46    BuiltIn: 186   Mean   :1978   Unf :1231    
##  3rd Qu.:1.0000   TA  : 600    CarPort:  15   3rd Qu.:2002   NA's: 157    
##  Max.   :4.0000   NA's:1422    Detchd : 782   Max.   :2207                
##                                NA's   : 157   NA's   :159                 
##   Garage.Cars     Garage.Area     Garage.Qual Garage.Cond Paved.Drive
##  Min.   :0.000   Min.   :   0.0       :   1       :   1   N: 216     
##  1st Qu.:1.000   1st Qu.: 320.0   Ex  :   3   Ex  :   3   P:  62     
##  Median :2.000   Median : 480.0   Fa  : 124   Fa  :  74   Y:2652     
##  Mean   :1.767   Mean   : 472.8   Gd  :  24   Gd  :  15              
##  3rd Qu.:2.000   3rd Qu.: 576.0   Po  :   5   Po  :  14              
##  Max.   :5.000   Max.   :1488.0   TA  :2615   TA  :2665              
##  NA's   :1       NA's   :1        NA's: 158   NA's: 158              
##   Wood.Deck.SF     Open.Porch.SF    Enclosed.Porch     X3Ssn.Porch     
##  Min.   :   0.00   Min.   :  0.00   Min.   :   0.00   Min.   :  0.000  
##  1st Qu.:   0.00   1st Qu.:  0.00   1st Qu.:   0.00   1st Qu.:  0.000  
##  Median :   0.00   Median : 27.00   Median :   0.00   Median :  0.000  
##  Mean   :  93.75   Mean   : 47.53   Mean   :  23.01   Mean   :  2.592  
##  3rd Qu.: 168.00   3rd Qu.: 70.00   3rd Qu.:   0.00   3rd Qu.:  0.000  
##  Max.   :1424.00   Max.   :742.00   Max.   :1012.00   Max.   :508.000  
##                                                                        
##   Screen.Porch   Pool.Area       Pool.QC       Fence      Misc.Feature
##  Min.   :  0   Min.   :  0.000   Ex  :   4   GdPrv: 118   Elev:   1   
##  1st Qu.:  0   1st Qu.:  0.000   Fa  :   2   GdWo : 112   Gar2:   5   
##  Median :  0   Median :  0.000   Gd  :   4   MnPrv: 330   Othr:   4   
##  Mean   : 16   Mean   :  2.243   TA  :   3   MnWw :  12   Shed:  95   
##  3rd Qu.:  0   3rd Qu.:  0.000   NA's:2917   NA's :2358   TenC:   1   
##  Max.   :576   Max.   :800.000                            NA's:2824   
##                                                                       
##     Misc.Val           Mo.Sold          Yr.Sold       Sale.Type   
##  Min.   :    0.00   Min.   : 1.000   Min.   :2006   WD     :2536  
##  1st Qu.:    0.00   1st Qu.: 4.000   1st Qu.:2007   New    : 239  
##  Median :    0.00   Median : 6.000   Median :2008   COD    :  87  
##  Mean   :   50.63   Mean   : 6.216   Mean   :2008   ConLD  :  26  
##  3rd Qu.:    0.00   3rd Qu.: 8.000   3rd Qu.:2009   CWD    :  12  
##  Max.   :17000.00   Max.   :12.000   Max.   :2010   ConLI  :   9  
##                                                     (Other):  21  
##  Sale.Condition   SalePrice     
##  Abnorml: 190   Min.   : 12789  
##  AdjLand:  12   1st Qu.:129500  
##  Alloca :  24   Median :160000  
##  Family :  46   Mean   :180796  
##  Normal :2413   3rd Qu.:213500  
##  Partial: 245   Max.   :755000  
## 
#ames = ames %>% 
#  select()

Now we want to divide our data into two parts. The first part will be the basis of our model, called the training dataset. Begin by putting the entire dataset into a randomized order.

# Get randomized row numbers
set.seed(199)
rows = sample(nrow(ames))

# Use the randomized numbers to randomize ames itself. 
ames = ames[rows,]

Now separate ames into the training and testing pieces. Designate a split point. Then use it to split the data.

# Use 70% of the data for training.
split = round(nrow(ames) * .7)
train = ames[1:split,]

test = ames[(split+1):nrow(ames),]

Now build a model on train. Look at the summary results.

mod1 = lm(SalePrice~X1st.Flr.SF,data = train)
summary(mod1)
## 
## Call:
## lm(formula = SalePrice ~ X1st.Flr.SF, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -511869  -37461  -12090   37099  405237 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30820.978   4314.224   7.144 1.25e-12 ***
## X1st.Flr.SF   130.500      3.545  36.816  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 62800 on 2049 degrees of freedom
## Multiple R-squared:  0.3981, Adjusted R-squared:  0.3978 
## F-statistic:  1355 on 1 and 2049 DF,  p-value: < 2.2e-16

Compute thr RMSE for the train data and compare it with the summary output.

PredTrain = predict(mod1,train)
ErrTrain = train$SalePrice - PredTrain
RMSETrain = sqrt(mean(ErrTrain^2))
RMSETrain
## [1] 62769.56
# Note the similarity to the residual standard error in the summary output.

Now compute the RMSE for the test data.

PredTest = predict(mod1,test)
ErrTest = test$SalePrice - PredTest
RMSETest = sqrt(mean(ErrTest^2))
RMSETest
## [1] 62184.05

Note that the RMSE for the test data is higher than the RMSE for the train data. The RMSE of the test data is what you should quote to describe the likely performance of your model when it is used on previously unseen data.

Problem 1

Extend the model above to mod2 by adding one additional variable. Compute the RMSE for the test data. By how much did you improve the performance of your model?

# Insert your code here.

Problem 2

Add one more variable to create mod3. By how much were you able to improve your model?

# Insert your code here.

Problem 3

Add as many variables as you want and see how well you can do.

# Insert your code here.