The Ames Housing dataset was downloaded from kaggle. It is a playground competition’s dataset and my taske is to predict house price based on house-level features using multiple linear regression model in R.

Prepare the data

##   Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1  1         60       RL          65    8450   Pave  <NA>      Reg
## 2  2         20       RL          80    9600   Pave  <NA>      Reg
## 3  3         60       RL          68   11250   Pave  <NA>      IR1
## 4  4         70       RL          60    9550   Pave  <NA>      IR1
## 5  5         60       RL          84   14260   Pave  <NA>      IR1
## 6  6         50       RL          85   14115   Pave  <NA>      IR1
##   LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1         Lvl    AllPub    Inside       Gtl      CollgCr       Norm
## 2         Lvl    AllPub       FR2       Gtl      Veenker      Feedr
## 3         Lvl    AllPub    Inside       Gtl      CollgCr       Norm
## 4         Lvl    AllPub    Corner       Gtl      Crawfor       Norm
## 5         Lvl    AllPub       FR2       Gtl      NoRidge       Norm
## 6         Lvl    AllPub    Inside       Gtl      Mitchel       Norm
##   Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1       Norm     1Fam     2Story           7           5      2003
## 2       Norm     1Fam     1Story           6           8      1976
## 3       Norm     1Fam     2Story           7           5      2001
## 4       Norm     1Fam     2Story           7           5      1915
## 5       Norm     1Fam     2Story           8           5      2000
## 6       Norm     1Fam     1.5Fin           5           5      1993
##   YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1         2003     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 2         1976     Gable  CompShg     MetalSd     MetalSd       None
## 3         2002     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 4         1970     Gable  CompShg     Wd Sdng     Wd Shng       None
## 5         2000     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 6         1995     Gable  CompShg     VinylSd     VinylSd       None
##   MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## 1        196        Gd        TA      PConc       Gd       TA           No
## 2          0        TA        TA     CBlock       Gd       TA           Gd
## 3        162        Gd        TA      PConc       Gd       TA           Mn
## 4          0        TA        TA     BrkTil       TA       Gd           No
## 5        350        Gd        TA      PConc       Gd       TA           Av
## 6          0        TA        TA       Wood       Gd       TA           No
##   BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1          GLQ        706          Unf          0       150         856
## 2          ALQ        978          Unf          0       284        1262
## 3          GLQ        486          Unf          0       434         920
## 4          ALQ        216          Unf          0       540         756
## 5          GLQ        655          Unf          0       490        1145
## 6          GLQ        732          Unf          0        64         796
##   Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## 1    GasA        Ex          Y      SBrkr       856       854            0
## 2    GasA        Ex          Y      SBrkr      1262         0            0
## 3    GasA        Ex          Y      SBrkr       920       866            0
## 4    GasA        Gd          Y      SBrkr       961       756            0
## 5    GasA        Ex          Y      SBrkr      1145      1053            0
## 6    GasA        Ex          Y      SBrkr       796       566            0
##   GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 1      1710            1            0        2        1            3
## 2      1262            0            1        2        0            3
## 3      1786            1            0        2        1            3
## 4      1717            1            0        1        0            3
## 5      2198            1            0        2        1            4
## 6      1362            1            0        1        1            1
##   KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu
## 1            1          Gd            8        Typ          0        <NA>
## 2            1          TA            6        Typ          1          TA
## 3            1          Gd            6        Typ          1          TA
## 4            1          Gd            7        Typ          1          Gd
## 5            1          Gd            9        Typ          1          TA
## 6            1          TA            5        Typ          0        <NA>
##   GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 1     Attchd        2003          RFn          2        548         TA
## 2     Attchd        1976          RFn          2        460         TA
## 3     Attchd        2001          RFn          2        608         TA
## 4     Detchd        1998          Unf          3        642         TA
## 5     Attchd        2000          RFn          3        836         TA
## 6     Attchd        1993          Unf          2        480         TA
##   GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 1         TA          Y          0          61             0          0
## 2         TA          Y        298           0             0          0
## 3         TA          Y          0          42             0          0
## 4         TA          Y          0          35           272          0
## 5         TA          Y        192          84             0          0
## 6         TA          Y         40          30             0        320
##   ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold
## 1           0        0   <NA>  <NA>        <NA>       0      2   2008
## 2           0        0   <NA>  <NA>        <NA>       0      5   2007
## 3           0        0   <NA>  <NA>        <NA>       0      9   2008
## 4           0        0   <NA>  <NA>        <NA>       0      2   2006
## 5           0        0   <NA>  <NA>        <NA>       0     12   2008
## 6           0        0   <NA> MnPrv        Shed     700     10   2009
##   SaleType SaleCondition SalePrice
## 1       WD        Normal    208500
## 2       WD        Normal    181500
## 3       WD        Normal    223500
## 4       WD       Abnorml    140000
## 5       WD        Normal    250000
## 6       WD        Normal    143000

Next, split the data into a training set and a testing set.

## [1] 1095   81

The training set contains 1095 observations and 81 variables. To start, I will hypothesize the following subset of the variables as potential predicators.

Construct a new data fram consisting solely of these variables.

##      SalePrice LotArea PoolArea GarageCars TotRmsAbvGrd KitchenAbvGr
## 1350    122000    5250        0          0            8            1
## 784     165500    9101        0          2            4            1
## 685     221000   16770        0          2            7            1
## 421     206300    7060        0          4            8            2
## 1122    212900   10084        0          3            7            1
## 1125    163900    9125        0          2            7            1
##      GrLivArea BedroomAbvGr YearRemodAdd YearBuilt OverallCond
## 1350      2358            4         1987      1872           5
## 784       1110            1         1978      1978           6
## 685       1839            4         1998      1998           5
## 421       1344            2         1998      1997           5
## 1122      1552            3         2006      2005           5
## 1125      1482            3         1992      1992           5

Report variables with missing values.

##    SalePrice      LotArea     PoolArea   GarageCars TotRmsAbvGrd 
##            0            0            0            0            0 
## KitchenAbvGr    GrLivArea BedroomAbvGr YearRemodAdd    YearBuilt 
##            0            0            0            0            0 
##  OverallCond 
##            0

Summary statistics

##    SalePrice         LotArea          PoolArea         GarageCars   
##  Min.   : 34900   Min.   :  1300   Min.   :  0.000   Min.   :0.000  
##  1st Qu.:129000   1st Qu.:  7500   1st Qu.:  0.000   1st Qu.:1.000  
##  Median :164000   Median :  9452   Median :  0.000   Median :2.000  
##  Mean   :181598   Mean   : 10467   Mean   :  3.679   Mean   :1.764  
##  3rd Qu.:215600   3rd Qu.: 11500   3rd Qu.:  0.000   3rd Qu.:2.000  
##  Max.   :745000   Max.   :215245   Max.   :738.000   Max.   :4.000  
##   TotRmsAbvGrd     KitchenAbvGr     GrLivArea     BedroomAbvGr  
##  Min.   : 2.000   Min.   :0.000   Min.   : 334   Min.   :0.000  
##  1st Qu.: 5.000   1st Qu.:1.000   1st Qu.:1124   1st Qu.:2.000  
##  Median : 6.000   Median :1.000   Median :1458   Median :3.000  
##  Mean   : 6.493   Mean   :1.045   Mean   :1510   Mean   :2.848  
##  3rd Qu.: 7.000   3rd Qu.:1.000   3rd Qu.:1779   3rd Qu.:3.000  
##  Max.   :12.000   Max.   :3.000   Max.   :5642   Max.   :6.000  
##   YearRemodAdd    YearBuilt     OverallCond   
##  Min.   :1950   Min.   :1872   Min.   :1.000  
##  1st Qu.:1967   1st Qu.:1954   1st Qu.:5.000  
##  Median :1994   Median :1974   Median :5.000  
##  Mean   :1985   Mean   :1972   Mean   :5.555  
##  3rd Qu.:2004   3rd Qu.:2001   3rd Qu.:6.000  
##  Max.   :2010   Max.   :2009   Max.   :9.000

Before fitting my regression model I want to investigate how the variables are related to one another.

We can see some of the variables are very skewed. If we want to have a good regression model, the varaibles should be normal distributed. The variables should be independent and not correlated. “GrLivArea” and “TotRmsAbvGrd” clearly have a high correlation, I will need to deal with these.

Fit the linear model

## 
## Call:
## lm(formula = SalePrice ~ LotArea + PoolArea + GarageCars + TotRmsAbvGrd + 
##     KitchenAbvGr + GrLivArea + BedroomAbvGr + YearRemodAdd + 
##     YearBuilt + OverallCond, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -434633  -21648   -3194   16980  294504 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.823e+06  1.451e+05 -12.569  < 2e-16 ***
## LotArea       7.174e-01  1.317e-01   5.449 6.28e-08 ***
## PoolArea     -1.522e+01  2.777e+01  -0.548   0.5838    
## GarageCars    2.107e+04  2.288e+03   9.207  < 2e-16 ***
## TotRmsAbvGrd  6.789e+03  1.672e+03   4.060 5.26e-05 ***
## KitchenAbvGr -4.215e+04  6.195e+03  -6.805 1.67e-11 ***
## GrLivArea     7.598e+01  4.700e+00  16.165  < 2e-16 ***
## BedroomAbvGr -1.605e+04  2.198e+03  -7.303 5.44e-13 ***
## YearRemodAdd  2.115e+02  8.779e+01   2.410   0.0161 *  
## YearBuilt     7.235e+02  6.875e+01  10.523  < 2e-16 ***
## OverallCond   7.940e+03  1.356e+03   5.853 6.39e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41240 on 1084 degrees of freedom
## Multiple R-squared:  0.737,  Adjusted R-squared:  0.7346 
## F-statistic: 303.7 on 10 and 1084 DF,  p-value: < 2.2e-16

interprete the output:

R-squred of 0.737 tells us that approximately 74% of variation in sale price can be explained by my model.

F-statistics and p-value show the overall significance test of my model.

Residual standard error gives an idea on how far observed sale price are from the predicted or fitted sales price.

Intercept is the estimated sale price for a house with all the other variables at zero. It does not provide any meaningful interpretation.

The slope for “GrlivArea”(7.598e+01) is the effect of Above grade living area square feet on sale price adjusting or controling for the other variables, i.e we associate an increase of 1 square foot in above grade living area with an increase of $75.98 in sale price adjusting or controlling for the other variables.

Stepwise Procedure

Using backward elimination to remove the predictor with the largest p-value over 0.05. In this case, I will remove “PoolArea” first, then fit the model again.

## 
## Call:
## lm(formula = SalePrice ~ LotArea + GarageCars + TotRmsAbvGrd + 
##     KitchenAbvGr + GrLivArea + BedroomAbvGr + YearRemodAdd + 
##     YearBuilt + OverallCond, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -440086  -21728   -3086   16994  287342 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.826e+06  1.449e+05 -12.601  < 2e-16 ***
## LotArea       7.153e-01  1.316e-01   5.437 6.69e-08 ***
## GarageCars    2.114e+04  2.284e+03   9.255  < 2e-16 ***
## TotRmsAbvGrd  6.888e+03  1.662e+03   4.144 3.67e-05 ***
## KitchenAbvGr -4.212e+04  6.192e+03  -6.802 1.70e-11 ***
## GrLivArea     7.543e+01  4.590e+00  16.433  < 2e-16 ***
## BedroomAbvGr -1.608e+04  2.196e+03  -7.321 4.80e-13 ***
## YearRemodAdd  2.135e+02  8.769e+01   2.434   0.0151 *  
## YearBuilt     7.231e+02  6.873e+01  10.521  < 2e-16 ***
## OverallCond   7.930e+03  1.356e+03   5.849 6.56e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41230 on 1085 degrees of freedom
## Multiple R-squared:  0.7369, Adjusted R-squared:  0.7347 
## F-statistic: 337.7 on 9 and 1085 DF,  p-value: < 2.2e-16

After eliminating “PoolArea”, R-Squared almost identical, Adjusted R-squared slightly improved. At this point, I think I can start building the model.

However, as you have seen earlier, two variables - “GrLivArea” and “TotRmsAbvGrd” are highly correlated, the multicollinearity between GrLivArea and TotRmsAbvGrd means that we should not directly interpret GrLivArea as the effect of GrLivArea on sale price adjusting for TotRmsAbvGrd. These two effects are somewhat bounded together.

## [1] 0.826969

Create a confidence interval for the model coefficients

##                      2.5 %        97.5 %
## (Intercept)  -2.110407e+06 -1.541712e+06
## LotArea       4.571854e-01  9.734940e-01
## GarageCars    1.665678e+04  2.561958e+04
## TotRmsAbvGrd  3.626750e+03  1.014894e+04
## KitchenAbvGr -5.426898e+04 -2.996842e+04
## GrLivArea     6.642123e+01  8.443367e+01
## BedroomAbvGr -2.038677e+04 -1.176829e+04
## YearRemodAdd  4.141973e+01  3.855420e+02
## YearBuilt     5.882452e+02  8.579526e+02
## OverallCond   5.269715e+03  1.059089e+04

For example, from the 2nd model, I have estimated the slope for GrLivArea is 75.43. I am 95% confident that the true slope is between 66.42 and 84.43.

Check the diagnostic plots for the model

The relationship between predictor variables and an outcome variable is approximate linear. There are three extreme cases (outliers).

It looks like I don’t have to be concerned too much, although two observations numbered as 524 and 1299 look a little off.

The distribution of residuals around the linear model in relation to the sale price. The most of the houses in the data in the lower and median price range, the higher price, the less observations.

This plot helps us to find influential cases if any. Not all outliers are influential in linear regression analysis. It looks like none of the outliers in my model are influential.

Testing the prediction model

Look at the first few values of prediction, and compare it to the values of salePrice in the test data set.

##         4        10        16        20        22        34 
## 176501.95  65626.55 130650.30 135358.79 102336.97 161300.96
## [1] 140000 118000 132000 139000 139400 165500

At last, calculate the value of R-squared for the prediction model on the test data set. In general, R-squared is the metric for evaluating the goodness of fit of my model. Higher is better with 1 being the best.

## [1] 0.7421962