The Ames Housing dataset was downloaded from kaggle. It is a playground competition’s dataset and my taske is to predict house price based on house-level features using multiple linear regression model in R.
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1 1 60 RL 65 8450 Pave <NA> Reg
## 2 2 20 RL 80 9600 Pave <NA> Reg
## 3 3 60 RL 68 11250 Pave <NA> IR1
## 4 4 70 RL 60 9550 Pave <NA> IR1
## 5 5 60 RL 84 14260 Pave <NA> IR1
## 6 6 50 RL 85 14115 Pave <NA> IR1
## LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1 Lvl AllPub Inside Gtl CollgCr Norm
## 2 Lvl AllPub FR2 Gtl Veenker Feedr
## 3 Lvl AllPub Inside Gtl CollgCr Norm
## 4 Lvl AllPub Corner Gtl Crawfor Norm
## 5 Lvl AllPub FR2 Gtl NoRidge Norm
## 6 Lvl AllPub Inside Gtl Mitchel Norm
## Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1 Norm 1Fam 2Story 7 5 2003
## 2 Norm 1Fam 1Story 6 8 1976
## 3 Norm 1Fam 2Story 7 5 2001
## 4 Norm 1Fam 2Story 7 5 1915
## 5 Norm 1Fam 2Story 8 5 2000
## 6 Norm 1Fam 1.5Fin 5 5 1993
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1 2003 Gable CompShg VinylSd VinylSd BrkFace
## 2 1976 Gable CompShg MetalSd MetalSd None
## 3 2002 Gable CompShg VinylSd VinylSd BrkFace
## 4 1970 Gable CompShg Wd Sdng Wd Shng None
## 5 2000 Gable CompShg VinylSd VinylSd BrkFace
## 6 1995 Gable CompShg VinylSd VinylSd None
## MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## 1 196 Gd TA PConc Gd TA No
## 2 0 TA TA CBlock Gd TA Gd
## 3 162 Gd TA PConc Gd TA Mn
## 4 0 TA TA BrkTil TA Gd No
## 5 350 Gd TA PConc Gd TA Av
## 6 0 TA TA Wood Gd TA No
## BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1 GLQ 706 Unf 0 150 856
## 2 ALQ 978 Unf 0 284 1262
## 3 GLQ 486 Unf 0 434 920
## 4 ALQ 216 Unf 0 540 756
## 5 GLQ 655 Unf 0 490 1145
## 6 GLQ 732 Unf 0 64 796
## Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## 1 GasA Ex Y SBrkr 856 854 0
## 2 GasA Ex Y SBrkr 1262 0 0
## 3 GasA Ex Y SBrkr 920 866 0
## 4 GasA Gd Y SBrkr 961 756 0
## 5 GasA Ex Y SBrkr 1145 1053 0
## 6 GasA Ex Y SBrkr 796 566 0
## GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 1 1710 1 0 2 1 3
## 2 1262 0 1 2 0 3
## 3 1786 1 0 2 1 3
## 4 1717 1 0 1 0 3
## 5 2198 1 0 2 1 4
## 6 1362 1 0 1 1 1
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu
## 1 1 Gd 8 Typ 0 <NA>
## 2 1 TA 6 Typ 1 TA
## 3 1 Gd 6 Typ 1 TA
## 4 1 Gd 7 Typ 1 Gd
## 5 1 Gd 9 Typ 1 TA
## 6 1 TA 5 Typ 0 <NA>
## GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 1 Attchd 2003 RFn 2 548 TA
## 2 Attchd 1976 RFn 2 460 TA
## 3 Attchd 2001 RFn 2 608 TA
## 4 Detchd 1998 Unf 3 642 TA
## 5 Attchd 2000 RFn 3 836 TA
## 6 Attchd 1993 Unf 2 480 TA
## GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 1 TA Y 0 61 0 0
## 2 TA Y 298 0 0 0
## 3 TA Y 0 42 0 0
## 4 TA Y 0 35 272 0
## 5 TA Y 192 84 0 0
## 6 TA Y 40 30 0 320
## ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold
## 1 0 0 <NA> <NA> <NA> 0 2 2008
## 2 0 0 <NA> <NA> <NA> 0 5 2007
## 3 0 0 <NA> <NA> <NA> 0 9 2008
## 4 0 0 <NA> <NA> <NA> 0 2 2006
## 5 0 0 <NA> <NA> <NA> 0 12 2008
## 6 0 0 <NA> MnPrv Shed 700 10 2009
## SaleType SaleCondition SalePrice
## 1 WD Normal 208500
## 2 WD Normal 181500
## 3 WD Normal 223500
## 4 WD Abnorml 140000
## 5 WD Normal 250000
## 6 WD Normal 143000
Next, split the data into a training set and a testing set.
## [1] 1095 81
The training set contains 1095 observations and 81 variables. To start, I will hypothesize the following subset of the variables as potential predicators.
salePrice - the property’s sale price in dollars. This is the target variable that I am trying to predict.
OverallCond - Overall condition rating
YearBuilt - Original construction date
YearRemodAdd - Remodel data
BedroomAbvGr - Number of bedrooms above basement level
GrLivArea - Above grade (ground) living area square feet
KitchenAbvGr - Number of kitchens above grade
TotRmsAbvGrd - Total rooms above grade (does not include bathrooms)
GarageCars - Size of garage in car capacity
PoolArea - Pool area in square feet
LotArea - Lot size in square feet
Construct a new data fram consisting solely of these variables.
## SalePrice LotArea PoolArea GarageCars TotRmsAbvGrd KitchenAbvGr
## 1350 122000 5250 0 0 8 1
## 784 165500 9101 0 2 4 1
## 685 221000 16770 0 2 7 1
## 421 206300 7060 0 4 8 2
## 1122 212900 10084 0 3 7 1
## 1125 163900 9125 0 2 7 1
## GrLivArea BedroomAbvGr YearRemodAdd YearBuilt OverallCond
## 1350 2358 4 1987 1872 5
## 784 1110 1 1978 1978 6
## 685 1839 4 1998 1998 5
## 421 1344 2 1998 1997 5
## 1122 1552 3 2006 2005 5
## 1125 1482 3 1992 1992 5
Report variables with missing values.
## SalePrice LotArea PoolArea GarageCars TotRmsAbvGrd
## 0 0 0 0 0
## KitchenAbvGr GrLivArea BedroomAbvGr YearRemodAdd YearBuilt
## 0 0 0 0 0
## OverallCond
## 0
Summary statistics
## SalePrice LotArea PoolArea GarageCars
## Min. : 34900 Min. : 1300 Min. : 0.000 Min. :0.000
## 1st Qu.:129000 1st Qu.: 7500 1st Qu.: 0.000 1st Qu.:1.000
## Median :164000 Median : 9452 Median : 0.000 Median :2.000
## Mean :181598 Mean : 10467 Mean : 3.679 Mean :1.764
## 3rd Qu.:215600 3rd Qu.: 11500 3rd Qu.: 0.000 3rd Qu.:2.000
## Max. :745000 Max. :215245 Max. :738.000 Max. :4.000
## TotRmsAbvGrd KitchenAbvGr GrLivArea BedroomAbvGr
## Min. : 2.000 Min. :0.000 Min. : 334 Min. :0.000
## 1st Qu.: 5.000 1st Qu.:1.000 1st Qu.:1124 1st Qu.:2.000
## Median : 6.000 Median :1.000 Median :1458 Median :3.000
## Mean : 6.493 Mean :1.045 Mean :1510 Mean :2.848
## 3rd Qu.: 7.000 3rd Qu.:1.000 3rd Qu.:1779 3rd Qu.:3.000
## Max. :12.000 Max. :3.000 Max. :5642 Max. :6.000
## YearRemodAdd YearBuilt OverallCond
## Min. :1950 Min. :1872 Min. :1.000
## 1st Qu.:1967 1st Qu.:1954 1st Qu.:5.000
## Median :1994 Median :1974 Median :5.000
## Mean :1985 Mean :1972 Mean :5.555
## 3rd Qu.:2004 3rd Qu.:2001 3rd Qu.:6.000
## Max. :2010 Max. :2009 Max. :9.000
Before fitting my regression model I want to investigate how the variables are related to one another.
We can see some of the variables are very skewed. If we want to have a good regression model, the varaibles should be normal distributed. The variables should be independent and not correlated. “GrLivArea” and “TotRmsAbvGrd” clearly have a high correlation, I will need to deal with these.
##
## Call:
## lm(formula = SalePrice ~ LotArea + PoolArea + GarageCars + TotRmsAbvGrd +
## KitchenAbvGr + GrLivArea + BedroomAbvGr + YearRemodAdd +
## YearBuilt + OverallCond, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -434633 -21648 -3194 16980 294504
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.823e+06 1.451e+05 -12.569 < 2e-16 ***
## LotArea 7.174e-01 1.317e-01 5.449 6.28e-08 ***
## PoolArea -1.522e+01 2.777e+01 -0.548 0.5838
## GarageCars 2.107e+04 2.288e+03 9.207 < 2e-16 ***
## TotRmsAbvGrd 6.789e+03 1.672e+03 4.060 5.26e-05 ***
## KitchenAbvGr -4.215e+04 6.195e+03 -6.805 1.67e-11 ***
## GrLivArea 7.598e+01 4.700e+00 16.165 < 2e-16 ***
## BedroomAbvGr -1.605e+04 2.198e+03 -7.303 5.44e-13 ***
## YearRemodAdd 2.115e+02 8.779e+01 2.410 0.0161 *
## YearBuilt 7.235e+02 6.875e+01 10.523 < 2e-16 ***
## OverallCond 7.940e+03 1.356e+03 5.853 6.39e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 41240 on 1084 degrees of freedom
## Multiple R-squared: 0.737, Adjusted R-squared: 0.7346
## F-statistic: 303.7 on 10 and 1084 DF, p-value: < 2.2e-16
interprete the output:
R-squred of 0.737 tells us that approximately 74% of variation in sale price can be explained by my model.
F-statistics and p-value show the overall significance test of my model.
Residual standard error gives an idea on how far observed sale price are from the predicted or fitted sales price.
Intercept is the estimated sale price for a house with all the other variables at zero. It does not provide any meaningful interpretation.
The slope for “GrlivArea”(7.598e+01) is the effect of Above grade living area square feet on sale price adjusting or controling for the other variables, i.e we associate an increase of 1 square foot in above grade living area with an increase of $75.98 in sale price adjusting or controlling for the other variables.
Using backward elimination to remove the predictor with the largest p-value over 0.05. In this case, I will remove “PoolArea” first, then fit the model again.
##
## Call:
## lm(formula = SalePrice ~ LotArea + GarageCars + TotRmsAbvGrd +
## KitchenAbvGr + GrLivArea + BedroomAbvGr + YearRemodAdd +
## YearBuilt + OverallCond, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -440086 -21728 -3086 16994 287342
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.826e+06 1.449e+05 -12.601 < 2e-16 ***
## LotArea 7.153e-01 1.316e-01 5.437 6.69e-08 ***
## GarageCars 2.114e+04 2.284e+03 9.255 < 2e-16 ***
## TotRmsAbvGrd 6.888e+03 1.662e+03 4.144 3.67e-05 ***
## KitchenAbvGr -4.212e+04 6.192e+03 -6.802 1.70e-11 ***
## GrLivArea 7.543e+01 4.590e+00 16.433 < 2e-16 ***
## BedroomAbvGr -1.608e+04 2.196e+03 -7.321 4.80e-13 ***
## YearRemodAdd 2.135e+02 8.769e+01 2.434 0.0151 *
## YearBuilt 7.231e+02 6.873e+01 10.521 < 2e-16 ***
## OverallCond 7.930e+03 1.356e+03 5.849 6.56e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 41230 on 1085 degrees of freedom
## Multiple R-squared: 0.7369, Adjusted R-squared: 0.7347
## F-statistic: 337.7 on 9 and 1085 DF, p-value: < 2.2e-16
After eliminating “PoolArea”, R-Squared almost identical, Adjusted R-squared slightly improved. At this point, I think I can start building the model.
However, as you have seen earlier, two variables - “GrLivArea” and “TotRmsAbvGrd” are highly correlated, the multicollinearity between GrLivArea and TotRmsAbvGrd means that we should not directly interpret GrLivArea as the effect of GrLivArea on sale price adjusting for TotRmsAbvGrd. These two effects are somewhat bounded together.
## [1] 0.826969
## 2.5 % 97.5 %
## (Intercept) -2.110407e+06 -1.541712e+06
## LotArea 4.571854e-01 9.734940e-01
## GarageCars 1.665678e+04 2.561958e+04
## TotRmsAbvGrd 3.626750e+03 1.014894e+04
## KitchenAbvGr -5.426898e+04 -2.996842e+04
## GrLivArea 6.642123e+01 8.443367e+01
## BedroomAbvGr -2.038677e+04 -1.176829e+04
## YearRemodAdd 4.141973e+01 3.855420e+02
## YearBuilt 5.882452e+02 8.579526e+02
## OverallCond 5.269715e+03 1.059089e+04
For example, from the 2nd model, I have estimated the slope for GrLivArea is 75.43. I am 95% confident that the true slope is between 66.42 and 84.43.
The relationship between predictor variables and an outcome variable is approximate linear. There are three extreme cases (outliers).
It looks like I don’t have to be concerned too much, although two observations numbered as 524 and 1299 look a little off.
The distribution of residuals around the linear model in relation to the sale price. The most of the houses in the data in the lower and median price range, the higher price, the less observations.
This plot helps us to find influential cases if any. Not all outliers are influential in linear regression analysis. It looks like none of the outliers in my model are influential.
Look at the first few values of prediction, and compare it to the values of salePrice in the test data set.
## 4 10 16 20 22 34
## 176501.95 65626.55 130650.30 135358.79 102336.97 161300.96
## [1] 140000 118000 132000 139000 139400 165500
At last, calculate the value of R-squared for the prediction model on the test data set. In general, R-squared is the metric for evaluating the goodness of fit of my model. Higher is better with 1 being the best.
## [1] 0.7421962