Using R, build a multiple regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

1 Data Acquisition

Let’s explore real estate data set and try to find that how certain aspects of a home influences its price. We’ll also use multiple regression to determine what those effects may be.

library(RCurl)
## Loading required package: bitops
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 3.4.4
library(knitr)
## Warning: package 'knitr' was built under R version 3.4.4
data_url <- 'https://raw.githubusercontent.com/niteen11/CUNY_DATA_605/master/dataset/train.csv'
home_data <- read.csv(data_url, stringsAsFactors = FALSE)
kable(head(home_data))
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
1 60 RL 65 8450 Pave NA Reg Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2003 2003 Gable CompShg VinylSd VinylSd BrkFace 196 Gd TA PConc Gd TA No GLQ 706 Unf 0 150 856 GasA Ex Y SBrkr 856 854 0 1710 1 0 2 1 3 1 Gd 8 Typ 0 NA Attchd 2003 RFn 2 548 TA TA Y 0 61 0 0 0 0 NA NA NA 0 2 2008 WD Normal 208500
2 20 RL 80 9600 Pave NA Reg Lvl AllPub FR2 Gtl Veenker Feedr Norm 1Fam 1Story 6 8 1976 1976 Gable CompShg MetalSd MetalSd None 0 TA TA CBlock Gd TA Gd ALQ 978 Unf 0 284 1262 GasA Ex Y SBrkr 1262 0 0 1262 0 1 2 0 3 1 TA 6 Typ 1 TA Attchd 1976 RFn 2 460 TA TA Y 298 0 0 0 0 0 NA NA NA 0 5 2007 WD Normal 181500
3 60 RL 68 11250 Pave NA IR1 Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2001 2002 Gable CompShg VinylSd VinylSd BrkFace 162 Gd TA PConc Gd TA Mn GLQ 486 Unf 0 434 920 GasA Ex Y SBrkr 920 866 0 1786 1 0 2 1 3 1 Gd 6 Typ 1 TA Attchd 2001 RFn 2 608 TA TA Y 0 42 0 0 0 0 NA NA NA 0 9 2008 WD Normal 223500
4 70 RL 60 9550 Pave NA IR1 Lvl AllPub Corner Gtl Crawfor Norm Norm 1Fam 2Story 7 5 1915 1970 Gable CompShg Wd Sdng Wd Shng None 0 TA TA BrkTil TA Gd No ALQ 216 Unf 0 540 756 GasA Gd Y SBrkr 961 756 0 1717 1 0 1 0 3 1 Gd 7 Typ 1 Gd Detchd 1998 Unf 3 642 TA TA Y 0 35 272 0 0 0 NA NA NA 0 2 2006 WD Abnorml 140000
5 60 RL 84 14260 Pave NA IR1 Lvl AllPub FR2 Gtl NoRidge Norm Norm 1Fam 2Story 8 5 2000 2000 Gable CompShg VinylSd VinylSd BrkFace 350 Gd TA PConc Gd TA Av GLQ 655 Unf 0 490 1145 GasA Ex Y SBrkr 1145 1053 0 2198 1 0 2 1 4 1 Gd 9 Typ 1 TA Attchd 2000 RFn 3 836 TA TA Y 192 84 0 0 0 0 NA NA NA 0 12 2008 WD Normal 250000
6 50 RL 85 14115 Pave NA IR1 Lvl AllPub Inside Gtl Mitchel Norm Norm 1Fam 1.5Fin 5 5 1993 1995 Gable CompShg VinylSd VinylSd None 0 TA TA Wood Gd TA No GLQ 732 Unf 0 64 796 GasA Ex Y SBrkr 796 566 0 1362 1 0 1 1 1 1 TA 5 Typ 0 NA Attchd 1993 Unf 2 480 TA TA Y 40 30 0 320 0 0 NA MnPrv Shed 700 10 2009 WD Normal 143000
ncol(home_data)
## [1] 81
colnames(home_data)
##  [1] "Id"            "MSSubClass"    "MSZoning"      "LotFrontage"  
##  [5] "LotArea"       "Street"        "Alley"         "LotShape"     
##  [9] "LandContour"   "Utilities"     "LotConfig"     "LandSlope"    
## [13] "Neighborhood"  "Condition1"    "Condition2"    "BldgType"     
## [17] "HouseStyle"    "OverallQual"   "OverallCond"   "YearBuilt"    
## [21] "YearRemodAdd"  "RoofStyle"     "RoofMatl"      "Exterior1st"  
## [25] "Exterior2nd"   "MasVnrType"    "MasVnrArea"    "ExterQual"    
## [29] "ExterCond"     "Foundation"    "BsmtQual"      "BsmtCond"     
## [33] "BsmtExposure"  "BsmtFinType1"  "BsmtFinSF1"    "BsmtFinType2" 
## [37] "BsmtFinSF2"    "BsmtUnfSF"     "TotalBsmtSF"   "Heating"      
## [41] "HeatingQC"     "CentralAir"    "Electrical"    "X1stFlrSF"    
## [45] "X2ndFlrSF"     "LowQualFinSF"  "GrLivArea"     "BsmtFullBath" 
## [49] "BsmtHalfBath"  "FullBath"      "HalfBath"      "BedroomAbvGr" 
## [53] "KitchenAbvGr"  "KitchenQual"   "TotRmsAbvGrd"  "Functional"   
## [57] "Fireplaces"    "FireplaceQu"   "GarageType"    "GarageYrBlt"  
## [61] "GarageFinish"  "GarageCars"    "GarageArea"    "GarageQual"   
## [65] "GarageCond"    "PavedDrive"    "WoodDeckSF"    "OpenPorchSF"  
## [69] "EnclosedPorch" "X3SsnPorch"    "ScreenPorch"   "PoolArea"     
## [73] "PoolQC"        "Fence"         "MiscFeature"   "MiscVal"      
## [77] "MoSold"        "YrSold"        "SaleType"      "SaleCondition"
## [81] "SalePrice"

2 Linear Model

To buy a home, let’s assume that LotArea, Overall Quality, Overal Condition, and Year Built would most likely influence the home price. Let’s see if we can build a model to prove (or disprove) this point.

attach(home_data)
home.lm <- lm(SalePrice ~ LotArea + OverallQual + OverallCond + YearBuilt)
home.lm
## 
## Call:
## lm(formula = SalePrice ~ LotArea + OverallQual + OverallCond + 
##     YearBuilt)
## 
## Coefficients:
## (Intercept)      LotArea  OverallQual  OverallCond    YearBuilt  
##  -7.984e+05    1.499e+00    4.003e+04    2.736e+03    3.572e+02

3 Data Visualization

home_factors <- c('SalePrice', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt')
house_plt <- home_data[home_factors]
plot(house_plt)

Let’s take a closer look at Lot area and sales price

plot(LotArea, SalePrice)

plot(OverallQual, SalePrice)

plot(YearBuilt, SalePrice)

4 Evaluate Model

Now, let us evaluate the model

summary(home.lm)
## 
## Call:
## lm(formula = SalePrice ~ LotArea + OverallQual + OverallCond + 
##     YearBuilt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -268634  -26234   -3667   20004  393023 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.984e+05  1.032e+05  -7.733 1.94e-14 ***
## LotArea      1.500e+00  1.210e-01  12.397  < 2e-16 ***
## OverallQual  4.003e+04  1.079e+03  37.107  < 2e-16 ***
## OverallCond  2.736e+03  1.178e+03   2.323   0.0203 *  
## YearBuilt    3.572e+02  5.279e+01   6.767 1.90e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 45770 on 1455 degrees of freedom
## Multiple R-squared:  0.6689, Adjusted R-squared:  0.668 
## F-statistic:   735 on 4 and 1455 DF,  p-value: < 2.2e-16

This model has a p value < 0.05, which makes them statistically significant. This model has an adjusted R-squared value of .668, which in others states that with these 4 variables, 66.8% of the data is explained by this model.

Let’s examine singel regression model for YearBuilt and Price Model

YearBuilt_SalesPrice.lm <- lm(SalePrice ~ YearBuilt)
YearBuilt_SalesPrice.lm
## 
## Call:
## lm(formula = SalePrice ~ YearBuilt)
## 
## Coefficients:
## (Intercept)    YearBuilt  
##    -2530308         1375
summary(YearBuilt_SalesPrice.lm)
## 
## Call:
## lm(formula = SalePrice ~ YearBuilt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -144191  -40999  -15464   22685  542814 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.530e+06  1.158e+05  -21.86   <2e-16 ***
## YearBuilt    1.375e+03  5.872e+01   23.42   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 67740 on 1458 degrees of freedom
## Multiple R-squared:  0.2734, Adjusted R-squared:  0.2729 
## F-statistic: 548.7 on 1 and 1458 DF,  p-value: < 2.2e-16
plot(YearBuilt,SalePrice)
abline(YearBuilt_SalesPrice.lm)

plot(fitted(home.lm),resid(home.lm))

qqnorm(resid(home.lm))
qqline(resid(home.lm))

5 Summary

As we can see, the YearBuilt attribute is infulencial out of out of 80 variables. This model accounts for 27.3% of the data (Adjusted R-squared). Looking at the residuals and the Q-Q plot also appears to be fine with few outliers. Also, multiple linear regression appears to have better results as compared to single linear regression model for home prices.