Using R, build a multiple regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?
Let’s take a look at real estate and its data. We know that certain aspects of a home influences its price. Let’s use multiple regression to determine what those effects may be.
library(RCurl)
raw.file <- 'https://raw.githubusercontent.com/jcp9010/MSDA/master/CUNY%20605/Week%2011/train.csv'
homes <- read.csv(raw.file, stringsAsFactors = FALSE)
head(homes)
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1 1 60 RL 65 8450 Pave <NA> Reg
## 2 2 20 RL 80 9600 Pave <NA> Reg
## 3 3 60 RL 68 11250 Pave <NA> IR1
## 4 4 70 RL 60 9550 Pave <NA> IR1
## 5 5 60 RL 84 14260 Pave <NA> IR1
## 6 6 50 RL 85 14115 Pave <NA> IR1
## LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1 Lvl AllPub Inside Gtl CollgCr Norm
## 2 Lvl AllPub FR2 Gtl Veenker Feedr
## 3 Lvl AllPub Inside Gtl CollgCr Norm
## 4 Lvl AllPub Corner Gtl Crawfor Norm
## 5 Lvl AllPub FR2 Gtl NoRidge Norm
## 6 Lvl AllPub Inside Gtl Mitchel Norm
## Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1 Norm 1Fam 2Story 7 5 2003
## 2 Norm 1Fam 1Story 6 8 1976
## 3 Norm 1Fam 2Story 7 5 2001
## 4 Norm 1Fam 2Story 7 5 1915
## 5 Norm 1Fam 2Story 8 5 2000
## 6 Norm 1Fam 1.5Fin 5 5 1993
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1 2003 Gable CompShg VinylSd VinylSd BrkFace
## 2 1976 Gable CompShg MetalSd MetalSd None
## 3 2002 Gable CompShg VinylSd VinylSd BrkFace
## 4 1970 Gable CompShg Wd Sdng Wd Shng None
## 5 2000 Gable CompShg VinylSd VinylSd BrkFace
## 6 1995 Gable CompShg VinylSd VinylSd None
## MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## 1 196 Gd TA PConc Gd TA No
## 2 0 TA TA CBlock Gd TA Gd
## 3 162 Gd TA PConc Gd TA Mn
## 4 0 TA TA BrkTil TA Gd No
## 5 350 Gd TA PConc Gd TA Av
## 6 0 TA TA Wood Gd TA No
## BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1 GLQ 706 Unf 0 150 856
## 2 ALQ 978 Unf 0 284 1262
## 3 GLQ 486 Unf 0 434 920
## 4 ALQ 216 Unf 0 540 756
## 5 GLQ 655 Unf 0 490 1145
## 6 GLQ 732 Unf 0 64 796
## Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## 1 GasA Ex Y SBrkr 856 854 0
## 2 GasA Ex Y SBrkr 1262 0 0
## 3 GasA Ex Y SBrkr 920 866 0
## 4 GasA Gd Y SBrkr 961 756 0
## 5 GasA Ex Y SBrkr 1145 1053 0
## 6 GasA Ex Y SBrkr 796 566 0
## GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 1 1710 1 0 2 1 3
## 2 1262 0 1 2 0 3
## 3 1786 1 0 2 1 3
## 4 1717 1 0 1 0 3
## 5 2198 1 0 2 1 4
## 6 1362 1 0 1 1 1
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu
## 1 1 Gd 8 Typ 0 <NA>
## 2 1 TA 6 Typ 1 TA
## 3 1 Gd 6 Typ 1 TA
## 4 1 Gd 7 Typ 1 Gd
## 5 1 Gd 9 Typ 1 TA
## 6 1 TA 5 Typ 0 <NA>
## GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 1 Attchd 2003 RFn 2 548 TA
## 2 Attchd 1976 RFn 2 460 TA
## 3 Attchd 2001 RFn 2 608 TA
## 4 Detchd 1998 Unf 3 642 TA
## 5 Attchd 2000 RFn 3 836 TA
## 6 Attchd 1993 Unf 2 480 TA
## GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 1 TA Y 0 61 0 0
## 2 TA Y 298 0 0 0
## 3 TA Y 0 42 0 0
## 4 TA Y 0 35 272 0
## 5 TA Y 192 84 0 0
## 6 TA Y 40 30 0 320
## ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold
## 1 0 0 <NA> <NA> <NA> 0 2 2008
## 2 0 0 <NA> <NA> <NA> 0 5 2007
## 3 0 0 <NA> <NA> <NA> 0 9 2008
## 4 0 0 <NA> <NA> <NA> 0 2 2006
## 5 0 0 <NA> <NA> <NA> 0 12 2008
## 6 0 0 <NA> MnPrv Shed 700 10 2009
## SaleType SaleCondition SalePrice
## 1 WD Normal 208500
## 2 WD Normal 181500
## 3 WD Normal 223500
## 4 WD Abnorml 140000
## 5 WD Normal 250000
## 6 WD Normal 143000
ncol(homes)
## [1] 81
As you can see, there are 81 columns in this dataset, with 80 of them being the independent variables and 1 (SalePrice) being the measured outcome.
To buy a home, my suspicion is that LotArea, Overall Quality, Overal Condition, and Year Built would most likely influence the home price. Let’s see if we can build a model to prove (or disprove) this point.
attach(homes)
home.lm <- lm(SalePrice ~ LotArea + OverallQual + OverallCond + YearBuilt)
home.lm
##
## Call:
## lm(formula = SalePrice ~ LotArea + OverallQual + OverallCond +
## YearBuilt)
##
## Coefficients:
## (Intercept) LotArea OverallQual OverallCond YearBuilt
## -7.984e+05 1.499e+00 4.003e+04 2.736e+03 3.572e+02
How good is this? Well, let’s visualize this.
myvars <- c('SalePrice', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt')
Housing <- homes[myvars]
plot(Housing)
We’re interested in the first row. Let’s take a closer look.
plot(LotArea, SalePrice)
plot(OverallQual, SalePrice)
plot(OverallCond, SalePrice)
plot(YearBuilt, SalePrice)
How good is the linear regression model?
summary(home.lm)
##
## Call:
## lm(formula = SalePrice ~ LotArea + OverallQual + OverallCond +
## YearBuilt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -268634 -26234 -3667 20004 393023
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.984e+05 1.032e+05 -7.733 1.94e-14 ***
## LotArea 1.500e+00 1.210e-01 12.397 < 2e-16 ***
## OverallQual 4.003e+04 1.079e+03 37.107 < 2e-16 ***
## OverallCond 2.736e+03 1.178e+03 2.323 0.0203 *
## YearBuilt 3.572e+02 5.279e+01 6.767 1.90e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 45770 on 1455 degrees of freedom
## Multiple R-squared: 0.6689, Adjusted R-squared: 0.668
## F-statistic: 735 on 4 and 1455 DF, p-value: < 2.2e-16
So clearly, all of these values have a p value < 0.05, which makes them statistically significant. This model has an adjusted R-squared value of .668, which in others states that with these 4 variables, 66.8% of the data is explained by this model. Given that this week’s discussion is about a single independent variable, let’s take YearBuilt and create a single linear regression model.
year_price.lm <- lm(SalePrice ~ YearBuilt)
year_price.lm
##
## Call:
## lm(formula = SalePrice ~ YearBuilt)
##
## Coefficients:
## (Intercept) YearBuilt
## -2530308 1375
summary(year_price.lm)
##
## Call:
## lm(formula = SalePrice ~ YearBuilt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -144191 -40999 -15464 22685 542814
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.530e+06 1.158e+05 -21.86 <2e-16 ***
## YearBuilt 1.375e+03 5.872e+01 23.42 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 67740 on 1458 degrees of freedom
## Multiple R-squared: 0.2734, Adjusted R-squared: 0.2729
## F-statistic: 548.7 on 1 and 1458 DF, p-value: < 2.2e-16
plot(YearBuilt,SalePrice)
abline(year_price.lm)
plot(fitted(home.lm),resid(home.lm))
qqnorm(resid(home.lm))
qqline(resid(home.lm))
As you can see, the YearBuilt does a fairly good job for being 1 out of 80 variables. This model accounts for 27.3% of the data (Adjusted R-squared). Looking at the residuals and the Q-Q plot, it appears that the right portion of the data may be heavier but overall seems to be okay to the left of that. While the multiple regression model seems to do a better job explaining home prices compared to the single linear regression model, the single linear regresion model appeared to perform better than I had expected.