Using R, build a multiple regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?
Let’s explore real estate data set and try to find that how certain aspects of a home influences its price. We’ll also use multiple regression to determine what those effects may be.
library(RCurl)
## Loading required package: bitops
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 3.4.4
library(knitr)
## Warning: package 'knitr' was built under R version 3.4.4
data_url <- 'https://raw.githubusercontent.com/niteen11/CUNY_DATA_605/master/dataset/train.csv'
home_data <- read.csv(data_url, stringsAsFactors = FALSE)
kable(head(home_data))
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | OverallQual | OverallCond | YearBuilt | YearRemodAdd | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | MasVnrArea | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | Heating | HeatingQC | CentralAir | Electrical | X1stFlrSF | X2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Functional | Fireplaces | FireplaceQu | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | GarageQual | GarageCond | PavedDrive | WoodDeckSF | OpenPorchSF | EnclosedPorch | X3SsnPorch | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 60 | RL | 65 | 8450 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | 7 | 5 | 2003 | 2003 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 196 | Gd | TA | PConc | Gd | TA | No | GLQ | 706 | Unf | 0 | 150 | 856 | GasA | Ex | Y | SBrkr | 856 | 854 | 0 | 1710 | 1 | 0 | 2 | 1 | 3 | 1 | Gd | 8 | Typ | 0 | NA | Attchd | 2003 | RFn | 2 | 548 | TA | TA | Y | 0 | 61 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 2 | 2008 | WD | Normal | 208500 |
2 | 20 | RL | 80 | 9600 | Pave | NA | Reg | Lvl | AllPub | FR2 | Gtl | Veenker | Feedr | Norm | 1Fam | 1Story | 6 | 8 | 1976 | 1976 | Gable | CompShg | MetalSd | MetalSd | None | 0 | TA | TA | CBlock | Gd | TA | Gd | ALQ | 978 | Unf | 0 | 284 | 1262 | GasA | Ex | Y | SBrkr | 1262 | 0 | 0 | 1262 | 0 | 1 | 2 | 0 | 3 | 1 | TA | 6 | Typ | 1 | TA | Attchd | 1976 | RFn | 2 | 460 | TA | TA | Y | 298 | 0 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 5 | 2007 | WD | Normal | 181500 |
3 | 60 | RL | 68 | 11250 | Pave | NA | IR1 | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | 7 | 5 | 2001 | 2002 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 162 | Gd | TA | PConc | Gd | TA | Mn | GLQ | 486 | Unf | 0 | 434 | 920 | GasA | Ex | Y | SBrkr | 920 | 866 | 0 | 1786 | 1 | 0 | 2 | 1 | 3 | 1 | Gd | 6 | Typ | 1 | TA | Attchd | 2001 | RFn | 2 | 608 | TA | TA | Y | 0 | 42 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 9 | 2008 | WD | Normal | 223500 |
4 | 70 | RL | 60 | 9550 | Pave | NA | IR1 | Lvl | AllPub | Corner | Gtl | Crawfor | Norm | Norm | 1Fam | 2Story | 7 | 5 | 1915 | 1970 | Gable | CompShg | Wd Sdng | Wd Shng | None | 0 | TA | TA | BrkTil | TA | Gd | No | ALQ | 216 | Unf | 0 | 540 | 756 | GasA | Gd | Y | SBrkr | 961 | 756 | 0 | 1717 | 1 | 0 | 1 | 0 | 3 | 1 | Gd | 7 | Typ | 1 | Gd | Detchd | 1998 | Unf | 3 | 642 | TA | TA | Y | 0 | 35 | 272 | 0 | 0 | 0 | NA | NA | NA | 0 | 2 | 2006 | WD | Abnorml | 140000 |
5 | 60 | RL | 84 | 14260 | Pave | NA | IR1 | Lvl | AllPub | FR2 | Gtl | NoRidge | Norm | Norm | 1Fam | 2Story | 8 | 5 | 2000 | 2000 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 350 | Gd | TA | PConc | Gd | TA | Av | GLQ | 655 | Unf | 0 | 490 | 1145 | GasA | Ex | Y | SBrkr | 1145 | 1053 | 0 | 2198 | 1 | 0 | 2 | 1 | 4 | 1 | Gd | 9 | Typ | 1 | TA | Attchd | 2000 | RFn | 3 | 836 | TA | TA | Y | 192 | 84 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 12 | 2008 | WD | Normal | 250000 |
6 | 50 | RL | 85 | 14115 | Pave | NA | IR1 | Lvl | AllPub | Inside | Gtl | Mitchel | Norm | Norm | 1Fam | 1.5Fin | 5 | 5 | 1993 | 1995 | Gable | CompShg | VinylSd | VinylSd | None | 0 | TA | TA | Wood | Gd | TA | No | GLQ | 732 | Unf | 0 | 64 | 796 | GasA | Ex | Y | SBrkr | 796 | 566 | 0 | 1362 | 1 | 0 | 1 | 1 | 1 | 1 | TA | 5 | Typ | 0 | NA | Attchd | 1993 | Unf | 2 | 480 | TA | TA | Y | 40 | 30 | 0 | 320 | 0 | 0 | NA | MnPrv | Shed | 700 | 10 | 2009 | WD | Normal | 143000 |
ncol(home_data)
## [1] 81
colnames(home_data)
## [1] "Id" "MSSubClass" "MSZoning" "LotFrontage"
## [5] "LotArea" "Street" "Alley" "LotShape"
## [9] "LandContour" "Utilities" "LotConfig" "LandSlope"
## [13] "Neighborhood" "Condition1" "Condition2" "BldgType"
## [17] "HouseStyle" "OverallQual" "OverallCond" "YearBuilt"
## [21] "YearRemodAdd" "RoofStyle" "RoofMatl" "Exterior1st"
## [25] "Exterior2nd" "MasVnrType" "MasVnrArea" "ExterQual"
## [29] "ExterCond" "Foundation" "BsmtQual" "BsmtCond"
## [33] "BsmtExposure" "BsmtFinType1" "BsmtFinSF1" "BsmtFinType2"
## [37] "BsmtFinSF2" "BsmtUnfSF" "TotalBsmtSF" "Heating"
## [41] "HeatingQC" "CentralAir" "Electrical" "X1stFlrSF"
## [45] "X2ndFlrSF" "LowQualFinSF" "GrLivArea" "BsmtFullBath"
## [49] "BsmtHalfBath" "FullBath" "HalfBath" "BedroomAbvGr"
## [53] "KitchenAbvGr" "KitchenQual" "TotRmsAbvGrd" "Functional"
## [57] "Fireplaces" "FireplaceQu" "GarageType" "GarageYrBlt"
## [61] "GarageFinish" "GarageCars" "GarageArea" "GarageQual"
## [65] "GarageCond" "PavedDrive" "WoodDeckSF" "OpenPorchSF"
## [69] "EnclosedPorch" "X3SsnPorch" "ScreenPorch" "PoolArea"
## [73] "PoolQC" "Fence" "MiscFeature" "MiscVal"
## [77] "MoSold" "YrSold" "SaleType" "SaleCondition"
## [81] "SalePrice"
To buy a home, let’s assume that LotArea, Overall Quality, Overal Condition, and Year Built would most likely influence the home price. Let’s see if we can build a model to prove (or disprove) this point.
attach(home_data)
home.lm <- lm(SalePrice ~ LotArea + OverallQual + OverallCond + YearBuilt)
home.lm
##
## Call:
## lm(formula = SalePrice ~ LotArea + OverallQual + OverallCond +
## YearBuilt)
##
## Coefficients:
## (Intercept) LotArea OverallQual OverallCond YearBuilt
## -7.984e+05 1.499e+00 4.003e+04 2.736e+03 3.572e+02
home_factors <- c('SalePrice', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt')
house_plt <- home_data[home_factors]
plot(house_plt)
Let’s take a closer look at Lot area and sales price
plot(LotArea, SalePrice)
plot(OverallQual, SalePrice)
plot(YearBuilt, SalePrice)
Now, let us evaluate the model
summary(home.lm)
##
## Call:
## lm(formula = SalePrice ~ LotArea + OverallQual + OverallCond +
## YearBuilt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -268634 -26234 -3667 20004 393023
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.984e+05 1.032e+05 -7.733 1.94e-14 ***
## LotArea 1.500e+00 1.210e-01 12.397 < 2e-16 ***
## OverallQual 4.003e+04 1.079e+03 37.107 < 2e-16 ***
## OverallCond 2.736e+03 1.178e+03 2.323 0.0203 *
## YearBuilt 3.572e+02 5.279e+01 6.767 1.90e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 45770 on 1455 degrees of freedom
## Multiple R-squared: 0.6689, Adjusted R-squared: 0.668
## F-statistic: 735 on 4 and 1455 DF, p-value: < 2.2e-16
This model has a p value < 0.05, which makes them statistically significant. This model has an adjusted R-squared value of .668, which in others states that with these 4 variables, 66.8% of the data is explained by this model.
Let’s examine singel regression model for YearBuilt and Price Model
YearBuilt_SalesPrice.lm <- lm(SalePrice ~ YearBuilt)
YearBuilt_SalesPrice.lm
##
## Call:
## lm(formula = SalePrice ~ YearBuilt)
##
## Coefficients:
## (Intercept) YearBuilt
## -2530308 1375
summary(YearBuilt_SalesPrice.lm)
##
## Call:
## lm(formula = SalePrice ~ YearBuilt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -144191 -40999 -15464 22685 542814
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.530e+06 1.158e+05 -21.86 <2e-16 ***
## YearBuilt 1.375e+03 5.872e+01 23.42 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 67740 on 1458 degrees of freedom
## Multiple R-squared: 0.2734, Adjusted R-squared: 0.2729
## F-statistic: 548.7 on 1 and 1458 DF, p-value: < 2.2e-16
plot(YearBuilt,SalePrice)
abline(YearBuilt_SalesPrice.lm)
plot(fitted(home.lm),resid(home.lm))
qqnorm(resid(home.lm))
qqline(resid(home.lm))
As we can see, the YearBuilt attribute is infulencial out of out of 80 variables. This model accounts for 27.3% of the data (Adjusted R-squared). Looking at the residuals and the Q-Q plot also appears to be fine with few outliers. Also, multiple linear regression appears to have better results as compared to single linear regression model for home prices.