Using R, build a multiple regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Let’s take a look at real estate and its data. We know that certain aspects of a home influences its price. Let’s use multiple regression to determine what those effects may be.

library(RCurl)

raw.file <- 'https://raw.githubusercontent.com/jcp9010/MSDA/master/CUNY%20605/Week%2011/train.csv'
homes <- read.csv(raw.file, stringsAsFactors = FALSE)
head(homes)
##   Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1  1         60       RL          65    8450   Pave  <NA>      Reg
## 2  2         20       RL          80    9600   Pave  <NA>      Reg
## 3  3         60       RL          68   11250   Pave  <NA>      IR1
## 4  4         70       RL          60    9550   Pave  <NA>      IR1
## 5  5         60       RL          84   14260   Pave  <NA>      IR1
## 6  6         50       RL          85   14115   Pave  <NA>      IR1
##   LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1         Lvl    AllPub    Inside       Gtl      CollgCr       Norm
## 2         Lvl    AllPub       FR2       Gtl      Veenker      Feedr
## 3         Lvl    AllPub    Inside       Gtl      CollgCr       Norm
## 4         Lvl    AllPub    Corner       Gtl      Crawfor       Norm
## 5         Lvl    AllPub       FR2       Gtl      NoRidge       Norm
## 6         Lvl    AllPub    Inside       Gtl      Mitchel       Norm
##   Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1       Norm     1Fam     2Story           7           5      2003
## 2       Norm     1Fam     1Story           6           8      1976
## 3       Norm     1Fam     2Story           7           5      2001
## 4       Norm     1Fam     2Story           7           5      1915
## 5       Norm     1Fam     2Story           8           5      2000
## 6       Norm     1Fam     1.5Fin           5           5      1993
##   YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1         2003     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 2         1976     Gable  CompShg     MetalSd     MetalSd       None
## 3         2002     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 4         1970     Gable  CompShg     Wd Sdng     Wd Shng       None
## 5         2000     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 6         1995     Gable  CompShg     VinylSd     VinylSd       None
##   MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## 1        196        Gd        TA      PConc       Gd       TA           No
## 2          0        TA        TA     CBlock       Gd       TA           Gd
## 3        162        Gd        TA      PConc       Gd       TA           Mn
## 4          0        TA        TA     BrkTil       TA       Gd           No
## 5        350        Gd        TA      PConc       Gd       TA           Av
## 6          0        TA        TA       Wood       Gd       TA           No
##   BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1          GLQ        706          Unf          0       150         856
## 2          ALQ        978          Unf          0       284        1262
## 3          GLQ        486          Unf          0       434         920
## 4          ALQ        216          Unf          0       540         756
## 5          GLQ        655          Unf          0       490        1145
## 6          GLQ        732          Unf          0        64         796
##   Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## 1    GasA        Ex          Y      SBrkr       856       854            0
## 2    GasA        Ex          Y      SBrkr      1262         0            0
## 3    GasA        Ex          Y      SBrkr       920       866            0
## 4    GasA        Gd          Y      SBrkr       961       756            0
## 5    GasA        Ex          Y      SBrkr      1145      1053            0
## 6    GasA        Ex          Y      SBrkr       796       566            0
##   GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 1      1710            1            0        2        1            3
## 2      1262            0            1        2        0            3
## 3      1786            1            0        2        1            3
## 4      1717            1            0        1        0            3
## 5      2198            1            0        2        1            4
## 6      1362            1            0        1        1            1
##   KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu
## 1            1          Gd            8        Typ          0        <NA>
## 2            1          TA            6        Typ          1          TA
## 3            1          Gd            6        Typ          1          TA
## 4            1          Gd            7        Typ          1          Gd
## 5            1          Gd            9        Typ          1          TA
## 6            1          TA            5        Typ          0        <NA>
##   GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 1     Attchd        2003          RFn          2        548         TA
## 2     Attchd        1976          RFn          2        460         TA
## 3     Attchd        2001          RFn          2        608         TA
## 4     Detchd        1998          Unf          3        642         TA
## 5     Attchd        2000          RFn          3        836         TA
## 6     Attchd        1993          Unf          2        480         TA
##   GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 1         TA          Y          0          61             0          0
## 2         TA          Y        298           0             0          0
## 3         TA          Y          0          42             0          0
## 4         TA          Y          0          35           272          0
## 5         TA          Y        192          84             0          0
## 6         TA          Y         40          30             0        320
##   ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold
## 1           0        0   <NA>  <NA>        <NA>       0      2   2008
## 2           0        0   <NA>  <NA>        <NA>       0      5   2007
## 3           0        0   <NA>  <NA>        <NA>       0      9   2008
## 4           0        0   <NA>  <NA>        <NA>       0      2   2006
## 5           0        0   <NA>  <NA>        <NA>       0     12   2008
## 6           0        0   <NA> MnPrv        Shed     700     10   2009
##   SaleType SaleCondition SalePrice
## 1       WD        Normal    208500
## 2       WD        Normal    181500
## 3       WD        Normal    223500
## 4       WD       Abnorml    140000
## 5       WD        Normal    250000
## 6       WD        Normal    143000
ncol(homes)
## [1] 81

As you can see, there are 81 columns in this dataset, with 80 of them being the independent variables and 1 (SalePrice) being the measured outcome.

To buy a home, my suspicion is that LotArea, Overall Quality, Overal Condition, and Year Built would most likely influence the home price. Let’s see if we can build a model to prove (or disprove) this point.

attach(homes)
home.lm <- lm(SalePrice ~ LotArea + OverallQual + OverallCond + YearBuilt)
home.lm
## 
## Call:
## lm(formula = SalePrice ~ LotArea + OverallQual + OverallCond + 
##     YearBuilt)
## 
## Coefficients:
## (Intercept)      LotArea  OverallQual  OverallCond    YearBuilt  
##  -7.984e+05    1.499e+00    4.003e+04    2.736e+03    3.572e+02

How good is this? Well, let’s visualize this.

myvars <- c('SalePrice', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt')
Housing <- homes[myvars]
plot(Housing)

We’re interested in the first row. Let’s take a closer look.

plot(LotArea, SalePrice)

plot(OverallQual, SalePrice)

plot(OverallCond, SalePrice)

plot(YearBuilt, SalePrice)

How good is the linear regression model?

summary(home.lm)
## 
## Call:
## lm(formula = SalePrice ~ LotArea + OverallQual + OverallCond + 
##     YearBuilt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -268634  -26234   -3667   20004  393023 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.984e+05  1.032e+05  -7.733 1.94e-14 ***
## LotArea      1.500e+00  1.210e-01  12.397  < 2e-16 ***
## OverallQual  4.003e+04  1.079e+03  37.107  < 2e-16 ***
## OverallCond  2.736e+03  1.178e+03   2.323   0.0203 *  
## YearBuilt    3.572e+02  5.279e+01   6.767 1.90e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 45770 on 1455 degrees of freedom
## Multiple R-squared:  0.6689, Adjusted R-squared:  0.668 
## F-statistic:   735 on 4 and 1455 DF,  p-value: < 2.2e-16

So clearly, all of these values have a p value < 0.05, which makes them statistically significant. This model has an adjusted R-squared value of .668, which in others states that with these 4 variables, 66.8% of the data is explained by this model. Given that this week’s discussion is about a single independent variable, let’s take YearBuilt and create a single linear regression model.

year_price.lm <- lm(SalePrice ~ YearBuilt)
year_price.lm
## 
## Call:
## lm(formula = SalePrice ~ YearBuilt)
## 
## Coefficients:
## (Intercept)    YearBuilt  
##    -2530308         1375
summary(year_price.lm)
## 
## Call:
## lm(formula = SalePrice ~ YearBuilt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -144191  -40999  -15464   22685  542814 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.530e+06  1.158e+05  -21.86   <2e-16 ***
## YearBuilt    1.375e+03  5.872e+01   23.42   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 67740 on 1458 degrees of freedom
## Multiple R-squared:  0.2734, Adjusted R-squared:  0.2729 
## F-statistic: 548.7 on 1 and 1458 DF,  p-value: < 2.2e-16
plot(YearBuilt,SalePrice)
abline(year_price.lm)

plot(fitted(home.lm),resid(home.lm))

qqnorm(resid(home.lm))
qqline(resid(home.lm))

As you can see, the YearBuilt does a fairly good job for being 1 out of 80 variables. This model accounts for 27.3% of the data (Adjusted R-squared). Looking at the residuals and the Q-Q plot, it appears that the right portion of the data may be heavier but overall seems to be okay to the left of that. While the multiple regression model seems to do a better job explaining home prices compared to the single linear regression model, the single linear regresion model appeared to perform better than I had expected.