Multiple Regression

Data Exploration

Inspecting the first few rows:

price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above basement yr_built yr_renovated zipcode
221900 3 1.00 1180 5650 1 0 0 3 7 1180 0 1955 0 98178
538000 3 2.25 2570 7242 2 0 0 3 7 2170 1 1951 1991 98125
180000 2 1.00 770 10000 1 0 0 3 6 770 0 1933 0 98028
604000 4 3.00 1960 5000 1 0 0 5 7 1050 1 1965 0 98136
510000 3 2.00 1680 8080 1 0 0 3 8 1680 0 1987 0 98074
1225000 4 4.50 5420 101930 1 0 0 3 11 3890 1 2001 0 98053
##      price            bedrooms        bathrooms      sqft_living   
##  Min.   :  75000   Min.   : 0.000   Min.   :0.000   Min.   :  290  
##  1st Qu.: 321950   1st Qu.: 3.000   1st Qu.:1.750   1st Qu.: 1427  
##  Median : 450000   Median : 3.000   Median :2.250   Median : 1910  
##  Mean   : 540088   Mean   : 3.371   Mean   :2.115   Mean   : 2080  
##  3rd Qu.: 645000   3rd Qu.: 4.000   3rd Qu.:2.500   3rd Qu.: 2550  
##  Max.   :7700000   Max.   :33.000   Max.   :8.000   Max.   :13540  
##     sqft_lot           floors        waterfront            view       
##  Min.   :    520   Min.   :1.000   Min.   :0.000000   Min.   :0.0000  
##  1st Qu.:   5040   1st Qu.:1.000   1st Qu.:0.000000   1st Qu.:0.0000  
##  Median :   7618   Median :1.500   Median :0.000000   Median :0.0000  
##  Mean   :  15107   Mean   :1.494   Mean   :0.007542   Mean   :0.2343  
##  3rd Qu.:  10688   3rd Qu.:2.000   3rd Qu.:0.000000   3rd Qu.:0.0000  
##  Max.   :1651359   Max.   :3.500   Max.   :1.000000   Max.   :4.0000  
##    condition         grade          sqft_above      basement     
##  Min.   :1.000   Min.   : 1.000   Min.   : 290   Min.   :0.0000  
##  1st Qu.:3.000   1st Qu.: 7.000   1st Qu.:1190   1st Qu.:0.0000  
##  Median :3.000   Median : 7.000   Median :1560   Median :0.0000  
##  Mean   :3.409   Mean   : 7.657   Mean   :1788   Mean   :0.3927  
##  3rd Qu.:4.000   3rd Qu.: 8.000   3rd Qu.:2210   3rd Qu.:1.0000  
##  Max.   :5.000   Max.   :13.000   Max.   :9410   Max.   :1.0000  
##     yr_built     yr_renovated       zipcode     
##  Min.   :1900   Min.   :   0.0   Min.   :98001  
##  1st Qu.:1951   1st Qu.:   0.0   1st Qu.:98033  
##  Median :1975   Median :   0.0   Median :98065  
##  Mean   :1971   Mean   :  84.4   Mean   :98078  
##  3rd Qu.:1997   3rd Qu.:   0.0   3rd Qu.:98118  
##  Max.   :2015   Max.   :2015.0   Max.   :98199

Simple Model

The summary statistics below are an example of a simple model we could try to improve on by using multiple regression. This model uses the sqft_living variable to predict price of a home. Note the low R squared value. Residuals for this simple model will not be analyzed here.

## 
## Call:
## lm(formula = price ~ sqft_living, data = house_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1476062  -147486   -24043   106182  4362067 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -43580.743   4402.690  -9.899   <2e-16 ***
## sqft_living    280.624      1.936 144.920   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 261500 on 21611 degrees of freedom
## Multiple R-squared:  0.4929, Adjusted R-squared:  0.4928 
## F-statistic: 2.1e+04 on 1 and 21611 DF,  p-value: < 2.2e-16

Multiple Regression Model

The model below includes all the features in the dataset to predict price. The model including all the features is the base model we seek to improve on by removing some features via backwards elimination. Right away we see that R-squared has increased.

## 
## Call:
## lm(formula = price ~ ., data = house_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1342233  -109629    -9789    89182  4259710 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.019e+07  3.104e+06   3.283  0.00103 ** 
## bedrooms     -3.887e+04  2.032e+03 -19.133  < 2e-16 ***
## bathrooms     4.472e+04  3.525e+03  12.685  < 2e-16 ***
## sqft_living   1.618e+02  6.587e+00  24.568  < 2e-16 ***
## sqft_lot     -2.584e-01  3.676e-02  -7.029 2.14e-12 ***
## floors        2.526e+04  3.810e+03   6.630 3.43e-11 ***
## waterfront    5.745e+05  1.867e+04  30.778  < 2e-16 ***
## view          4.562e+04  2.267e+03  20.122  < 2e-16 ***
## condition     1.870e+04  2.527e+03   7.402 1.39e-13 ***
## grade         1.239e+05  2.180e+03  56.845  < 2e-16 ***
## sqft_above    9.853e+00  7.257e+00   1.358  0.17452    
## basement      1.224e+04  5.645e+03   2.167  0.03021 *  
## yr_built     -3.587e+03  7.480e+01 -47.956  < 2e-16 ***
## yr_renovated  8.628e+00  3.924e+00   2.199  0.02790 *  
## zipcode      -4.043e+01  3.115e+01  -1.298  0.19434    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 216500 on 21598 degrees of freedom
## Multiple R-squared:  0.6524, Adjusted R-squared:  0.6521 
## F-statistic:  2895 on 14 and 21598 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot + 
##     floors + waterfront + view + condition + grade + sqft_above + 
##     basement + yr_built + yr_renovated, data = house_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1338640  -109489    -9868    89252  4257609 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.166e+06  1.390e+05  44.374  < 2e-16 ***
## bedrooms     -3.873e+04  2.029e+03 -19.090  < 2e-16 ***
## bathrooms     4.477e+04  3.525e+03  12.701  < 2e-16 ***
## sqft_living   1.621e+02  6.585e+00  24.612  < 2e-16 ***
## sqft_lot     -2.545e-01  3.664e-02  -6.946 3.86e-12 ***
## floors        2.431e+04  3.739e+03   6.502 8.09e-11 ***
## waterfront    5.746e+05  1.867e+04  30.784  < 2e-16 ***
## view          4.537e+04  2.259e+03  20.084  < 2e-16 ***
## condition     1.913e+04  2.506e+03   7.635 2.35e-14 ***
## grade         1.239e+05  2.180e+03  56.831  < 2e-16 ***
## sqft_above    1.008e+01  7.255e+00   1.390   0.1647    
## basement      1.148e+04  5.615e+03   2.045   0.0409 *  
## yr_built     -3.558e+03  7.120e+01 -49.962  < 2e-16 ***
## yr_renovated  8.860e+00  3.920e+00   2.260   0.0238 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 216500 on 21599 degrees of freedom
## Multiple R-squared:  0.6523, Adjusted R-squared:  0.6521 
## F-statistic:  3118 on 13 and 21599 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot + 
##     floors + waterfront + view + condition + grade + basement + 
##     yr_built + yr_renovated, data = house_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1332929  -109501    -9854    89148  4248178 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.166e+06  1.390e+05  44.371  < 2e-16 ***
## bedrooms     -3.877e+04  2.029e+03 -19.111  < 2e-16 ***
## bathrooms     4.475e+04  3.525e+03  12.693  < 2e-16 ***
## sqft_living   1.700e+02  3.293e+00  51.624  < 2e-16 ***
## sqft_lot     -2.516e-01  3.658e-02  -6.877 6.26e-12 ***
## floors        2.562e+04  3.620e+03   7.077 1.52e-12 ***
## waterfront    5.748e+05  1.867e+04  30.794  < 2e-16 ***
## view          4.497e+04  2.241e+03  20.070  < 2e-16 ***
## condition     1.882e+04  2.496e+03   7.542 4.81e-14 ***
## grade         1.244e+05  2.142e+03  58.091  < 2e-16 ***
## basement      5.327e+03  3.451e+03   1.544   0.1227    
## yr_built     -3.558e+03  7.121e+01 -49.966  < 2e-16 ***
## yr_renovated  8.717e+00  3.919e+00   2.225   0.0261 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 216500 on 21600 degrees of freedom
## Multiple R-squared:  0.6523, Adjusted R-squared:  0.6521 
## F-statistic:  3377 on 12 and 21600 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot + 
##     floors + waterfront + view + condition + grade + yr_built + 
##     yr_renovated, data = house_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1331249  -109660   -10027    89148  4240400 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.199e+06  1.373e+05  45.144  < 2e-16 ***
## bedrooms     -3.879e+04  2.029e+03 -19.122  < 2e-16 ***
## bathrooms     4.593e+04  3.441e+03  13.345  < 2e-16 ***
## sqft_living   1.705e+02  3.275e+00  52.064  < 2e-16 ***
## sqft_lot     -2.571e-01  3.640e-02  -7.063 1.68e-12 ***
## floors        2.382e+04  3.428e+03   6.949 3.79e-12 ***
## waterfront    5.738e+05  1.866e+04  30.759  < 2e-16 ***
## view          4.533e+04  2.228e+03  20.348  < 2e-16 ***
## condition     1.887e+04  2.496e+03   7.559 4.22e-14 ***
## grade         1.243e+05  2.139e+03  58.094  < 2e-16 ***
## yr_built     -3.573e+03  7.050e+01 -50.685  < 2e-16 ***
## yr_renovated  8.580e+00  3.918e+00   2.190   0.0285 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 216500 on 21601 degrees of freedom
## Multiple R-squared:  0.6523, Adjusted R-squared:  0.6521 
## F-statistic:  3684 on 11 and 21601 DF,  p-value: < 2.2e-16

The model below is our best model containing only statistically significant factors. We note that the R-squared value did not increase with through the elimination steps.

## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot + 
##     floors + waterfront + view + condition + grade + yr_built, 
##     data = house_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1321450  -109820   -10128    89477  4248362 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.298e+06  1.297e+05  48.571  < 2e-16 ***
## bedrooms    -3.892e+04  2.028e+03 -19.194  < 2e-16 ***
## bathrooms    4.694e+04  3.410e+03  13.763  < 2e-16 ***
## sqft_living  1.704e+02  3.275e+00  52.042  < 2e-16 ***
## sqft_lot    -2.564e-01  3.641e-02  -7.043 1.93e-12 ***
## floors       2.420e+04  3.424e+03   7.066 1.65e-12 ***
## waterfront   5.762e+05  1.863e+04  30.934  < 2e-16 ***
## view         4.544e+04  2.228e+03  20.401  < 2e-16 ***
## condition    1.797e+04  2.462e+03   7.298 3.02e-13 ***
## grade        1.243e+05  2.139e+03  58.114  < 2e-16 ***
## yr_built    -3.623e+03  6.677e+01 -54.261  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 216600 on 21602 degrees of freedom
## Multiple R-squared:  0.6522, Adjusted R-squared:  0.652 
## F-statistic:  4051 on 10 and 21602 DF,  p-value: < 2.2e-16

Adding a quadratic term

Here we take a look at the same starting model but with the addition of a quadratic term br2 which arbitrarily squares the number of bathrooms. Just starting with the same model, the R-square has increased but only marginally.

## 
## Call:
## lm(formula = price ~ ., data = house_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2347006  -107460    -8070    87376  3875050 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.875e+07  3.025e+06   6.199 5.79e-10 ***
## bedrooms     -2.959e+04  1.990e+03 -14.864  < 2e-16 ***
## bathrooms    -2.090e+05  7.851e+03 -26.618  < 2e-16 ***
## sqft_living   1.359e+02  6.439e+00  21.102  < 2e-16 ***
## sqft_lot     -2.717e-01  3.571e-02  -7.606 2.93e-14 ***
## floors        3.841e+04  3.719e+03  10.328  < 2e-16 ***
## waterfront    5.655e+05  1.813e+04  31.183  < 2e-16 ***
## view          4.336e+04  2.203e+03  19.683  < 2e-16 ***
## condition     2.754e+04  2.467e+03  11.162  < 2e-16 ***
## grade         1.306e+05  2.126e+03  61.424  < 2e-16 ***
## sqft_above    1.131e+01  7.049e+00   1.604    0.109    
## basement      3.207e+04  5.512e+03   5.819 6.00e-09 ***
## yr_built     -3.102e+03  7.391e+01 -41.974  < 2e-16 ***
## yr_renovated  1.629e+01  3.818e+00   4.268 1.98e-05 ***
## zipcode      -1.356e+02  3.038e+01  -4.466 8.03e-06 ***
## br2           5.350e+04  1.490e+03  35.911  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 210300 on 21597 degrees of freedom
## Multiple R-squared:  0.672,  Adjusted R-squared:  0.6717 
## F-statistic:  2949 on 15 and 21597 DF,  p-value: < 2.2e-16

Eliminating the sqft_above feature does not improve much.

## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot + 
##     floors + waterfront + view + condition + grade + basement + 
##     yr_built + yr_renovated + zipcode + br2, data = house_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2362020  -107439    -8184    87313  3873054 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.886e+07  3.024e+06   6.238 4.51e-10 ***
## bedrooms     -2.964e+04  1.990e+03 -14.891  < 2e-16 ***
## bathrooms    -2.089e+05  7.851e+03 -26.613  < 2e-16 ***
## sqft_living   1.448e+02  3.287e+00  44.049  < 2e-16 ***
## sqft_lot     -2.685e-01  3.566e-02  -7.529 5.33e-14 ***
## floors        3.989e+04  3.602e+03  11.076  < 2e-16 ***
## waterfront    5.657e+05  1.813e+04  31.194  < 2e-16 ***
## view          4.292e+04  2.186e+03  19.635  < 2e-16 ***
## condition     2.718e+04  2.457e+03  11.062  < 2e-16 ***
## grade         1.312e+05  2.090e+03  62.791  < 2e-16 ***
## basement      2.519e+04  3.459e+03   7.282 3.40e-13 ***
## yr_built     -3.104e+03  7.391e+01 -41.993  < 2e-16 ***
## yr_renovated  1.612e+01  3.817e+00   4.225 2.40e-05 ***
## zipcode      -1.368e+02  3.037e+01  -4.504 6.69e-06 ***
## br2           5.348e+04  1.490e+03  35.901  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 210400 on 21598 degrees of freedom
## Multiple R-squared:  0.6719, Adjusted R-squared:  0.6717 
## F-statistic:  3160 on 14 and 21598 DF,  p-value: < 2.2e-16

Residual Analysis - Variance

We conduct residual analysis to verify the assumptions that make our linear model valid. We need the model residuals to have constant variance. The residuals are mostly centered about 0 but there seems to be a an increasing spreading pattern in the higher fitted values. This makes us cautious about the model.

Residual Analysis - Q-Q Plot

We also need to verify that our residuals are normally distributed and we do so with the help of the Q-Q plot below. There is a significant deviation from the line especially at the tails which makes us think that the model is not valid for the extremes which makes sense given that the price of a home can be significantly driven up by a number of factors not considered.

Conclusion

In conclusion, we were able to increase the R-squared value from the simple regression model with one feature by adding multiple features from the dataset. We were to increase the R-squared score slightly by introducing a qudratic term. However, no gain in R-squared was realised through backwards elimination. The model is also not that reliable because the residuals may have an increasing pattern, but more importantly the residuals diverge significantly from normal at the tails.