file <- 'https://raw.githubusercontent.com/maelillien/data/master/kc_house_data.csv'
house_data <- read_csv(file, col_names = TRUE)
# drop some features
house_data <- house_data %>% select (-c(id, date, lat, long, sqft_living15, sqft_lot15))
# turn basement into a dichotomous feature (1 = basement, 0 no basement)
house_data$sqft_basement[house_data$sqft_basement > 0] <- 1
house_data <- rename(house_data, basement=sqft_basement)Inspecting the first few rows:
| price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | condition | grade | sqft_above | basement | yr_built | yr_renovated | zipcode |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 221900 | 3 | 1.00 | 1180 | 5650 | 1 | 0 | 0 | 3 | 7 | 1180 | 0 | 1955 | 0 | 98178 |
| 538000 | 3 | 2.25 | 2570 | 7242 | 2 | 0 | 0 | 3 | 7 | 2170 | 1 | 1951 | 1991 | 98125 |
| 180000 | 2 | 1.00 | 770 | 10000 | 1 | 0 | 0 | 3 | 6 | 770 | 0 | 1933 | 0 | 98028 |
| 604000 | 4 | 3.00 | 1960 | 5000 | 1 | 0 | 0 | 5 | 7 | 1050 | 1 | 1965 | 0 | 98136 |
| 510000 | 3 | 2.00 | 1680 | 8080 | 1 | 0 | 0 | 3 | 8 | 1680 | 0 | 1987 | 0 | 98074 |
| 1225000 | 4 | 4.50 | 5420 | 101930 | 1 | 0 | 0 | 3 | 11 | 3890 | 1 | 2001 | 0 | 98053 |
## price bedrooms bathrooms sqft_living
## Min. : 75000 Min. : 0.000 Min. :0.000 Min. : 290
## 1st Qu.: 321950 1st Qu.: 3.000 1st Qu.:1.750 1st Qu.: 1427
## Median : 450000 Median : 3.000 Median :2.250 Median : 1910
## Mean : 540088 Mean : 3.371 Mean :2.115 Mean : 2080
## 3rd Qu.: 645000 3rd Qu.: 4.000 3rd Qu.:2.500 3rd Qu.: 2550
## Max. :7700000 Max. :33.000 Max. :8.000 Max. :13540
## sqft_lot floors waterfront view
## Min. : 520 Min. :1.000 Min. :0.000000 Min. :0.0000
## 1st Qu.: 5040 1st Qu.:1.000 1st Qu.:0.000000 1st Qu.:0.0000
## Median : 7618 Median :1.500 Median :0.000000 Median :0.0000
## Mean : 15107 Mean :1.494 Mean :0.007542 Mean :0.2343
## 3rd Qu.: 10688 3rd Qu.:2.000 3rd Qu.:0.000000 3rd Qu.:0.0000
## Max. :1651359 Max. :3.500 Max. :1.000000 Max. :4.0000
## condition grade sqft_above basement
## Min. :1.000 Min. : 1.000 Min. : 290 Min. :0.0000
## 1st Qu.:3.000 1st Qu.: 7.000 1st Qu.:1190 1st Qu.:0.0000
## Median :3.000 Median : 7.000 Median :1560 Median :0.0000
## Mean :3.409 Mean : 7.657 Mean :1788 Mean :0.3927
## 3rd Qu.:4.000 3rd Qu.: 8.000 3rd Qu.:2210 3rd Qu.:1.0000
## Max. :5.000 Max. :13.000 Max. :9410 Max. :1.0000
## yr_built yr_renovated zipcode
## Min. :1900 Min. : 0.0 Min. :98001
## 1st Qu.:1951 1st Qu.: 0.0 1st Qu.:98033
## Median :1975 Median : 0.0 Median :98065
## Mean :1971 Mean : 84.4 Mean :98078
## 3rd Qu.:1997 3rd Qu.: 0.0 3rd Qu.:98118
## Max. :2015 Max. :2015.0 Max. :98199
The summary statistics below are an example of a simple model we could try to improve on by using multiple regression. This model uses the sqft_living variable to predict price of a home. Note the low R squared value. Residuals for this simple model will not be analyzed here.
##
## Call:
## lm(formula = price ~ sqft_living, data = house_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1476062 -147486 -24043 106182 4362067
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -43580.743 4402.690 -9.899 <2e-16 ***
## sqft_living 280.624 1.936 144.920 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 261500 on 21611 degrees of freedom
## Multiple R-squared: 0.4929, Adjusted R-squared: 0.4928
## F-statistic: 2.1e+04 on 1 and 21611 DF, p-value: < 2.2e-16
The model below includes all the features in the dataset to predict price. The model including all the features is the base model we seek to improve on by removing some features via backwards elimination. Right away we see that R-squared has increased.
##
## Call:
## lm(formula = price ~ ., data = house_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1342233 -109629 -9789 89182 4259710
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.019e+07 3.104e+06 3.283 0.00103 **
## bedrooms -3.887e+04 2.032e+03 -19.133 < 2e-16 ***
## bathrooms 4.472e+04 3.525e+03 12.685 < 2e-16 ***
## sqft_living 1.618e+02 6.587e+00 24.568 < 2e-16 ***
## sqft_lot -2.584e-01 3.676e-02 -7.029 2.14e-12 ***
## floors 2.526e+04 3.810e+03 6.630 3.43e-11 ***
## waterfront 5.745e+05 1.867e+04 30.778 < 2e-16 ***
## view 4.562e+04 2.267e+03 20.122 < 2e-16 ***
## condition 1.870e+04 2.527e+03 7.402 1.39e-13 ***
## grade 1.239e+05 2.180e+03 56.845 < 2e-16 ***
## sqft_above 9.853e+00 7.257e+00 1.358 0.17452
## basement 1.224e+04 5.645e+03 2.167 0.03021 *
## yr_built -3.587e+03 7.480e+01 -47.956 < 2e-16 ***
## yr_renovated 8.628e+00 3.924e+00 2.199 0.02790 *
## zipcode -4.043e+01 3.115e+01 -1.298 0.19434
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 216500 on 21598 degrees of freedom
## Multiple R-squared: 0.6524, Adjusted R-squared: 0.6521
## F-statistic: 2895 on 14 and 21598 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot +
## floors + waterfront + view + condition + grade + sqft_above +
## basement + yr_built + yr_renovated, data = house_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1338640 -109489 -9868 89252 4257609
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.166e+06 1.390e+05 44.374 < 2e-16 ***
## bedrooms -3.873e+04 2.029e+03 -19.090 < 2e-16 ***
## bathrooms 4.477e+04 3.525e+03 12.701 < 2e-16 ***
## sqft_living 1.621e+02 6.585e+00 24.612 < 2e-16 ***
## sqft_lot -2.545e-01 3.664e-02 -6.946 3.86e-12 ***
## floors 2.431e+04 3.739e+03 6.502 8.09e-11 ***
## waterfront 5.746e+05 1.867e+04 30.784 < 2e-16 ***
## view 4.537e+04 2.259e+03 20.084 < 2e-16 ***
## condition 1.913e+04 2.506e+03 7.635 2.35e-14 ***
## grade 1.239e+05 2.180e+03 56.831 < 2e-16 ***
## sqft_above 1.008e+01 7.255e+00 1.390 0.1647
## basement 1.148e+04 5.615e+03 2.045 0.0409 *
## yr_built -3.558e+03 7.120e+01 -49.962 < 2e-16 ***
## yr_renovated 8.860e+00 3.920e+00 2.260 0.0238 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 216500 on 21599 degrees of freedom
## Multiple R-squared: 0.6523, Adjusted R-squared: 0.6521
## F-statistic: 3118 on 13 and 21599 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot +
## floors + waterfront + view + condition + grade + basement +
## yr_built + yr_renovated, data = house_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1332929 -109501 -9854 89148 4248178
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.166e+06 1.390e+05 44.371 < 2e-16 ***
## bedrooms -3.877e+04 2.029e+03 -19.111 < 2e-16 ***
## bathrooms 4.475e+04 3.525e+03 12.693 < 2e-16 ***
## sqft_living 1.700e+02 3.293e+00 51.624 < 2e-16 ***
## sqft_lot -2.516e-01 3.658e-02 -6.877 6.26e-12 ***
## floors 2.562e+04 3.620e+03 7.077 1.52e-12 ***
## waterfront 5.748e+05 1.867e+04 30.794 < 2e-16 ***
## view 4.497e+04 2.241e+03 20.070 < 2e-16 ***
## condition 1.882e+04 2.496e+03 7.542 4.81e-14 ***
## grade 1.244e+05 2.142e+03 58.091 < 2e-16 ***
## basement 5.327e+03 3.451e+03 1.544 0.1227
## yr_built -3.558e+03 7.121e+01 -49.966 < 2e-16 ***
## yr_renovated 8.717e+00 3.919e+00 2.225 0.0261 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 216500 on 21600 degrees of freedom
## Multiple R-squared: 0.6523, Adjusted R-squared: 0.6521
## F-statistic: 3377 on 12 and 21600 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot +
## floors + waterfront + view + condition + grade + yr_built +
## yr_renovated, data = house_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1331249 -109660 -10027 89148 4240400
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.199e+06 1.373e+05 45.144 < 2e-16 ***
## bedrooms -3.879e+04 2.029e+03 -19.122 < 2e-16 ***
## bathrooms 4.593e+04 3.441e+03 13.345 < 2e-16 ***
## sqft_living 1.705e+02 3.275e+00 52.064 < 2e-16 ***
## sqft_lot -2.571e-01 3.640e-02 -7.063 1.68e-12 ***
## floors 2.382e+04 3.428e+03 6.949 3.79e-12 ***
## waterfront 5.738e+05 1.866e+04 30.759 < 2e-16 ***
## view 4.533e+04 2.228e+03 20.348 < 2e-16 ***
## condition 1.887e+04 2.496e+03 7.559 4.22e-14 ***
## grade 1.243e+05 2.139e+03 58.094 < 2e-16 ***
## yr_built -3.573e+03 7.050e+01 -50.685 < 2e-16 ***
## yr_renovated 8.580e+00 3.918e+00 2.190 0.0285 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 216500 on 21601 degrees of freedom
## Multiple R-squared: 0.6523, Adjusted R-squared: 0.6521
## F-statistic: 3684 on 11 and 21601 DF, p-value: < 2.2e-16
The model below is our best model containing only statistically significant factors. We note that the R-squared value did not increase with through the elimination steps.
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot +
## floors + waterfront + view + condition + grade + yr_built,
## data = house_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1321450 -109820 -10128 89477 4248362
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.298e+06 1.297e+05 48.571 < 2e-16 ***
## bedrooms -3.892e+04 2.028e+03 -19.194 < 2e-16 ***
## bathrooms 4.694e+04 3.410e+03 13.763 < 2e-16 ***
## sqft_living 1.704e+02 3.275e+00 52.042 < 2e-16 ***
## sqft_lot -2.564e-01 3.641e-02 -7.043 1.93e-12 ***
## floors 2.420e+04 3.424e+03 7.066 1.65e-12 ***
## waterfront 5.762e+05 1.863e+04 30.934 < 2e-16 ***
## view 4.544e+04 2.228e+03 20.401 < 2e-16 ***
## condition 1.797e+04 2.462e+03 7.298 3.02e-13 ***
## grade 1.243e+05 2.139e+03 58.114 < 2e-16 ***
## yr_built -3.623e+03 6.677e+01 -54.261 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 216600 on 21602 degrees of freedom
## Multiple R-squared: 0.6522, Adjusted R-squared: 0.652
## F-statistic: 4051 on 10 and 21602 DF, p-value: < 2.2e-16
Here we take a look at the same starting model but with the addition of a quadratic term br2 which arbitrarily squares the number of bathrooms. Just starting with the same model, the R-square has increased but only marginally.
# introduce a quadratic term
house_data$br2 <- house_data$bathrooms**2
mtpl.lm <- lm(price ~ ., data=house_data)
summary(mtpl.lm)##
## Call:
## lm(formula = price ~ ., data = house_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2347006 -107460 -8070 87376 3875050
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.875e+07 3.025e+06 6.199 5.79e-10 ***
## bedrooms -2.959e+04 1.990e+03 -14.864 < 2e-16 ***
## bathrooms -2.090e+05 7.851e+03 -26.618 < 2e-16 ***
## sqft_living 1.359e+02 6.439e+00 21.102 < 2e-16 ***
## sqft_lot -2.717e-01 3.571e-02 -7.606 2.93e-14 ***
## floors 3.841e+04 3.719e+03 10.328 < 2e-16 ***
## waterfront 5.655e+05 1.813e+04 31.183 < 2e-16 ***
## view 4.336e+04 2.203e+03 19.683 < 2e-16 ***
## condition 2.754e+04 2.467e+03 11.162 < 2e-16 ***
## grade 1.306e+05 2.126e+03 61.424 < 2e-16 ***
## sqft_above 1.131e+01 7.049e+00 1.604 0.109
## basement 3.207e+04 5.512e+03 5.819 6.00e-09 ***
## yr_built -3.102e+03 7.391e+01 -41.974 < 2e-16 ***
## yr_renovated 1.629e+01 3.818e+00 4.268 1.98e-05 ***
## zipcode -1.356e+02 3.038e+01 -4.466 8.03e-06 ***
## br2 5.350e+04 1.490e+03 35.911 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 210300 on 21597 degrees of freedom
## Multiple R-squared: 0.672, Adjusted R-squared: 0.6717
## F-statistic: 2949 on 15 and 21597 DF, p-value: < 2.2e-16
Eliminating the sqft_above feature does not improve much.
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot +
## floors + waterfront + view + condition + grade + basement +
## yr_built + yr_renovated + zipcode + br2, data = house_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2362020 -107439 -8184 87313 3873054
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.886e+07 3.024e+06 6.238 4.51e-10 ***
## bedrooms -2.964e+04 1.990e+03 -14.891 < 2e-16 ***
## bathrooms -2.089e+05 7.851e+03 -26.613 < 2e-16 ***
## sqft_living 1.448e+02 3.287e+00 44.049 < 2e-16 ***
## sqft_lot -2.685e-01 3.566e-02 -7.529 5.33e-14 ***
## floors 3.989e+04 3.602e+03 11.076 < 2e-16 ***
## waterfront 5.657e+05 1.813e+04 31.194 < 2e-16 ***
## view 4.292e+04 2.186e+03 19.635 < 2e-16 ***
## condition 2.718e+04 2.457e+03 11.062 < 2e-16 ***
## grade 1.312e+05 2.090e+03 62.791 < 2e-16 ***
## basement 2.519e+04 3.459e+03 7.282 3.40e-13 ***
## yr_built -3.104e+03 7.391e+01 -41.993 < 2e-16 ***
## yr_renovated 1.612e+01 3.817e+00 4.225 2.40e-05 ***
## zipcode -1.368e+02 3.037e+01 -4.504 6.69e-06 ***
## br2 5.348e+04 1.490e+03 35.901 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 210400 on 21598 degrees of freedom
## Multiple R-squared: 0.6719, Adjusted R-squared: 0.6717
## F-statistic: 3160 on 14 and 21598 DF, p-value: < 2.2e-16
We conduct residual analysis to verify the assumptions that make our linear model valid. We need the model residuals to have constant variance. The residuals are mostly centered about 0 but there seems to be a an increasing spreading pattern in the higher fitted values. This makes us cautious about the model.
We also need to verify that our residuals are normally distributed and we do so with the help of the Q-Q plot below. There is a significant deviation from the line especially at the tails which makes us think that the model is not valid for the extremes which makes sense given that the price of a home can be significantly driven up by a number of factors not considered.
In conclusion, we were able to increase the R-squared value from the simple regression model with one feature by adding multiple features from the dataset. We were to increase the R-squared score slightly by introducing a qudratic term. However, no gain in R-squared was realised through backwards elimination. The model is also not that reliable because the residuals may have an increasing pattern, but more importantly the residuals diverge significantly from normal at the tails.