We will attempt to build a multiple linear model that predicts housing prices in King’s County. We’re gonna skip the train-test split. Home sales prices are highly influenced by market forces - the date of the sale would be a big factor in the price, but we’ll exclude that data for simplicity.
Read the csv as a dataframe.
house_df <- read.csv("kc_house_data.csv", stringsAsFactors = FALSE)
Summarize it.
summary(house_df)
## id date price bedrooms
## Min. :1.000e+06 Length:21613 Min. : 75000 Min. : 0.000
## 1st Qu.:2.123e+09 Class :character 1st Qu.: 321950 1st Qu.: 3.000
## Median :3.905e+09 Mode :character Median : 450000 Median : 3.000
## Mean :4.580e+09 Mean : 540088 Mean : 3.371
## 3rd Qu.:7.309e+09 3rd Qu.: 645000 3rd Qu.: 4.000
## Max. :9.900e+09 Max. :7700000 Max. :33.000
## bathrooms sqft_living sqft_lot floors
## Min. :0.000 Min. : 290 Min. : 520 Min. :1.000
## 1st Qu.:1.750 1st Qu.: 1427 1st Qu.: 5040 1st Qu.:1.000
## Median :2.250 Median : 1910 Median : 7618 Median :1.500
## Mean :2.115 Mean : 2080 Mean : 15107 Mean :1.494
## 3rd Qu.:2.500 3rd Qu.: 2550 3rd Qu.: 10688 3rd Qu.:2.000
## Max. :8.000 Max. :13540 Max. :1651359 Max. :3.500
## waterfront view condition grade
## Min. :0.000000 Min. :0.0000 Min. :1.000 Min. : 1.000
## 1st Qu.:0.000000 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.: 7.000
## Median :0.000000 Median :0.0000 Median :3.000 Median : 7.000
## Mean :0.007542 Mean :0.2343 Mean :3.409 Mean : 7.657
## 3rd Qu.:0.000000 3rd Qu.:0.0000 3rd Qu.:4.000 3rd Qu.: 8.000
## Max. :1.000000 Max. :4.0000 Max. :5.000 Max. :13.000
## sqft_above sqft_basement yr_built yr_renovated
## Min. : 290 Min. : 0.0 Min. :1900 Min. : 0.0
## 1st Qu.:1190 1st Qu.: 0.0 1st Qu.:1951 1st Qu.: 0.0
## Median :1560 Median : 0.0 Median :1975 Median : 0.0
## Mean :1788 Mean : 291.5 Mean :1971 Mean : 84.4
## 3rd Qu.:2210 3rd Qu.: 560.0 3rd Qu.:1997 3rd Qu.: 0.0
## Max. :9410 Max. :4820.0 Max. :2015 Max. :2015.0
## zipcode lat long sqft_living15
## Min. :98001 Min. :47.16 Min. :-122.5 Min. : 399
## 1st Qu.:98033 1st Qu.:47.47 1st Qu.:-122.3 1st Qu.:1490
## Median :98065 Median :47.57 Median :-122.2 Median :1840
## Mean :98078 Mean :47.56 Mean :-122.2 Mean :1987
## 3rd Qu.:98118 3rd Qu.:47.68 3rd Qu.:-122.1 3rd Qu.:2360
## Max. :98199 Max. :47.78 Max. :-121.3 Max. :6210
## sqft_lot15
## Min. : 651
## 1st Qu.: 5100
## Median : 7620
## Mean : 12768
## 3rd Qu.: 10083
## Max. :871200
Preview.
head(house_df)
## id date price bedrooms bathrooms sqft_living sqft_lot
## 1 7129300520 20141013T000000 221900 3 1.00 1180 5650
## 2 6414100192 20141209T000000 538000 3 2.25 2570 7242
## 3 5631500400 20150225T000000 180000 2 1.00 770 10000
## 4 2487200875 20141209T000000 604000 4 3.00 1960 5000
## 5 1954400510 20150218T000000 510000 3 2.00 1680 8080
## 6 7237550310 20140512T000000 1225000 4 4.50 5420 101930
## floors waterfront view condition grade sqft_above sqft_basement yr_built
## 1 1 0 0 3 7 1180 0 1955
## 2 2 0 0 3 7 2170 400 1951
## 3 1 0 0 3 6 770 0 1933
## 4 1 0 0 5 7 1050 910 1965
## 5 1 0 0 3 8 1680 0 1987
## 6 1 0 0 3 11 3890 1530 2001
## yr_renovated zipcode lat long sqft_living15 sqft_lot15
## 1 0 98178 47.5112 -122.257 1340 5650
## 2 1991 98125 47.7210 -122.319 1690 7639
## 3 0 98028 47.7379 -122.233 2720 8062
## 4 0 98136 47.5208 -122.393 1360 5000
## 5 0 98074 47.6168 -122.045 1800 7503
## 6 0 98053 47.6561 -122.005 4760 101930
We’ll follow the book’s example and use backward elimination. Though we’ll start with the variables that may need converting to factors removed. That collinearities will be present is readily apparent. Views and waterfronts go together. As do square feet with other square values and bedrooms, bathrooms, etc.
Zip code, latitude, and longitude may be reliable predictors, but zip code will need to be converted to factor, and other transformations may be necessary to make lat and long usable. While there are certainly cities where, for example, a more northern location means higher property values, it could be misleading.
house_lm <- lm(price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + waterfront + view + condition + grade + sqft_above + sqft_basement, data=house_df)
summary(house_lm)
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot +
## floors + waterfront + view + condition + grade + sqft_above +
## sqft_basement, data = house_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1175172 -123899 -16809 94252 4628215
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.962e+05 1.740e+04 -40.007 < 2e-16 ***
## bedrooms -3.378e+04 2.157e+03 -15.659 < 2e-16 ***
## bathrooms -1.463e+04 3.486e+03 -4.196 2.72e-05 ***
## sqft_living 2.172e+02 4.820e+00 45.059 < 2e-16 ***
## sqft_lot -3.180e-01 3.900e-02 -8.152 3.76e-16 ***
## floors -2.831e+03 3.941e+03 -0.718 0.473
## waterfront 5.822e+05 1.985e+04 29.333 < 2e-16 ***
## view 6.064e+04 2.385e+03 25.428 < 2e-16 ***
## condition 5.344e+04 2.532e+03 21.108 < 2e-16 ***
## grade 1.032e+05 2.269e+03 45.472 < 2e-16 ***
## sqft_above -2.917e+01 4.718e+00 -6.182 6.42e-10 ***
## sqft_basement NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 230600 on 21602 degrees of freedom
## Multiple R-squared: 0.6055, Adjusted R-squared: 0.6053
## F-statistic: 3315 on 10 and 21602 DF, p-value: < 2.2e-16
We see that square feet of the lot is not a useful predictive. Let’s remove that and try adding zip code as a factor.
house_lm_2 <- lm(price ~ bedrooms + bathrooms + sqft_living + as.factor(zipcode) + floors + waterfront + view + condition + grade + sqft_above + sqft_basement, data=house_df)
summary(house_lm_2)
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + as.factor(zipcode) +
## floors + waterfront + view + condition + grade + sqft_above +
## sqft_basement, data = house_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1143285 -71144 -1431 61730 4441328
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.710e+05 1.538e+04 -30.621 < 2e-16 ***
## bedrooms -2.604e+04 1.536e+03 -16.954 < 2e-16 ***
## bathrooms 1.401e+04 2.491e+03 5.624 1.89e-08 ***
## sqft_living 1.373e+02 3.488e+00 39.352 < 2e-16 ***
## as.factor(zipcode)98002 2.812e+04 1.434e+04 1.961 0.049909 *
## as.factor(zipcode)98003 -1.460e+04 1.292e+04 -1.130 0.258329
## as.factor(zipcode)98004 7.896e+05 1.260e+04 62.658 < 2e-16 ***
## as.factor(zipcode)98005 3.121e+05 1.525e+04 20.465 < 2e-16 ***
## as.factor(zipcode)98006 2.742e+05 1.138e+04 24.104 < 2e-16 ***
## as.factor(zipcode)98007 2.518e+05 1.614e+04 15.599 < 2e-16 ***
## as.factor(zipcode)98008 2.546e+05 1.292e+04 19.699 < 2e-16 ***
## as.factor(zipcode)98010 7.543e+04 1.834e+04 4.114 3.90e-05 ***
## as.factor(zipcode)98011 1.267e+05 1.442e+04 8.787 < 2e-16 ***
## as.factor(zipcode)98014 1.120e+05 1.690e+04 6.627 3.51e-11 ***
## as.factor(zipcode)98019 9.403e+04 1.455e+04 6.461 1.06e-10 ***
## as.factor(zipcode)98022 -2.770e+03 1.366e+04 -0.203 0.839310
## as.factor(zipcode)98023 -3.367e+04 1.121e+04 -3.003 0.002675 **
## as.factor(zipcode)98024 1.781e+05 1.996e+04 8.922 < 2e-16 ***
## as.factor(zipcode)98027 1.749e+05 1.174e+04 14.893 < 2e-16 ***
## as.factor(zipcode)98028 1.247e+05 1.288e+04 9.685 < 2e-16 ***
## as.factor(zipcode)98029 2.141e+05 1.253e+04 17.089 < 2e-16 ***
## as.factor(zipcode)98030 4.972e+03 1.325e+04 0.375 0.707428
## as.factor(zipcode)98031 1.468e+04 1.299e+04 1.130 0.258655
## as.factor(zipcode)98032 9.867e+03 1.684e+04 0.586 0.558055
## as.factor(zipcode)98033 3.685e+05 1.159e+04 31.779 < 2e-16 ***
## as.factor(zipcode)98034 2.035e+05 1.101e+04 18.485 < 2e-16 ***
## as.factor(zipcode)98038 3.120e+04 1.086e+04 2.874 0.004059 **
## as.factor(zipcode)98039 1.336e+06 2.466e+04 54.177 < 2e-16 ***
## as.factor(zipcode)98040 5.248e+05 1.308e+04 40.127 < 2e-16 ***
## as.factor(zipcode)98042 3.486e+03 1.099e+04 0.317 0.751185
## as.factor(zipcode)98045 9.560e+04 1.386e+04 6.899 5.38e-12 ***
## as.factor(zipcode)98052 2.304e+05 1.093e+04 21.067 < 2e-16 ***
## as.factor(zipcode)98053 1.891e+05 1.184e+04 15.965 < 2e-16 ***
## as.factor(zipcode)98055 5.395e+04 1.307e+04 4.127 3.68e-05 ***
## as.factor(zipcode)98056 9.591e+04 1.174e+04 8.168 3.32e-16 ***
## as.factor(zipcode)98058 3.162e+04 1.143e+04 2.765 0.005691 **
## as.factor(zipcode)98059 8.267e+04 1.139e+04 7.259 4.03e-13 ***
## as.factor(zipcode)98065 8.287e+04 1.262e+04 6.564 5.35e-11 ***
## as.factor(zipcode)98070 -4.417e+03 1.743e+04 -0.253 0.799978
## as.factor(zipcode)98072 1.573e+05 1.304e+04 12.063 < 2e-16 ***
## as.factor(zipcode)98074 1.756e+05 1.161e+04 15.127 < 2e-16 ***
## as.factor(zipcode)98075 1.682e+05 1.224e+04 13.748 < 2e-16 ***
## as.factor(zipcode)98077 1.280e+05 1.446e+04 8.853 < 2e-16 ***
## as.factor(zipcode)98092 -3.643e+04 1.217e+04 -2.994 0.002754 **
## as.factor(zipcode)98102 5.509e+05 1.811e+04 30.415 < 2e-16 ***
## as.factor(zipcode)98103 3.694e+05 1.092e+04 33.819 < 2e-16 ***
## as.factor(zipcode)98105 5.050e+05 1.376e+04 36.700 < 2e-16 ***
## as.factor(zipcode)98106 1.610e+05 1.234e+04 13.049 < 2e-16 ***
## as.factor(zipcode)98107 3.753e+05 1.323e+04 28.377 < 2e-16 ***
## as.factor(zipcode)98108 1.444e+05 1.466e+04 9.846 < 2e-16 ***
## as.factor(zipcode)98109 5.316e+05 1.780e+04 29.861 < 2e-16 ***
## as.factor(zipcode)98112 6.606e+05 1.319e+04 50.087 < 2e-16 ***
## as.factor(zipcode)98115 3.563e+05 1.091e+04 32.654 < 2e-16 ***
## as.factor(zipcode)98116 3.157e+05 1.244e+04 25.372 < 2e-16 ***
## as.factor(zipcode)98117 3.430e+05 1.103e+04 31.098 < 2e-16 ***
## as.factor(zipcode)98118 1.940e+05 1.119e+04 17.336 < 2e-16 ***
## as.factor(zipcode)98119 5.156e+05 1.480e+04 34.834 < 2e-16 ***
## as.factor(zipcode)98122 3.745e+05 1.288e+04 29.077 < 2e-16 ***
## as.factor(zipcode)98125 2.163e+05 1.172e+04 18.450 < 2e-16 ***
## as.factor(zipcode)98126 2.130e+05 1.218e+04 17.486 < 2e-16 ***
## as.factor(zipcode)98133 1.798e+05 1.125e+04 15.977 < 2e-16 ***
## as.factor(zipcode)98136 2.683e+05 1.322e+04 20.294 < 2e-16 ***
## as.factor(zipcode)98144 3.053e+05 1.231e+04 24.800 < 2e-16 ***
## as.factor(zipcode)98146 1.228e+05 1.284e+04 9.566 < 2e-16 ***
## as.factor(zipcode)98148 7.998e+04 2.312e+04 3.459 0.000544 ***
## as.factor(zipcode)98155 1.575e+05 1.149e+04 13.711 < 2e-16 ***
## as.factor(zipcode)98166 6.739e+04 1.332e+04 5.058 4.28e-07 ***
## as.factor(zipcode)98168 9.179e+04 1.310e+04 7.004 2.55e-12 ***
## as.factor(zipcode)98177 2.319e+05 1.334e+04 17.383 < 2e-16 ***
## as.factor(zipcode)98178 5.176e+04 1.321e+04 3.919 8.92e-05 ***
## as.factor(zipcode)98188 4.379e+04 1.632e+04 2.683 0.007299 **
## as.factor(zipcode)98198 4.193e+03 1.294e+04 0.324 0.745989
## as.factor(zipcode)98199 4.105e+05 1.259e+04 32.613 < 2e-16 ***
## floors -6.021e+04 2.989e+03 -20.143 < 2e-16 ***
## waterfront 6.644e+05 1.421e+04 46.760 < 2e-16 ***
## view 5.924e+04 1.734e+03 34.161 < 2e-16 ***
## condition 2.987e+04 1.833e+03 16.298 < 2e-16 ***
## grade 5.241e+04 1.700e+03 30.833 < 2e-16 ***
## sqft_above 8.644e+01 3.611e+00 23.939 < 2e-16 ***
## sqft_basement NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 162200 on 21534 degrees of freedom
## Multiple R-squared: 0.8056, Adjusted R-squared: 0.8049
## F-statistic: 1144 on 78 and 21534 DF, p-value: < 2.2e-16
That gives us a decent performance - 80.5% of the variation in home values are explained by our model. Adding zip code as factor vastly expands our degrees of freedom - there might be an opportunity to simplify this with more regional knowledge.
Does adding latitute and longitude add any predictive power, even given that we know its integer values could be predictive model?
house_lm_3 <- lm(price ~ bedrooms + bathrooms + sqft_living + as.factor(zipcode) + floors + waterfront + view + condition + grade + sqft_above + sqft_basement + lat + long, data=house_df)
summary(house_lm_3)
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + as.factor(zipcode) +
## floors + waterfront + view + condition + grade + sqft_above +
## sqft_basement + lat + long, data = house_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1136900 -71322 -1757 61737 4440870
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.563e+07 6.164e+06 -4.157 3.23e-05 ***
## bedrooms -2.606e+04 1.535e+03 -16.976 < 2e-16 ***
## bathrooms 1.400e+04 2.490e+03 5.623 1.90e-08 ***
## sqft_living 1.373e+02 3.487e+00 39.370 < 2e-16 ***
## as.factor(zipcode)98002 3.493e+04 1.457e+04 2.397 0.016554 *
## as.factor(zipcode)98003 -2.074e+04 1.304e+04 -1.590 0.111757
## as.factor(zipcode)98004 7.295e+05 2.367e+04 30.824 < 2e-16 ***
## as.factor(zipcode)98005 2.577e+05 2.530e+04 10.183 < 2e-16 ***
## as.factor(zipcode)98006 2.341e+05 2.067e+04 11.324 < 2e-16 ***
## as.factor(zipcode)98007 2.006e+05 2.612e+04 7.679 1.67e-14 ***
## as.factor(zipcode)98008 2.059e+05 2.481e+04 8.299 < 2e-16 ***
## as.factor(zipcode)98010 1.025e+05 2.223e+04 4.612 4.01e-06 ***
## as.factor(zipcode)98011 3.632e+04 3.229e+04 1.125 0.260692
## as.factor(zipcode)98014 8.239e+04 3.544e+04 2.325 0.020094 *
## as.factor(zipcode)98019 3.714e+04 3.499e+04 1.061 0.288476
## as.factor(zipcode)98022 5.179e+04 1.930e+04 2.684 0.007285 **
## as.factor(zipcode)98023 -4.474e+04 1.199e+04 -3.731 0.000191 ***
## as.factor(zipcode)98024 1.667e+05 3.110e+04 5.360 8.42e-08 ***
## as.factor(zipcode)98027 1.552e+05 2.124e+04 7.307 2.83e-13 ***
## as.factor(zipcode)98028 2.899e+04 3.137e+04 0.924 0.355454
## as.factor(zipcode)98029 1.905e+05 2.424e+04 7.858 4.09e-15 ***
## as.factor(zipcode)98030 1.326e+03 1.433e+04 0.092 0.926312
## as.factor(zipcode)98031 3.477e+03 1.492e+04 0.233 0.815747
## as.factor(zipcode)98032 -5.363e+03 1.733e+04 -0.309 0.757021
## as.factor(zipcode)98033 2.963e+05 2.690e+04 11.014 < 2e-16 ***
## as.factor(zipcode)98034 1.201e+05 2.885e+04 4.163 3.15e-05 ***
## as.factor(zipcode)98038 4.619e+04 1.608e+04 2.873 0.004067 **
## as.factor(zipcode)98039 1.270e+06 3.199e+04 39.697 < 2e-16 ***
## as.factor(zipcode)98040 4.747e+05 2.091e+04 22.699 < 2e-16 ***
## as.factor(zipcode)98042 9.530e+03 1.371e+04 0.695 0.486896
## as.factor(zipcode)98045 1.207e+05 2.972e+04 4.061 4.91e-05 ***
## as.factor(zipcode)98052 1.664e+05 2.745e+04 6.060 1.39e-09 ***
## as.factor(zipcode)98053 1.367e+05 2.942e+04 4.646 3.40e-06 ***
## as.factor(zipcode)98055 2.899e+04 1.661e+04 1.745 0.081017 .
## as.factor(zipcode)98056 6.197e+04 1.806e+04 3.430 0.000604 ***
## as.factor(zipcode)98058 1.625e+04 1.571e+04 1.035 0.300826
## as.factor(zipcode)98059 5.772e+04 1.771e+04 3.260 0.001115 **
## as.factor(zipcode)98065 8.250e+04 2.734e+04 3.017 0.002555 **
## as.factor(zipcode)98070 -5.174e+04 2.051e+04 -2.522 0.011675 *
## as.factor(zipcode)98072 7.614e+04 3.212e+04 2.370 0.017779 *
## as.factor(zipcode)98074 1.331e+05 2.598e+04 5.122 3.05e-07 ***
## as.factor(zipcode)98075 1.364e+05 2.496e+04 5.465 4.68e-08 ***
## as.factor(zipcode)98077 5.647e+04 3.340e+04 1.690 0.090956 .
## as.factor(zipcode)98092 -2.252e+04 1.303e+04 -1.729 0.083910 .
## as.factor(zipcode)98102 4.727e+05 2.757e+04 17.148 < 2e-16 ***
## as.factor(zipcode)98103 2.793e+05 2.590e+04 10.784 < 2e-16 ***
## as.factor(zipcode)98105 4.233e+05 2.656e+04 15.937 < 2e-16 ***
## as.factor(zipcode)98106 1.004e+05 1.924e+04 5.221 1.79e-07 ***
## as.factor(zipcode)98107 2.829e+05 2.674e+04 10.582 < 2e-16 ***
## as.factor(zipcode)98108 8.678e+04 2.123e+04 4.088 4.38e-05 ***
## as.factor(zipcode)98109 4.499e+05 2.748e+04 16.371 < 2e-16 ***
## as.factor(zipcode)98112 5.865e+05 2.433e+04 24.108 < 2e-16 ***
## as.factor(zipcode)98115 2.693e+05 2.637e+04 10.212 < 2e-16 ***
## as.factor(zipcode)98116 2.424e+05 2.144e+04 11.306 < 2e-16 ***
## as.factor(zipcode)98117 2.465e+05 2.669e+04 9.234 < 2e-16 ***
## as.factor(zipcode)98118 1.415e+05 1.871e+04 7.564 4.06e-14 ***
## as.factor(zipcode)98119 4.309e+05 2.594e+04 16.610 < 2e-16 ***
## as.factor(zipcode)98122 3.045e+05 2.312e+04 13.167 < 2e-16 ***
## as.factor(zipcode)98125 1.223e+05 2.854e+04 4.287 1.82e-05 ***
## as.factor(zipcode)98126 1.483e+05 1.969e+04 7.533 5.15e-14 ***
## as.factor(zipcode)98133 7.661e+04 2.947e+04 2.600 0.009336 **
## as.factor(zipcode)98136 2.034e+05 2.020e+04 10.070 < 2e-16 ***
## as.factor(zipcode)98144 2.406e+05 2.156e+04 11.158 < 2e-16 ***
## as.factor(zipcode)98146 6.961e+04 1.806e+04 3.855 0.000116 ***
## as.factor(zipcode)98148 4.459e+04 2.460e+04 1.812 0.069947 .
## as.factor(zipcode)98155 5.494e+04 3.066e+04 1.792 0.073128 .
## as.factor(zipcode)98166 2.544e+04 1.654e+04 1.539 0.123892
## as.factor(zipcode)98168 4.736e+04 1.746e+04 2.713 0.006682 **
## as.factor(zipcode)98177 1.242e+05 3.077e+04 4.037 5.42e-05 ***
## as.factor(zipcode)98178 1.256e+04 1.804e+04 0.697 0.486053
## as.factor(zipcode)98188 1.198e+04 1.854e+04 0.646 0.517998
## as.factor(zipcode)98198 -1.930e+04 1.405e+04 -1.374 0.169569
## as.factor(zipcode)98199 3.203e+05 2.537e+04 12.625 < 2e-16 ***
## floors -6.002e+04 2.990e+03 -20.078 < 2e-16 ***
## waterfront 6.644e+05 1.421e+04 46.750 < 2e-16 ***
## view 5.929e+04 1.734e+03 34.200 < 2e-16 ***
## condition 3.006e+04 1.833e+03 16.397 < 2e-16 ***
## grade 5.218e+04 1.700e+03 30.696 < 2e-16 ***
## sqft_above 8.676e+01 3.611e+00 24.030 < 2e-16 ***
## sqft_basement NA NA NA NA
## lat 2.214e+05 6.391e+04 3.464 0.000534 ***
## long -1.201e+05 4.572e+04 -2.627 0.008623 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 162100 on 21532 degrees of freedom
## Multiple R-squared: 0.8058, Adjusted R-squared: 0.8051
## F-statistic: 1117 on 80 and 21532 DF, p-value: < 2.2e-16
Slightly better, but, we’re at risk of overfitting.
Let’s check our assumptions on the last model:
check_model(house_lm_3)
We have problems with variance not being evenly distributed at the extremes - particularly at the high side. Collinearity is present - and overwhelmed by the zip codes.
We likely need a more sophisticated approach to geospatial analysis to handle the zip codes and latitute and longitude. Here’s an interesting article on why zip codes shouldn’t be used in predictive models: https://towardsdatascience.com/stop-using-zip-codes-for-geospatial-analysis-ceacb6e80c38
We definitely need to capture market fluctuations present in the sales date using a time-series approach.
Other variables would likely benefit from transformations. And we’re definitely overfitting.
Not a good linear model at present.
The discussion prompt says: Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term.
We already have dichotomous variables in the model - waterfront.
Let’s try to transform one of the variables into a quadratic term - condition, lacking any better ideas or domain knowledge.
house_df$condition_quad = house_df$condition^2
house_lm_4 <- lm(price ~ bedrooms + bathrooms + sqft_living + as.factor(zipcode) + floors + waterfront + view + condition_quad + grade + sqft_above + sqft_basement + lat + long, data=house_df)
summary(house_lm_4)
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + as.factor(zipcode) +
## floors + waterfront + view + condition_quad + grade + sqft_above +
## sqft_basement + lat + long, data = house_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1136347 -71371 -1864 62030 4443406
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.556e+07 6.162e+06 -4.149 3.36e-05 ***
## bedrooms -2.605e+04 1.535e+03 -16.974 < 2e-16 ***
## bathrooms 1.389e+04 2.489e+03 5.582 2.41e-08 ***
## sqft_living 1.371e+02 3.486e+00 39.328 < 2e-16 ***
## as.factor(zipcode)98002 3.508e+04 1.457e+04 2.408 0.016046 *
## as.factor(zipcode)98003 -2.045e+04 1.303e+04 -1.569 0.116642
## as.factor(zipcode)98004 7.298e+05 2.366e+04 30.848 < 2e-16 ***
## as.factor(zipcode)98005 2.578e+05 2.529e+04 10.191 < 2e-16 ***
## as.factor(zipcode)98006 2.338e+05 2.066e+04 11.316 < 2e-16 ***
## as.factor(zipcode)98007 2.011e+05 2.611e+04 7.703 1.39e-14 ***
## as.factor(zipcode)98008 2.063e+05 2.479e+04 8.321 < 2e-16 ***
## as.factor(zipcode)98010 1.025e+05 2.222e+04 4.613 4.00e-06 ***
## as.factor(zipcode)98011 3.653e+04 3.228e+04 1.132 0.257787
## as.factor(zipcode)98014 8.267e+04 3.543e+04 2.333 0.019632 *
## as.factor(zipcode)98019 3.746e+04 3.497e+04 1.071 0.284073
## as.factor(zipcode)98022 5.123e+04 1.929e+04 2.656 0.007921 **
## as.factor(zipcode)98023 -4.459e+04 1.199e+04 -3.719 0.000200 ***
## as.factor(zipcode)98024 1.667e+05 3.108e+04 5.363 8.29e-08 ***
## as.factor(zipcode)98027 1.553e+05 2.123e+04 7.315 2.67e-13 ***
## as.factor(zipcode)98028 2.927e+04 3.135e+04 0.933 0.350608
## as.factor(zipcode)98029 1.906e+05 2.423e+04 7.867 3.81e-15 ***
## as.factor(zipcode)98030 1.605e+03 1.433e+04 0.112 0.910839
## as.factor(zipcode)98031 3.721e+03 1.491e+04 0.250 0.802943
## as.factor(zipcode)98032 -4.988e+03 1.733e+04 -0.288 0.773427
## as.factor(zipcode)98033 2.963e+05 2.689e+04 11.019 < 2e-16 ***
## as.factor(zipcode)98034 1.204e+05 2.883e+04 4.176 2.98e-05 ***
## as.factor(zipcode)98038 4.631e+04 1.607e+04 2.881 0.003964 **
## as.factor(zipcode)98039 1.270e+06 3.198e+04 39.728 < 2e-16 ***
## as.factor(zipcode)98040 4.745e+05 2.090e+04 22.700 < 2e-16 ***
## as.factor(zipcode)98042 9.380e+03 1.370e+04 0.685 0.493627
## as.factor(zipcode)98045 1.207e+05 2.971e+04 4.064 4.83e-05 ***
## as.factor(zipcode)98052 1.667e+05 2.744e+04 6.076 1.25e-09 ***
## as.factor(zipcode)98053 1.370e+05 2.941e+04 4.657 3.23e-06 ***
## as.factor(zipcode)98055 2.910e+04 1.661e+04 1.752 0.079759 .
## as.factor(zipcode)98056 6.146e+04 1.806e+04 3.403 0.000667 ***
## as.factor(zipcode)98058 1.615e+04 1.570e+04 1.029 0.303679
## as.factor(zipcode)98059 5.769e+04 1.770e+04 3.260 0.001117 **
## as.factor(zipcode)98065 8.264e+04 2.733e+04 3.023 0.002503 **
## as.factor(zipcode)98070 -5.209e+04 2.051e+04 -2.540 0.011090 *
## as.factor(zipcode)98072 7.672e+04 3.211e+04 2.389 0.016883 *
## as.factor(zipcode)98074 1.332e+05 2.597e+04 5.127 2.97e-07 ***
## as.factor(zipcode)98075 1.365e+05 2.495e+04 5.472 4.51e-08 ***
## as.factor(zipcode)98077 5.678e+04 3.339e+04 1.701 0.089007 .
## as.factor(zipcode)98092 -2.248e+04 1.302e+04 -1.726 0.084402 .
## as.factor(zipcode)98102 4.725e+05 2.755e+04 17.147 < 2e-16 ***
## as.factor(zipcode)98103 2.789e+05 2.589e+04 10.773 < 2e-16 ***
## as.factor(zipcode)98105 4.229e+05 2.655e+04 15.928 < 2e-16 ***
## as.factor(zipcode)98106 1.004e+05 1.923e+04 5.222 1.79e-07 ***
## as.factor(zipcode)98107 2.827e+05 2.672e+04 10.577 < 2e-16 ***
## as.factor(zipcode)98108 8.650e+04 2.122e+04 4.076 4.61e-05 ***
## as.factor(zipcode)98109 4.494e+05 2.747e+04 16.359 < 2e-16 ***
## as.factor(zipcode)98112 5.858e+05 2.432e+04 24.091 < 2e-16 ***
## as.factor(zipcode)98115 2.690e+05 2.636e+04 10.206 < 2e-16 ***
## as.factor(zipcode)98116 2.417e+05 2.143e+04 11.280 < 2e-16 ***
## as.factor(zipcode)98117 2.464e+05 2.668e+04 9.234 < 2e-16 ***
## as.factor(zipcode)98118 1.412e+05 1.870e+04 7.548 4.61e-14 ***
## as.factor(zipcode)98119 4.306e+05 2.593e+04 16.608 < 2e-16 ***
## as.factor(zipcode)98122 3.039e+05 2.311e+04 13.147 < 2e-16 ***
## as.factor(zipcode)98125 1.223e+05 2.852e+04 4.288 1.81e-05 ***
## as.factor(zipcode)98126 1.479e+05 1.968e+04 7.517 5.81e-14 ***
## as.factor(zipcode)98133 7.674e+04 2.946e+04 2.605 0.009190 **
## as.factor(zipcode)98136 2.032e+05 2.019e+04 10.062 < 2e-16 ***
## as.factor(zipcode)98144 2.402e+05 2.155e+04 11.145 < 2e-16 ***
## as.factor(zipcode)98146 6.937e+04 1.805e+04 3.843 0.000122 ***
## as.factor(zipcode)98148 4.339e+04 2.459e+04 1.765 0.077643 .
## as.factor(zipcode)98155 5.489e+04 3.064e+04 1.791 0.073290 .
## as.factor(zipcode)98166 2.538e+04 1.653e+04 1.536 0.124601
## as.factor(zipcode)98168 4.703e+04 1.745e+04 2.695 0.007050 **
## as.factor(zipcode)98177 1.245e+05 3.076e+04 4.046 5.22e-05 ***
## as.factor(zipcode)98178 1.250e+04 1.803e+04 0.693 0.488278
## as.factor(zipcode)98188 1.194e+04 1.853e+04 0.644 0.519318
## as.factor(zipcode)98198 -1.907e+04 1.405e+04 -1.358 0.174569
## as.factor(zipcode)98199 3.200e+05 2.536e+04 12.619 < 2e-16 ***
## floors -5.988e+04 2.988e+03 -20.042 < 2e-16 ***
## waterfront 6.641e+05 1.421e+04 46.745 < 2e-16 ***
## view 5.928e+04 1.733e+03 34.206 < 2e-16 ***
## condition_quad 4.108e+03 2.431e+02 16.900 < 2e-16 ***
## grade 5.244e+04 1.700e+03 30.842 < 2e-16 ***
## sqft_above 8.679e+01 3.609e+00 24.050 < 2e-16 ***
## sqft_basement NA NA NA NA
## lat 2.206e+05 6.388e+04 3.453 0.000556 ***
## long -1.203e+05 4.570e+04 -2.633 0.008481 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 162000 on 21532 degrees of freedom
## Multiple R-squared: 0.8059, Adjusted R-squared: 0.8052
## F-statistic: 1118 on 80 and 21532 DF, p-value: < 2.2e-16
Hey - it worked…minimally.
Now we add dichotomous:quantitative interaction variable. Let’s do waterfront: sqft_living.
house_df$waterfront_sqftliving = house_df$waterfront/house_df$sqft_living
house_lm_5 <- lm(price ~ bedrooms + bathrooms + sqft_living + as.factor(zipcode) + floors + waterfront + view + condition_quad + grade + sqft_above + sqft_basement + lat + long + waterfront_sqftliving, data=house_df)
summary(house_lm_5)
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + as.factor(zipcode) +
## floors + waterfront + view + condition_quad + grade + sqft_above +
## sqft_basement + lat + long + waterfront_sqftliving, data = house_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1214914 -70695 -1918 60642 4479415
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.689e+07 6.072e+06 -4.428 9.57e-06 ***
## bedrooms -2.518e+04 1.513e+03 -16.649 < 2e-16 ***
## bathrooms 1.395e+04 2.453e+03 5.689 1.29e-08 ***
## sqft_living 1.322e+02 3.440e+00 38.428 < 2e-16 ***
## as.factor(zipcode)98002 3.370e+04 1.435e+04 2.348 0.018897 *
## as.factor(zipcode)98003 -2.025e+04 1.284e+04 -1.577 0.114853
## as.factor(zipcode)98004 7.260e+05 2.331e+04 31.144 < 2e-16 ***
## as.factor(zipcode)98005 2.546e+05 2.492e+04 10.215 < 2e-16 ***
## as.factor(zipcode)98006 2.321e+05 2.036e+04 11.398 < 2e-16 ***
## as.factor(zipcode)98007 1.957e+05 2.573e+04 7.607 2.92e-14 ***
## as.factor(zipcode)98008 1.946e+05 2.444e+04 7.963 1.76e-15 ***
## as.factor(zipcode)98010 1.033e+05 2.190e+04 4.716 2.43e-06 ***
## as.factor(zipcode)98011 2.799e+04 3.181e+04 0.880 0.378897
## as.factor(zipcode)98014 7.608e+04 3.491e+04 2.179 0.029326 *
## as.factor(zipcode)98019 2.917e+04 3.446e+04 0.846 0.397367
## as.factor(zipcode)98022 5.320e+04 1.901e+04 2.799 0.005137 **
## as.factor(zipcode)98023 -4.408e+04 1.181e+04 -3.731 0.000191 ***
## as.factor(zipcode)98024 1.632e+05 3.063e+04 5.328 1.00e-07 ***
## as.factor(zipcode)98027 1.523e+05 2.092e+04 7.280 3.45e-13 ***
## as.factor(zipcode)98028 1.966e+04 3.090e+04 0.636 0.524519
## as.factor(zipcode)98029 1.872e+05 2.388e+04 7.840 4.71e-15 ***
## as.factor(zipcode)98030 5.865e+02 1.412e+04 0.042 0.966869
## as.factor(zipcode)98031 1.868e+03 1.470e+04 0.127 0.898852
## as.factor(zipcode)98032 -6.912e+03 1.707e+04 -0.405 0.685593
## as.factor(zipcode)98033 2.877e+05 2.650e+04 10.855 < 2e-16 ***
## as.factor(zipcode)98034 1.098e+05 2.842e+04 3.863 0.000112 ***
## as.factor(zipcode)98038 4.616e+04 1.584e+04 2.915 0.003562 **
## as.factor(zipcode)98039 1.266e+06 3.151e+04 40.181 < 2e-16 ***
## as.factor(zipcode)98040 4.667e+05 2.060e+04 22.655 < 2e-16 ***
## as.factor(zipcode)98042 8.630e+03 1.350e+04 0.639 0.522697
## as.factor(zipcode)98045 1.186e+05 2.927e+04 4.053 5.07e-05 ***
## as.factor(zipcode)98052 1.602e+05 2.704e+04 5.925 3.17e-09 ***
## as.factor(zipcode)98053 1.321e+05 2.898e+04 4.557 5.22e-06 ***
## as.factor(zipcode)98055 2.544e+04 1.636e+04 1.555 0.120057
## as.factor(zipcode)98056 5.614e+04 1.780e+04 3.155 0.001609 **
## as.factor(zipcode)98058 1.398e+04 1.547e+04 0.904 0.366049
## as.factor(zipcode)98059 5.543e+04 1.744e+04 3.178 0.001483 **
## as.factor(zipcode)98065 8.039e+04 2.693e+04 2.985 0.002842 **
## as.factor(zipcode)98070 2.551e+04 2.044e+04 1.248 0.212003
## as.factor(zipcode)98072 6.914e+04 3.164e+04 2.185 0.028877 *
## as.factor(zipcode)98074 1.287e+05 2.559e+04 5.028 4.99e-07 ***
## as.factor(zipcode)98075 1.338e+05 2.458e+04 5.443 5.30e-08 ***
## as.factor(zipcode)98077 5.132e+04 3.290e+04 1.560 0.118814
## as.factor(zipcode)98092 -2.095e+04 1.283e+04 -1.633 0.102572
## as.factor(zipcode)98102 4.667e+05 2.715e+04 17.186 < 2e-16 ***
## as.factor(zipcode)98103 2.696e+05 2.552e+04 10.568 < 2e-16 ***
## as.factor(zipcode)98105 4.128e+05 2.617e+04 15.773 < 2e-16 ***
## as.factor(zipcode)98106 9.376e+04 1.895e+04 4.948 7.56e-07 ***
## as.factor(zipcode)98107 2.733e+05 2.634e+04 10.376 < 2e-16 ***
## as.factor(zipcode)98108 8.048e+04 2.091e+04 3.848 0.000119 ***
## as.factor(zipcode)98109 4.428e+05 2.707e+04 16.357 < 2e-16 ***
## as.factor(zipcode)98112 5.814e+05 2.396e+04 24.263 < 2e-16 ***
## as.factor(zipcode)98115 2.601e+05 2.597e+04 10.015 < 2e-16 ***
## as.factor(zipcode)98116 2.355e+05 2.112e+04 11.151 < 2e-16 ***
## as.factor(zipcode)98117 2.372e+05 2.629e+04 9.020 < 2e-16 ***
## as.factor(zipcode)98118 1.341e+05 1.843e+04 7.278 3.51e-13 ***
## as.factor(zipcode)98119 4.236e+05 2.555e+04 16.578 < 2e-16 ***
## as.factor(zipcode)98122 2.970e+05 2.278e+04 13.037 < 2e-16 ***
## as.factor(zipcode)98125 1.124e+05 2.811e+04 3.997 6.43e-05 ***
## as.factor(zipcode)98126 1.413e+05 1.939e+04 7.285 3.34e-13 ***
## as.factor(zipcode)98133 6.600e+04 2.903e+04 2.274 0.023004 *
## as.factor(zipcode)98136 1.995e+05 1.990e+04 10.028 < 2e-16 ***
## as.factor(zipcode)98144 2.330e+05 2.124e+04 10.969 < 2e-16 ***
## as.factor(zipcode)98146 6.609e+04 1.779e+04 3.716 0.000203 ***
## as.factor(zipcode)98148 3.946e+04 2.423e+04 1.628 0.103459
## as.factor(zipcode)98155 4.229e+04 3.020e+04 1.400 0.161470
## as.factor(zipcode)98166 2.139e+04 1.629e+04 1.313 0.189054
## as.factor(zipcode)98168 4.135e+04 1.720e+04 2.404 0.016219 *
## as.factor(zipcode)98177 1.158e+05 3.031e+04 3.820 0.000134 ***
## as.factor(zipcode)98178 1.165e+04 1.777e+04 0.656 0.512151
## as.factor(zipcode)98188 8.408e+03 1.826e+04 0.460 0.645237
## as.factor(zipcode)98198 -1.753e+04 1.384e+04 -1.266 0.205424
## as.factor(zipcode)98199 3.133e+05 2.499e+04 12.536 < 2e-16 ***
## floors -5.931e+04 2.944e+03 -20.144 < 2e-16 ***
## waterfront 1.148e+06 2.366e+04 48.519 < 2e-16 ***
## view 6.022e+04 1.708e+03 35.257 < 2e-16 ***
## condition_quad 4.151e+03 2.395e+02 17.328 < 2e-16 ***
## grade 5.153e+04 1.676e+03 30.746 < 2e-16 ***
## sqft_above 8.839e+01 3.557e+00 24.850 < 2e-16 ***
## sqft_basement NA NA NA NA
## lat 2.437e+05 6.296e+04 3.871 0.000109 ***
## long -1.222e+05 4.503e+04 -2.714 0.006647 **
## waterfront_sqftliving -1.189e+09 4.686e+07 -25.371 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 159700 on 21531 degrees of freedom
## Multiple R-squared: 0.8116, Adjusted R-squared: 0.8109
## F-statistic: 1145 on 81 and 21531 DF, p-value: < 2.2e-16
Our triumph of slight improvement continues, though, it feels like overfitting.