R Markdown

We will attempt to build a multiple linear model that predicts housing prices in King’s County. We’re gonna skip the train-test split. Home sales prices are highly influenced by market forces - the date of the sale would be a big factor in the price, but we’ll exclude that data for simplicity.

Read the csv as a dataframe.

house_df <- read.csv("kc_house_data.csv", stringsAsFactors = FALSE)

Summarize it.

summary(house_df)
##        id                date               price            bedrooms     
##  Min.   :1.000e+06   Length:21613       Min.   :  75000   Min.   : 0.000  
##  1st Qu.:2.123e+09   Class :character   1st Qu.: 321950   1st Qu.: 3.000  
##  Median :3.905e+09   Mode  :character   Median : 450000   Median : 3.000  
##  Mean   :4.580e+09                      Mean   : 540088   Mean   : 3.371  
##  3rd Qu.:7.309e+09                      3rd Qu.: 645000   3rd Qu.: 4.000  
##  Max.   :9.900e+09                      Max.   :7700000   Max.   :33.000  
##    bathrooms      sqft_living       sqft_lot           floors     
##  Min.   :0.000   Min.   :  290   Min.   :    520   Min.   :1.000  
##  1st Qu.:1.750   1st Qu.: 1427   1st Qu.:   5040   1st Qu.:1.000  
##  Median :2.250   Median : 1910   Median :   7618   Median :1.500  
##  Mean   :2.115   Mean   : 2080   Mean   :  15107   Mean   :1.494  
##  3rd Qu.:2.500   3rd Qu.: 2550   3rd Qu.:  10688   3rd Qu.:2.000  
##  Max.   :8.000   Max.   :13540   Max.   :1651359   Max.   :3.500  
##    waterfront            view          condition         grade       
##  Min.   :0.000000   Min.   :0.0000   Min.   :1.000   Min.   : 1.000  
##  1st Qu.:0.000000   1st Qu.:0.0000   1st Qu.:3.000   1st Qu.: 7.000  
##  Median :0.000000   Median :0.0000   Median :3.000   Median : 7.000  
##  Mean   :0.007542   Mean   :0.2343   Mean   :3.409   Mean   : 7.657  
##  3rd Qu.:0.000000   3rd Qu.:0.0000   3rd Qu.:4.000   3rd Qu.: 8.000  
##  Max.   :1.000000   Max.   :4.0000   Max.   :5.000   Max.   :13.000  
##    sqft_above   sqft_basement       yr_built     yr_renovated   
##  Min.   : 290   Min.   :   0.0   Min.   :1900   Min.   :   0.0  
##  1st Qu.:1190   1st Qu.:   0.0   1st Qu.:1951   1st Qu.:   0.0  
##  Median :1560   Median :   0.0   Median :1975   Median :   0.0  
##  Mean   :1788   Mean   : 291.5   Mean   :1971   Mean   :  84.4  
##  3rd Qu.:2210   3rd Qu.: 560.0   3rd Qu.:1997   3rd Qu.:   0.0  
##  Max.   :9410   Max.   :4820.0   Max.   :2015   Max.   :2015.0  
##     zipcode           lat             long        sqft_living15 
##  Min.   :98001   Min.   :47.16   Min.   :-122.5   Min.   : 399  
##  1st Qu.:98033   1st Qu.:47.47   1st Qu.:-122.3   1st Qu.:1490  
##  Median :98065   Median :47.57   Median :-122.2   Median :1840  
##  Mean   :98078   Mean   :47.56   Mean   :-122.2   Mean   :1987  
##  3rd Qu.:98118   3rd Qu.:47.68   3rd Qu.:-122.1   3rd Qu.:2360  
##  Max.   :98199   Max.   :47.78   Max.   :-121.3   Max.   :6210  
##    sqft_lot15    
##  Min.   :   651  
##  1st Qu.:  5100  
##  Median :  7620  
##  Mean   : 12768  
##  3rd Qu.: 10083  
##  Max.   :871200

Preview.

head(house_df)
##           id            date   price bedrooms bathrooms sqft_living sqft_lot
## 1 7129300520 20141013T000000  221900        3      1.00        1180     5650
## 2 6414100192 20141209T000000  538000        3      2.25        2570     7242
## 3 5631500400 20150225T000000  180000        2      1.00         770    10000
## 4 2487200875 20141209T000000  604000        4      3.00        1960     5000
## 5 1954400510 20150218T000000  510000        3      2.00        1680     8080
## 6 7237550310 20140512T000000 1225000        4      4.50        5420   101930
##   floors waterfront view condition grade sqft_above sqft_basement yr_built
## 1      1          0    0         3     7       1180             0     1955
## 2      2          0    0         3     7       2170           400     1951
## 3      1          0    0         3     6        770             0     1933
## 4      1          0    0         5     7       1050           910     1965
## 5      1          0    0         3     8       1680             0     1987
## 6      1          0    0         3    11       3890          1530     2001
##   yr_renovated zipcode     lat     long sqft_living15 sqft_lot15
## 1            0   98178 47.5112 -122.257          1340       5650
## 2         1991   98125 47.7210 -122.319          1690       7639
## 3            0   98028 47.7379 -122.233          2720       8062
## 4            0   98136 47.5208 -122.393          1360       5000
## 5            0   98074 47.6168 -122.045          1800       7503
## 6            0   98053 47.6561 -122.005          4760     101930

The Model

We’ll follow the book’s example and use backward elimination. Though we’ll start with the variables that may need converting to factors removed. That collinearities will be present is readily apparent. Views and waterfronts go together. As do square feet with other square values and bedrooms, bathrooms, etc.

Zip code, latitude, and longitude may be reliable predictors, but zip code will need to be converted to factor, and other transformations may be necessary to make lat and long usable. While there are certainly cities where, for example, a more northern location means higher property values, it could be misleading.

house_lm <- lm(price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + waterfront + view + condition + grade + sqft_above + sqft_basement, data=house_df)
summary(house_lm)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot + 
##     floors + waterfront + view + condition + grade + sqft_above + 
##     sqft_basement, data = house_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1175172  -123899   -16809    94252  4628215 
## 
## Coefficients: (1 not defined because of singularities)
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -6.962e+05  1.740e+04 -40.007  < 2e-16 ***
## bedrooms      -3.378e+04  2.157e+03 -15.659  < 2e-16 ***
## bathrooms     -1.463e+04  3.486e+03  -4.196 2.72e-05 ***
## sqft_living    2.172e+02  4.820e+00  45.059  < 2e-16 ***
## sqft_lot      -3.180e-01  3.900e-02  -8.152 3.76e-16 ***
## floors        -2.831e+03  3.941e+03  -0.718    0.473    
## waterfront     5.822e+05  1.985e+04  29.333  < 2e-16 ***
## view           6.064e+04  2.385e+03  25.428  < 2e-16 ***
## condition      5.344e+04  2.532e+03  21.108  < 2e-16 ***
## grade          1.032e+05  2.269e+03  45.472  < 2e-16 ***
## sqft_above    -2.917e+01  4.718e+00  -6.182 6.42e-10 ***
## sqft_basement         NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 230600 on 21602 degrees of freedom
## Multiple R-squared:  0.6055, Adjusted R-squared:  0.6053 
## F-statistic:  3315 on 10 and 21602 DF,  p-value: < 2.2e-16

We see that square feet of the lot is not a useful predictive. Let’s remove that and try adding zip code as a factor.

house_lm_2 <- lm(price ~ bedrooms + bathrooms + sqft_living + as.factor(zipcode) + floors + waterfront + view + condition + grade + sqft_above + sqft_basement, data=house_df)
summary(house_lm_2)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + as.factor(zipcode) + 
##     floors + waterfront + view + condition + grade + sqft_above + 
##     sqft_basement, data = house_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1143285   -71144    -1431    61730  4441328 
## 
## Coefficients: (1 not defined because of singularities)
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -4.710e+05  1.538e+04 -30.621  < 2e-16 ***
## bedrooms                -2.604e+04  1.536e+03 -16.954  < 2e-16 ***
## bathrooms                1.401e+04  2.491e+03   5.624 1.89e-08 ***
## sqft_living              1.373e+02  3.488e+00  39.352  < 2e-16 ***
## as.factor(zipcode)98002  2.812e+04  1.434e+04   1.961 0.049909 *  
## as.factor(zipcode)98003 -1.460e+04  1.292e+04  -1.130 0.258329    
## as.factor(zipcode)98004  7.896e+05  1.260e+04  62.658  < 2e-16 ***
## as.factor(zipcode)98005  3.121e+05  1.525e+04  20.465  < 2e-16 ***
## as.factor(zipcode)98006  2.742e+05  1.138e+04  24.104  < 2e-16 ***
## as.factor(zipcode)98007  2.518e+05  1.614e+04  15.599  < 2e-16 ***
## as.factor(zipcode)98008  2.546e+05  1.292e+04  19.699  < 2e-16 ***
## as.factor(zipcode)98010  7.543e+04  1.834e+04   4.114 3.90e-05 ***
## as.factor(zipcode)98011  1.267e+05  1.442e+04   8.787  < 2e-16 ***
## as.factor(zipcode)98014  1.120e+05  1.690e+04   6.627 3.51e-11 ***
## as.factor(zipcode)98019  9.403e+04  1.455e+04   6.461 1.06e-10 ***
## as.factor(zipcode)98022 -2.770e+03  1.366e+04  -0.203 0.839310    
## as.factor(zipcode)98023 -3.367e+04  1.121e+04  -3.003 0.002675 ** 
## as.factor(zipcode)98024  1.781e+05  1.996e+04   8.922  < 2e-16 ***
## as.factor(zipcode)98027  1.749e+05  1.174e+04  14.893  < 2e-16 ***
## as.factor(zipcode)98028  1.247e+05  1.288e+04   9.685  < 2e-16 ***
## as.factor(zipcode)98029  2.141e+05  1.253e+04  17.089  < 2e-16 ***
## as.factor(zipcode)98030  4.972e+03  1.325e+04   0.375 0.707428    
## as.factor(zipcode)98031  1.468e+04  1.299e+04   1.130 0.258655    
## as.factor(zipcode)98032  9.867e+03  1.684e+04   0.586 0.558055    
## as.factor(zipcode)98033  3.685e+05  1.159e+04  31.779  < 2e-16 ***
## as.factor(zipcode)98034  2.035e+05  1.101e+04  18.485  < 2e-16 ***
## as.factor(zipcode)98038  3.120e+04  1.086e+04   2.874 0.004059 ** 
## as.factor(zipcode)98039  1.336e+06  2.466e+04  54.177  < 2e-16 ***
## as.factor(zipcode)98040  5.248e+05  1.308e+04  40.127  < 2e-16 ***
## as.factor(zipcode)98042  3.486e+03  1.099e+04   0.317 0.751185    
## as.factor(zipcode)98045  9.560e+04  1.386e+04   6.899 5.38e-12 ***
## as.factor(zipcode)98052  2.304e+05  1.093e+04  21.067  < 2e-16 ***
## as.factor(zipcode)98053  1.891e+05  1.184e+04  15.965  < 2e-16 ***
## as.factor(zipcode)98055  5.395e+04  1.307e+04   4.127 3.68e-05 ***
## as.factor(zipcode)98056  9.591e+04  1.174e+04   8.168 3.32e-16 ***
## as.factor(zipcode)98058  3.162e+04  1.143e+04   2.765 0.005691 ** 
## as.factor(zipcode)98059  8.267e+04  1.139e+04   7.259 4.03e-13 ***
## as.factor(zipcode)98065  8.287e+04  1.262e+04   6.564 5.35e-11 ***
## as.factor(zipcode)98070 -4.417e+03  1.743e+04  -0.253 0.799978    
## as.factor(zipcode)98072  1.573e+05  1.304e+04  12.063  < 2e-16 ***
## as.factor(zipcode)98074  1.756e+05  1.161e+04  15.127  < 2e-16 ***
## as.factor(zipcode)98075  1.682e+05  1.224e+04  13.748  < 2e-16 ***
## as.factor(zipcode)98077  1.280e+05  1.446e+04   8.853  < 2e-16 ***
## as.factor(zipcode)98092 -3.643e+04  1.217e+04  -2.994 0.002754 ** 
## as.factor(zipcode)98102  5.509e+05  1.811e+04  30.415  < 2e-16 ***
## as.factor(zipcode)98103  3.694e+05  1.092e+04  33.819  < 2e-16 ***
## as.factor(zipcode)98105  5.050e+05  1.376e+04  36.700  < 2e-16 ***
## as.factor(zipcode)98106  1.610e+05  1.234e+04  13.049  < 2e-16 ***
## as.factor(zipcode)98107  3.753e+05  1.323e+04  28.377  < 2e-16 ***
## as.factor(zipcode)98108  1.444e+05  1.466e+04   9.846  < 2e-16 ***
## as.factor(zipcode)98109  5.316e+05  1.780e+04  29.861  < 2e-16 ***
## as.factor(zipcode)98112  6.606e+05  1.319e+04  50.087  < 2e-16 ***
## as.factor(zipcode)98115  3.563e+05  1.091e+04  32.654  < 2e-16 ***
## as.factor(zipcode)98116  3.157e+05  1.244e+04  25.372  < 2e-16 ***
## as.factor(zipcode)98117  3.430e+05  1.103e+04  31.098  < 2e-16 ***
## as.factor(zipcode)98118  1.940e+05  1.119e+04  17.336  < 2e-16 ***
## as.factor(zipcode)98119  5.156e+05  1.480e+04  34.834  < 2e-16 ***
## as.factor(zipcode)98122  3.745e+05  1.288e+04  29.077  < 2e-16 ***
## as.factor(zipcode)98125  2.163e+05  1.172e+04  18.450  < 2e-16 ***
## as.factor(zipcode)98126  2.130e+05  1.218e+04  17.486  < 2e-16 ***
## as.factor(zipcode)98133  1.798e+05  1.125e+04  15.977  < 2e-16 ***
## as.factor(zipcode)98136  2.683e+05  1.322e+04  20.294  < 2e-16 ***
## as.factor(zipcode)98144  3.053e+05  1.231e+04  24.800  < 2e-16 ***
## as.factor(zipcode)98146  1.228e+05  1.284e+04   9.566  < 2e-16 ***
## as.factor(zipcode)98148  7.998e+04  2.312e+04   3.459 0.000544 ***
## as.factor(zipcode)98155  1.575e+05  1.149e+04  13.711  < 2e-16 ***
## as.factor(zipcode)98166  6.739e+04  1.332e+04   5.058 4.28e-07 ***
## as.factor(zipcode)98168  9.179e+04  1.310e+04   7.004 2.55e-12 ***
## as.factor(zipcode)98177  2.319e+05  1.334e+04  17.383  < 2e-16 ***
## as.factor(zipcode)98178  5.176e+04  1.321e+04   3.919 8.92e-05 ***
## as.factor(zipcode)98188  4.379e+04  1.632e+04   2.683 0.007299 ** 
## as.factor(zipcode)98198  4.193e+03  1.294e+04   0.324 0.745989    
## as.factor(zipcode)98199  4.105e+05  1.259e+04  32.613  < 2e-16 ***
## floors                  -6.021e+04  2.989e+03 -20.143  < 2e-16 ***
## waterfront               6.644e+05  1.421e+04  46.760  < 2e-16 ***
## view                     5.924e+04  1.734e+03  34.161  < 2e-16 ***
## condition                2.987e+04  1.833e+03  16.298  < 2e-16 ***
## grade                    5.241e+04  1.700e+03  30.833  < 2e-16 ***
## sqft_above               8.644e+01  3.611e+00  23.939  < 2e-16 ***
## sqft_basement                   NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 162200 on 21534 degrees of freedom
## Multiple R-squared:  0.8056, Adjusted R-squared:  0.8049 
## F-statistic:  1144 on 78 and 21534 DF,  p-value: < 2.2e-16

That gives us a decent performance - 80.5% of the variation in home values are explained by our model. Adding zip code as factor vastly expands our degrees of freedom - there might be an opportunity to simplify this with more regional knowledge.

Does adding latitute and longitude add any predictive power, even given that we know its integer values could be predictive model?

house_lm_3 <- lm(price ~ bedrooms + bathrooms + sqft_living + as.factor(zipcode) + floors + waterfront + view + condition + grade + sqft_above + sqft_basement + lat + long, data=house_df)
summary(house_lm_3)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + as.factor(zipcode) + 
##     floors + waterfront + view + condition + grade + sqft_above + 
##     sqft_basement + lat + long, data = house_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1136900   -71322    -1757    61737  4440870 
## 
## Coefficients: (1 not defined because of singularities)
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -2.563e+07  6.164e+06  -4.157 3.23e-05 ***
## bedrooms                -2.606e+04  1.535e+03 -16.976  < 2e-16 ***
## bathrooms                1.400e+04  2.490e+03   5.623 1.90e-08 ***
## sqft_living              1.373e+02  3.487e+00  39.370  < 2e-16 ***
## as.factor(zipcode)98002  3.493e+04  1.457e+04   2.397 0.016554 *  
## as.factor(zipcode)98003 -2.074e+04  1.304e+04  -1.590 0.111757    
## as.factor(zipcode)98004  7.295e+05  2.367e+04  30.824  < 2e-16 ***
## as.factor(zipcode)98005  2.577e+05  2.530e+04  10.183  < 2e-16 ***
## as.factor(zipcode)98006  2.341e+05  2.067e+04  11.324  < 2e-16 ***
## as.factor(zipcode)98007  2.006e+05  2.612e+04   7.679 1.67e-14 ***
## as.factor(zipcode)98008  2.059e+05  2.481e+04   8.299  < 2e-16 ***
## as.factor(zipcode)98010  1.025e+05  2.223e+04   4.612 4.01e-06 ***
## as.factor(zipcode)98011  3.632e+04  3.229e+04   1.125 0.260692    
## as.factor(zipcode)98014  8.239e+04  3.544e+04   2.325 0.020094 *  
## as.factor(zipcode)98019  3.714e+04  3.499e+04   1.061 0.288476    
## as.factor(zipcode)98022  5.179e+04  1.930e+04   2.684 0.007285 ** 
## as.factor(zipcode)98023 -4.474e+04  1.199e+04  -3.731 0.000191 ***
## as.factor(zipcode)98024  1.667e+05  3.110e+04   5.360 8.42e-08 ***
## as.factor(zipcode)98027  1.552e+05  2.124e+04   7.307 2.83e-13 ***
## as.factor(zipcode)98028  2.899e+04  3.137e+04   0.924 0.355454    
## as.factor(zipcode)98029  1.905e+05  2.424e+04   7.858 4.09e-15 ***
## as.factor(zipcode)98030  1.326e+03  1.433e+04   0.092 0.926312    
## as.factor(zipcode)98031  3.477e+03  1.492e+04   0.233 0.815747    
## as.factor(zipcode)98032 -5.363e+03  1.733e+04  -0.309 0.757021    
## as.factor(zipcode)98033  2.963e+05  2.690e+04  11.014  < 2e-16 ***
## as.factor(zipcode)98034  1.201e+05  2.885e+04   4.163 3.15e-05 ***
## as.factor(zipcode)98038  4.619e+04  1.608e+04   2.873 0.004067 ** 
## as.factor(zipcode)98039  1.270e+06  3.199e+04  39.697  < 2e-16 ***
## as.factor(zipcode)98040  4.747e+05  2.091e+04  22.699  < 2e-16 ***
## as.factor(zipcode)98042  9.530e+03  1.371e+04   0.695 0.486896    
## as.factor(zipcode)98045  1.207e+05  2.972e+04   4.061 4.91e-05 ***
## as.factor(zipcode)98052  1.664e+05  2.745e+04   6.060 1.39e-09 ***
## as.factor(zipcode)98053  1.367e+05  2.942e+04   4.646 3.40e-06 ***
## as.factor(zipcode)98055  2.899e+04  1.661e+04   1.745 0.081017 .  
## as.factor(zipcode)98056  6.197e+04  1.806e+04   3.430 0.000604 ***
## as.factor(zipcode)98058  1.625e+04  1.571e+04   1.035 0.300826    
## as.factor(zipcode)98059  5.772e+04  1.771e+04   3.260 0.001115 ** 
## as.factor(zipcode)98065  8.250e+04  2.734e+04   3.017 0.002555 ** 
## as.factor(zipcode)98070 -5.174e+04  2.051e+04  -2.522 0.011675 *  
## as.factor(zipcode)98072  7.614e+04  3.212e+04   2.370 0.017779 *  
## as.factor(zipcode)98074  1.331e+05  2.598e+04   5.122 3.05e-07 ***
## as.factor(zipcode)98075  1.364e+05  2.496e+04   5.465 4.68e-08 ***
## as.factor(zipcode)98077  5.647e+04  3.340e+04   1.690 0.090956 .  
## as.factor(zipcode)98092 -2.252e+04  1.303e+04  -1.729 0.083910 .  
## as.factor(zipcode)98102  4.727e+05  2.757e+04  17.148  < 2e-16 ***
## as.factor(zipcode)98103  2.793e+05  2.590e+04  10.784  < 2e-16 ***
## as.factor(zipcode)98105  4.233e+05  2.656e+04  15.937  < 2e-16 ***
## as.factor(zipcode)98106  1.004e+05  1.924e+04   5.221 1.79e-07 ***
## as.factor(zipcode)98107  2.829e+05  2.674e+04  10.582  < 2e-16 ***
## as.factor(zipcode)98108  8.678e+04  2.123e+04   4.088 4.38e-05 ***
## as.factor(zipcode)98109  4.499e+05  2.748e+04  16.371  < 2e-16 ***
## as.factor(zipcode)98112  5.865e+05  2.433e+04  24.108  < 2e-16 ***
## as.factor(zipcode)98115  2.693e+05  2.637e+04  10.212  < 2e-16 ***
## as.factor(zipcode)98116  2.424e+05  2.144e+04  11.306  < 2e-16 ***
## as.factor(zipcode)98117  2.465e+05  2.669e+04   9.234  < 2e-16 ***
## as.factor(zipcode)98118  1.415e+05  1.871e+04   7.564 4.06e-14 ***
## as.factor(zipcode)98119  4.309e+05  2.594e+04  16.610  < 2e-16 ***
## as.factor(zipcode)98122  3.045e+05  2.312e+04  13.167  < 2e-16 ***
## as.factor(zipcode)98125  1.223e+05  2.854e+04   4.287 1.82e-05 ***
## as.factor(zipcode)98126  1.483e+05  1.969e+04   7.533 5.15e-14 ***
## as.factor(zipcode)98133  7.661e+04  2.947e+04   2.600 0.009336 ** 
## as.factor(zipcode)98136  2.034e+05  2.020e+04  10.070  < 2e-16 ***
## as.factor(zipcode)98144  2.406e+05  2.156e+04  11.158  < 2e-16 ***
## as.factor(zipcode)98146  6.961e+04  1.806e+04   3.855 0.000116 ***
## as.factor(zipcode)98148  4.459e+04  2.460e+04   1.812 0.069947 .  
## as.factor(zipcode)98155  5.494e+04  3.066e+04   1.792 0.073128 .  
## as.factor(zipcode)98166  2.544e+04  1.654e+04   1.539 0.123892    
## as.factor(zipcode)98168  4.736e+04  1.746e+04   2.713 0.006682 ** 
## as.factor(zipcode)98177  1.242e+05  3.077e+04   4.037 5.42e-05 ***
## as.factor(zipcode)98178  1.256e+04  1.804e+04   0.697 0.486053    
## as.factor(zipcode)98188  1.198e+04  1.854e+04   0.646 0.517998    
## as.factor(zipcode)98198 -1.930e+04  1.405e+04  -1.374 0.169569    
## as.factor(zipcode)98199  3.203e+05  2.537e+04  12.625  < 2e-16 ***
## floors                  -6.002e+04  2.990e+03 -20.078  < 2e-16 ***
## waterfront               6.644e+05  1.421e+04  46.750  < 2e-16 ***
## view                     5.929e+04  1.734e+03  34.200  < 2e-16 ***
## condition                3.006e+04  1.833e+03  16.397  < 2e-16 ***
## grade                    5.218e+04  1.700e+03  30.696  < 2e-16 ***
## sqft_above               8.676e+01  3.611e+00  24.030  < 2e-16 ***
## sqft_basement                   NA         NA      NA       NA    
## lat                      2.214e+05  6.391e+04   3.464 0.000534 ***
## long                    -1.201e+05  4.572e+04  -2.627 0.008623 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 162100 on 21532 degrees of freedom
## Multiple R-squared:  0.8058, Adjusted R-squared:  0.8051 
## F-statistic:  1117 on 80 and 21532 DF,  p-value: < 2.2e-16

Slightly better, but, we’re at risk of overfitting.

Let’s check our assumptions on the last model:

check_model(house_lm_3)

We have problems with variance not being evenly distributed at the extremes - particularly at the high side. Collinearity is present - and overwhelmed by the zip codes.

We likely need a more sophisticated approach to geospatial analysis to handle the zip codes and latitute and longitude. Here’s an interesting article on why zip codes shouldn’t be used in predictive models: https://towardsdatascience.com/stop-using-zip-codes-for-geospatial-analysis-ceacb6e80c38

We definitely need to capture market fluctuations present in the sales date using a time-series approach.

Other variables would likely benefit from transformations. And we’re definitely overfitting.

Not a good linear model at present.

The discussion prompt says: Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term.

We already have dichotomous variables in the model - waterfront.

Let’s try to transform one of the variables into a quadratic term - condition, lacking any better ideas or domain knowledge.

house_df$condition_quad = house_df$condition^2
house_lm_4 <- lm(price ~ bedrooms + bathrooms + sqft_living + as.factor(zipcode) + floors + waterfront + view + condition_quad + grade + sqft_above + sqft_basement + lat + long, data=house_df)
summary(house_lm_4)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + as.factor(zipcode) + 
##     floors + waterfront + view + condition_quad + grade + sqft_above + 
##     sqft_basement + lat + long, data = house_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1136347   -71371    -1864    62030  4443406 
## 
## Coefficients: (1 not defined because of singularities)
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -2.556e+07  6.162e+06  -4.149 3.36e-05 ***
## bedrooms                -2.605e+04  1.535e+03 -16.974  < 2e-16 ***
## bathrooms                1.389e+04  2.489e+03   5.582 2.41e-08 ***
## sqft_living              1.371e+02  3.486e+00  39.328  < 2e-16 ***
## as.factor(zipcode)98002  3.508e+04  1.457e+04   2.408 0.016046 *  
## as.factor(zipcode)98003 -2.045e+04  1.303e+04  -1.569 0.116642    
## as.factor(zipcode)98004  7.298e+05  2.366e+04  30.848  < 2e-16 ***
## as.factor(zipcode)98005  2.578e+05  2.529e+04  10.191  < 2e-16 ***
## as.factor(zipcode)98006  2.338e+05  2.066e+04  11.316  < 2e-16 ***
## as.factor(zipcode)98007  2.011e+05  2.611e+04   7.703 1.39e-14 ***
## as.factor(zipcode)98008  2.063e+05  2.479e+04   8.321  < 2e-16 ***
## as.factor(zipcode)98010  1.025e+05  2.222e+04   4.613 4.00e-06 ***
## as.factor(zipcode)98011  3.653e+04  3.228e+04   1.132 0.257787    
## as.factor(zipcode)98014  8.267e+04  3.543e+04   2.333 0.019632 *  
## as.factor(zipcode)98019  3.746e+04  3.497e+04   1.071 0.284073    
## as.factor(zipcode)98022  5.123e+04  1.929e+04   2.656 0.007921 ** 
## as.factor(zipcode)98023 -4.459e+04  1.199e+04  -3.719 0.000200 ***
## as.factor(zipcode)98024  1.667e+05  3.108e+04   5.363 8.29e-08 ***
## as.factor(zipcode)98027  1.553e+05  2.123e+04   7.315 2.67e-13 ***
## as.factor(zipcode)98028  2.927e+04  3.135e+04   0.933 0.350608    
## as.factor(zipcode)98029  1.906e+05  2.423e+04   7.867 3.81e-15 ***
## as.factor(zipcode)98030  1.605e+03  1.433e+04   0.112 0.910839    
## as.factor(zipcode)98031  3.721e+03  1.491e+04   0.250 0.802943    
## as.factor(zipcode)98032 -4.988e+03  1.733e+04  -0.288 0.773427    
## as.factor(zipcode)98033  2.963e+05  2.689e+04  11.019  < 2e-16 ***
## as.factor(zipcode)98034  1.204e+05  2.883e+04   4.176 2.98e-05 ***
## as.factor(zipcode)98038  4.631e+04  1.607e+04   2.881 0.003964 ** 
## as.factor(zipcode)98039  1.270e+06  3.198e+04  39.728  < 2e-16 ***
## as.factor(zipcode)98040  4.745e+05  2.090e+04  22.700  < 2e-16 ***
## as.factor(zipcode)98042  9.380e+03  1.370e+04   0.685 0.493627    
## as.factor(zipcode)98045  1.207e+05  2.971e+04   4.064 4.83e-05 ***
## as.factor(zipcode)98052  1.667e+05  2.744e+04   6.076 1.25e-09 ***
## as.factor(zipcode)98053  1.370e+05  2.941e+04   4.657 3.23e-06 ***
## as.factor(zipcode)98055  2.910e+04  1.661e+04   1.752 0.079759 .  
## as.factor(zipcode)98056  6.146e+04  1.806e+04   3.403 0.000667 ***
## as.factor(zipcode)98058  1.615e+04  1.570e+04   1.029 0.303679    
## as.factor(zipcode)98059  5.769e+04  1.770e+04   3.260 0.001117 ** 
## as.factor(zipcode)98065  8.264e+04  2.733e+04   3.023 0.002503 ** 
## as.factor(zipcode)98070 -5.209e+04  2.051e+04  -2.540 0.011090 *  
## as.factor(zipcode)98072  7.672e+04  3.211e+04   2.389 0.016883 *  
## as.factor(zipcode)98074  1.332e+05  2.597e+04   5.127 2.97e-07 ***
## as.factor(zipcode)98075  1.365e+05  2.495e+04   5.472 4.51e-08 ***
## as.factor(zipcode)98077  5.678e+04  3.339e+04   1.701 0.089007 .  
## as.factor(zipcode)98092 -2.248e+04  1.302e+04  -1.726 0.084402 .  
## as.factor(zipcode)98102  4.725e+05  2.755e+04  17.147  < 2e-16 ***
## as.factor(zipcode)98103  2.789e+05  2.589e+04  10.773  < 2e-16 ***
## as.factor(zipcode)98105  4.229e+05  2.655e+04  15.928  < 2e-16 ***
## as.factor(zipcode)98106  1.004e+05  1.923e+04   5.222 1.79e-07 ***
## as.factor(zipcode)98107  2.827e+05  2.672e+04  10.577  < 2e-16 ***
## as.factor(zipcode)98108  8.650e+04  2.122e+04   4.076 4.61e-05 ***
## as.factor(zipcode)98109  4.494e+05  2.747e+04  16.359  < 2e-16 ***
## as.factor(zipcode)98112  5.858e+05  2.432e+04  24.091  < 2e-16 ***
## as.factor(zipcode)98115  2.690e+05  2.636e+04  10.206  < 2e-16 ***
## as.factor(zipcode)98116  2.417e+05  2.143e+04  11.280  < 2e-16 ***
## as.factor(zipcode)98117  2.464e+05  2.668e+04   9.234  < 2e-16 ***
## as.factor(zipcode)98118  1.412e+05  1.870e+04   7.548 4.61e-14 ***
## as.factor(zipcode)98119  4.306e+05  2.593e+04  16.608  < 2e-16 ***
## as.factor(zipcode)98122  3.039e+05  2.311e+04  13.147  < 2e-16 ***
## as.factor(zipcode)98125  1.223e+05  2.852e+04   4.288 1.81e-05 ***
## as.factor(zipcode)98126  1.479e+05  1.968e+04   7.517 5.81e-14 ***
## as.factor(zipcode)98133  7.674e+04  2.946e+04   2.605 0.009190 ** 
## as.factor(zipcode)98136  2.032e+05  2.019e+04  10.062  < 2e-16 ***
## as.factor(zipcode)98144  2.402e+05  2.155e+04  11.145  < 2e-16 ***
## as.factor(zipcode)98146  6.937e+04  1.805e+04   3.843 0.000122 ***
## as.factor(zipcode)98148  4.339e+04  2.459e+04   1.765 0.077643 .  
## as.factor(zipcode)98155  5.489e+04  3.064e+04   1.791 0.073290 .  
## as.factor(zipcode)98166  2.538e+04  1.653e+04   1.536 0.124601    
## as.factor(zipcode)98168  4.703e+04  1.745e+04   2.695 0.007050 ** 
## as.factor(zipcode)98177  1.245e+05  3.076e+04   4.046 5.22e-05 ***
## as.factor(zipcode)98178  1.250e+04  1.803e+04   0.693 0.488278    
## as.factor(zipcode)98188  1.194e+04  1.853e+04   0.644 0.519318    
## as.factor(zipcode)98198 -1.907e+04  1.405e+04  -1.358 0.174569    
## as.factor(zipcode)98199  3.200e+05  2.536e+04  12.619  < 2e-16 ***
## floors                  -5.988e+04  2.988e+03 -20.042  < 2e-16 ***
## waterfront               6.641e+05  1.421e+04  46.745  < 2e-16 ***
## view                     5.928e+04  1.733e+03  34.206  < 2e-16 ***
## condition_quad           4.108e+03  2.431e+02  16.900  < 2e-16 ***
## grade                    5.244e+04  1.700e+03  30.842  < 2e-16 ***
## sqft_above               8.679e+01  3.609e+00  24.050  < 2e-16 ***
## sqft_basement                   NA         NA      NA       NA    
## lat                      2.206e+05  6.388e+04   3.453 0.000556 ***
## long                    -1.203e+05  4.570e+04  -2.633 0.008481 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 162000 on 21532 degrees of freedom
## Multiple R-squared:  0.8059, Adjusted R-squared:  0.8052 
## F-statistic:  1118 on 80 and 21532 DF,  p-value: < 2.2e-16

Hey - it worked…minimally.

Now we add dichotomous:quantitative interaction variable. Let’s do waterfront: sqft_living.

house_df$waterfront_sqftliving = house_df$waterfront/house_df$sqft_living
house_lm_5 <- lm(price ~ bedrooms + bathrooms + sqft_living + as.factor(zipcode) + floors + waterfront + view + condition_quad + grade + sqft_above + sqft_basement + lat + long + waterfront_sqftliving, data=house_df)
summary(house_lm_5)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + as.factor(zipcode) + 
##     floors + waterfront + view + condition_quad + grade + sqft_above + 
##     sqft_basement + lat + long + waterfront_sqftliving, data = house_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1214914   -70695    -1918    60642  4479415 
## 
## Coefficients: (1 not defined because of singularities)
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -2.689e+07  6.072e+06  -4.428 9.57e-06 ***
## bedrooms                -2.518e+04  1.513e+03 -16.649  < 2e-16 ***
## bathrooms                1.395e+04  2.453e+03   5.689 1.29e-08 ***
## sqft_living              1.322e+02  3.440e+00  38.428  < 2e-16 ***
## as.factor(zipcode)98002  3.370e+04  1.435e+04   2.348 0.018897 *  
## as.factor(zipcode)98003 -2.025e+04  1.284e+04  -1.577 0.114853    
## as.factor(zipcode)98004  7.260e+05  2.331e+04  31.144  < 2e-16 ***
## as.factor(zipcode)98005  2.546e+05  2.492e+04  10.215  < 2e-16 ***
## as.factor(zipcode)98006  2.321e+05  2.036e+04  11.398  < 2e-16 ***
## as.factor(zipcode)98007  1.957e+05  2.573e+04   7.607 2.92e-14 ***
## as.factor(zipcode)98008  1.946e+05  2.444e+04   7.963 1.76e-15 ***
## as.factor(zipcode)98010  1.033e+05  2.190e+04   4.716 2.43e-06 ***
## as.factor(zipcode)98011  2.799e+04  3.181e+04   0.880 0.378897    
## as.factor(zipcode)98014  7.608e+04  3.491e+04   2.179 0.029326 *  
## as.factor(zipcode)98019  2.917e+04  3.446e+04   0.846 0.397367    
## as.factor(zipcode)98022  5.320e+04  1.901e+04   2.799 0.005137 ** 
## as.factor(zipcode)98023 -4.408e+04  1.181e+04  -3.731 0.000191 ***
## as.factor(zipcode)98024  1.632e+05  3.063e+04   5.328 1.00e-07 ***
## as.factor(zipcode)98027  1.523e+05  2.092e+04   7.280 3.45e-13 ***
## as.factor(zipcode)98028  1.966e+04  3.090e+04   0.636 0.524519    
## as.factor(zipcode)98029  1.872e+05  2.388e+04   7.840 4.71e-15 ***
## as.factor(zipcode)98030  5.865e+02  1.412e+04   0.042 0.966869    
## as.factor(zipcode)98031  1.868e+03  1.470e+04   0.127 0.898852    
## as.factor(zipcode)98032 -6.912e+03  1.707e+04  -0.405 0.685593    
## as.factor(zipcode)98033  2.877e+05  2.650e+04  10.855  < 2e-16 ***
## as.factor(zipcode)98034  1.098e+05  2.842e+04   3.863 0.000112 ***
## as.factor(zipcode)98038  4.616e+04  1.584e+04   2.915 0.003562 ** 
## as.factor(zipcode)98039  1.266e+06  3.151e+04  40.181  < 2e-16 ***
## as.factor(zipcode)98040  4.667e+05  2.060e+04  22.655  < 2e-16 ***
## as.factor(zipcode)98042  8.630e+03  1.350e+04   0.639 0.522697    
## as.factor(zipcode)98045  1.186e+05  2.927e+04   4.053 5.07e-05 ***
## as.factor(zipcode)98052  1.602e+05  2.704e+04   5.925 3.17e-09 ***
## as.factor(zipcode)98053  1.321e+05  2.898e+04   4.557 5.22e-06 ***
## as.factor(zipcode)98055  2.544e+04  1.636e+04   1.555 0.120057    
## as.factor(zipcode)98056  5.614e+04  1.780e+04   3.155 0.001609 ** 
## as.factor(zipcode)98058  1.398e+04  1.547e+04   0.904 0.366049    
## as.factor(zipcode)98059  5.543e+04  1.744e+04   3.178 0.001483 ** 
## as.factor(zipcode)98065  8.039e+04  2.693e+04   2.985 0.002842 ** 
## as.factor(zipcode)98070  2.551e+04  2.044e+04   1.248 0.212003    
## as.factor(zipcode)98072  6.914e+04  3.164e+04   2.185 0.028877 *  
## as.factor(zipcode)98074  1.287e+05  2.559e+04   5.028 4.99e-07 ***
## as.factor(zipcode)98075  1.338e+05  2.458e+04   5.443 5.30e-08 ***
## as.factor(zipcode)98077  5.132e+04  3.290e+04   1.560 0.118814    
## as.factor(zipcode)98092 -2.095e+04  1.283e+04  -1.633 0.102572    
## as.factor(zipcode)98102  4.667e+05  2.715e+04  17.186  < 2e-16 ***
## as.factor(zipcode)98103  2.696e+05  2.552e+04  10.568  < 2e-16 ***
## as.factor(zipcode)98105  4.128e+05  2.617e+04  15.773  < 2e-16 ***
## as.factor(zipcode)98106  9.376e+04  1.895e+04   4.948 7.56e-07 ***
## as.factor(zipcode)98107  2.733e+05  2.634e+04  10.376  < 2e-16 ***
## as.factor(zipcode)98108  8.048e+04  2.091e+04   3.848 0.000119 ***
## as.factor(zipcode)98109  4.428e+05  2.707e+04  16.357  < 2e-16 ***
## as.factor(zipcode)98112  5.814e+05  2.396e+04  24.263  < 2e-16 ***
## as.factor(zipcode)98115  2.601e+05  2.597e+04  10.015  < 2e-16 ***
## as.factor(zipcode)98116  2.355e+05  2.112e+04  11.151  < 2e-16 ***
## as.factor(zipcode)98117  2.372e+05  2.629e+04   9.020  < 2e-16 ***
## as.factor(zipcode)98118  1.341e+05  1.843e+04   7.278 3.51e-13 ***
## as.factor(zipcode)98119  4.236e+05  2.555e+04  16.578  < 2e-16 ***
## as.factor(zipcode)98122  2.970e+05  2.278e+04  13.037  < 2e-16 ***
## as.factor(zipcode)98125  1.124e+05  2.811e+04   3.997 6.43e-05 ***
## as.factor(zipcode)98126  1.413e+05  1.939e+04   7.285 3.34e-13 ***
## as.factor(zipcode)98133  6.600e+04  2.903e+04   2.274 0.023004 *  
## as.factor(zipcode)98136  1.995e+05  1.990e+04  10.028  < 2e-16 ***
## as.factor(zipcode)98144  2.330e+05  2.124e+04  10.969  < 2e-16 ***
## as.factor(zipcode)98146  6.609e+04  1.779e+04   3.716 0.000203 ***
## as.factor(zipcode)98148  3.946e+04  2.423e+04   1.628 0.103459    
## as.factor(zipcode)98155  4.229e+04  3.020e+04   1.400 0.161470    
## as.factor(zipcode)98166  2.139e+04  1.629e+04   1.313 0.189054    
## as.factor(zipcode)98168  4.135e+04  1.720e+04   2.404 0.016219 *  
## as.factor(zipcode)98177  1.158e+05  3.031e+04   3.820 0.000134 ***
## as.factor(zipcode)98178  1.165e+04  1.777e+04   0.656 0.512151    
## as.factor(zipcode)98188  8.408e+03  1.826e+04   0.460 0.645237    
## as.factor(zipcode)98198 -1.753e+04  1.384e+04  -1.266 0.205424    
## as.factor(zipcode)98199  3.133e+05  2.499e+04  12.536  < 2e-16 ***
## floors                  -5.931e+04  2.944e+03 -20.144  < 2e-16 ***
## waterfront               1.148e+06  2.366e+04  48.519  < 2e-16 ***
## view                     6.022e+04  1.708e+03  35.257  < 2e-16 ***
## condition_quad           4.151e+03  2.395e+02  17.328  < 2e-16 ***
## grade                    5.153e+04  1.676e+03  30.746  < 2e-16 ***
## sqft_above               8.839e+01  3.557e+00  24.850  < 2e-16 ***
## sqft_basement                   NA         NA      NA       NA    
## lat                      2.437e+05  6.296e+04   3.871 0.000109 ***
## long                    -1.222e+05  4.503e+04  -2.714 0.006647 ** 
## waterfront_sqftliving   -1.189e+09  4.686e+07 -25.371  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 159700 on 21531 degrees of freedom
## Multiple R-squared:  0.8116, Adjusted R-squared:  0.8109 
## F-statistic:  1145 on 81 and 21531 DF,  p-value: < 2.2e-16

Our triumph of slight improvement continues, though, it feels like overfitting.