King County Home Sales: Analysis and the limitations of a multiple regression model

The data for these sales comes from the official public records of home sales in the King County area, Washington State. The data sets contains 21613 rows. Each represents a home sold from May 2014 through May 2015. Below is a breakdown of the variables involved:

http://your.kingcounty.gov/assessor/eRealProperty/ResGlossaryOfTerms.html.

I previously posted a heat map of price per sq ft in King County, using the same dataset.

id - Unique ID for each home sold

date - Date of the home sale

price - Price of each home sold

bedrooms - Number of bedrooms

bathrooms - Number of bathrooms, where .5 accounts for a room with a toilet but no shower

sqft_living - Square footage of the apartments interior living space

sqft_lot - Square footage of the land space

floors - Number of floors

waterfront - A dummy variable for whether the apartment was overlooking the waterfront or not

view - An index from 0 to 4 of how good the view of the property was

condition - An index from 1 to 5 on the condition of the apartment,

grade - An index from 1 to 13, where 1-3 falls short of building construction and design, 7 has an average level of construction and design, and 11-13 have a high quality level of construction and design.

sqft_above - The square footage of the interior housing space that is above ground level

sqft_basement - The square footage of the interior housing space that is below ground level

yr_built - The year the house was initially built

yr_renovated - The year of the house’s last renovation

zipcode - What zipcode area the house is in

lat - Lattitude

long - Longitude

sqft_living15 - The square footage of interior housing living space for the nearest 15 neighbors

sqft_lot15 - The square footage of the land lots of the nearest 15 neighbors

First, we carry out an exploratory analysis of the data. Early in the process, I worked together with Greg Merrell. We checked for the normality of the explanatory variables. You can see his work at http://rpubs.com/grmerrell/154461. The remaining work on this page is entirely mine.

Next we compute a few new values from the variables in the data set. For instance, we’re interested in the price per sq foot of the living space, of the lot, and of the surrounding living spaces and lots. We plot those four values against time, and get a very similar pattern (graphs not provided). So we take the mean of those four variables, and plot them against time. We call that variable “Average price per sq foot.” Similarly, we compute the mean selling price per month, and plot it against time.

First the code…

# collapses sales in one month to the first day of the month
homes$date <- paste(substr(homes$date, 1,6), "01", sep = "")
homes$date <- ymd(homes$date)

# stores the median price of homes for each month
median_price_by_month <- homes %>% 
                            group_by(date) %>% 
                            summarise(median_price_month = median(price))

# calculates the price of sq ft for each sqft related variable
homes$price_sqft_living <- homes$price/homes$sqft_living
homes$price_sqft_lot <- homes$price/homes$sqft_lot
homes$price_sqft_lot15 <- homes$price/homes$sqft_lot15
homes$price_sqft_living15 <- homes$price/homes$sqft_living15

# computes the average price of sq ft 
homes <- homes %>% 
  mutate(price_sqft_avg = 
           (price_sqft_living + price_sqft_lot + price_sqft_lot15 + price_sqft_living15)/4)

tmp <- homes %>% 
  group_by(date) %>% 
  summarise(mean_price = median(price_sqft_avg)) %>% as.data.frame()

Then the graphs… There is a pattern to both graphs. Prices dip some time around January and February 2015. I proceeded to split the data into two data sets. Homes sold up to January 2015, and homes sold from February 2015 on. For the sake of siplicity I will only include the data the earlier set (5/14 - 1/15).

I did a variety of transformations to the independent variables I thought were important. The goal was to achieve somewhat normally distributed variables that could successfully describe the variability in price, the independent variable.

First I split the set into a train set and a test set.

set.seed(3456)
trainIndex <- createDataPartition(homes$price, p = .8, list = FALSE, times = 1)
train <- homes[trainIndex,]
test <- homes[-trainIndex,]

Then I transform the data..

normalize <- function(x){(x-mean(x, na.rm = TRUE)/(max(x, na.rm = TRUE)-min(x, na.rm = TRUE)))}

train$x1 <- train$sqft_above +  train$sqft_basement
train$x1 <- normalize(train$x1)
train$x2 <- (train$sqft_living + train$sqft_lot + train$sqft_lot15 + train$sqft_living15)/4
train$x2 <- normalize(train$x2)
train$x3 <- train$grade
train$x3 <- normalize(train$x3)
x<-train$yr_built - (min(train$yr_built)-2)
y <- train$zipcode - 98000
train$x4 <- sqrt(y)*log(x)     
train$x4 <- normalize(train$x4)
train$x5 <- (train$bathrooms + 1)/(train$bedrooms + 1 + train$floors)
train$x5 <- normalize(train$x5)
train$nm_price <- normalize(train$price)
train$fact_date <- as.factor(train$date)

test$x1 <- test$sqft_above +  test$sqft_basement
test$x1 <- normalize(test$x1)
test$x2 <- (test$sqft_living + test$sqft_lot + test$sqft_lot15 + test$sqft_living15)/4
test$x2 <- normalize(test$x2)
test$x3 <- test$grade
test$x3 <- normalize(test$x3)
x<-test$yr_built - (min(test$yr_built)-2)
y <- test$zipcode - 98000
test$x4 <- sqrt(y)*log(x)     
test$x4 <- normalize(test$x4)
test$x5 <- (test$bathrooms + 1)/(test$bedrooms + 1 + test$floors)
test$x5 <- normalize(test$x5)
test$nm_price <- normalize(test$price)
test$fact_date <- as.factor(test$date)

Then we proceed to carry out three separate models.

Model 1 is a spline regression model. Splines are piecewise polynomial functions. Splines could be considered a form of polynomial regression. This model includes the variables x1 through x5 and fact_date. There is a total of 23 coefficients estimated. The R-squared is 0.6181.

model1 <- lm(price ~ bs(x1) + bs(x2) + bs(x3) + bs(x4) + bs(x5) + fact_date, data = train)
summary(model1)

## 
## Call:
## lm(formula = price ~ bs(x1) + bs(x2) + bs(x3) + bs(x4) + bs(x5) + 
##     fact_date, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2932532  -124429   -25925    90593  3372882 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           650169      69261   9.387  < 2e-16 ***
## bs(x1)1              -173113      46007  -3.763 0.000169 ***
## bs(x1)2              2734009     103474  26.422  < 2e-16 ***
## bs(x1)3              4046201     186122  21.740  < 2e-16 ***
## bs(x2)1              -594433      56372 -10.545  < 2e-16 ***
## bs(x2)2               673745     193910   3.475 0.000513 ***
## bs(x2)3              -506409     191543  -2.644 0.008204 ** 
## bs(x3)1               240774      92444   2.605 0.009208 ** 
## bs(x3)2               127982      37742   3.391 0.000698 ***
## bs(x3)3              1439051      75122  19.156  < 2e-16 ***
## bs(x4)1               -95498      28593  -3.340 0.000840 ***
## bs(x4)2                76640      17936   4.273 1.94e-05 ***
## bs(x4)3              -128205      22398  -5.724 1.06e-08 ***
## bs(x5)1             -1022179     114514  -8.926  < 2e-16 ***
## bs(x5)2              -123295      40324  -3.058 0.002234 ** 
## bs(x5)3              -371990     108142  -3.440 0.000583 ***
## fact_date2014-06-01     5537       8225   0.673 0.500879    
## fact_date2014-07-01    -5125       8197  -0.625 0.531834    
## fact_date2014-08-01    -9444       8442  -1.119 0.263297    
## fact_date2014-09-01    -4744       8652  -0.548 0.583447    
## fact_date2014-10-01    -3458       8491  -0.407 0.683794    
## fact_date2014-11-01    -3736       9177  -0.407 0.683913    
## fact_date2014-12-01   -12762       9066  -1.408 0.159258    
## fact_date2015-01-01   -15187      10284  -1.477 0.139784    
## fact_date2015-02-01   -11328       9541  -1.187 0.235126    
## fact_date2015-03-01    26814       8570   3.129 0.001758 ** 
## fact_date2015-04-01    33986       8203   4.143 3.44e-05 ***
## fact_date2015-05-01    37201      11969   3.108 0.001887 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 230300 on 17264 degrees of freedom
## Multiple R-squared:  0.6078, Adjusted R-squared:  0.6072 
## F-statistic: 991.1 on 27 and 17264 DF,  p-value: < 2.2e-16

Model 2 is a multiple linear regression model that includes the same variables as in Model 1 and all their interactions. A total of 287 coefficients are estimated. The R-squared is 0.6526. A higher value than Model 1, but also depends on many more coefficients. Personally, I prefer to depend on a relativly low number of predictor variables.

model2 <- lm(price ~ bs(x1) + bs(x2) + bs(x3) + bs(x4) + bs(x5) + fact_date, data = train)
summary(model2)

## 
## Call:
## lm(formula = price ~ bs(x1) + bs(x2) + bs(x3) + bs(x4) + bs(x5) + 
##     fact_date, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2932532  -124429   -25925    90593  3372882 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           650169      69261   9.387  < 2e-16 ***
## bs(x1)1              -173113      46007  -3.763 0.000169 ***
## bs(x1)2              2734009     103474  26.422  < 2e-16 ***
## bs(x1)3              4046201     186122  21.740  < 2e-16 ***
## bs(x2)1              -594433      56372 -10.545  < 2e-16 ***
## bs(x2)2               673745     193910   3.475 0.000513 ***
## bs(x2)3              -506409     191543  -2.644 0.008204 ** 
## bs(x3)1               240774      92444   2.605 0.009208 ** 
## bs(x3)2               127982      37742   3.391 0.000698 ***
## bs(x3)3              1439051      75122  19.156  < 2e-16 ***
## bs(x4)1               -95498      28593  -3.340 0.000840 ***
## bs(x4)2                76640      17936   4.273 1.94e-05 ***
## bs(x4)3              -128205      22398  -5.724 1.06e-08 ***
## bs(x5)1             -1022179     114514  -8.926  < 2e-16 ***
## bs(x5)2              -123295      40324  -3.058 0.002234 ** 
## bs(x5)3              -371990     108142  -3.440 0.000583 ***
## fact_date2014-06-01     5537       8225   0.673 0.500879    
## fact_date2014-07-01    -5125       8197  -0.625 0.531834    
## fact_date2014-08-01    -9444       8442  -1.119 0.263297    
## fact_date2014-09-01    -4744       8652  -0.548 0.583447    
## fact_date2014-10-01    -3458       8491  -0.407 0.683794    
## fact_date2014-11-01    -3736       9177  -0.407 0.683913    
## fact_date2014-12-01   -12762       9066  -1.408 0.159258    
## fact_date2015-01-01   -15187      10284  -1.477 0.139784    
## fact_date2015-02-01   -11328       9541  -1.187 0.235126    
## fact_date2015-03-01    26814       8570   3.129 0.001758 ** 
## fact_date2015-04-01    33986       8203   4.143 3.44e-05 ***
## fact_date2015-05-01    37201      11969   3.108 0.001887 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 230300 on 17264 degrees of freedom
## Multiple R-squared:  0.6078, Adjusted R-squared:  0.6072 
## F-statistic: 991.1 on 27 and 17264 DF,  p-value: < 2.2e-16

Model 3 is also a multiple linear regression model that includes variables x1 through x5 and fact_date. No interactions includes. R-squared = 0.552.

model3 <- lm(price ~ bs(x1) + bs(x2) + bs(x3) + bs(x4) + bs(x5) + fact_date, data = train)
summary(model3)

## 
## Call:
## lm(formula = price ~ bs(x1) + bs(x2) + bs(x3) + bs(x4) + bs(x5) + 
##     fact_date, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2932532  -124429   -25925    90593  3372882 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           650169      69261   9.387  < 2e-16 ***
## bs(x1)1              -173113      46007  -3.763 0.000169 ***
## bs(x1)2              2734009     103474  26.422  < 2e-16 ***
## bs(x1)3              4046201     186122  21.740  < 2e-16 ***
## bs(x2)1              -594433      56372 -10.545  < 2e-16 ***
## bs(x2)2               673745     193910   3.475 0.000513 ***
## bs(x2)3              -506409     191543  -2.644 0.008204 ** 
## bs(x3)1               240774      92444   2.605 0.009208 ** 
## bs(x3)2               127982      37742   3.391 0.000698 ***
## bs(x3)3              1439051      75122  19.156  < 2e-16 ***
## bs(x4)1               -95498      28593  -3.340 0.000840 ***
## bs(x4)2                76640      17936   4.273 1.94e-05 ***
## bs(x4)3              -128205      22398  -5.724 1.06e-08 ***
## bs(x5)1             -1022179     114514  -8.926  < 2e-16 ***
## bs(x5)2              -123295      40324  -3.058 0.002234 ** 
## bs(x5)3              -371990     108142  -3.440 0.000583 ***
## fact_date2014-06-01     5537       8225   0.673 0.500879    
## fact_date2014-07-01    -5125       8197  -0.625 0.531834    
## fact_date2014-08-01    -9444       8442  -1.119 0.263297    
## fact_date2014-09-01    -4744       8652  -0.548 0.583447    
## fact_date2014-10-01    -3458       8491  -0.407 0.683794    
## fact_date2014-11-01    -3736       9177  -0.407 0.683913    
## fact_date2014-12-01   -12762       9066  -1.408 0.159258    
## fact_date2015-01-01   -15187      10284  -1.477 0.139784    
## fact_date2015-02-01   -11328       9541  -1.187 0.235126    
## fact_date2015-03-01    26814       8570   3.129 0.001758 ** 
## fact_date2015-04-01    33986       8203   4.143 3.44e-05 ***
## fact_date2015-05-01    37201      11969   3.108 0.001887 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 230300 on 17264 degrees of freedom
## Multiple R-squared:  0.6078, Adjusted R-squared:  0.6072 
## F-statistic: 991.1 on 27 and 17264 DF,  p-value: < 2.2e-16

Finally, I plot the predicted values of model 2 versus the actual prices of the homes sold for that period of time. Included in the scatter plot is a line “y = x” which indicates if the prediction is above or below the actual price. Dots below the line represent homes that sold for less of what model 2 predicted, dots above the line represent homes that sold for more than waht the model predicted.

pred1 <- predict(object = model1, newdata = test)

## Warning in bs(x1, degree = 3L, knots = numeric(0), Boundary.knots =
## c(369.842028472135, : some 'x' values beyond boundary knots may cause ill-
## conditioned bases

## Warning in bs(x3, degree = 3L, knots = numeric(0), Boundary.knots =
## c(2.23448415452232, : some 'x' values beyond boundary knots may cause ill-
## conditioned bases

## Warning in bs(x5, degree = 3L, knots = numeric(0), Boundary.knots =
## c(-0.412752596462541, : some 'x' values beyond boundary knots may cause
## ill-conditioned bases

res1<-cbind(test$price,pred1)
colnames(res1) <- c("actual", "predicted2")
res1 <- as.data.frame(res1)

pred3 <- predict(object = model3, newdata = test)

## Warning in bs(x1, degree = 3L, knots = numeric(0), Boundary.knots =
## c(369.842028472135, : some 'x' values beyond boundary knots may cause ill-
## conditioned bases

## Warning in bs(x3, degree = 3L, knots = numeric(0), Boundary.knots =
## c(2.23448415452232, : some 'x' values beyond boundary knots may cause ill-
## conditioned bases

## Warning in bs(x5, degree = 3L, knots = numeric(0), Boundary.knots =
## c(-0.412752596462541, : some 'x' values beyond boundary knots may cause
## ill-conditioned bases

res3<-cbind(test$price,pred3)
colnames(res3) <- c("actual", "predicted2")
res3 <- as.data.frame(res3)

pred2 <- predict(object = model2, newdata = test)

## Warning in bs(x1, degree = 3L, knots = numeric(0), Boundary.knots =
## c(369.842028472135, : some 'x' values beyond boundary knots may cause ill-
## conditioned bases

## Warning in bs(x3, degree = 3L, knots = numeric(0), Boundary.knots =
## c(2.23448415452232, : some 'x' values beyond boundary knots may cause ill-
## conditioned bases

## Warning in bs(x5, degree = 3L, knots = numeric(0), Boundary.knots =
## c(-0.412752596462541, : some 'x' values beyond boundary knots may cause
## ill-conditioned bases

res2<-cbind(test$price,pred2)
colnames(res2) <- c("actual", "predicted2")
res2 <- as.data.frame(res2)

plot(actual~predicted2, data=res2, xlim=c(0,1000000), ylim=c(0,1000000),
     main = "Actual prices vs predicted prices by Model 2")
abline(a = 0, b =1)

I now include a peek at a data frame that includes the actual prices and the predictions of the three models. Although the models did not serve as great descriptors of the variability of price in the housing market, it sparked my interest in other models and algorithms for continuous dependent variables.

res <- merge(res1,res2, by=intersect(colnames(res1),colnames(res2)))
res <- merge(res,res3, by=intersect(colnames(res3),colnames(res)))
res %>% head(10)

##     actual predicted2
## 1  1010000   743794.7
## 2  1010000   863398.2
## 3  1010000   911998.8
## 4  1020000  1175328.8
## 5  1020000   754705.9
## 6  1020000   794971.6
## 7  1030000  1512836.2
## 8  1030000   707579.0
## 9  1030000   767277.2
## 10 1030000   786719.9

res %>% tail(10)

##      actual predicted2
## 4312  9e+05   506644.4
## 4313  9e+05   567693.9
## 4314  9e+05   601572.7
## 4315  9e+05   616798.5
## 4316  9e+05   624143.9
## 4317  9e+05   687180.8
## 4318  9e+05   703320.3
## 4319  9e+05   747234.7
## 4320  9e+05   817009.4
## 4321  9e+05   950250.7

King County Home Sales: Analysis and the limitations of a multiple regression model

JuanPablo Murillo

February 23, 2016