The data for these sales comes from the official public records of home sales in the King County area, Washington State. The data sets contains 21613 rows. Each represents a home sold from May 2014 through May 2015. Below is a breakdown of the variables involved:
http://your.kingcounty.gov/assessor/eRealProperty/ResGlossaryOfTerms.html.
I previously posted a heat map of price per sq ft in King County, using the same dataset.
id - Unique ID for each home sold
date - Date of the home sale
price - Price of each home sold
bedrooms - Number of bedrooms
bathrooms - Number of bathrooms, where .5 accounts for a room with a toilet but no shower
sqft_living - Square footage of the apartments interior living space
sqft_lot - Square footage of the land space
floors - Number of floors
waterfront - A dummy variable for whether the apartment was overlooking the waterfront or not
view - An index from 0 to 4 of how good the view of the property was
condition - An index from 1 to 5 on the condition of the apartment,
grade - An index from 1 to 13, where 1-3 falls short of building construction and design, 7 has an average level of construction and design, and 11-13 have a high quality level of construction and design.
sqft_above - The square footage of the interior housing space that is above ground level
sqft_basement - The square footage of the interior housing space that is below ground level
yr_built - The year the house was initially built
yr_renovated - The year of the house’s last renovation
zipcode - What zipcode area the house is in
lat - Lattitude
long - Longitude
sqft_living15 - The square footage of interior housing living space for the nearest 15 neighbors
sqft_lot15 - The square footage of the land lots of the nearest 15 neighbors
First, we carry out an exploratory analysis of the data. Early in the process, I worked together with Greg Merrell. We checked for the normality of the explanatory variables. You can see his work at http://rpubs.com/grmerrell/154461. The remaining work on this page is entirely mine.
Next we compute a few new values from the variables in the data set. For instance, we’re interested in the price per sq foot of the living space, of the lot, and of the surrounding living spaces and lots. We plot those four values against time, and get a very similar pattern (graphs not provided). So we take the mean of those four variables, and plot them against time. We call that variable “Average price per sq foot.” Similarly, we compute the mean selling price per month, and plot it against time.
First the code…
# collapses sales in one month to the first day of the month
homes$date <- paste(substr(homes$date, 1,6), "01", sep = "")
homes$date <- ymd(homes$date)
# stores the median price of homes for each month
median_price_by_month <- homes %>%
group_by(date) %>%
summarise(median_price_month = median(price))
# calculates the price of sq ft for each sqft related variable
homes$price_sqft_living <- homes$price/homes$sqft_living
homes$price_sqft_lot <- homes$price/homes$sqft_lot
homes$price_sqft_lot15 <- homes$price/homes$sqft_lot15
homes$price_sqft_living15 <- homes$price/homes$sqft_living15
# computes the average price of sq ft
homes <- homes %>%
mutate(price_sqft_avg =
(price_sqft_living + price_sqft_lot + price_sqft_lot15 + price_sqft_living15)/4)
tmp <- homes %>%
group_by(date) %>%
summarise(mean_price = median(price_sqft_avg)) %>% as.data.frame()
Then the graphs… There is a pattern to both graphs. Prices dip some time around January and February 2015. I proceeded to split the data into two data sets. Homes sold up to January 2015, and homes sold from February 2015 on. For the sake of siplicity I will only include the data the earlier set (5/14 - 1/15).
I did a variety of transformations to the independent variables I thought were important. The goal was to achieve somewhat normally distributed variables that could successfully describe the variability in price, the independent variable.
First I split the set into a train set and a test set.
set.seed(3456)
trainIndex <- createDataPartition(homes$price, p = .8, list = FALSE, times = 1)
train <- homes[trainIndex,]
test <- homes[-trainIndex,]
Then I transform the data..
normalize <- function(x){(x-mean(x, na.rm = TRUE)/(max(x, na.rm = TRUE)-min(x, na.rm = TRUE)))}
train$x1 <- train$sqft_above + train$sqft_basement
train$x1 <- normalize(train$x1)
train$x2 <- (train$sqft_living + train$sqft_lot + train$sqft_lot15 + train$sqft_living15)/4
train$x2 <- normalize(train$x2)
train$x3 <- train$grade
train$x3 <- normalize(train$x3)
x<-train$yr_built - (min(train$yr_built)-2)
y <- train$zipcode - 98000
train$x4 <- sqrt(y)*log(x)
train$x4 <- normalize(train$x4)
train$x5 <- (train$bathrooms + 1)/(train$bedrooms + 1 + train$floors)
train$x5 <- normalize(train$x5)
train$nm_price <- normalize(train$price)
train$fact_date <- as.factor(train$date)
test$x1 <- test$sqft_above + test$sqft_basement
test$x1 <- normalize(test$x1)
test$x2 <- (test$sqft_living + test$sqft_lot + test$sqft_lot15 + test$sqft_living15)/4
test$x2 <- normalize(test$x2)
test$x3 <- test$grade
test$x3 <- normalize(test$x3)
x<-test$yr_built - (min(test$yr_built)-2)
y <- test$zipcode - 98000
test$x4 <- sqrt(y)*log(x)
test$x4 <- normalize(test$x4)
test$x5 <- (test$bathrooms + 1)/(test$bedrooms + 1 + test$floors)
test$x5 <- normalize(test$x5)
test$nm_price <- normalize(test$price)
test$fact_date <- as.factor(test$date)
Then we proceed to carry out three separate models.
Model 1 is a spline regression model. Splines are piecewise polynomial functions. Splines could be considered a form of polynomial regression. This model includes the variables x1 through x5 and fact_date. There is a total of 23 coefficients estimated. The R-squared is 0.6181.
model1 <- lm(price ~ bs(x1) + bs(x2) + bs(x3) + bs(x4) + bs(x5) + fact_date, data = train)
summary(model1)
##
## Call:
## lm(formula = price ~ bs(x1) + bs(x2) + bs(x3) + bs(x4) + bs(x5) +
## fact_date, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2932532 -124429 -25925 90593 3372882
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 650169 69261 9.387 < 2e-16 ***
## bs(x1)1 -173113 46007 -3.763 0.000169 ***
## bs(x1)2 2734009 103474 26.422 < 2e-16 ***
## bs(x1)3 4046201 186122 21.740 < 2e-16 ***
## bs(x2)1 -594433 56372 -10.545 < 2e-16 ***
## bs(x2)2 673745 193910 3.475 0.000513 ***
## bs(x2)3 -506409 191543 -2.644 0.008204 **
## bs(x3)1 240774 92444 2.605 0.009208 **
## bs(x3)2 127982 37742 3.391 0.000698 ***
## bs(x3)3 1439051 75122 19.156 < 2e-16 ***
## bs(x4)1 -95498 28593 -3.340 0.000840 ***
## bs(x4)2 76640 17936 4.273 1.94e-05 ***
## bs(x4)3 -128205 22398 -5.724 1.06e-08 ***
## bs(x5)1 -1022179 114514 -8.926 < 2e-16 ***
## bs(x5)2 -123295 40324 -3.058 0.002234 **
## bs(x5)3 -371990 108142 -3.440 0.000583 ***
## fact_date2014-06-01 5537 8225 0.673 0.500879
## fact_date2014-07-01 -5125 8197 -0.625 0.531834
## fact_date2014-08-01 -9444 8442 -1.119 0.263297
## fact_date2014-09-01 -4744 8652 -0.548 0.583447
## fact_date2014-10-01 -3458 8491 -0.407 0.683794
## fact_date2014-11-01 -3736 9177 -0.407 0.683913
## fact_date2014-12-01 -12762 9066 -1.408 0.159258
## fact_date2015-01-01 -15187 10284 -1.477 0.139784
## fact_date2015-02-01 -11328 9541 -1.187 0.235126
## fact_date2015-03-01 26814 8570 3.129 0.001758 **
## fact_date2015-04-01 33986 8203 4.143 3.44e-05 ***
## fact_date2015-05-01 37201 11969 3.108 0.001887 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 230300 on 17264 degrees of freedom
## Multiple R-squared: 0.6078, Adjusted R-squared: 0.6072
## F-statistic: 991.1 on 27 and 17264 DF, p-value: < 2.2e-16
Model 2 is a multiple linear regression model that includes the same variables as in Model 1 and all their interactions. A total of 287 coefficients are estimated. The R-squared is 0.6526. A higher value than Model 1, but also depends on many more coefficients. Personally, I prefer to depend on a relativly low number of predictor variables.
model2 <- lm(price ~ bs(x1) + bs(x2) + bs(x3) + bs(x4) + bs(x5) + fact_date, data = train)
summary(model2)
##
## Call:
## lm(formula = price ~ bs(x1) + bs(x2) + bs(x3) + bs(x4) + bs(x5) +
## fact_date, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2932532 -124429 -25925 90593 3372882
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 650169 69261 9.387 < 2e-16 ***
## bs(x1)1 -173113 46007 -3.763 0.000169 ***
## bs(x1)2 2734009 103474 26.422 < 2e-16 ***
## bs(x1)3 4046201 186122 21.740 < 2e-16 ***
## bs(x2)1 -594433 56372 -10.545 < 2e-16 ***
## bs(x2)2 673745 193910 3.475 0.000513 ***
## bs(x2)3 -506409 191543 -2.644 0.008204 **
## bs(x3)1 240774 92444 2.605 0.009208 **
## bs(x3)2 127982 37742 3.391 0.000698 ***
## bs(x3)3 1439051 75122 19.156 < 2e-16 ***
## bs(x4)1 -95498 28593 -3.340 0.000840 ***
## bs(x4)2 76640 17936 4.273 1.94e-05 ***
## bs(x4)3 -128205 22398 -5.724 1.06e-08 ***
## bs(x5)1 -1022179 114514 -8.926 < 2e-16 ***
## bs(x5)2 -123295 40324 -3.058 0.002234 **
## bs(x5)3 -371990 108142 -3.440 0.000583 ***
## fact_date2014-06-01 5537 8225 0.673 0.500879
## fact_date2014-07-01 -5125 8197 -0.625 0.531834
## fact_date2014-08-01 -9444 8442 -1.119 0.263297
## fact_date2014-09-01 -4744 8652 -0.548 0.583447
## fact_date2014-10-01 -3458 8491 -0.407 0.683794
## fact_date2014-11-01 -3736 9177 -0.407 0.683913
## fact_date2014-12-01 -12762 9066 -1.408 0.159258
## fact_date2015-01-01 -15187 10284 -1.477 0.139784
## fact_date2015-02-01 -11328 9541 -1.187 0.235126
## fact_date2015-03-01 26814 8570 3.129 0.001758 **
## fact_date2015-04-01 33986 8203 4.143 3.44e-05 ***
## fact_date2015-05-01 37201 11969 3.108 0.001887 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 230300 on 17264 degrees of freedom
## Multiple R-squared: 0.6078, Adjusted R-squared: 0.6072
## F-statistic: 991.1 on 27 and 17264 DF, p-value: < 2.2e-16
Model 3 is also a multiple linear regression model that includes variables x1 through x5 and fact_date. No interactions includes. R-squared = 0.552.
model3 <- lm(price ~ bs(x1) + bs(x2) + bs(x3) + bs(x4) + bs(x5) + fact_date, data = train)
summary(model3)
##
## Call:
## lm(formula = price ~ bs(x1) + bs(x2) + bs(x3) + bs(x4) + bs(x5) +
## fact_date, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2932532 -124429 -25925 90593 3372882
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 650169 69261 9.387 < 2e-16 ***
## bs(x1)1 -173113 46007 -3.763 0.000169 ***
## bs(x1)2 2734009 103474 26.422 < 2e-16 ***
## bs(x1)3 4046201 186122 21.740 < 2e-16 ***
## bs(x2)1 -594433 56372 -10.545 < 2e-16 ***
## bs(x2)2 673745 193910 3.475 0.000513 ***
## bs(x2)3 -506409 191543 -2.644 0.008204 **
## bs(x3)1 240774 92444 2.605 0.009208 **
## bs(x3)2 127982 37742 3.391 0.000698 ***
## bs(x3)3 1439051 75122 19.156 < 2e-16 ***
## bs(x4)1 -95498 28593 -3.340 0.000840 ***
## bs(x4)2 76640 17936 4.273 1.94e-05 ***
## bs(x4)3 -128205 22398 -5.724 1.06e-08 ***
## bs(x5)1 -1022179 114514 -8.926 < 2e-16 ***
## bs(x5)2 -123295 40324 -3.058 0.002234 **
## bs(x5)3 -371990 108142 -3.440 0.000583 ***
## fact_date2014-06-01 5537 8225 0.673 0.500879
## fact_date2014-07-01 -5125 8197 -0.625 0.531834
## fact_date2014-08-01 -9444 8442 -1.119 0.263297
## fact_date2014-09-01 -4744 8652 -0.548 0.583447
## fact_date2014-10-01 -3458 8491 -0.407 0.683794
## fact_date2014-11-01 -3736 9177 -0.407 0.683913
## fact_date2014-12-01 -12762 9066 -1.408 0.159258
## fact_date2015-01-01 -15187 10284 -1.477 0.139784
## fact_date2015-02-01 -11328 9541 -1.187 0.235126
## fact_date2015-03-01 26814 8570 3.129 0.001758 **
## fact_date2015-04-01 33986 8203 4.143 3.44e-05 ***
## fact_date2015-05-01 37201 11969 3.108 0.001887 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 230300 on 17264 degrees of freedom
## Multiple R-squared: 0.6078, Adjusted R-squared: 0.6072
## F-statistic: 991.1 on 27 and 17264 DF, p-value: < 2.2e-16
Finally, I plot the predicted values of model 2 versus the actual prices of the homes sold for that period of time. Included in the scatter plot is a line “y = x” which indicates if the prediction is above or below the actual price. Dots below the line represent homes that sold for less of what model 2 predicted, dots above the line represent homes that sold for more than waht the model predicted.
pred1 <- predict(object = model1, newdata = test)
## Warning in bs(x1, degree = 3L, knots = numeric(0), Boundary.knots =
## c(369.842028472135, : some 'x' values beyond boundary knots may cause ill-
## conditioned bases
## Warning in bs(x3, degree = 3L, knots = numeric(0), Boundary.knots =
## c(2.23448415452232, : some 'x' values beyond boundary knots may cause ill-
## conditioned bases
## Warning in bs(x5, degree = 3L, knots = numeric(0), Boundary.knots =
## c(-0.412752596462541, : some 'x' values beyond boundary knots may cause
## ill-conditioned bases
res1<-cbind(test$price,pred1)
colnames(res1) <- c("actual", "predicted2")
res1 <- as.data.frame(res1)
pred3 <- predict(object = model3, newdata = test)
## Warning in bs(x1, degree = 3L, knots = numeric(0), Boundary.knots =
## c(369.842028472135, : some 'x' values beyond boundary knots may cause ill-
## conditioned bases
## Warning in bs(x3, degree = 3L, knots = numeric(0), Boundary.knots =
## c(2.23448415452232, : some 'x' values beyond boundary knots may cause ill-
## conditioned bases
## Warning in bs(x5, degree = 3L, knots = numeric(0), Boundary.knots =
## c(-0.412752596462541, : some 'x' values beyond boundary knots may cause
## ill-conditioned bases
res3<-cbind(test$price,pred3)
colnames(res3) <- c("actual", "predicted2")
res3 <- as.data.frame(res3)
pred2 <- predict(object = model2, newdata = test)
## Warning in bs(x1, degree = 3L, knots = numeric(0), Boundary.knots =
## c(369.842028472135, : some 'x' values beyond boundary knots may cause ill-
## conditioned bases
## Warning in bs(x3, degree = 3L, knots = numeric(0), Boundary.knots =
## c(2.23448415452232, : some 'x' values beyond boundary knots may cause ill-
## conditioned bases
## Warning in bs(x5, degree = 3L, knots = numeric(0), Boundary.knots =
## c(-0.412752596462541, : some 'x' values beyond boundary knots may cause
## ill-conditioned bases
res2<-cbind(test$price,pred2)
colnames(res2) <- c("actual", "predicted2")
res2 <- as.data.frame(res2)
plot(actual~predicted2, data=res2, xlim=c(0,1000000), ylim=c(0,1000000),
main = "Actual prices vs predicted prices by Model 2")
abline(a = 0, b =1)
I now include a peek at a data frame that includes the actual prices and the predictions of the three models. Although the models did not serve as great descriptors of the variability of price in the housing market, it sparked my interest in other models and algorithms for continuous dependent variables.
res <- merge(res1,res2, by=intersect(colnames(res1),colnames(res2)))
res <- merge(res,res3, by=intersect(colnames(res3),colnames(res)))
res %>% head(10)
## actual predicted2
## 1 1010000 743794.7
## 2 1010000 863398.2
## 3 1010000 911998.8
## 4 1020000 1175328.8
## 5 1020000 754705.9
## 6 1020000 794971.6
## 7 1030000 1512836.2
## 8 1030000 707579.0
## 9 1030000 767277.2
## 10 1030000 786719.9
res %>% tail(10)
## actual predicted2
## 4312 9e+05 506644.4
## 4313 9e+05 567693.9
## 4314 9e+05 601572.7
## 4315 9e+05 616798.5
## 4316 9e+05 624143.9
## 4317 9e+05 687180.8
## 4318 9e+05 703320.3
## 4319 9e+05 747234.7
## 4320 9e+05 817009.4
## 4321 9e+05 950250.7