In May of 2017, the Zillow Group (Zillow) announced that it would hold a $1 million contest to improve its “Zestimate” (predicted home value) algorithm (https://www.zillow.com/research/announcing-zillow-prize-15380/). In the spirit of this competition, though with a much smaller data set, we set out to develop an algorithm to predict home values in Boston, MA.
For Zillow, the importance of this exercise cannot be overstated–it is integral to its business model. For a company like Zillow, the best model to predict home values could mean the difference between prosperity and irrelevance. Less obvious is how the ability to predict home values can expand the opportunities for planners, developers, realtors and homeowners alike; home values play an important part in value capture schemes to finance infrastructure, developer site-selection, commissions for home sales and retirement income.
However, generating useful predictions is not easy. Home values are a function of many elements: the attributes of the home (number of rooms, how modern, etc.), accessible amenities (schools, retail shops, etc.), and even undesirable characteristics (proximity to crime, proportion of vacant homes, etc.). Another important consideration is the tendency of objects that are close to one another in space to be more similar than objects that are farther apart from one another other in space (spatial autocorrelation).
On average, the model estimates home values in Boston with 83% accuracy. It is based on 16 different significant factors attempting to account for all of the categories mentioned above.
In search of data to power the model, open data resources were key. First, MassGIS (https://www.mass.gov/service-details/massgis-data-layers) provided many shapefiles at the state level that could be paired down for analysis at the city level (i.e. recreational spaces, post-secondary education institutions, etc.). AnalyzeBoston (https://data.boston.gov/), unsurprisingly, has lots of data limited to Boston city boundaries (i.e., incidences of crime, number of trees, etc. ). Social Explorer (https://www.socialexplorer.com/) provided census data from the 2015 American Community Survey Five Year Estimates (i.e., percent married, poverty rate, etc.), and the Mssachusetts Department of Education (http://profiles.doe.mass.edu/) provided information on the availability of pre-Kindergarten programs, SAT scores and class sizes in local high schools. Finally, My Maps (https://www.google.com/maps/about/mymaps/), allows the incorporation of google search engine location listings of amenities (i.e., Wholefoods locations). Home sale prices and characteristics come from the Suffolk County Massachusetts Assessor’s Office.
The goal, as described above, is to capture the complexity of several factors that may simultaneously contribute to a home’s value. Lastly, the model incorporates a variable that seeks to capture clustering that may not be reflected in the inclusion of other variables (see Figure 8). Looking at relevant home attributes and home sale prices, the model identified nine clusters. To make the regression results easier to interpret, these groups were ordered (ascending) by average home sale prices.
Though 46 different variables were initially explored, only 16 warranted inclusion based on their predictive strength.
Since we are looking to predict home prices and are restricted (for the purposes of this assignment) to using an OLS (ordinary least squres) regression, we must investigate the distribution of the variable we are interested in (SalePrice). “Instrinsic-value” is not something we consider. Instead, we look at recent home sales to approximate the home’s value (as evidenced by a buyer’s recent willingness to pay). Figure 1 shows that the home sale prices in our data are not normally distributed (a necessary assumption for OLS regression). In Figure 2, we log-transform the home values and get a more normal distribution. Here, it is critical to observe that the distribution is still not normal (as there are many high home values, which skew the distribution to the right). Depending on the use case, it may make sense to remove these, but we choose not to.
Table 1 and Table 2, respectively, provide data definitions and descriptive statistics of the dataset.
##
## Summary Statistics
## ==================================================================
## Statistic N Mean St. Dev. Min Max
## ------------------------------------------------------------------
## d_Trees12nn 1,339 0.0004 0.0002 0.0001 0.001
## d_PoliceSt3nn 1,339 0.026 0.010 0.005 0.051
## d_Crime20nn 1,339 0.001 0.001 0.0002 0.006
## d_PosSecEd5nn 1,339 0.033 0.011 0.003 0.066
## d_CommRailStopsnn 1,339 0.026 0.011 0.012 0.081
## d_mbta5nn 1,339 0.027 0.019 0.003 0.075
## d_hydro10nn 1,339 0.017 0.007 0.004 0.033
## d_pubPreK5nn 1,339 0.013 0.005 0.004 0.032
## d_openSpace5nn 1,339 0.004 0.002 0.0003 0.010
## d_wholefoods3nn 1,339 0.047 0.021 0.007 0.101
## SalePrice 1,339 643,058.900 563,698.800 200,000 11,600,000
## LAND_SF 1,339 4,563.483 3,380.199 498 63,941
## GROSS_AREA 1,339 3,588.543 1,458.633 0 11,919
## LIVING_ARE 1,339 2,260.084 1,014.265 0 9,908
## NUM_FLOORS 1,339 2.164 0.651 0 4
## Is_R_EXT_F 1,339 0.093 0.290 0 1
## Is_R_EXT_1 1,339 0.007 0.082 0 1
## R_FULL_BTH 1,339 1.974 0.912 0 8
## R_HALF_BTH 1,339 0.361 0.547 0 3
## R_KITCH 1,339 1.678 0.836 0 4
## R_AC_Y 1,339 0.145 0.352 0 1
## R_FPLACE 1,339 0.403 0.755 0 5
## pov_rate 1,339 0.138 0.092 0.003 0.622
## pct_white 1,339 0.529 0.323 0.000 1.000
## pct_nev_ma 1,339 0.458 0.133 0.196 0.946
## pct_bach_m 1,339 0.393 0.227 0.006 0.973
## vacancy_ra 1,339 0.078 0.062 0.000 0.300
## pct_1car 1,339 0.449 0.131 0.000 0.874
## pct_2car 1,339 0.240 0.155 0.000 0.800
## Is_South_E 1,339 0.013 0.115 0 1
## Is_West_Ro 1,339 0.159 0.366 0 1
## pct_3car 1,339 0.048 0.070 0.000 0.488
## ------------------------------------------------------------------
A pairwise correlation matrix is a great tool to get a visual feel for different relationships between variables. In Figure 3, dark blue areas signify a strong positive relationship and red areas signify a strong negative relationship. White Areas in the grid indicate an insignificant relationship (if any). For example, as the the total number of kitchens in a home increases, the number of bedrooms are likely to increase as well (indicated by the color blue). As the rate of the population that has never married goes up, however, the distance to police stations move in the opposite direction (slightly red). The goal is for each variable to contribute somewhat different explanatory power to the model. Whether positively or negatively, then, we want to avoid variables that reflect similar movement (i.e., are co-linear).
Pairwise Correlation Matrix of All Variables (some excluded in final model)
Next, we consider a number of variables we explored and the logic for initial inclusion.
Figure 4, shows the distribution of home values (represented by a sample of 2015 sales data). What is immediately apparent is clustering or groups of colors, for which the model will need to account. The prices in this image are grouped by the bottom 20%, 21-40%, 41-60%, 61-80% and 81-100%. Clusters are clear within every grouping.
Sample of 1,339 Homes Sold in 2015
In Figure 5, since we’ve already noted that the data contains several extremely high home values, we thought about a variable that might be skewed to these areas. As it turns out, the Wholefoods locations in Boston are generally clustered in an area which houses some of the most luxurious real estate in the city.
Distance to 2 nearest Wholefoods
In Figure 6, we explored the importance of colleges and universities in Boston, MA. The area is known, in part, for its density of colleges and universities. If professors preferred to live near colleges and universities, for example, they possess above median incomes to bid the prices of these homes up. On the other hand, if college students are lousy neighbors, we might see the opposite effect.
Distance to 5 nearest post-secondary educational institutions
In Figure 7, based on an initial look at the predictive power of the perecent of the population that was never married, we thought it would be interesting to see if the proximity to publicy available pre-kindergarten options would be significant. It is no secret that new families are often eager homebuyers. Further, Boston, is a commuter city, and it is reasonable to believe that proximity to this type of amenity has value. Since private options are also available (presumably, at a higher cost), we thought this variable could have predictive power in either direction.
Distance to 5 nearest public pre-Kindergarten offerings
As mentioned briefly in our discussion about the tendency of objects that are close to one another in space to be more similar than objects that are farther apart from one another other in space (spatial autocorrelation), Figure 8 identifies groups of this type by a process called K-means clustering. This analysis was done in ArcGis. In this way, we hope to account for some of the observed clustering. Since clusters are randomly assigned numbers to identify its group, they were reordered (ascending by sale price) to make interpretation of the regression results easier.
Kmeans of 12 nearest neighbors
As discussed above and shown in Figure 1 and Figure 2, we began our analysis by exploring the variable we would predict using an OLS (ordinary least squares) regression. We log-transformed the sale price variable to get a distribution suitable for this type of regression. Non-numeric variables with substantial influence and All neighborhood variables were transformed to zeros and ones (binary categories) to ease interpretation.
The approach was to perform a stepwise analyis in both directions beginning with all of the home attribute data. Once a new variable was collected, it was added to the regression to assess its predictive power in relation to the home attributes that had already proved significant. This was done categorically. For example, after home attributes were considered, each demographic element would be considered. Then, enviornmental elements (i.e. open spaces, distance to bodies of water, etc.). Next, all amenities, and so on until all variables were added. Once this was completed, the same process was performed in reverse. Bbeginning with all variables and systematically removing variables without significant predictive power.
The final model includes only significant variables.
The model was estimated using a random sample of 75% (often referred to as training data). The remaining 25% of the data (often referred to as test data) was used to test the predictive accuracy of the model.
The results of the final model can be seen in Table 3, accounting for approximately 83% of the variance observed in our data set.
To get a sense of the relative importance that each variable contributes to the predictive power of the model, we “standardize” them. This is especially useful when the variables are measured in different units of measurement (as is the case in our analysis). Now, we can visually see the relative importance of each variable.
Comparison of relative influence of variables
Table 4 and Table 5 show several goodness of fit measures for the training and test data, respectively. What is clear is that the model performs better on the training data compared to the test data. In particular, the root mean squared error (RMSE) is over three-times larger and the mean absolute percent error (MAPE) goes up by two-thirds when moving from the training data to the test data.
Table 6, however, shows better results. After performing cross-validation on the data, the model performs appreciably better. The standard deviation of r-squared is 7.4%. Figure 11, provides a clue. It appears that MAE is generally $120,000. Given the presence of outliers (approximately 9% of the data set are homes over $1,000,000 (119 out of 1,339). The maximum home sale price in the data set is $11,600,000, while the minimum is $200,000.
## Adj_R-Squared RMSE MAE MAPE
## 1 0.8220246 44138.33 34542.71 0.08878709
## Adj_R-Squared RMSE MAE MAPE
## 1 0.8183397 156646.3 92240.89 0.147554
## RMSE R-SQUARED MAE
## 1 223414.1 0.8354654 119332.1
Figure 12 and Figure 13, show the errors in prediction versus observed and predicted values, respectively. As mentioned above, though the relationships between the two appear similar (which signifies the model may be generalizable), Figure 13 shows that low values produce slightly higher errors.
Residuals Distibution Overlaying Boston, MA
A positive Moran’s I test implies clustering. This confirms the clustering we observed and discussed above. The model attempts to address this clustering with a K-means variable, neighborhood fixed effects and census data (among others).
##
## Moran I test under randomisation
##
## data: regTrain$residuals
## weights: nb2listw(spatialWeights, style = "W")
##
## Moran I statistic standard deviate = 4.3141, p-value = 0.000008013
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.0883680090 -0.0009950249 0.0004290804
Looking at the training set residuals by neighborhood in Figure 15, we see that the residuals still appear somewhat clustered.
We chose a high income, middle income and low income neighborhood. The goal was to hold one of these groups’ sale price data as a test set that would be used to evaluate the model based on predicted values from the remaining data. As a result, three separate tests were performed. Table 8, shows the results.
From training data to test data, changes in MAPE were similar for low income and middle income neighborhoods. High income neighborhoods, however, improved by over 50%. Assuming that outliers of high home sale prices are in high income neighborhoods, it’s understandable that removing these values from the test set caused the model to perform with more error on average and better when predicting on data that (presumably) contained more of these outliers.
MAPE by Neighborhood
The model is promising, but could be improved. As Eugene Brusilovskiy notes, a statistician should have a compelling reason (ideally resulting from specific domain knowledge) to remove outliers from analyses. In this case, however, given a use case of predicting home prices where the vast majority are under $600,000, a compelling case can be made to remove the outliers.
It would be interesting to perform the same method with outliers removed. Perhaps more variables would prove significant. Looking at the percent of the population who had never married was interesting. Diving into complementary (i.e., percent divorced) metrics here would, perhaps, also be interesting.
Overall, the error in the model’s predictions were pretty good. In fact, 14% average error (MAPE for this model) is around the error cited by Zillow when they first started out with their in-house algorithm that became “Zestimate” (https://www.zillow.com/research/announcing-zillow-prize-15380/). Unfortunately, the model appears too significantly influenced by outliers, which may also reduce the predictive weights of the variables included to account for spatial autocorrelation.
As seen in Figure 16, we didn’t account for the spatial autocorrelation as much as we would have liked. There is still substantial variation by neighborhood.
I would not yet recommended this model to Zillow; more work is required. While I’m confident that removing outliers will substantially improve the model, I think that it could still be useful for other stakeholders (like those mentioned at the outset). Though not completely sure of a “Zestimate’s” all-in costs, it would be interesting to compare the projected cost of Zestimates (plus their ~5% error) versus my free (for now) model with approximately 14% error.
The model might also improve with additional data, but must first be tested after removing outliers. At present, goodness of fit meansures are more variable than we would like, but are also a promising start.
Thanks for reading!