Midterm

R Markdown

Introduction

The state of Tennessee has seen substantial population growth in the past several years, with much of this growth driven by its capital, Nashville, located in Davidson County. According to data from the 2010 U.S. Census and 2011-2016 American Community Survey 5-year averages, Davidson County’s population grew by nearly 10 percent between 2010 and 2016. This growth is expected to continue in the coming years. This increase in population has come with shifts in demographics and socioeconomic characteristics. These shifts no doubt impact the real estate market in Nashville, however, how these shifts impact the market remains to be seen.

The purpose of this project is to develop a machine learning model to predict home prices in Nashville, Tennessee, using a variety of metrics collected from the United States Census and American Community Survey as well as Nashville’s Open Data website. This model allows the city to seek to sustainably develop housing and amenities, and to proactively control real estate boom and bust cycles.

This is a difficult task for numerous reasons. First and foremost, countless interrelated factors impact home price. As such, deciding which metrics should be included in the model is challenging. Equally difficult is determining how to feature engineer the relevant independent variables, meaning how do we transform a variable to optimize predictions?

Our strategy for developing the machine learning model was a backwards stepwise model. We started with a kitchen sink model by including any variable that could even remotely impact housing prices, and then worked backwards and weeded out less predictive variables as we went along.

We focused data collection on the physical characteristics, both internal and external, of the home and on geospatial context, or the distance to a variety of amenities and disamenities. We felt these factors, instead of those based on census tract, or arbitrary geographies would provide more predictive power in the model.

The model we developed had an R-squared of 0.49, indicating that the model accounted for 49% of the variation in home prices in Nashville. When used to predict, the model had a mean absolute error (MAE) of $123, 539 and a mean absolute percent error of 0.66. These values indicate that on average, the models predictions were off by about $123,539 or 66%.

Data Wrangling and Cleaning

The map above presents all home sales in the city from the last few years by quartile, with home prices ranging from $1,500 to upwards of $6.8 million. Notably, home sales in Nashville are largely spread throughout the southern portion of the city, with far greater densities clustered around the river and southeastern part of Nashville. As is indicated by the map, sale prices are varied throughout the city, with higher priced homes clustered in the southwestern quadrant of the city. This area, south of the river, is proximal to Vanderbilt University, the state capital, and other cultural destinations, which are all likely amenities for the area. Lower priced homes are clustered in the southeastern quadrant of the city, which is near the airport, a likely disamenity. In the northern portions of the city, above the river, prices are somewhat less clustered, with some clustering of higher prices closest to the river and downtown Nashville.

Nashville Home Prices by Quartile

Given the variation in the price of homes in Nashville, collecting, wrangling, and cleaning the data to be included in the machine learning model was a major challenge, that required the use of a variety of sources. To develop our model, we collected data from the United States Census and American Community Survey, Open Data Nashville, and Google Earth. We pulled information on demographics, income, and commutes from the American Community Survey (2012-2016). We then took this data and created binary columns for above or below average for each variable in each census tract. Ultimately, these datasets proved to not help the predicting lower of our model and were removed. In addition, spatial data from the census, including highways and water bodies in Nashville were imported in R using the Census API. Both the highway and water datasets were used in distance variables in the final model, which calculated distance to water and distance to highways for each home sale in the dataset. These variables were created using the assumption that proximity to certain geographic and physical features would positively or negatively impact home prices.

Additional distance measures were created using data from Google Earth. From Google Earth, we were able to collect data on where certain cultural, historical, and political destinations in the city were located as well the locations of highly ranked public schools, Vanderbilt University, and check cashing stores. Like with the highway and water datasets, distances were calculated to the nearest destinations from each home sale. Again, these distance variables were created under the assumption that proximity to some of these places would positively impact home prices, while proximity to others would negatively do so. .

In addition to the Google Earth and Census data, data on police incidents and police calls, and park location was collected from Nashville’s open data website. These datasets included XY coordinate information, so that the data could be mapped. In ArcGIS, the individual incidents and calls were joined to Census Tracts, providing a count of incidents and calls in each tract. This provided a measure of relative density of crime across the city. Like the data from the census, these data proved to not enhance the predictive power of the model and were excluded from the final model.

The majority of variables included in the final model were built out of variables included in the original home sales dataset, including both physical characteristics of the homes as well as governmentally designated geographies related to the homes. To make these variables more powerful predictors, they were transformed into binaries or classified into bins, which would allow us to try to count them as factors or as semi-factors. The following variables from the dataset were transformed:

Building Type Exterior Wall Type Assessor Zone Lot Acreage Home Square Footage Full Baths Half Baths Physical Depreciation Finished Basement Year Built Foundation Type

For example, we did not create a binary variable for each year that homes were built in the dataset, however, we did break that variabl)e down based on decades, beginning with homes built before 1978 (the year lead paint legislation was enacted. We followed a similar binning process for square footage and acreage. The remaining variables were built into more simple binary variables. Each building type, exterior wall type, physical depreciation level, foundation type, and assessor zone was made into its own binary variable. Finished basement was made into a binary variable based on if a house did, or did not, have a finished basement. Baths and half baths were broken down into binaries for ‘many baths’ (more than two) and many half baths (more than 1). We also wanted to be sure that we were comparing apples to apples, or houses to houses, rather than land, so we created a binary variable called “not a house,” which pulled from information on land use, and square footage to determine whether the land had a home or was just land. Note that while numerous binary variables were created, not all were included in the final model, as some did not improve the predictive power of the model.

Finally, using the home sales variable in the dataset, we calculated average nearest neighbor home price. We developed three versions of nearest neighbors, using 5, 10, and 20 nearest neighbors. These variables were developed under the assumption that the best predictor of home price is the price of nearby homes.

A full list of variables included in the model are listed in the table below. The variables are organized into numerous categories: internal home characteristics, external home characteristics, distance to amenities and disamenities, and geospatial categories

Variable Characteristics

As the table above indicates, there are a great variety of variables that impact home price. The following two tables provide additional information about the variables included in the model. The first provides summary statistics on non-binary variables, while the second provides additional details on the binary variables, as summary statistics would not be useful in these cases.

Variable Characteristics

While the above tables provide some additional information on the variables included in the model, it is also helpful to see variables distributed spatially. The following three maps show three of the more interesting variables included in the analysis : average home price of 5 nearest neighbors, number of bathrooms in homes, and lot size (in acres).

Average KNN Home Prices (5)

The above map illustrates the average home prices of the five nearest homes for each home sale in Nashville. The average prices on this map, unsurprisingly, mirror the patterns of prices on the sale price map at the beginning of this section. Average neighbor home prices are highest in the southwestern quadrant of the city.

The below map shows number of full baths in in homes in Nashville, with the lighter blue color indicating homes that have 3 or more baths and the darker blue color indicating homes that have 1-2 baths. Given the pattern of home prices in the city, with the most expensive homes in the southwestern quadrant, it is unsurprising to see a cluster of homes with 3 or more bathrooms in the southwest.

Bathrooms in Nashville Homes

This third map shows lot size in acres of homes sold in Nashville, with the purple color indicating homes on lots less than one acre and the green indicating homes on lots greater than or equal to one acre. There is a small clustering of larger lot sizes in the southwestern part of the city as well as some larger lots on the northern periphery of the city, however, as the map shows, the majority of homes are on lots that are less than one acre.

Land Acrage in Nashville

To further understand the variables, and especially the relationship among the variables included in the model, we developed a correlation matrix, which denotes which independent variables are correlated with each other. As the figure below shows, those variables that are more correlated with each other have darker values than those variables that are not correlated. In addition, red numbers indicate negative correlation, that is, when one variable increases, the other will decrease, while blue indicates positive correlation.

Correlation Plot

As the correlation matrix shows, most of the variables included in the model are relatively weakly correlated with one another. The two average nearest neighbor variables have a strong correlation, indicating that they are collinear. Despite their colinearity, however, both variables were included in the model because of their statistically significant and strong predicting power.

Ultimately this process of data cleaning and wrangling was key to model development. Understanding the variables as well as engineering them effectively is imperative for developing a good predictive model.

Methods

Developing the machine learning model involved a variety of steps, much of which were focused on data collection and data wrangling, which was described in the above section. Once all of our data was collected and cleaned, model building could begin.The first step was developing training and test datasets, with 75% of the home sales going to the training set and 25% going to the test set. Using a training and test set enabled us to test the model’s predictive power. The model was built using the training set and then tested on the test set, allowing for cross-validation.

We used the kitchen sink method of regression building, where we started with all potentially relevant variables and ran successive models, while whittling the independent variables down to only the most powerful predictors. With this method of model building, the quality of each successive iteration is judged on several different factors. Most important are the adjusted R-squared, which indicates the percent of variation in sale prices that the model accounts for, as well as the mean absolute error and mean absolute percentage error, which a calculated from the model’s predictions. These absolute error values compare the predicted home sale price (from the model) to the actual home sale price. Powerful regression models have high R-squared values and low mean absolute errors. In addition to these key metrics which judge the predictive power of the model, each independent variable was judged on its statistical significance, with statistically insignificant variables being removed from the model. Further, for engineered and correlated independent variables, different transformations were used in different models to ensure the most powerful predictor was included in the final model. Note that within the model building process, the independent variable, sales price, was logged. Logging made the variable more normally distributed and thus improved the predictive power of the model. Detailed results of the model are included in the results section below.

Results

The results of the kitchen sink regression model for the training set, or 75% of the full dataset, can be found in the table below. As the bottom of the table states, the adjusted R-squared of the model is 0.49. This indicates that 48% of the variation in home prices is accounted for in the regression model. In addition, the majority of variables in the model are highly statistically significant, indicating that they are important predictors of home price.

Regression Results

Although the model accounts for nearly half of the variation in home prices in Nashville, it is not a perfect predictor. When predicting on the test set, made up of 25 percent of the full data set, the accuracy of the model can be assessed. The table below, provides a sample of observed (actual), home prices, predicted home prices, and error for the test set. Error is calculated as the difference in the predicted and actual home prices; absolute error is the absolute value of the calculated error, and the percent absolute error is the percentage of error for each prediction, calculated as the absolute error divided by the observed home price.

R Squared, MAE, and MAPE

When aggregated to the full test set, a mean absolute error (MAE) and mean absolute percentage error (MAPE) can be calculated by finding the average of absolute and absolute percentage error. As the table above shows, the MAE for the test set was $123,539, indicating that on average, the model’s predictions were off by about $124,000. The MAPE for the test set was 66%, indicating that on average predicted home prices were off by about 66%.

To gain a better understanding of the generalizability of the model, we performed cross-validation using k-fold cross-validation. With cross validation, the model is run with numerous random samples, 100 in our case. If the model is generalizable the goodness of fit metrics, including the standard deviation of all the R-squared values, and the MAE of all the test sets should be relatively similar. For our model, the standard deviation of MAE was 23532.84 and the mean MAE was $128,345, which is comparable to the MAE of the original test set. The standard deviation of the R-squared values was 0.143 and the mean R-squared value was 0.46, which like the mean MAE is similar to the R-squared for our original model.

To further illustrate the generalizability of the model based on the results of cross-validation, the R-squared values can be plotted as a histogram, as is shown in the figure below. If the histogram of R-squared values is normally distributed, than the model is generalizable. As the figure below shows, there is variation, however, the R-squared values are somewhat normally distributed.

R Squared Histogram

To further understand how effective the model is at predicting home prices, it is useful to plot the predicted home prices as a function of observed home prices, as is presented in the figure below.

Prediction as a function of Observed

In the figure the red line shows perfect prediction, that is observed home prices, while the blue line shows actual prediction, that is what the model predicted each home price to be. As the graph shows, the model over predicted the lowest priced homes and under predicted most other homes. The chart illustrates what the MAE and MAPE show, predictions are not perfect and the model has a relatively high error. Had the model been a better predictor, the blue line would have had a much more similar slope as the red line on the chart.

While the above analysis of of the model focus on how well the model predicts home prices broadly, the following analysis brings in a spatial component. The map below shows the residuals, or errors, of the predictions of the 25% test set. As the map shows, the residuals are relatively well dispersed throughout the city, which is good. However, some clustering of error exists. For example, in the Northeastern quadrant of the city, there is a cluster of particularly high residuals. This is one of the wealthier areas of the city, indicating that the model may not be generalizable, and may not do well predicting across a broad spectrum of prices.

Residual Map

To further understand the spatial autocorrelation of home prices, we used Moran’s I, which is a summary statistic with the null hypothesis that there is no clustering, or spatial autocorrelation, in residuals. The results of the Moran’s I statistic are presented in the code block below. With a p-value of less than 0.05, the null hypothesis can be rejected, indicating that there is statistically significant clustering of error in the model, further indicating that the model is missing key variables that would predict home prices in Nashville.

Moran's I test under randomisation
data:  reg3$residuals  
weights: nb2listw(spatialWeights, style = "W")    

Moran I statistic standard deviate = 10.939, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
     0.1510090173     -0.0004450378      0.0001916756

Analysis

This model is not particularly effective for predicting home prices in Nashville, and the model only accounts for 49% percent of the variation in home prices, indicating that the model is predicting home prices without powerful predictive variables. In addition, the MAE and MAPE of the model are both high. In the 25% test set, the MAE was $123,539 and the MAPE was 66%. This means that on average, the model was off by over $100,000 when predicting home prices. Or, put another way, on average, the model’s predictions, on average, were off by 66%.

Examining the regression output, it becomes clear that certain variables are more powerful predictors than others. The distance variables had particularly strong impacts on home price, except for distance to schools. As the regression output shows, increases in home prices were positively correlated with distance to parks and highways, but negatively correlated with distance to check cashing locations, Whole Foods grocery stores, and water bodies. Although highly statistically significant, the average nearest neighbor price variables had low coefficients, indicating they only impact price minimally. This is particularly surprising given that in the real world, home prices are largely determined by the price of nearby comparables.

While this model is not the most effective at predicting home prices, the results of the cross validation do indicate that it is somewhat generalizable. When comparing the mean R-squared values and mean MAE of the 100 test sets to the R-squared and MAE of the original model, there is limited variation. The mean R-squared of the cross-validation test sets was 0.46, compared to 0.49 in the original model. Similarly, the mean MAE for the cross validated test sets was $128,345, compared to $123,539 for the original test set. The results of cross-validation indicate some generalizability, however, when the residuals of the original test set are mapped, it is clear that the model does not completely generalizable across space. The residual map, included in the above section, has somewhat well dispersed residuals, but there is still clustering, especially in the northeastern portion of the city. This indicates that the model does not completely account for spatial variation in home prices.

[add in zip code analysis]

MAPE by Zipcode

To further understand the generalizability of the model, predictions were aggregated up to the zip code level. The map below shows MAPE for each zip code. As the map shows, the MAPE is relativley consistent across zip codes, providing further indication that the model is generalizable across space.The accompanying figure plots the MAPE by zip code as a function of average home price by zip code, providing further evidence that the model is somewhat generalizable.

MAPE as a function of Average Home Price by Zip

Conclusion

While our model does predict home price, we would not recommend it Zillow. With an R-squared of under 50% and a mean absolute percentage error of 66% this model is not reliable enough to predict home prices for Zillow. Significant improvements would be needed before this model could be used to accurately predict home prices. Given the relatively low R-squared and high error values it is clear that certain key variables are missing from the model. This likely includes well engineered demographic and socioeconomic data as well as additional data on housing comps. In addition, the feature engineering of chosen variables could be improved to make them more powerful predictors. For example, the variables that were put into bins could have been binned differently to get better results. Similarly, the distance variables could have been engineered into binaries to indicate homes that were within a specified distance of a given destination.

Midterm

Fay Walker, Jessica Klion

November 10, 2018

R Markdown