1 Introduction & Summary

An accurate home price prediction algorithm can reduce volatility in the housing market and take into account existing factors that may not be reflected in a home’s previous selling prices (e.g., new roof, new shopping center, etc.) However, predictive algorithms can also be exceedingly difficult to perfect. A falsely high average estimate in a neighborhood might lead home sellers to list their homes at too high an asking price and dragging out the process of selling their home, thereby introducing friction into the housing market. A falsely low estimate may depress the value of what is oftentimes a homeowner’s most valuable asset.

This project attempts to predict housing prices in metropolitan Miami by taking into consideration a home’s unique features (e.g., fence, patio) as well as considering local amenities and external features like schools, parks, and access to major roads. One interesting finding from this process is that a home’s location in a middle school zone shows a positive relationship with home prices, but bot elementary or high school zone.

To create our model, we converted our features of interest into variables that can be fed into an OLS regression model. We tested each featured for correlation with home sale prices and fine-tuned our model until we were able to minimize error.

After testing and rejecting several features that did not deduce our prediction errors (e.g., distances to nearest park, major road, and middle school), we ultimately settled on the features (dependent variables) listed below.

Location/External Features
- Property.City: Miami, Miami City
- GEOID: Census tract code
- Shore1: distance from shoreline (feet)
- MedRent: median average rent by census tract
- pctWhite: % residents who identify as White
- pctPoverty: % residents below poverty line
- 9 binary variables for middle school area: Brownsville, Citrus Grove, Jose de Diego, Georgia Jones-Ayer, Kinloch Park, Madison, Nautilus, Shenandoah, West Miami
Internal Features
- LotSize: lot square footage
- Age: years since home was built
- Bed: number of bedrooms
- Bath: number of bathrooms
- Stories: number of bathrooms
- Pool: extra pool feature (0/1)
- Fence: extra fence feature (0/1)
- Patio: extra patio feature (0/1)

2 Data

2.1 Dependent Variable Map: Home Prices

The map below shows the spatial distribution of home prices in Miami and Miami Beach. Darker points represent more expensive homes, with the deepest purple shade representing any home 1 million dollars or higher. Given the extreme range of home prices in Miami (max approx. 27 million dollars), we felt it necessary to collapse the outlier homes into the highest tier of home prices.

2.2 Independent Variable Map #2: Distance from Shore

Here we see the spatial relationship between home sale prices and distance from the shore. Unsurprisingly, as we move farther inland, home prices decrease.

2.3 Independent Variable Map #3: Middle Schools

This map shows the relationship between middle school attendance zones and home sale prices.

2.4 Independent Variable Map #4: Percent White Residents

Below is a map of the percent of White residents in each Census tract in Miami and Miami Beach. As shown below, although having a higher percentage of White residence does not appear to be closely correlated with home price, the absence of White residents is clearly tied to a lower estimate of home price.

2.5 Excluded Variable of Interest: Major Roads

Several other features such as distance from major roads, parks, and location within elementary and high school attendance zones were tested, but did not prove to be relevant. As shown below, there appears to be little relationship between distance from major roads and home price.

2.6 Summary Statistics of Variables/Features

**Summary Statistics**

Statistic	N	Mean	St. Dev.	Min	Max

SalePrice	2,066	405,476.400	199,741.700	12,500	1,000,000
LotSize	2,066	6,360.875	1,721.617	1,250	17,620
Age	2,066	70.954	18.186	-1	115
Stories	2,066	1.073	0.265	0	3
Bed	2,066	2.692	0.794	0	8
Bath	2,066	1.611	0.700	0	6
Pool	2,066	0.108	0.310	0	1
Fence	2,066	0.738	0.440	0	1
Patio	2,066	0.499	0.500	0	1
Shore1	2,066	7,047.549	5,248.614	88.597	26,528.540
MedRent	2,040	1,042.535	311.133	246.000	2,297.000
pctWhite	2,062	0.703	0.320	0.057	0.989
pctPoverty	2,062	0.217	0.108	0.052	0.556
Brownsville.MS	1,588	0.098	0.298	0.000	1.000
CitrusGrove.MS	1,588	0.115	0.319	0.000	1.000
JosedeDiego.MS	1,588	0.129	0.335	0.000	1.000
GeorgiaJA.MS	1,588	0.133	0.340	0.000	1.000
KinlochPk.MS	1,588	0.196	0.397	0.000	1.000
Madison.MS	1,588	0.001	0.035	0.000	1.000
Nautilus.MS	1,588	0.061	0.240	0.000	1.000
Shenandoah.MS	1,588	0.243	0.429	0.000	1.000
WestMiami.MS	1,588	0.024	0.153	0.000	1.000

2.7 Correlation

Below is a correlation matrix, showing the relatedness of each numeric variable to every other. The red-bounded box shows each variable’s correlation with sale price, our dependent variable.

2.8 Scatterplots

The below plots show the linear relationship between 4 independent variables, and home prices. Actual square footage is most highly and positively correlated with home price. Median rent in a home’s area also has a small positive relationship, and distance from the shore and age are quite expectedly negatively correlated (i.e., as the age of a home increases, home price decreases).

3 Methods

Our goal was to build a model that most accurately predicted Miami home prices. This was not a black-and-white procedure. We had three main considerations: 1) explain the variance in home prices, 2) minimize errors in predictions, and 3) be generalizable. Our first models used a simple ordinary least squares (OLS) regression. The results allowed us to exclude some census variables and remove overlapping variables (like square feet). In the next step we split our dataset 60/40, trained a model on 60% of the data, and tested it on the remaining 40%. This process gave us some insight into the model’s generalizability; it also allowed us to remove some variables. For example, this stage revealed that census tracts were our most influential neighborhood categorization, instead of property or mailing zip codes.

The last stage involved a machine learning method known as “k-fold cross validation”. Basically we split our dataset into ten different subsets, which were then divided further into training and test datasets. We measured the average performance across these folds - this was the best way to measure generalizability. We tried different combinations of features until we settled on a model that had the lowest errors in relation to actual sale price.

**Training Set LM Results**

	Dependent variable:

	SalePrice
	(1)	(2)

Folio		0.00000
		(0.00000)

Property.CityMiami Beach		220,589.400^**
		(102,729.500)

LotSize		17.974^***
		(1.660)

Bed		8,653.610^*
		(4,483.327)

Bath		4,613.343
		(5,440.150)

Stories		13,854.920
		(11,214.380)

Pool		77,281.650^***
		(9,820.892)

Fence		-149.050
		(5,646.349)

Patio		4,073.120
		(5,115.939)

ActualSqFt		67.033^***
		(6.231)

Age		-698.975^***
		(147.148)

Shore1	-5.745^***	-3.369^***
	(1.229)	(1.030)

MedHHInc	1.370^***	1.091^***
	(0.234)	(0.193)

TotalPop	4.453^**	3.400^**
	(1.797)	(1.481)

MedRent	8.011	10.111
	(16.927)	(13.970)

pctWhite	87,738.750^***	72,584.590^***
	(21,213.910)	(18,038.400)

pctPoverty	-65,160.580	-28,329.320
	(46,955.060)	(38,669.420)

Brownsville.MS	-74,321.900^**	-19,595.140
	(35,665.230)	(29,848.970)

CitrusGrove.MS	-63,085.380^*	-16,536.970
	(33,757.230)	(27,908.010)

JosedeDiego.MS	-24,574.030	41,689.690
	(36,087.170)	(30,171.430)

GeorgiaJA.MS	-83,659.540^**	-21,745.790
	(33,968.440)	(28,569.970)

KinlochPk.MS	-20,898.120	7,151.547
	(23,871.550)	(19,733.870)

Madison.MS	-102,752.000	-24,901.940
	(89,066.670)	(73,254.900)

Nautilus.MS	229,234.000^***
	(37,774.810)

Shenandoah.MS	99,482.560^***	122,292.100^***
	(31,097.140)	(25,991.120)

WestMiami.MS


Constant	274,559.400^***	-5,246.435
	(50,967.110)	(142,697.300)


Observations	1,584	1,584
R²	0.603	0.736
Adjusted R²	0.599	0.732
Residual Std. Error	113,746.600 (df = 1569)	93,070.580 (df = 1559)
F Statistic	170.158^*** (df = 14; 1569)	180.949^*** (df = 24; 1559)

Note:	p<0.1; p<0.05; p<0.01

4 Results

Our first regression was based on our feature engineering. We included census variables, shoreline distance and middle school distance, but excluded household details. This resulted in several statistically significant features, an adjusted R2 of 0.599 and an overall model F-statistic implies statistical significance. The intercept, $274,600, is the estimated mean sale price when all of our independent variables are zero. For every additional foot from the shore, the sale price will drop by $5.75. Our second model below on the right, includes all of the census variables, household variables, and some of our feature engineering. The adjusted R2 improves to 0.732.

This table shows our mean absolute error and mean absolute percent error for a single test set. Clearly more work needed to be done to improve our model.

Test	Values
Test MAE:	195610.35526
Test MAPE:	61.69651

These are the high level results of our final model based on the features discussed in the introduction. We were able to explain about 75% of the variance in sale prices, while our distance from sale price average about +/- $66,000.

4.1 Results of Cross Validation Test

intercept	RMSE	Rsquared	MAE	RMSESD	RsquaredSD	MAESD
TRUE	88416.28	0.7556264	65775.12	25971.98	0.1477249	16997.15

4.2 Histogram of Cross Validation Errors (MAE)

The histogram below examines our final cross validation model run on 100 folds. The MAE, or Mean Absolute Error, measures the difference in actual and predicted sales price across each test subset. Our errors clustered between $50k and $75k. From this plot alone, it is difficult to say whether our model is generalizable.

4.3 Scatterplot of Predicted and Observed Sale Prices

Our final model was much more accurate at predicting home prices at lower values. Removing outliers above $1M improves our overall mean errors, but limits our predictive power at higher home values.

The following two charts show our predicted home prices based on our final model in comparison to all the actual home prices used in the training dataset. Our model struggles to capture the higher-priced homes in dark purple.

4.4 Spatial Errors

Home prices are spatially autocorrelated with neighbors according to our model. In addition, the bottom chart shows that errors are also spacially correlated with neighbors.

Our observed Moran’s I, denoted by the vertical orange line, is statistically higher than the randomly permuted I, thus there exists spatial autocorrelation in our final model.

4.5 Combined Total Sale Prices (Training and Test)

The chart below looks at all 3503 observations from the dataset, including those where no sale price is known. It applies our final model predictions.

4.6 Errors by Neighborhood

This map look mean test set errors by neighborhood.

This scatterplot looks at our errors in relation to price by neighborhood. Our errors grow larger for more expensive homes.

4.7 Generalizability by Income

These results show that our model does differ drastically across income, therefore struggling with generalizability.

intercept	RMSE	Rsquared	MAE	RMSESD	RsquaredSD	MAESD
TRUE	94712.13	0.7717844	73717.92	101456	0.259679	66067.11

intercept	RMSE	Rsquared	MAE	RMSESD	RsquaredSD	MAESD
TRUE	100906.8	0.7273539	83003.1	51307.63	0.2727718	40206.13

MUSA508 Midterm: Miami Home Price Prediction Markdown

Julian Hartwell & Juliana Zhou

October 20, 2020