Housing Price Model

Introduction

For this study, I am trying to determine which factors influence the median home value of a tract of land, and whether median value can be predicted. This could help prospective home buyers narrow their search by finding undervalued areas that fulfill their buying requirements. This data has great potential from a business standpoint. It could be used by a real estate developer looking to buy into undervalued areas, or determine which of their assets are overvalued. We will be evaluating the viability of such a model.

About the Data

The data was obtained from http://biostat.mc.vanderbilt.edu/DataSets

The data was originally used for the paper “Hedonic Housing Prices and the Demand for Clean Air”, which is over 40 years old. It is somewhat unclear how the data was collected, but the authors cite several people who contributed to the data collection. It originated from many sources, such as city records and environmental research. Each observation indicates a tract of land in the Boston area and there are 506 total observations. Every variable in the dataset will be considered, but several are of particular interest.

Variables of interest:

Numer of Rooms: Numerical
Distance: Numerical
River: Categorical
Crime: Numerical

Variable descriptions from the dataset provider:

CRIME: per capita crime rate by town
RIVER: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX: nitric oxides concentration (parts per 10 million)
RM: average number of rooms per dwelling
AGE: proportion of owner-occupied units built prior to 1940
DIS: weighted distances to five Boston employment centres
RAD: index of accessibility to radial highways
TAX: full-value property-tax rate per $10,000
PTRATIO: pupil-teacher ratio by town
LSTAT: % lower status of the population
MEDV: Median value of owner-occupied homes in $1000’s

Though the data is quite old, results from this analysis will still be valuable. Many of the features haven’t changed much or at all, such as the size of houses, and the relative location of cities. It still wouldn’t be prudent to use the model to predict housing prices today, but it can inform us of the important variables, and whether a model using current data would be viable. This was an observational study because there was no randomization, and no splitting of the observations into treatment/control groups. The data is not independent because every observation shares something in common, metropolitan area. Every observation is within commuting distance of Boston and has a high population density. As such, it cannot be generalized to other parts of the state, and especially not other states.

Even though some of these variables might be good predictors of value, a causal link cannot be established. To establish causation, there must be treatment/control groups. There is an intuitive reason for this. Predictors like rooms and crime rate are indirectly measuring neighborhood affluence. A neighborhood full of large, but run down homes near a polluted factory would not have higher home values than a neighborhood of medium sized, but new and well-maintained homes near a top school system. Even though the neighborhood with more rooms, will generally have the higher value, that is not the case here. Affluence is the driver and the other variables are just correlated with it.

Exploring the Data

This dataset has a fair amount of variables, so we will start by getting an overall sense of it, allowing us to narrow the focus.

library(Hmisc)
library(corrplot)
library(nlme)
library(tidyverse)
library(car)
library(pander)
getHdata(boston)
boston_subset <- boston %>%
  select(-(1:2), -cmedv, -black)
pander(summary(boston_subset))

Table continues below
longitude	latitude	value	crime
Min. :-71.29	Min. :42.03	Min. : 5.00	Min. : 0.00632
1st Qu.:-71.09	1st Qu.:42.18	1st Qu.:17.02	1st Qu.: 0.08205
Median :-71.05	Median :42.22	Median :21.20	Median : 0.25651
Mean :-71.06	Mean :42.22	Mean :22.53	Mean : 3.61352
3rd Qu.:-71.02	3rd Qu.:42.25	3rd Qu.:25.00	3rd Qu.: 3.67708
Max. :-70.81	Max. :42.38	Max. :50.00	Max. :88.97620

Table continues below
residential	industrial	river	nox	rooms
Min. : 0.00	Min. : 0.46	no :471	Min. :0.3850	Min. :3.561
1st Qu.: 0.00	1st Qu.: 5.19	yes: 35	1st Qu.:0.4490	1st Qu.:5.886
Median : 0.00	Median : 9.69	NA	Median :0.5380	Median :6.208
Mean : 11.36	Mean :11.14	NA	Mean :0.5547	Mean :6.285
3rd Qu.: 12.50	3rd Qu.:18.10	NA	3rd Qu.:0.6240	3rd Qu.:6.623
Max. :100.00	Max. :27.74	NA	Max. :0.8710	Max. :8.780

Table continues below
older	distance	highway	tax
Min. : 2.90	Min. : 1.130	Min. : 1.000	Min. :187.0
1st Qu.: 45.02	1st Qu.: 2.100	1st Qu.: 4.000	1st Qu.:279.0
Median : 77.50	Median : 3.207	Median : 5.000	Median :330.0
Mean : 68.57	Mean : 3.795	Mean : 9.549	Mean :408.2
3rd Qu.: 94.08	3rd Qu.: 5.188	3rd Qu.:24.000	3rd Qu.:666.0
Max. :100.00	Max. :12.127	Max. :24.000	Max. :711.0

ptratio	lstat
Min. :12.60	Min. : 1.73
1st Qu.:17.40	1st Qu.: 6.95
Median :19.05	Median :11.36
Mean :18.46	Mean :12.65
3rd Qu.:20.20	3rd Qu.:16.95
Max. :22.00	Max. :37.97

Home values are capped at $50,000. Crime jumps out as not having a symmetric distribution, with a mean around 3 and a max of over 80.

mat <- cor(select(boston_subset, -river))

Correlation Plot (Darker Colors Indicate Correlation)

corrplot(mat, type = "upper")

Rooms and lstat have the strongest correlation with value. Crime has a somewhat strong correlation, but distance’s may not be a strong predictor variable. This is possibly because of affluent suburbs somewhat far from Boston. Many of the possible predictor variables appear to be related, which concerns us for a couple reasons. First, one of the measures that informs us whether the variable offers utility, the p value, could be compromised. Second, if we add additional data or remove variables, it could significantly change our model. We will attempt to remedy this by including as few variables as possible in the final model.

Value

describe(boston$value)

## boston$value : Median value of owner-occupied homes / $1000 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      506        0      229        1    22.53    9.778    10.20    12.75 
##      .25      .50      .75      .90      .95 
##    17.03    21.20    25.00    34.80    43.40 
## 
## lowest :  5.0  5.6  6.3  7.0  7.2, highest: 46.7 48.3 48.5 48.8 50.0

boston_subset %>% ggplot() + geom_histogram(aes(x = value), color = "black", fill = "darkblue", binwidth = 1)

Home value is somewhat right skewed, with a fairly long tail, which means that predictions at the extremes could suffer.

Rooms

describe(boston_subset$rooms)

## boston_subset$rooms : Average no. rooms per dwelling 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      506        0      446        1    6.285   0.7515    5.314    5.593 
##      .25      .50      .75      .90      .95 
##    5.886    6.209    6.623    7.151    7.588 
## 
## lowest : 3.561 3.863 4.138 4.368 4.519, highest: 8.375 8.398 8.704 8.725 8.780

boston_subset %>% ggplot() + geom_histogram(aes(x = rooms), color = "black", fill = "darkgreen") + labs(title = "Distribution of Rooms")

boston_subset %>% ggplot() + geom_point(aes(x = rooms, y = value)) + labs(title = "Rooms vs Value")

This will likely be the best predictor variable. It has a good linear relationship with home value, though that relationship deteriorates somewhat at the extremes.

Distance

describe(boston_subset$distance)

## boston_subset$distance : Weighted distances to five Boston employment centers 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      506        0      412        1    3.795    2.298    1.462    1.628 
##      .25      .50      .75      .90      .95 
##    2.100    3.207    5.188    6.817    7.828 
## 
## lowest :  1.1296  1.1370  1.1691  1.1742  1.1781
## highest:  9.2203  9.2229 10.5857 10.7103 12.1265

boston_subset %>% ggplot() + geom_histogram(aes(x = distance), color = "black", fill = "purple") + labs(title = "Distribution of Distance")

boston_subset$distance_ln <- log(boston$distance)
boston_subset %>% ggplot() + geom_histogram(aes(x = distance_ln), color = "black", fill = "purple") + labs(title = "Distribution of Log Distance")

boston_subset %>% ggplot() + geom_point(aes(x = distance_ln, y = value)) + labs(title = "Distance vs Value")

Performing the “log transform” by taking the natural logarithm of distance has made the distribution more symmetric. This is very desirable There is a noticeable relationship, especially after the transform.

Lstat

describe(boston_subset$lstat)

## boston_subset$lstat : % Population in lower socio-economic status 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      506        0      455        1    12.65    7.881    3.708    4.680 
##      .25      .50      .75      .90      .95 
##    6.950   11.360   16.955   23.035   26.807 
## 
## lowest :  1.73  1.92  1.98  2.47  2.87, highest: 34.37 34.41 34.77 36.98 37.97

boston_subset %>% ggplot() + geom_histogram(aes(x = lstat), color = "black", fill = "yellow") + labs(title = "Distribution of Lstat")

boston_subset$lstat_ln <- log(boston$lstat)
boston_subset %>% ggplot() + geom_histogram(aes(x = lstat_ln), color = "black", fill = "yellow") + labs(title = "Distribution of Log Lstat")

boston_subset %>% ggplot() + geom_point(aes(x = lstat_ln, y = value)) + labs(title = "Log Lstat vs Value")

Again, we have a skewed variable where the transformation helps. This relationship is almost as strong as that of room count. There is additional indication here that our predictions at higher value levels will suffer.

Tax

describe(boston_subset$tax)

## boston_subset$tax : Full-value property tax rate per $10,000 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      506        0       66    0.981    408.2    181.7      222      233 
##      .25      .50      .75      .90      .95 
##      279      330      666      666      666 
## 
## lowest : 187 188 193 198 216, highest: 432 437 469 666 711

boston_subset %>% ggplot() + geom_histogram(aes(x = tax), color = "black", fill = "brown") + labs(title = "Distribution of Tax")

boston_subset %>% ggplot() + geom_point(aes(x = tax, y = value)) + labs(title = "Tax vs Value")

Tax does not have many distinct values. There is a point in the top right corner that will could influence the model’s predictions, so this should be remembered.

Crime

describe(boston_subset$crime)

## boston_subset$crime : Per capita crime rate 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      506        0      504        1    3.614    5.794  0.02791  0.03820 
##      .25      .50      .75      .90      .95 
##  0.08205  0.25651  3.67708 10.75300 15.78915 
## 
## lowest :  0.00632  0.00906  0.01096  0.01301  0.01311
## highest: 45.74610 51.13580 67.92080 73.53410 88.97620

boston_subset %>% ggplot() + geom_histogram(aes(x = crime), color = "black", fill = "orange") + labs(title = "Distribution of Crime")

boston_subset %>% ggplot() + geom_point(aes(x = crime, y = value)) + labs(title = "Crime vs Value")

Crime is extremely skewed. It will have to be transformed, but I don’t believe the log transform we previously used to be the most appropriate. Instead, we will take the cube root of crime.

boston_subset$crime_cbrt <- boston$crime^(1/3)

boston_subset %>%  ggplot() + geom_histogram(aes(x = crime_cbrt), color = "black", fill = "orange") + labs(title = "Distribution of Crime Transformed")

boston_subset %>% ggplot() + geom_point(aes(x = crime_cbrt, y = value)) + labs(title = "Crime Tranformed vs Value")

boston_transform <- select(boston_subset, -crime, -distance, -lstat)

Taking the cube root helped with the skew and gives us a more linear relationship, one of the properties the model must hold.

Home Value Model

We will start with every variable and remove them based on the p value. As mentioned earlier, because many of the variables are related, it is desirable to cut as many variables as possible, so we won’t bother with recalculating after each variable is removed.

full_model <- lm(value ~ ., data = boston_transform)
pander(summary(full_model))

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-501.8	264.7	-1.896	0.05861
longitude	-5.117	2.961	-1.728	0.08464
latitude	4.785	3.248	1.473	0.1414
residential	0.009398	0.01146	0.8204	0.4124
industrial	-0.02392	0.0555	-0.431	0.6666
riveryes	2.154	0.7744	2.781	0.005628
nox	-16.41	3.682	-4.457	1.029e-05
rooms	2.386	0.3884	6.143	1.671e-09
older	0.02149	0.01242	1.731	0.08404
highway	0.3977	0.06939	5.731	1.742e-08
tax	-0.01313	0.003329	-3.944	9.192e-05
ptratio	-0.7966	0.1225	-6.503	1.939e-10
distance_ln	-6.306	0.81	-7.785	4.169e-14
lstat_ln	-9.237	0.5653	-16.34	3.046e-48
crime_cbrt	-2.895	0.6012	-4.815	1.964e-06

Fitting linear model: value ~ .
Observations	Residual Std. Error	$R^2$	Adjusted $R^2$
506	4.2	0.7973	0.7915

It looks like industrial, residential, longitude, latitude, and older can be removed. These aren’t necessarily bad predictors, but because they are correlated with other predictors, they don’t add much to the model. For instance, take the one predictor model with older:

Using just “older”

older <- lm(value ~ older, boston)
pander(summary(older))

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	30.98	0.9991	31.01	6.814e-119
older	-0.1232	0.01348	-9.137	1.57e-18

Fitting linear model: value ~ older
Observations	Residual Std. Error	$R^2$	Adjusted $R^2$
506	8.527	0.1421	0.1404

This variable is easily a significant predictor when taken alone.

Second Model

m2 <- lm(value ~ ., data = select(boston_transform, -older, - industrial, - latitude, - longitude, -residential))
pander(summary(m2))

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	65.96	4.687	14.07	4.326e-38
riveryes	2.376	0.7592	3.129	0.001856
nox	-18.34	3.337	-5.495	6.242e-08
rooms	2.632	0.3702	7.11	4.08e-12
highway	0.3918	0.06633	5.906	6.504e-09
tax	-0.01293	0.00295	-4.384	1.421e-05
ptratio	-0.8774	0.1085	-8.09	4.637e-15
distance_ln	-6.977	0.6637	-10.51	1.811e-23
lstat_ln	-8.868	0.518	-17.12	5.656e-52
crime_cbrt	-2.886	0.6004	-4.807	2.034e-06

Fitting linear model: value ~ .
Observations	Residual Std. Error	$R^2$	Adjusted $R^2$
506	4.21	0.7942	0.7904

boston_final <- boston_transform %>%
  select(-older, - industrial, - latitude, - longitude, -residential)
mat2 <- cor(select(boston_final, -river))

The second model looks quite good. We have eliminated a few variables, without affecting the measure of the quality of prediction, R squared. The “t value” can give us a rough rank of the best predictors. The further from 0, the better, so the top predictors are transformed lstat, transformed distance, ptratio, and rooms.

Correlation Plot (Darker Colors Indicate Correlation)

corrplot(mat2, type = "upper")

Based on the correlation plot, relation among predictors isn’t as much of a problem as before.

Checking model assumptions

We will next determine whether the model meets the assumptions the assumptions required to trust the results.

diagnostics <- data.frame(m2$model, residuals = m2$residuals, fitted_values = m2$fitted.values, residuals_abs = abs(m2$residuals))
ggplot(diagnostics) + geom_point(aes(x = fitted_values, y = residuals)) + geom_hline(yintercept = 0)

qqnorm(diagnostics$residuals)
qqline(diagnostics$residuals)

ggplot(diagnostics) + geom_histogram(aes(x = residuals), color = "black", fill = "yellow")

ggplot(diagnostics) + geom_point(aes(x = fitted_values, y = residuals_abs))

ggplot(diagnostics) + geom_boxplot(aes(y = residuals, x = river))

resid_df <- data.frame(residuals = m2$residuals,town = boston$town)
test_model <- lm(residuals ~ town, resid_df)
summary(test_model)$fstatistic

##      value      numdf      dendf 
##   4.437858  91.000000 414.000000

We have already covered the first assumption, a linear relationship between the predictors and value with the exploratory analysis.
The plots of the residuals, the difference between the predicted and actual value, should be follow a symmetric distribution. This isn’t a major problem because the overall model and variable estimates are easily significant.
The residual plot should also exhibit constant spread, which it does not. At predicted values above 30, there aren’t as many points, but the difference between the highest and lowest points increases. This means, as we suspected earlier, that the model will become more inefficient at predicting higher home values.
Lastly, the residuals shouldn’t be related. The most likely relation would be residuals from the same town, so it was tested whether town was a good predictor for residual size. Town did turn out to be a significant predictor of residual, so this assumption was violated. However, this assumption is more important for data with a time related element, e.g. stock prices. In our case, we simply missed out on a possible predictor variable. We will still not include town in the model because it would add too many variables, with each town becoming a variable

There are some minor shortcomings, but nothing invalidates our results. The assumption we care most about, in this case, is the second, so we will attempt a couple remedial measures.

gamma <- glm(value ~ ., family = "Gamma",data = select(boston_transform, -older, - industrial, - latitude, - longitude, -residential))

diagnostics_w <-data.frame(residuals = gamma$residuals, fitted_values = gamma$fitted)
ggplot(diagnostics_w) + geom_point(aes(x = fitted_values, y = residuals)) + geom_hline(yintercept = 0)

boston_log <- boston_transform
boston_log$log_value <- log(boston_subset$value)

model_log <- lm(log_value ~ ., data = select(boston_log, -value, -older, - industrial, - latitude, - longitude, -residential))
diagnostics_log <- data.frame(model_log$model, residuals = model_log$residuals, fitted_values = model_log$fitted.values, residuals_abs = abs(model_log$residuals))
ggplot(diagnostics_log) + geom_point(aes(x = fitted_values, y = residuals)) + geom_hline(yintercept = 0)

qqnorm(diagnostics_log$residuals)
qqline(diagnostics_log$residuals)

ggplot(diagnostics_log) + geom_histogram(aes(x = residuals), color = "black", fill = "lightblue")

The first new model does appear to decrease the spread of residuals. With this model, the spread doesn’t need to be constant, so the fact that they “fan” to some degree isn’t a problem. However, this model is more difficult to interpret, so it will be discarded. The second model doesn’t look any better than the original, as the spread, instead, increases at lower values.

Conclusion

For completeness’ sake, the final model was:
$Value=65.96+2.38*River-18.34*Nox+2.63*Rooms+.39*Highway-.0129*Tax-.8774*Ptratio-6.98*\ln { Distance } -8.87*\ln { Lstat } -2.89*\sqrt [ 3 ]{ Crime }$

However, the purpose of this analysis was to determine the quality and feasibility of a housing price model. We will, therefore, turn our attention to evaluation and possible improvements.

Within this dataset, the predictor variables did an overall excellent job of predicting the response variable, value. They explained four fifths of the variance in value. The model only faltered somewhat at the extremes. The best predictors can be classified in two groups, measures of neighborhood affluence, e.g. crime rate, and measures of the neighborhood’s closeness to Boston, e.g. distance.

The fact that residuals were not independent of town indicates that there is possibly something not easily measurable that contributes to property values. Perhaps it is the resident’s wealth, like was just mentioned, or perhaps it is something different: a characteristic that makes certain towns unique and enjoyable places to live.

If we wanted to create a current and more generalizable dataset, we would start by sampling from various areas around the country. Population density would be a good variable to add, as would an indication of whether the town is rural, suburban, or urban.