For this study, I am trying to determine which factors influence the median home value of a tract of land, and whether median value can be predicted. This could help prospective home buyers narrow their search by finding undervalued areas that fulfill their buying requirements. This data has great potential from a business standpoint. It could be used by a real estate developer looking to buy into undervalued areas, or determine which of their assets are overvalued. We will be evaluating the viability of such a model.
The data was obtained from http://biostat.mc.vanderbilt.edu/DataSets
The data was originally used for the paper “Hedonic Housing Prices and the Demand for Clean Air”, which is over 40 years old. It is somewhat unclear how the data was collected, but the authors cite several people who contributed to the data collection. It originated from many sources, such as city records and environmental research. Each observation indicates a tract of land in the Boston area and there are 506 total observations. Every variable in the dataset will be considered, but several are of particular interest.
Variables of interest:
Variable descriptions from the dataset provider:
CRIME: per capita crime rate by town
RIVER: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX: nitric oxides concentration (parts per 10 million)
RM: average number of rooms per dwelling
AGE: proportion of owner-occupied units built prior to 1940
DIS: weighted distances to five Boston employment centres
RAD: index of accessibility to radial highways
TAX: full-value property-tax rate per $10,000
PTRATIO: pupil-teacher ratio by town
LSTAT: % lower status of the population
MEDV: Median value of owner-occupied homes in $1000’s
Though the data is quite old, results from this analysis will still be valuable. Many of the features haven’t changed much or at all, such as the size of houses, and the relative location of cities. It still wouldn’t be prudent to use the model to predict housing prices today, but it can inform us of the important variables, and whether a model using current data would be viable. This was an observational study because there was no randomization, and no splitting of the observations into treatment/control groups. The data is not independent because every observation shares something in common, metropolitan area. Every observation is within commuting distance of Boston and has a high population density. As such, it cannot be generalized to other parts of the state, and especially not other states.
Even though some of these variables might be good predictors of value, a causal link cannot be established. To establish causation, there must be treatment/control groups. There is an intuitive reason for this. Predictors like rooms and crime rate are indirectly measuring neighborhood affluence. A neighborhood full of large, but run down homes near a polluted factory would not have higher home values than a neighborhood of medium sized, but new and well-maintained homes near a top school system. Even though the neighborhood with more rooms, will generally have the higher value, that is not the case here. Affluence is the driver and the other variables are just correlated with it.
This dataset has a fair amount of variables, so we will start by getting an overall sense of it, allowing us to narrow the focus.
library(Hmisc)
library(corrplot)
library(nlme)
library(tidyverse)
library(car)
library(pander)
getHdata(boston)
boston_subset <- boston %>%
select(-(1:2), -cmedv, -black)
pander(summary(boston_subset))
| longitude | latitude | value | crime |
|---|---|---|---|
| Min. :-71.29 | Min. :42.03 | Min. : 5.00 | Min. : 0.00632 |
| 1st Qu.:-71.09 | 1st Qu.:42.18 | 1st Qu.:17.02 | 1st Qu.: 0.08205 |
| Median :-71.05 | Median :42.22 | Median :21.20 | Median : 0.25651 |
| Mean :-71.06 | Mean :42.22 | Mean :22.53 | Mean : 3.61352 |
| 3rd Qu.:-71.02 | 3rd Qu.:42.25 | 3rd Qu.:25.00 | 3rd Qu.: 3.67708 |
| Max. :-70.81 | Max. :42.38 | Max. :50.00 | Max. :88.97620 |
| residential | industrial | river | nox | rooms |
|---|---|---|---|---|
| Min. : 0.00 | Min. : 0.46 | no :471 | Min. :0.3850 | Min. :3.561 |
| 1st Qu.: 0.00 | 1st Qu.: 5.19 | yes: 35 | 1st Qu.:0.4490 | 1st Qu.:5.886 |
| Median : 0.00 | Median : 9.69 | NA | Median :0.5380 | Median :6.208 |
| Mean : 11.36 | Mean :11.14 | NA | Mean :0.5547 | Mean :6.285 |
| 3rd Qu.: 12.50 | 3rd Qu.:18.10 | NA | 3rd Qu.:0.6240 | 3rd Qu.:6.623 |
| Max. :100.00 | Max. :27.74 | NA | Max. :0.8710 | Max. :8.780 |
| older | distance | highway | tax |
|---|---|---|---|
| Min. : 2.90 | Min. : 1.130 | Min. : 1.000 | Min. :187.0 |
| 1st Qu.: 45.02 | 1st Qu.: 2.100 | 1st Qu.: 4.000 | 1st Qu.:279.0 |
| Median : 77.50 | Median : 3.207 | Median : 5.000 | Median :330.0 |
| Mean : 68.57 | Mean : 3.795 | Mean : 9.549 | Mean :408.2 |
| 3rd Qu.: 94.08 | 3rd Qu.: 5.188 | 3rd Qu.:24.000 | 3rd Qu.:666.0 |
| Max. :100.00 | Max. :12.127 | Max. :24.000 | Max. :711.0 |
| ptratio | lstat |
|---|---|
| Min. :12.60 | Min. : 1.73 |
| 1st Qu.:17.40 | 1st Qu.: 6.95 |
| Median :19.05 | Median :11.36 |
| Mean :18.46 | Mean :12.65 |
| 3rd Qu.:20.20 | 3rd Qu.:16.95 |
| Max. :22.00 | Max. :37.97 |
Home values are capped at $50,000. Crime jumps out as not having a symmetric distribution, with a mean around 3 and a max of over 80.
mat <- cor(select(boston_subset, -river))
Correlation Plot (Darker Colors Indicate Correlation)
corrplot(mat, type = "upper")
Rooms and lstat have the strongest correlation with value. Crime has a somewhat strong correlation, but distance’s may not be a strong predictor variable. This is possibly because of affluent suburbs somewhat far from Boston. Many of the possible predictor variables appear to be related, which concerns us for a couple reasons. First, one of the measures that informs us whether the variable offers utility, the p value, could be compromised. Second, if we add additional data or remove variables, it could significantly change our model. We will attempt to remedy this by including as few variables as possible in the final model.
describe(boston$value)
## boston$value : Median value of owner-occupied homes / $1000
## n missing distinct Info Mean Gmd .05 .10
## 506 0 229 1 22.53 9.778 10.20 12.75
## .25 .50 .75 .90 .95
## 17.03 21.20 25.00 34.80 43.40
##
## lowest : 5.0 5.6 6.3 7.0 7.2, highest: 46.7 48.3 48.5 48.8 50.0
boston_subset %>% ggplot() + geom_histogram(aes(x = value), color = "black", fill = "darkblue", binwidth = 1)
Home value is somewhat right skewed, with a fairly long tail, which means that predictions at the extremes could suffer.
describe(boston_subset$rooms)
## boston_subset$rooms : Average no. rooms per dwelling
## n missing distinct Info Mean Gmd .05 .10
## 506 0 446 1 6.285 0.7515 5.314 5.593
## .25 .50 .75 .90 .95
## 5.886 6.209 6.623 7.151 7.588
##
## lowest : 3.561 3.863 4.138 4.368 4.519, highest: 8.375 8.398 8.704 8.725 8.780
boston_subset %>% ggplot() + geom_histogram(aes(x = rooms), color = "black", fill = "darkgreen") + labs(title = "Distribution of Rooms")
boston_subset %>% ggplot() + geom_point(aes(x = rooms, y = value)) + labs(title = "Rooms vs Value")
This will likely be the best predictor variable. It has a good linear relationship with home value, though that relationship deteriorates somewhat at the extremes.
describe(boston_subset$distance)
## boston_subset$distance : Weighted distances to five Boston employment centers
## n missing distinct Info Mean Gmd .05 .10
## 506 0 412 1 3.795 2.298 1.462 1.628
## .25 .50 .75 .90 .95
## 2.100 3.207 5.188 6.817 7.828
##
## lowest : 1.1296 1.1370 1.1691 1.1742 1.1781
## highest: 9.2203 9.2229 10.5857 10.7103 12.1265
boston_subset %>% ggplot() + geom_histogram(aes(x = distance), color = "black", fill = "purple") + labs(title = "Distribution of Distance")
boston_subset$distance_ln <- log(boston$distance)
boston_subset %>% ggplot() + geom_histogram(aes(x = distance_ln), color = "black", fill = "purple") + labs(title = "Distribution of Log Distance")
boston_subset %>% ggplot() + geom_point(aes(x = distance_ln, y = value)) + labs(title = "Distance vs Value")
Performing the “log transform” by taking the natural logarithm of distance has made the distribution more symmetric. This is very desirable There is a noticeable relationship, especially after the transform.
describe(boston_subset$lstat)
## boston_subset$lstat : % Population in lower socio-economic status
## n missing distinct Info Mean Gmd .05 .10
## 506 0 455 1 12.65 7.881 3.708 4.680
## .25 .50 .75 .90 .95
## 6.950 11.360 16.955 23.035 26.807
##
## lowest : 1.73 1.92 1.98 2.47 2.87, highest: 34.37 34.41 34.77 36.98 37.97
boston_subset %>% ggplot() + geom_histogram(aes(x = lstat), color = "black", fill = "yellow") + labs(title = "Distribution of Lstat")
boston_subset$lstat_ln <- log(boston$lstat)
boston_subset %>% ggplot() + geom_histogram(aes(x = lstat_ln), color = "black", fill = "yellow") + labs(title = "Distribution of Log Lstat")
boston_subset %>% ggplot() + geom_point(aes(x = lstat_ln, y = value)) + labs(title = "Log Lstat vs Value")
Again, we have a skewed variable where the transformation helps. This relationship is almost as strong as that of room count. There is additional indication here that our predictions at higher value levels will suffer.
describe(boston_subset$tax)
## boston_subset$tax : Full-value property tax rate per $10,000
## n missing distinct Info Mean Gmd .05 .10
## 506 0 66 0.981 408.2 181.7 222 233
## .25 .50 .75 .90 .95
## 279 330 666 666 666
##
## lowest : 187 188 193 198 216, highest: 432 437 469 666 711
boston_subset %>% ggplot() + geom_histogram(aes(x = tax), color = "black", fill = "brown") + labs(title = "Distribution of Tax")
boston_subset %>% ggplot() + geom_point(aes(x = tax, y = value)) + labs(title = "Tax vs Value")
Tax does not have many distinct values. There is a point in the top right corner that will could influence the model’s predictions, so this should be remembered.
describe(boston_subset$crime)
## boston_subset$crime : Per capita crime rate
## n missing distinct Info Mean Gmd .05 .10
## 506 0 504 1 3.614 5.794 0.02791 0.03820
## .25 .50 .75 .90 .95
## 0.08205 0.25651 3.67708 10.75300 15.78915
##
## lowest : 0.00632 0.00906 0.01096 0.01301 0.01311
## highest: 45.74610 51.13580 67.92080 73.53410 88.97620
boston_subset %>% ggplot() + geom_histogram(aes(x = crime), color = "black", fill = "orange") + labs(title = "Distribution of Crime")
boston_subset %>% ggplot() + geom_point(aes(x = crime, y = value)) + labs(title = "Crime vs Value")
Crime is extremely skewed. It will have to be transformed, but I don’t believe the log transform we previously used to be the most appropriate. Instead, we will take the cube root of crime.
boston_subset$crime_cbrt <- boston$crime^(1/3)
boston_subset %>% ggplot() + geom_histogram(aes(x = crime_cbrt), color = "black", fill = "orange") + labs(title = "Distribution of Crime Transformed")
boston_subset %>% ggplot() + geom_point(aes(x = crime_cbrt, y = value)) + labs(title = "Crime Tranformed vs Value")
boston_transform <- select(boston_subset, -crime, -distance, -lstat)
Taking the cube root helped with the skew and gives us a more linear relationship, one of the properties the model must hold.
We will start with every variable and remove them based on the p value. As mentioned earlier, because many of the variables are related, it is desirable to cut as many variables as possible, so we won’t bother with recalculating after each variable is removed.
full_model <- lm(value ~ ., data = boston_transform)
pander(summary(full_model))
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -501.8 | 264.7 | -1.896 | 0.05861 |
| longitude | -5.117 | 2.961 | -1.728 | 0.08464 |
| latitude | 4.785 | 3.248 | 1.473 | 0.1414 |
| residential | 0.009398 | 0.01146 | 0.8204 | 0.4124 |
| industrial | -0.02392 | 0.0555 | -0.431 | 0.6666 |
| riveryes | 2.154 | 0.7744 | 2.781 | 0.005628 |
| nox | -16.41 | 3.682 | -4.457 | 1.029e-05 |
| rooms | 2.386 | 0.3884 | 6.143 | 1.671e-09 |
| older | 0.02149 | 0.01242 | 1.731 | 0.08404 |
| highway | 0.3977 | 0.06939 | 5.731 | 1.742e-08 |
| tax | -0.01313 | 0.003329 | -3.944 | 9.192e-05 |
| ptratio | -0.7966 | 0.1225 | -6.503 | 1.939e-10 |
| distance_ln | -6.306 | 0.81 | -7.785 | 4.169e-14 |
| lstat_ln | -9.237 | 0.5653 | -16.34 | 3.046e-48 |
| crime_cbrt | -2.895 | 0.6012 | -4.815 | 1.964e-06 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 506 | 4.2 | 0.7973 | 0.7915 |
It looks like industrial, residential, longitude, latitude, and older can be removed. These aren’t necessarily bad predictors, but because they are correlated with other predictors, they don’t add much to the model. For instance, take the one predictor model with older:
older <- lm(value ~ older, boston)
pander(summary(older))
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 30.98 | 0.9991 | 31.01 | 6.814e-119 |
| older | -0.1232 | 0.01348 | -9.137 | 1.57e-18 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 506 | 8.527 | 0.1421 | 0.1404 |
This variable is easily a significant predictor when taken alone.
m2 <- lm(value ~ ., data = select(boston_transform, -older, - industrial, - latitude, - longitude, -residential))
pander(summary(m2))
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 65.96 | 4.687 | 14.07 | 4.326e-38 |
| riveryes | 2.376 | 0.7592 | 3.129 | 0.001856 |
| nox | -18.34 | 3.337 | -5.495 | 6.242e-08 |
| rooms | 2.632 | 0.3702 | 7.11 | 4.08e-12 |
| highway | 0.3918 | 0.06633 | 5.906 | 6.504e-09 |
| tax | -0.01293 | 0.00295 | -4.384 | 1.421e-05 |
| ptratio | -0.8774 | 0.1085 | -8.09 | 4.637e-15 |
| distance_ln | -6.977 | 0.6637 | -10.51 | 1.811e-23 |
| lstat_ln | -8.868 | 0.518 | -17.12 | 5.656e-52 |
| crime_cbrt | -2.886 | 0.6004 | -4.807 | 2.034e-06 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 506 | 4.21 | 0.7942 | 0.7904 |
boston_final <- boston_transform %>%
select(-older, - industrial, - latitude, - longitude, -residential)
mat2 <- cor(select(boston_final, -river))
The second model looks quite good. We have eliminated a few variables, without affecting the measure of the quality of prediction, R squared. The “t value” can give us a rough rank of the best predictors. The further from 0, the better, so the top predictors are transformed lstat, transformed distance, ptratio, and rooms.
Correlation Plot (Darker Colors Indicate Correlation)
corrplot(mat2, type = "upper")
Based on the correlation plot, relation among predictors isn’t as much of a problem as before.
We will next determine whether the model meets the assumptions the assumptions required to trust the results.
diagnostics <- data.frame(m2$model, residuals = m2$residuals, fitted_values = m2$fitted.values, residuals_abs = abs(m2$residuals))
ggplot(diagnostics) + geom_point(aes(x = fitted_values, y = residuals)) + geom_hline(yintercept = 0)
qqnorm(diagnostics$residuals)
qqline(diagnostics$residuals)
ggplot(diagnostics) + geom_histogram(aes(x = residuals), color = "black", fill = "yellow")
ggplot(diagnostics) + geom_point(aes(x = fitted_values, y = residuals_abs))
ggplot(diagnostics) + geom_boxplot(aes(y = residuals, x = river))
resid_df <- data.frame(residuals = m2$residuals,town = boston$town)
test_model <- lm(residuals ~ town, resid_df)
summary(test_model)$fstatistic
## value numdf dendf
## 4.437858 91.000000 414.000000
We have already covered the first assumption, a linear relationship between the predictors and value with the exploratory analysis.
The plots of the residuals, the difference between the predicted and actual value, should be follow a symmetric distribution. This isn’t a major problem because the overall model and variable estimates are easily significant.
The residual plot should also exhibit constant spread, which it does not. At predicted values above 30, there aren’t as many points, but the difference between the highest and lowest points increases. This means, as we suspected earlier, that the model will become more inefficient at predicting higher home values.
Lastly, the residuals shouldn’t be related. The most likely relation would be residuals from the same town, so it was tested whether town was a good predictor for residual size. Town did turn out to be a significant predictor of residual, so this assumption was violated. However, this assumption is more important for data with a time related element, e.g. stock prices. In our case, we simply missed out on a possible predictor variable. We will still not include town in the model because it would add too many variables, with each town becoming a variable
There are some minor shortcomings, but nothing invalidates our results. The assumption we care most about, in this case, is the second, so we will attempt a couple remedial measures.
gamma <- glm(value ~ ., family = "Gamma",data = select(boston_transform, -older, - industrial, - latitude, - longitude, -residential))
diagnostics_w <-data.frame(residuals = gamma$residuals, fitted_values = gamma$fitted)
ggplot(diagnostics_w) + geom_point(aes(x = fitted_values, y = residuals)) + geom_hline(yintercept = 0)
boston_log <- boston_transform
boston_log$log_value <- log(boston_subset$value)
model_log <- lm(log_value ~ ., data = select(boston_log, -value, -older, - industrial, - latitude, - longitude, -residential))
diagnostics_log <- data.frame(model_log$model, residuals = model_log$residuals, fitted_values = model_log$fitted.values, residuals_abs = abs(model_log$residuals))
ggplot(diagnostics_log) + geom_point(aes(x = fitted_values, y = residuals)) + geom_hline(yintercept = 0)
qqnorm(diagnostics_log$residuals)
qqline(diagnostics_log$residuals)
ggplot(diagnostics_log) + geom_histogram(aes(x = residuals), color = "black", fill = "lightblue")
The first new model does appear to decrease the spread of residuals. With this model, the spread doesn’t need to be constant, so the fact that they “fan” to some degree isn’t a problem. However, this model is more difficult to interpret, so it will be discarded. The second model doesn’t look any better than the original, as the spread, instead, increases at lower values.
For completeness’ sake, the final model was:
\(Value=65.96+2.38*River-18.34*Nox+2.63*Rooms+.39*Highway-.0129*Tax-.8774*Ptratio-6.98*\ln { Distance } -8.87*\ln { Lstat } -2.89*\sqrt [ 3 ]{ Crime }\)
However, the purpose of this analysis was to determine the quality and feasibility of a housing price model. We will, therefore, turn our attention to evaluation and possible improvements.
Within this dataset, the predictor variables did an overall excellent job of predicting the response variable, value. They explained four fifths of the variance in value. The model only faltered somewhat at the extremes. The best predictors can be classified in two groups, measures of neighborhood affluence, e.g. crime rate, and measures of the neighborhood’s closeness to Boston, e.g. distance.
The fact that residuals were not independent of town indicates that there is possibly something not easily measurable that contributes to property values. Perhaps it is the resident’s wealth, like was just mentioned, or perhaps it is something different: a characteristic that makes certain towns unique and enjoyable places to live.
If we wanted to create a current and more generalizable dataset, we would start by sampling from various areas around the country. Population density would be a good variable to add, as would an indication of whether the town is rural, suburban, or urban.