The data to be analyzed were collected by Harrison and Rubinfeld in 1978 for the purpose of discovering whether or not clean air influenced the value of houses in Boston. Their results are documented in a paper titled Hedonic prices and the demand for clean air, published in J. Environ. Economics and Management 5, 81-102.
This report seeks to examine the influence of several neighborhood attributes on the prices of housing, in an attempt to discover the most suitable explanatory variables. The specific neighborhood attributes to be considered are proximity to the Charles River, distance to the main employment centers, pupil-teacher ratio in schools, and levels of crime. Whereas the original study focused on air pollution using nitrogen oxide concentrations as an explanatory variable, this report examines whether or not there are other, better explanatory variables for the median value of houses in Boston. The R programming language will be used to conduct this analysis.
The data consist of 506 observations and 14 non constant independent variables. The variables are listed here along with their meaning:
Of these, medv is the response variable while the other 13 variables are possible predictors. The goal of this analysis is to fit a regression model that best explains the variation in medv. First, the individual variables will be analyzed to assess if the data is skewed in a way that would affect the accuracy of the model. Some variables may require transformations to better fit the model. Next, a full regression model will be fitted using the transformations as appropriate, with 13 individual predictor variables. Some of the predictor variables may do a good job of explaining this variation, while others may not. Furthermore, some predictor variables may be highly correlated, which causes the issue of multicollinearity in which we cannot be certain which of the correlated predictors is most responsible for variation in the response. To address this, variance inflation factors will be computed to quantify the effect of multicollinearity and eliminate predictors as necessary keeping in mind the ultimate goal of finding the best predictors.
After eliminating variables that are weak predictors, a residual analysis will be performed using quantile plots, scatterplots and histograms to check that the model satisfies the assumptions of linear regression, namely, that the errors are normally distributed and have constant variance. This will also serve the purpose of identifying potential outliers which are affecting the model’s accuracy. Of particular interest is identifying the outliers that have a large effect on the regression. These influential observations will be identified and removed using Cook’s Distance. From the model thus obtained, interesting inferences can be drawn about the data and conclusions can be made about the most important factors that explain the variation in housing prices.
First, the data is loaded and summarized to obtain a broad overview of the variables.
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
It is interesting to note the highest correlations between indus and nox, as well as those between tax and rad and tax and indus. It makes sense that nitrogen oxide levels as well as tax levels are highest near industrial areas. These are possible sources of multicollinearity, each explaining the same thing as far as how they affect variation in medv.
Related to medv itself, it is found that average number of rooms has the highest positive correlation, while pupil-teacher ratio and lstat have the highest negative correlations.
Now it is necessary to graphically examine the data to understand how it is distributed and what transformations can be applied to improve the model.
The right skewed distribution suggests that a log transformation would be appropriat. Similarly, the variables crim, dis, nox, zn are found to be right skewed, making log transformations appropriate. The left skewed distribution of ptratio suggests that squaring it could make for a better fit.
For the purposes of fitting a regression model for the first time, the log transformation will only be applied to medv to determine whether it does indeed provide a better fit.
##
## Call:
## lm(formula = log(medv) ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.73361 -0.09747 -0.01657 0.09629 0.86435
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.1020423 0.2042726 20.081 < 2e-16 ***
## crim -0.0102715 0.0013155 -7.808 3.52e-14 ***
## zn 0.0011725 0.0005495 2.134 0.033349 *
## indus 0.0024668 0.0024614 1.002 0.316755
## chas 0.1008876 0.0344859 2.925 0.003598 **
## nox -0.7783993 0.1528902 -5.091 5.07e-07 ***
## rm 0.0908331 0.0167280 5.430 8.87e-08 ***
## age 0.0002106 0.0005287 0.398 0.690567
## dis -0.0490873 0.0079834 -6.149 1.62e-09 ***
## rad 0.0142673 0.0026556 5.373 1.20e-07 ***
## tax -0.0006258 0.0001505 -4.157 3.80e-05 ***
## ptratio -0.0382715 0.0052365 -7.309 1.10e-12 ***
## black 0.0004136 0.0001075 3.847 0.000135 ***
## lstat -0.0290355 0.0020299 -14.304 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1899 on 492 degrees of freedom
## Multiple R-squared: 0.7896, Adjusted R-squared: 0.7841
## F-statistic: 142.1 on 13 and 492 DF, p-value: < 2.2e-16
Without the log transformation, we obtain an adjusted R-squared of 0.7338:
##
## Call:
## lm(formula = medv ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.595 -2.730 -0.518 1.777 26.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
## crim -1.080e-01 3.286e-02 -3.287 0.001087 **
## zn 4.642e-02 1.373e-02 3.382 0.000778 ***
## indus 2.056e-02 6.150e-02 0.334 0.738288
## chas 2.687e+00 8.616e-01 3.118 0.001925 **
## nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
## rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
## age 6.922e-04 1.321e-02 0.052 0.958229
## dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
## rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
## tax -1.233e-02 3.760e-03 -3.280 0.001112 **
## ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
## black 9.312e-03 2.686e-03 3.467 0.000573 ***
## lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
## F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
Thus, the log transformation of medv is indeed appropriate. Now, the predictors that are not statistically significant can be removed from the model. These are indus, age, and zn. Moreover, for the purpose of this report, the variables tax and rad are not of interest, since they are highly correlated with proximity to industries which itself is not significant.
Adjusting the model appropriately:
##
## Call:
## lm(formula = log(medv) ~ crim + chas + nox + rm + dis + ptratio +
## black + lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.70579 -0.09972 -0.01652 0.09180 0.90262
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.7964008 0.1965807 19.312 < 2e-16 ***
## crim -0.0080244 0.0012107 -6.628 8.89e-11 ***
## chas 0.1186049 0.0350353 3.385 0.000767 ***
## nox -0.6880284 0.1304358 -5.275 1.99e-07 ***
## rm 0.1100190 0.0162770 6.759 3.90e-11 ***
## dis -0.0426965 0.0066075 -6.462 2.47e-10 ***
## ptratio -0.0356303 0.0045174 -7.887 1.98e-14 ***
## black 0.0003696 0.0001084 3.408 0.000708 ***
## lstat -0.0286950 0.0019509 -14.709 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.195 on 497 degrees of freedom
## Multiple R-squared: 0.776, Adjusted R-squared: 0.7724
## F-statistic: 215.2 on 8 and 497 DF, p-value: < 2.2e-16
Now it is necessary to compute the variation inflation factor (vif) to assess multicollinearity.
library(car)
vif(fit)
## crim chas nox rm dis ptratio black lstat
## 1.440228 1.051599 3.033782 1.736921 2.570795 1.270191 1.301781 2.577443
As all values are less than 5, there are no issues of multicollinearity.
Now, it would be helpful to check for outliers and influences.
## rstudent unadjusted p-value Bonferonni p
## 413 4.859758 1.5792e-06 0.00079908
## 372 4.454721 1.0390e-05 0.00525750
## 369 4.122466 4.3937e-05 0.02223200
## 373 4.099549 4.8363e-05 0.02447100
4 outliers are found using outlierTest. Using Cook’s Distance to find those outlier observations that have a large impact on regression:
Several observations are found. It would be helpful to eliminate these from the model:
##
## Call:
## lm(formula = log(medv) ~ crim + chas + nox + rm + dis + ptratio +
## black + lstat, data = Boston, subset = cook < max(cook))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.71852 -0.09763 -0.01779 0.08913 0.89921
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.7978969 0.1952989 19.447 < 2e-16 ***
## crim -0.0067158 0.0012937 -5.191 3.05e-07 ***
## chas 0.1190275 0.0348071 3.420 0.000679 ***
## nox -0.6943724 0.1296054 -5.358 1.29e-07 ***
## rm 0.1092422 0.0161733 6.754 4.03e-11 ***
## dis -0.0424294 0.0065651 -6.463 2.46e-10 ***
## ptratio -0.0361419 0.0044918 -8.046 6.35e-15 ***
## black 0.0004091 0.0001087 3.764 0.000187 ***
## lstat -0.0288853 0.0019394 -14.894 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1937 on 496 degrees of freedom
## Multiple R-squared: 0.7739, Adjusted R-squared: 0.7703
## F-statistic: 212.2 on 8 and 496 DF, p-value: < 2.2e-16
Now, residual diagnostics can be performed to check that errors have a constant variance, are normally distributed and not independent. For this purpose, quantile plots, histograms and scatterplots will be used.
The errors appear to be normally distributed.
Checking for constant variance using a Fitted vs Residual Plot:
The constant variance assumption appears to be satisfied. Removing the outliers and insignificant predictors definitely improved the model, as did the log transformation. Before the transformation, there was excessive variance towards the right, which is no longer there. More transformations can be applied to the variables as described earlier, but a good fit has already been obtained and interesting insights can be drawn from the data. Backward AIC selection can be performed to determine the best model.
## Start: AIC=-1648.76
## log(medv) ~ crim + chas + nox + rm + dis + ptratio + black +
## lstat
##
## Df Sum of Sq RSS AIC
## <none> 18.616 -1648.8
## - chas 1 0.4389 19.055 -1639.0
## - black 1 0.5317 19.148 -1636.5
## - crim 1 1.0115 19.628 -1624.0
## - nox 1 1.0773 19.694 -1622.3
## - dis 1 1.5677 20.184 -1609.9
## - rm 1 1.7124 20.329 -1606.3
## - ptratio 1 2.4299 21.046 -1588.8
## - lstat 1 8.3258 26.942 -1464.1
As shown above, no further predictors can be removed without sacrificing accuracy.
From the model, the first thing that can be interpreted is that the average number of rooms is positively correlated with house price. This makes sense and should be expected. Second, there is also a positive correlation if the house is next to the Charles River. It is reasonable that more people would want to live closer to the river for the great view on offer and that this should raise the house prices. Similarly, negative correlations with crime rate and pupil-teacher ratios are also to be expected. People would prefer to live in areas that have less crime and where there is a low pupil-teacher ratio. An increased demand for houses in such areas would drive house prices up based on those factors. What is most interesting for the purpose of this report is how distance to the main employment centers and nitrogen oxide levels influence the house prices. There is a negative correlation for both, and this will be explored further in the conclusion.
The goal of this report was to determine the neighborhood attributes that best explained variation in house pricing. Various statistical techniques were used to eliminate predictors and extraneous observations. In examining the final model, one finds – quite reasonably – that house prices are higher in areas with lower crime and lower pupil-teacher ratios. House prices also tend to be higher closer to the Charles River, and houses with more rooms are pricier. This report is interested in the neighborhood attributes of houses, so the number of rooms is not an important predictor. The most interesting factors to consider are nitrogen oxide levels and distance to the main employment centers. On the one hand, people would want to live close to their place of employment. Yet it is reasonable to suggest that pollution levels are higher as one moves closer to these main employment centers. Most importantly, when talking of pollution, it is not just nitrogen oxide levels that are higher, but also noise pollution levels. The regression model that was fitted shows that higher levels of pollution decrease house prices to a greater extent than distance to employment centers. This suggests that people would prefer to live further away from their place of employment if it meant lower levels of pollution, which is an interesting point to consider. On a concluding note, it is important to note that the data for this report was collected several decades ago. In the years since, there is no doubt that pollution levels have risen and it would be interesting to examine the ways in which that affects house pricing in Boston today.
Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5 , 81–102.
Belsley D.A., Kuh, E. and Welsch, R.E. (1980)Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.
Faraway, J. Linear Models with R. Chapman & Hall/CRC Texts in Statistical Science (Book 63).