Boston Housing

This data set concerns the task of predicting housing values in areas of Boston. The used variables and their meaning are the following:

  1. CRIM per capita crime rate by town
  2. ZN proportion of residential land zoned for lots over 25,000 sq.ft.
  3. INDUS proportion of non-retail business acres per town
  4. CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
  5. NOX nitric oxides concentration (parts per 10 million)
  6. RM average number of rooms per dwelling
  7. AGE proportion of owner-occupied units built prior to 1940
  8. DIS weighted distances to five Boston employment centres
  9. RAD index of accessibility to radial highways
  10. TAX full-value property-tax rate per $10,000
  11. PTRATIO pupil-teacher ratio by town
  12. B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
  13. LSTAT % lower status of the population
  14. MEDV Median value of owner-occupied homes in $1000’s
housing <- read.table("./housing.csv", sep=",", header=TRUE)
names(housing)
##  [1] "X"       "CRIM"    "ZN"      "INDUS"   "CHAS"    "NOX"     "RM"     
##  [8] "AGE"     "DIS"     "RAD"     "TAX"     "PTRATIO" "B"       "LSTAT"  
## [15] "MEDV"

The dataset contains 506 datapoints. The goal is to predict the response variable MEDV given the other 13 predictor variables.

Exploratory Data Analysis

Prior to fitting a model, we inspect the variables to see if the assumptions required for a linear regression model are valid. First, we plot the histogram for each variable to check if the data is skewed. If the data is skewed, transformations on the variables may be necessary.

## No id variables; using all as measure variables

We notice that:

  1. For the binary predictor variable CHAS, 471 of 506 have value 0, and 35 of 506 have value 1. There is a high level of skew towards houses not bounding the Charles River, and as a result, this variable may not have predicting power.

  2. B is quadratic transform on the variable Bk. The histogram of B shows the data to be skewed towards high values, which is an indication of a non-ideal data transformation. Since it is impossible to recover the Bk variable from the quadratic transform, we find this predictor variable hard to use, and are likely to not consider it in our regression model.

  3. The crime rate in Boston is generally low (below 10 percent). A logarithmic transformation on CRIM may be appropriate.

  4. Similarly, the proportion of residential land zoned for lots over 25,000 sq.ft. is generally small, and a logarithmic transformation on ZN may be appropriate.

We do a logarithmic transform on CRIM and ZN to obtain a more symmetric distribution on both variables.

housing$ZNlog <- log1p(housing$ZN)
hist(housing$ZNlog)

housing$CRIMlog <- log1p(housing$CRIM)
hist(housing$CRIMlog)

Next, we explore the relationship between each pair of variables to identify general patterns.

With 13 predictor variables, it is still feasible to use a plot matrix.

Upon inspecting the plot matrix, we find that:

  1. There is a non-linear relationship between the nitric oxide concentration level NOX, with the distance from five Boston employment centres DIS. The closer the proximity to the employment centres, the higher the nitric oxide concentration. This is logical, and we take note of this relationship later on during variable selection.

  2. The number of rooms per dwelling RM and the % lower status of the population LSTAT are strongly linearly correlated to the response variable MEDV.

  3. NOX, RM, AGE and log(CRIM) are all highly linearly correlated, and they are also highly linearly correlated with the response variable MEDV. Correlation in predictor variables will need to be accounted for during variable selection.

To further investigate if there are potential issues concerning collinearity, we find the VIF values for each predictor variable, and compute the eigenvalues of the correlation matrix.

mask <- names(housing) %in% c("CRIM", "ZN") # Use log versions
lmfit <- lm(MEDV ~ ., data=housing[!mask])
vif(lmfit)
##         X     INDUS      CHAS       NOX        RM       AGE       DIS 
##  2.087994  3.990133  1.075679  4.482631  1.934743  3.163476  4.090739 
##       RAD       TAX   PTRATIO         B     LSTAT     ZNlog   CRIMlog 
## 11.825133  8.924478  1.944074  1.389093  3.021489  2.591774  8.417859

We see a large VIF value of 11.8 for RAD, which indicates some collinearity.

mask <- names(housing) %in% c("CRIM", "ZN", "MEDV")
cor_matrix <- cor(housing[!mask])
eigen(cor_matrix)$values
##  [1] 6.86635180 1.57124078 1.32818408 0.89006094 0.83778410 0.69008169
##  [7] 0.44631048 0.34499270 0.30044753 0.21956438 0.18109564 0.17118193
## [13] 0.10103788 0.05166608
print(Reduce(`+`, lapply(eigen(cor_matrix)$values, function(x) 1 / x)))
## [1] 58.9393

On the other hand, analysis of the eigenvalues of the correlation matrix suggests that collinearity may not be an issue, since the sum of the reciprocals of the correlation matrix (58.9) is less than 5 times the number of predictor variables (13).

Fitting the Full Model

mask <- names(housing) %in% c("X", "CRIM", "ZN", "B")
housing <- housing[!mask]
full_model <- lm(MEDV~., data=housing)
summary(full_model)
## 
## Call:
## lm(formula = MEDV ~ ., data = housing)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.4561  -2.7975  -0.7198   1.8191  26.7751 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  40.105593   5.036065   7.964 1.16e-14 ***
## INDUS         0.017484   0.063088   0.277 0.781792    
## CHAS          2.949057   0.884008   3.336 0.000914 ***
## NOX         -17.011999   3.949465  -4.307 1.99e-05 ***
## RM            3.750957   0.425668   8.812  < 2e-16 ***
## AGE           0.001539   0.013471   0.114 0.909073    
## DIS          -1.372566   0.207449  -6.616 9.61e-11 ***
## RAD           0.297385   0.083182   3.575 0.000385 ***
## TAX          -0.010612   0.003779  -2.808 0.005185 ** 
## PTRATIO      -0.973743   0.139515  -6.979 9.60e-12 ***
## LSTAT        -0.556683   0.052334 -10.637  < 2e-16 ***
## ZNlog         0.446508   0.214422   2.082 0.037823 *  
## CRIMlog      -1.135993   0.599132  -1.896 0.058536 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.87 on 493 degrees of freedom
## Multiple R-squared:  0.7262, Adjusted R-squared:  0.7196 
## F-statistic:   109 on 12 and 493 DF,  p-value: < 2.2e-16

Variable Selection

We use the R leaps package to do variable selection. We vary the number of predictor variables, and keep the best subset of variables for each of them. We choose the best subset of variables to be the subset with the lowest Mallow’s Cp metric.

y_mask <- names(housing) %in% c("MEDV")
y <- housing$MEDV
x <- model.matrix(MEDV~ . - 1, data=housing)
best_mods <- leaps(x, y, nbest=1)
best_mods
## $which
##        1     2     3     4     5     6     7     8     9    A     B     C
## 1  FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## 2  FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## 3  FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE TRUE FALSE FALSE
## 4  FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE TRUE FALSE FALSE
## 5  FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE TRUE FALSE FALSE
## 6  FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE TRUE FALSE FALSE
## 7  FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE TRUE  TRUE FALSE
## 8  FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE TRUE FALSE FALSE
## 9  FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE TRUE  TRUE FALSE
## 10 FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE TRUE  TRUE  TRUE
## 11  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE TRUE  TRUE  TRUE
## 12  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE TRUE  TRUE  TRUE
## 
## $label
##  [1] "(Intercept)" "1"           "2"           "3"           "4"          
##  [6] "5"           "6"           "7"           "8"           "9"          
## [11] "A"           "B"           "C"          
## 
## $size
##  [1]  2  3  4  5  6  7  8  9 10 11 12 13
## 
## $Cp
##  [1] 318.930543 150.901409  80.754196  61.713424  31.688605  19.848457
##  [7]  17.182177  12.804450  10.653929   9.090095  11.013057  13.000000

The model with the lowest Mallow’s Cp metric is one with the following 10 predictor variables: CHAS, NOX,RM, DIS, RAD, TAX, PTRATIO, LSTAT, ZNlog, CRIMlog.

We fit this model:

reduced_best_model <- lm(MEDV ~ CHAS + NOX + RM + DIS + RAD + TAX + PTRATIO + LSTAT + ZNlog + CRIMlog, data=housing)
summary(reduced_best_model)
## 
## Call:
## lm(formula = MEDV ~ CHAS + NOX + RM + DIS + RAD + TAX + PTRATIO + 
##     LSTAT + ZNlog + CRIMlog, data = housing)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.4691  -2.8075  -0.7518   1.8227  26.8442 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39.984191   5.006758   7.986 9.83e-15 ***
## CHAS          2.979747   0.876323   3.400 0.000728 ***
## NOX         -16.607619   3.665109  -4.531 7.36e-06 ***
## RM            3.749926   0.415017   9.036  < 2e-16 ***
## DIS          -1.391137   0.192947  -7.210 2.11e-12 ***
## RAD           0.291202   0.080415   3.621 0.000323 ***
## TAX          -0.010153   0.003397  -2.988 0.002944 ** 
## PTRATIO      -0.967787   0.137725  -7.027 7.02e-12 ***
## LSTAT        -0.553724   0.049326 -11.226  < 2e-16 ***
## ZNlog         0.437923   0.212035   2.065 0.039412 *  
## CRIMlog      -1.130336   0.597598  -1.891 0.059146 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.861 on 495 degrees of freedom
## Multiple R-squared:  0.7262, Adjusted R-squared:  0.7207 
## F-statistic: 131.3 on 10 and 495 DF,  p-value: < 2.2e-16