This data set concerns the task of predicting housing values in areas of Boston. The used variables and their meaning are the following:
housing <- read.table("./housing.csv", sep=",", header=TRUE)
names(housing)
## [1] "X" "CRIM" "ZN" "INDUS" "CHAS" "NOX" "RM"
## [8] "AGE" "DIS" "RAD" "TAX" "PTRATIO" "B" "LSTAT"
## [15] "MEDV"
The dataset contains 506 datapoints. The goal is to predict the response variable MEDV given the other 13 predictor variables.
Prior to fitting a model, we inspect the variables to see if the assumptions required for a linear regression model are valid. First, we plot the histogram for each variable to check if the data is skewed. If the data is skewed, transformations on the variables may be necessary.
## No id variables; using all as measure variables
We notice that:
For the binary predictor variable CHAS, 471 of 506 have value 0, and 35 of 506 have value 1. There is a high level of skew towards houses not bounding the Charles River, and as a result, this variable may not have predicting power.
B is quadratic transform on the variable Bk. The histogram of B shows the data to be skewed towards high values, which is an indication of a non-ideal data transformation. Since it is impossible to recover the Bk variable from the quadratic transform, we find this predictor variable hard to use, and are likely to not consider it in our regression model.
The crime rate in Boston is generally low (below 10 percent). A logarithmic transformation on CRIM may be appropriate.
Similarly, the proportion of residential land zoned for lots over 25,000 sq.ft. is generally small, and a logarithmic transformation on ZN may be appropriate.
We do a logarithmic transform on CRIM and ZN to obtain a more symmetric distribution on both variables.
housing$ZNlog <- log1p(housing$ZN)
hist(housing$ZNlog)
housing$CRIMlog <- log1p(housing$CRIM)
hist(housing$CRIMlog)
Next, we explore the relationship between each pair of variables to identify general patterns.
With 13 predictor variables, it is still feasible to use a plot matrix.
Upon inspecting the plot matrix, we find that:
There is a non-linear relationship between the nitric oxide concentration level NOX, with the distance from five Boston employment centres DIS. The closer the proximity to the employment centres, the higher the nitric oxide concentration. This is logical, and we take note of this relationship later on during variable selection.
The number of rooms per dwelling RM and the % lower status of the population LSTAT are strongly linearly correlated to the response variable MEDV.
NOX, RM, AGE and log(CRIM) are all highly linearly correlated, and they are also highly linearly correlated with the response variable MEDV. Correlation in predictor variables will need to be accounted for during variable selection.
To further investigate if there are potential issues concerning collinearity, we find the VIF values for each predictor variable, and compute the eigenvalues of the correlation matrix.
mask <- names(housing) %in% c("CRIM", "ZN") # Use log versions
lmfit <- lm(MEDV ~ ., data=housing[!mask])
vif(lmfit)
## X INDUS CHAS NOX RM AGE DIS
## 2.087994 3.990133 1.075679 4.482631 1.934743 3.163476 4.090739
## RAD TAX PTRATIO B LSTAT ZNlog CRIMlog
## 11.825133 8.924478 1.944074 1.389093 3.021489 2.591774 8.417859
We see a large VIF value of 11.8 for RAD, which indicates some collinearity.
mask <- names(housing) %in% c("CRIM", "ZN", "MEDV")
cor_matrix <- cor(housing[!mask])
eigen(cor_matrix)$values
## [1] 6.86635180 1.57124078 1.32818408 0.89006094 0.83778410 0.69008169
## [7] 0.44631048 0.34499270 0.30044753 0.21956438 0.18109564 0.17118193
## [13] 0.10103788 0.05166608
print(Reduce(`+`, lapply(eigen(cor_matrix)$values, function(x) 1 / x)))
## [1] 58.9393
On the other hand, analysis of the eigenvalues of the correlation matrix suggests that collinearity may not be an issue, since the sum of the reciprocals of the correlation matrix (58.9) is less than 5 times the number of predictor variables (13).
mask <- names(housing) %in% c("X", "CRIM", "ZN", "B")
housing <- housing[!mask]
full_model <- lm(MEDV~., data=housing)
summary(full_model)
##
## Call:
## lm(formula = MEDV ~ ., data = housing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.4561 -2.7975 -0.7198 1.8191 26.7751
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.105593 5.036065 7.964 1.16e-14 ***
## INDUS 0.017484 0.063088 0.277 0.781792
## CHAS 2.949057 0.884008 3.336 0.000914 ***
## NOX -17.011999 3.949465 -4.307 1.99e-05 ***
## RM 3.750957 0.425668 8.812 < 2e-16 ***
## AGE 0.001539 0.013471 0.114 0.909073
## DIS -1.372566 0.207449 -6.616 9.61e-11 ***
## RAD 0.297385 0.083182 3.575 0.000385 ***
## TAX -0.010612 0.003779 -2.808 0.005185 **
## PTRATIO -0.973743 0.139515 -6.979 9.60e-12 ***
## LSTAT -0.556683 0.052334 -10.637 < 2e-16 ***
## ZNlog 0.446508 0.214422 2.082 0.037823 *
## CRIMlog -1.135993 0.599132 -1.896 0.058536 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.87 on 493 degrees of freedom
## Multiple R-squared: 0.7262, Adjusted R-squared: 0.7196
## F-statistic: 109 on 12 and 493 DF, p-value: < 2.2e-16
We use the R leaps package to do variable selection. We vary the number of predictor variables, and keep the best subset of variables for each of them. We choose the best subset of variables to be the subset with the lowest Mallow’s Cp metric.
y_mask <- names(housing) %in% c("MEDV")
y <- housing$MEDV
x <- model.matrix(MEDV~ . - 1, data=housing)
best_mods <- leaps(x, y, nbest=1)
best_mods
## $which
## 1 2 3 4 5 6 7 8 9 A B C
## 1 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## 2 FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## 3 FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
## 4 FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE
## 5 FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE
## 6 FALSE TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE
## 7 FALSE TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE FALSE
## 8 FALSE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
## 9 FALSE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
## 10 FALSE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## 11 TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## 12 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##
## $label
## [1] "(Intercept)" "1" "2" "3" "4"
## [6] "5" "6" "7" "8" "9"
## [11] "A" "B" "C"
##
## $size
## [1] 2 3 4 5 6 7 8 9 10 11 12 13
##
## $Cp
## [1] 318.930543 150.901409 80.754196 61.713424 31.688605 19.848457
## [7] 17.182177 12.804450 10.653929 9.090095 11.013057 13.000000
The model with the lowest Mallow’s Cp metric is one with the following 10 predictor variables: CHAS, NOX,RM, DIS, RAD, TAX, PTRATIO, LSTAT, ZNlog, CRIMlog.
We fit this model:
reduced_best_model <- lm(MEDV ~ CHAS + NOX + RM + DIS + RAD + TAX + PTRATIO + LSTAT + ZNlog + CRIMlog, data=housing)
summary(reduced_best_model)
##
## Call:
## lm(formula = MEDV ~ CHAS + NOX + RM + DIS + RAD + TAX + PTRATIO +
## LSTAT + ZNlog + CRIMlog, data = housing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.4691 -2.8075 -0.7518 1.8227 26.8442
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.984191 5.006758 7.986 9.83e-15 ***
## CHAS 2.979747 0.876323 3.400 0.000728 ***
## NOX -16.607619 3.665109 -4.531 7.36e-06 ***
## RM 3.749926 0.415017 9.036 < 2e-16 ***
## DIS -1.391137 0.192947 -7.210 2.11e-12 ***
## RAD 0.291202 0.080415 3.621 0.000323 ***
## TAX -0.010153 0.003397 -2.988 0.002944 **
## PTRATIO -0.967787 0.137725 -7.027 7.02e-12 ***
## LSTAT -0.553724 0.049326 -11.226 < 2e-16 ***
## ZNlog 0.437923 0.212035 2.065 0.039412 *
## CRIMlog -1.130336 0.597598 -1.891 0.059146 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.861 on 495 degrees of freedom
## Multiple R-squared: 0.7262, Adjusted R-squared: 0.7207
## F-statistic: 131.3 on 10 and 495 DF, p-value: < 2.2e-16