Boston Housing Dataset

The Boston housing dataset is a dataset that has median value of the house along with 13 other parameters that could potentially be related to housing prices. These are the factors such as socio-economic conditions, environmental conditions, educational facilities and some other similar factors. There are 506 observations in the data for 14 variables including the median price of house in Boston. There are 12 numerical variables in our dataset and 1 categorical variable. The aim of this project is to build a linear regression model estimate the median price of owner-occupied homes in Boston.

Summary Statistics

The first step before performing any modeling is to look at the high-level summary of a dataset, understand the variables and look for any abnormalities or hidden facts present in the data. We look at summary statistics and distribution of the observations of each variable. It’s been observed that there are no missing values in the data.

library(glmnet)
library(MASS)
library(dplyr)
library(GGally)
data(Boston)
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Let’s look at the variation in the values of various variables present in the dataset. This is achieved by plotting a boxplot of all the variables.

boxplot(Boston)

We see that there are outliers in the variables crime, zn and black. Highest variability is observed in the full-value property tax rates.

We are interested in answering questions like does higher property tax imply higher median high prices. We plot the correlation matrix to be able to answer such questions.

ggcorr(Boston, nbreaks=6, palette='RdGy', label=TRUE, label_size=5, label_color='white')

Strong negative correlation between percentage of lower status of population and median house price. Strong positive correlation in the number of rooms and median price of the house. It is also suggested that higher is the property tax, lower is the median price of the house.

Building a Regression Model

Let’s perform linear regression analysis on the dataset. The variable of our interest here is ‘medv’. Our main objective is to predict the value of medv based on other independent variables present in the dataset. Here we perform multiple linear regression model using all the variables.

##Regression model
set.seed(12366915)
index <- sample(nrow(Boston),nrow(Boston)*0.80)
train <- Boston[index,]
test <- Boston[-index,]
mlr <- lm(medv~.,data=train)
summary(mlr)
## 
## Call:
## lm(formula = medv ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.665  -2.827  -0.567   2.018  26.141 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  38.310626   5.663422   6.765 4.91e-11 ***
## crim         -0.103719   0.035930  -2.887 0.004109 ** 
## zn            0.045214   0.015293   2.957 0.003300 ** 
## indus        -0.006526   0.069470  -0.094 0.925211    
## chas          2.759550   0.948693   2.909 0.003836 ** 
## nox         -15.961795   4.243101  -3.762 0.000195 ***
## rm            3.676306   0.455704   8.067 8.96e-15 ***
## age          -0.004209   0.014083  -0.299 0.765180    
## dis          -1.532915   0.214933  -7.132 4.84e-12 ***
## rad           0.267926   0.073362   3.652 0.000296 ***
## tax          -0.010923   0.004309  -2.535 0.011640 *  
## ptratio      -0.993846   0.145964  -6.809 3.73e-11 ***
## black         0.008142   0.003067   2.655 0.008254 ** 
## lstat        -0.520933   0.053917  -9.662  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.653 on 390 degrees of freedom
## Multiple R-squared:  0.7548, Adjusted R-squared:  0.7466 
## F-statistic: 92.33 on 13 and 390 DF,  p-value: < 2.2e-16

The summary suggests that variables indus and age should be dropped since they are insignificant based on the high p-values. Variables lstat and rm have the highest significance.

Variable Selection

Now we want to select what variables should be present in the final model. We perform Variable selection using multiple techniques.

Best Subset Selection

Here we apply the best subset selection approach to the Boston data. We wish to predict the median value of a house based on various statistics associated with the medv variable.

We decide on the number of variables that are required in the model using BIC,adjusted r squared and Cp.

library(leaps)
regs_model <- regsubsets(medv~.,train,nvmax = 13 )
regs_summ <- summary(regs_model)
regs_summ$bic
##  [1] -311.5275 -406.4929 -456.8011 -468.6637 -487.8519 -492.3544 -494.4685
##  [8] -494.4398 -492.9798 -494.5191 -495.7083 -489.7991 -483.8069
regs_summ$adjr2
##  [1] 0.5499215 0.6485755 0.6935325 0.7060538 0.7231279 0.7295542 0.7342633
##  [8] 0.7375002 0.7397757 0.7439378 0.7478134 0.7472261 0.7465837
regs_summ$cp
##  [1] 313.96975 158.08587  87.73760  68.81369  42.83822  33.67827  27.25237
##  [8]  23.15843  20.58486  15.10326  10.09791  12.00882  14.00000

We can see that the model with 11 variables has the lowest BIC and Cp as well as the highest Adjusted R-sqaure value. Best subset selection also suggests dropping the variables indus and age.

Adjusted R-square value : 0.7478

Forward Selection

Forward stepwise selection is a computationally efficient alternative to best subset selection. While the best subset selection procedure considers all 2p possible models containing subsets of the p predictors, forward stepwise considers a much smaller set of models.

nullmodel <- lm(medv~1, data = train)
fullmodel <- lm(medv~., data = train)
model.step.f<- step(nullmodel, scope=list(lower=nullmodel, upper=fullmodel), direction='forward')
summary(model.step.f)
## 
## Call:
## lm(formula = medv ~ lstat + rm + ptratio + dis + nox + chas + 
##     black + zn + crim + rad + tax, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.6654  -2.8045  -0.5478   2.0366  25.9836 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  38.486454   5.613851   6.856 2.77e-11 ***
## lstat        -0.526805   0.050377 -10.457  < 2e-16 ***
## rm            3.649630   0.442425   8.249 2.45e-15 ***
## ptratio      -0.997898   0.144439  -6.909 1.99e-11 ***
## dis          -1.511341   0.202103  -7.478 4.99e-13 ***
## nox         -16.380386   3.929948  -4.168 3.78e-05 ***
## chas          2.735509   0.939021   2.913  0.00378 ** 
## black         0.008107   0.003056   2.653  0.00830 ** 
## zn            0.046075   0.014929   3.086  0.00217 ** 
## crim         -0.103586   0.035834  -2.891  0.00406 ** 
## rad           0.271856   0.068768   3.953 9.15e-05 ***
## tax          -0.011179   0.003755  -2.977  0.00309 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.641 on 392 degrees of freedom
## Multiple R-squared:  0.7547, Adjusted R-squared:  0.7478 
## F-statistic: 109.6 on 11 and 392 DF,  p-value: < 2.2e-16

This method also suggests that we drop the variables indus and age. Adjusted R-square value: 0.7478

Backward Elimination

Like forward stepwise selection, backward stepwise selection provides an efficient alternative to best subset selection. However, unlike forward stepwise selection, it begins with the full least squares model containing all p predictors, and then iteratively removes the least useful predictor, one-at-a-time.

model.step.b<- step(fullmodel,direction='backward')
summary(model.step.b)
## 
## Call:
## lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad + 
##     tax + ptratio + black + lstat, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.6654  -2.8045  -0.5478   2.0366  25.9836 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  38.486454   5.613851   6.856 2.77e-11 ***
## crim         -0.103586   0.035834  -2.891  0.00406 ** 
## zn            0.046075   0.014929   3.086  0.00217 ** 
## chas          2.735509   0.939021   2.913  0.00378 ** 
## nox         -16.380386   3.929948  -4.168 3.78e-05 ***
## rm            3.649630   0.442425   8.249 2.45e-15 ***
## dis          -1.511341   0.202103  -7.478 4.99e-13 ***
## rad           0.271856   0.068768   3.953 9.15e-05 ***
## tax          -0.011179   0.003755  -2.977  0.00309 ** 
## ptratio      -0.997898   0.144439  -6.909 1.99e-11 ***
## black         0.008107   0.003056   2.653  0.00830 ** 
## lstat        -0.526805   0.050377 -10.457  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.641 on 392 degrees of freedom
## Multiple R-squared:  0.7547, Adjusted R-squared:  0.7478 
## F-statistic: 109.6 on 11 and 392 DF,  p-value: < 2.2e-16

This method also suggests that we drop indus and age. Adjusted R-square value: 0.7478

Stepwise Selection

Stepwise selection has the flexibility to both add/ remove variables in the process of selecting the best regression model.

model.step.s<- step(nullmodel, scope=list(lower=nullmodel, upper=fullmodel), direction='both')
summary(model.step.s)
## 
## Call:
## lm(formula = medv ~ lstat + rm + ptratio + dis + nox + chas + 
##     black + zn + crim + rad + tax, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.6654  -2.8045  -0.5478   2.0366  25.9836 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  38.486454   5.613851   6.856 2.77e-11 ***
## lstat        -0.526805   0.050377 -10.457  < 2e-16 ***
## rm            3.649630   0.442425   8.249 2.45e-15 ***
## ptratio      -0.997898   0.144439  -6.909 1.99e-11 ***
## dis          -1.511341   0.202103  -7.478 4.99e-13 ***
## nox         -16.380386   3.929948  -4.168 3.78e-05 ***
## chas          2.735509   0.939021   2.913  0.00378 ** 
## black         0.008107   0.003056   2.653  0.00830 ** 
## zn            0.046075   0.014929   3.086  0.00217 ** 
## crim         -0.103586   0.035834  -2.891  0.00406 ** 
## rad           0.271856   0.068768   3.953 9.15e-05 ***
## tax          -0.011179   0.003755  -2.977  0.00309 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.641 on 392 degrees of freedom
## Multiple R-squared:  0.7547, Adjusted R-squared:  0.7478 
## F-statistic: 109.6 on 11 and 392 DF,  p-value: < 2.2e-16

Using stepwise selection as well, the suggested model remains the same, dropping variables indus and age. Adjusted R-square value: 0.7478

Lasso Regression

We use Shrinkage method Lasso to build a model now. For performing Lasso, we need to standardize the variables first.

Boston.X.std<- scale(select(Boston, -medv))
X.train<- as.matrix(Boston.X.std)[index,]
X.test<-  as.matrix(Boston.X.std)[-index,]
Y.train<- Boston[index, "medv"]
Y.test<- Boston[-index, "medv"]

To use Lasso, we first need to find the value of the tuning parameter lambda. Using cross-validation we find appropriate lambda value using error versus lambda plot. We take the value with the least error as well as the error value which is one standard deviation away from the lowest error value.

cv.lasso<- cv.glmnet(x=X.train, y=Y.train, family = "gaussian", alpha = 1, nfolds = 10)
plot(cv.lasso)

min <- cv.lasso$lambda.min
se <- cv.lasso$lambda.1se

We then build model based on the error value one standard deviation away as it reduces the complexity in the model at the expense of one standard deviation of error.

lasso.fit<- glmnet(x=X.train, y=Y.train, family = "gaussian", alpha = 1)
plot(lasso.fit, xvar = "lambda")

coef(lasso.fit,s=se)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                      1
## (Intercept) 22.7688737
## crim        -0.2984214
## zn           0.2243296
## indus       -0.1666811
## chas         0.6019586
## nox         -0.8193901
## rm           2.8285363
## age          .        
## dis         -1.5187074
## rad          .        
## tax          .        
## ptratio     -1.9292898
## black        0.5784803
## lstat       -3.6979675

We can see that Lasso suggests that we use 10 variables, dropping age, rad and tax. We create a new model with the 10 variables and check the Adjusted R-square value.

newm <- lm(medv~crim+zn+indus+chas+nox+rm+dis+ptratio+black+lstat,data=train)
summary(newm)
## 
## Call:
## lm(formula = medv ~ crim + zn + indus + chas + nox + rm + dis + 
##     ptratio + black + lstat, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.1158  -2.9325  -0.6984   1.7042  27.4448 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  32.618961   5.442690   5.993 4.66e-09 ***
## crim         -0.059905   0.032799  -1.826 0.068546 .  
## zn            0.039223   0.014522   2.701 0.007215 ** 
## indus        -0.076108   0.061309  -1.241 0.215208    
## chas          3.092902   0.952847   3.246 0.001271 ** 
## nox         -13.701480   3.953674  -3.466 0.000588 ***
## rm            3.861382   0.446823   8.642  < 2e-16 ***
## dis          -1.487419   0.209155  -7.112 5.46e-12 ***
## ptratio      -0.890015   0.135657  -6.561 1.69e-10 ***
## black         0.007265   0.003086   2.354 0.019069 *  
## lstat        -0.523339   0.051319 -10.198  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.718 on 393 degrees of freedom
## Multiple R-squared:  0.7459, Adjusted R-squared:  0.7394 
## F-statistic: 115.4 on 10 and 393 DF,  p-value: < 2.2e-16

Adjusted R-square value: 0.7394

Final model

We select the final model suggested using the Lasso method. It gives a less complex model at the expense of slight decrease in Adjusted R-square value. Hence, the final model contains the following 10 variables: crim, zn, indus, chas, nox, rm, dis, ptratio, black, lstat

Residual Diagnostics

Residual diagnosis needs to be performed to check the validity of the assumptions made for building the regression model. We draw residual plots for this activity.

plot(newm)

We find following:

From the residual plots, it can be inferred that i) Variance is non-constant. But transformation can be made to solve this problem. ii) Q-Q plot suggests normal data but with a skewed tail. iii) It doesn’t look like there is any auto correlation present. iv) There are no outliers.

Model Validation

Now we test the model on the test data set and gauge its performance. We calculate the Adjusted R-square value to assess the performance.

newm.test <- lm(medv~crim+zn+indus+chas+nox+rm+dis+ptratio+black+lstat,data=test)
summary(newm.test)
## 
## Call:
## lm(formula = medv ~ crim + zn + indus + chas + nox + rm + dis + 
##     ptratio + black + lstat, data = test)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.5884  -2.2590  -0.3518   1.6487  25.9339 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  18.896375  11.885410   1.590  0.11533    
## crim         -0.082773   0.083491  -0.991  0.32412    
## zn            0.054759   0.034765   1.575  0.11870    
## indus         0.096901   0.139405   0.695  0.48876    
## chas          2.319858   2.179154   1.065  0.28989    
## nox         -18.060296   8.206593  -2.201  0.03029 *  
## rm            5.015898   1.101839   4.552 1.64e-05 ***
## dis          -1.492888   0.517085  -2.887  0.00486 ** 
## ptratio      -0.576319   0.291111  -1.980  0.05076 .  
## black         0.009574   0.006255   1.531  0.12934    
## lstat        -0.600352   0.150205  -3.997  0.00013 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.314 on 91 degrees of freedom
## Multiple R-squared:  0.6894, Adjusted R-squared:  0.6553 
## F-statistic:  20.2 on 10 and 91 DF,  p-value: < 2.2e-16

Adjusted R-square value: 0.6553