The Boston housing dataset is a dataset that has median value of the house along with 13 other parameters that could potentially be related to housing prices. These are the factors such as socio-economic conditions, environmental conditions, educational facilities and some other similar factors. There are 506 observations in the data for 14 variables including the median price of house in Boston. There are 12 numerical variables in our dataset and 1 categorical variable. The aim of this project is to build a linear regression model estimate the median price of owner-occupied homes in Boston.
The first step before performing any modeling is to look at the high-level summary of a dataset, understand the variables and look for any abnormalities or hidden facts present in the data. We look at summary statistics and distribution of the observations of each variable. It’s been observed that there are no missing values in the data.
library(glmnet)
library(MASS)
library(dplyr)
library(GGally)
data(Boston)
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Let’s look at the variation in the values of various variables present in the dataset. This is achieved by plotting a boxplot of all the variables.
boxplot(Boston)
We see that there are outliers in the variables crime, zn and black. Highest variability is observed in the full-value property tax rates.
We are interested in answering questions like does higher property tax imply higher median high prices. We plot the correlation matrix to be able to answer such questions.
ggcorr(Boston, nbreaks=6, palette='RdGy', label=TRUE, label_size=5, label_color='white')
Strong negative correlation between percentage of lower status of population and median house price. Strong positive correlation in the number of rooms and median price of the house. It is also suggested that higher is the property tax, lower is the median price of the house.
Let’s perform linear regression analysis on the dataset. The variable of our interest here is ‘medv’. Our main objective is to predict the value of medv based on other independent variables present in the dataset. Here we perform multiple linear regression model using all the variables.
##Regression model
set.seed(12366915)
index <- sample(nrow(Boston),nrow(Boston)*0.80)
train <- Boston[index,]
test <- Boston[-index,]
mlr <- lm(medv~.,data=train)
summary(mlr)
##
## Call:
## lm(formula = medv ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.665 -2.827 -0.567 2.018 26.141
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.310626 5.663422 6.765 4.91e-11 ***
## crim -0.103719 0.035930 -2.887 0.004109 **
## zn 0.045214 0.015293 2.957 0.003300 **
## indus -0.006526 0.069470 -0.094 0.925211
## chas 2.759550 0.948693 2.909 0.003836 **
## nox -15.961795 4.243101 -3.762 0.000195 ***
## rm 3.676306 0.455704 8.067 8.96e-15 ***
## age -0.004209 0.014083 -0.299 0.765180
## dis -1.532915 0.214933 -7.132 4.84e-12 ***
## rad 0.267926 0.073362 3.652 0.000296 ***
## tax -0.010923 0.004309 -2.535 0.011640 *
## ptratio -0.993846 0.145964 -6.809 3.73e-11 ***
## black 0.008142 0.003067 2.655 0.008254 **
## lstat -0.520933 0.053917 -9.662 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.653 on 390 degrees of freedom
## Multiple R-squared: 0.7548, Adjusted R-squared: 0.7466
## F-statistic: 92.33 on 13 and 390 DF, p-value: < 2.2e-16
The summary suggests that variables indus and age should be dropped since they are insignificant based on the high p-values. Variables lstat and rm have the highest significance.
Now we want to select what variables should be present in the final model. We perform Variable selection using multiple techniques.
Here we apply the best subset selection approach to the Boston data. We wish to predict the median value of a house based on various statistics associated with the medv variable.
We decide on the number of variables that are required in the model using BIC,adjusted r squared and Cp.
library(leaps)
regs_model <- regsubsets(medv~.,train,nvmax = 13 )
regs_summ <- summary(regs_model)
regs_summ$bic
## [1] -311.5275 -406.4929 -456.8011 -468.6637 -487.8519 -492.3544 -494.4685
## [8] -494.4398 -492.9798 -494.5191 -495.7083 -489.7991 -483.8069
regs_summ$adjr2
## [1] 0.5499215 0.6485755 0.6935325 0.7060538 0.7231279 0.7295542 0.7342633
## [8] 0.7375002 0.7397757 0.7439378 0.7478134 0.7472261 0.7465837
regs_summ$cp
## [1] 313.96975 158.08587 87.73760 68.81369 42.83822 33.67827 27.25237
## [8] 23.15843 20.58486 15.10326 10.09791 12.00882 14.00000
We can see that the model with 11 variables has the lowest BIC and Cp as well as the highest Adjusted R-sqaure value. Best subset selection also suggests dropping the variables indus and age.
Adjusted R-square value : 0.7478
Forward stepwise selection is a computationally efficient alternative to best subset selection. While the best subset selection procedure considers all 2p possible models containing subsets of the p predictors, forward stepwise considers a much smaller set of models.
nullmodel <- lm(medv~1, data = train)
fullmodel <- lm(medv~., data = train)
model.step.f<- step(nullmodel, scope=list(lower=nullmodel, upper=fullmodel), direction='forward')
summary(model.step.f)
##
## Call:
## lm(formula = medv ~ lstat + rm + ptratio + dis + nox + chas +
## black + zn + crim + rad + tax, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.6654 -2.8045 -0.5478 2.0366 25.9836
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.486454 5.613851 6.856 2.77e-11 ***
## lstat -0.526805 0.050377 -10.457 < 2e-16 ***
## rm 3.649630 0.442425 8.249 2.45e-15 ***
## ptratio -0.997898 0.144439 -6.909 1.99e-11 ***
## dis -1.511341 0.202103 -7.478 4.99e-13 ***
## nox -16.380386 3.929948 -4.168 3.78e-05 ***
## chas 2.735509 0.939021 2.913 0.00378 **
## black 0.008107 0.003056 2.653 0.00830 **
## zn 0.046075 0.014929 3.086 0.00217 **
## crim -0.103586 0.035834 -2.891 0.00406 **
## rad 0.271856 0.068768 3.953 9.15e-05 ***
## tax -0.011179 0.003755 -2.977 0.00309 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.641 on 392 degrees of freedom
## Multiple R-squared: 0.7547, Adjusted R-squared: 0.7478
## F-statistic: 109.6 on 11 and 392 DF, p-value: < 2.2e-16
This method also suggests that we drop the variables indus and age. Adjusted R-square value: 0.7478
Like forward stepwise selection, backward stepwise selection provides an efficient alternative to best subset selection. However, unlike forward stepwise selection, it begins with the full least squares model containing all p predictors, and then iteratively removes the least useful predictor, one-at-a-time.
model.step.b<- step(fullmodel,direction='backward')
summary(model.step.b)
##
## Call:
## lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad +
## tax + ptratio + black + lstat, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.6654 -2.8045 -0.5478 2.0366 25.9836
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.486454 5.613851 6.856 2.77e-11 ***
## crim -0.103586 0.035834 -2.891 0.00406 **
## zn 0.046075 0.014929 3.086 0.00217 **
## chas 2.735509 0.939021 2.913 0.00378 **
## nox -16.380386 3.929948 -4.168 3.78e-05 ***
## rm 3.649630 0.442425 8.249 2.45e-15 ***
## dis -1.511341 0.202103 -7.478 4.99e-13 ***
## rad 0.271856 0.068768 3.953 9.15e-05 ***
## tax -0.011179 0.003755 -2.977 0.00309 **
## ptratio -0.997898 0.144439 -6.909 1.99e-11 ***
## black 0.008107 0.003056 2.653 0.00830 **
## lstat -0.526805 0.050377 -10.457 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.641 on 392 degrees of freedom
## Multiple R-squared: 0.7547, Adjusted R-squared: 0.7478
## F-statistic: 109.6 on 11 and 392 DF, p-value: < 2.2e-16
This method also suggests that we drop indus and age. Adjusted R-square value: 0.7478
Stepwise selection has the flexibility to both add/ remove variables in the process of selecting the best regression model.
model.step.s<- step(nullmodel, scope=list(lower=nullmodel, upper=fullmodel), direction='both')
summary(model.step.s)
##
## Call:
## lm(formula = medv ~ lstat + rm + ptratio + dis + nox + chas +
## black + zn + crim + rad + tax, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.6654 -2.8045 -0.5478 2.0366 25.9836
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.486454 5.613851 6.856 2.77e-11 ***
## lstat -0.526805 0.050377 -10.457 < 2e-16 ***
## rm 3.649630 0.442425 8.249 2.45e-15 ***
## ptratio -0.997898 0.144439 -6.909 1.99e-11 ***
## dis -1.511341 0.202103 -7.478 4.99e-13 ***
## nox -16.380386 3.929948 -4.168 3.78e-05 ***
## chas 2.735509 0.939021 2.913 0.00378 **
## black 0.008107 0.003056 2.653 0.00830 **
## zn 0.046075 0.014929 3.086 0.00217 **
## crim -0.103586 0.035834 -2.891 0.00406 **
## rad 0.271856 0.068768 3.953 9.15e-05 ***
## tax -0.011179 0.003755 -2.977 0.00309 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.641 on 392 degrees of freedom
## Multiple R-squared: 0.7547, Adjusted R-squared: 0.7478
## F-statistic: 109.6 on 11 and 392 DF, p-value: < 2.2e-16
Using stepwise selection as well, the suggested model remains the same, dropping variables indus and age. Adjusted R-square value: 0.7478
We use Shrinkage method Lasso to build a model now. For performing Lasso, we need to standardize the variables first.
Boston.X.std<- scale(select(Boston, -medv))
X.train<- as.matrix(Boston.X.std)[index,]
X.test<- as.matrix(Boston.X.std)[-index,]
Y.train<- Boston[index, "medv"]
Y.test<- Boston[-index, "medv"]
To use Lasso, we first need to find the value of the tuning parameter lambda. Using cross-validation we find appropriate lambda value using error versus lambda plot. We take the value with the least error as well as the error value which is one standard deviation away from the lowest error value.
cv.lasso<- cv.glmnet(x=X.train, y=Y.train, family = "gaussian", alpha = 1, nfolds = 10)
plot(cv.lasso)
min <- cv.lasso$lambda.min
se <- cv.lasso$lambda.1se
We then build model based on the error value one standard deviation away as it reduces the complexity in the model at the expense of one standard deviation of error.
lasso.fit<- glmnet(x=X.train, y=Y.train, family = "gaussian", alpha = 1)
plot(lasso.fit, xvar = "lambda")
coef(lasso.fit,s=se)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 22.7688737
## crim -0.2984214
## zn 0.2243296
## indus -0.1666811
## chas 0.6019586
## nox -0.8193901
## rm 2.8285363
## age .
## dis -1.5187074
## rad .
## tax .
## ptratio -1.9292898
## black 0.5784803
## lstat -3.6979675
We can see that Lasso suggests that we use 10 variables, dropping age, rad and tax. We create a new model with the 10 variables and check the Adjusted R-square value.
newm <- lm(medv~crim+zn+indus+chas+nox+rm+dis+ptratio+black+lstat,data=train)
summary(newm)
##
## Call:
## lm(formula = medv ~ crim + zn + indus + chas + nox + rm + dis +
## ptratio + black + lstat, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.1158 -2.9325 -0.6984 1.7042 27.4448
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.618961 5.442690 5.993 4.66e-09 ***
## crim -0.059905 0.032799 -1.826 0.068546 .
## zn 0.039223 0.014522 2.701 0.007215 **
## indus -0.076108 0.061309 -1.241 0.215208
## chas 3.092902 0.952847 3.246 0.001271 **
## nox -13.701480 3.953674 -3.466 0.000588 ***
## rm 3.861382 0.446823 8.642 < 2e-16 ***
## dis -1.487419 0.209155 -7.112 5.46e-12 ***
## ptratio -0.890015 0.135657 -6.561 1.69e-10 ***
## black 0.007265 0.003086 2.354 0.019069 *
## lstat -0.523339 0.051319 -10.198 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.718 on 393 degrees of freedom
## Multiple R-squared: 0.7459, Adjusted R-squared: 0.7394
## F-statistic: 115.4 on 10 and 393 DF, p-value: < 2.2e-16
Adjusted R-square value: 0.7394
We select the final model suggested using the Lasso method. It gives a less complex model at the expense of slight decrease in Adjusted R-square value. Hence, the final model contains the following 10 variables: crim, zn, indus, chas, nox, rm, dis, ptratio, black, lstat
Residual diagnosis needs to be performed to check the validity of the assumptions made for building the regression model. We draw residual plots for this activity.
plot(newm)
We find following:
From the residual plots, it can be inferred that i) Variance is non-constant. But transformation can be made to solve this problem. ii) Q-Q plot suggests normal data but with a skewed tail. iii) It doesn’t look like there is any auto correlation present. iv) There are no outliers.
Now we test the model on the test data set and gauge its performance. We calculate the Adjusted R-square value to assess the performance.
newm.test <- lm(medv~crim+zn+indus+chas+nox+rm+dis+ptratio+black+lstat,data=test)
summary(newm.test)
##
## Call:
## lm(formula = medv ~ crim + zn + indus + chas + nox + rm + dis +
## ptratio + black + lstat, data = test)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.5884 -2.2590 -0.3518 1.6487 25.9339
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.896375 11.885410 1.590 0.11533
## crim -0.082773 0.083491 -0.991 0.32412
## zn 0.054759 0.034765 1.575 0.11870
## indus 0.096901 0.139405 0.695 0.48876
## chas 2.319858 2.179154 1.065 0.28989
## nox -18.060296 8.206593 -2.201 0.03029 *
## rm 5.015898 1.101839 4.552 1.64e-05 ***
## dis -1.492888 0.517085 -2.887 0.00486 **
## ptratio -0.576319 0.291111 -1.980 0.05076 .
## black 0.009574 0.006255 1.531 0.12934
## lstat -0.600352 0.150205 -3.997 0.00013 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.314 on 91 degrees of freedom
## Multiple R-squared: 0.6894, Adjusted R-squared: 0.6553
## F-statistic: 20.2 on 10 and 91 DF, p-value: < 2.2e-16
Adjusted R-square value: 0.6553