The Boston housing dataset is a dataset that has median value of the house along with 13 other parameters that could potentially be related to housing prices. These are the factors such as socio-economic conditions, environmental conditions, educational facilities and some other similar factors. There are 506 observations in the data for 14 variables including the median price of house in Boston. There are 12 numerical variables in our dataset and 1 categorical variable. The aim of this project is to build a linear regression model estimate the median price of owner-occupied homes in Boston.
The summary statistics of the dataset were first observed to understand the range of values of the variables in the dataset. The correlation matrix was then observed and we notice that housing value has a strong positive correlation with rm(number of rooms). It is expected, as a spacious house with more rooms would have a higher valuation. Also, medv has a strong negative correlation with lstat. It means that a house in an area with lower socioeconomic status naturally has a lower value.
Multivariate regression analysis was then performed without any standardization to generate results that acted as a benchmark. These results were later compared with the results generated from other regression techniques to understand the accuracy of the predictions.
The different methods used for variable selection were best subset selection, forward selection, backward elimination, stepwise selection and LASSO regression. The results from the different variable selection methods were compared to come up with the best fit model to predict the median housing values in Boston. The models generated similar results and there was not a considerable difference between them. LASSO suggested a 10-variable model whereas all other techniques suggested 11 variable models.
Finally, residual diagnostics were performed on the best fit model and the assumptions of linear regression modeling were checked.
data(Boston)
#random sampling
index <- sample(nrow(Boston),nrow(Boston)*0.80)
train <- Boston[index,]
test <- Boston[-index,]
Let’s take a look at the summary statistics and distribution of the observations in each variable. There are no missing values in the data. But, the summary statistics suggest that there might be outliers in some of the variables.
#summary statistics
summary(train)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08354 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25372 Median : 0.00 Median : 8.56 Median :0.00000
## Mean : 3.68856 Mean : 11.38 Mean :11.14 Mean :0.06931
## 3rd Qu.: 3.54343 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.385 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.449 1st Qu.:5.883 1st Qu.: 43.62 1st Qu.: 2.106
## Median :0.532 Median :6.189 Median : 79.05 Median : 3.158
## Mean :0.554 Mean :6.273 Mean : 68.74 Mean : 3.819
## 3rd Qu.:0.624 3rd Qu.:6.597 3rd Qu.: 94.35 3rd Qu.: 5.287
## Max. :0.871 Max. :8.725 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:280.5 1st Qu.:17.23 1st Qu.:376.09
## Median : 5.000 Median :330.0 Median :19.10 Median :391.44
## Mean : 9.483 Mean :406.7 Mean :18.46 Mean :358.84
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.25
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.730 Min. : 5.00
## 1st Qu.: 7.135 1st Qu.:17.10
## Median :11.570 Median :21.20
## Mean :12.820 Mean :22.63
## 3rd Qu.:17.225 3rd Qu.:25.02
## Max. :37.970 Max. :50.00
Now, let’s look at the variation in the values of various variables present in the dataset. This is achieved by plotting a boxplot of all the variables.
boxplot(train)
From the boxplot, it is evident that there are outliers in the variables ‘crim’, ‘zn’ and ‘black’. Highest variability is observed in the full-value property tax rates. We are interested in answering questions like ‘does high tax imply higher median prices?’. We calculate pairwise correlations to answer such questions.
corr <- round(cor(train), 1)
ggcorrplot(corr, hc.order = TRUE, type = "lower", lab = TRUE,
outline.col = "white",
ggtheme = ggplot2::theme_gray,
colors = c("#6D9EC1", "white", "#E46726"))
Very high correlation is observed between the variables ‘tax’ and ‘medv’. The feature with the least correlation to medv is the proximity to Charles River(chas). In some cases, we observe a zero correlation between the variables.
Let’s perform linear regression on the dataset. The variable of our interest here is ‘medv’. Our main objective is to predict the value of ‘medv’ based on other independent variables present in the dataset. We will first perform multiple linear regression using all the variables.
fit <- lm(medv~.,data = train)
sum.fit <- summary(fit)
From the summary statistics of the model we observe that lstat has the most significance as it has the lowest p value followed by rm. The coefficient estimate of lstat variable is negative indicating that the value of response variable medv decreases as the percentage of people with lower economic status increases. We also observe that the variables age and indus are not significant in determining the median value of homes in Boston.
Model Stats
Checking the model stats, using MSE, R-squared, adjusted R-squared, Test MSPE, AIC and BIC as metrics:
fit.mse <- (sum.fit$sigma)^2
fit.rsq <- sum.fit$r.squared
fit.arsq <- sum.fit$adj.r.squared
test.pred.fit <- predict(fit, newdata=test)
fit.mpse <- mean((test$medv-test.pred.fit)^2)
fit.aic <- AIC(fit)
fit.bic <- BIC(fit)
stats.fit <- c("FULL", fit.mse, fit.rsq, fit.arsq, fit.mpse, fit.aic, fit.bic)
comparison_table <- c("model type", "MSE", "R-Squared", "Adjusted R-Squared", "Test MSPE", "AIC", "BIC")
data.frame(cbind(comparison_table, stats.fit))
## comparison_table stats.fit
## 1 model type FULL
## 2 MSE 21.6469957573516
## 3 R-Squared 0.754758430353452
## 4 Adjusted R-Squared 0.746583711365233
## 5 Test MSPE 26.5009972929696
## 6 AIC 2404.50014575205
## 7 BIC 2464.52136892146
We will now apply different techniques of variable selection to decide the best fit model.
Here we apply the best subset selection approach to the Boston data. We wish to predict the median value of a house based on the available predictors. We decide on the number of variables that are required in the model using BIC, r squared and adjusted r squared.
regs_model <- regsubsets(medv~., train,nvmax = 13)
regs_summ <- summary(regs_model)
cbind(regs_summ$which,bic=regs_summ$bic,adj.r2=regs_summ$adjr2)
## (Intercept) crim zn indus chas nox rm age dis rad tax ptratio black
## 1 1 0 0 0 0 0 0 0 0 0 0 0 0
## 2 1 0 0 0 0 0 1 0 0 0 0 0 0
## 3 1 0 0 0 0 0 1 0 0 0 0 1 0
## 4 1 0 0 0 0 0 1 0 1 0 0 1 0
## 5 1 0 0 0 0 1 1 0 1 0 0 1 0
## 6 1 0 0 0 1 1 1 0 1 0 0 1 0
## 7 1 0 0 0 1 1 1 0 1 0 0 1 1
## 8 1 0 1 0 1 1 1 0 1 0 0 1 1
## 9 1 1 0 0 1 1 1 0 1 1 0 1 1
## 10 1 1 1 0 1 1 1 0 1 1 1 1 0
## 11 1 1 1 0 1 1 1 0 1 1 1 1 1
## 12 1 1 1 0 1 1 1 1 1 1 1 1 1
## 13 1 1 1 1 1 1 1 1 1 1 1 1 1
## lstat bic adj.r2
## 1 1 -311.5275 0.5499215
## 2 1 -406.4929 0.6485755
## 3 1 -456.8011 0.6935325
## 4 1 -468.6637 0.7060538
## 5 1 -487.8519 0.7231279
## 6 1 -492.3544 0.7295542
## 7 1 -494.4685 0.7342633
## 8 1 -494.4398 0.7375002
## 9 1 -492.9798 0.7397757
## 10 1 -494.5191 0.7439378
## 11 1 -495.7083 0.7478134
## 12 1 -489.7991 0.7472261
## 13 1 -483.8069 0.7465837
The bic and adjusted r squared values in the above table suggests that the model with 11 variables is the best model. We arrive at the model medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio + black + lstat as our best model, with 11 predictors.
Model Stats
Let us compare the model stats with the stats obtained from the full model in the previous section
model.ss <- lm(medv ~ . -indus -age, data=train)
sum.model.ss <- summary(model.ss)
sum.model.ss
##
## Call:
## lm(formula = medv ~ . - indus - age, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.6654 -2.8045 -0.5478 2.0366 25.9836
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.486454 5.613851 6.856 2.77e-11 ***
## crim -0.103586 0.035834 -2.891 0.00406 **
## zn 0.046075 0.014929 3.086 0.00217 **
## chas 2.735509 0.939021 2.913 0.00378 **
## nox -16.380386 3.929948 -4.168 3.78e-05 ***
## rm 3.649630 0.442425 8.249 2.45e-15 ***
## dis -1.511341 0.202103 -7.478 4.99e-13 ***
## rad 0.271856 0.068768 3.953 9.15e-05 ***
## tax -0.011179 0.003755 -2.977 0.00309 **
## ptratio -0.997898 0.144439 -6.909 1.99e-11 ***
## black 0.008107 0.003056 2.653 0.00830 **
## lstat -0.526805 0.050377 -10.457 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.641 on 392 degrees of freedom
## Multiple R-squared: 0.7547, Adjusted R-squared: 0.7478
## F-statistic: 109.6 on 11 and 392 DF, p-value: < 2.2e-16
model.ss.mse <- (sum.model.ss$sigma)^2
model.ss.rsq <- sum.model.ss$r.squared
model.ss.arsq <- sum.model.ss$adj.r.squared
test.pred.model.ss <- predict(model.ss, newdata=test)
model.ss.mpse <- mean((test$medv-test.pred.model.ss)^2)
modelss.aic <- AIC(model.ss)
modelss.bic <- BIC(model.ss)
stats.model.ss <- c("best_subset", model.ss.mse, model.ss.rsq, model.ss.arsq, model.ss.mpse, modelss.aic, modelss.bic)
data.frame(cbind(comparison_table, stats.fit, stats.model.ss))
## comparison_table stats.fit stats.model.ss
## 1 model type FULL best_subset
## 2 MSE 21.6469957573516 21.5419588675389
## 3 R-Squared 0.754758430353452 0.75469686001213
## 4 Adjusted R-Squared 0.746583711365233 0.747813353532878
## 5 Test MSPE 26.5009972929696 26.4299999958324
## 6 AIC 2404.50014575205 2400.60156125011
## 7 BIC 2464.52136892146 2452.6199546636
Forward stepwise selection is a computationally efficient alternative to best subset selection. While the best subset selection procedure considers all 2p possible models containing subsets of the p predictors, forward stepwise considers a much smaller set of models.
Best model selected by forward selection method is given below.
nullmodel <- lm(medv~1, data = train)
fullmodel <- lm(medv~., data = train)
model.step.f<- step(nullmodel, scope=list(lower=nullmodel, upper=fullmodel), direction='forward', trace=FALSE)
summary(model.step.f)
##
## Call:
## lm(formula = medv ~ lstat + rm + ptratio + dis + nox + chas +
## black + zn + crim + rad + tax, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.6654 -2.8045 -0.5478 2.0366 25.9836
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.486454 5.613851 6.856 2.77e-11 ***
## lstat -0.526805 0.050377 -10.457 < 2e-16 ***
## rm 3.649630 0.442425 8.249 2.45e-15 ***
## ptratio -0.997898 0.144439 -6.909 1.99e-11 ***
## dis -1.511341 0.202103 -7.478 4.99e-13 ***
## nox -16.380386 3.929948 -4.168 3.78e-05 ***
## chas 2.735509 0.939021 2.913 0.00378 **
## black 0.008107 0.003056 2.653 0.00830 **
## zn 0.046075 0.014929 3.086 0.00217 **
## crim -0.103586 0.035834 -2.891 0.00406 **
## rad 0.271856 0.068768 3.953 9.15e-05 ***
## tax -0.011179 0.003755 -2.977 0.00309 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.641 on 392 degrees of freedom
## Multiple R-squared: 0.7547, Adjusted R-squared: 0.7478
## F-statistic: 109.6 on 11 and 392 DF, p-value: < 2.2e-16
We observe that the model is same as the one selected using best subset selection method Hence, the model stats also remain the same.
## comparison_table stats.fit stats.model.ss stats.model.f
## 1 model type FULL best_subset forward_selection
## 2 MSE 21.6469957573516 21.5419588675389 21.5419588675389
## 3 R-Squared 0.754758430353452 0.75469686001213 0.75469686001213
## 4 Adjusted R-Squared 0.746583711365233 0.747813353532878 0.747813353532879
## 5 Test MSPE 26.5009972929696 26.4299999958324 26.4299999958324
## 6 AIC 2404.50014575205 2400.60156125011 2400.60156125011
## 7 BIC 2464.52136892146 2452.6199546636 2452.6199546636
Like forward stepwise selection, backward stepwise selection provides an efficient alternative to best subset selection. However, unlike forward stepwise selection, it begins with the full least squares model containing all p predictors, and then iteratively removes the least useful predictor, one-at-a-time.
Best model selected by backward elimination method is given below.
#backward elimination
model.step.b<- step(fullmodel,direction='backward',trace=FALSE)
summary(model.step.b)
##
## Call:
## lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad +
## tax + ptratio + black + lstat, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.6654 -2.8045 -0.5478 2.0366 25.9836
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.486454 5.613851 6.856 2.77e-11 ***
## crim -0.103586 0.035834 -2.891 0.00406 **
## zn 0.046075 0.014929 3.086 0.00217 **
## chas 2.735509 0.939021 2.913 0.00378 **
## nox -16.380386 3.929948 -4.168 3.78e-05 ***
## rm 3.649630 0.442425 8.249 2.45e-15 ***
## dis -1.511341 0.202103 -7.478 4.99e-13 ***
## rad 0.271856 0.068768 3.953 9.15e-05 ***
## tax -0.011179 0.003755 -2.977 0.00309 **
## ptratio -0.997898 0.144439 -6.909 1.99e-11 ***
## black 0.008107 0.003056 2.653 0.00830 **
## lstat -0.526805 0.050377 -10.457 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.641 on 392 degrees of freedom
## Multiple R-squared: 0.7547, Adjusted R-squared: 0.7478
## F-statistic: 109.6 on 11 and 392 DF, p-value: < 2.2e-16
Stepwise selection has the flexibility to both add/ remove variables in the process of selecting the best regression model. Best model selected by stepwise selection method is given below
#stepwise(mixed)
model.step.s<- step(nullmodel, scope=list(lower=nullmodel, upper=fullmodel), direction='both',trace=FALSE)
summary(model.step.s)
##
## Call:
## lm(formula = medv ~ lstat + rm + ptratio + dis + nox + chas +
## black + zn + crim + rad + tax, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.6654 -2.8045 -0.5478 2.0366 25.9836
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.486454 5.613851 6.856 2.77e-11 ***
## lstat -0.526805 0.050377 -10.457 < 2e-16 ***
## rm 3.649630 0.442425 8.249 2.45e-15 ***
## ptratio -0.997898 0.144439 -6.909 1.99e-11 ***
## dis -1.511341 0.202103 -7.478 4.99e-13 ***
## nox -16.380386 3.929948 -4.168 3.78e-05 ***
## chas 2.735509 0.939021 2.913 0.00378 **
## black 0.008107 0.003056 2.653 0.00830 **
## zn 0.046075 0.014929 3.086 0.00217 **
## crim -0.103586 0.035834 -2.891 0.00406 **
## rad 0.271856 0.068768 3.953 9.15e-05 ***
## tax -0.011179 0.003755 -2.977 0.00309 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.641 on 392 degrees of freedom
## Multiple R-squared: 0.7547, Adjusted R-squared: 0.7478
## F-statistic: 109.6 on 11 and 392 DF, p-value: < 2.2e-16
Hence, we observe that Using stepwise regression in any direction results in the same best model as before where medv is regressed on all other variables except indus and age.
We will now use Lasso to build the regression model
#standardize covariates
Boston.X.std<- scale(dplyr::select(Boston, -medv))
X.train<- as.matrix(Boston.X.std)[index,]
X.test<- as.matrix(Boston.X.std)[-index,]
Y.train<- Boston[index, "medv"]
Y.test<- Boston[-index, "medv"]
#fit model
lasso.fit<- glmnet(x=X.train, y=Y.train, family = "gaussian", alpha = 1)
plot(lasso.fit, xvar = "lambda")
** Cross validation to get Optimal Lamda**
Using cross-validation we find appropriate lambda value using error versus lambda plot. We take the value with the least error as well as the error value which is one standard deviation away from the lowest error value. we then build models based on both. For the higher error value, the number of variables selected decreases.
cv.lasso<- cv.glmnet(x=X.train, y=Y.train, family = "gaussian", alpha = 1, nfolds = 10)
plot(cv.lasso)
cv.lasso$lambda.min
## [1] 0.01345128
cv.lasso$lambda.1se
## [1] 0.2405962
We find lambda.min to be 0.0135 and lambda.se to be 0.241.
We try to fit the model using lambda.1se and lamda.min. We observe that lamda.1se gives one predictor less as compared to lamda.1se.
Coefficients for lamda.1se
coef(lasso.fit, s=cv.lasso$lambda.1se)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 22.7688737
## crim -0.2984214
## zn 0.2243296
## indus -0.1666811
## chas 0.6019586
## nox -0.8193901
## rm 2.8285363
## age .
## dis -1.5187074
## rad .
## tax .
## ptratio -1.9292898
## black 0.5784803
## lstat -3.6979675
** Coefficients for lamda.min**
coef(lasso.fit, s=cv.lasso$lambda.min)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 22.78140868
## crim -0.84940440
## zn 0.99725880
## indus -0.08037496
## chas 0.69929510
## nox -1.79191071
## rm 2.59897984
## age -0.09602244
## dis -3.12359093
## rad 2.11532641
## tax -1.64535068
## ptratio -2.13760198
## black 0.73231750
## lstat -3.72307947
We will select lamd.1se for a less complex model.
We select the subset selection model as our best model: medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio + black + lstat
## comparison_table stats.fit stats.model.ss
## 1 model type FULL best_subset
## 2 MSE 21.6469957573516 21.5419588675389
## 3 R-Squared 0.754758430353452 0.75469686001213
## 4 Adjusted R-Squared 0.746583711365233 0.747813353532878
## 5 Test MSPE 26.5009972929696 26.4299999958324
## 6 AIC 2404.50014575205 2400.60156125011
## 7 BIC 2464.52136892146 2452.6199546636
## stats.model.lasso.1se
## 1 LASSO
## 2 23.2090350619489
## 3 0.735039287433406
## 4 0.728297284569116
## 5 29.147140630975
## 6 NA
## 7 NA
Residual diagnosis needs to be performed to check the validity of the assumptions made for building the regression model. We draw residual plots for this activity. From the residual plots, it can be inferred that i) Variance is not constant. But transformation can be made to solve this problem. ii) Q-Q plot suggests normal data but with a skewed tail. iii) It doesn’t look like there is any auto correlation present. iv) There are no outliers.
par(mfrow=c(2,2))
plot(model.ss)