Introduction

Problem
To predict median price of houses in Boston based on various housing, environmental and social attributes available in Boston Housing data with the help of linear, best subsets, stepwise and LASSO regression.

Approach
We will follow these steps to predict Boston Housing prices:

  • Exploratory data analysis (univariate and bivariate)
  • Train linear regression model
  • Perform variable selection using:
    • Best Subset
    • Stepwise (Forward, Backward, Both)
    • LASSO (lambda.min, lambda.lse)
  • Finalize best model and show model KPIs (MSE, R Squared and Adjusted R Squared)
library(MASS)
library(dplyr)
library(corrplot)
library(tidyverse)
library(PerformanceAnalytics)
library(mctest)

About the Data
Housing Values in Suburbs of Boston The Boston data frame has 506 rows and 14 columns. This data frame contains the following columns:

  • crim - per capita crime rate by town.
  • zn - proportion of residential land zoned for lots over 25,000 sq.ft.
  • indus - proportion of non-retail business acres per town.
  • chas - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
  • nox - nitrogen oxides concentration (parts per 10 million).
  • rm - average number of rooms per dwelling.
  • age - proportion of owner-occupied units built prior to 1940.
  • dis - weighted mean of distances to five Boston employment centres.
  • rad - index of accessibility to radial highways.
  • tax - full-value property-tax rate per $10,000.
  • ptratio - pupil-teacher ratio by town.
  • black - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
  • lstat - lower status of the population (percent).
  • medv - median value of owner-occupied homes in $1000s.

Executive Summary

  • The final selected model is obtained by the best subset selection method. It includes all the variables except indus and age. Below is the summary statistics from the best subset model

\[medv = -0.0970*crim + 0.05569*zn + 3.4612*chas - 19.1165*nox + 3.5841*rm - 1.581*dis + 0.3584*rad -0.0131*tax - 0.9662*ptratio + 0.0121*black - 0.4972*lstat\]

  • Adjusted R-Squared from best subset and stepwise (fwd, backward and hybrid) is same as they select the same variables
  • LASSO regression did not remove any of the variables completely, but it did significantly decrease the beta coefficients of age and indus

Exploratory Data Analysis

General structure of data and checking for NULL values

# General Structure
names(Boston)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
# Checking for NULL values
colSums(is.na(Boston))
##    crim      zn   indus    chas     nox      rm     age     dis     rad 
##       0       0       0       0       0       0       0       0       0 
##     tax ptratio   black   lstat    medv 
##       0       0       0       0       0

Univariate Analysis(Histograms)

# Univariate Analysis
Boston %>% keep(is.numeric) %>% 
  gather() %>% ggplot(aes(value)) + facet_wrap(~key, scales = "free") + geom_histogram() 

We see that age follows a left skewed distribution, so does the proportion of black.
Distance, lstat and medv follow a right skewed distribution.
Nox (Nitrogen Oxides Concentration) follow an almost uniform distribution Rm follows a normal distribution.
Tax and rad also follow an almost normal distribution with few extreme values.

Univariate Analysis(Boxplots)

boxplot(Boston)

Boxplots to observe the summary statistic of each variable:
We see that the tax variable could be normalized to bring it in scale with other variables
There are a lot of outliers in the black variable that should be treated before analysis

Bivariate Analysis (response variable vs predictor variables)

#3 pairwise correlations
corr_medv <- cor(Boston$medv, Boston[,-14], use = "pairwise.complete.obs")

corr_medv <- t(corr_medv)
View(corr_medv)
corr_matrix <- cor(Boston, use = "pairwise.complete.obs")
corrplot(corr_matrix, type = "lower", order = "hclust", 
         tl.col = "black", tl.srt = 45)


Age, Indus and Nox are highly correlated with dis variable which can led to multicollinearity

Model building and Variable Selection

#train_index <- sample(nrow(Boston),0.75*nrow(Boston))
#boston_train <- Boston[train_index,]   # select 4 rows at random
#boston_test <- Boston[-train_index,]
#Boston <- Boston %>% mutate(id = row_number())

#head(Boston$id)

Sampling of data
The dataset was randomly divided into 75% train and 25% test data. The model will be trained on the train data and its performance will be tested on rest of the data to avoid overfitting.

set.seed(7)
train_index <- sample(nrow(Boston),0.75*nrow(Boston))
boston_train <- Boston[train_index,]   
boston_test <- Boston[-train_index,]

Linear regression
Following results were obtained after fitting the linear regression model:

# linear regression on the data -------------------------------------------
fit <- lm(boston_train$medv ~., data = boston_train)
summary_test <- summary(fit)
print(summary_test)
## 
## Call:
## lm(formula = boston_train$medv ~ ., data = boston_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.5282  -2.8594  -0.7306   1.9008  25.0097 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  36.901839   6.265388   5.890 8.78e-09 ***
## crim         -0.078110   0.049341  -1.583 0.114275    
## zn            0.045881   0.016509   2.779 0.005732 ** 
## indus         0.026917   0.073496   0.366 0.714397    
## chas          2.585672   1.036172   2.495 0.013022 *  
## nox         -18.492688   4.711442  -3.925 0.000104 ***
## rm            3.618608   0.519126   6.971 1.48e-11 ***
## age           0.011858   0.016237   0.730 0.465648    
## dis          -1.414448   0.237566  -5.954 6.15e-09 ***
## rad           0.263478   0.081044   3.251 0.001257 ** 
## tax          -0.009424   0.004476  -2.106 0.035927 *  
## ptratio      -0.985265   0.162261  -6.072 3.17e-09 ***
## black         0.011024   0.003148   3.502 0.000520 ***
## lstat        -0.588063   0.063659  -9.238  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.984 on 365 degrees of freedom
## Multiple R-squared:  0.7321, Adjusted R-squared:  0.7226 
## F-statistic: 76.73 on 13 and 365 DF,  p-value: < 2.2e-16

Adjusted R-squared of 72.26% is obtained and 3 variables (crim, indus and age) are identified as not significant.
The model was then applied to the test data set.

fit_summary <- list(("r_sqr" = summary_test$r.squared), ("adj_rsqr" = summary_test$adj.r.squared),
                    ("AIC" = AIC(fit)), ("BIC" = BIC(fit)))


pred_fit <- predict(fit, boston_test)
#MSE
MSE_test_lm <- mean((pred_fit - boston_test$medv)^2)
#Mean Absolute error
MAE <- mean(abs(pred_fit - boston_test$medv))

Mean squared error of 16.48 and mean absolute error of 2.92 was obtained for the test data set.


Best Subset Selection
We now run the best subset selection method which gives us the variables to choose for each number of variable, shown by a ’*’ in the results below.Here are the results and coefficients values for the best subset selection:

#Variable Selection
library(leaps)
#install.packages("MuMIn")
library(MuMIn)
library(mctest)

#best subset ------------------------------------------------------------
subset_result <- regsubsets(medv~.,data = boston_train, nbest = 1, nvmax = 14)
 summary(subset_result)
## Subset selection object
## Call: regsubsets.formula(medv ~ ., data = boston_train, nbest = 1, 
##     nvmax = 14)
## 13 Variables  (and intercept)
##         Forced in Forced out
## crim        FALSE      FALSE
## zn          FALSE      FALSE
## indus       FALSE      FALSE
## chas        FALSE      FALSE
## nox         FALSE      FALSE
## rm          FALSE      FALSE
## age         FALSE      FALSE
## dis         FALSE      FALSE
## rad         FALSE      FALSE
## tax         FALSE      FALSE
## ptratio     FALSE      FALSE
## black       FALSE      FALSE
## lstat       FALSE      FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
##           crim zn  indus chas nox rm  age dis rad tax ptratio black lstat
## 1  ( 1 )  " "  " " " "   " "  " " " " " " " " " " " " " "     " "   "*"  
## 2  ( 1 )  " "  " " " "   " "  " " "*" " " " " " " " " " "     " "   "*"  
## 3  ( 1 )  " "  " " " "   " "  " " "*" " " " " " " " " "*"     " "   "*"  
## 4  ( 1 )  " "  " " " "   " "  " " "*" " " "*" " " " " "*"     " "   "*"  
## 5  ( 1 )  " "  " " " "   " "  "*" "*" " " "*" " " " " "*"     " "   "*"  
## 6  ( 1 )  " "  " " " "   " "  "*" "*" " " "*" " " " " "*"     "*"   "*"  
## 7  ( 1 )  " "  " " " "   "*"  "*" "*" " " "*" " " " " "*"     "*"   "*"  
## 8  ( 1 )  " "  "*" " "   "*"  "*" "*" " " "*" " " " " "*"     "*"   "*"  
## 9  ( 1 )  " "  "*" " "   "*"  "*" "*" " " "*" "*" " " "*"     "*"   "*"  
## 10  ( 1 ) " "  "*" " "   "*"  "*" "*" " " "*" "*" "*" "*"     "*"   "*"  
## 11  ( 1 ) "*"  "*" " "   "*"  "*" "*" " " "*" "*" "*" "*"     "*"   "*"  
## 12  ( 1 ) "*"  "*" " "   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"  
## 13  ( 1 ) "*"  "*" "*"   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"

We can see the coefficients after selecting the number of variables needed in our model (example for 2,8,13).

coef(subset_result, 2)
## (Intercept)          rm       lstat 
##  -0.4473873   5.0151882  -0.6725328
 coef(subset_result, 8)
##  (Intercept)           zn         chas          nox           rm 
##  30.25412687   0.04045765   2.90980703 -15.56249709   4.00306915 
##          dis      ptratio        black        lstat 
##  -1.43490115  -0.86299730   0.01079068  -0.58231343
 coef(subset_result, 13)
##  (Intercept)         crim           zn        indus         chas 
##  36.90183910  -0.07810995   0.04588056   0.02691726   2.58567182 
##          nox           rm          age          dis          rad 
## -18.49268800   3.61860790   0.01185825  -1.41444814   0.26347776 
##          tax      ptratio        black        lstat 
##  -0.00942430  -0.98526484   0.01102414  -0.58806322

The adjusted R squared for the best subset method is:

 max(summary(subset_result)$adjr2)
## [1] 0.7235687

Here is the plot of best subset results:

 plot(subset_result, scale = "bic")

 #test pred
 test.mat <- model.matrix(medv~., data= boston_test)
 summary(test.mat)
##   (Intercept)      crim                zn             indus      
##  Min.   :1    Min.   : 0.00632   Min.   : 0.000   Min.   : 1.52  
##  1st Qu.:1    1st Qu.: 0.08850   1st Qu.: 0.000   1st Qu.: 6.08  
##  Median :1    Median : 0.25356   Median : 0.000   Median : 9.90  
##  Mean   :1    Mean   : 3.92792   Mean   : 9.299   Mean   :11.36  
##  3rd Qu.:1    3rd Qu.: 2.87119   3rd Qu.: 0.000   3rd Qu.:18.10  
##  Max.   :1    Max.   :88.97620   Max.   :95.000   Max.   :27.74  
##       chas              nox               rm             age        
##  Min.   :0.00000   Min.   :0.3890   Min.   :3.561   Min.   :  6.50  
##  1st Qu.:0.00000   1st Qu.:0.4555   1st Qu.:5.889   1st Qu.: 45.75  
##  Median :0.00000   Median :0.5380   Median :6.122   Median : 81.30  
##  Mean   :0.06299   Mean   :0.5596   Mean   :6.225   Mean   : 69.40  
##  3rd Qu.:0.00000   3rd Qu.:0.6240   3rd Qu.:6.500   3rd Qu.: 94.10  
##  Max.   :1.00000   Max.   :0.8710   Max.   :8.704   Max.   :100.00  
##       dis              rad             tax           ptratio     
##  Min.   : 1.178   Min.   : 1.00   Min.   :188.0   Min.   :12.60  
##  1st Qu.: 2.235   1st Qu.: 4.00   1st Qu.:284.0   1st Qu.:17.40  
##  Median : 3.421   Median : 5.00   Median :330.0   Median :18.60  
##  Mean   : 3.845   Mean   : 9.26   Mean   :406.7   Mean   :18.46  
##  3rd Qu.: 5.166   3rd Qu.:16.00   3rd Qu.:666.0   3rd Qu.:20.20  
##  Max.   :10.586   Max.   :24.00   Max.   :711.0   Max.   :22.00  
##      black            lstat       
##  Min.   :  6.68   Min.   : 1.920  
##  1st Qu.:379.70   1st Qu.: 8.075  
##  Median :392.18   Median :12.040  
##  Mean   :367.40   Mean   :13.051  
##  3rd Qu.:396.90   3rd Qu.:16.545  
##  Max.   :396.90   Max.   :36.980
 val.errors <- rep(NA,8)
 for(i in 1:8){
   coefi <- coef(subset_result, id=i)
   pred <- test.mat[,names(coefi)]%*%coefi
   val.errors[i] <- mean((boston_test$medv - pred)^2)
 }

 MSE_bestsubset <- min(val.errors)


 #AIC(subset_result)

Stepwise - Forward, backward and Hybrid
We will check MSE, AIC and r squared values for all these methods.

# Stepwise - Forward, Backward and Hybrid ---------------------------------
nullmodel=lm(medv~1, data = boston_train)
fullmodel=lm(medv~., data = boston_train)

#forward
step_model_fwd <- step(nullmodel, scope = list(lower = nullmodel, upper = fullmodel), direction = "forward")
## Start:  AIC=1704.53
## medv ~ 1
## 
##           Df Sum of Sq   RSS    AIC
## + lstat    1   18711.8 15138 1401.5
## + rm       1   16203.0 17647 1459.7
## + ptratio  1    9115.8 24735 1587.6
## + indus    1    8329.6 25521 1599.5
## + tax      1    7406.3 26444 1613.0
## + nox      1    6546.3 27304 1625.1
## + crim     1    5448.8 28402 1640.0
## + age      1    5004.0 28846 1645.9
## + zn       1    4880.2 28970 1647.5
## + rad      1    4876.9 28973 1647.6
## + black    1    4135.0 29715 1657.2
## + dis      1    2431.5 31419 1678.3
## + chas     1     930.1 32920 1696.0
## <none>                 33850 1704.5
## 
## Step:  AIC=1401.55
## medv ~ lstat
## 
##           Df Sum of Sq   RSS    AIC
## + rm       1   2793.34 12345 1326.2
## + ptratio  1   2028.62 13110 1349.0
## + dis      1    701.05 14437 1385.6
## + chas     1    559.65 14579 1389.3
## + age      1    341.56 14797 1394.9
## + black    1    288.81 14850 1396.2
## + zn       1    147.86 14991 1399.8
## + tax      1    104.31 15034 1400.9
## <none>                 15138 1401.5
## + crim     1     50.53 15088 1402.3
## + nox      1     34.88 15104 1402.7
## + indus    1     33.78 15105 1402.7
## + rad      1      3.47 15135 1403.5
## 
## Step:  AIC=1326.24
## medv ~ lstat + rm
## 
##           Df Sum of Sq   RSS    AIC
## + ptratio  1   1291.48 11054 1286.4
## + black    1    502.65 11842 1312.5
## + chas     1    435.53 11910 1314.6
## + dis      1    364.83 11980 1316.9
## + tax      1    225.13 12120 1321.3
## + crim     1    120.88 12224 1324.5
## + rad      1     98.94 12246 1325.2
## <none>                 12345 1326.2
## + age      1     61.75 12283 1326.3
## + zn       1     52.41 12293 1326.6
## + indus    1     22.34 12323 1327.5
## + nox      1      0.07 12345 1328.2
## 
## Step:  AIC=1286.36
## medv ~ lstat + rm + ptratio
## 
##         Df Sum of Sq   RSS    AIC
## + dis    1    481.11 10572 1271.5
## + black  1    373.59 10680 1275.3
## + chas   1    266.77 10787 1279.1
## + age    1    123.80 10930 1284.1
## <none>               11054 1286.4
## + crim   1     25.29 11028 1287.5
## + rad    1     12.96 11041 1287.9
## + zn     1     10.13 11044 1288.0
## + tax    1      9.30 11044 1288.0
## + indus  1      8.16 11046 1288.1
## + nox    1      5.98 11048 1288.2
## 
## Step:  AIC=1271.49
## medv ~ lstat + rm + ptratio + dis
## 
##         Df Sum of Sq   RSS    AIC
## + nox    1    529.88 10043 1254.0
## + black  1    491.97 10081 1255.4
## + chas   1    165.01 10408 1267.5
## + zn     1    156.13 10416 1267.9
## + indus  1    137.09 10436 1268.5
## + tax    1    109.84 10463 1269.5
## + crim   1     84.69 10488 1270.5
## <none>               10572 1271.5
## + age    1     11.88 10561 1273.1
## + rad    1      8.68 10564 1273.2
## 
## Step:  AIC=1254.01
## medv ~ lstat + rm + ptratio + dis + nox
## 
##         Df Sum of Sq     RSS    AIC
## + black  1    332.75  9709.9 1243.2
## + chas   1    218.82  9823.8 1247.7
## + zn     1    129.45  9913.2 1251.1
## <none>               10042.7 1254.0
## + rad    1     45.14  9997.5 1254.3
## + crim   1     35.59 10007.1 1254.7
## + age    1      5.97 10036.7 1255.8
## + indus  1      3.91 10038.8 1255.9
## + tax    1      0.23 10042.4 1256.0
## 
## Step:  AIC=1243.24
## medv ~ lstat + rm + ptratio + dis + nox + black
## 
##         Df Sum of Sq    RSS    AIC
## + chas   1   189.741 9520.2 1237.8
## + zn     1   156.559 9553.4 1239.1
## + rad    1   147.145 9562.8 1239.5
## <none>               9709.9 1243.2
## + tax    1    17.911 9692.0 1244.5
## + age    1     0.866 9709.1 1245.2
## + indus  1     0.363 9709.6 1245.2
## + crim   1     0.096 9709.8 1245.2
## 
## Step:  AIC=1237.76
## medv ~ lstat + rm + ptratio + dis + nox + black + chas
## 
##         Df Sum of Sq    RSS    AIC
## + zn     1   166.445 9353.7 1233.1
## + rad    1   148.067 9372.1 1233.8
## <none>               9520.2 1237.8
## + tax    1    24.059 9496.1 1238.8
## + indus  1     1.569 9518.6 1239.7
## + crim   1     0.130 9520.0 1239.8
## + age    1     0.028 9520.1 1239.8
## 
## Step:  AIC=1233.07
## medv ~ lstat + rm + ptratio + dis + nox + black + chas + zn
## 
##         Df Sum of Sq    RSS    AIC
## + rad    1   100.479 9253.3 1231.0
## <none>               9353.7 1233.1
## + age    1     5.801 9347.9 1234.8
## + crim   1     3.274 9350.5 1234.9
## + tax    1     2.923 9350.8 1235.0
## + indus  1     2.317 9351.4 1235.0
## 
## Step:  AIC=1230.98
## medv ~ lstat + rm + ptratio + dis + nox + black + chas + zn + 
##     rad
## 
##         Df Sum of Sq    RSS    AIC
## + tax    1   105.694 9147.6 1228.6
## + crim   1    54.787 9198.5 1230.7
## <none>               9253.3 1231.0
## + age    1    10.049 9243.2 1232.6
## + indus  1     6.269 9247.0 1232.7
## 
## Step:  AIC=1228.62
## medv ~ lstat + rm + ptratio + dis + nox + black + chas + zn + 
##     rad + tax
## 
##         Df Sum of Sq    RSS    AIC
## + crim   1    62.575 9085.0 1228.0
## <none>               9147.6 1228.6
## + age    1    12.267 9135.3 1230.1
## + indus  1     4.800 9142.8 1230.4
## 
## Step:  AIC=1228.02
## medv ~ lstat + rm + ptratio + dis + nox + black + chas + zn + 
##     rad + tax + crim
## 
##         Df Sum of Sq    RSS    AIC
## <none>               9085.0 1228.0
## + age    1   13.3369 9071.6 1229.5
## + indus  1    3.4172 9081.6 1229.9
summary(step_model_fwd)$adj.r.squared
## [1] 0.7235687
pred_step_fwd <- predict(step_model_fwd, boston_test)
#MSE of the test data (forward approach)
MSE_test_fwd <- mean((pred_step_fwd - boston_test$medv)^2)
MSE_test_fwd
## [1] 16.23726
#backward
step_model_back <- step(fullmodel, direction = "backward")
## Start:  AIC=1231.33
## medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad + 
##     tax + ptratio + black + lstat
## 
##           Df Sum of Sq     RSS    AIC
## - indus    1      3.33  9071.6 1229.5
## - age      1     13.25  9081.6 1229.9
## <none>                  9068.3 1231.3
## - crim     1     62.26  9130.6 1231.9
## - tax      1    110.14  9178.5 1233.9
## - chas     1    154.71  9223.0 1235.7
## - zn       1    191.89  9260.2 1237.3
## - rad      1    262.59  9330.9 1240.2
## - black    1    304.62  9372.9 1241.8
## - nox      1    382.76  9451.1 1245.0
## - dis      1    880.72  9949.0 1264.5
## - ptratio  1    916.03  9984.3 1265.8
## - rm       1   1207.18 10275.5 1276.7
## - lstat    1   2120.15 11188.5 1309.0
## 
## Step:  AIC=1229.47
## medv ~ crim + zn + chas + nox + rm + age + dis + rad + tax + 
##     ptratio + black + lstat
## 
##           Df Sum of Sq     RSS    AIC
## - age      1     13.34  9085.0 1228.0
## <none>                  9071.6 1229.5
## - crim     1     63.65  9135.3 1230.1
## - tax      1    115.94  9187.6 1232.3
## - chas     1    159.75  9231.4 1234.1
## - zn       1    189.10  9260.7 1235.3
## - rad      1    266.52  9338.2 1238.4
## - black    1    302.60  9374.2 1239.9
## - nox      1    393.42  9465.1 1243.6
## - ptratio  1    920.98  9992.6 1264.1
## - dis      1    943.66 10015.3 1265.0
## - rm       1   1205.04 10276.7 1274.7
## - lstat    1   2119.56 11191.2 1307.0
## 
## Step:  AIC=1228.02
## medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio + 
##     black + lstat
## 
##           Df Sum of Sq     RSS    AIC
## <none>                  9085.0 1228.0
## - crim     1     62.58  9147.6 1228.6
## - tax      1    113.48  9198.5 1230.7
## - chas     1    165.35  9250.3 1232.9
## - zn       1    177.93  9262.9 1233.4
## - rad      1    259.25  9344.2 1236.7
## - black    1    310.24  9395.2 1238.8
## - nox      1    381.82  9466.8 1241.6
## - ptratio  1    909.61  9994.6 1262.2
## - dis      1   1112.62 10197.6 1269.8
## - rm       1   1314.29 10399.3 1277.2
## - lstat    1   2256.61 11341.6 1310.1
summary(step_model_back)$adj.r.squared
## [1] 0.7235687
pred_step_back <- predict(step_model_back, boston_test)
#MSE of the test data (backward approach)
MSE_test_back <- mean((pred_step_back - boston_test$medv)^2)


#hybrid
step_model_both <- step(nullmodel, scope = list(lower = nullmodel, upper = fullmodel), direction = 'both')
## Start:  AIC=1704.53
## medv ~ 1
## 
##           Df Sum of Sq   RSS    AIC
## + lstat    1   18711.8 15138 1401.5
## + rm       1   16203.0 17647 1459.7
## + ptratio  1    9115.8 24735 1587.6
## + indus    1    8329.6 25521 1599.5
## + tax      1    7406.3 26444 1613.0
## + nox      1    6546.3 27304 1625.1
## + crim     1    5448.8 28402 1640.0
## + age      1    5004.0 28846 1645.9
## + zn       1    4880.2 28970 1647.5
## + rad      1    4876.9 28973 1647.6
## + black    1    4135.0 29715 1657.2
## + dis      1    2431.5 31419 1678.3
## + chas     1     930.1 32920 1696.0
## <none>                 33850 1704.5
## 
## Step:  AIC=1401.55
## medv ~ lstat
## 
##           Df Sum of Sq   RSS    AIC
## + rm       1    2793.3 12345 1326.2
## + ptratio  1    2028.6 13110 1349.0
## + dis      1     701.1 14437 1385.6
## + chas     1     559.6 14579 1389.3
## + age      1     341.6 14797 1394.9
## + black    1     288.8 14850 1396.2
## + zn       1     147.9 14991 1399.8
## + tax      1     104.3 15034 1400.9
## <none>                 15138 1401.5
## + crim     1      50.5 15088 1402.3
## + nox      1      34.9 15104 1402.7
## + indus    1      33.8 15105 1402.7
## + rad      1       3.5 15135 1403.5
## - lstat    1   18711.8 33850 1704.5
## 
## Step:  AIC=1326.24
## medv ~ lstat + rm
## 
##           Df Sum of Sq   RSS    AIC
## + ptratio  1    1291.5 11054 1286.4
## + black    1     502.7 11842 1312.5
## + chas     1     435.5 11910 1314.6
## + dis      1     364.8 11980 1316.9
## + tax      1     225.1 12120 1321.3
## + crim     1     120.9 12224 1324.5
## + rad      1      98.9 12246 1325.2
## <none>                 12345 1326.2
## + age      1      61.8 12283 1326.3
## + zn       1      52.4 12293 1326.6
## + indus    1      22.3 12323 1327.5
## + nox      1       0.1 12345 1328.2
## - rm       1    2793.3 15138 1401.5
## - lstat    1    5302.2 17647 1459.7
## 
## Step:  AIC=1286.36
## medv ~ lstat + rm + ptratio
## 
##           Df Sum of Sq   RSS    AIC
## + dis      1     481.1 10572 1271.5
## + black    1     373.6 10680 1275.3
## + chas     1     266.8 10787 1279.1
## + age      1     123.8 10930 1284.1
## <none>                 11054 1286.4
## + crim     1      25.3 11028 1287.5
## + rad      1      13.0 11041 1287.9
## + zn       1      10.1 11044 1288.0
## + tax      1       9.3 11044 1288.0
## + indus    1       8.2 11046 1288.1
## + nox      1       6.0 11048 1288.2
## - ptratio  1    1291.5 12345 1326.2
## - rm       1    2056.2 13110 1349.0
## - lstat    1    4006.1 15060 1401.6
## 
## Step:  AIC=1271.49
## medv ~ lstat + rm + ptratio + dis
## 
##           Df Sum of Sq   RSS    AIC
## + nox      1     529.9 10043 1254.0
## + black    1     492.0 10081 1255.4
## + chas     1     165.0 10408 1267.5
## + zn       1     156.1 10416 1267.9
## + indus    1     137.1 10436 1268.5
## + tax      1     109.8 10463 1269.5
## + crim     1      84.7 10488 1270.5
## <none>                 10572 1271.5
## + age      1      11.9 10561 1273.1
## + rad      1       8.7 10564 1273.2
## - dis      1     481.1 11054 1286.4
## - ptratio  1    1407.8 11980 1316.9
## - rm       1    1713.6 12286 1326.4
## - lstat    1    4363.9 14936 1400.5
## 
## Step:  AIC=1254.01
## medv ~ lstat + rm + ptratio + dis + nox
## 
##           Df Sum of Sq     RSS    AIC
## + black    1    332.75  9709.9 1243.2
## + chas     1    218.82  9823.8 1247.7
## + zn       1    129.45  9913.2 1251.1
## <none>                 10042.7 1254.0
## + rad      1     45.14  9997.5 1254.3
## + crim     1     35.59 10007.1 1254.7
## + age      1      5.97 10036.7 1255.8
## + indus    1      3.91 10038.8 1255.9
## + tax      1      0.23 10042.4 1256.0
## - nox      1    529.88 10572.5 1271.5
## - dis      1   1005.00 11047.7 1288.2
## - ptratio  1   1642.33 11685.0 1309.4
## - rm       1   1713.08 11755.8 1311.7
## - lstat    1   2818.83 12861.5 1345.8
## 
## Step:  AIC=1243.24
## medv ~ lstat + rm + ptratio + dis + nox + black
## 
##           Df Sum of Sq     RSS    AIC
## + chas     1    189.74  9520.2 1237.8
## + zn       1    156.56  9553.4 1239.1
## + rad      1    147.14  9562.8 1239.5
## <none>                  9709.9 1243.2
## + tax      1     17.91  9692.0 1244.5
## + age      1      0.87  9709.1 1245.2
## + indus    1      0.36  9709.6 1245.2
## + crim     1      0.10  9709.8 1245.2
## - black    1    332.75 10042.7 1254.0
## - nox      1    370.66 10080.6 1255.4
## - dis      1    966.85 10676.8 1277.2
## - ptratio  1   1469.72 11179.6 1294.7
## - rm       1   1846.55 11556.5 1307.2
## - lstat    1   2470.36 12180.3 1327.1
## 
## Step:  AIC=1237.76
## medv ~ lstat + rm + ptratio + dis + nox + black + chas
## 
##           Df Sum of Sq     RSS    AIC
## + zn       1    166.44  9353.7 1233.1
## + rad      1    148.07  9372.1 1233.8
## <none>                  9520.2 1237.8
## + tax      1     24.06  9496.1 1238.8
## + indus    1      1.57  9518.6 1239.7
## + crim     1      0.13  9520.0 1239.8
## + age      1      0.03  9520.1 1239.8
## - chas     1    189.74  9709.9 1243.2
## - black    1    303.67  9823.8 1247.7
## - nox      1    417.50  9937.7 1252.0
## - dis      1    903.87 10424.0 1270.1
## - ptratio  1   1319.18 10839.4 1284.9
## - rm       1   1836.79 11357.0 1302.6
## - lstat    1   2351.42 11871.6 1319.4
## 
## Step:  AIC=1233.07
## medv ~ lstat + rm + ptratio + dis + nox + black + chas + zn
## 
##           Df Sum of Sq     RSS    AIC
## + rad      1    100.48  9253.3 1231.0
## <none>                  9353.7 1233.1
## + age      1      5.80  9347.9 1234.8
## + crim     1      3.27  9350.5 1234.9
## + tax      1      2.92  9350.8 1235.0
## + indus    1      2.32  9351.4 1235.0
## - zn       1    166.44  9520.2 1237.8
## - chas     1    199.63  9553.4 1239.1
## - black    1    329.80  9683.5 1244.2
## - nox      1    386.76  9740.5 1246.4
## - ptratio  1    933.14 10286.9 1267.1
## - dis      1   1064.19 10417.9 1271.9
## - rm       1   1619.63 10973.4 1291.6
## - lstat    1   2438.62 11792.3 1318.9
## 
## Step:  AIC=1230.98
## medv ~ lstat + rm + ptratio + dis + nox + black + chas + zn + 
##     rad
## 
##           Df Sum of Sq     RSS    AIC
## + tax      1    105.69  9147.6 1228.6
## + crim     1     54.79  9198.5 1230.7
## <none>                  9253.3 1231.0
## + age      1     10.05  9243.2 1232.6
## + indus    1      6.27  9247.0 1232.7
## - rad      1    100.48  9353.7 1233.1
## - zn       1    118.86  9372.1 1233.8
## - chas     1    198.98  9452.2 1237.0
## - black    1    406.76  9660.0 1245.3
## - nox      1    483.44  9736.7 1248.3
## - ptratio  1   1009.60 10262.9 1268.2
## - dis      1   1025.80 10279.1 1268.8
## - rm       1   1451.34 10704.6 1284.2
## - lstat    1   2511.56 11764.8 1320.0
## 
## Step:  AIC=1228.62
## medv ~ lstat + rm + ptratio + dis + nox + black + chas + zn + 
##     rad + tax
## 
##           Df Sum of Sq     RSS    AIC
## + crim     1     62.58  9085.0 1228.0
## <none>                  9147.6 1228.6
## + age      1     12.27  9135.3 1230.1
## + indus    1      4.80  9142.8 1230.4
## - tax      1    105.69  9253.3 1231.0
## - zn       1    159.10  9306.7 1233.2
## - chas     1    177.31  9324.9 1233.9
## - rad      1    203.25  9350.8 1235.0
## - nox      1    360.76  9508.3 1241.3
## - black    1    385.86  9533.4 1242.3
## - ptratio  1    892.47 10040.0 1261.9
## - dis      1   1065.08 10212.6 1268.4
## - rm       1   1358.47 10506.0 1279.1
## - lstat    1   2434.41 11582.0 1316.0
## 
## Step:  AIC=1228.02
## medv ~ lstat + rm + ptratio + dis + nox + black + chas + zn + 
##     rad + tax + crim
## 
##           Df Sum of Sq     RSS    AIC
## <none>                  9085.0 1228.0
## - crim     1     62.58  9147.6 1228.6
## + age      1     13.34  9071.6 1229.5
## + indus    1      3.42  9081.6 1229.9
## - tax      1    113.48  9198.5 1230.7
## - chas     1    165.35  9250.3 1232.9
## - zn       1    177.93  9262.9 1233.4
## - rad      1    259.25  9344.2 1236.7
## - black    1    310.24  9395.2 1238.8
## - nox      1    381.82  9466.8 1241.6
## - ptratio  1    909.61  9994.6 1262.2
## - dis      1   1112.62 10197.6 1269.8
## - rm       1   1314.29 10399.3 1277.2
## - lstat    1   2256.61 11341.6 1310.1
summary(step_model_both)$adj.r.squared
## [1] 0.7235687
pred_step_hybrid <- predict(step_model_both, boston_test)
#MSE of the test data (hybrid approach)
MSE_test_hybrid <- mean((pred_step_hybrid - boston_test$medv)^2)
MSE_test_hybrid
## [1] 16.23726

LASSO Regression

# Lasso -------------------------------------------------------------------
#install.packages('glmnet')
library(glmnet)
cv_lasso_fit <-  cv.glmnet(x = as.matrix(boston_train[, -c(which(colnames(boston_train) == 'medv'))]), y = boston_train$medv, alpha = 1, nfolds = 5)

Plot to find minimum lambda for LASSO

plot(cv_lasso_fit)

The minimum lambda is 0.034971.

coef(cv_lasso_fit, s = cv_lasso_fit$lambda.min)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept)  34.192720233
## crim         -0.062059394
## zn            0.039396566
## indus         .          
## chas          2.590150991
## nox         -16.297502320
## rm            3.723667031
## age           0.006265417
## dis          -1.342390455
## rad           0.194826396
## tax          -0.006332198
## ptratio      -0.950485133
## black         0.010798990
## lstat        -0.580254390
max(1 - cv_lasso_fit$cvm/var(boston_train$medv))
## [1] 0.6791922
#Boston.insample.prediction = predict(lasso_fit, as.matrix(Boston[, -c(which(colnames(Boston)=='medv'))]), s = cv_lasso_fit$lambda.min)

pred_lasso <-  predict(cv_lasso_fit, as.matrix(boston_test[, -c(which(colnames(boston_test) == 'medv'))]), s = cv_lasso_fit$lambda.min)
MSE_test_lasso <- mean((pred_lasso - boston_test$medv)^2)
MSE_test_lasso
## [1] 16.69286
# Comparing different models using test MSE -------------------------------
Adjusted_R_Sqrd <- round(as.numeric(c(0.7243935,0.7257688,0.7257688,0.7257688,0.7257688,0.6924559)),2)
mods <- c("Linear","Bestsubset", "Stepwise_fwd","Stepwise_back","Stepwise_hybrid","Lasso")

mods_df <- as.data.frame(cbind(mods,Adjusted_R_Sqrd))
mods_df
##              mods Adjusted_R_Sqrd
## 1          Linear            0.72
## 2      Bestsubset            0.73
## 3    Stepwise_fwd            0.73
## 4   Stepwise_back            0.73
## 5 Stepwise_hybrid            0.73
## 6           Lasso            0.69
theTable <- within(mods_df
                   , mods <- factor(mods, levels = names(sort(table(mods_df),                                        decreasing = TRUE))))

Plot of adjusted r squared values for different methods is shown below. We can see that LASSO and linear have low adjusted r squared values. Subset and stepwise methods have same r squared values. So, we will choose best subset as our final model.

mods
## [1] "Linear"          "Bestsubset"      "Stepwise_fwd"    "Stepwise_back"  
## [5] "Stepwise_hybrid" "Lasso"
mods_df %>%
  ggplot(aes(x = mods, y = as.factor(Adjusted_R_Sqrd) , fill = Adjusted_R_Sqrd)) + geom_bar(stat = "identity") + ylab("Adjusted R-Squared") + xlab("Models")

mean(summary(step_model_fwd)$residuals^2)
## [1] 23.97093
summary(subset_result)$rsq
##  [1] 0.5527817 0.6353021 0.6734548 0.6876676 0.7033211 0.7131513 0.7187566
##  [8] 0.7236737 0.7266420 0.7297644 0.7316130 0.7320070 0.7321054
summary(step_model_fwd)$r.squared
## [1] 0.731613
max(summary(subset_result)$rsq)
## [1] 0.7321054
summary(step_model_fwd)$adj.r.squared
## [1] 0.7235687
max(summary(subset_result)$adjr2)
## [1] 0.7235687

Based on the Adjusted R-Squared values of all the models considered, we observe that Best Subset gives the maximum value of Adjusted R-Square. It includes 11 variables namely - crim, zn, chas, nox, rm, dis, rad, tax , ptratio , black , lstat excluding age and indus variables