Problem
To predict median price of houses in Boston based on various housing, environmental and social attributes available in Boston Housing data with the help of linear, best subsets, stepwise and LASSO regression.
Approach
We will follow these steps to predict Boston Housing prices:
library(MASS)
library(dplyr)
library(corrplot)
library(tidyverse)
library(PerformanceAnalytics)
library(mctest)
About the Data
Housing Values in Suburbs of Boston The Boston data frame has 506 rows and 14 columns. This data frame contains the following columns:
\[medv = -0.0970*crim + 0.05569*zn + 3.4612*chas - 19.1165*nox + 3.5841*rm - 1.581*dis + 0.3584*rad -0.0131*tax - 0.9662*ptratio + 0.0121*black - 0.4972*lstat\]
General structure of data and checking for NULL values
# General Structure
names(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
# Checking for NULL values
colSums(is.na(Boston))
## crim zn indus chas nox rm age dis rad
## 0 0 0 0 0 0 0 0 0
## tax ptratio black lstat medv
## 0 0 0 0 0
Univariate Analysis(Histograms)
# Univariate Analysis
Boston %>% keep(is.numeric) %>%
gather() %>% ggplot(aes(value)) + facet_wrap(~key, scales = "free") + geom_histogram()
We see that age follows a left skewed distribution, so does the proportion of black.
Distance, lstat and medv follow a right skewed distribution.
Nox (Nitrogen Oxides Concentration) follow an almost uniform distribution Rm follows a normal distribution.
Tax and rad also follow an almost normal distribution with few extreme values.
Univariate Analysis(Boxplots)
boxplot(Boston)
Boxplots to observe the summary statistic of each variable:
We see that the tax variable could be normalized to bring it in scale with other variables
There are a lot of outliers in the black variable that should be treated before analysis
Bivariate Analysis (response variable vs predictor variables)
#3 pairwise correlations
corr_medv <- cor(Boston$medv, Boston[,-14], use = "pairwise.complete.obs")
corr_medv <- t(corr_medv)
View(corr_medv)
corr_matrix <- cor(Boston, use = "pairwise.complete.obs")
corrplot(corr_matrix, type = "lower", order = "hclust",
tl.col = "black", tl.srt = 45)
Age, Indus and Nox are highly correlated with dis variable which can led to multicollinearity
#train_index <- sample(nrow(Boston),0.75*nrow(Boston))
#boston_train <- Boston[train_index,] # select 4 rows at random
#boston_test <- Boston[-train_index,]
#Boston <- Boston %>% mutate(id = row_number())
#head(Boston$id)
Sampling of data
The dataset was randomly divided into 75% train and 25% test data. The model will be trained on the train data and its performance will be tested on rest of the data to avoid overfitting.
set.seed(7)
train_index <- sample(nrow(Boston),0.75*nrow(Boston))
boston_train <- Boston[train_index,]
boston_test <- Boston[-train_index,]
Linear regression
Following results were obtained after fitting the linear regression model:
# linear regression on the data -------------------------------------------
fit <- lm(boston_train$medv ~., data = boston_train)
summary_test <- summary(fit)
print(summary_test)
##
## Call:
## lm(formula = boston_train$medv ~ ., data = boston_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.5282 -2.8594 -0.7306 1.9008 25.0097
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.901839 6.265388 5.890 8.78e-09 ***
## crim -0.078110 0.049341 -1.583 0.114275
## zn 0.045881 0.016509 2.779 0.005732 **
## indus 0.026917 0.073496 0.366 0.714397
## chas 2.585672 1.036172 2.495 0.013022 *
## nox -18.492688 4.711442 -3.925 0.000104 ***
## rm 3.618608 0.519126 6.971 1.48e-11 ***
## age 0.011858 0.016237 0.730 0.465648
## dis -1.414448 0.237566 -5.954 6.15e-09 ***
## rad 0.263478 0.081044 3.251 0.001257 **
## tax -0.009424 0.004476 -2.106 0.035927 *
## ptratio -0.985265 0.162261 -6.072 3.17e-09 ***
## black 0.011024 0.003148 3.502 0.000520 ***
## lstat -0.588063 0.063659 -9.238 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.984 on 365 degrees of freedom
## Multiple R-squared: 0.7321, Adjusted R-squared: 0.7226
## F-statistic: 76.73 on 13 and 365 DF, p-value: < 2.2e-16
Adjusted R-squared of 72.26% is obtained and 3 variables (crim, indus and age) are identified as not significant.
The model was then applied to the test data set.
fit_summary <- list(("r_sqr" = summary_test$r.squared), ("adj_rsqr" = summary_test$adj.r.squared),
("AIC" = AIC(fit)), ("BIC" = BIC(fit)))
pred_fit <- predict(fit, boston_test)
#MSE
MSE_test_lm <- mean((pred_fit - boston_test$medv)^2)
#Mean Absolute error
MAE <- mean(abs(pred_fit - boston_test$medv))
Mean squared error of 16.48 and mean absolute error of 2.92 was obtained for the test data set.
Best Subset Selection
We now run the best subset selection method which gives us the variables to choose for each number of variable, shown by a ’*’ in the results below.Here are the results and coefficients values for the best subset selection:
#Variable Selection
library(leaps)
#install.packages("MuMIn")
library(MuMIn)
library(mctest)
#best subset ------------------------------------------------------------
subset_result <- regsubsets(medv~.,data = boston_train, nbest = 1, nvmax = 14)
summary(subset_result)
## Subset selection object
## Call: regsubsets.formula(medv ~ ., data = boston_train, nbest = 1,
## nvmax = 14)
## 13 Variables (and intercept)
## Forced in Forced out
## crim FALSE FALSE
## zn FALSE FALSE
## indus FALSE FALSE
## chas FALSE FALSE
## nox FALSE FALSE
## rm FALSE FALSE
## age FALSE FALSE
## dis FALSE FALSE
## rad FALSE FALSE
## tax FALSE FALSE
## ptratio FALSE FALSE
## black FALSE FALSE
## lstat FALSE FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " " "*"
## 2 ( 1 ) " " " " " " " " " " "*" " " " " " " " " " " " " "*"
## 3 ( 1 ) " " " " " " " " " " "*" " " " " " " " " "*" " " "*"
## 4 ( 1 ) " " " " " " " " " " "*" " " "*" " " " " "*" " " "*"
## 5 ( 1 ) " " " " " " " " "*" "*" " " "*" " " " " "*" " " "*"
## 6 ( 1 ) " " " " " " " " "*" "*" " " "*" " " " " "*" "*" "*"
## 7 ( 1 ) " " " " " " "*" "*" "*" " " "*" " " " " "*" "*" "*"
## 8 ( 1 ) " " "*" " " "*" "*" "*" " " "*" " " " " "*" "*" "*"
## 9 ( 1 ) " " "*" " " "*" "*" "*" " " "*" "*" " " "*" "*" "*"
## 10 ( 1 ) " " "*" " " "*" "*" "*" " " "*" "*" "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" " " "*" "*" "*" " " "*" "*" "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" " " "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
We can see the coefficients after selecting the number of variables needed in our model (example for 2,8,13).
coef(subset_result, 2)
## (Intercept) rm lstat
## -0.4473873 5.0151882 -0.6725328
coef(subset_result, 8)
## (Intercept) zn chas nox rm
## 30.25412687 0.04045765 2.90980703 -15.56249709 4.00306915
## dis ptratio black lstat
## -1.43490115 -0.86299730 0.01079068 -0.58231343
coef(subset_result, 13)
## (Intercept) crim zn indus chas
## 36.90183910 -0.07810995 0.04588056 0.02691726 2.58567182
## nox rm age dis rad
## -18.49268800 3.61860790 0.01185825 -1.41444814 0.26347776
## tax ptratio black lstat
## -0.00942430 -0.98526484 0.01102414 -0.58806322
The adjusted R squared for the best subset method is:
max(summary(subset_result)$adjr2)
## [1] 0.7235687
Here is the plot of best subset results:
plot(subset_result, scale = "bic")
#test pred
test.mat <- model.matrix(medv~., data= boston_test)
summary(test.mat)
## (Intercept) crim zn indus
## Min. :1 Min. : 0.00632 Min. : 0.000 Min. : 1.52
## 1st Qu.:1 1st Qu.: 0.08850 1st Qu.: 0.000 1st Qu.: 6.08
## Median :1 Median : 0.25356 Median : 0.000 Median : 9.90
## Mean :1 Mean : 3.92792 Mean : 9.299 Mean :11.36
## 3rd Qu.:1 3rd Qu.: 2.87119 3rd Qu.: 0.000 3rd Qu.:18.10
## Max. :1 Max. :88.97620 Max. :95.000 Max. :27.74
## chas nox rm age
## Min. :0.00000 Min. :0.3890 Min. :3.561 Min. : 6.50
## 1st Qu.:0.00000 1st Qu.:0.4555 1st Qu.:5.889 1st Qu.: 45.75
## Median :0.00000 Median :0.5380 Median :6.122 Median : 81.30
## Mean :0.06299 Mean :0.5596 Mean :6.225 Mean : 69.40
## 3rd Qu.:0.00000 3rd Qu.:0.6240 3rd Qu.:6.500 3rd Qu.: 94.10
## Max. :1.00000 Max. :0.8710 Max. :8.704 Max. :100.00
## dis rad tax ptratio
## Min. : 1.178 Min. : 1.00 Min. :188.0 Min. :12.60
## 1st Qu.: 2.235 1st Qu.: 4.00 1st Qu.:284.0 1st Qu.:17.40
## Median : 3.421 Median : 5.00 Median :330.0 Median :18.60
## Mean : 3.845 Mean : 9.26 Mean :406.7 Mean :18.46
## 3rd Qu.: 5.166 3rd Qu.:16.00 3rd Qu.:666.0 3rd Qu.:20.20
## Max. :10.586 Max. :24.00 Max. :711.0 Max. :22.00
## black lstat
## Min. : 6.68 Min. : 1.920
## 1st Qu.:379.70 1st Qu.: 8.075
## Median :392.18 Median :12.040
## Mean :367.40 Mean :13.051
## 3rd Qu.:396.90 3rd Qu.:16.545
## Max. :396.90 Max. :36.980
val.errors <- rep(NA,8)
for(i in 1:8){
coefi <- coef(subset_result, id=i)
pred <- test.mat[,names(coefi)]%*%coefi
val.errors[i] <- mean((boston_test$medv - pred)^2)
}
MSE_bestsubset <- min(val.errors)
#AIC(subset_result)
Stepwise - Forward, backward and Hybrid
We will check MSE, AIC and r squared values for all these methods.
# Stepwise - Forward, Backward and Hybrid ---------------------------------
nullmodel=lm(medv~1, data = boston_train)
fullmodel=lm(medv~., data = boston_train)
#forward
step_model_fwd <- step(nullmodel, scope = list(lower = nullmodel, upper = fullmodel), direction = "forward")
## Start: AIC=1704.53
## medv ~ 1
##
## Df Sum of Sq RSS AIC
## + lstat 1 18711.8 15138 1401.5
## + rm 1 16203.0 17647 1459.7
## + ptratio 1 9115.8 24735 1587.6
## + indus 1 8329.6 25521 1599.5
## + tax 1 7406.3 26444 1613.0
## + nox 1 6546.3 27304 1625.1
## + crim 1 5448.8 28402 1640.0
## + age 1 5004.0 28846 1645.9
## + zn 1 4880.2 28970 1647.5
## + rad 1 4876.9 28973 1647.6
## + black 1 4135.0 29715 1657.2
## + dis 1 2431.5 31419 1678.3
## + chas 1 930.1 32920 1696.0
## <none> 33850 1704.5
##
## Step: AIC=1401.55
## medv ~ lstat
##
## Df Sum of Sq RSS AIC
## + rm 1 2793.34 12345 1326.2
## + ptratio 1 2028.62 13110 1349.0
## + dis 1 701.05 14437 1385.6
## + chas 1 559.65 14579 1389.3
## + age 1 341.56 14797 1394.9
## + black 1 288.81 14850 1396.2
## + zn 1 147.86 14991 1399.8
## + tax 1 104.31 15034 1400.9
## <none> 15138 1401.5
## + crim 1 50.53 15088 1402.3
## + nox 1 34.88 15104 1402.7
## + indus 1 33.78 15105 1402.7
## + rad 1 3.47 15135 1403.5
##
## Step: AIC=1326.24
## medv ~ lstat + rm
##
## Df Sum of Sq RSS AIC
## + ptratio 1 1291.48 11054 1286.4
## + black 1 502.65 11842 1312.5
## + chas 1 435.53 11910 1314.6
## + dis 1 364.83 11980 1316.9
## + tax 1 225.13 12120 1321.3
## + crim 1 120.88 12224 1324.5
## + rad 1 98.94 12246 1325.2
## <none> 12345 1326.2
## + age 1 61.75 12283 1326.3
## + zn 1 52.41 12293 1326.6
## + indus 1 22.34 12323 1327.5
## + nox 1 0.07 12345 1328.2
##
## Step: AIC=1286.36
## medv ~ lstat + rm + ptratio
##
## Df Sum of Sq RSS AIC
## + dis 1 481.11 10572 1271.5
## + black 1 373.59 10680 1275.3
## + chas 1 266.77 10787 1279.1
## + age 1 123.80 10930 1284.1
## <none> 11054 1286.4
## + crim 1 25.29 11028 1287.5
## + rad 1 12.96 11041 1287.9
## + zn 1 10.13 11044 1288.0
## + tax 1 9.30 11044 1288.0
## + indus 1 8.16 11046 1288.1
## + nox 1 5.98 11048 1288.2
##
## Step: AIC=1271.49
## medv ~ lstat + rm + ptratio + dis
##
## Df Sum of Sq RSS AIC
## + nox 1 529.88 10043 1254.0
## + black 1 491.97 10081 1255.4
## + chas 1 165.01 10408 1267.5
## + zn 1 156.13 10416 1267.9
## + indus 1 137.09 10436 1268.5
## + tax 1 109.84 10463 1269.5
## + crim 1 84.69 10488 1270.5
## <none> 10572 1271.5
## + age 1 11.88 10561 1273.1
## + rad 1 8.68 10564 1273.2
##
## Step: AIC=1254.01
## medv ~ lstat + rm + ptratio + dis + nox
##
## Df Sum of Sq RSS AIC
## + black 1 332.75 9709.9 1243.2
## + chas 1 218.82 9823.8 1247.7
## + zn 1 129.45 9913.2 1251.1
## <none> 10042.7 1254.0
## + rad 1 45.14 9997.5 1254.3
## + crim 1 35.59 10007.1 1254.7
## + age 1 5.97 10036.7 1255.8
## + indus 1 3.91 10038.8 1255.9
## + tax 1 0.23 10042.4 1256.0
##
## Step: AIC=1243.24
## medv ~ lstat + rm + ptratio + dis + nox + black
##
## Df Sum of Sq RSS AIC
## + chas 1 189.741 9520.2 1237.8
## + zn 1 156.559 9553.4 1239.1
## + rad 1 147.145 9562.8 1239.5
## <none> 9709.9 1243.2
## + tax 1 17.911 9692.0 1244.5
## + age 1 0.866 9709.1 1245.2
## + indus 1 0.363 9709.6 1245.2
## + crim 1 0.096 9709.8 1245.2
##
## Step: AIC=1237.76
## medv ~ lstat + rm + ptratio + dis + nox + black + chas
##
## Df Sum of Sq RSS AIC
## + zn 1 166.445 9353.7 1233.1
## + rad 1 148.067 9372.1 1233.8
## <none> 9520.2 1237.8
## + tax 1 24.059 9496.1 1238.8
## + indus 1 1.569 9518.6 1239.7
## + crim 1 0.130 9520.0 1239.8
## + age 1 0.028 9520.1 1239.8
##
## Step: AIC=1233.07
## medv ~ lstat + rm + ptratio + dis + nox + black + chas + zn
##
## Df Sum of Sq RSS AIC
## + rad 1 100.479 9253.3 1231.0
## <none> 9353.7 1233.1
## + age 1 5.801 9347.9 1234.8
## + crim 1 3.274 9350.5 1234.9
## + tax 1 2.923 9350.8 1235.0
## + indus 1 2.317 9351.4 1235.0
##
## Step: AIC=1230.98
## medv ~ lstat + rm + ptratio + dis + nox + black + chas + zn +
## rad
##
## Df Sum of Sq RSS AIC
## + tax 1 105.694 9147.6 1228.6
## + crim 1 54.787 9198.5 1230.7
## <none> 9253.3 1231.0
## + age 1 10.049 9243.2 1232.6
## + indus 1 6.269 9247.0 1232.7
##
## Step: AIC=1228.62
## medv ~ lstat + rm + ptratio + dis + nox + black + chas + zn +
## rad + tax
##
## Df Sum of Sq RSS AIC
## + crim 1 62.575 9085.0 1228.0
## <none> 9147.6 1228.6
## + age 1 12.267 9135.3 1230.1
## + indus 1 4.800 9142.8 1230.4
##
## Step: AIC=1228.02
## medv ~ lstat + rm + ptratio + dis + nox + black + chas + zn +
## rad + tax + crim
##
## Df Sum of Sq RSS AIC
## <none> 9085.0 1228.0
## + age 1 13.3369 9071.6 1229.5
## + indus 1 3.4172 9081.6 1229.9
summary(step_model_fwd)$adj.r.squared
## [1] 0.7235687
pred_step_fwd <- predict(step_model_fwd, boston_test)
#MSE of the test data (forward approach)
MSE_test_fwd <- mean((pred_step_fwd - boston_test$medv)^2)
MSE_test_fwd
## [1] 16.23726
#backward
step_model_back <- step(fullmodel, direction = "backward")
## Start: AIC=1231.33
## medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad +
## tax + ptratio + black + lstat
##
## Df Sum of Sq RSS AIC
## - indus 1 3.33 9071.6 1229.5
## - age 1 13.25 9081.6 1229.9
## <none> 9068.3 1231.3
## - crim 1 62.26 9130.6 1231.9
## - tax 1 110.14 9178.5 1233.9
## - chas 1 154.71 9223.0 1235.7
## - zn 1 191.89 9260.2 1237.3
## - rad 1 262.59 9330.9 1240.2
## - black 1 304.62 9372.9 1241.8
## - nox 1 382.76 9451.1 1245.0
## - dis 1 880.72 9949.0 1264.5
## - ptratio 1 916.03 9984.3 1265.8
## - rm 1 1207.18 10275.5 1276.7
## - lstat 1 2120.15 11188.5 1309.0
##
## Step: AIC=1229.47
## medv ~ crim + zn + chas + nox + rm + age + dis + rad + tax +
## ptratio + black + lstat
##
## Df Sum of Sq RSS AIC
## - age 1 13.34 9085.0 1228.0
## <none> 9071.6 1229.5
## - crim 1 63.65 9135.3 1230.1
## - tax 1 115.94 9187.6 1232.3
## - chas 1 159.75 9231.4 1234.1
## - zn 1 189.10 9260.7 1235.3
## - rad 1 266.52 9338.2 1238.4
## - black 1 302.60 9374.2 1239.9
## - nox 1 393.42 9465.1 1243.6
## - ptratio 1 920.98 9992.6 1264.1
## - dis 1 943.66 10015.3 1265.0
## - rm 1 1205.04 10276.7 1274.7
## - lstat 1 2119.56 11191.2 1307.0
##
## Step: AIC=1228.02
## medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio +
## black + lstat
##
## Df Sum of Sq RSS AIC
## <none> 9085.0 1228.0
## - crim 1 62.58 9147.6 1228.6
## - tax 1 113.48 9198.5 1230.7
## - chas 1 165.35 9250.3 1232.9
## - zn 1 177.93 9262.9 1233.4
## - rad 1 259.25 9344.2 1236.7
## - black 1 310.24 9395.2 1238.8
## - nox 1 381.82 9466.8 1241.6
## - ptratio 1 909.61 9994.6 1262.2
## - dis 1 1112.62 10197.6 1269.8
## - rm 1 1314.29 10399.3 1277.2
## - lstat 1 2256.61 11341.6 1310.1
summary(step_model_back)$adj.r.squared
## [1] 0.7235687
pred_step_back <- predict(step_model_back, boston_test)
#MSE of the test data (backward approach)
MSE_test_back <- mean((pred_step_back - boston_test$medv)^2)
#hybrid
step_model_both <- step(nullmodel, scope = list(lower = nullmodel, upper = fullmodel), direction = 'both')
## Start: AIC=1704.53
## medv ~ 1
##
## Df Sum of Sq RSS AIC
## + lstat 1 18711.8 15138 1401.5
## + rm 1 16203.0 17647 1459.7
## + ptratio 1 9115.8 24735 1587.6
## + indus 1 8329.6 25521 1599.5
## + tax 1 7406.3 26444 1613.0
## + nox 1 6546.3 27304 1625.1
## + crim 1 5448.8 28402 1640.0
## + age 1 5004.0 28846 1645.9
## + zn 1 4880.2 28970 1647.5
## + rad 1 4876.9 28973 1647.6
## + black 1 4135.0 29715 1657.2
## + dis 1 2431.5 31419 1678.3
## + chas 1 930.1 32920 1696.0
## <none> 33850 1704.5
##
## Step: AIC=1401.55
## medv ~ lstat
##
## Df Sum of Sq RSS AIC
## + rm 1 2793.3 12345 1326.2
## + ptratio 1 2028.6 13110 1349.0
## + dis 1 701.1 14437 1385.6
## + chas 1 559.6 14579 1389.3
## + age 1 341.6 14797 1394.9
## + black 1 288.8 14850 1396.2
## + zn 1 147.9 14991 1399.8
## + tax 1 104.3 15034 1400.9
## <none> 15138 1401.5
## + crim 1 50.5 15088 1402.3
## + nox 1 34.9 15104 1402.7
## + indus 1 33.8 15105 1402.7
## + rad 1 3.5 15135 1403.5
## - lstat 1 18711.8 33850 1704.5
##
## Step: AIC=1326.24
## medv ~ lstat + rm
##
## Df Sum of Sq RSS AIC
## + ptratio 1 1291.5 11054 1286.4
## + black 1 502.7 11842 1312.5
## + chas 1 435.5 11910 1314.6
## + dis 1 364.8 11980 1316.9
## + tax 1 225.1 12120 1321.3
## + crim 1 120.9 12224 1324.5
## + rad 1 98.9 12246 1325.2
## <none> 12345 1326.2
## + age 1 61.8 12283 1326.3
## + zn 1 52.4 12293 1326.6
## + indus 1 22.3 12323 1327.5
## + nox 1 0.1 12345 1328.2
## - rm 1 2793.3 15138 1401.5
## - lstat 1 5302.2 17647 1459.7
##
## Step: AIC=1286.36
## medv ~ lstat + rm + ptratio
##
## Df Sum of Sq RSS AIC
## + dis 1 481.1 10572 1271.5
## + black 1 373.6 10680 1275.3
## + chas 1 266.8 10787 1279.1
## + age 1 123.8 10930 1284.1
## <none> 11054 1286.4
## + crim 1 25.3 11028 1287.5
## + rad 1 13.0 11041 1287.9
## + zn 1 10.1 11044 1288.0
## + tax 1 9.3 11044 1288.0
## + indus 1 8.2 11046 1288.1
## + nox 1 6.0 11048 1288.2
## - ptratio 1 1291.5 12345 1326.2
## - rm 1 2056.2 13110 1349.0
## - lstat 1 4006.1 15060 1401.6
##
## Step: AIC=1271.49
## medv ~ lstat + rm + ptratio + dis
##
## Df Sum of Sq RSS AIC
## + nox 1 529.9 10043 1254.0
## + black 1 492.0 10081 1255.4
## + chas 1 165.0 10408 1267.5
## + zn 1 156.1 10416 1267.9
## + indus 1 137.1 10436 1268.5
## + tax 1 109.8 10463 1269.5
## + crim 1 84.7 10488 1270.5
## <none> 10572 1271.5
## + age 1 11.9 10561 1273.1
## + rad 1 8.7 10564 1273.2
## - dis 1 481.1 11054 1286.4
## - ptratio 1 1407.8 11980 1316.9
## - rm 1 1713.6 12286 1326.4
## - lstat 1 4363.9 14936 1400.5
##
## Step: AIC=1254.01
## medv ~ lstat + rm + ptratio + dis + nox
##
## Df Sum of Sq RSS AIC
## + black 1 332.75 9709.9 1243.2
## + chas 1 218.82 9823.8 1247.7
## + zn 1 129.45 9913.2 1251.1
## <none> 10042.7 1254.0
## + rad 1 45.14 9997.5 1254.3
## + crim 1 35.59 10007.1 1254.7
## + age 1 5.97 10036.7 1255.8
## + indus 1 3.91 10038.8 1255.9
## + tax 1 0.23 10042.4 1256.0
## - nox 1 529.88 10572.5 1271.5
## - dis 1 1005.00 11047.7 1288.2
## - ptratio 1 1642.33 11685.0 1309.4
## - rm 1 1713.08 11755.8 1311.7
## - lstat 1 2818.83 12861.5 1345.8
##
## Step: AIC=1243.24
## medv ~ lstat + rm + ptratio + dis + nox + black
##
## Df Sum of Sq RSS AIC
## + chas 1 189.74 9520.2 1237.8
## + zn 1 156.56 9553.4 1239.1
## + rad 1 147.14 9562.8 1239.5
## <none> 9709.9 1243.2
## + tax 1 17.91 9692.0 1244.5
## + age 1 0.87 9709.1 1245.2
## + indus 1 0.36 9709.6 1245.2
## + crim 1 0.10 9709.8 1245.2
## - black 1 332.75 10042.7 1254.0
## - nox 1 370.66 10080.6 1255.4
## - dis 1 966.85 10676.8 1277.2
## - ptratio 1 1469.72 11179.6 1294.7
## - rm 1 1846.55 11556.5 1307.2
## - lstat 1 2470.36 12180.3 1327.1
##
## Step: AIC=1237.76
## medv ~ lstat + rm + ptratio + dis + nox + black + chas
##
## Df Sum of Sq RSS AIC
## + zn 1 166.44 9353.7 1233.1
## + rad 1 148.07 9372.1 1233.8
## <none> 9520.2 1237.8
## + tax 1 24.06 9496.1 1238.8
## + indus 1 1.57 9518.6 1239.7
## + crim 1 0.13 9520.0 1239.8
## + age 1 0.03 9520.1 1239.8
## - chas 1 189.74 9709.9 1243.2
## - black 1 303.67 9823.8 1247.7
## - nox 1 417.50 9937.7 1252.0
## - dis 1 903.87 10424.0 1270.1
## - ptratio 1 1319.18 10839.4 1284.9
## - rm 1 1836.79 11357.0 1302.6
## - lstat 1 2351.42 11871.6 1319.4
##
## Step: AIC=1233.07
## medv ~ lstat + rm + ptratio + dis + nox + black + chas + zn
##
## Df Sum of Sq RSS AIC
## + rad 1 100.48 9253.3 1231.0
## <none> 9353.7 1233.1
## + age 1 5.80 9347.9 1234.8
## + crim 1 3.27 9350.5 1234.9
## + tax 1 2.92 9350.8 1235.0
## + indus 1 2.32 9351.4 1235.0
## - zn 1 166.44 9520.2 1237.8
## - chas 1 199.63 9553.4 1239.1
## - black 1 329.80 9683.5 1244.2
## - nox 1 386.76 9740.5 1246.4
## - ptratio 1 933.14 10286.9 1267.1
## - dis 1 1064.19 10417.9 1271.9
## - rm 1 1619.63 10973.4 1291.6
## - lstat 1 2438.62 11792.3 1318.9
##
## Step: AIC=1230.98
## medv ~ lstat + rm + ptratio + dis + nox + black + chas + zn +
## rad
##
## Df Sum of Sq RSS AIC
## + tax 1 105.69 9147.6 1228.6
## + crim 1 54.79 9198.5 1230.7
## <none> 9253.3 1231.0
## + age 1 10.05 9243.2 1232.6
## + indus 1 6.27 9247.0 1232.7
## - rad 1 100.48 9353.7 1233.1
## - zn 1 118.86 9372.1 1233.8
## - chas 1 198.98 9452.2 1237.0
## - black 1 406.76 9660.0 1245.3
## - nox 1 483.44 9736.7 1248.3
## - ptratio 1 1009.60 10262.9 1268.2
## - dis 1 1025.80 10279.1 1268.8
## - rm 1 1451.34 10704.6 1284.2
## - lstat 1 2511.56 11764.8 1320.0
##
## Step: AIC=1228.62
## medv ~ lstat + rm + ptratio + dis + nox + black + chas + zn +
## rad + tax
##
## Df Sum of Sq RSS AIC
## + crim 1 62.58 9085.0 1228.0
## <none> 9147.6 1228.6
## + age 1 12.27 9135.3 1230.1
## + indus 1 4.80 9142.8 1230.4
## - tax 1 105.69 9253.3 1231.0
## - zn 1 159.10 9306.7 1233.2
## - chas 1 177.31 9324.9 1233.9
## - rad 1 203.25 9350.8 1235.0
## - nox 1 360.76 9508.3 1241.3
## - black 1 385.86 9533.4 1242.3
## - ptratio 1 892.47 10040.0 1261.9
## - dis 1 1065.08 10212.6 1268.4
## - rm 1 1358.47 10506.0 1279.1
## - lstat 1 2434.41 11582.0 1316.0
##
## Step: AIC=1228.02
## medv ~ lstat + rm + ptratio + dis + nox + black + chas + zn +
## rad + tax + crim
##
## Df Sum of Sq RSS AIC
## <none> 9085.0 1228.0
## - crim 1 62.58 9147.6 1228.6
## + age 1 13.34 9071.6 1229.5
## + indus 1 3.42 9081.6 1229.9
## - tax 1 113.48 9198.5 1230.7
## - chas 1 165.35 9250.3 1232.9
## - zn 1 177.93 9262.9 1233.4
## - rad 1 259.25 9344.2 1236.7
## - black 1 310.24 9395.2 1238.8
## - nox 1 381.82 9466.8 1241.6
## - ptratio 1 909.61 9994.6 1262.2
## - dis 1 1112.62 10197.6 1269.8
## - rm 1 1314.29 10399.3 1277.2
## - lstat 1 2256.61 11341.6 1310.1
summary(step_model_both)$adj.r.squared
## [1] 0.7235687
pred_step_hybrid <- predict(step_model_both, boston_test)
#MSE of the test data (hybrid approach)
MSE_test_hybrid <- mean((pred_step_hybrid - boston_test$medv)^2)
MSE_test_hybrid
## [1] 16.23726
LASSO Regression
# Lasso -------------------------------------------------------------------
#install.packages('glmnet')
library(glmnet)
cv_lasso_fit <- cv.glmnet(x = as.matrix(boston_train[, -c(which(colnames(boston_train) == 'medv'))]), y = boston_train$medv, alpha = 1, nfolds = 5)
Plot to find minimum lambda for LASSO
plot(cv_lasso_fit)
The minimum lambda is 0.034971.
coef(cv_lasso_fit, s = cv_lasso_fit$lambda.min)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 34.192720233
## crim -0.062059394
## zn 0.039396566
## indus .
## chas 2.590150991
## nox -16.297502320
## rm 3.723667031
## age 0.006265417
## dis -1.342390455
## rad 0.194826396
## tax -0.006332198
## ptratio -0.950485133
## black 0.010798990
## lstat -0.580254390
max(1 - cv_lasso_fit$cvm/var(boston_train$medv))
## [1] 0.6791922
#Boston.insample.prediction = predict(lasso_fit, as.matrix(Boston[, -c(which(colnames(Boston)=='medv'))]), s = cv_lasso_fit$lambda.min)
pred_lasso <- predict(cv_lasso_fit, as.matrix(boston_test[, -c(which(colnames(boston_test) == 'medv'))]), s = cv_lasso_fit$lambda.min)
MSE_test_lasso <- mean((pred_lasso - boston_test$medv)^2)
MSE_test_lasso
## [1] 16.69286
# Comparing different models using test MSE -------------------------------
Adjusted_R_Sqrd <- round(as.numeric(c(0.7243935,0.7257688,0.7257688,0.7257688,0.7257688,0.6924559)),2)
mods <- c("Linear","Bestsubset", "Stepwise_fwd","Stepwise_back","Stepwise_hybrid","Lasso")
mods_df <- as.data.frame(cbind(mods,Adjusted_R_Sqrd))
mods_df
## mods Adjusted_R_Sqrd
## 1 Linear 0.72
## 2 Bestsubset 0.73
## 3 Stepwise_fwd 0.73
## 4 Stepwise_back 0.73
## 5 Stepwise_hybrid 0.73
## 6 Lasso 0.69
theTable <- within(mods_df
, mods <- factor(mods, levels = names(sort(table(mods_df), decreasing = TRUE))))
Plot of adjusted r squared values for different methods is shown below. We can see that LASSO and linear have low adjusted r squared values. Subset and stepwise methods have same r squared values. So, we will choose best subset as our final model.
mods
## [1] "Linear" "Bestsubset" "Stepwise_fwd" "Stepwise_back"
## [5] "Stepwise_hybrid" "Lasso"
mods_df %>%
ggplot(aes(x = mods, y = as.factor(Adjusted_R_Sqrd) , fill = Adjusted_R_Sqrd)) + geom_bar(stat = "identity") + ylab("Adjusted R-Squared") + xlab("Models")
mean(summary(step_model_fwd)$residuals^2)
## [1] 23.97093
summary(subset_result)$rsq
## [1] 0.5527817 0.6353021 0.6734548 0.6876676 0.7033211 0.7131513 0.7187566
## [8] 0.7236737 0.7266420 0.7297644 0.7316130 0.7320070 0.7321054
summary(step_model_fwd)$r.squared
## [1] 0.731613
max(summary(subset_result)$rsq)
## [1] 0.7321054
summary(step_model_fwd)$adj.r.squared
## [1] 0.7235687
max(summary(subset_result)$adjr2)
## [1] 0.7235687
Based on the Adjusted R-Squared values of all the models considered, we observe that Best Subset gives the maximum value of Adjusted R-Square. It includes 11 variables namely - crim, zn, chas, nox, rm, dis, rad, tax , ptratio , black , lstat excluding age and indus variables