#The data contains information collected by the US Bureau of the Census concerning housing in the area of Boston, Massachusetts. Let's use the code below to display the data.
D= MASS::Boston
#The goal of this preject is to predict the median house price in new tracts based on information such as crime rate, pollution, and number of rooms.
1.Partition the data into a training set and a validation set. Set the seed to the number “3.14” by issuing the code set.seed(3.14).
#Before we partition the data into a training and a validation set, it is important to know why we divide the data. We separate data into training and validation sets because it gives the model something to predict over which it was not shown before. The training set is used to create the model and validation is to predict the accuracy/performance of the model.
set.seed(3.14)
#Here we split the dataset into 80% 'training' and 20% 'validation'.
ind<-sample(2, nrow(D), replace=TRUE, prob=c(0.8, 0.2))
training_data<-D[ind==1,]
validation_data <- D[ind==2,]
#Training set is used to create the model.
head(training_data)
crim zn indus chas nox rm age dis rad tax ptratio black lstat
1 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
3 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
4 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
5 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
6 0.02985 0.0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
7 0.08829 12.5 7.87 0 0.524 6.012 66.6 5.5605 5 311 15.2 395.60 12.43
medv
1 24.0
3 34.7
4 33.4
5 36.2
6 28.7
7 22.9
#As we can see from the table, what ever data is missing in tdata has gone to vdata. Validation predicts the accuracy of model.
head(validation_data)
crim zn indus chas nox rm age dis rad tax ptratio black lstat
2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
15 0.63796 0 8.14 0 0.538 6.096 84.5 4.4619 4 307 21.0 380.02 10.26
16 0.62739 0 8.14 0 0.538 5.834 56.5 4.4986 4 307 21.0 395.62 8.47
19 0.80271 0 8.14 0 0.538 5.456 36.6 3.7965 4 307 21.0 288.99 11.69
28 0.95577 0 8.14 0 0.538 6.047 88.8 4.4534 4 307 21.0 306.38 17.28
37 0.09744 0 5.96 0 0.499 5.841 61.4 3.3779 5 279 19.2 377.56 11.41
medv
2 21.6
15 18.2
16 19.9
19 20.2
28 14.8
37 20.0
#Thing to remember: We develop multiple regression model based on training dataset and we test that model using validation data which is vdata in our case. If the predictions are good, that means our model is doing fine.
2.Use the training data to do stepwise regression with the three options (backward, forward, both). Choose the top model from each regression. Then, use each of these models to make prediction for MEDV based on the validation data. Compare the models in terms of RMSE. If the three models are the same, then no comparison is needed and just report the RMSE of the common model.
m0 = lm(medv ~ 1, data = D) # Fit the intercept-only model (no predictor)
summary(m0)
Call:
lm(formula = medv ~ 1, data = D)
Residuals:
Min 1Q Median 3Q Max
-17.533 -5.508 -1.333 2.467 27.467
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.5328 0.4089 55.11 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.197 on 505 degrees of freedom
m = lm(medv ~ ., data = D) # Fit the model with all predictors in
summary(m)
Call:
lm(formula = medv ~ ., data = D)
Residuals:
Min 1Q Median 3Q Max
-15.595 -2.730 -0.518 1.777 26.199
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
crim -1.080e-01 3.286e-02 -3.287 0.001087 **
zn 4.642e-02 1.373e-02 3.382 0.000778 ***
indus 2.056e-02 6.150e-02 0.334 0.738288
chas 2.687e+00 8.616e-01 3.118 0.001925 **
nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
age 6.922e-04 1.321e-02 0.052 0.958229
dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
tax -1.233e-02 3.760e-03 -3.280 0.001112 **
ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
black 9.312e-03 2.686e-03 3.467 0.000573 ***
lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.745 on 492 degrees of freedom
Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
#To select a subset of predictors that maximize a criterion (such as R-square) or minimize a criterion , we can use the training data to do stepwise regression with the three options (backward, forward, both). Here, the procedure is provided by the package MASS.
suppressMessages(library(MASS))
m<-stepAIC(m, direction = "backward")
Start: AIC=1589.64
medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad +
tax + ptratio + black + lstat
Df Sum of Sq RSS AIC
- age 1 0.06 11079 1587.7
- indus 1 2.52 11081 1587.8
<none> 11079 1589.6
- chas 1 218.97 11298 1597.5
- tax 1 242.26 11321 1598.6
- crim 1 243.22 11322 1598.6
- zn 1 257.49 11336 1599.3
- black 1 270.63 11349 1599.8
- rad 1 479.15 11558 1609.1
- nox 1 487.16 11566 1609.4
- ptratio 1 1194.23 12273 1639.4
- dis 1 1232.41 12311 1641.0
- rm 1 1871.32 12950 1666.6
- lstat 1 2410.84 13490 1687.3
Step: AIC=1587.65
medv ~ crim + zn + indus + chas + nox + rm + dis + rad + tax +
ptratio + black + lstat
Df Sum of Sq RSS AIC
- indus 1 2.52 11081 1585.8
<none> 11079 1587.7
- chas 1 219.91 11299 1595.6
- tax 1 242.24 11321 1596.6
- crim 1 243.20 11322 1596.6
- zn 1 260.32 11339 1597.4
- black 1 272.26 11351 1597.9
- rad 1 481.09 11560 1607.2
- nox 1 520.87 11600 1608.9
- ptratio 1 1200.23 12279 1637.7
- dis 1 1352.26 12431 1643.9
- rm 1 1959.55 13038 1668.0
- lstat 1 2718.88 13798 1696.7
Step: AIC=1585.76
medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio +
black + lstat
Df Sum of Sq RSS AIC
<none> 11081 1585.8
- chas 1 227.21 11309 1594.0
- crim 1 245.37 11327 1594.8
- zn 1 257.82 11339 1595.4
- black 1 270.82 11352 1596.0
- tax 1 273.62 11355 1596.1
- rad 1 500.92 11582 1606.1
- nox 1 541.91 11623 1607.9
- ptratio 1 1206.45 12288 1636.0
- dis 1 1448.94 12530 1645.9
- rm 1 1963.66 13045 1666.3
- lstat 1 2723.48 13805 1695.0
m
Call:
lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad +
tax + ptratio + black + lstat, data = D)
Coefficients:
(Intercept) crim zn chas nox rm
36.341145 -0.108413 0.045845 2.718716 -17.376023 3.801579
dis rad tax ptratio black lstat
-1.492711 0.299608 -0.011778 -0.946525 0.009291 -0.522553
n<-stepAIC(m0, direction = "forward", scope = list(lower = m0, upper = m))
Start: AIC=2246.51
medv ~ 1
Df Sum of Sq RSS AIC
+ lstat 1 23243.9 19472 1851.0
+ rm 1 20654.4 22062 1914.2
+ ptratio 1 11014.3 31702 2097.6
+ tax 1 9377.3 33339 2123.1
+ nox 1 7800.1 34916 2146.5
+ crim 1 6440.8 36276 2165.8
+ rad 1 6221.1 36495 2168.9
+ zn 1 5549.7 37167 2178.1
+ black 1 4749.9 37966 2188.9
+ dis 1 2668.2 40048 2215.9
+ chas 1 1312.1 41404 2232.7
<none> 42716 2246.5
Step: AIC=1851.01
medv ~ lstat
Df Sum of Sq RSS AIC
+ rm 1 4033.1 15439 1735.6
+ ptratio 1 2670.1 16802 1778.4
+ chas 1 786.3 18686 1832.2
+ dis 1 772.4 18700 1832.5
+ tax 1 274.4 19198 1845.8
+ black 1 198.3 19274 1847.8
+ zn 1 160.3 19312 1848.8
+ crim 1 146.9 19325 1849.2
<none> 19472 1851.0
+ rad 1 25.1 19447 1852.4
+ nox 1 4.8 19468 1852.9
Step: AIC=1735.58
medv ~ lstat + rm
Df Sum of Sq RSS AIC
+ ptratio 1 1711.32 13728 1678.1
+ chas 1 548.53 14891 1719.3
+ black 1 512.31 14927 1720.5
+ tax 1 425.16 15014 1723.5
+ dis 1 351.15 15088 1725.9
+ crim 1 311.42 15128 1727.3
+ rad 1 180.45 15259 1731.6
<none> 15439 1735.6
+ zn 1 56.56 15383 1735.7
+ nox 1 14.90 15424 1737.1
Step: AIC=1678.13
medv ~ lstat + rm + ptratio
Df Sum of Sq RSS AIC
+ dis 1 499.08 13229 1661.4
+ black 1 389.68 13338 1665.6
+ chas 1 377.96 13350 1666.0
+ crim 1 122.52 13606 1675.6
<none> 13728 1678.1
+ tax 1 44.36 13684 1678.5
+ nox 1 24.81 13703 1679.2
+ zn 1 14.96 13713 1679.6
+ rad 1 6.07 13722 1679.9
Step: AIC=1661.39
medv ~ lstat + rm + ptratio + dis
Df Sum of Sq RSS AIC
+ nox 1 759.56 12469 1633.5
+ black 1 502.64 12726 1643.8
+ chas 1 267.43 12962 1653.1
+ tax 1 240.34 12989 1654.1
+ crim 1 233.54 12995 1654.4
+ zn 1 144.81 13084 1657.8
<none> 13229 1661.4
+ rad 1 22.40 13206 1662.5
Step: AIC=1633.47
medv ~ lstat + rm + ptratio + dis + nox
Df Sum of Sq RSS AIC
+ chas 1 328.27 12141 1622.0
+ black 1 311.83 12158 1622.7
+ zn 1 151.71 12318 1629.3
+ crim 1 141.43 12328 1629.7
+ rad 1 53.48 12416 1633.3
<none> 12469 1633.5
+ tax 1 10.50 12459 1635.0
Step: AIC=1621.97
medv ~ lstat + rm + ptratio + dis + nox + chas
Df Sum of Sq RSS AIC
+ black 1 272.837 11868 1612.5
+ zn 1 164.406 11977 1617.1
+ crim 1 116.330 12025 1619.1
+ rad 1 58.556 12082 1621.5
<none> 12141 1622.0
+ tax 1 4.187 12137 1623.8
Step: AIC=1612.47
medv ~ lstat + rm + ptratio + dis + nox + chas + black
Df Sum of Sq RSS AIC
+ zn 1 189.936 11678 1606.3
+ rad 1 144.320 11724 1608.3
+ crim 1 55.633 11813 1612.1
<none> 11868 1612.5
+ tax 1 2.703 11866 1614.4
Step: AIC=1606.31
medv ~ lstat + rm + ptratio + dis + nox + chas + black + zn
Df Sum of Sq RSS AIC
+ crim 1 94.712 11584 1604.2
+ rad 1 93.614 11585 1604.2
<none> 11678 1606.3
+ tax 1 3.952 11674 1608.1
Step: AIC=1604.19
medv ~ lstat + rm + ptratio + dis + nox + chas + black + zn +
crim
Df Sum of Sq RSS AIC
+ rad 1 228.604 11355 1596.1
<none> 11584 1604.2
+ tax 1 1.305 11582 1606.1
Step: AIC=1596.1
medv ~ lstat + rm + ptratio + dis + nox + chas + black + zn +
crim + rad
Df Sum of Sq RSS AIC
+ tax 1 273.62 11081 1585.8
<none> 11355 1596.1
Step: AIC=1585.76
medv ~ lstat + rm + ptratio + dis + nox + chas + black + zn +
crim + rad + tax
n
Call:
lm(formula = medv ~ lstat + rm + ptratio + dis + nox + chas +
black + zn + crim + rad + tax, data = D)
Coefficients:
(Intercept) lstat rm ptratio dis nox
36.341145 -0.522553 3.801579 -0.946525 -1.492711 -17.376023
chas black zn crim rad tax
2.718716 0.009291 0.045845 -0.108413 0.299608 -0.011778
mn<-stepAIC(m, direction = "both", scope = list(lower = m0, upper = m))
Start: AIC=1585.76
medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio +
black + lstat
Df Sum of Sq RSS AIC
<none> 11081 1585.8
- chas 1 227.21 11309 1594.0
- crim 1 245.37 11327 1594.8
- zn 1 257.82 11339 1595.4
- black 1 270.82 11352 1596.0
- tax 1 273.62 11355 1596.1
- rad 1 500.92 11582 1606.1
- nox 1 541.91 11623 1607.9
- ptratio 1 1206.45 12288 1636.0
- dis 1 1448.94 12530 1645.9
- rm 1 1963.66 13045 1666.3
- lstat 1 2723.48 13805 1695.0
mn
Call:
lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad +
tax + ptratio + black + lstat, data = D)
Coefficients:
(Intercept) crim zn chas nox rm
36.341145 -0.108413 0.045845 2.718716 -17.376023 3.801579
dis rad tax ptratio black lstat
-1.492711 0.299608 -0.011778 -0.946525 0.009291 -0.522553
summary(m)
Call:
lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad +
tax + ptratio + black + lstat, data = D)
Residuals:
Min 1Q Median 3Q Max
-15.5984 -2.7386 -0.5046 1.7273 26.2373
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.341145 5.067492 7.171 2.73e-12 ***
crim -0.108413 0.032779 -3.307 0.001010 **
zn 0.045845 0.013523 3.390 0.000754 ***
chas 2.718716 0.854240 3.183 0.001551 **
nox -17.376023 3.535243 -4.915 1.21e-06 ***
rm 3.801579 0.406316 9.356 < 2e-16 ***
dis -1.492711 0.185731 -8.037 6.84e-15 ***
rad 0.299608 0.063402 4.726 3.00e-06 ***
tax -0.011778 0.003372 -3.493 0.000521 ***
ptratio -0.946525 0.129066 -7.334 9.24e-13 ***
black 0.009291 0.002674 3.475 0.000557 ***
lstat -0.522553 0.047424 -11.019 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.736 on 494 degrees of freedom
Multiple R-squared: 0.7406, Adjusted R-squared: 0.7348
F-statistic: 128.2 on 11 and 494 DF, p-value: < 2.2e-16
summary(n)
Call:
lm(formula = medv ~ lstat + rm + ptratio + dis + nox + chas +
black + zn + crim + rad + tax, data = D)
Residuals:
Min 1Q Median 3Q Max
-15.5984 -2.7386 -0.5046 1.7273 26.2373
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.341145 5.067492 7.171 2.73e-12 ***
lstat -0.522553 0.047424 -11.019 < 2e-16 ***
rm 3.801579 0.406316 9.356 < 2e-16 ***
ptratio -0.946525 0.129066 -7.334 9.24e-13 ***
dis -1.492711 0.185731 -8.037 6.84e-15 ***
nox -17.376023 3.535243 -4.915 1.21e-06 ***
chas 2.718716 0.854240 3.183 0.001551 **
black 0.009291 0.002674 3.475 0.000557 ***
zn 0.045845 0.013523 3.390 0.000754 ***
crim -0.108413 0.032779 -3.307 0.001010 **
rad 0.299608 0.063402 4.726 3.00e-06 ***
tax -0.011778 0.003372 -3.493 0.000521 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.736 on 494 degrees of freedom
Multiple R-squared: 0.7406, Adjusted R-squared: 0.7348
F-statistic: 128.2 on 11 and 494 DF, p-value: < 2.2e-16
summary(mn)
Call:
lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad +
tax + ptratio + black + lstat, data = D)
Residuals:
Min 1Q Median 3Q Max
-15.5984 -2.7386 -0.5046 1.7273 26.2373
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.341145 5.067492 7.171 2.73e-12 ***
crim -0.108413 0.032779 -3.307 0.001010 **
zn 0.045845 0.013523 3.390 0.000754 ***
chas 2.718716 0.854240 3.183 0.001551 **
nox -17.376023 3.535243 -4.915 1.21e-06 ***
rm 3.801579 0.406316 9.356 < 2e-16 ***
dis -1.492711 0.185731 -8.037 6.84e-15 ***
rad 0.299608 0.063402 4.726 3.00e-06 ***
tax -0.011778 0.003372 -3.493 0.000521 ***
ptratio -0.946525 0.129066 -7.334 9.24e-13 ***
black 0.009291 0.002674 3.475 0.000557 ***
lstat -0.522553 0.047424 -11.019 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.736 on 494 degrees of freedom
Multiple R-squared: 0.7406, Adjusted R-squared: 0.7348
F-statistic: 128.2 on 11 and 494 DF, p-value: < 2.2e-16
#Findings:
#We can see from the summary that the three models are the same, therefore, there is no need of comparison and we just report the RMSE of the common model. We have the value of multiple R^2 as 0.4505 and RMSE = 5.21.
3.Create a lift chart and a decile-wise lift chart for the best model obtained based on the model using the backward variable selection procedure.
library(gains)
# regression model based on all numerical predictors
reg <- lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad +
tax + ptratio + black + lstat, data = D)
# predictions
#pred_v <- predict(reg, newdata = D[validation_data])
#gain <- gains(D[validation_data,]$medv[!is.na(m)], m[!is.na(m)])
#plot(c(0,gain$cume.pct.of.total*sum(medv))~c(0,gain$cume.obs),
#xlab="#", ylab="Cumulative medv", main="Lift Chart", type="l")
# Decile-wise lift chart
#barplot(gain$mean.resp/mean(price), names.arg = gain$depth,
#xlab = "Percentile", ylab = "Mean Response", main = "Decile-wise lift chart")
4.Fit ridge regression, lasso regression, and elastic net regression with the training data and then compare the performance of the 3 regressions in terms of RMSE.
# Setup a grid range of lambda values:
lambda <- 10^seq(-3, 3, length = 100)
# 1. # Build the model based on ridge regression
ridge <- train(
medv ~.,
data = training_data,
method = "glmnet",
tuneGrid = expand.grid(alpha = 0, lambda = lambda)
)
# Model coefficients
coef(ridge$finalModel, ridge$bestTune$lambda)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 3.223772e+01
## crim -1.080828e-01
## zn 3.193412e-02
## indus -4.999724e-02
## chas 3.038451e+00
## nox -1.423678e+01
## rm 3.651674e+00
## age 8.950671e-04
## dis -1.215940e+00
## rad 1.874167e-01
## tax -5.197185e-03
## ptratio -8.635587e-01
## black 8.894530e-03
## lstat -5.223572e-01
# Make predictions
predictions <- ridge %>% predict(newdata = validation_data)
# Model prediction performance
data.frame(
RMSE = RMSE(predictions, validation_data$medv),
Rsquare = R2(predictions, validation_data$medv)
)
## RMSE Rsquare
## 1 4.152015 0.7678127
# 2. Build the model based on lasso regression
lasso <- train(
medv ~.,
data = training_data,
method = "glmnet",
tuneGrid = expand.grid(alpha = 1, lambda = lambda)
)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
# Model coefficients
coef(lasso$finalModel, lasso$bestTune$lambda)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 41.797118782
## crim -0.128772396
## zn 0.045335521
## indus 0.016465095
## chas 2.745495132
## nox -21.149205375
## rm 3.408370045
## age 0.009485832
## dis -1.574481940
## rad 0.351625818
## tax -0.012180005
## ptratio -0.972779321
## black 0.008672929
## lstat -0.589218413
# Make predictions
predictions <- lasso %>% predict(newdata = validation_data)
# Model prediction performance
data.frame(
RMSE = RMSE(predictions, validation_data$medv),
Rsquare = R2(predictions, validation_data$medv)
)
## RMSE Rsquare
## 1 4.332493 0.7514415
# 3. Build the model based on Elastic net:
elastic <- train(
medv ~.,
data = training_data,
method = "glmnet"
)
# Model coefficients
coef(elastic$finalModel, elastic$bestTune$lambda)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 38.811194493
## crim -0.120488959
## zn 0.040045129
## indus -0.003379220
## chas 2.840609422
## nox -19.091811691
## rm 3.504208726
## age 0.005605836
## dis -1.472310961
## rad 0.288202219
## tax -0.009268021
## ptratio -0.941962764
## black 0.008690953
## lstat -0.572782193
# Make predictions
predictions <- elastic %>% predict(newdata = validation_data)
# Model prediction performance
data.frame(
RMSE = RMSE(predictions, validation_data$medv),
Rsquare = R2(predictions, validation_data$medv)
)
## RMSE Rsquare
## 1 4.26269 0.7574006
# 4. Comparing performances of the models:
models <- list(ridge = ridge, lasso = lasso, elastic = elastic)
resamples(models) %>% summary( metric = "RMSE")
##
## Call:
## summary.resamples(object = ., metric = "RMSE")
##
## Models: ridge, lasso, elastic
## Number of resamples: 25
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## ridge 3.903173 4.527245 5.245453 5.069848 5.571126 6.098857 0
## lasso 4.371040 4.800932 4.923307 5.117049 5.462554 6.028990 0
## elastic 4.427861 4.723254 5.046577 5.100148 5.322544 6.130705 0
#Here we look at median to compare performance. We pick a small value of median to find the best regression model. From our findings, we can see that lasso has the least median, so it is the best model.