library(tidyverse)
library(reshape2)
library(caret)
library(glmnet)
library(forecast)
library(gains)
Using BostonHousing.csv which contains information on 506 US Census housing tracts in the Boston area.
Boston <- read.csv("/Users/Simbo/Desktop/School/STAT415/DMBA-R-datasets/BostonHousing.csv")
There’s two types of modeling: explanatory and predictive modeling. In predictive modeling, we partition the data into training and validation sets. The goal of predictive modeling is to predict new data accurately. In order to create a model that can predict new data accurately, we have to holdout some data when creating the model, then use this data to test our model. The data being held out is part of the “validation set”, whereas the data being used to construct the model is part of the “training set”. The training set is used to create the model and the validation set is used to test the “predictive performance” of the model.
An explanatory model using all observations:
MEDV = -28.91 - 0.26(CRIM) + 3.76(CHAS) + 8.28(RM)
selected.housing.vars <- c("MEDV", "CRIM", "CHAS", "RM")
# Explanatory model parameters
lm(MEDV ~ ., data = Boston[, selected.housing.vars])
##
## Call:
## lm(formula = MEDV ~ ., data = Boston[, selected.housing.vars])
##
## Coefficients:
## (Intercept) CRIM CHAS RM
## -28.8107 -0.2607 3.7630 8.2782
A predictive (training) model using ~75% of the observations:
MEDV = -30.75 - 0.27(CRIM) + 3.55(CHAS) + 8.54(RM)
Note the Betas (slope/intercept parameters) in the training model are similar to the Betas in the explanatory model using all of the rows.
This difference is one that might not explain the entire data set the best, but could be better for predicting new data.
# Partition data into training and validation sets
set.seed(100)
sample.housing <- sample(c(1:506), 380) # ~75% of the data
training.housing <- Boston[sample.housing, selected.housing.vars]
validation.housing <- Boston[-sample.housing, selected.housing.vars]
# Fit Multiple linear regression model, Median house price (MEDV) by the other 3 variables (.)
housing.lm <- lm(MEDV ~ ., data = training.housing)
summary(housing.lm)
##
## Call:
## lm(formula = MEDV ~ ., data = training.housing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.710 -2.897 -0.182 2.342 39.642
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -30.75453 3.07439 -10.003 < 2e-16 ***
## CRIM -0.27386 0.03332 -8.218 3.39e-15 ***
## CHAS 3.54500 1.29196 2.744 0.00636 **
## RM 8.54200 0.48528 17.602 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.87 on 376 degrees of freedom
## Multiple R-squared: 0.5484, Adjusted R-squared: 0.5448
## F-statistic: 152.2 on 3 and 376 DF, p-value: < 2.2e-16
A median house price for a tract that does not bound the Charles River, has a crime rate of 0.1, and has an average of 6 rooms per house is:
MEDV = -30.75 - 0.27(0.1) + 3.55(0) + 8.54(6) = 20.463 = about $20,463
Here is how accuarate the model is at fitting the validation set:
ME and RMSE are not extremely high and the MAPE implies the model is about 78% accurate in predicting the next observations.
housing.lm.predictions <- predict(housing.lm, validation.housing)
accuracy(housing.lm.predictions, validation.housing$MEDV)
## ME RMSE MAE MPE MAPE
## Test set 1.39906 7.039879 4.60643 1.185454 21.8375
Amongst all of the predictors:
MEDV and CAT MEDV should be highly correlated because CAT MEDV is simply 1 if MEDV is over 3000 and 0 if under.
AGE (Pre-1940 Buildings) and LSTAT (Lower status of population) might measure similar things as older buildings might be more “affordable” for those with lower status.
INDUS (Nonretail business acres), NOX (Nitrous oxide), and TAX (property tax rate) are likely to be positively correlated as these variables would score high in commercial areas.
Looking at the correlation matrix:
correlation_matrix <- cor(na.omit(Boston))
round(correlation_matrix, 2)
## CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO
## CRIM 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29
## ZN -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39
## INDUS 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38
## CHAS -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12
## NOX 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19
## RM -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36
## AGE 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26
## DIS -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23
## RAD 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46
## TAX 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46
## PTRATIO 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
## LSTAT 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37
## MEDV -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51
## CAT..MEDV -0.15 0.37 -0.37 0.11 -0.23 0.64 -0.19 0.12 -0.20 -0.27 -0.44
## LSTAT MEDV CAT..MEDV
## CRIM 0.46 -0.39 -0.15
## ZN -0.41 0.36 0.37
## INDUS 0.60 -0.48 -0.37
## CHAS -0.05 0.18 0.11
## NOX 0.59 -0.43 -0.23
## RM -0.61 0.70 0.64
## AGE 0.60 -0.38 -0.19
## DIS -0.50 0.25 0.12
## RAD 0.49 -0.38 -0.20
## TAX 0.54 -0.47 -0.27
## PTRATIO 0.37 -0.51 -0.44
## LSTAT 1.00 -0.74 -0.47
## MEDV -0.74 1.00 0.79
## CAT..MEDV -0.47 0.79 1.00
ggplot(data = melt(correlation_matrix),
aes(x = Var1, y = Var2, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient2(limit = c(-1, 1), midpoint = 0,
low = "dark blue", high = "dark red", mid = "white")
Boston.subset <- c("CHAS", "MEDV", "INDUS", "LSTAT", "RAD", "PTRATIO", "ZN")
From this table, I would keep CHAS, MEDV, INDUS, LSTAT, RAD, PTRATIO, and ZN to perform stepwise regression on.
Stepwise regression is a way of selecting some, but not all of the variables in a dataset containing lots of predictors. It aims to explain most of the variability without using too few or too many predictors by penalizing models that have adiditional predictors. In the textbook, this is referred to as the bias-variance tradeoff. There are many ways of calculating this, including AIC and BIC.
Using the variables from the previous step, we will use stepwise regression (using AIC criterion) to select the best model.
# Split into training and validation using our set.seed() and sample() functions from before
training.subset <- Boston[sample.housing, Boston.subset]
validation.subset <- Boston[-sample.housing, Boston.subset]
subset_model = lm(MEDV ~ ., data = training.subset)
small_model <- lm(MEDV ~ 1, data = training.subset)
Forward Stepwise regression: Starts with no predictors, adding one predictor at a time until an added predictor doesn’t add statistical significance.
step_forward <- step(small_model, scope = list(lower = small_model, upper = subset_model), direction = "forward")
## Start: AIC=1645.14
## MEDV ~ 1
##
## Df Sum of Sq RSS AIC
## + LSTAT 1 16979.1 11709 1306.6
## + PTRATIO 1 7761.7 20926 1527.3
## + INDUS 1 7646.6 21041 1529.3
## + ZN 1 4672.6 24015 1579.6
## + RAD 1 4487.6 24200 1582.5
## + CHAS 1 560.9 28127 1639.6
## <none> 28688 1645.1
##
## Step: AIC=1306.62
## MEDV ~ LSTAT
##
## Df Sum of Sq RSS AIC
## + PTRATIO 1 2004.85 9704.1 1237.2
## + CHAS 1 537.62 11171.3 1290.8
## + ZN 1 155.82 11553.1 1303.5
## <none> 11708.9 1306.6
## + INDUS 1 16.62 11692.3 1308.1
## + RAD 1 0.88 11708.1 1308.6
##
## Step: AIC=1237.25
## MEDV ~ LSTAT + PTRATIO
##
## Df Sum of Sq RSS AIC
## + CHAS 1 298.329 9405.8 1227.4
## + RAD 1 231.599 9472.5 1230.1
## <none> 9704.1 1237.2
## + INDUS 1 44.338 9659.7 1237.5
## + ZN 1 0.123 9704.0 1239.2
##
## Step: AIC=1227.38
## MEDV ~ LSTAT + PTRATIO + CHAS
##
## Df Sum of Sq RSS AIC
## + RAD 1 207.216 9198.5 1220.9
## <none> 9405.8 1227.4
## + INDUS 1 23.570 9382.2 1228.4
## + ZN 1 8.789 9397.0 1229.0
##
## Step: AIC=1220.92
## MEDV ~ LSTAT + PTRATIO + CHAS + RAD
##
## Df Sum of Sq RSS AIC
## <none> 9198.5 1220.9
## + ZN 1 8.4474 9190.1 1222.6
## + INDUS 1 0.0003 9198.5 1222.9
Backward Stepwise Regression: Starts with all predictors, reducing one at a time until the remaining predictors are significant.
step_back <- step(subset_model, direction = "backward")
## Start: AIC=1224.53
## MEDV ~ CHAS + INDUS + LSTAT + RAD + PTRATIO + ZN
##
## Df Sum of Sq RSS AIC
## - INDUS 1 0.9 9190.1 1222.6
## - ZN 1 9.4 9198.5 1222.9
## <none> 9189.2 1224.5
## - RAD 1 171.5 9360.7 1229.6
## - CHAS 1 280.0 9469.1 1233.9
## - PTRATIO 1 1755.8 10945.0 1289.0
## - LSTAT 1 7842.8 17032.0 1457.0
##
## Step: AIC=1222.57
## MEDV ~ CHAS + LSTAT + RAD + PTRATIO + ZN
##
## Df Sum of Sq RSS AIC
## - ZN 1 8.4 9198.5 1220.9
## <none> 9190.1 1222.6
## - RAD 1 206.9 9397.0 1229.0
## - CHAS 1 282.2 9472.3 1232.1
## - PTRATIO 1 1755.4 10945.5 1287.0
## - LSTAT 1 9328.5 18518.6 1486.8
##
## Step: AIC=1220.92
## MEDV ~ CHAS + LSTAT + RAD + PTRATIO
##
## Df Sum of Sq RSS AIC
## <none> 9198.5 1220.9
## - RAD 1 207.2 9405.8 1227.4
## - CHAS 1 273.9 9472.5 1230.1
## - PTRATIO 1 1971.9 11170.5 1292.7
## - LSTAT 1 10604.4 19802.9 1510.3
“Both” Stepwise Regression: At each step, predictors are added or reduced depending on statistical significance.
step_both <- step(subset_model, direction = "both")
## Start: AIC=1224.53
## MEDV ~ CHAS + INDUS + LSTAT + RAD + PTRATIO + ZN
##
## Df Sum of Sq RSS AIC
## - INDUS 1 0.9 9190.1 1222.6
## - ZN 1 9.4 9198.5 1222.9
## <none> 9189.2 1224.5
## - RAD 1 171.5 9360.7 1229.6
## - CHAS 1 280.0 9469.1 1233.9
## - PTRATIO 1 1755.8 10945.0 1289.0
## - LSTAT 1 7842.8 17032.0 1457.0
##
## Step: AIC=1222.57
## MEDV ~ CHAS + LSTAT + RAD + PTRATIO + ZN
##
## Df Sum of Sq RSS AIC
## - ZN 1 8.4 9198.5 1220.9
## <none> 9190.1 1222.6
## + INDUS 1 0.9 9189.2 1224.5
## - RAD 1 206.9 9397.0 1229.0
## - CHAS 1 282.2 9472.3 1232.1
## - PTRATIO 1 1755.4 10945.5 1287.0
## - LSTAT 1 9328.5 18518.6 1486.8
##
## Step: AIC=1220.92
## MEDV ~ CHAS + LSTAT + RAD + PTRATIO
##
## Df Sum of Sq RSS AIC
## <none> 9198.5 1220.9
## + ZN 1 8.4 9190.1 1222.6
## + INDUS 1 0.0 9198.5 1222.9
## - RAD 1 207.2 9405.8 1227.4
## - CHAS 1 273.9 9472.5 1230.1
## - PTRATIO 1 1971.9 11170.5 1292.7
## - LSTAT 1 10604.4 19802.9 1510.3
All 3 stepwise regression methods came up with the exact same model, containing 4 predictors:
MEDV = 55.40 - 0.9(LSTAT) - 1.24(PTRATIO) + 3.67(CHAS) + 0.1(RAD)
This model would predict higher median house prices for those tracts on the Charles River, those with accessibility to highways, higher status, and higher teacher to pupil ratio.
summary(step_forward)
##
## Call:
## lm(formula = MEDV ~ LSTAT + PTRATIO + CHAS + RAD, data = training.subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.1039 -2.9201 -0.6114 1.9998 20.2015
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.39791 2.44011 22.703 < 2e-16 ***
## LSTAT -0.90531 0.04354 -20.792 < 2e-16 ***
## PTRATIO -1.24401 0.13875 -8.966 < 2e-16 ***
## CHAS 3.67386 1.09934 3.342 0.000916 ***
## RAD 0.10337 0.03557 2.906 0.003872 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.953 on 375 degrees of freedom
## Multiple R-squared: 0.6794, Adjusted R-squared: 0.6759
## F-statistic: 198.6 on 4 and 375 DF, p-value: < 2.2e-16
summary(step_back)
##
## Call:
## lm(formula = MEDV ~ CHAS + LSTAT + RAD + PTRATIO, data = training.subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.1039 -2.9201 -0.6114 1.9998 20.2015
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.39791 2.44011 22.703 < 2e-16 ***
## CHAS 3.67386 1.09934 3.342 0.000916 ***
## LSTAT -0.90531 0.04354 -20.792 < 2e-16 ***
## RAD 0.10337 0.03557 2.906 0.003872 **
## PTRATIO -1.24401 0.13875 -8.966 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.953 on 375 degrees of freedom
## Multiple R-squared: 0.6794, Adjusted R-squared: 0.6759
## F-statistic: 198.6 on 4 and 375 DF, p-value: < 2.2e-16
summary(step_both)
##
## Call:
## lm(formula = MEDV ~ CHAS + LSTAT + RAD + PTRATIO, data = training.subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.1039 -2.9201 -0.6114 1.9998 20.2015
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.39791 2.44011 22.703 < 2e-16 ***
## CHAS 3.67386 1.09934 3.342 0.000916 ***
## LSTAT -0.90531 0.04354 -20.792 < 2e-16 ***
## RAD 0.10337 0.03557 2.906 0.003872 **
## PTRATIO -1.24401 0.13875 -8.966 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.953 on 375 degrees of freedom
## Multiple R-squared: 0.6794, Adjusted R-squared: 0.6759
## F-statistic: 198.6 on 4 and 375 DF, p-value: < 2.2e-16
Because all the models are the same, their accuracy measurements (ME, RMSE, etc.) are also the exact same.
Compared to the model from before (using CRIM, CHAS, and RM), this model from the stepwise function has higher values, with exception of the ME.
The model from before seems to be a better model for predicting new data, when this validation set is used.
step_for.prediction <- predict(step_forward, validation.subset)
step_back.prediction <- predict(step_back, validation.subset)
step_both.prediction <- predict(step_both, validation.subset)
accuracy(step_for.prediction, validation.subset$MEDV)
## ME RMSE MAE MPE MAPE
## Test set 1.23483 7.48106 5.526723 1.253665 24.49312
accuracy(step_back.prediction, validation.subset$MEDV)
## ME RMSE MAE MPE MAPE
## Test set 1.23483 7.48106 5.526723 1.253665 24.49312
accuracy(step_both.prediction, validation.subset$MEDV)
## ME RMSE MAE MPE MAPE
## Test set 1.23483 7.48106 5.526723 1.253665 24.49312
Ridge, Lasso, and Elastic-Net Regression are very similar in that they aim to create a number of predictors that is neither too large or too small. The glmnet() function from the glmnet package makes these regression methods easy to compute. The glmnet() function takes an x, y, alpha, and lambda parameters. Alpha is set to 0 for ridge regression, 1 for lasso regression, and a decimal in between for elastic net regression. The lambda parameter can be easily computed using the cv.glmnet() function.
Before we begin the regression methods, we need to split our data into a training and validation set. In addition, we need to specify our x (predictors) and y (MEDV) values:
# Partition data into training and validation sets
set.seed(100)
sample.housing <- Boston$MEDV %>% createDataPartition(p = 0.75, list = F)
train.housing <- Boston[sample.housing, ]
test.housing <- Boston[-sample.housing, ]
# Becuase glmnet() doesn't accept the generic y ~ x format, we need to split into x and y
# x: Predictors
x <- model.matrix(MEDV ~ ., train.housing)
# y: Median Hosue Price MEDV
y <- train.housing$MEDV
Let’s start with ridge regression (alpha = 0, lambda specified by cv.glmnet())
set.seed(100)
cv <- cv.glmnet(x = x, y = y, alpha = 0)
# Lambda: 0.704
cv$lambda.min
## [1] 0.7041333
ridge_model <- glmnet(x = x, y = y, alpha = 0, lambda = cv$lambda.min)
coef(ridge_model)
## 15 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 33.989958051
## (Intercept) .
## CRIM -0.095189550
## ZN -0.006998894
## INDUS 0.053124566
## CHAS 2.932989998
## NOX -9.797127860
## RM 1.536356840
## AGE -0.005143717
## DIS -0.481190272
## RAD 0.075969327
## TAX -0.003859152
## PTRATIO -0.494638221
## LSTAT -0.446374738
## CAT..MEDV 10.517543795
x.test <- model.matrix(MEDV ~ ., test.housing)
predictions <- predict(ridge_model, x.test) %>% as.vector()
ridge_performance <- data.frame(
RMSE_Ridge = RMSE(predictions, test.housing$MEDV),
R2_Ridge = R2(predictions, test.housing$MEDV)
)
Now, lasso regression (alpha = 1, lambda specified by cv.glmnet())
set.seed(100)
cv <- cv.glmnet(x = x, y = y, alpha = 1)
# Lambda: 0.020
cv$lambda.min
## [1] 0.02005397
lasso_model <- glmnet(x = x, y = y, alpha = 1, lambda = cv$lambda.min)
coef(lasso_model)
## 15 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 39.664950709
## (Intercept) .
## CRIM -0.101771949
## ZN -0.004528387
## INDUS 0.105570997
## CHAS 2.854117644
## NOX -12.645648668
## RM 0.994761360
## AGE .
## DIS -0.561979114
## RAD 0.134420750
## TAX -0.006270727
## PTRATIO -0.509880236
## LSTAT -0.508404961
## CAT..MEDV 11.556642427
predictions <- predict(lasso_model, x.test) %>% as.vector()
lasso_performance <- data.frame(
RMSE_Lasso = RMSE(predictions, test.housing$MEDV),
R2_Lasso = R2(predictions, test.housing$MEDV)
)
Now, lasso regression (alpha = decimals between 0 and 1, lambda specified by cv.glmnet())
set.seed(100)
# To create 10 different alphas and lambdas we need to set tuneLength to 10 and trControl to 10
elastic_model <- train(MEDV ~., data = train.housing, method = "glmnet",
trControl = trainControl("cv", number = 10), tuneLength = 10)
# Best alpha/lambda value: 0.1, 0.09
elastic_model$bestTune
## alpha lambda
## 5 0.1 0.09265453
coef(elastic_model$finalModel, elastic_model$bestTune$lambda)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 39.340729302
## CRIM -0.102717551
## ZN -0.005007817
## INDUS 0.104230351
## CHAS 2.868251553
## NOX -12.699999589
## RM 1.068836528
## AGE .
## DIS -0.558677745
## RAD 0.135319773
## TAX -0.006319288
## PTRATIO -0.518322370
## LSTAT -0.501344572
## CAT..MEDV 11.419733740
predictions <- predict(elastic_model, x.test) %>% as.vector()
elastic_performance <- data.frame(
RMSE_Elastic = RMSE(predictions, test.housing$MEDV),
R2_Elastic = R2(predictions, test.housing$MEDV)
)
As we can see, all 3 models have comparable R^2 values (~84%). Lasso and Elastic-Net have very similar RMSE values, although Lasso has the lowest RMSE.
ridge_performance
## RMSE_Ridge R2_Ridge
## 1 3.896098 0.8440277
lasso_performance
## RMSE_Lasso R2_Lasso
## 1 3.818446 0.8458754
elastic_performance
## RMSE_Elastic R2_Elastic
## 1 3.818529 0.8461946