For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.
More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
This statement is correct. Lasso is less flexible because it can shrink coefficients to exactly zero in order to perform variable selection. This reduction in flexibility does cause an increase in training bias but in turn decreases variances.
Similar to Lasso, the above statement also correctly describes Ridge regression relative to least squares. Similar to Lasso Ridge also shrinks coefficients towards zero but the difference is that it does not set any coefficients exactly to zero. However the same effect occurs where this shrinkage causes an increase in training bias and a decrease in variance.
The above statement is correct in describing non-linear methods relative to least squares. Non-linear methods such as decision trees, and random forests have the ability to capture more complex relationships than just linear relationships thus making the less likely to be bias.
college <- read.csv('College.csv')
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
set.seed(42)
train_index <- createDataPartition(college$Apps, p = 0.8, list = FALSE)
train_data <- college[train_index, ]
test_data <- college[-train_index, ]
train_data <- subset(train_data, select = -X)
test_data <- subset(test_data, select = -X)
linear_model <- lm(Apps ~ ., data = train_data)
summary(linear_model)
##
## Call:
## lm(formula = Apps ~ ., data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3881.0 -437.1 -19.5 316.7 6967.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -561.34089 461.02508 -1.218 0.22385
## PrivateYes -409.16975 158.55701 -2.581 0.01010 *
## Accept 1.62834 0.04476 36.382 < 2e-16 ***
## Enroll -0.85903 0.20563 -4.177 3.38e-05 ***
## Top10perc 51.31276 6.42525 7.986 7.03e-15 ***
## Top25perc -14.67648 5.16650 -2.841 0.00465 **
## F.Undergrad 0.05127 0.03625 1.415 0.15769
## P.Undergrad 0.04916 0.03535 1.391 0.16483
## Outstate -0.09202 0.02262 -4.067 5.38e-05 ***
## Room.Board 0.13611 0.05628 2.418 0.01588 *
## Books 0.10960 0.27217 0.403 0.68732
## Personal 0.01803 0.07008 0.257 0.79711
## PhD -8.96660 5.28199 -1.698 0.09010 .
## Terminal -4.40975 5.78320 -0.763 0.44605
## S.F.Ratio 21.41479 14.51439 1.475 0.14062
## perc.alumni 3.87434 4.69013 0.826 0.40909
## Expend 0.08916 0.01484 6.009 3.22e-09 ***
## Grad.Rate 7.23133 3.32272 2.176 0.02992 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1075 on 606 degrees of freedom
## Multiple R-squared: 0.9268, Adjusted R-squared: 0.9247
## F-statistic: 451.1 on 17 and 606 DF, p-value: < 2.2e-16
predictions <- predict(linear_model, newdata = test_data) # Get predictions
actuals <- test_data$Apps # Actual values
rmse <- sqrt(mean((actuals - predictions)^2)) # Compute RMSE
print(rmse)
## [1] 928.8176
The RMSE from using linear regression is 928.81.
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-8
x_train <- model.matrix(Apps ~ ., train_data)[, -1]
y_train <- train_data$Apps
x_test <- model.matrix(Apps ~ ., test_data)[, -1]
y_test <- test_data$Apps
set.seed(1)
#grid <- 10^seq(10, -2, length = 100)
cv.out <- cv.glmnet(x_train, y_train, alpha = 0)
plot(cv.out)
bestlam <- cv.out$lambda.min
bestlam
## [1] 367.6145
ridge_model <- glmnet(x_train, y_train, alpha = 0, lambda = bestlam)
ridge_model
##
## Call: glmnet(x = x_train, y = y_train, alpha = 0, lambda = bestlam)
##
## Df %Dev Lambda
## 1 17 90.66 367.6
ridge.pred <- predict(ridge_model, s = bestlam, newx = x_test)
sqrt(mean((ridge.pred - y_test)^2))
## [1] 879.621
The Ridge Model resulted in an RMSE of 879.621 after running predictions on the test set. This improved significantly from the regular Linear Regression Model.
set.seed(1)
cv.out <- cv.glmnet(x_train, y_train, alpha = 1)
plot(cv.out)
bestlam <- cv.out$lambda.min
bestlam
## [1] 13.84045
lasso_model <- glmnet(x_train, y_train, alpha = 1, lambda = bestlam)
lasso_model
##
## Call: glmnet(x = x_train, y = y_train, alpha = 1, lambda = bestlam)
##
## Df %Dev Lambda
## 1 14 92.55 13.84
lasso.pred <- predict(lasso_model, s = bestlam, newx = x_test)
sqrt(mean((lasso.pred - y_test)^2))
## [1] 922.3743
The Lasso Model resulted in an RMSE of 922.3743 which is worse than the ridge model but still slightly better than the linear regression model.
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:stats':
##
## loadings
set.seed(1)
pcr.fit <- pcr(Apps ~ ., data = train_data, scale = TRUE, validation = "CV")
summary(pcr.fit)
## Data: X dimension: 624 17
## Y dimension: 624 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3922 3896 2109 2112 1992 1715 1692
## adjCV 3922 3895 2106 2110 1946 1704 1686
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1682 1651 1631 1633 1639 1635 1644
## adjCV 1682 1642 1625 1627 1633 1629 1638
## 14 comps 15 comps 16 comps 17 comps
## CV 1644 1494 1246 1206
## adjCV 1638 1471 1236 1197
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 32.199 57.51 64.56 70.11 75.57 80.57 84.18 87.63
## Apps 2.383 72.02 72.04 79.00 82.65 83.03 83.22 84.07
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.55 92.94 95.07 96.79 97.85 98.69 99.36
## Apps 84.56 84.82 84.83 84.95 84.97 84.99 90.89
## 16 comps 17 comps
## X 99.82 100.00
## Apps 92.31 92.68
Looking at the results of the PCR model run using cross-validation, it can be seen that the variance expliain in the predictors and in the response variable exceed 80% when 6 components are considered. We will go with this number because we want to have a good model but we also don’t want to include the whole thing as that would lead to overfitting.
validationplot(pcr.fit, val.type = "MSEP")
pcr_model <- pcr(Apps ~ ., data = train_data, scale = TRUE, ncomp = 6)
summary(pcr_model)
## Data: X dimension: 624 17
## Y dimension: 624 1
## Fit method: svdpc
## Number of components considered: 6
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## X 32.199 57.51 64.56 70.11 75.57 80.57
## Apps 2.383 72.02 72.04 79.00 82.65 83.03
pcr.pred <- predict(pcr.fit, test_data, ncomp = 6)
sqrt(mean((pcr.pred - test_data$Apps)^2))
## [1] 1161.317
After re-running the pcr model while specifing the numbner of components considered to 6 from the previous step, we get a caluculated RMSE of 1161.317 which is significantly worse than the previous models.
set.seed(1)
pls.fit <- plsr(Apps ~ ., data = train_data, scale = TRUE, validation = "CV")
summary(pls.fit)
## Data: X dimension: 624 17
## Y dimension: 624 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3922 1941 1673 1543 1462 1283 1234
## adjCV 3922 1936 1672 1535 1441 1266 1224
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1218 1214 1211 1209 1209 1207 1207
## adjCV 1209 1205 1202 1201 1200 1199 1198
## 14 comps 15 comps 16 comps 17 comps
## CV 1206 1206 1206 1206
## adjCV 1198 1197 1197 1197
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 25.52 42.51 62.72 64.87 67.51 73.11 76.72 80.11
## Apps 76.94 83.71 87.00 90.75 92.28 92.44 92.52 92.57
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 82.40 85.13 88.99 90.91 92.68 95.49 97.00
## Apps 92.63 92.65 92.66 92.67 92.68 92.68 92.68
## 16 comps 17 comps
## X 98.87 100.00
## Apps 92.68 92.68
Similar to the PCR model, with PLS, we first ran a model using cross validation in order to see what the best number of compnents to consider would be. We see that at 8 componenets, at least 80% of variance is explained in both the predictors and the the response variable. We will choose this for ncomps and re-run the model.
validationplot(pls.fit, val.type = "MSEP")
pls.pred <- predict(pls.fit, test_data, ncomp = 8)
sqrt(mean((pls.pred - test_data$Apps)^2))
## [1] 924.1843
After re-running the PLS model with ncomp set to 8 I made predictions on the test set and got an RMSE of 924.1843 which is much better than the PCR model.
The RMSE’s of the all of the models are as follows:
Linear Model (Least Square): 928.8176
Ridge Regression: 879.621
Lasso: 922.3743
PCR: 1161.317
PLS: 924.1843
From these results it can seen that the Ridge Regression Model performed much better than the other models. Similarly it can also be seen that the PCR Model performed the worst by a significant margin.
library(ISLR2)
library(leaps)
Boston = Boston
dim(Boston)
## [1] 506 13
set.seed(42)
train_index <- createDataPartition(Boston$crim, p = 0.7, list = FALSE)
boston_train <- Boston[train_index, ]
boston_test <- Boston[-train_index, ]
regfit.full <- regsubsets(crim ~ ., data = boston_train, nvmax = 12)
reg.summary = summary(regfit.full)
reg.summary
## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = boston_train, nvmax = 12)
## 12 Variables (and intercept)
## Forced in Forced out
## zn FALSE FALSE
## indus FALSE FALSE
## chas FALSE FALSE
## nox FALSE FALSE
## rm FALSE FALSE
## age FALSE FALSE
## dis FALSE FALSE
## rad FALSE FALSE
## tax FALSE FALSE
## ptratio FALSE FALSE
## lstat FALSE FALSE
## medv FALSE FALSE
## 1 subsets of each size up to 12
## Selection Algorithm: exhaustive
## zn indus chas nox rm age dis rad tax ptratio lstat medv
## 1 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " "*" " " " " "*" " "
## 3 ( 1 ) " " " " " " " " " " " " " " "*" " " " " "*" "*"
## 4 ( 1 ) "*" " " " " " " " " " " "*" "*" " " " " " " "*"
## 5 ( 1 ) "*" "*" " " " " " " " " "*" "*" " " " " " " "*"
## 6 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " "*" " " "*"
## 7 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " "*" "*" "*"
## 8 ( 1 ) "*" " " " " "*" "*" " " "*" "*" " " "*" "*" "*"
## 9 ( 1 ) "*" " " "*" "*" "*" " " "*" "*" " " "*" "*" "*"
## 10 ( 1 ) "*" " " "*" "*" "*" " " "*" "*" "*" "*" "*" "*"
## 11 ( 1 ) "*" " " "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
reg.summary$adjr2
## [1] 0.3834393 0.4034368 0.4048850 0.4099491 0.4119745 0.4140309 0.4147092
## [8] 0.4156044 0.4152502 0.4149416 0.4137607 0.4122893
which.max(reg.summary$adjr2)
## [1] 8
plot(reg.summary$adjr2, xlab = "Number of Variables",
ylab = "Adjusted RSq", type = "l")
points(8, reg.summary$adjr2[8], col = "red", cex = 2, pch = 20)
coef(regfit.full, 8)
## (Intercept) zn nox rm dis rad
## 10.64234133 0.03865907 -11.15120476 0.84593198 -0.78348686 0.55150069
## ptratio lstat medv
## -0.32765249 0.12170597 -0.19666350
plot(regfit.full, scale = "adjr2")
I ran a regsubsets() function on the entire model and then took a
look at the Adjusted R-Squared values. Because there are 12 predictors,
there were also 12 values. The highest adjusted R-Squared value was at 8
variables being included in the model. The graph above as well as the
coef() function directly above that show that the specific varibles to
be inclulded when considering 8 variables are zn,
nox, rm, dis, rad,
ptratio, lstat, and medv.
linear_model <- lm(crim ~ zn + nox + rm + dis + rad + ptratio + lstat + medv, data = boston_train)
summary(linear_model)
##
## Call:
## lm(formula = crim ~ zn + nox + rm + dis + rad + ptratio + lstat +
## medv, data = boston_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.210 -1.909 -0.262 0.891 74.365
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.64234 8.18336 1.300 0.19430
## zn 0.03866 0.02160 1.789 0.07442 .
## nox -11.15120 5.69954 -1.957 0.05121 .
## rm 0.84593 0.68322 1.238 0.21649
## dis -0.78349 0.31377 -2.497 0.01299 *
## rad 0.55150 0.05721 9.640 < 2e-16 ***
## ptratio -0.32765 0.21984 -1.490 0.13703
## lstat 0.12171 0.08466 1.438 0.15144
## medv -0.19666 0.06737 -2.919 0.00374 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.446 on 347 degrees of freedom
## Multiple R-squared: 0.4288, Adjusted R-squared: 0.4156
## F-statistic: 32.56 on 8 and 347 DF, p-value: < 2.2e-16
predictions <- predict(linear_model, newdata = boston_test) # Get predictions
actuals <- boston_test$crim # Actual values
rmse <- sqrt(mean((actuals - predictions)^2)) # Compute RMSE
print(rmse)
## [1] 6.493153
After refitting a linear regression model using the variables chosen from the regsubsets() function, I used the model to make predictions on a test set and got an RMSE of 6.493153.
x_train <- model.matrix(crim ~ ., boston_train)[, -1]
y_train <- boston_train$crim
x_test <- model.matrix(crim ~ ., boston_test)[, -1]
y_test <- boston_test$crim
set.seed(1)
cv.out <- cv.glmnet(x_train, y_train, alpha = 0)
plot(cv.out)
bestlam <- cv.out$lambda.min
bestlam
## [1] 0.5225794
ridge_model <- glmnet(x_train, y_train, alpha = 0, lambda = bestlam)
ridge_model
##
## Call: glmnet(x = x_train, y = y_train, alpha = 0, lambda = bestlam)
##
## Df %Dev Lambda
## 1 12 42.63 0.5226
ridge.pred <- predict(ridge_model, s = bestlam, newx = x_test)
sqrt(mean((ridge.pred - y_test)^2))
## [1] 6.553395
I fit a ridge model on the training split and then made predictions on the test split getting an RMSE of 6.553395.
set.seed(1)
cv.out <- cv.glmnet(x_train, y_train, alpha = 1)
plot(cv.out)
bestlam <- cv.out$lambda.min
bestlam
## [1] 0.03132781
lasso_model <- glmnet(x_train, y_train, alpha = 1, lambda = bestlam)
lasso_model
##
## Call: glmnet(x = x_train, y = y_train, alpha = 1, lambda = bestlam)
##
## Df %Dev Lambda
## 1 12 43.13 0.03133
lasso.pred <- predict(lasso_model, s = bestlam, newx = x_test)
sqrt(mean((lasso.pred - y_test)^2))
## [1] 6.512083
I fit a lasso model on the training split and then made predictions on the test split getting an RMSE of 6.512083.
set.seed(1)
pcr.fit <- pcr(crim ~ ., data = boston_train, scale = TRUE, validation = "CV")
summary(pcr.fit)
## Data: X dimension: 356 12
## Y dimension: 356 1
## Fit method: svdpc
## Number of components considered: 12
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.444 7.220 7.230 6.800 6.807 6.765 6.753
## adjCV 8.444 7.218 7.227 6.792 6.806 6.761 6.748
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps
## CV 6.632 6.665 6.653 6.667 6.641 6.548
## adjCV 6.625 6.658 6.645 6.659 6.631 6.537
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 49.52 63.44 72.52 79.80 86.34 89.86 92.69 94.83
## crim 27.33 27.36 36.31 36.32 37.18 37.63 40.07 40.21
## 9 comps 10 comps 11 comps 12 comps
## X 96.71 98.27 99.42 100.00
## crim 40.49 40.49 41.60 43.22
validationplot(pcr.fit, val.type = "MSEP")
Similar to problem 9, I began by running a pcr model with cross validation in order to see what the best number of predictors to consider would be. After looking at the summary, I saw that in the in the Cross-Validation errors, there was a first local minimum value at 3 components. This compelled me to choose this as the number of predictors to consider in the model.
pcr_model <- pcr(crim ~ ., data = boston_train, scale = TRUE, ncomp = 3)
summary(pcr_model)
## Data: X dimension: 356 12
## Y dimension: 356 1
## Fit method: svdpc
## Number of components considered: 3
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps
## X 49.52 63.44 72.52
## crim 27.33 27.36 36.31
pcr.pred <- predict(pcr.fit, boston_test, ncomp = 3)
sqrt(mean((pcr.pred - boston_test$crim)^2))
## [1] 6.976105
After re-running the pcr model while specifing the numbner of components considered to 3 from the previous step, we get a calculated RMSE of 6.976105 which is again significantly worse than the previous models.
set.seed(1)
pls.fit <- plsr(crim ~ ., data = boston_train, scale = TRUE, validation = "CV")
summary(pls.fit)
## Data: X dimension: 356 12
## Y dimension: 356 1
## Fit method: kernelpls
## Number of components considered: 12
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.444 7.050 6.665 6.578 6.606 6.580 6.571
## adjCV 8.444 7.048 6.659 6.569 6.596 6.569 6.559
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps
## CV 6.551 6.546 6.546 6.547 6.548 6.548
## adjCV 6.540 6.535 6.535 6.537 6.537 6.537
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 49.04 58.22 62.61 73.39 79.60 82.91 85.46 89.73
## crim 31.09 39.69 41.87 42.31 42.74 43.10 43.18 43.21
## 9 comps 10 comps 11 comps 12 comps
## X 94.34 96.37 98.42 100.00
## crim 43.21 43.22 43.22 43.22
validationplot(pls.fit, val.type = "MSEP")
After looking at the summary of the pls model using cross-validation, I again saw a local minimum value in the cross-validtion errors at 3 components. This again made me choose 3 as the number of components to include in my final model.
pls.pred <- predict(pls.fit, boston_test, ncomp = 3)
sqrt(mean((pls.pred - boston_test$crim)^2))
## [1] 6.600582
After re-runninc the pls model with only 3 predictors being considered, I ran predictions on the test set and got an RMSE of 6.600582.
The RMSE’s of the all of the models are as follows:
Linear Model (regsubsets()): 6.493153
Ridge Regression: 6.553395
Lasso: 6.512083
PCR: 6.976105
PLS: 6.600582
This time it was the linear regression model using varibles chosen through the regsubset() function that was ables to perform the best predictions on the data. Similar to question 9, the PCR model performed significatlly worse than the other models.
The linear regression model did not end up involving all of the
features in the data set because before fit the model, i ran the data
through the regsubsets() function. As I mentioned I used the adjusted
R-squared values as the criteria and found that the best combination of
predictors to maximize that value is zn, nox,
rm, dis, rad,
ptratio, lstat, and medv.