ISLR2 book starting PDF Page 294.
# Load dependencies
pacman::p_load(ISLR2, dplyr, caret, glmnet, pls, leaps)
For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.
The lasso, relative to least squares, is:
i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
Lasso regression, relative to the least squares is iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. Lasso applies a penalty to the regression coefficients, shrinking some to zero, effectively performing variable selection. This reduces model flexibility compared to least squares which increases bias but can reduce variance. The increase in bias is less than its decrease in variance.
Repeat (a) for ridge regression relative to least squares.
Ridge regression, relative to the least squares is iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. Ridge regression is less flexible than least squares because it adds an \(L_2\) penalty to the size of the coefficients, shrinking them toward zero. This also reduces model flexibility compared to least squares which increases bias but can reduce variance. The increase in bias is less than its decrease in variance.
Repeat (a) for non-linear methods relative to least squares.
Non-linear regression, relative to the least squares is ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. Non-linear regression is more flexible than least squares because it does not assume a linear form allowing the model to adapt to complex patterns. This increases model flexibility compared to least squares which decreases bias but can increase variance. The increase in variance is less than its decrease in bias.
In this exercise, we will predict the number of applications received using the other variables in the College data set.
Split the data set into a training set and a test set.
# Read in the data from ISLR2 package
d1 = College
glimpse(d1)
## Rows: 777
## Columns: 18
## $ Private <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes…
## $ Apps <dbl> 1660, 2186, 1428, 417, 193, 587, 353, 1899, 1038, 582, 173…
## $ Accept <dbl> 1232, 1924, 1097, 349, 146, 479, 340, 1720, 839, 498, 1425…
## $ Enroll <dbl> 721, 512, 336, 137, 55, 158, 103, 489, 227, 172, 472, 484,…
## $ Top10perc <dbl> 23, 16, 22, 60, 16, 38, 17, 37, 30, 21, 37, 44, 38, 44, 23…
## $ Top25perc <dbl> 52, 29, 50, 89, 44, 62, 45, 68, 63, 44, 75, 77, 64, 73, 46…
## $ F.Undergrad <dbl> 2885, 2683, 1036, 510, 249, 678, 416, 1594, 973, 799, 1830…
## $ P.Undergrad <dbl> 537, 1227, 99, 63, 869, 41, 230, 32, 306, 78, 110, 44, 638…
## $ Outstate <dbl> 7440, 12280, 11250, 12960, 7560, 13500, 13290, 13868, 1559…
## $ Room.Board <dbl> 3300, 6450, 3750, 5450, 4120, 3335, 5720, 4826, 4400, 3380…
## $ Books <dbl> 450, 750, 400, 450, 800, 500, 500, 450, 300, 660, 500, 400…
## $ Personal <dbl> 2200, 1500, 1165, 875, 1500, 675, 1500, 850, 500, 1800, 60…
## $ PhD <dbl> 70, 29, 53, 92, 76, 67, 90, 89, 79, 40, 82, 73, 60, 79, 36…
## $ Terminal <dbl> 78, 30, 66, 97, 72, 73, 93, 100, 84, 41, 88, 91, 84, 87, 6…
## $ S.F.Ratio <dbl> 18.1, 12.2, 12.9, 7.7, 11.9, 9.4, 11.5, 13.7, 11.3, 11.5, …
## $ perc.alumni <dbl> 12, 16, 30, 37, 2, 11, 26, 37, 23, 15, 31, 41, 21, 32, 26,…
## $ Expend <dbl> 7041, 10527, 8735, 19016, 10922, 9727, 8861, 11487, 11644,…
## $ Grad.Rate <dbl> 60, 56, 54, 59, 15, 55, 63, 73, 80, 52, 73, 76, 74, 68, 55…
set.seed(42)
# Split data into train and test
train_index = createDataPartition(d1$Apps, p= 0.8, list= FALSE)
train = d1[train_index, ]
test = d1[-train_index, ]
Fit a linear model using least squares on the training set, and report the test error obtained.
The RMSE obtained was 929 meaning on average, the model’s prediction are off by 929 applications from the actual values in the test set. Given that the mean number of applications in the test set is of 3,005, the RMSE represents an error of approximately 30.9% of the mean. The range of applications in the test set spans from 141 to 21,804, indicating that the model’s errors are relatively small compared to the range of the data.
# Linear Regression Model
lm_model = lm(Apps ~ ., data= train)
summary(lm_model)
##
## Call:
## lm(formula = Apps ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3881.0 -437.1 -19.5 316.7 6967.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -561.34089 461.02508 -1.218 0.22385
## PrivateYes -409.16975 158.55701 -2.581 0.01010 *
## Accept 1.62834 0.04476 36.382 < 2e-16 ***
## Enroll -0.85903 0.20563 -4.177 3.38e-05 ***
## Top10perc 51.31276 6.42525 7.986 7.03e-15 ***
## Top25perc -14.67648 5.16650 -2.841 0.00465 **
## F.Undergrad 0.05127 0.03625 1.415 0.15769
## P.Undergrad 0.04916 0.03535 1.391 0.16483
## Outstate -0.09202 0.02262 -4.067 5.38e-05 ***
## Room.Board 0.13611 0.05628 2.418 0.01588 *
## Books 0.10960 0.27217 0.403 0.68732
## Personal 0.01803 0.07008 0.257 0.79711
## PhD -8.96660 5.28199 -1.698 0.09010 .
## Terminal -4.40975 5.78320 -0.763 0.44605
## S.F.Ratio 21.41479 14.51439 1.475 0.14062
## perc.alumni 3.87434 4.69013 0.826 0.40909
## Expend 0.08916 0.01484 6.009 3.22e-09 ***
## Grad.Rate 7.23133 3.32272 2.176 0.02992 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1075 on 606 degrees of freedom
## Multiple R-squared: 0.9268, Adjusted R-squared: 0.9247
## F-statistic: 451.1 on 17 and 606 DF, p-value: < 2.2e-16
# Make predictions on the test set
predictions = predict(lm_model, test)
# Calculate the Root Mean Squared Error
mse = mean((test$Apps - predictions)^2)
sqrt(mse)
## [1] 928.8176
# Mean and range of test dataset
mean(test$Apps)
## [1] 3004.856
range(test$Apps)
## [1] 141 21804
Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
The RMSE obtained was 880 meaning on average, the model’s prediction are off by 880 applications from the actual values in the test set. Given that the mean number of applications in the test set is of 3,005, the RMSE represents an error of approximately 29.3% of the mean. The range of applications in the test set spans from 141 to 21,804, indicating that the model’s errors are relatively small compared to the range of the data.
# Convert x to a matrix of predictor variables
x_train = model.matrix(Apps ~ ., data= train)[,-1]
x_test = model.matrix(Apps ~ ., data= test)[,-1]
# Convert y to a numeric vector
y_train = train$Apps
y_test = test$Apps
set.seed(1)
# Fit ridge with cross-validation
lam = cv.glmnet(x_train, y_train, alpha= 0)
# Ridge Regression model with best lambda
ridge_model = glmnet(x_train, y_train, alpha= 0, lambda= lam$lambda.min)
ridge_model
##
## Call: glmnet(x = x_train, y = y_train, alpha = 0, lambda = lam$lambda.min)
##
## Df %Dev Lambda
## 1 17 90.66 367.6
# Make predictions on the test set
predictions = predict(ridge_model, x_test)
# Calculate the Root Mean Squared Error
sqrt(mean((y_test - predictions)^2))
## [1] 879.621
Fit a lasso model on the training set, with λ chosen by cross validation. Report the test error obtained, along with the number of non-zero coefficient estimates.
The RMSE obtained was 922 meaning on average, the model’s prediction are off by 880 applications from the actual values in the test set. Given that the mean number of applications in the test set is of 3,005, the RMSE represents an error of approximately 30.7% of the mean. The range of applications in the test set spans from 141 to 21,804, indicating that the model’s errors are relatively small compared to the range of the data.
set.seed(1)
# Fit ridge with cross-validation
lam = cv.glmnet(x_train, y_train, alpha= 1)
# Ridge Regression model with best lambda
ridge_model = glmnet(x_train, y_train, alpha= 1, lambda= lam$lambda.min)
ridge_model
##
## Call: glmnet(x = x_train, y = y_train, alpha = 1, lambda = lam$lambda.min)
##
## Df %Dev Lambda
## 1 14 92.55 13.84
# Make predictions on the test set
predictions = predict(ridge_model, x_test)
# Calculate the Root Mean Squared Error
sqrt(mean((y_test - predictions)^2))
## [1] 922.3743
Fit a PCR model on the training set, with M chosen by cross validation. Report the test error obtained, along with the value of M selected by cross-validation.
The RMSE obtained was 1161 with an M of 6, meaning on average, the model’s prediction are off by 1161 applications from the actual values in the test set. Given that the mean number of applications in the test set is of 3,005, the RMSE represents an error of approximately 38.6% of the mean. The range of applications in the test set spans from 141 to 21,804, indicating that the model’s errors are relatively small compared to the range of the data.
set.seed(1)
# PCR model
pcr_model = pcr(Apps ~ ., data= train, scale= TRUE, validation= 'CV')
summary(pcr_model)
## Data: X dimension: 624 17
## Y dimension: 624 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3922 3896 2109 2112 1992 1715 1692
## adjCV 3922 3895 2106 2110 1946 1704 1686
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1682 1651 1631 1633 1639 1635 1644
## adjCV 1682 1642 1625 1627 1633 1629 1638
## 14 comps 15 comps 16 comps 17 comps
## CV 1644 1494 1246 1206
## adjCV 1638 1471 1236 1197
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 32.199 57.51 64.56 70.11 75.57 80.57 84.18 87.63
## Apps 2.383 72.02 72.04 79.00 82.65 83.03 83.22 84.07
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.55 92.94 95.07 96.79 97.85 98.69 99.36
## Apps 84.56 84.82 84.83 84.95 84.97 84.99 90.89
## 16 comps 17 comps
## X 99.82 100.00
## Apps 92.31 92.68
# Plot cross-validation scores
validationplot(pcr_model, val.type = "MSEP")
# Fit PCR model using best M
pcr_model = pcr(Apps ~ ., data= train, scale = TRUE, ncomp= 6)
summary(pcr_model)
## Data: X dimension: 624 17
## Y dimension: 624 1
## Fit method: svdpc
## Number of components considered: 6
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## X 32.199 57.51 64.56 70.11 75.57 80.57
## Apps 2.383 72.02 72.04 79.00 82.65 83.03
# Make predictions on the test set
predictions = predict(pcr_model, test, ncomp= 6)
# Calculate the Root Mean Squared Error
sqrt(mean((test$Apps - predictions)^2))
## [1] 1161.317
Fit a PLS model on the training set, with M chosen by cross validation. Report the test error obtained, along with the value of M selected by cross-validation.
The RMSE obtained was 924 with an M of 8, meaning on average, the model’s prediction are off by 924 applications from the actual values in the test set. Given that the mean number of applications in the test set is of 3,005, the RMSE represents an error of approximately 30.7% of the mean. The range of applications in the test set spans from 141 to 21,804, indicating that the model’s errors are relatively small compared to the range of the data.
set.seed(1)
# PLS Model
pls_model = plsr(Apps ~ ., data= train, scale= TRUE, validation= 'CV')
summary(pls_model)
## Data: X dimension: 624 17
## Y dimension: 624 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3922 1941 1673 1543 1462 1283 1234
## adjCV 3922 1936 1672 1535 1441 1266 1224
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1218 1214 1211 1209 1209 1207 1207
## adjCV 1209 1205 1202 1201 1200 1199 1198
## 14 comps 15 comps 16 comps 17 comps
## CV 1206 1206 1206 1206
## adjCV 1198 1197 1197 1197
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 25.52 42.51 62.72 64.87 67.51 73.11 76.72 80.11
## Apps 76.94 83.71 87.00 90.75 92.28 92.44 92.52 92.57
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 82.40 85.13 88.99 90.91 92.68 95.49 97.00
## Apps 92.63 92.65 92.66 92.67 92.68 92.68 92.68
## 16 comps 17 comps
## X 98.87 100.00
## Apps 92.68 92.68
# Plot cross-validation scores
validationplot(pls_model, val.type = "MSEP")
# Make predictions on the test set
predictions = predict(pls_model, test, ncomp= 8)
# Calculate the Root Mean Squared Error
sqrt(mean((test$Apps - predictions)^2))
## [1] 924.1843
Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
The test RMSEs of the five models conducted are as follows:
The ridge regression model achieved the lowest RMSE suggesting it had the best predictive accuracy for estimating the number of college applications received and being off on average by 880. Linear Regression, Lasso, and Partial Least Squares (PLS) all produced similar RMSE values, ranging from 922 to 929, indicating comparable predictive capabilities. Unlike, Principal Components Regression (PCR) performed noticeably worse, with a test RMSE of 1161, indicating reduced accuracy.
We will now try to predict per capita crime rate in the Boston dataset.
Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
# Read in the data from ISLR2 package
boston = Boston
head(boston)
## crim zn indus chas nox rm age dis rad tax ptratio lstat medv
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 4.98 24.0
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 9.14 21.6
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 4.03 34.7
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 2.94 33.4
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 5.33 36.2
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 5.21 28.7
set.seed(42)
# Split data into train and test
train_index = createDataPartition(boston$crim, p= 0.7, list= FALSE)
train = boston[train_index, ]
test = boston[-train_index, ]
The RMSE obtained was 6.49 meaning on average, the model’s prediction are off by 6.49 units from the actual values in the test set. Given that the mean per capita crime rate by town in the test set is 3.83, the RMSE is approximately 1.69 times higher than the mean. The range of crime rate in the test set spans from 0.006 to 67.92, indicating the model likely performs better in higher-crime areas and less accurately in low-crime towns.
# Regsubsets Model
reg_model = regsubsets(crim ~ ., data= train, nvmax= 12)
reg_summary = summary(reg_model)
reg_summary
## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = train, nvmax = 12)
## 12 Variables (and intercept)
## Forced in Forced out
## zn FALSE FALSE
## indus FALSE FALSE
## chas FALSE FALSE
## nox FALSE FALSE
## rm FALSE FALSE
## age FALSE FALSE
## dis FALSE FALSE
## rad FALSE FALSE
## tax FALSE FALSE
## ptratio FALSE FALSE
## lstat FALSE FALSE
## medv FALSE FALSE
## 1 subsets of each size up to 12
## Selection Algorithm: exhaustive
## zn indus chas nox rm age dis rad tax ptratio lstat medv
## 1 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " "*" " " " " "*" " "
## 3 ( 1 ) " " " " " " " " " " " " " " "*" " " " " "*" "*"
## 4 ( 1 ) "*" " " " " " " " " " " "*" "*" " " " " " " "*"
## 5 ( 1 ) "*" "*" " " " " " " " " "*" "*" " " " " " " "*"
## 6 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " "*" " " "*"
## 7 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " "*" "*" "*"
## 8 ( 1 ) "*" " " " " "*" "*" " " "*" "*" " " "*" "*" "*"
## 9 ( 1 ) "*" " " "*" "*" "*" " " "*" "*" " " "*" "*" "*"
## 10 ( 1 ) "*" " " "*" "*" "*" " " "*" "*" "*" "*" "*" "*"
## 11 ( 1 ) "*" " " "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
# R^2 Values
reg_summary$adjr2
## [1] 0.3834393 0.4034368 0.4048850 0.4099491 0.4119745 0.4140309 0.4147092
## [8] 0.4156044 0.4152502 0.4149416 0.4137607 0.4122893
which.max(reg_summary$adjr2)
## [1] 8
# Adjusted R^2 Plot
plot(reg_summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted RSq", type = "l")
# Point for the highest value
points(8, reg_summary$adjr2[8], col = "red", cex = 2, pch = 20)
plot(reg_model, scale = "r2")
plot(reg_model, scale = "adjr2")
plot(reg_model, scale = "Cp")
plot(reg_model, scale = "bic")
# Select the best eight-variable model
coef(reg_model, 8)
## (Intercept) zn nox rm dis rad
## 10.64234133 0.03865907 -11.15120476 0.84593198 -0.78348686 0.55150069
## ptratio lstat medv
## -0.32765249 0.12170597 -0.19666350
# Linear Regression Model
lm_model = lm(crim ~ zn + nox + rm + dis + rad + ptratio + lstat + medv, data= train)
summary(lm_model)
##
## Call:
## lm(formula = crim ~ zn + nox + rm + dis + rad + ptratio + lstat +
## medv, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.210 -1.909 -0.262 0.891 74.365
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.64234 8.18336 1.300 0.19430
## zn 0.03866 0.02160 1.789 0.07442 .
## nox -11.15120 5.69954 -1.957 0.05121 .
## rm 0.84593 0.68322 1.238 0.21649
## dis -0.78349 0.31377 -2.497 0.01299 *
## rad 0.55150 0.05721 9.640 < 2e-16 ***
## ptratio -0.32765 0.21984 -1.490 0.13703
## lstat 0.12171 0.08466 1.438 0.15144
## medv -0.19666 0.06737 -2.919 0.00374 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.446 on 347 degrees of freedom
## Multiple R-squared: 0.4288, Adjusted R-squared: 0.4156
## F-statistic: 32.56 on 8 and 347 DF, p-value: < 2.2e-16
# Make predictions on the test set
predictions = predict(lm_model, test)
# Calculate the Root Mean Squared Error
mse = mean((test$crim - predictions)^2)
sqrt(mse)
## [1] 6.493153
# Mean and range of test dataset
mean(test$crim)
## [1] 3.827136
range(test$crim)
## [1] 0.00632 67.92080
The RMSE obtained was 6.55 meaning on average, the model’s prediction are off by 6.55 units from the actual values in the test set. Given that the mean per capita crime rate by town in the test set is 3.83, the RMSE is approximately 1.71 times higher than the mean. The range of crime rate in the test set spans from 0.006 to 67.92, indicating the model likely performs better in higher-crime areas and less accurately in low-crime towns.
# Convert x to a matrix of predictor variables
x_train = model.matrix(crim ~ ., data= train)[,-1]
x_test = model.matrix(crim ~ ., data= test)[,-1]
# Convert y to a numeric vector
y_train = train$crim
y_test = test$crim
set.seed(1)
# Fit ridge with cross-validation
lam = cv.glmnet(x_train, y_train, alpha= 0)
# Ridge Regression model with best lambda
ridge_model = glmnet(x_train, y_train, alpha= 0, lambda= lam$lambda.min)
ridge_model
##
## Call: glmnet(x = x_train, y = y_train, alpha = 0, lambda = lam$lambda.min)
##
## Df %Dev Lambda
## 1 12 42.63 0.5226
# Make predictions on the test set
predictions = predict(ridge_model, x_test)
# Calculate the Root Mean Squared Error
sqrt(mean((y_test - predictions)^2))
## [1] 6.553395
The RMSE obtained was 6.51 meaning on average, the model’s prediction are off by 6.51 units from the actual values in the test set. Given that the mean per capita crime rate by town in the test set is 3.83, the RMSE is approximately 1.7 times higher than the mean. The range of crime rate in the test set spans from 0.006 to 67.92, indicating the model likely performs better in higher-crime areas and less accurately in low-crime towns.
set.seed(1)
# Fit ridge with cross-validation
lam = cv.glmnet(x_train, y_train, alpha= 1)
# Ridge Regression model with best lambda
ridge_model = glmnet(x_train, y_train, alpha= 1, lambda= lam$lambda.min)
ridge_model
##
## Call: glmnet(x = x_train, y = y_train, alpha = 1, lambda = lam$lambda.min)
##
## Df %Dev Lambda
## 1 12 43.13 0.03133
# Make predictions on the test set
predictions = predict(ridge_model, x_test)
# Calculate the Root Mean Squared Error
sqrt(mean((y_test - predictions)^2))
## [1] 6.512083
The RMSE obtained was 6.98 meaning on average, the model’s prediction are off by 6.98 units from the actual values in the test set. Given that the mean per capita crime rate by town in the test set is 3.83, the RMSE is approximately 1.82 times higher than the mean. The range of crime rate in the test set spans from 0.006 to 67.92, indicating the model likely performs better in higher-crime areas and less accurately in low-crime towns.
set.seed(1)
# PCR model
pcr_model = pcr(crim ~ ., data= train, scale= TRUE, validation= 'CV')
summary(pcr_model)
## Data: X dimension: 356 12
## Y dimension: 356 1
## Fit method: svdpc
## Number of components considered: 12
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.444 7.220 7.230 6.800 6.807 6.765 6.753
## adjCV 8.444 7.218 7.227 6.792 6.806 6.761 6.748
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps
## CV 6.632 6.665 6.653 6.667 6.641 6.548
## adjCV 6.625 6.658 6.645 6.659 6.631 6.537
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 49.52 63.44 72.52 79.80 86.34 89.86 92.69 94.83
## crim 27.33 27.36 36.31 36.32 37.18 37.63 40.07 40.21
## 9 comps 10 comps 11 comps 12 comps
## X 96.71 98.27 99.42 100.00
## crim 40.49 40.49 41.60 43.22
# Plot cross-validation scores
validationplot(pcr_model, val.type = "MSEP")
# Fit PCR model using best M
pcr_model = pcr(crim ~ ., data= train, scale = TRUE, ncomp= 3)
summary(pcr_model)
## Data: X dimension: 356 12
## Y dimension: 356 1
## Fit method: svdpc
## Number of components considered: 3
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps
## X 49.52 63.44 72.52
## crim 27.33 27.36 36.31
# Make predictions on the test set
predictions = predict(pcr_model, test, ncomp= 3)
# Calculate the Root Mean Squared Error
sqrt(mean((test$crim - predictions)^2))
## [1] 6.976105
The RMSE obtained was 6.6 meaning on average, the model’s prediction are off by 6.6 units from the actual values in the test set. Given that the mean per capita crime rate by town in the test set is 3.83, the RMSE is approximately 1.72 times higher than the mean. The range of crime rate in the test set spans from 0.006 to 67.92, indicating the model likely performs better in higher-crime areas and less accurately in low-crime towns.
set.seed(1)
# PLS Model
pls_model = plsr(crim ~ ., data= train, scale= TRUE, validation= 'CV')
summary(pls_model)
## Data: X dimension: 356 12
## Y dimension: 356 1
## Fit method: kernelpls
## Number of components considered: 12
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.444 7.050 6.665 6.578 6.606 6.580 6.571
## adjCV 8.444 7.048 6.659 6.569 6.596 6.569 6.559
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps
## CV 6.551 6.546 6.546 6.547 6.548 6.548
## adjCV 6.540 6.535 6.535 6.537 6.537 6.537
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 49.04 58.22 62.61 73.39 79.60 82.91 85.46 89.73
## crim 31.09 39.69 41.87 42.31 42.74 43.10 43.18 43.21
## 9 comps 10 comps 11 comps 12 comps
## X 94.34 96.37 98.42 100.00
## crim 43.21 43.22 43.22 43.22
# Plot cross-validation scores
validationplot(pls_model, val.type = "MSEP")
# Make predictions on the test set
predictions = predict(pls_model, test, ncomp= 3)
# Calculate the Root Mean Squared Error
sqrt(mean((test$crim - predictions)^2))
## [1] 6.600582
Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross validation, or some other reasonable alternative, as opposed to using training error.
The test RMSEs of the five models conducted are as follows:
The five models produced test RMSE values that are fairly close, indicating similar predictive performance when estimating per capita crime rate by town. The lowest RMSE was achieved by best subset selection through linear regression at 6.49. The optimal model was chosen based on the highest adjusted \(R^2\) value of 0.4156, indicating only up to 41.56% of the variance in crime rate could be explained by the model. All models were split on a 70/30 ratio for training and testing.
Given that the crime rate in the test set ranges from 0.006 to 67.92, the relatively high RMSE suggests that the model may perform more reliably in towns with higher crime rates, while exhibiting reduced accuracy in low-crime areas.
Does your chosen model involve all of the features in the data set? Why or why not?
This model only included eight predictors out of the twelve which
were zn, nox, rm,
dis, rad, ptratio,
lstat, and medv. This is because the best
model was optimized to find the highest adjusted \(R^2\) value. This meant only eight varibles
ended up being significant to be included.