ISLR2 book starting PDF Page 294.

# Load dependencies
pacman::p_load(ISLR2, dplyr, caret, glmnet, pls, leaps)

Question 2

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

Part a)

The lasso, relative to least squares, is:

i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.


Lasso regression, relative to the least squares is iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. Lasso applies a penalty to the regression coefficients, shrinking some to zero, effectively performing variable selection. This reduces model flexibility compared to least squares which increases bias but can reduce variance. The increase in bias is less than its decrease in variance.

Part b)

Repeat (a) for ridge regression relative to least squares.


Ridge regression, relative to the least squares is iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. Ridge regression is less flexible than least squares because it adds an \(L_2\) penalty to the size of the coefficients, shrinking them toward zero. This also reduces model flexibility compared to least squares which increases bias but can reduce variance. The increase in bias is less than its decrease in variance.

Part c)

Repeat (a) for non-linear methods relative to least squares.


Non-linear regression, relative to the least squares is ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. Non-linear regression is more flexible than least squares because it does not assume a linear form allowing the model to adapt to complex patterns. This increases model flexibility compared to least squares which decreases bias but can increase variance. The increase in variance is less than its decrease in bias.

Question 9

In this exercise, we will predict the number of applications received using the other variables in the College data set.

Part a)

Split the data set into a training set and a test set.

# Read in the data from ISLR2 package
d1 = College
glimpse(d1)
## Rows: 777
## Columns: 18
## $ Private     <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes…
## $ Apps        <dbl> 1660, 2186, 1428, 417, 193, 587, 353, 1899, 1038, 582, 173…
## $ Accept      <dbl> 1232, 1924, 1097, 349, 146, 479, 340, 1720, 839, 498, 1425…
## $ Enroll      <dbl> 721, 512, 336, 137, 55, 158, 103, 489, 227, 172, 472, 484,…
## $ Top10perc   <dbl> 23, 16, 22, 60, 16, 38, 17, 37, 30, 21, 37, 44, 38, 44, 23…
## $ Top25perc   <dbl> 52, 29, 50, 89, 44, 62, 45, 68, 63, 44, 75, 77, 64, 73, 46…
## $ F.Undergrad <dbl> 2885, 2683, 1036, 510, 249, 678, 416, 1594, 973, 799, 1830…
## $ P.Undergrad <dbl> 537, 1227, 99, 63, 869, 41, 230, 32, 306, 78, 110, 44, 638…
## $ Outstate    <dbl> 7440, 12280, 11250, 12960, 7560, 13500, 13290, 13868, 1559…
## $ Room.Board  <dbl> 3300, 6450, 3750, 5450, 4120, 3335, 5720, 4826, 4400, 3380…
## $ Books       <dbl> 450, 750, 400, 450, 800, 500, 500, 450, 300, 660, 500, 400…
## $ Personal    <dbl> 2200, 1500, 1165, 875, 1500, 675, 1500, 850, 500, 1800, 60…
## $ PhD         <dbl> 70, 29, 53, 92, 76, 67, 90, 89, 79, 40, 82, 73, 60, 79, 36…
## $ Terminal    <dbl> 78, 30, 66, 97, 72, 73, 93, 100, 84, 41, 88, 91, 84, 87, 6…
## $ S.F.Ratio   <dbl> 18.1, 12.2, 12.9, 7.7, 11.9, 9.4, 11.5, 13.7, 11.3, 11.5, …
## $ perc.alumni <dbl> 12, 16, 30, 37, 2, 11, 26, 37, 23, 15, 31, 41, 21, 32, 26,…
## $ Expend      <dbl> 7041, 10527, 8735, 19016, 10922, 9727, 8861, 11487, 11644,…
## $ Grad.Rate   <dbl> 60, 56, 54, 59, 15, 55, 63, 73, 80, 52, 73, 76, 74, 68, 55…
set.seed(42)

# Split data into train and test
train_index = createDataPartition(d1$Apps, p= 0.8, list= FALSE)

train = d1[train_index, ]
test = d1[-train_index, ]

Part b)

Fit a linear model using least squares on the training set, and report the test error obtained.


The RMSE obtained was 929 meaning on average, the model’s prediction are off by 929 applications from the actual values in the test set. Given that the mean number of applications in the test set is of 3,005, the RMSE represents an error of approximately 30.9% of the mean. The range of applications in the test set spans from 141 to 21,804, indicating that the model’s errors are relatively small compared to the range of the data.

# Linear Regression Model
lm_model = lm(Apps ~ ., data= train)

summary(lm_model)
## 
## Call:
## lm(formula = Apps ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3881.0  -437.1   -19.5   316.7  6967.8 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -561.34089  461.02508  -1.218  0.22385    
## PrivateYes  -409.16975  158.55701  -2.581  0.01010 *  
## Accept         1.62834    0.04476  36.382  < 2e-16 ***
## Enroll        -0.85903    0.20563  -4.177 3.38e-05 ***
## Top10perc     51.31276    6.42525   7.986 7.03e-15 ***
## Top25perc    -14.67648    5.16650  -2.841  0.00465 ** 
## F.Undergrad    0.05127    0.03625   1.415  0.15769    
## P.Undergrad    0.04916    0.03535   1.391  0.16483    
## Outstate      -0.09202    0.02262  -4.067 5.38e-05 ***
## Room.Board     0.13611    0.05628   2.418  0.01588 *  
## Books          0.10960    0.27217   0.403  0.68732    
## Personal       0.01803    0.07008   0.257  0.79711    
## PhD           -8.96660    5.28199  -1.698  0.09010 .  
## Terminal      -4.40975    5.78320  -0.763  0.44605    
## S.F.Ratio     21.41479   14.51439   1.475  0.14062    
## perc.alumni    3.87434    4.69013   0.826  0.40909    
## Expend         0.08916    0.01484   6.009 3.22e-09 ***
## Grad.Rate      7.23133    3.32272   2.176  0.02992 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1075 on 606 degrees of freedom
## Multiple R-squared:  0.9268, Adjusted R-squared:  0.9247 
## F-statistic: 451.1 on 17 and 606 DF,  p-value: < 2.2e-16
# Make predictions on the test set
predictions = predict(lm_model, test)

# Calculate the Root Mean Squared Error
mse = mean((test$Apps - predictions)^2)
sqrt(mse)
## [1] 928.8176
# Mean and range of test dataset
mean(test$Apps)
## [1] 3004.856
range(test$Apps)
## [1]   141 21804

Part c)

Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.


The RMSE obtained was 880 meaning on average, the model’s prediction are off by 880 applications from the actual values in the test set. Given that the mean number of applications in the test set is of 3,005, the RMSE represents an error of approximately 29.3% of the mean. The range of applications in the test set spans from 141 to 21,804, indicating that the model’s errors are relatively small compared to the range of the data.

# Convert x to a matrix of predictor variables
x_train = model.matrix(Apps ~ ., data= train)[,-1]
x_test =  model.matrix(Apps ~ ., data= test)[,-1]

# Convert y to a numeric vector
y_train = train$Apps
y_test = test$Apps
set.seed(1)

# Fit ridge with cross-validation
lam = cv.glmnet(x_train, y_train, alpha= 0)

# Ridge Regression model with best lambda
ridge_model = glmnet(x_train, y_train, alpha= 0, lambda= lam$lambda.min)
ridge_model
## 
## Call:  glmnet(x = x_train, y = y_train, alpha = 0, lambda = lam$lambda.min) 
## 
##   Df  %Dev Lambda
## 1 17 90.66  367.6
# Make predictions on the test set
predictions = predict(ridge_model, x_test)

# Calculate the Root Mean Squared Error
sqrt(mean((y_test - predictions)^2))
## [1] 879.621

Part d)

Fit a lasso model on the training set, with λ chosen by cross validation. Report the test error obtained, along with the number of non-zero coefficient estimates.


The RMSE obtained was 922 meaning on average, the model’s prediction are off by 880 applications from the actual values in the test set. Given that the mean number of applications in the test set is of 3,005, the RMSE represents an error of approximately 30.7% of the mean. The range of applications in the test set spans from 141 to 21,804, indicating that the model’s errors are relatively small compared to the range of the data.

set.seed(1)

# Fit ridge with cross-validation
lam = cv.glmnet(x_train, y_train, alpha= 1)

# Ridge Regression model with best lambda
ridge_model = glmnet(x_train, y_train, alpha= 1, lambda= lam$lambda.min)
ridge_model
## 
## Call:  glmnet(x = x_train, y = y_train, alpha = 1, lambda = lam$lambda.min) 
## 
##   Df  %Dev Lambda
## 1 14 92.55  13.84
# Make predictions on the test set
predictions = predict(ridge_model, x_test)

# Calculate the Root Mean Squared Error
sqrt(mean((y_test - predictions)^2))
## [1] 922.3743

Part e)

Fit a PCR model on the training set, with M chosen by cross validation. Report the test error obtained, along with the value of M selected by cross-validation.


The RMSE obtained was 1161 with an M of 6, meaning on average, the model’s prediction are off by 1161 applications from the actual values in the test set. Given that the mean number of applications in the test set is of 3,005, the RMSE represents an error of approximately 38.6% of the mean. The range of applications in the test set spans from 141 to 21,804, indicating that the model’s errors are relatively small compared to the range of the data.

set.seed(1)

# PCR model
pcr_model = pcr(Apps ~ ., data= train, scale= TRUE, validation= 'CV')
summary(pcr_model)
## Data:    X dimension: 624 17 
##  Y dimension: 624 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3922     3896     2109     2112     1992     1715     1692
## adjCV         3922     3895     2106     2110     1946     1704     1686
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1682     1651     1631      1633      1639      1635      1644
## adjCV     1682     1642     1625      1627      1633      1629      1638
##        14 comps  15 comps  16 comps  17 comps
## CV         1644      1494      1246      1206
## adjCV      1638      1471      1236      1197
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X      32.199    57.51    64.56    70.11    75.57    80.57    84.18    87.63
## Apps    2.383    72.02    72.04    79.00    82.65    83.03    83.22    84.07
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.55     92.94     95.07     96.79     97.85     98.69     99.36
## Apps    84.56     84.82     84.83     84.95     84.97     84.99     90.89
##       16 comps  17 comps
## X        99.82    100.00
## Apps     92.31     92.68
# Plot cross-validation scores
validationplot(pcr_model, val.type = "MSEP")

# Fit PCR model  using best M
pcr_model = pcr(Apps ~ ., data= train, scale = TRUE, ncomp= 6)
summary(pcr_model)
## Data:    X dimension: 624 17 
##  Y dimension: 624 1
## Fit method: svdpc
## Number of components considered: 6
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## X      32.199    57.51    64.56    70.11    75.57    80.57
## Apps    2.383    72.02    72.04    79.00    82.65    83.03
# Make predictions on the test set
predictions = predict(pcr_model, test, ncomp= 6)

# Calculate the Root Mean Squared Error
sqrt(mean((test$Apps - predictions)^2))
## [1] 1161.317

Part f)

Fit a PLS model on the training set, with M chosen by cross validation. Report the test error obtained, along with the value of M selected by cross-validation.


The RMSE obtained was 924 with an M of 8, meaning on average, the model’s prediction are off by 924 applications from the actual values in the test set. Given that the mean number of applications in the test set is of 3,005, the RMSE represents an error of approximately 30.7% of the mean. The range of applications in the test set spans from 141 to 21,804, indicating that the model’s errors are relatively small compared to the range of the data.

set.seed(1)

# PLS Model
pls_model = plsr(Apps ~ ., data= train, scale= TRUE, validation= 'CV')
summary(pls_model)
## Data:    X dimension: 624 17 
##  Y dimension: 624 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3922     1941     1673     1543     1462     1283     1234
## adjCV         3922     1936     1672     1535     1441     1266     1224
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1218     1214     1211      1209      1209      1207      1207
## adjCV     1209     1205     1202      1201      1200      1199      1198
##        14 comps  15 comps  16 comps  17 comps
## CV         1206      1206      1206      1206
## adjCV      1198      1197      1197      1197
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       25.52    42.51    62.72    64.87    67.51    73.11    76.72    80.11
## Apps    76.94    83.71    87.00    90.75    92.28    92.44    92.52    92.57
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       82.40     85.13     88.99     90.91     92.68     95.49     97.00
## Apps    92.63     92.65     92.66     92.67     92.68     92.68     92.68
##       16 comps  17 comps
## X        98.87    100.00
## Apps     92.68     92.68
# Plot cross-validation scores
validationplot(pls_model, val.type = "MSEP")

# Make predictions on the test set
predictions = predict(pls_model, test, ncomp= 8)

# Calculate the Root Mean Squared Error
sqrt(mean((test$Apps - predictions)^2))
## [1] 924.1843

Part g)

Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?


The test RMSEs of the five models conducted are as follows:

  • Linear Regression: 929
  • Ridge Regression: 880
  • Lasso: 922
  • PCR: 1161
  • PLS: 924

The ridge regression model achieved the lowest RMSE suggesting it had the best predictive accuracy for estimating the number of college applications received and being off on average by 880. Linear Regression, Lasso, and Partial Least Squares (PLS) all produced similar RMSE values, ranging from 922 to 929, indicating comparable predictive capabilities. Unlike, Principal Components Regression (PCR) performed noticeably worse, with a test RMSE of 1161, indicating reduced accuracy.

Question 11

We will now try to predict per capita crime rate in the Boston dataset.

Part a)

Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

# Read in the data from ISLR2 package
boston = Boston
head(boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio lstat medv
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3  4.98 24.0
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8  9.14 21.6
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8  4.03 34.7
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7  2.94 33.4
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7  5.33 36.2
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7  5.21 28.7
set.seed(42)

# Split data into train and test
train_index = createDataPartition(boston$crim, p= 0.7, list= FALSE)

train = boston[train_index, ]
test = boston[-train_index, ]

Regsubsets

The RMSE obtained was 6.49 meaning on average, the model’s prediction are off by 6.49 units from the actual values in the test set. Given that the mean per capita crime rate by town in the test set is 3.83, the RMSE is approximately 1.69 times higher than the mean. The range of crime rate in the test set spans from 0.006 to 67.92, indicating the model likely performs better in higher-crime areas and less accurately in low-crime towns.

# Regsubsets Model
reg_model = regsubsets(crim ~ ., data= train, nvmax= 12)
reg_summary = summary(reg_model)
reg_summary
## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = train, nvmax = 12)
## 12 Variables  (and intercept)
##         Forced in Forced out
## zn          FALSE      FALSE
## indus       FALSE      FALSE
## chas        FALSE      FALSE
## nox         FALSE      FALSE
## rm          FALSE      FALSE
## age         FALSE      FALSE
## dis         FALSE      FALSE
## rad         FALSE      FALSE
## tax         FALSE      FALSE
## ptratio     FALSE      FALSE
## lstat       FALSE      FALSE
## medv        FALSE      FALSE
## 1 subsets of each size up to 12
## Selection Algorithm: exhaustive
##           zn  indus chas nox rm  age dis rad tax ptratio lstat medv
## 1  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     " "   " " 
## 2  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     "*"   " " 
## 3  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     "*"   "*" 
## 4  ( 1 )  "*" " "   " "  " " " " " " "*" "*" " " " "     " "   "*" 
## 5  ( 1 )  "*" "*"   " "  " " " " " " "*" "*" " " " "     " "   "*" 
## 6  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " "*"     " "   "*" 
## 7  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " "*"     "*"   "*" 
## 8  ( 1 )  "*" " "   " "  "*" "*" " " "*" "*" " " "*"     "*"   "*" 
## 9  ( 1 )  "*" " "   "*"  "*" "*" " " "*" "*" " " "*"     "*"   "*" 
## 10  ( 1 ) "*" " "   "*"  "*" "*" " " "*" "*" "*" "*"     "*"   "*" 
## 11  ( 1 ) "*" " "   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*" 
## 12  ( 1 ) "*" "*"   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"
# R^2 Values
reg_summary$adjr2
##  [1] 0.3834393 0.4034368 0.4048850 0.4099491 0.4119745 0.4140309 0.4147092
##  [8] 0.4156044 0.4152502 0.4149416 0.4137607 0.4122893
which.max(reg_summary$adjr2)
## [1] 8
# Adjusted R^2 Plot
plot(reg_summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted RSq", type = "l")

# Point for the highest value
points(8, reg_summary$adjr2[8], col = "red", cex = 2, pch = 20)

plot(reg_model, scale = "r2")

plot(reg_model, scale = "adjr2")

plot(reg_model, scale = "Cp")

plot(reg_model, scale = "bic")

# Select the best eight-variable model
coef(reg_model, 8)
##  (Intercept)           zn          nox           rm          dis          rad 
##  10.64234133   0.03865907 -11.15120476   0.84593198  -0.78348686   0.55150069 
##      ptratio        lstat         medv 
##  -0.32765249   0.12170597  -0.19666350
# Linear Regression Model
lm_model = lm(crim ~ zn + nox + rm + dis + rad + ptratio + lstat + medv, data= train)

summary(lm_model)
## 
## Call:
## lm(formula = crim ~ zn + nox + rm + dis + rad + ptratio + lstat + 
##     medv, data = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.210 -1.909 -0.262  0.891 74.365 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  10.64234    8.18336   1.300  0.19430    
## zn            0.03866    0.02160   1.789  0.07442 .  
## nox         -11.15120    5.69954  -1.957  0.05121 .  
## rm            0.84593    0.68322   1.238  0.21649    
## dis          -0.78349    0.31377  -2.497  0.01299 *  
## rad           0.55150    0.05721   9.640  < 2e-16 ***
## ptratio      -0.32765    0.21984  -1.490  0.13703    
## lstat         0.12171    0.08466   1.438  0.15144    
## medv         -0.19666    0.06737  -2.919  0.00374 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.446 on 347 degrees of freedom
## Multiple R-squared:  0.4288, Adjusted R-squared:  0.4156 
## F-statistic: 32.56 on 8 and 347 DF,  p-value: < 2.2e-16
# Make predictions on the test set
predictions = predict(lm_model, test)

# Calculate the Root Mean Squared Error
mse = mean((test$crim - predictions)^2)
sqrt(mse)
## [1] 6.493153
# Mean and range of test dataset
mean(test$crim)
## [1] 3.827136
range(test$crim)
## [1]  0.00632 67.92080

Ridge Regression

The RMSE obtained was 6.55 meaning on average, the model’s prediction are off by 6.55 units from the actual values in the test set. Given that the mean per capita crime rate by town in the test set is 3.83, the RMSE is approximately 1.71 times higher than the mean. The range of crime rate in the test set spans from 0.006 to 67.92, indicating the model likely performs better in higher-crime areas and less accurately in low-crime towns.

# Convert x to a matrix of predictor variables
x_train = model.matrix(crim ~ ., data= train)[,-1]
x_test =  model.matrix(crim ~ ., data= test)[,-1]

# Convert y to a numeric vector
y_train = train$crim
y_test = test$crim
set.seed(1)

# Fit ridge with cross-validation
lam = cv.glmnet(x_train, y_train, alpha= 0)

# Ridge Regression model with best lambda
ridge_model = glmnet(x_train, y_train, alpha= 0, lambda= lam$lambda.min)
ridge_model
## 
## Call:  glmnet(x = x_train, y = y_train, alpha = 0, lambda = lam$lambda.min) 
## 
##   Df  %Dev Lambda
## 1 12 42.63 0.5226
# Make predictions on the test set
predictions = predict(ridge_model, x_test)

# Calculate the Root Mean Squared Error
sqrt(mean((y_test - predictions)^2))
## [1] 6.553395

Lasso

The RMSE obtained was 6.51 meaning on average, the model’s prediction are off by 6.51 units from the actual values in the test set. Given that the mean per capita crime rate by town in the test set is 3.83, the RMSE is approximately 1.7 times higher than the mean. The range of crime rate in the test set spans from 0.006 to 67.92, indicating the model likely performs better in higher-crime areas and less accurately in low-crime towns.

set.seed(1)

# Fit ridge with cross-validation
lam = cv.glmnet(x_train, y_train, alpha= 1)

# Ridge Regression model with best lambda
ridge_model = glmnet(x_train, y_train, alpha= 1, lambda= lam$lambda.min)
ridge_model
## 
## Call:  glmnet(x = x_train, y = y_train, alpha = 1, lambda = lam$lambda.min) 
## 
##   Df  %Dev  Lambda
## 1 12 43.13 0.03133
# Make predictions on the test set
predictions = predict(ridge_model, x_test)

# Calculate the Root Mean Squared Error
sqrt(mean((y_test - predictions)^2))
## [1] 6.512083

PCR

The RMSE obtained was 6.98 meaning on average, the model’s prediction are off by 6.98 units from the actual values in the test set. Given that the mean per capita crime rate by town in the test set is 3.83, the RMSE is approximately 1.82 times higher than the mean. The range of crime rate in the test set spans from 0.006 to 67.92, indicating the model likely performs better in higher-crime areas and less accurately in low-crime towns.

set.seed(1)

# PCR model
pcr_model = pcr(crim ~ ., data= train, scale= TRUE, validation= 'CV')
summary(pcr_model)
## Data:    X dimension: 356 12 
##  Y dimension: 356 1
## Fit method: svdpc
## Number of components considered: 12
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           8.444    7.220    7.230    6.800    6.807    6.765    6.753
## adjCV        8.444    7.218    7.227    6.792    6.806    6.761    6.748
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps
## CV       6.632    6.665    6.653     6.667     6.641     6.548
## adjCV    6.625    6.658    6.645     6.659     6.631     6.537
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       49.52    63.44    72.52    79.80    86.34    89.86    92.69    94.83
## crim    27.33    27.36    36.31    36.32    37.18    37.63    40.07    40.21
##       9 comps  10 comps  11 comps  12 comps
## X       96.71     98.27     99.42    100.00
## crim    40.49     40.49     41.60     43.22
# Plot cross-validation scores
validationplot(pcr_model, val.type = "MSEP")

# Fit PCR model  using best M
pcr_model = pcr(crim ~ ., data= train, scale = TRUE, ncomp= 3)
summary(pcr_model)
## Data:    X dimension: 356 12 
##  Y dimension: 356 1
## Fit method: svdpc
## Number of components considered: 3
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps
## X       49.52    63.44    72.52
## crim    27.33    27.36    36.31
# Make predictions on the test set
predictions = predict(pcr_model, test, ncomp= 3)

# Calculate the Root Mean Squared Error
sqrt(mean((test$crim - predictions)^2))
## [1] 6.976105

PLS Regression

The RMSE obtained was 6.6 meaning on average, the model’s prediction are off by 6.6 units from the actual values in the test set. Given that the mean per capita crime rate by town in the test set is 3.83, the RMSE is approximately 1.72 times higher than the mean. The range of crime rate in the test set spans from 0.006 to 67.92, indicating the model likely performs better in higher-crime areas and less accurately in low-crime towns.

set.seed(1)

# PLS Model
pls_model = plsr(crim ~ ., data= train, scale= TRUE, validation= 'CV')
summary(pls_model)
## Data:    X dimension: 356 12 
##  Y dimension: 356 1
## Fit method: kernelpls
## Number of components considered: 12
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           8.444    7.050    6.665    6.578    6.606    6.580    6.571
## adjCV        8.444    7.048    6.659    6.569    6.596    6.569    6.559
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps
## CV       6.551    6.546    6.546     6.547     6.548     6.548
## adjCV    6.540    6.535    6.535     6.537     6.537     6.537
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       49.04    58.22    62.61    73.39    79.60    82.91    85.46    89.73
## crim    31.09    39.69    41.87    42.31    42.74    43.10    43.18    43.21
##       9 comps  10 comps  11 comps  12 comps
## X       94.34     96.37     98.42    100.00
## crim    43.21     43.22     43.22     43.22
# Plot cross-validation scores
validationplot(pls_model, val.type = "MSEP")

# Make predictions on the test set
predictions = predict(pls_model, test, ncomp= 3)

# Calculate the Root Mean Squared Error
sqrt(mean((test$crim - predictions)^2))
## [1] 6.600582

Part b)

Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross validation, or some other reasonable alternative, as opposed to using training error.


The test RMSEs of the five models conducted are as follows:

  • Linear Regression (Using best subset selection): 6.49
  • Ridge Regression: 6.55
  • Lasso: 6.51
  • PCR: 6.98
  • PLS Regression: 6.6

The five models produced test RMSE values that are fairly close, indicating similar predictive performance when estimating per capita crime rate by town. The lowest RMSE was achieved by best subset selection through linear regression at 6.49. The optimal model was chosen based on the highest adjusted \(R^2\) value of 0.4156, indicating only up to 41.56% of the variance in crime rate could be explained by the model. All models were split on a 70/30 ratio for training and testing.

Given that the crime rate in the test set ranges from 0.006 to 67.92, the relatively high RMSE suggests that the model may perform more reliably in towns with higher crime rates, while exhibiting reduced accuracy in low-crime areas.

Part c)

Does your chosen model involve all of the features in the data set? Why or why not?


This model only included eight predictors out of the twelve which were zn, nox, rm, dis, rad, ptratio, lstat, and medv. This is because the best model was optimized to find the highest adjusted \(R^2\) value. This meant only eight varibles ended up being significant to be included.