Problem 3

library(ISLR)
data("Auto")
# Auto
?Auto

(a) Which of the predictors are quantitative, and which are qualitative?

Quantitative: mpg, cylinders, displacement, horsepower, weight, acceleration, year

Qualitative: origin, name

(b) What is the range of each quantitative predictor? You can answer this using the range() function.

range(Auto$mpg)

## [1]  9.0 46.6

range(Auto$cylinders)

## [1] 3 8

range(Auto$displacement)

## [1]  68 455

range(Auto$horsepower)

## [1]  46 230

range(Auto$weight)

## [1] 1613 5140

range(Auto$acceleration)

## [1]  8.0 24.8

range(Auto$year)

## [1] 70 82

(c) What is the mean and standard deviation of each quantitative predictor?

mean(Auto$mpg)

## [1] 23.44592

mean(Auto$cylinders)

## [1] 5.471939

mean(Auto$displacement)

## [1] 194.412

mean(Auto$horsepower)

## [1] 104.4694

mean(Auto$weight)

## [1] 2977.584

mean(Auto$acceleration)

## [1] 15.54133

mean(Auto$year)

## [1] 75.97959

sd(Auto$mpg)

## [1] 7.805007

sd(Auto$cylinders)

## [1] 1.705783

sd(Auto$displacement)

## [1] 104.644

sd(Auto$horsepower)

## [1] 38.49116

sd(Auto$weight)

## [1] 849.4026

sd(Auto$acceleration)

## [1] 2.758864

sd(Auto$year)

## [1] 3.683737

(d) Now remove the 20th through 80th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?

Auto_removed <- Auto[-(20:80),]

range(Auto_removed$mpg)

## [1] 11.0 46.6

range(Auto_removed$cylinders)

## [1] 3 8

range(Auto_removed$displacement)

## [1]  68 455

range(Auto_removed$horsepower)

## [1]  46 230

range(Auto_removed$weight)

## [1] 1649 4997

range(Auto_removed$acceleration)

## [1]  8.0 24.8

range(Auto_removed$year)

## [1] 70 82

mean(Auto_removed$mpg)

## [1] 24.21692

mean(Auto_removed$cylinders)

## [1] 5.401813

mean(Auto_removed$displacement)

## [1] 189.4411

mean(Auto_removed$horsepower)

## [1] 101.858

mean(Auto_removed$weight)

## [1] 2935.015

mean(Auto_removed$acceleration)

## [1] 15.6139

mean(Auto_removed$year)

## [1] 76.85498

sd(Auto_removed$mpg)

## [1] 7.823191

sd(Auto_removed$cylinders)

## [1] 1.665659

sd(Auto_removed$displacement)

## [1] 101.7607

sd(Auto_removed$horsepower)

## [1] 36.60106

sd(Auto_removed$weight)

## [1] 805.4751

sd(Auto_removed$acceleration)

## [1] 2.758818

sd(Auto_removed$year)

## [1] 3.323488

(e) Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.

par(mfrow = c(2,3))
plot(Auto$weight, Auto$horsepower)
plot(Auto$displacement, Auto$weight)
plot(Auto$displacement, Auto$horsepower)

plot(Auto$horsepower, Auto$mpg)
plot(Auto$weight, Auto$mpg)
plot(Auto$displacement, Auto$mpg)

plot(Auto$year, Auto$mpg)

There are clear positive correlations between weight, displacement, and horsepower, indicating that larger engines tend to be more powerful and are found in heavier cars. Additionally, mpg is negatively associated with weight, horsepower, and displacement, suggesting that cars with larger engines and higher weight tend to have lower fuel efficiency. There also seems to be a positive relationship between year and mpg, suggesting that newer cars tend to have higher fuel efficiency.

(f) Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.

Based on the plots in part (e), weight, horsepower, and displacement show clear negative relationships with mpg, while year exhibits a positive relationship with mpg. These may be useful in predicting mpg in a regression model.

Problem 4

library(ISLR)
data(College)
# data basic information
head(College)

##                              Private Apps Accept Enroll Top10perc Top25perc
## Abilene Christian University     Yes 1660   1232    721        23        52
## Adelphi University               Yes 2186   1924    512        16        29
## Adrian College                   Yes 1428   1097    336        22        50
## Agnes Scott College              Yes  417    349    137        60        89
## Alaska Pacific University        Yes  193    146     55        16        44
## Albertson College                Yes  587    479    158        38        62
##                              F.Undergrad P.Undergrad Outstate Room.Board Books
## Abilene Christian University        2885         537     7440       3300   450
## Adelphi University                  2683        1227    12280       6450   750
## Adrian College                      1036          99    11250       3750   400
## Agnes Scott College                  510          63    12960       5450   450
## Alaska Pacific University            249         869     7560       4120   800
## Albertson College                    678          41    13500       3335   500
##                              Personal PhD Terminal S.F.Ratio perc.alumni Expend
## Abilene Christian University     2200  70       78      18.1          12   7041
## Adelphi University               1500  29       30      12.2          16  10527
## Adrian College                   1165  53       66      12.9          30   8735
## Agnes Scott College               875  92       97       7.7          37  19016
## Alaska Pacific University        1500  76       72      11.9           2  10922
## Albertson College                 675  67       73       9.4          11   9727
##                              Grad.Rate
## Abilene Christian University        60
## Adelphi University                  56
## Adrian College                      54
## Agnes Scott College                 59
## Alaska Pacific University           15
## Albertson College                   55

dim(College)

## [1] 777  18

# The column Apps is the response variable, and others may be treated as predictors.
# For instance, for linear regression model in R, you may use lm(Apps~.,data=College)

(a) Appropriately split the data set into a training set (80%) and a test set (20%).

library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

hist(College$Apps) #right skew, use createDataPartition

set.seed(123)

trainIndex <- createDataPartition(College$Apps, p = 0.8, list = FALSE)

train <- College[trainIndex, ]
test  <- College[-trainIndex, ]

(b) Fit a support vector machine (SVM) model with the radial basis function function model using least squares on the training set. Clearly report the test error obtained.

set.seed(123)

svmTune <- train(Apps ~ ., data = train,
                 method = "svmRadial",
                 preProcess = c("center", "scale"),
                 tuneLength = 14)
svmTune

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 624 samples
##  17 predictor
## 
## Pre-processing: centered (17), scaled (17) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 624, 624, 624, 624, 624, 624, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE      Rsquared   MAE     
##      0.25  2696.222  0.6647106  996.7393
##      0.50  2484.534  0.7015293  909.7269
##      1.00  2340.618  0.7266354  862.2084
##      2.00  2261.375  0.7399688  847.6386
##      4.00  2239.103  0.7434481  853.8454
##      8.00  2239.729  0.7422595  858.9893
##     16.00  2243.750  0.7413120  861.2244
##     32.00  2243.750  0.7413120  861.2244
##     64.00  2243.750  0.7413120  861.2244
##    128.00  2243.750  0.7413120  861.2244
##    256.00  2243.750  0.7413120  861.2244
##    512.00  2243.750  0.7413120  861.2244
##   1024.00  2243.750  0.7413120  861.2244
##   2048.00  2243.750  0.7413120  861.2244
## 
## Tuning parameter 'sigma' was held constant at a value of 0.06196904
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06196904 and C = 4.

library(Metrics)

## 
## Attaching package: 'Metrics'

## The following objects are masked from 'package:caret':
## 
##     precision, recall

test$SVM <- predict(svmTune, newdata = test)
rmse(test$SVM, test$Apps)

## [1] 851.5477

The test RMSE is 851.5477.

(c) Fit a neural network model on the training set by creating an appropriate grid for tuning parameters. Clearly report the test error obtained.

indx <- createFolds(train$Apps, returnTrain = TRUE)
ctrl <- trainControl(method = "cv", index = indx)

nnetGrid <- expand.grid(decay = c(0, 0.01, 0.1),
                        size = c(1, 3, 5, 7),
                        bag = FALSE)

set.seed(123)

nnetTune <- train(Apps ~ ., data = train,
                  method = "avNNet", 
                  tuneGrid = nnetGrid, 
                  trControl = ctrl, 
                  preProcess = c("center", "scale"), 
                  linout = TRUE, 
                  trace = FALSE, 
                  MaxNWts = 10000,
                  maxit = 1000,
                  allowParallel = FALSE)
nnetTune

## Model Averaged Neural Network 
## 
## 624 samples
##  17 predictor
## 
## Pre-processing: centered (17), scaled (17) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 561, 561, 562, 562, 563, 562, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE      Rsquared   MAE      
##   0.00   1     2857.832  0.5721190  1647.7129
##   0.00   3     2096.280  0.7723060  1149.0270
##   0.00   5     1980.808  0.7808697  1093.2574
##   0.00   7     1780.631  0.8195406   974.2936
##   0.01   1     1843.555  0.8037330  1039.7315
##   0.01   3     1691.100  0.8410310   911.1555
##   0.01   5     1664.568  0.8432421   880.8077
##   0.01   7     1661.861  0.8425060   934.4921
##   0.10   1     1404.843  0.8970443   780.3200
##   0.10   3     1615.845  0.8541443   875.5145
##   0.10   5     1695.660  0.8327359   920.9192
##   0.10   7     1601.724  0.8550899   894.2645
## 
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 1, decay = 0.1 and bag = FALSE.

test$NNet <- predict(nnetTune, newdata = test)
rmse(test$NNet, test$Apps)

## [1] 1095.994

The test RMSE is 1095.994.

(d) Fit a Multivariate Adaptive Regression Splines (MARS) model on the training set with tuning parameters chosen by cross-validation. Clearly report the test error obtained, along with the importance of each predictor. estimates.

library(earth)

## Loading required package: Formula

## Loading required package: plotmo

## Loading required package: plotrix

marsTune <- train(Apps ~ ., data = train,
                  method = "earth",
                  tuneGrid = expand.grid(degree = 1, nprune = 2:38),
                  trControl = ctrl)
marsTune

## Multivariate Adaptive Regression Spline 
## 
## 624 samples
##  17 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 561, 561, 562, 562, 563, 562, ... 
## Resampling results across tuning parameters:
## 
##   nprune  RMSE      Rsquared   MAE     
##    2      1578.070  0.8557717  731.3260
##    3      1562.095  0.8728635  643.0014
##    4      1210.672  0.9289767  536.7915
##    5      1201.513  0.9285990  536.7608
##    6      1183.260  0.9291783  543.9305
##    7      1171.882  0.9301260  547.0502
##    8      1177.996  0.9284477  554.3384
##    9      1183.033  0.9271378  567.8108
##   10      1188.996  0.9258380  575.2853
##   11      1207.093  0.9231389  584.4312
##   12      1208.557  0.9229228  584.7849
##   13      1217.859  0.9220090  590.2080
##   14      1217.859  0.9220090  590.2080
##   15      1217.859  0.9220090  590.2080
##   16      1217.859  0.9220090  590.2080
##   17      1217.859  0.9220090  590.2080
##   18      1217.859  0.9220090  590.2080
##   19      1217.859  0.9220090  590.2080
##   20      1217.859  0.9220090  590.2080
##   21      1217.859  0.9220090  590.2080
##   22      1217.859  0.9220090  590.2080
##   23      1217.859  0.9220090  590.2080
##   24      1217.859  0.9220090  590.2080
##   25      1217.859  0.9220090  590.2080
##   26      1217.859  0.9220090  590.2080
##   27      1217.859  0.9220090  590.2080
##   28      1217.859  0.9220090  590.2080
##   29      1217.859  0.9220090  590.2080
##   30      1217.859  0.9220090  590.2080
##   31      1217.859  0.9220090  590.2080
##   32      1217.859  0.9220090  590.2080
##   33      1217.859  0.9220090  590.2080
##   34      1217.859  0.9220090  590.2080
##   35      1217.859  0.9220090  590.2080
##   36      1217.859  0.9220090  590.2080
##   37      1217.859  0.9220090  590.2080
##   38      1217.859  0.9220090  590.2080
## 
## Tuning parameter 'degree' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 7 and degree = 1.

test$MARS <- predict(marsTune, newdata = test)
rmse(test$MARS, test$Apps)

## [1] 895.8359

The test RMSE is 895.8359.

(e) Comment on the results obtained. Is there much difference among the test errors resulting from these three nonparametric approaches?

methods = c(rep('SVM', 153), rep('NNet', 153), rep('MARS', 153))
values = c(test$SVM, test$NNet, test$MARS)
aov = aov(values~ methods)
summary(aov)

##              Df    Sum Sq Mean Sq F value Pr(>F)
## methods       2 4.552e+06 2276169   0.338  0.713
## Residuals   456 3.072e+09 6736363

The SVM, neural network, and MARS models produced test RMSE values of 851.5477, 1095.994, and 895.8359, respectively. The SVM model achieved the lowest test error, while the neural network had the highest error. The differences in performance, however, are not statistically significant (p = 0.713 > 0.05).

Problem 5

library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")

df <- ChemicalManufacturingProcess
yield <- df$Yield
processPredictors <- df[, -which(names(df) == "Yield")]

(a) A small percentage of cells in the predictor set contain missing values. Use an appropriate imputation function to fill in these missing values.

library(mice)

## 
## Attaching package: 'mice'

## The following object is masked from 'package:stats':
## 
##     filter

## The following objects are masked from 'package:base':
## 
##     cbind, rbind

imputedData <- mice(processPredictors, m = 1, method = 'pmm', seed = 123)

## 
##  iter imp variable
##   1   1  ManufacturingProcess01  ManufacturingProcess02  ManufacturingProcess03  ManufacturingProcess04  ManufacturingProcess05  ManufacturingProcess06  ManufacturingProcess07  ManufacturingProcess08  ManufacturingProcess10  ManufacturingProcess11  ManufacturingProcess12  ManufacturingProcess14  ManufacturingProcess22  ManufacturingProcess23  ManufacturingProcess24  ManufacturingProcess25  ManufacturingProcess26  ManufacturingProcess27  ManufacturingProcess28  ManufacturingProcess29  ManufacturingProcess30  ManufacturingProcess31  ManufacturingProcess33  ManufacturingProcess34  ManufacturingProcess35  ManufacturingProcess36  ManufacturingProcess40  ManufacturingProcess41
##   2   1  ManufacturingProcess01  ManufacturingProcess02  ManufacturingProcess03  ManufacturingProcess04  ManufacturingProcess05  ManufacturingProcess06  ManufacturingProcess07  ManufacturingProcess08  ManufacturingProcess10  ManufacturingProcess11  ManufacturingProcess12  ManufacturingProcess14  ManufacturingProcess22  ManufacturingProcess23  ManufacturingProcess24  ManufacturingProcess25  ManufacturingProcess26  ManufacturingProcess27  ManufacturingProcess28  ManufacturingProcess29  ManufacturingProcess30  ManufacturingProcess31  ManufacturingProcess33  ManufacturingProcess34  ManufacturingProcess35  ManufacturingProcess36  ManufacturingProcess40  ManufacturingProcess41
##   3   1  ManufacturingProcess01  ManufacturingProcess02  ManufacturingProcess03  ManufacturingProcess04  ManufacturingProcess05  ManufacturingProcess06  ManufacturingProcess07  ManufacturingProcess08  ManufacturingProcess10  ManufacturingProcess11  ManufacturingProcess12  ManufacturingProcess14  ManufacturingProcess22  ManufacturingProcess23  ManufacturingProcess24  ManufacturingProcess25  ManufacturingProcess26  ManufacturingProcess27  ManufacturingProcess28  ManufacturingProcess29  ManufacturingProcess30  ManufacturingProcess31  ManufacturingProcess33  ManufacturingProcess34  ManufacturingProcess35  ManufacturingProcess36  ManufacturingProcess40  ManufacturingProcess41
##   4   1  ManufacturingProcess01  ManufacturingProcess02  ManufacturingProcess03  ManufacturingProcess04  ManufacturingProcess05  ManufacturingProcess06  ManufacturingProcess07  ManufacturingProcess08  ManufacturingProcess10  ManufacturingProcess11  ManufacturingProcess12  ManufacturingProcess14  ManufacturingProcess22  ManufacturingProcess23  ManufacturingProcess24  ManufacturingProcess25  ManufacturingProcess26  ManufacturingProcess27  ManufacturingProcess28  ManufacturingProcess29  ManufacturingProcess30  ManufacturingProcess31  ManufacturingProcess33  ManufacturingProcess34  ManufacturingProcess35  ManufacturingProcess36  ManufacturingProcess40  ManufacturingProcess41
##   5   1  ManufacturingProcess01  ManufacturingProcess02  ManufacturingProcess03  ManufacturingProcess04  ManufacturingProcess05  ManufacturingProcess06  ManufacturingProcess07  ManufacturingProcess08  ManufacturingProcess10  ManufacturingProcess11  ManufacturingProcess12  ManufacturingProcess14  ManufacturingProcess22  ManufacturingProcess23  ManufacturingProcess24  ManufacturingProcess25  ManufacturingProcess26  ManufacturingProcess27  ManufacturingProcess28  ManufacturingProcess29  ManufacturingProcess30  ManufacturingProcess31  ManufacturingProcess33  ManufacturingProcess34  ManufacturingProcess35  ManufacturingProcess36  ManufacturingProcess40  ManufacturingProcess41

## Warning: Number of logged events: 135

predictorsImputed <- complete(imputedData, 1)

(b) Split the data into a training and a test set, pre-process the data, and build at least four different models from Chapter 6. For those models with tuning parameters (e.g., ENET), what are the optimal values of the tuning parameter(s)?

# Split data
set.seed(123)
trainIndex2 <- createDataPartition(yield, p = 0.8, list = FALSE)

x_train <- predictorsImputed[trainIndex2, ]
x_test  <- predictorsImputed[-trainIndex2, ]
y_train <- yield[trainIndex2]
y_test  <- yield[-trainIndex2]


# Pre-process
nzv <- nearZeroVar(x_train)
x_train <- x_train[, -nzv]
x_test  <- x_test[, -nzv]

preProc <- preProcess(x_train, method = c("center", "scale"))
x_train <- predict(preProc, x_train)
x_test  <- predict(preProc, x_test)


# PLS
indx2 <- createFolds(x_train, returnTrain = TRUE)
ctrl2 <- trainControl(method = "cv", index = indx2)

set.seed(100)
plsTune <- train(x = x_train, y = y_train,
                 method = "pls",
                 tuneGrid = expand.grid(ncomp = 1:50),
                 trControl = ctrl2)

## Warning: model fit failed for Fold04: ncomp=50 Error in pls::mvr(.outcome ~ ., data = dat, ncomp = ncomp, method = "oscorespls") : 
##   Invalid number of components, ncomp

## Warning: model fit failed for Fold07: ncomp=50 Error in pls::mvr(.outcome ~ ., data = dat, ncomp = ncomp, method = "oscorespls") : 
##   Invalid number of components, ncomp

## Warning: model fit failed for Fold08: ncomp=50 Error in pls::mvr(.outcome ~ ., data = dat, ncomp = ncomp, method = "oscorespls") : 
##   Invalid number of components, ncomp

## Warning: model fit failed for Fold09: ncomp=50 Error in pls::mvr(.outcome ~ ., data = dat, ncomp = ncomp, method = "oscorespls") : 
##   Invalid number of components, ncomp

## Warning: model fit failed for Fold10: ncomp=50 Error in pls::mvr(.outcome ~ ., data = dat, ncomp = ncomp, method = "oscorespls") : 
##   Invalid number of components, ncomp

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.

testResults <- data.frame(obs = y_test,
                          PLS = predict(plsTune, newdata = x_test))


# PCR
set.seed(100)
pcrTune <- train(x = x_train, y = y_train,
                 method = "pcr",
                 tuneGrid = expand.grid(ncomp = 1:50),
                 trControl = ctrl2)

## Warning: model fit failed for Fold04: ncomp=50 Error in pls::mvr(.outcome ~ ., data = dat, ncomp = ncomp, method = "svdpc") : 
##   Invalid number of components, ncomp

## Warning: model fit failed for Fold07: ncomp=50 Error in pls::mvr(.outcome ~ ., data = dat, ncomp = ncomp, method = "svdpc") : 
##   Invalid number of components, ncomp

## Warning: model fit failed for Fold08: ncomp=50 Error in pls::mvr(.outcome ~ ., data = dat, ncomp = ncomp, method = "svdpc") : 
##   Invalid number of components, ncomp

## Warning: model fit failed for Fold09: ncomp=50 Error in pls::mvr(.outcome ~ ., data = dat, ncomp = ncomp, method = "svdpc") : 
##   Invalid number of components, ncomp

## Warning: model fit failed for Fold10: ncomp=50 Error in pls::mvr(.outcome ~ ., data = dat, ncomp = ncomp, method = "svdpc") : 
##   Invalid number of components, ncomp

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.

testResults$PCR <- predict(pcrTune, newdata = x_test)


# Ridge Regression
ridgeGrid <- expand.grid(lambda = seq(0, .1, length = 10))

set.seed(100)
ridgeTune <- train(x = x_train, y = y_train,
                   method = "ridge",
                   tuneGrid = ridgeGrid,
                   trControl = ctrl2,
                   preProc = c("BoxCox"))
testResults$Ridge <- predict(ridgeTune, newdata = x_test)


# ENET
library(elasticnet)

## Loading required package: lars

## Loaded lars 1.3

enetGrid <- expand.grid(lambda = c(0.0001, .001, 0.01, 0.1),
                        fraction = seq(.05, 1, length = 20))

set.seed(100)
enetTune <- train(x = x_train, y = y_train,
                  method = "enet",
                  tuneGrid = enetGrid,
                  trControl = ctrl2,
                  preProc = c("BoxCox"))
testResults$ENET <- predict(enetTune, newdata = x_test)


# Best params
plsTune$bestTune

##   ncomp
## 1     1

pcrTune$bestTune

##   ncomp
## 8     8

ridgeTune$bestTune

##    lambda
## 10    0.1

enetTune$bestTune

##    fraction lambda
## 62      0.1    0.1

The optimal tuning parameters selected via cross-validation were:

PLS: ncomp = 1
PCR: ncomp = 8
Ridge: lambda = 0.1
Elastic Net: fraction = 0.1, lambda = 0.1

(c) Which model has the best predictive ability? Is any model significantly better or worse than the others? You need to conduct a hypothesis testing to justify your choice if necessary.

data.frame(rbind(PLS = postResample(pred = testResults$PLS, obs = y_test),
                 PCR = postResample(pred = testResults$PCR, obs = y_test),
                 Ridge = postResample(pred = testResults$Ridge, obs = y_test),
                 ENET = postResample(pred = testResults$ENET, obs = y_test)))

##           RMSE  Rsquared      MAE
## PLS   1.443262 0.3718903 1.188921
## PCR   1.378808 0.4247325 1.135053
## Ridge 1.367931 0.4974944 1.176098
## ENET  1.473129 0.6171580 1.213936

resamps <- resamples(list(
  PLS = plsTune,
  PCR = pcrTune,
  Ridge = ridgeTune,
  ENET = enetTune
))

diffs <- diff(resamps)
summary(diffs)

## 
## Call:
## summary.diff.resamples(object = diffs)
## 
## p-value adjustment: bonferroni 
## Upper diagonal: estimates of the difference
## Lower diagonal: p-value for H0: difference = 0
## 
## MAE 
##       PLS       PCR       Ridge     ENET    
## PLS              0.01867  -0.97713  -0.52495
## PCR   0.9780062           -0.99580  -0.54362
## Ridge 0.0009615 0.0006877            0.38553
## ENET  0.0004226 0.0009247 0.0245615         
## 
## RMSE 
##       PLS       PCR       Ridge     ENET    
## PLS              0.03063  -3.02581  -0.53132
## PCR   0.4076167           -3.05644  -0.56195
## Ridge 0.0003462 0.0003714            2.24829
## ENET  0.0008961 0.0014629 0.0003873         
## 
## Rsquared 
##       PLS      PCR       Ridge     ENET     
## PLS            -0.049987  0.191048  0.003871
## PCR   0.007344            0.241035  0.053858
## Ridge 0.000330 2.449e-05           -0.179140
## ENET  1.000000 0.398353  0.011166

Among the models evaluated, Ridge Regression achieved the lowest RMSE (1.368), Elastic Net achieved the highest R2 (0.617), and PCR achieved the lowest MAE (1.135). Based on the resampling comparisons, Ridge Regression shows a statistically significant superior performance in all three metrics compared to the other models. Although ENET achieved the highest R2, its differences are less consistent, with some cases not being statistically significant. Therefore, Ridge Regression has the best predictive ability in this problem.

(d) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list

ridgeImp <- varImp(ridgeTune, scale = FALSE) 
plot(ridgeImp, top = 25)

The top 25 most important predictors are shown above. There are more manufacturing predictors than biological represented (15 vs 10), suggesting that manufacturing processes may contribute more to the final yield than the quality of the raw materials.

(e) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

par(mfrow = c(2,3))
plot(x_train[, "ManufacturingProcess32"], y_train)
plot(x_train[, "BiologicalMaterial05"], y_train)
plot(x_train[, "ManufacturingProcess36"], y_train)
plot(x_train[, "BiologicalMaterial03"], y_train)
plot(x_train[, "ManufacturingProcess13"], y_train)

Based on the plots of the top 5 most important predictors, ManufacturingProcess32 and BiologicalMaterial03 show a clear positive linear relationship with yield, so optimizing these factors further during future runs could improve yield. On the other hand, ManufacturingProcess13 and ManufacturingProcess36 have a negative relationship with yield, so limiting these processes in future runs can help mitigate reductions in yield.

Midterm