A

library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
data(tecator)

str(absorp)
##  num [1:215, 1:100] 2.62 2.83 2.58 2.82 2.79 ...
str(endpoints)
##  num [1:215, 1:3] 60.5 46 71 72.8 58.3 44 44 69.3 61.4 61.4 ...
moisture <- endpoints[, 1]
fat <- endpoints[, 2]
protein <- endpoints[, 3]

The tecator data were loaded from the caret package. The absorp matrix has 215 observations and 100 predictor variables, which represent absorbance at different infrared frequencies. The endpoints matrix has 215 observations and 3 response variables: moisture, fat, and protein. In this exercise, moisture is used as the response variable.

B

pca_model <- prcomp(absorp, center = TRUE, scale. = TRUE)

summary(pca_model)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     9.9311 0.9847 0.52851 0.33827 0.08038 0.05123 0.02681
## Proportion of Variance 0.9863 0.0097 0.00279 0.00114 0.00006 0.00003 0.00001
## Cumulative Proportion  0.9863 0.9960 0.99875 0.99990 0.99996 0.99999 0.99999
##                            PC8      PC9     PC10     PC11     PC12     PC13
## Standard deviation     0.01961 0.008564 0.006739 0.004442 0.003361 0.001867
## Proportion of Variance 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000
## Cumulative Proportion  1.00000 1.000000 1.000000 1.000000 1.000000 1.000000
##                            PC14      PC15      PC16      PC17      PC18
## Standard deviation     0.001377 0.0009449 0.0008641 0.0007558 0.0006977
## Proportion of Variance 0.000000 0.0000000 0.0000000 0.0000000 0.0000000
## Cumulative Proportion  1.000000 1.0000000 1.0000000 1.0000000 1.0000000
##                             PC19      PC20      PC21      PC22      PC23
## Standard deviation     0.0005884 0.0004628 0.0003897 0.0003341 0.0003123
## Proportion of Variance 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## Cumulative Proportion  1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##                             PC24      PC25     PC26      PC27      PC28
## Standard deviation     0.0002721 0.0002616 0.000211 0.0001954 0.0001857
## Proportion of Variance 0.0000000 0.0000000 0.000000 0.0000000 0.0000000
## Cumulative Proportion  1.0000000 1.0000000 1.000000 1.0000000 1.0000000
##                             PC29      PC30      PC31      PC32      PC33
## Standard deviation     0.0001729 0.0001656 0.0001539 0.0001473 0.0001392
## Proportion of Variance 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## Cumulative Proportion  1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##                             PC34      PC35      PC36     PC37     PC38
## Standard deviation     0.0001339 0.0001269 0.0001082 0.000104 9.98e-05
## Proportion of Variance 0.0000000 0.0000000 0.0000000 0.000000 0.00e+00
## Cumulative Proportion  1.0000000 1.0000000 1.0000000 1.000000 1.00e+00
##                             PC39      PC40      PC41      PC42     PC43
## Standard deviation     9.081e-05 8.668e-05 8.026e-05 7.762e-05 7.36e-05
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.00e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.00e+00
##                             PC44      PC45     PC46      PC47      PC48
## Standard deviation     6.808e-05 6.541e-05 6.44e-05 5.897e-05 5.422e-05
## Proportion of Variance 0.000e+00 0.000e+00 0.00e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.00e+00 1.000e+00 1.000e+00
##                             PC49      PC50      PC51      PC52      PC53
## Standard deviation     5.027e-05 4.893e-05 4.608e-05 4.419e-05 4.037e-05
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                             PC54    PC55     PC56      PC57      PC58      PC59
## Standard deviation     3.854e-05 3.8e-05 3.64e-05 3.497e-05 3.443e-05 3.264e-05
## Proportion of Variance 0.000e+00 0.0e+00 0.00e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.0e+00 1.00e+00 1.000e+00 1.000e+00 1.000e+00
##                             PC60     PC61      PC62      PC63      PC64
## Standard deviation     3.104e-05 3.04e-05 2.959e-05 2.844e-05 2.699e-05
## Proportion of Variance 0.000e+00 0.00e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.00e+00 1.000e+00 1.000e+00 1.000e+00
##                             PC65      PC66      PC67      PC68      PC69
## Standard deviation     2.586e-05 2.388e-05 2.364e-05 2.284e-05 2.173e-05
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                             PC70      PC71     PC72      PC73      PC74
## Standard deviation     2.058e-05 1.997e-05 1.93e-05 1.854e-05 1.807e-05
## Proportion of Variance 0.000e+00 0.000e+00 0.00e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.00e+00 1.000e+00 1.000e+00
##                             PC75      PC76      PC77      PC78      PC79
## Standard deviation     1.728e-05 1.693e-05 1.612e-05 1.569e-05 1.516e-05
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                             PC80      PC81      PC82      PC83      PC84
## Standard deviation     1.445e-05 1.408e-05 1.356e-05 1.275e-05 1.224e-05
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                             PC85     PC86      PC87      PC88      PC89
## Standard deviation     1.178e-05 1.09e-05 1.045e-05 1.009e-05 9.396e-06
## Proportion of Variance 0.000e+00 0.00e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.00e+00 1.000e+00 1.000e+00 1.000e+00
##                             PC90     PC91      PC92     PC93      PC94
## Standard deviation     8.728e-06 8.27e-06 7.613e-06 6.83e-06 6.383e-06
## Proportion of Variance 0.000e+00 0.00e+00 0.000e+00 0.00e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.00e+00 1.000e+00 1.00e+00 1.000e+00
##                             PC95      PC96      PC97      PC98      PC99
## Standard deviation     5.946e-06 5.478e-06 4.826e-06 4.521e-06 4.164e-06
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC100
## Standard deviation     4.122e-06
## Proportion of Variance 0.000e+00
## Cumulative Proportion  1.000e+00
pve <- pca_model$sdev^2 / sum(pca_model$sdev^2)
cum_pve <- cumsum(pve)

which(cum_pve >= 0.95)[1]
## [1] 1
which(cum_pve >= 0.99)[1]
## [1] 2

PCA was applied to the absorbance predictors after centering and scaling. The results show that the first principal component already explains about 98.63% of the total variance. The first two components explain about 99.60%, and the first three explain about 99.88%. This means the predictors are highly correlated, and the effective dimension of the data is much smaller than the original 100 variables.

C

set.seed(123)

train_index <- createDataPartition(moisture, p = 0.75, list = FALSE)

x_train <- absorp[train_index, ]
x_test  <- absorp[-train_index, ]

y_train <- moisture[train_index]
y_test  <- moisture[-train_index]

The data were split into training and test sets using createDataPartition, with 75% of the observations used for training and the remaining 25% used for testing. The training set was used to fit the models, and the test set was used to evaluate their performance.

ols_fit <- lm(y_train ~ ., data = as.data.frame(x_train))

ols_pred <- predict(ols_fit, newdata = as.data.frame(x_test))

ols_rmse <- sqrt(mean((y_test - ols_pred)^2))
ols_rmse
## [1] 4.147104

An OLS model was first fitted using the training data. The model was then applied to the test set, and the RMSE was about 4.15. This result will be used later to compare with other models.

colnames(x_train) <- paste0("V", 1:ncol(x_train))
colnames(x_test)  <- paste0("V", 1:ncol(x_test))
set.seed(123)

pcr_fit <- train(
  x_train, y_train,
  method = "pcr",
  tuneLength = 20,
  trControl = trainControl(method = "cv")
)

pcr_pred <- predict(pcr_fit, x_test)

pcr_rmse <- sqrt(mean((y_test - pcr_pred)^2))
pcr_rmse
## [1] 2.919432

A PCR model was fitted using cross-validation to select the number of components. When applied to the test set, the RMSE was about 2.92, which is much lower than the OLS result.

set.seed(123)

pls_fit <- train(
  x_train, y_train,
  method = "pls",
  tuneLength = 20,
  trControl = trainControl(method = "cv")
)

pls_pred <- predict(pls_fit, x_test)

pls_rmse <- sqrt(mean((y_test - pls_pred)^2))
pls_rmse
## [1] 2.329229

A PLS model was also fitted using cross-validation. On the test set, the RMSE was about 2.33, which is lower than both OLS and PCR.

set.seed(123)

ridge_fit <- train(
  x_train, y_train,
  method = "glmnet",
  tuneGrid = expand.grid(alpha = 0, lambda = seq(0.0001, 1, length = 50)),
  trControl = trainControl(method = "cv")
)

ridge_pred <- predict(ridge_fit, x_test)

ridge_rmse <- sqrt(mean((y_test - ridge_pred)^2))
ridge_rmse
## [1] 4.280928

A Ridge regression model was also fitted with cross-validation. Its RMSE on the test set was about 4.28, which is higher than PCR and PLS, and even slightly worse than OLS.

set.seed(123)

enet_fit <- train(
  x_train, y_train,
  method = "glmnet",
  tuneLength = 20,
  trControl = trainControl(method = "cv")
)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
enet_pred <- predict(enet_fit, x_test)

enet_rmse <- sqrt(mean((y_test - enet_pred)^2))
enet_rmse
## [1] 2.962891

An elastic net model was also fitted using cross-validation. Its RMSE on the test set was about 2.96, which is better than OLS and Ridge, but slightly worse than PCR and PLS.

D Comparing all models, PLS performed the best with the lowest RMSE (about 2.33). PCR and elastic net also performed reasonably well, while OLS and Ridge had much higher errors.

E Based on these results, I would choose the PLS model because it gives the lowest test RMSE and performs the best among all models.