A
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
data(tecator)
str(absorp)
## num [1:215, 1:100] 2.62 2.83 2.58 2.82 2.79 ...
str(endpoints)
## num [1:215, 1:3] 60.5 46 71 72.8 58.3 44 44 69.3 61.4 61.4 ...
moisture <- endpoints[, 1]
fat <- endpoints[, 2]
protein <- endpoints[, 3]
The tecator data were loaded from the caret package. The absorp matrix has 215 observations and 100 predictor variables, which represent absorbance at different infrared frequencies. The endpoints matrix has 215 observations and 3 response variables: moisture, fat, and protein. In this exercise, moisture is used as the response variable.
B
pca_model <- prcomp(absorp, center = TRUE, scale. = TRUE)
summary(pca_model)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 9.9311 0.9847 0.52851 0.33827 0.08038 0.05123 0.02681
## Proportion of Variance 0.9863 0.0097 0.00279 0.00114 0.00006 0.00003 0.00001
## Cumulative Proportion 0.9863 0.9960 0.99875 0.99990 0.99996 0.99999 0.99999
## PC8 PC9 PC10 PC11 PC12 PC13
## Standard deviation 0.01961 0.008564 0.006739 0.004442 0.003361 0.001867
## Proportion of Variance 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000
## Cumulative Proportion 1.00000 1.000000 1.000000 1.000000 1.000000 1.000000
## PC14 PC15 PC16 PC17 PC18
## Standard deviation 0.001377 0.0009449 0.0008641 0.0007558 0.0006977
## Proportion of Variance 0.000000 0.0000000 0.0000000 0.0000000 0.0000000
## Cumulative Proportion 1.000000 1.0000000 1.0000000 1.0000000 1.0000000
## PC19 PC20 PC21 PC22 PC23
## Standard deviation 0.0005884 0.0004628 0.0003897 0.0003341 0.0003123
## Proportion of Variance 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## Cumulative Proportion 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## PC24 PC25 PC26 PC27 PC28
## Standard deviation 0.0002721 0.0002616 0.000211 0.0001954 0.0001857
## Proportion of Variance 0.0000000 0.0000000 0.000000 0.0000000 0.0000000
## Cumulative Proportion 1.0000000 1.0000000 1.000000 1.0000000 1.0000000
## PC29 PC30 PC31 PC32 PC33
## Standard deviation 0.0001729 0.0001656 0.0001539 0.0001473 0.0001392
## Proportion of Variance 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## Cumulative Proportion 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## PC34 PC35 PC36 PC37 PC38
## Standard deviation 0.0001339 0.0001269 0.0001082 0.000104 9.98e-05
## Proportion of Variance 0.0000000 0.0000000 0.0000000 0.000000 0.00e+00
## Cumulative Proportion 1.0000000 1.0000000 1.0000000 1.000000 1.00e+00
## PC39 PC40 PC41 PC42 PC43
## Standard deviation 9.081e-05 8.668e-05 8.026e-05 7.762e-05 7.36e-05
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.00e+00
## Cumulative Proportion 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.00e+00
## PC44 PC45 PC46 PC47 PC48
## Standard deviation 6.808e-05 6.541e-05 6.44e-05 5.897e-05 5.422e-05
## Proportion of Variance 0.000e+00 0.000e+00 0.00e+00 0.000e+00 0.000e+00
## Cumulative Proportion 1.000e+00 1.000e+00 1.00e+00 1.000e+00 1.000e+00
## PC49 PC50 PC51 PC52 PC53
## Standard deviation 5.027e-05 4.893e-05 4.608e-05 4.419e-05 4.037e-05
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
## PC54 PC55 PC56 PC57 PC58 PC59
## Standard deviation 3.854e-05 3.8e-05 3.64e-05 3.497e-05 3.443e-05 3.264e-05
## Proportion of Variance 0.000e+00 0.0e+00 0.00e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion 1.000e+00 1.0e+00 1.00e+00 1.000e+00 1.000e+00 1.000e+00
## PC60 PC61 PC62 PC63 PC64
## Standard deviation 3.104e-05 3.04e-05 2.959e-05 2.844e-05 2.699e-05
## Proportion of Variance 0.000e+00 0.00e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion 1.000e+00 1.00e+00 1.000e+00 1.000e+00 1.000e+00
## PC65 PC66 PC67 PC68 PC69
## Standard deviation 2.586e-05 2.388e-05 2.364e-05 2.284e-05 2.173e-05
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
## PC70 PC71 PC72 PC73 PC74
## Standard deviation 2.058e-05 1.997e-05 1.93e-05 1.854e-05 1.807e-05
## Proportion of Variance 0.000e+00 0.000e+00 0.00e+00 0.000e+00 0.000e+00
## Cumulative Proportion 1.000e+00 1.000e+00 1.00e+00 1.000e+00 1.000e+00
## PC75 PC76 PC77 PC78 PC79
## Standard deviation 1.728e-05 1.693e-05 1.612e-05 1.569e-05 1.516e-05
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
## PC80 PC81 PC82 PC83 PC84
## Standard deviation 1.445e-05 1.408e-05 1.356e-05 1.275e-05 1.224e-05
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
## PC85 PC86 PC87 PC88 PC89
## Standard deviation 1.178e-05 1.09e-05 1.045e-05 1.009e-05 9.396e-06
## Proportion of Variance 0.000e+00 0.00e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion 1.000e+00 1.00e+00 1.000e+00 1.000e+00 1.000e+00
## PC90 PC91 PC92 PC93 PC94
## Standard deviation 8.728e-06 8.27e-06 7.613e-06 6.83e-06 6.383e-06
## Proportion of Variance 0.000e+00 0.00e+00 0.000e+00 0.00e+00 0.000e+00
## Cumulative Proportion 1.000e+00 1.00e+00 1.000e+00 1.00e+00 1.000e+00
## PC95 PC96 PC97 PC98 PC99
## Standard deviation 5.946e-06 5.478e-06 4.826e-06 4.521e-06 4.164e-06
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
## PC100
## Standard deviation 4.122e-06
## Proportion of Variance 0.000e+00
## Cumulative Proportion 1.000e+00
pve <- pca_model$sdev^2 / sum(pca_model$sdev^2)
cum_pve <- cumsum(pve)
which(cum_pve >= 0.95)[1]
## [1] 1
which(cum_pve >= 0.99)[1]
## [1] 2
PCA was applied to the absorbance predictors after centering and scaling. The results show that the first principal component already explains about 98.63% of the total variance. The first two components explain about 99.60%, and the first three explain about 99.88%. This means the predictors are highly correlated, and the effective dimension of the data is much smaller than the original 100 variables.
C
set.seed(123)
train_index <- createDataPartition(moisture, p = 0.75, list = FALSE)
x_train <- absorp[train_index, ]
x_test <- absorp[-train_index, ]
y_train <- moisture[train_index]
y_test <- moisture[-train_index]
The data were split into training and test sets using createDataPartition, with 75% of the observations used for training and the remaining 25% used for testing. The training set was used to fit the models, and the test set was used to evaluate their performance.
ols_fit <- lm(y_train ~ ., data = as.data.frame(x_train))
ols_pred <- predict(ols_fit, newdata = as.data.frame(x_test))
ols_rmse <- sqrt(mean((y_test - ols_pred)^2))
ols_rmse
## [1] 4.147104
An OLS model was first fitted using the training data. The model was then applied to the test set, and the RMSE was about 4.15. This result will be used later to compare with other models.
colnames(x_train) <- paste0("V", 1:ncol(x_train))
colnames(x_test) <- paste0("V", 1:ncol(x_test))
set.seed(123)
pcr_fit <- train(
x_train, y_train,
method = "pcr",
tuneLength = 20,
trControl = trainControl(method = "cv")
)
pcr_pred <- predict(pcr_fit, x_test)
pcr_rmse <- sqrt(mean((y_test - pcr_pred)^2))
pcr_rmse
## [1] 2.919432
A PCR model was fitted using cross-validation to select the number of components. When applied to the test set, the RMSE was about 2.92, which is much lower than the OLS result.
set.seed(123)
pls_fit <- train(
x_train, y_train,
method = "pls",
tuneLength = 20,
trControl = trainControl(method = "cv")
)
pls_pred <- predict(pls_fit, x_test)
pls_rmse <- sqrt(mean((y_test - pls_pred)^2))
pls_rmse
## [1] 2.329229
A PLS model was also fitted using cross-validation. On the test set, the RMSE was about 2.33, which is lower than both OLS and PCR.
set.seed(123)
ridge_fit <- train(
x_train, y_train,
method = "glmnet",
tuneGrid = expand.grid(alpha = 0, lambda = seq(0.0001, 1, length = 50)),
trControl = trainControl(method = "cv")
)
ridge_pred <- predict(ridge_fit, x_test)
ridge_rmse <- sqrt(mean((y_test - ridge_pred)^2))
ridge_rmse
## [1] 4.280928
A Ridge regression model was also fitted with cross-validation. Its RMSE on the test set was about 4.28, which is higher than PCR and PLS, and even slightly worse than OLS.
set.seed(123)
enet_fit <- train(
x_train, y_train,
method = "glmnet",
tuneLength = 20,
trControl = trainControl(method = "cv")
)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
enet_pred <- predict(enet_fit, x_test)
enet_rmse <- sqrt(mean((y_test - enet_pred)^2))
enet_rmse
## [1] 2.962891
An elastic net model was also fitted using cross-validation. Its RMSE on the test set was about 2.96, which is better than OLS and Ridge, but slightly worse than PCR and PLS.
D Comparing all models, PLS performed the best with the lowest RMSE (about 2.33). PCR and elastic net also performed reasonably well, while OLS and Ridge had much higher errors.
E Based on these results, I would choose the PLS model because it gives the lowest test RMSE and performs the best among all models.