In Kuhn and Johnson do problems 6.2 and 6.3. There are only two but they consist of many parts. Please submit a link to your Rpubs and submit the .rmd file as well.
a. Start R and use these commands to load the data:
data(permeability)
b. The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?
head(permeability)
## permeability
## 1 12.520
## 2 1.120
## 3 19.405
## 4 1.730
## 5 1.680
## 6 0.510
dim(fingerprints)
## [1] 165 1107
Filter predictors with low frequencies
nzv <- nearZeroVar(fingerprints)
filtered_fingerprints <- fingerprints[, -nzv]
dim(filtered_fingerprints)
## [1] 165 388
After removing sparse values, there are only 388 features left.
num_remaining_predictors <- ncol(filtered_fingerprints)
print(paste("Number of predictors left:", num_remaining_predictors))
## [1] "Number of predictors left: 388"
c. Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?
set.seed(1)
inTrain <- createDataPartition(permeability, p = 0.8, list = FALSE)
train_fingerprints <- filtered_fingerprints[inTrain,]
test_fingerprints <- filtered_fingerprints[-inTrain,]
train_permeability <- permeability[inTrain]
test_permeability <- permeability[-inTrain]
ctrl <- trainControl(method = "cv", number = 10)
pls_model <- train(train_fingerprints, train_permeability,
method = "pls",
tuneLength = 15,
trControl = ctrl,
preProc = c("center", "scale"))
pls_model
## Partial Least Squares
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 121, 119, 120, 120, 120, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 12.69853 0.3419188 9.510461
## 2 11.26461 0.4862419 7.961848
## 3 11.23192 0.5020667 8.443290
## 4 11.34704 0.5151263 8.525333
## 5 11.36740 0.4843217 8.217022
## 6 11.17569 0.5067142 8.111659
## 7 10.70101 0.5296468 7.822282
## 8 10.54862 0.5419219 7.788303
## 9 10.68495 0.5356735 7.877981
## 10 10.45770 0.5534189 7.634547
## 11 10.61471 0.5495913 7.802781
## 12 10.60305 0.5554128 7.829622
## 13 10.53495 0.5554981 7.761453
## 14 10.57287 0.5551807 7.813457
## 15 10.82956 0.5352628 7.981578
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 10.
print(paste("Optimal number of latent variables:", pls_model$bestTune$ncomp))
## [1] "Optimal number of latent variables: 10"
print(paste("Corresponding resampled R2 estimate:", pls_model$results[pls_model$results$ncomp == pls_model$bestTune$ncomp, "Rsquared"]))
## [1] "Corresponding resampled R2 estimate: 0.553418884040966"
d. Predict the response for the test set and calculate test set R2
pls_predictions <- predict(pls_model, test_fingerprints)
pls_r2 <- cor(pls_predictions, test_permeability)^2
pls_r2
## [1] 0.3274115
Test set estimate of R^2 is 0.3274115
e. Try building other models discussed in the chapter. Do any others have better performance?
pcr_model <- train(train_fingerprints, train_permeability,
method = "pcr",
tuneLength = 15,
trControl = ctrl,
preProc = c("center", "scale"))
pcr_predictions <- predict(pcr_model, test_fingerprints)
pcr_r2 <- cor(pcr_predictions, test_permeability)^2
pcr_r2
## [1] 0.2860616
glmnet_model <- train(train_fingerprints, train_permeability,
method = "glmnet",
tuneLength = 10,
trControl = ctrl,
preProc = c("center", "scale"))
glmnet_predictions <- predict(glmnet_model, test_fingerprints)
glmnet_r2 <- cor(glmnet_predictions, test_permeability)^2
glmnet_r2
## [1] 0.4001524
Elastic Net has better performance based on the test set R^2.
f. Would you recommend any of your models to replace the permeability lab experiment?
The Elastic Net model, with a highest test set R2 would be recommended to replace the permeability lab experiment based on performance.
a. Load Data
data("ChemicalManufacturingProcess")
data <- ChemicalManufacturingProcess
dim(data)
## [1] 176 58
b. A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values
preProcValues <- preProcess(data[, -ncol(data)], method = "knnImpute")
data_imputed <- predict(preProcValues, data)
c. Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
set.seed(123)
trainIndex <- createDataPartition(data_imputed$Yield, p = 0.8, list = FALSE)
trainData <- data_imputed[trainIndex, ]
testData <- data_imputed[-trainIndex, ]
preProcValues <- preProcess(trainData[, -ncol(trainData)], method = c("center", "scale"))
trainTransformed <- predict(preProcValues, trainData)
testTransformed <- predict(preProcValues, testData)
ctrl <- trainControl(method = "cv", number = 10)
set.seed(123)
pcrTune <- train(Yield ~ ., data = trainTransformed, method = "pcr",
trControl = ctrl, tuneLength = 20)
print(pcrTune)
## Principal Component Analysis
##
## 144 samples
## 57 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 128, 129, 129, 130, 128, 131, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 0.8155519 0.3601336 0.6634448
## 2 0.8131761 0.3609466 0.6631559
## 3 0.7927148 0.3984094 0.6347908
## 4 0.8204937 0.3590981 0.6518306
## 5 0.7891178 0.4018028 0.6411547
## 6 0.7826775 0.4133684 0.6373802
## 7 0.7868886 0.4059173 0.6368692
## 8 0.7659817 0.4426340 0.6186877
## 9 0.7397818 0.4927881 0.6096364
## 10 0.6667528 0.5956882 0.5452838
## 11 0.6547198 0.6097344 0.5367515
## 12 0.6407526 0.6261281 0.5221095
## 13 0.6274506 0.6359631 0.5114676
## 14 0.6303158 0.6369425 0.5163531
## 15 0.6321744 0.6363527 0.5188768
## 16 0.6299262 0.6449908 0.5127557
## 17 0.6307375 0.6407938 0.5130477
## 18 0.6333827 0.6350424 0.5148542
## 19 0.6410366 0.6299998 0.5197339
## 20 0.6440515 0.6255382 0.5197736
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 13.
plot(pcrTune)
d. Predict response for test set. What is the value of the performance metric and how does this compare to resampled performance metric of training set?
predictions <- predict(pcrTune, testTransformed)
rmse <- sqrt(mean((testTransformed$Yield - predictions)^2))
print(paste("RMSE on Test Set:", rmse))
## [1] "RMSE on Test Set: 0.724455236535656"
e. Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
varImp_pcr <- varImp(pcrTune)
print(varImp_pcr)
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.00
## BiologicalMaterial06 94.06
## BiologicalMaterial03 81.27
## ManufacturingProcess13 80.63
## ManufacturingProcess36 78.31
## BiologicalMaterial02 76.04
## ManufacturingProcess17 75.92
## ManufacturingProcess31 74.96
## ManufacturingProcess09 73.04
## BiologicalMaterial12 69.48
## ManufacturingProcess06 66.28
## BiologicalMaterial11 59.72
## ManufacturingProcess33 57.41
## ManufacturingProcess29 54.55
## BiologicalMaterial04 53.93
## ManufacturingProcess11 49.55
## BiologicalMaterial01 45.62
## BiologicalMaterial08 44.93
## ManufacturingProcess30 41.08
## BiologicalMaterial09 40.88
plot(varImp_pcr, top = 20)
The most important Predictors are Manifacturing Process 32, 13, and the Biological Material 06 and 03.
f. Explore relationship between top predictors
selected_variables <- data_imputed %>%
select(ManufacturingProcess32,
ManufacturingProcess13,
BiologicalMaterial06,
BiologicalMaterial03,
ManufacturingProcess36,
Yield)
correlations <- cor(selected_variables, use = "pairwise.complete.obs")
corrplot(correlations, method = "circle", type = "lower", diag = FALSE)
There’s a noticeable relationship between the leading predictors and the response. These predictors can enhance model accuracy and are likely to hold more statistical importance. This also indicates how certain processes can influence yield, whether for better or worse.