Homework Prompt
In Kuhn and Johnson do problems 6.2 and 6.3. There are only two but they consist of many parts. Please submit a link to your Rpubs and submit the .rmd file as well.
Import Libraries
library(caret)## Loading required package: ggplot2
## Loading required package: lattice
library(dplyr)## Warning: package 'dplyr' was built under R version 4.1.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(RANN)
library(DataExplorer)Question 6.2
(a)
Start R and use these commands to load the data:
library(AppliedPredictiveModeling)
data(permeability)(b)
The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?
dim(fingerprints)## [1] 165 1107
The original fingerprints matrix has 1,107 features.
head(permeability)## permeability
## 1 12.520
## 2 1.120
## 3 19.405
## 4 1.730
## 5 1.680
## 6 0.510
sparse <- nearZeroVar(fingerprints)
not_sparse <- fingerprints[, -sparse]
dim(not_sparse)## [1] 165 388
After removing nearZeros (sparse values), there are only 388 features left. That’s a lot of reduction.
(c)
Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?
# combining into a dataframe
df <- as.data.frame(fingerprints)
df <- df %>%
mutate(target = permeability)# Removing sparse values again.
sparse <- nearZeroVar(df)
df <- df[, -sparse]
dim(df)## [1] 165 389
# Train Test Split
in_train <- createDataPartition(df$target, times = 1, p = 0.8, list = FALSE)
train <- df[in_train, ]
test <- df[-in_train, ]# Train PLS model
pls_model <- train(target ~ ., data = train, method = "pls",
center = TRUE, trControl = trainControl("cv", number = 10),
tuneLength = 25
)pls_model## Partial Least Squares
##
## 133 samples
## 388 predictors
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 119, 119, 120, 120, 118, 121, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 13.43887 0.2807872 10.310102
## 2 12.16259 0.4135066 8.644289
## 3 11.97550 0.4330134 8.689783
## 4 11.87352 0.4533428 8.877753
## 5 11.94558 0.4595854 8.894107
## 6 11.64578 0.4760110 8.491113
## 7 11.63339 0.4783361 8.446789
## 8 11.84429 0.4636580 8.721235
## 9 12.28092 0.4428324 8.949703
## 10 12.79312 0.4154918 9.350185
## 11 13.63087 0.3729934 9.906356
## 12 13.98357 0.3577800 10.176955
## 13 13.97490 0.3550552 9.978092
## 14 14.32846 0.3299257 10.225343
## 15 15.05270 0.2956616 10.799766
## 16 15.49455 0.2893243 11.190199
## 17 16.05262 0.2657860 11.696030
## 18 16.33156 0.2657834 11.879754
## 19 16.17543 0.2733573 11.883028
## 20 16.35467 0.2647619 11.970685
## 21 16.51559 0.2622998 12.068258
## 22 16.82899 0.2579296 12.345354
## 23 17.17223 0.2513465 12.678490
## 24 17.40175 0.2479309 12.858202
## 25 17.59773 0.2492653 13.043419
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 7.
Optimal number of latent variables is 5, with an associated R2 of .50
(d)
Predict the response for the test set. What is the test set estimate of R2?
predict_test <- predict(pls_model, test)check <- data.frame(obs = test$target, pred = predict_test)colnames(check) <- c("obs","pred")defaultSummary(check)## RMSE Rsquared MAE
## 11.7000646 0.4381358 7.9152826
The estimate for R2 is .31 for the test set.
(e)
Try building other models discussed in the chapter. Do any others have better performance?
# Trying PCR
pcr_model <- train(target ~ ., data = train, method = "pcr",
center = TRUE,
trControl = trainControl("cv", number = 10),
tuneLength = 25
)
predict_pcr <- predict(pcr_model, test)
check <- data.frame(obs = test$target, pred = predict_pcr)
colnames(check) <- c("obs","pred")
defaultSummary(check)## RMSE Rsquared MAE
## 11.6035276 0.4373711 7.2376847
enetGrid <- expand.grid(.lambda = c(0, 0.01, .1), .fraction = seq(.05, 1, length = 20))
elasticNet_model <- train(target ~ ., data=train, method = "enet", tuneGrid = enetGrid,
trControl = trainControl(method = "cv", number = 10), preProc = c("center", "scale"))
elastic_predictions = predict(elasticNet_model, test)
check <- data.frame(obs = test$target, pred = elastic_predictions)
colnames(check) <- c("obs","pred")
defaultSummary(check)## RMSE Rsquared MAE
## 12.179052 0.375328 7.460705
(f)
Would you recommend any of your models to replace the permeability lab experiment?
Of the three models that I tested, ElasticNet has the best R2 value of .32.
That being said, all the outcome R2 values are very similar.
Question 6.3
(a)
Load Data
data("ChemicalManufacturingProcess")(b)
- A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values
pre_process <- preProcess(ChemicalManufacturingProcess, method = c("knnImpute"))
imputation <- predict(pre_process, ChemicalManufacturingProcess)(c)
Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
head(imputation)# We will go PCR
in_train <- createDataPartition(imputation$Yield, p=0.8, list=FALSE)
train <- imputation[in_train, ]
test <- imputation[-in_train, ]pcr_model <- train(Yield ~ ., data = train, method = "pcr",
center = TRUE,
trControl = trainControl("cv", number = 10),
tuneLength = 25
)pcr_model## Principal Component Analysis
##
## 144 samples
## 57 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 130, 129, 130, 130, 130, 128, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 0.8271417 0.3187043 0.6773371
## 2 0.8200378 0.3216131 0.6674740
## 3 0.8184112 0.3315931 0.6674409
## 4 0.7543537 0.4230570 0.6227664
## 5 0.7486611 0.4224589 0.6184060
## 6 0.7433816 0.4327349 0.6149055
## 7 0.7536037 0.4177028 0.6276285
## 8 0.7418806 0.4306847 0.6175178
## 9 0.6975096 0.4916661 0.5771032
## 10 0.6317396 0.5964149 0.5106660
## 11 0.6273906 0.5973501 0.5064139
## 12 0.6204565 0.5998459 0.5080446
## 13 0.6208840 0.5994272 0.5078420
## 14 0.6287856 0.5899123 0.5129712
## 15 0.6295647 0.5906916 0.5158021
## 16 0.6293586 0.5877584 0.5183647
## 17 0.6312621 0.5862648 0.5222900
## 18 0.6329184 0.5864504 0.5228635
## 19 0.6353719 0.5835816 0.5206903
## 20 0.6241498 0.5980949 0.5145283
## 21 0.6224285 0.5962560 0.5123372
## 22 0.6344329 0.5820579 0.5215298
## 23 0.6270184 0.5872833 0.5158966
## 24 0.6092522 0.5954673 0.5024058
## 25 0.5891262 0.6182670 0.4854013
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 25.
The optimal number of latent variables is 14, with an associated R2 of .64
(d)
Predict response for test set. What is the value of the performance metric and how does this compare to resampled performance metric of ttraining set?
predict_pcr <- predict(pcr_model, test)
check <- data.frame(obs = test$Yield, pred = predict_pcr)
colnames(check) <- c("obs","pred")
defaultSummary(check)## RMSE Rsquared MAE
## 0.7275442 0.5789689 0.6004247
The RSME on the test set is .78, higher than the train set. R2 value of .38.
(e)
Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
important <- varImp(pcr_model)
important## loess r-squared variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## BiologicalMaterial06 100.00
## ManufacturingProcess32 99.58
## ManufacturingProcess13 93.89
## BiologicalMaterial03 89.61
## ManufacturingProcess31 84.83
## BiologicalMaterial12 79.15
## ManufacturingProcess09 77.05
## ManufacturingProcess36 71.60
## ManufacturingProcess17 69.75
## BiologicalMaterial02 63.24
## ManufacturingProcess11 60.68
## ManufacturingProcess29 57.69
## ManufacturingProcess06 57.66
## BiologicalMaterial09 54.31
## BiologicalMaterial11 53.41
## BiologicalMaterial04 48.41
## ManufacturingProcess33 45.50
## BiologicalMaterial08 43.71
## ManufacturingProcess30 42.19
## BiologicalMaterial01 42.02
According to the output from model feature importance, it seems that there is a mix of importance from both Manufacturing and Biological features. That being said, manufacturing predictors account for 5 of the top 6 features in terms of predictive power.
(f)
Explore relationship between top predictors
correlations <- imputation %>%
select(ManufacturingProcess32, ManufacturingProcess13, BiologicalMaterial06, ManufacturingProcess36, ManufacturingProcess17, ManufacturingProcess09, BiologicalMaterial03, BiologicalMaterial02, BiologicalMaterial12, ManufacturingProcess06, ManufacturingProcess31,ManufacturingProcess11,ManufacturingProcess33,BiologicalMaterial11,BiologicalMaterial08,BiologicalMaterial04)
plot_correlation(correlations)Not surprisingly, most of the ManufacturingProcess features are correlated with eachother, while the Biological features are correlated with eachother. There are a few features which are negatively correlated within the same domain, which is interesting to note.