Homework Prompt

In Kuhn and Johnson do problems 6.2 and 6.3. There are only two but they consist of many parts. Please submit a link to your Rpubs and submit the .rmd file as well.

Import Libraries

library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.1.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(RANN)
library(DataExplorer)

Question 6.2

(a)

Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
data(permeability)

(b)

The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

dim(fingerprints)
## [1]  165 1107

The original fingerprints matrix has 1,107 features.

head(permeability)
##   permeability
## 1       12.520
## 2        1.120
## 3       19.405
## 4        1.730
## 5        1.680
## 6        0.510
sparse <- nearZeroVar(fingerprints)
not_sparse <- fingerprints[, -sparse]
dim(not_sparse)
## [1] 165 388

After removing nearZeros (sparse values), there are only 388 features left. That’s a lot of reduction.

(c)

Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

# combining into a dataframe

df <- as.data.frame(fingerprints)

df <- df %>%
  mutate(target = permeability)
# Removing sparse values again.

sparse <- nearZeroVar(df)
df <- df[, -sparse]
dim(df)
## [1] 165 389
# Train Test Split

in_train <- createDataPartition(df$target, times = 1, p = 0.8, list = FALSE)
train <- df[in_train, ]
test <- df[-in_train, ]
# Train PLS model

pls_model <- train(target ~ ., data = train, method = "pls",
  center = TRUE, trControl = trainControl("cv", number = 10),
  tuneLength = 25
)
pls_model
## Partial Least Squares 
## 
## 133 samples
## 388 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 119, 120, 120, 118, 121, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     13.43887  0.2807872  10.310102
##    2     12.16259  0.4135066   8.644289
##    3     11.97550  0.4330134   8.689783
##    4     11.87352  0.4533428   8.877753
##    5     11.94558  0.4595854   8.894107
##    6     11.64578  0.4760110   8.491113
##    7     11.63339  0.4783361   8.446789
##    8     11.84429  0.4636580   8.721235
##    9     12.28092  0.4428324   8.949703
##   10     12.79312  0.4154918   9.350185
##   11     13.63087  0.3729934   9.906356
##   12     13.98357  0.3577800  10.176955
##   13     13.97490  0.3550552   9.978092
##   14     14.32846  0.3299257  10.225343
##   15     15.05270  0.2956616  10.799766
##   16     15.49455  0.2893243  11.190199
##   17     16.05262  0.2657860  11.696030
##   18     16.33156  0.2657834  11.879754
##   19     16.17543  0.2733573  11.883028
##   20     16.35467  0.2647619  11.970685
##   21     16.51559  0.2622998  12.068258
##   22     16.82899  0.2579296  12.345354
##   23     17.17223  0.2513465  12.678490
##   24     17.40175  0.2479309  12.858202
##   25     17.59773  0.2492653  13.043419
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 7.

Optimal number of latent variables is 5, with an associated R2 of .50

(d)

Predict the response for the test set. What is the test set estimate of R2?

predict_test <- predict(pls_model, test)
check <- data.frame(obs = test$target, pred = predict_test)
colnames(check) <- c("obs","pred")
defaultSummary(check)
##       RMSE   Rsquared        MAE 
## 11.7000646  0.4381358  7.9152826

The estimate for R2 is .31 for the test set.

(e)

Try building other models discussed in the chapter. Do any others have better performance?

# Trying PCR

pcr_model <- train(target ~ ., data = train, method = "pcr",
  center = TRUE,
  trControl = trainControl("cv", number = 10),
  tuneLength = 25
)

predict_pcr <- predict(pcr_model, test)

check <- data.frame(obs = test$target, pred = predict_pcr)
colnames(check) <- c("obs","pred")
defaultSummary(check)
##       RMSE   Rsquared        MAE 
## 11.6035276  0.4373711  7.2376847
enetGrid <- expand.grid(.lambda = c(0, 0.01, .1), .fraction = seq(.05, 1, length = 20))

elasticNet_model <- train(target ~ ., data=train, method = "enet", tuneGrid = enetGrid,
trControl = trainControl(method = "cv", number = 10), preProc = c("center", "scale"))

elastic_predictions = predict(elasticNet_model, test)

check <- data.frame(obs = test$target, pred = elastic_predictions)
colnames(check) <- c("obs","pred")
defaultSummary(check)
##      RMSE  Rsquared       MAE 
## 12.179052  0.375328  7.460705

(f)

Would you recommend any of your models to replace the permeability lab experiment?

Of the three models that I tested, ElasticNet has the best R2 value of .32.

That being said, all the outcome R2 values are very similar.

Question 6.3

(a)

Load Data

data("ChemicalManufacturingProcess")

(b)

  1. A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values
pre_process <- preProcess(ChemicalManufacturingProcess, method = c("knnImpute"))

imputation <- predict(pre_process, ChemicalManufacturingProcess)

(c)

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

head(imputation)
# We will go PCR

in_train <- createDataPartition(imputation$Yield, p=0.8, list=FALSE)
train <- imputation[in_train, ]
test <- imputation[-in_train, ]
pcr_model <- train(Yield ~ ., data = train, method = "pcr",
  center = TRUE,
  trControl = trainControl("cv", number = 10),
  tuneLength = 25
)
pcr_model
## Principal Component Analysis 
## 
## 144 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 129, 130, 130, 130, 128, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE      
##    1     0.8271417  0.3187043  0.6773371
##    2     0.8200378  0.3216131  0.6674740
##    3     0.8184112  0.3315931  0.6674409
##    4     0.7543537  0.4230570  0.6227664
##    5     0.7486611  0.4224589  0.6184060
##    6     0.7433816  0.4327349  0.6149055
##    7     0.7536037  0.4177028  0.6276285
##    8     0.7418806  0.4306847  0.6175178
##    9     0.6975096  0.4916661  0.5771032
##   10     0.6317396  0.5964149  0.5106660
##   11     0.6273906  0.5973501  0.5064139
##   12     0.6204565  0.5998459  0.5080446
##   13     0.6208840  0.5994272  0.5078420
##   14     0.6287856  0.5899123  0.5129712
##   15     0.6295647  0.5906916  0.5158021
##   16     0.6293586  0.5877584  0.5183647
##   17     0.6312621  0.5862648  0.5222900
##   18     0.6329184  0.5864504  0.5228635
##   19     0.6353719  0.5835816  0.5206903
##   20     0.6241498  0.5980949  0.5145283
##   21     0.6224285  0.5962560  0.5123372
##   22     0.6344329  0.5820579  0.5215298
##   23     0.6270184  0.5872833  0.5158966
##   24     0.6092522  0.5954673  0.5024058
##   25     0.5891262  0.6182670  0.4854013
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 25.

The optimal number of latent variables is 14, with an associated R2 of .64

(d)

Predict response for test set. What is the value of the performance metric and how does this compare to resampled performance metric of ttraining set?

predict_pcr <- predict(pcr_model, test)

check <- data.frame(obs = test$Yield, pred = predict_pcr)
colnames(check) <- c("obs","pred")
defaultSummary(check)
##      RMSE  Rsquared       MAE 
## 0.7275442 0.5789689 0.6004247

The RSME on the test set is .78, higher than the train set. R2 value of .38.

(e)

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

important <- varImp(pcr_model)
important
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## BiologicalMaterial06    100.00
## ManufacturingProcess32   99.58
## ManufacturingProcess13   93.89
## BiologicalMaterial03     89.61
## ManufacturingProcess31   84.83
## BiologicalMaterial12     79.15
## ManufacturingProcess09   77.05
## ManufacturingProcess36   71.60
## ManufacturingProcess17   69.75
## BiologicalMaterial02     63.24
## ManufacturingProcess11   60.68
## ManufacturingProcess29   57.69
## ManufacturingProcess06   57.66
## BiologicalMaterial09     54.31
## BiologicalMaterial11     53.41
## BiologicalMaterial04     48.41
## ManufacturingProcess33   45.50
## BiologicalMaterial08     43.71
## ManufacturingProcess30   42.19
## BiologicalMaterial01     42.02

According to the output from model feature importance, it seems that there is a mix of importance from both Manufacturing and Biological features. That being said, manufacturing predictors account for 5 of the top 6 features in terms of predictive power.

(f)

Explore relationship between top predictors

correlations <- imputation %>%
  select(ManufacturingProcess32, ManufacturingProcess13, BiologicalMaterial06, ManufacturingProcess36, ManufacturingProcess17, ManufacturingProcess09, BiologicalMaterial03, BiologicalMaterial02, BiologicalMaterial12, ManufacturingProcess06, ManufacturingProcess31,ManufacturingProcess11,ManufacturingProcess33,BiologicalMaterial11,BiologicalMaterial08,BiologicalMaterial04)

plot_correlation(correlations)

Not surprisingly, most of the ManufacturingProcess features are correlated with eachother, while the Biological features are correlated with eachother. There are a few features which are negatively correlated within the same domain, which is interesting to note.