In Kuhn and Johnson do problems 6.2 and 6.3. There are only two but they consist of many parts. Please submit a link to your Rpubs and submit the .rmd file as well.

6.2

a. Start R and use these commands to load the data:

data(permeability)

b. The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

head(permeability)
##   permeability
## 1       12.520
## 2        1.120
## 3       19.405
## 4        1.730
## 5        1.680
## 6        0.510
dim(fingerprints)
## [1]  165 1107

Filter predictors with low frequencies

nzv <- nearZeroVar(fingerprints)

filtered_fingerprints <- fingerprints[, -nzv]

dim(filtered_fingerprints)
## [1] 165 388

After removing sparse values, there are only 388 features left.

num_remaining_predictors <- ncol(filtered_fingerprints)
print(paste("Number of predictors left:", num_remaining_predictors))
## [1] "Number of predictors left: 388"

c. Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

set.seed(1)

inTrain <- createDataPartition(permeability, p = 0.8, list = FALSE)

train_fingerprints <- filtered_fingerprints[inTrain,]

test_fingerprints <- filtered_fingerprints[-inTrain,]

train_permeability <- permeability[inTrain]

test_permeability <- permeability[-inTrain]

ctrl <- trainControl(method = "cv", number = 10)

pls_model <- train(train_fingerprints, train_permeability,

                   method = "pls",

                   tuneLength = 15,

                   trControl = ctrl,

                   preProc = c("center", "scale"))

pls_model 
## Partial Least Squares 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 121, 119, 120, 120, 120, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE     
##    1     12.69853  0.3419188  9.510461
##    2     11.26461  0.4862419  7.961848
##    3     11.23192  0.5020667  8.443290
##    4     11.34704  0.5151263  8.525333
##    5     11.36740  0.4843217  8.217022
##    6     11.17569  0.5067142  8.111659
##    7     10.70101  0.5296468  7.822282
##    8     10.54862  0.5419219  7.788303
##    9     10.68495  0.5356735  7.877981
##   10     10.45770  0.5534189  7.634547
##   11     10.61471  0.5495913  7.802781
##   12     10.60305  0.5554128  7.829622
##   13     10.53495  0.5554981  7.761453
##   14     10.57287  0.5551807  7.813457
##   15     10.82956  0.5352628  7.981578
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 10.
print(paste("Optimal number of latent variables:", pls_model$bestTune$ncomp))
## [1] "Optimal number of latent variables: 10"
print(paste("Corresponding resampled R2 estimate:", pls_model$results[pls_model$results$ncomp == pls_model$bestTune$ncomp, "Rsquared"]))
## [1] "Corresponding resampled R2 estimate: 0.553418884040966"

d. Predict the response for the test set and calculate test set R2

pls_predictions <- predict(pls_model, test_fingerprints)

pls_r2 <- cor(pls_predictions, test_permeability)^2

pls_r2
## [1] 0.3274115

Test set estimate of R^2 is 0.3274115

e. Try building other models discussed in the chapter. Do any others have better performance?

pcr_model <- train(train_fingerprints, train_permeability,

                   method = "pcr",

                   tuneLength = 15,

                   trControl = ctrl,

                   preProc = c("center", "scale"))

pcr_predictions <- predict(pcr_model, test_fingerprints)

pcr_r2 <- cor(pcr_predictions, test_permeability)^2

pcr_r2
## [1] 0.2860616
glmnet_model <- train(train_fingerprints, train_permeability,

                      method = "glmnet",

                      tuneLength = 10,

                      trControl = ctrl,

                      preProc = c("center", "scale"))

glmnet_predictions <- predict(glmnet_model, test_fingerprints)

glmnet_r2 <- cor(glmnet_predictions, test_permeability)^2

glmnet_r2
## [1] 0.4001524

Elastic Net has better performance based on the test set R^2.

f. Would you recommend any of your models to replace the permeability lab experiment?

The Elastic Net model, with a highest test set R2 would be recommended to replace the permeability lab experiment based on performance.

6.3

a. Load Data

data("ChemicalManufacturingProcess")

data <- ChemicalManufacturingProcess
dim(data)
## [1] 176  58

b. A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values

preProcValues <- preProcess(data[, -ncol(data)], method = "knnImpute")
data_imputed <- predict(preProcValues, data)

c. Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

set.seed(123)

trainIndex <- createDataPartition(data_imputed$Yield, p = 0.8, list = FALSE)

trainData <- data_imputed[trainIndex, ]

testData <- data_imputed[-trainIndex, ]
preProcValues <- preProcess(trainData[, -ncol(trainData)], method = c("center", "scale"))
trainTransformed <- predict(preProcValues, trainData)
testTransformed <- predict(preProcValues, testData)
ctrl <- trainControl(method = "cv", number = 10)
set.seed(123)
pcrTune <- train(Yield ~ ., data = trainTransformed, method = "pcr",

                  trControl = ctrl, tuneLength = 20)
print(pcrTune)
## Principal Component Analysis 
## 
## 144 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 128, 129, 129, 130, 128, 131, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE      
##    1     0.8155519  0.3601336  0.6634448
##    2     0.8131761  0.3609466  0.6631559
##    3     0.7927148  0.3984094  0.6347908
##    4     0.8204937  0.3590981  0.6518306
##    5     0.7891178  0.4018028  0.6411547
##    6     0.7826775  0.4133684  0.6373802
##    7     0.7868886  0.4059173  0.6368692
##    8     0.7659817  0.4426340  0.6186877
##    9     0.7397818  0.4927881  0.6096364
##   10     0.6667528  0.5956882  0.5452838
##   11     0.6547198  0.6097344  0.5367515
##   12     0.6407526  0.6261281  0.5221095
##   13     0.6274506  0.6359631  0.5114676
##   14     0.6303158  0.6369425  0.5163531
##   15     0.6321744  0.6363527  0.5188768
##   16     0.6299262  0.6449908  0.5127557
##   17     0.6307375  0.6407938  0.5130477
##   18     0.6333827  0.6350424  0.5148542
##   19     0.6410366  0.6299998  0.5197339
##   20     0.6440515  0.6255382  0.5197736
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 13.
plot(pcrTune)

d. Predict response for test set. What is the value of the performance metric and how does this compare to resampled performance metric of training set?

predictions <- predict(pcrTune, testTransformed)
rmse <- sqrt(mean((testTransformed$Yield - predictions)^2))
print(paste("RMSE on Test Set:", rmse))
## [1] "RMSE on Test Set: 0.724455236535656"

e. Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

varImp_pcr <- varImp(pcrTune)
print(varImp_pcr)
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32  100.00
## BiologicalMaterial06     94.06
## BiologicalMaterial03     81.27
## ManufacturingProcess13   80.63
## ManufacturingProcess36   78.31
## BiologicalMaterial02     76.04
## ManufacturingProcess17   75.92
## ManufacturingProcess31   74.96
## ManufacturingProcess09   73.04
## BiologicalMaterial12     69.48
## ManufacturingProcess06   66.28
## BiologicalMaterial11     59.72
## ManufacturingProcess33   57.41
## ManufacturingProcess29   54.55
## BiologicalMaterial04     53.93
## ManufacturingProcess11   49.55
## BiologicalMaterial01     45.62
## BiologicalMaterial08     44.93
## ManufacturingProcess30   41.08
## BiologicalMaterial09     40.88
plot(varImp_pcr, top = 20)

The most important Predictors are Manifacturing Process 32, 13, and the Biological Material 06 and 03.

f. Explore relationship between top predictors

selected_variables <- data_imputed %>%
select(ManufacturingProcess32,
         ManufacturingProcess13, 
         BiologicalMaterial06, 
         BiologicalMaterial03,
         ManufacturingProcess36, 
         Yield)

correlations <- cor(selected_variables, use = "pairwise.complete.obs")
corrplot(correlations, method = "circle", type = "lower", diag = FALSE)

There’s a noticeable relationship between the leading predictors and the response. These predictors can enhance model accuracy and are likely to hold more statistical importance. This also indicates how certain processes can influence yield, whether for better or worse.