Exercises from Chapter 6 of textbook Applied Predictive Modeling by Kuhn & Johnson
data(permeability)
fingerprints <- as.data.frame(fingerprints)
After creating a new dataframe fingerprints_thinned
that shows the dataframe after nearZeroVar is used, there are 388 predictor variables remaining.
#remove predictors with low frequencies
fingerprints_thinned <- fingerprints[, -nearZeroVar(fingerprints)]
#check new predictor count
ncol(fingerprints_thinned)
## [1] 388
#add permeability/target varaible
dataset <- cbind(data.frame(permeability), fingerprints_thinned)
#split train/test
set.seed(3190)
sample_set <- sample(nrow(dataset), round(nrow(dataset)*0.75), replace = FALSE)
fingerprints_train <- dataset[sample_set, ]
fingerprints_test <- dataset[-sample_set, ]
It looks like 7 is the optimal number of latent variables.
# build the PLS model
pls_model <- train(fingerprints_train[,-1],
fingerprints_train$permeability,
method = "pls",
tuneLength = 10,
trControl = trainControl(method = "cv"))
#plot RMSE of ncomps
plot(pls_model$results$RMSE)
In looking at the full results, I see the low RMSE value of 10.142 on ncomp 7 as identified above, and the associated R-squared value is 0.608.
pls_model$results
## ncomp RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 1 13.48118 0.3324390 10.387461 3.542885 0.2538493 2.924443
## 2 2 11.73656 0.5044877 8.620059 4.070957 0.2451729 3.145744
## 3 3 11.49670 0.5014269 8.615089 3.146592 0.1555225 2.359770
## 4 4 10.95246 0.5648879 8.680918 2.863782 0.1592836 2.471589
## 5 5 10.59649 0.5851518 8.197448 2.996456 0.1732325 2.315941
## 6 6 10.49329 0.5884128 8.126480 2.637389 0.1555512 2.048783
## 7 7 10.14244 0.6077336 7.894780 2.702364 0.1571095 2.179632
## 8 8 10.15322 0.6075330 7.889303 2.464001 0.1463963 2.160220
## 9 9 10.42128 0.5844003 8.044052 2.673910 0.1578883 2.336455
## 10 10 10.70107 0.5738913 8.121887 2.922505 0.1771802 2.430940
Taking a look at the predictions from this model in comparison to the test set, I see an R-squared of 0.355 meaning the model can account for 35.5% of the variance in the data. The RMSE is 14.131.
pred <- predict(pls_model, fingerprints_test)
postResample(pred = pred, obs = fingerprints_test$permeability)
## RMSE Rsquared MAE
## 14.1307091 0.3547684 10.3893889
I’ll try a PCR model first. This has a worse R-Squared which means this model can only account for 28.5% of the variance in the dataset. The RMSE is lower than the PLS though.
pcr_model <- train(fingerprints_train[,-1],
fingerprints_train$permeability,
method = "pcr",
tuneLength = 10,
trControl = trainControl(method = "cv"))
pred <- predict(pcr_model, fingerprints_test)
postResample(pred = pred, obs = fingerprints_test$permeability)
## RMSE Rsquared MAE
## 12.1046996 0.2816046 8.5982847
Finally I try an elastic net model, which has a better than PCR but worse the PLS R-squared of 33.9% of variance in the dataset explained. This has an RMSE between the two previous models as well, at 13.958.
enet_model <- train(fingerprints_train[,-1],
fingerprints_train$permeability,
method = "enet",
tuneLength = 10,
trControl = trainControl(method = "cv"))
pred <- predict(enet_model, fingerprints_test)
postResample(pred = pred, obs = fingerprints_test$permeability)
## RMSE Rsquared MAE
## 13.9583366 0.3391313 9.7450271
This is a little tricky, as while the PLS has the best R-squared (highest) it has the worst RMSE value. The PCR model has the lower RMSE but quite a bad R-squared at 28.2%. I believe I would choose the original model, as the elastic net model has only a slightly lower RMSE but also loses about 1.5% points of R-Squared.
data(ChemicalManufacturingProcess)
chem <- as.data.frame(ChemicalManufacturingProcess)
I use the RANN
library to impute the missing values. When I do a skim of the new dataframe I see no missing values.
knn_imp <- preProcess(chem, "knnImpute")
chem_imp <- predict(knn_imp, chem)
skim(chem_imp)
Name | chem_imp |
Number of rows | 176 |
Number of columns | 58 |
_______________________ | |
Column type frequency: | |
numeric | 58 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Yield | 0 | 1 | 0.00 | 1.00 | -2.67 | -0.77 | -0.11 | 0.70 | 3.34 | ▁▇▇▃▁ |
BiologicalMaterial01 | 0 | 1 | 0.00 | 1.00 | -2.57 | -0.61 | -0.15 | 0.64 | 3.36 | ▂▇▇▂▁ |
BiologicalMaterial02 | 0 | 1 | 0.00 | 1.00 | -2.19 | -0.75 | -0.15 | 0.76 | 2.25 | ▂▇▆▅▃ |
BiologicalMaterial03 | 0 | 1 | 0.00 | 1.00 | -2.68 | -0.68 | -0.12 | 0.68 | 2.64 | ▂▅▇▆▁ |
BiologicalMaterial04 | 0 | 1 | 0.00 | 1.00 | -1.67 | -0.62 | -0.14 | 0.49 | 6.05 | ▇▆▁▁▁ |
BiologicalMaterial05 | 0 | 1 | 0.00 | 1.00 | -2.91 | -0.74 | -0.06 | 0.71 | 3.39 | ▁▅▇▃▁ |
BiologicalMaterial06 | 0 | 1 | 0.00 | 1.00 | -2.22 | -0.76 | -0.12 | 0.65 | 2.79 | ▂▇▆▅▁ |
BiologicalMaterial07 | 0 | 1 | 0.00 | 1.00 | -0.13 | -0.13 | -0.13 | -0.13 | 7.57 | ▇▁▁▁▁ |
BiologicalMaterial08 | 0 | 1 | 0.00 | 1.00 | -2.39 | -0.64 | 0.02 | 0.57 | 2.43 | ▁▅▇▃▂ |
BiologicalMaterial09 | 0 | 1 | 0.00 | 1.00 | -3.40 | -0.60 | -0.04 | 0.67 | 2.96 | ▁▃▇▅▁ |
BiologicalMaterial10 | 0 | 1 | 0.00 | 1.00 | -1.72 | -0.57 | -0.15 | 0.32 | 6.79 | ▇▅▁▁▁ |
BiologicalMaterial11 | 0 | 1 | 0.00 | 1.00 | -2.31 | -0.65 | -0.18 | 0.55 | 2.44 | ▂▆▇▃▂ |
BiologicalMaterial12 | 0 | 1 | 0.00 | 1.00 | -2.39 | -0.61 | -0.10 | 0.71 | 2.60 | ▂▆▇▃▂ |
ManufacturingProcess01 | 0 | 1 | 0.00 | 1.00 | -6.15 | -0.22 | 0.11 | 0.50 | 1.59 | ▁▁▁▅▇ |
ManufacturingProcess02 | 0 | 1 | 0.01 | 0.99 | -1.97 | 0.31 | 0.51 | 0.57 | 0.69 | ▂▁▁▁▇ |
ManufacturingProcess03 | 0 | 1 | 0.04 | 0.97 | -3.11 | -0.43 | 0.38 | 0.47 | 2.70 | ▁▂▆▇▁ |
ManufacturingProcess04 | 0 | 1 | 0.00 | 1.00 | -3.32 | -0.61 | 0.34 | 0.66 | 2.25 | ▁▂▃▇▂ |
ManufacturingProcess05 | 0 | 1 | 0.00 | 1.00 | -2.58 | -0.49 | -0.09 | 0.23 | 5.69 | ▁▇▁▁▁ |
ManufacturingProcess06 | 0 | 1 | -0.01 | 1.00 | -1.63 | -0.63 | -0.22 | 0.48 | 7.41 | ▇▃▁▁▁ |
ManufacturingProcess07 | 0 | 1 | 0.00 | 1.00 | -0.96 | -0.96 | -0.96 | 1.04 | 1.04 | ▇▁▁▁▇ |
ManufacturingProcess08 | 0 | 1 | 0.00 | 1.00 | -1.11 | -1.11 | 0.89 | 0.89 | 0.89 | ▆▁▁▁▇ |
ManufacturingProcess09 | 0 | 1 | 0.00 | 1.00 | -4.38 | -0.50 | 0.05 | 0.55 | 2.39 | ▁▁▅▇▂ |
ManufacturingProcess10 | 0 | 1 | 0.02 | 0.99 | -2.19 | -0.62 | -0.10 | 0.55 | 3.16 | ▂▇▇▂▁ |
ManufacturingProcess11 | 0 | 1 | 0.03 | 1.00 | -2.63 | -0.54 | 0.02 | 0.72 | 2.95 | ▁▇▇▆▁ |
ManufacturingProcess12 | 0 | 1 | 0.00 | 1.00 | -0.48 | -0.48 | -0.48 | -0.48 | 2.07 | ▇▁▁▁▂ |
ManufacturingProcess13 | 0 | 1 | 0.00 | 1.00 | -2.37 | -0.60 | 0.09 | 0.68 | 4.03 | ▃▆▇▁▁ |
ManufacturingProcess14 | 0 | 1 | 0.00 | 1.00 | -2.80 | -0.49 | 0.03 | 0.52 | 3.69 | ▁▅▇▂▁ |
ManufacturingProcess15 | 0 | 1 | 0.00 | 1.00 | -2.31 | -0.50 | -0.13 | 0.38 | 3.33 | ▂▇▆▂▁ |
ManufacturingProcess16 | 0 | 1 | 0.00 | 1.00 | -12.98 | -0.01 | 0.06 | 0.15 | 0.81 | ▁▁▁▁▇ |
ManufacturingProcess17 | 0 | 1 | 0.00 | 1.00 | -2.44 | -0.68 | 0.05 | 0.61 | 4.53 | ▂▇▆▁▁ |
ManufacturingProcess18 | 0 | 1 | 0.00 | 1.00 | -13.09 | 0.01 | 0.07 | 0.14 | 0.44 | ▁▁▁▁▇ |
ManufacturingProcess19 | 0 | 1 | 0.00 | 1.00 | -3.03 | -0.60 | -0.14 | 0.48 | 2.58 | ▁▃▇▃▂ |
ManufacturingProcess20 | 0 | 1 | 0.00 | 1.00 | -13.06 | -0.01 | 0.07 | 0.15 | 0.58 | ▁▁▁▁▇ |
ManufacturingProcess21 | 0 | 1 | 0.00 | 1.00 | -2.10 | -0.56 | -0.17 | 0.21 | 4.84 | ▂▇▂▁▁ |
ManufacturingProcess22 | 0 | 1 | 0.00 | 1.00 | -1.62 | -0.72 | -0.12 | 0.78 | 1.98 | ▇▇▇▅▅ |
ManufacturingProcess23 | 0 | 1 | 0.00 | 1.00 | -1.81 | -0.61 | -0.01 | 0.59 | 1.79 | ▇▆▇▆▇ |
ManufacturingProcess24 | 0 | 1 | 0.01 | 1.00 | -1.52 | -0.83 | -0.14 | 0.89 | 2.44 | ▇▇▅▆▁ |
ManufacturingProcess25 | 0 | 1 | 0.00 | 0.99 | -12.93 | 0.00 | 0.07 | 0.13 | 0.43 | ▁▁▁▁▇ |
ManufacturingProcess26 | 0 | 1 | 0.00 | 0.99 | -12.94 | 0.01 | 0.06 | 0.12 | 0.31 | ▁▁▁▁▇ |
ManufacturingProcess27 | 0 | 1 | 0.00 | 0.99 | -12.89 | 0.00 | 0.07 | 0.13 | 0.42 | ▁▁▁▁▇ |
ManufacturingProcess28 | 0 | 1 | -0.03 | 1.00 | -1.26 | -1.26 | 0.73 | 0.78 | 0.94 | ▅▁▁▁▇ |
ManufacturingProcess29 | 0 | 1 | 0.00 | 0.99 | -12.03 | -0.19 | -0.07 | 0.23 | 1.20 | ▁▁▁▁▇ |
ManufacturingProcess30 | 0 | 1 | 0.01 | 0.99 | -9.39 | -0.37 | 0.04 | 0.55 | 2.09 | ▁▁▁▅▇ |
ManufacturingProcess31 | 0 | 1 | 0.00 | 0.99 | -12.63 | -0.02 | 0.11 | 0.22 | 0.42 | ▁▁▁▁▇ |
ManufacturingProcess32 | 0 | 1 | 0.00 | 1.00 | -2.87 | -0.64 | -0.09 | 0.65 | 2.69 | ▁▃▇▃▁ |
ManufacturingProcess33 | 0 | 1 | -0.01 | 0.99 | -3.04 | -0.62 | 0.18 | 0.59 | 2.60 | ▁▃▇▅▁ |
ManufacturingProcess34 | 0 | 1 | -0.01 | 0.99 | -3.56 | 0.12 | 0.12 | 0.12 | 1.96 | ▁▂▁▇▁ |
ManufacturingProcess35 | 0 | 1 | -0.02 | 1.00 | -3.01 | -0.52 | -0.06 | 0.50 | 2.44 | ▁▃▇▅▂ |
ManufacturingProcess36 | 0 | 1 | -0.01 | 0.99 | -2.94 | -0.66 | -0.08 | 0.49 | 2.78 | ▂▇▁▇▃ |
ManufacturingProcess37 | 0 | 1 | 0.00 | 1.00 | -2.28 | -0.70 | -0.03 | 0.64 | 2.89 | ▂▇▇▃▁ |
ManufacturingProcess38 | 0 | 1 | 0.00 | 1.00 | -3.90 | -0.82 | 0.72 | 0.72 | 0.72 | ▁▁▁▅▇ |
ManufacturingProcess39 | 0 | 1 | 0.00 | 1.00 | -4.55 | 0.17 | 0.23 | 0.30 | 0.43 | ▁▁▁▁▇ |
ManufacturingProcess40 | 0 | 1 | 0.00 | 1.00 | -0.46 | -0.46 | -0.46 | -0.46 | 2.15 | ▇▁▁▁▂ |
ManufacturingProcess41 | 0 | 1 | 0.00 | 1.00 | -0.44 | -0.44 | -0.44 | -0.44 | 3.28 | ▇▁▁▁▁ |
ManufacturingProcess42 | 0 | 1 | 0.00 | 1.00 | -5.77 | 0.10 | 0.20 | 0.25 | 0.46 | ▁▁▁▁▇ |
ManufacturingProcess43 | 0 | 1 | 0.00 | 1.00 | -1.05 | -0.36 | -0.13 | 0.13 | 11.62 | ▇▁▁▁▁ |
ManufacturingProcess44 | 0 | 1 | 0.00 | 1.00 | -5.61 | -0.02 | 0.29 | 0.29 | 0.92 | ▁▁▁▁▇ |
ManufacturingProcess45 | 0 | 1 | 0.00 | 1.00 | -5.25 | -0.09 | 0.15 | 0.40 | 1.14 | ▁▁▁▂▇ |
#split train/test
set.seed(3190)
sample_set <- sample(nrow(chem_imp), round(nrow(chem_imp)*0.75), replace = FALSE)
chem_train <- chem_imp[sample_set, ]
chem_test <- chem_imp[-sample_set, ]
I’ll build a PLS model for this dataset. I see the optimal value of ncomps is 3 in this case.
# build the PLS model
pls_model <- train(chem_train[,-1],
chem_train$Yield,
method = "pls",
tuneLength = 10,
trControl = trainControl(method = "cv"))
#plot RMSE of ncomps
plot(pls_model$results$RMSE)
Further, at the level of 3 the RMSE is 0.604 and the R-squared states the model can account for 63.46% of the variance in the dataset.
pls_model$results
## ncomp RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 1 0.7001241 0.5177166 0.5793990 0.11519441 0.1728657 0.10464902
## 2 2 0.6174564 0.5998932 0.5162735 0.11499137 0.1512413 0.09839459
## 3 3 0.6035627 0.6345528 0.5088571 0.08866118 0.1364007 0.08299434
## 4 4 0.6105764 0.6445618 0.5086705 0.08645226 0.1436748 0.08094136
## 5 5 0.6241008 0.6388256 0.5235059 0.09492814 0.1416651 0.09419574
## 6 6 0.6323922 0.6353155 0.5195763 0.08685005 0.1340769 0.09081020
## 7 7 0.6442984 0.6372570 0.5272632 0.08708980 0.1325100 0.08681575
## 8 8 0.6558119 0.6356937 0.5391171 0.08788075 0.1367980 0.08201953
## 9 9 0.6646960 0.6300397 0.5510012 0.09263265 0.1394335 0.08702770
## 10 10 0.6796563 0.6148485 0.5540910 0.09946885 0.1462734 0.09034939
Now considering the test data, the RMSE is 0.73 with a lower R-squared of 48.6%.
pred <- predict(pls_model, chem_test)
postResample(pred = pred, obs = chem_test$Yield)
## RMSE Rsquared MAE
## 0.7298442 0.4862441 0.5645561
I found an interesting function varImp()
that can be used to show the importance of variables in different model types. With regard to PLS the score is calculated by: “the variable importance measure here is based on weighted sums of the absolute regression coefficients. The weights are a function of the reduction of the sums of squares across the number of PLS components and are computed separately for each outcome. Therefore, the contribution of the coefficients are weighted proportionally to the reduction in the sums of squares.”
More here: https://topepo.github.io/caret/variable-importance.html
In my case, it appears the top 5 predictors are process, followed by 2 biologic, and then another 3 process - so the process measures are dominating in this model.
varImp(pls_model)
## pls variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess09 88.55
## ManufacturingProcess36 86.82
## ManufacturingProcess13 80.74
## ManufacturingProcess17 78.58
## BiologicalMaterial02 68.25
## BiologicalMaterial06 66.27
## ManufacturingProcess11 65.49
## ManufacturingProcess06 60.47
## ManufacturingProcess33 60.40
## BiologicalMaterial08 58.44
## BiologicalMaterial12 58.34
## BiologicalMaterial03 55.59
## BiologicalMaterial04 53.11
## BiologicalMaterial11 53.03
## BiologicalMaterial01 50.44
## ManufacturingProcess15 47.97
## ManufacturingProcess12 47.42
## ManufacturingProcess04 46.70
## ManufacturingProcess34 44.71
Going back to the original dataset and selecting just the top 5 predictors from above, I build a correlation plot. There appear to be strong positive correlations between Yield
and ManufacturingProcess32
, and negative correlations between Yield
and ManufacturingProcess36
, ManufacturingProcess13
, and ManufacturingProcess17
.
chem_top5 <- chem_imp %>%
select(c(ManufacturingProcess32, ManufacturingProcess09, ManufacturingProcess36, ManufacturingProcess13, ManufacturingProcess17,Yield))
correlation <- cor(chem_top5)
# plot correlations
corrplot.mixed(correlation, tl.col = 'black', tl.pos = 'lt',
number.cex= .9, tl.cex = .7)
There are many methods for combing variables that could be used to leverage this information, further the entire dataset could be checked for high correlation amid the predictor variables, with redundant variables removed.