Note: Model descriptions are from Applied Predictive Modeling by Max Kuhn and Kjell Johnson. Solutions are modifications of those posted by Max Kuhn on his public GitHub page. Function descriptions are from the RDocumentation website.
A chemical manufacturing process for a pharmaceutical product [is] discussed [below].
This data set contains information about a chemical manufacturing process, in which the goal is to understand the relationship between the process and the resulting final product yield. Raw material in this process is put through a sequence of 27 steps to make the final pharmaceutical product. The starting material is generated from a biological unit and has a range of quality and characteristics. The objective in this project was to develop a model to predict percent yield of the manufacturing process. The data set consisted of 177 samples of biological material for which 57 characteristics were measured. Of the 57 characteristics, there were 12 measurements of the biological starting material and 45 measurements of the manufacturing process. The process variables included measurements such as temperature, drying time, washing time, and concentrations of by-products at various steps. Some of the process measurements can be controlled, while others are observed. Predictors are continuous, count, categorical; some are correlated, and some contain missing values. Samples are not independent because sets of samples come from the same batch of biological starting material.
- Dimensions: 177 Samples, 57 Predictors.
- Response: Continuous, Balanced/symmetric.
- Predictors: Continuous, Count, Categorical, Correlated/associated, Different scales, Missing values.
In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
if (!require("pls")) install.packages("pls")
if (!require("fma")) install.packages("fma")
if (!require("VIM")) install.packages("VIM")
if (!require("RANN")) install.packages("RANN")
if (!require("mice")) install.packages("mice")
if (!require("caret")) install.packages("caret")
if (!require("AppliedPredictiveModeling")) install.packages("AppliedPredictiveModeling")
library(pls)
library(fma)
library(VIM)
library(RANN)
library(mice)
library(caret)
library(AppliedPredictiveModeling)
Start R and use these commands to load the data:
data(package = "AppliedPredictiveModeling")$results[, "Item"]
## [1] "ChemicalManufacturingProcess" "abalone"
## [3] "bio (hepatic)" "cars2010 (FuelEconomy)"
## [5] "cars2011 (FuelEconomy)" "cars2012 (FuelEconomy)"
## [7] "chem (hepatic)" "classes (twoClassData)"
## [9] "concrete" "diagnosis (AlzheimerDisease)"
## [11] "fingerprints (permeability)" "injury (hepatic)"
## [13] "logisticCreditPredictions" "mixtures (concrete)"
## [15] "permeability" "predictors (AlzheimerDisease)"
## [17] "predictors (twoClassData)" "schedulingData"
## [19] "segmentationOriginal" "solTestX (solubility)"
## [21] "solTestXtrans (solubility)" "solTestY (solubility)"
## [23] "solTrainX (solubility)" "solTrainXtrans (solubility)"
## [25] "solTrainY (solubility)"
data(ChemicalManufacturingProcess)
[Create a] matrix processPredictors
[that] contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. [Create the vector] yield
[that] contains the percent yield for each run.
The ChemicalManufacturingProcess
dataset is loaded then separated into a matrix containing the predictors and a vector with the target variable. Next predictors with degenerate distributions are identified. Degenerate distributions are those where the predictor variable has a single unique value (zero variance) or a handful of unique values that occur with very low frequencies (near-zero variance). The nearZeroVar()
function from the caret
library examines the uniqueness of data and returns a table indicating whether each variable has zero or near-zero variance. Any degenerate distributions are then plotted with a smoothed color density representation of a scatterplot using the smoothScatter()
function. This is followed by the identification and plotting of highly correlated variables using the findCorrelation()
function and a Scatterplot Matrix. The Scatterplot Matrix is a grid with a scatterplot for each pair of predictors. The diagonal of the matrix indicates which variable spans across the row and column. The last operation on the predictors is checking if a transformation is appropriate by applying the BoxCoxTrans()
function to each predictor. For the response variable, a density plot with an overlapping normal distribution curve are plotted for comparison.
typeof(ChemicalManufacturingProcess)
## [1] "list"
CMP <- as.matrix(ChemicalManufacturingProcess)
rm(ChemicalManufacturingProcess) # list not needed
predictors <- subset(CMP, select=-Yield)
nzv <- nearZeroVar(predictors)
smoothScatter(predictors[, nzv])
table(predictors[, nzv])
##
## 100 100.83
## 173 3
corr <- findCorrelation(cor(predictors, use='complete.obs'))
pairs(predictors[, corr])
cols <- setdiff(1:ncol(predictors), c(nzv, corr))
df = data.frame(Predictor=character(0), Lamda=numeric(0), stringsAsFactors=F)
for (i in cols) {
pred <- colnames(predictors)[i]
nonulls <- na.omit(predictors[,i])
lambda <- BoxCoxTrans(nonulls)$lambda
k <- ifelse(i==1, 1, k+1)
df[k,] <- c(pred, round(lambda, 1))
}
df[!is.na(df$Lamda), ]
## Predictor Lamda
## 1 BiologicalMaterial01 0.3
## 2 BiologicalMaterial05 0
## 3 BiologicalMaterial06 -1.1
## 4 BiologicalMaterial08 -0.9
## 5 BiologicalMaterial09 2
## 6 BiologicalMaterial10 -1.1
## 7 BiologicalMaterial11 -2
## 10 ManufacturingProcess03 2
## 11 ManufacturingProcess04 2
## 12 ManufacturingProcess05 -2
## 13 ManufacturingProcess06 -2
## 16 ManufacturingProcess09 2
## 17 ManufacturingProcess10 -1.5
## 18 ManufacturingProcess11 1.1
## 20 ManufacturingProcess13 -2
## 21 ManufacturingProcess14 1.3
## 22 ManufacturingProcess15 -2
## 24 ManufacturingProcess17 -2
## 26 ManufacturingProcess19 -2
## 35 ManufacturingProcess32 -1
## 36 ManufacturingProcess33 2
## 37 ManufacturingProcess34 2
## 38 ManufacturingProcess35 2
## 39 ManufacturingProcess36 -0.1
yield <- subset(CMP, select=Yield)
plot(density(yield), main="Density")
polygon(density(yield), col="steelblue")
curve(dnorm(x, mean=mean(yield), sd=sd(yield)), col="red", lwd=2, add=TRUE)
One predictor did have a degenerate distribution with near-zero variance. It contained only a handful of unique values that occur with very low frequencies. Specifically, only the values 100 and 100.83 with the latter appearing only three times. There were several highly correlated predictors with correlations exceeding the \(r=0.9\) cutoff (default). The Scatterplot Matrix shows each pair of highly correlated predictors and their relationships. The BoxCoxTrans()
function indicates that about half of the remaining 47 variables would benefit from a power transformation. Plotting the target variable against a Gaussian distribution with the same mean and standard deviation shows that the distribution of the target variable very closely matches the distribution of a normal variable.
A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see here).
The aggr
function in the VIM
package plots and calculates the amount of missing values in each variable. The mice()
function in the mice
package conducts Multivariate Imputation by Chained Equations (MICE) on multivariate datasets with missing values. The function has over imputation 20 methods that can be applied to the data. The one used with these data is the predictive mean matching method which is currently the most popular in online forums. After the imputations are made, a complete dataset is created using the complete()
function. The aggr
function from the VIM
package is then reran for comparison.
data(ChemicalManufacturingProcess)
CMP <- as.matrix(ChemicalManufacturingProcess)
rm(ChemicalManufacturingProcess) # list not needed
aggr(CMP, prop = c(T, T), bars=T, numbers=T)
MICE <- mice(CMP, method="pmm", printFlag=F, seed=624)
aggr(complete(MICE), prop = c(T, T), bars=T, numbers=T)
The visualizations produced by the aggr
function in the VIM
package show a bar chart with the proportion of missing data per variable as well as a grid with the proportion of missing data for variable combinations. The bar chart shows some predictor variables with values missing. The grid shows the combination of all predictors have 86% of data not missing. The remainder of the grid shows missing data for variable combinations with each row highlighting the missing values for the group of variables detailed in the x-axis. Most predictors with missing data have under 5% of their values missing. Imputation is done with Multivariate Imputation by Chained Equations (MICE). MICE assumes values are missing at random and is implement by imputing missing data for all variables with a simple method, removing the imputations for one variable, imputing the removed data using regression, repeating the remove-regress imputation for every other imputed variable, then continuing the remove-regress imputation in a loop over the whole dataset \(m\) times. The simple imputation method used in here is Predictive Mean Matching (PMM) which “imputes missing values by means of the nearest-neighbor donor with distance based on the expected values of the missing variables conditional on the observed covariates.” After imputing with the mice()
function, no missing values remain in the data.
Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
Preprocessing in this section is done entirely with functions from the caret
package. The createDataPartition()
function helps create a series of test and training partitions. The p
parameter sets the percentage of data that goes to training. Declaring list=FALSE
returns a matrix. The preProcess()
function is used complete several pre-processing steps at once. The method
parameter outlines the preprocessing steps. Possible values are “BoxCox”, “YeoJohnson”, “expoTrans”, “center”, “scale”, “range”, “knnImpute”, “bagImpute”, “medianImpute”, “pca”, “ica”, “spatialSign”, “corr”, “zv”, “nzv”, and “conditionalX”. The ‘BoxCox’ method is a simple, computationally efficient, and effective method for estimating power transformations. The center
method subtracts the mean of the predictor’s data from the predictor values while scale
method divides values by the standard deviation of the predictor. The nzv
method identifies numeric predictor columns having very few unique values by applying nearZeroVar()
and then excludes any “near zero-variance” predictors. The corr
method seeks to filter out highly correlated predictors. A k-nearest neighbor imputation with knnImpute
is carried out by finding the k closest samples (Euclidian distance) in the training set. Using bagImpute
fits a bagged tree model for each predictor and has much higher computational cost. Using medianImpute
is fast, but may be inaccurate. The operations are applied in this order: zero-variance filter, near-zero variance filter, correlation filter, Box-Cox/Yeo-Johnson/exponential transformation, centering, scaling, range, imputation, PCA, ICA then spatial sign. The predict()
function can be used to apply the preprocessing to the training set, but the preprocessing steps can also be specified in the train()
function. The train()
function sets up a grid of tuning parameters for a number of classification and regression routines, fits each model, and calculates a resampling based performance measure. This function then reports the optimal number of components based on the resamplings.
set.seed(624)
data(ChemicalManufacturingProcess)
CMP <- as.matrix(ChemicalManufacturingProcess)
rm(ChemicalManufacturingProcess) # list not needed
rows_train <- createDataPartition(CMP[, 1], p=0.75, list=F)
CMP_train_X <- CMP[rows_train, -1]
CMP_train_Y <- CMP[rows_train, 1]
nearZeroVar(CMP_train_X, names = T)
## [1] "BiologicalMaterial07"
findCorrelation(cor(CMP_train_X, use="complete.obs"), names=T)
## [1] "BiologicalMaterial02" "BiologicalMaterial06"
## [3] "ManufacturingProcess26" "BiologicalMaterial12"
## [5] "BiologicalMaterial04" "ManufacturingProcess11"
## [7] "ManufacturingProcess20" "ManufacturingProcess40"
preMethods <- c("nzv", "corr", "YeoJohnson", "center", "scale", "knnImpute")
ctrl <- trainControl(method = "boot", number = 25)
(tune <- train(x=CMP_train_X, y=CMP_train_Y,
method="pls", preProcess= preMethods, tuneLength=15, trControl=ctrl))
## Partial Least Squares
##
## 132 samples
## 57 predictor
##
## Pre-processing: Yeo-Johnson transformation (48), centered (48),
## scaled (48), nearest neighbor imputation (48), remove (9)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 1.367509 0.3789625 1.129168
## 2 1.352929 0.4205792 1.084536
## 3 1.343566 0.4452650 1.068323
## 4 1.467395 0.4165241 1.110620
## 5 1.593476 0.3915152 1.144274
## 6 1.654527 0.3839493 1.175995
## 7 1.795043 0.3672247 1.204510
## 8 1.907951 0.3567526 1.240288
## 9 2.014495 0.3477422 1.270906
## 10 2.102822 0.3387973 1.294336
## 11 2.144276 0.3367535 1.305348
## 12 2.147608 0.3337261 1.312182
## 13 2.156899 0.3343572 1.318573
## 14 2.151630 0.3307468 1.333819
## 15 2.126877 0.3264620 1.341226
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.
plot(tune)
As noted in Exercise 6.3.a, there is one near zero variance predictor and several highly correlated predictors. The near-zero variance predictor with the degenerate distribution contained only a handful of unique values that occur with very low frequencies. Specifically, only the values 100 and 100.83 with the latter appearing only three times. There are several highly correlated predictors with correlations exceeding the \(r=0.9\) cutoff (default). When transforming, Box-Cox power transformations work for only strictly positive numbers, \(\mathbb{R}^+\). Although all the variables are non-negative, many include zero. Therefore the YeoJohnson
method which allows a few negative values is used. The full pre-processing also performed centering and scaling. The regression model chosen is in accordance with Kuhn (2013) who states that:
Partial Least Squares (PLS) models is essentially a supervised version of Principal Component Analysis (PCA). PCA is unsupervised in that it does not consider any aspects of the response when it selects its components. Instead, it simply chases the variability present throughout the predictor space. If that variability happens to be related to the response variability, then Principal Component Regression (PCR) has a good chance to identify a predictive relationship. If, however, the variability in the predictor space is not related to the variability of the response, then PCR can have difficulty identifying a predictive relationship when one might actually exist. Because of this inherent problem with PCR, it is recommended that PLS be used when there are correlated predictors and a linear regression-type solution is desired.
Tuning this pls
model on the preprocessed data indicates that a Partial Least Squares Regression (PLSR) model with three components leads to the optimal value in the RMSE performance metric.
Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
Pre-processing in this section is done entirely with functions from the caret
package. The createDataPartition()
function helps create a series of test and training partitions. The p
parameter sets the percentage of data that goes to training. Declaring list=FALSE
returns a matrix. The preProcess()
function is used complete several pre-processing steps at once. The method
parameter outlines the preprocessing steps. Possible values are “BoxCox”, “YeoJohnson”, “expoTrans”, “center”, “scale”, “range”, “knnImpute”, “bagImpute”, “medianImpute”, “pca”, “ica”, “spatialSign”, “corr”, “zv”, “nzv”, and “conditionalX”. The ‘BoxCox’ method is a simple, computationally efficient, and effective method for estimating power transformations. The center
method subtracts the mean of the predictor’s data from the predictor values while scale
method divides values by the standard deviation of the predictor. The nzv
method identifies numeric predictor columns having very few unique values by applying nearZeroVar()
and then excludes any “near zero-variance” predictors. The corr
method seeks to filter out highly correlated predictors. A k-nearest neighbor imputation with knnImpute
is carried out by finding the k closest samples (Euclidian distance) in the training set. Using bagImpute
fits a bagged tree model for each predictor and has much higher computational cost. Using medianImpute
is fast, but may be inaccurate. The operations are applied in this order: zero-variance filter, near-zero variance filter, correlation filter, Box-Cox/Yeo-Johnson/exponential transformation, centering, scaling, range, imputation, PCA, ICA then spatial sign. The plsr()
function fits a Partial Least Squares Regression (PLSR) model with the number of components specified in the ncomp
argument. The defaultSummary()
function calculates the performance of the model across resamples.
set.seed(624)
data(ChemicalManufacturingProcess)
CMP <- as.matrix(ChemicalManufacturingProcess)
rm(ChemicalManufacturingProcess) # list not needed
rows_train <- createDataPartition(CMP[, 1], p=0.75, list=F)
CMP_train_X <- CMP[rows_train, -1]
CMP_train_Y <- CMP[rows_train, 1]
CMP_test_X <- CMP[-rows_train, -1]
CMP_test_Y <- CMP[-rows_train, 1]
preMethods <- c("nzv", "corr", "YeoJohnson", "center", "scale", "knnImpute")
preprocessed <- preProcess(CMP_train_X, verbose=T, method = preMethods)
## 1 near-zero variance predictors were removed.
## 8 highly correlated predictors were removed.
## all of the transformations failed
## Calculating 48 means for centering
## Calculating 48 standard deviations for scaling
CMP_train_X_pp <- predict(preprocessed, CMP_train_X, ncomp = 1:3)
CMP_test_X_pp <- predict(preprocessed, CMP_test_X, ncomp = 1:3)
fit <- plsr(CMP_train_Y ~ CMP_train_X_pp)
# The plsr function returns a 3D array. Must flatten to a vector.
predictions <- as.vector(predict(fit, CMP_test_X_pp, ncomp = 3))
obs_vs_pred <- data.frame(obs = CMP_test_Y, pred = predictions)
defaultSummary(obs_vs_pred)
## RMSE Rsquared MAE
## 2.0537798 0.3507019 1.2277052
As noted in Exercise 6.3.a, there is one near zero variance predictor and several highly correlated predictors. The near-zero variance predictor with the degenerate distribution contained only a handful of unique values that occur with very low frequencies. Specifically, only the values 100 and 100.83 with the latter appearing only three times. There are several highly correlated predictors with correlations exceeding the \(r=0.9\) cutoff (default). When transforming, Box-Cox power transformations work for only strictly positive numbers, \(\mathbb{R}^+\). Although all the variables are non-negative, many include zero. Therefore the YeoJohnson
method which allows a few negative values is used. The full pre-processing also performed centering and scaling. In Exercise 6.3.c, it was determined that Partial Least Squares Regression (PLSR) model with three components leads to the optimal value in the RMSE performance metric. The value of the RMSE performance metric for this three-component PLSR model is 2.0537798 compared to 1.343566 for the resampled RMSE performance metric on the training set. The resampled training set has eight alternatives that perform better than this, but performance metrics from the resampled training set are highly optimistic because they are based on resampling.
Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
Pre-processing in this section is done entirely with functions from the caret
package. The createDataPartition()
function helps create a series of test and training partitions. The p
parameter sets the percentage of data that goes to training. Declaring list=FALSE
returns a matrix. The preProcess()
function is used complete several pre-processing steps at once. The method
parameter outlines the preprocessing steps. Possible values are “BoxCox”, “YeoJohnson”, “expoTrans”, “center”, “scale”, “range”, “knnImpute”, “bagImpute”, “medianImpute”, “pca”, “ica”, “spatialSign”, “corr”, “zv”, “nzv”, and “conditionalX”. The ‘BoxCox’ method is a simple, computationally efficient, and effective method for estimating power transformations. The center
method subtracts the mean of the predictor’s data from the predictor values while scale
method divides values by the standard deviation of the predictor. The nzv
method identifies numeric predictor columns having very few unique values by applying nearZeroVar()
and then excludes any “near zero-variance” predictors. The corr
method seeks to filter out highly correlated predictors. A k-nearest neighbor imputation with knnImpute
is carried out by finding the k closest samples (Euclidian distance) in the training set. Using bagImpute
fits a bagged tree model for each predictor and has much higher computational cost. Using medianImpute
is fast, but may be inaccurate. The operations are applied in this order: zero-variance filter, near-zero variance filter, correlation filter, Box-Cox/Yeo-Johnson/exponential transformation, centering, scaling, range, imputation, PCA, ICA then spatial sign. The predict()
function applies the results to a set of data. The train()
function sets up a grid of tuning parameters for a number of classification and regression routines, fits each model, and calculates a resampling based performance measure. The varImp()
function calculates the variable importance for a given model.The dotPlot()
function create a dotplot of variable importance values.
data(ChemicalManufacturingProcess)
CMP <- as.matrix(ChemicalManufacturingProcess)
rm(ChemicalManufacturingProcess) # list not needed
set.seed(624)
rows_train <- createDataPartition(CMP[, 1], p=0.75, list=F)
CMP_train_X <- CMP[rows_train, -1]
CMP_train_Y <- CMP[rows_train, 1]
set.seed(624)
preMethods <- c("nzv", "corr", "YeoJohnson", "center", "scale", "knnImpute")
ctrl <- trainControl(method = "boot", number = 25)
tune <- train(x=CMP_train_X, y=CMP_train_Y, method="pls",
preProcess= preMethods, tuneLength=15, trControl=ctrl)
(varimp <- varImp(tune))
## pls variable importance
##
## only 20 most important variables shown (out of 48)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess36 76.83
## ManufacturingProcess13 69.05
## ManufacturingProcess33 65.92
## ManufacturingProcess09 65.82
## BiologicalMaterial06 65.44
## ManufacturingProcess17 60.56
## BiologicalMaterial03 58.60
## ManufacturingProcess06 55.35
## BiologicalMaterial08 53.03
## BiologicalMaterial01 51.87
## ManufacturingProcess31 51.73
## BiologicalMaterial12 50.87
## BiologicalMaterial11 49.74
## ManufacturingProcess29 46.60
## ManufacturingProcess02 44.34
## ManufacturingProcess04 43.81
## ManufacturingProcess28 42.04
## ManufacturingProcess12 39.82
## ManufacturingProcess30 37.60
dotPlot(varimp, top=15)
As noted in Exercise 6.3.a, there is one near zero variance predictor and several highly correlated predictors. The near-zero variance predictor with the degenerate distribution contained only a handful of unique values that occur with very low frequencies. Specifically, only the values 100 and 100.83 with the latter appearing only three times. There are several highly correlated predictors with correlations exceeding the \(r=0.9\) cutoff (default). When transforming, Box-Cox power transformations work for only strictly positive numbers, \(\mathbb{R}^+\). Although all the variables are non-negative, many include zero. Therefore the YeoJohnson
method which allows a few negative values is used. The full pre-processing also performed centering and scaling. In Exercise 6.3.c, it was determined that Partial Least Squares Regression (PLSR) model with three components leads to the optimal value in the RMSE performance metric. The normalized weight and response variation explained by a component is proportional to the importance of predictors in the PLS model. The varImp()
function output and its corresponding dotPlot()
show that the most important predictors are the manufacturing process predictors which dominate the list.
Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?
Pre-processing in this section is done entirely with functions from the caret
package. The createDataPartition()
function helps create a series of test and training partitions. The p
parameter sets the percentage of data that goes to training. Declaring list=FALSE
returns a matrix. The preProcess()
function is used complete several pre-processing steps at once. The method
parameter outlines the preprocessing steps. Possible values are “BoxCox”, “YeoJohnson”, “expoTrans”, “center”, “scale”, “range”, “knnImpute”, “bagImpute”, “medianImpute”, “pca”, “ica”, “spatialSign”, “corr”, “zv”, “nzv”, and “conditionalX”. The ‘BoxCox’ method is a simple, computationally efficient, and effective method for estimating power transformations. The center
method subtracts the mean of the predictor’s data from the predictor values while scale
method divides values by the standard deviation of the predictor. The nzv
method identifies numeric predictor columns having very few unique values by applying nearZeroVar()
and then excludes any “near zero-variance” predictors. The corr
method seeks to filter out highly correlated predictors. A k-nearest neighbor imputation with knnImpute
is carried out by finding the k closest samples (Euclidian distance) in the training set. Using bagImpute
fits a bagged tree model for each predictor and has much higher computational cost. Using medianImpute
is fast, but may be inaccurate. The operations are applied in this order: zero-variance filter, near-zero variance filter, correlation filter, Box-Cox/Yeo-Johnson/exponential transformation, centering, scaling, range, imputation, PCA, ICA then spatial sign. The predict()
function applies the results to a set of data. The train()
function sets up a grid of tuning parameters for a number of classification and regression routines, fits each model, and calculates a resampling based performance measure. The varImp()
function calculates the variable importance for a given model. Examining and understanding the correlations between the top predictors and the response variable can help improve the yield in future runs of the manufacturing process.
data(ChemicalManufacturingProcess)
CMP <- as.matrix(ChemicalManufacturingProcess)
rm(ChemicalManufacturingProcess) # list not needed
set.seed(624)
rows_train <- createDataPartition(CMP[, 1], p=0.75, list=F)
CMP_train_X <- CMP[rows_train, -1]
CMP_train_Y <- CMP[rows_train, 1]
set.seed(624)
preMethods <- c("nzv", "corr", "YeoJohnson", "center", "scale", "knnImpute")
ctrl <- trainControl(method = "boot", number = 25)
tune <- train(x=CMP_train_X, y=CMP_train_Y, method="pls",
preProcess= preMethods, tuneLength=15, trControl=ctrl)
varImp(tune)
## pls variable importance
##
## only 20 most important variables shown (out of 48)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess36 76.83
## ManufacturingProcess13 69.05
## ManufacturingProcess33 65.92
## ManufacturingProcess09 65.82
## BiologicalMaterial06 65.44
## ManufacturingProcess17 60.56
## BiologicalMaterial03 58.60
## ManufacturingProcess06 55.35
## BiologicalMaterial08 53.03
## BiologicalMaterial01 51.87
## ManufacturingProcess31 51.73
## BiologicalMaterial12 50.87
## BiologicalMaterial11 49.74
## ManufacturingProcess29 46.60
## ManufacturingProcess02 44.34
## ManufacturingProcess04 43.81
## ManufacturingProcess28 42.04
## ManufacturingProcess12 39.82
## ManufacturingProcess30 37.60
important <- c('ManufacturingProcess32',
'ManufacturingProcess36',
'ManufacturingProcess13',
'ManufacturingProcess33',
'ManufacturingProcess09')
cor(CMP_train_X[, important], CMP_train_Y, use="complete.obs")
## [,1]
## ManufacturingProcess32 0.6238960
## ManufacturingProcess36 -0.5259015
## ManufacturingProcess13 -0.4394630
## ManufacturingProcess33 0.4745657
## ManufacturingProcess09 0.4325716
As noted in Exercise 6.3.a, there is one near zero variance predictor and several highly correlated predictors. The near-zero variance predictor with the degenerate distribution contained only a handful of unique values that occur with very low frequencies. Specifically, only the values 100 and 100.83 with the latter appearing only three times. There are several highly correlated predictors with correlations exceeding the \(r=0.9\) cutoff (default). When transforming, Box-Cox power transformations work for only strictly positive numbers, \(\mathbb{R}^+\). Although all the variables are non-negative, many include zero. Therefore the YeoJohnson
method which allows a few negative values is used. The full pre-processing also performed centering and scaling. In Exercise 6.3.c, it was determined that Partial Least Squares Regression (PLSR) model with three components leads to the optimal value in the RMSE performance metric. The normalized weight and response variation explained by a component is proportional to the importance of predictors in the PLS model. In Exercise 6.3.e, it was determined that the manufacturing process predictors are the most important predictors. The top five predictors show moderate correlation with the response variable. Some are negatively correlated and others are positively correlated. Understanding the strength and direction of these relationships can help improve the yield in future runs of the manufacturing process by revealing the expected impact of tweaks to the processes.
https://rpubs.com/josezuniga/358601
https://rpubs.com/josezuniga/358606
https://rpubs.com/josezuniga/358605
http://appliedpredictivemodeling.com/
https://topepo.github.io/caret/pre-processing.html
https://cran.r-project.org/web/packages/caret/caret.pdf
http://appliedpredictivemodeling.com/blog/2014/11/12/solutions-on-github
http://ftp.uni-bayreuth.de/math/statlib/R/CRAN/doc/vignettes/caret/caretVarImp.pdf