This homework assignment will contain seven questions from the book ‘Applied Predictive Modeling’ by Max Kuhn · Kjell Johnson
6.3 // 7.2 // 7.5 // 8.1 // 8.2 // 8.3 // 8.7 and the Market Basket Analysis / Recommender Systems
Question 6.3:
A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:
(a) Start R and use these commands to load the data:
The matrix process Predictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).
library(caret)
library(RANN)
preP <- preProcess(ChemicalManufacturingProcess, method = c("knnImpute"))
df <- predict(preP, ChemicalManufacturingProcess)
df$Yield = ChemicalManufacturingProcess$Yield
any(is.na(df))## [1] FALSE
In the above code, I created a pre-processing object named preP using the preProcess function from the “caret” package. It performs k-nearest neighbor imputation (knnImpute) to fill in missing values in the data frame ChemicalManufacturingProcess. The pre-processing object preP stores information about the transformation applied to the data. I then applied the pre-processing object preP to the original data frame ChemicalManufacturingProcess using the predict function. It imputes missing values and applies any other transformations specified in preP. The resulting data frame is assigned to df. Then I assigned the “Yield” column from the original data frame ChemicalManufacturingProcess to the corresponding column in the processed data frame df. Lastly, I checked if there are any missing values (NA) in the processed data frame df. Because it said ‘False’ we know that we imputed all the NAs correctly.
(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
preP <- preProcess(ChemicalManufacturingProcess, method = c("BoxCox", "knnImpute"))
df <- predict(preP, ChemicalManufacturingProcess)
df$Yield = ChemicalManufacturingProcess$Yield
trainRows <- createDataPartition(df$Yield, p = .80, list = FALSE)
df.train <- df[trainRows,]
df.test <- df[-trainRows,]
colYield <-
which(colnames(ChemicalManufacturingProcess) == "Yield")
ctrl <- trainControl(method = "cv", number = 5)
mdl <- train(
x = df.train[, -colYield],
y = df.train$Yield,
method = "pls",
tuneLength = 20,
trControl = ctrl
)
mdl## Partial Least Squares
##
## 144 samples
## 57 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 116, 115, 115, 115, 115
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 1.420512 0.4551606 1.1552156
## 2 1.246287 0.5569653 1.0301604
## 3 1.202620 0.5960259 0.9988938
## 4 1.205222 0.5959634 0.9949146
## 5 1.241749 0.5819981 1.0217219
## 6 1.268362 0.5757635 1.0371681
## 7 1.282577 0.5725722 1.0234249
## 8 1.304858 0.5666444 1.0457709
## 9 1.312207 0.5716937 1.0559618
## 10 1.323910 0.5725190 1.0639483
## 11 1.327343 0.5738286 1.0622661
## 12 1.336390 0.5751951 1.0665269
## 13 1.345094 0.5709718 1.0724646
## 14 1.344407 0.5691308 1.0683000
## 15 1.340058 0.5769834 1.0567306
## 16 1.360139 0.5718526 1.0592496
## 17 1.372095 0.5669017 1.0660666
## 18 1.381210 0.5634525 1.0730942
## 19 1.381617 0.5643265 1.0752356
## 20 1.397123 0.5578352 1.0811591
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.
In the above code, I created a preprocessing object named preP. The function preProcess from the “caret” package is used to perform various preprocessing steps on the ChemicalManufacturingProcess dataset. In this case, two methods are specified: “BoxCox” and “knnImpute”. “BoxCox” performs a power transformation to make the data more symmetric, and “knnImpute” imputes missing values using k-nearest neighbors. I then applied the preprocessing object preP to the original dataset ChemicalManufacturingProcess using the predict function. It transforms the dataset based on the preprocessing steps specified in preP and assigns the transformed dataset to df. Then I assigned the “Yield” column from the original dataset ChemicalManufacturingProcess to the corresponding column in the processed dataset df. I then created a partition of the data for training purposes. It splits the data based on the “Yield” column, with 80% of the data assigned to trainRows for training and the remaining 20% for testing. I then subset the processed dataset df using the trainRows indices, selecting only the rows designated for training. The resulting subset is assigned to df.train. Then I subset the processed dataset df using the complement of the trainRows indices, selecting the rows not included in training. The resulting subset is assigned to df.test. Then I created a train control object named ctrl using the trainControl function from the “caret” package. It specifies the method as “cv” (cross-validation) and the number of folds as 5. Then I trained a Partial Least Squares (PLS) regression model using the train function from the “caret” package. It uses the df.train dataset, excluding the “Yield” column (-colYield) as the predictor variables (x), and the “Yield” column from df.train as the response variable (y). The method is specified as “pls”, and tuneLength is set to 20, indicating the number of tuning parameter values to explore. The trControl parameter is set to ctrl, which specifies the train control object created earlier.
library(glue)
mdl.ncomp <- mdl[["bestTune"]][["ncomp"]]
glue("The optimal value (for the number of PLS components) with respect to RMSE performance metric is {mdl.ncomp}.")## The optimal value (for the number of PLS components) with respect to RMSE performance metric is 3.
In the above code, I extracted the optimal number of Partial Least Squares (PLS) components from the trained model mdl. In the caret package, the bestTune element of a trained model object stores the optimal parameter values selected during the tuning process. In this case, ncomp represents the number of PLS components selected as optimal. The value is assigned to the variable mdl.ncomp. I then used the glue function from the “glue” package to interpolate the value of mdl.ncomp into a string. It creates a formatted message stating the optimal value of the number of PLS components selected based on the Root Mean Square Error (RMSE) performance metric.
(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
## RMSE Rsquared MAE
## 1.1145997 0.4914468 0.8739560
## TrainRMSE TrainRsquared TrainMAE method
## 1 1.20262 0.5960259 0.9988938 pls
In the above code I evaluated the performance of the trained model (mdl) on the test dataset (df.test) by comparing the predicted values with the actual “Yield” values. Additionally, I retrieves the performance metrics of the model on the training dataset using the getTrainPerf function.The root mean squared error (RMSE) of the test set was found to be slightly lower than that of the resampled training set.
(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
In the above code I calculated the variable importance scores for the predictors in the trained model. I then created a plot showing the top 15 predictor variables based on their VIP scores. This plot provides insight into the relative importance of the predictors in the model and helps identify which variables have the most influence on the model’s performance.The seven most important predictors in the model I trained are Manufacturing Processes. The process predictors dominate the list as opposed to the biological ones.
(f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?
library(dplyr)
library(tidyverse)
correlation <- cor(select(df, 'ManufacturingProcess32','ManufacturingProcess36','BiologicalMaterial08','Yield'))
corrplot::corrplot(correlation, method='square', type="upper")In the above code, I calculated the correlation coefficients between selected variables in the dataset. I then uses the corrplot package to create a correlation plot, visualizing the relationships between these variables. The resulting plot provides an overview of the pairwise correlations and can help identify patterns and dependencies between the variables.The correlation plots above illustrated that several predictor variables possess substantial positive or negative associations with the response variable, with the most significant variables exhibiting the strongest correlations. However, a few variables that the model identified as important do not exhibit such strong correlations with the response variable. Comprehensive understanding of the variables exhibiting positive or negative correlations with yield may facilitate improvement in the manufacturing process by enabling necessary adjustments to increase yield.
Question 7.2: 7.2. Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data: y = 10 sin(πx1x2) + 20(x3 − 0.5)2 + 10x4 + 5x5 + N(0, σ2) where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data
library(mlbench)
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
trainingData$x <- data.frame(trainingData$x)
featurePlot(trainingData$x, trainingData$y)Tune several models on these data. For example:
library(caret)
knnModel <- train(x = trainingData$x,
y = trainingData$y,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)
knnModel## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.466085 0.5121775 2.816838
## 7 3.349428 0.5452823 2.727410
## 9 3.264276 0.5785990 2.660026
## 11 3.214216 0.6024244 2.603767
## 13 3.196510 0.6176570 2.591935
## 15 3.184173 0.6305506 2.577482
## 17 3.183130 0.6425367 2.567787
## 19 3.198752 0.6483184 2.592683
## 21 3.188993 0.6611428 2.588787
## 23 3.200458 0.6638353 2.604529
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
knnPred <- predict(knnModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = knnPred, obs = testData$y)## RMSE Rsquared MAE
## 3.2040595 0.6819919 2.5683461
Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
marsTune <- train(trainingData$x, trainingData$y,
method = "earth",
tuneGrid = marsGrid,
trControl = trainControl(method = "cv"))
marsTune## Multivariate Adaptive Regression Spline
##
## 200 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 4.462296 0.2176253 3.697979
## 1 3 3.720663 0.4673821 2.949121
## 1 4 2.680039 0.7094916 2.123848
## 1 5 2.333538 0.7781559 1.856629
## 1 6 2.367933 0.7754329 1.901509
## 1 7 1.809983 0.8656526 1.414985
## 1 8 1.692656 0.8838936 1.333678
## 1 9 1.704958 0.8845683 1.339517
## 1 10 1.688559 0.8842495 1.309838
## 1 11 1.669043 0.8886165 1.296522
## 1 12 1.645066 0.8892796 1.271981
## 1 13 1.655570 0.8886896 1.271232
## 1 14 1.666354 0.8879143 1.285545
## 1 15 1.666354 0.8879143 1.285545
## 1 16 1.666354 0.8879143 1.285545
## 1 17 1.666354 0.8879143 1.285545
## 1 18 1.666354 0.8879143 1.285545
## 1 19 1.666354 0.8879143 1.285545
## 1 20 1.666354 0.8879143 1.285545
## 1 21 1.666354 0.8879143 1.285545
## 1 22 1.666354 0.8879143 1.285545
## 1 23 1.666354 0.8879143 1.285545
## 1 24 1.666354 0.8879143 1.285545
## 1 25 1.666354 0.8879143 1.285545
## 1 26 1.666354 0.8879143 1.285545
## 1 27 1.666354 0.8879143 1.285545
## 1 28 1.666354 0.8879143 1.285545
## 1 29 1.666354 0.8879143 1.285545
## 1 30 1.666354 0.8879143 1.285545
## 1 31 1.666354 0.8879143 1.285545
## 1 32 1.666354 0.8879143 1.285545
## 1 33 1.666354 0.8879143 1.285545
## 1 34 1.666354 0.8879143 1.285545
## 1 35 1.666354 0.8879143 1.285545
## 1 36 1.666354 0.8879143 1.285545
## 1 37 1.666354 0.8879143 1.285545
## 1 38 1.666354 0.8879143 1.285545
## 2 2 4.440854 0.2204755 3.686796
## 2 3 3.697203 0.4714312 2.938566
## 2 4 2.664266 0.7149235 2.119566
## 2 5 2.313371 0.7837374 1.852172
## 2 6 2.335796 0.7875253 1.841919
## 2 7 1.833780 0.8622906 1.462210
## 2 8 1.688673 0.8883137 1.325754
## 2 9 1.557314 0.9002634 1.234207
## 2 10 1.463018 0.9133897 1.174354
## 2 11 1.350247 0.9265882 1.099432
## 2 12 1.305955 0.9344683 1.049853
## 2 13 1.261130 0.9357469 1.017123
## 2 14 1.286463 0.9315381 1.040156
## 2 15 1.337104 0.9297651 1.069602
## 2 16 1.337560 0.9294593 1.067973
## 2 17 1.318152 0.9322833 1.054436
## 2 18 1.324331 0.9319631 1.055971
## 2 19 1.324331 0.9319631 1.055971
## 2 20 1.324331 0.9319631 1.055971
## 2 21 1.324331 0.9319631 1.055971
## 2 22 1.324331 0.9319631 1.055971
## 2 23 1.324331 0.9319631 1.055971
## 2 24 1.324331 0.9319631 1.055971
## 2 25 1.324331 0.9319631 1.055971
## 2 26 1.324331 0.9319631 1.055971
## 2 27 1.324331 0.9319631 1.055971
## 2 28 1.324331 0.9319631 1.055971
## 2 29 1.324331 0.9319631 1.055971
## 2 30 1.324331 0.9319631 1.055971
## 2 31 1.324331 0.9319631 1.055971
## 2 32 1.324331 0.9319631 1.055971
## 2 33 1.324331 0.9319631 1.055971
## 2 34 1.324331 0.9319631 1.055971
## 2 35 1.324331 0.9319631 1.055971
## 2 36 1.324331 0.9319631 1.055971
## 2 37 1.324331 0.9319631 1.055971
## 2 38 1.324331 0.9319631 1.055971
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 13 and degree = 2.
In the above code, I created a grid of tuning parameters for the MARS model, performed model training using the “train” function, and stored the trained model in marsTune. The grid search over the parameter combinations allows for finding the optimal values for degree and nprune using cross-validation.
svmRTune <- train(trainingData$x, trainingData$y,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 14,
trControl = trainControl(method = "cv"))
svmRTune## Support Vector Machines with Radial Basis Function Kernel
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 2.475865 0.7988016 1.992183
## 0.50 2.214599 0.8166629 1.774775
## 1.00 2.046869 0.8398392 1.623489
## 2.00 1.953012 0.8519284 1.552263
## 4.00 1.891812 0.8587464 1.523110
## 8.00 1.875676 0.8604855 1.532308
## 16.00 1.879154 0.8595207 1.542176
## 32.00 1.879022 0.8595391 1.542105
## 64.00 1.879022 0.8595391 1.542105
## 128.00 1.879022 0.8595391 1.542105
## 256.00 1.879022 0.8595391 1.542105
## 512.00 1.879022 0.8595391 1.542105
## 1024.00 1.879022 0.8595391 1.542105
## 2048.00 1.879022 0.8595391 1.542105
##
## Tuning parameter 'sigma' was held constant at a value of 0.06437208
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06437208 and C = 8.
In the above code, I trained an SVM model with a radial kernel on the training data. It applied centering and scaling as preprocessing steps, explored different combinations of tuning parameters, and performed cross-validation to evaluate model performance. The resulting model is stored in svmRTune.
tooHigh <- findCorrelation(cor(trainingData$x), cutoff = .75)
nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),
.size = c(1:10))
ctrl <- trainControl(method = "cv", number = 10)
nnetTune <- train(trainingData$x, trainingData$y,
method = "nnet",
tuneGrid = nnetGrid,
trControl = ctrl,
preProc = c("center", "scale"),
linout = TRUE,
trace = FALSE,
MaxNWts = 10 * (ncol(trainingData$x) + 1) + 10 + 1,
maxit = 500)
nnetTune## Neural Network
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.00 1 2.580768 0.7284673 2.035490
## 0.00 2 2.680611 0.7349533 2.163387
## 0.00 3 2.320043 0.7890496 1.862655
## 0.00 4 2.516530 0.7604025 2.036872
## 0.00 5 3.429278 0.6352466 2.435958
## 0.00 6 3.591351 0.6611452 2.572580
## 0.00 7 4.451186 0.6202449 2.893162
## 0.00 8 5.809870 0.4546762 3.506995
## 0.00 9 12.995439 0.3434619 5.442580
## 0.00 10 6.073681 0.5401981 3.646143
## 0.01 1 2.548256 0.7510589 1.995186
## 0.01 2 2.655808 0.7368365 2.091566
## 0.01 3 2.456279 0.7668390 2.041680
## 0.01 4 2.467229 0.7750489 1.881964
## 0.01 5 2.821335 0.7068607 2.234548
## 0.01 6 2.963347 0.6910225 2.264453
## 0.01 7 3.135325 0.6545929 2.486553
## 0.01 8 3.069100 0.6838123 2.386589
## 0.01 9 3.540203 0.6290499 2.849426
## 0.01 10 3.505342 0.5919074 2.773600
## 0.10 1 2.454455 0.7727431 1.912887
## 0.10 2 2.636746 0.7358911 2.118765
## 0.10 3 2.168554 0.8146493 1.697511
## 0.10 4 2.285694 0.8014563 1.834426
## 0.10 5 2.643277 0.7389948 2.161553
## 0.10 6 2.557881 0.7709643 2.071306
## 0.10 7 2.710028 0.7427555 2.184003
## 0.10 8 3.038225 0.6609506 2.491278
## 0.10 9 2.990543 0.7005036 2.372746
## 0.10 10 3.021560 0.6767638 2.445791
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 3 and decay = 0.1.
In the above code, I identified highly correlated variables, created a grid of tuning parameters for the neural network model, trained the neural network model using cross-validation, and stored the trained model in nnetTune. The tuning process helps find optimal parameter values for the neural network model, while preprocessing steps like centering and scaling are applied to the data.
After looking at the models:
## earth variable importance
##
## Overall
## X1 100.00
## X4 75.33
## X2 48.88
## X5 15.63
## X3 0.00
MARS appears to give the best performance since it has the lowest RMSE and MAE and highest R2. The second best would be SVM radial. MARS does select the informative variables, X1 - X5, however, X3 seems to be insignificant.
Question 7.5 Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.
(a) Which nonlinear regression model gives the optimal resampling and test set performance?
set.seed(42)
index <- createDataPartition(df$Yield, p = .8, list = FALSE)
train_x <- df[index, -1]
train_y <- df[index, 1]
test_x <- df[-index, -1]
test_y <- df[-index, 1]In the above code, I split the data frame df into training and test sets. I created separate data sets for the predictor variables (train_x and test_x) and the response variable (train_y and test_y). The data partition is performed based on the “Yield” column, and the proportion of data allocated to the training set is set to 80%.
set.seed(42)
knnModel <- train(train_x, train_y,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)
tooHigh <- findCorrelation(cor(train_x), cutoff = .75)
train_x_nnet <- train_x[, -tooHigh]
test_x_nnet <- test_x[, -tooHigh]
nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),
.size = c(1:10))
ctrl <- trainControl(method = "cv", number = 10)
nnetTune <- train(train_x_nnet, train_y,
method = "nnet",
tuneGrid = nnetGrid,
trControl = ctrl,
preProc = c("center", "scale"),
linout = TRUE,
trace = FALSE,
MaxNWts = 10 * (ncol(train_x_nnet) + 1) + 10 + 1,
maxit = 500)
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
marsTune <- train(train_x, train_y,
method = "earth",
tuneGrid = marsGrid,
trControl = trainControl(method = "cv"))
svmRTune <- train(train_x, train_y,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 14,
trControl = trainControl(method = "cv"))In the above code I ran several different nonlinear regression model to see which gives the optimal resampling and test set performance.
knnPred <- predict(knnModel, test_x)
nnPred <- predict(nnetTune, test_x_nnet)
marsPred <- predict(marsTune, test_x)
svmRPred <- predict(svmRTune, test_x)
rbind(knn = postResample(knnPred, test_y),
nn = postResample(nnPred, test_y),
mars = postResample(marsPred, test_y),
svmR = postResample(svmRPred, test_y))## RMSE Rsquared MAE
## knn 1.269575 0.6703412 0.9156250
## nn 1.311375 0.5749934 1.0732701
## mars 1.389522 0.5682379 1.0328214
## svmR 1.250826 0.6282103 0.9345619
KNN gives the best performance with the radial method as it has the lowest RMSE and MAE and the highest R2.
(b) Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess13 94.69
## ManufacturingProcess09 90.80
## ManufacturingProcess17 88.88
## BiologicalMaterial06 84.11
## BiologicalMaterial03 80.13
## ManufacturingProcess36 74.71
## BiologicalMaterial12 73.94
## ManufacturingProcess06 70.12
## ManufacturingProcess11 62.92
## ManufacturingProcess31 56.87
## BiologicalMaterial02 51.24
## BiologicalMaterial11 48.43
## BiologicalMaterial09 45.48
## ManufacturingProcess30 42.23
## BiologicalMaterial08 40.10
## ManufacturingProcess33 39.16
## ManufacturingProcess29 38.87
## BiologicalMaterial04 38.44
## ManufacturingProcess25 37.15
The predictors again seem to be dominated by the Manufacturing Process and not the Biological Material just like in the optimal linear model.
(c) Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?
correlation <- cor(select(df, 'ManufacturingProcess32','ManufacturingProcess36','ManufacturingProcess13','ManufacturingProcess17','ManufacturingProcess09', 'ManufacturingProcess33','ManufacturingProcess06', 'ManufacturingProcess12','BiologicalMaterial02','BiologicalMaterial06','Yield'))
corrplot::corrplot(correlation, method='square', type="upper")Based on the correlation plot, it can be observed that the ManufacturingProcess32 exhibits the strongest positive correlation with the Yield. Conversely, three out of the top ten variables display a negative correlation with Yield. It is evident that the biological predictors demonstrate a positive association with Yield. In contrast, the relationship between the manufacturing processes and Yield displays variability.
Question 8.1 Recreate the simulated data from Exercise 7.2:
set.seed(200)
simulated = mlbench.friedman1(200, sd = 1)
simulated = cbind(simulated$x, simulated$y)
simulated = as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] = "y"(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:
library(randomForest)
model_rf = randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rf_Imp = varImp(model_rf, scale = FALSE)Did the random forest model significantly use the uninformative predictors (V6 – V10)?
## Overall
## V1 8.62743275
## V2 6.27437240
## V3 0.72305459
## V4 7.50258584
## V5 2.13575650
## V6 0.12395003
## V7 0.02927888
## V8 -0.11724317
## V9 -0.10344797
## V10 0.04312556
The random forest model did not significantly use these variables.
(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
set.seed(42)
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)## [1] 0.9474738
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
set.seed(42)
model_rf2 <- randomForest(y ~ ., data= simulated, importance= TRUE, ntree= 1000)
rf_imp2 <- varImp(model_rf2, scale=FALSE)
rf_imp2## Overall
## V1 5.901016854
## V2 6.304789379
## V3 0.644534789
## V4 6.586485711
## V5 1.932969674
## V6 0.234129323
## V7 0.004037023
## V8 -0.112333624
## V9 -0.071834629
## V10 -0.191667743
## duplicate1 4.223128144
The importance of V1 dropped after adding another predictor. The correlation between V1 and duplicate1 is very high at 0.95. If we use model2, more predictors may be selected than necessary and variable importance is affected.
(c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
library(partykit)
set.seed(42)
mod <- cforest(y ~ ., data = simulated)
cfImp_cond <- varimp(mod, conditional = TRUE)
cfImp_uncond <- varimp(mod, conditional = FALSE)
barplot(sort(cfImp_cond),horiz = TRUE, main = 'Conditional')In the above code, I fit a conditional random forest model and calculated and visualized the variable importance measures for both the conditional and unconditional cases using bar plots. If the variable importance is conditional, it takes into account the correlation between variable v1 and duplicate1 and adjusts the importance score accordingly. On the other hand, the unconditional approach treats both v1 and duplicate1 as having equal importance. However, the patterns observed with this approach do not appear to follow the same pattern as those observed with a traditional random forest model.
(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
library(gbm)
library(Cubist)
set.seed(42)
gbmGrid = expand.grid(interaction.depth = seq(1,5, by=2),
n.trees = seq(100, 1000, by = 100),
shrinkage = 0.1,
n.minobsinnode = 5)
model_gbm = train(y ~ ., data = simulated,
tuneGrid = gbmGrid, verbose = FALSE,
method = 'gbm')
gbm_imp<-varImp(model_gbm)
model_cubist <- cubist(simulated[,-11], simulated[, 11])
cubist_imp<-varImp(model_cubist)
df = as.data.frame(cbind(gbm_imp$importance, cubist_imp))
names(df) = c("boosted", "cubist")
df## boosted cubist
## V1 78.110425 50
## V2 80.636364 50
## V3 33.137535 50
## V4 100.000000 50
## V5 37.742657 50
## V6 3.427648 0
## V7 3.319279 0
## V8 0.487909 0
## V9 0.000000 0
## V10 1.036546 0
## duplicate1 27.429750 0
In the above code, a grid of hyperparameters is created for the gradient boosting model. The code proceeds to fit the gradient boosting model using the train function and the specified hyperparameter grid. Variable importance measures are calculated for the gradient boosting model. The code then loads the Cubist package and fits a Cubist model using the provided dataset. Variable importance measures are also calculated for the Cubist model. The importance measures from both models are combined into a dataframe for comparison. The resulting dataframe contains the importance measures of each predictor variable for both the gradient boosting and Cubist models, facilitating the assessment of variable importance across the two modeling approaches.The importance rankings of V6-V10 remain low in both the traditional and conditional models, as observed previously. Similarly, V4 continues to be the most influential predictor, and the overall pattern remains unchanged.
Question 8.2 Use a simulation to show tree bias with different granularities.
library(rpart)
set.seed(42)
n <- 1000
x <- runif(n, -10, 10)
y <- 2 * sin(x) + rnorm(n, 0, 0.5)
simulated <- data.frame(x, y)
depths <- c(1, 2, 5, 10)
trees <- list()
for (depth in depths) {
tree <- rpart(y ~ x, data = simulated, control = rpart.control(maxdepth = depth))
trees[[as.character(depth)]] <- tree
}
plot(x, y, pch = 16, col = "black", main = "Tree Bias with Different Depths")
curve(2 * sin(x), -10, 10, add = TRUE, col = "blue", lwd = 2)
colors <- c("red", "green", "purple", "orange")
for (i in seq_along(depths)) {
depth <- depths[i]
pred <- predict(trees[[as.character(depth)]], data.frame(x = seq(-10, 10, by = 0.1)))
lines(seq(-10, 10, by = 0.1), pred, col = colors[i], lwd = 2)
legend("topleft", legend = paste("Depth", depth), col = colors[i], lwd = 2, bty = "n")
}To demonstrate the tree bias with different granularities, we can simulate a dataset with a known underlying relationship and fit decision trees with varying depths. In this code, we start by simulating a dataset with a non-linear relationship between x and y. The relationship is defined as y = 2 * sin(x) + epsilon, where epsilon is normally distributed noise. We then create a dataframe called simulated with the simulated x and y values. Next, we define a vector depths that represents different tree depths we want to explore. We iterate over each depth value and fit decision trees using the rpart function from the rpart package. The maxdepth parameter is set to control the depth of the decision tree. We then plot the simulated data points using plot, and the true underlying relationship is plotted using curve in blue color. After that, we iterate over the different tree depths and plot the decision boundaries for each depth. Each decision boundary is obtained by using the predict function on the corresponding decision tree. The resulting decision boundaries are plotted using lines, with different colors for different depths. Finally, a legend is added to identify the depth associated with each decision boundary.
Question 8.3
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
The model on the right displays a higher bagging fraction and learning ratio, which indicates that a larger fraction of the training data is utilized to create the decision tree. With an increase in bagging fraction, the bootstrap samples become more alike, approaching unity, which can result in certain predictors dominating. In contrast, the model on the left shows a lower bagging rate and learning rate. The lower learning rate leads to a less aggressive model that is likely to recognize more predictors as significant.
(b) Which model do you think would be more predictive of other samples?
The model that has a higher learning and bagging rate may overfit as it considers fewer variables, whereas the model on the left has a greater potential to generalize to other samples and make predictive inferences.
(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?
When we increase the tree depth, the importance of variables is expected to be distributed among a wider range of predictors since more variables are taken into consideration for tree splitting.
Question 8.7 Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
(a) Which tree-based regression model gives the optimal resampling and test set performance?
data(ChemicalManufacturingProcess)
set.seed(42)
imputed_data <- preProcess(ChemicalManufacturingProcess, method = c("knnImpute","nzv", "corr"))
full_data <- predict(imputed_data, ChemicalManufacturingProcess)
index_chem <- createDataPartition(full_data$Yield , p=.8, list=F)
train_chem <- full_data[index_chem,]
test_chem <- full_data[-index_chem,]
train_predictors <- train_chem[-c(1)]
test_predictors <- test_chem[-c(1)]The code above loads a dataset, performs preprocessing steps (including imputation, removing near-zero variance variables, and correlating variables), and splits the data into training and testing sets. It also extracts the predictor variables from each set for use in model training and evaluation.
set.seed(42)
rf_model <- randomForest(train_predictors, train_chem$Yield, importance = TRUE, ntrees = 1000)
rfPred <- predict(rf_model, newdata = test_predictors)
postResample(pred = rfPred, obs = test_chem$Yield)## RMSE Rsquared MAE
## 0.6443612 0.7484369 0.4572252
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 0.9530 nan 0.0010 0.0007
## 2 0.9524 nan 0.0010 0.0004
## 3 0.9517 nan 0.0010 0.0007
## 4 0.9509 nan 0.0010 0.0007
## 5 0.9500 nan 0.0010 0.0007
## 6 0.9493 nan 0.0010 0.0007
## 7 0.9487 nan 0.0010 0.0006
## 8 0.9480 nan 0.0010 0.0005
## 9 0.9473 nan 0.0010 0.0007
## 10 0.9467 nan 0.0010 0.0006
## 20 0.9402 nan 0.0010 0.0007
## 40 0.9273 nan 0.0010 0.0005
## 60 0.9146 nan 0.0010 0.0006
## 80 0.9027 nan 0.0010 0.0003
## 100 0.8905 nan 0.0010 0.0006
gbmPred <- predict(gbm_model, newdata = test_predictors)
postResample(pred = gbmPred, obs = test_chem$Yield)## RMSE Rsquared MAE
## 1.0508638 0.6123073 0.8184981
cube_model <- cubist(train_predictors, train_chem$Yield)
cubePred <- predict(cube_model, newdata = test_predictors)
postResample(pred = cubePred, obs = test_chem$Yield)## RMSE Rsquared MAE
## 0.6383669 0.6827951 0.4634283
Code above creates three tree-based regression models.
rbind(Random_Forest = postResample(pred = rfPred, obs = test_chem$Yield),
Boosted_Trees = postResample(pred = gbmPred, obs = test_chem$Yield),
Cubist = postResample(pred = cubePred, obs = test_chem$Yield))## RMSE Rsquared MAE
## Random_Forest 0.6443612 0.7484369 0.4572252
## Boosted_Trees 1.0508638 0.6123073 0.8184981
## Cubist 0.6383669 0.6827951 0.4634283
The Random Forest model provides optimal metrics, with a low RMSE of 0.64, and an R2 value of 0.75.
(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
## Overall
## BiologicalMaterial01 6.050412
## BiologicalMaterial03 12.602229
## BiologicalMaterial05 5.129726
## BiologicalMaterial06 9.213824
## BiologicalMaterial08 6.095621
## BiologicalMaterial09 5.301006
## BiologicalMaterial10 4.272144
## BiologicalMaterial11 10.702325
## ManufacturingProcess01 4.907436
## ManufacturingProcess02 3.520197
The predictors dominating the list is BiologicalMaterial with 8 in the top 10, and the rest ManufacturingProcess. This is different than the optimal linear and nonlinear models which has Manufacturing Process dominate.
(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
library(rpart)
library(rpart.plot)
set.seed(42)
rpart_tree <- rpart(Yield ~., data = train_chem)
rpart.plot(rpart_tree)The code above fits a decision tree model using the rpart function. The formula Yield ~ . specifies that Yield is the response variable, and . indicates that all other variables in the train_chem dataset should be used as predictors. The resulting decision tree model is stored in the rpart_tree object. The I plotted the decision tree model using the rpart.plot function. The rpart_tree object, which contains the fitted decision tree model, is provided as the argument to rpart.plot. The function generates a visual representation of the decision tree, with nodes representing decision rules and branches representing the splitting criteria. Plotting the optimal tree with the distribution of yield in the terminal nodes provides insight into the important predictors and their relationships with yield, as well as identifies any nonlinear or interaction effects that may be present. Additionally, examining the distribution of yield in the terminal nodes can help identify any patterns or clusters that may exist in the data , which can inform further analysis or modeling.
Market Basket Analysis / Recommender Systems (a simple problem)
Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket – and therefore ‘Market Basket Analysis’.
That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item.
## transactions in sparse format with
## 9835 transactions (rows) and
## 169 items (columns)
## transactions in sparse format with
## 9835 transactions (rows) and
## 169 items (columns)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.55 0.1 1 none FALSE TRUE 5 0.009 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 88
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [93 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [10 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
The arules package utilizes the apriori function to create association rules. The support parameter is a measure of the frequency of an itemset in the dataset. For instance, if an itemset such as {milk, bread} appears in 2 out of 5 transactions, the support would be 0.4 (i.e., 40%). When I attempted to use support greater than 0.009, I encountered an error while using the apriori function. The confidence parameter was set to 0.55, which means that 55% of transactions containing milk and bread also included butter, in the example of {milk, bread} => {butter}. To ensure that the LHS (antecedent) is not empty, we used a minlen value of 2 instead of the default value of 1. We generated 10 rules using a support of 0.009 and a confidence of 0.55. The model above depicts these parameters and the rules generated.
## set of 10 rules
##
## rule length distribution (lhs + rhs):sizes
## 3
## 10
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3 3 3 3 3 3
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.009354 Min. :0.5525 Min. :0.01464 Min. :2.162
## 1st Qu.:0.009914 1st Qu.:0.5648 1st Qu.:0.01721 1st Qu.:2.210
## Median :0.010930 Median :0.5738 Median :0.01886 Median :2.246
## Mean :0.011174 Mean :0.5779 Mean :0.01941 Mean :2.408
## 3rd Qu.:0.012227 3rd Qu.:0.5840 3rd Qu.:0.02105 3rd Qu.:2.445
## Max. :0.014540 Max. :0.6389 Max. :0.02583 Max. :3.030
## count
## Min. : 92.0
## 1st Qu.: 97.5
## Median :107.5
## Mean :109.9
## 3rd Qu.:120.2
## Max. :143.0
##
## mining info:
## data ntransactions support confidence
## grocery 9835 0.009 0.55
## call
## apriori(data = grocery, parameter = list(support = 0.009, confidence = 0.55, minlen = 2))
## lhs rhs support
## [1] {citrus fruit, root vegetables} => {other vegetables} 0.010371124
## [2] {root vegetables, tropical fruit} => {other vegetables} 0.012302999
## [3] {butter, yogurt} => {whole milk} 0.009354347
## [4] {curd, yogurt} => {whole milk} 0.010066090
## [5] {curd, other vegetables} => {whole milk} 0.009862735
## [6] {butter, other vegetables} => {whole milk} 0.011489578
## [7] {root vegetables, tropical fruit} => {whole milk} 0.011997966
## [8] {root vegetables, yogurt} => {whole milk} 0.014539908
## [9] {root vegetables, whipped/sour cream} => {whole milk} 0.009456024
## [10] {domestic eggs, other vegetables} => {whole milk} 0.012302999
## confidence coverage lift count
## [1] 0.5862069 0.01769192 3.029608 102
## [2] 0.5845411 0.02104728 3.020999 121
## [3] 0.6388889 0.01464159 2.500387 92
## [4] 0.5823529 0.01728521 2.279125 99
## [5] 0.5739645 0.01718353 2.246296 97
## [6] 0.5736041 0.02003050 2.244885 113
## [7] 0.5700483 0.02104728 2.230969 118
## [8] 0.5629921 0.02582613 2.203354 143
## [9] 0.5535714 0.01708185 2.166484 93
## [10] 0.5525114 0.02226741 2.162336 121
In cases where there are numerous association rules that meet the support and confidence criteria, lift can be utilized to filter or rank the rules further. Higher lift values are indicative of stronger associations. The association with the greatest lift value is {citrus fruit, root vegetables} ==> {other vegetables}, while the one with the lowest lift value is {domestic eggs, other vegetables} ==> {whole milk}.
The above code involves generating association rules and then selecting the top 10 rules based on their lift value. The lift measure indicates how much more often the antecedent and consequent of a rule occur together than would be expected if they were independent, with a value of 1 indicating independence and higher values indicating stronger association.
The code begins by selecting the top 10 rules using the head() function, with the “basket_model” data containing the association rules. The by parameter is set to “lift”, indicating that the rules should be ordered by their lift values.
The resulting rules are then plotted using the plot() function with the “graph” method, producing a visual representation of the relationship between the antecedents and consequents of the rules. This plot can help identify patterns or clusters of related items in the data, as well as indicate the strength of the association between the antecedent and consequents of each rule.
grocery_df <- as.data.frame(as.matrix(grocery@data))
row.names(grocery_df) <- grocery@itemInfo$labels
dim(grocery_df)## [1] 169 9835
Convert the transactions object from the arules package in R to a data frame and set row names based on the item labels:
set.seed(42)
k_cluster <- kmeans(grocery_df, centers=100, nstart=50, iter.max=10)
table(k_cluster$cluster)##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 1 1 1 68 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
## 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1
Uses k-means clustering to cluster the rows of the grocery_df data frame into 100 clusters, then displays a table of the number of items in each cluster:
Below are the 68 grocery items clustered in one group with 100 centers.
## abrasive cleaner artif. sweetener baby cosmetics
## 24 24 24
## baby food bags bathroom cleaner
## 24 24 24
## brandy canned fruit cereals
## 24 24 24
## cleaner cocoa drinks cooking chocolate
## 24 24 24
## cookware cream curd cheese
## 24 24 24
## decalcifier dental care female sanitary products
## 24 24 24
## finished products fish flower soil/fertilizer
## 24 24 24
## frozen chicken frozen fruits hair spray
## 24 24 24
## honey instant coffee jam
## 24 24 24
## ketchup kitchen towels kitchen utensil
## 24 24 24
## light bulbs liqueur liquor (appetizer)
## 24 24 24
## liver loaf make up remover male cosmetics
## 24 24 24
## meat spreads nut snack nuts/prunes
## 24 24 24
## organic products organic sausage popcorn
## 24 24 24
## potato products preservation products prosecco
## 24 24 24
## pudding powder ready soups rubbing alcohol
## 24 24 24
## rum salad dressing sauces
## 24 24 24
## skin care snack products soap
## 24 24 24
## softener sound storage medium soups
## 24 24 24
## sparkling wine specialty fat specialty vegetables
## 24 24 24
## spices syrup tea
## 24 24 24
## tidbits toilet cleaner vinegar
## 24 24 24
## whisky zwieback
## 24 24