Data624HW2

Mathew Katz

2023-06-28


This homework assignment will contain seven questions from the book ‘Applied Predictive Modeling’ by Max Kuhn · Kjell Johnson

6.3 // 7.2 // 7.5 // 8.1 // 8.2 // 8.3 // 8.7 and the Market Basket Analysis / Recommender Systems

Question 6.3:

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:

(a) Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)

The matrix process Predictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

library(caret)
library(RANN)
preP <- preProcess(ChemicalManufacturingProcess, method = c("knnImpute"))
df <- predict(preP, ChemicalManufacturingProcess)
df$Yield = ChemicalManufacturingProcess$Yield
any(is.na(df))
## [1] FALSE

In the above code, I created a pre-processing object named preP using the preProcess function from the “caret” package. It performs k-nearest neighbor imputation (knnImpute) to fill in missing values in the data frame ChemicalManufacturingProcess. The pre-processing object preP stores information about the transformation applied to the data. I then applied the pre-processing object preP to the original data frame ChemicalManufacturingProcess using the predict function. It imputes missing values and applies any other transformations specified in preP. The resulting data frame is assigned to df. Then I assigned the “Yield” column from the original data frame ChemicalManufacturingProcess to the corresponding column in the processed data frame df. Lastly, I checked if there are any missing values (NA) in the processed data frame df. Because it said ‘False’ we know that we imputed all the NAs correctly.

(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

preP <- preProcess(ChemicalManufacturingProcess, method = c("BoxCox", "knnImpute"))
df <- predict(preP, ChemicalManufacturingProcess)
df$Yield = ChemicalManufacturingProcess$Yield
trainRows <- createDataPartition(df$Yield, p = .80, list = FALSE)
df.train <- df[trainRows,]
df.test <- df[-trainRows,]
colYield <-
  which(colnames(ChemicalManufacturingProcess) == "Yield")
ctrl <- trainControl(method = "cv", number = 5)
mdl <- train(
  x = df.train[, -colYield],
  y = df.train$Yield,
  method = "pls",
  tuneLength = 20,
  trControl = ctrl
)
mdl
## Partial Least Squares 
## 
## 144 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 116, 115, 115, 115, 115 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     1.420512  0.4551606  1.1552156
##    2     1.246287  0.5569653  1.0301604
##    3     1.202620  0.5960259  0.9988938
##    4     1.205222  0.5959634  0.9949146
##    5     1.241749  0.5819981  1.0217219
##    6     1.268362  0.5757635  1.0371681
##    7     1.282577  0.5725722  1.0234249
##    8     1.304858  0.5666444  1.0457709
##    9     1.312207  0.5716937  1.0559618
##   10     1.323910  0.5725190  1.0639483
##   11     1.327343  0.5738286  1.0622661
##   12     1.336390  0.5751951  1.0665269
##   13     1.345094  0.5709718  1.0724646
##   14     1.344407  0.5691308  1.0683000
##   15     1.340058  0.5769834  1.0567306
##   16     1.360139  0.5718526  1.0592496
##   17     1.372095  0.5669017  1.0660666
##   18     1.381210  0.5634525  1.0730942
##   19     1.381617  0.5643265  1.0752356
##   20     1.397123  0.5578352  1.0811591
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.

In the above code, I created a preprocessing object named preP. The function preProcess from the “caret” package is used to perform various preprocessing steps on the ChemicalManufacturingProcess dataset. In this case, two methods are specified: “BoxCox” and “knnImpute”. “BoxCox” performs a power transformation to make the data more symmetric, and “knnImpute” imputes missing values using k-nearest neighbors. I then applied the preprocessing object preP to the original dataset ChemicalManufacturingProcess using the predict function. It transforms the dataset based on the preprocessing steps specified in preP and assigns the transformed dataset to df. Then I assigned the “Yield” column from the original dataset ChemicalManufacturingProcess to the corresponding column in the processed dataset df. I then created a partition of the data for training purposes. It splits the data based on the “Yield” column, with 80% of the data assigned to trainRows for training and the remaining 20% for testing. I then subset the processed dataset df using the trainRows indices, selecting only the rows designated for training. The resulting subset is assigned to df.train. Then I subset the processed dataset df using the complement of the trainRows indices, selecting the rows not included in training. The resulting subset is assigned to df.test. Then I created a train control object named ctrl using the trainControl function from the “caret” package. It specifies the method as “cv” (cross-validation) and the number of folds as 5. Then I trained a Partial Least Squares (PLS) regression model using the train function from the “caret” package. It uses the df.train dataset, excluding the “Yield” column (-colYield) as the predictor variables (x), and the “Yield” column from df.train as the response variable (y). The method is specified as “pls”, and tuneLength is set to 20, indicating the number of tuning parameter values to explore. The trControl parameter is set to ctrl, which specifies the train control object created earlier.

library(glue)

mdl.ncomp <- mdl[["bestTune"]][["ncomp"]]

glue("The optimal value (for the number of PLS components) with respect to RMSE performance metric is {mdl.ncomp}.")
## The optimal value (for the number of PLS components) with respect to RMSE performance metric is 3.

In the above code, I extracted the optimal number of Partial Least Squares (PLS) components from the trained model mdl. In the caret package, the bestTune element of a trained model object stores the optimal parameter values selected during the tuning process. In this case, ncomp represents the number of PLS components selected as optimal. The value is assigned to the variable mdl.ncomp. I then used the glue function from the “glue” package to interpolate the value of mdl.ncomp into a string. It creates a formatted message stating the optimal value of the number of PLS components selected based on the Root Mean Square Error (RMSE) performance metric.

(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

postResample(predict(mdl, newdata = df.test, ncomp = mdl.ncomp), df.test$Yield)
##      RMSE  Rsquared       MAE 
## 1.1145997 0.4914468 0.8739560
getTrainPerf(mdl)
##   TrainRMSE TrainRsquared  TrainMAE method
## 1   1.20262     0.5960259 0.9988938    pls

In the above code I evaluated the performance of the trained model (mdl) on the test dataset (df.test) by comparing the predicted values with the actual “Yield” values. Additionally, I retrieves the performance metrics of the model on the training dataset using the getTrainPerf function.The root mean squared error (RMSE) of the test set was found to be slightly lower than that of the resampled training set.

(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

mdl.vip <- varImp(mdl)
plot(mdl.vip, top = 15)

In the above code I calculated the variable importance scores for the predictors in the trained model. I then created a plot showing the top 15 predictor variables based on their VIP scores. This plot provides insight into the relative importance of the predictors in the model and helps identify which variables have the most influence on the model’s performance.The seven most important predictors in the model I trained are Manufacturing Processes. The process predictors dominate the list as opposed to the biological ones.

(f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

library(dplyr)
library(tidyverse)
correlation <- cor(select(df, 'ManufacturingProcess32','ManufacturingProcess36','BiologicalMaterial08','Yield'))
corrplot::corrplot(correlation, method='square', type="upper")

In the above code, I calculated the correlation coefficients between selected variables in the dataset. I then uses the corrplot package to create a correlation plot, visualizing the relationships between these variables. The resulting plot provides an overview of the pairwise correlations and can help identify patterns and dependencies between the variables.The correlation plots above illustrated that several predictor variables possess substantial positive or negative associations with the response variable, with the most significant variables exhibiting the strongest correlations. However, a few variables that the model identified as important do not exhibit such strong correlations with the response variable. Comprehensive understanding of the variables exhibiting positive or negative correlations with yield may facilitate improvement in the manufacturing process by enabling necessary adjustments to increase yield.

Question 7.2: 7.2. Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data: y = 10 sin(πx1x2) + 20(x3 − 0.5)2 + 10x4 + 5x5 + N(0, σ2) where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data

library(mlbench)
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
trainingData$x <- data.frame(trainingData$x)
featurePlot(trainingData$x, trainingData$y)

testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

Tune several models on these data. For example:

library(caret)
knnModel <- train(x = trainingData$x, 
                  y = trainingData$y,
                  method = "knn",
                  preProc = c("center", "scale"),
                  tuneLength = 10)

knnModel
## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.466085  0.5121775  2.816838
##    7  3.349428  0.5452823  2.727410
##    9  3.264276  0.5785990  2.660026
##   11  3.214216  0.6024244  2.603767
##   13  3.196510  0.6176570  2.591935
##   15  3.184173  0.6305506  2.577482
##   17  3.183130  0.6425367  2.567787
##   19  3.198752  0.6483184  2.592683
##   21  3.188993  0.6611428  2.588787
##   23  3.200458  0.6638353  2.604529
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
knnPred <- predict(knnModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = knnPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 3.2040595 0.6819919 2.5683461

Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?

marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)

marsTune <- train(trainingData$x, trainingData$y,
                  method = "earth",
                  tuneGrid = marsGrid,
                  trControl = trainControl(method = "cv"))

marsTune
## Multivariate Adaptive Regression Spline 
## 
## 200 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE     
##   1        2      4.462296  0.2176253  3.697979
##   1        3      3.720663  0.4673821  2.949121
##   1        4      2.680039  0.7094916  2.123848
##   1        5      2.333538  0.7781559  1.856629
##   1        6      2.367933  0.7754329  1.901509
##   1        7      1.809983  0.8656526  1.414985
##   1        8      1.692656  0.8838936  1.333678
##   1        9      1.704958  0.8845683  1.339517
##   1       10      1.688559  0.8842495  1.309838
##   1       11      1.669043  0.8886165  1.296522
##   1       12      1.645066  0.8892796  1.271981
##   1       13      1.655570  0.8886896  1.271232
##   1       14      1.666354  0.8879143  1.285545
##   1       15      1.666354  0.8879143  1.285545
##   1       16      1.666354  0.8879143  1.285545
##   1       17      1.666354  0.8879143  1.285545
##   1       18      1.666354  0.8879143  1.285545
##   1       19      1.666354  0.8879143  1.285545
##   1       20      1.666354  0.8879143  1.285545
##   1       21      1.666354  0.8879143  1.285545
##   1       22      1.666354  0.8879143  1.285545
##   1       23      1.666354  0.8879143  1.285545
##   1       24      1.666354  0.8879143  1.285545
##   1       25      1.666354  0.8879143  1.285545
##   1       26      1.666354  0.8879143  1.285545
##   1       27      1.666354  0.8879143  1.285545
##   1       28      1.666354  0.8879143  1.285545
##   1       29      1.666354  0.8879143  1.285545
##   1       30      1.666354  0.8879143  1.285545
##   1       31      1.666354  0.8879143  1.285545
##   1       32      1.666354  0.8879143  1.285545
##   1       33      1.666354  0.8879143  1.285545
##   1       34      1.666354  0.8879143  1.285545
##   1       35      1.666354  0.8879143  1.285545
##   1       36      1.666354  0.8879143  1.285545
##   1       37      1.666354  0.8879143  1.285545
##   1       38      1.666354  0.8879143  1.285545
##   2        2      4.440854  0.2204755  3.686796
##   2        3      3.697203  0.4714312  2.938566
##   2        4      2.664266  0.7149235  2.119566
##   2        5      2.313371  0.7837374  1.852172
##   2        6      2.335796  0.7875253  1.841919
##   2        7      1.833780  0.8622906  1.462210
##   2        8      1.688673  0.8883137  1.325754
##   2        9      1.557314  0.9002634  1.234207
##   2       10      1.463018  0.9133897  1.174354
##   2       11      1.350247  0.9265882  1.099432
##   2       12      1.305955  0.9344683  1.049853
##   2       13      1.261130  0.9357469  1.017123
##   2       14      1.286463  0.9315381  1.040156
##   2       15      1.337104  0.9297651  1.069602
##   2       16      1.337560  0.9294593  1.067973
##   2       17      1.318152  0.9322833  1.054436
##   2       18      1.324331  0.9319631  1.055971
##   2       19      1.324331  0.9319631  1.055971
##   2       20      1.324331  0.9319631  1.055971
##   2       21      1.324331  0.9319631  1.055971
##   2       22      1.324331  0.9319631  1.055971
##   2       23      1.324331  0.9319631  1.055971
##   2       24      1.324331  0.9319631  1.055971
##   2       25      1.324331  0.9319631  1.055971
##   2       26      1.324331  0.9319631  1.055971
##   2       27      1.324331  0.9319631  1.055971
##   2       28      1.324331  0.9319631  1.055971
##   2       29      1.324331  0.9319631  1.055971
##   2       30      1.324331  0.9319631  1.055971
##   2       31      1.324331  0.9319631  1.055971
##   2       32      1.324331  0.9319631  1.055971
##   2       33      1.324331  0.9319631  1.055971
##   2       34      1.324331  0.9319631  1.055971
##   2       35      1.324331  0.9319631  1.055971
##   2       36      1.324331  0.9319631  1.055971
##   2       37      1.324331  0.9319631  1.055971
##   2       38      1.324331  0.9319631  1.055971
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 13 and degree = 2.

In the above code, I created a grid of tuning parameters for the MARS model, performed model training using the “train” function, and stored the trained model in marsTune. The grid search over the parameter combinations allows for finding the optimal values for degree and nprune using cross-validation.

svmRTune <- train(trainingData$x, trainingData$y,
                  method = "svmRadial",
                  preProc = c("center", "scale"),
                  tuneLength = 14,
                  trControl = trainControl(method = "cv"))

svmRTune
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE      Rsquared   MAE     
##      0.25  2.475865  0.7988016  1.992183
##      0.50  2.214599  0.8166629  1.774775
##      1.00  2.046869  0.8398392  1.623489
##      2.00  1.953012  0.8519284  1.552263
##      4.00  1.891812  0.8587464  1.523110
##      8.00  1.875676  0.8604855  1.532308
##     16.00  1.879154  0.8595207  1.542176
##     32.00  1.879022  0.8595391  1.542105
##     64.00  1.879022  0.8595391  1.542105
##    128.00  1.879022  0.8595391  1.542105
##    256.00  1.879022  0.8595391  1.542105
##    512.00  1.879022  0.8595391  1.542105
##   1024.00  1.879022  0.8595391  1.542105
##   2048.00  1.879022  0.8595391  1.542105
## 
## Tuning parameter 'sigma' was held constant at a value of 0.06437208
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06437208 and C = 8.

In the above code, I trained an SVM model with a radial kernel on the training data. It applied centering and scaling as preprocessing steps, explored different combinations of tuning parameters, and performed cross-validation to evaluate model performance. The resulting model is stored in svmRTune.

tooHigh <- findCorrelation(cor(trainingData$x), cutoff = .75)
nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),
                        .size = c(1:10))


ctrl <- trainControl(method = "cv", number = 10)

nnetTune <- train(trainingData$x, trainingData$y,
                  method = "nnet",
                  tuneGrid = nnetGrid,
                  trControl = ctrl,
                  preProc = c("center", "scale"),
                  linout = TRUE,
                  trace = FALSE,
                  MaxNWts = 10 * (ncol(trainingData$x) + 1) + 10 + 1,
                  maxit = 500)

nnetTune
## Neural Network 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE       Rsquared   MAE     
##   0.00    1     2.580768  0.7284673  2.035490
##   0.00    2     2.680611  0.7349533  2.163387
##   0.00    3     2.320043  0.7890496  1.862655
##   0.00    4     2.516530  0.7604025  2.036872
##   0.00    5     3.429278  0.6352466  2.435958
##   0.00    6     3.591351  0.6611452  2.572580
##   0.00    7     4.451186  0.6202449  2.893162
##   0.00    8     5.809870  0.4546762  3.506995
##   0.00    9    12.995439  0.3434619  5.442580
##   0.00   10     6.073681  0.5401981  3.646143
##   0.01    1     2.548256  0.7510589  1.995186
##   0.01    2     2.655808  0.7368365  2.091566
##   0.01    3     2.456279  0.7668390  2.041680
##   0.01    4     2.467229  0.7750489  1.881964
##   0.01    5     2.821335  0.7068607  2.234548
##   0.01    6     2.963347  0.6910225  2.264453
##   0.01    7     3.135325  0.6545929  2.486553
##   0.01    8     3.069100  0.6838123  2.386589
##   0.01    9     3.540203  0.6290499  2.849426
##   0.01   10     3.505342  0.5919074  2.773600
##   0.10    1     2.454455  0.7727431  1.912887
##   0.10    2     2.636746  0.7358911  2.118765
##   0.10    3     2.168554  0.8146493  1.697511
##   0.10    4     2.285694  0.8014563  1.834426
##   0.10    5     2.643277  0.7389948  2.161553
##   0.10    6     2.557881  0.7709643  2.071306
##   0.10    7     2.710028  0.7427555  2.184003
##   0.10    8     3.038225  0.6609506  2.491278
##   0.10    9     2.990543  0.7005036  2.372746
##   0.10   10     3.021560  0.6767638  2.445791
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 3 and decay = 0.1.

In the above code, I identified highly correlated variables, created a grid of tuning parameters for the neural network model, trained the neural network model using cross-validation, and stored the trained model in nnetTune. The tuning process helps find optimal parameter values for the neural network model, while preprocessing steps like centering and scaling are applied to the data.

After looking at the models:

varImp(marsTune)
## earth variable importance
## 
##    Overall
## X1  100.00
## X4   75.33
## X2   48.88
## X5   15.63
## X3    0.00

MARS appears to give the best performance since it has the lowest RMSE and MAE and highest R2. The second best would be SVM radial. MARS does select the informative variables, X1 - X5, however, X3 seems to be insignificant.

Question 7.5 Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

(a) Which nonlinear regression model gives the optimal resampling and test set performance?

set.seed(42)
index <- createDataPartition(df$Yield, p = .8, list = FALSE)
train_x <- df[index, -1]
train_y <- df[index, 1]
test_x <- df[-index, -1]
test_y <- df[-index, 1]

In the above code, I split the data frame df into training and test sets. I created separate data sets for the predictor variables (train_x and test_x) and the response variable (train_y and test_y). The data partition is performed based on the “Yield” column, and the proportion of data allocated to the training set is set to 80%.

set.seed(42)
knnModel <- train(train_x, train_y,
                  method = "knn",
                  preProc = c("center", "scale"),
                  tuneLength = 10)

tooHigh <- findCorrelation(cor(train_x), cutoff = .75)

train_x_nnet <- train_x[, -tooHigh]
test_x_nnet <- test_x[, -tooHigh]

nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),
                        .size = c(1:10))

ctrl <- trainControl(method = "cv", number = 10)

nnetTune <- train(train_x_nnet, train_y,
                  method = "nnet",
                  tuneGrid = nnetGrid,
                  trControl = ctrl,
                  preProc = c("center", "scale"),
                  linout = TRUE,
                  trace = FALSE,
                  MaxNWts = 10 * (ncol(train_x_nnet) + 1) + 10 + 1,
                  maxit = 500)

marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)

marsTune <- train(train_x, train_y,
                  method = "earth",
                  tuneGrid = marsGrid,
                  trControl = trainControl(method = "cv"))

svmRTune <- train(train_x, train_y,
                  method = "svmRadial",
                  preProc = c("center", "scale"),
                  tuneLength = 14,
                  trControl = trainControl(method = "cv"))

In the above code I ran several different nonlinear regression model to see which gives the optimal resampling and test set performance.

knnPred <- predict(knnModel, test_x)
nnPred <- predict(nnetTune, test_x_nnet)
marsPred <- predict(marsTune, test_x)
svmRPred <- predict(svmRTune, test_x)

rbind(knn = postResample(knnPred, test_y),
      nn = postResample(nnPred, test_y),
      mars = postResample(marsPred, test_y),
      svmR = postResample(svmRPred, test_y))
##          RMSE  Rsquared       MAE
## knn  1.269575 0.6703412 0.9156250
## nn   1.311375 0.5749934 1.0732701
## mars 1.389522 0.5682379 1.0328214
## svmR 1.250826 0.6282103 0.9345619

KNN gives the best performance with the radial method as it has the lowest RMSE and MAE and the highest R2.

(b) Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

set.seed(42)
varImp(svmRTune)
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess13   94.69
## ManufacturingProcess09   90.80
## ManufacturingProcess17   88.88
## BiologicalMaterial06     84.11
## BiologicalMaterial03     80.13
## ManufacturingProcess36   74.71
## BiologicalMaterial12     73.94
## ManufacturingProcess06   70.12
## ManufacturingProcess11   62.92
## ManufacturingProcess31   56.87
## BiologicalMaterial02     51.24
## BiologicalMaterial11     48.43
## BiologicalMaterial09     45.48
## ManufacturingProcess30   42.23
## BiologicalMaterial08     40.10
## ManufacturingProcess33   39.16
## ManufacturingProcess29   38.87
## BiologicalMaterial04     38.44
## ManufacturingProcess25   37.15
plot(varImp(svmRTune), top = 10) 

plot(mdl.vip, top = 10)

The predictors again seem to be dominated by the Manufacturing Process and not the Biological Material just like in the optimal linear model.

(c) Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

correlation <- cor(select(df, 'ManufacturingProcess32','ManufacturingProcess36','ManufacturingProcess13','ManufacturingProcess17','ManufacturingProcess09', 'ManufacturingProcess33','ManufacturingProcess06', 'ManufacturingProcess12','BiologicalMaterial02','BiologicalMaterial06','Yield'))
corrplot::corrplot(correlation, method='square', type="upper")

Based on the correlation plot, it can be observed that the ManufacturingProcess32 exhibits the strongest positive correlation with the Yield. Conversely, three out of the top ten variables display a negative correlation with Yield. It is evident that the biological predictors demonstrate a positive association with Yield. In contrast, the relationship between the manufacturing processes and Yield displays variability.

Question 8.1 Recreate the simulated data from Exercise 7.2:

set.seed(200)
simulated = mlbench.friedman1(200, sd = 1)
simulated = cbind(simulated$x, simulated$y)
simulated = as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] = "y"

(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:

library(randomForest)
model_rf = randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rf_Imp = varImp(model_rf, scale = FALSE)

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

set.seed(42)
rf_Imp
##         Overall
## V1   8.62743275
## V2   6.27437240
## V3   0.72305459
## V4   7.50258584
## V5   2.13575650
## V6   0.12395003
## V7   0.02927888
## V8  -0.11724317
## V9  -0.10344797
## V10  0.04312556

The random forest model did not significantly use these variables.

(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

set.seed(42)
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9474738

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

set.seed(42)
model_rf2 <- randomForest(y ~ ., data= simulated, importance= TRUE, ntree= 1000)
rf_imp2 <- varImp(model_rf2, scale=FALSE)
rf_imp2
##                 Overall
## V1          5.901016854
## V2          6.304789379
## V3          0.644534789
## V4          6.586485711
## V5          1.932969674
## V6          0.234129323
## V7          0.004037023
## V8         -0.112333624
## V9         -0.071834629
## V10        -0.191667743
## duplicate1  4.223128144

The importance of V1 dropped after adding another predictor. The correlation between V1 and duplicate1 is very high at 0.95. If we use model2, more predictors may be selected than necessary and variable importance is affected.

(c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

library(partykit)
set.seed(42)
mod <- cforest(y ~ ., data = simulated)
cfImp_cond <- varimp(mod, conditional = TRUE)
cfImp_uncond <- varimp(mod, conditional = FALSE)
barplot(sort(cfImp_cond),horiz = TRUE, main = 'Conditional')

barplot(sort(cfImp_uncond),horiz = TRUE, main = 'Unconditional')

In the above code, I fit a conditional random forest model and calculated and visualized the variable importance measures for both the conditional and unconditional cases using bar plots. If the variable importance is conditional, it takes into account the correlation between variable v1 and duplicate1 and adjusts the importance score accordingly. On the other hand, the unconditional approach treats both v1 and duplicate1 as having equal importance. However, the patterns observed with this approach do not appear to follow the same pattern as those observed with a traditional random forest model.

(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

library(gbm)
library(Cubist)
set.seed(42)
gbmGrid = expand.grid(interaction.depth = seq(1,5, by=2), 
                      n.trees = seq(100, 1000, by = 100), 
                      shrinkage = 0.1, 
                      n.minobsinnode = 5)
model_gbm = train(y ~ ., data = simulated, 
                  tuneGrid = gbmGrid, verbose = FALSE, 
                  method = 'gbm')
gbm_imp<-varImp(model_gbm)

model_cubist <- cubist(simulated[,-11], simulated[, 11])
cubist_imp<-varImp(model_cubist)

df = as.data.frame(cbind(gbm_imp$importance, cubist_imp))
names(df) = c("boosted", "cubist")
df
##               boosted cubist
## V1          78.110425     50
## V2          80.636364     50
## V3          33.137535     50
## V4         100.000000     50
## V5          37.742657     50
## V6           3.427648      0
## V7           3.319279      0
## V8           0.487909      0
## V9           0.000000      0
## V10          1.036546      0
## duplicate1  27.429750      0

In the above code, a grid of hyperparameters is created for the gradient boosting model. The code proceeds to fit the gradient boosting model using the train function and the specified hyperparameter grid. Variable importance measures are calculated for the gradient boosting model. The code then loads the Cubist package and fits a Cubist model using the provided dataset. Variable importance measures are also calculated for the Cubist model. The importance measures from both models are combined into a dataframe for comparison. The resulting dataframe contains the importance measures of each predictor variable for both the gradient boosting and Cubist models, facilitating the assessment of variable importance across the two modeling approaches.The importance rankings of V6-V10 remain low in both the traditional and conditional models, as observed previously. Similarly, V4 continues to be the most influential predictor, and the overall pattern remains unchanged.

Question 8.2 Use a simulation to show tree bias with different granularities.

library(rpart)
set.seed(42)

n <- 1000
x <- runif(n, -10, 10)
y <- 2 * sin(x) + rnorm(n, 0, 0.5)

simulated <- data.frame(x, y)

depths <- c(1, 2, 5, 10)

trees <- list()
for (depth in depths) {
  tree <- rpart(y ~ x, data = simulated, control = rpart.control(maxdepth = depth))
  trees[[as.character(depth)]] <- tree
}

plot(x, y, pch = 16, col = "black", main = "Tree Bias with Different Depths")
curve(2 * sin(x), -10, 10, add = TRUE, col = "blue", lwd = 2)

colors <- c("red", "green", "purple", "orange")
for (i in seq_along(depths)) {
  depth <- depths[i]
  pred <- predict(trees[[as.character(depth)]], data.frame(x = seq(-10, 10, by = 0.1)))
  lines(seq(-10, 10, by = 0.1), pred, col = colors[i], lwd = 2)
  legend("topleft", legend = paste("Depth", depth), col = colors[i], lwd = 2, bty = "n")
}

To demonstrate the tree bias with different granularities, we can simulate a dataset with a known underlying relationship and fit decision trees with varying depths. In this code, we start by simulating a dataset with a non-linear relationship between x and y. The relationship is defined as y = 2 * sin(x) + epsilon, where epsilon is normally distributed noise. We then create a dataframe called simulated with the simulated x and y values. Next, we define a vector depths that represents different tree depths we want to explore. We iterate over each depth value and fit decision trees using the rpart function from the rpart package. The maxdepth parameter is set to control the depth of the decision tree. We then plot the simulated data points using plot, and the true underlying relationship is plotted using curve in blue color. After that, we iterate over the different tree depths and plot the decision boundaries for each depth. Each decision boundary is obtained by using the predict function on the corresponding decision tree. The resulting decision boundaries are plotted using lines, with different colors for different depths. Finally, a legend is added to identify the depth associated with each decision boundary.

Question 8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

Fig. 8.24
Fig. 8.24

(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

The model on the right displays a higher bagging fraction and learning ratio, which indicates that a larger fraction of the training data is utilized to create the decision tree. With an increase in bagging fraction, the bootstrap samples become more alike, approaching unity, which can result in certain predictors dominating. In contrast, the model on the left shows a lower bagging rate and learning rate. The lower learning rate leads to a less aggressive model that is likely to recognize more predictors as significant.

(b) Which model do you think would be more predictive of other samples?

The model that has a higher learning and bagging rate may overfit as it considers fewer variables, whereas the model on the left has a greater potential to generalize to other samples and make predictive inferences.

(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

When we increase the tree depth, the importance of variables is expected to be distributed among a wider range of predictors since more variables are taken into consideration for tree splitting.

Question 8.7 Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

(a) Which tree-based regression model gives the optimal resampling and test set performance?

data(ChemicalManufacturingProcess)
set.seed(42)
imputed_data <- preProcess(ChemicalManufacturingProcess, method = c("knnImpute","nzv", "corr"))
full_data <- predict(imputed_data, ChemicalManufacturingProcess)

index_chem <- createDataPartition(full_data$Yield , p=.8, list=F)

train_chem <-  full_data[index_chem,] 
test_chem <- full_data[-index_chem,]

train_predictors <- train_chem[-c(1)]
test_predictors <-  test_chem[-c(1)]

The code above loads a dataset, performs preprocessing steps (including imputation, removing near-zero variance variables, and correlating variables), and splits the data into training and testing sets. It also extracts the predictor variables from each set for use in model training and evaluation.

set.seed(42)
rf_model <- randomForest(train_predictors, train_chem$Yield, importance = TRUE, ntrees = 1000)
rfPred <- predict(rf_model, newdata = test_predictors)
postResample(pred = rfPred, obs = test_chem$Yield)
##      RMSE  Rsquared       MAE 
## 0.6443612 0.7484369 0.4572252
gbm_model <- gbm.fit(train_predictors, train_chem$Yield, distribution = "gaussian")
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.9530             nan     0.0010    0.0007
##      2        0.9524             nan     0.0010    0.0004
##      3        0.9517             nan     0.0010    0.0007
##      4        0.9509             nan     0.0010    0.0007
##      5        0.9500             nan     0.0010    0.0007
##      6        0.9493             nan     0.0010    0.0007
##      7        0.9487             nan     0.0010    0.0006
##      8        0.9480             nan     0.0010    0.0005
##      9        0.9473             nan     0.0010    0.0007
##     10        0.9467             nan     0.0010    0.0006
##     20        0.9402             nan     0.0010    0.0007
##     40        0.9273             nan     0.0010    0.0005
##     60        0.9146             nan     0.0010    0.0006
##     80        0.9027             nan     0.0010    0.0003
##    100        0.8905             nan     0.0010    0.0006
gbmPred <- predict(gbm_model, newdata = test_predictors)
postResample(pred = gbmPred, obs = test_chem$Yield)
##      RMSE  Rsquared       MAE 
## 1.0508638 0.6123073 0.8184981
cube_model <- cubist(train_predictors, train_chem$Yield)
cubePred <- predict(cube_model, newdata = test_predictors)
postResample(pred = cubePred, obs = test_chem$Yield)
##      RMSE  Rsquared       MAE 
## 0.6383669 0.6827951 0.4634283

Code above creates three tree-based regression models.

rbind(Random_Forest = postResample(pred = rfPred, obs = test_chem$Yield),
      Boosted_Trees = postResample(pred = gbmPred, obs = test_chem$Yield),
      Cubist = postResample(pred = cubePred, obs = test_chem$Yield))
##                    RMSE  Rsquared       MAE
## Random_Forest 0.6443612 0.7484369 0.4572252
## Boosted_Trees 1.0508638 0.6123073 0.8184981
## Cubist        0.6383669 0.6827951 0.4634283

The Random Forest model provides optimal metrics, with a low RMSE of 0.64, and an R2 value of 0.75.

(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

head(varImp(rf_model),10)
##                          Overall
## BiologicalMaterial01    6.050412
## BiologicalMaterial03   12.602229
## BiologicalMaterial05    5.129726
## BiologicalMaterial06    9.213824
## BiologicalMaterial08    6.095621
## BiologicalMaterial09    5.301006
## BiologicalMaterial10    4.272144
## BiologicalMaterial11   10.702325
## ManufacturingProcess01  4.907436
## ManufacturingProcess02  3.520197

The predictors dominating the list is BiologicalMaterial with 8 in the top 10, and the rest ManufacturingProcess. This is different than the optimal linear and nonlinear models which has Manufacturing Process dominate.

(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

library(rpart)
library(rpart.plot)
set.seed(42)
rpart_tree <- rpart(Yield ~., data = train_chem)
rpart.plot(rpart_tree)

The code above fits a decision tree model using the rpart function. The formula Yield ~ . specifies that Yield is the response variable, and . indicates that all other variables in the train_chem dataset should be used as predictors. The resulting decision tree model is stored in the rpart_tree object. The I plotted the decision tree model using the rpart.plot function. The rpart_tree object, which contains the fitted decision tree model, is provided as the argument to rpart.plot. The function generates a visual representation of the decision tree, with nodes representing decision rules and branches representing the splitting criteria. Plotting the optimal tree with the distribution of yield in the terminal nodes provides insight into the important predictors and their relationships with yield, as well as identifies any nonlinear or interaction effects that may be present. Additionally, examining the distribution of yield in the terminal nodes can help identify any patterns or clusters that may exist in the data , which can inform further analysis or modeling.

Market Basket Analysis / Recommender Systems (a simple problem)

Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket – and therefore ‘Market Basket Analysis’.

That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item.

library(arules)
(grocery <- read.transactions("GroceryDataSet.csv", sep = ","))
## transactions in sparse format with
##  9835 transactions (rows) and
##  169 items (columns)
grocery
## transactions in sparse format with
##  9835 transactions (rows) and
##  169 items (columns)
summary(grocery)
## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics
library(arules)
itemFrequencyPlot(grocery, topN=20)

basket_model <- apriori(grocery, parameter = list(support=.009, confidence=0.55 , minlen=2))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.55    0.1    1 none FALSE            TRUE       5   0.009      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 88 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [93 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [10 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

The arules package utilizes the apriori function to create association rules. The support parameter is a measure of the frequency of an itemset in the dataset. For instance, if an itemset such as {milk, bread} appears in 2 out of 5 transactions, the support would be 0.4 (i.e., 40%). When I attempted to use support greater than 0.009, I encountered an error while using the apriori function. The confidence parameter was set to 0.55, which means that 55% of transactions containing milk and bread also included butter, in the example of {milk, bread} => {butter}. To ensure that the LHS (antecedent) is not empty, we used a minlen value of 2 instead of the default value of 1. We generated 10 rules using a support of 0.009 and a confidence of 0.55. The model above depicts these parameters and the rules generated.

summary(basket_model)
## set of 10 rules
## 
## rule length distribution (lhs + rhs):sizes
##  3 
## 10 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       3       3       3       3       3       3 
## 
## summary of quality measures:
##     support           confidence        coverage            lift      
##  Min.   :0.009354   Min.   :0.5525   Min.   :0.01464   Min.   :2.162  
##  1st Qu.:0.009914   1st Qu.:0.5648   1st Qu.:0.01721   1st Qu.:2.210  
##  Median :0.010930   Median :0.5738   Median :0.01886   Median :2.246  
##  Mean   :0.011174   Mean   :0.5779   Mean   :0.01941   Mean   :2.408  
##  3rd Qu.:0.012227   3rd Qu.:0.5840   3rd Qu.:0.02105   3rd Qu.:2.445  
##  Max.   :0.014540   Max.   :0.6389   Max.   :0.02583   Max.   :3.030  
##      count      
##  Min.   : 92.0  
##  1st Qu.: 97.5  
##  Median :107.5  
##  Mean   :109.9  
##  3rd Qu.:120.2  
##  Max.   :143.0  
## 
## mining info:
##     data ntransactions support confidence
##  grocery          9835   0.009       0.55
##                                                                                       call
##  apriori(data = grocery, parameter = list(support = 0.009, confidence = 0.55, minlen = 2))
inspect(sort(basket_model, by="lift")[1:10])
##      lhs                                      rhs                support    
## [1]  {citrus fruit, root vegetables}       => {other vegetables} 0.010371124
## [2]  {root vegetables, tropical fruit}     => {other vegetables} 0.012302999
## [3]  {butter, yogurt}                      => {whole milk}       0.009354347
## [4]  {curd, yogurt}                        => {whole milk}       0.010066090
## [5]  {curd, other vegetables}              => {whole milk}       0.009862735
## [6]  {butter, other vegetables}            => {whole milk}       0.011489578
## [7]  {root vegetables, tropical fruit}     => {whole milk}       0.011997966
## [8]  {root vegetables, yogurt}             => {whole milk}       0.014539908
## [9]  {root vegetables, whipped/sour cream} => {whole milk}       0.009456024
## [10] {domestic eggs, other vegetables}     => {whole milk}       0.012302999
##      confidence coverage   lift     count
## [1]  0.5862069  0.01769192 3.029608 102  
## [2]  0.5845411  0.02104728 3.020999 121  
## [3]  0.6388889  0.01464159 2.500387  92  
## [4]  0.5823529  0.01728521 2.279125  99  
## [5]  0.5739645  0.01718353 2.246296  97  
## [6]  0.5736041  0.02003050 2.244885 113  
## [7]  0.5700483  0.02104728 2.230969 118  
## [8]  0.5629921  0.02582613 2.203354 143  
## [9]  0.5535714  0.01708185 2.166484  93  
## [10] 0.5525114  0.02226741 2.162336 121

In cases where there are numerous association rules that meet the support and confidence criteria, lift can be utilized to filter or rank the rules further. Higher lift values are indicative of stronger associations. The association with the greatest lift value is {citrus fruit, root vegetables} ==> {other vegetables}, while the one with the lowest lift value is {domestic eggs, other vegetables} ==> {whole milk}.

library(arulesViz)
rules <- head(basket_model, n = 10, by = "lift")
plot(rules, method = "graph")

The above code involves generating association rules and then selecting the top 10 rules based on their lift value. The lift measure indicates how much more often the antecedent and consequent of a rule occur together than would be expected if they were independent, with a value of 1 indicating independence and higher values indicating stronger association.

The code begins by selecting the top 10 rules using the head() function, with the “basket_model” data containing the association rules. The by parameter is set to “lift”, indicating that the rules should be ordered by their lift values.

The resulting rules are then plotted using the plot() function with the “graph” method, producing a visual representation of the relationship between the antecedents and consequents of the rules. This plot can help identify patterns or clusters of related items in the data, as well as indicate the strength of the association between the antecedent and consequents of each rule.

grocery_df <- as.data.frame(as.matrix(grocery@data)) 
row.names(grocery_df) <- grocery@itemInfo$labels
dim(grocery_df)
## [1]  169 9835

Convert the transactions object from the arules package in R to a data frame and set row names based on the item labels:

set.seed(42)
k_cluster <- kmeans(grocery_df, centers=100, nstart=50, iter.max=10)
table(k_cluster$cluster)
## 
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40 
##   1   1   1  68   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80 
##   1   1   1   2   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 
##   1   1   1   1   1   1   1   1   2   1   1   1   1   1   1   1   1   1   1   1

Uses k-means clustering to cluster the rows of the grocery_df data frame into 100 clusters, then displays a table of the number of items in each cluster:

Below are the 68 grocery items clustered in one group with 100 centers.

k_cluster$cluster[k_cluster$cluster == 24]
##         abrasive cleaner         artif. sweetener           baby cosmetics 
##                       24                       24                       24 
##                baby food                     bags         bathroom cleaner 
##                       24                       24                       24 
##                   brandy             canned fruit                  cereals 
##                       24                       24                       24 
##                  cleaner             cocoa drinks        cooking chocolate 
##                       24                       24                       24 
##                 cookware                    cream              curd cheese 
##                       24                       24                       24 
##              decalcifier              dental care female sanitary products 
##                       24                       24                       24 
##        finished products                     fish   flower soil/fertilizer 
##                       24                       24                       24 
##           frozen chicken            frozen fruits               hair spray 
##                       24                       24                       24 
##                    honey           instant coffee                      jam 
##                       24                       24                       24 
##                  ketchup           kitchen towels          kitchen utensil 
##                       24                       24                       24 
##              light bulbs                  liqueur       liquor (appetizer) 
##                       24                       24                       24 
##               liver loaf          make up remover           male cosmetics 
##                       24                       24                       24 
##             meat spreads                nut snack              nuts/prunes 
##                       24                       24                       24 
##         organic products          organic sausage                  popcorn 
##                       24                       24                       24 
##          potato products    preservation products                 prosecco 
##                       24                       24                       24 
##           pudding powder              ready soups          rubbing alcohol 
##                       24                       24                       24 
##                      rum           salad dressing                   sauces 
##                       24                       24                       24 
##                skin care           snack products                     soap 
##                       24                       24                       24 
##                 softener     sound storage medium                    soups 
##                       24                       24                       24 
##           sparkling wine            specialty fat     specialty vegetables 
##                       24                       24                       24 
##                   spices                    syrup                      tea 
##                       24                       24                       24 
##                  tidbits           toilet cleaner                  vinegar 
##                       24                       24                       24 
##                   whisky                 zwieback 
##                       24                       24