Chapter 6 - Linear Regression and Its Cousins

6.3

A chemical manufacturing process for a pharmaceutical product wasdiscussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

Start R and use these commands to load the data:

Question

\((a)\) The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

library(AppliedPredictiveModeling)
library(mice) # impute NAs
library(visdat) # missing value
library(e1071)
library(caret)
library(dplyr)
library(tidyr)
library(mlbench) # chp 7 and 8
library(kableExtra)
library(randomForest)
library(party)
library(gbm)
library(Cubist)
library(rpart)
library(RANN)
Response
data("ChemicalManufacturingProcess")
Pharma_data <- ChemicalManufacturingProcess
str(Pharma_data)
## 'data.frame':    176 obs. of  58 variables:
##  $ Yield                 : num  38 42.4 42 41.4 42.5 ...
##  $ BiologicalMaterial01  : num  6.25 8.01 8.01 8.01 7.47 6.12 7.48 6.94 6.94 6.94 ...
##  $ BiologicalMaterial02  : num  49.6 61 61 61 63.3 ...
##  $ BiologicalMaterial03  : num  57 67.5 67.5 67.5 72.2 ...
##  $ BiologicalMaterial04  : num  12.7 14.7 14.7 14.7 14 ...
##  $ BiologicalMaterial05  : num  19.5 19.4 19.4 19.4 17.9 ...
##  $ BiologicalMaterial06  : num  43.7 53.1 53.1 53.1 54.7 ...
##  $ BiologicalMaterial07  : num  100 100 100 100 100 100 100 100 100 100 ...
##  $ BiologicalMaterial08  : num  16.7 19 19 19 18.2 ...
##  $ BiologicalMaterial09  : num  11.4 12.6 12.6 12.6 12.8 ...
##  $ BiologicalMaterial10  : num  3.46 3.46 3.46 3.46 3.05 3.78 3.04 3.85 3.85 3.85 ...
##  $ BiologicalMaterial11  : num  138 154 154 154 148 ...
##  $ BiologicalMaterial12  : num  18.8 21.1 21.1 21.1 21.1 ...
##  $ ManufacturingProcess01: num  NA 0 0 0 10.7 12 11.5 12 12 12 ...
##  $ ManufacturingProcess02: num  NA 0 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess03: num  NA NA NA NA NA NA 1.56 1.55 1.56 1.55 ...
##  $ ManufacturingProcess04: num  NA 917 912 911 918 924 933 929 928 938 ...
##  $ ManufacturingProcess05: num  NA 1032 1004 1015 1028 ...
##  $ ManufacturingProcess06: num  NA 210 207 213 206 ...
##  $ ManufacturingProcess07: num  NA 177 178 177 178 178 177 178 177 177 ...
##  $ ManufacturingProcess08: num  NA 178 178 177 178 178 178 178 177 177 ...
##  $ ManufacturingProcess09: num  43 46.6 45.1 44.9 45 ...
##  $ ManufacturingProcess10: num  NA NA NA NA NA NA 11.6 10.2 9.7 10.1 ...
##  $ ManufacturingProcess11: num  NA NA NA NA NA NA 11.5 11.3 11.1 10.2 ...
##  $ ManufacturingProcess12: num  NA 0 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess13: num  35.5 34 34.8 34.8 34.6 34 32.4 33.6 33.9 34.3 ...
##  $ ManufacturingProcess14: num  4898 4869 4878 4897 4992 ...
##  $ ManufacturingProcess15: num  6108 6095 6087 6102 6233 ...
##  $ ManufacturingProcess16: num  4682 4617 4617 4635 4733 ...
##  $ ManufacturingProcess17: num  35.5 34 34.8 34.8 33.9 33.4 33.8 33.6 33.9 35.3 ...
##  $ ManufacturingProcess18: num  4865 4867 4877 4872 4886 ...
##  $ ManufacturingProcess19: num  6049 6097 6078 6073 6102 ...
##  $ ManufacturingProcess20: num  4665 4621 4621 4611 4659 ...
##  $ ManufacturingProcess21: num  0 0 0 0 -0.7 -0.6 1.4 0 0 1 ...
##  $ ManufacturingProcess22: num  NA 3 4 5 8 9 1 2 3 4 ...
##  $ ManufacturingProcess23: num  NA 0 1 2 4 1 1 2 3 1 ...
##  $ ManufacturingProcess24: num  NA 3 4 5 18 1 1 2 3 4 ...
##  $ ManufacturingProcess25: num  4873 4869 4897 4892 4930 ...
##  $ ManufacturingProcess26: num  6074 6107 6116 6111 6151 ...
##  $ ManufacturingProcess27: num  4685 4630 4637 4630 4684 ...
##  $ ManufacturingProcess28: num  10.7 11.2 11.1 11.1 11.3 11.4 11.2 11.1 11.3 11.4 ...
##  $ ManufacturingProcess29: num  21 21.4 21.3 21.3 21.6 21.7 21.2 21.2 21.5 21.7 ...
##  $ ManufacturingProcess30: num  9.9 9.9 9.4 9.4 9 10.1 11.2 10.9 10.5 9.8 ...
##  $ ManufacturingProcess31: num  69.1 68.7 69.3 69.3 69.4 68.2 67.6 67.9 68 68.5 ...
##  $ ManufacturingProcess32: num  156 169 173 171 171 173 159 161 160 164 ...
##  $ ManufacturingProcess33: num  66 66 66 68 70 70 65 65 65 66 ...
##  $ ManufacturingProcess34: num  2.4 2.6 2.6 2.5 2.5 2.5 2.5 2.5 2.5 2.5 ...
##  $ ManufacturingProcess35: num  486 508 509 496 468 490 475 478 491 488 ...
##  $ ManufacturingProcess36: num  0.019 0.019 0.018 0.018 0.017 0.018 0.019 0.019 0.019 0.019 ...
##  $ ManufacturingProcess37: num  0.5 2 0.7 1.2 0.2 0.4 0.8 1 1.2 1.8 ...
##  $ ManufacturingProcess38: num  3 2 2 2 2 2 2 2 3 3 ...
##  $ ManufacturingProcess39: num  7.2 7.2 7.2 7.2 7.3 7.2 7.3 7.3 7.4 7.1 ...
##  $ ManufacturingProcess40: num  NA 0.1 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess41: num  NA 0.15 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess42: num  11.6 11.1 12 10.6 11 11.5 11.7 11.4 11.4 11.3 ...
##  $ ManufacturingProcess43: num  3 0.9 1 1.1 1.1 2.2 0.7 0.8 0.9 0.8 ...
##  $ ManufacturingProcess44: num  1.8 1.9 1.8 1.8 1.7 1.8 2 2 1.9 1.9 ...
##  $ ManufacturingProcess45: num  2.4 2.2 2.3 2.1 2.1 2 2.2 2.2 2.1 2.4 ...
Question

\((b)\) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values$

Response

There are total 24 rows having NAs. Here we have chosen to impute missing values using preProcess() function. Thd method = “BoxCox” and “knnImpute” specifies as imputation method.

visdat::vis_miss(Pharma_data, sort_miss = TRUE)

prepro_2 <- preProcess(subset(Pharma_data, select = - Yield),
  method=c("BoxCox","knnImpute"))
pharma_imput <- predict(prepro_2, Pharma_data[ , -1])

# pharma_imput <- mice(data = Pharma_data, m = 1, method = "pmm", printFlag=F, seed = 123)
# pharma_imput <- mice::complete(pharma_imput, 1)

visdat::vis_miss(pharma_imput)

Question

\((c)\) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

Response
skewness <- apply(pharma_imput, 2, skewness)
# order by skewness magnitude in descending order
skew_ord <- round(skewness[order(-abs(skewness))], 3)
skew_ord %>% head(20)
## ManufacturingProcess26 ManufacturingProcess25 ManufacturingProcess18 
##                -12.858                -12.818                -12.736 
## ManufacturingProcess27 ManufacturingProcess20 ManufacturingProcess16 
##                -12.705                -12.638                -12.420 
## ManufacturingProcess31 ManufacturingProcess29 ManufacturingProcess43 
##                -11.967                -10.221                  9.055 
##   BiologicalMaterial07 ManufacturingProcess42 ManufacturingProcess44 
##                  7.399                 -5.450                 -4.970 
## ManufacturingProcess30 ManufacturingProcess39 ManufacturingProcess45 
##                 -4.822                 -4.269                 -4.078 
## ManufacturingProcess01 ManufacturingProcess06 ManufacturingProcess41 
##                 -3.934                  2.531                  2.176 
## ManufacturingProcess21 ManufacturingProcess05 
##                  1.729                  1.710

There are some significant skewness present in the columns. Some columns are completly right skewed and some are left skewed. Apply Centering & scaling pre-processing on the dataset. Since we consider using modeling algorithms that require normalized predictors. Then, Split data into training and test set, training set contains 80% of the data and test set have 20% of the data.

# apply scaling
pharma_imput_scaled <- caret::preProcess(pharma_imput, method = c("center", "scale"))

pharma_imput <- predict(pharma_imput_scaled, pharma_imput)

# Restore the response variable values to original
pharma_imput$Yield = Pharma_data$Yield
#for reproducibility
set.seed(123) 

# Split the data into a training and a test set
trainRows <- createDataPartition(pharma_imput$Yield, p = .80, list = FALSE)
training <- pharma_imput[trainRows,]
test <- pharma_imput[-trainRows,]

Training set contains 144 rows and 58 columns. Similarly, test set contains 32 rows and 58 columns. The dataset have traget variable or dependant variable is Yield. We select PLS as our starting model to predict Yield.

# Tune PLS model
set.seed(123)

ctrl <- trainControl(method = "cv", number = 5)
pslfit <- train(
  x = training[, -58],
  y = training$Yield,
  method = "pls",
  tuneLength = 20,
  trControl = ctrl
)
pslfit
## Partial Least Squares 
## 
## 144 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 116, 116, 115, 116, 113 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     1.375077  0.4669179  1.1241613
##    2     1.180882  0.5839246  0.9516711
##    3     1.193040  0.5968404  0.9695887
##    4     1.189887  0.6117807  0.9736692
##    5     1.205546  0.6031753  0.9860156
##    6     1.236514  0.5896363  0.9972034
##    7     1.260458  0.5805530  1.0108745
##    8     1.333692  0.5532908  1.0479009
##    9     1.383997  0.5422293  1.0779654
##   10     1.486458  0.5069164  1.1352094
##   11     1.579043  0.4850630  1.1784240
##   12     1.686175  0.4620947  1.2163896
##   13     1.809331  0.4410962  1.2507273
##   14     1.968989  0.4181512  1.2937506
##   15     2.064291  0.4118259  1.3157159
##   16     2.145573  0.4017640  1.3408959
##   17     2.224777  0.4002886  1.3575358
##   18     2.282051  0.4003697  1.3655592
##   19     2.349766  0.3978435  1.3702997
##   20     2.442304  0.3936885  1.3885161
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 2.
plot(pslfit)

The optimal value (for the number of PLS components) with respect to RMSE performance metric is 2.

Question

\((d)\) Predict the response for the test set.What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

Response
# Prediction on test 
pred_test <- predict(pslfit, newdata = test, ncomp = 2)
# test matrices
test_perf <- postResample(pred_test, test$Yield)
# Train matrices
train_perf <- getTrainPerf(pslfit)
# create data frame
perf_df <- data.frame(c(train_perf[1:3], test_perf[1:3]))
# Rename column
names(perf_df) <- c("Training RMSE", "Training Rsquared", "Training MAE","Test RMSE", "Test Rsquared", "Test MAE")
# Train and Test matrices
perf_df
plot(residuals(pslfit, ncomp = 2))
abline(h = 0)

The RMSE and R-squared on the test set turned out to be actually slightly lower than the RMSE and R-squared of the training set. Also, the model seems to be fitting well, residuals plot showing residuals are scattered randomly and uniformly around zero value.

Question

\((e)\) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

Response
plot(caret::varImp(pslfit), top = 20)

Above graph, for the importance of predictors in this model, it shows that the top 5 most important predictors are from manufacturing process. Moreover, among the top 20, more than half are manufacturing process predictors as oppose to the biological Material.

Question

\((f)\) Explore the relationships between each of the top predictors and the response.How could this information be helpful in improving yield in future runs of the manufacturing process?

Response
imp_predictors <- c('ManufacturingProcess32', 'ManufacturingProcess13', 'ManufacturingProcess09', 'ManufacturingProcess17', 'ManufacturingProcess36', 'Yield')

pharma_imput %>% select(imp_predictors) %>% gather(var, val, -Yield) %>% ggplot(aes(x=val, y=Yield, color = var)) + geom_point() + facet_wrap(~var)

Among the top 5 most significant predictors, ManufacturingProcess9, ManufacturingProcess13, ManufacturingProcess17, and ManufacturingProcess32 show some relationship with Yield. But, ManufacturingProcess36 does not show any relationship with dependant variables. The sub-plots above illustrate positive and negative correlations with the response variable. To improve the yield in the future runs, the manufacturing processes with positive coefficients should be boosted or increased, while the processes with negative coefficients should be reduced.

Chapter 7 Nonlinear Regression Models

Problem 7.2

Question

Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:

\(y = 10 sin(\pi x_1x_2) + 20(x_3 − 0.5)^2 + 10x_4 + 5x_5 + N(0, \sigma^2)\)

where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
# We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
featurePlot(trainingData$x, trainingData$y)

This creates a list with a vector ‘y’ and a matrix of predictors ‘x’. Also simulate a large test set to estimate the true error rate with good precision:

testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

Tune several models on these data. For example:

knnModel <- train(x = trainingData$x,
y = trainingData$y,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)

knnModel
## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.466085  0.5121775  2.816838
##    7  3.349428  0.5452823  2.727410
##    9  3.264276  0.5785990  2.660026
##   11  3.214216  0.6024244  2.603767
##   13  3.196510  0.6176570  2.591935
##   15  3.184173  0.6305506  2.577482
##   17  3.183130  0.6425367  2.567787
##   19  3.198752  0.6483184  2.592683
##   21  3.188993  0.6611428  2.588787
##   23  3.200458  0.6638353  2.604529
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
knnPred <- predict(knnModel, newdata = testData$x)
# perforamnce values
knn_results <- postResample(pred = knnPred, obs = testData$y)
knn_results
##      RMSE  Rsquared       MAE 
## 3.2040595 0.6819919 2.5683461
Question

Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?

Response

In the question, already modeling done for knn. We Will apply 3 more models and compair the accuracy matrices of those models. Based on accuracy matrices we decide altimate model for this data.

model1 : mars_model

MARS stands for Multivariate Adaptive Regression Splines. Like neural networks and partial least squares, MARS (Friedman 1991) uses surrogate features instead of the original predictors.

set.seed(100)
multi_var <- expand.grid(.degree=1:2, .nprune=2:38)

# model fit
mars_model <- train(x=trainingData$x, y=trainingData$y,
                  method="earth",
                  preProcess = c("center", "scale"),
                  tuneGrid=multi_var,
                  trControl = ctrl)
# model predict
mars_predict <- predict(mars_model, newdata=testData$x)
mars_results <- postResample(pred=mars_predict, obs=testData$y)
mars_results
##      RMSE  Rsquared       MAE 
## 1.2779993 0.9338365 1.0147070

Apply varImp on MARS model to see the which variable has high imporatance in the simulation. The below graph shows var X1 has significant imporatnce i.e 100.

plot(varImp(mars_model))

Model2 : SVM

SVMs are a class of powerful, highly flexible modeling techniques. The theory behind SVMs was originally developed in the context of classification models.

svm <- train(x=trainingData$x, y=trainingData$y,
                  method="svmRadial",
                  preProcess=c("center", "scale"),
                  tuneLength=20)

svm_predict <- predict(svm, newdata=testData$x)
svm_results<- postResample(pred=svm_predict, obs=testData$y)
svm_results
##      RMSE  Rsquared       MAE 
## 2.0624702 0.8274948 1.5670719

Model3 : Neural Networks

Neural networks are powerful nonlinear regression techniques inspired by theories about how the brain works. Like partial least squares, the outcome is modeled by an intermediary set of unobserved variables (called hidden variables or hidden units here). This model has input, hidden layers and output.

set.seed(100)
neural_model <- expand.grid(.decay = c(0, 0.01, .1),
                        .size = c(1:10),
                        .bag = FALSE)

ctrl <- trainControl(method = "cv")

avNNet_model <- train(trainingData$x, trainingData$y,
                 method = "avNNet",
                 tuneGrid = neural_model,
                 trControl = ctrl,
                 preProcess = c("center", "scale"),
                 linout = TRUE,
                 trace = FALSE,
                 maxit = 500,
                 MaxNWts = 10 * (ncol(trainingData$x) + 1) + 10 + 1)

avNNet_predict <- predict(avNNet_model, newdata = testData$x)
avNNet_results <- postResample(pred = avNNet_predict, obs = testData$y)
avNNet_results
##      RMSE  Rsquared       MAE 
## 2.1328254 0.8228765 1.5911867

Model comparison:

Set all 4 models accuracy matrices in a tabular format. Let’s visualize the results:

model_comparison <- data.frame(rbind(avNNet_results, mars_results, svm_results, knn_results))
#sort by RMSE
model_comparison[order(model_comparison$RMSE),] 

MARS model has outperformed among all the models, based on RMSE and MAE.

Problem 7.5

Question

Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

Response

In this exercise we work with the ChemicalManufacturingProcess dataset using the caret library. We apply the same imputation, data splitting, and pre-processing steps that we have used in exercise 6.3:

Imputation for missing values: impute predictive mean matching using the mice function Pre-processing: remove near-zero variance predictors (only 1 variable) and center and scale all remaining predictors. Data splitting: 80/20 partition into training and test sets using the createDataPartition function We then fit and evaluate four models using the below algorithms and parameters tuning:

Neural network (NNet): “avNNet” method from the nnet library; tune size (number of hidden units) and decay (weight decay) parameters (bag set to FALSE) Multivariate adaptive regression spline (MARS): “earth” method from the earth library; tune nprune (number of terms) and degree (product degree) parameters Support vector machine with radial basis function kernel (SVM): “svmRadial” from the kernlab library; tune C (cost) parameter (with default estimate of sigma) K-nearest neighbors (KNN): “knn” from base R; tune k parameter. We specify a common resampling and tuning methodology for all models within the train function:

Tuning method: tuning parameters are chosen to minimize RMSE on the holdout samples. For each model, we show below the re-sampling output, the tuning profile, the final model summary, and the variable importance plot.

Data imputation

Impute data by using Boxcox and knnImpute method for ChemicalManufacturingProcess. Missing data is 1% of the entire dataset.

visdat::vis_miss(ChemicalManufacturingProcess, sort_miss = TRUE)

prepro_3 <- preProcess(subset(ChemicalManufacturingProcess, select = - Yield),
  method=c("BoxCox","knnImpute"))
data <- predict(prepro_3, ChemicalManufacturingProcess[ , -1])

visdat::vis_miss(data)

Pre-processing:

In order to pre-process the data, what we need to do is to center and scale.

data_preprocess <- preProcess(data,
                    method = c("center", "scale"))

# Need to obtain new transformed values
data_trans <- predict(data_preprocess, data)

# Restore the response variable values to original
data_trans$Yield = ChemicalManufacturingProcess$Yield

If there strong linear correlations in between predictors we will remove those because that may cause the colinearity.

Now, proceed to find predictors that have a near zero variance. This predictor will be remove as well.

data_trans <- data_trans[,-findCorrelation(cor(data_trans), cutoff=0.90)]

10 columns having correlation 0.90 or more are removed. Next we remove the variance have zero, by using nearZeroVar function.

data_trans <- data_trans[,-nearZeroVar(data_trans)]
Split train & test

Split the entire dataset in to training and test sets. Training set have 80% of the data and test set have 20% of the data.

set.seed(123)
n <- nrow(data_trans)
trainIndex <- sample(1:n, size = round(0.80*n), replace=FALSE)
CMPtrain <- data_trans[trainIndex ,]
CMPtest <- data_trans[-trainIndex ,]
Model Building

Build 4 models such as K-Nearest Neighbors, Multivariate Adaptive Regression Splines, Neural Networks and support vector machines.

K-Nearest Neighbors (knn)
set.seed(123)
knnModel <- train(x = CMPtrain[,-48], # Yield is located in the column 48
                 y = CMPtrain$Yield,  
                 method = "knn",
                 tuneGrid = data.frame(.k = 1:20),
                 trControl = trainControl(method = "cv"))
knnModel
## k-Nearest Neighbors 
## 
## 141 samples
##  47 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 126, 127, 126, 126, 126, 128, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE      
##    1  1.516714  0.4058245  1.1845736
##    2  1.249970  0.5181932  0.9933675
##    3  1.255281  0.5185962  1.0128123
##    4  1.257593  0.5057254  1.0186190
##    5  1.254299  0.5069969  1.0250140
##    6  1.296684  0.4844623  1.0724337
##    7  1.280787  0.4941769  1.0444561
##    8  1.294564  0.4977758  1.0610015
##    9  1.301398  0.4935757  1.0678843
##   10  1.338288  0.4747992  1.0845992
##   11  1.347469  0.4720066  1.0976453
##   12  1.363837  0.4538867  1.1043217
##   13  1.368538  0.4554942  1.1135750
##   14  1.368101  0.4611611  1.1128507
##   15  1.376482  0.4550788  1.1256116
##   16  1.377454  0.4599080  1.1261199
##   17  1.372638  0.4639051  1.1167364
##   18  1.378720  0.4673412  1.1197606
##   19  1.387137  0.4658122  1.1235132
##   20  1.389994  0.4707557  1.1236569
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 2.

Let’s test the above model with the testing data.

knn_pred <- predict(knnModel, newdata = CMPtest[,-48])
knn <- postResample(pred = knn_pred, obs = CMPtest$Yield)
knn
##      RMSE  Rsquared       MAE 
## 1.3765446 0.5774289 1.0707143

The best tune parameter for the KNN model that resulted in the smallest root mean squared error is 2 which has low RMSE and R2 value.

Neural Network Model
set.seed(123)
nnetModel <- train(x = CMPtrain[,-48], 
                   y = CMPtrain$Yield,  
                  method = "nnet",
                  tuneLength = 10,
                  linout = TRUE,
                  trace = FALSE,
                  MaxNWts = 10 * (ncol(CMPtrain[,-48]) + 1) + 10 + 1,
                  maxit = 500)
nnetModel
## Neural Network 
## 
## 141 samples
##  47 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 141, 141, 141, 141, 141, 141, ... 
## Resampling results across tuning parameters:
## 
##   size  decay         RMSE       Rsquared    MAE     
##    1    0.0000000000   1.812885  0.09851196  1.465758
##    1    0.0001000000   1.677030  0.26096198  1.349153
##    1    0.0002371374   1.693386  0.28176683  1.357400
##    1    0.0005623413   2.059416  0.25235909  1.585649
##    1    0.0013335214   1.826111  0.26231962  1.456429
##    1    0.0031622777   1.917697  0.24758079  1.530184
##    1    0.0074989421   1.952680  0.22713439  1.521909
##    1    0.0177827941   1.959688  0.25325548  1.530635
##    1    0.0421696503   1.777377  0.28785384  1.402350
##    1    0.1000000000   1.872226  0.27702021  1.431847
##    3    0.0000000000   4.032461  0.15280343  3.088418
##    3    0.0001000000   3.161482  0.20847673  2.401343
##    3    0.0002371374   3.525626  0.18165791  2.652646
##    3    0.0005623413   3.448209  0.13487962  2.611443
##    3    0.0013335214   2.846306  0.17021341  2.177920
##    3    0.0031622777   2.915725  0.16987455  2.221842
##    3    0.0074989421   3.087645  0.16681344  2.313872
##    3    0.0177827941   2.911343  0.16279825  2.187210
##    3    0.0421696503   3.078800  0.14029968  2.244149
##    3    0.1000000000   2.987379  0.14446871  2.204174
##    5    0.0000000000   4.631692  0.07881013  3.456920
##    5    0.0001000000   3.256342  0.15866572  2.512243
##    5    0.0002371374   3.377737  0.14117837  2.551482
##    5    0.0005623413   3.273891  0.15334058  2.503177
##    5    0.0013335214   3.290361  0.13262904  2.613471
##    5    0.0031622777   3.085075  0.15203721  2.401712
##    5    0.0074989421   3.121706  0.15164557  2.408969
##    5    0.0177827941   3.287562  0.11159204  2.471967
##    5    0.0421696503   2.946769  0.15963408  2.232218
##    5    0.1000000000   2.719887  0.17600712  2.022433
##    7    0.0000000000   7.166252  0.05745757  4.899185
##    7    0.0001000000   4.866668  0.10064229  3.679719
##    7    0.0002371374   4.468772  0.11599484  3.202812
##    7    0.0005623413   4.361669  0.11545413  3.305941
##    7    0.0013335214   3.725697  0.14246711  2.835150
##    7    0.0031622777   3.259566  0.11677015  2.473299
##    7    0.0074989421   3.318786  0.08217356  2.522008
##    7    0.0177827941   3.043056  0.12343661  2.366573
##    7    0.0421696503   2.736960  0.19870993  2.144898
##    7    0.1000000000   2.647512  0.20456166  1.997333
##    9    0.0000000000  11.847590  0.04623609  7.697505
##    9    0.0001000000   9.357004  0.09710093  5.585527
##    9    0.0002371374   7.199934  0.07981913  4.482027
##    9    0.0005623413   5.161094  0.13062187  3.715648
##    9    0.0013335214   4.424005  0.15792303  3.357217
##    9    0.0031622777   3.760788  0.15556803  2.869292
##    9    0.0074989421   3.493355  0.13743185  2.687504
##    9    0.0177827941   3.248757  0.14007802  2.564664
##    9    0.0421696503   3.031497  0.11817704  2.374082
##    9    0.1000000000   2.809427  0.19221014  2.192595
##   11    0.0000000000        NaN         NaN       NaN
##   11    0.0001000000        NaN         NaN       NaN
##   11    0.0002371374        NaN         NaN       NaN
##   11    0.0005623413        NaN         NaN       NaN
##   11    0.0013335214        NaN         NaN       NaN
##   11    0.0031622777        NaN         NaN       NaN
##   11    0.0074989421        NaN         NaN       NaN
##   11    0.0177827941        NaN         NaN       NaN
##   11    0.0421696503        NaN         NaN       NaN
##   11    0.1000000000        NaN         NaN       NaN
##   13    0.0000000000        NaN         NaN       NaN
##   13    0.0001000000        NaN         NaN       NaN
##   13    0.0002371374        NaN         NaN       NaN
##   13    0.0005623413        NaN         NaN       NaN
##   13    0.0013335214        NaN         NaN       NaN
##   13    0.0031622777        NaN         NaN       NaN
##   13    0.0074989421        NaN         NaN       NaN
##   13    0.0177827941        NaN         NaN       NaN
##   13    0.0421696503        NaN         NaN       NaN
##   13    0.1000000000        NaN         NaN       NaN
##   15    0.0000000000        NaN         NaN       NaN
##   15    0.0001000000        NaN         NaN       NaN
##   15    0.0002371374        NaN         NaN       NaN
##   15    0.0005623413        NaN         NaN       NaN
##   15    0.0013335214        NaN         NaN       NaN
##   15    0.0031622777        NaN         NaN       NaN
##   15    0.0074989421        NaN         NaN       NaN
##   15    0.0177827941        NaN         NaN       NaN
##   15    0.0421696503        NaN         NaN       NaN
##   15    0.1000000000        NaN         NaN       NaN
##   17    0.0000000000        NaN         NaN       NaN
##   17    0.0001000000        NaN         NaN       NaN
##   17    0.0002371374        NaN         NaN       NaN
##   17    0.0005623413        NaN         NaN       NaN
##   17    0.0013335214        NaN         NaN       NaN
##   17    0.0031622777        NaN         NaN       NaN
##   17    0.0074989421        NaN         NaN       NaN
##   17    0.0177827941        NaN         NaN       NaN
##   17    0.0421696503        NaN         NaN       NaN
##   17    0.1000000000        NaN         NaN       NaN
##   19    0.0000000000        NaN         NaN       NaN
##   19    0.0001000000        NaN         NaN       NaN
##   19    0.0002371374        NaN         NaN       NaN
##   19    0.0005623413        NaN         NaN       NaN
##   19    0.0013335214        NaN         NaN       NaN
##   19    0.0031622777        NaN         NaN       NaN
##   19    0.0074989421        NaN         NaN       NaN
##   19    0.0177827941        NaN         NaN       NaN
##   19    0.0421696503        NaN         NaN       NaN
##   19    0.1000000000        NaN         NaN       NaN
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 1 and decay = 1e-04.

Let’s test the above model with the testing data.

nnets_pred <- predict(nnetModel, newdata = CMPtest[,-48])
nnets <- postResample(pred = nnets_pred, obs = CMPtest$Yield)
nnets
##      RMSE  Rsquared       MAE 
## 1.6652510 0.3813292 1.3301745
Support Vector Machines Model
set.seed(100)
svmModel <- train(x = CMPtrain[,-48], # Yield is located in the column 1
                  y = CMPtrain$Yield,  # Yield is located in the column 1
                  method = "svmRadial",
                  tuneLength = 14,
                  trControl = trainControl(method = "cv"))
svmModel
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 141 samples
##  47 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 128, 127, 126, 128, 126, 127, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE      Rsquared   MAE      
##      0.25  1.388162  0.4693727  1.1356413
##      0.50  1.280608  0.5222836  1.0424002
##      1.00  1.214869  0.5728213  0.9805095
##      2.00  1.176706  0.6078421  0.9557710
##      4.00  1.136574  0.6438766  0.9294317
##      8.00  1.103243  0.6679311  0.9126026
##     16.00  1.100820  0.6688793  0.9109972
##     32.00  1.100820  0.6688793  0.9109972
##     64.00  1.100820  0.6688793  0.9109972
##    128.00  1.100820  0.6688793  0.9109972
##    256.00  1.100820  0.6688793  0.9109972
##    512.00  1.100820  0.6688793  0.9109972
##   1024.00  1.100820  0.6688793  0.9109972
##   2048.00  1.100820  0.6688793  0.9109972
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01737454
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01737454 and C = 16.

Let’s test the above model with the testing data.

svm_pred <- predict(svmModel, newdata = CMPtest[,-48])
svm <- postResample(pred = svm_pred, obs = CMPtest$Yield)
svm
##      RMSE  Rsquared       MAE 
## 1.1598393 0.7684599 0.8681171
Multivariate Adaptive Regression Splines Model
set.seed(123)
# Define the candidate models to test
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)

# Fix the seed so that the results can be reproduced
marsModel <- train(x = CMPtrain[,-48],
                  y = CMPtrain$Yield,  
                  method = "earth",
                  tuneGrid = marsGrid,
                  trControl = trainControl(method = "cv"))
marsModel
## Multivariate Adaptive Regression Spline 
## 
## 141 samples
##  47 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 126, 127, 126, 126, 126, 128, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE      
##   1        2      1.401524  0.4337912  1.0976114
##   1        3      1.227718  0.5383539  1.0085592
##   1        4      1.259802  0.5201858  1.0097998
##   1        5      1.277350  0.5090982  1.0250165
##   1        6      1.249772  0.5282677  1.0047732
##   1        7      1.245838  0.5225228  0.9837999
##   1        8      1.269691  0.5133311  1.0186896
##   1        9      1.268687  0.5061341  1.0258303
##   1       10      1.253142  0.5217915  1.0108639
##   1       11      1.268615  0.5111838  1.0411777
##   1       12      1.246591  0.5239903  1.0006034
##   1       13      1.216811  0.5462984  0.9787978
##   1       14      1.210860  0.5538109  0.9780245
##   1       15      1.220026  0.5461268  0.9928672
##   1       16      1.220026  0.5461268  0.9928672
##   1       17      1.220026  0.5461268  0.9928672
##   1       18      1.220026  0.5461268  0.9928672
##   1       19      1.220026  0.5461268  0.9928672
##   1       20      1.220026  0.5461268  0.9928672
##   1       21      1.220026  0.5461268  0.9928672
##   1       22      1.220026  0.5461268  0.9928672
##   1       23      1.220026  0.5461268  0.9928672
##   1       24      1.220026  0.5461268  0.9928672
##   1       25      1.220026  0.5461268  0.9928672
##   1       26      1.220026  0.5461268  0.9928672
##   1       27      1.220026  0.5461268  0.9928672
##   1       28      1.220026  0.5461268  0.9928672
##   1       29      1.220026  0.5461268  0.9928672
##   1       30      1.220026  0.5461268  0.9928672
##   1       31      1.220026  0.5461268  0.9928672
##   1       32      1.220026  0.5461268  0.9928672
##   1       33      1.220026  0.5461268  0.9928672
##   1       34      1.220026  0.5461268  0.9928672
##   1       35      1.220026  0.5461268  0.9928672
##   1       36      1.220026  0.5461268  0.9928672
##   1       37      1.220026  0.5461268  0.9928672
##   1       38      1.220026  0.5461268  0.9928672
##   2        2      1.401524  0.4337912  1.0976114
##   2        3      1.202229  0.5627575  0.9687881
##   2        4      1.185578  0.5510958  0.9670450
##   2        5      1.197972  0.5346498  0.9597890
##   2        6      1.195972  0.5396585  0.9818698
##   2        7      1.151683  0.5726105  0.9539994
##   2        8      1.267274  0.5302324  1.0015602
##   2        9      1.164125  0.5648688  0.9415893
##   2       10      1.199317  0.5722273  0.9562336
##   2       11      1.185495  0.5825754  0.9473839
##   2       12      1.216935  0.5675808  0.9636410
##   2       13      1.240276  0.5436361  0.9977262
##   2       14      1.263276  0.5338953  0.9833144
##   2       15      1.213467  0.5711692  0.9279406
##   2       16      1.202555  0.5781057  0.9242256
##   2       17      1.225621  0.5754002  0.9483430
##   2       18      1.249723  0.5636869  0.9716605
##   2       19      1.263617  0.5560603  0.9755511
##   2       20      1.262320  0.5558744  0.9746078
##   2       21      1.243833  0.5656689  0.9665508
##   2       22      1.261830  0.5516708  0.9604946
##   2       23      1.278468  0.5448387  0.9867787
##   2       24      1.275174  0.5475677  0.9814726
##   2       25      1.275174  0.5475677  0.9814726
##   2       26      1.275174  0.5475677  0.9814726
##   2       27      1.275174  0.5475677  0.9814726
##   2       28      1.275174  0.5475677  0.9814726
##   2       29      1.275174  0.5475677  0.9814726
##   2       30      1.275174  0.5475677  0.9814726
##   2       31      1.275174  0.5475677  0.9814726
##   2       32      1.275174  0.5475677  0.9814726
##   2       33      1.275174  0.5475677  0.9814726
##   2       34      1.275174  0.5475677  0.9814726
##   2       35      1.275174  0.5475677  0.9814726
##   2       36      1.275174  0.5475677  0.9814726
##   2       37      1.275174  0.5475677  0.9814726
##   2       38      1.275174  0.5475677  0.9814726
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 7 and degree = 2.

Let’s test the above model with the testing data.

mars_pred <- predict(marsModel, newdata = CMPtest[,-48])
mars <- postResample(pred = mars_pred, obs = CMPtest$Yield)
mars
##      RMSE  Rsquared       MAE 
## 1.2142177 0.6607828 0.9711687
Question

\((a)\) Which nonlinear regression model gives the optimal resampling and test set performance?

Response

Let’s compare the returned values from the postResample function:

train_knn <- getTrainPerf(knnModel)[1:3]
train_nnets <- getTrainPerf(nnetModel)[1:3]
train_svm <- getTrainPerf(svmModel)[1:3]
train_mars <- getTrainPerf(marsModel)[1:3]

train_model_com <- data.frame(rbind(train_knn, train_nnets, train_svm, train_mars))
rownames(train_model_com) <- c("KNN", "Neural Nets", "SVM", "MARS")

train_model_com <- train_model_com[order(train_model_com$TrainRMSE),]
train_model_com

In effect, the SVM model returned the lowest RMSE and MAE as seen on the above table. Let’s compare the matrices from the testing models.

model_comparison_2 <- data.frame(rbind(knn, nnets, svm, mars))
#sort by RMSE
model_comparison_2[order(model_comparison_2$RMSE),] 

With testing data SVM model performs better than other models.

Question

\((b)\) Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

Response

Which predictors are most important in the optimal nonlinear regression model?

Optimal non-linear model is SVM model. Apply VarImp() to find Most important predictors. They are : Manufacturing Process 32, Manufacturing Process 06, Manufacturing Process 13, Manufacturing Process 17, Manufacturing Process 09, Biological Material 03 etc.

ggplot(varImp(svmModel), top = 10) + 
  labs(title = paste0("Variable importance: ", svmModel$modelInfo$label))

Do either the biological or process variables dominate the list?

From above, we notice that there’s a total of 7 Manufacturer variables in the top 10 and 3 Biological predictors.

How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

We reference back to optimal linear model is PSL. Optimal Non-linear model is SVM.

ggplot(varImp(pslfit), top = 10) +
  labs(title = paste0("Variable importance: ", pslfit$modelInfo$label))

ggplot(varImp(svmModel), top = 10) +
  labs(title = paste0("Variable importance: ", svmModel$modelInfo$label))

Manufacturing Process 32 shows as highly significant in both the models. Out of Top 10 PSL shows BiologicalMaterial02 is least significant and for SVM BiologicalMaterial04 is least significant.

Question

\((c)\) Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

Response

We will now get the top 10 predictors, arrange it in order and then draw the featureplot to explore the visualization.

# predictors importance
vimp <- varImp(svmModel)$importance
# top 10 predictors
top10_vars <- head(rownames(vimp)[order(-vimp$Overall)], 10)

X <- ChemicalManufacturingProcess[,top10_vars]
Y <- ChemicalManufacturingProcess$Yield

featurePlot(X,Y)

Among the top 10 most significant predictors, the feature-plot above illustrate positive and negative correlations and no co-realtion with each other. To improve the yield in the future runs, the manufacturing processes with positive coefficients should be boosted or increased, while the processes with negative coefficients should be reduced.

Chapter 8 : Regression Trees and Rule-Based Models

8.1

Question

Recreate the simulated data from Exercise 7.2:

set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
Question

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

model1 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)

\((a)\) Did the random forest model significantly use the uninformative predictors (V6 – V10)?

Response

According to beloe results, the random forest model doesnt significantly use the uninformative predictors (V6 – V10).

varImp1 <- varImp(model1, scale = FALSE)
Question

Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9396216

\((b)\) Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

Response

We see here after adding another predictor that is highly correlated with V1, its importance got reduced. The importance now is splitted between V1 and duplicate1 after adding the highly correlated duplicate1.

model2 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
varImp2 <- varImp(model2, scale = FALSE)
varImp2
Question

\((c)\) Use the cforest function in the party package to fit a random forest modelusing conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

Response

We can see here that importance shows the different pattern as compared to traditional random forest model. V2 and V4 are the highest significant variables in all the 3 cases. Also we can see that from V6 to V10 remains low in all 3 cases.

model3 <- cforest(y ~ ., data = simulated)
# conditional = TRUE
ctrue <- varimp(model3, conditional = TRUE)
# conditional = FALSE
cfalse <- varimp(model3, conditional = FALSE)

cbind(model2=varImp2, cforest_con=ctrue,cforest_uncon=cfalse )
Question

(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Response

Below create 2 models such as boosted trees and Cubist.

Gradient boosting model

Use library gbm for gradient boosting model.

set.seed(123)

# boosting regression trees via stochastic gradient boosting machines
gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
                       n.trees = seq(100, 1000, by = 50),
                       shrinkage = 0.1,
                       n.minobsinnode = 5)

gbmModel <- train(y ~ ., data = simulated,
                  method = "gbm",
                  tuneGrid = gbmGrid,
                  verbose = FALSE)

gbm_imp <- varImp(gbmModel)
gbm_imp
## gbm variable importance
## 
##             Overall
## V4         100.0000
## V2          80.2407
## V1          54.7824
## V5          40.4073
## V3          35.7511
## duplicate1  30.8668
## V6           2.0953
## V7           1.7374
## V8           1.4529
## V10          0.3425
## V9           0.0000
Cubist Model

Use library cubist for Cubist model.

set.seed(123)
# cubist
cubistModel <- train(y ~ ., data = simulated, method = 'cubist')
cubist_imp <- varImp(cubistModel)
cubist_imp
## cubist variable importance
## 
##            Overall
## V1         100.000
## V2          75.694
## V4          68.750
## V3          58.333
## V5          56.944
## V6          13.889
## duplicate1   2.083
## V8           0.000
## V10          0.000
## V7           0.000
## V9           0.000

In cforest model, V1, V2, and V4 are highest in rank. Comparing this results with grdient boost and cubist, the uninformative predictors such as V6 to V10 still appear as lowest in ranking. V4 still appear as highest for boosted trees but for cubist V2 shows as highest in ranking.

Question

8.2

Use a simulation to show tree bias with different granularities.

Response

We will do the simulation here with 3 variables having different granularities. The response variable we will choose, would be a function of random selection and some noise.

Below, a regression tree is fitted using rpart, and the variable importance is calculated using varImp.

set.seed(123)

x1 <- rnorm(300, 30, 1)
x2 <- rnorm(300, 30, 2)
x3 <- rnorm(300, 30, 3)

z <- (.4 * x1) + (.1 * x3) + rnorm(300, 0, sqrt(1- (.16 + .04 + .01)))
y <- (1.5 * z) + 10

granular_simulated <- data.frame(x1 = x1, x2 = x2, x3 = x3, y=y)

rpartfit <- rpart(y ~., data = granular_simulated)

varimp_sim <- varImp(rpartfit)

varimp_sim

As we can see, the tree uses x1, and x3 mostly to split, x2 is not related to Y, so the x2 variance score is less. In this simulation, x1 has the smallest standard deviation, so it is the strongest predictor in this case.

8.3

Question

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

Response

This question is somewhat hard to answer. This may cause of 2 factor i.e bragging fraction and learning rate. In gradient boosting algorithm, the learner is designed to employ the optimal fitting the gradient in each stage. This greedy strategy might cause over-fitting, and reduce generalization power. Learning rate is used to avoid over-fitting by adding only partial of the predicted value to the previous accumulated predicted values. The higher value will increase the greediness, and focus on fewer variables.

Question

(b) Which model do you think would be more predictive of other samples?

Response

Greedy models are less likely to select the optimal global model and are prone to over-fitting. Stochastic models reduce prediction variance. Therefore, the less greedy model on the left (with a 0.1 learning rate) that is also more random (due to only selecting 0.1 of the training set observations to propose the next tree in the expansion) would be more predictive of other samples.

Question

(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Response

Interaction depth specifies the tree depth and node splits. Models with a lower learning rate see greater improvements from interaction depth than models with a higher learning rate. Tree depth and learning rates impact the number of predictors which impacts the model greediness which impacts the predictive ability of the model which impacts the RMSE, all proportionally. Therefore, interaction depth will increase the number of predictors and RMSE. Variable importance will be spread across more predictors.

8.7

Question

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

Response

We decide to work with few tree based models. They are Random Forest (RF), Stochastic Gradient Boosting (SGB), Rule-Based Cubist (CUBE), Classification and Regression Tree (CART), Conditional Inference Tree (CIT).

We are going to follow the same steps as we did for problem 6.3, and 7.5 The steps include data imputation , remove zero variance predictors and high corelated predictors, pre-processing and data splitting. Training data - 80%, test data - 20%.

set.seed(123)
CMP_data <- ChemicalManufacturingProcess
rows_train <- createDataPartition(CMP_data$Yield, p=0.80, list=FALSE)
CMP_train <- CMP_data[rows_train, ]
CMP_test <- CMP_data[-rows_train, ]
prepro <- preProcess(subset(CMP_train, select = - Yield),
  method=c("BoxCox", "center", "scale", "knnImpute"))
CMP_train_X <- predict(prepro, CMP_train[ , -1])
nzv <- nearZeroVar(CMP_train_X)
mcl <- findCorrelation(cor(CMP_train_X))
CMP_train_X <- CMP_train_X[ ,-c(nzv, mcl)]
Random Forest

RF is implemented in the train() function using the rf method which implements Random Forest classification and regression using the randomForest package.

set.seed(123)
ctrl1 <- trainControl(method = "boot", number = 25)
tg1 <- expand.grid(mtry=seq(2,38,by=3))
rfModel <- train(x = CMP_train_X, y = CMP_train$Yield, method = "rf",
  metric = "Rsquared", tuneGrid = tg1, trControl = ctrl1)
Stochastic Gradient Boosting (SGB)

SGB is implemented in the train() function using the gbm method fits generalized boosted regression models. The tuning parameters declared in the tuning grid to be evaluated are various variable interaction depths (interaction.depth), fitted trees (n.trees), learning rates (shrinkage), and a minimum of 10 observations in the trees terminal nodes (n.minobsinnode).

set.seed(123)
ctrl2 <- trainControl(method = "boot", number = 25)
tg2 <- expand.grid(interaction.depth=seq(1,6,by=1), n.trees=c(25,50,100,200),
  shrinkage=c(0.01,0.05,0.1,0.2), n.minobsinnode=10)
sgbModel <- train(x = CMP_train_X, y = CMP_train$Yield, method = "gbm",
  metric = "Rsquared", tuneGrid = tg2, trControl = ctrl2, verbose=F)
Rule-Based Cubist (CUBE)

CUBE is implemented in the train() function using the cubist method which fits a rule-based M5 model with additional corrections based on nearest neighbors in the training set. The parameters committees use for boosting interaction and neighbors use for number of instances.

set.seed(123)
ctrl3 <- trainControl(method = "boot", number = 25)
tg3 <- expand.grid(committees = c(1,5,10,20,50,100), neighbors = c(0,1,3,5,7))
cubeModel <- train(x = CMP_train_X, y = CMP_train$Yield, method = "cubist",
  metric = "Rsquared", tuneGrid = tg3, trControl = ctrl3)
Classification and Regression Tree

CART is implemented in the train() function using the rpart2 method which tunes the model over the maximum depth (of the tree).

set.seed(123)
ctrl4 <- trainControl(method = "boot", number = 25)
tg4 <- expand.grid(maxdepth= seq(1,10,by=1))
cartModel <- train(x = CMP_train_X, y = CMP_train$Yield, method = "rpart2",
  metric = "Rsquared", tuneGrid = tg4, trControl = ctrl4)
Conditional Inference Tree

CIT is implemented in the train() function using the ctree2 method where the number of randomly selected predictors to choose from at each split is set by mtry.

set.seed(123)
ctrl5 <- trainControl(method = "cv")
tg5 <- expand.grid(maxdepth= seq(1,10,by=1), mincriterion=1-c(0.01, 0.05, 0.1))
citModel <- train(x = CMP_train_X, y = CMP_train$Yield, method = "ctree2",
  metric = "Rsquared", tuneGrid = tg5, trControl = ctrl5)
Question

(a) Which tree-based regression model gives the optimal resampling and test set performance?

Response

Training Set Resampling

metrics <- function(models) {
  RMSE = min(models$results$RMSE)
  Rsquared = max(models$results$Rsquared)
  MAE = min(models$results$MAE)
  return(cbind(RMSE, Rsquared, MAE)) }

data.frame(rbind(metrics(rfModel), metrics(sgbModel),
  metrics(cubeModel), metrics(cartModel), metrics(citModel)),
  row.names = c("RF","SGB","CUBE","CART","CIT"))

Predict Test Set

CMP_test_X <- predict(prepro, CMP_test[ , -1])
CMP_test_X <- CMP_test_X[ ,-c(nzv, mcl)]
val <- min(CMP_test_X[is.finite(CMP_test_X[,46]), 46])
CMP_test_X[CMP_test_X[,46] == -Inf, 46] <- val
fcastRf <- predict(rfModel, newdata = CMP_test_X)
fcastSgb <- predict(sgbModel, newdata = CMP_test_X)
fcastCube <- predict(cubeModel, newdata = CMP_test_X)
fcastCart <- predict(cartModel, newdata = CMP_test_X)
fcastCit <- predict(citModel, newdata = CMP_test_X)

data.frame(rbind(
  postResample(pred = fcastRf, obs = CMP_test$Yield),
  postResample(pred = fcastSgb, obs = CMP_test$Yield),
  postResample(pred = fcastCube, obs = CMP_test$Yield),
  postResample(pred = fcastCart, obs = CMP_test$Yield),
  postResample(pred = fcastCit, obs = CMP_test$Yield)
),
  row.names = c("RF","SGB","CUBE","CART","CIT")
)

Among all these tree based models, optimal resampling and test set performance is in CUBE model.

Question

(b)Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

Response

Optimal tree based model is Rule-Based Cubist model, we can see mosrt important predictors by using varImp function.

plot(varImp(cubeModel), top=20)

Manufacture variables dominate the list, out of 10 highest significant predictors 2 are Biological and rest all are manufacture variables.

PSL is the Linear optimal model, SVM is the non-linear optimal model. And Cube is the optimal model among tree based.

All models also agree that BiologicalProcess32 is a very important biological process. ManufacturinProcess36 was in top 10 for PSL, but it is not even in the top 10 of the SVM or the Cube models.

Question

(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

Response

The optimal single tree model based on RMSE, R2, and MAE resampling performance metrics is the CIT model. The decision tree produced for the CIT model shows a two splits for ManufacturingProcess32 and ManufacturingProcess09 at the 0.162 and 0.143 values. Values lower than 0.143 are goes to ManufacturingProcess09 at the 0.143.The value 0.143 with lower yields as represented by the boxplot distributions. Values over 0.143 are associated with higher yields as represented by the boxplot distributions. Similarly, values higher than 0.162 are associated with higher yields as represented by the boxplot distributions.The visualization of the tree does help in understanding the relationship between ManufacturingProcess32, ManufacturingProcess09, and the yield.

plot(citModel$finalModel)

Market Basket Analysis

Question

The Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. Here is the dataset is in GroceryDataSet.csv (comma separated file). You assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.

Response

Import all necessary libraries

library(arules)
library(pander)
library(arulesViz)
library(fpp2)
library(RColorBrewer)

The Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. Load the csv file to R and lets get in to EDA.

The purpose of market basket analysis is retailers/businesses can analyze the data like what are customers buying together and make use of that information for making some profitable decisions.

Load the data from csv using read.transactions() from arules package.

grocery_df <- read.transactions('https://raw.githubusercontent.com/SubhalaxmiRout002/DATA624/main/Week4/GroceryDataSet.csv', sep = ",", format = "basket")

grocery_df
## transactions in sparse format with
##  9835 transactions (rows) and
##  169 items (columns)
Exploratory Data Analysis

Use summary() to get the over view of data.

summary(grocery_df)
## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

The summary gives the number of rows and columns present in the data. It shows the purchase of the most frequent items, in that the whole milk is on the top, 2nd highest is other vegetables, 3rd rank is roll/buns, etc. Below this the length distribution of the items. There is a total of 32 items, the first item occurs 2159 times, 2nd item occurs 1643 times, and so on. This mean 2159 carts have 1st item, 1643 carts have 2nd items. If we add all these items’ frequencies it will sum up to the total number of rows that is 9835. If we look at the distribution, the mean is 4.4 which means on average there are 4 items per basket.

In itemFrequencyPlot(grocery_df, topN=10,type=“absolute”) first argument is the transaction object to be plotted that is grocery_df. topN allows to plot top N highest frequency items. type can be type=“absolute” or type=“relative”. If absolute it will plot numeric frequencies of each item independently. If relative it will plot how many times these items have appeared as compared to others.

itemFrequencyPlot(grocery_df, topN = 10, type="absolute", col=brewer.pal(8,'Pastel2'), main = 'Top 10 items purchased')

The above plot shows the same top 5 items as we get from summary().

bottom_10 <- head(sort(itemFrequency(grocery_df, type="absolute"), decreasing=FALSE), n=10)
par(mar=c(10.5,3,2, 0.3))
barplot(bottom_10, ylab = "Frequency", main = "Bottom 10 items purchased", col=brewer.pal(8,'Pastel2'), las = 2)

The above plot shows the Bottom 10 items purchased.

Items distribution in basket

hist(size(grocery_df), breaks = 0:35, xaxt="n", ylim=c(0,2200), 
     main = "Number of items in particular baskets", xlab = "Items", col = brewer.pal(8,'Pastel2'))
axis(1, at=seq(0,33,by=1), cex.axis=0.8)

We can see that the number of baskets decreases with the increase number of items.

Model Building

In this section using the APRIORI algorithm we make some rules and interpret how it works.

Generating Rules

Next step is to mine the rules using the APRIORI algorithm. The function apriori() is from package arules.

# Min Support as 0.001, confidence as 0.8.
association_rules <- apriori(grocery_df, parameter = list(supp=0.001, conf=0.8,maxlen=10), control=list(verbose=F))

The apriori will take dats as the transaction object on which mining is to be applied. Parameter will allow to set min_sup and min_confidence. The default values for parameter are minimum support of 0.1, the minimum confidence of 0.8, maximum of 10 items (maxlen).

summary(association_rules)
## set of 410 rules
## 
## rule length distribution (lhs + rhs):sizes
##   3   4   5   6 
##  29 229 140  12 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   4.000   4.000   4.329   5.000   6.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift       
##  Min.   :0.001017   Min.   :0.8000   Min.   :0.001017   Min.   : 3.131  
##  1st Qu.:0.001017   1st Qu.:0.8333   1st Qu.:0.001220   1st Qu.: 3.312  
##  Median :0.001220   Median :0.8462   Median :0.001322   Median : 3.588  
##  Mean   :0.001247   Mean   :0.8663   Mean   :0.001449   Mean   : 3.951  
##  3rd Qu.:0.001322   3rd Qu.:0.9091   3rd Qu.:0.001627   3rd Qu.: 4.341  
##  Max.   :0.003152   Max.   :1.0000   Max.   :0.003559   Max.   :11.235  
##      count      
##  Min.   :10.00  
##  1st Qu.:10.00  
##  Median :12.00  
##  Mean   :12.27  
##  3rd Qu.:13.00  
##  Max.   :31.00  
## 
## mining info:
##        data ntransactions support confidence
##  grocery_df          9835   0.001        0.8

Avove summary() shows the following:

Parameter Specification: min_sup=0.001 and min_confidence=0.8 values with 10 items as max of items in a rule.

Total number of rules: The set of 410 rules

Distribution of rule length: A length of 4 items has the most rules: 229 and length of 6 items have the lowest number of rules:12

Summary of Quality measures: Min and max values for Support, Confidence and, Lift.

Information used for creating rules: The data, support, and confidence we provided to the algorithm.

Since there are 410 rules, let’s print only top 10:

inspect(association_rules[1:10])
##      lhs                         rhs                    support confidence    coverage      lift count
## [1]  {liquor,                                                                                         
##       red/blush wine}         => {bottled beer}     0.001931876  0.9047619 0.002135231 11.235269    19
## [2]  {cereals,                                                                                        
##       curd}                   => {whole milk}       0.001016777  0.9090909 0.001118454  3.557863    10
## [3]  {cereals,                                                                                        
##       yogurt}                 => {whole milk}       0.001728521  0.8095238 0.002135231  3.168192    17
## [4]  {butter,                                                                                         
##       jam}                    => {whole milk}       0.001016777  0.8333333 0.001220132  3.261374    10
## [5]  {bottled beer,                                                                                   
##       soups}                  => {whole milk}       0.001118454  0.9166667 0.001220132  3.587512    11
## [6]  {house keeping products,                                                                         
##       napkins}                => {whole milk}       0.001321810  0.8125000 0.001626843  3.179840    13
## [7]  {house keeping products,                                                                         
##       whipped/sour cream}     => {whole milk}       0.001220132  0.9230769 0.001321810  3.612599    12
## [8]  {pastry,                                                                                         
##       sweet spreads}          => {whole milk}       0.001016777  0.9090909 0.001118454  3.557863    10
## [9]  {curd,                                                                                           
##       turkey}                 => {other vegetables} 0.001220132  0.8000000 0.001525165  4.134524    12
## [10] {rice,                                                                                           
##       sugar}                  => {whole milk}       0.001220132  1.0000000 0.001220132  3.913649    12

Above rule table shows lsh, rhs, support, confidence, coverage, lift, count. Lets know about what these terms means.

  • lhs: items present in the basket
  • rhs: item more likely bought with lhs
  • support: Fraction of transactions that contain the item-set
  • confidence: For a rule A=>B Confidence shows the percentage in which B is bought with A.
  • lift: Lift gives the correlation between A and B in the rule A=>B. Correlation shows how one item-set A effects the item-set B.
  • count: Frequency of occurrence of an item-set

Using the above output, we can make analysis such as:

  • 100% of the customers who bought ’{rice,sugar} also bought {whole milk}.
  • 92% of the customers who bought {house keeping products,whipped/sour cream} also bought {whole milk}.
Removing redundant rules

We can remove rules that are subsets of larger rules. Use the code below to remove such rules:

# get subset rules in vector
subset_rules <- which(colSums(is.subset(association_rules, association_rules)) > 1) 
length(subset_rules)
## [1] 91