Chapter 6 Exercises

Exercise 6

6.3. A chemical manufacturing process for a pharmaceutical product was discussed in Sect.1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

(a) Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
data <- ChemicalManufacturingProcess

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

There are total 24 rows having NAs.

sum(!complete.cases(data))

## [1] 24

Missing Yield & Biological Predictors Let’s first review the yield and biological predictors. The function, vis_dat, is an easy way to visualize missing data. The y-axis represents the observation with row 1 at the top and increasing as you move down the y-axis. Each column of data is represented on the x-axis. If a value is missing, you’ll see it highlighted in grey otherwise it will be pink.

Here there are no values missing from any of the biological or yield columns.

yield_and_bio <- data[,c(1:13)]

vis_dat(yield_and_bio)

Missing Process Predictors

Now let’s review if any values are missing from the manufacturing process predictors. We see more grey rectangles which indicates there are more na values.

process <- data[,c(14:58)]

vis_dat(process)

The first observation includes many na values:
* ManufacturingProcess01 to ManufacturingProcess08
* ManufacturingProcess10 to ManufacturingProcess12
* ManufacturingProcess22 to ManufacturingProcess24
* ManufacturingProcess40 to ManufacturingProcess41

Many of the observations from row 1-25 are missing the many values:
* ManufacturingProcess03, ManufacturingProcess10, & ManufacturingProcess11

Observations 170-176 are are missing the most values:
* ManufacturingProcess25 to ManufacturingProcess36 with the exception of ManufacturingProcess32

Impute Missing Values

In order to complete our analysis, we are going to impute these missing values using the preProcess() function using method knnImpute which can be found in the caret package.

If we view the first row which we previously saw was missing a ton of values, we now see they have been filled in.

imp <- preProcess(data, method = "knnImpute")
imputed <- predict(imp,data)
imputed[c(1),]

(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

Here we are going to start by using 80% of the data set to train our model. The training set contains 144 rows and 58 columns. Similarly, the test set contains 32 rows and 58 columns. The data set has a target variable or dependent variable called Yield.

We will use the dataframe with imputed values, imputed. To do this we randomly sample 80% of the observations based on their row index.

#for reproducibility
set.seed(123)
train_index <- sample(nrow(imputed),nrow(imputed)*0.8)

#get subset of ids
train_pred <- imputed[train_index,-1]
train_output <- imputed[train_index,1]

test_pred <- imputed[-train_index,-1]
test_output <- imputed[-train_index,1]

Modeling Data with Ridge Regression

For this problem, we will use the ridge regression model. Ridge regression is used when the predictors in the data set are highly correlated by adding a penalty error to the sum of squared errors. It uses, alpha, as a tuning parameter for the penalty.

Next we are going to determine which value of lambda returns the best model. We know which lambda is best because it will produce the model with the lowest value of RMSE.

ctrl <- trainControl(method = "cv", number = 10)
#creating grid of possible values for lambda
ridgeGrid <- data.frame(.lambda = seq(0, .1, length = 15))
#for reproducibility
set.seed(123)
ridgeRegFit <- train(train_pred,train_output,
                     method='ridge',
                     #Our different alpha penalty values
                     tuneGrid=ridgeGrid,
                     trControl = ctrl,
                     #center & get pred variables on same scale
                     preProc=c("center","scale"))
ridgeRegFit

## Ridge Regression 
## 
## 140 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 126, 126, 125, 125, 125, 127, ... 
## Resampling results across tuning parameters:
## 
##   lambda       RMSE       Rsquared   MAE      
##   0.000000000  1.0221525  0.3674877  0.7492669
##   0.007142857  0.7394789  0.5192863  0.6083392
##   0.014285714  0.7067733  0.5427204  0.5836053
##   0.021428571  0.7004064  0.5455998  0.5818266
##   0.028571429  0.6990414  0.5459039  0.5802543
##   0.035714286  0.6989843  0.5459852  0.5790184
##   0.042857143  0.6993876  0.5461214  0.5776078
##   0.050000000  0.7000090  0.5463097  0.5763720
##   0.057142857  0.7007663  0.5465197  0.5761176
##   0.064285714  0.7016265  0.5467293  0.5763379
##   0.071428571  0.7025735  0.5469253  0.5767611
##   0.078571429  0.7035971  0.5471011  0.5771880
##   0.085714286  0.7046902  0.5472540  0.5775532
##   0.092857143  0.7058465  0.5473833  0.5778765
##   0.100000000  0.7070609  0.5474898  0.5784133
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.03571429.

This plot shows the change in RMSE as alpha increases.

plot(ridgeRegFit)

Our best model uses alpha equal to 0.0357. This model has an R^2 of 55% and RMSE of 0.7.

best_ridge_model = ridgeRegFit$results[ridgeRegFit$results$RMSE == min(ridgeRegFit$results$RMSE),]
best_ridge_model

(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

Here we use our best ridge regression model to see how well it predicts on our test data. We set s = 1 and mode = ‘fraction’ in order to get a ridge regression model.

ridgePred_Test <-predict(ridge_model,newx=as.matrix(test_pred),s=1,mode="fraction",type="fit")
ridgeCoeff_Test <-predict(ridge_model,newx=as.matrix(test_pred),s=1,mode="fraction",type="coefficients")

Now we can compare how our model performed on our train data versus our test data.

# test performance
model_values <- data.frame(obs = test_output, pred = ridgePred_Test$fit)
test_performance <- defaultSummary(model_values)

# train matrices
train_performance <- best_ridge_model[,c('RMSE','Rsquared','MAE')]

# create data frame of both train + test performances
perf_df <- rbind(t(data.frame(test_performance)), data.frame(train_performance))

# Rename column
names(perf_df) <- c("RMSE", "Rsquared", "MAE")
rownames(perf_df) <-  c("Test","Train")

# Train and Test matrices
perf_df

The RMSE value for the test data is 3.3 and the RMSE value for the training set is 0.7. The lower the RMSE value, the better the model. The R^2 value for the test set is lower than the training set, which indicates the model accounts for the variability in the training data better than the test data. These metrics tell us that our model did not perform as well as on the testing data as it did on the training data, which isn’t too surprising.

(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

variable_importance <- varImp(ridgeRegFit,scale=F)

plot(variable_importance, top = 20)

ManufacturingProcess32 is the most important variable followed by BiologicalMaterial06. The manufacturing predictors are appearing more frequently in our list and appear to more important in our model.

(f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

vi_df <- variable_importance$importance %>%
  data.frame() %>%
  top_n(9)
vi_df$var <- rownames(vi_df)

df_top_x <- data[,names(imputed) %in% vi_df$var] %>%
  bind_cols(Yield = data$Yield)
df_top_x %>%
  gather(variable, value, -Yield) %>%
  ggplot(aes(x = value, y = Yield,color=variable)) +
  geom_point() +
  facet_wrap(~variable, scales = 'free_x') +
  labs(x = element_blank())

The sub-plots above illustrate positive and negative correlations with the response variable. To improve the yield in the future runs, the manufacturing processes with positive coefficients should be boosted or increased, while the processes with negative coefficients should be reduced. The majority of the important variables have a positive relationship except ManufacturingProcess17 and ManufacturingProcess13. By knowing which variables that affect yield and the relationship of each variable against yield, we can determine how to adjust each material or process to maximize results.

Chapter 7 Exercises

Exercise 7.2

7.2. Friedman (1991) introduced several benchmark data sets created by simulation. One of these simulations used the following nonlinear equation to create data:

\(y = 10sin(pi* x_{1}x_{2})+20(x_{3} −0.5)^2 +10x_{4} +5x_{5} +N(0,sigma^2)\)

where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame.  One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
obj <- featurePlot(trainingData$x, trainingData$y)
print(obj)

## or other methods.
## This creates a list with a vector 'y' and a matrix of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

Tune several models on these data. For example:

Knn Model

library(caret)
set.seed(123)
knnModel <- train(x = trainingData$x,
                  y = trainingData$y,
                  preProc = c("center", "scale"),
                  tuneLength = 10)

## note: only 9 unique complexity parameters in default grid. Truncating the grid to 9 .

knnModel

## Random Forest 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##    2    2.942963  0.7786600  2.409166
##    3    2.757348  0.7780577  2.257721
##    4    2.668106  0.7709207  2.184489
##    5    2.638548  0.7580862  2.165022
##    6    2.623661  0.7511115  2.152904
##    7    2.626736  0.7401869  2.155715
##    8    2.640345  0.7321122  2.162547
##    9    2.657499  0.7241281  2.176910
##   10    2.674946  0.7177772  2.190956
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 6.

knnPred <- predict(knnModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## performance values
kNN_Results <- postResample(pred = knnPred, obs = testData$y)
kNN_Results

##      RMSE  Rsquared       MAE 
## 2.4381466 0.7992296 1.9308194

MARS Model

Now we are going to run the MARS modeling using the package earth. MARS stands for Multivariate Adaptive Regression Splines. Like neural networks and partial least squares, MARS (Friedman 1991) uses surrogate features instead of the original predictors.

When running a MARS model, you are able to set the value for nprune which is the number of terms allowed to be included in the final model. Here we’ve chosen 20 since that is more than the number of X columns in order to allow for all the terms to be included and leave room for interactions. You can also set the degree, which is the number of hinge functions that can interact. We’ll pick 3 for the degree.

library(earth)
set.seed(123)

ctrl <- trainControl(method = "cv", number = 10)
#set values for nprune and degree
parameter_Grid <- expand.grid(nprune=seq(1:20),degree=seq(1:3))

#tune model
MARSModel <- train(trainingData$x,
                  trainingData$y,
                  method='earth',
                  metric='RMSE',
                  tuneGrid=parameter_Grid,
                  trControl = ctrl)


summary(MARSModel)

## Call: earth(x=data.frame[200,10], y=c(18.46,16.1,17...), keepxy=TRUE, degree=2,
##             nprune=14)
## 
##                                 coefficients
## (Intercept)                        21.905319
## h(0.621722-X1)                    -15.726181
## h(X1-0.621722)                      9.234027
## h(0.601063-X2)                    -18.253527
## h(X2-0.601063)                     10.448545
## h(0.447442-X3)                      9.700589
## h(X3-0.606015)                     12.674694
## h(0.734892-X4)                     -9.863526
## h(X4-0.734892)                     10.297964
## h(0.850094-X5)                     -5.324175
## h(0.218266-X1) * h(X2-0.601063)   -55.358726
## h(X1-0.218266) * h(X2-0.601063)   -29.100250
## h(X1-0.621722) * h(X2-0.295997)   -26.833129
## h(0.649253-X1) * h(0.601063-X2)    27.120721
## 
## Selected 14 of 18 terms, and 5 of 10 predictors (nprune=14)
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6-unused, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 9 4
## GCV 1.62945    RSS 225.8601    GRSq 0.9338437    RSq 0.953688

Here we are going to visualize each of the models that were created in the previous step. It looks that the model with degree = 3 and 20 terms has the lowest RMSE, which makes it the best model.

ggplot(MARSModel)

Now we will predict using the test data using our best MARS model and review the model results.

mars_predictions <- predict(MARSModel$finalModel,testData$x)

MARS_Results <- postResample(pred = mars_predictions, obs = testData$y)
MARS_Results

##      RMSE  Rsquared       MAE 
## 1.1722635 0.9448890 0.9324923

Neural Network Model

Neural networks are powerful nonlinear regression techniques inspired by theories about how the brain works. Like partial least squares, the outcome is modeled by an intermediary set of unobserved variables (called hidden variables or hidden units here). This model has input, hidden layers and output.

Now we will tune the model. The parameters that take on multiple values are hidden units and amount of weight decay. We will do this using the train function as we have before.

neural_Grid <- expand.grid(.decay = c(0,0.01,0.1),
                           .size = c(1:10),
                           .bag = FALSE)
set.seed(123)
nnetModel <- train(trainingData$x,trainingData$y,
                  method = "avNNet",
                  tuneGrid = neural_Grid,
                  trControl = ctrl,
                  preProc = c("center","scale"),                   #center & scale the data
                  linout = TRUE,
                  trace = FALSE,
                  #MaxNWts = 10 * (ncol(trainingData$x) + 1) + 10 + 1,
                  maxit = 500)
plot(nnetModel)

The best model was using the weight decay 0.1 and with 10 hidden units. Now we will predict on the test data using our neural network model and the results.

nn_predictions <- predict(nnetModel$finalModel,testData$x)

nn_Results <- postResample(pred = nn_predictions, obs = testData$y)
nn_Results

##      RMSE  Rsquared       MAE 
## 5.5708330 0.6800083 4.5994262

Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?

Model Summary

It looks like the MARS model outperforms kNN since the RMSE & MAE are lower and R^2 is higher.

results <- data.frame(rbind(nn_Results,kNN_Results,MARS_Results))
results

MARS most important predictors

The informative predictors are X1-X5.

varImp(MARSModel$finalModel)

Exercise 7.5

Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

This code is from week 6 homework problems and imputes missing values and splits the data into training and testing sets.

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
data <- ChemicalManufacturingProcess
imp <- preProcess(data, method = "knnImpute")
imputed <- predict(imp,data)

set.seed(123)
ids <- sample(nrow(imputed),nrow(imputed)*0.8)

train_pred <- imputed[ids,-1]
train_output <- imputed[ids,1]

test_pred <- imputed[-ids,-1]
test_output <- imputed[-ids,1]

Now we will use several nonlinear regression models to see which performs best on our data.

Neural Network

As stated previously, neural networks are powerful nonlinear regression techniques inspired by theories about how the brain works. Like partial least squares, the outcome is modeled by an intermediary set of unobserved variables (called hidden variables or hidden units here). This model has input, hidden layers and output.

neural_Grid <- expand.grid(.decay = c(0,0.01,0.1),
                           .size = c(1:10),
                           .bag = FALSE)

set.seed(123)

nnetTune <- train(train_pred,train_output,
                  method = "avNNet",
                  tuneGrid = neural_Grid,
                  trControl = ctrl,
                  #center & scale the data
                  preProc = c("center","scale"),
                  linout = TRUE,
                  trace = FALSE,
                  maxit = 500)

nn_predictions <- predict(nnetTune$finalModel,test_pred)

nn_results <- postResample(pred = nn_predictions, obs = test_output)
nn_results

##      RMSE  Rsquared       MAE 
## 0.6350121 0.6818911 0.4776292

SVM

SVMs are a class of powerful, highly flexible modeling techniques. The theory behind SVMs was originally developed in the context of classification models.

library(kernlab)
svmTune <- train(train_pred, train_output,
             method = "svmRadial",
             preProc = c("center", "scale"),
             tuneLength = 14,
             trControl = trainControl(method = "cv"))

svm_predictions <- predict(svmTune$finalModel,test_pred)

SVM_results <- postResample(pred = svm_predictions, obs = test_output)
SVM_results

##      RMSE  Rsquared       MAE 
## 0.6522754 0.7663851 0.4791971

kNearestNeighbors

Now we will run the kNN model and will center and scale the data.

library(caret)
knnTune <- train(x = train_pred,
                  y = train_output,
                  preProc = c("center", "scale"),
                  tuneLength = 10)

knnTune

## Random Forest 
## 
## 140 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 140, 140, 140, 140, 140, 140, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared   MAE      
##    2    0.7235649  0.5378986  0.5712462
##    8    0.6891294  0.5566480  0.5384496
##   14    0.6857843  0.5507213  0.5341980
##   20    0.6866935  0.5436037  0.5332588
##   26    0.6881685  0.5367398  0.5337969
##   32    0.6899401  0.5308883  0.5350922
##   38    0.6957190  0.5192489  0.5385073
##   44    0.7004385  0.5107343  0.5428744
##   50    0.7048170  0.5027933  0.5469711
##   57    0.7104203  0.4917456  0.5531747
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 14.

knnPred <- predict(knnTune, newdata = test_pred)

knn_Results <- postResample(pred = knnPred, obs = test_output)
knn_Results

##      RMSE  Rsquared       MAE 
## 0.6466251 0.7747683 0.5041527

MARS

As previously stated, MARS models work like neural networks and partial least squares, MARS (Friedman 1991) uses surrogate features instead of the original predictors.

ctrl <- trainControl(method = "cv", number = 10)
parameter_Grid <- expand.grid(nprune=seq(1:20),degree=seq(1:3))

MARS_Fit <- train(train_pred,
                  train_output,
                  method='earth',
                  metric='RMSE',
                  tuneGrid=parameter_Grid,
                  trControl = ctrl)

mars_predictions <- predict(MARS_Fit$finalModel,test_pred)

MARS_results <- postResample(pred = mars_predictions, obs = test_output)
MARS_results

##      RMSE  Rsquared       MAE 
## 0.6441535 0.6703288 0.5029883

(a) Which nonlinear regression model gives the optimal resampling and test set performance?

The kNearestNeighbor model performs the best because it has the highest R^2 value and ties with the lowest RMSE & MAE values.

results <- data.frame(rbind(knn_Results,MARS_results,SVM_results,nn_results))
results

(b) Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

In the nonlinear regression model, the manufacturing predictors are seen more but not by much.

nl_variable_importance <- varImp(knnTune,scale=F)

topPred_knn <- as.data.frame(varImp(knnTune$finalModel))
topPred_knn <- data.frame(overall = topPred_knn$Overall,
                          names   = rownames(topPred_knn))
non_linear_top_predictors <- topPred_knn[order(topPred_knn$overall,decreasing = T),]
head(non_linear_top_predictors,10)

In both the linear and nonlinear regression model, 9/10 of the top variables are the same. BiologicalMaterial04 is in the nonlinear regression model and not in the linear regression model. BiologicalMaterial02 is in the linear regression model and not in the nonlinear regression model.

topPred_nonlinear <- head(topPred_knn[order(topPred_knn$overall,decreasing = T),],10)$names
topPred_linear <- c('ManufacturingProcess32','BiologicalMaterial06','ManufacturingProcess13',
                    'BiologicalMaterial03','ManufacturingProcess17','BiologicalMaterial12',
                    'BiologicalMaterial02','ManufacturingProcess09','ManufacturingProcess31','ManufacturingProcess06')

topPreds_in_Both <- as.data.frame(cbind(topPred_nonlinear,topPred_linear))
topPreds_in_Both

(c) Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

x <- 9

vi_df <- nl_variable_importance$importance %>%
  data.frame() %>%
  top_n(x)

vi_df$var <- rownames(vi_df)

df_top_x <- data[,names(imputed) %in% vi_df$var] %>%
  bind_cols(Yield = data$Yield)

df_top_x %>%
  gather(variable, value, -Yield) %>%
  ggplot(aes(x = value, y = Yield,color=variable)) +
  geom_point() +
  facet_wrap(~variable, scales = 'free_x') +
  labs(x = element_blank())

The predictor, ManufacturingProcess32, is the top predictor in both the linear and nonlinear regression models. This predictor contains data which are discrete, whole numbers. There is a strong linear relationship between the two variables with a correlation of 61%. The measurements from ManufacturingProcess32 would be extremely helpful in trying to maximize the Yield.

Looking at the results above, we see few differences than reported in 6.3. We can see that the features are strongly correlated. For BiologicalMaterial 03, 06, and 12 as well as ManufacturingProcess 06, 09, and 32, the more we increase these values, the greater our yield will be. Conversely, for those not mentioned, these variables have negatively correlated relationships, meaning the more we can minimize these values, the higher our yield will be.

Chapter 8 Exercises

Exercise 8.1

8.1. Recreate the simulated data from Exercise 7.2:

library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:

library(randomForest)
library(caret)
set.seed(123)
model1 <- randomForest(y ~ ., data = simulated, importance = TRUE,ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

The predictors V6 - V10 have a variable importance score of near 0, which tells me the predictors were not significant in the model.

rfImp1

(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9496502

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

After adding the variable that was highly correlated with V1 and rerunning the model, the importance score for V1 did decrease from 8.86 to 5.01. The new variable, duplicateV1, had an importance score of 4.32. This makes sense because we know if you have a variable that has an importance score of X and you add a highly correlated second variable, that each of the correlated variables will have an importance score of roughly X/2.

This is referenced on pafge 202 of Applied Predictive Modeling book by Max Kuhn and Kjell Johnson.

set.seed(123)
model2 <- randomForest(y ~ ., data = simulated, importance = TRUE,ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2

Now we are going to add a second variable that is highly correlated with V1. This means we have 3 total variables that are extremely highly correlated with one another.

simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)

## [1] 0.9433552

After creating a third model which included the two new variables that are highly correlated with V1, we see that the importance score for V1 drops even more. V1 importance dropped from 8 to 5.01 to 3.87 in this third model. The duplicate variables have lower importance scores than V1 (duplicate1: 2.76 and duplicate2: 3.94).

Our book tells us that if we have 3 correlated variables with an original importance of X, the model with all 3 correlated variables, will cause the importance score to drop to X/3. Our data follows this statement pretty closely.

set.seed(123)
model3 <- randomForest(y ~ ., data = simulated, importance = TRUE,ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
rfImp3

(c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Stroblet al. (2007). Do these importances show the same pattern as the traditional random forest model?

library(party)

bagCtrl <- cforest_control(mtry = ncol(simulated) - 1)
baggedTree <- cforest(y ~ ., data = simulated, controls = bagCtrl)

Yes, the conditional inference tress shows the same pattern. V1 important is similar to the duplicate1 and duplicate2, which are about 1/3 of the original important size of V1 which was 8.

Also, variables V6-V10 do not appear to be significant in the model which matches the traditional random forest model.

cfImp <- varImp(baggedTree)
cfImp

(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Boosted Tree Model

Lets first start off by creating a boosted tree model using the the train function. When running a boosted tree model, we are able to tune the parameters interaction depth, number of trees and shrinkage.

The interaction depth is the maximum number of splits the tree can have or the depth of the trees.

The number of trees is how many trees will be fit or number of iterations.

The shrinkage is the learning rate.

set.seed(123)

gbmTune <- train(simulated[,-11],simulated[,c('y')],
                 method = "gbm",
                ## Dont print tons of notes to screen
                verbose = FALSE)

Looking at the importance scores of each predictor from the boosted tree model we see some differences. In general V1-V5 are still more important than V6-V10. V4 is still the most important predictor too. The importance scores are much higher and close to 100 for this model while other models were much closer to 0.

#causing issues have to fix
#gbm_importance <- caret::varImp(gbmTune)
#gbm_importance

Cubist Model

Now let’s create a cubist model using the function train and method = cubist.

library(Cubist)

cubistMod <- train(simulated[,-11],
                   simulated[,c('y')],
                   method = "cubist")

Again with the cubist model we are seeing variable V1-V5 as the most important predictors. V2 takes the top spot as most important predictor in place of V4 in the other models. Again we are seeing predictors V6-V10 having much lower importance scores that are close to 0.

varImp(cubistMod)

## cubist variable importance
## 
##             Overall
## V1         100.0000
## V2          90.0000
## V4          69.2857
## V3          58.5714
## V5          45.7143
## duplicate1   5.7143
## V6           3.5714
## duplicate2   0.7143
## V9           0.0000
## V8           0.0000
## V10          0.0000
## V7           0.0000

Exercise 8.2

Use a simulation to show tree bias with different granularities.

We will do the simulation here with 3 variables having different granularities. The response variable we will choose, would be a function of random selection and some noise.

Below, a regression tree is fitted using rpart, and the variable importance is calculated using varImp. –>

library(rpart)
set.seed(123)

x1 <- rnorm(300, 30, 1)
x2 <- rnorm(300, 30, 2)
x3 <- rnorm(300, 30, 3)

z <- (.4 * x1) + (.1 * x3) + rnorm(300, 0, sqrt(1- (.16 + .04 + .01)))
y <- (1.5 * z) + 10

granular_simulated <- data.frame(x1 = x1, x2 = x2, x3 = x3, y=y)

rpartfit <- rpart(y ~., data = granular_simulated)

varimp_sim <- varImp(rpartfit)

varimp_sim

As we can see, the tree uses x1, and x3 mostly to split, x2 is not related to Y, so the x2 variance score is less. In this simulation, x1 has the smallest standard deviation, so it is the strongest predictor in this case.

Exercise 8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

This question is somewhat hard to answer. This may cause of 2 factor i.e bragging fraction and learning rate. In gradient boosting algorithm, the learner is designed to employ the optimal fitting the gradient in each stage. This greedy strategy might cause over-fitting, and reduce generalization power. Learning rate is used to avoid over-fitting by adding only partial of the predicted value to the previous accumulated predicted values. The higher value will increase the greediness, and focus on fewer variables.

(b) Which model do you think would be more predictive of other samples?

Greedy models are less likely to select the optimal global model and are prone to over-fitting. Stochastic models reduce prediction variance. Therefore, the less greedy model on the left (with a 0.1 learning rate) that is also more random (due to only selecting 0.1 of the training set observations to propose the next tree in the expansion) would be more predictive of other samples.

(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Interaction depth specifies the tree depth and node splits. Models with a lower learning rate see greater improvements from interaction depth than models with a higher learning rate. Tree depth and learning rates impact the number of predictors which impacts the model greediness which impacts the predictive ability of the model which impacts the RMSE, all proportionally. Therefore, interaction depth will increase the number of predictors and RMSE. Variable importance will be spread across more predictors.

Exercise 8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

We decide to work with few tree based models. They are Random Forest (RF), Stochastic Gradient Boosting (SGB), Rule-Based Cubist (CUBE), Classification and Regression Tree (CART), Conditional Inference Tree (CIT).

We are going to the same data for problem 6.3, and 7.5 that went through data imputation, pre-processing and data splitting. Training data - 80%, test data - 20%.

Random Forest

RF is implemented in the train() function using the rf method which implements Random Forest classification and regression using the randomForest package.

set.seed(123)
ctrl1 <- trainControl(method = "boot", number = 25)
tg1 <- expand.grid(mtry=seq(2,38,by=3))
rfModel <- train(x = train_pred, y = train_output, method = "rf",
  metric = "Rsquared", tuneGrid = tg1, trControl = ctrl1)

Stochastic Gradient Boosting (SGB)

SGB is implemented in the train() function using the gbm method fits generalized boosted regression models. The tuning parameters declared in the tuning grid to be evaluated are various variable interaction depths (interaction.depth), fitted trees (n.trees), learning rates (shrinkage), and a minimum of 10 observations in the trees terminal nodes (n.minobsinnode).

set.seed(123)
ctrl2 <- trainControl(method = "boot", number = 25)
tg2 <- expand.grid(interaction.depth=seq(1,6,by=1), n.trees=c(25,50,100,200),
  shrinkage=c(0.01,0.05,0.1,0.2), n.minobsinnode=10)
sgbModel <- train(x = train_pred, y = train_output, method = "gbm",
  metric = "Rsquared", tuneGrid = tg2, trControl = ctrl2, verbose=F)

Rule-Based Cubist (CUBE)

CUBE is implemented in the train() function using the cubist method which fits a rule-based M5 model with additional corrections based on nearest neighbors in the training set. The parameters committees use for boosting interaction and neighbors use for number of instances.

set.seed(123)
ctrl3 <- trainControl(method = "boot", number = 25)
tg3 <- expand.grid(committees = c(1,5,10,20,50,100), neighbors = c(0,1,3,5,7))
cubeModel <- train(x = train_pred, y = train_output, method = "cubist",
  metric = "Rsquared", tuneGrid = tg3, trControl = ctrl3)

Classification and Regression Tree

CART is implemented in the train() function using the rpart2 method which tunes the model over the maximum depth (of the tree).

set.seed(123)
ctrl4 <- trainControl(method = "boot", number = 25)
tg4 <- expand.grid(maxdepth= seq(1,10,by=1))
cartModel <- train(x = train_pred, y = train_output, method = "rpart2",
  metric = "Rsquared", tuneGrid = tg4, trControl = ctrl4)

Conditional Inference Tree

CIT is implemented in the train() function using the ctree2 method where the number of randomly selected predictors to choose from at each split is set by mtry.

set.seed(123)
ctrl5 <- trainControl(method = "cv")
tg5 <- expand.grid(maxdepth= seq(1,10,by=1), mincriterion=1-c(0.01, 0.05, 0.1))
citModel <- train(x = train_pred, y = train_output, method = "ctree2",
  metric = "Rsquared", tuneGrid = tg5, trControl = ctrl5)

(a) Which tree-based regression model gives the optimal resampling and test set performance?

Training Set Resampling

metrics <- function(models) {
  RMSE = min(models$results$RMSE)
  Rsquared = max(models$results$Rsquared)
  MAE = min(models$results$MAE)
  return(cbind(RMSE, Rsquared, MAE)) }

data.frame(rbind(metrics(rfModel), metrics(sgbModel),
  metrics(cubeModel), metrics(cartModel), metrics(citModel)),
  row.names = c("RF","SGB","CUBE","CART","CIT"))

Predict Test Set

fcastRf   <- predict(rfModel,   newdata = test_pred)
fcastSgb  <- predict(sgbModel,  newdata = test_pred)
fcastCube <- predict(cubeModel, newdata = test_pred)
fcastCart <- predict(cartModel, newdata = test_pred)
fcastCit  <- predict(citModel,  newdata = test_pred)

data.frame(rbind(
  postResample(pred = fcastRf,   obs = test_output),
  postResample(pred = fcastSgb,  obs = test_output),
  postResample(pred = fcastCube, obs = test_output),
  postResample(pred = fcastCart, obs = test_output),
  postResample(pred = fcastCit,  obs = test_output)
),
  row.names = c("RF","SGB","CUBE","CART","CIT")
)

The randomForest performed the best since it had the highester R^2 value and a lower RMSE & MAE values.
## (b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
Optimal tree based model is random Forest model, we can see most important predictors by using varImp function.

plot(varImp(rfModel), top=20)

Manufacturing variables dominate the list, out of 20 highest significant predictors 8 are Biological and rest all are manufacture variables.

All models also agree that ManufacturingProcess32 is the most important biological process. BiologicalMaterial03 and 06 were very important in all 3 models.

topPred_nonlinear <- head(topPred_knn[order(topPred_knn$overall,decreasing = T),],10)$names
topPred_linear <- c('ManufacturingProcess32','BiologicalMaterial06','ManufacturingProcess13',
                    'BiologicalMaterial03','ManufacturingProcess17','BiologicalMaterial12',
                    'BiologicalMaterial02','ManufacturingProcess09','ManufacturingProcess31','ManufacturingProcess06')
topTree <- c('ManufacturingProcess32','BiologicalMaterial03','BiologicalMaterial06',
                    'ManufacturingProcess13','BiologicalMaterial12','ManufacturingProcess31',
                    'ManufacturingProcess06','ManufacturingProcess09','BiologicalMaterial02','BiologicalMaterial04')
topPreds_in_Both <- as.data.frame(cbind(topPred_nonlinear,topPred_linear,topTree))
topPreds_in_Both

(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

getTree(rfModel$finalModel,labelVar=TRUE)

Market Basket Analysis - Recommender System

We worked on the recommender system in Python and R and are going to compare the results from each.
## The Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. Here is the dataset is in GroceryDataSet.csv (comma separated file). You assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.

We’ll start with importing the libraries necessary for our analysis.

library(arules)

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

## 
## Attaching package: 'arules'

## The following object is masked from 'package:modeltools':
## 
##     info

## The following object is masked from 'package:kernlab':
## 
##     size

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

library(pander)
library(arulesViz)
library(fpp2)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

## ── Attaching packages ────────────────────────────────────────────── fpp2 2.4 ──

## ✓ forecast  8.15     ✓ expsmooth 2.3 
## ✓ fma       2.4

## ── Conflicts ───────────────────────────────────────────────── fpp2_conflicts ──
## x kernlab::alpha()       masks ggplot2::alpha()
## x kernlab::cross()       masks purrr::cross()
## x tidyr::extract()       masks magrittr::extract()
## x purrr::lift()          masks caret::lift()
## x randomForest::margin() masks ggplot2::margin()
## x purrr::set_names()     masks magrittr::set_names()

library(RColorBrewer)

The Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. Load the csv file to R and lets get in to EDA.

The purpose of market basket analysis is retailers/businesses can analyze the data like what are customers buying together and make use of that information for making some profitable decisions.

Load the data from csv using read.transactions() from arules package.

grocery_df <- read.transactions('https://raw.githubusercontent.com/SubhalaxmiRout002/DATA624/main/Week4/GroceryDataSet.csv', sep = ",", format = "basket")

grocery_df

## transactions in sparse format with
##  9835 transactions (rows) and
##  169 items (columns)

Exploratory Data Analysis

Use summary() to get the over view of data.

summary(grocery_df)

## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

The summary gives the number of rows and columns present in the data. It shows the purchase of the most frequent items, in that the whole milk is on the top, 2nd highest is other vegetables, 3rd rank is roll/buns, etc. Below this the length distribution of the items. There is a total of 32 items, the first item occurs 2159 times, 2nd item occurs 1643 times, and so on. This mean 2159 carts have 1st item, 1643 carts have 2nd items. If we add all these items’ frequencies it will sum up to the total number of rows that is 9835. If we look at the distribution, the mean is 4.4 which means on average there are 4 items per basket.

In itemFrequencyPlot(grocery_df, topN=10,type=“absolute”) first argument is the transaction object to be plotted that is grocery_df. topN allows to plot top N highest frequency items. type can be type=“absolute” or type=“relative”. If absolute it will plot numeric frequencies of each item independently. If relative it will plot how many times these items have appeared as compared to others.

itemFrequencyPlot(grocery_df, topN = 10, type="absolute", col=brewer.pal(8,'Pastel2'), main = 'Top 10 items purchased')

The above plot shows the same top 5 items as we get from summary():
1 whole milk
2 other vegetables
3 rolls/buns
4 soda
5 yogurt

bottom_10 <- head(sort(itemFrequency(grocery_df, type="absolute"), decreasing=FALSE), n=10)
par(mar=c(10.5,3,2, 0.3))
barplot(bottom_10, ylab = "Frequency", main = "Bottom 10 items purchased", col=brewer.pal(8,'Pastel2'), las = 2)

The above plot shows the Bottom 10 items purchased.

Items distribution in basket

hist(size(grocery_df), breaks = 0:35, xaxt="n", ylim=c(0,2200), 
     main = "Number of items in particular baskets", xlab = "Items", col = brewer.pal(8,'Pastel2'))
axis(1, at=seq(0,33,by=1), cex.axis=0.8)

We can see that the number of baskets decreases with the increase number of items.

Model Building

In this section using the APRIORI algorithm we make some rules and interpret how it works.

Generating Rules

Next step is to mine the rules using the APRIORI algorithm. The function apriori() is from package arules.

# Min Support as 0.001, confidence as 0.8.
association_rules <- apriori(grocery_df, parameter = list(supp=0.001, conf=0.8,maxlen=10), control=list(verbose=F))

The apriori will take dats as the transaction object on which mining is to be applied. Parameter will allow to set min_sup and min_confidence. The default values for parameter are minimum support of 0.1, the minimum confidence of 0.8, maximum of 10 items (maxlen).

summary(association_rules)

## set of 410 rules
## 
## rule length distribution (lhs + rhs):sizes
##   3   4   5   6 
##  29 229 140  12 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   4.000   4.000   4.329   5.000   6.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift       
##  Min.   :0.001017   Min.   :0.8000   Min.   :0.001017   Min.   : 3.131  
##  1st Qu.:0.001017   1st Qu.:0.8333   1st Qu.:0.001220   1st Qu.: 3.312  
##  Median :0.001220   Median :0.8462   Median :0.001322   Median : 3.588  
##  Mean   :0.001247   Mean   :0.8663   Mean   :0.001449   Mean   : 3.951  
##  3rd Qu.:0.001322   3rd Qu.:0.9091   3rd Qu.:0.001627   3rd Qu.: 4.341  
##  Max.   :0.003152   Max.   :1.0000   Max.   :0.003559   Max.   :11.235  
##      count      
##  Min.   :10.00  
##  1st Qu.:10.00  
##  Median :12.00  
##  Mean   :12.27  
##  3rd Qu.:13.00  
##  Max.   :31.00  
## 
## mining info:
##        data ntransactions support confidence
##  grocery_df          9835   0.001        0.8

The above summary() shows the following: * Parameter Specification: min_sup=0.001 and min_confidence=0.8 values with 10 items as max of items in a rule. * Total number of rules: The set of 410 rules * Distribution of rule length: A length of 4 items has the most rules: 229 and length of 6 items have the lowest number of rules:12 * Summary of Quality measures: Min and max values for Support, Confidence and, Lift. * Information used for creating rules: The data, support, and confidence we provided to the algorithm.

Since there are 410 rules, let’s print only top 10:

inspect(association_rules[1:10])

##      lhs                         rhs                    support confidence    coverage      lift count
## [1]  {liquor,                                                                                         
##       red/blush wine}         => {bottled beer}     0.001931876  0.9047619 0.002135231 11.235269    19
## [2]  {cereals,                                                                                        
##       curd}                   => {whole milk}       0.001016777  0.9090909 0.001118454  3.557863    10
## [3]  {cereals,                                                                                        
##       yogurt}                 => {whole milk}       0.001728521  0.8095238 0.002135231  3.168192    17
## [4]  {butter,                                                                                         
##       jam}                    => {whole milk}       0.001016777  0.8333333 0.001220132  3.261374    10
## [5]  {bottled beer,                                                                                   
##       soups}                  => {whole milk}       0.001118454  0.9166667 0.001220132  3.587512    11
## [6]  {house keeping products,                                                                         
##       napkins}                => {whole milk}       0.001321810  0.8125000 0.001626843  3.179840    13
## [7]  {house keeping products,                                                                         
##       whipped/sour cream}     => {whole milk}       0.001220132  0.9230769 0.001321810  3.612599    12
## [8]  {pastry,                                                                                         
##       sweet spreads}          => {whole milk}       0.001016777  0.9090909 0.001118454  3.557863    10
## [9]  {curd,                                                                                           
##       turkey}                 => {other vegetables} 0.001220132  0.8000000 0.001525165  4.134524    12
## [10] {rice,                                                                                           
##       sugar}                  => {whole milk}       0.001220132  1.0000000 0.001220132  3.913649    12

Above rule table shows lsh, rhs, support, confidence, coverage, lift, count. Lets know about what these terms means. * lhs: items present in the basket
* rhs: item more likely bought with lhs
* support: Fraction of transactions that contain the item-set
* confidence: For a rule A=>B Confidence shows the percentage in which B is bought with A.
* lift: Lift gives the correlation between A and B in the rule A=>B. Correlation shows how one item-set A effects the item-set B.
* count: Frequency of occurrence of an item-set

Using the above output, we can make analysis such as:
* 100% of the customers who bought ’{rice,sugar} also bought {whole milk}.
* 92% of the customers who bought {house keeping products,whipped/sour cream} also bought {whole milk}.

Removing redundant rules

We can remove rules that are subsets of larger rules. Use the code below to remove such rules:

# get subset rules in vector
subset_rules <- which(colSums(is.subset(association_rules, association_rules)) > 1) 
length(subset_rules)

## [1] 91

Finding Rules related to single item

Sometimes, we want to work on a specific product. If want to find out what causes influence on the purchase of item X, we can use the appearance option in the apriori command. Appearance gives options to set LHS (IF part) and RHS (THEN part) of the rule.

For example, to find what customers buy before buying ‘whole milk’ run the following line of code:

whmilk_association_rules <- apriori(grocery_df, parameter = list(supp=0.001, conf=0.8), appearance = list(default="lhs",rhs="whole milk"), control=list(verbose=F))

rules_whmilk_byconf <- sort(whmilk_association_rules, by="confidence", decreasing=TRUE)

inspect(head(rules_whmilk_byconf, n = 10))

##      lhs                     rhs              support confidence    coverage     lift count
## [1]  {rice,                                                                                
##       sugar}              => {whole milk} 0.001220132          1 0.001220132 3.913649    12
## [2]  {canned fish,                                                                         
##       hygiene articles}   => {whole milk} 0.001118454          1 0.001118454 3.913649    11
## [3]  {butter,                                                                              
##       rice,                                                                                
##       root vegetables}    => {whole milk} 0.001016777          1 0.001016777 3.913649    10
## [4]  {flour,                                                                               
##       root vegetables,                                                                     
##       whipped/sour cream} => {whole milk} 0.001728521          1 0.001728521 3.913649    17
## [5]  {butter,                                                                              
##       domestic eggs,                                                                       
##       soft cheese}        => {whole milk} 0.001016777          1 0.001016777 3.913649    10
## [6]  {butter,                                                                              
##       hygiene articles,                                                                    
##       pip fruit}          => {whole milk} 0.001016777          1 0.001016777 3.913649    10
## [7]  {hygiene articles,                                                                    
##       root vegetables,                                                                     
##       whipped/sour cream} => {whole milk} 0.001016777          1 0.001016777 3.913649    10
## [8]  {hygiene articles,                                                                    
##       pip fruit,                                                                           
##       root vegetables}    => {whole milk} 0.001016777          1 0.001016777 3.913649    10
## [9]  {cream cheese,                                                                        
##       domestic eggs,                                                                       
##       sugar}              => {whole milk} 0.001118454          1 0.001118454 3.913649    11
## [10] {curd,                                                                                
##       domestic eggs,                                                                       
##       sugar}              => {whole milk} 0.001016777          1 0.001016777 3.913649    10

Lets find what customers buy before buying ‘Other vegetables’ run the following line of code:

otherveg_association_rules <- apriori(grocery_df, parameter = list(supp=0.001, conf=0.8),appearance = list(default="lhs",rhs="other vegetables"), control=list(verbose=F))

rules_otherveg_byconf <- sort(otherveg_association_rules, by="confidence", decreasing=TRUE)

inspect(head(rules_otherveg_byconf, n = 10))

##      lhs                        rhs                    support confidence    coverage     lift count
## [1]  {citrus fruit,                                                                                 
##       root vegetables,                                                                              
##       soft cheese}           => {other vegetables} 0.001016777  1.0000000 0.001016777 5.168156    10
## [2]  {brown bread,                                                                                  
##       pip fruit,                                                                                    
##       whipped/sour cream}    => {other vegetables} 0.001118454  1.0000000 0.001118454 5.168156    11
## [3]  {grapes,                                                                                       
##       tropical fruit,                                                                               
##       whole milk,                                                                                   
##       yogurt}                => {other vegetables} 0.001016777  1.0000000 0.001016777 5.168156    10
## [4]  {ham,                                                                                          
##       pip fruit,                                                                                    
##       tropical fruit,                                                                               
##       yogurt}                => {other vegetables} 0.001016777  1.0000000 0.001016777 5.168156    10
## [5]  {ham,                                                                                          
##       pip fruit,                                                                                    
##       tropical fruit,                                                                               
##       whole milk}            => {other vegetables} 0.001118454  1.0000000 0.001118454 5.168156    11
## [6]  {butter,                                                                                       
##       fruit/vegetable juice,                                                                        
##       tropical fruit,                                                                               
##       whipped/sour cream}    => {other vegetables} 0.001016777  1.0000000 0.001016777 5.168156    10
## [7]  {newspapers,                                                                                   
##       rolls/buns,                                                                                   
##       soda,                                                                                         
##       whole milk}            => {other vegetables} 0.001016777  1.0000000 0.001016777 5.168156    10
## [8]  {citrus fruit,                                                                                 
##       root vegetables,                                                                              
##       tropical fruit,                                                                               
##       whipped/sour cream}    => {other vegetables} 0.001220132  1.0000000 0.001220132 5.168156    12
## [9]  {oil,                                                                                          
##       root vegetables,                                                                              
##       whole milk,                                                                                   
##       yogurt}                => {other vegetables} 0.001423488  0.9333333 0.001525165 4.823612    14
## [10] {citrus fruit,                                                                                 
##       root vegetables,                                                                              
##       tropical fruit,                                                                               
##       whole milk,                                                                                   
##       yogurt}                => {other vegetables} 0.001423488  0.9333333 0.001525165 4.823612    14

Visualizing Association Rules

A straight-forward visualization of association rules is to use a scatter plot using plot() of the arulesViz package. It uses Support and Confidence on the axes. In addition, third measure Lift is used by default to color (grey levels) of the points.

# Filter rules with confidence greater than 0.4 or 40%
subRules<-association_rules[quality(association_rules)$confidence>0.4]
#Plot SubRules
plot(subRules)

## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

The above plot shows that rules with high lift have low support. There is one rule in the middle which has moderate support and high lift. And gray points shows high support less lift.

Another way to visualize by individual Rule Representation. We need to pass method as “two-key plot”. The two-key plot uses support and confidence on x and y-axis respectively. It uses order for coloring. The order is the number of items in the rule.

plot(subRules,method="two-key plot")

## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

As we discussed in EDA section, items 4 has more rules and items 6 has less rules. That we can see in the plot.

Graph-Based Visualizations

Graph plots are a great way to visualize rules but tend to become congested as the number of rules increases. So it is better to visualize less number of rules with graph-based visualizations. Here we will see top 10 rules from subRules which have the highest confidence.

All the interactive graph has a dropdown, can select respective rules and see the graph.

top10subRules <- head(subRules, n = 10, by = "confidence")

Below, plot an interactive graph: Here we use engine=htmlwidget parameter in plot.

set.seed(123)
plot(top10subRules, method = "graph",  engine = "htmlwidget")

The above graph showing more red color high lift, and high confidence. Rule 4 has big size and light red color, this means data suggest if some one buys {flour, root vegetables, whipped/soure cream} that persom may be buy {whole milk}. If we look at Rule 6 it has high confidence and high lift, this means there is high chance a person buys {citrus fruit, root vegetables, soft cheese} will buy {other vegetables}.

Below graph for whole milk in RHS.

set.seed(123)
plot(head(whmilk_association_rules, n= 10), method = "graph",  engine = "htmlwidget")

Look at Rule 6, and Rule 8 both rules have high lift and high confidence. Rule 2 and Rule 9 has little low confidence and lift.
Below graph for Other vegetables in RHS.

set.seed(123)
plot(head(otherveg_association_rules, n= 10), method = "graph",  engine = "htmlwidget")

Above plot, Rule 5 and Rule 7 have higher confidence and higher lift than other rules. Considering Rule 7, if a person buys {hard cheese, oil} more likely to buy {other vegetables}.

Individual Rule Representation

This representation is also called as Parallel Coordinates Plot. It is useful to visualized which products along with which items cause what kind of sales.

As mentioned above, the RHS is the Consequent or the item we propose the customer will buy; the positions are in the LHS where 2 is the most recent addition to our basket and 1 is the item we previously had.

set.seed(123)
subRules2<-head(subRules, n=10, by="lift")
plot(subRules2, method="paracoord")

Above plot, we see the wide dark line of the arrow, which means a person who buys {red/blush wine, liquor} more likely to buy bottled beer. We can see the line next to the dark line, which means who buys {grapes, fruits/vegetables, citrus fruits} less likely to buy {pip fruit}.

Conclusions

Since association rules are very useful in setting a strategy of the store, we found few conclusions out of the above analysis.

Concerning the least sold products which are baby food, sound storage medium, preservation products, kitchen utensil, bags, frozen chicken, baby cosmetics, toilet cleaner, and salad dressing, the business should consider withdrawing them from sale. Set a sales promotion or promotion by combining these least sold products with more popular ones (e.g. vegetables with whole milk, soda, yogurt, etc.).

Concerning the association rules, the grocery store can place other vegetables near fruits, root vegetables, cheese and place whole milk near rice, cereal, yogurt, and sour cream.

Resources

https://www.youtube.com/watch?v=dYM-IaB61CA
https://www.statology.org/ridge-regression-in-r/
https://www.analyticsvidhya.com/blog/2016/01/ridge-lasso-regression-python-complete-tutorial/
[Data Camp market basket analysis] https://www.datacamp.com/community/tutorials/market-basket-analysis-r
[R51 Association Rules, Market Basket Analysis in R Recommender Systems]https://www.youtube.com/watch?v=iASqPvQpJ20

Data 624 Batch 2 Homework

Homework Chapter 6-8 & Recommender System

Subhalaxmi Rout, Kenan Sooklall,Leo Yi, Christian Thieme, Devin Teran

7/18/2021