6.3. A chemical manufacturing process for a pharmaceutical product was discussed in Sect.1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
data <- ChemicalManufacturingProcess
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
There are total 24 rows having NAs.
sum(!complete.cases(data))
## [1] 24
Missing Yield & Biological Predictors Let’s first review the yield and biological predictors. The function, vis_dat, is an easy way to visualize missing data. The y-axis represents the observation with row 1 at the top and increasing as you move down the y-axis. Each column of data is represented on the x-axis. If a value is missing, you’ll see it highlighted in grey otherwise it will be pink.
Here there are no values missing from any of the biological or yield columns.
yield_and_bio <- data[,c(1:13)]
vis_dat(yield_and_bio)
Now let’s review if any values are missing from the manufacturing process predictors. We see more grey rectangles which indicates there are more na values.
process <- data[,c(14:58)]
vis_dat(process)
The first observation includes many na values:
* ManufacturingProcess01 to ManufacturingProcess08
* ManufacturingProcess10 to ManufacturingProcess12
* ManufacturingProcess22 to ManufacturingProcess24
* ManufacturingProcess40 to ManufacturingProcess41
Many of the observations from row 1-25 are missing the many values:
* ManufacturingProcess03, ManufacturingProcess10, & ManufacturingProcess11
Observations 170-176 are are missing the most values:
* ManufacturingProcess25 to ManufacturingProcess36 with the exception of ManufacturingProcess32
In order to complete our analysis, we are going to impute these missing values using the preProcess() function using method knnImpute which can be found in the caret package.
If we view the first row which we previously saw was missing a ton of values, we now see they have been filled in.
imp <- preProcess(data, method = "knnImpute")
imputed <- predict(imp,data)
imputed[c(1),]
Here we are going to start by using 80% of the data set to train our model. The training set contains 144 rows and 58 columns. Similarly, the test set contains 32 rows and 58 columns. The data set has a target variable or dependent variable called Yield.
We will use the dataframe with imputed values, imputed. To do this we randomly sample 80% of the observations based on their row index.
#for reproducibility
set.seed(123)
train_index <- sample(nrow(imputed),nrow(imputed)*0.8)
#get subset of ids
train_pred <- imputed[train_index,-1]
train_output <- imputed[train_index,1]
test_pred <- imputed[-train_index,-1]
test_output <- imputed[-train_index,1]
For this problem, we will use the ridge regression model. Ridge regression is used when the predictors in the data set are highly correlated by adding a penalty error to the sum of squared errors. It uses, alpha, as a tuning parameter for the penalty.
Next we are going to determine which value of lambda returns the best model. We know which lambda is best because it will produce the model with the lowest value of RMSE.
ctrl <- trainControl(method = "cv", number = 10)
#creating grid of possible values for lambda
ridgeGrid <- data.frame(.lambda = seq(0, .1, length = 15))
#for reproducibility
set.seed(123)
ridgeRegFit <- train(train_pred,train_output,
method='ridge',
#Our different alpha penalty values
tuneGrid=ridgeGrid,
trControl = ctrl,
#center & get pred variables on same scale
preProc=c("center","scale"))
ridgeRegFit
## Ridge Regression
##
## 140 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 126, 126, 125, 125, 125, 127, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.000000000 1.0221525 0.3674877 0.7492669
## 0.007142857 0.7394789 0.5192863 0.6083392
## 0.014285714 0.7067733 0.5427204 0.5836053
## 0.021428571 0.7004064 0.5455998 0.5818266
## 0.028571429 0.6990414 0.5459039 0.5802543
## 0.035714286 0.6989843 0.5459852 0.5790184
## 0.042857143 0.6993876 0.5461214 0.5776078
## 0.050000000 0.7000090 0.5463097 0.5763720
## 0.057142857 0.7007663 0.5465197 0.5761176
## 0.064285714 0.7016265 0.5467293 0.5763379
## 0.071428571 0.7025735 0.5469253 0.5767611
## 0.078571429 0.7035971 0.5471011 0.5771880
## 0.085714286 0.7046902 0.5472540 0.5775532
## 0.092857143 0.7058465 0.5473833 0.5778765
## 0.100000000 0.7070609 0.5474898 0.5784133
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.03571429.
This plot shows the change in RMSE as alpha increases.
plot(ridgeRegFit)
Our best model uses alpha equal to 0.0357. This model has an R^2 of 55% and RMSE of 0.7.
best_ridge_model = ridgeRegFit$results[ridgeRegFit$results$RMSE == min(ridgeRegFit$results$RMSE),]
best_ridge_model
Here we use our best ridge regression model to see how well it predicts on our test data. We set s = 1 and mode = ‘fraction’ in order to get a ridge regression model.
ridgePred_Test <-predict(ridge_model,newx=as.matrix(test_pred),s=1,mode="fraction",type="fit")
ridgeCoeff_Test <-predict(ridge_model,newx=as.matrix(test_pred),s=1,mode="fraction",type="coefficients")
Now we can compare how our model performed on our train data versus our test data.
# test performance
model_values <- data.frame(obs = test_output, pred = ridgePred_Test$fit)
test_performance <- defaultSummary(model_values)
# train matrices
train_performance <- best_ridge_model[,c('RMSE','Rsquared','MAE')]
# create data frame of both train + test performances
perf_df <- rbind(t(data.frame(test_performance)), data.frame(train_performance))
# Rename column
names(perf_df) <- c("RMSE", "Rsquared", "MAE")
rownames(perf_df) <- c("Test","Train")
# Train and Test matrices
perf_df
The RMSE value for the test data is 3.3 and the RMSE value for the training set is 0.7. The lower the RMSE value, the better the model. The R^2 value for the test set is lower than the training set, which indicates the model accounts for the variability in the training data better than the test data. These metrics tell us that our model did not perform as well as on the testing data as it did on the training data, which isn’t too surprising.
variable_importance <- varImp(ridgeRegFit,scale=F)
plot(variable_importance, top = 20)
ManufacturingProcess32 is the most important variable followed by BiologicalMaterial06. The manufacturing predictors are appearing more frequently in our list and appear to more important in our model.
vi_df <- variable_importance$importance %>%
data.frame() %>%
top_n(9)
vi_df$var <- rownames(vi_df)
df_top_x <- data[,names(imputed) %in% vi_df$var] %>%
bind_cols(Yield = data$Yield)
df_top_x %>%
gather(variable, value, -Yield) %>%
ggplot(aes(x = value, y = Yield,color=variable)) +
geom_point() +
facet_wrap(~variable, scales = 'free_x') +
labs(x = element_blank())
The sub-plots above illustrate positive and negative correlations with the response variable. To improve the yield in the future runs, the manufacturing processes with positive coefficients should be boosted or increased, while the processes with negative coefficients should be reduced. The majority of the important variables have a positive relationship except ManufacturingProcess17 and ManufacturingProcess13. By knowing which variables that affect yield and the relationship of each variable against yield, we can determine how to adjust each material or process to maximize results.
\(y = 10sin(pi* x_{1}x_{2})+20(x_{3} −0.5)^2 +10x_{4} +5x_{5} +N(0,sigma^2)\)
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame. One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
obj <- featurePlot(trainingData$x, trainingData$y)
print(obj)
## or other methods.
## This creates a list with a vector 'y' and a matrix of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)
Tune several models on these data. For example:
library(caret)
set.seed(123)
knnModel <- train(x = trainingData$x,
y = trainingData$y,
preProc = c("center", "scale"),
tuneLength = 10)
## note: only 9 unique complexity parameters in default grid. Truncating the grid to 9 .
knnModel
## Random Forest
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 2.942963 0.7786600 2.409166
## 3 2.757348 0.7780577 2.257721
## 4 2.668106 0.7709207 2.184489
## 5 2.638548 0.7580862 2.165022
## 6 2.623661 0.7511115 2.152904
## 7 2.626736 0.7401869 2.155715
## 8 2.640345 0.7321122 2.162547
## 9 2.657499 0.7241281 2.176910
## 10 2.674946 0.7177772 2.190956
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 6.
knnPred <- predict(knnModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## performance values
kNN_Results <- postResample(pred = knnPred, obs = testData$y)
kNN_Results
## RMSE Rsquared MAE
## 2.4381466 0.7992296 1.9308194
Now we are going to run the MARS modeling using the package earth. MARS stands for Multivariate Adaptive Regression Splines. Like neural networks and partial least squares, MARS (Friedman 1991) uses surrogate features instead of the original predictors.
When running a MARS model, you are able to set the value for nprune which is the number of terms allowed to be included in the final model. Here we’ve chosen 20 since that is more than the number of X columns in order to allow for all the terms to be included and leave room for interactions. You can also set the degree, which is the number of hinge functions that can interact. We’ll pick 3 for the degree.
library(earth)
set.seed(123)
ctrl <- trainControl(method = "cv", number = 10)
#set values for nprune and degree
parameter_Grid <- expand.grid(nprune=seq(1:20),degree=seq(1:3))
#tune model
MARSModel <- train(trainingData$x,
trainingData$y,
method='earth',
metric='RMSE',
tuneGrid=parameter_Grid,
trControl = ctrl)
summary(MARSModel)
## Call: earth(x=data.frame[200,10], y=c(18.46,16.1,17...), keepxy=TRUE, degree=2,
## nprune=14)
##
## coefficients
## (Intercept) 21.905319
## h(0.621722-X1) -15.726181
## h(X1-0.621722) 9.234027
## h(0.601063-X2) -18.253527
## h(X2-0.601063) 10.448545
## h(0.447442-X3) 9.700589
## h(X3-0.606015) 12.674694
## h(0.734892-X4) -9.863526
## h(X4-0.734892) 10.297964
## h(0.850094-X5) -5.324175
## h(0.218266-X1) * h(X2-0.601063) -55.358726
## h(X1-0.218266) * h(X2-0.601063) -29.100250
## h(X1-0.621722) * h(X2-0.295997) -26.833129
## h(0.649253-X1) * h(0.601063-X2) 27.120721
##
## Selected 14 of 18 terms, and 5 of 10 predictors (nprune=14)
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6-unused, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 9 4
## GCV 1.62945 RSS 225.8601 GRSq 0.9338437 RSq 0.953688
Here we are going to visualize each of the models that were created in the previous step. It looks that the model with degree = 3 and 20 terms has the lowest RMSE, which makes it the best model.
ggplot(MARSModel)
Now we will predict using the test data using our best MARS model and review the model results.
mars_predictions <- predict(MARSModel$finalModel,testData$x)
MARS_Results <- postResample(pred = mars_predictions, obs = testData$y)
MARS_Results
## RMSE Rsquared MAE
## 1.1722635 0.9448890 0.9324923
Neural networks are powerful nonlinear regression techniques inspired by theories about how the brain works. Like partial least squares, the outcome is modeled by an intermediary set of unobserved variables (called hidden variables or hidden units here). This model has input, hidden layers and output.
Now we will tune the model. The parameters that take on multiple values are hidden units and amount of weight decay. We will do this using the train function as we have before.
neural_Grid <- expand.grid(.decay = c(0,0.01,0.1),
.size = c(1:10),
.bag = FALSE)
set.seed(123)
nnetModel <- train(trainingData$x,trainingData$y,
method = "avNNet",
tuneGrid = neural_Grid,
trControl = ctrl,
preProc = c("center","scale"), #center & scale the data
linout = TRUE,
trace = FALSE,
#MaxNWts = 10 * (ncol(trainingData$x) + 1) + 10 + 1,
maxit = 500)
plot(nnetModel)
The best model was using the weight decay 0.1 and with 10 hidden units. Now we will predict on the test data using our neural network model and the results.
nn_predictions <- predict(nnetModel$finalModel,testData$x)
nn_Results <- postResample(pred = nn_predictions, obs = testData$y)
nn_Results
## RMSE Rsquared MAE
## 5.5708330 0.6800083 4.5994262
It looks like the MARS model outperforms kNN since the RMSE & MAE are lower and R^2 is higher.
results <- data.frame(rbind(nn_Results,kNN_Results,MARS_Results))
results
The informative predictors are X1-X5.
varImp(MARSModel$finalModel)
This code is from week 6 homework problems and imputes missing values and splits the data into training and testing sets.
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
data <- ChemicalManufacturingProcess
imp <- preProcess(data, method = "knnImpute")
imputed <- predict(imp,data)
set.seed(123)
ids <- sample(nrow(imputed),nrow(imputed)*0.8)
train_pred <- imputed[ids,-1]
train_output <- imputed[ids,1]
test_pred <- imputed[-ids,-1]
test_output <- imputed[-ids,1]
Now we will use several nonlinear regression models to see which performs best on our data.
As stated previously, neural networks are powerful nonlinear regression techniques inspired by theories about how the brain works. Like partial least squares, the outcome is modeled by an intermediary set of unobserved variables (called hidden variables or hidden units here). This model has input, hidden layers and output.
neural_Grid <- expand.grid(.decay = c(0,0.01,0.1),
.size = c(1:10),
.bag = FALSE)
set.seed(123)
nnetTune <- train(train_pred,train_output,
method = "avNNet",
tuneGrid = neural_Grid,
trControl = ctrl,
#center & scale the data
preProc = c("center","scale"),
linout = TRUE,
trace = FALSE,
maxit = 500)
nn_predictions <- predict(nnetTune$finalModel,test_pred)
nn_results <- postResample(pred = nn_predictions, obs = test_output)
nn_results
## RMSE Rsquared MAE
## 0.6350121 0.6818911 0.4776292
SVMs are a class of powerful, highly flexible modeling techniques. The theory behind SVMs was originally developed in the context of classification models.
library(kernlab)
svmTune <- train(train_pred, train_output,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 14,
trControl = trainControl(method = "cv"))
svm_predictions <- predict(svmTune$finalModel,test_pred)
SVM_results <- postResample(pred = svm_predictions, obs = test_output)
SVM_results
## RMSE Rsquared MAE
## 0.6522754 0.7663851 0.4791971
Now we will run the kNN model and will center and scale the data.
library(caret)
knnTune <- train(x = train_pred,
y = train_output,
preProc = c("center", "scale"),
tuneLength = 10)
knnTune
## Random Forest
##
## 140 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 140, 140, 140, 140, 140, 140, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 0.7235649 0.5378986 0.5712462
## 8 0.6891294 0.5566480 0.5384496
## 14 0.6857843 0.5507213 0.5341980
## 20 0.6866935 0.5436037 0.5332588
## 26 0.6881685 0.5367398 0.5337969
## 32 0.6899401 0.5308883 0.5350922
## 38 0.6957190 0.5192489 0.5385073
## 44 0.7004385 0.5107343 0.5428744
## 50 0.7048170 0.5027933 0.5469711
## 57 0.7104203 0.4917456 0.5531747
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 14.
knnPred <- predict(knnTune, newdata = test_pred)
knn_Results <- postResample(pred = knnPred, obs = test_output)
knn_Results
## RMSE Rsquared MAE
## 0.6466251 0.7747683 0.5041527
As previously stated, MARS models work like neural networks and partial least squares, MARS (Friedman 1991) uses surrogate features instead of the original predictors.
ctrl <- trainControl(method = "cv", number = 10)
parameter_Grid <- expand.grid(nprune=seq(1:20),degree=seq(1:3))
MARS_Fit <- train(train_pred,
train_output,
method='earth',
metric='RMSE',
tuneGrid=parameter_Grid,
trControl = ctrl)
mars_predictions <- predict(MARS_Fit$finalModel,test_pred)
MARS_results <- postResample(pred = mars_predictions, obs = test_output)
MARS_results
## RMSE Rsquared MAE
## 0.6441535 0.6703288 0.5029883
The kNearestNeighbor model performs the best because it has the highest R^2 value and ties with the lowest RMSE & MAE values.
results <- data.frame(rbind(knn_Results,MARS_results,SVM_results,nn_results))
results
In the nonlinear regression model, the manufacturing predictors are seen more but not by much.
nl_variable_importance <- varImp(knnTune,scale=F)
topPred_knn <- as.data.frame(varImp(knnTune$finalModel))
topPred_knn <- data.frame(overall = topPred_knn$Overall,
names = rownames(topPred_knn))
non_linear_top_predictors <- topPred_knn[order(topPred_knn$overall,decreasing = T),]
head(non_linear_top_predictors,10)
In both the linear and nonlinear regression model, 9/10 of the top variables are the same. BiologicalMaterial04 is in the nonlinear regression model and not in the linear regression model. BiologicalMaterial02 is in the linear regression model and not in the nonlinear regression model.
topPred_nonlinear <- head(topPred_knn[order(topPred_knn$overall,decreasing = T),],10)$names
topPred_linear <- c('ManufacturingProcess32','BiologicalMaterial06','ManufacturingProcess13',
'BiologicalMaterial03','ManufacturingProcess17','BiologicalMaterial12',
'BiologicalMaterial02','ManufacturingProcess09','ManufacturingProcess31','ManufacturingProcess06')
topPreds_in_Both <- as.data.frame(cbind(topPred_nonlinear,topPred_linear))
topPreds_in_Both
x <- 9
vi_df <- nl_variable_importance$importance %>%
data.frame() %>%
top_n(x)
vi_df$var <- rownames(vi_df)
df_top_x <- data[,names(imputed) %in% vi_df$var] %>%
bind_cols(Yield = data$Yield)
df_top_x %>%
gather(variable, value, -Yield) %>%
ggplot(aes(x = value, y = Yield,color=variable)) +
geom_point() +
facet_wrap(~variable, scales = 'free_x') +
labs(x = element_blank())
The predictor, ManufacturingProcess32, is the top predictor in both the linear and nonlinear regression models. This predictor contains data which are discrete, whole numbers. There is a strong linear relationship between the two variables with a correlation of 61%. The measurements from ManufacturingProcess32 would be extremely helpful in trying to maximize the Yield.
Looking at the results above, we see few differences than reported in 6.3. We can see that the features are strongly correlated. For BiologicalMaterial 03, 06, and 12 as well as ManufacturingProcess 06, 09, and 32, the more we increase these values, the greater our yield will be. Conversely, for those not mentioned, these variables have negatively correlated relationships, meaning the more we can minimize these values, the higher our yield will be.
8.1. Recreate the simulated data from Exercise 7.2:
library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
library(randomForest)
library(caret)
set.seed(123)
model1 <- randomForest(y ~ ., data = simulated, importance = TRUE,ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
The predictors V6 - V10 have a variable importance score of near 0, which tells me the predictors were not significant in the model.
rfImp1
library(party)
bagCtrl <- cforest_control(mtry = ncol(simulated) - 1)
baggedTree <- cforest(y ~ ., data = simulated, controls = bagCtrl)
Yes, the conditional inference tress shows the same pattern. V1 important is similar to the duplicate1 and duplicate2, which are about 1/3 of the original important size of V1 which was 8.
Also, variables V6-V10 do not appear to be significant in the model which matches the traditional random forest model.
cfImp <- varImp(baggedTree)
cfImp
Lets first start off by creating a boosted tree model using the the train function. When running a boosted tree model, we are able to tune the parameters interaction depth, number of trees and shrinkage.
The interaction depth is the maximum number of splits the tree can have or the depth of the trees.
The number of trees is how many trees will be fit or number of iterations.
The shrinkage is the learning rate.
set.seed(123)
gbmTune <- train(simulated[,-11],simulated[,c('y')],
method = "gbm",
## Dont print tons of notes to screen
verbose = FALSE)
Looking at the importance scores of each predictor from the boosted tree model we see some differences. In general V1-V5 are still more important than V6-V10. V4 is still the most important predictor too. The importance scores are much higher and close to 100 for this model while other models were much closer to 0.
#causing issues have to fix
#gbm_importance <- caret::varImp(gbmTune)
#gbm_importance
Now let’s create a cubist model using the function train and method = cubist.
library(Cubist)
cubistMod <- train(simulated[,-11],
simulated[,c('y')],
method = "cubist")
Again with the cubist model we are seeing variable V1-V5 as the most important predictors. V2 takes the top spot as most important predictor in place of V4 in the other models. Again we are seeing predictors V6-V10 having much lower importance scores that are close to 0.
varImp(cubistMod)
## cubist variable importance
##
## Overall
## V1 100.0000
## V2 90.0000
## V4 69.2857
## V3 58.5714
## V5 45.7143
## duplicate1 5.7143
## V6 3.5714
## duplicate2 0.7143
## V9 0.0000
## V8 0.0000
## V10 0.0000
## V7 0.0000
We will do the simulation here with 3 variables having different granularities. The response variable we will choose, would be a function of random selection and some noise.
Below, a regression tree is fitted using rpart, and the variable importance is calculated using varImp. –>
library(rpart)
set.seed(123)
x1 <- rnorm(300, 30, 1)
x2 <- rnorm(300, 30, 2)
x3 <- rnorm(300, 30, 3)
z <- (.4 * x1) + (.1 * x3) + rnorm(300, 0, sqrt(1- (.16 + .04 + .01)))
y <- (1.5 * z) + 10
granular_simulated <- data.frame(x1 = x1, x2 = x2, x3 = x3, y=y)
rpartfit <- rpart(y ~., data = granular_simulated)
varimp_sim <- varImp(rpartfit)
varimp_sim
As we can see, the tree uses x1, and x3 mostly to split, x2 is not related to Y, so the x2 variance score is less. In this simulation, x1 has the smallest standard deviation, so it is the strongest predictor in this case.
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
This question is somewhat hard to answer. This may cause of 2 factor i.e bragging fraction and learning rate. In gradient boosting algorithm, the learner is designed to employ the optimal fitting the gradient in each stage. This greedy strategy might cause over-fitting, and reduce generalization power. Learning rate is used to avoid over-fitting by adding only partial of the predicted value to the previous accumulated predicted values. The higher value will increase the greediness, and focus on fewer variables.
Greedy models are less likely to select the optimal global model and are prone to over-fitting. Stochastic models reduce prediction variance. Therefore, the less greedy model on the left (with a 0.1 learning rate) that is also more random (due to only selecting 0.1 of the training set observations to propose the next tree in the expansion) would be more predictive of other samples.
Interaction depth specifies the tree depth and node splits. Models with a lower learning rate see greater improvements from interaction depth than models with a higher learning rate. Tree depth and learning rates impact the number of predictors which impacts the model greediness which impacts the predictive ability of the model which impacts the RMSE, all proportionally. Therefore, interaction depth will increase the number of predictors and RMSE. Variable importance will be spread across more predictors.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
We decide to work with few tree based models. They are Random Forest (RF), Stochastic Gradient Boosting (SGB), Rule-Based Cubist (CUBE), Classification and Regression Tree (CART), Conditional Inference Tree (CIT).
We are going to the same data for problem 6.3, and 7.5 that went through data imputation, pre-processing and data splitting. Training data - 80%, test data - 20%.
RF is implemented in the train() function using the rf method which implements Random Forest classification and regression using the randomForest package.
set.seed(123)
ctrl1 <- trainControl(method = "boot", number = 25)
tg1 <- expand.grid(mtry=seq(2,38,by=3))
rfModel <- train(x = train_pred, y = train_output, method = "rf",
metric = "Rsquared", tuneGrid = tg1, trControl = ctrl1)
SGB is implemented in the train() function using the gbm method fits generalized boosted regression models. The tuning parameters declared in the tuning grid to be evaluated are various variable interaction depths (interaction.depth), fitted trees (n.trees), learning rates (shrinkage), and a minimum of 10 observations in the trees terminal nodes (n.minobsinnode).
set.seed(123)
ctrl2 <- trainControl(method = "boot", number = 25)
tg2 <- expand.grid(interaction.depth=seq(1,6,by=1), n.trees=c(25,50,100,200),
shrinkage=c(0.01,0.05,0.1,0.2), n.minobsinnode=10)
sgbModel <- train(x = train_pred, y = train_output, method = "gbm",
metric = "Rsquared", tuneGrid = tg2, trControl = ctrl2, verbose=F)
CUBE is implemented in the train() function using the cubist method which fits a rule-based M5 model with additional corrections based on nearest neighbors in the training set. The parameters committees use for boosting interaction and neighbors use for number of instances.
set.seed(123)
ctrl3 <- trainControl(method = "boot", number = 25)
tg3 <- expand.grid(committees = c(1,5,10,20,50,100), neighbors = c(0,1,3,5,7))
cubeModel <- train(x = train_pred, y = train_output, method = "cubist",
metric = "Rsquared", tuneGrid = tg3, trControl = ctrl3)
CART is implemented in the train() function using the rpart2 method which tunes the model over the maximum depth (of the tree).
set.seed(123)
ctrl4 <- trainControl(method = "boot", number = 25)
tg4 <- expand.grid(maxdepth= seq(1,10,by=1))
cartModel <- train(x = train_pred, y = train_output, method = "rpart2",
metric = "Rsquared", tuneGrid = tg4, trControl = ctrl4)
CIT is implemented in the train() function using the ctree2 method where the number of randomly selected predictors to choose from at each split is set by mtry.
set.seed(123)
ctrl5 <- trainControl(method = "cv")
tg5 <- expand.grid(maxdepth= seq(1,10,by=1), mincriterion=1-c(0.01, 0.05, 0.1))
citModel <- train(x = train_pred, y = train_output, method = "ctree2",
metric = "Rsquared", tuneGrid = tg5, trControl = ctrl5)
metrics <- function(models) {
RMSE = min(models$results$RMSE)
Rsquared = max(models$results$Rsquared)
MAE = min(models$results$MAE)
return(cbind(RMSE, Rsquared, MAE)) }
data.frame(rbind(metrics(rfModel), metrics(sgbModel),
metrics(cubeModel), metrics(cartModel), metrics(citModel)),
row.names = c("RF","SGB","CUBE","CART","CIT"))
fcastRf <- predict(rfModel, newdata = test_pred)
fcastSgb <- predict(sgbModel, newdata = test_pred)
fcastCube <- predict(cubeModel, newdata = test_pred)
fcastCart <- predict(cartModel, newdata = test_pred)
fcastCit <- predict(citModel, newdata = test_pred)
data.frame(rbind(
postResample(pred = fcastRf, obs = test_output),
postResample(pred = fcastSgb, obs = test_output),
postResample(pred = fcastCube, obs = test_output),
postResample(pred = fcastCart, obs = test_output),
postResample(pred = fcastCit, obs = test_output)
),
row.names = c("RF","SGB","CUBE","CART","CIT")
)
The randomForest performed the best since it had the highester R^2 value and a lower RMSE & MAE values.
## (b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
Optimal tree based model is random Forest model, we can see most important predictors by using varImp function.
plot(varImp(rfModel), top=20)
Manufacturing variables dominate the list, out of 20 highest significant predictors 8 are Biological and rest all are manufacture variables.
All models also agree that ManufacturingProcess32 is the most important biological process. BiologicalMaterial03 and 06 were very important in all 3 models.
topPred_nonlinear <- head(topPred_knn[order(topPred_knn$overall,decreasing = T),],10)$names
topPred_linear <- c('ManufacturingProcess32','BiologicalMaterial06','ManufacturingProcess13',
'BiologicalMaterial03','ManufacturingProcess17','BiologicalMaterial12',
'BiologicalMaterial02','ManufacturingProcess09','ManufacturingProcess31','ManufacturingProcess06')
topTree <- c('ManufacturingProcess32','BiologicalMaterial03','BiologicalMaterial06',
'ManufacturingProcess13','BiologicalMaterial12','ManufacturingProcess31',
'ManufacturingProcess06','ManufacturingProcess09','BiologicalMaterial02','BiologicalMaterial04')
topPreds_in_Both <- as.data.frame(cbind(topPred_nonlinear,topPred_linear,topTree))
topPreds_in_Both
getTree(rfModel$finalModel,labelVar=TRUE)
We worked on the recommender system in Python and R and are going to compare the results from each.
## The Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. Here is the dataset is in GroceryDataSet.csv (comma separated file). You assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.
We’ll start with importing the libraries necessary for our analysis.
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## Attaching package: 'arules'
## The following object is masked from 'package:modeltools':
##
## info
## The following object is masked from 'package:kernlab':
##
## size
## The following object is masked from 'package:dplyr':
##
## recode
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(pander)
library(arulesViz)
library(fpp2)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## ── Attaching packages ────────────────────────────────────────────── fpp2 2.4 ──
## ✓ forecast 8.15 ✓ expsmooth 2.3
## ✓ fma 2.4
## ── Conflicts ───────────────────────────────────────────────── fpp2_conflicts ──
## x kernlab::alpha() masks ggplot2::alpha()
## x kernlab::cross() masks purrr::cross()
## x tidyr::extract() masks magrittr::extract()
## x purrr::lift() masks caret::lift()
## x randomForest::margin() masks ggplot2::margin()
## x purrr::set_names() masks magrittr::set_names()
library(RColorBrewer)
The Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. Load the csv file to R and lets get in to EDA.
The purpose of market basket analysis is retailers/businesses can analyze the data like what are customers buying together and make use of that information for making some profitable decisions.
Load the data from csv using read.transactions() from arules package.
grocery_df <- read.transactions('https://raw.githubusercontent.com/SubhalaxmiRout002/DATA624/main/Week4/GroceryDataSet.csv', sep = ",", format = "basket")
grocery_df
## transactions in sparse format with
## 9835 transactions (rows) and
## 169 items (columns)
Use summary() to get the over view of data.
summary(grocery_df)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
The summary gives the number of rows and columns present in the data. It shows the purchase of the most frequent items, in that the whole milk is on the top, 2nd highest is other vegetables, 3rd rank is roll/buns, etc. Below this the length distribution of the items. There is a total of 32 items, the first item occurs 2159 times, 2nd item occurs 1643 times, and so on. This mean 2159 carts have 1st item, 1643 carts have 2nd items. If we add all these items’ frequencies it will sum up to the total number of rows that is 9835. If we look at the distribution, the mean is 4.4 which means on average there are 4 items per basket.
In itemFrequencyPlot(grocery_df, topN=10,type=“absolute”) first argument is the transaction object to be plotted that is grocery_df. topN allows to plot top N highest frequency items. type can be type=“absolute” or type=“relative”. If absolute it will plot numeric frequencies of each item independently. If relative it will plot how many times these items have appeared as compared to others.
itemFrequencyPlot(grocery_df, topN = 10, type="absolute", col=brewer.pal(8,'Pastel2'), main = 'Top 10 items purchased')
The above plot shows the same top 5 items as we get from summary():
1 whole milk
2 other vegetables
3 rolls/buns
4 soda
5 yogurt
bottom_10 <- head(sort(itemFrequency(grocery_df, type="absolute"), decreasing=FALSE), n=10)
par(mar=c(10.5,3,2, 0.3))
barplot(bottom_10, ylab = "Frequency", main = "Bottom 10 items purchased", col=brewer.pal(8,'Pastel2'), las = 2)
The above plot shows the Bottom 10 items purchased.
Items distribution in basket
hist(size(grocery_df), breaks = 0:35, xaxt="n", ylim=c(0,2200),
main = "Number of items in particular baskets", xlab = "Items", col = brewer.pal(8,'Pastel2'))
axis(1, at=seq(0,33,by=1), cex.axis=0.8)
We can see that the number of baskets decreases with the increase number of items.
In this section using the APRIORI algorithm we make some rules and interpret how it works.
Next step is to mine the rules using the APRIORI algorithm. The function apriori() is from package arules.
# Min Support as 0.001, confidence as 0.8.
association_rules <- apriori(grocery_df, parameter = list(supp=0.001, conf=0.8,maxlen=10), control=list(verbose=F))
The apriori will take dats as the transaction object on which mining is to be applied. Parameter will allow to set min_sup and min_confidence. The default values for parameter are minimum support of 0.1, the minimum confidence of 0.8, maximum of 10 items (maxlen).
summary(association_rules)
## set of 410 rules
##
## rule length distribution (lhs + rhs):sizes
## 3 4 5 6
## 29 229 140 12
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 4.000 4.000 4.329 5.000 6.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.001017 Min. :0.8000 Min. :0.001017 Min. : 3.131
## 1st Qu.:0.001017 1st Qu.:0.8333 1st Qu.:0.001220 1st Qu.: 3.312
## Median :0.001220 Median :0.8462 Median :0.001322 Median : 3.588
## Mean :0.001247 Mean :0.8663 Mean :0.001449 Mean : 3.951
## 3rd Qu.:0.001322 3rd Qu.:0.9091 3rd Qu.:0.001627 3rd Qu.: 4.341
## Max. :0.003152 Max. :1.0000 Max. :0.003559 Max. :11.235
## count
## Min. :10.00
## 1st Qu.:10.00
## Median :12.00
## Mean :12.27
## 3rd Qu.:13.00
## Max. :31.00
##
## mining info:
## data ntransactions support confidence
## grocery_df 9835 0.001 0.8
The above summary() shows the following: * Parameter Specification: min_sup=0.001 and min_confidence=0.8 values with 10 items as max of items in a rule. * Total number of rules: The set of 410 rules * Distribution of rule length: A length of 4 items has the most rules: 229 and length of 6 items have the lowest number of rules:12 * Summary of Quality measures: Min and max values for Support, Confidence and, Lift. * Information used for creating rules: The data, support, and confidence we provided to the algorithm.
Since there are 410 rules, let’s print only top 10:
inspect(association_rules[1:10])
## lhs rhs support confidence coverage lift count
## [1] {liquor,
## red/blush wine} => {bottled beer} 0.001931876 0.9047619 0.002135231 11.235269 19
## [2] {cereals,
## curd} => {whole milk} 0.001016777 0.9090909 0.001118454 3.557863 10
## [3] {cereals,
## yogurt} => {whole milk} 0.001728521 0.8095238 0.002135231 3.168192 17
## [4] {butter,
## jam} => {whole milk} 0.001016777 0.8333333 0.001220132 3.261374 10
## [5] {bottled beer,
## soups} => {whole milk} 0.001118454 0.9166667 0.001220132 3.587512 11
## [6] {house keeping products,
## napkins} => {whole milk} 0.001321810 0.8125000 0.001626843 3.179840 13
## [7] {house keeping products,
## whipped/sour cream} => {whole milk} 0.001220132 0.9230769 0.001321810 3.612599 12
## [8] {pastry,
## sweet spreads} => {whole milk} 0.001016777 0.9090909 0.001118454 3.557863 10
## [9] {curd,
## turkey} => {other vegetables} 0.001220132 0.8000000 0.001525165 4.134524 12
## [10] {rice,
## sugar} => {whole milk} 0.001220132 1.0000000 0.001220132 3.913649 12
Above rule table shows lsh, rhs, support, confidence, coverage, lift, count. Lets know about what these terms means. * lhs: items present in the basket
* rhs: item more likely bought with lhs
* support: Fraction of transactions that contain the item-set
* confidence: For a rule A=>B Confidence shows the percentage in which B is bought with A.
* lift: Lift gives the correlation between A and B in the rule A=>B. Correlation shows how one item-set A effects the item-set B.
* count: Frequency of occurrence of an item-set
Using the above output, we can make analysis such as:
* 100% of the customers who bought ’{rice,sugar} also bought {whole milk}.
* 92% of the customers who bought {house keeping products,whipped/sour cream} also bought {whole milk}.
We can remove rules that are subsets of larger rules. Use the code below to remove such rules:
# get subset rules in vector
subset_rules <- which(colSums(is.subset(association_rules, association_rules)) > 1)
length(subset_rules)
## [1] 91
A straight-forward visualization of association rules is to use a scatter plot using plot() of the arulesViz package. It uses Support and Confidence on the axes. In addition, third measure Lift is used by default to color (grey levels) of the points.
# Filter rules with confidence greater than 0.4 or 40%
subRules<-association_rules[quality(association_rules)$confidence>0.4]
#Plot SubRules
plot(subRules)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
The above plot shows that rules with high lift have low support. There is one rule in the middle which has moderate support and high lift. And gray points shows high support less lift.
Another way to visualize by individual Rule Representation. We need to pass method as “two-key plot”. The two-key plot uses support and confidence on x and y-axis respectively. It uses order for coloring. The order is the number of items in the rule.
plot(subRules,method="two-key plot")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
As we discussed in EDA section, items 4 has more rules and items 6 has less rules. That we can see in the plot.
Graph plots are a great way to visualize rules but tend to become congested as the number of rules increases. So it is better to visualize less number of rules with graph-based visualizations. Here we will see top 10 rules from subRules which have the highest confidence.
All the interactive graph has a dropdown, can select respective rules and see the graph.
top10subRules <- head(subRules, n = 10, by = "confidence")
Below, plot an interactive graph: Here we use engine=htmlwidget parameter in plot.
set.seed(123)
plot(top10subRules, method = "graph", engine = "htmlwidget")
The above graph showing more red color high lift, and high confidence. Rule 4 has big size and light red color, this means data suggest if some one buys {flour, root vegetables, whipped/soure cream} that persom may be buy {whole milk}. If we look at Rule 6 it has high confidence and high lift, this means there is high chance a person buys {citrus fruit, root vegetables, soft cheese} will buy {other vegetables}.
Below graph for whole milk in RHS.
set.seed(123)
plot(head(whmilk_association_rules, n= 10), method = "graph", engine = "htmlwidget")
Look at Rule 6, and Rule 8 both rules have high lift and high confidence. Rule 2 and Rule 9 has little low confidence and lift.
Below graph for Other vegetables in RHS.
set.seed(123)
plot(head(otherveg_association_rules, n= 10), method = "graph", engine = "htmlwidget")
Above plot, Rule 5 and Rule 7 have higher confidence and higher lift than other rules. Considering Rule 7, if a person buys {hard cheese, oil} more likely to buy {other vegetables}.
This representation is also called as Parallel Coordinates Plot. It is useful to visualized which products along with which items cause what kind of sales.
As mentioned above, the RHS is the Consequent or the item we propose the customer will buy; the positions are in the LHS where 2 is the most recent addition to our basket and 1 is the item we previously had.
set.seed(123)
subRules2<-head(subRules, n=10, by="lift")
plot(subRules2, method="paracoord")
Above plot, we see the wide dark line of the arrow, which means a person who buys {red/blush wine, liquor} more likely to buy bottled beer. We can see the line next to the dark line, which means who buys {grapes, fruits/vegetables, citrus fruits} less likely to buy {pip fruit}.
Since association rules are very useful in setting a strategy of the store, we found few conclusions out of the above analysis.
Concerning the least sold products which are baby food, sound storage medium, preservation products, kitchen utensil, bags, frozen chicken, baby cosmetics, toilet cleaner, and salad dressing, the business should consider withdrawing them from sale. Set a sales promotion or promotion by combining these least sold products with more popular ones (e.g. vegetables with whole milk, soda, yogurt, etc.).
Concerning the association rules, the grocery store can place other vegetables near fruits, root vegetables, cheese and place whole milk near rice, cereal, yogurt, and sour cream.
https://www.youtube.com/watch?v=dYM-IaB61CA
https://www.statology.org/ridge-regression-in-r/
https://www.analyticsvidhya.com/blog/2016/01/ridge-lasso-regression-python-complete-tutorial/
[Data Camp market basket analysis] https://www.datacamp.com/community/tutorials/market-basket-analysis-r
[R51 Association Rules, Market Basket Analysis in R Recommender Systems]https://www.youtube.com/watch?v=iASqPvQpJ20