Data624_KJ_Chapter7

Exercise 7.2

7.2. Friedman (1991) introduced several benchmark data sets created by simulation. One of these simulations used the following nonlinear equation to create data: \(y = 10sin(πx_{1}x_{2})+20(x_{3} −0.5)^2 +10x_{4} +5x_{5} +N(0,\sigma^2)\) where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

library(mlbench)
library(caret)

## Loading required package: lattice

set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names. 
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
obj <- featurePlot(trainingData$x, trainingData$y)
print(obj)

## or other methods.

## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)  
testData$x <- data.frame(testData$x)

Tune several models on these data. For example: ### Knn Model

library(caret)
set.seed(123)
knnModel <- train(x = trainingData$x,
                  y = trainingData$y,
                  preProc = c("center", "scale"),
                  tuneLength = 10)

## note: only 9 unique complexity parameters in default grid. Truncating the grid to 9 .

knnModel

## Random Forest 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##    2    2.942963  0.7786600  2.409166
##    3    2.757348  0.7780577  2.257721
##    4    2.668106  0.7709207  2.184489
##    5    2.638548  0.7580862  2.165022
##    6    2.623661  0.7511115  2.152904
##    7    2.626736  0.7401869  2.155715
##    8    2.640345  0.7321122  2.162547
##    9    2.657499  0.7241281  2.176910
##   10    2.674946  0.7177772  2.190956
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 6.

knnPred <- predict(knnModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## performance values
kNN_Results <- postResample(pred = knnPred, obs = testData$y)
kNN_Results

##      RMSE  Rsquared       MAE 
## 2.4381466 0.7992296 1.9308194

MARS Model

Now we are going to run the MARS modeling using the package earth. When running a MARS model, you are able to set the value for nprune which is the number of terms allowed to be included in the final model. Here we’ve chosen 20 since that is more than the number of X columns in order to allow for all the terms to be included and leave room for interactions. You can also set the degree, which is the number of hinge functions that can interact. We’ll pick 3 for the degree.

library(earth)

## Loading required package: Formula

## Loading required package: plotmo

## Loading required package: plotrix

## Loading required package: TeachingDemos

set.seed(123)

ctrl <- trainControl(method = "cv", number = 10)
#set values for nprune and degree
parameter_Grid <- expand.grid(nprune=seq(1:20),degree=seq(1:3))

#tune model
MARSModel <- train(trainingData$x,
                  trainingData$y,
                  method='earth',
                  metric='RMSE',
                  tuneGrid=parameter_Grid,
                  trControl = ctrl)

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.

summary(MARSModel)

## Call: earth(x=data.frame[200,10], y=c(18.46,16.1,17...), keepxy=TRUE, degree=2,
##             nprune=14)
## 
##                                 coefficients
## (Intercept)                        21.905319
## h(0.621722-X1)                    -15.726181
## h(X1-0.621722)                      9.234027
## h(0.601063-X2)                    -18.253527
## h(X2-0.601063)                     10.448545
## h(0.447442-X3)                      9.700589
## h(X3-0.606015)                     12.674694
## h(0.734892-X4)                     -9.863526
## h(X4-0.734892)                     10.297964
## h(0.850094-X5)                     -5.324175
## h(0.218266-X1) * h(X2-0.601063)   -55.358726
## h(X1-0.218266) * h(X2-0.601063)   -29.100250
## h(X1-0.621722) * h(X2-0.295997)   -26.833129
## h(0.649253-X1) * h(0.601063-X2)    27.120721
## 
## Selected 14 of 18 terms, and 5 of 10 predictors (nprune=14)
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6-unused, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 9 4
## GCV 1.62945    RSS 225.8601    GRSq 0.9338437    RSq 0.953688

Here we are going to visualize each of the models that were created in the previous step. It looks that the model with degree = 3 and 20 terms has the lowest RMSE, which makes it the best model.

ggplot(MARSModel)

Now we will predict on the test data using our MARS mode and review the model results.

mars_predictions <- predict(MARSModel$finalModel,testData$x)

MARS_Results <- postResample(pred = mars_predictions, obs = testData$y)
MARS_Results

##      RMSE  Rsquared       MAE 
## 1.1722635 0.9448890 0.9324923

Neural Network Model

First we need to remove predictors that have correlations that are too high. There aren’t any so we don’t need to remove parameters.

high_corr <- findCorrelation(cor(trainingData$x),cutoff = 0.75)
high_corr

## integer(0)

Now we will tune the model. The parameters that take on multiple values are hidden units and amount of weight decay. We will do this using the train function as we have before.

neural_Grid <- expand.grid(.decay = c(0,0.01,0.1),
                           .size = c(1:10),
                           .bag = FALSE)

set.seed(123)

nnetModel <- train(trainingData$x,trainingData$y,
                  method = "avNNet",
                  tuneGrid = neural_Grid,
                  trControl = ctrl,
                  #center & scale the data
                  preProc = c("center","scale"),
                  linout = TRUE,
                  trace = FALSE, 
                  #MaxNWts = 10 * (ncol(trainingData$x) + 1) + 10 + 1,
                  maxit = 500)

## Warning: executing %dopar% sequentially: no parallel backend registered

plot(nnetModel)

The best model was using the weight decay 0.1 and with 10 hidden units.

Now we will predict on the test data using our neural network model and the results.

nn_predictions <- predict(nnetModel$finalModel,testData$x)

nn_Results <- postResample(pred = nn_predictions, obs = testData$y)
nn_Results

##      RMSE  Rsquared       MAE 
## 5.5708330 0.6800083 4.5994262

Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?

Model Summary

It looks like the MARS model outperforms kNN since the RMSE & MAE are lower and R^2 is higher.

results <- data.frame(rbind(nn_Results,kNN_Results,MARS_Results))
results

MARS most important predictors

The informative predictors are X1-X5.

varImp(MARSModel$finalModel)

Exercise 7.5

Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

This code is from week 6 homework problems and imputes missing values and splits the data into training and testing sets.

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
data <- ChemicalManufacturingProcess
imp <- preProcess(data, method = "knnImpute")
imputed <- predict(imp,data)

set.seed(123)
ids <- sample(nrow(imputed),nrow(imputed)*0.8)

train_pred <- imputed[ids,-1]
train_output <- imputed[ids,1]

test_pred <- imputed[-ids,-1]
test_output <- imputed[-ids,1]

Now we will use several nonlinear regression models to see which performs best on our data.

Neural Network

neural_Grid <- expand.grid(.decay = c(0,0.01,0.1),
                           .size = c(1:10),
                           .bag = FALSE)

set.seed(123)

nnetTune <- train(train_pred,train_output,
                  method = "avNNet",
                  tuneGrid = neural_Grid,
                  trControl = ctrl,
                  #center & scale the data
                  preProc = c("center","scale"),
                  linout = TRUE,
                  trace = FALSE, 
                  maxit = 500)

nn_predictions <- predict(nnetTune$finalModel,test_pred)

nn_results <- postResample(pred = nn_predictions, obs = test_output)
nn_results

##      RMSE  Rsquared       MAE 
## 0.6350121 0.6818911 0.4776292

SVM

library(kernlab)

## 
## Attaching package: 'kernlab'

## The following object is masked from 'package:ggplot2':
## 
##     alpha

svmTune <- train(train_pred, train_output, 
             method = "svmRadial",
             preProc = c("center", "scale"), 
             tuneLength = 14,
             trControl = trainControl(method = "cv"))

svm_predictions <- predict(svmTune$finalModel,test_pred)

SVM_results <- postResample(pred = svm_predictions, obs = test_output)
SVM_results

##      RMSE  Rsquared       MAE 
## 0.6522754 0.7663851 0.4791971

kNearestNeighbors

library(caret)
knnTune <- train(x = train_pred,
                  y = train_output,
                  preProc = c("center", "scale"),
                  tuneLength = 10)

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

knnTune

## Random Forest 
## 
## 140 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 140, 140, 140, 140, 140, 140, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared   MAE      
##    2    0.7235649  0.5378986  0.5712462
##    8    0.6891294  0.5566480  0.5384496
##   14    0.6857843  0.5507213  0.5341980
##   20    0.6866935  0.5436037  0.5332588
##   26    0.6881685  0.5367398  0.5337969
##   32    0.6899401  0.5308883  0.5350922
##   38    0.6957190  0.5192489  0.5385073
##   44    0.7004385  0.5107343  0.5428744
##   50    0.7048170  0.5027933  0.5469711
##   57    0.7104203  0.4917456  0.5531747
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 14.

knnPred <- predict(knnTune, newdata = test_pred)

knn_Results <- postResample(pred = knnPred, obs = test_output)
knn_Results

##      RMSE  Rsquared       MAE 
## 0.6466251 0.7747683 0.5041527

MARS

ctrl <- trainControl(method = "cv", number = 10)
parameter_Grid <- expand.grid(nprune=seq(1:20),degree=seq(1:3))

MARS_Fit <- train(train_pred,
                  train_output,
                  method='earth',
                  metric='RMSE',
                  tuneGrid=parameter_Grid,
                  trControl = ctrl)

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.

mars_predictions <- predict(MARS_Fit$finalModel,test_pred)

MARS_results <- postResample(pred = mars_predictions, obs = test_output)
MARS_results

##      RMSE  Rsquared       MAE 
## 0.6441535 0.6703288 0.5029883

Which nonlinear regression model gives the optimal resampling and test set performance?
The kNearestNeighbor performs the best because it has the highest R^2 value and ties with the lowest RMSE & MAE values.

results <- data.frame(rbind(knn_Results,MARS_results,SVM_results,nn_results))
results

Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

In the nonlinear regression model, the manufacturing predictors are seen more but not by much.

topPred_knn <- as.data.frame(varImp(knnTune$finalModel))
topPred_knn <- data.frame(overall = topPred_knn$Overall,
                          names   = rownames(topPred_knn))
topPred_knn[order(topPred_knn$overall,decreasing = T),]

In both the linear and nonlinear regression model, 9/10 of the top variables are the same. BiologicalMaterial04 is in the nonlinear regression model and not in the linear regression model. BiologicalMaterial02 is in the linear regression model and not in the nonlinear regression model.

topPred_nonlinear <- head(topPred_knn[order(topPred_knn$overall,decreasing = T),],10)$names
topPred_linear <- c('ManufacturingProcess32','BiologicalMaterial06','ManufacturingProcess13',
                    'BiologicalMaterial03','ManufacturingProcess17','BiologicalMaterial12',
                    'BiologicalMaterial02','ManufacturingProcess09','ManufacturingProcess31','ManufacturingProcess06')

topPreds_in_Both <- as.data.frame(cbind(topPred_nonlinear,topPred_linear))
topPreds_in_Both

Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

The predictor, ManufacturingProcess32, is the top predictor in both the linear and nonlinear regression models. This predictor is discrete, whole numbers. There is a strong linear relationship between the two variables with a correlation of 61%. The measurements from ManufacturingProcess32 would be extremely helpful in trying to maximize the Yield.

ggplot(data=data,aes(x=ManufacturingProcess32,y=Yield)) + geom_point()

cor(data$ManufacturingProcess32,data$Yield)

## [1] 0.6083321

The linear relationship between BiologicalMaterial03 and Yield is less strong but still there. There don’t appear to be any major outliers affecting this predictor. This predictor would still be really useful to predict Yield due to the linear relationship.

ggplot(data=data,aes(x=BiologicalMaterial03,y=Yield)) + geom_point()

boxplot(data$BiologicalMaterial03)

Chapter 8 Exercise 8.1

8.1. Recreate the simulated data from Exercise 7.2:

library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1) 
simulated <- cbind(simulated$x, simulated$y) 
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

library(caret)
set.seed(123)
model1 <- randomForest(y ~ ., data = simulated, importance = TRUE,ntree = 1000) 
rfImp1 <- varImp(model1, scale = FALSE)

Did the random forest model significantly use the uninformative predictors (V6 – V10)?
The predictors V6 - V10 have a variable importance score of near 0, which tells me the predictors were not significant in the model.

rfImp1

Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9496502

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

After adding the variable that was highly correlated with V1 and rerunning the model, the importance score for V1 did decrease from 8.86 to 5.01. The new variable, duplicateV1, had an importance score of 4.32. This makes sense because we know if you have a variable that has an importance score of X and you add a highly correlated second variable, that the first importance score will be X/2.

This is referenced on page 202 of Applied Predictive Modeling book by Max Kuhn and Kjell Johnson.

set.seed(123)
model2 <- randomForest(y ~ ., data = simulated, importance = TRUE,ntree = 1000) 
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2

Now we are going to add a second variable that is highly correlated with V1. This means we have 3 total variables that are extremely highly correlated with one another.

simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)

## [1] 0.9433552

After creating a third model which included the two new variables that are highly correlated with V1, we see that the importance score for V1 drops even more. V1 importance dropped from 8 to 5.01 to 3.87 in this third model. The duplicate variables have lower importance scores than V1 (duplicate1: 2.76 and duplicate2: 3.94).

Our book tells us that if we have 3 correlated variables with an original importance of X, the model with all 3 correlated variables, will cause the importance score to drop to X/3. Our data follows this statement pretty closely.

set.seed(123)
model3 <- randomForest(y ~ ., data = simulated, importance = TRUE,ntree = 1000) 
rfImp3 <- varImp(model3, scale = FALSE)
rfImp3

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Stroblet al. (2007). Do these importances show the same pattern as the traditional random forest model?

library(party)

## Loading required package: grid

## Loading required package: mvtnorm

## Loading required package: modeltools

## Loading required package: stats4

## 
## Attaching package: 'modeltools'

## The following object is masked from 'package:kernlab':
## 
##     prior

## Loading required package: strucchange

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: sandwich

bagCtrl <- cforest_control(mtry = ncol(simulated) - 1)
baggedTree <- cforest(y ~ ., data = simulated, controls = bagCtrl)

Yes, the conditional inference tress shows the same pattern. V1 important is similar to the duplicate1 and duplicate2, which are about 1/3 of the original important size of V1 which was 8.

Also, variables V6-V10 do not appear to be significant in the model which matches the traditional random forest model.

cfImp <- varImp(baggedTree)
cfImp

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Lets first start off by creating a boosted tree model using the the train function. When running a boosted tree model, we are able to tune the parameters interaction depth, number of trees and shrinkage.

The interaction depth is the maximum number of splits the tree can have or the depth of the trees.

The number of trees is how many trees will be fit or number of iterations.

The shrinkage is the learning rate.

set.seed(123)

gbmTune <- train(simulated[,-11],simulated[,c('y')],
                 method = "gbm",
                ## Dont print tons of notes to screen 
                verbose = FALSE)

Looking at the importance scores of each predictor from the boosted tree model we see some differences. In general V1-V5 are still more important than V6-V10. V4 is still the most important predictor too. The importance scores are much higher and close to 100 for this model while other models were much closer to 0.

#causing error. need to fix.
#gbm_importance <- varImp(gbmTune)
#gbm_importance

Now let’s create a cubist model using the function train and method = cubist.

library(Cubist)

cubistMod <- train(simulated[,-11],
                   simulated[,c('y')],
                   method = "cubist")

Again with the cubist model we are seeing variable V1-V5 as the most important predictors. V2 takes the top spot as most important predictor in place of V4 in the other models. Again we are seeing predictors V6-V10 having much lower importance scores that are close to 0.

varImp(cubistMod)

## cubist variable importance
## 
##             Overall
## V1         100.0000
## V2          90.0000
## V4          69.2857
## V3          58.5714
## V5          45.7143
## duplicate1   5.7143
## V6           3.5714
## duplicate2   0.7143
## V9           0.0000
## V8           0.0000
## V10          0.0000
## V7           0.0000

Exercise 8.2

Use a simulation to show tree bias with different granularities.

Exercise 8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradi- ent. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

Which model do you think would be more predictive of other samples?

How would increasing interaction depth affect the slope of predictor im- portance for either model in Fig. 8.24?

Exercise 8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

Which tree-based regression model gives the optimal resampling and test set performance?

Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

Resources

https://www.youtube.com/watch?v=dYM-IaB61CA

Data624_KJ_Chapter7_Exercises

Devin Teran

7/5/2021