Library

library(AppliedPredictiveModeling)
library(caret)
library(corrplot)
library(Cubist)
library(dplyr)
library(earth)
library(forecast)
library(fpp3)
library(gbm)
library(ggplot2)
library(MASS)
library(mice)
library(mlbench)
library(party)
library(randomForest)
library(RANN)
library(rpart)
library(rpart.plot)
library(tidyr)

Description

Do problems 8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson. Please submit the Rpubs link along with the .rmd file.

8.1.

Recreate the simulated data from Exercise 7.2:

set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

(a)

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

 model1 <- randomForest(y ~ ., data = simulated,
 importance = TRUE,
 ntree = 1000)
 rfImp1 <- varImp(model1, scale = FALSE)
 
rownames(rfImp1) <- paste0("V", 1:10)
rfImp1_sorted <- rfImp1[order(-rfImp1$Overall), , drop = FALSE] # 'drop = FALSE' ensures the result is still a data frame
print(rfImp1_sorted)

Did the random forest model significantly use the uninformative predictors (V6V10)?

No. The values of V6-V10 are very low, even most being a negative value indicating a low importance score.

(b)

Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
[1] 0.9460206
simulated2<-simulated
simulated2$duplicate1 <- simulated2$V1 + rnorm(200) * .1
cor(simulated2$duplicate1, simulated2$V1)
[1] 0.9447637

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

model2 <- randomForest(y ~ ., data = simulated2, importance = TRUE, 
                       ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rownames(rfImp2) <- c(paste0("V", 1:10), "duplicate1")
rfImp2_sorted <- rfImp2[order(-rfImp2$Overall), , drop = FALSE] # 'drop = FALSE' ensures the result is still a data frame
print(rfImp2_sorted)

The highly correlated rearranged the importance score for all predictors. Predictor V1 was demoted from most important to second most important. The highly correlated duplicate lands in fourth place.

(c)

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

model3 <- cforest(y ~., data = simulated)

order(-varimp(model3, conditional = FALSE)) #default conditional: FALSE
 [1]  4  2 11  1  5  7  9 10  6  3  8
order(-varimp(model3, conditional = TRUE))
 [1]  4  2 11  1  5  3  9  6  7 10  8

Using the party package on varimp

cforest conditional=false is 4 2 11 1 5 3 7 9 6 8 10 cforest conditional=true is 4 2 1 11 5 6 3 7 8 9 10 while randomforest is 1 4 2 5 3 6 7 10 9 8

which differ in pattern, but do support that v6-v10 have relatively low importance levels.

(d)

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Boosted


set.seed(624)

model_gbm<- gbm(y ~., data = simulated, distribution = "gaussian")

summary.gbm(model_gbm)

Cubist


simulated_x <- subset(simulated, select = -c(y))

cubist_model <- cubist(x = simulated_x, y = simulated$y, committees = 100)
cube_model_v2<-varImp(cubist_model)
rownames(cube_model_v2) <- c(paste0("V", 1:10), "duplicate1")
cubist_sorted <- cube_model_v2[order(-cube_model_v2$Overall), , drop = FALSE]
print(cubist_sorted)

Cubist model has V3 as the most important predictor, and the Bagged Trees has V4 as the most important predictor. Predictors V6 – V10 remain uninformative.

8.2.

Use a simulation to show tree bias with different granularities.

set.seed(624)
#samples for predictors
low <- sample(0:50, 500, replace = T)
medium <- sample(0:500, 500, replace = T)
high <- sample(0:5000, 500, replace = T)

#response
y <- low + medium + high + rnorm(250)

#check variance of predictors
var(low)
[1] 221.9536
var(medium)
[1] 21752.82
var(high)
[1] 2016913
df_sim <- data.frame(low, medium, high, y)

diff_gran_model <- randomForest(y ~., data = df_sim, importance = TRUE, ntree = 1000)

varImp(diff_gran_model, scale=FALSE)

Sample data made using sample has low variable with the highest granularity. This is confirmed by it’s variance of 0.71. Variables medium and high have medium and high variance, when compared to low. Higher variance indicates lower granularity.

The tree model confirms tree bias where highest variance variables (lowest granularity) get ranked with highest importance.

8.3.

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

package gbm

(a)

Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

Left Model

  • The left model with shrinkage = 0.1 & bag.fraction = 0.1 focuses on importance for just the first few predictors because the training set observations fraction is very small(0.1 or 10%).
  • Small portion data used to create the trees, means there’s a focus to it’s importance only on a small amount of predictors.
  • shrinkage parameter or learning rate is set to the highest recommended value of what “usually works” according to documentation for the gbm package. The documentation states “smaller learning rate typically require(s) more trees” meaning that smaller values lead to more trees and higher values lead to less trees.

Right Model

  • The right model with shrinkage = 0.9, bag.fraction = 0.9 has a greater spread importance across the predictors because it uses .9 or 90% of the training set to create its trees.
  • Larger shrinkage parameter suggests that less trees are created in this model.

(b)

Which model do you think would be more predictive of other samples?

The left model with shrinkage = 0.1, bag.fraction = 0.1 would be more predictive of other samples between the two. Considering the amount of trees it should have in comparison to the model on the right with shrinkage = 0.9, bag.fraction = 0.9.

(c)

How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

According to pub_journals link Sec 4.2, pg. 15

  • The relationship between shrinkage (learning rate) and interaction depth (tree complexity) in decision tree models, specifically in boosting.
  • Shrinkage is inversely related to interaction depth, meaning as tree complexity increases, the learning rate should decrease to prevent overfitting and maintain model stability.
  • This slower learning process necessitates using more trees.
  • Larger datasets better support more complex trees, as they can handle the detailed structures these trees create without overfitting quickly. *Interaction depth/tree size or tree complexity, affects the maximum size for each tree.
  • As mentioned there is a inverse relationship with shrinkage to interaction depth.
  • The left (shrinkage = 0.1, bag.fraction = 0.1) is likely to benefit from increasing interaction depth.
  • But the right might not since the shrinkage parameter is already so high.

8.7.

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

#load data
data(ChemicalManufacturingProcess)

set.seed(624)
imputed_data <- preProcess(ChemicalManufacturingProcess, method = c("knnImpute","nzv", "corr"))
df_full <- predict(imputed_data, ChemicalManufacturingProcess)

chem_index <- createDataPartition(df_full$Yield , p=.8, list=F)

chem_train <-  df_full[chem_index,] 
chem_test <- df_full[-chem_index,]

train_pred <- chem_train[-c(1)]
test_pred<-  chem_test[-c(1)]

(a)

Which tree-based regression model gives the optimal resampling and test set performance?

Boosted Trees

set.seed(624)
gbm_model <- gbm.fit(train_pred, chem_train$Yield, distribution = "gaussian")
Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        0.9239             nan     0.0010    0.0004
     2        0.9231             nan     0.0010    0.0006
     3        0.9224             nan     0.0010    0.0007
     4        0.9217             nan     0.0010    0.0006
     5        0.9211             nan     0.0010    0.0006
     6        0.9203             nan     0.0010    0.0006
     7        0.9196             nan     0.0010    0.0007
     8        0.9190             nan     0.0010    0.0006
     9        0.9183             nan     0.0010    0.0007
    10        0.9175             nan     0.0010    0.0005
    20        0.9111             nan     0.0010    0.0007
    40        0.8980             nan     0.0010    0.0006
    60        0.8865             nan     0.0010    0.0004
    80        0.8743             nan     0.0010    0.0005
   100        0.8628             nan     0.0010    0.0002
gbm_model
A gradient boosted model with gaussian loss function.
100 iterations were performed.
There were 46 predictors of which 7 had non-zero influence.
gbm_pred <- predict(gbm_model, newdata = test_pred)
postResample(pred = gbm_pred, obs = chem_test$Yield)
     RMSE  Rsquared       MAE 
1.1097582 0.5627084 0.8634997 

Cubist

set.seed(624)
cube_model <- cubist(train_pred, chem_train$Yield)
cube_model

Call:
cubist.default(x = train_pred, y = chem_train$Yield)

Number of samples: 144 
Number of predictors: 46 

Number of committees: 1 
Number of rules: 2 
cube_pred <- predict(cube_model, newdata = test_pred)
postResample(pred = cube_pred, obs = chem_test$Yield)
     RMSE  Rsquared       MAE 
0.6724398 0.6903309 0.5159054 

Random Forest

set.seed(624)
#fit the model
rf_model <- randomForest(train_pred, chem_train$Yield, importance = TRUE, ntrees = 1000)
rf_model

Call:
 randomForest(x = train_pred, y = chem_train$Yield, importance = TRUE,      ntrees = 1000) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 15

          Mean of squared residuals: 0.3711346
                    % Var explained: 59.85
rf_pred <- predict(rf_model, newdata = test_pred)
postResample(pred = rf_pred, obs = chem_test$Yield)
     RMSE  Rsquared       MAE 
0.6715320 0.7323224 0.4828902 

Random Forest has the optimal metrics because of its RMSE of 0.6715320 (lowest) and Rsquared of 0.7323224

(b)

Top 10:
ManufacturingProcess32, BiologicalMaterial06, BiologicalMaterial03, ManufacturingProcess13, ManufacturingProcess36, BiologicalMaterial11, ManufacturingProcess09, BiologicalMaterial08, ManufacturingProcess28, ManufacturingProcess11

rf_model_v2<-varImp(rf_model)
#rownames(rf_model_v2) <- c(paste0("V", 1:10), "duplicate1")
rf_model_sorted <- rf_model_v2[order(-rf_model_v2$Overall), , drop = FALSE] # 'drop = FALSE' ensures the result is still a data frame
head(rf_model_sorted,10)

i

Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list?

ManufacturingProcess32 is the most important.Neither of the predictors dominating the list with 5 ManufacturingProcess and 5 BiologicalProcess in the top ten.

ii

How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

This combination is basically consistent with the importance variables of the non-linear model, not so much the linear model.

(c)

Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

rpart_tree <- rpart(Yield ~., data = chem_train)
rpart.plot(rpart_tree)

---
title: 'DATA 624: PREDICTIVE ANALYTICS HW 9'
author: "Gabriel Campos"
date: "Last edited `r format(Sys.time(), '%B %d, %Y')`"
output:
  html_notebook: default
  geometry: left=0.5cm,right=0.5cm,top=1cm,bottom=2cm
  html_document:
    df_print: paged
  pdf_document:
    latex_engine: xelatex
urlcolor: blue
---

# Library

```{r, warning=FALSE, message=FALSE}
library(AppliedPredictiveModeling)
library(caret)
library(corrplot)
library(Cubist)
library(dplyr)
library(earth)
library(forecast)
library(fpp3)
library(gbm)
library(ggplot2)
library(MASS)
library(mice)
library(mlbench)
library(party)
library(randomForest)
library(RANN)
library(rpart)
library(rpart.plot)
library(tidyr)
```

# Description

Do problems 8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson.  Please submit the Rpubs link along with the .rmd file.

# 8.1.

Recreate the simulated data from Exercise 7.2:

```{r, message=FALSE, warning=FALSE}
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
```


## (a) 

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

```{r, message=FALSE, warning=FALSE}
 model1 <- randomForest(y ~ ., data = simulated,
 importance = TRUE,
 ntree = 1000)
 rfImp1 <- varImp(model1, scale = FALSE)
 
```

```{r}
rownames(rfImp1) <- paste0("V", 1:10)
rfImp1_sorted <- rfImp1[order(-rfImp1$Overall), , drop = FALSE] # 'drop = FALSE' ensures the result is still a data frame
print(rfImp1_sorted)
```


Did the random forest model significantly use the uninformative predictors (`V6` – `V10`)?

No. The values of `V6`-`V10` are very low, even most being a negative value indicating a low importance score.


## (b)

Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

```{r}
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
```

```{r}
simulated2<-simulated
simulated2$duplicate1 <- simulated2$V1 + rnorm(200) * .1
cor(simulated2$duplicate1, simulated2$V1)
```


Fit another random forest model to these data. Did the importance score for `V1` change? What happens when you add another predictor that is also highly correlated with `V1`?

```{r message=FALSE, warning=FALSE}
model2 <- randomForest(y ~ ., data = simulated2, importance = TRUE, 
                       ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
```


```{r}
rownames(rfImp2) <- c(paste0("V", 1:10), "duplicate1")
rfImp2_sorted <- rfImp2[order(-rfImp2$Overall), , drop = FALSE] # 'drop = FALSE' ensures the result is still a data frame
print(rfImp2_sorted)
```


The highly correlated rearranged the importance score for all predictors. Predictor V1 was demoted from most important to second most important. The highly correlated duplicate lands in fourth place.


## (c)

Use the `cforest` function in the party package to fit a random forest model using conditional inference trees. The party package function `varimp` can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in *Strobl et al. (2007)*. Do these importances show the same pattern as the traditional random forest model?

```{r, warning=FALSE}
model3 <- cforest(y ~., data = simulated)

order(-varimp(model3, conditional = FALSE)) #default conditional: FALSE

order(-varimp(model3, conditional = TRUE))
```

Using the party package on varimp

cforest conditional=false is  4  2 11  1  5  3  7  9  6  8 10
cforest conditional=true  is 4  2  1 11  5  6  3  7  8  9 10
while
randomforest is 1 4 2 5 3 6 7 10 9 8

which differ in pattern, but do support that v6-v10 have relatively low importance levels.

## (d)

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

![](.\Images\Img_8_1.png)

### Boosted

```{r message=FALSE, warning=FALSE}

set.seed(624)

model_gbm<- gbm(y ~., data = simulated, distribution = "gaussian")

summary.gbm(model_gbm)
```

### Cubist

```{r message=FALSE, warning=FALSE}

simulated_x <- subset(simulated, select = -c(y))

cubist_model <- cubist(x = simulated_x, y = simulated$y, committees = 100)

```

```{r}
cube_model_v2<-varImp(cubist_model)
rownames(cube_model_v2) <- c(paste0("V", 1:10), "duplicate1")
cubist_sorted <- cube_model_v2[order(-cube_model_v2$Overall), , drop = FALSE]
print(cubist_sorted)
```

Cubist model has V3 as the most important predictor, and the Bagged Trees has V4 as the most important predictor. Predictors V6 – V10 remain uninformative.


# 8.2.

Use a simulation to show tree bias with different granularities.

```{r message=FALSE, warning=FALSE}
set.seed(624)
#samples for predictors
low <- sample(0:50, 500, replace = T)
medium <- sample(0:500, 500, replace = T)
high <- sample(0:5000, 500, replace = T)

#response
y <- low + medium + high + rnorm(250)

#check variance of predictors
var(low)
var(medium)
var(high)
```

```{r message=FALSE, warning=FALSE}
df_sim <- data.frame(low, medium, high, y)

diff_gran_model <- randomForest(y ~., data = df_sim, importance = TRUE, ntree = 1000)

varImp(diff_gran_model, scale=FALSE)
```

Sample data made using `sample` has `low` variable with the highest granularity. This is confirmed by it's variance of 0.71. Variables `medium` and `high` have medium and high variance, when compared to `low`. Higher variance indicates lower granularity. 

The tree model confirms tree bias where highest variance variables (lowest granularity) get ranked with highest importance.

# 8.3.

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

![](.\Images\Img_8_1.png)

[package gbm](https://www.rdocumentation.org/packages/gbm/versions/2.1.8/topics/gbm)

## (a)

Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across
more predictors?

**Left Model**

* The left model with `shrinkage` = 0.1 & `bag.fraction` = 0.1 focuses on importance for just the first few predictors because the training set observations fraction is very small(0.1 or 10%). 
* Small portion data used to create the trees, means there's a focus to it's importance only on a small amount of predictors. 
* `shrinkage` parameter or learning rate is set to the highest recommended value of what "usually works" according to documentation for the `gbm` package.
  The documentation states "smaller learning rate typically require(s) more trees" meaning that smaller values lead to more trees and higher values lead to      less trees.

**Right Model**

* The right model with `shrinkage` = 0.9, `bag.fraction` = 0.9 has a greater spread importance across the predictors because it uses .9 or 90% of the training set to create its trees. 
* Larger `shrinkage` parameter suggests that less trees are created in this model.


## (b)

Which model do you think would be more predictive of other samples?

The left model with `shrinkage` = 0.1, `bag.fraction` = 0.1 would be more predictive of other samples between the two. Considering the amount of trees it should have in comparison to the model on the right with `shrinkage` = 0.9, `bag.fraction` = 0.9.


## (c)

How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

According to [pub_journals link](https://www.fs.fed.us/rm/pubs_journals/2015/rmrs_2015_freeman_e001.pdf) Sec 4.2, pg. 15 

* The relationship between shrinkage (learning rate) and interaction depth (tree complexity) in decision tree models, specifically in boosting. 
* Shrinkage is inversely related to interaction depth, meaning as tree complexity increases, the learning rate should decrease to prevent overfitting and maintain model stability.
* This slower learning process necessitates using more trees.
* Larger datasets better support more complex trees, as they can handle the detailed structures these trees create without overfitting quickly.
*Interaction depth/tree size or tree complexity, affects the maximum size for each tree. 
* As mentioned there is a inverse relationship with shrinkage to interaction depth. 
* The left (`shrinkage` = 0.1, `bag.fraction` = 0.1) is likely to benefit from increasing interaction depth. 
* But the right might not since the shrinkage parameter is already so high.



# 8.7.

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

```{r message=FALSE, warning=FALSE}
#load data
data(ChemicalManufacturingProcess)

set.seed(624)
imputed_data <- preProcess(ChemicalManufacturingProcess, method = c("knnImpute","nzv", "corr"))
df_full <- predict(imputed_data, ChemicalManufacturingProcess)

chem_index <- createDataPartition(df_full$Yield , p=.8, list=F)

chem_train <-  df_full[chem_index,] 
chem_test <- df_full[-chem_index,]

train_pred <- chem_train[-c(1)]
test_pred<-  chem_test[-c(1)]
```

## (a)

Which tree-based regression model gives the optimal resampling and test set performance?

### Boosted Trees
```{r message=FALSE, warning=FALSE}
set.seed(624)
gbm_model <- gbm.fit(train_pred, chem_train$Yield, distribution = "gaussian")
gbm_model
gbm_pred <- predict(gbm_model, newdata = test_pred)
postResample(pred = gbm_pred, obs = chem_test$Yield)
```

### Cubist
```{r message=FALSE, warning=FALSE}
set.seed(624)
cube_model <- cubist(train_pred, chem_train$Yield)
cube_model
cube_pred <- predict(cube_model, newdata = test_pred)
postResample(pred = cube_pred, obs = chem_test$Yield)
```

### Random Forest
```{r message=FALSE, warning=FALSE}
set.seed(624)
#fit the model
rf_model <- randomForest(train_pred, chem_train$Yield, importance = TRUE, ntrees = 1000)
rf_model

rf_pred <- predict(rf_model, newdata = test_pred)
postResample(pred = rf_pred, obs = chem_test$Yield)
```

Random Forest has the optimal metrics because of its RMSE of 0.6715320 (lowest) and Rsquared of 0.7323224 

## (b)

 

Top 10:  
ManufacturingProcess32, BiologicalMaterial06, BiologicalMaterial03, ManufacturingProcess13, ManufacturingProcess36, BiologicalMaterial11, ManufacturingProcess09, BiologicalMaterial08, ManufacturingProcess28, ManufacturingProcess11

```{r}
rf_model_v2<-varImp(rf_model)
```


```{r message=FALSE, warning=FALSE}
rf_model_sorted <- rf_model_v2[order(-rf_model_v2$Overall), , drop = FALSE] # 'drop = FALSE' ensures the result is still a data frame
head(rf_model_sorted,10)
```

### i

Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list?

`ManufacturingProcess32` is the most important.Neither of the predictors dominating the list with 5 `ManufacturingProcess` and 5 `BiologicalProcess` in the top ten. 

### ii

How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

This combination is basically consistent with the importance variables of the non-linear model, not so much the linear model. 

## (c)

Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the
biological or process predictors and their relationship with yield?

```{r message=FALSE, warning=FALSE}
rpart_tree <- rpart(Yield ~., data = chem_train)
rpart.plot(rpart_tree)
```
