Introduction

In this assignment we are exploring “the good, the bad and the ugly” of using Decision Trees. We are looking into the bias and variance issues associated with decision trees and seeing if Random Forest provides a way around some of the issues associated with Decision Trees.

LInk to Good, Bad and Ugly Decision Tree Article

Load in Data

I am using a Kaggle data set that I used before, It is a ramen rating data set.

ramenDf <- read.csv("G:/Documents/DATA622_HW1/ramen-ratings.csv")

Remove the review number (unnecessary key) and cast star rating as numeric

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
ramenDf <- ramenDf %>% select(-c('Review..')) %>% mutate(Stars = as.numeric(Stars)) %>% relocate(Stars) %>% filter(!is.na(Stars))

Here I am generating some new features. I am creating some dummy variables related to the style (pack, tray, cup, bowl). I’m also creating some features related to the ramen name, does in include the words “Instant”, “Spicy” or “Curry”. I am creating a dummy variable for the most popular producer Nissin. I am also creating a target variable for my classification task ‘east_asia’ meaning was this product produced in China, Japan, South Korea and Taiwan.

ramenDf <-
  ramenDf %>% mutate(pack = ifelse(Style == 'Pack', 1, 0),
                     tray = ifelse(Style == 'Tray', 1, 0),
                     cup = ifelse(Style == 'Cup', 1, 0),
                     bowl = ifelse(Style == 'Bowl', 1, 0))

ramenDf <-
  ramenDf %>% mutate(nissin = ifelse(Brand == 'Nissin', 1, 0))

ramenDf <- ramenDf %>% mutate(east_asia = as.numeric(ifelse(Country %in% c('China', 'Japan', 'South Korea', 'Taiwan'), 1, 0)))

ramenDf <- ramenDf %>% mutate(spicy = as.numeric(grepl('Spicy', Variety)), curry = as.numeric(grepl('Curry', Variety)), instant = as.numeric(grepl('Instant', Variety)))

ramenDf <- ramenDf %>% select(Stars, nissin, east_asia, spicy, curry, instant, pack, tray, cup, bowl)

Data Exploration

Here we are using the PointBlank function ‘scan_data’ to explore the missingness, distribution and correlations in the data set.

Some interesting correlations are revealed. The target ‘east_asia’ is positively correlated with the number of stars a product received and the style ‘bowl’. It is negatively correlated with the producer ‘Nissin’ and having the words ‘Instant’ or ‘Curry’ in the name. These negatively relationships make sense, Nissin produces ramen for a Western audience and the adjectives ‘Instant’ and ‘Curry’ and probably most appealing to those same people.

Because this dataset came from Kaggle, it is very clean and has almost no missingness, not something you encounter in the real world.

pointblank::scan_data(ramenDf)

Overview of ramenDf

Table Overview

Columns

10

Rows

2,577

NAs

0

Duplicate Rows

2,049 (79.51%)

Column Types

numeric 10

Reproducibility Information

Scan Build Time

2024-04-04 11:56:45

pointblank Version

0.11.4

R Version

R version 4.3.2 (2023–10–31 ucrt)
Eye Holes

Operating System

x86_64-w64-mingw32

Variables

Distinct

42

NAs

0

Inf/-Inf

0

Mean

3.65

Minimum

0

Maximum

5

Distinct

2

NAs

0

Inf/-Inf

0

Mean

0.15

Minimum

0

Maximum

1

Distinct

2

NAs

0

Inf/-Inf

0

Mean

0.41

Minimum

0

Maximum

1

Distinct

2

NAs

0

Inf/-Inf

0

Mean

0.1

Minimum

0

Maximum

1

Distinct

2

NAs

0

Inf/-Inf

0

Mean

0.05

Minimum

0

Maximum

1

Distinct

2

NAs

0

Inf/-Inf

0

Mean

0.18

Minimum

0

Maximum

1

Distinct

2

NAs

0

Inf/-Inf

0

Mean

0.59

Minimum

0

Maximum

1

Distinct

2

NAs

0

Inf/-Inf

0

Mean

0.04

Minimum

0

Maximum

1

Distinct

2

NAs

0

Inf/-Inf

0

Mean

0.17

Minimum

0

Maximum

1

Distinct

2

NAs

0

Inf/-Inf

0

Mean

0.19

Minimum

0

Maximum

1

Interactions

Correlations

Missing Values

Sample

Stars nissin east_asia spicy curry instant pack tray cup bowl
1 3.75 0 1 0 0 0 0 0 1 0
2 1.00 0 1 1 0 0 1 0 0 0
3 2.25 1 0 0 0 0 0 0 1 0
4 2.75 0 1 0 0 0 1 0 0 0
5 3.75 0 0 0 1 0 1 0 0 0
6..2572
2573 3.50 0 0 0 0 1 0 0 0 1
2574 1.00 0 0 0 0 1 1 0 0 0
2575 2.00 0 0 0 0 0 1 0 0 0
2576 2.00 0 0 0 0 0 1 0 0 0
2577 0.50 0 0 0 0 0 1 0 0 0

Decision Trees

We have to cast our target ‘east_asia’ to a factor for our classification Decision Trees to run. Here we are creating our first classification decision tree.

library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
## ✔ broom        1.0.5      ✔ rsample      1.2.1 
## ✔ dials        1.2.1      ✔ tibble       3.2.1 
## ✔ ggplot2      3.5.0      ✔ tidyr        1.3.1 
## ✔ infer        1.0.7      ✔ tune         1.2.0 
## ✔ modeldata    1.3.0      ✔ workflows    1.1.4 
## ✔ parsnip      1.2.1      ✔ workflowsets 1.1.0 
## ✔ purrr        1.0.2      ✔ yardstick    1.3.1 
## ✔ recipes      1.0.10
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ✖ recipes::step()  masks stats::step()
## • Learn how to get started at https://www.tidymodels.org/start/
ramenDf$east_asia <- as.factor(ramenDf$east_asia )

set.seed(123)
data_split <- initial_split(ramenDf, prop = 0.75)
train_data <- training(data_split)
test_data <- testing(data_split)

tree_spec <- decision_tree(engine = "rpart", mode = "classification", tree_depth = 7)

# Fit the model to the training data
tree_fit1 <- tree_spec %>%
 fit(east_asia ~ ., data = train_data, model=TRUE) 

Our first node is not surprising, if the word “Instant” appears in the ramen name is classified as not from East Asia.

The far right branch says, if it doesn’t have “Instant” in the name, it isn’t in a cup, it has more than 3.7 stars and it’s in a bowl it’s from East Asia.

# Load the library
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.3.3
## Loading required package: rpart
## 
## Attaching package: 'rpart'
## The following object is masked from 'package:dials':
## 
##     prune
# Plot the decision tree
rpart.plot(tree_fit1$fit, type = 4, extra = 101, under = TRUE, cex = 0.8, box.palette = "auto", roundint=FALSE)

This model has an accuracy of 66% on the training data.

predictions <- tree_fit1 %>%
 predict(train_data) %>%
 pull(.pred_class)


caret::confusionMatrix(as.factor(predictions), as.factor(train_data[,3]))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 863 365
##          1 292 412
##                                           
##                Accuracy : 0.6599          
##                  95% CI : (0.6383, 0.6811)
##     No Information Rate : 0.5978          
##     P-Value [Acc > NIR] : 1.083e-08       
##                                           
##                   Kappa : 0.2818          
##                                           
##  Mcnemar's Test P-Value : 0.00497         
##                                           
##             Sensitivity : 0.7472          
##             Specificity : 0.5302          
##          Pos Pred Value : 0.7028          
##          Neg Pred Value : 0.5852          
##              Prevalence : 0.5978          
##          Detection Rate : 0.4467          
##    Detection Prevalence : 0.6356          
##       Balanced Accuracy : 0.6387          
##                                           
##        'Positive' Class : 0               
## 

This model has an accuracy of 64% on the test data.

predictions <- tree_fit1 %>%
 predict(test_data) %>%
 pull(.pred_class)


caret::confusionMatrix(as.factor(predictions), as.factor(test_data[,3]))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 273 138
##          1  97 137
##                                           
##                Accuracy : 0.6357          
##                  95% CI : (0.5972, 0.6729)
##     No Information Rate : 0.5736          
##     P-Value [Acc > NIR] : 0.0007718       
##                                           
##                   Kappa : 0.2406          
##                                           
##  Mcnemar's Test P-Value : 0.0090724       
##                                           
##             Sensitivity : 0.7378          
##             Specificity : 0.4982          
##          Pos Pred Value : 0.6642          
##          Neg Pred Value : 0.5855          
##              Prevalence : 0.5736          
##          Detection Rate : 0.4233          
##    Detection Prevalence : 0.6372          
##       Balanced Accuracy : 0.6180          
##                                           
##        'Positive' Class : 0               
## 

Next we want to build a Decision Tree excluding the feature ‘Instant’ that was used in the first node.

# Fit the model to the training data


set.seed(123)
data_split <- initial_split(ramenDf %>% select(-instant), prop = 0.75)
train_data <- training(data_split)
test_data <- testing(data_split)

tree_spec <- decision_tree(engine = "rpart", mode = "classification", tree_depth = 7)

tree_fit2 <- tree_spec %>%
 fit(east_asia ~ ., data = train_data, model=TRUE)

With the exclusion of ‘Instant’ from the features, ‘Cup’ is now the feature used in the first node.

The training data model has an accuracy of 63% and the test data model has an accuracy of 60%.

Even thought the accuracy of the models are very similar, the underlying sensitivity and specificity change dramatically. The second decision tree has very high sensitivity but very poor specificity, meaning it is very good at predicting if a ramen is from East Asian but that is because it says almost everything is from East Asia.

# Load the library
library(rpart.plot)

# Plot the decision tree
rpart.plot(tree_fit2$fit, type = 4, extra = 101, under = TRUE, cex = 0.8, box.palette = "auto", roundint=FALSE)

predictions <- tree_fit2 %>%
 predict(train_data) %>%
 pull(.pred_class)


caret::confusionMatrix(as.factor(predictions), as.factor(train_data[,3]))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1089  641
##          1   66  136
##                                           
##                Accuracy : 0.6341          
##                  95% CI : (0.6121, 0.6556)
##     No Information Rate : 0.5978          
##     P-Value [Acc > NIR] : 0.0005967       
##                                           
##                   Kappa : 0.1341          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9429          
##             Specificity : 0.1750          
##          Pos Pred Value : 0.6295          
##          Neg Pred Value : 0.6733          
##              Prevalence : 0.5978          
##          Detection Rate : 0.5637          
##    Detection Prevalence : 0.8954          
##       Balanced Accuracy : 0.5589          
##                                           
##        'Positive' Class : 0               
## 
predictions <- tree_fit2 %>%
 predict(test_data) %>%
 pull(.pred_class)


caret::confusionMatrix(as.factor(predictions), as.factor(test_data[,3]))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 341 224
##          1  29  51
##                                           
##                Accuracy : 0.6078          
##                  95% CI : (0.5689, 0.6456)
##     No Information Rate : 0.5736          
##     P-Value [Acc > NIR] : 0.04308         
##                                           
##                   Kappa : 0.1178          
##                                           
##  Mcnemar's Test P-Value : < 2e-16         
##                                           
##             Sensitivity : 0.9216          
##             Specificity : 0.1855          
##          Pos Pred Value : 0.6035          
##          Neg Pred Value : 0.6375          
##              Prevalence : 0.5736          
##          Detection Rate : 0.5287          
##    Detection Prevalence : 0.8760          
##       Balanced Accuracy : 0.5535          
##                                           
##        'Positive' Class : 0               
## 

Random Forset

Here we split the ramen dataset into a training and test set for our random forest

library(randomForest)
## Warning: package 'randomForest' was built under R version 4.3.3
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
library(datasets)
library(caret)
## Warning: package 'caret' was built under R version 4.3.3
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following objects are masked from 'package:yardstick':
## 
##     precision, recall, sensitivity, specificity
## The following object is masked from 'package:purrr':
## 
##     lift
set.seed(222)
data_split <- initial_split(ramenDf, prop = 0.75)
train_data <- training(data_split)
test_data <- testing(data_split)

Here I train my random forsest model for the ramen dataset

rf <- randomForest(east_asia ~., data=train_data,
                       keep.forest=TRUE, importance=TRUE)
print(rf)
## 
## Call:
##  randomForest(formula = east_asia ~ ., data = train_data, keep.forest = TRUE,      importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 35.56%
## Confusion matrix:
##     0   1 class.error
## 0 795 338    0.298323
## 1 349 450    0.436796

Here I test the accuracy

predictions <- rf %>%
 predict(test_data)


caret::confusionMatrix(as.factor(predictions), as.factor(test_data[,3]))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 279 126
##          1 113 127
##                                           
##                Accuracy : 0.6295          
##                  95% CI : (0.5909, 0.6668)
##     No Information Rate : 0.6078          
##     P-Value [Acc > NIR] : 0.1380          
##                                           
##                   Kappa : 0.2157          
##                                           
##  Mcnemar's Test P-Value : 0.4376          
##                                           
##             Sensitivity : 0.7117          
##             Specificity : 0.5020          
##          Pos Pred Value : 0.6889          
##          Neg Pred Value : 0.5292          
##              Prevalence : 0.6078          
##          Detection Rate : 0.4326          
##    Detection Prevalence : 0.6279          
##       Balanced Accuracy : 0.6069          
##                                           
##        'Positive' Class : 0               
## 

Results

The two decision trees had similar performances but the second tree with the exclusion of ‘instant’ had poorer performance on both the training (66% vs 63% accuracy) and the test data (64% vs 60%). Both models have slightly higher bias, poorer performance on the test data but the second model more so.

The decision tree had about the same variance as the decision trees but suffered from lower bias, the training accuracy was 64% and the test accuracy was 63%. So, the decision tree gives us a more generalizable classification model.

Random Forest solves some of the ‘Bad’ parts of Decision Trees that the article talks about. Random Forest abstracts away the ‘Complexity’, you aren’t trying to understand a physical representation of the decision making. You are just given a classification prediction, which may ultimately be what you need to make a decision. You don’t have to worry about data ‘Evolution’, as you get more data you just retrain your trees and get new predictions.

Random Forest also deals with some of the ‘Ugly’ parts of Decision Trees. A Random Forest model has high ‘Usability’, it isn’t a static diagram like a Decision Tree. It is ‘Mobile’ and ‘Everywhere’, you can use it anywhere regardless of the device. It can be ‘Integrated’ into the back end of any application. It gives you good out of the box performance so you don’t need to be a ‘Coding’ expert to use it. You have the accuracy of your models as a ‘Measure’ of performance,

You do loose some of the things that make Decision Trees great with Random Forest. Decision trees are easy to ‘Understand’ and ‘Come naturally’ because it is similar to how people thing about complex choices. It as can serve as documentation which to me is very attractive. I often work on long running project and the ability to look at code and see the decision making baked into reduces the cognitive burden of revisiting an old project. I think that there are use cases for both, it just depends on which benefit is most important for a project.