Here we are using the PointBlank function ‘scan_data’ to explore the
missingness, distribution and correlations in the data set.
Some interesting correlations are revealed. The target ‘east_asia’ is
positively correlated with the number of stars a product received and
the style ‘bowl’. It is negatively correlated with the producer ‘Nissin’
and having the words ‘Instant’ or ‘Curry’ in the name. These negatively
relationships make sense, Nissin produces ramen for a Western audience
and the adjectives ‘Instant’ and ‘Curry’ and probably most appealing to
those same people.
Because this dataset came from Kaggle, it is very clean and has
almost no missingness, not something you encounter in the real
world.
Decision Trees
We have to cast our target ‘east_asia’ to a factor for our
classification Decision Trees to run. Here we are creating our first
classification decision tree.
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
## ✔ broom 1.0.5 ✔ rsample 1.2.1
## ✔ dials 1.2.1 ✔ tibble 3.2.1
## ✔ ggplot2 3.5.0 ✔ tidyr 1.3.1
## ✔ infer 1.0.7 ✔ tune 1.2.0
## ✔ modeldata 1.3.0 ✔ workflows 1.1.4
## ✔ parsnip 1.2.1 ✔ workflowsets 1.1.0
## ✔ purrr 1.0.2 ✔ yardstick 1.3.1
## ✔ recipes 1.0.10
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ recipes::step() masks stats::step()
## • Learn how to get started at https://www.tidymodels.org/start/
ramenDf$east_asia <- as.factor(ramenDf$east_asia )
set.seed(123)
data_split <- initial_split(ramenDf, prop = 0.75)
train_data <- training(data_split)
test_data <- testing(data_split)
tree_spec <- decision_tree(engine = "rpart", mode = "classification", tree_depth = 7)
# Fit the model to the training data
tree_fit1 <- tree_spec %>%
fit(east_asia ~ ., data = train_data, model=TRUE)
Our first node is not surprising, if the word “Instant” appears in
the ramen name is classified as not from East Asia.
The far right branch says, if it doesn’t have “Instant” in the name,
it isn’t in a cup, it has more than 3.7 stars and it’s in a bowl it’s
from East Asia.
# Load the library
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.3.3
## Loading required package: rpart
##
## Attaching package: 'rpart'
## The following object is masked from 'package:dials':
##
## prune
# Plot the decision tree
rpart.plot(tree_fit1$fit, type = 4, extra = 101, under = TRUE, cex = 0.8, box.palette = "auto", roundint=FALSE)

This model has an accuracy of 66% on the training data.
predictions <- tree_fit1 %>%
predict(train_data) %>%
pull(.pred_class)
caret::confusionMatrix(as.factor(predictions), as.factor(train_data[,3]))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 863 365
## 1 292 412
##
## Accuracy : 0.6599
## 95% CI : (0.6383, 0.6811)
## No Information Rate : 0.5978
## P-Value [Acc > NIR] : 1.083e-08
##
## Kappa : 0.2818
##
## Mcnemar's Test P-Value : 0.00497
##
## Sensitivity : 0.7472
## Specificity : 0.5302
## Pos Pred Value : 0.7028
## Neg Pred Value : 0.5852
## Prevalence : 0.5978
## Detection Rate : 0.4467
## Detection Prevalence : 0.6356
## Balanced Accuracy : 0.6387
##
## 'Positive' Class : 0
##
This model has an accuracy of 64% on the test data.
predictions <- tree_fit1 %>%
predict(test_data) %>%
pull(.pred_class)
caret::confusionMatrix(as.factor(predictions), as.factor(test_data[,3]))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 273 138
## 1 97 137
##
## Accuracy : 0.6357
## 95% CI : (0.5972, 0.6729)
## No Information Rate : 0.5736
## P-Value [Acc > NIR] : 0.0007718
##
## Kappa : 0.2406
##
## Mcnemar's Test P-Value : 0.0090724
##
## Sensitivity : 0.7378
## Specificity : 0.4982
## Pos Pred Value : 0.6642
## Neg Pred Value : 0.5855
## Prevalence : 0.5736
## Detection Rate : 0.4233
## Detection Prevalence : 0.6372
## Balanced Accuracy : 0.6180
##
## 'Positive' Class : 0
##
Next we want to build a Decision Tree excluding the feature
‘Instant’ that was used in the first node.
# Fit the model to the training data
set.seed(123)
data_split <- initial_split(ramenDf %>% select(-instant), prop = 0.75)
train_data <- training(data_split)
test_data <- testing(data_split)
tree_spec <- decision_tree(engine = "rpart", mode = "classification", tree_depth = 7)
tree_fit2 <- tree_spec %>%
fit(east_asia ~ ., data = train_data, model=TRUE)
With the exclusion of ‘Instant’ from the features, ‘Cup’ is now the
feature used in the first node.
The training data model has an accuracy of 63% and the test data
model has an accuracy of 60%.
Even thought the accuracy of the models are very similar, the
underlying sensitivity and specificity change dramatically. The second
decision tree has very high sensitivity but very poor specificity,
meaning it is very good at predicting if a ramen is from East Asian but
that is because it says almost everything is from East Asia.
# Load the library
library(rpart.plot)
# Plot the decision tree
rpart.plot(tree_fit2$fit, type = 4, extra = 101, under = TRUE, cex = 0.8, box.palette = "auto", roundint=FALSE)

predictions <- tree_fit2 %>%
predict(train_data) %>%
pull(.pred_class)
caret::confusionMatrix(as.factor(predictions), as.factor(train_data[,3]))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1089 641
## 1 66 136
##
## Accuracy : 0.6341
## 95% CI : (0.6121, 0.6556)
## No Information Rate : 0.5978
## P-Value [Acc > NIR] : 0.0005967
##
## Kappa : 0.1341
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9429
## Specificity : 0.1750
## Pos Pred Value : 0.6295
## Neg Pred Value : 0.6733
## Prevalence : 0.5978
## Detection Rate : 0.5637
## Detection Prevalence : 0.8954
## Balanced Accuracy : 0.5589
##
## 'Positive' Class : 0
##
predictions <- tree_fit2 %>%
predict(test_data) %>%
pull(.pred_class)
caret::confusionMatrix(as.factor(predictions), as.factor(test_data[,3]))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 341 224
## 1 29 51
##
## Accuracy : 0.6078
## 95% CI : (0.5689, 0.6456)
## No Information Rate : 0.5736
## P-Value [Acc > NIR] : 0.04308
##
## Kappa : 0.1178
##
## Mcnemar's Test P-Value : < 2e-16
##
## Sensitivity : 0.9216
## Specificity : 0.1855
## Pos Pred Value : 0.6035
## Neg Pred Value : 0.6375
## Prevalence : 0.5736
## Detection Rate : 0.5287
## Detection Prevalence : 0.8760
## Balanced Accuracy : 0.5535
##
## 'Positive' Class : 0
##
Random Forset
Here we split the ramen dataset into a training and test set for our
random forest
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.3.3
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(datasets)
library(caret)
## Warning: package 'caret' was built under R version 4.3.3
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following objects are masked from 'package:yardstick':
##
## precision, recall, sensitivity, specificity
## The following object is masked from 'package:purrr':
##
## lift
set.seed(222)
data_split <- initial_split(ramenDf, prop = 0.75)
train_data <- training(data_split)
test_data <- testing(data_split)
Here I train my random forsest model for the ramen dataset
rf <- randomForest(east_asia ~., data=train_data,
keep.forest=TRUE, importance=TRUE)
print(rf)
##
## Call:
## randomForest(formula = east_asia ~ ., data = train_data, keep.forest = TRUE, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 35.56%
## Confusion matrix:
## 0 1 class.error
## 0 795 338 0.298323
## 1 349 450 0.436796
Here I test the accuracy
predictions <- rf %>%
predict(test_data)
caret::confusionMatrix(as.factor(predictions), as.factor(test_data[,3]))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 279 126
## 1 113 127
##
## Accuracy : 0.6295
## 95% CI : (0.5909, 0.6668)
## No Information Rate : 0.6078
## P-Value [Acc > NIR] : 0.1380
##
## Kappa : 0.2157
##
## Mcnemar's Test P-Value : 0.4376
##
## Sensitivity : 0.7117
## Specificity : 0.5020
## Pos Pred Value : 0.6889
## Neg Pred Value : 0.5292
## Prevalence : 0.6078
## Detection Rate : 0.4326
## Detection Prevalence : 0.6279
## Balanced Accuracy : 0.6069
##
## 'Positive' Class : 0
##