DATA 622 HW2

Introduction

This assignment creates two decision trees from a single dataset from https://excelbianalytics.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/ to solve a classification problem and predict the outcome of a particular feature or detail of the data. We then create a random forest for regression, analyze the results, and consider how we can change negative perceptions of decision trees with our result. We will use the following packages for this study,

library(tidyverse)
library(rpart)
library(rpart.plot)
library(caret)
library(randomForest)
library(ROSE)
library(ggthemes)
library(egg)
library(kableExtra)
library(doParallel)

The Data

We’ll use the 1000 sales records dataset for this assignment.

df <- read.csv('1000 Sales Records.csv')
glimpse(df)
## Rows: 1,000
## Columns: 14
## $ Region         <chr> "Middle East and North Africa", "North America", "Middl…
## $ Country        <chr> "Libya", "Canada", "Libya", "Japan", "Chad", "Armenia",…
## $ Item.Type      <chr> "Cosmetics", "Vegetables", "Baby Food", "Cereal", "Frui…
## $ Sales.Channel  <chr> "Offline", "Online", "Offline", "Offline", "Offline", "…
## $ Order.Priority <chr> "M", "M", "C", "C", "H", "H", "H", "M", "H", "H", "M", …
## $ Order.Date     <chr> "10/18/2014", "11/7/2011", "10/31/2016", "4/10/2010", "…
## $ Order.ID       <int> 686800706, 185941302, 246222341, 161442649, 645713555, …
## $ Ship.Date      <chr> "10/31/2014", "12/8/2011", "12/9/2016", "5/12/2010", "8…
## $ Units.Sold     <int> 8446, 3018, 1517, 3322, 9845, 9528, 2844, 7299, 2428, 4…
## $ Unit.Price     <dbl> 437.20, 154.06, 255.28, 205.70, 9.33, 205.70, 205.70, 1…
## $ Unit.Cost      <dbl> 263.33, 90.93, 159.42, 117.11, 6.92, 117.11, 117.11, 35…
## $ Total.Revenue  <dbl> 3692591.20, 464953.08, 387259.76, 683335.40, 91853.85, …
## $ Total.Cost     <dbl> 2224085.18, 274426.74, 241840.14, 389039.42, 68127.40, …
## $ Total.Profit   <dbl> 1468506.02, 190526.34, 145419.62, 294295.98, 23726.45, …

The dataset contains 1000 observations and 14 dimensions, a mix of numerical and categorical values.

colSums(is.na(df))
##         Region        Country      Item.Type  Sales.Channel Order.Priority 
##              0              0              0              0              0 
##     Order.Date       Order.ID      Ship.Date     Units.Sold     Unit.Price 
##              0              0              0              0              0 
##      Unit.Cost  Total.Revenue     Total.Cost   Total.Profit 
##              0              0              0              0

There are no missing values in the dataset.

Data Prep

We’ll process the following changes to the data,

  • column names to lowercase.
  • drop order.id and country columns.
  • convert order.date and ship.date to date.
  • convert remaning columns to factor.
  • create column time as the difference in days between order.date and ship.date
  • drop order.date and ship.date
  • drop the dependencies total.cost, total.profit, total.revenue
# convert all column names to lowercase
colnames(df) <- tolower(colnames(df))

# drop columns and convert column types
df <- df |>
  select(-c(order.id, country)) |>
  mutate(order.date = as.Date(order.date, format = "%m/%d/%Y"),
         ship.date = as.Date(ship.date, format = "%m/%d/%Y")) |>
  mutate(time = as.integer(ship.date - order.date),
         month = as.factor(month(order.date)),
         year = as.factor(year(order.date))) |>
  mutate_if(is.character, as.factor) |>
  select(-c(order.date, ship.date, total.cost, total.profit, total.revenue))

glimpse(df)
## Rows: 1,000
## Columns: 10
## $ region         <fct> Middle East and North Africa, North America, Middle Eas…
## $ item.type      <fct> Cosmetics, Vegetables, Baby Food, Cereal, Fruits, Cerea…
## $ sales.channel  <fct> Offline, Online, Offline, Offline, Offline, Online, Onl…
## $ order.priority <fct> M, M, C, C, H, H, H, M, H, H, M, M, C, C, L, H, H, M, C…
## $ units.sold     <int> 8446, 3018, 1517, 3322, 9845, 9528, 2844, 7299, 2428, 4…
## $ unit.price     <dbl> 437.20, 154.06, 255.28, 205.70, 9.33, 205.70, 205.70, 1…
## $ unit.cost      <dbl> 263.33, 90.93, 159.42, 117.11, 6.92, 117.11, 117.11, 35…
## $ time           <int> 13, 31, 39, 32, 15, 34, 44, 42, 37, 26, 21, 19, 35, 16,…
## $ month          <fct> 10, 11, 10, 4, 8, 11, 3, 5, 1, 12, 12, 2, 11, 12, 1, 6,…
## $ year           <fct> 2014, 2011, 2016, 2010, 2011, 2014, 2015, 2012, 2015, 2…
df <- df |> 
  group_by(region,
           year,
           month,
           item.type,
           sales.channel,
           order.priority) |>
  summarize(sold = sum(units.sold),
            avg.cost = mean(unit.cost),
            avg.price = mean(unit.price),
            avg.time = mean(time)) |>
  rename(item = item.type,
         channel = sales.channel,
         priority = order.priority)

glimpse(df)
## Rows: 991
## Columns: 10
## Groups: region, year, month, item, channel [962]
## $ region    <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
## $ year      <fct> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, …
## $ month     <fct> 2, 3, 3, 3, 4, 4, 4, 5, 7, 7, 7, 7, 7, 9, 9, 10, 10, 10, 11,…
## $ item      <fct> Beverages, Cereal, Fruits, Snacks, Beverages, Cereal, Office…
## $ channel   <fct> Offline, Online, Offline, Offline, Online, Offline, Online, …
## $ priority  <fct> L, C, L, H, C, C, H, C, H, M, M, M, L, C, H, H, H, M, L, C, …
## $ sold      <int> 4571, 5132, 3972, 8929, 1547, 3322, 702, 1237, 8862, 6569, 4…
## $ avg.cost  <dbl> 31.79, 117.11, 6.92, 97.44, 31.79, 117.11, 524.96, 263.33, 6…
## $ avg.price <dbl> 47.45, 205.70, 9.33, 152.58, 47.45, 205.70, 651.21, 437.20, …
## $ avg.time  <dbl> 33, 10, 10, 41, 6, 32, 7, 1, 24, 39, 3, 31, 45, 41, 16, 50, …

We’ve now reduced our dataset to 991 observations and 10 dimensions.

Let’s take a look at the distribution of average turnaround times, as this could be a possible outcome variable,

df |>
  ggplot(aes(x=avg.time)) +
  geom_density(fill="lightblue") +
  theme_few() +
  labs(x="average time")

The distribution of average times are fairly uniform.

summary(df$avg.time)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   13.00   25.00   25.03   38.00   50.00

The mean and the median are each 25. We create a new variable delayed for all orders that took longer than 30 days to ship:

df <- df |>
  mutate(delayed = as.factor(ifelse(
    avg.time > 30, 
    "yes",
    "no"))) |>
  select(-avg.time)

table(df$delayed)
## 
## yes  no 
## 383 608

We see there is some class imbalance. This could cause problems for our decision trees, so we oversample the minority class to balance the data.

balanced.data <- ROSE(delayed ~ ., data = df, seed = 123)$data

table(balanced.data$delayed)
## 
##  no yes 
## 503 488

Data Modeling

We’re going to predict if an order was delayed. The delayed variable is our response variable.

First we’ll split the data 80/20 into training and testing sets.

set.seed(1)

# shuffle the dataset
bd <- balanced.data[sample(nrow(balanced.data)), ]
# create split point
split.index <- createDataPartition(bd$delayed, p = .8, list = FALSE)

# split the data
train.set <- bd[split.index, ]
test.set <- bd[-split.index, ]

Decision Trees

We generate the first decision tree and then hold out the first split node from our formula for the second tree.

We observe that the root node in Decision Tree 1 splits at the item feature, so we hold that out Decision Tree 2, and we get a very different tree.

Decision Tree 1 splits at the items feature and achieves a depth of 8, then splits at two different features depending on the outcome of the root split. In this tree, the most important features are items, year, and month.

Decision Tree 2 splits at the year feature, and achieves a depth of 6. In this tree, the most important features are year, month, and region.

Decision Tree 1 is far more complex than Decision Tree 2, indicating low bias, yet is barely more accurate, which suggests it is overfitting the data, a characteristic of high variance.

Decision Tree 1

set.seed(1) 

tree.1 <- rpart(delayed ~ .,
                data = train.set,
                method = "class")

rpart.plot(tree.1, tweak = 1.5)

We check the model against our test set,

set.seed(1)

tree1.preds <- predict(tree.1, test.set, type="class")
t1 <- confusionMatrix(tree1.preds, test.set$delayed)
t1.results <- c(t1$overall['Accuracy'],
                t1$byClass['Sensitivity'],
                t1$byClass['Specificity'],
                t1$byClass['Precision'],
                t1$byClass['F1'])
t1.results
##    Accuracy Sensitivity Specificity   Precision          F1 
##   0.6345178   0.6600000   0.6082474   0.6346154   0.6470588

Decision Tree 2

set.seed(1)

tree.2 <- rpart(delayed ~ . -item,
                data = train.set,
                method = "class")

rpart.plot(tree.2, tweak=1.5) 

We check the model against our test set,

set.seed(1)

tree2.preds <- predict(tree.2, test.set, type="class")
t2 <- confusionMatrix(tree2.preds, test.set$delayed)
t2.results <- c(t2$overall['Accuracy'],
                t2$byClass['Sensitivity'],
                t2$byClass['Specificity'],
                t2$byClass['Precision'],
                t2$byClass['F1'])
t2.results
##    Accuracy Sensitivity Specificity   Precision          F1 
##   0.6649746   0.6800000   0.6494845   0.6666667   0.6732673

Random Forest

Now we’ll create a random forest for regression and analyze the results.

This can be computationally expensive so we prepare additional cores for parallel processing,

num_cores <- 4
cl <- makeCluster(num_cores)

Next we’ll train our model using 10-fold cross-validation,

set.seed(1)

registerDoParallel(cl)

ctrl <- trainControl(
  method = "cv",
  number = 10,
  allowParallel = TRUE
)

rf.model <- train(delayed ~ .,
                         data = train.set,
                         ntrees = 1000,
                         method = "rf",
                         trControl = ctrl,
                         parallel = "multicore"
                         )

stopCluster(cl)
rf.model
## Random Forest 
## 
## 794 samples
##   9 predictor
##   2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 715, 715, 715, 714, 715, 714, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.6664082  0.3317786
##   22    0.6789399  0.3573030
##   42    0.6740032  0.3476397
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 22.

Now we can check our model against the test data,

set.seed(1)

rf.preds <- predict(rf.model, test.set)
rfm <- confusionMatrix(rf.preds, test.set$delayed)
rf.results <- c(rfm$overall['Accuracy'],
                rfm$byClass['Sensitivity'],
                rfm$byClass['Specificity'],
                rfm$byClass['Precision'],
                rfm$byClass['F1'])
rf.results
##    Accuracy Sensitivity Specificity   Precision          F1 
##   0.7411168   0.7300000   0.7525773   0.7525773   0.7411168

Our model performs better on the testing data than on the training data, achieving 74% accuracy – this demonstrates low variance. The sensitivity (true positive) is 73%. The model is good at detecting positive cases, in this case, if an order isn’t delayed. The model can predict with 73% accuracy if an order isn’t delayed. The specificity (true negative) is a bit higher at 75%. This indicates the model can identify negative cases (delayed orders) with an accuracy of 75%.

We can observe the most important variables to our model:

plot(varImp(rf.model), 10)

Interestingly, these are not the variables that our individual decision trees chose. In our random forest, the size of the order (sold) and the price and cost of the items in the order are by far the biggest predictors of whether an order will be on time or delayed. This makes instinctive sense, as large orders may require restocks, or the price of an order may indicate its popularity, which may also affect restocking – things that might contribute to whether an order is shipped quickly.

Comparing results across models,

kable(rbind(tree1 = t1.results,
      tree2 = t2.results,
      rf = rf.results)) |>
  kable_styling()
Accuracy Sensitivity Specificity Precision F1
tree1 0.6345178 0.66 0.6082474 0.6346154 0.6470588
tree2 0.6649746 0.68 0.6494845 0.6666667 0.6732673
rf 0.7411168 0.73 0.7525773 0.7525773 0.7411168

Discussion

To summarize this analysis, we feature engineered a dataset containing sales data, and created a binary response variable representing whether an order was delayed, resampling our data in order to achieve class balance of our response variable. Splitting our data into training and testing sets (using an 80/20 split) we built a decision tree that identified the items contained in the order as a predictor, then built a second tree removing the items as a predictor, and observed the tree selected specific years as predictors. We next trained the models to make predictions on the testing data and observed that the first tree was more complex and exhibited low bias and high variance - as it was likely overfitting the data, while the second tree was simpler and proved to be more accurate. This highlights the bias-variance tradeoff. Bias refers to the error introduced when simplifying a problem, for example making assumptions about the data, whereas variance refers to the model’s sensitivity to small changes in the training data. High bias models may have lower variance but they may be underfitting the data. In this case, decision tree 2 has higher bias and lower variance than decision tree 1, and is more accurate. One way to balance bias and variance is to utilize an ensemble method such as random forest, that randomly samples and aggregates the predictions of multiple trees, and as such are less likely to overfit and more likely to better generalize to the unseen data, providing a more accurate model. Our next step illustrates this, as we trained a random forest model using 10-fold cross-validation to further reduce variance and achieved a predictive accuracy of 74% on the remaining 20% of our data, while our model predicted with 73% accuracy whether an order wasn’t delayed. We observed in our random forest that the most important predictors in our model were the number of items sold, the average cost, and the average price, not, as identified in our earlier individual trees, the specific items or years that an order was placed. This highlights what is described in the https://decizone.com/blog/the-good-the-bad-the-ugly-of-using-decision-trees article as “The BAD”, or disadvantages of using a decision tree. The individual decision trees failed to identify what were overall the most important variables to our predictive model. Our individual decision trees also highlight what the article refers to as “The UGLY” – the trees are complex and hard to interpret. One of the ways that we can change the perception when using the decision trees we created would be to contextualize the tree – describe it as part of the exploratory analysis, and simplifying the labeling so it is more legible, explaining to the audience the decision-making process of excluding the key feature from the second tree, and comparing the final result with the earlier iterations, explaining how the random forest model uses cross-validation to reduce variance in the model. The key takeaway from this assignment is that random forests, by randomly bootstrapping samples from the data and aggregating the predictions of multiple trees, is able to capture complex patterns in the data that might is not apparent in an individual tree. The result is a model with less variance than individual models and is more generizable to unseen data. One way to illustrate this to stakeholders if you’re walking through the concept of a random forest would be to recreate the steps we just took – build a few decision trees with different predictors that will look different and then explain how a random forest does this n times (in our case, 1000) and averages the results to make more accurate predictions.