DATA 622 Decision Trees Algorithms

library(tidyverse)
library(skimr)
library(DataExplorer)
library(corrplot)
library(ggfortify)
library(caret)
set.seed(123)

Introduction

Our objective for this analysis is to make two decision trees with different variables used and a random forest model from a dataset to compare how the results change depending on which model we are using. For our dataset we will utilize the sample sales dataset from (https://excelbianalytics.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/) containing 1,000,000 million records.

Loading Data

Here we load in the data from the CSV which we have downloaded.

First we load in the sales dataset with 1,000,000 rows and are able to see our 14 columns.

mil_df <- read_csv("1000000 Sales Records.csv", show_col_types = FALSE)
glimpse(mil_df)

## Rows: 1,000,000
## Columns: 14
## $ Region           <chr> "Sub-Saharan Africa", "Middle East and North Africa",…
## $ Country          <chr> "South Africa", "Morocco", "Papua New Guinea", "Djibo…
## $ `Item Type`      <chr> "Fruits", "Clothes", "Meat", "Clothes", "Beverages", …
## $ `Sales Channel`  <chr> "Offline", "Online", "Offline", "Offline", "Offline",…
## $ `Order Priority` <chr> "M", "M", "M", "H", "L", "L", "M", "L", "L", "L", "M"…
## $ `Order Date`     <chr> "7/27/2012", "9/14/2013", "5/15/2015", "5/17/2017", "…
## $ `Order ID`       <dbl> 443368995, 667593514, 940995585, 880811536, 174590194…
## $ `Ship Date`      <chr> "7/28/2012", "10/19/2013", "6/4/2015", "7/2/2017", "1…
## $ `Units Sold`     <dbl> 1593, 4611, 360, 562, 3973, 1379, 597, 1476, 896, 776…
## $ `Unit Price`     <dbl> 9.33, 109.28, 421.89, 109.28, 47.45, 9.33, 47.45, 47.…
## $ `Unit Cost`      <dbl> 6.92, 35.84, 364.69, 35.84, 31.79, 6.92, 31.79, 31.79…
## $ `Total Revenue`  <dbl> 14862.69, 503890.08, 151880.40, 61415.36, 188518.85, …
## $ `Total Cost`     <dbl> 11023.56, 165258.24, 131288.40, 20142.08, 126301.67, …
## $ `Total Profit`   <dbl> 3839.13, 338631.84, 20592.00, 41273.28, 62217.18, 332…

Data Processing

First, we must consider the columns that we have and if they are applicable for fitting a classification model. We do have labels of order priority, item type, or region that could possibly be predicted from the other categories. What might be a good use case for the data we have would be predicting the order priority of new orders, so whenever a new order comes in it does not have to be manually labeled. Our existing dataset of one million manually classified orders should provide as a good base for making this happen.

However, we should remove data that we would not realistically have a way of knowing when the order is initially created for building our models. In this case we will remove the shipping date column as this would mark the end of the order as a whole, when priority should have already been assigned. Additionally, we should remove Order ID since this realistically does not provide any information about or order besides incidental information such as the chronology of an order which is already covered more thoroughly by order date.

Additionally, our column names are not very syntactically friendly having spaces in between words, thus we coerce these into better names

mil_df <- mil_df |>
  select(-`Ship Date`, -`Order ID`)


colnames(mil_df) <- make.names(
  colnames(mil_df)
)

colnames(mil_df)

##  [1] "Region"         "Country"        "Item.Type"      "Sales.Channel" 
##  [5] "Order.Priority" "Order.Date"     "Units.Sold"     "Unit.Price"    
##  [9] "Unit.Cost"      "Total.Revenue"  "Total.Cost"     "Total.Profit"

Next we modify our dataset to contain the correct variable types per column. Order date gets converted from a character string to an actual date type. Order priority becomes an ordered factor since we have a set order for low to medium to high to critical priority. Finally, the remaining categorical variables also get set as such as date for order date or factor for order priority.

mil_df <- mil_df |>
  mutate(
    Order.Date = mdy(Order.Date),
    Order.Priority = ordered(Order.Priority, levels = c("L","M","H","C"))
  ) |>
  mutate(
    across(
      where(is.character), as_factor
    )
  )

str(mil_df)

## tibble [1,000,000 × 12] (S3: tbl_df/tbl/data.frame)
##  $ Region        : Factor w/ 7 levels "Sub-Saharan Africa",..: 1 2 3 1 4 5 1 1 1 1 ...
##  $ Country       : Factor w/ 185 levels "South Africa",..: 1 2 3 4 5 6 7 8 9 8 ...
##  $ Item.Type     : Factor w/ 12 levels "Fruits","Clothes",..: 1 2 3 2 4 1 4 4 5 6 ...
##  $ Sales.Channel : Factor w/ 2 levels "Offline","Online": 1 2 1 1 1 2 2 2 2 1 ...
##  $ Order.Priority: Ord.factor w/ 4 levels "L"<"M"<"H"<"C": 2 2 2 3 1 1 2 1 1 1 ...
##  $ Order.Date    : Date[1:1000000], format: "2012-07-27" "2013-09-14" ...
##  $ Units.Sold    : num [1:1000000] 1593 4611 360 562 3973 ...
##  $ Unit.Price    : num [1:1000000] 9.33 109.28 421.89 109.28 47.45 ...
##  $ Unit.Cost     : num [1:1000000] 6.92 35.84 364.69 35.84 31.79 ...
##  $ Total.Revenue : num [1:1000000] 14863 503890 151880 61415 188519 ...
##  $ Total.Cost    : num [1:1000000] 11024 165258 131288 20142 126302 ...
##  $ Total.Profit  : num [1:1000000] 3839 338632 20592 41273 62217 ...

As we have undergone basic data preparation that applies to the whole dataset and does not bias the data we will now procure a training and test component from our dataset from 80% of the data.

inTrain <- createDataPartition(
  y = mil_df$Order.Priority,
  p = .80,
  list = FALSE
)

mil_df_simp_train <- mil_df[inTrain,]
mil_df_simp_test <- mil_df[-inTrain,]

cat("Number of test rows:",nrow(mil_df_simp_test),"Number of train rows:",nrow(mil_df_simp_train))

## Number of test rows: 199998 Number of train rows: 800002

Data Exploration

Our data exploration here consists of skimming the test data for immediate summary statistics, data types, and a visualization of distributions with mini histograms.

skim(mil_df_simp_train)

Data summary
Name	mil_df_simp_train
Number of rows	800002
Number of columns	12
_______________________
Column type frequency:
Date	1
factor	5
numeric	6
________________________
Group variables	None

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
Order.Date	0	1	2010-01-01	2017-07-29	2013-10-16	2767

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
Region	1	FALSE	7	Sub: 207992, Eur: 207436, Asi: 116714, Mid: 99532
Country	1	FALSE	185	Egy: 4474, Lib: 4465, Spa: 4460, Pan: 4449
Item.Type	1	FALSE	12	Bab: 66915, Off: 66838, Per: 66770, Cos: 66731
Sales.Channel	1	FALSE	2	Off: 400170, Onl: 399832
Order.Priority	1	TRUE	4	C: 200251, L: 200107, H: 199889, M: 199755

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Units.Sold	1	4996.83	2885.73	1.00	2501.00	4993.00	7494.00	10000.00	▇▇▇▇▇
Unit.Price	1	266.13	217.02	9.33	81.73	154.06	437.20	668.27	▇▇▁▃▃
Unit.Cost	1	187.60	175.69	6.92	56.67	97.44	263.33	524.96	▇▃▂▁▃
Total.Revenue	1	1329554.94	1468678.09	9.33	277543.02	784336.80	1823372.20	6682700.00	▇▂▁▁▁
Total.Cost	1	937238.60	1149024.46	6.92	161435.42	466507.44	1197361.51	5249600.00	▇▂▁▁▁
Total.Profit	1	392316.33	378827.21	2.41	94952.00	281054.88	565347.50	1738700.00	▇▃▂▁▁

Our date column tells us that we are going from 1-1-2010 to 7-29-2017, which is interesting to know the bounds of our dates but not of consequence to building the model in my opinion.

The factors we have seem to mostly be evenly balanced with no count sticking out as completely disproportional to the total amount of data besides country. While around 4000 samples per country is a large amount, it is a tiny proportion compared to the total sample set of one million. Additionally, the fact that we have 185 different factors to consider with the existing size of the data set will make it much more computationally heavy to generate a model with country contained within our decision trees or especially our random forest. For this reason, we would like to remove it but will keep it in for testing one of our decision trees.

Note that none of the numerical columns here are normally distributed. For our variables relating to money, that is the natural consequence of having a hard lower limit of zero, thus monetary related distributions will typically follow a negative binomial which they are in this case. Since, these variables are following a naturally occurring distribution we decide not to transform the variables to shift the distributions closer to normal. Our decision trees should also be resilient to these outliers if they are not influential which is not likely when we know that these distributions are typically for monetary matters.

We can take a closer look below and see that unit cost and unit price are actually in the form of set discrete values which makes for a very odd distribution. However, a transformation would not bring us closer to normality.

mil_df_simp_train %>%
  plot_density(
    ggtheme = theme_minimal(),
    nrow = 3,
    ncol = 3,
    title = "Density of Quantitative Variables"
  )

Next we explore that distribution of order priority amongst our categorical variables which do not have a ridiculous amount of data. Here it seems that there is no relation whatsoever as there is an equal spread within each categorical variable. This hints that we might not need to consider these variables when building the optimal decision tree model.

mil_df_simp_train %>%
  select(-Order.Date,-Country) %>%
  plot_bar(by = "Order.Priority",
           ggtheme = theme_minimal(),
           title = "Spread of Order Priority Amongst Discrete Variables")

Next, we take a look at how order priority is broken down between our different quantitative variables. Surprisingly we once more see that our response variable is almost perfectly distributed except in the two variables with odd distributions of unit cost and unit price. Even then, the medians are all very close to each other. Thus, our exploration so far points to order priority being almost perfectly randomly generated instead of depending on the variables we have within our dataset.

plot_boxplot(mil_df_simp_train, by = "Order.Priority",
           ggtheme = theme_minimal(),
           nrow = 3,
           ncol = 3,
           title = "Spread of Order Priority Amongst Quantitative Variables")

Model Building

Now that we have a better idea of our dataset and have them split into a training and testing dataset we will build our three models.

General Decision Tree

We begin our model building by fitting an rpart based decision tree that utilizes every variable within our dataset to create a model, this is repeated for 5 cross-validated sets.

ctrl <- trainControl(method = "cv", number = 5)

dt1 <- train(
  Order.Priority ~ .,
  data = mil_df_simp_train,
  method = "rpart",
  trControl = ctrl
)

summary(dt1)

We soon discover that after taking almost an hour to run, we error out. Since we know that the existence of country column would result in attempting to evaluate 180+ different variables after converting the country factors to dummy variables. We remove it for the sake of making it actually possible to train our data.

dt1 <- train(
  Order.Priority ~ . - Country,
  data = mil_df_simp_train,
  method = "rpart",
  trControl = ctrl
)

dt1

We see that we end up with the most accurate model with an accuracy of slightly above 25%. Which is not very impressive in most manners. Except, if we consider that our classes seem to be evenly distributed amongst almost every variable, the extra decimals of percentage point accuracy beyond 25 are interesting to note.

Afterwards, we save the model to an rds file, which ends up being 350MB in size. This way we do not have to go to the almost ten minute long process of training if we want to utilize this model after exiting the session.

saveRDS(dt1,file="dt1.rds")

dt1 <- readRDS("dt1.rds")
dt1

## CART 
## 
## 800002 samples
##     11 predictor
##      4 classes: 'L', 'M', 'H', 'C' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 640002, 640001, 640002, 640002, 640001 
## Resampling results across tuning parameters:
## 
##   cp            Accuracy   Kappa        
##   8.003321e-05  0.2503019   0.0004039577
##   1.508960e-04  0.2500419  -0.0000070964
##   1.834094e-04  0.2500419  -0.0000070964
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 8.003321e-05.

Investigating into how the decision tree breaks down the variables we can see that Total Revenue is deemed the most important variable, but there are only 5 variables in total with any importance. This goes back to what we saw in data exploration where these were the only 5 variables that had any variance in the response class breakdown.

varImp(dt1)$importance |>
  filter(Overall > 0) |>
  arrange(desc(Overall))

##                 Overall
## Total.Revenue 100.00000
## Total.Cost     92.20907
## Total.Profit   87.77569
## Order.Date     80.89328
## Units.Sold     76.25268

Next, we take a closer look at the decision tree that was generated and see that all orders beyond a certain date seem to have been classified as medium priority. additionally, we have an interesting redundant node generated where this order date is filtered twice. From there the decision tree filters based on units sold, total revenue, and order date again. Yet most data simply gets classified as critical order priority based on the breakdown we have here. So, we know that our decision tree is just going to classify most orders as critical if they are in the past. While if any new orders came out they would automatically be classified as medium. Meaning our decision tree isn’t going to be very helpful for future classifications at all, but at least it is very easy to explain.

rattle::fancyRpartPlot(dt1$finalModel, sub = NULL)

Specified Decision Tree

Next we fit an rpart based decision tree that utilizes only the variables of units sold, unit price, and unit cost which actually had variation within the class distribution of order priority to compare to our first model. We do not utilize order date or revenue which were deemed important in the previous decision tree because orders will all take place in the future, and revenue is just a combination of the variables we are using.

dt2 <- train(
  Order.Priority ~ Units.Sold + Unit.Price + Unit.Cost,
  data = mil_df_simp_train,
  method = "rpart",
  trControl = ctrl
)

dt2

We see that we end up with the most accurate model with an accuracy of slightly below 25%. Which is not very impressive in any manner. This is even worse than picking by chance alone, but it is very close. However, this model does take substantially less time to train because of the lesser amount of predictors.

Afterwards, we save the model to an rds file, which ends up being 150MB in size. This way we do not have to go to the process of training if we want to utilize this model after exiting the session.

saveRDS(dt2,file="dt2.rds")

dt2 <- readRDS("dt2.rds")
dt2

## CART 
## 
## 800002 samples
##      3 predictor
##      4 classes: 'L', 'M', 'H', 'C' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 640003, 640001, 640001, 640001, 640002 
## Resampling results across tuning parameters:
## 
##   cp            Accuracy   Kappa       
##   5.835755e-05  0.2482244  -0.002416997
##   1.775737e-04  0.2490631  -0.001520656
##   2.984572e-04  0.2493594  -0.001176291
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.0002984572.

Investigating how the decision tree ends up having the variables broken down, we get variable importance not being able to be determined for some odd reason.

varImp(dt2)$importance

##            Overall
## Units.Sold     NaN
## Unit.Price     NaN
## Unit.Cost      NaN

Investigating this further, we try to plot the model and we get an error that tells us our model does not have a positive dimension. We will explain the reasoning for this further in our model evaluation section.

rattle::fancyRpartPlot(dt2$finalModel, sub = NULL)

#Error in apply(model$frame$yval2[, yval2per], 1, function(x) x[1 + x[1]]) : 
#  dim(X) must have a positive length

Random Forest

Finally we fit an random forest model that utilizes only the variables of units sold, unit price, and unit cost to directly compare to our second decision tree model. Here we have another model that runs for close to 2 hours without actually finishing and manages to eat up 32GB of RAM at the same time.

rf1 <- train(
  Order.Priority ~ Units.Sold + Unit.Price + Unit.Cost,
  data = mil_df_simp_train,
  method = "rf",
  trControl = ctrl
)

rf1

We thus deem the previous attempt at making an RF non-viable and attempt to use the ranger package for faster creation of an rf tree. This unfortunately also takes around an hour, but this time it completes! Yet we are left with worse accuracy compared to either of our two decision trees. This is likely due to the fact that we have not properly tuned our decision trees by manually selecting parameters. As ensemble methods can have worse results if they are not properly tuned compared to the rparts decision trees we had which had ample opportunity and time to auto tune the trees.

rf1 <- train(
  Order.Priority ~ Units.Sold + Unit.Price + Unit.Cost,
  data = mil_df_simp_train,
  method = "ranger",
  trControl = ctrl
)

rf1

Afterwards, we save the model to an rds file, which ends up being 15MB in size. This way we do not have to go to the process of training if we want to utilize this model after exiting the session.

saveRDS(rf1,file="rf1.rds")

rf1 <- readRDS("rf1.rds")
rf1

## Random Forest 
## 
## 800002 samples
##      3 predictor
##      4 classes: 'L', 'M', 'H', 'C' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 640002, 640002, 640001, 640002, 640001 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   Accuracy   Kappa       
##   2     gini        0.2450681  -0.006631323
##   2     extratrees  0.2460581  -0.005298297
##   3     gini        0.2128607  -0.049519028
##   3     extratrees  0.2128332  -0.049555891
## 
## Tuning parameter 'min.node.size' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 2, splitrule = extratrees
##  and min.node.size = 1.

Unfortunately, we learn too late after the fact that we needed to add the argument, importance = “impurity”, to be able to determine variable importance of a ranger model trained through caret. Although this is our own fault, it’s ironic that random forests are typically less interpretable and this happens. Additionally, we can not plot a decision tree for a random forest with the packages we have used because a random forest is not well explained by just a single tree that was used.

varImp(rf1)
# Error in code$varImp(object$finalModel, ...) :
#  No importance values available

Model Evaluation

Finally, we determine which model that we have created works the best for our test data.

For our first decision tree with the majority of variables, we see that it’s able to classify the correct class 25% of the time against the test data. This is not useful at all as a model because we already have 4 classes that we know are typically distributed, so if we rolled a four-sided die than we would be as accurate as this model. An interesting thing to note is the high sensitivity on Class C and high specificity on all the other classes. This is happening because the decision tree tends to corral data into Class C with a detection rate of 91%. If we predict almost everything as class C than it makes sense we will end up correctly predicting positives of class C often and negatives of all the other classes.

dt1 |>
  predict(mil_df_simp_test) |>
  confusionMatrix(mil_df_simp_test$Order.Priority)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     L     M     H     C
##          L   497   493   526   494
##          M  1999  1954  1967  1937
##          H  1884  1869  1875  1965
##          C 45646 45622 45604 45666
## 
## Overall Statistics
##                                           
##                Accuracy : 0.25            
##                  95% CI : (0.2481, 0.2519)
##     No Information Rate : 0.2503          
##     P-Value [Acc > NIR] : 0.6419          
##                                           
##                   Kappa : -4e-04          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
## 
## Statistics by Class:
## 
##                      Class: L Class: M Class: H Class: C
## Sensitivity          0.009935  0.03913 0.037521  0.91219
## Specificity          0.989911  0.96066 0.961887  0.08713
## Pos Pred Value       0.247264  0.24870 0.246938  0.25017
## Neg Pred Value       0.749838  0.75027 0.750022  0.74822
## Prevalence           0.250133  0.24969 0.249862  0.25031
## Detection Rate       0.002485  0.00977 0.009375  0.22833
## Detection Prevalence 0.010050  0.03929 0.037965  0.91270
## Balanced Accuracy    0.499923  0.49990 0.499704  0.49966

For our second decision tree with selected variables, we see that our accuracy is higher at 25.03%. However, our confusion matrix is slightly strange with only a single row of critical order priority populated. This decision tree is only a single node that classifies everything as critical it seems like. Which is the reason variable importance could not be generated, nor could the plot of the decision tree itself. Looking at sensitivity and specificity we can see that it correctly predicts every single true positive of class C and every single false negative of the other classes, but never predicts anything else correctly. Although this model is technically more accurate, we would want to use it even less than the first decision tree as order priority is not useful to us if everything is classified as the same priority.

dt2 |>
  predict(mil_df_simp_test) |>
  confusionMatrix(mil_df_simp_test$Order.Priority)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     L     M     H     C
##          L     0     0     0     0
##          M     0     0     0     0
##          H     0     0     0     0
##          C 50026 49938 49972 50062
## 
## Overall Statistics
##                                           
##                Accuracy : 0.2503          
##                  95% CI : (0.2484, 0.2522)
##     No Information Rate : 0.2503          
##     P-Value [Acc > NIR] : 0.5009          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: L Class: M Class: H Class: C
## Sensitivity            0.0000   0.0000   0.0000   1.0000
## Specificity            1.0000   1.0000   1.0000   0.0000
## Pos Pred Value            NaN      NaN      NaN   0.2503
## Neg Pred Value         0.7499   0.7503   0.7501      NaN
## Prevalence             0.2501   0.2497   0.2499   0.2503
## Detection Rate         0.0000   0.0000   0.0000   0.2503
## Detection Prevalence   0.0000   0.0000   0.0000   1.0000
## Balanced Accuracy      0.5000   0.5000   0.5000   0.5000

With our random forest model we have the least accuracy of 24.73%, so we are now worse than simply randomly rolling a four-sided die. However, looking at the sensitivity and specificity, we can see that the model is at least attempting to predict between classes at an equal rate. Meaning the different order priorities are not being minimized or biased against. Thus, if forced to choose between using one of these three models we would utilize this model for its capabilities of eliminating bias towards a certain priority. Yet, in the real world it would likely be better to split the typical total profit ranges per order into four buckets and decide priority based on which bucket it is in.

rf1 |>
  predict(mil_df_simp_test) |>
  confusionMatrix(mil_df_simp_test$Order.Priority)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     L     M     H     C
##          L 13312 13557 13368 13495
##          M 10515 10407 10736 10618
##          H 10713 10676 10490 10700
##          C 15486 15298 15378 15249
## 
## Overall Statistics
##                                           
##                Accuracy : 0.2473          
##                  95% CI : (0.2454, 0.2492)
##     No Information Rate : 0.2503          
##     P-Value [Acc > NIR] : 0.9991          
##                                           
##                   Kappa : -0.0037         
##                                           
##  Mcnemar's Test P-Value : <2e-16          
## 
## Statistics by Class:
## 
##                      Class: L Class: M Class: H Class: C
## Sensitivity           0.26610  0.20840  0.20992  0.30460
## Specificity           0.73048  0.78762  0.78611  0.69212
## Pos Pred Value        0.24775  0.24617  0.24637  0.24831
## Neg Pred Value        0.74899  0.74936  0.74919  0.74880
## Prevalence            0.25013  0.24969  0.24986  0.25031
## Detection Rate        0.06656  0.05204  0.05245  0.07625
## Detection Prevalence  0.26866  0.21138  0.21290  0.30706
## Balanced Accuracy     0.49829  0.49801  0.49801  0.49836