Decision Trees Algorithms

Pre-work

Assignment work

  • Based on the latest topics presented, choose a dataset of your choice and create a Decision Tree where you can solve a classification problem and predict the outcome of a particular feature or detail of the data used.
  • Switch variables* to generate 2 decision trees and compare the results. Create a random forest and analyze the results.
  • Based on real cases where desicion trees went wrong, and ‘the bad & ugly’ aspects of decision trees (https://decizone.com/blog/the-good-the-bad-the-ugly-of-using-decision-trees), how can you change this perception when using the decision tree you created to solve a real problem?

Deliverable

  • Essay (minimum 500 word document) Write a short essay explaining your analysis, and how you would address the concerns in the blog (listed in pre-work)
  • Exploratory Analysis using R or Python (submit code + errors + analysis as notebook or copy/paste to document)

Including of the required libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(lubridate) 
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
## ✔ broom        1.0.5     ✔ rsample      1.2.0
## ✔ dials        1.2.0     ✔ tune         1.1.2
## ✔ infer        1.0.5     ✔ workflows    1.1.3
## ✔ modeldata    1.2.0     ✔ workflowsets 1.0.1
## ✔ parsnip      1.1.1     ✔ yardstick    1.2.0
## ✔ recipes      1.0.8     
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard()        masks purrr::discard()
## ✖ dplyr::filter()          masks stats::filter()
## ✖ recipes::fixed()         masks stringr::fixed()
## ✖ dplyr::lag()             masks stats::lag()
## ✖ caret::lift()            masks purrr::lift()
## ✖ yardstick::precision()   masks caret::precision()
## ✖ yardstick::recall()      masks caret::recall()
## ✖ yardstick::sensitivity() masks caret::sensitivity()
## ✖ yardstick::spec()        masks readr::spec()
## ✖ yardstick::specificity() masks caret::specificity()
## ✖ recipes::step()          masks stats::step()
## • Use suppressPackageStartupMessages() to eliminate package startup messages
library(rpart)
## 
## Attaching package: 'rpart'
## 
## The following object is masked from 'package:dials':
## 
##     prune
library(rpart.plot)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin

Dataset

For Assignment 2 I will be using the on of the datasets which I used in assignment 1

small_ds <- read_csv("https://raw.githubusercontent.com/petferns/DATA622/main/5000%20Sales%20Records.csv")
## Rows: 5000 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Region, Country, Item Type, Sales Channel, Order Priority, Order Da...
## dbl (7): Order ID, Units Sold, Unit Price, Unit Cost, Total Revenue, Total C...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data contains sales order details of products like baby food, personal care products, food items, fruits, etc from across the continents.

head(small_ds)
## # A tibble: 6 × 14
##   Region       Country `Item Type` `Sales Channel` `Order Priority` `Order Date`
##   <chr>        <chr>   <chr>       <chr>           <chr>            <chr>       
## 1 Central Ame… Antigu… Baby Food   Online          M                12/20/2013  
## 2 Central Ame… Panama  Snacks      Offline         C                7/5/2010    
## 3 Europe       Czech … Beverages   Offline         C                9/12/2011   
## 4 Asia         North … Cereal      Offline         L                5/13/2010   
## 5 Asia         Sri La… Snacks      Offline         C                7/20/2015   
## 6 Middle East… Morocco Personal C… Offline         L                11/8/2010   
## # ℹ 8 more variables: `Order ID` <dbl>, `Ship Date` <chr>, `Units Sold` <dbl>,
## #   `Unit Price` <dbl>, `Unit Cost` <dbl>, `Total Revenue` <dbl>,
## #   `Total Cost` <dbl>, `Total Profit` <dbl>

Some columns needs conversion and some other needs to be factored.

glimpse(small_ds)
## Rows: 5,000
## Columns: 14
## $ Region           <chr> "Central America and the Caribbean", "Central America…
## $ Country          <chr> "Antigua and Barbuda", "Panama", "Czech Republic", "N…
## $ `Item Type`      <chr> "Baby Food", "Snacks", "Beverages", "Cereal", "Snacks…
## $ `Sales Channel`  <chr> "Online", "Offline", "Offline", "Offline", "Offline",…
## $ `Order Priority` <chr> "M", "C", "C", "L", "C", "L", "H", "M", "M", "M", "C"…
## $ `Order Date`     <chr> "12/20/2013", "7/5/2010", "9/12/2011", "5/13/2010", "…
## $ `Order ID`       <dbl> 957081544, 301644504, 478051030, 892599952, 571902596…
## $ `Ship Date`      <chr> "1/11/2014", "7/26/2010", "9/29/2011", "6/15/2010", "…
## $ `Units Sold`     <dbl> 552, 2167, 4778, 9016, 7542, 48, 8258, 927, 8841, 981…
## $ `Unit Price`     <dbl> 255.28, 152.58, 47.45, 205.70, 152.58, 81.73, 109.28,…
## $ `Unit Cost`      <dbl> 159.42, 97.44, 31.79, 117.11, 97.44, 56.67, 35.84, 35…
## $ `Total Revenue`  <dbl> 140914.56, 330640.86, 226716.10, 1854591.20, 1150758.…
## $ `Total Cost`     <dbl> 87999.84, 211152.48, 151892.62, 1055863.76, 734892.48…
## $ `Total Profit`   <dbl> 52914.72, 119488.38, 74823.48, 798727.44, 415865.88, …
small_ds[['Order Date']] <- as.Date(small_ds[['Order Date']], "%m/%d/%Y")
small_ds[['Ship Date']] <- as.Date(small_ds[['Ship Date']], "%m/%d/%Y")

small_ds[['Sales Channel']] <- as.factor(small_ds[['Sales Channel']])


small_ds[['Total Profit']] <- as.numeric(small_ds[['Total Profit']])

Modelling

We partition the dataset into training and testing sets in 80:20 proportion

From my models I would like to predict Sales Channel based on variables Region, Item Type, Order Priority and Total Profit.

set.seed(1234)

Btraining.samples <- small_ds$`Sales Channel` %>% 
  createDataPartition(p = 0.8, list=FALSE)

Btrain.data <- small_ds[Btraining.samples,]
Btest.data <- small_ds[-Btraining.samples,]

Deision Tree

Model 1

Model 1 we run the first decision tree model to predict Sales Channel with choosing the variable Region

model1 <- rpart(`Sales Channel` ~ `Region` , data = Btrain.data)

# Plot the complexity parameter table
prp(model1, type = 4, extra = 1, nn = TRUE)

From the below plots we see the R-suqare increased with increase in number of splits and XRelative error decreases drastically as the number of splits increases.

par(mfrow=c(1,2)) 
rsq.rpart(model1) 
## 
## Classification tree:
## rpart(formula = `Sales Channel` ~ Region, data = Btrain.data)
## 
## Variables actually used in tree construction:
## [1] Region
## 
## Root node error: 1997/4001 = 0.49913
## 
## n= 4001 
## 
##         CP nsplit rel error xerror     xstd
## 1 0.030045      0   1.00000 1.0300 0.015831
## 2 0.010000      1   0.96995 1.0025 0.015837
## Warning in rsq.rpart(model1): may not be applicable for this method

Lets calculate the accuracy for Model1

predictions <- predict(model1, Btrain.data, type = "class")

# Compare the predicted values to the actual values
actual_values <- Btrain.data$`Sales Channel`

# Calculate accuracy
correct_predictions <- sum(predictions == actual_values)
total_predictions <- length(predictions)
accuracy <- correct_predictions / total_predictions

# Display the accuracy
accuracy
## [1] 0.515871

We see our Model is able to predict around 51% accurately, this accuracy is low.

Model 2

In Model 2 lets try to include few more variables to our second decision tree and see how it helps in improving the accuracy.

model2 <- rpart(`Sales Channel` ~ `Region` + `Item Type` + `Order Priority` ,
   data= Btrain.data)
summary(model2)
## Call:
## rpart(formula = `Sales Channel` ~ Region + `Item Type` + `Order Priority`, 
##     data = Btrain.data)
##   n= 4001 
## 
##           CP nsplit rel error   xerror       xstd
## 1 0.02754131      0 1.0000000 1.038558 0.01582643
## 2 0.01101652      1 0.9724587 1.026540 0.01583227
## 3 0.01000000      2 0.9614422 1.022033 0.01583387
## 
## Variable importance
## Item Type    Region 
##        53        47 
## 
## Node number 1: 4001 observations,    complexity param=0.02754131
##   predicted class=Offline  expected loss=0.4991252  P(node) =1
##     class counts:  2004  1997
##    probabilities: 0.501 0.499 
##   left son=2 (1050 obs) right son=3 (2951 obs)
##   Primary splits:
##       Item Type      splits as  LRRRRLRRRRRL, improve=2.336891, (0 missing)
##       Region         splits as  RLLLLLR,      improve=2.047659, (0 missing)
##       Order Priority splits as  LRRL,         improve=0.325755, (0 missing)
## 
## Node number 2: 1050 observations
##   predicted class=Offline  expected loss=0.4704762  P(node) =0.2624344
##     class counts:   556   494
##    probabilities: 0.530 0.470 
## 
## Node number 3: 2951 observations,    complexity param=0.01101652
##   predicted class=Online   expected loss=0.4906811  P(node) =0.7375656
##     class counts:  1448  1503
##    probabilities: 0.491 0.509 
##   left son=6 (1756 obs) right son=7 (1195 obs)
##   Primary splits:
##       Region         splits as  RLLLLLR,      improve=2.1060280, (0 missing)
##       Item Type      splits as  -LRLL-LLLLL-, improve=0.4288556, (0 missing)
##       Order Priority splits as  RLLL,         improve=0.2182987, (0 missing)
## 
## Node number 6: 1756 observations
##   predicted class=Offline  expected loss=0.4937358  P(node) =0.4388903
##     class counts:   889   867
##    probabilities: 0.506 0.494 
## 
## Node number 7: 1195 observations
##   predicted class=Online   expected loss=0.4677824  P(node) =0.2986753
##     class counts:   559   636
##    probabilities: 0.468 0.532
prp(model2, type = 4, extra = 1, nn = TRUE)

We see from the the graphs of cross validation results , in Model 2 the R-square increases with increase in Number of splits and also the XRelative error decreases with increase in number of splits.

par(mfrow=c(1,2))
rsq.rpart(model2) 
## 
## Classification tree:
## rpart(formula = `Sales Channel` ~ Region + `Item Type` + `Order Priority`, 
##     data = Btrain.data)
## 
## Variables actually used in tree construction:
## [1] Item Type Region   
## 
## Root node error: 1997/4001 = 0.49913
## 
## n= 4001 
## 
##         CP nsplit rel error xerror     xstd
## 1 0.027541      0   1.00000 1.0386 0.015826
## 2 0.011017      1   0.97246 1.0265 0.015832
## 3 0.010000      2   0.96144 1.0220 0.015834
## Warning in rsq.rpart(model2): may not be applicable for this method

We see that in Model 2 even with adding more variables to our model the accuracy is same.

predictions <- predict(model2, Btrain.data, type = "class")

# Compare the predicted values to the actual values
actual_values <- Btrain.data$`Sales Channel`

# Calculate accuracy
correct_predictions <- sum(predictions == actual_values)
total_predictions <- length(predictions)
accuracy <- correct_predictions / total_predictions

# Display the accuracy
accuracy
## [1] 0.52012

Model 3

We run a Random forest model in our third modeling.

Some of the column had to be renamed as I was getting error while running the randomForest function.

train2 <- Btrain.data
colnames(train2)[3] <- "ItemType"
colnames(train2)[5] <- "OrderPriority"

test2 <- Btest.data
colnames(test2)[3] <- "ItemType"
colnames(test2)[5] <- "OrderPriority"

model3 <- randomForest(`Sales Channel` ~ Region + ItemType + OrderPriority, importance = TRUE, train2)

From the below plot we see the error decreases drastically with increase in the number of trees

plot(model3)

We calculate the accuracy for Randomforest model

predictions <- predict(model3, newdata = test2)

There is a slight improvement in terms of accuracy when we use Random forest model.

confusion_matrix <- table(Actual_Label = test2$`Sales Channel`, Predicted_Label = predictions)
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
accuracy
## [1] 0.5255255

From all of the 3 models above I see the accuracy is almost similar in all the 3 models, I choose Model 3 Randomforest as it performs slightly better when compared to other 2 models.