Data622 - Homework 2

Decision Trees Algorithms

Pre-work

Read this blog: https://decizone.com/blog/the-good-the-bad-the-ugly-of-using-decision-trees which shows some of the issues with decision trees
Choose a dataset from a source in Assignment #1, or another dataset of your choice

Assignment work

Based on the latest topics presented, choose a dataset of your choice and create a Decision Tree where you can solve a classification problem and predict the outcome of a particular feature or detail of the data used.
Switch variables* to generate 2 decision trees and compare the results. Create a random forest and analyze the results.
Based on real cases where desicion trees went wrong, and ‘the bad & ugly’ aspects of decision trees (https://decizone.com/blog/the-good-the-bad-the-ugly-of-using-decision-trees), how can you change this perception when using the decision tree you created to solve a real problem?

Deliverable

Essay (minimum 500 word document) Write a short essay explaining your analysis, and how you would address the concerns in the blog (listed in pre-work)
Exploratory Analysis using R or Python (submit code + errors + analysis as notebook or copy/paste to document)

Including of the required libraries

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(lubridate) 
library(pROC)

## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

library(caret)

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
## ✔ broom        1.0.5     ✔ rsample      1.2.0
## ✔ dials        1.2.0     ✔ tune         1.1.2
## ✔ infer        1.0.5     ✔ workflows    1.1.3
## ✔ modeldata    1.2.0     ✔ workflowsets 1.0.1
## ✔ parsnip      1.1.1     ✔ yardstick    1.2.0
## ✔ recipes      1.0.8     
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard()        masks purrr::discard()
## ✖ dplyr::filter()          masks stats::filter()
## ✖ recipes::fixed()         masks stringr::fixed()
## ✖ dplyr::lag()             masks stats::lag()
## ✖ caret::lift()            masks purrr::lift()
## ✖ yardstick::precision()   masks caret::precision()
## ✖ yardstick::recall()      masks caret::recall()
## ✖ yardstick::sensitivity() masks caret::sensitivity()
## ✖ yardstick::spec()        masks readr::spec()
## ✖ yardstick::specificity() masks caret::specificity()
## ✖ recipes::step()          masks stats::step()
## • Use suppressPackageStartupMessages() to eliminate package startup messages

library(rpart)

## 
## Attaching package: 'rpart'
## 
## The following object is masked from 'package:dials':
## 
##     prune

library(rpart.plot)
library(randomForest)

## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin

Dataset

For Assignment 2 I will be using the on of the datasets which I used in assignment 1

small_ds <- read_csv("https://raw.githubusercontent.com/petferns/DATA622/main/5000%20Sales%20Records.csv")

## Rows: 5000 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Region, Country, Item Type, Sales Channel, Order Priority, Order Da...
## dbl (7): Order ID, Units Sold, Unit Price, Unit Cost, Total Revenue, Total C...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data contains sales order details of products like baby food, personal care products, food items, fruits, etc from across the continents.

head(small_ds)

## # A tibble: 6 × 14
##   Region       Country `Item Type` `Sales Channel` `Order Priority` `Order Date`
##   <chr>        <chr>   <chr>       <chr>           <chr>            <chr>       
## 1 Central Ame… Antigu… Baby Food   Online          M                12/20/2013  
## 2 Central Ame… Panama  Snacks      Offline         C                7/5/2010    
## 3 Europe       Czech … Beverages   Offline         C                9/12/2011   
## 4 Asia         North … Cereal      Offline         L                5/13/2010   
## 5 Asia         Sri La… Snacks      Offline         C                7/20/2015   
## 6 Middle East… Morocco Personal C… Offline         L                11/8/2010   
## # ℹ 8 more variables: `Order ID` <dbl>, `Ship Date` <chr>, `Units Sold` <dbl>,
## #   `Unit Price` <dbl>, `Unit Cost` <dbl>, `Total Revenue` <dbl>,
## #   `Total Cost` <dbl>, `Total Profit` <dbl>

Some columns needs conversion and some other needs to be factored.

Order Date, Ship Date will be converted to date type
Total Profit will be converted to numeric value
Sales Channel will be factored as it contains values as either ‘Online’ or ‘Offline’

glimpse(small_ds)

## Rows: 5,000
## Columns: 14
## $ Region           <chr> "Central America and the Caribbean", "Central America…
## $ Country          <chr> "Antigua and Barbuda", "Panama", "Czech Republic", "N…
## $ `Item Type`      <chr> "Baby Food", "Snacks", "Beverages", "Cereal", "Snacks…
## $ `Sales Channel`  <chr> "Online", "Offline", "Offline", "Offline", "Offline",…
## $ `Order Priority` <chr> "M", "C", "C", "L", "C", "L", "H", "M", "M", "M", "C"…
## $ `Order Date`     <chr> "12/20/2013", "7/5/2010", "9/12/2011", "5/13/2010", "…
## $ `Order ID`       <dbl> 957081544, 301644504, 478051030, 892599952, 571902596…
## $ `Ship Date`      <chr> "1/11/2014", "7/26/2010", "9/29/2011", "6/15/2010", "…
## $ `Units Sold`     <dbl> 552, 2167, 4778, 9016, 7542, 48, 8258, 927, 8841, 981…
## $ `Unit Price`     <dbl> 255.28, 152.58, 47.45, 205.70, 152.58, 81.73, 109.28,…
## $ `Unit Cost`      <dbl> 159.42, 97.44, 31.79, 117.11, 97.44, 56.67, 35.84, 35…
## $ `Total Revenue`  <dbl> 140914.56, 330640.86, 226716.10, 1854591.20, 1150758.…
## $ `Total Cost`     <dbl> 87999.84, 211152.48, 151892.62, 1055863.76, 734892.48…
## $ `Total Profit`   <dbl> 52914.72, 119488.38, 74823.48, 798727.44, 415865.88, …

small_ds[['Order Date']] <- as.Date(small_ds[['Order Date']], "%m/%d/%Y")
small_ds[['Ship Date']] <- as.Date(small_ds[['Ship Date']], "%m/%d/%Y")

small_ds[['Sales Channel']] <- as.factor(small_ds[['Sales Channel']])


small_ds[['Total Profit']] <- as.numeric(small_ds[['Total Profit']])

Modelling

We partition the dataset into training and testing sets in 80:20 proportion

From my models I would like to predict Sales Channel based on variables Region, Item Type, Order Priority and Total Profit.

set.seed(1234)

Btraining.samples <- small_ds$`Sales Channel` %>% 
  createDataPartition(p = 0.8, list=FALSE)

Btrain.data <- small_ds[Btraining.samples,]
Btest.data <- small_ds[-Btraining.samples,]

Deision Tree

Model 1

Model 1 we run the first decision tree model to predict Sales Channel with choosing the variable Region

model1 <- rpart(`Sales Channel` ~ `Region` , data = Btrain.data)

# Plot the complexity parameter table
prp(model1, type = 4, extra = 1, nn = TRUE)

From the below plots we see the R-suqare increased with increase in number of splits and XRelative error decreases drastically as the number of splits increases.

par(mfrow=c(1,2)) 
rsq.rpart(model1)

## 
## Classification tree:
## rpart(formula = `Sales Channel` ~ Region, data = Btrain.data)
## 
## Variables actually used in tree construction:
## [1] Region
## 
## Root node error: 1997/4001 = 0.49913
## 
## n= 4001 
## 
##         CP nsplit rel error xerror     xstd
## 1 0.030045      0   1.00000 1.0300 0.015831
## 2 0.010000      1   0.96995 1.0025 0.015837

## Warning in rsq.rpart(model1): may not be applicable for this method

Lets calculate the accuracy for Model1

predictions <- predict(model1, Btrain.data, type = "class")

# Compare the predicted values to the actual values
actual_values <- Btrain.data$`Sales Channel`

# Calculate accuracy
correct_predictions <- sum(predictions == actual_values)
total_predictions <- length(predictions)
accuracy <- correct_predictions / total_predictions

# Display the accuracy
accuracy

## [1] 0.515871

We see our Model is able to predict around 51% accurately, this accuracy is low.

Model 2

In Model 2 lets try to include few more variables to our second decision tree and see how it helps in improving the accuracy.

model2 <- rpart(`Sales Channel` ~ `Region` + `Item Type` + `Order Priority` ,
   data= Btrain.data)
summary(model2)

## Call:
## rpart(formula = `Sales Channel` ~ Region + `Item Type` + `Order Priority`, 
##     data = Btrain.data)
##   n= 4001 
## 
##           CP nsplit rel error   xerror       xstd
## 1 0.02754131      0 1.0000000 1.038558 0.01582643
## 2 0.01101652      1 0.9724587 1.026540 0.01583227
## 3 0.01000000      2 0.9614422 1.022033 0.01583387
## 
## Variable importance
## Item Type    Region 
##        53        47 
## 
## Node number 1: 4001 observations,    complexity param=0.02754131
##   predicted class=Offline  expected loss=0.4991252  P(node) =1
##     class counts:  2004  1997
##    probabilities: 0.501 0.499 
##   left son=2 (1050 obs) right son=3 (2951 obs)
##   Primary splits:
##       Item Type      splits as  LRRRRLRRRRRL, improve=2.336891, (0 missing)
##       Region         splits as  RLLLLLR,      improve=2.047659, (0 missing)
##       Order Priority splits as  LRRL,         improve=0.325755, (0 missing)
## 
## Node number 2: 1050 observations
##   predicted class=Offline  expected loss=0.4704762  P(node) =0.2624344
##     class counts:   556   494
##    probabilities: 0.530 0.470 
## 
## Node number 3: 2951 observations,    complexity param=0.01101652
##   predicted class=Online   expected loss=0.4906811  P(node) =0.7375656
##     class counts:  1448  1503
##    probabilities: 0.491 0.509 
##   left son=6 (1756 obs) right son=7 (1195 obs)
##   Primary splits:
##       Region         splits as  RLLLLLR,      improve=2.1060280, (0 missing)
##       Item Type      splits as  -LRLL-LLLLL-, improve=0.4288556, (0 missing)
##       Order Priority splits as  RLLL,         improve=0.2182987, (0 missing)
## 
## Node number 6: 1756 observations
##   predicted class=Offline  expected loss=0.4937358  P(node) =0.4388903
##     class counts:   889   867
##    probabilities: 0.506 0.494 
## 
## Node number 7: 1195 observations
##   predicted class=Online   expected loss=0.4677824  P(node) =0.2986753
##     class counts:   559   636
##    probabilities: 0.468 0.532

prp(model2, type = 4, extra = 1, nn = TRUE)

We see from the the graphs of cross validation results , in Model 2 the R-square increases with increase in Number of splits and also the XRelative error decreases with increase in number of splits.

par(mfrow=c(1,2))
rsq.rpart(model2)

## 
## Classification tree:
## rpart(formula = `Sales Channel` ~ Region + `Item Type` + `Order Priority`, 
##     data = Btrain.data)
## 
## Variables actually used in tree construction:
## [1] Item Type Region   
## 
## Root node error: 1997/4001 = 0.49913
## 
## n= 4001 
## 
##         CP nsplit rel error xerror     xstd
## 1 0.027541      0   1.00000 1.0386 0.015826
## 2 0.011017      1   0.97246 1.0265 0.015832
## 3 0.010000      2   0.96144 1.0220 0.015834

## Warning in rsq.rpart(model2): may not be applicable for this method

We see that in Model 2 even with adding more variables to our model the accuracy is same.

predictions <- predict(model2, Btrain.data, type = "class")

# Compare the predicted values to the actual values
actual_values <- Btrain.data$`Sales Channel`

# Calculate accuracy
correct_predictions <- sum(predictions == actual_values)
total_predictions <- length(predictions)
accuracy <- correct_predictions / total_predictions

# Display the accuracy
accuracy

## [1] 0.52012

Model 3

We run a Random forest model in our third modeling.

Some of the column had to be renamed as I was getting error while running the randomForest function.

train2 <- Btrain.data
colnames(train2)[3] <- "ItemType"
colnames(train2)[5] <- "OrderPriority"

test2 <- Btest.data
colnames(test2)[3] <- "ItemType"
colnames(test2)[5] <- "OrderPriority"

model3 <- randomForest(`Sales Channel` ~ Region + ItemType + OrderPriority, importance = TRUE, train2)

From the below plot we see the error decreases drastically with increase in the number of trees

plot(model3)

We calculate the accuracy for Randomforest model

predictions <- predict(model3, newdata = test2)

There is a slight improvement in terms of accuracy when we use Random forest model.

confusion_matrix <- table(Actual_Label = test2$`Sales Channel`, Predicted_Label = predictions)
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
accuracy

## [1] 0.5255255

From all of the 3 models above I see the accuracy is almost similar in all the 3 models, I choose Model 3 Randomforest as it performs slightly better when compared to other 2 models.