library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(lubridate)
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
## ✔ broom 1.0.5 ✔ rsample 1.2.0
## ✔ dials 1.2.0 ✔ tune 1.1.2
## ✔ infer 1.0.5 ✔ workflows 1.1.3
## ✔ modeldata 1.2.0 ✔ workflowsets 1.0.1
## ✔ parsnip 1.1.1 ✔ yardstick 1.2.0
## ✔ recipes 1.0.8
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ caret::lift() masks purrr::lift()
## ✖ yardstick::precision() masks caret::precision()
## ✖ yardstick::recall() masks caret::recall()
## ✖ yardstick::sensitivity() masks caret::sensitivity()
## ✖ yardstick::spec() masks readr::spec()
## ✖ yardstick::specificity() masks caret::specificity()
## ✖ recipes::step() masks stats::step()
## • Use suppressPackageStartupMessages() to eliminate package startup messages
library(rpart)
##
## Attaching package: 'rpart'
##
## The following object is masked from 'package:dials':
##
## prune
library(rpart.plot)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
For Assignment 2 I will be using the on of the datasets which I used in assignment 1
small_ds <- read_csv("https://raw.githubusercontent.com/petferns/DATA622/main/5000%20Sales%20Records.csv")
## Rows: 5000 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Region, Country, Item Type, Sales Channel, Order Priority, Order Da...
## dbl (7): Order ID, Units Sold, Unit Price, Unit Cost, Total Revenue, Total C...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Data contains sales order details of products like baby food, personal care products, food items, fruits, etc from across the continents.
head(small_ds)
## # A tibble: 6 × 14
## Region Country `Item Type` `Sales Channel` `Order Priority` `Order Date`
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Central Ame… Antigu… Baby Food Online M 12/20/2013
## 2 Central Ame… Panama Snacks Offline C 7/5/2010
## 3 Europe Czech … Beverages Offline C 9/12/2011
## 4 Asia North … Cereal Offline L 5/13/2010
## 5 Asia Sri La… Snacks Offline C 7/20/2015
## 6 Middle East… Morocco Personal C… Offline L 11/8/2010
## # ℹ 8 more variables: `Order ID` <dbl>, `Ship Date` <chr>, `Units Sold` <dbl>,
## # `Unit Price` <dbl>, `Unit Cost` <dbl>, `Total Revenue` <dbl>,
## # `Total Cost` <dbl>, `Total Profit` <dbl>
Some columns needs conversion and some other needs to be factored.
Order Date, Ship Date will be converted to date typeTotal Profit will be converted to numeric valueSales Channel will be factored as it contains values as either ‘Online’ or ‘Offline’glimpse(small_ds)
## Rows: 5,000
## Columns: 14
## $ Region <chr> "Central America and the Caribbean", "Central America…
## $ Country <chr> "Antigua and Barbuda", "Panama", "Czech Republic", "N…
## $ `Item Type` <chr> "Baby Food", "Snacks", "Beverages", "Cereal", "Snacks…
## $ `Sales Channel` <chr> "Online", "Offline", "Offline", "Offline", "Offline",…
## $ `Order Priority` <chr> "M", "C", "C", "L", "C", "L", "H", "M", "M", "M", "C"…
## $ `Order Date` <chr> "12/20/2013", "7/5/2010", "9/12/2011", "5/13/2010", "…
## $ `Order ID` <dbl> 957081544, 301644504, 478051030, 892599952, 571902596…
## $ `Ship Date` <chr> "1/11/2014", "7/26/2010", "9/29/2011", "6/15/2010", "…
## $ `Units Sold` <dbl> 552, 2167, 4778, 9016, 7542, 48, 8258, 927, 8841, 981…
## $ `Unit Price` <dbl> 255.28, 152.58, 47.45, 205.70, 152.58, 81.73, 109.28,…
## $ `Unit Cost` <dbl> 159.42, 97.44, 31.79, 117.11, 97.44, 56.67, 35.84, 35…
## $ `Total Revenue` <dbl> 140914.56, 330640.86, 226716.10, 1854591.20, 1150758.…
## $ `Total Cost` <dbl> 87999.84, 211152.48, 151892.62, 1055863.76, 734892.48…
## $ `Total Profit` <dbl> 52914.72, 119488.38, 74823.48, 798727.44, 415865.88, …
small_ds[['Order Date']] <- as.Date(small_ds[['Order Date']], "%m/%d/%Y")
small_ds[['Ship Date']] <- as.Date(small_ds[['Ship Date']], "%m/%d/%Y")
small_ds[['Sales Channel']] <- as.factor(small_ds[['Sales Channel']])
small_ds[['Total Profit']] <- as.numeric(small_ds[['Total Profit']])
We partition the dataset into training and testing sets in 80:20 proportion
From my models I would like to predict Sales Channel based on variables Region, Item Type, Order Priority and Total Profit.
set.seed(1234)
Btraining.samples <- small_ds$`Sales Channel` %>%
createDataPartition(p = 0.8, list=FALSE)
Btrain.data <- small_ds[Btraining.samples,]
Btest.data <- small_ds[-Btraining.samples,]
Model 1 we run the first decision tree model to predict Sales Channel with choosing the variable Region
model1 <- rpart(`Sales Channel` ~ `Region` , data = Btrain.data)
# Plot the complexity parameter table
prp(model1, type = 4, extra = 1, nn = TRUE)
From the below plots we see the R-suqare increased with increase in number of splits and XRelative error decreases drastically as the number of splits increases.
par(mfrow=c(1,2))
rsq.rpart(model1)
##
## Classification tree:
## rpart(formula = `Sales Channel` ~ Region, data = Btrain.data)
##
## Variables actually used in tree construction:
## [1] Region
##
## Root node error: 1997/4001 = 0.49913
##
## n= 4001
##
## CP nsplit rel error xerror xstd
## 1 0.030045 0 1.00000 1.0300 0.015831
## 2 0.010000 1 0.96995 1.0025 0.015837
## Warning in rsq.rpart(model1): may not be applicable for this method
Lets calculate the accuracy for Model1
predictions <- predict(model1, Btrain.data, type = "class")
# Compare the predicted values to the actual values
actual_values <- Btrain.data$`Sales Channel`
# Calculate accuracy
correct_predictions <- sum(predictions == actual_values)
total_predictions <- length(predictions)
accuracy <- correct_predictions / total_predictions
# Display the accuracy
accuracy
## [1] 0.515871
We see our Model is able to predict around 51% accurately, this accuracy is low.
In Model 2 lets try to include few more variables to our second decision tree and see how it helps in improving the accuracy.
model2 <- rpart(`Sales Channel` ~ `Region` + `Item Type` + `Order Priority` ,
data= Btrain.data)
summary(model2)
## Call:
## rpart(formula = `Sales Channel` ~ Region + `Item Type` + `Order Priority`,
## data = Btrain.data)
## n= 4001
##
## CP nsplit rel error xerror xstd
## 1 0.02754131 0 1.0000000 1.038558 0.01582643
## 2 0.01101652 1 0.9724587 1.026540 0.01583227
## 3 0.01000000 2 0.9614422 1.022033 0.01583387
##
## Variable importance
## Item Type Region
## 53 47
##
## Node number 1: 4001 observations, complexity param=0.02754131
## predicted class=Offline expected loss=0.4991252 P(node) =1
## class counts: 2004 1997
## probabilities: 0.501 0.499
## left son=2 (1050 obs) right son=3 (2951 obs)
## Primary splits:
## Item Type splits as LRRRRLRRRRRL, improve=2.336891, (0 missing)
## Region splits as RLLLLLR, improve=2.047659, (0 missing)
## Order Priority splits as LRRL, improve=0.325755, (0 missing)
##
## Node number 2: 1050 observations
## predicted class=Offline expected loss=0.4704762 P(node) =0.2624344
## class counts: 556 494
## probabilities: 0.530 0.470
##
## Node number 3: 2951 observations, complexity param=0.01101652
## predicted class=Online expected loss=0.4906811 P(node) =0.7375656
## class counts: 1448 1503
## probabilities: 0.491 0.509
## left son=6 (1756 obs) right son=7 (1195 obs)
## Primary splits:
## Region splits as RLLLLLR, improve=2.1060280, (0 missing)
## Item Type splits as -LRLL-LLLLL-, improve=0.4288556, (0 missing)
## Order Priority splits as RLLL, improve=0.2182987, (0 missing)
##
## Node number 6: 1756 observations
## predicted class=Offline expected loss=0.4937358 P(node) =0.4388903
## class counts: 889 867
## probabilities: 0.506 0.494
##
## Node number 7: 1195 observations
## predicted class=Online expected loss=0.4677824 P(node) =0.2986753
## class counts: 559 636
## probabilities: 0.468 0.532
prp(model2, type = 4, extra = 1, nn = TRUE)
We see from the the graphs of cross validation results , in Model 2 the R-square increases with increase in Number of splits and also the XRelative error decreases with increase in number of splits.
par(mfrow=c(1,2))
rsq.rpart(model2)
##
## Classification tree:
## rpart(formula = `Sales Channel` ~ Region + `Item Type` + `Order Priority`,
## data = Btrain.data)
##
## Variables actually used in tree construction:
## [1] Item Type Region
##
## Root node error: 1997/4001 = 0.49913
##
## n= 4001
##
## CP nsplit rel error xerror xstd
## 1 0.027541 0 1.00000 1.0386 0.015826
## 2 0.011017 1 0.97246 1.0265 0.015832
## 3 0.010000 2 0.96144 1.0220 0.015834
## Warning in rsq.rpart(model2): may not be applicable for this method
We see that in Model 2 even with adding more variables to our model the accuracy is same.
predictions <- predict(model2, Btrain.data, type = "class")
# Compare the predicted values to the actual values
actual_values <- Btrain.data$`Sales Channel`
# Calculate accuracy
correct_predictions <- sum(predictions == actual_values)
total_predictions <- length(predictions)
accuracy <- correct_predictions / total_predictions
# Display the accuracy
accuracy
## [1] 0.52012
We run a Random forest model in our third modeling.
Some of the column had to be renamed as I was getting error while running the randomForest function.
train2 <- Btrain.data
colnames(train2)[3] <- "ItemType"
colnames(train2)[5] <- "OrderPriority"
test2 <- Btest.data
colnames(test2)[3] <- "ItemType"
colnames(test2)[5] <- "OrderPriority"
model3 <- randomForest(`Sales Channel` ~ Region + ItemType + OrderPriority, importance = TRUE, train2)
From the below plot we see the error decreases drastically with increase in the number of trees
plot(model3)
We calculate the accuracy for Randomforest model
predictions <- predict(model3, newdata = test2)
There is a slight improvement in terms of accuracy when we use Random forest model.
confusion_matrix <- table(Actual_Label = test2$`Sales Channel`, Predicted_Label = predictions)
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
accuracy
## [1] 0.5255255
From all of the 3 models above I see the accuracy is almost similar in all the 3 models, I choose Model 3 Randomforest as it performs slightly better when compared to other 2 models.