Exploratory analysis and essay

Pre-work

  • Visit the following website and explore the range of sizes of this dataset (from 100 to 5 million records): https://excelbianalytics.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/ or (new) https://www.kaggle.com/datasets
  • Select 2 files to download Based on your computer’s capabilities (memory, CPU), select 2 files you can handle (recommended one small, one large)
  • Download the files
  • Review the structure and content of the tables, and think about the data sets (structure, size, dependencies, labels, etc)
  • Consider the similarities and differences in the two data sets you have downloaded
  • Think about how to analyze and predict an outcome based on the datasets available
  • Based on the data you have, think which two machine learning algorithms presented so far could be used to analyze the data

Deliverable

  • Essay (minimum 500 word document) Write a short essay explaining your selection of algorithms and how they relate to the data and what you are trying to do
  • Exploratory Analysis using R or Python (submit code + errors + analysis as notebook or copy/paste to document) Explore how to analyze and predict an outcome based on the data available. This will be an exploratory exercise, so feel free to show errors and warnings that raise during the analysis. Test the code with both datasets selected and compare the results.

Answer questions such as:

  • Are the columns of your data correlated?
  • Are there labels in your data? Did that impact your choice of algorithm?
  • What are the pros and cons of each algorithm you selected?
  • How your choice of algorithm relates to the datasets (was your choice of algorithm impacted by the datasets you chose)?
  • Which result will you trust if you need to make a business decision?
  • Do you think an analysis could be prone to errors when using too much data, or when using the least amount possible?
  • How does the analysis between data sets compare?

Including of the required libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(lubridate) 
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
## ✔ broom        1.0.5     ✔ rsample      1.2.0
## ✔ dials        1.2.0     ✔ tune         1.1.2
## ✔ infer        1.0.5     ✔ workflows    1.1.3
## ✔ modeldata    1.2.0     ✔ workflowsets 1.0.1
## ✔ parsnip      1.1.1     ✔ yardstick    1.2.0
## ✔ recipes      1.0.8     
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard()        masks purrr::discard()
## ✖ dplyr::filter()          masks stats::filter()
## ✖ recipes::fixed()         masks stringr::fixed()
## ✖ dplyr::lag()             masks stats::lag()
## ✖ caret::lift()            masks purrr::lift()
## ✖ yardstick::precision()   masks caret::precision()
## ✖ yardstick::recall()      masks caret::recall()
## ✖ yardstick::sensitivity() masks caret::sensitivity()
## ✖ yardstick::spec()        masks readr::spec()
## ✖ yardstick::specificity() masks caret::specificity()
## ✖ recipes::step()          masks stats::step()
## • Learn how to get started at https://www.tidymodels.org/start/

Loading of the datasets

I selected my datasets from https://excelbianalytics.com/, for smaller dataset I chose 5000 records and for large dataset the 100000 one. Uploaded the datasets on my github account and loaded them using read_csv function.

small_ds <- read_csv("https://raw.githubusercontent.com/petferns/DATA622/main/5000%20Sales%20Records.csv")
## Rows: 5000 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Region, Country, Item Type, Sales Channel, Order Priority, Order Da...
## dbl (7): Order ID, Units Sold, Unit Price, Unit Cost, Total Revenue, Total C...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
big_ds <- read_csv("https://raw.githubusercontent.com/petferns/DATA622/main/100000%20Sales%20Records.csv")
## Rows: 100000 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Region, Country, Item Type, Sales Channel, Order Priority, Order Da...
## dbl (7): Order ID, Units Sold, Unit Price, Unit Cost, Total Revenue, Total C...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data contains sales order details of products like baby food, personal care products, food items, fruits, etc from across the continents.

head(small_ds)
## # A tibble: 6 × 14
##   Region       Country `Item Type` `Sales Channel` `Order Priority` `Order Date`
##   <chr>        <chr>   <chr>       <chr>           <chr>            <chr>       
## 1 Central Ame… Antigu… Baby Food   Online          M                12/20/2013  
## 2 Central Ame… Panama  Snacks      Offline         C                7/5/2010    
## 3 Europe       Czech … Beverages   Offline         C                9/12/2011   
## 4 Asia         North … Cereal      Offline         L                5/13/2010   
## 5 Asia         Sri La… Snacks      Offline         C                7/20/2015   
## 6 Middle East… Morocco Personal C… Offline         L                11/8/2010   
## # ℹ 8 more variables: `Order ID` <dbl>, `Ship Date` <chr>, `Units Sold` <dbl>,
## #   `Unit Price` <dbl>, `Unit Cost` <dbl>, `Total Revenue` <dbl>,
## #   `Total Cost` <dbl>, `Total Profit` <dbl>
head(big_ds)
## # A tibble: 6 × 14
##   Region       Country `Item Type` `Sales Channel` `Order Priority` `Order Date`
##   <chr>        <chr>   <chr>       <chr>           <chr>            <chr>       
## 1 Middle East… Azerba… Snacks      Online          C                10/8/2014   
## 2 Central Ame… Panama  Cosmetics   Offline         L                2/22/2015   
## 3 Sub-Saharan… Sao To… Fruits      Offline         M                12/9/2015   
## 4 Sub-Saharan… Sao To… Personal C… Online          M                9/17/2014   
## 5 Central Ame… Belize  Household   Offline         H                2/4/2010    
## 6 Europe       Denmark Clothes     Online          C                2/20/2013   
## # ℹ 8 more variables: `Order ID` <dbl>, `Ship Date` <chr>, `Units Sold` <dbl>,
## #   `Unit Price` <dbl>, `Unit Cost` <dbl>, `Total Revenue` <dbl>,
## #   `Total Cost` <dbl>, `Total Profit` <dbl>

From the glimpse of the data we see that certain column values needs conversion - ‘Order Date’, ‘Ship Date’ will be converted to date type. ‘Sales Channel’ will be factored as it contains either Online or Offline. This will be done for both of the datasets.

glimpse(small_ds)
## Rows: 5,000
## Columns: 14
## $ Region           <chr> "Central America and the Caribbean", "Central America…
## $ Country          <chr> "Antigua and Barbuda", "Panama", "Czech Republic", "N…
## $ `Item Type`      <chr> "Baby Food", "Snacks", "Beverages", "Cereal", "Snacks…
## $ `Sales Channel`  <chr> "Online", "Offline", "Offline", "Offline", "Offline",…
## $ `Order Priority` <chr> "M", "C", "C", "L", "C", "L", "H", "M", "M", "M", "C"…
## $ `Order Date`     <chr> "12/20/2013", "7/5/2010", "9/12/2011", "5/13/2010", "…
## $ `Order ID`       <dbl> 957081544, 301644504, 478051030, 892599952, 571902596…
## $ `Ship Date`      <chr> "1/11/2014", "7/26/2010", "9/29/2011", "6/15/2010", "…
## $ `Units Sold`     <dbl> 552, 2167, 4778, 9016, 7542, 48, 8258, 927, 8841, 981…
## $ `Unit Price`     <dbl> 255.28, 152.58, 47.45, 205.70, 152.58, 81.73, 109.28,…
## $ `Unit Cost`      <dbl> 159.42, 97.44, 31.79, 117.11, 97.44, 56.67, 35.84, 35…
## $ `Total Revenue`  <dbl> 140914.56, 330640.86, 226716.10, 1854591.20, 1150758.…
## $ `Total Cost`     <dbl> 87999.84, 211152.48, 151892.62, 1055863.76, 734892.48…
## $ `Total Profit`   <dbl> 52914.72, 119488.38, 74823.48, 798727.44, 415865.88, …
glimpse(big_ds)
## Rows: 100,000
## Columns: 14
## $ Region           <chr> "Middle East and North Africa", "Central America and …
## $ Country          <chr> "Azerbaijan", "Panama", "Sao Tome and Principe", "Sao…
## $ `Item Type`      <chr> "Snacks", "Cosmetics", "Fruits", "Personal Care", "Ho…
## $ `Sales Channel`  <chr> "Online", "Offline", "Offline", "Online", "Offline", …
## $ `Order Priority` <chr> "C", "L", "M", "M", "H", "C", "M", "C", "H", "H", "C"…
## $ `Order Date`     <chr> "10/8/2014", "2/22/2015", "12/9/2015", "9/17/2014", "…
## $ `Order ID`       <dbl> 535113847, 874708545, 854349935, 892836844, 129280602…
## $ `Ship Date`      <chr> "10/23/2014", "2/27/2015", "1/18/2016", "10/12/2014",…
## $ `Units Sold`     <dbl> 934, 4551, 9986, 9118, 5858, 1149, 7964, 6307, 8217, …
## $ `Unit Price`     <dbl> 152.58, 437.20, 9.33, 81.73, 668.27, 109.28, 437.20, …
## $ `Unit Cost`      <dbl> 97.44, 263.33, 6.92, 56.67, 502.54, 35.84, 263.33, 6.…
## $ `Total Revenue`  <dbl> 142509.72, 1989697.20, 93169.38, 745214.14, 3914725.6…
## $ `Total Cost`     <dbl> 91008.96, 1198414.83, 69103.12, 516717.06, 2943879.32…
## $ `Total Profit`   <dbl> 51500.76, 791282.37, 24066.26, 228497.08, 970846.34, …
small_ds[['Order Date']] <- as.Date(small_ds[['Order Date']], "%m/%d/%Y")
small_ds[['Ship Date']] <- as.Date(small_ds[['Ship Date']], "%m/%d/%Y")

big_ds[['Order Date']] <- as.Date(big_ds[['Order Date']], "%m/%d/%Y")
big_ds[['Ship Date']] <- as.Date(big_ds[['Ship Date']], "%m/%d/%Y")

small_ds[['Sales Channel']] <- as.factor(small_ds[['Sales Channel']])
big_ds[['Sales Channel']] <- as.factor(big_ds[['Sales Channel']])

small_ds[['Total Profit']] <- as.numeric(small_ds[['Total Profit']])
big_ds[['Total Profit']] <- as.numeric(big_ds[['Total Profit']])

small_ds[['Sales Channel']] <- as.factor(small_ds[['Sales Channel']])
big_ds[['Sales Channel']] <- as.factor(big_ds[['Sales Channel']])

We see from the summary that there aren’t any missing values and the datasets contains Order details from year 2010 to 2017.

summary(small_ds)
##     Region            Country           Item Type         Sales Channel 
##  Length:5000        Length:5000        Length:5000        Offline:2504  
##  Class :character   Class :character   Class :character   Online :2496  
##  Mode  :character   Mode  :character   Mode  :character                 
##                                                                         
##                                                                         
##                                                                         
##  Order Priority       Order Date            Order ID        
##  Length:5000        Min.   :2010-01-01   Min.   :100090873  
##  Class :character   1st Qu.:2011-12-08   1st Qu.:320104217  
##  Mode  :character   Median :2013-10-23   Median :552314960  
##                     Mean   :2013-10-19   Mean   :548644737  
##                     3rd Qu.:2015-09-08   3rd Qu.:768770944  
##                     Max.   :2017-07-28   Max.   :999879729  
##    Ship Date            Units Sold     Unit Price       Unit Cost     
##  Min.   :2010-01-06   Min.   :   2   Min.   :  9.33   Min.   :  6.92  
##  1st Qu.:2012-01-06   1st Qu.:2453   1st Qu.: 81.73   1st Qu.: 35.84  
##  Median :2013-11-14   Median :5123   Median :154.06   Median : 97.44  
##  Mean   :2013-11-13   Mean   :5031   Mean   :265.75   Mean   :187.49  
##  3rd Qu.:2015-10-03   3rd Qu.:7576   3rd Qu.:437.20   3rd Qu.:263.33  
##  Max.   :2017-08-31   Max.   :9999   Max.   :668.27   Max.   :524.96  
##  Total Revenue       Total Cost       Total Profit      
##  Min.   :     65   Min.   :     48   Min.   :     16.9  
##  1st Qu.: 257417   1st Qu.: 154748   1st Qu.:  85339.3  
##  Median : 779409   Median : 468181   Median : 279095.2  
##  Mean   :1325738   Mean   : 933093   Mean   : 392644.6  
##  3rd Qu.:1839975   3rd Qu.:1189578   3rd Qu.: 565106.4  
##  Max.   :6672676   Max.   :5248025   Max.   :1726007.5
summary(big_ds)
##     Region            Country           Item Type         Sales Channel  
##  Length:100000      Length:100000      Length:100000      Offline:49946  
##  Class :character   Class :character   Class :character   Online :50054  
##  Mode  :character   Mode  :character   Mode  :character                  
##                                                                          
##                                                                          
##                                                                          
##  Order Priority       Order Date            Order ID        
##  Length:100000      Min.   :2010-01-01   Min.   :100008904  
##  Class :character   1st Qu.:2011-11-25   1st Qu.:326046383  
##  Mode  :character   Median :2013-10-15   Median :547718512  
##                     Mean   :2013-10-15   Mean   :550395554  
##                     3rd Qu.:2015-09-07   3rd Qu.:775078534  
##                     Max.   :2017-07-28   Max.   :999996459  
##    Ship Date            Units Sold      Unit Price       Unit Cost     
##  Min.   :2010-01-02   Min.   :    1   Min.   :  9.33   Min.   :  6.92  
##  1st Qu.:2011-12-21   1st Qu.: 2505   1st Qu.:109.28   1st Qu.: 56.67  
##  Median :2013-11-09   Median : 5007   Median :205.70   Median :117.11  
##  Mean   :2013-11-09   Mean   : 5001   Mean   :266.70   Mean   :188.02  
##  3rd Qu.:2015-10-02   3rd Qu.: 7495   3rd Qu.:437.20   3rd Qu.:364.69  
##  Max.   :2017-09-16   Max.   :10000   Max.   :668.27   Max.   :524.96  
##  Total Revenue       Total Cost       Total Profit      
##  Min.   :     19   Min.   :     14   Min.   :      4.8  
##  1st Qu.: 279753   1st Qu.: 162928   1st Qu.:  95900.0  
##  Median : 789892   Median : 467937   Median : 283657.5  
##  Mean   :1336067   Mean   : 941975   Mean   : 394091.2  
##  3rd Qu.:1836490   3rd Qu.:1209475   3rd Qu.: 568384.1  
##  Max.   :6682700   Max.   :5249075   Max.   :1738700.0

Visualizing the data by plotting Total profit over the years. We see from the below plots in both of the datasets the ‘Total profit’ was constant across years but there is a drastic decline in 2017.

small_ds_plt <- small_ds %>%
  mutate(Year = year(`Order Date`)) %>%
  group_by(Year) %>%
  summarize(ProfitPerYear = sum(`Total Profit`))

ggplot(small_ds_plt, aes(x = Year, y = ProfitPerYear)) +
  geom_bar(stat = "identity", fill = "blue") +
  labs(title = "Yearly Profit", x = "Year", y = "Total Profit")

big_ds_plt <- big_ds %>%
  mutate(Year = year(`Order Date`)) %>%
  group_by(Year) %>%
  summarize(ProfitPerYear = sum(`Total Profit`))

ggplot(big_ds_plt, aes(x = Year, y = ProfitPerYear)) +
  geom_bar(stat = "identity", fill = "blue") +
  labs(title = "Yearly Profit", x = "Year", y = "Total Profit")

Modelling

Based on the datasets I chose decision tree and logistic regression to analyze the data. Considering Sales Channel as dependent variable and Region, Item Type, Order Priority and Total Profit will be our independent variables.

I will split the dataset in 80:20 ratio with 80% for training set and 20% for testing set

set.seed(2222)

training.samples <- small_ds$`Sales Channel` %>% 
  createDataPartition(p = 0.8, list=FALSE)

train.data <- small_ds[training.samples,]
test.data <- small_ds[-training.samples,]

tree_spec <- decision_tree() %>% 

  set_engine("rpart") %>%
  set_mode("classification")
  

Dtree_model <- tree_spec %>%
  fit(formula = `Sales Channel` ~ `Region` + `Item Type` + `Order Priority` + `Total Profit`,
      data = train.data)
Dtree_model
## parsnip model object
## 
## n= 4001 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 4001 1997 Offline (0.5008748 0.4991252)  
##   2) Item Type=Baby Food,Beverages,Cosmetics,Household,Office Supplies,Personal Care,Vegetables 2401 1151 Offline (0.5206164 0.4793836) *
##   3) Item Type=Cereal,Clothes,Fruits,Meat,Snacks 1600  754 Online (0.4712500 0.5287500) *
importance <- Dtree_model$fit$variable.importance
importance
##    Item Type Total Profit 
##    4.6799011    0.8277575
predictions <- predict(Dtree_model, new_data = test.data)

predictions_combined <- predictions %>% 
  mutate(true_classification = test.data$`Sales Channel`)

head(predictions_combined)
## # A tibble: 6 × 2
##   .pred_class true_classification
##   <fct>       <fct>              
## 1 Offline     Online             
## 2 Offline     Offline            
## 3 Online      Offline            
## 4 Online      Offline            
## 5 Offline     Online             
## 6 Offline     Offline
confusion_matrix <- conf_mat(data = predictions_combined,
                            estimate = .pred_class,
                            truth = true_classification)

confusion_matrix 
##           Truth
## Prediction Offline Online
##    Offline     288    296
##    Online      212    203

Calculating accuracy

correct_predictions <- 288 + 203


all_predictions <- 288 + 203 + 212 + 296

accuracy <- correct_predictions / all_predictions
accuracy
## [1] 0.4914915

Applying the same for the bigger dataset

set.seed(3333)

Btraining.samples <- big_ds$`Sales Channel` %>% 
  createDataPartition(p = 0.8, list=FALSE)

Btrain.data <- big_ds[training.samples,]
Btest.data <- big_ds[-training.samples,]

Btree_spec <- decision_tree() %>% 

  set_engine("rpart") %>%
  set_mode("classification")
  

Btree_model <- Btree_spec %>%
  fit(formula = `Sales Channel` ~ `Region` + `Item Type` + `Order Priority` + `Total Profit`,
      data = Btrain.data)

Btree_model
## parsnip model object
## 
## n= 4001 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 4001 1981 Online (0.4951262 0.5048738) *
predictions <- predict(Dtree_model, new_data = Btest.data)

predictions_combined <- predictions %>% 
  mutate(true_classification = Btest.data$`Sales Channel`)

head(predictions_combined)
## # A tibble: 6 × 2
##   .pred_class true_classification
##   <fct>       <fct>              
## 1 Online      Online             
## 2 Online      Offline            
## 3 Offline     Offline            
## 4 Online      Online             
## 5 Offline     Offline            
## 6 Offline     Offline
confusion_matrix <- conf_mat(data = predictions_combined,
                            estimate = .pred_class,
                            truth = true_classification)

confusion_matrix 
##           Truth
## Prediction Offline Online
##    Offline   27950  28111
##    Online    20015  19923
# Calculate the number of correctly predicted classes
correct_predictions <- 27950  + 19923

# Calculate the number of all predicted classes
all_predictions <- 27950  + 19923 + 20015 + 28111

# Calculate and print the accuracy
accuracy <- correct_predictions / all_predictions
accuracy
## [1] 0.4986823

Applying the logistic regression

glm.df.small<-glm(`Sales Channel` ~ `Region` + `Item Type` + `Order Priority` + `Total Profit`,data=small_ds, family=binomial)

summary(glm.df.small)
## 
## Call:
## glm(formula = `Sales Channel` ~ Region + `Item Type` + `Order Priority` + 
##     `Total Profit`, family = binomial, data = small_ds)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.342  -1.170  -1.042   1.179   1.343  
## 
## Coefficients:
##                                           Estimate Std. Error z value Pr(>|z|)
## (Intercept)                              1.359e-02  1.371e-01   0.099   0.9210
## RegionAustralia and Oceania             -2.468e-01  1.237e-01  -1.995   0.0460
## RegionCentral America and the Caribbean -1.468e-01  1.146e-01  -1.281   0.2003
## RegionEurope                            -1.171e-01  9.294e-02  -1.260   0.2076
## RegionMiddle East and North Africa      -1.572e-01  1.105e-01  -1.423   0.1547
## RegionNorth America                     -3.334e-01  2.098e-01  -1.589   0.1120
## RegionSub-Saharan Africa                -7.752e-02  9.355e-02  -0.829   0.4073
## `Item Type`Beverages                     1.194e-01  1.409e-01   0.848   0.3966
## `Item Type`Cereal                        3.133e-01  1.399e-01   2.239   0.0251
## `Item Type`Clothes                       2.190e-01  1.400e-01   1.565   0.1177
## `Item Type`Cosmetics                     5.210e-02  1.419e-01   0.367   0.7135
## `Item Type`Fruits                        5.866e-02  1.430e-01   0.410   0.6817
## `Item Type`Household                     1.186e-01  1.403e-01   0.846   0.3978
## `Item Type`Meat                          1.743e-01  1.400e-01   1.244   0.2133
## `Item Type`Office Supplies               7.791e-02  1.372e-01   0.568   0.5703
## `Item Type`Personal Care                -1.124e-02  1.418e-01  -0.079   0.9368
## `Item Type`Snacks                        1.617e-01  1.402e-01   1.153   0.2488
## `Item Type`Vegetables                    7.704e-03  1.384e-01   0.056   0.9556
## `Order Priority`H                        5.721e-02  8.110e-02   0.705   0.4805
## `Order Priority`L                        1.624e-02  8.194e-02   0.198   0.8429
## `Order Priority`M                       -9.006e-03  8.054e-02  -0.112   0.9110
## `Total Profit`                          -6.091e-08  1.038e-07  -0.587   0.5575
##                                          
## (Intercept)                              
## RegionAustralia and Oceania             *
## RegionCentral America and the Caribbean  
## RegionEurope                             
## RegionMiddle East and North Africa       
## RegionNorth America                      
## RegionSub-Saharan Africa                 
## `Item Type`Beverages                     
## `Item Type`Cereal                       *
## `Item Type`Clothes                       
## `Item Type`Cosmetics                     
## `Item Type`Fruits                        
## `Item Type`Household                     
## `Item Type`Meat                          
## `Item Type`Office Supplies               
## `Item Type`Personal Care                 
## `Item Type`Snacks                        
## `Item Type`Vegetables                    
## `Order Priority`H                        
## `Order Priority`L                        
## `Order Priority`M                        
## `Total Profit`                           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6931.5  on 4999  degrees of freedom
## Residual deviance: 6913.4  on 4978  degrees of freedom
## AIC: 6957.4
## 
## Number of Fisher Scoring iterations: 3
glm.df.small<-glm(`Sales Channel` ~ `Region` + `Item Type` + `Order Priority` + `Total Profit`,data=small_ds, family=binomial)

summary(glm.df.small)
## 
## Call:
## glm(formula = `Sales Channel` ~ Region + `Item Type` + `Order Priority` + 
##     `Total Profit`, family = binomial, data = small_ds)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.342  -1.170  -1.042   1.179   1.343  
## 
## Coefficients:
##                                           Estimate Std. Error z value Pr(>|z|)
## (Intercept)                              1.359e-02  1.371e-01   0.099   0.9210
## RegionAustralia and Oceania             -2.468e-01  1.237e-01  -1.995   0.0460
## RegionCentral America and the Caribbean -1.468e-01  1.146e-01  -1.281   0.2003
## RegionEurope                            -1.171e-01  9.294e-02  -1.260   0.2076
## RegionMiddle East and North Africa      -1.572e-01  1.105e-01  -1.423   0.1547
## RegionNorth America                     -3.334e-01  2.098e-01  -1.589   0.1120
## RegionSub-Saharan Africa                -7.752e-02  9.355e-02  -0.829   0.4073
## `Item Type`Beverages                     1.194e-01  1.409e-01   0.848   0.3966
## `Item Type`Cereal                        3.133e-01  1.399e-01   2.239   0.0251
## `Item Type`Clothes                       2.190e-01  1.400e-01   1.565   0.1177
## `Item Type`Cosmetics                     5.210e-02  1.419e-01   0.367   0.7135
## `Item Type`Fruits                        5.866e-02  1.430e-01   0.410   0.6817
## `Item Type`Household                     1.186e-01  1.403e-01   0.846   0.3978
## `Item Type`Meat                          1.743e-01  1.400e-01   1.244   0.2133
## `Item Type`Office Supplies               7.791e-02  1.372e-01   0.568   0.5703
## `Item Type`Personal Care                -1.124e-02  1.418e-01  -0.079   0.9368
## `Item Type`Snacks                        1.617e-01  1.402e-01   1.153   0.2488
## `Item Type`Vegetables                    7.704e-03  1.384e-01   0.056   0.9556
## `Order Priority`H                        5.721e-02  8.110e-02   0.705   0.4805
## `Order Priority`L                        1.624e-02  8.194e-02   0.198   0.8429
## `Order Priority`M                       -9.006e-03  8.054e-02  -0.112   0.9110
## `Total Profit`                          -6.091e-08  1.038e-07  -0.587   0.5575
##                                          
## (Intercept)                              
## RegionAustralia and Oceania             *
## RegionCentral America and the Caribbean  
## RegionEurope                             
## RegionMiddle East and North Africa       
## RegionNorth America                      
## RegionSub-Saharan Africa                 
## `Item Type`Beverages                     
## `Item Type`Cereal                       *
## `Item Type`Clothes                       
## `Item Type`Cosmetics                     
## `Item Type`Fruits                        
## `Item Type`Household                     
## `Item Type`Meat                          
## `Item Type`Office Supplies               
## `Item Type`Personal Care                 
## `Item Type`Snacks                        
## `Item Type`Vegetables                    
## `Order Priority`H                        
## `Order Priority`L                        
## `Order Priority`M                        
## `Total Profit`                           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6931.5  on 4999  degrees of freedom
## Residual deviance: 6913.4  on 4978  degrees of freedom
## AIC: 6957.4
## 
## Number of Fisher Scoring iterations: 3
c1 <- confusionMatrix(as.factor(as.integer(fitted(glm.df.small) > .5)), as.factor(glm.df.small$y), positive = "1")
c1$overall[1]
## Accuracy 
##   0.5184
glm.df.big<-glm(`Sales Channel` ~ `Region` + `Item Type` + `Order Priority` + `Total Profit`,data=big_ds, family=binomial)

summary(glm.df.big)
## 
## Call:
## glm(formula = `Sales Channel` ~ Region + `Item Type` + `Order Priority` + 
##     `Total Profit`, family = binomial, data = big_ds)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.204  -1.180   1.151   1.175   1.209  
## 
## Coefficients:
##                                           Estimate Std. Error z value Pr(>|z|)
## (Intercept)                             -1.476e-03  3.089e-02  -0.048   0.9619
## RegionAustralia and Oceania              1.274e-02  2.772e-02   0.460   0.6457
## RegionCentral America and the Caribbean -5.387e-02  2.546e-02  -2.116   0.0343
## RegionEurope                            -7.645e-03  2.073e-02  -0.369   0.7122
## RegionMiddle East and North Africa      -2.884e-02  2.435e-02  -1.184   0.2363
## RegionNorth America                     -4.650e-02  4.638e-02  -1.002   0.3161
## RegionSub-Saharan Africa                -6.576e-03  2.071e-02  -0.318   0.7508
## `Item Type`Beverages                     1.691e-02  3.234e-02   0.523   0.6009
## `Item Type`Cereal                        3.208e-02  3.085e-02   1.040   0.2983
## `Item Type`Clothes                      -2.413e-03  3.105e-02  -0.078   0.9380
## `Item Type`Cosmetics                     2.747e-03  3.220e-02   0.085   0.9320
## `Item Type`Fruits                        2.179e-02  3.281e-02   0.664   0.5066
## `Item Type`Household                     2.774e-02  3.203e-02   0.866   0.3864
## `Item Type`Meat                         -1.547e-02  3.125e-02  -0.495   0.6205
## `Item Type`Office Supplies               1.384e-02  3.104e-02   0.446   0.6557
## `Item Type`Personal Care                 3.577e-02  3.196e-02   1.119   0.2630
## `Item Type`Snacks                        5.693e-03  3.129e-02   0.182   0.8556
## `Item Type`Vegetables                    2.234e-02  3.120e-02   0.716   0.4740
## `Order Priority`H                        8.953e-04  1.791e-02   0.050   0.9601
## `Order Priority`L                        1.467e-02  1.790e-02   0.820   0.4125
## `Order Priority`M                       -4.338e-03  1.788e-02  -0.243   0.8083
## `Total Profit`                           1.191e-09  2.315e-08   0.051   0.9590
##                                          
## (Intercept)                              
## RegionAustralia and Oceania              
## RegionCentral America and the Caribbean *
## RegionEurope                             
## RegionMiddle East and North Africa       
## RegionNorth America                      
## RegionSub-Saharan Africa                 
## `Item Type`Beverages                     
## `Item Type`Cereal                        
## `Item Type`Clothes                       
## `Item Type`Cosmetics                     
## `Item Type`Fruits                        
## `Item Type`Household                     
## `Item Type`Meat                          
## `Item Type`Office Supplies               
## `Item Type`Personal Care                 
## `Item Type`Snacks                        
## `Item Type`Vegetables                    
## `Order Priority`H                        
## `Order Priority`L                        
## `Order Priority`M                        
## `Total Profit`                           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 138629  on 99999  degrees of freedom
## Residual deviance: 138614  on 99978  degrees of freedom
## AIC: 138658
## 
## Number of Fisher Scoring iterations: 3
c2 <- confusionMatrix(as.factor(as.integer(fitted(glm.df.big) > .5)), as.factor(glm.df.big$y), positive = "1")
c2$overall[1]
## Accuracy 
##  0.50405

We see that applying the Decision tree the smaller dataset has a accuracy of around 49% and bigger dataset has slightly higher narly 50%. Which means it can predict correctly nearly 49 to 50% when we consider Region, Item Type, Order Priority and Total Profit to decide upon Sales Channel. Also, from the logistic regression we see the smaller dataset has an accuracy of nearly 52% and the bigger dataset has around 50% accuracy. Which means it can predict correctly nearly 49 to 50% when we consider Region, Item Type, Order Priority and Total Profit to decide upon Sales Channel.