Introduction

This assignment is the first homework for Data 622. The following is the assignment:

Visit the following website and explore the range of sizes of this dataset (from 100 to 5 million records). https://eforexcel.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/

Based on your computer’s capabilities (memory, CPU), select 2 files you can handle (recommended one small, one large) Review the structure and content of the tables, and think which two machine learning algorithms presented so far could be used to analyze the data, and how can they be applied in the suggested environment of the datasets.

Write a short essay explaining your selection. Then, select one of the 2 algorithms and explore how to analyze and predict an outcome based on the data available. This will be an exploratory exercise, so feel free to show errors and warnings that raise during the analysis. Test the code with both datasets selected and compare the results. Which result will you trust if you need to make a business decision? Do you think an analysis could be prone to errors when using too much data, or when using the least amount possible? Develop your exploratory analysis of the data and the essay in the following 2 weeks. You’ll have until March 17 to submit both.

Data Exploration

I decided to select the 1000 Sales Records and the 100,000 Sales Records files on the site.

Setup

The following code is an initial setup of libraries and loading of the files. I kept the files in csv.gz format and relied upon the tidyverse library to unzip the files with the read_csv function because I would be storing these files on GitHub.

# Load libraries
library(tidymodels)
## Registered S3 method overwritten by 'tune':
##   method                   from   
##   required_pkgs.model_spec parsnip
## -- Attaching packages -------------------------------------- tidymodels 0.1.4 --
## v broom        0.7.12     v recipes      0.2.0 
## v dials        0.1.0      v rsample      0.1.1 
## v dplyr        1.0.8      v tibble       3.1.6 
## v ggplot2      3.3.5      v tidyr        1.2.0 
## v infer        1.0.0      v tune         0.1.6 
## v modeldata    0.1.1      v workflows    0.2.4 
## v parsnip      0.2.0      v workflowsets 0.1.0 
## v purrr        0.3.4      v yardstick    0.0.9
## -- Conflicts ----------------------------------------- tidymodels_conflicts() --
## x purrr::discard() masks scales::discard()
## x dplyr::filter()  masks stats::filter()
## x dplyr::lag()     masks stats::lag()
## x recipes::step()  masks stats::step()
## x tune::tune()     masks parsnip::tune()
## * Use tidymodels_prefer() to resolve common conflicts.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v readr   2.1.2     v forcats 0.5.1
## v stringr 1.4.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x readr::col_factor() masks scales::col_factor()
## x purrr::discard()    masks scales::discard()
## x dplyr::filter()     masks stats::filter()
## x stringr::fixed()    masks recipes::fixed()
## x dplyr::lag()        masks stats::lag()
## x readr::spec()       masks yardstick::spec()
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following objects are masked from 'package:yardstick':
## 
##     precision, recall, sensitivity, specificity
## The following object is masked from 'package:purrr':
## 
##     lift
library(rpart.plot)
## Loading required package: rpart
## 
## Attaching package: 'rpart'
## The following object is masked from 'package:dials':
## 
##     prune
# Load the 1000 sales file
ksales <- read_csv('https://github.com/logicalschema/spring2022/raw/main/data622/hw1/1000_Sales_Records.csv.gz')
## Rows: 1000 Columns: 14
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (7): Region, Country, Item Type, Sales Channel, Order Priority, Order Da...
## dbl (7): Order ID, Units Sold, Unit Price, Unit Cost, Total Revenue, Total C...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
bigksales <- read_csv('https://github.com/logicalschema/spring2022/raw/main/data622/hw1/100000_Sales_Records.csv.gz')
## Rows: 100000 Columns: 14
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (7): Region, Country, Item Type, Sales Channel, Order Priority, Order Da...
## dbl (7): Order ID, Units Sold, Unit Price, Unit Cost, Total Revenue, Total C...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Analysis of Data

The following is an analysis of the data. The following are the columns for the data:

Region, Country, Item Type, Sales Channel, Order Priority, Order Date, Order ID, Ship Date, Units Sold, Unit Price, Unit Cost, Total Revenue, Total Cost, Total Profit

I decided to convert the date columns to Date datatypes.

# Convert date columns to date datatypes
ksales[['Order Date']] <- as.Date(ksales[['Order Date']], "%m/%d/%Y")
ksales[['Ship Date']] <- as.Date(ksales[['Ship Date']], "%m/%d/%Y")
ksales[['Order ID']] <- toString(ksales[['Order ID']])
ksales[['Sales Channel']] <- as.factor(ksales[['Sales Channel']])


bigksales[['Order Date']] <- as.Date(bigksales[['Order Date']], "%m/%d/%Y")
bigksales[['Ship Date']] <- as.Date(bigksales[['Ship Date']], "%m/%d/%Y")
bigksales[['Order ID']] <- toString(bigksales[['Order ID']])
bigksales[['Sales Channel']] <- as.factor(bigksales[['Sales Channel']])

The following is a snippet of the data:

# Snippet of the data
head(ksales)
head(bigksales)

The following is a glimpse and summary of the data. There are no missing values.

# glimpse of data
glimpse(ksales)
## Rows: 1,000
## Columns: 14
## $ Region           <chr> "Middle East and North Africa", "North America", "Mid~
## $ Country          <chr> "Libya", "Canada", "Libya", "Japan", "Chad", "Armenia~
## $ `Item Type`      <chr> "Cosmetics", "Vegetables", "Baby Food", "Cereal", "Fr~
## $ `Sales Channel`  <fct> Offline, Online, Offline, Offline, Offline, Online, O~
## $ `Order Priority` <chr> "M", "M", "C", "C", "H", "H", "H", "M", "H", "H", "M"~
## $ `Order Date`     <date> 2014-10-18, 2011-11-07, 2016-10-31, 2010-04-10, 2011~
## $ `Order ID`       <chr> "686800706, 185941302, 246222341, 161442649, 64571355~
## $ `Ship Date`      <date> 2014-10-31, 2011-12-08, 2016-12-09, 2010-05-12, 2011~
## $ `Units Sold`     <dbl> 8446, 3018, 1517, 3322, 9845, 9528, 2844, 7299, 2428,~
## $ `Unit Price`     <dbl> 437.20, 154.06, 255.28, 205.70, 9.33, 205.70, 205.70,~
## $ `Unit Cost`      <dbl> 263.33, 90.93, 159.42, 117.11, 6.92, 117.11, 117.11, ~
## $ `Total Revenue`  <dbl> 3692591.20, 464953.08, 387259.76, 683335.40, 91853.85~
## $ `Total Cost`     <dbl> 2224085.18, 274426.74, 241840.14, 389039.42, 68127.40~
## $ `Total Profit`   <dbl> 1468506.02, 190526.34, 145419.62, 294295.98, 23726.45~
glimpse(bigksales)
## Rows: 100,000
## Columns: 14
## $ Region           <chr> "Middle East and North Africa", "Central America and ~
## $ Country          <chr> "Azerbaijan", "Panama", "Sao Tome and Principe", "Sao~
## $ `Item Type`      <chr> "Snacks", "Cosmetics", "Fruits", "Personal Care", "Ho~
## $ `Sales Channel`  <fct> Online, Offline, Offline, Online, Offline, Online, Of~
## $ `Order Priority` <chr> "C", "L", "M", "M", "H", "C", "M", "C", "H", "H", "C"~
## $ `Order Date`     <date> 2014-10-08, 2015-02-22, 2015-12-09, 2014-09-17, 2010~
## $ `Order ID`       <chr> "535113847, 874708545, 854349935, 892836844, 12928060~
## $ `Ship Date`      <date> 2014-10-23, 2015-02-27, 2016-01-18, 2014-10-12, 2010~
## $ `Units Sold`     <dbl> 934, 4551, 9986, 9118, 5858, 1149, 7964, 6307, 8217, ~
## $ `Unit Price`     <dbl> 152.58, 437.20, 9.33, 81.73, 668.27, 109.28, 437.20, ~
## $ `Unit Cost`      <dbl> 97.44, 263.33, 6.92, 56.67, 502.54, 35.84, 263.33, 6.~
## $ `Total Revenue`  <dbl> 142509.72, 1989697.20, 93169.38, 745214.14, 3914725.6~
## $ `Total Cost`     <dbl> 91008.96, 1198414.83, 69103.12, 516717.06, 2943879.32~
## $ `Total Profit`   <dbl> 51500.76, 791282.37, 24066.26, 228497.08, 970846.34, ~
summary(ksales)
##     Region            Country           Item Type         Sales Channel
##  Length:1000        Length:1000        Length:1000        Offline:520  
##  Class :character   Class :character   Class :character   Online :480  
##  Mode  :character   Mode  :character   Mode  :character                
##                                                                        
##                                                                        
##                                                                        
##  Order Priority       Order Date           Order ID        
##  Length:1000        Min.   :2010-01-01   Length:1000       
##  Class :character   1st Qu.:2011-11-14   Class :character  
##  Mode  :character   Median :2013-09-24   Mode  :character  
##                     Mean   :2013-09-19                     
##                     3rd Qu.:2015-07-03                     
##                     Max.   :2017-07-26                     
##    Ship Date            Units Sold     Unit Price       Unit Cost     
##  Min.   :2010-01-15   Min.   :  13   Min.   :  9.33   Min.   :  6.92  
##  1st Qu.:2011-12-11   1st Qu.:2420   1st Qu.: 81.73   1st Qu.: 56.67  
##  Median :2013-10-12   Median :5184   Median :154.06   Median : 97.44  
##  Mean   :2013-10-14   Mean   :5054   Mean   :262.11   Mean   :184.97  
##  3rd Qu.:2015-07-28   3rd Qu.:7537   3rd Qu.:421.89   3rd Qu.:263.33  
##  Max.   :2017-09-12   Max.   :9998   Max.   :668.27   Max.   :524.96  
##  Total Revenue       Total Cost       Total Profit      
##  Min.   :   2043   Min.   :   1417   Min.   :    532.6  
##  1st Qu.: 281192   1st Qu.: 164932   1st Qu.:  98376.1  
##  Median : 754939   Median : 464726   Median : 277226.0  
##  Mean   :1327322   Mean   : 936119   Mean   : 391202.6  
##  3rd Qu.:1733503   3rd Qu.:1141750   3rd Qu.: 548456.8  
##  Max.   :6617210   Max.   :5204978   Max.   :1726181.4
summary(bigksales)
##     Region            Country           Item Type         Sales Channel  
##  Length:100000      Length:100000      Length:100000      Offline:49946  
##  Class :character   Class :character   Class :character   Online :50054  
##  Mode  :character   Mode  :character   Mode  :character                  
##                                                                          
##                                                                          
##                                                                          
##  Order Priority       Order Date           Order ID        
##  Length:100000      Min.   :2010-01-01   Length:100000     
##  Class :character   1st Qu.:2011-11-25   Class :character  
##  Mode  :character   Median :2013-10-15   Mode  :character  
##                     Mean   :2013-10-15                     
##                     3rd Qu.:2015-09-07                     
##                     Max.   :2017-07-28                     
##    Ship Date            Units Sold      Unit Price       Unit Cost     
##  Min.   :2010-01-02   Min.   :    1   Min.   :  9.33   Min.   :  6.92  
##  1st Qu.:2011-12-21   1st Qu.: 2505   1st Qu.:109.28   1st Qu.: 56.67  
##  Median :2013-11-09   Median : 5007   Median :205.70   Median :117.11  
##  Mean   :2013-11-09   Mean   : 5001   Mean   :266.70   Mean   :188.02  
##  3rd Qu.:2015-10-02   3rd Qu.: 7495   3rd Qu.:437.20   3rd Qu.:364.69  
##  Max.   :2017-09-16   Max.   :10000   Max.   :668.27   Max.   :524.96  
##  Total Revenue       Total Cost       Total Profit      
##  Min.   :     19   Min.   :     14   Min.   :      4.8  
##  1st Qu.: 279753   1st Qu.: 162928   1st Qu.:  95900.0  
##  Median : 789892   Median : 467937   Median : 283657.5  
##  Mean   :1336067   Mean   : 941975   Mean   : 394091.2  
##  3rd Qu.:1836490   3rd Qu.:1209475   3rd Qu.: 568384.1  
##  Max.   :6682700   Max.   :5249075   Max.   :1738700.0

Structure and Content of the Data

From a cursory view of the data, the sales data looks to contain records of dated transactions for purchases of various categories (cosmetics, vegetables, baby food, fruits, etc) from various countries. Additional information provided are the priority given to the sale, region, ship date, the unit price, quantity, cost, revenue, and total profit. Fortunately, there is no missing data. For the columns that store currency, a logarithmic transformation could be applied to make it easier to read; however, I decided not to do this.

Machine Learning Algorithms

For the business that has recorded these transactions, the two machine learning algorithms that came to mind that can be applied to this data would be the supervised decision tree and linear regression algorithms. One purpose for the machine learning would be for classification. If the business wanted to increase its online presence, if it was deemed profitable, then ML can be used to help determine if additional resources should be invested in improving IT infrastructure. Another possibility would be in determining which region and what time of year would be best for storage of perishable goods. Business questions such as where should storage places be acquired to store goods or what times of year and which region have proven to be profitable?

For a continuous prediction, such as profit, ML can be used to make predictions about profitability if additional countries in a region opened up and also if a particular Item Type could be leveraged for greater profitability. In addition, if the business wanted to increase its presence in a particular region, which country would be in need of advertising and investment in additional communication.

Selection of Decision Tree

For this assignment, I decided to attempt a classification problem utilizing decision tree. In previous classes, I did not work with decision trees so I decided to explore what I could do with this algorithm. For this assignment, I decided to look at the sales transactions variable Sales Channel. The only two possibilities are Offline and Online.

Decision Tree

The plan is to build a training and testing set using the data. Build a model using the training data and evaluate its performance using the testing data. The following code sets the seed and partitions the data into training and testing sets. I will do this on both the small and large data sets.

Partitioning and Building the Model

The following splits the data for the small data set and builds the model. This is followed by a visualization of the model. The fit looks at Sales Channel in relation to Region, Item Type, Order Priority, and Total Profit.

# Splitting the data 80/20
set.seed(3822)

training.samples <- ksales$`Sales Channel` %>% 
  createDataPartition(p = 0.8, list=FALSE)

train.data <- ksales[training.samples,]
test.data <- ksales[-training.samples,]

tree_spec <- decision_tree() %>% 
  # Set the engine and mode
  set_engine("rpart") %>%
  set_mode("classification")
  
# Train the model
tree_model <- tree_spec %>%
  fit(formula = `Sales Channel` ~ `Region` + `Item Type` + `Order Priority` + `Total Profit`,
      data = train.data)


# Information about the model
tree_model
## parsnip model object
## 
## n= 800 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 800 384 Offline (0.5200000 0.4800000)  
##    2) Item Type=Clothes,Cosmetics,Fruits,Meat,Snacks 302 114 Offline (0.6225166 0.3774834)  
##      4) Total Profit>=7023.945 285 102 Offline (0.6421053 0.3578947)  
##        8) Region=Asia,Australia and Oceania,Central America and the Caribbean,Sub-Saharan Africa 166  50 Offline (0.6987952 0.3012048) *
##        9) Region=Europe,Middle East and North Africa,North America 119  52 Offline (0.5630252 0.4369748)  
##         18) Item Type=Clothes,Fruits,Snacks 74  25 Offline (0.6621622 0.3378378) *
##         19) Item Type=Cosmetics,Meat 45  18 Online (0.4000000 0.6000000) *
##      5) Total Profit< 7023.945 17   5 Online (0.2941176 0.7058824) *
##    3) Item Type=Baby Food,Beverages,Cereal,Household,Office Supplies,Personal Care,Vegetables 498 228 Online (0.4578313 0.5421687)  
##      6) Region=Australia and Oceania,Central America and the Caribbean 80  35 Offline (0.5625000 0.4375000)  
##       12) Order Priority=C,H,L 62  24 Offline (0.6129032 0.3870968) *
##       13) Order Priority=M 18   7 Online (0.3888889 0.6111111) *
##      7) Region=Asia,Europe,Middle East and North Africa,North America,Sub-Saharan Africa 418 183 Online (0.4377990 0.5622010)  
##       14) Order Priority=H,M 195  96 Online (0.4923077 0.5076923)  
##         28) Total Profit>=580851.5 51  20 Offline (0.6078431 0.3921569)  
##           56) Region=Europe,Middle East and North Africa,Sub-Saharan Africa 43  14 Offline (0.6744186 0.3255814) *
##           57) Region=Asia 8   2 Online (0.2500000 0.7500000) *
##         29) Total Profit< 580851.5 144  65 Online (0.4513889 0.5486111) *
##       15) Order Priority=C,L 223  87 Online (0.3901345 0.6098655) *
# Visualization of the model
tree_model$fit %>% rpart.plot(type = 4, extra = 2, roundint=FALSE)

The following splits the data for the large data set and builds the model. This is followed by a visualization of the model. The fit looks at Sales Channel in relation to Region, Item Type, Order Priority, and Total Profit.

# Splitting the data 80/20
set.seed(3822)

Btraining.samples <- bigksales$`Sales Channel` %>% 
  createDataPartition(p = 0.8, list=FALSE)

Btrain.data <- bigksales[training.samples,]
Btest.data <- bigksales[-training.samples,]

Btree_spec <- decision_tree() %>% 
  # Set the engine and mode
  set_engine("rpart") %>%
  set_mode("classification")
  
# Train the model
Btree_model <- Btree_spec %>%
  fit(formula = `Sales Channel` ~ `Region` + `Item Type` + `Order Priority` + `Total Profit`,
      data = Btrain.data)


# Information about the model
Btree_model
## parsnip model object
## 
## n= 800 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 800 398 Offline (0.5025000 0.4975000)  
##    2) Item Type=Baby Food,Beverages,Cereal,Cosmetics,Meat,Snacks 390 173 Offline (0.5564103 0.4435897) *
##    3) Item Type=Clothes,Fruits,Household,Office Supplies,Personal Care,Vegetables 410 185 Online (0.4512195 0.5487805)  
##      6) Total Profit< 1131197 381 178 Online (0.4671916 0.5328084)  
##       12) Order Priority=C,L 211 103 Offline (0.5118483 0.4881517)  
##         24) Total Profit>=19584.87 178  81 Offline (0.5449438 0.4550562) *
##         25) Total Profit< 19584.87 33  11 Online (0.3333333 0.6666667) *
##       13) Order Priority=H,M 170  70 Online (0.4117647 0.5882353)  
##         26) Total Profit< 56292.6 47  21 Offline (0.5531915 0.4468085)  
##           52) Item Type=Office Supplies,Personal Care,Vegetables 15   3 Offline (0.8000000 0.2000000) *
##           53) Item Type=Clothes,Fruits,Household 32  14 Online (0.4375000 0.5625000) *
##         27) Total Profit>=56292.6 123  44 Online (0.3577236 0.6422764) *
##      7) Total Profit>=1131197 29   7 Online (0.2413793 0.7586207) *
# Visualization of the model
Btree_model$fit %>% rpart.plot(type = 4, extra = 2, roundint=FALSE)

Performance of the Model

After constructing the models, I tested its efficacy by using the test data and made a confusion matrix. After the matrix is constructed, the accuracy is given.

\[accuracy = \frac{correct predictions}{all\ predictions}\]

The following is for the small data set.

# Generate the predictions using the test data
predictions <- predict(tree_model, new_data = test.data)

# Generate a side-by-side of 
predictions_combined <- predictions %>% 
  mutate(true_classification = test.data$`Sales Channel`)

head(predictions_combined)
# The confusion matrix
confusion_matrix <- conf_mat(data = predictions_combined,
                            estimate = .pred_class,
                            truth = true_classification)

confusion_matrix 
##           Truth
## Prediction Offline Online
##    Offline      49     42
##    Online       55     54
# Calculate the number of correctly predicted classes
correct_predictions <- 49 + 54

# Calculate the number of all predicted classes
all_predictions <- 49 + 54 + 55 + 42

# Calculate and print the accuracy
acc_manual <- correct_predictions / all_predictions
acc_manual
## [1] 0.515

The following is for the big data set.

# Generate the predictions using the test data
Bpredictions <- predict(Btree_model, new_data = Btest.data)

# Generate a side-by-side of 
Bpredictions_combined <- Bpredictions %>% 
  mutate(true_classification = Btest.data$`Sales Channel`)

head(Bpredictions_combined)
# The confusion matrix
Bconfusion_matrix <- conf_mat(data = Bpredictions_combined,
                            estimate = .pred_class,
                            truth = true_classification)

Bconfusion_matrix 
##           Truth
## Prediction Offline Online
##    Offline   35046  35066
##    Online    14498  14590
# Calculate the number of correctly predicted classes
Bcorrect_predictions <- 35046 + 14590

# Calculate the number of all predicted classes
Ball_predictions <- 35046 + 14590 + 14498 + 35066

# Calculate and print the accuracy
Bacc_manual <- Bcorrect_predictions / Ball_predictions
Bacc_manual
## [1] 0.5003629

Conclusion

From the abysmal accuracies from both models (the model based on the small data set had an accuracy of 0.515 and the big data set model had an accuracy of 0.5003629), the models can use more refinement. As I was going through previous iterations of building the models, when I used Country, the model would break because a country like Cameroon that was not in the training set, but appeared in the testing set would break the model. The decision tree does have limitations when new variables are introduced and thus lacks flexibility. As discussed before, another limitation is that decision trees can overfit in that the model can fit the training data but fails when applied to the testing data.