This assignment is the first homework for Data 622. The following is the assignment:
Visit the following website and explore the range of sizes of this dataset (from 100 to 5 million records). https://eforexcel.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/
Based on your computer’s capabilities (memory, CPU), select 2 files you can handle (recommended one small, one large) Review the structure and content of the tables, and think which two machine learning algorithms presented so far could be used to analyze the data, and how can they be applied in the suggested environment of the datasets.
Write a short essay explaining your selection. Then, select one of the 2 algorithms and explore how to analyze and predict an outcome based on the data available. This will be an exploratory exercise, so feel free to show errors and warnings that raise during the analysis. Test the code with both datasets selected and compare the results. Which result will you trust if you need to make a business decision? Do you think an analysis could be prone to errors when using too much data, or when using the least amount possible? Develop your exploratory analysis of the data and the essay in the following 2 weeks. You’ll have until March 17 to submit both.
I decided to select the 1000 Sales Records and the 100,000 Sales Records files on the site.
The following code is an initial setup of libraries and loading of
the files. I kept the files in csv.gz format and relied upon the
tidyverse library to unzip the files with the
read_csv function because I would be storing these files on
GitHub.
# Load libraries
library(tidymodels)
## Registered S3 method overwritten by 'tune':
## method from
## required_pkgs.model_spec parsnip
## -- Attaching packages -------------------------------------- tidymodels 0.1.4 --
## v broom 0.7.12 v recipes 0.2.0
## v dials 0.1.0 v rsample 0.1.1
## v dplyr 1.0.8 v tibble 3.1.6
## v ggplot2 3.3.5 v tidyr 1.2.0
## v infer 1.0.0 v tune 0.1.6
## v modeldata 0.1.1 v workflows 0.2.4
## v parsnip 0.2.0 v workflowsets 0.1.0
## v purrr 0.3.4 v yardstick 0.0.9
## -- Conflicts ----------------------------------------- tidymodels_conflicts() --
## x purrr::discard() masks scales::discard()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x recipes::step() masks stats::step()
## x tune::tune() masks parsnip::tune()
## * Use tidymodels_prefer() to resolve common conflicts.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v readr 2.1.2 v forcats 0.5.1
## v stringr 1.4.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x readr::col_factor() masks scales::col_factor()
## x purrr::discard() masks scales::discard()
## x dplyr::filter() masks stats::filter()
## x stringr::fixed() masks recipes::fixed()
## x dplyr::lag() masks stats::lag()
## x readr::spec() masks yardstick::spec()
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following objects are masked from 'package:yardstick':
##
## precision, recall, sensitivity, specificity
## The following object is masked from 'package:purrr':
##
## lift
library(rpart.plot)
## Loading required package: rpart
##
## Attaching package: 'rpart'
## The following object is masked from 'package:dials':
##
## prune
# Load the 1000 sales file
ksales <- read_csv('https://github.com/logicalschema/spring2022/raw/main/data622/hw1/1000_Sales_Records.csv.gz')
## Rows: 1000 Columns: 14
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (7): Region, Country, Item Type, Sales Channel, Order Priority, Order Da...
## dbl (7): Order ID, Units Sold, Unit Price, Unit Cost, Total Revenue, Total C...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
bigksales <- read_csv('https://github.com/logicalschema/spring2022/raw/main/data622/hw1/100000_Sales_Records.csv.gz')
## Rows: 100000 Columns: 14
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (7): Region, Country, Item Type, Sales Channel, Order Priority, Order Da...
## dbl (7): Order ID, Units Sold, Unit Price, Unit Cost, Total Revenue, Total C...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
The following is an analysis of the data. The following are the columns for the data:
Region, Country, Item Type, Sales Channel, Order Priority, Order Date, Order ID, Ship Date, Units Sold, Unit Price, Unit Cost, Total Revenue, Total Cost, Total Profit
I decided to convert the date columns to Date
datatypes.
# Convert date columns to date datatypes
ksales[['Order Date']] <- as.Date(ksales[['Order Date']], "%m/%d/%Y")
ksales[['Ship Date']] <- as.Date(ksales[['Ship Date']], "%m/%d/%Y")
ksales[['Order ID']] <- toString(ksales[['Order ID']])
ksales[['Sales Channel']] <- as.factor(ksales[['Sales Channel']])
bigksales[['Order Date']] <- as.Date(bigksales[['Order Date']], "%m/%d/%Y")
bigksales[['Ship Date']] <- as.Date(bigksales[['Ship Date']], "%m/%d/%Y")
bigksales[['Order ID']] <- toString(bigksales[['Order ID']])
bigksales[['Sales Channel']] <- as.factor(bigksales[['Sales Channel']])
The following is a snippet of the data:
# Snippet of the data
head(ksales)
head(bigksales)
The following is a glimpse and summary of the data. There are no missing values.
# glimpse of data
glimpse(ksales)
## Rows: 1,000
## Columns: 14
## $ Region <chr> "Middle East and North Africa", "North America", "Mid~
## $ Country <chr> "Libya", "Canada", "Libya", "Japan", "Chad", "Armenia~
## $ `Item Type` <chr> "Cosmetics", "Vegetables", "Baby Food", "Cereal", "Fr~
## $ `Sales Channel` <fct> Offline, Online, Offline, Offline, Offline, Online, O~
## $ `Order Priority` <chr> "M", "M", "C", "C", "H", "H", "H", "M", "H", "H", "M"~
## $ `Order Date` <date> 2014-10-18, 2011-11-07, 2016-10-31, 2010-04-10, 2011~
## $ `Order ID` <chr> "686800706, 185941302, 246222341, 161442649, 64571355~
## $ `Ship Date` <date> 2014-10-31, 2011-12-08, 2016-12-09, 2010-05-12, 2011~
## $ `Units Sold` <dbl> 8446, 3018, 1517, 3322, 9845, 9528, 2844, 7299, 2428,~
## $ `Unit Price` <dbl> 437.20, 154.06, 255.28, 205.70, 9.33, 205.70, 205.70,~
## $ `Unit Cost` <dbl> 263.33, 90.93, 159.42, 117.11, 6.92, 117.11, 117.11, ~
## $ `Total Revenue` <dbl> 3692591.20, 464953.08, 387259.76, 683335.40, 91853.85~
## $ `Total Cost` <dbl> 2224085.18, 274426.74, 241840.14, 389039.42, 68127.40~
## $ `Total Profit` <dbl> 1468506.02, 190526.34, 145419.62, 294295.98, 23726.45~
glimpse(bigksales)
## Rows: 100,000
## Columns: 14
## $ Region <chr> "Middle East and North Africa", "Central America and ~
## $ Country <chr> "Azerbaijan", "Panama", "Sao Tome and Principe", "Sao~
## $ `Item Type` <chr> "Snacks", "Cosmetics", "Fruits", "Personal Care", "Ho~
## $ `Sales Channel` <fct> Online, Offline, Offline, Online, Offline, Online, Of~
## $ `Order Priority` <chr> "C", "L", "M", "M", "H", "C", "M", "C", "H", "H", "C"~
## $ `Order Date` <date> 2014-10-08, 2015-02-22, 2015-12-09, 2014-09-17, 2010~
## $ `Order ID` <chr> "535113847, 874708545, 854349935, 892836844, 12928060~
## $ `Ship Date` <date> 2014-10-23, 2015-02-27, 2016-01-18, 2014-10-12, 2010~
## $ `Units Sold` <dbl> 934, 4551, 9986, 9118, 5858, 1149, 7964, 6307, 8217, ~
## $ `Unit Price` <dbl> 152.58, 437.20, 9.33, 81.73, 668.27, 109.28, 437.20, ~
## $ `Unit Cost` <dbl> 97.44, 263.33, 6.92, 56.67, 502.54, 35.84, 263.33, 6.~
## $ `Total Revenue` <dbl> 142509.72, 1989697.20, 93169.38, 745214.14, 3914725.6~
## $ `Total Cost` <dbl> 91008.96, 1198414.83, 69103.12, 516717.06, 2943879.32~
## $ `Total Profit` <dbl> 51500.76, 791282.37, 24066.26, 228497.08, 970846.34, ~
summary(ksales)
## Region Country Item Type Sales Channel
## Length:1000 Length:1000 Length:1000 Offline:520
## Class :character Class :character Class :character Online :480
## Mode :character Mode :character Mode :character
##
##
##
## Order Priority Order Date Order ID
## Length:1000 Min. :2010-01-01 Length:1000
## Class :character 1st Qu.:2011-11-14 Class :character
## Mode :character Median :2013-09-24 Mode :character
## Mean :2013-09-19
## 3rd Qu.:2015-07-03
## Max. :2017-07-26
## Ship Date Units Sold Unit Price Unit Cost
## Min. :2010-01-15 Min. : 13 Min. : 9.33 Min. : 6.92
## 1st Qu.:2011-12-11 1st Qu.:2420 1st Qu.: 81.73 1st Qu.: 56.67
## Median :2013-10-12 Median :5184 Median :154.06 Median : 97.44
## Mean :2013-10-14 Mean :5054 Mean :262.11 Mean :184.97
## 3rd Qu.:2015-07-28 3rd Qu.:7537 3rd Qu.:421.89 3rd Qu.:263.33
## Max. :2017-09-12 Max. :9998 Max. :668.27 Max. :524.96
## Total Revenue Total Cost Total Profit
## Min. : 2043 Min. : 1417 Min. : 532.6
## 1st Qu.: 281192 1st Qu.: 164932 1st Qu.: 98376.1
## Median : 754939 Median : 464726 Median : 277226.0
## Mean :1327322 Mean : 936119 Mean : 391202.6
## 3rd Qu.:1733503 3rd Qu.:1141750 3rd Qu.: 548456.8
## Max. :6617210 Max. :5204978 Max. :1726181.4
summary(bigksales)
## Region Country Item Type Sales Channel
## Length:100000 Length:100000 Length:100000 Offline:49946
## Class :character Class :character Class :character Online :50054
## Mode :character Mode :character Mode :character
##
##
##
## Order Priority Order Date Order ID
## Length:100000 Min. :2010-01-01 Length:100000
## Class :character 1st Qu.:2011-11-25 Class :character
## Mode :character Median :2013-10-15 Mode :character
## Mean :2013-10-15
## 3rd Qu.:2015-09-07
## Max. :2017-07-28
## Ship Date Units Sold Unit Price Unit Cost
## Min. :2010-01-02 Min. : 1 Min. : 9.33 Min. : 6.92
## 1st Qu.:2011-12-21 1st Qu.: 2505 1st Qu.:109.28 1st Qu.: 56.67
## Median :2013-11-09 Median : 5007 Median :205.70 Median :117.11
## Mean :2013-11-09 Mean : 5001 Mean :266.70 Mean :188.02
## 3rd Qu.:2015-10-02 3rd Qu.: 7495 3rd Qu.:437.20 3rd Qu.:364.69
## Max. :2017-09-16 Max. :10000 Max. :668.27 Max. :524.96
## Total Revenue Total Cost Total Profit
## Min. : 19 Min. : 14 Min. : 4.8
## 1st Qu.: 279753 1st Qu.: 162928 1st Qu.: 95900.0
## Median : 789892 Median : 467937 Median : 283657.5
## Mean :1336067 Mean : 941975 Mean : 394091.2
## 3rd Qu.:1836490 3rd Qu.:1209475 3rd Qu.: 568384.1
## Max. :6682700 Max. :5249075 Max. :1738700.0
From a cursory view of the data, the sales data looks to contain records of dated transactions for purchases of various categories (cosmetics, vegetables, baby food, fruits, etc) from various countries. Additional information provided are the priority given to the sale, region, ship date, the unit price, quantity, cost, revenue, and total profit. Fortunately, there is no missing data. For the columns that store currency, a logarithmic transformation could be applied to make it easier to read; however, I decided not to do this.
For the business that has recorded these transactions, the two machine learning algorithms that came to mind that can be applied to this data would be the supervised decision tree and linear regression algorithms. One purpose for the machine learning would be for classification. If the business wanted to increase its online presence, if it was deemed profitable, then ML can be used to help determine if additional resources should be invested in improving IT infrastructure. Another possibility would be in determining which region and what time of year would be best for storage of perishable goods. Business questions such as where should storage places be acquired to store goods or what times of year and which region have proven to be profitable?
For a continuous prediction, such as profit, ML can be used to make
predictions about profitability if additional countries in a region
opened up and also if a particular Item Type could be
leveraged for greater profitability. In addition, if the business wanted
to increase its presence in a particular region, which country would be
in need of advertising and investment in additional communication.
For this assignment, I decided to attempt a classification problem
utilizing decision tree. In previous classes, I did not work with
decision trees so I decided to explore what I could do with this
algorithm. For this assignment, I decided to look at the sales
transactions variable Sales Channel. The only two
possibilities are Offline and Online.
The plan is to build a training and testing set using the data. Build a model using the training data and evaluate its performance using the testing data. The following code sets the seed and partitions the data into training and testing sets. I will do this on both the small and large data sets.
The following splits the data for the small data set and builds the
model. This is followed by a visualization of the model. The fit looks
at Sales Channel in relation to Region,
Item Type, Order Priority, and
Total Profit.
# Splitting the data 80/20
set.seed(3822)
training.samples <- ksales$`Sales Channel` %>%
createDataPartition(p = 0.8, list=FALSE)
train.data <- ksales[training.samples,]
test.data <- ksales[-training.samples,]
tree_spec <- decision_tree() %>%
# Set the engine and mode
set_engine("rpart") %>%
set_mode("classification")
# Train the model
tree_model <- tree_spec %>%
fit(formula = `Sales Channel` ~ `Region` + `Item Type` + `Order Priority` + `Total Profit`,
data = train.data)
# Information about the model
tree_model
## parsnip model object
##
## n= 800
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 800 384 Offline (0.5200000 0.4800000)
## 2) Item Type=Clothes,Cosmetics,Fruits,Meat,Snacks 302 114 Offline (0.6225166 0.3774834)
## 4) Total Profit>=7023.945 285 102 Offline (0.6421053 0.3578947)
## 8) Region=Asia,Australia and Oceania,Central America and the Caribbean,Sub-Saharan Africa 166 50 Offline (0.6987952 0.3012048) *
## 9) Region=Europe,Middle East and North Africa,North America 119 52 Offline (0.5630252 0.4369748)
## 18) Item Type=Clothes,Fruits,Snacks 74 25 Offline (0.6621622 0.3378378) *
## 19) Item Type=Cosmetics,Meat 45 18 Online (0.4000000 0.6000000) *
## 5) Total Profit< 7023.945 17 5 Online (0.2941176 0.7058824) *
## 3) Item Type=Baby Food,Beverages,Cereal,Household,Office Supplies,Personal Care,Vegetables 498 228 Online (0.4578313 0.5421687)
## 6) Region=Australia and Oceania,Central America and the Caribbean 80 35 Offline (0.5625000 0.4375000)
## 12) Order Priority=C,H,L 62 24 Offline (0.6129032 0.3870968) *
## 13) Order Priority=M 18 7 Online (0.3888889 0.6111111) *
## 7) Region=Asia,Europe,Middle East and North Africa,North America,Sub-Saharan Africa 418 183 Online (0.4377990 0.5622010)
## 14) Order Priority=H,M 195 96 Online (0.4923077 0.5076923)
## 28) Total Profit>=580851.5 51 20 Offline (0.6078431 0.3921569)
## 56) Region=Europe,Middle East and North Africa,Sub-Saharan Africa 43 14 Offline (0.6744186 0.3255814) *
## 57) Region=Asia 8 2 Online (0.2500000 0.7500000) *
## 29) Total Profit< 580851.5 144 65 Online (0.4513889 0.5486111) *
## 15) Order Priority=C,L 223 87 Online (0.3901345 0.6098655) *
# Visualization of the model
tree_model$fit %>% rpart.plot(type = 4, extra = 2, roundint=FALSE)
The following splits the data for the large data set and builds the
model. This is followed by a visualization of the model. The fit looks
at Sales Channel in relation to Region,
Item Type, Order Priority, and
Total Profit.
# Splitting the data 80/20
set.seed(3822)
Btraining.samples <- bigksales$`Sales Channel` %>%
createDataPartition(p = 0.8, list=FALSE)
Btrain.data <- bigksales[training.samples,]
Btest.data <- bigksales[-training.samples,]
Btree_spec <- decision_tree() %>%
# Set the engine and mode
set_engine("rpart") %>%
set_mode("classification")
# Train the model
Btree_model <- Btree_spec %>%
fit(formula = `Sales Channel` ~ `Region` + `Item Type` + `Order Priority` + `Total Profit`,
data = Btrain.data)
# Information about the model
Btree_model
## parsnip model object
##
## n= 800
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 800 398 Offline (0.5025000 0.4975000)
## 2) Item Type=Baby Food,Beverages,Cereal,Cosmetics,Meat,Snacks 390 173 Offline (0.5564103 0.4435897) *
## 3) Item Type=Clothes,Fruits,Household,Office Supplies,Personal Care,Vegetables 410 185 Online (0.4512195 0.5487805)
## 6) Total Profit< 1131197 381 178 Online (0.4671916 0.5328084)
## 12) Order Priority=C,L 211 103 Offline (0.5118483 0.4881517)
## 24) Total Profit>=19584.87 178 81 Offline (0.5449438 0.4550562) *
## 25) Total Profit< 19584.87 33 11 Online (0.3333333 0.6666667) *
## 13) Order Priority=H,M 170 70 Online (0.4117647 0.5882353)
## 26) Total Profit< 56292.6 47 21 Offline (0.5531915 0.4468085)
## 52) Item Type=Office Supplies,Personal Care,Vegetables 15 3 Offline (0.8000000 0.2000000) *
## 53) Item Type=Clothes,Fruits,Household 32 14 Online (0.4375000 0.5625000) *
## 27) Total Profit>=56292.6 123 44 Online (0.3577236 0.6422764) *
## 7) Total Profit>=1131197 29 7 Online (0.2413793 0.7586207) *
# Visualization of the model
Btree_model$fit %>% rpart.plot(type = 4, extra = 2, roundint=FALSE)
After constructing the models, I tested its efficacy by using the test data and made a confusion matrix. After the matrix is constructed, the accuracy is given.
\[accuracy = \frac{correct predictions}{all\ predictions}\]
The following is for the small data set.
# Generate the predictions using the test data
predictions <- predict(tree_model, new_data = test.data)
# Generate a side-by-side of
predictions_combined <- predictions %>%
mutate(true_classification = test.data$`Sales Channel`)
head(predictions_combined)
# The confusion matrix
confusion_matrix <- conf_mat(data = predictions_combined,
estimate = .pred_class,
truth = true_classification)
confusion_matrix
## Truth
## Prediction Offline Online
## Offline 49 42
## Online 55 54
# Calculate the number of correctly predicted classes
correct_predictions <- 49 + 54
# Calculate the number of all predicted classes
all_predictions <- 49 + 54 + 55 + 42
# Calculate and print the accuracy
acc_manual <- correct_predictions / all_predictions
acc_manual
## [1] 0.515
The following is for the big data set.
# Generate the predictions using the test data
Bpredictions <- predict(Btree_model, new_data = Btest.data)
# Generate a side-by-side of
Bpredictions_combined <- Bpredictions %>%
mutate(true_classification = Btest.data$`Sales Channel`)
head(Bpredictions_combined)
# The confusion matrix
Bconfusion_matrix <- conf_mat(data = Bpredictions_combined,
estimate = .pred_class,
truth = true_classification)
Bconfusion_matrix
## Truth
## Prediction Offline Online
## Offline 35046 35066
## Online 14498 14590
# Calculate the number of correctly predicted classes
Bcorrect_predictions <- 35046 + 14590
# Calculate the number of all predicted classes
Ball_predictions <- 35046 + 14590 + 14498 + 35066
# Calculate and print the accuracy
Bacc_manual <- Bcorrect_predictions / Ball_predictions
Bacc_manual
## [1] 0.5003629
From the abysmal accuracies from both models (the model based on the
small data set had an accuracy of 0.515 and the big data set model had
an accuracy of 0.5003629), the models can use more refinement. As I was
going through previous iterations of building the models, when I used
Country, the model would break because a country like
Cameroon that was not in the training set, but appeared in
the testing set would break the model. The decision tree does have
limitations when new variables are introduced and thus lacks
flexibility. As discussed before, another limitation is that decision
trees can overfit in that the model can fit the training data but fails
when applied to the testing data.