Name: Charles Ugiagbe.

Date: 04/16/2024

Assignment Overview

  1. Visit the following website and explore the range of sizes of this dataset (from 100 to 5 million records): https://excelbianalytics.com/wp/downloads-18-sample-csv-files-data-sets-for- testing-sales/ or (new) https://www.kaggle.com/datasets

  2. Select 2 files to download - Based on your computer’s capabilities (memory, CPU), select 2 files you can handle (recommended one small, one large)

  3. Download the files

  4. Review the structure and content of the tables, and think about the data sets (structure, size, dependencies, labels, etc)

  5. Consider the similarities and differences in the two data sets you have downloaded

  6. Think about how to analyze and predict an outcome based on the datasets available

  7. Based on the data you have, think which two machine learning algorithms presented so far could be used to analyze the data

Load the Required packages

library(tidyverse)
library(lubridate)
library(tidyr)
library(rpart)
library(rpart.plot)
library(tidymodels)   
library(gridExtra)
library(skimr)
library(pROC)
library(stringr)
library(knitr)

load the Dataset

# small data = 1000 records
# large data = 100,000 records

small_data <- read.csv("https://raw.githubusercontent.com/omocharly/DATA622/main/Homework1/1000%20Sales%20Records.csv")

large_data <- read.csv("https://raw.githubusercontent.com/omocharly/DATA622/main/Homework1/100000%20Sales%20Records.csv")
head(small_data)
##                         Region Country  Item.Type Sales.Channel Order.Priority
## 1 Middle East and North Africa   Libya  Cosmetics       Offline              M
## 2                North America  Canada Vegetables        Online              M
## 3 Middle East and North Africa   Libya  Baby Food       Offline              C
## 4                         Asia   Japan     Cereal       Offline              C
## 5           Sub-Saharan Africa    Chad     Fruits       Offline              H
## 6                       Europe Armenia     Cereal        Online              H
##   Order.Date  Order.ID  Ship.Date Units.Sold Unit.Price Unit.Cost Total.Revenue
## 1 10/18/2014 686800706 10/31/2014       8446     437.20    263.33    3692591.20
## 2  11/7/2011 185941302  12/8/2011       3018     154.06     90.93     464953.08
## 3 10/31/2016 246222341  12/9/2016       1517     255.28    159.42     387259.76
## 4  4/10/2010 161442649  5/12/2010       3322     205.70    117.11     683335.40
## 5  8/16/2011 645713555  8/31/2011       9845       9.33      6.92      91853.85
## 6 11/24/2014 683458888 12/28/2014       9528     205.70    117.11    1959909.60
##   Total.Cost Total.Profit
## 1  2224085.2   1468506.02
## 2   274426.7    190526.34
## 3   241840.1    145419.62
## 4   389039.4    294295.98
## 5    68127.4     23726.45
## 6  1115824.1    844085.52
head(large_data)
##                              Region               Country     Item.Type
## 1      Middle East and North Africa            Azerbaijan        Snacks
## 2 Central America and the Caribbean                Panama     Cosmetics
## 3                Sub-Saharan Africa Sao Tome and Principe        Fruits
## 4                Sub-Saharan Africa Sao Tome and Principe Personal Care
## 5 Central America and the Caribbean                Belize     Household
## 6                            Europe               Denmark       Clothes
##   Sales.Channel Order.Priority Order.Date  Order.ID  Ship.Date Units.Sold
## 1        Online              C  10/8/2014 535113847 10/23/2014        934
## 2       Offline              L  2/22/2015 874708545  2/27/2015       4551
## 3       Offline              M  12/9/2015 854349935  1/18/2016       9986
## 4        Online              M  9/17/2014 892836844 10/12/2014       9118
## 5       Offline              H   2/4/2010 129280602   3/5/2010       5858
## 6        Online              C  2/20/2013 473105037  2/28/2013       1149
##   Unit.Price Unit.Cost Total.Revenue Total.Cost Total.Profit
## 1     152.58     97.44     142509.72   91008.96     51500.76
## 2     437.20    263.33    1989697.20 1198414.83    791282.37
## 3       9.33      6.92      93169.38   69103.12     24066.26
## 4      81.73     56.67     745214.14  516717.06    228497.08
## 5     668.27    502.54    3914725.66 2943879.32    970846.34
## 6     109.28     35.84     125562.72   41180.16     84382.56
glimpse(small_data)
## Rows: 1,000
## Columns: 14
## $ Region         <chr> "Middle East and North Africa", "North America", "Middl…
## $ Country        <chr> "Libya", "Canada", "Libya", "Japan", "Chad", "Armenia",…
## $ Item.Type      <chr> "Cosmetics", "Vegetables", "Baby Food", "Cereal", "Frui…
## $ Sales.Channel  <chr> "Offline", "Online", "Offline", "Offline", "Offline", "…
## $ Order.Priority <chr> "M", "M", "C", "C", "H", "H", "H", "M", "H", "H", "M", …
## $ Order.Date     <chr> "10/18/2014", "11/7/2011", "10/31/2016", "4/10/2010", "…
## $ Order.ID       <int> 686800706, 185941302, 246222341, 161442649, 645713555, …
## $ Ship.Date      <chr> "10/31/2014", "12/8/2011", "12/9/2016", "5/12/2010", "8…
## $ Units.Sold     <int> 8446, 3018, 1517, 3322, 9845, 9528, 2844, 7299, 2428, 4…
## $ Unit.Price     <dbl> 437.20, 154.06, 255.28, 205.70, 9.33, 205.70, 205.70, 1…
## $ Unit.Cost      <dbl> 263.33, 90.93, 159.42, 117.11, 6.92, 117.11, 117.11, 35…
## $ Total.Revenue  <dbl> 3692591.20, 464953.08, 387259.76, 683335.40, 91853.85, …
## $ Total.Cost     <dbl> 2224085.18, 274426.74, 241840.14, 389039.42, 68127.40, …
## $ Total.Profit   <dbl> 1468506.02, 190526.34, 145419.62, 294295.98, 23726.45, …
glimpse(large_data)
## Rows: 100,000
## Columns: 14
## $ Region         <chr> "Middle East and North Africa", "Central America and th…
## $ Country        <chr> "Azerbaijan", "Panama", "Sao Tome and Principe", "Sao T…
## $ Item.Type      <chr> "Snacks", "Cosmetics", "Fruits", "Personal Care", "Hous…
## $ Sales.Channel  <chr> "Online", "Offline", "Offline", "Online", "Offline", "O…
## $ Order.Priority <chr> "C", "L", "M", "M", "H", "C", "M", "C", "H", "H", "C", …
## $ Order.Date     <chr> "10/8/2014", "2/22/2015", "12/9/2015", "9/17/2014", "2/…
## $ Order.ID       <int> 535113847, 874708545, 854349935, 892836844, 129280602, …
## $ Ship.Date      <chr> "10/23/2014", "2/27/2015", "1/18/2016", "10/12/2014", "…
## $ Units.Sold     <int> 934, 4551, 9986, 9118, 5858, 1149, 7964, 6307, 8217, 27…
## $ Unit.Price     <dbl> 152.58, 437.20, 9.33, 81.73, 668.27, 109.28, 437.20, 9.…
## $ Unit.Cost      <dbl> 97.44, 263.33, 6.92, 56.67, 502.54, 35.84, 263.33, 6.92…
## $ Total.Revenue  <dbl> 142509.72, 1989697.20, 93169.38, 745214.14, 3914725.66,…
## $ Total.Cost     <dbl> 91008.96, 1198414.83, 69103.12, 516717.06, 2943879.32, …
## $ Total.Profit   <dbl> 51500.76, 791282.37, 24066.26, 228497.08, 970846.34, 84…

It appears that Both the small dataset and large datasets contain the same columns, differing only in the number of records present. The sales datasets contain 14 columns in total, and each record describes one sale of goods.

Data Exploration

Let’s explore the data sets; first the small_df data set, using the skimr library we can obtain quick summary statistics beyond the summary(). We notice that we have 14 variables split into 7 character and 7 numeric. There seems to be no missing values, so this will have a simple preparation before we build our models.

Data summary
Name small_data
Number of rows 1000
Number of columns 14
_______________________
Column type frequency:
character 7
numeric 7
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Region 0 1 4 33 0 7 0
Country 0 1 4 32 0 185 0
Item.Type 0 1 4 15 0 12 0
Sales.Channel 0 1 6 7 0 2 0
Order.Priority 0 1 1 1 0 4 0
Order.Date 0 1 8 10 0 841 0
Ship.Date 0 1 8 10 0 835 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Order.ID 0 1 549681324.74 257133358.84 1.02928e+08 328074026.00 556609713.50 769694482.75 995529830.00 ▇▇▇▇▇
Units.Sold 0 1 5053.99 2901.38 1.30000e+01 2420.25 5184.00 7536.75 9998.00 ▇▆▇▇▇
Unit.Price 0 1 262.11 216.02 9.33000e+00 81.73 154.06 421.89 668.27 ▇▇▁▃▃
Unit.Cost 0 1 184.97 175.29 6.92000e+00 56.67 97.44 263.33 524.96 ▇▂▁▁▂
Total.Revenue 0 1 1327321.84 1486514.56 2.04325e+03 281191.90 754939.18 1733502.75 6617209.54 ▇▂▁▁▁
Total.Cost 0 1 936119.23 1162570.75 1.41675e+03 164931.88 464726.06 1141750.09 5204978.40 ▇▁▁▁▁
Total.Profit 0 1 391202.61 383640.19 5.32610e+02 98376.12 277225.98 548456.84 1726181.36 ▇▃▂▁▁

Let’s take a look at the distributions of the numeric variables for the small data set:

Categorical variables visualization for the small dataset:

\(~\)

Now, the large_df dataset is composed of 100,000 values of the same 14 variables as the small data set. It also has 7 character and 7 numeric variables with no missing values.

Data summary
Name large_data
Number of rows 100000
Number of columns 14
_______________________
Column type frequency:
character 7
numeric 7
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Region 0 1 4 33 0 7 0
Country 0 1 4 32 0 185 0
Item.Type 0 1 4 15 0 12 0
Sales.Channel 0 1 6 7 0 2 0
Order.Priority 0 1 1 1 0 4 0
Order.Date 0 1 8 10 0 2766 0
Ship.Date 0 1 8 10 0 2813 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Order.ID 0 1 550395554.18 259321871.37 100008904.00 326046383.25 547718512.50 775078534.50 999996459.00 ▇▇▇▇▇
Units.Sold 0 1 5001.45 2884.58 1.00 2505.00 5007.00 7495.25 10000.00 ▇▇▇▇▇
Unit.Price 0 1 266.70 216.94 9.33 109.28 205.70 437.20 668.27 ▇▇▁▃▃
Unit.Cost 0 1 188.02 175.71 6.92 56.67 117.11 364.69 524.96 ▇▃▂▂▃
Total.Revenue 0 1 1336066.73 1471767.59 18.66 279753.34 789891.57 1836489.60 6682700.00 ▇▂▁▁▁
Total.Cost 0 1 941975.49 1151828.43 13.84 162928.29 467937.41 1209474.69 5249075.04 ▇▂▁▁▁
Total.Profit 0 1 394091.24 379598.60 4.82 95900.00 283657.46 568384.13 1738700.00 ▇▃▂▁▁

\(~\)

Visualizations of the numeric variable distributions of the large dataset:

\(~\)

Now let’s look at the large data categorical variables:

Now let’s look at the categorical variables:

Data Preparation:

Now that I’ve visualized the data it’s time to make some changes to the variables. First, convert the categorical values into as.factor and convert the two columns containing dates to as.Date to be able to manipulate. I’ll drop the Order.ID column as it is not needed with our model. Below are the results:

\(~\)

Small dataset:

Region Country Item.Type Sales.Channel Order.Priority Order.Date Ship.Date Units.Sold Unit.Price Unit.Cost Total.Revenue Total.Cost Total.Profit
Middle East and North Africa Libya Cosmetics Offline M 2014-10-18 2014-10-31 8446 437.20 263.33 3692591.20 2224085.2 1468506.02
North America Canada Vegetables Online M 2011-11-07 2011-12-08 3018 154.06 90.93 464953.08 274426.7 190526.34
Middle East and North Africa Libya Baby Food Offline C 2016-10-31 2016-12-09 1517 255.28 159.42 387259.76 241840.1 145419.62
Asia Japan Cereal Offline C 2010-04-10 2010-05-12 3322 205.70 117.11 683335.40 389039.4 294295.98
Sub-Saharan Africa Chad Fruits Offline H 2011-08-16 2011-08-31 9845 9.33 6.92 91853.85 68127.4 23726.45
Europe Armenia Cereal Online H 2014-11-24 2014-12-28 9528 205.70 117.11 1959909.60 1115824.1 844085.52

\(~\)

Large dataset:

Region Country Item.Type Sales.Channel Order.Priority Order.Date Ship.Date Units.Sold Unit.Price Unit.Cost Total.Revenue Total.Cost Total.Profit
Middle East and North Africa Azerbaijan Snacks Online C 2014-10-08 2014-10-23 934 152.58 97.44 142509.72 91008.96 51500.76
Central America and the Caribbean Panama Cosmetics Offline L 2015-02-22 2015-02-27 4551 437.20 263.33 1989697.20 1198414.83 791282.37
Sub-Saharan Africa Sao Tome and Principe Fruits Offline M 2015-12-09 2016-01-18 9986 9.33 6.92 93169.38 69103.12 24066.26
Sub-Saharan Africa Sao Tome and Principe Personal Care Online M 2014-09-17 2014-10-12 9118 81.73 56.67 745214.14 516717.06 228497.08
Central America and the Caribbean Belize Household Offline H 2010-02-04 2010-03-05 5858 668.27 502.54 3914725.66 2943879.32 970846.34
Europe Denmark Clothes Online C 2013-02-20 2013-02-28 1149 109.28 35.84 125562.72 41180.16 84382.56

\(~\)

Model Building:

First, we start by splitting both datasets into the standard ratio 75:25

Now we can start the decision tree for the small data set using the rpart function and setting Order.Priority as our target variable followed by the rest of the variables. The results are below:

To test the above model I used the small_df testing data to create the prediction table below:

C H L M
C 23 16 11 5
H 26 20 12 10
L 31 14 15 7
M 24 16 12 8

and checking the accuracy of the model using the predicted values alongside the small_test data which is 44%:

Accuracy
x
0.264

\(~\)

Now that the small_df has been completed it’s time to do the large_df. Same as before, create the decision tree with Order.Priority as my target variable. The results are below:

Testing the model against the large_test data:

C H L M
C 0 0 0 6316
H 0 0 0 6195
L 0 0 0 6301
M 0 0 0 6188

and now to check the accuracy of the model which is 25.7%:

Accuracy
x
0.24752

\(~\)

I did not expect the decision tree for the larger dataset to be this small along with the accuracy compared to the small dataset. After some research I found some parameters I could improve on the rpart function to improve the model and it’s accuracy. Below are the results:

Now that I have a better decision tree I test the above model using our large_data_model2 and large_test testing data:

C H L M
C 46 765 7 5498
H 48 724 5 5418
L 44 748 9 5500
M 52 701 11 5424

and finally checking the accuracy of the second model; we see the accuracy is only 25.6% which is less than the first model. There wasn’t much improvement in accuracy but we note the changes in the nodes of the decision trees.

Accuracy
x
0.24812

\(~\)