Visit the following website and explore the range of sizes of this dataset (from 100 to 5 million records): https://excelbianalytics.com/wp/downloads-18-sample-csv-files-data-sets-for- testing-sales/ or (new) https://www.kaggle.com/datasets
Select 2 files to download - Based on your computer’s capabilities (memory, CPU), select 2 files you can handle (recommended one small, one large)
Download the files
Review the structure and content of the tables, and think about the data sets (structure, size, dependencies, labels, etc)
Consider the similarities and differences in the two data sets you have downloaded
Think about how to analyze and predict an outcome based on the datasets available
Based on the data you have, think which two machine learning algorithms presented so far could be used to analyze the data
library(tidyverse)
library(lubridate)
library(tidyr)
library(rpart)
library(rpart.plot)
library(tidymodels)
library(gridExtra)
library(skimr)
library(pROC)
library(stringr)
library(knitr)
# small data = 1000 records
# large data = 100,000 records
small_data <- read.csv("https://raw.githubusercontent.com/omocharly/DATA622/main/Homework1/1000%20Sales%20Records.csv")
large_data <- read.csv("https://raw.githubusercontent.com/omocharly/DATA622/main/Homework1/100000%20Sales%20Records.csv")
head(small_data)
## Region Country Item.Type Sales.Channel Order.Priority
## 1 Middle East and North Africa Libya Cosmetics Offline M
## 2 North America Canada Vegetables Online M
## 3 Middle East and North Africa Libya Baby Food Offline C
## 4 Asia Japan Cereal Offline C
## 5 Sub-Saharan Africa Chad Fruits Offline H
## 6 Europe Armenia Cereal Online H
## Order.Date Order.ID Ship.Date Units.Sold Unit.Price Unit.Cost Total.Revenue
## 1 10/18/2014 686800706 10/31/2014 8446 437.20 263.33 3692591.20
## 2 11/7/2011 185941302 12/8/2011 3018 154.06 90.93 464953.08
## 3 10/31/2016 246222341 12/9/2016 1517 255.28 159.42 387259.76
## 4 4/10/2010 161442649 5/12/2010 3322 205.70 117.11 683335.40
## 5 8/16/2011 645713555 8/31/2011 9845 9.33 6.92 91853.85
## 6 11/24/2014 683458888 12/28/2014 9528 205.70 117.11 1959909.60
## Total.Cost Total.Profit
## 1 2224085.2 1468506.02
## 2 274426.7 190526.34
## 3 241840.1 145419.62
## 4 389039.4 294295.98
## 5 68127.4 23726.45
## 6 1115824.1 844085.52
head(large_data)
## Region Country Item.Type
## 1 Middle East and North Africa Azerbaijan Snacks
## 2 Central America and the Caribbean Panama Cosmetics
## 3 Sub-Saharan Africa Sao Tome and Principe Fruits
## 4 Sub-Saharan Africa Sao Tome and Principe Personal Care
## 5 Central America and the Caribbean Belize Household
## 6 Europe Denmark Clothes
## Sales.Channel Order.Priority Order.Date Order.ID Ship.Date Units.Sold
## 1 Online C 10/8/2014 535113847 10/23/2014 934
## 2 Offline L 2/22/2015 874708545 2/27/2015 4551
## 3 Offline M 12/9/2015 854349935 1/18/2016 9986
## 4 Online M 9/17/2014 892836844 10/12/2014 9118
## 5 Offline H 2/4/2010 129280602 3/5/2010 5858
## 6 Online C 2/20/2013 473105037 2/28/2013 1149
## Unit.Price Unit.Cost Total.Revenue Total.Cost Total.Profit
## 1 152.58 97.44 142509.72 91008.96 51500.76
## 2 437.20 263.33 1989697.20 1198414.83 791282.37
## 3 9.33 6.92 93169.38 69103.12 24066.26
## 4 81.73 56.67 745214.14 516717.06 228497.08
## 5 668.27 502.54 3914725.66 2943879.32 970846.34
## 6 109.28 35.84 125562.72 41180.16 84382.56
glimpse(small_data)
## Rows: 1,000
## Columns: 14
## $ Region <chr> "Middle East and North Africa", "North America", "Middl…
## $ Country <chr> "Libya", "Canada", "Libya", "Japan", "Chad", "Armenia",…
## $ Item.Type <chr> "Cosmetics", "Vegetables", "Baby Food", "Cereal", "Frui…
## $ Sales.Channel <chr> "Offline", "Online", "Offline", "Offline", "Offline", "…
## $ Order.Priority <chr> "M", "M", "C", "C", "H", "H", "H", "M", "H", "H", "M", …
## $ Order.Date <chr> "10/18/2014", "11/7/2011", "10/31/2016", "4/10/2010", "…
## $ Order.ID <int> 686800706, 185941302, 246222341, 161442649, 645713555, …
## $ Ship.Date <chr> "10/31/2014", "12/8/2011", "12/9/2016", "5/12/2010", "8…
## $ Units.Sold <int> 8446, 3018, 1517, 3322, 9845, 9528, 2844, 7299, 2428, 4…
## $ Unit.Price <dbl> 437.20, 154.06, 255.28, 205.70, 9.33, 205.70, 205.70, 1…
## $ Unit.Cost <dbl> 263.33, 90.93, 159.42, 117.11, 6.92, 117.11, 117.11, 35…
## $ Total.Revenue <dbl> 3692591.20, 464953.08, 387259.76, 683335.40, 91853.85, …
## $ Total.Cost <dbl> 2224085.18, 274426.74, 241840.14, 389039.42, 68127.40, …
## $ Total.Profit <dbl> 1468506.02, 190526.34, 145419.62, 294295.98, 23726.45, …
glimpse(large_data)
## Rows: 100,000
## Columns: 14
## $ Region <chr> "Middle East and North Africa", "Central America and th…
## $ Country <chr> "Azerbaijan", "Panama", "Sao Tome and Principe", "Sao T…
## $ Item.Type <chr> "Snacks", "Cosmetics", "Fruits", "Personal Care", "Hous…
## $ Sales.Channel <chr> "Online", "Offline", "Offline", "Online", "Offline", "O…
## $ Order.Priority <chr> "C", "L", "M", "M", "H", "C", "M", "C", "H", "H", "C", …
## $ Order.Date <chr> "10/8/2014", "2/22/2015", "12/9/2015", "9/17/2014", "2/…
## $ Order.ID <int> 535113847, 874708545, 854349935, 892836844, 129280602, …
## $ Ship.Date <chr> "10/23/2014", "2/27/2015", "1/18/2016", "10/12/2014", "…
## $ Units.Sold <int> 934, 4551, 9986, 9118, 5858, 1149, 7964, 6307, 8217, 27…
## $ Unit.Price <dbl> 152.58, 437.20, 9.33, 81.73, 668.27, 109.28, 437.20, 9.…
## $ Unit.Cost <dbl> 97.44, 263.33, 6.92, 56.67, 502.54, 35.84, 263.33, 6.92…
## $ Total.Revenue <dbl> 142509.72, 1989697.20, 93169.38, 745214.14, 3914725.66,…
## $ Total.Cost <dbl> 91008.96, 1198414.83, 69103.12, 516717.06, 2943879.32, …
## $ Total.Profit <dbl> 51500.76, 791282.37, 24066.26, 228497.08, 970846.34, 84…
It appears that Both the small dataset and large datasets contain the same columns, differing only in the number of records present. The sales datasets contain 14 columns in total, and each record describes one sale of goods.
Let’s explore the data sets; first the small_df data
set, using the skimr library we can obtain quick summary
statistics beyond the summary(). We notice that we have 14
variables split into 7 character and 7 numeric. There seems to be no
missing values, so this will have a simple preparation before we build
our models.
| Name | small_data |
| Number of rows | 1000 |
| Number of columns | 14 |
| _______________________ | |
| Column type frequency: | |
| character | 7 |
| numeric | 7 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Region | 0 | 1 | 4 | 33 | 0 | 7 | 0 |
| Country | 0 | 1 | 4 | 32 | 0 | 185 | 0 |
| Item.Type | 0 | 1 | 4 | 15 | 0 | 12 | 0 |
| Sales.Channel | 0 | 1 | 6 | 7 | 0 | 2 | 0 |
| Order.Priority | 0 | 1 | 1 | 1 | 0 | 4 | 0 |
| Order.Date | 0 | 1 | 8 | 10 | 0 | 841 | 0 |
| Ship.Date | 0 | 1 | 8 | 10 | 0 | 835 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Order.ID | 0 | 1 | 549681324.74 | 257133358.84 | 1.02928e+08 | 328074026.00 | 556609713.50 | 769694482.75 | 995529830.00 | ▇▇▇▇▇ |
| Units.Sold | 0 | 1 | 5053.99 | 2901.38 | 1.30000e+01 | 2420.25 | 5184.00 | 7536.75 | 9998.00 | ▇▆▇▇▇ |
| Unit.Price | 0 | 1 | 262.11 | 216.02 | 9.33000e+00 | 81.73 | 154.06 | 421.89 | 668.27 | ▇▇▁▃▃ |
| Unit.Cost | 0 | 1 | 184.97 | 175.29 | 6.92000e+00 | 56.67 | 97.44 | 263.33 | 524.96 | ▇▂▁▁▂ |
| Total.Revenue | 0 | 1 | 1327321.84 | 1486514.56 | 2.04325e+03 | 281191.90 | 754939.18 | 1733502.75 | 6617209.54 | ▇▂▁▁▁ |
| Total.Cost | 0 | 1 | 936119.23 | 1162570.75 | 1.41675e+03 | 164931.88 | 464726.06 | 1141750.09 | 5204978.40 | ▇▁▁▁▁ |
| Total.Profit | 0 | 1 | 391202.61 | 383640.19 | 5.32610e+02 | 98376.12 | 277225.98 | 548456.84 | 1726181.36 | ▇▃▂▁▁ |
\(~\)
Now, the large_df dataset is composed of 100,000
values of the same 14 variables as the small data set. It
also has 7 character and 7 numeric variables with no missing
values.
| Name | large_data |
| Number of rows | 100000 |
| Number of columns | 14 |
| _______________________ | |
| Column type frequency: | |
| character | 7 |
| numeric | 7 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Region | 0 | 1 | 4 | 33 | 0 | 7 | 0 |
| Country | 0 | 1 | 4 | 32 | 0 | 185 | 0 |
| Item.Type | 0 | 1 | 4 | 15 | 0 | 12 | 0 |
| Sales.Channel | 0 | 1 | 6 | 7 | 0 | 2 | 0 |
| Order.Priority | 0 | 1 | 1 | 1 | 0 | 4 | 0 |
| Order.Date | 0 | 1 | 8 | 10 | 0 | 2766 | 0 |
| Ship.Date | 0 | 1 | 8 | 10 | 0 | 2813 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Order.ID | 0 | 1 | 550395554.18 | 259321871.37 | 100008904.00 | 326046383.25 | 547718512.50 | 775078534.50 | 999996459.00 | ▇▇▇▇▇ |
| Units.Sold | 0 | 1 | 5001.45 | 2884.58 | 1.00 | 2505.00 | 5007.00 | 7495.25 | 10000.00 | ▇▇▇▇▇ |
| Unit.Price | 0 | 1 | 266.70 | 216.94 | 9.33 | 109.28 | 205.70 | 437.20 | 668.27 | ▇▇▁▃▃ |
| Unit.Cost | 0 | 1 | 188.02 | 175.71 | 6.92 | 56.67 | 117.11 | 364.69 | 524.96 | ▇▃▂▂▃ |
| Total.Revenue | 0 | 1 | 1336066.73 | 1471767.59 | 18.66 | 279753.34 | 789891.57 | 1836489.60 | 6682700.00 | ▇▂▁▁▁ |
| Total.Cost | 0 | 1 | 941975.49 | 1151828.43 | 13.84 | 162928.29 | 467937.41 | 1209474.69 | 5249075.04 | ▇▂▁▁▁ |
| Total.Profit | 0 | 1 | 394091.24 | 379598.60 | 4.82 | 95900.00 | 283657.46 | 568384.13 | 1738700.00 | ▇▃▂▁▁ |
\(~\)
\(~\)
Now that I’ve visualized the data it’s time to make some changes to
the variables. First, convert the categorical values into
as.factor and convert the two columns containing dates to
as.Date to be able to manipulate. I’ll drop the
Order.ID column as it is not needed with our model. Below
are the results:
\(~\)
Small dataset:
| Region | Country | Item.Type | Sales.Channel | Order.Priority | Order.Date | Ship.Date | Units.Sold | Unit.Price | Unit.Cost | Total.Revenue | Total.Cost | Total.Profit |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Middle East and North Africa | Libya | Cosmetics | Offline | M | 2014-10-18 | 2014-10-31 | 8446 | 437.20 | 263.33 | 3692591.20 | 2224085.2 | 1468506.02 |
| North America | Canada | Vegetables | Online | M | 2011-11-07 | 2011-12-08 | 3018 | 154.06 | 90.93 | 464953.08 | 274426.7 | 190526.34 |
| Middle East and North Africa | Libya | Baby Food | Offline | C | 2016-10-31 | 2016-12-09 | 1517 | 255.28 | 159.42 | 387259.76 | 241840.1 | 145419.62 |
| Asia | Japan | Cereal | Offline | C | 2010-04-10 | 2010-05-12 | 3322 | 205.70 | 117.11 | 683335.40 | 389039.4 | 294295.98 |
| Sub-Saharan Africa | Chad | Fruits | Offline | H | 2011-08-16 | 2011-08-31 | 9845 | 9.33 | 6.92 | 91853.85 | 68127.4 | 23726.45 |
| Europe | Armenia | Cereal | Online | H | 2014-11-24 | 2014-12-28 | 9528 | 205.70 | 117.11 | 1959909.60 | 1115824.1 | 844085.52 |
\(~\)
Large dataset:
| Region | Country | Item.Type | Sales.Channel | Order.Priority | Order.Date | Ship.Date | Units.Sold | Unit.Price | Unit.Cost | Total.Revenue | Total.Cost | Total.Profit |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Middle East and North Africa | Azerbaijan | Snacks | Online | C | 2014-10-08 | 2014-10-23 | 934 | 152.58 | 97.44 | 142509.72 | 91008.96 | 51500.76 |
| Central America and the Caribbean | Panama | Cosmetics | Offline | L | 2015-02-22 | 2015-02-27 | 4551 | 437.20 | 263.33 | 1989697.20 | 1198414.83 | 791282.37 |
| Sub-Saharan Africa | Sao Tome and Principe | Fruits | Offline | M | 2015-12-09 | 2016-01-18 | 9986 | 9.33 | 6.92 | 93169.38 | 69103.12 | 24066.26 |
| Sub-Saharan Africa | Sao Tome and Principe | Personal Care | Online | M | 2014-09-17 | 2014-10-12 | 9118 | 81.73 | 56.67 | 745214.14 | 516717.06 | 228497.08 |
| Central America and the Caribbean | Belize | Household | Offline | H | 2010-02-04 | 2010-03-05 | 5858 | 668.27 | 502.54 | 3914725.66 | 2943879.32 | 970846.34 |
| Europe | Denmark | Clothes | Online | C | 2013-02-20 | 2013-02-28 | 1149 | 109.28 | 35.84 | 125562.72 | 41180.16 | 84382.56 |
\(~\)
First, we start by splitting both datasets into the standard ratio 75:25
Now we can start the decision tree for the small data set using the
rpart function and setting Order.Priority as
our target variable followed by the rest of the variables. The results
are below:
To test the above model I used the small_df testing data
to create the prediction table below:
| C | H | L | M | |
|---|---|---|---|---|
| C | 23 | 16 | 11 | 5 |
| H | 26 | 20 | 12 | 10 |
| L | 31 | 14 | 15 | 7 |
| M | 24 | 16 | 12 | 8 |
and checking the accuracy of the model using the predicted values
alongside the small_test data which is 44%:
| x |
|---|
| 0.264 |
\(~\)
Now that the small_df has been completed it’s time to do
the large_df. Same as before, create the decision tree with
Order.Priority as my target variable. The results are
below:
Testing the model against the large_test data:
| C | H | L | M | |
|---|---|---|---|---|
| C | 0 | 0 | 0 | 6316 |
| H | 0 | 0 | 0 | 6195 |
| L | 0 | 0 | 0 | 6301 |
| M | 0 | 0 | 0 | 6188 |
and now to check the accuracy of the model which is 25.7%:
| x |
|---|
| 0.24752 |
\(~\)
I did not expect the decision tree for the larger dataset to be this
small along with the accuracy compared to the small dataset. After some
research I found some parameters I could improve on the
rpart function to improve the model and it’s accuracy.
Below are the results:
Now that I have a better decision tree I test the above model using
our large_data_model2 and large_test testing
data:
| C | H | L | M | |
|---|---|---|---|---|
| C | 46 | 765 | 7 | 5498 |
| H | 48 | 724 | 5 | 5418 |
| L | 44 | 748 | 9 | 5500 |
| M | 52 | 701 | 11 | 5424 |
and finally checking the accuracy of the second model; we see the accuracy is only 25.6% which is less than the first model. There wasn’t much improvement in accuracy but we note the changes in the nodes of the decision trees.
| x |
|---|
| 0.24812 |
\(~\)