DATA622_Homework 1

Name: Charles Ugiagbe.

Date: 04/16/2024

Assignment Overview

Visit the following website and explore the range of sizes of this dataset (from 100 to 5 million records): https://excelbianalytics.com/wp/downloads-18-sample-csv-files-data-sets-for- testing-sales/ or (new) https://www.kaggle.com/datasets
Select 2 files to download - Based on your computer’s capabilities (memory, CPU), select 2 files you can handle (recommended one small, one large)
Download the files
Review the structure and content of the tables, and think about the data sets (structure, size, dependencies, labels, etc)
Consider the similarities and differences in the two data sets you have downloaded
Think about how to analyze and predict an outcome based on the datasets available
Based on the data you have, think which two machine learning algorithms presented so far could be used to analyze the data

Load the Required packages

library(tidyverse)
library(lubridate)
library(tidyr)
library(rpart)
library(rpart.plot)
library(tidymodels)   
library(gridExtra)
library(skimr)
library(pROC)
library(stringr)
library(knitr)

load the Dataset

# small data = 1000 records
# large data = 100,000 records

small_data <- read.csv("https://raw.githubusercontent.com/omocharly/DATA622/main/Homework1/1000%20Sales%20Records.csv")

large_data <- read.csv("https://raw.githubusercontent.com/omocharly/DATA622/main/Homework1/100000%20Sales%20Records.csv")

head(small_data)

##                         Region Country  Item.Type Sales.Channel Order.Priority
## 1 Middle East and North Africa   Libya  Cosmetics       Offline              M
## 2                North America  Canada Vegetables        Online              M
## 3 Middle East and North Africa   Libya  Baby Food       Offline              C
## 4                         Asia   Japan     Cereal       Offline              C
## 5           Sub-Saharan Africa    Chad     Fruits       Offline              H
## 6                       Europe Armenia     Cereal        Online              H
##   Order.Date  Order.ID  Ship.Date Units.Sold Unit.Price Unit.Cost Total.Revenue
## 1 10/18/2014 686800706 10/31/2014       8446     437.20    263.33    3692591.20
## 2  11/7/2011 185941302  12/8/2011       3018     154.06     90.93     464953.08
## 3 10/31/2016 246222341  12/9/2016       1517     255.28    159.42     387259.76
## 4  4/10/2010 161442649  5/12/2010       3322     205.70    117.11     683335.40
## 5  8/16/2011 645713555  8/31/2011       9845       9.33      6.92      91853.85
## 6 11/24/2014 683458888 12/28/2014       9528     205.70    117.11    1959909.60
##   Total.Cost Total.Profit
## 1  2224085.2   1468506.02
## 2   274426.7    190526.34
## 3   241840.1    145419.62
## 4   389039.4    294295.98
## 5    68127.4     23726.45
## 6  1115824.1    844085.52

head(large_data)

##                              Region               Country     Item.Type
## 1      Middle East and North Africa            Azerbaijan        Snacks
## 2 Central America and the Caribbean                Panama     Cosmetics
## 3                Sub-Saharan Africa Sao Tome and Principe        Fruits
## 4                Sub-Saharan Africa Sao Tome and Principe Personal Care
## 5 Central America and the Caribbean                Belize     Household
## 6                            Europe               Denmark       Clothes
##   Sales.Channel Order.Priority Order.Date  Order.ID  Ship.Date Units.Sold
## 1        Online              C  10/8/2014 535113847 10/23/2014        934
## 2       Offline              L  2/22/2015 874708545  2/27/2015       4551
## 3       Offline              M  12/9/2015 854349935  1/18/2016       9986
## 4        Online              M  9/17/2014 892836844 10/12/2014       9118
## 5       Offline              H   2/4/2010 129280602   3/5/2010       5858
## 6        Online              C  2/20/2013 473105037  2/28/2013       1149
##   Unit.Price Unit.Cost Total.Revenue Total.Cost Total.Profit
## 1     152.58     97.44     142509.72   91008.96     51500.76
## 2     437.20    263.33    1989697.20 1198414.83    791282.37
## 3       9.33      6.92      93169.38   69103.12     24066.26
## 4      81.73     56.67     745214.14  516717.06    228497.08
## 5     668.27    502.54    3914725.66 2943879.32    970846.34
## 6     109.28     35.84     125562.72   41180.16     84382.56

glimpse(small_data)

## Rows: 1,000
## Columns: 14
## $ Region         <chr> "Middle East and North Africa", "North America", "Middl…
## $ Country        <chr> "Libya", "Canada", "Libya", "Japan", "Chad", "Armenia",…
## $ Item.Type      <chr> "Cosmetics", "Vegetables", "Baby Food", "Cereal", "Frui…
## $ Sales.Channel  <chr> "Offline", "Online", "Offline", "Offline", "Offline", "…
## $ Order.Priority <chr> "M", "M", "C", "C", "H", "H", "H", "M", "H", "H", "M", …
## $ Order.Date     <chr> "10/18/2014", "11/7/2011", "10/31/2016", "4/10/2010", "…
## $ Order.ID       <int> 686800706, 185941302, 246222341, 161442649, 645713555, …
## $ Ship.Date      <chr> "10/31/2014", "12/8/2011", "12/9/2016", "5/12/2010", "8…
## $ Units.Sold     <int> 8446, 3018, 1517, 3322, 9845, 9528, 2844, 7299, 2428, 4…
## $ Unit.Price     <dbl> 437.20, 154.06, 255.28, 205.70, 9.33, 205.70, 205.70, 1…
## $ Unit.Cost      <dbl> 263.33, 90.93, 159.42, 117.11, 6.92, 117.11, 117.11, 35…
## $ Total.Revenue  <dbl> 3692591.20, 464953.08, 387259.76, 683335.40, 91853.85, …
## $ Total.Cost     <dbl> 2224085.18, 274426.74, 241840.14, 389039.42, 68127.40, …
## $ Total.Profit   <dbl> 1468506.02, 190526.34, 145419.62, 294295.98, 23726.45, …

glimpse(large_data)

## Rows: 100,000
## Columns: 14
## $ Region         <chr> "Middle East and North Africa", "Central America and th…
## $ Country        <chr> "Azerbaijan", "Panama", "Sao Tome and Principe", "Sao T…
## $ Item.Type      <chr> "Snacks", "Cosmetics", "Fruits", "Personal Care", "Hous…
## $ Sales.Channel  <chr> "Online", "Offline", "Offline", "Online", "Offline", "O…
## $ Order.Priority <chr> "C", "L", "M", "M", "H", "C", "M", "C", "H", "H", "C", …
## $ Order.Date     <chr> "10/8/2014", "2/22/2015", "12/9/2015", "9/17/2014", "2/…
## $ Order.ID       <int> 535113847, 874708545, 854349935, 892836844, 129280602, …
## $ Ship.Date      <chr> "10/23/2014", "2/27/2015", "1/18/2016", "10/12/2014", "…
## $ Units.Sold     <int> 934, 4551, 9986, 9118, 5858, 1149, 7964, 6307, 8217, 27…
## $ Unit.Price     <dbl> 152.58, 437.20, 9.33, 81.73, 668.27, 109.28, 437.20, 9.…
## $ Unit.Cost      <dbl> 97.44, 263.33, 6.92, 56.67, 502.54, 35.84, 263.33, 6.92…
## $ Total.Revenue  <dbl> 142509.72, 1989697.20, 93169.38, 745214.14, 3914725.66,…
## $ Total.Cost     <dbl> 91008.96, 1198414.83, 69103.12, 516717.06, 2943879.32, …
## $ Total.Profit   <dbl> 51500.76, 791282.37, 24066.26, 228497.08, 970846.34, 84…

It appears that Both the small dataset and large datasets contain the same columns, differing only in the number of records present. The sales datasets contain 14 columns in total, and each record describes one sale of goods.

Data Exploration

Let’s explore the data sets; first the small_df data set, using the skimr library we can obtain quick summary statistics beyond the summary(). We notice that we have 14 variables split into 7 character and 7 numeric. There seems to be no missing values, so this will have a simple preparation before we build our models.

Data summary
Name	small_data
Number of rows	1000
Number of columns	14
_______________________
Column type frequency:
character	7
numeric	7
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
Region	1	4	33	7
Country	1	4	32	185
Item.Type	1	4	15	12
Sales.Channel	1	6	7	2
Order.Priority	1	1	1	4
Order.Date	1	8	10	841
Ship.Date	1	8	10	835

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Order.ID	1	549681324.74	257133358.84	1.02928e+08	328074026.00	556609713.50	769694482.75	995529830.00	▇▇▇▇▇
Units.Sold	1	5053.99	2901.38	1.30000e+01	2420.25	5184.00	7536.75	9998.00	▇▆▇▇▇
Unit.Price	1	262.11	216.02	9.33000e+00	81.73	154.06	421.89	668.27	▇▇▁▃▃
Unit.Cost	1	184.97	175.29	6.92000e+00	56.67	97.44	263.33	524.96	▇▂▁▁▂
Total.Revenue	1	1327321.84	1486514.56	2.04325e+03	281191.90	754939.18	1733502.75	6617209.54	▇▂▁▁▁
Total.Cost	1	936119.23	1162570.75	1.41675e+03	164931.88	464726.06	1141750.09	5204978.40	▇▁▁▁▁
Total.Profit	1	391202.61	383640.19	5.32610e+02	98376.12	277225.98	548456.84	1726181.36	▇▃▂▁▁

Let’s take a look at the distributions of the numeric variables for the small data set:

Categorical variables visualization for the small dataset:

\(~\)

Now, the large_df dataset is composed of 100,000 values of the same 14 variables as the small data set. It also has 7 character and 7 numeric variables with no missing values.

Data summary
Name	large_data
Number of rows	100000
Number of columns	14
_______________________
Column type frequency:
character	7
numeric	7
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
Region	1	4	33	7
Country	1	4	32	185
Item.Type	1	4	15	12
Sales.Channel	1	6	7	2
Order.Priority	1	1	1	4
Order.Date	1	8	10	2766
Ship.Date	1	8	10	2813

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Order.ID	1	550395554.18	259321871.37	100008904.00	326046383.25	547718512.50	775078534.50	999996459.00	▇▇▇▇▇
Units.Sold	1	5001.45	2884.58	1.00	2505.00	5007.00	7495.25	10000.00	▇▇▇▇▇
Unit.Price	1	266.70	216.94	9.33	109.28	205.70	437.20	668.27	▇▇▁▃▃
Unit.Cost	1	188.02	175.71	6.92	56.67	117.11	364.69	524.96	▇▃▂▂▃
Total.Revenue	1	1336066.73	1471767.59	18.66	279753.34	789891.57	1836489.60	6682700.00	▇▂▁▁▁
Total.Cost	1	941975.49	1151828.43	13.84	162928.29	467937.41	1209474.69	5249075.04	▇▂▁▁▁
Total.Profit	1	394091.24	379598.60	4.82	95900.00	283657.46	568384.13	1738700.00	▇▃▂▁▁

\(~\)

Visualizations of the numeric variable distributions of the large dataset:

\(~\)

Now let’s look at the large data categorical variables:

Now let’s look at the categorical variables:

Data Preparation:

Now that I’ve visualized the data it’s time to make some changes to the variables. First, convert the categorical values into as.factor and convert the two columns containing dates to as.Date to be able to manipulate. I’ll drop the Order.ID column as it is not needed with our model. Below are the results:

\(~\)

Small dataset:

Region	Country	Item.Type	Sales.Channel	Order.Priority	Order.Date	Ship.Date	Units.Sold	Unit.Price	Unit.Cost	Total.Revenue	Total.Cost	Total.Profit
Middle East and North Africa	Libya	Cosmetics	Offline	M	2014-10-18	2014-10-31	8446	437.20	263.33	3692591.20	2224085.2	1468506.02
North America	Canada	Vegetables	Online	M	2011-11-07	2011-12-08	3018	154.06	90.93	464953.08	274426.7	190526.34
Middle East and North Africa	Libya	Baby Food	Offline	C	2016-10-31	2016-12-09	1517	255.28	159.42	387259.76	241840.1	145419.62
Asia	Japan	Cereal	Offline	C	2010-04-10	2010-05-12	3322	205.70	117.11	683335.40	389039.4	294295.98
Sub-Saharan Africa	Chad	Fruits	Offline	H	2011-08-16	2011-08-31	9845	9.33	6.92	91853.85	68127.4	23726.45
Europe	Armenia	Cereal	Online	H	2014-11-24	2014-12-28	9528	205.70	117.11	1959909.60	1115824.1	844085.52

\(~\)

Large dataset:

Region	Country	Item.Type	Sales.Channel	Order.Priority	Order.Date	Ship.Date	Units.Sold	Unit.Price	Unit.Cost	Total.Revenue	Total.Cost	Total.Profit
Middle East and North Africa	Azerbaijan	Snacks	Online	C	2014-10-08	2014-10-23	934	152.58	97.44	142509.72	91008.96	51500.76
Central America and the Caribbean	Panama	Cosmetics	Offline	L	2015-02-22	2015-02-27	4551	437.20	263.33	1989697.20	1198414.83	791282.37
Sub-Saharan Africa	Sao Tome and Principe	Fruits	Offline	M	2015-12-09	2016-01-18	9986	9.33	6.92	93169.38	69103.12	24066.26
Sub-Saharan Africa	Sao Tome and Principe	Personal Care	Online	M	2014-09-17	2014-10-12	9118	81.73	56.67	745214.14	516717.06	228497.08
Central America and the Caribbean	Belize	Household	Offline	H	2010-02-04	2010-03-05	5858	668.27	502.54	3914725.66	2943879.32	970846.34
Europe	Denmark	Clothes	Online	C	2013-02-20	2013-02-28	1149	109.28	35.84	125562.72	41180.16	84382.56

\(~\)

Model Building:

First, we start by splitting both datasets into the standard ratio 75:25

Now we can start the decision tree for the small data set using the rpart function and setting Order.Priority as our target variable followed by the rest of the variables. The results are below:

To test the above model I used the small_df testing data to create the prediction table below:

	C	H	L	M
C	23	16	11	5
H	26	20	12	10
L	31	14	15	7
M	24	16	12	8

and checking the accuracy of the model using the predicted values alongside the small_test data which is 44%:

Accuracy
x
0.264

\(~\)

Now that the small_df has been completed it’s time to do the large_df. Same as before, create the decision tree with Order.Priority as my target variable. The results are below:

Testing the model against the large_test data:

	C	H	L	M
C	0	0	0	6316
H	0	0	0	6195
L	0	0	0	6301
M	0	0	0	6188

and now to check the accuracy of the model which is 25.7%:

Accuracy
x
0.24752

\(~\)

I did not expect the decision tree for the larger dataset to be this small along with the accuracy compared to the small dataset. After some research I found some parameters I could improve on the rpart function to improve the model and it’s accuracy. Below are the results:

Now that I have a better decision tree I test the above model using our large_data_model2 and large_test testing data:

	C	H	L	M
C	46	765	7	5498
H	48	724	5	5418
L	44	748	9	5500
M	52	701	11	5424

and finally checking the accuracy of the second model; we see the accuracy is only 25.6% which is less than the first model. There wasn’t much improvement in accuracy but we note the changes in the nodes of the decision trees.

Accuracy
x
0.24812

\(~\)