Libraries
library(tidyverse)
library(dplyr)
library(tidyr)
library(rpart)
library(rpart.plot)
library(lubridate)
library(skimr)
library(stringr)
library(corrplot)
library(ggplot2)
library(fpp3)
library(caret)
Assignment Overview
Visit the following website and explore the range of sizes of this dataset (from 100 to 5 million records): https://excelbianalytics.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/ or (new) https://www.kaggle.com/datasets
Select 2 files to download - Based on your computer’s capabilities (memory, CPU), select 2 files you can handle (recommended one small, one large)
Download the files
Review the structure and content of the tables, and think about the data sets (structure, size, dependencies, labels, etc)
Consider the similarities and differences in the two data sets you have downloaded
Think about how to analyze and predict an outcome based on the datasets available
Based on the data you have, think which two machine learning algorithms presented so far could be used to analyze the data
Load Data
# Load Datasets
# Small Set == 5,000 records
# Large Set == 100,000 records
<- "/Users/alecmccabe/Downloads/5000 Sales Records.csv"
small_path <- "/Users/alecmccabe/Downloads/100000 Sales Records.csv"
big_path
<- read_csv(small_path)
small <- read_csv(big_path) large
Review Structure and Content of Tables
Review the structure and content of the tables, and think about the data sets (structure, size, dependencies, labels, etc)
head(small)
head(large)
It appears that the two datasets are identical in terms of their included features. Let’s review the large dataset in more detail.
skim(large)
Name | large |
Number of rows | 100000 |
Number of columns | 14 |
_______________________ | |
Column type frequency: | |
character | 7 |
numeric | 7 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Region | 0 | 1 | 4 | 33 | 0 | 7 | 0 |
Country | 0 | 1 | 4 | 32 | 0 | 185 | 0 |
Item Type | 0 | 1 | 4 | 15 | 0 | 12 | 0 |
Sales Channel | 0 | 1 | 6 | 7 | 0 | 2 | 0 |
Order Priority | 0 | 1 | 1 | 1 | 0 | 4 | 0 |
Order Date | 0 | 1 | 8 | 10 | 0 | 2766 | 0 |
Ship Date | 0 | 1 | 8 | 10 | 0 | 2813 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Order ID | 0 | 1 | 550395554.18 | 259321871.37 | 100008904.00 | 326046383.25 | 547718512.50 | 775078534.50 | 999996459.00 | ▇▇▇▇▇ |
Units Sold | 0 | 1 | 5001.45 | 2884.58 | 1.00 | 2505.00 | 5007.00 | 7495.25 | 10000.00 | ▇▇▇▇▇ |
Unit Price | 0 | 1 | 266.70 | 216.94 | 9.33 | 109.28 | 205.70 | 437.20 | 668.27 | ▇▇▁▃▃ |
Unit Cost | 0 | 1 | 188.02 | 175.71 | 6.92 | 56.67 | 117.11 | 364.69 | 524.96 | ▇▃▂▂▃ |
Total Revenue | 0 | 1 | 1336066.73 | 1471767.59 | 18.66 | 279753.34 | 789891.57 | 1836489.60 | 6682700.00 | ▇▂▁▁▁ |
Total Cost | 0 | 1 | 941975.49 | 1151828.43 | 13.84 | 162928.29 | 467937.41 | 1209474.69 | 5249075.04 | ▇▂▁▁▁ |
Total Profit | 0 | 1 | 394091.24 | 379598.60 | 4.82 | 95900.00 | 283657.46 | 568384.13 | 1738700.00 | ▇▃▂▁▁ |
Both datasets contain the same columns, differing only in the number of records present. The sales datasets contain 14 columns in total, and each record describes one sale of goods. Each record provides information around:
- Region: the region of the sale
- Country: the country of the sale. There are 185 unique country values.
- Item type: category of items sold
- Sales channel: online sale or offline sale
- Order priority: priority of order (“H”,“M”,“L”,“C”)
- Order date: date of the order
- Order ID: id of the order
- Ship date: date order was shipped
- Units sold: number of items sold in order
- Unit cost: cost of order
- Total revenue: total revenue of order
- Total cost: total cost of order
- Total profit: total profit of order
Missing Values
We can also see from the skim() method and below that there are no missing values in either of our datasets:
sapply(small, function(x) sum(is.na(x)))
## Region Country Item Type Sales Channel Order Priority
## 0 0 0 0 0
## Order Date Order ID Ship Date Units Sold Unit Price
## 0 0 0 0 0
## Unit Cost Total Revenue Total Cost Total Profit
## 0 0 0 0
sapply(large, function(x) sum(is.na(x)))
## Region Country Item Type Sales Channel Order Priority
## 0 0 0 0 0
## Order Date Order ID Ship Date Units Sold Unit Price
## 0 0 0 0 0
## Unit Cost Total Revenue Total Cost Total Profit
## 0 0 0 0
Data Inspection and Cleaning
To better work with each feature, we will adjust the feature names
colnames(large)
## [1] "Region" "Country" "Item Type" "Sales Channel"
## [5] "Order Priority" "Order Date" "Order ID" "Ship Date"
## [9] "Units Sold" "Unit Price" "Unit Cost" "Total Revenue"
## [13] "Total Cost" "Total Profit"
<- function(col) {
adjust_cols <- str_replace(col, " ", "_")
new return(new)
}
<- unlist(lapply(colnames(large), adjust_cols))
new_colnames
colnames(large) <- new_colnames
colnames(small) <- new_colnames
colnames(large)
## [1] "Region" "Country" "Item_Type" "Sales_Channel"
## [5] "Order_Priority" "Order_Date" "Order_ID" "Ship_Date"
## [9] "Units_Sold" "Unit_Price" "Unit_Cost" "Total_Revenue"
## [13] "Total_Cost" "Total_Profit"
Now let’s look at the summaries of our columns.
summary(large)
## Region Country Item_Type Sales_Channel
## Length:100000 Length:100000 Length:100000 Length:100000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Order_Priority Order_Date Order_ID Ship_Date
## Length:100000 Length:100000 Min. :100008904 Length:100000
## Class :character Class :character 1st Qu.:326046383 Class :character
## Mode :character Mode :character Median :547718512 Mode :character
## Mean :550395554
## 3rd Qu.:775078534
## Max. :999996459
## Units_Sold Unit_Price Unit_Cost Total_Revenue
## Min. : 1 Min. : 9.33 Min. : 6.92 Min. : 19
## 1st Qu.: 2505 1st Qu.:109.28 1st Qu.: 56.67 1st Qu.: 279753
## Median : 5007 Median :205.70 Median :117.11 Median : 789892
## Mean : 5001 Mean :266.70 Mean :188.02 Mean :1336067
## 3rd Qu.: 7495 3rd Qu.:437.20 3rd Qu.:364.69 3rd Qu.:1836490
## Max. :10000 Max. :668.27 Max. :524.96 Max. :6682700
## Total_Cost Total_Profit
## Min. : 14 Min. : 4.8
## 1st Qu.: 162928 1st Qu.: 95900.0
## Median : 467937 Median : 283657.5
## Mean : 941975 Mean : 394091.2
## 3rd Qu.:1209475 3rd Qu.: 568384.1
## Max. :5249075 Max. :1738700.0
From the above, it appears that the date fields (Order Date and Ship Date) are incorrectly coded as character values. Correcting this would be a good first step in preprocessing for modeling. Additionally we can convert our categorical string values to factors which will make our jobs easier. We can further drop the Country column, as there are many unique values that will likely hinder our modeling efforts downstream. Additionally, as we have parent category “Region”, we are not losing all of the information that we have around geographics. Finally, we can drop the order id, as it won’t have any impact on our model’s predictive power.
# updating small values
$Order_Date <- as.Date(small$Order_Date, "%m/%d/%Y")
small$Ship_Date <- as.Date(small$Ship_Date, "%m/%d/%Y")
small
$Region <- as.factor(small$Region)
small$Item_Type <- as.factor(small$Item_Type)
small$Sales_Channel <- as.factor(small$Sales_Channel)
small$Order_Priority <- as.factor(small$Order_Priority)
small
<- small[,-c(2,7)]
small
# updating large values
$Order_Date <- as.Date(large$Order_Date, "%m/%d/%Y")
large$Ship_Date <- as.Date(large$Ship_Date, "%m/%d/%Y")
large
$Region <- as.factor(large$Region)
large$Item_Type <- as.factor(large$Item_Type)
large$Sales_Channel <- as.factor(large$Sales_Channel)
large$Order_Priority <- as.factor(large$Order_Priority)
large
<- large[,-c(2,7)] large
Feature Dependence
We can expect that some of these features will be highly dependent on each other.
- Units Sold and Unit Price will be highly correlated with Total Revenue
- Units Sold and Unit Cost will be highly correlated with Total Cost
- Total Revenue and Total Cost will be highly correlated with Total Profit
- For obvious reasions, Region and Country will be entirely dependant on one another
- We expect that the expected profit will be dependant on the type of item being sold. It will be interesting to see if certain item types are more highly prized in different regions.
- We expect that the sales channel will be correlated with an increase or decrease in expected profits.
- Order Date and Shipping Date will naturally be correlated, and additional affected by the item type.
- Additionally, we expect that depending on the tiem of purchase or sale, that item prices or cost will also fluctuate.
Numerical Distributions
We can see from the below histogram that the distribution of units_sold is fairly uniform.
hist(large$Units_Sold)
Unit price can vary and does not seem to follow any specific distribution type. Prices range from 0-100 up to 600-700. Most of the item types being sold are #300 or less.
hist(large$Unit_Price)
Unit cost is similar, with items costing between 0-100 and 500.
hist(large$Unit_Cost)
Total Revenue, which is dependant on Unit Price and Units Sold, posesses a large right skew. Additionally the values are so large that it is difficult to read. It could be beneficial to use a log transform.
hist(large$Total_Revenue)
$Total_Revenue <- log(large$Total_Revenue) large
Similar to Total Revenue, total cost has a strong right skew. We will use a log transformation to rectify this.
hist(large$Total_Cost)
$Total_Cost <- log(large$Total_Cost)
large$Total_Cost <- log(small$Total_Cost) small
Total profit also has a right skew, but it is not as pronounced as either of it’s upstream features, Total Revenue and Total Cost. In the spirit of consistency, we will also use a log transform.
hist(large$Total_Profit)
$Total_Profit <- log(large$Total_Profit)
large$Total_Profit <- log(small$Total_Profit) small
Categorical Disstributions
Region is skewed towards sales in Europe and Sub-Saharan Africa
# Check distribution of predictor values
%>%
large ggplot(aes(x = Region)) +
geom_bar() +
coord_flip()
Item types are evenly distributed, with 12 item types in total.
# Check distribution of predictor values
%>%
large ggplot(aes(x = Item_Type)) +
geom_bar() +
coord_flip()
There are two sales types: Online and Offline sales. They are evenly distributed.
# Check distribution of predictor values
%>%
large ggplot(aes(x = Sales_Channel)) +
geom_bar()
There are 4 order priority types: High, Medium, Low, and “C” (which I am not sure what it means). I assume that these order priority types are correlated with the difference between Order Date and Shipping Date.
# Check distribution of predictor values
%>%
large ggplot(aes(x = Order_Priority)) +
geom_bar()
$time_to_ship = large$Ship_Date - large$Order_Date
large$time_to_ship = small$Ship_Date - small$Order_Date small
$time_to_ship <- as.numeric(large$time_to_ship)
large$time_to_ship = small$Ship_Date - small$Order_Date small
Interestingly, the time_to_ship for a given order is not impacted by the order priority.
%>%
large ggplot(aes(x=Order_Priority, y=time_to_ship)) +
geom_boxplot()
Collinearity
cor(large[c("Unit_Price","Unit_Cost","Units_Sold","Total_Revenue","Total_Cost","Total_Profit")])
## Unit_Price Unit_Cost Units_Sold Total_Revenue Total_Cost
## Unit_Price 1.000000000 0.986029866 0.003453459 0.6515835 0.6942626
## Unit_Cost 0.986029866 1.000000000 0.003166787 0.6168524 0.6754400
## Units_Sold 0.003453459 0.003166787 1.000000000 0.5677437 0.5516501
## Total_Revenue 0.651583527 0.616852380 0.567743711 1.0000000 0.9888694
## Total_Cost 0.694262646 0.675440026 0.551650080 0.9888694 1.0000000
## Total_Profit 0.528434881 0.461276501 0.574129744 0.9662761 0.9209439
## Total_Profit
## Unit_Price 0.5284349
## Unit_Cost 0.4612765
## Units_Sold 0.5741297
## Total_Revenue 0.9662761
## Total_Cost 0.9209439
## Total_Profit 1.0000000
As expected, all of these values are highly correlated with one another. When building our model we likely can limit the numerical features to a combination of Units_Sold, Unit_Price, and Unit_Cost. Using Total_Revenue, Total_Cost, and Total_Profit could be problematic as we would be introducing multi-collinearity into our model.
Model Selection
These datasets could be used to predict a wide number of things. We could predict the expected profit from a sale, based on the item type, region, and cost. Similarly we could predict the Region of a sale given the generated profit and sales channel (however this would not prove very useful).
In terms of generating the most use, I feel that a business would want to create a model which could accurately predict or classify the correct “Order Priority” for a given order. This would enable a business to better prioritize their operation, focusing on the most important orders first. Likely the prioritization would be based on expected profits, but it will be interesting to determine which other features are significant predictors.
With this in mind, we determine that we will be building a multi-class classification model as opposed to a standard regression model. This leaves us with a number of options to consider:
Multiple Logistic Regression
Using multiple logistic regression could be an option for this problem, however it also has a number of considerations which technically do not pass. For instance multiple linear regression assumes that the predictor variables, or features, should follow normal distributions. We have transformed our skewed variables using a log transform, but they are still not normal after transformation. I believe that this option would work, but we would need to keep these considerations in mind.
KNN
K-Nearest Neighbors is an option for this task, but it would require additional preprocessing of the data. KNN models cannot manage non-numerical features, so we would need to one hot encode the categorical values. One of the nice things about KNN is that it includes zero assumptions about the underlying data. You simply need to normalize the data (preferably within bounds of 0 and 1), and train your model. That all being said, for a growing business looking to model order priority, KNN might not be the best choice. As the amount of data collected (sales) grows, the training step of KNN will become longer and longer. It is a slower algorithm as it is instance based, and will need to retrain every time there is a new data point added.
Decision Trees
Finally we can consider Decision Trees, which are great at performing these types of multi-class classification problems. Decision trees are powerful, non-parametric models which do not make many assumptions about your underlying data (similar to KNN). DTs can handle all types of data, and do not require additional preprocessing at this stage. While we do not have any missing values, it is also important to remember that DTs can handle missing values inherently, so using a model like this could be beneficial if ever we have records with missing values. Some of the negatives of using Decision Trees is that they have high variability in their output. Given new data in the future, it is possible that any newly trained models will differ greatly. That is where models like Random Forest shine (but we haven’t gotten there yet). We will need to take future outliers into consideration as we continue to build and optimize the model over time.
Model Building
Large Model
We will move forward with a Decision Tree for both the small and large datasets. We have already performed most of the transformations that we will need. The final step will be to create additional features related to the time of order placement as well as shipping date.
$order_month <- as.factor(format(large$Order_Date,"%m"))
large$order_day <- as.factor(wday(large$Order_Date))
large
$shipping_month <- as.factor(format(large$Ship_Date,"%m"))
large$shipping_day <- as.factor(wday(large$Ship_Date)) large
And now we can safely remove the order date and shipping dates. We can additionally remove time_to_ship as we saw that it has no effect on the Order Priority from before.
<- large[,-c(5,6,13)]
large head(large)
Now we are ready to subset our data into training and testing sets.
#make this example reproducible
set.seed(42)
#create ID column
$id <- 1:nrow(large)
large
#use 70% of dataset as training set and 30% as test set
<- large %>% dplyr::sample_frac(0.70)
large_train <- dplyr::anti_join(large, large_train, by = 'id') large_test
head(large)
Once we have our training set, we can build our first model.
#build model via rpart package
<- rpart(formula=Order_Priority ~ Region + Item_Type + Sales_Channel + Units_Sold + Unit_Price + Unit_Cost + order_month + order_day + shipping_month + shipping_day, data=large_train, method='class',
large_model control=rpart.control(minsplit=1, minbucket=1, cp=0.001))
#display decision tree
rpart.plot(large_model)
Small Model
We need to perform the same adjustments to the small dataset that we performed to the large one.
$order_month <- as.factor(format(small$Order_Date,"%m"))
small$order_day <- as.factor(wday(small$Order_Date))
small
$shipping_month <- as.factor(format(small$Ship_Date,"%m"))
small$shipping_day <- as.factor(wday(small$Ship_Date)) small
<- small[,-c(5,6,13)]
small head(small)
Now we are ready to subset our data into training and testing sets.
#make this example reproducible
set.seed(42)
#create ID column
$id <- 1:nrow(small)
small
#use 70% of dataset as training set and 30% as test set
<- small %>% dplyr::sample_frac(0.70)
small_train <- dplyr::anti_join(small, small_train, by = 'id') small_test
Once we have our training set, we can build our first model.
#build model via rpart package
<- rpart(formula=Order_Priority ~ Region + Item_Type + Sales_Channel + Units_Sold + Unit_Price + Unit_Cost + order_month + order_day + shipping_month + shipping_day, data=small_train, method='class', minsplit=1, minbucket=2, cp=0.009) small_model
#display decision tree
rpart.plot(small_model)
Model Evaluation
# results for large model
<- data.frame(predicted = predict(large_model, large_test, type='class'),
large_predictions actual = large_test$Order_Priority)
<- confusionMatrix(data = large_predictions$predicted,
confusion_matrix reference = large_predictions$actual)
confusion_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction C H L M
## C 2037 2050 2088 2083
## H 3648 3655 3568 3660
## L 714 683 805 694
## M 1096 1055 1055 1109
##
## Overall Statistics
##
## Accuracy : 0.2535
## 95% CI : (0.2486, 0.2585)
## No Information Rate : 0.2515
## P-Value [Acc > NIR] : 0.2141
##
## Kappa : 0.0056
##
## Mcnemar's Test P-Value : <2e-16
##
## Statistics by Class:
##
## Class: C Class: H Class: L Class: M
## Sensitivity 0.2718 0.4911 0.10710 0.14697
## Specificity 0.7236 0.5178 0.90700 0.85722
## Pos Pred Value 0.2467 0.2515 0.27797 0.25701
## Neg Pred Value 0.7490 0.7551 0.75240 0.74939
## Prevalence 0.2498 0.2481 0.25053 0.25153
## Detection Rate 0.0679 0.1218 0.02683 0.03697
## Detection Prevalence 0.2753 0.4844 0.09653 0.14383
## Balanced Accuracy 0.4977 0.5045 0.50705 0.50209
# results for small model
<- data.frame(predicted = predict(small_model, small_test, type='class'),
small_predictions actual = small_test$Order_Priority)
<- confusionMatrix(data = small_predictions$predicted,
confusion_matrix reference = small_predictions$actual)
confusion_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction C H L M
## C 74 101 93 96
## H 27 15 23 20
## L 38 47 62 47
## M 206 219 206 226
##
## Overall Statistics
##
## Accuracy : 0.2513
## 95% CI : (0.2296, 0.2741)
## No Information Rate : 0.2593
## P-Value [Acc > NIR] : 0.7686
##
## Kappa : -2e-04
##
## Mcnemar's Test P-Value : <2e-16
##
## Statistics by Class:
##
## Class: C Class: H Class: L Class: M
## Sensitivity 0.21449 0.03927 0.16146 0.5810
## Specificity 0.74892 0.93739 0.88172 0.4320
## Pos Pred Value 0.20330 0.17647 0.31959 0.2637
## Neg Pred Value 0.76144 0.74064 0.75345 0.7465
## Prevalence 0.23000 0.25467 0.25600 0.2593
## Detection Rate 0.04933 0.01000 0.04133 0.1507
## Detection Prevalence 0.24267 0.05667 0.12933 0.5713
## Balanced Accuracy 0.48171 0.48833 0.52159 0.5065
Based on the above results, using a Decision Tree in this case provided us with a model that is only marginally stronger than random guessing. We know that with all models, the more data you have during the training phase, the stronger the model will become. We do see this lift when comparing our accuracy values of large and small models. The large model has slightly greater accuracy.