Libraries

library(tidyverse)
library(dplyr)
library(tidyr)
library(rpart)
library(rpart.plot)
library(lubridate)
library(skimr)
library(stringr)
library(corrplot)
library(ggplot2)
library(fpp3)
library(caret)

Assignment Overview

  1. Visit the following website and explore the range of sizes of this dataset (from 100 to 5 million records): https://excelbianalytics.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/ or (new) https://www.kaggle.com/datasets

  2. Select 2 files to download - Based on your computer’s capabilities (memory, CPU), select 2 files you can handle (recommended one small, one large)

  3. Download the files

  4. Review the structure and content of the tables, and think about the data sets (structure, size, dependencies, labels, etc)

  5. Consider the similarities and differences in the two data sets you have downloaded

  6. Think about how to analyze and predict an outcome based on the datasets available

  7. Based on the data you have, think which two machine learning algorithms presented so far could be used to analyze the data

Load Data

# Load Datasets
# Small Set == 5,000 records
# Large Set == 100,000 records

small_path <- "/Users/alecmccabe/Downloads/5000 Sales Records.csv"
big_path <- "/Users/alecmccabe/Downloads/100000 Sales Records.csv"

small <- read_csv(small_path)
large <- read_csv(big_path)

Review Structure and Content of Tables

Review the structure and content of the tables, and think about the data sets (structure, size, dependencies, labels, etc)

head(small)
head(large)

It appears that the two datasets are identical in terms of their included features. Let’s review the large dataset in more detail.

skim(large)
Data summary
Name large
Number of rows 100000
Number of columns 14
_______________________
Column type frequency:
character 7
numeric 7
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Region 0 1 4 33 0 7 0
Country 0 1 4 32 0 185 0
Item Type 0 1 4 15 0 12 0
Sales Channel 0 1 6 7 0 2 0
Order Priority 0 1 1 1 0 4 0
Order Date 0 1 8 10 0 2766 0
Ship Date 0 1 8 10 0 2813 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Order ID 0 1 550395554.18 259321871.37 100008904.00 326046383.25 547718512.50 775078534.50 999996459.00 ▇▇▇▇▇
Units Sold 0 1 5001.45 2884.58 1.00 2505.00 5007.00 7495.25 10000.00 ▇▇▇▇▇
Unit Price 0 1 266.70 216.94 9.33 109.28 205.70 437.20 668.27 ▇▇▁▃▃
Unit Cost 0 1 188.02 175.71 6.92 56.67 117.11 364.69 524.96 ▇▃▂▂▃
Total Revenue 0 1 1336066.73 1471767.59 18.66 279753.34 789891.57 1836489.60 6682700.00 ▇▂▁▁▁
Total Cost 0 1 941975.49 1151828.43 13.84 162928.29 467937.41 1209474.69 5249075.04 ▇▂▁▁▁
Total Profit 0 1 394091.24 379598.60 4.82 95900.00 283657.46 568384.13 1738700.00 ▇▃▂▁▁

Both datasets contain the same columns, differing only in the number of records present. The sales datasets contain 14 columns in total, and each record describes one sale of goods. Each record provides information around:

Missing Values

We can also see from the skim() method and below that there are no missing values in either of our datasets:

sapply(small, function(x) sum(is.na(x)))
##         Region        Country      Item Type  Sales Channel Order Priority 
##              0              0              0              0              0 
##     Order Date       Order ID      Ship Date     Units Sold     Unit Price 
##              0              0              0              0              0 
##      Unit Cost  Total Revenue     Total Cost   Total Profit 
##              0              0              0              0
sapply(large, function(x) sum(is.na(x)))
##         Region        Country      Item Type  Sales Channel Order Priority 
##              0              0              0              0              0 
##     Order Date       Order ID      Ship Date     Units Sold     Unit Price 
##              0              0              0              0              0 
##      Unit Cost  Total Revenue     Total Cost   Total Profit 
##              0              0              0              0

Data Inspection and Cleaning

To better work with each feature, we will adjust the feature names

colnames(large)
##  [1] "Region"         "Country"        "Item Type"      "Sales Channel" 
##  [5] "Order Priority" "Order Date"     "Order ID"       "Ship Date"     
##  [9] "Units Sold"     "Unit Price"     "Unit Cost"      "Total Revenue" 
## [13] "Total Cost"     "Total Profit"
adjust_cols <- function(col) {
  new <- str_replace(col, " ", "_")
  return(new)
}

new_colnames <- unlist(lapply(colnames(large), adjust_cols))

colnames(large) <- new_colnames
colnames(small) <- new_colnames
colnames(large)
##  [1] "Region"         "Country"        "Item_Type"      "Sales_Channel" 
##  [5] "Order_Priority" "Order_Date"     "Order_ID"       "Ship_Date"     
##  [9] "Units_Sold"     "Unit_Price"     "Unit_Cost"      "Total_Revenue" 
## [13] "Total_Cost"     "Total_Profit"

Now let’s look at the summaries of our columns.

summary(large)
##     Region            Country           Item_Type         Sales_Channel     
##  Length:100000      Length:100000      Length:100000      Length:100000     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Order_Priority      Order_Date           Order_ID          Ship_Date        
##  Length:100000      Length:100000      Min.   :100008904   Length:100000     
##  Class :character   Class :character   1st Qu.:326046383   Class :character  
##  Mode  :character   Mode  :character   Median :547718512   Mode  :character  
##                                        Mean   :550395554                     
##                                        3rd Qu.:775078534                     
##                                        Max.   :999996459                     
##    Units_Sold      Unit_Price       Unit_Cost      Total_Revenue    
##  Min.   :    1   Min.   :  9.33   Min.   :  6.92   Min.   :     19  
##  1st Qu.: 2505   1st Qu.:109.28   1st Qu.: 56.67   1st Qu.: 279753  
##  Median : 5007   Median :205.70   Median :117.11   Median : 789892  
##  Mean   : 5001   Mean   :266.70   Mean   :188.02   Mean   :1336067  
##  3rd Qu.: 7495   3rd Qu.:437.20   3rd Qu.:364.69   3rd Qu.:1836490  
##  Max.   :10000   Max.   :668.27   Max.   :524.96   Max.   :6682700  
##    Total_Cost       Total_Profit      
##  Min.   :     14   Min.   :      4.8  
##  1st Qu.: 162928   1st Qu.:  95900.0  
##  Median : 467937   Median : 283657.5  
##  Mean   : 941975   Mean   : 394091.2  
##  3rd Qu.:1209475   3rd Qu.: 568384.1  
##  Max.   :5249075   Max.   :1738700.0

From the above, it appears that the date fields (Order Date and Ship Date) are incorrectly coded as character values. Correcting this would be a good first step in preprocessing for modeling. Additionally we can convert our categorical string values to factors which will make our jobs easier. We can further drop the Country column, as there are many unique values that will likely hinder our modeling efforts downstream. Additionally, as we have parent category “Region”, we are not losing all of the information that we have around geographics. Finally, we can drop the order id, as it won’t have any impact on our model’s predictive power.

# updating small values

small$Order_Date <- as.Date(small$Order_Date, "%m/%d/%Y")
small$Ship_Date <- as.Date(small$Ship_Date, "%m/%d/%Y")

small$Region <- as.factor(small$Region)
small$Item_Type <- as.factor(small$Item_Type)
small$Sales_Channel <- as.factor(small$Sales_Channel)
small$Order_Priority <- as.factor(small$Order_Priority)

small <- small[,-c(2,7)]

# updating large values

large$Order_Date <- as.Date(large$Order_Date, "%m/%d/%Y")
large$Ship_Date <- as.Date(large$Ship_Date, "%m/%d/%Y")

large$Region <- as.factor(large$Region)
large$Item_Type <- as.factor(large$Item_Type)
large$Sales_Channel <- as.factor(large$Sales_Channel)
large$Order_Priority <- as.factor(large$Order_Priority)

large <- large[,-c(2,7)]

Feature Dependence

We can expect that some of these features will be highly dependent on each other.

  • Units Sold and Unit Price will be highly correlated with Total Revenue
  • Units Sold and Unit Cost will be highly correlated with Total Cost
  • Total Revenue and Total Cost will be highly correlated with Total Profit
  • For obvious reasions, Region and Country will be entirely dependant on one another
  • We expect that the expected profit will be dependant on the type of item being sold. It will be interesting to see if certain item types are more highly prized in different regions.
  • We expect that the sales channel will be correlated with an increase or decrease in expected profits.
  • Order Date and Shipping Date will naturally be correlated, and additional affected by the item type.
  • Additionally, we expect that depending on the tiem of purchase or sale, that item prices or cost will also fluctuate.

Numerical Distributions

We can see from the below histogram that the distribution of units_sold is fairly uniform.

hist(large$Units_Sold)

Unit price can vary and does not seem to follow any specific distribution type. Prices range from 0-100 up to 600-700. Most of the item types being sold are #300 or less.

hist(large$Unit_Price)

Unit cost is similar, with items costing between 0-100 and 500.

hist(large$Unit_Cost)

Total Revenue, which is dependant on Unit Price and Units Sold, posesses a large right skew. Additionally the values are so large that it is difficult to read. It could be beneficial to use a log transform.

hist(large$Total_Revenue)

large$Total_Revenue <- log(large$Total_Revenue)

Similar to Total Revenue, total cost has a strong right skew. We will use a log transformation to rectify this.

hist(large$Total_Cost)

large$Total_Cost <- log(large$Total_Cost)
small$Total_Cost <- log(small$Total_Cost)

Total profit also has a right skew, but it is not as pronounced as either of it’s upstream features, Total Revenue and Total Cost. In the spirit of consistency, we will also use a log transform.

hist(large$Total_Profit)

large$Total_Profit <- log(large$Total_Profit)
small$Total_Profit <- log(small$Total_Profit)

Categorical Disstributions

Region is skewed towards sales in Europe and Sub-Saharan Africa

# Check distribution of predictor values

large %>%
  ggplot(aes(x = Region)) + 
  geom_bar() +
  coord_flip()

Item types are evenly distributed, with 12 item types in total.

# Check distribution of predictor values

large %>%
  ggplot(aes(x = Item_Type)) + 
  geom_bar() + 
  coord_flip()

There are two sales types: Online and Offline sales. They are evenly distributed.

# Check distribution of predictor values

large %>%
  ggplot(aes(x = Sales_Channel)) + 
  geom_bar()

There are 4 order priority types: High, Medium, Low, and “C” (which I am not sure what it means). I assume that these order priority types are correlated with the difference between Order Date and Shipping Date.

# Check distribution of predictor values

large %>%
  ggplot(aes(x = Order_Priority)) + 
  geom_bar()

large$time_to_ship = large$Ship_Date - large$Order_Date
small$time_to_ship = small$Ship_Date - small$Order_Date
large$time_to_ship <- as.numeric(large$time_to_ship)
small$time_to_ship = small$Ship_Date - small$Order_Date

Interestingly, the time_to_ship for a given order is not impacted by the order priority.

large %>%
  ggplot(aes(x=Order_Priority, y=time_to_ship)) + 
  geom_boxplot()

Collinearity

cor(large[c("Unit_Price","Unit_Cost","Units_Sold","Total_Revenue","Total_Cost","Total_Profit")])
##                Unit_Price   Unit_Cost  Units_Sold Total_Revenue Total_Cost
## Unit_Price    1.000000000 0.986029866 0.003453459     0.6515835  0.6942626
## Unit_Cost     0.986029866 1.000000000 0.003166787     0.6168524  0.6754400
## Units_Sold    0.003453459 0.003166787 1.000000000     0.5677437  0.5516501
## Total_Revenue 0.651583527 0.616852380 0.567743711     1.0000000  0.9888694
## Total_Cost    0.694262646 0.675440026 0.551650080     0.9888694  1.0000000
## Total_Profit  0.528434881 0.461276501 0.574129744     0.9662761  0.9209439
##               Total_Profit
## Unit_Price       0.5284349
## Unit_Cost        0.4612765
## Units_Sold       0.5741297
## Total_Revenue    0.9662761
## Total_Cost       0.9209439
## Total_Profit     1.0000000

As expected, all of these values are highly correlated with one another. When building our model we likely can limit the numerical features to a combination of Units_Sold, Unit_Price, and Unit_Cost. Using Total_Revenue, Total_Cost, and Total_Profit could be problematic as we would be introducing multi-collinearity into our model.

Model Selection

These datasets could be used to predict a wide number of things. We could predict the expected profit from a sale, based on the item type, region, and cost. Similarly we could predict the Region of a sale given the generated profit and sales channel (however this would not prove very useful).

In terms of generating the most use, I feel that a business would want to create a model which could accurately predict or classify the correct “Order Priority” for a given order. This would enable a business to better prioritize their operation, focusing on the most important orders first. Likely the prioritization would be based on expected profits, but it will be interesting to determine which other features are significant predictors.

With this in mind, we determine that we will be building a multi-class classification model as opposed to a standard regression model. This leaves us with a number of options to consider:

Multiple Logistic Regression

Using multiple logistic regression could be an option for this problem, however it also has a number of considerations which technically do not pass. For instance multiple linear regression assumes that the predictor variables, or features, should follow normal distributions. We have transformed our skewed variables using a log transform, but they are still not normal after transformation. I believe that this option would work, but we would need to keep these considerations in mind.

KNN

K-Nearest Neighbors is an option for this task, but it would require additional preprocessing of the data. KNN models cannot manage non-numerical features, so we would need to one hot encode the categorical values. One of the nice things about KNN is that it includes zero assumptions about the underlying data. You simply need to normalize the data (preferably within bounds of 0 and 1), and train your model. That all being said, for a growing business looking to model order priority, KNN might not be the best choice. As the amount of data collected (sales) grows, the training step of KNN will become longer and longer. It is a slower algorithm as it is instance based, and will need to retrain every time there is a new data point added.

Decision Trees

Finally we can consider Decision Trees, which are great at performing these types of multi-class classification problems. Decision trees are powerful, non-parametric models which do not make many assumptions about your underlying data (similar to KNN). DTs can handle all types of data, and do not require additional preprocessing at this stage. While we do not have any missing values, it is also important to remember that DTs can handle missing values inherently, so using a model like this could be beneficial if ever we have records with missing values. Some of the negatives of using Decision Trees is that they have high variability in their output. Given new data in the future, it is possible that any newly trained models will differ greatly. That is where models like Random Forest shine (but we haven’t gotten there yet). We will need to take future outliers into consideration as we continue to build and optimize the model over time.

Model Building

Large Model

We will move forward with a Decision Tree for both the small and large datasets. We have already performed most of the transformations that we will need. The final step will be to create additional features related to the time of order placement as well as shipping date.

large$order_month <- as.factor(format(large$Order_Date,"%m"))
large$order_day <- as.factor(wday(large$Order_Date))

large$shipping_month <- as.factor(format(large$Ship_Date,"%m"))
large$shipping_day <- as.factor(wday(large$Ship_Date))

And now we can safely remove the order date and shipping dates. We can additionally remove time_to_ship as we saw that it has no effect on the Order Priority from before.

large <- large[,-c(5,6,13)]
head(large)

Now we are ready to subset our data into training and testing sets.

#make this example reproducible
set.seed(42)

#create ID column
large$id <- 1:nrow(large)

#use 70% of dataset as training set and 30% as test set 
large_train <- large %>% dplyr::sample_frac(0.70)
large_test  <- dplyr::anti_join(large, large_train, by = 'id')
head(large) 

Once we have our training set, we can build our first model.

#build model via rpart package

large_model <- rpart(formula=Order_Priority ~ Region + Item_Type + Sales_Channel + Units_Sold + Unit_Price + Unit_Cost + order_month + order_day + shipping_month + shipping_day, data=large_train, method='class', 
      control=rpart.control(minsplit=1, minbucket=1, cp=0.001))
#display decision tree

rpart.plot(large_model)

Small Model

We need to perform the same adjustments to the small dataset that we performed to the large one.

small$order_month <- as.factor(format(small$Order_Date,"%m"))
small$order_day <- as.factor(wday(small$Order_Date))

small$shipping_month <- as.factor(format(small$Ship_Date,"%m"))
small$shipping_day <- as.factor(wday(small$Ship_Date))
small <- small[,-c(5,6,13)]
head(small)

Now we are ready to subset our data into training and testing sets.

#make this example reproducible
set.seed(42)

#create ID column
small$id <- 1:nrow(small)

#use 70% of dataset as training set and 30% as test set 
small_train <- small %>% dplyr::sample_frac(0.70)
small_test  <- dplyr::anti_join(small, small_train, by = 'id')

Once we have our training set, we can build our first model.

#build model via rpart package

small_model <- rpart(formula=Order_Priority ~ Region + Item_Type + Sales_Channel + Units_Sold + Unit_Price + Unit_Cost + order_month + order_day + shipping_month + shipping_day, data=small_train, method='class', minsplit=1, minbucket=2, cp=0.009)
#display decision tree

rpart.plot(small_model)

Model Evaluation

# results for large model

large_predictions <- data.frame(predicted = predict(large_model, large_test, type='class'),
                          actual = large_test$Order_Priority)

confusion_matrix <- confusionMatrix(data = large_predictions$predicted,
                            reference = large_predictions$actual)

confusion_matrix 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    C    H    L    M
##          C 2037 2050 2088 2083
##          H 3648 3655 3568 3660
##          L  714  683  805  694
##          M 1096 1055 1055 1109
## 
## Overall Statistics
##                                           
##                Accuracy : 0.2535          
##                  95% CI : (0.2486, 0.2585)
##     No Information Rate : 0.2515          
##     P-Value [Acc > NIR] : 0.2141          
##                                           
##                   Kappa : 0.0056          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
## 
## Statistics by Class:
## 
##                      Class: C Class: H Class: L Class: M
## Sensitivity            0.2718   0.4911  0.10710  0.14697
## Specificity            0.7236   0.5178  0.90700  0.85722
## Pos Pred Value         0.2467   0.2515  0.27797  0.25701
## Neg Pred Value         0.7490   0.7551  0.75240  0.74939
## Prevalence             0.2498   0.2481  0.25053  0.25153
## Detection Rate         0.0679   0.1218  0.02683  0.03697
## Detection Prevalence   0.2753   0.4844  0.09653  0.14383
## Balanced Accuracy      0.4977   0.5045  0.50705  0.50209
# results for small model

small_predictions <- data.frame(predicted = predict(small_model, small_test, type='class'),
                          actual = small_test$Order_Priority)

confusion_matrix <- confusionMatrix(data = small_predictions$predicted,
                            reference = small_predictions$actual)

confusion_matrix 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   C   H   L   M
##          C  74 101  93  96
##          H  27  15  23  20
##          L  38  47  62  47
##          M 206 219 206 226
## 
## Overall Statistics
##                                           
##                Accuracy : 0.2513          
##                  95% CI : (0.2296, 0.2741)
##     No Information Rate : 0.2593          
##     P-Value [Acc > NIR] : 0.7686          
##                                           
##                   Kappa : -2e-04          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
## 
## Statistics by Class:
## 
##                      Class: C Class: H Class: L Class: M
## Sensitivity           0.21449  0.03927  0.16146   0.5810
## Specificity           0.74892  0.93739  0.88172   0.4320
## Pos Pred Value        0.20330  0.17647  0.31959   0.2637
## Neg Pred Value        0.76144  0.74064  0.75345   0.7465
## Prevalence            0.23000  0.25467  0.25600   0.2593
## Detection Rate        0.04933  0.01000  0.04133   0.1507
## Detection Prevalence  0.24267  0.05667  0.12933   0.5713
## Balanced Accuracy     0.48171  0.48833  0.52159   0.5065

Based on the above results, using a Decision Tree in this case provided us with a model that is only marginally stronger than random guessing. We know that with all models, the more data you have during the training phase, the stronger the model will become. We do see this lift when comparing our accuracy values of large and small models. The large model has slightly greater accuracy.