Goal

Visit the following website and explore the range of sizes of this dataset (from 100 to 5 million records).

https://eforexcel.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/

  1. Based on your computer’s capabilities (memory, CPU), select 2 files you can handle (recommended one small, one large)

  2. Review the structure and content of the tables, and think which two machine learning algorithms presented so far could be used to analyze the data, and how can they be applied in the suggested environment of the datasets.

  3. Write a short essay explaining your selection. Then, select one of the 2 algorithms and explore how to analyze and predict an outcome based on the data available. This will be an exploratory exercise, so feel free to show errors and warnings that raise during the analysis. Test the code with both datasets selected and compare the results.

  4. Which result will you trust if you need to make a business decision? Do you think an analysis could be prone to errors when using too much data, or when using the least amount possible?

1. Choose 2 datasets (1 small and 1 large)

For my small dataset I chose the one with 100 records and for my large dataset I chose the one with 50000 records.

small dataset is loaded into df_small and large dataset is loaded into df_large

df_small <- read.csv("100 Sales Records.csv")
df_large <- read.csv("50000 Sales Records.csv")

2. Review the loaded data

summary(df_small)
##     Region            Country           Item.Type         Sales.Channel     
##  Length:100         Length:100         Length:100         Length:100        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Order.Priority      Order.Date           Order.ID          Ship.Date        
##  Length:100         Length:100         Min.   :114606559   Length:100        
##  Class :character   Class :character   1st Qu.:338922488   Class :character  
##  Mode  :character   Mode  :character   Median :557708561   Mode  :character  
##                                        Mean   :555020412                     
##                                        3rd Qu.:790755081                     
##                                        Max.   :994022214                     
##    Units.Sold     Unit.Price       Unit.Cost      Total.Revenue    
##  Min.   : 124   Min.   :  9.33   Min.   :  6.92   Min.   :   4870  
##  1st Qu.:2836   1st Qu.: 81.73   1st Qu.: 35.84   1st Qu.: 268721  
##  Median :5382   Median :179.88   Median :107.28   Median : 752314  
##  Mean   :5129   Mean   :276.76   Mean   :191.05   Mean   :1373488  
##  3rd Qu.:7369   3rd Qu.:437.20   3rd Qu.:263.33   3rd Qu.:2212045  
##  Max.   :9925   Max.   :668.27   Max.   :524.96   Max.   :5997055  
##    Total.Cost       Total.Profit    
##  Min.   :   3612   Min.   :   1258  
##  1st Qu.: 168868   1st Qu.: 121444  
##  Median : 363566   Median : 290768  
##  Mean   : 931806   Mean   : 441682  
##  3rd Qu.:1613870   3rd Qu.: 635829  
##  Max.   :4509794   Max.   :1719922
summary(df_large)
##     Region            Country           Item.Type         Sales.Channel     
##  Length:50000       Length:50000       Length:50000       Length:50000      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Order.Priority      Order.Date           Order.ID          Ship.Date        
##  Length:50000       Length:50000       Min.   :100013196   Length:50000      
##  Class :character   Class :character   1st Qu.:324007046   Class :character  
##  Mode  :character   Mode  :character   Median :550422394   Mode  :character  
##                                        Mean   :549733027                     
##                                        3rd Qu.:776782381                     
##                                        Max.   :999999463                     
##    Units.Sold      Unit.Price       Unit.Cost      Total.Revenue    
##  Min.   :    1   Min.   :  9.33   Min.   :  6.92   Min.   :     28  
##  1st Qu.: 2498   1st Qu.: 81.73   1st Qu.: 35.84   1st Qu.: 276487  
##  Median : 5018   Median :154.06   Median : 97.44   Median : 781325  
##  Mean   : 5000   Mean   :265.65   Mean   :187.32   Mean   :1323716  
##  3rd Qu.: 7493   3rd Qu.:421.89   3rd Qu.:263.33   3rd Qu.:1808642  
##  Max.   :10000   Max.   :668.27   Max.   :524.96   Max.   :6682032  
##    Total.Cost       Total.Profit      
##  Min.   :     21   Min.   :      7.2  
##  1st Qu.: 160637   1st Qu.:  94150.9  
##  Median : 467104   Median : 279536.4  
##  Mean   : 933157   Mean   : 390558.7  
##  3rd Qu.:1190390   3rd Qu.: 564286.7  
##  Max.   :5249075   Max.   :1738178.4

From the summary statistics it is observed that both the datasets have the same 14 variables. There are 7 numeric variables and 7 character variables. Out of the 7 character variables 2 are date so I will be making those as date variables below. There does not seem to be any missing values.

df_small$Order.Date <- mdy(df_small$Order.Date)
df_large$Order.Date <- mdy(df_large$Order.Date)

df_small$Ship.Date <- mdy(df_small$Ship.Date)
df_large$Ship.Date <- mdy(df_large$Ship.Date)

Now lets see the summary again

summary(df_small)
##     Region            Country           Item.Type         Sales.Channel     
##  Length:100         Length:100         Length:100         Length:100        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Order.Priority       Order.Date            Order.ID        
##  Length:100         Min.   :2010-02-02   Min.   :114606559  
##  Class :character   1st Qu.:2012-02-14   1st Qu.:338922488  
##  Mode  :character   Median :2013-07-12   Median :557708561  
##                     Mean   :2013-09-16   Mean   :555020412  
##                     3rd Qu.:2015-04-07   3rd Qu.:790755081  
##                     Max.   :2017-05-22   Max.   :994022214  
##    Ship.Date            Units.Sold     Unit.Price       Unit.Cost     
##  Min.   :2010-02-25   Min.   : 124   Min.   :  9.33   Min.   :  6.92  
##  1st Qu.:2012-02-24   1st Qu.:2836   1st Qu.: 81.73   1st Qu.: 35.84  
##  Median :2013-08-11   Median :5382   Median :179.88   Median :107.28  
##  Mean   :2013-10-09   Mean   :5129   Mean   :276.76   Mean   :191.05  
##  3rd Qu.:2015-04-28   3rd Qu.:7369   3rd Qu.:437.20   3rd Qu.:263.33  
##  Max.   :2017-06-17   Max.   :9925   Max.   :668.27   Max.   :524.96  
##  Total.Revenue       Total.Cost       Total.Profit    
##  Min.   :   4870   Min.   :   3612   Min.   :   1258  
##  1st Qu.: 268721   1st Qu.: 168868   1st Qu.: 121444  
##  Median : 752314   Median : 363566   Median : 290768  
##  Mean   :1373488   Mean   : 931806   Mean   : 441682  
##  3rd Qu.:2212045   3rd Qu.:1613870   3rd Qu.: 635829  
##  Max.   :5997055   Max.   :4509794   Max.   :1719922
summary(df_large)
##     Region            Country           Item.Type         Sales.Channel     
##  Length:50000       Length:50000       Length:50000       Length:50000      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Order.Priority       Order.Date            Order.ID        
##  Length:50000       Min.   :2010-01-01   Min.   :100013196  
##  Class :character   1st Qu.:2011-11-15   1st Qu.:324007046  
##  Mode  :character   Median :2013-10-09   Median :550422394  
##                     Mean   :2013-10-11   Mean   :549733027  
##                     3rd Qu.:2015-09-04   3rd Qu.:776782381  
##                     Max.   :2017-07-28   Max.   :999999463  
##    Ship.Date            Units.Sold      Unit.Price       Unit.Cost     
##  Min.   :2010-01-02   Min.   :    1   Min.   :  9.33   Min.   :  6.92  
##  1st Qu.:2011-12-11   1st Qu.: 2498   1st Qu.: 81.73   1st Qu.: 35.84  
##  Median :2013-11-02   Median : 5018   Median :154.06   Median : 97.44  
##  Mean   :2013-11-05   Mean   : 5000   Mean   :265.65   Mean   :187.32  
##  3rd Qu.:2015-09-30   3rd Qu.: 7493   3rd Qu.:421.89   3rd Qu.:263.33  
##  Max.   :2017-09-16   Max.   :10000   Max.   :668.27   Max.   :524.96  
##  Total.Revenue       Total.Cost       Total.Profit      
##  Min.   :     28   Min.   :     21   Min.   :      7.2  
##  1st Qu.: 276487   1st Qu.: 160637   1st Qu.:  94150.9  
##  Median : 781325   Median : 467104   Median : 279536.4  
##  Mean   :1323716   Mean   : 933157   Mean   : 390558.7  
##  3rd Qu.:1808642   3rd Qu.:1190390   3rd Qu.: 564286.7  
##  Max.   :6682032   Max.   :5249075   Max.   :1738178.4

Now it can be seen that both the datasets have data from 2010 till 2017

I will be choosing Order.Priority as my target variable. When a new order comes in my model should be able to identify the right priority for this. There are 4 different priorities present throughout the dataset.

The variables that could affect the target variable Order.Priority are Item.Type , Total.Revenue and Units.Sold.

3. Short essay on algorithm selection

The target variable is categorical, so the number of potential algorithms is less. KNN algorithm could be used as it does not make any assumptions about the distribution of data. Its predictions are based on the k most similar training patterns for a new instance of data. This is a good choice as there are no missing data in the dataset and knn algorithm cannot handle missing data. But knn algorithm make have a tough time with outliers so I am choosing not to go ahead with this.

My next option is decision trees. Decision trees can handle outliers and work well on categorical data. Here there is a small and a large dataset and this does not affect decision tree in anyway. Just like knn, decision trees do not make assumptions.

I am going to make the decision tree model for small dataset first followed by the large dataset.

Small Dataset (100 records)

Below I will make 4 of the 7 character variables as factor. Following that I will split the dataset into train and test set in the ratio 75:25.

set.seed(123)

df_small$Region <- as.factor(df_small$Region)
df_small$Item.Type <- as.factor(df_small$Item.Type)
df_small$Sales.Channel <- as.factor(df_small$Sales.Channel)
df_small$Order.Priority <- as.factor(df_small$Order.Priority)

df_sample <- sample(nrow(df_small), round(nrow(df_small)*0.75), replace = FALSE)
small_train <- df_small[df_sample, ]
small_test <- df_small[-df_sample, ]

Using rpart package I am going to build and plot the model.

small_model <- rpart(Order.Priority ~ Region + Item.Type + Sales.Channel + Order.Date + Order.ID + Ship.Date + Units.Sold + Total.Revenue + Total.Cost + Total.Profit , method = "class", data = small_train)

rpart.plot(small_model)

Using the above model on the test set.

small_pred <- predict(small_model, small_test, type = "class")
small_pred <- table(small_test$Order.Priority, small_pred)
small_pred
##    small_pred
##     C H L M
##   C 0 2 0 1
##   H 1 6 2 2
##   L 0 1 4 2
##   M 0 2 1 1

Accuracy of the model.

sum(diag(small_pred)) / nrow(small_test)
## [1] 0.44

Large Dataset (50000 records)

Just like the small dataset, here also I will make 4 of the 7 character variables as factor. Following that I will split the dataset into train and test set in the ratio 75:25.

set.seed(456)

df_large$Region <- as.factor(df_large$Region)
df_large$Item.Type <- as.factor(df_large$Item.Type)
df_large$Sales.Channel <- as.factor(df_large$Sales.Channel)
df_large$Order.Priority <- as.factor(df_large$Order.Priority)

df_sample <- sample(nrow(df_large), round(nrow(df_large)*0.75), replace = FALSE)
large_train <- df_large[df_sample, ]
large_test <- df_large[-df_sample, ]

Using rpart package I am going to build and plot the model.

large_model <- rpart(Order.Priority ~ Region + Item.Type + Sales.Channel + Order.Date + Order.ID + Ship.Date + Units.Sold + Total.Revenue + Total.Cost + Total.Profit , method = "class", data = large_train,control=rpart.control(minsplit=2, minbucket=3, cp=0.001))

rpart.plot(large_model)

Using the above model on the test set.

large_pred <- predict(large_model, large_test, type = "class")
large_pred <- table(large_test$Order.Priority, large_pred)
large_pred
##    large_pred
##        C    H    L    M
##   C  974 1619  394  164
##   H  952 1555  367  170
##   L  967 1626  371  179
##   M  995 1589  398  180

Accuracy of the model.

sum(diag(large_pred)) / nrow(large_test)
## [1] 0.2464

4. Analysis of the result

The accuracy for the small dataset is 44% whereas the accuracy for the large dataset is roughly 25%. These are not that great results but what I noticed is that the accuracy for small datasets are way better than large ones. The large datasets almost crashed my PC.control=rpart.control(minsplit=2, minbucket=3, cp=0.001 saved me here.

But based on my findings here I do not recommend both these to make a business decision. When using too much data errors tend to increase but that could mean I need to use an algorithm that can handle the errors better.