Visit the following website and explore the range of sizes of this dataset (from 100 to 5 million records).
https://eforexcel.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/
Based on your computer’s capabilities (memory, CPU), select 2 files you can handle (recommended one small, one large)
Review the structure and content of the tables, and think which two machine learning algorithms presented so far could be used to analyze the data, and how can they be applied in the suggested environment of the datasets.
Write a short essay explaining your selection. Then, select one of the 2 algorithms and explore how to analyze and predict an outcome based on the data available. This will be an exploratory exercise, so feel free to show errors and warnings that raise during the analysis. Test the code with both datasets selected and compare the results.
Which result will you trust if you need to make a business decision? Do you think an analysis could be prone to errors when using too much data, or when using the least amount possible?
For my small dataset I chose the one with 100 records and for my large dataset I chose the one with 50000 records.
small dataset is loaded into df_small and large dataset is loaded into df_large
df_small <- read.csv("100 Sales Records.csv")
df_large <- read.csv("50000 Sales Records.csv")
summary(df_small)
## Region Country Item.Type Sales.Channel
## Length:100 Length:100 Length:100 Length:100
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Order.Priority Order.Date Order.ID Ship.Date
## Length:100 Length:100 Min. :114606559 Length:100
## Class :character Class :character 1st Qu.:338922488 Class :character
## Mode :character Mode :character Median :557708561 Mode :character
## Mean :555020412
## 3rd Qu.:790755081
## Max. :994022214
## Units.Sold Unit.Price Unit.Cost Total.Revenue
## Min. : 124 Min. : 9.33 Min. : 6.92 Min. : 4870
## 1st Qu.:2836 1st Qu.: 81.73 1st Qu.: 35.84 1st Qu.: 268721
## Median :5382 Median :179.88 Median :107.28 Median : 752314
## Mean :5129 Mean :276.76 Mean :191.05 Mean :1373488
## 3rd Qu.:7369 3rd Qu.:437.20 3rd Qu.:263.33 3rd Qu.:2212045
## Max. :9925 Max. :668.27 Max. :524.96 Max. :5997055
## Total.Cost Total.Profit
## Min. : 3612 Min. : 1258
## 1st Qu.: 168868 1st Qu.: 121444
## Median : 363566 Median : 290768
## Mean : 931806 Mean : 441682
## 3rd Qu.:1613870 3rd Qu.: 635829
## Max. :4509794 Max. :1719922
summary(df_large)
## Region Country Item.Type Sales.Channel
## Length:50000 Length:50000 Length:50000 Length:50000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Order.Priority Order.Date Order.ID Ship.Date
## Length:50000 Length:50000 Min. :100013196 Length:50000
## Class :character Class :character 1st Qu.:324007046 Class :character
## Mode :character Mode :character Median :550422394 Mode :character
## Mean :549733027
## 3rd Qu.:776782381
## Max. :999999463
## Units.Sold Unit.Price Unit.Cost Total.Revenue
## Min. : 1 Min. : 9.33 Min. : 6.92 Min. : 28
## 1st Qu.: 2498 1st Qu.: 81.73 1st Qu.: 35.84 1st Qu.: 276487
## Median : 5018 Median :154.06 Median : 97.44 Median : 781325
## Mean : 5000 Mean :265.65 Mean :187.32 Mean :1323716
## 3rd Qu.: 7493 3rd Qu.:421.89 3rd Qu.:263.33 3rd Qu.:1808642
## Max. :10000 Max. :668.27 Max. :524.96 Max. :6682032
## Total.Cost Total.Profit
## Min. : 21 Min. : 7.2
## 1st Qu.: 160637 1st Qu.: 94150.9
## Median : 467104 Median : 279536.4
## Mean : 933157 Mean : 390558.7
## 3rd Qu.:1190390 3rd Qu.: 564286.7
## Max. :5249075 Max. :1738178.4
From the summary statistics it is observed that both the datasets have the same 14 variables. There are 7 numeric variables and 7 character variables. Out of the 7 character variables 2 are date so I will be making those as date variables below. There does not seem to be any missing values.
df_small$Order.Date <- mdy(df_small$Order.Date)
df_large$Order.Date <- mdy(df_large$Order.Date)
df_small$Ship.Date <- mdy(df_small$Ship.Date)
df_large$Ship.Date <- mdy(df_large$Ship.Date)
Now lets see the summary again
summary(df_small)
## Region Country Item.Type Sales.Channel
## Length:100 Length:100 Length:100 Length:100
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Order.Priority Order.Date Order.ID
## Length:100 Min. :2010-02-02 Min. :114606559
## Class :character 1st Qu.:2012-02-14 1st Qu.:338922488
## Mode :character Median :2013-07-12 Median :557708561
## Mean :2013-09-16 Mean :555020412
## 3rd Qu.:2015-04-07 3rd Qu.:790755081
## Max. :2017-05-22 Max. :994022214
## Ship.Date Units.Sold Unit.Price Unit.Cost
## Min. :2010-02-25 Min. : 124 Min. : 9.33 Min. : 6.92
## 1st Qu.:2012-02-24 1st Qu.:2836 1st Qu.: 81.73 1st Qu.: 35.84
## Median :2013-08-11 Median :5382 Median :179.88 Median :107.28
## Mean :2013-10-09 Mean :5129 Mean :276.76 Mean :191.05
## 3rd Qu.:2015-04-28 3rd Qu.:7369 3rd Qu.:437.20 3rd Qu.:263.33
## Max. :2017-06-17 Max. :9925 Max. :668.27 Max. :524.96
## Total.Revenue Total.Cost Total.Profit
## Min. : 4870 Min. : 3612 Min. : 1258
## 1st Qu.: 268721 1st Qu.: 168868 1st Qu.: 121444
## Median : 752314 Median : 363566 Median : 290768
## Mean :1373488 Mean : 931806 Mean : 441682
## 3rd Qu.:2212045 3rd Qu.:1613870 3rd Qu.: 635829
## Max. :5997055 Max. :4509794 Max. :1719922
summary(df_large)
## Region Country Item.Type Sales.Channel
## Length:50000 Length:50000 Length:50000 Length:50000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Order.Priority Order.Date Order.ID
## Length:50000 Min. :2010-01-01 Min. :100013196
## Class :character 1st Qu.:2011-11-15 1st Qu.:324007046
## Mode :character Median :2013-10-09 Median :550422394
## Mean :2013-10-11 Mean :549733027
## 3rd Qu.:2015-09-04 3rd Qu.:776782381
## Max. :2017-07-28 Max. :999999463
## Ship.Date Units.Sold Unit.Price Unit.Cost
## Min. :2010-01-02 Min. : 1 Min. : 9.33 Min. : 6.92
## 1st Qu.:2011-12-11 1st Qu.: 2498 1st Qu.: 81.73 1st Qu.: 35.84
## Median :2013-11-02 Median : 5018 Median :154.06 Median : 97.44
## Mean :2013-11-05 Mean : 5000 Mean :265.65 Mean :187.32
## 3rd Qu.:2015-09-30 3rd Qu.: 7493 3rd Qu.:421.89 3rd Qu.:263.33
## Max. :2017-09-16 Max. :10000 Max. :668.27 Max. :524.96
## Total.Revenue Total.Cost Total.Profit
## Min. : 28 Min. : 21 Min. : 7.2
## 1st Qu.: 276487 1st Qu.: 160637 1st Qu.: 94150.9
## Median : 781325 Median : 467104 Median : 279536.4
## Mean :1323716 Mean : 933157 Mean : 390558.7
## 3rd Qu.:1808642 3rd Qu.:1190390 3rd Qu.: 564286.7
## Max. :6682032 Max. :5249075 Max. :1738178.4
Now it can be seen that both the datasets have data from 2010 till 2017
I will be choosing Order.Priority as my target variable. When a new order comes in my model should be able to identify the right priority for this. There are 4 different priorities present throughout the dataset.
The variables that could affect the target variable Order.Priority are Item.Type , Total.Revenue and Units.Sold.
The target variable is categorical, so the number of potential algorithms is less. KNN algorithm could be used as it does not make any assumptions about the distribution of data. Its predictions are based on the k most similar training patterns for a new instance of data. This is a good choice as there are no missing data in the dataset and knn algorithm cannot handle missing data. But knn algorithm make have a tough time with outliers so I am choosing not to go ahead with this.
My next option is decision trees. Decision trees can handle outliers and work well on categorical data. Here there is a small and a large dataset and this does not affect decision tree in anyway. Just like knn, decision trees do not make assumptions.
I am going to make the decision tree model for small dataset first followed by the large dataset.
Small Dataset (100 records)
Below I will make 4 of the 7 character variables as factor. Following that I will split the dataset into train and test set in the ratio 75:25.
set.seed(123)
df_small$Region <- as.factor(df_small$Region)
df_small$Item.Type <- as.factor(df_small$Item.Type)
df_small$Sales.Channel <- as.factor(df_small$Sales.Channel)
df_small$Order.Priority <- as.factor(df_small$Order.Priority)
df_sample <- sample(nrow(df_small), round(nrow(df_small)*0.75), replace = FALSE)
small_train <- df_small[df_sample, ]
small_test <- df_small[-df_sample, ]
Using rpart package I am going to build and plot the model.
small_model <- rpart(Order.Priority ~ Region + Item.Type + Sales.Channel + Order.Date + Order.ID + Ship.Date + Units.Sold + Total.Revenue + Total.Cost + Total.Profit , method = "class", data = small_train)
rpart.plot(small_model)
Using the above model on the test set.
small_pred <- predict(small_model, small_test, type = "class")
small_pred <- table(small_test$Order.Priority, small_pred)
small_pred
## small_pred
## C H L M
## C 0 2 0 1
## H 1 6 2 2
## L 0 1 4 2
## M 0 2 1 1
Accuracy of the model.
sum(diag(small_pred)) / nrow(small_test)
## [1] 0.44
Large Dataset (50000 records)
Just like the small dataset, here also I will make 4 of the 7 character variables as factor. Following that I will split the dataset into train and test set in the ratio 75:25.
set.seed(456)
df_large$Region <- as.factor(df_large$Region)
df_large$Item.Type <- as.factor(df_large$Item.Type)
df_large$Sales.Channel <- as.factor(df_large$Sales.Channel)
df_large$Order.Priority <- as.factor(df_large$Order.Priority)
df_sample <- sample(nrow(df_large), round(nrow(df_large)*0.75), replace = FALSE)
large_train <- df_large[df_sample, ]
large_test <- df_large[-df_sample, ]
Using rpart package I am going to build and plot the model.
large_model <- rpart(Order.Priority ~ Region + Item.Type + Sales.Channel + Order.Date + Order.ID + Ship.Date + Units.Sold + Total.Revenue + Total.Cost + Total.Profit , method = "class", data = large_train,control=rpart.control(minsplit=2, minbucket=3, cp=0.001))
rpart.plot(large_model)
Using the above model on the test set.
large_pred <- predict(large_model, large_test, type = "class")
large_pred <- table(large_test$Order.Priority, large_pred)
large_pred
## large_pred
## C H L M
## C 974 1619 394 164
## H 952 1555 367 170
## L 967 1626 371 179
## M 995 1589 398 180
Accuracy of the model.
sum(diag(large_pred)) / nrow(large_test)
## [1] 0.2464
The accuracy for the small dataset is 44% whereas the accuracy for the large dataset is roughly 25%. These are not that great results but what I noticed is that the accuracy for small datasets are way better than large ones. The large datasets almost crashed my PC.control=rpart.control(minsplit=2, minbucket=3, cp=0.001 saved me here.
But based on my findings here I do not recommend both these to make a business decision. When using too much data errors tend to increase but that could mean I need to use an algorithm that can handle the errors better.