As the quiz that was part of the original content was discarded, here’s a new assignment:
Visit the following website and explore the range of sizes of this data set (from 100 to 5 million records). https://eforexcel.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/ Based on your computer’s capabilities (memory, CPU), select 2 files you can handle (recommended one small, one large)
Review the structure and content of the tables, and think which two machine learning algorithms presented so far could be used to analyze the data, and how can they be applied in the suggested environment of the data sets.
Write a short essay explaining your selection. Then, select one of the 2 algorithms and explore how to analyze and predict an outcome based on the data available. This will be an exploratory exercise, so feel free to show errors and warnings that raise during the analysis. Test the code with both data sets selected and compare the results. Which result will you trust if you need to make a business decision? Do you think an analysis could be prone to errors when using too much data, or when using the least amount possible?
Develop your exploratory analysis of the data and the essay in the following 2 weeks. You’ll have until March 17 to submit both.
## Region Country
## Asia : 719 Andorra : 40
## Australia and Oceania : 416 San Marino : 40
## Central America and the Caribbean: 534 Ghana : 38
## Europe :1330 Mauritius : 38
## Middle East and North Africa : 610 United States of America: 38
## North America : 106 Tonga : 37
## Sub-Saharan Africa :1285 (Other) :4769
## Item.Type Sales.Channel Order.Priority Order.Date
## Beverages : 447 Offline:2504 C:1174 Min. :2010-01-01
## Fruits : 447 Online :2496 H:1278 1st Qu.:2011-12-08
## Baby Food : 445 L:1227 Median :2013-10-23
## Cosmetics : 424 M:1321 Mean :2013-10-19
## Household : 424 3rd Qu.:2015-09-08
## Office Supplies: 420 Max. :2017-07-28
## (Other) :2393
## Order.ID Ship.Date Units.Sold Unit.Price
## Min. :100090873 Min. :2010-01-01 Min. : 2 Min. : 9.33
## 1st Qu.:320104217 1st Qu.:2011-12-08 1st Qu.:2453 1st Qu.: 81.73
## Median :552314960 Median :2013-10-23 Median :5123 Median :154.06
## Mean :548644737 Mean :2013-10-19 Mean :5031 Mean :265.75
## 3rd Qu.:768770944 3rd Qu.:2015-09-08 3rd Qu.:7576 3rd Qu.:437.20
## Max. :999879729 Max. :2017-07-28 Max. :9999 Max. :668.27
##
## Unit.Cost Total.Revenue Total.Cost Total.Profit
## Min. : 6.92 Min. : 65 Min. : 48 Min. : 16.9
## 1st Qu.: 35.84 1st Qu.: 257417 1st Qu.: 154748 1st Qu.: 85339.3
## Median : 97.44 Median : 779409 Median : 468181 Median : 279095.2
## Mean :187.49 Mean :1325738 Mean : 933093 Mean : 392644.6
## 3rd Qu.:263.33 3rd Qu.:1839975 3rd Qu.:1189578 3rd Qu.: 565106.4
## Max. :524.96 Max. :6672676 Max. :5248025 Max. :1726007.5
##
## [1] "Region" "Country" "Item.Type" "Sales.Channel"
## [5] "Order.Priority" "Order.Date" "Order.ID" "Ship.Date"
## [9] "Units.Sold" "Unit.Price" "Unit.Cost" "Total.Revenue"
## [13] "Total.Cost" "Total.Profit"
Each data set generates random observations with the following 14 characteristics: an order ID and order/ship dates, the region and country of the order, the item type and order priority as well as whether it was ordered online or offline. There is also a number of units sold row, as well as 5 rows dedicated to pricing. The number of units sold varies between 2 and a maximum of 9999
## [1] 185
## [1] 12
## [1] "2010-01-01"
## [1] "2017-07-28"
## Time difference of 2765 days
## [1] 7.575342
There are 185 countries included, and 12 item types. The dates range between 2010 and 2017, or about 7 years.
There don’t appear to be any year-on-year patterns with respect to profit.
## mean(processing_time)
## 1 0 days
Also, it appears that shipping date and order date are always the same.
## [1] 12
## [1] 12
## sum(revenue_test)
## 1 4999
## sum(cost_test)
## 1 4996
## sum(profit_test)
## 1 4994
The last 5 columns related to cost and revenue are all derived from the units sold and the item type. The unit price and cost have 12 different unique values similar to item type. Additionally, the totals for cost, revenue, and profit are all derived from these values, with nearly all of these variables can be recalculated using some accounting arithmetic. It’s assumed that these descriptions are true for the larger data set as well.
Finally, lets look at the distribution of several columns. It appears that the number of units sold, priority, and item type are all uniformly distributed. Meanwhile, total profit, in addition to cost and revenue, is skewed to the right.
Because many of these variables are derived from each other, the machine learning should instead try to predict something that hasn’t already been derived during an exploratory analysis. To illustrate this, I’ve used a simple linear model to model the total profit using the small data frame.
##
## Call:
## lm(formula = Total.Profit ~ Item.Type * Units.Sold, data = small_fix)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.425e-07 -2.000e-10 -2.000e-11 5.000e-11 5.024e-07
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.455e-10 1.179e-09 -3.780e-01 0.705473
## Item.TypeBeverages -1.441e-08 1.672e-09 -8.620e+00 < 2e-16
## Item.TypeCereal 1.891e-09 1.749e-09 1.081e+00 0.279698
## Item.TypeClothes 8.216e-09 1.738e-09 4.728e+00 2.33e-06
## Item.TypeCosmetics -6.292e-09 1.709e-09 -3.682e+00 0.000234
## Item.TypeFruits 2.675e-08 1.704e-09 1.570e+01 < 2e-16
## Item.TypeHousehold 2.318e-08 1.697e-09 1.366e+01 < 2e-16
## Item.TypeMeat -8.442e-09 1.658e-09 -5.090e+00 3.71e-07
## Item.TypeOffice Supplies -8.366e-09 1.686e-09 -4.962e+00 7.20e-07
## Item.TypePersonal Care -2.489e-09 1.697e-09 -1.467e+00 0.142537
## Item.TypeSnacks -2.384e-09 1.678e-09 -1.421e+00 0.155479
## Item.TypeVegetables -1.812e-09 1.738e-09 -1.042e+00 0.297332
## Units.Sold 9.586e+01 1.993e-13 4.809e+14 < 2e-16
## Item.TypeBeverages:Units.Sold -8.020e+01 2.881e-13 -2.784e+14 < 2e-16
## Item.TypeCereal:Units.Sold -7.270e+00 2.984e-13 -2.436e+13 < 2e-16
## Item.TypeClothes:Units.Sold -2.242e+01 2.935e-13 -7.639e+13 < 2e-16
## Item.TypeCosmetics:Units.Sold 7.801e+01 2.920e-13 2.672e+14 < 2e-16
## Item.TypeFruits:Units.Sold -9.345e+01 2.867e-13 -3.260e+14 < 2e-16
## Item.TypeHousehold:Units.Sold 6.987e+01 2.925e-13 2.389e+14 < 2e-16
## Item.TypeMeat:Units.Sold -3.866e+01 2.871e-13 -1.347e+14 < 2e-16
## Item.TypeOffice Supplies:Units.Sold 3.039e+01 2.875e-13 1.057e+14 < 2e-16
## Item.TypePersonal Care:Units.Sold -7.080e+01 2.861e-13 -2.475e+14 < 2e-16
## Item.TypeSnacks:Units.Sold -4.072e+01 2.900e-13 -1.404e+14 < 2e-16
## Item.TypeVegetables:Units.Sold -3.273e+01 2.964e-13 -1.104e+14 < 2e-16
##
## (Intercept)
## Item.TypeBeverages ***
## Item.TypeCereal
## Item.TypeClothes ***
## Item.TypeCosmetics ***
## Item.TypeFruits ***
## Item.TypeHousehold ***
## Item.TypeMeat ***
## Item.TypeOffice Supplies ***
## Item.TypePersonal Care
## Item.TypeSnacks
## Item.TypeVegetables
## Units.Sold ***
## Item.TypeBeverages:Units.Sold ***
## Item.TypeCereal:Units.Sold ***
## Item.TypeClothes:Units.Sold ***
## Item.TypeCosmetics:Units.Sold ***
## Item.TypeFruits:Units.Sold ***
## Item.TypeHousehold:Units.Sold ***
## Item.TypeMeat:Units.Sold ***
## Item.TypeOffice Supplies:Units.Sold ***
## Item.TypePersonal Care:Units.Sold ***
## Item.TypeSnacks:Units.Sold ***
## Item.TypeVegetables:Units.Sold ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.25e-08 on 4976 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 2.04e+29 on 23 and 4976 DF, p-value: < 2.2e-16
## (Intercept) Item.TypeBeverages
## 0.00 0.00
## Item.TypeCereal Item.TypeClothes
## 0.00 0.00
## Item.TypeCosmetics Item.TypeFruits
## 0.00 0.00
## Item.TypeHousehold Item.TypeMeat
## 0.00 0.00
## Item.TypeOffice Supplies Item.TypePersonal Care
## 0.00 0.00
## Item.TypeSnacks Item.TypeVegetables
## 0.00 0.00
## Units.Sold Item.TypeBeverages:Units.Sold
## 95.86 -80.20
## Item.TypeCereal:Units.Sold Item.TypeClothes:Units.Sold
## -7.27 -22.42
## Item.TypeCosmetics:Units.Sold Item.TypeFruits:Units.Sold
## 78.01 -93.45
## Item.TypeHousehold:Units.Sold Item.TypeMeat:Units.Sold
## 69.87 -38.66
## Item.TypeOffice Supplies:Units.Sold Item.TypePersonal Care:Units.Sold
## 30.39 -70.80
## Item.TypeSnacks:Units.Sold Item.TypeVegetables:Units.Sold
## -40.72 -32.73
Using all data points, the R-squared is 1, or completely correlated to just two variables. Next lets use that model to predict values using the larger dataframe.
## [1] 7.574929e-12
In this case the RSS is very small, indicating that the same relationship exists between units sold and unit price.
Instead, it makes more sense to model the amount of units sold based on the remaining variables. As already illustrated, units sold is uniformly distributed between 1 and 9999. Because of this, it might make more sense to break units sold into a half-dozen categories spanning the whole data set, incremented by 2000 units.
There are a few options to predict the order size category. My first thought was to use decision trees as I just changed the outcome variable to something categorical. Decision trees are easily understood by laypeople and could offer some insight into the most important determinants of order size. However, after trying to fit a tree to this data Using decision trees would be useful In many cases, it would be easier to turn this into a classification problem after deciding an acceptable profit level (e.g., greater than $500,000 monthly). There’s a second technique for continuous outcomes. however called a regression tree that could be applied to this case. This may be able to provide an opportunity to predict the value of profit.
##
## Classification tree:
## tree(formula = unit_cat ~ Region + Item.Type + Sales.Channel +
## Order.Priority + year + monthN, data = small_train)
## Variables actually used in tree construction:
## character(0)
## Number of terminal nodes: 1
## Residual mean deviance: 3.119 = 11690 / 3749
## Misclassification error rate: 0.6989 = 2621 / 3750
The training data creates a very simple tree with a misclassification rate of 0.699. This model does not appear to be appropriate for these data.
## small_pred
## (0,2e+03] (2e+03,4e+03] (4e+03,6e+03] (6e+03,9e+03]
## (0,2e+03] 0 0 0 261
## (2e+03,4e+03] 0 0 0 225
## (4e+03,6e+03] 0 0 0 244
## (6e+03,9e+03] 0 0 0 395
## (9e+03,1e+04] 0 0 0 125
## small_pred
## (9e+03,1e+04]
## (0,2e+03] 0
## (2e+03,4e+03] 0
## (4e+03,6e+03] 0
## (6e+03,9e+03] 0
## (9e+03,1e+04] 0
On closer inspection, this is because only one bin is being predicted. For a second machine learning technique, I’ll attempt using an unsupervised technique like kNN. This will require creating dummy variables for all categorical variables, potentially lowering performance by increasing variables.
## Region_Asia Region_Australia and Oceania
## Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000
## Mean :0.1427 Mean :0.0816
## 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000
## Region_Central America and the Caribbean Region_Europe
## Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000
## Mean :0.1083 Mean :0.2637
## 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000
## Region_Middle East and North Africa Region_North America
## Min. :0.0000 Min. :0.000
## 1st Qu.:0.0000 1st Qu.:0.000
## Median :0.0000 Median :0.000
## Mean :0.1245 Mean :0.024
## 3rd Qu.:0.0000 3rd Qu.:0.000
## Max. :1.0000 Max. :1.000
## Region_Sub-Saharan Africa Item.Type_Baby Food Item.Type_Beverages
## Min. :0.0000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.0000 Median :0.00000 Median :0.00000
## Mean :0.2552 Mean :0.09013 Mean :0.09227
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.00000 Max. :1.00000
## Item.Type_Cereal Item.Type_Clothes Item.Type_Cosmetics Item.Type_Fruits
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.0000 Median :0.0000 Median :0.00000 Median :0.00000
## Mean :0.0728 Mean :0.0816 Mean :0.08293 Mean :0.09013
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.00000
## Item.Type_Household Item.Type_Meat Item.Type_Office Supplies
## Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.0000 Median :0.00000 Median :0.0000
## Mean :0.0816 Mean :0.07867 Mean :0.0808
## 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.00000 Max. :1.0000
## Item.Type_Snacks Item.Type_Vegetables Sales.Channel_Offline
## Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.00000 Median :0.0000 Median :1.0000
## Mean :0.07947 Mean :0.0864 Mean :0.5021
## 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :1.00000 Max. :1.0000 Max. :1.0000
## Sales.Channel_Online Order.Priority_C Order.Priority_H Order.Priority_L
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.4979 Mean :0.2325 Mean :0.2544 Mean :0.2493
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## Order.Priority_M Units.Sold Unit.Price Unit.Cost
## Min. :0.0000 Min. :-1.725398 Min. :-1.17236 Min. :-1.02357
## 1st Qu.:0.0000 1st Qu.:-0.886493 1st Qu.:-0.84134 1st Qu.:-0.85964
## Median :0.0000 Median : 0.035444 Median :-0.51064 Median :-0.51046
## Mean :0.2637 Mean : 0.001664 Mean :-0.02192 Mean :-0.02183
## 3rd Qu.:1.0000 3rd Qu.: 0.879838 3rd Qu.: 0.71391 3rd Qu.: 0.42987
## Max. :1.0000 Max. : 1.704675 Max. : 1.84039 Max. : 1.91290
## Total.Revenue Total.Cost Total.Profit year_2010
## Min. :-0.89843 Min. :-0.81071 Min. :-1.02502 Min. :0.0000
## 1st Qu.:-0.73109 1st Qu.:-0.68366 1st Qu.:-0.80811 1st Qu.:0.0000
## Median :-0.37787 Median :-0.41662 Median :-0.30233 Median :0.0000
## Mean :-0.01625 Mean :-0.01641 Mean :-0.01329 Mean :0.1229
## 3rd Qu.: 0.32186 3rd Qu.: 0.19289 3rd Qu.: 0.44623 3rd Qu.:0.0000
## Max. : 3.62412 Max. : 3.74927 Max. : 3.46152 Max. :1.0000
## year_2011 year_2012 year_2013 year_2014
## Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.128 Mean :0.1355 Mean :0.1328 Mean :0.1349
## 3rd Qu.:0.000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## year_2015 year_2016 year_2017 monthN_1
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.0000 Median :0.0000 Median :0.00000 Median :0.00000
## Mean :0.1328 Mean :0.1373 Mean :0.07573 Mean :0.09333
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.00000
## monthN_2 monthN_3 monthN_4 monthN_5
## Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.0000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.0824 Mean :0.08667 Mean :0.08907 Mean :0.08987
## 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.00000 Max. :1.00000 Max. :1.00000
## monthN_6 monthN_7 monthN_8 monthN_9
## Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.00000 Median :0.0000 Median :0.0000 Median :0.00000
## Mean :0.08773 Mean :0.0872 Mean :0.0824 Mean :0.07733
## 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.00000
## monthN_10 monthN_11 monthN_12
## Min. :0.00 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00 Median :0.00000 Median :0.00000
## Mean :0.08 Mean :0.07573 Mean :0.06827
## 3rd Qu.:0.00 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00 Max. :1.00000 Max. :1.00000
The addition of dummy variables has increased the number of variables present to 56.
## small_knn
## small_test_labels (0,2e+03] (2e+03,4e+03] (4e+03,6e+03] (6e+03,9e+03]
## (0,2e+03] 203 52 5 0
## (2e+03,4e+03] 42 155 35 4
## (4e+03,6e+03] 3 38 159 56
## (6e+03,9e+03] 0 5 48 295
## (9e+03,1e+04] 0 0 1 80
## small_knn
## small_test_labels (9e+03,1e+04]
## (0,2e+03] 0
## (2e+03,4e+03] 0
## (4e+03,6e+03] 0
## (6e+03,9e+03] 17
## (9e+03,1e+04] 52
## [1] 0.6912
Using kNN, units sold could be predicted accurately in 70% of cases. This is on the lower limit of what’s considered effective classification. Looking at the confusion matrix, it appears the most common classification errors are placing an observation in the next closest observation.
Next, let’s do the same thing but with a larger data set.
## big_knn
## big_test_labels (0,2e+03] (2e+03,4e+03] (4e+03,6e+03] (6e+03,9e+03]
## (0,2e+03] 2046 383 16 0
## (2e+03,4e+03] 312 1886 318 5
## (4e+03,6e+03] 9 288 1853 357
## (6e+03,9e+03] 0 5 278 3369
## (9e+03,1e+04] 0 0 4 634
## big_knn
## big_test_labels (9e+03,1e+04]
## (0,2e+03] 0
## (2e+03,4e+03] 0
## (4e+03,6e+03] 1
## (6e+03,9e+03] 153
## (9e+03,1e+04] 583
## [1] 0.77896
Increasing the sample size by ten times only increases the accuracy by about 7%.