DATA 622: Homework 1

As the quiz that was part of the original content was discarded, here’s a new assignment:

Visit the following website and explore the range of sizes of this data set (from 100 to 5 million records). https://eforexcel.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/ Based on your computer’s capabilities (memory, CPU), select 2 files you can handle (recommended one small, one large)

Review the structure and content of the tables, and think which two machine learning algorithms presented so far could be used to analyze the data, and how can they be applied in the suggested environment of the data sets.

Write a short essay explaining your selection. Then, select one of the 2 algorithms and explore how to analyze and predict an outcome based on the data available. This will be an exploratory exercise, so feel free to show errors and warnings that raise during the analysis. Test the code with both data sets selected and compare the results. Which result will you trust if you need to make a business decision? Do you think an analysis could be prone to errors when using too much data, or when using the least amount possible?

Develop your exploratory analysis of the data and the essay in the following 2 weeks. You’ll have until March 17 to submit both.

Datasets

##                                Region                         Country    
##  Asia                             : 719   Andorra                 :  40  
##  Australia and Oceania            : 416   San Marino              :  40  
##  Central America and the Caribbean: 534   Ghana                   :  38  
##  Europe                           :1330   Mauritius               :  38  
##  Middle East and North Africa     : 610   United States of America:  38  
##  North America                    : 106   Tonga                   :  37  
##  Sub-Saharan Africa               :1285   (Other)                 :4769  
##            Item.Type    Sales.Channel  Order.Priority   Order.Date        
##  Beverages      : 447   Offline:2504   C:1174         Min.   :2010-01-01  
##  Fruits         : 447   Online :2496   H:1278         1st Qu.:2011-12-08  
##  Baby Food      : 445                  L:1227         Median :2013-10-23  
##  Cosmetics      : 424                  M:1321         Mean   :2013-10-19  
##  Household      : 424                                 3rd Qu.:2015-09-08  
##  Office Supplies: 420                                 Max.   :2017-07-28  
##  (Other)        :2393                                                     
##     Order.ID           Ship.Date            Units.Sold     Unit.Price    
##  Min.   :100090873   Min.   :2010-01-01   Min.   :   2   Min.   :  9.33  
##  1st Qu.:320104217   1st Qu.:2011-12-08   1st Qu.:2453   1st Qu.: 81.73  
##  Median :552314960   Median :2013-10-23   Median :5123   Median :154.06  
##  Mean   :548644737   Mean   :2013-10-19   Mean   :5031   Mean   :265.75  
##  3rd Qu.:768770944   3rd Qu.:2015-09-08   3rd Qu.:7576   3rd Qu.:437.20  
##  Max.   :999879729   Max.   :2017-07-28   Max.   :9999   Max.   :668.27  
##                                                                          
##    Unit.Cost      Total.Revenue       Total.Cost       Total.Profit      
##  Min.   :  6.92   Min.   :     65   Min.   :     48   Min.   :     16.9  
##  1st Qu.: 35.84   1st Qu.: 257417   1st Qu.: 154748   1st Qu.:  85339.3  
##  Median : 97.44   Median : 779409   Median : 468181   Median : 279095.2  
##  Mean   :187.49   Mean   :1325738   Mean   : 933093   Mean   : 392644.6  
##  3rd Qu.:263.33   3rd Qu.:1839975   3rd Qu.:1189578   3rd Qu.: 565106.4  
##  Max.   :524.96   Max.   :6672676   Max.   :5248025   Max.   :1726007.5  
##

##  [1] "Region"         "Country"        "Item.Type"      "Sales.Channel" 
##  [5] "Order.Priority" "Order.Date"     "Order.ID"       "Ship.Date"     
##  [9] "Units.Sold"     "Unit.Price"     "Unit.Cost"      "Total.Revenue" 
## [13] "Total.Cost"     "Total.Profit"

Each data set generates random observations with the following 14 characteristics: an order ID and order/ship dates, the region and country of the order, the item type and order priority as well as whether it was ordered online or offline. There is also a number of units sold row, as well as 5 rows dedicated to pricing. The number of units sold varies between 2 and a maximum of 9999

## [1] 185

## [1] 12

## [1] "2010-01-01"

## [1] "2017-07-28"

## Time difference of 2765 days

## [1] 7.575342

There are 185 countries included, and 12 item types. The dates range between 2010 and 2017, or about 7 years.

There don’t appear to be any year-on-year patterns with respect to profit.

##   mean(processing_time)
## 1                0 days

Also, it appears that shipping date and order date are always the same.

## [1] 12

## [1] 12

##   sum(revenue_test)
## 1              4999

##   sum(cost_test)
## 1           4996

##   sum(profit_test)
## 1             4994

The last 5 columns related to cost and revenue are all derived from the units sold and the item type. The unit price and cost have 12 different unique values similar to item type. Additionally, the totals for cost, revenue, and profit are all derived from these values, with nearly all of these variables can be recalculated using some accounting arithmetic. It’s assumed that these descriptions are true for the larger data set as well.

Finally, lets look at the distribution of several columns. It appears that the number of units sold, priority, and item type are all uniformly distributed. Meanwhile, total profit, in addition to cost and revenue, is skewed to the right.

Machine Learning Algorithms

Because many of these variables are derived from each other, the machine learning should instead try to predict something that hasn’t already been derived during an exploratory analysis. To illustrate this, I’ve used a simple linear model to model the total profit using the small data frame.

## 
## Call:
## lm(formula = Total.Profit ~ Item.Type * Units.Sold, data = small_fix)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -5.425e-07 -2.000e-10 -2.000e-11  5.000e-11  5.024e-07 
## 
## Coefficients:
##                                       Estimate Std. Error    t value Pr(>|t|)
## (Intercept)                         -4.455e-10  1.179e-09 -3.780e-01 0.705473
## Item.TypeBeverages                  -1.441e-08  1.672e-09 -8.620e+00  < 2e-16
## Item.TypeCereal                      1.891e-09  1.749e-09  1.081e+00 0.279698
## Item.TypeClothes                     8.216e-09  1.738e-09  4.728e+00 2.33e-06
## Item.TypeCosmetics                  -6.292e-09  1.709e-09 -3.682e+00 0.000234
## Item.TypeFruits                      2.675e-08  1.704e-09  1.570e+01  < 2e-16
## Item.TypeHousehold                   2.318e-08  1.697e-09  1.366e+01  < 2e-16
## Item.TypeMeat                       -8.442e-09  1.658e-09 -5.090e+00 3.71e-07
## Item.TypeOffice Supplies            -8.366e-09  1.686e-09 -4.962e+00 7.20e-07
## Item.TypePersonal Care              -2.489e-09  1.697e-09 -1.467e+00 0.142537
## Item.TypeSnacks                     -2.384e-09  1.678e-09 -1.421e+00 0.155479
## Item.TypeVegetables                 -1.812e-09  1.738e-09 -1.042e+00 0.297332
## Units.Sold                           9.586e+01  1.993e-13  4.809e+14  < 2e-16
## Item.TypeBeverages:Units.Sold       -8.020e+01  2.881e-13 -2.784e+14  < 2e-16
## Item.TypeCereal:Units.Sold          -7.270e+00  2.984e-13 -2.436e+13  < 2e-16
## Item.TypeClothes:Units.Sold         -2.242e+01  2.935e-13 -7.639e+13  < 2e-16
## Item.TypeCosmetics:Units.Sold        7.801e+01  2.920e-13  2.672e+14  < 2e-16
## Item.TypeFruits:Units.Sold          -9.345e+01  2.867e-13 -3.260e+14  < 2e-16
## Item.TypeHousehold:Units.Sold        6.987e+01  2.925e-13  2.389e+14  < 2e-16
## Item.TypeMeat:Units.Sold            -3.866e+01  2.871e-13 -1.347e+14  < 2e-16
## Item.TypeOffice Supplies:Units.Sold  3.039e+01  2.875e-13  1.057e+14  < 2e-16
## Item.TypePersonal Care:Units.Sold   -7.080e+01  2.861e-13 -2.475e+14  < 2e-16
## Item.TypeSnacks:Units.Sold          -4.072e+01  2.900e-13 -1.404e+14  < 2e-16
## Item.TypeVegetables:Units.Sold      -3.273e+01  2.964e-13 -1.104e+14  < 2e-16
##                                        
## (Intercept)                            
## Item.TypeBeverages                  ***
## Item.TypeCereal                        
## Item.TypeClothes                    ***
## Item.TypeCosmetics                  ***
## Item.TypeFruits                     ***
## Item.TypeHousehold                  ***
## Item.TypeMeat                       ***
## Item.TypeOffice Supplies            ***
## Item.TypePersonal Care                 
## Item.TypeSnacks                        
## Item.TypeVegetables                    
## Units.Sold                          ***
## Item.TypeBeverages:Units.Sold       ***
## Item.TypeCereal:Units.Sold          ***
## Item.TypeClothes:Units.Sold         ***
## Item.TypeCosmetics:Units.Sold       ***
## Item.TypeFruits:Units.Sold          ***
## Item.TypeHousehold:Units.Sold       ***
## Item.TypeMeat:Units.Sold            ***
## Item.TypeOffice Supplies:Units.Sold ***
## Item.TypePersonal Care:Units.Sold   ***
## Item.TypeSnacks:Units.Sold          ***
## Item.TypeVegetables:Units.Sold      ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.25e-08 on 4976 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 2.04e+29 on 23 and 4976 DF,  p-value: < 2.2e-16

##                         (Intercept)                  Item.TypeBeverages 
##                                0.00                                0.00 
##                     Item.TypeCereal                    Item.TypeClothes 
##                                0.00                                0.00 
##                  Item.TypeCosmetics                     Item.TypeFruits 
##                                0.00                                0.00 
##                  Item.TypeHousehold                       Item.TypeMeat 
##                                0.00                                0.00 
##            Item.TypeOffice Supplies              Item.TypePersonal Care 
##                                0.00                                0.00 
##                     Item.TypeSnacks                 Item.TypeVegetables 
##                                0.00                                0.00 
##                          Units.Sold       Item.TypeBeverages:Units.Sold 
##                               95.86                              -80.20 
##          Item.TypeCereal:Units.Sold         Item.TypeClothes:Units.Sold 
##                               -7.27                              -22.42 
##       Item.TypeCosmetics:Units.Sold          Item.TypeFruits:Units.Sold 
##                               78.01                              -93.45 
##       Item.TypeHousehold:Units.Sold            Item.TypeMeat:Units.Sold 
##                               69.87                              -38.66 
## Item.TypeOffice Supplies:Units.Sold   Item.TypePersonal Care:Units.Sold 
##                               30.39                              -70.80 
##          Item.TypeSnacks:Units.Sold      Item.TypeVegetables:Units.Sold 
##                              -40.72                              -32.73

Using all data points, the R-squared is 1, or completely correlated to just two variables. Next lets use that model to predict values using the larger dataframe.

## [1] 7.574929e-12

In this case the RSS is very small, indicating that the same relationship exists between units sold and unit price.

Instead, it makes more sense to model the amount of units sold based on the remaining variables. As already illustrated, units sold is uniformly distributed between 1 and 9999. Because of this, it might make more sense to break units sold into a half-dozen categories spanning the whole data set, incremented by 2000 units.

There are a few options to predict the order size category. My first thought was to use decision trees as I just changed the outcome variable to something categorical. Decision trees are easily understood by laypeople and could offer some insight into the most important determinants of order size. However, after trying to fit a tree to this data Using decision trees would be useful In many cases, it would be easier to turn this into a classification problem after deciding an acceptable profit level (e.g., greater than $500,000 monthly). There’s a second technique for continuous outcomes. however called a regression tree that could be applied to this case. This may be able to provide an opportunity to predict the value of profit.

## 
## Classification tree:
## tree(formula = unit_cat ~ Region + Item.Type + Sales.Channel + 
##     Order.Priority + year + monthN, data = small_train)
## Variables actually used in tree construction:
## character(0)
## Number of terminal nodes:  1 
## Residual mean deviance:  3.119 = 11690 / 3749 
## Misclassification error rate: 0.6989 = 2621 / 3750

The training data creates a very simple tree with a misclassification rate of 0.699. This model does not appear to be appropriate for these data.

##                small_pred
##                 (0,2e+03] (2e+03,4e+03] (4e+03,6e+03] (6e+03,9e+03]
##   (0,2e+03]             0             0             0           261
##   (2e+03,4e+03]         0             0             0           225
##   (4e+03,6e+03]         0             0             0           244
##   (6e+03,9e+03]         0             0             0           395
##   (9e+03,1e+04]         0             0             0           125
##                small_pred
##                 (9e+03,1e+04]
##   (0,2e+03]                 0
##   (2e+03,4e+03]             0
##   (4e+03,6e+03]             0
##   (6e+03,9e+03]             0
##   (9e+03,1e+04]             0

On closer inspection, this is because only one bin is being predicted. For a second machine learning technique, I’ll attempt using an unsupervised technique like kNN. This will require creating dummy variables for all categorical variables, potentially lowering performance by increasing variables.

##   Region_Asia     Region_Australia and Oceania
##  Min.   :0.0000   Min.   :0.0000              
##  1st Qu.:0.0000   1st Qu.:0.0000              
##  Median :0.0000   Median :0.0000              
##  Mean   :0.1427   Mean   :0.0816              
##  3rd Qu.:0.0000   3rd Qu.:0.0000              
##  Max.   :1.0000   Max.   :1.0000              
##  Region_Central America and the Caribbean Region_Europe   
##  Min.   :0.0000                           Min.   :0.0000  
##  1st Qu.:0.0000                           1st Qu.:0.0000  
##  Median :0.0000                           Median :0.0000  
##  Mean   :0.1083                           Mean   :0.2637  
##  3rd Qu.:0.0000                           3rd Qu.:1.0000  
##  Max.   :1.0000                           Max.   :1.0000  
##  Region_Middle East and North Africa Region_North America
##  Min.   :0.0000                      Min.   :0.000       
##  1st Qu.:0.0000                      1st Qu.:0.000       
##  Median :0.0000                      Median :0.000       
##  Mean   :0.1245                      Mean   :0.024       
##  3rd Qu.:0.0000                      3rd Qu.:0.000       
##  Max.   :1.0000                      Max.   :1.000       
##  Region_Sub-Saharan Africa Item.Type_Baby Food Item.Type_Beverages
##  Min.   :0.0000            Min.   :0.00000     Min.   :0.00000    
##  1st Qu.:0.0000            1st Qu.:0.00000     1st Qu.:0.00000    
##  Median :0.0000            Median :0.00000     Median :0.00000    
##  Mean   :0.2552            Mean   :0.09013     Mean   :0.09227    
##  3rd Qu.:1.0000            3rd Qu.:0.00000     3rd Qu.:0.00000    
##  Max.   :1.0000            Max.   :1.00000     Max.   :1.00000    
##  Item.Type_Cereal Item.Type_Clothes Item.Type_Cosmetics Item.Type_Fruits 
##  Min.   :0.0000   Min.   :0.0000    Min.   :0.00000     Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.0000    1st Qu.:0.00000     1st Qu.:0.00000  
##  Median :0.0000   Median :0.0000    Median :0.00000     Median :0.00000  
##  Mean   :0.0728   Mean   :0.0816    Mean   :0.08293     Mean   :0.09013  
##  3rd Qu.:0.0000   3rd Qu.:0.0000    3rd Qu.:0.00000     3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.0000    Max.   :1.00000     Max.   :1.00000  
##  Item.Type_Household Item.Type_Meat    Item.Type_Office Supplies
##  Min.   :0.0000      Min.   :0.00000   Min.   :0.0000           
##  1st Qu.:0.0000      1st Qu.:0.00000   1st Qu.:0.0000           
##  Median :0.0000      Median :0.00000   Median :0.0000           
##  Mean   :0.0816      Mean   :0.07867   Mean   :0.0808           
##  3rd Qu.:0.0000      3rd Qu.:0.00000   3rd Qu.:0.0000           
##  Max.   :1.0000      Max.   :1.00000   Max.   :1.0000           
##  Item.Type_Snacks  Item.Type_Vegetables Sales.Channel_Offline
##  Min.   :0.00000   Min.   :0.0000       Min.   :0.0000       
##  1st Qu.:0.00000   1st Qu.:0.0000       1st Qu.:0.0000       
##  Median :0.00000   Median :0.0000       Median :1.0000       
##  Mean   :0.07947   Mean   :0.0864       Mean   :0.5021       
##  3rd Qu.:0.00000   3rd Qu.:0.0000       3rd Qu.:1.0000       
##  Max.   :1.00000   Max.   :1.0000       Max.   :1.0000       
##  Sales.Channel_Online Order.Priority_C Order.Priority_H Order.Priority_L
##  Min.   :0.0000       Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000       1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000       Median :0.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.4979       Mean   :0.2325   Mean   :0.2544   Mean   :0.2493  
##  3rd Qu.:1.0000       3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:0.0000  
##  Max.   :1.0000       Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  Order.Priority_M   Units.Sold          Unit.Price         Unit.Cost       
##  Min.   :0.0000   Min.   :-1.725398   Min.   :-1.17236   Min.   :-1.02357  
##  1st Qu.:0.0000   1st Qu.:-0.886493   1st Qu.:-0.84134   1st Qu.:-0.85964  
##  Median :0.0000   Median : 0.035444   Median :-0.51064   Median :-0.51046  
##  Mean   :0.2637   Mean   : 0.001664   Mean   :-0.02192   Mean   :-0.02183  
##  3rd Qu.:1.0000   3rd Qu.: 0.879838   3rd Qu.: 0.71391   3rd Qu.: 0.42987  
##  Max.   :1.0000   Max.   : 1.704675   Max.   : 1.84039   Max.   : 1.91290  
##  Total.Revenue        Total.Cost        Total.Profit        year_2010     
##  Min.   :-0.89843   Min.   :-0.81071   Min.   :-1.02502   Min.   :0.0000  
##  1st Qu.:-0.73109   1st Qu.:-0.68366   1st Qu.:-0.80811   1st Qu.:0.0000  
##  Median :-0.37787   Median :-0.41662   Median :-0.30233   Median :0.0000  
##  Mean   :-0.01625   Mean   :-0.01641   Mean   :-0.01329   Mean   :0.1229  
##  3rd Qu.: 0.32186   3rd Qu.: 0.19289   3rd Qu.: 0.44623   3rd Qu.:0.0000  
##  Max.   : 3.62412   Max.   : 3.74927   Max.   : 3.46152   Max.   :1.0000  
##    year_2011       year_2012        year_2013        year_2014     
##  Min.   :0.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.000   Median :0.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.128   Mean   :0.1355   Mean   :0.1328   Mean   :0.1349  
##  3rd Qu.:0.000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :1.000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##    year_2015        year_2016        year_2017          monthN_1      
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.0000   Median :0.00000   Median :0.00000  
##  Mean   :0.1328   Mean   :0.1373   Mean   :0.07573   Mean   :0.09333  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.00000   Max.   :1.00000  
##     monthN_2         monthN_3          monthN_4          monthN_5      
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.00000   Median :0.00000   Median :0.00000  
##  Mean   :0.0824   Mean   :0.08667   Mean   :0.08907   Mean   :0.08987  
##  3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.00000   Max.   :1.00000  
##     monthN_6          monthN_7         monthN_8         monthN_9      
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.0000   Median :0.0000   Median :0.00000  
##  Mean   :0.08773   Mean   :0.0872   Mean   :0.0824   Mean   :0.07733  
##  3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##    monthN_10      monthN_11         monthN_12      
##  Min.   :0.00   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.00   Median :0.00000   Median :0.00000  
##  Mean   :0.08   Mean   :0.07573   Mean   :0.06827  
##  3rd Qu.:0.00   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.00   Max.   :1.00000   Max.   :1.00000

The addition of dummy variables has increased the number of variables present to 56.

##                  small_knn
## small_test_labels (0,2e+03] (2e+03,4e+03] (4e+03,6e+03] (6e+03,9e+03]
##     (0,2e+03]           203            52             5             0
##     (2e+03,4e+03]        42           155            35             4
##     (4e+03,6e+03]         3            38           159            56
##     (6e+03,9e+03]         0             5            48           295
##     (9e+03,1e+04]         0             0             1            80
##                  small_knn
## small_test_labels (9e+03,1e+04]
##     (0,2e+03]                 0
##     (2e+03,4e+03]             0
##     (4e+03,6e+03]             0
##     (6e+03,9e+03]            17
##     (9e+03,1e+04]            52

## [1] 0.6912

Using kNN, units sold could be predicted accurately in 70% of cases. This is on the lower limit of what’s considered effective classification. Looking at the confusion matrix, it appears the most common classification errors are placing an observation in the next closest observation.

Next, let’s do the same thing but with a larger data set.

##                big_knn
## big_test_labels (0,2e+03] (2e+03,4e+03] (4e+03,6e+03] (6e+03,9e+03]
##   (0,2e+03]          2046           383            16             0
##   (2e+03,4e+03]       312          1886           318             5
##   (4e+03,6e+03]         9           288          1853           357
##   (6e+03,9e+03]         0             5           278          3369
##   (9e+03,1e+04]         0             0             4           634
##                big_knn
## big_test_labels (9e+03,1e+04]
##   (0,2e+03]                 0
##   (2e+03,4e+03]             0
##   (4e+03,6e+03]             1
##   (6e+03,9e+03]           153
##   (9e+03,1e+04]           583

## [1] 0.77896

Increasing the sample size by ten times only increases the accuracy by about 7%.

DATA 622: Homework 1

Thomas Hill

March 17, 2022

Datasets

Machine Learning Algorithms