Homework 1…New

As the quiz that was part of the original content was discarded, here’s a new assignment: Visit the following website and explore the range of sizes of this dataset (from 100 to 5 million records). https://eforexcel.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/ Based on your computer’s capabilities (memory, CPU), select 2 files you can handle (recommended one small, one large) Review the structure and content of the tables, and think which two machine learning algorithms presented so far could be used to analyze the data, and how can they be applied in the suggested environment of the datasets. Write a short essay explaining your selection. Then, select one of the 2 algorithms and explore how to analyze and predict an outcome based on the data available. This will be an exploratory exercise, so feel free to show errors and warnings that raise during the analysis. Test the code with both datasets selected and compare the results. Which result will you trust if you need to make a business decision? Do you think an analysis could be prone to errors when using too much data, or when using the least amount possible?

Data Exploration

I will analyze the datasets containing 10000 and 50000 Sales Records

Loading Data

##                         Region                          Country       Item.Type
## 1           Sub-Saharan Africa                             Chad Office Supplies
## 2                       Europe                           Latvia       Beverages
## 3 Middle East and North Africa                         Pakistan      Vegetables
## 4           Sub-Saharan Africa Democratic Republic of the Congo       Household
## 5                       Europe                   Czech Republic       Beverages
## 6           Sub-Saharan Africa                     South Africa       Beverages
##   Sales.Channel Order.Priority Order.Date  Order.ID Ship.Date Units.Sold
## 1        Online              L  1/27/2011 292494523 2/12/2011       4484
## 2        Online              C 12/28/2015 361825549 1/23/2016       1075
## 3       Offline              C  1/13/2011 141515767  2/1/2011       6515
## 4        Online              C  9/11/2012 500364005 10/6/2012       7683
## 5        Online              C 10/27/2015 127481591 12/5/2015       3491
## 6       Offline              H  7/10/2012 482292354 8/21/2012       9880
##   Unit.Price Unit.Cost Total.Revenue Total.Cost Total.Profit
## 1     651.21    524.96    2920025.64 2353920.64    566105.00
## 2      47.45     31.79      51008.75   34174.25     16834.50
## 3     154.06     90.93    1003700.90  592408.95    411291.95
## 4     668.27    502.54    5134318.41 3861014.82   1273303.59
## 5      47.45     31.79     165647.95  110978.89     54669.06
## 6      47.45     31.79     468806.00  314085.20    154720.80
##               Region   Country Item.Type Sales.Channel Order.Priority
## 1 Sub-Saharan Africa   Namibia Household       Offline              M
## 2             Europe   Iceland Baby Food        Online              H
## 3             Europe    Russia      Meat        Online              L
## 4             Europe  Moldova       Meat        Online              L
## 5             Europe     Malta    Cereal        Online              M
## 6               Asia Indonesia      Meat        Online              H
##   Order.Date  Order.ID  Ship.Date Units.Sold Unit.Price Unit.Cost Total.Revenue
## 1  8/31/2015 897751939 10/12/2015       3604     668.27    502.54     2408445.1
## 2 11/20/2010 599480426   1/9/2011       8435     255.28    159.42     2153286.8
## 3  6/22/2017 538911855  6/25/2017       4848     421.89    364.69     2045322.7
## 4  2/28/2012 459845054  3/20/2012       7225     421.89    364.69     3048155.2
## 5  8/12/2010 626391351  9/13/2010       1975     205.70    117.11      406257.5
## 6  8/20/2010 472974574  8/27/2010       2542     421.89    364.69     1072444.4
##   Total.Cost Total.Profit
## 1  1811154.2     597290.9
## 2  1344707.7     808579.1
## 3  1768017.1     277305.6
## 4  2634885.2     413270.0
## 5   231292.2     174965.2
## 6   927042.0     145402.4

Data Analysis

##  [1] "Region"         "Country"        "Item.Type"      "Sales.Channel" 
##  [5] "Order.Priority" "Order.Date"     "Order.ID"       "Ship.Date"     
##  [9] "Units.Sold"     "Unit.Price"     "Unit.Cost"      "Total.Revenue" 
## [13] "Total.Cost"     "Total.Profit"
##  [1] "Region"         "Country"        "Item.Type"      "Sales.Channel" 
##  [5] "Order.Priority" "Order.Date"     "Order.ID"       "Ship.Date"     
##  [9] "Units.Sold"     "Unit.Price"     "Unit.Cost"      "Total.Revenue" 
## [13] "Total.Cost"     "Total.Profit"
##     Region            Country           Item.Type         Sales.Channel     
##  Length:10000       Length:10000       Length:10000       Length:10000      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Order.Priority   Order.Date           Order.ID           Ship.Date         
##  C:2555         Min.   :2010-01-01   Length:10000       Min.   :2010-01-05  
##  H:2503         1st Qu.:2011-12-08   Class :character   1st Qu.:2012-01-04  
##  L:2494         Median :2013-11-02   Mode  :character   Median :2013-11-26  
##  M:2448         Mean   :2013-10-27                      Mean   :2013-11-21  
##                 3rd Qu.:2015-09-11                      3rd Qu.:2015-10-08  
##                 Max.   :2017-07-28                      Max.   :2017-09-10  
##    Units.Sold      Unit.Price       Unit.Cost      Total.Revenue    
##  Min.   :    2   Min.   :  9.33   Min.   :  6.92   Min.   :    168  
##  1st Qu.: 2531   1st Qu.:109.28   1st Qu.: 56.67   1st Qu.: 288551  
##  Median : 4962   Median :205.70   Median :117.11   Median : 800051  
##  Mean   : 5003   Mean   :268.14   Mean   :188.81   Mean   :1333355  
##  3rd Qu.: 7472   3rd Qu.:437.20   3rd Qu.:364.69   3rd Qu.:1819143  
##  Max.   :10000   Max.   :668.27   Max.   :524.96   Max.   :6680027  
##    Total.Cost       Total.Profit      
##  Min.   :    125   Min.   :     43.4  
##  1st Qu.: 164786   1st Qu.:  98329.1  
##  Median : 481606   Median : 289099.0  
##  Mean   : 938266   Mean   : 395089.3  
##  3rd Qu.:1183822   3rd Qu.: 566422.7  
##  Max.   :5241726   Max.   :1738178.4
##     Region            Country           Item.Type         Sales.Channel     
##  Length:50000       Length:50000       Length:50000       Length:50000      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Order.Priority   Order.Date           Order.ID           Ship.Date         
##  C:12446        Min.   :2010-01-01   Length:50000       Min.   :2010-01-02  
##  H:12471        1st Qu.:2011-11-15   Class :character   1st Qu.:2011-12-11  
##  L:12588        Median :2013-10-09   Mode  :character   Median :2013-11-02  
##  M:12495        Mean   :2013-10-11                      Mean   :2013-11-05  
##                 3rd Qu.:2015-09-04                      3rd Qu.:2015-09-30  
##                 Max.   :2017-07-28                      Max.   :2017-09-16  
##    Units.Sold      Unit.Price       Unit.Cost      Total.Revenue    
##  Min.   :    1   Min.   :  9.33   Min.   :  6.92   Min.   :     28  
##  1st Qu.: 2498   1st Qu.: 81.73   1st Qu.: 35.84   1st Qu.: 276487  
##  Median : 5018   Median :154.06   Median : 97.44   Median : 781325  
##  Mean   : 5000   Mean   :265.65   Mean   :187.32   Mean   :1323716  
##  3rd Qu.: 7493   3rd Qu.:421.89   3rd Qu.:263.33   3rd Qu.:1808642  
##  Max.   :10000   Max.   :668.27   Max.   :524.96   Max.   :6682032  
##    Total.Cost       Total.Profit      
##  Min.   :     21   Min.   :      7.2  
##  1st Qu.: 160637   1st Qu.:  94150.9  
##  Median : 467104   Median : 279536.4  
##  Mean   : 933157   Mean   : 390558.7  
##  3rd Qu.:1190390   3rd Qu.: 564286.7  
##  Max.   :5249075   Max.   :1738178.4
## Rows: 10,000
## Columns: 14
## $ Region         <chr> "Sub-Saharan Africa", "Europe", "Middle East and North ~
## $ Country        <chr> "Chad", "Latvia", "Pakistan", "Democratic Republic of t~
## $ Item.Type      <chr> "Office Supplies", "Beverages", "Vegetables", "Househol~
## $ Sales.Channel  <chr> "Online", "Online", "Offline", "Online", "Online", "Off~
## $ Order.Priority <fct> L, C, C, C, C, H, L, C, L, C, M, M, C, C, C, L, H, L, H~
## $ Order.Date     <date> 2011-01-27, 2015-12-28, 2011-01-13, 2012-09-11, 2015-1~
## $ Order.ID       <chr> "292494523, 361825549, 141515767, 500364005, 127481591,~
## $ Ship.Date      <date> 2011-02-12, 2016-01-23, 2011-02-01, 2012-10-06, 2015-1~
## $ Units.Sold     <int> 4484, 1075, 6515, 7683, 3491, 9880, 4825, 3330, 2431, 6~
## $ Unit.Price     <dbl> 651.21, 47.45, 154.06, 668.27, 47.45, 47.45, 154.06, 25~
## $ Unit.Cost      <dbl> 524.96, 31.79, 90.93, 502.54, 31.79, 31.79, 90.93, 159.~
## $ Total.Revenue  <dbl> 2920025.64, 51008.75, 1003700.90, 5134318.41, 165647.95~
## $ Total.Cost     <dbl> 2353920.64, 34174.25, 592408.95, 3861014.82, 110978.89,~
## $ Total.Profit   <dbl> 566105.00, 16834.50, 411291.95, 1273303.59, 54669.06, 1~
## Rows: 50,000
## Columns: 14
## $ Region         <chr> "Sub-Saharan Africa", "Europe", "Europe", "Europe", "Eu~
## $ Country        <chr> "Namibia", "Iceland", "Russia", "Moldova ", "Malta", "I~
## $ Item.Type      <chr> "Household", "Baby Food", "Meat", "Meat", "Cereal", "Me~
## $ Sales.Channel  <chr> "Offline", "Online", "Online", "Online", "Online", "Onl~
## $ Order.Priority <fct> M, H, L, L, M, H, M, L, M, C, M, L, C, L, L, M, M, M, H~
## $ Order.Date     <date> 2015-08-31, 2010-11-20, 2017-06-22, 2012-02-28, 2010-0~
## $ Order.ID       <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "",~
## $ Ship.Date      <date> 2015-10-12, 2011-01-09, 2017-06-25, 2012-03-20, 2010-0~
## $ Units.Sold     <int> 3604, 8435, 4848, 7225, 1975, 2542, 4398, 49, 4031, 791~
## $ Unit.Price     <dbl> 668.27, 255.28, 421.89, 421.89, 205.70, 421.89, 668.27,~
## $ Unit.Cost      <dbl> 502.54, 159.42, 364.69, 364.69, 117.11, 364.69, 502.54,~
## $ Total.Revenue  <dbl> 2408445.08, 2153286.80, 2045322.72, 3048155.25, 406257.~
## $ Total.Cost     <dbl> 1811154.16, 1344707.70, 1768017.12, 2634885.25, 231292.~
## $ Total.Profit   <dbl> 597290.92, 808579.10, 277305.60, 413270.00, 174965.25, ~

Machine Learning

The two ML algorithms I will use are:

  • Decision tree
  • Linear regression algorithms

For my case, the purpose of ML is for classification. ML can be used to determine if additional resources should be invested in improving IT infrastructure as well as determining which region and what time of year would be best for storage of perishable goods.

Decision Tree

For the sales transactions, the Order.Priority variable can have only four possible outcomes :C(Critical), H(High), M(Medium) or L(Low)

Building the model

10K Sales Data Set

## parsnip model object
## 
## Fit time:  90ms 
## n= 8002 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 8002 5958 C (0.2554361 0.2503124 0.2494376 0.2448138)  
##   2) Item.Type=Baby Food,Clothes,Cosmetics,Household,Vegetables 3425 2493 C (0.2721168 0.2475912 0.2332847 0.2470073) *
##   3) Item.Type=Beverages,Cereal,Fruits,Meat,Office Supplies,Personal Care,Snacks 4577 3380 L (0.2429539 0.2523487 0.2615250 0.2431724) *
Model visualization

50K Sales Data Set

## parsnip model object
## 
## Fit time:  90ms 
## n= 8002 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 8002 5939 H (0.2564359 0.2578105 0.2484379 0.2373157)  
##   2) Item.Type=Beverages,Cereal,Clothes,Cosmetics,Vegetables 3326 2396 H (0.2507517 0.2796152 0.2444378 0.2251954) *
##   3) Item.Type=Baby Food,Fruits,Household,Meat,Office Supplies,Personal Care,Snacks 4676 3458 C (0.2604790 0.2423011 0.2512831 0.2459367) *

Model Perfomance