As the quiz that was part of the original content was discarded, here’s a new assignment: Visit the following website and explore the range of sizes of this dataset (from 100 to 5 million records). https://eforexcel.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/ Based on your computer’s capabilities (memory, CPU), select 2 files you can handle (recommended one small, one large) Review the structure and content of the tables, and think which two machine learning algorithms presented so far could be used to analyze the data, and how can they be applied in the suggested environment of the datasets. Write a short essay explaining your selection. Then, select one of the 2 algorithms and explore how to analyze and predict an outcome based on the data available. This will be an exploratory exercise, so feel free to show errors and warnings that raise during the analysis. Test the code with both datasets selected and compare the results. Which result will you trust if you need to make a business decision? Do you think an analysis could be prone to errors when using too much data, or when using the least amount possible?
I will analyze the datasets containing 10000 and 50000 Sales Records
## Region Country Item.Type
## 1 Sub-Saharan Africa Chad Office Supplies
## 2 Europe Latvia Beverages
## 3 Middle East and North Africa Pakistan Vegetables
## 4 Sub-Saharan Africa Democratic Republic of the Congo Household
## 5 Europe Czech Republic Beverages
## 6 Sub-Saharan Africa South Africa Beverages
## Sales.Channel Order.Priority Order.Date Order.ID Ship.Date Units.Sold
## 1 Online L 1/27/2011 292494523 2/12/2011 4484
## 2 Online C 12/28/2015 361825549 1/23/2016 1075
## 3 Offline C 1/13/2011 141515767 2/1/2011 6515
## 4 Online C 9/11/2012 500364005 10/6/2012 7683
## 5 Online C 10/27/2015 127481591 12/5/2015 3491
## 6 Offline H 7/10/2012 482292354 8/21/2012 9880
## Unit.Price Unit.Cost Total.Revenue Total.Cost Total.Profit
## 1 651.21 524.96 2920025.64 2353920.64 566105.00
## 2 47.45 31.79 51008.75 34174.25 16834.50
## 3 154.06 90.93 1003700.90 592408.95 411291.95
## 4 668.27 502.54 5134318.41 3861014.82 1273303.59
## 5 47.45 31.79 165647.95 110978.89 54669.06
## 6 47.45 31.79 468806.00 314085.20 154720.80
## Region Country Item.Type Sales.Channel Order.Priority
## 1 Sub-Saharan Africa Namibia Household Offline M
## 2 Europe Iceland Baby Food Online H
## 3 Europe Russia Meat Online L
## 4 Europe Moldova Meat Online L
## 5 Europe Malta Cereal Online M
## 6 Asia Indonesia Meat Online H
## Order.Date Order.ID Ship.Date Units.Sold Unit.Price Unit.Cost Total.Revenue
## 1 8/31/2015 897751939 10/12/2015 3604 668.27 502.54 2408445.1
## 2 11/20/2010 599480426 1/9/2011 8435 255.28 159.42 2153286.8
## 3 6/22/2017 538911855 6/25/2017 4848 421.89 364.69 2045322.7
## 4 2/28/2012 459845054 3/20/2012 7225 421.89 364.69 3048155.2
## 5 8/12/2010 626391351 9/13/2010 1975 205.70 117.11 406257.5
## 6 8/20/2010 472974574 8/27/2010 2542 421.89 364.69 1072444.4
## Total.Cost Total.Profit
## 1 1811154.2 597290.9
## 2 1344707.7 808579.1
## 3 1768017.1 277305.6
## 4 2634885.2 413270.0
## 5 231292.2 174965.2
## 6 927042.0 145402.4
## [1] "Region" "Country" "Item.Type" "Sales.Channel"
## [5] "Order.Priority" "Order.Date" "Order.ID" "Ship.Date"
## [9] "Units.Sold" "Unit.Price" "Unit.Cost" "Total.Revenue"
## [13] "Total.Cost" "Total.Profit"
## [1] "Region" "Country" "Item.Type" "Sales.Channel"
## [5] "Order.Priority" "Order.Date" "Order.ID" "Ship.Date"
## [9] "Units.Sold" "Unit.Price" "Unit.Cost" "Total.Revenue"
## [13] "Total.Cost" "Total.Profit"
## Region Country Item.Type Sales.Channel
## Length:10000 Length:10000 Length:10000 Length:10000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Order.Priority Order.Date Order.ID Ship.Date
## C:2555 Min. :2010-01-01 Length:10000 Min. :2010-01-05
## H:2503 1st Qu.:2011-12-08 Class :character 1st Qu.:2012-01-04
## L:2494 Median :2013-11-02 Mode :character Median :2013-11-26
## M:2448 Mean :2013-10-27 Mean :2013-11-21
## 3rd Qu.:2015-09-11 3rd Qu.:2015-10-08
## Max. :2017-07-28 Max. :2017-09-10
## Units.Sold Unit.Price Unit.Cost Total.Revenue
## Min. : 2 Min. : 9.33 Min. : 6.92 Min. : 168
## 1st Qu.: 2531 1st Qu.:109.28 1st Qu.: 56.67 1st Qu.: 288551
## Median : 4962 Median :205.70 Median :117.11 Median : 800051
## Mean : 5003 Mean :268.14 Mean :188.81 Mean :1333355
## 3rd Qu.: 7472 3rd Qu.:437.20 3rd Qu.:364.69 3rd Qu.:1819143
## Max. :10000 Max. :668.27 Max. :524.96 Max. :6680027
## Total.Cost Total.Profit
## Min. : 125 Min. : 43.4
## 1st Qu.: 164786 1st Qu.: 98329.1
## Median : 481606 Median : 289099.0
## Mean : 938266 Mean : 395089.3
## 3rd Qu.:1183822 3rd Qu.: 566422.7
## Max. :5241726 Max. :1738178.4
## Region Country Item.Type Sales.Channel
## Length:50000 Length:50000 Length:50000 Length:50000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Order.Priority Order.Date Order.ID Ship.Date
## C:12446 Min. :2010-01-01 Length:50000 Min. :2010-01-02
## H:12471 1st Qu.:2011-11-15 Class :character 1st Qu.:2011-12-11
## L:12588 Median :2013-10-09 Mode :character Median :2013-11-02
## M:12495 Mean :2013-10-11 Mean :2013-11-05
## 3rd Qu.:2015-09-04 3rd Qu.:2015-09-30
## Max. :2017-07-28 Max. :2017-09-16
## Units.Sold Unit.Price Unit.Cost Total.Revenue
## Min. : 1 Min. : 9.33 Min. : 6.92 Min. : 28
## 1st Qu.: 2498 1st Qu.: 81.73 1st Qu.: 35.84 1st Qu.: 276487
## Median : 5018 Median :154.06 Median : 97.44 Median : 781325
## Mean : 5000 Mean :265.65 Mean :187.32 Mean :1323716
## 3rd Qu.: 7493 3rd Qu.:421.89 3rd Qu.:263.33 3rd Qu.:1808642
## Max. :10000 Max. :668.27 Max. :524.96 Max. :6682032
## Total.Cost Total.Profit
## Min. : 21 Min. : 7.2
## 1st Qu.: 160637 1st Qu.: 94150.9
## Median : 467104 Median : 279536.4
## Mean : 933157 Mean : 390558.7
## 3rd Qu.:1190390 3rd Qu.: 564286.7
## Max. :5249075 Max. :1738178.4
## Rows: 10,000
## Columns: 14
## $ Region <chr> "Sub-Saharan Africa", "Europe", "Middle East and North ~
## $ Country <chr> "Chad", "Latvia", "Pakistan", "Democratic Republic of t~
## $ Item.Type <chr> "Office Supplies", "Beverages", "Vegetables", "Househol~
## $ Sales.Channel <chr> "Online", "Online", "Offline", "Online", "Online", "Off~
## $ Order.Priority <fct> L, C, C, C, C, H, L, C, L, C, M, M, C, C, C, L, H, L, H~
## $ Order.Date <date> 2011-01-27, 2015-12-28, 2011-01-13, 2012-09-11, 2015-1~
## $ Order.ID <chr> "292494523, 361825549, 141515767, 500364005, 127481591,~
## $ Ship.Date <date> 2011-02-12, 2016-01-23, 2011-02-01, 2012-10-06, 2015-1~
## $ Units.Sold <int> 4484, 1075, 6515, 7683, 3491, 9880, 4825, 3330, 2431, 6~
## $ Unit.Price <dbl> 651.21, 47.45, 154.06, 668.27, 47.45, 47.45, 154.06, 25~
## $ Unit.Cost <dbl> 524.96, 31.79, 90.93, 502.54, 31.79, 31.79, 90.93, 159.~
## $ Total.Revenue <dbl> 2920025.64, 51008.75, 1003700.90, 5134318.41, 165647.95~
## $ Total.Cost <dbl> 2353920.64, 34174.25, 592408.95, 3861014.82, 110978.89,~
## $ Total.Profit <dbl> 566105.00, 16834.50, 411291.95, 1273303.59, 54669.06, 1~
## Rows: 50,000
## Columns: 14
## $ Region <chr> "Sub-Saharan Africa", "Europe", "Europe", "Europe", "Eu~
## $ Country <chr> "Namibia", "Iceland", "Russia", "Moldova ", "Malta", "I~
## $ Item.Type <chr> "Household", "Baby Food", "Meat", "Meat", "Cereal", "Me~
## $ Sales.Channel <chr> "Offline", "Online", "Online", "Online", "Online", "Onl~
## $ Order.Priority <fct> M, H, L, L, M, H, M, L, M, C, M, L, C, L, L, M, M, M, H~
## $ Order.Date <date> 2015-08-31, 2010-11-20, 2017-06-22, 2012-02-28, 2010-0~
## $ Order.ID <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "",~
## $ Ship.Date <date> 2015-10-12, 2011-01-09, 2017-06-25, 2012-03-20, 2010-0~
## $ Units.Sold <int> 3604, 8435, 4848, 7225, 1975, 2542, 4398, 49, 4031, 791~
## $ Unit.Price <dbl> 668.27, 255.28, 421.89, 421.89, 205.70, 421.89, 668.27,~
## $ Unit.Cost <dbl> 502.54, 159.42, 364.69, 364.69, 117.11, 364.69, 502.54,~
## $ Total.Revenue <dbl> 2408445.08, 2153286.80, 2045322.72, 3048155.25, 406257.~
## $ Total.Cost <dbl> 1811154.16, 1344707.70, 1768017.12, 2634885.25, 231292.~
## $ Total.Profit <dbl> 597290.92, 808579.10, 277305.60, 413270.00, 174965.25, ~
The two ML algorithms I will use are:
For my case, the purpose of ML is for classification. ML can be used to determine if additional resources should be invested in improving IT infrastructure as well as determining which region and what time of year would be best for storage of perishable goods.
For the sales transactions, the Order.Priority variable can have only four possible outcomes :C(Critical), H(High), M(Medium) or L(Low)
10K Sales Data Set
## parsnip model object
##
## Fit time: 90ms
## n= 8002
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 8002 5958 C (0.2554361 0.2503124 0.2494376 0.2448138)
## 2) Item.Type=Baby Food,Clothes,Cosmetics,Household,Vegetables 3425 2493 C (0.2721168 0.2475912 0.2332847 0.2470073) *
## 3) Item.Type=Beverages,Cereal,Fruits,Meat,Office Supplies,Personal Care,Snacks 4577 3380 L (0.2429539 0.2523487 0.2615250 0.2431724) *
50K Sales Data Set
## parsnip model object
##
## Fit time: 90ms
## n= 8002
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 8002 5939 H (0.2564359 0.2578105 0.2484379 0.2373157)
## 2) Item.Type=Beverages,Cereal,Clothes,Cosmetics,Vegetables 3326 2396 H (0.2507517 0.2796152 0.2444378 0.2251954) *
## 3) Item.Type=Baby Food,Fruits,Household,Meat,Office Supplies,Personal Care,Snacks 4676 3458 C (0.2604790 0.2423011 0.2512831 0.2459367) *