Visit the following website and explore the range of sizes of this dataset (from 100 to 5 million records).https://eforexcel.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/ Based on your computer’s capabilities (memory, CPU), select 2 files you can handle (recommended one small, one large) Review the structure and content of the tables, and think which two machine learning algorithms presented so far could be used to analyze the data, and how can they be applied in the suggested environment of the datasets. Write a short essay explaining your selection. Then, select one of the 2 algorithms and explore how to analyze and predict an outcome based on the data available. This will be an exploratory exercise, so feel free to show errors and warnings that raise during the analysis. Test the code with both datasets selected and compare the results. Which result will you trust if you need to make a business decision? Do you think an analysis could be prone to errors when using too much data, or when using the least amount possible? Develop your exploratory analysis of the data and the essay in the following 2 weeks.
Among the given datasets, I chose the 10000 sales record and 1000 sales record datasets. The 10000 sales record is named tenksales, and 1000 sales record is named oneksales.
library(readr)
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ stringr 1.4.0
## ✓ tidyr 1.2.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2)
library(corrplot)## corrplot 0.92 loaded
library(randomForest)## randomForest 4.7-1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(caret)## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(party)## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
##
## Attaching package: 'strucchange'
## The following object is masked from 'package:stringr':
##
## boundary
oneksales <- read_csv('https://raw.githubusercontent.com/nancunjie4560/DATA622/main/1000%20Sales%20Records.csv',col_types = 'ffffffffnnnnnn')
tenksales <- read_csv('https://raw.githubusercontent.com/nancunjie4560/DATA622/main/10000%20Sales%20Records.csv',col_types = 'ffffffffnnnnnn')
tenksales<- tenksales %>%
rename(region = Region, country = Country, item = `Item Type`, channel = `Sales Channel`, order = `Order Priority`, date = `Order Date`, ID = `Order ID`, sdate = `Ship Date`, unit = `Units Sold`, price = `Unit Price`, ucost = `Unit Cost`, revenue = `Total Revenue`, tcost = `Total Cost`, profit = `Total Profit`)
oneksales<- oneksales %>%
rename(region = Region, country = Country, item = `Item Type`, channel = `Sales Channel`, order = `Order Priority`, date = `Order Date`, ID = `Order ID`, sdate = `Ship Date`, unit = `Units Sold`, price = `Unit Price`, ucost = `Unit Cost`, revenue = `Total Revenue`, tcost = `Total Cost`, profit = `Total Profit`)glimpse(tenksales)## Rows: 10,000
## Columns: 14
## $ region <fct> Sub-Saharan Africa, Europe, Middle East and North Africa, Sub-…
## $ country <fct> Chad, Latvia, Pakistan, Democratic Republic of the Congo, Czec…
## $ item <fct> Office Supplies, Beverages, Vegetables, Household, Beverages, …
## $ channel <fct> Online, Online, Offline, Online, Online, Offline, Online, Onli…
## $ order <fct> L, C, C, C, C, H, L, C, L, C, M, M, C, C, C, L, H, L, H, H, H,…
## $ date <fct> 1/27/2011, 12/28/2015, 1/13/2011, 9/11/2012, 10/27/2015, 7/10/…
## $ ID <fct> 292494523, 361825549, 141515767, 500364005, 127481591, 4822923…
## $ sdate <fct> 2/12/2011, 1/23/2016, 2/1/2011, 10/6/2012, 12/5/2015, 8/21/201…
## $ unit <dbl> 4484, 1075, 6515, 7683, 3491, 9880, 4825, 3330, 2431, 6197, 72…
## $ price <dbl> 651.21, 47.45, 154.06, 668.27, 47.45, 47.45, 154.06, 255.28, 4…
## $ ucost <dbl> 524.96, 31.79, 90.93, 502.54, 31.79, 31.79, 90.93, 159.42, 364…
## $ revenue <dbl> 2920025.64, 51008.75, 1003700.90, 5134318.41, 165647.95, 46880…
## $ tcost <dbl> 2353920.64, 34174.25, 592408.95, 3861014.82, 110978.89, 314085…
## $ profit <dbl> 566105.00, 16834.50, 411291.95, 1273303.59, 54669.06, 154720.8…
glimpse(oneksales)## Rows: 1,000
## Columns: 14
## $ region <fct> Middle East and North Africa, North America, Middle East and N…
## $ country <fct> Libya, Canada, Libya, Japan, Chad, Armenia, Eritrea, Montenegr…
## $ item <fct> Cosmetics, Vegetables, Baby Food, Cereal, Fruits, Cereal, Cere…
## $ channel <fct> Offline, Online, Offline, Offline, Offline, Online, Online, Of…
## $ order <fct> M, M, C, C, H, H, H, M, H, H, M, M, C, C, L, H, H, M, C, L, C,…
## $ date <fct> 10/18/2014, 11/7/2011, 10/31/2016, 4/10/2010, 8/16/2011, 11/24…
## $ ID <fct> 686800706, 185941302, 246222341, 161442649, 645713555, 6834588…
## $ sdate <fct> 10/31/2014, 12/8/2011, 12/9/2016, 5/12/2010, 8/31/2011, 12/28/…
## $ unit <dbl> 8446, 3018, 1517, 3322, 9845, 9528, 2844, 7299, 2428, 4800, 30…
## $ price <dbl> 437.20, 154.06, 255.28, 205.70, 9.33, 205.70, 205.70, 109.28, …
## $ ucost <dbl> 263.33, 90.93, 159.42, 117.11, 6.92, 117.11, 117.11, 35.84, 90…
## $ revenue <dbl> 3692591.20, 464953.08, 387259.76, 683335.40, 91853.85, 1959909…
## $ tcost <dbl> 2224085.18, 274426.74, 241840.14, 389039.42, 68127.40, 1115824…
## $ profit <dbl> 1468506.02, 190526.34, 145419.62, 294295.98, 23726.45, 844085.…
head(tenksales)## # A tibble: 6 × 14
## region country item channel order date ID sdate unit price ucost revenue
## <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Sub-S… Chad Offi… Online L 1/27… 2924… 2/12… 4484 651. 525. 2.92e6
## 2 Europe Latvia Beve… Online C 12/2… 3618… 1/23… 1075 47.4 31.8 5.10e4
## 3 Middl… Pakist… Vege… Offline C 1/13… 1415… 2/1/… 6515 154. 90.9 1.00e6
## 4 Sub-S… Democr… Hous… Online C 9/11… 5003… 10/6… 7683 668. 503. 5.13e6
## 5 Europe Czech … Beve… Online C 10/2… 1274… 12/5… 3491 47.4 31.8 1.66e5
## 6 Sub-S… South … Beve… Offline H 7/10… 4822… 8/21… 9880 47.4 31.8 4.69e5
## # … with 2 more variables: tcost <dbl>, profit <dbl>
head(oneksales)## # A tibble: 6 × 14
## region country item channel order date ID sdate unit price ucost
## <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <dbl> <dbl> <dbl>
## 1 Middle East… Libya Cosm… Offline M 10/1… 6868… 10/3… 8446 437. 263.
## 2 North Ameri… Canada Vege… Online M 11/7… 1859… 12/8… 3018 154. 90.9
## 3 Middle East… Libya Baby… Offline C 10/3… 2462… 12/9… 1517 255. 159.
## 4 Asia Japan Cere… Offline C 4/10… 1614… 5/12… 3322 206. 117.
## 5 Sub-Saharan… Chad Frui… Offline H 8/16… 6457… 8/31… 9845 9.33 6.92
## 6 Europe Armenia Cere… Online H 11/2… 6834… 12/2… 9528 206. 117.
## # … with 3 more variables: revenue <dbl>, tcost <dbl>, profit <dbl>
summary(tenksales)## region country
## Sub-Saharan Africa :2603 Lithuania : 72
## Europe :2633 United Kingdom: 72
## Middle East and North Africa :1264 Moldova : 71
## Asia :1469 Seychelles : 70
## Central America and the Caribbean:1019 Croatia : 70
## Australia and Oceania : 797 Montenegro : 69
## North America : 215 (Other) :9576
## item channel order date
## Personal Care : 888 Online :5061 L:2494 1/28/2012 : 13
## Household : 875 Offline:4939 C:2555 3/3/2012 : 12
## Clothes : 872 H:2503 8/16/2014 : 11
## Baby Food : 842 M:2448 7/15/2012 : 11
## Office Supplies: 837 10/28/2016: 11
## Vegetables : 836 7/28/2017 : 10
## (Other) :4850 (Other) :9932
## ID sdate unit price
## 292494523: 1 9/30/2014 : 12 Min. : 2 Min. : 9.33
## 361825549: 1 7/23/2015 : 11 1st Qu.: 2531 1st Qu.:109.28
## 141515767: 1 2/21/2010 : 11 Median : 4962 Median :205.70
## 500364005: 1 3/24/2016 : 11 Mean : 5003 Mean :268.14
## 127481591: 1 10/28/2012: 11 3rd Qu.: 7472 3rd Qu.:437.20
## 482292354: 1 7/24/2011 : 10 Max. :10000 Max. :668.27
## (Other) :9994 (Other) :9934
## ucost revenue tcost profit
## Min. : 6.92 Min. : 168 Min. : 125 Min. : 43.4
## 1st Qu.: 56.67 1st Qu.: 288551 1st Qu.: 164786 1st Qu.: 98329.1
## Median :117.11 Median : 800051 Median : 481606 Median : 289099.0
## Mean :188.81 Mean :1333355 Mean : 938266 Mean : 395089.3
## 3rd Qu.:364.69 3rd Qu.:1819143 3rd Qu.:1183822 3rd Qu.: 566422.7
## Max. :524.96 Max. :6680027 Max. :5241726 Max. :1738178.4
##
summary(oneksales)## region country
## Middle East and North Africa :138 Cuba : 11
## North America : 19 Malaysia : 10
## Asia :136 Czech Republic: 10
## Sub-Saharan Africa :262 Zimbabwe : 10
## Europe :267 Bahrain : 10
## Central America and the Caribbean: 99 Fiji : 9
## Australia and Oceania : 79 (Other) :940
## item channel order date ID
## Beverages :101 Offline:520 M:242 5/17/2012: 3 686800706: 1
## Vegetables : 97 Online :480 C:262 8/3/2013 : 3 185941302: 1
## Office Supplies: 89 H:228 3/17/2012: 3 246222341: 1
## Baby Food : 87 L:268 6/9/2017 : 3 161442649: 1
## Personal Care : 87 1/14/2013: 3 645713555: 1
## Snacks : 82 3/20/2011: 3 683458888: 1
## (Other) :457 (Other) :982 (Other) :994
## sdate unit price ucost
## 4/17/2015: 3 Min. : 13 Min. : 9.33 Min. : 6.92
## 6/28/2012: 3 1st Qu.:2420 1st Qu.: 81.73 1st Qu.: 56.67
## 2/15/2012: 3 Median :5184 Median :154.06 Median : 97.44
## 6/8/2011 : 3 Mean :5054 Mean :262.11 Mean :184.97
## 8/19/2013: 3 3rd Qu.:7537 3rd Qu.:421.89 3rd Qu.:263.33
## 11/4/2011: 3 Max. :9998 Max. :668.27 Max. :524.96
## (Other) :982
## revenue tcost profit
## Min. : 2043 Min. : 1417 Min. : 532.6
## 1st Qu.: 281192 1st Qu.: 164932 1st Qu.: 98376.1
## Median : 754939 Median : 464726 Median : 277226.0
## Mean :1327322 Mean : 936119 Mean : 391202.6
## 3rd Qu.:1733503 3rd Qu.:1141750 3rd Qu.: 548456.8
## Max. :6617210 Max. :5204978 Max. :1726181.4
##
The 10000 Sales record has 14 columns and 10000 observations and the 1000 Sales record also has 14 columns, but 1000 observations. The variables in the datasets are Region, Country, Item Type, Sales Channel, Order Priority, Order date, Order ID, Ship Data, Units Sold, Units Price, Unit Cost, Total Revenue, Total Cost, Total Profit. However, The date is not in MDY format, and the Order ID shows as numeric, need to fix them first.
Date to MDY format
oneksales <- oneksales %>%
mutate(order_date = as.Date(date, '%m/%d/%Y'))%>%
mutate(ship_date = as.Date(sdate, '%m/%d/%Y'))
tenksales <- tenksales %>%
mutate(order_date = as.Date(date, '%m/%d/%Y'))%>%
mutate(ship_date = as.Date(sdate, '%m/%d/%Y'))
summary(oneksales)## region country
## Middle East and North Africa :138 Cuba : 11
## North America : 19 Malaysia : 10
## Asia :136 Czech Republic: 10
## Sub-Saharan Africa :262 Zimbabwe : 10
## Europe :267 Bahrain : 10
## Central America and the Caribbean: 99 Fiji : 9
## Australia and Oceania : 79 (Other) :940
## item channel order date ID
## Beverages :101 Offline:520 M:242 5/17/2012: 3 686800706: 1
## Vegetables : 97 Online :480 C:262 8/3/2013 : 3 185941302: 1
## Office Supplies: 89 H:228 3/17/2012: 3 246222341: 1
## Baby Food : 87 L:268 6/9/2017 : 3 161442649: 1
## Personal Care : 87 1/14/2013: 3 645713555: 1
## Snacks : 82 3/20/2011: 3 683458888: 1
## (Other) :457 (Other) :982 (Other) :994
## sdate unit price ucost
## 4/17/2015: 3 Min. : 13 Min. : 9.33 Min. : 6.92
## 6/28/2012: 3 1st Qu.:2420 1st Qu.: 81.73 1st Qu.: 56.67
## 2/15/2012: 3 Median :5184 Median :154.06 Median : 97.44
## 6/8/2011 : 3 Mean :5054 Mean :262.11 Mean :184.97
## 8/19/2013: 3 3rd Qu.:7537 3rd Qu.:421.89 3rd Qu.:263.33
## 11/4/2011: 3 Max. :9998 Max. :668.27 Max. :524.96
## (Other) :982
## revenue tcost profit order_date
## Min. : 2043 Min. : 1417 Min. : 532.6 Min. :2010-01-01
## 1st Qu.: 281192 1st Qu.: 164932 1st Qu.: 98376.1 1st Qu.:2011-11-14
## Median : 754939 Median : 464726 Median : 277226.0 Median :2013-09-24
## Mean :1327322 Mean : 936119 Mean : 391202.6 Mean :2013-09-19
## 3rd Qu.:1733503 3rd Qu.:1141750 3rd Qu.: 548456.8 3rd Qu.:2015-07-03
## Max. :6617210 Max. :5204978 Max. :1726181.4 Max. :2017-07-26
##
## ship_date
## Min. :2010-01-15
## 1st Qu.:2011-12-11
## Median :2013-10-12
## Mean :2013-10-14
## 3rd Qu.:2015-07-28
## Max. :2017-09-12
##
summary(tenksales)## region country
## Sub-Saharan Africa :2603 Lithuania : 72
## Europe :2633 United Kingdom: 72
## Middle East and North Africa :1264 Moldova : 71
## Asia :1469 Seychelles : 70
## Central America and the Caribbean:1019 Croatia : 70
## Australia and Oceania : 797 Montenegro : 69
## North America : 215 (Other) :9576
## item channel order date
## Personal Care : 888 Online :5061 L:2494 1/28/2012 : 13
## Household : 875 Offline:4939 C:2555 3/3/2012 : 12
## Clothes : 872 H:2503 8/16/2014 : 11
## Baby Food : 842 M:2448 7/15/2012 : 11
## Office Supplies: 837 10/28/2016: 11
## Vegetables : 836 7/28/2017 : 10
## (Other) :4850 (Other) :9932
## ID sdate unit price
## 292494523: 1 9/30/2014 : 12 Min. : 2 Min. : 9.33
## 361825549: 1 7/23/2015 : 11 1st Qu.: 2531 1st Qu.:109.28
## 141515767: 1 2/21/2010 : 11 Median : 4962 Median :205.70
## 500364005: 1 3/24/2016 : 11 Mean : 5003 Mean :268.14
## 127481591: 1 10/28/2012: 11 3rd Qu.: 7472 3rd Qu.:437.20
## 482292354: 1 7/24/2011 : 10 Max. :10000 Max. :668.27
## (Other) :9994 (Other) :9934
## ucost revenue tcost profit
## Min. : 6.92 Min. : 168 Min. : 125 Min. : 43.4
## 1st Qu.: 56.67 1st Qu.: 288551 1st Qu.: 164786 1st Qu.: 98329.1
## Median :117.11 Median : 800051 Median : 481606 Median : 289099.0
## Mean :188.81 Mean :1333355 Mean : 938266 Mean : 395089.3
## 3rd Qu.:364.69 3rd Qu.:1819143 3rd Qu.:1183822 3rd Qu.: 566422.7
## Max. :524.96 Max. :6680027 Max. :5241726 Max. :1738178.4
##
## order_date ship_date
## Min. :2010-01-01 Min. :2010-01-05
## 1st Qu.:2011-12-08 1st Qu.:2012-01-04
## Median :2013-11-02 Median :2013-11-26
## Mean :2013-10-27 Mean :2013-11-21
## 3rd Qu.:2015-09-11 3rd Qu.:2015-10-08
## Max. :2017-07-28 Max. :2017-09-10
##
hist(oneksales$profit, col = 'blue')
hist(oneksales$profit , col = 'blue')Most of the sales are below $600,000
ggplot(oneksales, aes(x = region, y = profit, color = channel)) +
geom_boxplot()+coord_flip()ggplot(tenksales, aes(x = region, y = profit, color = channel)) +
geom_boxplot()+coord_flip()cor<-oneksales%>%
select(unit, price, ucost, revenue, tcost, profit)
corrplot(cor(cor), type = 'upper')As I guessed, the numeric variables are correlated.
The box plot shows there are some outliers within the data, and the histogram shows they are right skewed. According the boxplot, The 10000 Sale records show differntly from 1000 Sales records in terms of preference of offline or online purchase. A good sales strategy should be built according to the preferance of shopping platform and good target of region. View the data, the sales datasets contain records of shopping categories from various countries and regions. The key numeric information provided for the detail of order ID, unit, price, ucost, revenue, tcost, profit without NA.
Assume The target of the business is to increase the profit, the qualitative analysis would help the business to make a decision to focus on such as, at what region/country with what kind of item , online or offline platform may increase the profit. The two algorithm can be Naive bays and random forest model to solve the classification problem.According to https://discuss.analyticsvidhya.com/t/how-to-decide-when-to-use-naive-bayes-for-classification/5720 Naive Bayes performs well when we have multiple classes and working with text classification. Advantage of Naive Bayes algorithms is simple and if the conditional independence assumption actually holds, a Naive Bayes classifier will converge quicker than discriminative models like logistic regression, so naive bays model requires less training dataset.Even if the assumption for naive bays doesn’t hold, requires less model training time. The main difference between Naive Bayes and Random Forest are their model size. Naive Bayes model size is low and quite constant with respect to the data. The Naive Bays has difficulty to present the complex behavior, and it won’t get into over fitting. On the other hand, Random Forest model size is very large, it easily results to over fitting. When the data is dynamic and keeps changing. Naive Bays can adapt quickly to the changes and new data while using a Random Forest requires to rebuild the forest every time something changes. also, according to https://cloudvane.net/big-data-2/machine-learning-101-classification-algorithms-random-forest-and-naive-bayes/ Random forest models run efficient on large datasets, since all compute can be split and thus it is easier to run the model in parallel. It can handle thousands of input variables without variable deletion. It computes proximities between pairs of cases that can be used in clustering, locating outliers or (by scaling) give interesting views of the data.
For this assignment, due to my object - region classification have many method to work with, I’d like to build the random forest regression model to make classification for the region.It is important for a business to find their focus market, and target the market sales in a correctly point region to maximize the total profit of the business. The training data and testing data are spited in 75% and 25% of the observations, and this rule is applied to both 10000 Sales Records dataset and 1000 Sales Records dataset.
Split for 1000 Sales records
tenksales<-tenksales[,-2]
oneksales<-oneksales[,-2]
set.seed(1234)
onekdata = sort(sample(nrow(oneksales), nrow(oneksales)*.75))
onektrain<-oneksales[onekdata,]
onektest<-oneksales[-onekdata,]Random forest model for oneksales.
onekrd <- randomForest(channel ~ region+item+order+unit+price+ucost+revenue+tcost+profit, data = onektrain,importance = TRUE, na.omit=T)
onekrd##
## Call:
## randomForest(formula = channel ~ region + item + order + unit + price + ucost + revenue + tcost + profit, data = onektrain, importance = TRUE, na.omit = T)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 46.27%
## Confusion matrix:
## Offline Online class.error
## Offline 244 156 0.3900000
## Online 191 159 0.5457143
varImp(onekrd)## Offline Online
## region -1.2343119 -1.2343119
## item -0.5570033 -0.5570033
## order -0.2296124 -0.2296124
## unit 0.2640131 0.2640131
## price 0.6657127 0.6657127
## ucost -0.2670300 -0.2670300
## revenue 1.6884005 1.6884005
## tcost 1.3050953 1.3050953
## profit 1.0451576 1.0451576
varImpPlot(onekrd)Accuracy of oneksales random forest model
onek_pred<- predict(onekrd, newdata = onektest)
matrix<-table(onek_pred, onektest$channel)
matrix##
## onek_pred Offline Online
## Offline 81 74
## Online 39 56
sum(diag(matrix))/nrow(onektest)## [1] 0.548
Split for 10000 Sales records
set.seed(1234)
tenkdata = sort(sample(nrow(tenksales), nrow(tenksales)*.75))
tenktrain<-tenksales[tenkdata,]
tenktest<-tenksales[-tenkdata,]Random forest model for tenksales.
tenkrd <- randomForest(channel ~ region+item+order+unit+price+ucost+revenue+tcost+profit, data = tenktrain,importance = TRUE, na.omit=T)
tenkrd##
## Call:
## randomForest(formula = channel ~ region + item + order + unit + price + ucost + revenue + tcost + profit, data = tenktrain, importance = TRUE, na.omit = T)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 49.68%
## Confusion matrix:
## Online Offline class.error
## Online 2000 1815 0.4757536
## Offline 1911 1774 0.5185889
varImp(tenkrd)## Online Offline
## region -1.8774071 -1.8774071
## item 1.3431021 1.3431021
## order 1.6212722 1.6212722
## unit 0.1691929 0.1691929
## price 0.6742841 0.6742841
## ucost 0.9066252 0.9066252
## revenue 0.9932861 0.9932861
## tcost 0.7416452 0.7416452
## profit 0.1425480 0.1425480
varImpPlot(tenkrd)Accuracy of tenksales random forest model
tenk_pred<- predict(tenkrd, newdata = tenktest)
matrix<-table(tenk_pred, tenktest$channel)
matrix##
## tenk_pred Online Offline
## Online 616 644
## Offline 630 610
accuracy<-sum(diag(matrix))/nrow(tenktest)
accuracy## [1] 0.4904
In summary, the performance of the random forest model is not satisfied. The accuracy of the model is around 50%. I believe there are possibilities to improve the accuracy of the model with tuning. The model is not allowed to include the Country variable due to too many categories under the Country variable. If the data keeps accumulates, then it is better to use Naive bays model rather than the random forest model, due to random forest easily overfit the data, and need to rebuild every time the data changes. I don’t recommend the current model to the business, and it seems there are no different result from two datasets of using the same model.