Visit the following website and explore the range of sizes of this dataset (from 100 to 5 million records).https://eforexcel.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/ Based on your computer’s capabilities (memory, CPU), select 2 files you can handle (recommended one small, one large) Review the structure and content of the tables, and think which two machine learning algorithms presented so far could be used to analyze the data, and how can they be applied in the suggested environment of the datasets. Write a short essay explaining your selection. Then, select one of the 2 algorithms and explore how to analyze and predict an outcome based on the data available. This will be an exploratory exercise, so feel free to show errors and warnings that raise during the analysis. Test the code with both datasets selected and compare the results. Which result will you trust if you need to make a business decision? Do you think an analysis could be prone to errors when using too much data, or when using the least amount possible? Develop your exploratory analysis of the data and the essay in the following 2 weeks.

Data Exploration

Among the given datasets, I chose the 10000 sales record and 1000 sales record datasets. The 10000 sales record is named tenksales, and 1000 sales record is named oneksales.

Libraries

library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ stringr 1.4.0
## ✓ tidyr   1.2.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggplot2)
library(corrplot)
## corrplot 0.92 loaded
library(randomForest)
## randomForest 4.7-1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
## 
## Attaching package: 'strucchange'
## The following object is masked from 'package:stringr':
## 
##     boundary

Import Data

oneksales <- read_csv('https://raw.githubusercontent.com/nancunjie4560/DATA622/main/1000%20Sales%20Records.csv',col_types = 'ffffffffnnnnnn')
tenksales <- read_csv('https://raw.githubusercontent.com/nancunjie4560/DATA622/main/10000%20Sales%20Records.csv',col_types = 'ffffffffnnnnnn')

tenksales<- tenksales %>%
  rename(region = Region, country = Country, item = `Item Type`, channel = `Sales Channel`, order = `Order Priority`, date = `Order Date`, ID = `Order ID`, sdate = `Ship Date`, unit = `Units Sold`, price = `Unit Price`, ucost = `Unit Cost`, revenue = `Total Revenue`, tcost = `Total Cost`, profit = `Total Profit`)

oneksales<- oneksales %>%
  rename(region = Region, country = Country, item = `Item Type`, channel = `Sales Channel`, order = `Order Priority`, date = `Order Date`, ID = `Order ID`, sdate = `Ship Date`, unit = `Units Sold`, price = `Unit Price`, ucost = `Unit Cost`, revenue = `Total Revenue`, tcost = `Total Cost`, profit = `Total Profit`)

Statistic Summary

glimpse(tenksales)
## Rows: 10,000
## Columns: 14
## $ region  <fct> Sub-Saharan Africa, Europe, Middle East and North Africa, Sub-…
## $ country <fct> Chad, Latvia, Pakistan, Democratic Republic of the Congo, Czec…
## $ item    <fct> Office Supplies, Beverages, Vegetables, Household, Beverages, …
## $ channel <fct> Online, Online, Offline, Online, Online, Offline, Online, Onli…
## $ order   <fct> L, C, C, C, C, H, L, C, L, C, M, M, C, C, C, L, H, L, H, H, H,…
## $ date    <fct> 1/27/2011, 12/28/2015, 1/13/2011, 9/11/2012, 10/27/2015, 7/10/…
## $ ID      <fct> 292494523, 361825549, 141515767, 500364005, 127481591, 4822923…
## $ sdate   <fct> 2/12/2011, 1/23/2016, 2/1/2011, 10/6/2012, 12/5/2015, 8/21/201…
## $ unit    <dbl> 4484, 1075, 6515, 7683, 3491, 9880, 4825, 3330, 2431, 6197, 72…
## $ price   <dbl> 651.21, 47.45, 154.06, 668.27, 47.45, 47.45, 154.06, 255.28, 4…
## $ ucost   <dbl> 524.96, 31.79, 90.93, 502.54, 31.79, 31.79, 90.93, 159.42, 364…
## $ revenue <dbl> 2920025.64, 51008.75, 1003700.90, 5134318.41, 165647.95, 46880…
## $ tcost   <dbl> 2353920.64, 34174.25, 592408.95, 3861014.82, 110978.89, 314085…
## $ profit  <dbl> 566105.00, 16834.50, 411291.95, 1273303.59, 54669.06, 154720.8…
glimpse(oneksales)
## Rows: 1,000
## Columns: 14
## $ region  <fct> Middle East and North Africa, North America, Middle East and N…
## $ country <fct> Libya, Canada, Libya, Japan, Chad, Armenia, Eritrea, Montenegr…
## $ item    <fct> Cosmetics, Vegetables, Baby Food, Cereal, Fruits, Cereal, Cere…
## $ channel <fct> Offline, Online, Offline, Offline, Offline, Online, Online, Of…
## $ order   <fct> M, M, C, C, H, H, H, M, H, H, M, M, C, C, L, H, H, M, C, L, C,…
## $ date    <fct> 10/18/2014, 11/7/2011, 10/31/2016, 4/10/2010, 8/16/2011, 11/24…
## $ ID      <fct> 686800706, 185941302, 246222341, 161442649, 645713555, 6834588…
## $ sdate   <fct> 10/31/2014, 12/8/2011, 12/9/2016, 5/12/2010, 8/31/2011, 12/28/…
## $ unit    <dbl> 8446, 3018, 1517, 3322, 9845, 9528, 2844, 7299, 2428, 4800, 30…
## $ price   <dbl> 437.20, 154.06, 255.28, 205.70, 9.33, 205.70, 205.70, 109.28, …
## $ ucost   <dbl> 263.33, 90.93, 159.42, 117.11, 6.92, 117.11, 117.11, 35.84, 90…
## $ revenue <dbl> 3692591.20, 464953.08, 387259.76, 683335.40, 91853.85, 1959909…
## $ tcost   <dbl> 2224085.18, 274426.74, 241840.14, 389039.42, 68127.40, 1115824…
## $ profit  <dbl> 1468506.02, 190526.34, 145419.62, 294295.98, 23726.45, 844085.…
head(tenksales)
## # A tibble: 6 × 14
##   region country item  channel order date  ID    sdate  unit price ucost revenue
##   <fct>  <fct>   <fct> <fct>   <fct> <fct> <fct> <fct> <dbl> <dbl> <dbl>   <dbl>
## 1 Sub-S… Chad    Offi… Online  L     1/27… 2924… 2/12…  4484 651.  525.   2.92e6
## 2 Europe Latvia  Beve… Online  C     12/2… 3618… 1/23…  1075  47.4  31.8  5.10e4
## 3 Middl… Pakist… Vege… Offline C     1/13… 1415… 2/1/…  6515 154.   90.9  1.00e6
## 4 Sub-S… Democr… Hous… Online  C     9/11… 5003… 10/6…  7683 668.  503.   5.13e6
## 5 Europe Czech … Beve… Online  C     10/2… 1274… 12/5…  3491  47.4  31.8  1.66e5
## 6 Sub-S… South … Beve… Offline H     7/10… 4822… 8/21…  9880  47.4  31.8  4.69e5
## # … with 2 more variables: tcost <dbl>, profit <dbl>
head(oneksales)
## # A tibble: 6 × 14
##   region       country item  channel order date  ID    sdate  unit  price  ucost
##   <fct>        <fct>   <fct> <fct>   <fct> <fct> <fct> <fct> <dbl>  <dbl>  <dbl>
## 1 Middle East… Libya   Cosm… Offline M     10/1… 6868… 10/3…  8446 437.   263.  
## 2 North Ameri… Canada  Vege… Online  M     11/7… 1859… 12/8…  3018 154.    90.9 
## 3 Middle East… Libya   Baby… Offline C     10/3… 2462… 12/9…  1517 255.   159.  
## 4 Asia         Japan   Cere… Offline C     4/10… 1614… 5/12…  3322 206.   117.  
## 5 Sub-Saharan… Chad    Frui… Offline H     8/16… 6457… 8/31…  9845   9.33   6.92
## 6 Europe       Armenia Cere… Online  H     11/2… 6834… 12/2…  9528 206.   117.  
## # … with 3 more variables: revenue <dbl>, tcost <dbl>, profit <dbl>
summary(tenksales)
##                                region               country    
##  Sub-Saharan Africa               :2603   Lithuania     :  72  
##  Europe                           :2633   United Kingdom:  72  
##  Middle East and North Africa     :1264   Moldova       :  71  
##  Asia                             :1469   Seychelles    :  70  
##  Central America and the Caribbean:1019   Croatia       :  70  
##  Australia and Oceania            : 797   Montenegro    :  69  
##  North America                    : 215   (Other)       :9576  
##               item         channel     order            date     
##  Personal Care  : 888   Online :5061   L:2494   1/28/2012 :  13  
##  Household      : 875   Offline:4939   C:2555   3/3/2012  :  12  
##  Clothes        : 872                  H:2503   8/16/2014 :  11  
##  Baby Food      : 842                  M:2448   7/15/2012 :  11  
##  Office Supplies: 837                           10/28/2016:  11  
##  Vegetables     : 836                           7/28/2017 :  10  
##  (Other)        :4850                           (Other)   :9932  
##          ID              sdate           unit           price       
##  292494523:   1   9/30/2014 :  12   Min.   :    2   Min.   :  9.33  
##  361825549:   1   7/23/2015 :  11   1st Qu.: 2531   1st Qu.:109.28  
##  141515767:   1   2/21/2010 :  11   Median : 4962   Median :205.70  
##  500364005:   1   3/24/2016 :  11   Mean   : 5003   Mean   :268.14  
##  127481591:   1   10/28/2012:  11   3rd Qu.: 7472   3rd Qu.:437.20  
##  482292354:   1   7/24/2011 :  10   Max.   :10000   Max.   :668.27  
##  (Other)  :9994   (Other)   :9934                                   
##      ucost           revenue            tcost             profit         
##  Min.   :  6.92   Min.   :    168   Min.   :    125   Min.   :     43.4  
##  1st Qu.: 56.67   1st Qu.: 288551   1st Qu.: 164786   1st Qu.:  98329.1  
##  Median :117.11   Median : 800051   Median : 481606   Median : 289099.0  
##  Mean   :188.81   Mean   :1333355   Mean   : 938266   Mean   : 395089.3  
##  3rd Qu.:364.69   3rd Qu.:1819143   3rd Qu.:1183822   3rd Qu.: 566422.7  
##  Max.   :524.96   Max.   :6680027   Max.   :5241726   Max.   :1738178.4  
## 
summary(oneksales)
##                                region              country   
##  Middle East and North Africa     :138   Cuba          : 11  
##  North America                    : 19   Malaysia      : 10  
##  Asia                             :136   Czech Republic: 10  
##  Sub-Saharan Africa               :262   Zimbabwe      : 10  
##  Europe                           :267   Bahrain       : 10  
##  Central America and the Caribbean: 99   Fiji          :  9  
##  Australia and Oceania            : 79   (Other)       :940  
##               item        channel    order          date             ID     
##  Beverages      :101   Offline:520   M:242   5/17/2012:  3   686800706:  1  
##  Vegetables     : 97   Online :480   C:262   8/3/2013 :  3   185941302:  1  
##  Office Supplies: 89                 H:228   3/17/2012:  3   246222341:  1  
##  Baby Food      : 87                 L:268   6/9/2017 :  3   161442649:  1  
##  Personal Care  : 87                         1/14/2013:  3   645713555:  1  
##  Snacks         : 82                         3/20/2011:  3   683458888:  1  
##  (Other)        :457                         (Other)  :982   (Other)  :994  
##        sdate          unit          price            ucost       
##  4/17/2015:  3   Min.   :  13   Min.   :  9.33   Min.   :  6.92  
##  6/28/2012:  3   1st Qu.:2420   1st Qu.: 81.73   1st Qu.: 56.67  
##  2/15/2012:  3   Median :5184   Median :154.06   Median : 97.44  
##  6/8/2011 :  3   Mean   :5054   Mean   :262.11   Mean   :184.97  
##  8/19/2013:  3   3rd Qu.:7537   3rd Qu.:421.89   3rd Qu.:263.33  
##  11/4/2011:  3   Max.   :9998   Max.   :668.27   Max.   :524.96  
##  (Other)  :982                                                   
##     revenue            tcost             profit         
##  Min.   :   2043   Min.   :   1417   Min.   :    532.6  
##  1st Qu.: 281192   1st Qu.: 164932   1st Qu.:  98376.1  
##  Median : 754939   Median : 464726   Median : 277226.0  
##  Mean   :1327322   Mean   : 936119   Mean   : 391202.6  
##  3rd Qu.:1733503   3rd Qu.:1141750   3rd Qu.: 548456.8  
##  Max.   :6617210   Max.   :5204978   Max.   :1726181.4  
## 

The 10000 Sales record has 14 columns and 10000 observations and the 1000 Sales record also has 14 columns, but 1000 observations. The variables in the datasets are Region, Country, Item Type, Sales Channel, Order Priority, Order date, Order ID, Ship Data, Units Sold, Units Price, Unit Cost, Total Revenue, Total Cost, Total Profit. However, The date is not in MDY format, and the Order ID shows as numeric, need to fix them first.

Date to MDY format

oneksales <- oneksales %>%
    mutate(order_date = as.Date(date, '%m/%d/%Y'))%>%
    mutate(ship_date = as.Date(sdate, '%m/%d/%Y'))


tenksales <- tenksales %>%
    mutate(order_date = as.Date(date, '%m/%d/%Y'))%>%
    mutate(ship_date = as.Date(sdate, '%m/%d/%Y'))

summary(oneksales)
##                                region              country   
##  Middle East and North Africa     :138   Cuba          : 11  
##  North America                    : 19   Malaysia      : 10  
##  Asia                             :136   Czech Republic: 10  
##  Sub-Saharan Africa               :262   Zimbabwe      : 10  
##  Europe                           :267   Bahrain       : 10  
##  Central America and the Caribbean: 99   Fiji          :  9  
##  Australia and Oceania            : 79   (Other)       :940  
##               item        channel    order          date             ID     
##  Beverages      :101   Offline:520   M:242   5/17/2012:  3   686800706:  1  
##  Vegetables     : 97   Online :480   C:262   8/3/2013 :  3   185941302:  1  
##  Office Supplies: 89                 H:228   3/17/2012:  3   246222341:  1  
##  Baby Food      : 87                 L:268   6/9/2017 :  3   161442649:  1  
##  Personal Care  : 87                         1/14/2013:  3   645713555:  1  
##  Snacks         : 82                         3/20/2011:  3   683458888:  1  
##  (Other)        :457                         (Other)  :982   (Other)  :994  
##        sdate          unit          price            ucost       
##  4/17/2015:  3   Min.   :  13   Min.   :  9.33   Min.   :  6.92  
##  6/28/2012:  3   1st Qu.:2420   1st Qu.: 81.73   1st Qu.: 56.67  
##  2/15/2012:  3   Median :5184   Median :154.06   Median : 97.44  
##  6/8/2011 :  3   Mean   :5054   Mean   :262.11   Mean   :184.97  
##  8/19/2013:  3   3rd Qu.:7537   3rd Qu.:421.89   3rd Qu.:263.33  
##  11/4/2011:  3   Max.   :9998   Max.   :668.27   Max.   :524.96  
##  (Other)  :982                                                   
##     revenue            tcost             profit            order_date        
##  Min.   :   2043   Min.   :   1417   Min.   :    532.6   Min.   :2010-01-01  
##  1st Qu.: 281192   1st Qu.: 164932   1st Qu.:  98376.1   1st Qu.:2011-11-14  
##  Median : 754939   Median : 464726   Median : 277226.0   Median :2013-09-24  
##  Mean   :1327322   Mean   : 936119   Mean   : 391202.6   Mean   :2013-09-19  
##  3rd Qu.:1733503   3rd Qu.:1141750   3rd Qu.: 548456.8   3rd Qu.:2015-07-03  
##  Max.   :6617210   Max.   :5204978   Max.   :1726181.4   Max.   :2017-07-26  
##                                                                              
##    ship_date         
##  Min.   :2010-01-15  
##  1st Qu.:2011-12-11  
##  Median :2013-10-12  
##  Mean   :2013-10-14  
##  3rd Qu.:2015-07-28  
##  Max.   :2017-09-12  
## 
summary(tenksales)
##                                region               country    
##  Sub-Saharan Africa               :2603   Lithuania     :  72  
##  Europe                           :2633   United Kingdom:  72  
##  Middle East and North Africa     :1264   Moldova       :  71  
##  Asia                             :1469   Seychelles    :  70  
##  Central America and the Caribbean:1019   Croatia       :  70  
##  Australia and Oceania            : 797   Montenegro    :  69  
##  North America                    : 215   (Other)       :9576  
##               item         channel     order            date     
##  Personal Care  : 888   Online :5061   L:2494   1/28/2012 :  13  
##  Household      : 875   Offline:4939   C:2555   3/3/2012  :  12  
##  Clothes        : 872                  H:2503   8/16/2014 :  11  
##  Baby Food      : 842                  M:2448   7/15/2012 :  11  
##  Office Supplies: 837                           10/28/2016:  11  
##  Vegetables     : 836                           7/28/2017 :  10  
##  (Other)        :4850                           (Other)   :9932  
##          ID              sdate           unit           price       
##  292494523:   1   9/30/2014 :  12   Min.   :    2   Min.   :  9.33  
##  361825549:   1   7/23/2015 :  11   1st Qu.: 2531   1st Qu.:109.28  
##  141515767:   1   2/21/2010 :  11   Median : 4962   Median :205.70  
##  500364005:   1   3/24/2016 :  11   Mean   : 5003   Mean   :268.14  
##  127481591:   1   10/28/2012:  11   3rd Qu.: 7472   3rd Qu.:437.20  
##  482292354:   1   7/24/2011 :  10   Max.   :10000   Max.   :668.27  
##  (Other)  :9994   (Other)   :9934                                   
##      ucost           revenue            tcost             profit         
##  Min.   :  6.92   Min.   :    168   Min.   :    125   Min.   :     43.4  
##  1st Qu.: 56.67   1st Qu.: 288551   1st Qu.: 164786   1st Qu.:  98329.1  
##  Median :117.11   Median : 800051   Median : 481606   Median : 289099.0  
##  Mean   :188.81   Mean   :1333355   Mean   : 938266   Mean   : 395089.3  
##  3rd Qu.:364.69   3rd Qu.:1819143   3rd Qu.:1183822   3rd Qu.: 566422.7  
##  Max.   :524.96   Max.   :6680027   Max.   :5241726   Max.   :1738178.4  
##                                                                          
##    order_date           ship_date         
##  Min.   :2010-01-01   Min.   :2010-01-05  
##  1st Qu.:2011-12-08   1st Qu.:2012-01-04  
##  Median :2013-11-02   Median :2013-11-26  
##  Mean   :2013-10-27   Mean   :2013-11-21  
##  3rd Qu.:2015-09-11   3rd Qu.:2015-10-08  
##  Max.   :2017-07-28   Max.   :2017-09-10  
## 

Visualization

hist(oneksales$profit, col = 'blue')
hist(oneksales$profit , col = 'blue')

Most of the sales are below $600,000

ggplot(oneksales, aes(x = region, y = profit, color = channel)) +  
  geom_boxplot()+coord_flip()

ggplot(tenksales, aes(x = region, y = profit, color = channel)) +  
  geom_boxplot()+coord_flip()

Correlations

cor<-oneksales%>%
  select(unit, price, ucost, revenue, tcost, profit)
corrplot(cor(cor), type = 'upper')

As I guessed, the numeric variables are correlated.

Data Structure

The box plot shows there are some outliers within the data, and the histogram shows they are right skewed. According the boxplot, The 10000 Sale records show differntly from 1000 Sales records in terms of preference of offline or online purchase. A good sales strategy should be built according to the preferance of shopping platform and good target of region. View the data, the sales datasets contain records of shopping categories from various countries and regions. The key numeric information provided for the detail of order ID, unit, price, ucost, revenue, tcost, profit without NA.

Algorithms/Short Essay

Assume The target of the business is to increase the profit, the qualitative analysis would help the business to make a decision to focus on such as, at what region/country with what kind of item , online or offline platform may increase the profit. The two algorithm can be Naive bays and random forest model to solve the classification problem.According to https://discuss.analyticsvidhya.com/t/how-to-decide-when-to-use-naive-bayes-for-classification/5720 Naive Bayes performs well when we have multiple classes and working with text classification. Advantage of Naive Bayes algorithms is simple and if the conditional independence assumption actually holds, a Naive Bayes classifier will converge quicker than discriminative models like logistic regression, so naive bays model requires less training dataset.Even if the assumption for naive bays doesn’t hold, requires less model training time. The main difference between Naive Bayes and Random Forest are their model size. Naive Bayes model size is low and quite constant with respect to the data. The Naive Bays has difficulty to present the complex behavior, and it won’t get into over fitting. On the other hand, Random Forest model size is very large, it easily results to over fitting. When the data is dynamic and keeps changing. Naive Bays can adapt quickly to the changes and new data while using a Random Forest requires to rebuild the forest every time something changes. also, according to https://cloudvane.net/big-data-2/machine-learning-101-classification-algorithms-random-forest-and-naive-bayes/ Random forest models run efficient on large datasets, since all compute can be split and thus it is easier to run the model in parallel. It can handle thousands of input variables without variable deletion. It computes proximities between pairs of cases that can be used in clustering, locating outliers or (by scaling) give interesting views of the data.

Machine Learning Algorithm Selection

For this assignment, due to my object - region classification have many method to work with, I’d like to build the random forest regression model to make classification for the region.It is important for a business to find their focus market, and target the market sales in a correctly point region to maximize the total profit of the business. The training data and testing data are spited in 75% and 25% of the observations, and this rule is applied to both 10000 Sales Records dataset and 1000 Sales Records dataset.

Data Split and Modeling

Split for 1000 Sales records

tenksales<-tenksales[,-2]
oneksales<-oneksales[,-2]
set.seed(1234)
onekdata = sort(sample(nrow(oneksales), nrow(oneksales)*.75))
onektrain<-oneksales[onekdata,]
onektest<-oneksales[-onekdata,]

Random forest model for oneksales.

onekrd <- randomForest(channel ~ region+item+order+unit+price+ucost+revenue+tcost+profit, data = onektrain,importance = TRUE, na.omit=T)
onekrd
## 
## Call:
##  randomForest(formula = channel ~ region + item + order + unit +      price + ucost + revenue + tcost + profit, data = onektrain,      importance = TRUE, na.omit = T) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 46.27%
## Confusion matrix:
##         Offline Online class.error
## Offline     244    156   0.3900000
## Online      191    159   0.5457143
varImp(onekrd)
##            Offline     Online
## region  -1.2343119 -1.2343119
## item    -0.5570033 -0.5570033
## order   -0.2296124 -0.2296124
## unit     0.2640131  0.2640131
## price    0.6657127  0.6657127
## ucost   -0.2670300 -0.2670300
## revenue  1.6884005  1.6884005
## tcost    1.3050953  1.3050953
## profit   1.0451576  1.0451576
varImpPlot(onekrd)

Accuracy of oneksales random forest model

onek_pred<- predict(onekrd, newdata = onektest)
matrix<-table(onek_pred, onektest$channel)
matrix
##          
## onek_pred Offline Online
##   Offline      81     74
##   Online       39     56
sum(diag(matrix))/nrow(onektest)
## [1] 0.548

Split for 10000 Sales records

set.seed(1234)
tenkdata = sort(sample(nrow(tenksales), nrow(tenksales)*.75))
tenktrain<-tenksales[tenkdata,]
tenktest<-tenksales[-tenkdata,]

Random forest model for tenksales.

tenkrd <- randomForest(channel ~ region+item+order+unit+price+ucost+revenue+tcost+profit, data = tenktrain,importance = TRUE, na.omit=T)
tenkrd
## 
## Call:
##  randomForest(formula = channel ~ region + item + order + unit +      price + ucost + revenue + tcost + profit, data = tenktrain,      importance = TRUE, na.omit = T) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 49.68%
## Confusion matrix:
##         Online Offline class.error
## Online    2000    1815   0.4757536
## Offline   1911    1774   0.5185889
varImp(tenkrd)
##             Online    Offline
## region  -1.8774071 -1.8774071
## item     1.3431021  1.3431021
## order    1.6212722  1.6212722
## unit     0.1691929  0.1691929
## price    0.6742841  0.6742841
## ucost    0.9066252  0.9066252
## revenue  0.9932861  0.9932861
## tcost    0.7416452  0.7416452
## profit   0.1425480  0.1425480
varImpPlot(tenkrd)

Accuracy of tenksales random forest model

tenk_pred<- predict(tenkrd, newdata = tenktest)
matrix<-table(tenk_pred, tenktest$channel)
matrix
##          
## tenk_pred Online Offline
##   Online     616     644
##   Offline    630     610
accuracy<-sum(diag(matrix))/nrow(tenktest)
accuracy
## [1] 0.4904

Conclusion

In summary, the performance of the random forest model is not satisfied. The accuracy of the model is around 50%. I believe there are possibilities to improve the accuracy of the model with tuning. The model is not allowed to include the Country variable due to too many categories under the Country variable. If the data keeps accumulates, then it is better to use Naive bays model rather than the random forest model, due to random forest easily overfit the data, and need to rebuild every time the data changes. I don’t recommend the current model to the business, and it seems there are no different result from two datasets of using the same model.

