library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(lubridate)
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
## ✔ broom 1.0.5 ✔ rsample 1.2.0
## ✔ dials 1.2.0 ✔ tune 1.1.2
## ✔ infer 1.0.5 ✔ workflows 1.1.3
## ✔ modeldata 1.2.0 ✔ workflowsets 1.0.1
## ✔ parsnip 1.1.1 ✔ yardstick 1.2.0
## ✔ recipes 1.0.8
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ caret::lift() masks purrr::lift()
## ✖ yardstick::precision() masks caret::precision()
## ✖ yardstick::recall() masks caret::recall()
## ✖ yardstick::sensitivity() masks caret::sensitivity()
## ✖ yardstick::spec() masks readr::spec()
## ✖ yardstick::specificity() masks caret::specificity()
## ✖ recipes::step() masks stats::step()
## • Learn how to get started at https://www.tidymodels.org/start/
I selected my datasets from https://excelbianalytics.com/, for smaller dataset I chose 5000 records and for large dataset the 100000 one. Uploaded the datasets on my github account and loaded them using read_csv function.
small_ds <- read_csv("https://raw.githubusercontent.com/petferns/DATA622/main/5000%20Sales%20Records.csv")
## Rows: 5000 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Region, Country, Item Type, Sales Channel, Order Priority, Order Da...
## dbl (7): Order ID, Units Sold, Unit Price, Unit Cost, Total Revenue, Total C...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
big_ds <- read_csv("https://raw.githubusercontent.com/petferns/DATA622/main/100000%20Sales%20Records.csv")
## Rows: 100000 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Region, Country, Item Type, Sales Channel, Order Priority, Order Da...
## dbl (7): Order ID, Units Sold, Unit Price, Unit Cost, Total Revenue, Total C...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Data contains sales order details of products like baby food, personal care products, food items, fruits, etc from across the continents.
head(small_ds)
## # A tibble: 6 × 14
## Region Country `Item Type` `Sales Channel` `Order Priority` `Order Date`
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Central Ame… Antigu… Baby Food Online M 12/20/2013
## 2 Central Ame… Panama Snacks Offline C 7/5/2010
## 3 Europe Czech … Beverages Offline C 9/12/2011
## 4 Asia North … Cereal Offline L 5/13/2010
## 5 Asia Sri La… Snacks Offline C 7/20/2015
## 6 Middle East… Morocco Personal C… Offline L 11/8/2010
## # ℹ 8 more variables: `Order ID` <dbl>, `Ship Date` <chr>, `Units Sold` <dbl>,
## # `Unit Price` <dbl>, `Unit Cost` <dbl>, `Total Revenue` <dbl>,
## # `Total Cost` <dbl>, `Total Profit` <dbl>
head(big_ds)
## # A tibble: 6 × 14
## Region Country `Item Type` `Sales Channel` `Order Priority` `Order Date`
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Middle East… Azerba… Snacks Online C 10/8/2014
## 2 Central Ame… Panama Cosmetics Offline L 2/22/2015
## 3 Sub-Saharan… Sao To… Fruits Offline M 12/9/2015
## 4 Sub-Saharan… Sao To… Personal C… Online M 9/17/2014
## 5 Central Ame… Belize Household Offline H 2/4/2010
## 6 Europe Denmark Clothes Online C 2/20/2013
## # ℹ 8 more variables: `Order ID` <dbl>, `Ship Date` <chr>, `Units Sold` <dbl>,
## # `Unit Price` <dbl>, `Unit Cost` <dbl>, `Total Revenue` <dbl>,
## # `Total Cost` <dbl>, `Total Profit` <dbl>
From the glimpse of the data we see that certain column values needs conversion - ‘Order Date’, ‘Ship Date’ will be converted to date type. ‘Sales Channel’ will be factored as it contains either Online or Offline. This will be done for both of the datasets.
glimpse(small_ds)
## Rows: 5,000
## Columns: 14
## $ Region <chr> "Central America and the Caribbean", "Central America…
## $ Country <chr> "Antigua and Barbuda", "Panama", "Czech Republic", "N…
## $ `Item Type` <chr> "Baby Food", "Snacks", "Beverages", "Cereal", "Snacks…
## $ `Sales Channel` <chr> "Online", "Offline", "Offline", "Offline", "Offline",…
## $ `Order Priority` <chr> "M", "C", "C", "L", "C", "L", "H", "M", "M", "M", "C"…
## $ `Order Date` <chr> "12/20/2013", "7/5/2010", "9/12/2011", "5/13/2010", "…
## $ `Order ID` <dbl> 957081544, 301644504, 478051030, 892599952, 571902596…
## $ `Ship Date` <chr> "1/11/2014", "7/26/2010", "9/29/2011", "6/15/2010", "…
## $ `Units Sold` <dbl> 552, 2167, 4778, 9016, 7542, 48, 8258, 927, 8841, 981…
## $ `Unit Price` <dbl> 255.28, 152.58, 47.45, 205.70, 152.58, 81.73, 109.28,…
## $ `Unit Cost` <dbl> 159.42, 97.44, 31.79, 117.11, 97.44, 56.67, 35.84, 35…
## $ `Total Revenue` <dbl> 140914.56, 330640.86, 226716.10, 1854591.20, 1150758.…
## $ `Total Cost` <dbl> 87999.84, 211152.48, 151892.62, 1055863.76, 734892.48…
## $ `Total Profit` <dbl> 52914.72, 119488.38, 74823.48, 798727.44, 415865.88, …
glimpse(big_ds)
## Rows: 100,000
## Columns: 14
## $ Region <chr> "Middle East and North Africa", "Central America and …
## $ Country <chr> "Azerbaijan", "Panama", "Sao Tome and Principe", "Sao…
## $ `Item Type` <chr> "Snacks", "Cosmetics", "Fruits", "Personal Care", "Ho…
## $ `Sales Channel` <chr> "Online", "Offline", "Offline", "Online", "Offline", …
## $ `Order Priority` <chr> "C", "L", "M", "M", "H", "C", "M", "C", "H", "H", "C"…
## $ `Order Date` <chr> "10/8/2014", "2/22/2015", "12/9/2015", "9/17/2014", "…
## $ `Order ID` <dbl> 535113847, 874708545, 854349935, 892836844, 129280602…
## $ `Ship Date` <chr> "10/23/2014", "2/27/2015", "1/18/2016", "10/12/2014",…
## $ `Units Sold` <dbl> 934, 4551, 9986, 9118, 5858, 1149, 7964, 6307, 8217, …
## $ `Unit Price` <dbl> 152.58, 437.20, 9.33, 81.73, 668.27, 109.28, 437.20, …
## $ `Unit Cost` <dbl> 97.44, 263.33, 6.92, 56.67, 502.54, 35.84, 263.33, 6.…
## $ `Total Revenue` <dbl> 142509.72, 1989697.20, 93169.38, 745214.14, 3914725.6…
## $ `Total Cost` <dbl> 91008.96, 1198414.83, 69103.12, 516717.06, 2943879.32…
## $ `Total Profit` <dbl> 51500.76, 791282.37, 24066.26, 228497.08, 970846.34, …
small_ds[['Order Date']] <- as.Date(small_ds[['Order Date']], "%m/%d/%Y")
small_ds[['Ship Date']] <- as.Date(small_ds[['Ship Date']], "%m/%d/%Y")
big_ds[['Order Date']] <- as.Date(big_ds[['Order Date']], "%m/%d/%Y")
big_ds[['Ship Date']] <- as.Date(big_ds[['Ship Date']], "%m/%d/%Y")
small_ds[['Sales Channel']] <- as.factor(small_ds[['Sales Channel']])
big_ds[['Sales Channel']] <- as.factor(big_ds[['Sales Channel']])
small_ds[['Total Profit']] <- as.numeric(small_ds[['Total Profit']])
big_ds[['Total Profit']] <- as.numeric(big_ds[['Total Profit']])
small_ds[['Sales Channel']] <- as.factor(small_ds[['Sales Channel']])
big_ds[['Sales Channel']] <- as.factor(big_ds[['Sales Channel']])
We see from the summary that there aren’t any missing values and the datasets contains Order details from year 2010 to 2017.
summary(small_ds)
## Region Country Item Type Sales Channel
## Length:5000 Length:5000 Length:5000 Offline:2504
## Class :character Class :character Class :character Online :2496
## Mode :character Mode :character Mode :character
##
##
##
## Order Priority Order Date Order ID
## Length:5000 Min. :2010-01-01 Min. :100090873
## Class :character 1st Qu.:2011-12-08 1st Qu.:320104217
## Mode :character Median :2013-10-23 Median :552314960
## Mean :2013-10-19 Mean :548644737
## 3rd Qu.:2015-09-08 3rd Qu.:768770944
## Max. :2017-07-28 Max. :999879729
## Ship Date Units Sold Unit Price Unit Cost
## Min. :2010-01-06 Min. : 2 Min. : 9.33 Min. : 6.92
## 1st Qu.:2012-01-06 1st Qu.:2453 1st Qu.: 81.73 1st Qu.: 35.84
## Median :2013-11-14 Median :5123 Median :154.06 Median : 97.44
## Mean :2013-11-13 Mean :5031 Mean :265.75 Mean :187.49
## 3rd Qu.:2015-10-03 3rd Qu.:7576 3rd Qu.:437.20 3rd Qu.:263.33
## Max. :2017-08-31 Max. :9999 Max. :668.27 Max. :524.96
## Total Revenue Total Cost Total Profit
## Min. : 65 Min. : 48 Min. : 16.9
## 1st Qu.: 257417 1st Qu.: 154748 1st Qu.: 85339.3
## Median : 779409 Median : 468181 Median : 279095.2
## Mean :1325738 Mean : 933093 Mean : 392644.6
## 3rd Qu.:1839975 3rd Qu.:1189578 3rd Qu.: 565106.4
## Max. :6672676 Max. :5248025 Max. :1726007.5
summary(big_ds)
## Region Country Item Type Sales Channel
## Length:100000 Length:100000 Length:100000 Offline:49946
## Class :character Class :character Class :character Online :50054
## Mode :character Mode :character Mode :character
##
##
##
## Order Priority Order Date Order ID
## Length:100000 Min. :2010-01-01 Min. :100008904
## Class :character 1st Qu.:2011-11-25 1st Qu.:326046383
## Mode :character Median :2013-10-15 Median :547718512
## Mean :2013-10-15 Mean :550395554
## 3rd Qu.:2015-09-07 3rd Qu.:775078534
## Max. :2017-07-28 Max. :999996459
## Ship Date Units Sold Unit Price Unit Cost
## Min. :2010-01-02 Min. : 1 Min. : 9.33 Min. : 6.92
## 1st Qu.:2011-12-21 1st Qu.: 2505 1st Qu.:109.28 1st Qu.: 56.67
## Median :2013-11-09 Median : 5007 Median :205.70 Median :117.11
## Mean :2013-11-09 Mean : 5001 Mean :266.70 Mean :188.02
## 3rd Qu.:2015-10-02 3rd Qu.: 7495 3rd Qu.:437.20 3rd Qu.:364.69
## Max. :2017-09-16 Max. :10000 Max. :668.27 Max. :524.96
## Total Revenue Total Cost Total Profit
## Min. : 19 Min. : 14 Min. : 4.8
## 1st Qu.: 279753 1st Qu.: 162928 1st Qu.: 95900.0
## Median : 789892 Median : 467937 Median : 283657.5
## Mean :1336067 Mean : 941975 Mean : 394091.2
## 3rd Qu.:1836490 3rd Qu.:1209475 3rd Qu.: 568384.1
## Max. :6682700 Max. :5249075 Max. :1738700.0
Visualizing the data by plotting Total profit over the years. We see from the below plots in both of the datasets the ‘Total profit’ was constant across years but there is a drastic decline in 2017.
small_ds_plt <- small_ds %>%
mutate(Year = year(`Order Date`)) %>%
group_by(Year) %>%
summarize(ProfitPerYear = sum(`Total Profit`))
ggplot(small_ds_plt, aes(x = Year, y = ProfitPerYear)) +
geom_bar(stat = "identity", fill = "blue") +
labs(title = "Yearly Profit", x = "Year", y = "Total Profit")
big_ds_plt <- big_ds %>%
mutate(Year = year(`Order Date`)) %>%
group_by(Year) %>%
summarize(ProfitPerYear = sum(`Total Profit`))
ggplot(big_ds_plt, aes(x = Year, y = ProfitPerYear)) +
geom_bar(stat = "identity", fill = "blue") +
labs(title = "Yearly Profit", x = "Year", y = "Total Profit")
Based on the datasets I chose decision tree and logistic regression to analyze the data. Considering Sales Channel as dependent variable and Region, Item Type, Order Priority and Total Profit will be our independent variables.
I will split the dataset in 80:20 ratio with 80% for training set and 20% for testing set
set.seed(2222)
training.samples <- small_ds$`Sales Channel` %>%
createDataPartition(p = 0.8, list=FALSE)
train.data <- small_ds[training.samples,]
test.data <- small_ds[-training.samples,]
tree_spec <- decision_tree() %>%
set_engine("rpart") %>%
set_mode("classification")
Dtree_model <- tree_spec %>%
fit(formula = `Sales Channel` ~ `Region` + `Item Type` + `Order Priority` + `Total Profit`,
data = train.data)
Dtree_model
## parsnip model object
##
## n= 4001
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 4001 1997 Offline (0.5008748 0.4991252)
## 2) Item Type=Baby Food,Beverages,Cosmetics,Household,Office Supplies,Personal Care,Vegetables 2401 1151 Offline (0.5206164 0.4793836) *
## 3) Item Type=Cereal,Clothes,Fruits,Meat,Snacks 1600 754 Online (0.4712500 0.5287500) *
importance <- Dtree_model$fit$variable.importance
importance
## Item Type Total Profit
## 4.6799011 0.8277575
predictions <- predict(Dtree_model, new_data = test.data)
predictions_combined <- predictions %>%
mutate(true_classification = test.data$`Sales Channel`)
head(predictions_combined)
## # A tibble: 6 × 2
## .pred_class true_classification
## <fct> <fct>
## 1 Offline Online
## 2 Offline Offline
## 3 Online Offline
## 4 Online Offline
## 5 Offline Online
## 6 Offline Offline
confusion_matrix <- conf_mat(data = predictions_combined,
estimate = .pred_class,
truth = true_classification)
confusion_matrix
## Truth
## Prediction Offline Online
## Offline 288 296
## Online 212 203
Calculating accuracy
correct_predictions <- 288 + 203
all_predictions <- 288 + 203 + 212 + 296
accuracy <- correct_predictions / all_predictions
accuracy
## [1] 0.4914915
Applying the same for the bigger dataset
set.seed(3333)
Btraining.samples <- big_ds$`Sales Channel` %>%
createDataPartition(p = 0.8, list=FALSE)
Btrain.data <- big_ds[training.samples,]
Btest.data <- big_ds[-training.samples,]
Btree_spec <- decision_tree() %>%
set_engine("rpart") %>%
set_mode("classification")
Btree_model <- Btree_spec %>%
fit(formula = `Sales Channel` ~ `Region` + `Item Type` + `Order Priority` + `Total Profit`,
data = Btrain.data)
Btree_model
## parsnip model object
##
## n= 4001
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 4001 1981 Online (0.4951262 0.5048738) *
predictions <- predict(Dtree_model, new_data = Btest.data)
predictions_combined <- predictions %>%
mutate(true_classification = Btest.data$`Sales Channel`)
head(predictions_combined)
## # A tibble: 6 × 2
## .pred_class true_classification
## <fct> <fct>
## 1 Online Online
## 2 Online Offline
## 3 Offline Offline
## 4 Online Online
## 5 Offline Offline
## 6 Offline Offline
confusion_matrix <- conf_mat(data = predictions_combined,
estimate = .pred_class,
truth = true_classification)
confusion_matrix
## Truth
## Prediction Offline Online
## Offline 27950 28111
## Online 20015 19923
# Calculate the number of correctly predicted classes
correct_predictions <- 27950 + 19923
# Calculate the number of all predicted classes
all_predictions <- 27950 + 19923 + 20015 + 28111
# Calculate and print the accuracy
accuracy <- correct_predictions / all_predictions
accuracy
## [1] 0.4986823
Applying the logistic regression
glm.df.small<-glm(`Sales Channel` ~ `Region` + `Item Type` + `Order Priority` + `Total Profit`,data=small_ds, family=binomial)
summary(glm.df.small)
##
## Call:
## glm(formula = `Sales Channel` ~ Region + `Item Type` + `Order Priority` +
## `Total Profit`, family = binomial, data = small_ds)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.342 -1.170 -1.042 1.179 1.343
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.359e-02 1.371e-01 0.099 0.9210
## RegionAustralia and Oceania -2.468e-01 1.237e-01 -1.995 0.0460
## RegionCentral America and the Caribbean -1.468e-01 1.146e-01 -1.281 0.2003
## RegionEurope -1.171e-01 9.294e-02 -1.260 0.2076
## RegionMiddle East and North Africa -1.572e-01 1.105e-01 -1.423 0.1547
## RegionNorth America -3.334e-01 2.098e-01 -1.589 0.1120
## RegionSub-Saharan Africa -7.752e-02 9.355e-02 -0.829 0.4073
## `Item Type`Beverages 1.194e-01 1.409e-01 0.848 0.3966
## `Item Type`Cereal 3.133e-01 1.399e-01 2.239 0.0251
## `Item Type`Clothes 2.190e-01 1.400e-01 1.565 0.1177
## `Item Type`Cosmetics 5.210e-02 1.419e-01 0.367 0.7135
## `Item Type`Fruits 5.866e-02 1.430e-01 0.410 0.6817
## `Item Type`Household 1.186e-01 1.403e-01 0.846 0.3978
## `Item Type`Meat 1.743e-01 1.400e-01 1.244 0.2133
## `Item Type`Office Supplies 7.791e-02 1.372e-01 0.568 0.5703
## `Item Type`Personal Care -1.124e-02 1.418e-01 -0.079 0.9368
## `Item Type`Snacks 1.617e-01 1.402e-01 1.153 0.2488
## `Item Type`Vegetables 7.704e-03 1.384e-01 0.056 0.9556
## `Order Priority`H 5.721e-02 8.110e-02 0.705 0.4805
## `Order Priority`L 1.624e-02 8.194e-02 0.198 0.8429
## `Order Priority`M -9.006e-03 8.054e-02 -0.112 0.9110
## `Total Profit` -6.091e-08 1.038e-07 -0.587 0.5575
##
## (Intercept)
## RegionAustralia and Oceania *
## RegionCentral America and the Caribbean
## RegionEurope
## RegionMiddle East and North Africa
## RegionNorth America
## RegionSub-Saharan Africa
## `Item Type`Beverages
## `Item Type`Cereal *
## `Item Type`Clothes
## `Item Type`Cosmetics
## `Item Type`Fruits
## `Item Type`Household
## `Item Type`Meat
## `Item Type`Office Supplies
## `Item Type`Personal Care
## `Item Type`Snacks
## `Item Type`Vegetables
## `Order Priority`H
## `Order Priority`L
## `Order Priority`M
## `Total Profit`
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6931.5 on 4999 degrees of freedom
## Residual deviance: 6913.4 on 4978 degrees of freedom
## AIC: 6957.4
##
## Number of Fisher Scoring iterations: 3
glm.df.small<-glm(`Sales Channel` ~ `Region` + `Item Type` + `Order Priority` + `Total Profit`,data=small_ds, family=binomial)
summary(glm.df.small)
##
## Call:
## glm(formula = `Sales Channel` ~ Region + `Item Type` + `Order Priority` +
## `Total Profit`, family = binomial, data = small_ds)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.342 -1.170 -1.042 1.179 1.343
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.359e-02 1.371e-01 0.099 0.9210
## RegionAustralia and Oceania -2.468e-01 1.237e-01 -1.995 0.0460
## RegionCentral America and the Caribbean -1.468e-01 1.146e-01 -1.281 0.2003
## RegionEurope -1.171e-01 9.294e-02 -1.260 0.2076
## RegionMiddle East and North Africa -1.572e-01 1.105e-01 -1.423 0.1547
## RegionNorth America -3.334e-01 2.098e-01 -1.589 0.1120
## RegionSub-Saharan Africa -7.752e-02 9.355e-02 -0.829 0.4073
## `Item Type`Beverages 1.194e-01 1.409e-01 0.848 0.3966
## `Item Type`Cereal 3.133e-01 1.399e-01 2.239 0.0251
## `Item Type`Clothes 2.190e-01 1.400e-01 1.565 0.1177
## `Item Type`Cosmetics 5.210e-02 1.419e-01 0.367 0.7135
## `Item Type`Fruits 5.866e-02 1.430e-01 0.410 0.6817
## `Item Type`Household 1.186e-01 1.403e-01 0.846 0.3978
## `Item Type`Meat 1.743e-01 1.400e-01 1.244 0.2133
## `Item Type`Office Supplies 7.791e-02 1.372e-01 0.568 0.5703
## `Item Type`Personal Care -1.124e-02 1.418e-01 -0.079 0.9368
## `Item Type`Snacks 1.617e-01 1.402e-01 1.153 0.2488
## `Item Type`Vegetables 7.704e-03 1.384e-01 0.056 0.9556
## `Order Priority`H 5.721e-02 8.110e-02 0.705 0.4805
## `Order Priority`L 1.624e-02 8.194e-02 0.198 0.8429
## `Order Priority`M -9.006e-03 8.054e-02 -0.112 0.9110
## `Total Profit` -6.091e-08 1.038e-07 -0.587 0.5575
##
## (Intercept)
## RegionAustralia and Oceania *
## RegionCentral America and the Caribbean
## RegionEurope
## RegionMiddle East and North Africa
## RegionNorth America
## RegionSub-Saharan Africa
## `Item Type`Beverages
## `Item Type`Cereal *
## `Item Type`Clothes
## `Item Type`Cosmetics
## `Item Type`Fruits
## `Item Type`Household
## `Item Type`Meat
## `Item Type`Office Supplies
## `Item Type`Personal Care
## `Item Type`Snacks
## `Item Type`Vegetables
## `Order Priority`H
## `Order Priority`L
## `Order Priority`M
## `Total Profit`
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6931.5 on 4999 degrees of freedom
## Residual deviance: 6913.4 on 4978 degrees of freedom
## AIC: 6957.4
##
## Number of Fisher Scoring iterations: 3
c1 <- confusionMatrix(as.factor(as.integer(fitted(glm.df.small) > .5)), as.factor(glm.df.small$y), positive = "1")
c1$overall[1]
## Accuracy
## 0.5184
glm.df.big<-glm(`Sales Channel` ~ `Region` + `Item Type` + `Order Priority` + `Total Profit`,data=big_ds, family=binomial)
summary(glm.df.big)
##
## Call:
## glm(formula = `Sales Channel` ~ Region + `Item Type` + `Order Priority` +
## `Total Profit`, family = binomial, data = big_ds)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.204 -1.180 1.151 1.175 1.209
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.476e-03 3.089e-02 -0.048 0.9619
## RegionAustralia and Oceania 1.274e-02 2.772e-02 0.460 0.6457
## RegionCentral America and the Caribbean -5.387e-02 2.546e-02 -2.116 0.0343
## RegionEurope -7.645e-03 2.073e-02 -0.369 0.7122
## RegionMiddle East and North Africa -2.884e-02 2.435e-02 -1.184 0.2363
## RegionNorth America -4.650e-02 4.638e-02 -1.002 0.3161
## RegionSub-Saharan Africa -6.576e-03 2.071e-02 -0.318 0.7508
## `Item Type`Beverages 1.691e-02 3.234e-02 0.523 0.6009
## `Item Type`Cereal 3.208e-02 3.085e-02 1.040 0.2983
## `Item Type`Clothes -2.413e-03 3.105e-02 -0.078 0.9380
## `Item Type`Cosmetics 2.747e-03 3.220e-02 0.085 0.9320
## `Item Type`Fruits 2.179e-02 3.281e-02 0.664 0.5066
## `Item Type`Household 2.774e-02 3.203e-02 0.866 0.3864
## `Item Type`Meat -1.547e-02 3.125e-02 -0.495 0.6205
## `Item Type`Office Supplies 1.384e-02 3.104e-02 0.446 0.6557
## `Item Type`Personal Care 3.577e-02 3.196e-02 1.119 0.2630
## `Item Type`Snacks 5.693e-03 3.129e-02 0.182 0.8556
## `Item Type`Vegetables 2.234e-02 3.120e-02 0.716 0.4740
## `Order Priority`H 8.953e-04 1.791e-02 0.050 0.9601
## `Order Priority`L 1.467e-02 1.790e-02 0.820 0.4125
## `Order Priority`M -4.338e-03 1.788e-02 -0.243 0.8083
## `Total Profit` 1.191e-09 2.315e-08 0.051 0.9590
##
## (Intercept)
## RegionAustralia and Oceania
## RegionCentral America and the Caribbean *
## RegionEurope
## RegionMiddle East and North Africa
## RegionNorth America
## RegionSub-Saharan Africa
## `Item Type`Beverages
## `Item Type`Cereal
## `Item Type`Clothes
## `Item Type`Cosmetics
## `Item Type`Fruits
## `Item Type`Household
## `Item Type`Meat
## `Item Type`Office Supplies
## `Item Type`Personal Care
## `Item Type`Snacks
## `Item Type`Vegetables
## `Order Priority`H
## `Order Priority`L
## `Order Priority`M
## `Total Profit`
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 138629 on 99999 degrees of freedom
## Residual deviance: 138614 on 99978 degrees of freedom
## AIC: 138658
##
## Number of Fisher Scoring iterations: 3
c2 <- confusionMatrix(as.factor(as.integer(fitted(glm.df.big) > .5)), as.factor(glm.df.big$y), positive = "1")
c2$overall[1]
## Accuracy
## 0.50405
We see that applying the Decision tree the smaller dataset has a accuracy of around 49% and bigger dataset has slightly higher narly 50%. Which means it can predict correctly nearly 49 to 50% when we consider Region, Item Type, Order Priority and Total Profit to decide upon Sales Channel. Also, from the logistic regression we see the smaller dataset has an accuracy of nearly 52% and the bigger dataset has around 50% accuracy. Which means it can predict correctly nearly 49 to 50% when we consider Region, Item Type, Order Priority and Total Profit to decide upon Sales Channel.