library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(lubridate)
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
## ✔ broom 1.0.5 ✔ rsample 1.2.0
## ✔ dials 1.2.0 ✔ tune 1.1.2
## ✔ infer 1.0.5 ✔ workflows 1.1.3
## ✔ modeldata 1.2.0 ✔ workflowsets 1.0.1
## ✔ parsnip 1.1.1 ✔ yardstick 1.2.0
## ✔ recipes 1.0.8
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ caret::lift() masks purrr::lift()
## ✖ yardstick::precision() masks caret::precision()
## ✖ yardstick::recall() masks caret::recall()
## ✖ yardstick::sensitivity() masks caret::sensitivity()
## ✖ yardstick::spec() masks readr::spec()
## ✖ yardstick::specificity() masks caret::specificity()
## ✖ recipes::step() masks stats::step()
## • Use suppressPackageStartupMessages() to eliminate package startup messages
library(rpart)
##
## Attaching package: 'rpart'
##
## The following object is masked from 'package:dials':
##
## prune
library(rpart.plot)
library(performanceEstimation)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
library(e1071)
##
## Attaching package: 'e1071'
##
## The following object is masked from 'package:tune':
##
## tune
##
## The following object is masked from 'package:rsample':
##
## permutations
##
## The following object is masked from 'package:parsnip':
##
## tune
library(kableExtra)
##
## Attaching package: 'kableExtra'
##
## The following object is masked from 'package:dplyr':
##
## group_rows
For Assignment 3 I will be using the one of the datasets which I used in Assignment 2
small_ds <- read_csv("https://raw.githubusercontent.com/petferns/DATA622/main/5000%20Sales%20Records.csv")
## Rows: 5000 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Region, Country, Item Type, Sales Channel, Order Priority, Order Da...
## dbl (7): Order ID, Units Sold, Unit Price, Unit Cost, Total Revenue, Total C...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Data contains sales order details of products like baby food, personal care products, food items, fruits, etc from across the continents.
head(small_ds)
## # A tibble: 6 × 14
## Region Country `Item Type` `Sales Channel` `Order Priority` `Order Date`
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Central Ame… Antigu… Baby Food Online M 12/20/2013
## 2 Central Ame… Panama Snacks Offline C 7/5/2010
## 3 Europe Czech … Beverages Offline C 9/12/2011
## 4 Asia North … Cereal Offline L 5/13/2010
## 5 Asia Sri La… Snacks Offline C 7/20/2015
## 6 Middle East… Morocco Personal C… Offline L 11/8/2010
## # ℹ 8 more variables: `Order ID` <dbl>, `Ship Date` <chr>, `Units Sold` <dbl>,
## # `Unit Price` <dbl>, `Unit Cost` <dbl>, `Total Revenue` <dbl>,
## # `Total Cost` <dbl>, `Total Profit` <dbl>
We see from the summary that there aren’t any missing values and the datasets contains Order details from year 2010 to 2017.
summary(small_ds)
## Region Country Item Type Sales Channel
## Length:5000 Length:5000 Length:5000 Length:5000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Order Priority Order Date Order ID Ship Date
## Length:5000 Length:5000 Min. :100090873 Length:5000
## Class :character Class :character 1st Qu.:320104217 Class :character
## Mode :character Mode :character Median :552314960 Mode :character
## Mean :548644737
## 3rd Qu.:768770944
## Max. :999879729
## Units Sold Unit Price Unit Cost Total Revenue
## Min. : 2 Min. : 9.33 Min. : 6.92 Min. : 65
## 1st Qu.:2453 1st Qu.: 81.73 1st Qu.: 35.84 1st Qu.: 257417
## Median :5123 Median :154.06 Median : 97.44 Median : 779409
## Mean :5031 Mean :265.75 Mean :187.49 Mean :1325738
## 3rd Qu.:7576 3rd Qu.:437.20 3rd Qu.:263.33 3rd Qu.:1839975
## Max. :9999 Max. :668.27 Max. :524.96 Max. :6672676
## Total Cost Total Profit
## Min. : 48 Min. : 16.9
## 1st Qu.: 154748 1st Qu.: 85339.3
## Median : 468181 Median : 279095.2
## Mean : 933093 Mean : 392644.6
## 3rd Qu.:1189578 3rd Qu.: 565106.4
## Max. :5248025 Max. :1726007.5
From the glimpse of the data we see that certain column values needs conversion - ‘Order Date’, ‘Ship Date’ will be converted to date type. ‘Sales Channel’ will be factored as it contains either Online or Offline.
Order Date, Ship Date will be converted to date typeTotal Profit will be converted to numeric valueSales Channel will be factored as it contains values as either ‘Online’ or ‘Offline’glimpse(small_ds)
## Rows: 5,000
## Columns: 14
## $ Region <chr> "Central America and the Caribbean", "Central America…
## $ Country <chr> "Antigua and Barbuda", "Panama", "Czech Republic", "N…
## $ `Item Type` <chr> "Baby Food", "Snacks", "Beverages", "Cereal", "Snacks…
## $ `Sales Channel` <chr> "Online", "Offline", "Offline", "Offline", "Offline",…
## $ `Order Priority` <chr> "M", "C", "C", "L", "C", "L", "H", "M", "M", "M", "C"…
## $ `Order Date` <chr> "12/20/2013", "7/5/2010", "9/12/2011", "5/13/2010", "…
## $ `Order ID` <dbl> 957081544, 301644504, 478051030, 892599952, 571902596…
## $ `Ship Date` <chr> "1/11/2014", "7/26/2010", "9/29/2011", "6/15/2010", "…
## $ `Units Sold` <dbl> 552, 2167, 4778, 9016, 7542, 48, 8258, 927, 8841, 981…
## $ `Unit Price` <dbl> 255.28, 152.58, 47.45, 205.70, 152.58, 81.73, 109.28,…
## $ `Unit Cost` <dbl> 159.42, 97.44, 31.79, 117.11, 97.44, 56.67, 35.84, 35…
## $ `Total Revenue` <dbl> 140914.56, 330640.86, 226716.10, 1854591.20, 1150758.…
## $ `Total Cost` <dbl> 87999.84, 211152.48, 151892.62, 1055863.76, 734892.48…
## $ `Total Profit` <dbl> 52914.72, 119488.38, 74823.48, 798727.44, 415865.88, …
small_ds[['Order Date']] <- as.Date(small_ds[['Order Date']], "%m/%d/%Y")
small_ds[['Ship Date']] <- as.Date(small_ds[['Ship Date']], "%m/%d/%Y")
small_ds[['Sales Channel']] <- as.factor(small_ds[['Sales Channel']])
small_ds[['Total Profit']] <- as.numeric(small_ds[['Total Profit']])
Visualizing the data by plotting Total profit over the years. We see from the below plots in both of the datasets the ‘Total profit’ was constant across years but there is a drastic decline in 2017.
small_ds_plt <- small_ds %>%
mutate(Year = year(`Order Date`)) %>%
group_by(Year) %>%
summarize(ProfitPerYear = sum(`Total Profit`))
ggplot(small_ds_plt, aes(x = Year, y = ProfitPerYear)) +
geom_bar(stat = "identity", fill = "blue") +
labs(title = "Yearly Profit", x = "Year", y = "Total Profit")
We partition the dataset into training and testing sets in 80:20 proportion
From my models I would like to predict Sales Channel based on variables Region, Item Type, Order Priority and Total Profit.
set.seed(1234)
Btraining.samples <- small_ds$`Sales Channel` %>%
createDataPartition(p = 0.8, list=FALSE)
Btrain.data <- small_ds[Btraining.samples,]
Btest.data <- small_ds[-Btraining.samples,]
The primary objective of SVM is to find a hyperplane that best separates data points into different classes, maximizing the margin between the classes. The margin is the distance between the hyperplane and the nearest data point from each class.
Hyperplane : In a two-dimensional space, a hyperplane is a line. In higher dimensions, it becomes a plane or a hyperplane. SVM seeks to find the hyperplane that best separates classes.
Margin : The margin is the distance between the hyperplane and the nearest data point from either class. SVM aims to maximize this margin.
Support Vectors : Support vectors are the data points that lie closest to the hyperplane. They play a crucial role in defining the optimal hyperplane. Using these support vectors, we maximize the margin of the classifier. Deleting the support vectors will change the position of the hyperplane.
SVM algorithms use a group of mathematical functions that are known as kernels. The function of kernel is to take data as input and transform it into the required form. Different SVM algorithms use different types of kernel functions.
Basically Kernel represents the style of SVM that is used to classify data. We shall create models and we will apply different kernel functions and then we will compare which one is more efficient for our dataset analysis based on accuracy.
linear <- svm(formula = `Sales Channel` ~ `Region` + `Item Type` + `Order Priority`,
data = Btrain.data,
type = 'C-classification',
kernel = 'linear')
predictions <- predict(linear, newdata = Btest.data)
confusion_matrix <- table(Actual_Label = Btest.data$`Sales Channel`, Predicted_Label = predictions)
accuracy_lin <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
accuracy_lin
## [1] 0.4984985
radial <- svm(formula = `Sales Channel` ~ `Region` + `Item Type` + `Order Priority`,
data = Btrain.data,
type = 'C-classification',
kernel = 'radial')
predictions <- predict(radial, newdata = Btest.data)
confusion_matrix <- table(Actual_Label = Btest.data$`Sales Channel`, Predicted_Label = predictions)
accuracy_rad <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
accuracy_rad
## [1] 0.5025025
polynomial <- svm(formula = `Sales Channel` ~ `Region` + `Item Type` + `Order Priority`,
data = Btrain.data,
type = 'C-classification',
kernel = 'polynomial')
predictions <- predict(polynomial, newdata = Btest.data)
confusion_matrix <- table(Actual_Label = Btest.data$`Sales Channel`, Predicted_Label = predictions)
accuracy_pol <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
accuracy_pol
## [1] 0.5255255
sigmoid <- svm(formula = `Sales Channel` ~ `Region` + `Item Type` + `Order Priority`,
data = Btrain.data,
type = 'C-classification',
kernel = 'sigmoid')
predictions <- predict(sigmoid, newdata = Btest.data)
confusion_matrix <- table(Actual_Label = Btest.data$`Sales Channel`, Predicted_Label = predictions)
accuracy_sig <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
accuracy_sig
## [1] 0.4744745
From the below summarized results shows the accuracy of the models which we created in past assignments and also the dataset was analyzed using Support vector machine using different kernel types in this assignment.
We see from the results that SVM polynomial performs better with an accuracy of 52.55% when compared to other models we created.
SVM uses a - kernel trick to solve non-linear problems whereas decision trees derive hyper-rectangles in input space to solve the problem and decision trees are better for categorical data and it deals with colinearity better than SVM.
metrics1 <- c("Accuracy" = 0.52012)
metrics2 <- c("Accuracy" = 0.5235235)
kable(cbind(metrics1, metrics2, accuracy_lin,accuracy_rad,accuracy_pol,accuracy_sig), col.names = c("DecisionTree", "RandomForest","SVM Linear", "SVM Radial" ,"SVM Polynomial", "SVM Sigmoid")) %>%
kable_styling(full_width = T)
| DecisionTree | RandomForest | SVM Linear | SVM Radial | SVM Polynomial | SVM Sigmoid | |
|---|---|---|---|---|---|---|
| Accuracy | 0.52012 | 0.5235235 | 0.4984985 | 0.5025025 | 0.5255255 | 0.4744745 |
“A Comparative Study of Decision Trees and Support Vector Machines for Classification” by A.K. Jain, R.P.W. Duin, and J. Mao (2000), Pattern Recognition Letters, 21(12): 1157-1165.
This paper compares the performance of decision trees and support vector machines for classification tasks on various datasets. The study concludes that SVMs often outperform decision trees in terms of accuracy, particularly for high-dimensional data and non-linear problems.
“Decision Trees vs. Support Vector Machines for Classification: Which Algorithm is the Best?” by S.B. Kotsiantis, I.D. Zaharakis, and P.E. Pintelas (2006), WSEAS Transactions on Information Science and Applications, 3(6): 988-993.
This paper analyzes the strengths and weaknesses of decision trees and support vector machines for classification problems. The study highlights that decision trees are easier to interpret and require less parameter tuning, while SVMs often achieve higher accuracy but can be computationally expensive and difficult to interpret.
“Support Vector Machines vs. Decision Trees for Credit Card Fraud Detection” by O.S. Duque, M.A.G. Ferreira, J.S. Cardoso, and A.L. Oliveira (2014), Expert Systems with Applications, 41(10): 4995-5004.
This paper compares the effectiveness of decision trees and support vector machines for detecting credit card fraud. The study demonstrates that both algorithms can achieve good accuracy, but SVMs generally outperform decision trees in terms of F-measure and AUC (area under the ROC curve).