Perform an analysis of the dataset(s) used in Homework #2 using the SVM algorithm. Compare the results with the results from previous homework.

Format

  • Essay (minimum 500 word document) - Write a short essay explaining your selection of algorithms and how they relate to the data and what you are trying to do
  • Analysis using R or Python (submit code + errors + analysis as notebook or copy/paste to document) Include analysis R (or Python) code.

Including of the required libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(lubridate) 
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
## ✔ broom        1.0.5     ✔ rsample      1.2.0
## ✔ dials        1.2.0     ✔ tune         1.1.2
## ✔ infer        1.0.5     ✔ workflows    1.1.3
## ✔ modeldata    1.2.0     ✔ workflowsets 1.0.1
## ✔ parsnip      1.1.1     ✔ yardstick    1.2.0
## ✔ recipes      1.0.8     
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard()        masks purrr::discard()
## ✖ dplyr::filter()          masks stats::filter()
## ✖ recipes::fixed()         masks stringr::fixed()
## ✖ dplyr::lag()             masks stats::lag()
## ✖ caret::lift()            masks purrr::lift()
## ✖ yardstick::precision()   masks caret::precision()
## ✖ yardstick::recall()      masks caret::recall()
## ✖ yardstick::sensitivity() masks caret::sensitivity()
## ✖ yardstick::spec()        masks readr::spec()
## ✖ yardstick::specificity() masks caret::specificity()
## ✖ recipes::step()          masks stats::step()
## • Use suppressPackageStartupMessages() to eliminate package startup messages
library(rpart)
## 
## Attaching package: 'rpart'
## 
## The following object is masked from 'package:dials':
## 
##     prune
library(rpart.plot)
library(performanceEstimation)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(e1071)
## 
## Attaching package: 'e1071'
## 
## The following object is masked from 'package:tune':
## 
##     tune
## 
## The following object is masked from 'package:rsample':
## 
##     permutations
## 
## The following object is masked from 'package:parsnip':
## 
##     tune
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows

Dataset

For Assignment 3 I will be using the one of the datasets which I used in Assignment 2

small_ds <- read_csv("https://raw.githubusercontent.com/petferns/DATA622/main/5000%20Sales%20Records.csv")
## Rows: 5000 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Region, Country, Item Type, Sales Channel, Order Priority, Order Da...
## dbl (7): Order ID, Units Sold, Unit Price, Unit Cost, Total Revenue, Total C...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data contains sales order details of products like baby food, personal care products, food items, fruits, etc from across the continents.

head(small_ds)
## # A tibble: 6 × 14
##   Region       Country `Item Type` `Sales Channel` `Order Priority` `Order Date`
##   <chr>        <chr>   <chr>       <chr>           <chr>            <chr>       
## 1 Central Ame… Antigu… Baby Food   Online          M                12/20/2013  
## 2 Central Ame… Panama  Snacks      Offline         C                7/5/2010    
## 3 Europe       Czech … Beverages   Offline         C                9/12/2011   
## 4 Asia         North … Cereal      Offline         L                5/13/2010   
## 5 Asia         Sri La… Snacks      Offline         C                7/20/2015   
## 6 Middle East… Morocco Personal C… Offline         L                11/8/2010   
## # ℹ 8 more variables: `Order ID` <dbl>, `Ship Date` <chr>, `Units Sold` <dbl>,
## #   `Unit Price` <dbl>, `Unit Cost` <dbl>, `Total Revenue` <dbl>,
## #   `Total Cost` <dbl>, `Total Profit` <dbl>

We see from the summary that there aren’t any missing values and the datasets contains Order details from year 2010 to 2017.

summary(small_ds)
##     Region            Country           Item Type         Sales Channel     
##  Length:5000        Length:5000        Length:5000        Length:5000       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Order Priority      Order Date           Order ID          Ship Date        
##  Length:5000        Length:5000        Min.   :100090873   Length:5000       
##  Class :character   Class :character   1st Qu.:320104217   Class :character  
##  Mode  :character   Mode  :character   Median :552314960   Mode  :character  
##                                        Mean   :548644737                     
##                                        3rd Qu.:768770944                     
##                                        Max.   :999879729                     
##    Units Sold     Unit Price       Unit Cost      Total Revenue    
##  Min.   :   2   Min.   :  9.33   Min.   :  6.92   Min.   :     65  
##  1st Qu.:2453   1st Qu.: 81.73   1st Qu.: 35.84   1st Qu.: 257417  
##  Median :5123   Median :154.06   Median : 97.44   Median : 779409  
##  Mean   :5031   Mean   :265.75   Mean   :187.49   Mean   :1325738  
##  3rd Qu.:7576   3rd Qu.:437.20   3rd Qu.:263.33   3rd Qu.:1839975  
##  Max.   :9999   Max.   :668.27   Max.   :524.96   Max.   :6672676  
##    Total Cost       Total Profit      
##  Min.   :     48   Min.   :     16.9  
##  1st Qu.: 154748   1st Qu.:  85339.3  
##  Median : 468181   Median : 279095.2  
##  Mean   : 933093   Mean   : 392644.6  
##  3rd Qu.:1189578   3rd Qu.: 565106.4  
##  Max.   :5248025   Max.   :1726007.5

From the glimpse of the data we see that certain column values needs conversion - ‘Order Date’, ‘Ship Date’ will be converted to date type. ‘Sales Channel’ will be factored as it contains either Online or Offline.

glimpse(small_ds)
## Rows: 5,000
## Columns: 14
## $ Region           <chr> "Central America and the Caribbean", "Central America…
## $ Country          <chr> "Antigua and Barbuda", "Panama", "Czech Republic", "N…
## $ `Item Type`      <chr> "Baby Food", "Snacks", "Beverages", "Cereal", "Snacks…
## $ `Sales Channel`  <chr> "Online", "Offline", "Offline", "Offline", "Offline",…
## $ `Order Priority` <chr> "M", "C", "C", "L", "C", "L", "H", "M", "M", "M", "C"…
## $ `Order Date`     <chr> "12/20/2013", "7/5/2010", "9/12/2011", "5/13/2010", "…
## $ `Order ID`       <dbl> 957081544, 301644504, 478051030, 892599952, 571902596…
## $ `Ship Date`      <chr> "1/11/2014", "7/26/2010", "9/29/2011", "6/15/2010", "…
## $ `Units Sold`     <dbl> 552, 2167, 4778, 9016, 7542, 48, 8258, 927, 8841, 981…
## $ `Unit Price`     <dbl> 255.28, 152.58, 47.45, 205.70, 152.58, 81.73, 109.28,…
## $ `Unit Cost`      <dbl> 159.42, 97.44, 31.79, 117.11, 97.44, 56.67, 35.84, 35…
## $ `Total Revenue`  <dbl> 140914.56, 330640.86, 226716.10, 1854591.20, 1150758.…
## $ `Total Cost`     <dbl> 87999.84, 211152.48, 151892.62, 1055863.76, 734892.48…
## $ `Total Profit`   <dbl> 52914.72, 119488.38, 74823.48, 798727.44, 415865.88, …
small_ds[['Order Date']] <- as.Date(small_ds[['Order Date']], "%m/%d/%Y")
small_ds[['Ship Date']] <- as.Date(small_ds[['Ship Date']], "%m/%d/%Y")

small_ds[['Sales Channel']] <- as.factor(small_ds[['Sales Channel']])


small_ds[['Total Profit']] <- as.numeric(small_ds[['Total Profit']])

Visualizing the data by plotting Total profit over the years. We see from the below plots in both of the datasets the ‘Total profit’ was constant across years but there is a drastic decline in 2017.

small_ds_plt <- small_ds %>%
  mutate(Year = year(`Order Date`)) %>%
  group_by(Year) %>%
  summarize(ProfitPerYear = sum(`Total Profit`))

ggplot(small_ds_plt, aes(x = Year, y = ProfitPerYear)) +
  geom_bar(stat = "identity", fill = "blue") +
  labs(title = "Yearly Profit", x = "Year", y = "Total Profit")

Modelling

We partition the dataset into training and testing sets in 80:20 proportion

From my models I would like to predict Sales Channel based on variables Region, Item Type, Order Priority and Total Profit.

set.seed(1234)

Btraining.samples <- small_ds$`Sales Channel` %>% 
  createDataPartition(p = 0.8, list=FALSE)

Btrain.data <- small_ds[Btraining.samples,]
Btest.data <- small_ds[-Btraining.samples,]

Support Vector Machine(SVM) Model

The primary objective of SVM is to find a hyperplane that best separates data points into different classes, maximizing the margin between the classes. The margin is the distance between the hyperplane and the nearest data point from each class.

SVM Kernel Functions

SVM algorithms use a group of mathematical functions that are known as kernels. The function of kernel is to take data as input and transform it into the required form. Different SVM algorithms use different types of kernel functions.

Basically Kernel represents the style of SVM that is used to classify data. We shall create models and we will apply different kernel functions and then we will compare which one is more efficient for our dataset analysis based on accuracy.

Linear Kernel

linear <- svm(formula = `Sales Channel` ~ `Region` + `Item Type` + `Order Priority`,
                 data = Btrain.data,
                 type = 'C-classification',
                 kernel = 'linear')
predictions <- predict(linear, newdata = Btest.data)

confusion_matrix <- table(Actual_Label = Btest.data$`Sales Channel`, Predicted_Label = predictions)
accuracy_lin <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
accuracy_lin
## [1] 0.4984985

Radial Kernel

radial <- svm(formula = `Sales Channel` ~ `Region` + `Item Type` + `Order Priority`,
                 data = Btrain.data,
                 type = 'C-classification',
                 kernel = 'radial')
predictions <- predict(radial, newdata = Btest.data)

confusion_matrix <- table(Actual_Label = Btest.data$`Sales Channel`, Predicted_Label = predictions)
accuracy_rad <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
accuracy_rad
## [1] 0.5025025

Polynomial Kernel

polynomial <- svm(formula = `Sales Channel` ~ `Region` + `Item Type` + `Order Priority`,
                 data = Btrain.data,
                 type = 'C-classification',
                 kernel = 'polynomial')
predictions <- predict(polynomial, newdata = Btest.data)

confusion_matrix <- table(Actual_Label = Btest.data$`Sales Channel`, Predicted_Label = predictions)
accuracy_pol <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
accuracy_pol
## [1] 0.5255255

Sigmoid Kernel

sigmoid <- svm(formula = `Sales Channel` ~ `Region` + `Item Type` + `Order Priority`,
                 data = Btrain.data,
                 type = 'C-classification',
                 kernel = 'sigmoid')
predictions <- predict(sigmoid, newdata = Btest.data)

confusion_matrix <- table(Actual_Label = Btest.data$`Sales Channel`, Predicted_Label = predictions)
accuracy_sig <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
accuracy_sig
## [1] 0.4744745

Results summary

From the below summarized results shows the accuracy of the models which we created in past assignments and also the dataset was analyzed using Support vector machine using different kernel types in this assignment.

We see from the results that SVM polynomial performs better with an accuracy of 52.55% when compared to other models we created.

SVM uses a - kernel trick to solve non-linear problems whereas decision trees derive hyper-rectangles in input space to solve the problem and decision trees are better for categorical data and it deals with colinearity better than SVM.

metrics1 <- c("Accuracy" = 0.52012)

metrics2 <- c("Accuracy" = 0.5235235)





kable(cbind(metrics1, metrics2, accuracy_lin,accuracy_rad,accuracy_pol,accuracy_sig), col.names = c("DecisionTree", "RandomForest","SVM Linear", "SVM Radial" ,"SVM Polynomial", "SVM Sigmoid"))  %>% 
  kable_styling(full_width = T)
DecisionTree RandomForest SVM Linear SVM Radial SVM Polynomial SVM Sigmoid
Accuracy 0.52012 0.5235235 0.4984985 0.5025025 0.5255255 0.4744745

Academic research articles

“A Comparative Study of Decision Trees and Support Vector Machines for Classification” by A.K. Jain, R.P.W. Duin, and J. Mao (2000), Pattern Recognition Letters, 21(12): 1157-1165.

This paper compares the performance of decision trees and support vector machines for classification tasks on various datasets. The study concludes that SVMs often outperform decision trees in terms of accuracy, particularly for high-dimensional data and non-linear problems.

“Decision Trees vs. Support Vector Machines for Classification: Which Algorithm is the Best?” by S.B. Kotsiantis, I.D. Zaharakis, and P.E. Pintelas (2006), WSEAS Transactions on Information Science and Applications, 3(6): 988-993.

This paper analyzes the strengths and weaknesses of decision trees and support vector machines for classification problems. The study highlights that decision trees are easier to interpret and require less parameter tuning, while SVMs often achieve higher accuracy but can be computationally expensive and difficult to interpret.

“Support Vector Machines vs. Decision Trees for Credit Card Fraud Detection” by O.S. Duque, M.A.G. Ferreira, J.S. Cardoso, and A.L. Oliveira (2014), Expert Systems with Applications, 41(10): 4995-5004.

This paper compares the effectiveness of decision trees and support vector machines for detecting credit card fraud. The study demonstrates that both algorithms can achieve good accuracy, but SVMs generally outperform decision trees in terms of F-measure and AUC (area under the ROC curve).