DATA 622 Support Vector Algorithm

library(tidyverse)

## Warning: package 'ggplot2' was built under R version 4.3.2

## Warning: package 'dplyr' was built under R version 4.3.2

## Warning: package 'stringr' was built under R version 4.3.2

## Warning: package 'lubridate' was built under R version 4.3.2

library(skimr)
library(DataExplorer)
library(corrplot)
library(ggfortify)
library(caret)
set.seed(123)

Introduction

Our objective for this analysis is to model a support vector machine on a dataset to compare how the results change compared to previously built decision tree models. For our dataset we will utilize the sample sales dataset from (https://excelbianalytics.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/) containing 1,000,000 million records.

Loading Data

Here we load in the data from the CSV which we have downloaded.

First we load in the sales dataset with 1,000,000 rows and are able to see our 14 columns.

mil_df <- read_csv("1000000 Sales Records.csv", show_col_types = FALSE)
glimpse(mil_df)

## Rows: 1,000,000
## Columns: 14
## $ Region           <chr> "Sub-Saharan Africa", "Middle East and North Africa",…
## $ Country          <chr> "South Africa", "Morocco", "Papua New Guinea", "Djibo…
## $ `Item Type`      <chr> "Fruits", "Clothes", "Meat", "Clothes", "Beverages", …
## $ `Sales Channel`  <chr> "Offline", "Online", "Offline", "Offline", "Offline",…
## $ `Order Priority` <chr> "M", "M", "M", "H", "L", "L", "M", "L", "L", "L", "M"…
## $ `Order Date`     <chr> "7/27/2012", "9/14/2013", "5/15/2015", "5/17/2017", "…
## $ `Order ID`       <dbl> 443368995, 667593514, 940995585, 880811536, 174590194…
## $ `Ship Date`      <chr> "7/28/2012", "10/19/2013", "6/4/2015", "7/2/2017", "1…
## $ `Units Sold`     <dbl> 1593, 4611, 360, 562, 3973, 1379, 597, 1476, 896, 776…
## $ `Unit Price`     <dbl> 9.33, 109.28, 421.89, 109.28, 47.45, 9.33, 47.45, 47.…
## $ `Unit Cost`      <dbl> 6.92, 35.84, 364.69, 35.84, 31.79, 6.92, 31.79, 31.79…
## $ `Total Revenue`  <dbl> 14862.69, 503890.08, 151880.40, 61415.36, 188518.85, …
## $ `Total Cost`     <dbl> 11023.56, 165258.24, 131288.40, 20142.08, 126301.67, …
## $ `Total Profit`   <dbl> 3839.13, 338631.84, 20592.00, 41273.28, 62217.18, 332…

Data Processing

Note that these processing steps are identical to the data processing that were underwent when building decision tree models.

First, we must consider the columns that we have and if they are applicable for fitting a classification model. We do have labels of order priority, item type, or region that could possibly be predicted from the other categories. What might be a good use case for the data we have would be predicting the order priority of new orders, so whenever a new order comes in it does not have to be manually labeled. Our existing dataset of one million manually classified orders should provide as a good base for making this happen.

However, we should remove data that we would not realistically have a way of knowing when the order is initially created for building our models. In this case we will remove the shipping date column as this would mark the end of the order as a whole, when priority should have already been assigned. Additionally, we should remove Order ID since this realistically does not provide any information about or order besides incidental information such as the chronology of an order which is already covered more thoroughly by order date.

Additionally, our column names are not very syntactically friendly having spaces in between words, thus we coerce these into better names

mil_df <- mil_df |>
  select(-`Ship Date`, -`Order ID`)


colnames(mil_df) <- make.names(
  colnames(mil_df)
)

colnames(mil_df)

##  [1] "Region"         "Country"        "Item.Type"      "Sales.Channel" 
##  [5] "Order.Priority" "Order.Date"     "Units.Sold"     "Unit.Price"    
##  [9] "Unit.Cost"      "Total.Revenue"  "Total.Cost"     "Total.Profit"

Next we modify our dataset to contain the correct variable types per column. Order date gets converted from a character string to an actual date type. Order priority becomes an ordered factor since we have a set order for low to medium to high to critical priority. Finally, the remaining categorical variables also get set as such as date for order date or factor for order priority.

mil_df <- mil_df |>
  mutate(
    Order.Date = mdy(Order.Date),
    Order.Priority = ordered(Order.Priority, levels = c("L","M","H","C"))
  ) |>
  mutate(
    across(
      where(is.character), as_factor
    )
  )

str(mil_df)

## tibble [1,000,000 × 12] (S3: tbl_df/tbl/data.frame)
##  $ Region        : Factor w/ 7 levels "Sub-Saharan Africa",..: 1 2 3 1 4 5 1 1 1 1 ...
##  $ Country       : Factor w/ 185 levels "South Africa",..: 1 2 3 4 5 6 7 8 9 8 ...
##  $ Item.Type     : Factor w/ 12 levels "Fruits","Clothes",..: 1 2 3 2 4 1 4 4 5 6 ...
##  $ Sales.Channel : Factor w/ 2 levels "Offline","Online": 1 2 1 1 1 2 2 2 2 1 ...
##  $ Order.Priority: Ord.factor w/ 4 levels "L"<"M"<"H"<"C": 2 2 2 3 1 1 2 1 1 1 ...
##  $ Order.Date    : Date[1:1000000], format: "2012-07-27" "2013-09-14" ...
##  $ Units.Sold    : num [1:1000000] 1593 4611 360 562 3973 ...
##  $ Unit.Price    : num [1:1000000] 9.33 109.28 421.89 109.28 47.45 ...
##  $ Unit.Cost     : num [1:1000000] 6.92 35.84 364.69 35.84 31.79 ...
##  $ Total.Revenue : num [1:1000000] 14863 503890 151880 61415 188519 ...
##  $ Total.Cost    : num [1:1000000] 11024 165258 131288 20142 126302 ...
##  $ Total.Profit  : num [1:1000000] 3839 338632 20592 41273 62217 ...

As we have undergone basic data preparation that applies to the whole dataset and does not bias the data we will now procure a training and test component from our dataset from 80% of the data.

inTrain <- createDataPartition(
  y = mil_df$Order.Priority,
  p = .80,
  list = FALSE
)

mil_df_simp_train <- mil_df[inTrain,]
mil_df_simp_test <- mil_df[-inTrain,]

cat("Number of test rows:",nrow(mil_df_simp_test),"Number of train rows:",nrow(mil_df_simp_train))

## Number of test rows: 199998 Number of train rows: 800002

Model Building

Now that we have processed our data, we will fit the data into support vector machine models.

Linear Support Vector Machine

We begin our model training with a linear support vector machine. Linear SVMs are optimal for when our data can be neatly separated through straight lines. Unfortunately, we know our data here is random data and will fail to be separated well with such a method. Still, we utilize five-fold cross-validation in order to find an optimal model with a lower amount of bias and also center and scale or data. It is important to center and scale datasets when building SVMs because all SVM models utilize a distance-based calculation for maximizing separation.

Unfortunately, support vector machines are not very quick at fitting large datasets. Here, we can leave the code running for more than 2 hours and it does not end up finishing execution.

svml <- train(
  Order.Priority ~ Units.Sold + Unit.Price + Unit.Cost,
  data = mil_df_simp_train,
  method = "svmLinear",
  trControl = trainControl(method = "cv", number = 5),
  preProc = c("center","scale")
)

svml

Thus, we have to substantially cut down on the amount of rows we are using for our SVM models. This does mean that we are not doing a one to one comparison on what datasets were trained on, as the decision tree models were trained on a much bigger training set.

After, some experimentation it is determined that to get the model to run in a reasonable amount of time we want our training dataset to have 1000 rows.

inTrain <- createDataPartition(
  y = mil_df$Order.Priority,
  p = .001,
  list = FALSE
)

mil_df_simp_train <- mil_df[inTrain,]
mil_df_simp_test <- mil_df[-inTrain,]

cat("Number of test rows:",nrow(mil_df_simp_test),"Number of train rows:",nrow(mil_df_simp_train))

## Number of test rows: 998998 Number of train rows: 1002

We end up with an accuracy of lower than 25% against our training data which is worse than randomly selecting a class.

svml <- train(
  Order.Priority ~ Units.Sold + Unit.Price + Unit.Cost,
  data = mil_df_simp_train,
  method = "svmLinear",
  trControl = trainControl(method = "cv", number = 5),
  preProc = c("center","scale")
)

svml

## Support Vector Machines with Linear Kernel 
## 
## 1002 samples
##    3 predictor
##    4 classes: 'L', 'M', 'H', 'C' 
## 
## Pre-processing: centered (3), scaled (3) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 802, 801, 802, 802, 801 
## Resampling results:
## 
##   Accuracy   Kappa       
##   0.2474876  -0.003575299
## 
## Tuning parameter 'C' was held constant at a value of 1

Radial Support Vector Machine

Next we change SVM kernels in order to be able to separate our data in a non-linear fashion. As our data is random, we can better separate our data in a non-linear fashion. We will attempt to utilize a radial kernel here for radial separation.

We end up with a much more accurate model utilizing a radial kernel with 27% accuracy against the training set.

svmr <- train(
  Order.Priority ~ Units.Sold + Unit.Price + Unit.Cost,
  data = mil_df_simp_train,
  method = "svmRadial",
  trControl = trainControl(method = "cv", number = 5),
  preProc = c("center","scale")
)

svmr

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 1002 samples
##    3 predictor
##    4 classes: 'L', 'M', 'H', 'C' 
## 
## Pre-processing: centered (3), scaled (3) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 801, 802, 802, 801, 802 
## Resampling results across tuning parameters:
## 
##   C     Accuracy   Kappa      
##   0.25  0.2524677  0.003276044
##   0.50  0.2724726  0.029986370
##   1.00  0.2654627  0.020595086
## 
## Tuning parameter 'sigma' was held constant at a value of 0.9790247
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.9790247 and C = 0.5.

Model Evaluation

Finally, we determine which model that we have created works the best for our test data.

For our linear SVM model we get an accuracy of 24.97% which is very close to random chance. Additionally, it is worse than our best decision tree model with an accuracy of 25.03%. However, our linear SVM model has decided to predict the majority of classes as class L. Which does not reflect the equal distribution of classes that should be in the data. Yet, it would still be a better class distribution in a real world scenario to have most of the orders labeled as low priority orders.

svml |>
  predict(mil_df_simp_test) |>
  confusionMatrix(mil_df_simp_test$Order.Priority)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction      L      M      H      C
##          L 112419 112207 112176 112895
##          M  19143  19307  19406  18961
##          H  79691  79704  79544  80100
##          C  38629  38225  38485  38106
## 
## Overall Statistics
##                                           
##                Accuracy : 0.2496          
##                  95% CI : (0.2488, 0.2505)
##     No Information Rate : 0.2503          
##     P-Value [Acc > NIR] : 0.9436          
##                                           
##                   Kappa : -6e-04          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
## 
## Statistics by Class:
## 
##                      Class: L Class: M Class: H Class: C
## Sensitivity            0.4499  0.07740  0.31867  0.15239
## Specificity            0.5498  0.92327  0.68041  0.84600
## Pos Pred Value         0.2500  0.25134  0.24932  0.24834
## Neg Pred Value         0.7497  0.75044  0.74989  0.74933
## Prevalence             0.2501  0.24969  0.24986  0.25031
## Detection Rate         0.1125  0.01933  0.07962  0.03814
## Detection Prevalence   0.4501  0.07689  0.31936  0.15360
## Balanced Accuracy      0.4998  0.50034  0.49954  0.49919

Our radial SVM has a slightly lower accuracy at 24.95%, but spreads the class predictions more evenly with the highest detection prevalence at 0.4083 for class L. Between the SVMs, I would use this model to label order priority purely because of the spread of classes.

svmr |>
  predict(mil_df_simp_test) |>
  confusionMatrix(mil_df_simp_test$Order.Priority)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     L     M     H     C
##          L 95452 95560 95504 95836
##          M 35478 35533 35290 35485
##          H 61182 61434 61035 61788
##          C 57770 56916 57782 56953
## 
## Overall Statistics
##                                           
##                Accuracy : 0.2492          
##                  95% CI : (0.2484, 0.2501)
##     No Information Rate : 0.2503          
##     P-Value [Acc > NIR] : 0.9941          
##                                           
##                   Kappa : -0.0011         
##                                           
##  Mcnemar's Test P-Value : <2e-16          
## 
## Statistics by Class:
## 
##                      Class: L Class: M Class: H Class: C
## Sensitivity           0.38199  0.14245   0.2445  0.22776
## Specificity           0.61702  0.85825   0.7539  0.76972
## Pos Pred Value        0.24964  0.25061   0.2487  0.24825
## Neg Pred Value        0.74956  0.75046   0.7498  0.74907
## Prevalence            0.25013  0.24969   0.2499  0.25031
## Detection Rate        0.09555  0.03557   0.0611  0.05701
## Detection Prevalence  0.38274  0.14193   0.2457  0.22965
## Balanced Accuracy     0.49950  0.50035   0.4992  0.49874

Overall, we end up with one of our CART decision trees having the best accuracy at 25.03%, but having high bias towards labeling most outcomes as critical. SVMs in particular were not suited for predicting this data as attempting to geometrically separate randomly generated data is a fool’s errand. Yet, the decision trees in didn’t do much better either because no model will be able to predict random outcomes consistently.