Assignment

Perform an analysis of the dataset used in Homework #2 using the SVM algorithm. Compare the results with the results from previous homework.

Based on articles

https://www.hindawi.com/journals/complexity/2021/5550344/  
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8137961/  

Search for academic content (at least 3 articles) that compare the use of decision trees vs SVMs in your current area of expertise. Which algorithm is recommended to get more accurate results? Is it better for classification or regression scenarios? Do you agree with the recommendations? Why?

Format: R file & essay

Data

For homework #2, I used the “Flavors of Cacao” dataset from this https://www.kaggle.com/datasets/techiesid01/flavours-of-cacao?resource=download. In homework #2, I sought to create a decision tree that would solve a regression problem: what will be the final grade for a cocoa bean.

The following code sets up our libraries and imports the dataset.

library(tidymodels)
## -- Attaching packages -------------------------------------- tidymodels 0.2.0 --
## v broom        0.7.12     v recipes      0.2.0 
## v dials        0.1.1      v rsample      0.1.1 
## v dplyr        1.0.8      v tibble       3.1.6 
## v ggplot2      3.3.5      v tidyr        1.2.0 
## v infer        1.0.0      v tune         0.2.0 
## v modeldata    0.1.1      v workflows    0.2.6 
## v parsnip      0.2.1      v workflowsets 0.2.1 
## v purrr        0.3.4      v yardstick    0.0.9
## -- Conflicts ----------------------------------------- tidymodels_conflicts() --
## x purrr::discard() masks scales::discard()
## x dplyr::filter()  masks stats::filter()
## x dplyr::lag()     masks stats::lag()
## x recipes::step()  masks stats::step()
## * Search for functions across packages at https://www.tidymodels.org/find/
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v readr   2.1.2     v forcats 0.5.1
## v stringr 1.4.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x readr::col_factor() masks scales::col_factor()
## x purrr::discard()    masks scales::discard()
## x dplyr::filter()     masks stats::filter()
## x stringr::fixed()    masks recipes::fixed()
## x dplyr::lag()        masks stats::lag()
## x readr::spec()       masks yardstick::spec()
library(e1071)
## 
## Attaching package: 'e1071'
## The following object is masked from 'package:tune':
## 
##     tune
## The following object is masked from 'package:rsample':
## 
##     permutations
## The following object is masked from 'package:parsnip':
## 
##     tune
library(readr)
library(rpart.plot)
## Loading required package: rpart
## 
## Attaching package: 'rpart'
## The following object is masked from 'package:dials':
## 
##     prune
library(ggplot2)

# Used for Random Forest
library(ranger)

chocolate <- read_csv('https://raw.githubusercontent.com/logicalschema/spring2022/main/data622/hw3/flavors_of_cacao.csv')
## Rows: 1795 Columns: 9
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (6): company, specific_bean_origin_or_bar_name, cocoa_percent, company_l...
## dbl (3): ref, review_date, final_grade
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

This is a view of the imported dataset.

names(chocolate)
## [1] "company"                          "specific_bean_origin_or_bar_name"
## [3] "ref"                              "review_date"                     
## [5] "cocoa_percent"                    "company_location"                
## [7] "final_grade"                      "bean_type"                       
## [9] "broad_bean_origin"
head(chocolate)

Cleaning

The column cocoa_percent has a chr data type so this needs to be converted to a double data type. The column ref was removed. The columns broad_bean_origin and bean_type were converted to factor variables. I also did a drop_na.

For the SVM model, a new variable, final_grade_floor was made that takes the floor of final_grade. This is to make the grade an integer value for later use in the SVM model.

chocolate <- chocolate %>% drop_na()


# Parses the cocoa_percent and converts it to percentage
chocolate$cocoa_percent <- parse_number(chocolate$cocoa_percent)/100

# Removes the ref column
chocolate <- chocolate[-c(3)]

# Convert some columns to factors
chocolate$broad_bean_origin  <- as.factor(chocolate$broad_bean_origin )
chocolate$bean_type  <- as.factor(chocolate$bean_type)

# Making a value for SVM
chocolate$final_grade_floor <- floor(chocolate$final_grade)
chocolate$cocoa_percent_floor <- floor(chocolate$cocoa_percent)

This is a view of the data after our modification.

head(chocolate)
summary(chocolate)
##    company          specific_bean_origin_or_bar_name  review_date  
##  Length:1793        Length:1793                      Min.   :2006  
##  Class :character   Class :character                 1st Qu.:2010  
##  Mode  :character   Mode  :character                 Median :2013  
##                                                      Mean   :2012  
##                                                      3rd Qu.:2015  
##                                                      Max.   :2017  
##                                                                    
##  cocoa_percent   company_location    final_grade                   bean_type  
##  Min.   :0.420   Length:1793        Min.   :1.000                       :887  
##  1st Qu.:0.700   Class :character   1st Qu.:3.000   Trinitario          :418  
##  Median :0.700   Mode  :character   Median :3.250   Criollo             :153  
##  Mean   :0.717                      Mean   :3.186   Forastero           : 87  
##  3rd Qu.:0.750                      3rd Qu.:3.500   Forastero (Nacional): 52  
##  Max.   :1.000                      Max.   :5.000   Blend               : 41  
##                                                     (Other)             :155  
##           broad_bean_origin final_grade_floor cocoa_percent_floor
##  Venezuela         :214     Min.   :1.000     Min.   :0.00000    
##  Ecuador           :193     1st Qu.:3.000     1st Qu.:0.00000    
##  Peru              :165     Median :3.000     Median :0.00000    
##  Madagascar        :145     Mean   :2.797     Mean   :0.01115    
##  Dominican Republic:141     3rd Qu.:3.000     3rd Qu.:0.00000    
##                    : 73     Max.   :5.000     Max.   :1.00000    
##  (Other)           :862

Partition

The following code splits the data into two partitions.

# Splitting the data 80/20
set.seed(32022)

data_split <- initial_split(chocolate, prop = 0.8, strata = 'bean_type')
chocolate_train <- training(data_split)
chocolate_test <- testing(data_split)

SVM

The following code constructs the SVM model based. It looks at the final_grade based upon cocoa_percent and bean_type.

svm_model <- svm(final_grade ~ cocoa_percent + bean_type,
                 data=chocolate_train,
                 kernel="polynomial",
                 scale=FALSE)

svm_model
## 
## Call:
## svm(formula = final_grade ~ cocoa_percent + bean_type, data = chocolate_train, 
##     kernel = "polynomial", scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  polynomial 
##        cost:  1 
##      degree:  3 
##       gamma:  0.02380952 
##      coef.0:  0 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  1282

Analysis of Model

Next, we will run the model and calculate the RMSE for when the model is run with the test data.

chocolate_test$pred <- predict(svm_model, newdata=chocolate_test)

rmse <- chocolate_test %>% 
  mutate(residual = final_grade - pred) %>%
  summarize(rmse = sqrt(mean(residual^2)))

The RMSE is 0.4565467.

Summary

The SVM model had a better RMSE score than the random forest model from the previous homework at 0.4565467. Because I had gone with building models for regression and from the reading materials, going with SVM was the preferred method for this exercise. The relationship did not look linear and for the SVM model I used a polynomial kernel parameter that increased the RMSE score.

Decision Tree vs SVM stated that SVM uses a “kernel trick to solve non-inear problems whereas decision trees derive hyper-rectangles in input space to solve the problem” and “decision trees are better for categorical data and it deals with colinearity” better than SVM.

With this homework, I would lean towards SVMs for regression over decision tree.