Perform an analysis of the dataset used in Homework #2 using the SVM algorithm. Compare the results with the results from previous homework.
Based on articles
https://www.hindawi.com/journals/complexity/2021/5550344/
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8137961/
Search for academic content (at least 3 articles) that compare the use of decision trees vs SVMs in your current area of expertise. Which algorithm is recommended to get more accurate results? Is it better for classification or regression scenarios? Do you agree with the recommendations? Why?
Format: R file & essay
For homework #2, I used the “Flavors of Cacao” dataset from this https://www.kaggle.com/datasets/techiesid01/flavours-of-cacao?resource=download. In homework #2, I sought to create a decision tree that would solve a regression problem: what will be the final grade for a cocoa bean.
The following code sets up our libraries and imports the dataset.
library(tidymodels)
## -- Attaching packages -------------------------------------- tidymodels 0.2.0 --
## v broom 0.7.12 v recipes 0.2.0
## v dials 0.1.1 v rsample 0.1.1
## v dplyr 1.0.8 v tibble 3.1.6
## v ggplot2 3.3.5 v tidyr 1.2.0
## v infer 1.0.0 v tune 0.2.0
## v modeldata 0.1.1 v workflows 0.2.6
## v parsnip 0.2.1 v workflowsets 0.2.1
## v purrr 0.3.4 v yardstick 0.0.9
## -- Conflicts ----------------------------------------- tidymodels_conflicts() --
## x purrr::discard() masks scales::discard()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x recipes::step() masks stats::step()
## * Search for functions across packages at https://www.tidymodels.org/find/
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v readr 2.1.2 v forcats 0.5.1
## v stringr 1.4.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x readr::col_factor() masks scales::col_factor()
## x purrr::discard() masks scales::discard()
## x dplyr::filter() masks stats::filter()
## x stringr::fixed() masks recipes::fixed()
## x dplyr::lag() masks stats::lag()
## x readr::spec() masks yardstick::spec()
library(e1071)
##
## Attaching package: 'e1071'
## The following object is masked from 'package:tune':
##
## tune
## The following object is masked from 'package:rsample':
##
## permutations
## The following object is masked from 'package:parsnip':
##
## tune
library(readr)
library(rpart.plot)
## Loading required package: rpart
##
## Attaching package: 'rpart'
## The following object is masked from 'package:dials':
##
## prune
library(ggplot2)
# Used for Random Forest
library(ranger)
chocolate <- read_csv('https://raw.githubusercontent.com/logicalschema/spring2022/main/data622/hw3/flavors_of_cacao.csv')
## Rows: 1795 Columns: 9
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (6): company, specific_bean_origin_or_bar_name, cocoa_percent, company_l...
## dbl (3): ref, review_date, final_grade
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
This is a view of the imported dataset.
names(chocolate)
## [1] "company" "specific_bean_origin_or_bar_name"
## [3] "ref" "review_date"
## [5] "cocoa_percent" "company_location"
## [7] "final_grade" "bean_type"
## [9] "broad_bean_origin"
head(chocolate)
The column cocoa_percent has a chr data
type so this needs to be converted to a double data type. The column
ref was removed. The columns broad_bean_origin
and bean_type were converted to factor
variables. I also did a drop_na.
For the SVM model, a new variable, final_grade_floor was
made that takes the floor of final_grade. This is to make
the grade an integer value for later use in the SVM model.
chocolate <- chocolate %>% drop_na()
# Parses the cocoa_percent and converts it to percentage
chocolate$cocoa_percent <- parse_number(chocolate$cocoa_percent)/100
# Removes the ref column
chocolate <- chocolate[-c(3)]
# Convert some columns to factors
chocolate$broad_bean_origin <- as.factor(chocolate$broad_bean_origin )
chocolate$bean_type <- as.factor(chocolate$bean_type)
# Making a value for SVM
chocolate$final_grade_floor <- floor(chocolate$final_grade)
chocolate$cocoa_percent_floor <- floor(chocolate$cocoa_percent)
This is a view of the data after our modification.
head(chocolate)
summary(chocolate)
## company specific_bean_origin_or_bar_name review_date
## Length:1793 Length:1793 Min. :2006
## Class :character Class :character 1st Qu.:2010
## Mode :character Mode :character Median :2013
## Mean :2012
## 3rd Qu.:2015
## Max. :2017
##
## cocoa_percent company_location final_grade bean_type
## Min. :0.420 Length:1793 Min. :1.000 :887
## 1st Qu.:0.700 Class :character 1st Qu.:3.000 Trinitario :418
## Median :0.700 Mode :character Median :3.250 Criollo :153
## Mean :0.717 Mean :3.186 Forastero : 87
## 3rd Qu.:0.750 3rd Qu.:3.500 Forastero (Nacional): 52
## Max. :1.000 Max. :5.000 Blend : 41
## (Other) :155
## broad_bean_origin final_grade_floor cocoa_percent_floor
## Venezuela :214 Min. :1.000 Min. :0.00000
## Ecuador :193 1st Qu.:3.000 1st Qu.:0.00000
## Peru :165 Median :3.000 Median :0.00000
## Madagascar :145 Mean :2.797 Mean :0.01115
## Dominican Republic:141 3rd Qu.:3.000 3rd Qu.:0.00000
## : 73 Max. :5.000 Max. :1.00000
## (Other) :862
The following code splits the data into two partitions.
# Splitting the data 80/20
set.seed(32022)
data_split <- initial_split(chocolate, prop = 0.8, strata = 'bean_type')
chocolate_train <- training(data_split)
chocolate_test <- testing(data_split)
The following code constructs the SVM model based. It looks at the
final_grade based upon cocoa_percent and
bean_type.
svm_model <- svm(final_grade ~ cocoa_percent + bean_type,
data=chocolate_train,
kernel="polynomial",
scale=FALSE)
svm_model
##
## Call:
## svm(formula = final_grade ~ cocoa_percent + bean_type, data = chocolate_train,
## kernel = "polynomial", scale = FALSE)
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: polynomial
## cost: 1
## degree: 3
## gamma: 0.02380952
## coef.0: 0
## epsilon: 0.1
##
##
## Number of Support Vectors: 1282
Next, we will run the model and calculate the RMSE for when the model is run with the test data.
chocolate_test$pred <- predict(svm_model, newdata=chocolate_test)
rmse <- chocolate_test %>%
mutate(residual = final_grade - pred) %>%
summarize(rmse = sqrt(mean(residual^2)))
The RMSE is 0.4565467.
The SVM model had a better RMSE score than the random forest model
from the previous homework at 0.4565467. Because I had gone with
building models for regression and from the reading materials, going
with SVM was the preferred method for this exercise. The relationship
did not look linear and for the SVM model I used a
polynomial kernel parameter that increased the RMSE
score.
Decision Tree vs SVM stated that SVM uses a “kernel trick to solve non-inear problems whereas decision trees derive hyper-rectangles in input space to solve the problem” and “decision trees are better for categorical data and it deals with colinearity” better than SVM.
With this homework, I would lean towards SVMs for regression over decision tree.