Load in Data

I am using a Kaggle data set that I used before, It is a ramen rating data set.

ramenDf <- read.csv("G:/Documents/DATA622_HW1/ramen-ratings.csv")

Remove the review number (unnecessary key) and cast star rating as numeric

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
ramenDf <- ramenDf %>% select(-c('Review..')) %>% mutate(Stars = as.numeric(Stars)) %>% relocate(Stars) %>% filter(!is.na(Stars))

Here I am generating some new features. I am creating some dummy variables related to the style (pack, tray, cup, bowl). I’m also creating some features related to the ramen name, does in include the words “Instant”, “Spicy” or “Curry”. I am creating a dummy variable for the most popular producer Nissin. I am also creating a target variable for my classification task ‘east_asia’ meaning was this product produced in China, Japan, South Korea and Taiwan.

ramenDf <-
  ramenDf %>% mutate(pack = ifelse(Style == 'Pack', 1, 0),
                     tray = ifelse(Style == 'Tray', 1, 0),
                     cup = ifelse(Style == 'Cup', 1, 0),
                     bowl = ifelse(Style == 'Bowl', 1, 0))

ramenDf <-
  ramenDf %>% mutate(nissin = ifelse(Brand == 'Nissin', 1, 0))

ramenDf <- ramenDf %>% mutate(east_asia = as.numeric(ifelse(Country %in% c('China', 'Japan', 'South Korea', 'Taiwan'), 1, 0)))

ramenDf <- ramenDf %>% mutate(spicy = as.numeric(grepl('Spicy', Variety)), curry = as.numeric(grepl('Curry', Variety)), instant = as.numeric(grepl('Instant', Variety)))

ramenDf <- ramenDf %>% select(Stars, nissin, east_asia, spicy, curry, instant, pack, tray, cup, bowl)

Data Exploration

Here we are using the PointBlank function ‘scan_data’ to explore the missingness, distribution and correlations in the data set.

Some interesting correlations are revealed. The target ‘east_asia’ is positively correlated with the number of stars a product received and the style ‘bowl’. It is negatively correlated with the producer ‘Nissin’ and having the words ‘Instant’ or ‘Curry’ in the name. These negatively relationships make sense, Nissin produces ramen for a Western audience and the adjectives ‘Instant’ and ‘Curry’ and probably most appealing to those same people.

Because this dataset came from Kaggle, it is very clean and has almost no missingness, not something you encounter in the real world.

pointblank::scan_data(ramenDf)

Overview of ramenDf

Table Overview

Columns

10

Rows

2,577

NAs

0

Duplicate Rows

2,049 (79.51%)

Column Types

numeric 10

Reproducibility Information

Scan Build Time

2024-05-12 20:22:27

pointblank Version

0.12.1

R Version

R version 4.3.2 (2023–10–31 ucrt)
Eye Holes

Operating System

x86_64-w64-mingw32

Variables

Distinct

42

NAs

0

Inf/-Inf

0

Mean

3.65

Minimum

0

Maximum

5

Distinct

2

NAs

0

Inf/-Inf

0

Mean

0.15

Minimum

0

Maximum

1

Distinct

2

NAs

0

Inf/-Inf

0

Mean

0.41

Minimum

0

Maximum

1

Distinct

2

NAs

0

Inf/-Inf

0

Mean

0.1

Minimum

0

Maximum

1

Distinct

2

NAs

0

Inf/-Inf

0

Mean

0.05

Minimum

0

Maximum

1

Distinct

2

NAs

0

Inf/-Inf

0

Mean

0.18

Minimum

0

Maximum

1

Distinct

2

NAs

0

Inf/-Inf

0

Mean

0.59

Minimum

0

Maximum

1

Distinct

2

NAs

0

Inf/-Inf

0

Mean

0.04

Minimum

0

Maximum

1

Distinct

2

NAs

0

Inf/-Inf

0

Mean

0.17

Minimum

0

Maximum

1

Distinct

2

NAs

0

Inf/-Inf

0

Mean

0.19

Minimum

0

Maximum

1

Interactions

Correlations

Missing Values

Sample

Stars nissin east_asia spicy curry instant pack tray cup bowl
1 3.75 0 1 0 0 0 0 0 1 0
2 1.00 0 1 1 0 0 1 0 0 0
3 2.25 1 0 0 0 0 0 0 1 0
4 2.75 0 1 0 0 0 1 0 0 0
5 3.75 0 0 0 1 0 1 0 0 0
6..2572
2573 3.50 0 0 0 0 1 0 0 0 1
2574 1.00 0 0 0 0 1 1 0 0 0
2575 2.00 0 0 0 0 0 1 0 0 0
2576 2.00 0 0 0 0 0 1 0 0 0
2577 0.50 0 0 0 0 0 1 0 0 0

Create our first SVM model using a ‘linear’ kernel.

library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
## ✔ broom        1.0.5      ✔ rsample      1.2.1 
## ✔ dials        1.2.1      ✔ tibble       3.2.1 
## ✔ ggplot2      3.5.0      ✔ tidyr        1.3.1 
## ✔ infer        1.0.7      ✔ tune         1.2.0 
## ✔ modeldata    1.3.0      ✔ workflows    1.1.4 
## ✔ parsnip      1.2.1      ✔ workflowsets 1.1.0 
## ✔ purrr        1.0.2      ✔ yardstick    1.3.1 
## ✔ recipes      1.0.10
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ✖ recipes::step()  masks stats::step()
## • Learn how to get started at https://www.tidymodels.org/start/
set.seed(123)
data_split <- initial_split(ramenDf, prop = 0.75)
train_data <- training(data_split)
test_data <- testing(data_split)
library(e1071)
## Warning: package 'e1071' was built under R version 4.3.3
## 
## Attaching package: 'e1071'
## The following object is masked from 'package:tune':
## 
##     tune
## The following object is masked from 'package:rsample':
## 
##     permutations
## The following object is masked from 'package:parsnip':
## 
##     tune
svmfit = svm(east_asia ~ ., data = train_data , cost = 10, kernel = "linear", scale = TRUE)
svmpred = predict(svmfit, test_data[, -3])
tab <- table(pred = round(svmpred), true = test_data[,3])

With a linear kernel we get a 62% accuracy

classAgreement(tab)
## $diag
## [1] 0.627907
## 
## $kappa
## [1] 0.1844906
## 
## $rand
## [1] 0.5319948
## 
## $crand
## [1] 0.05730295

Next let’s try some different kernals, starting with the ‘radial’ kernel.

svmfit = svm(east_asia ~ ., data = train_data , cost = 10, kernel = "radial", scale = TRUE)
svmpred = predict(svmfit, test_data[, -3])
tab <- table(pred = round(svmpred), true = test_data[,3])

The radial kernel gives us a 64% accuracy

classAgreement(tab)
## $diag
## [1] 0.6465116
## 
## $kappa
## [1] 0.2725924
## 
## $rand
## [1] 0.5422216
## 
## $crand
## [1] 0.08379865

Next, let’s try the ‘polynomial’ kernel.

svmfit = svm(east_asia ~ ., data = train_data , cost = 10, kernel = "polynomial", scale = TRUE)
svmpred = predict(svmfit, test_data[, -3])
tab <- table(pred = round(svmpred), true = test_data[,3])

The polynomial kernel gives us a 24% accuracy

classAgreement(tab)
## $diag
## [1] 0.2465116
## 
## $kappa
## [1] -0.08100559
## 
## $rand
## [1] 0.5393278
## 
## $crand
## [1] 0.07546902

Let’s train the same data on the Random Forest model from HW2.

rffit <- randomForest::randomForest(as.factor(east_asia) ~ ., data = train_data)
rfpred <- predict(rffit, test_data[, -3])
tab <- table(pred = rfpred, true = as.factor(test_data[,3]))

We get an accuracy of 63%

classAgreement(tab)
## $diag
## [1] 0.6418605
## 
## $kappa
## [1] 0.2549691
## 
## $rand
## [1] 0.5395349
## 
## $crand
## [1] 0.07781945

Result

SVM with a radial kernel gave us the highest accuracy of 64% and a kappa of 27%. Random Forest gave us an accuracy of 63% and a kappa of 24%.

Conclusion

Two out of the three papers found that Random Forest outperformed SVM for product origin prediction. Interestingly, only one of the three papers was not in a binary classification setting and it still found Random Forest to give better accuracy. I found a very modest performance improvement in SVM. That being said, I think that the performance of both models is currently below the accuracy that I would be happy with in a real application. I would go back and explore more feature engineering to see if I could improve my accuracy before I chose a model type.