Here we are using the PointBlank function ‘scan_data’ to explore the
missingness, distribution and correlations in the data set.
Some interesting correlations are revealed. The target ‘east_asia’ is
positively correlated with the number of stars a product received and
the style ‘bowl’. It is negatively correlated with the producer ‘Nissin’
and having the words ‘Instant’ or ‘Curry’ in the name. These negatively
relationships make sense, Nissin produces ramen for a Western audience
and the adjectives ‘Instant’ and ‘Curry’ and probably most appealing to
those same people.
Because this dataset came from Kaggle, it is very clean and has
almost no missingness, not something you encounter in the real
world.
Create our first SVM model using a ‘linear’ kernel.
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
## ✔ broom 1.0.5 ✔ rsample 1.2.1
## ✔ dials 1.2.1 ✔ tibble 3.2.1
## ✔ ggplot2 3.5.0 ✔ tidyr 1.3.1
## ✔ infer 1.0.7 ✔ tune 1.2.0
## ✔ modeldata 1.3.0 ✔ workflows 1.1.4
## ✔ parsnip 1.2.1 ✔ workflowsets 1.1.0
## ✔ purrr 1.0.2 ✔ yardstick 1.3.1
## ✔ recipes 1.0.10
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ recipes::step() masks stats::step()
## • Learn how to get started at https://www.tidymodels.org/start/
set.seed(123)
data_split <- initial_split(ramenDf, prop = 0.75)
train_data <- training(data_split)
test_data <- testing(data_split)
library(e1071)
## Warning: package 'e1071' was built under R version 4.3.3
##
## Attaching package: 'e1071'
## The following object is masked from 'package:tune':
##
## tune
## The following object is masked from 'package:rsample':
##
## permutations
## The following object is masked from 'package:parsnip':
##
## tune
svmfit = svm(east_asia ~ ., data = train_data , cost = 10, kernel = "linear", scale = TRUE)
svmpred = predict(svmfit, test_data[, -3])
tab <- table(pred = round(svmpred), true = test_data[,3])
With a linear kernel we get a 62% accuracy
classAgreement(tab)
## $diag
## [1] 0.627907
##
## $kappa
## [1] 0.1844906
##
## $rand
## [1] 0.5319948
##
## $crand
## [1] 0.05730295
Next let’s try some different kernals, starting with the ‘radial’
kernel.
svmfit = svm(east_asia ~ ., data = train_data , cost = 10, kernel = "radial", scale = TRUE)
svmpred = predict(svmfit, test_data[, -3])
tab <- table(pred = round(svmpred), true = test_data[,3])
The radial kernel gives us a 64% accuracy
classAgreement(tab)
## $diag
## [1] 0.6465116
##
## $kappa
## [1] 0.2725924
##
## $rand
## [1] 0.5422216
##
## $crand
## [1] 0.08379865
Next, let’s try the ‘polynomial’ kernel.
svmfit = svm(east_asia ~ ., data = train_data , cost = 10, kernel = "polynomial", scale = TRUE)
svmpred = predict(svmfit, test_data[, -3])
tab <- table(pred = round(svmpred), true = test_data[,3])
The polynomial kernel gives us a 24% accuracy
classAgreement(tab)
## $diag
## [1] 0.2465116
##
## $kappa
## [1] -0.08100559
##
## $rand
## [1] 0.5393278
##
## $crand
## [1] 0.07546902
Let’s train the same data on the Random Forest model from HW2.
rffit <- randomForest::randomForest(as.factor(east_asia) ~ ., data = train_data)
rfpred <- predict(rffit, test_data[, -3])
tab <- table(pred = rfpred, true = as.factor(test_data[,3]))
We get an accuracy of 63%
classAgreement(tab)
## $diag
## [1] 0.6418605
##
## $kappa
## [1] 0.2549691
##
## $rand
## [1] 0.5395349
##
## $crand
## [1] 0.07781945