DATA622

ramenDf <- ramenDf %>% mutate(pack = ifelse(Style == 'Pack', 1, 0), tray = ifelse(Style == 'Tray', 1, 0), cup = ifelse(Style == 'Cup', 1, 0), bowl = ifelse(Style == 'Bowl', 1, 0)) ramenDf <- ramenDf %>% mutate(nissin = ifelse(Brand == 'Nissin', 1, 0)) ramenDf <- ramenDf %>% mutate(east_asia = as.numeric(ifelse(Country %in% c('China', 'Japan', 'South Korea', 'Taiwan'), 1, 0))) ramenDf <- ramenDf %>% mutate(spicy = as.numeric(grepl('Spicy', Variety)), curry = as.numeric(grepl('Curry', Variety)), instant = as.numeric(grepl('Instant', Variety))) ramenDf <- ramenDf %>% select(Stars, nissin, east_asia, spicy, curry, instant, pack, tray, cup, bowl)

Data Exploration

Here we are using the PointBlank function ‘scan_data’ to explore the missingness, distribution and correlations in the data set.

Some interesting correlations are revealed. The target ‘east_asia’ is positively correlated with the number of stars a product received and the style ‘bowl’. It is negatively correlated with the producer ‘Nissin’ and having the words ‘Instant’ or ‘Curry’ in the name. These negatively relationships make sense, Nissin produces ramen for a Western audience and the adjectives ‘Instant’ and ‘Curry’ and probably most appealing to those same people.

Because this dataset came from Kaggle, it is very clean and has almost no missingness, not something you encounter in the real world.

pointblank::scan_data(ramenDf)

Overview
Reproducibility

Table Overview

Columns	10
Rows	2,577
`NA`s	0
Duplicate Rows	2,049 (79.51%)

Column Types

numeric

Reproducibility Information

Scan Build Time	`2024-05-12 20:22:27`
pointblank Version	`0.12.1`
R Version	R version 4.3.2 (2023–10–31 ucrt) Eye Holes
Operating System	`x86_64-w64-mingw32`

Stars
numeric

Distinct	42
`NA`s	0
`Inf`/`-Inf`	0

Mean	3.65
Minimum	0
Maximum	5

Quantile Statistics

Minimum	0.00
5th Percentile	1.70
Q1	3.25
Median	3.75
Q3	4.25
95th Percentile	5.00
Maximum	5.00
Range	5.00
IQR	1.00

Descriptive Statistics

Mean	3.65
Variance	1.03
Standard Deviation	1.02
Coefficient of Variation	0.28

Value	Count	Frequency
4	393	15.3%
5	386	15.0%
3.75	350	13.6%
3.5	335	13.0%
3	176	6.8%
3.25	170	6.6%
4.25	143	5.5%
4.5	135	5.2%
2.75	85	3.3%
Other Values (336)	336	13.0%

Maximum Values

Value	Count	Frequency
4.00	393	15.25%
5.00	386	14.98%
3.75	350	13.58%
3.50	335	13.00%
3.00	176	6.83%
3.25	170	6.60%
4.25	143	5.55%
4.50	135	5.24%
2.75	85	3.30%
2.00	68	2.64%

Minimum Values

Value	Count	Frequency
0.100	1	0.04%
0.750	1	0.04%
0.900	1	0.04%
1.800	1	0.04%
2.100	1	0.04%
2.125	1	0.04%
2.850	1	0.04%
3.125	1	0.04%
3.200	1	0.04%
3.300	1	0.04%

nissin
numeric

Distinct	2
`NA`s	0
`Inf`/`-Inf`	0

Mean	0.15
Minimum	0
Maximum	1

Quantile Statistics

Minimum	0.00
5th Percentile	0.00
Q1	0.00
Median	0.00
Q3	0.00
95th Percentile	1.00
Maximum	1.00
Range	1.00
IQR	0.00

Descriptive Statistics

Mean	0.15
Variance	0.13
Standard Deviation	0.36
Coefficient of Variation	2.40

Value	Count	Frequency
0	2196	85.2%
1	381	14.8%
Other Values (0)	0	0.0%

Maximum Values

Value	Count	Frequency
0	2196	85.22%
1	381	14.78%

Minimum Values

Value	Count	Frequency
1	381	14.78%
0	2196	85.22%

east_asia
numeric

Distinct	2
`NA`s	0
`Inf`/`-Inf`	0

Mean	0.41
Minimum	0
Maximum	1

Quantile Statistics

Minimum	0.00
5th Percentile	0.00
Q1	0.00
Median	0.00
Q3	1.00
95th Percentile	1.00
Maximum	1.00
Range	1.00
IQR	1.00

Descriptive Statistics

Mean	0.41
Variance	0.24
Standard Deviation	0.49
Coefficient of Variation	1.20

Value	Count	Frequency
0	1525	59.2%
1	1052	40.8%
Other Values (0)	0	0.0%

Maximum Values

Value	Count	Frequency
0	1525	59.18%
1	1052	40.82%

Minimum Values

Value	Count	Frequency
1	1052	40.82%
0	1525	59.18%

spicy
numeric

Distinct	2
`NA`s	0
`Inf`/`-Inf`	0

Mean	0.1
Minimum	0
Maximum	1

Quantile Statistics

Minimum	0.00
5th Percentile	0.00
Q1	0.00
Median	0.00
Q3	0.00
95th Percentile	1.00
Maximum	1.00
Range	1.00
IQR	0.00

Descriptive Statistics

Mean	0.10
Variance	0.09
Standard Deviation	0.30
Coefficient of Variation	2.95

Value	Count	Frequency
0	2312	89.7%
1	265	10.3%
Other Values (0)	0	0.0%

Maximum Values

Value	Count	Frequency
0	2312	89.72%
1	265	10.28%

Minimum Values

Value	Count	Frequency
1	265	10.28%
0	2312	89.72%

curry
numeric

Distinct	2
`NA`s	0
`Inf`/`-Inf`	0

Mean	0.05
Minimum	0
Maximum	1

Quantile Statistics

Minimum	0.00
5th Percentile	0.00
Q1	0.00
Median	0.00
Q3	0.00
95th Percentile	0.00
Maximum	1.00
Range	1.00
IQR	0.00

Descriptive Statistics

Mean	0.05
Variance	0.05
Standard Deviation	0.21
Coefficient of Variation	4.45

Value	Count	Frequency
0	2453	95.2%
1	124	4.8%
Other Values (0)	0	0.0%

Maximum Values

Value	Count	Frequency
0	2453	95.19%
1	124	4.81%

Minimum Values

Value	Count	Frequency
1	124	4.81%
0	2453	95.19%

instant
numeric

Distinct	2
`NA`s	0
`Inf`/`-Inf`	0

Mean	0.18
Minimum	0
Maximum	1

Quantile Statistics

Minimum	0.00
5th Percentile	0.00
Q1	0.00
Median	0.00
Q3	0.00
95th Percentile	1.00
Maximum	1.00
Range	1.00
IQR	0.00

Descriptive Statistics

Mean	0.18
Variance	0.14
Standard Deviation	0.38
Coefficient of Variation	2.17

Value	Count	Frequency
0	2124	82.4%
1	453	17.6%
Other Values (0)	0	0.0%

Maximum Values

Value	Count	Frequency
0	2124	82.42%
1	453	17.58%

Minimum Values

Value	Count	Frequency
1	453	17.58%
0	2124	82.42%

pack
numeric

Distinct	2
`NA`s	0
`Inf`/`-Inf`	0

Mean	0.59
Minimum	0
Maximum	1

Quantile Statistics

Minimum	0.00
5th Percentile	0.00
Q1	0.00
Median	1.00
Q3	1.00
95th Percentile	1.00
Maximum	1.00
Range	1.00
IQR	1.00

Descriptive Statistics

Mean	0.59
Variance	0.24
Standard Deviation	0.49
Coefficient of Variation	0.83

Value	Count	Frequency
1	1528	59.3%
0	1049	40.7%
Other Values (0)	0	0.0%

Maximum Values

Value	Count	Frequency
1	1528	59.29%
0	1049	40.71%

Minimum Values

Value	Count	Frequency
0	1049	40.71%
1	1528	59.29%

tray
numeric

Distinct	2
`NA`s	0
`Inf`/`-Inf`	0

Mean	0.04
Minimum	0
Maximum	1

Quantile Statistics

Minimum	0.00
5th Percentile	0.00
Q1	0.00
Median	0.00
Q3	0.00
95th Percentile	0.00
Maximum	1.00
Range	1.00
IQR	0.00

Descriptive Statistics

Mean	0.04
Variance	0.04
Standard Deviation	0.20
Coefficient of Variation	4.78

Value	Count	Frequency
0	2469	95.8%
1	108	4.2%
Other Values (0)	0	0.0%

Maximum Values

Value	Count	Frequency
0	2469	95.81%
1	108	4.19%

Minimum Values

Value	Count	Frequency
1	108	4.19%
0	2469	95.81%

cup
numeric

Distinct	2
`NA`s	0
`Inf`/`-Inf`	0

Mean	0.17
Minimum	0
Maximum	1

Quantile Statistics

Minimum	0.00
5th Percentile	0.00
Q1	0.00
Median	0.00
Q3	0.00
95th Percentile	1.00
Maximum	1.00
Range	1.00
IQR	0.00

Descriptive Statistics

Mean	0.17
Variance	0.14
Standard Deviation	0.38
Coefficient of Variation	2.17

Value	Count	Frequency
0	2127	82.5%
1	450	17.5%
Other Values (0)	0	0.0%

Maximum Values

Value	Count	Frequency
0	2127	82.54%
1	450	17.46%

Minimum Values

Value	Count	Frequency
1	450	17.46%
0	2127	82.54%

bowl
numeric

Distinct	2
`NA`s	0
`Inf`/`-Inf`	0

Mean	0.19
Minimum	0
Maximum	1

Quantile Statistics

Minimum	0.00
5th Percentile	0.00
Q1	0.00
Median	0.00
Q3	0.00
95th Percentile	1.00
Maximum	1.00
Range	1.00
IQR	0.00

Descriptive Statistics

Mean	0.19
Variance	0.15
Standard Deviation	0.39
Coefficient of Variation	2.09

Value	Count	Frequency
0	2096	81.3%
1	481	18.7%
Other Values (0)	0	0.0%

Maximum Values

Value	Count	Frequency
0	2096	81.33%
1	481	18.67%

Minimum Values

Value	Count	Frequency
1	481	18.67%
0	2096	81.33%

	Stars	nissin	east_asia	spicy	curry	instant	pack	tray	cup	bowl
1	3.75	0	1	0	0	0	0	0	1	0
2	1.00	0	1	1	0	0	1	0	0	0
3	2.25	1	0	0	0	0	0	0	1	0
4	2.75	0	1	0	0	0	1	0	0	0
5	3.75	0	0	0	1	0	1	0	0	0
6..2572
2573	3.50	0	0	0	0	1	0	0	0	1
2574	1.00	0	0	0	0	1	1	0	0	0
2575	2.00	0	0	0	0	0	1	0	0	0
2576	2.00	0	0	0	0	0	1	0	0	0
2577	0.50	0	0	0	0	0	1	0	0	0

Create our first SVM model using a ‘linear’ kernel.

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──

## ✔ broom        1.0.5      ✔ rsample      1.2.1 
## ✔ dials        1.2.1      ✔ tibble       3.2.1 
## ✔ ggplot2      3.5.0      ✔ tidyr        1.3.1 
## ✔ infer        1.0.7      ✔ tune         1.2.0 
## ✔ modeldata    1.3.0      ✔ workflows    1.1.4 
## ✔ parsnip      1.2.1      ✔ workflowsets 1.1.0 
## ✔ purrr        1.0.2      ✔ yardstick    1.3.1 
## ✔ recipes      1.0.10

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ✖ recipes::step()  masks stats::step()
## • Learn how to get started at https://www.tidymodels.org/start/

set.seed(123)
data_split <- initial_split(ramenDf, prop = 0.75)
train_data <- training(data_split)
test_data <- testing(data_split)

library(e1071)

## Warning: package 'e1071' was built under R version 4.3.3

## 
## Attaching package: 'e1071'

## The following object is masked from 'package:tune':
## 
##     tune

## The following object is masked from 'package:rsample':
## 
##     permutations

## The following object is masked from 'package:parsnip':
## 
##     tune

svmfit = svm(east_asia ~ ., data = train_data , cost = 10, kernel = "linear", scale = TRUE)
svmpred = predict(svmfit, test_data[, -3])

tab <- table(pred = round(svmpred), true = test_data[,3])

With a linear kernel we get a 62% accuracy

classAgreement(tab)

## $diag
## [1] 0.627907
## 
## $kappa
## [1] 0.1844906
## 
## $rand
## [1] 0.5319948
## 
## $crand
## [1] 0.05730295

Next let’s try some different kernals, starting with the ‘radial’ kernel.

svmfit = svm(east_asia ~ ., data = train_data , cost = 10, kernel = "radial", scale = TRUE)
svmpred = predict(svmfit, test_data[, -3])

tab <- table(pred = round(svmpred), true = test_data[,3])

The radial kernel gives us a 64% accuracy

classAgreement(tab)

## $diag
## [1] 0.6465116
## 
## $kappa
## [1] 0.2725924
## 
## $rand
## [1] 0.5422216
## 
## $crand
## [1] 0.08379865

Next, let’s try the ‘polynomial’ kernel.

svmfit = svm(east_asia ~ ., data = train_data , cost = 10, kernel = "polynomial", scale = TRUE)
svmpred = predict(svmfit, test_data[, -3])

tab <- table(pred = round(svmpred), true = test_data[,3])

The polynomial kernel gives us a 24% accuracy

classAgreement(tab)

## $diag
## [1] 0.2465116
## 
## $kappa
## [1] -0.08100559
## 
## $rand
## [1] 0.5393278
## 
## $crand
## [1] 0.07546902

Let’s train the same data on the Random Forest model from HW2.

rffit <- randomForest::randomForest(as.factor(east_asia) ~ ., data = train_data)
rfpred <- predict(rffit, test_data[, -3])

tab <- table(pred = rfpred, true = as.factor(test_data[,3]))

We get an accuracy of 63%

classAgreement(tab)

## $diag
## [1] 0.6418605
## 
## $kappa
## [1] 0.2549691
## 
## $rand
## [1] 0.5395349
## 
## $crand
## [1] 0.07781945

Result

SVM with a radial kernel gave us the highest accuracy of 64% and a kappa of 27%. Random Forest gave us an accuracy of 63% and a kappa of 24%.

Conclusion

Two out of the three papers found that Random Forest outperformed SVM for product origin prediction. Interestingly, only one of the three papers was not in a binary classification setting and it still found Random Forest to give better accuracy. I found a very modest performance improvement in SVM. That being said, I think that the performance of both models is currently below the accuracy that I would be happy with in a real application. I would go back and explore more feature engineering to see if I could improve my accuracy before I chose a model type.

DATA622_HW3

William Aiken

2024-05-10

Links to Given Articles

Links to Articles about Predicting Product Origin with SVM and Random Forest

Load in Data

Remove the review number (unnecessary key) and cast star rating as numeric

Data Exploration

Overview of `ramenDf`

Variables

Interactions

Correlations

Missing Values

Sample

Create our first SVM model using a ‘linear’ kernel.

With a linear kernel we get a 62% accuracy

Next let’s try some different kernals, starting with the ‘radial’ kernel.

The radial kernel gives us a 64% accuracy

Next, let’s try the ‘polynomial’ kernel.

The polynomial kernel gives us a 24% accuracy

Let’s train the same data on the Random Forest model from HW2.

Result

Conclusion

DATA622_HW3

William Aiken

2024-05-10

Links to Given Articles

Links to Articles about Predicting Product Origin with SVM and Random Forest

Load in Data

Remove the review number (unnecessary key) and cast star rating as numeric

Data Exploration

Overview of ramenDf

Variables

Interactions

Correlations

Missing Values

Sample

Create our first SVM model using a ‘linear’ kernel.

With a linear kernel we get a 62% accuracy

Next let’s try some different kernals, starting with the ‘radial’ kernel.

The radial kernel gives us a 64% accuracy

Next, let’s try the ‘polynomial’ kernel.

The polynomial kernel gives us a 24% accuracy

Let’s train the same data on the Random Forest model from HW2.

Result

Conclusion

Overview of `ramenDf`