DATA 622 - Homework 1

Packages

library(caret)
library(DataExplorer)
library(ggraph)
library(igraph)
library(knitr)
library(Matrix)
library(naivebayes)
library(naniar)
library(RColorBrewer)
library(scales)
library(stopwords)
library(tidytext)
library(tidyverse)
library(widyr)
library(xgboost)

Introduction

We load two labeled text datasets of very different sizes, which we will use to train two different models to classify text based on sentiment and emotion.

In the small dataset, social media messages from Twitter, Facebook, and Instagram have been classified as primarily conveying positive, negative, or neutral sentiment; our response variable for this dataset will be SENTIMENT
In the large dataset, Twitter messages have been classified as primarily conveying one of six emotions: sadness, joy, love, anger, fear, and surprise; our response variable for this dataset will be EMOTION

While the only feature in the large dataset is the text of the message itself, the small dataset has additional features related to the time and date the message was sent, as well as the platform used to send it since the small dataset covers more social media platforms than the large dataset. For the sake of simplicity and comparison later, we will only train models on the text features.

my_url1 <- "https://raw.githubusercontent.com/geedoubledee/data622_homework1/main/sentiment_analysis.csv"
small_df <- read.csv(my_url1)
cols <- c("YEAR", "MONTH", "DAY", "TIME", "TXT", "SENTIMENT", "PLATFORM")
colnames(small_df) <- cols
char_cols <- colnames(small_df[, sapply(small_df, class) == "character"])
small_df_sub1 <- small_df |>
    select(-all_of(char_cols))
small_df_sub2 <- small_df |>
    select(all_of(char_cols)) |>
    sapply(str_trim) |>
    as.data.frame()
small_df <- small_df_sub1 |>
    bind_cols(small_df_sub2) |>
    rowid_to_column(var = "ID")
my_url2 <- "https://raw.githubusercontent.com/geedoubledee/data622_homework1/main/text_pt_"
x <- seq(1, 9)
y <- rep(".csv", 9)
files <- paste0(x, y)
cols <- c("TXT", "LAB")
large_df <- as.data.frame(matrix(nrow = 0, ncol = 2))
colnames(large_df) <- cols
large_df <- large_df |>
    mutate(TXT = as.character(TXT),
           LAB = as.integer(LAB))
for (f in files){
    new_rows <- read.csv(paste0(my_url2, f))
    colnames(new_rows) <- cols
    large_df <- large_df |>
        bind_rows(new_rows)
}
large_df <- large_df |>
    rowid_to_column(var = "ID")
rm(new_rows)
my_url3 <- "https://raw.githubusercontent.com/geedoubledee/data622_homework1/main/text_label_map.csv"
txt_lab_map <- read.csv(my_url3)
cols <- c("KEY", "EMOTION")
colnames(txt_lab_map) <- cols
large_df <- large_df |>
    left_join(txt_lab_map, by = join_by(LAB == KEY),
              relationship = "many-to-one")

Exploratory Data Analysis

We look at summaries of the datasets to confirm the numbers of observations in each and whether there are any missing values to address.

remove <- c("total_observations", "memory_usage")
reorder <- c("rows", "complete_rows", "columns", "discrete_columns",
             "continuous_columns", "all_missing_columns",
             "total_missing_values")
introduce <- small_df |>
    introduce() |>
    select(-all_of(remove))
introduce <- introduce[, reorder]
knitr::kable(t(introduce), format = "simple", caption = "A summary introduction to the small dataset.")

A summary introduction to the small dataset.
rows	499
complete_rows	499
columns	8
discrete_columns	4
continuous_columns	4
all_missing_columns	0
total_missing_values	0

introduce <- large_df |>
    introduce() |>
    select(-all_of(remove))
introduce <- introduce[, reorder]
knitr::kable(t(introduce), format = "simple", caption = "A summary introduction to the large dataset.")

A summary introduction to the large dataset.
rows	416809
complete_rows	416809
columns	4
discrete_columns	2
continuous_columns	2
all_missing_columns	0
total_missing_values	0

There are less than 500 observations in the small dataset and over 400,000 observations in the large dataset. There are no missing values to address in either dataset.

Next we take a look at the distributions of our response variables:

SENTIMENT in the small dataset
EMOTION in the large dataset

cur_theme = theme_set(theme_classic())
palette1 <- brewer.pal(8, "Dark2")
palette2 <- brewer.pal(11, "RdYlBu")
cols <- palette2[c(2, 6, 10)]
names(cols) <- c("negative", "neutral", "positive")
fils <- cols
obs = nrow(small_df)
p1 <- small_df |>
    ggplot(aes(x = SENTIMENT)) +
    geom_histogram(aes(color = SENTIMENT, fill = SENTIMENT), stat = "count") +
    geom_text(stat = "count", aes(label = paste0(round(
        after_stat(count) / obs * 100, 1), "%")),
              size = 4, color = "black", nudge_y = 8) + 
    scale_color_manual(values = cols) +
    scale_fill_manual(values = fils) +
    labs(title = "Distribution of SENTIMENT in the Small Dataset",
         y = "COUNT") +
    theme(legend.position = "none")
p1

In the small dataset, the most frequent SENTIMENT class is neutral, followed by positive, then negative. There are some slight class imbalances here, but since we’ll be doing text analysis, our input variables will primarily be categorical, and using the SMOTE algorithm to fix unbalanced classification problems is therefore not recommended. None of the classes is so rare that we should be too worried, but this issue combined with the small number of observations might affect the predictive power of our models.

cols <- palette1[1:6]
emotions <- c("sadness", "joy", "love", "anger", "fear", "surprise")
names(cols) <- emotions
fils <- cols
obs <- nrow(large_df)
p2 <- large_df |>
    ggplot(aes(x = EMOTION)) +
    geom_histogram(aes(color = EMOTION, fill = EMOTION), stat = "count") +
    geom_text(stat = "count", aes(label = paste0(round(
        after_stat(count) / obs * 100, 1), "%")),
              size = 4, color = "white", hjust = 1.1, fontface = "bold") + 
    scale_color_manual(values = cols) +
    scale_fill_manual(values = fils) +
    scale_y_continuous(labels = scales::comma) +
    labs(title = "Distribution of EMOTION in the Large Dataset",
         y = "COUNT") +
    coord_flip() +
    theme(legend.position = "none")
p2

In the large dataset, we see a worse class imbalance issue. Joy is the most frequent EMOTION in the large dataset, and it occurs nearly 10 times as often as the least frequent EMOTION: surprise. This issue could again affect the predictive power of our models since we can’t use SMOTE to correct it, but the sheer number of observations may allow our models to overcome it.

Now we are ready to tokenize our data for text analysis, creating the word features our models will use to predict sentiment and emotion.

Data Preparation

To prepare the data, we first remove some non-text feature variables from the small dataset that we won’t be using to train our models.

remove <- c("YEAR", "MONTH", "DAY", "TIME", "PLATFORM")
small_df <- small_df |>
    select(-all_of(remove))

Then we split the text variable in each dataset into its word components, and any punctuation, numbers, or stopwords are removed. We could go further by combining singular and plural versions of the same noun or the various tenses of the same verb into one token. We could also attempt to correct for misspellings. However, the text features as they are now will be sufficient for training our models. We do implement a word frequency cut-off for the large dataset, as its massive number of observations results in nearly 75,000 unique words. Removing words that only occur once in the corpus cuts the number of unique words by a little less than half, making it more manageable. We don’t implement this cut-off for the small dataset, as it is unnecessary.

small_df_tokens <- small_df |>
    unnest_tokens(output = WORD, input = TXT, strip_punct = TRUE) |>
    anti_join(stop_words, by = join_by(WORD == word)) |>
    filter(!grepl('[0-9]', WORD))
small_df_tokens <- small_df_tokens |>
    add_column(COUNT = 1)
large_df_tokens <- large_df |>
    unnest_tokens(output = WORD, input = TXT, strip_punct = TRUE) |>
    anti_join(stop_words, by = join_by(WORD == word)) |>
    filter(!grepl('[0-9]', WORD))
large_df_tokens_summary <- large_df_tokens |>
    group_by(WORD) |>
    summarize(total = n())
low_freq <- large_df_tokens_summary[large_df_tokens_summary$total == 1, 1]
low_freq <- as.character(low_freq$WORD)
large_df_tokens <- large_df_tokens |>
    filter(!WORD %in% low_freq)
large_df_tokens <- large_df_tokens |>
    add_column(COUNT = 1)

Finally, we pivot the WORD variable we’ve just created into boolean matrices with all the words recorded in the text as columns. Within said matrices, values of 1 indicate a word appears in that text, and values of 0 indicate it does not. For the small dataset, we could store this data as part of its original dataframe, but for the large dataset, we have to compress this data into a sparse matrix, or we run into a memory issue. So we create sparse matrices of the input features for both datasets. We also separate the response variables into vectors, as is required when using sparse matrices to build the particular models we’ll be developing.

small_df_sparse_matrix <- small_df_tokens |>
    cast_sparse(ID, WORD, COUNT)
sel <- c("ID", "SENTIMENT")
small_df_labels <- small_df_tokens |>
    select(all_of(sel)) |>
    distinct() |>
    column_to_rownames(var = "ID")
small_df_y <- as.character(small_df_labels$SENTIMENT)
large_df_sparse_matrix <- large_df_tokens |>
    cast_sparse(ID, WORD, COUNT)
sel <- c("ID", "EMOTION")
large_df_labels <- large_df_tokens |>
    select(all_of(sel)) |>
    distinct() |>
    column_to_rownames(var = "ID")
large_df_y <- as.character(large_df_labels$EMOTION)

The tokenization process has resulted in some observations being removed from both datasets. This is because they were composed entirely of stopwords and/or numbers, and their removal is fine for our purposes.

Now that we have all the word features we will be using as our input variables, we can check for pairwise correlations among them. Below is a graph featuring words in the small dataset that occur three or more times and have a pairwise correlation greater than 0.6.

small_df_cor <- small_df_tokens |>
    group_by(WORD) |>
    filter(n() >= 3) |>
    pairwise_cor(WORD, ID, upper = FALSE)
p3 <- small_df_cor |>
    filter(correlation > .6) |>
    graph_from_data_frame() |>
    ggraph(layout = "fr") +
    geom_edge_link(edge_color = palette1[8]) +
    geom_node_point(color = palette1[3], size = 5) +
    geom_node_text(aes(label = name), repel = TRUE) +
    labs(title = "Words in Small Dataset Occurring 3+ Times with Pairwise Correlation > 0.6") +
    theme_void()
p3

We can see some logical correlations, like “final,” “exam,” and “semester” being correlated because they’re words students would frequently use together.

Next is a graph featuring words in the large dataset that occur 50 or more times and have a pairwise correlation greater than 0.6.

large_df_cor <- large_df_tokens |>
    group_by(WORD) |>
    filter(n() >= 50) |>
    pairwise_cor(WORD, ID, upper = FALSE) |>
    filter(correlation > 0.6)
p4 <- large_df_cor |>
    graph_from_data_frame() |>
    ggraph(layout = "fr") +
    geom_edge_link(edge_color = palette1[8]) +
    geom_node_point(color = palette1[3], size = 5) +
    geom_node_text(aes(label = name), repel = TRUE) +
    labs(title = "Words in Large Dataset Occurring 50+ Times with Pairwise Correlation > 0.6") +
    theme_void()
p4

The largest cluster here, as well as most of the other correlated words, all have to do with link text and are relatively uninteresting. One stand-out correlated pair is “roller” and “coaster” though.

Now we are ready to split the data into training and test sets.

set.seed(1006)
sample_set <- sample(nrow(small_df_sparse_matrix),
                     round(nrow(small_df_sparse_matrix) * 0.7),
                     replace = FALSE)
small_df_train_x <- small_df_sparse_matrix[sample_set, ]
small_df_train_y <- small_df_y[sample_set]
small_df_test_x <- small_df_sparse_matrix[-sample_set, ]
small_df_test_y <- small_df_y[-sample_set]
sample_set <- sample(nrow(large_df_sparse_matrix),
                     round(nrow(large_df_sparse_matrix) * 0.7),
                     replace = FALSE)
large_df_train_x <- large_df_sparse_matrix[sample_set, ]
large_df_train_y <- large_df_y[sample_set]
large_df_test_x <- large_df_sparse_matrix[-sample_set, ]
large_df_test_y <- large_df_y[-sample_set]

Finally, we build our models.

Model Development

First, we train Multinomial Naive Bayes Classifier models on both datasets.

mnb_mod_small <- multinomial_naive_bayes(small_df_train_x, small_df_train_y,
                                  laplace = 1)
mnb_mod_small

## 
## ============================ Multinomial Naive Bayes ============================ 
##  
##  Call: 
## multinomial_naive_bayes(x = small_df_train_x, y = small_df_train_y, 
##     laplace = 1)
## 
## --------------------------------------------------------------------------------- 
##  
## Laplace smoothing: 1
## 
## --------------------------------------------------------------------------------- 
##  
##  A priori probabilities: 
##  negative   neutral  positive 
## 0.2680723 0.3584337 0.3734940 
## 
## --------------------------------------------------------------------------------- 
##  
##            Classes
## Features        negative      neutral    positive
##   day       0.0039867110 0.0038834951 0.013017751
##   dream     0.0006644518 0.0006472492 0.001775148
##   feel      0.0039867110 0.0025889968 0.001775148
##   miss      0.0039867110 0.0012944984 0.002366864
##   sea       0.0006644518 0.0006472492 0.002366864
##   beach     0.0006644518 0.0006472492 0.002366864
##   angry     0.0006644518 0.0006472492 0.000591716
##   attend    0.0013289037 0.0006472492 0.000591716
##   class     0.0026578073 0.0006472492 0.000591716
##   listening 0.0013289037 0.0012944984 0.001183432
## 
## --------------------------------------------------------------------------------- 
## 
## # ... and 1072 more features
## 
## ---------------------------------------------------------------------------------

mnb_mod_large <- multinomial_naive_bayes(large_df_train_x, large_df_train_y,
                                  laplace = 1)
mnb_mod_large

## 
## ============================ Multinomial Naive Bayes ============================ 
##  
##  Call: 
## multinomial_naive_bayes(x = large_df_train_x, y = large_df_train_y, 
##     laplace = 1)
## 
## --------------------------------------------------------------------------------- 
##  
## Laplace smoothing: 1
## 
## --------------------------------------------------------------------------------- 
##  
##  A priori probabilities: 
##      anger       fear        joy       love    sadness   surprise 
## 0.13767666 0.11474426 0.33825460 0.08270746 0.29078214 0.03583487 
## 
## --------------------------------------------------------------------------------- 
##  
##           Classes
## Features          anger         fear          joy         love      sadness
##   feel     8.408473e-02 8.038500e-02 9.890695e-02 7.807355e-02 9.703709e-02
##   helpless 1.783167e-04 3.628925e-03 2.602168e-05 3.828777e-05 1.836450e-03
##   heavy    1.110274e-04 1.920273e-04 8.384762e-05 1.292212e-04 2.145141e-04
##   hearted  4.710252e-05 3.527033e-05 2.457603e-05 1.148633e-04 4.534443e-05
##   ive      4.215676e-03 4.146223e-03 4.127327e-03 3.498545e-03 5.919193e-03
##   enjoyed  1.514010e-04 1.450002e-04 2.659994e-04 2.584425e-04 1.674256e-04
##   slouch   3.364466e-06 3.918925e-06 2.891297e-06 4.785971e-06 3.488033e-06
##   relax    1.547654e-04 1.763516e-04 1.749235e-04 7.178957e-05 1.063850e-04
##   unwind   1.682233e-05 2.351355e-05 1.590214e-05 1.435791e-05 8.720084e-06
##   frankly  1.177563e-04 9.013528e-05 6.794549e-05 2.871583e-05 8.894485e-05
##           Classes
## Features       surprise
##   feel     6.017506e-02
##   helpless 5.442424e-05
##   heavy    7.256565e-05
##   hearted  2.721212e-05
##   ive      3.592000e-03
##   enjoyed  1.723434e-04
##   slouch   9.070706e-06
##   relax    7.256565e-05
##   unwind   9.070706e-06
##   frankly  1.088485e-04
## 
## --------------------------------------------------------------------------------- 
## 
## # ... and 40040 more features
## 
## ---------------------------------------------------------------------------------

Next, we train Extreme Gradient Boosting (XGBoost) models on both datasets. The classes of the response variable have to be numeric for these models, so we make those coercions first.

small_df_train_y_num <- as.data.frame(small_df_train_y) |>
    rename(SENTIMENT = small_df_train_y) |>
    mutate(SENTIMENT = case_when(SENTIMENT == "negative" ~ 0,
                                 SENTIMENT == "neutral" ~ 1,
                                 .default = 2))
small_df_train_y_num <- as.integer(small_df_train_y_num$SENTIMENT)
small_df_test_y_num <- as.data.frame(small_df_test_y) |>
    rename(SENTIMENT = small_df_test_y) |>
    mutate(SENTIMENT = case_when(SENTIMENT == "negative" ~ 0,
                                 SENTIMENT == "neutral" ~ 1,
                                 .default = 2))
small_df_test_y_num <- as.integer(small_df_test_y_num$SENTIMENT)
xgb_mod_small <- xgboost(small_df_train_x, small_df_train_y_num, nrounds = 100,
                         objective = "multi:softmax", num_class = 3,
                         verbose = 0)
xgb_mod_small

## ##### xgb.Booster
## raw: 327.9 Kb 
## call:
##   xgb.train(params = params, data = dtrain, nrounds = nrounds, 
##     watchlist = watchlist, verbose = verbose, print_every_n = print_every_n, 
##     early_stopping_rounds = early_stopping_rounds, maximize = maximize, 
##     save_period = save_period, save_name = save_name, xgb_model = xgb_model, 
##     callbacks = callbacks, objective = "multi:softmax", num_class = 3)
## params (as set within xgb.train):
##   objective = "multi:softmax", num_class = "3", validate_parameters = "TRUE"
## xgb.attributes:
##   niter
## callbacks:
##   cb.evaluation.log()
## # of features: 1082 
## niter: 100
## nfeatures : 1082 
## evaluation_log:
##     iter train_mlogloss
##        1       1.034165
##        2       0.990067
## ---                    
##       99       0.560126
##      100       0.559474

large_df_train_y_num <- as.data.frame(large_df_train_y) |>
    rename(EMOTION = large_df_train_y) |>
    mutate(EMOTION = case_when(EMOTION == "sadness" ~ 0,
                               EMOTION == "joy" ~ 1,
                               EMOTION == "love" ~ 2,
                               EMOTION == "anger" ~ 3,
                               EMOTION == "fear" ~ 4,
                               .default = 5))
large_df_train_y_num <- as.integer(large_df_train_y_num$EMOTION)
large_df_test_y_num <- as.data.frame(large_df_test_y) |>
    rename(EMOTION = large_df_test_y) |>
    mutate(EMOTION = case_when(EMOTION == "sadness" ~ 0,
                               EMOTION == "joy" ~ 1,
                               EMOTION == "love" ~ 2,
                               EMOTION == "anger" ~ 3,
                               EMOTION == "fear" ~ 4,
                               .default = 5))
large_df_test_y_num <- as.integer(large_df_test_y_num$EMOTION)
xgb_mod_large <- xgboost(large_df_train_x, large_df_train_y_num, nrounds = 100,
                         objective = "multi:softmax", num_class = 6,
                         verbose = 0)
xgb_mod_large

## ##### xgb.Booster
## raw: 1.3 Mb 
## call:
##   xgb.train(params = params, data = dtrain, nrounds = nrounds, 
##     watchlist = watchlist, verbose = verbose, print_every_n = print_every_n, 
##     early_stopping_rounds = early_stopping_rounds, maximize = maximize, 
##     save_period = save_period, save_name = save_name, xgb_model = xgb_model, 
##     callbacks = callbacks, objective = "multi:softmax", num_class = 6)
## params (as set within xgb.train):
##   objective = "multi:softmax", num_class = "6", validate_parameters = "TRUE"
## xgb.attributes:
##   niter
## callbacks:
##   cb.evaluation.log()
## # of features: 40050 
## niter: 100
## nfeatures : 40050 
## evaluation_log:
##     iter train_mlogloss
##        1      1.6426637
##        2      1.5298370
## ---                    
##       99      0.4188904
##      100      0.4165744

Model Selection

We make predictions on the test data and construct confusion matrices for each of the models in order to calculate their predictive accuracy.

mnb_pred_small <- predict(mnb_mod_small, small_df_test_x, type = "class")
mnb_pred_tbl_small <- table(mnb_pred_small, small_df_test_y)
names(dimnames(mnb_pred_tbl_small)) <- c("Prediction", "Reference")
print("Confusion Matrix: Multinomial Naive Bayes Classifier: Small Test Data")

## [1] "Confusion Matrix: Multinomial Naive Bayes Classifier: Small Test Data"

mnb_pred_tbl_small

##           Reference
## Prediction negative neutral positive
##   negative       18       8        3
##   neutral        17      40        7
##   positive        6      16       27

xgb_pred_small <- predict(xgb_mod_small, small_df_test_x)
xgb_pred_tbl_small <- table(xgb_pred_small, small_df_test_y_num)
names(dimnames(xgb_pred_tbl_small)) <- c("Prediction", "Reference")
sent <- c("negative", "neutral", "positive")
dimnames(xgb_pred_tbl_small)$Prediction <- sent
dimnames(xgb_pred_tbl_small)$Reference <- sent
print("Confusion Matrix: XGBoost Model: Small Test Data")

## [1] "Confusion Matrix: XGBoost Model: Small Test Data"

xgb_pred_tbl_small

##           Reference
## Prediction negative neutral positive
##   negative        8       5        4
##   neutral        31      46       15
##   positive        2      13       18

mnb_pred_large <- predict(mnb_mod_large, large_df_test_x, type = "class")
mnb_pred_tbl_large <- table(mnb_pred_large, large_df_test_y)
names(dimnames(mnb_pred_tbl_large)) <- c("Prediction", "Reference")
print("Confusion Matrix: Multinomial Naive Bayes Classifier: Large Test Data")

## [1] "Confusion Matrix: Multinomial Naive Bayes Classifier: Large Test Data"

mnb_pred_tbl_large

##           Reference
## Prediction anger  fear   joy  love sadness surprise
##   anger    14838   555   256    85     607       37
##   fear       499 11851   328    63     497     1006
##   joy        466   446 38751  2988    1192      790
##   love        80    41  1339  6884     157       40
##   sadness   1231  1035  1591   399   33834      277
##   surprise    17   299   114     7      58     2368

xgb_pred_large <- predict(xgb_mod_large, large_df_test_x)
xgb_pred_tbl_large <- table(xgb_pred_large, large_df_test_y_num)
names(dimnames(xgb_pred_tbl_large)) <- c("Prediction", "Reference")
dimnames(xgb_pred_tbl_large)$Prediction <- emotions
dimnames(xgb_pred_tbl_large)$Reference <- emotions
print("Confusion Matrix: XGBoost Model: Large Test Data")

## [1] "Confusion Matrix: XGBoost Model: Large Test Data"

xgb_pred_tbl_large

##           Reference
## Prediction sadness   joy  love anger  fear surprise
##   sadness    32704   539   100   931   337       31
##   joy         1920 37706   816   584   370       91
##   love         237  3033  9394    69    34       18
##   anger        530   262    38 14730   198        7
##   fear         760   314    48   752 12116       84
##   surprise     194   525    30    65  1172     4287

A summary of the models’ predictive accuracy on each of the test datasets is below.

mnb_pred_acc_small <- sum(diag(mnb_pred_tbl_small)) / nrow(small_df_test_x)
xgb_pred_acc_small <- sum(diag(xgb_pred_tbl_small)) / nrow(small_df_test_x)
mnb_pred_acc_large <- sum(diag(mnb_pred_tbl_large)) / nrow(large_df_test_x)
xgb_pred_acc_large <- sum(diag(xgb_pred_tbl_large)) / nrow(large_df_test_x)
pred_acc_df <- as.data.frame(matrix(data = c(mnb_pred_acc_small,
                                             xgb_pred_acc_small,
                                             mnb_pred_acc_large,
                                             xgb_pred_acc_large),
                                    nrow = 2, ncol = 2))
rownames(pred_acc_df) <- c("Multinomial Naive Bayes Classifier Model",
                           "Extreme Gradient Boost (XGBoost) Model")
colnames(pred_acc_df) <- c("Small Test Data Predictive Accuracy",
                           "Large Test Data Predictive Accuracy")
kable(pred_acc_df, format = "simple")

	Small Test Data Predictive Accuracy	Large Test Data Predictive Accuracy
Multinomial Naive Bayes Classifier Model	0.5985915	0.8680275
Extreme Gradient Boost (XGBoost) Model	0.5070423	0.8873114

The Multinomial Naive Bayes Classifier has the highest predictive accuracy on the small test data, while the XGBoost Model has the highest predictive accuracy on the large test data.

Conclusion

Neither model performed particularly well on the small test data, but the Multinomial Naive Bayes Classifier was roughly 9% more accurate in its predictions than the XGBoost Model, making it the clear winner there.

Unsurprisingly, both models benefited from having more training data. The jump in performance on the large test data was significant for both models, but the better performer here was the XGBoost Model rather than the Multinomial Naive Bayes Classifier. The former only beat the latter by roughly 2%, and the XGBoost Model was much more computationally expensive than the Multinomial Naive Bayes Classifier. So even though it was outperformed by the XGBoost Model, we were impressed by accuracy of the Multinomial Naive Bayes Classifier given its speed.