library(caret)
library(DataExplorer)
library(ggraph)
library(igraph)
library(knitr)
library(Matrix)
library(naivebayes)
library(naniar)
library(RColorBrewer)
library(scales)
library(stopwords)
library(tidytext)
library(tidyverse)
library(widyr)
library(xgboost)
We load two labeled text datasets of very different sizes, which we will use to train two different models to classify text based on sentiment and emotion.
In the small dataset, social media messages from Twitter,
Facebook, and Instagram have been classified as primarily conveying
positive, negative, or neutral sentiment; our response variable for this
dataset will be SENTIMENT
In the large dataset, Twitter messages have been classified as
primarily conveying one of six emotions: sadness, joy, love, anger,
fear, and surprise; our response variable for this dataset will be
EMOTION
While the only feature in the large dataset is the text of the message itself, the small dataset has additional features related to the time and date the message was sent, as well as the platform used to send it since the small dataset covers more social media platforms than the large dataset. For the sake of simplicity and comparison later, we will only train models on the text features.
my_url1 <- "https://raw.githubusercontent.com/geedoubledee/data622_homework1/main/sentiment_analysis.csv"
small_df <- read.csv(my_url1)
cols <- c("YEAR", "MONTH", "DAY", "TIME", "TXT", "SENTIMENT", "PLATFORM")
colnames(small_df) <- cols
char_cols <- colnames(small_df[, sapply(small_df, class) == "character"])
small_df_sub1 <- small_df |>
select(-all_of(char_cols))
small_df_sub2 <- small_df |>
select(all_of(char_cols)) |>
sapply(str_trim) |>
as.data.frame()
small_df <- small_df_sub1 |>
bind_cols(small_df_sub2) |>
rowid_to_column(var = "ID")
my_url2 <- "https://raw.githubusercontent.com/geedoubledee/data622_homework1/main/text_pt_"
x <- seq(1, 9)
y <- rep(".csv", 9)
files <- paste0(x, y)
cols <- c("TXT", "LAB")
large_df <- as.data.frame(matrix(nrow = 0, ncol = 2))
colnames(large_df) <- cols
large_df <- large_df |>
mutate(TXT = as.character(TXT),
LAB = as.integer(LAB))
for (f in files){
new_rows <- read.csv(paste0(my_url2, f))
colnames(new_rows) <- cols
large_df <- large_df |>
bind_rows(new_rows)
}
large_df <- large_df |>
rowid_to_column(var = "ID")
rm(new_rows)
my_url3 <- "https://raw.githubusercontent.com/geedoubledee/data622_homework1/main/text_label_map.csv"
txt_lab_map <- read.csv(my_url3)
cols <- c("KEY", "EMOTION")
colnames(txt_lab_map) <- cols
large_df <- large_df |>
left_join(txt_lab_map, by = join_by(LAB == KEY),
relationship = "many-to-one")
We look at summaries of the datasets to confirm the numbers of observations in each and whether there are any missing values to address.
remove <- c("total_observations", "memory_usage")
reorder <- c("rows", "complete_rows", "columns", "discrete_columns",
"continuous_columns", "all_missing_columns",
"total_missing_values")
introduce <- small_df |>
introduce() |>
select(-all_of(remove))
introduce <- introduce[, reorder]
knitr::kable(t(introduce), format = "simple", caption = "A summary introduction to the small dataset.")
| rows | 499 |
| complete_rows | 499 |
| columns | 8 |
| discrete_columns | 4 |
| continuous_columns | 4 |
| all_missing_columns | 0 |
| total_missing_values | 0 |
introduce <- large_df |>
introduce() |>
select(-all_of(remove))
introduce <- introduce[, reorder]
knitr::kable(t(introduce), format = "simple", caption = "A summary introduction to the large dataset.")
| rows | 416809 |
| complete_rows | 416809 |
| columns | 4 |
| discrete_columns | 2 |
| continuous_columns | 2 |
| all_missing_columns | 0 |
| total_missing_values | 0 |
There are less than 500 observations in the small dataset and over 400,000 observations in the large dataset. There are no missing values to address in either dataset.
Next we take a look at the distributions of our response variables:
SENTIMENT in the small dataset
EMOTION in the large dataset
cur_theme = theme_set(theme_classic())
palette1 <- brewer.pal(8, "Dark2")
palette2 <- brewer.pal(11, "RdYlBu")
cols <- palette2[c(2, 6, 10)]
names(cols) <- c("negative", "neutral", "positive")
fils <- cols
obs = nrow(small_df)
p1 <- small_df |>
ggplot(aes(x = SENTIMENT)) +
geom_histogram(aes(color = SENTIMENT, fill = SENTIMENT), stat = "count") +
geom_text(stat = "count", aes(label = paste0(round(
after_stat(count) / obs * 100, 1), "%")),
size = 4, color = "black", nudge_y = 8) +
scale_color_manual(values = cols) +
scale_fill_manual(values = fils) +
labs(title = "Distribution of SENTIMENT in the Small Dataset",
y = "COUNT") +
theme(legend.position = "none")
p1
In the small dataset, the most frequent SENTIMENT class
is neutral, followed by positive, then negative. There are some slight
class imbalances here, but since we’ll be doing text analysis, our input
variables will primarily be categorical, and using the SMOTE algorithm
to fix unbalanced classification problems is therefore not recommended.
None of the classes is so rare that we should be too worried, but this
issue combined with the small number of observations might affect the
predictive power of our models.
cols <- palette1[1:6]
emotions <- c("sadness", "joy", "love", "anger", "fear", "surprise")
names(cols) <- emotions
fils <- cols
obs <- nrow(large_df)
p2 <- large_df |>
ggplot(aes(x = EMOTION)) +
geom_histogram(aes(color = EMOTION, fill = EMOTION), stat = "count") +
geom_text(stat = "count", aes(label = paste0(round(
after_stat(count) / obs * 100, 1), "%")),
size = 4, color = "white", hjust = 1.1, fontface = "bold") +
scale_color_manual(values = cols) +
scale_fill_manual(values = fils) +
scale_y_continuous(labels = scales::comma) +
labs(title = "Distribution of EMOTION in the Large Dataset",
y = "COUNT") +
coord_flip() +
theme(legend.position = "none")
p2
In the large dataset, we see a worse class imbalance issue. Joy is
the most frequent EMOTION in the large dataset, and it
occurs nearly 10 times as often as the least frequent
EMOTION: surprise. This issue could again affect the
predictive power of our models since we can’t use SMOTE to correct it,
but the sheer number of observations may allow our models to overcome
it.
Now we are ready to tokenize our data for text analysis, creating the word features our models will use to predict sentiment and emotion.
To prepare the data, we first remove some non-text feature variables from the small dataset that we won’t be using to train our models.
remove <- c("YEAR", "MONTH", "DAY", "TIME", "PLATFORM")
small_df <- small_df |>
select(-all_of(remove))
Then we split the text variable in each dataset into its word components, and any punctuation, numbers, or stopwords are removed. We could go further by combining singular and plural versions of the same noun or the various tenses of the same verb into one token. We could also attempt to correct for misspellings. However, the text features as they are now will be sufficient for training our models. We do implement a word frequency cut-off for the large dataset, as its massive number of observations results in nearly 75,000 unique words. Removing words that only occur once in the corpus cuts the number of unique words by a little less than half, making it more manageable. We don’t implement this cut-off for the small dataset, as it is unnecessary.
small_df_tokens <- small_df |>
unnest_tokens(output = WORD, input = TXT, strip_punct = TRUE) |>
anti_join(stop_words, by = join_by(WORD == word)) |>
filter(!grepl('[0-9]', WORD))
small_df_tokens <- small_df_tokens |>
add_column(COUNT = 1)
large_df_tokens <- large_df |>
unnest_tokens(output = WORD, input = TXT, strip_punct = TRUE) |>
anti_join(stop_words, by = join_by(WORD == word)) |>
filter(!grepl('[0-9]', WORD))
large_df_tokens_summary <- large_df_tokens |>
group_by(WORD) |>
summarize(total = n())
low_freq <- large_df_tokens_summary[large_df_tokens_summary$total == 1, 1]
low_freq <- as.character(low_freq$WORD)
large_df_tokens <- large_df_tokens |>
filter(!WORD %in% low_freq)
large_df_tokens <- large_df_tokens |>
add_column(COUNT = 1)
Finally, we pivot the WORD variable we’ve just created
into boolean matrices with all the words recorded in the text as
columns. Within said matrices, values of 1 indicate a word appears in
that text, and values of 0 indicate it does not. For the small dataset,
we could store this data as part of its original dataframe, but for the
large dataset, we have to compress this data into a sparse matrix, or we
run into a memory issue. So we create sparse matrices of the input
features for both datasets. We also separate the response variables into
vectors, as is required when using sparse matrices to build the
particular models we’ll be developing.
small_df_sparse_matrix <- small_df_tokens |>
cast_sparse(ID, WORD, COUNT)
sel <- c("ID", "SENTIMENT")
small_df_labels <- small_df_tokens |>
select(all_of(sel)) |>
distinct() |>
column_to_rownames(var = "ID")
small_df_y <- as.character(small_df_labels$SENTIMENT)
large_df_sparse_matrix <- large_df_tokens |>
cast_sparse(ID, WORD, COUNT)
sel <- c("ID", "EMOTION")
large_df_labels <- large_df_tokens |>
select(all_of(sel)) |>
distinct() |>
column_to_rownames(var = "ID")
large_df_y <- as.character(large_df_labels$EMOTION)
The tokenization process has resulted in some observations being removed from both datasets. This is because they were composed entirely of stopwords and/or numbers, and their removal is fine for our purposes.
Now that we have all the word features we will be using as our input variables, we can check for pairwise correlations among them. Below is a graph featuring words in the small dataset that occur three or more times and have a pairwise correlation greater than 0.6.
small_df_cor <- small_df_tokens |>
group_by(WORD) |>
filter(n() >= 3) |>
pairwise_cor(WORD, ID, upper = FALSE)
p3 <- small_df_cor |>
filter(correlation > .6) |>
graph_from_data_frame() |>
ggraph(layout = "fr") +
geom_edge_link(edge_color = palette1[8]) +
geom_node_point(color = palette1[3], size = 5) +
geom_node_text(aes(label = name), repel = TRUE) +
labs(title = "Words in Small Dataset Occurring 3+ Times with Pairwise Correlation > 0.6") +
theme_void()
p3
We can see some logical correlations, like “final,” “exam,” and “semester” being correlated because they’re words students would frequently use together.
Next is a graph featuring words in the large dataset that occur 50 or more times and have a pairwise correlation greater than 0.6.
large_df_cor <- large_df_tokens |>
group_by(WORD) |>
filter(n() >= 50) |>
pairwise_cor(WORD, ID, upper = FALSE) |>
filter(correlation > 0.6)
p4 <- large_df_cor |>
graph_from_data_frame() |>
ggraph(layout = "fr") +
geom_edge_link(edge_color = palette1[8]) +
geom_node_point(color = palette1[3], size = 5) +
geom_node_text(aes(label = name), repel = TRUE) +
labs(title = "Words in Large Dataset Occurring 50+ Times with Pairwise Correlation > 0.6") +
theme_void()
p4
The largest cluster here, as well as most of the other correlated words, all have to do with link text and are relatively uninteresting. One stand-out correlated pair is “roller” and “coaster” though.
Now we are ready to split the data into training and test sets.
set.seed(1006)
sample_set <- sample(nrow(small_df_sparse_matrix),
round(nrow(small_df_sparse_matrix) * 0.7),
replace = FALSE)
small_df_train_x <- small_df_sparse_matrix[sample_set, ]
small_df_train_y <- small_df_y[sample_set]
small_df_test_x <- small_df_sparse_matrix[-sample_set, ]
small_df_test_y <- small_df_y[-sample_set]
sample_set <- sample(nrow(large_df_sparse_matrix),
round(nrow(large_df_sparse_matrix) * 0.7),
replace = FALSE)
large_df_train_x <- large_df_sparse_matrix[sample_set, ]
large_df_train_y <- large_df_y[sample_set]
large_df_test_x <- large_df_sparse_matrix[-sample_set, ]
large_df_test_y <- large_df_y[-sample_set]
Finally, we build our models.
First, we train Multinomial Naive Bayes Classifier models on both datasets.
mnb_mod_small <- multinomial_naive_bayes(small_df_train_x, small_df_train_y,
laplace = 1)
mnb_mod_small
##
## ============================ Multinomial Naive Bayes ============================
##
## Call:
## multinomial_naive_bayes(x = small_df_train_x, y = small_df_train_y,
## laplace = 1)
##
## ---------------------------------------------------------------------------------
##
## Laplace smoothing: 1
##
## ---------------------------------------------------------------------------------
##
## A priori probabilities:
## negative neutral positive
## 0.2680723 0.3584337 0.3734940
##
## ---------------------------------------------------------------------------------
##
## Classes
## Features negative neutral positive
## day 0.0039867110 0.0038834951 0.013017751
## dream 0.0006644518 0.0006472492 0.001775148
## feel 0.0039867110 0.0025889968 0.001775148
## miss 0.0039867110 0.0012944984 0.002366864
## sea 0.0006644518 0.0006472492 0.002366864
## beach 0.0006644518 0.0006472492 0.002366864
## angry 0.0006644518 0.0006472492 0.000591716
## attend 0.0013289037 0.0006472492 0.000591716
## class 0.0026578073 0.0006472492 0.000591716
## listening 0.0013289037 0.0012944984 0.001183432
##
## ---------------------------------------------------------------------------------
##
## # ... and 1072 more features
##
## ---------------------------------------------------------------------------------
mnb_mod_large <- multinomial_naive_bayes(large_df_train_x, large_df_train_y,
laplace = 1)
mnb_mod_large
##
## ============================ Multinomial Naive Bayes ============================
##
## Call:
## multinomial_naive_bayes(x = large_df_train_x, y = large_df_train_y,
## laplace = 1)
##
## ---------------------------------------------------------------------------------
##
## Laplace smoothing: 1
##
## ---------------------------------------------------------------------------------
##
## A priori probabilities:
## anger fear joy love sadness surprise
## 0.13767666 0.11474426 0.33825460 0.08270746 0.29078214 0.03583487
##
## ---------------------------------------------------------------------------------
##
## Classes
## Features anger fear joy love sadness
## feel 8.408473e-02 8.038500e-02 9.890695e-02 7.807355e-02 9.703709e-02
## helpless 1.783167e-04 3.628925e-03 2.602168e-05 3.828777e-05 1.836450e-03
## heavy 1.110274e-04 1.920273e-04 8.384762e-05 1.292212e-04 2.145141e-04
## hearted 4.710252e-05 3.527033e-05 2.457603e-05 1.148633e-04 4.534443e-05
## ive 4.215676e-03 4.146223e-03 4.127327e-03 3.498545e-03 5.919193e-03
## enjoyed 1.514010e-04 1.450002e-04 2.659994e-04 2.584425e-04 1.674256e-04
## slouch 3.364466e-06 3.918925e-06 2.891297e-06 4.785971e-06 3.488033e-06
## relax 1.547654e-04 1.763516e-04 1.749235e-04 7.178957e-05 1.063850e-04
## unwind 1.682233e-05 2.351355e-05 1.590214e-05 1.435791e-05 8.720084e-06
## frankly 1.177563e-04 9.013528e-05 6.794549e-05 2.871583e-05 8.894485e-05
## Classes
## Features surprise
## feel 6.017506e-02
## helpless 5.442424e-05
## heavy 7.256565e-05
## hearted 2.721212e-05
## ive 3.592000e-03
## enjoyed 1.723434e-04
## slouch 9.070706e-06
## relax 7.256565e-05
## unwind 9.070706e-06
## frankly 1.088485e-04
##
## ---------------------------------------------------------------------------------
##
## # ... and 40040 more features
##
## ---------------------------------------------------------------------------------
Next, we train Extreme Gradient Boosting (XGBoost) models on both datasets. The classes of the response variable have to be numeric for these models, so we make those coercions first.
small_df_train_y_num <- as.data.frame(small_df_train_y) |>
rename(SENTIMENT = small_df_train_y) |>
mutate(SENTIMENT = case_when(SENTIMENT == "negative" ~ 0,
SENTIMENT == "neutral" ~ 1,
.default = 2))
small_df_train_y_num <- as.integer(small_df_train_y_num$SENTIMENT)
small_df_test_y_num <- as.data.frame(small_df_test_y) |>
rename(SENTIMENT = small_df_test_y) |>
mutate(SENTIMENT = case_when(SENTIMENT == "negative" ~ 0,
SENTIMENT == "neutral" ~ 1,
.default = 2))
small_df_test_y_num <- as.integer(small_df_test_y_num$SENTIMENT)
xgb_mod_small <- xgboost(small_df_train_x, small_df_train_y_num, nrounds = 100,
objective = "multi:softmax", num_class = 3,
verbose = 0)
xgb_mod_small
## ##### xgb.Booster
## raw: 327.9 Kb
## call:
## xgb.train(params = params, data = dtrain, nrounds = nrounds,
## watchlist = watchlist, verbose = verbose, print_every_n = print_every_n,
## early_stopping_rounds = early_stopping_rounds, maximize = maximize,
## save_period = save_period, save_name = save_name, xgb_model = xgb_model,
## callbacks = callbacks, objective = "multi:softmax", num_class = 3)
## params (as set within xgb.train):
## objective = "multi:softmax", num_class = "3", validate_parameters = "TRUE"
## xgb.attributes:
## niter
## callbacks:
## cb.evaluation.log()
## # of features: 1082
## niter: 100
## nfeatures : 1082
## evaluation_log:
## iter train_mlogloss
## 1 1.034165
## 2 0.990067
## ---
## 99 0.560126
## 100 0.559474
large_df_train_y_num <- as.data.frame(large_df_train_y) |>
rename(EMOTION = large_df_train_y) |>
mutate(EMOTION = case_when(EMOTION == "sadness" ~ 0,
EMOTION == "joy" ~ 1,
EMOTION == "love" ~ 2,
EMOTION == "anger" ~ 3,
EMOTION == "fear" ~ 4,
.default = 5))
large_df_train_y_num <- as.integer(large_df_train_y_num$EMOTION)
large_df_test_y_num <- as.data.frame(large_df_test_y) |>
rename(EMOTION = large_df_test_y) |>
mutate(EMOTION = case_when(EMOTION == "sadness" ~ 0,
EMOTION == "joy" ~ 1,
EMOTION == "love" ~ 2,
EMOTION == "anger" ~ 3,
EMOTION == "fear" ~ 4,
.default = 5))
large_df_test_y_num <- as.integer(large_df_test_y_num$EMOTION)
xgb_mod_large <- xgboost(large_df_train_x, large_df_train_y_num, nrounds = 100,
objective = "multi:softmax", num_class = 6,
verbose = 0)
xgb_mod_large
## ##### xgb.Booster
## raw: 1.3 Mb
## call:
## xgb.train(params = params, data = dtrain, nrounds = nrounds,
## watchlist = watchlist, verbose = verbose, print_every_n = print_every_n,
## early_stopping_rounds = early_stopping_rounds, maximize = maximize,
## save_period = save_period, save_name = save_name, xgb_model = xgb_model,
## callbacks = callbacks, objective = "multi:softmax", num_class = 6)
## params (as set within xgb.train):
## objective = "multi:softmax", num_class = "6", validate_parameters = "TRUE"
## xgb.attributes:
## niter
## callbacks:
## cb.evaluation.log()
## # of features: 40050
## niter: 100
## nfeatures : 40050
## evaluation_log:
## iter train_mlogloss
## 1 1.6426637
## 2 1.5298370
## ---
## 99 0.4188904
## 100 0.4165744
We make predictions on the test data and construct confusion matrices for each of the models in order to calculate their predictive accuracy.
mnb_pred_small <- predict(mnb_mod_small, small_df_test_x, type = "class")
mnb_pred_tbl_small <- table(mnb_pred_small, small_df_test_y)
names(dimnames(mnb_pred_tbl_small)) <- c("Prediction", "Reference")
print("Confusion Matrix: Multinomial Naive Bayes Classifier: Small Test Data")
## [1] "Confusion Matrix: Multinomial Naive Bayes Classifier: Small Test Data"
mnb_pred_tbl_small
## Reference
## Prediction negative neutral positive
## negative 18 8 3
## neutral 17 40 7
## positive 6 16 27
xgb_pred_small <- predict(xgb_mod_small, small_df_test_x)
xgb_pred_tbl_small <- table(xgb_pred_small, small_df_test_y_num)
names(dimnames(xgb_pred_tbl_small)) <- c("Prediction", "Reference")
sent <- c("negative", "neutral", "positive")
dimnames(xgb_pred_tbl_small)$Prediction <- sent
dimnames(xgb_pred_tbl_small)$Reference <- sent
print("Confusion Matrix: XGBoost Model: Small Test Data")
## [1] "Confusion Matrix: XGBoost Model: Small Test Data"
xgb_pred_tbl_small
## Reference
## Prediction negative neutral positive
## negative 8 5 4
## neutral 31 46 15
## positive 2 13 18
mnb_pred_large <- predict(mnb_mod_large, large_df_test_x, type = "class")
mnb_pred_tbl_large <- table(mnb_pred_large, large_df_test_y)
names(dimnames(mnb_pred_tbl_large)) <- c("Prediction", "Reference")
print("Confusion Matrix: Multinomial Naive Bayes Classifier: Large Test Data")
## [1] "Confusion Matrix: Multinomial Naive Bayes Classifier: Large Test Data"
mnb_pred_tbl_large
## Reference
## Prediction anger fear joy love sadness surprise
## anger 14838 555 256 85 607 37
## fear 499 11851 328 63 497 1006
## joy 466 446 38751 2988 1192 790
## love 80 41 1339 6884 157 40
## sadness 1231 1035 1591 399 33834 277
## surprise 17 299 114 7 58 2368
xgb_pred_large <- predict(xgb_mod_large, large_df_test_x)
xgb_pred_tbl_large <- table(xgb_pred_large, large_df_test_y_num)
names(dimnames(xgb_pred_tbl_large)) <- c("Prediction", "Reference")
dimnames(xgb_pred_tbl_large)$Prediction <- emotions
dimnames(xgb_pred_tbl_large)$Reference <- emotions
print("Confusion Matrix: XGBoost Model: Large Test Data")
## [1] "Confusion Matrix: XGBoost Model: Large Test Data"
xgb_pred_tbl_large
## Reference
## Prediction sadness joy love anger fear surprise
## sadness 32704 539 100 931 337 31
## joy 1920 37706 816 584 370 91
## love 237 3033 9394 69 34 18
## anger 530 262 38 14730 198 7
## fear 760 314 48 752 12116 84
## surprise 194 525 30 65 1172 4287
A summary of the models’ predictive accuracy on each of the test datasets is below.
mnb_pred_acc_small <- sum(diag(mnb_pred_tbl_small)) / nrow(small_df_test_x)
xgb_pred_acc_small <- sum(diag(xgb_pred_tbl_small)) / nrow(small_df_test_x)
mnb_pred_acc_large <- sum(diag(mnb_pred_tbl_large)) / nrow(large_df_test_x)
xgb_pred_acc_large <- sum(diag(xgb_pred_tbl_large)) / nrow(large_df_test_x)
pred_acc_df <- as.data.frame(matrix(data = c(mnb_pred_acc_small,
xgb_pred_acc_small,
mnb_pred_acc_large,
xgb_pred_acc_large),
nrow = 2, ncol = 2))
rownames(pred_acc_df) <- c("Multinomial Naive Bayes Classifier Model",
"Extreme Gradient Boost (XGBoost) Model")
colnames(pred_acc_df) <- c("Small Test Data Predictive Accuracy",
"Large Test Data Predictive Accuracy")
kable(pred_acc_df, format = "simple")
| Small Test Data Predictive Accuracy | Large Test Data Predictive Accuracy | |
|---|---|---|
| Multinomial Naive Bayes Classifier Model | 0.5985915 | 0.8680275 |
| Extreme Gradient Boost (XGBoost) Model | 0.5070423 | 0.8873114 |
The Multinomial Naive Bayes Classifier has the highest predictive accuracy on the small test data, while the XGBoost Model has the highest predictive accuracy on the large test data.
Neither model performed particularly well on the small test data, but the Multinomial Naive Bayes Classifier was roughly 9% more accurate in its predictions than the XGBoost Model, making it the clear winner there.
Unsurprisingly, both models benefited from having more training data. The jump in performance on the large test data was significant for both models, but the better performer here was the XGBoost Model rather than the Multinomial Naive Bayes Classifier. The former only beat the latter by roughly 2%, and the XGBoost Model was much more computationally expensive than the Multinomial Naive Bayes Classifier. So even though it was outperformed by the XGBoost Model, we were impressed by accuracy of the Multinomial Naive Bayes Classifier given its speed.