For my final project, I will investigate the potential use of sentiment analysis as a ham/spam filter.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.1.0 v dplyr 1.0.5
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
library(tidymodels)
## -- Attaching packages -------------------------------------- tidymodels 0.1.3 --
## v broom 0.7.6 v rsample 0.0.9
## v dials 0.0.9 v tune 0.1.5
## v infer 0.5.4 v workflows 0.2.2
## v modeldata 0.1.0 v workflowsets 0.0.2
## v parsnip 0.1.5 v yardstick 0.0.8
## v recipes 0.1.16
## -- Conflicts ----------------------------------------- tidymodels_conflicts() --
## x scales::discard() masks purrr::discard()
## x dplyr::filter() masks stats::filter()
## x recipes::fixed() masks stringr::fixed()
## x kableExtra::group_rows() masks dplyr::group_rows()
## x dplyr::lag() masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step() masks stats::step()
## * Use tidymodels_prefer() to resolve common conflicts.
library(tidytext)
library(textrecipes)
library(wesanderson)
library(discrim)
##
## Attaching package: 'discrim'
## The following object is masked from 'package:dials':
##
## smoothness
I decided to use UCI’s Machine Learning Repository SMS Spam Collection Data Set, which is linked here. The SMS Spam Collection Data Set combines four different sources of Spam and Ham SMS messages.
The first step was to download the file off UCI’s website, save it as a csv, then upload it to Github and re-download it into R. The columns were renamed, and the spam and ham was separated into separate data frames.
data <- read.csv("https://raw.githubusercontent.com/carlisleferguson/DATA607/main/SMSSpamCollection.csv")
data <- data %>% rename(
type = ham,
text = Go.until.jurong.point..crazy...Available.only.in.bugis.n.great.world.la.e.buffet....Cine.there.got.amore.wat...
)
spam <- filter(data, type == "spam")
ham <- filter(data, type == "ham")
head(spam) %>% kbl(caption = "Head of the Spam Data Set") %>%
kable_styling(bootstrap_options = "striped")
| type | text |
|---|---|
| spam | Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C’s apply 08452810075over18’s |
| spam | FreeMsg Hey there darling it’s been 3 week’s now and no word back! I’d like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv |
| spam | WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only. |
| spam | Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030 |
| spam | SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6days, 16+ TsandCs apply Reply HL 4 info |
| spam | URGENT! You have won a 1 week FREE membership in our £100,000 Prize Jackpot! Txt the word: CLAIM to No: 81010 T&C www.dbuk.net LCCLTD POBOX 4403LDNW1A7RW18 |
head(ham) %>% kbl(caption = "Head of the Ham Data Set") %>%
kable_styling(bootstrap_options = "striped")
| type | text |
|---|---|
| ham | Ok lar… Joking wif u oni… |
| ham | U dun say so early hor… U c already then say… |
| ham | Nah I don’t think he goes to usf, he lives around here though |
| ham | Even my brother is not like to speak with me. They treat me like aids patent. |
| ham | As per your request ‘Melle Melle (Oru Minnaminunginte Nurungu Vettam)’ has been set as your callertune for all Callers. Press *9 to copy your friends Callertune |
| ham | I’m gonna be home soon and i don’t want to talk about this stuff anymore tonight, k? I’ve cried enough today. |
The tidytext library offers several different sentiment analysis lexicons including bing,nrc, and loughran. The bing lexicon provides a simple positive or negative sentiment analysis, while the nrc lexicon provides a more detailed look at anger, anticipation, disgust, fear, joy, sadness, surprise, and trust contained in the text. The loughran lexicon is similar to the nrc lexicon, and looks at constraining, litigious, positive, negative, superfluous, and uncertainty.
spam_b <- spam %>% unnest_tokens(word, text) %>% inner_join(get_sentiments(lexicon = "bing"))
## Joining, by = "word"
spam_b %>%
count(sentiment) %>%
ggplot(aes( x = sentiment, y = n, fill = sentiment)) +
geom_bar(stat = "identity") + scale_fill_manual(values=wes_palette( name="Zissou1", 2, type = "continuous")) + labs(title="Bing Sentiment in SMS Spam Messages")
ham_b <- ham %>% unnest_tokens(word, text) %>% inner_join(get_sentiments(lexicon = "bing"))
## Joining, by = "word"
ham_b %>%
count(sentiment) %>%
ggplot(aes( x = sentiment, y = n, fill = sentiment)) +
geom_bar(stat = "identity") + scale_fill_manual(values=wes_palette( name="Zissou1", 2, type = "continuous")) + labs(title="Bing Sentiment in SMS Ham Messages")
spam_n <- spam %>% unnest_tokens(word, text) %>% inner_join(get_sentiments(lexicon = "nrc"))
## Joining, by = "word"
spam_n %>%
count(sentiment) %>%
ggplot(aes( x = sentiment, y = n, fill = sentiment)) +
geom_bar(stat = "identity") + scale_fill_manual(values=wes_palette( name="Zissou1", 10, type = "continuous")) + labs(title="NRC Sentiment in SMS Spam Messages")
ham_n <- ham %>% unnest_tokens(word, text) %>% inner_join(get_sentiments(lexicon = "nrc"))
## Joining, by = "word"
ham_n %>%
count(sentiment) %>%
ggplot(aes( x = sentiment, y = n, fill = sentiment)) +
geom_bar(stat = "identity") + scale_fill_manual(values=wes_palette( name="Zissou1", 10, type = "continuous")) + labs(title="NRC Sentiment in SMS Ham Messages")
spam_l <- spam %>% unnest_tokens(word, text) %>% inner_join(get_sentiments(lexicon = "loughran"))
## Joining, by = "word"
spam_l %>%
count(sentiment) %>%
ggplot(aes( x = sentiment, y = n, fill = sentiment)) +
geom_bar(stat = "identity") + scale_fill_manual(values=wes_palette( name="Zissou1", 6, type = "continuous")) + labs(title="Loughran Sentiment in SMS Spam Messages")
ham_l <- ham %>% unnest_tokens(word, text) %>% inner_join(get_sentiments(lexicon = "loughran"))
## Joining, by = "word"
ham_l %>%
count(sentiment) %>%
ggplot(aes( x = sentiment, y = n, fill = sentiment)) +
geom_bar(stat = "identity") + scale_fill_manual(values=wes_palette( name="Zissou1", 6, type = "continuous")) + labs(title="Loughran Sentiment in SMS Ham Messages")
Interestingly enough, there are clear visual differences in the sentiments between spam and ham messages that are present across all three lexicons. However, is this difference statistically significant?
The first step is to compile the total count per sentiment for each lexicon.
ham_b_count <- count(ham_b, sentiment)
ham_b_count <- ham_b_count %>% rename("ham_count" = n)
spam_b_count <- count(spam_b, sentiment)
spam_b_count <- spam_b_count %>% rename("spam_count" = n)
ham_n_count <- count(ham_n, sentiment)
ham_n_count <- ham_n_count %>% rename("ham_count" = n)
spam_n_count <- count(spam_n, sentiment)
spam_n_count <- spam_n_count %>% rename("spam_count" = n)
ham_l_count <- count(ham_l, sentiment)
ham_l_count <- ham_l_count %>% rename("ham_count" = n)
spam_l_count <- count(spam_l, sentiment)
spam_l_count <- spam_l_count %>% rename("spam_count" = n)
bing_count <- left_join(ham_b_count, spam_b_count)
## Joining, by = "sentiment"
bing_count %>% kbl(caption = "Bing Counts") %>%
kable_styling(bootstrap_options = "striped")
| sentiment | ham_count | spam_count |
|---|---|---|
| negative | 1588 | 175 |
| positive | 2605 | 924 |
nrc_count <- left_join(ham_n_count, spam_n_count)
## Joining, by = "sentiment"
nrc_count %>% kbl(caption = "NRC Counts") %>%
kable_styling(bootstrap_options = "striped")
| sentiment | ham_count | spam_count |
|---|---|---|
| anger | 690 | 192 |
| anticipation | 2146 | 734 |
| disgust | 534 | 32 |
| fear | 918 | 228 |
| joy | 1879 | 444 |
| negative | 1788 | 298 |
| positive | 3275 | 951 |
| sadness | 936 | 91 |
| surprise | 1075 | 295 |
| trust | 1918 | 439 |
loughran_count <- left_join(ham_l_count, spam_l_count)
## Joining, by = "sentiment"
loughran_count %>% kbl(caption = "Loughran Counts") %>%
kable_styling(bootstrap_options = "striped")
| sentiment | ham_count | spam_count |
|---|---|---|
| constraining | 23 | 1 |
| litigious | 67 | 122 |
| negative | 789 | 126 |
| positive | 758 | 193 |
| superfluous | 1 | NA |
| uncertainty | 320 | 27 |
The next step is to compute the ratio in each category.
ham_b_sum <- sum(ham_b_count["ham_count"])
spam_b_sum <- sum(spam_b_count["spam_count"])
ham_n_sum <- sum(ham_n_count["ham_count"])
spam_n_sum <- sum(spam_n_count["spam_count"])
ham_l_sum <- sum(ham_l_count["ham_count"])
spam_l_sum <- sum(spam_l_count["spam_count"])
bing_count <- bing_count %>% mutate(ham_ratio = ham_count/ham_b_sum)
bing_count <- bing_count %>% mutate(spam_ratio = spam_count/spam_b_sum)
bing_count %>% kbl(caption = "Bing Counts and Ratios") %>%
kable_styling(bootstrap_options = "striped")
| sentiment | ham_count | spam_count | ham_ratio | spam_ratio |
|---|---|---|---|---|
| negative | 1588 | 175 | 0.3787264 | 0.1592357 |
| positive | 2605 | 924 | 0.6212736 | 0.8407643 |
nrc_count <- nrc_count %>% mutate(ham_ratio = ham_count/ham_n_sum)
nrc_count <- nrc_count %>% mutate(spam_ratio = spam_count/spam_n_sum)
nrc_count %>% kbl(caption = "NRC Counts and Ratios") %>%
kable_styling(bootstrap_options = "striped")
| sentiment | ham_count | spam_count | ham_ratio | spam_ratio |
|---|---|---|---|---|
| anger | 690 | 192 | 0.0455175 | 0.0518359 |
| anticipation | 2146 | 734 | 0.1415661 | 0.1981641 |
| disgust | 534 | 32 | 0.0352266 | 0.0086393 |
| fear | 918 | 228 | 0.0605581 | 0.0615551 |
| joy | 1879 | 444 | 0.1239528 | 0.1198704 |
| negative | 1788 | 298 | 0.1179497 | 0.0804536 |
| positive | 3275 | 951 | 0.2160433 | 0.2567495 |
| sadness | 936 | 91 | 0.0617455 | 0.0245680 |
| surprise | 1075 | 295 | 0.0709150 | 0.0796436 |
| trust | 1918 | 439 | 0.1265255 | 0.1185205 |
loughran_count <- loughran_count %>% mutate(ham_ratio = ham_count/ham_l_sum)
loughran_count <- loughran_count %>% mutate(spam_ratio = spam_count/spam_l_sum)
loughran_count %>% kbl(caption = "Loughran Counts and Ratios") %>%
kable_styling(bootstrap_options = "striped")
| sentiment | ham_count | spam_count | ham_ratio | spam_ratio |
|---|---|---|---|---|
| constraining | 23 | 1 | 0.0117467 | 0.0021322 |
| litigious | 67 | 122 | 0.0342186 | 0.2601279 |
| negative | 789 | 126 | 0.4029622 | 0.2686567 |
| positive | 758 | 193 | 0.3871297 | 0.4115139 |
| superfluous | 1 | NA | 0.0005107 | NA |
| uncertainty | 320 | 27 | 0.1634321 | 0.0575693 |
row.names(bing_count) <- c("negative", "positive")
bing_ratio <- bing_count[c("ham_ratio", "spam_ratio")]
bing_only_count <- bing_count[c("ham_count", "spam_count")]
fisher.test(bing_only_count)
##
## Fisher's Exact Test for Count Data
##
## data: bing_only_count
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 2.701286 3.849601
## sample estimates:
## odds ratio
## 3.218312
chisq.test(bing_only_count)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: bing_only_count
## X-squared = 187.84, df = 1, p-value < 2.2e-16
row.names(nrc_count) <- c("anger", "anticipation", "disgust", "fear", "joy", "negative", "positive", "sadness", "surprise", "trust")
nrc_only_count <- nrc_count[c("ham_count", "spam_count")]
nrc_ratio <- nrc_count[c("ham_ratio", "spam_ratio")]
chisq.test(nrc_only_count)
##
## Pearson's Chi-squared test
##
## data: nrc_only_count
## X-squared = 275.65, df = 9, p-value < 2.2e-16
row.names(loughran_count) <- c("contraining", "litigious", "negative", "positive", "superfluous", "uncertainty")
loughran_count[is.na(loughran_count)] = 0
loughran_only_count <- loughran_count[c("ham_count", "spam_count")]
loughran_ratio <- loughran_count[c("ham_ratio", "spam_ratio")]
chisq.test(loughran_only_count)
## Warning in chisq.test(loughran_only_count): Chi-squared approximation may be
## incorrect
##
## Pearson's Chi-squared test
##
## data: loughran_only_count
## X-squared = 300.08, df = 5, p-value < 2.2e-16
The p values for each chi-square test was less than .05, meaning that the difference in sentiment between ham and spam SMS messages was statistically significant.