Introduction

For my final project, I will investigate the potential use of sentiment analysis as a ham/spam filter.

Loading Libraries

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.1.0     v dplyr   1.0.5
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(tidymodels)
## -- Attaching packages -------------------------------------- tidymodels 0.1.3 --
## v broom        0.7.6      v rsample      0.0.9 
## v dials        0.0.9      v tune         0.1.5 
## v infer        0.5.4      v workflows    0.2.2 
## v modeldata    0.1.0      v workflowsets 0.0.2 
## v parsnip      0.1.5      v yardstick    0.0.8 
## v recipes      0.1.16
## -- Conflicts ----------------------------------------- tidymodels_conflicts() --
## x scales::discard()        masks purrr::discard()
## x dplyr::filter()          masks stats::filter()
## x recipes::fixed()         masks stringr::fixed()
## x kableExtra::group_rows() masks dplyr::group_rows()
## x dplyr::lag()             masks stats::lag()
## x yardstick::spec()        masks readr::spec()
## x recipes::step()          masks stats::step()
## * Use tidymodels_prefer() to resolve common conflicts.
library(tidytext)
library(textrecipes)
library(wesanderson)
library(discrim)
## 
## Attaching package: 'discrim'
## The following object is masked from 'package:dials':
## 
##     smoothness

Obtaining the Data

I decided to use UCI’s Machine Learning Repository SMS Spam Collection Data Set, which is linked here. The SMS Spam Collection Data Set combines four different sources of Spam and Ham SMS messages.

The first step was to download the file off UCI’s website, save it as a csv, then upload it to Github and re-download it into R. The columns were renamed, and the spam and ham was separated into separate data frames.

data <- read.csv("https://raw.githubusercontent.com/carlisleferguson/DATA607/main/SMSSpamCollection.csv")

data <- data %>% rename(
  type = ham,
  text = Go.until.jurong.point..crazy...Available.only.in.bugis.n.great.world.la.e.buffet....Cine.there.got.amore.wat...
)

spam <- filter(data, type == "spam")
ham <- filter(data, type == "ham")

head(spam) %>% kbl(caption = "Head of the Spam Data Set") %>%
  kable_styling(bootstrap_options = "striped")
Head of the Spam Data Set
type text
spam Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C’s apply 08452810075over18’s
spam FreeMsg Hey there darling it’s been 3 week’s now and no word back! I’d like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv
spam WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.
spam Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030
spam SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6days, 16+ TsandCs apply Reply HL 4 info
spam URGENT! You have won a 1 week FREE membership in our £100,000 Prize Jackpot! Txt the word: CLAIM to No: 81010 T&C www.dbuk.net LCCLTD POBOX 4403LDNW1A7RW18
head(ham) %>% kbl(caption = "Head of the Ham Data Set") %>%
  kable_styling(bootstrap_options = "striped")
Head of the Ham Data Set
type text
ham Ok lar… Joking wif u oni…
ham U dun say so early hor… U c already then say…
ham Nah I don’t think he goes to usf, he lives around here though
ham Even my brother is not like to speak with me. They treat me like aids patent.
ham As per your request ‘Melle Melle (Oru Minnaminunginte Nurungu Vettam)’ has been set as your callertune for all Callers. Press *9 to copy your friends Callertune
ham I’m gonna be home soon and i don’t want to talk about this stuff anymore tonight, k? I’ve cried enough today.

Applying Initial Sentiment Analysis

The tidytext library offers several different sentiment analysis lexicons including bing,nrc, and loughran. The bing lexicon provides a simple positive or negative sentiment analysis, while the nrc lexicon provides a more detailed look at anger, anticipation, disgust, fear, joy, sadness, surprise, and trust contained in the text. The loughran lexicon is similar to the nrc lexicon, and looks at constraining, litigious, positive, negative, superfluous, and uncertainty.

spam_b <- spam %>% unnest_tokens(word, text) %>% inner_join(get_sentiments(lexicon = "bing"))
## Joining, by = "word"
spam_b %>% 
  count(sentiment) %>%
  ggplot(aes( x = sentiment, y = n, fill = sentiment)) + 
  geom_bar(stat = "identity") + scale_fill_manual(values=wes_palette( name="Zissou1", 2, type = "continuous")) + labs(title="Bing Sentiment in SMS Spam Messages")

ham_b <- ham %>% unnest_tokens(word, text) %>% inner_join(get_sentiments(lexicon = "bing"))
## Joining, by = "word"
ham_b %>% 
  count(sentiment) %>%
  ggplot(aes( x = sentiment, y = n, fill = sentiment)) + 
  geom_bar(stat = "identity") + scale_fill_manual(values=wes_palette( name="Zissou1", 2, type = "continuous")) + labs(title="Bing Sentiment in SMS Ham Messages")

spam_n <- spam %>% unnest_tokens(word, text) %>% inner_join(get_sentiments(lexicon = "nrc"))
## Joining, by = "word"
spam_n %>% 
  count(sentiment) %>%
  ggplot(aes( x = sentiment, y = n, fill = sentiment)) + 
  geom_bar(stat = "identity") + scale_fill_manual(values=wes_palette( name="Zissou1", 10, type = "continuous")) + labs(title="NRC Sentiment in SMS Spam Messages")

ham_n <- ham %>% unnest_tokens(word, text) %>% inner_join(get_sentiments(lexicon = "nrc"))
## Joining, by = "word"
ham_n %>% 
  count(sentiment) %>%
  ggplot(aes( x = sentiment, y = n, fill = sentiment)) + 
  geom_bar(stat = "identity") + scale_fill_manual(values=wes_palette( name="Zissou1", 10, type = "continuous")) + labs(title="NRC Sentiment in SMS Ham Messages")

spam_l <- spam %>% unnest_tokens(word, text) %>% inner_join(get_sentiments(lexicon = "loughran"))
## Joining, by = "word"
spam_l %>% 
  count(sentiment) %>%
  ggplot(aes( x = sentiment, y = n, fill = sentiment)) + 
  geom_bar(stat = "identity") + scale_fill_manual(values=wes_palette( name="Zissou1", 6, type = "continuous")) + labs(title="Loughran Sentiment in SMS Spam Messages")

ham_l <- ham %>% unnest_tokens(word, text) %>% inner_join(get_sentiments(lexicon = "loughran"))
## Joining, by = "word"
ham_l %>% 
  count(sentiment) %>%
  ggplot(aes( x = sentiment, y = n, fill = sentiment)) + 
  geom_bar(stat = "identity") + scale_fill_manual(values=wes_palette( name="Zissou1", 6, type = "continuous")) + labs(title="Loughran Sentiment in SMS Ham Messages")

Interestingly enough, there are clear visual differences in the sentiments between spam and ham messages that are present across all three lexicons. However, is this difference statistically significant?

Determining Statistical Significance

The first step is to compile the total count per sentiment for each lexicon.

ham_b_count <- count(ham_b, sentiment)
ham_b_count <- ham_b_count %>% rename("ham_count" = n)
spam_b_count <- count(spam_b, sentiment)
spam_b_count <- spam_b_count %>% rename("spam_count" = n)
ham_n_count <- count(ham_n, sentiment)
ham_n_count <- ham_n_count %>% rename("ham_count" = n)
spam_n_count <- count(spam_n, sentiment)
spam_n_count <- spam_n_count %>% rename("spam_count" = n)
ham_l_count <- count(ham_l, sentiment)
ham_l_count <- ham_l_count %>% rename("ham_count" = n)
spam_l_count <- count(spam_l, sentiment)
spam_l_count <- spam_l_count %>% rename("spam_count" = n)


bing_count <- left_join(ham_b_count, spam_b_count)
## Joining, by = "sentiment"
bing_count  %>% kbl(caption = "Bing Counts") %>%
  kable_styling(bootstrap_options = "striped")
Bing Counts
sentiment ham_count spam_count
negative 1588 175
positive 2605 924
nrc_count <- left_join(ham_n_count, spam_n_count)
## Joining, by = "sentiment"
nrc_count  %>% kbl(caption = "NRC Counts") %>%
  kable_styling(bootstrap_options = "striped")
NRC Counts
sentiment ham_count spam_count
anger 690 192
anticipation 2146 734
disgust 534 32
fear 918 228
joy 1879 444
negative 1788 298
positive 3275 951
sadness 936 91
surprise 1075 295
trust 1918 439
loughran_count <- left_join(ham_l_count, spam_l_count)
## Joining, by = "sentiment"
loughran_count  %>% kbl(caption = "Loughran Counts") %>%
  kable_styling(bootstrap_options = "striped")
Loughran Counts
sentiment ham_count spam_count
constraining 23 1
litigious 67 122
negative 789 126
positive 758 193
superfluous 1 NA
uncertainty 320 27

The next step is to compute the ratio in each category.

ham_b_sum <- sum(ham_b_count["ham_count"])
spam_b_sum <- sum(spam_b_count["spam_count"])
ham_n_sum <- sum(ham_n_count["ham_count"])
spam_n_sum <- sum(spam_n_count["spam_count"])
ham_l_sum <- sum(ham_l_count["ham_count"])
spam_l_sum <- sum(spam_l_count["spam_count"])



bing_count <- bing_count %>% mutate(ham_ratio = ham_count/ham_b_sum)
bing_count <- bing_count %>% mutate(spam_ratio = spam_count/spam_b_sum)
bing_count %>% kbl(caption = "Bing Counts and Ratios") %>%
  kable_styling(bootstrap_options = "striped")
Bing Counts and Ratios
sentiment ham_count spam_count ham_ratio spam_ratio
negative 1588 175 0.3787264 0.1592357
positive 2605 924 0.6212736 0.8407643
nrc_count <- nrc_count %>% mutate(ham_ratio = ham_count/ham_n_sum)
nrc_count <- nrc_count %>% mutate(spam_ratio = spam_count/spam_n_sum)
nrc_count %>% kbl(caption = "NRC Counts and Ratios") %>%
  kable_styling(bootstrap_options = "striped")
NRC Counts and Ratios
sentiment ham_count spam_count ham_ratio spam_ratio
anger 690 192 0.0455175 0.0518359
anticipation 2146 734 0.1415661 0.1981641
disgust 534 32 0.0352266 0.0086393
fear 918 228 0.0605581 0.0615551
joy 1879 444 0.1239528 0.1198704
negative 1788 298 0.1179497 0.0804536
positive 3275 951 0.2160433 0.2567495
sadness 936 91 0.0617455 0.0245680
surprise 1075 295 0.0709150 0.0796436
trust 1918 439 0.1265255 0.1185205
loughran_count <- loughran_count %>% mutate(ham_ratio = ham_count/ham_l_sum)
loughran_count  <- loughran_count %>% mutate(spam_ratio = spam_count/spam_l_sum)
loughran_count %>% kbl(caption = "Loughran Counts and Ratios") %>%
  kable_styling(bootstrap_options = "striped")
Loughran Counts and Ratios
sentiment ham_count spam_count ham_ratio spam_ratio
constraining 23 1 0.0117467 0.0021322
litigious 67 122 0.0342186 0.2601279
negative 789 126 0.4029622 0.2686567
positive 758 193 0.3871297 0.4115139
superfluous 1 NA 0.0005107 NA
uncertainty 320 27 0.1634321 0.0575693
row.names(bing_count) <- c("negative", "positive")
bing_ratio <- bing_count[c("ham_ratio", "spam_ratio")]
bing_only_count <- bing_count[c("ham_count", "spam_count")]
fisher.test(bing_only_count)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  bing_only_count
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  2.701286 3.849601
## sample estimates:
## odds ratio 
##   3.218312
chisq.test(bing_only_count)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  bing_only_count
## X-squared = 187.84, df = 1, p-value < 2.2e-16
row.names(nrc_count) <- c("anger", "anticipation", "disgust", "fear", "joy", "negative", "positive", "sadness", "surprise", "trust")
nrc_only_count <- nrc_count[c("ham_count", "spam_count")]
nrc_ratio <- nrc_count[c("ham_ratio", "spam_ratio")]
chisq.test(nrc_only_count)
## 
##  Pearson's Chi-squared test
## 
## data:  nrc_only_count
## X-squared = 275.65, df = 9, p-value < 2.2e-16
row.names(loughran_count) <- c("contraining", "litigious", "negative", "positive", "superfluous", "uncertainty")
loughran_count[is.na(loughran_count)] = 0
loughran_only_count <- loughran_count[c("ham_count", "spam_count")]
loughran_ratio <- loughran_count[c("ham_ratio", "spam_ratio")]

chisq.test(loughran_only_count)
## Warning in chisq.test(loughran_only_count): Chi-squared approximation may be
## incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  loughran_only_count
## X-squared = 300.08, df = 5, p-value < 2.2e-16

Conclusions

The p values for each chi-square test was less than .05, meaning that the difference in sentiment between ham and spam SMS messages was statistically significant.