607 Final Project

Introduction

For my final project, I will investigate the potential use of sentiment analysis as a ham/spam filter.

Loading Libraries

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.1.0     v dplyr   1.0.5
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(kableExtra)

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(tidymodels)

## -- Attaching packages -------------------------------------- tidymodels 0.1.3 --

## v broom        0.7.6      v rsample      0.0.9 
## v dials        0.0.9      v tune         0.1.5 
## v infer        0.5.4      v workflows    0.2.2 
## v modeldata    0.1.0      v workflowsets 0.0.2 
## v parsnip      0.1.5      v yardstick    0.0.8 
## v recipes      0.1.16

## -- Conflicts ----------------------------------------- tidymodels_conflicts() --
## x scales::discard()        masks purrr::discard()
## x dplyr::filter()          masks stats::filter()
## x recipes::fixed()         masks stringr::fixed()
## x kableExtra::group_rows() masks dplyr::group_rows()
## x dplyr::lag()             masks stats::lag()
## x yardstick::spec()        masks readr::spec()
## x recipes::step()          masks stats::step()
## * Use tidymodels_prefer() to resolve common conflicts.

library(tidytext)
library(textrecipes)
library(wesanderson)
library(discrim)

## 
## Attaching package: 'discrim'

## The following object is masked from 'package:dials':
## 
##     smoothness

Obtaining the Data

I decided to use UCI’s Machine Learning Repository SMS Spam Collection Data Set, which is linked here. The SMS Spam Collection Data Set combines four different sources of Spam and Ham SMS messages.

The first step was to download the file off UCI’s website, save it as a csv, then upload it to Github and re-download it into R. The columns were renamed, and the spam and ham was separated into separate data frames.

data <- read.csv("https://raw.githubusercontent.com/carlisleferguson/DATA607/main/SMSSpamCollection.csv")

data <- data %>% rename(
  type = ham,
  text = Go.until.jurong.point..crazy...Available.only.in.bugis.n.great.world.la.e.buffet....Cine.there.got.amore.wat...
)

spam <- filter(data, type == "spam")
ham <- filter(data, type == "ham")

head(spam) %>% kbl(caption = "Head of the Spam Data Set") %>%
  kable_styling(bootstrap_options = "striped")

Head of the Spam Data Set
type	text
spam	Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C’s apply 08452810075over18’s
spam	FreeMsg Hey there darling it’s been 3 week’s now and no word back! I’d like some fun you up for it still? Tb ok! XxX std chgs to send, Â£1.50 to rcv
spam	WINNER!! As a valued network customer you have been selected to receivea Â£900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.
spam	Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030
spam	SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6days, 16+ TsandCs apply Reply HL 4 info
spam	URGENT! You have won a 1 week FREE membership in our Â£100,000 Prize Jackpot! Txt the word: CLAIM to No: 81010 T&C www.dbuk.net LCCLTD POBOX 4403LDNW1A7RW18

head(ham) %>% kbl(caption = "Head of the Ham Data Set") %>%
  kable_styling(bootstrap_options = "striped")

Head of the Ham Data Set
type	text
ham	Ok lar… Joking wif u oni…
ham	U dun say so early hor… U c already then say…
ham	Nah I don’t think he goes to usf, he lives around here though
ham	Even my brother is not like to speak with me. They treat me like aids patent.
ham	As per your request ‘Melle Melle (Oru Minnaminunginte Nurungu Vettam)’ has been set as your callertune for all Callers. Press *9 to copy your friends Callertune
ham	I’m gonna be home soon and i don’t want to talk about this stuff anymore tonight, k? I’ve cried enough today.

Applying Initial Sentiment Analysis

The tidytext library offers several different sentiment analysis lexicons including bing,nrc, and loughran. The bing lexicon provides a simple positive or negative sentiment analysis, while the nrc lexicon provides a more detailed look at anger, anticipation, disgust, fear, joy, sadness, surprise, and trust contained in the text. The loughran lexicon is similar to the nrc lexicon, and looks at constraining, litigious, positive, negative, superfluous, and uncertainty.

spam_b <- spam %>% unnest_tokens(word, text) %>% inner_join(get_sentiments(lexicon = "bing"))

## Joining, by = "word"

spam_b %>% 
  count(sentiment) %>%
  ggplot(aes( x = sentiment, y = n, fill = sentiment)) + 
  geom_bar(stat = "identity") + scale_fill_manual(values=wes_palette( name="Zissou1", 2, type = "continuous")) + labs(title="Bing Sentiment in SMS Spam Messages")

ham_b <- ham %>% unnest_tokens(word, text) %>% inner_join(get_sentiments(lexicon = "bing"))

## Joining, by = "word"

ham_b %>% 
  count(sentiment) %>%
  ggplot(aes( x = sentiment, y = n, fill = sentiment)) + 
  geom_bar(stat = "identity") + scale_fill_manual(values=wes_palette( name="Zissou1", 2, type = "continuous")) + labs(title="Bing Sentiment in SMS Ham Messages")

spam_n <- spam %>% unnest_tokens(word, text) %>% inner_join(get_sentiments(lexicon = "nrc"))

## Joining, by = "word"

spam_n %>% 
  count(sentiment) %>%
  ggplot(aes( x = sentiment, y = n, fill = sentiment)) + 
  geom_bar(stat = "identity") + scale_fill_manual(values=wes_palette( name="Zissou1", 10, type = "continuous")) + labs(title="NRC Sentiment in SMS Spam Messages")

ham_n <- ham %>% unnest_tokens(word, text) %>% inner_join(get_sentiments(lexicon = "nrc"))

## Joining, by = "word"

ham_n %>% 
  count(sentiment) %>%
  ggplot(aes( x = sentiment, y = n, fill = sentiment)) + 
  geom_bar(stat = "identity") + scale_fill_manual(values=wes_palette( name="Zissou1", 10, type = "continuous")) + labs(title="NRC Sentiment in SMS Ham Messages")

spam_l <- spam %>% unnest_tokens(word, text) %>% inner_join(get_sentiments(lexicon = "loughran"))

## Joining, by = "word"

spam_l %>% 
  count(sentiment) %>%
  ggplot(aes( x = sentiment, y = n, fill = sentiment)) + 
  geom_bar(stat = "identity") + scale_fill_manual(values=wes_palette( name="Zissou1", 6, type = "continuous")) + labs(title="Loughran Sentiment in SMS Spam Messages")

ham_l <- ham %>% unnest_tokens(word, text) %>% inner_join(get_sentiments(lexicon = "loughran"))

## Joining, by = "word"

ham_l %>% 
  count(sentiment) %>%
  ggplot(aes( x = sentiment, y = n, fill = sentiment)) + 
  geom_bar(stat = "identity") + scale_fill_manual(values=wes_palette( name="Zissou1", 6, type = "continuous")) + labs(title="Loughran Sentiment in SMS Ham Messages")

Interestingly enough, there are clear visual differences in the sentiments between spam and ham messages that are present across all three lexicons. However, is this difference statistically significant?

Determining Statistical Significance

The first step is to compile the total count per sentiment for each lexicon.

ham_b_count <- count(ham_b, sentiment)
ham_b_count <- ham_b_count %>% rename("ham_count" = n)
spam_b_count <- count(spam_b, sentiment)
spam_b_count <- spam_b_count %>% rename("spam_count" = n)
ham_n_count <- count(ham_n, sentiment)
ham_n_count <- ham_n_count %>% rename("ham_count" = n)
spam_n_count <- count(spam_n, sentiment)
spam_n_count <- spam_n_count %>% rename("spam_count" = n)
ham_l_count <- count(ham_l, sentiment)
ham_l_count <- ham_l_count %>% rename("ham_count" = n)
spam_l_count <- count(spam_l, sentiment)
spam_l_count <- spam_l_count %>% rename("spam_count" = n)


bing_count <- left_join(ham_b_count, spam_b_count)

## Joining, by = "sentiment"

bing_count  %>% kbl(caption = "Bing Counts") %>%
  kable_styling(bootstrap_options = "striped")

Bing Counts
sentiment	ham_count	spam_count
negative	1588	175
positive	2605	924

nrc_count <- left_join(ham_n_count, spam_n_count)

## Joining, by = "sentiment"

nrc_count  %>% kbl(caption = "NRC Counts") %>%
  kable_styling(bootstrap_options = "striped")

NRC Counts
sentiment	ham_count	spam_count
anger	690	192
anticipation	2146	734
disgust	534	32
fear	918	228
joy	1879	444
negative	1788	298
positive	3275	951
sadness	936	91
surprise	1075	295
trust	1918	439

loughran_count <- left_join(ham_l_count, spam_l_count)

## Joining, by = "sentiment"

loughran_count  %>% kbl(caption = "Loughran Counts") %>%
  kable_styling(bootstrap_options = "striped")

Loughran Counts
sentiment	ham_count	spam_count
constraining	23	1
litigious	67	122
negative	789	126
positive	758	193
superfluous	1	NA
uncertainty	320	27

The next step is to compute the ratio in each category.

ham_b_sum <- sum(ham_b_count["ham_count"])
spam_b_sum <- sum(spam_b_count["spam_count"])
ham_n_sum <- sum(ham_n_count["ham_count"])
spam_n_sum <- sum(spam_n_count["spam_count"])
ham_l_sum <- sum(ham_l_count["ham_count"])
spam_l_sum <- sum(spam_l_count["spam_count"])



bing_count <- bing_count %>% mutate(ham_ratio = ham_count/ham_b_sum)
bing_count <- bing_count %>% mutate(spam_ratio = spam_count/spam_b_sum)
bing_count %>% kbl(caption = "Bing Counts and Ratios") %>%
  kable_styling(bootstrap_options = "striped")

Bing Counts and Ratios
sentiment	ham_count	spam_count	ham_ratio	spam_ratio
negative	1588	175	0.3787264	0.1592357
positive	2605	924	0.6212736	0.8407643

nrc_count <- nrc_count %>% mutate(ham_ratio = ham_count/ham_n_sum)
nrc_count <- nrc_count %>% mutate(spam_ratio = spam_count/spam_n_sum)
nrc_count %>% kbl(caption = "NRC Counts and Ratios") %>%
  kable_styling(bootstrap_options = "striped")

NRC Counts and Ratios
sentiment	ham_count	spam_count	ham_ratio	spam_ratio
anger	690	192	0.0455175	0.0518359
anticipation	2146	734	0.1415661	0.1981641
disgust	534	32	0.0352266	0.0086393
fear	918	228	0.0605581	0.0615551
joy	1879	444	0.1239528	0.1198704
negative	1788	298	0.1179497	0.0804536
positive	3275	951	0.2160433	0.2567495
sadness	936	91	0.0617455	0.0245680
surprise	1075	295	0.0709150	0.0796436
trust	1918	439	0.1265255	0.1185205

loughran_count <- loughran_count %>% mutate(ham_ratio = ham_count/ham_l_sum)
loughran_count  <- loughran_count %>% mutate(spam_ratio = spam_count/spam_l_sum)
loughran_count %>% kbl(caption = "Loughran Counts and Ratios") %>%
  kable_styling(bootstrap_options = "striped")

Loughran Counts and Ratios
sentiment	ham_count	spam_count	ham_ratio	spam_ratio
constraining	23	1	0.0117467	0.0021322
litigious	67	122	0.0342186	0.2601279
negative	789	126	0.4029622	0.2686567
positive	758	193	0.3871297	0.4115139
superfluous	1	NA	0.0005107	NA
uncertainty	320	27	0.1634321	0.0575693

row.names(bing_count) <- c("negative", "positive")
bing_ratio <- bing_count[c("ham_ratio", "spam_ratio")]
bing_only_count <- bing_count[c("ham_count", "spam_count")]

fisher.test(bing_only_count)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  bing_only_count
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  2.701286 3.849601
## sample estimates:
## odds ratio 
##   3.218312

chisq.test(bing_only_count)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  bing_only_count
## X-squared = 187.84, df = 1, p-value < 2.2e-16

row.names(nrc_count) <- c("anger", "anticipation", "disgust", "fear", "joy", "negative", "positive", "sadness", "surprise", "trust")
nrc_only_count <- nrc_count[c("ham_count", "spam_count")]
nrc_ratio <- nrc_count[c("ham_ratio", "spam_ratio")]
chisq.test(nrc_only_count)

## 
##  Pearson's Chi-squared test
## 
## data:  nrc_only_count
## X-squared = 275.65, df = 9, p-value < 2.2e-16

row.names(loughran_count) <- c("contraining", "litigious", "negative", "positive", "superfluous", "uncertainty")
loughran_count[is.na(loughran_count)] = 0
loughran_only_count <- loughran_count[c("ham_count", "spam_count")]
loughran_ratio <- loughran_count[c("ham_ratio", "spam_ratio")]

chisq.test(loughran_only_count)

## Warning in chisq.test(loughran_only_count): Chi-squared approximation may be
## incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  loughran_only_count
## X-squared = 300.08, df = 5, p-value < 2.2e-16

Conclusions

The p values for each chi-square test was less than .05, meaning that the difference in sentiment between ham and spam SMS messages was statistically significant.