Executive summary

Spam messages have become an issue nowadays. They always find a way into our message boxes and only do harm to people. As spam messages are getting more sophisticated and becoming more like real, normal messages (ham messages), there is a need to identify trends in spam messages that are distinct from other messages to stop further harm caused by them.

The main aim is to determine if there are any significant differences in the characteristics of spam messages and ham messages. If this analysis reveals significant differences between spam and ham messages, then recognizing spam messages would be easier. Consequently, the findings would be useful for people who seek to protect themselves from being deceived by spam messages.

I will primarily analyze the data through the scope of sentiment analysis. However, other basic analyses, such as frequency analysis, will also be conducted.

The main questions are the followings: - How do the frequency and usage patterns of words differ between spam and ham messages? - Do spam messages differ from ham messages in terms of the sentiment they appeal to? - What are the most common themes or topics identified in spam and ham messages?

The prediction is that since spam emails aim to deceive the reader and prompt them to take certain actions (for example, clicking a malicious link or leaking their own personal information), the messages would actively use positive, gentle, and kind words.

After the analysis, the results did show some significant differences between the ham and spam message data. The TF-IDF frequency graph highlighted the 10 most used ‘core’ words in spam and ham messages. By examining the graph, it was possible to identify the core messages and main themes conveyed by each type of message. Spam messages were more focused on alluring readers with prizes, points, and gifts, whereas ham messages were more focused on daily conversations.

For sentiment analysis, the BING and NRC sentiment data sets were used. The BING word clouds showed the negative and positive words used in both spam and ham messages. It was found that spam messages used more variety of positive words, which quite corresponded with the prediction.

Using the NRC sentiment data set showed which sentiment the messages evoked. Surprisingly, there were no significant differences between the sentiment that are conveyed the most by ham messages and those conveyed by spam messages.

Lastly, a bi-gram network was created to find out if there were any significant word combinations. Furthermore, having insight to these word combinations helped to find out the contexts of the messages. In the spam messages bi-gram network, word combinations such as ‘dont miss’, ‘bonus 2000’, ‘charge extra’, and ‘lucky day’ could be found. By looking at the result, it could be assumed that spam messages contain enticing phrases aimed at attracting the reader’s attention and prompting them to take action.

In contrast, the ham messages bi-gram network revealed combinations like ‘love ya’, ‘wife didn’t’, ‘stop bus’, and ‘babe hey’. These combinations suggest that ham messages are more focused on personal interactions and everyday communications.

This analysis highlights the distinct linguistic patterns in spam and ham messages, providing valuable insights to help people to detect spam messages. By understanding the common words, phrases and word pairings used in spam messages, it becomes easier to identify and filter them. Furthermore, it can be expected that these findings would contribute in development of an effective spam filtering algorithm, too.

The final advice would be if a message from unknown sender tells you that you can get something for free, or is willing to give you prize for no reason, then that message is highly likely to be a spam.

Data background

This analysis used SMS Spam Collection Dataset which is provided in kaggle.com The SMS messages are written in English, and altogether 5574 messages are stored in this dataset. Only two column exists in this data: v1 and v2. v1 contains information about whether the message is ‘spam’ or ‘ham’. v2 column contains the full text of each SMS messages.

Data loading, cleaning and preprocessing

(Describe and show how you cleaned and reshaped the data)

first, load spam.csv dataset into ‘full_data’ variable. you can check that v1 column tell us whether the message is ‘spam’ or ‘ham’, and v2 contains the actual text.

# full_data = unprocessed data is saved
# v1 column = 'spam' 'ham' info is saved here
# v2 column = actual text is saved here

full_data <- read_csv("spam.csv", show_col_types = FALSE)

## New names:
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`

print(full_data)

## # A tibble: 5,572 × 5
##    v1    v2                                                    ...3  ...4  ...5 
##    <chr> <chr>                                                 <chr> <chr> <chr>
##  1 ham   "Go until jurong point, crazy.. Available only in bu… <NA>  <NA>  <NA> 
##  2 ham   "Ok lar... Joking wif u oni..."                       <NA>  <NA>  <NA> 
##  3 spam  "Free entry in 2 a wkly comp to win FA Cup final tkt… <NA>  <NA>  <NA> 
##  4 ham   "U dun say so early hor... U c already then say..."   <NA>  <NA>  <NA> 
##  5 ham   "Nah I don't think he goes to usf, he lives around h… <NA>  <NA>  <NA> 
##  6 spam  "FreeMsg Hey there darling it's been 3 week's now an… <NA>  <NA>  <NA> 
##  7 ham   "Even my brother is not like to speak with me. They … <NA>  <NA>  <NA> 
##  8 ham   "As per your request 'Melle Melle (Oru Minnaminungin… <NA>  <NA>  <NA> 
##  9 spam  "WINNER!! As a valued network customer you have been… <NA>  <NA>  <NA> 
## 10 spam  "Had your mobile 11 months or more? U R entitled to … <NA>  <NA>  <NA> 
## # ℹ 5,562 more rows

Now, preprocess the data. Tokenize the text, and remove any stop words. Preprocessed data is stored in ‘data’ variable.

# 'data' varaible = pre-processed full_data (tokenized, removed stop words)
# when tokenizing, the tokenized words are saved in 'word' column

data <- full_data %>% 
  unnest_tokens(word, v2) %>%
  anti_join(stop_words, by = "word")

print(data)

## # A tibble: 38,221 × 5
##    v1    ...3  ...4  ...5  word  
##    <chr> <chr> <chr> <chr> <chr> 
##  1 ham   <NA>  <NA>  <NA>  jurong
##  2 ham   <NA>  <NA>  <NA>  crazy 
##  3 ham   <NA>  <NA>  <NA>  bugis 
##  4 ham   <NA>  <NA>  <NA>  world 
##  5 ham   <NA>  <NA>  <NA>  la    
##  6 ham   <NA>  <NA>  <NA>  buffet
##  7 ham   <NA>  <NA>  <NA>  cine  
##  8 ham   <NA>  <NA>  <NA>  amore 
##  9 ham   <NA>  <NA>  <NA>  wat   
## 10 ham   <NA>  <NA>  <NA>  lar   
## # ℹ 38,211 more rows

Text data analysis

Individual analysis and figures

Anaysis and Figure 1

# processing data for TF-IDF analysis
# word_counts = word frequency counted
word_counts <- data %>%
  count(v1, word, sort = T)

# tf_counts = tf-idf processed data
tf_counts <- word_counts %>%
  bind_tf_idf(term = word,           
              document = v1,  
              n = n) %>%             
  arrange(-tf_idf)

# counting the top 10 TF-IDF words for ham and spam
# top10 = top 10 tf-idf processed words
top10 <- tf_counts %>%
  group_by(v1) %>%
  slice_max(tf_idf, n = 10, with_ties = F)

top10$v1 <- factor(top10$v1,
                          levels = c("ham", "spam"))

# plotting bar graph:
ggplot(top10, aes(x = reorder_within(word, tf_idf, v1),
                  y = tf_idf,
                  fill = v1)) +
  geom_col(show.legend = F) +
  coord_flip() +
  facet_wrap(~ v1, scales = "free", ncol = 2) +
  labs(title = "Top 10 Words - tf_idf frequency",
       x = "Words",
       y = "Frequency") + 
  scale_x_reordered() +
  labs(x = NULL) + 
  scale_fill_manual(values = c("ham" = "lightblue",
                               "spam" = "lightpink")) + 
  theme_light()

TF-IDF - top 10

With the preprocessed data variable ‘data’, a TF-IDF analysis was performed. The top ten most frequent words were selected and visualized in a bar graph. A bar graph is an excellent tool for visualizing frequency, as the length of each bar represents the repetition of the words in the texts.

TF-IDF analysis was conducted not only to find the most frequent words but also to identify the important and meaningful ones. Common frequency analysis, which merely counts the number of times each word appears in the data, does not account for the importance and significance of the words. Therefore, TF-IDF analysis was used to highlight words that are not just frequent but also carry substantial importance and meaning.

Anaysis and Figure 2-1

# seperate the ham and spam data
# data = pre-processed data (tokenized, removed stop_words)

# ham_data = only contains ham messages
ham_data <- data %>%
  filter(v1 == "ham")

# ham_data = only contains spam messages
spam_data <- data %>%
  filter(v1 == "spam")

# processing ham messages with BING: 
sentiment_ham <- ham_data %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  group_by(sentiment) %>%
  #slice_max(order_by = n, n = 100, with_ties = FALSE) %>%
  ungroup()

## Joining with `by = join_by(word)`

# processing spam messages with BING:
sentiment_spam <- spam_data %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  group_by(sentiment) %>%
  #slice_max(order_by = n, n = 10, with_ties = FALSE) %>%
  ungroup()

## Joining with `by = join_by(word)`

sentiment_ham

## # A tibble: 698 × 3
##    word  sentiment     n
##    <chr> <chr>     <int>
##  1 love  positive    191
##  2 happy positive    106
##  3 miss  negative     69
##  4 free  positive     60
##  5 nice  positive     58
##  6 fine  positive     50
##  7 smile positive     45
##  8 cool  positive     40
##  9 ready positive     35
## 10 sweet positive     35
## # ℹ 688 more rows

sentiment_spam

## # A tibble: 144 × 3
##    word    sentiment     n
##    <chr>   <chr>     <int>
##  1 free    positive    223
##  2 prize   positive     92
##  3 won     positive     73
##  4 urgent  negative     63
##  5 win     positive     60
##  6 awarded positive     38
##  7 award   positive     28
##  8 bonus   positive     21
##  9 top     positive     16
## 10 winner  positive     16
## # ℹ 134 more rows

# plotting comparison word cloud for HAM: 
sentiment_ham %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("darkred", "lightblue"))

BING word cloud: HAM

# plotting comparison word cloud for SPAM: 
sentiment_spam %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("darkred", "lightblue"))

BING word cloud: SPAM

The size of the word cloud for spam data is noticeably smaller compared to the ham data due to the smaller dataset size.

Using the BING sentiment dataset allows us to see the positive and negative words used in both spam and ham messages.

Word clouds effectively visualize the frequency of words used. They allow us to directly observe which positive and negative words were utilized, with larger words indicating higher frequency. Upon reviewing the graph for spam messages only, it was observed that a greater variety of positive words were used compared to negative words (the count of positive words displayed in the word cloud is higher than that of negative words).

An intriguing observation is that the most used positive words in spam messages align closely with the findings from the earlier TF-IDF bar graph, which highlighted the most important and meaningful words.

Anaysis and Figure 2-2

# loading NRC sentiment data
# sentiment_list variable contains list of emotions that NRC work with.
load("nrc.rda")
sentiment_list <- unique(nrc$sentiment)

print(sentiment_list)

##  [1] "trust"        "fear"         "negative"     "sadness"      "anger"       
##  [6] "surprise"     "positive"     "disgust"      "joy"          "anticipation"

# checking which sentiment appeared most frequently in the words - HAM
sentiment_count_ham <- ham_data %>%
  inner_join(nrc) %>%
  count(sentiment, sort = TRUE)

## Joining with `by = join_by(word)`

## Warning in inner_join(., nrc): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 2 of `x` matches multiple rows in `y`.
## ℹ Row 9809 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

# checking which sentiment appeared most frequently in the words - SPAM
sentiment_count_spam <- spam_data %>%
  inner_join(nrc) %>%
  count(sentiment, sort = TRUE)

## Joining with `by = join_by(word)`

## Warning in inner_join(., nrc): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 26 of `x` matches multiple rows in `y`.
## ℹ Row 8258 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

# plotting donut chart - HAM
ggplot(sentiment_count_ham, aes(x = 2, y = n, fill = sentiment)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y") +
  geom_text(aes(label = paste0(round(n / sum(n) * 100, 1), "%")), 
            position = position_stack(vjust = 0.5)) +
  labs(title = "Sentiment Distribution in HAM messages") +
  theme_void() +
  theme(legend.title = element_blank()) +
  xlim(0.5, 2.5) + 
  theme_light()

NRC donut graph: HAM

# plotting donut chart - SPAM
ggplot(sentiment_count_spam, aes(x = 2, y = n, fill = sentiment)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y") +
  geom_text(aes(label = paste0(round(n / sum(n) * 100, 1), "%")), 
            position = position_stack(vjust = 0.5)) +
  labs(title = "Sentiment Distribution in SPAM messages") +
  theme_void() +
  theme(legend.title = element_blank()) +
  xlim(0.5, 2.5) + 
  theme_light()

NRC donut graph: SPAM

This time, a donut chart was plotted to clearly illustrate the proportion of sentiment words used in each set of messages.

Interestingly, there were minimal differences observed in the proportions of sentiment between spam and ham messages. Hence, for the question - Do spam messages differ from ham messages in terms of the sentiment they appeal to?, the answer would be - there is no significant differences between the spam and ham messages.

Anaysis and Figure 3

# full_data variable was used because the tokenizing part needed to be done differently for bi-gram analysis. 
# since data variable is already tokenized, we need to work with full_data here. 

# processing data for bi-gram - SPAM
spam_data_bigram <- full_data %>%
  filter(v1 == "spam") %>% 
  unnest_tokens(bigram, v2, token = "ngrams", n = 2) %>%
  filter(!is.na(bigram))

# in spam_data_seperated, the text column v2 is now seperated into word1, word2 column
spam_data_seperated <- spam_data_bigram %>%
  separate(bigram, c("word1", "word2"), sep = " ")

# removing stop word 
spam_bigrams_filtered <- spam_data_seperated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

spam_bigrams_counted <- spam_bigrams_filtered %>% 
  count(word1, word2, sort = TRUE)

spam_bigrams_graph <- spam_bigrams_counted %>%
  filter(n > 4) %>%
  graph_from_data_frame()

# processing data for bi-gram - HAM
ham_data_bigram <- full_data %>%
  filter(v1 == "ham") %>% 
  unnest_tokens(bigram, v2, token = "ngrams", n = 2) %>%
  filter(!is.na(bigram))

# in ham_data_seperated, the text column v2 is now seperated into word1, word2 column
ham_data_seperated <- ham_data_bigram %>%
  separate(bigram, c("word1", "word2"), sep = " ")

# removing stop word
ham_bigrams_filtered <- ham_data_seperated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

ham_bigrams_counted <- ham_bigrams_filtered %>% 
  count(word1, word2, sort = TRUE)

ham_bigrams_graph <- ham_bigrams_counted %>%
  filter(n > 4) %>%
  graph_from_data_frame()

# plotting bigram graph - HAM: 
set.seed(2023)

ggraph(ham_bigrams_graph, layout = "fr") +
  geom_edge_link() +
  geom_node_point() +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) + 
  labs(title = "bi-gram: HAM messages") + 
  theme_light()

Bi-gram graph: HAM

# plotting bi-gram graph - SPAM:
set.seed(2023)

ggraph(spam_bigrams_graph, layout = "fr") +
  geom_edge_link() +
  geom_node_point() +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) + 
  labs(title = "bi-gram: SPAM messages") + 
  theme_light()

Bi-gram graph: SPAM

Bi-gram graphs were plotted for each dataset of spam and ham messages. Due to the relatively short length of SMS messages, bi-gram graphs was the one of the most effective method to handle the data.

By analyzing the plotted graph, it was able to find out the general context of the messages. Spam messages predominantly centered around topics like prize, getting free bonus, and extra. On the other hand, ham messages were more focused on daily conversations. Also, it was noticed that the words that were present in bi-gram graphs wre also present in the previous graphs (tf-idf graph, sentiment graphs, etc), too.

Consistent patterns across all graphs indicated several conclusions: A) There were significant differences in word usage between spam and ham messages. B) Spam messages tended to use more positive words compared to ham messages. C) These positive words frequently pertained to winning prizes.

Final_report: analysis of the spam and ham messages

HyeonSeo

2024/06/18