Executive summary

After becoming a victim of a scam, Alissa, a data analysis scientist, was motivated to differentiate between spam and authentic SMS texts. Her research used a variety of analytical tools to identify linguistic traits that are specific to spam. She utilized a Log Odds Ratio analysis to identify phrases commonly used in spam, such as “claim” and “prize,” demonstrating how spammers create tempting mailings. Sentiment research indicated that spam communications often have a more positive tone, allowing for emotional manipulation. Using TF-IDF, she emphasized phrases such as “guaranteed” that are common in spam but uncommon in normal correspondence, indicating false emphasis. Furthermore, bi-gram analysis revealed common word combinations in spam that are unusual in real texts, allowing us to better map spam techniques.

Alissa used graphics to demonstrate these findings, using different colors to highlight the distinctions between spam and legitimate texts and improve reading. Her study makes major contributions to understanding spam detection, providing useful insights for enhancing spam filtering systems and user awareness. This research not only enhances academic understanding, but also helps to design more effective anti-spam solutions and informs regulatory approaches to digital communication security.

Background

Spamming is disseminating large amounts of unwanted information through ads, promoting pornographic websites, fraudulent contributions, fake news, online employment scams, and other malicious objectives perpetrated by spammers (Adewole et al., 2019 ). Thus, spamming is the act of sending large quantity of undesirable information by companies or individuals with different intentions. The rise in popularity of short messaging services has led to a significant increase in spam messages, negatively impacting people’s daily lives, societal stability, and public security (Ning et al., 2019). Understanding the characteristics that separate spam from actual messages can help create better spam filters. This research will examine the SMS Spam Collection Dataset from Kaggle to uncover these distinct characteristics.

Data Source:

The UCI Machine Learning Repository submitted the SMS Spam Collection Dataset, which is available on Kaggle. The dataset was generated for research objectives, primarily the development and evaluation of spam filtering methods and it was collected in 2011.

Data Structure

The dataset has 5,574 English SMS messages and is divided into two categories: spam and ham. The dataset is organized as follows:

label: This column shows if the message is spam or ham.
message: This column includes the text of the SMS message.

label	message
ham	Go until jurong point, crazy.. Available only in …
ham	Ok lar… Joking wif u oni…
spam	Free entry in 2 a wkly comp to win FA Cup fina…
ham	U dun say so early hor… U c already then say…
spam	Six chances to win CASH! From 100 to 20,000 po…

What the Data Show

Spam texts: contain promotional information, offers, or links with the intention of persuading receiver into doing certain things.
Ham texts : actual messages that contains ordinary conversation with no bad intentions.

Research Question

What linguistic features are most indicative of spam messages in SMS data, and how can these features be used to improve spam detection systems?

Research Objectives

Identifying Features: Look for terms and patterns that are frequent in spam texts.
Analyzing Sentiment: Compare the emotional tone of spam and authentic communications to discover if there is any difference.
Visualizing the results: Create charts or graphs to illustrate the distinctions between spam and valid messages.

Research Signifigance

Practical

Better Spam Filters: Identifying spam traits can aid in the development of more effective spam detection systems.
Improved User Experience: By reducing spam messages, users will have a more enjoyable mobile communication experience.

Academic

Contribution to Research: This research will add knowledge in the field of text analysis, particularly in recognizing and categorizing spam.

Social

Public Awareness: Teaching individuals about the characteristics of spam texts might help them avoid frauds. -Policy Development: The insights might help create policies to decrease spam communications.

Hypothesis

Hypothesis 1 : Spam texts will use more specific words and phrases than ham texts, such as advertising buzzwords and urgent language.

Hypothesis 2 : The sentiment of spam communications differs from ham texts.

Data loading

# Load the dataset
sms_data <- read_csv("spam.csv")

## New names:
## Rows: 5572 Columns: 5
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (5): v1, v2, ...3, ...4, ...5
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`

sms_data

## # A tibble: 5,572 × 5
##    v1    v2                                                    ...3  ...4  ...5 
##    <chr> <chr>                                                 <chr> <chr> <chr>
##  1 ham   "Go until jurong point, crazy.. Available only in bu… <NA>  <NA>  <NA> 
##  2 ham   "Ok lar... Joking wif u oni..."                       <NA>  <NA>  <NA> 
##  3 spam  "Free entry in 2 a wkly comp to win FA Cup final tkt… <NA>  <NA>  <NA> 
##  4 ham   "U dun say so early hor... U c already then say..."   <NA>  <NA>  <NA> 
##  5 ham   "Nah I don't think he goes to usf, he lives around h… <NA>  <NA>  <NA> 
##  6 spam  "FreeMsg Hey there darling it's been 3 week's now an… <NA>  <NA>  <NA> 
##  7 ham   "Even my brother is not like to speak with me. They … <NA>  <NA>  <NA> 
##  8 ham   "As per your request 'Melle Melle (Oru Minnaminungin… <NA>  <NA>  <NA> 
##  9 spam  "WINNER!! As a valued network customer you have been… <NA>  <NA>  <NA> 
## 10 spam  "Had your mobile 11 months or more? U R entitled to … <NA>  <NA>  <NA> 
## # ℹ 5,562 more rows

Data Cleaning and Preprocessing

Renaming the data labels

#Renaming the data labels
sms_data <- sms_data %>%
  select(v1, v2) %>%
  rename(spam_or_ham = v1, message = v2)
sms_data

## # A tibble: 5,572 × 2
##    spam_or_ham message                                                          
##    <chr>       <chr>                                                            
##  1 ham         "Go until jurong point, crazy.. Available only in bugis n great …
##  2 ham         "Ok lar... Joking wif u oni..."                                  
##  3 spam        "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2…
##  4 ham         "U dun say so early hor... U c already then say..."              
##  5 ham         "Nah I don't think he goes to usf, he lives around here though"  
##  6 spam        "FreeMsg Hey there darling it's been 3 week's now and no word ba…
##  7 ham         "Even my brother is not like to speak with me. They treat me lik…
##  8 ham         "As per your request 'Melle Melle (Oru Minnaminunginte Nurungu V…
##  9 spam        "WINNER!! As a valued network customer you have been selected to…
## 10 spam        "Had your mobile 11 months or more? U R entitled to Update to th…
## # ℹ 5,562 more rows

Explanation:

The reason why I rename v1 into “spam_or_ham” and v2 into “message” makes the data easier to read and understand. Moreover, I use select(v1, v2) because there are unwanted columns from the first sms_data such as: …3, …4, …5. As it can be seen from this picture below:

Tokenizing and Removing Stop Words (Both Spam and Ham)

#Tokenizing and Removing Stop Words (Both Spam and Ham)
sms_tidy <- sms_data %>%
  unnest_tokens(word, message) %>%                                    #Tokenizing
  anti_join(stop_words, by = "word") %>%                              #Removing stop words
  mutate(word = str_replace_all(word, "[[:punct:]]", "")) %>%         #Removing punctuations
  mutate(word = tolower(word))                                        #Lower casing 
sms_tidy

## # A tibble: 38,221 × 2
##    spam_or_ham word  
##    <chr>       <chr> 
##  1 ham         jurong
##  2 ham         crazy 
##  3 ham         bugis 
##  4 ham         world 
##  5 ham         la    
##  6 ham         buffet
##  7 ham         cine  
##  8 ham         amore 
##  9 ham         wat   
## 10 ham         lar   
## # ℹ 38,211 more rows

Explanation:

Tokenizing It simplifies text analysis by breaking them down into individual words. How: The unnest_tokens(word, message) function turns each message into a neat format, with each word in the message represented by a single row.
Removing Stop Words Stop words (e.g., “the,” “and,” “is”) are common words that is not significant for this research How: The anti_join(stop_words, by = “word”) function removes all stop words listed in the stop_words dataset from the tokenized words.
Removing Punctuation Removing punctuation marks helps simplify data in text analysis. How: The mutate(word = str_replace_all(word, “[[:punct:]]”, ““)) function uses regular expressions to replace all punctuation with an empty string
Lower Casing Each Words Converting all words to lowercase ensures that words like “Go” and “go” are treated as the same token, maintaining consistency. How: The mutate(word = tolower(word)) function converts all words to lowercase.

Filtering Spam

#Filtering Spam
spam_tidy <- sms_tidy %>% 
  filter(spam_or_ham == "spam")
spam_tidy

## # A tibble: 11,518 × 2
##    spam_or_ham word 
##    <chr>       <chr>
##  1 spam        free 
##  2 spam        entry
##  3 spam        2    
##  4 spam        wkly 
##  5 spam        comp 
##  6 spam        win  
##  7 spam        fa   
##  8 spam        cup  
##  9 spam        final
## 10 spam        tkts 
## # ℹ 11,508 more rows

Filtering Ham

#Filtering Ham
ham_tidy <- sms_tidy %>% 
  filter(spam_or_ham == "ham")
ham_tidy

## # A tibble: 26,703 × 2
##    spam_or_ham word  
##    <chr>       <chr> 
##  1 ham         jurong
##  2 ham         crazy 
##  3 ham         bugis 
##  4 ham         world 
##  5 ham         la    
##  6 ham         buffet
##  7 ham         cine  
##  8 ham         amore 
##  9 ham         wat   
## 10 ham         lar   
## # ℹ 26,693 more rows

Explanation:

Filtering Spam Messages: Separating spam messages allows for specific analysis of the characteristics unique to spam. How: The filter(spam_or_ham == “spam”) function selects rows where the spam_or_ham column is equal to “spam”.
Filtering Ham Messages: Separating ham messages allows for specific analysis of the characteristics unique to legitimate (ham) messages. How: The filter(spam_or_ham == “ham”) function selects rows where the spam_or_ham column is equal to “ham”.

Text data analysis

Counting the most common words

#Count the most common words in spam and ham messages
spam_ham_counts <- sms_tidy %>% 
  count(word, spam_or_ham, sort = TRUE)
spam_ham_counts

## # A tibble: 9,206 × 3
##    word  spam_or_ham     n
##    <chr> <chr>       <int>
##  1 call  spam          355
##  2 2     ham           320
##  3 gt    ham           318
##  4 lt    ham           316
##  5 ur    ham           241
##  6 call  ham           231
##  7 free  spam          223
##  8 day   ham           200
##  9 time  ham           198
## 10 love  ham           191
## # ℹ 9,196 more rows

#Count the most common words in spam messages
spam_counts <- spam_tidy %>%
  count(word, sort = TRUE)
spam_counts

## # A tibble: 2,627 × 2
##    word       n
##    <chr>  <int>
##  1 call     355
##  2 free     223
##  3 2        188
##  4 txt      160
##  5 ur       144
##  6 4        129
##  7 mobile   127
##  8 text     125
##  9 stop     121
## 10 claim    113
## # ℹ 2,617 more rows

#Count the most common words in ham messages
ham_counts <- ham_tidy %>%
  count(word, sort = TRUE)
ham_counts

## # A tibble: 6,579 × 2
##    word      n
##    <chr> <int>
##  1 2       320
##  2 gt      318
##  3 lt      316
##  4 ur      241
##  5 call    231
##  6 day     200
##  7 time    198
##  8 love    191
##  9 4       181
## 10 lor     162
## # ℹ 6,569 more rows

Getting the sentiment analysis using Bing lexicon

#Bing lexicon for spam sentiments 
spam_sentiments <- spam_tidy %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = list(n = 0))

## Joining with `by = join_by(word)`

spam_sentiments

## # A tibble: 144 × 3
##    word       negative positive
##    <chr>         <int>    <int>
##  1 abuse             1        0
##  2 accessible        0        1
##  3 admirer           0       10
##  4 afraid            1        0
##  5 amazing           0        3
##  6 award             0       28
##  7 awarded           0       38
##  8 bad               1        0
##  9 beg               1        0
## 10 benefits          0        1
## # ℹ 134 more rows

#Bing lexicon for ham sentiments 
ham_sentiments <- ham_tidy %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = list(n = 0))

## Joining with `by = join_by(word)`

ham_sentiments

## # A tibble: 698 × 3
##    word         negative positive
##    <chr>           <int>    <int>
##  1 absence             1        0
##  2 ache                4        0
##  3 addicted            4        0
##  4 adjustable          0        1
##  5 adore               0        3
##  6 adoring             0        2
##  7 affection           0        4
##  8 affectionate        0        1
##  9 afford              0        1
## 10 afraid              3        0
## # ℹ 688 more rows

Log Odds Ratio

sms_ratios <- sms_tidy %>% 
  count(word, spam_or_ham) %>%
  pivot_wider(names_from = spam_or_ham, values_from = n, values_fill = list(n = 0)) %>%
  mutate(spam_total = sum(spam), 
         ham_total = sum(ham)) %>% 
  mutate(logratio = log((spam + 1) / (spam_total + 1) / ((ham + 1) / (ham_total + 1)))) %>%
  arrange(desc(logratio))
sms_ratios

## # A tibble: 8,428 × 6
##    word         ham  spam spam_total ham_total logratio
##    <chr>      <int> <int>      <int>     <int>    <dbl>
##  1 claim          0   113      11518     26703     5.58
##  2 prize          0    92      11518     26703     5.37
##  3 150p           0    74      11518     26703     5.16
##  4 won            0    73      11518     26703     5.14
##  5 tone           0    59      11518     26703     4.94
##  6 150            0    55      11518     26703     4.87
##  7 guaranteed     0    50      11518     26703     4.77
##  8 18             0    49      11518     26703     4.75
##  9 500            0    45      11518     26703     4.67
## 10 cs             0    44      11518     26703     4.65
## # ℹ 8,418 more rows

TF-IDF

sms_tf_idf <- spam_ham_counts %>%
  bind_tf_idf(term = word,          
              document = spam_or_ham,  
              n = n) %>%            
  arrange(desc(tf_idf))
sms_tf_idf

## # A tibble: 9,206 × 6
##    word  spam_or_ham     n      tf   idf  tf_idf
##    <chr> <chr>       <int>   <dbl> <dbl>   <dbl>
##  1 gt    ham           318 0.0119  0.693 0.00825
##  2 lt    ham           316 0.0118  0.693 0.00820
##  3 claim spam          113 0.00981 0.693 0.00680
##  4 prize spam           92 0.00799 0.693 0.00554
##  5 150p  spam           74 0.00642 0.693 0.00445
##  6 won   spam           73 0.00634 0.693 0.00439
##  7 lor   ham           162 0.00607 0.693 0.00421
##  8 da    ham           149 0.00558 0.693 0.00387
##  9 tone  spam           59 0.00512 0.693 0.00355
## 10 150   spam           55 0.00478 0.693 0.00331
## # ℹ 9,196 more rows

Bi-gram Analysis

sms_bigrams <- sms_data %>%
  unnest_tokens(bigram, message, token = "ngrams", n = 2) %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word, !word2 %in% stop_words$word)
sms_bigrams

## # A tibble: 16,373 × 3
##    spam_or_ham word1  word2 
##    <chr>       <chr>  <chr> 
##  1 ham         world  la    
##  2 ham         buffet cine  
##  3 ham         amore  wat   
##  4 ham         lar    joking
##  5 ham         joking wif   
##  6 spam        free   entry 
##  7 spam        wkly   comp  
##  8 spam        win    fa    
##  9 spam        fa     cup   
## 10 spam        cup    final 
## # ℹ 16,363 more rows

# Count bi-grams for spam
spam_bigram_counts <- sms_bigrams %>%
  filter(spam_or_ham == "spam") %>%
  count(word1, word2, sort = TRUE) %>%
  filter(n>4)
spam_bigram_counts

## # A tibble: 260 × 3
##    word1      word2          n
##    <chr>      <chr>      <int>
##  1 po         box           24
##  2 1000       cash          23
##  3 guaranteed call          23
##  4 prize      guaranteed    22
##  5 national   rate          20
##  6 await      collection    19
##  7 send       stop          19
##  8 land       line          18
##  9 2          claim         17
## 10 customer   service       17
## # ℹ 250 more rows

spam_bigram_graph <- spam_bigram_counts %>%
  filter(n>4) %>%
  graph_from_data_frame()
spam_bigram_graph

## IGRAPH 01196b4 DN-- 253 260 -- 
## + attr: name (v/c), n (e/n)
## + edges from 01196b4 (vertex names):
##  [1] po        ->box        1000      ->cash       guaranteed->call      
##  [4] prize     ->guaranteed national  ->rate       await     ->collection
##  [7] send      ->stop       land      ->line       2         ->claim     
## [10] customer  ->service    valid     ->12hrs      150p      ->msg       
## [13] account   ->statement  call      ->mobileupd8 free      ->entry     
## [16] identifier->code       2lands    ->row        dating    ->service   
## [19] suite342  ->2lands     txt       ->stop       ur        ->mob       
## [22] 2nd       ->attempt    line      ->claim      ur        ->awarded   
## + ... omitted several edges

# Count bi-grams for ham
ham_bigram_counts <- sms_bigrams %>%
  filter(spam_or_ham == "ham") %>%
  count(word1, word2, sort = TRUE) %>%
  filter(n>4) 
ham_bigram_counts

## # A tibble: 88 × 3
##    word1   word2       n
##    <chr>   <chr>   <int>
##  1 lt      gt        276
##  2 <NA>    <NA>       42
##  3 wan     2          26
##  4 decimal gt         23
##  5 lt      decimal    23
##  6 pls     send       22
##  7 wat     time       18
##  8 nice    day        15
##  9 4       dinner     14
## 10 gt      min        13
## # ℹ 78 more rows

ham_bigram_graph <- ham_bigram_counts %>%
  filter(n>4) %>%
  graph_from_data_frame()

## Warning in graph_from_data_frame(.): In `d' `NA' elements were replaced with
## string "NA"

ham_bigram_graph

## IGRAPH f2d341c DN-- 110 88 -- 
## + attr: name (v/c), n (e/n)
## + edges from f2d341c (vertex names):
##  [1] lt      ->gt       NA      ->NA       wan     ->2        decimal ->gt      
##  [5] lt      ->decimal  pls     ->send     wat     ->time     nice    ->day     
##  [9] 4       ->dinner   gt      ->min      gud     ->ni8      dun     ->wan     
## [13] gud     ->mrng     happy   ->birthday wait    ->4        4       ->lunch   
## [17] watching->tv       love    ->ya       sweet   ->dreams   2       ->meet    
## [21] 4       ->ur       gt      ->mins     gt      ->minutes  joy's   ->father  
## [25] pls     ->pls      ur      ->friends  wait    ->till     2       ->watch   
## [29] coming  ->home     gt      ->lt       gud     ->nyt      house   ->maid    
## + ... omitted several edges

Individual analysis and figures

10 Frequent Words from Spam and Ham

#10 Frequent Words from Spam and Ham
spam_ham_10 <- spam_ham_counts %>%
  anti_join(stop_words, by = "word") %>%
  group_by(spam_or_ham) %>%
  slice_max(n, n = 10) %>%
  ungroup()

ggplot(spam_ham_10, aes(x = fct_reorder(word, n), 
                        y = n, 
                        fill = spam_or_ham)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  facet_wrap(~spam_or_ham, scales = "free_y") +  
  labs(x = NULL, 
       y = "Frequency", 
       title = "Top 10 Frequent Words from Spam and Ham") +
  scale_fill_manual(values = c("ham" = "pink", "spam" = "maroon"))

WordCloud from Spam

#WordCloud from Spam
gradient_maroon <- c("#b30000", "#800000", "#480000") 

spam_tidy %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100, colors = gradient_maroon))

## Joining with `by = join_by(word)`

WordCloud from Ham

#WordCloud from Ham
gradient_pink <- colorRampPalette(c("#ffcccb", "#ff99aa", "#ff6699", "#ff3366"))(100)

ham_tidy %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100, colors = gradient_pink))

## Joining with `by = join_by(word)`

Bing Sentiment Graph

#Bing Sentiment Graph
sentiments_combined <- bind_rows(mutate(spam_sentiments, type = "Spam"),
                                 mutate(ham_sentiments, type = "Ham"))

sentiments_long <- sentiments_combined %>%
  pivot_longer(cols = -c(word, type), names_to = "sentiment", values_to = "count")

ggplot(sentiments_long, aes(x = type, y = count, fill = sentiment)) +
  geom_bar(stat = "identity", position = "dodge") + # Use 'identity' to use counts directly
  scale_fill_manual(values = c("positive" = "pink", "negative" = "maroon")) + # Specify colors
  labs(x = "Type",
       y = "Count",
       fill = "Sentiment",
       title = "Distribution of Sentiment Categories in Spam and Ham Messages") +
  theme_minimal()

  theme(strip.text = element_blank())

## List of 1
##  $ strip.text: list()
##   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE

Log Odds Ratio Graph Top 10

#Log Odds Ratio Graph Top 10
sms_ratios %>%
  group_by(logratio < 0) %>%
  slice_max(abs(logratio), n = 10) %>% 
  ungroup() %>%
  mutate(word = reorder(word, logratio)) %>%
  ggplot(aes(x = word, y = logratio, fill = logratio < 0)) +
  geom_col(show.legend = T) +
  coord_flip() +
  ylab("log odds ratio (spam/ham)") +
  scale_fill_manual(values = c("TRUE" = "pink", "FALSE" = "maroon")) +
  theme_minimal()

TF-IDF Top 10

#TF-IDF Top 10
top10_sms <- sms_tf_idf %>%
  anti_join(stop_words, by = "word") %>%
  group_by(spam_or_ham) %>%
  slice_max(tf_idf, n = 10, with_ties = FALSE)

top10_sms$spam_or_ham <- factor(top10_sms$spam_or_ham,
                                levels = c("ham", "spam"))

ggplot(top10_sms, aes(x = reorder_within(word, tf_idf, spam_or_ham),
                      y = tf_idf,
                      fill = spam_or_ham)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  facet_wrap(~ spam_or_ham, scales = "free_y", ncol = 2) +
  scale_x_reordered() +
  labs(x = NULL, 
       y = "TF-IDF",
       title = "Top 10 Words in TF-IDF in Spam and Ham Messages") +
  scale_fill_manual(values = c("ham" = "pink", "spam" = "maroon"))

Bi-Gram Spam

#Bi-Gram Spam
set.seed(2023)

ggraph(spam_bigram_graph, layout = "fr") +
  geom_edge_link() +
  geom_node_point() +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1)

a <- grid::arrow(type = "closed", length = unit(.10, "inches"))

ggraph(spam_bigram_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = a, end_cap = circle(.03, 'inches')) +
  geom_node_point(color = "maroon", size = 2) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void()

Bi-Gram Ham

#Bi-gram ham
set.seed(2023)

ggraph(ham_bigram_graph, layout = "fr") +
  geom_edge_link() +
  geom_node_point() +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1)

a <- grid::arrow(type = "closed", length = unit(.10, "inches"))

ggraph(ham_bigram_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = a, end_cap = circle(.03, 'inches')) +
  geom_node_point(color = "pink", size = 2) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void()

Anaysis and Figure 1

To create the graph above, the SMS data was first tokenized and cleaned to eliminate stop words and punctuation. The data was then combined to determine the frequency of each word in spam and ham communications. The top ten words by frequency in each category were chosen. A bar chart was created using ‘ggplot2’, with terms on the y-axis, frequency on the x-axis, and colors separating spam (maroon) from ham (pink). The bars were sorted in decreasing order of frequency, and the chart was divided into two aspects to allow for a direct comparison of word frequencies between spam and ham.

The graph above answers the research question of this research which is “What linguistic features are most indicative of spam messages in SMS data, and how can these features be used to improve spam detection systems?”. By knowing and comparing the top ten most common terms in spam and ham messages, it highlights key linguistic indicators that are present in spam but not in ham, and vice versa. Moreover, it also support the hypotheses of research:

Hypothesis 1: The graph confirms the first hypothesis, which states that spam texts would contain more specialized terms and phrases, such as advertising buzzwords and urgent language. For example, “claim” and “prize” in spam communications corresponds to the urgency and promotional language that are typical of spam. In contrast, the ham messages use more usual phrases such as “gt” (got), “lt” (less than), and “ur” (your/you’re).

Anaysis and Figure 2

Before creating the word clouds, the dataset was cleaned and pre-processed. The frequency of words was then calculated separately for spam and ham communications. Using the wordcloud function in R, two separate visualizations were created, exhibiting the most common terms in each group. The word clouds employ gradient colors—maroon for spam and pink for ham—to clearly discriminate between the two sorts of communications, allowing you to visually capture the most common themes and terms in each category.

Word clouds were chosen because they give a quick visual overview of the text data, highlighting the most frequently used terms in a visually appealing manner. This helps to quickly detect important themes or terms that are common in spam and ham communications.

Additionally, these word cloud addresses the research question by assisting in the identification of linguistic traits that are most suggestive of spam vs ham communication. For example, terms like “free,” “prize,” and “claim” appear often in the spam word cloud and correspond to prevalent spam traits. Moreover the spam wordcloud consist of numbers indicating the prizes for people to become deceived. Furthermore, it supports hypothesis 1 since spam letters employ more particular terms and phrases associated with advertising and haste.

Anaysis and Figure 3

The sentiment distribution graph was created by first computing the sentiment scores for each word in spam and ham messages using the Bing lexicon. After counting sentiment occurrences, the pivot_longer() method was used to convert the numbers from wide to long format for visualization. The obtained data was then displayed using ggplot2, yielding a bar chart that clearly depicts the comparison of positive and negative feelings in spam and ham communications.

Bing lexicon is chosen because it depicts the distribution and comparison of emotion categories between spam and ham messages. The Bing lexicon categorizes words simply as “positive” or “negative,” providing an easy method for analyzing emotion. This simplicity is excellent in emphasizing the core emotional tone of communications, which supports the idea about the various moods in spam and ham messages.

On the other hand, lexicons like NRC, which group words into several emotions (such as trust, dread, anticipation, and so on), give a more complex understanding of the text, but they also complicate research when your primary interest is in a binary categorization of sentiment. Similarly, the AFINN lexicon assigns values ranging from strongly negative to highly positive, which may represent a gradient of sentiment intensity but lack binary clarity.

However, looking at the graph, it could be seen that the different sentiment between spam and ham is not very significant. This could suggest several interpretations such as:

Spam sophistication: To avoid detection, modern spam may simulate natural conversation by adopting less forceful marketing language that is generally classified as “negative.”
Language overlap: Common phrases and words may appear in both spam and ham, resulting in comparable emotion distributions. This frequently occurs with neutral terms that have varied contextual meanings depending on how they are used.

Anaysis and Figure 4

To produce the Log Odds Ratio graph, I first collected and analyzed SMS data to determine the frequency of each phrase in the spam and ham categories. Then I calculated the log odds ratio for each term, which indicates how much more probable a word is to appear in spam messages than in ham communications. those with greater ratios are more likely to be spam, whereas those with lower or negative values are more likely to be ham. After sorting these numbers, I created a bar chart of the top ten terms with the most different log odds ratios, highlighting their link with spam or ham using color coding—pink for spam and maroon for ham.

The bar chart was chosen because it shows the magnitude and direction of each word’s relationship, providing a better understanding of which terms are most common in spam or ham communications. This graphic directly contributes to research by emphasizing particular language traits that distinguish spam from ham transmissions. It supports the first hypothesis because the figure clearly shows how specific terms (“claim,” “prize,” “won”) are more common in spam letters, which frequently represent the use of advertising buzzwords and frantic language. This lends weight to the theory that spam texts use more detailed and forceful words to persuade or deceive receivers.

Anaysis and Figure 5

The SMS data was cleansed and preprocessed. Each phrase was then rated using the bind_tf_idf function, which calculated its TF-IDF score across the spam and ham categories. Words with the top ten highest TF-IDF scores in each category were chosen for visualization.

This graph type was chosen because TF-IDF successfully finds words that are distinctive to a category while accounting for their general commonality across all messages. This is critical for differentiating between everyday language and phrases that indicate spam. The visualization directly supports the study question by revealing important linguistic elements that distinguish spam from ham, which is consistent with Hypothesis 1, which states that spam communications use unique, frequently manipulative language as opposed to more conventional language in ham texts.

Anaysis and Figure 6

The dataset was cleaned and pre-processed before being used for bigram analysis. The counts of these bi-grams were then determined separately for spam and ham communications, with pairings appearing more than four times to ensure relevancy. The collected data was used to create network graphs with the ggraph package, which depicts these bi-grams as nodes connected by edges, with the thickness of each edge corresponding to the bi-gram’s frequency.

This figure style was chosen because network graphs graphically depict the relationships and structure of word pairs within text data, making it simpler to identify trends or frequent words utilized differently in spam vs ham communications. This visualization clearly supports the research hypothesis by demonstrating how specific connected terms (such as “ur awarded” and “free entry”) may be more common in spam, offering a graphical depiction of linguistic traits associated with spam communications. These findings might be critical for designing more effective spam detection systems, so answering the core study goal.

Storyline

Alissa is a dedicated data analysis scientist who one day receives an SMS informing her that she has won a significant quantity of money. In a state of excitement and unaware of the dangers, she follows the directions in the message and becomes compromised. This personal oversight prompts her to wonder: What linguistic features distinguish spam from legitimate messages? This experience fuels a desire to look further into SMS data in order to create solutions that can avoid such frauds.

Exploring the Data Through Alissa’s Lens

Identification of Key Words: - Log Odds Ratio Analysis: Alissa begins by identifying words most frequently associated with spam versus ham messages. The analysis shows distinctive use of certain words in spam messages—words like “claim” and “prize” stand out. This step underscores how spammers craft messages to entice and deceive.

Sentiment Manipulation: - Sentiment Analysis: Next, she examines the emotional tone embedded within these messages. The analysis reveals that there’s not much difference between spam and ham message in the sentiments. However, spam messages tend to be overly positive than ham messsages.

Statistical Significance of Words: - TF-IDF Analysis: To understand the unique context in which specific words are used, Alissa employs TF-IDF, which highlights words like “guaranteed” that are crucial in spam but not common in everyday messages. This method helps pinpoint the deceptive significance placed on certain words.

Relational Word Analysis: - Bi-Gram Analysis: Alissa extends her exploration to the relationships between words using bi-gram analysis. This reveals common pairs of words that occur in spam and ham messages, illustrating patterns such as frequent commands or offers found in spam (“free entry”, “guaranteed prize”), as opposed to more mundane language seen in ham (“4 dinner”, “hey babe”). This relational view adds a deeper layer to understanding the linguistic tactics used by spammers and supports the hypothesis about the unique linguistic features of spam.

Narrative Around the Charts

Chart Analysis: Each chart Alissa develops is a piece of the puzzle. For instance, the TF-IDF chart isn’t just a visualization; it’s a tool to discern the stealth tactics of spammers. The log odds ratio graph isn’t just numbers; it’s for identifying words that should raise red flags for readers and spam filters alike. The bi-gram analysis further enriches this narrative by mapping the connections between words, showcasing how certain word combinations are prevalent in spam. This not only illustrates typical spam patterns but also aids in developing more effective spam detection algorithms by highlighting the most suspicious linguistic structures.

She uses pink and maroon colors because it is her favorite color and that these colors contrast nicely with each other and the background, allowing her to distinct data points. Moreover, darker colors like maroon signal possible danger or caution (spam), whereas light colors signify safer, more trustworthy information (ham). Lastly, she also wants to maintain uniformity across all visualizations so that she can comprehend the charts.

Concluding Insights

Alissa’s research is motivated by personal experience and a desire to protect others from the dangers of internet communication. Her work not only improves scientific understanding of spam identification, but it also acts as a useful reference for regular mobile users. This tale may end with observations on how his findings could impact spam detection technology, user education, and government.

Conclusion

This study looked at the linguistic traits that distinguish spam communications from ham messages in SMS data, with a particular emphasis on word use, sentiment, term frequency-inverse document frequency (TF-IDF), and bi-gram metrics. The result of this research indicates that spam texts usually contain particular, convincing terms like “claim,” “prize,” and “guaranteed.” Spam texts might frequently include numberes to signify “prizes”. These statements are intended to exploit human emotions, such as greed and urgency. The sentiment study also revealed that spam communications are often more upbeat than ham ones, implying a tactic of instilling false optimism to fool recipients. TF-IDF research revealed that specific phrases are used disproportionately in spam compared to their general frequency in the corpus, providing a more nuanced understanding of the uniqueness of spam vocabulary.The sentiment analysis also revealed that spam communications are often more positive than ham ones, implying a tactic of instilling false optimism to fool recipients. TF-IDF research revealed that specific phrases are used disproportionately in spam compared to their general frequency in the corpus, providing a more nuanced understanding of the uniqueness of spam vocabulary. Lastly, from the bi-gram graph, we could find patterns that frequently happens in spam messages.

Recommendation

Spam Filter Enhancement: using sophisticated language analysis in spam detection systems to go beyond typical spam signs. Integrating tests for high TF-IDF scores with sentiment analysis may increase spam detection systems’ accuracy.
User Education Programs: Creating thorough user education programs to educate the public about the characteristics of spam communications. Highlighting the precise phrases and feelings that are regularly used in spam might assist people in recognizing and avoiding unwelcome SMS or texts.

Research Limitation

Dataset constraints: the study was limited to the linguistic aspects of a single dataset. The findings may not be applicable to all forms of digital communication or cultures.
Spam dynamic: spam methods changes and spammers are always adapting their efforts to avoid detection. The language traits that have been recognized as spam indicators today may not be applicable tomorrow.
Binary Sentiment Analysis: The sentiment analysis was confined to positive and negative categories (bing lexicon), oversimplifying the spectrum of emotions and intents communicated in spam communications. More refined sentiment analysis may yield further information.

References

Adewole, K. S., Anuar, N. B., Kamsin, A., & Sangaiah, A. K. (2019). SMSAD: A framework for spam message and spam account detection. Multimedia Tools and Applications, 78(4), 3925–3960. https://doi.org/10.1007/s11042-017-5018-x

Bin Ning, Wu Junwei, & Hu Feng. (2019). Spam Message Classification Based on the Naïve Bayes Classification Algorithm.

From Spam to Ham: Linguistic Features in SMS for Scam Detection

Selinda Dowi

18 June 2024

Executive summary

Background

Data Source:

Data Structure

What the Data Show

Research Question

Research Objectives

Research Signifigance

Hypothesis

Data loading

Data Cleaning and Preprocessing

Renaming the data labels

Tokenizing and Removing Stop Words (Both Spam and Ham)

Filtering Spam

Filtering Ham

Text data analysis

Counting the most common words

Getting the sentiment analysis using Bing lexicon

Log Odds Ratio

TF-IDF

Bi-gram Analysis

Individual analysis and figures

10 Frequent Words from Spam and Ham

WordCloud from Spam

WordCloud from Ham

Bing Sentiment Graph

Log Odds Ratio Graph Top 10

TF-IDF Top 10

Bi-Gram Spam

Bi-Gram Ham

Anaysis and Figure 1

Anaysis and Figure 2

Anaysis and Figure 3

Anaysis and Figure 4

Anaysis and Figure 5

Anaysis and Figure 6

Storyline

Conclusion

Recommendation

Research Limitation

References