Research Question

How do the textual features of spam SMS messages differ from legitimate (ham) messages, and how can these differences be used to improve automatic spam detection?

Spam messages are a common nuisance that can clutter users’ inboxes, waste time, and sometimes pose security risks through scams or phishing attempts. Understanding the language and emotional tone that characterize spam messages helps build better spam filters and protects users from unwanted or harmful content. By identifying clear patterns in the text, this analysis aims to contribute to more accurate and efficient spam detection systems, improving communication safety and user experience.

Executive Summary

This report analyzes the textual features that distinguish spam messages from legitimate (ham) SMS messages using the SMS Spam Collection dataset. The dataset contains 5,574 English text messages labeled as either spam or ham, sourced from multiple real-world collections. Our goal is to identify key linguistic and emotional differences that can help improve automatic spam detection.

We apply several text analysis methods including word frequency analysis, term frequency-inverse document frequency (tf-idf), and sentiment analysis. The word frequency analysis reveals the most common words used in spam and ham messages, highlighting distinct vocabularies between the two categories. The tf-idf method helps identify words that are important and unique to each group. Sentiment analysis uncovers emotional tones, showing that spam messages often carry stronger emotional cues, such as urgency or excitement.

The results demonstrate clear patterns that separate spam from legitimate messages. Spam texts tend to contain promotional language, repetitive words, and certain keywords like “free,” “win,” or “call.” Legitimate messages show more personal and conversational language. These insights can be leveraged to develop more effective spam filters and message classification algorithms, ultimately reducing unwanted messages and improving user experience.

This report includes detailed data exploration, visualizations, and code to ensure full reproducibility. The findings contribute to understanding the nature of spam messages and provide practical tools for detecting them in real-world applications.

Background information and Summary of the Data

The SMS Spam Collection dataset gathers SMS messages from various sources to create a diverse collection of real-world texts. The dataset includes 5,574 messages, each labeled as either “ham” (normal messages) or “spam” (unsolicited or promotional messages). Spam messages come mainly from a UK forum called Grumbletext, where users report spam messages they receive. Normal messages mostly come from the National University of Singapore’s SMS Corpus, collected from volunteers, mainly students. Additional messages come from a PhD thesis dataset and a public SMS spam corpus used in other research.

Each entry in the dataset contains two main columns: a label indicating whether the message is ham or spam, and the raw text message itself. This structure allows for straightforward analysis of differences between legitimate and spam messages. The dataset provides a solid basis for exploring text features that distinguish spam from ham, which can support building better spam filtering models.

v1: the label for each message (spam or ham)
v2: the full text of the SMS message

library(tidyverse)    # data manipulation + ggplot2 plotting

## Warning: package 'ggplot2' was built under R version 4.3.3

## Warning: package 'tidyr' was built under R version 4.3.2

## Warning: package 'purrr' was built under R version 4.3.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidytext)     # tokenization, sentiment analysis

## Warning: package 'tidytext' was built under R version 4.3.2

library(wordcloud)    # word clouds

## Loading required package: RColorBrewer

library(RColorBrewer) # colors for word clouds
library(textdata)     # sentiment lexicons (e.g., NRC)

## Warning: package 'textdata' was built under R version 4.3.3

library(igraph)       # graph data structure

## Warning: package 'igraph' was built under R version 4.3.3

## 
## Attaching package: 'igraph'
## 
## The following objects are masked from 'package:lubridate':
## 
##     %--%, union
## 
## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union
## 
## The following objects are masked from 'package:purrr':
## 
##     compose, simplify
## 
## The following object is masked from 'package:tidyr':
## 
##     crossing
## 
## The following object is masked from 'package:tibble':
## 
##     as_data_frame
## 
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## 
## The following object is masked from 'package:base':
## 
##     union

library(tidygraph)    # graph manipulation

## Warning: package 'tidygraph' was built under R version 4.3.2

## 
## Attaching package: 'tidygraph'
## 
## The following object is masked from 'package:igraph':
## 
##     groups
## 
## The following object is masked from 'package:stats':
## 
##     filter

library(ggraph)       # graph plotting

## Warning: package 'ggraph' was built under R version 4.3.2

Data load, clean, preprocessing

First, we load the SMS Spam dataset, which contains two main columns: v1, indicating whether the message is spam or ham, and v2, containing the raw text message. We then preprocess the data by tokenizing the messages into individual words and removing common stop words (such as “the”, “and”, “is”) that do not add much meaning for analysis.

library(readr)
library(dplyr)
library(tidytext)  # for unnest_tokens and stop_words

# Load dataset
full_data <- read_csv("spam.csv", show_col_types = FALSE)

## New names:
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`

# Tokenize text messages into individual words
sms_data <- full_data %>%
  unnest_tokens(word, v2)

# Remove stop words
sms_data <- sms_data %>%
  anti_join(stop_words, by = "word")

# View a snippet of the cleaned data
print(sms_data)

## # A tibble: 38,221 × 5
##    v1    ...3  ...4  ...5  word  
##    <chr> <chr> <chr> <chr> <chr> 
##  1 ham   <NA>  <NA>  <NA>  jurong
##  2 ham   <NA>  <NA>  <NA>  crazy 
##  3 ham   <NA>  <NA>  <NA>  bugis 
##  4 ham   <NA>  <NA>  <NA>  world 
##  5 ham   <NA>  <NA>  <NA>  la    
##  6 ham   <NA>  <NA>  <NA>  buffet
##  7 ham   <NA>  <NA>  <NA>  cine  
##  8 ham   <NA>  <NA>  <NA>  amore 
##  9 ham   <NA>  <NA>  <NA>  wat   
## 10 ham   <NA>  <NA>  <NA>  lar   
## # ℹ 38,211 more rows

Figure 1: TF-IDF Analysis

To identify the most distinctive words in spam and legitimate (ham) messages, we use the Term Frequency-Inverse Document Frequency (TF-IDF) metric. TF-IDF highlights words that appear frequently in one group but not across all messages, helping us find terms unique to spam or ham texts.

# Load required libraries
library(tidyverse)
library(tidytext)
library(ggplot2)

# Rename columns
full_data_clean <- full_data %>%
  rename(label = v1, text = v2) %>%
  select(label, text)

# Tokenize text
data <- full_data_clean %>%
  unnest_tokens(word, text)

# Remove stop words
data <- data %>%
  anti_join(stop_words, by = "word")

# Filter by label
ham_data <- data %>% filter(label == "ham")
spam_data <- data %>% filter(label == "spam")

ham_data <- data %>% filter(label == "ham")
spam_data <- data %>% filter(label == "spam")

# Count word frequencies by label
word_counts <- data %>%
  count(label, word, sort = TRUE)

# Compute TF-IDF
tf_idf <- word_counts %>%
  bind_tf_idf(word, label, n) %>%
  arrange(desc(tf_idf))

# Get top 10 by label
top_tf_idf <- tf_idf %>%
  group_by(label) %>%
  slice_max(tf_idf, n = 10) %>%
  ungroup()

# Plot with facet for ham/spam
ggplot(top_tf_idf, aes(x = reorder_within(word, tf_idf, label), y = tf_idf, fill = label)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~label, scales = "free") +
  scale_x_reordered() +
  labs(title = "Top 10 Distinctive Words Based on TF-IDF Words for Spam vs. Ham Messages",
       x = "Word", y = "TF-IDF") +
  coord_flip() +
  theme_minimal()

This figure shows the top 10 words that are most distinctive to spam and ham messages according to their TF-IDF scores. Words like “free”, “win”, and “call” appear prominently in spam messages, reflecting typical spam language focused on promotions and urgency. Legitimate messages feature more common conversational words, illustrating their everyday, personal nature.

These unique word patterns can help improve spam detection by focusing on words that signal whether a message is likely spam or not.

Figure 2: Word Clouds with Bing Sentiment

This figure presents word clouds for positive and negative sentiment words found in spam and ham messages. Using the Bing sentiment lexicon, words are classified by sentiment to highlight emotional tones in each message type. Positive words appear in green tones and negative words in red tones. The size of each word reflects its frequency.

library(dplyr)
library(wordcloud)
library(RColorBrewer)
library(tidytext)

# Get Bing sentiment lexicon
bing <- get_sentiments("bing")

# Join data with sentiment labels
data_sentiment <- data %>%
  inner_join(bing, by = "word")

# Filter and count for each group

# Spam Positive
spam_pos <- data_sentiment %>%
  filter(label == "spam", sentiment == "positive") %>%
  count(word, sort = TRUE)

# Spam Negative
spam_neg <- data_sentiment %>%
  filter(label == "spam", sentiment == "negative") %>%
  count(word, sort = TRUE)

# Ham Positive
ham_pos <- data_sentiment %>%
  filter(label == "ham", sentiment == "positive") %>%
  count(word, sort = TRUE)

# Ham Negative
ham_neg <- data_sentiment %>%
  filter(label == "ham", sentiment == "negative") %>%
  count(word, sort = TRUE)

# Plot Spam Positive Word Cloud
set.seed(123)
wordcloud(words = spam_pos$word,
          freq = spam_pos$n,
          min.freq = 5,
          max.words = 100,
          colors = brewer.pal(8, "Greens"),
          scale = c(3.5, 0.7),
          random.order = FALSE)
title("Spam Positive Words")

# Plot Spam Negative Word Cloud
set.seed(123)
wordcloud(words = spam_neg$word,
          freq = spam_neg$n,
          min.freq = 5,
          max.words = 100,
          colors = brewer.pal(8, "Reds"),
          scale = c(3.5, 0.7),
          random.order = FALSE)
title("Spam Negative Words")

# Plot Ham Positive Word Cloud
set.seed(123)
wordcloud(words = ham_pos$word,
          freq = ham_pos$n,
          min.freq = 5,
          max.words = 100,
          colors = brewer.pal(8, "Greens"),
          scale = c(3.5, 0.7),
          random.order = FALSE)
title("Ham Positive Words")

# Plot Ham Negative Word Cloud
set.seed(123)
wordcloud(words = ham_neg$word,
          freq = ham_neg$n,
          min.freq = 5,
          max.words = 100,
          colors = brewer.pal(8, "Reds"),
          scale = c(3.5, 0.7),
          random.order = FALSE)
title("Ham Negative Words")

The word clouds reveal that spam messages contain many negative words such as “call” and “urgent,” reflecting their often alarming or urgent tone. Positive words in spam include terms like “free” and “win,” commonly used to attract attention. Ham messages show more neutral or positive words related to everyday communication.

These sentiment differences highlight how spam messages use emotional triggers to entice recipients, which can be useful in distinguishing spam from legitimate messages.

Figure 3: Emotional Profiles and Sentiment Analysis

This figure uses the NRC sentiment lexicon to compare the emotional tone of ham (legitimate) and spam messages. The NRC lexicon categorizes words into various emotions such as joy, anger, fear, and sadness. We visualize these differences using donut charts showing the proportion of words in each sentiment category.

textdata::lexicon_nrc()

## # A tibble: 13,872 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # ℹ 13,862 more rows

library(tidytext)
nrc <- get_sentiments("nrc")

saveRDS(get_sentiments("nrc"), "nrc_sentiment.rds")

nrc <- readRDS("nrc_sentiment.rds")

library(tidyverse)
library(tidytext)
library(readr)
library(textdata)
library(ggplot2)

# Load and preprocess the data 
sms_raw <- read_csv("spam.csv", show_col_types = FALSE) %>%
  select(v1, v2) %>%
  rename(label = v1, text = v2)

## New names:
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`

sms_tokens <- sms_raw %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words, by = "word")

# Separate ham and spam tokens
ham_data <- sms_tokens %>% filter(label == "ham")
spam_data <- sms_tokens %>% filter(label == "spam")

# Load NRC lexicon
nrc <- get_sentiments("nrc")

# Count sentiments for ham and spam
sentiment_ham <- ham_data %>%
  inner_join(nrc, by = "word") %>%
  count(sentiment) %>%
  arrange(desc(n))

## Warning in inner_join(., nrc, by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 2 of `x` matches multiple rows in `y`.
## ℹ Row 9809 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

sentiment_spam <- spam_data %>%
  inner_join(nrc, by = "word") %>%
  count(sentiment) %>%
  arrange(desc(n))

## Warning in inner_join(., nrc, by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 26 of `x` matches multiple rows in `y`.
## ℹ Row 8258 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

# Plot sentiment distribution for HAM messages (donut chart)
ggplot(sentiment_ham, aes(x = 2, y = n, fill = sentiment)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y") +
  geom_text(aes(label = paste0(round(n / sum(n) * 100, 1), "%")),
            position = position_stack(vjust = 0.5)) +
  labs(title = "Sentiment Distribution in HAM Messages") +
  theme_void() +
  theme(legend.title = element_blank()) +
  xlim(0.5, 2.5)

# Plot sentiment distribution for SPAM messages (donut chart)
ggplot(sentiment_spam, aes(x = 2, y = n, fill = sentiment)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y") +
  geom_text(aes(label = paste0(round(n / sum(n) * 100, 1), "%")),
            position = position_stack(vjust = 0.5)) +
  labs(title = "Sentiment Distribution in SPAM Messages") +
  theme_void() +
  theme(legend.title = element_blank()) +
  xlim(0.5, 2.5)

The donut charts show the distribution of emotional categories across ham and spam messages. Ham messages exhibit a relatively balanced emotional profile, including sentiments like joy and trust. Spam messages show a higher proportion of negative emotions such as fear and anger, reflecting their often urgent or alarming language. Understanding these emotional patterns can help improve spam detection algorithms by incorporating sentiment features.

Figure 4: Bigram Network

This figure visualizes common pairs of consecutive words (bigrams) in spam and ham SMS messages using network graphs. Edges represent frequently co-occurring word pairs, highlighting characteristic phrase patterns.

# Load necessary packages
library(tidyverse)
library(tidytext)
library(igraph)
library(ggraph)
library(tidygraph)
library(stringr)

# Step 1: Prepare data
# Rename for clarity and select relevant columns
full_data_clean <- full_data %>%
  rename(label = v1, text = v2) %>%
  select(label, text)

# Step 2: Tokenize into bigrams
bigrams <- full_data_clean %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)

# Step 3: Separate bigrams into individual words
bigrams_separated <- bigrams %>%
  separate(bigram, into = c("word1", "word2"), sep = " ")

# Step 4: Remove stop words
data("stop_words")
bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word)

# Step 5: Count bigrams by label
bigram_counts <- bigrams_filtered %>%
  count(label, word1, word2, sort = TRUE)

# Step 6: Create and plot SPAM bigram network
spam_bigram_graph <- bigram_counts %>%
  filter(label == "spam", n >= 5) %>%  # Adjust `n` if plot is too sparse
  graph_from_data_frame()

set.seed(123)
spam_plot <- ggraph(spam_bigram_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = arrow(length = unit(3, 'mm')),
                 end_cap = circle(2, 'mm')) +
  geom_node_point(color = "red", size = 4) +
  geom_node_text(aes(label = name), vjust = 1.5, hjust = 1.2) +
  theme_void() +
  labs(title = "Bigram Network in Spam Messages")

print(spam_plot)

# Step 7: Create and plot HAM bigram network
ham_bigram_graph <- bigram_counts %>%
  filter(label == "ham", n >= 5) %>%
  graph_from_data_frame()

## Warning: In `d`, `NA` elements were replaced with string "NA".

set.seed(123)
ham_plot <- ggraph(ham_bigram_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = arrow(length = unit(3, 'mm')),
                 end_cap = circle(2, 'mm')) +
  geom_node_point(color = "blue", size = 4) +
  geom_node_text(aes(label = name), vjust = 1.5, hjust = 1.2) +
  theme_void() +
  labs(title = "Bigram Network in Ham Messages")

print(ham_plot)

The networks highlight common word pairs in each message category. For spam, bigrams often include marketing or urgent phrases (e.g., “free call,” “claim prize”), reflecting typical spam tactics. Ham bigrams tend to be more conversational or neutral (e.g., “see you,” “thank you”), indicating everyday language use. These network visualizations provide intuitive insight into the linguistic patterns that differentiate spam from legitimate SMS messages.

Conclusion

This analysis revealed clear differences in the language and emotional tone used in spam versus legitimate SMS messages. Spam messages tend to contain distinctive keywords and phrases focused on urgency, offers, and promotions, as shown by the TF-IDF and bigram network analyses. The sentiment analysis further highlighted that spam texts often carry stronger negative emotions, while legitimate messages are more positive and neutral in tone. These insights are valuable for improving spam detection systems by focusing on the unique linguistic and emotional patterns found in spam. Overall, understanding these differences helps build more effective filters and contributes to reducing unwanted and potentially harmful messages.

Spam Detection via Text Analysis

Yeonsu Kim

2025-06-08