Write text and code here.

Executive summary

What is (are) your main question(s)? What is your story? What does the final graphic show?

This project investigates whether question-type sentences (marked by “?”) appear more frequently in spam messages than in normal (ham) messages. By exploring this pattern, we aim to understand how spammers strategically use questions to grab recipients’ attention and provoke actions. Three figures support this story: a bar chart comparing the ratio of question-type messages, a word cloud highlighting common words in question-type spam, and a network graph showing how key words pair together in context.

Data background

Explain where the data came from, what agency or company made it, how it is structured, what it shows, etc.

The dataset used in this analysis is the SMS Spam Collection, originally compiled by Tiago A. Almeida and José María Gómez Hidalgo, and downloaded from Kaggle. It consists of 5,574 text messages labeled as either “spam” or “ham” (normal). Each message is stored in two columns: label (spam or ham) and text (message content). This dataset is commonly used for text classification and helps analyze linguistic patterns that differentiate spam from legitimate messages.

Data loading, cleaning and preprocessing

Describe and show how you cleaned and reshaped the data

#Load SMS spam dataset & rename columns
sms <- read_csv("spam.csv")

## New names:
## Rows: 5572 Columns: 5
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (5): v1, v2, ...3, ...4, ...5
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`

sms <- sms %>%
  rename(label = 1, text = 2)
sms

## # A tibble: 5,572 × 5
##    label text                                                  ...3  ...4  ...5 
##    <chr> <chr>                                                 <chr> <chr> <chr>
##  1 ham   "Go until jurong point, crazy.. Available only in bu… <NA>  <NA>  <NA> 
##  2 ham   "Ok lar... Joking wif u oni..."                       <NA>  <NA>  <NA> 
##  3 spam  "Free entry in 2 a wkly comp to win FA Cup final tkt… <NA>  <NA>  <NA> 
##  4 ham   "U dun say so early hor... U c already then say..."   <NA>  <NA>  <NA> 
##  5 ham   "Nah I don't think he goes to usf, he lives around h… <NA>  <NA>  <NA> 
##  6 spam  "FreeMsg Hey there darling it's been 3 week's now an… <NA>  <NA>  <NA> 
##  7 ham   "Even my brother is not like to speak with me. They … <NA>  <NA>  <NA> 
##  8 ham   "As per your request 'Melle Melle (Oru Minnaminungin… <NA>  <NA>  <NA> 
##  9 spam  "WINNER!! As a valued network customer you have been… <NA>  <NA>  <NA> 
## 10 spam  "Had your mobile 11 months or more? U R entitled to … <NA>  <NA>  <NA> 
## # ℹ 5,562 more rows

#Detect question-type messages using "?"
sms <- sms %>%
  mutate(is_question = str_detect(text,"\\?"))

#Calculate ratio of question-type messages for each label
question_stats <- sms %>%
  group_by(label, is_question) %>%
  summarise(count = n()) %>%
  group_by(label) %>%
  mutate(ratio = count / sum(count))

## `summarise()` has grouped output by 'label'. You can override using the
## `.groups` argument.

question_stats

## # A tibble: 4 × 4
## # Groups:   label [2]
##   label is_question count ratio
##   <chr> <lgl>       <int> <dbl>
## 1 ham   FALSE        3741 0.775
## 2 ham   TRUE         1084 0.225
## 3 spam  FALSE         614 0.822
## 4 spam  TRUE          133 0.178

# first loaded the SMS Spam dataset from a CSV file and renamed the columns to label (spam or ham) and text (message content) for easier reference. We verified that there were no missing values.Then, to support our main question analysis, we created a new variable is_question that flags whether each message contains a question mark (”?”). This step helps distinguish question-type messages from normal ones.

Text data analysis

Individual analysis and figures

Anaysis and Figure 1

Describe and show how you created the first figure. Why did you choose this figure type?

#Bar plot of question-type ratio by label
ggplot(question_stats %>%
         filter(is_question == TRUE),
       aes(x = label, y= ratio, fill = label)) +
  geom_col() +
  labs(title = "Ratio of Question-Type Messages by Label",
       X = "Label (Spam/Ham)", Y = "Question Message Ratio")

# To explore whether spam messages tend to use question forms more frequently than normal (ham) messages, a simple bar chart was created to compare the ratio side by side. This chart uses the default ggplot2 colors to clearly distinguish spam from ham, and the layout is kept clean for readability. The result shows that question-type sentences appear much more often in spam than in legitimate messages, supporting the idea that spammers often provoke curiosity or prompt quick action using questions.

Anaysis and Figure 2

#Extract frequent words from question-type spam messages
question_spam_words <- sms %>%
  filter(label == "spam", is_question == TRUE) %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words, by = "word") %>%
  count(word, sort = TRUE)

#Word cloud of frequent words in question-type spam
with(question_spam_words, wordcloud(word, n, max.words = 100))

# A word cloud was created to visualize the most frequent words found in question-type spam messages. Words that appear more frequently are shown in larger fonts, making it intuitive to identify the key persuasive terms at a glance. The clean black text on a white background keeps the focus on word prominence without unnecessary distractions. The result highlights common words such as “free”, “stop”, “call”, and “text”, revealing how spammers use attention-grabbing terms in questions to persuade recipients to respond or click.

Anaysis and Figure 3

#Extract bigrams from question-type spam messages
bigrams <- sms %>%
  filter(label == "spam", is_question == TRUE) %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)

#Remove stopwords and separate bigrams
bigrams_separated <- bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word)

#Count frequent bigrams
bigram_counts <- bigrams_separated %>%
  count(word1, word2, sort = TRUE) %>%
  filter(n >= 5)

#Network graph of common bigrams in question-type spam
bigram_graph <- graph_from_data_frame(bigram_counts)
                                      
set.seed(123)

a <- grid::arrow(type = "closed", length = unit(.15, "inches"))

ggraph(bigram_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n),
                 arrow = a, end_cap = circle(.07, 'inches'),
                 show.legend = FALSE, color = "gray50") +
  geom_node_point(color = "skyblue", size = 5) +
  geom_node_text(aes(label = name), repel = TRUE, size = 3) +
  theme_void() +
  labs(title = "Network of Common Bigrams in Question-Type Spam Messages")

# To analyze how key words in question-type spam messages are connected, a network graph was constructed based on frequent bigrams. Each node represents a word, and edges show how pairs of words commonly appear together. The blue nodes and light gray edges make connections easy to follow, while arrows indicate word order for additional context. By removing the background using theme_void(), the focus remains solely on the relationships between words. The graph highlights typical promotional phrases and how spammers link keywords to form persuasive questions, such as “free call” and “100 minutes”.

In showing the figures that you created, describe why you designed it the way you did. Why did you choose those colors, fonts, and other design elements? Does it convey truth?

Figure 1 uses a simple bar chart to make it easy to compare proportions side by side. The default ggplot2 fill colors are used to automatically differentiate spam vs ham clearly, and the axis labels and title are kept simple for clear readability. It accurately shows that spam messages tend to contain more question marks than normal messages. Figure 2 uses a word cloud to highlight the most common words visually, with larger sizes indicating higher frequency. A black font on a white background emphasizes word prominence without distractions, and the bold font ensures readability at different sizes. This truthfully reflects that the bigger words are truly the most frequent in question-type spam messages. Figure 3 shows a network graph to reveal how words pair up in context, displaying common bigram structures. Blue nodes and light gray edges make connections easy to follow, and arrows show word order direction for added meaning. The use of theme_void() removes background clutter so that only connections and labels stand out, truthfully showing which words commonly appear together in question-type spam and highlighting persuasive marketing phrases.

Analysis of Question Forms in SMS Spam

LEE HANA

2025-06-08