Write text and code here.

Executive summary

What is (are) your main question(s)? What is your story? What does the final graphic show?

This project investigates whether question-type sentences (marked by “?”) appear more frequently in spam messages than in normal (ham) messages. By exploring this pattern, we aim to understand how spammers strategically use questions to grab recipients’ attention and provoke actions. Three figures support this story: a bar chart comparing the ratio of question-type messages, a word cloud highlighting common words in question-type spam, and a network graph showing how key words pair together in context.

Data background

Explain where the data came from, what agency or company made it, how it is structured, what it shows, etc.

The dataset used in this analysis is the SMS Spam Collection, originally compiled by Tiago A. Almeida and José María Gómez Hidalgo, and downloaded from Kaggle. It consists of 5,574 text messages labeled as either “spam” or “ham” (normal). Each message is stored in two columns: label (spam or ham) and text (message content). This dataset is commonly used for text classification and helps analyze linguistic patterns that differentiate spam from legitimate messages.

Data loading, cleaning and preprocessing

Describe and show how you cleaned and reshaped the data

#Load SMS spam dataset & rename columns
sms <- read_csv("spam.csv")

## New names:
## Rows: 5572 Columns: 5
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (5): v1, v2, ...3, ...4, ...5
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`

sms <- sms %>%
  rename(label = 1, text = 2)
sms

## # A tibble: 5,572 × 5
##    label text                                                  ...3  ...4  ...5 
##    <chr> <chr>                                                 <chr> <chr> <chr>
##  1 ham   "Go until jurong point, crazy.. Available only in bu… <NA>  <NA>  <NA> 
##  2 ham   "Ok lar... Joking wif u oni..."                       <NA>  <NA>  <NA> 
##  3 spam  "Free entry in 2 a wkly comp to win FA Cup final tkt… <NA>  <NA>  <NA> 
##  4 ham   "U dun say so early hor... U c already then say..."   <NA>  <NA>  <NA> 
##  5 ham   "Nah I don't think he goes to usf, he lives around h… <NA>  <NA>  <NA> 
##  6 spam  "FreeMsg Hey there darling it's been 3 week's now an… <NA>  <NA>  <NA> 
##  7 ham   "Even my brother is not like to speak with me. They … <NA>  <NA>  <NA> 
##  8 ham   "As per your request 'Melle Melle (Oru Minnaminungin… <NA>  <NA>  <NA> 
##  9 spam  "WINNER!! As a valued network customer you have been… <NA>  <NA>  <NA> 
## 10 spam  "Had your mobile 11 months or more? U R entitled to … <NA>  <NA>  <NA> 
## # ℹ 5,562 more rows

#Detect question-type messages using "?"
sms <- sms %>%
  mutate(is_question = str_detect(text,"\\?"))

#Calculate ratio of question-type messages for each label
question_stats <- sms %>%
  group_by(label, is_question) %>%
  summarise(count = n()) %>%
  group_by(label) %>%
  mutate(ratio = count / sum(count))

## `summarise()` has grouped output by 'label'. You can override using the
## `.groups` argument.

question_stats

## # A tibble: 4 × 4
## # Groups:   label [2]
##   label is_question count ratio
##   <chr> <lgl>       <int> <dbl>
## 1 ham   FALSE        3741 0.775
## 2 ham   TRUE         1084 0.225
## 3 spam  FALSE         614 0.822
## 4 spam  TRUE          133 0.178

# first loaded the SMS Spam dataset from a CSV file and renamed the columns to label (spam or ham) and text (message content) for easier reference. We verified that there were no missing values.Then, to support our main question analysis, we created a new variable is_question that flags whether each message contains a question mark (”?”). This step helps distinguish question-type messages from normal ones.

Text data analysis

Individual analysis and figures

Anaysis and Figure 1

Describe and show how you created the first figure. Why did you choose this figure type?

#Bar plot of question-type ratio by label
ggplot(question_stats %>%
         filter(is_question == TRUE),
       aes(x = label, y= ratio, fill = label)) +
  geom_col() +
  labs(title = "Ratio of Question-Type Messages by Label",
       X = "Label (Spam/Ham)", Y = "Question Message Ratio")

# Explanation: To investigate whether spam messages tend to use question forms more frequently than normal (ham) messages. This helps identify a linguistic pattern typical of spam.
# Description: The bar plot shows that the proportion of question-type messages is significantly higher in spam than in ham. This supports the hypothesis that spam messages often use questions to provoke user attention and actions.

Anaysis and Figure 2

#Extract frequent words from question-type spam messages
question_spam_words <- sms %>%
  filter(label == "spam", is_question == TRUE) %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words, by = "word") %>%
  count(word, sort = TRUE)

#Word cloud of frequent words in question-type spam
with(question_spam_words, wordcloud(word, n, max.words = 100))

# Explanation: To identify the most frequently used words in question-type spam messages. This reveals the typical persuasive words that spammers combine with question forms.
# Description: The word cloud highlights common keywords such as “free”, “stop”, “call”, and “text”. These words are often used to prompt the recipient to take immediate action.

Anaysis and Figure 3

#Extract bigrams from question-type spam messages
bigrams <- sms %>%
  filter(label == "spam", is_question == TRUE) %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)

#Remove stopwords and separate bigrams
bigrams_separated <- bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word)

#Count frequent bigrams
bigram_counts <- bigrams_separated %>%
  count(word1, word2, sort = TRUE) %>%
  filter(n >= 5)

#Network graph of common bigrams in question-type spam
bigram_graph <- graph_from_data_frame(bigram_counts)
                                      
set.seed(123)

a <- grid::arrow(type = "closed", length = unit(.15, "inches"))

ggraph(bigram_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n),
                 arrow = a, end_cap = circle(.07, 'inches'),
                 show.legend = FALSE, color = "gray50") +
  geom_node_point(color = "skyblue", size = 5) +
  geom_node_text(aes(label = name), repel = TRUE, size = 3) +
  theme_void() +
  labs(title = "Network of Common Bigrams in Question-Type Spam Messages")

#Explanation: To visualize how key words in question-type spam messages are connected. This helps identify common word pairs and persuasive phrases used by spammers.
#Description: The network highlights typical bigram connections used in question-type spam messages. Clusters show promotional phrases such as phone models (“nokia 3510i”), free offers (“free call”, “100 minutes”), and sales phrases like “half price”. This reveals how spammers combine product ads and persuasive words with question forms to attract attention.

In showing the figures that you created, describe why you designed it the way you did. Why did you choose those colors, fonts, and other design elements? Does it convey truth?

Figure 1 uses a simple bar chart to make it easy to compare proportions side by side. The default ggplot2 fill colors are used to automatically differentiate spam vs ham clearly, and the axis labels and title are kept simple for clear readability. It accurately shows that spam messages tend to contain more question marks than normal messages. Figure 2 uses a word cloud to highlight the most common words visually, with larger sizes indicating higher frequency. A black font on a white background emphasizes word prominence without distractions, and the bold font ensures readability at different sizes. This truthfully reflects that the bigger words are truly the most frequent in question-type spam messages. Figure 3 shows a network graph to reveal how words pair up in context, displaying common bigram structures. Blue nodes and light gray edges make connections easy to follow, and arrows show word order direction for added meaning. The use of theme_void() removes background clutter so that only connections and labels stand out, truthfully showing which words commonly appear together in question-type spam and highlighting persuasive marketing phrases.

You can also include images like this:

Analysis of Question Forms in SMS Spam

LEE HANA

2025-06-08