After becoming a victim of a scam, Alissa, a data analysis scientist, was motivated to differentiate between spam and authentic SMS texts. Her research used a variety of analytical tools to identify linguistic traits that are specific to spam. She utilized a Log Odds Ratio analysis to identify phrases commonly used in spam, such as “claim” and “prize,” demonstrating how spammers create tempting mailings. Sentiment research indicated that spam communications often have a more positive tone, allowing for emotional manipulation. Using TF-IDF, she emphasized phrases such as “guaranteed” that are common in spam but uncommon in normal correspondence, indicating false emphasis. Furthermore, bi-gram analysis revealed common word combinations in spam that are unusual in real texts, allowing us to better map spam techniques.
Alissa used graphics to demonstrate these findings, using different colors to highlight the distinctions between spam and legitimate texts and improve reading. Her study makes major contributions to understanding spam detection, providing useful insights for enhancing spam filtering systems and user awareness. This research not only enhances academic understanding, but also helps to design more effective anti-spam solutions and informs regulatory approaches to digital communication security.
Spamming is disseminating large amounts of unwanted information through ads, promoting pornographic websites, fraudulent contributions, fake news, online employment scams, and other malicious objectives perpetrated by spammers (Adewole et al., 2019 ). Thus, spamming is the act of sending large quantity of undesirable information by companies or individuals with different intentions. The rise in popularity of short messaging services has led to a significant increase in spam messages, negatively impacting people’s daily lives, societal stability, and public security (Ning et al., 2019). Understanding the characteristics that separate spam from actual messages can help create better spam filters. This research will examine the SMS Spam Collection Dataset from Kaggle to uncover these distinct characteristics.
The UCI Machine Learning Repository submitted the SMS Spam Collection Dataset, which is available on Kaggle. The dataset was generated for research objectives, primarily the development and evaluation of spam filtering methods and it was collected in 2011.
The dataset has 5,574 English SMS messages and is divided into two categories: spam and ham. The dataset is organized as follows:
label | message |
---|---|
ham | Go until jurong point, crazy.. Available only in … |
ham | Ok lar… Joking wif u oni… |
spam | Free entry in 2 a wkly comp to win FA Cup fina… |
ham | U dun say so early hor… U c already then say… |
spam | Six chances to win CASH! From 100 to 20,000 po… |
Spam texts: contain promotional information, offers, or links with the intention of persuading receiver into doing certain things.
Ham texts : actual messages that contains ordinary conversation with no bad intentions.
What linguistic features are most indicative of spam messages in SMS data, and how can these features be used to improve spam detection systems?
Hypothesis 1 : Spam texts will use more specific words and phrases than ham texts, such as advertising buzzwords and urgent language.
Hypothesis 2 : The sentiment of spam communications differs from ham texts.
# Load the dataset
sms_data <- read_csv("spam.csv")
## New names:
## Rows: 5572 Columns: 5
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (5): v1, v2, ...3, ...4, ...5
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
sms_data
## # A tibble: 5,572 × 5
## v1 v2 ...3 ...4 ...5
## <chr> <chr> <chr> <chr> <chr>
## 1 ham "Go until jurong point, crazy.. Available only in bu… <NA> <NA> <NA>
## 2 ham "Ok lar... Joking wif u oni..." <NA> <NA> <NA>
## 3 spam "Free entry in 2 a wkly comp to win FA Cup final tkt… <NA> <NA> <NA>
## 4 ham "U dun say so early hor... U c already then say..." <NA> <NA> <NA>
## 5 ham "Nah I don't think he goes to usf, he lives around h… <NA> <NA> <NA>
## 6 spam "FreeMsg Hey there darling it's been 3 week's now an… <NA> <NA> <NA>
## 7 ham "Even my brother is not like to speak with me. They … <NA> <NA> <NA>
## 8 ham "As per your request 'Melle Melle (Oru Minnaminungin… <NA> <NA> <NA>
## 9 spam "WINNER!! As a valued network customer you have been… <NA> <NA> <NA>
## 10 spam "Had your mobile 11 months or more? U R entitled to … <NA> <NA> <NA>
## # ℹ 5,562 more rows
#Renaming the data labels
sms_data <- sms_data %>%
select(v1, v2) %>%
rename(spam_or_ham = v1, message = v2)
sms_data
## # A tibble: 5,572 × 2
## spam_or_ham message
## <chr> <chr>
## 1 ham "Go until jurong point, crazy.. Available only in bugis n great …
## 2 ham "Ok lar... Joking wif u oni..."
## 3 spam "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2…
## 4 ham "U dun say so early hor... U c already then say..."
## 5 ham "Nah I don't think he goes to usf, he lives around here though"
## 6 spam "FreeMsg Hey there darling it's been 3 week's now and no word ba…
## 7 ham "Even my brother is not like to speak with me. They treat me lik…
## 8 ham "As per your request 'Melle Melle (Oru Minnaminunginte Nurungu V…
## 9 spam "WINNER!! As a valued network customer you have been selected to…
## 10 spam "Had your mobile 11 months or more? U R entitled to Update to th…
## # ℹ 5,562 more rows
Explanation:
The reason why I rename v1 into “spam_or_ham” and v2 into “message” makes the data easier to read and understand. Moreover, I use select(v1, v2) because there are unwanted columns from the first sms_data such as: …3, …4, …5. As it can be seen from this picture below:
#Tokenizing and Removing Stop Words (Both Spam and Ham)
sms_tidy <- sms_data %>%
unnest_tokens(word, message) %>% #Tokenizing
anti_join(stop_words, by = "word") %>% #Removing stop words
mutate(word = str_replace_all(word, "[[:punct:]]", "")) %>% #Removing punctuations
mutate(word = tolower(word)) #Lower casing
sms_tidy
## # A tibble: 38,221 × 2
## spam_or_ham word
## <chr> <chr>
## 1 ham jurong
## 2 ham crazy
## 3 ham bugis
## 4 ham world
## 5 ham la
## 6 ham buffet
## 7 ham cine
## 8 ham amore
## 9 ham wat
## 10 ham lar
## # ℹ 38,211 more rows
Explanation:
Tokenizing It simplifies text analysis by breaking them down into individual words. How: The unnest_tokens(word, message) function turns each message into a neat format, with each word in the message represented by a single row.
Removing Stop Words Stop words (e.g., “the,” “and,” “is”) are common words that is not significant for this research How: The anti_join(stop_words, by = “word”) function removes all stop words listed in the stop_words dataset from the tokenized words.
Removing Punctuation Removing punctuation marks helps simplify data in text analysis. How: The mutate(word = str_replace_all(word, “[[:punct:]]”, ““)) function uses regular expressions to replace all punctuation with an empty string
Lower Casing Each Words Converting all words to lowercase ensures that words like “Go” and “go” are treated as the same token, maintaining consistency. How: The mutate(word = tolower(word)) function converts all words to lowercase.
#Filtering Spam
spam_tidy <- sms_tidy %>%
filter(spam_or_ham == "spam")
spam_tidy
## # A tibble: 11,518 × 2
## spam_or_ham word
## <chr> <chr>
## 1 spam free
## 2 spam entry
## 3 spam 2
## 4 spam wkly
## 5 spam comp
## 6 spam win
## 7 spam fa
## 8 spam cup
## 9 spam final
## 10 spam tkts
## # ℹ 11,508 more rows
#Filtering Ham
ham_tidy <- sms_tidy %>%
filter(spam_or_ham == "ham")
ham_tidy
## # A tibble: 26,703 × 2
## spam_or_ham word
## <chr> <chr>
## 1 ham jurong
## 2 ham crazy
## 3 ham bugis
## 4 ham world
## 5 ham la
## 6 ham buffet
## 7 ham cine
## 8 ham amore
## 9 ham wat
## 10 ham lar
## # ℹ 26,693 more rows
Explanation:
Filtering Spam Messages: Separating spam messages allows for specific analysis of the characteristics unique to spam. How: The filter(spam_or_ham == “spam”) function selects rows where the spam_or_ham column is equal to “spam”.
Filtering Ham Messages: Separating ham messages allows for specific analysis of the characteristics unique to legitimate (ham) messages. How: The filter(spam_or_ham == “ham”) function selects rows where the spam_or_ham column is equal to “ham”.
#Count the most common words in spam and ham messages
spam_ham_counts <- sms_tidy %>%
count(word, spam_or_ham, sort = TRUE)
spam_ham_counts
## # A tibble: 9,206 × 3
## word spam_or_ham n
## <chr> <chr> <int>
## 1 call spam 355
## 2 2 ham 320
## 3 gt ham 318
## 4 lt ham 316
## 5 ur ham 241
## 6 call ham 231
## 7 free spam 223
## 8 day ham 200
## 9 time ham 198
## 10 love ham 191
## # ℹ 9,196 more rows
#Count the most common words in spam messages
spam_counts <- spam_tidy %>%
count(word, sort = TRUE)
spam_counts
## # A tibble: 2,627 × 2
## word n
## <chr> <int>
## 1 call 355
## 2 free 223
## 3 2 188
## 4 txt 160
## 5 ur 144
## 6 4 129
## 7 mobile 127
## 8 text 125
## 9 stop 121
## 10 claim 113
## # ℹ 2,617 more rows
#Count the most common words in ham messages
ham_counts <- ham_tidy %>%
count(word, sort = TRUE)
ham_counts
## # A tibble: 6,579 × 2
## word n
## <chr> <int>
## 1 2 320
## 2 gt 318
## 3 lt 316
## 4 ur 241
## 5 call 231
## 6 day 200
## 7 time 198
## 8 love 191
## 9 4 181
## 10 lor 162
## # ℹ 6,569 more rows
#Bing lexicon for spam sentiments
spam_sentiments <- spam_tidy %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = list(n = 0))
## Joining with `by = join_by(word)`
spam_sentiments
## # A tibble: 144 × 3
## word negative positive
## <chr> <int> <int>
## 1 abuse 1 0
## 2 accessible 0 1
## 3 admirer 0 10
## 4 afraid 1 0
## 5 amazing 0 3
## 6 award 0 28
## 7 awarded 0 38
## 8 bad 1 0
## 9 beg 1 0
## 10 benefits 0 1
## # ℹ 134 more rows
#Bing lexicon for ham sentiments
ham_sentiments <- ham_tidy %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = list(n = 0))
## Joining with `by = join_by(word)`
ham_sentiments
## # A tibble: 698 × 3
## word negative positive
## <chr> <int> <int>
## 1 absence 1 0
## 2 ache 4 0
## 3 addicted 4 0
## 4 adjustable 0 1
## 5 adore 0 3
## 6 adoring 0 2
## 7 affection 0 4
## 8 affectionate 0 1
## 9 afford 0 1
## 10 afraid 3 0
## # ℹ 688 more rows
sms_ratios <- sms_tidy %>%
count(word, spam_or_ham) %>%
pivot_wider(names_from = spam_or_ham, values_from = n, values_fill = list(n = 0)) %>%
mutate(spam_total = sum(spam),
ham_total = sum(ham)) %>%
mutate(logratio = log((spam + 1) / (spam_total + 1) / ((ham + 1) / (ham_total + 1)))) %>%
arrange(desc(logratio))
sms_ratios
## # A tibble: 8,428 × 6
## word ham spam spam_total ham_total logratio
## <chr> <int> <int> <int> <int> <dbl>
## 1 claim 0 113 11518 26703 5.58
## 2 prize 0 92 11518 26703 5.37
## 3 150p 0 74 11518 26703 5.16
## 4 won 0 73 11518 26703 5.14
## 5 tone 0 59 11518 26703 4.94
## 6 150 0 55 11518 26703 4.87
## 7 guaranteed 0 50 11518 26703 4.77
## 8 18 0 49 11518 26703 4.75
## 9 500 0 45 11518 26703 4.67
## 10 cs 0 44 11518 26703 4.65
## # ℹ 8,418 more rows
sms_tf_idf <- spam_ham_counts %>%
bind_tf_idf(term = word,
document = spam_or_ham,
n = n) %>%
arrange(desc(tf_idf))
sms_tf_idf
## # A tibble: 9,206 × 6
## word spam_or_ham n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 gt ham 318 0.0119 0.693 0.00825
## 2 lt ham 316 0.0118 0.693 0.00820
## 3 claim spam 113 0.00981 0.693 0.00680
## 4 prize spam 92 0.00799 0.693 0.00554
## 5 150p spam 74 0.00642 0.693 0.00445
## 6 won spam 73 0.00634 0.693 0.00439
## 7 lor ham 162 0.00607 0.693 0.00421
## 8 da ham 149 0.00558 0.693 0.00387
## 9 tone spam 59 0.00512 0.693 0.00355
## 10 150 spam 55 0.00478 0.693 0.00331
## # ℹ 9,196 more rows
sms_bigrams <- sms_data %>%
unnest_tokens(bigram, message, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word, !word2 %in% stop_words$word)
sms_bigrams
## # A tibble: 16,373 × 3
## spam_or_ham word1 word2
## <chr> <chr> <chr>
## 1 ham world la
## 2 ham buffet cine
## 3 ham amore wat
## 4 ham lar joking
## 5 ham joking wif
## 6 spam free entry
## 7 spam wkly comp
## 8 spam win fa
## 9 spam fa cup
## 10 spam cup final
## # ℹ 16,363 more rows
# Count bi-grams for spam
spam_bigram_counts <- sms_bigrams %>%
filter(spam_or_ham == "spam") %>%
count(word1, word2, sort = TRUE) %>%
filter(n>4)
spam_bigram_counts
## # A tibble: 260 × 3
## word1 word2 n
## <chr> <chr> <int>
## 1 po box 24
## 2 1000 cash 23
## 3 guaranteed call 23
## 4 prize guaranteed 22
## 5 national rate 20
## 6 await collection 19
## 7 send stop 19
## 8 land line 18
## 9 2 claim 17
## 10 customer service 17
## # ℹ 250 more rows
spam_bigram_graph <- spam_bigram_counts %>%
filter(n>4) %>%
graph_from_data_frame()
spam_bigram_graph
## IGRAPH 01196b4 DN-- 253 260 --
## + attr: name (v/c), n (e/n)
## + edges from 01196b4 (vertex names):
## [1] po ->box 1000 ->cash guaranteed->call
## [4] prize ->guaranteed national ->rate await ->collection
## [7] send ->stop land ->line 2 ->claim
## [10] customer ->service valid ->12hrs 150p ->msg
## [13] account ->statement call ->mobileupd8 free ->entry
## [16] identifier->code 2lands ->row dating ->service
## [19] suite342 ->2lands txt ->stop ur ->mob
## [22] 2nd ->attempt line ->claim ur ->awarded
## + ... omitted several edges
# Count bi-grams for ham
ham_bigram_counts <- sms_bigrams %>%
filter(spam_or_ham == "ham") %>%
count(word1, word2, sort = TRUE) %>%
filter(n>4)
ham_bigram_counts
## # A tibble: 88 × 3
## word1 word2 n
## <chr> <chr> <int>
## 1 lt gt 276
## 2 <NA> <NA> 42
## 3 wan 2 26
## 4 decimal gt 23
## 5 lt decimal 23
## 6 pls send 22
## 7 wat time 18
## 8 nice day 15
## 9 4 dinner 14
## 10 gt min 13
## # ℹ 78 more rows
ham_bigram_graph <- ham_bigram_counts %>%
filter(n>4) %>%
graph_from_data_frame()
## Warning in graph_from_data_frame(.): In `d' `NA' elements were replaced with
## string "NA"
ham_bigram_graph
## IGRAPH f2d341c DN-- 110 88 --
## + attr: name (v/c), n (e/n)
## + edges from f2d341c (vertex names):
## [1] lt ->gt NA ->NA wan ->2 decimal ->gt
## [5] lt ->decimal pls ->send wat ->time nice ->day
## [9] 4 ->dinner gt ->min gud ->ni8 dun ->wan
## [13] gud ->mrng happy ->birthday wait ->4 4 ->lunch
## [17] watching->tv love ->ya sweet ->dreams 2 ->meet
## [21] 4 ->ur gt ->mins gt ->minutes joy's ->father
## [25] pls ->pls ur ->friends wait ->till 2 ->watch
## [29] coming ->home gt ->lt gud ->nyt house ->maid
## + ... omitted several edges
#10 Frequent Words from Spam and Ham
spam_ham_10 <- spam_ham_counts %>%
anti_join(stop_words, by = "word") %>%
group_by(spam_or_ham) %>%
slice_max(n, n = 10) %>%
ungroup()
ggplot(spam_ham_10, aes(x = fct_reorder(word, n),
y = n,
fill = spam_or_ham)) +
geom_col(show.legend = FALSE) +
coord_flip() +
facet_wrap(~spam_or_ham, scales = "free_y") +
labs(x = NULL,
y = "Frequency",
title = "Top 10 Frequent Words from Spam and Ham") +
scale_fill_manual(values = c("ham" = "pink", "spam" = "maroon"))
#WordCloud from Spam
gradient_maroon <- c("#b30000", "#800000", "#480000")
spam_tidy %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100, colors = gradient_maroon))
## Joining with `by = join_by(word)`
#WordCloud from Ham
gradient_pink <- colorRampPalette(c("#ffcccb", "#ff99aa", "#ff6699", "#ff3366"))(100)
ham_tidy %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100, colors = gradient_pink))
## Joining with `by = join_by(word)`
#Bing Sentiment Graph
sentiments_combined <- bind_rows(mutate(spam_sentiments, type = "Spam"),
mutate(ham_sentiments, type = "Ham"))
sentiments_long <- sentiments_combined %>%
pivot_longer(cols = -c(word, type), names_to = "sentiment", values_to = "count")
ggplot(sentiments_long, aes(x = type, y = count, fill = sentiment)) +
geom_bar(stat = "identity", position = "dodge") + # Use 'identity' to use counts directly
scale_fill_manual(values = c("positive" = "pink", "negative" = "maroon")) + # Specify colors
labs(x = "Type",
y = "Count",
fill = "Sentiment",
title = "Distribution of Sentiment Categories in Spam and Ham Messages") +
theme_minimal()
theme(strip.text = element_blank())
## List of 1
## $ strip.text: list()
## ..- attr(*, "class")= chr [1:2] "element_blank" "element"
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi FALSE
## - attr(*, "validate")= logi TRUE
#Log Odds Ratio Graph Top 10
sms_ratios %>%
group_by(logratio < 0) %>%
slice_max(abs(logratio), n = 10) %>%
ungroup() %>%
mutate(word = reorder(word, logratio)) %>%
ggplot(aes(x = word, y = logratio, fill = logratio < 0)) +
geom_col(show.legend = T) +
coord_flip() +
ylab("log odds ratio (spam/ham)") +
scale_fill_manual(values = c("TRUE" = "pink", "FALSE" = "maroon")) +
theme_minimal()
#TF-IDF Top 10
top10_sms <- sms_tf_idf %>%
anti_join(stop_words, by = "word") %>%
group_by(spam_or_ham) %>%
slice_max(tf_idf, n = 10, with_ties = FALSE)
top10_sms$spam_or_ham <- factor(top10_sms$spam_or_ham,
levels = c("ham", "spam"))
ggplot(top10_sms, aes(x = reorder_within(word, tf_idf, spam_or_ham),
y = tf_idf,
fill = spam_or_ham)) +
geom_col(show.legend = FALSE) +
coord_flip() +
facet_wrap(~ spam_or_ham, scales = "free_y", ncol = 2) +
scale_x_reordered() +
labs(x = NULL,
y = "TF-IDF",
title = "Top 10 Words in TF-IDF in Spam and Ham Messages") +
scale_fill_manual(values = c("ham" = "pink", "spam" = "maroon"))
#Bi-Gram Spam
set.seed(2023)
ggraph(spam_bigram_graph, layout = "fr") +
geom_edge_link() +
geom_node_point() +
geom_node_text(aes(label = name), vjust = 1, hjust = 1)
a <- grid::arrow(type = "closed", length = unit(.10, "inches"))
ggraph(spam_bigram_graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.03, 'inches')) +
geom_node_point(color = "maroon", size = 2) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()
#Bi-gram ham
set.seed(2023)
ggraph(ham_bigram_graph, layout = "fr") +
geom_edge_link() +
geom_node_point() +
geom_node_text(aes(label = name), vjust = 1, hjust = 1)
a <- grid::arrow(type = "closed", length = unit(.10, "inches"))
ggraph(ham_bigram_graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.03, 'inches')) +
geom_node_point(color = "pink", size = 2) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()
To create the graph above, the SMS data was first tokenized and cleaned to eliminate stop words and punctuation. The data was then combined to determine the frequency of each word in spam and ham communications. The top ten words by frequency in each category were chosen. A bar chart was created using ‘ggplot2’, with terms on the y-axis, frequency on the x-axis, and colors separating spam (maroon) from ham (pink). The bars were sorted in decreasing order of frequency, and the chart was divided into two aspects to allow for a direct comparison of word frequencies between spam and ham.
The graph above answers the research question of this research which is “What linguistic features are most indicative of spam messages in SMS data, and how can these features be used to improve spam detection systems?”. By knowing and comparing the top ten most common terms in spam and ham messages, it highlights key linguistic indicators that are present in spam but not in ham, and vice versa. Moreover, it also support the hypotheses of research:
Before creating the word clouds, the dataset was cleaned and pre-processed. The frequency of words was then calculated separately for spam and ham communications. Using the wordcloud function in R, two separate visualizations were created, exhibiting the most common terms in each group. The word clouds employ gradient colors—maroon for spam and pink for ham—to clearly discriminate between the two sorts of communications, allowing you to visually capture the most common themes and terms in each category.
Word clouds were chosen because they give a quick visual overview of the text data, highlighting the most frequently used terms in a visually appealing manner. This helps to quickly detect important themes or terms that are common in spam and ham communications.
Additionally, these word cloud addresses the research question by assisting in the identification of linguistic traits that are most suggestive of spam vs ham communication. For example, terms like “free,” “prize,” and “claim” appear often in the spam word cloud and correspond to prevalent spam traits. Moreover the spam wordcloud consist of numbers indicating the prizes for people to become deceived. Furthermore, it supports hypothesis 1 since spam letters employ more particular terms and phrases associated with advertising and haste.
The sentiment distribution graph was created by first computing the sentiment scores for each word in spam and ham messages using the Bing lexicon. After counting sentiment occurrences, the pivot_longer() method was used to convert the numbers from wide to long format for visualization. The obtained data was then displayed using ggplot2, yielding a bar chart that clearly depicts the comparison of positive and negative feelings in spam and ham communications.
Bing lexicon is chosen because it depicts the distribution and comparison of emotion categories between spam and ham messages. The Bing lexicon categorizes words simply as “positive” or “negative,” providing an easy method for analyzing emotion. This simplicity is excellent in emphasizing the core emotional tone of communications, which supports the idea about the various moods in spam and ham messages.
On the other hand, lexicons like NRC, which group words into several emotions (such as trust, dread, anticipation, and so on), give a more complex understanding of the text, but they also complicate research when your primary interest is in a binary categorization of sentiment. Similarly, the AFINN lexicon assigns values ranging from strongly negative to highly positive, which may represent a gradient of sentiment intensity but lack binary clarity.
However, looking at the graph, it could be seen that the different sentiment between spam and ham is not very significant. This could suggest several interpretations such as:
To produce the Log Odds Ratio graph, I first collected and analyzed SMS data to determine the frequency of each phrase in the spam and ham categories. Then I calculated the log odds ratio for each term, which indicates how much more probable a word is to appear in spam messages than in ham communications. those with greater ratios are more likely to be spam, whereas those with lower or negative values are more likely to be ham. After sorting these numbers, I created a bar chart of the top ten terms with the most different log odds ratios, highlighting their link with spam or ham using color coding—pink for spam and maroon for ham.
The bar chart was chosen because it shows the magnitude and direction of each word’s relationship, providing a better understanding of which terms are most common in spam or ham communications. This graphic directly contributes to research by emphasizing particular language traits that distinguish spam from ham transmissions. It supports the first hypothesis because the figure clearly shows how specific terms (“claim,” “prize,” “won”) are more common in spam letters, which frequently represent the use of advertising buzzwords and frantic language. This lends weight to the theory that spam texts use more detailed and forceful words to persuade or deceive receivers.
The SMS data was cleansed and preprocessed. Each phrase was then rated using the bind_tf_idf function, which calculated its TF-IDF score across the spam and ham categories. Words with the top ten highest TF-IDF scores in each category were chosen for visualization.
This graph type was chosen because TF-IDF successfully finds words that are distinctive to a category while accounting for their general commonality across all messages. This is critical for differentiating between everyday language and phrases that indicate spam. The visualization directly supports the study question by revealing important linguistic elements that distinguish spam from ham, which is consistent with Hypothesis 1, which states that spam communications use unique, frequently manipulative language as opposed to more conventional language in ham texts.
The dataset was cleaned and pre-processed before being used for bigram analysis. The counts of these bi-grams were then determined separately for spam and ham communications, with pairings appearing more than four times to ensure relevancy. The collected data was used to create network graphs with the ggraph package, which depicts these bi-grams as nodes connected by edges, with the thickness of each edge corresponding to the bi-gram’s frequency.
This figure style was chosen because network graphs graphically depict the relationships and structure of word pairs within text data, making it simpler to identify trends or frequent words utilized differently in spam vs ham communications. This visualization clearly supports the research hypothesis by demonstrating how specific connected terms (such as “ur awarded” and “free entry”) may be more common in spam, offering a graphical depiction of linguistic traits associated with spam communications. These findings might be critical for designing more effective spam detection systems, so answering the core study goal.
Alissa is a dedicated data analysis scientist who one day receives an SMS informing her that she has won a significant quantity of money. In a state of excitement and unaware of the dangers, she follows the directions in the message and becomes compromised. This personal oversight prompts her to wonder: What linguistic features distinguish spam from legitimate messages? This experience fuels a desire to look further into SMS data in order to create solutions that can avoid such frauds.
Exploring the Data Through Alissa’s Lens
Identification of Key Words: - Log Odds Ratio Analysis: Alissa begins by identifying words most frequently associated with spam versus ham messages. The analysis shows distinctive use of certain words in spam messages—words like “claim” and “prize” stand out. This step underscores how spammers craft messages to entice and deceive.
Sentiment Manipulation: - Sentiment Analysis: Next, she examines the emotional tone embedded within these messages. The analysis reveals that there’s not much difference between spam and ham message in the sentiments. However, spam messages tend to be overly positive than ham messsages.
Statistical Significance of Words: - TF-IDF Analysis: To understand the unique context in which specific words are used, Alissa employs TF-IDF, which highlights words like “guaranteed” that are crucial in spam but not common in everyday messages. This method helps pinpoint the deceptive significance placed on certain words.
Relational Word Analysis: - Bi-Gram Analysis: Alissa extends her exploration to the relationships between words using bi-gram analysis. This reveals common pairs of words that occur in spam and ham messages, illustrating patterns such as frequent commands or offers found in spam (“free entry”, “guaranteed prize”), as opposed to more mundane language seen in ham (“4 dinner”, “hey babe”). This relational view adds a deeper layer to understanding the linguistic tactics used by spammers and supports the hypothesis about the unique linguistic features of spam.
Narrative Around the Charts
Chart Analysis: Each chart Alissa develops is a piece of the puzzle. For instance, the TF-IDF chart isn’t just a visualization; it’s a tool to discern the stealth tactics of spammers. The log odds ratio graph isn’t just numbers; it’s for identifying words that should raise red flags for readers and spam filters alike. The bi-gram analysis further enriches this narrative by mapping the connections between words, showcasing how certain word combinations are prevalent in spam. This not only illustrates typical spam patterns but also aids in developing more effective spam detection algorithms by highlighting the most suspicious linguistic structures.
She uses pink and maroon colors because it is her favorite color and that these colors contrast nicely with each other and the background, allowing her to distinct data points. Moreover, darker colors like maroon signal possible danger or caution (spam), whereas light colors signify safer, more trustworthy information (ham). Lastly, she also wants to maintain uniformity across all visualizations so that she can comprehend the charts.
Concluding Insights
Alissa’s research is motivated by personal experience and a desire to protect others from the dangers of internet communication. Her work not only improves scientific understanding of spam identification, but it also acts as a useful reference for regular mobile users. This tale may end with observations on how his findings could impact spam detection technology, user education, and government.
This study looked at the linguistic traits that distinguish spam communications from ham messages in SMS data, with a particular emphasis on word use, sentiment, term frequency-inverse document frequency (TF-IDF), and bi-gram metrics. The result of this research indicates that spam texts usually contain particular, convincing terms like “claim,” “prize,” and “guaranteed.” Spam texts might frequently include numberes to signify “prizes”. These statements are intended to exploit human emotions, such as greed and urgency. The sentiment study also revealed that spam communications are often more upbeat than ham ones, implying a tactic of instilling false optimism to fool recipients. TF-IDF research revealed that specific phrases are used disproportionately in spam compared to their general frequency in the corpus, providing a more nuanced understanding of the uniqueness of spam vocabulary.The sentiment analysis also revealed that spam communications are often more positive than ham ones, implying a tactic of instilling false optimism to fool recipients. TF-IDF research revealed that specific phrases are used disproportionately in spam compared to their general frequency in the corpus, providing a more nuanced understanding of the uniqueness of spam vocabulary. Lastly, from the bi-gram graph, we could find patterns that frequently happens in spam messages.
Spam Filter Enhancement: using sophisticated language analysis in spam detection systems to go beyond typical spam signs. Integrating tests for high TF-IDF scores with sentiment analysis may increase spam detection systems’ accuracy.
User Education Programs: Creating thorough user education programs to educate the public about the characteristics of spam communications. Highlighting the precise phrases and feelings that are regularly used in spam might assist people in recognizing and avoiding unwelcome SMS or texts.
Dataset constraints: the study was limited to the linguistic aspects of a single dataset. The findings may not be applicable to all forms of digital communication or cultures.
Spam dynamic: spam methods changes and spammers are always adapting their efforts to avoid detection. The language traits that have been recognized as spam indicators today may not be applicable tomorrow.
Binary Sentiment Analysis: The sentiment analysis was confined to positive and negative categories (bing lexicon), oversimplifying the spectrum of emotions and intents communicated in spam communications. More refined sentiment analysis may yield further information.
Adewole, K. S., Anuar, N. B., Kamsin, A., & Sangaiah, A. K. (2019). SMSAD: A framework for spam message and spam account detection. Multimedia Tools and Applications, 78(4), 3925–3960. https://doi.org/10.1007/s11042-017-5018-x
Bin Ning, Wu Junwei, & Hu Feng. (2019). Spam Message Classification Based on the Naïve Bayes Classification Algorithm.