Executive summary

The definition and characteristics of Mail -Definition: “Ham” mail refers to legitimate, desired emails that a user wants to receive. This includes personal messages, business communications, newsletters that the user has subscribed to, and other non-spam emails. -Characteristics: 1) Relevant: Contains content that is useful or interesting to the recipient. 2) Opt-in: Typically sent with the recipient’s permission. 3) Expected: The recipient is aware of the sender and expects to receive these emails. 4) Compliant: Adheres to laws and regulations such as CAN-SPAM Act, GDPR, etc.

-Definition: “Spam” mail, also known as junk mail, refers to unsolicited and often irrelevant or inappropriate emails sent in bulk to a large number of recipients without their consent. -Characteristics: 1) Unsolicited: Sent without the recipient’s permission. 2) Mass Sent: Distributed in large quantities to many recipients. 3) Irrelevant: Typically not relevant to the recipient’s interests. 4) Often Malicious: Can contain scams, phishing attempts, malware, or misleading information. 5) Non-compliant: Often violates laws and regulations related to email marketing and privacy.

Through this data analysis, we want to find out what strategies spam mail uses to look like ham mail related to the recipient and what kind of mail it is when it is received. Using the given data, we want to find answers to two questions: ‘Is ham mail more formal and refined word usage than spam mail?’(Since the source of the recipient and the sender is specified, Ham mail thought that official words would be used a lot, including official business conversations) and ‘Is spam mail more negative word usage than ham mail?’ (Spam mail can also cause fraud and theft by tricking recipients into accessing external links. Therefore, we thought we would use emotional appeal methods to make people think that it was “real” mail).

Data background

The “spam. csv” file consists of two variables. “v1” is the type of mail, and it is a classification of whether it is ham mail or spam mail. “v2” is the content of the mail, and it can be seen as the main data material we will analyze. We want to tokenize v2 to check the word frequency in many ways. If you look at the text from the file source, you can see that a collection of 400 sms spam messages was extracted from the Grubblext website, and some of the 3,000 sms messages are ham messages of nsc data collected for research by the Computer Science Department of the National University of Singapore. This is an English sms message data set consisting of a total of 5,574 messages.

Data loading, cleaning and preprocessing

library(readr)
library(dplyr)
## 
## 다음의 패키지를 부착합니다: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(stringr)
library(tidytext)
library(ggplot2)
library(widyr)
library(igraph)
## 
## 다음의 패키지를 부착합니다: 'igraph'
## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union
library(ggraph)
library(tidyr)
## 
## 다음의 패키지를 부착합니다: 'tidyr'
## The following object is masked from 'package:igraph':
## 
##     crossing
library(magrittr)
## 
## 다음의 패키지를 부착합니다: 'magrittr'
## The following object is masked from 'package:tidyr':
## 
##     extract
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ purrr     1.0.2
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ lubridate::%--%()       masks igraph::%--%()
## ✖ tibble::as_data_frame() masks igraph::as_data_frame(), dplyr::as_data_frame()
## ✖ purrr::compose()        masks igraph::compose()
## ✖ tidyr::crossing()       masks igraph::crossing()
## ✖ magrittr::extract()     masks tidyr::extract()
## ✖ dplyr::filter()         masks stats::filter()
## ✖ dplyr::lag()            masks stats::lag()
## ✖ purrr::set_names()      masks magrittr::set_names()
## ✖ purrr::simplify()       masks igraph::simplify()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(treemapify)
SPAM_fulldata <- read_csv("spam.csv", show_col_types = FALSE)
## New names:
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
SPAM_fulldata
## # A tibble: 5,572 × 5
##    v1    v2                                                    ...3  ...4  ...5 
##    <chr> <chr>                                                 <chr> <chr> <chr>
##  1 ham   "Go until jurong point, crazy.. Available only in bu… <NA>  <NA>  <NA> 
##  2 ham   "Ok lar... Joking wif u oni..."                       <NA>  <NA>  <NA> 
##  3 spam  "Free entry in 2 a wkly comp to win FA Cup final tkt… <NA>  <NA>  <NA> 
##  4 ham   "U dun say so early hor... U c already then say..."   <NA>  <NA>  <NA> 
##  5 ham   "Nah I don't think he goes to usf, he lives around h… <NA>  <NA>  <NA> 
##  6 spam  "FreeMsg Hey there darling it's been 3 week's now an… <NA>  <NA>  <NA> 
##  7 ham   "Even my brother is not like to speak with me. They … <NA>  <NA>  <NA> 
##  8 ham   "As per your request 'Melle Melle (Oru Minnaminungin… <NA>  <NA>  <NA> 
##  9 spam  "WINNER!! As a valued network customer you have been… <NA>  <NA>  <NA> 
## 10 spam  "Had your mobile 11 months or more? U R entitled to … <NA>  <NA>  <NA> 
## # ℹ 5,562 more rows
ham<-SPAM_fulldata %>% 
  filter(v1== "ham")
ham
## # A tibble: 4,825 × 5
##    v1    v2                                                    ...3  ...4  ...5 
##    <chr> <chr>                                                 <chr> <chr> <chr>
##  1 ham   Go until jurong point, crazy.. Available only in bug… <NA>  <NA>  <NA> 
##  2 ham   Ok lar... Joking wif u oni...                         <NA>  <NA>  <NA> 
##  3 ham   U dun say so early hor... U c already then say...     <NA>  <NA>  <NA> 
##  4 ham   Nah I don't think he goes to usf, he lives around he… <NA>  <NA>  <NA> 
##  5 ham   Even my brother is not like to speak with me. They t… <NA>  <NA>  <NA> 
##  6 ham   As per your request 'Melle Melle (Oru Minnaminungint… <NA>  <NA>  <NA> 
##  7 ham   I'm gonna be home soon and i don't want to talk abou… <NA>  <NA>  <NA> 
##  8 ham   I've been searching for the right words to thank you… <NA>  <NA>  <NA> 
##  9 ham   I HAVE A DATE ON SUNDAY WITH WILL!!                   <NA>  <NA>  <NA> 
## 10 ham   Oh k...i'm watching here:)                            <NA>  <NA>  <NA> 
## # ℹ 4,815 more rows
spam<-SPAM_fulldata %>% 
  filter(v1 =="spam")
spam
## # A tibble: 747 × 5
##    v1    v2                                                    ...3  ...4  ...5 
##    <chr> <chr>                                                 <chr> <chr> <chr>
##  1 spam  "Free entry in 2 a wkly comp to win FA Cup final tkt… <NA>  <NA>  <NA> 
##  2 spam  "FreeMsg Hey there darling it's been 3 week's now an… <NA>  <NA>  <NA> 
##  3 spam  "WINNER!! As a valued network customer you have been… <NA>  <NA>  <NA> 
##  4 spam  "Had your mobile 11 months or more? U R entitled to … <NA>  <NA>  <NA> 
##  5 spam  "SIX chances to win CASH! From 100 to 20,000 pounds … <NA>  <NA>  <NA> 
##  6 spam  "URGENT! You have won a 1 week FREE membership in ou… <NA>  <NA>  <NA> 
##  7 spam  "XXXMobileMovieClub: To use your credit, click the W… <NA>  <NA>  <NA> 
##  8 spam  "England v Macedonia - dont miss the goals/team news… <NA>  <NA>  <NA> 
##  9 spam  "Thanks for your subscription to Ringtone UK your mo… <NA>  <NA>  <NA> 
## 10 spam  "07732584351 - Rodger Burns - MSG = We tried to call… <NA>  <NA>  <NA> 
## # ℹ 737 more rows
SPAM<-SPAM_fulldata %>% 
  unnest_tokens(word, v2) %>% 
  anti_join(stop_words)
## Joining with `by = join_by(word)`

Anaysis and Figure 1, Treemap

tidy_ham <- ham %>%
  unnest_tokens(word, v2) %>%
  anti_join(stop_words, by = "word") %>%
  count(word, sort = TRUE)

top_words_ham <- tidy_ham %>%
  top_n(15, n) %>%
  mutate(word = factor(word, levels = rev(word)))

ggplot(top_words_ham, aes(area = n, fill = word, label = word)) +
  geom_treemap() +
  geom_treemap_text(colour = "white", place = "centre", grow = TRUE) +
  labs(title = "Top 15 Words in Ham Messages",
       fill = "Word") +
  theme_minimal() +
  theme(legend.position = "none")

tidy_spam <- spam %>%
  unnest_tokens(word, v2) %>%
  anti_join(stop_words, by = "word") %>%
  count(word, sort = TRUE)

top_words_spam <- tidy_spam %>%
  top_n(15, n) %>%
  mutate(word = factor(word, levels = rev(word)))

ggplot(top_words_spam, aes(area = n, fill = word, label = word)) +
  geom_treemap() +
  geom_treemap_text(colour = "black", place = "centre", grow = TRUE) +
  labs(title = "Top 15 Words in Ham Messages",
       fill = "Word") +
  theme_minimal() +
  theme(legend.position = "none")

Looking at the wordcloud, you can check words such as ‘2’, ‘lt(<)’, ‘gt(>)’, ‘call’, ‘ur’, ‘home’, ‘love’, ‘lol’, ‘day’, ‘time’, ‘job’, ‘eat’, ‘thk’, ‘pls’ etc. in ham mail. Rather than being formal, I found that the content of everyday conversations was the main content of ham mail. This is because the content of conversations between abbreviations and close relationships is frequently checked. On the other hand, in spam mail, you can see words such as ‘call’, ‘free’, ‘mobile’, ‘prize’, ‘urgent’, ‘1000’, ‘150p’, etc. It can be seen that there are many numbers, and there are many contents that something can be gained for free through the word free.

Anaysis and Figure 2, Tf-idf

frequecy<-SPAM %>% 
  count(v1, word)

frequecy <- frequecy %>%
  bind_tf_idf(term = word,
              document = v1, 
              n = n) %>%             
  arrange(tf_idf)

frequecy
## # A tibble: 9,260 × 6
##    v1    word      n        tf   idf tf_idf
##    <chr> <chr> <int>     <dbl> <dbl>  <dbl>
##  1 ham   1        61 0.00228       0      0
##  2 ham   1.20      1 0.0000374     0      0
##  3 ham   10       13 0.000487      0      0
##  4 ham   100       1 0.0000374     0      0
##  5 ham   1000s     1 0.0000374     0      0
##  6 ham   11        4 0.000150      0      0
##  7 ham   12        5 0.000187      0      0
##  8 ham   16        1 0.0000374     0      0
##  9 ham   1st      12 0.000449      0      0
## 10 ham   2       320 0.0120        0      0
## # ℹ 9,250 more rows
top10<-frequecy %>% 
  group_by(v1) %>% 
  slice_max(tf_idf, n=10, with_ties=F)

top10
## # A tibble: 20 × 6
## # Groups:   v1 [2]
##    v1    word           n      tf   idf  tf_idf
##    <chr> <chr>      <int>   <dbl> <dbl>   <dbl>
##  1 ham   gt           318 0.0119  0.693 0.00825
##  2 ham   lt           316 0.0118  0.693 0.00820
##  3 ham   lor          162 0.00607 0.693 0.00421
##  4 ham   da           149 0.00558 0.693 0.00387
##  5 ham   amp           86 0.00322 0.693 0.00223
##  6 ham   morning       78 0.00292 0.693 0.00202
##  7 ham   cos           76 0.00285 0.693 0.00197
##  8 ham   lol           74 0.00277 0.693 0.00192
##  9 ham   feel          62 0.00232 0.693 0.00161
## 10 ham   gonna         58 0.00217 0.693 0.00151
## 11 spam  claim        113 0.00980 0.693 0.00679
## 12 spam  prize         92 0.00798 0.693 0.00553
## 13 spam  won           73 0.00633 0.693 0.00439
## 14 spam  150p          71 0.00616 0.693 0.00427
## 15 spam  tone          59 0.00512 0.693 0.00355
## 16 spam  guaranteed    50 0.00434 0.693 0.00301
## 17 spam  18            49 0.00425 0.693 0.00295
## 18 spam  cs            44 0.00382 0.693 0.00264
## 19 spam  500           43 0.00373 0.693 0.00258
## 20 spam  1000          41 0.00356 0.693 0.00246
ggplot(top10, aes(x = reorder_within(word, tf_idf, v1),
                  y = tf_idf,
                  fill = v1)) +
  geom_col(show.legend = F) +
  coord_flip() +
  facet_wrap(~ v1, scales = "free", ncol = 2) +
  scale_x_reordered() +
  labs(x = NULL)

Through the tf-idf graph, we can identify words that are uniquely used more often than comparative data. In ham mail, words such as ‘gt’, ‘lt’, ‘morning’, ‘cos’, ‘lol’, ‘feel’, and ‘gonna’ were used a lot. The abbreviations gonna, lol, or morning can predict that it is a daily conversation content as mentioned above. In spam mail, you can see words such as ‘claim’, ‘prize’, ‘won’, ‘150p’, and ‘tone’. Checking how the word was used in the original data, it was found that it contained false information that it won the benefit. From the wordcloud and tf-idf graph above, it can be clearly seen that contrary to prediction, in ham mail, more ordinary and convenient abbreviations are used than formal and refined expressions.

Anaysis and Figure 3, Sentiment analysis

tidy_ham<-ham %>% 
  rowid_to_column("linenumber") %>% 
  unnest_tokens(word, v2) %>% 
  anti_join(stop_words)
## Joining with `by = join_by(word)`
tidy_ham
## # A tibble: 26,703 × 6
##    linenumber v1    ...3  ...4  ...5  word  
##         <int> <chr> <chr> <chr> <chr> <chr> 
##  1          1 ham   <NA>  <NA>  <NA>  jurong
##  2          1 ham   <NA>  <NA>  <NA>  crazy 
##  3          1 ham   <NA>  <NA>  <NA>  bugis 
##  4          1 ham   <NA>  <NA>  <NA>  world 
##  5          1 ham   <NA>  <NA>  <NA>  la    
##  6          1 ham   <NA>  <NA>  <NA>  buffet
##  7          1 ham   <NA>  <NA>  <NA>  cine  
##  8          1 ham   <NA>  <NA>  <NA>  amore 
##  9          1 ham   <NA>  <NA>  <NA>  wat   
## 10          2 ham   <NA>  <NA>  <NA>  lar   
## # ℹ 26,693 more rows
bing_ham <- tidy_ham %>% 
  inner_join(get_sentiments("bing")) %>% 
  count(v1, index = linenumber %% 10, sentiment) %>% 
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
ggplot(bing_ham, aes(index, sentiment, fill = v1))+
  geom_col(show.legend = F)+
  facet_wrap(~v1, ncol = 2, scales = "free_x")

tidy_spam<-spam %>% 
  rowid_to_column("linenumber") %>% 
  unnest_tokens(word, v2) %>% 
  anti_join(stop_words)
## Joining with `by = join_by(word)`
tidy_spam
## # A tibble: 11,531 × 6
##    linenumber v1    ...3  ...4  ...5  word 
##         <int> <chr> <chr> <chr> <chr> <chr>
##  1          1 spam  <NA>  <NA>  <NA>  free 
##  2          1 spam  <NA>  <NA>  <NA>  entry
##  3          1 spam  <NA>  <NA>  <NA>  2    
##  4          1 spam  <NA>  <NA>  <NA>  wkly 
##  5          1 spam  <NA>  <NA>  <NA>  comp 
##  6          1 spam  <NA>  <NA>  <NA>  win  
##  7          1 spam  <NA>  <NA>  <NA>  fa   
##  8          1 spam  <NA>  <NA>  <NA>  cup  
##  9          1 spam  <NA>  <NA>  <NA>  final
## 10          1 spam  <NA>  <NA>  <NA>  tkts 
## # ℹ 11,521 more rows
bing_spam <- tidy_spam %>% 
  inner_join(get_sentiments("bing")) %>% 
  count(v1, index = linenumber %% 10, sentiment) %>% 
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment= positive - negative)
## Joining with `by = join_by(word)`
bing_spam
## # A tibble: 10 × 5
##    v1    index negative positive sentiment
##    <chr> <dbl>    <int>    <int>     <int>
##  1 spam      0       24       75        51
##  2 spam      1       20       89        69
##  3 spam      2       11       97        86
##  4 spam      3       17      101        84
##  5 spam      4       16       85        69
##  6 spam      5       18       75        57
##  7 spam      6       24       86        62
##  8 spam      7       16       74        58
##  9 spam      8        9       82        73
## 10 spam      9       17       73        56
ggplot(bing_spam, aes(index, sentiment, fill = v1))+
  geom_col(show.legend = F)+
  facet_wrap(~v1, ncol = 2, scales = "free_x")

ratio_ham <- tidy_ham %>% 
  inner_join(get_sentiments("bing")) %>% 
  count(v1, index = linenumber %% 10, sentiment) %>% 
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(total_words = positive + negative, 
         negative_ratio = negative / total_words,
         positive_ratio = positive / total_words)
## Joining with `by = join_by(word)`
negative_ratio_ham <- sum(ratio_ham$negative) / sum(ratio_ham$total_words)
positive_ratio_ham <- sum(ratio_ham$positive) / sum(ratio_ham$total_words)

total_ratios_ham <- data.frame(sentiment = c("Negative", "Positive"),
  ratio = c(negative_ratio_ham, positive_ratio_ham))

ggplot(total_ratios_ham, aes(x = sentiment, y = ratio, fill = sentiment)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  labs(x = "Sentiment", y = "Ratio", title = "Ratio of Negative and Positive Words of HAM") +
  theme_minimal()

ratio_spam <- tidy_spam %>% 
  inner_join(get_sentiments("bing")) %>% 
  count(v1, index = linenumber %% 10, sentiment) %>% 
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(total_words = positive + negative, 
         negative_ratio = negative / total_words,
         positive_ratio = positive / total_words)
## Joining with `by = join_by(word)`
negative_ratio_spam <- sum(ratio_spam$negative) / sum(ratio_spam$total_words)
positive_ratio_spam <- sum(ratio_spam$positive) / sum(ratio_spam$total_words)

total_ratios_spam <- data.frame(sentiment = c("Negative", "Positive"),
  ratio = c(negative_ratio_spam, positive_ratio_spam))

ggplot(total_ratios_spam, aes(x = sentiment, y = ratio, fill = sentiment)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  labs(x = "Sentiment", y = "Ratio", title = "Ratio of Negative and Positive Words of SPAM") +
  theme_minimal()

Emotion analysis will be conducted using bing lexicon. A linenumber variable was set and analyzed with a wider corpus. Since there is a difference in the number of ham and spam mail data, it may be difficult to generalize it, but when looking only at the data we have, it can be seen that spam mail has a much higher frequency of using positive words than ham mail. In addition, when checking the ratio of positive and negative words, it can be seen that ham mail uses positive and negative words evenly, whereas spam mail uses overwhelmingly positive words with more than 80% of positive word use and less than 20% of negative word use.

Conclusion

Through the previous three graphs, we can realize the following. Hammail data showed that there were many daily conversations, including business interactions. As there are more abbreviations that are often used between friends, it can be seen that there are many conversations between acquaintances in the data. Spam mail was thought to have a lot of content to confuse the recipient by emphasizing negative situations (if police investigation was required), but a strategy was used to mislead the recipient as real by using positive words emphasizing prizes and benefits.