For this project, we have been given a email corpus which has spam as well as ham emails, and the emails are segregated into spam and ham. The ask is to do the text mining of the data set and find the most common words in the spam and ham emails. The email corpus link is: https://spamassassin.apache.org/old/publiccorpus/
We have copied all the zip files, and unzipped them. And then all the ham emails have been put into a single folder : All-ham, similarly all the spam emails have been put into another folder : All-spam.
We will be reading all these emails - spam and ham separately into R, and will be performing text mining on both the data sets.
Loading the required packages:
Loading the spam data set from the local drive, and putting it in a data.frame. Also, note that we are working only on the email body, and will be removing the email headers to do our text mining. For this, we have created a function which uses regular expressions to remove the header, and just save the email body text. Also, as a part of this function, we are removing the tml tags, as we see that there are many emails which are in html format, and we don’t want to count the html tags in our word counting. Also, in the final spam data.frame, we are creating a class column, and the value of class for spam would be set to 1.
spam_folder_path <- "D:/MS Data Science/CUNY/CUNY/CUNY MSDS/Fall 2018/DATA 607/DATA 607 Week-10/Data 607 Project-4/spamham/All-spam"
spam_file_list <- list.files(path = spam_folder_path)
spam_file_list <- paste(spam_folder_path, "/", spam_file_list, sep = "")
raw_spam <- lapply(spam_file_list, readtext)
raw_spam <- lapply(raw_spam, FUN = paste, collapse = " ")
### Function to get just the email body - assuming that the body starts from the point just after 2 consecutive new line characters that comes for the first time after the string "From:"
### also remove html_tags
get_email_body <- function(email){
email_body <- str_split(email,"From:.*\n?(.+\n)*\n") %>% lapply('[[',2) %>% unlist()
email_body <- gsub("<.*?>", " ", email_body)
return(email_body)
}
spam_email_body_list <- lapply(raw_spam, get_email_body)
spam_email_body_df <- as.data.frame(unlist(spam_email_body_list), stringsAsFactors = FALSE)
spam_email_body_df$class <- vector(mode = "numeric", length = nrow(spam_email_body_df))
spam_email_body_df$class <- 0
colnames(spam_email_body_df) <- c("email_body", "class")
dim(spam_email_body_df)
## [1] 2397 2
Loading the ham data set from the local drive, and putting it in a data.frame. Same function defined above will be used to get the email body and remove the html tags.
ham_folder_path <- "D:/MS Data Science/CUNY/CUNY/CUNY MSDS/Fall 2018/DATA 607/DATA 607 Week-10/Data 607 Project-4/spamham/All-ham"
ham_file_list <- list.files(path = ham_folder_path)
ham_file_list <- paste(ham_folder_path, "/", ham_file_list, sep = "")
raw_ham <- lapply(ham_file_list, readtext)
raw_ham <- lapply(raw_ham, FUN = paste, collapse = " ")
ham_email_body_list <- lapply(raw_ham, get_email_body)
ham_email_body_df <- as.data.frame(unlist(ham_email_body_list), stringsAsFactors = FALSE)
ham_email_body_df$class <- vector(mode = "numeric", length = nrow(ham_email_body_df))
ham_email_body_df$class <- 1
colnames(ham_email_body_df) <- c("email_body", "class")
dim(ham_email_body_df)
## [1] 6951 2
Cleaning up spam data set and creating the Document-term Matrix. Finally this will be used to create the Top 20 words in the form of a word cloud. Using the tm package function tm_map to : 1) convert to lower case 2) remove punctuation 3) remove white space 4) remove stop words 5) remove numbers 6) stemming
Working with Document Term Matrix: 1) Generating 2 types of Document Term Matrices - 1 with the tf-idf of words, and 2nd with the count of words 2) Using the DTM with the word counts to build the word cloud for the top 50 most used words in spam
spam_email_body_text_corpus <- spam_email_body_df$email_body %>% VectorSource() %>% Corpus()
spam_email_body_text_corpus_cleaned <- spam_email_body_text_corpus %>% tm_map(content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(., content_transformer(tolower)):
## transformation drops documents
spam_email_body_text_corpus_cleaned <- spam_email_body_text_corpus_cleaned %>% tm_map(removePunctuation)
## Warning in tm_map.SimpleCorpus(., removePunctuation): transformation drops
## documents
spam_email_body_text_corpus_cleaned <- spam_email_body_text_corpus_cleaned %>% tm_map(stripWhitespace)
## Warning in tm_map.SimpleCorpus(., stripWhitespace): transformation drops
## documents
spam_email_body_text_corpus_cleaned <- spam_email_body_text_corpus_cleaned %>% tm_map(removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(., removeWords, stopwords("english")):
## transformation drops documents
spam_email_body_text_corpus_cleaned <- spam_email_body_text_corpus_cleaned %>% tm_map(removeNumbers)
## Warning in tm_map.SimpleCorpus(., removeNumbers): transformation drops
## documents
spam_email_body_text_corpus_cleaned <- spam_email_body_text_corpus_cleaned %>% tm_map(stemDocument)
## Warning in tm_map.SimpleCorpus(., stemDocument): transformation drops
## documents
spam_email_body_dtm_tfidf <- DocumentTermMatrix(spam_email_body_text_corpus_cleaned, control = list(weighting = weightTfIdf))
## Warning in weighting(x): empty document(s): 285 290 493 557 559 690 1494
## 2040 2043 2237
spam_email_body_dtm <- DocumentTermMatrix(spam_email_body_text_corpus_cleaned)
spam_email_body_dtm_matrix <- as.matrix(spam_email_body_dtm)
spam_word_frequency <- colSums(spam_email_body_dtm_matrix)
spam_word_frequency <- sort(spam_word_frequency, decreasing = TRUE)
print("Top 20 words in spam emails")
## [1] "Top 20 words in spam emails"
spam_word_frequency[1:20] %>% kable()
| x | |
|---|---|
| 5108 | |
| will | 4033 |
| nbsp | 3383 |
| free | 3075 |
| can | 2544 |
| click | 2456 |
| receiv | 2355 |
| list | 2338 |
| pleas | 2244 |
| get | 2239 |
| busi | 2093 |
| 2031 | |
| remov | 1999 |
| order | 1944 |
| address | 1916 |
| money | 1912 |
| inform | 1762 |
| one | 1743 |
| make | 1644 |
| name | 1578 |
spam_words <- names(spam_word_frequency)
wordcloud(spam_words[1:100], spam_word_frequency[1:100])
Performing the same for the ham emails data set:
ham_email_body_text_corpus <- ham_email_body_df$email_body %>% VectorSource() %>% Corpus()
ham_email_body_text_corpus_cleaned <- ham_email_body_text_corpus %>% tm_map(content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(., content_transformer(tolower)):
## transformation drops documents
ham_email_body_text_corpus_cleaned <- ham_email_body_text_corpus_cleaned %>% tm_map(removePunctuation)
## Warning in tm_map.SimpleCorpus(., removePunctuation): transformation drops
## documents
ham_email_body_text_corpus_cleaned <- ham_email_body_text_corpus_cleaned %>% tm_map(stripWhitespace)
## Warning in tm_map.SimpleCorpus(., stripWhitespace): transformation drops
## documents
ham_email_body_text_corpus_cleaned <- ham_email_body_text_corpus_cleaned %>% tm_map(removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(., removeWords, stopwords("english")):
## transformation drops documents
ham_email_body_text_corpus_cleaned <- ham_email_body_text_corpus_cleaned %>% tm_map(removeNumbers)
## Warning in tm_map.SimpleCorpus(., removeNumbers): transformation drops
## documents
ham_email_body_text_corpus_cleaned <- ham_email_body_text_corpus_cleaned %>% tm_map(stemDocument)
## Warning in tm_map.SimpleCorpus(., stemDocument): transformation drops
## documents
ham_email_body_dtm_tfidf <- DocumentTermMatrix(ham_email_body_text_corpus_cleaned, control = list(weighting = weightTfIdf))
## Warning in weighting(x): empty document(s): 327 533 830 911 1237 1544 1549
## 1583 1811 1870 1890 2064 2081 2263 2300 2346 2487 2503 2702 3135 3217 4157
## 4481 4719 4756 5046 5048 5065 5066 5170 5198 5286 5383 5411 5500 5507
ham_email_body_dtm <- DocumentTermMatrix(ham_email_body_text_corpus_cleaned)
ham_email_body_dtm_matrix <- as.matrix(removeSparseTerms(ham_email_body_dtm, 0.999))
ham_word_frequency <- colSums(ham_email_body_dtm_matrix)
ham_word_frequency <- sort(ham_word_frequency, decreasing = TRUE)
print("Top 20 words in ham emails")
## [1] "Top 20 words in ham emails"
ham_word_frequency[1:20] %>% kable()
| x | |
|---|---|
| use | 6324 |
| nbsp | 5590 |
| can | 4892 |
| list | 4452 |
| get | 4358 |
| will | 4220 |
| one | 4119 |
| just | 3528 |
| new | 3489 |
| time | 3409 |
| 3343 | |
| like | 3338 |
| 3310 | |
| messag | 3051 |
| now | 2943 |
| work | 2862 |
| file | 2774 |
| make | 2657 |
| dont | 2548 |
| wrote | 2472 |
ham_words <- names(ham_word_frequency)
wordcloud(ham_words[1:100], ham_word_frequency[1:100])
## Warning in wordcloud(ham_words[1:100], ham_word_frequency[1:100]): list
## could not be fit on page. It will not be plotted.
Conclusion: Inspite of many common words which we see in the word clouds from spam and ham emails, there are some words which distinguish the spam and ham. Spam has words like - click, receive, money, order, credit, save - which look to be referring to emails related to advertizements. On the other hand, ham emails have words like - problem, still, right, said, like - which seem to be more like conveying a message to the recipient of the email. Hence we get some words relevant in both types of emails, however some of these words are common in both spam and ham emails.