Introduction
For this project, we’ll be doing text mining and analysis on spam and ham emails. First we’ll load the libraries we’ll use for this project. We’ll also store the names of our spam and ham emails into a variable
library(tidytext)
library(tidyverse)
library(tm)
library(stringr)
library(caret)
library(SnowballC)
spam_tr <- list.files("/Users/chesterpoon/Desktop/Data607/Project4/spam_2")
ham_tr <- list.files("/Users/chesterpoon/Desktop/Data607/Project4/easy_ham")Get our data into a data frame
We’ll now read our files into a data frame: one for spam and the other for ham.
#Change into the directories where the files are and read them into a dataframe
#spam data frame
#change the working directory to get the right files
setwd("~/Desktop/Data607/Project4/spam_2")
spam <- map_df(spam_tr, ~ data_frame(txt = read_lines(.x)) %>%
mutate(filename = .x, type = "spam"))
#Set an integer ID for each email
spam$ID <- cumsum(!duplicated(spam[2]))
#ham data frame
#change the working directory to get the right files
setwd("~/Desktop/Data607/Project4/easy_ham")
ham <- map_df(ham_tr, ~ data_frame(txt = read_lines(.x)) %>%
mutate(filename = .x, type = "ham"))
#Set an integer ID for each email
ham$ID <- cumsum(!duplicated(ham[2]))
#combining into a single data frame
spam_ham_train <- full_join(spam, ham, by = c("ID","txt","type","filename"))
knitr::kable(head(spam_ham_train), format = "html")| txt | filename | type | ID |
|---|---|---|---|
| From ilug-admin@linux.ie Tue Aug 6 11:51:02 2002 | 00001.317e78fa8ee2f54cd4890fdc09ba8176 | spam | 1 |
| Return-Path: <ilug-admin@linux.ie>; | 00001.317e78fa8ee2f54cd4890fdc09ba8176 | spam | 1 |
| Delivered-To: yyyy@localhost.netnoteinc.com | 00001.317e78fa8ee2f54cd4890fdc09ba8176 | spam | 1 |
| Received: from localhost (localhost [127.0.0.1]) | 00001.317e78fa8ee2f54cd4890fdc09ba8176 | spam | 1 |
| by phobos.labs.netnoteinc.com (Postfix) with ESMTP id 9E1F5441DD | 00001.317e78fa8ee2f54cd4890fdc09ba8176 | spam | 1 |
| for <jm@localhost>;; Tue, 6 Aug 2002 06:48:09 -0400 (EDT) | 00001.317e78fa8ee2f54cd4890fdc09ba8176 | spam | 1 |
We have four columns for our data frame and each row represents a line from our email. I’ve assigned an integer ID for each email, which would better suit our purposes when we create our document term matrix. I was also sure to label each email as ham or spam based on how they were classified from the link provided for this project.
Tidy our data
Now our data is ready to be prepared. I’ll use the tidytext library to clean up our text. Each row will now represent a word in each email. We’ll take the following steps:
- Use
unnest_tokensto create one row per word - Remove stop words
- Take out numbers/digits
- Get only the word stem for our words
- Add a column that counts the numbers
- Remove any whitespace
#get each word into a row and get rid of stop words
spam_ham_cl <- spam_ham_train %>%
unnest_tokens(word, txt) %>%
anti_join(stop_words) %>%
#take out numbers
filter(!str_detect(word, "^[0-9]*$")) %>%
#Get just the word stem
mutate(word = wordStem(word)) %>%
#add count column
count(type,word,sort = TRUE) ## Joining, by = "word"
#take out whitespace
spam_ham_cl$word <- gsub("\\s+","",spam_ham_cl$word)
knitr::kable(head(spam_ham_cl), format = "html")| type | word | n |
|---|---|---|
| spam | font | 33911 |
| spam | 3d | 32154 |
| spam | br | 16751 |
| spam | td | 15692 |
| ham | list | 15018 |
| ham | id | 14763 |
TF-IDF Analysis
We’ll now use tf-idf to analyze which words are the most important for each document type. We’ll use the convenient bind_tf_idf function to get our values and arrange it in descending order to get the top tf-idf scores
spam_ham_tfidf <- spam_ham_cl %>%
bind_tf_idf(word,type,n) %>%
arrange(desc(tf_idf))
knitr::kable(head(spam_ham_tfidf), format = "html")| type | word | n | tf | idf | tf_idf |
|---|---|---|---|---|---|
| ham | example.com | 7676 | 0.0100666 | 0.6931472 | 0.0069776 |
| ham | exmh | 4888 | 0.0064103 | 0.6931472 | 0.0044433 |
| ham | freshrpms.net | 3780 | 0.0049572 | 0.6931472 | 0.0034361 |
| spam | helvetica | 3316 | 0.0042733 | 0.6931472 | 0.0029620 |
| ham | zzzlist | 2879 | 0.0037756 | 0.6931472 | 0.0026171 |
| spam | serif | 2906 | 0.0037449 | 0.6931472 | 0.0025958 |
Now we’ll display the top 15 tf-idf scores for words in our spam and ham emails.
spam_tf_idf <- spam_ham_tfidf %>%
filter(type == 'spam') %>%
top_n(15)
ham_tf_idf <- spam_ham_tfidf %>%
filter(type == 'ham') %>%
top_n(15)
ggplot(spam_tf_idf, aes(reorder(word,tf_idf),tf_idf)) +
geom_bar(stat = 'identity', aes(fill = tf_idf)) +
coord_flip() +
xlab('Word') +
ggtitle('Spam') +
scale_color_gradient() +
theme_bw()ggplot(ham_tf_idf, aes(reorder(word,tf_idf),tf_idf)) +
geom_bar(stat = 'identity', aes(fill = tf_idf)) +
coord_flip() +
xlab('Word') +
ggtitle('Ham') +
scale_color_gradient() +
theme_bw()Creating a Document Term Matrix
We’ll now create a document term matrix to potentially utilize it in a machine learning algorithm to classify emails as spam or ham. We’ll repeat the same steps as earlier to make our text tidy, but we’ll omit a “count” column.
spam_ham_df <- spam_ham_train %>%
#get each word into a row.
unnest_tokens(word, txt) %>%
#Remove numbers and any NA values
filter(!str_detect(word, "^[0-9]*$"),
!is.na(word)) %>%
#Remove stop words
anti_join(stop_words) %>%
#get just the word stem
mutate(word = wordStem(word)) %>%
ungroup()
#take out whitespace
spam_ham_df$word <- gsub("\\s+","",spam_ham_df$word)
#add word counts
spam_ham_df <- spam_ham_df %>%
count(ID, word, type)
#create our dtm
spam_ham_dtm <- spam_ham_df %>%
cast_dtm(document = ID, term = word, value = n,
weighting = tm::weightTfIdf)
#remove sparsity
spam_ham_dtm <- removeSparseTerms(spam_ham_dtm, sparse = 0.99)
spam_ham_dtm## <<DocumentTermMatrix (documents: 2551, terms: 2863)>>
## Non-/sparse entries: 437107/6866406
## Sparsity : 94%
## Maximal term length: 65
## Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
Challenges
This project was particularly challenging and I was unsuccessful with many things. I found the tm package difficult to work with. The tidytext package was far easier to work with, but I was unable to figure out how to read only the text in the email body or subject. Also, I attempted to create a predictive model without success. I wasn’t clear on where to begin and what steps needed to be taken to create a model. Next steps for this project would be for me to do research on how to do machine learning to be able to do any modeling.