It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.
For this project, we are tasked to start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). We are provided with the corpus (https://spamassassin.apache.org/old/publiccorpus/) and instructions on how to download the ham and spam files.
## Loading required package: NLP
## -- Attaching packages ----------------------------------------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.2.1 v purrr 0.3.2
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 0.8.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts -------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x ggplot2::annotate() masks NLP::annotate()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## Loading required package: RColorBrewer
## naivebayes 0.9.6 loaded
We have followed the unzipping process explained in the video and downloaded “easy_ham” and “spam” folders. We will further load these files to R.
# loading both test and training files
spam_directory = "C:/Users/Anil Akyildirim/Desktop/Data Science/MSDS/Data Acquisition and Management/Week 11/Project 4/spam"
easy_ham_directory = "C:/Users/Anil Akyildirim/Desktop/Data Science/MSDS/Data Acquisition and Management/Week 11/Project 4/easy_ham"
spam_files <- list.files(spam_directory)
easy_ham_files <- list.files(easy_ham_directory)We need to remove the .cmds files from all the files.
# easy_ham folder files
easy_ham_corpus <- easy_ham_directory %>%
paste(., list.files(.), sep = "/") %>%
lapply(readLines) %>%
VectorSource() %>%
VCorpus()
easy_ham_corpus## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 2551
# spam folder files
spam_corpus <- spam_directory %>%
paste(., list.files(.), sep = "/") %>%
lapply(readLines) %>%
VectorSource() %>%
VCorpus()
spam_corpus## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 500
In terms of cleaning the corpus for each folder we will use the tm package and follow below steps;
1- Remove the numbers and punctuations
2- Remove stopwords such as to, from, and, the etc…
3- Remove blankspaces.
4- Reduce the terms to their stem.
# easy ham emails
easy_ham_corpus <- easy_ham_corpus %>%
tm_map(removeNumbers) %>%
tm_map(removePunctuation) %>%
tm_map(removeWords, stopwords()) %>%
tm_map(stripWhitespace) %>%
tm_map(stemDocument)
easy_ham_corpus## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 2551
#spam emails
spam_corpus <- spam_corpus %>%
tm_map(removeNumbers) %>%
tm_map(removePunctuation) %>%
tm_map(removeWords, stopwords()) %>%
tm_map(stripWhitespace) %>%
tm_map(stemDocument)
spam_corpus## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 500
A look at the corpus for easy_ham and spam revelas that we have 2551 documents on easy_ham and 500 documents on spam. We combine these two corpuses.
## <<DocumentTermMatrix (documents: 3051, terms: 59852)>>
## Non-/sparse entries: 490433/182118019
## Sparsity : 100%
## Maximal term length: 298
## Weighting : term frequency (tf)
We can use a classification method such as Naive Bayes classifier to find out the presence of certain features (words) in a defined class to predict if the email is spam or ham.
Before we start creating our training and test data sets and process, we need to create a combined dataframe, label the corpus (ham or spam) as part of supervised technique.
df_ham <- as.data.frame(unlist(easy_ham_corpus), stringsAsFactors = FALSE)
df_ham$type <- "ham"
colnames(df_ham)=c("text", "email")
df_spam <- as.data.frame(unlist(spam_corpus), stringsAsFactors = FALSE)
df_spam$type <- "spam"
colnames(df_spam)=c("text", "email")
df_ham_or_spam <- rbind(df_ham, df_spam)
head(df_ham_or_spam)## text email
## 1 From exmhworkersadminredhatcom Thu Aug ham
## 2 ReturnPath exmhworkersadminexamplecom ham
## 3 DeliveredTo zzzzlocalhostnetnoteinccom ham
## 4 Receiv localhost localhost ham
## 5 phoboslabsnetnoteinccom Postfix ESMTP id DEC ham
## 6 zzzzlocalhost Thu Aug EDT ham
We will split 75% of the data as training data and 25% as the test data.
sample_size <- floor(0.75 * nrow(df_ham_or_spam)) # selecting sample size of 75% of the data for training.
set.seed(123)
train <- sample(seq_len(nrow(df_ham_or_spam)), size = sample_size)
train_ham_or_spam <- df_ham_or_spam[train, ]
test_ham_or_spam <- df_ham_or_spam[-train, ]
head(train_ham_or_spam)## text email
## 188942 ListArchiv httpwwwgeocrawlercomredirsfphplistrazorus ham
## 134058 To rpmzzzlistfreshrpmsnet ham
## 124022 manifest exmh I figur ask might help track ham
## 160997 Refer DDEFphoboslabsnetnoteinccom ham
## 226318 Receiv dogmaslashnullorg localhost ham
## 124507 XPrioriti ham
## text email
## 6 zzzzlocalhost Thu Aug EDT ham
## 14 listmanredhatcom Postfix ESMTP id Thu Aug ham
## 15 EDT ham
## 25 intmxcorpredhatcom SMTP id gMBYi ham
## 37 To Chris Garrigu cwgdatedfadDeepEddyCom ham
## 40 InReplyTo TMDAdeepeddyvirciocom ham
# corpus creation
train_corpus <- Corpus (VectorSource(train_ham_or_spam$text)) # corpus training data
test_corpus <- Corpus(VectorSource(test_ham_or_spam$text)) # corpus test data
# corpus cleaning
train_corpus <- train_corpus %>%
tm_map(removeNumbers) %>%
tm_map(removePunctuation) %>%
tm_map(removeWords, stopwords()) %>%
tm_map(stripWhitespace)## Warning in tm_map.SimpleCorpus(., removeNumbers): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(., removePunctuation): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(., removeWords, stopwords()): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(., stripWhitespace): transformation drops
## documents
test_corpus <- test_corpus %>%
tm_map(removeNumbers) %>%
tm_map(removePunctuation) %>%
tm_map(removeWords, stopwords()) %>%
tm_map(stripWhitespace)## Warning in tm_map.SimpleCorpus(., removeNumbers): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(., removePunctuation): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(., removeWords, stopwords()): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(., stripWhitespace): transformation drops
## documents
## <<DocumentTermMatrix (documents: 236142, terms: 50639)>>
## Non-/sparse entries: 573964/11957420774
## Sparsity : 100%
## Maximal term length: 245
## Weighting : term frequency (tf)
## <<DocumentTermMatrix (documents: 78714, terms: 25950)>>
## Non-/sparse entries: 191569/2042436731
## Sparsity : 100%
## Maximal term length: 298
## Weighting : term frequency (tf)
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 236142
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 78714
We need to separate training data to spam and ham.
If we run all the observation in my data, R doesnt have enough memory to execute it at the moment. So, I am going to narrow down the observations by selecting words that uses at least 50 times in the training document.
## [1] 1616
We need to create a classifier for each email.
## [1] "DocumentTermMatrix" "simple_triplet_matrix"
## [1] "data.frame"
## [1] "naiveBayes"
## [1] "DocumentTermMatrix" "simple_triplet_matrix"
## [1] "data.frame"
We can use the predict function to test the model on new data. " test_pred <- predict(classifier, newdata=test_tdm_3)"
We are able to generate prediction of email being ham or spam (using supervised technique -naive Bayes method). We can further test it against the raw data and evaluate model’s performance.
** Unfortunately, i have ran a lot of code efficiency issues on this project. Majority of the time I wasnt able to create efficient code and when i reviewed the error messages I found out that the code that i created using a lot of memmory. For example i had to change the class type to make the classifier work. **