PROJECT 4: Document Classification
It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.
For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). One example corpus: https://spamassassin.apache.org/publiccorpus/
Workspace preparation
Create vector with all needed libraries.
load_packages <- c(
"knitr",
"R.utils",
"tm",
"wordcloud",
"topicmodels",
"SnowballC",
"e1071",
"data.table",
"RMySQL",
"tidyverse",
"tidyr",
"dplyr",
"stringr",
"stats"
)
Selected datasets
The selected datasets selected are as follows:
url.spam <- "http://spamassassin.apache.org/old/publiccorpus/"
file.spam <- "20050311_spam_2.tar.bz2"
url.ham <- "http://spamassassin.apache.org/old/publiccorpus/"
file.ham <- "20030228_easy_ham.tar.bz2"
Preparing datasets
Download
Function to download the desired files
downloadTAR <- function(filetype=NULL, myurl=NULL, myrootfile=NULL){
destfile <- paste(filetype,".tar", sep="")
if(!file.exists(destfile)){
myfile <- paste(myurl,myrootfile,sep="")
destfile <- paste(filetype,".tar.bz2", sep="")
download.file(myfile, destfile= destfile)
bunzip2(destfile)
# untar(destfile)
}
mycompresedfilenames <- untar(destfile, list = TRUE)
return(mycompresedfilenames)
}
spamFileNames <- downloadTAR("Spam", url.spam, file.spam)
hamFileNames <- downloadTAR("Ham", url.ham, file.ham)
Obtaining file names
spamfiles <- str_trim(str_replace_all(spamFileNames, "spam_2/", ""))
hamFiles <- str_trim(str_replace_all(hamFileNames, "easy_ham/", ""))
spamfiles <- subset(spamfiles, nchar(spamfiles) == 38)
hamfiles <- subset(hamFiles , nchar(hamFiles) == 38)
Read contents
readFileContents <- function(importtype=NULL, filenames=NULL){
if (importtype == "Spam") {
globalcon <- paste("C:/Users/mydvtech/Documents/GitHub/MSDA/Spring-2017/607/Projects/Project4/spam_2/",filenames, sep = "")
}
if (importtype == "Ham") {
globalcon <- paste("C:/Users/mydvtech/Documents/GitHub/MSDA/Spring-2017/607/Projects/Project4/easy_ham/",filenames, sep = "")
}
temp <- data.frame(stringsAsFactors = FALSE)
mydata <- matrix()
for(i in 1:length(filenames)){
con <- file(globalcon[i], "r", blocking = FALSE)
temp <- readLines(con)
close(con)
temp <- str_c(temp, collapse = "")
temp <- as.data.frame(temp, stringsAsFactors = FALSE)
names(temp) <- "Content"
mydata[[i]] <- temp
}
return(mydata)
}
spams <- readFileContents("Spam", spamfiles)
hams <- readFileContents("Ham", hamfiles)
Some results
The total number of known spams are: 1396.
The total number of known hams are: 2500.
Grand total of Emails: 3896.
Sample emails
Spam
Ham
Analysis
Lenght of Email
Spams Statistics
Summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 725 2458 4004 6183 7020 89210
Distribution
Hams Summary Statistics
Summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 355 1644 3081 3364 4039 88590
Distribution
Median Length
By running this analysis we can find out that in our pool of known ham spam emails; the Spam emails tend to have a longer Median length compared to Ham emails; that is as follows:
Median Length of Spams: 4004.
Median Length of Hams: 3081.
Difference of medians: 923.
Percentage difference: 29.96%.
@ Analysis
@ Spams
Summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.0 9.0 11.0 15.6 19.0 423.0
Distribution
@ Hams
Summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 9.00 20.00 18.29 23.00 70.00
Distribution
@ Median analysis
By running this analysis we can find out that in our pool of known ham spam emails; the Spam emails tend to have a lower Median count of “@” compared to Ham emails; that is as follows:
Median Length of Spams: 11.
Median Length of Hams: 20.
Difference of medians: -9.
Percentage difference: -45%.
This can be probably concluded as accurate since work and personal emails tend to cc a lot of people while spams are targeted to small audiences in the beginning.
Wordclouds
Spam
Ham
Training data
Divide corpus into training and test data
Use 75% training and 25% test.
# Randomize emails order
random_emails <- emails_df[sample(nrow(emails_df)),]
NEmailsQ <- dim(random_emails)[1]/4*3
NEmails <- dim(random_emails)[1]
random_emails_train <- random_emails[1:NEmailsQ,]
random_emails_test <- random_emails[NEmailsQ+1:NEmails,]
# Document-term matrix and clean corpus
emails_corpus_train <- clean_corpus[1:NEmailsQ]
emails_corpus_test <- clean_corpus[NEmailsQ+1:NEmails]
# Text to Matrix in order to Tokenize the corpus
emails_dtm_train <- DocumentTermMatrix(emails_corpus_train)
emails_dtm_train <- removeSparseTerms(emails_dtm_train, 1-(10/length(release_corpus)))
emails_dtm_test <- DocumentTermMatrix(emails_corpus_test)
emails_dtm_test <- removeSparseTerms(emails_dtm_test, 1-(10/length(release_corpus)))
emails_tdm_train <- TermDocumentMatrix(emails_corpus_train)
emails_tdm_train <- removeSparseTerms(emails_tdm_train, 1-(10/length(release_corpus)))
emails_tdm_test <- TermDocumentMatrix(emails_corpus_test)
emails_tdm_test <- removeSparseTerms(emails_tdm_test, 1-(10/length(release_corpus)))
five_times_words <- findFreqTerms(emails_dtm_train, 5)
Create document-term matrices using frequent words
emails_train <- DocumentTermMatrix(emails_corpus_train, control=list(dictionary = five_times_words))
emails_test <- DocumentTermMatrix(emails_corpus_test, control=list(dictionary = five_times_words))
Convert count information to “Yes”, “No”
Naive Bayes classification needs present or absent info on each word in a message. We have counts of occurrences. Convert the document-term matrices.
convert_count <- function(x) {
y <- ifelse(x > 0, 1,0)
y <- factor(y, levels=c(0,1), labels=c("No", "Yes"))
y
}
emails_train <- apply(emails_train, 2, convert_count)
emails_test <- apply(emails_test, 2, convert_count)
The Naive Bayes function
We’ll use a Naive Bayes classifier provided in the package e1071.
emails_classifier <- naiveBayes(emails_train, factor(random_emails_train$type))
class(emails_classifier)
## [1] "naiveBayes"
# emails_test_pred <- predict(emails_classifier, newdata=emails_test)
Unfortunatelly this requires a lot of resources from my PC and ran out of memory; hense I can’t present the final reults.