PROJECT 4: Document Classification

It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.

For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). One example corpus: https://spamassassin.apache.org/publiccorpus/

Workspace preparation

Create vector with all needed libraries.

 load_packages <- c(
                    "knitr",
                    "R.utils",
                    "tm",
                    "wordcloud",
                    "topicmodels",
                    "SnowballC",
                    "e1071",
                    "data.table",
                    "RMySQL",
                    "tidyverse",
                    "tidyr",
                    "dplyr",
                    "stringr",
                    "stats"
                  )

Selected datasets

The selected datasets selected are as follows:

url.spam <- "http://spamassassin.apache.org/old/publiccorpus/"
file.spam <- "20050311_spam_2.tar.bz2"

url.ham <- "http://spamassassin.apache.org/old/publiccorpus/"
file.ham <- "20030228_easy_ham.tar.bz2"

Preparing datasets

Download

Function to download the desired files

downloadTAR <- function(filetype=NULL, myurl=NULL, myrootfile=NULL){

  destfile <- paste(filetype,".tar", sep="")
  
  if(!file.exists(destfile)){
      myfile <- paste(myurl,myrootfile,sep="")
      destfile <- paste(filetype,".tar.bz2", sep="")

      download.file(myfile, destfile= destfile)

      bunzip2(destfile)
      # untar(destfile)
  }
  
  mycompresedfilenames <- untar(destfile, list = TRUE)
  return(mycompresedfilenames)
}

spamFileNames <- downloadTAR("Spam", url.spam, file.spam)
hamFileNames <- downloadTAR("Ham", url.ham, file.ham)

Obtaining file names

spamfiles <- str_trim(str_replace_all(spamFileNames, "spam_2/", ""))  
hamFiles <- str_trim(str_replace_all(hamFileNames, "easy_ham/", ""))

spamfiles <- subset(spamfiles, nchar(spamfiles) == 38)
hamfiles <- subset(hamFiles , nchar(hamFiles) == 38)

Read contents

readFileContents <- function(importtype=NULL, filenames=NULL){
  
  if (importtype == "Spam") {
    globalcon <- paste("C:/Users/mydvtech/Documents/GitHub/MSDA/Spring-2017/607/Projects/Project4/spam_2/",filenames, sep = "")
  }
  if (importtype == "Ham") {
    globalcon <- paste("C:/Users/mydvtech/Documents/GitHub/MSDA/Spring-2017/607/Projects/Project4/easy_ham/",filenames, sep = "")
  }
  temp <- data.frame(stringsAsFactors = FALSE)

  mydata <- matrix()

  for(i in 1:length(filenames)){
    con <- file(globalcon[i], "r", blocking = FALSE)
     temp <- readLines(con)
    close(con)    
    temp <- str_c(temp, collapse = "")
    temp <- as.data.frame(temp, stringsAsFactors = FALSE)
    names(temp) <- "Content"
    mydata[[i]] <- temp
  }
  
  return(mydata)
}

spams <- readFileContents("Spam", spamfiles)
hams <- readFileContents("Ham", hamfiles)

Some results

The total number of known spams are: 1396.

The total number of known hams are: 2500.

Grand total of Emails: 3896.

Sample emails

Spam

Ham

Analysis

Lenght of Email

Spams Statistics

Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     725    2458    4004    6183    7020   89210

Distribution

Hams Summary Statistics

Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     355    1644    3081    3364    4039   88590

Distribution

Median Length

By running this analysis we can find out that in our pool of known ham spam emails; the Spam emails tend to have a longer Median length compared to Ham emails; that is as follows:

Median Length of Spams: 4004.

Median Length of Hams: 3081.

Difference of medians: 923.

Percentage difference: 29.96%.

@ Analysis

@ Spams

Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     3.0     9.0    11.0    15.6    19.0   423.0

Distribution

@ Hams

Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    9.00   20.00   18.29   23.00   70.00

Distribution

@ Median analysis

By running this analysis we can find out that in our pool of known ham spam emails; the Spam emails tend to have a lower Median count of “@” compared to Ham emails; that is as follows:

Median Length of Spams: 11.

Median Length of Hams: 20.

Difference of medians: -9.

Percentage difference: -45%.

This can be probably concluded as accurate since work and personal emails tend to cc a lot of people while spams are targeted to small audiences in the beginning.

Wordclouds

Spam

Ham

Training data

Divide corpus into training and test data

Use 75% training and 25% test.

# Randomize emails order
random_emails <- emails_df[sample(nrow(emails_df)),]
NEmailsQ <- dim(random_emails)[1]/4*3
NEmails <- dim(random_emails)[1]

random_emails_train <- random_emails[1:NEmailsQ,]
random_emails_test <- random_emails[NEmailsQ+1:NEmails,]

# Document-term matrix and clean corpus
emails_corpus_train <- clean_corpus[1:NEmailsQ]
emails_corpus_test <- clean_corpus[NEmailsQ+1:NEmails]


# Text to Matrix in order to Tokenize the corpus
emails_dtm_train <- DocumentTermMatrix(emails_corpus_train)
emails_dtm_train <- removeSparseTerms(emails_dtm_train, 1-(10/length(release_corpus)))

emails_dtm_test <- DocumentTermMatrix(emails_corpus_test)
emails_dtm_test <- removeSparseTerms(emails_dtm_test, 1-(10/length(release_corpus)))


emails_tdm_train <- TermDocumentMatrix(emails_corpus_train)
emails_tdm_train <- removeSparseTerms(emails_tdm_train, 1-(10/length(release_corpus)))

emails_tdm_test <- TermDocumentMatrix(emails_corpus_test)
emails_tdm_test <- removeSparseTerms(emails_tdm_test, 1-(10/length(release_corpus)))



five_times_words <- findFreqTerms(emails_dtm_train, 5)

Create document-term matrices using frequent words

emails_train <- DocumentTermMatrix(emails_corpus_train, control=list(dictionary = five_times_words))
emails_test <- DocumentTermMatrix(emails_corpus_test, control=list(dictionary = five_times_words))

Convert count information to “Yes”, “No”

Naive Bayes classification needs present or absent info on each word in a message. We have counts of occurrences. Convert the document-term matrices.

convert_count <- function(x) {
  y <- ifelse(x > 0, 1,0)
  y <- factor(y, levels=c(0,1), labels=c("No", "Yes"))
  y
}

emails_train <- apply(emails_train, 2, convert_count)
emails_test <- apply(emails_test, 2, convert_count)

The Naive Bayes function

We’ll use a Naive Bayes classifier provided in the package e1071.

emails_classifier <- naiveBayes(emails_train, factor(random_emails_train$type))
class(emails_classifier)

## [1] "naiveBayes"

# emails_test_pred <- predict(emails_classifier, newdata=emails_test)

Unfortunatelly this requires a lot of resources from my PC and ran out of memory; hense I can’t present the final reults.