It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.

For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).

Within this Project I will be utilizing the ham and spam folders downloaded from the “Index of /old/publiccorpus” website linked :https://spamassassin.apache.org/old/publiccorpus/

As these files are too large to open in raw contents on github, they will be input from my local files, however I will provide a the source files in a zipped folder within the repository of this project.

Load Data

spam_path<- "C:/Users/keeno/OneDrive/Documents/MSDS/DATA607/Document Classification/spam_2"
ham_path <- "C:/Users/keeno/OneDrive/Documents/MSDS/DATA607/Document Classification/easy_ham"
spam_filenames <- list.files(spam_path)
ham_filenames <- list.files(ham_path)
spam_filenames <- spam_filenames[which(spam_filenames!="cmds")]
ham_filenames <- ham_filenames[which(ham_filenames!="cmds")]


head(ham_filenames)
## [1] "00001.7c53336b37003a9286aba55d2945844c"
## [2] "00002.9c4069e25e1ef370c078db7ee85ff9ac"
## [3] "00003.860e3c3cee1b42ead714c5c874fe25f7"
## [4] "00004.864220c5b6930b209cc287c361c99af1"
## [5] "00005.bf27cdeaf0b8c4647ecd61b1d09da613"
## [6] "00006.253ea2f9a9cc36fa0b1129b04b806608"
head(spam_filenames)
## [1] "00001.317e78fa8ee2f54cd4890fdc09ba8176"
## [2] "00002.9438920e9a55591b18e60d1ed37d992b"
## [3] "00003.590eff932f8704d8b0fcbe69d023b54d"
## [4] "00004.bdcc075fa4beb5157b5dd6cd41d8887b"
## [5] "00005.ed0aba4d386c5e62bc737cf3f0ed9589"
## [6] "00006.3ca1f399ccda5d897fecb8c57669a283"

#Process text data with corpus

hamcorpus<- ham_path %>%
  paste(., list.files(.), sep = "/") %>%
  lapply(readLines) %>%
  VectorSource() %>%
  VCorpus()%>%
  tm_map(removeWords, stopwords()) %>%
  tm_map(stripWhitespace) %>%
  tm_map(stemDocument)%>%
  tm_map(removeNumbers) %>%
  tm_map(removePunctuation)
  

spamcorpus <- spam_path %>%
  paste(., list.files(.), sep = "/") %>%
  lapply(readLines) %>%
  VectorSource() %>%
  VCorpus()%>%
  tm_map(removeWords, stopwords()) %>%
  tm_map(stripWhitespace) %>%
  tm_map(stemDocument)%>%
  tm_map(removeNumbers) %>%
  tm_map(removePunctuation)

hamspamcorpus<- c(hamcorpus, spamcorpus)

Combine Spam and Ham to create csv

df_H <- as.data.frame(unlist(hamcorpus), type = "ham" , stringsAsFactors = FALSE)
df_H$type <- "ham"
colnames(df_H)=c("text", "spam")

df_S <- as.data.frame(unlist(spamcorpus), type = "spam" , stringsAsFactors = FALSE)
df_S$type <- "spam"
colnames(df_S)=c("text", "spam")

#Dataframe
df_HS <- rbind(df_H, df_S)
df_HS[df_HS == "spam"] <- 1
df_HS[df_HS == "ham"] <- 0

Training Data Frame for Prediction

df_HS$id <- 1:nrow(df_HS)

#Use 70% of dataset as training set and remaining 30% as testing set 
train <- df_HS %>% dplyr::sample_frac(0.7)
test  <- dplyr::anti_join(df_HS, train, by = 'id')

Corpus and Classifier Creation

train_corpus <- Corpus(VectorSource(train$text))
test_corpus <- Corpus(VectorSource(test$text))
train_tdm <- DocumentTermMatrix(train_corpus)
test_tdm<-DocumentTermMatrix(test_corpus)
freq<- findFreqTerms(train_tdm, 50)
train_tdm_2<- DocumentTermMatrix(train_corpus, control=list(dictionary = freq))
test_tdm_2<- DocumentTermMatrix(test_corpus, control=list(dictionary = freq))
train_tdm_3 <- as.data.frame(as.matrix(train_tdm_2))
test_tdm_3 <- as.data.frame(as.matrix(test_tdm_2))

#Prediction

pred <- naive_bayes(train_tdm_3, factor(train$spam))
##testing the model

test_pred <- predict(pred, newdata=test_tdm_3)
table(predicted=test_pred,actual=test_tdm_3[,1])
##          actual
## predicted      0      1
##         0  27718    242
##         1 101883    998

visualization

wordcloud(hamspamcorpus, max.words = 100, random.order = FALSE, rot.per=0.15, min.freq=5, colors = brewer.pal(8, "Dark2"))

# Conclusion

This was a very challenging Project as to get the classification right you have to have strong experience and practice with corplus and creation of confusion matrices. This also utilizes a ton of data so the processing of the data is quite lengthy. Overall the model is not the strongest predicter I have had but we do get to see the model overall.