It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.
For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). One example corpus: https://spamassassin.apache.org/old/publiccorpus/
library(tm)
library(tmap)
library(readr)
library(tidyr)
library(dplyr)
library( stringr)
library(tidytext)
library(caret)
First, we will load the files from the spam and ham folders into R. Then we transform the text from each email into the data frame. There are 2551 ham and 1398 spam messages.
#Retrieving the respective ham and spam filenames
spam_dir <- "/Users/blessinga/Desktop/Masters Data Science/607 /Project 4/spam_2"
ham_dir <- "/Users/blessinga/Desktop/Masters Data Science/607 /Project 4/easy_ham"
sh_df <- function(path, tag){
files <- list.files(path=path, full.names=TRUE, recursive=TRUE)
email <- lapply(files, function(x) {
body <- read_file(x)
})
email <- unlist(email)
data <- as.data.frame(email)
data$tag <- tag
return (data)
}
ham_df_file <- sh_df(ham_dir, tag="ham")
spam_df_file <- sh_df(spam_dir, tag="spam")
ham_spam_df <- rbind(ham_df_file, spam_df_file)
table(ham_spam_df$tag)
##
## ham spam
## 2551 1398
Next, we get rid of unnecessary data.
ham_spam_df<-ham_spam_df %>%
mutate(email = str_remove_all(email, pattern = "<.*?>")) %>%
mutate(email = str_remove_all(email, pattern = "[:digit:]")) %>%
mutate(email = str_remove_all(email, pattern = "[:punct:]")) %>%
mutate(email = str_remove_all(email, pattern = "[\n]")) %>%
mutate(email = str_to_lower(email)) %>%
unnest_tokens(output=text,input=email,
token="paragraphs",
format="text") %>%
anti_join(stop_words, by=c("text"="word"))
Then we need to shuffle the spam and ham emails in the data frame.
set.seed(7614)
shuffled <- sample(nrow(ham_spam_df))
ham_spam_df<-ham_spam_df[shuffled,]
ham_spam_df$tag <- as.factor(ham_spam_df$tag)
Then we Clean the corpus by removing punctuation, numbers, and stop words using function tm_map.
n_corp <- VCorpus(VectorSource(ham_spam_df$text))
n_corp <- tm_map(n_corp, removeNumbers)
n_corp <- tm_map(n_corp, removePunctuation)
n_corp <- tm_map(n_corp, stripWhitespace)
n_corp <- tm_map(n_corp, removeWords, stopwords("english"))
n_corp <- tm_map(n_corp, stemDocument)
n_corp <- tm_map(n_corp, content_transformer(stringi::stri_trans_tolower))
Then using the following functions:
DocumentTermMatrix() we will contract document Term Matrix from our data
frame,
and with removeSparseTerms() we will remove sparse terms .
ham_spam_tm <- DocumentTermMatrix(n_corp, control = list(stemming = TRUE))
ham_spam_tm <- removeSparseTerms(ham_spam_tm, 0.999)
co_count <- function(x) {
y <- ifelse(x > 0, 1,0)
y <- factor(y, levels=c(0,1), labels=c(0,1))
y
}
tmp_df <- apply(ham_spam_tm, 2, co_count)
spam_ham_matrix = as.data.frame(as.matrix(tmp_df))
spam_ham_matrix$class = spam_ham_matrix$class
str(spam_ham_matrix$class)
## chr [1:3949] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" ...
Using library caret. The training data frame will take 0.7 = 70% of
the data, and 0.3 = 30% data we will leave for testing.
Function createDataPartition() is used to create series of test/training
partitions.
set.seed(7312)
pred <- createDataPartition(spam_ham_matrix$class, p=.7, list = FALSE, times = 1)
head(pred)
## Resample1
## [1,] 1
## [2,] 2
## [3,] 5
## [4,] 6
## [5,] 7
## [6,] 10
training <- ham_spam_df[pred,]
testing <- ham_spam_df[-pred,]
We will be using function RandomForest algorithm for classification and
regression.
We will use the randomForest classifier with 500 trees.
The model shows 99.7% accuracy for the test data frame.
library(randomForest)
classifier <- randomForest(x = training, y = training$tag, ntree = 500)
predicted <- predict(classifier, newdata = testing)
confusionMatrix(table(predicted,testing$tag))
## Confusion Matrix and Statistics
##
##
## predicted ham spam
## ham 737 0
## spam 0 447
##
## Accuracy : 1
## 95% CI : (0.9969, 1)
## No Information Rate : 0.6225
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.6225
## Detection Rate : 0.6225
## Detection Prevalence : 0.6225
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : ham
##
In conclusion, The Random Forest was used as a classifier for the model,and it helped to achieve 99% accuracy. From what we gathered the predicted new document will be ham, not spam.