It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.
For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). One example corpus: https://spamassassin.apache.org/old/publiccorpus/
We are going to use download.file() and untar() functions of utils package to download the spam and ham folders from the provided link. As a result, the data will be available for any user who will run the code below for document classification.
#load ham folder bz2 archive, destfile is the name where the downloaded file is saved
download.file(url = "http://spamassassin.apache.org/old/publiccorpus/20021010_easy_ham.tar.bz2", destfile = "20021010_easy_ham.tar.bz2")
#extract tar archive, exdir is the directory to extract files to, compressed is to select that form of compression
untar("20021010_easy_ham.tar.bz2", exdir="project4", compressed = "bzip2")
#load spam folder bz2 archive,
download.file(url = "http://spamassassin.apache.org/old/publiccorpus/20050311_spam_2.tar.bz2", destfile = "20050311_spam_2.tar.bz2")
untar("20050311_spam_2.tar.bz2", exdir="project4",compressed = "bzip2")
First, we will get all the files from the spam and ham folders. After, we need to read the text in each email and transform the text into the data frame, each row will contain email text, there will be an additional column “tag” where we will mark if the email is spam or ham. There are 2551 ham and 1397 spam messages.
spam_dir = "C:/Users/daria/Documents/project4/spam_2/"
ham_dir = "C:/Users/daria/Documents/project4/easy_ham/"
to_df <- function(path, tag){
files <- list.files(path=path,
full.names=TRUE,
recursive=TRUE)
email <- lapply(files, function(x) {
body <- read_file(x)
})
email <- unlist(email)
data <- as.data.frame(email)
data$tag <- tag
return (data)
}
ham_df <- to_df(ham_dir, tag="ham")
spam_df <- to_df(spam_dir, tag="spam")
df <- rbind(ham_df, spam_df)
table(df$tag)
##
## ham spam
## 2551 1397
Next, we get rid of the html characters that we don’t need. As a result, we will keep the body of the email.
df<-df %>%
mutate(email = str_remove_all(email, pattern = "<.*?>")) %>%
mutate(email = str_remove_all(email, pattern = "[:digit:]")) %>%
mutate(email = str_remove_all(email, pattern = "[:punct:]")) %>%
mutate(email = str_remove_all(email, pattern = "[\n]")) %>%
mutate(email = str_to_lower(email)) %>%
unnest_tokens(output=text,input=email,
token="paragraphs",
format="text") %>%
anti_join(stop_words, by=c("text"="word"))
We need to shuffle the spam and ham emails in the data frame.
set.seed(7614)
shuffled <- sample(nrow(df))
df<-df[shuffled,]
df$tag <- as.factor(df$tag)
We will transform the words we have in the data set into a corpus of messages using function tm_map. Using the same function, we will remove numbers, white space, etc.
v_corp <- VCorpus(VectorSource(df$text))
v_corp <- tm_map(v_corp, content_transformer(stringi::stri_trans_tolower))
v_corp <- tm_map(v_corp, removeNumbers)
v_corp <- tm_map(v_corp, removePunctuation)
v_corp <- tm_map(v_corp, stripWhitespace)
v_corp <- tm_map(v_corp, removeWords, stopwords("english"))
v_corp <- tm_map(v_corp, stemDocument)
Using functions DocumentTermMatrix() we will contract document Term Matrix from our data frame, and remove sparse terms with removeSparseTerms(). After, we will convert it back to the data frame and mark emails with 0 and 1 for ham and spam.
dtm <- DocumentTermMatrix(v_corp, control =
list(stemming = TRUE))
dtm <- removeSparseTerms(dtm, 0.999)
convert_count <- function(x) {
y <- ifelse(x > 0, 1,0)
y <- factor(y, levels=c(0,1), labels=c(0,1))
y
}
tmp <- apply(dtm, 2, convert_count)
df_matrix = as.data.frame(as.matrix(tmp))
df_matrix$class = df_matrix$class
str(df_matrix$class)
## chr [1:3948] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" ...
The training data frame will take 0.7 of the data, 0.3 data we will leave for testing. Function createDataPartition() is used to create series of test/training partitions.
set.seed(7316)
prediction <- createDataPartition(df_matrix$class, p=.7, list = FALSE, times = 1)
head(prediction)
## Resample1
## [1,] 1
## [2,] 2
## [3,] 3
## [4,] 4
## [5,] 5
## [6,] 6
training <- df[prediction,]
testing <- df[-prediction,]
We will try the randomForest classifier with 400 trees. It implements Breiman’s random forest algorithm for classification and regression. Random forest averages multiple deep decision trees, trains on different parts of the same training set, and help with overcoming over-fitting problem of individual decision tree.
The model shows 99.7% accuracy for the test data frame.
classifier <- randomForest(x = training, y = training$tag, ntree = 400)
predicted <- predict(classifier, newdata = testing)
confusionMatrix(table(predicted,testing$tag))
## Confusion Matrix and Statistics
##
##
## predicted ham spam
## ham 783 0
## spam 0 400
##
## Accuracy : 1
## 95% CI : (0.9969, 1)
## No Information Rate : 0.6619
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.6619
## Detection Rate : 0.6619
## Detection Prevalence : 0.6619
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : ham
##
The task to define the spam message requires a lot of work. The project above helped to learn how to work with the tar archives, how to read the text from the email and transform it to the data frame, and, the main thing, how to use this data frame to train the model for predicting spam. The work with corpus was a great challenge and took most of the time. 70% of the data was used to train the data, 30% to test. The Random Forest was used as a classifier for the model, it helped to achieve 99% accuracy. There are other model to try and that may work better (boosting, Naive Bayes, etc).