PROJECT 4:Document Classification

Task

It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.

For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). One example corpus: https://spamassassin.apache.org/old/publiccorpus/

Load Packages

library(tm)
library(tmap)
library(readr)
library(tidyr)
library(dplyr)
library( stringr)
library(tidytext)
library(caret)

Load Data and Tidy

First, we will load the files from the spam and ham folders into R. Then we transform the text from each email into the data frame. There are 2551 ham and 1398 spam messages.

#Retrieving the respective ham and spam filenames
spam_dir <- "/Users/blessinga/Desktop/Masters Data Science/607 /Project 4/spam_2"
ham_dir <-  "/Users/blessinga/Desktop/Masters Data Science/607 /Project 4/easy_ham"

sh_df <- function(path, tag){
  files <- list.files(path=path, full.names=TRUE, recursive=TRUE)
  email <- lapply(files, function(x) {
    body <- read_file(x)
    })
  email <- unlist(email)
  data <- as.data.frame(email)
  data$tag <- tag
  return (data)
}

ham_df_file <- sh_df(ham_dir, tag="ham") 
spam_df_file <- sh_df(spam_dir, tag="spam")
ham_spam_df <- rbind(ham_df_file, spam_df_file)
table(ham_spam_df$tag)

## 
##  ham spam 
## 2551 1398

Process and Prepare Data

Next, we get rid of unnecessary data.

ham_spam_df<-ham_spam_df %>%
  mutate(email = str_remove_all(email, pattern = "<.*?>")) %>%
  mutate(email = str_remove_all(email, pattern = "[:digit:]")) %>%
  mutate(email = str_remove_all(email, pattern = "[:punct:]")) %>%
  mutate(email = str_remove_all(email, pattern = "[\n]")) %>%
  mutate(email = str_to_lower(email)) %>%
 unnest_tokens(output=text,input=email,
                token="paragraphs",
                format="text") %>%
 anti_join(stop_words, by=c("text"="word"))

Corpus , Document Term Matrix

Then we need to shuffle the spam and ham emails in the data frame.

set.seed(7614)
shuffled <- sample(nrow(ham_spam_df))
ham_spam_df<-ham_spam_df[shuffled,]
ham_spam_df$tag <- as.factor(ham_spam_df$tag)

Then we Clean the corpus by removing punctuation, numbers, and stop words using function tm_map.

n_corp <- VCorpus(VectorSource(ham_spam_df$text))
n_corp <- tm_map(n_corp, removeNumbers)
n_corp <- tm_map(n_corp, removePunctuation)
n_corp <- tm_map(n_corp, stripWhitespace)
n_corp <- tm_map(n_corp, removeWords, stopwords("english")) 
n_corp <- tm_map(n_corp, stemDocument)
n_corp <- tm_map(n_corp, content_transformer(stringi::stri_trans_tolower))

Then using the following functions:
DocumentTermMatrix() we will contract document Term Matrix from our data frame,
and with removeSparseTerms() we will remove sparse terms .

ham_spam_tm <- DocumentTermMatrix(n_corp, control = list(stemming = TRUE))
ham_spam_tm <- removeSparseTerms(ham_spam_tm, 0.999)

co_count <- function(x) {
  y <- ifelse(x > 0, 1,0)
  y <- factor(y, levels=c(0,1), labels=c(0,1))
  y
}

tmp_df <- apply(ham_spam_tm, 2, co_count)

spam_ham_matrix = as.data.frame(as.matrix(tmp_df))

spam_ham_matrix$class = spam_ham_matrix$class
str(spam_ham_matrix$class)

##  chr [1:3949] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" ...

Prediction (and Results)

Using library caret. The training data frame will take 0.7 = 70% of the data, and 0.3 = 30% data we will leave for testing.
Function createDataPartition() is used to create series of test/training partitions.

set.seed(7312)  
pred <- createDataPartition(spam_ham_matrix$class, p=.7, list = FALSE, times = 1)
head(pred)

##      Resample1
## [1,]         1
## [2,]         2
## [3,]         5
## [4,]         6
## [5,]         7
## [6,]        10

training <- ham_spam_df[pred,]
testing <- ham_spam_df[-pred,]

We will be using function RandomForest algorithm for classification and regression.
We will use the randomForest classifier with 500 trees.
The model shows 99.7% accuracy for the test data frame.

library(randomForest)

classifier <-  randomForest(x = training, y = training$tag, ntree = 500) 
predicted <-  predict(classifier, newdata = testing)

confusionMatrix(table(predicted,testing$tag))

## Confusion Matrix and Statistics
## 
##          
## predicted ham spam
##      ham  737    0
##      spam   0  447
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9969, 1)
##     No Information Rate : 0.6225     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.6225     
##          Detection Rate : 0.6225     
##    Detection Prevalence : 0.6225     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : ham        
##

Conclusion

In conclusion, The Random Forest was used as a classifier for the model,and it helped to achieve 99% accuracy. From what we gathered the predicted new document will be ham, not spam.

PROJECT 4:Document Classification - 607

Blessing Anoroh

April 25, 2024