DATA 607 Project 4: Document Classification

Problem Statement

It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.

For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). One example corpus: https://spamassassin.apache.org/old/publiccorpus/

Getting Started

# loading required packages 
library(tm)

## Loading required package: NLP

library(SnowballC)
library(stringr)
library(knitr)
library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidytext)
library(caret)

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

## Loading required package: lattice

library(e1071)
library(gbm)

## Loaded gbm 2.1.8.1

Loading files

# Load the ham folder bz2 archive
download.file(url = "http://spamassassin.apache.org/old/publiccorpus/20030228_easy_ham.tar.bz2", destfile = "20030228_easy_ham.tar.bz2")
untar("20021010_easy_ham.tar.bz2", exdir = "project4", compressed = "bzip2")

## Warning: untar(compressed=) is deprecated

ham.dir="DATA607Project4\\easy_ham\\"
ham_files = list.files(path = ham.dir,full.names = TRUE)

# Load the spam folder bz2 archive
download.file(url = "http://spamassassin.apache.org/old/publiccorpus/20050311_spam_2.tar.bz2", destfile = "20050311_spam_2.tar.bz2")
untar("20050311_spam_2.tar.bz2", exdir = "project4", compressed = "bzip2")

## Warning: untar(compressed=) is deprecated

Creating Dataframe

Define a directory path for easy_ham and spam_2 folders using forward slahes ‘/’ within the sting and creates a data frame with the email content and tags them as ‘ham’ or ‘spam’. It binds the ‘ham’ and ‘spam’ data frames together and shows a table of the counts for each tag in the combined data frame in which there are 2501 ham and 1397 spam messages.

Machine learning-based text classification has two phases; training and prediction.

Training phase A supervised machine learning algorithm is trained on the input-labeled dataset during the training phase. At the end of this process, we get a trained model that we can use to obtain predictions (labels) on new and unseen data.

Prediction phase Once a machine learning model is trained, it can be used to predict labels on new and unseen data. This is usually done by deploying the best model from an earlier phase as an API on the server.

spam_dir <- "DATA607Project4/spam_2"
ham_dir <- "DATA607Project4/easy_ham"

df <- function(path, tag){
  files <- list.files(path = path, full.names = TRUE, recursive = TRUE)
  Email <- lapply(files, function(x) {
    body <- readLines(x)
    body <- paste(body, collapse = "\n")
    return(body)
  })
  Email <- unlist(Email)
  data <- as.data.frame(Email)
  data$tag <- tag
  return(data)
}

ham_df <- df(ham_dir, tag = "ham") 
spam_df <- df(spam_dir, tag = "spam")
document_df <- rbind(ham_df, spam_df)
table(document_df$tag)

## 
##  ham spam 
## 2501 1397

Cleaning Data and Text Processing with Tidytext

Preprocessing text data is an important step in any natural language processing task. It helps in cleaning and preparing the text data for further processing or analysis.

A text preprocessing pipeline is a series of processing steps that are applied to raw text data in order to prepare it for use in natural language processing tasks.

The steps in a text preprocessing pipeline can vary, but they typically include tasks such as tokenization, stop word removal, stemming, and lemmatization. These steps help reduce the size of the text data and also improve the accuracy of NLP tasks such as text classification and information extraction.

Text data is difficult to process because it is unstructured and often contains a lot of noise. This noise can be in the form of misspellings, grammatical errors, and non-standard formatting. A text preprocessing pipeline aims to clean up this noise so that the text data can be more easily analyzed.

Figure illustrates Text Classification Pipeline. https://www.datacamp.com/tutorial/text-classification-python

Clean and process a dataframe named document_df containing email text data.Splits the ‘email’ column into paragraphs using the unnest_tokens() function from tidytext. The resulting paragraphs are stored in a column named ‘text’and removes common stop words (like ’and’, ‘the’, ‘is’, etc.) by performing an anti-join operation using a stop words dataset. It keeps only the words that are not present in the ‘stop_words’ dataset.

document_df<-document_df %>%
  mutate(Email = str_remove_all(Email, pattern = "<.*?>")) %>%  #Remove HTML tags using"<.*?>"
  mutate(Email = str_remove_all(Email, pattern = "[:digit:]")) %>% #Remove digits/numbers using [:digit:]
  mutate(Email = str_remove_all(Email, pattern = "[:punct:]")) %>% # Remove punctuation using [:punct:]
  mutate(Email = str_remove_all(Email, pattern = "[\n]")) %>% # Remove Newline Characters using [\n]
  mutate(Email = str_to_lower(Email)) %>%  # Converts the 'email' column to lowercase
  
  #Tokenization
  unnest_tokens(output=text,input=Email,
                token="paragraphs",
                format="text") %>%
  #Remove Stop Words
  anti_join(stop_words, by=c("text"="word"))

Document Text Matrix, Corpus

The corpus is represented in the form of the Term Document Matrix that represents documents vectors in matrix form in which the rows correspond to the terms in the document, columns correspond to the documents in the corpus and cells correspond to the weights of the terms. In DTM, the rows correspond to the documents in the corpus and the columns correspond to the terms in the documents and the cells correspond to the weights of the terms.

set.seed(7614)
shuffled <- sample(nrow(document_df))
document_df<-document_df[shuffled,]
document_df$tag <- as.factor(document_df$tag)

v_corp <- VCorpus(VectorSource(document_df$text))
v_corp <- tm_map(v_corp, content_transformer(stringi::stri_trans_tolower))
v_corp <- tm_map(v_corp, removeNumbers)
v_corp <- tm_map(v_corp, removePunctuation)
v_corp <- tm_map(v_corp, stripWhitespace)
v_corp <- tm_map(v_corp, removeWords, stopwords("english"))
v_corp <- tm_map(v_corp, stemDocument)

corpus_dtm <- DocumentTermMatrix(v_corp, control =
                                 list(stemming = TRUE))
corpus_dtm <- removeSparseTerms(corpus_dtm, 0.999)
inspect(corpus_dtm[1:10, 1:10])

## <<DocumentTermMatrix (documents: 10, terms: 10)>>
## Non-/sparse entries: 0/100
## Sparsity           : 100%
## Maximal term length: 9
## Weighting          : term frequency (tf)
## Sample             :
##     Terms
## Docs aaa aaaa aab aac aae aaf aaff aalib aall aapplecom
##   1    0    0   0   0   0   0    0     0    0         0
##   2    0    0   0   0   0   0    0     0    0         0
##   3    0    0   0   0   0   0    0     0    0         0
##   4    0    0   0   0   0   0    0     0    0         0
##   5    0    0   0   0   0   0    0     0    0         0
##   6    0    0   0   0   0   0    0     0    0         0
##   7    0    0   0   0   0   0    0     0    0         0
##   8    0    0   0   0   0   0    0     0    0         0
##   9    0    0   0   0   0   0    0     0    0         0
##   10   0    0   0   0   0   0    0     0    0         0

convert_count <- function(x) {
  y <- ifelse(x > 0, 1,0)
  y <- factor(y, levels=c(0,1), labels=c(0,1))
  y
}
tmp <- apply(corpus_dtm, 2, convert_count)

df_matrix = as.data.frame(as.matrix(tmp))

df_matrix$class = df_matrix$class
str(df_matrix$class)

##  chr [1:3898] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" ...

Splitting Data into Training and Testing

The training data frame will take 0.7 of the data, and leave 0.3 of data for testing. Function createDataPartition() is used to create series of test/training partitions.

set.seed(9999)  
prediction <- createDataPartition(df_matrix$class, p=.7, list = FALSE, times = 1)
head(prediction)

##      Resample1
## [1,]         1
## [2,]         2
## [3,]         3
## [4,]         4
## [5,]         6
## [6,]         7

training <- document_df[prediction,]
testing <- document_df[-prediction,]

ML based Classification Model: Naive Bayesian

Text classification Model like Naive Bayesian is a powerful and widely used in NLP that can be used to automatically categorize or predict a class of unseen text documents, often with the help of supervised machine learning.

classifier <- naiveBayes(training, factor(training$tag))
test_prediction <- predict(classifier, newdata=testing)

confusionMatrix(table(test_prediction,testing$tag))

## Confusion Matrix and Statistics
## 
##                
## test_prediction ham spam
##            ham  760    0
##            spam   0  408
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9968, 1)
##     No Information Rate : 0.6507     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.6507     
##          Detection Rate : 0.6507     
##    Detection Prevalence : 0.6507     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : ham        
##

Conclusions

We can see from the table below that the Naive Bayesian Model classified zero ham text messages as spam, no error rate while classifying zero spam messages as ham, no error rate. The model shows 100% accuracy for the test data frame.