It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.
For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). One example corpus: https://spamassassin.apache.org/old/publiccorpus/
# loading required packages
library(tm)
## Loading required package: NLP
library(SnowballC)
library(stringr)
library(knitr)
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidytext)
library(caret)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
## Loading required package: lattice
library(e1071)
library(gbm)
## Loaded gbm 2.1.8.1
# Load the ham folder bz2 archive
download.file(url = "http://spamassassin.apache.org/old/publiccorpus/20030228_easy_ham.tar.bz2", destfile = "20030228_easy_ham.tar.bz2")
untar("20021010_easy_ham.tar.bz2", exdir = "project4", compressed = "bzip2")
## Warning: untar(compressed=) is deprecated
ham.dir="DATA607Project4\\easy_ham\\"
ham_files = list.files(path = ham.dir,full.names = TRUE)
# Load the spam folder bz2 archive
download.file(url = "http://spamassassin.apache.org/old/publiccorpus/20050311_spam_2.tar.bz2", destfile = "20050311_spam_2.tar.bz2")
untar("20050311_spam_2.tar.bz2", exdir = "project4", compressed = "bzip2")
## Warning: untar(compressed=) is deprecated
Define a directory path for easy_ham and spam_2 folders using forward slahes ‘/’ within the sting and creates a data frame with the email content and tags them as ‘ham’ or ‘spam’. It binds the ‘ham’ and ‘spam’ data frames together and shows a table of the counts for each tag in the combined data frame in which there are 2501 ham and 1397 spam messages.
Machine learning-based text classification has two phases; training and prediction.
Training phase A supervised machine learning algorithm is trained on the input-labeled dataset during the training phase. At the end of this process, we get a trained model that we can use to obtain predictions (labels) on new and unseen data.
Prediction phase Once a machine learning model is trained, it can be used to predict labels on new and unseen data. This is usually done by deploying the best model from an earlier phase as an API on the server.
spam_dir <- "DATA607Project4/spam_2"
ham_dir <- "DATA607Project4/easy_ham"
df <- function(path, tag){
files <- list.files(path = path, full.names = TRUE, recursive = TRUE)
Email <- lapply(files, function(x) {
body <- readLines(x)
body <- paste(body, collapse = "\n")
return(body)
})
Email <- unlist(Email)
data <- as.data.frame(Email)
data$tag <- tag
return(data)
}
ham_df <- df(ham_dir, tag = "ham")
spam_df <- df(spam_dir, tag = "spam")
document_df <- rbind(ham_df, spam_df)
table(document_df$tag)
##
## ham spam
## 2501 1397
Preprocessing text data is an important step in any natural language processing task. It helps in cleaning and preparing the text data for further processing or analysis.
A text preprocessing pipeline is a series of processing steps that are applied to raw text data in order to prepare it for use in natural language processing tasks.
The steps in a text preprocessing pipeline can vary, but they typically include tasks such as tokenization, stop word removal, stemming, and lemmatization. These steps help reduce the size of the text data and also improve the accuracy of NLP tasks such as text classification and information extraction.
Text data is difficult to process because it is unstructured and often contains a lot of noise. This noise can be in the form of misspellings, grammatical errors, and non-standard formatting. A text preprocessing pipeline aims to clean up this noise so that the text data can be more easily analyzed.
Figure illustrates Text Classification Pipeline. https://www.datacamp.com/tutorial/text-classification-python
Clean and process a dataframe named document_df containing email text data.Splits the ‘email’ column into paragraphs using the unnest_tokens() function from tidytext. The resulting paragraphs are stored in a column named ‘text’and removes common stop words (like ’and’, ‘the’, ‘is’, etc.) by performing an anti-join operation using a stop words dataset. It keeps only the words that are not present in the ‘stop_words’ dataset.
document_df<-document_df %>%
mutate(Email = str_remove_all(Email, pattern = "<.*?>")) %>% #Remove HTML tags using"<.*?>"
mutate(Email = str_remove_all(Email, pattern = "[:digit:]")) %>% #Remove digits/numbers using [:digit:]
mutate(Email = str_remove_all(Email, pattern = "[:punct:]")) %>% # Remove punctuation using [:punct:]
mutate(Email = str_remove_all(Email, pattern = "[\n]")) %>% # Remove Newline Characters using [\n]
mutate(Email = str_to_lower(Email)) %>% # Converts the 'email' column to lowercase
#Tokenization
unnest_tokens(output=text,input=Email,
token="paragraphs",
format="text") %>%
#Remove Stop Words
anti_join(stop_words, by=c("text"="word"))
The corpus is represented in the form of the Term Document Matrix that represents documents vectors in matrix form in which the rows correspond to the terms in the document, columns correspond to the documents in the corpus and cells correspond to the weights of the terms. In DTM, the rows correspond to the documents in the corpus and the columns correspond to the terms in the documents and the cells correspond to the weights of the terms.
set.seed(7614)
shuffled <- sample(nrow(document_df))
document_df<-document_df[shuffled,]
document_df$tag <- as.factor(document_df$tag)
v_corp <- VCorpus(VectorSource(document_df$text))
v_corp <- tm_map(v_corp, content_transformer(stringi::stri_trans_tolower))
v_corp <- tm_map(v_corp, removeNumbers)
v_corp <- tm_map(v_corp, removePunctuation)
v_corp <- tm_map(v_corp, stripWhitespace)
v_corp <- tm_map(v_corp, removeWords, stopwords("english"))
v_corp <- tm_map(v_corp, stemDocument)
corpus_dtm <- DocumentTermMatrix(v_corp, control =
list(stemming = TRUE))
corpus_dtm <- removeSparseTerms(corpus_dtm, 0.999)
inspect(corpus_dtm[1:10, 1:10])
## <<DocumentTermMatrix (documents: 10, terms: 10)>>
## Non-/sparse entries: 0/100
## Sparsity : 100%
## Maximal term length: 9
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs aaa aaaa aab aac aae aaf aaff aalib aall aapplecom
## 1 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 0
## 7 0 0 0 0 0 0 0 0 0 0
## 8 0 0 0 0 0 0 0 0 0 0
## 9 0 0 0 0 0 0 0 0 0 0
## 10 0 0 0 0 0 0 0 0 0 0
convert_count <- function(x) {
y <- ifelse(x > 0, 1,0)
y <- factor(y, levels=c(0,1), labels=c(0,1))
y
}
tmp <- apply(corpus_dtm, 2, convert_count)
df_matrix = as.data.frame(as.matrix(tmp))
df_matrix$class = df_matrix$class
str(df_matrix$class)
## chr [1:3898] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" ...
The training data frame will take 0.7 of the data, and leave 0.3 of data for testing. Function createDataPartition() is used to create series of test/training partitions.
set.seed(9999)
prediction <- createDataPartition(df_matrix$class, p=.7, list = FALSE, times = 1)
head(prediction)
## Resample1
## [1,] 1
## [2,] 2
## [3,] 3
## [4,] 4
## [5,] 6
## [6,] 7
training <- document_df[prediction,]
testing <- document_df[-prediction,]
Text classification Model like Naive Bayesian is a powerful and widely used in NLP that can be used to automatically categorize or predict a class of unseen text documents, often with the help of supervised machine learning.
classifier <- naiveBayes(training, factor(training$tag))
test_prediction <- predict(classifier, newdata=testing)
confusionMatrix(table(test_prediction,testing$tag))
## Confusion Matrix and Statistics
##
##
## test_prediction ham spam
## ham 760 0
## spam 0 408
##
## Accuracy : 1
## 95% CI : (0.9968, 1)
## No Information Rate : 0.6507
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.6507
## Detection Rate : 0.6507
## Detection Prevalence : 0.6507
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : ham
##
We can see from the table below that the Naive Bayesian Model classified zero ham text messages as spam, no error rate while classifying zero spam messages as ham, no error rate. The model shows 100% accuracy for the test data frame.