Author Text Classification using Naive Bayes Algorithm

Introduction

This report focuses on analysis of Real time and Multivariate Dataset. It depicts the Analysis of Author Text Dataset. The model prediction in was accomplished using Naive Bayes Classification.

Naive Bayes :

The technique descended from the work of the 18th century mathematician Thomas Bayes, who developed foundational mathematical principles (now known as Bayesian methods) for describing the probability of events, and how probabilities should be revised in light of additional information (Lantz, 2013).

Dataset Description:

The Dataset is a subset of RCV1 (Lewis, Yang, Rose, & and Li, 2004), a text categorization test collection dataset developed by ZhiLiu and was donated on September 08, 2011 (ZhiLiu, 2011). The dataset is composed of Top 50 Authors and Top 100 texts of the Authors and Testing Data. So, a data with 5000 instances and is divided into 2 parts: Training and Testing. The training corpus consists of 2,500 texts (50 per author) and the test corpus includes other 2,500 texts (50 per author) non-overlapping with the training texts (ZhiLiu, 2011).

Attributes Information:

The dataset contains two attributes: author name and text. The author name contains 50 attribute specifies the text written by them. The text attribute contains the unprocessed text. The dataset consists total of 5000 instances.

#Session Information
sessionInfo()

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Catalina 10.15.5
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.6.1  magrittr_1.5    tools_3.6.1     htmltools_0.4.0
##  [5] yaml_2.2.1      Rcpp_1.0.4.6    stringi_1.4.6   rmarkdown_2.1  
##  [9] knitr_1.28      stringr_1.4.0   xfun_0.14       digest_0.6.25  
## [13] rlang_0.4.6     evaluate_0.14

Step 1: Collecting Data

#Downloading Dataset
if(!file.exists("Data/C50.zip")){
        #Downloading File
        download.file(url = "http://archive.ics.uci.edu/ml//machine-learning-databases/00217/C50.zip",
                      destfile = "Data/C50.zip")
        
        #Unzipping the File
        unzip("Data/C50.zip", exdir = "Data/C50")
}

#Reading Data
library(readtext)
#Train Data File
data_train_dir <- system.file("Data/C50/C50train/")
Data_train <- readtext(paste0(data_train_dir, "Data/C50/C50train/*"),
                       dvsep = "\n")
head(Data_train$text, n = 1)

## [1] "The Internet may be overflowing with new technology but crime in cyberspace is still of the old-fashioned variety.\nThe National Consumers League said Wednesday that the most popular scam on the Internet was the pyramid scheme, in which early investors in a bogus fund are paid off with deposits of later investors.\nThe league, a non-profit consumer advocacy group, tracks web scams through a site it set up on the world wide web in February called Internet Fraud Watch at http://www.fraud.org.\nThe site, which collects reports directly from consumers, has been widely praised by law enforcement agencies.\n\"Consumers who suspect a scam on the Internet have critical information,\" said Jodie Bernstein, director of the Federal Trade Commission's Bureau of Consumer Protection. Internet Fraud Watch \"has been a major help to the FTC in identifying particular scams in their infancy.\"\nIn May, for example, the commission used Internet reports to shut down a site run by Fortuna Alliance that had taken in over $6 million, promising investors they could earn $5,000 a month from an initial deposit of $250. Instead, Fortuna kept most of the money, the commission charged.\nFraud reports from the league's site, which has been visited over 370,000 times, are forwarded to local, state and federal authorities.\nThe second-most-popular Internet scam, the league said, was the sale of bogus Internet services, such as custom designed web sites or Internet access accounts.\nIn third place were crooks who sell computer equipment, such as memory chips or sound boards, over the net and then deliver significantly lower quality goods or nothing at all, the league said.\nOther top scams involve business opportunities. Con artists may offer shares in a business or franchise using unreasonable predictions or misrepresentations. One popular scheme promised to let consumers get rich while working at home.\nThe League also announced Tuesday that NationsBank had donated $100,000 to become a sponsor of the Fraud Watch site."

#Test Data File
data_test_dir <- system.file("Data/C50/C50test/")
Data_test <- readtext(paste0(data_test_dir, "Data/C50/C50test/*"),
                      dvsep = "\n")

#Author Names
Authornames <- as.data.frame(rep(basename(list.dirs("Data/C50/C50train")), each = 50))
Authornames <- Authornames[-(1:50),]

#Assigning Author name to Text
Data_test$Author <- Authornames
Data_train$Author <- Authornames

#Dropping ID Column
Data_test <- Data_test[-1]
Data_train <- Data_train[-1]

#Converting Author Column to Factor
Data_test$Author <- as.factor(Data_test$Author)
Data_train$Author <- as.factor(Data_train$Author)

#Filtering Data by 4 Authors
library(dplyr)
library(data.table)
AaronTrain <- Data_train %>% filter(Author == "AaronPressman", text == text)
JaneTrain <- Data_train %>% filter(Author == "JaneMacartney", text == text)
SarahTrain <- Data_train %>% filter(Author == "SarahDavison", text == text)
WilliamTrain <- Data_train %>% filter(Author == "WilliamKazer", text == text)
Data_train <- rbind(AaronTrain, JaneTrain, SarahTrain, WilliamTrain)

AaronTest <- Data_test %>% filter(Author == "AaronPressman", text == text)
JaneTest <- Data_test %>% filter(Author == "JaneMacartney", text == text)
SarahTest <- Data_test %>% filter(Author == "SarahDavison", text == text)
WilliamTest <- Data_test %>% filter(Author == "WilliamKazer", text == text)
Data_test <- rbind(AaronTest, JaneTest, SarahTest, WilliamTest)

dim(Data_test)

## [1] 200   2

dim(Data_train)

## [1] 200   2

table(Data_train$Author)

## 
##     AaronPressman        AlanCrosby    AlexanderSmith   BenjaminKangLim 
##                50                 0                 0                 0 
##     BernardHickey       BradDorfman          C50train  DarrenSchuettler 
##                 0                 0                 0                 0 
##       DavidLawder     EdnaFernandes       EricAuchard    FumikoFujisaki 
##                 0                 0                 0                 0 
##    GrahamEarnshaw  HeatherScoffield     JaneMacartney        JanLopatka 
##                 0                 0                50                 0 
##      JimGilchrist          JoeOrtiz      JohnMastrini      JonathanBirt 
##                 0                 0                 0                 0 
##    JoWinterbottom       KarlPenhaul         KeithWeir    KevinDrawbaugh 
##                 0                 0                 0                 0 
##     KevinMorrison     KirstinRidley KouroshKarimkhany         LydiaZajc 
##                 0                 0                 0                 0 
##    LynneO'Donnell   LynnleyBrowning   MarcelMichelson      MarkBendeich 
##                 0                 0                 0                 0 
##        MartinWolk      MatthewBunce     MichaelConnor        MureDickie 
##                 0                 0                 0                 0 
##         NickLouth   PatriciaCommins     PeterHumphrey        PierreTran 
##                 0                 0                 0                 0 
##        RobinSidel      RogerFillion       SamuelPerry      SarahDavison 
##                 0                 0                 0                50 
##       ScottHillis       SimonCowell          TanEeLyn    TheresePoletti 
##                 0                 0                 0                 0 
##        TimFarrand        ToddNissen      WilliamKazer 
##                 0                 0                50

Step 2 - Exploring and Preparing Data

As a part of Exploratory Data Analysis, I found that there are no missing values in the dataset.

#Checking for missing values
#any(is.na(data_train))
#any(is.na(data_test))

To analyze the textdata, the first step involves creating a corpus, which refers to a collection of text documents. The Corpus was created from the dataset using Corpus() function of “tm” package of R. It creates the R object to store text documents. The Corpus data was then cleaned using “tm_map()” function. For cleaning the corpus, the text was converted to lower case using “tolower” attribute of tm_map. Then using “removeNumbers”, “removeWords”, “stopwords()” as well as “stripWhitespace” functions, the number, stopping words and white spaces were removed. The figure shows the clean corpus element 1 of Training Dataset.

library(tm)

#Creating Corpus
suppressWarnings(Data_test_corpus <- Corpus(VectorSource(Data_test$text)))
suppressWarnings(Data_train_corpus <- Corpus(VectorSource(Data_train$text)))

#Corpus Cleaning
suppressWarnings(Data_test_corpus_clean <- tm_map(Data_test_corpus, tolower))
suppressWarnings(Data_train_corpus_clean <- tm_map(Data_train_corpus, tolower))

suppressWarnings(Data_test_corpus_clean <- tm_map(Data_test_corpus_clean, removeNumbers))
suppressWarnings(Data_train_corpus_clean <- tm_map(Data_train_corpus_clean, removeNumbers))

suppressWarnings(Data_test_corpus_clean <- tm_map(Data_test_corpus_clean, removeWords, stopwords()))
suppressWarnings(Data_train_corpus_clean <- tm_map(Data_train_corpus_clean, removeWords, stopwords()))

suppressWarnings(Data_test_corpus_clean <- tm_map(Data_test_corpus_clean, removePunctuation))
suppressWarnings(Data_train_corpus_clean <- tm_map(Data_train_corpus_clean, removePunctuation))

suppressWarnings(Data_test_corpus_clean <- tm_map(Data_test_corpus_clean, stripWhitespace))
suppressWarnings(Data_train_corpus_clean <- tm_map(Data_train_corpus_clean, stripWhitespace))

suppressWarnings(inspect(Data_train_corpus_clean[1]))

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 1
## 
## [1]  internet may overflowing new technology crime cyberspace still oldfashioned variety national consumers league said wednesday popular scam internet pyramid scheme early investors bogus fund paid deposits later investors league nonprofit consumer advocacy group tracks web scams site set world wide web february called internet fraud watch httpwwwfraudorg site collects reports directly consumers widely praised law enforcement agencies consumers suspect scam internet critical information said jodie bernstein director federal trade commissions bureau consumer protection internet fraud watch major help ftc identifying particular scams infancy may example commission used internet reports shut site run fortuna alliance taken million promising investors earn month initial deposit instead fortuna kept money commission charged fraud reports leagues site visited times forwarded local state federal authorities secondpopular internet scam league said sale bogus internet services custom designed web sites internet access accounts third place crooks sell computer equipment memory chips sound boards net deliver significantly lower quality goods nothing league said top scams involve business opportunities con artists may offer shares business franchise using unreasonable predictions misrepresentations one popular scheme promised let consumers get rich working home league also announced tuesday nationsbank donated become sponsor fraud watch site

The text from Training Dataset was split according to authors. Using the “wordcloud” package of R, the most used words by authors were visualized. The following word cloud is the visualization of most frequently used words in by author in test and training database.

#Word Cloud of Testing Dataset
library(wordcloud)
wordcloud(Data_test_corpus_clean, min.freq = 40, random.order = FALSE)

The data is then split into individual components commonly known as tokenization. The token was achieved using “DocumentTermMatrix()” function of “tm” package. This will create a data structure called Sparse Matrix in which rows indicate the Text and Column represents the word.

#Sparse Matrix
test_dtm <- DocumentTermMatrix(Data_test_corpus_clean)
train_dtm <- DocumentTermMatrix(Data_train_corpus_clean)

inspect(train_dtm)

## <<DocumentTermMatrix (documents: 200, terms: 8765)>>
## Non-/sparse entries: 41629/1711371
## Sparsity           : 98%
## Maximal term length: 36
## Weighting          : term frequency (tf)
## Sample             :
##      Terms
## Docs  china chinese hong kong last one percent said will year
##   126     0       0    3    3    1   1       1   13    3    1
##   127     4       1    2    2    3   3       7    5   11    6
##   133     1       0    3    3    2   2       5    7    3    3
##   137     1       6   12   11    1   5       0   12    3    3
##   139     0       1    2    2    6   1      14   23    9    1
##   140    17       4   27   19    1   6       3   12   15    2
##   51      2       3    0    0    1   5       1   17    0    6
##   53      2      14    0    0    1   4       1    6    0    2
##   55      1       1    0    0    2   0       2   16    2    4
##   60      3       4    0    0    2   5       4    9    0    6

To predict the model using Naive Bayes Classification, the training and testing datasets are obtained using frequency of words.

##### Preparing Training and Testing Datasets #####
### Creating Indicator features for frequent words ###
FreqWords <- findFreqTerms(train_dtm, 5)

#Saving List using Dictionary() Function
Dictionary <- function(x) {
        if( is.character(x) ) {
                return (x)
        }
        stop('x is not a character vector')
}

data_dict <- Dictionary(findFreqTerms(train_dtm, 5))

#Appending Document Term Matrix to Train and Test Dataset 
data_train <- DocumentTermMatrix(Data_train_corpus_clean, list(data_dict))
data_test <- DocumentTermMatrix(Data_test_corpus_clean, list(data_dict))

#Converting the frequency of word to count
convert_counts <- function(x) {
        x <- ifelse(x > 0, 1, 0)
        x <- factor(x, levels = c(0, 1), labels = c("No", "Yes")) 
        return(x)
}

#Appending count function to Train and Test Dataset
data_train <- apply(data_train, MARGIN = 2, convert_counts)
data_test <- apply(data_test, MARGIN = 2, convert_counts)

Step 3 - Training a Model

#Naive Bayes Classification
library(e1071)
data_classifier <- naiveBayes(data_train, Data_train$Author)

Step 4 - Evaluating Model Performance

library(gmodels)
data_test_pred <- predict(data_classifier, data_test)
CrossTable(data_test_pred, Data_test$Author,
           prop.chisq = FALSE, prop.t = FALSE,
           dnn = c('predicted', 'actual'))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  200 
## 
##  
##               | actual 
##     predicted | AaronPressman | JaneMacartney |  SarahDavison |  WilliamKazer |     Row Total | 
## --------------|---------------|---------------|---------------|---------------|---------------|
## AaronPressman |            46 |             0 |             0 |             0 |            46 | 
##               |         1.000 |         0.000 |         0.000 |         0.000 |         0.230 | 
##               |         0.920 |         0.000 |         0.000 |         0.000 |               | 
## --------------|---------------|---------------|---------------|---------------|---------------|
## JaneMacartney |             0 |            40 |             2 |            14 |            56 | 
##               |         0.000 |         0.714 |         0.036 |         0.250 |         0.280 | 
##               |         0.000 |         0.800 |         0.040 |         0.280 |               | 
## --------------|---------------|---------------|---------------|---------------|---------------|
##  SarahDavison |             4 |             0 |            47 |             3 |            54 | 
##               |         0.074 |         0.000 |         0.870 |         0.056 |         0.270 | 
##               |         0.080 |         0.000 |         0.940 |         0.060 |               | 
## --------------|---------------|---------------|---------------|---------------|---------------|
##  WilliamKazer |             0 |            10 |             1 |            33 |            44 | 
##               |         0.000 |         0.227 |         0.023 |         0.750 |         0.220 | 
##               |         0.000 |         0.200 |         0.020 |         0.660 |               | 
## --------------|---------------|---------------|---------------|---------------|---------------|
##  Column Total |            50 |            50 |            50 |            50 |           200 | 
##               |         0.250 |         0.250 |         0.250 |         0.250 |               | 
## --------------|---------------|---------------|---------------|---------------|---------------|
## 
##

From the table we can observe that the accuracy achieved is 83.0%.

Step 5 - Improving Model Performance

Setting Laplace = 1 is one way for improving performance of model.

# Setting Laplace = 1
library(e1071)
data_classifier2 <- naiveBayes(data_train, Data_train$Author,laplace = 1)

library(gmodels)
data_test_pred2 <- predict(data_classifier2, data_test)
CrossTable(data_test_pred2, Data_test$Author,
           prop.chisq = FALSE, prop.t = FALSE,
           dnn = c('predicted', 'actual'))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  200 
## 
##  
##               | actual 
##     predicted | AaronPressman | JaneMacartney |  SarahDavison |  WilliamKazer |     Row Total | 
## --------------|---------------|---------------|---------------|---------------|---------------|
## AaronPressman |            49 |             0 |             0 |             0 |            49 | 
##               |         1.000 |         0.000 |         0.000 |         0.000 |         0.245 | 
##               |         0.980 |         0.000 |         0.000 |         0.000 |               | 
## --------------|---------------|---------------|---------------|---------------|---------------|
## JaneMacartney |             0 |            30 |             1 |             9 |            40 | 
##               |         0.000 |         0.750 |         0.025 |         0.225 |         0.200 | 
##               |         0.000 |         0.600 |         0.020 |         0.180 |               | 
## --------------|---------------|---------------|---------------|---------------|---------------|
##  SarahDavison |             0 |             0 |            41 |             0 |            41 | 
##               |         0.000 |         0.000 |         1.000 |         0.000 |         0.205 | 
##               |         0.000 |         0.000 |         0.820 |         0.000 |               | 
## --------------|---------------|---------------|---------------|---------------|---------------|
##  WilliamKazer |             1 |            20 |             8 |            41 |            70 | 
##               |         0.014 |         0.286 |         0.114 |         0.586 |         0.350 | 
##               |         0.020 |         0.400 |         0.160 |         0.820 |               | 
## --------------|---------------|---------------|---------------|---------------|---------------|
##  Column Total |            50 |            50 |            50 |            50 |           200 | 
##               |         0.250 |         0.250 |         0.250 |         0.250 |               | 
## --------------|---------------|---------------|---------------|---------------|---------------|
## 
##

From the table we can observe that with laplace equals to 1, the accuracy achieved is 80.5%.

Summary

This algorithm constructs tables of probabilities that are used to estimate the likelihood that new examples belong to various classes.
The original model is 2.5% more accurate than model with Laplace = 1.
In Base Model, 34 text were misclassified while in model with Laplace = 1, 39 text out of 200 were wrongly classified constituting 19.5% of the dataset.
The Model Optimization technique failed in this dataset might be because of less instances.

References

Lantz, B. (2013). Chapter 4: Probabilistic Learning – Classification Using Naive Bayes. In B. Lantz, Machine Learning with R (pp. 89 - 117). Birmingham B3 2PB, UK.: Packt Publishing Ltd.
Lewis, D. D., Yang, Y., Rose, T., & and Li, F. (2004). RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research,(5), 361-397.
Parsian, M. (2015). Data Algorithms. Sebastopol, CA: O’Reilly Media, Inc.
ZhiLiu. (2011, September 08). Reuter_50_50 Data Set. Retrieved from UCI Machine Learning Repository: http://archive.ics.uci.edu/ml//datasets/Reuter_50_50