This report focuses on analysis of Real time and Multivariate Dataset. It depicts the Analysis of Author Text Dataset. The model prediction in was accomplished using Naive Bayes Classification.
The technique descended from the work of the 18th century mathematician Thomas Bayes, who developed foundational mathematical principles (now known as Bayesian methods) for describing the probability of events, and how probabilities should be revised in light of additional information (Lantz, 2013).
The Dataset is a subset of RCV1 (Lewis, Yang, Rose, & and Li, 2004), a text categorization test collection dataset developed by ZhiLiu and was donated on September 08, 2011 (ZhiLiu, 2011). The dataset is composed of Top 50 Authors and Top 100 texts of the Authors and Testing Data. So, a data with 5000 instances and is divided into 2 parts: Training and Testing. The training corpus consists of 2,500 texts (50 per author) and the test corpus includes other 2,500 texts (50 per author) non-overlapping with the training texts (ZhiLiu, 2011).
The dataset contains two attributes: author name and text. The author name contains 50 attribute specifies the text written by them. The text attribute contains the unprocessed text. The dataset consists total of 5000 instances.
#Session Information
sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Catalina 10.15.5
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] compiler_3.6.1 magrittr_1.5 tools_3.6.1 htmltools_0.4.0
## [5] yaml_2.2.1 Rcpp_1.0.4.6 stringi_1.4.6 rmarkdown_2.1
## [9] knitr_1.28 stringr_1.4.0 xfun_0.14 digest_0.6.25
## [13] rlang_0.4.6 evaluate_0.14
#Downloading Dataset
if(!file.exists("Data/C50.zip")){
#Downloading File
download.file(url = "http://archive.ics.uci.edu/ml//machine-learning-databases/00217/C50.zip",
destfile = "Data/C50.zip")
#Unzipping the File
unzip("Data/C50.zip", exdir = "Data/C50")
}
#Reading Data
library(readtext)
#Train Data File
data_train_dir <- system.file("Data/C50/C50train/")
Data_train <- readtext(paste0(data_train_dir, "Data/C50/C50train/*"),
dvsep = "\n")
head(Data_train$text, n = 1)
## [1] "The Internet may be overflowing with new technology but crime in cyberspace is still of the old-fashioned variety.\nThe National Consumers League said Wednesday that the most popular scam on the Internet was the pyramid scheme, in which early investors in a bogus fund are paid off with deposits of later investors.\nThe league, a non-profit consumer advocacy group, tracks web scams through a site it set up on the world wide web in February called Internet Fraud Watch at http://www.fraud.org.\nThe site, which collects reports directly from consumers, has been widely praised by law enforcement agencies.\n\"Consumers who suspect a scam on the Internet have critical information,\" said Jodie Bernstein, director of the Federal Trade Commission's Bureau of Consumer Protection. Internet Fraud Watch \"has been a major help to the FTC in identifying particular scams in their infancy.\"\nIn May, for example, the commission used Internet reports to shut down a site run by Fortuna Alliance that had taken in over $6 million, promising investors they could earn $5,000 a month from an initial deposit of $250. Instead, Fortuna kept most of the money, the commission charged.\nFraud reports from the league's site, which has been visited over 370,000 times, are forwarded to local, state and federal authorities.\nThe second-most-popular Internet scam, the league said, was the sale of bogus Internet services, such as custom designed web sites or Internet access accounts.\nIn third place were crooks who sell computer equipment, such as memory chips or sound boards, over the net and then deliver significantly lower quality goods or nothing at all, the league said.\nOther top scams involve business opportunities. Con artists may offer shares in a business or franchise using unreasonable predictions or misrepresentations. One popular scheme promised to let consumers get rich while working at home.\nThe League also announced Tuesday that NationsBank had donated $100,000 to become a sponsor of the Fraud Watch site."
#Test Data File
data_test_dir <- system.file("Data/C50/C50test/")
Data_test <- readtext(paste0(data_test_dir, "Data/C50/C50test/*"),
dvsep = "\n")
#Author Names
Authornames <- as.data.frame(rep(basename(list.dirs("Data/C50/C50train")), each = 50))
Authornames <- Authornames[-(1:50),]
#Assigning Author name to Text
Data_test$Author <- Authornames
Data_train$Author <- Authornames
#Dropping ID Column
Data_test <- Data_test[-1]
Data_train <- Data_train[-1]
#Converting Author Column to Factor
Data_test$Author <- as.factor(Data_test$Author)
Data_train$Author <- as.factor(Data_train$Author)
#Filtering Data by 4 Authors
library(dplyr)
library(data.table)
AaronTrain <- Data_train %>% filter(Author == "AaronPressman", text == text)
JaneTrain <- Data_train %>% filter(Author == "JaneMacartney", text == text)
SarahTrain <- Data_train %>% filter(Author == "SarahDavison", text == text)
WilliamTrain <- Data_train %>% filter(Author == "WilliamKazer", text == text)
Data_train <- rbind(AaronTrain, JaneTrain, SarahTrain, WilliamTrain)
AaronTest <- Data_test %>% filter(Author == "AaronPressman", text == text)
JaneTest <- Data_test %>% filter(Author == "JaneMacartney", text == text)
SarahTest <- Data_test %>% filter(Author == "SarahDavison", text == text)
WilliamTest <- Data_test %>% filter(Author == "WilliamKazer", text == text)
Data_test <- rbind(AaronTest, JaneTest, SarahTest, WilliamTest)
dim(Data_test)
## [1] 200 2
dim(Data_train)
## [1] 200 2
table(Data_train$Author)
##
## AaronPressman AlanCrosby AlexanderSmith BenjaminKangLim
## 50 0 0 0
## BernardHickey BradDorfman C50train DarrenSchuettler
## 0 0 0 0
## DavidLawder EdnaFernandes EricAuchard FumikoFujisaki
## 0 0 0 0
## GrahamEarnshaw HeatherScoffield JaneMacartney JanLopatka
## 0 0 50 0
## JimGilchrist JoeOrtiz JohnMastrini JonathanBirt
## 0 0 0 0
## JoWinterbottom KarlPenhaul KeithWeir KevinDrawbaugh
## 0 0 0 0
## KevinMorrison KirstinRidley KouroshKarimkhany LydiaZajc
## 0 0 0 0
## LynneO'Donnell LynnleyBrowning MarcelMichelson MarkBendeich
## 0 0 0 0
## MartinWolk MatthewBunce MichaelConnor MureDickie
## 0 0 0 0
## NickLouth PatriciaCommins PeterHumphrey PierreTran
## 0 0 0 0
## RobinSidel RogerFillion SamuelPerry SarahDavison
## 0 0 0 50
## ScottHillis SimonCowell TanEeLyn TheresePoletti
## 0 0 0 0
## TimFarrand ToddNissen WilliamKazer
## 0 0 50
As a part of Exploratory Data Analysis, I found that there are no missing values in the dataset.
#Checking for missing values
#any(is.na(data_train))
#any(is.na(data_test))
To analyze the textdata, the first step involves creating a corpus, which refers to a collection of text documents. The Corpus was created from the dataset using Corpus() function of “tm” package of R. It creates the R object to store text documents. The Corpus data was then cleaned using “tm_map()” function. For cleaning the corpus, the text was converted to lower case using “tolower” attribute of tm_map. Then using “removeNumbers”, “removeWords”, “stopwords()” as well as “stripWhitespace” functions, the number, stopping words and white spaces were removed. The figure shows the clean corpus element 1 of Training Dataset.
library(tm)
#Creating Corpus
suppressWarnings(Data_test_corpus <- Corpus(VectorSource(Data_test$text)))
suppressWarnings(Data_train_corpus <- Corpus(VectorSource(Data_train$text)))
#Corpus Cleaning
suppressWarnings(Data_test_corpus_clean <- tm_map(Data_test_corpus, tolower))
suppressWarnings(Data_train_corpus_clean <- tm_map(Data_train_corpus, tolower))
suppressWarnings(Data_test_corpus_clean <- tm_map(Data_test_corpus_clean, removeNumbers))
suppressWarnings(Data_train_corpus_clean <- tm_map(Data_train_corpus_clean, removeNumbers))
suppressWarnings(Data_test_corpus_clean <- tm_map(Data_test_corpus_clean, removeWords, stopwords()))
suppressWarnings(Data_train_corpus_clean <- tm_map(Data_train_corpus_clean, removeWords, stopwords()))
suppressWarnings(Data_test_corpus_clean <- tm_map(Data_test_corpus_clean, removePunctuation))
suppressWarnings(Data_train_corpus_clean <- tm_map(Data_train_corpus_clean, removePunctuation))
suppressWarnings(Data_test_corpus_clean <- tm_map(Data_test_corpus_clean, stripWhitespace))
suppressWarnings(Data_train_corpus_clean <- tm_map(Data_train_corpus_clean, stripWhitespace))
suppressWarnings(inspect(Data_train_corpus_clean[1]))
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 1
##
## [1] internet may overflowing new technology crime cyberspace still oldfashioned variety national consumers league said wednesday popular scam internet pyramid scheme early investors bogus fund paid deposits later investors league nonprofit consumer advocacy group tracks web scams site set world wide web february called internet fraud watch httpwwwfraudorg site collects reports directly consumers widely praised law enforcement agencies consumers suspect scam internet critical information said jodie bernstein director federal trade commissions bureau consumer protection internet fraud watch major help ftc identifying particular scams infancy may example commission used internet reports shut site run fortuna alliance taken million promising investors earn month initial deposit instead fortuna kept money commission charged fraud reports leagues site visited times forwarded local state federal authorities secondpopular internet scam league said sale bogus internet services custom designed web sites internet access accounts third place crooks sell computer equipment memory chips sound boards net deliver significantly lower quality goods nothing league said top scams involve business opportunities con artists may offer shares business franchise using unreasonable predictions misrepresentations one popular scheme promised let consumers get rich working home league also announced tuesday nationsbank donated become sponsor fraud watch site
The text from Training Dataset was split according to authors. Using the “wordcloud” package of R, the most used words by authors were visualized. The following word cloud is the visualization of most frequently used words in by author in test and training database.
#Word Cloud of Testing Dataset
library(wordcloud)
wordcloud(Data_test_corpus_clean, min.freq = 40, random.order = FALSE)
The data is then split into individual components commonly known as tokenization. The token was achieved using “DocumentTermMatrix()” function of “tm” package. This will create a data structure called Sparse Matrix in which rows indicate the Text and Column represents the word.
#Sparse Matrix
test_dtm <- DocumentTermMatrix(Data_test_corpus_clean)
train_dtm <- DocumentTermMatrix(Data_train_corpus_clean)
inspect(train_dtm)
## <<DocumentTermMatrix (documents: 200, terms: 8765)>>
## Non-/sparse entries: 41629/1711371
## Sparsity : 98%
## Maximal term length: 36
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs china chinese hong kong last one percent said will year
## 126 0 0 3 3 1 1 1 13 3 1
## 127 4 1 2 2 3 3 7 5 11 6
## 133 1 0 3 3 2 2 5 7 3 3
## 137 1 6 12 11 1 5 0 12 3 3
## 139 0 1 2 2 6 1 14 23 9 1
## 140 17 4 27 19 1 6 3 12 15 2
## 51 2 3 0 0 1 5 1 17 0 6
## 53 2 14 0 0 1 4 1 6 0 2
## 55 1 1 0 0 2 0 2 16 2 4
## 60 3 4 0 0 2 5 4 9 0 6
To predict the model using Naive Bayes Classification, the training and testing datasets are obtained using frequency of words.
##### Preparing Training and Testing Datasets #####
### Creating Indicator features for frequent words ###
FreqWords <- findFreqTerms(train_dtm, 5)
#Saving List using Dictionary() Function
Dictionary <- function(x) {
if( is.character(x) ) {
return (x)
}
stop('x is not a character vector')
}
data_dict <- Dictionary(findFreqTerms(train_dtm, 5))
#Appending Document Term Matrix to Train and Test Dataset
data_train <- DocumentTermMatrix(Data_train_corpus_clean, list(data_dict))
data_test <- DocumentTermMatrix(Data_test_corpus_clean, list(data_dict))
#Converting the frequency of word to count
convert_counts <- function(x) {
x <- ifelse(x > 0, 1, 0)
x <- factor(x, levels = c(0, 1), labels = c("No", "Yes"))
return(x)
}
#Appending count function to Train and Test Dataset
data_train <- apply(data_train, MARGIN = 2, convert_counts)
data_test <- apply(data_test, MARGIN = 2, convert_counts)
#Naive Bayes Classification
library(e1071)
data_classifier <- naiveBayes(data_train, Data_train$Author)
library(gmodels)
data_test_pred <- predict(data_classifier, data_test)
CrossTable(data_test_pred, Data_test$Author,
prop.chisq = FALSE, prop.t = FALSE,
dnn = c('predicted', 'actual'))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 200
##
##
## | actual
## predicted | AaronPressman | JaneMacartney | SarahDavison | WilliamKazer | Row Total |
## --------------|---------------|---------------|---------------|---------------|---------------|
## AaronPressman | 46 | 0 | 0 | 0 | 46 |
## | 1.000 | 0.000 | 0.000 | 0.000 | 0.230 |
## | 0.920 | 0.000 | 0.000 | 0.000 | |
## --------------|---------------|---------------|---------------|---------------|---------------|
## JaneMacartney | 0 | 40 | 2 | 14 | 56 |
## | 0.000 | 0.714 | 0.036 | 0.250 | 0.280 |
## | 0.000 | 0.800 | 0.040 | 0.280 | |
## --------------|---------------|---------------|---------------|---------------|---------------|
## SarahDavison | 4 | 0 | 47 | 3 | 54 |
## | 0.074 | 0.000 | 0.870 | 0.056 | 0.270 |
## | 0.080 | 0.000 | 0.940 | 0.060 | |
## --------------|---------------|---------------|---------------|---------------|---------------|
## WilliamKazer | 0 | 10 | 1 | 33 | 44 |
## | 0.000 | 0.227 | 0.023 | 0.750 | 0.220 |
## | 0.000 | 0.200 | 0.020 | 0.660 | |
## --------------|---------------|---------------|---------------|---------------|---------------|
## Column Total | 50 | 50 | 50 | 50 | 200 |
## | 0.250 | 0.250 | 0.250 | 0.250 | |
## --------------|---------------|---------------|---------------|---------------|---------------|
##
##
From the table we can observe that the accuracy achieved is 83.0%.
Setting Laplace = 1 is one way for improving performance of model.
# Setting Laplace = 1
library(e1071)
data_classifier2 <- naiveBayes(data_train, Data_train$Author,laplace = 1)
library(gmodels)
data_test_pred2 <- predict(data_classifier2, data_test)
CrossTable(data_test_pred2, Data_test$Author,
prop.chisq = FALSE, prop.t = FALSE,
dnn = c('predicted', 'actual'))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 200
##
##
## | actual
## predicted | AaronPressman | JaneMacartney | SarahDavison | WilliamKazer | Row Total |
## --------------|---------------|---------------|---------------|---------------|---------------|
## AaronPressman | 49 | 0 | 0 | 0 | 49 |
## | 1.000 | 0.000 | 0.000 | 0.000 | 0.245 |
## | 0.980 | 0.000 | 0.000 | 0.000 | |
## --------------|---------------|---------------|---------------|---------------|---------------|
## JaneMacartney | 0 | 30 | 1 | 9 | 40 |
## | 0.000 | 0.750 | 0.025 | 0.225 | 0.200 |
## | 0.000 | 0.600 | 0.020 | 0.180 | |
## --------------|---------------|---------------|---------------|---------------|---------------|
## SarahDavison | 0 | 0 | 41 | 0 | 41 |
## | 0.000 | 0.000 | 1.000 | 0.000 | 0.205 |
## | 0.000 | 0.000 | 0.820 | 0.000 | |
## --------------|---------------|---------------|---------------|---------------|---------------|
## WilliamKazer | 1 | 20 | 8 | 41 | 70 |
## | 0.014 | 0.286 | 0.114 | 0.586 | 0.350 |
## | 0.020 | 0.400 | 0.160 | 0.820 | |
## --------------|---------------|---------------|---------------|---------------|---------------|
## Column Total | 50 | 50 | 50 | 50 | 200 |
## | 0.250 | 0.250 | 0.250 | 0.250 | |
## --------------|---------------|---------------|---------------|---------------|---------------|
##
##
From the table we can observe that with laplace equals to 1, the accuracy achieved is 80.5%.