Naive Bayes Using Apache Mahout

Fingertis

Mehul Katara & Sahil Desai

2021-01-18

What is Naive Bayes ?

Mahout currently has two flavors of Naive Bayes.
- The first is standard Multinomial Naive Bayes.
- The second is an implementation of Transformed Weight-normalized Complement Naive Bayes.

When one class has more training examples than an other, Naive Bayes selects poor weights for the decision boundary. This is due to under-studied bias effect that shrinks weights for classes with few training examples. To balance the amount of training examples used per estimate, we introduce a “complement class” formulation of Naive Bayes.

Steps to be followed in Mahout

Getting the data
Copping text files to the HDFD
Convert our dataset into SequenceFiles
Convert SequenceFiles to sparse vector file format
Split dataset in Testing and Training
Training the model with training dataset
Test the model with the testing dataset

Steps 1 : Getting the 20 Newsgroups dataset

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. http://qwone.com/~jason/20Newsgroups/

Download Dataset

http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz

Steps 2 : Copping text files to the HDFD

After downloading our text collections locally, and in order to be able to handle it with Mahout. It's time to copy it to out HDFS.

hadoop fs -mkdir 20newsdata

hadoop fs -copyFromLocal 20news-bydate-train/ 20newsdata

Steps 3 : Convert our dataset into SequenceFiles

SequenceFile is a hadoop flat file format where each document is represented as a key-value pair.
Actually, Key is the document id and value is its content

mahout seqdirectory -i 20newsdata/20news-bydate-train -o 20newsdataseq-out

-i : specifying the input directory
-o : specifying the output directory

Steps 4 : Convert SequenceFiles to sparse vector file format

In order to be able to run properly, most algorithms in text mining require a numerical representation of texts. That's why, we should turn the collections of texts we had in the previous steps into numerical feature vectors. Therefore, every document is represented as a vector where each element of the vector is a word and its weight respectively

mahout seq2sparse -i 20newsdataseq-out/part-m-00000 -o 20newsdatavec -lnorm -nv -wt tfidf 

-i : specifying the input directory
-o : specifying the output directory
-lnorm : output vector to be log normalized
-nv : named vectors
-wt : weight to use here we use tfidf

Steps 5 : Split dataset in Testing and Training

Split our dataset into testing and training so we can test our model how accurate it is.

mahout split -i 20newsdatavec/tfidf-vectors --trainingOutput 20newsdatatrain --testOutput 20newsdatatest --randomSelectionPct 70 --overwrite --sequenceFiles -xm sequential  

--trainingOutput : training output directory
--testOutput : testing output directory
--randomSelectionPct : split data in percentage for training
--overwrite : overwrite if folder exist
--sequenceFiles : sequential
--xm : execution method sequential or mapreduce

Steps 6 : Training the model

Train the model with training dataset

mahout trainnb -i 20newsdatatrain -o model -li lableindex -ow -c  

-i : specifying the input directory
-o : specifying the output directory
-li :  specifying the lableindex directory
-ow : overwrite
-c : Complement Naive Bayes

Steps 7 : Test the model

Now, Test the model with our testing dataset and check the result

mahout testnb -i 20newsdatatest -m model -l lableindex -ow -o results  

-i : specifying the input director
-m : specifying the model
-o : specifying the output directory
-li :  specifying the lableindex directory
-ow : overwrite