When one class has more training examples than an other, Naive Bayes selects poor weights for the decision boundary. This is due to under-studied bias effect that shrinks weights for classes with few training examples. To balance the amount of training examples used per estimate, we introduce a “complement class” formulation of Naive Bayes.
Download Dataset
After downloading our text collections locally, and in order to be able to handle it with Mahout. It's time to copy it to out HDFS.
hadoop fs -mkdir 20newsdata
hadoop fs -copyFromLocal 20news-bydate-train/ 20newsdata
mahout seqdirectory -i 20newsdata/20news-bydate-train -o 20newsdataseq-out
-i : specifying the input directory
-o : specifying the output directory
mahout seq2sparse -i 20newsdataseq-out/part-m-00000 -o 20newsdatavec -lnorm -nv -wt tfidf
-i : specifying the input directory
-o : specifying the output directory
-lnorm : output vector to be log normalized
-nv : named vectors
-wt : weight to use here we use tfidf
mahout split -i 20newsdatavec/tfidf-vectors --trainingOutput 20newsdatatrain --testOutput 20newsdatatest --randomSelectionPct 70 --overwrite --sequenceFiles -xm sequential
--trainingOutput : training output directory
--testOutput : testing output directory
--randomSelectionPct : split data in percentage for training
--overwrite : overwrite if folder exist
--sequenceFiles : sequential
--xm : execution method sequential or mapreduce
mahout trainnb -i 20newsdatatrain -o model -li lableindex -ow -c
-i : specifying the input directory
-o : specifying the output directory
-li : specifying the lableindex directory
-ow : overwrite
-c : Complement Naive Bayes
mahout testnb -i 20newsdatatest -m model -l lableindex -ow -o results
-i : specifying the input director
-m : specifying the model
-o : specifying the output directory
-li : specifying the lableindex directory
-ow : overwrite