Mehul Katara
2021-01-16
The first step is to get our dataset that will eventually represent our raw material on which we will test our clustering algorithm
https://github.com/mehulkatara/K-meansApacheMahout/raw/main/data.zip
After downloading our text collections locally, and in order to be able to handle it with Mahout. It's time to copy it to out HDFS.
hadoop fs -put austen-bronte ~/clustering
mahout seqdirectory -i clustering/ -o tragedy-seqfiles -c UTF-8 -chunk 5
-i : specifying the input directory
-o : specifying the output directory
UTF-8 : specifying the encoding of our input files
-chunk : specifying the size of each block of data
mahout seq2sparse -nv -i tragedy-seqfiles -o tragedy-vectors
-i : specifying the input directory
-o : specifying the output directory
-nv : very important option that keeps the files names for later use when displaying the result of text clustering
https://mahout.apache.org/docs/latest/algorithms/clustering/canopy/
https://neo4j.com/docs/graph-algorithms/current/labs-algorithms/cosine/
mahout canopy -i tragedy-vectors/tf-vectors -o tragedy-vectors/tragedy-canopy-centroids -dm org.apache.mahout.common.distance.CosineDistanceMeasure -t1 1500 -t2 2000
-o : specifying the output directory
-t1 : threashold value 1
-t2 : threashold value 2
-dm : distance measure
mahout kmeans -i tragedy-vectors/tfidf-vectors -c tragedy-canopy-centroids -o tragedy-kmeans-clusters -dm org.apache.mahout.common.distance.CosineDistanceMeasure --clustering -cl -cd 0.1 -ow -x 20 -k 10
mahout clusterdump --input tragedy-kmeans-clusters/clusters-0 --output kmeans-dump.text