K-means Clustering using Apache Mahout

Mehul Katara
2021-01-16

What is Clustering ?

  • Clustering is one of the most common data analysis technique used to get an intuition about the structure of the data
  • It can be defined as the task of identifying subgroup in the data such that data points in the same subgroup (cluster) are very similar while data points in different cluster are very different.

What is K-means ?

  • Kmeans algorithm is an iterative algorithm that tries to partition the dataset into Kpre-defined distinct non-overlapping subgroup (clusters) where each data point belongs to only one group
  • The 'means' in the K-means refers to averaging of the data; that is ,finding the centroid.

How the K-means algorithm works ?

  • To process the learning data, the K-means algorithm in data mining starts with a first group of randomly selected centroids, which are used as the beginning points for every cluster, and then performs iterative (repetitive) calculations to optimize the positions of the centroids.
  • It halts creating and optimizing clusters when either :
    • The centroids have stabilized. There is no change in their values because the clustering has been successful.
    • The defined number of iterations has been achieved.

Steps to be followed in Mahout

  1. Getting the data
  2. Copping text files to the HDFD
  3. Convert our dataset into SequenceFiles
  4. Convert SequenceFiles to sparse vector file format
  5. Running k-means text clustering algorithm
  6. Interpreting the clustering final result

Steps 1

The first step is to get our dataset that will eventually represent our raw material on which we will test our clustering algorithm

https://github.com/mehulkatara/K-meansApacheMahout/raw/main/data.zip

Steps 2

After downloading our text collections locally, and in order to be able to handle it with Mahout. It's time to copy it to out HDFS.

hadoop fs -put austen-bronte ~/clustering

Steps 3

  • SequenceFile is a hadoop flat file format where each document is represented as a key-value pair.
  • Actually, Key is the document id and value is its content
mahout seqdirectory -i clustering/ -o tragedy-seqfiles -c UTF-8 -chunk 5

-i : specifying the input directory

-o : specifying the output directory

UTF-8 : specifying the encoding of our input files

-chunk : specifying the size of each block of data

Steps 4

  • In order to be able to run properly, most algorithms in text mining require a numerical representation of texts.
  • That's why, we should turn the collections of texts we had in the previous steps into numerical feature vectors.
  • Therefore, every document is represented as a vector where each element of the vector is a word and its weight respectively
mahout seq2sparse -nv -i tragedy-seqfiles -o tragedy-vectors

-i : specifying the input directory
-o : specifying the output directory
-nv : very important option that keeps the files names for later use when displaying the result of text clustering

Steps 5

  • Before passing to action by applying k-means clustering algorithm on our textual data, there is a simple step left.
  • In order to have initial centroids values, we should in the first place, run canopy clustering on our data.

https://mahout.apache.org/docs/latest/algorithms/clustering/canopy/

https://neo4j.com/docs/graph-algorithms/current/labs-algorithms/cosine/

mahout canopy -i tragedy-vectors/tf-vectors -o tragedy-vectors/tragedy-canopy-centroids -dm org.apache.mahout.common.distance.CosineDistanceMeasure -t1 1500 -t2 2000

-o : specifying the output directory

-t1 : threashold value 1

-t2 : threashold value 2

-dm : distance measure

Steps 6

  • Once we have generated initial centroids value we can finally run k-means algorithm on our documents
mahout kmeans -i tragedy-vectors/tfidf-vectors -c tragedy-canopy-centroids -o tragedy-kmeans-clusters -dm org.apache.mahout.common.distance.CosineDistanceMeasure --clustering -cl -cd 0.1 -ow -x 20 -k 10

mahout clusterdump --input tragedy-kmeans-clusters/clusters-0 --output kmeans-dump.text