Have you ever organized your library so that books of the same subject are on the same shelf or in the same block? You probably did. You already know how to group similar objects together. Although the idea is very simple, the amount of use cases affected by this idea is enormous. In machine learning literature this is often called clustering - it automatically groups similar objects into the same groups.
Clustering is one of the most common exploratory data analysis techniques used to gain insight into the structure of the data. While the data points in different clusters are very different, it can be defined as the task of determining the subgroups in the data because the data points in the same subgroup (cluster) are very similar. In this article, we will consider the K-means algorithm, which is considered one of the most used clustering algorithms due to its simplicity.
One of the oldest clustering algorithms, k-means was founded in 1967 by J.B. Developed by MacQueen (MacQueen, 1967) *
k-Means Clustering Algorithm is one of the most used algorithms in Data Mining World. There are some differences between clustering algorithms and classification algorithms. The k-means algorithm is a clustering algorithm. Clustering algorithms are algorithms that automatically separate data into smaller sets or subsets. The algorithm puts statistically similar records into the same group. An element is allowed to belong to only one set. The cluster is the value that represents the central cluster.
The letter “k” in the name of the alogithm actually indicates the number of clusters: The algorithm also searches for the number of clusters that will minimize the Terrestrial Error Function commonly used in error calculation. The given “n” data set is placed in “k” cluster in a way to minimize this error function. For this reason, cluster similarity is measured by the approximation of the values in the cluster. This is the cluster’s center of gravity. The value at the center of the cluster is the representative value of the cluster and is called the medoid.
The two most important goals here are:
1- The values in the set should be most similar to each other,
2- Clusters should not be as similar as possible
In this case rather than applying k-means clustering on datasets we will replicate an existing tutorial on clustering of colours on an image in order to better understand how much we can benefit from k-means clustering.
Importing the image:
library(jpeg)
library(ggplot2)
library(raster)
image <- stack("AustraliaFire.jpg")
Let’s display the image so visually understand which areas are burning.
Performing k-means clustering on the image values (each pixel)
For 2 cluster centers:
kMeansResult <- kmeans(image[], centers=2)
result <- raster(image[[1]])
result <- setValues(result, kMeansResult$cluster)
plot(result)
If we compare the result above with the original image we will see that some terrain is also mixed with burning areas. Thus, a better K should be selected.
kMeansResult <- kmeans(image[], centers=3)
result <- raster(image[[1]])
result <- setValues(result, kMeansResult$cluster)
plot(result)
As the graph above reflects k=3 is almost perfect point for extracting burning areas. However, we need better visualization here:
plot(result, col=c("darkgreen", "blue","red"))
Above graph reflects better the burning areas. Let’s find out on what percent of the whole image contains fire according to our clustering model.
length(kMeansResult$cluster[kMeansResult$cluster==3])/length(kMeansResult$cluster)*100
## [1] 2.134545
So our model thinks that 2,13% of the total captured area is burning.
Aside from its applications on text based data, K-means clustering is also beneficial on image segmentation even in critical cases. Although here we have the chance to visually decide which n is better.
Adopted reference: https://www.gis-blog.com/unsupervised-kmeans-classification-of-satellite-imagery-using-r/