KNN, K-means and Hierarchical Clustering Explained

Data preparation

Loading the specified libraries

library(ggplot2)
library(tidyr)
library(class)
library(gmodels)
library(fpc)
library(rsconnect)

data("iris")
head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Checking the dimension of data

dim(iris)

## [1] 150   5

summarizing the data

summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

Exploratory Data Analysis

ggplot(iris,aes(x=Sepal.Length,y=Sepal.Width,color=Species)) + geom_jitter()

Restructuring the data for further visualization

iris_tidy<- iris %>% gather(key,Value,-Species) %>% separate(key,c("Part","Measure"),"\\.")
head(iris_tidy)

##   Species  Part Measure Value
## 1  setosa Sepal  Length   5.1
## 2  setosa Sepal  Length   4.9
## 3  setosa Sepal  Length   4.7
## 4  setosa Sepal  Length   4.6
## 5  setosa Sepal  Length   5.0
## 6  setosa Sepal  Length   5.4

ggplot(iris_tidy,aes(x=Species,y=Value,color=Part,shape=Part))+geom_jitter()+facet_grid(. ~ Measure)

Creating a scatter plot for the length and width for 3 species of flowers in our data. The overall value of length is more than the width. Compared to versicolor and virginica the petal length and width of setosa is smaller.

ggplot(iris_tidy, aes(x=Species,y=Value,color=Part,fill=Part))+geom_col(position = "dodge")+facet_grid(~Measure)

Visualizing the metrics on a histogram , it’s fairly obvious that the length of both petal and sepal is larger than the width.

ggplot(iris, aes(x=Sepal.Width,fill=Species)) +geom_histogram(binwidth =.2,position="dodge")+labs(x='Sepal Width',y="count")

Ploting the variable Sepal length of all 3 flowers. We see that the sepal length of versicolor ranges from 2.0 to 3.4 and that of setosa is from 3.0 to 4.3. The y-axis shows the count of samples of these flowers in the data.

ggplot(iris,aes(Petal.Length,fill=Species,..scaled..)) + geom_density(aes(alpha=0.4))

PLotting the density curve of petal length.

Suprevised Modelling

The KNN algorithm assumes that similar things exist in close proximity. In other words, similar things are near to each other. It captures the idea of similarity (sometimes called distance, proximity, or closeness) with some mathematics we might have learned in our childhood— calculating the distance between points on a graph.

There are other ways of calculating distance, and one way might be preferable depending on the problem we are solving. However, the straight-line distance (also called the Euclidean distance) is a popular and familiar choice.

We have 120 observations in our dataset which we will break down into training and testing sets. The training set will be 80% of the size of the whole data and test set is 20% of the whole data.

set.seed(13383610)
size=floor(nrow(iris))
shuffled<-iris[sample(size),]
train_data=shuffled[1:(0.8*size),]
test_data=shuffled[(0.8*size+1):(size),]

dim(train_data)

## [1] 120   5

Size of training data

dim(test_data)

## [1] 30  5

Size of test data

Training the model

knn_iris <- knn(train = train_data[,-5],test= test_data[,-5],cl=train_data[,5],k=5)

knn_iris

##  [1] versicolor versicolor virginica  virginica  versicolor setosa    
##  [7] versicolor setosa     setosa     setosa     setosa     virginica 
## [13] versicolor setosa     versicolor versicolor virginica  versicolor
## [19] versicolor virginica  virginica  setosa     virginica  setosa    
## [25] virginica  versicolor virginica  setosa     setosa     versicolor
## Levels: setosa versicolor virginica

Checking the confusion matrix for our model to see how our model has performed.

table(test_data[,5],knn_iris,dnn=c("True","Predicted"))

##             Predicted
## True         setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0         11         0
##   virginica       0          0         9

It seems our model has performed well.

If we check the accuracy of the model in 30 total samples in our dataset. we get (10+11+9)*100/30 i.e. 100 accuracy. Calculating the accuracy by formula:

mean(test_data[,5]==knn_iris)

## [1] 1

miserror <- sum(test_data[,5]!=knn_iris)/nrow(test_data)
miserror

## [1] 0

The misclassifcation rate is 1-accuracy, which is 0.083. Plotting the same output.

plot(knn_iris)

CrossTable(x=test_data[,5],y=knn_iris,prop.chisq = FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  30 
## 
##  
##                | knn_iris 
## test_data[, 5] |     setosa | versicolor |  virginica |  Row Total | 
## ---------------|------------|------------|------------|------------|
##         setosa |         10 |          0 |          0 |         10 | 
##                |      1.000 |      0.000 |      0.000 |      0.333 | 
##                |      1.000 |      0.000 |      0.000 |            | 
##                |      0.333 |      0.000 |      0.000 |            | 
## ---------------|------------|------------|------------|------------|
##     versicolor |          0 |         11 |          0 |         11 | 
##                |      0.000 |      1.000 |      0.000 |      0.367 | 
##                |      0.000 |      1.000 |      0.000 |            | 
##                |      0.000 |      0.367 |      0.000 |            | 
## ---------------|------------|------------|------------|------------|
##      virginica |          0 |          0 |          9 |          9 | 
##                |      0.000 |      0.000 |      1.000 |      0.300 | 
##                |      0.000 |      0.000 |      1.000 |            | 
##                |      0.000 |      0.000 |      0.300 |            | 
## ---------------|------------|------------|------------|------------|
##   Column Total |         10 |         11 |          9 |         30 | 
##                |      0.333 |      0.367 |      0.300 |            | 
## ---------------|------------|------------|------------|------------|
## 
##

Here we can see a more detailed view of the confusion matrix as well as the accuracy in each category. Setosa comprises of 33% of our test data, veriscolor is 36.7% and virginica is 30%. The accuracy is setosa is 1, that is 100% with no misclassifications.

The overall accuracy is 0.333+0.367+0.300=1.00.

Lets us train another model with the parameter K as 10.

knn_iris2 <- knn(train = train_data[,-5],test= test_data[,-5],cl=train_data[,5],k=10)

knn_iris2

##  [1] versicolor versicolor virginica  versicolor versicolor setosa    
##  [7] versicolor setosa     setosa     setosa     setosa     virginica 
## [13] versicolor setosa     versicolor versicolor virginica  versicolor
## [19] versicolor virginica  virginica  setosa     virginica  setosa    
## [25] virginica  versicolor virginica  setosa     setosa     versicolor
## Levels: setosa versicolor virginica

table(test_data[,5],knn_iris2,dnn=c("True","Predicted"))

##             Predicted
## True         setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0         11         0
##   virginica       0          1         8

mean(test_data[,5]==knn_iris2)

## [1] 0.9666667

miserror2 <- sum(test_data[,5]!=knn_iris2)/nrow(test_data)
miserror2

## [1] 0.03333333

CrossTable(x=test_data[,5],y=knn_iris2,prop.chisq = FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  30 
## 
##  
##                | knn_iris2 
## test_data[, 5] |     setosa | versicolor |  virginica |  Row Total | 
## ---------------|------------|------------|------------|------------|
##         setosa |         10 |          0 |          0 |         10 | 
##                |      1.000 |      0.000 |      0.000 |      0.333 | 
##                |      1.000 |      0.000 |      0.000 |            | 
##                |      0.333 |      0.000 |      0.000 |            | 
## ---------------|------------|------------|------------|------------|
##     versicolor |          0 |         11 |          0 |         11 | 
##                |      0.000 |      1.000 |      0.000 |      0.367 | 
##                |      0.000 |      0.917 |      0.000 |            | 
##                |      0.000 |      0.367 |      0.000 |            | 
## ---------------|------------|------------|------------|------------|
##      virginica |          0 |          1 |          8 |          9 | 
##                |      0.000 |      0.111 |      0.889 |      0.300 | 
##                |      0.000 |      0.083 |      1.000 |            | 
##                |      0.000 |      0.033 |      0.267 |            | 
## ---------------|------------|------------|------------|------------|
##   Column Total |         10 |         12 |          8 |         30 | 
##                |      0.333 |      0.400 |      0.267 |            | 
## ---------------|------------|------------|------------|------------|
## 
##

Here we have again 100% accuracy in setosa and in virginica but in versicolor 1 sample is being predicted as virginica with an accuracy of 91.7%. The overall accuracy is 96.6%

The overall accuracy is 0.333+0.367+0.267= 0.966.

K-means Clustering

K-means Clustering is an unsupervised meachine learning algorithm. It groups similar datapoints together and discrovers underlying patterns, by identifying a fixed nummber (K) clusters in dataset. ‘Means’ refers to the averaging of the data .e. finding the clusters. K is defined as the number of centroids we need in the dataset. A centroid is an imaginary or real location representing teh center of the cluster.

Process:

Starts with the first group of randomly selected centroids- which are the beginning points.
Performs iterative(repititive) calculations to optimize the poistions of clusters.
The process stops when the centroids are stabilized and the values don’t change with further iterations or when the defined number of iterations are reached.

The bigger the value of K, the lower will be the variance within the groups in the clustering. If K is equal to the number of observations, then each point will be a group and the variance will be 0. It’s necessary to find an optimum number of clusters. variance within a group means how different the members of the group are. A large variance shows that there’s more dissimilarity in the groups.

set.seed(13383610)
input <- iris[,1:4]
kmeans_fit<-kmeans(input, centers = 3, nstart = 20)
kmeans_fit

## K-means clustering with 3 clusters of sizes 62, 50, 38
## 
## Cluster means:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     5.901613    2.748387     4.393548    1.433871
## 2     5.006000    3.428000     1.462000    0.246000
## 3     6.850000    3.073684     5.742105    2.071053
## 
## Clustering vector:
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [75] 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 3 3 3 3 1 3 3 3 3
## [112] 3 3 1 1 3 3 3 3 1 3 1 3 1 3 3 1 1 3 3 3 3 3 1 3 3 3 3 1 3 3 3 1 3 3 3 1 3
## [149] 3 1
## 
## Within cluster sum of squares by cluster:
## [1] 39.82097 15.15100 23.87947
##  (between_SS / total_SS =  88.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

The kmeans() function outputs the results of the clustering. The cluster in which each observation was allocated has a mean and a percentage (88.4%) that represents the compactness of the clustering, and how similar are the members within the same group. If all the observations within a group were in the same exact point in the n-dimensional space, then we would achieve 100% of compactness.

plotcluster(input,kmeans_fit$cluster,xlab="Number of groups")

table(kmeans_fit$cluster, iris$Species)

##    
##     setosa versicolor virginica
##   1      0         48        14
##   2     50          0         0
##   3      0          2        36

As we can see, the data belonging to the setosa species got grouped into cluster 3, versicolor into cluster 2, and virginica into cluster 1. The algorithm wrongly classified 2 data points belonging to versicolor into virginica and 14 data points belonging to virginica into versicolor.

Let’s plot a chart showing the “within sum of squares” by the number of groups (K value). The within sum of squares is a metric that shows the dissimilarity within members of a group. The greater is the sum, the greater is the dissimilarity.

wssplot <- function(input, nc=15, seed=13383610){
               wss <- (nrow(input)-1)*sum(apply(input,2,var))
               for (i in 2:nc){
                    set.seed(seed)
                    wss[i] <- sum(kmeans(input, centers=i)$withinss)}
                plot(1:nc, wss, type="b", xlab="Number of groups",
                     ylab="Sum of squares within a group")}

wssplot(input, nc = 20)

We can see that going from K=3 to 4 there’s a decrease in the sum of squares, which means our dissimilarity will decrease and compactness will increase if we take K=4. So, let’s choose K = 4 and run the K-means again.

kmeans_fit2<-kmeans(input, centers = 4, nstart = 20)
kmeans_fit2

## K-means clustering with 4 clusters of sizes 40, 32, 28, 50
## 
## Cluster means:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     6.252500    2.855000     4.815000    1.625000
## 2     6.912500    3.100000     5.846875    2.131250
## 3     5.532143    2.635714     3.960714    1.228571
## 4     5.006000    3.428000     1.462000    0.246000
## 
## Clustering vector:
##   [1] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
##  [38] 4 4 4 4 4 4 4 4 4 4 4 4 4 1 1 1 3 1 3 1 3 1 3 3 3 3 1 3 1 3 3 1 3 1 3 1 1
##  [75] 1 1 1 1 1 3 3 3 3 1 3 1 1 1 3 3 3 1 3 3 3 3 3 1 3 3 2 1 2 2 2 2 3 2 2 2 1
## [112] 1 2 1 1 2 2 2 2 1 2 1 2 1 2 2 1 1 2 2 2 2 2 1 1 2 2 2 1 2 2 2 1 2 2 2 1 1
## [149] 2 1
## 
## Within cluster sum of squares by cluster:
## [1] 13.624750 18.703437  9.749286 15.151000
##  (between_SS / total_SS =  91.6 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Using 3 groups (K = 3) we had 88.4% of well-grouped data. Using 4 groups (K = 4) that value raised to 91.6%, which is a good value for us.

Hierarchial Clustering

Hierarchical clustering is an alternative approach to clustering which builds a hierarchy from the bottom-up, and doesn’t require us to specify the number of clusters beforehand. There are two types of hierarchial clustering

Agglomerative- Each data point is considered a separate cluster initially and at each iteration, similar clusters merge with other clusters util one or desired number of clusters are formed.
Divisive-It’s the opposite of Agglomerative clustering. All data points are considered into a singlt cluster and then are seaprated further until we get the desired number of clusters.

Process:

Compute proximity/dissimilarity/distance matrix. This is the backbone of our clustering. It is a mathematical expression of how different or distant the datapoints are from each other.
There are many ways to calulate dissimilarity between clusters. These are the linkage methods.
1. MIN
2. MAX
3. Group Average
4. Ward’s method
Let each data point be a cluster.
Merge the 2 closest clusters based on the distances from the distance matrix and as a result the amount of clusters goes down by 1

5)Update proximity/distance matrix and repeat step 4 until desired clusters remain.

Let us see how well the hierarchical clustering algorithm performs on our dataset. We will use hclust for this which requires us to provide the data in the form of a distance matrix. We will create this by using dist.

clusters <- hclust(dist(iris[, 1:4]))

In hierarchical clustering, we categorize the objects into a hierarchy similar to a tree-like diagram which is called a dendrogram.

plot(clusters, xlab="Clusters",
                     ylab="Height of dendogram")

We’ll cut our dendogram at cluster 3 and check how it performs.

clusterCut <- cutree(clusters, 3)
table(clusterCut, iris$Species)

##           
## clusterCut setosa versicolor virginica
##          1     50          0         0
##          2      0         23        49
##          3      0         27         1

It looks like the algorithm successfully classified all the flowers of species setosa into cluster 1, and virginica into cluster 2, but had trouble with versicolor.

Let us see if we can better by using a different linkage method. This time, we will use the mean linkage method

clusters2 <- hclust(dist(iris[, 1:4]), method = 'average')
plot(clusters2,xlab="Clusters",
                     ylab="Height of dendogram")

Next, we’ll cut the dendrogram in order to create the desired number of clusters. Since in this case we already know that there are three Species we will choose the number of clusters to be k = 3. We will use the cutree() function.

clusterCut2 <- cutree(clusters2, k= 3)
plot(clusters2, xlab="Clusters",
                     ylab="Height of dendogram")
rect.hclust(clusters2 , k = 3, border = 2:6)
abline(h = 3, col = 'red')

table(clusterCut2, iris$Species)

##            
## clusterCut2 setosa versicolor virginica
##           1     50          0         0
##           2      0         50        14
##           3      0          0        36

We can see that this time, the algorithm did a little better but has prolem classifying virginica properly.