The aim of this document is to introduce the idea of k-means custering method. The first part contains some theoretical introduction with mathematical basis and the second part consists of coding examples using R language. To show the application of mentioned method Chatterjee-Price Attitude dataset was used.
K-means algorithm is a method that belongs into the realm of unsupervised learning. To define properly the definition of unsupervised learning let us first focus on understanding what is supervised learning. Let’s consider a set of pairs \((x_i, y_i)_{i \in (1..n)}\) where \(x_i\) is a vector of features (explanatory variables) and \(y_i\) is our response variable, the one which we try to predict. Such set is called a learning sequence. It is natural to define two statistical spaces \(\chi_x\) and \(\chi_y\) which are necessesary for defining the concept of random variables understanded as a single measurement when one collect pair of e.g \((x_1, y_1)\). The aim of building a predictive model is to find conditional probability distribution of target vector \(Y\) given a set of \(X\): \(P(Y|X_i)\).
When it comes to the unsupervised learning: we do not have any information about \(Y\) (the teacher) therefore our aim is to find relevant patterns only in a set \((x_i)_{i \in (1..n)}\). One can split unsupervised methods into many categories. The most relevant and interesting are:
Now, in order to understand the main idea of introduced further algorithm we need to define some mathematical concepts. The most important one is a metric function.
Definition 1 Function \(F: \chi \times \chi \rightarrow R_+\) is called a metric function if following conditions are fulfiled: (for \(x,y,z \in \chi\))
For the purpose of better understanding of above definition one can associate space \(\chi\) with learning set \(X\). Now we can proceed with specyfing details of K-means algorithm. Given a set of observations \((x_1, ..., x_n ) \in X\), where each observation is a d-dimensional vector, k-means algorithm aims to partition the n-observations into k (\(\le n\)) clusters (groups) so as to minimize the within-cluster sum of squares. Formally, the objective is to find: \[argmin_S \sum^k_{i=1} \sum_{x \in S_i} ||x-\mu_i ||\] where \(S_i\) are initialized clusters and index \(i\) indicates the iteration by all of S, \(\mu_i\) is the mean of points within a single cluster. Using \(||x-\mu_i||\) indicates a distance (defined function F) between single point \(x\) and mean of all points in a cluster. To sum up, in this algorithm we seek for such a stratification of our feature space in order to achieve minimalization of within-cluster sum of squares (metric function F given as a Euclidean distance). One can devide the whole algorithm into following steps:
Now let us see how the algorith works on a specific dataset. We will use tha attitude dataset which one can find inside datasets package. This dataset is a survey of clerical employees of a large financial organization. The data are aggregated from questionnaires of approximately 35 employees for each of 30 randomly selected departments. The numbers give the percent proportion of favourable responses to seven questions in each department. Let’s read the data and check it’s structure:
if(!require(datasets)){
install.packages('datasets')
library(datasets)
} else {
library(datasets)
}
head(attitude)
We are dealing with the frame of 30 observations and 7 columns. Each observation indicates responses on a particular question in a given departament. Now let’s check data summary:
summary(attitude)
## rating complaints privileges learning
## Min. :40.00 Min. :37.0 Min. :30.00 Min. :34.00
## 1st Qu.:58.75 1st Qu.:58.5 1st Qu.:45.00 1st Qu.:47.00
## Median :65.50 Median :65.0 Median :51.50 Median :56.50
## Mean :64.63 Mean :66.6 Mean :53.13 Mean :56.37
## 3rd Qu.:71.75 3rd Qu.:77.0 3rd Qu.:62.50 3rd Qu.:66.75
## Max. :85.00 Max. :90.0 Max. :83.00 Max. :75.00
## raises critical advance
## Min. :43.00 Min. :49.00 Min. :25.00
## 1st Qu.:58.25 1st Qu.:69.25 1st Qu.:35.00
## Median :63.50 Median :77.50 Median :41.00
## Mean :64.63 Mean :74.77 Mean :42.93
## 3rd Qu.:71.00 3rd Qu.:80.00 3rd Qu.:47.75
## Max. :88.00 Max. :92.00 Max. :72.00
As it was mentioned previously this data gives the percent of favourable responses for each department. For example, one department had only 30% of responses favourable when it came to assessing ‘privileges’ and one department had 83% of favourable responses when it came to assessing ‘privileges’, and a lot of other favourable response levels in between. Before applying clustering algorithm we are to be aware that this problem is multidimensional. We could somehow reduce number of dimensions e.g. using pronciple component analysis or other methods of feature selection when no response variable given.
Let’s suppose that we want to investigate the existence of any structure between variables i.e whether single observations tends to group in any interpretable way when considering it in two-dimensional space (spanned by two chosen variables, in order to visualize it properly). We can cluster the whole dataset regarding privilages and learning responses. Let’s start from a simple scatter plot:
data_of_interest <- attitude[,c(3,4)]
plot(
data_of_interest,
main = "Responses to learning and privilages",
col = 'black', pch = 20, cex = 2)
Now, in order to find the possible cluster setup we can use defined previously K-means algorithm. Proper function is implemented in R base package. Let’s check it’s usage. One should remember to set random seed to ensure obtaining reproducible examples because the first step of an algorithm select center of clusters randomly. We start from three clusters.
set.seed(42)
k_init <- kmeans(data_of_interest, 3, nstart=100)
plot(data_of_interest, col =(k_init$cluster +1) ,
main="K-Means result with 3 clusters", pch=20, cex=2)
And for \(k=2\):
k_init <- kmeans(data_of_interest, 2, nstart=100)
plot(data_of_interest, col =(k_init$cluster +1) ,
main="K-Means result with 3 clusters", pch=20, cex=2)
One can notice that we choose number of clusters without deep analysis. There is no unique approach to select this parameter in the best way. One of the most commonly criterium used is the Elbow approach. It is based on taking into account the value of within-cluster sum of squares. As mentioned in theoretical preface, this value is minimized in consecutive iterations of our algorithm. This method examines our clustering cost function with respect to the number of clusters. Below we can see this relationship for selected pair of variables.
N_OF_CLUSTERS <- 20
wss <- (
nrow(data_of_interest) - 1) * sum( apply( data_of_interest, 2, var))
for (i in 2:N_OF_CLUSTERS){
wss[i] <- sum(kmeans(data_of_interest, centers=i)$withinss)
}
plot(1:N_OF_CLUSTERS, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares",
main="WGSS vs. number of clusters",
pch=20, cex=2)
As expected WGSS tends to decrease when incresing number of cluster. One can notice that there is no substatntial difference in WGSS when adding more that 6 clusters. Any further improvements are negligible. Hence, we can now check how our space looks like when N_OF_CLUSTERS variable is set to 6.
N_OF_CLUSTERS <- 6
k_init <- kmeans(data_of_interest, N_OF_CLUSTERS, nstart=100)
plot(data_of_interest, col =(k_init$cluster +1) ,
main="K-Means result with 3 clusters", pch=20, cex=2)
We can distinguish a well-defines set of groups of departments that are relatively distinct when it comes to answering favourably around Privileges and Learning in the survey. In further steps of analysis one can investigate relationships between other variables and trying to find the ones that explain rating variable in the best way.
In this report we introduced the reader with the idea of k-means clustering. Both the mathematical introduction and provided code snippets should be useful in further exploration of this method.