library(ggplot2) #create plots
library(gridExtra) #plot output
library(purrr) #map functions. Much, much better than apply/sapply
#library(cluster) #neat PCA-based visualization of clusters
library(caret)
library(clue) #contains kmediods algorithm
library(lpSolve) #dependency of clue above
library(dplyr) #data manipulation. Always load last
theme_set(theme_light()) #set ggplot2 theme globallyImagine that you are a large retailer interested in understanding the customer base. There may be several “types” of customers, such as those shopping for business with corporate accounts, those shopping for leisure, or debt-strapped grad students. Each of these customers would exhibit different behavior, and should be treated differently statistically. But how can a customer’s “type” be defined? Especially for large customer data sets in the millions, one can imagine how this problem can be challenging.
Statistically speaking, this would be an example of a heterogeneous or mixture distribution problem. Cluster analysis, or data segmentation, seeks to group a collection of objects into homogeneous subsets. In the customer example, this means that the corporate shoppers would go into bucket “A”, the leisure shoppers into bucket “B”, and so on.
There are several clustering algorithms which aim to accomplish this task.
Kmeans takes continuous data and assigns observations into k clusters, or groups. In the two-dimensional example, this is the same as drawing lines around points together and assigning them to groups.
The kmeans algorithm consists of four steps:
Example 1
To begin, we create sample data with two clusters.
set.seed(2)
dummy_data <- data_frame(x = rnorm(n = 50),
y = rnorm(n = 50)) %>%
mutate(x = ifelse(row_number() >= 25, x + 2, x - 4),
y = ifelse(row_number() >= 10, y - 3, y))This is a dataframe of the following vectors \(X\) and \(Y\) as columns.
A plot shows that there are two distinct groupings, which is how the data was constructed. In other words, the mean of the data changes for different observation indices, \(1\leq i\leq n\). The mixture distributions are defined from the above code as follows.
For the first column of simulated data,
\[X_i \sim \{N(\mu = 2, \sigma = 1), i\geq 25,\\ N(\mu = -4, \sigma = 1), i\leq 25 \} \]
And for the second column,
\[Y_i \sim \{N(\mu = -3, \sigma = 1), i\geq 10,\\ N(\mu = -4, \sigma = 1), i\leq 10 \} \]
These points can be plotted.
dummy_data %>%
ggplot(aes(x, y)) +
geom_point() +
ggtitle("Example 1: Simulated Data")The goal of clustering is to find these groups. In this example, we know that there should be 2 clusters, so the input is specified as centers = 2. In more complicated examples, this needs to be specified by trial and error.
km1 <- kmeans(x = as.matrix(dummy_data),
centers = 2)
dummy_data %>%
mutate(cluster = as.factor(km1$cluster)) %>%
ggplot(aes(x, y, color = cluster)) +
geom_point() +
ggtitle("Example 1: Simulated Data with Cluster Assignments")In the next example, the number of clusters is less well defined. Should this be 2 or 3? Let’s see how the algorithm performs.
Example 2
#Ignore this ugly indentation for now.
dummy_data <- data_frame(x1 = rnorm(n = 50), x2 = rnorm(n = 50)) %>%
mutate(x1 = ifelse(row_number() <= 15,
yes = x1 + 2,
no = ifelse(row_number() <= 25,
yes = x1 + 3,
no = x1 - 2)),
x2 = ifelse(row_number() >= 10, x2 - 3, x2))
dummy_data %>%
ggplot(aes(x1, x2)) +
geom_point()Because we created the data, we immediately know that k should be 3, but due to the randomness of the simulation this could change with each new draw.
km11 <- kmeans(x = as.matrix(dummy_data),
centers = 2)
p1 <- dummy_data %>%
mutate(cluster = as.factor(km1$cluster)) %>%
ggplot(aes(x1, x2, color = cluster)) +
geom_point() +
ggtitle("Kmeans when K = 2")
km12 <- kmeans(x = as.matrix(dummy_data),
centers = 3)
p2 <- dummy_data %>%
mutate(cluster = as.factor(km12$cluster)) %>%
ggplot(aes(x1, x2, color = cluster)) +
geom_point() +
ggtitle("Kmeans when K = 3")
grid.arrange(p1, p2, nrow = 1)As is sometimes the case, kmeans gives counter-intuitive results. This is because the algorithm can find local minima of the within-cluster deviance instead of the global minimum. The common solution to this is to run the algorithm with many different starting points and take an average of the cluster assignments.
First, we need a higher dimension than 2. The famous iris data set (Fisher’s or Anderson’s) gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
data(iris)
head(iris)model_data <- iris %>%
select(-Species) %>%
mutate_all(function(x){(x - mean(x))/sd(x)})Because we are using a distance measure to assign points to clusters, the data needs to be scaled; otherwise, the algorithm would not work. Now that the variables have been scaled by subtracting the mean and dividing by the standard deviation, the clustering can begin.
For very large data sets, it can be impractical to scale (i.e., normalize) using the \((X - \mu)/\sigma\) scaling, and oftentimes feature scaling is used instead.
How will we determine the optimal number of clusters, k? Trial and error. The objective is to minimize the within-cluster sum of squares (WCSS). A good “rule of thumb” is to choose k so that the WCSS is about at its lowest point, without k being too large. Obviously, if k = n, then WCSS would be minimized, but this would not provide useful information as each point would be in its own cluster!
#try out different values of k from 1 - 5 with kmeans
clustering_models <- 1:10 %>% map(function(x){kmeans(model_data, x)})
clustering_models %>%
map_dbl("tot.withinss") %>%
as_data_frame() %>%
rename(total_ss = value) %>%
mutate(k = 1:10) %>%
ggplot(aes(k, total_ss)) +
geom_point() +
geom_line(color = "red") +
ggtitle("Total Sum of Within-Cluster Squared Error")The above graph suggests that \(k = 3\) or \(k = 4\) would be quite reasonable. To set k higher would lead to instability and over fitting.
What does the “within cluster sum of squares” look like from a distributional perspective? Recall that in a density plot, or histogram, that the variance is the amount of “width” or “x-axis-deviance” present. The “skinnier” the distribution, the lower the variance.
We graph the pre-cluster and post-cluster distributions to see directly.
Example 3
make_hist <- function(x){
model_data %>%
ggplot(aes(x)) +
geom_histogram()
}
plots <- model_data %>%
map(make_hist)
title <- grid::textGrob("Pre-cluster distributions")
grid.arrange(plots[[1]],
plots[[2]],
plots[[3]],
plots[[4]],
top = title)After splitting into homogeneous groups
clustered_model_data <- model_data %>%
mutate(cluster = as.factor(kmeans(model_data, 3)$cluster)) %>%
map(unlist) %>%
as_data_frame()
make_hist <- function(x){
clustered_model_data %>%
ggplot(aes(x, fill = cluster)) +
geom_density(alpha = 0.3)
}
plots <- clustered_model_data %>%
map(make_hist)
title <- grid::textGrob("Post-cluster distributions")
grid.arrange(plots[[1]],
plots[[2]],
plots[[3]],
plots[[4]],
top = title)Generally speaking the mean is just one type of statistical average. Can we use other types of averages in the ‘k-average’ algorithm? Absolutely. One common choice is the m-mediods algorithm. In simple language, the median is the value which splits half of the data points. For example, the median of c(1,1,2,3,3) is 2 because two numbers are above and two numbers are below the number 2, or 50% is greater than 2 and 50% is less than 2.
For probability distributions, the definition is as follows:
Let \(F(x)\) be a cumulative density function of \(X\) and \(m\) its median.
Then
\(\int_{-\infty, m}dF(x) \leq 0.5\) and \(\int_{m, \infty}dF(x) \geq 0.5\).
Source: wikipedia
There are many different types of scaling data, or normalization. The two which we will test are standard scaling and what we will call feature scaling.
Standard scaling:
Let \(X\) be a random variable with mean \(\mu\) and standard deviation \(\sigma\).
Then \(\frac{X - \mu}{\sigma}\) is said to be standardized.
Feature scaling:
Let \(X\) be a random variable.
Then \(\frac{X - min(X)}{max(X) - min(X)}\) is feature scaled.
There are other types of scaling as well.
Source: wikipedia
Questions: does this mean that kmediods, which uses the median, will be better than kmeans, which uses the mean, when outliers are present in the data? And what about using feature scaling vs. standard scaling?
Example 4
We will “break” the iris data set by changing a few values to outliers and compare kmeans to kmediods. You might ask, will the outliers be “fixed” by the scaling process? Let’s find out. As an additional test, we will try scaling by standard score (\(\mu\) / \(\sigma\) scaling) as well as by feature scaling (i.e., using \(min(.)\) / \(max(.)\)).
We test the following after adding an outlier.
outlier_standard_score_model_data <- iris %>%
select(-Species) %>%
mutate(Sepal.Length = ifelse(row_number() == 1,
yes = Sepal.Length + 100*rnorm(1),
no = Sepal.Length),
is_outlier = ifelse(row_number() == 1, T, F)) %>%
mutate_if(is.numeric,function(x){(x - mean(x))/sd(x)}) %>%
slice(1:50) # makes computation easier
outlier_feature_scale_model_data <- iris %>%
select(-Species) %>%
mutate(Sepal.Length = ifelse(row_number() == 1,
yes = Sepal.Length + 100*rnorm(1),
no = Sepal.Length),
is_outlier = ifelse(row_number() == 1, T,F)) %>%
mutate_if(is.numeric, function(x){(x - min(x))/(max(x) - min(x))}) %>%
slice(1:50) # makes computation easierNotice that the outlier data sets have a Sepal.Length outlier. Also notice that the scaling is performed after the outlier is added, and that the feature-scaled column is between 0 and 1.
outlier_standard_score_model_data %>%
select(Sepal.Length) %>%
summary()## Sepal.Length
## Min. :-0.34452
## 1st Qu.:-0.25937
## Median :-0.22531
## Mean : 0.02077
## 3rd Qu.:-0.19125
## Max. :12.04488
outlier_feature_scale_model_data %>%
select(Sepal.Length) %>%
summary()## Sepal.Length
## Min. :0.000000
## 1st Qu.:0.007432
## Median :0.010405
## Mean :0.030256
## 3rd Qu.:0.013378
## Max. :1.000000
#take only numeric columns and compute distance matrix
dist_matrix1 <- outlier_standard_score_model_data %>%
select_if(is.numeric) %>%
dist() %>%
as.matrix()
#take only numeric columns and compute distance matrix
dist_matrix2 <- outlier_feature_scale_model_data %>%
select_if(is.numeric) %>%
dist() %>%
as.matrix()
#models
kmediods_standard_score <- kmedoids(dist_matrix1, k = 3)
kmediods_feature_scale <- kmedoids(dist_matrix2, k = 3)
kmeans_standard_score <- outlier_standard_score_model_data %>%
kmeans(., centers = 3, nstart = 20)
kmeans_feature_score <- outlier_feature_scale_model_data %>%
kmeans(., centers = 3, nstart = 20)After this computation, we examine the compactness of the distributions. In order to produce density plots that are readable, we remove the outlier that was added artificially.
#take a data frame and return a list of distribution plots for each column
make_hist <- function(input_column, input_cluster){
input_column %>%
as_data_frame() %>%
mutate(cluster = input_cluster) %>%
ggplot(aes(input_column, fill = cluster)) +
geom_density(alpha = 0.3)
}plots <- outlier_standard_score_model_data %>%
filter(is_outlier == FALSE) %>%
map_if(., .p = is.numeric, .f = make_hist, as.factor(kmediods_standard_score$cluster[-1]))
title <- grid::textGrob("Post-cluster distributions:\n kmedoids with standard scoring")
grid.arrange(plots[[1]], plots[[2]],plots[[3]],plots[[4]],top = title)Notice that only 2 of the three clusters are actually shown. This is because the outlier was assigned to its own cluster, but was removed in order to allow the plot window to show a meaningful scale. Notice that in the upper-right plot there are still two distributions. The other distributions are less obvious. This essentially “wastes” or “throws out” a cluster.
plots <- outlier_feature_scale_model_data %>%
filter(is_outlier == FALSE) %>%
map_if(., .p = is.numeric, .f = make_hist, as.factor(kmediods_feature_scale$cluster[-1]))
title <- grid::textGrob("Post-cluster distributions:\n kmedoids with feature scaling")
grid.arrange(plots[[1]], plots[[2]],plots[[3]],plots[[4]],top = title)Distributions are not obviously seperated. At least there are three clusters this time, which implies that the outlier was not the only observation in its cluster. This is an improvement over the previous plot.
plots <- outlier_standard_score_model_data %>%
filter(is_outlier == FALSE) %>%
map_if(., .p = is.numeric, .f = make_hist, as.factor(kmeans_standard_score$cluster[-1]))
title <- grid::textGrob("Post-cluster distributions:\n kmeans with standard scoring")
grid.arrange(plots[[1]], plots[[2]],plots[[3]],plots[[4]],top = title)As with the first standard scaling, this is not much improved and there are still only 2 clusters represented.
plots <- outlier_feature_scale_model_data %>%
filter(is_outlier == FALSE) %>%
map_if(., .p = is.numeric, .f = make_hist, as.factor(kmeans_feature_score$cluster[-1]))
title <- grid::textGrob("Post-cluster distributions:\n kmeans with feature scaling")
grid.arrange(plots[[1]], plots[[2]],plots[[3]],plots[[4]],top = title)As the similarity of the above density plots demonstrates, the choice of scaling method did not have a substantial impact. The choice of clustering, using a mean or median, did have some impact.
What was not shown was that during this analysis, the value of k had a significant impact when dealing with the outlier. When k was set higher, say to 4 or 5, then the outlier would just get bucketed into its own cluster, and then the algorithm would perform as usual on the remainder of the observations. This means that there was an automatic outlier removal within kmeans and kmediods.
Analysis loosely based on https://rstudio-pubs-static.s3.amazonaws.com/33876_1d7794d9a86647ca90c4f182df93f0e8.html