This page demonstrates K-Means Clustering with some imaginary data.
library(tidyverse)
library(cluster)
library(factoextra)
library(meanShiftR)
library(ggthemes)
Samlping from a series of single digit integers, with some more common than others. Two columns of data are created, to be displayed in a 2D scatter plot.
# size of test data, 3 = 1,000 rows, 2 = 100
zeroes <- 4
# k means requires you to tell it how many clusters there are
# the sample data will create 4 clusters
clusters <- 4
# randomly select data from 1 - 6 with more 1's and 5's than other numbers
# this will give two clusters in this dimension
var_a <- sample(c(1,1,1,2,3,4,5,5,5,6), 1*10^zeroes, replace = TRUE)
# do the same thing again
# two clusters in this dimension
var_b <- sample(c(1,1,1,1,2,2,3,4,5,5,5,5,6), 1*10^zeroes, replace = TRUE)
# create a dataframe of the sample data
variables <- data.frame(var_a = var_a,
var_b = var_b)
Adding some random noise to the numbers so the are a bit more interesting, and realistic.
# add some noise to the sample data
# add a random number from -1 to 1 to each variable
variables <- variables %>%
mutate(var_a = var_a+runif(1*10^zeroes, min = -1, max = 1),
var_b = var_b+runif(1*10^zeroes, min = -1, max = 1))
Plotting the data to check for clusters, the clusters show up pretty clearly, there are four. K-Means requires you to specify the number of clusters, plotting is a quick way check how many. In this case I knew there would be four because I made the data.
# plotting the pretend data to see how the clusters look
variables %>%
ggplot(aes(x = var_a, y = var_b)) +
geom_point() +
theme_few()
# two clusters in each direction for a total of four
Running the K-Means function and plotting the results, one colour per cluster.
# running the k means function
# one component of the result is a column with a number for the cluster each point
# belongs to
km <- kmeans(variables, clusters)
# adding a column to the test data, with each cluster that the
# point falls in
variables <- variables %>%
mutate(cluster = km$cluster) %>%
# converting to factor, or ggplplot will shade it rather than
# using discrete colours
mutate(cluster = factor(cluster, levels = 1:clusters))
Scatter plot with all the points, coloured by the cluster K-Means assigned them.
# plotting the test data with the colours indicating the cluster
variables %>%
ggplot(aes(x = var_a, y = var_b, col = cluster)) +
geom_point() +
theme_few()