Load Packages and Dataset
Packages used to import and analyse the data include
A. Key Points
unsupervised learning - finding structures in unlabeled data
two goals
find homogeneous subgroups within larger group - it is called clustering (i.e. market segmentations)
finding patterns in the features of the data
- dimensionality reduction (find patterns in the features of the data, visualization of high dimensional data, pre-processing before supervised learning)
challenges and benefits
no single goal of analysis
requires more creativity
much more unlabeled data available than clean labeled data
Intro to k-means clustering
breaks observations into pre-defined number of clusters
k-means comes from base-R. kmeans (data, centers, nstart)
data - matrix. one observation per row of the matrix and one feature in each column of the matrix
centers - number of predetermined groups/clusters
nstart - parameter specifies the number of times ‘kmeans’ will be repeated
Kmeans()
Results:
kmeans () produces many results, one of which is cluster membership
to access to the cluster results of a kmeans model, use $ operater in $cluster
Plot:
- plotting the data as a scatter plot and using color to label the samples’ cluster membership to interpret the results of k-means
How k-means works and practical matters
Goals:
how k-means algorithm is implemented visually
model selection: determining number of clusters
How k-means works - below is one iteration of k-means method
suppose there is a data group from two subgroups (aka clusters)
first, kmeans randomly assigns each point to one of the two clusters - this is the random aspect of the kmeans algorithm
second, calculate the centers of each of the two subgroups - the average position of all the points in that subgroup
third, each point in the data is assigned to the cluster of the nearest center
kmeans algorithm will finish when no points change cluster assignment
but note there are other stop criteria for selection as well
Model selection:
minimum total within cluster sum of squares (squred distance from the observation to the cluster center)
sum of all the squared distances calculated and that is the total within cluster sum of squares
R does the model selection automatically.
for repeat-ability, use set-seed function before running ‘kmeans’ to guarantee reproducibility
and with scree plot to determin number of clusters if not known before hand
B. Practice
- use iter.max argument to kmeans - as kmeans’ default number of iterations is 10 - not enough for convergence in this case.
pokemon <- pokemon %>%
select(HitPoints:Speed)- initialize total within sum of squares error: wss
wss <- 0- look over 1 to 15 possible clusters
for (i in 1:15) {
km.out <- kmeans(pokemon, centers = i, nstart = 20, iter.max = 50)
wss[i] <- km.out$tot.withinss
}- produce a scree plot
plot(1:15, wss, type = "b",
xlab = "Number of Clusters",
ylab = "Within groups sum of squares")- select a proper center based on the plot. here we choose 3.
k <- 3- build the model with k clusters
km.out <- kmeans(pokemon, centers = k, nstart = 20, iter.max = 50)- view the resulting model
km.out## K-means clustering with 3 clusters of sizes 355, 270, 175
##
## Cluster means:
## HitPoints Attack Defense SpecialAttack SpecialDefense Speed
## 1 54.68732 56.93239 53.64507 52.02254 53.04789 53.58873
## 2 81.90370 96.15926 77.65556 104.12222 86.87778 94.71111
## 3 79.30857 97.29714 108.93143 66.71429 87.04571 57.29143
##
## Clustering vector:
## [1] 1 1 2 2 1 1 2 2 2 1 1 3 2 1 1 1 1 1 1 2 1 1 2 2 1 1 1 2 1 2 1 2 1 3 1 1 3
## [38] 1 1 2 1 2 1 2 1 1 1 2 1 1 2 1 3 1 2 1 1 1 2 1 2 1 2 1 2 1 1 3 1 2 2 2 1 3
## [75] 3 1 1 2 1 2 1 3 3 1 2 1 3 3 1 2 1 1 2 1 3 1 3 1 3 1 2 2 2 3 1 3 1 3 1 2 1
## [112] 2 1 3 3 3 1 1 3 1 3 1 3 3 3 1 2 1 3 1 2 2 2 2 2 2 3 3 3 1 3 3 3 1 1 2 2 2
## [149] 1 1 3 1 3 2 2 3 2 2 2 1 1 2 2 2 2 2 1 1 3 1 1 2 1 1 3 1 1 1 2 1 1 1 1 2 1
## [186] 2 1 1 1 1 1 1 2 1 1 2 2 3 1 1 3 2 1 1 2 1 1 1 1 1 3 2 3 1 3 2 1 1 2 1 3 1
## [223] 3 3 3 1 3 1 3 3 3 3 3 1 1 3 1 3 1 3 1 1 2 1 2 3 1 2 2 2 1 3 2 2 1 1 3 1 1
## [260] 1 3 2 2 2 3 1 1 3 3 2 2 2 1 1 2 2 1 1 2 2 1 1 3 3 1 1 1 1 1 1 1 1 1 1 1 2
## [297] 1 1 2 1 1 1 3 1 1 2 2 1 1 1 3 1 1 2 1 2 1 1 1 2 1 3 1 3 1 1 1 3 1 3 1 3 3
## [334] 3 1 1 2 1 2 2 1 1 1 1 1 1 3 1 2 2 1 2 1 2 3 3 1 2 1 1 1 2 1 2 1 3 2 2 2 2
## [371] 3 1 3 1 3 1 3 1 3 1 3 1 2 1 3 1 2 2 1 3 3 1 2 2 1 1 2 2 1 1 2 1 3 3 3 1 1
## [408] 3 2 2 1 3 3 2 3 3 3 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 1 3 3 1 1 2 1 1 2 1 1 2
## [445] 1 1 1 1 1 1 2 1 2 1 3 1 3 1 3 3 3 2 1 3 1 1 2 1 2 1 3 2 1 2 1 2 2 2 2 1 2
## [482] 1 1 2 1 3 1 1 1 1 3 1 1 2 2 1 1 2 2 1 3 1 3 1 2 3 1 2 1 1 2 3 2 2 3 3 3 2
## [519] 2 2 2 3 2 3 2 2 2 2 3 3 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 1
## [556] 1 2 1 1 2 1 1 2 1 1 1 1 3 1 2 1 2 1 2 1 2 1 3 1 1 2 1 2 1 3 3 1 2 1 2 3 3
## [593] 1 3 3 1 1 2 3 3 1 1 2 1 1 2 1 2 1 2 2 1 1 2 1 3 2 2 1 3 1 3 2 1 3 1 3 1 2
## [630] 1 3 1 2 1 2 1 1 3 1 1 2 1 2 1 1 2 1 2 2 1 3 1 3 1 2 3 1 2 1 3 1 3 3 1 1 2
## [667] 1 2 1 1 2 1 3 3 1 3 2 1 2 3 1 2 3 1 3 1 3 3 1 3 1 3 2 3 1 1 2 1 2 2 2 2 2
## [704] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 3 3 1 1 2 1 1 2 1 1 1 1 2 1 1 1 1 2 1 1 2
## [741] 1 2 1 3 2 1 2 2 1 3 2 3 1 3 1 2 1 3 1 3 1 3 1 2 1 2 1 3 1 2 2 2 2 3 1 2 2
## [778] 3 1 3 1 1 1 1 3 3 3 3 1 3 1 2 2 2 3 3 2 2 2 2
##
## Within cluster sum of squares by cluster:
## [1] 812079.9 1018348.0 709020.5
## (between_SS / total_SS = 40.8 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
- plot of Defense vs Speed by cluster membership
plot(pokemon[, c("Defense", "Speed")],
col = km.out$cluster,
main = paste("k-means clustering of Pokemon with", k, "clusters"),
xlab = "Defense", ylab = "Speed")C. Hierarchical clustering
when number of cluster is not known ahead of time
bottom-up and top-down approaches
bottom-up: starts by assigning each point to its own cluster, and next, finding the closest two clusters to join them into a single cluster, continues iteratively - till single cluster then it stops
only require one parameter - distance between observations.
Steps
calculate the distance between observations. dist(x)
returns hierarchical clustering model hclust(dist(x))
Selecting number of clusters
can create a dendrogram - visualize how points are formed into clusters based on distances
output of the hclust function is passed into plot() function
draw a horizontal line with height specified (col is optional) - specifying the height of ht line means you want clusters that are no further apart than that height.
for distance measuring and representation, here we use Euclidean distance.’ (the height we see from the dendrogram)
use tree “cutting” in R. either “cut” by height of h or “cut” by number of clusters of k - resulting in a vector with a numeric cluster assignment for each observation. (using cutree function)
D. Clustering linkage and practical matters
measure the distance between clusters - 4 methods in R - need to choose when using the method parameter in the function hclust.
complete: pairwise, use largest of similarities - produce more balanced trees
single: same as above but uses smallest of similarities - unbalanced trees
average: uses average of similarities - produce more balanced trees
centroid: distance between the centroids of cluster 1 and 2 - not used as often as other methods
whether choosing a balanced or unbalanced tree depends on the context of the problem
balanced trees are essential if want an even number of observations assigned to each cluster
if want to detect outliers, an unbalanced tree is more desirable because pruning an unbalanced tree can result in most observations assigned to one cluster and only a few observations assigned to other clusters.
Practical matters
data on different scales can cause undesirable results in clustering methods
solution is to scale data so that features have same mean and deviation
i.e. normalize the data - subtract mean of a feature from all observations
divide each feature by the std of the feature
normalized features have a mean of zero and standard deviation of one
always check if “scaling” is necessary. i.e. check the variability of the means and sd of the features
Some practice
on pokemon data
- check on col means
colMeans(pokemon)## HitPoints Attack Defense SpecialAttack SpecialDefense
## 69.25875 79.00125 73.84250 72.82000 71.90250
## Speed
## 68.27750
- check on col standard deviation. since want to apply to col then set the function margin to 2 in apply (1 is for applying to rows)
apply(pokemon, 2, sd)## HitPoints Attack Defense SpecialAttack SpecialDefense
## 25.53467 32.45737 31.18350 32.72229 27.82892
## Speed
## 29.06047
- as we can see from the result, mean and std do vary so need to scale the data
pokemon.scaled <- scale(pokemon)- create hierarchical clustering model using complete linkage method
hclust.pokemon <- hclust(dist(pokemon.scaled), method = "complete")- apply cutree on hclust.pokemon and assign cluster membership to each observation. assume 3 clusters
cut.pokemon <- cutree(hclust.pokemon, k = 3)- using table to compare cluster membership between two clustering methods.
table(km.out$cluster, cut.pokemon)## cut.pokemon
## 1 2 3
## 1 350 5 0
## 2 267 3 0
## 3 171 3 1