Unsupervised learning in R

Kate C

2022-01-23

Load Packages and Dataset

Packages used to import and analyse the data include

A. Key Points

unsupervised learning - finding structures in unlabeled data

two goals

  • find homogeneous subgroups within larger group - it is called clustering (i.e. market segmentations)

  • finding patterns in the features of the data

    • dimensionality reduction (find patterns in the features of the data, visualization of high dimensional data, pre-processing before supervised learning)
  • challenges and benefits

    • no single goal of analysis

    • requires more creativity

    • much more unlabeled data available than clean labeled data

Intro to k-means clustering

  • breaks observations into pre-defined number of clusters

  • k-means comes from base-R. kmeans (data, centers, nstart)

    • data - matrix. one observation per row of the matrix and one feature in each column of the matrix

    • centers - number of predetermined groups/clusters

    • nstart - parameter specifies the number of times ‘kmeans’ will be repeated

Kmeans()

Results:

  • kmeans () produces many results, one of which is cluster membership

  • to access to the cluster results of a kmeans model, use $ operater in $cluster

Plot:

  • plotting the data as a scatter plot and using color to label the samples’ cluster membership to interpret the results of k-means

How k-means works and practical matters

Goals:

  • how k-means algorithm is implemented visually

  • model selection: determining number of clusters

How k-means works - below is one iteration of k-means method

  • suppose there is a data group from two subgroups (aka clusters)

  • first, kmeans randomly assigns each point to one of the two clusters - this is the random aspect of the kmeans algorithm

  • second, calculate the centers of each of the two subgroups - the average position of all the points in that subgroup

  • third, each point in the data is assigned to the cluster of the nearest center

  • kmeans algorithm will finish when no points change cluster assignment

  • but note there are other stop criteria for selection as well

Model selection:

  • minimum total within cluster sum of squares (squred distance from the observation to the cluster center)

  • sum of all the squared distances calculated and that is the total within cluster sum of squares

  • R does the model selection automatically.

  • for repeat-ability, use set-seed function before running ‘kmeans’ to guarantee reproducibility

  • and with scree plot to determin number of clusters if not known before hand

B. Practice

  • use iter.max argument to kmeans - as kmeans’ default number of iterations is 10 - not enough for convergence in this case.
pokemon <- pokemon %>% 
  select(HitPoints:Speed)
  • initialize total within sum of squares error: wss
wss <- 0
  • look over 1 to 15 possible clusters
for (i in 1:15) {
  km.out <- kmeans(pokemon, centers = i, nstart = 20, iter.max = 50)
  wss[i] <- km.out$tot.withinss
}
  • produce a scree plot
plot(1:15, wss, type = "b",
     xlab = "Number of Clusters", 
     ylab = "Within groups sum of squares")

  • select a proper center based on the plot. here we choose 3.
k <- 3
  • build the model with k clusters
km.out <- kmeans(pokemon, centers = k, nstart = 20, iter.max = 50)
  • view the resulting model
km.out
## K-means clustering with 3 clusters of sizes 355, 270, 175
## 
## Cluster means:
##   HitPoints   Attack   Defense SpecialAttack SpecialDefense    Speed
## 1  54.68732 56.93239  53.64507      52.02254       53.04789 53.58873
## 2  81.90370 96.15926  77.65556     104.12222       86.87778 94.71111
## 3  79.30857 97.29714 108.93143      66.71429       87.04571 57.29143
## 
## Clustering vector:
##   [1] 1 1 2 2 1 1 2 2 2 1 1 3 2 1 1 1 1 1 1 2 1 1 2 2 1 1 1 2 1 2 1 2 1 3 1 1 3
##  [38] 1 1 2 1 2 1 2 1 1 1 2 1 1 2 1 3 1 2 1 1 1 2 1 2 1 2 1 2 1 1 3 1 2 2 2 1 3
##  [75] 3 1 1 2 1 2 1 3 3 1 2 1 3 3 1 2 1 1 2 1 3 1 3 1 3 1 2 2 2 3 1 3 1 3 1 2 1
## [112] 2 1 3 3 3 1 1 3 1 3 1 3 3 3 1 2 1 3 1 2 2 2 2 2 2 3 3 3 1 3 3 3 1 1 2 2 2
## [149] 1 1 3 1 3 2 2 3 2 2 2 1 1 2 2 2 2 2 1 1 3 1 1 2 1 1 3 1 1 1 2 1 1 1 1 2 1
## [186] 2 1 1 1 1 1 1 2 1 1 2 2 3 1 1 3 2 1 1 2 1 1 1 1 1 3 2 3 1 3 2 1 1 2 1 3 1
## [223] 3 3 3 1 3 1 3 3 3 3 3 1 1 3 1 3 1 3 1 1 2 1 2 3 1 2 2 2 1 3 2 2 1 1 3 1 1
## [260] 1 3 2 2 2 3 1 1 3 3 2 2 2 1 1 2 2 1 1 2 2 1 1 3 3 1 1 1 1 1 1 1 1 1 1 1 2
## [297] 1 1 2 1 1 1 3 1 1 2 2 1 1 1 3 1 1 2 1 2 1 1 1 2 1 3 1 3 1 1 1 3 1 3 1 3 3
## [334] 3 1 1 2 1 2 2 1 1 1 1 1 1 3 1 2 2 1 2 1 2 3 3 1 2 1 1 1 2 1 2 1 3 2 2 2 2
## [371] 3 1 3 1 3 1 3 1 3 1 3 1 2 1 3 1 2 2 1 3 3 1 2 2 1 1 2 2 1 1 2 1 3 3 3 1 1
## [408] 3 2 2 1 3 3 2 3 3 3 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 1 3 3 1 1 2 1 1 2 1 1 2
## [445] 1 1 1 1 1 1 2 1 2 1 3 1 3 1 3 3 3 2 1 3 1 1 2 1 2 1 3 2 1 2 1 2 2 2 2 1 2
## [482] 1 1 2 1 3 1 1 1 1 3 1 1 2 2 1 1 2 2 1 3 1 3 1 2 3 1 2 1 1 2 3 2 2 3 3 3 2
## [519] 2 2 2 3 2 3 2 2 2 2 3 3 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 1
## [556] 1 2 1 1 2 1 1 2 1 1 1 1 3 1 2 1 2 1 2 1 2 1 3 1 1 2 1 2 1 3 3 1 2 1 2 3 3
## [593] 1 3 3 1 1 2 3 3 1 1 2 1 1 2 1 2 1 2 2 1 1 2 1 3 2 2 1 3 1 3 2 1 3 1 3 1 2
## [630] 1 3 1 2 1 2 1 1 3 1 1 2 1 2 1 1 2 1 2 2 1 3 1 3 1 2 3 1 2 1 3 1 3 3 1 1 2
## [667] 1 2 1 1 2 1 3 3 1 3 2 1 2 3 1 2 3 1 3 1 3 3 1 3 1 3 2 3 1 1 2 1 2 2 2 2 2
## [704] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 3 3 1 1 2 1 1 2 1 1 1 1 2 1 1 1 1 2 1 1 2
## [741] 1 2 1 3 2 1 2 2 1 3 2 3 1 3 1 2 1 3 1 3 1 3 1 2 1 2 1 3 1 2 2 2 2 3 1 2 2
## [778] 3 1 3 1 1 1 1 3 3 3 3 1 3 1 2 2 2 3 3 2 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1]  812079.9 1018348.0  709020.5
##  (between_SS / total_SS =  40.8 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
  • plot of Defense vs Speed by cluster membership
plot(pokemon[, c("Defense", "Speed")],
     col = km.out$cluster,
     main = paste("k-means clustering of Pokemon with", k, "clusters"),
     xlab = "Defense", ylab = "Speed")

C. Hierarchical clustering

  • when number of cluster is not known ahead of time

  • bottom-up and top-down approaches

  • bottom-up: starts by assigning each point to its own cluster, and next, finding the closest two clusters to join them into a single cluster, continues iteratively - till single cluster then it stops

  • only require one parameter - distance between observations.

Steps

  • calculate the distance between observations. dist(x)

  • returns hierarchical clustering model hclust(dist(x))

Selecting number of clusters

  • can create a dendrogram - visualize how points are formed into clusters based on distances

  • output of the hclust function is passed into plot() function

  • draw a horizontal line with height specified (col is optional) - specifying the height of ht line means you want clusters that are no further apart than that height.

  • for distance measuring and representation, here we use Euclidean distance.’ (the height we see from the dendrogram)

  • use tree “cutting” in R. either “cut” by height of h or “cut” by number of clusters of k - resulting in a vector with a numeric cluster assignment for each observation. (using cutree function)

D. Clustering linkage and practical matters

  • measure the distance between clusters - 4 methods in R - need to choose when using the method parameter in the function hclust.

    • complete: pairwise, use largest of similarities - produce more balanced trees

    • single: same as above but uses smallest of similarities - unbalanced trees

    • average: uses average of similarities - produce more balanced trees

    • centroid: distance between the centroids of cluster 1 and 2 - not used as often as other methods

  • whether choosing a balanced or unbalanced tree depends on the context of the problem

    • balanced trees are essential if want an even number of observations assigned to each cluster

    • if want to detect outliers, an unbalanced tree is more desirable because pruning an unbalanced tree can result in most observations assigned to one cluster and only a few observations assigned to other clusters.

Practical matters

  • data on different scales can cause undesirable results in clustering methods

  • solution is to scale data so that features have same mean and deviation

    • i.e. normalize the data - subtract mean of a feature from all observations

    • divide each feature by the std of the feature

    • normalized features have a mean of zero and standard deviation of one

  • always check if “scaling” is necessary. i.e. check the variability of the means and sd of the features

Some practice

on pokemon data

  • check on col means
colMeans(pokemon)
##      HitPoints         Attack        Defense  SpecialAttack SpecialDefense 
##       69.25875       79.00125       73.84250       72.82000       71.90250 
##          Speed 
##       68.27750
  • check on col standard deviation. since want to apply to col then set the function margin to 2 in apply (1 is for applying to rows)
apply(pokemon, 2, sd)
##      HitPoints         Attack        Defense  SpecialAttack SpecialDefense 
##       25.53467       32.45737       31.18350       32.72229       27.82892 
##          Speed 
##       29.06047
  • as we can see from the result, mean and std do vary so need to scale the data
pokemon.scaled <- scale(pokemon)
  • create hierarchical clustering model using complete linkage method
hclust.pokemon <- hclust(dist(pokemon.scaled), method = "complete")
  • apply cutree on hclust.pokemon and assign cluster membership to each observation. assume 3 clusters
cut.pokemon <- cutree(hclust.pokemon, k = 3)
  • using table to compare cluster membership between two clustering methods.
table(km.out$cluster, cut.pokemon)
##    cut.pokemon
##       1   2   3
##   1 350   5   0
##   2 267   3   0
##   3 171   3   1