Unsupervised learning in R

Load Packages and Dataset

Packages used to import and analyse the data include

A. Key Points

unsupervised learning - finding structures in unlabeled data

two goals

find homogeneous subgroups within larger group - it is called clustering (i.e. market segmentations)
finding patterns in the features of the data
- dimensionality reduction (find patterns in the features of the data, visualization of high dimensional data, pre-processing before supervised learning)
challenges and benefits
- no single goal of analysis
- requires more creativity
- much more unlabeled data available than clean labeled data

Intro to k-means clustering

breaks observations into pre-defined number of clusters
k-means comes from base-R. kmeans (data, centers, nstart)
- data - matrix. one observation per row of the matrix and one feature in each column of the matrix
- centers - number of predetermined groups/clusters
- nstart - parameter specifies the number of times ‘kmeans’ will be repeated

Kmeans()

Results:

kmeans () produces many results, one of which is cluster membership
to access to the cluster results of a kmeans model, use $ operater in $cluster

Plot:

plotting the data as a scatter plot and using color to label the samples’ cluster membership to interpret the results of k-means

How k-means works and practical matters

Goals:

how k-means algorithm is implemented visually
model selection: determining number of clusters

How k-means works - below is one iteration of k-means method

suppose there is a data group from two subgroups (aka clusters)
first, kmeans randomly assigns each point to one of the two clusters - this is the random aspect of the kmeans algorithm
second, calculate the centers of each of the two subgroups - the average position of all the points in that subgroup
third, each point in the data is assigned to the cluster of the nearest center
kmeans algorithm will finish when no points change cluster assignment
but note there are other stop criteria for selection as well

Model selection:

minimum total within cluster sum of squares (squred distance from the observation to the cluster center)
sum of all the squared distances calculated and that is the total within cluster sum of squares
R does the model selection automatically.
for repeat-ability, use set-seed function before running ‘kmeans’ to guarantee reproducibility
and with scree plot to determin number of clusters if not known before hand

B. Practice

use iter.max argument to kmeans - as kmeans’ default number of iterations is 10 - not enough for convergence in this case.

pokemon <- pokemon %>% 
  select(HitPoints:Speed)

initialize total within sum of squares error: wss

wss <- 0

look over 1 to 15 possible clusters

for (i in 1:15) {
  km.out <- kmeans(pokemon, centers = i, nstart = 20, iter.max = 50)
  wss[i] <- km.out$tot.withinss
}

produce a scree plot

plot(1:15, wss, type = "b",
     xlab = "Number of Clusters", 
     ylab = "Within groups sum of squares")

select a proper center based on the plot. here we choose 3.

k <- 3

build the model with k clusters

km.out <- kmeans(pokemon, centers = k, nstart = 20, iter.max = 50)

view the resulting model

km.out

## K-means clustering with 3 clusters of sizes 355, 270, 175
## 
## Cluster means:
##   HitPoints   Attack   Defense SpecialAttack SpecialDefense    Speed
## 1  54.68732 56.93239  53.64507      52.02254       53.04789 53.58873
## 2  81.90370 96.15926  77.65556     104.12222       86.87778 94.71111
## 3  79.30857 97.29714 108.93143      66.71429       87.04571 57.29143
## 
## Clustering vector:
##   [1] 1 1 2 2 1 1 2 2 2 1 1 3 2 1 1 1 1 1 1 2 1 1 2 2 1 1 1 2 1 2 1 2 1 3 1 1 3
##  [38] 1 1 2 1 2 1 2 1 1 1 2 1 1 2 1 3 1 2 1 1 1 2 1 2 1 2 1 2 1 1 3 1 2 2 2 1 3
##  [75] 3 1 1 2 1 2 1 3 3 1 2 1 3 3 1 2 1 1 2 1 3 1 3 1 3 1 2 2 2 3 1 3 1 3 1 2 1
## [112] 2 1 3 3 3 1 1 3 1 3 1 3 3 3 1 2 1 3 1 2 2 2 2 2 2 3 3 3 1 3 3 3 1 1 2 2 2
## [149] 1 1 3 1 3 2 2 3 2 2 2 1 1 2 2 2 2 2 1 1 3 1 1 2 1 1 3 1 1 1 2 1 1 1 1 2 1
## [186] 2 1 1 1 1 1 1 2 1 1 2 2 3 1 1 3 2 1 1 2 1 1 1 1 1 3 2 3 1 3 2 1 1 2 1 3 1
## [223] 3 3 3 1 3 1 3 3 3 3 3 1 1 3 1 3 1 3 1 1 2 1 2 3 1 2 2 2 1 3 2 2 1 1 3 1 1
## [260] 1 3 2 2 2 3 1 1 3 3 2 2 2 1 1 2 2 1 1 2 2 1 1 3 3 1 1 1 1 1 1 1 1 1 1 1 2
## [297] 1 1 2 1 1 1 3 1 1 2 2 1 1 1 3 1 1 2 1 2 1 1 1 2 1 3 1 3 1 1 1 3 1 3 1 3 3
## [334] 3 1 1 2 1 2 2 1 1 1 1 1 1 3 1 2 2 1 2 1 2 3 3 1 2 1 1 1 2 1 2 1 3 2 2 2 2
## [371] 3 1 3 1 3 1 3 1 3 1 3 1 2 1 3 1 2 2 1 3 3 1 2 2 1 1 2 2 1 1 2 1 3 3 3 1 1
## [408] 3 2 2 1 3 3 2 3 3 3 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 1 3 3 1 1 2 1 1 2 1 1 2
## [445] 1 1 1 1 1 1 2 1 2 1 3 1 3 1 3 3 3 2 1 3 1 1 2 1 2 1 3 2 1 2 1 2 2 2 2 1 2
## [482] 1 1 2 1 3 1 1 1 1 3 1 1 2 2 1 1 2 2 1 3 1 3 1 2 3 1 2 1 1 2 3 2 2 3 3 3 2
## [519] 2 2 2 3 2 3 2 2 2 2 3 3 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 1
## [556] 1 2 1 1 2 1 1 2 1 1 1 1 3 1 2 1 2 1 2 1 2 1 3 1 1 2 1 2 1 3 3 1 2 1 2 3 3
## [593] 1 3 3 1 1 2 3 3 1 1 2 1 1 2 1 2 1 2 2 1 1 2 1 3 2 2 1 3 1 3 2 1 3 1 3 1 2
## [630] 1 3 1 2 1 2 1 1 3 1 1 2 1 2 1 1 2 1 2 2 1 3 1 3 1 2 3 1 2 1 3 1 3 3 1 1 2
## [667] 1 2 1 1 2 1 3 3 1 3 2 1 2 3 1 2 3 1 3 1 3 3 1 3 1 3 2 3 1 1 2 1 2 2 2 2 2
## [704] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 3 3 1 1 2 1 1 2 1 1 1 1 2 1 1 1 1 2 1 1 2
## [741] 1 2 1 3 2 1 2 2 1 3 2 3 1 3 1 2 1 3 1 3 1 3 1 2 1 2 1 3 1 2 2 2 2 3 1 2 2
## [778] 3 1 3 1 1 1 1 3 3 3 3 1 3 1 2 2 2 3 3 2 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1]  812079.9 1018348.0  709020.5
##  (between_SS / total_SS =  40.8 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

plot of Defense vs Speed by cluster membership

plot(pokemon[, c("Defense", "Speed")],
     col = km.out$cluster,
     main = paste("k-means clustering of Pokemon with", k, "clusters"),
     xlab = "Defense", ylab = "Speed")

C. Hierarchical clustering

when number of cluster is not known ahead of time
bottom-up and top-down approaches
bottom-up: starts by assigning each point to its own cluster, and next, finding the closest two clusters to join them into a single cluster, continues iteratively - till single cluster then it stops
only require one parameter - distance between observations.

Steps

calculate the distance between observations. dist(x)
returns hierarchical clustering model hclust(dist(x))

Selecting number of clusters

can create a dendrogram - visualize how points are formed into clusters based on distances
output of the hclust function is passed into plot() function
draw a horizontal line with height specified (col is optional) - specifying the height of ht line means you want clusters that are no further apart than that height.
for distance measuring and representation, here we use Euclidean distance.’ (the height we see from the dendrogram)
use tree “cutting” in R. either “cut” by height of h or “cut” by number of clusters of k - resulting in a vector with a numeric cluster assignment for each observation. (using cutree function)

D. Clustering linkage and practical matters

measure the distance between clusters - 4 methods in R - need to choose when using the method parameter in the function hclust.
- complete: pairwise, use largest of similarities - produce more balanced trees
- single: same as above but uses smallest of similarities - unbalanced trees
- average: uses average of similarities - produce more balanced trees
- centroid: distance between the centroids of cluster 1 and 2 - not used as often as other methods
whether choosing a balanced or unbalanced tree depends on the context of the problem
- balanced trees are essential if want an even number of observations assigned to each cluster
- if want to detect outliers, an unbalanced tree is more desirable because pruning an unbalanced tree can result in most observations assigned to one cluster and only a few observations assigned to other clusters.

Practical matters

data on different scales can cause undesirable results in clustering methods
solution is to scale data so that features have same mean and deviation
- i.e. normalize the data - subtract mean of a feature from all observations
- divide each feature by the std of the feature
- normalized features have a mean of zero and standard deviation of one
always check if “scaling” is necessary. i.e. check the variability of the means and sd of the features

Some practice

on pokemon data

check on col means

colMeans(pokemon)

##      HitPoints         Attack        Defense  SpecialAttack SpecialDefense 
##       69.25875       79.00125       73.84250       72.82000       71.90250 
##          Speed 
##       68.27750

check on col standard deviation. since want to apply to col then set the function margin to 2 in apply (1 is for applying to rows)

apply(pokemon, 2, sd)

##      HitPoints         Attack        Defense  SpecialAttack SpecialDefense 
##       25.53467       32.45737       31.18350       32.72229       27.82892 
##          Speed 
##       29.06047

as we can see from the result, mean and std do vary so need to scale the data

pokemon.scaled <- scale(pokemon)

create hierarchical clustering model using complete linkage method

hclust.pokemon <- hclust(dist(pokemon.scaled), method = "complete")

apply cutree on hclust.pokemon and assign cluster membership to each observation. assume 3 clusters

cut.pokemon <- cutree(hclust.pokemon, k = 3)

using table to compare cluster membership between two clustering methods.

table(km.out$cluster, cut.pokemon)

##    cut.pokemon
##       1   2   3
##   1 350   5   0
##   2 267   3   0
##   3 171   3   1

Unsupervised learning in R

Kate C

2022-01-23

Load Packages and Dataset

A. Key Points

unsupervised learning - finding structures in unlabeled data

Intro to k-means clustering

Kmeans()

How k-means works and practical matters

B. Practice

C. Hierarchical clustering

Steps

Selecting number of clusters

D. Clustering linkage and practical matters

Practical matters

Some practice