This document covers the R-Codeline as referenced in the Power Point presentation for this CAS study. Please read/consult the presention regards the intense discussion and furhter input to this topic.
The codeline is based on this writing:
Grundlagen clusteranalytischer Verfahren
Institut fuer Soziologie - Universitaet Duisburg-Essen
Prof. Petra Stein - Sven Vollnhals 1. April 2011
Cluster Analysis
Cluster analysis is a powerful toolkit in the data science workbench. It is used to find groups of observations (clusters) that share similar characteristics. These similarities can inform all kinds of business decisions; for example, in marketing, it is used to identify distinct groups of customers for which advertisements can be tailored.
The following script depicts the cluster analysis based on random genarated data, covering income vs. eductional lavel (yeary in scool)
library(cluster)
Remove all object currently in the R-Studio session.
rm(list=ls())
We generate 300 observations of income and educational level (number of years in school) The data will be laded in data.frame “data”
3 Income ranges (10k, 40k and 70k USD)
inc1 <- rnorm(100, mean = 10000, sd = 5000)
inc2 <- rnorm(100, mean = 40000, sd = 5000)
inc3 <- rnorm(100, mean = 70000, sd = 5000)
inc <- c(inc1, inc2, inc3)
head(inc)
## [1] 7272.759 18141.909 5485.967 3826.970 12069.524 4434.149
tail(inc)
## [1] 66366.13 79551.82 73216.13 63690.96 71449.88 79120.88
summary(inc)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3343 12866 39720 40198 67131 82933
school1 <- round(runif(100, min = 6, max = 10))
school2 <- round(runif(100, min = 12, max = 13))
school3 <- round(runif(100, min = 16, max = 18))
school <- c(school1, school2, school3)
head(school)
## [1] 7 8 8 7 7 7
tail(school)
## [1] 17 17 17 18 18 17
summary(school)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 9.00 13.00 12.53 17.00 18.00
inc_school <- cbind(inc, school)
data <- data.frame(inc_school)
attach(data)
## The following objects are masked _by_ .GlobalEnv:
##
## inc, school
str(data)
## 'data.frame': 300 obs. of 2 variables:
## $ inc : num 7273 18142 5486 3827 12070 ...
## $ school: num 7 8 8 7 7 7 8 7 7 9 ...
Add an income-indicator (low, medium, high) as new column “categories” to data.frame “data”
high.dummy <- ifelse(data$inc > 60000, 3, 0)
mid.dummy <- ifelse(data$inc < 60000 & data$inc > 30000, 2, 0)
low.dummy <- ifelse(data$inc < 30000, 1, 0)
categories_ <- low.dummy + mid.dummy + high.dummy
categories = factor(categories_, labels = c("Low", "Mid", "High"))
data <- cbind(data, categories)
head(categories_,20) # values 1-3
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
head(categories,20) # values Low-Mid-High
## [1] Low Low Low Low Low Low Low Low Low Low Low Low Low Low Low Low Low
## [18] Low Low Low
## Levels: Low Mid High
head(data,10)
str(data)
## 'data.frame': 300 obs. of 3 variables:
## $ inc : num 7273 18142 5486 3827 12070 ...
## $ school : num 7 8 8 7 7 7 8 7 7 9 ...
## $ categories: Factor w/ 3 levels "Low","Mid","High": 1 1 1 1 1 1 1 1 1 1 ...
As expected (based on the data generation) we can observe, that income and education is well partitioned in the low, medium and high income range
par(mfrow=c(1,2)) # display two boxplot in one frame
boxplot(inc~categories,main="Boxplot Income",
ylab="Income in Euro",xlab="Income Groups")
boxplot(school~categories,main="Boxplot Eduction",
ylab="Number of Years in Education",xlab="Income Groups")
par(mfrow=c(1,2))
plot(density(inc),main="Kernel Density Income")
plot(density(school),main="Kernel Density Eduction")
Hierarchical Clustering
The result of hierarchical clustering is a tree-based representation of the objects, which is also known as dendrogram. Observations can be subdivided into groups by cutting the dendrogram at a desired similarity level.
cluster <- data.frame(data$inc,data$school)
Compute all the pairwise dissimilarities (distances) between observations in the data set.
metric: Character string specifying the metric to be used. The currently available options are “euclidean” (the default), “manhattan” and “gower”
stand: if TRUE, then the measurements in x are standardized before calculating the dissimilarities. Measurements are standardized for each variable (column), by subtracting the variable’s mean value and dividing by the variable’s mean absolute deviation.
dist.euclid <- daisy(cluster,metric="euclidean",stand=TRUE)
summary(dist.euclid)
## 44850 dissimilarities, summarized :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000022 0.568750 1.927900 1.940400 2.525500 5.198400
## Metric : euclidean
## Number of objects : 300
str(dist.euclid)
## 'dissimilarity' num [1:44850] 0.59 0.326 0.158 0.22 0.13 ...
## - attr(*, "Size")= int 300
## - attr(*, "Metric")= chr "euclidean"
head(dist.euclid,10)
## [1] 0.58992460 0.32601107 0.15801763 0.21997086 0.13017351 0.35672515
## [7] 0.01692903 0.12424322 0.68127919 1.28189849
Hierarchical cluster analysis on a set of dissimilarities and methods for analyzing it.
method: The agglomeration method to be used.
This should be (an unambiguous abbreviation of) one of “ward.D”, “ward.D2”, “single”,“complete”, “average” (= UPGMA), “mcquitty” (= WPGMA), “median” (= WPGMC) or “centroid” (= UPGMC).
dendogramm <- hclust(dist.euclid,method="average")
str(dendogramm)
## List of 7
## $ merge : int [1:299, 1:2] -48 -210 -16 -175 -27 -108 -176 -148 -45 -106 ...
## $ height : num [1:299] 2.24e-05 1.05e-04 3.55e-04 3.70e-04 3.89e-04 ...
## $ order : int [1:300] 11 95 75 99 33 97 8 1 20 69 ...
## $ labels : NULL
## $ method : chr "average"
## $ call : language hclust(d = dist.euclid, method = "average")
## $ dist.method: NULL
## - attr(*, "class")= chr "hclust"
summary(dendogramm)
## Length Class Mode
## merge 598 -none- numeric
## height 299 -none- numeric
## order 300 -none- numeric
## labels 0 -none- NULL
## method 1 -none- character
## call 3 -none- call
## dist.method 0 -none- NULL
The result of hierarchical clustering is a tree-based representation of the objects, which is also known as dendrogram.Observations can be subdivided into groups by cutting the dendrogram at a desired similarity level. In our example, as expected again, we can see that the dendogram splits the data.frame in three groups (low, medium and high income clusters.
par(mfrow=c(1,1))
plot(dendogramm,xlab="Objects",ylab="Distance",
main="Dendogramm of Clusteranalysis (Average)",labels=FALSE)
Cuts a tree, e.g., as resulting from hclust, into several groups either by specifying the desired number(s) of groups or the cut height(s).
k: an integer scalar or vector with the desired number of groups
cluster.hierarch_3 <- cutree(dendogramm,k=3)
cluster.hierarch_3
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [71] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2
## [106] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [141] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [176] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
## [211] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [246] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [281] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
str(cluster.hierarch_3)
## int [1:300] 1 1 1 1 1 1 1 1 1 1 ...
summary(cluster.hierarch_3)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 1 2 2 3 3
data<-cbind(data,cluster.hierarch_3)
head(data, 10)
tail(data, 10)
str(data)
## 'data.frame': 300 obs. of 4 variables:
## $ inc : num 7273 18142 5486 3827 12070 ...
## $ school : num 7 8 8 7 7 7 8 7 7 9 ...
## $ categories : Factor w/ 3 levels "Low","Mid","High": 1 1 1 1 1 1 1 1 1 1 ...
## $ cluster.hierarch_3: int 1 1 1 1 1 1 1 1 1 1 ...
By having the three clusters applied to the data.frame “data”, we are able to apply further analysis to the dataset. Eg. Calculating the mean and standard deviation, per cluster and per measure (income, school)
Apply a function to each cell of a ragged array, that is to each (non-empty) group of values given by a unique combination of the levels of certain factors.
tapply(inc,cluster.hierarch_3,mean) # mean(income) by cluster 1-3
## 1 2 3
## 9495.325 40048.850 71048.944
tapply(inc,cluster.hierarch_3, sd) # standard-deviation(income) by cluster 1-3
## 1 2 3
## 5080.087 4834.493 5362.631
tapply(school,cluster.hierarch_3,mean) # mean(school) by cluster 1-3
## 1 2 3
## 8.01 12.55 17.02
tapply(school,cluster.hierarch_3, sd) # standard-deviation(school) by cluster 1-3
## 1 2 3
## 1.0588253 0.5000000 0.6663636
table(cluster.hierarch_3)
## cluster.hierarch_3
## 1 2 3
## 100 100 100
We can observe, that 100 persons by cluster 1-3 have been assigned each
Let’s see how the distribution looks-like in case we build just 2(two) clusters
cluster.hierarch_2<-cutree(dendogramm,k=2)
data<-cbind(data,cluster.hierarch_2)
tapply(inc,cluster.hierarch_2,mean)
## 1 2
## 9495.325 55548.897
tapply(school,cluster.hierarch_2,mean)
## 1 2
## 8.010 14.785
table(cluster.hierarch_2)
## cluster.hierarch_2
## 1 2
## 100 200
Perform a divisive hierarchical cluster analysis with k-means.
K-means represents one of the most popular clustering algorithm. However, it has some limitations: it requires the user to specify the number of clusters in advance and selects initial centroids randomly.
The final k-means clustering solution is very sensitive to this initial random selection of cluster centers.
The result might be (slightly) different each time you compute k-means.
Different scaling of “school” (6 to 18) vs. “inc” (-1289 to 80426) asks for Z-Standarisation.
summary(school)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 9.00 13.00 12.53 17.00 18.00
summary(inc)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3343 12866 39720 40198 67131 82933
Do the scaling:
inc.stand<-(inc-mean(inc))/sd(inc)
school.stand<-(school-mean(school))/sd(school)
head(inc.stand,10)
## [1] -1.2821818 -0.8589092 -1.3517640 -1.4163696 -1.0953835 -1.3927245
## [7] -1.4234743 -1.2678057 -1.3876885 -1.0642435
head(school.stand,10)
## [1] -1.4678379 -1.2022460 -1.2022460 -1.4678379 -1.4678379 -1.4678379
## [7] -1.2022460 -1.4678379 -1.4678379 -0.9366541
cluster.k<-cbind(inc.stand,school.stand)
head(cluster.k,10)
## inc.stand school.stand
## [1,] -1.2821818 -1.4678379
## [2,] -0.8589092 -1.2022460
## [3,] -1.3517640 -1.2022460
## [4,] -1.4163696 -1.4678379
## [5,] -1.0953835 -1.4678379
## [6,] -1.3927245 -1.4678379
## [7,] -1.4234743 -1.2022460
## [8,] -1.2678057 -1.4678379
## [9,] -1.3876885 -1.4678379
## [10,] -1.0642435 -0.9366541
Perform k-means clustering on a data matrix.
centers: Either the number of clusters, say k, or a st of initial (distinct) cluster centres. If a number, a random set of (distinct) rows in x is chosen as the initial centre
clusterzentren <- kmeans(cluster.k, centers = 3)
clusterzentren
## K-means clustering with 3 clusters of sizes 77, 200, 23
##
## Cluster means:
## inc.stand school.stand
## 1 1.2040207 1.1156469
## 2 -0.6007131 -0.5966965
## 3 1.1927402 1.4536730
##
## Clustering vector:
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [36] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [71] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [106] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [141] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [176] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 3 1 1 1 1 1 1 1
## [211] 1 3 1 1 1 3 1 3 1 1 1 1 1 3 3 1 3 1 1 1 3 3 1 1 1 3 1 1 1 1 1 3 3 1 3
## [246] 1 1 1 1 1 3 1 1 1 1 1 1 3 3 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1
## [281] 1 1 1 1 1 1 1 1 1 1 3 3 3 1 1 1 1 3 3 1
##
## Within cluster sum of squares by cluster:
## [1] 4.4890235 160.4397957 0.9036224
## (between_SS / total_SS = 72.3 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
clusterzentren$cluster
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [36] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [71] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [106] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [141] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [176] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 3 1 1 1 1 1 1 1
## [211] 1 3 1 1 1 3 1 3 1 1 1 1 1 3 3 1 3 1 1 1 3 3 1 1 1 3 1 1 1 1 1 3 3 1 3
## [246] 1 1 1 1 1 3 1 1 1 1 1 1 3 3 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1
## [281] 1 1 1 1 1 1 1 1 1 1 3 3 3 1 1 1 1 3 3 1
cluster.k<-data.frame(cbind(cluster.k,clusterzentren$cluster))
head(cluster.k,10)
names(cluster.k)
## [1] "inc.stand" "school.stand" "V3"
sozioeconomical.group=factor(clusterzentren$cluster,
labels=c("Medium","Low","High"))
head(sozioeconomical.group,10)
## [1] Low Low Low Low Low Low Low Low Low Low
## Levels: Medium Low High
data<-cbind(data,sozioeconomical.group)
names(data)
## [1] "inc" "school" "categories"
## [4] "cluster.hierarch_3" "cluster.hierarch_2" "sozioeconomical.group"
head(data,10)
head(data[,c(1,2,5)],10)
tail(data[,c(1,2,5)],10)
str(data)
## 'data.frame': 300 obs. of 6 variables:
## $ inc : num 7273 18142 5486 3827 12070 ...
## $ school : num 7 8 8 7 7 7 8 7 7 9 ...
## $ categories : Factor w/ 3 levels "Low","Mid","High": 1 1 1 1 1 1 1 1 1 1 ...
## $ cluster.hierarch_3 : int 1 1 1 1 1 1 1 1 1 1 ...
## $ cluster.hierarch_2 : int 1 1 1 1 1 1 1 1 1 1 ...
## $ sozioeconomical.group: Factor w/ 3 levels "Medium","Low",..: 2 2 2 2 2 2 2 2 2 2 ...
summary(data)
## inc school categories cluster.hierarch_3
## Min. :-3343 Min. : 6.00 Low :101 Min. :1
## 1st Qu.:12866 1st Qu.: 9.00 Mid : 99 1st Qu.:1
## Median :39720 Median :13.00 High:100 Median :2
## Mean :40198 Mean :12.53 Mean :2
## 3rd Qu.:67131 3rd Qu.:17.00 3rd Qu.:3
## Max. :82933 Max. :18.00 Max. :3
## cluster.hierarch_2 sozioeconomical.group
## Min. :1.000 Medium: 77
## 1st Qu.:1.000 Low :200
## Median :2.000 High : 23
## Mean :1.667
## 3rd Qu.:2.000
## Max. :2.000
tapply(inc, sozioeconomical.group, mean)
## Medium Low High
## 71115.57 24772.09 70825.90
tapply(school, sozioeconomical.group, mean)
## Medium Low High
## 16.72727 10.28000 18.00000