Preface / Reference
Introduction
Prepartion
- Load Libraries
- Cleanup
Establish test data
Do Cluster Analysis
Review and analyze results
- tapply (base): Apply A Function Over A Ragged Array
- Check numbers of person (observation) by cluster 1-3
Build cluster with k-means

Preface / Reference

This document covers the R-Codeline as referenced in the Power Point presentation for this CAS study. Please read/consult the presention regards the intense discussion and furhter input to this topic.

The codeline is based on this writing:

Grundlagen clusteranalytischer Verfahren
Institut fuer Soziologie - Universitaet Duisburg-Essen
Prof. Petra Stein - Sven Vollnhals 1. April 2011

PDF Document (Link)

Introduction

Cluster Analysis

Cluster analysis is a powerful toolkit in the data science workbench. It is used to find groups of observations (clusters) that share similar characteristics. These similarities can inform all kinds of business decisions; for example, in marketing, it is used to identify distinct groups of customers for which advertisements can be tailored.

The following script depicts the cluster analysis based on random genarated data, covering income vs. eductional lavel (yeary in scool)

Prepartion

Load Libraries

library(cluster)

Cleanup

Remove all object currently in the R-Studio session.

rm(list=ls())

Establish test data

We generate 300 observations of income and educational level (number of years in school) The data will be laded in data.frame “data”

Generate Income: low, medium, high

3 Income ranges (10k, 40k and 70k USD)

inc1 <- rnorm(100, mean = 10000, sd = 5000)
inc2 <- rnorm(100, mean = 40000, sd = 5000)
inc3 <- rnorm(100, mean = 70000, sd = 5000)
inc <- c(inc1, inc2, inc3)

head(inc)

## [1]  7272.759 18141.909  5485.967  3826.970 12069.524  4434.149

tail(inc)

## [1] 66366.13 79551.82 73216.13 63690.96 71449.88 79120.88

summary(inc)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -3343   12866   39720   40198   67131   82933

Generate Eduction: Years in eduction

school1 <- round(runif(100, min = 6, max = 10))
school2 <- round(runif(100, min = 12, max = 13))
school3 <- round(runif(100, min = 16, max = 18))
school <- c(school1, school2, school3)

head(school)

## [1] 7 8 8 7 7 7

tail(school)

## [1] 17 17 17 18 18 17

summary(school)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00    9.00   13.00   12.53   17.00   18.00

Establish dataframe “data”

inc_school <- cbind(inc, school)
data <- data.frame(inc_school)

attach(data)

## The following objects are masked _by_ .GlobalEnv:
## 
##     inc, school

str(data)

## 'data.frame':    300 obs. of  2 variables:
##  $ inc   : num  7273 18142 5486 3827 12070 ...
##  $ school: num  7 8 8 7 7 7 8 7 7 9 ...

Establish Indicator for Income Ranges

Add an income-indicator (low, medium, high) as new column “categories” to data.frame “data”

high.dummy <- ifelse(data$inc > 60000, 3, 0)
mid.dummy <- ifelse(data$inc < 60000 & data$inc > 30000, 2, 0)
low.dummy <- ifelse(data$inc < 30000, 1, 0)
categories_ <- low.dummy + mid.dummy + high.dummy
categories = factor(categories_, labels = c("Low", "Mid", "High"))
data <- cbind(data, categories)

head(categories_,20) # values 1-3

##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

head(categories,20) # values Low-Mid-High

##  [1] Low Low Low Low Low Low Low Low Low Low Low Low Low Low Low Low Low
## [18] Low Low Low
## Levels: Low Mid High

head(data,10)

str(data)

## 'data.frame':    300 obs. of  3 variables:
##  $ inc       : num  7273 18142 5486 3827 12070 ...
##  $ school    : num  7 8 8 7 7 7 8 7 7 9 ...
##  $ categories: Factor w/ 3 levels "Low","Mid","High": 1 1 1 1 1 1 1 1 1 1 ...

Display income and school distribution by income ranges

As expected (based on the data generation) we can observe, that income and education is well partitioned in the low, medium and high income range

par(mfrow=c(1,2)) # display two boxplot in one frame

boxplot(inc~categories,main="Boxplot Income",
        ylab="Income in Euro",xlab="Income Groups")

boxplot(school~categories,main="Boxplot Eduction",
        ylab="Number of Years in Education",xlab="Income Groups")

par(mfrow=c(1,2))
plot(density(inc),main="Kernel Density Income")
plot(density(school),main="Kernel Density Eduction")

Do Cluster Analysis

Hierarchical Clustering

Establish data.frame cluster

cluster <- data.frame(data$inc,data$school)

daisy (cluster): Dissimilarity Matrix Calculation

Compute all the pairwise dissimilarities (distances) between observations in the data set.

metric: Character string specifying the metric to be used. The currently available options are “euclidean” (the default), “manhattan” and “gower”
stand: if TRUE, then the measurements in x are standardized before calculating the dissimilarities. Measurements are standardized for each variable (column), by subtracting the variable’s mean value and dividing by the variable’s mean absolute deviation.

dist.euclid <- daisy(cluster,metric="euclidean",stand=TRUE)

summary(dist.euclid)

## 44850 dissimilarities, summarized :
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000022 0.568750 1.927900 1.940400 2.525500 5.198400 
## Metric :  euclidean 
## Number of objects : 300

str(dist.euclid)

##  'dissimilarity' num [1:44850] 0.59 0.326 0.158 0.22 0.13 ...
##  - attr(*, "Size")= int 300
##  - attr(*, "Metric")= chr "euclidean"

head(dist.euclid,10)

##  [1] 0.58992460 0.32601107 0.15801763 0.21997086 0.13017351 0.35672515
##  [7] 0.01692903 0.12424322 0.68127919 1.28189849

hclust (stats): Hierarchical Clustering

Hierarchical cluster analysis on a set of dissimilarities and methods for analyzing it.

method: The agglomeration method to be used.
This should be (an unambiguous abbreviation of) one of “ward.D”, “ward.D2”, “single”,“complete”, “average” (= UPGMA), “mcquitty” (= WPGMA), “median” (= WPGMC) or “centroid” (= UPGMC).

dendogramm <- hclust(dist.euclid,method="average")

str(dendogramm)

## List of 7
##  $ merge      : int [1:299, 1:2] -48 -210 -16 -175 -27 -108 -176 -148 -45 -106 ...
##  $ height     : num [1:299] 2.24e-05 1.05e-04 3.55e-04 3.70e-04 3.89e-04 ...
##  $ order      : int [1:300] 11 95 75 99 33 97 8 1 20 69 ...
##  $ labels     : NULL
##  $ method     : chr "average"
##  $ call       : language hclust(d = dist.euclid, method = "average")
##  $ dist.method: NULL
##  - attr(*, "class")= chr "hclust"

summary(dendogramm)

##             Length Class  Mode     
## merge       598    -none- numeric  
## height      299    -none- numeric  
## order       300    -none- numeric  
## labels        0    -none- NULL     
## method        1    -none- character
## call          3    -none- call     
## dist.method   0    -none- NULL

Display Dendogramm

The result of hierarchical clustering is a tree-based representation of the objects, which is also known as dendrogram.Observations can be subdivided into groups by cutting the dendrogram at a desired similarity level. In our example, as expected again, we can see that the dendogram splits the data.frame in three groups (low, medium and high income clusters.

par(mfrow=c(1,1)) 
plot(dendogramm,xlab="Objects",ylab="Distance",
     main="Dendogramm of Clusteranalysis (Average)",labels=FALSE)

cutree (stats): Cut A Tree Into Groups Of Data

Cuts a tree, e.g., as resulting from hclust, into several groups either by specifying the desired number(s) of groups or the cut height(s).

k: an integer scalar or vector with the desired number of groups

cluster.hierarch_3 <- cutree(dendogramm,k=3)

cluster.hierarch_3

##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [71] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2
## [106] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [141] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [176] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
## [211] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [246] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [281] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

str(cluster.hierarch_3)

##  int [1:300] 1 1 1 1 1 1 1 1 1 1 ...

summary(cluster.hierarch_3)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       1       2       2       3       3

Bind cluster IDs to the table.frame

data<-cbind(data,cluster.hierarch_3)

head(data, 10)

tail(data, 10)

str(data)

## 'data.frame':    300 obs. of  4 variables:
##  $ inc               : num  7273 18142 5486 3827 12070 ...
##  $ school            : num  7 8 8 7 7 7 8 7 7 9 ...
##  $ categories        : Factor w/ 3 levels "Low","Mid","High": 1 1 1 1 1 1 1 1 1 1 ...
##  $ cluster.hierarch_3: int  1 1 1 1 1 1 1 1 1 1 ...

Review and analyze results

By having the three clusters applied to the data.frame “data”, we are able to apply further analysis to the dataset. Eg. Calculating the mean and standard deviation, per cluster and per measure (income, school)

tapply (base): Apply A Function Over A Ragged Array

Apply a function to each cell of a ragged array, that is to each (non-empty) group of values given by a unique combination of the levels of certain factors.

tapply(inc,cluster.hierarch_3,mean) # mean(income) by cluster 1-3

##         1         2         3 
##  9495.325 40048.850 71048.944

tapply(inc,cluster.hierarch_3, sd) # standard-deviation(income) by cluster 1-3

##        1        2        3 
## 5080.087 4834.493 5362.631

tapply(school,cluster.hierarch_3,mean) # mean(school) by cluster 1-3

##     1     2     3 
##  8.01 12.55 17.02

tapply(school,cluster.hierarch_3, sd) # standard-deviation(school) by cluster 1-3

##         1         2         3 
## 1.0588253 0.5000000 0.6663636

Check numbers of person (observation) by cluster 1-3

table(cluster.hierarch_3)

## cluster.hierarch_3
##   1   2   3 
## 100 100 100

We can observe, that 100 persons by cluster 1-3 have been assigned each
Let’s see how the distribution looks-like in case we build just 2(two) clusters

cluster.hierarch_2<-cutree(dendogramm,k=2)
data<-cbind(data,cluster.hierarch_2)

tapply(inc,cluster.hierarch_2,mean)

##         1         2 
##  9495.325 55548.897

tapply(school,cluster.hierarch_2,mean)

##      1      2 
##  8.010 14.785

table(cluster.hierarch_2)

## cluster.hierarch_2
##   1   2 
## 100 200

Build cluster with k-means

Perform a divisive hierarchical cluster analysis with k-means.

K-means represents one of the most popular clustering algorithm. However, it has some limitations: it requires the user to specify the number of clusters in advance and selects initial centroids randomly.
The final k-means clustering solution is very sensitive to this initial random selection of cluster centers.
The result might be (slightly) different each time you compute k-means.

Standardize data (inc, school)

Different scaling of “school” (6 to 18) vs. “inc” (-1289 to 80426) asks for Z-Standarisation.

summary(school)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00    9.00   13.00   12.53   17.00   18.00

summary(inc)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -3343   12866   39720   40198   67131   82933

Do the scaling:

inc.stand<-(inc-mean(inc))/sd(inc)
school.stand<-(school-mean(school))/sd(school)

head(inc.stand,10)

##  [1] -1.2821818 -0.8589092 -1.3517640 -1.4163696 -1.0953835 -1.3927245
##  [7] -1.4234743 -1.2678057 -1.3876885 -1.0642435

head(school.stand,10)

##  [1] -1.4678379 -1.2022460 -1.2022460 -1.4678379 -1.4678379 -1.4678379
##  [7] -1.2022460 -1.4678379 -1.4678379 -0.9366541

Prepare data for k-Clustering

cluster.k<-cbind(inc.stand,school.stand)

head(cluster.k,10)

##        inc.stand school.stand
##  [1,] -1.2821818   -1.4678379
##  [2,] -0.8589092   -1.2022460
##  [3,] -1.3517640   -1.2022460
##  [4,] -1.4163696   -1.4678379
##  [5,] -1.0953835   -1.4678379
##  [6,] -1.3927245   -1.4678379
##  [7,] -1.4234743   -1.2022460
##  [8,] -1.2678057   -1.4678379
##  [9,] -1.3876885   -1.4678379
## [10,] -1.0642435   -0.9366541

kmeans: K-Means Clustering

Perform k-means clustering on a data matrix.

centers: Either the number of clusters, say k, or a st of initial (distinct) cluster centres. If a number, a random set of (distinct) rows in x is chosen as the initial centre

clusterzentren <- kmeans(cluster.k, centers = 3)
clusterzentren

## K-means clustering with 3 clusters of sizes 77, 200, 23
## 
## Cluster means:
##    inc.stand school.stand
## 1  1.2040207    1.1156469
## 2 -0.6007131   -0.5966965
## 3  1.1927402    1.4536730
## 
## Clustering vector:
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [36] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [71] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [106] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [141] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [176] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 3 1 1 1 1 1 1 1
## [211] 1 3 1 1 1 3 1 3 1 1 1 1 1 3 3 1 3 1 1 1 3 3 1 1 1 3 1 1 1 1 1 3 3 1 3
## [246] 1 1 1 1 1 3 1 1 1 1 1 1 3 3 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1
## [281] 1 1 1 1 1 1 1 1 1 1 3 3 3 1 1 1 1 3 3 1
## 
## Within cluster sum of squares by cluster:
## [1]   4.4890235 160.4397957   0.9036224
##  (between_SS / total_SS =  72.3 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

Establish data.frame with cluster assignment

clusterzentren$cluster

##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [36] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [71] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [106] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [141] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [176] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 3 1 1 1 1 1 1 1
## [211] 1 3 1 1 1 3 1 3 1 1 1 1 1 3 3 1 3 1 1 1 3 3 1 1 1 3 1 1 1 1 1 3 3 1 3
## [246] 1 1 1 1 1 3 1 1 1 1 1 1 3 3 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1
## [281] 1 1 1 1 1 1 1 1 1 1 3 3 3 1 1 1 1 3 3 1

cluster.k<-data.frame(cbind(cluster.k,clusterzentren$cluster))
head(cluster.k,10)

names(cluster.k)

## [1] "inc.stand"    "school.stand" "V3"

Build factor with lables for cluster groups

sozioeconomical.group=factor(clusterzentren$cluster,
                             labels=c("Medium","Low","High"))

head(sozioeconomical.group,10)

##  [1] Low Low Low Low Low Low Low Low Low Low
## Levels: Medium Low High

Bind cluster lables to orginating data.frame

data<-cbind(data,sozioeconomical.group)

names(data)

## [1] "inc"                   "school"                "categories"           
## [4] "cluster.hierarch_3"    "cluster.hierarch_2"    "sozioeconomical.group"

head(data,10)

head(data[,c(1,2,5)],10)

tail(data[,c(1,2,5)],10)

str(data)

## 'data.frame':    300 obs. of  6 variables:
##  $ inc                  : num  7273 18142 5486 3827 12070 ...
##  $ school               : num  7 8 8 7 7 7 8 7 7 9 ...
##  $ categories           : Factor w/ 3 levels "Low","Mid","High": 1 1 1 1 1 1 1 1 1 1 ...
##  $ cluster.hierarch_3   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ cluster.hierarch_2   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ sozioeconomical.group: Factor w/ 3 levels "Medium","Low",..: 2 2 2 2 2 2 2 2 2 2 ...

summary(data)

##       inc            school      categories cluster.hierarch_3
##  Min.   :-3343   Min.   : 6.00   Low :101   Min.   :1         
##  1st Qu.:12866   1st Qu.: 9.00   Mid : 99   1st Qu.:1         
##  Median :39720   Median :13.00   High:100   Median :2         
##  Mean   :40198   Mean   :12.53              Mean   :2         
##  3rd Qu.:67131   3rd Qu.:17.00              3rd Qu.:3         
##  Max.   :82933   Max.   :18.00              Max.   :3         
##  cluster.hierarch_2 sozioeconomical.group
##  Min.   :1.000      Medium: 77           
##  1st Qu.:1.000      Low   :200           
##  Median :2.000      High  : 23           
##  Mean   :1.667                           
##  3rd Qu.:2.000                           
##  Max.   :2.000

Check mean values by clusters

tapply(inc, sozioeconomical.group, mean)

##   Medium      Low     High 
## 71115.57 24772.09 70825.90

tapply(school, sozioeconomical.group, mean)

##   Medium      Low     High 
## 16.72727 10.28000 18.00000

HWZ - CAS Big Data Analysis

Cluster Analysis

thomas.stump@bluewin.ch

21st. December 2018