Remember: For Cluster Analysis 1. Rows are observations and columns are variables. 2. Any Missing value must be imputed/removed 3. Data must be standardised. lets Load the dataset “US Arrests”
data("USArrests")
df <- USArrests
df <- na.omit(df)
df <- scale(df)
head(df,n=3)
## Murder Assault UrbanPop Rape
## Alabama 1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska 0.50786248 1.1068225 -1.2117642 2.484202941
## Arizona 0.07163341 1.4788032 0.9989801 1.042878388
We will use the packages “cluster” and “factoextra”
Data Preparation
Using US Arrests, we take 15 random rows, then we scale the data
set.seed(123)
ss <- sample(1:50,15) # get the row index of randomly selected 15 rows
df <- USArrests[ss,] # subset the rows basis this row index
df.scaled <- scale(df) # Standardise the variable
Computing Eucledean Distance
dist.eucl <- dist(df.scaled,method="euclidean")
You can see this distance in the form of a matrix
# subset the first three columns and rows and round the values
round(as.matrix(dist.eucl)[1:3,1:3],1)
## New Mexico Iowa Indiana
## New Mexico 0.0 4.1 2.5
## Iowa 4.1 0.0 1.8
## Indiana 2.5 1.8 0.0
Computing correlation based distances Correation can be pearson, spearman or kendall
#compute
library("factoextra")
## Warning: package 'factoextra' was built under R version 3.6.2
## Loading required package: ggplot2
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
dist.cor <- get_dist(df.scaled,method="pearson")
# display a subset
round(as.matrix(dist.cor)[1:3,1:3],1)
## New Mexico Iowa Indiana
## New Mexico 0.0 1.7 2.0
## Iowa 1.7 0.0 0.3
## Indiana 2.0 0.3 0.0
If there are no numeric columns use Gower’s Metric
library(cluster)
## Warning: package 'cluster' was built under R version 3.6.2
# load data
data(flower)
head(flower,3)
## V1 V2 V3 V4 V5 V6 V7 V8
## 1 0 1 1 4 3 15 25 15
## 2 1 0 0 2 1 3 150 50
## 3 0 1 0 3 3 1 150 50
str(flower)
## 'data.frame': 18 obs. of 8 variables:
## $ V1: Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 2 2 ...
## $ V2: Factor w/ 2 levels "0","1": 2 1 2 1 2 2 1 1 2 2 ...
## $ V3: Factor w/ 2 levels "0","1": 2 1 1 2 1 1 1 2 1 1 ...
## $ V4: Factor w/ 5 levels "1","2","3","4",..: 4 2 3 4 5 4 4 2 3 5 ...
## $ V5: Ord.factor w/ 3 levels "1"<"2"<"3": 3 1 3 2 2 3 3 2 1 2 ...
## $ V6: Ord.factor w/ 18 levels "1"<"2"<"3"<"4"<..: 15 3 1 16 2 12 13 7 4 14 ...
## $ V7: num 25 150 150 125 20 50 40 100 25 100 ...
## $ V8: num 15 50 50 50 15 40 20 15 15 60 ...
# distance Matrix
dd <- daisy(flower)
round(as.matrix(dd)[1:3,1:3],2)
## 1 2 3
## 1 0.00 0.89 0.53
## 2 0.89 0.00 0.51
## 3 0.53 0.51 0.00
*** Visualising Distance Matrix ***
library(factoextra)
fviz_dist(dist.eucl)
Red is similar and blue is dissimilar.