Custering Distance measures

Introduction

Remember: For Cluster Analysis 1. Rows are observations and columns are variables. 2. Any Missing value must be imputed/removed 3. Data must be standardised. lets Load the dataset “US Arrests”

data("USArrests")
df <- USArrests

Remove any missing value that may be present in the data

df <- na.omit(df)

Scale of Data

df <- scale(df)
head(df,n=3)

##             Murder   Assault   UrbanPop         Rape
## Alabama 1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska  0.50786248 1.1068225 -1.2117642  2.484202941
## Arizona 0.07163341 1.4788032  0.9989801  1.042878388

We will use the packages “cluster” and “factoextra”

clustering Distance Measures.

Distance Matrix Computation

Data Preparation

Using US Arrests, we take 15 random rows, then we scale the data

set.seed(123)
ss <- sample(1:50,15) # get the row index of randomly selected 15 rows
df <- USArrests[ss,] # subset the rows basis this row index
df.scaled <- scale(df) # Standardise the variable

Computing Eucledean Distance

dist.eucl <- dist(df.scaled,method="euclidean")

You can see this distance in the form of a matrix

# subset the first three columns and rows and round the values
round(as.matrix(dist.eucl)[1:3,1:3],1)

##            New Mexico Iowa Indiana
## New Mexico        0.0  4.1     2.5
## Iowa              4.1  0.0     1.8
## Indiana           2.5  1.8     0.0

Computing correlation based distances Correation can be pearson, spearman or kendall

#compute
library("factoextra")

## Warning: package 'factoextra' was built under R version 3.6.2

## Loading required package: ggplot2

## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

dist.cor <- get_dist(df.scaled,method="pearson")

# display a subset
round(as.matrix(dist.cor)[1:3,1:3],1)

##            New Mexico Iowa Indiana
## New Mexico        0.0  1.7     2.0
## Iowa              1.7  0.0     0.3
## Indiana           2.0  0.3     0.0

If there are no numeric columns use Gower’s Metric

library(cluster)

## Warning: package 'cluster' was built under R version 3.6.2

# load data
data(flower)
head(flower,3)

##   V1 V2 V3 V4 V5 V6  V7 V8
## 1  0  1  1  4  3 15  25 15
## 2  1  0  0  2  1  3 150 50
## 3  0  1  0  3  3  1 150 50

str(flower)

## 'data.frame':    18 obs. of  8 variables:
##  $ V1: Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 2 2 ...
##  $ V2: Factor w/ 2 levels "0","1": 2 1 2 1 2 2 1 1 2 2 ...
##  $ V3: Factor w/ 2 levels "0","1": 2 1 1 2 1 1 1 2 1 1 ...
##  $ V4: Factor w/ 5 levels "1","2","3","4",..: 4 2 3 4 5 4 4 2 3 5 ...
##  $ V5: Ord.factor w/ 3 levels "1"<"2"<"3": 3 1 3 2 2 3 3 2 1 2 ...
##  $ V6: Ord.factor w/ 18 levels "1"<"2"<"3"<"4"<..: 15 3 1 16 2 12 13 7 4 14 ...
##  $ V7: num  25 150 150 125 20 50 40 100 25 100 ...
##  $ V8: num  15 50 50 50 15 40 20 15 15 60 ...

# distance Matrix
dd <- daisy(flower)
round(as.matrix(dd)[1:3,1:3],2)

##      1    2    3
## 1 0.00 0.89 0.53
## 2 0.89 0.00 0.51
## 3 0.53 0.51 0.00

*** Visualising Distance Matrix ***

library(factoextra)
fviz_dist(dist.eucl)

Red is similar and blue is dissimilar.

Custering Distance measures

Priyank Goyal

22/03/2020

Introduction

clustering Distance Measures.

Distance Matrix Computation