1 Introduction

So far what we have seen in terms of pattern finding in the data has always been about developing some sort of equation to predict response. Or in simple words, we always had a response. It might come as a surprise but finding pattern in the data isn’t always coupled with having a response against your observations [ collection of people or object characteristic].

What we need to do is to formalize this already intuitive concept of finding “groups” with in our data or observations.

2 Loading and Exploring Data

2.1 Loading libraries required

Loading R packages used besides base R.

library(knitr)
library(ggplot2)
library(dplyr)
library(dbscan)

3 Hierarchical clustering

We are going to find clusters in the mtcars data, where we’d be grouping same kind of cars based on their various characteristics.

3.1 Loading mtcars data

data(mtcars)
cars=mtcars

3.2 Standardize the data

As we discussed , we need to standardize the data before we can begin clustering if we dont want inherrent weightages present in the data to prevail.

medians = apply(cars,2,median)
mads = apply(cars,2,mad)
cars.std=data.frame(scale(cars,center=medians,scale=mads))

3.3 Distance Matrix

Next we will create a distance matrix for all the observations in the data. this is the resource consuming step which becomes insanley slow as the size of the data increases.

cars.dist = dist(cars.std)

Using this distance matrix we can now built a hierarchical clustering model with function hclust.

cars.hclust = hclust(cars.dist)

plot(cars.hclust,labels=rownames(cars),main='Default from hclust',col="mediumturquoise")

3.4 Exploring the cluster

You can various information regarding each group of the cluster using function cutree. Here is an example.

groups.3=cutree(cars.hclust,3)
groups.3

##           Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
##                   1                   1                   2                   2 
##   Hornet Sportabout             Valiant          Duster 360           Merc 240D 
##                   3                   2                   1                   2 
##            Merc 230            Merc 280           Merc 280C          Merc 450SE 
##                   2                   2                   2                   3 
##          Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
##                   3                   3                   3                   3 
##   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
##                   3                   2                   2                   2 
##       Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
##                   2                   3                   3                   1 
##    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
##                   3                   2                   2                   2 
##      Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
##                   1                   1                   1                   2

table(groups.3)

## groups.3
##  1  2  3 
##  7 15 10

rownames(cars)[groups.3==1]

## [1] "Mazda RX4"      "Mazda RX4 Wag"  "Duster 360"     "Camaro Z28"    
## [5] "Ford Pantera L" "Ferrari Dino"   "Maserati Bora"

4 K-means clustering

Lets prepare data for K-means clustering with natural groups present in the data. We’ll later run K-mean algorithm to see whether it can identify those groups or not.

n = 100
g = 6
set.seed(g)
d <- data.frame(
  x = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))),
  y = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))))

plot(d,col="mediumturquoise",pch=16, xlab="Arrival Time Deviations", ylab="Departure time Deviations", main="Scatter Plot: Delays from Schedule in Minutes ")

4.1 Number of clusters

We are not going to standardize this data and let the inherrent weightages prevail. [ Also, all the variables are on same scale.]. Lets run kmeans for various values of K and record wss everytime.

mydata <- d
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(mydata,centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
     ylab="Within groups sum of squares",col="mediumseagreen",pch=12)

This tells us that there might be 4 groups present in the data. Although if you go back ,try and look closely on the data plot, there might be more group possible. How groups you want to break your data into , is eventually your call as we discussed. There is another way of deciding the value of K by calinski criterion. We’ll discuss maths of the method.

require(vegan)

fit <- cascadeKM(scale(d, center = TRUE, scale = TRUE), 1, 10, iter = 
                   1000)

plot(fit, sortg = TRUE, grpmts.plot = TRUE)

calinski.best <- as.numeric(which.max(fit$results[2,]))
cat("Calinski criterion optimal number of clusters:", calinski.best, "\n")

## Calinski criterion optimal number of clusters: 5

This gives best choice is 5. Again this is a subjective call. For profiling you can add the cluster membership back to the data and plot pair wise 2d plots, colored by cluster membership. You can look at numerical summarise cluster wise as well.

fit <- kmeans(mydata,4 )
d$cluster=fit$cluster
library(ggplot2)
ggplot(d,aes(x,y,color=as.factor(cluster)))+geom_point()

5 New Data

5.1 White wine

5.1.1 Reading the data

wq=read.csv("winequality-white.csv",sep=";")

5.1.2 Selecting variables

I selected 2 variables here because we won’t be able to properly visualize more than 2 variables even though they are properly clustered.

wq=wq %>% select(alcohol,pH)

5.1.3 Scaling the data

wq.std=scale(wq)

Next we will run for different values of k and record values of SSW

ssw=numeric(14)

for(k in 2:15){
  fit=kmeans(wq.std,centers = k)
  ssw[k-1]=fit$withinss
}

## Warning in ssw[k - 1] <- fit$withinss: number of items to replace is not a
## multiple of replacement length

## Warning in ssw[k - 1] <- fit$withinss: number of items to replace is not a
## multiple of replacement length

## Warning in ssw[k - 1] <- fit$withinss: number of items to replace is not a
## multiple of replacement length

## Warning in ssw[k - 1] <- fit$withinss: number of items to replace is not a
## multiple of replacement length

## Warning in ssw[k - 1] <- fit$withinss: number of items to replace is not a
## multiple of replacement length

## Warning in ssw[k - 1] <- fit$withinss: number of items to replace is not a
## multiple of replacement length

## Warning in ssw[k - 1] <- fit$withinss: number of items to replace is not a
## multiple of replacement length

## Warning in ssw[k - 1] <- fit$withinss: number of items to replace is not a
## multiple of replacement length

## Warning in ssw[k - 1] <- fit$withinss: number of items to replace is not a
## multiple of replacement length

## Warning in ssw[k - 1] <- fit$withinss: number of items to replace is not a
## multiple of replacement length

## Warning in ssw[k - 1] <- fit$withinss: number of items to replace is not a
## multiple of replacement length

## Warning in ssw[k - 1] <- fit$withinss: number of items to replace is not a
## multiple of replacement length

## Warning in ssw[k - 1] <- fit$withinss: number of items to replace is not a
## multiple of replacement length

## Warning in ssw[k - 1] <- fit$withinss: number of items to replace is not a
## multiple of replacement length

qplot(2:15,ssw)

clus.fit=kmeans(wq.std,centers = 4)

wq$cluster=clus.fit$cluster

tapply(wq$pH,wq$cluster,summary)

## $`1`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.210   3.290   3.340   3.368   3.420   3.820 
## 
## $`2`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.060   3.130   3.112   3.180   3.280 
## 
## $`3`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.060   3.200   3.250   3.264   3.320   3.590 
## 
## $`4`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.010   3.070   3.064   3.130   3.220

tapply(wq$pH,wq$cluster,mean)

##        1        2        3        4 
## 3.368097 3.111671 3.263967 3.063520

### pH, alcohol

ggplot(wq,aes(x=alcohol,y=pH,color=factor(cluster)))+
  geom_point()

5.2 Red wine

5.2.1 Reading data

wine=read.csv("winequality-red.csv",sep=";")

5.2.2 Selecting variables

I selected 4 variables data.ph, sulphates, alcohol, total.sulfur.dioxide

wine=wine %>%
  select(pH,sulphates,alcohol,total.sulfur.dioxide)

5.2.3 Standardize this data

md=function(x){
  return((x-mean(x))/sd(x))
}
wine_std=wine %>%
  mutate(pH=md(pH),
         sulphates=md(sulphates),
         alcohol=md(alcohol),
         total.sulfur.dioxide=md(total.sulfur.dioxide))

require(vegan)
fit <- cascadeKM(wine_std, 1, 10, iter = 100)
plot(fit, sortg = TRUE, grpmts.plot = TRUE)

calinski.best <- as.numeric(which.max(fit$results[2,]))
cat("Calinski criterion optimal number of clusters:", calinski.best, "\n")

## Calinski criterion optimal number of clusters: 5

We’ll make 5 groups.

fit <- kmeans(wine_std,5)

wine_std$cluster=fit$cluster

#pair wise profiling plots
ggplot(wine_std,aes(pH,alcohol,color=as.factor(cluster)))+geom_point()

ggplot(wine_std,aes(pH,sulphates,color=as.factor(cluster)))+geom_point()

ggplot(wine_std,aes(pH,total.sulfur.dioxide,color=as.factor(cluster)))+geom_point()

ggplot(wine_std,aes(alcohol,sulphates,color=as.factor(cluster)))+geom_point()

ggplot(wine_std,aes(alcohol,total.sulfur.dioxide,color=as.factor(cluster)))+geom_point()

ggplot(wine_std,aes(sulphates,total.sulfur.dioxide,color=as.factor(cluster)))+geom_point()

As you can see, once you have increased dimnesion, there are going to be overlap in 2D plots. However in some plots , you’ll see better differentiation between certain groups, you can label them accordingly. For example group wih high total.sulfur.dioxide [ or if you can associate that with some property of the wine , say strong scent.]

6 DBSCAN

6.1 Reading the data

moon_data=read.csv("moon_data.csv",stringsAsFactors = F, sep=",")

6.2 Looking at the data

We see that the data is not spherical so k-means won’t work properly.

moon_data=moon_data[,-3]

glimpse(moon_data)

## Observations: 200
## Variables: 2
## $ X1 <dbl> 1.8495776, -0.4861227, 0.6239769, -0.3723831, 1.4167924, -0.6032...
## $ X2 <dbl> 0.09622161, -0.08064073, -0.09272887, -0.27670058, -0.43537460, ...

ggplot(moon_data,aes(x=X1,y=X2))+geom_point()

6.3 K-means

fit= kmeans(moon_data,2)

### to find wss for each k , fit$tot.withinss
### you can run a for loop over range of k and examine how wss behaves

moon_data$cluster=fit$cluster

ggplot(moon_data,aes(X1,X2,color=as.factor(cluster)))+geom_point()

6.4 Clustering with DBSCAN

moon_data$cluster=NULL

moon.scan=dbscan(moon_data, eps=0.2, minPts = 3)

moon_data$cluster=moon.scan$cluster

ggplot(moon_data,aes(X1,X2,color=as.factor(cluster)))+geom_point()

We see how DBSCAN can cluster the data. We can try out different values of epsilon and minimum points and see the results.

When it comes to anomaly detection we use DBSCAN.

Clustering

Prateek Kumar