So far what we have seen in terms of pattern finding in the data has always been about developing some sort of equation to predict response. Or in simple words, we always had a response. It might come as a surprise but finding pattern in the data isn’t always coupled with having a response against your observations [ collection of people or object characteristic].
What we need to do is to formalize this already intuitive concept of finding “groups” with in our data or observations.
Loading R packages used besides base R.
library(knitr)
library(ggplot2)
library(dplyr)
library(dbscan)
We are going to find clusters in the mtcars data, where we’d be grouping same kind of cars based on their various characteristics.
data(mtcars)
cars=mtcars
As we discussed , we need to standardize the data before we can begin clustering if we dont want inherrent weightages present in the data to prevail.
medians = apply(cars,2,median)
mads = apply(cars,2,mad)
cars.std=data.frame(scale(cars,center=medians,scale=mads))
Next we will create a distance matrix for all the observations in the data. this is the resource consuming step which becomes insanley slow as the size of the data increases.
cars.dist = dist(cars.std)
Using this distance matrix we can now built a hierarchical clustering model with function hclust.
cars.hclust = hclust(cars.dist)
plot(cars.hclust,labels=rownames(cars),main='Default from hclust',col="mediumturquoise")
You can various information regarding each group of the cluster using function cutree. Here is an example.
groups.3=cutree(cars.hclust,3)
groups.3
## Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
## 1 1 2 2
## Hornet Sportabout Valiant Duster 360 Merc 240D
## 3 2 1 2
## Merc 230 Merc 280 Merc 280C Merc 450SE
## 2 2 2 3
## Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental
## 3 3 3 3
## Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla
## 3 2 2 2
## Toyota Corona Dodge Challenger AMC Javelin Camaro Z28
## 2 3 3 1
## Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
## 3 2 2 2
## Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
## 1 1 1 2
table(groups.3)
## groups.3
## 1 2 3
## 7 15 10
rownames(cars)[groups.3==1]
## [1] "Mazda RX4" "Mazda RX4 Wag" "Duster 360" "Camaro Z28"
## [5] "Ford Pantera L" "Ferrari Dino" "Maserati Bora"
Lets prepare data for K-means clustering with natural groups present in the data. We’ll later run K-mean algorithm to see whether it can identify those groups or not.
n = 100
g = 6
set.seed(g)
d <- data.frame(
x = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))),
y = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))))
plot(d,col="mediumturquoise",pch=16, xlab="Arrival Time Deviations", ylab="Departure time Deviations", main="Scatter Plot: Delays from Schedule in Minutes ")
We are not going to standardize this data and let the inherrent weightages prevail. [ Also, all the variables are on same scale.]. Lets run kmeans for various values of K and record wss everytime.
mydata <- d
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(mydata,centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares",col="mediumseagreen",pch=12)
This tells us that there might be 4 groups present in the data. Although if you go back ,try and look closely on the data plot, there might be more group possible. How groups you want to break your data into , is eventually your call as we discussed. There is another way of deciding the value of K by calinski criterion. We’ll discuss maths of the method.
require(vegan)
fit <- cascadeKM(scale(d, center = TRUE, scale = TRUE), 1, 10, iter =
1000)
plot(fit, sortg = TRUE, grpmts.plot = TRUE)
calinski.best <- as.numeric(which.max(fit$results[2,]))
cat("Calinski criterion optimal number of clusters:", calinski.best, "\n")
## Calinski criterion optimal number of clusters: 5
This gives best choice is 5. Again this is a subjective call. For profiling you can add the cluster membership back to the data and plot pair wise 2d plots, colored by cluster membership. You can look at numerical summarise cluster wise as well.
fit <- kmeans(mydata,4 )
d$cluster=fit$cluster
library(ggplot2)
ggplot(d,aes(x,y,color=as.factor(cluster)))+geom_point()
wq=read.csv("winequality-white.csv",sep=";")
I selected 2 variables here because we won’t be able to properly visualize more than 2 variables even though they are properly clustered.
wq=wq %>% select(alcohol,pH)
wq.std=scale(wq)
Next we will run for different values of k and record values of SSW
ssw=numeric(14)
for(k in 2:15){
fit=kmeans(wq.std,centers = k)
ssw[k-1]=fit$withinss
}
## Warning in ssw[k - 1] <- fit$withinss: number of items to replace is not a
## multiple of replacement length
## Warning in ssw[k - 1] <- fit$withinss: number of items to replace is not a
## multiple of replacement length
## Warning in ssw[k - 1] <- fit$withinss: number of items to replace is not a
## multiple of replacement length
## Warning in ssw[k - 1] <- fit$withinss: number of items to replace is not a
## multiple of replacement length
## Warning in ssw[k - 1] <- fit$withinss: number of items to replace is not a
## multiple of replacement length
## Warning in ssw[k - 1] <- fit$withinss: number of items to replace is not a
## multiple of replacement length
## Warning in ssw[k - 1] <- fit$withinss: number of items to replace is not a
## multiple of replacement length
## Warning in ssw[k - 1] <- fit$withinss: number of items to replace is not a
## multiple of replacement length
## Warning in ssw[k - 1] <- fit$withinss: number of items to replace is not a
## multiple of replacement length
## Warning in ssw[k - 1] <- fit$withinss: number of items to replace is not a
## multiple of replacement length
## Warning in ssw[k - 1] <- fit$withinss: number of items to replace is not a
## multiple of replacement length
## Warning in ssw[k - 1] <- fit$withinss: number of items to replace is not a
## multiple of replacement length
## Warning in ssw[k - 1] <- fit$withinss: number of items to replace is not a
## multiple of replacement length
## Warning in ssw[k - 1] <- fit$withinss: number of items to replace is not a
## multiple of replacement length
qplot(2:15,ssw)
clus.fit=kmeans(wq.std,centers = 4)
wq$cluster=clus.fit$cluster
tapply(wq$pH,wq$cluster,summary)
## $`1`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.210 3.290 3.340 3.368 3.420 3.820
##
## $`2`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.060 3.130 3.112 3.180 3.280
##
## $`3`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.060 3.200 3.250 3.264 3.320 3.590
##
## $`4`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.010 3.070 3.064 3.130 3.220
tapply(wq$pH,wq$cluster,mean)
## 1 2 3 4
## 3.368097 3.111671 3.263967 3.063520
### pH, alcohol
ggplot(wq,aes(x=alcohol,y=pH,color=factor(cluster)))+
geom_point()
wine=read.csv("winequality-red.csv",sep=";")
I selected 4 variables data.ph, sulphates, alcohol, total.sulfur.dioxide
wine=wine %>%
select(pH,sulphates,alcohol,total.sulfur.dioxide)
md=function(x){
return((x-mean(x))/sd(x))
}
wine_std=wine %>%
mutate(pH=md(pH),
sulphates=md(sulphates),
alcohol=md(alcohol),
total.sulfur.dioxide=md(total.sulfur.dioxide))
require(vegan)
fit <- cascadeKM(wine_std, 1, 10, iter = 100)
plot(fit, sortg = TRUE, grpmts.plot = TRUE)
calinski.best <- as.numeric(which.max(fit$results[2,]))
cat("Calinski criterion optimal number of clusters:", calinski.best, "\n")
## Calinski criterion optimal number of clusters: 5
We’ll make 5 groups.
fit <- kmeans(wine_std,5)
wine_std$cluster=fit$cluster
#pair wise profiling plots
ggplot(wine_std,aes(pH,alcohol,color=as.factor(cluster)))+geom_point()
ggplot(wine_std,aes(pH,sulphates,color=as.factor(cluster)))+geom_point()
ggplot(wine_std,aes(pH,total.sulfur.dioxide,color=as.factor(cluster)))+geom_point()
ggplot(wine_std,aes(alcohol,sulphates,color=as.factor(cluster)))+geom_point()
ggplot(wine_std,aes(alcohol,total.sulfur.dioxide,color=as.factor(cluster)))+geom_point()
ggplot(wine_std,aes(sulphates,total.sulfur.dioxide,color=as.factor(cluster)))+geom_point()
As you can see, once you have increased dimnesion, there are going to be overlap in 2D plots. However in some plots , you’ll see better differentiation between certain groups, you can label them accordingly. For example group wih high total.sulfur.dioxide [ or if you can associate that with some property of the wine , say strong scent.]
moon_data=read.csv("moon_data.csv",stringsAsFactors = F, sep=",")
We see that the data is not spherical so k-means won’t work properly.
moon_data=moon_data[,-3]
glimpse(moon_data)
## Observations: 200
## Variables: 2
## $ X1 <dbl> 1.8495776, -0.4861227, 0.6239769, -0.3723831, 1.4167924, -0.6032...
## $ X2 <dbl> 0.09622161, -0.08064073, -0.09272887, -0.27670058, -0.43537460, ...
ggplot(moon_data,aes(x=X1,y=X2))+geom_point()
fit= kmeans(moon_data,2)
### to find wss for each k , fit$tot.withinss
### you can run a for loop over range of k and examine how wss behaves
moon_data$cluster=fit$cluster
ggplot(moon_data,aes(X1,X2,color=as.factor(cluster)))+geom_point()
moon_data$cluster=NULL
moon.scan=dbscan(moon_data, eps=0.2, minPts = 3)
moon_data$cluster=moon.scan$cluster
ggplot(moon_data,aes(X1,X2,color=as.factor(cluster)))+geom_point()
We see how DBSCAN can cluster the data. We can try out different values of epsilon and minimum points and see the results.
When it comes to anomaly detection we use DBSCAN.