Cluster Analysis example

This example was produced as an auxiliary study for the course “Introdução ao Marketing Analítico”, 4th module, by INSPER, Coursera.org.
It is based on a small data set of personal spent (“gastos diários”) (clothing and food). It demonstrates some techniques for Cluster Analysis.

Reading data

dt <- read.table("Gastos_Diarios.csv", sep=',', header=TRUE, row.names=1)
names(dt) <- c("comida","roupas")
print(head(dt))

##   comida roupas
## a    2.0      4
## b    8.0      2
## c    9.0      3
## d    1.0      5
## e    8.5      1

Plotting scattergram (XY graph)

Prepare data - standardization

dts <- na.omit(dt) # listwise deletion of missing
dts <- scale(dts)  # standardize variables 
print(dts)

##       comida     roupas
## a -0.9569321  0.6324555
## b  0.5948497 -0.6324555
## c  0.8534800  0.0000000
## d -1.2155624  1.2649111
## e  0.7241648 -1.2649111
## attr(,"scaled:center")
## comida roupas 
##    5.7    3.0 
## attr(,"scaled:scale")
##   comida   roupas 
## 3.866523 1.581139

Analysing clusters

Visualizing the Dendrogram

# https://rpubs.com/gaston/dendrograms
hc = hclust(dist(dt))           # prepare hierarchical cluster
plot(hc, xlab="pessoas")        # very simple dendrogram
box(which="figure",lty="solid",col="red",bg="yellow")

Establishing number of clusters solution

# http://www.statmethods.net/advstats/cluster.html
wss <- (nrow(dts)-1)*sum(apply(dts,2,var))
# nCenters <- nrow(dts)-1   # (maximum allowed)
nCenters <- 2           # (chosen visually, observing dendrogram image)

Partioning by K-means

for (i in 2:nCenters) 
  wss[i] <- sum(kmeans(dts, centers=i)$withinss)
plot(1:nCenters, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")

K-Means Cluster Analysis

fit <- kmeans(dts, nCenters) 
# get cluster means
aggregate(dts,by=list(fit$cluster),FUN=mean)

##   Group.1     comida     roupas
## 1       1 -1.0862473  0.9486833
## 2       2  0.7241648 -0.6324555

# append cluster assignment
dt.cluster <- data.frame(dts, fit$cluster)

Resulted Clusters

##       comida     roupas fit.cluster
## a -0.9569321  0.6324555           1
## b  0.5948497 -0.6324555           2
## c  0.8534800  0.0000000           2
## d -1.2155624  1.2649111           1
## e  0.7241648 -1.2649111           2

Plotting the reviewed scattergram with clusters determined

This [R] code is available to see and download from my GitHub: https://github.com/svicente99/cluster_analysis_example