##Robert Sidorowski
K-means is a very important an unsupervised machine learning algorithm used to find groups of observations that share similar characteristics. Data objects within the same cluster are similar but dissimilar to the objects in other clusters.Aim in this work is to generate high quality clusters with high intra-class similarity and low inter-class similarity. I’m going to use a german credit data set to cluster attributes.
The dataset contains 1000 entries with 20 categorial attributes. In this dataset, each entry represents a person who takes a credit by a bank. Each person is classified as good or bad credit risks according to the set of attributes.
At the beginning I load a data and check dimension of the dataset.
credit <- read.csv("german_credit_data.csv", sep=",", header=TRUE)
dim(credit)
## [1] 1000 10
The purpose of my project is to cluster people who took a credit. In order to do this I was working on dataset called “German Credit Risk” downloaded from Kaggle. In this dataset we have 10 columns:
credit<-credit[-1]
As first column includes ID for each person I deleted it as it is unwanted for my clustering analysis. I’ve focused only on numerical variables. Therefore, I’ve based my analysis on variables: “Age”, “Credit.amount” and “Duration”.
creditChoosed<-credit[c(1,7,8)]
Before starting k-means algorithm, using histograms I took a look at the data that I will use in later part of my project.
hist(creditChoosed$Duration)
hist(creditChoosed$Age)
hist(creditChoosed$Credit.amount)
Now, we can notice that in our data we have some outliers and the distribution is right-skewed. To obtain better results using k-means algorithm I’ve decided to transform data using logarithm function.
creditPrepared<-log(creditChoosed+1)
And check again our histograms
hist(creditPrepared$Duration)
hist(creditPrepared$Age)
hist(creditPrepared$Credit.amount)
In order to use k-means algorithm I needed to scale my data.
creditScaled<-scale(creditPrepared)
After that, I’ve started to look for appropriate number of clusters: To do that I used different methods:
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_nbclust(creditScaled, kmeans, method = "silhouette")
library(factoextra)
fviz_nbclust(creditScaled, kmeans, method = "wss")
library(factoextra)
fviz_nbclust(creditScaled, kmeans, method = "gap_stat")
From the plots we can see that the best number for clusters for: silhouette is 2, elbow method is 3, gap statistic is 3. Based on that results I’ve decided to choose 3 as the number of clusters.
Preparing data to k-means:
creditkmeans<-kmeans(creditScaled,3)
creditkmeans
## K-means clustering with 3 clusters of sizes 257, 365, 378
##
## Cluster means:
## Age Credit.amount Duration
## 1 1.15070308 -0.5235296 -0.6187934
## 2 -0.07752252 0.9543931 0.9420904
## 3 -0.70749992 -0.5656254 -0.4889765
##
## Clustering vector:
## [1] 1 2 1 2 2 2 1 2 1 2 3 2 3 1 3 3 1 2 2 2 1 1 1 1 3 1 1 1 3 2 3 2 2 1 3 2 2
## [38] 1 1 3 2 3 2 2 2 3 2 3 2 3 3 2 3 2 1 3 1 2 3 2 3 1 1 2 2 2 3 3 2 2 2 1 1 2
## [75] 2 1 2 1 2 2 2 1 3 1 1 3 3 2 3 3 1 1 3 2 1 2 1 3 2 2 1 2 3 3 3 2 2 2 2 3 3
## [112] 3 2 2 3 2 2 3 2 3 3 2 3 1 3 3 1 3 3 3 2 2 3 1 2 1 2 1 3 1 3 2 2 3 3 2 1 1
## [149] 2 1 1 1 2 2 2 3 1 1 2 3 2 3 3 1 3 3 3 3 2 3 3 3 2 3 2 2 1 1 3 1 2 2 1 2 3
## [186] 3 1 1 3 2 2 2 2 3 2 3 3 3 2 2 1 3 2 3 1 2 1 3 2 1 3 2 2 1 2 1 3 2 2 1 3 3
## [223] 3 2 2 2 2 1 3 3 2 1 3 3 1 3 3 1 3 2 3 1 2 3 3 3 1 3 3 3 1 1 3 2 3 2 2 3 3
## [260] 1 3 1 2 1 3 1 2 1 2 3 3 1 2 2 2 3 3 1 3 3 3 1 1 1 2 2 2 2 1 1 3 2 2 2 2 2
## [297] 3 1 1 2 1 2 3 1 2 3 2 3 3 3 2 2 2 3 3 2 1 2 3 3 2 3 2 3 3 1 1 3 2 3 1 3 2
## [334] 2 2 1 3 3 2 3 2 2 3 2 1 1 3 2 3 3 3 3 2 2 1 3 1 2 3 2 1 1 3 3 3 3 2 3 2 1
## [371] 2 2 3 2 2 2 3 1 2 1 1 2 3 3 2 3 3 2 3 3 3 3 2 3 3 2 1 2 1 1 1 3 2 3 1 3 1
## [408] 3 2 3 3 2 1 1 3 2 3 2 2 3 3 3 1 3 3 3 2 1 3 1 1 2 1 3 3 3 3 3 2 3 1 3 3 1
## [445] 2 1 2 3 1 1 2 3 3 2 2 2 1 3 3 2 2 3 3 1 3 1 3 2 2 2 3 3 3 1 3 3 2 2 1 1 3
## [482] 3 2 3 1 1 1 1 3 3 3 2 3 1 1 3 2 1 3 3 2 2 1 1 3 3 3 2 3 2 3 2 3 3 1 1 1 2
## [519] 1 1 2 3 2 2 3 2 3 1 2 1 2 3 1 3 2 3 1 2 2 1 3 1 2 3 1 1 2 3 3 2 3 1 2 3 1
## [556] 3 3 2 2 3 1 3 3 2 2 3 3 2 2 2 2 2 2 3 3 3 3 3 2 3 3 1 3 2 1 3 1 3 1 1 1 1
## [593] 3 3 1 3 3 2 3 2 1 3 3 2 3 3 1 2 3 1 3 1 2 2 3 2 2 1 2 1 2 3 2 3 1 3 1 3 2
## [630] 1 2 1 3 3 3 1 3 2 3 2 3 1 2 3 3 2 2 3 1 1 2 1 1 2 2 3 1 2 2 2 3 3 1 3 1 2
## [667] 2 2 1 2 2 2 2 3 1 2 2 2 1 3 1 1 3 2 2 2 3 2 3 3 3 3 2 1 3 1 3 1 3 1 3 2 1
## [704] 2 2 2 2 3 3 1 3 3 1 3 2 2 2 3 2 2 3 3 3 1 3 1 1 3 2 3 2 3 1 2 1 2 2 2 1 2
## [741] 2 3 2 3 2 3 3 1 2 3 1 3 3 2 1 3 1 1 3 3 1 3 3 2 3 1 2 3 3 1 2 2 2 1 1 3 2
## [778] 3 2 1 2 1 3 3 1 3 2 2 2 2 1 2 3 1 2 3 2 3 1 3 1 1 2 3 2 2 3 1 2 3 3 3 2 2
## [815] 2 2 1 3 2 3 3 3 2 1 2 3 2 2 2 2 1 3 2 2 3 1 3 3 1 1 2 2 3 1 1 2 1 3 1 1 3
## [852] 2 3 3 2 3 1 3 3 3 2 3 2 2 3 3 3 1 2 3 2 1 3 3 3 1 3 2 3 2 2 2 2 1 2 1 2 2
## [889] 2 2 2 1 1 2 3 2 3 3 3 2 1 2 2 1 3 3 2 2 1 3 2 2 2 3 2 2 3 1 2 2 3 2 3 3 2
## [926] 1 3 2 1 1 3 3 3 1 3 2 3 3 2 2 1 3 3 1 1 2 2 1 1 2 2 2 2 2 3 1 2 1 2 2 3 1
## [963] 3 2 3 3 3 1 2 1 3 2 3 2 2 1 1 1 1 3 2 2 3 2 3 3 2 1 2 1 1 3 3 2 1 3 2 1 2
## [1000] 2
##
## Within cluster sum of squares by cluster:
## [1] 434.2936 534.4278 460.0009
## (between_SS / total_SS = 52.3 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
creditkmeans<-eclust(creditScaled, "kmeans", hc_metric="euclidean",k=3)
We obtained 3 clusters. In first one we have 368 objects, in second one 371, in third one 261.
fviz_cluster(list(data=credit[c("Age","Credit.amount")], cluster=creditkmeans$cluster), ellipse.type="norm", geom="point",
stand=FALSE, palette="jco", ggtheme=theme_classic())
fviz_cluster(list(data=credit[c("Duration","Credit.amount")], cluster=creditkmeans$cluster), ellipse.type="norm", geom="point",
stand=FALSE, palette="jco", ggtheme=theme_classic())
fviz_cluster(list(data=credit[c("Duration","Age")], cluster=creditkmeans$cluster), ellipse.type="norm", geom="point",
stand=FALSE, palette="jco", ggtheme=theme_classic())
Interpretation of this graphs: We obtained 3 groups:
Generally, we can say that in first group we have middle-aged people who borrowed rather a lot of money for a long period.
Second group represents young people who borrowed little amount of money for rather short period.
The third group contains older people with short duration credit for a little amount of money.
Now it’s time to try hierarchical clustering.
credittree <- hclust(dist(creditScaled), method="ward.D2")
plot(credittree)
rect.hclust(credittree,k=4,border="red")
credittree1<-cutree(credittree,4)
I’ve used function rect.hust to decide how many clusters I should applied. After few changes I’ve decided to implement 4 clusters.
After classification I’ve created plots as they will be significant for the interpretation.
fviz_cluster(list(data=credit[c("Duration","Age")], cluster=credittree1), ellipse.type="norm", geom="point",
stand=FALSE, palette="jco", ggtheme=theme_classic())
fviz_cluster(list(data=credit[c("Duration","Credit.amount")], cluster=credittree1), ellipse.type="norm", geom="point",
stand=FALSE, palette="jco", ggtheme=theme_classic())
fviz_cluster(list(data=credit[c("Age","Credit.amount")], cluster=credittree1), ellipse.type="norm", geom="point",
stand=FALSE, palette="jco", ggtheme=theme_classic())
We have received 4 groups:
First one represents middle-aged and older people who borrowed little amount of money for a short period.
Second group contains every possibilities. Here we have people in every age with different credit types.
In the third group we have young people who borrowed rather little amount of money, but higher than people from the first and fourth group. Duration of their credits is also longer than the one from the first and fourth group.
Fourth group represent people young and middle-age with low credit’s amount and duration.
Method’s comparison:
Based on analysis conducted before we can say that k-means dealt better with our data. It provided us better classification than hierarchical clustering method. We received 3 groups which we could easily interpret.