clustering.knit

Data description

Data comes from kaggle. As the author of the datasets explains, the dataset consists of 1 Million+ transaction by over 800K customers for a bank in India.

Therefore, the main purpose of this project will be to find different types of transactions. I will use K-means clustering and PAM for this. Then, using variance_explained and silhouette I will choose apropriate number of clusters. All of the results will be visualized.

library(dplyr)
library(ggplot2)
library(tidyverse)
library(cluster)
library(factoextra)
library(flexclust)
library(fpc)
library(ClusterR)
library(anytime)
library(hopkins)

Used libraries:

dplyr: A library for data manipulation and analysis.
ggplot2: A library for creating static and interactive visualizations.
tidyverse: A meta-library for data manipulation, analysis, and visualization.
cluster: A library for cluster analysis.
factoextra: A library for factorial analysis and clustering with additional functionalities.
flexclust: A library for flexible clustering algorithms.
fpc: A library for cluster validation and assessment.
ClusterR: A framework for cluster analysis.
anytime: A library for converting time and date representations.
hopkins: A library for calculating the Hopkins statistic for clustering analysis.

EDA

Read data:

data <- read.csv("bank_transactions.csv")

Names of the columns:

colnames(data)

## [1] "TransactionID"           "CustomerID"             
## [3] "CustomerDOB"             "CustGender"             
## [5] "CustLocation"            "CustAccountBalance"     
## [7] "TransactionDate"         "TransactionTime"        
## [9] "TransactionAmount..INR."

Data types:

str(data)

## 'data.frame':    1048567 obs. of  9 variables:
##  $ TransactionID          : chr  "T1" "T2" "T3" "T4" ...
##  $ CustomerID             : chr  "C5841053" "C2142763" "C4417068" "C5342380" ...
##  $ CustomerDOB            : chr  "10/1/94" "4/4/57" "26/11/96" "14/9/73" ...
##  $ CustGender             : chr  "F" "M" "F" "F" ...
##  $ CustLocation           : chr  "JAMSHEDPUR" "JHAJJAR" "MUMBAI" "MUMBAI" ...
##  $ CustAccountBalance     : num  17819 2271 17874 866503 6714 ...
##  $ TransactionDate        : chr  "2/8/16" "2/8/16" "2/8/16" "2/8/16" ...
##  $ TransactionTime        : int  143207 141858 142712 142714 181156 173940 173806 170537 192825 192446 ...
##  $ TransactionAmount..INR.: num  25 27999 459 2060 1762 ...

Data summary:

summary(data)

##  TransactionID       CustomerID        CustomerDOB         CustGender       
##  Length:1048567     Length:1048567     Length:1048567     Length:1048567    
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  CustLocation       CustAccountBalance  TransactionDate    TransactionTime 
##  Length:1048567     Min.   :        0   Length:1048567     Min.   :     0  
##  Class :character   1st Qu.:     4722   Class :character   1st Qu.:124030  
##  Mode  :character   Median :    16792   Mode  :character   Median :164226  
##                     Mean   :   115404                      Mean   :157088  
##                     3rd Qu.:    57657                      3rd Qu.:200010  
##                     Max.   :115035495                      Max.   :235959  
##                     NA's   :2369                                           
##  TransactionAmount..INR.
##  Min.   :      0        
##  1st Qu.:    161        
##  Median :    459        
##  Mean   :   1574        
##  3rd Qu.:   1200        
##  Max.   :1560035        
##

Change types to factor:

data$CustLocation <- as.factor(data$CustLocation)
data$CustGender <- as.factor(data$CustGender)

Prepare new variables, TransactionHour and TransactionMinute. It may happen, that transaction on ceratin hours or minutes have certain attributes.

india_time_zone <- "Asia/Kolkata"
data["TransactionHour"] <-  as.numeric(format(
  as.POSIXct(data$TransactionTime,
             origin = "1970-01-01",
             tz = india_time_zone,),
  format = "%H"
))
data["TransactionMinute"] <-  as.numeric(format(
  as.POSIXct(data$TransactionTime,
             origin = "1970-01-01",
             tz = india_time_zone,),
  format = "%M"
))

Choose columns for the final analysis:

final_columns <-
  c(
    "CustAccountBalance",
    "TransactionAmount..INR.",
    "TransactionHour",
    "TransactionMinute"
  )

Due to the size of the dataset and computational power of my laptop, a sampling from the main dataset was needed. For the clustering I will be using sample of size 10000.

df <- data[, final_columns]
df <- df[complete.cases(df),]
df.sample <- df[sample(nrow(df), 10000), ]

K-Means clustering

Helper functions:

cluster_and_stripes <- function(dist, k) {
  kmeans_model <- cclust(df.sample, dist = dist, k = k)
  stripes(kmeans_model)
}

cluster_and_box_plots <- function(k, alpha = 0.05) {
  kmeans_model <- kmeans(df.sample, k)
  groupBWplot(df.sample, kmeans_model$cluster, alpha = 0.05)
}

vizualize_results <- function(k) {
  results <- eclust(df.sample, "kmeans", k=k)
  fviz_silhouette(results)
}

Vizualize results

3 Clusters:

vizualize_results(k = 3)

2 Clusters:

vizualize_results(k = 2)

Distance of data points to cluster centroids

3 Clusters:

cluster_and_stripes(dist = "euclidean", k = 3)

2 Clusters:

cluster_and_stripes(dist = "euclidean", k = 2)

Boxplots of each variable in each cluster

3 Clusters:

cluster_and_box_plots(3)

## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded

2 Clusters;

cluster_and_box_plots(2)

## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded

Optimal cluster number selection

By variance explained:

Optimal_Clusters_KMeans(df.sample, max_clusters = 5, plot_clusters = TRUE)

By silhouette:

Optimal_Clusters_KMeans(
  df.sample,
  max_clusters = 5,
  plot_clusters = TRUE,
  criterion = "silhouette"
)

Using all of the above information, we can claim that the 2 clusters will be the most appropriate one.

PAM

3 Clusters

pam1.results <-eclust(df.sample, "pam", k=3)

fviz_silhouette(pam1.results)

2 Clusters

pam2.results <-eclust(df.sample, "pam", k=3)

fviz_silhouette(pam2.results)

Conclusions

In conclusion, a bank transaction dataset consisting of over 1 million transactions made by 800K customers in India was analyzed. The aim of the project was to identify different types of transactions using K-means clustering and PAM. To choose the appropriate number of clusters, variance_explained and silhouette methods were used, and the results were visualized. The data was read, and its types were transformed to factors. Two new variables, TransactionHour and TransactionMinute, were created to explore possible attributes that transactions at certain hours or minutes may have. A sample of size 10,000 was taken due to the large size of the dataset. K-Means clustering was performed using the sample, and visualization of results was done using silhouette method. The results showed that a 2 cluster solution would be suitable for this dataset.