Data comes from kaggle. As the author of the datasets explains, the dataset consists of 1 Million+ transaction by over 800K customers for a bank in India.
Therefore, the main purpose of this project will be to find different types of transactions. I will use K-means clustering and PAM for this. Then, using variance_explained and silhouette I will choose apropriate number of clusters. All of the results will be visualized.
library(dplyr)
library(ggplot2)
library(tidyverse)
library(cluster)
library(factoextra)
library(flexclust)
library(fpc)
library(ClusterR)
library(anytime)
library(hopkins)
dplyr: A library for data manipulation and
analysis.
ggplot2: A library for creating static and interactive
visualizations.
tidyverse: A meta-library for data manipulation,
analysis, and visualization.
cluster: A library for cluster analysis.
factoextra: A library for factorial analysis and
clustering with additional functionalities.
flexclust: A library for flexible clustering
algorithms.
fpc: A library for cluster validation and
assessment.
ClusterR: A framework for cluster analysis.
anytime: A library for converting time and date
representations.
hopkins: A library for calculating the Hopkins
statistic for clustering analysis.
Read data:
data <- read.csv("bank_transactions.csv")
Names of the columns:
colnames(data)
## [1] "TransactionID" "CustomerID"
## [3] "CustomerDOB" "CustGender"
## [5] "CustLocation" "CustAccountBalance"
## [7] "TransactionDate" "TransactionTime"
## [9] "TransactionAmount..INR."
Data types:
str(data)
## 'data.frame': 1048567 obs. of 9 variables:
## $ TransactionID : chr "T1" "T2" "T3" "T4" ...
## $ CustomerID : chr "C5841053" "C2142763" "C4417068" "C5342380" ...
## $ CustomerDOB : chr "10/1/94" "4/4/57" "26/11/96" "14/9/73" ...
## $ CustGender : chr "F" "M" "F" "F" ...
## $ CustLocation : chr "JAMSHEDPUR" "JHAJJAR" "MUMBAI" "MUMBAI" ...
## $ CustAccountBalance : num 17819 2271 17874 866503 6714 ...
## $ TransactionDate : chr "2/8/16" "2/8/16" "2/8/16" "2/8/16" ...
## $ TransactionTime : int 143207 141858 142712 142714 181156 173940 173806 170537 192825 192446 ...
## $ TransactionAmount..INR.: num 25 27999 459 2060 1762 ...
Data summary:
summary(data)
## TransactionID CustomerID CustomerDOB CustGender
## Length:1048567 Length:1048567 Length:1048567 Length:1048567
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## CustLocation CustAccountBalance TransactionDate TransactionTime
## Length:1048567 Min. : 0 Length:1048567 Min. : 0
## Class :character 1st Qu.: 4722 Class :character 1st Qu.:124030
## Mode :character Median : 16792 Mode :character Median :164226
## Mean : 115404 Mean :157088
## 3rd Qu.: 57657 3rd Qu.:200010
## Max. :115035495 Max. :235959
## NA's :2369
## TransactionAmount..INR.
## Min. : 0
## 1st Qu.: 161
## Median : 459
## Mean : 1574
## 3rd Qu.: 1200
## Max. :1560035
##
Change types to factor:
data$CustLocation <- as.factor(data$CustLocation)
data$CustGender <- as.factor(data$CustGender)
Prepare new variables, TransactionHour and TransactionMinute. It may happen, that transaction on ceratin hours or minutes have certain attributes.
india_time_zone <- "Asia/Kolkata"
data["TransactionHour"] <- as.numeric(format(
as.POSIXct(data$TransactionTime,
origin = "1970-01-01",
tz = india_time_zone,),
format = "%H"
))
data["TransactionMinute"] <- as.numeric(format(
as.POSIXct(data$TransactionTime,
origin = "1970-01-01",
tz = india_time_zone,),
format = "%M"
))
Choose columns for the final analysis:
final_columns <-
c(
"CustAccountBalance",
"TransactionAmount..INR.",
"TransactionHour",
"TransactionMinute"
)
Due to the size of the dataset and computational power of my laptop, a sampling from the main dataset was needed. For the clustering I will be using sample of size 10000.
df <- data[, final_columns]
df <- df[complete.cases(df),]
df.sample <- df[sample(nrow(df), 10000), ]
Helper functions:
cluster_and_stripes <- function(dist, k) {
kmeans_model <- cclust(df.sample, dist = dist, k = k)
stripes(kmeans_model)
}
cluster_and_box_plots <- function(k, alpha = 0.05) {
kmeans_model <- kmeans(df.sample, k)
groupBWplot(df.sample, kmeans_model$cluster, alpha = 0.05)
}
vizualize_results <- function(k) {
results <- eclust(df.sample, "kmeans", k=k)
fviz_silhouette(results)
}
3 Clusters:
vizualize_results(k = 3)
2 Clusters:
vizualize_results(k = 2)
3 Clusters:
cluster_and_stripes(dist = "euclidean", k = 3)
2 Clusters:
cluster_and_stripes(dist = "euclidean", k = 2)
3 Clusters:
cluster_and_box_plots(3)
## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded
2 Clusters;
cluster_and_box_plots(2)
## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded
By variance explained:
Optimal_Clusters_KMeans(df.sample, max_clusters = 5, plot_clusters = TRUE)
By silhouette:
Optimal_Clusters_KMeans(
df.sample,
max_clusters = 5,
plot_clusters = TRUE,
criterion = "silhouette"
)
Using all of the above information, we can claim that the 2 clusters will be the most appropriate one.
3 Clusters
pam1.results <-eclust(df.sample, "pam", k=3)
fviz_silhouette(pam1.results)
2 Clusters
pam2.results <-eclust(df.sample, "pam", k=3)
fviz_silhouette(pam2.results)
In conclusion, a bank transaction dataset consisting of over 1 million transactions made by 800K customers in India was analyzed. The aim of the project was to identify different types of transactions using K-means clustering and PAM. To choose the appropriate number of clusters, variance_explained and silhouette methods were used, and the results were visualized. The data was read, and its types were transformed to factors. Two new variables, TransactionHour and TransactionMinute, were created to explore possible attributes that transactions at certain hours or minutes may have. A sample of size 10,000 was taken due to the large size of the dataset. K-Means clustering was performed using the sample, and visualization of results was done using silhouette method. The results showed that a 2 cluster solution would be suitable for this dataset.