knitr::include_graphics("card.png")
Analyzing credit card data can provide valuable insights for various purposes, including fraud detection, customer behavior analysis, risk management, and business strategy optimization. By segmenting customers based on their credit card usage patterns, businesses can offer more personalized services, design effective marketing strategies, and enhance customer satisfaction. In this project my aim is to do clustering on credit card dataset, in order to make distinct segments of credit card user and then describe each segment, what are these segments and what defines each segment.
I downloaded the CC GENERAL dataset from the Kaggle site, you can find the dataset here: https://www.kaggle.com/datasets/arjunbhasin2013/ccdata. This data is about 9,000 active credit card holder over the last six months with 18 behavioral variables. Lets at first take a look at these variables.
Data Dictionary:
CUST_ID : Identification of Credit Card holder (Categorical)
BALANCE : Balance amount left in their account to make purchases
BALANCE_FREQUENCY : How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated)
PURCHASES : Amount of purchases made from account
ONEOFF_PURCHASES : Maximum purchase amount done in one-go
INSTALLMENTS_PURCHASES : Amount of purchase done in installment
CASH_ADVANCE : Cash in advance given by the user
PURCHASES_FREQUENCY : How frequently the Purchases are being made, score between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased)
ONEOFFPURCHASESFREQUENCY : How frequently Purchases are happening in one-go (1 = frequently purchased, 0 = not frequently purchased)
PURCHASESINSTALLMENTSFREQUENCY : How frequently purchases in installments are being done (1 = frequently done, 0 = not frequently done)
CASHADVANCEFREQUENCY : How frequently the cash in advance being paid
CASHADVANCETRX : Number of Transactions made with “Cash in Advanced”
PURCHASES_TRX : Number of purchase transactions made
CREDIT_LIMIT : Limit of Credit Card for user
PAYMENTS : Amount of Payment done by user
MINIMUM_PAYMENTS : Minimum amount of payments made by user
PRCFULLPAYMENT : Percent of full payment paid by user
TENURE : Tenure of credit card service for user
in order to preparing data I do the following step:
. finding NA elements and handling that by replacing with median of data.
. removing variable CUST_ID because it’s in class of character.
. standardization data.To ensure every feature contributes equally to my analysis.
. reduction dimension.
NA handling: Firstly I read the data and then to have a better visualization I changed the label of variables to the shorter variable.The by summary() function we look at the data in statistic way in order to find the type of data
ccgeneral <- read.csv("F:/Warsaw university/unsupervised learning/R_code/project clustering/CC GENERAL.csv")
names(ccgeneral) <- c("CUST_ID","Bal", "Bal_F", "Pur", "1f_Pur", "Ins_Pur", "Cash_Ad", "Pur_F", "1f_Pur_F", "Pur_Ins_F",
"Cash_Ad_F", "Cash_Ad_T", "Pur_T", "Cred_Lim", "pay", "Min_pay", "PRC_Ful_pay", "Ten")
summary(ccgeneral)
## CUST_ID Bal Bal_F Pur
## Length:8950 Min. : 0.0 Min. :0.0000 Min. : 0.00
## Class :character 1st Qu.: 128.3 1st Qu.:0.8889 1st Qu.: 39.63
## Mode :character Median : 873.4 Median :1.0000 Median : 361.28
## Mean : 1564.5 Mean :0.8773 Mean : 1003.20
## 3rd Qu.: 2054.1 3rd Qu.:1.0000 3rd Qu.: 1110.13
## Max. :19043.1 Max. :1.0000 Max. :49039.57
##
## 1f_Pur Ins_Pur Cash_Ad Pur_F
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. :0.00000
## 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.:0.08333
## Median : 38.0 Median : 89.0 Median : 0.0 Median :0.50000
## Mean : 592.4 Mean : 411.1 Mean : 978.9 Mean :0.49035
## 3rd Qu.: 577.4 3rd Qu.: 468.6 3rd Qu.: 1113.8 3rd Qu.:0.91667
## Max. :40761.2 Max. :22500.0 Max. :47137.2 Max. :1.00000
##
## 1f_Pur_F Pur_Ins_F Cash_Ad_F Cash_Ad_T
## Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. : 0.000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 0.000
## Median :0.08333 Median :0.1667 Median :0.0000 Median : 0.000
## Mean :0.20246 Mean :0.3644 Mean :0.1351 Mean : 3.249
## 3rd Qu.:0.30000 3rd Qu.:0.7500 3rd Qu.:0.2222 3rd Qu.: 4.000
## Max. :1.00000 Max. :1.0000 Max. :1.5000 Max. :123.000
##
## Pur_T Cred_Lim pay Min_pay
## Min. : 0.00 Min. : 50 Min. : 0.0 Min. : 0.02
## 1st Qu.: 1.00 1st Qu.: 1600 1st Qu.: 383.3 1st Qu.: 169.12
## Median : 7.00 Median : 3000 Median : 856.9 Median : 312.34
## Mean : 14.71 Mean : 4494 Mean : 1733.1 Mean : 864.21
## 3rd Qu.: 17.00 3rd Qu.: 6500 3rd Qu.: 1901.1 3rd Qu.: 825.49
## Max. :358.00 Max. :30000 Max. :50721.5 Max. :76406.21
## NA's :1 NA's :313
## PRC_Ful_pay Ten
## Min. :0.0000 Min. : 6.00
## 1st Qu.:0.0000 1st Qu.:12.00
## Median :0.0000 Median :12.00
## Mean :0.1537 Mean :11.52
## 3rd Qu.:0.1429 3rd Qu.:12.00
## Max. :1.0000 Max. :12.00
##
we have 313 NA at minimum payment(Min_pay) and one NA in credit_limit(Cred_Lim). I replaced with median of each column.
ccgeneral$Min_pay <- ifelse(is.na(ccgeneral$Min_pay), median(ccgeneral$Min_pay, na.rm=TRUE), ccgeneral$Min_pay)
ccgeneral$Cred_Lim <- ifelse(is.na(ccgeneral$Cred_Lim), median(ccgeneral$Cred_Lim, na.rm=TRUE), ccgeneral$Cred_Lim)
Removing characteristic column and standardization: I remove the first column because it is character and then standardization.This is where data normalization comes in, it ensures that each attribute has the same level of contribution, preventing one variable from dominating others.
ccgenerall <- ccgeneral[2:18]
ccgeneral_z <- as.data.frame(lapply(ccgenerall, scale))
Credit card data set has 18 dimension which is so high to analysis our segment so I decided to reduce the dimension by Principal Component Analysis (PCA) to get a better result and describe the segments in the efficient way. While dimension reduction can be beneficial, especially with a large number of features, it’s essential to ensure that we don’t lose significant information in the process. I applied Principal Component Analysis (PCA) to understand the variance explained by different components.
Check, installation and loading of required packages
loading the installed package
At first I will find the correlation matrix in order to see correlation between variables. One of the important step in preparing data is finding high correlation variables and then for each pairs of high correlation, keep one of them and remove another one. Some time wrongly it will be assume that during dimention reduction, removing high correlation variables will be handle. I want to examine this hypothesis. Firstly I apply PCA on my dataset. Secondly I will remove the the high correlation variables and then I will apply the PCA on the new dataset one more time and then I will compare the outputs of the PCA.
pca1 <- prcomp(ccgeneral_z)
summary(pca1)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.154 1.8583 1.22403 1.1276 1.02869 0.9878 0.91114
## Proportion of Variance 0.273 0.2031 0.08813 0.0748 0.06225 0.0574 0.04883
## Cumulative Proportion 0.273 0.4761 0.56425 0.6390 0.70129 0.7587 0.80752
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.85491 0.80356 0.7236 0.63505 0.54907 0.49268 0.45484
## Proportion of Variance 0.04299 0.03798 0.0308 0.02372 0.01773 0.01428 0.01217
## Cumulative Proportion 0.85052 0.88850 0.9193 0.94302 0.96075 0.97503 0.98720
## PC15 PC16 PC17
## Standard deviation 0.41491 0.21306 0.003413
## Proportion of Variance 0.01013 0.00267 0.000000
## Cumulative Proportion 0.99733 1.00000 1.000000
There are 17 principle component, PC1 till PC17 which also correspond to the number of variables in the normalized data. The result of the summary function shows three statistics according to all components: standard deviation, proportion of variance and cumulative proportion. for choosing the component we should concentrate on the third element. I mean cumulative proportion and we should select the number of component which cover at least 2/3 of data, for my case PC1 cover 27% of variance and PC2 explain 20% variance and finally component 1 until 7 together cover about 80 percent of variance which is good percent and based on these seven component I choose variables which are more important.
cor.matrix <- cor(ccgeneral_z, method = c("pearson"))
corrplot(cor.matrix, type = "lower", order = "alphabet", tl.cex = 0.6)
correlation_matrix <- cor(ccgeneral_z)
# Get the indices of variables with correlation greater than 0.6
high_correlation_indices <- which(correlation_matrix > 0.6 & correlation_matrix < 1, arr.ind = TRUE)
# Extract variable names
high_correlation_pairs <- rownames(correlation_matrix)[high_correlation_indices[, 1]]
high_correlation_pairs <- cbind(high_correlation_pairs, colnames(correlation_matrix)[high_correlation_indices[, 2]])
# Print the pairs of variables with correlation greater than 0.8
print(high_correlation_pairs)
## high_correlation_pairs
## [1,] "X1f_Pur" "Pur"
## [2,] "Ins_Pur" "Pur"
## [3,] "Pur_T" "Pur"
## [4,] "pay" "Pur"
## [5,] "Pur" "X1f_Pur"
## [6,] "Pur" "Ins_Pur"
## [7,] "Pur_T" "Ins_Pur"
## [8,] "Cash_Ad_F" "Cash_Ad"
## [9,] "Cash_Ad_T" "Cash_Ad"
## [10,] "Pur_Ins_F" "Pur_F"
## [11,] "Pur_F" "Pur_Ins_F"
## [12,] "Cash_Ad" "Cash_Ad_F"
## [13,] "Cash_Ad_T" "Cash_Ad_F"
## [14,] "Cash_Ad" "Cash_Ad_T"
## [15,] "Cash_Ad_F" "Cash_Ad_T"
## [16,] "Pur" "Pur_T"
## [17,] "Ins_Pur" "Pur_T"
## [18,] "Pur" "pay"
we can see between some variable we have a strong correlation (more than 0.6). In order to making it easer to see this correlation I draw this chart and based on that I decided keep “Pur” and remove this variables: {“Pay, if_pur, Ins-Pur, Pur_T”} which have the high correlation. and also keep “Pur_F” and remove “Pur_Ins_F”, and finally keep “Cash_Ad” and remove “Cash_Ad_F” and “Cash_Ad_T”.
knitr::include_graphics("Pca.png")
ccgeneral_zn <- ccgeneral_z[c(1,2,3,6,7,8,13,15,16,17)]
summary(ccgeneral_zn)
## Bal Bal_F Pur Cash_Ad
## Min. :-0.7516 Min. :-3.70306 Min. :-0.46953 Min. :-0.46676
## 1st Qu.:-0.6900 1st Qu.: 0.04904 1st Qu.:-0.45098 1st Qu.:-0.46676
## Median :-0.3320 Median : 0.51806 Median :-0.30044 Median :-0.46676
## Mean : 0.0000 Mean : 0.00000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.2352 3rd Qu.: 0.51806 3rd Qu.: 0.05004 3rd Qu.: 0.06435
## Max. : 8.3970 Max. : 0.51806 Max. :22.48225 Max. :22.00989
## Pur_F X1f_Pur_F Cred_Lim Min_pay
## Min. :-1.22169 Min. :-0.6786 Min. :-1.2214 Min. :-0.36218
## 1st Qu.:-1.01407 1st Qu.:-0.6786 1st Qu.:-0.7954 1st Qu.:-0.28895
## Median : 0.02404 Median :-0.3993 Median :-0.4107 Median :-0.22829
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 1.06215 3rd Qu.: 0.3270 3rd Qu.: 0.5512 3rd Qu.:-0.02409
## Max. : 1.26977 Max. : 2.6733 Max. : 7.0097 Max. :32.39092
## PRC_Ful_pay Ten
## Min. :-0.52552 Min. :-4.1225
## 1st Qu.:-0.52552 1st Qu.: 0.3607
## Median :-0.52552 Median : 0.3607
## Mean : 0.00000 Mean : 0.0000
## 3rd Qu.:-0.03712 3rd Qu.: 0.3607
## Max. : 2.89329 Max. : 0.3607
So by removing high correlation variables our variables decrease from 17 to 10. and now I apply PCA on new dataset.
Applying PCA
pca1 <- prcomp(ccgeneral_zn)
summary(pca1)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.5689 1.4609 1.0481 0.9803 0.93466 0.87910 0.72294
## Proportion of Variance 0.2462 0.2134 0.1099 0.0961 0.08736 0.07728 0.05226
## Cumulative Proportion 0.2462 0.4596 0.5694 0.6655 0.75290 0.83018 0.88244
## PC8 PC9 PC10
## Standard deviation 0.70079 0.65073 0.5109
## Proportion of Variance 0.04911 0.04235 0.0261
## Cumulative Proportion 0.93155 0.97390 1.0000
eig_val <- get_eigenvalue(pca1)
eig_val
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 2.4615528 24.615528 24.61553
## Dim.2 2.1342288 21.342288 45.95782
## Dim.3 1.0986103 10.986103 56.94392
## Dim.4 0.9609886 9.609886 66.55381
## Dim.5 0.8735979 8.735979 75.28978
## Dim.6 0.7728242 7.728242 83.01803
## Dim.7 0.5226356 5.226356 88.24438
## Dim.8 0.4911087 4.911087 93.15547
## Dim.9 0.4234513 4.234513 97.38998
## Dim.10 0.2610018 2.610018 100.00000
Looking at the PCA results, we found that using artificial variables PC1 through PC6 covers 83% of our data, which is better than before(covering 80% of data by 7 component). By removing highly correlated variables beforehand, we improved our coverage from 80% to 83%. This suggests it’s wise to drop those correlated variables before running PCA. Lets look at the component and discuse that each of them represent what.
Analysis of the components:
pc1=fviz_contrib(pca1, choice = "var", axes = 1)
pc2=fviz_contrib(pca1, choice = "var", axes = 2)
pc3=fviz_contrib(pca1, choice = "var", axes = 3)
pc4=fviz_contrib(pca1, choice = "var", axes = 4)
pc5=fviz_contrib(pca1, choice = "var", axes = 5)
pc6=fviz_contrib(pca1, choice = "var", axes = 6)
grid.arrange(pc1, pc2, pc3, ncol=2)
grid.arrange(pc4, pc5, pc6, ncol=2)
this result show us that for every component, which variables are most important and focus on that variable. for component 1 the variables {“Cred_Lim”, “Pur”, “Bal”, “X1f_Pur_F”} is more important. for component 2 the variables {“Pur_F”, “PRC_Ful_pay”, “X1f_Pur_F”, “Pur”, “Ten”} is more important. for component 3 the variables {“Cred_Lim”, “Cash_Ad”, “PRC_Ful_pay” “Pur”, “Bal”} is more important. for component 4 the variables {“Ten”, “Cred_Lim”} is more important. for component 5 the variables {“Min_pay”, “PRC_Ful_pay”} is more important. for component 6 the variables {“PRC_Ful_pay”, “Bal_F”} is more important.
Clustering is a technique used to understand how data is grouped together. With the K-means algorithm, we split the data into different groups, or clusters, where each data point belongs to just one group. To figure out how many clusters to use, we can use a method called the elbow method, which helps us find the best number of clusters by looking at the shape of a plot.
kkm<-Optimal_Clusters_KMeans(ccgeneral_zn,max_cluster=10,plot_cluster=TRUE)
kkm<-3
I decided to choose the number of clusters as 3, because the changes after 3 is sharper, so I will continue with 3 group of cluster.
ccgeneral_cluster_km3<-kmeans(ccgeneral_zn,3)
fviz_cluster(list(data=ccgeneral_zn, cluster=ccgeneral_cluster_km3$cluster),
ellipse.type="norm", geom="point", stand=FALSE, palette="jco", ggtheme=theme_classic())
ccgeneral_zn$cluster_lable <- ccgeneral_cluster_km3$cluster
x <- split(ccgeneral_zn, ccgeneral_zn$cluster_lable)
class1 <- as.data.frame(x[1])
summary(class1)
## X1.Bal X1.Bal_F X1.Pur X1.Cash_Ad
## Min. :-0.7309 Min. :-2.9356 Min. :-0.46953 Min. :-0.4668
## 1st Qu.: 0.9939 1st Qu.: 0.5181 1st Qu.:-0.46953 1st Qu.: 0.4054
## Median : 1.6805 Median : 0.5181 Median :-0.39464 Median : 1.2038
## Mean : 1.8616 Mean : 0.4141 Mean :-0.09116 Mean : 1.5786
## 3rd Qu.: 2.4893 3rd Qu.: 0.5181 3rd Qu.:-0.05589 3rd Qu.: 2.2483
## Max. : 8.3970 Max. : 0.5181 Max. : 9.87468 Max. :22.0099
## X1.Pur_F X1.X1f_Pur_F X1.Cred_Lim X1.Min_pay
## Min. :-1.2217 Min. :-0.6786 Min. :-0.9603 Min. :-0.3549
## 1st Qu.:-1.2217 1st Qu.:-0.6786 1st Qu.: 0.4138 1st Qu.: 0.1126
## Median :-0.8064 Median :-0.6786 Median : 0.9635 Median : 0.3452
## Mean :-0.4116 Mean :-0.2570 Mean : 1.1472 Mean : 0.9545
## 3rd Qu.: 0.4393 3rd Qu.:-0.1200 3rd Qu.: 1.6505 3rd Qu.: 0.7494
## Max. : 1.2698 Max. : 2.6733 Max. : 7.0097 Max. :32.3909
## X1.PRC_Ful_pay X1.Ten X1.cluster_lable
## Min. :-0.5255 Min. :-4.12254 Min. :1
## 1st Qu.:-0.5255 1st Qu.: 0.36066 1st Qu.:1
## Median :-0.5255 Median : 0.36066 Median :1
## Mean :-0.4585 Mean : 0.07787 Mean :1
## 3rd Qu.:-0.5255 3rd Qu.: 0.36066 3rd Qu.:1
## Max. : 2.2095 Max. : 0.36066 Max. :1
Analysing the behavior of each segment:
Calculate and display the median values of the variables within each cluster help us to identify the characteristic features of each cluster and understand the differences between them.
# Calculate median for each variable within each cluster group
median_df <- ccgeneral_zn %>%
group_by(ccgeneral_zn$cluster_lable) %>%
summarise_at(vars(names(ccgeneral_zn)), median)
# Print the resulting dataframe
print(t(median_df))
## [,1] [,2] [,3]
## ccgeneral_zn$cluster_lable 1.0000000 2.0000000 3.0000000
## Bal 1.6805092 -0.3479796 -0.4964612
## Bal_F 0.5180549 0.5180549 0.5180549
## Pur -0.3946415 0.5584366 -0.3656239
## Cash_Ad 1.2038219 -0.4667595 -0.4667595
## Pur_F -0.8064453 1.2697723 -0.3912033
## X1f_Pur_F -0.6786229 1.7591414 -0.6786229
## Cred_Lim 0.9634674 0.4138125 -0.5480836
## Min_pay 0.3451772 -0.2478394 -0.2507422
## PRC_Ful_pay -0.5255216 -0.2406217 -0.5255216
## Ten 0.3606594 0.3606594 0.3606594
## cluster_lable 1.0000000 2.0000000 3.0000000
Cluster 1: This customer group have high Bal(balances) and high cash_Ad(cash advance) with low Pur(purchases) and low PRC_Ful_pay (Percentage of Full Payments). We can assume that this customer group uses their credit cards as a loan.
Cluster 2: This customer group are in the oposide side with the previous group. They have a high Pur_F(Purchases Frequency) and high X1f_Pur_F (One-off Purchases Frequency) with low balance and low Cash_Ad (Cash Advance). We can assume that this customer group uses their credit card as a household purchase.
Cluster 3: This customer group have low balance, low purchases low cash advance and low in all factor relatively. We can assume this group of customer dont uses of their card frequently.
Clustering techniques are essential for grouping similar data points together based on their features or characteristics. By identifying inherent patterns and structures within the data, clustering helps in understanding the natural grouping of data points without the need for labeled training data.