Credit Card Data Analysis

knitr::include_graphics("card.png")

Introduction

Analyzing credit card data can provide valuable insights for various purposes, including fraud detection, customer behavior analysis, risk management, and business strategy optimization. By segmenting customers based on their credit card usage patterns, businesses can offer more personalized services, design effective marketing strategies, and enhance customer satisfaction. In this project my aim is to do clustering on credit card dataset, in order to make distinct segments of credit card user and then describe each segment, what are these segments and what defines each segment.

data description

I downloaded the CC GENERAL dataset from the Kaggle site, you can find the dataset here: https://www.kaggle.com/datasets/arjunbhasin2013/ccdata. This data is about 9,000 active credit card holder over the last six months with 18 behavioral variables. Lets at first take a look at these variables.

Data Dictionary:

CUST_ID : Identification of Credit Card holder (Categorical)
BALANCE : Balance amount left in their account to make purchases
BALANCE_FREQUENCY : How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated)
PURCHASES : Amount of purchases made from account
ONEOFF_PURCHASES : Maximum purchase amount done in one-go
INSTALLMENTS_PURCHASES : Amount of purchase done in installment
CASH_ADVANCE : Cash in advance given by the user
PURCHASES_FREQUENCY : How frequently the Purchases are being made, score between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased)
ONEOFFPURCHASESFREQUENCY : How frequently Purchases are happening in one-go (1 = frequently purchased, 0 = not frequently purchased)
PURCHASESINSTALLMENTSFREQUENCY : How frequently purchases in installments are being done (1 = frequently done, 0 = not frequently done)
CASHADVANCEFREQUENCY : How frequently the cash in advance being paid
CASHADVANCETRX : Number of Transactions made with “Cash in Advanced”
PURCHASES_TRX : Number of purchase transactions made
CREDIT_LIMIT : Limit of Credit Card for user
PAYMENTS : Amount of Payment done by user
MINIMUM_PAYMENTS : Minimum amount of payments made by user
PRCFULLPAYMENT : Percent of full payment paid by user
TENURE : Tenure of credit card service for user

preparing data

in order to preparing data I do the following step:

. finding NA elements and handling that by replacing with median of data.

. removing variable CUST_ID because it’s in class of character.

. standardization data.To ensure every feature contributes equally to my analysis.

. reduction dimension.

NA handling: Firstly I read the data and then to have a better visualization I changed the label of variables to the shorter variable.The by summary() function we look at the data in statistic way in order to find the type of data

ccgeneral <- read.csv("F:/Warsaw university/unsupervised learning/R_code/project clustering/CC GENERAL.csv")
names(ccgeneral) <- c("CUST_ID","Bal", "Bal_F", "Pur", "1f_Pur", "Ins_Pur", "Cash_Ad", "Pur_F", "1f_Pur_F", "Pur_Ins_F",
                        "Cash_Ad_F", "Cash_Ad_T", "Pur_T", "Cred_Lim", "pay", "Min_pay", "PRC_Ful_pay", "Ten")
summary(ccgeneral)

##    CUST_ID               Bal              Bal_F             Pur          
##  Length:8950        Min.   :    0.0   Min.   :0.0000   Min.   :    0.00  
##  Class :character   1st Qu.:  128.3   1st Qu.:0.8889   1st Qu.:   39.63  
##  Mode  :character   Median :  873.4   Median :1.0000   Median :  361.28  
##                     Mean   : 1564.5   Mean   :0.8773   Mean   : 1003.20  
##                     3rd Qu.: 2054.1   3rd Qu.:1.0000   3rd Qu.: 1110.13  
##                     Max.   :19043.1   Max.   :1.0000   Max.   :49039.57  
##                                                                          
##      1f_Pur           Ins_Pur           Cash_Ad            Pur_F        
##  Min.   :    0.0   Min.   :    0.0   Min.   :    0.0   Min.   :0.00000  
##  1st Qu.:    0.0   1st Qu.:    0.0   1st Qu.:    0.0   1st Qu.:0.08333  
##  Median :   38.0   Median :   89.0   Median :    0.0   Median :0.50000  
##  Mean   :  592.4   Mean   :  411.1   Mean   :  978.9   Mean   :0.49035  
##  3rd Qu.:  577.4   3rd Qu.:  468.6   3rd Qu.: 1113.8   3rd Qu.:0.91667  
##  Max.   :40761.2   Max.   :22500.0   Max.   :47137.2   Max.   :1.00000  
##                                                                         
##     1f_Pur_F         Pur_Ins_F        Cash_Ad_F        Cash_Ad_T      
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :  0.000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:  0.000  
##  Median :0.08333   Median :0.1667   Median :0.0000   Median :  0.000  
##  Mean   :0.20246   Mean   :0.3644   Mean   :0.1351   Mean   :  3.249  
##  3rd Qu.:0.30000   3rd Qu.:0.7500   3rd Qu.:0.2222   3rd Qu.:  4.000  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.5000   Max.   :123.000  
##                                                                       
##      Pur_T           Cred_Lim          pay             Min_pay        
##  Min.   :  0.00   Min.   :   50   Min.   :    0.0   Min.   :    0.02  
##  1st Qu.:  1.00   1st Qu.: 1600   1st Qu.:  383.3   1st Qu.:  169.12  
##  Median :  7.00   Median : 3000   Median :  856.9   Median :  312.34  
##  Mean   : 14.71   Mean   : 4494   Mean   : 1733.1   Mean   :  864.21  
##  3rd Qu.: 17.00   3rd Qu.: 6500   3rd Qu.: 1901.1   3rd Qu.:  825.49  
##  Max.   :358.00   Max.   :30000   Max.   :50721.5   Max.   :76406.21  
##                   NA's   :1                         NA's   :313       
##   PRC_Ful_pay          Ten       
##  Min.   :0.0000   Min.   : 6.00  
##  1st Qu.:0.0000   1st Qu.:12.00  
##  Median :0.0000   Median :12.00  
##  Mean   :0.1537   Mean   :11.52  
##  3rd Qu.:0.1429   3rd Qu.:12.00  
##  Max.   :1.0000   Max.   :12.00  
##

we have 313 NA at minimum payment(Min_pay) and one NA in credit_limit(Cred_Lim). I replaced with median of each column.

ccgeneral$Min_pay <- ifelse(is.na(ccgeneral$Min_pay), median(ccgeneral$Min_pay, na.rm=TRUE), ccgeneral$Min_pay)
ccgeneral$Cred_Lim <- ifelse(is.na(ccgeneral$Cred_Lim), median(ccgeneral$Cred_Lim, na.rm=TRUE), ccgeneral$Cred_Lim)

Removing characteristic column and standardization: I remove the first column because it is character and then standardization.This is where data normalization comes in, it ensures that each attribute has the same level of contribution, preventing one variable from dominating others.

ccgenerall <- ccgeneral[2:18]
ccgeneral_z <- as.data.frame(lapply(ccgenerall, scale))

dimension reduction

Credit card data set has 18 dimension which is so high to analysis our segment so I decided to reduce the dimension by Principal Component Analysis (PCA) to get a better result and describe the segments in the efficient way. While dimension reduction can be beneficial, especially with a large number of features, it’s essential to ensure that we don’t lose significant information in the process. I applied Principal Component Analysis (PCA) to understand the variance explained by different components.

Check, installation and loading of required packages

loading the installed package

At first I will find the correlation matrix in order to see correlation between variables. One of the important step in preparing data is finding high correlation variables and then for each pairs of high correlation, keep one of them and remove another one. Some time wrongly it will be assume that during dimention reduction, removing high correlation variables will be handle. I want to examine this hypothesis. Firstly I apply PCA on my dataset. Secondly I will remove the the high correlation variables and then I will apply the PCA on the new dataset one more time and then I will compare the outputs of the PCA.

pca1 <- prcomp(ccgeneral_z)
summary(pca1)

## Importance of components:
##                          PC1    PC2     PC3    PC4     PC5    PC6     PC7
## Standard deviation     2.154 1.8583 1.22403 1.1276 1.02869 0.9878 0.91114
## Proportion of Variance 0.273 0.2031 0.08813 0.0748 0.06225 0.0574 0.04883
## Cumulative Proportion  0.273 0.4761 0.56425 0.6390 0.70129 0.7587 0.80752
##                            PC8     PC9   PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.85491 0.80356 0.7236 0.63505 0.54907 0.49268 0.45484
## Proportion of Variance 0.04299 0.03798 0.0308 0.02372 0.01773 0.01428 0.01217
## Cumulative Proportion  0.85052 0.88850 0.9193 0.94302 0.96075 0.97503 0.98720
##                           PC15    PC16     PC17
## Standard deviation     0.41491 0.21306 0.003413
## Proportion of Variance 0.01013 0.00267 0.000000
## Cumulative Proportion  0.99733 1.00000 1.000000

There are 17 principle component, PC1 till PC17 which also correspond to the number of variables in the normalized data. The result of the summary function shows three statistics according to all components: standard deviation, proportion of variance and cumulative proportion. for choosing the component we should concentrate on the third element. I mean cumulative proportion and we should select the number of component which cover at least 2/3 of data, for my case PC1 cover 27% of variance and PC2 explain 20% variance and finally component 1 until 7 together cover about 80 percent of variance which is good percent and based on these seven component I choose variables which are more important.

cor.matrix <- cor(ccgeneral_z, method = c("pearson"))
corrplot(cor.matrix, type = "lower", order = "alphabet", tl.cex = 0.6)

correlation_matrix <- cor(ccgeneral_z)

# Get the indices of variables with correlation greater than 0.6
high_correlation_indices <- which(correlation_matrix > 0.6 & correlation_matrix < 1, arr.ind = TRUE)

# Extract variable names
high_correlation_pairs <- rownames(correlation_matrix)[high_correlation_indices[, 1]]
high_correlation_pairs <- cbind(high_correlation_pairs, colnames(correlation_matrix)[high_correlation_indices[, 2]])

# Print the pairs of variables with correlation greater than 0.8
print(high_correlation_pairs)

##       high_correlation_pairs            
##  [1,] "X1f_Pur"              "Pur"      
##  [2,] "Ins_Pur"              "Pur"      
##  [3,] "Pur_T"                "Pur"      
##  [4,] "pay"                  "Pur"      
##  [5,] "Pur"                  "X1f_Pur"  
##  [6,] "Pur"                  "Ins_Pur"  
##  [7,] "Pur_T"                "Ins_Pur"  
##  [8,] "Cash_Ad_F"            "Cash_Ad"  
##  [9,] "Cash_Ad_T"            "Cash_Ad"  
## [10,] "Pur_Ins_F"            "Pur_F"    
## [11,] "Pur_F"                "Pur_Ins_F"
## [12,] "Cash_Ad"              "Cash_Ad_F"
## [13,] "Cash_Ad_T"            "Cash_Ad_F"
## [14,] "Cash_Ad"              "Cash_Ad_T"
## [15,] "Cash_Ad_F"            "Cash_Ad_T"
## [16,] "Pur"                  "Pur_T"    
## [17,] "Ins_Pur"              "Pur_T"    
## [18,] "Pur"                  "pay"

we can see between some variable we have a strong correlation (more than 0.6). In order to making it easer to see this correlation I draw this chart and based on that I decided keep “Pur” and remove this variables: {“Pay, if_pur, Ins-Pur, Pur_T”} which have the high correlation. and also keep “Pur_F” and remove “Pur_Ins_F”, and finally keep “Cash_Ad” and remove “Cash_Ad_F” and “Cash_Ad_T”.

knitr::include_graphics("Pca.png")

ccgeneral_zn <- ccgeneral_z[c(1,2,3,6,7,8,13,15,16,17)]
summary(ccgeneral_zn)

##       Bal              Bal_F               Pur              Cash_Ad        
##  Min.   :-0.7516   Min.   :-3.70306   Min.   :-0.46953   Min.   :-0.46676  
##  1st Qu.:-0.6900   1st Qu.: 0.04904   1st Qu.:-0.45098   1st Qu.:-0.46676  
##  Median :-0.3320   Median : 0.51806   Median :-0.30044   Median :-0.46676  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.2352   3rd Qu.: 0.51806   3rd Qu.: 0.05004   3rd Qu.: 0.06435  
##  Max.   : 8.3970   Max.   : 0.51806   Max.   :22.48225   Max.   :22.00989  
##      Pur_F            X1f_Pur_F          Cred_Lim          Min_pay        
##  Min.   :-1.22169   Min.   :-0.6786   Min.   :-1.2214   Min.   :-0.36218  
##  1st Qu.:-1.01407   1st Qu.:-0.6786   1st Qu.:-0.7954   1st Qu.:-0.28895  
##  Median : 0.02404   Median :-0.3993   Median :-0.4107   Median :-0.22829  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 1.06215   3rd Qu.: 0.3270   3rd Qu.: 0.5512   3rd Qu.:-0.02409  
##  Max.   : 1.26977   Max.   : 2.6733   Max.   : 7.0097   Max.   :32.39092  
##   PRC_Ful_pay            Ten         
##  Min.   :-0.52552   Min.   :-4.1225  
##  1st Qu.:-0.52552   1st Qu.: 0.3607  
##  Median :-0.52552   Median : 0.3607  
##  Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.:-0.03712   3rd Qu.: 0.3607  
##  Max.   : 2.89329   Max.   : 0.3607

So by removing high correlation variables our variables decrease from 17 to 10. and now I apply PCA on new dataset.

Applying PCA

pca1 <- prcomp(ccgeneral_zn)
summary(pca1)

## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     1.5689 1.4609 1.0481 0.9803 0.93466 0.87910 0.72294
## Proportion of Variance 0.2462 0.2134 0.1099 0.0961 0.08736 0.07728 0.05226
## Cumulative Proportion  0.2462 0.4596 0.5694 0.6655 0.75290 0.83018 0.88244
##                            PC8     PC9   PC10
## Standard deviation     0.70079 0.65073 0.5109
## Proportion of Variance 0.04911 0.04235 0.0261
## Cumulative Proportion  0.93155 0.97390 1.0000

eig_val <- get_eigenvalue(pca1)
eig_val

##        eigenvalue variance.percent cumulative.variance.percent
## Dim.1   2.4615528        24.615528                    24.61553
## Dim.2   2.1342288        21.342288                    45.95782
## Dim.3   1.0986103        10.986103                    56.94392
## Dim.4   0.9609886         9.609886                    66.55381
## Dim.5   0.8735979         8.735979                    75.28978
## Dim.6   0.7728242         7.728242                    83.01803
## Dim.7   0.5226356         5.226356                    88.24438
## Dim.8   0.4911087         4.911087                    93.15547
## Dim.9   0.4234513         4.234513                    97.38998
## Dim.10  0.2610018         2.610018                   100.00000

Looking at the PCA results, we found that using artificial variables PC1 through PC6 covers 83% of our data, which is better than before(covering 80% of data by 7 component). By removing highly correlated variables beforehand, we improved our coverage from 80% to 83%. This suggests it’s wise to drop those correlated variables before running PCA. Lets look at the component and discuse that each of them represent what.

Analysis of the components:

pc1=fviz_contrib(pca1, choice = "var", axes = 1)
pc2=fviz_contrib(pca1, choice = "var", axes = 2)
pc3=fviz_contrib(pca1, choice = "var", axes = 3)
pc4=fviz_contrib(pca1, choice = "var", axes = 4)
pc5=fviz_contrib(pca1, choice = "var", axes = 5)
pc6=fviz_contrib(pca1, choice = "var", axes = 6)
grid.arrange(pc1, pc2, pc3, ncol=2)

grid.arrange(pc4, pc5, pc6, ncol=2)

this result show us that for every component, which variables are most important and focus on that variable. for component 1 the variables {“Cred_Lim”, “Pur”, “Bal”, “X1f_Pur_F”} is more important. for component 2 the variables {“Pur_F”, “PRC_Ful_pay”, “X1f_Pur_F”, “Pur”, “Ten”} is more important. for component 3 the variables {“Cred_Lim”, “Cash_Ad”, “PRC_Ful_pay” “Pur”, “Bal”} is more important. for component 4 the variables {“Ten”, “Cred_Lim”} is more important. for component 5 the variables {“Min_pay”, “PRC_Ful_pay”} is more important. for component 6 the variables {“PRC_Ful_pay”, “Bal_F”} is more important.

Clustering

Clustering is a technique used to understand how data is grouped together. With the K-means algorithm, we split the data into different groups, or clusters, where each data point belongs to just one group. To figure out how many clusters to use, we can use a method called the elbow method, which helps us find the best number of clusters by looking at the shape of a plot.

kkm<-Optimal_Clusters_KMeans(ccgeneral_zn,max_cluster=10,plot_cluster=TRUE)

kkm<-3

I decided to choose the number of clusters as 3, because the changes after 3 is sharper, so I will continue with 3 group of cluster.

ccgeneral_cluster_km3<-kmeans(ccgeneral_zn,3)

fviz_cluster(list(data=ccgeneral_zn, cluster=ccgeneral_cluster_km3$cluster), 
             ellipse.type="norm", geom="point", stand=FALSE, palette="jco", ggtheme=theme_classic())

ccgeneral_zn$cluster_lable <- ccgeneral_cluster_km3$cluster
x <- split(ccgeneral_zn, ccgeneral_zn$cluster_lable)
class1 <- as.data.frame(x[1])
summary(class1)

##      X1.Bal           X1.Bal_F           X1.Pur           X1.Cash_Ad     
##  Min.   :-0.7309   Min.   :-2.9356   Min.   :-0.46953   Min.   :-0.4668  
##  1st Qu.: 0.9939   1st Qu.: 0.5181   1st Qu.:-0.46953   1st Qu.: 0.4054  
##  Median : 1.6805   Median : 0.5181   Median :-0.39464   Median : 1.2038  
##  Mean   : 1.8616   Mean   : 0.4141   Mean   :-0.09116   Mean   : 1.5786  
##  3rd Qu.: 2.4893   3rd Qu.: 0.5181   3rd Qu.:-0.05589   3rd Qu.: 2.2483  
##  Max.   : 8.3970   Max.   : 0.5181   Max.   : 9.87468   Max.   :22.0099  
##     X1.Pur_F        X1.X1f_Pur_F      X1.Cred_Lim        X1.Min_pay     
##  Min.   :-1.2217   Min.   :-0.6786   Min.   :-0.9603   Min.   :-0.3549  
##  1st Qu.:-1.2217   1st Qu.:-0.6786   1st Qu.: 0.4138   1st Qu.: 0.1126  
##  Median :-0.8064   Median :-0.6786   Median : 0.9635   Median : 0.3452  
##  Mean   :-0.4116   Mean   :-0.2570   Mean   : 1.1472   Mean   : 0.9545  
##  3rd Qu.: 0.4393   3rd Qu.:-0.1200   3rd Qu.: 1.6505   3rd Qu.: 0.7494  
##  Max.   : 1.2698   Max.   : 2.6733   Max.   : 7.0097   Max.   :32.3909  
##  X1.PRC_Ful_pay        X1.Ten         X1.cluster_lable
##  Min.   :-0.5255   Min.   :-4.12254   Min.   :1       
##  1st Qu.:-0.5255   1st Qu.: 0.36066   1st Qu.:1       
##  Median :-0.5255   Median : 0.36066   Median :1       
##  Mean   :-0.4585   Mean   : 0.07787   Mean   :1       
##  3rd Qu.:-0.5255   3rd Qu.: 0.36066   3rd Qu.:1       
##  Max.   : 2.2095   Max.   : 0.36066   Max.   :1

Analysing the behavior of each segment:

Calculate and display the median values of the variables within each cluster help us to identify the characteristic features of each cluster and understand the differences between them.

# Calculate median for each variable within each cluster group
median_df <- ccgeneral_zn %>%
  group_by(ccgeneral_zn$cluster_lable) %>%
  summarise_at(vars(names(ccgeneral_zn)), median)

# Print the resulting dataframe
print(t(median_df))

##                                  [,1]       [,2]       [,3]
## ccgeneral_zn$cluster_lable  1.0000000  2.0000000  3.0000000
## Bal                         1.6805092 -0.3479796 -0.4964612
## Bal_F                       0.5180549  0.5180549  0.5180549
## Pur                        -0.3946415  0.5584366 -0.3656239
## Cash_Ad                     1.2038219 -0.4667595 -0.4667595
## Pur_F                      -0.8064453  1.2697723 -0.3912033
## X1f_Pur_F                  -0.6786229  1.7591414 -0.6786229
## Cred_Lim                    0.9634674  0.4138125 -0.5480836
## Min_pay                     0.3451772 -0.2478394 -0.2507422
## PRC_Ful_pay                -0.5255216 -0.2406217 -0.5255216
## Ten                         0.3606594  0.3606594  0.3606594
## cluster_lable               1.0000000  2.0000000  3.0000000

Cluster 1: This customer group have high Bal(balances) and high cash_Ad(cash advance) with low Pur(purchases) and low PRC_Ful_pay (Percentage of Full Payments). We can assume that this customer group uses their credit cards as a loan.

Cluster 2: This customer group are in the oposide side with the previous group. They have a high Pur_F(Purchases Frequency) and high X1f_Pur_F (One-off Purchases Frequency) with low balance and low Cash_Ad (Cash Advance). We can assume that this customer group uses their credit card as a household purchase.

Cluster 3: This customer group have low balance, low purchases low cash advance and low in all factor relatively. We can assume this group of customer dont uses of their card frequently.

conclusion

Clustering techniques are essential for grouping similar data points together based on their features or characteristics. By identifying inherent patterns and structures within the data, clustering helps in understanding the natural grouping of data points without the need for labeled training data.