categorize the dataset based on distinct characteristics or patterns. These segments can provide valuable insights for formulating targeted marketing strategies. Since each segment exhibits different purchase capacities, understanding these variations allows for tailored approaches to maximize effectiveness and meet the unique needs of each customer group.
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(stats)
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
library(e1071)
library(caret)
## Loading required package: lattice
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(cluster)
data <- read.csv("/Users/zihualai/Desktop/CC GENERAL.csv")
head(data)
## CUST_ID BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES
## 1 C10001 40.90075 0.818182 95.40 0.00
## 2 C10002 3202.46742 0.909091 0.00 0.00
## 3 C10003 2495.14886 1.000000 773.17 773.17
## 4 C10004 1666.67054 0.636364 1499.00 1499.00
## 5 C10005 817.71434 1.000000 16.00 16.00
## 6 C10006 1809.82875 1.000000 1333.28 0.00
## INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY
## 1 95.40 0.000 0.166667
## 2 0.00 6442.945 0.000000
## 3 0.00 0.000 1.000000
## 4 0.00 205.788 0.083333
## 5 0.00 0.000 0.083333
## 6 1333.28 0.000 0.666667
## ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY
## 1 0.000000 0.083333
## 2 0.000000 0.000000
## 3 1.000000 0.000000
## 4 0.083333 0.000000
## 5 0.083333 0.000000
## 6 0.000000 0.583333
## CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT PAYMENTS
## 1 0.000000 0 2 1000 201.8021
## 2 0.250000 4 0 7000 4103.0326
## 3 0.000000 0 12 7500 622.0667
## 4 0.083333 1 1 7500 0.0000
## 5 0.000000 0 1 1200 678.3348
## 6 0.000000 0 8 1800 1400.0578
## MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE
## 1 139.5098 0.000000 12
## 2 1072.3402 0.222222 12
## 3 627.2848 0.000000 12
## 4 NA 0.000000 12
## 5 244.7912 0.000000 12
## 6 2407.2460 0.000000 12
summary(data)
## CUST_ID BALANCE BALANCE_FREQUENCY PURCHASES
## Length:8950 Min. : 0.0 Min. :0.0000 Min. : 0.00
## Class :character 1st Qu.: 128.3 1st Qu.:0.8889 1st Qu.: 39.63
## Mode :character Median : 873.4 Median :1.0000 Median : 361.28
## Mean : 1564.5 Mean :0.8773 Mean : 1003.20
## 3rd Qu.: 2054.1 3rd Qu.:1.0000 3rd Qu.: 1110.13
## Max. :19043.1 Max. :1.0000 Max. :49039.57
##
## ONEOFF_PURCHASES INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. :0.00000
## 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.:0.08333
## Median : 38.0 Median : 89.0 Median : 0.0 Median :0.50000
## Mean : 592.4 Mean : 411.1 Mean : 978.9 Mean :0.49035
## 3rd Qu.: 577.4 3rd Qu.: 468.6 3rd Qu.: 1113.8 3rd Qu.:0.91667
## Max. :40761.2 Max. :22500.0 Max. :47137.2 Max. :1.00000
##
## ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY
## Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.08333 Median :0.1667
## Mean :0.20246 Mean :0.3644
## 3rd Qu.:0.30000 3rd Qu.:0.7500
## Max. :1.00000 Max. :1.0000
##
## CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT
## Min. :0.0000 Min. : 0.000 Min. : 0.00 Min. : 50
## 1st Qu.:0.0000 1st Qu.: 0.000 1st Qu.: 1.00 1st Qu.: 1600
## Median :0.0000 Median : 0.000 Median : 7.00 Median : 3000
## Mean :0.1351 Mean : 3.249 Mean : 14.71 Mean : 4494
## 3rd Qu.:0.2222 3rd Qu.: 4.000 3rd Qu.: 17.00 3rd Qu.: 6500
## Max. :1.5000 Max. :123.000 Max. :358.00 Max. :30000
## NA's :1
## PAYMENTS MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE
## Min. : 0.0 Min. : 0.02 Min. :0.0000 Min. : 6.00
## 1st Qu.: 383.3 1st Qu.: 169.12 1st Qu.:0.0000 1st Qu.:12.00
## Median : 856.9 Median : 312.34 Median :0.0000 Median :12.00
## Mean : 1733.1 Mean : 864.21 Mean :0.1537 Mean :11.52
## 3rd Qu.: 1901.1 3rd Qu.: 825.49 3rd Qu.:0.1429 3rd Qu.:12.00
## Max. :50721.5 Max. :76406.21 Max. :1.0000 Max. :12.00
## NA's :313
1 CREDIT_LIMIT and MINIMUM_PAYMENTS contain null values. and the NA values for MINIMUM_PAYMENTS is huge, so we cannot ignore the missing value 2 it is numeric data, so it is suitble for clustering 3 There is difference between the mean and median values for most features.
firstly, deal with the CREDIT_LIMIT
hist(data$CREDIT_LIMIT)
Due to the right-skewed distribution of the data, it is better to fill
the missing values with the median instead of the mean
data <- data %>%
mutate(CREDIT_LIMIT
= replace(CREDIT_LIMIT,
is.na(CREDIT_LIMIT),
median(CREDIT_LIMIT, na.rm = TRUE)))
firstly plot the histgram to perceive
hist(data$MINIMUM_PAYMENTS)
As the data is right skewed, it’s better to fill the missing value with
median instead of mean ### fill in the NA with median values
data <- data %>%
mutate(MINIMUM_PAYMENTS
= replace(MINIMUM_PAYMENTS,
is.na(MINIMUM_PAYMENTS ),
median(MINIMUM_PAYMENTS , na.rm = TRUE)))
Then check the na value of dataset again and drop the column ID , then check the structure of the data
data <- data[,-1]
str(data)
## 'data.frame': 8950 obs. of 17 variables:
## $ BALANCE : num 40.9 3202.5 2495.1 1666.7 817.7 ...
## $ BALANCE_FREQUENCY : num 0.818 0.909 1 0.636 1 ...
## $ PURCHASES : num 95.4 0 773.2 1499 16 ...
## $ ONEOFF_PURCHASES : num 0 0 773 1499 16 ...
## $ INSTALLMENTS_PURCHASES : num 95.4 0 0 0 0 ...
## $ CASH_ADVANCE : num 0 6443 0 206 0 ...
## $ PURCHASES_FREQUENCY : num 0.1667 0 1 0.0833 0.0833 ...
## $ ONEOFF_PURCHASES_FREQUENCY : num 0 0 1 0.0833 0.0833 ...
## $ PURCHASES_INSTALLMENTS_FREQUENCY: num 0.0833 0 0 0 0 ...
## $ CASH_ADVANCE_FREQUENCY : num 0 0.25 0 0.0833 0 ...
## $ CASH_ADVANCE_TRX : int 0 4 0 1 0 0 0 0 0 0 ...
## $ PURCHASES_TRX : int 2 0 12 1 1 8 64 12 5 3 ...
## $ CREDIT_LIMIT : num 1000 7000 7500 7500 1200 1800 13500 2300 7000 11000 ...
## $ PAYMENTS : num 202 4103 622 0 678 ...
## $ MINIMUM_PAYMENTS : num 140 1072 627 312 245 ...
## $ PRC_FULL_PAYMENT : num 0 0.222 0 0 0 ...
## $ TENURE : int 12 12 12 12 12 12 12 12 12 12 ...
Now everything is okay, then we should do data visualization
long_data <- gather(data, key = "variable", value = "value")
ggplot(long_data, aes(x = value, fill = variable)) +
geom_density(alpha = 0.7) +
facet_wrap(~ variable, scales = "free") +
labs(title = "Density Plots of Variables", x = "Value", y = "Density") +
theme_minimal()
probably there are a small number of customers who make a significant
number of purchases, while the majority of customers make fewer
purchases, so there are skewness appeared.
But skewed data can distort these distance measures influence the distance measures sensitivity .And it can lead to centroids that are not representative of the majority of data points in a cluster so that there are bias in centroids. SO here we will deal with the skewness firsly.
firstly, choose the positive skewed variables
right_skewed_variables <- data[c('BALANCE_FREQUENCY', 'ONEOFF_PURCHASES', 'INSTALLMENTS_PURCHASES', 'CASH_ADVANCE', 'ONEOFF_PURCHASES_FREQUENCY','PURCHASES_INSTALLMENTS_FREQUENCY', 'CASH_ADVANCE_TRX', 'PURCHASES_TRX', 'CREDIT_LIMIT', 'PAYMENTS', 'MINIMUM_PAYMENTS', 'PRC_FULL_PAYMENT')]
use the log transformation and check again
transformed_data <- as.data.frame(lapply(right_skewed_variables, function(x) log(x + 1)))
long_data <- gather(transformed_data, key = "variable", value = "value")
ggplot(long_data, aes(x = value)) +
geom_density(alpha = 0.7) +
facet_wrap(~ variable, scales = "free") +
labs(title = "Density Plots of Right Skewed Variables", x = "Transformed Value", y = "Density") +
theme_minimal()
the results are not very perfect, so now I will try square root transformation
transformed_data1 <- as.data.frame(lapply(right_skewed_variables, sqrt))
long_data1 <- gather(transformed_data1, key = "variable", value = "value")
ggplot(long_data1, aes(x = value)) +
geom_density(alpha = 0.7) +
facet_wrap(~ variable, scales = "free") +
labs(title = "Density Plots of Right Skewed Variables", x = "Transformed Value1", y = "Density") +
theme_minimal()
log transformation is still the best choice
Use heatmap to decide if we need to reduce the dimensions. if there are features have high correlation, we should do dimension reduction
cor_matrix <- cor(data, use = "complete.obs")
# Melt the correlation matrix for ggplot2
cor_melted <- melt(cor_matrix)
# Plotting the heatmap with text labels
ggplot(cor_melted, aes(Var1, Var2, fill = value)) +
geom_tile() +
geom_text(aes(label = sprintf("%.2f", value)), vjust = 1) +
scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0) +
theme_minimal() +
labs(x = "Variables", y = "Variables", fill = "Correlation") +
coord_fixed()
According to the results of heatmap, there are several variables have
high correlations, so it is necessary to do dimension reduction.
dim(data)
## [1] 8950 17
preproc1 <- preProcess(data, method=c("center", "scale"))
data_s<- predict(preproc1, data)
data_s <- as.data.frame(data_s)
summary(data_s)
## BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES
## Min. :-0.7516 Min. :-3.70306 Min. :-0.46953 Min. :-0.356914
## 1st Qu.:-0.6900 1st Qu.: 0.04904 1st Qu.:-0.45098 1st Qu.:-0.356914
## Median :-0.3320 Median : 0.51806 Median :-0.30044 Median :-0.334021
## Mean : 0.0000 Mean : 0.00000 Mean : 0.00000 Mean : 0.000000
## 3rd Qu.: 0.2352 3rd Qu.: 0.51806 3rd Qu.: 0.05004 3rd Qu.:-0.009056
## Max. : 8.3970 Max. : 0.51806 Max. :22.48225 Max. :24.199714
## INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY
## Min. :-0.45455 Min. :-0.46676 Min. :-1.22169
## 1st Qu.:-0.45455 1st Qu.:-0.46676 1st Qu.:-1.01407
## Median :-0.35614 Median :-0.46676 Median : 0.02404
## Mean : 0.00000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.06366 3rd Qu.: 0.06435 3rd Qu.: 1.06215
## Max. :24.42552 Max. :22.00989 Max. : 1.26977
## ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY
## Min. :-0.6786 Min. :-0.9169
## 1st Qu.:-0.6786 1st Qu.:-0.9169
## Median :-0.3993 Median :-0.4976
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3270 3rd Qu.: 0.9701
## Max. : 2.6733 Max. : 1.5991
## CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT
## Min. :-0.6753 Min. :-0.4760 Min. :-0.59176 Min. :-1.2214
## 1st Qu.:-0.6753 1st Qu.:-0.4760 1st Qu.:-0.55153 1st Qu.:-0.7954
## Median :-0.6753 Median :-0.4760 Median :-0.31016 Median :-0.4107
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.4351 3rd Qu.: 0.1101 3rd Qu.: 0.09213 3rd Qu.: 0.5512
## Max. : 6.8201 Max. :17.5469 Max. :13.81024 Max. : 7.0097
## PAYMENTS MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE
## Min. :-0.59866 Min. :-0.36218 Min. :-0.52552 Min. :-4.1225
## 1st Qu.:-0.46626 1st Qu.:-0.28895 1st Qu.:-0.52552 1st Qu.: 0.3607
## Median :-0.30267 Median :-0.22829 Median :-0.52552 Median : 0.3607
## Mean : 0.00000 Mean : 0.00000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.05803 3rd Qu.:-0.02409 3rd Qu.:-0.03712 3rd Qu.: 0.3607
## Max. :16.92133 Max. :32.39092 Max. : 2.89329 Max. : 0.3607
###eigenvalues on the basis of covariance
data_s.cov<-cov(data_s)
data_s.eigen<-eigen(data_s.cov)
data_s.eigen$values
## [1] 4.640604e+00 3.453343e+00 1.498241e+00 1.271519e+00 1.058204e+00
## [6] 9.758096e-01 8.301823e-01 7.308646e-01 6.457040e-01 5.236004e-01
## [11] 4.032827e-01 3.014728e-01 2.427357e-01 2.068763e-01 1.721534e-01
## [16] 4.539592e-02 1.164886e-05
head(data_s.eigen$vectors)
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] -0.10173497 -0.40381873 -0.16163198 0.27931641 -0.05102225 0.04275156
## [2,] -0.12049041 -0.13134414 -0.42299545 0.15576173 0.47683620 0.02204495
## [3,] -0.41156240 -0.04002462 0.25747423 0.04162083 -0.02788702 0.18045255
## [4,] -0.34633416 -0.06191772 0.39089919 0.08507909 0.16083881 0.17998375
## [5,] -0.33676092 0.01917539 -0.10926846 -0.05791521 -0.36140247 0.09600344
## [6,] 0.02280988 -0.43902651 -0.02312809 -0.25659565 -0.09756007 -0.14501811
## [,7] [,8] [,9] [,10] [,11] [,12]
## [1,] 0.2541816 -0.17029351 0.12851376 -0.03117406 -0.1484861 0.48066944
## [2,] -0.1008009 0.28349585 0.61625425 0.08158054 0.1189014 -0.09139969
## [3,] -0.2002205 0.02605464 0.10477733 -0.05116270 -0.2007868 -0.07487246
## [4,] -0.1206849 0.12697358 0.01400471 0.17546751 -0.4481486 0.05238789
## [5,] -0.2517713 -0.17169631 0.22202301 -0.44326909 0.3484815 -0.27316781
## [6,] 0.0487302 -0.02241577 -0.04942654 0.35921657 0.3543696 0.18362311
## [,13] [,14] [,15] [,16] [,17]
## [1,] 0.524791875 -0.17453771 0.21891187 0.005249742 -4.216029e-05
## [2,] -0.184842432 0.04116932 0.04192939 0.010891185 -7.979220e-06
## [3,] 0.114177416 0.22513789 -0.05169797 -0.002159184 7.489748e-01
## [4,] -0.008510644 0.22524436 -0.05634003 0.004666332 -5.819431e-01
## [5,] 0.285345707 0.11843846 -0.01874770 -0.014555735 -3.168263e-01
## [6,] 0.023281010 0.60945077 -0.21008717 0.007087134 -2.220327e-05
xxx<- data_s
xxx.pca1<-prcomp(xxx, center=FALSE, scale.=FALSE) # stats::
summary(xxx.pca1)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.154 1.8583 1.22403 1.1276 1.02869 0.9878 0.91114
## Proportion of Variance 0.273 0.2031 0.08813 0.0748 0.06225 0.0574 0.04883
## Cumulative Proportion 0.273 0.4761 0.56425 0.6390 0.70129 0.7587 0.80752
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.85491 0.80356 0.7236 0.63505 0.54907 0.49268 0.45484
## Proportion of Variance 0.04299 0.03798 0.0308 0.02372 0.01773 0.01428 0.01217
## Cumulative Proportion 0.85052 0.88850 0.9193 0.94302 0.96075 0.97503 0.98720
## PC15 PC16 PC17
## Standard deviation 0.41491 0.21306 0.003413
## Proportion of Variance 0.01013 0.00267 0.000000
## Cumulative Proportion 0.99733 1.00000 1.000000
now let us plot to verify
plot(xxx.pca1)
fviz_pca_var(xxx.pca1, col.var="steelblue")
### visusalisation of quality
fviz_eig(xxx.pca1, choice='eigenvalue') # eigenvalues on y-axis
fviz_eig(xxx.pca1) # percentage of explained variance on y-axis
eig.val<-get_eigenvalue(xxx.pca1)
a<-summary(xxx.pca1)
plot(a$importance[3,],type="l") # cumulative variance
here we want to keep as much variables as we could in suitable range, so
we use 5 PCs so we define the variables again
data <- data[,c(1:5)]
set.seed(123) # Set seed for reproducibility
# Compute total within-cluster sum of square
wss <- sapply(1:5, function(k){
kmeans(data, k, nstart = 15)$tot.withinss
})
# Plot the elbow curve
plot(1:5, wss, type="b", xlab="Number of Clusters", ylab="Total within-clusters sum of squares")
according to the plot, we choose the elbow point, 3 clusters
k <- 4
km_result <- kmeans(data, centers = k, nstart = 25)
data_with_cluster <- data
data_with_cluster$cluster <- as.factor(km_result$cluster)
Calculate the mean for each variable in each cluster
cluster_summary <- aggregate(. ~ cluster, data_with_cluster, mean)
print(cluster_summary)
## cluster BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES
## 1 1 5601.6446 0.9541322 28394.1623 22858.4573
## 2 2 827.6439 0.8516878 602.5697 305.6232
## 3 3 5646.9296 0.9902468 776.1323 434.0495
## 4 4 2461.1269 0.9756786 5921.5745 3991.9298
## INSTALLMENTS_PURCHASES
## 1 5535.7050
## 2 297.2188
## 3 342.1400
## 4 1930.8904
conclustion:
Cluster 1: This cluster has a high average balance and a high balance frequency, suggesting that these customers regularly use their accounts and maintain a high balance. They also have the highest average purchases and one-off purchases, which implies they make significant and frequent transactions, possibly indicating a group of high-spending customers. The high installment purchases suggest they also utilize installment plans.
Cluster 2: Customers in this cluster have the lowest average balance and balance frequency, along with low purchase values. This could indicate a group that uses their credit cards infrequently and for smaller amounts, suggesting they might be more conservative with credit card usage or have lower credit limits.
Cluster 3: This cluster has a very high balance frequency, almost equal to 1, which suggests that their balances are updated almost constantly, indicating regular account activity. However, the purchase amounts are relatively low, especially for one-off purchases. This might represent customers who use their cards regularly but for smaller routine expenses.
Cluster 4: Customers here have a moderate average balance and high balance frequency, with purchase amounts larger than Cluster 2 and 3 but significantly lower than Cluster 1. The one-off purchases are reasonably high, suggesting these customers engage in substantial one-time transactions, but less frequently than those in Cluster 1. The installment purchases are also moderate, indicating a balanced use between one-off and installment purchases.
we can infer :
Cluster 1 could represent affluent customers with high spending capacity. Cluster 2 might consist of low-usage customers who could be targeted for increased engagement and usage. Cluster 3 may represent customers who use their credit cards as a regular payment method for smaller amounts. Cluster 4 could include a middle segment that uses credit cards for both regular and significant purchases but not to the extent of Cluster 1.