K-mean clustering a machine learning model that uses distance to separates data into meaningful groups can be used for market segmentation.
The following data set taken from ‘https://www.kaggle.com/datasets/arjunbhasin2013/ccdata/data’ contains 8950 rows of behavior activity from credit card holders in the last 6 months. The dataset includes information on the id of the user, their credit card balance, and purchasing activity.
In this analysis, I will use three techniques, Principal Component Analysis (PCA), K-means clustering algorithm and the hierarchical clustering algorithm to perform customer segmentation and detail a marketing strategy to create a more targeted user experience.
Principal Component Analysis: This is a technique applied to large datasets to reduce the size to it’s principal components, values which capture the most variation of the data. This is especially useful where there are a lot of features, as PCA compresses the data while keeping most of the important information. I will be utilizing PCA in the pre-processing stage of building the K-means clustering model.
K-means Clustering: A machine learning algorithm that uses distance to classify data points into groups. This is a tool used for classification and is based on the assumption that data points in close proximity are related to each other.
Hierarchical Clustering: Hierarchical clustering is similar to k-means clustering in that it is a unsupervised machine-learning technique that seperates data into groups based on proximity. Hierarchial clustering unlike K-means clustering does not require a defined k value. Instead, it individually looks at each data point and merges based on proximity. These relationships take the form a hierarchical tree or a dendrogram. Clusters are merged or split based on something called the linkage criterion which determines the distance between sets of observations.
The objective of this study is to perform customer segmentation using the k-means clustering and hierarchical clustering technique and using the information to detail a marketing plan to engage these groups. I will perform data exploration, transformation, model building and analysis followed by a recommendation for next steps.
| Variable Name | Definition |
|---|---|
| CUST_ID | ID of Credit Card user |
| BALANCE | Credit Card balance |
| BALANCE_FREQUENCY | Frequency that balance is updated (1=frequent, 0=not frequent) |
| PURCHASES | The total amount of purchases from the credit card |
| ONEOFF_PURCHASES | Maximum purchase amount at one time |
| INSTALLMENT_PUR… | Total Amount from installment purchase |
| CASH_ADVANCE | Cash advance amount taken from user |
| PURCHASES_FREQUENCY | Frequency purchases are made (1=frequent, 0=not frequent) |
| ONEOFFPURCHASESFR… | Frequency one-off purchases are made |
| PURCHASEINSTALLME… | Frequency purchase installments are made |
| CASHADVANCEFREQUE… | Frequency cash advance is requested |
| CASHADVANCETRX | Number of transactions made with “Cash in Advance” |
| PURCHASES_TRX | Amount of purchase transactions |
| CREDIT_LIMIT | Credit Card limit for user |
| PAYMENTS | Amount of payments completed by user |
| MINIMUM_PAYMENTS | Minimum amount of payments from user |
| PRCFULLPAYMENT | Percentage of full payment paid |
| TENURE | Length of credit card service tenure |
With the exception of the CUST_ID which is a character type, all the other variables are numeric features. We can also observe different types of numeric features. There are your continuous numeric variables such as BALANCE and PURCHASES and there are variables like BALANCE_FREQUENCY which are on a range from 0 to 1. You can also see count variables such as TENURE, where each number represents a specific duration. When creating using PCA, we will need to be mindful about what variables to use as PCA works best with continuous variables. We will also have to address the null values in CREDIT_LIMIT, and MINIMUM_PAYMENTS.
glimpse(ccdataset)
## Rows: 8,950
## Columns: 18
## $ CUST_ID <chr> "C10001", "C10002", "C10003", "C10004…
## $ BALANCE <dbl> 40.90075, 3202.46742, 2495.14886, 166…
## $ BALANCE_FREQUENCY <dbl> 0.818182, 0.909091, 1.000000, 0.63636…
## $ PURCHASES <dbl> 95.40, 0.00, 773.17, 1499.00, 16.00, …
## $ ONEOFF_PURCHASES <dbl> 0.00, 0.00, 773.17, 1499.00, 16.00, 0…
## $ INSTALLMENTS_PURCHASES <dbl> 95.40, 0.00, 0.00, 0.00, 0.00, 1333.2…
## $ CASH_ADVANCE <dbl> 0.0000, 6442.9455, 0.0000, 205.7880, …
## $ PURCHASES_FREQUENCY <dbl> 0.166667, 0.000000, 1.000000, 0.08333…
## $ ONEOFF_PURCHASES_FREQUENCY <dbl> 0.000000, 0.000000, 1.000000, 0.08333…
## $ PURCHASES_INSTALLMENTS_FREQUENCY <dbl> 0.083333, 0.000000, 0.000000, 0.00000…
## $ CASH_ADVANCE_FREQUENCY <dbl> 0.000000, 0.250000, 0.000000, 0.08333…
## $ CASH_ADVANCE_TRX <dbl> 0, 4, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ PURCHASES_TRX <dbl> 2, 0, 12, 1, 1, 8, 64, 12, 5, 3, 12, …
## $ CREDIT_LIMIT <dbl> 1000, 7000, 7500, 7500, 1200, 1800, 1…
## $ PAYMENTS <dbl> 201.8021, 4103.0326, 622.0667, 0.0000…
## $ MINIMUM_PAYMENTS <dbl> 139.50979, 1072.34022, 627.28479, NA,…
## $ PRC_FULL_PAYMENT <dbl> 0.000000, 0.222222, 0.000000, 0.00000…
## $ TENURE <dbl> 12, 12, 12, 12, 12, 12, 12, 12, 12, 1…
summary(ccdataset)
## CUST_ID BALANCE BALANCE_FREQUENCY PURCHASES
## Length:8950 Min. : 0.0 Min. :0.0000 Min. : 0.00
## Class :character 1st Qu.: 128.3 1st Qu.:0.8889 1st Qu.: 39.63
## Mode :character Median : 873.4 Median :1.0000 Median : 361.28
## Mean : 1564.5 Mean :0.8773 Mean : 1003.20
## 3rd Qu.: 2054.1 3rd Qu.:1.0000 3rd Qu.: 1110.13
## Max. :19043.1 Max. :1.0000 Max. :49039.57
##
## ONEOFF_PURCHASES INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. :0.00000
## 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.:0.08333
## Median : 38.0 Median : 89.0 Median : 0.0 Median :0.50000
## Mean : 592.4 Mean : 411.1 Mean : 978.9 Mean :0.49035
## 3rd Qu.: 577.4 3rd Qu.: 468.6 3rd Qu.: 1113.8 3rd Qu.:0.91667
## Max. :40761.2 Max. :22500.0 Max. :47137.2 Max. :1.00000
##
## ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY
## Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.08333 Median :0.1667
## Mean :0.20246 Mean :0.3644
## 3rd Qu.:0.30000 3rd Qu.:0.7500
## Max. :1.00000 Max. :1.0000
##
## CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT
## Min. :0.0000 Min. : 0.000 Min. : 0.00 Min. : 50
## 1st Qu.:0.0000 1st Qu.: 0.000 1st Qu.: 1.00 1st Qu.: 1600
## Median :0.0000 Median : 0.000 Median : 7.00 Median : 3000
## Mean :0.1351 Mean : 3.249 Mean : 14.71 Mean : 4494
## 3rd Qu.:0.2222 3rd Qu.: 4.000 3rd Qu.: 17.00 3rd Qu.: 6500
## Max. :1.5000 Max. :123.000 Max. :358.00 Max. :30000
## NA's :1
## PAYMENTS MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE
## Min. : 0.0 Min. : 0.02 Min. :0.0000 Min. : 6.00
## 1st Qu.: 383.3 1st Qu.: 169.12 1st Qu.:0.0000 1st Qu.:12.00
## Median : 856.9 Median : 312.34 Median :0.0000 Median :12.00
## Mean : 1733.1 Mean : 864.21 Mean :0.1537 Mean :11.52
## 3rd Qu.: 1901.1 3rd Qu.: 825.49 3rd Qu.:0.1429 3rd Qu.:12.00
## Max. :50721.5 Max. :76406.21 Max. :1.0000 Max. :12.00
## NA's :313
## CUST_ID BALANCE
## 0 0
## BALANCE_FREQUENCY PURCHASES
## 0 0
## ONEOFF_PURCHASES INSTALLMENTS_PURCHASES
## 0 0
## CASH_ADVANCE PURCHASES_FREQUENCY
## 0 0
## ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY
## 0 0
## CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX
## 0 0
## PURCHASES_TRX CREDIT_LIMIT
## 0 1
## PAYMENTS MINIMUM_PAYMENTS
## 0 313
## PRC_FULL_PAYMENT TENURE
## 0 0
To handle the nulls, I chose the replace the missing data with the median of the variables. By using the median, we avoid skewing the data and maintain the overall integrity of the dataset.
# Impute missing values
ccdataset$MINIMUM_PAYMENTS[is.na(ccdataset$MINIMUM_PAYMENTS)] <- median(ccdataset $MINIMUM_PAYMENTS, na.rm = TRUE)
ccdataset$CREDIT_LIMIT[is.na(ccdataset$CREDIT_LIMIT)] <- median(ccdataset$CREDIT_LIMIT, na.rm = TRUE)
Looking at the distribution of the data, we can see the variables are right skewed. Based on the density plot and the summary from earlier, we can see that each variable has different scales. To ensure certain variables don’t have larger influence due to the scaling, we have to make sure all the variables used for PCA are scaled appropriately. This is necessary for PCA because it’s technique relies on feature variability.
Prior to using the scale function, we will remove any variables that are
not going to be used.
ccdataset1 <- ccdataset %>%
select(-BALANCE_FREQUENCY,-PURCHASES_FREQUENCY,-ONEOFF_PURCHASES_FREQUENCY,-PURCHASES_INSTALLMENTS_FREQUENCY,-CASH_ADVANCE_FREQUENCY,-PRC_FULL_PAYMENT,-CUST_ID,-TENURE)
Now that our features are scaled. We can perform PCA.
data_scaled <- scale(ccdataset1)
Here we performed PCA, and are visualizing the Scree Plot. This shows us the variances that is explained by each principal component. The first three components make up a significant amount of variance, which tells us the importance of the variables.
I’m keeping all the components to increase the complexity of the data. Though everything after the first three components might not capture as much variance, they can still contain important behavioral information on the customers that is worth examining.
pca_result <- prcomp(data_scaled, center = TRUE, scale. = TRUE)
fviz_eig(pca_result)
Below I am using the fviz_nbclust function to look at the optimal number of clusters. This is calculated by using the sum of squares - which looks at the variability of observations within the clusters. The optimal number of clusters appears to be 3. We can see the rate of decrease in sum of squares slow down after cluster 3. This tells us there is diminishing returns when adding more than 3 clusters.
set.seed(123)
fviz_nbclust(data_scaled, kmeans, method = "wss") # Elbow method to determine the optimal number of clusters
When visualizing our clusters, we can see three distinct groups. Cluster 1 is in red and is spread in the upper values of Dim2 and the lower values of Dim1. We can see cluster 2 is around the lower value of Dim2 and then spread across the lower to higher end of Dim1. Cluster 3 is in the lower value of Dim2 and Dim1.
set.seed(124)
k <- 3
knn_res <- kmeans(data_scaled, centers = k)
fviz_cluster(knn_res, data = data_scaled)
In examining the mean z scores of each variable for every cluster, we can see 3 distinct credit credit profiles with different spending and borrowing habits.
The score values indicate the deviation from the mean. If a score is positive, then that variable within that cluster is greater than the overall dataset’s mean. A negative value indicates that the variable’s score is lower than the overall dataset’s mean.
Below I breakdown each cluster.
cluster_profiles <- aggregate(data_scaled, by=list(cluster=knn_res$cluster), FUN=mean)
print(cluster_profiles)
## cluster BALANCE PURCHASES ONEOFF_PURCHASES INSTALLMENTS_PURCHASES
## 1 1 1.596769 -0.1793127 -0.1255767 -0.1933779
## 2 2 0.832024 2.9423145 2.4932632 2.3764671
## 3 3 -0.312620 -0.1544720 -0.1352014 -0.1168396
## CASH_ADVANCE CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT PAYMENTS
## 1 1.7038412 1.4589341 -0.2084392 1.0090386 0.7014612
## 2 -0.1958818 -0.2090157 2.8785234 1.3423195 1.8011723
## 3 -0.2659075 -0.2251080 -0.1457337 -0.2485445 -0.2269894
## MINIMUM_PAYMENTS
## 1 0.6558583
## 2 0.4278902
## 3 -0.1337854
In cluster 1 we can observe that these customers carry high balances and have less purchasing activity.
CASH_ADVANCE(1.7038412) and BALANCE(1.596769) have the highest positive values - which tells us customers in cluster 1 have high cash advance usage and high credit card balances.
The customers in the cluster also high higher than average CREDIT_LIMIT(1.0090386) and make large payments (PAYMENTS, 0.7014612). This makes sense because of the large balances carried by those in cluster 1.
PURCHASES_TRX(-0.2084392),INSTALLMENT_PURCHASES(-0.1933779), and PURCHASES(-0.1793127) have the highest negative values which tells us that there is lower than average purchasing activity in this group.
Cluster 2 is made up of customers that have large purchasing activity and low cash advance/borrowing events. Active shoppers with high purchasing power.
Purchasing variables such as PURCHASES(2.9423145), PURCHASES_TRX(2.8785234), ONEOFF_PURCHASES(2.4932632), and INSTALLMENTS_PURCHASES(2.3764671) all have high positive means. Not only are the purchase transaction numbers high but the amount of the purchases themselves as well.
This group borrows less money through cash advances as show by the negative values for CASH_ADVANCE(-0.1958818) and CASH_ADVANCE_TRX(-0.2090157).
Cluster 3 shows negative values in all categories, which tells us these users have overall low engagement. This is especially evident by the low PURCHASES_TRX(-0.1457337) and CREDIT_LIMIT(-0.2485445).
For the second model, I will select the variables that were removed, with exception to the variables CUST_ID, and TENURE.
At first glance, we can see that these variables need to be scaled as CASH_ADVANCE_FREQUENCY has a scale of 0 to 1.5 unlike the other variables which are scaled from 0 to 1. This can be problematic for us when performing PCA.
### Data Preparation
Now that our data is scaled, we can perform the PCA technique.
data_scaled2 <- scale(ccdataset2)
Similar to the first model, we can see that the first three principal components explain a significant amount of variance. However, we only observe a sharp decline in the percentage until after the fifth component.
I’m keeping all the components as with the first model, to increase the complexity of the data.
pca_result2 <- prcomp(data_scaled2, center = TRUE, scale. = TRUE)
fviz_eig(pca_result2)
Using the fviz_nbclust function to look at the optimal number of clusters, we can see the elbow or bend in the plot at around K = 3. This will determine our ideal K value.
set.seed(235)
fviz_nbclust(data_scaled2, kmeans, method = "wss")
We can see that cluster 1 is in the middle of both cluster 2 and cluster 3, which means there might be some overlap in characteristics. Cluster 2 is mainly on the right side of the plot, which tells us it is very influential. Cluster 3 is on the left side of the plot and looks to have different characteristics from cluster 2 and 3.
set.seed(125)
k <- 3
knn_res2 <- kmeans(data_scaled2, centers = k)
fviz_cluster(knn_res2, data = data_scaled2)
The customers in cluster 1 frequently use their cards for installment purchases and pay their balance full more often.
This group of customers often makes one-off purchases, and overall have high purchasing activity.
The third cluster is classified by customers who frequently use cash advances.
cluster_profiles2 <- aggregate(data_scaled2, by=list(cluster=knn_res2$cluster), FUN=mean)
print(cluster_profiles2)
## cluster BALANCE_FREQUENCY PURCHASES_FREQUENCY ONEOFF_PURCHASES_FREQUENCY
## 1 1 0.1336238 0.9144197 -0.3883945
## 2 2 0.4023126 0.9984649 1.9499965
## 3 3 -0.2119321 -0.8735944 -0.4092443
## PURCHASES_INSTALLMENTS_FREQUENCY CASH_ADVANCE_FREQUENCY PRC_FULL_PAYMENT
## 1 1.1104971 -0.3800583 0.3268186
## 2 0.3256025 -0.2228577 0.3715532
## 3 -0.7693975 0.2999415 -0.3170577
Next we’re going to create the dataset, we’re going to use for a hierarchical clustering model. Unlike K-means clustering which uses a predefined K to create it’s distance based groups, hierarchical clustering doesn’t require an N; it uses a divisive method which looks at one record and then arranges clusters into a hierarchy. Like k-means clustering it uses proximity in determining relationships. Given the dendogram shape, it handles different types well unlike k-means clustering. For this reason we’re going to keep our variables together when building our hierarchical model.
ccdataset3 <- ccdataset %>%
select(-CUST_ID)
I will repeat the same steps I performed previously, including scaling the data and performing PCA to reduce the dimension of our dataset.
data_scaled3 <- scale(ccdataset3)
In looking at the variance, we can see that dimensions 1 and 2 explain the most variance. However, unlike our first two k-means clustering models, the variance here is spread more along the dimensions.
pca_result3 <- prcomp(data_scaled3, center = TRUE, scale. = TRUE)
fviz_eig(pca_result3, addlabels = TRUE, ylim = c(0, 100))
Given the large number of small groups. We’ll have to assign a cut-off
and examine the groups further.
Here we create our cutoff k, to be 5. In looking at each cluster, I
notice the following characteristics:
Cluster 1 is characterized by customers who are conservative spenders. They have the lowest balance, and low overall purchasing activity.
Cluster 2 has customers that have high balances and frequently use case advances. Their purchase activity is low which tells us they use their credit card for short term borrowing instead of making purchases.
This group contains customers who make large purchases, have high purchasing volume and also have a high credit limit to support their activity.
These customers have high purchasing frequency while keeping a low balancing and making ontime payments. In addition, they have a decent credit limit.
This cluster, similar to cluster 1 is either low or middle in purchasing activity.
## # A tibble: 5 × 18
## cluster BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 958. 0.792 404. 294.
## 2 2 4451. 0.979 370. 237.
## 3 3 4037. 0.984 8550. 5785.
## 4 4 1067. 0.967 1527. 725.
## 5 5 1027. 0.831 482. 271.
## # ℹ 13 more variables: INSTALLMENTS_PURCHASES <dbl>, CASH_ADVANCE <dbl>,
## # PURCHASES_FREQUENCY <dbl>, ONEOFF_PURCHASES_FREQUENCY <dbl>,
## # PURCHASES_INSTALLMENTS_FREQUENCY <dbl>, CASH_ADVANCE_FREQUENCY <dbl>,
## # CASH_ADVANCE_TRX <dbl>, PURCHASES_TRX <dbl>, CREDIT_LIMIT <dbl>,
## # PAYMENTS <dbl>, MINIMUM_PAYMENTS <dbl>, PRC_FULL_PAYMENT <dbl>,
## # TENURE <dbl>
Now that we’ve seen clusters in using both the k-means clustering and the hierarchical clustering technique. I’m going to provide enagement strategies based on the customers groups we’ve seen.
These group of spenders are characterized by their low engagement across all categories including spending, borrowing activities and have a low to medium balance on their account.
Recommendation: To incentivize this group to use their card, rewards can be offered for his group. Special offers on installment purchases can also be offered to encourage frequent spending.
The customers in this groups have high balances in their account and are frequent users of cash advances. They are also characterized by their low purchasing activity, as they are using their cards primarily for short term financing.
Recommendation: Promotions can be offered on cash advance fees to encourage this activity. In addition, alerts or tools can be created to help users manage their spending.
These users having medium to low engagement across spending and borrowing activities. Our objective is to re-engage and offer different offers and services.
Recommendation: Create special one-time offers to users. A variety of related financial products can be offered such as loans, cash advances, etc. Surveys could also be sent to gauge their needs and understand their inactivity.
Digital strategy: It’s important that the online site or app have personalized displays tailored to their behavior, in the form of notifications and dashboards that provides insights into the customers behavior, offer financial assistance and cross-promote other products.
Risk Management: Another strategy to look into is to minimize risk and costs. A strong credit risk model can be created using this dataset, which can predict risk and manage other account related decisions like credit limits.
Leveraging customer behavior experience is key in any institution providing services. The k-means clustering and hierarchical clustering were both able to give us insights into customer behavior. Given the type of data types within the data (continuous, discrete, and nominal), an algorithm like hierarchical clustering can handle this better as it looks at each data point individually and creates clusters based on relationships with no predetermined K. Given the hierarchical decision making, it can handle different data types that have unique relationships. When seperating the continuous and nominal types for k-means clustering, the groups did not give us as useful information as when all the variables together in the hierarchical model.