Data622_Final

Abstract

To effectively target customers for financial or credit card services, the product should add value to the consumer. In the case of credit card services, consumers can be offered a range of options such as balance transfers, special introductory rate offers, and reward specific credit cards. Customer segmentation is a key strategy in targeting their audience with relevant content. This involves analyzing the spending behavior of consumers and segmenting them into meaningful groups that marketers can tailor content to.

K-mean clustering a machine learning model that uses distance to separates data into meaningful groups can be used for market segmentation.

Keywords

Customer Segmentation, Marketing, Credit Card Services, Unsupervised Machine Learning, Dimensionality, Clustering, Principal Component Analysis, K-means Clustering, Hierarchical Clustering

Introduction

The following data set taken from ‘https://www.kaggle.com/datasets/arjunbhasin2013/ccdata/data’ contains 8950 rows of behavior activity from credit card holders in the last 6 months. The dataset includes information on the id of the user, their credit card balance, and purchasing activity.

In this analysis, I will use three techniques, Principal Component Analysis (PCA), K-means clustering algorithm and the hierarchical clustering algorithm to perform customer segmentation and detail a marketing strategy to create a more targeted user experience.

Algorithms

Principal Component Analysis: This is a technique applied to large datasets to reduce the size to it’s principal components, values which capture the most variation of the data. This is especially useful where there are a lot of features, as PCA compresses the data while keeping most of the important information. I will be utilizing PCA in the pre-processing stage of building the K-means clustering model.

K-means Clustering: A machine learning algorithm that uses distance to classify data points into groups. This is a tool used for classification and is based on the assumption that data points in close proximity are related to each other.

Hierarchical Clustering: Hierarchical clustering is similar to k-means clustering in that it is a unsupervised machine-learning technique that seperates data into groups based on proximity. Hierarchial clustering unlike K-means clustering does not require a defined k value. Instead, it individually looks at each data point and merges based on proximity. These relationships take the form a hierarchical tree or a dendrogram. Clusters are merged or split based on something called the linkage criterion which determines the distance between sets of observations.

Objective

The objective of this study is to perform customer segmentation using the k-means clustering and hierarchical clustering technique and using the information to detail a marketing plan to engage these groups. I will perform data exploration, transformation, model building and analysis followed by a recommendation for next steps.

Initial Data Exploration

The dataset which is labeled ccdataset contains 8,950 rows of data and 18 features. The list of features and their definitions are below:

Variable Name	Definition
CUST_ID	ID of Credit Card user
BALANCE	Credit Card balance
BALANCE_FREQUENCY	Frequency that balance is updated (1=frequent, 0=not frequent)
PURCHASES	The total amount of purchases from the credit card
ONEOFF_PURCHASES	Maximum purchase amount at one time
INSTALLMENT_PUR…	Total Amount from installment purchase
CASH_ADVANCE	Cash advance amount taken from user
PURCHASES_FREQUENCY	Frequency purchases are made (1=frequent, 0=not frequent)
ONEOFFPURCHASESFR…	Frequency one-off purchases are made
PURCHASEINSTALLME…	Frequency purchase installments are made
CASHADVANCEFREQUE…	Frequency cash advance is requested
CASHADVANCETRX	Number of transactions made with “Cash in Advance”
PURCHASES_TRX	Amount of purchase transactions
CREDIT_LIMIT	Credit Card limit for user
PAYMENTS	Amount of payments completed by user
MINIMUM_PAYMENTS	Minimum amount of payments from user
PRCFULLPAYMENT	Percentage of full payment paid
TENURE	Length of credit card service tenure

With the exception of the CUST_ID which is a character type, all the other variables are numeric features. We can also observe different types of numeric features. There are your continuous numeric variables such as BALANCE and PURCHASES and there are variables like BALANCE_FREQUENCY which are on a range from 0 to 1. You can also see count variables such as TENURE, where each number represents a specific duration. When creating using PCA, we will need to be mindful about what variables to use as PCA works best with continuous variables. We will also have to address the null values in CREDIT_LIMIT, and MINIMUM_PAYMENTS.

glimpse(ccdataset)

## Rows: 8,950
## Columns: 18
## $ CUST_ID                          <chr> "C10001", "C10002", "C10003", "C10004…
## $ BALANCE                          <dbl> 40.90075, 3202.46742, 2495.14886, 166…
## $ BALANCE_FREQUENCY                <dbl> 0.818182, 0.909091, 1.000000, 0.63636…
## $ PURCHASES                        <dbl> 95.40, 0.00, 773.17, 1499.00, 16.00, …
## $ ONEOFF_PURCHASES                 <dbl> 0.00, 0.00, 773.17, 1499.00, 16.00, 0…
## $ INSTALLMENTS_PURCHASES           <dbl> 95.40, 0.00, 0.00, 0.00, 0.00, 1333.2…
## $ CASH_ADVANCE                     <dbl> 0.0000, 6442.9455, 0.0000, 205.7880, …
## $ PURCHASES_FREQUENCY              <dbl> 0.166667, 0.000000, 1.000000, 0.08333…
## $ ONEOFF_PURCHASES_FREQUENCY       <dbl> 0.000000, 0.000000, 1.000000, 0.08333…
## $ PURCHASES_INSTALLMENTS_FREQUENCY <dbl> 0.083333, 0.000000, 0.000000, 0.00000…
## $ CASH_ADVANCE_FREQUENCY           <dbl> 0.000000, 0.250000, 0.000000, 0.08333…
## $ CASH_ADVANCE_TRX                 <dbl> 0, 4, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ PURCHASES_TRX                    <dbl> 2, 0, 12, 1, 1, 8, 64, 12, 5, 3, 12, …
## $ CREDIT_LIMIT                     <dbl> 1000, 7000, 7500, 7500, 1200, 1800, 1…
## $ PAYMENTS                         <dbl> 201.8021, 4103.0326, 622.0667, 0.0000…
## $ MINIMUM_PAYMENTS                 <dbl> 139.50979, 1072.34022, 627.28479, NA,…
## $ PRC_FULL_PAYMENT                 <dbl> 0.000000, 0.222222, 0.000000, 0.00000…
## $ TENURE                           <dbl> 12, 12, 12, 12, 12, 12, 12, 12, 12, 1…

summary(ccdataset)

##    CUST_ID             BALANCE        BALANCE_FREQUENCY   PURCHASES       
##  Length:8950        Min.   :    0.0   Min.   :0.0000    Min.   :    0.00  
##  Class :character   1st Qu.:  128.3   1st Qu.:0.8889    1st Qu.:   39.63  
##  Mode  :character   Median :  873.4   Median :1.0000    Median :  361.28  
##                     Mean   : 1564.5   Mean   :0.8773    Mean   : 1003.20  
##                     3rd Qu.: 2054.1   3rd Qu.:1.0000    3rd Qu.: 1110.13  
##                     Max.   :19043.1   Max.   :1.0000    Max.   :49039.57  
##                                                                           
##  ONEOFF_PURCHASES  INSTALLMENTS_PURCHASES  CASH_ADVANCE     PURCHASES_FREQUENCY
##  Min.   :    0.0   Min.   :    0.0        Min.   :    0.0   Min.   :0.00000    
##  1st Qu.:    0.0   1st Qu.:    0.0        1st Qu.:    0.0   1st Qu.:0.08333    
##  Median :   38.0   Median :   89.0        Median :    0.0   Median :0.50000    
##  Mean   :  592.4   Mean   :  411.1        Mean   :  978.9   Mean   :0.49035    
##  3rd Qu.:  577.4   3rd Qu.:  468.6        3rd Qu.: 1113.8   3rd Qu.:0.91667    
##  Max.   :40761.2   Max.   :22500.0        Max.   :47137.2   Max.   :1.00000    
##                                                                                
##  ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY
##  Min.   :0.00000            Min.   :0.0000                  
##  1st Qu.:0.00000            1st Qu.:0.0000                  
##  Median :0.08333            Median :0.1667                  
##  Mean   :0.20246            Mean   :0.3644                  
##  3rd Qu.:0.30000            3rd Qu.:0.7500                  
##  Max.   :1.00000            Max.   :1.0000                  
##                                                             
##  CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX  PURCHASES_TRX     CREDIT_LIMIT  
##  Min.   :0.0000         Min.   :  0.000   Min.   :  0.00   Min.   :   50  
##  1st Qu.:0.0000         1st Qu.:  0.000   1st Qu.:  1.00   1st Qu.: 1600  
##  Median :0.0000         Median :  0.000   Median :  7.00   Median : 3000  
##  Mean   :0.1351         Mean   :  3.249   Mean   : 14.71   Mean   : 4494  
##  3rd Qu.:0.2222         3rd Qu.:  4.000   3rd Qu.: 17.00   3rd Qu.: 6500  
##  Max.   :1.5000         Max.   :123.000   Max.   :358.00   Max.   :30000  
##                                                            NA's   :1      
##     PAYMENTS       MINIMUM_PAYMENTS   PRC_FULL_PAYMENT     TENURE     
##  Min.   :    0.0   Min.   :    0.02   Min.   :0.0000   Min.   : 6.00  
##  1st Qu.:  383.3   1st Qu.:  169.12   1st Qu.:0.0000   1st Qu.:12.00  
##  Median :  856.9   Median :  312.34   Median :0.0000   Median :12.00  
##  Mean   : 1733.1   Mean   :  864.21   Mean   :0.1537   Mean   :11.52  
##  3rd Qu.: 1901.1   3rd Qu.:  825.49   3rd Qu.:0.1429   3rd Qu.:12.00  
##  Max.   :50721.5   Max.   :76406.21   Max.   :1.0000   Max.   :12.00  
##                    NA's   :313

##                          CUST_ID                          BALANCE 
##                                0                                0 
##                BALANCE_FREQUENCY                        PURCHASES 
##                                0                                0 
##                 ONEOFF_PURCHASES           INSTALLMENTS_PURCHASES 
##                                0                                0 
##                     CASH_ADVANCE              PURCHASES_FREQUENCY 
##                                0                                0 
##       ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY 
##                                0                                0 
##           CASH_ADVANCE_FREQUENCY                 CASH_ADVANCE_TRX 
##                                0                                0 
##                    PURCHASES_TRX                     CREDIT_LIMIT 
##                                0                                1 
##                         PAYMENTS                 MINIMUM_PAYMENTS 
##                                0                              313 
##                 PRC_FULL_PAYMENT                           TENURE 
##                                0                                0

First K-means Clustering Model

Data Preparation

Handling Nulls

To handle the nulls, I chose the replace the missing data with the median of the variables. By using the median, we avoid skewing the data and maintain the overall integrity of the dataset.

# Impute missing values
ccdataset$MINIMUM_PAYMENTS[is.na(ccdataset$MINIMUM_PAYMENTS)] <- median(ccdataset $MINIMUM_PAYMENTS, na.rm = TRUE)

ccdataset$CREDIT_LIMIT[is.na(ccdataset$CREDIT_LIMIT)] <- median(ccdataset$CREDIT_LIMIT, na.rm = TRUE)

Scaling Data

Looking at the distribution of the data, we can see the variables are right skewed. Based on the density plot and the summary from earlier, we can see that each variable has different scales. To ensure certain variables don’t have larger influence due to the scaling, we have to make sure all the variables used for PCA are scaled appropriately. This is necessary for PCA because it’s technique relies on feature variability.

Prior to using the scale function, we will remove any variables that are not going to be used.

ccdataset1 <- ccdataset %>%
  select(-BALANCE_FREQUENCY,-PURCHASES_FREQUENCY,-ONEOFF_PURCHASES_FREQUENCY,-PURCHASES_INSTALLMENTS_FREQUENCY,-CASH_ADVANCE_FREQUENCY,-PRC_FULL_PAYMENT,-CUST_ID,-TENURE)

Now that our features are scaled. We can perform PCA.

data_scaled <- scale(ccdataset1)

PCA

Here we performed PCA, and are visualizing the Scree Plot. This shows us the variances that is explained by each principal component. The first three components make up a significant amount of variance, which tells us the importance of the variables.

I’m keeping all the components to increase the complexity of the data. Though everything after the first three components might not capture as much variance, they can still contain important behavioral information on the customers that is worth examining.

pca_result <- prcomp(data_scaled, center = TRUE, scale. = TRUE)
fviz_eig(pca_result)

Below I am using the fviz_nbclust function to look at the optimal number of clusters. This is calculated by using the sum of squares - which looks at the variability of observations within the clusters. The optimal number of clusters appears to be 3. We can see the rate of decrease in sum of squares slow down after cluster 3. This tells us there is diminishing returns when adding more than 3 clusters.

set.seed(123)
fviz_nbclust(data_scaled, kmeans, method = "wss")  # Elbow method to determine the optimal number of clusters

Model Building

When visualizing our clusters, we can see three distinct groups. Cluster 1 is in red and is spread in the upper values of Dim2 and the lower values of Dim1. We can see cluster 2 is around the lower value of Dim2 and then spread across the lower to higher end of Dim1. Cluster 3 is in the lower value of Dim2 and Dim1.

set.seed(124)
k <- 3  
knn_res <- kmeans(data_scaled, centers = k)
fviz_cluster(knn_res, data = data_scaled)

Analysis

In examining the mean z scores of each variable for every cluster, we can see 3 distinct credit credit profiles with different spending and borrowing habits.

The score values indicate the deviation from the mean. If a score is positive, then that variable within that cluster is greater than the overall dataset’s mean. A negative value indicates that the variable’s score is lower than the overall dataset’s mean.

Below I breakdown each cluster.

cluster_profiles <- aggregate(data_scaled, by=list(cluster=knn_res$cluster), FUN=mean)
print(cluster_profiles)

##   cluster   BALANCE  PURCHASES ONEOFF_PURCHASES INSTALLMENTS_PURCHASES
## 1       1  1.596769 -0.1793127       -0.1255767             -0.1933779
## 2       2  0.832024  2.9423145        2.4932632              2.3764671
## 3       3 -0.312620 -0.1544720       -0.1352014             -0.1168396
##   CASH_ADVANCE CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT   PAYMENTS
## 1    1.7038412        1.4589341    -0.2084392    1.0090386  0.7014612
## 2   -0.1958818       -0.2090157     2.8785234    1.3423195  1.8011723
## 3   -0.2659075       -0.2251080    -0.1457337   -0.2485445 -0.2269894
##   MINIMUM_PAYMENTS
## 1        0.6558583
## 2        0.4278902
## 3       -0.1337854

Cluster 1

In cluster 1 we can observe that these customers carry high balances and have less purchasing activity.

CASH_ADVANCE(1.7038412) and BALANCE(1.596769) have the highest positive values - which tells us customers in cluster 1 have high cash advance usage and high credit card balances.

The customers in the cluster also high higher than average CREDIT_LIMIT(1.0090386) and make large payments (PAYMENTS, 0.7014612). This makes sense because of the large balances carried by those in cluster 1.

PURCHASES_TRX(-0.2084392),INSTALLMENT_PURCHASES(-0.1933779), and PURCHASES(-0.1793127) have the highest negative values which tells us that there is lower than average purchasing activity in this group.

Cluster 2

Cluster 2 is made up of customers that have large purchasing activity and low cash advance/borrowing events. Active shoppers with high purchasing power.

Purchasing variables such as PURCHASES(2.9423145), PURCHASES_TRX(2.8785234), ONEOFF_PURCHASES(2.4932632), and INSTALLMENTS_PURCHASES(2.3764671) all have high positive means. Not only are the purchase transaction numbers high but the amount of the purchases themselves as well.

This group borrows less money through cash advances as show by the negative values for CASH_ADVANCE(-0.1958818) and CASH_ADVANCE_TRX(-0.2090157).

Cluster 3

Cluster 3 shows negative values in all categories, which tells us these users have overall low engagement. This is especially evident by the low PURCHASES_TRX(-0.1457337) and CREDIT_LIMIT(-0.2485445).

Second K-means Clustering Model

For the second model, I will select the variables that were removed, with exception to the variables CUST_ID, and TENURE.

Data Exploration

At first glance, we can see that these variables need to be scaled as CASH_ADVANCE_FREQUENCY has a scale of 0 to 1.5 unlike the other variables which are scaled from 0 to 1. This can be problematic for us when performing PCA.

### Data Preparation

Scaling Data

Now that our data is scaled, we can perform the PCA technique.

data_scaled2 <- scale(ccdataset2)

PCA

Similar to the first model, we can see that the first three principal components explain a significant amount of variance. However, we only observe a sharp decline in the percentage until after the fifth component.

I’m keeping all the components as with the first model, to increase the complexity of the data.

pca_result2 <- prcomp(data_scaled2, center = TRUE, scale. = TRUE)
fviz_eig(pca_result2)

Using the fviz_nbclust function to look at the optimal number of clusters, we can see the elbow or bend in the plot at around K = 3. This will determine our ideal K value.

set.seed(235)
fviz_nbclust(data_scaled2, kmeans, method = "wss")

Model Building

We can see that cluster 1 is in the middle of both cluster 2 and cluster 3, which means there might be some overlap in characteristics. Cluster 2 is mainly on the right side of the plot, which tells us it is very influential. Cluster 3 is on the left side of the plot and looks to have different characteristics from cluster 2 and 3.

set.seed(125)
k <- 3  
knn_res2 <- kmeans(data_scaled2, centers = k)
fviz_cluster(knn_res2, data = data_scaled2)

Analysis

Cluster 1

The customers in cluster 1 frequently use their cards for installment purchases and pay their balance full more often.

Cluster 2

This group of customers often makes one-off purchases, and overall have high purchasing activity.

Cluster 3

The third cluster is classified by customers who frequently use cash advances.

cluster_profiles2 <- aggregate(data_scaled2, by=list(cluster=knn_res2$cluster), FUN=mean)
print(cluster_profiles2)

##   cluster BALANCE_FREQUENCY PURCHASES_FREQUENCY ONEOFF_PURCHASES_FREQUENCY
## 1       1         0.1336238           0.9144197                 -0.3883945
## 2       2         0.4023126           0.9984649                  1.9499965
## 3       3        -0.2119321          -0.8735944                 -0.4092443
##   PURCHASES_INSTALLMENTS_FREQUENCY CASH_ADVANCE_FREQUENCY PRC_FULL_PAYMENT
## 1                        1.1104971             -0.3800583        0.3268186
## 2                        0.3256025             -0.2228577        0.3715532
## 3                       -0.7693975              0.2999415       -0.3170577

Hierarchical Clustering

Next we’re going to create the dataset, we’re going to use for a hierarchical clustering model. Unlike K-means clustering which uses a predefined K to create it’s distance based groups, hierarchical clustering doesn’t require an N; it uses a divisive method which looks at one record and then arranges clusters into a hierarchy. Like k-means clustering it uses proximity in determining relationships. Given the dendogram shape, it handles different types well unlike k-means clustering. For this reason we’re going to keep our variables together when building our hierarchical model.

ccdataset3 <- ccdataset %>%
  select(-CUST_ID)

I will repeat the same steps I performed previously, including scaling the data and performing PCA to reduce the dimension of our dataset.

data_scaled3 <- scale(ccdataset3)

PCA

In looking at the variance, we can see that dimensions 1 and 2 explain the most variance. However, unlike our first two k-means clustering models, the variance here is spread more along the dimensions.

pca_result3 <- prcomp(data_scaled3, center = TRUE, scale. = TRUE)
fviz_eig(pca_result3, addlabels = TRUE, ylim = c(0, 100))

Given the large number of small groups. We’ll have to assign a cut-off and examine the groups further.

Here we create our cutoff k, to be 5. In looking at each cluster, I notice the following characteristics:

Cluster 1

Cluster 1 is characterized by customers who are conservative spenders. They have the lowest balance, and low overall purchasing activity.

Cluster 2

Cluster 2 has customers that have high balances and frequently use case advances. Their purchase activity is low which tells us they use their credit card for short term borrowing instead of making purchases.

Cluster 3

This group contains customers who make large purchases, have high purchasing volume and also have a high credit limit to support their activity.

Cluster 4

These customers have high purchasing frequency while keeping a low balancing and making ontime payments. In addition, they have a decent credit limit.

Cluster 5

This cluster, similar to cluster 1 is either low or middle in purchasing activity.

## # A tibble: 5 × 18
##   cluster BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES
##     <int>   <dbl>             <dbl>     <dbl>            <dbl>
## 1       1    958.             0.792      404.             294.
## 2       2   4451.             0.979      370.             237.
## 3       3   4037.             0.984     8550.            5785.
## 4       4   1067.             0.967     1527.             725.
## 5       5   1027.             0.831      482.             271.
## # ℹ 13 more variables: INSTALLMENTS_PURCHASES <dbl>, CASH_ADVANCE <dbl>,
## #   PURCHASES_FREQUENCY <dbl>, ONEOFF_PURCHASES_FREQUENCY <dbl>,
## #   PURCHASES_INSTALLMENTS_FREQUENCY <dbl>, CASH_ADVANCE_FREQUENCY <dbl>,
## #   CASH_ADVANCE_TRX <dbl>, PURCHASES_TRX <dbl>, CREDIT_LIMIT <dbl>,
## #   PAYMENTS <dbl>, MINIMUM_PAYMENTS <dbl>, PRC_FULL_PAYMENT <dbl>,
## #   TENURE <dbl>

Marketing Strategy

Now that we’ve seen clusters in using both the k-means clustering and the hierarchical clustering technique. I’m going to provide enagement strategies based on the customers groups we’ve seen.

Low Activity/Conservative Spenders

These group of spenders are characterized by their low engagement across all categories including spending, borrowing activities and have a low to medium balance on their account.

Recommendation: To incentivize this group to use their card, rewards can be offered for his group. Special offers on installment purchases can also be offered to encourage frequent spending.

Cash Advance Users

The customers in this groups have high balances in their account and are frequent users of cash advances. They are also characterized by their low purchasing activity, as they are using their cards primarily for short term financing.

Recommendation: Promotions can be offered on cash advance fees to encourage this activity. In addition, alerts or tools can be created to help users manage their spending.

Premium Spenders

Premium spenders make large purchases, have high purchasing volume, possess a high credit limit and make timely payments. We want to continue maximize both revenue and customer satisfaction.

Recommendation: We can offer a spending tier-based loyal program. The more spending, the more rewards they are eligible for. They should also be offered higher credit limits and exclusive offers to improve their experience.

Low Engagement

These users having medium to low engagement across spending and borrowing activities. Our objective is to re-engage and offer different offers and services.

Recommendation: Create special one-time offers to users. A variety of related financial products can be offered such as loans, cash advances, etc. Surveys could also be sent to gauge their needs and understand their inactivity.

Other Recommendations

Digital strategy: It’s important that the online site or app have personalized displays tailored to their behavior, in the form of notifications and dashboards that provides insights into the customers behavior, offer financial assistance and cross-promote other products.

Risk Management: Another strategy to look into is to minimize risk and costs. A strong credit risk model can be created using this dataset, which can predict risk and manage other account related decisions like credit limits.

Final Thoughts

Leveraging customer behavior experience is key in any institution providing services. The k-means clustering and hierarchical clustering were both able to give us insights into customer behavior. Given the type of data types within the data (continuous, discrete, and nominal), an algorithm like hierarchical clustering can handle this better as it looks at each data point individually and creates clusters based on relationships with no predetermined K. Given the hierarchical decision making, it can handle different data types that have unique relationships. When seperating the continuous and nominal types for k-means clustering, the groups did not give us as useful information as when all the variables together in the hierarchical model.

Data622_Final_Project

Christian Uriostegui

2024-05-13

Abstract

Keywords

Introduction

Algorithms

Objective

Initial Data Exploration

First K-means Clustering Model

Data Preparation

Handling Nulls

Scaling Data

PCA

Model Building

Analysis

Cluster 1

Cluster 2

Cluster 3

Second K-means Clustering Model

Data Exploration

Scaling Data

PCA

Model Building

Analysis

Cluster 1

Cluster 2

Cluster 3

Hierarchical Clustering

PCA

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Marketing Strategy

Low Activity/Conservative Spenders

Cash Advance Users

Premium Spenders

Low Engagement

Other Recommendations

Final Thoughts