K-Means Clustering

Customer Segmentantion

You were just hired as a data scientist for one of the largest retailer companies: MarketCo. MarketCo offers its customers store credit cards. MarketCo has collected transactions of these credit cards. You are asked by your boss, Mr. Johnson, to develop a customer segmentation to define marketing strategy. You are given a historical dataset summarizing the usage behavior of about 9,000 active credit card holders during the last 6 months. You are asked to perform cluster analysis of this dataset.

Data Source: Kaggle at https://www.kaggle.com/arjunbhasin2013/ccdata/home

Data Dictionary

This dataset contains the following columns:

Variable Name	Data Type	Description	Constraints/Rules
`CUST_ID`	Integer	Row identifier or index	Unique, Non-null
`BALANCE`	Float	Current balance on the card	≥ 0, Non-null
`BALANCE_FREQUENCY`	Float	How frequently the balance is updated	Range: [0.0, 1.0], Non-null
`PURCHASES`	Float	Total amount of purchases made	≥ 0, Non-null
`ONEOFF_PURCHASES`	Float	Purchases made in a single transaction	≥ 0, ≤ `PURCHASES`, Non-null
`INSTALLMENTS_PURCHASES`	Float	Purchases made in installments	≥ 0, ≤ `PURCHASES`, Non-null
`CASH_ADVANCE`	Float	Cash advances taken	≥ 0, Non-null
`PURCHASES_FREQUENCY`	Float	Frequency of making purchases	Range: [0.0, 1.0], Non-null
`ONEOFF_PURCHASES_FREQUENCY`	Float	Frequency of one-off purchases	Range: [0.0, 1.0], Non-null
`PURCHASES_INSTALLMENTS_FREQUENCY`	Float	Frequency of purchases in installments	Range: [0.0, 1.0], Non-null
`CASH_ADVANCE_FREQUENCY`	Float	Frequency of cash advances	Range: [0.0, 1.0], Non-null
`CASH_ADVANCE_TRX`	Integer	Number of cash advance transactions	≥ 0, Non-null
`PURCHASES_TRX`	Integer	Number of purchase transactions	≥ 0, Non-null
`CREDIT_LIMIT`	Integer	Credit limit assigned to the card	≥ 0, Non-null
`PAYMENTS`	Float	Total amount of payments made	≥ 0, Non-null
`MINIMUM_PAYMENTS`	Float	Minimum payments made	≥ 0, Non-null
`PRC_FULL_PAYMENT`	Float	Proportion of months with full payment of balance	Range: [0.0, 1.0], Non-null
`TENURE`	Integer	Number of months the customer has been active (usually 12)	Integer ≥ 0, Non-null, typically constant (12) across records

Question 1

Load the dataset CreditCards.csv into memory.

Read the dataset into memory

CreditCards.df <- read.csv("data/CreditCards.csv")

Display the dimensions of the data frame (number of rows and columns)

dim(CreditCards.df)

## [1] 8950   18

Display the column names of the data frame

colnames(CreditCards.df)

##  [1] "X"                                "BALANCE"                         
##  [3] "BALANCE_FREQUENCY"                "PURCHASES"                       
##  [5] "ONEOFF_PURCHASES"                 "INSTALLMENTS_PURCHASES"          
##  [7] "CASH_ADVANCE"                     "PURCHASES_FREQUENCY"             
##  [9] "ONEOFF_PURCHASES_FREQUENCY"       "PURCHASES_INSTALLMENTS_FREQUENCY"
## [11] "CASH_ADVANCE_FREQUENCY"           "CASH_ADVANCE_TRX"                
## [13] "PURCHASES_TRX"                    "CREDIT_LIMIT"                    
## [15] "PAYMENTS"                         "MINIMUM_PAYMENTS"                
## [17] "PRC_FULL_PAYMENT"                 "TENURE"

Question 2

Perform the k-means cluster analysis

Section A

Remove the first column: CUST_ID since it doesn’t provide any info for cluster.

Remove the first column

CreditCards.df <- CreditCards.df[,-1]

Display the dimensions of the data frame (number of rows and columns)

dim(CreditCards.df)

## [1] 8950   17

Display the column names of the data frame

colnames(CreditCards.df)

##  [1] "BALANCE"                          "BALANCE_FREQUENCY"               
##  [3] "PURCHASES"                        "ONEOFF_PURCHASES"                
##  [5] "INSTALLMENTS_PURCHASES"           "CASH_ADVANCE"                    
##  [7] "PURCHASES_FREQUENCY"              "ONEOFF_PURCHASES_FREQUENCY"      
##  [9] "PURCHASES_INSTALLMENTS_FREQUENCY" "CASH_ADVANCE_FREQUENCY"          
## [11] "CASH_ADVANCE_TRX"                 "PURCHASES_TRX"                   
## [13] "CREDIT_LIMIT"                     "PAYMENTS"                        
## [15] "MINIMUM_PAYMENTS"                 "PRC_FULL_PAYMENT"                
## [17] "TENURE"

Section B

Determine the optimal number of clusters. Justify your answer. It may take longer running time since it uses a large dataset.

Set seed for reproducibility

set.seed(123)

Scale the data for clustering

scaled_data <- scale(CreditCards.df)

Determine the optimal number of clusters using “gap_stat” Method

fviz_nbclust(scaled_data, kmeans, method = "gap_stat") +
  ggtitle("Optimal Number of Clusters - Gap Statistic Method") +
  theme_test() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    axis.text = element_text(size = 10)
  )

The Gap Statistic Method identifies the optimal number of clusters (k) by maximizing the gap statistic, which evaluates clustering quality. In this analysis, the gap statistic increases steadily and reaches its maximum at k = 7, indicating the best balance between compactness and separation. Beyond k = 7, improvements are minimal, suggesting diminishing returns. Error bars confirm the reliability of k = 7 as the optimal choice, making it the most meaningful segmentation for this dataset.

Determine the optimal number of clusters using “Silhouette” Method

fviz_nbclust(scaled_data, kmeans, method = "silhouette") +
  ggtitle("Optimal Number of Clusters - Silhouette Method") +
  theme_test() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    axis.text = element_text(size = 10)
  )

The Silhouette Method has recommended 7 clusters as the optimal number for K-means clustering on the scaled_data. This value ensures a good balance between cluster cohesion (how close points in the same cluster are) and separation (how distinct each cluster is from the others).

Set observed optimal number

optimal_k <- 7

Section C

Perform k-means clustering using the optimal number of clusters.

Perform k-means clustering

kmeans_result <- kmeans(scaled_data, centers = optimal_k, nstart = 25)

Display the statistical summary of the kmeans

summary(kmeans_result)

##              Length Class  Mode   
## cluster      8950   -none- numeric
## centers       119   -none- numeric
## totss           1   -none- numeric
## withinss        7   -none- numeric
## tot.withinss    1   -none- numeric
## betweenss       1   -none- numeric
## size            7   -none- numeric
## iter            1   -none- numeric
## ifault          1   -none- numeric

The k-means clustering analysis divided 8,950 data points into 7 clusters, each with a centroid. The solution is compact (low total within-cluster sum of squares) and well-separated (high between-cluster sum of squares), indicating good clustering quality. The algorithm converged efficiently in 1 iteration without errors, demonstrating that the data was well-suited for clustering.

Section D

Visualize the clusters in different colors.

Add the cluster assignment to the original data

CreditCards.df$Cluster <- as.factor(kmeans_result$cluster)

Visualize the clusters

fviz_cluster(
  kmeans_result,
  data = scaled_data,
  geom = "point",
  ellipse.type = "norm",
  ggtheme = theme_test(),
  main = "Cluster Visualization"
)

The visualization shows 7 clusters derived from dimensionality reduction (Dim1 and Dim2) that explain 47.6% of data variability. Some clusters, like Cluster 5, are distinct and well-separated, while others, like Clusters 6 and 7, overlap slightly, indicating shared characteristics. Compact clusters (e.g., Cluster 2) suggest closely related points, while wider clusters (e.g., Cluster 6) show greater variability. Overall, the plot validates the clustering structure but highlights potential for refinement in overlapping areas.