K-Means Clustering


Customer Segmentantion


You were just hired as a data scientist for one of the largest retailer companies: MarketCo. MarketCo offers its customers store credit cards. MarketCo has collected transactions of these credit cards. You are asked by your boss, Mr. Johnson, to develop a customer segmentation to define marketing strategy. You are given a historical dataset summarizing the usage behavior of about 9,000 active credit card holders during the last 6 months. You are asked to perform cluster analysis of this dataset.

Data Source: Kaggle at https://www.kaggle.com/arjunbhasin2013/ccdata/home

Data Dictionary

This dataset contains the following columns:

Variable Name Data Type Description Constraints/Rules
CUST_ID Integer Row identifier or index Unique, Non-null
BALANCE Float Current balance on the card ≥ 0, Non-null
BALANCE_FREQUENCY Float How frequently the balance is updated Range: [0.0, 1.0], Non-null
PURCHASES Float Total amount of purchases made ≥ 0, Non-null
ONEOFF_PURCHASES Float Purchases made in a single transaction ≥ 0, ≤ PURCHASES, Non-null
INSTALLMENTS_PURCHASES Float Purchases made in installments ≥ 0, ≤ PURCHASES, Non-null
CASH_ADVANCE Float Cash advances taken ≥ 0, Non-null
PURCHASES_FREQUENCY Float Frequency of making purchases Range: [0.0, 1.0], Non-null
ONEOFF_PURCHASES_FREQUENCY Float Frequency of one-off purchases Range: [0.0, 1.0], Non-null
PURCHASES_INSTALLMENTS_FREQUENCY Float Frequency of purchases in installments Range: [0.0, 1.0], Non-null
CASH_ADVANCE_FREQUENCY Float Frequency of cash advances Range: [0.0, 1.0], Non-null
CASH_ADVANCE_TRX Integer Number of cash advance transactions ≥ 0, Non-null
PURCHASES_TRX Integer Number of purchase transactions ≥ 0, Non-null
CREDIT_LIMIT Integer Credit limit assigned to the card ≥ 0, Non-null
PAYMENTS Float Total amount of payments made ≥ 0, Non-null
MINIMUM_PAYMENTS Float Minimum payments made ≥ 0, Non-null
PRC_FULL_PAYMENT Float Proportion of months with full payment of balance Range: [0.0, 1.0], Non-null
TENURE Integer Number of months the customer has been active (usually 12) Integer ≥ 0, Non-null, typically constant (12) across records


Question 1

Load the dataset CreditCards.csv into memory.

Read the dataset into memory

CreditCards.df <- read.csv("data/CreditCards.csv")

Display the dimensions of the data frame (number of rows and columns)

dim(CreditCards.df)
## [1] 8950   18

Display the column names of the data frame

colnames(CreditCards.df)
##  [1] "X"                                "BALANCE"                         
##  [3] "BALANCE_FREQUENCY"                "PURCHASES"                       
##  [5] "ONEOFF_PURCHASES"                 "INSTALLMENTS_PURCHASES"          
##  [7] "CASH_ADVANCE"                     "PURCHASES_FREQUENCY"             
##  [9] "ONEOFF_PURCHASES_FREQUENCY"       "PURCHASES_INSTALLMENTS_FREQUENCY"
## [11] "CASH_ADVANCE_FREQUENCY"           "CASH_ADVANCE_TRX"                
## [13] "PURCHASES_TRX"                    "CREDIT_LIMIT"                    
## [15] "PAYMENTS"                         "MINIMUM_PAYMENTS"                
## [17] "PRC_FULL_PAYMENT"                 "TENURE"


Question 2

Perform the k-means cluster analysis

Section A

Remove the first column: CUST_ID since it doesn’t provide any info for cluster.

Remove the first column

CreditCards.df <- CreditCards.df[,-1]

Display the dimensions of the data frame (number of rows and columns)

dim(CreditCards.df)
## [1] 8950   17

Display the column names of the data frame

colnames(CreditCards.df)
##  [1] "BALANCE"                          "BALANCE_FREQUENCY"               
##  [3] "PURCHASES"                        "ONEOFF_PURCHASES"                
##  [5] "INSTALLMENTS_PURCHASES"           "CASH_ADVANCE"                    
##  [7] "PURCHASES_FREQUENCY"              "ONEOFF_PURCHASES_FREQUENCY"      
##  [9] "PURCHASES_INSTALLMENTS_FREQUENCY" "CASH_ADVANCE_FREQUENCY"          
## [11] "CASH_ADVANCE_TRX"                 "PURCHASES_TRX"                   
## [13] "CREDIT_LIMIT"                     "PAYMENTS"                        
## [15] "MINIMUM_PAYMENTS"                 "PRC_FULL_PAYMENT"                
## [17] "TENURE"

Section B

Determine the optimal number of clusters. Justify your answer. It may take longer running time since it uses a large dataset.

Set seed for reproducibility

set.seed(123)

Scale the data for clustering

scaled_data <- scale(CreditCards.df)

Determine the optimal number of clusters using “gap_stat” Method

fviz_nbclust(scaled_data, kmeans, method = "gap_stat") +
  ggtitle("Optimal Number of Clusters - Gap Statistic Method") +
  theme_test() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    axis.text = element_text(size = 10)
  )

The Gap Statistic Method identifies the optimal number of clusters (k) by maximizing the gap statistic, which evaluates clustering quality. In this analysis, the gap statistic increases steadily and reaches its maximum at k = 7, indicating the best balance between compactness and separation. Beyond k = 7, improvements are minimal, suggesting diminishing returns. Error bars confirm the reliability of k = 7 as the optimal choice, making it the most meaningful segmentation for this dataset.

Determine the optimal number of clusters using “Silhouette” Method

fviz_nbclust(scaled_data, kmeans, method = "silhouette") +
  ggtitle("Optimal Number of Clusters - Silhouette Method") +
  theme_test() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    axis.text = element_text(size = 10)
  )

The Silhouette Method has recommended 7 clusters as the optimal number for K-means clustering on the scaled_data. This value ensures a good balance between cluster cohesion (how close points in the same cluster are) and separation (how distinct each cluster is from the others).

Set observed optimal number

optimal_k <- 7

Section C

Perform k-means clustering using the optimal number of clusters.

Perform k-means clustering

kmeans_result <- kmeans(scaled_data, centers = optimal_k, nstart = 25)

Display the statistical summary of the kmeans

summary(kmeans_result)
##              Length Class  Mode   
## cluster      8950   -none- numeric
## centers       119   -none- numeric
## totss           1   -none- numeric
## withinss        7   -none- numeric
## tot.withinss    1   -none- numeric
## betweenss       1   -none- numeric
## size            7   -none- numeric
## iter            1   -none- numeric
## ifault          1   -none- numeric

The k-means clustering analysis divided 8,950 data points into 7 clusters, each with a centroid. The solution is compact (low total within-cluster sum of squares) and well-separated (high between-cluster sum of squares), indicating good clustering quality. The algorithm converged efficiently in 1 iteration without errors, demonstrating that the data was well-suited for clustering.

Section D

Visualize the clusters in different colors.

Add the cluster assignment to the original data

CreditCards.df$Cluster <- as.factor(kmeans_result$cluster)

Visualize the clusters

fviz_cluster(
  kmeans_result,
  data = scaled_data,
  geom = "point",
  ellipse.type = "norm",
  ggtheme = theme_test(),
  main = "Cluster Visualization"
)

The visualization shows 7 clusters derived from dimensionality reduction (Dim1 and Dim2) that explain 47.6% of data variability. Some clusters, like Cluster 5, are distinct and well-separated, while others, like Clusters 6 and 7, overlap slightly, indicating shared characteristics. Compact clusters (e.g., Cluster 2) suggest closely related points, while wider clusters (e.g., Cluster 6) show greater variability. Overall, the plot validates the clustering structure but highlights potential for refinement in overlapping areas.