K-Means Clustering
Customer Segmentantion
You were just hired as a data scientist for one of the largest retailer companies: MarketCo. MarketCo offers its customers store credit cards. MarketCo has collected transactions of these credit cards. You are asked by your boss, Mr. Johnson, to develop a customer segmentation to define marketing strategy. You are given a historical dataset summarizing the usage behavior of about 9,000 active credit card holders during the last 6 months. You are asked to perform cluster analysis of this dataset.
Data Source: Kaggle at https://www.kaggle.com/arjunbhasin2013/ccdata/home
Data Dictionary
This dataset contains the following columns:
Variable Name | Data Type | Description | Constraints/Rules |
---|---|---|---|
CUST_ID |
Integer | Row identifier or index | Unique, Non-null |
BALANCE |
Float | Current balance on the card | ≥ 0, Non-null |
BALANCE_FREQUENCY |
Float | How frequently the balance is updated | Range: [0.0, 1.0], Non-null |
PURCHASES |
Float | Total amount of purchases made | ≥ 0, Non-null |
ONEOFF_PURCHASES |
Float | Purchases made in a single transaction | ≥ 0, ≤ PURCHASES , Non-null |
INSTALLMENTS_PURCHASES |
Float | Purchases made in installments | ≥ 0, ≤ PURCHASES , Non-null |
CASH_ADVANCE |
Float | Cash advances taken | ≥ 0, Non-null |
PURCHASES_FREQUENCY |
Float | Frequency of making purchases | Range: [0.0, 1.0], Non-null |
ONEOFF_PURCHASES_FREQUENCY |
Float | Frequency of one-off purchases | Range: [0.0, 1.0], Non-null |
PURCHASES_INSTALLMENTS_FREQUENCY |
Float | Frequency of purchases in installments | Range: [0.0, 1.0], Non-null |
CASH_ADVANCE_FREQUENCY |
Float | Frequency of cash advances | Range: [0.0, 1.0], Non-null |
CASH_ADVANCE_TRX |
Integer | Number of cash advance transactions | ≥ 0, Non-null |
PURCHASES_TRX |
Integer | Number of purchase transactions | ≥ 0, Non-null |
CREDIT_LIMIT |
Integer | Credit limit assigned to the card | ≥ 0, Non-null |
PAYMENTS |
Float | Total amount of payments made | ≥ 0, Non-null |
MINIMUM_PAYMENTS |
Float | Minimum payments made | ≥ 0, Non-null |
PRC_FULL_PAYMENT |
Float | Proportion of months with full payment of balance | Range: [0.0, 1.0], Non-null |
TENURE |
Integer | Number of months the customer has been active (usually 12) | Integer ≥ 0, Non-null, typically constant (12) across records |
Question 1
Load the dataset CreditCards.csv into memory.
Read the dataset into memory
Display the dimensions of the data frame (number of rows and columns)
## [1] 8950 18
Display the column names of the data frame
## [1] "X" "BALANCE"
## [3] "BALANCE_FREQUENCY" "PURCHASES"
## [5] "ONEOFF_PURCHASES" "INSTALLMENTS_PURCHASES"
## [7] "CASH_ADVANCE" "PURCHASES_FREQUENCY"
## [9] "ONEOFF_PURCHASES_FREQUENCY" "PURCHASES_INSTALLMENTS_FREQUENCY"
## [11] "CASH_ADVANCE_FREQUENCY" "CASH_ADVANCE_TRX"
## [13] "PURCHASES_TRX" "CREDIT_LIMIT"
## [15] "PAYMENTS" "MINIMUM_PAYMENTS"
## [17] "PRC_FULL_PAYMENT" "TENURE"
Question 2
Perform the k-means cluster analysis
Section A
Remove the first column: CUST_ID since it doesn’t provide any info for cluster.
Remove the first column
Display the dimensions of the data frame (number of rows and columns)
## [1] 8950 17
Display the column names of the data frame
## [1] "BALANCE" "BALANCE_FREQUENCY"
## [3] "PURCHASES" "ONEOFF_PURCHASES"
## [5] "INSTALLMENTS_PURCHASES" "CASH_ADVANCE"
## [7] "PURCHASES_FREQUENCY" "ONEOFF_PURCHASES_FREQUENCY"
## [9] "PURCHASES_INSTALLMENTS_FREQUENCY" "CASH_ADVANCE_FREQUENCY"
## [11] "CASH_ADVANCE_TRX" "PURCHASES_TRX"
## [13] "CREDIT_LIMIT" "PAYMENTS"
## [15] "MINIMUM_PAYMENTS" "PRC_FULL_PAYMENT"
## [17] "TENURE"
Section B
Determine the optimal number of clusters. Justify your answer. It may take longer running time since it uses a large dataset.
Set seed for reproducibility
Scale the data for clustering
Determine the optimal number of clusters using “gap_stat” Method
fviz_nbclust(scaled_data, kmeans, method = "gap_stat") +
ggtitle("Optimal Number of Clusters - Gap Statistic Method") +
theme_test() +
theme(
plot.title = element_text(size = 14, face = "bold"),
axis.text = element_text(size = 10)
)
The Gap Statistic Method identifies the optimal number of clusters
(k) by maximizing the gap statistic, which evaluates clustering quality.
In this analysis, the gap statistic increases steadily and reaches its
maximum at k = 7, indicating the best balance between compactness and
separation. Beyond k = 7, improvements are minimal, suggesting
diminishing returns. Error bars confirm the reliability of k = 7 as the
optimal choice, making it the most meaningful segmentation for this
dataset.
Determine the optimal number of clusters using “Silhouette” Method
fviz_nbclust(scaled_data, kmeans, method = "silhouette") +
ggtitle("Optimal Number of Clusters - Silhouette Method") +
theme_test() +
theme(
plot.title = element_text(size = 14, face = "bold"),
axis.text = element_text(size = 10)
)
The Silhouette Method has recommended 7 clusters as the optimal
number for K-means clustering on the scaled_data. This value ensures a
good balance between cluster cohesion (how close points in the same
cluster are) and separation (how distinct each cluster is from the
others).
Set observed optimal number
Section C
Perform k-means clustering using the optimal number of clusters.
Perform k-means clustering
Display the statistical summary of the kmeans
## Length Class Mode
## cluster 8950 -none- numeric
## centers 119 -none- numeric
## totss 1 -none- numeric
## withinss 7 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 7 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
The k-means clustering analysis divided 8,950 data points into 7 clusters, each with a centroid. The solution is compact (low total within-cluster sum of squares) and well-separated (high between-cluster sum of squares), indicating good clustering quality. The algorithm converged efficiently in 1 iteration without errors, demonstrating that the data was well-suited for clustering.
Section D
Visualize the clusters in different colors.
Add the cluster assignment to the original data
Visualize the clusters
fviz_cluster(
kmeans_result,
data = scaled_data,
geom = "point",
ellipse.type = "norm",
ggtheme = theme_test(),
main = "Cluster Visualization"
)
The visualization shows 7 clusters derived from dimensionality reduction (Dim1 and Dim2) that explain 47.6% of data variability. Some clusters, like Cluster 5, are distinct and well-separated, while others, like Clusters 6 and 7, overlap slightly, indicating shared characteristics. Compact clusters (e.g., Cluster 2) suggest closely related points, while wider clusters (e.g., Cluster 6) show greater variability. Overall, the plot validates the clustering structure but highlights potential for refinement in overlapping areas.