library(tidyverse)
library(corrplot)
library(flexclust)
library(dbscan)
library(meanShiftR)
library(e1071)
library(cluster)
library(fpc)
library(factoextra)
The dataset used is the Online Shoppers Purchasing Intention Dataset, sourced from the UCI Machine Learning Repository (https://archive.ics.uci.edu/dataset/468). This dataset contains 12,330 online store visit sessions with 18 attributes. In this study, only 10 numerical features were used as clustering variables, namely Administrative, Administrative_Duration, Informational, Informational_Duration, ProductRelated, ProductRelated_Duration, BounceRates, ExitRates, PageValues, and SpecialDay.
df <- read.csv("online_shoppers_intention.csv", sep=",")
df_num <- df %>% select(
Administrative, Administrative_Duration,
Informational, Informational_Duration,
ProductRelated, ProductRelated_Duration,
BounceRates, ExitRates, PageValues, SpecialDay
)
Data structure checks are performed to ensure that all selected
variables are of numeric type and ready for further processing. The
output of str() displays the data type of each column along
with some initial values as samples.
str(df_num)
## 'data.frame': 12330 obs. of 10 variables:
## $ Administrative : int 0 0 0 0 0 0 0 1 0 0 ...
## $ Administrative_Duration: num 0 0 0 0 0 0 0 0 0 0 ...
## $ Informational : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Informational_Duration : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ProductRelated : int 1 2 1 2 10 19 1 0 2 3 ...
## $ ProductRelated_Duration: num 0 64 0 2.67 627.5 ...
## $ BounceRates : num 0.2 0 0.2 0.05 0.02 ...
## $ ExitRates : num 0.2 0.1 0.2 0.14 0.05 ...
## $ PageValues : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SpecialDay : num 0 0 0 0 0 0 0.4 0 0.8 0.4 ...
Descriptive statistics provide a summary of the distribution of each
feature. It is evident that most features have a minimum value of 0 and
a highly right-skewed distribution, particularly for duration features
such as ProductRelated_Duration, which has a range of 0 to
63,973 seconds. This significant difference in scale between features is
the primary reason why normalization is necessary before clustering is
performed.
summary(df_num)
## Administrative Administrative_Duration Informational
## Min. : 0.000 Min. : 0.00 Min. : 0.0000
## 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.0000
## Median : 1.000 Median : 7.50 Median : 0.0000
## Mean : 2.315 Mean : 80.82 Mean : 0.5036
## 3rd Qu.: 4.000 3rd Qu.: 93.26 3rd Qu.: 0.0000
## Max. :27.000 Max. :3398.75 Max. :24.0000
## Informational_Duration ProductRelated ProductRelated_Duration
## Min. : 0.00 Min. : 0.00 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.: 7.00 1st Qu.: 184.1
## Median : 0.00 Median : 18.00 Median : 598.9
## Mean : 34.47 Mean : 31.73 Mean : 1194.7
## 3rd Qu.: 0.00 3rd Qu.: 38.00 3rd Qu.: 1464.2
## Max. :2549.38 Max. :705.00 Max. :63973.5
## BounceRates ExitRates PageValues SpecialDay
## Min. :0.000000 Min. :0.00000 Min. : 0.000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.01429 1st Qu.: 0.000 1st Qu.:0.00000
## Median :0.003112 Median :0.02516 Median : 0.000 Median :0.00000
## Mean :0.022191 Mean :0.04307 Mean : 5.889 Mean :0.06143
## 3rd Qu.:0.016813 3rd Qu.:0.05000 3rd Qu.: 0.000 3rd Qu.:0.00000
## Max. :0.200000 Max. :0.20000 Max. :361.764 Max. :1.00000
Boxplots are used to visually detect outliers in the data prior to preprocessing. The red points on the boxplots indicate observations that fall outside the IQR. It is evident that nearly all features contain extreme outliers, which, if left unaddressed, will affect the quality of the clustering results, particularly in the case of K-Means, which is sensitive to outliers. The histograms show that all features are non-normally distributed and heavily skewed to the left.
df_long <- df_num %>%
pivot_longer(cols = everything(), names_to = "Feature", values_to = "Value")
ggplot(df_long, aes(x = Feature, y = Value, fill = Feature)) +
geom_boxplot(show.legend = FALSE, outlier.color = "red", outlier.alpha = 0.5) +
coord_flip() +
theme_minimal() +
labs(title = "Boxplot of 10 Numeric Features (Raw Data)",
subtitle = "Red dots indicate the presence of outliers",
x = "Feature",
y = "Value")
ggplot(df_long, aes(x = Value, fill = Feature)) +
geom_histogram(bins = 30, show.legend = FALSE, color = "white") +
facet_wrap(~Feature, scales = "free") +
theme_minimal() +
labs(title = "Feature Distribution Histogram",
x = "Value", y = "Frequency")
Checking for missing values is a mandatory step before clustering. Clustering algorithms such as K-Means and DBSCAN cannot process data with missing values. The results of the check show that all 10 numeric columns have no missing values, so data imputation is not required.
cat("Number of Missing Values per Column:\n")
## Number of Missing Values per Column:
colSums(is.na(df_num))
## Administrative Administrative_Duration Informational
## 0 0 0
## Informational_Duration ProductRelated ProductRelated_Duration
## 0 0 0
## BounceRates ExitRates PageValues
## 0 0 0
## SpecialDay
## 0
df_clean <- df_num %>% drop_na()
Duplicate rows can introduce bias in the clustering results because the same observations are counted more than once during the centroid formation process. A total of 761 duplicate rows were identified out of 12,330 rows; after removal, 11,569 unique data rows remained for use in further analysis.
cat("Number of duplicate rows:", sum(duplicated(df_clean)), "\n")
## Number of duplicate rows: 761
cat("Number of rows before cleaning:", nrow(df_clean), "\n")
## Number of rows before cleaning: 12330
df_clean <- df_clean %>% distinct()
cat("Number of rows after cleaning:", nrow(df_clean), "\n")
## Number of rows after cleaning: 11569
Extreme outliers are handled using the IQR capping (Winsorization) method, which involves replacing values below the lower bound (Q1 – 1.5×IQR) with that lower bound value, and values exceeding the upper bound (Q3 + 1.5×IQR) with the upper bound value. This method was chosen because it does not remove observations but merely limits the influence of extreme values, thereby preserving the total number of data points. Columns with IQR = 0 are not capped to avoid division by zero.
cap_outliers <- function(x) {
Q1 <- quantile(x, 0.25)
Q3 <- quantile(x, 0.75)
IQR_val <- Q3 - Q1
if (IQR_val == 0) return(x)
lower <- Q1 - 1.5 * IQR_val
upper <- Q3 + 1.5 * IQR_val
pmin(pmax(x, lower), upper)
}
df_clean <- df_clean %>% mutate(across(everything(), cap_outliers))
Standardization using Z-scores was performed to normalize the scales of all features so that no single feature would dominate the distance calculation simply because it had a wider range of values. After scaling, each feature has a mean of 0 and a standard deviation of 1. A check was performed to ensure that no NaN or Inf values were generated after the scaling process; neither was found.
zero_var_cols <- names(df_clean)[sapply(df_clean, var) == 0]
if (length(zero_var_cols) > 0) {
cat("Columns with variance = 0 (removed before scaling):",
paste(zero_var_cols, collapse = ", "), "\n")
df_clean <- df_clean %>% select(-all_of(zero_var_cols))
} else {
cat("All columns OK, no variance =0\n")
}
## All columns OK, no variance =0
df_scaled <- as.data.frame(scale(df_clean))
cat("NaN after scaling:", sum(is.nan(as.matrix(df_scaled))), "\n")
## NaN after scaling: 0
cat("Inf after scaling:", sum(is.infinite(as.matrix(df_scaled))), "\n")
## Inf after scaling: 0
Descriptive statistics following preprocessing reveal significant
changes in the data distribution. Previously extreme maximum values
(e.g., ProductRelated_Duration reaching 63,973) are now
under control following IQR capping. This tighter data distribution will
result in more representative clusters.
summary(df_clean)
## Administrative Administrative_Duration Informational
## Min. : 0.000 Min. : 0.00 Min. : 0.0000
## 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.0000
## Median : 1.000 Median : 16.50 Median : 0.0000
## Mean : 2.343 Mean : 62.55 Mean : 0.5365
## 3rd Qu.: 4.000 3rd Qu.:100.92 3rd Qu.: 0.0000
## Max. :10.000 Max. :252.29 Max. :24.0000
## Informational_Duration ProductRelated ProductRelated_Duration
## Min. : 0.00 Min. : 0.00 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.: 9.00 1st Qu.: 242.6
## Median : 0.00 Median :20.00 Median : 671.9
## Mean : 36.74 Mean :28.53 Mean :1066.1
## 3rd Qu.: 0.00 3rd Qu.:40.00 3rd Qu.:1557.2
## Max. :2549.38 Max. :86.50 Max. :3529.0
## BounceRates ExitRates PageValues SpecialDay
## Min. :0.000000 Min. :0.00000 Min. : 0.000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.01334 1st Qu.: 0.000 1st Qu.:0.00000
## Median :0.002151 Median :0.02402 Median : 0.000 Median :0.00000
## Mean :0.008573 Mean :0.03101 Mean : 6.277 Mean :0.06287
## 3rd Qu.:0.013793 3rd Qu.:0.04181 3rd Qu.: 0.000 3rd Qu.:0.00000
## Max. :0.034483 Max. :0.08451 Max. :361.764 Max. :1.00000
A correlation heatmap is used to identify linear relationships
between features. High correlations between features can influence
cluster formation. The heatmap results show a strong positive
correlation between BounceRates and ExitRates,
as well as between Administrative and
Administrative_Duration. This suggests that visitors with a
high bounce rate also tend to have a high exit rate, indicating a
segment of visitors who are not engaged with the store’s content.
cor_matrix <- cor(df_clean)
corrplot(cor_matrix,
method = "color",
type = "upper",
addCoef.col = "black",
tl.col = "black",
tl.srt = 45,
number.cex = 0.7,
title = "Correlation Heatmap of 10 Numerical Features",
mar = c(0,0,1,0))
Box plots and histograms of the scaled data show that all features are now on a comparable scale (Z-scores). The outliers remaining after IQR capping appear to be better controlled. This more homogeneous distribution ensures that distance-based algorithms such as K-Means and DBSCAN will not be biased toward specific features.
df_long <- df_scaled %>%
pivot_longer(cols = everything(), names_to = "Feature", values_to = "Value")
ggplot(df_long, aes(x = Feature, y = Value, fill = Feature)) +
geom_boxplot(show.legend = FALSE, outlier.color = "red", outlier.alpha = 0.5) +
coord_flip() +
theme_minimal() +
labs(title = "Boxplot of 10 Numeric Features (Scaled Data)",
subtitle = "Red dots indicate the presence of outliers (Z-Score)",
x = "Feature",
y = "Z-Score Value")
ggplot(df_long, aes(x = Value, fill = Feature)) +
geom_histogram(bins = 30, show.legend = FALSE, color = "white") +
facet_wrap(~Feature, scales = "free") +
theme_minimal() +
labs(title = "Feature Distribution Histogram (Scaled Data)",
x = "Z-Score Value", y = "Frequency")
The Elbow Method calculates the total Within-Cluster Sum of Squares (WSS) for each value of k. As k increases, the WSS decreases because the clusters become smaller and more compact. The “elbow” point—the point at which the decrease in WSS begins to slow significantly indicates the optimal number of clusters. From the resulting graph, the elbow point is identified at k = 7.
set.seed(123)
fviz_nbclust(df_scaled, kmeans, method = "wss", k.max = 10) +
labs(title = "Elbow Method for Optimal K") +
theme_minimal()
The Silhouette Method measures how well each observation fits within its own cluster compared to the nearest neighboring clusters. Silhouette values range from -1 to 1, where higher values indicate better cluster separation. The silhouette plot confirms that the highest average silhouette value is also obtained at k = 7. Based on these two methods, k = 7 is selected as the optimal number of clusters for all methods that require k to be determined.
set.seed(123)
fviz_nbclust(df_scaled, kmeans, method = "silhouette", k.max = 10) +
labs(title = "Silhouette Method for Optimal K") +
theme_minimal()
## CONCLUSION OF K SELECTION
# Elbow Method: The inflection point is visible at k = 7
# Silhouette: The highest value is at k = 7
# → Both methods agree, k = 7 is selected
optimal_k <- 7
cat("Optimal K selected:", optimal_k)
## Optimal K selected: 7
For the DBSCAN algorithm, the epsilon (ε) parameter is determined using a kNN Distance Plot. This plot displays the distances to the k nearest neighbors (where k = inPts - 1 = 10) for each point, sorted in ascending order. The “knee” point on the graph indicates the optimal value of ε where points begin to be far apart from one another, marking the transition from a dense area to a sparse area. From this plot, ε = 1.0 was selected because it produces 3 clusters with 18.3% noise, which represents the best balance between a meaningful number of clusters and an acceptable level of noise.
kNNdistplot(df_scaled, k = 11)
abline(h = 2.0, col = "red", lty = 2)
Principal Component Analysis (PCA) is applied not as a clustering method, but as a visualization tool. Since the data has 10 dimensions, the clustering results cannot be visualized directly. PCA projects the data onto two principal components (PC1 and PC2) that capture the greatest variance. PC1 explains 33.7% of the variance and PC2 explains 16.0%, so the total variance captured by these two components is 49.7%, which is sufficiently representative for visualization purposes.
pca_res <- prcomp(df_scaled, center = FALSE, scale. = FALSE)
df_pca <- as.data.frame(pca_res$x[, 1:2])
cat("Proportion of variance explained by PC1 and PC2:\n")
## Proportion of variance explained by PC1 and PC2:
summary(pca_res)$importance[2, 1:2]
## PC1 PC2
## 0.33748 0.15972
var_explained <- summary(pca_res)$importance[2, 1:2]
cat(sprintf("PC1 explained %.1f%% variance\n", var_explained[1] * 100))
## PC1 explained 33.7% variance
cat(sprintf("PC2 explained %.1f%% variance\n", var_explained[2] * 100))
## PC2 explained 16.0% variance
cat(sprintf("Total 2 components: %.1f%% variance\n", sum(var_explained) * 100))
## Total 2 components: 49.7% variance
K-Means is a partition-based clustering algorithm that groups data
into k clusters by minimizing the Within-Cluster Sum of Squares (WCSS).
The algorithm begins with k random centroids, then iteratively assigns
each data point to the nearest centroid and updates the centroid’s
position based on the average of the data points in that cluster until
convergence. The parameter nstart = 25 is used to run the
algorithm from 25 different starting points and select the best result,
in order to avoid suboptimal local solutions. The results show a BSS/TSS
ratio of 58.48%, meaning that 58.48% of the total data variance can be
explained by the separation between clusters.
set.seed(123)
km_res <- kmeans(df_scaled, centers = optimal_k, nstart = 25)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 578450)
cat("Ukuran tiap cluster:\n")
## Ukuran tiap cluster:
print(table(km_res$cluster))
##
## 1 2 3 4 5 6 7
## 1662 499 1632 4719 435 798 1824
cat("\nTotal Within-cluster SS :", round(km_res$tot.withinss, 2))
##
## Total Within-cluster SS : 48028.32
cat("\nBetween-cluster SS :", round(km_res$betweenss, 2))
##
## Between-cluster SS : 67651.68
cat("\nRasio BSS/TSS :", round(km_res$betweenss / km_res$totss * 100, 2), "%\n")
##
## Rasio BSS/TSS : 58.48 %
df_pca$cluster_km <- as.factor(km_res$cluster)
ggplot(df_pca, aes(x = PC1, y = PC2, color = cluster_km)) +
geom_point(alpha = 0.3, size = 1.0) +
stat_ellipse(level = 0.95, linewidth = 0.8) +
theme_minimal() +
labs(
title = "K-Means Clustering — PCA Visualization",
subtitle = paste0("k = ", optimal_k,
" | PC1 + PC2 explained variance: ",
round(sum(var_explained) * 100, 1), "%"),
color = "Cluster",
x = "PC1",
y = "PC2"
)
Cluster profiling displays the average value of each feature per cluster using the original data (before scaling) to make the results easier to interpret in context. Heatmap profiling visually illustrates the differences in characteristics between clusters, with red indicating high values and blue indicating low values.
df_profile_km <- df_clean %>%
mutate(cluster = as.factor(km_res$cluster))
profiling <- df_profile_km %>%
group_by(cluster) %>%
summarise(across(where(is.numeric), ~mean(.x, na.rm = TRUE), .names = "{.col}"))
print(profiling)
## # A tibble: 7 × 11
## cluster Administrative Administrative_Duration Informational
## <fct> <dbl> <dbl> <dbl>
## 1 1 0.690 19.5 0.153
## 2 2 2.10 58.6 0.345
## 3 3 4.08 97.6 0.911
## 4 4 0.881 19.2 0.176
## 5 5 5.95 156. 4.82
## 6 6 0.812 17.4 0.198
## 7 7 5.95 181. 0.663
## # ℹ 7 more variables: Informational_Duration <dbl>, ProductRelated <dbl>,
## # ProductRelated_Duration <dbl>, BounceRates <dbl>, ExitRates <dbl>,
## # PageValues <dbl>, SpecialDay <dbl>
profiling %>%
pivot_longer(-cluster, names_to = "Feature", values_to = "Mean") %>%
ggplot(aes(x = cluster, y = Feature, fill = Mean)) +
geom_tile(color = "white") +
geom_text(aes(label = round(Mean, 2)), size = 3) +
scale_fill_gradient2(
low = "#3498DB",
mid = "#ECF0F1",
high = "#E74C3C",
midpoint = 0
) +
theme_minimal() +
labs(
title = "Cluster Profiling Heatmap",
subtitle = "Mean value per feature per cluster (original scale)",
x = "Cluster",
y = "Feature",
fill = "Mean"
)
sil_km <- silhouette(km_res$cluster, dist(df_scaled))
sil_score <- round(mean(sil_km[, 3]), 4)
cat("Silhouette Score K-Means:", sil_score, "\n")
## Silhouette Score K-Means: 0.2861
stats_km <- cluster.stats(dist(df_scaled), km_res$cluster)
dunn_score <- round(stats_km$dunn, 4)
cat("Dunn Index K-Means :", dunn_score, "\n")
## Dunn Index K-Means : 0.0049
fviz_silhouette(sil_km) +
theme_minimal() +
labs(title = "Silhouette Plot — K-Means Clustering")
## cluster size ave.sil.width
## 1 1 1662 0.40
## 2 2 499 0.17
## 3 3 1632 0.16
## 4 4 4719 0.37
## 5 5 435 0.07
## 6 6 798 0.23
## 7 7 1824 0.17
K-Medians is a variant of K-Means that uses the median as the cluster
center instead of the mean. The use of the median makes this algorithm
more robust to outliers than K-Means, because the median is not affected
by extreme values. K-Medians uses the Manhattan distance (L1 norm) as
the distance metric, unlike K-Means, which uses the Euclidean distance
(L2 norm). The implementation uses the kcca() function from
the flexclust package with the kmedians
family.
set.seed(123)
kmed_res <- kcca(df_scaled, k = optimal_k, family = kccaFamily("kmedians"))
## Found more than one class "kcca" in cache; using the first, from namespace 'flexclust'
## Also defined by 'kernlab'
## Found more than one class "kcca" in cache; using the first, from namespace 'flexclust'
## Also defined by 'kernlab'
cat("Ukuran tiap cluster:\n")
## Ukuran tiap cluster:
print(table(clusters(kmed_res)))
##
## 1 2 3 4 5 6 7
## 663 2277 1581 3737 701 1742 868
df_pca$cluster_kmed <- as.factor(clusters(kmed_res))
ggplot(df_pca, aes(x = PC1, y = PC2, color = cluster_kmed)) +
geom_point(alpha = 0.3, size = 1.0) +
stat_ellipse(level = 0.95, linewidth = 0.8) +
theme_minimal() +
labs(
title = "K-Medians Clustering — PCA Visualization",
subtitle = paste0("k = ", optimal_k),
color = "Cluster", x = "PC1", y = "PC2"
)
df_profile_kmed <- df_clean %>%
mutate(cluster = as.factor(clusters(kmed_res))) %>%
group_by(cluster) %>%
summarise(across(where(is.numeric), ~mean(.x, na.rm = TRUE))) %>%
mutate(across(where(is.numeric), ~round(.x, 3)))
print(df_profile_kmed)
## # A tibble: 7 × 11
## cluster Administrative Administrative_Duration Informational
## <fct> <dbl> <dbl> <dbl>
## 1 1 1.78 50.5 0.243
## 2 2 4.58 134. 0.185
## 3 3 1.59 33.3 0.467
## 4 4 0.385 7.07 0.139
## 5 5 5.18 144. 3.24
## 6 6 0.554 16.2 0.13
## 7 7 8.00 205. 2.15
## # ℹ 7 more variables: Informational_Duration <dbl>, ProductRelated <dbl>,
## # ProductRelated_Duration <dbl>, BounceRates <dbl>, ExitRates <dbl>,
## # PageValues <dbl>, SpecialDay <dbl>
df_profile_kmed %>%
pivot_longer(-cluster, names_to = "Feature", values_to = "Mean") %>%
ggplot(aes(x = cluster, y = Feature, fill = Mean)) +
geom_tile(color = "white") +
geom_text(aes(label = round(Mean, 2)), size = 3) +
scale_fill_gradient2(low="#3498DB", mid="#ECF0F1", high="#E74C3C", midpoint=0) +
theme_minimal() +
labs(title = "Cluster Profiling Heatmap — K-Medians")
sil_kmed <- silhouette(clusters(kmed_res), dist(df_scaled))
sil_kmed_score <- round(mean(sil_kmed[, 3]), 4)
cat("Silhouette Score K-Medians:", sil_kmed_score, "\n")
## Silhouette Score K-Medians: 0.2182
stats_kmed <- cluster.stats(dist(df_scaled), clusters(kmed_res))
cat("Dunn Index K-Medians :", round(stats_kmed$dunn, 4), "\n")
## Dunn Index K-Medians : 0.003
fviz_silhouette(sil_kmed) +
theme_minimal() +
labs(title = "Silhouette Plot — K-Medians Clustering")
## cluster size ave.sil.width
## 1 1 663 0.13
## 2 2 2277 0.18
## 3 3 1581 0.14
## 4 4 3737 0.31
## 5 5 701 -0.02
## 6 6 1742 0.33
## 7 7 868 0.09
Before implementation, the optimal eps value was found using the following loop. Eps = 1.0 was chosen because it produced more than 1 cluster with a reasonable noise percentage (18.3%).
# Test berbagai nilai eps
for (eps_val in c(0.3, 0.5, 0.8, 1.0, 1.2, 1.5, 2.0, 2.5)) {
db_test <- dbscan(df_scaled, eps = eps_val, MinPts = 11)
n_cluster <- length(unique(db_test$cluster[db_test$cluster != 0]))
n_noise <- sum(db_test$cluster == 0)
cat(sprintf("eps = %.1f | Clusters: %d | Noise: %d (%.1f%%)\n",
eps_val, n_cluster, n_noise, n_noise/nrow(df_scaled)*100))
}
## eps = 0.3 | Clusters: 25 | Noise: 8476 (73.3%)
## eps = 0.5 | Clusters: 14 | Noise: 6226 (53.8%)
## eps = 0.8 | Clusters: 14 | Noise: 3648 (31.5%)
## eps = 1.0 | Clusters: 3 | Noise: 2120 (18.3%)
## eps = 1.2 | Clusters: 1 | Noise: 1308 (11.3%)
## eps = 1.5 | Clusters: 1 | Noise: 698 (6.0%)
## eps = 2.0 | Clusters: 1 | Noise: 237 (2.0%)
## eps = 2.5 | Clusters: 1 | Noise: 98 (0.8%)
DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
is a density-based algorithm that does not require the number of
clusters to be specified in advance. This algorithm classifies data
points into three categories: core points (having at
least MinPts neighbors within a radius of ε), border
points (located within a radius of ε from a core point but not
meeting the MinPts requirement), and noise points
(neither of the above). The parameters used are eps = 1.0
based on the kNN Distance Plot analysis, and MinPts = 11
according to the general rule MinPts ≥ D+1 where D = 10 (the number of
dimensions). DBSCAN produces 3 clusters with 2,120 noise points (17.19%
of the total data).
db_res <- dbscan(df_scaled, eps = 1.0, MinPts = 11)
cat("Distribusi cluster DBSCAN (0 = noise):\n")
## Distribusi cluster DBSCAN (0 = noise):
print(table(db_res$cluster))
##
## 0 1 2 3
## 2120 9401 37 11
cat("Jumlah noise points:", sum(db_res$cluster == 0), "\n")
## Jumlah noise points: 2120
df_pca$cluster_db <- as.factor(db_res$cluster)
ggplot(df_pca, aes(x = PC1, y = PC2, color = cluster_db)) +
geom_point(alpha = 0.3, size = 1.0) +
theme_minimal() +
scale_color_manual(
values = c("0" = "grey70",
setNames(scales::hue_pal()(length(unique(db_res$cluster)) - 1),
as.character(sort(unique(db_res$cluster[db_res$cluster != 0]))))),
labels = c("0" = "Noise",
as.character(sort(unique(db_res$cluster[db_res$cluster != 0]))))
) +
labs(
title = "DBSCAN Clustering — PCA Visualization",
subtitle = paste0("eps = 1.0 | minPts = 11 | Noise points: ",
sum(db_res$cluster == 0)),
color = "Cluster", x = "PC1", y = "PC2"
)
Metrics are evaluated only on data that is not noise (cluster ≠ 0), since noise points, by definition, do not belong to any cluster and would distort the silhouette value if included.
db_no_noise <- db_res$cluster[db_res$cluster != 0]
df_scaled_no_noise <- df_scaled[db_res$cluster != 0, ]
if (length(unique(db_no_noise)) > 1) {
sil_db <- silhouette(db_no_noise, dist(df_scaled_no_noise))
cat("Silhouette Score DBSCAN:", round(mean(sil_db[, 3]), 4), "\n")
stats_db <- cluster.stats(dist(df_scaled_no_noise), db_no_noise)
cat("Dunn Index DBSCAN :", round(stats_db$dunn, 4), "\n")
} else {
cat("Hanya 1 cluster terbentuk, evaluasi tidak dapat dihitung.\n")
}
## Silhouette Score DBSCAN: 0.169
## Dunn Index DBSCAN : 0.0531
cat ("\nDBSCAN often produces low silhouettes due to its non-spherical clusters. This needs to be explained in the DBSCAN evaluation narrative.")
##
## DBSCAN often produces low silhouettes due to its non-spherical clusters. This needs to be explained in the DBSCAN evaluation narrative.
Mean Shift is a density-based non-parametric clustering algorithm that works by iteratively shifting each data point toward the weighted average of its neighbors within a certain bandwidth radius. This process is called hill-climbing because the points always move toward areas of higher density until they converge on the mode (peak) of the distribution. The main advantage of Mean Shift is that it does not require the number of clusters to be specified at the outset. Due to the high computational complexity, the analysis was performed on a sample of 2000 data points using a bandwidth of 3.0
set.seed(123)
sample_idx <- sample(nrow(df_scaled), 2000)
df_ms_sample <- as.matrix(df_scaled[sample_idx, ])
ms_res <- meanShift(
df_ms_sample,
bandwidth = rep(3.0, ncol(df_ms_sample))
)
cat("Distribusi cluster Mean Shift:\n")
## Distribusi cluster Mean Shift:
print(table(ms_res$assignment))
##
## 1 2 3
## 1998 1 1
Due to the quadratic computational complexity O(n²) of the Mean Shift
algorithm, which makes it infeasible to run on the full dataset of
11,569 observations within a reasonable time, the analysis was performed
on a representative random sample of 2,000 data points (17.3% of the
total cleaned dataset). The sample was drawn using
set.seed(123) to ensure reproducibility. This sampling
approach is consistent with common practice in the literature when
applying Mean Shift to large datasets (Comaniciu & Meer, 2002).
df_pca_ms <- df_pca[sample_idx, ]
df_pca_ms$cluster_ms <- as.factor(ms_res$assignment)
ggplot(df_pca_ms, aes(x = PC1, y = PC2, color = cluster_ms)) +
geom_point(alpha = 0.4, size = 1.2) +
stat_ellipse(level = 0.95, linewidth = 0.8) +
theme_minimal() +
labs(
title = "Mean Shift Clustering — PCA Visualization",
subtitle = paste0("bandwidth = 3.0 | n sample = 2000"),
color = "Cluster", x = "PC1", y = "PC2"
)
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_path()`).
if (length(unique(ms_res$assignment)) > 1) {
sil_ms <- silhouette(ms_res$assignment, dist(df_ms_sample))
cat("Silhouette Score Mean Shift:", round(mean(sil_ms[, 3]), 4), "\n")
stats_ms <- cluster.stats(dist(df_ms_sample), ms_res$assignment)
cat("Dunn Index Mean Shift :", round(stats_ms$dunn, 4), "\n")
} else {
cat("Mean Shift hanya menghasilkan 1 cluster dengan bandwidth ini.\n")
cat("Coba turunkan bandwidth, misal: 1.0 atau 0.8\n")
}
## Silhouette Score Mean Shift: 0.7736
## Dunn Index Mean Shift : 0.1095
Fuzzy C-Means (FCM) is a soft clustering method in which each data
point does not belong exclusively to a single cluster, but rather has a
membership degree for all clusters with values between 0 and 1. The
total sum of a data point’s membership degrees across all clusters
equals 1. The fuzziness parameter m = 2 is the standard
value that controls how “fuzzy” the boundaries between clusters are; the
larger m is, the more the clusters overlap. FCM is highly relevant for
visitor behavior data where the boundaries between segments are not
always clear-cut.
set.seed(123)
fcm_res <- cmeans(as.matrix(df_scaled), centers = optimal_k, m = 2)
cat("Distribusi cluster Fuzzy C-Means:\n")
## Distribusi cluster Fuzzy C-Means:
print(table(fcm_res$cluster))
##
## 1 2 3 4 5 6 7
## 1798 316 1590 3718 794 1853 1500
df_pca$cluster_fcm <- as.factor(fcm_res$cluster)
ggplot(df_pca, aes(x = PC1, y = PC2, color = cluster_fcm)) +
geom_point(alpha = 0.3, size = 1.0) +
stat_ellipse(level = 0.95, linewidth = 0.8) +
theme_minimal() +
labs(
title = "Fuzzy C-Means Clustering — PCA Visualization",
subtitle = paste0("k = ", optimal_k, " | m = 2 (fuzziness)"),
color = "Cluster", x = "PC1", y = "PC2"
)
df_profile_fcm <- df_clean %>%
mutate(cluster = as.factor(fcm_res$cluster)) %>%
group_by(cluster) %>%
summarise(across(where(is.numeric), ~mean(.x, na.rm = TRUE))) %>%
mutate(across(where(is.numeric), ~round(.x, 3)))
print(df_profile_fcm)
## # A tibble: 7 × 11
## cluster Administrative Administrative_Duration Informational
## <fct> <dbl> <dbl> <dbl>
## 1 1 0.542 15.6 0.12
## 2 2 2.01 53.4 0.408
## 3 3 4.82 147. 0.656
## 4 4 0.57 11.9 0.082
## 5 5 2.12 55.8 0.513
## 6 6 6.49 167. 1.99
## 7 7 1.35 30.8 0.278
## # ℹ 7 more variables: Informational_Duration <dbl>, ProductRelated <dbl>,
## # ProductRelated_Duration <dbl>, BounceRates <dbl>, ExitRates <dbl>,
## # PageValues <dbl>, SpecialDay <dbl>
sil_fcm <- silhouette(fcm_res$cluster, dist(df_scaled))
cat("Silhouette Score FCM:", round(mean(sil_fcm[, 3]), 4), "\n")
## Silhouette Score FCM: 0.1375
stats_fcm <- cluster.stats(dist(df_scaled), fcm_res$cluster)
cat("Dunn Index FCM :", round(stats_fcm$dunn, 4), "\n")
## Dunn Index FCM : 0.0033
fviz_silhouette(sil_fcm) +
theme_minimal() +
labs(title = "Silhouette Plot — Fuzzy C-Means Clustering")
## cluster size ave.sil.width
## 1 1 1798 0.34
## 2 2 316 -0.10
## 3 3 1590 0.05
## 4 4 3718 0.29
## 5 5 794 -0.08
## 6 6 1853 0.01
## 7 7 1500 -0.05
The following comparison table summarizes the performance of the five clustering methods based on two internal evaluation metrics. The Silhouette Score measures how well each observation fits within its own cluster (range -1 to 1; higher is better). The Dunn Index measures the ratio of the minimum distance between clusters to the maximum cluster diameter (higher is better). The method with the highest values for both metrics is considered to produce the most optimal segmentation.
comparison_table <- data.frame(
Method = c("K-Means", "K-Medians", "DBSCAN", "Mean Shift", "Fuzzy C-Means"),
N_Clusters = c(
length(unique(km_res$cluster)),
length(unique(clusters(kmed_res))),
length(unique(db_res$cluster[db_res$cluster != 0])),
length(unique(ms_res$assignment)),
length(unique(fcm_res$cluster))
),
Silhouette_Score = c(sil_score, sil_kmed_score,
round(mean(sil_db[, 3]), 4),
round(mean(sil_ms[, 3]), 4),
round(mean(sil_fcm[, 3]), 4)),
Dunn_Index = c(dunn_score,
round(stats_kmed$dunn, 4),
round(stats_db$dunn, 4),
round(stats_ms$dunn, 4),
round(stats_fcm$dunn, 4))
)
print(comparison_table)
## Method N_Clusters Silhouette_Score Dunn_Index
## 1 K-Means 7 0.2861 0.0049
## 2 K-Medians 7 0.2182 0.0030
## 3 DBSCAN 3 0.1690 0.0531
## 4 Mean Shift 3 0.7736 0.1095
## 5 Fuzzy C-Means 7 0.1375 0.0033
comparison_table %>%
pivot_longer(cols = c(Silhouette_Score, Dunn_Index),
names_to = "Metric", values_to = "Score") %>%
ggplot(aes(x = reorder(Method, Score), y = Score, fill = Method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~Metric, scales = "free_x") +
coord_flip() +
theme_minimal() +
labs(title = "Clustering Methods Comparison",
subtitle = "Higher is better for both metrics",
x = "Method", y = "Score")
Note on Mean Shift: The Silhouette Score and Dunn Index for Mean Shift were computed on a 2,000-point sample due to computational constraints. Direct numerical comparison with other methods should be interpreted with caution, as sampling may influence metric values. The reported metrics are indicative of the method’s clustering tendency rather than an absolute comparison.