Install Package

library(tidyverse)
library(corrplot)
library(flexclust)
library(dbscan)
library(meanShiftR)
library(e1071)
library(cluster)
library(fpc)
library(factoextra)

Data Understanding

Load Dataset

The dataset used is the Online Shoppers Purchasing Intention Dataset, sourced from the UCI Machine Learning Repository (https://archive.ics.uci.edu/dataset/468). This dataset contains 12,330 online store visit sessions with 18 attributes. In this study, only 10 numerical features were used as clustering variables, namely Administrative, Administrative_Duration, Informational, Informational_Duration, ProductRelated, ProductRelated_Duration, BounceRates, ExitRates, PageValues, and SpecialDay.

df <- read.csv("online_shoppers_intention.csv", sep=",")

df_num <- df %>% select(
  Administrative, Administrative_Duration,
  Informational, Informational_Duration,
  ProductRelated, ProductRelated_Duration,
  BounceRates, ExitRates, PageValues, SpecialDay
)

Check Structure Data

Data structure checks are performed to ensure that all selected variables are of numeric type and ready for further processing. The output of str() displays the data type of each column along with some initial values as samples.

str(df_num)
## 'data.frame':    12330 obs. of  10 variables:
##  $ Administrative         : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ Administrative_Duration: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Informational          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Informational_Duration : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ProductRelated         : int  1 2 1 2 10 19 1 0 2 3 ...
##  $ ProductRelated_Duration: num  0 64 0 2.67 627.5 ...
##  $ BounceRates            : num  0.2 0 0.2 0.05 0.02 ...
##  $ ExitRates              : num  0.2 0.1 0.2 0.14 0.05 ...
##  $ PageValues             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SpecialDay             : num  0 0 0 0 0 0 0.4 0 0.8 0.4 ...

Descriptive Statistics

Descriptive statistics provide a summary of the distribution of each feature. It is evident that most features have a minimum value of 0 and a highly right-skewed distribution, particularly for duration features such as ProductRelated_Duration, which has a range of 0 to 63,973 seconds. This significant difference in scale between features is the primary reason why normalization is necessary before clustering is performed.

summary(df_num)
##  Administrative   Administrative_Duration Informational    
##  Min.   : 0.000   Min.   :   0.00         Min.   : 0.0000  
##  1st Qu.: 0.000   1st Qu.:   0.00         1st Qu.: 0.0000  
##  Median : 1.000   Median :   7.50         Median : 0.0000  
##  Mean   : 2.315   Mean   :  80.82         Mean   : 0.5036  
##  3rd Qu.: 4.000   3rd Qu.:  93.26         3rd Qu.: 0.0000  
##  Max.   :27.000   Max.   :3398.75         Max.   :24.0000  
##  Informational_Duration ProductRelated   ProductRelated_Duration
##  Min.   :   0.00        Min.   :  0.00   Min.   :    0.0        
##  1st Qu.:   0.00        1st Qu.:  7.00   1st Qu.:  184.1        
##  Median :   0.00        Median : 18.00   Median :  598.9        
##  Mean   :  34.47        Mean   : 31.73   Mean   : 1194.7        
##  3rd Qu.:   0.00        3rd Qu.: 38.00   3rd Qu.: 1464.2        
##  Max.   :2549.38        Max.   :705.00   Max.   :63973.5        
##   BounceRates         ExitRates         PageValues        SpecialDay     
##  Min.   :0.000000   Min.   :0.00000   Min.   :  0.000   Min.   :0.00000  
##  1st Qu.:0.000000   1st Qu.:0.01429   1st Qu.:  0.000   1st Qu.:0.00000  
##  Median :0.003112   Median :0.02516   Median :  0.000   Median :0.00000  
##  Mean   :0.022191   Mean   :0.04307   Mean   :  5.889   Mean   :0.06143  
##  3rd Qu.:0.016813   3rd Qu.:0.05000   3rd Qu.:  0.000   3rd Qu.:0.00000  
##  Max.   :0.200000   Max.   :0.20000   Max.   :361.764   Max.   :1.00000

Distribution & Outlier

Boxplots are used to visually detect outliers in the data prior to preprocessing. The red points on the boxplots indicate observations that fall outside the IQR. It is evident that nearly all features contain extreme outliers, which, if left unaddressed, will affect the quality of the clustering results, particularly in the case of K-Means, which is sensitive to outliers. The histograms show that all features are non-normally distributed and heavily skewed to the left.

df_long <- df_num %>%
pivot_longer(cols = everything(), names_to = "Feature", values_to = "Value")

ggplot(df_long, aes(x = Feature, y = Value, fill = Feature)) +
geom_boxplot(show.legend = FALSE, outlier.color = "red", outlier.alpha = 0.5) +
coord_flip() +
theme_minimal() +
labs(title = "Boxplot of 10 Numeric Features (Raw Data)",
subtitle = "Red dots indicate the presence of outliers",
x = "Feature",
y = "Value")

ggplot(df_long, aes(x = Value, fill = Feature)) +
geom_histogram(bins = 30, show.legend = FALSE, color = "white") +
facet_wrap(~Feature, scales = "free") +
theme_minimal() +
labs(title = "Feature Distribution Histogram",
x = "Value", y = "Frequency")

Preprocessing Data

Handling Missing Values

Checking for missing values is a mandatory step before clustering. Clustering algorithms such as K-Means and DBSCAN cannot process data with missing values. The results of the check show that all 10 numeric columns have no missing values, so data imputation is not required.

cat("Number of Missing Values per Column:\n")
## Number of Missing Values per Column:
colSums(is.na(df_num))
##          Administrative Administrative_Duration           Informational 
##                       0                       0                       0 
##  Informational_Duration          ProductRelated ProductRelated_Duration 
##                       0                       0                       0 
##             BounceRates               ExitRates              PageValues 
##                       0                       0                       0 
##              SpecialDay 
##                       0
df_clean <- df_num %>% drop_na()

Handling Duplicates

Duplicate rows can introduce bias in the clustering results because the same observations are counted more than once during the centroid formation process. A total of 761 duplicate rows were identified out of 12,330 rows; after removal, 11,569 unique data rows remained for use in further analysis.

cat("Number of duplicate rows:", sum(duplicated(df_clean)), "\n")
## Number of duplicate rows: 761
cat("Number of rows before cleaning:", nrow(df_clean), "\n")
## Number of rows before cleaning: 12330
df_clean <- df_clean %>% distinct()

cat("Number of rows after cleaning:", nrow(df_clean), "\n")
## Number of rows after cleaning: 11569

Handling Outliers

Extreme outliers are handled using the IQR capping (Winsorization) method, which involves replacing values below the lower bound (Q1 – 1.5×IQR) with that lower bound value, and values exceeding the upper bound (Q3 + 1.5×IQR) with the upper bound value. This method was chosen because it does not remove observations but merely limits the influence of extreme values, thereby preserving the total number of data points. Columns with IQR = 0 are not capped to avoid division by zero.

cap_outliers <- function(x) {
  Q1 <- quantile(x, 0.25)
  Q3 <- quantile(x, 0.75)
  IQR_val <- Q3 - Q1
  
  if (IQR_val == 0) return(x)
  
  lower <- Q1 - 1.5 * IQR_val
  upper <- Q3 + 1.5 * IQR_val
  pmin(pmax(x, lower), upper)
}

df_clean <- df_clean %>% mutate(across(everything(), cap_outliers))

Scaling (Standarization)

Standardization using Z-scores was performed to normalize the scales of all features so that no single feature would dominate the distance calculation simply because it had a wider range of values. After scaling, each feature has a mean of 0 and a standard deviation of 1. A check was performed to ensure that no NaN or Inf values were generated after the scaling process; neither was found.

zero_var_cols <- names(df_clean)[sapply(df_clean, var) == 0]
if (length(zero_var_cols) > 0) {
  cat("Columns with variance = 0 (removed before scaling):", 
      paste(zero_var_cols, collapse = ", "), "\n")
  df_clean <- df_clean %>% select(-all_of(zero_var_cols))
} else {
  cat("All columns OK, no variance =0\n")
}
## All columns OK, no variance =0
df_scaled <- as.data.frame(scale(df_clean))

cat("NaN after scaling:", sum(is.nan(as.matrix(df_scaled))), "\n")
## NaN after scaling: 0
cat("Inf after scaling:", sum(is.infinite(as.matrix(df_scaled))), "\n")
## Inf after scaling: 0

Exploratory Data Analysis

Descriptive Statistics

Descriptive statistics following preprocessing reveal significant changes in the data distribution. Previously extreme maximum values (e.g., ProductRelated_Duration reaching 63,973) are now under control following IQR capping. This tighter data distribution will result in more representative clusters.

summary(df_clean)
##  Administrative   Administrative_Duration Informational    
##  Min.   : 0.000   Min.   :  0.00          Min.   : 0.0000  
##  1st Qu.: 0.000   1st Qu.:  0.00          1st Qu.: 0.0000  
##  Median : 1.000   Median : 16.50          Median : 0.0000  
##  Mean   : 2.343   Mean   : 62.55          Mean   : 0.5365  
##  3rd Qu.: 4.000   3rd Qu.:100.92          3rd Qu.: 0.0000  
##  Max.   :10.000   Max.   :252.29          Max.   :24.0000  
##  Informational_Duration ProductRelated  ProductRelated_Duration
##  Min.   :   0.00        Min.   : 0.00   Min.   :   0.0         
##  1st Qu.:   0.00        1st Qu.: 9.00   1st Qu.: 242.6         
##  Median :   0.00        Median :20.00   Median : 671.9         
##  Mean   :  36.74        Mean   :28.53   Mean   :1066.1         
##  3rd Qu.:   0.00        3rd Qu.:40.00   3rd Qu.:1557.2         
##  Max.   :2549.38        Max.   :86.50   Max.   :3529.0         
##   BounceRates         ExitRates         PageValues        SpecialDay     
##  Min.   :0.000000   Min.   :0.00000   Min.   :  0.000   Min.   :0.00000  
##  1st Qu.:0.000000   1st Qu.:0.01334   1st Qu.:  0.000   1st Qu.:0.00000  
##  Median :0.002151   Median :0.02402   Median :  0.000   Median :0.00000  
##  Mean   :0.008573   Mean   :0.03101   Mean   :  6.277   Mean   :0.06287  
##  3rd Qu.:0.013793   3rd Qu.:0.04181   3rd Qu.:  0.000   3rd Qu.:0.00000  
##  Max.   :0.034483   Max.   :0.08451   Max.   :361.764   Max.   :1.00000

Correlation Heatmap

A correlation heatmap is used to identify linear relationships between features. High correlations between features can influence cluster formation. The heatmap results show a strong positive correlation between BounceRates and ExitRates, as well as between Administrative and Administrative_Duration. This suggests that visitors with a high bounce rate also tend to have a high exit rate, indicating a segment of visitors who are not engaged with the store’s content.

cor_matrix <- cor(df_clean)

corrplot(cor_matrix, 
         method = "color", 
         type = "upper", 
         addCoef.col = "black", 
         tl.col = "black",      
         tl.srt = 45,           
         number.cex = 0.7,      
         title = "Correlation Heatmap of 10 Numerical Features",
         mar = c(0,0,1,0))      

Distribution and Outliers

Box plots and histograms of the scaled data show that all features are now on a comparable scale (Z-scores). The outliers remaining after IQR capping appear to be better controlled. This more homogeneous distribution ensures that distance-based algorithms such as K-Means and DBSCAN will not be biased toward specific features.

df_long <- df_scaled %>%
  pivot_longer(cols = everything(), names_to = "Feature", values_to = "Value")

ggplot(df_long, aes(x = Feature, y = Value, fill = Feature)) +
  geom_boxplot(show.legend = FALSE, outlier.color = "red", outlier.alpha = 0.5) +
  coord_flip() +
  theme_minimal() +
  labs(title = "Boxplot of 10 Numeric Features (Scaled Data)",
       subtitle = "Red dots indicate the presence of outliers (Z-Score)",
       x = "Feature",
       y = "Z-Score Value")

ggplot(df_long, aes(x = Value, fill = Feature)) +
  geom_histogram(bins = 30, show.legend = FALSE, color = "white") +
  facet_wrap(~Feature, scales = "free") +
  theme_minimal() +
  labs(title = "Feature Distribution Histogram (Scaled Data)",
       x = "Z-Score Value", y = "Frequency")

Optimal Cluster Determination

Elbow Method (WSS)

The Elbow Method calculates the total Within-Cluster Sum of Squares (WSS) for each value of k. As k increases, the WSS decreases because the clusters become smaller and more compact. The “elbow” point—the point at which the decrease in WSS begins to slow significantly indicates the optimal number of clusters. From the resulting graph, the elbow point is identified at k = 7.

set.seed(123)
fviz_nbclust(df_scaled, kmeans, method = "wss", k.max = 10) +
  labs(title = "Elbow Method for Optimal K") +
  theme_minimal()

Silhouette Method

The Silhouette Method measures how well each observation fits within its own cluster compared to the nearest neighboring clusters. Silhouette values range from -1 to 1, where higher values indicate better cluster separation. The silhouette plot confirms that the highest average silhouette value is also obtained at k = 7. Based on these two methods, k = 7 is selected as the optimal number of clusters for all methods that require k to be determined.

set.seed(123)
fviz_nbclust(df_scaled, kmeans, method = "silhouette", k.max = 10) +
  labs(title = "Silhouette Method for Optimal K") +
  theme_minimal()

## CONCLUSION OF K SELECTION
# Elbow Method: The inflection point is visible at k = 7
# Silhouette: The highest value is at k = 7
# → Both methods agree, k = 7 is selected

optimal_k <- 7
cat("Optimal K selected:", optimal_k)
## Optimal K selected: 7

K-NN Distance Plot

For the DBSCAN algorithm, the epsilon (ε) parameter is determined using a kNN Distance Plot. This plot displays the distances to the k nearest neighbors (where k = inPts - 1 = 10) for each point, sorted in ascending order. The “knee” point on the graph indicates the optimal value of ε where points begin to be far apart from one another, marking the transition from a dense area to a sparse area. From this plot, ε = 1.0 was selected because it produces 3 clusters with 18.3% noise, which represents the best balance between a meaningful number of clusters and an acceptable level of noise.

kNNdistplot(df_scaled, k = 11)
abline(h = 2.0, col = "red", lty = 2)

PCA 2D

Principal Component Analysis (PCA) is applied not as a clustering method, but as a visualization tool. Since the data has 10 dimensions, the clustering results cannot be visualized directly. PCA projects the data onto two principal components (PC1 and PC2) that capture the greatest variance. PC1 explains 33.7% of the variance and PC2 explains 16.0%, so the total variance captured by these two components is 49.7%, which is sufficiently representative for visualization purposes.

pca_res <- prcomp(df_scaled, center = FALSE, scale. = FALSE)
df_pca <- as.data.frame(pca_res$x[, 1:2])

cat("Proportion of variance explained by PC1 and PC2:\n")
## Proportion of variance explained by PC1 and PC2:
summary(pca_res)$importance[2, 1:2]
##     PC1     PC2 
## 0.33748 0.15972
var_explained <- summary(pca_res)$importance[2, 1:2]
cat(sprintf("PC1 explained %.1f%% variance\n", var_explained[1] * 100))
## PC1 explained 33.7% variance
cat(sprintf("PC2 explained %.1f%% variance\n", var_explained[2] * 100))
## PC2 explained 16.0% variance
cat(sprintf("Total 2 components: %.1f%% variance\n", sum(var_explained) * 100))
## Total 2 components: 49.7% variance

K-Means Clustering

Implementation

K-Means is a partition-based clustering algorithm that groups data into k clusters by minimizing the Within-Cluster Sum of Squares (WCSS). The algorithm begins with k random centroids, then iteratively assigns each data point to the nearest centroid and updates the centroid’s position based on the average of the data points in that cluster until convergence. The parameter nstart = 25 is used to run the algorithm from 25 different starting points and select the best result, in order to avoid suboptimal local solutions. The results show a BSS/TSS ratio of 58.48%, meaning that 58.48% of the total data variance can be explained by the separation between clusters.

set.seed(123)
km_res <- kmeans(df_scaled, centers = optimal_k, nstart = 25)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 578450)
cat("Ukuran tiap cluster:\n")
## Ukuran tiap cluster:
print(table(km_res$cluster))
## 
##    1    2    3    4    5    6    7 
## 1662  499 1632 4719  435  798 1824
cat("\nTotal Within-cluster SS :", round(km_res$tot.withinss, 2))
## 
## Total Within-cluster SS : 48028.32
cat("\nBetween-cluster SS      :", round(km_res$betweenss, 2))
## 
## Between-cluster SS      : 67651.68
cat("\nRasio BSS/TSS           :", round(km_res$betweenss / km_res$totss * 100, 2), "%\n")
## 
## Rasio BSS/TSS           : 58.48 %

PCA Visualization

df_pca$cluster_km <- as.factor(km_res$cluster)

ggplot(df_pca, aes(x = PC1, y = PC2, color = cluster_km)) +
  geom_point(alpha = 0.3, size = 1.0) +
  stat_ellipse(level = 0.95, linewidth = 0.8) +
  theme_minimal() +
  labs(
    title    = "K-Means Clustering — PCA Visualization",
    subtitle = paste0("k = ", optimal_k,
                      " | PC1 + PC2 explained variance: ",
                      round(sum(var_explained) * 100, 1), "%"),
    color    = "Cluster",
    x        = "PC1",
    y        = "PC2"
  )

Cluster Profiling

Cluster profiling displays the average value of each feature per cluster using the original data (before scaling) to make the results easier to interpret in context. Heatmap profiling visually illustrates the differences in characteristics between clusters, with red indicating high values and blue indicating low values.

df_profile_km <- df_clean %>%
  mutate(cluster = as.factor(km_res$cluster))

profiling <- df_profile_km %>%
  group_by(cluster) %>%
  summarise(across(where(is.numeric), ~mean(.x, na.rm = TRUE), .names = "{.col}"))
  
print(profiling)
## # A tibble: 7 × 11
##   cluster Administrative Administrative_Duration Informational
##   <fct>            <dbl>                   <dbl>         <dbl>
## 1 1                0.690                    19.5         0.153
## 2 2                2.10                     58.6         0.345
## 3 3                4.08                     97.6         0.911
## 4 4                0.881                    19.2         0.176
## 5 5                5.95                    156.          4.82 
## 6 6                0.812                    17.4         0.198
## 7 7                5.95                    181.          0.663
## # ℹ 7 more variables: Informational_Duration <dbl>, ProductRelated <dbl>,
## #   ProductRelated_Duration <dbl>, BounceRates <dbl>, ExitRates <dbl>,
## #   PageValues <dbl>, SpecialDay <dbl>
profiling %>%
  pivot_longer(-cluster, names_to = "Feature", values_to = "Mean") %>%
  ggplot(aes(x = cluster, y = Feature, fill = Mean)) +
  geom_tile(color = "white") +
  geom_text(aes(label = round(Mean, 2)), size = 3) +
  scale_fill_gradient2(
    low      = "#3498DB",
    mid      = "#ECF0F1",
    high     = "#E74C3C",
    midpoint = 0
  ) +
  theme_minimal() +
  labs(
    title    = "Cluster Profiling Heatmap",
    subtitle = "Mean value per feature per cluster (original scale)",
    x        = "Cluster",
    y        = "Feature",
    fill     = "Mean"
  )

K-Means Evaluation

sil_km <- silhouette(km_res$cluster, dist(df_scaled))
sil_score <- round(mean(sil_km[, 3]), 4)
cat("Silhouette Score K-Means:", sil_score, "\n")
## Silhouette Score K-Means: 0.2861
stats_km <- cluster.stats(dist(df_scaled), km_res$cluster)
dunn_score <- round(stats_km$dunn, 4)
cat("Dunn Index K-Means      :", dunn_score, "\n")
## Dunn Index K-Means      : 0.0049
fviz_silhouette(sil_km) +
  theme_minimal() +
  labs(title = "Silhouette Plot — K-Means Clustering")
##   cluster size ave.sil.width
## 1       1 1662          0.40
## 2       2  499          0.17
## 3       3 1632          0.16
## 4       4 4719          0.37
## 5       5  435          0.07
## 6       6  798          0.23
## 7       7 1824          0.17

K-Median Clustering

Implementation

K-Medians is a variant of K-Means that uses the median as the cluster center instead of the mean. The use of the median makes this algorithm more robust to outliers than K-Means, because the median is not affected by extreme values. K-Medians uses the Manhattan distance (L1 norm) as the distance metric, unlike K-Means, which uses the Euclidean distance (L2 norm). The implementation uses the kcca() function from the flexclust package with the kmedians family.

set.seed(123)
kmed_res <- kcca(df_scaled, k = optimal_k, family = kccaFamily("kmedians"))
## Found more than one class "kcca" in cache; using the first, from namespace 'flexclust'
## Also defined by 'kernlab'
## Found more than one class "kcca" in cache; using the first, from namespace 'flexclust'
## Also defined by 'kernlab'
cat("Ukuran tiap cluster:\n")
## Ukuran tiap cluster:
print(table(clusters(kmed_res)))
## 
##    1    2    3    4    5    6    7 
##  663 2277 1581 3737  701 1742  868

PCA Visualization

df_pca$cluster_kmed <- as.factor(clusters(kmed_res))

ggplot(df_pca, aes(x = PC1, y = PC2, color = cluster_kmed)) +
  geom_point(alpha = 0.3, size = 1.0) +
  stat_ellipse(level = 0.95, linewidth = 0.8) +
  theme_minimal() +
  labs(
    title    = "K-Medians Clustering — PCA Visualization",
    subtitle = paste0("k = ", optimal_k),
    color    = "Cluster", x = "PC1", y = "PC2"
  )

Cluster Profiling

df_profile_kmed <- df_clean %>%
  mutate(cluster = as.factor(clusters(kmed_res))) %>%
  group_by(cluster) %>%
  summarise(across(where(is.numeric), ~mean(.x, na.rm = TRUE))) %>%
  mutate(across(where(is.numeric), ~round(.x, 3)))
print(df_profile_kmed)
## # A tibble: 7 × 11
##   cluster Administrative Administrative_Duration Informational
##   <fct>            <dbl>                   <dbl>         <dbl>
## 1 1                1.78                    50.5          0.243
## 2 2                4.58                   134.           0.185
## 3 3                1.59                    33.3          0.467
## 4 4                0.385                    7.07         0.139
## 5 5                5.18                   144.           3.24 
## 6 6                0.554                   16.2          0.13 
## 7 7                8.00                   205.           2.15 
## # ℹ 7 more variables: Informational_Duration <dbl>, ProductRelated <dbl>,
## #   ProductRelated_Duration <dbl>, BounceRates <dbl>, ExitRates <dbl>,
## #   PageValues <dbl>, SpecialDay <dbl>
df_profile_kmed %>%
  pivot_longer(-cluster, names_to = "Feature", values_to = "Mean") %>%
  ggplot(aes(x = cluster, y = Feature, fill = Mean)) +
  geom_tile(color = "white") +
  geom_text(aes(label = round(Mean, 2)), size = 3) +
  scale_fill_gradient2(low="#3498DB", mid="#ECF0F1", high="#E74C3C", midpoint=0) +
  theme_minimal() +
  labs(title = "Cluster Profiling Heatmap — K-Medians")

Evaluation

sil_kmed  <- silhouette(clusters(kmed_res), dist(df_scaled))
sil_kmed_score <- round(mean(sil_kmed[, 3]), 4)
cat("Silhouette Score K-Medians:", sil_kmed_score, "\n")
## Silhouette Score K-Medians: 0.2182
stats_kmed <- cluster.stats(dist(df_scaled), clusters(kmed_res))
cat("Dunn Index K-Medians      :", round(stats_kmed$dunn, 4), "\n")
## Dunn Index K-Medians      : 0.003
fviz_silhouette(sil_kmed) +
  theme_minimal() +
  labs(title = "Silhouette Plot — K-Medians Clustering")
##   cluster size ave.sil.width
## 1       1  663          0.13
## 2       2 2277          0.18
## 3       3 1581          0.14
## 4       4 3737          0.31
## 5       5  701         -0.02
## 6       6 1742          0.33
## 7       7  868          0.09

DBSCAN Clustering

Parameter Selection (eps Tuning)

Before implementation, the optimal eps value was found using the following loop. Eps = 1.0 was chosen because it produced more than 1 cluster with a reasonable noise percentage (18.3%).

# Test berbagai nilai eps
for (eps_val in c(0.3, 0.5, 0.8, 1.0, 1.2, 1.5, 2.0, 2.5)) {
  db_test <- dbscan(df_scaled, eps = eps_val, MinPts = 11)
  n_cluster <- length(unique(db_test$cluster[db_test$cluster != 0]))
  n_noise   <- sum(db_test$cluster == 0)
  cat(sprintf("eps = %.1f | Clusters: %d | Noise: %d (%.1f%%)\n",
              eps_val, n_cluster, n_noise, n_noise/nrow(df_scaled)*100))
}
## eps = 0.3 | Clusters: 25 | Noise: 8476 (73.3%)
## eps = 0.5 | Clusters: 14 | Noise: 6226 (53.8%)
## eps = 0.8 | Clusters: 14 | Noise: 3648 (31.5%)
## eps = 1.0 | Clusters: 3 | Noise: 2120 (18.3%)
## eps = 1.2 | Clusters: 1 | Noise: 1308 (11.3%)
## eps = 1.5 | Clusters: 1 | Noise: 698 (6.0%)
## eps = 2.0 | Clusters: 1 | Noise: 237 (2.0%)
## eps = 2.5 | Clusters: 1 | Noise: 98 (0.8%)

Implementation

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based algorithm that does not require the number of clusters to be specified in advance. This algorithm classifies data points into three categories: core points (having at least MinPts neighbors within a radius of ε), border points (located within a radius of ε from a core point but not meeting the MinPts requirement), and noise points (neither of the above). The parameters used are eps = 1.0 based on the kNN Distance Plot analysis, and MinPts = 11 according to the general rule MinPts ≥ D+1 where D = 10 (the number of dimensions). DBSCAN produces 3 clusters with 2,120 noise points (17.19% of the total data).

db_res <- dbscan(df_scaled, eps = 1.0, MinPts = 11)
cat("Distribusi cluster DBSCAN (0 = noise):\n")
## Distribusi cluster DBSCAN (0 = noise):
print(table(db_res$cluster))
## 
##    0    1    2    3 
## 2120 9401   37   11
cat("Jumlah noise points:", sum(db_res$cluster == 0), "\n")
## Jumlah noise points: 2120

PCA Visualization

df_pca$cluster_db <- as.factor(db_res$cluster)

ggplot(df_pca, aes(x = PC1, y = PC2, color = cluster_db)) +
  geom_point(alpha = 0.3, size = 1.0) +
  theme_minimal() +
  scale_color_manual(
    values = c("0" = "grey70",
               setNames(scales::hue_pal()(length(unique(db_res$cluster)) - 1),
                        as.character(sort(unique(db_res$cluster[db_res$cluster != 0]))))),
    labels = c("0" = "Noise", 
               as.character(sort(unique(db_res$cluster[db_res$cluster != 0]))))
  ) +
  labs(
    title    = "DBSCAN Clustering — PCA Visualization",
    subtitle = paste0("eps = 1.0 | minPts = 11 | Noise points: ",
                      sum(db_res$cluster == 0)),
    color    = "Cluster", x = "PC1", y = "PC2"
  )

Evaluation

Metrics are evaluated only on data that is not noise (cluster ≠ 0), since noise points, by definition, do not belong to any cluster and would distort the silhouette value if included.

db_no_noise <- db_res$cluster[db_res$cluster != 0]
df_scaled_no_noise <- df_scaled[db_res$cluster != 0, ]

if (length(unique(db_no_noise)) > 1) {
  sil_db <- silhouette(db_no_noise, dist(df_scaled_no_noise))
  cat("Silhouette Score DBSCAN:", round(mean(sil_db[, 3]), 4), "\n")
  stats_db <- cluster.stats(dist(df_scaled_no_noise), db_no_noise)
  cat("Dunn Index DBSCAN      :", round(stats_db$dunn, 4), "\n")
} else {
  cat("Hanya 1 cluster terbentuk, evaluasi tidak dapat dihitung.\n")
}
## Silhouette Score DBSCAN: 0.169 
## Dunn Index DBSCAN      : 0.0531
cat ("\nDBSCAN often produces low silhouettes due to its non-spherical clusters. This needs to be explained in the DBSCAN evaluation narrative.")
## 
## DBSCAN often produces low silhouettes due to its non-spherical clusters. This needs to be explained in the DBSCAN evaluation narrative.

Mean Shift Clustering

Implementation

Mean Shift is a density-based non-parametric clustering algorithm that works by iteratively shifting each data point toward the weighted average of its neighbors within a certain bandwidth radius. This process is called hill-climbing because the points always move toward areas of higher density until they converge on the mode (peak) of the distribution. The main advantage of Mean Shift is that it does not require the number of clusters to be specified at the outset. Due to the high computational complexity, the analysis was performed on a sample of 2000 data points using a bandwidth of 3.0

set.seed(123)
sample_idx <- sample(nrow(df_scaled), 2000)
df_ms_sample <- as.matrix(df_scaled[sample_idx, ])

ms_res <- meanShift(
  df_ms_sample,
  bandwidth = rep(3.0, ncol(df_ms_sample))
)

cat("Distribusi cluster Mean Shift:\n")
## Distribusi cluster Mean Shift:
print(table(ms_res$assignment))
## 
##    1    2    3 
## 1998    1    1

Due to the quadratic computational complexity O(n²) of the Mean Shift algorithm, which makes it infeasible to run on the full dataset of 11,569 observations within a reasonable time, the analysis was performed on a representative random sample of 2,000 data points (17.3% of the total cleaned dataset). The sample was drawn using set.seed(123) to ensure reproducibility. This sampling approach is consistent with common practice in the literature when applying Mean Shift to large datasets (Comaniciu & Meer, 2002).

PCA Visualization

df_pca_ms <- df_pca[sample_idx, ]
df_pca_ms$cluster_ms <- as.factor(ms_res$assignment)

ggplot(df_pca_ms, aes(x = PC1, y = PC2, color = cluster_ms)) +
  geom_point(alpha = 0.4, size = 1.2) +
  stat_ellipse(level = 0.95, linewidth = 0.8) +
  theme_minimal() +
  labs(
    title    = "Mean Shift Clustering — PCA Visualization",
    subtitle = paste0("bandwidth = 3.0 | n sample = 2000"),
    color    = "Cluster", x = "PC1", y = "PC2"
  )
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_path()`).

Evaluation

if (length(unique(ms_res$assignment)) > 1) {
  sil_ms <- silhouette(ms_res$assignment, dist(df_ms_sample))
  cat("Silhouette Score Mean Shift:", round(mean(sil_ms[, 3]), 4), "\n")
  stats_ms <- cluster.stats(dist(df_ms_sample), ms_res$assignment)
  cat("Dunn Index Mean Shift      :", round(stats_ms$dunn, 4), "\n")
} else {
  cat("Mean Shift hanya menghasilkan 1 cluster dengan bandwidth ini.\n")
  cat("Coba turunkan bandwidth, misal: 1.0 atau 0.8\n")
}
## Silhouette Score Mean Shift: 0.7736 
## Dunn Index Mean Shift      : 0.1095

FUZZY C-MEANS Clustering

Implementation

Fuzzy C-Means (FCM) is a soft clustering method in which each data point does not belong exclusively to a single cluster, but rather has a membership degree for all clusters with values between 0 and 1. The total sum of a data point’s membership degrees across all clusters equals 1. The fuzziness parameter m = 2 is the standard value that controls how “fuzzy” the boundaries between clusters are; the larger m is, the more the clusters overlap. FCM is highly relevant for visitor behavior data where the boundaries between segments are not always clear-cut.

set.seed(123)
fcm_res <- cmeans(as.matrix(df_scaled), centers = optimal_k, m = 2)

cat("Distribusi cluster Fuzzy C-Means:\n")
## Distribusi cluster Fuzzy C-Means:
print(table(fcm_res$cluster))
## 
##    1    2    3    4    5    6    7 
## 1798  316 1590 3718  794 1853 1500

PCA Visualization

df_pca$cluster_fcm <- as.factor(fcm_res$cluster)

ggplot(df_pca, aes(x = PC1, y = PC2, color = cluster_fcm)) +
  geom_point(alpha = 0.3, size = 1.0) +
  stat_ellipse(level = 0.95, linewidth = 0.8) +
  theme_minimal() +
  labs(
    title    = "Fuzzy C-Means Clustering — PCA Visualization",
    subtitle = paste0("k = ", optimal_k, " | m = 2 (fuzziness)"),
    color    = "Cluster", x = "PC1", y = "PC2"
  )

CLUSTER PROFILING

df_profile_fcm <- df_clean %>%
  mutate(cluster = as.factor(fcm_res$cluster)) %>%
  group_by(cluster) %>%
  summarise(across(where(is.numeric), ~mean(.x, na.rm = TRUE))) %>%
  mutate(across(where(is.numeric), ~round(.x, 3)))
print(df_profile_fcm)
## # A tibble: 7 × 11
##   cluster Administrative Administrative_Duration Informational
##   <fct>            <dbl>                   <dbl>         <dbl>
## 1 1                0.542                    15.6         0.12 
## 2 2                2.01                     53.4         0.408
## 3 3                4.82                    147.          0.656
## 4 4                0.57                     11.9         0.082
## 5 5                2.12                     55.8         0.513
## 6 6                6.49                    167.          1.99 
## 7 7                1.35                     30.8         0.278
## # ℹ 7 more variables: Informational_Duration <dbl>, ProductRelated <dbl>,
## #   ProductRelated_Duration <dbl>, BounceRates <dbl>, ExitRates <dbl>,
## #   PageValues <dbl>, SpecialDay <dbl>

Evaluation

sil_fcm <- silhouette(fcm_res$cluster, dist(df_scaled))
cat("Silhouette Score FCM:", round(mean(sil_fcm[, 3]), 4), "\n")
## Silhouette Score FCM: 0.1375
stats_fcm <- cluster.stats(dist(df_scaled), fcm_res$cluster)
cat("Dunn Index FCM      :", round(stats_fcm$dunn, 4), "\n")
## Dunn Index FCM      : 0.0033
fviz_silhouette(sil_fcm) +
  theme_minimal() +
  labs(title = "Silhouette Plot — Fuzzy C-Means Clustering")
##   cluster size ave.sil.width
## 1       1 1798          0.34
## 2       2  316         -0.10
## 3       3 1590          0.05
## 4       4 3718          0.29
## 5       5  794         -0.08
## 6       6 1853          0.01
## 7       7 1500         -0.05

Comparison All Clustering

The following comparison table summarizes the performance of the five clustering methods based on two internal evaluation metrics. The Silhouette Score measures how well each observation fits within its own cluster (range -1 to 1; higher is better). The Dunn Index measures the ratio of the minimum distance between clusters to the maximum cluster diameter (higher is better). The method with the highest values for both metrics is considered to produce the most optimal segmentation.

comparison_table <- data.frame(
  Method           = c("K-Means", "K-Medians", "DBSCAN", "Mean Shift", "Fuzzy C-Means"),
  N_Clusters       = c(
    length(unique(km_res$cluster)),
    length(unique(clusters(kmed_res))),
    length(unique(db_res$cluster[db_res$cluster != 0])),
    length(unique(ms_res$assignment)),
    length(unique(fcm_res$cluster))
  ),
  Silhouette_Score = c(sil_score, sil_kmed_score,
                       round(mean(sil_db[, 3]), 4),
                       round(mean(sil_ms[, 3]), 4),
                       round(mean(sil_fcm[, 3]), 4)),
  Dunn_Index       = c(dunn_score,
                       round(stats_kmed$dunn, 4),
                       round(stats_db$dunn, 4),
                       round(stats_ms$dunn, 4),
                       round(stats_fcm$dunn, 4))
)

print(comparison_table)
##          Method N_Clusters Silhouette_Score Dunn_Index
## 1       K-Means          7           0.2861     0.0049
## 2     K-Medians          7           0.2182     0.0030
## 3        DBSCAN          3           0.1690     0.0531
## 4    Mean Shift          3           0.7736     0.1095
## 5 Fuzzy C-Means          7           0.1375     0.0033

Comparison Visualization

comparison_table %>%
  pivot_longer(cols = c(Silhouette_Score, Dunn_Index),
               names_to = "Metric", values_to = "Score") %>%
  ggplot(aes(x = reorder(Method, Score), y = Score, fill = Method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~Metric, scales = "free_x") +
  coord_flip() +
  theme_minimal() +
  labs(title = "Clustering Methods Comparison",
       subtitle = "Higher is better for both metrics",
       x = "Method", y = "Score")

Note on Mean Shift: The Silhouette Score and Dunn Index for Mean Shift were computed on a 2,000-point sample due to computational constraints. Direct numerical comparison with other methods should be interpreted with caution, as sampling may influence metric values. The reported metrics are indicative of the method’s clustering tendency rather than an absolute comparison.