1 1 Introduction

Clustering is a powerful unsupervised learning technique used to group data points into meaningful clusters based on their similarities. The key objective is to maximize intra-cluster similarity and minimize inter-cluster similarity.

The dataset analyzed in this project should typically originate from a survey filled out by club guests upon their arrival at a club with 2 distinct rooms and 2 different music sources. The survey captures key information such as room preferences, entry times, additional preferences for music or activities, demographic details, and real-time status of the club’s environment. This structured data collection process ensures a rich and reliable dataset for subsequent analysis. However, due to privacy reason…

The primary goal of this analysis is to better understand the preferences and behaviors of clubgoers. By clustering guests based on their responses, we aim to:

In this project, we explore:

2 2 Data Preprocessing

2.1 2.1 Preprocessing Steps

2.1.1 Handling Missing Data

Checked and imputed missing values.

2.1.2 Binary Encoding

Converted Preferred.Room and Gender into binary variables.

2.1.3 One-Hot Encoding

Created dummy variables for Additional.Preferences.

2.1.4 Time Conversion

Validated and converted Time.of.Entry into numeric minutes.

2.1.5 Normalization

Scaled all numerical features to ensure equal weighting.

# Load the dataset
club_goers_dataset <- read.csv("~/Desktop/Data science and Business Analytics/unsupervised learning project /Generated_Clubgoers_Dataset.csv", sep =";")

# Check missing values
missing_summary <- colSums(is.na(club_goers_dataset))
missing_summary
##              Respondent.ID             Preferred.Room 
##                          0                          0 
##              Time.of.Entry     Additional.Preferences 
##                          0                          0 
##                        Age                     Gender 
##                          0                          0 
## Minutes.Since.Club.Opening            Small.Room.Open 
##                          0                          0
# Binary encoding for Preferred.Room and Gender
club_goers_dataset <- club_goers_dataset %>%
  mutate(
    Preferred.Room = ifelse(Preferred.Room == "Latin", 0, 1),
    Gender = ifelse(Gender == "Male", 0, 1)
  )

# One-hot encoding for Additional Preferences
club_goers_dataset <- fastDummies::dummy_cols(
  club_goers_dataset, 
  select_columns = "Additional.Preferences", 
  remove_first_dummy = TRUE
)

# Convert Time.of.Entry to numeric (minutes since midnight)
club_goers_dataset <- club_goers_dataset %>%
  mutate(
    Time.of.Entry = as.numeric(hour(hm(Time.of.Entry)) * 60 + minute(hm(Time.of.Entry)))
  )

# Handle missing values in Time.of.Entry
club_goers_dataset$Time.of.Entry[is.na(club_goers_dataset$Time.of.Entry)] <- 
  mean(club_goers_dataset$Time.of.Entry, na.rm = TRUE)

# Normalize numerical features
scaled_data <- club_goers_dataset %>%
  select(Time.of.Entry, Age, Preferred.Room, Gender) %>%
  scale()

3 3 Clustering Analysis

3.1 3.1 K-Means Clustering

3.1.1 3.1.1 Choosing Optimal k Using the Elbow Method

The elbow method was used to determine the optimal number of clusters by analyzing the total within-cluster sum of squares (WSS) as the number of clusters (k) increases. The ‘elbow point’ represents the value of k where the rate of decrease in WSS sharply levels off, indicating a good balance between cluster compactness and interpretability.

# Test multiple values of k for WSS (within-cluster sum of squares)
wss <- sapply(2:10, function(k) {
  # Perform K-Means clustering for different values of k
  km <- kmeans(scaled_data, centers = k, nstart = 25)
  km$tot.withinss  # Total within-cluster sum of squares (inertia)
})

# Plot the elbow method to determine the optimal k
plot(2:10, wss, type = "b",
     xlab = "Number of Clusters (k)", ylab = "Within-cluster Sum of Squares (WSS)",
     main = "Choosing Optimal k with the Elbow Method")

# Annotate the elbow plot with the optimal k (k = 3)
optimal_k <- 3  # The elbow point
abline(v = optimal_k, col = "blue", lty = 2)
text(optimal_k, wss[optimal_k - 1], paste("k =", optimal_k), pos = 4, col = "blue")

# Perform K-Means Clustering with k = 3
set.seed(42)
kmeans_result <- kmeans(scaled_data, centers = 3, nstart = 25)  # Use k = 3 based on the elbow plot

# Add cluster labels to the dataset
club_goers_dataset$KMeans_Cluster <- kmeans_result$cluster

# Check cluster sizes and mean values for each variable within clusters
cluster_summary <- aggregate(scaled_data, by = list(Cluster = kmeans_result$cluster), FUN = mean)
print(cluster_summary)
##   Cluster Time.of.Entry         Age Preferred.Room      Gender
## 1       1    -0.5623472 -0.06339887      0.9206479  0.02396141
## 2       2    -0.5467574 -0.01092294     -1.0807606  0.11565977
## 3       3     1.7570567  0.12095594      0.1284237 -0.21533571
# Optional: Visualize cluster distribution using PCA for better separation
pca_result <- prcomp(scaled_data, center = TRUE, scale. = TRUE)
autoplot(pca_result, data = club_goers_dataset, colour = 'KMeans_Cluster') +
  labs(title = "K-Means Clustering Visualization (PCA)")

4 Interpretations of visualisations

The elbow plot shows the relationship between the number of clusters (k) and the within-cluster sum of squares (WSS). The ‘elbow’ point at k = 3 indicates the optimal number of clusters. Beyond this point, increasing k does not significantly reduce the WSS, suggesting that 3 clusters provide the best balance between cluster compactness and simplicity. This confirms that k = 3 is the ideal number for this dataset, where the clusters start to form clearly without overfitting.

This PCA plot visualizes the results of K-Means clustering with k = 3. The points are colored based on their assigned cluster, with three distinct groups visible. Cluster 1 (represented by light blue) is mostly separated from Clusters 2 and 3, while Cluster 3 (dark blue) appears to be more spread out. The plot demonstrates how the K-Means algorithm has successfully divided the data into three clusters based on their preferences and behaviors. The clear separation between the clusters in the PCA space suggests that the clustering was successful in identifying meaningful patterns in the data.

4.1 3.2 DBSCAN Clustering

4.1.1 3.2.1 Choosing Optimal eps Using k-Distance Plot

k_dist <- dbscan::kNNdistplot(scaled_data, k = 3)
abline(h = 0.5, col = "red", lty = 2)

This plot represents the k-distance (3-NN distance) for each point in the dataset, sorted by distance. The red horizontal line indicates the threshold for density-based clustering. Points below this threshold belong to a cluster, while points above the threshold are considered noise or outliers. In this case, you can observe that there is a sharp increase in the 3-NN distance after a certain point, suggesting that most of the data points are part of dense regions (clusters), while the sharp jump identifies sparse regions where the points are distant from each other.

4.1.2 3.2.2 Perform DBSCAN Clustering

dbscan_result <- dbscan(scaled_data, eps = 0.5, minPts = 5)
club_goers_dataset$DBSCAN_Cluster <- dbscan_result$cluster

# Perform PCA for DBSCAN visualization
autoplot(pca_result, data = club_goers_dataset, colour = 'DBSCAN_Cluster') +
  labs(title = "DBSCAN Clustering Visualization (PCA)")

The PCA plot visualizes the clustering results from DBSCAN, where points are colored according to their assigned clusters. The different shades of blue represent varying clusters, with some points marked as noise (cluster 0). DBSCAN successfully identified several distinct clusters, and the outliers or noise points are visibly separated from the rest of the data. The plot shows that DBSCAN is able to handle irregularly shaped clusters and can detect noise effectively, unlike K-Means, which assumes spherical clusters.

4.1.3 3.2.3 Visual Comparison of K-Means and DBSCAN

# Compare the clusters from K-Means and DBSCAN using PCA
ggplot(club_goers_dataset, aes(x = pca_result$x[,1], y = pca_result$x[,2], color = factor(KMeans_Cluster))) +
  geom_point() +
  labs(title = "Comparison of K-Means and DBSCAN Clusters",
       x = "PCA Component 1", y = "PCA Component 2", color = "K-Means Clusters")

ggplot(club_goers_dataset, aes(x = pca_result$x[,1], y = pca_result$x[,2], color = factor(DBSCAN_Cluster))) +
  geom_point() +
  labs(title = "DBSCAN Clusters Visualization",
       x = "PCA Component 1", y = "PCA Component 2", color = "DBSCAN Clusters")

4.2 Interpretation of visualisations

The Comparison of K-Means and DBSCAN Clusters plot compares the clusters identified by both K-Means and DBSCAN. In the K-Means clustering, the data points are divided into 3 distinct clusters (red, green, and blue), while DBSCAN identifies several irregular clusters and noise points. K-Means performs well when the data forms compact, spherical clusters, as seen in the clear separation between the clusters. On the other hand, DBSCAN is more flexible in detecting non-spherical clusters and noise, making it more suitable for handling complex data structures where some points do not belong to any cluster.

The DBSCAN Clusters Visualization plot visualizes the DBSCAN clustering results, with each cluster colored differently. DBSCAN has successfully identified multiple clusters, ranging from well-defined groups (such as green and yellow) to outliers marked as noise (cluster 0). The varying colors indicate the diversity of clusters found by DBSCAN, showing its ability to detect both dense areas and irregular-shaped clusters. Points that do not fit the cluster criteria are labeled as noise and are not included in any cluster. This highlights DBSCAN’s capacity for detecting outliers and irregular cluster shapes.

5 4 Conclusion

K-Means Findings: Identified 3 clusters with distinct entry times, preferences, and demographics.

6 5 Discussion of Discovered Clusters

6.1 5.1 Overall Insights

The clustering analysis revealed several meaningful patterns among the clubgoers:

  1. Cluster 1: Early EDM Enthusiasts
    • These attendees prefer EDM music and tend to arrive early in the evening.
    • Age group: Predominantly younger professionals.
    • Recommendation: Tailor early promotions or happy hour events targeting this group.
  2. Cluster 2: Latin Music Lovers
    • This cluster is characterized by attendees who prefer Latin music and arrive later in the night.
    • Age group: A mix of young and middle-aged attendees.
    • Recommendation: Introduce themed Latin nights or discounted entry for late arrivals.
  3. Cluster 3: Mixed Preferences
    • A diverse group in terms of preferences and demographics.
    • They do not show a strong preference for room type and have varied entry times.
    • Recommendation: Use flexible room allocation based on their preferences and increase engagement with personalized promotions.

6.1.1 5.2 Comparison of K-Means and DBSCAN

  • K-Means: K-Means is a centroid-based clustering algorithm that divides the data into a pre-defined number of clusters (k). It minimizes the within-cluster sum of squares (WSS) to create compact and spherical clusters. K-Means performs well when the data naturally forms well-separated, spherical clusters.

Cluster Shape and Size: K-Means assumes that clusters are compact, well-separated, and generally of similar size. As seen in the PCA plot comparison, the K-Means algorithm has clearly separated the dataset into three distinct clusters, represented by the red, green, and blue colors. This suggests that the data may have inherent groupings with relatively equal density and structure. Data Assumptions: K-Means tends to struggle with irregularly shaped clusters and does not handle noise or outliers well. The elbow method, used in K-Means, also assumes that the clusters are balanced in terms of their spread and density. In cases where there are noise points or clusters that vary significantly in shape or size, K-Means may misclassify data points or fail to form meaningful clusters. Limitations: In cases where there are outliers or the data contains irregular cluster shapes, K-Means may perform poorly. For instance, as shown in the PCA visualization, K-Means groups the data into well-separated clusters, but it would fail to identify the noise points that do not fit into any of the three clusters (such as the outliers shown in DBSCAN). K-Means also requires the user to specify the number of clusters beforehand, which can be problematic when the true number of clusters is unknown.

  • DBSCAN (Density-Based Spatial Clustering of Applications with Noise): DBSCAN is a density-based clustering algorithm that can identify clusters of varying shapes and sizes. It groups points based on their density, which makes it suitable for datasets with irregular or non-spherical clusters.

Cluster Shape and Size: Unlike K-Means, DBSCAN is not constrained by the assumption of spherical clusters. It can find clusters of arbitrary shapes, as seen in the second PCA plot for DBSCAN. In the plot, DBSCAN identifies a variety of distinct clusters, with points labeled according to their assigned cluster number (colored in different shades). DBSCAN is able to separate dense regions of data into clusters, even when the clusters have irregular shapes, such as the one shown in green and orange. This flexibility makes DBSCAN an ideal choice for more complex datasets where clusters don’t follow simple geometrical shapes. Noise and Outliers: One of the main advantages of DBSCAN is its ability to identify and handle noise (outliers). In the PCA plot, points that do not fit into any cluster are marked as noise (often represented by Cluster 0). These points are spaced far from the dense regions, and DBSCAN recognizes them as anomalies. This capability makes DBSCAN more robust in real-world datasets, where outliers are common. Data Assumptions: DBSCAN only requires two parameters to be set: the radius (epsilon) that defines the neighborhood of a point, and the minimum number of points needed to form a dense region. The algorithm does not require the user to specify the number of clusters beforehand, which is a significant advantage over K-Means when the true number of clusters is not known. DBSCAN’s performance is highly dependent on the choice of the epsilon parameter, which must be carefully tuned based on the data’s characteristics. Limitations: While DBSCAN is powerful in identifying clusters of arbitrary shapes and detecting outliers, it can struggle in areas with varying densities. If the dataset has clusters with significantly different densities, DBSCAN might fail to identify some clusters or might merge separate clusters into one. The sensitivity of DBSCAN to the epsilon parameter also requires careful consideration, as setting it too high or too low can result in incorrect clustering.

-Summary: K-Means is ideal for datasets with compact, well-separated, and similarly sized clusters. It works well for applications where the clusters are expected to be spherical and of similar density, but it struggles with irregular shapes and noise. It is a good option when you have a general idea of the number of clusters and when the data is relatively clean without significant outliers. DBSCAN, on the other hand, excels in scenarios where the clusters are irregularly shaped, and the data may contain noise or outliers. It does not require the user to specify the number of clusters, making it useful when the cluster structure is unknown or complex. However, DBSCAN’s performance can be sensitive to the choice of the epsilon parameter and may not perform as well when the data contains clusters of varying densities. In the visualizations provided:

K-Means results in three distinct clusters (red, green, and blue), suggesting that the data points are well-separated and compact. However, K-Means does not identify the noise points, which are handled by DBSCAN. DBSCAN, as shown in the second PCA plot, identifies more complex and irregularly shaped clusters and labels several points as noise. The varying colors represent the diverse clusters, and noise points are effectively detected. Both methods have their strengths and limitations, and the choice between them depends on the dataset and the specific needs of the analysis.

6.1.2 5.3 Recommendations for the Club

  1. Targeted Marketing:
    • Offer promotions to early attendees and loyal groups.
    • Tailor EDM events for late-night crowds.
  2. Operational Efficiency:
    • Adjust staffing schedules based on cluster entry times.
    • Stock beverages based on demographic preferences.
  3. Enhanced Customer Experience:
    • Design loyalty programs for regular attendees.
    • Offer room-specific benefits for distinct clusters.

7 6. Appendix

7.0.1 6.1 Missing Data Summary

print(missing_summary)
##              Respondent.ID             Preferred.Room 
##                          0                          0 
##              Time.of.Entry     Additional.Preferences 
##                          0                          0 
##                        Age                     Gender 
##                          0                          0 
## Minutes.Since.Club.Opening            Small.Room.Open 
##                          0                          0

7.0.2 6.2 References

  1. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer.
  2. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning: With Applications in R. Springer.
  3. Aggarwal, C. C. (2015). Data Mining: The Textbook. Springer.
  4. Tenenbaum, J. B., de Silva, V., & Langford, J. C. (2000). A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science.