Clustering is a powerful unsupervised learning technique used to group data points into meaningful clusters based on their similarities. The key objective is to maximize intra-cluster similarity and minimize inter-cluster similarity.
The dataset analyzed in this project should typically originate from a survey filled out by club guests upon their arrival at a club with 2 distinct rooms and 2 different music sources. The survey captures key information such as room preferences, entry times, additional preferences for music or activities, demographic details, and real-time status of the club’s environment. This structured data collection process ensures a rich and reliable dataset for subsequent analysis. However, due to privacy reason…
The primary goal of this analysis is to better understand the preferences and behaviors of clubgoers. By clustering guests based on their responses, we aim to:
In this project, we explore:
Checked and imputed missing values.
Converted Preferred.Room and Gender into
binary variables.
Created dummy variables for Additional.Preferences.
Validated and converted Time.of.Entry into numeric
minutes.
Scaled all numerical features to ensure equal weighting.
# Load the dataset
club_goers_dataset <- read.csv("~/Desktop/Data science and Business Analytics/unsupervised learning project /Generated_Clubgoers_Dataset.csv", sep =";")
# Check missing values
missing_summary <- colSums(is.na(club_goers_dataset))
missing_summary
## Respondent.ID Preferred.Room
## 0 0
## Time.of.Entry Additional.Preferences
## 0 0
## Age Gender
## 0 0
## Minutes.Since.Club.Opening Small.Room.Open
## 0 0
# Binary encoding for Preferred.Room and Gender
club_goers_dataset <- club_goers_dataset %>%
mutate(
Preferred.Room = ifelse(Preferred.Room == "Latin", 0, 1),
Gender = ifelse(Gender == "Male", 0, 1)
)
# One-hot encoding for Additional Preferences
club_goers_dataset <- fastDummies::dummy_cols(
club_goers_dataset,
select_columns = "Additional.Preferences",
remove_first_dummy = TRUE
)
# Convert Time.of.Entry to numeric (minutes since midnight)
club_goers_dataset <- club_goers_dataset %>%
mutate(
Time.of.Entry = as.numeric(hour(hm(Time.of.Entry)) * 60 + minute(hm(Time.of.Entry)))
)
# Handle missing values in Time.of.Entry
club_goers_dataset$Time.of.Entry[is.na(club_goers_dataset$Time.of.Entry)] <-
mean(club_goers_dataset$Time.of.Entry, na.rm = TRUE)
# Normalize numerical features
scaled_data <- club_goers_dataset %>%
select(Time.of.Entry, Age, Preferred.Room, Gender) %>%
scale()
The elbow method was used to determine the optimal number of clusters by analyzing the total within-cluster sum of squares (WSS) as the number of clusters (k) increases. The ‘elbow point’ represents the value of k where the rate of decrease in WSS sharply levels off, indicating a good balance between cluster compactness and interpretability.
# Test multiple values of k for WSS (within-cluster sum of squares)
wss <- sapply(2:10, function(k) {
# Perform K-Means clustering for different values of k
km <- kmeans(scaled_data, centers = k, nstart = 25)
km$tot.withinss # Total within-cluster sum of squares (inertia)
})
# Plot the elbow method to determine the optimal k
plot(2:10, wss, type = "b",
xlab = "Number of Clusters (k)", ylab = "Within-cluster Sum of Squares (WSS)",
main = "Choosing Optimal k with the Elbow Method")
# Annotate the elbow plot with the optimal k (k = 3)
optimal_k <- 3 # The elbow point
abline(v = optimal_k, col = "blue", lty = 2)
text(optimal_k, wss[optimal_k - 1], paste("k =", optimal_k), pos = 4, col = "blue")
# Perform K-Means Clustering with k = 3
set.seed(42)
kmeans_result <- kmeans(scaled_data, centers = 3, nstart = 25) # Use k = 3 based on the elbow plot
# Add cluster labels to the dataset
club_goers_dataset$KMeans_Cluster <- kmeans_result$cluster
# Check cluster sizes and mean values for each variable within clusters
cluster_summary <- aggregate(scaled_data, by = list(Cluster = kmeans_result$cluster), FUN = mean)
print(cluster_summary)
## Cluster Time.of.Entry Age Preferred.Room Gender
## 1 1 -0.5623472 -0.06339887 0.9206479 0.02396141
## 2 2 -0.5467574 -0.01092294 -1.0807606 0.11565977
## 3 3 1.7570567 0.12095594 0.1284237 -0.21533571
# Optional: Visualize cluster distribution using PCA for better separation
pca_result <- prcomp(scaled_data, center = TRUE, scale. = TRUE)
autoplot(pca_result, data = club_goers_dataset, colour = 'KMeans_Cluster') +
labs(title = "K-Means Clustering Visualization (PCA)")
The elbow plot shows the relationship between the number of clusters (k) and the within-cluster sum of squares (WSS). The ‘elbow’ point at k = 3 indicates the optimal number of clusters. Beyond this point, increasing k does not significantly reduce the WSS, suggesting that 3 clusters provide the best balance between cluster compactness and simplicity. This confirms that k = 3 is the ideal number for this dataset, where the clusters start to form clearly without overfitting.
This PCA plot visualizes the results of K-Means clustering with k = 3. The points are colored based on their assigned cluster, with three distinct groups visible. Cluster 1 (represented by light blue) is mostly separated from Clusters 2 and 3, while Cluster 3 (dark blue) appears to be more spread out. The plot demonstrates how the K-Means algorithm has successfully divided the data into three clusters based on their preferences and behaviors. The clear separation between the clusters in the PCA space suggests that the clustering was successful in identifying meaningful patterns in the data.
k_dist <- dbscan::kNNdistplot(scaled_data, k = 3)
abline(h = 0.5, col = "red", lty = 2)
This plot represents the k-distance (3-NN distance) for each point in the dataset, sorted by distance. The red horizontal line indicates the threshold for density-based clustering. Points below this threshold belong to a cluster, while points above the threshold are considered noise or outliers. In this case, you can observe that there is a sharp increase in the 3-NN distance after a certain point, suggesting that most of the data points are part of dense regions (clusters), while the sharp jump identifies sparse regions where the points are distant from each other.
dbscan_result <- dbscan(scaled_data, eps = 0.5, minPts = 5)
club_goers_dataset$DBSCAN_Cluster <- dbscan_result$cluster
# Perform PCA for DBSCAN visualization
autoplot(pca_result, data = club_goers_dataset, colour = 'DBSCAN_Cluster') +
labs(title = "DBSCAN Clustering Visualization (PCA)")
The PCA plot visualizes the clustering results from DBSCAN, where points are colored according to their assigned clusters. The different shades of blue represent varying clusters, with some points marked as noise (cluster 0). DBSCAN successfully identified several distinct clusters, and the outliers or noise points are visibly separated from the rest of the data. The plot shows that DBSCAN is able to handle irregularly shaped clusters and can detect noise effectively, unlike K-Means, which assumes spherical clusters.
# Compare the clusters from K-Means and DBSCAN using PCA
ggplot(club_goers_dataset, aes(x = pca_result$x[,1], y = pca_result$x[,2], color = factor(KMeans_Cluster))) +
geom_point() +
labs(title = "Comparison of K-Means and DBSCAN Clusters",
x = "PCA Component 1", y = "PCA Component 2", color = "K-Means Clusters")
ggplot(club_goers_dataset, aes(x = pca_result$x[,1], y = pca_result$x[,2], color = factor(DBSCAN_Cluster))) +
geom_point() +
labs(title = "DBSCAN Clusters Visualization",
x = "PCA Component 1", y = "PCA Component 2", color = "DBSCAN Clusters")
The Comparison of K-Means and DBSCAN Clusters plot compares the clusters identified by both K-Means and DBSCAN. In the K-Means clustering, the data points are divided into 3 distinct clusters (red, green, and blue), while DBSCAN identifies several irregular clusters and noise points. K-Means performs well when the data forms compact, spherical clusters, as seen in the clear separation between the clusters. On the other hand, DBSCAN is more flexible in detecting non-spherical clusters and noise, making it more suitable for handling complex data structures where some points do not belong to any cluster.
The DBSCAN Clusters Visualization plot visualizes the DBSCAN clustering results, with each cluster colored differently. DBSCAN has successfully identified multiple clusters, ranging from well-defined groups (such as green and yellow) to outliers marked as noise (cluster 0). The varying colors indicate the diversity of clusters found by DBSCAN, showing its ability to detect both dense areas and irregular-shaped clusters. Points that do not fit the cluster criteria are labeled as noise and are not included in any cluster. This highlights DBSCAN’s capacity for detecting outliers and irregular cluster shapes.
K-Means Findings: Identified 3 clusters with distinct entry times, preferences, and demographics.
The clustering analysis revealed several meaningful patterns among the clubgoers:
Cluster Shape and Size: K-Means assumes that clusters are compact, well-separated, and generally of similar size. As seen in the PCA plot comparison, the K-Means algorithm has clearly separated the dataset into three distinct clusters, represented by the red, green, and blue colors. This suggests that the data may have inherent groupings with relatively equal density and structure. Data Assumptions: K-Means tends to struggle with irregularly shaped clusters and does not handle noise or outliers well. The elbow method, used in K-Means, also assumes that the clusters are balanced in terms of their spread and density. In cases where there are noise points or clusters that vary significantly in shape or size, K-Means may misclassify data points or fail to form meaningful clusters. Limitations: In cases where there are outliers or the data contains irregular cluster shapes, K-Means may perform poorly. For instance, as shown in the PCA visualization, K-Means groups the data into well-separated clusters, but it would fail to identify the noise points that do not fit into any of the three clusters (such as the outliers shown in DBSCAN). K-Means also requires the user to specify the number of clusters beforehand, which can be problematic when the true number of clusters is unknown.
Cluster Shape and Size: Unlike K-Means, DBSCAN is not constrained by the assumption of spherical clusters. It can find clusters of arbitrary shapes, as seen in the second PCA plot for DBSCAN. In the plot, DBSCAN identifies a variety of distinct clusters, with points labeled according to their assigned cluster number (colored in different shades). DBSCAN is able to separate dense regions of data into clusters, even when the clusters have irregular shapes, such as the one shown in green and orange. This flexibility makes DBSCAN an ideal choice for more complex datasets where clusters don’t follow simple geometrical shapes. Noise and Outliers: One of the main advantages of DBSCAN is its ability to identify and handle noise (outliers). In the PCA plot, points that do not fit into any cluster are marked as noise (often represented by Cluster 0). These points are spaced far from the dense regions, and DBSCAN recognizes them as anomalies. This capability makes DBSCAN more robust in real-world datasets, where outliers are common. Data Assumptions: DBSCAN only requires two parameters to be set: the radius (epsilon) that defines the neighborhood of a point, and the minimum number of points needed to form a dense region. The algorithm does not require the user to specify the number of clusters beforehand, which is a significant advantage over K-Means when the true number of clusters is not known. DBSCAN’s performance is highly dependent on the choice of the epsilon parameter, which must be carefully tuned based on the data’s characteristics. Limitations: While DBSCAN is powerful in identifying clusters of arbitrary shapes and detecting outliers, it can struggle in areas with varying densities. If the dataset has clusters with significantly different densities, DBSCAN might fail to identify some clusters or might merge separate clusters into one. The sensitivity of DBSCAN to the epsilon parameter also requires careful consideration, as setting it too high or too low can result in incorrect clustering.
-Summary: K-Means is ideal for datasets with compact, well-separated, and similarly sized clusters. It works well for applications where the clusters are expected to be spherical and of similar density, but it struggles with irregular shapes and noise. It is a good option when you have a general idea of the number of clusters and when the data is relatively clean without significant outliers. DBSCAN, on the other hand, excels in scenarios where the clusters are irregularly shaped, and the data may contain noise or outliers. It does not require the user to specify the number of clusters, making it useful when the cluster structure is unknown or complex. However, DBSCAN’s performance can be sensitive to the choice of the epsilon parameter and may not perform as well when the data contains clusters of varying densities. In the visualizations provided:
K-Means results in three distinct clusters (red, green, and blue), suggesting that the data points are well-separated and compact. However, K-Means does not identify the noise points, which are handled by DBSCAN. DBSCAN, as shown in the second PCA plot, identifies more complex and irregularly shaped clusters and labels several points as noise. The varying colors represent the diverse clusters, and noise points are effectively detected. Both methods have their strengths and limitations, and the choice between them depends on the dataset and the specific needs of the analysis.
print(missing_summary)
## Respondent.ID Preferred.Room
## 0 0
## Time.of.Entry Additional.Preferences
## 0 0
## Age Gender
## 0 0
## Minutes.Since.Club.Opening Small.Room.Open
## 0 0