Clustering is a process of grouping similar data objects into related groups. This unsupervised learning technique is applied in order to discover pattern and structure within large datasets. The core objective to generate clusters with high intra-class similarity (meaning alike objects within the same cluster) and low inter-class similarity (meaning distinct objects in different clusters).
This project focuses on applying different clustering methods in order to analyze hourly bicycle rental data. The data has been aggregated to the define daily bike-renting demand based on fundamental environmental and calendar factors. Accurately defining these profiles is crucial for business applications, including optimizing fleet management, enhancing demand forecasting accuracy, and strategically planning operations based on daily conditions.
The dataset used to conduct analysis was provided by UC Irvine Machine Learning repository (https://archive.ics.uci.edu/dataset/275/bike+sharing+dataset). It contains the hourly and daily count of rental bikes between years 2011 and 2012 in Capital bikeshare system at Washington, D.C., USA with the corresponding weather and seasonal information. The initial hourly data has been aggregated to the daily level to define the daily profiles.
The clustering process relies on four continuous variables: cnt_mean, temp_mean, hum_mean, windspeed_mean. The selection is based on direct and measurable impact of these variables on customers renting behavior.
The table below contains all variables and their descriptions used to conduct the analysis.
| Aggregated Variable | Description |
|---|---|
| cnt_mean | Mean count of total rental bikes per day. |
| temp_mean | Mean normalized temperature in Celsius per day. Values divided by 41 (max). |
| hum_mean | Mean normalized humidity per day. Values divided by 100 (max). |
| windspeed_mean | Mean normalized wind speed per day. Values divided by 67 (max). |
| season | Season (1: Spring, 2: Summer, 3: Fall, 4: Winter). First value of the day. |
| yr | Year (0: 2011, 1: 2012). First value of the day. |
| month | Month (1 to 12). First value of the day. |
| holiday | Whether the day is an official holiday (1) or not (0). First value of the day. |
| workingday | Whether the day is neither a weekend nor an official holiday (1) or not (0). First value of the day. |
df$day <- as.Date(df$dteday)
#Grouping database by day and adding mean day variables.
df_day <- df %>%
group_by(day) %>%
summarise(
cnt_mean = mean(cnt),
temp_mean = mean(temp),
hum_mean = mean(hum),
windspeed_mean = mean(windspeed),
season = first(season),
yr = first(yr),
month = first(mnth),
holiday = first(holiday),
workingday = first(workingday)
) %>%
ungroup()
df_day<-na.omit(df_day)
#One-hot ecoding for season (season 1 is baseline)
df_day <- df_day %>%
dummy_cols(
select_columns = c("season"),
remove_selected_columns = TRUE,
remove_first_dummy = TRUE
)
#Seletction of variables and scaling, using the Z-score normalization
variables<-df_day[2:5]
variables_df <- as.data.frame(lapply(variables, scale))
variables_df<-na.omit(variables_df)
Before applying any clustering algorithms, it is crucial to perform preliminary steps. First, the clusterability was assessed to determine if the data is randomly distributed or contains any inherent groupings. If the data points were randomly distributed, any clustering result would be statistically meaningless. Statistical tests, such as the Hopkins Statistic and visualization techniques were used to confirm whether the data is suitable for clustering.
The evaluation of clusterability was performed using the Hopkins Statistic and Visual Assessment of Tendency plot. The Hopkins Statistic acts as a statistical hypothesis test where the null hypothesis is that data is randomly distributed. If the value is close to 0.5, the dataset reveals no clustering structure, whereas if the value is close to 1, there is significant evidence that the data might be clusterable.
The calculated Hopkins statistic is H = 0.7635, which is relatively high and means the data may be used for clustering. It is probable that the dataset does not possess a strong, natural, highly-separated clustering structure, but the data points are also not completely random.
The visualization generated below is a heatmap of the dissimilarity between pairs of data points. The presence of dark purple regions confirms that some observations exhibit low similarity and dense non random nature of the data. The plot does not display distinct clear squared of color along the diagonal, but the data is structured enough to be partitioned into meaningful segments.
I acknowledge that the clusters won’t have perfect sharp edges due to continuous nature of variables. Therefore I justify the decision to proceed with the analysis based on practical need to profile the year.
#Hopkins statistic and VAT plot
get_clust_tendency(variables_df, 10, graph=TRUE, gradient=list(low="red", mid="white", high="blue"), seed = 123)
## $hopkins_stat
## [1] 0.763544
##
## $plot
The next step is to determine the optimal number of clusters (K) that best represents the structure of the data and is suitable for later interpretation.
I utilize the Within the Cluster Sum of Squares visualized by the Elbow Method to identify the number of clusters K that there are diminishing returns beyond that point (no longer yields a significant reduction in variability within the clusters).
The curve exhibits a steep decline in WSS at K=2 and K=3. This indicates that these initial partitions are highly effective. The steep decline visibly begins to flatten out around K=4. After K=4, the marginal gain in reducing internal variability is not as significant.
#optimal numer of clusters - elbow
fviz_nbclust(variables_df, kmeans, method = "wss") +
labs(title = "WSS (Elbow Method) for Optimal Number of Clusters") +
xlab("Number of Clusters K")
The Average Silhouette Width is a measure used to evaluate the quality of clustering. The score ranges from -1 to 1. Values close to -1 indicate the objects are probably assigned to the wrong clusters, while values close to 1 are a sign of well clustered dataset. The plot below suggests, that K=2 statistically provides the best separation (s = 0.3). The score drops significantly for K=3 and K=4.
#optimal number of clusters - silhouette
fviz_nbclust(variables_df, kmeans, method = "silhouette")
Kalinski-Harabasz Index is a measure of clustering validation that evaluates the ratio of intra-group variance to the inter-group variance. The otimal number of clusters is the one that maximizes the ratio. The NbClust function was used to calculate the CH Index for K values ranging from 2 to 8 clusters. The highest value of the index is observed at K=4 (158.83).
#NbClust
opt1<-NbClust(variables_df, distance="euclidean", min.nc=2, max.nc=8, method="complete", index="ch")
opt1$All.index
## 2 3 4 5 6 7 8
## 12.0681 149.4974 158.8303 154.5182 153.8990 142.5123 142.0433
opt1$Best.nc
## Number_clusters Value_Index
## 4.0000 158.8303
The Shadow statistic measures how well the given cluster is isolated from the others. The higher the score, the better the cluster’s isolation from its neighbors. All calculated shadow values, across all K solutions, are strong (ranging from 0.69 to 0.80). Furthermore, isolation improves as the number of clusters increases. This finding confirms that the dataset, despite its continuous nature, contains dense regions that can be effectively separated.
d2<-cclust(variables_df, 2, dist="euclidean")
shadow(d2)
## 1 2
## 0.7312806 0.6899607
d3<-cclust(variables_df, 3, dist="euclidean")
shadow(d3)
## 1 2 3
## 0.7341732 0.6999410 0.7763433
d4<-cclust(variables_df, 4, dist="euclidean")
shadow(d4)
## 1 2 3 4
## 0.7822810 0.7109780 0.8027311 0.7414861
plot(shadow(d2))
plot(shadow(d3))
plot(shadow(d4))
The primary analytical goal of this study is to segment daily profiles into four distinct groups corresponding to the four seasons. The convergence of three validation metrics (Elbow Method, Calinski-Harabasz index and Shadow Statistic) on K=4 provides a strong statistical justification that this analysis is robust and provides well-isolated clusters. To ensure a comprehensive and robust analysis I will proceed by evaluating and comparing clustering with K=2, K=3 and K=4.
In order to compare internal validation, I executed the clValid function, assessed the performance of Hierarchical, K-Means and PAM algorithms across cluster sizes from K=2 to K=5. While Hierarchical clustering delivered the statistically best scores (Dunn Index=0.1011, Silhouette Width=0.322) at K=2. The PAM algorithm yielded the poorest scores across all metrics for all K values. The K-Means method had slightly lower statistical performance than Hierarchical, but it provided the highest Silhouette score for the desired K=4. The clustering will be performed using all three clustering method. The final model will be chosen based on quality and interpretability of day profiles.
#clValid for hierarchical, kmeans, pam
clmethods<-c("hierarchical","kmeans","pam")
internal_validation <- clValid(
variables_df,
nClust = 2:5,
clMethods = clmethods,
validation = "internal",
maxitems=100000)
summary(internal_validation)
##
## Clustering Methods:
## hierarchical kmeans pam
##
## Cluster sizes:
## 2 3 4 5
##
## Validation Measures:
## 2 3 4 5
##
## hierarchical Connectivity 19.3365 19.3365 96.6587 129.3909
## Dunn 0.1011 0.1011 0.0695 0.0739
## Silhouette 0.3220 0.3053 0.2352 0.2344
## kmeans Connectivity 105.1321 157.1516 211.5567 268.2187
## Dunn 0.0361 0.0438 0.0256 0.0450
## Silhouette 0.3013 0.2768 0.2669 0.2303
## pam Connectivity 106.9016 176.3357 244.4952 285.1659
## Dunn 0.0193 0.0207 0.0321 0.0262
## Silhouette 0.2967 0.2466 0.2089 0.2192
##
## Optimal Scores:
##
## Score Method Clusters
## Connectivity 19.3365 hierarchical 2
## Dunn 0.1011 hierarchical 2
## Silhouette 0.3220 hierarchical 2
The K-Means algorithm is a iterative method of partitioning based on distance used to group observations into K (predefined) clusters. The process begins with selecting random K central points (centroids) and assigning data points to the nearest centroid. With every iteration the position of the centroid is recalculated until it doesn’t change. The aim is to minimize the within-cluster sum of squares, while maximizing the distance between centroids.
The initial K-Means clustering with K=2 was executed as a baseline, as it statistically often yielded the highest separation scores. The cluster plot visualization shows two clusters with significant overlap between boundaries, especially in the central region. The first two dimensions represented by the horizontal and vertical axis explained about 72.3% of the variance. This confirms that the groups are not perfectly separated but the partition is based on continuous data distribution.
The average silhouette width is 0.30. While this score would not typically be considered high, it represents the highest score achieved across all tested K values. Cluster 2 displays a higher quality with fewer negative values, indicating a greater number of observations lying within the cluster 2 boundary.
The distance from the centroid (stripes plot) visualized the denseness of clusters. Tightly packed lines near y=0 would represent that a majority of the observations are near the center of the cluster. In this case, both clusters exhibit high density of data near the X-axis, confirming the presence of compact cores.
#k-means, 2 clusters
day_km2 <- kmeans(variables_df, 2)
df_day$cluster_km2 <- day_km2$cluster
#plot
fviz_cluster(list(data=variables_df, cluster=day_km2$cluster),
ellipse.type="norm", geom="point", stand=FALSE, palette="jco", ggtheme=theme_classic())
#silohuette plot
sil<-silhouette(day_km2$cluster, dist(variables_df))
fviz_silhouette(sil)
## cluster size ave.sil.width
## 1 1 313 0.24
## 2 2 418 0.35
#distnace from the centroid
d1<-cclust(variables_df, 2, dist="euclidean")
stripes(d1)
The K-Means algorithm was next executed with K=3. The cluster plot shows three distinct groups (blue, yellow, grey). The overall separation has visibly decreased compared to the K=2. There is an overlap suggesting poor boundaries and potential misclassification.
The average silhouette width dropped to 0.28, with cluster 1 having the highest score and being the most numerous and (339 observations - cluster 1, 196 - cluster 2 and 3).
The stripes plot reveals the compactness of the three groups. While all of them have a relatively dense core, cluster 3 is spatially more dispersed than the other two.
#k-means k=3
day_km3 <- kmeans(variables_df, 3)
df_day$cluster_km3 <- day_km3$cluster
#plot
fviz_cluster(list(data=variables_df, cluster=day_km3$cluster),
ellipse.type="norm", geom="point", stand=FALSE, palette="jco", ggtheme=theme_classic())
#silhouette
sil3<-silhouette(day_km3$cluster, dist(variables_df))
fviz_silhouette(sil3)
## cluster size ave.sil.width
## 1 1 339 0.33
## 2 2 196 0.21
## 3 3 196 0.26
#distnace from the centroid
d3<-cclust(variables_df, 3, dist="euclidean")
stripes(d3)
The method was executed with K=4 to align with the analytical objective of segmenting the data into four seasonal profiles. The cluster plot shows the formation of four distinct groups (blue, yellow, grey and red). The boundaries show significant inter-cluster overlap which confirms the difficulty of drawing sharp edges with this type of data. The perceived overlap might be exaggerated due to the plot displaying only two dimensions (73.2% of the variance).
The average silhouette score dropped further to 0.26. Two clusters maintain reasonable cohesion: cluster 4 (0.31) and cluster 2 (0.28). The other ones are less defined. Cluster 4 (301 observations) is the largest, while Cluster 1 (97 observations) is the smallest, which is visible in the varying heights of the Silhouette bands.
All four clusters display relatively high concentration, as can be seen on the stripes plot. Cluster 4 is the least compact, with the furthest point reaching a distance of more than 4. and cluster 1 most compact, with distances remaining below 2.5.
#k-means k=4
day_km4 <- kmeans(variables_df, 4)
df_day$cluster_km4 <- day_km4$cluster
#plot
fviz_cluster(list(data=variables_df, cluster=day_km4$cluster),
ellipse.type="norm", geom="point", stand=FALSE, palette="jco", ggtheme=theme_classic())
#silhouette
sil4<-silhouette(day_km4$cluster, dist(variables_df))
fviz_silhouette(sil4)
## cluster size ave.sil.width
## 1 1 97 0.19
## 2 2 164 0.28
## 3 3 169 0.18
## 4 4 301 0.31
#distnace from the centroid
d4<-cclust(variables_df, 4, dist="euclidean")
stripes(d4)
{-}
Partitioning around medoids is an alternative to the K-Means algorithm. It is an iterative method of clustering based on use of actual data points called medoids. Unlike K-Means, PAM is less sensitive to noise and outliers. It is also suitable for small data sets. The algorithm selects the most centrally located data point within each cluster. The PAM algorithm will be performed for K=2, K=3 and K=4
The PAM algorithm was executed with K=2 to determine if medoid-based approach would alter the cluster structure compared to K-Means. The convex hull visualization shows a clear division into 2 groups, with the first 2 dimensions explaining 73.2% of the variance.
The average silhouette width is 0.30, which is consistent with the results from K-Means.
The comparison of clusters was performed by visualizing a boxplots for the continuous variables. It reveals that cluster 2 is characterized by a higher median temperature and bike rental count compared to cluster 1. The presence of numerous outliers across all variables justifies the use of the PAM algorithm, which is more robust against noise.
#k=2
pam2<-pam(variables_df,2)
df_day$pam2 <- pam2$cluster
fviz_cluster(pam2,
geom = "point",
ellipse.type = "convex",
stand = FALSE,
palette = "jco",
main = paste("PAM, K = 2"))
#Silhouette
silpam2 <- silhouette(pam2$clustering, dist(variables_df))
#Visualisation of Silhouette plot
fviz_silhouette(silpam2,
main = paste("Silhouette Plot for PAM K = 2"))
## cluster size ave.sil.width
## 1 1 324 0.23
## 2 2 407 0.35
#boxplot
groupBWplot(variables_df,
pam2$clustering,
main = "Comparison of variables for PAM",
xlab = "Cluster",
alpha = 0.05)
The convex hull visualization for K=3 shows three distinct regions. While the clusters are spatially defined, their boundaries are in close proximity and slightly overlap.
The average silohuette width dropped to 0.25, which is lower than the K-Means score.
The groupWBplot highlights the characteristics of the data. Cluster 3 has the highest median value for temperature and rental counts. Cluster 1 represents low-demand days, while cluster 2 acts as a transitional group.
#k=3
pam3<-pam(variables_df,3)
df_day$pam3 <- pam3$cluster
fviz_cluster(pam3,
geom = "point",
ellipse.type = "convex",
stand = FALSE,
palette = "jco",
main = paste("PAM, K = 3"))
#Silhouette
silpam3 <- silhouette(pam3$clustering, dist(variables_df))
#Visualisation of Silhouette plot
fviz_silhouette(silpam3,
main = paste("Silhouette Plot for PAM K = 3"))
## cluster size ave.sil.width
## 1 1 244 0.18
## 2 2 231 0.24
## 3 3 256 0.32
#boxplot
groupBWplot(variables_df,
pam3$clustering,
main = "Comparison of variables for PAM",
xlab = "Cluster",
alpha = 0.05)
The PAM algorithm was performed for the analytically desired number of clusters K=4 in order to compare it with K-Means results. The cluster plot shows a significant overlap between clusters in the central region of the plot. This indicates that the boundaries between clusters remain fluid.
The average silhouette width was 0.21, with the width for each cluster varying from 0.18 to 0.23. Negative silhouette values confirm fuzzy boundaries between groups.
The four boxplots provide detailed look at seasonal profiles. Cluster 4 represents peak season with high count of rental bikes and the highest median temperature. Cluster 1 is characterized by low median temperature and lower rentals, whereas cluster 2 and 3 are transitional seasons.
#####PAM
#k=4
pam4<-pam(variables_df,4)
df_day$pam4 <- pam4$cluster
fviz_cluster(pam4,
geom = "point",
ellipse.type = "convex",
stand = FALSE,
palette = "jco",
main = paste("PAM, K = 4"))
#Silhouette
silpam4 <- silhouette(pam4$clustering, dist(variables_df))
#Visualisation of Silhouette plot
fviz_silhouette(silpam4,
main = paste("Silhouette Plot for PAM K = 4"))
## cluster size ave.sil.width
## 1 1 140 0.20
## 2 2 219 0.22
## 3 3 191 0.23
## 4 4 181 0.18
#boxplot
groupBWplot(variables_df,
pam4$clustering,
main = "Comparison of variables for PAM",
xlab = "Cluster",
alpha = 0.05)
{-}
The final method employed in the study is Hierarchical Agglomerative Clustering (HAC). HAC, unlike previous methods, doesn’t require a predefined number of clusters. It builds a hierarchy of clusters using a bottom-up approach. Each data point begins as its own individual cluster and at each step the algorithm merges the two most similar cluster until only one cluster remain. While K-Means can produce different results based on its random starting points, HAC is deterministic, providing consistent results.
The four common linkage criteria were evaluated using the Agglomerative Coefficient (AC). The AC measures the strength of the cluster structure, with values closer to 1 indicating better separation and cohesion. The Ward’s method achieved the highest confidence of 0.9894. Based on this result, the hierarchical model was built using Euclidean distance and Ward’s D2 linkage.
#HAC
# dissimilarity matrix
d <- dist(variables_df, method = "euclidean")
#linkage method selection
method<- c( "average", "single", "complete", "ward")
names(method) <- method
ac <- function(x) {
agnes(d, method = x)$ac
}
map_dbl(method, ac)
## average single complete ward
## 0.8934425 0.8392369 0.9445458 0.9894666
# ward linkage
hc1 <- hclust(d, method = "ward.D2")
# dendrogram
plot(hc1,
cex = 0.6,
hang = -1,
main = "Dendrogram: Ward Linkage (Hclust)")
After establishing the structure using Ward’s method, the dendrogram was cut at different levels to create K=2, K=3 and K=4 clusters.
At K=2 level the dendrogram shows a split at a height of a little over 30 with relatively balanced groups of 379 and 352 data points. When moving to K=3, group 2 from previous partition remained stable, while group 1 split into two subgrops (199 and 180). At target level of K=4, the hierarchy divides data into groups of 199, 180, 115 and 237 at a height of approximately 20.
#subgroups for K=2, K=3, K=4
sub_grp2 <- cutree(hc1, k = 2)
table(sub_grp2)
## sub_grp2
## 1 2
## 379 352
sub_grp3 <- cutree(hc1, k = 3)
table(sub_grp3)
## sub_grp3
## 1 2 3
## 199 180 352
sub_grp4 <- cutree(hc1, k = 4)
table(sub_grp4)
## sub_grp4
## 1 2 3 4
## 199 180 115 237
To objectively evaluate the trade-off between model complexity and clustering quality, key internal validation metrics were extracted from cluster.stats function. The key insight is the steady decrease in the average within-cluster distance, which drops from 2.185 (K=2) to 1.869 (K=4). This confirms that as K increases, the homogeneity of groups improves. As expected, the Dunn index, which is a ratio of cluster separation and compactness, decreases with greater number of clusters. Similarly, the average silhouette width drops from 0.276 (K=2) to 0.179 (K=4).
## K Dunn_Index Avg_Silhouette Avg_Within_Dist
## 1 2 0.05476476 0.2756214 2.185009
## 2 3 0.05009749 0.2389185 1.998094
## 3 4 0.03224339 0.1778102 1.869034
To visually assess the hierarchical progression, dendrograms were generated for K=2, K=3, and K=4 levels, with each solution marked by borders to illustrate the splitting of the branches.
{-}
Following the evaluation of all three algorithms, the K-Means K=4 model was selected for final interpretation to meet the analytical goal of defining four seasonal profiles of bike renting. This section focuses of characterizing each segment to understand environmental drives of bike-sharing demand.
The table below presents entire data set grouped by final K-Means segmentation for K=4. The internal clustering features (continuous variables) were destandarized to their original physical units and presented as a mean for each cluster. Beyond the variables used for clustering, additional categorical features have been added, such as the dominant month, percentage of holiday versus working days and percentage distribution of seasons, to allow a thorough interpretation. The cluster were then arranged from lowest to highest mean temperature.
The following table presents features of each group, allowing a direct comparison and characterization of all four clusters.
| Variable | Cluster_2 | Cluster_3 | Cluster_1 | Cluster_4 |
|---|---|---|---|---|
| N | 164 | 169 | 97 | 301 |
| Count_Avg | 119 | 134 | 206 | 252 |
| Temp_Avg | 11.4 | 17.9 | 19.9 | 26.6 |
| Hum_Avg | 52.5 | 79.6 | 50.8 | 62.8 |
| Wind_Avg | 13.3 | 12.2 | 20.4 | 10.4 |
| Dominant_Month | 1 | 12 | 4 | 7 |
| Holiday_Pct | 4.9 | 2.4 | 2.1 | 2.3 |
| WorkingDay_Pct | 64 | 68 | 76.3 | 68.4 |
| Winter_Pct | 65.9 | 26 | 22.7 | 2.3 |
| Spring_Pct | 10.4 | 26 | 42.3 | 27.2 |
| Summer_Pct | 0 | 11.2 | 10.3 | 52.8 |
| Fall_Pct | 23.8 | 36.7 | 24.7 | 17.6 |
The K=4 K-Means segmentation successfully divided the data set into four distinct operational seasons, showing that while temperature is the primary driver of demand, factors like high wind speed and extreme humidity are also crucial determinants. By identifying these four distinct profiles, the business can implement more strategic planning. For example, implement fleet maintenance during the cold dry winter ensuring maximum capacity and bike availability to capture the full potential for the peak summer season.