1. Introduction

Clustering is a process of grouping similar data objects into related groups. This unsupervised learning technique is applied in order to discover pattern and structure within large datasets. The core objective to generate clusters with high intra-class similarity (meaning alike objects within the same cluster) and low inter-class similarity (meaning distinct objects in different clusters).

This project focuses on applying different clustering methods in order to analyze hourly bicycle rental data. The data has been aggregated to the define daily bike-renting demand based on fundamental environmental and calendar factors. Accurately defining these profiles is crucial for business applications, including optimizing fleet management, enhancing demand forecasting accuracy, and strategically planning operations based on daily conditions.

2. Database

The dataset used to conduct analysis was provided by UC Irvine Machine Learning repository (https://archive.ics.uci.edu/dataset/275/bike+sharing+dataset). It contains the hourly and daily count of rental bikes between years 2011 and 2012 in Capital bikeshare system at Washington, D.C., USA with the corresponding weather and seasonal information. The initial hourly data has been aggregated to the daily level to define the daily profiles.

The clustering process relies on four continuous variables: cnt_mean, temp_mean, hum_mean, windspeed_mean. The selection is based on direct and measurable impact of these variables on customers renting behavior.

The table below contains all variables and their descriptions used to conduct the analysis.

Definition of Aggregated Variables
Aggregated Variable Description
cnt_mean Mean count of total rental bikes per day.
temp_mean Mean normalized temperature in Celsius per day. Values divided by 41 (max).
hum_mean Mean normalized humidity per day. Values divided by 100 (max).
windspeed_mean Mean normalized wind speed per day. Values divided by 67 (max).
season Season (1: Spring, 2: Summer, 3: Fall, 4: Winter). First value of the day.
yr Year (0: 2011, 1: 2012). First value of the day.
month Month (1 to 12). First value of the day.
holiday Whether the day is an official holiday (1) or not (0). First value of the day.
workingday Whether the day is neither a weekend nor an official holiday (1) or not (0). First value of the day.
df$day <- as.Date(df$dteday)
#Grouping database by day and adding mean day variables. 
df_day <- df %>%
  group_by(day) %>%
  summarise(
    cnt_mean        = mean(cnt),
    temp_mean       = mean(temp),
    hum_mean        = mean(hum),
    windspeed_mean  = mean(windspeed),
    season          = first(season),
    yr              = first(yr),
    month           = first(mnth),
    holiday         = first(holiday),
    workingday      = first(workingday)
  ) %>%
  ungroup()
df_day<-na.omit(df_day)

#One-hot ecoding for season  (season 1 is baseline)
df_day <- df_day %>%
  dummy_cols(
    select_columns = c("season"),
    remove_selected_columns = TRUE,
    remove_first_dummy = TRUE
  )

#Seletction of variables and scaling, using the Z-score normalization
variables<-df_day[2:5]
variables_df <- as.data.frame(lapply(variables, scale)) 
variables_df<-na.omit(variables_df)

3. Clusterability and Optimal K Selection

Before applying any clustering algorithms, it is crucial to perform preliminary steps. First, the clusterability was assessed to determine if the data is randomly distributed or contains any inherent groupings. If the data points were randomly distributed, any clustering result would be statistically meaningless. Statistical tests, such as the Hopkins Statistic and visualization techniques were used to confirm whether the data is suitable for clustering.

Cluster Tendency (Hopkins & VAT)

The evaluation of clusterability was performed using the Hopkins Statistic and Visual Assessment of Tendency plot. The Hopkins Statistic acts as a statistical hypothesis test where the null hypothesis is that data is randomly distributed. If the value is close to 0.5, the dataset reveals no clustering structure, whereas if the value is close to 1, there is significant evidence that the data might be clusterable.

The calculated Hopkins statistic is H = 0.7635, which is relatively high and means the data may be used for clustering. It is probable that the dataset does not possess a strong, natural, highly-separated clustering structure, but the data points are also not completely random.

The visualization generated below is a heatmap of the dissimilarity between pairs of data points. The presence of dark purple regions confirms that some observations exhibit low similarity and dense non random nature of the data. The plot does not display distinct clear squared of color along the diagonal, but the data is structured enough to be partitioned into meaningful segments.

I acknowledge that the clusters won’t have perfect sharp edges due to continuous nature of variables. Therefore I justify the decision to proceed with the analysis based on practical need to profile the year.

#Hopkins statistic and VAT plot
get_clust_tendency(variables_df, 10, graph=TRUE, gradient=list(low="red", mid="white", high="blue"), seed = 123)
## $hopkins_stat
## [1] 0.763544
## 
## $plot

Optimal number of clusters

The next step is to determine the optimal number of clusters (K) that best represents the structure of the data and is suitable for later interpretation.

I utilize the Within the Cluster Sum of Squares visualized by the Elbow Method to identify the number of clusters K that there are diminishing returns beyond that point (no longer yields a significant reduction in variability within the clusters).

The curve exhibits a steep decline in WSS at K=2 and K=3. This indicates that these initial partitions are highly effective. The steep decline visibly begins to flatten out around K=4. After K=4, the marginal gain in reducing internal variability is not as significant.

#optimal numer of clusters - elbow
fviz_nbclust(variables_df, kmeans, method = "wss") + 
  labs(title = "WSS (Elbow Method) for Optimal Number of Clusters") +
  xlab("Number of Clusters K")

The Average Silhouette Width is a measure used to evaluate the quality of clustering. The score ranges from -1 to 1. Values close to -1 indicate the objects are probably assigned to the wrong clusters, while values close to 1 are a sign of well clustered dataset. The plot below suggests, that K=2 statistically provides the best separation (s = 0.3). The score drops significantly for K=3 and K=4.

#optimal number of clusters - silhouette
fviz_nbclust(variables_df, kmeans, method = "silhouette")

Kalinski-Harabasz Index is a measure of clustering validation that evaluates the ratio of intra-group variance to the inter-group variance. The otimal number of clusters is the one that maximizes the ratio. The NbClust function was used to calculate the CH Index for K values ranging from 2 to 8 clusters. The highest value of the index is observed at K=4 (158.83).

#NbClust
opt1<-NbClust(variables_df, distance="euclidean", min.nc=2, max.nc=8, method="complete", index="ch")
opt1$All.index
##        2        3        4        5        6        7        8 
##  12.0681 149.4974 158.8303 154.5182 153.8990 142.5123 142.0433
opt1$Best.nc
## Number_clusters     Value_Index 
##          4.0000        158.8303

The Shadow statistic measures how well the given cluster is isolated from the others. The higher the score, the better the cluster’s isolation from its neighbors. All calculated shadow values, across all K solutions, are strong (ranging from 0.69 to 0.80). Furthermore, isolation improves as the number of clusters increases. This finding confirms that the dataset, despite its continuous nature, contains dense regions that can be effectively separated.

d2<-cclust(variables_df, 2, dist="euclidean") 
shadow(d2) 
##         1         2 
## 0.7312806 0.6899607
d3<-cclust(variables_df, 3, dist="euclidean") 
shadow(d3)
##         1         2         3 
## 0.7341732 0.6999410 0.7763433
d4<-cclust(variables_df, 4, dist="euclidean") 
shadow(d4)
##         1         2         3         4 
## 0.7822810 0.7109780 0.8027311 0.7414861
plot(shadow(d2)) 

plot(shadow(d3))

plot(shadow(d4))

The primary analytical goal of this study is to segment daily profiles into four distinct groups corresponding to the four seasons. The convergence of three validation metrics (Elbow Method, Calinski-Harabasz index and Shadow Statistic) on K=4 provides a strong statistical justification that this analysis is robust and provides well-isolated clusters. To ensure a comprehensive and robust analysis I will proceed by evaluating and comparing clustering with K=2, K=3 and K=4.

Clustering methods

In order to compare internal validation, I executed the clValid function, assessed the performance of Hierarchical, K-Means and PAM algorithms across cluster sizes from K=2 to K=5. While Hierarchical clustering delivered the statistically best scores (Dunn Index=0.1011, Silhouette Width=0.322) at K=2. The PAM algorithm yielded the poorest scores across all metrics for all K values. The K-Means method had slightly lower statistical performance than Hierarchical, but it provided the highest Silhouette score for the desired K=4. The clustering will be performed using all three clustering method. The final model will be chosen based on quality and interpretability of day profiles.

#clValid for hierarchical, kmeans, pam
clmethods<-c("hierarchical","kmeans","pam")
internal_validation <- clValid(
  variables_df, 
  nClust = 2:5, 
  clMethods = clmethods, 
  validation = "internal", 
  maxitems=100000)
summary(internal_validation)
## 
## Clustering Methods:
##  hierarchical kmeans pam 
## 
## Cluster sizes:
##  2 3 4 5 
## 
## Validation Measures:
##                                   2        3        4        5
##                                                               
## hierarchical Connectivity   19.3365  19.3365  96.6587 129.3909
##              Dunn            0.1011   0.1011   0.0695   0.0739
##              Silhouette      0.3220   0.3053   0.2352   0.2344
## kmeans       Connectivity  105.1321 157.1516 211.5567 268.2187
##              Dunn            0.0361   0.0438   0.0256   0.0450
##              Silhouette      0.3013   0.2768   0.2669   0.2303
## pam          Connectivity  106.9016 176.3357 244.4952 285.1659
##              Dunn            0.0193   0.0207   0.0321   0.0262
##              Silhouette      0.2967   0.2466   0.2089   0.2192
## 
## Optimal Scores:
## 
##              Score   Method       Clusters
## Connectivity 19.3365 hierarchical 2       
## Dunn          0.1011 hierarchical 2       
## Silhouette    0.3220 hierarchical 2

4. K-Means

The K-Means algorithm is a iterative method of partitioning based on distance used to group observations into K (predefined) clusters. The process begins with selecting random K central points (centroids) and assigning data points to the nearest centroid. With every iteration the position of the centroid is recalculated until it doesn’t change. The aim is to minimize the within-cluster sum of squares, while maximizing the distance between centroids.

K-Means analysis

K = 2

The initial K-Means clustering with K=2 was executed as a baseline, as it statistically often yielded the highest separation scores. The cluster plot visualization shows two clusters with significant overlap between boundaries, especially in the central region. The first two dimensions represented by the horizontal and vertical axis explained about 72.3% of the variance. This confirms that the groups are not perfectly separated but the partition is based on continuous data distribution.

The average silhouette width is 0.30. While this score would not typically be considered high, it represents the highest score achieved across all tested K values. Cluster 2 displays a higher quality with fewer negative values, indicating a greater number of observations lying within the cluster 2 boundary.

The distance from the centroid (stripes plot) visualized the denseness of clusters. Tightly packed lines near y=0 would represent that a majority of the observations are near the center of the cluster. In this case, both clusters exhibit high density of data near the X-axis, confirming the presence of compact cores.

#k-means, 2 clusters
day_km2 <- kmeans(variables_df, 2)
df_day$cluster_km2 <- day_km2$cluster
#plot
fviz_cluster(list(data=variables_df, cluster=day_km2$cluster),
             ellipse.type="norm", geom="point", stand=FALSE, palette="jco", ggtheme=theme_classic())

#silohuette plot
sil<-silhouette(day_km2$cluster, dist(variables_df))
fviz_silhouette(sil)
##   cluster size ave.sil.width
## 1       1  313          0.24
## 2       2  418          0.35

#distnace from the centroid
d1<-cclust(variables_df, 2, dist="euclidean")
stripes(d1)

K = 3

The K-Means algorithm was next executed with K=3. The cluster plot shows three distinct groups (blue, yellow, grey). The overall separation has visibly decreased compared to the K=2. There is an overlap suggesting poor boundaries and potential misclassification.

The average silhouette width dropped to 0.28, with cluster 1 having the highest score and being the most numerous and (339 observations - cluster 1, 196 - cluster 2 and 3).

The stripes plot reveals the compactness of the three groups. While all of them have a relatively dense core, cluster 3 is spatially more dispersed than the other two.

#k-means k=3
day_km3 <- kmeans(variables_df, 3)
df_day$cluster_km3 <- day_km3$cluster
#plot
fviz_cluster(list(data=variables_df, cluster=day_km3$cluster),
             ellipse.type="norm", geom="point", stand=FALSE, palette="jco", ggtheme=theme_classic())

#silhouette
sil3<-silhouette(day_km3$cluster, dist(variables_df))
fviz_silhouette(sil3)
##   cluster size ave.sil.width
## 1       1  339          0.33
## 2       2  196          0.21
## 3       3  196          0.26

#distnace from the centroid
d3<-cclust(variables_df, 3, dist="euclidean")
stripes(d3)

K = 4

The method was executed with K=4 to align with the analytical objective of segmenting the data into four seasonal profiles. The cluster plot shows the formation of four distinct groups (blue, yellow, grey and red). The boundaries show significant inter-cluster overlap which confirms the difficulty of drawing sharp edges with this type of data. The perceived overlap might be exaggerated due to the plot displaying only two dimensions (73.2% of the variance).

The average silhouette score dropped further to 0.26. Two clusters maintain reasonable cohesion: cluster 4 (0.31) and cluster 2 (0.28). The other ones are less defined. Cluster 4 (301 observations) is the largest, while Cluster 1 (97 observations) is the smallest, which is visible in the varying heights of the Silhouette bands.

All four clusters display relatively high concentration, as can be seen on the stripes plot. Cluster 4 is the least compact, with the furthest point reaching a distance of more than 4. and cluster 1 most compact, with distances remaining below 2.5.

#k-means k=4
day_km4 <- kmeans(variables_df, 4)
df_day$cluster_km4 <- day_km4$cluster
#plot
fviz_cluster(list(data=variables_df, cluster=day_km4$cluster),
             ellipse.type="norm", geom="point", stand=FALSE, palette="jco", ggtheme=theme_classic())

#silhouette
sil4<-silhouette(day_km4$cluster, dist(variables_df))
fviz_silhouette(sil4)
##   cluster size ave.sil.width
## 1       1   97          0.19
## 2       2  164          0.28
## 3       3  169          0.18
## 4       4  301          0.31

#distnace from the centroid
d4<-cclust(variables_df, 4, dist="euclidean")
stripes(d4)

{-}

5. PAM

Partitioning around medoids is an alternative to the K-Means algorithm. It is an iterative method of clustering based on use of actual data points called medoids. Unlike K-Means, PAM is less sensitive to noise and outliers. It is also suitable for small data sets. The algorithm selects the most centrally located data point within each cluster. The PAM algorithm will be performed for K=2, K=3 and K=4

PAM Analysis

K = 2

The PAM algorithm was executed with K=2 to determine if medoid-based approach would alter the cluster structure compared to K-Means. The convex hull visualization shows a clear division into 2 groups, with the first 2 dimensions explaining 73.2% of the variance.

The average silhouette width is 0.30, which is consistent with the results from K-Means.

The comparison of clusters was performed by visualizing a boxplots for the continuous variables. It reveals that cluster 2 is characterized by a higher median temperature and bike rental count compared to cluster 1. The presence of numerous outliers across all variables justifies the use of the PAM algorithm, which is more robust against noise.

#k=2
pam2<-pam(variables_df,2) 
df_day$pam2 <- pam2$cluster

fviz_cluster(pam2, 
             geom = "point", 
             ellipse.type = "convex", 
             stand = FALSE, 
             palette = "jco",
             main = paste("PAM, K = 2"))

#Silhouette
silpam2 <- silhouette(pam2$clustering, dist(variables_df))

#Visualisation of Silhouette plot
fviz_silhouette(silpam2, 
                main = paste("Silhouette Plot for PAM K = 2"))
##   cluster size ave.sil.width
## 1       1  324          0.23
## 2       2  407          0.35

#boxplot
groupBWplot(variables_df, 
            pam2$clustering, 
            main = "Comparison of variables for PAM", 
            xlab = "Cluster",
            alpha = 0.05)

K = 3

The convex hull visualization for K=3 shows three distinct regions. While the clusters are spatially defined, their boundaries are in close proximity and slightly overlap.

The average silohuette width dropped to 0.25, which is lower than the K-Means score.

The groupWBplot highlights the characteristics of the data. Cluster 3 has the highest median value for temperature and rental counts. Cluster 1 represents low-demand days, while cluster 2 acts as a transitional group.

#k=3
pam3<-pam(variables_df,3) 
df_day$pam3 <- pam3$cluster

fviz_cluster(pam3, 
             geom = "point", 
             ellipse.type = "convex", 
             stand = FALSE, 
             palette = "jco",
             main = paste("PAM, K = 3"))

#Silhouette
silpam3 <- silhouette(pam3$clustering, dist(variables_df))

#Visualisation of Silhouette plot
fviz_silhouette(silpam3, 
                main = paste("Silhouette Plot for PAM K = 3"))
##   cluster size ave.sil.width
## 1       1  244          0.18
## 2       2  231          0.24
## 3       3  256          0.32

#boxplot
groupBWplot(variables_df, 
            pam3$clustering, 
            main = "Comparison of variables for PAM", 
            xlab = "Cluster",
            alpha = 0.05)

K = 4

The PAM algorithm was performed for the analytically desired number of clusters K=4 in order to compare it with K-Means results. The cluster plot shows a significant overlap between clusters in the central region of the plot. This indicates that the boundaries between clusters remain fluid.

The average silhouette width was 0.21, with the width for each cluster varying from 0.18 to 0.23. Negative silhouette values confirm fuzzy boundaries between groups.

The four boxplots provide detailed look at seasonal profiles. Cluster 4 represents peak season with high count of rental bikes and the highest median temperature. Cluster 1 is characterized by low median temperature and lower rentals, whereas cluster 2 and 3 are transitional seasons.

#####PAM
#k=4
pam4<-pam(variables_df,4) 
df_day$pam4 <- pam4$cluster

fviz_cluster(pam4, 
             geom = "point", 
             ellipse.type = "convex", 
             stand = FALSE, 
             palette = "jco",
             main = paste("PAM, K = 4"))

#Silhouette
silpam4 <- silhouette(pam4$clustering, dist(variables_df))

#Visualisation of Silhouette plot
fviz_silhouette(silpam4, 
                main = paste("Silhouette Plot for PAM K = 4"))
##   cluster size ave.sil.width
## 1       1  140          0.20
## 2       2  219          0.22
## 3       3  191          0.23
## 4       4  181          0.18

#boxplot
groupBWplot(variables_df, 
            pam4$clustering, 
            main = "Comparison of variables for PAM", 
            xlab = "Cluster",
            alpha = 0.05)

{-}

6. HAC

The final method employed in the study is Hierarchical Agglomerative Clustering (HAC). HAC, unlike previous methods, doesn’t require a predefined number of clusters. It builds a hierarchy of clusters using a bottom-up approach. Each data point begins as its own individual cluster and at each step the algorithm merges the two most similar cluster until only one cluster remain. While K-Means can produce different results based on its random starting points, HAC is deterministic, providing consistent results.

Linkage method selection

The four common linkage criteria were evaluated using the Agglomerative Coefficient (AC). The AC measures the strength of the cluster structure, with values closer to 1 indicating better separation and cohesion. The Ward’s method achieved the highest confidence of 0.9894. Based on this result, the hierarchical model was built using Euclidean distance and Ward’s D2 linkage.

#HAC
# dissimilarity matrix
d <- dist(variables_df, method = "euclidean")

#linkage method selection
method<- c( "average", "single", "complete", "ward")
names(method) <- method
ac <- function(x) {
  agnes(d, method = x)$ac
}
map_dbl(method, ac)
##   average    single  complete      ward 
## 0.8934425 0.8392369 0.9445458 0.9894666
# ward linkage
hc1 <- hclust(d, method = "ward.D2")

# dendrogram
plot(hc1, 
     cex = 0.6, 
     hang = -1, 
     main = "Dendrogram: Ward Linkage (Hclust)")

After establishing the structure using Ward’s method, the dendrogram was cut at different levels to create K=2, K=3 and K=4 clusters.

At K=2 level the dendrogram shows a split at a height of a little over 30 with relatively balanced groups of 379 and 352 data points. When moving to K=3, group 2 from previous partition remained stable, while group 1 split into two subgrops (199 and 180). At target level of K=4, the hierarchy divides data into groups of 199, 180, 115 and 237 at a height of approximately 20.

#subgroups for K=2, K=3, K=4
sub_grp2 <- cutree(hc1, k = 2)      
table(sub_grp2)
## sub_grp2
##   1   2 
## 379 352
sub_grp3 <- cutree(hc1, k = 3)      
table(sub_grp3)
## sub_grp3
##   1   2   3 
## 199 180 352
sub_grp4 <- cutree(hc1, k = 4)      
table(sub_grp4)
## sub_grp4
##   1   2   3   4 
## 199 180 115 237

To objectively evaluate the trade-off between model complexity and clustering quality, key internal validation metrics were extracted from cluster.stats function. The key insight is the steady decrease in the average within-cluster distance, which drops from 2.185 (K=2) to 1.869 (K=4). This confirms that as K increases, the homogeneity of groups improves. As expected, the Dunn index, which is a ratio of cluster separation and compactness, decreases with greater number of clusters. Similarly, the average silhouette width drops from 0.276 (K=2) to 0.179 (K=4).

##   K Dunn_Index Avg_Silhouette Avg_Within_Dist
## 1 2 0.05476476      0.2756214        2.185009
## 2 3 0.05009749      0.2389185        1.998094
## 3 4 0.03224339      0.1778102        1.869034

Dendrograms

To visually assess the hierarchical progression, dendrograms were generated for K=2, K=3, and K=4 levels, with each solution marked by borders to illustrate the splitting of the branches.

K = 2

K = 3

K = 4

{-}

7. Interpretation

Following the evaluation of all three algorithms, the K-Means K=4 model was selected for final interpretation to meet the analytical goal of defining four seasonal profiles of bike renting. This section focuses of characterizing each segment to understand environmental drives of bike-sharing demand.

The table below presents entire data set grouped by final K-Means segmentation for K=4. The internal clustering features (continuous variables) were destandarized to their original physical units and presented as a mean for each cluster. Beyond the variables used for clustering, additional categorical features have been added, such as the dominant month, percentage of holiday versus working days and percentage distribution of seasons, to allow a thorough interpretation. The cluster were then arranged from lowest to highest mean temperature.

Seasonal segment analysis

The following table presents features of each group, allowing a direct comparison and characterization of all four clusters.

  • Cluster 2 - The cold dry winter - This cluster represents the period of the lowest demand for bike rental. It is characterized by the lowest average temperature (11.4°C) and the lowest overall demand (119 units). The dominant month is January and dominant season - winter. It also features a highest share of holiday days, illustrating that the cold weather is such a strong factor that even the highest share of non-working days does not boost the rental numbers.
  • Cluster 3 - The humid and cold period - A group dominated by the fall season and December is also distinguished by the highest air humidity, averaging 79.6%. Although the temperatures are moderate at 17.9°C, the high moisture and likely rainy days are the determining conditions for the second lowest rental count of only 134 bikes rented.
  • Cluster 1 - The windy spring - The least numerous cluster (206 observations) with average rental count of 97 units. This group represents the most windy days and is associated with the month April and the spring season (42.3%). Despite the average temperature of nearly 20°C, the demand in this group is the second highest due to peak wind speeds of 20.4 km/h. Also it has the highest percentage of working days (76.3%).
  • Cluster 4 - The peak summer - Representing the most numerous group with 301 days, this cluster is defined by the highest average temperatures (26.6°C) and the calmest winds (10.4 km/h). The dominant moth is July and the dominant season is summer. It yields the maximum daily demand of 252 bikes, making it the most profitable and stable period for business operation.
Final Cluster Profiles (K=4)
Variable Cluster_2 Cluster_3 Cluster_1 Cluster_4
N 164 169 97 301
Count_Avg 119 134 206 252
Temp_Avg 11.4 17.9 19.9 26.6
Hum_Avg 52.5 79.6 50.8 62.8
Wind_Avg 13.3 12.2 20.4 10.4
Dominant_Month 1 12 4 7
Holiday_Pct 4.9 2.4 2.1 2.3
WorkingDay_Pct 64 68 76.3 68.4
Winter_Pct 65.9 26 22.7 2.3
Spring_Pct 10.4 26 42.3 27.2
Summer_Pct 0 11.2 10.3 52.8
Fall_Pct 23.8 36.7 24.7 17.6

The K=4 K-Means segmentation successfully divided the data set into four distinct operational seasons, showing that while temperature is the primary driver of demand, factors like high wind speed and extreme humidity are also crucial determinants. By identifying these four distinct profiles, the business can implement more strategic planning. For example, implement fleet maintenance during the cold dry winter ensuring maximum capacity and bike availability to capture the full potential for the peak summer season.