The aim of this project is to perform a comparative analysis of clustering algorithms in order to achieve a market segmentation on a dataset of books, leveraging key metrics of success and quality. The primary goal of cluster analysis is to automatically divide data into clusters of similar items, providing insight into natural groupings. This approach is widely used in marketing to segment customers based on similar characteristics or buying patterns.
The added value of this study lies in addressing a common challenge in social data: high feature correlation. Initial assessment indicates an extremely high tendency for clustering, strongly suggesting that the data is not random. However, visual analysis shows that most observations form a dense mass. This suggests that while a pattern exists, the true separation between clusters is low, challenging the fundamental clustering goal of high intra-class similarity and low inter-class dissimilarity. The project pivots from simply finding the “best” statistical fit to evaluating the utility of different algorithms (K-means, Hierarchical, DBSCAN) when segmentation is practically necessary.
The main objectives of this project are to evaluate the structural integrity of the data, compare different clustering methods, and analyze the final segments against external variables to ensure the generated segments are useful for market strategy. This analysis provides a robust framework for handling non-ideal data in unsupervised learning contexts.
The dataset is available on github and it contains over 52 thousands records and 25 variables retrieved from the Goodreads Best Books Ever list (https://github.com/scostap/goodreads_bbe_dataset?tab=readme-ov-file).
The variables chosen for the analysis contain information about the titles and authors of the books, as well as their Goodreads rating score, a list of genres used to characterize them, number of pages, date of publication, the number of ratings, percent of ratings over 2 starts, and the score in Best Books Ever list.
books <- dplyr::select(books, title, author, rating, genres, pages, publishDate, numRatings, likedPercent, bbeScore)
str(books)
## 'data.frame': 52478 obs. of 9 variables:
## $ title : chr "The Hunger Games" "Harry Potter and the Order of the Phoenix" "To Kill a Mockingbird" "Pride and Prejudice" ...
## $ author : chr "Suzanne Collins" "J.K. Rowling, Mary GrandPré (Illustrator)" "Harper Lee" "Jane Austen, Anna Quindlen (Introduction)" ...
## $ rating : num 4.33 4.5 4.28 4.26 3.6 4.37 3.95 4.26 4.6 4.3 ...
## $ genres : chr "['Young Adult', 'Fiction', 'Dystopia', 'Fantasy', 'Science Fiction', 'Romance', 'Adventure', 'Teen', 'Post Apoc"| __truncated__ "['Fantasy', 'Young Adult', 'Fiction', 'Magic', 'Childrens', 'Adventure', 'Audiobook', 'Middle Grade', 'Classics"| __truncated__ "['Classics', 'Fiction', 'Historical Fiction', 'School', 'Literature', 'Young Adult', 'Historical', 'Novels', 'R"| __truncated__ "['Classics', 'Fiction', 'Romance', 'Historical Fiction', 'Literature', 'Historical', 'Novels', 'Historical Roma"| __truncated__ ...
## $ pages : chr "374" "870" "324" "279" ...
## $ publishDate : chr "09/14/08" "09/28/04" "05/23/06" "10/10/00" ...
## $ numRatings : int 6376780 2507623 4501075 2998241 4964519 1834276 2740713 517740 110146 1074620 ...
## $ likedPercent: int 96 98 95 94 78 96 91 96 98 94 ...
## $ bbeScore : int 2993816 2632233 2269402 1983116 1459448 1372809 1276599 1238556 1159802 1087732 ...
The data is converted to correct datatypes and missing values are omited. There are 817 observations remaining in the database after this step.
books <- books[books$genres != "[]", ]
books$pages_num <- as.numeric(books$pages)
## Warning: NAs introduced by coercion
books$publishDate_date <- as.Date(books$publishDate, format="%m/%d/%y")
books <- na.omit(books)
For the clustering, four numeric variables are used, while the rest will be used for the analysis of the final clusters.
books_m <- books %>%
select(rating, numRatings, likedPercent, bbeScore)
books_m <- as.matrix(books_m)
books_scaled <- scale(books_m)
The initial inspection of the variables used for clustering scatter plots clearly reveals the structural complexity of the data by illustrating two distinct distribution patterns. Variables assessing quality, rating and likedPercent, exhibit extremely low dispersion as the majority of values are tightly compacted within a narrow range (e.g., 3.8−4.4 for rating and 90%−98% for likedPercent). This high compactness signifies strong intrinsic data density in the quality dimension, making internal cluster separation difficult. Conversely, the popularity metrics, numRatings and bbeScore, display a classic long-tail distribution. The vast majority of observations cluster near zero, while a few records possess extreme outlier values. This structural difference ensures that the overall cluster distance calculations are disproportionately dominated by the few outliers in the long-tail variables, leading to the low inter-cluster separation observed in the main data mass.
The correlation matrix provides critical insight into the relationship between the final four clustering variables, directly informing the methodological choices. The key observation is the strong positive correlation between rating and likedPercent (0.86). This relationship is structurally natural: a higher average rating is inherently linked to a higher percentage of positive responses. I retain both variables in the clustering model despite this strong correlation because it is not perfect, indicating that 14% of the information is unique (e.g., likedPercent measures widespread acceptance, while rating is sensitive to extreme negative reviews). Retaining both variables is a deliberate methodological choice to preserve this subtle distinction in quality assessment. Retaining these two highly, but non-perfectly, correlated variables highlights the core challenge of the data: the variables measuring quality are interdependent. This forced the clustering algorithm to perform a decomposition within a dense, correlated space, rather than easily separating independent dimensions. The metrics of popularity (numRatings and bbeScore) remain strongly correlated (0.69), confirming that these variables collectively represent a “Scale of Success”, which dominates the overall variance and is responsible for the long-tail distribution and the low inter-cluster separation in the dense data mass.
##
## The downloaded binary packages are in
## /var/folders/kp/h4_jw39n19j819j_0sx3mwyh0000gn/T//RtmpBkTl8z/downloaded_packages
Before clustering, exploratory analysis is performed to further learn about the structure of the data, to determine if the data is suitable for clustering, and to define the appropriate parameters, such as the optimum number of clusters (k).
The high value of the Hopkins Statistic (H≈0.94) confirms that the data is structured and suitable for clustering. However, this result is largely driven by the extreme outlier values in the long-tail variables (bbeScore, numRatings), which create large distances from the central mass of the data. Thus, the statistic confirms that some structure exists, but it doesn’t guarantee the existence of multiple, well-separated clusters.
The Visual Assessment of Tendency (VAT) plot, which displays the ordered dissimilarity matrix, provides immediate visual evidence of the high density problem. The matrix is overwhelmingly dominated by light colors (white and light pink) across large blocks. These light colors denote very low dissimilarity (high similarity/density) between objects. The dark separation lines (high dissimilarity) between major groups are thin and weakly defined, and they run mainly around the corners. The presence of intense, deep violet/purple colors along the bottom and left edges of the VAT matrix is particularly noteworthy. This visual phenomenon indicates the existence of observations that are highly dissimilar to the majority of the data points. Since the horizontal and vertical axes of the VAT plot represent the observations themselves, the intense violet along the edges means that these specific points (located near the edges) exhibit a maximum measure of distance (dissimilarity) from the central, dense mass of the data (represented by the light colors in the middle of the plot). These edges are typically where true outliers or extreme values are positioned when the dissimilarity matrix is ordered which confirms the presence of books with extreme values that are very distant from the remaining observations.
get_clust_tendency(books_scaled, 30, graph=TRUE, gradient=list(low="red", mid="white", high="blue"), seed = 123)
## $hopkins_stat
## [1] 0.9371366
##
## $plot
To determine the appropriate number of clusters for partitioning, the Elbow Method was applied. The plot displays the proportion of variance explained by different numbers of clusters. The first major bend occurs at k=2, reducing the explained variance from 1.0 to 0.8. Crucially, the subsequent drop, from 0.8 at k=2 to 0.52 at k=3, represents the largest single reduction in unexplained variance (a gain of 0.28). The curve shows another, albeit smaller, significant reduction when moving from k=3 to k=4. The presence of multiple inflection points (at k=2, k=3, and k=4) confirms that the Elbow Method does not provide an unambiguous answer to the question of the optimal number of clusters. Based on it, k=3 can be selected as the primary focus for market interpretation, as it provides the largest single gain in information (k=3) after the initial outlier separation (k=2). This choice allows for a balanced trade-off between statistical integrity and practical segmentation utility.
opt<-Optimal_Clusters_KMeans(books_scaled, max_clusters=10, plot_clusters = TRUE)
The silhouette plot, on the other hand, clearly suggests that k=2 is the optimal number of clusters, as the value of silhouette statistics for this k is significantly higher than for other values (0.74). This can, however, be influenced by the presence of extreme values in the data: division into two clusters can separate the outliers from the dense data mass. While this score represents high statistical purity and strong separation of the resulting clusters, the two-cluster solution offers limited utility for comprehensive market segmentation. Therefore, the statistically optimal k=2 serves as a structural baseline, which is then refined by the more interpretable larger k segmentations .
opt<-Optimal_Clusters_KMeans(books_scaled, max_clusters=10, plot_clusters=TRUE, criterion="silhouette")
The Calinski-Harabasz (CH) Index provides another objective measure for determining the optimal k by measuring the ratio of between-cluster variance (maximizing separation) to within-cluster variance (minimizing compactness). Crucially, this index was calculated using the Ward’s linkage method, consistent with the strong performance observed in the Hierarchical Clustering model. This index suggests that k=4 yields the optimal clustering partition, with the highest index value of 437.35.
library(NbClust)
opt<-NbClust(books_scaled, distance="euclidean", min.nc=2, max.nc=8, method="ward.D", index="ch")
opt
## $All.index
## 2 3 4 5 6 7 8
## 289.9131 430.4361 437.3484 364.7701 309.0423 340.7541 325.0947
##
## $Best.nc
## Number_clusters Value_Index
## 4.0000 437.3484
##
## $Best.partition
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 1 1 1 1 1 1 1 2 1 1 1 3 2 2 2 1 2 1 2 1
## 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## 1 1 4 1 3 3 3 4 2 1 4 1 3 4 1 4 3 1 1 3
## 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
## 1 3 3 4 4 3 3 4 4 4 4 4 2 1 2 4 3 3 4 3
## 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 117 118 119 120 121
## 2 3 4 1 3 1 3 4 4 3 3 3 4 2 3 2 2 4 3 4
## 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141
## 3 4 4 4 4 1 3 2 4 3 4 2 3 4 3 3 4 4 1 3
## 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161
## 3 3 3 4 4 4 1 3 4 3 4 3 3 3 3 4 3 2 4 3
## 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181
## 4 3 3 3 3 3 2 4 2 3 4 3 3 4 3 3 4 4 4 2
## 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201
## 2 3 4 3 3 3 4 3 4 2 4 2 4 3 2 4 4 4 2 2
## 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221
## 3 4 3 3 3 3 3 3 4 3 4 2 4 4 2 4 2 2 4 3
## 222 223 224 225 226 227 228 229 230 231 233 234 235 236 237 238 239 240 241 242
## 1 3 3 3 3 4 2 3 4 4 4 3 3 3 2 2 3 3 2 4
## 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262
## 3 4 3 3 3 3 3 2 2 2 3 4 4 3 3 3 3 3 4 2
## 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282
## 4 3 4 4 1 4 4 4 3 2 4 4 3 4 4 3 3 4 4 4
## 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302
## 3 2 3 2 4 4 1 4 2 4 3 4 3 4 3 4 3 3 4 3
## 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322
## 3 4 4 3 4 2 3 2 2 4 3 3 2 3 4 3 2 3 4 3
## 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342
## 3 2 2 2 1 4 3 4 3 4 4 3 3 2 4 3 3 3 3 3
## 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362
## 3 4 4 3 3 2 2 4 2 3 4 3 4 3 3 2 4 4 3 2
## 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382
## 4 3 2 3 4 4 2 3 4 3 2 2 3 4 3 4 3 4 2 3
## 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402
## 4 4 4 4 3 4 4 3 4 2 3 4 3 2 4 4 4 4 2 4
## 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422
## 4 4 4 3 4 3 3 3 4 3 4 4 3 2 3 2 3 4 4 4
## 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442
## 4 4 4 4 3 3 4 3 4 4 2 2 4 3 4 3 2 4 4 4
## 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462
## 3 4 2 3 4 4 4 3 3 3 4 3 4 2 2 4 4 2 2 4
## 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482
## 3 4 3 4 3 2 4 4 4 3 3 2 3 3 2 2 3 2 4 4
## 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502
## 3 3 3 4 4 4 4 4 4 3 4 4 3 4 4 3 3 4 3 4
## 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522
## 3 3 2 4 3 4 2 4 4 4 4 2 4 4 4 4 4 3 4 4
## 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542
## 2 4 4 3 3 4 1 4 3 3 4 3 4 4 4 4 4 3 3 3
## 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562
## 2 4 4 4 3 4 3 4 3 4 4 4 4 3 3 4 3 3 3 3
## 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582
## 4 4 4 3 3 2 4 3 4 4 3 2 4 3 4 3 4 4 3 3
## 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602
## 3 3 3 4 4 4 4 4 4 3 4 2 2 4 2 3 3 3 4 4
## 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622
## 4 3 2 4 3 4 3 4 2 4 2 4 4 4 3 4 4 2 4 3
## 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642
## 2 4 3 4 3 4 4 4 4 4 4 4 4 3 2 4 4 4 3 3
## 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662
## 2 3 3 4 3 4 4 4 3 3 3 4 4 3 4 4 3 4 4 4
## 663 664 665 666 667 668 669 671 672 673 674 675 676 677 678 679 680 681 682 683
## 4 3 4 3 3 2 2 3 3 3 4 4 2 3 3 4 2 4 3 2
## 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703
## 4 3 3 4 3 3 4 3 3 3 3 3 2 2 4 4 2 3 4 3
## 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723
## 3 4 2 3 4 4 3 3 4 3 4 4 3 3 3 2 4 2 4 4
## 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743
## 4 3 4 4 3 3 2 4 2 4 4 4 3 2 4 3 3 4 3 2
## 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763
## 3 4 4 3 3 2 4 1 4 4 3 3 4 3 4 3 3 2 3 3
## 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783
## 2 4 4 4 4 3 3 4 4 2 2 3 4 3 4 4 4 4 3 2
## 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803
## 4 3 2 3 3 3 3 4 4 4 4 3 4 4 3 4 4 4 3 3
## 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820
## 3 4 3 2 3 3 2 4 4 3 3 3 2 3 3 3 3
This result is the third contradictory finding for the optimal number of clusters, solidifying the project’s main thesis: the lack of statistical consensus demonstrates that the data does not possess one clearly defined, natural structure.
| Method | Optimal k Suggested | Rationale/Mechanism |
|---|---|---|
| Silhouette Plot | k=2 (~0.74) | Statistically purest; highly sensitive to outliers, separating noise from the dense core. |
| Elbow Method | k=3 (Max Gain) | Represents the point of maximum gain in partitioning the complex central data mass. |
| Calinski-Harabasz (Ward) | k=4 (Max Separation) | Maximizes global separation of centroids, indicating potential for four structurally distinct groups. |
Internal validation metrics assess the quality of the clustering based solely on the data structure (cohesion and separation). I examine three key metrics: Connectivity, Dunn Index, and Silhouette Width. he results clearly indicate that Hierarchical Clustering is the superior method for fitting the structure of this dense dataset across all three metrics and nearly all values of k. The internal validation metrics confirm that the K-means algorithm struggled with the high density and lack of separation in the data. PAM turns out to be ineffective, demonstrating that the data is not well-represented by medoids, and because of that this method is excluded from further analysis.
library(clValid)
##
## Attaching package: 'clValid'
## The following object is masked from 'package:flexclust':
##
## clusters
clmethods <- c("hierarchical","kmeans","pam")
internal <- clValid(books_scaled, nClust = 2:10, clMethods = clmethods, validation = "internal", maxitems = 100000)
summary(internal)
##
## Clustering Methods:
## hierarchical kmeans pam
##
## Cluster sizes:
## 2 3 4 5 6 7 8 9 10
##
## Validation Measures:
## 2 3 4 5 6 7 8 9 10
##
## hierarchical Connectivity 9.9921 10.1349 10.7071 13.0738 16.2528 20.1107 36.8536 42.0774 45.5893
## Dunn 0.2971 0.3419 0.3552 0.3552 0.2025 0.1371 0.0605 0.0605 0.0605
## Silhouette 0.7945 0.7876 0.7694 0.7621 0.5765 0.4728 0.4788 0.4701 0.4627
## kmeans Connectivity 22.0353 38.7718 44.7591 99.4163 117.2298 119.5964 112.9258 116.8016 113.8242
## Dunn 0.0311 0.0396 0.0414 0.0100 0.0100 0.0100 0.0114 0.0144 0.0181
## Silhouette 0.5644 0.5039 0.4807 0.4080 0.4025 0.4019 0.4140 0.4113 0.3864
## pam Connectivity 97.1929 152.9433 122.1714 180.7532 159.4905 209.7143 240.1393 250.3266 289.9290
## Dunn 0.0040 0.0062 0.0040 0.0037 0.0034 0.0037 0.0051 0.0074 0.0052
## Silhouette 0.3410 0.2708 0.3266 0.2918 0.2665 0.2812 0.2308 0.2419 0.2264
##
## Optimal Scores:
##
## Score Method Clusters
## Connectivity 9.9921 hierarchical 2
## Dunn 0.3552 hierarchical 4
## Silhouette 0.7945 hierarchical 2
This section is dedicated to implementing and comparing various clustering techniques across different configurations (k values) to address the structural challenges identified in the Exploratory Analysis (ECA) phase. The primary objective is to compare the statistical outcome with the practical requirement for market segmentation. I aim to compare the results of Statistical Purity (identifying which methods yield the statistically optimal k=2 solution by primarily separating the extreme outliers from the dense central data mass, as predicted by the Silhouette method) and Market Utility (determining which methods provide results that are useful for market analysis). To achieve this, I will implement three distinct algorithmic approaches:
Density-Based Clustering (DBSCAN): I will first use DBSCAN to confirm our hypothesis that the data is primarily a single, highly dense mass. We anticipate that the optimal DBSCAN result will primarily yield one large cluster and classify the extreme observations as noise (Cluster 0), solidifying the argument against density-based segmentation.
Partitioning Methods (K-means): I will test K-means for various k values, focusing on the interpretability of the market segments and the comparison with hierarchical clustering.
Hierarchical Clustering: I will test Hclust (Ward’s method), which was deemed statistically superior across the internal validation metrics. I will test various k values to compare its performance against the K-means segmentation.
To determine the optimal parameters for the DBSCAN model, the kNN distance plot is analyzed. This plot shows the distance of each object to its 9th nearest neighbor (k=9 was chosen based on the dimension rule, k≥2×D). The plot exhibits an extremely flat curve for the majority of observations, visually confirming that the data consists of one large, highly dense mass. The point where the curve begins to sharply bend upwards identifies the optimal distance threshold (ϵ). I select ϵ=1.3 (marked by the red dashed line) as the maximum distance allowed for two points to be considered neighbors within this core dense region.
The minimum number of points required to form a dense region is set to 10. The DBSCAN model, run with the derived parameters, yielded the following cluster sizes. The result is a definitive validation of the project’s thesis. The DBSCAN model successfully segmented the data into one large, cohesive cluster (781 points) and separated only 36 points as structural noise. This outcome confirms that the data possesses no natural, separated cluster structure beyond the single, dense core, thus rejecting the viability of density-based segmentation.
res_dbscan_final <- fpc::dbscan(books_scaled, eps = 1.3, MinPts = 10)
table(res_dbscan_final$cluster)
##
## 0 1
## 36 781
The visualization of the DBSCAN results on the first two principal components provides definitive visual evidence for the structural challenges in the data. The plot is dominated by one enormous cluster (colored red, representing the 781 core points), which confirms the conclusion from the VAT plot and the DBSCAN size table: the vast majority of the data forms a single, highly dense structure. The algorithm successfully isolated the 36 points classified as structural noise (Cluster 0), which are seen as scattered black points and are too far from the dense core to be considered neighbors.
fviz_cluster(res_dbscan_final, data = books_scaled, elipse.type = "convex") + theme_minimal()
This visualization provides the final piece of evidence: since the data contains only one significant region of density, any attempt to segment the market into more clusters must involve forcing a split within the single dense mass, validating the methodological pivot of this project towards market utility rather than statistical purity.
This solution represents the algorithm’s attempt to find the two largest, most separable groups, driven by the strong recommendation from the Silhouette method. The data is split into two groups: Cluster 1 (344) and Cluster 2 (473).
clusters_km2 <- kmeans(books_scaled, 2, nstart = 25)
books$cluster_km2 <- clusters_km2$cluster
clusters_km2$size
## [1] 473 344
The centers are distinct, defined almost entirely by the Quality Factor. Cluster 1 exhibits metrics below the mean (rating≈−0.89), making it the “Low Quality” segment. Cluster 2 is the “Above Average Quality” segment (rating≈0.65). The differences in popularity metrics (numRatings and bbeScore) are negligible, meaning both clusters reside within the average popularity range.
clusters_km2$centers
## rating numRatings likedPercent bbeScore
## 1 0.6500815 0.05981488 0.6287239 0.08882855
## 2 -0.8938621 -0.08224546 -0.8644953 -0.12213926
The average silhouette width is 0.34. This score is low (below 0.5), indicating a weak structural fit despite being statistically optimal.
sil<-silhouette(clusters_km2$cluster, dist(books_scaled))
fviz_silhouette(sil)
## cluster size ave.sil.width
## 1 1 473 0.31
## 2 2 344 0.37
Unlike DBSCAN, which isolated the few outliers, K-means with k=2 was forced to partition the main dense mass in half. Cluster 1 represents the sub-population with significantly lower quality metrics, while Cluster 2 is the better-rated one. The dense structure of the data meant K-means could not easily separate a small outlier cluster.
fviz_cluster(clusters_km2, data = books_scaled, elipse.type = "convex") + theme_minimal()
While the centers are easy to interpret, this solution is statistically weak due to the dense data mass and is practically insufficient for market segmentation as it fails to isolate the critical “Extreme Bestseller” outlier segment, which is essential for targeted analysis.
This is the primary model selected for market interpretation, based on the maximum informational gain observed in the Elbow Method. This model successfully isolates the structural outliers while segmenting the dense core.
The split yields one small cluster (71 points) and two larger groups (296 and 450 points).
clusters_km3 <- kmeans(books_scaled, 3, nstart = 25)
books$cluster_km3 <- clusters_km3$cluster
clusters_km3$size
## [1] 296 71 450
Cluster 1 (296 points) represents the “Low Quality / Niche” segment (very low quality, low popularity). Cluster 2 (71 points) exhibits extremely high popularity (numRatings≈2.41, bbeScore≈2.38). This cluster effectively isolates the structural outliers (Extreme Bestseller/Outlier segment), which K-means was previously unable to separate when k=2. Cluster 3 (450 points) represents the large “High Quality Core” segment (above-average quality, low popularity).
clusters_km3$centers
## rating numRatings likedPercent bbeScore
## 1 -0.9545236 -0.1835585 -0.9485002 -0.2165369
## 2 0.3539341 2.4111429 0.1988674 2.3783308
## 3 0.5720215 -0.2596840 0.5925255 -0.2328146
The average silhouette width is 0.40. The score remains low, but shows an improvement from the previous model with k=2.
sil<-silhouette(clusters_km3$cluster, dist(books_scaled))
fviz_silhouette(sil)
## cluster size ave.sil.width
## 1 1 296 0.35
## 2 2 71 0.10
## 3 3 450 0.47
By setting k=3, the algorithm uses one cluster (Cluster 2) to capture the extreme variance introduced by the long-tail variables (numRatings and bbeScore). This action stabilizes the centroids of the remaining two clusters, allowing them to form more interpretable and actionable segments within the dense quality space.
fviz_cluster(clusters_km3, data = books_scaled, elipse.type = "convex") + theme_minimal()
This solution tests the suggestion from the Calinski-Harabasz Index. The split yields one outlier group (73 points) and three unequal groups (362, 112, and 270 points).
clusters_km4 <- kmeans(books_scaled, 4, nstart = 25)
books$cluster_km4 <- clusters_km4$cluster
clusters_km4$size
## [1] 362 112 73 270
Cluster 2 (73 points) successfully isolates the Extreme Bestseller/Outlier group again. Cluster 1 represents books with rating and popularity below average, while Cluster 3 is made of points with slightly higher popularity and significantly lower quality. Cluster 4, on the other hand, represents books with extremely high ranking snd popularity close to Cluster 1 (below average).
clusters_km4$centers
## rating numRatings likedPercent bbeScore
## 1 -0.3194191 -0.2365083 -0.1212865 -0.2507834
## 2 -1.5377003 -0.1646671 -1.8098411 -0.1978735
## 3 0.3365548 2.3669623 0.1853003 2.3553151
## 4 0.9751246 -0.2545538 0.8632629 -0.2184910
Despite the clear interpretability of the centers (validating the CH Index’s objective of maximizing separation), the average silhouette width drops to 0.36. This decline confirms that forcing the split into four groups leads to over-segmentation and structural instability of the dense core.
sil<-silhouette(clusters_km4$cluster, dist(books_scaled))
fviz_silhouette(sil)
## cluster size ave.sil.width
## 1 1 362 0.39
## 2 2 112 0.35
## 3 3 73 0.08
## 4 4 270 0.41
fviz_cluster(clusters_km4, data = books_scaled, elipse.type = "convex") + theme_minimal()
The next method used to cluster the data and compare the results with K-means is hierarchical clustering. This approach is valuable as it does not require a pre-defined number of clusters (k), allowing us to confirm potential natural subgroupings within the data, which may correspond to meaningful taxonomies. Firstly, the Agglomerative Coefficient (AC) for various linkage methods is computed to choose the most efficient one. As anticipated, the Ward method proves to be the correct choice, as it is the recommended default when no theoretical justification exists and its goal is to minimize the total within-cluster sum of squares.
d <- dist(books_scaled, method = "euclidean")
m <- c("average", "single", "complete", "ward")
names(m) <- c("average", "single", "complete", "ward")
ac_coeff <- function(x) {
agnes(books_scaled, method = x)$ac
}
map_dbl(m, ac_coeff)
## average single complete ward
## 0.9747986 0.9590764 0.9828348 0.9934339
The hierarchical clustering is performed and in the next steps various k values will be tested, specifically comparing its statistical performance (high Silhouette and low Connectivity, as shown in the clValid analysis) against the boundaries imposed by K-means.
hc_model_ward <- hclust(d, method = "ward.D")
plot(hc_model_ward, cex = 0.6, hang = -1, main = "Dendrogram - Ward method")
This solution tests the statistically optimal number of clusters suggested by the Silhouette index (k=2).
The data is partitioned into two relatively balanced groups: Cluster 1 (372) and Cluster 2 (445). This split is demonstrated by cutting the dendrogram at the highest point. This size distribution is similar to the K-means k=2 model (344 vs. 473), confirming that both algorithms are forced to arbitrarily bisect the large, dense central mass.
sub_grp_h2 <- cutree(hc_model_ward, k = 2)
books$cluster_h2 <- sub_grp_h2
table(sub_grp_h2)
## sub_grp_h2
## 1 2
## 372 445
The average silhouette width is 0.33. This low Silhouette score directly contradicts the theoretical high Silhouette value (0.79) calculated during the ECA phase. This dramatic drop confirms that the initial high score was misleading, being entirely dominated by the theoretical distance to the extreme outliers. The actual k=2 cut reveals the low separation and structural weakness inherent in partitioning the dense core.
sil_h2 <- silhouette(sub_grp_h2, d)
fviz_silhouette(sil_h2,
palette = "Set2",
ggtheme = theme_minimal())
## cluster size ave.sil.width
## 1 1 372 0.19
## 2 2 445 0.44
This solution is tested as the necessary step for market segmentation and the point of maximum gain identified by the Elbow Method.
The division into three clusters is highly informative: Cluster 1 (74), Cluster 2 (445), and Cluster 3 (298). This split precisely mirrors the most valuable outcome of the K-means k=3 model: one small cluster (74 points) successfully isolates the structural outliers (Extreme Bestseller segment), allowing the remaining two clusters to form the primary segments of the dense core.
sub_grp_h3 <- cutree(hc_model_ward, k = 3)
books$cluster_h3 <- sub_grp_h3
table(sub_grp_h3)
## sub_grp_h3
## 1 2 3
## 74 445 298
The average silhouette width increases to 0.36. This increase in the Silhouette score (from 0.33 to 0.36) confirms that k=3 is a structurally better fit than k=2. By dedicating one cluster to the extreme variance, the overall coherence of the remaining groups improves. This validates the choice of k=3 as the optimal trade-off between statistical cohesion and structural informativeness.
sil_h3 <- silhouette(sub_grp_h3, d)
fviz_silhouette(sil_h3,
palette = "Set2",
ggtheme = theme_minimal())
## cluster size ave.sil.width
## 1 1 74 0.08
## 2 2 445 0.30
## 3 3 298 0.53
This solution tests the structural suggestion of the Calinski-Harabasz Index (k=4), which was statistically preferred by this specific metric.
The data is split into four unequal groups: Cluster 1 (74), Cluster 2 (120), Cluster 3 (298), and Cluster 4 (325). Like K-means, Hclust isolates the small Outlier Cluster (74 points), but then arbitrarily splits the remaining dense core into three parts (Clusters 2, 3, and 4), attempting to maximize global separation.
sub_grp_h4 <- cutree(hc_model_ward, k = 4)
books$cluster_h4 <- sub_grp_h4
table(sub_grp_h4)
## sub_grp_h4
## 1 2 3 4
## 74 120 298 325
The average silhouette width slightly drops back to 0.35. The small decline in the Silhouette score confirms the result seen in the K-means analysis: k=4 represents an over-segmentation of the dense central data mass. While highly interpretable (as shown in the K-means centers analysis), the loss of statistical coherence makes k=4 a less desirable base for robust market segmentation than the k=3 solution.
sil_h4 <- silhouette(sub_grp_h4, d)
fviz_silhouette(sil_h4,
palette = "Set2",
ggtheme = theme_minimal())
## cluster size ave.sil.width
## 1 1 74 0.07
## 2 2 120 0.30
## 3 3 298 0.38
## 4 4 325 0.40
This final section compares the outcomes of the K-means and Hierarchical Clustering models, assesses their performance against the analytical objective, and provides the definitive interpretation of the chosen market segments.
Based on the preceding structural analysis, all statistically optimal models (k=2 and k=4) were deemed inadequate for robust market segmentation: k=2 (Statistically Optimal by Silhouette) was rejected because it fails to isolate the critical outlier segment, while k=4 (Optimal by CH Index) was rejected due to structural instability and over-segmentation (Silhouette width drop). I will therefore focus the discussion entirely on the k=3 solution, as it represents the optimal trade-off between statistical coherence and practical utility.
Prior to comparing the models, aggregation was performed on the unscaled data to obtain mean (or median) values for each feature within the k=3 partitions generated by K-means and Hierarchical Clustering. This is necessary for a meaningful, market-focused interpretation. The following data aggregation steps were executed:
Calculation of the mean unscaled values for the four clustering variables (rating, numRatings, likedPercent, bbeScore) for the model.
Genre Profile: Calculation of the mean proportion of the top 20 genres within each cluster.
Calculation of the mean number of pages (pages_num) and the median publication date (publishDate_date) for each cluster.
genres_tidy <- books %>%
mutate(book_key = paste(title, author, sep = "__")) %>%
mutate(genres_list = str_extract(genres, "\\[.*?\\]")) %>%
mutate(genres_list = str_remove_all(genres_list, '[\\[\\]\\"]')) %>%
separate_rows(genres_list, sep = ',\\s*') %>%
mutate(genres_list = str_trim(genres_list)) %>%
filter(genres_list != "") %>%
select(book_key, genres_list)
genres_binary <- genres_tidy %>%
mutate(value = 1) %>%
pivot_wider(names_from = genres_list, values_from = value, values_fill = 0) %>%
select(-book_key)
genre_sums <- colSums(genres_binary)
top_genres <- names(genre_sums[order(genre_sums, decreasing = TRUE)][1:20])
books_final <- books %>%
mutate(book_key = paste(title, author, sep = "__")) %>%
left_join(genres_binary %>% mutate(book_key = genres_tidy %>% distinct(book_key) %>% pull(book_key)) %>% select(book_key, all_of(top_genres)), by = "book_key") %>%
select(-book_key)
The structural comparison of the K-means and Hierarchical Clustering k=3 centers (using unscaled data) confirms that both methods arrive at the same three functional archetypes, reinforcing the stability of the k=3 solution. The cluster describing extreme bestsellers with the highest number of ratings (cluster 2 in k-means and cluster 1 in hierarchical) presents a minimal difference - both models perfectly isolate the high-variance segment. The high-quality cluster (cluster 1 in k-means and 3 in hierarchical) contains books with the highest ratings and positive reviews, and is very similar in both clustering methods, with hierarchical clustering presenting slightly higher values of the two crucial variables. The remaining cluster of books with the lowest quality shows a slight difference in defining the boundary between methods, but the archetypes are the same.
aggregate(
x = books_final[, c("rating", "numRatings", "likedPercent", "bbeScore")],
by = list(Klaster = books_final$cluster_km3),
FUN = mean
)
## Klaster rating numRatings likedPercent bbeScore
## 1 1 3.866284 340942.9 89.06081 53778.76
## 2 2 4.164648 2040800.1 93.35211 728114.15
## 3 3 4.214378 291071.1 94.82444 49548.61
aggregate(
x = books_final[, c("rating", "numRatings", "likedPercent", "bbeScore")],
by = list(Klaster = books_final$cluster_h3),
FUN = mean
)
## Klaster rating numRatings likedPercent bbeScore
## 1 1 4.167432 2001024.3 93.36486 709013.36
## 2 2 3.934472 316311.5 90.41348 50029.55
## 3 3 4.286409 295179.4 95.69799 50944.14
I now compare the external characteristics, which were not used for clustering, to check if the different segmentation methods resulted in different demographic or genre profiles. While the core structural definitions (rating, numRatings) are virtually identical, the external demographic profiles (date, pages) show moderate differences. This suggests that the slight variations in the cluster boundaries (dictated by the respective algorithms’ distance calculations) result in different secondary characteristics being absorbed by the surrounding segments. For instance, the Hclust model defined the High Quality Core as significantly newer (median date 2005-10-28) than the K-means model (median date 2004-10-13), and the Bestseller cluster defined by the k-means contains longer books (426 pages) than hierarchical (386 pages).
Despite these differences, the K-means k=3 solution is selected as the final model due to its higher silhouette score, ease of interpretation, and its common adoption in large-scale market segmentation studies. Crucially, the external profiles generated by the K-means boundaries were deemed more coherent for defining the market archetypes.
aggregate(
x = books_final[, top_genres],
by = list(Klaster = books_final$cluster_km3),
FUN = mean
)
## Klaster 'Fiction' 'Classics' 'Fantasy' 'Novels' 'Literature' 'Young Adult'
## 1 1 0.9391892 0.5743243 0.3479730 0.4797297 0.5135135 0.3006757
## 2 2 0.9718310 0.7605634 0.5915493 0.4647887 0.4366197 0.5352113
## 3 3 0.9200000 0.4777778 0.5111111 0.3200000 0.2800000 0.3733333
## 'Romance' 'Historical Fiction' 'Audiobook' 'Adventure' 'Adult'
## 1 0.3513514 0.3310811 0.1621622 0.1722973 0.1858108
## 2 0.2816901 0.3098592 0.2676056 0.3943662 0.2253521
## 3 0.3066667 0.2466667 0.2755556 0.2400000 0.2577778
## 'Contemporary' 'Historical' 'Science Fiction' 'School' 'Childrens'
## 1 0.2601351 0.1554054 0.1520270 0.24662162 0.0777027
## 2 0.1690141 0.2253521 0.2112676 0.21126761 0.2816901
## 3 0.1733333 0.2066667 0.1444444 0.06666667 0.1622222
## 'Paranormal' 'Magic' 'Science Fiction Fantasy' 'Mystery'
## 1 0.12837838 0.0777027 0.07094595 0.14527027
## 2 0.04225352 0.1690141 0.26760563 0.07042254
## 3 0.15333333 0.1577778 0.14666667 0.12000000
aggregate(
x = books_final[, c("pages_num")],
by = list(Klaster = books_final$cluster_km3),
FUN = mean
)
## Klaster x
## 1 1 372.7973
## 2 2 426.3944
## 3 3 463.9289
aggregate(
x = books_final[, c("publishDate_date")],
by = list(Klaster = books_final$cluster_km3),
FUN = median
)
## Klaster x
## 1 1 2004-10-13
## 2 2 2004-09-28
## 3 3 2005-05-02
aggregate(
x = books_final[, top_genres],
by = list(Klaster = books_final$cluster_h3),
FUN = mean
)
## Klaster 'Fiction' 'Classics' 'Fantasy' 'Novels' 'Literature' 'Young Adult'
## 1 1 0.9729730 0.7432432 0.5945946 0.4459459 0.4054054 0.5270270
## 2 2 0.9415730 0.5707865 0.3797753 0.4449438 0.4741573 0.3191011
## 3 3 0.9060403 0.4362416 0.5436242 0.2953020 0.2281879 0.3825503
## 'Romance' 'Historical Fiction' 'Audiobook' 'Adventure' 'Adult'
## 1 0.2837838 0.3108108 0.2972973 0.3918919 0.2567568
## 2 0.3370787 0.2988764 0.1730337 0.1842697 0.1977528
## 3 0.3053691 0.2516779 0.3087248 0.2550336 0.2684564
## 'Contemporary' 'Historical' 'Science Fiction' 'School' 'Childrens'
## 1 0.1621622 0.2297297 0.2162162 0.20270270 0.2837838
## 2 0.2449438 0.1483146 0.1438202 0.20449438 0.1011236
## 3 0.1543624 0.2416107 0.1510067 0.04026846 0.1677852
## 'Paranormal' 'Magic' 'Science Fiction Fantasy' 'Mystery'
## 1 0.04054054 0.16216216 0.27027027 0.08108108
## 2 0.13483146 0.09213483 0.09213483 0.13932584
## 3 0.15771812 0.17785235 0.15100671 0.11409396
aggregate(
x = books_final[, c("pages_num")],
by = list(Klaster = books_final$cluster_h3),
FUN = mean
)
## Klaster x
## 1 1 428.2027
## 2 2 386.2315
## 3 3 489.3624
aggregate(
x = books_final[, c("publishDate_date")],
by = list(Klaster = books_final$cluster_h3),
FUN = median ## Mediana jest lepsza dla dat niż średnia
)
## Klaster x
## 1 1 2004-09-28
## 2 2 2004-09-01
## 3 3 2005-10-28
The final K-means k=3 segments are defined by strong distinctions in Scale and Quality, providing three actionable market archetypes.
High Quality: The largest group, defined by strong subjective quality but lacking the extreme scale of the Bestseller Cluster. Its defining features are the highest mean rating and likedPercent, while popularity is average. When it comes to the external variables, this cluster exhibits the lowest proportion of ‘Classics’ (0.477) among all segments, but a high representation of general ‘Fiction’ (0.92). Additionally, more than half of the books in this cluster are a part of a ‘Fantasy’ genre. The books in this cluster tend to be longer and published later than in the other clusters. This indicates that its high quality is driven by well-received, contemporary, and diverse literary titles, rather than historical classics. This is the segment for stable investments, characterized by current content that meets high editorial and critical standards. Market Implication: this is the segment for stable investments, characterized by current content that meets high editorial and critical standards.
Extreme Bestseller (Outliers): This small cluster represents the structural outliers, defined by maximum scale and volume. Its defining feature are the highest total ratings and bbeScore, as well as a high mean rating. It is the most genre-specific segment. It has the highest proportion of ‘Classics’ (0.760), ‘Fiction’ (0.97), and ‘Young Adult’ (0.535). I also contains a considerable amount of ‘Adventure’, ‘Historical’, ‘Science Fiction’, and ‘Childrens’ books, while there are very few ‘Paranormal’ or ‘Mystery’ titles. The median publication date suggests that the books in this segment tend to be alightly older than in the other groups. This confirms that the extreme sales volume is primarily driven by highly established, multi-generational Classics and popular franchises that accumulate immense numbers of ratings over decades. Market Implication: maximizing secondary sales (audiobooks, re-releases) of established books.
Lower Quality: Defined by the lowest subjective quality scores, yet maintains average popularity. This cluster’s mean rating and likedPercent are the lowest. It shows an intermediate proportion of Classics (0.574), alongside larger representation than the other segments in diverse niche genres (Romance, Historical Fiction, Contemporary, Mystery). The average number of pages is significantly lower than in the previous clusters. This segment consists of books that appeal to specific tastes but are polarizing or critically challenging. Market Implication: highly targeted marketing toward niche communities where content type overrides general quality perception.
This project successfully achieved its goal of segmenting a structurally challenging dataset, demonstrating that an analyst must prioritize market utility over strict statistical optimality when dealing with high-density data. The initial Exploratory Data Analysis confirmed the data’s central problem: it consists of one enormous, dense core and a few highly influential extreme outliers.
The analysis of the optimal number of clusters (k) resulted in contradictory suggestions, highlighting the complexity of the data structure. Silhouette Method suggested k=2 (0.79), but this high score was misleadingly driven by the vast distance to the few extreme outliers, not by a cohesive split of the central mass. The actual internal coherence of the k=2 cut was very weak (0.33). Elbow Method & clValid indicated the true optimal gain threshold was at k=3, where dedicating one cluster to the outlier variance significantly improved the overall coherence. Calinski-Harabasz Index preferred k=4, an over-segmentation which provided highly interpretable segments but reduced the Silhouette score. This confirmed the core thesis: the data does not possess a single natural structure, necessitating a strategic decision to select k=3 as the optimal point for interpretive relevance. This finding was further supported by DBSCAN, which confirmed the presence of a single, massive dense core (781 points) with only 36 structural outliers. This lack of natural separation forced all partitioning algorithms to artificially impose boundaries onto the data mass.
The core comparison between K-means and Hierarchical Clustering revealed a nuanced relationship between statistical efficiency and practical segmentation utility in this high-density dataset. Internal validation (using the clValid summary) initially suggested that Hierarchical Clustering (Ward’s method) was statistically superior to K-means across all key metrics (Connectivity and Silhouette). he Ward method’s objective - minimizing the within-cluster variance - made it theoretically better at finding compact groups in the data. K-means initially suffered more from the presence of the dense core and outliers, resulting in lower scores (Dunn Index was extremely low for k-means k=3, suggesting very poor separation). The performance of both algorithms at the k=2 threshold was highly revealing regarding the outlier problem. The k-means algorithm was visibly forced to arbitrarily bisect the dense core along the main dimension to minimize the distance squared, resulting in two large, non-separated segments and a low Silhouette score (0.34). The outliers were masked within the largest segment. Hclust also partitioned the dense mass, confirming that despite Hclust’s statistical strength, it cannot find two naturally separated clusters within the single dense mass. The crucial shift to k=3 stabilized both models and neutralized the adverse effect of the outliers. Both algorithms successfully dedicated one cluster to the high-variance Extreme Bestseller/Outlier segment. This action effectively removed the destabilizing influence of the outliers from the remaining two clusters. Once stabilized, K-means and Hclust produced structurally identical segments for the remaining dense core. This strong convergence proves that the k=3 boundaries are robust, regardless of the calculation methodology.
Despite hierarchical clustering’s initial statistical edge, K-means was ultimately selected as the final model because the k=3 solutions were structurally equivalent and k-means provides direct, easily interpretable cluster centers, which is far more practical for market researchers. K-means is also widely accepted and simpler to implement and reproduce than navigating the linkage complexities of hierarchical models. The consistent observation of a low average Silhouette Width (reaching a maximum of only 0.40 at k=3) is not an indicator of poor analysis, but strong evidence of the data’s structural difficulty. The low score confirms that points within a cluster often lie closer to the center of a neighboring cluster, meaning the boundaries are overlapping.
Despite the low Silhouette score, the k=3 solution proved highly useful, confirming that structural metrics are secondary to the objective of the study. The final K-means model successfully defined the segments based on clear distinctions in scale and quality, providing immediate commercial relevance: Extreme Bestseller, High Quality, and Low Quality. While the clustering is statistically imperfect (low Silhouette width), the stability and high functional interpretability of the three derived segments make this model highly valuable for market strategy. The project successfully navigated the challenges of the dense dataset to extract meaningful commercial information.