── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Loading required package: bitops
Rattle: A free graphical interface for data science with R.
Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
Type 'rattle()' to shake, rattle, and roll your data.
Rows: 3,500
Columns: 6
$ outcome <fct> No Shot, No Shot, Shot, No Shot, No Shot, No Shot, No…
$ duration_sec <dbl> 2, 3, 4, 5, 5, 16, 17, 24, 24, 33, 36, 36, 40, 43, 45…
$ half <chr> "1st Half", "1st Half", "1st Half", "1st Half", "1st …
$ num_passes <dbl> 0, 1, 0, 1, 2, 5, 2, 4, 4, 4, 9, 4, 11, 10, 9, 2, 0, …
$ starting_area <chr> "Defence", "Midfield", "Attack", "Attack", "Attack", …
$ method_gained_by <chr> "KickoutOwn", "ThrowIn", "KickoutOpp", "KickoutOpp", …
tibble [3,500 × 6] (S3: tbl_df/tbl/data.frame)
$ outcome : Factor w/ 2 levels "No Shot","Shot": 1 1 2 1 1 1 1 2 2 1 ...
$ duration_sec : num [1:3500] 2 3 4 5 5 16 17 24 24 33 ...
$ half : chr [1:3500] "1st Half" "1st Half" "1st Half" "1st Half" ...
$ num_passes : num [1:3500] 0 1 0 1 2 5 2 4 4 4 ...
$ starting_area : chr [1:3500] "Defence" "Midfield" "Attack" "Attack" ...
$ method_gained_by: chr [1:3500] "KickoutOwn" "ThrowIn" "KickoutOpp" "KickoutOpp" ...
Show actual column names
colnames(players)
fg_pct two_p_pct three_p_pct ft_pct trb ast
[1,] -0.1195441 -0.04501742 0.11164631 0.60835344 0.26239810 2.4465644
[2,] 1.2271573 0.43803808 -0.42080843 -1.01149068 3.02109532 0.8722245
[3,] -0.2662146 1.06713362 -0.05218592 0.08383249 1.61470066 -0.6724109
[4,] -0.3195493 -0.03378357 -0.72389806 -0.37126658 0.20830600 -0.5238883
[5,] -0.2262135 -0.01131588 0.23452048 0.37694714 -0.06215451 -0.1971385
[6,] -0.9595658 -0.67411296 0.21813726 1.11744730 -0.67519833 -0.2862521
stl blk tov
[1,] 1.87840853 -0.6141150 1.88560837
[2,] -0.01304374 -0.2482917 1.60529394
[3,] 0.61744035 1.2150014 -0.54378335
[4,] -0.40709630 -0.4312034 -0.02987356
[5,] -0.64352783 -0.2482917 -0.26346892
[6,] -0.40709630 -0.7970266 -0.87081685
- Does the data need to be scaled before computing the distance matrix for hierarchical clustering or before being entered into the K-means clustering algorithm? Explain your answer.
Explanation: we are now selecting the correct columns based on the dataset.
Scaling is essential because trb, tov, etc., are on different numerical scales than percentages.
- Hierarchical Clustering:
- Create a suitable distance matrix containing the Euclidean distance between all pairs of players.
- Create a 4-cluster solution and assess the quality of this solution using silhouette scores.
Average silhouette width: 0.094
3e. Use tables and suitable graphs to profile the clusters, making sure to include answers to the following questions: i. How do the clusters differ on their average performance for the 9 attributes included in the cluster analysis? ii. How does the distribution of player position differ between clusters?
Warning: There was 1 warning in `summarise()`.
ℹ In argument: `across(...)`.
ℹ In group 1: `cluster = 1`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.
# Previously
across(a:b, mean, na.rm = TRUE)
# Now
across(a:b, \(x) mean(x, na.rm = TRUE))
# A tibble: 4 × 10
cluster fg_pct two_p_pct three_p_pct ft_pct trb ast stl blk tov
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0.433 0.481 0.333 0.733 118. 115. 38.0 6.06 73.5
2 2 0.431 0.494 0.299 0.703 89.8 35.6 18.7 7.79 35.5
3 3 0.329 0.323 0.320 0.634 41.0 15.9 9.72 3.30 15.2
4 4 0.500 0.540 0.269 0.686 228. 45.7 26.2 38.9 59.8
Warning: package 'reshape2' was built under R version 4.4.3
Attaching package: 'reshape2'
The following object is masked from 'package:tidyr':
smiths
C F G
1 0 8 230
2 55 791 1532
3 2 34 105
4 16 87 16
C F G
1 0.00000000 0.03361345 0.96638655
2 0.02312868 0.33263246 0.64423886
3 0.01418440 0.24113475 0.74468085
4 0.13445378 0.73109244 0.13445378
Quarto Report Write-up (Section 3e) i. How do the clusters differ on performance? The average values for each of the 9 performance metrics differ notably across clusters. For example: Cluster 1 has the highest average fg_pct and trb (field goal % and rebounds), indicating strong inside scorers. Cluster 3 has higher three_p_pct and ast, suggesting perimeter shooters or playmakers. Cluster 2, the largest group, has average values closer to the dataset overall — possibly more “average” role players. A grouped bar chart was used to visualize these attribute differences across clusters.
- How do positions differ by cluster? The position distribution varies across clusters: Cluster 1 had more forwards and centers, consistent with high rebounds and blocks. Cluster 3 contained more guards, aligning with higher assists and steals. Cluster 2 had a balanced mix of all positions, again supporting its “generalist” profile.
A stacked bar chart visualized the proportion of each position per cluster.
- K-means Clustering:
- Carry out a K-means clustering that will produce 4 clusters. Remember to use set.seed(101) to ensure that your results are reproducible
Warning: package 'factoextra' was built under R version 4.4.3
Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
Rows: 2876 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): team, player, pos
dbl (9): fg_pct, two_p_pct, three_p_pct, ft_pct, trb, ast, stl, blk, tov
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 4 × 10
cluster avg_fg_pct avg_two_p_pct avg_three_p_pct avg_ft_pct avg_trb avg_ast
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0.474 0.557 0.249 0.606 70.6 17.1
2 2 0.374 0.424 0.319 0.729 65.0 29.5
3 3 0.425 0.476 0.337 0.755 111. 82.9
4 4 0.507 0.554 0.273 0.677 183. 36.6
# ℹ 3 more variables: avg_stl <dbl>, avg_blk <dbl>, avg_tov <dbl>
- Assess the quality of the clustering solution by calculating silhouette scores.
Silhouette of 2876 units in 4 clusters from silhouette.default(x = kmeans_result$cluster, dist = dist_matrix) :
Cluster sizes and average silhouette widths:
626 1080 742 428
0.09580838 0.23013642 0.23190000 0.12343252
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.11241 0.09224 0.18691 0.18547 0.28534 0.44773
The average silhouette width was approximately 0.19, which falls below the commonly accepted threshold of 0.25 for a clearly defined clustering structure. This indicates that the clustering solution may be weak or overlapping, and many players may not be well matched to their assigned clusters.
The silhouette plot visually confirms this, showing a large number of narrow bars and values close to zero, which suggests:
Substantial overlap between clusters,
Low cohesion within clusters,
Potential misclassification of certain observations.
While the 4-cluster K-means model still allows for some broad pattern identification, the low silhouette score indicates that this clustering solution should be interpreted with caution.
- Use tables and suitable graphs to profile the clusters, making sure to include answers to the following questions:
- How do the clusters differ on their average performance for the 9 attributes included in the cluster analysis?
# A tibble: 4 × 10
cluster avg_fg_pct avg_two_p_pct avg_three_p_pct avg_ft_pct avg_trb avg_ast
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0.474 0.557 0.249 0.606 70.6 17.1
2 2 0.374 0.424 0.319 0.729 65.0 29.5
3 3 0.425 0.476 0.337 0.755 111. 82.9
4 4 0.507 0.554 0.273 0.677 183. 36.6
# ℹ 3 more variables: avg_stl <dbl>, avg_blk <dbl>, avg_tov <dbl>
This table allows us to identify cluster-specific characteristics. For instance:
Cluster 1 may include players with high assists and steals — likely playmakers or guards.
Cluster 2 might feature strong rebounders with higher block and turnover counts — suggesting centers or forwards.
Cluster 3 could show above-average field goal and free-throw percentages — consistent with shooting-oriented players.
- How does the distribution of player position differ between clusters?
C F G
1 31 342 253
2 2 223 855
3 0 49 693
4 40 306 82
Interpretation Cluster 1 appears to consist primarily of Guards, supported by high assists and steals.
Cluster 2 includes a higher share of Centers and Forwards, aligning with stronger rebounding and block stats.
Cluster 3 shows a more balanced mix, potentially representing versatile or role-specific players.
Cluster 4 could include shooting guards or wings, supported by strong free throw and three-point shooting.
- Compare and contrast the clusters produced by Hierarchical Clustering and K-means. For example:
- Which algorithm produced the highest quality clusters?
To assess cluster quality, the average silhouette width was calculated for both clustering methods:
K-means (4 clusters): average silhouette width = 0.19
Hierarchical clustering (4 clusters): average silhouette width = 0.094
These values indicate that both clustering solutions are relatively weak, but K-means produced a noticeably higher silhouette score, suggesting it provided slightly better-defined clusters in terms of cohesion and separation.
While neither model achieved strong clustering quality (generally > 0.25), K-means was marginally more effective for this dataset. b. Did both algorithms produce clusters with a similar profile? Are there any noticeable differences? While both K-means and Hierarchical Clustering grouped players based on the same 9 performance attributes, their resulting clusters showed some overlap in broad player types but also key differences in clarity and structure.
Similarities: Both algorithms identified a cluster made up largely of Guards, characterised by:
High assists
Above-average steals
Both also included a cluster of Forwards and Centers, typically showing:
High rebounds
High blocks
Lower assist and shooting stats
These similarities suggest that the main performance styles (e.g., playmakers vs post players) were robust enough to emerge under both methods.
Differences: K-means clustering produced more distinct statistical profiles. Each cluster had clearer roles — such as:
Playmakers (high assists, steals)
Shooters (high FG%, 3PT%, FT%)
Defenders (high rebounds and blocks)
Hierarchical clustering resulted in more blended clusters, with:
Less variation in performance metrics between clusters
More mixed position distributions (e.g., clusters with both guards and centers)
This difference likely stems from how the methods operate:
K-means optimises groupings based on centroid distances, allowing for tighter, more differentiated clusters.
Hierarchical clustering, especially with complete linkage, builds clusters based on overall distance structure, which can sometimes force similar or dissimilar players into less coherent groups.