Loading in Library Packages

Installing and downloading all the packages necessary in a function provided by Paul Tallon.

# Downloading shared code functions from Paul Tallon
source('https://raw.githubusercontent.com/ptallon/SportsAnalytics_Fall2024/main/SharedCode.R')
# downloading necessary packages using Paul Tallon functions
load_packages(c("data.table", "shiny", "dplyr", "viridis", "ggplot2","patchwork", "biotools" , "MVN", "ggalt", "factoextra", "cluster", "gifski", "gganimate", "httr", "r02pro", "tidyverse", "ggpubr", "factoextra","magrittr", "rlang", "reshape2", "rsconnect", "ggpmisc", "kableExtra", "glue"))

Loading in Files

Setting the working directory, path, seed, and reading in week 1 NFL data. To ensure reproducibility of the histogram when working with random or variable data, you can set a random seed using the set.seed() function before plotting

# prepare working directory and path for files
setwd("/Users/jose/Desktop/Data Viz")
my_path <- paste0(getwd(), "/nfl2025/NFLBDB2025")

# setting seed for machine learning functions
set.seed(123)

# downloading necessary packages using Paul Tallon functions
load_packages(c("data.table", "shiny", "dplyr", "viridis", "ggplot2","patchwork", "biotools" , "MVN", "ggalt", "factoextra", "cluster", "gifski", "gganimate", "httr", "r02pro", "tidyverse", "ggpubr", "factoextra", "rlang", "reshape2"))

# reading raw data
df <- load_data_for_one_week(directory = my_path, week = 1, merge = TRUE)

Filtering out the Data

Here we used functions to filter the data to only get motion that is pre snap but after the line is set so we can gauge the motion without seeing players leaving the huddle first. We also prepared data to be clustered by taking a subset of the previous data frame and cleaning and scaling that data, while also calculating the mean of statistics we hope to measure. We updated the raw data frame by modifying the x and y coordinates to center the selected player on the plot. This adjustment ensures that the chosen player’s position is at the origin (0, 0), with all other players’ positions recalculated relative to this centered reference.

# changing raw data frame to reflect x and y to center player on plot and filter out movement before motion. Tracking data after line is set for the player in motion.
lineset_inmotion <-  df %>%
  filter(frameType == "BEFORE_SNAP", inMotionAtBallSnap == TRUE) %>%
  group_by(nflId) %>%
  mutate(total_motion_pre_snap = cumsum(ifelse(frameType == "BEFORE_SNAP" & inMotionAtBallSnap == TRUE,    dis, 0))) %>%
  arrange(gameId, playId, frameId) %>%
  group_by(playId, gameId) %>%
  mutate(m1 = which(event=='line_set', arr.ind=TRUE)[1] ) %>%
  filter(row_number(gameId) >= unique(m1)) %>%
  ungroup() %>%
  group_by(gameId, playId) %>%
  mutate( frameId = (frameId - min(frameId)) + 1) %>%
  mutate( x = x - first(x[frameId == 1]),
          y = y - first(y[frameId == 1]),) %>%
  data.frame()

# using lineset_inmotion data frame to feature reduce and create summary statistics based on playId and gameId
data_for_cluster <- lineset_inmotion %>%
  group_by(playId, gameId) %>%
  summarise( x_mean = mean(x, na.rm = TRUE), 
             s_mean = mean(s, na.rm = TRUE),
             dir_mean = mean(dir, na.rm = TRUE),
             total_motion_pre_snap_mean = mean(total_motion_pre_snap, na.rm = TRUE)) %>%
  data.frame()

# scale data for better clustering and remove non numerical columns (gameId, playId)
scaled_data <- as.data.frame(scale(data_for_cluster[, -c(1, 2)]))

# remove rows with a z score higher or lower than 3
threshold <- 3
# Identify rows where all columns have a z score withing the threshold
non_outlier_rows <- apply(scaled_data, 1, function(row) all(abs(row) <= threshold))

# review the count of non-outlier rows
play_count <- (sum(non_outlier_rows == TRUE))
glue("The amount of rows which represent one play in the data is: {play_count}" )
## The amount of rows which represent one play in the data is: 594
# Subset the data to remove outliers
cleaned_data <- scaled_data[non_outlier_rows, ]

Correlation Heat Map

We created a correlation matrix and constructed a heat map that visualizes what statistics are correlated and how correlated they are relative to other combinations.

# columns to include in the correlation heatmap
cols_to_use_cor <- c("x", "s", "dis", "dir", "yardsGained", 
                     "expectedPoints", "passLength", "dropbackDistance", 
                     "receivingYards", "rushingYards", "total_motion_pre_snap")

# scale the data frame to ensure good correlation metrics
scaled_temp <- as.data.frame(scale(lineset_inmotion[, cols_to_use_cor]))

# Compute the correlation matrix
correlation_matrix <- cor(scaled_temp, use = "complete.obs")

# Convert the correlation matrix to long format
correlation_long <- melt(correlation_matrix)

# Create the heatmap
ggplot(correlation_long, aes(Var1, Var2, fill = value)) +
  geom_tile(color = "white") + # Add white grid lines for clarity
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                       midpoint = 0, limit = c(-1, 1), space = "Lab", 
                       name = "Correlation") +
  geom_text(aes(label = round(value, 2)), color = "black", size = 4) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  labs(title = "Correlation Heatmap", 
       x = "Variables", 
       y = "Variables")

Calculation of Sum of Squares

Below is the process we used to calculate and visualize the sum of squares using the scaled data. Based on the resulting plot, we determined and highlighted k = 7 as the optimal value.

set.seed(123)
# Calculate the within-cluster sum of squares (WSS) for k values 1 through 20
wss <- sapply(1:20, function(k) {
  kmeans(scaled_data, centers = k, nstart = 20)$tot.withinss
})

# generate a plot of wss for each k value
plot(1:20, wss, type = "b", pch = 19, frame = FALSE,
     xlab = "Number of Clusters (k)",
     ylab = "Total Within-Cluster Sum of Squares",
     main = "Elbow Method for Optimal k",
     col = "blue", lwd = 2, cex = 1.5)

# Add grid lines for better readability
grid(nx = NULL, ny = NULL, col = "gray", lty = "dotted", lwd = 1)

# Mathematically determine the optimal k using the "elbow" method using rapid change 
elbow_k <- 7

# Highlight the optimal k with a vertical line
abline(v = elbow_k, col = "red", lwd = 2, lty = 2)

# Add a legend to explain the vertical line
legend("topright", legend = paste("Optimal k =", elbow_k),
       col = "red", lty = 2, lwd = 2, bty = "n", cex = 0.9)

Kmeans Clustering

The kmeans function was applied to the cleaned data to create clusters representing motion plays. A new data frame was created to integrate the clustering results with the relevant statistics and variables for analysis. Clustering was performed with k = 7, and the cluster assignments were appended to data_to_analysis. A scaled version of the data was also scaled to enable comparisons between scaled and unscaled versions. Scaled data is crucial for anova and other testing algorithms.

set.seed(123)
kmeans_result <- kmeans(cleaned_data, centers = elbow_k, iter.max = 20, nstart = 25)
# Add cluster assignments to the original dataset

data_to_analysis <- cbind(data_for_cluster[non_outlier_rows, ], Cluster = kmeans_result$cluster)
data_to_analysis <- data_to_analysis %>% 
  left_join(
    df %>%
      dplyr::select(gameId, playId, yardsGained, event, quarter, expectedPoints, passLength, dropbackType, dropbackDistance, receivingYards, rushingYards) %>%
      group_by(playId, gameId) %>%
      summarise(across(where(is.numeric), ~ mean(.x, na.rm = TRUE), .names = "{.col}_mean")),
    by = c("gameId", "playId")
  ) %>%
  data.frame()
scaled_to_analysis <- cbind(cleaned_data, Cluster = kmeans_result$cluster)

Play Count in Clusters

Here we calculated the amount of plays that exist in each cluster and created a bar plot to show the results. This helps us gauge the amount of data that supports the cluster.

# Calculate cluster counts
cluster_counts <- data_to_analysis %>%
  group_by(Cluster) %>% 
  summarize(count_motion_plays = n()) %>%
  data.frame()

# Add a row for the total sum
cluster_counts <- cluster_counts %>%
  rbind(data.frame(Cluster = "Total", count_motion_plays = sum(cluster_counts$count_motion_plays)))

# Print the table using kable
kable(cluster_counts, col.names = c("Cluster", "Count of Motion Plays"), 
      caption = "Summary of Motion Plays by Cluster", 
      align = c('c', 'c'))
Summary of Motion Plays by Cluster
Cluster Count of Motion Plays
1 178
2 87
3 31
4 51
5 83
6 85
7 79
Total 594
# Creating the bar plot with frequencies displayed on top of the bars
ggplot(cluster_counts[-8,], aes(x = factor(Cluster), y = count_motion_plays)) +  # Convert Cluster to factor
  geom_bar(stat = "identity", fill = "steelblue") +
  geom_text(aes(label = count_motion_plays), vjust = -0.5, size = 4, color = "black") +
  labs(
    title = "Count of Motion Plays by Cluster",
    x = "Cluster",
    y = "Count of Plays"
  ) +
  theme_minimal()

Explained Variation

Looking back at the Kmeans results, we notice that direction showcases clusters with greatest values of plays where player in motion are moving left and away from their starting point according to the scaled value of 1.9 equals 228 degrees, which translates to South West. The total of sum of squared distances between each play within a cluster and its cluster center shows that Cluster 3 is the tightest cluster having a measure of 56. Cluster 1, Cluster 6, Cluster 7, and Cluster 4 have the largest distance within their clusters. Cluster 1 has a Within Cluster score of 120.47 and is 17% of the total distance within all clusters. the mean is 99.68. Cluster 2 is the closest to the mean, and Cluster 5 follows with the furthest clusters being Clusters 3 and 1. The total variability of a cluster was calculated using the Between SS divided by the Total SS. The result was 61.4% which indicates relatively strong clustering solution.

## K-means clustering with 7 clusters of sizes 178, 87, 31, 51, 83, 85, 79
## 
## Cluster means:
##        x_mean     s_mean     dir_mean total_motion_pre_snap_mean
## 1 -0.35483551 -0.6338623  0.004860760                -0.58013156
## 2  0.06653728 -0.3267660  1.321364975                -0.35175377
## 3  1.93543160  0.5166192 -0.156262390                -0.07142667
## 4  0.27890520  0.6190286  0.555087110                 1.92980359
## 5 -0.08911668 -0.2496173 -1.379619587                -0.58109686
## 6 -0.30945935 -0.4304406 -0.276001425                 0.92442847
## 7  0.05207428  1.3103285 -0.003132736                -0.23935854
## 
## Clustering vector:
##   1   2   3   4   5   6   8   9  10  11  12  13  15  16  17  18  19  20  21  22 
##   5   2   1   1   1   1   1   1   3   2   5   1   1   1   5   1   2   1   1   7 
##  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42 
##   1   7   3   1   2   2   2   5   5   5   1   1   7   7   2   1   7   1   2   1 
##  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  59  60  61  62  63 
##   1   5   3   1   3   1   6   1   2   1   3   1   2   2   2   1   1   1   1   1 
##  64  65  66  67  68  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84 
##   2   5   1   6   1   2   2   1   2   1   5   5   7   5   2   1   5   7   5   3 
##  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 101 102 103 104 105 
##   7   5   7   5   1   2   7   5   1   2   5   1   7   7   2   2   1   5   1   1 
## 106 107 108 109 110 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 
##   5   5   3   1   1   1   1   5   1   5   1   1   1   1   1   1   1   1   5   2 
## 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 
##   5   7   1   2   7   1   2   5   5   1   5   7   2   2   3   2   1   7   2   1 
## 147 148 149 150 151 152 153 154 155 156 157 159 160 161 162 163 164 165 166 167 
##   2   5   5   7   6   2   5   2   6   7   2   5   5   1   5   3   1   1   5   2 
## 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 
##   1   2   1   6   2   2   1   1   1   1   7   2   7   5   2   6   1   2   5   5 
## 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 206 207 208 
##   1   1   3   1   5   5   1   6   6   2   1   6   5   2   7   6   1   3   1   3 
## 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 
##   2   6   6   2   6   5   7   1   5   3   1   1   6   1   2   1   2   6   1   1 
## 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 246 247 248 249 250 
##   5   2   2   7   6   5   1   4   1   7   5   7   1   1   1   1   1   1   7   2 
## 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 
##   7   2   1   6   2   1   1   5   5   1   2   5   3   1   2   3   2   1   5   3 
## 271 272 273 274 275 276 277 278 279 280 281 283 284 285 286 287 288 289 290 291 
##   6   2   1   1   1   1   3   7   2   7   6   7   2   6   2   6   6   7   7   7 
## 292 293 294 295 296 297 298 299 300 301 303 304 305 306 307 308 309 310 311 312 
##   7   6   7   1   3   5   6   6   5   7   5   4   1   1   2   1   3   1   1   6 
## 313 314 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 
##   5   6   3   1   3   6   3   7   1   2   1   5   7   6   2   3   5   2   6   7 
## 334 335 336 337 338 339 340 341 342 343 344 346 347 348 349 350 351 352 353 354 
##   7   2   4   5   1   2   1   1   2   3   1   1   1   5   5   1   1   7   1   1 
## 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 
##   6   2   5   4   6   3   7   5   1   2   7   1   7   7   1   6   6   1   4   1 
## 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 393 394 395 
##   7   1   1   1   6   1   1   4   5   7   4   1   1   5   2   1   6   6   1   1 
## 396 397 398 399 400 401 402 405 406 407 408 409 410 411 412 413 414 415 416 417 
##   1   6   4   1   3   6   1   1   7   2   5   7   1   7   5   1   7   4   6   7 
## 418 419 420 421 422 423 424 426 427 428 429 430 431 432 433 434 435 436 437 438 
##   1   6   4   1   4   1   6   7   1   2   1   6   6   6   1   1   3   6   6   1 
## 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 
##   1   7   2   2   1   1   6   5   6   5   6   4   6   2   7   7   7   2   5   6 
## 459 460 461 462 463 464 465 466 467 468 469 470 471 473 474 475 476 477 479 480 
##   1   4   4   2   6   4   5   7   1   6   7   6   4   7   1   4   2   2   4   3 
## 481 482 483 484 485 486 487 488 489 491 492 493 494 495 496 497 498 499 500 501 
##   6   4   1   3   7   4   5   3   5   1   4   7   6   1   4   6   4   3   7   1 
## 502 503 504 505 506 509 510 511 512 513 514 515 516 517 518 520 521 522 523 524 
##   2   5   4   1   4   4   2   2   6   1   7   6   5   7   5   4   6   5   6   7 
## 525 526 527 528 529 530 531 532 533 534 535 536 538 539 540 541 542 543 544 545 
##   2   4   1   2   1   4   4   6   4   7   4   6   4   4   1   5   5   4   7   6 
## 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 
##   4   4   6   6   6   1   6   1   2   5   5   6   1   6   5   4   4   7   6   2 
## 566 567 568 569 570 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 
##   6   6   7   1   2   6   4   7   4   6   4   2   4   7   6   4   7   1   5   1 
## 587 589 592 593 594 596 597 598 599 600 601 604 605 607 608 609 610 611 612 613 
##   7   6   1   6   4   4   4   6   5   1   4   1   6   6   1   7   5   6   1   7 
## 614 615 616 618 619 620 621 622 623 624 625 626 627 628 
##   7   7   6   1   7   2   2   4   6   4   5   1   4   1 
## 
## Within cluster sum of squares by cluster:
## [1] 120.46906  99.19833  56.96522 102.07000  89.58418 115.39478 114.08810
##  (between_SS / total_SS =  61.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
## 
## ### Total Sum of Squares
Total and Total Within-Cluster Sum of Squares
Total_SS Total_Within_SS
1808.412 697.7697
## 
## ### Cluster Centers
Coordinates of Cluster Centers
x_mean s_mean dir_mean total_motion_pre_snap_mean
-0.3548355 -0.6338623 0.0048608 -0.5801316
0.0665373 -0.3267660 1.3213650 -0.3517538
1.9354316 0.5166192 -0.1562624 -0.0714267
0.2789052 0.6190286 0.5550871 1.9298036
-0.0891167 -0.2496173 -1.3796196 -0.5810969
-0.3094594 -0.4304406 -0.2760014 0.9244285
0.0520743 1.3103285 -0.0031327 -0.2393585

Anova Test

A series of statistical tests were performed to analyze the data. First, a new data frame was created by selecting the Cluster, x_mean, s_mean, and total_motion_pre_snap_mean columns. The Mardia test for multivariate normality was applied to check if the data followed a multivariate normal distribution. Box’s M test was then conducted to assess the homogeneity of covariance matrices across the clusters. A MANOVA was performed to evaluate differences in the means of the selected variables across the clusters, followed by a summary of the MANOVA results to assess the overall significance of the model. Finally, the Wilks’ test was used as a post-hoc test to further evaluate the significance of the cluster differences. These tests provide insights into the relationships between the clusters and the dependent variables, helping to inform subsequent analysis and modeling. The Pillai trace indicates strength of relationship between independent and dependent variables and with a 0.32 value I would say the independent variables only slightly affect the dependent variables in the anova test. The Shapiro Wilks test having a value of 0.68 indicates to use that a sample of our data is mostly normally distributed.

##                     Df  Pillai approx F num Df den Df    Pr(>F)    
## anova_data$Cluster   1 0.32442   94.442      3    590 < 2.2e-16 ***
## Residuals          592                                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##                     Df   Wilks approx F num Df den Df    Pr(>F)    
## anova_data$Cluster   1 0.67558   94.442      3    590 < 2.2e-16 ***
## Residuals          592                                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Plot of Clusters

Here we created a plot of every play and enclosed each cluster in a convex hull in order to show how they are separated.

Silhouette Distance

The silhouette analysis was conducted to evaluate the quality of the clustering. A distance matrix was first calculated by removing the Cluster, game, and play ID columns from the scaled dataset. The silhouette function was then applied to calculate the silhouette scores, which measure how similar each point is to its own cluster compared to other clusters. The silhouette plot was generated, with distinct colors assigned to each cluster for clarity, and the plot included a vertical red line indicating the average silhouette width, which turned out to be 0.4. The average silhouette width provides insight into the overall quality of the clustering, with higher values indicating better-defined clusters. For this cluster the silhouette score shows a moderate level of cluster separation signaling good clustering. Cluster 1 is the cluster with the most plays having an average silhouette width of 0.5 or higher, to be considered good clustering quality. Cluster 4 has the most plays with the lowest average silhouette score of 0.24.

# Calculate the distance matrix
dist_matrix <- dist(scaled_to_analysis[, c(-4)]) # Remove clust, game, and play ID columns

# Run silhouette analysis
silhouette_score <- silhouette(scaled_to_analysis$Cluster, dist_matrix)

# Plot silhouette scores
plot(silhouette_score, 
     col = rainbow(length(unique(scaled_to_analysis$Cluster))), # Unique colors for each cluster
     border = NA, 
     main = "Silhouette Plot", 
     xlab = "Silhouette Width", 
     ylab = "Cluster", 
     cex.axis = 1.2, # Adjust axis label size
     cex.lab = 1.4,  # Adjust axis title size
     cex.main = 1.6) # Adjust title size

# Add a vertical line at average silhouette width
avg_sil_width <- mean(silhouette_score[, 3])
abline(v = avg_sil_width, col = "red", lty = 2, lwd = 2)

# Add legend to explain the average line
legend("topright", legend = paste("Avg Silhouette Width =", round(avg_sil_width, 3)), 
       col = "red", lty = 2, lwd = 2, bty = "n", cex = 1)  # Adjust legend text size

Histogram Creation

This was a creation of a histogram that would allow us to visualize how clusters perform on average in terms of scaled speed of players motion for a particular play. Cluster 1 possesses the most plays where the mean scaled speed of the player in motion for the given play is between 0 and 2. It is followed by Clusters 2, 3, 4, and 5. Cluster 3 has the most plays in where the mean scaled speed of the player in motion is 3 and it is followed by Clusters 4, 5, and 7. Cluster 4 has the most plays where the mean scaled speed of the player in motion is at the greatest value, and Cluster 7 is the only other. Cluster 7, Cluster 4, and Cluster 3 have the highest average scaled speed of the player in motion for a given play. Then are then followed by Cluster 4, Cluster 2, Cluster 6 and then Cluster 1.

# Stacked histogram of speed, grouped by cluster
ggplot(data_to_analysis, aes(x = (s_mean), fill = as.factor(Cluster))) +
  geom_histogram(binwidth = 1, color = "black", alpha = 0.7, position = "stack") + 
  labs(
    title = "Stacked Histogram of Speed per Cluster",
    x = "Speed",
    y = "Frequency"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

# Bar plot of mean speed per cluster
ggplot(data_to_analysis, aes(x = factor(Cluster), y = s_mean, fill = factor(Cluster))) +
  geom_bar(stat = "summary", fun = "mean", show.legend = FALSE) +  
  labs(
    title = "Mean Speed per Cluster", 
    x = "Cluster ID",
    y = "Mean Speed"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Average Receiving Yards Chart

We have created a bar chart and histogram that will illustrate the scaled average receiving yards gained within each cluster of plays. Cluster 1 led with almost 500 plays where the scaled mean receiving yards. It was followed by Cluster 2, Cluster 3, Cluster 4, Cluster 5, Cluster 6, and Cluster 7. Cluster 1 also led in plays where the scaled mean of receiving yards is half a yard, as well as 1 yard. The bar chart also reveals that Cluster 6, Cluster 7, and Cluster 2 have the greatest scaled mean of receiving yards. They were followed Cluster 3, Cluster 1, Cluster 4, and then Cluster 5.

ggplot(data_to_analysis, aes(x = factor(Cluster), y = receivingYards_mean, fill = factor(Cluster))) +
  geom_bar(stat = "summary", fun = "mean", show.legend = FALSE) +  
  labs(
    title = "Mean Receiving Yards per Cluster", 
    x = "Cluster ID",
    y = "Mean Rushing Yards",
    fill = "Cluster ID"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggplot(data_to_analysis, aes(x = receivingYards_mean, fill = as.factor(Cluster))) +
  geom_histogram(binwidth = .5, color = "black", alpha = 0.7) + 
  labs(title = "Histogram of Receiving Yards",
       x = "Rushing Yards",
       y = "Frequency",
       fill = "Cluster ID") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

Average Rushing Yards Chart

We have created a bar chart that will illustrate the average scaled rushing yards gained within each cluster of plays. Cluster 1 led with plays that have a 0 average of scaled rushing yards. It was followed by Cluster 2, Cluster 3, Cluster 4, Cluster 5, Cluster 6, and Cluster 7. Cluster 1 also led for average scaled rushing yards of half a yard. And all the Clusters followed numerically as they did before. Cluster 1 also led in average scaled rushing yards of 1 or more. It was followed by Cluster 2, Cluster 4, and Cluster 7.

# Bar plot of mean receiving yards per cluster
ggplot(data_to_analysis, aes(x = factor(Cluster), y = rushingYards_mean, fill = factor(Cluster))) +
  geom_bar(stat = "summary", fun = "mean", show.legend = FALSE) +  
  labs(
    title = "Mean Rushing Yards per Cluster", 
    x = "Cluster ID",
    y = "Mean Rushing Yards",
    fill = "Cluster ID"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggplot(data_to_analysis, aes(x = rushingYards_mean, fill = as.factor(Cluster))) +
  geom_histogram(binwidth = .5, color = "black", alpha = 0.7) + 
  labs(title = "Histogram of Rushing Yards",
       x = "Rushing Yards",
       y = "Frequency",
       fill = "Cluster ID") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

Total Motion Pre Snap Histogram and Bar Chart Creation

This was a creation of a histogram that would allow us to visualize how clusters perform on scaled average total motion pre snap. Cluster 1 possesses the most plays where the mean total motion of the player in motion for the given play is between 0 and 25. It is followed by Clusters 2, 3, 4, 5, 6, and 7. Cluster 1 has the most plays in where the mean scaled total motion is 50 and it is followed by Clusters 2, 3, 5, 6, and 7. Cluster 2 has the most plays where the mean scaled total motion is 75 and Clusters 3, 5, 6, and 7. follow. Cluster 2, Cluster 4, Cluster 6, and Cluster 7 have plays with means of 100 scaled total motion pre snap. Clusters 4 and 6 have the most plays where the mean scaled total motion pre snap is 125 to 150. Then are then followed by Cluster 4, with a mean scaled total motion pre snap was 175 or more. The bar chart shows that Clusters 4 and 6 have the highest number of plays with large total motion pre snap. They are then followed by Clusters 3, 7, 2, 1, and 5.

# Stacked histogram of Total Motion pre snap, grouped by cluster
ggplot(data_to_analysis, aes(x = total_motion_pre_snap_mean, fill = as.factor(Cluster))) +
  geom_histogram(binwidth = 25, color = "black", alpha = 0.7, position = "stack") + 
  labs(
    title = "Stacked Histogram of Total Motion pre snap per Cluster",
    x = "Mean Total Motion pre snap",
    y = "Frequency"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

# Bar plot of mean total motion pre snap per cluster
ggplot(data_to_analysis, aes(x = factor(Cluster), y = total_motion_pre_snap_mean, fill = factor(Cluster))) +
  geom_bar(stat = "summary", fun = "mean", show.legend = FALSE) +  
  labs(
    title = "Mean Total Motion pre snap per Cluster", 
    x = "Cluster ID",
    y = "Mean Total Motion"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Shiny

The plot for each cluster for each play are similar to each other visually, meaning that we can accept the clustering even with lower R squared values. We can see that Cluster 1 and Cluster 7 had the lowest values at 0.27 and 0.29 and Clusters 2, 3, 4, 6 saw rises in R square value to be in the 0.35 and 0.39 range in R squared value. Cluster 5 had the greatest R squared value. (https://jmojea.shinyapps.io/Shiny_NFL2025/)