This project’s data set is the overall results from the 2022 Canadian Football League’s (CFL) Combine. The purpose of this project is to faithfully represent the high-dimensional (eight) data set in two-dimensions while preserving the variance (that is, differences in athletic performance) and keeping the process easy to understand by non-statisticians. This is achieved via principal component analysis (PCA), and results in interesting patterns emerging in player position with respect to athletic performance as measured by their Combine results.
The CFL Combine is a series of athletic tests used to scout potential athletes for the league’s nine franchises (i.e.: teams). The teams use the results of the Combine during the CFL’s Draft, where they select new team members from a pool of eligible athletes. The data set for this project is the overall results from the 2022 CFL National Combine, truncated to remove athletes who did not participate in every event. 43 out of the 53 athletes reported in the overall results completed every event.
The data set consists of nine variables of interest, one of which is categorical and the remaining eight are quantitative.
The first variable of interest is the categorical variable player position. We omit this variable from calculations but it plays an important role in the construction of the final biplot where it allows us to see the resulting patterns.
The second and third variables of interest are height and weight. Together these variables form the concept of size.
The fourth variable of interest is the number of completed bench press repetitions of 225 pounds. This measures an athlete’s upper body strength.
The fifth and sixth variables of interest are the jumping tests. The vertical jump height, and broad (standing long) jump distance measure an athlete’s potential for explosive movement.
The seventh, eighth, and ninth variables of interest are the timed running drills. The forty yard dash measures an athlete’s forward running acceleration, the three-cone drill measures an athlete’s ability to change direction and accelerate, and the twenty yard short shuttle run measures an athlete’s ability to accelerate laterally. The jumping tests and timed running drills together form the concept of agility.
The data set is available at https://www.cfl.ca/combine/2022-cfl-national-combine/.
Together, the eight quantitative variables act as measurable proxies for the intuitive (and admittedly, nebulous) athletic qualities of size, strength, and agility.
The purpose of this project is to summarize the data set in a two-dimensional figure that allows for the discovery of patterns or trends among the observations that otherwise would not be apparent when looking at a table of values. The preservation of variation will determine the success of the project.
library(DT)
datatable(CFL_2022_Combine_No_Missing_Entries)
Standardizing (A.K.A. normalizing) the data preserves variance while allowing us to perform meaningful mathematics on the variables. For instance, weight times height in inches results in a much larger number than weight times height in feet. When we standardize data we eliminate these issues.
## Turn the dataframe into matrix so we can use the values
CFL_mat <- data.matrix(CFL_2022_Combine_No_Missing_Entries)
## Standardize the entries of the matrix and truncate out Name, Pos, and Draft Rround
## Pick #
CFL_mat_norm <- scale(CFL_mat[, 3:10])
## Create the covariance matrix
Sigma <- var(CFL_mat_norm, na.rm = TRUE)
## Create the correlation matrix
R <- cor(CFL_mat_norm, use = "pairwise.complete.obs") ##this handles NAs the same as ggpairs
## Create a explanatory vectors for visualization
height_standardized <- CFL_mat_norm[, 1]
weight_standardized <- CFL_mat_norm[, 2]
bench_standardized <- CFL_mat_norm[, 3]
forty_standardized <- CFL_mat_norm[, 4]
vertical_standardized <- CFL_mat_norm[, 5]
broad_standardized <- CFL_mat_norm[, 6]
cones_standardized <- CFL_mat_norm[, 7]
shuttle_standardized <- CFL_mat_norm[, 8]
### create a response vector for regression
response <- CFL_2022_Combine_No_Missing_Entries$Draft_Round
## Are there any outliers?
all(CFL_mat_norm < 3)
## [1] TRUE
all(CFL_mat_norm > (-3))
## [1] TRUE
We begin by creating pairs plots to determine if there are any highly correlated variables. We can immediately see correlations where we would expect them, between height and weight, and between running tests. We also see inverse correlations where we would expect them, between weight and broad jump length (heavier athletes don’t jump as far as lighter athletes).
## Create a pairs plot for 1st idea of visualization
pairs(CFL_mat_norm)
A heat map makes it easier for us to see these correlations.
## Create a Correlation Heat Map
library(corrplot)
## corrplot 0.92 loaded
col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
corrplot(R, method = "shade", bg = "grey", shade.col = NA, shade.lwd = 1, tl.col = "black",
tl.srt = 45, tl.cex = 0.9, col = col(200), addCoef.col = "black", cl.pos = "n", order = "AOE",
type = "lower", addshade = "all", title = "Correlations for the CFL Combine Tests", mar = c(0,
0, 1, 0)) # title position fix from http://stackoverflow.com/a/14754408/54964)
Also, we can see multicolinearity among the three running tests.
##Show multicolinearity between standardized variables
library(rgl)
# Function to interleave the elements of two vectors
interleave <- function(v1, v2) as.vector(rbind(v1,v2))
## Multicolinearity between agility tests
plot3d(forty_standardized,
shuttle_standardized,
cones_standardized,
type = "s",
size = 0.75,
lit = FALSE,
xlab = "",
ylab = "",
zlab = "",
axes = FALSE,)
segments3d(interleave(forty_standardized,
forty_standardized),
interleave(shuttle_standardized,
shuttle_standardized),
interleave(cones_standardized,
min(cones_standardized,na.rm = TRUE)),
alpha = 0.4,
col = "blue")
# Draw the box.
rgl.bbox(color = "grey50", # grey60 surface and black text
emission = "grey50", # emission color is grey50
xlen = 0, ylen = 0, zlen = 0) # Don't add tick marks
# Set default color of future objects to black
rgl.material(color = "black")
# Add axes to specific sides. Possible values are "x--", "x-+", "x+-", and "x++".
axes3d(edges = c("x--", "y+-", "z--"),
ntick = 6, # Attempt 6 tick marks on each side
cex = 1.5) # Smaller font
# Add axis labels. 'line' specifies how far to set the label from the axis.
mtext3d("Forty", edge = "x--", line = 6, cex = 2)
mtext3d("Shuttle", edge = "y+-", line = 7, cex = 2)
mtext3d("Cones", edge = "z--", line = 10, cex = 2)
view <- structure(c(0.730821192264557, -0.21270352602005, 0.648579120635986,
0, 0.682410538196564, 0.207353606820107, -0.700940430164337,
0, 0.014607222750783, 0.954860508441925, 0.296689987182617, 0,
0, 0, 0, 1), .Dim = c(4L, 4L))##this positions the view of the 3d plot
par3d(userMatrix = view)
##play3d(spin3d())
rglwidget()
The role of this principal component analysis will be transforming the eight quantitative variables from the CFL Combine into two principal components (PCs) that form a plane onto which we can plot the data points, while preserving as much of the variance (difference in athletic ability) as possible.
We can use a scree plot to see that more than 80% of the variance is explained by the first two principle components. That means that we can preserve over 80% of the difference in athletic ability while simplifying the multi-dimensional data down to just two dimensions.
## change timed tests to negative so that arrows and positive loadings indicate the
## direction of superior performance
CFL_2022_Combine_No_Missing_Entries$Forty <- CFL_2022_Combine_No_Missing_Entries$Forty * (-1)
CFL_2022_Combine_No_Missing_Entries$Shuttle <- CFL_2022_Combine_No_Missing_Entries$Shuttle *
(-1)
CFL_2022_Combine_No_Missing_Entries$Cones <- CFL_2022_Combine_No_Missing_Entries$Cones * (-1)
## CFL_Data <- data.matrix(CFL_2022_Combine)
library(stats)
library(FactoMineR)
library(ggplot2)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
pca <- prcomp(na.pass(data.matrix(CFL_2022_Combine_No_Missing_Entries)[, 3:10]), scale = TRUE)
pca_summary <- summary(pca)
percent_of_pc_1 <- pca_summary$importance[2, 1]
percent_of_pc_2 <- pca_summary$importance[2, 2]
fviz_eig(pca, addlabels = TRUE, main = "Scree Plot: CFL 2022 Combine Principle Components")
Given that we explain nearly 83% of the variance with two PCs, we can construct a biplot using the first PC as the horizontal axis, the second PC as the vertical axis, and an arrow for each of the variables from the origin that represent an axis onto which the data points can be projected.
fviz_pca_biplot(pca, label = "var", habillage = CFL_2022_Combine_No_Missing_Entries$Pos, repel = TRUE,
col.var = "slategrey", title = "title", geom = "point") + labs(title = "Biplot: 2022 CFL Combine Athelete Performance",
subtitle = "") + xlab(paste("PC1 (", percent_of_pc_1 * 100, "%)")) + ylab(paste("PC2 (",
percent_of_pc_2 * 100, "%)")) + scale_shape_manual(values = c(15, 16, 17, 18, 13, 10, 1,
6)) + theme(panel.background = element_rect(fill = "snow"))
## Scale for shape is already present.
## Adding another scale for shape, which will replace the existing scale.
An interesting pattern emerges when we group the data points by player position. We see that positions tend to cluster around correlated variables. We also see that the points that project the most onto height and weight are almost exclusively players on the offensive line, this makes sense since that position’s role is to protect the quarterback from the opposing team’s defensive line by forming a wall (pocket) around him. We also see that the points that project the most onto the bench press are players on the defensive line, this makes sense since that position’s role is to create openings in the opposing team’s offensive line, which as previously noted is made up of the largest players on the field, and to block the offence from running the ball downfield. Finally, we see that the points that project the most onto the running and jumping drills are made up of the defensive backs, this also makes sense since this position’s role is to provide downfield coverage of the opposing teams receivers and potentially jump to catch an interception.
Thus, we see that the 2022 (or any year) CFL Combine results can be faithfully summarized in a two-dimensional plot that is both intuitive and retains nearly 83% of the variance in athletic performance. This suggests that the project is a success.