2025-03-24

Dataset

# The NFL Scouting Combine is an annual event where prospective NFL players
# are evaluated by NFL scouts. 
scoutingCombine = load_combine()

# Tidying up our position categories.
# Mostly just clumping into logical groups, but also accounting for 
# year-over-year discrepancies in how the Combine groups positions. 
scoutingCombine = scoutingCombine %>%
  filter(!pos %in% c("P", "K", "LS")) %>%
  mutate(pos = ifelse(pos %in% c("C", "G", "OG", "OT"), "OL", pos)) %>%
  mutate(pos = ifelse(pos %in% c("ILB", "OLB"), "LB", pos)) %>%
  mutate(pos = ifelse(pos %in% c("EDGE", "DE", "DT"), "DL", pos)) %>%
  mutate(pos = ifelse(pos %in% c("CB", "CB/WR", "S", "SAF"), "DB", pos)) %>%
  mutate(pos = ifelse(pos == "FB", "RB", pos))

Linear Regression

Linear regression is a statistical model used to describe a linear relationship between two variables. The formula for simple linear regression is as follows: \[ y = b_0 + b_1x +\epsilon \] where \(y\) is the dependent variable, \(x\) is the independent variable, \(b_0\) is the y-intercept, \(b_1\) is the slope of the resulting line, and \(\epsilon\) is the error, or the distance between the predicted value and the observed value of the dependent variable.

Linear Regression in R

In R, linear regression is performed using the lm function. In our case, we want to model the average vertical jump of NFL prospects by year, as provided by our scoutingCombine dataframe.

meanVert = scoutingCombine %>%
  group_by(season) %>%
  summarize(mean_vert = mean(vertical, na.rm = TRUE))
model = lm(mean_vert ~ season, meanVert)

Model Summary

We can view a summary of our models like so:

## 
## Call:
## lm(formula = mean_vert ~ season, data = meanVert)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.11239 -0.42404  0.00729  0.64777  1.30900 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  4.24493   48.67533   0.087    0.931
## season       0.01431    0.02419   0.591    0.560
## 
## Residual standard error: 0.9249 on 24 degrees of freedom
## Multiple R-squared:  0.01437,    Adjusted R-squared:  -0.0267 
## F-statistic: 0.3498 on 1 and 24 DF,  p-value: 0.5597

R-Squared

The “adjusted R-squared” value returned by our model in the last slide was -0.0267. If this was positive, we could attribute about 2.67% of the variation in our players’ vertical jumps to changes in year. Since it is negative, it is safe to say that our linear model as built does not suit our data. If we were to use this model to extrapolate values outside of the seasons we’ve been provided (i.e. earlier than 2000 or later than 2005), we could basically not trust the results at all.

The formula for R-Squared is as follows: \[ R^2 = 1 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2} \]

Visualizing a Bad Linear Model

We can plot our data and rather easily see why it might not be a good fit for a linear model.

## `geom_smooth()` using formula = 'y ~ x'

Vertical over Time by Position

Just out of curiosity, I also want to see the changes in vertical jump over time separated by position.

Making a Plot Interactive

That last plot was kind of hard to handle since there were so many lines. By using plotly to make the plot interactive, we can show or hide different position lines using the legend.