title: "Breakdown and Analysis of NFL Plays" author: "Andreas Collet" format: revealjs:
incremental: true embed-resources: true scrollable: true slide-number: true smaller: true footer: Breakdown and Analysis of NFL Plays ---

{r, message = FALSE, warning = FALSE} library(dplyr) library(tibble) library(ggplot2) library(tidyr) library(here) library(gt)

```{r} mypalette <- c("navyblue", "orange", "darkgreen", "maroon", "turquoise", "darkviolet", "grey") palette(mypalette)

options(ggplot2.discrete.fill = mypalette, ggplot2.discrete.color = mypalette) ```

{r} nfl <- read.csv("NFL.csv") #Note for self: On school computers, do ("C:/Users/amcollet/OneDrive - University of Iowa/BIOS_4510_Project/NFL.csv")

NFL Dataset

::: {.incremental}

-Dataset covers every NFL play from 2009–2019

450,000 plays
Includes:
- Down & distance
- Player-level stats
- EPA & WPA
- Play direction and type

:::

Goals

::: {.incremental}

Football is ultimately about scoring points.

This analysis examines:

Play direction
Down & distance
Quarter
Play type
Goal: identify patterns associated with offensive success

:::

Tidy Data

Dataset is mostly tidy as is, but there are some redundancies:

time, halfsecondsremaining, quartersecondsremaining

{r, echo=TRUE} nfl_tidy1 <- nfl %>% #weird variable name because it's not fully tidy yet select(-c(time,half_seconds_remaining,quarter_seconds_remaining))

Assist Tackle Indicator Variables

{r, echo=TRUE} nfl_tidy1 %>% drop_na() %>% distinct(assist_tackle_3_player_id)

There are no instances of this column being filled, so it is unnecessary

{r, echo=TRUE} nfl_tidy <- nfl_tidy1 %>% select(-(assist_tackle_3_player_id:assist_tackle_4_player_name))

Field Direction

{r, message = FALSE} options(scipen = 999) #Our final variable defaults to scientific notation, and I think it is more easy to comprehend if we keep it in normal numerics directions_table <- nfl_tidy %>% filter((play_type == "run" | play_type == "pass") & !(!is.na(pass_location) == !is.na(run_location))) %>% #This looks like a mess, but I'm filtering out the 4 or 5 plays where there is a pass_location and run_location, caused by laterals, which should be disregarded in finding general patterns due to their outstandingness. Furthermore, I needed to drop the 15000 or so variables with play_type pass or run which did not have a location, most of which were sacks. group_by(pass_location, run_location) %>% summarise(number = n(), mean_gain = mean(yards_gained, na.rm = TRUE), first_down_odds = (sum(first_down_rush) + sum(first_down_pass) + sum(first_down_penalty)) / number, touchdowns = sum(!is.na(td_team)), touchdown_odds = touchdowns / number, turnovers = (sum(interception) + sum(fumble_lost)), turnover_odds = turnovers / number, expected_points_added = sum(epa, na.rm = TRUE) / number, win_probability_added = sum(wpa, na.rm = TRUE) / number) %>% ungroup()

{r} gt(directions_table) %>% fmt_percent(columns = c("touchdown_odds", "turnover_odds", "win_probability_added", "first_down_odds")) %>% fmt_integer(columns = c("number", "touchdowns", "turnovers")) %>% fmt_number(columns = c("mean_gain", "expected_points_added"), decimals = 3) %>% tab_row_group(label = "Pass", rows = 1:3) %>% tab_row_group(label = "Run", rows = 4:6) %>% summary_rows(columns = c(number, touchdowns, turnovers), fns = list(fn = "sum", label = "Total"), fmt = ~ fmt_integer(.)) %>% summary_rows(columns = c("mean_gain", "expected_points_added"), fns = list(fn = "mean", label = "Average"), fmt = ~ fmt_number(., decimals = 3)) %>% summary_rows(columns = c("touchdown_odds", "turnover_odds", "win_probability_added", "first_down_odds"), fns = list(fn = "mean", label = "Average"), fmt = ~ fmt_percent(.)) %>% tab_header(title = "Summary of Play Success by Play Type and Direction")

Field Direction: Key Observations

Runs up the middle:
- Least common
- Lowest average gain
Passes to the right:
- Most common
- Lowest average gain
Passes over the middle:
- Highest EPA
- Highest turnover risk

Field Direction: Runs up the Middle

{r} nfl_tidy %>% filter(play_type == "run") %>% drop_na(run_location) %>% group_by(run_location) %>% summarize(number = n(), mean_gain = mean(yards_gained, na.rm = TRUE), sd_gain = sd(yards_gained, na.rm = TRUE)) %>% ungroup()

Runs up the middle are low-risk, low-reward. Lower mean, but also a lower standard deviation
More reliable for a short gain, useful for 3rd/4th down and short situations
Higher turnover rate despite this

Field Direction: Passes to the Right

{r} nfl_tidy %>% filter(play_type == "pass") %>% drop_na(pass_location, yards_gained) %>% group_by(pass_location) %>% summarize(number = n(), mean_air_yards = mean(air_yards, na.rm = TRUE), completion_rate = sum(!is.na(yards_after_catch))/number, mean_yards_after_catch = sum(yards_after_catch, na.rm = TRUE) / sum(!is.na(yards_after_catch))) %>% ungroup()

Passes to the right travel the least, but surprisingly have the lowest completion rate
They also have a lower average Yards After Catch, so my screen pass theory is wrong
Right side is the strong side, which means more nearby defenders to disrupt the play

Field Direction and Down

{r, message = FALSE} options(scipen = 999) directions_downs_table <- nfl_tidy %>% filter((play_type == "run" | play_type == "pass") & !(!is.na(pass_location) == !is.na(run_location))) %>% group_by(pass_location, run_location, down) %>% summarise(number = n(), mean_gain = mean(yards_gained, na.rm = TRUE), first_down_odds = (sum(first_down_rush) + sum(first_down_pass) + sum(first_down_penalty)) / number, touchdowns = sum(!is.na(td_team)), touchdown_odds = touchdowns / number, turnovers = (sum(interception) + sum(fumble_lost)), turnover_odds = turnovers / number, expected_points_added = sum(epa, na.rm = TRUE) / number, win_probability_added = sum(wpa, na.rm = TRUE) / number) %>% ungroup()

{r} gt(directions_downs_table) %>% fmt_percent(columns = c("touchdown_odds", "turnover_odds", "win_probability_added", "first_down_odds")) %>% fmt_integer(columns = c("number", "touchdowns", "turnovers")) %>% fmt_number(columns = c("mean_gain", "expected_points_added"), decimals = 3) %>% tab_row_group(label = "Pass", rows = 1:12) %>% tab_row_group(label = "Run", rows = 13:24) %>% summary_rows(columns = c(number, touchdowns, turnovers), fns = list(fn = "sum", label = "Total"), fmt = ~ fmt_integer(.)) %>% summary_rows(columns = c("mean_gain", "expected_points_added"), fns = list(fn = "mean", label = "Average"), fmt = ~ fmt_number(., decimals = 3)) %>% summary_rows(columns = c("touchdown_odds", "turnover_odds", "win_probability_added", "first_down_odds"), fns = list(fn = "mean", label = "Average"), fmt = ~ fmt_percent(.)) %>% tab_header(title = "Summary of Play Success by Play Type, Down, and Direction")

Field Direction and Down: Key Observations

Running up the middle on 3rd down does not have any better odds of first down despite its popularity
Every subsequent down increases odds of both Touchdown and Turnover
Teams do far more conservative play calls on 1st and 2nd down
Passes up the middle are the most effective, yet the rarest

Scoring by Quarter

In the NFL, there is a well documented increase in scoring in the 2nd and 4th quarters.

The first and third quarters always begin with a team at the very beginning of their possession

The urgency to score picks up at the end of halves.

Firstly, I need to turn the qtr variable into a factor.

{r, echo = TRUE} nfl %>% pull(qtr) %>% class()

{r, echo = TRUE} nfl_factors <- nfl %>% mutate(quarter = factor(qtr, levels = c("1", "2", "3", "4", "5"), labels = c("1", "2", "3", "4", "OT")))

Now we can make a simple bar chart to see how scores vary across quarter.

{r, message = FALSE} nfl_factors %>% group_by(game_id, quarter) %>% summarise(points = 6*sum(touchdown, na.rm = TRUE) + 2*sum(safety, na.rm = TRUE) + 3*sum(field_goal_result == "good", na.rm = TRUE) + sum(extra_point_result == "made", na.rm = TRUE) + 2*sum(two_point_conv_result == "success", na.rm = TRUE)) %>% group_by(quarter) %>% summarise(mean_points = mean(points)) %>% ggplot(aes(x = quarter, y = mean_points, fill = quarter)) + geom_col() + labs(title = "Average Points per Quarter Across Games", x = "Quarter", y = "Average Points") + theme_light()

Second and Fourth Quarters have nearly 40% more scoring
Next, we're going to see if this is a result of increased efficiency or increased pace

```{r} nflfactors %>% dropna(firstdownrush) %>% #columns where firstdownrush is NA are the same columns for all the others, so I just need to filter one of them mutate(success = as.integer(firstdownrush == 1 | firstdownpass == 1 | firstdownpenalty == 1 | touchdown == 1)) %>% groupby(quarter) %>% summarise(successrate = mean(success)) %>% ggplot(aes(x = quarter, y = successrate, fill = quarter)) + geomcol() + labs(title = "First Down Conversion Rate Across Quarters", x = "Quarter", y = "Success Rate") + theme_light()

```

First and Third Quarters actually have a higher conversion rate
Overtime, the most urgent period, is significantly lower
Offensive efficiency does not pick up, rather pace picks up, at the expense of efficiency

```{r} nflfactors %>% groupby(quarter) %>% summarise(averageepa = mean(epa, na.rm = TRUE)) %>% ggplot(aes(x = quarter, y = averageepa, fill = quarter)) + geomcol() + labs(title = "Expected Points Added Per Play Across Quarters", x = "Quarter", y = "EPA") + themelight()

```

This tells the same story. The next graph will check the rate of "good plays" (EPA > 2) and "bad plays" (EPA < -2)

```{r} nflfactors %>% dropna(epa) %>% mutate(goodplay = as.integer(epa > 2), badplay = as.integer(epa < -2)) %>% groupby(quarter) %>% summarise(number = n(), goodplayrate = sum(goodplay) / number, badplayrate = sum(badplay) / number) %>% pivotlonger(cols = c(goodplayrate, badplayrate), namesto = "playquality", valuesto = "rate") %>% ggplot(aes(x = quarter, y = rate, fill = playquality)) + geomcol(position = "dodge") + scaleycontinuous(labels = scales::percentformat()) + labs(x = "Quarter", y = "Rate", title = "Good vs Bad Play Rates by Quarter") + theme_light()

```

One would expect more urgent periods to have an increase in good plays and a bigger increase in bad plays
Instead, we just see a downtick in good plays
The winning team plays less aggressively, driving down the average for both good and bad plays

{r, message = FALSE} nfl_tidy %>% filter(down == 4, qtr %in% 1:4, fourth_down_converted == 1 | fourth_down_failed == 1) %>% #The second filter condition is to exclude OT because it has a small sample size and a 2x2 facet grid is much more aesthically pleasing. The last filter condition gets rid of all 4th downs with a punt or field goal attempt mutate(success = fourth_down_converted == 1) %>% group_by(qtr, ydstogo) %>% summarise(attempts = n(), success_rate = mean(success)) %>% ggplot(aes(x = ydstogo, y = success_rate)) + geom_line() + geom_point() + scale_y_continuous(labels = scales::percent_format()) + labs(title = "4th Down Conversion Rate by Distance and Quarter", x = "Yards to Go", y = "Success Rate (%)") + theme_light() + geom_smooth(se = FALSE) + facet_wrap(~ qtr)

4th and 1 as well as 4th and 2 have higher success rates in all other quarters than the 4th
Beyond that, sample size for 4th and 3+ is too small for the first 3 quarters to be worth analyzing
We can conclude that offenses actually get worse in the high-scoring quarters because they are more desperate
The message for NFL teams: GET THE LEAD IN THE FIRST QUARTER

Conclusion

NFL Teams know everything I do times a million.

They know that passes to the right are inefficient and 4th down runs to the left are the most efficient.

Yet they pass to the right more than anything and avoid 4th down runs to the left like the plague.

There is more than just the bare numbers to football.