title: "Breakdown and Analysis of NFL Plays" author: "Andreas Collet" format: revealjs:
incremental: true embed-resources: true scrollable: true slide-number: true smaller: true footer: Breakdown and Analysis of NFL Plays ---

{r, message = FALSE, warning = FALSE} library(dplyr) library(tibble) library(ggplot2) library(tidyr) library(here) library(gt)

```{r} mypalette <- c("navyblue", "orange", "darkgreen", "maroon", "turquoise", "darkviolet", "grey") palette(mypalette)

options(ggplot2.discrete.fill = mypalette, ggplot2.discrete.color = mypalette) ```

{r} nfl <- read.csv("NFL.csv") #Note for self: On school computers, do ("C:/Users/amcollet/OneDrive - University of Iowa/BIOS_4510_Project/NFL.csv")

NFL Dataset

::: {.incremental}

-Dataset covers every NFL play from 2009–2019

:::

Goals

::: {.incremental}

Football is ultimately about scoring points.

This analysis examines:

:::

Tidy Data

Dataset is mostly tidy as is, but there are some redundancies:

{r, echo=TRUE} nfl_tidy1 <- nfl %>% #weird variable name because it's not fully tidy yet select(-c(time,half_seconds_remaining,quarter_seconds_remaining))

{r, echo=TRUE} nfl_tidy1 %>% drop_na() %>% distinct(assist_tackle_3_player_id)

{r, echo=TRUE} nfl_tidy <- nfl_tidy1 %>% select(-(assist_tackle_3_player_id:assist_tackle_4_player_name))

Field Direction

{r, message = FALSE} options(scipen = 999) #Our final variable defaults to scientific notation, and I think it is more easy to comprehend if we keep it in normal numerics directions_table <- nfl_tidy %>% filter((play_type == "run" | play_type == "pass") & !(!is.na(pass_location) == !is.na(run_location))) %>% #This looks like a mess, but I'm filtering out the 4 or 5 plays where there is a pass_location and run_location, caused by laterals, which should be disregarded in finding general patterns due to their outstandingness. Furthermore, I needed to drop the 15000 or so variables with play_type pass or run which did not have a location, most of which were sacks. group_by(pass_location, run_location) %>% summarise(number = n(), mean_gain = mean(yards_gained, na.rm = TRUE), first_down_odds = (sum(first_down_rush) + sum(first_down_pass) + sum(first_down_penalty)) / number, touchdowns = sum(!is.na(td_team)), touchdown_odds = touchdowns / number, turnovers = (sum(interception) + sum(fumble_lost)), turnover_odds = turnovers / number, expected_points_added = sum(epa, na.rm = TRUE) / number, win_probability_added = sum(wpa, na.rm = TRUE) / number) %>% ungroup()

{r} gt(directions_table) %>% fmt_percent(columns = c("touchdown_odds", "turnover_odds", "win_probability_added", "first_down_odds")) %>% fmt_integer(columns = c("number", "touchdowns", "turnovers")) %>% fmt_number(columns = c("mean_gain", "expected_points_added"), decimals = 3) %>% tab_row_group(label = "Pass", rows = 1:3) %>% tab_row_group(label = "Run", rows = 4:6) %>% summary_rows(columns = c(number, touchdowns, turnovers), fns = list(fn = "sum", label = "Total"), fmt = ~ fmt_integer(.)) %>% summary_rows(columns = c("mean_gain", "expected_points_added"), fns = list(fn = "mean", label = "Average"), fmt = ~ fmt_number(., decimals = 3)) %>% summary_rows(columns = c("touchdown_odds", "turnover_odds", "win_probability_added", "first_down_odds"), fns = list(fn = "mean", label = "Average"), fmt = ~ fmt_percent(.)) %>% tab_header(title = "Summary of Play Success by Play Type and Direction")

Field Direction: Key Observations

Field Direction: Runs up the Middle

{r} nfl_tidy %>% filter(play_type == "run") %>% drop_na(run_location) %>% group_by(run_location) %>% summarize(number = n(), mean_gain = mean(yards_gained, na.rm = TRUE), sd_gain = sd(yards_gained, na.rm = TRUE)) %>% ungroup()

Field Direction: Passes to the Right

{r} nfl_tidy %>% filter(play_type == "pass") %>% drop_na(pass_location, yards_gained) %>% group_by(pass_location) %>% summarize(number = n(), mean_air_yards = mean(air_yards, na.rm = TRUE), completion_rate = sum(!is.na(yards_after_catch))/number, mean_yards_after_catch = sum(yards_after_catch, na.rm = TRUE) / sum(!is.na(yards_after_catch))) %>% ungroup()

Field Direction and Down

{r, message = FALSE} options(scipen = 999) directions_downs_table <- nfl_tidy %>% filter((play_type == "run" | play_type == "pass") & !(!is.na(pass_location) == !is.na(run_location))) %>% group_by(pass_location, run_location, down) %>% summarise(number = n(), mean_gain = mean(yards_gained, na.rm = TRUE), first_down_odds = (sum(first_down_rush) + sum(first_down_pass) + sum(first_down_penalty)) / number, touchdowns = sum(!is.na(td_team)), touchdown_odds = touchdowns / number, turnovers = (sum(interception) + sum(fumble_lost)), turnover_odds = turnovers / number, expected_points_added = sum(epa, na.rm = TRUE) / number, win_probability_added = sum(wpa, na.rm = TRUE) / number) %>% ungroup()

{r} gt(directions_downs_table) %>% fmt_percent(columns = c("touchdown_odds", "turnover_odds", "win_probability_added", "first_down_odds")) %>% fmt_integer(columns = c("number", "touchdowns", "turnovers")) %>% fmt_number(columns = c("mean_gain", "expected_points_added"), decimals = 3) %>% tab_row_group(label = "Pass", rows = 1:12) %>% tab_row_group(label = "Run", rows = 13:24) %>% summary_rows(columns = c(number, touchdowns, turnovers), fns = list(fn = "sum", label = "Total"), fmt = ~ fmt_integer(.)) %>% summary_rows(columns = c("mean_gain", "expected_points_added"), fns = list(fn = "mean", label = "Average"), fmt = ~ fmt_number(., decimals = 3)) %>% summary_rows(columns = c("touchdown_odds", "turnover_odds", "win_probability_added", "first_down_odds"), fns = list(fn = "mean", label = "Average"), fmt = ~ fmt_percent(.)) %>% tab_header(title = "Summary of Play Success by Play Type, Down, and Direction")

Field Direction and Down: Key Observations

Scoring by Quarter

In the NFL, there is a well documented increase in scoring in the 2nd and 4th quarters.

The first and third quarters always begin with a team at the very beginning of their possession

The urgency to score picks up at the end of halves.

Firstly, I need to turn the qtr variable into a factor.

{r, echo = TRUE} nfl %>% pull(qtr) %>% class()

{r, echo = TRUE} nfl_factors <- nfl %>% mutate(quarter = factor(qtr, levels = c("1", "2", "3", "4", "5"), labels = c("1", "2", "3", "4", "OT")))

Now we can make a simple bar chart to see how scores vary across quarter.


{r, message = FALSE} nfl_factors %>% group_by(game_id, quarter) %>% summarise(points = 6*sum(touchdown, na.rm = TRUE) + 2*sum(safety, na.rm = TRUE) + 3*sum(field_goal_result == "good", na.rm = TRUE) + sum(extra_point_result == "made", na.rm = TRUE) + 2*sum(two_point_conv_result == "success", na.rm = TRUE)) %>% group_by(quarter) %>% summarise(mean_points = mean(points)) %>% ggplot(aes(x = quarter, y = mean_points, fill = quarter)) + geom_col() + labs(title = "Average Points per Quarter Across Games", x = "Quarter", y = "Average Points") + theme_light()


```{r} nflfactors %>% dropna(firstdownrush) %>% #columns where firstdownrush is NA are the same columns for all the others, so I just need to filter one of them mutate(success = as.integer(firstdownrush == 1 | firstdownpass == 1 | firstdownpenalty == 1 | touchdown == 1)) %>% groupby(quarter) %>% summarise(successrate = mean(success)) %>% ggplot(aes(x = quarter, y = successrate, fill = quarter)) + geomcol() + labs(title = "First Down Conversion Rate Across Quarters", x = "Quarter", y = "Success Rate") + theme_light()

```


```{r} nflfactors %>% groupby(quarter) %>% summarise(averageepa = mean(epa, na.rm = TRUE)) %>% ggplot(aes(x = quarter, y = averageepa, fill = quarter)) + geomcol() + labs(title = "Expected Points Added Per Play Across Quarters", x = "Quarter", y = "EPA") + themelight()

```


```{r} nflfactors %>% dropna(epa) %>% mutate(goodplay = as.integer(epa > 2), badplay = as.integer(epa < -2)) %>% groupby(quarter) %>% summarise(number = n(), goodplayrate = sum(goodplay) / number, badplayrate = sum(badplay) / number) %>% pivotlonger(cols = c(goodplayrate, badplayrate), namesto = "playquality", valuesto = "rate") %>% ggplot(aes(x = quarter, y = rate, fill = playquality)) + geomcol(position = "dodge") + scaleycontinuous(labels = scales::percentformat()) + labs(x = "Quarter", y = "Rate", title = "Good vs Bad Play Rates by Quarter") + theme_light()

```


{r, message = FALSE} nfl_tidy %>% filter(down == 4, qtr %in% 1:4, fourth_down_converted == 1 | fourth_down_failed == 1) %>% #The second filter condition is to exclude OT because it has a small sample size and a 2x2 facet grid is much more aesthically pleasing. The last filter condition gets rid of all 4th downs with a punt or field goal attempt mutate(success = fourth_down_converted == 1) %>% group_by(qtr, ydstogo) %>% summarise(attempts = n(), success_rate = mean(success)) %>% ggplot(aes(x = ydstogo, y = success_rate)) + geom_line() + geom_point() + scale_y_continuous(labels = scales::percent_format()) + labs(title = "4th Down Conversion Rate by Distance and Quarter", x = "Yards to Go", y = "Success Rate (%)") + theme_light() + geom_smooth(se = FALSE) + facet_wrap(~ qtr)

Conclusion

NFL Teams know everything I do times a million.

They know that passes to the right are inefficient and 4th down runs to the left are the most efficient.

Yet they pass to the right more than anything and avoid 4th down runs to the left like the plague.

There is more than just the bare numbers to football.