American football and me

Having heard about the NFL and also its major event Super Bowl, I have to admit that I am not familiar at all with the rules of this game. Maybe except that there are two teams on a field playing against each other by throwing and carrying an egg-shaped ball around. So today I stumbled upon a challenge on Kaggle called the NFL Big Data Bowl 2021 which sparked my interest in the sport (possibly due to the nice makeup of the competition). In the following sections I will show you how this sport and I grew closer together by doing some exploratory data analysis.


The data

In total there are 20 .CSV files containing aggregate data on games, players and plays as well as detailed information about all weekly matches.

Read games data

At first, I read the data on games which hopefully is also a starting point for me into the depths of american football. This dataset has general information on all the season’s matches, including the date, time and competing teams. Naturally, the teams’ names appear as acronyms, so my first task was to decipher those abbreviations. I created a separate .CSV file with the abbreviation and full name of each team and joined them with the existing games dataset. At last, I created some additional date columns. Hey GB aka Green Bay Packers, I guess I heard about you.

df.games <- read_delim(file = "data_raw/games.csv", delim = ",")

# Read team abbreviations file
team.abbreviatons <- read_delim(file = "nfl_team_abbreviations.csv", delim = ";") 

# Join with games dataset
df.games <- df.games %>%
  # Home team
  rename(abbreviation = homeTeamAbbr) %>%
  left_join(team.abbreviatons, by = "abbreviation") %>%
  rename(homeTeamAbbr = abbreviation,
         homeTeam = franchise) %>%
  # Visitor team
  rename(abbreviation = visitorTeamAbbr) %>%
  left_join(team.abbreviatons, by = "abbreviation") %>%
  rename(visitorTeamAbbr = abbreviation,
         visitorTeam = franchise) %>%
  # Additional date columns
  separate(col = gameDate, into = c("month", "day", "year"),
           remove = FALSE) %>%
  # Convert date
  mutate(gameDate = lubridate::mdy(gameDate))
gameId gameDate month day year gameTimeEastern homeTeamAbbr visitorTeamAbbr week homeTeam visitorTeam
2018090600 2018-09-06 09 06 2018 20:20:00 PHI ATL 1 Philadelphia Eagles Atlanta Falcons
2018090901 2018-09-09 09 09 2018 13:00:00 CLE PIT 1 Cleveland Browns Pittsburgh Steelers
2018090902 2018-09-09 09 09 2018 13:00:00 IND CIN 1 Indianapolis Colts Cincinnati Bengals
2018090903 2018-09-09 09 09 2018 13:00:00 MIA TEN 1 Miami Dolphins Tennessee Titans
2018090900 2018-09-09 09 09 2018 13:00:00 BAL BUF 1 Baltimore Ravens Buffalo Bills
2018090905 2018-09-09 09 09 2018 13:00:00 NE HOU 1 New England Patriots Houston Texans
2018090907 2018-09-09 09 09 2018 13:00:00 NYG JAX 1 New York Giants Jacksonville Jaguars
2018090906 2018-09-09 09 09 2018 13:00:00 NO TB 1 New Orleans Saints Tampa Bay Buccaneers
2018090909 2018-09-09 09 09 2018 16:25:00 ARI WAS 1 Arizona Cardinals Washington Football Team
2018090910 2018-09-09 09 09 2018 16:25:00 CAR DAL 1 Carolina Panthers Dallas Cowboys

Read players data

Next, I read the data with information on all players of the 2018 NFL season. Now I also have to learn stuff about the Imperial system in order to make sense of the players’ weight and height. The height column is also pretty messy as it contains height measurements which are in feet and inches but some are only in inches. Fortunately, they are distinguishable by the use of the - sign. So I cleaned that one up and also converted the players’ height and weight to the Metric system. Also the players’ birth date is stored in two different versions: one is formatted as year-month-day and the other is formatted as month-day-year. After I cleaned this, I calculated the players’ age. Last but not least, in column position there is some other uncharted territory for me. All the players’ positions are abbreviated and I created a separate .CSV file in which I figured out all acronyms and then joined the full position names and category to the data.

df.players <- read_delim(file = "data_raw/players.csv", delim = ",")

# Clean height: some feet + inches; some only inches
df.players %>%
  filter(str_detect(df.players$height,"-")) %>%
  separate(col = height, into = c("feet", "inches"), 
           remove = FALSE, convert = TRUE) %>%
  mutate(height = 12*feet + inches,
         height_eu =(12*feet + inches)*2.54) %>%
  select(-c(feet,inches)) -> df.players.feet

df.players %>%
  filter(!(str_detect(df.players$height,"-")))  %>%
  mutate(height = as.numeric(height),
         height_eu = as.numeric(height) * 2.54) -> df.players.inch

# Combine dataframes and convert to metric system
df.players <- bind_rows(df.players.feet, df.players.inch) %>%
  mutate(height_eu = round(height_eu,digits = 0)) %>%
  mutate(weight_eu = round(weight / 2.205, digits = 0))

# Clean birthDate
df.players %>%
  filter(str_detect(df.players$birthDate,"-")) %>%
  separate(col = birthDate, into = c("year","month", "day"), convert = TRUE) -> df.player.y1
  
df.players %>%
  filter(str_detect(df.players$birthDate,"/")) %>%
  separate(col = birthDate, into = c("month","day", "year"), convert = TRUE) -> df.player.y2

# Combine and calculate age
df.players <- bind_rows(df.player.y1, df.player.y2) %>%
  mutate(age = 2018 - year)

# Read position abbreviations file
position.abbreviations <- read_delim(file = "nfl_position_abbreviations.csv", delim = ";") 

# Join
df.players <- df.players %>%
  rename(abbreviation = position) %>%
  left_join(position.abbreviations, by = "abbreviation") %>%
  rename(position_abb = abbreviation)
nflId height weight year month day collegeName position_abb displayName height_eu weight_eu age position position_category
2539334 72 190 1990 9 10 Washington CB Desmond Trufant 183 86 28 Cornerback Defense
2539653 70 186 1988 11 1 Southeastern Louisiana CB Robert Alford 178 84 30 Cornerback Defense
2543850 69 186 1991 12 18 Purdue SS Ricardo Allen 175 84 27 Strong Safety Defense
2555162 73 227 1994 11 4 Louisiana State MLB Deion Jones 185 103 24 Middle Linebacker Defense
2555255 75 232 1993 7 1 Minnesota OLB De’Vondre Campbell 190 105 25 Outside Linebacker Defense
2555543 73 216 1995 7 26 Florida FS Keanu Neal 185 98 23 Free Safety Defense
2556445 70 211 1992 10 20 Florida CB Brian Poole 178 96 26 Cornerback Defense
2532842 78 243 1989 1 20 Arizona QB Nick Foles 198 110 29 Quarterback Offense
2540158 77 250 1990 11 10 Stanford TE Zach Ertz 196 113 28 Tight End Offense
2552600 72 198 1993 5 24 Southern California WR Nelson Agholor 183 90 25 Wide Receiver Offense

Read plays data

The plays data contains general information about certain plays in each match like its verbal description and other game characteristics related to the play.

df.plays <- read_delim(file = "data_raw/plays.csv", delim = ",")
gameId playId playDescription quarter down yardsToGo possessionTeam playType yardlineSide yardlineNumber offenseFormation personnelO defendersInTheBox numberOfPassRushers personnelD typeDropback preSnapVisitorScore preSnapHomeScore gameClock absoluteYardlineNumber penaltyCodes penaltyJerseyNumbers passResult offensePlayResult playResult epa isDefensivePI
2018090600 75 (15:00) M.Ryan pass short right to J.Jones pushed ob at ATL 30 for 10 yards (M.Jenkins). 1 1 15 ATL play_type_pass ATL 20 I_FORM 2 RB, 1 TE, 2 WR 7 4 4 DL, 2 LB, 5 DB TRADITIONAL 0 0 15:00:00 90 NA NA C 10 10 0.2618273 FALSE
2018090600 146 (13:10) M.Ryan pass incomplete short right to C.Ridley (J.Mills, J.Hicks). 1 1 10 ATL play_type_pass PHI 39 SINGLEBACK 1 RB, 1 TE, 3 WR 7 4 4 DL, 2 LB, 5 DB TRADITIONAL 0 0 13:10:00 49 NA NA I 0 0 -0.3723598 FALSE
2018090600 168 (13:05) (Shotgun) M.Ryan pass incomplete short left to D.Freeman. 1 2 10 ATL play_type_pass PHI 39 SHOTGUN 2 RB, 1 TE, 2 WR 6 4 4 DL, 2 LB, 5 DB TRADITIONAL 0 0 13:05:00 49 NA NA I 0 0 -0.7027787 FALSE
2018090600 190 (13:01) (Shotgun) M.Ryan pass deep left to J.Jones to PHI 6 for 33 yards (R.Darby). 1 3 10 ATL play_type_pass PHI 39 SHOTGUN 1 RB, 1 TE, 3 WR 6 5 4 DL, 1 LB, 6 DB SCRAMBLE_ROLLOUT_LEFT 0 0 13:01:00 49 NA NA C 33 33 3.0475300 FALSE
2018090600 256 (10:59) (Shotgun) M.Ryan pass incomplete short right to D.Freeman. 1 3 1 ATL play_type_pass PHI 1 SHOTGUN 2 RB, 3 TE, 0 WR 8 6 6 DL, 3 LB, 2 DB TRADITIONAL 0 0 10:59:00 11 NA NA I 0 0 -0.8422719 FALSE
2018090600 320 (10:10) (Shotgun) N.Foles pass short left to N.Agholor to PHI 8 for 4 yards (R.Alford). 1 2 8 PHI play_type_pass PHI 4 SHOTGUN 1 RB, 1 TE, 3 WR 7 4 3 DL, 3 LB, 5 DB TRADITIONAL 0 0 10:10:00 14 NA NA C 4 4 -0.3440965 FALSE
2018090600 344 (9:24) (Shotgun) N.Foles pass incomplete short left to D.Sproles (R.Allen). 1 3 4 PHI play_type_pass PHI 8 EMPTY 1 RB, 2 TE, 2 WR 6 4 3 DL, 3 LB, 5 DB TRADITIONAL 0 0 09:24:00 18 NA NA I 0 0 -1.1922083 FALSE
2018090600 402 (9:08) M.Ryan pass incomplete deep left to M.Sanu. 1 1 10 ATL play_type_pass PHI 44 SINGLEBACK 1 RB, 1 TE, 3 WR 7 4 4 DL, 2 LB, 5 DB SCRAMBLE_ROLLOUT_LEFT 0 0 09:08:00 54 NA NA I 0 0 -0.4298626 FALSE
2018090600 492 (7:01) M.Ryan pass short left to T.Coleman pushed ob at PHI 10 for 26 yards (R.McLeod). 1 2 13 ATL play_type_pass PHI 36 SINGLEBACK 1 RB, 2 TE, 2 WR 8 4 4 DL, 2 LB, 5 DB TRADITIONAL 0 0 07:01:00 46 NA NA C 26 26 1.8798040 FALSE
2018090600 521 (6:19) M.Ryan pass short left to A.Hooper to PHI 7 for 3 yards (N.Gerry). Philadelphia challenged the pass completion ruling, and the play was Upheld. The ruling on the field stands. (Timeout #1.) 1 1 10 ATL play_type_pass PHI 10 SINGLEBACK 1 RB, 2 TE, 2 WR 8 3 4 DL, 2 LB, 5 DB DESIGNED_ROLLOUT_LEFT 0 0 06:19:00 20 NA NA C 3 3 0.0456651 FALSE

Read weeks data

The last 17 datasets contain detailed information about each play in each game. Each play is broken down in so called frames where the players’ movements on the field are tracked.

df.weeks <- purrr::map_dfr(as.character(1:17),
                           ~ readr::read_delim(delim = ",",
                                               paste0("data_raw/week", ., ".csv")))

Add team information to players data

With the weekly datasets I am able to determine which players is playing for which team in the current season. So I filtered the weekly data and only kept the occurence of a player one single time.

players_to_team <- df.weeks %>%
  filter(displayName != "Football") %>%
  group_by(gameId, displayName) %>%
  slice(1) %>%
  ungroup() %>%
  select(gameId, displayName, team)

players_to_team_out <- left_join(x = players_to_team, y = df.games, by = "gameId") %>%
  mutate(team_abb = ifelse(team == "away", visitorTeamAbbr, homeTeamAbbr),
         team = ifelse(team == "away", visitorTeam, homeTeam)) %>%
  group_by(displayName) %>%
  slice(1) %>%
  ungroup() %>%
  select(gameId, displayName, team_abb, team)

Finally, I am linking this information to the games dataset.

df.players <- left_join(x = df.players,
                        y = players_to_team,
                        by = "displayName")


Exploratory data analysis

I will begin with some exploratory data analysis as I have no prior knowledge about american football.

Games’ stats

Games per month

Most matches take place in December.

ggplot(data = df.games,
       aes(x = month)) +
  geom_bar(fill = "#013369", color = "black") +
  labs(x = "Month", y = "Number of games")

Average breaks between games per team

A team has an average break of 7 days between two matches. Some teams have a little more, some teams have a little less.

df.games %>%
  pivot_longer(cols = c(homeTeam, visitorTeam),
               names_to = "group", values_to = "team") %>%
  group_by(team) %>%
  mutate(rest = as.double(difftime(gameDate, lag(gameDate)))) %>%
  summarise(count = n(),
            average_rest = mean(rest, na.rm = T)) %>%
  ungroup() %>%
  ggplot(data = .,
         aes(x = reorder(team, average_rest), y = average_rest)) +
  geom_bar(stat = "identity", color = "black", fill = "#013369") +
  geom_hline(yintercept = 7.41, size = 1, color = "#d50a0a") +
  coord_flip() +
  labs(x = "", y = "Average break between games (in days)")


Players’ stats

Players’ position category

Barchart of players’ position category.

Most players are playing a defensive position, followed by offensive positions and special teams.

ggplot(data = df.players,
       aes(x = fct_infreq(position_category), fill = position_category)) +
  geom_bar(stat = "count",  color = "black") +
  scale_fill_manual(values = c("#013369","#d50a0a","#ffffff")) +
  labs(x = "", y = "Number of players", fill = "") +
  theme(legend.position = "bottom")

Barchart of players’ position filled by position category.

Most players in the league play as a Wide Receiver, followed by the Cornerback and Running Back.

ggplot(data = df.players,
       aes(x = fct_rev(fct_infreq(position)), fill = position_category)) +
  geom_bar(stat = "count",  color = "black") +
  scale_fill_manual(values = c("#013369","#d50a0a","#ffffff")) +
  coord_flip() +
  labs(x = "", y = "Number of players", fill = "") +
  theme(legend.position = "bottom")

Players’ college origin

Barchart of players’ top colleges filled by position category.

The three top universities are Alabama, Ohio State and Florida State.

df.players %>% 
  group_by(collegeName, position_category) %>%
  summarise(n = n()) %>%
  mutate(pct = n / sum(n),
         total = sum(n)) %>%
  ungroup() %>%
  arrange(desc(total)) %>%
  slice(1:54) %>%
  ggplot(data = .,
       aes(x = reorder(collegeName, total), y = n, fill = position_category)) +
  geom_bar(stat = "identity", color = "black") +
  scale_fill_manual(values = c("#013369","#d50a0a","#ffffff")) +
  coord_flip() +
  labs(x = "", y = "Number of players", fill = "") +
  theme(legend.position = "bottom")

Players’ age

Density plot of players’ age filled by position category.

Offense and defense players’ have a similiar distribution of age while special teams players are older on average.

ggplot(data = df.players,
       aes(x = age, fill = position_category)) +
  geom_density(color = "black", alpha = 0.75) +
  scale_fill_manual(values = c("#013369","#d50a0a","#ffffff")) +
  labs(x = "Age", y = "Density", fill = "") +
  theme(legend.position = "bottom")

Box plots of players’ age filled by position.

This can be shown in more detail with each player’s position.

ggplot(data = df.players,
       aes(x = reorder(position, age),
           y = age,
           fill = position_category)) +
  geom_boxplot() +
  coord_flip() +
  scale_fill_manual(values = c("#013369","#d50a0a","#ffffff")) +
  labs(x = "", y = "Age", fill = "") +
  theme(legend.position = "bottom")

Box plots of teams’ players age.

In most teams the median age of all players is 25 years. The San Francisco 49ers have the youngest team.

ggplot(data = filter(df.players, !is.na(team)),
       aes(x = reorder(team, age, FUN = median),
           y = age)) +
  geom_boxplot(fill = "white", color = "black") +
  coord_flip() +
  labs(x = "", y = "Age", fill = "") +
  theme(legend.position = "bottom")

Players’ weight and height

Scatterplot of players’ height and weight filled by position category.

Defensive players are compared to offensive players more heavy for a given height. There is even a bent in the distribution of the defensive players.

ggplot(data = df.players,
       aes(x = height_eu, y = weight_eu, fill = position_category)) +
  geom_point(size = 2.5, alpha = 1, shape = 21, colour = "black") +
  scale_fill_manual(values = c("#013369","#d50a0a","#ffffff")) +
  facet_wrap(. ~ position_category) +
  labs(x = "Height (cm)", y = "Weight (kg)", fill = "") +
  theme(legend.position = "bottom")

Scatterplot of players’ height and weight depending on position filled by position category.

The bent in the defensive player’s distribution of weight and height is explained by players on the Defensive Tackle and Nose Tackle position, i.e. players with an extraordinary weight compared to their height. I fitted an overall regression line of the model weight = β0 + β1height to each panel.

ggplot(data = df.players,
       aes(x = height_eu, y = weight_eu, fill = position_category)) +
  geom_point(size = 2.5, alpha = 1, shape = 21, colour = "black") +
  scale_fill_manual(values = c("#013369", "#d50a0a", "#ffffff")) +
  scale_x_continuous(limits = c(160, 210),
                     breaks = c(160, 170, 180, 190, 200, 210)) +
  scale_y_continuous(limits = c(70, 170),
                     breaks = c(70, 90, 110, 130, 150, 170)) +
  geom_abline(intercept = -160, slope = 1.4) +
  facet_wrap(. ~ position) +
  labs(x = "Height (cm)", y = "Weight (kg)", fill = "") +
  theme(legend.position = "bottom")


Plays’ stats

Players’ movements in play

Let us take a closer look into one single play of the match between the Philadelphia Eagles and Atlanta Falcons. The Atlanta Falcos are in the offense and the Philadelphia Eagles are in the defense.

library(gganimate)
df.week1 <- read_delim(file = "data_raw/week1.csv", delim = ",")

# PlayId 75
phi_atl <-
  filter(df.week1,  gameId == "2018090600" & playId %in% c("190")) %>%
  mutate(time_comb = paste(hour(time), 
                           minute(time), 
                           floor(second(time)), sep =":"))

plot <- phi_atl %>%
  group_by(playId, time_comb, displayName) %>%
  slice(1) %>%
  ggplot(data = .,
         aes(x=x, y=y, color=team, label=position)) +
  geom_point() +
  geom_text(aes(label=position),hjust=0, vjust=0) +
  scale_color_manual(values = c("#a71930","#624a2e","#004c54")) +
  scale_x_continuous(limits = c(0,100), breaks = seq(10,90,10),
                     labels = c(10,20,30,40,50,40,30,20,10)) +
  scale_y_continuous(limits = c(0,53.33), breaks = c(20,33,33)) +
  geom_vline(xintercept = 0, size = 1.5, color = "white") +
  geom_vline(xintercept = 100, size = 1.5, color = "white") +
  labs(x = "", color = "") +
  theme(axis.title.y=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank(),
        axis.ticks.x=element_blank(),
        panel.background = element_rect(fill = "#7eaf34",
                                        colour = "#7eaf34",
                                        size = 0.5, linetype = "solid"),
        panel.grid.minor.y = element_blank(),
        legend.position = "none") +
  transition_time(time) +
  labs(title = "Time: {frame_time}",
       subtitle = df.plays[df.plays$gameId == "2018090600" & 
                            df.plays$playId == "190",]$playDescription)
  
# Animation 
animate(plot, end_pause = 10)

You can check this on YouTube!

IMAGE ALT TEXT

Offense formations per play

I calculated the occurences of all offense formations for each recorded play over all available matches. The Shotgun is the most common formation.

df.plays %>%
  group_by(possessionTeam, offenseFormation) %>%
  summarise(count = n()) %>%
  mutate(total = sum(count),
         share = count/total) %>%
  ggplot(data = .,
         aes(x = possessionTeam, y = share, fill=offenseFormation)) +
  geom_bar(stat="identity", position="stack") +
  coord_flip() +
  scale_y_continuous(labels = scales::percent) +
  labs(x = "", y = "Share", fill = "Offense Formation")