R Tutorial: NFL Big Data Bowl 2021

American football and me
The data
Exploratory data analysis

American football and me

Having heard about the NFL and also its major event Super Bowl, I have to admit that I am not familiar at all with the rules of this game. Maybe except that there are two teams on a field playing against each other by throwing and carrying an egg-shaped ball around. So today I stumbled upon a challenge on Kaggle called the NFL Big Data Bowl 2021 which sparked my interest in the sport (possibly due to the nice makeup of the competition). In the following sections I will show you how this sport and I grew closer together by doing some exploratory data analysis.

The data

In total there are 20 .CSV files containing aggregate data on games, players and plays as well as detailed information about all weekly matches.

Read games data

At first, I read the data on games which hopefully is also a starting point for me into the depths of american football. This dataset has general information on all the season’s matches, including the date, time and competing teams. Naturally, the teams’ names appear as acronyms, so my first task was to decipher those abbreviations. I created a separate .CSV file with the abbreviation and full name of each team and joined them with the existing games dataset. At last, I created some additional date columns. Hey GB aka Green Bay Packers, I guess I heard about you.

df.games <- read_delim(file = "data_raw/games.csv", delim = ",")

# Read team abbreviations file
team.abbreviatons <- read_delim(file = "nfl_team_abbreviations.csv", delim = ";") 

# Join with games dataset
df.games <- df.games %>%
  # Home team
  rename(abbreviation = homeTeamAbbr) %>%
  left_join(team.abbreviatons, by = "abbreviation") %>%
  rename(homeTeamAbbr = abbreviation,
         homeTeam = franchise) %>%
  # Visitor team
  rename(abbreviation = visitorTeamAbbr) %>%
  left_join(team.abbreviatons, by = "abbreviation") %>%
  rename(visitorTeamAbbr = abbreviation,
         visitorTeam = franchise) %>%
  # Additional date columns
  separate(col = gameDate, into = c("month", "day", "year"),
           remove = FALSE) %>%
  # Convert date
  mutate(gameDate = lubridate::mdy(gameDate))

gameId	gameDate	month	day	year	gameTimeEastern	homeTeamAbbr	visitorTeamAbbr	week	homeTeam	visitorTeam
2018090600	2018-09-06	09	06	2018	20:20:00	PHI	ATL	1	Philadelphia Eagles	Atlanta Falcons
2018090901	2018-09-09	09	09	2018	13:00:00	CLE	PIT	1	Cleveland Browns	Pittsburgh Steelers
2018090902	2018-09-09	09	09	2018	13:00:00	IND	CIN	1	Indianapolis Colts	Cincinnati Bengals
2018090903	2018-09-09	09	09	2018	13:00:00	MIA	TEN	1	Miami Dolphins	Tennessee Titans
2018090900	2018-09-09	09	09	2018	13:00:00	BAL	BUF	1	Baltimore Ravens	Buffalo Bills
2018090905	2018-09-09	09	09	2018	13:00:00	NE	HOU	1	New England Patriots	Houston Texans
2018090907	2018-09-09	09	09	2018	13:00:00	NYG	JAX	1	New York Giants	Jacksonville Jaguars
2018090906	2018-09-09	09	09	2018	13:00:00	NO	TB	1	New Orleans Saints	Tampa Bay Buccaneers
2018090909	2018-09-09	09	09	2018	16:25:00	ARI	WAS	1	Arizona Cardinals	Washington Football Team
2018090910	2018-09-09	09	09	2018	16:25:00	CAR	DAL	1	Carolina Panthers	Dallas Cowboys

Read players data

Next, I read the data with information on all players of the 2018 NFL season. Now I also have to learn stuff about the Imperial system in order to make sense of the players’ weight and height. The height column is also pretty messy as it contains height measurements which are in feet and inches but some are only in inches. Fortunately, they are distinguishable by the use of the - sign. So I cleaned that one up and also converted the players’ height and weight to the Metric system. Also the players’ birth date is stored in two different versions: one is formatted as year-month-day and the other is formatted as month-day-year. After I cleaned this, I calculated the players’ age. Last but not least, in column position there is some other uncharted territory for me. All the players’ positions are abbreviated and I created a separate .CSV file in which I figured out all acronyms and then joined the full position names and category to the data.

df.players <- read_delim(file = "data_raw/players.csv", delim = ",")

# Clean height: some feet + inches; some only inches
df.players %>%
  filter(str_detect(df.players$height,"-")) %>%
  separate(col = height, into = c("feet", "inches"), 
           remove = FALSE, convert = TRUE) %>%
  mutate(height = 12*feet + inches,
         height_eu =(12*feet + inches)*2.54) %>%
  select(-c(feet,inches)) -> df.players.feet

df.players %>%
  filter(!(str_detect(df.players$height,"-")))  %>%
  mutate(height = as.numeric(height),
         height_eu = as.numeric(height) * 2.54) -> df.players.inch

# Combine dataframes and convert to metric system
df.players <- bind_rows(df.players.feet, df.players.inch) %>%
  mutate(height_eu = round(height_eu,digits = 0)) %>%
  mutate(weight_eu = round(weight / 2.205, digits = 0))

# Clean birthDate
df.players %>%
  filter(str_detect(df.players$birthDate,"-")) %>%
  separate(col = birthDate, into = c("year","month", "day"), convert = TRUE) -> df.player.y1
  
df.players %>%
  filter(str_detect(df.players$birthDate,"/")) %>%
  separate(col = birthDate, into = c("month","day", "year"), convert = TRUE) -> df.player.y2

# Combine and calculate age
df.players <- bind_rows(df.player.y1, df.player.y2) %>%
  mutate(age = 2018 - year)

# Read position abbreviations file
position.abbreviations <- read_delim(file = "nfl_position_abbreviations.csv", delim = ";") 

# Join
df.players <- df.players %>%
  rename(abbreviation = position) %>%
  left_join(position.abbreviations, by = "abbreviation") %>%
  rename(position_abb = abbreviation)

nflId	height	weight	year	month	day	collegeName	position_abb	displayName	height_eu	weight_eu	age	position	position_category
2539334	72	190	1990	9	10	Washington	CB	Desmond Trufant	183	86	28	Cornerback	Defense
2539653	70	186	1988	11	1	Southeastern Louisiana	CB	Robert Alford	178	84	30	Cornerback	Defense
2543850	69	186	1991	12	18	Purdue	SS	Ricardo Allen	175	84	27	Strong Safety	Defense
2555162	73	227	1994	11	4	Louisiana State	MLB	Deion Jones	185	103	24	Middle Linebacker	Defense
2555255	75	232	1993	7	1	Minnesota	OLB	De’Vondre Campbell	190	105	25	Outside Linebacker	Defense
2555543	73	216	1995	7	26	Florida	FS	Keanu Neal	185	98	23	Free Safety	Defense
2556445	70	211	1992	10	20	Florida	CB	Brian Poole	178	96	26	Cornerback	Defense
2532842	78	243	1989	1	20	Arizona	QB	Nick Foles	198	110	29	Quarterback	Offense
2540158	77	250	1990	11	10	Stanford	TE	Zach Ertz	196	113	28	Tight End	Offense
2552600	72	198	1993	5	24	Southern California	WR	Nelson Agholor	183	90	25	Wide Receiver	Offense

Read plays data

The plays data contains general information about certain plays in each match like its verbal description and other game characteristics related to the play.

df.plays <- read_delim(file = "data_raw/plays.csv", delim = ",")

gameId	playId	playDescription	quarter	down	yardsToGo	possessionTeam	playType	yardlineSide	yardlineNumber	offenseFormation	personnelO	defendersInTheBox	numberOfPassRushers	personnelD	typeDropback	gameClock	absoluteYardlineNumber	penaltyCodes	penaltyJerseyNumbers	passResult	offensePlayResult	playResult	epa	isDefensivePI
2018090600	75	(15:00) M.Ryan pass short right to J.Jones pushed ob at ATL 30 for 10 yards (M.Jenkins).	1	1	15	ATL	play_type_pass	ATL	20	I_FORM	2 RB, 1 TE, 2 WR	7	4	4 DL, 2 LB, 5 DB	TRADITIONAL	15:00:00	90	NA	NA	C	10	10	0.2618273	FALSE
2018090600	146	(13:10) M.Ryan pass incomplete short right to C.Ridley (J.Mills, J.Hicks).	1	1	10	ATL	play_type_pass	PHI	39	SINGLEBACK	1 RB, 1 TE, 3 WR	7	4	4 DL, 2 LB, 5 DB	TRADITIONAL	13:10:00	49	NA	NA	I	0	0	-0.3723598	FALSE
2018090600	168	(13:05) (Shotgun) M.Ryan pass incomplete short left to D.Freeman.	1	2	10	ATL	play_type_pass	PHI	39	SHOTGUN	2 RB, 1 TE, 2 WR	6	4	4 DL, 2 LB, 5 DB	TRADITIONAL	13:05:00	49	NA	NA	I	0	0	-0.7027787	FALSE
2018090600	190	(13:01) (Shotgun) M.Ryan pass deep left to J.Jones to PHI 6 for 33 yards (R.Darby).	1	3	10	ATL	play_type_pass	PHI	39	SHOTGUN	1 RB, 1 TE, 3 WR	6	5	4 DL, 1 LB, 6 DB	SCRAMBLE_ROLLOUT_LEFT	13:01:00	49	NA	NA	C	33	33	3.0475300	FALSE
2018090600	256	(10:59) (Shotgun) M.Ryan pass incomplete short right to D.Freeman.	1	3	1	ATL	play_type_pass	PHI	1	SHOTGUN	2 RB, 3 TE, 0 WR	8	6	6 DL, 3 LB, 2 DB	TRADITIONAL	10:59:00	11	NA	NA	I	0	0	-0.8422719	FALSE
2018090600	320	(10:10) (Shotgun) N.Foles pass short left to N.Agholor to PHI 8 for 4 yards (R.Alford).	1	2	8	PHI	play_type_pass	PHI	4	SHOTGUN	1 RB, 1 TE, 3 WR	7	4	3 DL, 3 LB, 5 DB	TRADITIONAL	10:10:00	14	NA	NA	C	4	4	-0.3440965	FALSE
2018090600	344	(9:24) (Shotgun) N.Foles pass incomplete short left to D.Sproles (R.Allen).	1	3	4	PHI	play_type_pass	PHI	8	EMPTY	1 RB, 2 TE, 2 WR	6	4	3 DL, 3 LB, 5 DB	TRADITIONAL	09:24:00	18	NA	NA	I	0	0	-1.1922083	FALSE
2018090600	402	(9:08) M.Ryan pass incomplete deep left to M.Sanu.	1	1	10	ATL	play_type_pass	PHI	44	SINGLEBACK	1 RB, 1 TE, 3 WR	7	4	4 DL, 2 LB, 5 DB	SCRAMBLE_ROLLOUT_LEFT	09:08:00	54	NA	NA	I	0	0	-0.4298626	FALSE
2018090600	492	(7:01) M.Ryan pass short left to T.Coleman pushed ob at PHI 10 for 26 yards (R.McLeod).	1	2	13	ATL	play_type_pass	PHI	36	SINGLEBACK	1 RB, 2 TE, 2 WR	8	4	4 DL, 2 LB, 5 DB	TRADITIONAL	07:01:00	46	NA	NA	C	26	26	1.8798040	FALSE
2018090600	521	(6:19) M.Ryan pass short left to A.Hooper to PHI 7 for 3 yards (N.Gerry). Philadelphia challenged the pass completion ruling, and the play was Upheld. The ruling on the field stands. (Timeout #1.)	1	1	10	ATL	play_type_pass	PHI	10	SINGLEBACK	1 RB, 2 TE, 2 WR	8	3	4 DL, 2 LB, 5 DB	DESIGNED_ROLLOUT_LEFT	06:19:00	20	NA	NA	C	3	3	0.0456651	FALSE

Read weeks data

The last 17 datasets contain detailed information about each play in each game. Each play is broken down in so called frames where the players’ movements on the field are tracked.

df.weeks <- purrr::map_dfr(as.character(1:17),
                           ~ readr::read_delim(delim = ",",
                                               paste0("data_raw/week", ., ".csv")))

Add team information to players data

With the weekly datasets I am able to determine which players is playing for which team in the current season. So I filtered the weekly data and only kept the occurence of a player one single time.

players_to_team <- df.weeks %>%
  filter(displayName != "Football") %>%
  group_by(gameId, displayName) %>%
  slice(1) %>%
  ungroup() %>%
  select(gameId, displayName, team)

players_to_team_out <- left_join(x = players_to_team, y = df.games, by = "gameId") %>%
  mutate(team_abb = ifelse(team == "away", visitorTeamAbbr, homeTeamAbbr),
         team = ifelse(team == "away", visitorTeam, homeTeam)) %>%
  group_by(displayName) %>%
  slice(1) %>%
  ungroup() %>%
  select(gameId, displayName, team_abb, team)

Finally, I am linking this information to the games dataset.

df.players <- left_join(x = df.players,
                        y = players_to_team,
                        by = "displayName")

Exploratory data analysis

I will begin with some exploratory data analysis as I have no prior knowledge about american football.

Games’ stats

Games per month

Most matches take place in December.

ggplot(data = df.games,
       aes(x = month)) +
  geom_bar(fill = "#013369", color = "black") +
  labs(x = "Month", y = "Number of games")

Average breaks between games per team

A team has an average break of 7 days between two matches. Some teams have a little more, some teams have a little less.

df.games %>%
  pivot_longer(cols = c(homeTeam, visitorTeam),
               names_to = "group", values_to = "team") %>%
  group_by(team) %>%
  mutate(rest = as.double(difftime(gameDate, lag(gameDate)))) %>%
  summarise(count = n(),
            average_rest = mean(rest, na.rm = T)) %>%
  ungroup() %>%
  ggplot(data = .,
         aes(x = reorder(team, average_rest), y = average_rest)) +
  geom_bar(stat = "identity", color = "black", fill = "#013369") +
  geom_hline(yintercept = 7.41, size = 1, color = "#d50a0a") +
  coord_flip() +
  labs(x = "", y = "Average break between games (in days)")

Players’ stats

Players’ position category

Barchart of players’ position category.

Most players are playing a defensive position, followed by offensive positions and special teams.

ggplot(data = df.players,
       aes(x = fct_infreq(position_category), fill = position_category)) +
  geom_bar(stat = "count",  color = "black") +
  scale_fill_manual(values = c("#013369","#d50a0a","#ffffff")) +
  labs(x = "", y = "Number of players", fill = "") +
  theme(legend.position = "bottom")

Barchart of players’ position filled by position category.

Most players in the league play as a Wide Receiver, followed by the Cornerback and Running Back.

ggplot(data = df.players,
       aes(x = fct_rev(fct_infreq(position)), fill = position_category)) +
  geom_bar(stat = "count",  color = "black") +
  scale_fill_manual(values = c("#013369","#d50a0a","#ffffff")) +
  coord_flip() +
  labs(x = "", y = "Number of players", fill = "") +
  theme(legend.position = "bottom")

Players’ college origin

Barchart of players’ top colleges filled by position category.

The three top universities are Alabama, Ohio State and Florida State.

df.players %>% 
  group_by(collegeName, position_category) %>%
  summarise(n = n()) %>%
  mutate(pct = n / sum(n),
         total = sum(n)) %>%
  ungroup() %>%
  arrange(desc(total)) %>%
  slice(1:54) %>%
  ggplot(data = .,
       aes(x = reorder(collegeName, total), y = n, fill = position_category)) +
  geom_bar(stat = "identity", color = "black") +
  scale_fill_manual(values = c("#013369","#d50a0a","#ffffff")) +
  coord_flip() +
  labs(x = "", y = "Number of players", fill = "") +
  theme(legend.position = "bottom")

Players’ age

Density plot of players’ age filled by position category.

Offense and defense players’ have a similiar distribution of age while special teams players are older on average.

ggplot(data = df.players,
       aes(x = age, fill = position_category)) +
  geom_density(color = "black", alpha = 0.75) +
  scale_fill_manual(values = c("#013369","#d50a0a","#ffffff")) +
  labs(x = "Age", y = "Density", fill = "") +
  theme(legend.position = "bottom")

Box plots of players’ age filled by position.

This can be shown in more detail with each player’s position.

ggplot(data = df.players,
       aes(x = reorder(position, age),
           y = age,
           fill = position_category)) +
  geom_boxplot() +
  coord_flip() +
  scale_fill_manual(values = c("#013369","#d50a0a","#ffffff")) +
  labs(x = "", y = "Age", fill = "") +
  theme(legend.position = "bottom")

Box plots of teams’ players age.

In most teams the median age of all players is 25 years. The San Francisco 49ers have the youngest team.

ggplot(data = filter(df.players, !is.na(team)),
       aes(x = reorder(team, age, FUN = median),
           y = age)) +
  geom_boxplot(fill = "white", color = "black") +
  coord_flip() +
  labs(x = "", y = "Age", fill = "") +
  theme(legend.position = "bottom")

Players’ weight and height

Scatterplot of players’ height and weight filled by position category.

Defensive players are compared to offensive players more heavy for a given height. There is even a bent in the distribution of the defensive players.

ggplot(data = df.players,
       aes(x = height_eu, y = weight_eu, fill = position_category)) +
  geom_point(size = 2.5, alpha = 1, shape = 21, colour = "black") +
  scale_fill_manual(values = c("#013369","#d50a0a","#ffffff")) +
  facet_wrap(. ~ position_category) +
  labs(x = "Height (cm)", y = "Weight (kg)", fill = "") +
  theme(legend.position = "bottom")

Scatterplot of players’ height and weight depending on position filled by position category.

The bent in the defensive player’s distribution of weight and height is explained by players on the Defensive Tackle and Nose Tackle position, i.e. players with an extraordinary weight compared to their height. I fitted an overall regression line of the model weight = β₀ + β₁height to each panel.

ggplot(data = df.players,
       aes(x = height_eu, y = weight_eu, fill = position_category)) +
  geom_point(size = 2.5, alpha = 1, shape = 21, colour = "black") +
  scale_fill_manual(values = c("#013369", "#d50a0a", "#ffffff")) +
  scale_x_continuous(limits = c(160, 210),
                     breaks = c(160, 170, 180, 190, 200, 210)) +
  scale_y_continuous(limits = c(70, 170),
                     breaks = c(70, 90, 110, 130, 150, 170)) +
  geom_abline(intercept = -160, slope = 1.4) +
  facet_wrap(. ~ position) +
  labs(x = "Height (cm)", y = "Weight (kg)", fill = "") +
  theme(legend.position = "bottom")

Plays’ stats

Players’ movements in play

Let us take a closer look into one single play of the match between the Philadelphia Eagles and Atlanta Falcons. The Atlanta Falcos are in the offense and the Philadelphia Eagles are in the defense.

library(gganimate)
df.week1 <- read_delim(file = "data_raw/week1.csv", delim = ",")

# PlayId 75
phi_atl <-
  filter(df.week1,  gameId == "2018090600" & playId %in% c("190")) %>%
  mutate(time_comb = paste(hour(time), 
                           minute(time), 
                           floor(second(time)), sep =":"))

plot <- phi_atl %>%
  group_by(playId, time_comb, displayName) %>%
  slice(1) %>%
  ggplot(data = .,
         aes(x=x, y=y, color=team, label=position)) +
  geom_point() +
  geom_text(aes(label=position),hjust=0, vjust=0) +
  scale_color_manual(values = c("#a71930","#624a2e","#004c54")) +
  scale_x_continuous(limits = c(0,100), breaks = seq(10,90,10),
                     labels = c(10,20,30,40,50,40,30,20,10)) +
  scale_y_continuous(limits = c(0,53.33), breaks = c(20,33,33)) +
  geom_vline(xintercept = 0, size = 1.5, color = "white") +
  geom_vline(xintercept = 100, size = 1.5, color = "white") +
  labs(x = "", color = "") +
  theme(axis.title.y=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank(),
        axis.ticks.x=element_blank(),
        panel.background = element_rect(fill = "#7eaf34",
                                        colour = "#7eaf34",
                                        size = 0.5, linetype = "solid"),
        panel.grid.minor.y = element_blank(),
        legend.position = "none") +
  transition_time(time) +
  labs(title = "Time: {frame_time}",
       subtitle = df.plays[df.plays$gameId == "2018090600" & 
                            df.plays$playId == "190",]$playDescription)
  
# Animation 
animate(plot, end_pause = 10)

You can check this on YouTube!

Offense formations per play

I calculated the occurences of all offense formations for each recorded play over all available matches. The Shotgun is the most common formation.

df.plays %>%
  group_by(possessionTeam, offenseFormation) %>%
  summarise(count = n()) %>%
  mutate(total = sum(count),
         share = count/total) %>%
  ggplot(data = .,
         aes(x = possessionTeam, y = share, fill=offenseFormation)) +
  geom_bar(stat="identity", position="stack") +
  coord_flip() +
  scale_y_continuous(labels = scales::percent) +
  labs(x = "", y = "Share", fill = "Offense Formation")