Having heard about the NFL and also its major event Super Bowl, I have to admit that I am not familiar at all with the rules of this game. Maybe except that there are two teams on a field playing against each other by throwing and carrying an egg-shaped ball around. So today I stumbled upon a challenge on Kaggle called the NFL Big Data Bowl 2021 which sparked my interest in the sport (possibly due to the nice makeup of the competition). In the following sections I will show you how this sport and I grew closer together by doing some exploratory data analysis.
In total there are 20 .CSV files containing aggregate data on games, players and plays as well as detailed information about all weekly matches.
At first, I read the data on games which hopefully is also a starting point for me into the depths of american football. This dataset has general information on all the season’s matches, including the date, time and competing teams. Naturally, the teams’ names appear as acronyms, so my first task was to decipher those abbreviations. I created a separate .CSV file with the abbreviation and full name of each team and joined them with the existing games dataset. At last, I created some additional date columns. Hey GB aka Green Bay Packers, I guess I heard about you.
df.games <- read_delim(file = "data_raw/games.csv", delim = ",")
# Read team abbreviations file
team.abbreviatons <- read_delim(file = "nfl_team_abbreviations.csv", delim = ";")
# Join with games dataset
df.games <- df.games %>%
# Home team
rename(abbreviation = homeTeamAbbr) %>%
left_join(team.abbreviatons, by = "abbreviation") %>%
rename(homeTeamAbbr = abbreviation,
homeTeam = franchise) %>%
# Visitor team
rename(abbreviation = visitorTeamAbbr) %>%
left_join(team.abbreviatons, by = "abbreviation") %>%
rename(visitorTeamAbbr = abbreviation,
visitorTeam = franchise) %>%
# Additional date columns
separate(col = gameDate, into = c("month", "day", "year"),
remove = FALSE) %>%
# Convert date
mutate(gameDate = lubridate::mdy(gameDate))
gameId | gameDate | month | day | year | gameTimeEastern | homeTeamAbbr | visitorTeamAbbr | week | homeTeam | visitorTeam |
---|---|---|---|---|---|---|---|---|---|---|
2018090600 | 2018-09-06 | 09 | 06 | 2018 | 20:20:00 | PHI | ATL | 1 | Philadelphia Eagles | Atlanta Falcons |
2018090901 | 2018-09-09 | 09 | 09 | 2018 | 13:00:00 | CLE | PIT | 1 | Cleveland Browns | Pittsburgh Steelers |
2018090902 | 2018-09-09 | 09 | 09 | 2018 | 13:00:00 | IND | CIN | 1 | Indianapolis Colts | Cincinnati Bengals |
2018090903 | 2018-09-09 | 09 | 09 | 2018 | 13:00:00 | MIA | TEN | 1 | Miami Dolphins | Tennessee Titans |
2018090900 | 2018-09-09 | 09 | 09 | 2018 | 13:00:00 | BAL | BUF | 1 | Baltimore Ravens | Buffalo Bills |
2018090905 | 2018-09-09 | 09 | 09 | 2018 | 13:00:00 | NE | HOU | 1 | New England Patriots | Houston Texans |
2018090907 | 2018-09-09 | 09 | 09 | 2018 | 13:00:00 | NYG | JAX | 1 | New York Giants | Jacksonville Jaguars |
2018090906 | 2018-09-09 | 09 | 09 | 2018 | 13:00:00 | NO | TB | 1 | New Orleans Saints | Tampa Bay Buccaneers |
2018090909 | 2018-09-09 | 09 | 09 | 2018 | 16:25:00 | ARI | WAS | 1 | Arizona Cardinals | Washington Football Team |
2018090910 | 2018-09-09 | 09 | 09 | 2018 | 16:25:00 | CAR | DAL | 1 | Carolina Panthers | Dallas Cowboys |
Next, I read the data with information on all players of the 2018 NFL season. Now I also have to learn stuff about the Imperial system in order to make sense of the players’ weight and height. The height
column is also pretty messy as it contains height measurements which are in feet and inches but some are only in inches. Fortunately, they are distinguishable by the use of the -
sign. So I cleaned that one up and also converted the players’ height and weight to the Metric system. Also the players’ birth date is stored in two different versions: one is formatted as year-month-day and the other is formatted as month-day-year. After I cleaned this, I calculated the players’ age. Last but not least, in column position
there is some other uncharted territory for me. All the players’ positions are abbreviated and I created a separate .CSV file in which I figured out all acronyms and then joined the full position names and category to the data.
df.players <- read_delim(file = "data_raw/players.csv", delim = ",")
# Clean height: some feet + inches; some only inches
df.players %>%
filter(str_detect(df.players$height,"-")) %>%
separate(col = height, into = c("feet", "inches"),
remove = FALSE, convert = TRUE) %>%
mutate(height = 12*feet + inches,
height_eu =(12*feet + inches)*2.54) %>%
select(-c(feet,inches)) -> df.players.feet
df.players %>%
filter(!(str_detect(df.players$height,"-"))) %>%
mutate(height = as.numeric(height),
height_eu = as.numeric(height) * 2.54) -> df.players.inch
# Combine dataframes and convert to metric system
df.players <- bind_rows(df.players.feet, df.players.inch) %>%
mutate(height_eu = round(height_eu,digits = 0)) %>%
mutate(weight_eu = round(weight / 2.205, digits = 0))
# Clean birthDate
df.players %>%
filter(str_detect(df.players$birthDate,"-")) %>%
separate(col = birthDate, into = c("year","month", "day"), convert = TRUE) -> df.player.y1
df.players %>%
filter(str_detect(df.players$birthDate,"/")) %>%
separate(col = birthDate, into = c("month","day", "year"), convert = TRUE) -> df.player.y2
# Combine and calculate age
df.players <- bind_rows(df.player.y1, df.player.y2) %>%
mutate(age = 2018 - year)
# Read position abbreviations file
position.abbreviations <- read_delim(file = "nfl_position_abbreviations.csv", delim = ";")
# Join
df.players <- df.players %>%
rename(abbreviation = position) %>%
left_join(position.abbreviations, by = "abbreviation") %>%
rename(position_abb = abbreviation)
nflId | height | weight | year | month | day | collegeName | position_abb | displayName | height_eu | weight_eu | age | position | position_category |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2539334 | 72 | 190 | 1990 | 9 | 10 | Washington | CB | Desmond Trufant | 183 | 86 | 28 | Cornerback | Defense |
2539653 | 70 | 186 | 1988 | 11 | 1 | Southeastern Louisiana | CB | Robert Alford | 178 | 84 | 30 | Cornerback | Defense |
2543850 | 69 | 186 | 1991 | 12 | 18 | Purdue | SS | Ricardo Allen | 175 | 84 | 27 | Strong Safety | Defense |
2555162 | 73 | 227 | 1994 | 11 | 4 | Louisiana State | MLB | Deion Jones | 185 | 103 | 24 | Middle Linebacker | Defense |
2555255 | 75 | 232 | 1993 | 7 | 1 | Minnesota | OLB | De’Vondre Campbell | 190 | 105 | 25 | Outside Linebacker | Defense |
2555543 | 73 | 216 | 1995 | 7 | 26 | Florida | FS | Keanu Neal | 185 | 98 | 23 | Free Safety | Defense |
2556445 | 70 | 211 | 1992 | 10 | 20 | Florida | CB | Brian Poole | 178 | 96 | 26 | Cornerback | Defense |
2532842 | 78 | 243 | 1989 | 1 | 20 | Arizona | QB | Nick Foles | 198 | 110 | 29 | Quarterback | Offense |
2540158 | 77 | 250 | 1990 | 11 | 10 | Stanford | TE | Zach Ertz | 196 | 113 | 28 | Tight End | Offense |
2552600 | 72 | 198 | 1993 | 5 | 24 | Southern California | WR | Nelson Agholor | 183 | 90 | 25 | Wide Receiver | Offense |
The plays data contains general information about certain plays in each match like its verbal description and other game characteristics related to the play.
df.plays <- read_delim(file = "data_raw/plays.csv", delim = ",")
gameId | playId | playDescription | quarter | down | yardsToGo | possessionTeam | playType | yardlineSide | yardlineNumber | offenseFormation | personnelO | defendersInTheBox | numberOfPassRushers | personnelD | typeDropback | preSnapVisitorScore | preSnapHomeScore | gameClock | absoluteYardlineNumber | penaltyCodes | penaltyJerseyNumbers | passResult | offensePlayResult | playResult | epa | isDefensivePI |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2018090600 | 75 | (15:00) M.Ryan pass short right to J.Jones pushed ob at ATL 30 for 10 yards (M.Jenkins). | 1 | 1 | 15 | ATL | play_type_pass | ATL | 20 | I_FORM | 2 RB, 1 TE, 2 WR | 7 | 4 | 4 DL, 2 LB, 5 DB | TRADITIONAL | 0 | 0 | 15:00:00 | 90 | NA | NA | C | 10 | 10 | 0.2618273 | FALSE |
2018090600 | 146 | (13:10) M.Ryan pass incomplete short right to C.Ridley (J.Mills, J.Hicks). | 1 | 1 | 10 | ATL | play_type_pass | PHI | 39 | SINGLEBACK | 1 RB, 1 TE, 3 WR | 7 | 4 | 4 DL, 2 LB, 5 DB | TRADITIONAL | 0 | 0 | 13:10:00 | 49 | NA | NA | I | 0 | 0 | -0.3723598 | FALSE |
2018090600 | 168 | (13:05) (Shotgun) M.Ryan pass incomplete short left to D.Freeman. | 1 | 2 | 10 | ATL | play_type_pass | PHI | 39 | SHOTGUN | 2 RB, 1 TE, 2 WR | 6 | 4 | 4 DL, 2 LB, 5 DB | TRADITIONAL | 0 | 0 | 13:05:00 | 49 | NA | NA | I | 0 | 0 | -0.7027787 | FALSE |
2018090600 | 190 | (13:01) (Shotgun) M.Ryan pass deep left to J.Jones to PHI 6 for 33 yards (R.Darby). | 1 | 3 | 10 | ATL | play_type_pass | PHI | 39 | SHOTGUN | 1 RB, 1 TE, 3 WR | 6 | 5 | 4 DL, 1 LB, 6 DB | SCRAMBLE_ROLLOUT_LEFT | 0 | 0 | 13:01:00 | 49 | NA | NA | C | 33 | 33 | 3.0475300 | FALSE |
2018090600 | 256 | (10:59) (Shotgun) M.Ryan pass incomplete short right to D.Freeman. | 1 | 3 | 1 | ATL | play_type_pass | PHI | 1 | SHOTGUN | 2 RB, 3 TE, 0 WR | 8 | 6 | 6 DL, 3 LB, 2 DB | TRADITIONAL | 0 | 0 | 10:59:00 | 11 | NA | NA | I | 0 | 0 | -0.8422719 | FALSE |
2018090600 | 320 | (10:10) (Shotgun) N.Foles pass short left to N.Agholor to PHI 8 for 4 yards (R.Alford). | 1 | 2 | 8 | PHI | play_type_pass | PHI | 4 | SHOTGUN | 1 RB, 1 TE, 3 WR | 7 | 4 | 3 DL, 3 LB, 5 DB | TRADITIONAL | 0 | 0 | 10:10:00 | 14 | NA | NA | C | 4 | 4 | -0.3440965 | FALSE |
2018090600 | 344 | (9:24) (Shotgun) N.Foles pass incomplete short left to D.Sproles (R.Allen). | 1 | 3 | 4 | PHI | play_type_pass | PHI | 8 | EMPTY | 1 RB, 2 TE, 2 WR | 6 | 4 | 3 DL, 3 LB, 5 DB | TRADITIONAL | 0 | 0 | 09:24:00 | 18 | NA | NA | I | 0 | 0 | -1.1922083 | FALSE |
2018090600 | 402 | (9:08) M.Ryan pass incomplete deep left to M.Sanu. | 1 | 1 | 10 | ATL | play_type_pass | PHI | 44 | SINGLEBACK | 1 RB, 1 TE, 3 WR | 7 | 4 | 4 DL, 2 LB, 5 DB | SCRAMBLE_ROLLOUT_LEFT | 0 | 0 | 09:08:00 | 54 | NA | NA | I | 0 | 0 | -0.4298626 | FALSE |
2018090600 | 492 | (7:01) M.Ryan pass short left to T.Coleman pushed ob at PHI 10 for 26 yards (R.McLeod). | 1 | 2 | 13 | ATL | play_type_pass | PHI | 36 | SINGLEBACK | 1 RB, 2 TE, 2 WR | 8 | 4 | 4 DL, 2 LB, 5 DB | TRADITIONAL | 0 | 0 | 07:01:00 | 46 | NA | NA | C | 26 | 26 | 1.8798040 | FALSE |
2018090600 | 521 | (6:19) M.Ryan pass short left to A.Hooper to PHI 7 for 3 yards (N.Gerry). Philadelphia challenged the pass completion ruling, and the play was Upheld. The ruling on the field stands. (Timeout #1.) | 1 | 1 | 10 | ATL | play_type_pass | PHI | 10 | SINGLEBACK | 1 RB, 2 TE, 2 WR | 8 | 3 | 4 DL, 2 LB, 5 DB | DESIGNED_ROLLOUT_LEFT | 0 | 0 | 06:19:00 | 20 | NA | NA | C | 3 | 3 | 0.0456651 | FALSE |
The last 17 datasets contain detailed information about each play in each game. Each play is broken down in so called frames where the players’ movements on the field are tracked.
df.weeks <- purrr::map_dfr(as.character(1:17),
~ readr::read_delim(delim = ",",
paste0("data_raw/week", ., ".csv")))
With the weekly datasets I am able to determine which players is playing for which team in the current season. So I filtered the weekly data and only kept the occurence of a player one single time.
players_to_team <- df.weeks %>%
filter(displayName != "Football") %>%
group_by(gameId, displayName) %>%
slice(1) %>%
ungroup() %>%
select(gameId, displayName, team)
players_to_team_out <- left_join(x = players_to_team, y = df.games, by = "gameId") %>%
mutate(team_abb = ifelse(team == "away", visitorTeamAbbr, homeTeamAbbr),
team = ifelse(team == "away", visitorTeam, homeTeam)) %>%
group_by(displayName) %>%
slice(1) %>%
ungroup() %>%
select(gameId, displayName, team_abb, team)
Finally, I am linking this information to the games dataset.
df.players <- left_join(x = df.players,
y = players_to_team,
by = "displayName")
I will begin with some exploratory data analysis as I have no prior knowledge about american football.
Most matches take place in December.
ggplot(data = df.games,
aes(x = month)) +
geom_bar(fill = "#013369", color = "black") +
labs(x = "Month", y = "Number of games")
A team has an average break of 7 days between two matches. Some teams have a little more, some teams have a little less.
df.games %>%
pivot_longer(cols = c(homeTeam, visitorTeam),
names_to = "group", values_to = "team") %>%
group_by(team) %>%
mutate(rest = as.double(difftime(gameDate, lag(gameDate)))) %>%
summarise(count = n(),
average_rest = mean(rest, na.rm = T)) %>%
ungroup() %>%
ggplot(data = .,
aes(x = reorder(team, average_rest), y = average_rest)) +
geom_bar(stat = "identity", color = "black", fill = "#013369") +
geom_hline(yintercept = 7.41, size = 1, color = "#d50a0a") +
coord_flip() +
labs(x = "", y = "Average break between games (in days)")
Barchart of players’ position category.
Most players are playing a defensive position, followed by offensive positions and special teams.
ggplot(data = df.players,
aes(x = fct_infreq(position_category), fill = position_category)) +
geom_bar(stat = "count", color = "black") +
scale_fill_manual(values = c("#013369","#d50a0a","#ffffff")) +
labs(x = "", y = "Number of players", fill = "") +
theme(legend.position = "bottom")
Barchart of players’ position filled by position category.
Most players in the league play as a Wide Receiver, followed by the Cornerback and Running Back.
ggplot(data = df.players,
aes(x = fct_rev(fct_infreq(position)), fill = position_category)) +
geom_bar(stat = "count", color = "black") +
scale_fill_manual(values = c("#013369","#d50a0a","#ffffff")) +
coord_flip() +
labs(x = "", y = "Number of players", fill = "") +
theme(legend.position = "bottom")
Barchart of players’ top colleges filled by position category.
The three top universities are Alabama, Ohio State and Florida State.
df.players %>%
group_by(collegeName, position_category) %>%
summarise(n = n()) %>%
mutate(pct = n / sum(n),
total = sum(n)) %>%
ungroup() %>%
arrange(desc(total)) %>%
slice(1:54) %>%
ggplot(data = .,
aes(x = reorder(collegeName, total), y = n, fill = position_category)) +
geom_bar(stat = "identity", color = "black") +
scale_fill_manual(values = c("#013369","#d50a0a","#ffffff")) +
coord_flip() +
labs(x = "", y = "Number of players", fill = "") +
theme(legend.position = "bottom")
Density plot of players’ age filled by position category.
Offense and defense players’ have a similiar distribution of age while special teams players are older on average.
ggplot(data = df.players,
aes(x = age, fill = position_category)) +
geom_density(color = "black", alpha = 0.75) +
scale_fill_manual(values = c("#013369","#d50a0a","#ffffff")) +
labs(x = "Age", y = "Density", fill = "") +
theme(legend.position = "bottom")
Box plots of players’ age filled by position.
This can be shown in more detail with each player’s position.
ggplot(data = df.players,
aes(x = reorder(position, age),
y = age,
fill = position_category)) +
geom_boxplot() +
coord_flip() +
scale_fill_manual(values = c("#013369","#d50a0a","#ffffff")) +
labs(x = "", y = "Age", fill = "") +
theme(legend.position = "bottom")
Box plots of teams’ players age.
In most teams the median age of all players is 25 years. The San Francisco 49ers have the youngest team.
ggplot(data = filter(df.players, !is.na(team)),
aes(x = reorder(team, age, FUN = median),
y = age)) +
geom_boxplot(fill = "white", color = "black") +
coord_flip() +
labs(x = "", y = "Age", fill = "") +
theme(legend.position = "bottom")
Scatterplot of players’ height and weight filled by position category.
Defensive players are compared to offensive players more heavy for a given height. There is even a bent in the distribution of the defensive players.
ggplot(data = df.players,
aes(x = height_eu, y = weight_eu, fill = position_category)) +
geom_point(size = 2.5, alpha = 1, shape = 21, colour = "black") +
scale_fill_manual(values = c("#013369","#d50a0a","#ffffff")) +
facet_wrap(. ~ position_category) +
labs(x = "Height (cm)", y = "Weight (kg)", fill = "") +
theme(legend.position = "bottom")
Scatterplot of players’ height and weight depending on position filled by position category.
The bent in the defensive player’s distribution of weight and height is explained by players on the Defensive Tackle and Nose Tackle position, i.e. players with an extraordinary weight compared to their height. I fitted an overall regression line of the model weight = β0 + β1height to each panel.
ggplot(data = df.players,
aes(x = height_eu, y = weight_eu, fill = position_category)) +
geom_point(size = 2.5, alpha = 1, shape = 21, colour = "black") +
scale_fill_manual(values = c("#013369", "#d50a0a", "#ffffff")) +
scale_x_continuous(limits = c(160, 210),
breaks = c(160, 170, 180, 190, 200, 210)) +
scale_y_continuous(limits = c(70, 170),
breaks = c(70, 90, 110, 130, 150, 170)) +
geom_abline(intercept = -160, slope = 1.4) +
facet_wrap(. ~ position) +
labs(x = "Height (cm)", y = "Weight (kg)", fill = "") +
theme(legend.position = "bottom")
Let us take a closer look into one single play of the match between the Philadelphia Eagles and Atlanta Falcons. The Atlanta Falcos are in the offense and the Philadelphia Eagles are in the defense.
library(gganimate)
df.week1 <- read_delim(file = "data_raw/week1.csv", delim = ",")
# PlayId 75
phi_atl <-
filter(df.week1, gameId == "2018090600" & playId %in% c("190")) %>%
mutate(time_comb = paste(hour(time),
minute(time),
floor(second(time)), sep =":"))
plot <- phi_atl %>%
group_by(playId, time_comb, displayName) %>%
slice(1) %>%
ggplot(data = .,
aes(x=x, y=y, color=team, label=position)) +
geom_point() +
geom_text(aes(label=position),hjust=0, vjust=0) +
scale_color_manual(values = c("#a71930","#624a2e","#004c54")) +
scale_x_continuous(limits = c(0,100), breaks = seq(10,90,10),
labels = c(10,20,30,40,50,40,30,20,10)) +
scale_y_continuous(limits = c(0,53.33), breaks = c(20,33,33)) +
geom_vline(xintercept = 0, size = 1.5, color = "white") +
geom_vline(xintercept = 100, size = 1.5, color = "white") +
labs(x = "", color = "") +
theme(axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank(),
axis.ticks.x=element_blank(),
panel.background = element_rect(fill = "#7eaf34",
colour = "#7eaf34",
size = 0.5, linetype = "solid"),
panel.grid.minor.y = element_blank(),
legend.position = "none") +
transition_time(time) +
labs(title = "Time: {frame_time}",
subtitle = df.plays[df.plays$gameId == "2018090600" &
df.plays$playId == "190",]$playDescription)
# Animation
animate(plot, end_pause = 10)
You can check this on YouTube!
I calculated the occurences of all offense formations for each recorded play over all available matches. The Shotgun is the most common formation.
df.plays %>%
group_by(possessionTeam, offenseFormation) %>%
summarise(count = n()) %>%
mutate(total = sum(count),
share = count/total) %>%
ggplot(data = .,
aes(x = possessionTeam, y = share, fill=offenseFormation)) +
geom_bar(stat="identity", position="stack") +
coord_flip() +
scale_y_continuous(labels = scales::percent) +
labs(x = "", y = "Share", fill = "Offense Formation")