In the NFL, the ability to run the football effectively has long been a key factor in a team’s success. While the modern game has increasingly emphasized passing, rushing remains an essential component of offensive strategy, influencing time of possession, field position, and overall game control. A strong ground game can wear down defenses, open up opportunities in the passing attack, and provide crucial balance to an offense. While running the ball was previously limited to running backs and full backs, this has changed in the modern NFL. Specifically, quarterbacks and those in the skill position group (wide receiver and tight end) are threats to run the ball in today’s offenses.
This project utilizes R to explore and visualize NFL rushing statistics using a variety of metrics. I am especially interested in which year from 2014-2023 was the most successful in terms of the overall rushing attack by all 32 teams. 2014 is the year I started closely following the NFL, and choosing it as the start year for my analysis creates a 10-year period of NFL rushing data. The objective of this analysis is to highlight one year between 2014 and 2023 in which the NFL as a whole had the most success running the football.
Click on show to see the libraries used for my data visualizations:
library(ggplot2)
library(lubridate)
library(dplyr)
library(scales)
library(ggthemes)
library(RColorBrewer)
library(plotly)
The data was found on the Kaggle page titled “NFL Rushing Statistics (2001-2023)”. However, the dataset originates from pro football reference, and different libraries like BeautifulSoup and Pandas were used to scrape and clean the data into a csv file (“rushing_cleaned.csv”). The dataset includes variables that track the performance of NFL players’ rushing statistics, such as rushing yards per attempt, as well as other variables like the player’s age and the year. For example, if a player was in the NFL from 2016-2020, they would have 5 unique observations (1 for each season). Below are some general details about the dataset.
# Reading in the dataset into a dataframe
dataset <- "C:/Users/Admin/Downloads/IS460/R/Assignment/rushing_cleaned.csv"
df_orig <- read.csv(dataset, header = TRUE)
# Basic Analysis of the Dataframe and how it is organized
# getting rid of the X column that is simply the row number
df_noX <- df_orig[,-1]
df <- df_noX
head(df)
## Player Age G GS rAtt rYds rTD r1D rLng rY.A rY.g Fmb Year
## 1 Stephen Davis 27 16 16 356 1432 5 75 32 4.0 89.5 6 2001
## 2 Corey Dillon 27 16 16 340 1315 10 69 96 3.9 82.2 5 2001
## 3 LaDainian Tomlinson 22 16 16 339 1236 10 68 54 3.6 77.3 8 2001
## 4 Curtis Martin 28 16 16 333 1513 10 78 47 4.5 94.6 2 2001
## 5 Priest Holmes 28 16 16 327 1555 8 81 41 4.8 97.2 4 2001
## 6 Eddie George 28 16 16 315 939 5 40 27 3.0 58.7 8 2001
dim(df)
## [1] 7516 13
colnames(df)
## [1] "Player" "Age" "G" "GS" "rAtt" "rYds" "rTD" "r1D"
## [9] "rLng" "rY.A" "rY.g" "Fmb" "Year"
colSums(is.na(df)) # no missing values
## Player Age G GS rAtt rYds rTD r1D rLng rY.A rY.g
## 0 0 0 0 0 0 0 0 0 0 0
## Fmb Year
## 0 0
summary(df)
## Player Age G GS
## Length:7516 Min. :21.00 Min. : 1.00 Min. : 0.000
## Class :character 1st Qu.:24.00 1st Qu.: 8.00 1st Qu.: 0.000
## Mode :character Median :26.00 Median :14.00 Median : 4.000
## Mean :26.56 Mean :11.89 Mean : 5.896
## 3rd Qu.:28.00 3rd Qu.:16.00 3rd Qu.:11.000
## Max. :45.00 Max. :17.00 Max. :17.000
## rAtt rYds rTD r1D
## Min. : 1.00 Min. : -31.0 Min. : 0.000 Min. : 0.000
## 1st Qu.: 2.00 1st Qu.: 7.0 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 9.00 Median : 36.0 Median : 0.000 Median : 2.000
## Mean : 42.84 Mean : 180.4 Mean : 1.313 Mean : 9.848
## 3rd Qu.: 46.00 3rd Qu.: 187.0 3rd Qu.: 1.000 3rd Qu.: 11.000
## Max. :416.00 Max. :2097.0 Max. :28.000 Max. :109.000
## rLng rY.A rY.g Fmb
## Min. :-28.00 Min. :-28.00 Min. :-16.00 Min. : 0.00
## 1st Qu.: 6.00 1st Qu.: 2.20 1st Qu.: 0.70 1st Qu.: 0.00
## Median : 13.00 Median : 3.80 Median : 4.00 Median : 1.00
## Mean : 18.48 Mean : 4.04 Mean : 14.39 Mean : 1.73
## 3rd Qu.: 25.00 3rd Qu.: 5.00 3rd Qu.: 18.23 3rd Qu.: 2.00
## Max. : 99.00 Max. : 68.00 Max. :131.10 Max. :23.00
## Year
## Min. :2001
## 1st Qu.:2006
## Median :2012
## Mean :2012
## 3rd Qu.:2018
## Max. :2023
df$Year <- as.factor(df$Year)
# Determine the last 10 years in the dataset
last_10_years <- sort(unique(df$Year), decreasing = TRUE)[1:10]
# Filter the dataset for the last 10 years and include only players with 75+ rushing attempts
df_filtered <- df %>%
filter(Year %in% last_10_years, rAtt >= 75)
summary(df_filtered)
## Player Age G GS
## Length:614 Min. :21.0 Min. : 6.00 Min. : 0.00
## Class :character 1st Qu.:23.0 1st Qu.:13.00 1st Qu.: 4.00
## Mode :character Median :25.0 Median :15.00 Median : 9.00
## Mean :25.4 Mean :14.21 Mean : 8.98
## 3rd Qu.:27.0 3rd Qu.:16.00 3rd Qu.:14.00
## Max. :37.0 Max. :17.00 Max. :17.00
##
## rAtt rYds rTD r1D
## Min. : 75.0 Min. : 207.0 Min. : 0.000 Min. : 0.00
## 1st Qu.:104.0 1st Qu.: 445.0 1st Qu.: 2.000 1st Qu.: 23.00
## Median :149.5 Median : 640.0 Median : 4.000 Median : 34.00
## Mean :160.9 Mean : 699.7 Mean : 5.145 Mean : 36.69
## 3rd Qu.:207.0 3rd Qu.: 898.0 3rd Qu.: 7.000 3rd Qu.: 48.00
## Max. :392.0 Max. :2027.0 Max. :18.000 Max. :107.00
##
## rLng rY.A rY.g Fmb
## Min. :15.00 Min. :2.600 Min. : 15.90 Min. : 0.000
## 1st Qu.:28.00 1st Qu.:3.825 1st Qu.: 33.40 1st Qu.: 1.000
## Median :40.00 Median :4.300 Median : 47.95 Median : 2.000
## Mean :43.09 Mean :4.352 Mean : 49.64 Mean : 2.228
## 3rd Qu.:55.00 3rd Qu.:4.775 3rd Qu.: 62.90 3rd Qu.: 3.000
## Max. :99.00 Max. :7.800 Max. :126.70 Max. :16.000
##
## Year
## 2020 : 70
## 2023 : 65
## 2022 : 63
## 2014 : 62
## 2021 : 62
## 2018 : 61
## (Other):231
There are 7516 observations and 13 features in the original dataset, and the filtered dataset has 614 observations. Summary statistics can be seen above, both before the filtering and after (the filtering entails players with 75+ rush attempts in a season from 2014-2023).
It is important to note that my analysis focuses only on players with at least 75 rushing attempts in a season. Player position was not a variable in the dataset, so this was a method to select only the starting-caliber players with meaningful contribution to their team in a given season. These players are typically running backs, but it can also be quarterbacks or wide receivers who are frequent rushers. Since every player who recorded a rush was originally included, it was necessary to filter out those with few rush attempts so that statistics such as mean(rush yards per game) are meaningful in my analysis. Looking at the median prior to filtering out this players is also not meaningful due to the volume of players that record only a few rush attempts in a given season.
Under the dataset section, you can see the summary statistics including only players with at least 75 attempts in a season from 2014-2023. The statistics are significantly different than the original dataset as only starting-caliber players with a high volume of rush attempts are included.
Note: This graph is the only one that includes all years (2001-2023), as I wanted to learn if the shift to more passing over the years affected the yardage production of the top running backs.
# First data visualization
# Sort the dataframe in order of the 10 most single season rush yards by a player
df_p1 <- df[order(-df$rYds),][1:10,]
# Creating the bar chart
p1 <- ggplot(df_p1, aes(x= reorder(Player, rYds), y = rYds, fill = Year)) +
geom_bar(stat = "identity",) +
coord_flip() +
labs(title="Top 10 NFL Single Season Rushers (2001-2023)", x="Player Name", y="Rushing Yards", fill = "Year") +
theme_light() +
geom_text(aes(label = rYds) ,hjust= 3.5)
p1
The goal of the first visualization was to learn the years in which a player had an outstanding season running the football. It was also to determine if the evolution of NFL offenses to becoming more pass-oriented over the years affected the top running backs’ rushing yard totals. The graph shows us that the 2000s and early 2010s were more heavily represented than the late 2010s and early 2020s. Only 2 of the top 10 single season rushing leaders were from 2014 or later (Demarco Murray and Derrick Henry). We can say that there is evidence to suggest that it was easier for running backs to have breakout seasons in the early time frame (2001-2013) compared to the later time frame (2014-2023). This could result from several factors, including more rushing attempts for running backs and more emphasis on establishing the run game.
# Second data visualization
# For each year calculate the mean number of fumbles, mean yards per attempt, and mean rushing TDs
df_summary <- df_filtered %>%
group_by(Year) %>%
summarise(mean_rY_A = mean(rY.A),
mean_Fmb = mean(Fmb),
mean_rTD = mean(rTD))
# Create the multiple line plot
p2 <- ggplot(df_summary, aes(x = Year)) +
geom_line(aes(y = mean_rY_A, color = "Mean Yards per Attempt", group = 1), linewidth = 2) +
geom_line(aes(y = mean_Fmb, color = "Mean Fumbles per Player", group = 1), linewidth = 2) +
geom_line(aes(y = mean_rTD, color = "Mean Rushing Touchdowns per Player", group = 1), linewidth = 2) +
# Add labels
geom_text(aes(y = mean_rY_A, label = round(mean_rY_A, 2)),
vjust = .8, size = 3, color = "black") +
geom_text(aes(y = mean_Fmb, label = round(mean_Fmb, 2)),
vjust = -1.5, size = 3, color = "black") +
geom_text(aes(y = mean_rTD, label = round(mean_rTD, 2)),
vjust = -2, size = 3, color = "black") +
# Adjust y-axis ticks
scale_y_continuous(breaks = seq(0, 7, by = 0.5), limits = c(1.5, 7)) +
labs(title = "NFL Rushing Statistics by Year for Players with 75+ Attempts",
x = "Year",
y = " Average Yards, Fumbles, or TDs",
color = "Metrics for each Season") +
theme_light()
p2
This multiple line plot depicts how three rushing statistics changed from 2014-2023: rushing yards per attempt, fumbles, and rushing touchdowns per player. Teams want yards per attempt and touchdowns to be as high as possible, as they lead to high quality drives and ultimately more points scored. Teams want fumbles as low as possible, as that means the player turned the ball over to the opposing team. For the average number of rushing touchdowns per player, we see that the number generally increased over the years, peaking at 5.64 touchdowns per player in 2019. This could be a result of NFL offenses improving over time with rule changes favoring offense and improvements in play design. This leads to teams having more opportunities in the redzone to run the ball in for a touchdown than previous years. The average yards per rushing attempt remained much more steady in the 4s, with a general increase beginning in 2019 and peaking in 2022 with 4.66 yards per rush attempt. This is likely more steady over time because some rush attempts are called just to keep the defense honest; not every run play is designed to be a big yardage play. The average fumbles per player per season was the most volatile of the three, reaching a minimum in 2016 of 1.78 and a maximum in 2022 of 2.67. Fumbles are generally a more volatile statistic, as it involves aspects of luck including how the ball bounces when it hits the ground. Nevertheless, the peak in 2022 might involve coaches instructing their defenses to try to produce more turnovers (an example is trying to punch the ball out of the offensive player’s hands), as winning the turnover battle has been shown to be a major factor in winning games.
# Data Visualization 3
# Nested Pie Chart
# only include players with 75+ rush attempts in a season
df3 <- df_orig %>%
filter(rAtt >= 75, Year > 2013) # Ensure only years > 2013
plot_ly() %>%
# Outer ring
add_trace(data = df3[df3$Year > 2013,],
labels = ~Year,
values = ~rYds,
type = "pie",
hole = 0.7,
textposition = "inside",
hovertemplate = "Year: %{label}<br>Rushing Yards: %{value}<br>Percent: %{percent}<extra></extra>",
domain = list(
x= c(.16,.84),
y= c(.16,.84))) %>%
# Inner pie chart
add_trace(data = df3[df3$Year > 2013,],
labels = ~Year,
values = ~rTD,
type = "pie",
hole = 0,
textposition = "inside",
hovertemplate = "Year: %{label}<br>Rushing Touchdowns: %{value}<br>Percent: %{percent}<extra></extra>",
domain = list(
x= c(.27,.73),
y= c(.27,.73))) %>%
layout(title = "Comparing Rushing Yards and Rushing Touchdowns Between 2014-2023")
The nested pie chart looks at the change in total rushing yards and rushing touchdowns starting in 2014 through 2023. For total rushing yards, we see that the more recent seasons have bigger slices (more yards) than the earlier years. The top 4 seasons are the 4 years in the 2020s, with 2022 having 11.3% of the total rush yards (48,623, which is nearly 3000 more than the second best year). A possible explanation is that as offenses continue to improve with each season, more total yards are gained, which leads to more rushing yards. Another reason could be that teams’ efficient passing attacks create more opportunities for big plays on the ground, as teams have to focus their efforts more on limiting elite quarterbacks. Total rushing touchdowns has a slightly different order, but follows a similar trend. The most recent years in the 2020s are all featured again in the top 5, with 2020 having the most rushing TDs (386, which is 12.2% of the 10-year sample). This could be a result of the explosion of dual-threat quarterbacks, such as Jalen Hurts, Lamar Jackson, and Josh Allen, who have the ability to score many touchdowns on the ground in short goal line situations.
# Data Visualization 4
# set df4 to original dataset
df4 <- df_noX
# sort the dataset for the last 10 years only
last_10_years <- sort(unique(df4$Year), decreasing = TRUE)[1:10]
# Filter the dataset for the last 10 years
df_filtered2 <- df4 %>%
filter(Year %in% last_10_years)
# Get the top 10 players with the most rushing yards per year
df_top10 <- df_filtered2 %>%
group_by(Year) %>%
slice_max(order_by = rYds, n = 10, with_ties = FALSE) %>%
ungroup()
# Compute the mean rushing yards of the top 10 rushers for each year
yearly_means <- df_top10 %>%
group_by(Year) %>%
summarise(mean_rYds = round(mean(rYds), 1),
max_rYds = max(rYds))
# Merge the means back into the main dataset for the labels
df_top10 <- df_top10 %>%
left_join(yearly_means, by = "Year")
# Trellis plot
ggplot(df_top10, aes(x = reorder(Player, rYds), y = rYds, fill = as.factor(Year))) +
geom_col(show.legend = F) + # Bar plot without legend for players
coord_flip() + # Flip for horizontal bars
facet_wrap(~ Year, scales = "free_y") + # Separate plots per year
labs(title = "Top 10 NFL Rushing Yards Leaders by Year (2014-2023)",
x = NULL, # Remove x-axis label (player names)
y = "Rushing Yards") +
theme_light() + # Use light theme
theme(axis.text.y = element_blank(), # Remove player names
strip.text = element_text(size = 12, face = "bold"),
axis.ticks.y = element_blank()) + # Remove axis ticks
geom_text(data = yearly_means, aes(x = 1, y = max_rYds + 100,
label = paste0("Mean:\n", mean_rYds)),
hjust = 0.7, vjust = 0, size = 3, fontface = "bold", inherit.aes = FALSE)
The trellis chart shows the top 10 rushing yards leaders for each season from 2014-2023. There is not a clear trend present here, as the top 3 years with the highest means are spread out (2022, 2014, and 2019). The variance of the top 10 rushers in the earlier years appears to be smaller than the variance in the more recent years. A possible reason for the lack of a clear trend in this plot is that each NFL team has unique strengths and weaknesses on offense. This means that each team constructs their own gameplan to give themselves the best chance to win. For example, a team like the Bengals with an elite quarterback and wide receivers will likely choose to throw the ball a lot, while a team like the Eagles with a very strong offensive line will lean on the ground game more. This could be a possible reason that the trend of the means for each year in this plot seems mostly random.
# Data Visualization 5
# Donut Chart
# Only include players with 75+ rush attempts in a season
df5 <- df_orig %>%
filter(rAtt >= 75, Year > 2013) # Ensure only years > 2013
# Compute mean rushing yards per game per year
mean_rYg_per_year <- df5 %>%
group_by(Year) %>%
summarise(mean_rYg = mean(rY.g, na.rm = TRUE)) # Compute mean per year
# Compute total number of qualified rushers (each row represents a player in a given year)
total_rushers <- nrow(df5)
# Create the donut chart
plot_ly() %>%
# First Downs Donut (Outer Ring)
add_trace(data = df5,
labels = ~Year,
values = ~r1D,
type = "pie",
hole = 0.7,
textposition = "inside",
hovertemplate = "Year: %{label}<br>First Downs: %{value}<br>Percent: %{percent}<extra></extra>",
domain = list(
x= c(.16,.84),
y= c(.16,.84))) %>%
# Mean Yards/Game Donut (Inner Ring)
add_trace(data = mean_rYg_per_year,
labels = ~Year,
values = ~mean_rYg,
type = "pie",
hole = 0.3,
textposition = "inside",
hovertemplate = "Year: %{label}<br>Avg Yards/Game: %{value}<br>Percent: %{percent}<extra></extra>",
domain = list(
x= c(.27,.73),
y= c(.27,.73))) %>%
# Add title and text in the center of the donut
layout(title = "Comparing First Downs and Yards/Game Between 2014-2023",
annotations = list(
list(
x = 0.5, # Center position
y = 0.5,
text = paste("Total Qualified<br>Players:<br>", total_rushers),
showarrow = FALSE,
font = list(size = 7.5, color = "black"),
xanchor = "center",
yanchor = "middle",
align = "center")))
The nested donut chart compares the total first downs per year with the average yards per game per player for the 614 qualified players from 2014-2023. The 2020s once again highlight the top 5 for total first downs, with 2022 leading the group with 2,552 first downs (11.3%). If more points were scored in the more recent years, that generally means that more first downs occur as well (first downs keep the team’s drive going). For average yards per game per player, the order is different, with more years from the 2010s featured towards the top. 2019 has the highest average of 53.08 yards per game per player, with 2022 following right behind at 52.62 yards per game. The most likely explanation for the unusual order is that teams in the earlier years relied more on their starting running backs than the teams do today. In other words, teams would give more rush attempts to their starting running backs (the term is a bell-cow running back). Today, we see more teams like the Lions, who spread out their rushing attempts more evenly among their running backs. This causes the average yards per game per rusher to decrease because the recent starting running backs generally have fewer opportunities per game than starters of earlier seasons. However, as seen in the other visualizations, the 2022 season does not follow the pattern I described.
After analyzing the results from the 5 graphs, it is very evident that the 2022 season was the most successful year between 2014 and 2023 for running the ball in the NFL. In the 2022 season, players had the highest average rush yards per attempt (4.66) and second highest average rushing touchdowns per player (5.51) of the 10 year period. Also, 2022 featured the most total rushing yards (48,623) and third highest total rushing touchdowns (347) from 2014-2023. When looking at the top 10 players in each season with the most rushing yards, it had the highest mean of approximately 1,300 rush yards per player. Finally, the most first downs were gained in 2022 (2,552), and it featured the second highest yards per player per game (52.62). Clearly, 2022 was the most successful year for all 32 teams’ rushing attack. In an era when passing is becoming more prominent, the ground game still remains as an important key to win. There truly was something special about the 2022 NFL season.