Web Scraping and Analysis of 2025 NFL Rushing Data

Author

Ryan Zaccour

Introduction and Question

As a major fan of the NFL and analytics, I’m drawn to finding out the efficiency and performance sustainability of running back’s (RBs) in the National Football League (NFL). Some people say that it is best for a team to get a “workhorse” or “bell-cow” running back to keep a dominant running game. However, modern analytics often suggests a “workload penalty”, where efficiency (Yards Per Attempt) declines as carries increase due to fatigue, injury risk, and facing solid defenses.

I would like to know more about the relationship between rushing volume and three key measures of success. Per-carry efficiency, outlier play potential, and goal-line scoring effectiveness. Researching these three measures can help find who is for real a “worth it” running back and will perform well in the future.

Research Question: Does an increase in rushing volume (attempts) among the NFL’s leading rushers lead to a decline in per-carry efficiency (Yards Per Attempt), and how do factors like player availability, single-play outliers, and goal-line usage influence this apparent relationship?

Through my analysis, football enthusiasts, fantasy league managers, and team analysts can inform themselves about which running backs offer efficient value versus those whose statistics are inflated by exceptional outlier plays or low carries. Overall, understanding the true cost and benefit of high rushing volume is important for predicting player sustainability and building effective offensive schemes.

Method

I scraped statistical data for the top 250 rushing leaders in the NFL during week 11 of the 2025 season. I pulled stats on the player’s name, games played, rushing attempts, rushing yards, rushing yards per game, average yards per rush, rushing touchdowns, and their longest rush. I will then create a series of visualizations to compare players across categories like workload to see how their efficiency and production differ.

To get this data, I web-scraped from the paginated rushing statistics table on CBS Sports, which lists 50 players per page. Here is an example of the first page of this data. https://www.cbssports.com/nfl/stats/player/rushing/nfl/regular/all/?page=

I utilized a for loop to programmatically access and scrape five separate pages of the CBS Sports statistics table, retrieving 250 rows (players) in total. To ensure ethical scraping practices and prevent overloading on the server, a three second sleep delay was intentionally implemented between the scraping of each page. Once all the raw data was collected, I put it all together to get one data frame then began the cleaning process. I used the janitor function to change the messy column headers into clear, usable names, then transformed all the statistical fields from the initial character data type to numeric in order for the calculations and visuals. Here is a portion of the final neat data:

Importing Data

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Attaching package: 'rvest'


The following object is masked from 'package:readr':

    guess_encoding



Attaching package: 'magrittr'


The following object is masked from 'package:purrr':

    set_names


The following object is masked from 'package:tidyr':

    extract



Attaching package: 'janitor'


The following objects are masked from 'package:stats':

    chisq.test, fisher.test


Scraping Page: 1 of 5

Scraping Page: 2 of 5

Scraping Page: 3 of 5

Scraping Page: 4 of 5

Scraping Page: 5 of 5

# A tibble: 6 × 9
  Player_Name            Games_Played Rushing_Attempts Rushing_Yards Rushing_YPG
  <chr>                         <dbl>            <dbl>         <dbl>       <dbl>
1 "J. Taylor\n         …           11              205          1197       109. 
2 "J. Cook\n           …           11              199          1084        98.5
3 "J. Gibbs\n          …           11              155           951        86.5
4 "D. Achane\n         …           11              164           900        81.8
5 "J. Williams\n       …           11              181           896        81.5
6 "D. Henry\n          …           11              187           871        79.2
# ℹ 4 more variables: AVG_Yards_Per_Rush <dbl>, Rushing_Tds <dbl>,
#   Longest_Rush <dbl>, source_page <int>

`geom_smooth()` using formula = 'y ~ x'

Cleaning Data

Here is the code with explanation on how and why it is cleaned.

#Clean and then Convert back to Numbers
nfl_clean <- 
  final_data %>% 
  #Cleaning names using janitor
  clean_names() %>% 
  #select and rename cols
  rename(
    Player_Name = player_player_on_team,   #Making all column names easier to read
    Games_Played = gp_games_played,
    Rushing_Attempts = att_rushing_attempts,
    Rushing_Yards = yds_rushing_yards,
    Rushing_YPG = yds_g_rushing_yards_per_game,
    AVG_Yards_Per_Rush = avg_average_yards_per_rush,
    Rushing_Tds = td_rushing_touchdowns,
    Longest_Rush = lng_longest_rush,
  ) %>% 
  #Convert back to Numeric
  #In order to combine them, they had to be character so this is switching back
  #the data type to numeric in order for them to be used in calculations/visuals
  mutate(
    Games_Played = as.numeric(Games_Played),   
    Rushing_Attempts = as.numeric(Rushing_Attempts),
    Rushing_Yards = as.numeric(Rushing_Yards),
    Rushing_YPG = as.numeric(Rushing_YPG),
    AVG_Yards_Per_Rush = as.numeric(AVG_Yards_Per_Rush),
    Rushing_Tds = as.numeric(Rushing_Tds),
    Longest_Rush = as.numeric(Longest_Rush)
  )

Visuals & Analysis

ggplot(nfl_clean, aes(x = Rushing_Attempts, y = AVG_Yards_Per_Rush)) +
  geom_point(alpha = 0.6, color = "darkgreen") + #points
  geom_smooth(method = "lm", se = FALSE, color = "red") +#linear regression line
  labs(
    title = "Rushing Workload vs. Rushing Efficiency (Top 250 Rushers)",
    x = "Total Rushing Attempts (Workload)",
    y = "Average Yards Per Rush (Efficiency)"
  )

`geom_smooth()` using formula = 'y ~ x'

This scatter plot, visually explores the correlation between a running back’s total rushing attempts (Workload) and their average yards gained per rush (Efficiency). The most significant feature is the nearly flat, horizontal red trend line, which is positioned just below the 5 yards per attempt. This trend line suggests that for the top 250 rushers in the NFL, an increase in rushing volume does not lead to a significant or steep decline in per-carry efficiency. While the data points show high variability at the lowest levels of workload (0-50 attempts), including several extreme outliers with 10+ yards per rush, likely due to small sample sizes and relying on one or two breakaway runs, the clustering tightens a lot once players reach a medium volume around 50 attempts. For the highest-volume running backs, efficiency remains stable and tightly clustered around the 4-5 yard mark. This shows that elite RBs are able to sustain their performance levels despite heavy usage.

These findings provide strong evidence that an increase in rushing volume does not lead to a decline in per-carry efficiency among the top performers. For fantasy managers and team analysts, this means that a running back’s efficiency is largely sustainable meaning you don’t necessarily have to fear a drop-off in Yards Per Attempt as their carries increase. Instead, high-volume RBs offer a reliable floor of production, which is valuable insight when evaluation potential “workhorse” players.

ggplot(nfl_categorized, aes(x = Workload_Category, y = AVG_Yards_Per_Rush, 
                            fill = Workload_Category))+
  geom_boxplot(outlier.shape = 1) +
  labs(
    title = "Rushing Efficiency Distribution by Workload Category",
    x = "Workload Category",
    y = "Average Yards Per Rush (YPA)"
  )

This box plot, segments all players into three discrete workload tiers to categorize the effect of volume on the spread of Yards Per Attempt (YPA). The analysis clearly refutes the simple “workload penalty” idea, as efficiency (YPA) does not decrease with volume; in fact, the Heavy Workload category has the highest average YPA and the tightest overall distribution. The key takeaway is the dramatic difference in variability: the Light Workload category has a massive interquartile range and numerous high efficiency outliers. In contrast, the medium and heavy workload boxes are far narrower, showing that high-volume backs are incredibly consistent.

These findings confirm the previous scatter plots results by showing that the efficiency of high-volume running backs is predictable and reliable, with a high mean performance. The extreme outliers in the light workload category skew the overall data, meaning that when looking at all players, low-volume rushers can be more efficient, but only because they benefit from a few plays. For teams seeking stability, the consistency and slightly higher mean YPA of the heavy workload backs suggest that trusting a high-volume runner provides a reliable foundation for the running game, providing a highly reliable and consistent per-carry average.

ggplot(top_10_longest_rush, aes(x = reorder(Player_Name, Longest_Rush),
                                y = Longest_Rush,
                                fill = AVG_Yards_Per_Rush))+
  geom_bar(stat = "identity") +
  geom_text(aes(label = Longest_Rush))+
  labs(
    title = "Top 10 RBs by Longest Rush",
    x = "Player Name",
    y = "Longest Rush (Yards)",
    fill = "Avg Yards/Rush")

This bar chart, shows the influence of outlier play potential on a running back’s profile. By ranking players solely on their longest single run of the season, the visualization highlights those capable of generating explosive, defense-shattering gains. The data clearly shows that the longest rushes among the league’s top RBs range from Saquon Barkley’s 65 yards up to Jonathon Taylor’s league-leading 83 yards, demonstrating a clear hierarchy of game-breaking ability. The bars are also color-coded by the player’s overall average yards per rush, but there appears to be no strong visual correlation between the absolute length of their longest run and their day to day efficiency.

This visualization helps prove the outlier play potential. The findings demonstrate that a running back’s ability to generate a massive, game-changing rush is a separate dimension from their reliable efficiency. For players like Emari Demercado and Justice Hill, whose long rushes of 71 yards are high relative to their overall low workloads, it is highly likely that their single-game outliers skew their seasonal YPA average, making them appear more efficient than they are. On the other hand, a runner like David Montgomery, whose bar is a darker shade of blue despite a 72 yard rush, shows a classic case of low consistent efficiency; his high outlier rush could not lift his overall YPA to the level of the high-volume/high-efficiency runners. This analysis informs decision-makers on whether a back is valuable for stability or for explosive, momentum-shifting plays, indicating that the best players (like Jonathon Taylor at 83 yards) offer both exceptional outlier potential and high day-to-day efficiency.

ggplot(top_10_efficient_full_availability, aes(
  x = reorder(Player_Name, AVG_Yards_Per_Rush),
  y = AVG_Yards_Per_Rush,
  fill = Rushing_Attempts #color the bars by their workload
)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = round(AVG_Yards_Per_Rush, 2))) +
  labs(
    title = "Top 10 RBS by AVG Yards Per Rush (Played 11 Games)",
    subtitle = "Color Intensity indicates Rushing Attempts (Workload)",
    x = "Player Name",
    y = "Average Yards Per Rush (YPA)",
    fill = "Rushing Attempts")

This bar chart, introduces the filter of player availability by only including backs who have played 11 games and maintained a medium-to-heavy workload (at least 75 attempts). The height of the bar shows efficiency, while the color intensity indicates Workload. James Cook and De’Von Achane lead the pack with 5.5 YPA and a lighter blue color, signaling they are highly efficient while carrying a super heavy workload. This demonstrates that they have sustained elite efficiency across many games and touches. Players like TreVeyon Henderson, Nick Chubb, and Jacory Croskey-Merritt all have high efficiency off of lower attempts indicating they are players that are productive and great but not relied upon for the absolute heaviest lifting within this elite subset.

By filtering the data, we isolate the most valuable runners who combine efficiency with durability and high usage, proving that sustained elite performance is possible. The light blue bars of Cook and Dowdle are the “sweet spot” for analysts and fantasy managers: high bars (efficiency) combined with light color (high workload) means they are successfully defying the workload penalty. This evidence confirms that volume and efficiency are not mutually exclusive, which shows the player possesses exceptional talent and availability, which is key for NFL analysts to predicting player sustainability in future seasons of the NFL.

ggplot(top_10_tds, aes(
  x = reorder(Player_Name, Rushing_Tds),
  y = Rushing_Tds,
  fill = Rushing_Attempts
)) + 
  geom_bar(stat = "identity") + 
  geom_text(aes(label = Rushing_Tds)) +
  labs(
    title = "Top 10 Rbs in Rushing Touchdowns (Min. 100 Attempts)",
    subtitle = "Finding the most successful goal-line runners",
    x = "Player Name",
    y = "Total Rushing Touchdowns (TDs)",
    fill = "Rushing Attempts"
  )

This final bar chart, addresses the third component of the research question which is goal-line scoring effectiveness. By filtering for players with at least 100 rushing attempts, we ensure we are analyzing backs who are trusted with meaningful volume. The chart reveals the hierarchy of goal-line success, with Jonathon Taylor once again standing out dramatically, leading the league with 15 rushing touchdowns, significantly ahead o the next tier of producers like Josh Jacobs. The color of the bars indicate the workload (rushing attempts). We can see a few players, such as Zach Charbonnet & Jahmyr Gibbs, who have a darker bar indicating a lower workload on the ground but high in touchdowns, indicating that they are high red zone targets compared to this rest of the 100+ rushing attempts RBs.

This analysis confirms that high workload is a necessary, but not sufficient, condition for high rushing touchdowns. While goal-line opportunities are largely a function of volume, the final result is driven by the player’s ability to convert those chances. Jonathon Taylor’s position at the top of the chart solidifies the finding across three visuals. He is the elite performer, successfully combining high volume with top tier efficiency and dominant scoring. For fantasy managers, this chart directly identifies the best Red-Zone targeted running backs heading into the 2nd half of the season. This can help fantasy managers improve their teams by trading for these players.

Conclusion

The research question asked whether increased rushing volume (attempts) leads to a demonstrable decline in per-carry efficiency (Yards Per Attempt) and how factors like player availability, outlier plays, and goal-line usage influence this relationship. Based on the analysis of the NFL’s top 250 rushers, the answer is a qualified No. The “workload penalty” is not a systematic issue among elite performers because the high volume correlates with high consistency and sustained efficiency.

Efficiency and Workload: Both the scatter plot and box plot demonstrate that once a running back achieves a heavy workload, their efficiency is above the league mean. Elite RBs are reliable, providing a consistent per-carry average that minimizes performance variance.
Outlier Plays: The longest rush analysis confirmed that explosive, game-breaking ability is a separate metric from reliable efficiency. Players like Emari Demercado and Justice Hill saw their YPA potentially skewed by a single massive run, while high volume backs like Taylor demonstrated the rare combination of both elite consistency and superior outlier play potential.
Goal-line Usage: The visuals filtered for availability and goal-line effectiveness confirmed that the most valuable assets, such as James Cook and Taylor, combine high efficiency and high scoring with high levels of durability.

The most valuable running backs in the NFL are those who defy the theoretical workload penalty by possessing the talent to sustain a high YPA while simultaneously shouldering the heaviest volume and converting the most goal-line opportunities.