Overview

There are a variety of metrics used to evaluate football players and teams. Pretty simple measures, often related to total production, are offered featured on broadcasts but considered less powerful by analysts. Some of these measures are total yardage, yards per attempt, and passer rating. This document is intended to present an overview of those metrics, how they’re calculated, how important and commonly used they are, and how I could incorporate them into my analyses.

In the second section, I also include some discussion of findings which generally seems to be accepted among analysts, possibly as a result of using these metrics.

General Metrics

Wins Above Replacement (WAR)

WAR applies to all players, and seeks to provide a single number that quantifies the value of a player. To do this, it reports the projected number of wins (including decimals) the player brought to a team compared to a replacement level player. Multiple sources have developed WAR metrics and do so based on different techniques and models.

One example of this is the metric created by Pro Football Focus (PFF). It does this in a multi-step process.

  1. PFF uses its player grades to determine how good a player was in a given period of time
  2. Map a player’s production to a ‘wins’ value for the team using the relative importance of each facet of play
  3. Simulate a team’s expected performance with the player of interest and an average player and take the difference in wins
  4. Determine the difference in wins between an average and replacement level player.
  5. Add together the total from the last two steps to get WAR.

The exact details of how this is done are complicated, but they aren’t necessary for understanding the metric itself. I haven’t yet learned enough to tell how well thought of this measure, and this particular version of it, are compared to other similar measures. The PFF measure does display a pretty strong correlation for players between seasons which is an indicator of strength.

There are clear relationships between positions and WAR values. The positions that tend to get the highest WAR (and are therefore the most important to a team’s win percentage) are quarterback,, defensive back, wide receiver, and tight end. It has the highest variability among defensive lineman, likely due to a could standout players at those positions. One reasons lineman score lower is that offside and holder penalties really hurt them. Runningbacks also score much lower than a position like quarterback. This contributes to the general consensus that runningbacks are fairly unimportant.

WAR can also be used to predict team win totals. Overall, this particular method is only available through PFF and not likely to be included in any analysis. If available, it could be a good way to evaluate the impact of individual players. It has apparently been a go-to metric in baseball for a while, but is early in adoption for the NFL. Here’s an additional article that computes WAR for punters. This presentation also presents a more detailed exploration of how it’s generated

Expected Points Added (EPA)

It seeks to measure the value of individual plays in terms of points (source). It compares the EP at the beginning and end of the play to calculate the value of each play. This helps differentiate a three yard run on first down to a three yard run on 3rd and 2. It can be used to examine what should be done on individual plays, but can also extend to value the players themselves. This can avoid the heavy influence of outlier plays on metrics such as yards gained, and captures the contextual value of plays. However, there’s still the difficulty of dividing it between players.

The post cited above also discusses some of the drawbacks. One these is that it has increased over time, reflecting increased offensive efficiency. Ways of divided the measure between teammates and the inherent subjectivity of defining things like ‘garbage time’ are also issues.

Passer Rating

This is a classic measure of quarterback skill, dating back to the 70’s. Overall, it’s considered a bit dated and is not necessarily a favored metric among analysts, even if it is still commonly used during broadcasts. It’s calculated differently between the NFL and NCAA, but for the pros it incorporates four variables: completion percentage, yards per attempt, touchdowns per attempts, and interceptions per attempt. Wikipedia has a good specific explanation of how this works, but essentially the measures are weighted, scored, and combined to produce a passer rating with a maximum value of 158.3

One issue with this measure is that quarterback performance has improved over time in the NFL, resulted in higher average scores, yet the measure has stayed the same. However, the Wikipedia entry cited above also mentions a .793 correlation between the qb that posts the highest rating and the qb that wins. A slightly adjusted version of this metric, with plays removed that are out of the qb’s control (i.e. drops and others) but that uses the same scale, is called the Independent Quarterback Rating (IQR). Another measure that is somewhat different is Adjusted Net Yards per Attempt (ANY/A), with major differences being that it incorporates sacks and doesn’t have a finite scale.

Total QBR

ESPN has a good explainer of QBR. It’s designed to examine all of a quarterbacks contributions, put plays in their proper context, and reflects that teammates also influence what happens on the field. It incorporates thing like rushes, sacks, turnovers, and penalties. It also makes sure to weight a five-yard first down pass and a five-yard completion short of a first-down differently (context). Same thing with a redzone interception and a hail mary interception before halftime.

In general, the measure begins with asking: how successful was the play for the team, given its context? The change in expected points caused by the play can be estimated. Then, it’s designed to separate the role of the quarterback from his teammates. This is done by dividing the play’s EPA among all teammates taking into account things like YAC or QB pressure. The article explains in more depth. Overall, this should be thought of as an efficiency stat rather than a value stat. Even if one quarterback produces more total value (for instance by being part of more plays), he may have less value per play than his opponent. ESPN scales this total efficiency value to between 0-100 using logistic regression.

Total Points Earned

The Sharp Football Analysis article discusses this together with Total QBR because they share similar modeling strategies. Like QBR, it uses EPA and divide credit among the team for each play. It also attempts to statistics such as YAC and air yards to apportion credit for a play. However, this method includes slightly more inputs because it was designed to evaluate all players, not just QBs.

QB Measure Comparison

Several useful figures and tables from Sharp Football Analysis are included below that give some indication of the validity of these measures for quarterback evaluation. Read the article for more specific information about each one. The first table summarizes which specific facets of the game are reflected in each metric.


There are several possible ways to evaluate the effectiveness of different metrics. One assumption would be that an effective metric would correlate highly with the stats and important measures such as points scored or victories. Another assumption is that an accurate metric would be fairly consistent for each player year-by-year. The next figure and table evaluate different metrics based on these assumptions.

Next, this figure shows the correlation between each measure.


Third, this table shows how strongly player scores using each metric are correlated between seasons. An accurate metric that provides insight into the value of the a player should be fairly consistent between years, although of course ‘player value’ does vary to some extent between years. The table shows the correlation of the metric when calculated for just passing plays and for all offensive plays. Overall, they’re all moderately correlated and predictive of future values.


Finally, this article concludes that any of these measures could be justifiably used to evaluate performance. Still, it argues that ANY/A is a strong method because it’s easily interpretable and accounts for sacks. For metrics with more modern analytical underpinnings, IQR and EPA are strong, while they are slightly harder to ‘see’ on the field. EPA is more all-encompassing, while IQR focuses more on throwing. For my own analysis, it may be best to always try to use one easily interpretable ones and one less easy one.

PFF Grades

Pro Football Focus produces player grades for each play. However, they aren’t public data so I’ll only mention them here for now.

ELO

This is used by 538 and measures strength based on game-by-game results. It can apply to teams, but I don’t think players. It works similar to ELO used in competitive gaming rankings. Teams gain and lose based on the strength of their results and how unexpected the results are. At the end of each season, team ratings are regressed back towards the mean (methodology explained more here).

Some RB stats

So far, most of these stats have primarily applied to QBs. I’ll have to spend more time looking at RB stats later, but some to think about may be receiving-related stats, pass blocking efficiency, and various measures of elusiveness. Some dicussion of these terms is found here. The site has similar discussions of WR metrics.

Other Assorted Stats

There are many other stats that are important to remember, but aren’t worth an entire section on their own. Many of these are available through Next Gen States are defined in its glossary. Many of these debuted for the 2020 season, so that’s something to keep in mind.

Passing

  • Speed. This is available for individual plays.
  • Improbable Completions/Rushes. From Next Gen Stats.
  • Time to throw (NGS)
  • Average completed air yards, averaged intended air yards, and the differential between the two. Also, air yards to the sticks, which tracks how far in front or behind the sticks intended passes are.
  • Aggressiveness % (throws into tight coverage)
  • Completion +/- compared to expected. Goff appears to have a very low one consistently.

Rushing

  • Rushing efficiency (total distance traveled/yards gained). I could also reverse calculate distance traveled from this metric.
  • % of runs with 8+ defenders in the box. Could use this, among other things, to show how this differed with Gurley.
  • Rushing yards over expected average per attempt. Gurley is low on this. Question: how much is this skewed by a few big runs?
  • Percent of rushes that are over average.
  • Average time behind the line of scrimmage

Receiving Stats

  • Average cushion
  • Average separation (distance toward nearest defender at time of completion)
  • Average targeted air yards and % share of team’s
  • Yards after catch
  • Expected YAC and YAC above expectation

Opponent Receptions Plus/Minus

This is a measure used to assess the strength of defenders covering the pass. It estimates the number of incompletions over expected a team creates, than assigns them to the nearest defender. See this 538 article. This helps grade coverage, which is otherwise quite hard. It’s hard for a couple reasons. One, good coverage often means the ball will be thrown somewhere else, and more important metrics like breakups and interceptions are fairly rare. However, the strength of this metric seems like it varies by coverage style, particularly being more accurate in man coverage. Also, it is not stable or predictive year by year.

Adjusted Line Yards

This metric is used to evaluate the performance of offensive lines in the run game. THis is created using a regression model that assigns the line a portion of a run’s value. Essentially, it assigns blame for losses, gives credit for mid-yardage runs, and gives the line no credit for long runs (see here). This source also has this data for each team! One problem is that it doesn’t account for the number of defenders the line faced (538).

Advanced Football Analytics also has a glossary that includes additional stats.

General Takeaways

I came across a number of interesting insights while looking through these measures and how they’re used. Some short summaries of these are below.