NFL Game and Attendance Insights

Highest-Paid NFL Players 2021 from Forbes

Introduction

A little bit background and a brief introduction on the project, we intent to explore the NFL data provided, tidy the data, raise questions we are interested and solve them using what we have learnt in the course. Here we list some key questions and proposed methods to obtain the answer and solutions as well as conclusions.

  • 1.1 Problems to be addressing -What football team should you root for? How can a football team improve their chances to make the playoffs? What is the most important contributor to Super Bowl wins? Our team will be investigating historical NFL standings, games, and attendance data to identify the most important factor to determine whether teams make the playoffs, and ultimately win the Super Bowl at the end of each season.

  • 1.2 Methodologies to be involved -The team plans to solve this enigma through analysis on the relationships between total number of wins, point margins, team rankings, and fan attendance to their chances of making the playoffs, and winning the coveted Lombardi trophy. In doing so, the team will evaluate correlation between playoff berths and Super Bowl wins with variables such as total wins per season, total points scored for and against, strength of schedule, offensive and defensive rankings, fan attendance, and yards gained and lost. The team will evaluate the validity of the above correlations through the use of scatter plots, assessing the p-values gathered through trend-line analysis to determine factual or immaterial correlation between two or more variables.

  • 1.3 Proposed approach/analytic techniques -As described above, the team believes that by analyzing the relationships between total wins per season, total points scored for and against, strength of schedule, offensive and defensive rankings, fan attendance, and yards gained and lost, we will be able to fully identify the most impactful variable(s) to determining whether a football team will make the playoffs, and win the Super Bowl. The use of scatter plots, assessing the p-values gathered through trend-line analysis to determine factual or immaterial correlation between two or more variables, will allow us to fully determine whether these variables correlate to playoff berths, and Super Bowl wins, or if there may be non-tangible variables beyond our analysis that contribute to success on the football field.

  • 1.4 How problem will be resolved and explained -The team’s findings will be helpful to football organizations by identifying what variables of a team’s operations are most heavily correlated to team success in making the playoffs, and winning the Super Bowl. Secondarily, the team will help football fans identify teams to bet on or root for based on the weekly, or yearly scenario the team is trending toward. Overall, the team wants to help educate fans on historical football statistics and impactful metrics.

Packages Required

Packages we will apply:

## install and load packages
library(dplyr)  #for manipulating, joining, and summarizing data
library(readr) #for importing data
library(ggplot2) #for creating plots and graphics
library(DT) #for creating nice tables in HTML output
library(rmarkdown) #to use paged_table() function to create a page-able version of a data frame.

Several packages were loaded and used for the data wrangling project. The dplyr package was loaded to assist with manipulating, joining, and summarizing the data. Since the data is in three separate data frames, dplyr was used to help join and manipulate the data into a single, usable data frame. The readr package for importing the data. The three data frames were pulled from this page on GitHub. The library ggplot2 was loaded to create plots and graphs. Although we have not created any, yet we know we will as we move further into the data analysis section of the project. The DT package was loaded to create tables in the HTML R Markdown document and tidyverse was loaded because many of it functions help with data analysis.

Data Preparation

The data we used for the project came from this page on GitHub. The data was gathered from the team standings page on team standings and this attendance page from Pro-Football-Reference.com. The three csv files are attendance, standings, and games. The data was originally gathered for record keeping purposes by a sports data website. It was further edited and compiled to focus on attendance data. The data was gathered on a weekly or gamely basis from the first week of football in 2000 (September 3, 2000) to Super Bowl LIV (February 2, 2020).

The attendance data frame has 10846 objects of 8 variables, the games data frame has 5324 objects of 19 variables, and the standings data frame has 638 objects of 15 variables. The variable names and data dictionary are available at this link and the variables used in this project are further explained below. The only variables that were missing values were attendance during bye weeks (a bye week is a week a team does not play during the regular season) and tie in the games data. The tie variable was removed from the data frame as it was not relevant to the analysis we performed and all attendance data was aggregated removing the NA values.

Load Data

All data is loaded from the GitHub page and is now ready for analysis and manipulation.

attendance <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/attendance.csv')
standings <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/standings.csv')
games <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/games.csv')

Attendance Data

The only variable we wanted to get from this table was the total attendance for the year by team. This table has both weekly data and yearly totals so we just needed to pull the total value and then turn that into a new consolidated table that lists values by year rather than by week. Then we added these values to the standings table as it would serve as the base for our data.

# ATTENDANCE DATA      ----------------------------------------------------
#aggregating the attendance totals by team and year
attendance$key <- paste(attendance$year, attendance$team, attendance$team_name)
attendance_ag <- aggregate(attendance[,4], by = list(attendance$key), FUN = max, na.rm = TRUE)
names(attendance_ag) <- c("key", "total_attendance")
#creating the same key in the standings table and adding the attendance totals to the standings
standings$key <- paste(standings$year, standings$team, standings$team_name)
standings_attend <- merge(standings, attendance_ag, by.x = "key")

Games Data

For this table, each row essentially serves as two observations since it represents a game with two teams, rather than just an observation on one team. There are variables for yardage and turnovers but they are separated by the winning team and the losing team, so we need to pull these observations separately. Currently, there is a variable that lists the winning team but there is not a variable for the losing team, so we need to create that by comparing the home and away teams to the winning team, and then choosing the one that did not win. Then after we have identified each winner and loser, we need to use those as the keys and aggregate the yardage and turnover variables by team and year. This results in two separate tables of aggregated values: one for the yearly totals of all the games each team won, and one for the yearly totals of all the games each team lost. Then we needed to add these results together to get the yearly totals. Then we added the final results to the original standings table with the additional attendance data.

# GAMES DATA --------------------------------------------------------------
#all variables in this table have two versions (one for winners and one for losers) so i will first aggregate them separately and then combine the aggregations
#adding a loser column in the games table
games$loser <- paste("filler")
for (i in 1:length(games$home_team)) 
  {
  if (games$home_team[i] == games$winner[i]) {
    games$loser[i] <- games$away_team[i]
  } 
    else { 
    games$loser[i] <- games$home_team[i]
    }
}
#creating a key for winners and a key for losers and aggregating data on yardage and turnovers
games$key_W <- paste(games$year, games$winner)
games$key_L <- paste(games$year, games$loser)
games2 <- games[is.na(games$tie) == TRUE,]
games_agg_W <- aggregate(games[,c(12,13)], by = list(games$key_W), FUN = sum, na.rm = TRUE)
games_agg_L <- aggregate(games[,c(14,15)], by = list(games$key_L), FUN = sum, na.rm = TRUE)
#renaming columns to be easier to work with
names(games_agg_L) <- c("key", "yds_total1", "turnovers_total1")
names(games_agg_W) <- c("key", "yds_total2", "turnovers_total2")
#joining the aggregated yardage and turnovers between the wins and losses
games_agg_full <- merge(games_agg_L, games_agg_W, by.x = "key", all.x = TRUE, all.y = TRUE )
#combining the values that are separated by wins and losses into season totals, in a new table
games_agg_final <- transmute(games_agg_full, 
          yds_total = yds_total1 + yds_total2,
          turnovers_total = turnovers_total1 + turnovers_total2, 
          key = key)
# FINAL TABLE FOR ANALYSIS ------------------------------------------------
#combining the yearly totals that were just calculated with the table that has the standings and attendance totals
standings_final <- merge(standings_attend, games_agg_final, by.x = "key")

Converting Variables

Almost all of our data was already in the correct formats and types, but we also wanted to have the ability to analyze super bowl and playoff wins with numerical variables. To do this, we created a binary version of each of the super bowl and playoff.

# CONVERTING VARIABLES ----------------------------------------------------
#converting superbowl variable into binary variable
standings_final$sb_binary <- NULL
for (i in 1:length(standings_final$sb_winner)) 
{
  if (standings_final$sb_winner[i] == "Won Superbowl") {
    standings_final$sb_binary[i] <- 1
  } 
  else { 
    standings_final$sb_binary[i] <- 0
  }
}
standings_final$sb_binary <- as.numeric(standings_final$sb_binary)
#converting playoffs variable into binary variable
standings_final$playoffs_binary <- NULL
for (j in 1:length(standings_final$playoffs)) 
{
  if (standings_final$playoffs[j] == "Playoffs") {
    standings_final$playoffs_binary[j] <- 1
  } 
  else { 
    standings_final$playoffs_binary[j] <- 0
  }
}
standings_final$playoffs_binary <- as.numeric(standings_final$playoffs_binary)
#All variables created and all important values combined into single table called "standings_final"

Final Table

paged_table(standings_final)

Exploration of Data

  • Points For: The values range from 161 points, representing a season with very low scoring, to 606 points, representing an incredible season in terms of scoring. The average is around 350 points and 75% of the seasons fall between 300 and 400 total points scored.

  • Points Against: It has almost identical values as you would expect, but the largest value is slightly different at 517 points against. This implies that even though a team may get scored on a lot, they are not consistently getting beaten by as much as the successful teams are winning by.

  • Margin of Victory: The values range from -16.3 to 19.7, while 75% of the data is within the range of -4,7 to 4.575 and the average is -.001881. This variable is equivalent to points for minus points against, all divided by the total number of games. So it attempt to speak to how close an average game may have been in terms of points and a large margin of victory represents consistent blowout wins while a negative value represents consistent losses.

  • Simple Rating: These values range from -17.4 to 20, with 75% of data falling between -4.475 and 4.5 and an average of 0. This can be slightly deceptive as not many teams actually have a ranking of 0, but it implies that there is a pretty good balance between the ratings being positive and negative even though the max and min values may not be symmetric.

  • Super Bowl: There has been one team that has won the Super Bowl six times in this data and then 20 teams have not won a super bowl at all. Eight teams have won it once and three teams have won it twice. The majority of teams are still chasing the elusive Super Bowl though.

Exploratory Data Analysis

Introduction

Based on the cleaned data sets, we proposed certain exploratory data analysis as below:

  • 4.1 The team has a variety of options at their disposal in terms of matching the data. Since this data set comes in three separate files, each with a different focus, various matchups may be an important factor in moving towards an eventual model. A season-by-season breakdown is likely, as this will allow the team to observe variables like regular-season results, performance statistics, and strength-of-schedule in a league-wide context. An example of possible new variables includes summarized attendance numbers for various periods of time.

  • 4.2 Plots would likely be helpful when looking at different seasons for a single team, to see how their performance or attendance numbers have changed over time, especially when compared against each other. Tables may help show a league-wide comparison between outcomes of different franchises. For example, it may be useful to have a table comparing strength-of-schedule for all teams each year, to better understand their placements.

  • 4.3 Right now, we do not know which variable clearly drives postseason / Super Bowl performance, which can hopefully be uncovered in the modeling steps. Additionally, we don’t know how performance impacts attendance (if at all), or how specific athletic statistics impact wins.

  • 4.4 Machine learning techniques will likely be appropriate here. Specifically, gradient-boosted techniques like xgBoost may help the team to better categorize seasons based on their component statistics.

Playoff Indicators

1. Offensive Ranking vs. Defensive Ranking


Above, we have a graph of all teams and their corresponding offensive and defensive rankings, with the x axis corresponding to offense and the y axis corresponding to defense. We can see that playoff teams typically have a higher offensive and defensive ranking by the trendlines, and that the playoff teams most often appear in the upper right quadrant of the graph. When looking at what scores lead to the best chance of making the playoffs though, a table can be very helpful. We have decided to split the graph in 4 quadrants beginning with quadrant 1 in the top right and then moving counter clockwise to 2, 3, and 4.

The main takeaways here are:

  • There are 240 teams that have made the playoffs over the course of our data. There are 186 teams in the first quadrant (positive offense, positive defense) and 150 of those teams made the playoffs which is 80% roughly.
  • Of the 128 teams in quadrant 4 (positive offense, negative defense), 47 have made the playoffs which is roughly 38%.
  • Of the 134 teams in quadrant 2 (positive defense, negative offense), 31 have made the playoffs which is roughly 23%.
  • Of the 190 teams in quadrant 3 (negative defense, negative offense), only 12 have made the playoffs

This tells us that it is better to have a strong offense and weak defense than it is to have a strong defense and a weak offense; but, if you can just have a positive score for both you have an 80% chance of getting in the playoffs.

2. Exploration Using the Playoff Binary Variable and Histograms

The following section uses the playoff binary variable in the data frame to analyze making the playoffs in relation to several important other variables.Specifically this will explore the variables wins (wins), total yards (yds_total), strength of schedule (strength_of_schedule),and points differential(points_differential).These graphs will demonstrate the minimums teams must reach to achieve their goal and the minimums reached to guarantee success.

These two graphs show wins split by teams who made the playoff versus those who did not. These graphs show that if an NFL teams wishes to make the playoff they must exceed 7 total wins of the season with most playoffs teams needing an average of 10 total wins. Every team that has won 12 or more games has made the playoffs. Neither follows normal distribution both are skewed, but are normal when the playoff/non-playoff binary is removed. This is a good analysis for simple benchmarks a team must reach.

These two graphs show total yards split by teams who made the playoff versus those who did not. These graphs show that if an NFL teams wishes to make the playoff they must exceed 4500 total yards of the season with most playoffs teams needing an average of 6000-6500 total yards. Interestingly a team does not guarantee a playoff spot until they reach over 7000 yards. It is interesting that a team reaching around 6500 yards (the mean of previous playoffs teams) still has a greater chance of missing the playoffs then they do of marking it. This shows that this information is most useful at the extremes either guaranteeing a missed playoffs or a made one. This is not the best variable to determine if your team is around average. It is also interesting to note that the distribution is normal for both playoff and non playoff teams.

These two graphs show points scored split by teams who made the playoff versus those who did not. These graphs show that if an NFL teams wishes to make the playoff they must exceed -100 points differential with most playoffs teams needing an average of +100 points differential. Every team that has exceeded +125 points difference or more has made the playoffs. It is interesting that several teams have made the playoffs with a negative point differential, this defies the logic that only winning teams make the playoffs but do to how playoff births are created in the NFL, it allows for losing teams to make the playoffs. Further analysis could be made to remove the lower extremes and see how that affects the graphs and benchmarks.

These two graphs show strength of schedule by teams who made the playoff versus those who did not. It is interesting that strength of schedule seems to have no real affect on making the playoffs. Logic says that teams with a lesser strength of schedule (indicating playing easier teams) would make the playoffs with greater frequency than those teams with higher strengths of schedules. This is not a good variable to use for whether a team will make the playoffs.

Super Bowl Indicators

1. Offensive Ranking vs. Defensive Ranking

We can see in this graph that most winners of the superbowl have had both a positive offensive and defensive ranking and that there is actually a negative linear relationship between the two normally. Most winners that have a very high offensive ranking have a defensive ranking closer to 0 and vice versa. There seems to be a decent spread along that line with a few more outliers on the side of having really strong defenses and decent offenses, but it is hard to draw a definitive conclusion on whether it is ore valuable to have a strong offense or strong defense from this graph. To further examine this we can use the code below to examine how many super bowl winners come from each quadrant of the above graph, with quadrant 1 being the upper right quadrant and moving counterclockwise to 2, 3, and 4.

SB_Q1 NonSB_Q1 SB_Q2 NonSB_Q2 SB_Q3 NonSB_Q3 SB_Q4 NonSB_Q4
15 171 2 132 0 190 3 125

The main takeaways here are:

  • 15 of the 20 winners come from quadrant 1 (positive scores for defense and offense). This is 15/186 teams in the quadrant
  • Quadrant 2 (positive offense, negative defense) has 3 winners out of 128 teams in the quadrant
  • Quadrant 4 (positive defense, negative offense) has 2 winners out of 134 teams in the quadrant
  • No winners from quadrant 4 (negative defense, negative offense)

So we are able to see that there is not a big difference in the number of teams that have won the Super Bowl in Quadrant 2 and 4. But 75% of teams have a positive ranking for both categories so focusing on one category while neglecting the other may not be beneficial.




2. Trends Over the Years

Here we can see that Super Bowl Winners are consistently averaging much higher simple ratings than the majority of their competition, but they are not guaranteed to be the team with the largest rating that year. Multiple years feature multiple teams with higher simple ratings than the team that won the super bowl, but usually the team that wins will have a rating that is better than the majority of the league.
The ranking of the team that wins the Super Bowl is growing very slowly from year to year according to the trendline, but the points fluctuate pretty largely. This tells us that it is not consistently getting much harder to win the super bowl, but that there is a little bit of an element of luck. Some years the competition is much stiffer than others and this impacts who is able to win the Super Bowl and what their talent level has to be.


Here we can see similar findings in that the team that wins the Super Bowl is not always the team with the most yards over the season, but they do have to have decent yardage in comparison to their competiton. The average of the Super Bowl winners yardage is much higher than the average of the league, but there are still some years where the Super Bowl Winner is closer to the middle of the pack than they are to the top. Most years though the team that wins is within the top 10 if not top 5 teams for yardage.
Year to year though we do see a pretty large increase in total yardage. From 2000 to 2002, Super Bowl Winners were only achieving around 6,000 yards. Whereas from 2017-2019 the winners are consistently achieving more than 7,000 yards and that is a pretty significant increase. Since we know that simple rankings are not increasing though we can see the rankings are much more relative to the competition at the time than to a a numerical standard set.

3. Exploration Using the Super Bowl Binary Variable and Histograms

The following section uses the super bowl binary variable in the data frame to analyze making the playoffs in relation to several important other variables.Specifically this will explore the variables wins (wins), total yards (yds_total),and points differential(points_differential).These graphs will demonstrate the minimums teams must reach to achieve their goal and the minimums reached to guarantee success.

In the course of the data there have been very few super bowl winners compared to the overall data. Every year 1 out of 32 teams win. This leads to interesting results and an smaller data set to work with. The graph above indicates the amount of wins a super bowl team achieved in the regular season. In a typical season a team will win between 0-16 games and this has no indicating on if a team will not win a super bowl, however from the graphs we can see that only teams with wins in the set {9,10,11,12,13,14} have won a super bowl since the year 2000.

These two graphs show the total yards a team has comparing teams who won the super bowl versus teams that do not. With all the graphs so far the data provides us with a logic puzzle for example, a NFL team has 6000 total yards. This does not guarantee a super bowl win and doesn’t even point to a super bowl win likelihood. However, it is in the range that all super bowl winners have occupied. This shows us that all super bowl winners had 5500-8000 total yards.

These two graphs show the difference in points scored between super bowl winners and others. These graphs resemble the ones created for the yards total graphs above. They tell us that no range or benchmark guarantees super bowl success. However all but one super bowl winner has a positive point differential and all exist within the range (-10,200). This is not a good indicator of success but can provide a range to strive for to increase chances of success.

Full League Analysis

1. Simple Ratings

Insights:

  • The Patriots have dominated in the past 20 years and have only continued to get better
  • The Dolphins and Giants have gotten exponentially worse and consequently made the playoffs much less often
  • The Browns, Bills, and Raiders have barely made the playoffs in the last 20 years
  • The Packers, Steelers, and Seahawks have been consistently good and made the playoffs frequently

2. Yardage


Insights:

  • The Cowboys, Falcons, Texans and Patriots have seen huge growth in yearly yardage totals and consequently seen more playoff appearances
  • The Bills, Browns, and Dolphins have consistently low yardage and not many playoff appearances
  • The Colts and Packers have consistently high yardage
  • The Rams have fluctuated in yardage but have been returning to high yardage years recently

Conclusion

  • 6.1 Summary of the problems addressed -Our team investigated historical NFL standings, games, and attendance data to identify the most important factor to determine whether teams make the playoffs, and ultimately win the Super Bowl at the end of each season.

  • 6.2 Summary of the data and methodologies involved -The team analyzed the relationships between total number of wins, offensive and defensive rankings, yards, turnovers, and fan attendance to their chances of making the playoffs, and winning the coveted Lombardi trophy. In doing so, the team evaluated the given correlations between playoff berths and Super Bowl wins with variables such as total wins per season, total points scored for and against, strength of schedule, offensive and defensive ratings, and yards gained and lost. The team further evaluated the validity and strength of the above correlations through the use of scatter plots, and assessing the p-values gathered through trend-line analysis to determine factual or immaterial correlation between two or more variables in the datasets gathered.

  • 6.3 Summary of insights and analytics -The team derived the following insights from our analysis:

    • Playoff teams typically have a higher offensive and defensive ranking by the trend lines.
    • There were 186 team seasons with positive offense and defense, of which 80% went on to make the playoffs.
    • There were 128 team seasons with positive offense and negative defense, of which 38% went on to make the playoffs, a larger percentage than teams with negative offense and positive defense, meaning positive offense played a more important role in making the playoffs.
    • NFL teams with playoff hopes must exceed 7 wins in the season, but likely need upwards of 10 total wins.
    • If an NFL teams wishes to make the playoffs they must exceed 4,500 total yards during the season, but likely upwards of 6,000-6,500 total yards.
    • NFL teams can still make the playoffs with a negative point differential!
    • 15 of the 20 Superbowl winners had positive scores for defense AND offense.
    • Super Bowl winners are not guaranteed to be the team with the largest rating that year.
    • Only teams with wins in the set {9,10,11,12,13,14} have won a Super Bowl since the year 2000!
    • The Leaders: The Patriots have dominated in the past 20 years and have only continued to get better.
    • The Average: The Packers, Steelers, and Seahawks have been consistently good and made the playoffs frequently.
    • The Below Average: The Dolphins and Giants have gotten exponentially worse and consequently made the playoffs on a much less frequent basis.
    • The Bottom Tier: The Browns, Bills, and Raiders have barely made the playoffs in the last 20 years (but may be turning things around!).
  • 6.4 Summary of the implications to the consumer of our analysis -The team’s findings are helpful to football organizations in the NFL by identifying the seemingly random and fluctuating variables of a team’s operations that most heavily correlate to their definition of success: making the playoffs and/or winning the Super Bowl. Additionally, our analysis can be taken by individuals all over the United States, and the world, identify the teams which are most likely going to hit or miss the spread, have a winning season, or even win the Super Bowl at the end of the season, based on the variable which they bet on, or would like to focus on.

  • 6.5 Summary of the limitations in our analysis -Fortunately, the team did not encounter many limitations, as we prepared thoroughly ahead of preparing and performing our analysis. However, we did come across several instances where our hypotheses around correlation to wins, playoff appearances, and Super Bowl wins did not go to plan. In this case, we had to re-evaluate our understanding of the datasets, the implications variables could have on these outcomes to formulate more comprehensive hypothesis around which variables are key to NFL success. For future improvements and reperformance of our analysis, we recommend analysts carefully evaluate our Data Preparation module to understand the steps we took to gather, massage, wrangle, and merge the datsets required to complete the analysis.