Data Wrangling Project

Seth Draper

The Dynasty of the New England Patriots

From 2001 to 2018, the New England Patriots dominated the National Football League. Over the course of 17 years, the Patriots won 15 division titles, 12 AFC Championship appearances, 7 Super Bowl appearances, and 5 Super Bowl victories. Let’s find out how they got there…

Synopsis

Problem Statement:

What does it take for an NFL team to win the Super Bowl? More importantly, how can they make the playoffs? These are questions that used to be answered with “win the most games in the division” and “win every game in playoffs.” We can now quantify exactly what it takes to make it to the most elite level of football. The New England Patriots consistently ended up in the AFC Championship or the Super Bowl almost every year, to the point where they made it look easy. We will dive in-depth to quantify what it takes to be a playoff team.

Solution Overview:

To reach a conclusion, NFL data including three datasets from 2000 to 2019 - one of which will be used, will be cleaned, wrangled, and analyzed to find what variables lead to being a playoff team and Super Bowl victor. The majority of the analysis will draw comparisons between the New England Patriots, Super Bowl winners, playoff contenders, and teams who missed the playoffs.

Implementation:

After the data is cleaned, we will analyze what goes into a playoff team, including wins. Additionally we will dive deeper into what exactly determines a win. We will show what it will take for an NFL team to become successful, like the New England Patriots.

Direct Impact of Analysis:

This report is for NFL ownership and management to understand what is required to take their team to the next step of either making the playoffs or winning it all.

Packages Required

Packages Required:

To reproduce the code and results from this report the following packages will need to be installed.

library(tidyverse)  ## Visualizing, transforming, tidying and joining data

library(dplyr)      ## Manipulating data

library(ggplot2)    ## Used for Visualizing data

library(ggcorrplot) ## Used to visualize Correlation matrix

library(knitr)      ## Neccessary to show tables in RMarkdown

Data Preparation

A. Data Import

Data Import:

The data from this report was compiled from Pro Football Reference and can be accessed HERE. This data is titled “2019 NFL Attendance Data” and includes 3 datasets: Attendance, Standings, and Games. These datasets are comprised of data from 2000-2019 and include many variables. Although this data is titled and focused around attendance, a more interesting analysis came to mind regarding playoff teams and Super Bowl winners. All the data was mostly clean with no missing values (except ‘NA’ attendance on Bye Weeks).

The attendance dataset was comprised of 10,846 observations and 8 variables:

  • Team
  • Team Name
  • Year
  • Total
  • Home
  • Away
  • Week
  • Weekly Attendance

The standings dataset was comprised of 638 observations and 15 variables:

  • Team
  • Team Name
  • Year
  • Wins
  • Loss
  • Points For
  • Points Against
  • Point Differential
  • Margin of Victory
  • Strength of Schedule

The games dataset was comprised of 5,324 observations and 19 variables:

  • Year
  • Week
  • Home Team
  • Away Team
  • Winner
  • Tie
  • Day
  • Time
  • Points Win
  • Points Loss
  • Yards Win
  • Turnovers Win
  • Yards Loss
  • Turnovers Loss
  • Home Team Name
  • Home Team City
  • Away Team Name
  • Away Team City

The first step in this project was to download the datasets and store them in an R File on my desktop. From there, set the working directory and import all three datasets into R.

setwd("C:/Users/sethd/OneDrive/Desktop/R Files")
attendance <- read.csv("attendance.csv")
standings <- read.csv("standings.csv")
games <- read.csv("games.csv")

B. Attendance Dataset Cleaning

Attendance Dataset Cleaning:

Upon briefly exploring the datasets, I decided to not use the games data, but felt it was appropriate to keep it in R in the event that I want to continue the report at a single game level instead of the season level.

After importing the data, the first step was to clean the attendance dataset. As just stated, I am only concerned about the seasons, not game specific - this applies to attendance. The first step was that I united the team and team_name variables to condense and have one team name as opposed to two. Then I removed week and week_attendance columns, leaving many duplicate rows (17 per team per season) because all that was left was the total, home, and away attendances repeating. To eliminate redundancies, I called for only distinct observations to remain. The code below shows the previous description followed by the first 6 observations of the now clean attendance dataset:

#Combine Team Name
attendance <- attendance %>% unite(team, team, team_name, sep = " ")

#Transform into Team Total Attendance per year
attendance <- attendance[,-(6:7)]
attendance <- distinct(attendance)
team year total home away
Arizona Cardinals 2000 893926 387475 506451
Atlanta Falcons 2000 964579 422814 541765
Baltimore Ravens 2000 1062373 551695 510678
Buffalo Bills 2000 1098587 560695 537892
Carolina Panthers 2000 1095192 583489 511703
Chicago Bears 2000 1080684 535552 545132

C. Standings Dataset Cleaning

Standings Dataset Cleaning:

The first step I took in cleaning the standings dataset was to combine the two team and team_name variables to condense and keep consistent with the attendance data. Following that I proceeded to do a multi-layered sort by year then wins in descending order. This was to check the first few observations and make sure the data made sense. Then when checking the head, I realized the variables playoffs and sb_winner were character values as they were reported as “No Playoffs”, “Playoffs”, “No Superbowl”, and “Superbowl” respectively. I changed the values to a binary 0 and 1 for No Playoffs/No Superbowl and Yes Playoffs/Yes Superbowl. The following codes shows these actions, along with an output of the first 6 values in the new clean standings dataset:

#Combining team and team_name
standings <- as_tibble(standings)
standings <- standings %>% unite(team, team, team_name, sep = " ")

#Sort by years then wins
standings <- arrange(standings, year, desc(wins))

#Changing playoffs and sb_winner from a character to Binary variable
standings$playoffs <- factor(standings$playoffs, levels=c("No Playoffs", "Playoffs"), labels=c(0, 1))
standings$sb_winner <- factor(standings$sb_winner, levels=c("No Superbowl", "Won Superbowl"), labels=c(0, 1))
team year wins loss points_for points_against points_differential margin_of_victory strength_of_schedule simple_rating offensive_ranking defensive_ranking playoffs sb_winner
Tennessee Titans 2000 13 3 346 191 155 9.7 -1.3 8.3 1.5 6.8 1 0
Baltimore Ravens 2000 12 4 333 165 168 10.5 -2.5 8.0 0.0 8.0 1 1
Oakland Raiders 2000 12 4 479 299 180 11.3 -1.5 9.7 8.0 1.8 1 0
New York Giants 2000 12 4 328 246 82 5.1 -2.7 2.4 -1.3 3.8 1 0
Miami Dolphins 2000 11 5 323 226 97 6.1 1.0 7.1 0.0 7.1 1 0
Denver Broncos 2000 11 5 485 369 116 7.3 -2.2 5.0 7.8 -2.7 1 0

D. Combining & Cleaning Datasets

Combining & Cleaning Datasets:

To make things easier, I noticed that both datasets had the same number of observations and identical team and year keys - therefore I decided to combine the two datasets to work with them as one. After combining, I renamed the attendance variables from “total”,“home”,“away” to “total_attendance”, “home_attendance”, and “away_attendance” respectively to differentiate. I then looked at the structures of all the variables and noticed that many were in integer form and the two binary variables were still variables. I then converted them to numeric variables to make calculations easy. This new merged dataset is named standings2. Below is the code and the first 6 observations

#Merge two data frames by team and year
standings2 <- merge(standings,attendance,by=c("team","year"))

#Rename variables
standings2 <- standings2 %>% 
  rename(
    total_attendance = total,
    home_attendance = home,
    away_attendance = away
    )

#Changing variable types
standings2[3:7] <- lapply(standings2[3:7], as.numeric)
standings2$playoffs <- as.numeric(levels(standings2$playoffs))[standings2$playoffs]
standings2$sb_winner <- as.numeric(levels(standings2$sb_winner))[standings2$sb_winner]
standings2[15:17] <- lapply(standings2[15:17], as.numeric)
team year wins loss points_for points_against points_differential margin_of_victory strength_of_schedule simple_rating offensive_ranking defensive_ranking playoffs sb_winner total_attendance home_attendance away_attendance
Arizona Cardinals 2000 3 13 210 443 -233 -14.6 -0.7 -15.2 -7.2 -8.1 0 0 893926 387475 506451
Arizona Cardinals 2001 7 9 295 343 -48 -3.0 -1.2 -4.2 -1.5 -2.6 0 0 811391 307315 504076
Arizona Cardinals 2002 5 11 262 417 -155 -9.7 -0.2 -9.9 -5.4 -4.5 0 0 898877 327272 571605
Arizona Cardinals 2003 4 12 225 452 -227 -14.2 1.6 -12.6 -6.3 -6.2 0 0 804401 288499 515902
Arizona Cardinals 2004 6 10 284 322 -38 -2.4 -2.5 -4.9 -5.1 0.2 0 0 838557 300267 538290
Arizona Cardinals 2005 5 11 311 387 -76 -4.8 -0.2 -5.0 -2.0 -3.0 0 0 920848 401035 519813

E. Subsetting into 4 Datasets

Subsetting into 4 Datasets:

The last step of the cleaning process is where it all starts coming together. From the standings2 dataset, I created 4 subsets: * Super Bowl Winners (sb_champs) * Teams Who Made the Playoffs (made_playoff) * Teams Who Missed the Playoffs (missed_playoff) * The New England Patriots (pats) I removed the Super Bowl Winners from the made_playoff to make sure they were not counted twice. I then sorted missed_playoff and made_playoff by year then descending years to get a good look at the data. Following is the code and the heads of each:

#Subsetting (filtering) SB winners, playoffs, missed playoffs, and NE Patriots
sb_champs <- filter(standings2, sb_winner == 1) 
sb_champs <- sb_champs[,-(13:14)]
sb_champs <- arrange(sb_champs, year)
kable(head(sb_champs), format = "markdown")
team year wins loss points_for points_against points_differential margin_of_victory strength_of_schedule simple_rating offensive_ranking defensive_ranking total_attendance home_attendance away_attendance
Baltimore Ravens 2000 12 4 333 165 168 10.5 -2.5 8.0 0.0 8.0 1062373 551695 510678
New England Patriots 2001 11 5 371 272 99 6.2 -1.9 4.3 1.2 3.1 977717 482336 495381
Tampa Bay Buccaneers 2002 12 4 346 196 150 9.4 -0.6 8.8 -1.0 9.8 1044920 525031 519889
New England Patriots 2003 14 2 348 238 110 6.9 0.1 6.9 2.1 4.9 1127515 547488 580027
New England Patriots 2004 14 2 437 260 177 11.1 1.8 12.8 6.4 6.5 1108210 550048 558162
Pittsburgh Steelers 2005 11 5 389 258 131 8.2 -0.4 7.8 3.8 4.0 1048739 507434 541305
made_playoff <- filter(standings2, playoffs == 1)   
made_playoff <- subset(made_playoff, made_playoff$sb_winner==0 )
made_playoff <- made_playoff[,-(13:14)]
made_playoff <- arrange(made_playoff, year, desc(wins))
kable(head(made_playoff), format = "markdown")
team year wins loss points_for points_against points_differential margin_of_victory strength_of_schedule simple_rating offensive_ranking defensive_ranking total_attendance home_attendance away_attendance
Tennessee Titans 2000 13 3 346 191 155 9.7 -1.3 8.3 1.5 6.8 1091274 547524 543750
New York Giants 2000 12 4 328 246 82 5.1 -2.7 2.4 -1.3 3.8 1135455 624085 511370
Oakland Raiders 2000 12 4 479 299 180 11.3 -1.5 9.7 8.0 1.8 998655 462515 536140
Denver Broncos 2000 11 5 485 369 116 7.3 -2.2 5.0 7.8 -2.7 1140030 604042 535988
Miami Dolphins 2000 11 5 323 226 97 6.1 1.0 7.1 0.0 7.1 1118883 589909 528974
Minnesota Vikings 2000 11 5 397 371 26 1.6 0.3 1.9 4.3 -2.3 1029262 513322 515940
missed_playoff <- filter(standings2, playoffs == 0) 
missed_playoff <- missed_playoff[,-(13:14)]
missed_playoff <- arrange(missed_playoff, year, desc(wins))
kable(head(missed_playoff), format = "markdown")
team year wins loss points_for points_against points_differential margin_of_victory strength_of_schedule simple_rating offensive_ranking defensive_ranking total_attendance home_attendance away_attendance
Detroit Lions 2000 9 7 307 307 0 0.0 1.4 1.4 -0.1 1.5 1140926 607076 533850
Green Bay Packers 2000 9 7 353 323 30 1.9 0.6 2.5 1.8 0.7 1049602 478747 570855
New York Jets 2000 9 7 321 321 0 0.0 3.5 3.5 1.4 2.2 1145146 623711 521435
Pittsburgh Steelers 2000 9 7 321 255 66 4.1 -0.2 3.9 0.6 3.3 987037 440426 546611
Buffalo Bills 2000 8 8 315 350 -35 -2.2 2.2 0.0 0.5 -0.5 1098587 560695 537892
Washington Redskins 2000 8 8 281 269 12 0.8 0.2 1.0 -2.9 3.8 1174332 647424 526908
pats <- filter(standings2, team == "New England Patriots")
kable(head(pats), format = "markdown")
team year wins loss points_for points_against points_differential margin_of_victory strength_of_schedule simple_rating offensive_ranking defensive_ranking playoffs sb_winner total_attendance home_attendance away_attendance
New England Patriots 2000 5 11 276 338 -62 -3.9 1.4 -2.5 -2.7 0.2 0 0 1030594 482336 548258
New England Patriots 2001 11 5 371 272 99 6.2 -1.9 4.3 1.2 3.1 1 1 977717 482336 495381
New England Patriots 2002 9 7 381 346 35 2.2 1.8 4.0 2.1 1.9 0 0 1096069 547488 548581
New England Patriots 2003 14 2 348 238 110 6.9 0.1 6.9 2.1 4.9 1 1 1127515 547488 580027
New England Patriots 2004 14 2 437 260 177 11.1 1.8 12.8 6.4 6.5 1 1 1108210 550048 558162
New England Patriots 2005 10 6 379 338 41 2.6 0.6 3.1 3.7 -0.5 1 0 1136903 550048 586855

Exploratory Data Analysis

F. Correlation Matrix

Correlation Matrix

The first step of analysis was to create a correlation matrix. From this my hope was to establish variables that are correlated to the playoff variable. When analyzing correlations against the playoffs variable, it appeared that there are many highly correlated variables with playoffs. However, all of these variables are correlated significantly higher with wins, which leads to the idea that these variables lead to wins and wins lead to playoffs. Below is a list of the correlated variables with the R^2-value to playoffs and wins in that order:

  • simple_rating (0.67)(0.88)
  • points_differential (0.71)(0.92)
  • margin_of_victory (0.71)(0.92)
  • points_for (0.56)(0.73)
  • offensive_ranking (0.55)(0.73)
  • wins (0.78)(NA)
  • points_against (-0.53)(-0.68)
  • defensive_ranking (0.49)(0.64)
#Correlation Matrix of Standings2
corr <- round(cor(standings2[3:17]), 1)

p.mat <- cor_pmat(standings2[3:17])

correlation <-ggcorrplot(corr, hc.order = TRUE, type = "lower",
           lab = TRUE)
correlation

G. Making the Playoffs

Wins Per Season Analysis

After realizing that wins are the direct influencer of making the playoffs, I wanted to compare the distributions of the four subsets. It is evident that teams will have no chance to make the playoffs without at least 7 wins. It is also worth noting that the Super Bowl winner’s median is 1 game better than the teams that made the playoffs. Lastly, a note on the Patriot Dynasty, beside 2000 (pre-Tom Brady) when the Patriots only won 5 games, they won 9 or more games every season for the next 19 years.

boxplot(pats$wins, sb_champs$wins, made_playoff$wins, missed_playoff$wins, 
        main = "Wins Per Season", ylab = "Wins", col = c("royalblue", "gold", "green", "red"),
        names = c("Patriots", "SB Champs", "Playoff Made", "Playoff Missed"))

Playoff Probability

Based on the given data, I calculated out probabilities based on wins:

  • If a team finishes with at least 11 wins, they have a 99.3% chance of making the playoffs
  • If a team finishes with 10 wins, they have an 87.3% chance of making the playoffs
  • If a team finishes with 9 wins, they have an 37.7% chance of making the playoffs
  • If a team finishes with 8 wins, they have an 8.7% chance of making the playoffs

What Leads to a Win?

There are essentially 2 things that go into winning or losing a game, and they are pretty obvious:

  • Points Scored For
  • Points Scored Against

Although there are more specific variables that can influence the result of a football game, like weather, home field advantage, health of players, and so on, at a game’s core, these variables can likely predict whether a team will reach the playoffs or be sitting on their couches come January. Following is a visual display to demonstrate each of these variables’ effects on wins.

Analyzing Points For

As seen in the graph below, there is a strong positive correlation between wins and points for. Points For has a direct impact on number of wins in a season.

highlight_df <- standings2 %>% 
  filter(standings2$team == "New England Patriots")

ggplot(standings2, aes(x=wins, y=points_for)) + 
  geom_point(aes(col=playoffs), size = 2) + 
  scale_colour_gradientn(colours= c("red","green")) +
  geom_point(data = highlight_df, aes(x=wins, y=points_for), colour = "navyblue", size = 3 ) +
  geom_smooth(method="lm", size=1.5, colour = "black")  +
  labs(title="Wins vs. Points For", subtitle = "Compared with The NE Patriots (Blue)", y="Total Points For", x="Wins")

The boxplot following compares points for between the four subsets. You can see that there is a drastic increase between teams who did not make the playoffs as opposed to those who did. One thing to note here is that the middle 50% of ‘points for’ of teams who made the playoffs and Super Bowl Champions is almost identical, meaning ‘points for’ is not the driving factor for a team who made the playoffs to lead to a Super Bowl victory.

boxplot(pats$points_for, sb_champs$points_for, made_playoff$points_for, missed_playoff$points_for, 
        main = "Points For Comparison", ylab = "Points For", col = c("royalblue", "gold", "green", "red"),
        names = c("Patriots", "SB Champs", "Playoff Made", "Playoff Missed"))

This plot shows just how dominate the Patriots were. Their lowest scoring year was still in the middle 50% of made playoffs, which led to the 17 titles in 19 years. They consistently scored in the highest tier among all other NFL teams. Note - The "dynasty’ dataset includes patriots standings data, excluding 2000 since the dynasty officially started in 2001.

dynasty <- pats[-1,]
ggplot(dynasty, aes(wins, points_for)) +
  geom_point(size = 3, colour = "navyblue") +
  geom_point(aes(col=sb_winner)) +
  geom_smooth(method="lm", size=1.5, colour = "navyblue") + 
labs(title="Patriots Wins vs. Points For", y="Total Points For", x="Wins")

Points For Probabilities

Based on the given data, I calculated out probabilities based on points for:

  • If a team finishes with at least 450 points for, they have a 86.0% chance of making the playoffs
  • If a team finishes with at least 400 points for, they have a 76.7% chance of making the playoffs
  • If a team finishes with at least 350 points for, they have a 60.6% chance of making the playoffs

Analyzing Points Against

As seen in the graph below, there is a strong negative correlation between wins and points points against. Points Against has a direct impact on number of wins in a season.

highlight_df <- standings2 %>% 
  filter(standings2$team == "New England Patriots")

ggplot(standings2, aes(x=wins, y=points_against)) + 
  geom_point(aes(col=playoffs), size = 2) + 
  scale_colour_gradientn(colours= c("red","green")) +
  geom_point(data = highlight_df, aes(x=wins, y=points_against), colour = "navyblue", size = 3 ) +
  geom_smooth(method="lm", size=1.5, colour = "black")  +
  labs(title="Wins vs. Points Against", subtitle = "Compared with The NE Patriots (Blue)", y="Total Points Against", x="Wins")

The boxplot following compares points for between the four subsets. You can see that there is a drastic increase between teams who did not make the playoffs as opposed to those who did. Unlike points for, this plot shows that there is a difference between points scored against by playoff teams and Super Bowl Champs. 50% of Super Bowl winners gave up less points than 75% of playoff teams. This is where the difference is between playoff teams and Super Bowl winners.

boxplot(pats$points_against, sb_champs$points_against, made_playoff$points_against, missed_playoff$points_against, 
        main = "Points Against", ylab = "Points", col = c("royalblue", "gold", "green", "red"),
        names = c("Patriots", "Superbowl Champs", "Playoff Made", "Playoff Missed"))

This plot yet again shows just how dominate the Patriots were. The same negative correlation applies with Patriot wins. The Patriots gave up less than 300 points in 45% of their seasons and never exceed 350.

ggplot(dynasty, aes(wins, points_against)) +
  geom_point(size = 3, colour = "navyblue") +
  geom_point(aes(col=sb_winner)) +
  geom_smooth(method="lm", size=1.5, colour = "navyblue") + 
  labs(title="Patriots Wins vs. Points Against", y="Total Points Against", x="Wins")

Points For Probabilities

Based on the given data, I calculated out probabilities based on points for:

  • If a team lets up 260 or less points against, they have a 97.6% chance of making the playoffs
  • If a team lets up 300 or less points against, they have a 80.8% chance of making the playoffs
  • If a team lets up more than 350 points against, they have a 14.7% chance of making the playoffs
  • If a team lets up more than 400 points against, they have a 4.8% chance of making the playoffs

Comparing Points For vs Points Against

The following plot shows the Points For vs Points Against for each team, separated between the teams who missed the playoffs and those who did make the playoffs.

points_fa <- ggplot(standings2, aes(x=points_for, y=points_against, color = factor(standings2$playoffs))) + 
  geom_point(data = highlight_df, aes(x=points_for, y=points_against), colour = "navyblue", size = 3 ) +
  geom_point() +
  xlim(150,625) + ylim(150,625) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", size = 1) +
  labs(title="Points For vs. Points Against", subtitle = "Compared with The NE Patriots (Blue)", y="Total Points Against", x="Total Points For", color = "Playoff Made")

points_fa

The dotted line between the data is a simple x=y line, meaning that if a team gives up more points against than points for, they will fall above the line, whereas if a team scores more points than points against they will fall below the line. As shown, the farther a team can fall from the dotted line (below), the more likely they are to reach the playoffs. This means that that Points For and Points Against do not help or hurt independently, but instead, it is a combination of the number of points and the margin between the two. You can see how strong and well performing the Patriot offense was over the years, but you can also see that their defense performed above average.

Analyzing Point Differentials

Its is important to note that Point Differential shows a similar correlation to Points For and Points Against. You can see the level of dominance that the Patriots were at as they top the charts for point differentials in 19 of the 20 seasons.

point_dif <- ggplot(standings2, aes(x=wins, y=points_differential, color = factor(playoffs))) + 
  geom_point() + 
  geom_point(data = highlight_df, aes(x=wins, y=points_differential), colour = "navyblue", size = 3 ) +
  geom_smooth(method="lm", size=1.5, colour = "black")  +
  labs(title="Wins vs. Point Differential", subtitle = "Compared with The NE Patriots (Navy)", y="Point Differential", x="Wins", color = "Playoff Made")

plot(point_dif)

Point Differential Probabilities

Based on the given data, I calculated out probabilities based on points for:

  • If a team finishes with a Point Differential of -50 or lower, they have a 1.9% chance of making the playoffs
  • If a team finishes with a Point Differential of 0 or lower, they have a 5.2% chance of making the playoffs
  • If a team finishes with a Point Differential of 0 or higher, they have a 69.3% chance of making the playoffs
  • If a team finishes with a Point Differential of 50 or higher, they have a 86.6% chance of making the playoffs
  • If a team finishes with a Point Differential of 100 or higher, they have a 95.9% chance of making the playoffs

H. Multiple Linear Regression

Multiple Linear Regression

A final glimpse of how to analyze the impact of Points For and Points Against has on Wins is to generate a multiple linear regression. After calculating, there is an R^2 value of 0.84, indicating that this model can strongly predict Wins based on these two variables. Below is the code, analysis of regression, and equation.

fit <- lm(standings2$wins ~ standings2$points_for + standings2$points_against)
summary(fit)
## 
## Call:
## lm(formula = standings2$wins ~ standings2$points_for + standings2$points_against)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3445 -0.8470 -0.0334  0.8322  3.9053 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                8.7362443  0.4184560   20.88   <2e-16 ***
## standings2$points_for      0.0270262  0.0006989   38.67   <2e-16 ***
## standings2$points_against -0.0291728  0.0008380  -34.81   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.238 on 635 degrees of freedom
## Multiple R-squared:  0.8395, Adjusted R-squared:  0.839 
## F-statistic:  1660 on 2 and 635 DF,  p-value: < 2.2e-16

The equation that will calculate number of Wins based on Points For and Points Against is as follows:

\[Wins = 8.736 + 0.027(Points For) - 0.029(Points Against)\]

One of the biggest takeaways from this regression is that Points Against has a stronger impact on Wins than Points For, meaning that a team hurts more from points scored against them, than they are helped by points scored for. From this standpoint, teams need to put emphasize on a stronger defense that gives up a lower number of points per game and season.

Summary

Problem Statement

Again, the purpose of this report is to understand exactly how NFL teams can reach the playoffs and the Super Bowl. There are many underlying factors besides Tom Brady that lead to the playoff and we attempted to pinpoint exactly what those are.

Methodology

In order to breakdown what it takes to reach the playoffs, I used data from Pro Football Reference and can be accessed HERE. This data is titled “2019 NFL Attendance Data” and includes 3 datasets: Attendance, Standings, and Games. The Standings dataset was the only one used in this report. From the data, I was able to create a correlation matrix which showed that there were many factors that led to making the playoffs, however, those variables were correlated higher with wins, which is highly correlated with playoffs. After breaking down what consists of a win, a multiple linear regression was run to establish the direct relationship.

Insights and Results

  • Playoffs are decided by Wins

    • With an exception of 1 of 145 teams in the last 20 years, every team that won at least 11 games made the playoffs. Teams that finished with 10 wins had an 87.3% chance of making the playoffs. This was obviously the biggest predictor of whether a team will make the playoffs or not. This then draws the question of how can a team win 11 games and secure a playoff bid.
  • What it takes to achieve a Win in an NFL game

    • After looking at all the variables, Points For and Points Against were the most telling variables, as expected. The more Points For and the fewer Points Against leads to more wins. On average, a playoff team had 402 Points For and 312 Points Against in a season (25 and 20 points per game respectively), whereas teams that missed the playoffs averaged 319 Points For and 375 Points Against in a season (20 and 23 points per game respectively). As these are just averages, many teams had many more or less points for and points against but this is just a reference point.

      • If a team finished with at least 450 Points for, they have an 86% chance of making the playoffs.
      • If a team gives up 260 or less Points against, they have a 97.6% chance of making the playoffs.
      • If a team finishes with a Point Differential of 100 or higher, they have a 95.9% chance of making the playoffs.
    • These are three metrics teams should strive for. On a per game level - this means averaging at least 28 point scored for, 16 points against or less, and at least a 6 point differential.

    • Lastly, after fitting a multiple linear regression, the true recipe to calculating wins is as follows: \[Wins = 8.736 + 0.027(Points For) - 0.029(Points Against)\]

      • This equation can pretty accurately predict the number of wins a team will have based on their points for and points against. To give an example, I will use the numbers I gave based on probabilities:

        \[Wins = 8.736 + 0.027(450) - 0.029(260) ==> 13 Wins\]

      • If a team scores 450 points and allow 260 points, they are projected to win 13 games, which has a 100% chance of making the playoffs based on the past 20 years.

      • An important take away from this regression, is that the weight of every point against is heavier than points for, meaning that a more points against will negatively impact a teams number of wins at a higher rate than every additional point for. This demonstrates that a team with stronger defense will in turn end with more wins, assuming their offense can still score more points for than the number of points a defense lets up.

Limitations to this Report

This report is at a very broad “Birds-eye-View” look at how teams can make the playoffs. John Madden, NFL Legend, put it pretty simple: “Usually the team that scores the most points wins the game.” This is essentially what this report is saying in general terms. With more time and effort put in to further analyze other correlated variables, or go more in depth into the Games dataset where metrics from every NFL game in the last 20 years has been recorded, I believe we can uncover more key factors that can determine the fate of an NFL team’s season. However, I do believe that this report is a good starting point to recognizing that there is a place in sports for analytics and data. From what I uncovered in this report, there are a few ideologies that coaches and ownership can act on, specifically the emphasis that should be placed on the defensive side of the ball. Secondly, I began this analysis with the idea of incorporating the attendance data but ended up not using it. I believe that it would be more useful to analyze those numbers at a game level from the Game dataset, however, that is an entirely different problem statement and analysis.

Lastly,

Thank You For Reading!

- Seth Draper