Problem Statement:
What does it take for an NFL team to win the Super Bowl? More importantly, how can they make the playoffs? These are questions that used to be answered with “win the most games in the division” and “win every game in playoffs.” We can now quantify exactly what it takes to make it to the most elite level of football. The New England Patriots consistently ended up in the AFC Championship or the Super Bowl almost every year, to the point where they made it look easy. We will dive in-depth to quantify what it takes to be a playoff team.
Solution Overview:
To reach a conclusion, NFL data including three datasets from 2000 to 2019 - one of which will be used, will be cleaned, wrangled, and analyzed to find what variables lead to being a playoff team and Super Bowl victor. The majority of the analysis will draw comparisons between the New England Patriots, Super Bowl winners, playoff contenders, and teams who missed the playoffs.
Implementation:
After the data is cleaned, we will analyze what goes into a playoff team, including wins. Additionally we will dive deeper into what exactly determines a win. We will show what it will take for an NFL team to become successful, like the New England Patriots.
Direct Impact of Analysis:
This report is for NFL ownership and management to understand what is required to take their team to the next step of either making the playoffs or winning it all.
Packages Required:
To reproduce the code and results from this report the following packages will need to be installed.
library(tidyverse) ## Visualizing, transforming, tidying and joining data
library(dplyr) ## Manipulating data
library(ggplot2) ## Used for Visualizing data
library(ggcorrplot) ## Used to visualize Correlation matrix
library(knitr) ## Neccessary to show tables in RMarkdown
Data Import:
The data from this report was compiled from Pro Football Reference and can be accessed HERE. This data is titled “2019 NFL Attendance Data” and includes 3 datasets: Attendance, Standings, and Games. These datasets are comprised of data from 2000-2019 and include many variables. Although this data is titled and focused around attendance, a more interesting analysis came to mind regarding playoff teams and Super Bowl winners. All the data was mostly clean with no missing values (except ‘NA’ attendance on Bye Weeks).
The attendance dataset was comprised of 10,846 observations and 8 variables:
The standings dataset was comprised of 638 observations and 15 variables:
The games dataset was comprised of 5,324 observations and 19 variables:
The first step in this project was to download the datasets and store them in an R File on my desktop. From there, set the working directory and import all three datasets into R.
setwd("C:/Users/sethd/OneDrive/Desktop/R Files")
attendance <- read.csv("attendance.csv")
standings <- read.csv("standings.csv")
games <- read.csv("games.csv")
Attendance Dataset Cleaning:
Upon briefly exploring the datasets, I decided to not use the games data, but felt it was appropriate to keep it in R in the event that I want to continue the report at a single game level instead of the season level.
After importing the data, the first step was to clean the attendance dataset. As just stated, I am only concerned about the seasons, not game specific - this applies to attendance. The first step was that I united the team and team_name variables to condense and have one team name as opposed to two. Then I removed week and week_attendance columns, leaving many duplicate rows (17 per team per season) because all that was left was the total, home, and away attendances repeating. To eliminate redundancies, I called for only distinct observations to remain. The code below shows the previous description followed by the first 6 observations of the now clean attendance dataset:
#Combine Team Name
attendance <- attendance %>% unite(team, team, team_name, sep = " ")
#Transform into Team Total Attendance per year
attendance <- attendance[,-(6:7)]
attendance <- distinct(attendance)
| team | year | total | home | away |
|---|---|---|---|---|
| Arizona Cardinals | 2000 | 893926 | 387475 | 506451 |
| Atlanta Falcons | 2000 | 964579 | 422814 | 541765 |
| Baltimore Ravens | 2000 | 1062373 | 551695 | 510678 |
| Buffalo Bills | 2000 | 1098587 | 560695 | 537892 |
| Carolina Panthers | 2000 | 1095192 | 583489 | 511703 |
| Chicago Bears | 2000 | 1080684 | 535552 | 545132 |
Standings Dataset Cleaning:
The first step I took in cleaning the standings dataset was to combine the two team and team_name variables to condense and keep consistent with the attendance data. Following that I proceeded to do a multi-layered sort by year then wins in descending order. This was to check the first few observations and make sure the data made sense. Then when checking the head, I realized the variables playoffs and sb_winner were character values as they were reported as “No Playoffs”, “Playoffs”, “No Superbowl”, and “Superbowl” respectively. I changed the values to a binary 0 and 1 for No Playoffs/No Superbowl and Yes Playoffs/Yes Superbowl. The following codes shows these actions, along with an output of the first 6 values in the new clean standings dataset:
#Combining team and team_name
standings <- as_tibble(standings)
standings <- standings %>% unite(team, team, team_name, sep = " ")
#Sort by years then wins
standings <- arrange(standings, year, desc(wins))
#Changing playoffs and sb_winner from a character to Binary variable
standings$playoffs <- factor(standings$playoffs, levels=c("No Playoffs", "Playoffs"), labels=c(0, 1))
standings$sb_winner <- factor(standings$sb_winner, levels=c("No Superbowl", "Won Superbowl"), labels=c(0, 1))
| team | year | wins | loss | points_for | points_against | points_differential | margin_of_victory | strength_of_schedule | simple_rating | offensive_ranking | defensive_ranking | playoffs | sb_winner |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Tennessee Titans | 2000 | 13 | 3 | 346 | 191 | 155 | 9.7 | -1.3 | 8.3 | 1.5 | 6.8 | 1 | 0 |
| Baltimore Ravens | 2000 | 12 | 4 | 333 | 165 | 168 | 10.5 | -2.5 | 8.0 | 0.0 | 8.0 | 1 | 1 |
| Oakland Raiders | 2000 | 12 | 4 | 479 | 299 | 180 | 11.3 | -1.5 | 9.7 | 8.0 | 1.8 | 1 | 0 |
| New York Giants | 2000 | 12 | 4 | 328 | 246 | 82 | 5.1 | -2.7 | 2.4 | -1.3 | 3.8 | 1 | 0 |
| Miami Dolphins | 2000 | 11 | 5 | 323 | 226 | 97 | 6.1 | 1.0 | 7.1 | 0.0 | 7.1 | 1 | 0 |
| Denver Broncos | 2000 | 11 | 5 | 485 | 369 | 116 | 7.3 | -2.2 | 5.0 | 7.8 | -2.7 | 1 | 0 |
Combining & Cleaning Datasets:
To make things easier, I noticed that both datasets had the same number of observations and identical team and year keys - therefore I decided to combine the two datasets to work with them as one. After combining, I renamed the attendance variables from “total”,“home”,“away” to “total_attendance”, “home_attendance”, and “away_attendance” respectively to differentiate. I then looked at the structures of all the variables and noticed that many were in integer form and the two binary variables were still variables. I then converted them to numeric variables to make calculations easy. This new merged dataset is named standings2. Below is the code and the first 6 observations
#Merge two data frames by team and year
standings2 <- merge(standings,attendance,by=c("team","year"))
#Rename variables
standings2 <- standings2 %>%
rename(
total_attendance = total,
home_attendance = home,
away_attendance = away
)
#Changing variable types
standings2[3:7] <- lapply(standings2[3:7], as.numeric)
standings2$playoffs <- as.numeric(levels(standings2$playoffs))[standings2$playoffs]
standings2$sb_winner <- as.numeric(levels(standings2$sb_winner))[standings2$sb_winner]
standings2[15:17] <- lapply(standings2[15:17], as.numeric)
| team | year | wins | loss | points_for | points_against | points_differential | margin_of_victory | strength_of_schedule | simple_rating | offensive_ranking | defensive_ranking | playoffs | sb_winner | total_attendance | home_attendance | away_attendance |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Arizona Cardinals | 2000 | 3 | 13 | 210 | 443 | -233 | -14.6 | -0.7 | -15.2 | -7.2 | -8.1 | 0 | 0 | 893926 | 387475 | 506451 |
| Arizona Cardinals | 2001 | 7 | 9 | 295 | 343 | -48 | -3.0 | -1.2 | -4.2 | -1.5 | -2.6 | 0 | 0 | 811391 | 307315 | 504076 |
| Arizona Cardinals | 2002 | 5 | 11 | 262 | 417 | -155 | -9.7 | -0.2 | -9.9 | -5.4 | -4.5 | 0 | 0 | 898877 | 327272 | 571605 |
| Arizona Cardinals | 2003 | 4 | 12 | 225 | 452 | -227 | -14.2 | 1.6 | -12.6 | -6.3 | -6.2 | 0 | 0 | 804401 | 288499 | 515902 |
| Arizona Cardinals | 2004 | 6 | 10 | 284 | 322 | -38 | -2.4 | -2.5 | -4.9 | -5.1 | 0.2 | 0 | 0 | 838557 | 300267 | 538290 |
| Arizona Cardinals | 2005 | 5 | 11 | 311 | 387 | -76 | -4.8 | -0.2 | -5.0 | -2.0 | -3.0 | 0 | 0 | 920848 | 401035 | 519813 |
Subsetting into 4 Datasets:
The last step of the cleaning process is where it all starts coming together. From the standings2 dataset, I created 4 subsets: * Super Bowl Winners (sb_champs) * Teams Who Made the Playoffs (made_playoff) * Teams Who Missed the Playoffs (missed_playoff) * The New England Patriots (pats) I removed the Super Bowl Winners from the made_playoff to make sure they were not counted twice. I then sorted missed_playoff and made_playoff by year then descending years to get a good look at the data. Following is the code and the heads of each:
#Subsetting (filtering) SB winners, playoffs, missed playoffs, and NE Patriots
sb_champs <- filter(standings2, sb_winner == 1)
sb_champs <- sb_champs[,-(13:14)]
sb_champs <- arrange(sb_champs, year)
kable(head(sb_champs), format = "markdown")
| team | year | wins | loss | points_for | points_against | points_differential | margin_of_victory | strength_of_schedule | simple_rating | offensive_ranking | defensive_ranking | total_attendance | home_attendance | away_attendance |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Baltimore Ravens | 2000 | 12 | 4 | 333 | 165 | 168 | 10.5 | -2.5 | 8.0 | 0.0 | 8.0 | 1062373 | 551695 | 510678 |
| New England Patriots | 2001 | 11 | 5 | 371 | 272 | 99 | 6.2 | -1.9 | 4.3 | 1.2 | 3.1 | 977717 | 482336 | 495381 |
| Tampa Bay Buccaneers | 2002 | 12 | 4 | 346 | 196 | 150 | 9.4 | -0.6 | 8.8 | -1.0 | 9.8 | 1044920 | 525031 | 519889 |
| New England Patriots | 2003 | 14 | 2 | 348 | 238 | 110 | 6.9 | 0.1 | 6.9 | 2.1 | 4.9 | 1127515 | 547488 | 580027 |
| New England Patriots | 2004 | 14 | 2 | 437 | 260 | 177 | 11.1 | 1.8 | 12.8 | 6.4 | 6.5 | 1108210 | 550048 | 558162 |
| Pittsburgh Steelers | 2005 | 11 | 5 | 389 | 258 | 131 | 8.2 | -0.4 | 7.8 | 3.8 | 4.0 | 1048739 | 507434 | 541305 |
made_playoff <- filter(standings2, playoffs == 1)
made_playoff <- subset(made_playoff, made_playoff$sb_winner==0 )
made_playoff <- made_playoff[,-(13:14)]
made_playoff <- arrange(made_playoff, year, desc(wins))
kable(head(made_playoff), format = "markdown")
| team | year | wins | loss | points_for | points_against | points_differential | margin_of_victory | strength_of_schedule | simple_rating | offensive_ranking | defensive_ranking | total_attendance | home_attendance | away_attendance |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Tennessee Titans | 2000 | 13 | 3 | 346 | 191 | 155 | 9.7 | -1.3 | 8.3 | 1.5 | 6.8 | 1091274 | 547524 | 543750 |
| New York Giants | 2000 | 12 | 4 | 328 | 246 | 82 | 5.1 | -2.7 | 2.4 | -1.3 | 3.8 | 1135455 | 624085 | 511370 |
| Oakland Raiders | 2000 | 12 | 4 | 479 | 299 | 180 | 11.3 | -1.5 | 9.7 | 8.0 | 1.8 | 998655 | 462515 | 536140 |
| Denver Broncos | 2000 | 11 | 5 | 485 | 369 | 116 | 7.3 | -2.2 | 5.0 | 7.8 | -2.7 | 1140030 | 604042 | 535988 |
| Miami Dolphins | 2000 | 11 | 5 | 323 | 226 | 97 | 6.1 | 1.0 | 7.1 | 0.0 | 7.1 | 1118883 | 589909 | 528974 |
| Minnesota Vikings | 2000 | 11 | 5 | 397 | 371 | 26 | 1.6 | 0.3 | 1.9 | 4.3 | -2.3 | 1029262 | 513322 | 515940 |
missed_playoff <- filter(standings2, playoffs == 0)
missed_playoff <- missed_playoff[,-(13:14)]
missed_playoff <- arrange(missed_playoff, year, desc(wins))
kable(head(missed_playoff), format = "markdown")
| team | year | wins | loss | points_for | points_against | points_differential | margin_of_victory | strength_of_schedule | simple_rating | offensive_ranking | defensive_ranking | total_attendance | home_attendance | away_attendance |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Detroit Lions | 2000 | 9 | 7 | 307 | 307 | 0 | 0.0 | 1.4 | 1.4 | -0.1 | 1.5 | 1140926 | 607076 | 533850 |
| Green Bay Packers | 2000 | 9 | 7 | 353 | 323 | 30 | 1.9 | 0.6 | 2.5 | 1.8 | 0.7 | 1049602 | 478747 | 570855 |
| New York Jets | 2000 | 9 | 7 | 321 | 321 | 0 | 0.0 | 3.5 | 3.5 | 1.4 | 2.2 | 1145146 | 623711 | 521435 |
| Pittsburgh Steelers | 2000 | 9 | 7 | 321 | 255 | 66 | 4.1 | -0.2 | 3.9 | 0.6 | 3.3 | 987037 | 440426 | 546611 |
| Buffalo Bills | 2000 | 8 | 8 | 315 | 350 | -35 | -2.2 | 2.2 | 0.0 | 0.5 | -0.5 | 1098587 | 560695 | 537892 |
| Washington Redskins | 2000 | 8 | 8 | 281 | 269 | 12 | 0.8 | 0.2 | 1.0 | -2.9 | 3.8 | 1174332 | 647424 | 526908 |
pats <- filter(standings2, team == "New England Patriots")
kable(head(pats), format = "markdown")
| team | year | wins | loss | points_for | points_against | points_differential | margin_of_victory | strength_of_schedule | simple_rating | offensive_ranking | defensive_ranking | playoffs | sb_winner | total_attendance | home_attendance | away_attendance |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| New England Patriots | 2000 | 5 | 11 | 276 | 338 | -62 | -3.9 | 1.4 | -2.5 | -2.7 | 0.2 | 0 | 0 | 1030594 | 482336 | 548258 |
| New England Patriots | 2001 | 11 | 5 | 371 | 272 | 99 | 6.2 | -1.9 | 4.3 | 1.2 | 3.1 | 1 | 1 | 977717 | 482336 | 495381 |
| New England Patriots | 2002 | 9 | 7 | 381 | 346 | 35 | 2.2 | 1.8 | 4.0 | 2.1 | 1.9 | 0 | 0 | 1096069 | 547488 | 548581 |
| New England Patriots | 2003 | 14 | 2 | 348 | 238 | 110 | 6.9 | 0.1 | 6.9 | 2.1 | 4.9 | 1 | 1 | 1127515 | 547488 | 580027 |
| New England Patriots | 2004 | 14 | 2 | 437 | 260 | 177 | 11.1 | 1.8 | 12.8 | 6.4 | 6.5 | 1 | 1 | 1108210 | 550048 | 558162 |
| New England Patriots | 2005 | 10 | 6 | 379 | 338 | 41 | 2.6 | 0.6 | 3.1 | 3.7 | -0.5 | 1 | 0 | 1136903 | 550048 | 586855 |
Correlation Matrix
The first step of analysis was to create a correlation matrix. From this my hope was to establish variables that are correlated to the playoff variable. When analyzing correlations against the playoffs variable, it appeared that there are many highly correlated variables with playoffs. However, all of these variables are correlated significantly higher with wins, which leads to the idea that these variables lead to wins and wins lead to playoffs. Below is a list of the correlated variables with the R^2-value to playoffs and wins in that order:
#Correlation Matrix of Standings2
corr <- round(cor(standings2[3:17]), 1)
p.mat <- cor_pmat(standings2[3:17])
correlation <-ggcorrplot(corr, hc.order = TRUE, type = "lower",
lab = TRUE)
correlation
Wins Per Season Analysis
After realizing that wins are the direct influencer of making the playoffs, I wanted to compare the distributions of the four subsets. It is evident that teams will have no chance to make the playoffs without at least 7 wins. It is also worth noting that the Super Bowl winner’s median is 1 game better than the teams that made the playoffs. Lastly, a note on the Patriot Dynasty, beside 2000 (pre-Tom Brady) when the Patriots only won 5 games, they won 11 or more games every season for the next 19 years.
boxplot(pats$wins, sb_champs$wins, made_playoff$wins, missed_playoff$wins,
main = "Wins Per Season", ylab = "Wins", col = c("royalblue", "gold", "green", "red"),
names = c("Patriots", "SB Champs", "Playoff Made", "Playoff Missed"))
Playoff Probability
Based on the given data, I calculated out probabilities based on wins:
What Leads to a Win?
There are essentially 2 things that go into winning or losing a game, and they are pretty obvious:
Although there are more specific variables that can influence the result of a football game, like weather, home field advantage, health of players, and so on, at a game’s core, these variables can likely predict whether a team will reach the playoffs or be sitting on their couches come January. Following is a visual display to demonstrate each of these variables’ effects on wins.
Analyzing Points For
As seen in the graph below, there is a strong positive correlation between wins and points for. Points For has a direct impact on number of wins in a season.
highlight_df <- standings2 %>%
filter(standings2$team == "New England Patriots")
ggplot(standings2, aes(x=wins, y=points_for)) +
geom_point(aes(col=playoffs), size = 2) +
scale_colour_gradientn(colours= c("red","green")) +
geom_point(data = highlight_df, aes(x=wins, y=points_for), colour = "navyblue", size = 3 ) +
geom_smooth(method="lm", size=1.5, colour = "black") +
labs(title="Wins vs. Points For", subtitle = "Compared with The NE Patriots (Blue)", y="Total Points For", x="Wins")
The boxplot following compares points for between the four subsets. You can see that there is a drastic increase between teams who did not make the playoffs as opposed to those who did. One thing to note here is that the middle 50% of ‘points for’ of teams who made the playoffs and Super Bowl Champions is almost identical, meaning ‘points for’ is not the driving factor for a team who made the playoffs to lead to a Super Bowl victory.
boxplot(pats$points_for, sb_champs$points_for, made_playoff$points_for, missed_playoff$points_for,
main = "Points For Comparison", ylab = "Points For", col = c("royalblue", "gold", "green", "red"),
names = c("Patriots", "SB Champs", "Playoff Made", "Playoff Missed"))
This plot shows just how dominate the Patriots were. Their lowest scoring year was still in the middle 50% of made playoffs, which led to the 17 titles in 19 years. They consistently scored in the highest tier among all other NFL teams. Note - The "dynasty’ dataset includes patriots standings data, excluding 2000 since the dynasty officially started in 2001.
dynasty <- pats[-1,]
ggplot(dynasty, aes(wins, points_for)) +
geom_point(size = 3, colour = "navyblue") +
geom_point(aes(col=sb_winner)) +
geom_smooth(method="lm", size=1.5, colour = "navyblue") +
labs(title="Patriots Wins vs. Points For", y="Total Points For", x="Wins")
Points For Probabilities
Based on the given data, I calculated out probabilities based on points for:
Analyzing Points Against
As seen in the graph below, there is a strong negative correlation between wins and points points against. Points Against has a direct impact on number of wins in a season.
highlight_df <- standings2 %>%
filter(standings2$team == "New England Patriots")
ggplot(standings2, aes(x=wins, y=points_against)) +
geom_point(aes(col=playoffs), size = 2) +
scale_colour_gradientn(colours= c("red","green")) +
geom_point(data = highlight_df, aes(x=wins, y=points_against), colour = "navyblue", size = 3 ) +
geom_smooth(method="lm", size=1.5, colour = "black") +
labs(title="Wins vs. Points Against", subtitle = "Compared with The NE Patriots (Blue)", y="Total Points Against", x="Wins")
The boxplot following compares points for between the four subsets. You can see that there is a drastic increase between teams who did not make the playoffs as opposed to those who did. Unlike points for, this plot shows that there is a difference between points scored against by playoff teams and Super Bowl Champs. 50% of Super Bowl winners gave up less points than 75% of playoff teams. This is where the difference is between playoff teams and Super Bowl winners.
boxplot(pats$points_against, sb_champs$points_against, made_playoff$points_against, missed_playoff$points_against,
main = "Points Against", ylab = "Points", col = c("royalblue", "gold", "green", "red"),
names = c("Patriots", "Superbowl Champs", "Playoff Made", "Playoff Missed"))
This plot yet again shows just how dominate the Patriots were. The same negative correlation applies with Patriot wins. The Patriots gave up less than 300 points in 45% of their seasons and never exceed 350.
ggplot(dynasty, aes(wins, points_against)) +
geom_point(size = 3, colour = "navyblue") +
geom_point(aes(col=sb_winner)) +
geom_smooth(method="lm", size=1.5, colour = "navyblue") +
labs(title="Patriots Wins vs. Points Against", y="Total Points Against", x="Wins")
Points For Probabilities
Based on the given data, I calculated out probabilities based on points for:
A lot of the analysis is complete, however, I plan to go further in depth. I was going off of the final project rubric so I went ahead and did a lot of the analysis as opposed to formulating what I want to do. Moving forward for the Final project turn in, I will perform more analysis and right up a full summary. Thank you.