Data Wrangling Project

Seth Draper

The Dynasty of the New England Patriots

From 2001 to 2018, the New England Patriots dominated the National Football League. Over the course of 17 years, the Patriots won 15 division titles, 12 AFC Championship appearances, 7 Super Bowl appearances, and 5 Super Bowl victories. Let’s find out how they got there…

Synopsis

Problem Statement:

What does it take for an NFL team to win the Super Bowl? More importantly, how can they make the playoffs? These are questions that used to be answered with “win the most games in the division” and “win every game in playoffs.” We can now quantify exactly what it takes to make it to the most elite level of football. The New England Patriots consistently ended up in the AFC Championship or the Super Bowl almost every year, to the point where they made it look easy. We will dive in-depth to quantify what it takes to be a playoff team.

Solution Overview:

To reach a conclusion, NFL data including three datasets from 2000 to 2019 - one of which will be used, will be cleaned, wrangled, and analyzed to find what variables lead to being a playoff team and Super Bowl victor. The majority of the analysis will draw comparisons between the New England Patriots, Super Bowl winners, playoff contenders, and teams who missed the playoffs.

Implementation:

After the data is cleaned, we will analyze what goes into a playoff team, including wins. Additionally we will dive deeper into what exactly determines a win. We will show what it will take for an NFL team to become successful, like the New England Patriots.

Direct Impact of Analysis:

This report is for NFL ownership and management to understand what is required to take their team to the next step of either making the playoffs or winning it all.

Packages Required

Packages Required:

To reproduce the code and results from this report the following packages will need to be installed.

library(tidyverse)  ## Visualizing, transforming, tidying and joining data

library(dplyr)      ## Manipulating data

library(ggplot2)    ## Used for Visualizing data

library(ggcorrplot) ## Used to visualize Correlation matrix

library(knitr)      ## Neccessary to show tables in RMarkdown

Data Preparation

A. Data Import

Data Import:

The data from this report was compiled from Pro Football Reference and can be accessed HERE. This data is titled “2019 NFL Attendance Data” and includes 3 datasets: Attendance, Standings, and Games. These datasets are comprised of data from 2000-2019 and include many variables. Although this data is titled and focused around attendance, a more interesting analysis came to mind regarding playoff teams and Super Bowl winners. All the data was mostly clean with no missing values (except ‘NA’ attendance on Bye Weeks).

The attendance dataset was comprised of 10,846 observations and 8 variables:

  • Team
  • Team Name
  • Year
  • Total
  • Home
  • Away
  • Week
  • Weekly Attendance

The standings dataset was comprised of 638 observations and 15 variables:

  • Team
  • Team Name
  • Year
  • Wins
  • Loss
  • Points For
  • Points Against
  • Point Differential
  • Margin of Victory
  • Strength of Schedule

The games dataset was comprised of 5,324 observations and 19 variables:

  • Year
  • Week
  • Home Team
  • Away Team
  • Winner
  • Tie
  • Day
  • Time
  • Points Win
  • Points Loss
  • Yards Win
  • Turnovers Win
  • Yards Loss
  • Turnovers Loss
  • Home Team Name
  • Home Team City
  • Away Team Name
  • Away Team City

The first step in this project was to download the datasets and store them in an R File on my desktop. From there, set the working directory and import all three datasets into R.

setwd("C:/Users/sethd/OneDrive/Desktop/R Files")
attendance <- read.csv("attendance.csv")
standings <- read.csv("standings.csv")
games <- read.csv("games.csv")

B. Attendance Dataset Cleaning

Attendance Dataset Cleaning:

Upon briefly exploring the datasets, I decided to not use the games data, but felt it was appropriate to keep it in R in the event that I want to continue the report at a single game level instead of the season level.

After importing the data, the first step was to clean the attendance dataset. As just stated, I am only concerned about the seasons, not game specific - this applies to attendance. The first step was that I united the team and team_name variables to condense and have one team name as opposed to two. Then I removed week and week_attendance columns, leaving many duplicate rows (17 per team per season) because all that was left was the total, home, and away attendances repeating. To eliminate redundancies, I called for only distinct observations to remain. The code below shows the previous description followed by the first 6 observations of the now clean attendance dataset:

#Combine Team Name
attendance <- attendance %>% unite(team, team, team_name, sep = " ")

#Transform into Team Total Attendance per year
attendance <- attendance[,-(6:7)]
attendance <- distinct(attendance)
team year total home away
Arizona Cardinals 2000 893926 387475 506451
Atlanta Falcons 2000 964579 422814 541765
Baltimore Ravens 2000 1062373 551695 510678
Buffalo Bills 2000 1098587 560695 537892
Carolina Panthers 2000 1095192 583489 511703
Chicago Bears 2000 1080684 535552 545132

C. Standings Dataset Cleaning

Standings Dataset Cleaning:

The first step I took in cleaning the standings dataset was to combine the two team and team_name variables to condense and keep consistent with the attendance data. Following that I proceeded to do a multi-layered sort by year then wins in descending order. This was to check the first few observations and make sure the data made sense. Then when checking the head, I realized the variables playoffs and sb_winner were character values as they were reported as “No Playoffs”, “Playoffs”, “No Superbowl”, and “Superbowl” respectively. I changed the values to a binary 0 and 1 for No Playoffs/No Superbowl and Yes Playoffs/Yes Superbowl. The following codes shows these actions, along with an output of the first 6 values in the new clean standings dataset:

#Combining team and team_name
standings <- as_tibble(standings)
standings <- standings %>% unite(team, team, team_name, sep = " ")

#Sort by years then wins
standings <- arrange(standings, year, desc(wins))

#Changing playoffs and sb_winner from a character to Binary variable
standings$playoffs <- factor(standings$playoffs, levels=c("No Playoffs", "Playoffs"), labels=c(0, 1))
standings$sb_winner <- factor(standings$sb_winner, levels=c("No Superbowl", "Won Superbowl"), labels=c(0, 1))
team year wins loss points_for points_against points_differential margin_of_victory strength_of_schedule simple_rating offensive_ranking defensive_ranking playoffs sb_winner
Tennessee Titans 2000 13 3 346 191 155 9.7 -1.3 8.3 1.5 6.8 1 0
Baltimore Ravens 2000 12 4 333 165 168 10.5 -2.5 8.0 0.0 8.0 1 1
Oakland Raiders 2000 12 4 479 299 180 11.3 -1.5 9.7 8.0 1.8 1 0
New York Giants 2000 12 4 328 246 82 5.1 -2.7 2.4 -1.3 3.8 1 0
Miami Dolphins 2000 11 5 323 226 97 6.1 1.0 7.1 0.0 7.1 1 0
Denver Broncos 2000 11 5 485 369 116 7.3 -2.2 5.0 7.8 -2.7 1 0

D. Combining & Cleaning Datasets

Combining & Cleaning Datasets:

To make things easier, I noticed that both datasets had the same number of observations and identical team and year keys - therefore I decided to combine the two datasets to work with them as one. After combining, I renamed the attendance variables from “total”,“home”,“away” to “total_attendance”, “home_attendance”, and “away_attendance” respectively to differentiate. I then looked at the structures of all the variables and noticed that many were in integer form and the two binary variables were still variables. I then converted them to numeric variables to make calculations easy. This new merged dataset is named standings2. Below is the code and the first 6 observations

#Merge two data frames by team and year
standings2 <- merge(standings,attendance,by=c("team","year"))

#Rename variables
standings2 <- standings2 %>% 
  rename(
    total_attendance = total,
    home_attendance = home,
    away_attendance = away
    )

#Changing variable types
standings2[3:7] <- lapply(standings2[3:7], as.numeric)
standings2$playoffs <- as.numeric(levels(standings2$playoffs))[standings2$playoffs]
standings2$sb_winner <- as.numeric(levels(standings2$sb_winner))[standings2$sb_winner]
standings2[15:17] <- lapply(standings2[15:17], as.numeric)
team year wins loss points_for points_against points_differential margin_of_victory strength_of_schedule simple_rating offensive_ranking defensive_ranking playoffs sb_winner total_attendance home_attendance away_attendance
Arizona Cardinals 2000 3 13 210 443 -233 -14.6 -0.7 -15.2 -7.2 -8.1 0 0 893926 387475 506451
Arizona Cardinals 2001 7 9 295 343 -48 -3.0 -1.2 -4.2 -1.5 -2.6 0 0 811391 307315 504076
Arizona Cardinals 2002 5 11 262 417 -155 -9.7 -0.2 -9.9 -5.4 -4.5 0 0 898877 327272 571605
Arizona Cardinals 2003 4 12 225 452 -227 -14.2 1.6 -12.6 -6.3 -6.2 0 0 804401 288499 515902
Arizona Cardinals 2004 6 10 284 322 -38 -2.4 -2.5 -4.9 -5.1 0.2 0 0 838557 300267 538290
Arizona Cardinals 2005 5 11 311 387 -76 -4.8 -0.2 -5.0 -2.0 -3.0 0 0 920848 401035 519813

E. Subsetting into 4 Datasets

Subsetting into 4 Datasets:

The last step of the cleaning process is where it all starts coming together. From the standings2 dataset, I created 4 subsets: * Super Bowl Winners (sb_champs) * Teams Who Made the Playoffs (made_playoff) * Teams Who Missed the Playoffs (missed_playoff) * The New England Patriots (pats) I removed the Super Bowl Winners from the made_playoff to make sure they were not counted twice. I then sorted missed_playoff and made_playoff by year then descending years to get a good look at the data. Following is the code and the heads of each:

#Subsetting (filtering) SB winners, playoffs, missed playoffs, and NE Patriots
sb_champs <- filter(standings2, sb_winner == 1) 
sb_champs <- sb_champs[,-(13:14)]
sb_champs <- arrange(sb_champs, year)
kable(head(sb_champs), format = "markdown")
team year wins loss points_for points_against points_differential margin_of_victory strength_of_schedule simple_rating offensive_ranking defensive_ranking total_attendance home_attendance away_attendance
Baltimore Ravens 2000 12 4 333 165 168 10.5 -2.5 8.0 0.0 8.0 1062373 551695 510678
New England Patriots 2001 11 5 371 272 99 6.2 -1.9 4.3 1.2 3.1 977717 482336 495381
Tampa Bay Buccaneers 2002 12 4 346 196 150 9.4 -0.6 8.8 -1.0 9.8 1044920 525031 519889
New England Patriots 2003 14 2 348 238 110 6.9 0.1 6.9 2.1 4.9 1127515 547488 580027
New England Patriots 2004 14 2 437 260 177 11.1 1.8 12.8 6.4 6.5 1108210 550048 558162
Pittsburgh Steelers 2005 11 5 389 258 131 8.2 -0.4 7.8 3.8 4.0 1048739 507434 541305
made_playoff <- filter(standings2, playoffs == 1)   
made_playoff <- subset(made_playoff, made_playoff$sb_winner==0 )
made_playoff <- made_playoff[,-(13:14)]
made_playoff <- arrange(made_playoff, year, desc(wins))
kable(head(made_playoff), format = "markdown")
team year wins loss points_for points_against points_differential margin_of_victory strength_of_schedule simple_rating offensive_ranking defensive_ranking total_attendance home_attendance away_attendance
Tennessee Titans 2000 13 3 346 191 155 9.7 -1.3 8.3 1.5 6.8 1091274 547524 543750
New York Giants 2000 12 4 328 246 82 5.1 -2.7 2.4 -1.3 3.8 1135455 624085 511370
Oakland Raiders 2000 12 4 479 299 180 11.3 -1.5 9.7 8.0 1.8 998655 462515 536140
Denver Broncos 2000 11 5 485 369 116 7.3 -2.2 5.0 7.8 -2.7 1140030 604042 535988
Miami Dolphins 2000 11 5 323 226 97 6.1 1.0 7.1 0.0 7.1 1118883 589909 528974
Minnesota Vikings 2000 11 5 397 371 26 1.6 0.3 1.9 4.3 -2.3 1029262 513322 515940
missed_playoff <- filter(standings2, playoffs == 0) 
missed_playoff <- missed_playoff[,-(13:14)]
missed_playoff <- arrange(missed_playoff, year, desc(wins))
kable(head(missed_playoff), format = "markdown")
team year wins loss points_for points_against points_differential margin_of_victory strength_of_schedule simple_rating offensive_ranking defensive_ranking total_attendance home_attendance away_attendance
Detroit Lions 2000 9 7 307 307 0 0.0 1.4 1.4 -0.1 1.5 1140926 607076 533850
Green Bay Packers 2000 9 7 353 323 30 1.9 0.6 2.5 1.8 0.7 1049602 478747 570855
New York Jets 2000 9 7 321 321 0 0.0 3.5 3.5 1.4 2.2 1145146 623711 521435
Pittsburgh Steelers 2000 9 7 321 255 66 4.1 -0.2 3.9 0.6 3.3 987037 440426 546611
Buffalo Bills 2000 8 8 315 350 -35 -2.2 2.2 0.0 0.5 -0.5 1098587 560695 537892
Washington Redskins 2000 8 8 281 269 12 0.8 0.2 1.0 -2.9 3.8 1174332 647424 526908
pats <- filter(standings2, team == "New England Patriots")
kable(head(pats), format = "markdown")
team year wins loss points_for points_against points_differential margin_of_victory strength_of_schedule simple_rating offensive_ranking defensive_ranking playoffs sb_winner total_attendance home_attendance away_attendance
New England Patriots 2000 5 11 276 338 -62 -3.9 1.4 -2.5 -2.7 0.2 0 0 1030594 482336 548258
New England Patriots 2001 11 5 371 272 99 6.2 -1.9 4.3 1.2 3.1 1 1 977717 482336 495381
New England Patriots 2002 9 7 381 346 35 2.2 1.8 4.0 2.1 1.9 0 0 1096069 547488 548581
New England Patriots 2003 14 2 348 238 110 6.9 0.1 6.9 2.1 4.9 1 1 1127515 547488 580027
New England Patriots 2004 14 2 437 260 177 11.1 1.8 12.8 6.4 6.5 1 1 1108210 550048 558162
New England Patriots 2005 10 6 379 338 41 2.6 0.6 3.1 3.7 -0.5 1 0 1136903 550048 586855

Exploratory Data Analysis

F. Correlation Matrix

Correlation Matrix

The first step of analysis was to create a correlation matrix. From this my hope was to establish variables that are correlated to the playoff variable. When analyzing correlations against the playoffs variable, it appeared that there are many highly correlated variables with playoffs. However, all of these variables are correlated significantly higher with wins, which leads to the idea that these variables lead to wins and wins lead to playoffs. Below is a list of the correlated variables with the R^2-value to playoffs and wins in that order:

  • simple_rating (0.67)(0.88)
  • points_differential (0.71)(0.92)
  • margin_of_victory (0.71)(0.92)
  • points_for (0.56)(0.73)
  • offensive_ranking (0.55)(0.73)
  • wins (0.78)(NA)
  • points_against (-0.53)(-0.68)
  • defensive_ranking (0.49)(0.64)
#Correlation Matrix of Standings2
corr <- round(cor(standings2[3:17]), 1)

p.mat <- cor_pmat(standings2[3:17])

correlation <-ggcorrplot(corr, hc.order = TRUE, type = "lower",
           lab = TRUE)
correlation

G. Making the Playoffs

Wins Per Season Analysis

After realizing that wins are the direct influencer of making the playoffs, I wanted to compare the distributions of the four subsets. It is evident that teams will have no chance to make the playoffs without at least 7 wins. It is also worth noting that the Super Bowl winner’s median is 1 game better than the teams that made the playoffs. Lastly, a note on the Patriot Dynasty, beside 2000 (pre-Tom Brady) when the Patriots only won 5 games, they won 11 or more games every season for the next 19 years.

boxplot(pats$wins, sb_champs$wins, made_playoff$wins, missed_playoff$wins, 
        main = "Wins Per Season", ylab = "Wins", col = c("royalblue", "gold", "green", "red"),
        names = c("Patriots", "SB Champs", "Playoff Made", "Playoff Missed"))

Playoff Probability

Based on the given data, I calculated out probabilities based on wins:

  • If a team finishes with at least 11 wins, they have a 99.3% chance of making the playoffs
  • If a team finishes with 10 wins, they have an 87.3% chance of making the playoffs
  • If a team finishes with 9 wins, they have an 37.7% chance of making the playoffs
  • If a team finishes with 8 wins, they have an 8.7% chance of making the playoffs

What Leads to a Win?

There are essentially 2 things that go into winning or losing a game, and they are pretty obvious:

  • Points Scored For
  • Points Scored Against

Although there are more specific variables that can influence the result of a football game, like weather, home field advantage, health of players, and so on, at a game’s core, these variables can likely predict whether a team will reach the playoffs or be sitting on their couches come January. Following is a visual display to demonstrate each of these variables’ effects on wins.

Analyzing Points For

As seen in the graph below, there is a strong positive correlation between wins and points for. Points For has a direct impact on number of wins in a season.

highlight_df <- standings2 %>% 
  filter(standings2$team == "New England Patriots")

ggplot(standings2, aes(x=wins, y=points_for)) + 
  geom_point(aes(col=playoffs), size = 2) + 
  scale_colour_gradientn(colours= c("red","green")) +
  geom_point(data = highlight_df, aes(x=wins, y=points_for), colour = "navyblue", size = 3 ) +
  geom_smooth(method="lm", size=1.5, colour = "black")  +
  labs(title="Wins vs. Points For", subtitle = "Compared with The NE Patriots (Blue)", y="Total Points For", x="Wins")

The boxplot following compares points for between the four subsets. You can see that there is a drastic increase between teams who did not make the playoffs as opposed to those who did. One thing to note here is that the middle 50% of ‘points for’ of teams who made the playoffs and Super Bowl Champions is almost identical, meaning ‘points for’ is not the driving factor for a team who made the playoffs to lead to a Super Bowl victory.

boxplot(pats$points_for, sb_champs$points_for, made_playoff$points_for, missed_playoff$points_for, 
        main = "Points For Comparison", ylab = "Points For", col = c("royalblue", "gold", "green", "red"),
        names = c("Patriots", "SB Champs", "Playoff Made", "Playoff Missed"))

This plot shows just how dominate the Patriots were. Their lowest scoring year was still in the middle 50% of made playoffs, which led to the 17 titles in 19 years. They consistently scored in the highest tier among all other NFL teams. Note - The "dynasty’ dataset includes patriots standings data, excluding 2000 since the dynasty officially started in 2001.

dynasty <- pats[-1,]
ggplot(dynasty, aes(wins, points_for)) +
  geom_point(size = 3, colour = "navyblue") +
  geom_point(aes(col=sb_winner)) +
  geom_smooth(method="lm", size=1.5, colour = "navyblue") + 
labs(title="Patriots Wins vs. Points For", y="Total Points For", x="Wins")

Points For Probabilities

Based on the given data, I calculated out probabilities based on points for:

  • If a team finishes with at least 450 points for, they have a 86.0% chance of making the playoffs
  • If a team finishes with at least 400 points for, they have a 76.7% chance of making the playoffs
  • If a team finishes with at least 350 points for, they have a 60.6% chance of making the playoffs

Analyzing Points Against

As seen in the graph below, there is a strong negative correlation between wins and points points against. Points Against has a direct impact on number of wins in a season.

highlight_df <- standings2 %>% 
  filter(standings2$team == "New England Patriots")

ggplot(standings2, aes(x=wins, y=points_against)) + 
  geom_point(aes(col=playoffs), size = 2) + 
  scale_colour_gradientn(colours= c("red","green")) +
  geom_point(data = highlight_df, aes(x=wins, y=points_against), colour = "navyblue", size = 3 ) +
  geom_smooth(method="lm", size=1.5, colour = "black")  +
  labs(title="Wins vs. Points Against", subtitle = "Compared with The NE Patriots (Blue)", y="Total Points Against", x="Wins")

The boxplot following compares points for between the four subsets. You can see that there is a drastic increase between teams who did not make the playoffs as opposed to those who did. Unlike points for, this plot shows that there is a difference between points scored against by playoff teams and Super Bowl Champs. 50% of Super Bowl winners gave up less points than 75% of playoff teams. This is where the difference is between playoff teams and Super Bowl winners.

boxplot(pats$points_against, sb_champs$points_against, made_playoff$points_against, missed_playoff$points_against, 
        main = "Points Against", ylab = "Points", col = c("royalblue", "gold", "green", "red"),
        names = c("Patriots", "Superbowl Champs", "Playoff Made", "Playoff Missed"))

This plot yet again shows just how dominate the Patriots were. The same negative correlation applies with Patriot wins. The Patriots gave up less than 300 points in 45% of their seasons and never exceed 350.

ggplot(dynasty, aes(wins, points_against)) +
  geom_point(size = 3, colour = "navyblue") +
  geom_point(aes(col=sb_winner)) +
  geom_smooth(method="lm", size=1.5, colour = "navyblue") + 
  labs(title="Patriots Wins vs. Points Against", y="Total Points Against", x="Wins")

Points For Probabilities

Based on the given data, I calculated out probabilities based on points for:

  • If a team lets up less than 260 points against, they have a 97.6% chance of making the playoffs
  • If a team lets up less than 300 points against, they have a 80.8% chance of making the playoffs
  • If a team lets up less than 350 points against, they have a 57.8% chance of making the playoffs
  • If a team lets up less than 450 points against, they have a 39.8% chance of making the playoffs

Summary

A lot of the analysis is complete, however, I plan to go further in depth. I was going off of the final project rubric so I went ahead and did a lot of the analysis as opposed to formulating what I want to do. Moving forward for the Final project turn in, I will perform more analysis and right up a full summary. Thank you.