Draper_Midterm_Markdown.utf8.md

Synopsis

Problem Statement:

What does it take for an NFL team to win the Super Bowl? More importantly, how can they make the playoffs? These are questions that used to be answered with “win the most games in the division” and “win every game in playoffs.” We can now quantify exactly what it takes to make it to the most elite level of football. The New England Patriots consistently ended up in the AFC Championship or the Super Bowl almost every year, to the point where they made it look easy. We will dive in-depth to quantify what it takes to be a playoff team.

Solution Overview:

To reach a conclusion, NFL data including three datasets from 2000 to 2019 - one of which will be used, will be cleaned, wrangled, and analyzed to find what variables lead to being a playoff team and Super Bowl victor. The majority of the analysis will draw comparisons between the New England Patriots, Super Bowl winners, playoff contenders, and teams who missed the playoffs.

Implementation:

After the data is cleaned, we will analyze what goes into a playoff team, including wins. Additionally we will dive deeper into what exactly determines a win. We will show what it will take for an NFL team to become successful, like the New England Patriots.

Direct Impact of Analysis:

This report is for NFL ownership and management to understand what is required to take their team to the next step of either making the playoffs or winning it all.

Packages Required

Packages Required:

To reproduce the code and results from this report the following packages will need to be installed.

library(tidyverse)  ## Visualizing, transforming, tidying and joining data

library(dplyr)      ## Manipulating data

library(ggplot2)    ## Used for Visualizing data

library(ggcorrplot) ## Used to visualize Correlation matrix

library(knitr)      ## Neccessary to show tables in RMarkdown

Data Preparation

A. Data Import

Data Import:

The data from this report was compiled from Pro Football Reference and can be accessed HERE. This data is titled “2019 NFL Attendance Data” and includes 3 datasets: Attendance, Standings, and Games. These datasets are comprised of data from 2000-2019 and include many variables. Although this data is titled and focused around attendance, a more interesting analysis came to mind regarding playoff teams and Super Bowl winners. All the data was mostly clean with no missing values (except ‘NA’ attendance on Bye Weeks).

The attendance dataset was comprised of 10,846 observations and 8 variables:

Team
Team Name
Year
Total
Home
Away
Week
Weekly Attendance

The standings dataset was comprised of 638 observations and 15 variables:

Team
Team Name
Year
Wins
Loss
Points For
Points Against
Point Differential
Margin of Victory
Strength of Schedule

The games dataset was comprised of 5,324 observations and 19 variables:

Year
Week
Home Team
Away Team
Winner
Tie
Day
Time
Points Win
Points Loss
Yards Win
Turnovers Win
Yards Loss
Turnovers Loss
Home Team Name
Home Team City
Away Team Name
Away Team City

The first step in this project was to download the datasets and store them in an R File on my desktop. From there, set the working directory and import all three datasets into R.

setwd("C:/Users/sethd/OneDrive/Desktop/R Files")
attendance <- read.csv("attendance.csv")
standings <- read.csv("standings.csv")
games <- read.csv("games.csv")

B. Attendance Dataset Cleaning

Attendance Dataset Cleaning:

Upon briefly exploring the datasets, I decided to not use the games data, but felt it was appropriate to keep it in R in the event that I want to continue the report at a single game level instead of the season level.

After importing the data, the first step was to clean the attendance dataset. As just stated, I am only concerned about the seasons, not game specific - this applies to attendance. The first step was that I united the team and team_name variables to condense and have one team name as opposed to two. Then I removed week and week_attendance columns, leaving many duplicate rows (17 per team per season) because all that was left was the total, home, and away attendances repeating. To eliminate redundancies, I called for only distinct observations to remain. The code below shows the previous description followed by the first 6 observations of the now clean attendance dataset:

#Combine Team Name
attendance <- attendance %>% unite(team, team, team_name, sep = " ")

#Transform into Team Total Attendance per year
attendance <- attendance[,-(6:7)]
attendance <- distinct(attendance)

team	year	total	home	away
Arizona Cardinals	2000	893926	387475	506451
Atlanta Falcons	2000	964579	422814	541765
Baltimore Ravens	2000	1062373	551695	510678
Buffalo Bills	2000	1098587	560695	537892
Carolina Panthers	2000	1095192	583489	511703
Chicago Bears	2000	1080684	535552	545132

C. Standings Dataset Cleaning

Standings Dataset Cleaning:

The first step I took in cleaning the standings dataset was to combine the two team and team_name variables to condense and keep consistent with the attendance data. Following that I proceeded to do a multi-layered sort by year then wins in descending order. This was to check the first few observations and make sure the data made sense. Then when checking the head, I realized the variables playoffs and sb_winner were character values as they were reported as “No Playoffs”, “Playoffs”, “No Superbowl”, and “Superbowl” respectively. I changed the values to a binary 0 and 1 for No Playoffs/No Superbowl and Yes Playoffs/Yes Superbowl. The following codes shows these actions, along with an output of the first 6 values in the new clean standings dataset:

#Combining team and team_name
standings <- as_tibble(standings)
standings <- standings %>% unite(team, team, team_name, sep = " ")

#Sort by years then wins
standings <- arrange(standings, year, desc(wins))

#Changing playoffs and sb_winner from a character to Binary variable
standings$playoffs <- factor(standings$playoffs, levels=c("No Playoffs", "Playoffs"), labels=c(0, 1))
standings$sb_winner <- factor(standings$sb_winner, levels=c("No Superbowl", "Won Superbowl"), labels=c(0, 1))

team	year	wins	loss	points_for	points_against	points_differential	margin_of_victory	strength_of_schedule	simple_rating	offensive_ranking	defensive_ranking	playoffs	sb_winner
Tennessee Titans	2000	13	3	346	191	155	9.7	-1.3	8.3	1.5	6.8	1	0
Baltimore Ravens	2000	12	4	333	165	168	10.5	-2.5	8.0	0.0	8.0	1	1
Oakland Raiders	2000	12	4	479	299	180	11.3	-1.5	9.7	8.0	1.8	1	0
New York Giants	2000	12	4	328	246	82	5.1	-2.7	2.4	-1.3	3.8	1	0
Miami Dolphins	2000	11	5	323	226	97	6.1	1.0	7.1	0.0	7.1	1	0
Denver Broncos	2000	11	5	485	369	116	7.3	-2.2	5.0	7.8	-2.7	1	0

D. Combining & Cleaning Datasets

Combining & Cleaning Datasets:

To make things easier, I noticed that both datasets had the same number of observations and identical team and year keys - therefore I decided to combine the two datasets to work with them as one. After combining, I renamed the attendance variables from “total”,“home”,“away” to “total_attendance”, “home_attendance”, and “away_attendance” respectively to differentiate. I then looked at the structures of all the variables and noticed that many were in integer form and the two binary variables were still variables. I then converted them to numeric variables to make calculations easy. This new merged dataset is named standings2. Below is the code and the first 6 observations

#Merge two data frames by team and year
standings2 <- merge(standings,attendance,by=c("team","year"))

#Rename variables
standings2 <- standings2 %>% 
  rename(
    total_attendance = total,
    home_attendance = home,
    away_attendance = away
    )

#Changing variable types
standings2[3:7] <- lapply(standings2[3:7], as.numeric)
standings2$playoffs <- as.numeric(levels(standings2$playoffs))[standings2$playoffs]
standings2$sb_winner <- as.numeric(levels(standings2$sb_winner))[standings2$sb_winner]
standings2[15:17] <- lapply(standings2[15:17], as.numeric)

team	year	wins	loss	points_for	points_against	points_differential	margin_of_victory	strength_of_schedule	simple_rating	offensive_ranking	defensive_ranking	total_attendance	home_attendance	away_attendance
Arizona Cardinals	2000	3	13	210	443	-233	-14.6	-0.7	-15.2	-7.2	-8.1	893926	387475	506451
Arizona Cardinals	2001	7	9	295	343	-48	-3.0	-1.2	-4.2	-1.5	-2.6	811391	307315	504076
Arizona Cardinals	2002	5	11	262	417	-155	-9.7	-0.2	-9.9	-5.4	-4.5	898877	327272	571605
Arizona Cardinals	2003	4	12	225	452	-227	-14.2	1.6	-12.6	-6.3	-6.2	804401	288499	515902
Arizona Cardinals	2004	6	10	284	322	-38	-2.4	-2.5	-4.9	-5.1	0.2	838557	300267	538290
Arizona Cardinals	2005	5	11	311	387	-76	-4.8	-0.2	-5.0	-2.0	-3.0	920848	401035	519813

E. Subsetting into 4 Datasets

Subsetting into 4 Datasets:

The last step of the cleaning process is where it all starts coming together. From the standings2 dataset, I created 4 subsets: * Super Bowl Winners (sb_champs) * Teams Who Made the Playoffs (made_playoff) * Teams Who Missed the Playoffs (missed_playoff) * The New England Patriots (pats) I removed the Super Bowl Winners from the made_playoff to make sure they were not counted twice. I then sorted missed_playoff and made_playoff by year then descending years to get a good look at the data. Following is the code and the heads of each:

#Subsetting (filtering) SB winners, playoffs, missed playoffs, and NE Patriots
sb_champs <- filter(standings2, sb_winner == 1) 
sb_champs <- sb_champs[,-(13:14)]
sb_champs <- arrange(sb_champs, year)
kable(head(sb_champs), format = "markdown")

team	year	wins	loss	points_for	points_against	points_differential	margin_of_victory	strength_of_schedule	simple_rating	offensive_ranking	defensive_ranking	total_attendance	home_attendance	away_attendance
Baltimore Ravens	2000	12	4	333	165	168	10.5	-2.5	8.0	0.0	8.0	1062373	551695	510678
New England Patriots	2001	11	5	371	272	99	6.2	-1.9	4.3	1.2	3.1	977717	482336	495381
Tampa Bay Buccaneers	2002	12	4	346	196	150	9.4	-0.6	8.8	-1.0	9.8	1044920	525031	519889
New England Patriots	2003	14	2	348	238	110	6.9	0.1	6.9	2.1	4.9	1127515	547488	580027
New England Patriots	2004	14	2	437	260	177	11.1	1.8	12.8	6.4	6.5	1108210	550048	558162
Pittsburgh Steelers	2005	11	5	389	258	131	8.2	-0.4	7.8	3.8	4.0	1048739	507434	541305

made_playoff <- filter(standings2, playoffs == 1)   
made_playoff <- subset(made_playoff, made_playoff$sb_winner==0 )
made_playoff <- made_playoff[,-(13:14)]
made_playoff <- arrange(made_playoff, year, desc(wins))
kable(head(made_playoff), format = "markdown")

team	year	wins	loss	points_for	points_against	points_differential	margin_of_victory	strength_of_schedule	simple_rating	offensive_ranking	defensive_ranking	total_attendance	home_attendance	away_attendance
Tennessee Titans	2000	13	3	346	191	155	9.7	-1.3	8.3	1.5	6.8	1091274	547524	543750
New York Giants	2000	12	4	328	246	82	5.1	-2.7	2.4	-1.3	3.8	1135455	624085	511370
Oakland Raiders	2000	12	4	479	299	180	11.3	-1.5	9.7	8.0	1.8	998655	462515	536140
Denver Broncos	2000	11	5	485	369	116	7.3	-2.2	5.0	7.8	-2.7	1140030	604042	535988
Miami Dolphins	2000	11	5	323	226	97	6.1	1.0	7.1	0.0	7.1	1118883	589909	528974
Minnesota Vikings	2000	11	5	397	371	26	1.6	0.3	1.9	4.3	-2.3	1029262	513322	515940

missed_playoff <- filter(standings2, playoffs == 0) 
missed_playoff <- missed_playoff[,-(13:14)]
missed_playoff <- arrange(missed_playoff, year, desc(wins))
kable(head(missed_playoff), format = "markdown")

team	year	wins	loss	points_for	points_against	points_differential	margin_of_victory	strength_of_schedule	simple_rating	offensive_ranking	defensive_ranking	total_attendance	home_attendance	away_attendance
Detroit Lions	2000	9	7	307	307	0	0.0	1.4	1.4	-0.1	1.5	1140926	607076	533850
Green Bay Packers	2000	9	7	353	323	30	1.9	0.6	2.5	1.8	0.7	1049602	478747	570855
New York Jets	2000	9	7	321	321	0	0.0	3.5	3.5	1.4	2.2	1145146	623711	521435
Pittsburgh Steelers	2000	9	7	321	255	66	4.1	-0.2	3.9	0.6	3.3	987037	440426	546611
Buffalo Bills	2000	8	8	315	350	-35	-2.2	2.2	0.0	0.5	-0.5	1098587	560695	537892
Washington Redskins	2000	8	8	281	269	12	0.8	0.2	1.0	-2.9	3.8	1174332	647424	526908

pats <- filter(standings2, team == "New England Patriots")
kable(head(pats), format = "markdown")

team	year	wins	loss	points_for	points_against	points_differential	margin_of_victory	strength_of_schedule	simple_rating	offensive_ranking	defensive_ranking	playoffs	sb_winner	total_attendance	home_attendance	away_attendance
New England Patriots	2000	5	11	276	338	-62	-3.9	1.4	-2.5	-2.7	0.2	0	0	1030594	482336	548258
New England Patriots	2001	11	5	371	272	99	6.2	-1.9	4.3	1.2	3.1	1	1	977717	482336	495381
New England Patriots	2002	9	7	381	346	35	2.2	1.8	4.0	2.1	1.9	0	0	1096069	547488	548581
New England Patriots	2003	14	2	348	238	110	6.9	0.1	6.9	2.1	4.9	1	1	1127515	547488	580027
New England Patriots	2004	14	2	437	260	177	11.1	1.8	12.8	6.4	6.5	1	1	1108210	550048	558162
New England Patriots	2005	10	6	379	338	41	2.6	0.6	3.1	3.7	-0.5	1	0	1136903	550048	586855

Exploratory Data Analysis

F. Correlation Matrix

Correlation Matrix

The first step of analysis was to create a correlation matrix. From this my hope was to establish variables that are correlated to the playoff variable. When analyzing correlations against the playoffs variable, it appeared that there are many highly correlated variables with playoffs. However, all of these variables are correlated significantly higher with wins, which leads to the idea that these variables lead to wins and wins lead to playoffs. Below is a list of the correlated variables with the R^2-value to playoffs and wins in that order:

simple_rating (0.67)(0.88)
points_differential (0.71)(0.92)
margin_of_victory (0.71)(0.92)
points_for (0.56)(0.73)
offensive_ranking (0.55)(0.73)
wins (0.78)(NA)
points_against (-0.53)(-0.68)
defensive_ranking (0.49)(0.64)

#Correlation Matrix of Standings2
corr <- round(cor(standings2[3:17]), 1)

p.mat <- cor_pmat(standings2[3:17])

correlation <-ggcorrplot(corr, hc.order = TRUE, type = "lower",
           lab = TRUE)
correlation

G. Making the Playoffs

Wins Per Season Analysis

After realizing that wins are the direct influencer of making the playoffs, I wanted to compare the distributions of the four subsets. It is evident that teams will have no chance to make the playoffs without at least 7 wins. It is also worth noting that the Super Bowl winner’s median is 1 game better than the teams that made the playoffs. Lastly, a note on the Patriot Dynasty, beside 2000 (pre-Tom Brady) when the Patriots only won 5 games, they won 9 or more games every season for the next 19 years.

boxplot(pats$wins, sb_champs$wins, made_playoff$wins, missed_playoff$wins, 
        main = "Wins Per Season", ylab = "Wins", col = c("royalblue", "gold", "green", "red"),
        names = c("Patriots", "SB Champs", "Playoff Made", "Playoff Missed"))

Playoff Probability

Based on the given data, I calculated out probabilities based on wins:

If a team finishes with at least 11 wins, they have a 99.3% chance of making the playoffs
If a team finishes with 10 wins, they have an 87.3% chance of making the playoffs
If a team finishes with 9 wins, they have an 37.7% chance of making the playoffs
If a team finishes with 8 wins, they have an 8.7% chance of making the playoffs

What Leads to a Win?

There are essentially 2 things that go into winning or losing a game, and they are pretty obvious:

Points Scored For
Points Scored Against

Although there are more specific variables that can influence the result of a football game, like weather, home field advantage, health of players, and so on, at a game’s core, these variables can likely predict whether a team will reach the playoffs or be sitting on their couches come January. Following is a visual display to demonstrate each of these variables’ effects on wins.

Analyzing Points For

As seen in the graph below, there is a strong positive correlation between wins and points for. Points For has a direct impact on number of wins in a season.

highlight_df <- standings2 %>% 
  filter(standings2$team == "New England Patriots")

ggplot(standings2, aes(x=wins, y=points_for)) + 
  geom_point(aes(col=playoffs), size = 2) + 
  scale_colour_gradientn(colours= c("red","green")) +
  geom_point(data = highlight_df, aes(x=wins, y=points_for), colour = "navyblue", size = 3 ) +
  geom_smooth(method="lm", size=1.5, colour = "black")  +
  labs(title="Wins vs. Points For", subtitle = "Compared with The NE Patriots (Blue)", y="Total Points For", x="Wins")

The boxplot following compares points for between the four subsets. You can see that there is a drastic increase between teams who did not make the playoffs as opposed to those who did. One thing to note here is that the middle 50% of ‘points for’ of teams who made the playoffs and Super Bowl Champions is almost identical, meaning ‘points for’ is not the driving factor for a team who made the playoffs to lead to a Super Bowl victory.

boxplot(pats$points_for, sb_champs$points_for, made_playoff$points_for, missed_playoff$points_for, 
        main = "Points For Comparison", ylab = "Points For", col = c("royalblue", "gold", "green", "red"),
        names = c("Patriots", "SB Champs", "Playoff Made", "Playoff Missed"))

This plot shows just how dominate the Patriots were. Their lowest scoring year was still in the middle 50% of made playoffs, which led to the 17 titles in 19 years. They consistently scored in the highest tier among all other NFL teams. Note - The "dynasty’ dataset includes patriots standings data, excluding 2000 since the dynasty officially started in 2001.

dynasty <- pats[-1,]
ggplot(dynasty, aes(wins, points_for)) +
  geom_point(size = 3, colour = "navyblue") +
  geom_point(aes(col=sb_winner)) +
  geom_smooth(method="lm", size=1.5, colour = "navyblue") + 
labs(title="Patriots Wins vs. Points For", y="Total Points For", x="Wins")

Points For Probabilities

Based on the given data, I calculated out probabilities based on points for:

If a team finishes with at least 450 points for, they have a 86.0% chance of making the playoffs
If a team finishes with at least 400 points for, they have a 76.7% chance of making the playoffs
If a team finishes with at least 350 points for, they have a 60.6% chance of making the playoffs

Analyzing Points Against

As seen in the graph below, there is a strong negative correlation between wins and points points against. Points Against has a direct impact on number of wins in a season.

highlight_df <- standings2 %>% 
  filter(standings2$team == "New England Patriots")

ggplot(standings2, aes(x=wins, y=points_against)) + 
  geom_point(aes(col=playoffs), size = 2) + 
  scale_colour_gradientn(colours= c("red","green")) +
  geom_point(data = highlight_df, aes(x=wins, y=points_against), colour = "navyblue", size = 3 ) +
  geom_smooth(method="lm", size=1.5, colour = "black")  +
  labs(title="Wins vs. Points Against", subtitle = "Compared with The NE Patriots (Blue)", y="Total Points Against", x="Wins")

The boxplot following compares points for between the four subsets. You can see that there is a drastic increase between teams who did not make the playoffs as opposed to those who did. Unlike points for, this plot shows that there is a difference between points scored against by playoff teams and Super Bowl Champs. 50% of Super Bowl winners gave up less points than 75% of playoff teams. This is where the difference is between playoff teams and Super Bowl winners.

boxplot(pats$points_against, sb_champs$points_against, made_playoff$points_against, missed_playoff$points_against, 
        main = "Points Against", ylab = "Points", col = c("royalblue", "gold", "green", "red"),
        names = c("Patriots", "Superbowl Champs", "Playoff Made", "Playoff Missed"))

This plot yet again shows just how dominate the Patriots were. The same negative correlation applies with Patriot wins. The Patriots gave up less than 300 points in 45% of their seasons and never exceed 350.

ggplot(dynasty, aes(wins, points_against)) +
  geom_point(size = 3, colour = "navyblue") +
  geom_point(aes(col=sb_winner)) +
  geom_smooth(method="lm", size=1.5, colour = "navyblue") + 
  labs(title="Patriots Wins vs. Points Against", y="Total Points Against", x="Wins")

Points For Probabilities

Based on the given data, I calculated out probabilities based on points for:

If a team lets up 260 or less points against, they have a 97.6% chance of making the playoffs
If a team lets up 300 or less points against, they have a 80.8% chance of making the playoffs
If a team lets up more than 350 points against, they have a 14.7% chance of making the playoffs
If a team lets up more than 400 points against, they have a 4.8% chance of making the playoffs

Comparing Points For vs Points Against

The following plot shows the Points For vs Points Against for each team, separated between the teams who missed the playoffs and those who did make the playoffs.

points_fa <- ggplot(standings2, aes(x=points_for, y=points_against, color = factor(standings2$playoffs))) + 
  geom_point(data = highlight_df, aes(x=points_for, y=points_against), colour = "navyblue", size = 3 ) +
  geom_point() +
  xlim(150,625) + ylim(150,625) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", size = 1) +
  labs(title="Points For vs. Points Against", subtitle = "Compared with The NE Patriots (Blue)", y="Total Points Against", x="Total Points For", color = "Playoff Made")

points_fa

The dotted line between the data is a simple x=y line, meaning that if a team gives up more points against than points for, they will fall above the line, whereas if a team scores more points than points against they will fall below the line. As shown, the farther a team can fall from the dotted line (below), the more likely they are to reach the playoffs. This means that that Points For and Points Against do not help or hurt independently, but instead, it is a combination of the number of points and the margin between the two. You can see how strong and well performing the Patriot offense was over the years, but you can also see that their defense performed above average.

Analyzing Point Differentials

Its is important to note that Point Differential shows a similar correlation to Points For and Points Against. You can see the level of dominance that the Patriots were at as they top the charts for point differentials in 19 of the 20 seasons.

point_dif <- ggplot(standings2, aes(x=wins, y=points_differential, color = factor(playoffs))) + 
  geom_point() + 
  geom_point(data = highlight_df, aes(x=wins, y=points_differential), colour = "navyblue", size = 3 ) +
  geom_smooth(method="lm", size=1.5, colour = "black")  +
  labs(title="Wins vs. Point Differential", subtitle = "Compared with The NE Patriots (Navy)", y="Point Differential", x="Wins", color = "Playoff Made")

plot(point_dif)

Point Differential Probabilities

Based on the given data, I calculated out probabilities based on points for:

If a team finishes with a Point Differential of -50 or lower, they have a 1.9% chance of making the playoffs
If a team finishes with a Point Differential of 0 or lower, they have a 5.2% chance of making the playoffs
If a team finishes with a Point Differential of 0 or higher, they have a 69.3% chance of making the playoffs
If a team finishes with a Point Differential of 50 or higher, they have a 86.6% chance of making the playoffs
If a team finishes with a Point Differential of 100 or higher, they have a 95.9% chance of making the playoffs

H. Multiple Linear Regression

Multiple Linear Regression

A final glimpse of how to analyze the impact of Points For and Points Against has on Wins is to generate a multiple linear regression. After calculating, there is an R^2 value of 0.84, indicating that this model can strongly predict Wins based on these two variables. Below is the code, analysis of regression, and equation.

fit <- lm(standings2$wins ~ standings2$points_for + standings2$points_against)
summary(fit)

## 
## Call:
## lm(formula = standings2$wins ~ standings2$points_for + standings2$points_against)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3445 -0.8470 -0.0334  0.8322  3.9053 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                8.7362443  0.4184560   20.88   <2e-16 ***
## standings2$points_for      0.0270262  0.0006989   38.67   <2e-16 ***
## standings2$points_against -0.0291728  0.0008380  -34.81   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.238 on 635 degrees of freedom
## Multiple R-squared:  0.8395, Adjusted R-squared:  0.839 
## F-statistic:  1660 on 2 and 635 DF,  p-value: < 2.2e-16

The equation that will calculate number of Wins based on Points For and Points Against is as follows:

\[Wins = 8.736 + 0.027(Points For) - 0.029(Points Against)\]

One of the biggest takeaways from this regression is that Points Against has a stronger impact on Wins than Points For, meaning that a team hurts more from points scored against them, than they are helped by points scored for. From this standpoint, teams need to put emphasize on a stronger defense that gives up a lower number of points per game and season.

Summary

Problem Statement

Again, the purpose of this report is to understand exactly how NFL teams can reach the playoffs and the Super Bowl. There are many underlying factors besides Tom Brady that lead to the playoff and we attempted to pinpoint exactly what those are.

Methodology

In order to breakdown what it takes to reach the playoffs, I used data from Pro Football Reference and can be accessed HERE. This data is titled “2019 NFL Attendance Data” and includes 3 datasets: Attendance, Standings, and Games. The Standings dataset was the only one used in this report. From the data, I was able to create a correlation matrix which showed that there were many factors that led to making the playoffs, however, those variables were correlated higher with wins, which is highly correlated with playoffs. After breaking down what consists of a win, a multiple linear regression was run to establish the direct relationship.

Insights and Results

Playoffs are decided by Wins
- With an exception of 1 of 145 teams in the last 20 years, every team that won at least 11 games made the playoffs. Teams that finished with 10 wins had an 87.3% chance of making the playoffs. This was obviously the biggest predictor of whether a team will make the playoffs or not. This then draws the question of how can a team win 11 games and secure a playoff bid.
What it takes to achieve a Win in an NFL game
- After looking at all the variables, Points For and Points Against were the most telling variables, as expected. The more Points For and the fewer Points Against leads to more wins. On average, a playoff team had 402 Points For and 312 Points Against in a season (25 and 20 points per game respectively), whereas teams that missed the playoffs averaged 319 Points For and 375 Points Against in a season (20 and 23 points per game respectively). As these are just averages, many teams had many more or less points for and points against but this is just a reference point.
  - If a team finished with at least 450 Points for, they have an 86% chance of making the playoffs.
  - If a team gives up 260 or less Points against, they have a 97.6% chance of making the playoffs.
  - If a team finishes with a Point Differential of 100 or higher, they have a 95.9% chance of making the playoffs.
- These are three metrics teams should strive for. On a per game level - this means averaging at least 28 point scored for, 16 points against or less, and at least a 6 point differential.
- Lastly, after fitting a multiple linear regression, the true recipe to calculating wins is as follows: \[Wins = 8.736 + 0.027(Points For) - 0.029(Points Against)\]
  - This equation can pretty accurately predict the number of wins a team will have based on their points for and points against. To give an example, I will use the numbers I gave based on probabilities:
    
    \[Wins = 8.736 + 0.027(450) - 0.029(260) ==> 13 Wins\]
  - If a team scores 450 points and allow 260 points, they are projected to win 13 games, which has a 100% chance of making the playoffs based on the past 20 years.
  - An important take away from this regression, is that the weight of every point against is heavier than points for, meaning that a more points against will negatively impact a teams number of wins at a higher rate than every additional point for. This demonstrates that a team with stronger defense will in turn end with more wins, assuming their offense can still score more points for than the number of points a defense lets up.

Limitations to this Report

This report is at a very broad “Birds-eye-View” look at how teams can make the playoffs. John Madden, NFL Legend, put it pretty simple: “Usually the team that scores the most points wins the game.” This is essentially what this report is saying in general terms. With more time and effort put in to further analyze other correlated variables, or go more in depth into the Games dataset where metrics from every NFL game in the last 20 years has been recorded, I believe we can uncover more key factors that can determine the fate of an NFL team’s season. However, I do believe that this report is a good starting point to recognizing that there is a place in sports for analytics and data. From what I uncovered in this report, there are a few ideologies that coaches and ownership can act on, specifically the emphasis that should be placed on the defensive side of the ball. Secondly, I began this analysis with the idea of incorporating the attendance data but ended up not using it. I believe that it would be more useful to analyze those numbers at a game level from the Game dataset, however, that is an entirely different problem statement and analysis.

Lastly,

Thank You For Reading!

- Seth Draper

Data Wrangling Project

Seth Draper

The Dynasty of the New England Patriots

From 2001 to 2018, the New England Patriots dominated the National Football League. Over the course of 17 years, the Patriots won 15 division titles, 12 AFC Championship appearances, 7 Super Bowl appearances, and 5 Super Bowl victories. Let’s find out how they got there…

Synopsis

Packages Required

Data Preparation

A. Data Import

B. Attendance Dataset Cleaning

C. Standings Dataset Cleaning

D. Combining & Cleaning Datasets

E. Subsetting into 4 Datasets

Exploratory Data Analysis

F. Correlation Matrix

G. Making the Playoffs

H. Multiple Linear Regression

Summary