Draper_Midterm_Markdown.utf8.md

Synopsis

Problem Statement:

What does it take for an NFL team to win the Super Bowl? More importantly, how can they make the playoffs? These are questions that used to be answered with “win the most games in the division” and “win every game in playoffs.” We can now quantify exactly what it takes to make it to the most elite level of football. The New England Patriots consistently ended up in the AFC Championship or the Super Bowl almost every year, to the point where they made it look easy. We will dive in-depth to quantify what it takes to be a playoff team.

Solution Overview:

To reach a conclusion, NFL data including three datasets from 2000 to 2019 - one of which will be used, will be cleaned, wrangled, and analyzed to find what variables lead to being a playoff team and Super Bowl victor. The majority of the analysis will draw comparisons between the New England Patriots, Super Bowl winners, playoff contenders, and teams who missed the playoffs.

Implementation:

After the data is cleaned, we will analyze what goes into a playoff team, including wins. Additionally we will dive deeper into what exactly determines a win. We will show what it will take for an NFL team to become successful, like the New England Patriots.

Direct Impact of Analysis:

This report is for NFL ownership and management to understand what is required to take their team to the next step of either making the playoffs or winning it all.

Packages Required

Packages Required:

To reproduce the code and results from this report the following packages will need to be installed.

library(tidyverse)  ## Visualizing, transforming, tidying and joining data

library(dplyr)      ## Manipulating data

library(ggplot2)    ## Used for Visualizing data

library(ggcorrplot) ## Used to visualize Correlation matrix

library(knitr)      ## Neccessary to show tables in RMarkdown

Data Preparation

A. Data Import

Data Import:

The data from this report was compiled from Pro Football Reference and can be accessed HERE. This data is titled “2019 NFL Attendance Data” and includes 3 datasets: Attendance, Standings, and Games. These datasets are comprised of data from 2000-2019 and include many variables. Although this data is titled and focused around attendance, a more interesting analysis came to mind regarding playoff teams and Super Bowl winners. All the data was mostly clean with no missing values (except ‘NA’ attendance on Bye Weeks).

The attendance dataset was comprised of 10,846 observations and 8 variables:

Team
Team Name
Year
Total
Home
Away
Week
Weekly Attendance

The standings dataset was comprised of 638 observations and 15 variables:

Team
Team Name
Year
Wins
Loss
Points For
Points Against
Point Differential
Margin of Victory
Strength of Schedule

The games dataset was comprised of 5,324 observations and 19 variables:

Year
Week
Home Team
Away Team
Winner
Tie
Day
Time
Points Win
Points Loss
Yards Win
Turnovers Win
Yards Loss
Turnovers Loss
Home Team Name
Home Team City
Away Team Name
Away Team City

The first step in this project was to download the datasets and store them in an R File on my desktop. From there, set the working directory and import all three datasets into R.

setwd("C:/Users/sethd/OneDrive/Desktop/R Files")
attendance <- read.csv("attendance.csv")
standings <- read.csv("standings.csv")
games <- read.csv("games.csv")

B. Attendance Dataset Cleaning

Attendance Dataset Cleaning:

Upon briefly exploring the datasets, I decided to not use the games data, but felt it was appropriate to keep it in R in the event that I want to continue the report at a single game level instead of the season level.

After importing the data, the first step was to clean the attendance dataset. As just stated, I am only concerned about the seasons, not game specific - this applies to attendance. The first step was that I united the team and team_name variables to condense and have one team name as opposed to two. Then I removed week and week_attendance columns, leaving many duplicate rows (17 per team per season) because all that was left was the total, home, and away attendances repeating. To eliminate redundancies, I called for only distinct observations to remain. The code below shows the previous description followed by the first 6 observations of the now clean attendance dataset:

#Combine Team Name
attendance <- attendance %>% unite(team, team, team_name, sep = " ")

#Transform into Team Total Attendance per year
attendance <- attendance[,-(6:7)]
attendance <- distinct(attendance)

team	year	total	home	away
Arizona Cardinals	2000	893926	387475	506451
Atlanta Falcons	2000	964579	422814	541765
Baltimore Ravens	2000	1062373	551695	510678
Buffalo Bills	2000	1098587	560695	537892
Carolina Panthers	2000	1095192	583489	511703
Chicago Bears	2000	1080684	535552	545132

C. Standings Dataset Cleaning

Standings Dataset Cleaning:

The first step I took in cleaning the standings dataset was to combine the two team and team_name variables to condense and keep consistent with the attendance data. Following that I proceeded to do a multi-layered sort by year then wins in descending order. This was to check the first few observations and make sure the data made sense. Then when checking the head, I realized the variables playoffs and sb_winner were character values as they were reported as “No Playoffs”, “Playoffs”, “No Superbowl”, and “Superbowl” respectively. I changed the values to a binary 0 and 1 for No Playoffs/No Superbowl and Yes Playoffs/Yes Superbowl. The following codes shows these actions, along with an output of the first 6 values in the new clean standings dataset:

#Combining team and team_name
standings <- as_tibble(standings)
standings <- standings %>% unite(team, team, team_name, sep = " ")

#Sort by years then wins
standings <- arrange(standings, year, desc(wins))

#Changing playoffs and sb_winner from a character to Binary variable
standings$playoffs <- factor(standings$playoffs, levels=c("No Playoffs", "Playoffs"), labels=c(0, 1))
standings$sb_winner <- factor(standings$sb_winner, levels=c("No Superbowl", "Won Superbowl"), labels=c(0, 1))

team	year	wins	loss	points_for	points_against	points_differential	margin_of_victory	strength_of_schedule	simple_rating	offensive_ranking	defensive_ranking	playoffs	sb_winner
Tennessee Titans	2000	13	3	346	191	155	9.7	-1.3	8.3	1.5	6.8	1	0
Baltimore Ravens	2000	12	4	333	165	168	10.5	-2.5	8.0	0.0	8.0	1	1
Oakland Raiders	2000	12	4	479	299	180	11.3	-1.5	9.7	8.0	1.8	1	0
New York Giants	2000	12	4	328	246	82	5.1	-2.7	2.4	-1.3	3.8	1	0
Miami Dolphins	2000	11	5	323	226	97	6.1	1.0	7.1	0.0	7.1	1	0
Denver Broncos	2000	11	5	485	369	116	7.3	-2.2	5.0	7.8	-2.7	1	0

D. Combining & Cleaning Datasets

Combining & Cleaning Datasets:

To make things easier, I noticed that both datasets had the same number of observations and identical team and year keys - therefore I decided to combine the two datasets to work with them as one. After combining, I renamed the attendance variables from “total”,“home”,“away” to “total_attendance”, “home_attendance”, and “away_attendance” respectively to differentiate. I then looked at the structures of all the variables and noticed that many were in integer form and the two binary variables were still variables. I then converted them to numeric variables to make calculations easy. This new merged dataset is named standings2. Below is the code and the first 6 observations

#Merge two data frames by team and year
standings2 <- merge(standings,attendance,by=c("team","year"))

#Rename variables
standings2 <- standings2 %>% 
  rename(
    total_attendance = total,
    home_attendance = home,
    away_attendance = away
    )

#Changing variable types
standings2[3:7] <- lapply(standings2[3:7], as.numeric)
standings2$playoffs <- as.numeric(levels(standings2$playoffs))[standings2$playoffs]
standings2$sb_winner <- as.numeric(levels(standings2$sb_winner))[standings2$sb_winner]
standings2[15:17] <- lapply(standings2[15:17], as.numeric)

team	year	wins	loss	points_for	points_against	points_differential	margin_of_victory	strength_of_schedule	simple_rating	offensive_ranking	defensive_ranking	total_attendance	home_attendance	away_attendance
Arizona Cardinals	2000	3	13	210	443	-233	-14.6	-0.7	-15.2	-7.2	-8.1	893926	387475	506451
Arizona Cardinals	2001	7	9	295	343	-48	-3.0	-1.2	-4.2	-1.5	-2.6	811391	307315	504076
Arizona Cardinals	2002	5	11	262	417	-155	-9.7	-0.2	-9.9	-5.4	-4.5	898877	327272	571605
Arizona Cardinals	2003	4	12	225	452	-227	-14.2	1.6	-12.6	-6.3	-6.2	804401	288499	515902
Arizona Cardinals	2004	6	10	284	322	-38	-2.4	-2.5	-4.9	-5.1	0.2	838557	300267	538290
Arizona Cardinals	2005	5	11	311	387	-76	-4.8	-0.2	-5.0	-2.0	-3.0	920848	401035	519813

E. Subsetting into 4 Datasets

Subsetting into 4 Datasets:

The last step of the cleaning process is where it all starts coming together. From the standings2 dataset, I created 4 subsets: * Super Bowl Winners (sb_champs) * Teams Who Made the Playoffs (made_playoff) * Teams Who Missed the Playoffs (missed_playoff) * The New England Patriots (pats) I removed the Super Bowl Winners from the made_playoff to make sure they were not counted twice. I then sorted missed_playoff and made_playoff by year then descending years to get a good look at the data. Following is the code and the heads of each:

#Subsetting (filtering) SB winners, playoffs, missed playoffs, and NE Patriots
sb_champs <- filter(standings2, sb_winner == 1) 
sb_champs <- sb_champs[,-(13:14)]
sb_champs <- arrange(sb_champs, year)
kable(head(sb_champs), format = "markdown")

team	year	wins	loss	points_for	points_against	points_differential	margin_of_victory	strength_of_schedule	simple_rating	offensive_ranking	defensive_ranking	total_attendance	home_attendance	away_attendance
Baltimore Ravens	2000	12	4	333	165	168	10.5	-2.5	8.0	0.0	8.0	1062373	551695	510678
New England Patriots	2001	11	5	371	272	99	6.2	-1.9	4.3	1.2	3.1	977717	482336	495381
Tampa Bay Buccaneers	2002	12	4	346	196	150	9.4	-0.6	8.8	-1.0	9.8	1044920	525031	519889
New England Patriots	2003	14	2	348	238	110	6.9	0.1	6.9	2.1	4.9	1127515	547488	580027
New England Patriots	2004	14	2	437	260	177	11.1	1.8	12.8	6.4	6.5	1108210	550048	558162
Pittsburgh Steelers	2005	11	5	389	258	131	8.2	-0.4	7.8	3.8	4.0	1048739	507434	541305

made_playoff <- filter(standings2, playoffs == 1)   
made_playoff <- subset(made_playoff, made_playoff$sb_winner==0 )
made_playoff <- made_playoff[,-(13:14)]
made_playoff <- arrange(made_playoff, year, desc(wins))
kable(head(made_playoff), format = "markdown")

team	year	wins	loss	points_for	points_against	points_differential	margin_of_victory	strength_of_schedule	simple_rating	offensive_ranking	defensive_ranking	total_attendance	home_attendance	away_attendance
Tennessee Titans	2000	13	3	346	191	155	9.7	-1.3	8.3	1.5	6.8	1091274	547524	543750
New York Giants	2000	12	4	328	246	82	5.1	-2.7	2.4	-1.3	3.8	1135455	624085	511370
Oakland Raiders	2000	12	4	479	299	180	11.3	-1.5	9.7	8.0	1.8	998655	462515	536140
Denver Broncos	2000	11	5	485	369	116	7.3	-2.2	5.0	7.8	-2.7	1140030	604042	535988
Miami Dolphins	2000	11	5	323	226	97	6.1	1.0	7.1	0.0	7.1	1118883	589909	528974
Minnesota Vikings	2000	11	5	397	371	26	1.6	0.3	1.9	4.3	-2.3	1029262	513322	515940

missed_playoff <- filter(standings2, playoffs == 0) 
missed_playoff <- missed_playoff[,-(13:14)]
missed_playoff <- arrange(missed_playoff, year, desc(wins))
kable(head(missed_playoff), format = "markdown")

team	year	wins	loss	points_for	points_against	points_differential	margin_of_victory	strength_of_schedule	simple_rating	offensive_ranking	defensive_ranking	total_attendance	home_attendance	away_attendance
Detroit Lions	2000	9	7	307	307	0	0.0	1.4	1.4	-0.1	1.5	1140926	607076	533850
Green Bay Packers	2000	9	7	353	323	30	1.9	0.6	2.5	1.8	0.7	1049602	478747	570855
New York Jets	2000	9	7	321	321	0	0.0	3.5	3.5	1.4	2.2	1145146	623711	521435
Pittsburgh Steelers	2000	9	7	321	255	66	4.1	-0.2	3.9	0.6	3.3	987037	440426	546611
Buffalo Bills	2000	8	8	315	350	-35	-2.2	2.2	0.0	0.5	-0.5	1098587	560695	537892
Washington Redskins	2000	8	8	281	269	12	0.8	0.2	1.0	-2.9	3.8	1174332	647424	526908

pats <- filter(standings2, team == "New England Patriots")
kable(head(pats), format = "markdown")

team	year	wins	loss	points_for	points_against	points_differential	margin_of_victory	strength_of_schedule	simple_rating	offensive_ranking	defensive_ranking	playoffs	sb_winner	total_attendance	home_attendance	away_attendance
New England Patriots	2000	5	11	276	338	-62	-3.9	1.4	-2.5	-2.7	0.2	0	0	1030594	482336	548258
New England Patriots	2001	11	5	371	272	99	6.2	-1.9	4.3	1.2	3.1	1	1	977717	482336	495381
New England Patriots	2002	9	7	381	346	35	2.2	1.8	4.0	2.1	1.9	0	0	1096069	547488	548581
New England Patriots	2003	14	2	348	238	110	6.9	0.1	6.9	2.1	4.9	1	1	1127515	547488	580027
New England Patriots	2004	14	2	437	260	177	11.1	1.8	12.8	6.4	6.5	1	1	1108210	550048	558162
New England Patriots	2005	10	6	379	338	41	2.6	0.6	3.1	3.7	-0.5	1	0	1136903	550048	586855

Exploratory Data Analysis

F. Correlation Matrix

Correlation Matrix

The first step of analysis was to create a correlation matrix. From this my hope was to establish variables that are correlated to the playoff variable. When analyzing correlations against the playoffs variable, it appeared that there are many highly correlated variables with playoffs. However, all of these variables are correlated significantly higher with wins, which leads to the idea that these variables lead to wins and wins lead to playoffs. Below is a list of the correlated variables with the R^2-value to playoffs and wins in that order:

simple_rating (0.67)(0.88)
points_differential (0.71)(0.92)
margin_of_victory (0.71)(0.92)
points_for (0.56)(0.73)
offensive_ranking (0.55)(0.73)
wins (0.78)(NA)
points_against (-0.53)(-0.68)
defensive_ranking (0.49)(0.64)

#Correlation Matrix of Standings2
corr <- round(cor(standings2[3:17]), 1)

p.mat <- cor_pmat(standings2[3:17])

correlation <-ggcorrplot(corr, hc.order = TRUE, type = "lower",
           lab = TRUE)
correlation

G. Making the Playoffs

Wins Per Season Analysis

After realizing that wins are the direct influencer of making the playoffs, I wanted to compare the distributions of the four subsets. It is evident that teams will have no chance to make the playoffs without at least 7 wins. It is also worth noting that the Super Bowl winner’s median is 1 game better than the teams that made the playoffs. Lastly, a note on the Patriot Dynasty, beside 2000 (pre-Tom Brady) when the Patriots only won 5 games, they won 11 or more games every season for the next 19 years.

boxplot(pats$wins, sb_champs$wins, made_playoff$wins, missed_playoff$wins, 
        main = "Wins Per Season", ylab = "Wins", col = c("royalblue", "gold", "green", "red"),
        names = c("Patriots", "SB Champs", "Playoff Made", "Playoff Missed"))

Playoff Probability

Based on the given data, I calculated out probabilities based on wins:

If a team finishes with at least 11 wins, they have a 99.3% chance of making the playoffs
If a team finishes with 10 wins, they have an 87.3% chance of making the playoffs
If a team finishes with 9 wins, they have an 37.7% chance of making the playoffs
If a team finishes with 8 wins, they have an 8.7% chance of making the playoffs

What Leads to a Win?

There are essentially 2 things that go into winning or losing a game, and they are pretty obvious:

Points Scored For
Points Scored Against

Although there are more specific variables that can influence the result of a football game, like weather, home field advantage, health of players, and so on, at a game’s core, these variables can likely predict whether a team will reach the playoffs or be sitting on their couches come January. Following is a visual display to demonstrate each of these variables’ effects on wins.

Analyzing Points For

As seen in the graph below, there is a strong positive correlation between wins and points for. Points For has a direct impact on number of wins in a season.

highlight_df <- standings2 %>% 
  filter(standings2$team == "New England Patriots")

ggplot(standings2, aes(x=wins, y=points_for)) + 
  geom_point(aes(col=playoffs), size = 2) + 
  scale_colour_gradientn(colours= c("red","green")) +
  geom_point(data = highlight_df, aes(x=wins, y=points_for), colour = "navyblue", size = 3 ) +
  geom_smooth(method="lm", size=1.5, colour = "black")  +
  labs(title="Wins vs. Points For", subtitle = "Compared with The NE Patriots (Blue)", y="Total Points For", x="Wins")

The boxplot following compares points for between the four subsets. You can see that there is a drastic increase between teams who did not make the playoffs as opposed to those who did. One thing to note here is that the middle 50% of ‘points for’ of teams who made the playoffs and Super Bowl Champions is almost identical, meaning ‘points for’ is not the driving factor for a team who made the playoffs to lead to a Super Bowl victory.

boxplot(pats$points_for, sb_champs$points_for, made_playoff$points_for, missed_playoff$points_for, 
        main = "Points For Comparison", ylab = "Points For", col = c("royalblue", "gold", "green", "red"),
        names = c("Patriots", "SB Champs", "Playoff Made", "Playoff Missed"))

This plot shows just how dominate the Patriots were. Their lowest scoring year was still in the middle 50% of made playoffs, which led to the 17 titles in 19 years. They consistently scored in the highest tier among all other NFL teams. Note - The "dynasty’ dataset includes patriots standings data, excluding 2000 since the dynasty officially started in 2001.

dynasty <- pats[-1,]
ggplot(dynasty, aes(wins, points_for)) +
  geom_point(size = 3, colour = "navyblue") +
  geom_point(aes(col=sb_winner)) +
  geom_smooth(method="lm", size=1.5, colour = "navyblue") + 
labs(title="Patriots Wins vs. Points For", y="Total Points For", x="Wins")

Points For Probabilities

Based on the given data, I calculated out probabilities based on points for:

If a team finishes with at least 450 points for, they have a 86.0% chance of making the playoffs
If a team finishes with at least 400 points for, they have a 76.7% chance of making the playoffs
If a team finishes with at least 350 points for, they have a 60.6% chance of making the playoffs

Analyzing Points Against

As seen in the graph below, there is a strong negative correlation between wins and points points against. Points Against has a direct impact on number of wins in a season.

highlight_df <- standings2 %>% 
  filter(standings2$team == "New England Patriots")

ggplot(standings2, aes(x=wins, y=points_against)) + 
  geom_point(aes(col=playoffs), size = 2) + 
  scale_colour_gradientn(colours= c("red","green")) +
  geom_point(data = highlight_df, aes(x=wins, y=points_against), colour = "navyblue", size = 3 ) +
  geom_smooth(method="lm", size=1.5, colour = "black")  +
  labs(title="Wins vs. Points Against", subtitle = "Compared with The NE Patriots (Blue)", y="Total Points Against", x="Wins")

The boxplot following compares points for between the four subsets. You can see that there is a drastic increase between teams who did not make the playoffs as opposed to those who did. Unlike points for, this plot shows that there is a difference between points scored against by playoff teams and Super Bowl Champs. 50% of Super Bowl winners gave up less points than 75% of playoff teams. This is where the difference is between playoff teams and Super Bowl winners.

boxplot(pats$points_against, sb_champs$points_against, made_playoff$points_against, missed_playoff$points_against, 
        main = "Points Against", ylab = "Points", col = c("royalblue", "gold", "green", "red"),
        names = c("Patriots", "Superbowl Champs", "Playoff Made", "Playoff Missed"))

This plot yet again shows just how dominate the Patriots were. The same negative correlation applies with Patriot wins. The Patriots gave up less than 300 points in 45% of their seasons and never exceed 350.

ggplot(dynasty, aes(wins, points_against)) +
  geom_point(size = 3, colour = "navyblue") +
  geom_point(aes(col=sb_winner)) +
  geom_smooth(method="lm", size=1.5, colour = "navyblue") + 
  labs(title="Patriots Wins vs. Points Against", y="Total Points Against", x="Wins")