More than Touchdowns: An NFL Data Analysis

Introduction

Breaking Down the NFL

From a young age, both of us have watched NFL football and cheered for our respective teams every week. We (Adam, a die-hard Bengals fan, and Katie, a lifelong Steelers fan), not only wanted to compare our teams individually, but also the league as a whole. The past 20 years have seen successes and failures from every NFL team. Looking at the datasets available for this project, we were instantly drawn to analyzing the NFL option, as it showcases many variables that affect an NFL team. We then thought, “What all goes into winning an NFL game, and what teams are historically successful in the final standings?” Using the past 20 years worth of data, we sought to investigate this problem.

Our Focus

We plan on using the functions in R to deliver overall summary statistics on games and standings. Additionally, we will use the data to develop potential correlations and plot respective data visualizations. Utilizing descriptive analysis of the past 20 years, we are looking to see if there can be predictive tendencies for NFL teams.

The NFL dataset contained three individuals datasets:

NFL Attendance (nfl_attendance)
NFL Standings (nfl_standings)
NFL Games (nfl_games)

More detailed information about each dataset can be found in the Data Preparation tab.

Analytical Technique and Approach

The datasets contain loads of information for the NFL. With a wide range of variables, many options are available to analytically investigate the NFL. With the three datasets at hand, we looked to compare them to draw conclusions about team performances. To see if statistical significance or rational conclusions related to the NFL could be realized, the following situations were explored:

The Importance of Fan Attendance - This data analysis will look into if the number of fans in attendance correlates to a team’s success (number of games won). Additionally, it will provide comparisons of how teams fare in home versus away games while keeping in mind the attendance at those games.
Standings over the Years - The NFL has two conferences: the American Football Conference (AFC) and National Football Conference (NFC). Each conference contains four divisions with four teams in each division. Each division then has a winner over the 16 game regular season. Our analysis will look into the qualities of the division winners, and the attributes that teams high in the standings have over teams in the lower portion of the standings. Furthermore, this approach will discover what separates Super Bowl Champions from the 31 other teams each season.
Offense vs. Defense - The two main parts of a NFL team are the offense and defense. The goal for each team is to be great on both sides. However, this is rarely the case. Using individual game data and season-long statistics, a thorough breakdown of how having a great offense or defense improves teams will be given. We will also see if having a better offense or defense is critical to success over the years.
Individual Game Observations - The nfl_games dataset contains many variables for games. Turnovers, day of the week, points, etc. are shown for every match-up. Correlations into why teams win or lose will be the goal of this analysis. Using a plethora of variables, significance of certain variables will be essential for further understanding.

Moving Forward

The NFL is a multi-billion dollar industry. Millions of fans across the world cheer for these 32 teams every year. People are now looking for ways to understand the game better.

Coaches want to understand what makes a team more successful. Sports gamblers want to get an edge and make the correct picks based on more than just gut feelings. Fans want to know if their team is progressing in the right direction. This analysis is useful for all of these situations. Using descriptive analysis, past results can be better explained. As such, trends can be deduced to predict how NFL games and seasons will occur. Although no one can see into the future, understanding the data sheds a better light on the probability of certain results occurring in the NFL.

Packages Required

This project requires a variety of packages. Given there are over 10,000 packages in R, we want to focus on the ones that will provide us with the best results while cleaning and interpreting the data. Because we still consider ourselves very much novices, many of these packages are standard.

Some packages will be more useful than others. For example, ggplot2 allows for great visualizations that provide better understanding of the data. Additionally, dplyr can drill deeper into the three datasets to come to conclusions that may be hidden at first. R has powerful functions that can derive explanations for questions to massive datasets. Please see below for all of the packages loaded for this analysis:

# Packages required
library(tidyverse) # Use to tidy data
library(dplyr) # Use to manipulate data
library(ggplot2) # Use to plot data and create visualizations
library(tibble) # Use to manipulate and re-imagine data
library(readr) # Use to import data cleanly and efficiently
library(DT) # Use to create comprehensive data tables with HTML output
library(knitr) # Use for dynamic report generation
library(base) # Contains Base R functions
library(ggthemes) # Use themes in data visualizations
library(plotly) # Use to plot data and create visualizations
library(ggpubr) # Use to show multiple plots at once
library(GGally) # Use to produce scatter plot matrix
library(rmarkdown) # Use to produce report

Data Preparation

The data was obtained from our professor, Tianhai Zu, for this class. He had provided four different datasets in which to choose, and we chose the NFL Attendance Data option. This dataset can be found on GitHub. Reading the information on GitHub led us to find the original source of the data, which is Pro Football Reference Standings and Pro Football Reference Attendance.

The NFL dataset contains of three individual datasets - (1) nfl_attendance, (2) nfl_standings, and (3) nfl_games. We first merged the three datasets into one dataframe called nfl_df. We decided it might be beneficial to have multiple frames of reference, some utilizing individual datasets, and another by looking at the combined dataframe. Rather than using str() and summary() to show descriptive statistics for each variable, we decided to create comprehensive tables.

# Get working directory
getwd()

## [1] "C:/Users/katie/OneDrive - University of Cincinnati/FS20/Second Half/Data Wrangling (BANA 7025)/Final Project"

# Get the data
nfl_attendance <- readr::read_csv('attendance.csv')
nfl_standings <- readr::read_csv('standings.csv')
nfl_games <- readr::read_csv('games.csv')

# To use 2020 data you need to update tidytuesdayR from GitHub
# Install via devtools::install_github("thebioengineer/tidytuesdayR")

tuesdata <- tidytuesdayR::tt_load('2020-02-04')

## 
##  Downloading file 1 of 3: `attendance.csv`
##  Downloading file 2 of 3: `games.csv`
##  Downloading file 3 of 3: `standings.csv`

tuesdata <- tidytuesdayR::tt_load(2020, week = 6)

## 
##  Downloading file 1 of 3: `attendance.csv`
##  Downloading file 2 of 3: `games.csv`
##  Downloading file 3 of 3: `standings.csv`

attendance <- tuesdata$attendance

# Join the data relatively nicely with dplyr
nfl_df <- dplyr::left_join(nfl_attendance, nfl_standings, nfl_games, by = c("year", "team_name", "team"))

Attendance

As aforementioned, the NFL Attendance data was imported and obtained from Pro Football Reference. The original data contains 10,846 observations and eight variables. There are two character type variables, team and team_name. There are six numeric type variables, year, total, home, away, week, weekly_attendance. The variables are described in the data dictionary below. The data was collected from 2000 - 2020, and the values for the columns were observed during the 17 weeks of the NFL season. See the ORIGINAL dataset below.

# Examine the structure of the dataset
datatable(head(nfl_attendance, 50))

# Create a data dictionary for attendance
var_names_att <- colnames(nfl_attendance)
var_types_att <- lapply(nfl_attendance, class) # lapply returns a list of the same length as X (a vector)
var_desciptions_att <- c("City or state in which the team originates", "Name or mascot of the team", "Year", "Total attendance per season", "Total attendance at home games per season", "Total attendance at away games per season", "Week in which game was played", "Attendance for given week")
data_dict_att <- as_tibble(cbind(var_names_att, var_types_att, var_desciptions_att))
colnames(data_dict_att) <- c("Variable Name", "Variable Data Type", "Variable Desciption")
kable(data_dict_att) # kable returns a single table for a single data object

Variable Name	Variable Data Type	Variable Desciption
team	character	City or state in which the team originates
team_name	character	Name or mascot of the team
year	numeric	Year
total	numeric	Total attendance per season
home	numeric	Total attendance at home games per season
away	numeric	Total attendance at away games per season
week	numeric	Week in which game was played
weekly_attendance	numeric	Attendance for given week

Looking at the missing data values, the only column in which missing values exist is the weekly_attendance. This makes sense, as each NFL team has at least one bye week during the regular season. We decided to omit these values as they would skew the data and misrepresent the trends for each team.

colSums(is.na(nfl_attendance)) # Find the number of missing values per column

##              team         team_name              year             total 
##                 0                 0                 0                 0 
##              home              away              week weekly_attendance 
##                 0                 0                 0               638

nfl_attendance <- na.omit(nfl_attendance)
colSums(is.na(nfl_attendance)) # Confirm there are no missing values

##              team         team_name              year             total 
##                 0                 0                 0                 0 
##              home              away              week weekly_attendance 
##                 0                 0                 0                 0

Looking at this above original dataset, we decided to first rename the columns to better describe the data.

nfl_attendance <- nfl_attendance %>% rename(
  team_location = team,
  total_attendance = total,
  total_home_attendance = home,
  total_away_attendance = away
)

Additionally, we split it into two dataframes, the first omitting the weekly data, and the second omitting the season totals. This decision was made largely to remove duplicates, and we knew it would bode for better visualizations during the exploratory data analysis (EDA).

The first dataset, nfl_total_attendance erased the two columns, week and weekly_attendance. This dataset will show the season totals for attendance per each team. The second dataset, nfl_weekly_attendance erased the season total data columns, total, home, and away.

nfl_total_attendance <- nfl_attendance[-c(7, 8)] # Remove weekly data
nfl_total_attendance <- nfl_total_attendance[!duplicated(nfl_total_attendance), ] # Remove duplicates
datatable(head(nfl_total_attendance, 50))

nfl_weekly_attendance <- nfl_attendance[-c(4, 5, 6)] # Remove season total attendance data
datatable(head(nfl_weekly_attendance, 50))

Now, for a summary of the two datasets and associated tables of the CLEANED data, please see below.

# Examine the final summary and structure of the nfl_total_attendance dataset
datatable(head(nfl_total_attendance, 50))

# Create a data dictionary for nfl_total_attendance
var_names_att <- colnames(nfl_total_attendance)
var_types_att <- lapply(nfl_total_attendance, class) # lapply returns a list of the same length as X (a vector)
var_desciptions_att <- c("City or state in which the team originates", "Name or mascot of the team", "Year", "Total attendance per season", "Total attendance at home games per season", "Total attendance at away games per season")
data_dict_total_att <- as_tibble(cbind(var_names_att, var_types_att, var_desciptions_att))
colnames(data_dict_total_att) <- c("Variable Name", "Variable Data Type", "Variable Desciption")
kable(data_dict_total_att) # kable returns a single table for a single data object

Variable Name	Variable Data Type	Variable Desciption
team_location	character	City or state in which the team originates
team_name	character	Name or mascot of the team
year	numeric	Year
total_attendance	numeric	Total attendance per season
total_home_attendance	numeric	Total attendance at home games per season
total_away_attendance	numeric	Total attendance at away games per season

# Examine the final summary and structure of the nfl_weekly_attendance dataset
datatable(head(nfl_weekly_attendance, 50))

# Create a data dictionary for nfl_weekly_attendance
var_names_att <- colnames(nfl_weekly_attendance)
var_types_att <- lapply(nfl_weekly_attendance, class) # lapply returns a list of the same length as X (a vector)
var_desciptions_att <- c("City or state in which the team originates", "Name or mascot of the team", "Year", "Week in which game was played", "Attendance for given week")
data_dict_weekly_att <- as_tibble(cbind(var_names_att, var_types_att, var_desciptions_att))
colnames(data_dict_weekly_att) <- c("Variable Name", "Variable Data Type", "Variable Desciption")
kable(data_dict_weekly_att) # kable returns a single table for a single data object

Variable Name	Variable Data Type	Variable Desciption
team_location	character	City or state in which the team originates
team_name	character	Name or mascot of the team
year	numeric	Year
week	numeric	Week in which game was played
weekly_attendance	numeric	Attendance for given week

Standings

Similar to above, the NFL Standings data was imported and obtained from Pro Football Reference. The original data contains 638 observations and 15 variables. There are four character type variables, team, team_name, playoffs, and sb_winner. There are 11 numeric type variables, year, wins, loss, points_for, points_against, points_differential, margin_of_victory, strength_of_schedule, simple_rating, offensive_ranking, and defensive_ranking. The variables are described in the data dictionary below. The data observed was collected from 2000 - 2020. See the ORIGINAL NFL Standings data below.

# Examine the structure of the dataset
datatable(head(nfl_standings, 50))

# Create a data dictionary for standings
var_names_st <- colnames(nfl_standings)
var_types_st <- lapply(nfl_standings, class) # lapply returns a list of the same length as X (a vector)
var_desciptions_st <- c("City or state in which the team originates", "Name or mascot of the team", "Year", "Total wins per season (0 to 16)", "Total losses per season (0 to 16)", "Total points the team scored per season", "Total points the opponent scored on the team per season", "The difference between the total points for the team and against the team", "Points differential divided by the total number of games per season", "Difficulty of schedule based on opponent records", "A rating for the team that takes into account points differential and strength of schedule (measured by Simple Rating System)", "A rating comparing how well the offense performs to opponent teams (measured by Simple Rating System)", "A rating comparing how well the defense performs to opponent teams (measured by Simple Rating System)", "Stating whether or not the team made it to the playoffs", "Stating whether or not the team won the Super Bowl for the season")
data_dict_st <- as_tibble(cbind(var_names_st, var_types_st, var_desciptions_st))
colnames(data_dict_st) <- c("Variable Name", "Variable Data Type", "Variable Desciption")
kable(data_dict_st) # kable returns a single table for a single data object

Variable Name	Variable Data Type	Variable Desciption
team	character	City or state in which the team originates
team_name	character	Name or mascot of the team
year	numeric	Year
wins	numeric	Total wins per season (0 to 16)
loss	numeric	Total losses per season (0 to 16)
points_for	numeric	Total points the team scored per season
points_against	numeric	Total points the opponent scored on the team per season
points_differential	numeric	The difference between the total points for the team and against the team
margin_of_victory	numeric	Points differential divided by the total number of games per season
strength_of_schedule	numeric	Difficulty of schedule based on opponent records
simple_rating	numeric	A rating for the team that takes into account points differential and strength of schedule (measured by Simple Rating System)
offensive_ranking	numeric	A rating comparing how well the offense performs to opponent teams (measured by Simple Rating System)
defensive_ranking	numeric	A rating comparing how well the defense performs to opponent teams (measured by Simple Rating System)
playoffs	character	Stating whether or not the team made it to the playoffs
sb_winner	character	Stating whether or not the team won the Super Bowl for the season

Looking at the above dataset, we first decided to change the column names to better describe the data.

nfl_standings <- nfl_standings %>% rename(
  team_location = team,
  total_wins = wins,
  total_losses = loss
)

It is important to note as well that a few of the variable names refer to calculated values. The calculated value for points_differential is: points_differential = points_for - points_against. Additionally, margin_of_victory is calculated by: points_scored - points_allowed / games_played.

Lastly, the simple_rating is calculated by: \[SRS = MoV + SoS = OSRS + DSRS\]

In layman’s terms, the simple rating system is equal to the margin of victory plus the strength of schedule. This is equal to the offensive simple rating standing plus the defensive simple rating standing.

Next, we wanted to see what the sum of missing values was per column. As evident below, there are no missing values.

colSums(is.na(nfl_standings))

##        team_location            team_name                 year 
##                    0                    0                    0 
##           total_wins         total_losses           points_for 
##                    0                    0                    0 
##       points_against  points_differential    margin_of_victory 
##                    0                    0                    0 
## strength_of_schedule        simple_rating    offensive_ranking 
##                    0                    0                    0 
##    defensive_ranking             playoffs            sb_winner 
##                    0                    0                    0

Moving forward, we decided to change both the playoffs and sb_winner to binary variables. This is because they both only have two unique values.

unique(nfl_standings$playoffs, incomparables = FALSE) # View the unique values for the playoffs column

## [1] "Playoffs"    "No Playoffs"

unique(nfl_standings$sb_winner, incomparables = FALSE) # View the unique values for the sb_winner column

## [1] "No Superbowl"  "Won Superbowl"

Knowing this, we changed the two columns to binary variables. For the playoffs column, a value of one stands for “Playoffs”, and a value of zero stands for “No Playoffs”.

nfl_standings$playoffs[nfl_standings$playoffs == "Playoffs"] <- "1"
nfl_standings$playoffs[nfl_standings$playoffs == "No Playoffs"] <- "0"

nfl_standings$playoffs <- as.numeric(nfl_standings$playoffs)

For the sb_winner column, a value of one denotes “Won Superbowl”, and a value of zero denotes “No Superbowl”.

nfl_standings$sb_winner[nfl_standings$sb_winner == "Won Superbowl"] <- "1"
nfl_standings$sb_winner[nfl_standings$sb_winner == "No Superbowl"] <- "0"

nfl_standings$sb_winner <- as.numeric(nfl_standings$sb_winner)

Now, for a summary of the dataset and associated table of the data, please see the CLEANED dataset below.

# Examine the structure of the dataset
datatable(head(nfl_standings, 50))

# Create a data dictionary for standings
var_names_st <- colnames(nfl_standings)
var_types_st <- lapply(nfl_standings, class) # lapply returns a list of the same length as X (a vector)
var_desciptions_st <- c("City or state in which the team originates", "Name or mascot of the team", "Year", "Total wins per season (0 to 16)", "Total losses per season (0 to 16)", "Total points the team scored per season", "Total points the opponent scored on the team per season", "The difference between the total points for the team and against the team", "Points differential divided by the total number of games per season", "Difficulty of schedule based on opponent records", "A rating for the team that takes into account points differential and strength of schedule (measured by Simple Rating System)", "A rating comparing how well the offense performs to opponent teams (measured by Simple Rating System)", "A rating comparing how well the defense performs to opponent teams (measured by Simple Rating System)", "Stating whether or not the team made it to the playoffs", "Stating whether or not the team won the Super Bowl for the season")
data_dict_st <- as_tibble(cbind(var_names_st, var_types_st, var_desciptions_st))
colnames(data_dict_st) <- c("Variable Name", "Variable Data Type", "Variable Desciption")
kable(data_dict_st) # kable returns a single table for a single data object

Variable Name	Variable Data Type	Variable Desciption
team_location	character	City or state in which the team originates
team_name	character	Name or mascot of the team
year	numeric	Year
total_wins	numeric	Total wins per season (0 to 16)
total_losses	numeric	Total losses per season (0 to 16)
points_for	numeric	Total points the team scored per season
points_against	numeric	Total points the opponent scored on the team per season
points_differential	numeric	The difference between the total points for the team and against the team
margin_of_victory	numeric	Points differential divided by the total number of games per season
strength_of_schedule	numeric	Difficulty of schedule based on opponent records
simple_rating	numeric	A rating for the team that takes into account points differential and strength of schedule (measured by Simple Rating System)
offensive_ranking	numeric	A rating comparing how well the offense performs to opponent teams (measured by Simple Rating System)
defensive_ranking	numeric	A rating comparing how well the defense performs to opponent teams (measured by Simple Rating System)
playoffs	numeric	Stating whether or not the team made it to the playoffs
sb_winner	numeric	Stating whether or not the team won the Super Bowl for the season

Games

Once again the NFL Games data was imported and obtained from Pro Football Reference. The original data contains 5,324 observations and 19 variables. There are 11 character variables, week, home_team, away_team, winner, tie, day, date, home_team_name, home_team_city, away_team_name, and away_team_city. There are seven numeric type variables, year, pts_win, pts_loss, yds_win, turnovers_win, yds_loss, and turnovers_loss. See the ORIGINAL dataset below.

# Examine the structure of the dataset
datatable(head(nfl_games, 50))

# Create a data dictionary for games
var_names_games <- colnames(nfl_games)
var_types_games <- lapply(nfl_games, class) # lapply returns a list of the same length as X (a vector)
var_desciptions_games <- c("Year", "Week of the season in which the game was played", "Home team for the game", "Away team for the game", "Winner of the game", "Was there a tie?  (if so, the other team will be listed in this column)", "Day of the week in which the game was played", "Date of the game", "Time of the day in which the game was played", "Number of points the winning team scored", "Number of points the losing team scored", "Total number of yards the winning team had", "Total number of turnovers the winning team had", "Total number of yards the losing team had", "Total number of turnovers the losing team had", "Name or mascot of the winning team", "City of the winning team", "Name or mascot of the losing team", "City of the losing team")
data_dict_games <- as_tibble(cbind(var_names_games, var_types_games, var_desciptions_games))
colnames(data_dict_games) <- c("Variable Name", "Variable Data Type", "Variable Desciption")
kable(data_dict_games) # kable returns a single table for a single data object

Variable Name	Variable Data Type	Variable Desciption
year	numeric	Year
week	character	Week of the season in which the game was played
home_team	character	Home team for the game
away_team	character	Away team for the game
winner	character	Winner of the game
tie	character	Was there a tie? (if so, the other team will be listed in this column)
day	character	Day of the week in which the game was played
date	character	Date of the game
time	hms , difftime	Time of the day in which the game was played
pts_win	numeric	Number of points the winning team scored
pts_loss	numeric	Number of points the losing team scored
yds_win	numeric	Total number of yards the winning team had
turnovers_win	numeric	Total number of turnovers the winning team had
yds_loss	numeric	Total number of yards the losing team had
turnovers_loss	numeric	Total number of turnovers the losing team had
home_team_name	character	Name or mascot of the winning team
home_team_city	character	City of the winning team
away_team_name	character	Name or mascot of the losing team
away_team_city	character	City of the losing team

Looking at the above dataset, the first step we took to clean the data was to remove the last four unnecessary columns, as we felt they were redundant.

names(nfl_games)

##  [1] "year"           "week"           "home_team"      "away_team"     
##  [5] "winner"         "tie"            "day"            "date"          
##  [9] "time"           "pts_win"        "pts_loss"       "yds_win"       
## [13] "turnovers_win"  "yds_loss"       "turnovers_loss" "home_team_name"
## [17] "home_team_city" "away_team_name" "away_team_city"

nfl_games <- nfl_games[-c(16, 17, 18, 19)] # Remove redundant columns
names(nfl_games)

##  [1] "year"           "week"           "home_team"      "away_team"     
##  [5] "winner"         "tie"            "day"            "date"          
##  [9] "time"           "pts_win"        "pts_loss"       "yds_win"       
## [13] "turnovers_win"  "yds_loss"       "turnovers_loss"

Then, we changed the week column to be numeric.

nfl_games$week <- as.numeric(nfl_games$week)

Looking at missing values, the only column which contained them was the tie column. This makes sense, as very few NFL games result in a tie.

Next, the way in which a tie was denoted was by listing one team name in the winner column, and the opponent team name in the tie column. To fix this, we identified any game that resulted in a tie. Then, for these specific games, we renamed the value in the winner column to “Tie”. The tie column was then erased.

colSums(is.na(nfl_games))

##           year           week      home_team      away_team         winner 
##              0            220              0              0              0 
##            tie            day           date           time        pts_win 
##           5314              0              0              0              0 
##       pts_loss        yds_win  turnovers_win       yds_loss turnovers_loss 
##              0              0              0              0              0

unique(nfl_games$tie, incomparables = FALSE)

## [1] NA                   "Atlanta Falcons"    "Cincinnati Bengals"
## [4] "St. Louis Rams"     "Green Bay Packers"  "Carolina Panthers" 
## [7] "Arizona Cardinals"  "Cleveland Browns"

nfl_games$winner[nfl_games$tie != is.na(nfl_games$tie)] <- "Tie"
nfl_games <- nfl_games[-c(6)] # Remove the tie column
colSums(is.na(nfl_games)) # Confirm there are no missing values

##           year           week      home_team      away_team         winner 
##              0            220              0              0              0 
##            day           date           time        pts_win       pts_loss 
##              0              0              0              0              0 
##        yds_win  turnovers_win       yds_loss turnovers_loss 
##              0              0              0              0

To view the summary and structure of the CLEANED data:

# Examine the structure of the dataset
datatable(head(nfl_games, 50))

# Create a data dictionary for games
var_names_games <- colnames(nfl_games)
var_types_games <- lapply(nfl_games, class) # lapply returns a list of the same length as X (a vector)
var_desciptions_games <- c("Year", "Week of the season in which the game was played", "Home team for the game", "Away team for the game", "Winner of the game", "Day of the week in which the game was played", "Date of the game", "Time of the day in which the game was played", "Number of points the winning team scored", "Number of points the losing team scored", "Total number of yards the winning team had", "Total number of turnovers the winning team had", "Total number of yards the losing team had", "Total number of turnovers the losing team had")
data_dict_games <- as_tibble(cbind(var_names_games, var_types_games, var_desciptions_games))
colnames(data_dict_games) <- c("Variable Name", "Variable Data Type", "Variable Desciption")
kable(data_dict_games) # kable returns a single table for a single data object

Variable Name	Variable Data Type	Variable Desciption
year	numeric	Year
week	numeric	Week of the season in which the game was played
home_team	character	Home team for the game
away_team	character	Away team for the game
winner	character	Winner of the game
day	character	Day of the week in which the game was played
date	character	Date of the game
time	hms , difftime	Time of the day in which the game was played
pts_win	numeric	Number of points the winning team scored
pts_loss	numeric	Number of points the losing team scored
yds_win	numeric	Total number of yards the winning team had
turnovers_win	numeric	Total number of turnovers the winning team had
yds_loss	numeric	Total number of yards the losing team had
turnovers_loss	numeric	Total number of turnovers the losing team had

Exploratory Data Analysis

The Importance of Fan Attendance

Total Attendance Breakdown

As mentioned in the introduction, this data analysis will look into if the number of fans in attendance correlates to a team’s success (number of games won). Additionally, it will provide comparisons of how teams fare in home versus away games while keeping in mind the attendance at those games. Earlier in our data preparation, we split the attendance dataset into two separate datasets, nfl_total_attendance and nfl_weekly_attendance. To first understand the importance of fan attendance, it is critical to observe which teams have the strongest fan base over the past 20 years.

# Create visualization for total attendance per year for all teams
team_total_attendance <- 
  ggplot(data = nfl_total_attendance, 
         aes(x = year, 
             y = total_attendance, 
             color = team_name)) + 
  geom_point(size = 1, alpha = .8) +
  geom_smooth(size = .8, se = FALSE) +
  scale_y_continuous(name = "Total Attendance") +
  scale_x_continuous(name = "Year") +
  ggtitle("Total Attendance Per Year") + 
  labs(col = "Team Name") +
  theme_stata()

ggplotly(team_total_attendance)

As evident in the above visualization, the Dallas Cowboys appear to have the strongest fan base, and the Los Angeles Chargers appear to have the weakest fan base. Using the ggplotly function, the graph becomes interactive. Double-click on any team to see their attendance trends since 2000. Now, we wanted to break this down on a division-basis. In order to do this, we added a column to the dataset, called “Division”.

# Attach the dataset to avoid calling on specific columns
attach(nfl_total_attendance)

# Create new column
nfl_total_attendance$division <- 
  ifelse(team_name == "Patriots" | team_name == "Bills" | team_name == "Jets" | team_name == "Dolphins", "AFC East", 
  ifelse(team_name == "Ravens" | team_name == "Steelers" | team_name == "Bengals" | team_name == "Browns", "AFC North",
  ifelse(team_name == "Texans" | team_name == "Titans" | team_name == "Colts" | team_name == "Jaguars", "AFC South", 
  ifelse(team_name == "Chiefs" | team_name == "Broncos" | team_name == "Raiders" | team_name == "Chargers", "AFC West",
  ifelse(team_name == "Eagles" | team_name == "Cowboys" | team_name == "Giants" | team_name == "Redskins", "NFC East",
  ifelse(team_name == "Packers" | team_name == "Vikings" | team_name == "Bears" | team_name == "Lions", "NFC North",
  ifelse(team_name == "Saints" | team_name == "Falcons" | team_name == "Buccaneers" | team_name == "Panthers", "NFC South",
  ifelse(team_name == "49ers" | team_name == "Seahawks" | team_name == "Rams" | team_name == "Cardinals", "NFC West",
  NA))))))) )

Now that the column Division exists, the breakdown of the strongest and weakest fan bases per division can be seen in the following table and visualizations below:

# Create table for attendance summary
attendance_summary <- matrix(c("New York Jets", "Miami Dolphins", "Baltimore Ravens", "Cincinnati Bengals", "Houston Texans", "Indianapolis Colts", "Kansas City Chiefs", "Los Angeles Chargers", "Dallas Cowboys", "Washington Redskins", "Green Bay Packers", "Detroit Lions", "New Orleans Saints", "Tampa Bay Buccaneers", "Los Angeles Rams", "Arizona Cardinals"), ncol = 2, byrow = TRUE)
colnames(attendance_summary) <- c("Strongest Fan Base","Weakest Fan Base")
rownames(attendance_summary) <- c("AFC East","AFC North","AFC South", "AFC West", "NFC East", "NFC North", "NFC South", "NFC West")
attendance_summary <- as.table(attendance_summary)
kable(attendance_summary)

	Strongest Fan Base	Weakest Fan Base
AFC East	New York Jets	Miami Dolphins
AFC North	Baltimore Ravens	Cincinnati Bengals
AFC South	Houston Texans	Indianapolis Colts
AFC West	Kansas City Chiefs	Los Angeles Chargers
NFC East	Dallas Cowboys	Washington Redskins
NFC North	Green Bay Packers	Detroit Lions
NFC South	New Orleans Saints	Tampa Bay Buccaneers
NFC West	Los Angeles Rams	Arizona Cardinals

American Football Conference Attendance Breakdown

# AFC East 
ggplotly(
  nfl_total_attendance %>% 
  filter(nfl_total_attendance$division == "AFC East") %>% 
  ggplot(aes(x = year, 
             y = total_attendance, 
             color = team_name)) + 
  geom_point(size = 1, alpha = .8) +
  geom_smooth(size = .8, se = FALSE) +
  scale_y_continuous(name = "Total Attendance") +
  scale_x_continuous(name = "Year") +
  ggtitle("AFC East Total Attendance Per Year") + 
  labs(col = "Team Name") +
  theme_stata()
)

# AFC North
ggplotly(
  nfl_total_attendance %>% 
  filter(nfl_total_attendance$division == "AFC North") %>% 
  ggplot(aes(x = year, 
             y = total_attendance, 
             color = team_name)) + 
  geom_point(size = 1, alpha = .8) +
  geom_smooth(size = .8, se = FALSE) +
  scale_y_continuous(name = "Total Attendance") +
  scale_x_continuous(name = "Year") +
  ggtitle("AFC North Total Attendance Per Year") + 
  labs(col = "Team Name") +
  theme_stata()
)

# AFC South
ggplotly(
  nfl_total_attendance %>% 
  filter(nfl_total_attendance$division == "AFC South") %>% 
  ggplot(aes(x = year, 
             y = total_attendance, 
             color = team_name)) + 
  geom_point(size = 1, alpha = .8) +
  geom_smooth(size = .8, se = FALSE) +
  scale_y_continuous(name = "Total Attendance") +
  scale_x_continuous(name = "Year") +
  ggtitle("AFC South Total Attendance Per Year") + 
  labs(col = "Team Name") +
  theme_stata()
)

# AFC West
ggplotly(
  nfl_total_attendance %>% 
  filter(nfl_total_attendance$division == "AFC West") %>% 
  ggplot(aes(x = year, 
             y = total_attendance, 
             color = team_name)) + 
  geom_point(size = 1, alpha = .8) +
  geom_smooth(size = .8, se = FALSE) +
  scale_y_continuous(name = "Total Attendance") +
  scale_x_continuous(name = "Year") +
  ggtitle("AFC West Total Attendance Per Year") + 
  labs(col = "Team Name") +
  theme_stata()
)

National Football Conference Attendance Breakdown

# NFC East
ggplotly(
  nfl_total_attendance %>% 
  filter(nfl_total_attendance$division == "NFC East") %>% 
  ggplot(aes(x = year, 
             y = total_attendance, 
             color = team_name)) + 
  geom_point(size = 1, alpha = .8) +
  geom_smooth(size = .8, se = FALSE) +
  scale_y_continuous(name = "Total Attendance") +
  scale_x_continuous(name = "Year") +
  ggtitle("NFC East Total Attendance Per Year") + 
  labs(col = "Team Name") +
  theme_stata()
)

# NFC North
ggplotly(
  nfl_total_attendance %>% 
  filter(nfl_total_attendance$division == "NFC North") %>% 
  ggplot(aes(x = year, 
             y = total_attendance, 
             color = team_name)) + 
  geom_point(size = 1, alpha = .8) +
  geom_smooth(size = .8, se = FALSE) +
  scale_y_continuous(name = "Total Attendance") +
  scale_x_continuous(name = "Year") +
  ggtitle("NFC North Total Attendance Per Year") + 
  labs(col = "Team Name") +
  theme_stata()
)

# NFC South
ggplotly(
  nfl_total_attendance %>% 
  filter(nfl_total_attendance$division == "NFC South") %>% 
  ggplot(aes(x = year, 
             y = total_attendance, 
             color = team_name)) + 
  geom_point(size = 1, alpha = .8) +
  geom_smooth(size = .8, se = FALSE) +
  scale_y_continuous(name = "Total Attendance") +
  scale_x_continuous(name = "Year") +
  ggtitle("NFC South Total Attendance Per Year") + 
  labs(col = "Team Name") +
  theme_stata()
)

# NFC West
ggplotly(
  nfl_total_attendance %>% 
  filter(nfl_total_attendance$division == "NFC West") %>% 
  ggplot(aes(x = year, 
             y = total_attendance, 
             color = team_name)) + 
  geom_point(size = 1, alpha = .8) +
  geom_smooth(size = .8, se = FALSE) +
  scale_y_continuous(name = "Total Attendance") +
  scale_x_continuous(name = "Year") +
  ggtitle("NFC West Total Attendance Per Year") + 
  labs(col = "Team Name") +
  theme_stata()
)

Does Home Attendance Impact Total Wins?

Knowing the above attendance statistics, we want to see if a stronger home attendance impacts the total number of wins. A team cannot necessarily control their away attendance, as their most loyal fans are assumed to be unlikely attendees at an away game.

# Select columns from nfl_standings 
nfl_wins <- 
  nfl_standings %>%
  select(team_name, year, total_wins)

# Perform left join to get needed statistics
joined_data <- left_join(nfl_total_attendance, nfl_wins, by = c("team_name", "year"))
joined_data

## # A tibble: 638 x 8
##    team_location team_name  year total_attendance total_home_atte~
##    <chr>         <chr>     <dbl>            <dbl>            <dbl>
##  1 Arizona       Cardinals  2000           893926           387475
##  2 Atlanta       Falcons    2000           964579           422814
##  3 Baltimore     Ravens     2000          1062373           551695
##  4 Buffalo       Bills      2000          1098587           560695
##  5 Carolina      Panthers   2000          1095192           583489
##  6 Chicago       Bears      2000          1080684           535552
##  7 Cincinnati    Bengals    2000           967434           469992
##  8 Cleveland     Browns     2000          1057139           581544
##  9 Dallas        Cowboys    2000          1075470           504360
## 10 Denver        Broncos    2000          1140030           604042
## # ... with 628 more rows, and 3 more variables: total_away_attendance <dbl>,
## #   division <chr>, total_wins <dbl>

# Attach the dataset
attach(joined_data)

# Create linear model 
home_attendance_model <- lm(total_wins ~ total_home_attendance)
summary(home_attendance_model)

## 
## Call:
## lm(formula = total_wins ~ total_home_attendance)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.7799 -2.2250 -0.0726  2.2478  7.9490 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           4.225e+00  9.851e-01   4.289 2.07e-05 ***
## total_home_attendance 6.955e-06  1.809e-06   3.845 0.000133 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.051 on 636 degrees of freedom
## Multiple R-squared:  0.02272,    Adjusted R-squared:  0.02118 
## F-statistic: 14.78 on 1 and 636 DF,  p-value: 0.0001327

cor(total_wins, total_home_attendance)

## [1] 0.1507174

# Plot data
ggplot(data = joined_data, aes(total_home_attendance, total_wins)) +
        geom_point(size = 1, alpha = .8, col = "red") +
        geom_smooth(method = "lm", size = .8, se = FALSE) +
        xlab("Total Home Attendance") +
        ylab("Total Wins") +
        ggtitle("Total Home Attendance vs. Total Wins")

From the above visualization, it appears that there is a slight, positive linear relationship between the predictor variable (X or total_home_attendance) and the response variable (Y or total_wins). The correlation coefficient between the two variables is 0.1507, and this relationship is statistically significant at a 99% confidence level with a p-value of 0.000133. The lm() function was used to perform simple linear regression between the two variables.

Does Away Attendance Impact Total Wins?

# Attach the dataset
attach(joined_data)

# Create linear model
away_attendance_model <- lm(total_wins ~ total_away_attendance)
summary(away_attendance_model)

## 
## Call:
## lm(formula = total_wins ~ total_away_attendance)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.8922 -2.2928 -0.1177  2.2793  7.7255 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)   
## (Intercept)           -3.310e-01  2.571e+00  -0.129  0.89758   
## total_away_attendance  1.539e-05  4.751e-06   3.238  0.00126 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.061 on 636 degrees of freedom
## Multiple R-squared:  0.01622,    Adjusted R-squared:  0.01468 
## F-statistic: 10.49 on 1 and 636 DF,  p-value: 0.001265

cor(total_wins, total_away_attendance)

## [1] 0.1273652

# Plot data
ggplot(data = joined_data, aes(total_away_attendance, total_wins)) +
        geom_point(size = 1, alpha = .8, col = "magenta") +
        geom_smooth(method = "lm", size = .8, se = FALSE) +
        xlab("Total Away Attendance") +
        ylab("Total Wins") +
        ggtitle("Total Away Attendance vs. Total Wins")

From the above visualization, it appears that there is also a very slight, positive linear relationship between the predictor variable (X or total_away_attendance) and the response variable (Y or total_wins). The correlation coefficient between the two variables is 0.1274, and this relationship is statistically significant at a 99% confidence level with a p-value of 0.00126. The lm() function was used to perform simple linear regression between the two variables.

Standings Over the Years

This part of our analysis will look into the qualities of the division winners and the attributes that teams high in the standings have over teams in the lower portion of the standings. Furthermore, this approach will discover what separates Super Bowl Champions from the 31 other teams each season.

Firstly, we wanted to see which division has brought home the most Super Bowl Championships over the past 20 years. We once again added a “Division” column to nfl_standings. This can be seen in the graphic below.

# Attach the dataset
attach(nfl_standings)

# Create new column
nfl_standings$division <- 
  ifelse(team_name == "Patriots" | team_name == "Bills" | team_name == "Jets" | team_name == "Dolphins", "AFC East", 
  ifelse(team_name == "Ravens" | team_name == "Steelers" | team_name == "Bengals" | team_name == "Browns", "AFC North",
  ifelse(team_name == "Texans" | team_name == "Titans" | team_name == "Colts" | team_name == "Jaguars", "AFC South", 
  ifelse(team_name == "Chiefs" | team_name == "Broncos" | team_name == "Raiders" | team_name == "Chargers", "AFC West",
  ifelse(team_name == "Eagles" | team_name == "Cowboys" | team_name == "Giants" | team_name == "Redskins", "NFC East",
  ifelse(team_name == "Packers" | team_name == "Vikings" | team_name == "Bears" | team_name == "Lions", "NFC North",
  ifelse(team_name == "Saints" | team_name == "Falcons" | team_name == "Buccaneers" | team_name == "Panthers", "NFC South",
  ifelse(team_name == "49ers" | team_name == "Seahawks" | team_name == "Rams" | team_name == "Cardinals", "NFC West",
  NA))))))) )

# Plot Super Bowl Championships per division
ggplot(data = nfl_standings, 
       aes(reorder(division, -sb_winner), sb_winner, col = team_name)) + geom_col() +
       ggtitle("Super Bowl Winners by Division") +
       xlab("Division") + ylab("Count of Super Bowl Championships Won") + 
       labs(col = "Team Name")

As evident in the above visualization, the AFC East has had the most Super Bowl wins between 2000-2019. This can be largely attributed to the New England Patriots’ former quarterback Tom Brady and current head coach Bill Belichick bringing home championships in 2002, 2004, 2005, 2015, 2017, and 2019. Additionally, the second-best division appears to be the AFC North, with both the Pittsburgh Steelers and Baltimore Ravens winning at least one Super Bowl Championship each. Conversely, it appears the AFC South, NFC North, and NFC West have all only won one Super Bowl over the past two decades.

Analyzing NFL standings with the given datasets is a bit tricky due to the fact that standings are calculated using tie-breakers if necessary. Additionally, choosing which teams make the playoffs is largely based off of division success. With that being said, the team that had the most wins might not be the team with the best standing. For this analysis, we decided to break the teams down by division and see which ones have been dominant over the years.

We analyzed their success by using summary statistics showing the Average Total Wins, Average Total Losses, Average Points Per Game, and Average Opponent Points Per Game. The results can be seen below. We also developed box plots for the average total wins per season by division to analyze the range of data for each team and any relevant outliers. At the end of this analysis, we then grouped the box plots by conference (AFC vs. NFC).

The most dominate teams per division, defined by highest average of total wins, (as discovered in the analysis below) are as follows:

AFC East: New England Patriots
AFC North: Pittsburgh Steelers
AFC South: Indianapolis Colts
AFC West: Denver Broncos
NFC East: Philadelphia Eagles
NFC North: Green Bay Packers
NFC South: New Orleans Saints
NFC West: Seattle Seahawks

Please see the full breakdown below.

AFC East Breakdown

# AFC East
afc_east <- nfl_standings %>% 
  filter(division == "AFC East") %>% 
  select(team_name, total_wins, points_for, total_losses, points_against) %>% 
  group_by(team_name) %>% 
  summarise(total_wins = mean(total_wins), total_losses = mean(total_losses), points_for = mean(points_for)/16, points_against = mean(points_against)/16)

colnames(afc_east) <- c("Team Name", "Average Total Wins", "Average Total Losses", "Average Points Per Game", "Average Opponent Points Per Game")

kable(afc_east)

Team Name	Average Total Wins	Average Total Losses	Average Points Per Game	Average Opponent Points Per Game
Bills	6.85	9.15	19.86875	22.28125
Dolphins	7.45	8.55	19.85938	21.67500
Jets	7.40	8.60	19.83750	21.38750
Patriots	11.85	4.15	27.25937	18.57812

# AFC East box plot
afc_east_box <- nfl_standings %>% 
  filter(division == "AFC East") %>% 
  select(team_name, total_wins, points_for, total_losses, points_against)

boxplot_afceast <- ggplot(afc_east_box, 
       aes(team_name, total_wins)) + 
       geom_boxplot(col = "blue") + 
       ggtitle("AFC East") +
       xlab("Team") + ylab("Total Wins") +
       theme_stata()
boxplot_afceast

AFC North Breakdown

# AFC North 
afc_north <- nfl_standings %>% 
  filter(division == "AFC North") %>% 
  select(team_name, total_wins, points_for, total_losses, points_against) %>% 
  group_by(team_name) %>% 
  summarise(total_wins = mean(total_wins), total_losses = mean(total_losses), points_for = mean(points_for)/16, points_against = mean(points_against)/16)

colnames(afc_north) <- c("Team Name", "Average Total Wins", "Average Total Losses", "Average Points Per Game", "Average Opponent Points Per Game")

kable(afc_north)

Team Name	Average Total Wins	Average Total Losses	Average Points Per Game	Average Opponent Points Per Game
Bengals	7.25	8.60	20.70625	22.20625
Browns	4.95	11.00	17.30313	23.00937
Ravens	9.50	6.50	22.42813	18.28125
Steelers	10.25	5.65	23.06563	18.45938

# AFC North box plot
afc_north_box <- nfl_standings %>% 
  filter(division == "AFC North") %>% 
  select(team_name, total_wins, points_for, total_losses, points_against)

boxplot_afcnorth <- ggplot(afc_north_box, 
       aes(team_name, total_wins)) + 
       geom_boxplot(col = "purple") + 
       ggtitle("AFC North") +
       xlab("Team") + ylab("Total Wins") +
       theme_stata()
boxplot_afcnorth

AFC South Breakdown

# AFC South
afc_south <- nfl_standings %>% 
  filter(division == "AFC South") %>% 
  select(team_name, total_wins, points_for, total_losses, points_against) %>% 
  group_by(team_name) %>% 
  summarise(total_wins = mean(total_wins), total_losses = mean(total_losses), points_for = mean(points_for)/16, points_against = mean(points_against)/16)

colnames(afc_south) <- c("Team Name", "Average Total Wins", "Average Total Losses", "Average Points Per Game", "Average Opponent Points Per Game")

kable(afc_south)

Team Name	Average Total Wins	Average Total Losses	Average Points Per Game	Average Opponent Points Per Game
Colts	9.850000	6.150000	24.85625	22.24063
Jaguars	6.350000	9.650000	19.57500	21.92188
Texans	7.277778	8.722222	20.86111	22.68056
Titans	8.000000	8.000000	21.35625	22.36563

# AFC South box plot
afc_south_box <- nfl_standings %>% 
  filter(division == "AFC South") %>% 
  select(team_name, total_wins, points_for, total_losses, points_against)

boxplot_afcsouth <- ggplot(afc_south_box, 
       aes(team_name, total_wins)) + 
       geom_boxplot(col = "red") + 
       ggtitle("AFC South") +
       xlab("Team") + ylab("Total Wins") +
       theme_stata()
boxplot_afcsouth

AFC West Breakdown

# AFC West
afc_west <- nfl_standings %>% 
  filter(division == "AFC West") %>% 
  select(team_name, total_wins, points_for, total_losses, points_against) %>% 
  group_by(team_name) %>% 
  summarise(total_wins = mean(total_wins), total_losses = mean(total_losses), points_for = mean(points_for)/16, points_against = mean(points_against)/16)

colnames(afc_west) <- c("Team Name", "Average Total Wins", "Average Total Losses", "Average Points Per Game", "Average Opponent Points Per Game")

kable(afc_west)

Team Name	Average Total Wins	Average Total Losses	Average Points Per Game	Average Opponent Points Per Game
Broncos	9.10	6.90	23.49687	21.70000
Chargers	8.10	7.90	24.05937	21.75000
Chiefs	8.30	7.70	23.28438	22.01250
Raiders	6.25	9.75	20.10000	24.45625

# AFC West box plot
afc_west_box <- nfl_standings %>% 
  filter(division == "AFC West") %>% 
  select(team_name, total_wins, points_for, total_losses, points_against)

boxplot_afcwest <- ggplot(afc_west_box, 
       aes(team_name, total_wins)) + 
       geom_boxplot(col = "seagreen") + 
       ggtitle("AFC West") +
       xlab("Team") + ylab("Total Wins") +
       theme_stata()
boxplot_afcwest

NFC East Breakdown

# NFC East
nfc_east <- nfl_standings %>% 
  filter(division == "NFC East") %>% 
  select(team_name, total_wins, points_for, total_losses, points_against) %>% 
  group_by(team_name) %>% 
  summarise(total_wins = mean(total_wins), total_losses = mean(total_losses), points_for = mean(points_for)/16, points_against = mean(points_against)/16)

colnames(nfc_east) <- c("Team Name", "Average Total Wins", "Average Total Losses", "Average Points Per Game", "Average Opponent Points Per Game")

kable(nfc_east)

Team Name	Average Total Wins	Average Total Losses	Average Points Per Game	Average Opponent Points Per Game
Cowboys	8.4	7.60	22.29688	21.60938
Eagles	9.5	6.45	24.19375	20.53750
Giants	7.9	8.10	22.01250	22.43437
Redskins	6.6	9.35	19.48750	22.42813

# NFC East box plot
nfc_east_box <- nfl_standings %>% 
  filter(division == "NFC East") %>% 
  select(team_name, total_wins, points_for, total_losses, points_against)

boxplot_nfceast <- ggplot(nfc_east_box, 
       aes(team_name, total_wins)) + 
       geom_boxplot() + 
       ggtitle("NFC East") +
       xlab("Team") + ylab("Total Wins") +
       theme_stata()
boxplot_nfceast

NFC North Breakdown

# NFC North
nfc_north <- nfl_standings %>% 
  filter(division == "NFC North") %>% 
  select(team_name, total_wins, points_for, total_losses, points_against) %>% 
  group_by(team_name) %>% 
  summarise(total_wins = mean(total_wins), total_losses = mean(total_losses), points_for = mean(points_for)/16, points_against = mean(points_against)/16)

colnames(nfc_north) <- c("Team Name", "Average Total Wins", "Average Total Losses", "Average Points Per Game", "Average Opponent Points Per Game")

kable(nfc_north)

Team Name	Average Total Wins	Average Total Losses	Average Points Per Game	Average Opponent Points Per Game
Bears	7.85	8.15	20.24063	20.82812
Lions	5.70	10.25	20.58438	24.61563
Packers	9.85	6.05	25.24063	21.25313
Vikings	8.25	7.65	22.67813	22.03438

# NFC North box plot
nfc_north_box <- nfl_standings %>% 
  filter(division == "NFC North") %>% 
  select(team_name, total_wins, points_for, total_losses, points_against)

boxplot_nfcnorth <- ggplot(nfc_north_box, 
       aes(team_name, total_wins)) + 
       geom_boxplot(col = "brown") + 
       ggtitle("NFC North") +
       xlab("Team") + ylab("Total Wins") +
       theme_stata()
boxplot_nfcnorth

NFC South Breakdown

# NFC South
nfc_south <- nfl_standings %>% 
  filter(division == "NFC South") %>% 
  select(team_name, total_wins, points_for, total_losses, points_against) %>% 
  group_by(team_name) %>% 
  summarise(total_wins = mean(total_wins), total_losses = mean(total_losses), points_for = mean(points_for)/16, points_against = mean(points_against)/16)

colnames(nfc_south) <- c("Team Name", "Average Total Wins", "Average Total Losses", "Average Points Per Game", "Average Opponent Points Per Game")

kable(nfc_south)

Team Name	Average Total Wins	Average Total Losses	Average Points Per Game	Average Opponent Points Per Game
Buccaneers	6.90	9.10	20.55313	21.93437
Falcons	8.20	7.75	22.61250	22.75313
Panthers	7.85	8.10	21.15625	21.61563
Saints	9.15	6.85	25.94063	23.29062

# NFC South box plot
nfc_south_box <- nfl_standings %>% 
  filter(division == "NFC South") %>% 
  select(team_name, total_wins, points_for, total_losses, points_against)

boxplot_nfcsouth <- ggplot(nfc_south_box, 
       aes(team_name, total_wins)) + 
       geom_boxplot(col = "magenta") + 
       ggtitle("NFC South") +
       xlab("Team") + ylab("Total Wins") +
       theme_stata()
boxplot_nfcsouth

NFC West Breakdown

# NFC West
nfc_west <- nfl_standings %>% 
  filter(division == "NFC West") %>% 
  select(team_name, total_wins, points_for, total_losses, points_against) %>% 
  group_by(team_name) %>% 
  summarise(total_wins = mean(total_wins), total_losses = mean(total_losses), points_for = mean(points_for)/16, points_against = mean(points_against)/16)

colnames(nfc_west) <- c("Team Name", "Average Total Wins", "Average Total Losses", "Average Points Per Game", "Average Opponent Points Per Game")

kable(nfc_west)

Team Name	Average Total Wins	Average Total Losses	Average Points Per Game	Average Opponent Points Per Game
49ers	7.40	8.55	21.01562	22.39062
Cardinals	6.85	9.05	20.10938	23.65000
Rams	7.20	8.75	21.50313	23.70000
Seahawks	9.10	6.85	22.92188	20.56563

# NFC West box plot
nfc_west_box <- nfl_standings %>% 
  filter(division == "NFC West") %>% 
  select(team_name, total_wins, points_for, total_losses, points_against)

boxplot_nfcwest <- ggplot(nfc_west_box, 
       aes(team_name, total_wins)) + 
       geom_boxplot(col = "goldenrod4") + 
       ggtitle("NFC West") +
       xlab("Team") + ylab("Total Wins") +
       theme_stata()
boxplot_nfcwest

# Comparison by conference
ggarrange(boxplot_afceast, boxplot_afcnorth, boxplot_afcsouth, boxplot_afcwest)

ggarrange(boxplot_nfceast, boxplot_nfcnorth, boxplot_nfcsouth, boxplot_nfcwest)

Division Leaders Breakdown

Combining the above tables to form one table with average statistics, the following leaders can be found:

Average Total Wins: New England Patriots
Average Total Losses: Cleveland Browns
Average Points Per Game: New England Patriots
Average Opponent Points Per Game: Detroit Lions

# Analysis of all teams in the NFL
division_leaders <- nfl_standings %>% 
  select(team_name, total_wins, points_for, total_losses, points_against) %>% 
  group_by(team_name) %>% 
  summarise(total_wins = mean(total_wins), total_losses = mean(total_losses), points_for = mean(points_for)/16, points_against = mean(points_against)/16)

colnames(division_leaders) <- c("Team Name", "Average Total Wins", "Average Total Losses", "Average Points Per Game", "Average Opponent Points Per Game")

kable(division_leaders)

Team Name	Average Total Wins	Average Total Losses	Average Points Per Game	Average Opponent Points Per Game
49ers	7.400000	8.550000	21.01562	22.39062
Bears	7.850000	8.150000	20.24063	20.82812
Bengals	7.250000	8.600000	20.70625	22.20625
Bills	6.850000	9.150000	19.86875	22.28125
Broncos	9.100000	6.900000	23.49687	21.70000
Browns	4.950000	11.000000	17.30313	23.00937
Buccaneers	6.900000	9.100000	20.55313	21.93437
Cardinals	6.850000	9.050000	20.10938	23.65000
Chargers	8.100000	7.900000	24.05937	21.75000
Chiefs	8.300000	7.700000	23.28438	22.01250
Colts	9.850000	6.150000	24.85625	22.24063
Cowboys	8.400000	7.600000	22.29688	21.60938
Dolphins	7.450000	8.550000	19.85938	21.67500
Eagles	9.500000	6.450000	24.19375	20.53750
Falcons	8.200000	7.750000	22.61250	22.75313
Giants	7.900000	8.100000	22.01250	22.43437
Jaguars	6.350000	9.650000	19.57500	21.92188
Jets	7.400000	8.600000	19.83750	21.38750
Lions	5.700000	10.250000	20.58438	24.61563
Packers	9.850000	6.050000	25.24063	21.25313
Panthers	7.850000	8.100000	21.15625	21.61563
Patriots	11.850000	4.150000	27.25937	18.57812
Raiders	6.250000	9.750000	20.10000	24.45625
Rams	7.200000	8.750000	21.50313	23.70000
Ravens	9.500000	6.500000	22.42813	18.28125
Redskins	6.600000	9.350000	19.48750	22.42813
Saints	9.150000	6.850000	25.94063	23.29062
Seahawks	9.100000	6.850000	22.92188	20.56563
Steelers	10.250000	5.650000	23.06563	18.45938
Texans	7.277778	8.722222	20.86111	22.68056
Titans	8.000000	8.000000	21.35625	22.36563
Vikings	8.250000	7.650000	22.67813	22.03438

Offense vs. Defense

In the next step of our analysis, we investigated the importance of an offense and a defense. The offense in football is the 11 players who are on the field for a team when they have the ball. Conversely, the defense is the 11 players on the field when the other team has the ball. Sports writers and analysts have argued over the years whether a better offense or defense is more critical to a team’s success. Utilizing the nfl_standings dataset, we sought to analyze this discussion.

First, we created a linear model showcasing a team’s wins in a season using offensive_ranking and defensive_ranking as the predictor variables.

# Attach the dataset
attach(nfl_standings)

# Create a linear model
rankings_model <- lm(total_wins ~ offensive_ranking + defensive_ranking)
summary(rankings_model)

## 
## Call:
## lm(formula = total_wins ~ offensive_ranking + defensive_ranking)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4463 -0.9950 -0.0159  1.0455  5.0687 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        7.98487    0.05837  136.81   <2e-16 ***
## offensive_ranking  0.43617    0.01381   31.58   <2e-16 ***
## defensive_ranking  0.43745    0.01681   26.03   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.474 on 635 degrees of freedom
## Multiple R-squared:  0.7722, Adjusted R-squared:  0.7715 
## F-statistic:  1076 on 2 and 635 DF,  p-value: < 2.2e-16

The model showcases that both offensive_ranking and defensive_ranking are significant variables in determining a team’s total wins at a 99% confidence level. To drill deeper, the correlation coefficients were discovered for each predictor variable to total wins.

# Run correlation tests
cor(offensive_ranking,total_wins)

## [1] 0.7274288

cor(defensive_ranking,total_wins)

## [1] 0.6437138

The offensive_ranking had a coefficient of 0.7274288 and the defensive_ranking had a coefficient 0.6437138. As such, it appears that a team’s offense has a greater correlation to a team’s wins than its defense. To visualize this, we plotted two graphs to further test this hypothesis:

# Create rankings graphs with confidence bands
Def_Ranking_Plot <- ggplot(data = nfl_standings, aes(defensive_ranking, total_wins)) + 
  geom_smooth(col = "red") +
  ggtitle("Total Wins | Defensive Ranking") +
  xlab("Defensive Ranking") +
  ylab("Total Wins") +
  xlim(-10,10) + ylim(0,16) +
  theme_stata()

Off_Ranking_Plot <- ggplot(data = nfl_standings, aes(offensive_ranking, total_wins)) + geom_smooth() +
  ggtitle("Total Wins | Offensive Ranking")  +
  xlab("Offensive Ranking") + ylab("Total Wins") +
  xlim(-10,10) + ylim(0,16) +
  theme_stata()

# Put graphs next to each other
ggarrange(Off_Ranking_Plot,Def_Ranking_Plot)

These graphs confirm the positive correlation between an increasing offensive or defensive ranking and a team’s win. Additionally, the confidence band in the defensive ranking is larger than the offensive ranking’s band. This agrees with our conclusion that the offensive’s ranking correlation is stronger than the defense.

Using different statistics now, we changed the predictor variables to be points_for and points_against as these represent offensive and defensive success, respectively. Then, we used the binary playoffs variable to see how scoring or giving up points led to a team’s probability of making the playoffs. We took the same approach as the previous variables.

# Create linear model
points_model <- lm(total_wins ~ points_for + points_against)
summary(points_model)

## 
## Call:
## lm(formula = total_wins ~ points_for + points_against)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3445 -0.8470 -0.0334  0.8322  3.9053 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     8.7362443  0.4184560   20.88   <2e-16 ***
## points_for      0.0270262  0.0006989   38.67   <2e-16 ***
## points_against -0.0291728  0.0008380  -34.81   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.238 on 635 degrees of freedom
## Multiple R-squared:  0.8395, Adjusted R-squared:  0.839 
## F-statistic:  1660 on 2 and 635 DF,  p-value: < 2.2e-16

# Run correlation tests
cor(points_for,total_wins)

## [1] 0.7300927

cor(points_against,total_wins)

## [1] -0.6792343

# Create scatter plots
Points_For_Plot <- ggplot(data = nfl_standings, aes(points_for, total_wins, col = as.character(playoffs))) +
  geom_point() +
  labs(col = "Playoffs") +
  ggtitle("Total Wins | Total Points Scored") +
  xlab("Total Points Scored") +
  ylab("Total Wins") +
  ylim(0,16) +
  theme_stata()

Points_Against_Plot <- ggplot(data = nfl_standings, aes(points_against, total_wins, col = as.character(playoffs))) +
  geom_point() +
  labs(col = "Playoffs") +
  ggtitle("Total Wins | Total Points Against") +
  xlab("Total Points Against") +
  ylab("Total Wins") +
  ylim(0,16) +
  theme_stata()

# Put graphs next to each other
ggarrange(Points_For_Plot,Points_Against_Plot)

This model shows that points_for and points_against are both significant as well to a team’s total wins. Additionally, the correlations to total wins are 0.7300927 and -0.6792343. This indicates a strong, positive relationship for points_for and a strong, negative relationship for points_against. The offensive side, once again, has a slightly stronger relationship.

The graphical representations show these strong linear relationships as well with playoff teams typically having a low Total Points Against and high Total Points For on the season.

Although it is certain that having both a good offense and defense in the NFL is important, it appears that having a better offense is more important to success in the long run. It is still a close margin and will be debated as the NFL continues to play. For one last piece of context, we developed a table of the last 20 Super Bowl winners with their offensive and defensive ranking:

# Create table for all Super Bowl Champions
super_bowl_champs <- nfl_standings %>% 
  filter(sb_winner == 1) %>% 
  select(year, team_name, offensive_ranking, defensive_ranking)

colnames(super_bowl_champs) <- c("Year", "Super Bowl Champion", "Offensive Ranking", "Defensive Ranking")

kable(super_bowl_champs)

Year	Super Bowl Champion	Offensive Ranking	Defensive Ranking
2000	Ravens	0.0	8.0
2001	Patriots	1.2	3.1
2002	Buccaneers	-1.0	9.8
2003	Patriots	2.1	4.9
2004	Patriots	6.4	6.5
2005	Steelers	3.8	4.0
2006	Colts	6.9	-1.1
2007	Giants	2.8	0.4
2008	Steelers	1.6	8.2
2009	Saints	11.2	-0.5
2010	Packers	3.1	7.9
2011	Giants	3.1	-1.5
2012	Ravens	1.9	1.0
2013	Seahawks	4.1	8.9
2014	Patriots	7.5	3.5
2015	Broncos	0.3	5.5
2016	Patriots	4.3	5.0
2017	Eagles	7.0	2.5
2018	Patriots	3.1	2.1
2019	Chiefs	6.2	2.9

Individual Game Observations

Our last analysis takes a look at data from the individual NFL games. Using the nfl_games dataset, we investigated the different variables.

Now, to analyze the correlation between different variables, we used the GGally package to produce a detailed scatter plot matrix. The function ggpairs() produced histograms along the diagonal of the matrix. Pearson’s rho estimates, or statistics showing correlation, are seen in the upper-right. Scatter plots are seen in the lower-left. We analyzed six variables here - (1) Points Scored by Winning Team (pts_win); (2) Yards Gained by Winning Team (yds_win); (3) Turnovers Committed by Winning Team (turnovers_win); (4) Points Scored by Losing Team (pts_loss); (5) Yards Gained by Losing Team (yds_loss); and (6) Turnovers Committed by Losing Team (turnovers_loss).

We then grouped these variables by winning team v. losing team.

# Create correlation graphs for variables in nfl_games
ggpairs(nfl_games %>% select(pts_win, yds_win, turnovers_win))

As evident through both the scatter plots and Pearson’s rho estimates, we can see there is little to no relationship between Points Scored by Winning Team v. Turnovers Committed by Winning Team as well as Yards Gained by Winning Team v. Turnovers Committed by Winning Team. All of these correlation coefficients are close to zero.

On the other hand, we can see there is a strong, positive relationship between Points Scored by Winning Team v. Yards Gained by Winning Team, with a Pearson rho estimate of 0.537.

# Create correlation graphs for variables in nfl_games
ggpairs(nfl_games %>% select(pts_loss, yds_loss, turnovers_loss))

Very similar to the winning teams, we can see there is little to no relationship between Points Scored by Losing Team v. Turnovers Committed by Losing Team as well as Yards Gained by Losing Team v. Turnovers Committed by Losing Team. All of these correlation coefficients are close to zero.

On the other hand, we can see there is a strong, positive relationship between Points Scored by Losing Team v. Yards Gained by Losing Team, with a Pearson rho estimate of 0.632.

The main takeaway from these correlation matrices are that the more yards gained, the more likely you are to score. To compare a winning team and a losing team, we wanted to see if more turnovers from a losing team caused more points for the winning team. Please see this below.

# Attach the dataset
attach(nfl_games)

# Create linear model
games_model <- lm(pts_win ~ turnovers_loss)
summary(games_model)

## 
## Call:
## lm(formula = pts_win ~ turnovers_loss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.598  -5.786  -0.598   5.425  33.308 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    25.41092    0.21730  116.94   <2e-16 ***
## turnovers_loss  1.09370    0.08384   13.04   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.693 on 5322 degrees of freedom
## Multiple R-squared:  0.03098,    Adjusted R-squared:  0.0308 
## F-statistic: 170.2 on 1 and 5322 DF,  p-value: < 2.2e-16

cor(turnovers_loss, pts_win)

## [1] 0.1760229

# Plot the data
ggplot(nfl_games) + geom_point(aes(x = turnovers_loss, y = pts_win), color = "coral1") +
  ggtitle("Turnovers Committed by Losing Team vs. Points Scored by Winning Team")

From the above graphic, we see that there is a slight, positive relationship between the Turnovers Committed by Losing Team and Points Scored by Winning Team. The correlation coefficient between the two variables is 0.176.

Summary

In this analysis, the main goal was to understand what all goes into winning an NFL game and what teams are historically successful in the standings. We were able to successfully break out this analysis into four sections: (1) The Importance of Fan Attendance; (2) Standings over the Years; (3) Offense vs. Defense; and (4) Individual Game Observations.

Through extensive use of R, we investigated the nfl_games, nfl_attendance, and nfl_attendance datasets. Linear modeling to discover the correlation between several datasets was frequently used. Additionally, the ggplot2 package delivered great visualizations to showcase this breakdown of the NFL. New variables and tables were created as well to drill deeper into the data for a better understanding of the raw data. One of our primary focuses was a breakdown of the divisions and their successes over the past 20 years. Box plot visualizations between the two conferences illuminated how teams have fared in the win column from their best season to their worst season.

Our first analysis looked into NFL Fan Attendance. Graphical representations were created to better understand which teams have a strong fan base and the consistency at which fans show up on a yearly basis. From this analysis, it was evident that the Dallas Cowboys have the strongest fan base and the Los Angeles Chargers have the weakest. Additionally, the greater attendance to games positively correlated to a team’s total wins per season.

Secondly, we focused on the divisional standings through the years. As mentioned above, box plot visualizations by division showed the range of success for NFL teams. Per division, these teams have had the most success based on the nfl_standings dataset:

AFC East: New England Patriots
AFC North: Pittsburgh Steelers
AFC South: Indianapolis Colts
AFC West: Denver Broncos
NFC East: Philadelphia Eagles
NFC North: Green Bay Packers
NFC South: New Orleans Saints
NFC West: Seattle Seahawks

Using geom_col(), we observed that the AFC East has won the most Super Bowl Championships. This is due to the phenomenal success of Tom Brady and the New England Patriots during this time period.

Next, we researched one of the most common arguments in football - is the offense or defense more important? Linear modeling of the nfl_standings data was completed on several variables. High offensive rankings and defensive rankings correlate to more wins for teams. Even though having a great offense and defense are both important, the correlation tests indicated that a better offense is slightly more important to a team’s success than a better defense. We created a table of the last 20 Super Bowl Champions and showcased the offensive_ranking and defensive_ranking. Teams have been trending towards having better offenses in the last few years as evident by this table.

Finally, we observed individual game data in the NFL. Through graphs created by ggpairs(), we were able to view correlation coefficients for six variables. The main conclusion we deduced from this is that a positive correlation exists between yards gained and points scored.

As big NFL fans, it was incredibly interesting to see how the NFL has worked during our entire lifetime. Also, it was intriguing to see our favorite teams’ successes over this time span. The Steelers have been comfortably better than the Bengals, but the next 20 years could be a different story. Some limitations to our analysis are the brief time period in which the datasets cover (20 years), the lack of data in the nfl_games dataset (i.e. rushing yards, passing yards, penalty yards, etc.). Additionally, there is no player information included in these datasets. Players and coaching staff would definitely impact the success of each team. If we were to continue this analysis and potentially collect more data, we could add in tie-breaker information and how weather affects scoring. Predictive modeling could also be a great next step to see how NFL franchises will perform based on prior information.

The NFL is one of the biggest industries in the world that has large implications on many levels. Sports gambling, the NFL Draft, fantasy football, and the common fan could all have different takeaways from this analysis that would help them better understand the recent history of the NFL. With fans across the globe, a deep dive into the NFL is exciting for many groups. Coaches and players would be able to more effectively prepare for their opponents, gamblers could make more educated bets, general managers could derive their team’s needs in the Draft, and the common fan could revel in their team’s history.

This data tells a phenomenal story of the state of the NFL. However, it is a game for a reason. No one will ever be able to fully predict NFL outcomes, and that is what makes the sport as intriguing as it is!