Project Introduction

Row

Breaking Down the NFL

From a young age, I have watched NFL football and cheered for my team every week. Given my dad is from Pittsburgh and my parents met in Pittsburgh, I was naturally raised a lifelong Steelers fan. With that said, I not only wanted to compare my team individually, but also the league as a whole. The past 20 years have seen successes and failures from every NFL team. When choosing an option for my final capstone project, I was instantly drawn to extending a project I had previously done on the NFL. The NFL has a plethora of data points publicly available. I thought, “What all goes into winning an NFL game, and what teams are historically successful in the final standings?” Using the past 20 years worth of data, I sought to investigate this problem.

My Focus

Aforementioned, for my final capstone project, I am expanding upon my final project from Data Wrangling in R (BANA 7025) with Professor Tianhai Zu. I originally worked with a partner on this project; however, the extension will be my own individual work. I plan on using the functions in R to deliver overall summary statistics on games and standings. Additionally, I will use the data to develop potential correlations and plot respective data visualizations. Utilizing descriptive analysis of the past 20 years, I am looking to see if there can be predictive tendencies for NFL teams.

This analysis includes data from 2000 - 2019. I added 2020 season data to every dataset aside from nfl_attendance and nfl_games, as these would be skewed if 2020 data was added. This skewness would be due to the impact of COVID-19. COVID-19 caused games to be played on different days / times, cancellation of games, and it also caused little to no attendance based on location.

This NFL analysis consists of eight individual datasets:

NFL Attendance (nfl_attendance)
NFL Standings (nfl_standings)
NFL Games (nfl_games)
NFL Weather (nfl_weather)
NFL Playoff Coaches and Quarterbacks (nfl_playoffs)
NFL Passing Yards Leaders (nfl_passing)
NFL Rushing Yards Leaders (nfl_rushing)
NFL Penalty Yards Per Game (nfl_penalty)

More detailed information about each dataset can be found in the Data Preparation tab.

Row

Goal

The NFL is a multi-billion dollar industry. Millions of fans across the world cheer for these 32 teams every year. People are now looking for ways to understand the game better.

Coaches want to understand what makes a team more successful. Sports gamblers want to get an edge and make the correct picks based on more than just gut feelings. Fans want to know if their team is progressing in the right direction. This analysis is useful for all of these situations. Using descriptive analysis, past results can be better explained. As such, trends can be deduced to predict how NFL games and seasons will occur. Although no one can see into the future, understanding the data sheds a better light on the probability of certain results occurring in the NFL.

The goal of my analysis is to inform my readers on what all goes into winning an NFL game. My hope is that the audience will finish reading my report and better understand historic trends and performance from teams, players, and coaches alike. As a final capstone project, I hope to demonstrate proficiency in R using R Markdown as well as flexdashboard with Shiny components.

Analytical Technique and Approach

Analytical Approach

The datasets contain loads of information for the NFL. With a wide range of variables, many options are available to analytically investigate the NFL. With the eight datasets at hand, I looked to compare them to draw conclusions about team performances. To see if statistical significance or rational conclusions related to the NFL could be realized, the following situations were explored:

Packages Required

This project requires a variety of packages. Given there are over 10,000 packages in R, I want to focus on the ones that will provide me with the best results while cleaning and interpreting the data.

Some packages will be more useful than others. For example, ggplot2 allows for great visualizations that provide better understanding of the data. Additionally, dplyr can drill deeper into the eight datasets to come to conclusions that may be hidden at first. R has powerful functions that can derive explanations for questions to massive datasets. Please see below for all of the packages loaded for this analysis:

# Packages required
library(tidyverse) # Use to tidy data
library(dplyr) # Use to manipulate data
library(ggplot2) # Use to plot data and create visualizations
library(tibble) # Use to manipulate and re-imagine data
library(readr) # Use to import data cleanly and efficiently
library(DT) # Use to create comprehensive data tables with HTML output
library(knitr) # Use for dynamic report generation
library(base) # Contains Base R functions
library(ggthemes) # Use themes in data visualizations
library(plotly) # Use to plot data and create visualizations
library(ggpubr) # Use to show multiple plots at once
library(GGally) # Use to produce scatter plot matrix
library(rmarkdown) # Use to produce report
library(flexdashboard) # Use to produce flexdashboard
library(stringr) # Provides functions to work with strings
library(highcharter) # Includes shortcut functions to plot R objects
library(shinythemes) # Use to implement themes for output

Importing the Data

Most of the data (nfl_attendance, nfl_standings, and nfl_games) was obtained from my professor, Tianhai Zu, for the Data Wrangling in R class. He had provided four different datasets in which to choose, and my partner and I chose the NFL option. These datasets can be found on GitHub. Reading the information on GitHub led me to find the original source of the data, which is Pro Football Reference Standings and Pro Football Reference Attendance.

This NFL analysis contains of eight individual datasets - (1) nfl_attendance, (2) nfl_standings, (3) nfl_games, (4) nfl_weather, (5) nfl_playoffs, (6) nfl_passing, (7) nfl_rushing, and (8) nfl_penalty.

I first merged three of the datasets (nfl_attendance, nfl_standings, and nfl_games) into one dataframe called nfl_df. I decided it might be beneficial to have multiple frames of reference, some utilizing individual datasets, and another by looking at the combined dataframe. Rather than using str() and summary() to show descriptive statistics for each variable, I decided to create comprehensive tables. Then, in the Data Preparation tab, I cleaned every dataset.

# Get working directory
getwd()

[1] "C:/Users/katie/OneDrive - University of Cincinnati/SS21/Full Semester/Capstone (BANA 8083)/Project Files"

# Get the data
nfl_attendance <- readr::read_csv('attendance.csv')
nfl_standings <- readr::read_csv('updatedstandings.csv')
nfl_games <- readr::read_csv('games.csv')
nfl_weather <- readr::read_csv('weather.csv')
nfl_playoffs <- readr::read_csv('post_season.csv')
nfl_passing <- readr::read_csv('passing_yards_leaders.csv')
nfl_rushing <- readr::read_csv('rushing_yards_leaders.csv')
nfl_penalty <- readr::read_csv('penalty_yards_per_game.csv')

# To use 2020 data you need to update tidytuesdayR from GitHub
# Install via devtools::install_github("thebioengineer/tidytuesdayR")

tuesdata <- tidytuesdayR::tt_load('2020-02-04')


    Downloading file 1 of 3: `attendance.csv`
    Downloading file 2 of 3: `games.csv`
    Downloading file 3 of 3: `standings.csv`

tuesdata <- tidytuesdayR::tt_load(2020, week = 6)


    Downloading file 1 of 3: `attendance.csv`
    Downloading file 2 of 3: `games.csv`
    Downloading file 3 of 3: `standings.csv`

attendance <- tuesdata$attendance

# Join the data relatively nicely with dplyr
nfl_df <- dplyr::left_join(nfl_attendance, nfl_standings, nfl_games, by = c("year", "team_name", "team"))

Data Preparation

Row

Attendance

As aforementioned, the nfl_attendance dataset was imported and obtained from Pro Football Reference. The original data contains 10,846 observations and eight variables. There are two character type variables, team and team_name. There are six numeric type variables, year, total, home, away, week, weekly_attendance. The data was collected from 2000 - 2020, and the values for the columns were observed during the 17 weeks of the NFL season.

# Examine the structure of the dataset
datatable(head(nfl_attendance, 10))

# Create a data dictionary for attendance
var_names_att <- colnames(nfl_attendance)
var_types_att <- lapply(nfl_attendance, class) # lapply returns a list of the same length as X (a vector)
var_descriptions_att <- c("City or state in which the team originates", "Name or mascot of the team", "Year", "Total attendance per season", "Total attendance at home games per season", "Total attendance at away games per season", "Week in which game was played", "Attendance for given week")
data_dict_att <- as_tibble(cbind(var_names_att, var_types_att, var_descriptions_att))
colnames(data_dict_att) <- c("Variable Name", "Variable Data Type", "Variable Description")
kable(data_dict_att) # kable returns a single table for a single data object

Looking at the missing data values, the only column in which missing values exist is the weekly_attendance. This makes sense, as each NFL team has at least one bye week during the regular season. I decided to omit these values as they would skew the data and misrepresent the trends for each team.

colSums(is.na(nfl_attendance)) # Find the number of missing values per column
nfl_attendance <- na.omit(nfl_attendance)
colSums(is.na(nfl_attendance)) # Confirm there are no missing values

Looking at this above original dataset, I decided to first rename the columns to better describe the data.

nfl_attendance <- nfl_attendance %>% dplyr::rename(
  team_location = team,
  total_attendance = total,
  total_home_attendance = home,
  total_away_attendance = away
)

Additionally, I split it into two dataframes, the first omitting the weekly data, and the second omitting the season totals. This decision was made largely to remove duplicates, and I knew it would bode for better visualizations during the exploratory data analysis (EDA).

The first dataset, nfl_total_attendance erased the two columns, week and weekly_attendance. This dataset will show the season totals for attendance per each team. The second dataset, nfl_weekly_attendance erased the season total data columns, total, home, and away.

nfl_total_attendance <- nfl_attendance[-c(7, 8)] # Remove weekly data
nfl_total_attendance <- nfl_total_attendance[!duplicated(nfl_total_attendance), ] # Remove duplicates
datatable(head(nfl_total_attendance, 10))

nfl_weekly_attendance <- nfl_attendance[-c(4, 5, 6)] # Remove season total attendance data
datatable(head(nfl_weekly_attendance, 10))

Now, for a summary of the two datasets and associated tables of the CLEANED data, please see below.

NFL Total Attendance Dataset

Data Dictionary for the NFL Total Attendance Dataset

Variable Name	Variable Data Type	Variable Description
team_location	character	City or state in which the team originates
team_name	character	Name or mascot of the team
year	numeric	Year
total_attendance	numeric	Total attendance per season
total_home_attendance	numeric	Total attendance at home games per season
total_away_attendance	numeric	Total attendance at away games per season

NFL Weekly Attendance Dataset

Data Dictionary for the NFL Weekly Attendance Dataset

Variable Name	Variable Data Type	Variable Description
team_location	character	City or state in which the team originates
team_name	character	Name or mascot of the team
year	numeric	Year
week	numeric	Week in which game was played
weekly_attendance	numeric	Attendance for given week

Standings

The nfl_standings dataset was imported and obtained from Pro Football Reference. The original data contains 638 observations and 15 variables. There are four character type variables, team, team_name, playoffs, and sb_winner. There are 11 numeric type variables, year, wins, loss, points_for, points_against, points_differential, margin_of_victory, strength_of_schedule, simple_rating, offensive_ranking, and defensive_ranking. The data observed was collected from 2000 - 2020. The process of cleaning the ORIGINAL data can be seen below.

# Examine the structure of the dataset
datatable(head(nfl_standings, 10))

# Create a data dictionary for standings
var_names_st <- colnames(nfl_standings)
var_types_st <- lapply(nfl_standings, class) # lapply returns a list of the same length as X (a vector)
var_descriptions_st <- c("City or state in which the team originates", "Name or mascot of the team", "Year", "Total wins per season (0 to 16)", "Total losses per season (0 to 16)", "Total points the team scored per season", "Total points the opponent scored on the team per season", "The difference between the total points for the team and against the team", "Points differential divided by the total number of games per season", "Difficulty of schedule based on opponent records", "A rating for the team that takes into account points differential and strength of schedule (measured by Simple Rating System)", "A rating comparing how well the offense performs to opponent teams (measured by Simple Rating System)", "A rating comparing how well the defense performs to opponent teams (measured by Simple Rating System)", "Stating whether or not the team made it to the playoffs", "Stating whether or not the team won the Super Bowl for the season")
data_dict_st <- as_tibble(cbind(var_names_st, var_types_st, var_descriptions_st))
colnames(data_dict_st) <- c("Variable Name", "Variable Data Type", "Variable Description")
kable(data_dict_st) # kable returns a single table for a single data object

Looking at the above dataset, I first decided to change the column names to better describe the data.

nfl_standings <- nfl_standings %>% dplyr::rename(
  team_location = team,
  total_wins = wins,
  total_losses = loss
)

It is important to note as well that a few of the variable names refer to calculated values. The calculated value for points_differential is: points_differential = points_for - points_against. Additionally, margin_of_victory is calculated by: points_scored - points_allowed / games_played.

Lastly, the simple_rating is calculated by: \[SRS = MoV + SoS = OSRS + DSRS\]

In layman’s terms, the simple rating system is equal to the margin of victory plus the strength of schedule. This is equal to the offensive simple rating standing plus the defensive simple rating standing.

Next, I wanted to see what the sum of missing values was per column. As evident below, there are no missing values.

colSums(is.na(nfl_standings))

Moving forward, I decided to change both the playoffs and sb_winner to binary variables. This is because they both only have two unique values.

unique(nfl_standings$playoffs, incomparables = FALSE) # View the unique values for the playoffs column
unique(nfl_standings$sb_winner, incomparables = FALSE) # View the unique values for the sb_winner column

Knowing this, I changed the two columns to binary variables. For the playoffs column, a value of one stands for “Playoffs”, and a value of zero stands for “No Playoffs”.

nfl_standings$playoffs[nfl_standings$playoffs == "Playoffs"] <- "1"
nfl_standings$playoffs[nfl_standings$playoffs == "No Playoffs"] <- "0"

nfl_standings$playoffs <- as.numeric(nfl_standings$playoffs)

For the sb_winner column, a value of one denotes “Won Superbowl”, and a value of zero denotes “No Superbowl”.

nfl_standings$sb_winner[nfl_standings$sb_winner == "Won Superbowl"] <- "1"
nfl_standings$sb_winner[nfl_standings$sb_winner == "No Superbowl"] <- "0"

nfl_standings$sb_winner <- as.numeric(nfl_standings$sb_winner)

Now, for a summary of the dataset and associated table of the data, please see the CLEANED dataset below.

NFL Standings Dataset

Data Dictionary for the NFL Standings Dataset

Variable Name	Variable Data Type	Variable Description
team_location	character	City or state in which the team originates
team_name	character	Name or mascot of the team
year	numeric	Year
total_wins	numeric	Total wins per season (0 to 16)
total_losses	numeric	Total losses per season (0 to 16)
points_for	numeric	Total points the team scored per season
points_against	numeric	Total points the opponent scored on the team per season
points_differential	numeric	The difference between the total points for the team and against the team
margin_of_victory	numeric	Points differential divided by the total number of games per season
strength_of_schedule	numeric	Difficulty of schedule based on opponent records
simple_rating	numeric	A rating for the team that takes into account points differential and strength of schedule (measured by Simple Rating System)
offensive_ranking	numeric	A rating comparing how well the offense performs to opponent teams (measured by Simple Rating System)
defensive_ranking	numeric	A rating comparing how well the defense performs to opponent teams (measured by Simple Rating System)
playoffs	numeric	Stating whether or not the team made it to the playoffs
sb_winner	numeric	Stating whether or not the team won the Super Bowl for the season

Games

Once again the nfl_games data was imported and obtained from Pro Football Reference. The original data contains 5,324 observations and 19 variables. There are 11 character variables, week, home_team, away_team, winner, tie, day, date, home_team_name, home_team_city, away_team_name, and away_team_city. There are seven numeric type variables, year, pts_win, pts_loss, yds_win, turnovers_win, yds_loss, and turnovers_loss. See the ORIGINAL dataset below.

# Examine the structure of the dataset
datatable(head(nfl_games, 10))

# Create a data dictionary for games
var_names_games <- colnames(nfl_games)
var_types_games <- lapply(nfl_games, class) # lapply returns a list of the same length as X (a vector)
var_descriptions_games <- c("Year", "Week of the season in which the game was played", "Home team for the game", "Away team for the game", "Winner of the game", "Was there a tie?  (if so, the other team will be listed in this column)", "Day of the week in which the game was played", "Date of the game", "Time of the day in which the game was played", "Number of points the winning team scored", "Number of points the losing team scored", "Total number of yards the winning team had", "Total number of turnovers the winning team had", "Total number of yards the losing team had", "Total number of turnovers the losing team had", "Name or mascot of the winning team", "City of the winning team", "Name or mascot of the losing team", "City of the losing team")
data_dict_games <- as_tibble(cbind(var_names_games, var_types_games, var_descriptions_games))
colnames(data_dict_games) <- c("Variable Name", "Variable Data Type", "Variable Description")
kable(data_dict_games) # kable returns a single table for a single data object

Looking at the above dataset, the first step I took to clean the data was to remove the last four unnecessary columns, as I felt they were redundant.

names(nfl_games)
nfl_games <- nfl_games[-c(16, 17, 18, 19)] # Remove redundant columns
names(nfl_games)

Then, I changed the week column to be numeric.

nfl_games$week <- as.numeric(nfl_games$week)

Looking at missing values, the only column which contained them was the tie column. This makes sense, as very few NFL games result in a tie.

Next, the way in which a tie was denoted was by listing one team name in the winner column, and the opponent team name in the tie column. To fix this, I identified any game that resulted in a tie. Then, for these specific games, I renamed the value in the winner column to “Tie”. The tie column was then erased.

colSums(is.na(nfl_games))
unique(nfl_games$tie, incomparables = FALSE)
nfl_games$winner[nfl_games$tie != is.na(nfl_games$tie)] <- "Tie"
nfl_games <- nfl_games[-c(6)] # Remove the tie column
colSums(is.na(nfl_games)) # Confirm there are no missing values

To view the summary and structure of the CLEANED data:

NFL Games Dataset

Data Dictionary for the NFL Games Dataset

Variable Name	Variable Data Type	Variable Description
year	numeric	Year
week	numeric	Week of the season in which the game was played
home_team	character	Home team for the game
away_team	character	Away team for the game
winner	character	Winner of the game
day	character	Day of the week in which the game was played
date	character	Date of the game
time	hms , difftime	Time of the day in which the game was played
pts_win	numeric	Number of points the winning team scored
pts_loss	numeric	Number of points the losing team scored
yds_win	numeric	Total number of yards the winning team had
turnovers_win	numeric	Total number of turnovers the winning team had
yds_loss	numeric	Total number of yards the losing team had
turnovers_loss	numeric	Total number of turnovers the losing team had

Weather

Incorporating weather data into my analysis is an interesting next step. I want to see how the weather impacts the outcome of individual games. The nfl_weather data is from NFLsavant.com. All data and statistics from this site are compiled from publicly-available NFL play-by-play on the Internet. The one negative is that this data only has until 2013; however, I thought 13 years of data was enough to see any significant trends.

The original data contains 3,521 observations and 13 variables. The variables are described in the data dictionary below. See the ORIGINAL NFL Weather data below.

# Examine the structure of the dataset
datatable(head(nfl_weather, 10))

# Create a data dictionary for standings
var_names_w <- colnames(nfl_weather)
var_types_w <- lapply(nfl_weather, class) # lapply returns a list of the same length as X (a vector)
var_descriptions_w <- c("Full home team name", "City or state in which the home team originates", "Name or mascot of the home team", "Total points scored by the home team", "Full away team name", "City or state in which the away team originates", "Name or mascot of the away team", "Total points scored by the away team", "Winner of the game", "Temperature during the game (in Fahrenheit)", "Humidity percentage during the game", "Wind speed in miles per hour (mph) during the game", "Date of the game played")

data_dict_w <- as_tibble(cbind(var_names_w, var_types_w, var_descriptions_w))
colnames(data_dict_w) <- c("Variable Name", "Variable Data Type", "Variable Description")
kable(data_dict_w) # kable returns a single table for a single data object

Looking at the above dataset, the first step I took to clean the data was to remove the home_team and away_team columns, as I felt they were redundant.

names(nfl_weather)
nfl_weather <- nfl_weather[-c(1, 5)] # Remove redundant columns
names(nfl_weather)

To view the summary and structure of the CLEANED data:

NFL Weather Dataset

Data Dictionary for the NFL Weather Dataset

Variable Name	Variable Data Type	Variable Description
home_team_city	character	City or state in which the home team originates
home_team_name	character	Name or mascot of the home team
home_score	numeric	Total points scored by the home team
away_team_city	character	City or state in which the away team originates
away_team_name	character	Name or mascot of the away team
away_score	numeric	Total points scored by the away team
winning_team	character	Winner of the game
temperature	numeric	Temperature during the game (in Fahrenheit)
humidity	numeric	Humidity percentage during the game
wind_mph	numeric	Wind speed in miles per hour (mph) during the game
date	character	Date of the game played

Playoffs

The next dataset within my analysis is the nfl_playoffs dataset. This looks into the coaches and quarterbacks for each team that went to the playoffs from 2000 - 2020. I created this dataset myself through research.

The original data contains 3,521 observations and 13 variables. The variables are described in the data dictionary below. See the ORIGINAL NFL Weather data below.

# Examine the structure of the dataset
datatable(head(nfl_playoffs, 10))

# Create a data dictionary for standings
var_names_playoffs <- colnames(nfl_playoffs)
var_types_playoffs <- lapply(nfl_playoffs, class) # lapply returns a list of the same length as X (a vector)
var_descriptions_playoffs <- c("City or state in which the team originates", "Name or mascot of the team", "Year", "Total wins for the team", "Total losses for the team", "Whether or not the team went to the Playoffs", "Whether or not the team won the Super Bowl", "Head coach of the team", "Starting quarterback during the postseason")

data_dict_playoffs <- as_tibble(cbind(var_names_playoffs, var_types_playoffs, var_descriptions_playoffs))
colnames(data_dict_playoffs) <- c("Variable Name", "Variable Data Type", "Variable Description")
kable(data_dict_playoffs) # kable returns a single table for a single data object

Moving forward, I decided to change the sb_winner to binary variables. This is because it only has two unique values. Because the unique value for the playoffs column is only “Playoffs”, I decided to drop that column.

unique(nfl_playoffs$playoffs, incomparables = FALSE) # View the unique values for the playoffs column
unique(nfl_playoffs$sb_winner, incomparables = FALSE) # View the unique values for the sb_winner column

names(nfl_playoffs)
nfl_playoffs <- nfl_playoffs[-6] # Remove unnecessary column
names(nfl_playoffs)

For the sb_winner column, a value of one denotes “Won Superbowl”, and a value of zero denotes “No Superbowl”.

nfl_playoffs$sb_winner[nfl_playoffs$sb_winner == "Won Superbowl"] <- "1"
nfl_playoffs$sb_winner[nfl_playoffs$sb_winner == "No Superbowl"] <- "0"

nfl_playoffs$sb_winner <- as.numeric(nfl_playoffs$sb_winner)

To view the summary and structure of the CLEANED data:

NFL Playoffs Dataset

Data Dictionary for the NFL Playoffs Dataset

Variable Name	Variable Data Type	Variable Description
team	character	City or state in which the team originates
team_name	character	Name or mascot of the team
year	numeric	Year
wins	numeric	Total wins for the team
loss	numeric	Total losses for the team
sb_winner	numeric	Whether or not the team won the Super Bowl
head_coach	character	Head coach of the team
qb	character	Starting quarterback during the postseason

Passing Yards Leaders

The nfl_passing dataset contains information regarding the league leader for passing yards from each year. Their respective team information is included. This data is from Pro Football Reference.

This dataset does not need to be cleaned or edited, so to view the summary and structure of the CLEANED data:

NFL Passing Dataset

Data Dictionary for the NFL Passing Dataset

Variable Name	Variable Data Type	Variable Description
year	numeric	Year
player	character	Name of the player with the most passing yards
yds	numeric	Total yards
team	character	Location of the team from which the player is on
team_name	character	Name or mascot of the team from which the player is on

Rushing Yards Leaders

The last dataset, nfl_rushing, contains information regarding the league leader for rushing yards from each year. Their respective team information is included. This data is also from Pro Football Reference.

Similar to the last dataset, this dataset does not need to be cleaned or edited, so to view the summary and structure of the CLEANED data:

NFL Rushing Dataset

Data Dictionary for the NFL Rushing Dataset

Variable Name	Variable Data Type	Variable Description
year	numeric	Year
player	character	Name of the player with the most rushing yards
yds	numeric	Total yards
team	character	Location of the team from which the player is on
team_name	character	Name or mascot of the team from which the player is on

Penalties Per Game

The nfl_penalty dataset contains of average penalty yards per game per team from 2003 - 2020. The data is from TeamRankings.

This dataset did not need to be cleaned, so To look at the summary and structure of the CLEANED data:

NFL Penalty Dataset

Data Dictionary for the NFL Penalty Dataset

Variable Name	Variable Data Type	Variable Description
team	character	City or state in which the team originates
team_name	character	Name or mascot of the team
2020	numeric	Average penalty yards per game from 2020
2019	numeric	Average penalty yards per game from 2019
2018	numeric	Average penalty yards per game from 2018
2017	numeric	Average penalty yards per game from 2017
2016	numeric	Average penalty yards per game from 2016
2015	numeric	Average penalty yards per game from 2015
2014	numeric	Average penalty yards per game from 2014
2013	numeric	Average penalty yards per game from 2013
2012	numeric	Average penalty yards per game from 2012
2011	numeric	Average penalty yards per game from 2011
2010	numeric	Average penalty yards per game from 2010
2009	numeric	Average penalty yards per game from 2009
2008	numeric	Average penalty yards per game from 2008
2007	numeric	Average penalty yards per game from 2007
2006	numeric	Average penalty yards per game from 2006
2005	numeric	Average penalty yards per game from 2005
2004	numeric	Average penalty yards per game from 2004
2003	numeric	Average penalty yards per game from 2003
total	numeric	Total penalty yards

Total Attendance Breakdown

Column

Total Attendance Breakdown

As mentioned in the introduction, this data analysis will look into if the number of fans in attendance correlates to a team’s success (number of games won). Additionally, it will provide comparisons of how teams fare in home versus away games while keeping in mind the attendance at those games. Earlier in the data preparation, I split the attendance dataset into two separate datasets, nfl_total_attendance and nfl_weekly_attendance. To first understand the importance of fan attendance, it is critical to observe which teams have the strongest fan base over the past 20 years.

Instead of using the teams’ total attendance numbers, I wanted to take an average of each team’s weekly attendance. I feel this will give me a more accurate representation of attendance. With that said, I added a column to the nfl_weekly_attendance column to calculate the mean.

Using the ggplotly function, the graph becomes interactive. To look and interact with the visualization to the right, you can scroll over the lines to get a detailed description including the year, total attendance, and team. You can click on a team once to remove it from the visualization, or you can double-click on the team in the legend to isolate that line. This interaction enables you to filter to specific teams in order to see their attendance trends since 2000.

In the visualization to the right, it is evident that the Dallas Cowboys appear to have the strongest fan base, and the Los Angeles Chargers appear to have the weakest fan base. The top five teams with the current highest attendance records are:

Dallas Cowboys
Green Bay Packers
Los Angeles Rams
New York Giants
Philadelphia Eagles

It is also important to note that the spike in attendance for the Dallas Cowboys in 2009 can be attributed to the opening of their brand new AT&T Stadium. This stadium opened on May 27, 2009. The stadium holds 80,000 people in the stands but can be expanded to hold more than 100,000 individuals when standing room only areas are included.

Column

NFL Weekly Attendance

Division-Basis

Now, I wanted to break attendance down on a division-basis. In order to do this, I added a column to the dataset, called “division”.

Once the division column was created, the breakdown of the strongest and weakest fan bases per division can be seen in the table below. Individual graphs for both the AFC and NFC can be seen under the tabs AFC Attendance Breakdown and NFC Attendance Breakdown.

	Strongest Fan Base	Weakest Fan Base
AFC East	New York Jets	Miami Dolphins
AFC North	Baltimore Ravens	Cincinnati Bengals
AFC South	Houston Texans	Indianapolis Colts
AFC West	Kansas City Chiefs	Los Angeles Chargers
NFC East	Dallas Cowboys	Washington Redskins
NFC North	Green Bay Packers	Detroit Lions
NFC South	New Orleans Saints	Tampa Bay Buccaneers
NFC West	Los Angeles Rams	Arizona Cardinals

AFC Attendance Breakdown

Row

AFC East

AFC North

Row

AFC South

AFC West

NFC Attendance Breakdown

Row

NFC East

NFC North

Row

NFC South

NFC West

Impact on Wins

Column

What Impacts Total Wins?

Knowing the previously discussed attendance statistics, I want to see if a stronger home attendance impacts the total number of wins. A team cannot necessarily control their away attendance, as their most loyal fans are assumed to be unlikely attendees at an away game.

First, I wanted to discover if home attendance impacts total wins. To do so, I created a linear model with total_wins as the response variable and total_home_attendance as the predictor variable. I also obtained the correlation coefficient between the two variables. To the right, in the Home Attendance tab, it appears that there is a slight, positive linear relationship between the predictor variable (X or total_home_attendance) and the response variable (Y or total_wins). The correlation coefficient between the two variables is 0.1507, and this relationship is statistically significant at a 99% confidence level with a p-value of 0.000133. The lm() function was used to perform simple linear regression between the two variables.

Next, I wanted to discover if away attendance impacts total wins. I followed the same process I did for home attendance, creating a linear model with total_wins as the response variable and total_away_attendance as the predictor variable. From the visualization in the Away Attendance tab, it appears that there is also a very slight, positive linear relationship between the predictor variable (X or total_away_attendance) and the response variable (Y or total_wins). The correlation coefficient between the two variables is 0.1274, and this relationship is statistically significant at a 99% confidence level with a p-value of 0.00126. The lm() function was used to perform simple linear regression between the two variables.

# Attach the dataset
attach(joined_data)

# Create linear model for home attendance
home_attendance_model <- lm(total_wins ~ total_home_attendance)
summary(home_attendance_model)


Call:
lm(formula = total_wins ~ total_home_attendance)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.7799 -2.2250 -0.0726  2.2478  7.9490 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)           4.225e+00  9.851e-01   4.289 2.07e-05 ***
total_home_attendance 6.955e-06  1.809e-06   3.845 0.000133 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.051 on 636 degrees of freedom
Multiple R-squared:  0.02272,   Adjusted R-squared:  0.02118 
F-statistic: 14.78 on 1 and 636 DF,  p-value: 0.0001327

cor(total_wins, total_home_attendance)

[1] 0.1507174

# Attach the dataset
attach(joined_data)

# Create linear model for away attendance
away_attendance_model <- lm(total_wins ~ total_away_attendance)
summary(away_attendance_model)


Call:
lm(formula = total_wins ~ total_away_attendance)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.8922 -2.2928 -0.1177  2.2793  7.7255 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)   
(Intercept)           -3.310e-01  2.571e+00  -0.129  0.89758   
total_away_attendance  1.539e-05  4.751e-06   3.238  0.00126 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.061 on 636 degrees of freedom
Multiple R-squared:  0.01622,   Adjusted R-squared:  0.01468 
F-statistic: 10.49 on 1 and 636 DF,  p-value: 0.001265

cor(total_wins, total_away_attendance)

[1] 0.1273652

Column

Home Attendance

Away Attendance

Super Bowl Champions

Row

Standings Over the Years

This part of the analysis will look into the qualities of the division winners and the attributes that teams high in the standings have over teams in the lower portion of the standings. Furthermore, this approach will discover what separates Super Bowl Champions from the 31 other teams each season.

Firstly, I wanted to see which division has brought home the most Super Bowl Championships over the past 20 years. I once again added a “division” column to nfl_standings. As evident in the below visualization, the AFC East has had the most Super Bowl wins between 2000-2020. This can be largely attributed to the New England Patriots’ former quarterback Tom Brady and current head coach Bill Belichick bringing home championships in 2002, 2004, 2005, 2015, 2017, and 2019. Additionally, the second-best division appears to be the AFC North, with both the Pittsburgh Steelers and Baltimore Ravens winning at least one Super Bowl Championship each. Conversely, it appears the AFC South, NFC North, and NFC West have all only won one Super Bowl over the past two decades.

Analyzing NFL standings with the given datasets is a bit tricky due to the fact that standings are calculated using tie-breakers if necessary. Additionally, choosing which teams make the playoffs is largely based off of division success. With that being said, the team that had the most wins might not be the team with the best standing. For this analysis, I decided to break the teams down by division and see which ones have been dominant over the years.

I analyzed their success by using summary statistics showing the Average Total Wins, Average Total Losses, Average Points Per Game, and Average Opponent Points Per Game. The results can be seen in the tabs AFC Summaries and NFC Summaries. I also developed box plots for the average total wins per season by division to analyze the range of data for each team and any relevant outliers. These box plots can be seen in the tabs AFC Box Plots | Total Wins and NFC Box Plots | Total Wins. I also grouped the box plots by conference (AFC vs. NFC).

The most dominant teams per division, defined by highest average of total wins, (as discovered in the AFC Summaries and NFC Summaries tabs) are as follows:

I also developed a table of the last 20 Super Bowl winners with their offensive and defensive ranking. This table can be found in the Rankings of Super Bowl Champions tab.

Row

Super Bowl Champions Per Division

Rankings of Super Bowl Champions

Year	Super Bowl Champion	Offensive Ranking	Defensive Ranking
2000	Ravens	0.0	8.0
2001	Patriots	1.2	3.1
2002	Buccaneers	-1.0	9.8
2003	Patriots	2.1	4.9
2004	Patriots	6.4	6.5
2005	Steelers	3.8	4.0
2006	Colts	6.9	-1.1
2007	Giants	2.8	0.4
2008	Steelers	1.6	8.2
2009	Saints	11.2	-0.5
2010	Packers	3.1	7.9
2011	Giants	3.1	-1.5
2012	Ravens	1.9	1.0
2013	Seahawks	4.1	8.9
2014	Patriots	7.5	3.5
2015	Broncos	0.3	5.5
2016	Patriots	4.3	5.0
2017	Eagles	7.0	2.5
2018	Patriots	3.1	2.1
2019	Chiefs	6.2	2.9
2020	Buccaneers	6.5	2.8

AFC Summaries

Row

AFC East Summary

Team Name	Average Total Wins	Average Total Losses	Average Points Per Game	Average Opponent Points Per Game
Bills	7.142857	8.857143	20.41369	22.33631
Dolphins	7.571429	8.428571	20.11607	21.64881
Jets	7.142857	8.857143	19.61607	21.72917
Patriots	11.619048	4.380952	26.93155	18.74405

AFC North Summary

Team Name	Average Total Wins	Average Total Losses	Average Points Per Game	Average Opponent Points Per Game
Bengals	7.095238	8.714286	20.64583	22.41071
Browns	5.238095	10.714286	17.69345	23.16071
Ravens	9.571429	6.428571	22.75298	18.31250
Steelers	10.333333	5.571429	23.20536	18.50893

Row

AFC South Summary

Team Name	Average Total Wins	Average Total Losses	Average Points Per Game	Average Opponent Points Per Game
Colts	9.904762	6.095238	25.01488	22.25893
Jaguars	6.095238	9.904762	19.55357	22.34226
Texans	7.105263	8.894737	21.02632	23.01316
Titans	8.142857	7.857143	21.80060	22.60714

AFC West Summary

Team Name	Average Total Wins	Average Total Losses	Average Points Per Game	Average Opponent Points Per Game
Broncos	8.904762	7.095238	23.33929	21.99405
Chargers	8.047619	7.952381	24.05655	21.98214
Chiefs	8.571429	7.428571	23.58333	22.04167
Raiders	6.333333	9.666667	20.43452	24.71429

AFC Box Plots | Total Wins

Row

AFC East Box Plot

AFC North Box Plot

Row

AFC South Box Plot

AFC West Box Plot

NFC Summaries

Row

NFC East Summary

Team Name	Average Total Wins	Average Total Losses	Average Points Per Game	Average Opponent Points Per Game
Cowboys	8.285714	7.714286	22.41071	21.98810
Eagles	9.238095	6.666667	24.03571	20.80357
Giants	7.809524	8.190476	21.79762	22.42857
Redskins	6.619048	9.333333	19.55655	22.33929

NFC North Summary

Team Name	Average Total Wins	Average Total Losses	Average Points Per Game	Average Opponent Points Per Game
Bears	7.857143	8.142857	20.38393	20.93750
Lions	5.666667	10.285714	20.72619	24.98810
Packers	10.000000	5.904762	25.55357	21.33929
Vikings	8.190476	7.714286	22.87798	22.39881

Row

NFC South Summary

Team Name	Average Total Wins	Average Total Losses	Average Points Per Game	Average Opponent Points Per Game
Buccaneers	7.095238	8.904762	21.03869	21.94643
Falcons	8.000000	7.952381	22.71429	22.90179
Panthers	7.714286	8.238095	21.19048	21.78274
Saints	9.285714	6.714286	26.13988	23.18452

NFC West Summary

Team Name	Average Total Wins	Average Total Losses	Average Points Per Game	Average Opponent Points Per Game
49ers	7.333333	8.619048	21.13393	22.48512
Cardinals	6.904762	9.000000	20.37202	23.61607
Rams	7.333333	8.619048	21.58631	23.45238
Seahawks	9.238095	6.714286	23.19643	20.69048

NFC Box Plots | Total Wins

Row

NFC East Box Plot

NFC North Box Plot

Row

NFC South Box Plot

NFC West Box Plot

Division Leaders

Division Leaders Breakdown

Combining the tables from the previous tabs to form one table with average statistics, the following leaders can be found:

Average Total Wins: New England Patriots
Average Total Losses: Cleveland Browns
Average Points Per Game: New England Patriots
Average Opponent Points Per Game: Detroit Lions

For a more in-depth look at each team, please refer to the table below.

NFL Standings of all Teams

Team Name	Average Total Wins	Average Total Losses	Average Points Per Game	Average Opponent Points Per Game
49ers	7.333333	8.619048	21.13393	22.48512
Bears	7.857143	8.142857	20.38393	20.93750
Bengals	7.095238	8.714286	20.64583	22.41071
Bills	7.142857	8.857143	20.41369	22.33631
Broncos	8.904762	7.095238	23.33929	21.99405
Browns	5.238095	10.714286	17.69345	23.16071
Buccaneers	7.095238	8.904762	21.03869	21.94643
Cardinals	6.904762	9.000000	20.37202	23.61607
Chargers	8.047619	7.952381	24.05655	21.98214
Chiefs	8.571429	7.428571	23.58333	22.04167
Colts	9.904762	6.095238	25.01488	22.25893
Cowboys	8.285714	7.714286	22.41071	21.98810
Dolphins	7.571429	8.428571	20.11607	21.64881
Eagles	9.238095	6.666667	24.03571	20.80357
Falcons	8.000000	7.952381	22.71429	22.90179
Giants	7.809524	8.190476	21.79762	22.42857
Jaguars	6.095238	9.904762	19.55357	22.34226
Jets	7.142857	8.857143	19.61607	21.72917
Lions	5.666667	10.285714	20.72619	24.98810
Packers	10.000000	5.904762	25.55357	21.33929
Panthers	7.714286	8.238095	21.19048	21.78274
Patriots	11.619048	4.380952	26.93155	18.74405
Raiders	6.333333	9.666667	20.43452	24.71429
Rams	7.333333	8.619048	21.58631	23.45238
Ravens	9.571429	6.428571	22.75298	18.31250
Redskins	6.619048	9.333333	19.55655	22.33929
Saints	9.285714	6.714286	26.13988	23.18452
Seahawks	9.238095	6.714286	23.19643	20.69048
Steelers	10.333333	5.571429	23.20536	18.50893
Texans	7.105263	8.894737	21.02632	23.01316
Titans	8.142857	7.857143	21.80060	22.60714
Vikings	8.190476	7.714286	22.87798	22.39881

Offense vs. Defense

Column

Offense vs. Defense

In the next step of this analysis, I investigated the importance of an offense and a defense. The offense in football is the 11 players who are on the field for a team when they have the ball. Conversely, the defense is the 11 players on the field when the other team has the ball. Sports writers and analysts have argued over the years whether a better offense or defense is more critical to a team’s success. Utilizing the nfl_standings dataset, I sought to analyze this discussion.

First, I created a linear model showcasing a team’s wins in a season using offensive_ranking and defensive_ranking as the predictor variables.

# Attach the dataset
attach(nfl_standings)

# Create a linear model
rankings_model <- lm(total_wins ~ offensive_ranking + defensive_ranking)
summary(rankings_model)


Call:
lm(formula = total_wins ~ offensive_ranking + defensive_ranking)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.2881 -1.0127 -0.0189  1.0536  5.0782 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        7.98430    0.05741  139.06   <2e-16 ***
offensive_ranking  0.44409    0.01363   32.58   <2e-16 ***
defensive_ranking  0.43527    0.01657   26.26   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.486 on 667 degrees of freedom
Multiple R-squared:  0.7711,    Adjusted R-squared:  0.7705 
F-statistic:  1124 on 2 and 667 DF,  p-value: < 2.2e-16

The model showcases that both offensive_ranking and defensive_ranking are significant variables in determining a team’s total wins at a 99% confidence level. To drill deeper, the correlation coefficients were discovered for each predictor variable to total wins.

# Run correlation tests
cor(offensive_ranking, total_wins)

[1] 0.7310874

cor(defensive_ranking, total_wins)

[1] 0.6379049

The offensive_ranking had a coefficient of 0.7311 and the defensive_ranking had a coefficient 0.6379. As such, it appears that a team’s offense has a greater correlation to a team’s wins than its defense. To visualize this, I plotted two graphs to further test this hypothesis. These can be seen in the tabs to the right.

These graphs confirm the positive correlation between an increasing offensive or defensive ranking and a team’s win. Additionally, the confidence band in the defensive ranking is larger than the offensive ranking’s band. This agrees with my conclusion that the offensive’s ranking correlation is stronger than the defense.

Column

Offensive Ranking

Defensive Ranking

Points For vs. Points Against

Column

Points For vs. Points Against

Using different statistics now, I changed the predictor variables to be points_for and points_against as these represent offensive and defensive success, respectively. Then, I used the binary playoffs variable to see how scoring or giving up points led to a team’s probability of making the playoffs. I took the same approach as the previous variables.

This model shows that points_for and points_against are both significant as well to a team’s total wins. Additionally, the correlations to total wins are 0.7276 and -0.6667. This indicates a strong, positive relationship for points_for and a strong, negative relationship for points_against. The offensive side, once again, has a slightly stronger relationship.

The graphical representations show these strong linear relationships as well with playoff teams typically having a low Total Points Against and high Total Points For on the season.

# Create linear model
points_model <- lm(total_wins ~ points_for + points_against)
summary(points_model)


Call:
lm(formula = total_wins ~ points_for + points_against)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.3942 -0.8523 -0.0140  0.8298  3.8933 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     8.5301159  0.4048277   21.07   <2e-16 ***
points_for      0.0274469  0.0006803   40.34   <2e-16 ***
points_against -0.0289973  0.0008114  -35.74   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.248 on 667 degrees of freedom
Multiple R-squared:  0.8385,    Adjusted R-squared:  0.838 
F-statistic:  1732 on 2 and 667 DF,  p-value: < 2.2e-16

# Run correlation tests
cor(points_for, total_wins)

[1] 0.7275527

cor(points_against, total_wins)

[1] -0.6666812

Column

Points For

Points Against

Individual Game Observations

Column

Individual Game Observations

The last analysis takes a look at data from the individual NFL games. Using the nfl_games dataset, I investigated the different variables.

Now, to analyze the correlation between different variables, I used the GGally package to produce a detailed scatter plot matrix. The function ggpairs() produced histograms along the diagonal of the matrix. Pearson’s rho estimates, or statistics showing correlation, are seen in the upper-right. Scatter plots are seen in the lower-left. I analyzed six variables here - (1) Points Scored by Winning Team (pts_win); (2) Yards Gained by Winning Team (yds_win); (3) Turnovers Committed by Winning Team (turnovers_win); (4) Points Scored by Losing Team (pts_loss); (5) Yards Gained by Losing Team (yds_loss); and (6) Turnovers Committed by Losing Team (turnovers_loss).

I then grouped these variables by winning team vs. losing team. This correlation matrix can be seen in the first tab to the right. As evident through both the scatter plots and Pearson’s rho estimates, there is little to no relationship between Points Scored by Winning Team vs. Turnovers Committed by Winning Team as well as Yards Gained by Winning Team vs. Turnovers Committed by Winning Team. All of these correlation coefficients are close to zero. On the other hand, there is a strong, positive relationship between Points Scored by Winning Team vs. Yards Gained by Winning Team, with a Pearson rho estimate of 0.537.

Looking at the variables by losing team in the second tab to the right – very similar to the winning teams, there is little to no relationship between Points Scored by Losing Team vs. Turnovers Committed by Losing Team as well as Yards Gained by Losing Team vs. Turnovers Committed by Losing Team. All of these correlation coefficients are close to zero. On the other hand, there is a strong, positive relationship between Points Scored by Losing Team vs. Yards Gained by Losing Team, with a Pearson rho estimate of 0.632.

The main takeaway from these correlation matrices are that the more yards gained, the more likely you are to score. To compare a winning team and a losing team, I wanted to see if more turnovers from a losing team caused more points for the winning team. Please reference the third tab to the right to reference the linear model with pts_win as the response variable and turnovers_loss as the predictor variable. In this graphic, there is a slight, positive relationship between the Turnovers Committed by Losing Team and Points Scored by Winning Team. The correlation coefficient between the two variables is 0.176.

Column

Variables by Winning Team

Variables by Losing Team

Turnovers Committed by Losing Team vs. Points Scored by Winning Team

Average Weather Conditions

Column

Understanding Weather Conditions

Looking at the nfl_weather dataset, I wanted to see which teams performed well under certain weather conditions. To do this, I first wanted to observe the average temperature, humidity, and wind speed at each home location. In R, I utilized the dplyr package to tidy my data and create new columns with mutate. To visualize the average temperature, humidity, and wind speed at each location, I created bar graphs for each variable per city.

From the visualizations to the right, it appears that the following five cities have the highest average temperatures:

Miami, Florida – 76.70°F
Detroit, Michigan – 71.64°F
Tampa Bay, Florida – 71.51°F
New Orleans, Louisiana – 71.03°F
Houston, Texas – 71.03°F

The following five cities have the highest humidity percentage:

Seattle, Washington – 79%
San Francisco, California – 71%
Oakland, California – 71%
Green Bay, Wisconsin – 71%
Miami, Florida – 70%

Lastly, the following five cities have the highest winds (in mph):

New England, Massachusetts – 11.54 mph
New York, New York –10.57 mph
Dallas, Texas – 10.27 mph
Denver, Colorado – 9.96 mph
Buffalo, New York – 9.95 mph

Column

Average Temperature Per City

Average Humidity Per City

Average Wind Per City

Can Weather Predict Game Outcomes?

Column

Can Weather Predict Game Outcomes?

Next, I wanted to see in high wind speeds were correlated to low-scoring games. To do this, I first combined the total score of the home_score and away_score variables. I created a total_score column. With this column, I ran correlation coefficients between total_score and temperature, total_score and humidity, and total_score and wind_mph.

I also ran two linear models to see if weather conditions could predict whether or not the game would be high-scoring or low-scoring. I was able to train and test my dataset with a 70-30 training-testing split.

It appears that the correlation coefficient between total_score and temperature is 0.0164. Knowing that a correlation coefficient value of plus or minus one is said to be a perfect correlation, I know that there is little to no correlation between total_score and temperature.

The correlation coefficient between total_score and humidity is -0.1207. Here, I can see that there is a slight, negative correlation between the two variables. Similar to this is the relationship between total_score and wind_mph. The correlation coefficient is -0.1328, indicating a slight, negative correlation.

Of the three relationships tested, it appears that the wind speed correlates most to a lower total score. The higher the wind, the lower the combined score of the game.

Looking further, I decided to create a linear model to see if temperature, humidity, and wind can predict the total combined score of the game. The first model I created included all three predictor variables; however, it did not perform well. The adjusted R-squared of this model is around a meager 0.02, indicating that the model accounts for only 2% of the variance explained by the model. It did appear that the humidity and wind_mph variables were statistically significant at the 95% confidence level as they had p-values less than 0.05. The Mean Squared Error (MSE) of this model is higher than 170, which is incredibly high given one would ideally want an MSE of zero. It is important to note that these results will vary given the random training-testing split.

Using the above data and knowing that wind_mph was the most correlated with total_score, I decided to create a second model with just wind_mph as the sole predictor variable. This model performed even worse with an adjusted R-squared of around 0.015, indicating that the model accounts for less than 2% of the variance. The wind_mph variable is still statistically significant due to its p-value being less than 0.05 at the 95% confidence interval. Additionally, the MSE of this model is still higher than 170. It is important to note that these results will also vary given the random training-testing split.

In conclusion, weather does not accurately predict whether or not the game will be high-scoring or low-scoring. I originally thought it would be more difficult for the players to score given higher wind speeds; however, I was proved wrong.

I thought about analyzing how teams fared in games located on opposite sides of the country (e.g., New England Patriots at Los Angeles Rams); however, I decided against that analysis. I decided against this because I thought there would be other contributing factors to a loss, such as home-field advantage.

Column

Linear Model with all Weather Variables

attach(nfl_weather)
cor(total_score, temperature); cor(total_score, humidity); cor(total_score, wind_mph)

# Split the data into training and testing
sample_index <- sample(nrow(nfl_weather), nrow(nfl_weather)*0.70)
weather_train <- nfl_weather[sample_index,]
weather_test <- nfl_weather[-sample_index,]

# Create the linear model
weather_model <- lm(total_score ~ temperature + humidity + wind_mph, data = weather_train)
model_summary <- summary(weather_model)
model_summary


Call:
lm(formula = total_score ~ temperature + humidity + wind_mph, 
    data = weather_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-37.371 -10.303  -1.084   8.629  65.488 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 47.48511    1.36569  34.770  < 2e-16 ***
temperature -0.01548    0.01849  -0.837  0.40268    
humidity    -3.12415    1.16805  -2.675  0.00753 ** 
wind_mph    -0.28202    0.06450  -4.372 1.28e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.45 on 2460 degrees of freedom
Multiple R-squared:  0.02244,   Adjusted R-squared:  0.02125 
F-statistic: 18.82 on 3 and 2460 DF,  p-value: 4.55e-12

# Out-of-sample performance
pi <- predict(object = weather_model, newdata = weather_test)
mean((pi - weather_test$total_score)^2) # MSE

[1] 190.3678

Linear Model with Wind Variable

# Drop all variables except wind_mph
weather_model_2 <- lm(total_score ~ wind_mph, data = weather_train)
model_summary_2 <- summary(weather_model_2)
model_summary_2


Call:
lm(formula = total_score ~ wind_mph, data = weather_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-36.593 -10.377  -0.936   8.773  65.526 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 45.59288    0.47042  96.920  < 2e-16 ***
wind_mph    -0.36566    0.05221  -7.004  3.2e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.46 on 2462 degrees of freedom
Multiple R-squared:  0.01953,   Adjusted R-squared:  0.01914 
F-statistic: 49.05 on 1 and 2462 DF,  p-value: 3.205e-12

# Out-of-sample performance
pi_2 <- predict(object = weather_model_2, newdata = weather_test)
mean((pi_2 - weather_test$total_score)^2) # MSE

[1] 191.7339

Teams

Column

Successful Postseason Teams

For the next part of my analysis, I decided to look at the best head coaches and quarterbacks from the past 20 years. I created this nfl_playoffs dataset myself by taking every team from the past 20 years that made the playoffs and then listing their head coach and starting postseason quarterback.

The goal of this portion of the analysis is to see if the head coach or quarterback really do make a difference in team success. For example, was it a coincidence that the Tampa Bay Buccaneers won the most recent Super Bowl – conveniently, the first year with Tom Brady as quarterback?

Over the past 20 years, it appears that the top three teams are as follows:

New England Patriots
Indianapolis Colts
Green Bay Packers

The New England Patriots have been a dominant force the past few decades. I would argue that everyone that is not a New England Patriots fan strongly roots against them just because they have won so frequently. In the past few years, fans would claim the success comes from head coach Bill Belichick and former players – Tom Brady and Rob Gronkowski.

The Indianapolis Colts’ success over the past 20 years can be attributed to their former quarterback Peyton Manning. Manning arrived to the Colts in 1998 and led the team to its first championship in 36 seasons at Super Bowl XLI.

The Green Bay Packers have also been quite the team over the past 20 years. Their general manager, Ted Thompson, has been a key figure in their success. The Packers also have had incredible leaders through their coaching staff and players. Notably, the Packers also have a notoriously strong fan base.

Column

Successful Postseason Teams

top_10_teams_playoffs <- nfl_teams_playoffs %>% top_n(10, playoffs) %>%
  arrange(desc(playoffs))
top_10_teams_playoffs <- top_10_teams_playoffs[1:10,]
kable(top_10_teams_playoffs)

team_name	playoffs
Patriots	17
Colts	15
Packers	15
Seahawks	14
Eagles	13
Ravens	13
Steelers	13
Chiefs	10
Saints	10
Broncos	9

Top Ten Teams by Playoff Appearances

Head Coaches

Column

Successful Postseason Head Coaches

Now knowing the top-performing teams in the NFL, I want to see which coaches have led these teams to success.

Looking at the visualization Top Ten Coaches by Playoff Appearances to the right, it is evident that the top two coaches in the NFL over the past 20 years are:

Bill Belichick
Andy Reid

Bill Belichick, head coach of the New England Patriots, has been with the team since 2000. As head coach, he has six Super Bowl championships (XXXVI, XXXVIII, XXXIX, XLIX, LI, and LIII). He has won AP NFL Coach of the Year in 2003, 2007, and 2010. He has also won 31 playoff games. I cannot say I was surprised to see him listed as number one in this analysis. His career record as a coach is 311-148 (0.678).

Andy Reid is the current head coach for the Kansas City Chiefs. He has been with the team since 2013. Prior to that, Reid was the head coach for the Philadelphia Eagles (1999 - 2012). He has won two Super Bowl championships (XXXI and LIV) - one as an assistant coach and one as a head coach. His career record as a coach is 238-145-1 (0.621).

Next, I wanted to look at the head coaches with the most Super Bowl championships won. Is this consistent with the top coaches who go to postseason play? Given Marvin Lewis, former head coach of the Cincinnati Bengals, is number ten in the graphic Top Ten Coaches by Playoff Appearances, I cannot be quite sure.

In the visualization Top Ten Coaches by Super Bowl Championships to the right, it is evident that Bill Belichick is still the dominant head coach from the past 20 years, with six Super Bowl championships.

The head coach with the next highest number of Super Bowl championships as head coach is Tom Coughlin. Tom Coughlin was the head coach of the Jacksonville Jaguars from 1995 - 2002, and he was the head coach of the New York Giants from 2004 - 2015. His two Super Bowl championships were as the head coach of the New York Giants (XLII and XLVI). His career record as a coach in the NFL was 182-157 (0.537).

Column

List of Top Coaches by Playoff Appearances

top_10_coaches_playoffs <- nfl_coaches_playoffs %>% top_n(10, playoffs) %>% arrange(desc(playoffs))
top_10_coaches_playoffs <- top_10_coaches_playoffs[1:10,]
kable(top_10_coaches_playoffs)

head_coach	playoffs
Bill Belichick	17
Andy Reid	16
John Harbaugh	9
Mike McCarthy	9
Mike Tomlin	9
Pete Carroll	9
Sean Payton	9
Tony Dungy	9
John Fox	7
Marvin Lewis	7

Top Ten Coaches by Playoff Appearances

List of Top Coaches by Super Bowl Championships

top_10_coaches <- nfl_coaches_sb %>% top_n(10, total_sb_coach) %>% arrange(desc(total_sb_coach))
top_10_coaches <- top_10_coaches[1:10,]
kable(top_10_coaches)

head_coach	total_sb_coach
Bill Belichick	6
Tom Coughlin	2
Brian Billick	1
Jon Gruden	1
Andy Reid	1
Tony Dungy	1
Bill Cowher	1
Sean Payton	1
Mike Tomlin	1
Mike McCarthy	1

Top Ten Coaches by Super Bowl Championships

Quarterbacks

Column

Successful Postseason Quarterbacks

An individual can be a great coach; however, they also need a great team. Quarterbacks are often described as the leaders of the NFL. With that said, I wanted to take a look at the best quarterbacks from the past 20 years.

From the visualization Top Ten Quarterbacks by Playoff Appearances to the right, it is evident that the top three quarterbacks in the NFL over the past 20 years are:

Tom Brady
Ben Roethlisberger
Drew Brees

It is no surprise that Tom Brady is number one, as he is often referred to as the Greatest Of All Time (GOAT). Tom Brady was drafted to the New England Patriots in the sixth round of the 2000 NFL Draft. Since then, he is a seven-time Super Bowl Champion (six with the New England Patriots and one with the Tampa Bay Buccaneers). He is still active in the NFL as the quarterback for the Tampa Bay Buccaneers. As of 2020, his completion percentage is 64% and his accolades are many. I am interested to see how much longer he will excel in the league.

Ben Roethlisberger, the long-time quarterback for the Pittsburgh Steelers, was drafted in the first round of the 2004 NFL Draft. He has won two Super Bowl championships, and his completion percentage is 64.4%. He just signed with the Steelers for another year, so (as a Steelers fan) I am hoping he will lead the team to a third championship this upcoming season.

Drew Brees started his career with the San Diego Chargers (2001 - 2005), but he is most known for his career as the quarterback for the New Orleans Saints (2006 - 2020). Brees has won one Super Bowl championship, and he just announced his retirement this year. His completion percentage was 67.7%.

These top three quarterbacks are some of the famous players in the NFL. It is interesting to see how their career statistics speak for themselves.

Next, I wanted to see which quarterbacks have won the most Super Bowl Championships over the past 20 years.

Once again, looking at the Top Ten Quarterbacks by Super Bowl Championships to the right, Tom Brady is the most dominant quarterback in the NFL based on Super Bowl Championships. Ben Roethlisberger is close behind. The other two quarterbacks which I have not discussed are Eli Manning and Peyton Manning. Both of them have won two Super Bowl Championships since 2000.

Eli Manning was the quarterback for the New York Giants from 2004 - 2019. He won two Super Bowl Championships (XLII and XLVI) and was the Super Bowl MVP for both games.

Peyton Manning was the quarterback for the Indianapolis Colts from 1998 - 2011 and the quarterback for the Denver Broncos from 2012 - 2015. He also won two Super Bowl Championships (XLI, 50) and was the Super Bowl MVP for Super Bowl XLI. He won one Super Bowl with the Colts in 2006 and one Super Bowl withe the Broncos in 2015.

Column

List of Top Quarterbacks by Playoff Appearances

top_10_qb_playoffs <- nfl_qb_playoffs %>% top_n(10, playoffs) %>% arrange(desc(playoffs))
top_10_qb_playoffs <- top_10_qb_playoffs[1:10,]
kable(top_10_qb_playoffs)

qb	playoffs
Tom Brady	17
Ben Roethlisberger	9
Drew Brees	9
Aaron Rodgers	8
Russell Wilson	8
Donovan McNabb	7
Peyton Manning	7
Joe Flacco	6
Matt Hasselbeck	5
Eli Manning	5

Top Ten Quarterbacks by Playoff Appearances

List of Top Quarterbacks by Super Bowl Championships

top_10_qb <- nfl_qb_sb %>% top_n(10, total_sb_qb) %>% arrange(desc(total_sb_qb))
top_10_qb <- top_10_qb[1:10,]
kable(top_10_qb)

qb	total_sb_qb
Tom Brady	7
Peyton Manning	2
Ben Roethlisberger	2
Eli Manning	2
Trent Dilfer	1
Brad Johnson	1
Drew Brees	1
Joe Flacco	1
Aaron Rodgers	1
Russell Wilson	1

Top Ten Quarterbacks by Super Bowl Championships

Passing Yards Leaders

Column

Passing Yards Leaders

For the next portion of my analysis, I wanted to analyze the top passing yards leaders. Given these are always quarterbacks, I wanted to see if this was consistent with my previous analysis of postseason quarterback success.

From the visualizations to the right, it is evident that Drew Brees was the leader for passing yards six out of 20 times the past 20 years. However, he did not have the most playoff appearances. It is interesting that Tom Brady is on the list only three times, yet he has by far been the most successful quarterback.

Additionally, looking at the Top Passing Yards Leaders tab, the visualization is interactive. It is evident that the top passer over the past 20 years was Peyton Manning in 2013. The data points are colored based on player. The legend can be seen on the right.

Column

Passing Yards Summary

kable(nfl_passing)

year	player	yds	team	team_name
2020	Deshaun Watson	4823	Houston	Texans
2019	Jameis Winston	5109	Tampa Bay	Buccaneers
2018	Ben Roethlisberger	5129	Pittsburgh	Steelers
2017	Tom Brady	4577	New England	Patriots
2016	Drew Brees	5208	New Orleans	Saints
2015	Drew Brees	4870	New Orleans	Saints
2014	Drew Brees	4952	New Orleans	Saints
2014	Ben Roethlisberger	4952	Pittsburgh	Steelers
2013	Peyton Manning	5477	Denver	Broncos
2012	Drew Brees	5177	New Orleans	Saints
2011	Drew Brees	5476	New Orleans	Saints
2010	Philip Rivers	4710	San Diego	Chargers
2009	Matt Schaub	4770	Houston	Texans
2008	Drew Brees	5069	New Orleans	Saints
2007	Tom Brady	4806	New England	Patriots
2006	Drew Brees	4418	New Orleans	Saints
2005	Tom Brady	4110	New England	Patriots
2004	Daunte Culpepper	4717	Minnesota	Vikings
2003	Peyton Manning	4267	Indianapolis	Colts
2002	Rich Gannon	4689	Oakland	Raiders
2001	Kurt Warner	4830	St. Louis	Rams
2000	Peyton Manning	4413	Indianapolis	Colts

Top Passing Yards Leaders

Rushing Yards Leaders

Column

Rushing Yards Leaders

Looking at the leaders for rushing yards, I performed the same analysis as above.

As seen in the visualizations to the right, the rushing yards leaders have much more variation than the passing yards leaders. Most recently, Derrick Henry from the Tennessee Titans has been the dominant running back with 2027 yards in 2020. He has been the top rusher for the past two years in a row.

Additionally, looking at the Top Rushing Yards Leaders tab, the visualization is also interactive. It is evident that the top rusher over the past 20 years was Adrian Peterson in 2012. The data points are colored based on player. The legend can be seen on the right.

Column

Rushing Yards Summary

kable(nfl_rushing)

year	player	yds	team	team_name
2020	Derrick Henry	2027	Tennessee	Titans
2019	Derrick Henry	1540	Tennessee	Titans
2018	Ezekiel Elliott	1434	Dallas	Cowboys
2017	Kareem Hunt	1327	Kansas City	Chiefs
2016	Ezekiel Elliott	1631	Dallas	Cowboys
2015	Adrian Peterson	1485	Minnesota	Vikings
2014	DeMarco Murray	1845	Dallas	Cowboys
2013	LeSean McCoy	1607	Philadelphia	Eagles
2012	Adrian Peterson	2097	Minnesota	Vikings
2011	Maurice Jones-Drew	1606	Jacksonville	Jaguars
2010	Arian Foster	1616	Houston	Texans
2009	Chris Johnson	2006	Tennessee	Titans
2008	Adrian Peterson	1760	Minnesota	Vikings
2007	LaDainian Tomlinson	1474	San Diego	Chargers
2006	LaDainian Tomlinson	1815	San Diego	Chargers
2005	Shaun Alexander	1880	Seattle	Seahawks
2004	Curtis Martin	1697	New York	Jets
2003	Jamal Lewis	2066	Baltimore	Ravens
2002	Ricky Williams	1853	Miami	Dolphins
2001	Priest Holmes	1555	Kansas City	Chiefs
2000	Edgerrin James	1709	Indianapolis	Colts

Top Rushing Yards Leaders

Penalty Yards Per Game

NFL Average Penalty Yards Per Game

Looking at the average penalty yards per game, I was able to find a dataset that recorded the average penalty yards against a team from 2003 - 2020. I wanted to figure out – which team was the most penalized?

From the graphic below, it is evident that the Las Vegas Raiders have been the most penalized team in the NFL. The top five most penalized teams are:

Las Vegas Raiders
Baltimore Ravens
Detroit Lions
Tampa Bay Buccaneers
Los Angeles Rams

The least penalized team in the NFL is the Indianapolis Colts.

Top Penalized Teams

Rival Analysis

Column

Rival Analysis

The last part of my analysis is to look at the top five rivals in the NFL and the performance of these games over the past 20 years.

The term “rivalry” can be a bit subjective; however I chose the following five rivalries to analyze:

Green Bay Packers vs. Chicago Bears – It is evident that the Green Bay Packers have been dominant in this rivalry, winning 71% of the encounters.
Dallas Cowboys vs. Philadelphia Eagles – This is a very good rivalry, as both teams have performed. It is evident that the Philadelphia Eagles have been leading this rivalry, winning 54% of the encounters.
Kansas City Chiefs vs. Las Vegas Raiders (formerly Oakland Raiders) – It is evident that the Kansas City Chiefs have been dominant in this rivalry, winning 63% of the encounters.
Baltimore Ravens vs. Pittsburgh Steelers – This is a phenomenal rivalry to watch, as both times have won 22 games out of 44 games total. Each team has won 50% of the encounters.
Washington Football Team (formerly Washington Redskins) vs. New York Giants – It is evident that the New York Giants have been dominant in this rivalry, winning 68% of the encounters.

Then, just for fun – I decided to analyze the Pittsburgh Steelers vs. the Cincinnati Bengals for fun. It is evident that the Pittsburgh Steelers have been dominant in this rivalry, winning 79% of the encounters.