From a young age, both of us have watched NFL football and cheered for our respective teams every week. We (Adam, a die-hard Bengals fan, and Katie, a lifelong Steelers fan), not only wanted to compare our teams individually, but also the league as a whole. The past 20 years have seen successes and failures from every NFL team. Looking at the datasets available for this project, we were instantly drawn to analyzing the NFL option, as it showcases many variables that affect an NFL team. We then thought, “What all goes into winning an NFL game, and what teams are historically successful in the final standings?” Using the past 20 years worth of data, we sought to investigate this problem.
We plan on using the functions in R to deliver overall summary statistics on games and standings. Additionally, we will use the data to develop potential correlations and plot respective data visualizations. Utilizing descriptive analysis of the past 20 years, we are looking to see if there can be predictive tendencies for NFL teams.
The NFL dataset contained three individuals datasets:
nfl_attendance)nfl_standings)nfl_games)More detailed information about each dataset can be found in the Data Preparation tab.
The datasets contain loads of information for the NFL. With a wide range of variables, many options are available to analytically investigate the NFL. With the three datasets at hand, we looked to compare them to draw conclusions about team performances. To see if statistical significance or rational conclusions related to the NFL could be realized, the following situations were explored:
nfl_games dataset contains many variables for games. Turnovers, day of the week, points, etc. are shown for every match-up. Correlations into why teams win or lose will be the goal of this analysis. Using a plethora of variables, significance of certain variables will be essential for further understanding.The NFL is a multi-billion dollar industry. Millions of fans across the world cheer for these 32 teams every year. People are now looking for ways to understand the game better.
Coaches want to understand what makes a team more successful. Sports gamblers want to get an edge and make the correct picks based on more than just gut feelings. Fans want to know if their team is progressing in the right direction. This analysis is useful for all of these situations. Using descriptive analysis, past results can be better explained. As such, trends can be deduced to predict how NFL games and seasons will occur. Although no one can see into the future, understanding the data sheds a better light on the probability of certain results occurring in the NFL.
This project requires a variety of packages. Given there are over 10,000 packages in R, we want to focus on the ones that will provide us with the best results while cleaning and interpreting the data. Because we still consider ourselves very much novices, many of these packages are standard.
Some packages will be more useful than others. For example, ggplot2 allows for great visualizations that provide better understanding of the data. Additionally, dplyr can drill deeper into the three datasets to come to conclusions that may be hidden at first. R has powerful functions that can derive explanations for questions to massive datasets. Please see below for all of the packages loaded for this analysis:
# Packages required
library(tidyverse) # Use to tidy data
library(dplyr) # Use to manipulate data
library(ggplot2) # Use to plot data and create visualizations
library(tibble) # Use to manipulate and re-imagine data
library(readr) # Use to import data cleanly and efficiently
library(DT) # Use to create comprehensive data tables with HTML output
library(knitr) # Use for dynamic report generation
library(base) # Contains Base R functions
library(ggthemes) # Use themes in data visualizations
library(plotly) # Use to plot data and create visualizations
library(ggpubr) # Use to show multiple plots at once
library(GGally) # Use to produce scatter plot matrix
library(rmarkdown) # Use to produce report
The data was obtained from our professor, Tianhai Zu, for this class. He had provided four different datasets in which to choose, and we chose the NFL Attendance Data option. This dataset can be found on GitHub. Reading the information on GitHub led us to find the original source of the data, which is Pro Football Reference Standings and Pro Football Reference Attendance.
The NFL dataset contains of three individual datasets - (1) nfl_attendance, (2) nfl_standings, and (3) nfl_games. We first merged the three datasets into one dataframe called nfl_df. We decided it might be beneficial to have multiple frames of reference, some utilizing individual datasets, and another by looking at the combined dataframe. Rather than using str() and summary() to show descriptive statistics for each variable, we decided to create comprehensive tables.
# Get working directory
getwd()
## [1] "C:/Users/katie/OneDrive - University of Cincinnati/FS20/Second Half/Data Wrangling (BANA 7025)/Final Project"
# Get the data
nfl_attendance <- readr::read_csv('attendance.csv')
nfl_standings <- readr::read_csv('standings.csv')
nfl_games <- readr::read_csv('games.csv')
# To use 2020 data you need to update tidytuesdayR from GitHub
# Install via devtools::install_github("thebioengineer/tidytuesdayR")
tuesdata <- tidytuesdayR::tt_load('2020-02-04')
##
## Downloading file 1 of 3: `attendance.csv`
## Downloading file 2 of 3: `games.csv`
## Downloading file 3 of 3: `standings.csv`
tuesdata <- tidytuesdayR::tt_load(2020, week = 6)
##
## Downloading file 1 of 3: `attendance.csv`
## Downloading file 2 of 3: `games.csv`
## Downloading file 3 of 3: `standings.csv`
attendance <- tuesdata$attendance
# Join the data relatively nicely with dplyr
nfl_df <- dplyr::left_join(nfl_attendance, nfl_standings, nfl_games, by = c("year", "team_name", "team"))
As aforementioned, the NFL Attendance data was imported and obtained from Pro Football Reference. The original data contains 10,846 observations and eight variables. There are two character type variables, team and team_name. There are six numeric type variables, year, total, home, away, week, weekly_attendance. The variables are described in the data dictionary below. The data was collected from 2000 - 2020, and the values for the columns were observed during the 17 weeks of the NFL season. See the ORIGINAL dataset below.
# Examine the structure of the dataset
datatable(head(nfl_attendance, 50))
# Create a data dictionary for attendance
var_names_att <- colnames(nfl_attendance)
var_types_att <- lapply(nfl_attendance, class) # lapply returns a list of the same length as X (a vector)
var_desciptions_att <- c("City or state in which the team originates", "Name or mascot of the team", "Year", "Total attendance per season", "Total attendance at home games per season", "Total attendance at away games per season", "Week in which game was played", "Attendance for given week")
data_dict_att <- as_tibble(cbind(var_names_att, var_types_att, var_desciptions_att))
colnames(data_dict_att) <- c("Variable Name", "Variable Data Type", "Variable Desciption")
kable(data_dict_att) # kable returns a single table for a single data object
| Variable Name | Variable Data Type | Variable Desciption |
|---|---|---|
| team | character | City or state in which the team originates |
| team_name | character | Name or mascot of the team |
| year | numeric | Year |
| total | numeric | Total attendance per season |
| home | numeric | Total attendance at home games per season |
| away | numeric | Total attendance at away games per season |
| week | numeric | Week in which game was played |
| weekly_attendance | numeric | Attendance for given week |
Looking at the missing data values, the only column in which missing values exist is the weekly_attendance. This makes sense, as each NFL team has at least one bye week during the regular season. We decided to omit these values as they would skew the data and misrepresent the trends for each team.
colSums(is.na(nfl_attendance)) # Find the number of missing values per column
## team team_name year total
## 0 0 0 0
## home away week weekly_attendance
## 0 0 0 638
nfl_attendance <- na.omit(nfl_attendance)
colSums(is.na(nfl_attendance)) # Confirm there are no missing values
## team team_name year total
## 0 0 0 0
## home away week weekly_attendance
## 0 0 0 0
Looking at this above original dataset, we decided to first rename the columns to better describe the data.
nfl_attendance <- nfl_attendance %>% rename(
team_location = team,
total_attendance = total,
total_home_attendance = home,
total_away_attendance = away
)
Additionally, we split it into two dataframes, the first omitting the weekly data, and the second omitting the season totals. This decision was made largely to remove duplicates, and we knew it would bode for better visualizations during the exploratory data analysis (EDA).
The first dataset, nfl_total_attendance erased the two columns, week and weekly_attendance. This dataset will show the season totals for attendance per each team. The second dataset, nfl_weekly_attendance erased the season total data columns, total, home, and away.
nfl_total_attendance <- nfl_attendance[-c(7, 8)] # Remove weekly data
nfl_total_attendance <- nfl_total_attendance[!duplicated(nfl_total_attendance), ] # Remove duplicates
datatable(head(nfl_total_attendance, 50))
nfl_weekly_attendance <- nfl_attendance[-c(4, 5, 6)] # Remove season total attendance data
datatable(head(nfl_weekly_attendance, 50))
Now, for a summary of the two datasets and associated tables of the CLEANED data, please see below.
# Examine the final summary and structure of the nfl_total_attendance dataset
datatable(head(nfl_total_attendance, 50))
# Create a data dictionary for nfl_total_attendance
var_names_att <- colnames(nfl_total_attendance)
var_types_att <- lapply(nfl_total_attendance, class) # lapply returns a list of the same length as X (a vector)
var_desciptions_att <- c("City or state in which the team originates", "Name or mascot of the team", "Year", "Total attendance per season", "Total attendance at home games per season", "Total attendance at away games per season")
data_dict_total_att <- as_tibble(cbind(var_names_att, var_types_att, var_desciptions_att))
colnames(data_dict_total_att) <- c("Variable Name", "Variable Data Type", "Variable Desciption")
kable(data_dict_total_att) # kable returns a single table for a single data object
| Variable Name | Variable Data Type | Variable Desciption |
|---|---|---|
| team_location | character | City or state in which the team originates |
| team_name | character | Name or mascot of the team |
| year | numeric | Year |
| total_attendance | numeric | Total attendance per season |
| total_home_attendance | numeric | Total attendance at home games per season |
| total_away_attendance | numeric | Total attendance at away games per season |
# Examine the final summary and structure of the nfl_weekly_attendance dataset
datatable(head(nfl_weekly_attendance, 50))
# Create a data dictionary for nfl_weekly_attendance
var_names_att <- colnames(nfl_weekly_attendance)
var_types_att <- lapply(nfl_weekly_attendance, class) # lapply returns a list of the same length as X (a vector)
var_desciptions_att <- c("City or state in which the team originates", "Name or mascot of the team", "Year", "Week in which game was played", "Attendance for given week")
data_dict_weekly_att <- as_tibble(cbind(var_names_att, var_types_att, var_desciptions_att))
colnames(data_dict_weekly_att) <- c("Variable Name", "Variable Data Type", "Variable Desciption")
kable(data_dict_weekly_att) # kable returns a single table for a single data object
| Variable Name | Variable Data Type | Variable Desciption |
|---|---|---|
| team_location | character | City or state in which the team originates |
| team_name | character | Name or mascot of the team |
| year | numeric | Year |
| week | numeric | Week in which game was played |
| weekly_attendance | numeric | Attendance for given week |
Similar to above, the NFL Standings data was imported and obtained from Pro Football Reference. The original data contains 638 observations and 15 variables. There are four character type variables, team, team_name, playoffs, and sb_winner. There are 11 numeric type variables, year, wins, loss, points_for, points_against, points_differential, margin_of_victory, strength_of_schedule, simple_rating, offensive_ranking, and defensive_ranking. The variables are described in the data dictionary below. The data observed was collected from 2000 - 2020. See the ORIGINAL NFL Standings data below.
# Examine the structure of the dataset
datatable(head(nfl_standings, 50))
# Create a data dictionary for standings
var_names_st <- colnames(nfl_standings)
var_types_st <- lapply(nfl_standings, class) # lapply returns a list of the same length as X (a vector)
var_desciptions_st <- c("City or state in which the team originates", "Name or mascot of the team", "Year", "Total wins per season (0 to 16)", "Total losses per season (0 to 16)", "Total points the team scored per season", "Total points the opponent scored on the team per season", "The difference between the total points for the team and against the team", "Points differential divided by the total number of games per season", "Difficulty of schedule based on opponent records", "A rating for the team that takes into account points differential and strength of schedule (measured by Simple Rating System)", "A rating comparing how well the offense performs to opponent teams (measured by Simple Rating System)", "A rating comparing how well the defense performs to opponent teams (measured by Simple Rating System)", "Stating whether or not the team made it to the playoffs", "Stating whether or not the team won the Super Bowl for the season")
data_dict_st <- as_tibble(cbind(var_names_st, var_types_st, var_desciptions_st))
colnames(data_dict_st) <- c("Variable Name", "Variable Data Type", "Variable Desciption")
kable(data_dict_st) # kable returns a single table for a single data object
| Variable Name | Variable Data Type | Variable Desciption |
|---|---|---|
| team | character | City or state in which the team originates |
| team_name | character | Name or mascot of the team |
| year | numeric | Year |
| wins | numeric | Total wins per season (0 to 16) |
| loss | numeric | Total losses per season (0 to 16) |
| points_for | numeric | Total points the team scored per season |
| points_against | numeric | Total points the opponent scored on the team per season |
| points_differential | numeric | The difference between the total points for the team and against the team |
| margin_of_victory | numeric | Points differential divided by the total number of games per season |
| strength_of_schedule | numeric | Difficulty of schedule based on opponent records |
| simple_rating | numeric | A rating for the team that takes into account points differential and strength of schedule (measured by Simple Rating System) |
| offensive_ranking | numeric | A rating comparing how well the offense performs to opponent teams (measured by Simple Rating System) |
| defensive_ranking | numeric | A rating comparing how well the defense performs to opponent teams (measured by Simple Rating System) |
| playoffs | character | Stating whether or not the team made it to the playoffs |
| sb_winner | character | Stating whether or not the team won the Super Bowl for the season |
Looking at the above dataset, we first decided to change the column names to better describe the data.
nfl_standings <- nfl_standings %>% rename(
team_location = team,
total_wins = wins,
total_losses = loss
)
It is important to note as well that a few of the variable names refer to calculated values. The calculated value for points_differential is: points_differential = points_for - points_against. Additionally, margin_of_victory is calculated by: points_scored - points_allowed / games_played.
Lastly, the simple_rating is calculated by: \[SRS = MoV + SoS = OSRS + DSRS\]
In layman’s terms, the simple rating system is equal to the margin of victory plus the strength of schedule. This is equal to the offensive simple rating standing plus the defensive simple rating standing.
Next, we wanted to see what the sum of missing values was per column. As evident below, there are no missing values.
colSums(is.na(nfl_standings))
## team_location team_name year
## 0 0 0
## total_wins total_losses points_for
## 0 0 0
## points_against points_differential margin_of_victory
## 0 0 0
## strength_of_schedule simple_rating offensive_ranking
## 0 0 0
## defensive_ranking playoffs sb_winner
## 0 0 0
Moving forward, we decided to change both the playoffs and sb_winner to binary variables. This is because they both only have two unique values.
unique(nfl_standings$playoffs, incomparables = FALSE) # View the unique values for the playoffs column
## [1] "Playoffs" "No Playoffs"
unique(nfl_standings$sb_winner, incomparables = FALSE) # View the unique values for the sb_winner column
## [1] "No Superbowl" "Won Superbowl"
Knowing this, we changed the two columns to binary variables. For the playoffs column, a value of one stands for “Playoffs”, and a value of zero stands for “No Playoffs”.
nfl_standings$playoffs[nfl_standings$playoffs == "Playoffs"] <- "1"
nfl_standings$playoffs[nfl_standings$playoffs == "No Playoffs"] <- "0"
nfl_standings$playoffs <- as.numeric(nfl_standings$playoffs)
For the sb_winner column, a value of one denotes “Won Superbowl”, and a value of zero denotes “No Superbowl”.
nfl_standings$sb_winner[nfl_standings$sb_winner == "Won Superbowl"] <- "1"
nfl_standings$sb_winner[nfl_standings$sb_winner == "No Superbowl"] <- "0"
nfl_standings$sb_winner <- as.numeric(nfl_standings$sb_winner)
Now, for a summary of the dataset and associated table of the data, please see the CLEANED dataset below.
# Examine the structure of the dataset
datatable(head(nfl_standings, 50))
# Create a data dictionary for standings
var_names_st <- colnames(nfl_standings)
var_types_st <- lapply(nfl_standings, class) # lapply returns a list of the same length as X (a vector)
var_desciptions_st <- c("City or state in which the team originates", "Name or mascot of the team", "Year", "Total wins per season (0 to 16)", "Total losses per season (0 to 16)", "Total points the team scored per season", "Total points the opponent scored on the team per season", "The difference between the total points for the team and against the team", "Points differential divided by the total number of games per season", "Difficulty of schedule based on opponent records", "A rating for the team that takes into account points differential and strength of schedule (measured by Simple Rating System)", "A rating comparing how well the offense performs to opponent teams (measured by Simple Rating System)", "A rating comparing how well the defense performs to opponent teams (measured by Simple Rating System)", "Stating whether or not the team made it to the playoffs", "Stating whether or not the team won the Super Bowl for the season")
data_dict_st <- as_tibble(cbind(var_names_st, var_types_st, var_desciptions_st))
colnames(data_dict_st) <- c("Variable Name", "Variable Data Type", "Variable Desciption")
kable(data_dict_st) # kable returns a single table for a single data object
| Variable Name | Variable Data Type | Variable Desciption |
|---|---|---|
| team_location | character | City or state in which the team originates |
| team_name | character | Name or mascot of the team |
| year | numeric | Year |
| total_wins | numeric | Total wins per season (0 to 16) |
| total_losses | numeric | Total losses per season (0 to 16) |
| points_for | numeric | Total points the team scored per season |
| points_against | numeric | Total points the opponent scored on the team per season |
| points_differential | numeric | The difference between the total points for the team and against the team |
| margin_of_victory | numeric | Points differential divided by the total number of games per season |
| strength_of_schedule | numeric | Difficulty of schedule based on opponent records |
| simple_rating | numeric | A rating for the team that takes into account points differential and strength of schedule (measured by Simple Rating System) |
| offensive_ranking | numeric | A rating comparing how well the offense performs to opponent teams (measured by Simple Rating System) |
| defensive_ranking | numeric | A rating comparing how well the defense performs to opponent teams (measured by Simple Rating System) |
| playoffs | numeric | Stating whether or not the team made it to the playoffs |
| sb_winner | numeric | Stating whether or not the team won the Super Bowl for the season |
Once again the NFL Games data was imported and obtained from Pro Football Reference. The original data contains 5,324 observations and 19 variables. There are 11 character variables, week, home_team, away_team, winner, tie, day, date, home_team_name, home_team_city, away_team_name, and away_team_city. There are seven numeric type variables, year, pts_win, pts_loss, yds_win, turnovers_win, yds_loss, and turnovers_loss. See the ORIGINAL dataset below.
# Examine the structure of the dataset
datatable(head(nfl_games, 50))
# Create a data dictionary for games
var_names_games <- colnames(nfl_games)
var_types_games <- lapply(nfl_games, class) # lapply returns a list of the same length as X (a vector)
var_desciptions_games <- c("Year", "Week of the season in which the game was played", "Home team for the game", "Away team for the game", "Winner of the game", "Was there a tie? (if so, the other team will be listed in this column)", "Day of the week in which the game was played", "Date of the game", "Time of the day in which the game was played", "Number of points the winning team scored", "Number of points the losing team scored", "Total number of yards the winning team had", "Total number of turnovers the winning team had", "Total number of yards the losing team had", "Total number of turnovers the losing team had", "Name or mascot of the winning team", "City of the winning team", "Name or mascot of the losing team", "City of the losing team")
data_dict_games <- as_tibble(cbind(var_names_games, var_types_games, var_desciptions_games))
colnames(data_dict_games) <- c("Variable Name", "Variable Data Type", "Variable Desciption")
kable(data_dict_games) # kable returns a single table for a single data object
| Variable Name | Variable Data Type | Variable Desciption |
|---|---|---|
| year | numeric | Year |
| week | character | Week of the season in which the game was played |
| home_team | character | Home team for the game |
| away_team | character | Away team for the game |
| winner | character | Winner of the game |
| tie | character | Was there a tie? (if so, the other team will be listed in this column) |
| day | character | Day of the week in which the game was played |
| date | character | Date of the game |
| time | hms , difftime | Time of the day in which the game was played |
| pts_win | numeric | Number of points the winning team scored |
| pts_loss | numeric | Number of points the losing team scored |
| yds_win | numeric | Total number of yards the winning team had |
| turnovers_win | numeric | Total number of turnovers the winning team had |
| yds_loss | numeric | Total number of yards the losing team had |
| turnovers_loss | numeric | Total number of turnovers the losing team had |
| home_team_name | character | Name or mascot of the winning team |
| home_team_city | character | City of the winning team |
| away_team_name | character | Name or mascot of the losing team |
| away_team_city | character | City of the losing team |
Looking at the above dataset, the first step we took to clean the data was to remove the last four unnecessary columns, as we felt they were redundant.
names(nfl_games)
## [1] "year" "week" "home_team" "away_team"
## [5] "winner" "tie" "day" "date"
## [9] "time" "pts_win" "pts_loss" "yds_win"
## [13] "turnovers_win" "yds_loss" "turnovers_loss" "home_team_name"
## [17] "home_team_city" "away_team_name" "away_team_city"
nfl_games <- nfl_games[-c(16, 17, 18, 19)] # Remove redundant columns
names(nfl_games)
## [1] "year" "week" "home_team" "away_team"
## [5] "winner" "tie" "day" "date"
## [9] "time" "pts_win" "pts_loss" "yds_win"
## [13] "turnovers_win" "yds_loss" "turnovers_loss"
Then, we changed the week column to be numeric.
nfl_games$week <- as.numeric(nfl_games$week)
Looking at missing values, the only column which contained them was the tie column. This makes sense, as very few NFL games result in a tie.
Next, the way in which a tie was denoted was by listing one team name in the winner column, and the opponent team name in the tie column. To fix this, we identified any game that resulted in a tie. Then, for these specific games, we renamed the value in the winner column to “Tie”. The tie column was then erased.
colSums(is.na(nfl_games))
## year week home_team away_team winner
## 0 220 0 0 0
## tie day date time pts_win
## 5314 0 0 0 0
## pts_loss yds_win turnovers_win yds_loss turnovers_loss
## 0 0 0 0 0
unique(nfl_games$tie, incomparables = FALSE)
## [1] NA "Atlanta Falcons" "Cincinnati Bengals"
## [4] "St. Louis Rams" "Green Bay Packers" "Carolina Panthers"
## [7] "Arizona Cardinals" "Cleveland Browns"
nfl_games$winner[nfl_games$tie != is.na(nfl_games$tie)] <- "Tie"
nfl_games <- nfl_games[-c(6)] # Remove the tie column
colSums(is.na(nfl_games)) # Confirm there are no missing values
## year week home_team away_team winner
## 0 220 0 0 0
## day date time pts_win pts_loss
## 0 0 0 0 0
## yds_win turnovers_win yds_loss turnovers_loss
## 0 0 0 0
To view the summary and structure of the CLEANED data:
# Examine the structure of the dataset
datatable(head(nfl_games, 50))
# Create a data dictionary for games
var_names_games <- colnames(nfl_games)
var_types_games <- lapply(nfl_games, class) # lapply returns a list of the same length as X (a vector)
var_desciptions_games <- c("Year", "Week of the season in which the game was played", "Home team for the game", "Away team for the game", "Winner of the game", "Day of the week in which the game was played", "Date of the game", "Time of the day in which the game was played", "Number of points the winning team scored", "Number of points the losing team scored", "Total number of yards the winning team had", "Total number of turnovers the winning team had", "Total number of yards the losing team had", "Total number of turnovers the losing team had")
data_dict_games <- as_tibble(cbind(var_names_games, var_types_games, var_desciptions_games))
colnames(data_dict_games) <- c("Variable Name", "Variable Data Type", "Variable Desciption")
kable(data_dict_games) # kable returns a single table for a single data object
| Variable Name | Variable Data Type | Variable Desciption |
|---|---|---|
| year | numeric | Year |
| week | numeric | Week of the season in which the game was played |
| home_team | character | Home team for the game |
| away_team | character | Away team for the game |
| winner | character | Winner of the game |
| day | character | Day of the week in which the game was played |
| date | character | Date of the game |
| time | hms , difftime | Time of the day in which the game was played |
| pts_win | numeric | Number of points the winning team scored |
| pts_loss | numeric | Number of points the losing team scored |
| yds_win | numeric | Total number of yards the winning team had |
| turnovers_win | numeric | Total number of turnovers the winning team had |
| yds_loss | numeric | Total number of yards the losing team had |
| turnovers_loss | numeric | Total number of turnovers the losing team had |
As mentioned in the introduction, this data analysis will look into if the number of fans in attendance correlates to a team’s success (number of games won). Additionally, it will provide comparisons of how teams fare in home versus away games while keeping in mind the attendance at those games. Earlier in our data preparation, we split the attendance dataset into two separate datasets, nfl_total_attendance and nfl_weekly_attendance. To first understand the importance of fan attendance, it is critical to observe which teams have the strongest fan base over the past 20 years.
# Create visualization for total attendance per year for all teams
team_total_attendance <-
ggplot(data = nfl_total_attendance,
aes(x = year,
y = total_attendance,
color = team_name)) +
geom_point(size = 1, alpha = .8) +
geom_smooth(size = .8, se = FALSE) +
scale_y_continuous(name = "Total Attendance") +
scale_x_continuous(name = "Year") +
ggtitle("Total Attendance Per Year") +
labs(col = "Team Name") +
theme_stata()
ggplotly(team_total_attendance)
As evident in the above visualization, the Dallas Cowboys appear to have the strongest fan base, and the Los Angeles Chargers appear to have the weakest fan base. Using the ggplotly function, the graph becomes interactive. Double-click on any team to see their attendance trends since 2000. Now, we wanted to break this down on a division-basis. In order to do this, we added a column to the dataset, called “Division”.
# Attach the dataset to avoid calling on specific columns
attach(nfl_total_attendance)
# Create new column
nfl_total_attendance$division <-
ifelse(team_name == "Patriots" | team_name == "Bills" | team_name == "Jets" | team_name == "Dolphins", "AFC East",
ifelse(team_name == "Ravens" | team_name == "Steelers" | team_name == "Bengals" | team_name == "Browns", "AFC North",
ifelse(team_name == "Texans" | team_name == "Titans" | team_name == "Colts" | team_name == "Jaguars", "AFC South",
ifelse(team_name == "Chiefs" | team_name == "Broncos" | team_name == "Raiders" | team_name == "Chargers", "AFC West",
ifelse(team_name == "Eagles" | team_name == "Cowboys" | team_name == "Giants" | team_name == "Redskins", "NFC East",
ifelse(team_name == "Packers" | team_name == "Vikings" | team_name == "Bears" | team_name == "Lions", "NFC North",
ifelse(team_name == "Saints" | team_name == "Falcons" | team_name == "Buccaneers" | team_name == "Panthers", "NFC South",
ifelse(team_name == "49ers" | team_name == "Seahawks" | team_name == "Rams" | team_name == "Cardinals", "NFC West",
NA))))))) )
Now that the column Division exists, the breakdown of the strongest and weakest fan bases per division can be seen in the following table and visualizations below:
# Create table for attendance summary
attendance_summary <- matrix(c("New York Jets", "Miami Dolphins", "Baltimore Ravens", "Cincinnati Bengals", "Houston Texans", "Indianapolis Colts", "Kansas City Chiefs", "Los Angeles Chargers", "Dallas Cowboys", "Washington Redskins", "Green Bay Packers", "Detroit Lions", "New Orleans Saints", "Tampa Bay Buccaneers", "Los Angeles Rams", "Arizona Cardinals"), ncol = 2, byrow = TRUE)
colnames(attendance_summary) <- c("Strongest Fan Base","Weakest Fan Base")
rownames(attendance_summary) <- c("AFC East","AFC North","AFC South", "AFC West", "NFC East", "NFC North", "NFC South", "NFC West")
attendance_summary <- as.table(attendance_summary)
kable(attendance_summary)
| Strongest Fan Base | Weakest Fan Base | |
|---|---|---|
| AFC East | New York Jets | Miami Dolphins |
| AFC North | Baltimore Ravens | Cincinnati Bengals |
| AFC South | Houston Texans | Indianapolis Colts |
| AFC West | Kansas City Chiefs | Los Angeles Chargers |
| NFC East | Dallas Cowboys | Washington Redskins |
| NFC North | Green Bay Packers | Detroit Lions |
| NFC South | New Orleans Saints | Tampa Bay Buccaneers |
| NFC West | Los Angeles Rams | Arizona Cardinals |
# AFC East
ggplotly(
nfl_total_attendance %>%
filter(nfl_total_attendance$division == "AFC East") %>%
ggplot(aes(x = year,
y = total_attendance,
color = team_name)) +
geom_point(size = 1, alpha = .8) +
geom_smooth(size = .8, se = FALSE) +
scale_y_continuous(name = "Total Attendance") +
scale_x_continuous(name = "Year") +
ggtitle("AFC East Total Attendance Per Year") +
labs(col = "Team Name") +
theme_stata()
)
# AFC North
ggplotly(
nfl_total_attendance %>%
filter(nfl_total_attendance$division == "AFC North") %>%
ggplot(aes(x = year,
y = total_attendance,
color = team_name)) +
geom_point(size = 1, alpha = .8) +
geom_smooth(size = .8, se = FALSE) +
scale_y_continuous(name = "Total Attendance") +
scale_x_continuous(name = "Year") +
ggtitle("AFC North Total Attendance Per Year") +
labs(col = "Team Name") +
theme_stata()
)
# AFC South
ggplotly(
nfl_total_attendance %>%
filter(nfl_total_attendance$division == "AFC South") %>%
ggplot(aes(x = year,
y = total_attendance,
color = team_name)) +
geom_point(size = 1, alpha = .8) +
geom_smooth(size = .8, se = FALSE) +
scale_y_continuous(name = "Total Attendance") +
scale_x_continuous(name = "Year") +
ggtitle("AFC South Total Attendance Per Year") +
labs(col = "Team Name") +
theme_stata()
)
# AFC West
ggplotly(
nfl_total_attendance %>%
filter(nfl_total_attendance$division == "AFC West") %>%
ggplot(aes(x = year,
y = total_attendance,
color = team_name)) +
geom_point(size = 1, alpha = .8) +
geom_smooth(size = .8, se = FALSE) +
scale_y_continuous(name = "Total Attendance") +
scale_x_continuous(name = "Year") +
ggtitle("AFC West Total Attendance Per Year") +
labs(col = "Team Name") +
theme_stata()
)
# NFC East
ggplotly(
nfl_total_attendance %>%
filter(nfl_total_attendance$division == "NFC East") %>%
ggplot(aes(x = year,
y = total_attendance,
color = team_name)) +
geom_point(size = 1, alpha = .8) +
geom_smooth(size = .8, se = FALSE) +
scale_y_continuous(name = "Total Attendance") +
scale_x_continuous(name = "Year") +
ggtitle("NFC East Total Attendance Per Year") +
labs(col = "Team Name") +
theme_stata()
)
# NFC North
ggplotly(
nfl_total_attendance %>%
filter(nfl_total_attendance$division == "NFC North") %>%
ggplot(aes(x = year,
y = total_attendance,
color = team_name)) +
geom_point(size = 1, alpha = .8) +
geom_smooth(size = .8, se = FALSE) +
scale_y_continuous(name = "Total Attendance") +
scale_x_continuous(name = "Year") +
ggtitle("NFC North Total Attendance Per Year") +
labs(col = "Team Name") +
theme_stata()
)
# NFC South
ggplotly(
nfl_total_attendance %>%
filter(nfl_total_attendance$division == "NFC South") %>%
ggplot(aes(x = year,
y = total_attendance,
color = team_name)) +
geom_point(size = 1, alpha = .8) +
geom_smooth(size = .8, se = FALSE) +
scale_y_continuous(name = "Total Attendance") +
scale_x_continuous(name = "Year") +
ggtitle("NFC South Total Attendance Per Year") +
labs(col = "Team Name") +
theme_stata()
)
# NFC West
ggplotly(
nfl_total_attendance %>%
filter(nfl_total_attendance$division == "NFC West") %>%
ggplot(aes(x = year,
y = total_attendance,
color = team_name)) +
geom_point(size = 1, alpha = .8) +
geom_smooth(size = .8, se = FALSE) +
scale_y_continuous(name = "Total Attendance") +
scale_x_continuous(name = "Year") +
ggtitle("NFC West Total Attendance Per Year") +
labs(col = "Team Name") +
theme_stata()
)
Knowing the above attendance statistics, we want to see if a stronger home attendance impacts the total number of wins. A team cannot necessarily control their away attendance, as their most loyal fans are assumed to be unlikely attendees at an away game.
# Select columns from nfl_standings
nfl_wins <-
nfl_standings %>%
select(team_name, year, total_wins)
# Perform left join to get needed statistics
joined_data <- left_join(nfl_total_attendance, nfl_wins, by = c("team_name", "year"))
joined_data
## # A tibble: 638 x 8
## team_location team_name year total_attendance total_home_atte~
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Arizona Cardinals 2000 893926 387475
## 2 Atlanta Falcons 2000 964579 422814
## 3 Baltimore Ravens 2000 1062373 551695
## 4 Buffalo Bills 2000 1098587 560695
## 5 Carolina Panthers 2000 1095192 583489
## 6 Chicago Bears 2000 1080684 535552
## 7 Cincinnati Bengals 2000 967434 469992
## 8 Cleveland Browns 2000 1057139 581544
## 9 Dallas Cowboys 2000 1075470 504360
## 10 Denver Broncos 2000 1140030 604042
## # ... with 628 more rows, and 3 more variables: total_away_attendance <dbl>,
## # division <chr>, total_wins <dbl>
# Attach the dataset
attach(joined_data)
# Create linear model
home_attendance_model <- lm(total_wins ~ total_home_attendance)
summary(home_attendance_model)
##
## Call:
## lm(formula = total_wins ~ total_home_attendance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.7799 -2.2250 -0.0726 2.2478 7.9490
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.225e+00 9.851e-01 4.289 2.07e-05 ***
## total_home_attendance 6.955e-06 1.809e-06 3.845 0.000133 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.051 on 636 degrees of freedom
## Multiple R-squared: 0.02272, Adjusted R-squared: 0.02118
## F-statistic: 14.78 on 1 and 636 DF, p-value: 0.0001327
cor(total_wins, total_home_attendance)
## [1] 0.1507174
# Plot data
ggplot(data = joined_data, aes(total_home_attendance, total_wins)) +
geom_point(size = 1, alpha = .8, col = "red") +
geom_smooth(method = "lm", size = .8, se = FALSE) +
xlab("Total Home Attendance") +
ylab("Total Wins") +
ggtitle("Total Home Attendance vs. Total Wins")
From the above visualization, it appears that there is a slight, positive linear relationship between the predictor variable (X or total_home_attendance) and the response variable (Y or total_wins). The correlation coefficient between the two variables is 0.1507, and this relationship is statistically significant at a 99% confidence level with a p-value of 0.000133. The lm() function was used to perform simple linear regression between the two variables.
# Attach the dataset
attach(joined_data)
# Create linear model
away_attendance_model <- lm(total_wins ~ total_away_attendance)
summary(away_attendance_model)
##
## Call:
## lm(formula = total_wins ~ total_away_attendance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.8922 -2.2928 -0.1177 2.2793 7.7255
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.310e-01 2.571e+00 -0.129 0.89758
## total_away_attendance 1.539e-05 4.751e-06 3.238 0.00126 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.061 on 636 degrees of freedom
## Multiple R-squared: 0.01622, Adjusted R-squared: 0.01468
## F-statistic: 10.49 on 1 and 636 DF, p-value: 0.001265
cor(total_wins, total_away_attendance)
## [1] 0.1273652
# Plot data
ggplot(data = joined_data, aes(total_away_attendance, total_wins)) +
geom_point(size = 1, alpha = .8, col = "magenta") +
geom_smooth(method = "lm", size = .8, se = FALSE) +
xlab("Total Away Attendance") +
ylab("Total Wins") +
ggtitle("Total Away Attendance vs. Total Wins")
From the above visualization, it appears that there is also a very slight, positive linear relationship between the predictor variable (X or total_away_attendance) and the response variable (Y or total_wins). The correlation coefficient between the two variables is 0.1274, and this relationship is statistically significant at a 99% confidence level with a p-value of 0.00126. The lm() function was used to perform simple linear regression between the two variables.
This part of our analysis will look into the qualities of the division winners and the attributes that teams high in the standings have over teams in the lower portion of the standings. Furthermore, this approach will discover what separates Super Bowl Champions from the 31 other teams each season.
Firstly, we wanted to see which division has brought home the most Super Bowl Championships over the past 20 years. We once again added a “Division” column to nfl_standings. This can be seen in the graphic below.
# Attach the dataset
attach(nfl_standings)
# Create new column
nfl_standings$division <-
ifelse(team_name == "Patriots" | team_name == "Bills" | team_name == "Jets" | team_name == "Dolphins", "AFC East",
ifelse(team_name == "Ravens" | team_name == "Steelers" | team_name == "Bengals" | team_name == "Browns", "AFC North",
ifelse(team_name == "Texans" | team_name == "Titans" | team_name == "Colts" | team_name == "Jaguars", "AFC South",
ifelse(team_name == "Chiefs" | team_name == "Broncos" | team_name == "Raiders" | team_name == "Chargers", "AFC West",
ifelse(team_name == "Eagles" | team_name == "Cowboys" | team_name == "Giants" | team_name == "Redskins", "NFC East",
ifelse(team_name == "Packers" | team_name == "Vikings" | team_name == "Bears" | team_name == "Lions", "NFC North",
ifelse(team_name == "Saints" | team_name == "Falcons" | team_name == "Buccaneers" | team_name == "Panthers", "NFC South",
ifelse(team_name == "49ers" | team_name == "Seahawks" | team_name == "Rams" | team_name == "Cardinals", "NFC West",
NA))))))) )
# Plot Super Bowl Championships per division
ggplot(data = nfl_standings,
aes(reorder(division, -sb_winner), sb_winner, col = team_name)) + geom_col() +
ggtitle("Super Bowl Winners by Division") +
xlab("Division") + ylab("Count of Super Bowl Championships Won") +
labs(col = "Team Name")
As evident in the above visualization, the AFC East has had the most Super Bowl wins between 2000-2019. This can be largely attributed to the New England Patriots’ former quarterback Tom Brady and current head coach Bill Belichick bringing home championships in 2002, 2004, 2005, 2015, 2017, and 2019. Additionally, the second-best division appears to be the AFC North, with both the Pittsburgh Steelers and Baltimore Ravens winning at least one Super Bowl Championship each. Conversely, it appears the AFC South, NFC North, and NFC West have all only won one Super Bowl over the past two decades.
Analyzing NFL standings with the given datasets is a bit tricky due to the fact that standings are calculated using tie-breakers if necessary. Additionally, choosing which teams make the playoffs is largely based off of division success. With that being said, the team that had the most wins might not be the team with the best standing. For this analysis, we decided to break the teams down by division and see which ones have been dominant over the years.
We analyzed their success by using summary statistics showing the Average Total Wins, Average Total Losses, Average Points Per Game, and Average Opponent Points Per Game. The results can be seen below. We also developed box plots for the average total wins per season by division to analyze the range of data for each team and any relevant outliers. At the end of this analysis, we then grouped the box plots by conference (AFC vs. NFC).
The most dominate teams per division, defined by highest average of total wins, (as discovered in the analysis below) are as follows:
Please see the full breakdown below.
# AFC East
afc_east <- nfl_standings %>%
filter(division == "AFC East") %>%
select(team_name, total_wins, points_for, total_losses, points_against) %>%
group_by(team_name) %>%
summarise(total_wins = mean(total_wins), total_losses = mean(total_losses), points_for = mean(points_for)/16, points_against = mean(points_against)/16)
colnames(afc_east) <- c("Team Name", "Average Total Wins", "Average Total Losses", "Average Points Per Game", "Average Opponent Points Per Game")
kable(afc_east)
| Team Name | Average Total Wins | Average Total Losses | Average Points Per Game | Average Opponent Points Per Game |
|---|---|---|---|---|
| Bills | 6.85 | 9.15 | 19.86875 | 22.28125 |
| Dolphins | 7.45 | 8.55 | 19.85938 | 21.67500 |
| Jets | 7.40 | 8.60 | 19.83750 | 21.38750 |
| Patriots | 11.85 | 4.15 | 27.25937 | 18.57812 |
# AFC East box plot
afc_east_box <- nfl_standings %>%
filter(division == "AFC East") %>%
select(team_name, total_wins, points_for, total_losses, points_against)
boxplot_afceast <- ggplot(afc_east_box,
aes(team_name, total_wins)) +
geom_boxplot(col = "blue") +
ggtitle("AFC East") +
xlab("Team") + ylab("Total Wins") +
theme_stata()
boxplot_afceast
# AFC North
afc_north <- nfl_standings %>%
filter(division == "AFC North") %>%
select(team_name, total_wins, points_for, total_losses, points_against) %>%
group_by(team_name) %>%
summarise(total_wins = mean(total_wins), total_losses = mean(total_losses), points_for = mean(points_for)/16, points_against = mean(points_against)/16)
colnames(afc_north) <- c("Team Name", "Average Total Wins", "Average Total Losses", "Average Points Per Game", "Average Opponent Points Per Game")
kable(afc_north)
| Team Name | Average Total Wins | Average Total Losses | Average Points Per Game | Average Opponent Points Per Game |
|---|---|---|---|---|
| Bengals | 7.25 | 8.60 | 20.70625 | 22.20625 |
| Browns | 4.95 | 11.00 | 17.30313 | 23.00937 |
| Ravens | 9.50 | 6.50 | 22.42813 | 18.28125 |
| Steelers | 10.25 | 5.65 | 23.06563 | 18.45938 |
# AFC North box plot
afc_north_box <- nfl_standings %>%
filter(division == "AFC North") %>%
select(team_name, total_wins, points_for, total_losses, points_against)
boxplot_afcnorth <- ggplot(afc_north_box,
aes(team_name, total_wins)) +
geom_boxplot(col = "purple") +
ggtitle("AFC North") +
xlab("Team") + ylab("Total Wins") +
theme_stata()
boxplot_afcnorth
# AFC South
afc_south <- nfl_standings %>%
filter(division == "AFC South") %>%
select(team_name, total_wins, points_for, total_losses, points_against) %>%
group_by(team_name) %>%
summarise(total_wins = mean(total_wins), total_losses = mean(total_losses), points_for = mean(points_for)/16, points_against = mean(points_against)/16)
colnames(afc_south) <- c("Team Name", "Average Total Wins", "Average Total Losses", "Average Points Per Game", "Average Opponent Points Per Game")
kable(afc_south)
| Team Name | Average Total Wins | Average Total Losses | Average Points Per Game | Average Opponent Points Per Game |
|---|---|---|---|---|
| Colts | 9.850000 | 6.150000 | 24.85625 | 22.24063 |
| Jaguars | 6.350000 | 9.650000 | 19.57500 | 21.92188 |
| Texans | 7.277778 | 8.722222 | 20.86111 | 22.68056 |
| Titans | 8.000000 | 8.000000 | 21.35625 | 22.36563 |
# AFC South box plot
afc_south_box <- nfl_standings %>%
filter(division == "AFC South") %>%
select(team_name, total_wins, points_for, total_losses, points_against)
boxplot_afcsouth <- ggplot(afc_south_box,
aes(team_name, total_wins)) +
geom_boxplot(col = "red") +
ggtitle("AFC South") +
xlab("Team") + ylab("Total Wins") +
theme_stata()
boxplot_afcsouth
# AFC West
afc_west <- nfl_standings %>%
filter(division == "AFC West") %>%
select(team_name, total_wins, points_for, total_losses, points_against) %>%
group_by(team_name) %>%
summarise(total_wins = mean(total_wins), total_losses = mean(total_losses), points_for = mean(points_for)/16, points_against = mean(points_against)/16)
colnames(afc_west) <- c("Team Name", "Average Total Wins", "Average Total Losses", "Average Points Per Game", "Average Opponent Points Per Game")
kable(afc_west)
| Team Name | Average Total Wins | Average Total Losses | Average Points Per Game | Average Opponent Points Per Game |
|---|---|---|---|---|
| Broncos | 9.10 | 6.90 | 23.49687 | 21.70000 |
| Chargers | 8.10 | 7.90 | 24.05937 | 21.75000 |
| Chiefs | 8.30 | 7.70 | 23.28438 | 22.01250 |
| Raiders | 6.25 | 9.75 | 20.10000 | 24.45625 |
# AFC West box plot
afc_west_box <- nfl_standings %>%
filter(division == "AFC West") %>%
select(team_name, total_wins, points_for, total_losses, points_against)
boxplot_afcwest <- ggplot(afc_west_box,
aes(team_name, total_wins)) +
geom_boxplot(col = "seagreen") +
ggtitle("AFC West") +
xlab("Team") + ylab("Total Wins") +
theme_stata()
boxplot_afcwest
# NFC East
nfc_east <- nfl_standings %>%
filter(division == "NFC East") %>%
select(team_name, total_wins, points_for, total_losses, points_against) %>%
group_by(team_name) %>%
summarise(total_wins = mean(total_wins), total_losses = mean(total_losses), points_for = mean(points_for)/16, points_against = mean(points_against)/16)
colnames(nfc_east) <- c("Team Name", "Average Total Wins", "Average Total Losses", "Average Points Per Game", "Average Opponent Points Per Game")
kable(nfc_east)
| Team Name | Average Total Wins | Average Total Losses | Average Points Per Game | Average Opponent Points Per Game |
|---|---|---|---|---|
| Cowboys | 8.4 | 7.60 | 22.29688 | 21.60938 |
| Eagles | 9.5 | 6.45 | 24.19375 | 20.53750 |
| Giants | 7.9 | 8.10 | 22.01250 | 22.43437 |
| Redskins | 6.6 | 9.35 | 19.48750 | 22.42813 |
# NFC East box plot
nfc_east_box <- nfl_standings %>%
filter(division == "NFC East") %>%
select(team_name, total_wins, points_for, total_losses, points_against)
boxplot_nfceast <- ggplot(nfc_east_box,
aes(team_name, total_wins)) +
geom_boxplot() +
ggtitle("NFC East") +
xlab("Team") + ylab("Total Wins") +
theme_stata()
boxplot_nfceast
# NFC North
nfc_north <- nfl_standings %>%
filter(division == "NFC North") %>%
select(team_name, total_wins, points_for, total_losses, points_against) %>%
group_by(team_name) %>%
summarise(total_wins = mean(total_wins), total_losses = mean(total_losses), points_for = mean(points_for)/16, points_against = mean(points_against)/16)
colnames(nfc_north) <- c("Team Name", "Average Total Wins", "Average Total Losses", "Average Points Per Game", "Average Opponent Points Per Game")
kable(nfc_north)
| Team Name | Average Total Wins | Average Total Losses | Average Points Per Game | Average Opponent Points Per Game |
|---|---|---|---|---|
| Bears | 7.85 | 8.15 | 20.24063 | 20.82812 |
| Lions | 5.70 | 10.25 | 20.58438 | 24.61563 |
| Packers | 9.85 | 6.05 | 25.24063 | 21.25313 |
| Vikings | 8.25 | 7.65 | 22.67813 | 22.03438 |
# NFC North box plot
nfc_north_box <- nfl_standings %>%
filter(division == "NFC North") %>%
select(team_name, total_wins, points_for, total_losses, points_against)
boxplot_nfcnorth <- ggplot(nfc_north_box,
aes(team_name, total_wins)) +
geom_boxplot(col = "brown") +
ggtitle("NFC North") +
xlab("Team") + ylab("Total Wins") +
theme_stata()
boxplot_nfcnorth
# NFC South
nfc_south <- nfl_standings %>%
filter(division == "NFC South") %>%
select(team_name, total_wins, points_for, total_losses, points_against) %>%
group_by(team_name) %>%
summarise(total_wins = mean(total_wins), total_losses = mean(total_losses), points_for = mean(points_for)/16, points_against = mean(points_against)/16)
colnames(nfc_south) <- c("Team Name", "Average Total Wins", "Average Total Losses", "Average Points Per Game", "Average Opponent Points Per Game")
kable(nfc_south)
| Team Name | Average Total Wins | Average Total Losses | Average Points Per Game | Average Opponent Points Per Game |
|---|---|---|---|---|
| Buccaneers | 6.90 | 9.10 | 20.55313 | 21.93437 |
| Falcons | 8.20 | 7.75 | 22.61250 | 22.75313 |
| Panthers | 7.85 | 8.10 | 21.15625 | 21.61563 |
| Saints | 9.15 | 6.85 | 25.94063 | 23.29062 |
# NFC South box plot
nfc_south_box <- nfl_standings %>%
filter(division == "NFC South") %>%
select(team_name, total_wins, points_for, total_losses, points_against)
boxplot_nfcsouth <- ggplot(nfc_south_box,
aes(team_name, total_wins)) +
geom_boxplot(col = "magenta") +
ggtitle("NFC South") +
xlab("Team") + ylab("Total Wins") +
theme_stata()
boxplot_nfcsouth
# NFC West
nfc_west <- nfl_standings %>%
filter(division == "NFC West") %>%
select(team_name, total_wins, points_for, total_losses, points_against) %>%
group_by(team_name) %>%
summarise(total_wins = mean(total_wins), total_losses = mean(total_losses), points_for = mean(points_for)/16, points_against = mean(points_against)/16)
colnames(nfc_west) <- c("Team Name", "Average Total Wins", "Average Total Losses", "Average Points Per Game", "Average Opponent Points Per Game")
kable(nfc_west)
| Team Name | Average Total Wins | Average Total Losses | Average Points Per Game | Average Opponent Points Per Game |
|---|---|---|---|---|
| 49ers | 7.40 | 8.55 | 21.01562 | 22.39062 |
| Cardinals | 6.85 | 9.05 | 20.10938 | 23.65000 |
| Rams | 7.20 | 8.75 | 21.50313 | 23.70000 |
| Seahawks | 9.10 | 6.85 | 22.92188 | 20.56563 |
# NFC West box plot
nfc_west_box <- nfl_standings %>%
filter(division == "NFC West") %>%
select(team_name, total_wins, points_for, total_losses, points_against)
boxplot_nfcwest <- ggplot(nfc_west_box,
aes(team_name, total_wins)) +
geom_boxplot(col = "goldenrod4") +
ggtitle("NFC West") +
xlab("Team") + ylab("Total Wins") +
theme_stata()
boxplot_nfcwest
# Comparison by conference
ggarrange(boxplot_afceast, boxplot_afcnorth, boxplot_afcsouth, boxplot_afcwest)
ggarrange(boxplot_nfceast, boxplot_nfcnorth, boxplot_nfcsouth, boxplot_nfcwest)
Combining the above tables to form one table with average statistics, the following leaders can be found:
# Analysis of all teams in the NFL
division_leaders <- nfl_standings %>%
select(team_name, total_wins, points_for, total_losses, points_against) %>%
group_by(team_name) %>%
summarise(total_wins = mean(total_wins), total_losses = mean(total_losses), points_for = mean(points_for)/16, points_against = mean(points_against)/16)
colnames(division_leaders) <- c("Team Name", "Average Total Wins", "Average Total Losses", "Average Points Per Game", "Average Opponent Points Per Game")
kable(division_leaders)
| Team Name | Average Total Wins | Average Total Losses | Average Points Per Game | Average Opponent Points Per Game |
|---|---|---|---|---|
| 49ers | 7.400000 | 8.550000 | 21.01562 | 22.39062 |
| Bears | 7.850000 | 8.150000 | 20.24063 | 20.82812 |
| Bengals | 7.250000 | 8.600000 | 20.70625 | 22.20625 |
| Bills | 6.850000 | 9.150000 | 19.86875 | 22.28125 |
| Broncos | 9.100000 | 6.900000 | 23.49687 | 21.70000 |
| Browns | 4.950000 | 11.000000 | 17.30313 | 23.00937 |
| Buccaneers | 6.900000 | 9.100000 | 20.55313 | 21.93437 |
| Cardinals | 6.850000 | 9.050000 | 20.10938 | 23.65000 |
| Chargers | 8.100000 | 7.900000 | 24.05937 | 21.75000 |
| Chiefs | 8.300000 | 7.700000 | 23.28438 | 22.01250 |
| Colts | 9.850000 | 6.150000 | 24.85625 | 22.24063 |
| Cowboys | 8.400000 | 7.600000 | 22.29688 | 21.60938 |
| Dolphins | 7.450000 | 8.550000 | 19.85938 | 21.67500 |
| Eagles | 9.500000 | 6.450000 | 24.19375 | 20.53750 |
| Falcons | 8.200000 | 7.750000 | 22.61250 | 22.75313 |
| Giants | 7.900000 | 8.100000 | 22.01250 | 22.43437 |
| Jaguars | 6.350000 | 9.650000 | 19.57500 | 21.92188 |
| Jets | 7.400000 | 8.600000 | 19.83750 | 21.38750 |
| Lions | 5.700000 | 10.250000 | 20.58438 | 24.61563 |
| Packers | 9.850000 | 6.050000 | 25.24063 | 21.25313 |
| Panthers | 7.850000 | 8.100000 | 21.15625 | 21.61563 |
| Patriots | 11.850000 | 4.150000 | 27.25937 | 18.57812 |
| Raiders | 6.250000 | 9.750000 | 20.10000 | 24.45625 |
| Rams | 7.200000 | 8.750000 | 21.50313 | 23.70000 |
| Ravens | 9.500000 | 6.500000 | 22.42813 | 18.28125 |
| Redskins | 6.600000 | 9.350000 | 19.48750 | 22.42813 |
| Saints | 9.150000 | 6.850000 | 25.94063 | 23.29062 |
| Seahawks | 9.100000 | 6.850000 | 22.92188 | 20.56563 |
| Steelers | 10.250000 | 5.650000 | 23.06563 | 18.45938 |
| Texans | 7.277778 | 8.722222 | 20.86111 | 22.68056 |
| Titans | 8.000000 | 8.000000 | 21.35625 | 22.36563 |
| Vikings | 8.250000 | 7.650000 | 22.67813 | 22.03438 |
In the next step of our analysis, we investigated the importance of an offense and a defense. The offense in football is the 11 players who are on the field for a team when they have the ball. Conversely, the defense is the 11 players on the field when the other team has the ball. Sports writers and analysts have argued over the years whether a better offense or defense is more critical to a team’s success. Utilizing the nfl_standings dataset, we sought to analyze this discussion.
First, we created a linear model showcasing a team’s wins in a season using offensive_ranking and defensive_ranking as the predictor variables.
# Attach the dataset
attach(nfl_standings)
# Create a linear model
rankings_model <- lm(total_wins ~ offensive_ranking + defensive_ranking)
summary(rankings_model)
##
## Call:
## lm(formula = total_wins ~ offensive_ranking + defensive_ranking)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4463 -0.9950 -0.0159 1.0455 5.0687
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.98487 0.05837 136.81 <2e-16 ***
## offensive_ranking 0.43617 0.01381 31.58 <2e-16 ***
## defensive_ranking 0.43745 0.01681 26.03 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.474 on 635 degrees of freedom
## Multiple R-squared: 0.7722, Adjusted R-squared: 0.7715
## F-statistic: 1076 on 2 and 635 DF, p-value: < 2.2e-16
The model showcases that both offensive_ranking and defensive_ranking are significant variables in determining a team’s total wins at a 99% confidence level. To drill deeper, the correlation coefficients were discovered for each predictor variable to total wins.
# Run correlation tests
cor(offensive_ranking,total_wins)
## [1] 0.7274288
cor(defensive_ranking,total_wins)
## [1] 0.6437138
The offensive_ranking had a coefficient of 0.7274288 and the defensive_ranking had a coefficient 0.6437138. As such, it appears that a team’s offense has a greater correlation to a team’s wins than its defense. To visualize this, we plotted two graphs to further test this hypothesis:
# Create rankings graphs with confidence bands
Def_Ranking_Plot <- ggplot(data = nfl_standings, aes(defensive_ranking, total_wins)) +
geom_smooth(col = "red") +
ggtitle("Total Wins | Defensive Ranking") +
xlab("Defensive Ranking") +
ylab("Total Wins") +
xlim(-10,10) + ylim(0,16) +
theme_stata()
Off_Ranking_Plot <- ggplot(data = nfl_standings, aes(offensive_ranking, total_wins)) + geom_smooth() +
ggtitle("Total Wins | Offensive Ranking") +
xlab("Offensive Ranking") + ylab("Total Wins") +
xlim(-10,10) + ylim(0,16) +
theme_stata()
# Put graphs next to each other
ggarrange(Off_Ranking_Plot,Def_Ranking_Plot)
These graphs confirm the positive correlation between an increasing offensive or defensive ranking and a team’s win. Additionally, the confidence band in the defensive ranking is larger than the offensive ranking’s band. This agrees with our conclusion that the offensive’s ranking correlation is stronger than the defense.
Using different statistics now, we changed the predictor variables to be points_for and points_against as these represent offensive and defensive success, respectively. Then, we used the binary playoffs variable to see how scoring or giving up points led to a team’s probability of making the playoffs. We took the same approach as the previous variables.
# Create linear model
points_model <- lm(total_wins ~ points_for + points_against)
summary(points_model)
##
## Call:
## lm(formula = total_wins ~ points_for + points_against)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3445 -0.8470 -0.0334 0.8322 3.9053
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.7362443 0.4184560 20.88 <2e-16 ***
## points_for 0.0270262 0.0006989 38.67 <2e-16 ***
## points_against -0.0291728 0.0008380 -34.81 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.238 on 635 degrees of freedom
## Multiple R-squared: 0.8395, Adjusted R-squared: 0.839
## F-statistic: 1660 on 2 and 635 DF, p-value: < 2.2e-16
# Run correlation tests
cor(points_for,total_wins)
## [1] 0.7300927
cor(points_against,total_wins)
## [1] -0.6792343
# Create scatter plots
Points_For_Plot <- ggplot(data = nfl_standings, aes(points_for, total_wins, col = as.character(playoffs))) +
geom_point() +
labs(col = "Playoffs") +
ggtitle("Total Wins | Total Points Scored") +
xlab("Total Points Scored") +
ylab("Total Wins") +
ylim(0,16) +
theme_stata()
Points_Against_Plot <- ggplot(data = nfl_standings, aes(points_against, total_wins, col = as.character(playoffs))) +
geom_point() +
labs(col = "Playoffs") +
ggtitle("Total Wins | Total Points Against") +
xlab("Total Points Against") +
ylab("Total Wins") +
ylim(0,16) +
theme_stata()
# Put graphs next to each other
ggarrange(Points_For_Plot,Points_Against_Plot)
This model shows that points_for and points_against are both significant as well to a team’s total wins. Additionally, the correlations to total wins are 0.7300927 and -0.6792343. This indicates a strong, positive relationship for points_for and a strong, negative relationship for points_against. The offensive side, once again, has a slightly stronger relationship.
The graphical representations show these strong linear relationships as well with playoff teams typically having a low Total Points Against and high Total Points For on the season.
Although it is certain that having both a good offense and defense in the NFL is important, it appears that having a better offense is more important to success in the long run. It is still a close margin and will be debated as the NFL continues to play. For one last piece of context, we developed a table of the last 20 Super Bowl winners with their offensive and defensive ranking:
# Create table for all Super Bowl Champions
super_bowl_champs <- nfl_standings %>%
filter(sb_winner == 1) %>%
select(year, team_name, offensive_ranking, defensive_ranking)
colnames(super_bowl_champs) <- c("Year", "Super Bowl Champion", "Offensive Ranking", "Defensive Ranking")
kable(super_bowl_champs)
| Year | Super Bowl Champion | Offensive Ranking | Defensive Ranking |
|---|---|---|---|
| 2000 | Ravens | 0.0 | 8.0 |
| 2001 | Patriots | 1.2 | 3.1 |
| 2002 | Buccaneers | -1.0 | 9.8 |
| 2003 | Patriots | 2.1 | 4.9 |
| 2004 | Patriots | 6.4 | 6.5 |
| 2005 | Steelers | 3.8 | 4.0 |
| 2006 | Colts | 6.9 | -1.1 |
| 2007 | Giants | 2.8 | 0.4 |
| 2008 | Steelers | 1.6 | 8.2 |
| 2009 | Saints | 11.2 | -0.5 |
| 2010 | Packers | 3.1 | 7.9 |
| 2011 | Giants | 3.1 | -1.5 |
| 2012 | Ravens | 1.9 | 1.0 |
| 2013 | Seahawks | 4.1 | 8.9 |
| 2014 | Patriots | 7.5 | 3.5 |
| 2015 | Broncos | 0.3 | 5.5 |
| 2016 | Patriots | 4.3 | 5.0 |
| 2017 | Eagles | 7.0 | 2.5 |
| 2018 | Patriots | 3.1 | 2.1 |
| 2019 | Chiefs | 6.2 | 2.9 |
Our last analysis takes a look at data from the individual NFL games. Using the nfl_games dataset, we investigated the different variables.
Now, to analyze the correlation between different variables, we used the GGally package to produce a detailed scatter plot matrix. The function ggpairs() produced histograms along the diagonal of the matrix. Pearson’s rho estimates, or statistics showing correlation, are seen in the upper-right. Scatter plots are seen in the lower-left. We analyzed six variables here - (1) Points Scored by Winning Team (pts_win); (2) Yards Gained by Winning Team (yds_win); (3) Turnovers Committed by Winning Team (turnovers_win); (4) Points Scored by Losing Team (pts_loss); (5) Yards Gained by Losing Team (yds_loss); and (6) Turnovers Committed by Losing Team (turnovers_loss).
We then grouped these variables by winning team v. losing team.
# Create correlation graphs for variables in nfl_games
ggpairs(nfl_games %>% select(pts_win, yds_win, turnovers_win))
As evident through both the scatter plots and Pearson’s rho estimates, we can see there is little to no relationship between Points Scored by Winning Team v. Turnovers Committed by Winning Team as well as Yards Gained by Winning Team v. Turnovers Committed by Winning Team. All of these correlation coefficients are close to zero.
On the other hand, we can see there is a strong, positive relationship between Points Scored by Winning Team v. Yards Gained by Winning Team, with a Pearson rho estimate of 0.537.
# Create correlation graphs for variables in nfl_games
ggpairs(nfl_games %>% select(pts_loss, yds_loss, turnovers_loss))
Very similar to the winning teams, we can see there is little to no relationship between Points Scored by Losing Team v. Turnovers Committed by Losing Team as well as Yards Gained by Losing Team v. Turnovers Committed by Losing Team. All of these correlation coefficients are close to zero.
On the other hand, we can see there is a strong, positive relationship between Points Scored by Losing Team v. Yards Gained by Losing Team, with a Pearson rho estimate of 0.632.
The main takeaway from these correlation matrices are that the more yards gained, the more likely you are to score. To compare a winning team and a losing team, we wanted to see if more turnovers from a losing team caused more points for the winning team. Please see this below.
# Attach the dataset
attach(nfl_games)
# Create linear model
games_model <- lm(pts_win ~ turnovers_loss)
summary(games_model)
##
## Call:
## lm(formula = pts_win ~ turnovers_loss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.598 -5.786 -0.598 5.425 33.308
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.41092 0.21730 116.94 <2e-16 ***
## turnovers_loss 1.09370 0.08384 13.04 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.693 on 5322 degrees of freedom
## Multiple R-squared: 0.03098, Adjusted R-squared: 0.0308
## F-statistic: 170.2 on 1 and 5322 DF, p-value: < 2.2e-16
cor(turnovers_loss, pts_win)
## [1] 0.1760229
# Plot the data
ggplot(nfl_games) + geom_point(aes(x = turnovers_loss, y = pts_win), color = "coral1") +
ggtitle("Turnovers Committed by Losing Team vs. Points Scored by Winning Team")
From the above graphic, we see that there is a slight, positive relationship between the Turnovers Committed by Losing Team and Points Scored by Winning Team. The correlation coefficient between the two variables is 0.176.
In this analysis, the main goal was to understand what all goes into winning an NFL game and what teams are historically successful in the standings. We were able to successfully break out this analysis into four sections: (1) The Importance of Fan Attendance; (2) Standings over the Years; (3) Offense vs. Defense; and (4) Individual Game Observations.
Through extensive use of R, we investigated the nfl_games, nfl_attendance, and nfl_attendance datasets. Linear modeling to discover the correlation between several datasets was frequently used. Additionally, the ggplot2 package delivered great visualizations to showcase this breakdown of the NFL. New variables and tables were created as well to drill deeper into the data for a better understanding of the raw data. One of our primary focuses was a breakdown of the divisions and their successes over the past 20 years. Box plot visualizations between the two conferences illuminated how teams have fared in the win column from their best season to their worst season.
Our first analysis looked into NFL Fan Attendance. Graphical representations were created to better understand which teams have a strong fan base and the consistency at which fans show up on a yearly basis. From this analysis, it was evident that the Dallas Cowboys have the strongest fan base and the Los Angeles Chargers have the weakest. Additionally, the greater attendance to games positively correlated to a team’s total wins per season.
Secondly, we focused on the divisional standings through the years. As mentioned above, box plot visualizations by division showed the range of success for NFL teams. Per division, these teams have had the most success based on the nfl_standings dataset:
Using geom_col(), we observed that the AFC East has won the most Super Bowl Championships. This is due to the phenomenal success of Tom Brady and the New England Patriots during this time period.
Next, we researched one of the most common arguments in football - is the offense or defense more important? Linear modeling of the nfl_standings data was completed on several variables. High offensive rankings and defensive rankings correlate to more wins for teams. Even though having a great offense and defense are both important, the correlation tests indicated that a better offense is slightly more important to a team’s success than a better defense. We created a table of the last 20 Super Bowl Champions and showcased the offensive_ranking and defensive_ranking. Teams have been trending towards having better offenses in the last few years as evident by this table.
Finally, we observed individual game data in the NFL. Through graphs created by ggpairs(), we were able to view correlation coefficients for six variables. The main conclusion we deduced from this is that a positive correlation exists between yards gained and points scored.
As big NFL fans, it was incredibly interesting to see how the NFL has worked during our entire lifetime. Also, it was intriguing to see our favorite teams’ successes over this time span. The Steelers have been comfortably better than the Bengals, but the next 20 years could be a different story. Some limitations to our analysis are the brief time period in which the datasets cover (20 years), the lack of data in the nfl_games dataset (i.e. rushing yards, passing yards, penalty yards, etc.). Additionally, there is no player information included in these datasets. Players and coaching staff would definitely impact the success of each team. If we were to continue this analysis and potentially collect more data, we could add in tie-breaker information and how weather affects scoring. Predictive modeling could also be a great next step to see how NFL franchises will perform based on prior information.
The NFL is one of the biggest industries in the world that has large implications on many levels. Sports gambling, the NFL Draft, fantasy football, and the common fan could all have different takeaways from this analysis that would help them better understand the recent history of the NFL. With fans across the globe, a deep dive into the NFL is exciting for many groups. Coaches and players would be able to more effectively prepare for their opponents, gamblers could make more educated bets, general managers could derive their team’s needs in the Draft, and the common fan could revel in their team’s history.
This data tells a phenomenal story of the state of the NFL. However, it is a game for a reason. No one will ever be able to fully predict NFL outcomes, and that is what makes the sport as intriguing as it is!