Project Introduction

Row

Breaking Down the NFL

From a young age, I have watched NFL football and cheered for my team every week. Given my dad is from Pittsburgh and my parents met in Pittsburgh, I was naturally raised a lifelong Steelers fan. With that said, I not only wanted to compare my team individually, but also the league as a whole. The past 20 years have seen successes and failures from every NFL team. When choosing an option for my final capstone project, I was instantly drawn to extending a project I had previously done on the NFL. The NFL has a plethora of data points publicly available. I thought, “What all goes into winning an NFL game, and what teams are historically successful in the final standings?” Using the past 20 years worth of data, I sought to investigate this problem.

My Focus

Aforementioned, for my final capstone project, I am expanding upon my final project from Data Wrangling in R (BANA 7025) with Professor Tianhai Zu. I originally worked with a partner on this project; however, the extension will be my own individual work. I plan on using the functions in R to deliver overall summary statistics on games and standings. Additionally, I will use the data to develop potential correlations and plot respective data visualizations. Utilizing descriptive analysis of the past 20 years, I am looking to see if there can be predictive tendencies for NFL teams.

This analysis includes data from 2000 - 2019. I added 2020 season data to every dataset aside from nfl_attendance and nfl_games, as these would be skewed if 2020 data was added. This skewness would be due to the impact of COVID-19. COVID-19 caused games to be played on different days / times, cancellation of games, and it also caused little to no attendance based on location.

This NFL analysis consists of eight individual datasets:

  1. NFL Attendance (nfl_attendance)

  2. NFL Standings (nfl_standings)

  3. NFL Games (nfl_games)

  4. NFL Weather (nfl_weather)

  5. NFL Playoff Coaches and Quarterbacks (nfl_playoffs)

  6. NFL Passing Yards Leaders (nfl_passing)

  7. NFL Rushing Yards Leaders (nfl_rushing)

  8. NFL Penalty Yards Per Game (nfl_penalty)

More detailed information about each dataset can be found in the Data Preparation tab.

Row

Goal

The NFL is a multi-billion dollar industry. Millions of fans across the world cheer for these 32 teams every year. People are now looking for ways to understand the game better.

Coaches want to understand what makes a team more successful. Sports gamblers want to get an edge and make the correct picks based on more than just gut feelings. Fans want to know if their team is progressing in the right direction. This analysis is useful for all of these situations. Using descriptive analysis, past results can be better explained. As such, trends can be deduced to predict how NFL games and seasons will occur. Although no one can see into the future, understanding the data sheds a better light on the probability of certain results occurring in the NFL.

The goal of my analysis is to inform my readers on what all goes into winning an NFL game. My hope is that the audience will finish reading my report and better understand historic trends and performance from teams, players, and coaches alike. As a final capstone project, I hope to demonstrate proficiency in R using R Markdown as well as flexdashboard with Shiny components.

Analytical Technique and Approach

Analytical Approach

The datasets contain loads of information for the NFL. With a wide range of variables, many options are available to analytically investigate the NFL. With the eight datasets at hand, I looked to compare them to draw conclusions about team performances. To see if statistical significance or rational conclusions related to the NFL could be realized, the following situations were explored:

  • The Importance of Fan Attendance - This data analysis will look into if the number of fans in attendance correlates to a team’s success (number of games won). Additionally, it will provide comparisons of how teams fare in home versus away games while keeping in mind the attendance at those games.
  • Standings over the Years - The NFL has two conferences: the American Football Conference (AFC) and the National Football Conference (NFC). Each conference contains four divisions with four teams in each division. Each division then has a winner over the 16 game regular season. This analysis will look into the qualities of the division winners and the attributes that teams high in the standings have over teams in the lower portion of the standings. Furthermore, this approach will discover what separates Super Bowl Champions from the 31 other teams each season.
  • Offense vs. Defense - The two main parts of a NFL team are the offense and defense. The goal for each team is to be great on both sides. However, this is rarely the case. Using individual game data and season-long statistics, a thorough breakdown of how having a great offense or defense improves teams will be given. I will also see if having a better offense or defense is critical to success over the years.
  • Individual Game Observations - The nfl_games dataset contains many variables for games. Turnovers, day of the week, points, etc. are shown for every match-up. Correlations into why teams win or lose will be the goal of this analysis. Using a plethora of variables, significance of certain variables will be essential for further understanding.
  • The Impact of Weather on Game Outcomes - The nfl_weather dataset contains the information of both the home and away teams from 2000 - 2013. This dataset also includes three weather-related variables: (1) temperature, (2) humidity, and (3) wind speed (in mph). I want to see which teams perform under certain weather conditions. Additionally, I hope to create a few linear models to see if weather conditions can predict whether or not the game will be high-scoring or low-scoring.
  • Successful Teams, Head Coaches, and Quarterbacks - The nfl_playoffs dataset includes information of teams who went to the playoffs from 2000 - 2020. This dataset also includes the Super Bowl Champions. I am curious to analyze trends regarding the coaches and quarterbacks who led the teams to success. Are certain quarterbacks consistently better-performing? Are there better head coaches than others?
  • Passing Yards Leaders - The nfl_passing dataset includes information from the past 20 years on the players with the most passing yards. Which player has performed consistently over the past 20 years? Who is the “best”?
  • Rushing Yards Leaders - The nfl_rushing dataset has the same information as the nfl_passing information, except it focuses on rushing yards instead of passing yards. Which players had the most rushing yards each year from 2000 - 2020?
  • Average Penalty Yards Per Game - Penalties are game-changers when it comes to success in a football game. One mistake can lead to an automatic first down compared to what could have been a fourth down and ten yards. In this analysis, I want to see which teams have consistently lost yards in games due to penalties.
  • Rivalry Analysis - Every sports fanatic knows the top rivalries in the NFL. Regardless of whether you are a fan of these teams, many will tune into the game as the level of intensity is typically higher. With that said, I wanted to see which teams have been dominant in their respective rivalries.

Packages Required

This project requires a variety of packages. Given there are over 10,000 packages in R, I want to focus on the ones that will provide me with the best results while cleaning and interpreting the data.

Some packages will be more useful than others. For example, ggplot2 allows for great visualizations that provide better understanding of the data. Additionally, dplyr can drill deeper into the eight datasets to come to conclusions that may be hidden at first. R has powerful functions that can derive explanations for questions to massive datasets. Please see below for all of the packages loaded for this analysis:

# Packages required
library(tidyverse) # Use to tidy data
library(dplyr) # Use to manipulate data
library(ggplot2) # Use to plot data and create visualizations
library(tibble) # Use to manipulate and re-imagine data
library(readr) # Use to import data cleanly and efficiently
library(DT) # Use to create comprehensive data tables with HTML output
library(knitr) # Use for dynamic report generation
library(base) # Contains Base R functions
library(ggthemes) # Use themes in data visualizations
library(plotly) # Use to plot data and create visualizations
library(ggpubr) # Use to show multiple plots at once
library(GGally) # Use to produce scatter plot matrix
library(rmarkdown) # Use to produce report
library(flexdashboard) # Use to produce flexdashboard
library(stringr) # Provides functions to work with strings
library(highcharter) # Includes shortcut functions to plot R objects
library(shinythemes) # Use to implement themes for output

Importing the Data

Importing the Data

Most of the data (nfl_attendance, nfl_standings, and nfl_games) was obtained from my professor, Tianhai Zu, for the Data Wrangling in R class. He had provided four different datasets in which to choose, and my partner and I chose the NFL option. These datasets can be found on GitHub. Reading the information on GitHub led me to find the original source of the data, which is Pro Football Reference Standings and Pro Football Reference Attendance.

This NFL analysis contains of eight individual datasets - (1) nfl_attendance, (2) nfl_standings, (3) nfl_games, (4) nfl_weather, (5) nfl_playoffs, (6) nfl_passing, (7) nfl_rushing, and (8) nfl_penalty.

I first merged three of the datasets (nfl_attendance, nfl_standings, and nfl_games) into one dataframe called nfl_df. I decided it might be beneficial to have multiple frames of reference, some utilizing individual datasets, and another by looking at the combined dataframe. Rather than using str() and summary() to show descriptive statistics for each variable, I decided to create comprehensive tables. Then, in the Data Preparation tab, I cleaned every dataset.

# Get working directory
getwd()
[1] "C:/Users/katie/OneDrive - University of Cincinnati/SS21/Full Semester/Capstone (BANA 8083)/Project Files"
# Get the data
nfl_attendance <- readr::read_csv('attendance.csv')
nfl_standings <- readr::read_csv('updatedstandings.csv')
nfl_games <- readr::read_csv('games.csv')
nfl_weather <- readr::read_csv('weather.csv')
nfl_playoffs <- readr::read_csv('post_season.csv')
nfl_passing <- readr::read_csv('passing_yards_leaders.csv')
nfl_rushing <- readr::read_csv('rushing_yards_leaders.csv')
nfl_penalty <- readr::read_csv('penalty_yards_per_game.csv')

# To use 2020 data you need to update tidytuesdayR from GitHub
# Install via devtools::install_github("thebioengineer/tidytuesdayR")

tuesdata <- tidytuesdayR::tt_load('2020-02-04')

    Downloading file 1 of 3: `attendance.csv`
    Downloading file 2 of 3: `games.csv`
    Downloading file 3 of 3: `standings.csv`
tuesdata <- tidytuesdayR::tt_load(2020, week = 6)

    Downloading file 1 of 3: `attendance.csv`
    Downloading file 2 of 3: `games.csv`
    Downloading file 3 of 3: `standings.csv`
attendance <- tuesdata$attendance

# Join the data relatively nicely with dplyr
nfl_df <- dplyr::left_join(nfl_attendance, nfl_standings, nfl_games, by = c("year", "team_name", "team"))

Data Preparation

Row

Attendance

As aforementioned, the nfl_attendance dataset was imported and obtained from Pro Football Reference. The original data contains 10,846 observations and eight variables. There are two character type variables, team and team_name. There are six numeric type variables, year, total, home, away, week, weekly_attendance. The data was collected from 2000 - 2020, and the values for the columns were observed during the 17 weeks of the NFL season.

# Examine the structure of the dataset
datatable(head(nfl_attendance, 10))

# Create a data dictionary for attendance
var_names_att <- colnames(nfl_attendance)
var_types_att <- lapply(nfl_attendance, class) # lapply returns a list of the same length as X (a vector)
var_descriptions_att <- c("City or state in which the team originates", "Name or mascot of the team", "Year", "Total attendance per season", "Total attendance at home games per season", "Total attendance at away games per season", "Week in which game was played", "Attendance for given week")
data_dict_att <- as_tibble(cbind(var_names_att, var_types_att, var_descriptions_att))
colnames(data_dict_att) <- c("Variable Name", "Variable Data Type", "Variable Description")
kable(data_dict_att) # kable returns a single table for a single data object

Looking at the missing data values, the only column in which missing values exist is the weekly_attendance. This makes sense, as each NFL team has at least one bye week during the regular season. I decided to omit these values as they would skew the data and misrepresent the trends for each team.

colSums(is.na(nfl_attendance)) # Find the number of missing values per column
nfl_attendance <- na.omit(nfl_attendance)
colSums(is.na(nfl_attendance)) # Confirm there are no missing values

Looking at this above original dataset, I decided to first rename the columns to better describe the data.

nfl_attendance <- nfl_attendance %>% dplyr::rename(
  team_location = team,
  total_attendance = total,
  total_home_attendance = home,
  total_away_attendance = away
)

Additionally, I split it into two dataframes, the first omitting the weekly data, and the second omitting the season totals. This decision was made largely to remove duplicates, and I knew it would bode for better visualizations during the exploratory data analysis (EDA).

The first dataset, nfl_total_attendance erased the two columns, week and weekly_attendance. This dataset will show the season totals for attendance per each team. The second dataset, nfl_weekly_attendance erased the season total data columns, total, home, and away.

nfl_total_attendance <- nfl_attendance[-c(7, 8)] # Remove weekly data
nfl_total_attendance <- nfl_total_attendance[!duplicated(nfl_total_attendance), ] # Remove duplicates
datatable(head(nfl_total_attendance, 10))

nfl_weekly_attendance <- nfl_attendance[-c(4, 5, 6)] # Remove season total attendance data
datatable(head(nfl_weekly_attendance, 10))

Now, for a summary of the two datasets and associated tables of the CLEANED data, please see below.

NFL Total Attendance Dataset

Data Dictionary for the NFL Total Attendance Dataset

Variable Name Variable Data Type Variable Description
team_location character City or state in which the team originates
team_name character Name or mascot of the team
year numeric Year
total_attendance numeric Total attendance per season
total_home_attendance numeric Total attendance at home games per season
total_away_attendance numeric Total attendance at away games per season

NFL Weekly Attendance Dataset

Data Dictionary for the NFL Weekly Attendance Dataset

Variable Name Variable Data Type Variable Description
team_location character City or state in which the team originates
team_name character Name or mascot of the team
year numeric Year
week numeric Week in which game was played
weekly_attendance numeric Attendance for given week

Standings

The nfl_standings dataset was imported and obtained from Pro Football Reference. The original data contains 638 observations and 15 variables. There are four character type variables, team, team_name, playoffs, and sb_winner. There are 11 numeric type variables, year, wins, loss, points_for, points_against, points_differential, margin_of_victory, strength_of_schedule, simple_rating, offensive_ranking, and defensive_ranking. The data observed was collected from 2000 - 2020. The process of cleaning the ORIGINAL data can be seen below.

# Examine the structure of the dataset
datatable(head(nfl_standings, 10))

# Create a data dictionary for standings
var_names_st <- colnames(nfl_standings)
var_types_st <- lapply(nfl_standings, class) # lapply returns a list of the same length as X (a vector)
var_descriptions_st <- c("City or state in which the team originates", "Name or mascot of the team", "Year", "Total wins per season (0 to 16)", "Total losses per season (0 to 16)", "Total points the team scored per season", "Total points the opponent scored on the team per season", "The difference between the total points for the team and against the team", "Points differential divided by the total number of games per season", "Difficulty of schedule based on opponent records", "A rating for the team that takes into account points differential and strength of schedule (measured by Simple Rating System)", "A rating comparing how well the offense performs to opponent teams (measured by Simple Rating System)", "A rating comparing how well the defense performs to opponent teams (measured by Simple Rating System)", "Stating whether or not the team made it to the playoffs", "Stating whether or not the team won the Super Bowl for the season")
data_dict_st <- as_tibble(cbind(var_names_st, var_types_st, var_descriptions_st))
colnames(data_dict_st) <- c("Variable Name", "Variable Data Type", "Variable Description")
kable(data_dict_st) # kable returns a single table for a single data object

Looking at the above dataset, I first decided to change the column names to better describe the data.

nfl_standings <- nfl_standings %>% dplyr::rename(
  team_location = team,
  total_wins = wins,
  total_losses = loss
)

It is important to note as well that a few of the variable names refer to calculated values. The calculated value for points_differential is: points_differential = points_for - points_against. Additionally, margin_of_victory is calculated by: points_scored - points_allowed / games_played.

Lastly, the simple_rating is calculated by: \[SRS = MoV + SoS = OSRS + DSRS\]

In layman’s terms, the simple rating system is equal to the margin of victory plus the strength of schedule. This is equal to the offensive simple rating standing plus the defensive simple rating standing.

Next, I wanted to see what the sum of missing values was per column. As evident below, there are no missing values.
colSums(is.na(nfl_standings))

Moving forward, I decided to change both the playoffs and sb_winner to binary variables. This is because they both only have two unique values.

unique(nfl_standings$playoffs, incomparables = FALSE) # View the unique values for the playoffs column
unique(nfl_standings$sb_winner, incomparables = FALSE) # View the unique values for the sb_winner column

Knowing this, I changed the two columns to binary variables. For the playoffs column, a value of one stands for “Playoffs”, and a value of zero stands for “No Playoffs”.

nfl_standings$playoffs[nfl_standings$playoffs == "Playoffs"] <- "1"
nfl_standings$playoffs[nfl_standings$playoffs == "No Playoffs"] <- "0"

nfl_standings$playoffs <- as.numeric(nfl_standings$playoffs)

For the sb_winner column, a value of one denotes “Won Superbowl”, and a value of zero denotes “No Superbowl”.

nfl_standings$sb_winner[nfl_standings$sb_winner == "Won Superbowl"] <- "1"
nfl_standings$sb_winner[nfl_standings$sb_winner == "No Superbowl"] <- "0"

nfl_standings$sb_winner <- as.numeric(nfl_standings$sb_winner)

Now, for a summary of the dataset and associated table of the data, please see the CLEANED dataset below.

NFL Standings Dataset

Data Dictionary for the NFL Standings Dataset

Variable Name Variable Data Type Variable Description
team_location character City or state in which the team originates
team_name character Name or mascot of the team
year numeric Year
total_wins numeric Total wins per season (0 to 16)
total_losses numeric Total losses per season (0 to 16)
points_for numeric Total points the team scored per season
points_against numeric Total points the opponent scored on the team per season
points_differential numeric The difference between the total points for the team and against the team
margin_of_victory numeric Points differential divided by the total number of games per season
strength_of_schedule numeric Difficulty of schedule based on opponent records
simple_rating numeric A rating for the team that takes into account points differential and strength of schedule (measured by Simple Rating System)
offensive_ranking numeric A rating comparing how well the offense performs to opponent teams (measured by Simple Rating System)
defensive_ranking numeric A rating comparing how well the defense performs to opponent teams (measured by Simple Rating System)
playoffs numeric Stating whether or not the team made it to the playoffs
sb_winner numeric Stating whether or not the team won the Super Bowl for the season

Games

Once again the nfl_games data was imported and obtained from Pro Football Reference. The original data contains 5,324 observations and 19 variables. There are 11 character variables, week, home_team, away_team, winner, tie, day, date, home_team_name, home_team_city, away_team_name, and away_team_city. There are seven numeric type variables, year, pts_win, pts_loss, yds_win, turnovers_win, yds_loss, and turnovers_loss. See the ORIGINAL dataset below.

# Examine the structure of the dataset
datatable(head(nfl_games, 10))

# Create a data dictionary for games
var_names_games <- colnames(nfl_games)
var_types_games <- lapply(nfl_games, class) # lapply returns a list of the same length as X (a vector)
var_descriptions_games <- c("Year", "Week of the season in which the game was played", "Home team for the game", "Away team for the game", "Winner of the game", "Was there a tie?  (if so, the other team will be listed in this column)", "Day of the week in which the game was played", "Date of the game", "Time of the day in which the game was played", "Number of points the winning team scored", "Number of points the losing team scored", "Total number of yards the winning team had", "Total number of turnovers the winning team had", "Total number of yards the losing team had", "Total number of turnovers the losing team had", "Name or mascot of the winning team", "City of the winning team", "Name or mascot of the losing team", "City of the losing team")
data_dict_games <- as_tibble(cbind(var_names_games, var_types_games, var_descriptions_games))
colnames(data_dict_games) <- c("Variable Name", "Variable Data Type", "Variable Description")
kable(data_dict_games) # kable returns a single table for a single data object

Looking at the above dataset, the first step I took to clean the data was to remove the last four unnecessary columns, as I felt they were redundant.

names(nfl_games)
nfl_games <- nfl_games[-c(16, 17, 18, 19)] # Remove redundant columns
names(nfl_games)

Then, I changed the week column to be numeric.

nfl_games$week <- as.numeric(nfl_games$week)

Looking at missing values, the only column which contained them was the tie column. This makes sense, as very few NFL games result in a tie.

Next, the way in which a tie was denoted was by listing one team name in the winner column, and the opponent team name in the tie column. To fix this, I identified any game that resulted in a tie. Then, for these specific games, I renamed the value in the winner column to “Tie”. The tie column was then erased.

colSums(is.na(nfl_games))
unique(nfl_games$tie, incomparables = FALSE)
nfl_games$winner[nfl_games$tie != is.na(nfl_games$tie)] <- "Tie"
nfl_games <- nfl_games[-c(6)] # Remove the tie column
colSums(is.na(nfl_games)) # Confirm there are no missing values

To view the summary and structure of the CLEANED data:

NFL Games Dataset

Data Dictionary for the NFL Games Dataset

Variable Name Variable Data Type Variable Description
year numeric Year
week numeric Week of the season in which the game was played
home_team character Home team for the game
away_team character Away team for the game
winner character Winner of the game
day character Day of the week in which the game was played
date character Date of the game
time hms , difftime Time of the day in which the game was played
pts_win numeric Number of points the winning team scored
pts_loss numeric Number of points the losing team scored
yds_win numeric Total number of yards the winning team had
turnovers_win numeric Total number of turnovers the winning team had
yds_loss numeric Total number of yards the losing team had
turnovers_loss numeric Total number of turnovers the losing team had

Weather

Incorporating weather data into my analysis is an interesting next step. I want to see how the weather impacts the outcome of individual games. The nfl_weather data is from NFLsavant.com. All data and statistics from this site are compiled from publicly-available NFL play-by-play on the Internet. The one negative is that this data only has until 2013; however, I thought 13 years of data was enough to see any significant trends.

The original data contains 3,521 observations and 13 variables. The variables are described in the data dictionary below. See the ORIGINAL NFL Weather data below.

# Examine the structure of the dataset
datatable(head(nfl_weather, 10))

# Create a data dictionary for standings
var_names_w <- colnames(nfl_weather)
var_types_w <- lapply(nfl_weather, class) # lapply returns a list of the same length as X (a vector)
var_descriptions_w <- c("Full home team name", "City or state in which the home team originates", "Name or mascot of the home team", "Total points scored by the home team", "Full away team name", "City or state in which the away team originates", "Name or mascot of the away team", "Total points scored by the away team", "Winner of the game", "Temperature during the game (in Fahrenheit)", "Humidity percentage during the game", "Wind speed in miles per hour (mph) during the game", "Date of the game played")

data_dict_w <- as_tibble(cbind(var_names_w, var_types_w, var_descriptions_w))
colnames(data_dict_w) <- c("Variable Name", "Variable Data Type", "Variable Description")
kable(data_dict_w) # kable returns a single table for a single data object

Looking at the above dataset, the first step I took to clean the data was to remove the home_team and away_team columns, as I felt they were redundant.

names(nfl_weather)
nfl_weather <- nfl_weather[-c(1, 5)] # Remove redundant columns
names(nfl_weather)

To view the summary and structure of the CLEANED data:

NFL Weather Dataset

Data Dictionary for the NFL Weather Dataset

Variable Name Variable Data Type Variable Description
home_team_city character City or state in which the home team originates
home_team_name character Name or mascot of the home team
home_score numeric Total points scored by the home team
away_team_city character City or state in which the away team originates
away_team_name character Name or mascot of the away team
away_score numeric Total points scored by the away team
winning_team character Winner of the game
temperature numeric Temperature during the game (in Fahrenheit)
humidity numeric Humidity percentage during the game
wind_mph numeric Wind speed in miles per hour (mph) during the game
date character Date of the game played

Playoffs

The next dataset within my analysis is the nfl_playoffs dataset. This looks into the coaches and quarterbacks for each team that went to the playoffs from 2000 - 2020. I created this dataset myself through research.

The original data contains 3,521 observations and 13 variables. The variables are described in the data dictionary below. See the ORIGINAL NFL Weather data below.

# Examine the structure of the dataset
datatable(head(nfl_playoffs, 10))

# Create a data dictionary for standings
var_names_playoffs <- colnames(nfl_playoffs)
var_types_playoffs <- lapply(nfl_playoffs, class) # lapply returns a list of the same length as X (a vector)
var_descriptions_playoffs <- c("City or state in which the team originates", "Name or mascot of the team", "Year", "Total wins for the team", "Total losses for the team", "Whether or not the team went to the Playoffs", "Whether or not the team won the Super Bowl", "Head coach of the team", "Starting quarterback during the postseason")

data_dict_playoffs <- as_tibble(cbind(var_names_playoffs, var_types_playoffs, var_descriptions_playoffs))
colnames(data_dict_playoffs) <- c("Variable Name", "Variable Data Type", "Variable Description")
kable(data_dict_playoffs) # kable returns a single table for a single data object

Moving forward, I decided to change the sb_winner to binary variables. This is because it only has two unique values. Because the unique value for the playoffs column is only “Playoffs”, I decided to drop that column.

unique(nfl_playoffs$playoffs, incomparables = FALSE) # View the unique values for the playoffs column
unique(nfl_playoffs$sb_winner, incomparables = FALSE) # View the unique values for the sb_winner column

names(nfl_playoffs)
nfl_playoffs <- nfl_playoffs[-6] # Remove unnecessary column
names(nfl_playoffs)

For the sb_winner column, a value of one denotes “Won Superbowl”, and a value of zero denotes “No Superbowl”.

nfl_playoffs$sb_winner[nfl_playoffs$sb_winner == "Won Superbowl"] <- "1"
nfl_playoffs$sb_winner[nfl_playoffs$sb_winner == "No Superbowl"] <- "0"

nfl_playoffs$sb_winner <- as.numeric(nfl_playoffs$sb_winner)

To view the summary and structure of the CLEANED data:

NFL Playoffs Dataset

Data Dictionary for the NFL Playoffs Dataset

Variable Name Variable Data Type Variable Description
team character City or state in which the team originates
team_name character Name or mascot of the team
year numeric Year
wins numeric Total wins for the team
loss numeric Total losses for the team
sb_winner numeric Whether or not the team won the Super Bowl
head_coach character Head coach of the team
qb character Starting quarterback during the postseason

Passing Yards Leaders

The nfl_passing dataset contains information regarding the league leader for passing yards from each year. Their respective team information is included. This data is from Pro Football Reference.

This dataset does not need to be cleaned or edited, so to view the summary and structure of the CLEANED data:

NFL Passing Dataset

Data Dictionary for the NFL Passing Dataset

Variable Name Variable Data Type Variable Description
year numeric Year
player character Name of the player with the most passing yards
yds numeric Total yards
team character Location of the team from which the player is on
team_name character Name or mascot of the team from which the player is on

Rushing Yards Leaders

The last dataset, nfl_rushing, contains information regarding the league leader for rushing yards from each year. Their respective team information is included. This data is also from Pro Football Reference.

Similar to the last dataset, this dataset does not need to be cleaned or edited, so to view the summary and structure of the CLEANED data:

NFL Rushing Dataset

Data Dictionary for the NFL Rushing Dataset

Variable Name Variable Data Type Variable Description
year numeric Year
player character Name of the player with the most rushing yards
yds numeric Total yards
team character Location of the team from which the player is on
team_name character Name or mascot of the team from which the player is on

Penalties Per Game

The nfl_penalty dataset contains of average penalty yards per game per team from 2003 - 2020. The data is from TeamRankings.

This dataset did not need to be cleaned, so To look at the summary and structure of the CLEANED data:

NFL Penalty Dataset

Data Dictionary for the NFL Penalty Dataset

Variable Name Variable Data Type Variable Description
team character City or state in which the team originates
team_name character Name or mascot of the team
2020 numeric Average penalty yards per game from 2020
2019 numeric Average penalty yards per game from 2019
2018 numeric Average penalty yards per game from 2018
2017 numeric Average penalty yards per game from 2017
2016 numeric Average penalty yards per game from 2016
2015 numeric Average penalty yards per game from 2015
2014 numeric Average penalty yards per game from 2014
2013 numeric Average penalty yards per game from 2013
2012 numeric Average penalty yards per game from 2012
2011 numeric Average penalty yards per game from 2011
2010 numeric Average penalty yards per game from 2010
2009 numeric Average penalty yards per game from 2009
2008 numeric Average penalty yards per game from 2008
2007 numeric Average penalty yards per game from 2007
2006 numeric Average penalty yards per game from 2006
2005 numeric Average penalty yards per game from 2005
2004 numeric Average penalty yards per game from 2004
2003 numeric Average penalty yards per game from 2003
total numeric Total penalty yards

Total Attendance Breakdown

Column

NFL Weekly Attendance

Division-Basis

Now, I wanted to break attendance down on a division-basis. In order to do this, I added a column to the dataset, called “division”.

Once the division column was created, the breakdown of the strongest and weakest fan bases per division can be seen in the table below. Individual graphs for both the AFC and NFC can be seen under the tabs AFC Attendance Breakdown and NFC Attendance Breakdown.

Strongest Fan Base Weakest Fan Base
AFC East New York Jets Miami Dolphins
AFC North Baltimore Ravens Cincinnati Bengals
AFC South Houston Texans Indianapolis Colts
AFC West Kansas City Chiefs Los Angeles Chargers
NFC East Dallas Cowboys Washington Redskins
NFC North Green Bay Packers Detroit Lions
NFC South New Orleans Saints Tampa Bay Buccaneers
NFC West Los Angeles Rams Arizona Cardinals

AFC Attendance Breakdown

Row

AFC East

AFC North

Row

AFC South

AFC West

NFC Attendance Breakdown

Row

NFC East

NFC North

Row

NFC South

NFC West

Impact on Wins

Column

What Impacts Total Wins?

Knowing the previously discussed attendance statistics, I want to see if a stronger home attendance impacts the total number of wins. A team cannot necessarily control their away attendance, as their most loyal fans are assumed to be unlikely attendees at an away game.

First, I wanted to discover if home attendance impacts total wins. To do so, I created a linear model with total_wins as the response variable and total_home_attendance as the predictor variable. I also obtained the correlation coefficient between the two variables. To the right, in the Home Attendance tab, it appears that there is a slight, positive linear relationship between the predictor variable (X or total_home_attendance) and the response variable (Y or total_wins). The correlation coefficient between the two variables is 0.1507, and this relationship is statistically significant at a 99% confidence level with a p-value of 0.000133. The lm() function was used to perform simple linear regression between the two variables.

Next, I wanted to discover if away attendance impacts total wins. I followed the same process I did for home attendance, creating a linear model with total_wins as the response variable and total_away_attendance as the predictor variable. From the visualization in the Away Attendance tab, it appears that there is also a very slight, positive linear relationship between the predictor variable (X or total_away_attendance) and the response variable (Y or total_wins). The correlation coefficient between the two variables is 0.1274, and this relationship is statistically significant at a 99% confidence level with a p-value of 0.00126. The lm() function was used to perform simple linear regression between the two variables.

# Attach the dataset
attach(joined_data)

# Create linear model for home attendance
home_attendance_model <- lm(total_wins ~ total_home_attendance)
summary(home_attendance_model)

Call:
lm(formula = total_wins ~ total_home_attendance)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.7799 -2.2250 -0.0726  2.2478  7.9490 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)           4.225e+00  9.851e-01   4.289 2.07e-05 ***
total_home_attendance 6.955e-06  1.809e-06   3.845 0.000133 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.051 on 636 degrees of freedom
Multiple R-squared:  0.02272,   Adjusted R-squared:  0.02118 
F-statistic: 14.78 on 1 and 636 DF,  p-value: 0.0001327
cor(total_wins, total_home_attendance)
[1] 0.1507174
# Attach the dataset
attach(joined_data)

# Create linear model for away attendance
away_attendance_model <- lm(total_wins ~ total_away_attendance)
summary(away_attendance_model)

Call:
lm(formula = total_wins ~ total_away_attendance)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.8922 -2.2928 -0.1177  2.2793  7.7255 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)   
(Intercept)           -3.310e-01  2.571e+00  -0.129  0.89758   
total_away_attendance  1.539e-05  4.751e-06   3.238  0.00126 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.061 on 636 degrees of freedom
Multiple R-squared:  0.01622,   Adjusted R-squared:  0.01468 
F-statistic: 10.49 on 1 and 636 DF,  p-value: 0.001265
cor(total_wins, total_away_attendance)
[1] 0.1273652

Column

Home Attendance

Away Attendance

Super Bowl Champions

Row

Standings Over the Years

This part of the analysis will look into the qualities of the division winners and the attributes that teams high in the standings have over teams in the lower portion of the standings. Furthermore, this approach will discover what separates Super Bowl Champions from the 31 other teams each season.

Firstly, I wanted to see which division has brought home the most Super Bowl Championships over the past 20 years. I once again added a “division” column to nfl_standings. As evident in the below visualization, the AFC East has had the most Super Bowl wins between 2000-2020. This can be largely attributed to the New England Patriots’ former quarterback Tom Brady and current head coach Bill Belichick bringing home championships in 2002, 2004, 2005, 2015, 2017, and 2019. Additionally, the second-best division appears to be the AFC North, with both the Pittsburgh Steelers and Baltimore Ravens winning at least one Super Bowl Championship each. Conversely, it appears the AFC South, NFC North, and NFC West have all only won one Super Bowl over the past two decades.

Analyzing NFL standings with the given datasets is a bit tricky due to the fact that standings are calculated using tie-breakers if necessary. Additionally, choosing which teams make the playoffs is largely based off of division success. With that being said, the team that had the most wins might not be the team with the best standing. For this analysis, I decided to break the teams down by division and see which ones have been dominant over the years.

I analyzed their success by using summary statistics showing the Average Total Wins, Average Total Losses, Average Points Per Game, and Average Opponent Points Per Game. The results can be seen in the tabs AFC Summaries and NFC Summaries. I also developed box plots for the average total wins per season by division to analyze the range of data for each team and any relevant outliers. These box plots can be seen in the tabs AFC Box Plots | Total Wins and NFC Box Plots | Total Wins. I also grouped the box plots by conference (AFC vs. NFC).

The most dominant teams per division, defined by highest average of total wins, (as discovered in the AFC Summaries and NFC Summaries tabs) are as follows:

  • AFC East: New England Patriots
  • AFC North: Pittsburgh Steelers
  • AFC South: Indianapolis Colts
  • AFC West: Denver Broncos
  • NFC East: Philadelphia Eagles
  • NFC North: Green Bay Packers
  • NFC South: New Orleans Saints
  • NFC West: Seattle Seahawks

I also developed a table of the last 20 Super Bowl winners with their offensive and defensive ranking. This table can be found in the Rankings of Super Bowl Champions tab.

Row

Super Bowl Champions Per Division

Rankings of Super Bowl Champions

Year Super Bowl Champion Offensive Ranking Defensive Ranking
2000 Ravens 0.0 8.0
2001 Patriots 1.2 3.1
2002 Buccaneers -1.0 9.8
2003 Patriots 2.1 4.9
2004 Patriots 6.4 6.5
2005 Steelers 3.8 4.0
2006 Colts 6.9 -1.1
2007 Giants 2.8 0.4
2008 Steelers 1.6 8.2
2009 Saints 11.2 -0.5
2010 Packers 3.1 7.9
2011 Giants 3.1 -1.5
2012 Ravens 1.9 1.0
2013 Seahawks 4.1 8.9
2014 Patriots 7.5 3.5
2015 Broncos 0.3 5.5
2016 Patriots 4.3 5.0
2017 Eagles 7.0 2.5
2018 Patriots 3.1 2.1
2019 Chiefs 6.2 2.9
2020 Buccaneers 6.5 2.8

AFC Summaries

Row

AFC East Summary

Team Name Average Total Wins Average Total Losses Average Points Per Game Average Opponent Points Per Game
Bills 7.142857 8.857143 20.41369 22.33631
Dolphins 7.571429 8.428571 20.11607 21.64881
Jets 7.142857 8.857143 19.61607 21.72917
Patriots 11.619048 4.380952 26.93155 18.74405

AFC North Summary

Team Name Average Total Wins Average Total Losses Average Points Per Game Average Opponent Points Per Game
Bengals 7.095238 8.714286 20.64583 22.41071
Browns 5.238095 10.714286 17.69345 23.16071
Ravens 9.571429 6.428571 22.75298 18.31250
Steelers 10.333333 5.571429 23.20536 18.50893

Row

AFC South Summary

Team Name Average Total Wins Average Total Losses Average Points Per Game Average Opponent Points Per Game
Colts 9.904762 6.095238 25.01488 22.25893
Jaguars 6.095238 9.904762 19.55357 22.34226
Texans 7.105263 8.894737 21.02632 23.01316
Titans 8.142857 7.857143 21.80060 22.60714

AFC West Summary

Team Name Average Total Wins Average Total Losses Average Points Per Game Average Opponent Points Per Game
Broncos 8.904762 7.095238 23.33929 21.99405
Chargers 8.047619 7.952381 24.05655 21.98214
Chiefs 8.571429 7.428571 23.58333 22.04167
Raiders 6.333333 9.666667 20.43452 24.71429

AFC Box Plots | Total Wins

Row

AFC East Box Plot

AFC North Box Plot

Row

AFC South Box Plot

AFC West Box Plot

NFC Summaries

Row

NFC East Summary

Team Name Average Total Wins Average Total Losses Average Points Per Game Average Opponent Points Per Game
Cowboys 8.285714 7.714286 22.41071 21.98810
Eagles 9.238095 6.666667 24.03571 20.80357
Giants 7.809524 8.190476 21.79762 22.42857
Redskins 6.619048 9.333333 19.55655 22.33929

NFC North Summary

Team Name Average Total Wins Average Total Losses Average Points Per Game Average Opponent Points Per Game
Bears 7.857143 8.142857 20.38393 20.93750
Lions 5.666667 10.285714 20.72619 24.98810
Packers 10.000000 5.904762 25.55357 21.33929
Vikings 8.190476 7.714286 22.87798 22.39881

Row

NFC South Summary

Team Name Average Total Wins Average Total Losses Average Points Per Game Average Opponent Points Per Game
Buccaneers 7.095238 8.904762 21.03869 21.94643
Falcons 8.000000 7.952381 22.71429 22.90179
Panthers 7.714286 8.238095 21.19048 21.78274
Saints 9.285714 6.714286 26.13988 23.18452

NFC West Summary

Team Name Average Total Wins Average Total Losses Average Points Per Game Average Opponent Points Per Game
49ers 7.333333 8.619048 21.13393 22.48512
Cardinals 6.904762 9.000000 20.37202 23.61607
Rams 7.333333 8.619048 21.58631 23.45238
Seahawks 9.238095 6.714286 23.19643 20.69048

NFC Box Plots | Total Wins

Row

NFC East Box Plot

NFC North Box Plot

Row

NFC South Box Plot

NFC West Box Plot

Division Leaders

Division Leaders Breakdown

Combining the tables from the previous tabs to form one table with average statistics, the following leaders can be found:

  • Average Total Wins: New England Patriots
  • Average Total Losses: Cleveland Browns
  • Average Points Per Game: New England Patriots
  • Average Opponent Points Per Game: Detroit Lions

For a more in-depth look at each team, please refer to the table below.

NFL Standings of all Teams

Team Name Average Total Wins Average Total Losses Average Points Per Game Average Opponent Points Per Game
49ers 7.333333 8.619048 21.13393 22.48512
Bears 7.857143 8.142857 20.38393 20.93750
Bengals 7.095238 8.714286 20.64583 22.41071
Bills 7.142857 8.857143 20.41369 22.33631
Broncos 8.904762 7.095238 23.33929 21.99405
Browns 5.238095 10.714286 17.69345 23.16071
Buccaneers 7.095238 8.904762 21.03869 21.94643
Cardinals 6.904762 9.000000 20.37202 23.61607
Chargers 8.047619 7.952381 24.05655 21.98214
Chiefs 8.571429 7.428571 23.58333 22.04167
Colts 9.904762 6.095238 25.01488 22.25893
Cowboys 8.285714 7.714286 22.41071 21.98810
Dolphins 7.571429 8.428571 20.11607 21.64881
Eagles 9.238095 6.666667 24.03571 20.80357
Falcons 8.000000 7.952381 22.71429 22.90179
Giants 7.809524 8.190476 21.79762 22.42857
Jaguars 6.095238 9.904762 19.55357 22.34226
Jets 7.142857 8.857143 19.61607 21.72917
Lions 5.666667 10.285714 20.72619 24.98810
Packers 10.000000 5.904762 25.55357 21.33929
Panthers 7.714286 8.238095 21.19048 21.78274
Patriots 11.619048 4.380952 26.93155 18.74405
Raiders 6.333333 9.666667 20.43452 24.71429
Rams 7.333333 8.619048 21.58631 23.45238
Ravens 9.571429 6.428571 22.75298 18.31250
Redskins 6.619048 9.333333 19.55655 22.33929
Saints 9.285714 6.714286 26.13988 23.18452
Seahawks 9.238095 6.714286 23.19643 20.69048
Steelers 10.333333 5.571429 23.20536 18.50893
Texans 7.105263 8.894737 21.02632 23.01316
Titans 8.142857 7.857143 21.80060 22.60714
Vikings 8.190476 7.714286 22.87798 22.39881

Offense vs. Defense

Column

Offensive Ranking

Defensive Ranking

Points For vs. Points Against

Column

Points For

Points Against

Individual Game Observations

Column

Individual Game Observations

The last analysis takes a look at data from the individual NFL games. Using the nfl_games dataset, I investigated the different variables.

Now, to analyze the correlation between different variables, I used the GGally package to produce a detailed scatter plot matrix. The function ggpairs() produced histograms along the diagonal of the matrix. Pearson’s rho estimates, or statistics showing correlation, are seen in the upper-right. Scatter plots are seen in the lower-left. I analyzed six variables here - (1) Points Scored by Winning Team (pts_win); (2) Yards Gained by Winning Team (yds_win); (3) Turnovers Committed by Winning Team (turnovers_win); (4) Points Scored by Losing Team (pts_loss); (5) Yards Gained by Losing Team (yds_loss); and (6) Turnovers Committed by Losing Team (turnovers_loss).

I then grouped these variables by winning team vs. losing team. This correlation matrix can be seen in the first tab to the right. As evident through both the scatter plots and Pearson’s rho estimates, there is little to no relationship between Points Scored by Winning Team vs. Turnovers Committed by Winning Team as well as Yards Gained by Winning Team vs. Turnovers Committed by Winning Team. All of these correlation coefficients are close to zero. On the other hand, there is a strong, positive relationship between Points Scored by Winning Team vs. Yards Gained by Winning Team, with a Pearson rho estimate of 0.537.

Looking at the variables by losing team in the second tab to the right – very similar to the winning teams, there is little to no relationship between Points Scored by Losing Team vs. Turnovers Committed by Losing Team as well as Yards Gained by Losing Team vs. Turnovers Committed by Losing Team. All of these correlation coefficients are close to zero. On the other hand, there is a strong, positive relationship between Points Scored by Losing Team vs. Yards Gained by Losing Team, with a Pearson rho estimate of 0.632.

The main takeaway from these correlation matrices are that the more yards gained, the more likely you are to score. To compare a winning team and a losing team, I wanted to see if more turnovers from a losing team caused more points for the winning team. Please reference the third tab to the right to reference the linear model with pts_win as the response variable and turnovers_loss as the predictor variable. In this graphic, there is a slight, positive relationship between the Turnovers Committed by Losing Team and Points Scored by Winning Team. The correlation coefficient between the two variables is 0.176.

Column

Variables by Winning Team

Variables by Losing Team

Turnovers Committed by Losing Team vs. Points Scored by Winning Team

Average Weather Conditions

Column

Understanding Weather Conditions

Looking at the nfl_weather dataset, I wanted to see which teams performed well under certain weather conditions. To do this, I first wanted to observe the average temperature, humidity, and wind speed at each home location. In R, I utilized the dplyr package to tidy my data and create new columns with mutate. To visualize the average temperature, humidity, and wind speed at each location, I created bar graphs for each variable per city.

From the visualizations to the right, it appears that the following five cities have the highest average temperatures:

  1. Miami, Florida – 76.70°F

  2. Detroit, Michigan – 71.64°F

  3. Tampa Bay, Florida – 71.51°F

  4. New Orleans, Louisiana – 71.03°F

  5. Houston, Texas – 71.03°F

The following five cities have the highest humidity percentage:

  1. Seattle, Washington – 79%

  2. San Francisco, California – 71%

  3. Oakland, California – 71%

  4. Green Bay, Wisconsin – 71%

  5. Miami, Florida – 70%

Lastly, the following five cities have the highest winds (in mph):

  1. New England, Massachusetts – 11.54 mph

  2. New York, New York –10.57 mph

  3. Dallas, Texas – 10.27 mph

  4. Denver, Colorado – 9.96 mph

  5. Buffalo, New York – 9.95 mph

Column

Average Temperature Per City

Average Humidity Per City

Average Wind Per City

Can Weather Predict Game Outcomes?

Column

Linear Model with all Weather Variables

attach(nfl_weather)
cor(total_score, temperature); cor(total_score, humidity); cor(total_score, wind_mph)

# Split the data into training and testing
sample_index <- sample(nrow(nfl_weather), nrow(nfl_weather)*0.70)
weather_train <- nfl_weather[sample_index,]
weather_test <- nfl_weather[-sample_index,]
# Create the linear model
weather_model <- lm(total_score ~ temperature + humidity + wind_mph, data = weather_train)
model_summary <- summary(weather_model)
model_summary

Call:
lm(formula = total_score ~ temperature + humidity + wind_mph, 
    data = weather_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-37.371 -10.303  -1.084   8.629  65.488 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 47.48511    1.36569  34.770  < 2e-16 ***
temperature -0.01548    0.01849  -0.837  0.40268    
humidity    -3.12415    1.16805  -2.675  0.00753 ** 
wind_mph    -0.28202    0.06450  -4.372 1.28e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.45 on 2460 degrees of freedom
Multiple R-squared:  0.02244,   Adjusted R-squared:  0.02125 
F-statistic: 18.82 on 3 and 2460 DF,  p-value: 4.55e-12
# Out-of-sample performance
pi <- predict(object = weather_model, newdata = weather_test)
mean((pi - weather_test$total_score)^2) # MSE
[1] 190.3678

Linear Model with Wind Variable

# Drop all variables except wind_mph
weather_model_2 <- lm(total_score ~ wind_mph, data = weather_train)
model_summary_2 <- summary(weather_model_2)
model_summary_2

Call:
lm(formula = total_score ~ wind_mph, data = weather_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-36.593 -10.377  -0.936   8.773  65.526 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 45.59288    0.47042  96.920  < 2e-16 ***
wind_mph    -0.36566    0.05221  -7.004  3.2e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.46 on 2462 degrees of freedom
Multiple R-squared:  0.01953,   Adjusted R-squared:  0.01914 
F-statistic: 49.05 on 1 and 2462 DF,  p-value: 3.205e-12
# Out-of-sample performance
pi_2 <- predict(object = weather_model_2, newdata = weather_test)
mean((pi_2 - weather_test$total_score)^2) # MSE
[1] 191.7339

Teams

Column

Successful Postseason Teams

top_10_teams_playoffs <- nfl_teams_playoffs %>% top_n(10, playoffs) %>%
  arrange(desc(playoffs))
top_10_teams_playoffs <- top_10_teams_playoffs[1:10,]
kable(top_10_teams_playoffs)
team_name playoffs
Patriots 17
Colts 15
Packers 15
Seahawks 14
Eagles 13
Ravens 13
Steelers 13
Chiefs 10
Saints 10
Broncos 9

Top Ten Teams by Playoff Appearances

Head Coaches

Column

List of Top Coaches by Playoff Appearances

top_10_coaches_playoffs <- nfl_coaches_playoffs %>% top_n(10, playoffs) %>% arrange(desc(playoffs))
top_10_coaches_playoffs <- top_10_coaches_playoffs[1:10,]
kable(top_10_coaches_playoffs)
head_coach playoffs
Bill Belichick 17
Andy Reid 16
John Harbaugh 9
Mike McCarthy 9
Mike Tomlin 9
Pete Carroll 9
Sean Payton 9
Tony Dungy 9
John Fox 7
Marvin Lewis 7

Top Ten Coaches by Playoff Appearances

List of Top Coaches by Super Bowl Championships

top_10_coaches <- nfl_coaches_sb %>% top_n(10, total_sb_coach) %>% arrange(desc(total_sb_coach))
top_10_coaches <- top_10_coaches[1:10,]
kable(top_10_coaches)
head_coach total_sb_coach
Bill Belichick 6
Tom Coughlin 2
Brian Billick 1
Jon Gruden 1
Andy Reid 1
Tony Dungy 1
Bill Cowher 1
Sean Payton 1
Mike Tomlin 1
Mike McCarthy 1

Top Ten Coaches by Super Bowl Championships

Quarterbacks

Column

List of Top Quarterbacks by Playoff Appearances

top_10_qb_playoffs <- nfl_qb_playoffs %>% top_n(10, playoffs) %>% arrange(desc(playoffs))
top_10_qb_playoffs <- top_10_qb_playoffs[1:10,]
kable(top_10_qb_playoffs)
qb playoffs
Tom Brady 17
Ben Roethlisberger 9
Drew Brees 9
Aaron Rodgers 8
Russell Wilson 8
Donovan McNabb 7
Peyton Manning 7
Joe Flacco 6
Matt Hasselbeck 5
Eli Manning 5

Top Ten Quarterbacks by Playoff Appearances

List of Top Quarterbacks by Super Bowl Championships

top_10_qb <- nfl_qb_sb %>% top_n(10, total_sb_qb) %>% arrange(desc(total_sb_qb))
top_10_qb <- top_10_qb[1:10,]
kable(top_10_qb)
qb total_sb_qb
Tom Brady 7
Peyton Manning 2
Ben Roethlisberger 2
Eli Manning 2
Trent Dilfer 1
Brad Johnson 1
Drew Brees 1
Joe Flacco 1
Aaron Rodgers 1
Russell Wilson 1

Top Ten Quarterbacks by Super Bowl Championships

Passing Yards Leaders

Column

Passing Yards Summary

kable(nfl_passing)
year player yds team team_name
2020 Deshaun Watson 4823 Houston Texans
2019 Jameis Winston 5109 Tampa Bay Buccaneers
2018 Ben Roethlisberger 5129 Pittsburgh Steelers
2017 Tom Brady 4577 New England Patriots
2016 Drew Brees 5208 New Orleans Saints
2015 Drew Brees 4870 New Orleans Saints
2014 Drew Brees 4952 New Orleans Saints
2014 Ben Roethlisberger 4952 Pittsburgh Steelers
2013 Peyton Manning 5477 Denver Broncos
2012 Drew Brees 5177 New Orleans Saints
2011 Drew Brees 5476 New Orleans Saints
2010 Philip Rivers 4710 San Diego Chargers
2009 Matt Schaub 4770 Houston Texans
2008 Drew Brees 5069 New Orleans Saints
2007 Tom Brady 4806 New England Patriots
2006 Drew Brees 4418 New Orleans Saints
2005 Tom Brady 4110 New England Patriots
2004 Daunte Culpepper 4717 Minnesota Vikings
2003 Peyton Manning 4267 Indianapolis Colts
2002 Rich Gannon 4689 Oakland Raiders
2001 Kurt Warner 4830 St. Louis Rams
2000 Peyton Manning 4413 Indianapolis Colts

Top Passing Yards Leaders

Rushing Yards Leaders

Column

Rushing Yards Summary

kable(nfl_rushing)
year player yds team team_name
2020 Derrick Henry 2027 Tennessee Titans
2019 Derrick Henry 1540 Tennessee Titans
2018 Ezekiel Elliott 1434 Dallas Cowboys
2017 Kareem Hunt 1327 Kansas City Chiefs
2016 Ezekiel Elliott 1631 Dallas Cowboys
2015 Adrian Peterson 1485 Minnesota Vikings
2014 DeMarco Murray 1845 Dallas Cowboys
2013 LeSean McCoy 1607 Philadelphia Eagles
2012 Adrian Peterson 2097 Minnesota Vikings
2011 Maurice Jones-Drew 1606 Jacksonville Jaguars
2010 Arian Foster 1616 Houston Texans
2009 Chris Johnson 2006 Tennessee Titans
2008 Adrian Peterson 1760 Minnesota Vikings
2007 LaDainian Tomlinson 1474 San Diego Chargers
2006 LaDainian Tomlinson 1815 San Diego Chargers
2005 Shaun Alexander 1880 Seattle Seahawks
2004 Curtis Martin 1697 New York Jets
2003 Jamal Lewis 2066 Baltimore Ravens
2002 Ricky Williams 1853 Miami Dolphins
2001 Priest Holmes 1555 Kansas City Chiefs
2000 Edgerrin James 1709 Indianapolis Colts

Top Rushing Yards Leaders

Penalty Yards Per Game

NFL Average Penalty Yards Per Game

Looking at the average penalty yards per game, I was able to find a dataset that recorded the average penalty yards against a team from 2003 - 2020. I wanted to figure out – which team was the most penalized?

From the graphic below, it is evident that the Las Vegas Raiders have been the most penalized team in the NFL. The top five most penalized teams are:

  1. Las Vegas Raiders

  2. Baltimore Ravens

  3. Detroit Lions

  4. Tampa Bay Buccaneers

  5. Los Angeles Rams

The least penalized team in the NFL is the Indianapolis Colts.

Top Penalized Teams

Rival Analysis

Column

Packers vs. Bears

Which team has won more games in the past 20 years?

team games
Chicago Bears 12
Green Bay Packers 29

Individual Game Statistics

year week home_team away_team winner day date time pts_win pts_loss yds_win turnovers_win yds_loss turnovers_loss
2000 5 Green Bay Packers Chicago Bears Chicago Bears Sun October 1 16:15:00 27 24 370 0 364 3
2000 14 Chicago Bears Green Bay Packers Green Bay Packers Sun December 3 20:35:00 28 6 304 0 330 2
2001 9 Chicago Bears Green Bay Packers Green Bay Packers Sun November 11 13:02:00 20 12 368 2 262 1
2001 13 Green Bay Packers Chicago Bears Green Bay Packers Sun December 9 13:02:00 17 7 352 1 189 1
2002 5 Chicago Bears Green Bay Packers Green Bay Packers Mon October 7 21:08:00 34 21 457 1 380 4
2002 13 Green Bay Packers Chicago Bears Green Bay Packers Sun December 1 13:02:00 30 20 396 2 304 4
2003 4 Chicago Bears Green Bay Packers Green Bay Packers Mon September 29 21:09:00 38 23 380 1 361 2
2003 14 Green Bay Packers Chicago Bears Green Bay Packers Sun December 7 13:02:00 34 21 307 1 275 5
2004 2 Green Bay Packers Chicago Bears Chicago Bears Sun September 19 13:02:00 21 10 307 2 404 3
2004 17 Chicago Bears Green Bay Packers Green Bay Packers Sun January 2 13:03:00 31 14 387 0 246 1
2005 13 Chicago Bears Green Bay Packers Chicago Bears Sun December 4 13:05:00 19 7 190 2 358 4
2005 16 Green Bay Packers Chicago Bears Chicago Bears Sun December 25 17:11:00 24 17 292 1 365 4
2006 1 Green Bay Packers Chicago Bears Chicago Bears Sun September 10 16:15:00 26 0 361 1 267 3
2006 17 Chicago Bears Green Bay Packers Green Bay Packers Sun December 31 20:15:00 26 7 373 1 316 6
2007 5 Green Bay Packers Chicago Bears Chicago Bears Sun October 7 20:24:00 27 20 285 1 439 5
2007 16 Chicago Bears Green Bay Packers Chicago Bears Sun December 23 13:03:00 35 7 240 0 274 2
2008 11 Green Bay Packers Chicago Bears Green Bay Packers Sun November 16 13:02:00 37 3 427 1 234 1
2008 16 Chicago Bears Green Bay Packers Chicago Bears Mon December 22 20:40:00 20 17 210 2 325 2
2009 1 Green Bay Packers Chicago Bears Green Bay Packers Sun September 13 20:30:00 21 15 226 0 352 4
2009 14 Chicago Bears Green Bay Packers Green Bay Packers Sun December 13 13:02:00 21 14 315 2 254 2
2010 3 Chicago Bears Green Bay Packers Chicago Bears Mon September 27 20:40:00 20 17 276 1 379 2
2010 17 Green Bay Packers Chicago Bears Green Bay Packers Sun January 2 16:15:00 10 3 284 2 227 2
2010 NA Chicago Bears Green Bay Packers Green Bay Packers Sun January 23 15:05:00 21 14 356 2 301 3
2011 3 Chicago Bears Green Bay Packers Green Bay Packers Sun September 25 16:15:00 27 17 392 2 291 2
2011 16 Green Bay Packers Chicago Bears Green Bay Packers Sun December 25 20:30:00 35 21 363 0 441 2
2012 2 Green Bay Packers Chicago Bears Green Bay Packers Thu September 13 20:29:00 23 10 321 2 168 4
2012 15 Chicago Bears Green Bay Packers Green Bay Packers Sun December 16 13:03:00 21 13 391 2 190 1
2013 9 Green Bay Packers Chicago Bears Chicago Bears Mon November 4 20:40:00 27 20 442 0 312 1
2013 17 Chicago Bears Green Bay Packers Green Bay Packers Sun December 29 16:25:00 33 28 473 2 345 2
2014 4 Chicago Bears Green Bay Packers Green Bay Packers Sun September 28 13:02:00 38 17 358 0 496 2
2014 10 Green Bay Packers Chicago Bears Green Bay Packers Sun November 9 20:30:00 55 14 451 1 311 3
2015 1 Chicago Bears Green Bay Packers Green Bay Packers Sun September 13 13:00:00 31 23 322 0 402 1
2015 12 Green Bay Packers Chicago Bears Chicago Bears Thu November 26 20:30:00 17 13 290 0 365 2
2016 7 Green Bay Packers Chicago Bears Green Bay Packers Thu October 20 20:26:00 26 10 406 1 189 2
2016 15 Chicago Bears Green Bay Packers Green Bay Packers Sun December 18 13:00:00 30 27 451 0 449 4
2017 4 Green Bay Packers Chicago Bears Green Bay Packers Thu September 28 20:25:00 35 14 260 0 308 4
2017 10 Chicago Bears Green Bay Packers Green Bay Packers Sun November 12 13:00:00 23 16 342 0 323 1
2018 1 Green Bay Packers Chicago Bears Green Bay Packers Sun September 9 20:20:00 24 23 370 2 294 1
2018 15 Chicago Bears Green Bay Packers Chicago Bears Sun December 16 13:00:00 24 17 332 1 323 1
2019 1 Chicago Bears Green Bay Packers Green Bay Packers Thu September 5 20:20:00 10 3 213 0 254 1
2019 15 Green Bay Packers Chicago Bears Green Bay Packers Sun December 15 13:00:00 21 13 292 0 415 3

Cowboys vs. Eagles

Which team has won more games in the past 20 years?

team games
Dallas Cowboys 19
Philadelphia Eagles 22

Individual Game Statistics

year week home_team away_team winner day date time pts_win pts_loss yds_win turnovers_win yds_loss turnovers_loss
2000 1 Dallas Cowboys Philadelphia Eagles Philadelphia Eagles Sun September 3 16:05:00 41 14 425 3 167 2
2000 10 Philadelphia Eagles Dallas Cowboys Philadelphia Eagles Sun November 5 13:03:00 16 13 357 2 295 2
2001 3 Philadelphia Eagles Dallas Cowboys Philadelphia Eagles Sun September 30 20:38:00 40 18 276 3 242 5
2001 10 Dallas Cowboys Philadelphia Eagles Philadelphia Eagles Sun November 18 13:03:00 36 3 227 1 213 4
2002 3 Philadelphia Eagles Dallas Cowboys Philadelphia Eagles Sun September 22 13:02:00 44 13 447 2 304 4
2002 16 Dallas Cowboys Philadelphia Eagles Philadelphia Eagles Sat December 21 20:39:00 27 3 359 2 146 3
2003 6 Dallas Cowboys Philadelphia Eagles Dallas Cowboys Sun October 12 13:03:00 23 21 292 1 232 1
2003 14 Philadelphia Eagles Dallas Cowboys Philadelphia Eagles Sun December 7 13:03:00 36 10 403 0 225 2
2004 10 Dallas Cowboys Philadelphia Eagles Philadelphia Eagles Mon November 15 21:09:00 49 21 485 0 317 3
2004 15 Philadelphia Eagles Dallas Cowboys Philadelphia Eagles Sun December 19 13:02:00 12 7 328 3 237 2
2005 5 Dallas Cowboys Philadelphia Eagles Dallas Cowboys Sun October 9 16:15:00 33 10 456 1 129 0
2005 10 Philadelphia Eagles Dallas Cowboys Dallas Cowboys Mon November 14 21:08:00 21 20 241 1 359 1
2006 5 Philadelphia Eagles Dallas Cowboys Philadelphia Eagles Sun October 8 16:14:00 38 24 383 2 320 5
2006 16 Dallas Cowboys Philadelphia Eagles Philadelphia Eagles Mon December 25 17:07:00 23 7 426 1 201 3
2007 9 Philadelphia Eagles Dallas Cowboys Dallas Cowboys Sun November 4 20:23:00 38 17 434 1 316 3
2007 15 Dallas Cowboys Philadelphia Eagles Philadelphia Eagles Sun December 16 16:15:00 10 6 315 1 240 3
2008 2 Dallas Cowboys Philadelphia Eagles Dallas Cowboys Mon September 15 20:40:00 41 37 380 2 337 1
2008 17 Philadelphia Eagles Dallas Cowboys Philadelphia Eagles Sun December 28 16:15:00 44 6 303 1 298 5
2009 9 Philadelphia Eagles Dallas Cowboys Dallas Cowboys Sun November 8 20:31:00 20 16 358 1 297 2
2009 17 Dallas Cowboys Philadelphia Eagles Dallas Cowboys Sun January 3 16:15:00 24 0 474 1 228 1
2009 NA Dallas Cowboys Philadelphia Eagles Dallas Cowboys Sat January 9 20:05:00 34 14 426 1 340 4
2010 14 Dallas Cowboys Philadelphia Eagles Philadelphia Eagles Sun December 12 20:30:00 30 27 429 2 349 2
2010 17 Philadelphia Eagles Dallas Cowboys Dallas Cowboys Sun January 2 16:15:00 14 13 272 1 244 4
2011 8 Philadelphia Eagles Dallas Cowboys Philadelphia Eagles Sun October 30 20:28:00 34 7 495 0 267 1
2011 16 Dallas Cowboys Philadelphia Eagles Philadelphia Eagles Sat December 24 16:15:00 20 7 386 1 238 0
2012 10 Philadelphia Eagles Dallas Cowboys Dallas Cowboys Sun November 11 16:25:00 38 23 294 0 369 2
2012 13 Dallas Cowboys Philadelphia Eagles Dallas Cowboys Sun December 2 20:20:00 38 33 417 0 423 1
2013 7 Philadelphia Eagles Dallas Cowboys Dallas Cowboys Sun October 20 13:02:00 17 3 368 2 278 3
2013 17 Dallas Cowboys Philadelphia Eagles Philadelphia Eagles Sun December 29 20:30:00 24 22 366 1 414 3
2014 13 Dallas Cowboys Philadelphia Eagles Philadelphia Eagles Thu November 27 16:36:00 33 10 464 1 267 3
2014 15 Philadelphia Eagles Dallas Cowboys Dallas Cowboys Sun December 14 20:35:00 38 27 364 1 294 3
2015 2 Philadelphia Eagles Dallas Cowboys Dallas Cowboys Sun September 20 16:25:00 20 10 359 2 226 3
2015 9 Dallas Cowboys Philadelphia Eagles Philadelphia Eagles Sun November 8 20:31:00 33 27 459 0 411 1
2016 8 Dallas Cowboys Philadelphia Eagles Dallas Cowboys Sun October 30 20:31:00 29 23 460 1 291 1
2016 17 Philadelphia Eagles Dallas Cowboys Philadelphia Eagles Sun January 1 13:00:00 27 13 346 0 195 2
2017 11 Dallas Cowboys Philadelphia Eagles Philadelphia Eagles Sun November 19 20:30:00 37 9 383 0 225 4
2017 17 Philadelphia Eagles Dallas Cowboys Dallas Cowboys Sun December 31 13:00:00 6 0 301 1 219 2
2018 10 Philadelphia Eagles Dallas Cowboys Dallas Cowboys Sun November 11 20:20:00 27 20 410 0 421 1
2018 14 Dallas Cowboys Philadelphia Eagles Dallas Cowboys Sun December 9 16:25:00 29 23 576 3 256 1
2019 7 Dallas Cowboys Philadelphia Eagles Dallas Cowboys Sun October 20 20:20:00 37 10 402 1 283 4
2019 16 Philadelphia Eagles Dallas Cowboys Philadelphia Eagles Sun December 22 16:25:00 17 9 431 0 311 1

Chiefs vs. Raiders

Which team has won more games in the past 20 years?

team games
Kansas City Chiefs 25
Oakland Raiders 15

Individual Game Statistics

year week home_team away_team winner day date time pts_win pts_loss yds_win turnovers_win yds_loss turnovers_loss
2000 7 Kansas City Chiefs Oakland Raiders Oakland Raiders Sun October 15 13:00:00 20 17 391 1 346 1
2000 10 Oakland Raiders Kansas City Chiefs Oakland Raiders Sun November 5 13:15:00 49 31 473 0 513 3
2001 1 Kansas City Chiefs Oakland Raiders Oakland Raiders Sun September 9 12:01:00 27 24 427 3 254 3
2001 13 Oakland Raiders Kansas City Chiefs Oakland Raiders Sun December 9 16:15:00 28 26 264 5 447 1
2002 8 Kansas City Chiefs Oakland Raiders Kansas City Chiefs Sun October 27 13:00:00 20 10 323 2 417 2
2002 17 Oakland Raiders Kansas City Chiefs Oakland Raiders Sat December 28 17:15:00 24 0 354 1 176 1
2003 7 Oakland Raiders Kansas City Chiefs Kansas City Chiefs Mon October 20 21:05:00 17 10 319 1 357 3
2003 12 Kansas City Chiefs Oakland Raiders Kansas City Chiefs Sun November 23 16:15:00 27 24 384 0 379 0
2004 13 Oakland Raiders Kansas City Chiefs Kansas City Chiefs Sun December 5 16:05:00 34 27 500 1 364 0
2004 16 Kansas City Chiefs Oakland Raiders Kansas City Chiefs Sat December 25 17:00:00 31 30 433 2 300 1
2005 2 Oakland Raiders Kansas City Chiefs Kansas City Chiefs Sun September 18 20:30:00 23 17 354 1 327 2
2005 9 Kansas City Chiefs Oakland Raiders Kansas City Chiefs Sun November 6 13:00:00 27 23 321 1 263 1
2006 11 Kansas City Chiefs Oakland Raiders Kansas City Chiefs Sun November 19 13:00:00 17 13 292 0 326 1
2006 16 Oakland Raiders Kansas City Chiefs Kansas City Chiefs Sat December 23 20:00:00 20 9 292 1 307 5
2007 7 Oakland Raiders Kansas City Chiefs Kansas City Chiefs Sun October 21 16:05:00 12 10 290 1 268 2
2007 12 Kansas City Chiefs Oakland Raiders Oakland Raiders Sun November 25 13:00:00 20 17 312 1 292 1
2008 2 Kansas City Chiefs Oakland Raiders Oakland Raiders Sun September 14 13:00:00 23 8 355 2 190 2
2008 13 Oakland Raiders Kansas City Chiefs Kansas City Chiefs Sun November 30 16:15:00 20 13 301 1 271 2
2009 2 Kansas City Chiefs Oakland Raiders Oakland Raiders Sun September 20 13:00:00 13 10 166 0 409 2
2009 10 Oakland Raiders Kansas City Chiefs Kansas City Chiefs Sun November 15 16:05:00 16 10 318 3 272 2
2010 9 Oakland Raiders Kansas City Chiefs Oakland Raiders Sun November 7 16:15:00 23 20 321 3 304 2
2010 17 Kansas City Chiefs Oakland Raiders Oakland Raiders Sun January 2 13:02:00 31 10 344 1 201 2
2011 7 Oakland Raiders Kansas City Chiefs Kansas City Chiefs Sun October 23 19:05:00 28 0 300 2 322 6
2011 16 Kansas City Chiefs Oakland Raiders Oakland Raiders Sat December 24 13:03:00 16 13 308 2 435 2
2012 8 Kansas City Chiefs Oakland Raiders Oakland Raiders Sun October 28 16:06:00 26 16 344 1 299 4
2012 15 Oakland Raiders Kansas City Chiefs Oakland Raiders Sun December 16 16:25:00 15 0 385 1 119 1
2013 6 Kansas City Chiefs Oakland Raiders Kansas City Chiefs Sun October 13 13:02:00 24 7 216 1 274 3
2013 15 Oakland Raiders Kansas City Chiefs Kansas City Chiefs Sun December 15 16:05:00 56 31 384 1 461 7
2014 12 Oakland Raiders Kansas City Chiefs Oakland Raiders Thu November 20 20:26:00 24 20 351 1 313 0
2014 15 Kansas City Chiefs Oakland Raiders Kansas City Chiefs Sun December 14 13:03:00 31 13 388 1 280 1
2015 13 Oakland Raiders Kansas City Chiefs Kansas City Chiefs Sun December 6 16:05:00 34 20 232 2 361 3
2015 17 Kansas City Chiefs Oakland Raiders Kansas City Chiefs Sun January 3 16:26:00 23 17 339 2 205 1
2016 6 Oakland Raiders Kansas City Chiefs Kansas City Chiefs Sun October 16 16:05:00 26 10 406 0 285 2
2016 14 Kansas City Chiefs Oakland Raiders Kansas City Chiefs Thu December 8 20:25:00 21 13 323 3 244 0
2017 7 Oakland Raiders Kansas City Chiefs Oakland Raiders Thu October 19 20:25:00 31 30 505 0 425 0
2017 14 Kansas City Chiefs Oakland Raiders Kansas City Chiefs Sun December 10 13:00:00 26 15 408 1 268 3
2018 13 Oakland Raiders Kansas City Chiefs Kansas City Chiefs Sun December 2 16:05:00 40 33 469 1 442 3
2018 17 Kansas City Chiefs Oakland Raiders Kansas City Chiefs Sun December 30 16:25:00 35 3 409 1 292 4
2019 2 Oakland Raiders Kansas City Chiefs Kansas City Chiefs Sun September 15 16:05:00 28 10 467 1 307 2
2019 13 Kansas City Chiefs Oakland Raiders Kansas City Chiefs Sun December 1 16:25:00 40 9 259 0 332 3

Ravens vs. Steelers

Which team has won more games in the past 20 years?

team games
Baltimore Ravens 22
Pittsburgh Steelers 22

Individual Game Statistics

year week home_team away_team winner day date time pts_win pts_loss yds_win turnovers_win yds_loss turnovers_loss
2000 1 Pittsburgh Steelers Baltimore Ravens Baltimore Ravens Sun September 3 13:02:00 16 0 336 0 223 1
2000 9 Baltimore Ravens Pittsburgh Steelers Pittsburgh Steelers Sun October 29 13:02:00 9 6 231 1 274 3
2001 8 Pittsburgh Steelers Baltimore Ravens Baltimore Ravens Sun November 4 13:01:00 13 10 183 1 348 1
2001 14 Baltimore Ravens Pittsburgh Steelers Pittsburgh Steelers Sun December 16 20:35:00 26 21 476 0 207 1
2001 NA Pittsburgh Steelers Baltimore Ravens Pittsburgh Steelers Sun January 20 12:40:00 27 10 297 1 150 4
2002 8 Baltimore Ravens Pittsburgh Steelers Pittsburgh Steelers Sun October 27 13:02:00 31 18 283 1 360 5
2002 17 Pittsburgh Steelers Baltimore Ravens Pittsburgh Steelers Sun December 29 13:02:00 34 31 351 2 422 4
2003 1 Pittsburgh Steelers Baltimore Ravens Pittsburgh Steelers Sun September 7 13:04:00 34 15 339 1 231 2
2003 17 Baltimore Ravens Pittsburgh Steelers Baltimore Ravens Sun December 28 20:36:00 13 10 279 2 214 5
2004 2 Baltimore Ravens Pittsburgh Steelers Baltimore Ravens Sun September 19 13:02:00 30 13 259 0 310 3
2004 16 Pittsburgh Steelers Baltimore Ravens Pittsburgh Steelers Sun December 26 13:01:00 20 7 404 2 248 1
2005 8 Pittsburgh Steelers Baltimore Ravens Pittsburgh Steelers Mon October 31 21:07:00 20 19 261 2 318 3
2005 11 Baltimore Ravens Pittsburgh Steelers Baltimore Ravens Sun November 20 13:02:00 16 13 241 2 282 2
2006 12 Baltimore Ravens Pittsburgh Steelers Baltimore Ravens Sun November 26 13:02:00 27 0 275 0 172 3
2006 16 Pittsburgh Steelers Baltimore Ravens Baltimore Ravens Sun December 24 13:02:00 31 7 359 3 251 3
2007 9 Pittsburgh Steelers Baltimore Ravens Pittsburgh Steelers Mon November 5 20:38:00 38 7 291 1 104 4
2007 17 Baltimore Ravens Pittsburgh Steelers Baltimore Ravens Sun December 30 16:17:00 27 21 334 1 264 3
2008 4 Pittsburgh Steelers Baltimore Ravens Pittsburgh Steelers Mon September 29 20:40:00 23 20 237 1 243 1
2008 15 Baltimore Ravens Pittsburgh Steelers Pittsburgh Steelers Sun December 14 16:15:00 13 9 311 2 202 2
2008 NA Pittsburgh Steelers Baltimore Ravens Pittsburgh Steelers Sun January 18 18:43:00 23 14 275 1 198 4
2009 12 Baltimore Ravens Pittsburgh Steelers Baltimore Ravens Sun November 29 20:30:00 20 17 393 2 298 1
2009 16 Pittsburgh Steelers Baltimore Ravens Pittsburgh Steelers Sun December 27 13:02:00 23 20 286 2 323 3
2010 4 Pittsburgh Steelers Baltimore Ravens Baltimore Ravens Sun October 3 13:02:00 17 14 320 2 210 1
2010 13 Baltimore Ravens Pittsburgh Steelers Pittsburgh Steelers Sun December 5 20:30:00 13 10 288 1 269 1
2010 NA Pittsburgh Steelers Baltimore Ravens Pittsburgh Steelers Sat January 15 16:35:00 31 24 263 2 126 3
2011 1 Baltimore Ravens Pittsburgh Steelers Baltimore Ravens Sun September 11 13:05:00 35 7 385 0 312 7
2011 9 Pittsburgh Steelers Baltimore Ravens Baltimore Ravens Sun November 6 20:30:00 23 20 356 1 392 2
2012 11 Pittsburgh Steelers Baltimore Ravens Baltimore Ravens Sun November 18 20:30:00 13 10 200 0 309 3
2012 13 Baltimore Ravens Pittsburgh Steelers Pittsburgh Steelers Sun December 2 16:25:00 23 20 366 3 288 2
2013 7 Pittsburgh Steelers Baltimore Ravens Pittsburgh Steelers Sun October 20 16:25:00 19 16 286 1 287 0
2013 13 Baltimore Ravens Pittsburgh Steelers Baltimore Ravens Thu November 28 20:31:00 22 20 311 0 329 0
2014 2 Baltimore Ravens Pittsburgh Steelers Baltimore Ravens Thu September 11 20:28:00 26 6 323 0 301 3
2014 9 Pittsburgh Steelers Baltimore Ravens Pittsburgh Steelers Sun November 2 20:30:00 43 23 376 1 332 2
2014 NA Pittsburgh Steelers Baltimore Ravens Baltimore Ravens Sat January 3 20:15:00 30 17 299 1 387 3
2015 4 Pittsburgh Steelers Baltimore Ravens Baltimore Ravens Thu October 1 20:26:00 23 20 356 2 263 0
2015 16 Baltimore Ravens Pittsburgh Steelers Baltimore Ravens Sun December 27 13:02:00 20 17 386 0 308 3
2016 9 Baltimore Ravens Pittsburgh Steelers Baltimore Ravens Sun November 6 13:00:00 21 14 274 1 277 1
2016 16 Pittsburgh Steelers Baltimore Ravens Pittsburgh Steelers Sun December 25 16:30:00 31 27 406 2 368 1
2017 4 Baltimore Ravens Pittsburgh Steelers Pittsburgh Steelers Sun October 1 13:00:00 26 9 381 1 288 3
2017 14 Pittsburgh Steelers Baltimore Ravens Pittsburgh Steelers Sun December 10 20:30:00 39 38 545 0 413 1
2018 4 Pittsburgh Steelers Baltimore Ravens Baltimore Ravens Sun September 30 20:20:00 26 14 451 1 284 2
2018 9 Baltimore Ravens Pittsburgh Steelers Pittsburgh Steelers Sun November 4 13:00:00 23 16 395 0 265 0
2019 5 Pittsburgh Steelers Baltimore Ravens Baltimore Ravens Sun October 6 13:00:00 26 23 277 3 269 2
2019 17 Baltimore Ravens Pittsburgh Steelers Baltimore Ravens Sun December 29 16:25:00 28 10 304 2 168 2

Packers vs. Bears

Which team has won more games in the past 20 years?

team games
New York Giants 27
Washington Redskins 13

Individual Game Statistics

year week home_team away_team winner day date time pts_win pts_loss yds_win turnovers_win yds_loss turnovers_loss
2000 4 New York Giants Washington Redskins Washington Redskins Sun September 24 20:37:00 16 6 394 0 261 1
2000 14 Washington Redskins New York Giants New York Giants Sun December 3 13:01:00 9 7 305 3 290 2
2001 4 New York Giants Washington Redskins New York Giants Sun October 7 13:04:00 23 9 309 4 181 5
2001 7 Washington Redskins New York Giants Washington Redskins Sun October 28 16:05:00 35 21 353 0 388 2
2002 11 New York Giants Washington Redskins New York Giants Sun November 17 13:04:00 19 17 299 3 166 2
2002 14 Washington Redskins New York Giants New York Giants Sun December 8 13:02:00 27 21 316 0 447 5
2003 3 Washington Redskins New York Giants New York Giants Sun September 21 16:05:00 24 21 399 0 456 1
2003 14 New York Giants Washington Redskins Washington Redskins Sun December 7 13:02:00 20 7 288 0 220 3
2004 2 New York Giants Washington Redskins New York Giants Sun September 19 13:04:00 20 14 277 1 322 7
2004 13 Washington Redskins New York Giants Washington Redskins Sun December 5 16:16:00 31 7 379 0 145 0
2005 8 New York Giants Washington Redskins New York Giants Sun October 30 13:06:00 36 0 386 1 125 4
2005 16 Washington Redskins New York Giants Washington Redskins Sat December 24 13:03:00 35 20 380 1 332 1
2006 5 New York Giants Washington Redskins New York Giants Sun October 8 13:03:00 19 3 411 0 164 0
2006 17 Washington Redskins New York Giants New York Giants Sat December 30 20:05:00 34 28 355 0 393 2
2007 3 Washington Redskins New York Giants New York Giants Sun September 23 16:15:00 24 17 315 3 260 1
2007 15 New York Giants Washington Redskins Washington Redskins Sun December 16 20:24:00 22 10 309 0 307 1
2008 1 New York Giants Washington Redskins New York Giants Thu September 4 19:08:00 16 7 354 1 209 0
2008 13 Washington Redskins New York Giants New York Giants Sun November 30 13:04:00 23 7 404 1 320 2
2009 1 New York Giants Washington Redskins New York Giants Sun September 13 16:15:00 23 17 351 2 272 2
2009 15 Washington Redskins New York Giants New York Giants Mon December 21 20:40:00 45 12 387 0 302 3
2010 13 New York Giants Washington Redskins New York Giants Sun December 5 13:02:00 31 7 358 1 338 6
2010 17 Washington Redskins New York Giants New York Giants Sun January 2 16:16:00 17 14 325 1 385 4
2011 1 Washington Redskins New York Giants Washington Redskins Sun September 11 16:23:00 28 14 332 1 315 1
2011 15 New York Giants Washington Redskins Washington Redskins Sun December 18 13:02:00 23 10 300 2 324 3
2012 7 New York Giants Washington Redskins New York Giants Sun October 21 13:03:00 27 23 393 2 480 4
2012 13 Washington Redskins New York Giants Washington Redskins Mon December 3 20:40:00 17 16 370 1 390 0
2013 13 Washington Redskins New York Giants New York Giants Sun December 1 20:31:00 24 17 286 1 323 1
2013 17 New York Giants Washington Redskins New York Giants Sun December 29 13:04:00 20 6 278 3 251 4
2014 4 Washington Redskins New York Giants New York Giants Thu September 25 20:26:00 45 14 449 1 329 6
2014 15 New York Giants Washington Redskins New York Giants Sun December 14 13:02:00 24 13 287 1 372 1
2015 3 New York Giants Washington Redskins New York Giants Thu September 24 20:26:00 32 21 363 0 393 3
2015 12 Washington Redskins New York Giants Washington Redskins Sun November 29 13:03:00 20 14 407 0 332 3
2016 3 New York Giants Washington Redskins Washington Redskins Sun September 25 13:02:00 29 27 403 1 457 3
2016 17 Washington Redskins New York Giants New York Giants Sun January 1 16:25:00 19 10 332 0 284 3
2017 12 Washington Redskins New York Giants Washington Redskins Thu November 23 20:30:00 20 10 323 1 170 1
2017 17 New York Giants Washington Redskins New York Giants Sun December 31 13:00:00 18 10 381 1 197 3
2018 8 New York Giants Washington Redskins Washington Redskins Sun October 28 13:00:00 20 13 360 1 303 2
2018 14 Washington Redskins New York Giants New York Giants Sun December 9 13:00:00 40 16 402 1 288 3
2019 4 New York Giants Washington Redskins New York Giants Sun September 29 13:00:00 24 3 389 4 176 4
2019 16 Washington Redskins New York Giants New York Giants Sun December 22 13:00:00 41 35 552 0 361 0

Steelers vs. Bengals

Which team has won more games in the past 20 years?

team games
Cincinnati Bengals 9
Pittsburgh Steelers 33

Individual Game Statistics

year week home_team away_team winner day date time pts_win pts_loss yds_win turnovers_win yds_loss turnovers_loss
2000 7 Pittsburgh Steelers Cincinnati Bengals Pittsburgh Steelers Sun October 15 13:01:00 15 0 274 0 232 3
2000 13 Cincinnati Bengals Pittsburgh Steelers Pittsburgh Steelers Sun November 26 13:02:00 48 28 372 0 309 3
2001 4 Pittsburgh Steelers Cincinnati Bengals Pittsburgh Steelers Sun October 7 13:02:00 16 7 413 2 214 1
2001 16 Cincinnati Bengals Pittsburgh Steelers Cincinnati Bengals Sun December 30 13:02:00 26 23 544 3 313 5
2002 6 Cincinnati Bengals Pittsburgh Steelers Pittsburgh Steelers Sun October 13 13:02:00 34 7 408 2 268 4
2002 12 Pittsburgh Steelers Cincinnati Bengals Pittsburgh Steelers Sun November 24 13:01:00 29 21 391 0 352 1
2003 3 Cincinnati Bengals Pittsburgh Steelers Pittsburgh Steelers Sun September 21 13:02:00 17 10 376 1 182 1
2003 13 Pittsburgh Steelers Cincinnati Bengals Cincinnati Bengals Sun November 30 13:01:00 24 20 379 0 384 2
2004 4 Pittsburgh Steelers Cincinnati Bengals Pittsburgh Steelers Sun October 3 13:01:00 28 17 333 2 293 3
2004 11 Cincinnati Bengals Pittsburgh Steelers Pittsburgh Steelers Sun November 21 13:03:00 19 14 235 1 209 1
2005 7 Cincinnati Bengals Pittsburgh Steelers Pittsburgh Steelers Sun October 23 13:02:00 27 13 304 2 302 2
2005 13 Pittsburgh Steelers Cincinnati Bengals Cincinnati Bengals Sun December 4 13:02:00 38 31 324 0 474 4
2005 NA Cincinnati Bengals Pittsburgh Steelers Pittsburgh Steelers Sun January 8 16:36:00 31 17 346 0 327 2
2006 3 Pittsburgh Steelers Cincinnati Bengals Cincinnati Bengals Sun September 24 13:02:00 28 20 246 3 365 5
2006 17 Cincinnati Bengals Pittsburgh Steelers Pittsburgh Steelers Sun December 31 13:03:00 23 17 482 2 295 0
2007 8 Cincinnati Bengals Pittsburgh Steelers Pittsburgh Steelers Sun October 28 13:02:00 24 13 390 1 296 1
2007 13 Pittsburgh Steelers Cincinnati Bengals Pittsburgh Steelers Sun December 2 20:24:00 24 10 285 4 249 1
2008 7 Cincinnati Bengals Pittsburgh Steelers Pittsburgh Steelers Sun October 19 13:04:00 38 10 375 0 212 1
2008 12 Pittsburgh Steelers Cincinnati Bengals Pittsburgh Steelers Thu November 20 20:15:00 27 10 364 1 208 1
2009 3 Cincinnati Bengals Pittsburgh Steelers Cincinnati Bengals Sun September 27 16:15:00 23 20 273 0 373 1
2009 10 Pittsburgh Steelers Cincinnati Bengals Cincinnati Bengals Sun November 15 13:02:00 18 12 218 0 226 1
2010 9 Cincinnati Bengals Pittsburgh Steelers Pittsburgh Steelers Mon November 8 20:40:00 27 21 314 2 272 2
2010 14 Pittsburgh Steelers Cincinnati Bengals Pittsburgh Steelers Sun December 12 13:02:00 23 7 354 0 190 3
2011 10 Cincinnati Bengals Pittsburgh Steelers Pittsburgh Steelers Sun November 13 13:02:00 24 17 328 1 279 2
2011 13 Pittsburgh Steelers Cincinnati Bengals Pittsburgh Steelers Sun December 4 13:02:00 35 7 295 0 232 2
2012 7 Cincinnati Bengals Pittsburgh Steelers Pittsburgh Steelers Sun October 21 20:30:00 24 17 431 2 185 1
2012 16 Pittsburgh Steelers Cincinnati Bengals Cincinnati Bengals Sun December 23 13:02:00 13 10 267 3 280 3
2013 2 Cincinnati Bengals Pittsburgh Steelers Cincinnati Bengals Mon September 16 20:41:00 20 10 407 0 278 2
2013 15 Pittsburgh Steelers Cincinnati Bengals Pittsburgh Steelers Sun December 15 20:30:00 30 20 290 1 279 1
2014 14 Cincinnati Bengals Pittsburgh Steelers Pittsburgh Steelers Sun December 7 13:02:00 42 21 543 0 408 2
2014 17 Pittsburgh Steelers Cincinnati Bengals Pittsburgh Steelers Sun December 28 20:30:00 27 17 346 3 337 3
2015 8 Pittsburgh Steelers Cincinnati Bengals Cincinnati Bengals Sun November 1 13:02:00 16 10 296 2 356 3
2015 14 Cincinnati Bengals Pittsburgh Steelers Pittsburgh Steelers Sun December 13 13:02:00 33 20 354 1 385 3
2015 NA Cincinnati Bengals Pittsburgh Steelers Pittsburgh Steelers Sat January 9 20:15:00 18 16 369 2 279 4
2016 2 Pittsburgh Steelers Cincinnati Bengals Pittsburgh Steelers Sun September 18 13:02:00 24 16 374 2 412 2
2016 15 Cincinnati Bengals Pittsburgh Steelers Pittsburgh Steelers Sun December 18 13:00:00 24 20 382 0 222 1
2017 7 Pittsburgh Steelers Cincinnati Bengals Pittsburgh Steelers Sun October 22 16:25:00 29 14 420 0 179 2
2017 13 Cincinnati Bengals Pittsburgh Steelers Pittsburgh Steelers Mon December 4 20:30:00 23 20 374 1 353 0
2018 6 Cincinnati Bengals Pittsburgh Steelers Pittsburgh Steelers Sun October 14 13:00:00 28 21 481 0 275 0
2018 17 Pittsburgh Steelers Cincinnati Bengals Pittsburgh Steelers Sun December 30 16:25:00 16 13 343 1 196 0
2019 4 Pittsburgh Steelers Cincinnati Bengals Pittsburgh Steelers Mon September 30 20:15:00 27 3 326 1 175 2
2019 12 Cincinnati Bengals Pittsburgh Steelers Pittsburgh Steelers Sun November 24 13:00:00 16 10 338 1 244 2

Summary

Summary

In this analysis, the main goal was to understand what all goes into winning an NFL game and what teams are historically successful in the standings. I was able to successfully break out this analysis into multiple different sections, including, but not limited to: (1) The Importance of Fan Attendance; (2) Standings over the Years; (3) Offense vs. Defense; and (4) Individual Game Observations.

Through extensive use of R, I investigated eight datasets with various information regarding the NFL. Linear modeling to discover the correlation between several datasets was frequently used. Additionally, the ggplot2 package delivered great visualizations to showcase this breakdown of the NFL. New variables and tables were created as well to drill deeper into the data for a better understanding of the raw data. One of my primary focuses was a breakdown of the divisions and their successes over the past 20 years. Box plot visualizations between the two conferences illuminated how teams have fared in the win column from their best season to their worst season.

My first analysis looked into NFL Fan Attendance. Graphical representations were created to better understand which teams have a strong fan base and the consistency at which fans show up on a yearly basis. From this analysis, it was evident that the Dallas Cowboys have the strongest fan base and the Los Angeles Chargers have the weakest. Additionally, the greater attendance to games positively correlated to a team’s total wins per season.

Secondly, I focused on the divisional standings through the years. As mentioned above, box plot visualizations by division showed the range of success for NFL teams. Per division, these teams have had the most success based on the nfl_standings dataset:

  • AFC East: New England Patriots
  • AFC North: Pittsburgh Steelers
  • AFC South: Indianapolis Colts
  • AFC West: Denver Broncos
  • NFC East: Philadelphia Eagles
  • NFC North: Green Bay Packers
  • NFC South: New Orleans Saints
  • NFC West: Seattle Seahawks

Using geom_col(), I observed that the AFC East has won the most Super Bowl Championships. This is due to the phenomenal success of Tom Brady and the New England Patriots during this time period.

Next, I researched one of the most common arguments in football - is the offense or defense more important? Linear modeling of the nfl_standings data was completed on several variables. High offensive rankings and defensive rankings correlate to more wins for teams. Even though having a great offense and defense are both important, the correlation tests indicated that a better offense is slightly more important to a team’s success than a better defense. I created a table of the last 20 Super Bowl Champions and showcased the offensive_ranking and defensive_ranking. Teams have been trending towards having better offenses in the last few years as evident by this table.

I then observed individual game data in the NFL. Through graphs created by ggpairs(), I was able to view correlation coefficients for six variables. The main conclusion I deduced from this is that a positive correlation exists between yards gained and points scored.

Extending my analysis, I looked into how weather conditions play a role in game outcomes. I was able to find the average temperature, humidity, and wind for each location teams may play. I also was able to train and test a model with a 70-30 split to see if weather conditions predict whether the game will be high-scoring or low-scoring. I deduced there is little predictability of game outcomes from weather conditions, as both the in-sample and out-of-sample performance of my models were underwhelming.

I also analyzed league leaders over the past 20 years – this includes teams, head coaches, quarterbacks, passing yards leaders, rushing yards leaders, and penalty leaders. Understanding who the game-changers are is important when trying to predict which team will win.

Lastly, I did an analysis of popular rivalries in the NFL to see which teams have been dominant. My personal favorite is that the Pittsburgh Steelers are 33-9 against the Cincinnati Bengals the past 20 years.

As a big NFL fan, it was incredibly interesting to see how the NFL has worked during my entire lifetime. Also, it was intriguing to see my favorite team’s success over this time span. The NFL is one of the biggest industries in the world that has large implications on many levels. Sports gambling, the NFL Draft, fantasy football, and the common fan could all have different takeaways from this analysis that would help them better understand the recent history of the NFL. With fans across the globe, a deep dive into the NFL is exciting for many groups. Coaches and players would be able to more effectively prepare for their opponents, gamblers could make more educated bets, general managers could derive their team’s needs in the Draft, and the common fan could revel in their team’s history.

This data tells a phenomenal story of the state of the NFL. However, it is a game for a reason. No one will ever be able to fully predict NFL outcomes, and that is what makes the sport as intriguing as it is!

---
title: 'More than Touchdowns: An NFL Data Analysis'
output: 
  flexdashboard::flex_dashboard:
    source_code: embed
    social: menu
    theme: flatly
    vertical_layout: fill
---

Project Introduction {data-navmenu="Background" data-orientation=rows}
=============================================================================

Row
-----------------------------------------------------------------------------

### **Breaking Down the NFL**

From a young age, I have watched NFL football and cheered for my team every week.  Given my dad is from Pittsburgh and my parents met in Pittsburgh, I was naturally raised a lifelong Steelers fan.  With that said, I not only wanted to compare my team individually, but also the league as a whole.  The past 20 years have seen successes and failures from every NFL team.  When choosing an option for my final capstone project, I was instantly drawn to extending a project I had previously done on the NFL.  The NFL has a plethora of data points publicly available.  I thought, "What all goes into winning an NFL game, and what teams are historically successful in the final standings?"  Using the past 20 years worth of data, I sought to investigate this problem.

### **My Focus**

Aforementioned, for my final capstone project, I am expanding upon my final project from Data Wrangling in R (BANA 7025) with Professor Tianhai Zu.  I originally worked with a partner on this project; however, the extension will be my own individual work.  I plan on using the functions in R to deliver overall summary statistics on games and standings.  Additionally, I will use the data to develop potential correlations and plot respective data visualizations. Utilizing descriptive analysis of the past 20 years, I am looking to see if there can be predictive tendencies for NFL teams.

This analysis includes data from 2000 - 2019.  I added 2020 season data to every dataset aside from `nfl_attendance` and `nfl_games`, as these would be skewed if 2020 data was added.  This skewness would be due to the impact of COVID-19.  COVID-19 caused games to be played on different days / times, cancellation of games, and it also caused little to no attendance based on location.

This NFL analysis consists of eight individual datasets:

1. **NFL Attendance** (`nfl_attendance`)

2. **NFL Standings** (`nfl_standings`)

3. **NFL Games** (`nfl_games`)

4. **NFL Weather** (`nfl_weather`)

5. **NFL Playoff Coaches and Quarterbacks** (`nfl_playoffs`)

6. **NFL Passing Yards Leaders** (`nfl_passing`)

7. **NFL Rushing Yards Leaders** (`nfl_rushing`)

8. **NFL Penalty Yards Per Game** (`nfl_penalty`)

More detailed information about each dataset can be found in the *Data Preparation* tab.

Row
-----------------------------------------------------------------------------

### **Goal**

The NFL is a multi-billion dollar industry. Millions of fans across the world cheer for these 32 teams every year. People are now looking for ways to understand the game better. 

Coaches want to understand what makes a team more successful. Sports gamblers want to get an edge and make the correct picks based on more than just gut feelings. Fans want to know if their team is progressing in the right direction. This analysis is useful for all of these situations. Using descriptive analysis, past results can be better explained. As such, trends can be deduced to predict how NFL games and seasons will occur. Although no one can see into the future, understanding the data sheds a better light on the probability of certain results occurring in the NFL.

The goal of my analysis is to inform my readers on what all goes into winning an NFL game.  My hope is that the audience will finish reading my report and better understand historic trends and performance from teams, players, and coaches alike.  As a final capstone project, I hope to demonstrate proficiency in R using R Markdown as well as flexdashboard with Shiny components.

![](bengals.png){width=10%} ![](NFL.png){width=12%} ![](steelers.png){width=9%}
Analytical Technique and Approach {data-navmenu="Background"} ============================================================================= ### **Analytical Approach** The datasets contain loads of information for the NFL. With a wide range of variables, many options are available to analytically investigate the NFL. With the eight datasets at hand, I looked to compare them to draw conclusions about team performances. To see if statistical significance or rational conclusions related to the NFL could be realized, the following situations were explored: * **The Importance of Fan Attendance** - This data analysis will look into if the number of fans in attendance correlates to a team's success (number of games won). Additionally, it will provide comparisons of how teams fare in home versus away games while keeping in mind the attendance at those games. * **Standings over the Years** - The NFL has two conferences: the American Football Conference (AFC) and the National Football Conference (NFC). Each conference contains four divisions with four teams in each division. Each division then has a winner over the 16 game regular season. This analysis will look into the qualities of the division winners and the attributes that teams high in the standings have over teams in the lower portion of the standings. Furthermore, this approach will discover what separates Super Bowl Champions from the 31 other teams each season. * **Offense vs. Defense** - The two main parts of a NFL team are the offense and defense. The goal for each team is to be great on both sides. However, this is rarely the case. Using individual game data and season-long statistics, a thorough breakdown of how having a great offense or defense improves teams will be given. I will also see if having a better offense or defense is critical to success over the years. * **Individual Game Observations** - The `nfl_games` dataset contains many variables for games. Turnovers, day of the week, points, etc. are shown for every match-up. Correlations into why teams win or lose will be the goal of this analysis. Using a plethora of variables, significance of certain variables will be essential for further understanding. * **The Impact of Weather on Game Outcomes** - The `nfl_weather` dataset contains the information of both the home and away teams from 2000 - 2013. This dataset also includes three weather-related variables: (1) temperature, (2) humidity, and (3) wind speed (in mph). I want to see which teams perform under certain weather conditions. Additionally, I hope to create a few linear models to see if weather conditions can predict whether or not the game will be high-scoring or low-scoring. * **Successful Teams, Head Coaches, and Quarterbacks** - The `nfl_playoffs` dataset includes information of teams who went to the playoffs from 2000 - 2020. This dataset also includes the Super Bowl Champions. I am curious to analyze trends regarding the coaches and quarterbacks who led the teams to success. Are certain quarterbacks consistently better-performing? Are there better head coaches than others? * **Passing Yards Leaders** - The `nfl_passing` dataset includes information from the past 20 years on the players with the most passing yards. Which player has performed consistently over the past 20 years? Who is the "best"? * **Rushing Yards Leaders** - The `nfl_rushing` dataset has the same information as the `nfl_passing` information, except it focuses on rushing yards instead of passing yards. Which players had the most rushing yards each year from 2000 - 2020? * **Average Penalty Yards Per Game** - Penalties are game-changers when it comes to success in a football game. One mistake can lead to an automatic first down compared to what could have been a fourth down and ten yards. In this analysis, I want to see which teams have consistently lost yards in games due to penalties. * **Rivalry Analysis** - Every sports fanatic knows the top rivalries in the NFL. Regardless of whether you are a fan of these teams, many will tune into the game as the level of intensity is typically higher. With that said, I wanted to see which teams have been dominant in their respective rivalries. ### **Packages Required** This project requires a variety of packages. Given there are over 10,000 packages in R, I want to focus on the ones that will provide me with the best results while cleaning and interpreting the data. Some packages will be more useful than others. For example, `ggplot2` allows for great visualizations that provide better understanding of the data. Additionally, `dplyr` can drill deeper into the eight datasets to come to conclusions that may be hidden at first. R has powerful functions that can derive explanations for questions to massive datasets. Please see below for all of the packages loaded for this analysis: ```{r, message = FALSE, warning = FALSE, echo = TRUE} # Packages required library(tidyverse) # Use to tidy data library(dplyr) # Use to manipulate data library(ggplot2) # Use to plot data and create visualizations library(tibble) # Use to manipulate and re-imagine data library(readr) # Use to import data cleanly and efficiently library(DT) # Use to create comprehensive data tables with HTML output library(knitr) # Use for dynamic report generation library(base) # Contains Base R functions library(ggthemes) # Use themes in data visualizations library(plotly) # Use to plot data and create visualizations library(ggpubr) # Use to show multiple plots at once library(GGally) # Use to produce scatter plot matrix library(rmarkdown) # Use to produce report library(flexdashboard) # Use to produce flexdashboard library(stringr) # Provides functions to work with strings library(highcharter) # Includes shortcut functions to plot R objects library(shinythemes) # Use to implement themes for output ``` Importing the Data {data-navmenu="Background" data-orientation=rows} ============================================================================= ### **Importing the Data** Most of the data (`nfl_attendance`, `nfl_standings`, and `nfl_games`) was obtained from my professor, Tianhai Zu, for the Data Wrangling in R class. He had provided four different datasets in which to choose, and my partner and I chose the NFL option. These datasets can be found on [GitHub](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-04/readme.md). Reading the information on GitHub led me to find the original source of the data, which is [Pro Football Reference Standings](https://www.pro-football-reference.com/years/2019/index.htm) and [Pro Football Reference Attendance](https://www.pro-football-reference.com/years/2019/attendance.htm). This NFL analysis contains of eight individual datasets - (1) `nfl_attendance`, (2) `nfl_standings`, (3) `nfl_games`, (4) `nfl_weather`, (5) `nfl_playoffs`, (6) `nfl_passing`, (7) `nfl_rushing`, and (8) `nfl_penalty`. I first merged three of the datasets (`nfl_attendance`, `nfl_standings`, and `nfl_games`) into one dataframe called `nfl_df`. I decided it might be beneficial to have multiple frames of reference, some utilizing individual datasets, and another by looking at the combined dataframe. Rather than using `str()` and `summary()` to show descriptive statistics for each variable, I decided to create comprehensive tables. Then, in the *Data Preparation* tab, I cleaned every dataset. ```{r, message = FALSE, warning = FALSE, echo = TRUE, cache = TRUE} # Get working directory getwd() # Get the data nfl_attendance <- readr::read_csv('attendance.csv') nfl_standings <- readr::read_csv('updatedstandings.csv') nfl_games <- readr::read_csv('games.csv') nfl_weather <- readr::read_csv('weather.csv') nfl_playoffs <- readr::read_csv('post_season.csv') nfl_passing <- readr::read_csv('passing_yards_leaders.csv') nfl_rushing <- readr::read_csv('rushing_yards_leaders.csv') nfl_penalty <- readr::read_csv('penalty_yards_per_game.csv') # To use 2020 data you need to update tidytuesdayR from GitHub # Install via devtools::install_github("thebioengineer/tidytuesdayR") tuesdata <- tidytuesdayR::tt_load('2020-02-04') tuesdata <- tidytuesdayR::tt_load(2020, week = 6) attendance <- tuesdata$attendance # Join the data relatively nicely with dplyr nfl_df <- dplyr::left_join(nfl_attendance, nfl_standings, nfl_games, by = c("year", "team_name", "team")) ``` Data Preparation {data-navmenu="Background" data-orientation=rows} ============================================================================= Row {.tabset .tabset-fade} ----------------------------------------------------------------------------- ### **Attendance** As aforementioned, the `nfl_attendance` dataset was imported and obtained from Pro Football Reference. The original data contains 10,846 observations and eight variables. There are two character type variables, `team` and `team_name`. There are six numeric type variables, `year`, `total`, `home`, `away`, `week`, `weekly_attendance`. The data was collected from 2000 - 2020, and the values for the columns were observed during the 17 weeks of the NFL season. ```{r, message = FALSE, warning = FALSE, echo = TRUE, results = 'hide'} # Examine the structure of the dataset datatable(head(nfl_attendance, 10)) # Create a data dictionary for attendance var_names_att <- colnames(nfl_attendance) var_types_att <- lapply(nfl_attendance, class) # lapply returns a list of the same length as X (a vector) var_descriptions_att <- c("City or state in which the team originates", "Name or mascot of the team", "Year", "Total attendance per season", "Total attendance at home games per season", "Total attendance at away games per season", "Week in which game was played", "Attendance for given week") data_dict_att <- as_tibble(cbind(var_names_att, var_types_att, var_descriptions_att)) colnames(data_dict_att) <- c("Variable Name", "Variable Data Type", "Variable Description") kable(data_dict_att) # kable returns a single table for a single data object ``` Looking at the missing data values, the only column in which missing values exist is the `weekly_attendance`. This makes sense, as each NFL team has at least one bye week during the regular season. I decided to omit these values as they would skew the data and misrepresent the trends for each team. ```{r, message = FALSE, warning = FALSE, echo = TRUE, results = 'hide'} colSums(is.na(nfl_attendance)) # Find the number of missing values per column nfl_attendance <- na.omit(nfl_attendance) colSums(is.na(nfl_attendance)) # Confirm there are no missing values ``` Looking at this above original dataset, I decided to first rename the columns to better describe the data. ```{r, message = FALSE, warning = FALSE, echo = TRUE, results = 'hide'} nfl_attendance <- nfl_attendance %>% dplyr::rename( team_location = team, total_attendance = total, total_home_attendance = home, total_away_attendance = away ) ``` Additionally, I split it into two dataframes, the first omitting the weekly data, and the second omitting the season totals. This decision was made largely to remove duplicates, and I knew it would bode for better visualizations during the exploratory data analysis (EDA). The first dataset, `nfl_total_attendance` erased the two columns, `week` and `weekly_attendance`. This dataset will show the season totals for attendance per each team. The second dataset, `nfl_weekly_attendance` erased the season total data columns, `total`, `home`, and `away`. ```{r, message = FALSE, warning = FALSE, echo = TRUE, results = 'hide'} nfl_total_attendance <- nfl_attendance[-c(7, 8)] # Remove weekly data nfl_total_attendance <- nfl_total_attendance[!duplicated(nfl_total_attendance), ] # Remove duplicates datatable(head(nfl_total_attendance, 10)) nfl_weekly_attendance <- nfl_attendance[-c(4, 5, 6)] # Remove season total attendance data datatable(head(nfl_weekly_attendance, 10)) ``` Now, for a summary of the two datasets and associated tables of the ***CLEANED*** data, please see below. **NFL Total Attendance Dataset** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # Examine the final summary and structure of the nfl_total_attendance dataset datatable(head(nfl_total_attendance, 10)) ``` **Data Dictionary for the NFL Total Attendance Dataset** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # Create a data dictionary for nfl_total_attendance var_names_att <- colnames(nfl_total_attendance) var_types_att <- lapply(nfl_total_attendance, class) # lapply returns a list of the same length as X (a vector) var_descriptions_att <- c("City or state in which the team originates", "Name or mascot of the team", "Year", "Total attendance per season", "Total attendance at home games per season", "Total attendance at away games per season") data_dict_total_att <- as_tibble(cbind(var_names_att, var_types_att, var_descriptions_att)) colnames(data_dict_total_att) <- c("Variable Name", "Variable Data Type", "Variable Description") kable(data_dict_total_att) # kable returns a single table for a single data object ``` **NFL Weekly Attendance Dataset** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # Examine the final summary and structure of the nfl_weekly_attendance dataset datatable(head(nfl_weekly_attendance, 10)) ``` **Data Dictionary for the NFL Weekly Attendance Dataset** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # Create a data dictionary for nfl_weekly_attendance var_names_att <- colnames(nfl_weekly_attendance) var_types_att <- lapply(nfl_weekly_attendance, class) # lapply returns a list of the same length as X (a vector) var_descriptions_att <- c("City or state in which the team originates", "Name or mascot of the team", "Year", "Week in which game was played", "Attendance for given week") data_dict_weekly_att <- as_tibble(cbind(var_names_att, var_types_att, var_descriptions_att)) colnames(data_dict_weekly_att) <- c("Variable Name", "Variable Data Type", "Variable Description") kable(data_dict_weekly_att) # kable returns a single table for a single data object ``` ### **Standings** The `nfl_standings` dataset was imported and obtained from Pro Football Reference. The original data contains 638 observations and 15 variables. There are four character type variables, `team`, `team_name`, `playoffs`, and `sb_winner`. There are 11 numeric type variables, `year`, `wins`, `loss`, `points_for`, `points_against`, `points_differential`, `margin_of_victory`, `strength_of_schedule`, `simple_rating`, `offensive_ranking`, and `defensive_ranking`. The data observed was collected from 2000 - 2020. The process of cleaning the ***ORIGINAL*** data can be seen below. ```{r, message = FALSE, warning = FALSE, echo = TRUE, results = 'hide'} # Examine the structure of the dataset datatable(head(nfl_standings, 10)) # Create a data dictionary for standings var_names_st <- colnames(nfl_standings) var_types_st <- lapply(nfl_standings, class) # lapply returns a list of the same length as X (a vector) var_descriptions_st <- c("City or state in which the team originates", "Name or mascot of the team", "Year", "Total wins per season (0 to 16)", "Total losses per season (0 to 16)", "Total points the team scored per season", "Total points the opponent scored on the team per season", "The difference between the total points for the team and against the team", "Points differential divided by the total number of games per season", "Difficulty of schedule based on opponent records", "A rating for the team that takes into account points differential and strength of schedule (measured by Simple Rating System)", "A rating comparing how well the offense performs to opponent teams (measured by Simple Rating System)", "A rating comparing how well the defense performs to opponent teams (measured by Simple Rating System)", "Stating whether or not the team made it to the playoffs", "Stating whether or not the team won the Super Bowl for the season") data_dict_st <- as_tibble(cbind(var_names_st, var_types_st, var_descriptions_st)) colnames(data_dict_st) <- c("Variable Name", "Variable Data Type", "Variable Description") kable(data_dict_st) # kable returns a single table for a single data object ``` Looking at the above dataset, I first decided to change the column names to better describe the data. ```{r, message = FALSE, warning = FALSE, echo = TRUE, results = 'hide'} nfl_standings <- nfl_standings %>% dplyr::rename( team_location = team, total_wins = wins, total_losses = loss ) ``` It is important to note as well that a few of the variable names refer to calculated values. The calculated value for `points_differential` is: `points_differential = points_for - points_against`. Additionally, `margin_of_victory` is calculated by: `points_scored - points_allowed / games_played`. Lastly, the `simple_rating` is calculated by: $$SRS = MoV + SoS = OSRS + DSRS$$ In layman's terms, the simple rating system is equal to the margin of victory plus the strength of schedule. This is equal to the offensive simple rating standing plus the defensive simple rating standing. Next, I wanted to see what the sum of missing values was per column. As evident below, there are no missing values. ```{r, message = FALSE, warning = FALSE, echo = TRUE, results = 'hide'} colSums(is.na(nfl_standings)) ``` Moving forward, I decided to change both the `playoffs` and `sb_winner` to binary variables. This is because they both only have two unique values. ```{r, message = FALSE, warning = FALSE, echo = TRUE, results = 'hide'} unique(nfl_standings$playoffs, incomparables = FALSE) # View the unique values for the playoffs column unique(nfl_standings$sb_winner, incomparables = FALSE) # View the unique values for the sb_winner column ``` Knowing this, I changed the two columns to binary variables. For the `playoffs` column, a value of one stands for "Playoffs", and a value of zero stands for "No Playoffs". ```{r, message = FALSE, warning = FALSE, echo = TRUE, results = 'hide'} nfl_standings$playoffs[nfl_standings$playoffs == "Playoffs"] <- "1" nfl_standings$playoffs[nfl_standings$playoffs == "No Playoffs"] <- "0" nfl_standings$playoffs <- as.numeric(nfl_standings$playoffs) ``` For the `sb_winner` column, a value of one denotes "Won Superbowl", and a value of zero denotes "No Superbowl". ```{r, message = FALSE, warning = FALSE, echo = TRUE, results = 'hide'} nfl_standings$sb_winner[nfl_standings$sb_winner == "Won Superbowl"] <- "1" nfl_standings$sb_winner[nfl_standings$sb_winner == "No Superbowl"] <- "0" nfl_standings$sb_winner <- as.numeric(nfl_standings$sb_winner) ``` Now, for a summary of the dataset and associated table of the data, please see the ***CLEANED*** dataset below. **NFL Standings Dataset** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # Examine the structure of the dataset datatable(head(nfl_standings, 10)) ``` **Data Dictionary for the NFL Standings Dataset** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # Create a data dictionary for standings var_names_st <- colnames(nfl_standings) var_types_st <- lapply(nfl_standings, class) # lapply returns a list of the same length as X (a vector) var_descriptions_st <- c("City or state in which the team originates", "Name or mascot of the team", "Year", "Total wins per season (0 to 16)", "Total losses per season (0 to 16)", "Total points the team scored per season", "Total points the opponent scored on the team per season", "The difference between the total points for the team and against the team", "Points differential divided by the total number of games per season", "Difficulty of schedule based on opponent records", "A rating for the team that takes into account points differential and strength of schedule (measured by Simple Rating System)", "A rating comparing how well the offense performs to opponent teams (measured by Simple Rating System)", "A rating comparing how well the defense performs to opponent teams (measured by Simple Rating System)", "Stating whether or not the team made it to the playoffs", "Stating whether or not the team won the Super Bowl for the season") data_dict_st <- as_tibble(cbind(var_names_st, var_types_st, var_descriptions_st)) colnames(data_dict_st) <- c("Variable Name", "Variable Data Type", "Variable Description") kable(data_dict_st) # kable returns a single table for a single data object ``` ### **Games** Once again the `nfl_games` data was imported and obtained from Pro Football Reference. The original data contains 5,324 observations and 19 variables. There are 11 character variables, `week`, `home_team`, `away_team`, `winner`, `tie`, `day`, `date`, `home_team_name`, `home_team_city`, `away_team_name`, and `away_team_city`. There are seven numeric type variables, `year`, `pts_win`, `pts_loss`, `yds_win`, `turnovers_win`, `yds_loss`, and `turnovers_loss`. See the ***ORIGINAL*** dataset below. ```{r, message = FALSE, warning = FALSE, echo = TRUE, results = 'hide'} # Examine the structure of the dataset datatable(head(nfl_games, 10)) # Create a data dictionary for games var_names_games <- colnames(nfl_games) var_types_games <- lapply(nfl_games, class) # lapply returns a list of the same length as X (a vector) var_descriptions_games <- c("Year", "Week of the season in which the game was played", "Home team for the game", "Away team for the game", "Winner of the game", "Was there a tie? (if so, the other team will be listed in this column)", "Day of the week in which the game was played", "Date of the game", "Time of the day in which the game was played", "Number of points the winning team scored", "Number of points the losing team scored", "Total number of yards the winning team had", "Total number of turnovers the winning team had", "Total number of yards the losing team had", "Total number of turnovers the losing team had", "Name or mascot of the winning team", "City of the winning team", "Name or mascot of the losing team", "City of the losing team") data_dict_games <- as_tibble(cbind(var_names_games, var_types_games, var_descriptions_games)) colnames(data_dict_games) <- c("Variable Name", "Variable Data Type", "Variable Description") kable(data_dict_games) # kable returns a single table for a single data object ``` Looking at the above dataset, the first step I took to clean the data was to remove the last four unnecessary columns, as I felt they were redundant. ```{r, message = FALSE, warning = FALSE, echo = TRUE, results = 'hide'} names(nfl_games) nfl_games <- nfl_games[-c(16, 17, 18, 19)] # Remove redundant columns names(nfl_games) ``` Then, I changed the `week` column to be numeric. ```{r, message = FALSE, warning = FALSE, echo = TRUE, results = 'hide'} nfl_games$week <- as.numeric(nfl_games$week) ``` Looking at missing values, the only column which contained them was the `tie` column. This makes sense, as very few NFL games result in a tie. Next, the way in which a tie was denoted was by listing one team name in the `winner` column, and the opponent team name in the `tie` column. To fix this, I identified any game that resulted in a tie. Then, for these specific games, I renamed the value in the `winner` column to "Tie". The `tie` column was then erased. ```{r, message = FALSE, warning = FALSE, echo = TRUE, results = 'hide'} colSums(is.na(nfl_games)) unique(nfl_games$tie, incomparables = FALSE) nfl_games$winner[nfl_games$tie != is.na(nfl_games$tie)] <- "Tie" nfl_games <- nfl_games[-c(6)] # Remove the tie column colSums(is.na(nfl_games)) # Confirm there are no missing values ``` To view the summary and structure of the ***CLEANED*** data: **NFL Games Dataset** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # Examine the structure of the dataset datatable(head(nfl_games, 10)) ``` **Data Dictionary for the NFL Games Dataset** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # Create a data dictionary for games var_names_games <- colnames(nfl_games) var_types_games <- lapply(nfl_games, class) # lapply returns a list of the same length as X (a vector) var_descriptions_games <- c("Year", "Week of the season in which the game was played", "Home team for the game", "Away team for the game", "Winner of the game", "Day of the week in which the game was played", "Date of the game", "Time of the day in which the game was played", "Number of points the winning team scored", "Number of points the losing team scored", "Total number of yards the winning team had", "Total number of turnovers the winning team had", "Total number of yards the losing team had", "Total number of turnovers the losing team had") data_dict_games <- as_tibble(cbind(var_names_games, var_types_games, var_descriptions_games)) colnames(data_dict_games) <- c("Variable Name", "Variable Data Type", "Variable Description") kable(data_dict_games) # kable returns a single table for a single data object ``` ### **Weather** Incorporating weather data into my analysis is an interesting next step. I want to see how the weather impacts the outcome of individual games. The `nfl_weather` data is from [NFLsavant.com](http://nflsavant.com/about.php). All data and statistics from this site are compiled from publicly-available NFL play-by-play on the Internet. The one negative is that this data only has until 2013; however, I thought 13 years of data was enough to see any significant trends. The original data contains 3,521 observations and 13 variables. The variables are described in the data dictionary below. See the ***ORIGINAL*** NFL Weather data below. ```{r, message = FALSE, warning = FALSE, echo = TRUE, results = 'hide'} # Examine the structure of the dataset datatable(head(nfl_weather, 10)) # Create a data dictionary for standings var_names_w <- colnames(nfl_weather) var_types_w <- lapply(nfl_weather, class) # lapply returns a list of the same length as X (a vector) var_descriptions_w <- c("Full home team name", "City or state in which the home team originates", "Name or mascot of the home team", "Total points scored by the home team", "Full away team name", "City or state in which the away team originates", "Name or mascot of the away team", "Total points scored by the away team", "Winner of the game", "Temperature during the game (in Fahrenheit)", "Humidity percentage during the game", "Wind speed in miles per hour (mph) during the game", "Date of the game played") data_dict_w <- as_tibble(cbind(var_names_w, var_types_w, var_descriptions_w)) colnames(data_dict_w) <- c("Variable Name", "Variable Data Type", "Variable Description") kable(data_dict_w) # kable returns a single table for a single data object ``` Looking at the above dataset, the first step I took to clean the data was to remove the `home_team` and `away_team` columns, as I felt they were redundant. ```{r, message = FALSE, warning = FALSE, echo = TRUE, results = 'hide'} names(nfl_weather) nfl_weather <- nfl_weather[-c(1, 5)] # Remove redundant columns names(nfl_weather) ``` To view the summary and structure of the ***CLEANED*** data: **NFL Weather Dataset** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # Examine the structure of the dataset datatable(head(nfl_weather, 10)) ``` **Data Dictionary for the NFL Weather Dataset** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # Create a data dictionary for standings var_names_w <- colnames(nfl_weather) var_types_w <- lapply(nfl_weather, class) # lapply returns a list of the same length as X (a vector) var_descriptions_w <- c("City or state in which the home team originates", "Name or mascot of the home team", "Total points scored by the home team", "City or state in which the away team originates", "Name or mascot of the away team", "Total points scored by the away team", "Winner of the game", "Temperature during the game (in Fahrenheit)", "Humidity percentage during the game", "Wind speed in miles per hour (mph) during the game", "Date of the game played") data_dict_w <- as_tibble(cbind(var_names_w, var_types_w, var_descriptions_w)) colnames(data_dict_w) <- c("Variable Name", "Variable Data Type", "Variable Description") kable(data_dict_w) # kable returns a single table for a single data object ``` ### **Playoffs** The next dataset within my analysis is the `nfl_playoffs` dataset. This looks into the coaches and quarterbacks for each team that went to the playoffs from 2000 - 2020. I created this dataset myself through research. The original data contains 3,521 observations and 13 variables. The variables are described in the data dictionary below. See the ***ORIGINAL*** NFL Weather data below. ```{r, message = FALSE, warning = FALSE, echo = TRUE, results = 'hide'} # Examine the structure of the dataset datatable(head(nfl_playoffs, 10)) # Create a data dictionary for standings var_names_playoffs <- colnames(nfl_playoffs) var_types_playoffs <- lapply(nfl_playoffs, class) # lapply returns a list of the same length as X (a vector) var_descriptions_playoffs <- c("City or state in which the team originates", "Name or mascot of the team", "Year", "Total wins for the team", "Total losses for the team", "Whether or not the team went to the Playoffs", "Whether or not the team won the Super Bowl", "Head coach of the team", "Starting quarterback during the postseason") data_dict_playoffs <- as_tibble(cbind(var_names_playoffs, var_types_playoffs, var_descriptions_playoffs)) colnames(data_dict_playoffs) <- c("Variable Name", "Variable Data Type", "Variable Description") kable(data_dict_playoffs) # kable returns a single table for a single data object ``` Moving forward, I decided to change the `sb_winner` to binary variables. This is because it only has two unique values. Because the unique value for the `playoffs` column is only "Playoffs", I decided to drop that column. ```{r, message = FALSE, warning = FALSE, echo = TRUE, results = 'hide'} unique(nfl_playoffs$playoffs, incomparables = FALSE) # View the unique values for the playoffs column unique(nfl_playoffs$sb_winner, incomparables = FALSE) # View the unique values for the sb_winner column names(nfl_playoffs) nfl_playoffs <- nfl_playoffs[-6] # Remove unnecessary column names(nfl_playoffs) ``` For the `sb_winner` column, a value of one denotes "Won Superbowl", and a value of zero denotes "No Superbowl". ```{r, message = FALSE, warning = FALSE, echo = TRUE, results = 'hide'} nfl_playoffs$sb_winner[nfl_playoffs$sb_winner == "Won Superbowl"] <- "1" nfl_playoffs$sb_winner[nfl_playoffs$sb_winner == "No Superbowl"] <- "0" nfl_playoffs$sb_winner <- as.numeric(nfl_playoffs$sb_winner) ``` To view the summary and structure of the ***CLEANED*** data: **NFL Playoffs Dataset** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # Examine the structure of the dataset datatable(head(nfl_playoffs, 10)) ``` **Data Dictionary for the NFL Playoffs Dataset** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # Create a data dictionary for standings var_names_playoffs <- colnames(nfl_playoffs) var_types_playoffs <- lapply(nfl_playoffs, class) # lapply returns a list of the same length as X (a vector) var_descriptions_playoffs <- c("City or state in which the team originates", "Name or mascot of the team", "Year", "Total wins for the team", "Total losses for the team", "Whether or not the team won the Super Bowl", "Head coach of the team", "Starting quarterback during the postseason") data_dict_playoffs <- as_tibble(cbind(var_names_playoffs, var_types_playoffs, var_descriptions_playoffs)) colnames(data_dict_playoffs) <- c("Variable Name", "Variable Data Type", "Variable Description") kable(data_dict_playoffs) # kable returns a single table for a single data object ``` ### **Passing Yards Leaders** The `nfl_passing` dataset contains information regarding the league leader for passing yards from each year. Their respective team information is included. This data is from [Pro Football Reference](https://www.pro-football-reference.com/). This dataset does not need to be cleaned or edited, so to view the summary and structure of the ***CLEANED*** data: **NFL Passing Dataset** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # Examine the structure of the dataset datatable(head(nfl_passing, 10)) ``` **Data Dictionary for the NFL Passing Dataset** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # Create a data dictionary for standings var_names_passing <- colnames(nfl_passing) var_types_passing <- lapply(nfl_passing, class) # lapply returns a list of the same length as X (a vector) var_descriptions_passing <- c("Year", "Name of the player with the most passing yards", "Total yards", "Location of the team from which the player is on", "Name or mascot of the team from which the player is on") data_dict_passing <- as_tibble(cbind(var_names_passing, var_types_passing, var_descriptions_passing)) colnames(data_dict_passing) <- c("Variable Name", "Variable Data Type", "Variable Description") kable(data_dict_passing) # kable returns a single table for a single data object ``` ### **Rushing Yards Leaders** The last dataset, `nfl_rushing`, contains information regarding the league leader for rushing yards from each year. Their respective team information is included. This data is also from [Pro Football Reference](https://www.pro-football-reference.com/). Similar to the last dataset, this dataset does not need to be cleaned or edited, so to view the summary and structure of the ***CLEANED*** data: **NFL Rushing Dataset** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # Examine the structure of the dataset datatable(head(nfl_rushing, 10)) ``` **Data Dictionary for the NFL Rushing Dataset** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # Create a data dictionary for standings var_names_rushing <- colnames(nfl_rushing) var_types_rushing <- lapply(nfl_rushing, class) # lapply returns a list of the same length as X (a vector) var_descriptions_rushing <- c("Year", "Name of the player with the most rushing yards", "Total yards", "Location of the team from which the player is on", "Name or mascot of the team from which the player is on") data_dict_rushing <- as_tibble(cbind(var_names_rushing, var_types_rushing, var_descriptions_rushing)) colnames(data_dict_rushing) <- c("Variable Name", "Variable Data Type", "Variable Description") kable(data_dict_rushing) # kable returns a single table for a single data object ``` ### **Penalties Per Game** The `nfl_penalty` dataset contains of average penalty yards per game per team from 2003 - 2020. The data is from [TeamRankings](https://www.teamrankings.com/nfl/stat/penalty-yards-per-game?date=2020-02-03). This dataset did not need to be cleaned, so To look at the summary and structure of the ***CLEANED*** data: **NFL Penalty Dataset** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # Examine the structure of the dataset datatable(head(nfl_penalty, 10)) ``` **Data Dictionary for the NFL Penalty Dataset** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # Create a data dictionary for standings var_names_penalty <- colnames(nfl_penalty) var_types_penalty <- lapply(nfl_penalty, class) # lapply returns a list of the same length as X (a vector) var_descriptions_penalty <- c("City or state in which the team originates", "Name or mascot of the team", "Average penalty yards per game from 2020", "Average penalty yards per game from 2019", "Average penalty yards per game from 2018", "Average penalty yards per game from 2017", "Average penalty yards per game from 2016", "Average penalty yards per game from 2015", "Average penalty yards per game from 2014", "Average penalty yards per game from 2013", "Average penalty yards per game from 2012", "Average penalty yards per game from 2011", "Average penalty yards per game from 2010", "Average penalty yards per game from 2009", "Average penalty yards per game from 2008", "Average penalty yards per game from 2007", "Average penalty yards per game from 2006", "Average penalty yards per game from 2005", "Average penalty yards per game from 2004", "Average penalty yards per game from 2003", "Total penalty yards") data_dict_penalty <- as_tibble(cbind(var_names_penalty, var_types_penalty, var_descriptions_penalty)) colnames(data_dict_penalty) <- c("Variable Name", "Variable Data Type", "Variable Description") kable(data_dict_penalty) # kable returns a single table for a single data object ``` Total Attendance Breakdown {data-navmenu="Attendance" data-orientation=columns} ============================================================================= Column {.sidebar data-width=450} ----------------------------------------------------------------------------- #### **Total Attendance Breakdown** As mentioned in the introduction, this data analysis will look into if the number of fans in attendance correlates to a team's success (number of games won). Additionally, it will provide comparisons of how teams fare in home versus away games while keeping in mind the attendance at those games. Earlier in the data preparation, I split the `attendance` dataset into two separate datasets, `nfl_total_attendance` and `nfl_weekly_attendance`. To first understand the importance of fan attendance, it is critical to observe which teams have the strongest fan base over the past 20 years. Instead of using the teams' total attendance numbers, I wanted to take an average of each team's weekly attendance. I feel this will give me a more accurate representation of attendance. With that said, I added a column to the `nfl_weekly_attendance` column to calculate the mean. ```{r, message = FALSE, warning = FALSE, echo = FALSE, results = 'hide'} # Add a column of yearly means for each team's attendance nfl_weekly_attendance <- nfl_weekly_attendance %>% group_by(team_name, year) %>% mutate(avg_attendance = mean(weekly_attendance)) head(nfl_weekly_attendance, 20) ``` Using the `ggplotly` function, the graph becomes interactive. To look and interact with the visualization to the right, you can scroll over the lines to get a detailed description including the year, total attendance, and team. You can click on a team once to *remove* it from the visualization, or you can double-click on the team in the legend to *isolate* that line. This interaction enables you to filter to specific teams in order to see their attendance trends since 2000. In the visualization to the right, it is evident that the **Dallas Cowboys** appear to have the strongest fan base, and the **Los Angeles Chargers** appear to have the weakest fan base. The top five teams with the current highest attendance records are: 1. **Dallas Cowboys** 2. **Green Bay Packers** 3. **Los Angeles Rams** 4. **New York Giants** 5. **Philadelphia Eagles** It is also important to note that the spike in attendance for the Dallas Cowboys in 2009 can be attributed to the opening of their brand new AT&T Stadium. This stadium opened on May 27, 2009. The stadium holds 80,000 people in the stands but can be expanded to hold more than 100,000 individuals when standing room only areas are included. Column {.tabset .tabset-fade} ----------------------------------------------------------------------------- ### **NFL Weekly Attendance** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # Create visualization for average weekly attendance per year for all teams team_avg_attendance <- ggplot(data = nfl_weekly_attendance, aes(x = year, y = avg_attendance, color = team_name)) + geom_point(size = 1, alpha = .8) + geom_smooth(size = .8, se = FALSE) + scale_y_continuous(name = "Average Weekly Attendance") + scale_x_continuous(name = "Year") + ggtitle("Average Weekly Attendance Per Year") + labs(col = "Team Name") + theme_stata() ggplotly(team_avg_attendance) ``` ### **Division-Basis** Now, I wanted to break attendance down on a division-basis. In order to do this, I added a column to the dataset, called "division". ```{r, message = FALSE, warning = FALSE, echo = FALSE} # Attach the dataset to avoid calling on specific columns attach(nfl_weekly_attendance) # Create new column nfl_weekly_attendance$division <- ifelse(team_name == "Patriots" | team_name == "Bills" | team_name == "Jets" | team_name == "Dolphins", "AFC East", ifelse(team_name == "Ravens" | team_name == "Steelers" | team_name == "Bengals" | team_name == "Browns", "AFC North", ifelse(team_name == "Texans" | team_name == "Titans" | team_name == "Colts" | team_name == "Jaguars", "AFC South", ifelse(team_name == "Chiefs" | team_name == "Broncos" | team_name == "Raiders" | team_name == "Chargers", "AFC West", ifelse(team_name == "Eagles" | team_name == "Cowboys" | team_name == "Giants" | team_name == "Redskins", "NFC East", ifelse(team_name == "Packers" | team_name == "Vikings" | team_name == "Bears" | team_name == "Lions", "NFC North", ifelse(team_name == "Saints" | team_name == "Falcons" | team_name == "Buccaneers" | team_name == "Panthers", "NFC South", ifelse(team_name == "49ers" | team_name == "Seahawks" | team_name == "Rams" | team_name == "Cardinals", "NFC West", NA))))))) ) ``` Once the `division` column was created, the breakdown of the strongest and weakest fan bases per division can be seen in the table below. Individual graphs for both the AFC and NFC can be seen under the tabs *AFC Attendance Breakdown* and *NFC Attendance Breakdown*. ```{r, message = FALSE, warning = FALSE, echo = FALSE} # Create table for attendance summary attendance_summary <- matrix(c("New York Jets", "Miami Dolphins", "Baltimore Ravens", "Cincinnati Bengals", "Houston Texans", "Indianapolis Colts", "Kansas City Chiefs", "Los Angeles Chargers", "Dallas Cowboys", "Washington Redskins", "Green Bay Packers", "Detroit Lions", "New Orleans Saints", "Tampa Bay Buccaneers", "Los Angeles Rams", "Arizona Cardinals"), ncol = 2, byrow = TRUE) colnames(attendance_summary) <- c("Strongest Fan Base","Weakest Fan Base") rownames(attendance_summary) <- c("AFC East","AFC North","AFC South", "AFC West", "NFC East", "NFC North", "NFC South", "NFC West") attendance_summary <- as.table(attendance_summary) kable(attendance_summary) ``` AFC Attendance Breakdown {data-navmenu="Attendance" data-orientation=rows} ============================================================================= Row ----------------------------------------------------------------------------- ### **AFC East** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # AFC East ggplotly( nfl_weekly_attendance %>% filter(division == "AFC East") %>% ggplot(aes(x = year, y = avg_attendance, color = team_name)) + geom_point(size = 1, alpha = .8) + geom_smooth(size = .8, se = FALSE) + scale_y_continuous(name = "Average Weekly Attendance") + scale_x_continuous(name = "Year") + ggtitle("AFC East Average Weekly Attendance Per Year") + labs(col = "Team Name") + theme_stata() ) ``` ### **AFC North** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # AFC North ggplotly( nfl_weekly_attendance %>% filter(division == "AFC North") %>% ggplot(aes(x = year, y = avg_attendance, color = team_name)) + geom_point(size = 1, alpha = .8) + geom_smooth(size = .8, se = FALSE) + scale_y_continuous(name = "Average Weekly Attendance") + scale_x_continuous(name = "Year") + ggtitle("AFC North Average Weekly Attendance Per Year") + labs(col = "Team Name") + theme_stata() ) ``` Row ------------------------------------------------------------------------------ ### **AFC South** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # AFC South ggplotly( nfl_weekly_attendance %>% filter(division == "AFC South") %>% ggplot(aes(x = year, y = avg_attendance, color = team_name)) + geom_point(size = 1, alpha = .8) + geom_smooth(size = .8, se = FALSE) + scale_y_continuous(name = "Average Weekly Attendance") + scale_x_continuous(name = "Year") + ggtitle("AFC South Average Weekly Attendance Per Year") + labs(col = "Team Name") + theme_stata() ) ``` ### **AFC West** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # AFC West ggplotly( nfl_weekly_attendance %>% filter(division == "AFC West") %>% ggplot(aes(x = year, y = avg_attendance, color = team_name)) + geom_point(size = 1, alpha = .8) + geom_smooth(size = .8, se = FALSE) + scale_y_continuous(name = "Average Weekly Attendance") + scale_x_continuous(name = "Year") + ggtitle("AFC West Average Weekly Attendance Per Year") + labs(col = "Team Name") + theme_stata() ) ``` NFC Attendance Breakdown {data-navmenu="Attendance" data-orientation=rows} ============================================================================= Row ----------------------------------------------------------------------------- ### **NFC East** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # NFC East ggplotly( nfl_weekly_attendance %>% filter(division == "NFC East") %>% ggplot(aes(x = year, y = avg_attendance, color = team_name)) + geom_point(size = 1, alpha = .8) + geom_smooth(size = .8, se = FALSE) + scale_y_continuous(name = "Average Weekly Attendance") + scale_x_continuous(name = "Year") + ggtitle("NFC East Average Weekly Attendance Per Year") + labs(col = "Team Name") + theme_stata() ) ``` ### **NFC North** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # NFC North ggplotly( nfl_weekly_attendance %>% filter(division == "NFC North") %>% ggplot(aes(x = year, y = avg_attendance, color = team_name)) + geom_point(size = 1, alpha = .8) + geom_smooth(size = .8, se = FALSE) + scale_y_continuous(name = "Average Weekly Attendance") + scale_x_continuous(name = "Year") + ggtitle("NFC North Average Weekly Attendance Per Year") + labs(col = "Team Name") + theme_stata() ) ``` Row ----------------------------------------------------------------------------- ### **NFC South** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # NFC South ggplotly( nfl_weekly_attendance %>% filter(division == "NFC South") %>% ggplot(aes(x = year, y = avg_attendance, color = team_name)) + geom_point(size = 1, alpha = .8) + geom_smooth(size = .8, se = FALSE) + scale_y_continuous(name = "Average Weekly Attendance") + scale_x_continuous(name = "Year") + ggtitle("NFC South Average Weekly Attendance Per Year") + labs(col = "Team Name") + theme_stata() ) ``` ### **NFC West** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # NFC West ggplotly( nfl_weekly_attendance %>% filter(division == "NFC West") %>% ggplot(aes(x = year, y = avg_attendance, color = team_name)) + geom_point(size = 1, alpha = .8) + geom_smooth(size = .8, se = FALSE) + scale_y_continuous(name = "Average Weekly Attendance") + scale_x_continuous(name = "Year") + ggtitle("NFC West Average Weekly Attendance Per Year") + labs(col = "Team Name") + theme_stata() ) ``` Impact on Wins {data-navmenu="Attendance" data-orientation=columns} ============================================================================= Column ----------------------------------------------------------------------------- ### **What Impacts Total Wins?** Knowing the previously discussed attendance statistics, I want to see if a stronger home attendance impacts the total number of wins. A team cannot necessarily control their away attendance, as their most loyal fans are assumed to be unlikely attendees at an away game. First, I wanted to discover if home attendance impacts total wins. To do so, I created a linear model with `total_wins` as the response variable and `total_home_attendance` as the predictor variable. I also obtained the correlation coefficient between the two variables. To the right, in the **Home Attendance** tab, it appears that there is a slight, positive linear relationship between the predictor variable (**X** or `total_home_attendance`) and the response variable (**Y** or `total_wins`). The correlation coefficient between the two variables is 0.1507, and this relationship is statistically significant at a 99% confidence level with a p-value of *0.000133*. The `lm()` function was used to perform simple linear regression between the two variables. Next, I wanted to discover if away attendance impacts total wins. I followed the same process I did for home attendance, creating a linear model with `total_wins` as the response variable and `total_away_attendance` as the predictor variable. From the visualization in the **Away Attendance** tab, it appears that there is also a very slight, positive linear relationship between the predictor variable (**X** or `total_away_attendance`) and the response variable (**Y** or `total_wins`). The correlation coefficient between the two variables is 0.1274, and this relationship is statistically significant at a 99% confidence level with a p-value of *0.00126*. The `lm()` function was used to perform simple linear regression between the two variables. ```{r, message = FALSE, warning = FALSE, echo = FALSE, results = 'hide'} # Select columns from nfl_standings nfl_wins <- nfl_standings %>% select(team_name, year, total_wins) # Perform left join to get needed statistics joined_data <- left_join(nfl_total_attendance, nfl_wins, by = c("team_name", "year")) joined_data ``` ```{r, message = FALSE, warning = FALSE, echo = TRUE} # Attach the dataset attach(joined_data) # Create linear model for home attendance home_attendance_model <- lm(total_wins ~ total_home_attendance) summary(home_attendance_model) cor(total_wins, total_home_attendance) ``` ```{r, message = FALSE, warning = FALSE, echo = TRUE} # Attach the dataset attach(joined_data) # Create linear model for away attendance away_attendance_model <- lm(total_wins ~ total_away_attendance) summary(away_attendance_model) cor(total_wins, total_away_attendance) ``` Column {.tabset} ------------------------------------------------------------------------------ ### **Home Attendance** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # Plot data home_atten_v_wins <- ggplot(data = joined_data, aes(total_home_attendance, total_wins)) + geom_point(size = 1, alpha = .8, col = "red") + geom_smooth(method = "lm", size = .8, se = FALSE) + xlab("Total Home Attendance") + ylab("Total Wins") + ggtitle("Total Home Attendance vs. Total Wins") ggplotly(home_atten_v_wins) ``` ### **Away Attendance** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # Plot data away_atten_v_wins <- ggplot(data = joined_data, aes(total_away_attendance, total_wins)) + geom_point(size = 1, alpha = .8, col = "magenta") + geom_smooth(method = "lm", size = .8, se = FALSE) + xlab("Total Away Attendance") + ylab("Total Wins") + ggtitle("Total Away Attendance vs. Total Wins") ggplotly(away_atten_v_wins) ``` Super Bowl Champions {data-navmenu="Standings" data-orientation=rows} ============================================================================= Row ----------------------------------------------------------------------------- ### **Standings Over the Years** This part of the analysis will look into the qualities of the division winners and the attributes that teams high in the standings have over teams in the lower portion of the standings. Furthermore, this approach will discover what separates Super Bowl Champions from the 31 other teams each season. Firstly, I wanted to see which division has brought home the most Super Bowl Championships over the past 20 years. I once again added a "division" column to `nfl_standings`. As evident in the below visualization, the **AFC East** has had the most Super Bowl wins between 2000-2020. This can be largely attributed to the New England Patriots' former quarterback Tom Brady and current head coach Bill Belichick bringing home championships in 2002, 2004, 2005, 2015, 2017, and 2019. Additionally, the second-best division appears to be the **AFC North**, with both the Pittsburgh Steelers and Baltimore Ravens winning at least one Super Bowl Championship each. Conversely, it appears the AFC South, NFC North, and NFC West have all only won one Super Bowl over the past two decades. Analyzing NFL standings with the given datasets is a bit tricky due to the fact that standings are calculated using tie-breakers if necessary. Additionally, choosing which teams make the playoffs is largely based off of division success. With that being said, the team that had the most wins might not be the team with the best standing. For this analysis, I decided to break the teams down by division and see which ones have been dominant over the years. I analyzed their success by using summary statistics showing the *Average Total Wins*, *Average Total Losses*, *Average Points Per Game*, and *Average Opponent Points Per Game*. The results can be seen in the tabs *AFC Summaries* and *NFC Summaries*. I also developed box plots for the average total wins per season by division to analyze the range of data for each team and any relevant outliers. These box plots can be seen in the tabs *AFC Box Plots | Total Wins* and *NFC Box Plots | Total Wins*. I also grouped the box plots by conference (AFC vs. NFC). The most dominant teams per division, defined by highest average of total wins, (as discovered in the *AFC Summaries* and *NFC Summaries* tabs) are as follows: * **AFC East**: New England Patriots * **AFC North**: Pittsburgh Steelers * **AFC South**: Indianapolis Colts * **AFC West**: Denver Broncos * **NFC East**: Philadelphia Eagles * **NFC North**: Green Bay Packers * **NFC South**: New Orleans Saints * **NFC West**: Seattle Seahawks I also developed a table of the last 20 Super Bowl winners with their offensive and defensive ranking. This table can be found in the **Rankings of Super Bowl Champions** tab. ```{r, message = FALSE, warning = FALSE, echo = FALSE} # Attach the dataset attach(nfl_standings) # Create new column nfl_standings$division <- ifelse(team_name == "Patriots" | team_name == "Bills" | team_name == "Jets" | team_name == "Dolphins", "AFC East", ifelse(team_name == "Ravens" | team_name == "Steelers" | team_name == "Bengals" | team_name == "Browns", "AFC North", ifelse(team_name == "Texans" | team_name == "Titans" | team_name == "Colts" | team_name == "Jaguars", "AFC South", ifelse(team_name == "Chiefs" | team_name == "Broncos" | team_name == "Raiders" | team_name == "Chargers", "AFC West", ifelse(team_name == "Eagles" | team_name == "Cowboys" | team_name == "Giants" | team_name == "Redskins", "NFC East", ifelse(team_name == "Packers" | team_name == "Vikings" | team_name == "Bears" | team_name == "Lions", "NFC North", ifelse(team_name == "Saints" | team_name == "Falcons" | team_name == "Buccaneers" | team_name == "Panthers", "NFC South", ifelse(team_name == "49ers" | team_name == "Seahawks" | team_name == "Rams" | team_name == "Cardinals", "NFC West", NA))))))) ) ``` Row {.tabset .tabset-fade} ------------------------------------------------------------------------------ ### **Super Bowl Champions Per Division** ```{r, message = FALSE, warning = FALSE, echo = FALSE, fig.width = 10, fig.height = 11} # Plot Super Bowl Championships per division sb_champions_division <- ggplot(data = nfl_standings, aes(reorder(division, -sb_winner), sb_winner, col = team_name)) + geom_col() + ggtitle("Super Bowl Winners by Division") + xlab("Division") + ylab("Count of Super Bowl Championships Won") + labs(col = "Team Name") ggplotly(sb_champions_division) ``` ### **Rankings of Super Bowl Champions** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # Create table for all Super Bowl Champions super_bowl_champs <- nfl_standings %>% filter(sb_winner == 1) %>% select(year, team_name, offensive_ranking, defensive_ranking) colnames(super_bowl_champs) <- c("Year", "Super Bowl Champion", "Offensive Ranking", "Defensive Ranking") kable(super_bowl_champs) ``` AFC Summaries {data-navmenu="Standings" data-orientation=rows} ============================================================================= Row ----------------------------------------------------------------------------- ### **AFC East Summary** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # AFC East afc_east <- nfl_standings %>% filter(division == "AFC East") %>% select(team_name, total_wins, points_for, total_losses, points_against) %>% group_by(team_name) %>% summarise(total_wins = mean(total_wins), total_losses = mean(total_losses), points_for = mean(points_for)/16, points_against = mean(points_against)/16) colnames(afc_east) <- c("Team Name", "Average Total Wins", "Average Total Losses", "Average Points Per Game", "Average Opponent Points Per Game") kable(afc_east) ``` ### **AFC North Summary** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # AFC North afc_north <- nfl_standings %>% filter(division == "AFC North") %>% select(team_name, total_wins, points_for, total_losses, points_against) %>% group_by(team_name) %>% summarise(total_wins = mean(total_wins), total_losses = mean(total_losses), points_for = mean(points_for)/16, points_against = mean(points_against)/16) colnames(afc_north) <- c("Team Name", "Average Total Wins", "Average Total Losses", "Average Points Per Game", "Average Opponent Points Per Game") kable(afc_north) ``` Row ----------------------------------------------------------------------------- ### **AFC South Summary** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # AFC South afc_south <- nfl_standings %>% filter(division == "AFC South") %>% select(team_name, total_wins, points_for, total_losses, points_against) %>% group_by(team_name) %>% summarise(total_wins = mean(total_wins), total_losses = mean(total_losses), points_for = mean(points_for)/16, points_against = mean(points_against)/16) colnames(afc_south) <- c("Team Name", "Average Total Wins", "Average Total Losses", "Average Points Per Game", "Average Opponent Points Per Game") kable(afc_south) ``` ### **AFC West Summary** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # AFC West afc_west <- nfl_standings %>% filter(division == "AFC West") %>% select(team_name, total_wins, points_for, total_losses, points_against) %>% group_by(team_name) %>% summarise(total_wins = mean(total_wins), total_losses = mean(total_losses), points_for = mean(points_for)/16, points_against = mean(points_against)/16) colnames(afc_west) <- c("Team Name", "Average Total Wins", "Average Total Losses", "Average Points Per Game", "Average Opponent Points Per Game") kable(afc_west) ``` AFC Box Plots | Total Wins {data-navmenu="Standings" data-orientation=rows} ============================================================================= Row ----------------------------------------------------------------------------- ### **AFC East Box Plot** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # AFC East box plot afc_east_box <- nfl_standings %>% filter(division == "AFC East") %>% select(team_name, total_wins, points_for, total_losses, points_against) boxplot_afceast <- ggplot(afc_east_box, aes(team_name, total_wins)) + geom_boxplot(col = "blue") + ggtitle("AFC East") + xlab("Team") + ylab("Total Wins") + theme_stata() ggplotly(boxplot_afceast) ``` ### **AFC North Box Plot** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # AFC North box plot afc_north_box <- nfl_standings %>% filter(division == "AFC North") %>% select(team_name, total_wins, points_for, total_losses, points_against) boxplot_afcnorth <- ggplot(afc_north_box, aes(team_name, total_wins)) + geom_boxplot(col = "purple") + ggtitle("AFC North") + xlab("Team") + ylab("Total Wins") + theme_stata() ggplotly(boxplot_afcnorth) ``` Row ----------------------------------------------------------------------------- ### **AFC South Box Plot** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # AFC South box plot afc_south_box <- nfl_standings %>% filter(division == "AFC South") %>% select(team_name, total_wins, points_for, total_losses, points_against) boxplot_afcsouth <- ggplot(afc_south_box, aes(team_name, total_wins)) + geom_boxplot(col = "red") + ggtitle("AFC South") + xlab("Team") + ylab("Total Wins") + theme_stata() ggplotly(boxplot_afcsouth) ``` ### **AFC West Box Plot** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # AFC West box plot afc_west_box <- nfl_standings %>% filter(division == "AFC West") %>% select(team_name, total_wins, points_for, total_losses, points_against) boxplot_afcwest <- ggplot(afc_west_box, aes(team_name, total_wins)) + geom_boxplot(col = "seagreen") + ggtitle("AFC West") + xlab("Team") + ylab("Total Wins") + theme_stata() ggplotly(boxplot_afcwest) ``` NFC Summaries {data-navmenu="Standings" data-orientation=rows} ============================================================================= Row ----------------------------------------------------------------------------- ### **NFC East Summary** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # NFC East nfc_east <- nfl_standings %>% filter(division == "NFC East") %>% select(team_name, total_wins, points_for, total_losses, points_against) %>% group_by(team_name) %>% summarise(total_wins = mean(total_wins), total_losses = mean(total_losses), points_for = mean(points_for)/16, points_against = mean(points_against)/16) colnames(nfc_east) <- c("Team Name", "Average Total Wins", "Average Total Losses", "Average Points Per Game", "Average Opponent Points Per Game") kable(nfc_east) ``` ### **NFC North Summary** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # NFC North nfc_north <- nfl_standings %>% filter(division == "NFC North") %>% select(team_name, total_wins, points_for, total_losses, points_against) %>% group_by(team_name) %>% summarise(total_wins = mean(total_wins), total_losses = mean(total_losses), points_for = mean(points_for)/16, points_against = mean(points_against)/16) colnames(nfc_north) <- c("Team Name", "Average Total Wins", "Average Total Losses", "Average Points Per Game", "Average Opponent Points Per Game") kable(nfc_north) ``` Row ----------------------------------------------------------------------------- ### **NFC South Summary** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # NFC South nfc_south <- nfl_standings %>% filter(division == "NFC South") %>% select(team_name, total_wins, points_for, total_losses, points_against) %>% group_by(team_name) %>% summarise(total_wins = mean(total_wins), total_losses = mean(total_losses), points_for = mean(points_for)/16, points_against = mean(points_against)/16) colnames(nfc_south) <- c("Team Name", "Average Total Wins", "Average Total Losses", "Average Points Per Game", "Average Opponent Points Per Game") kable(nfc_south) ``` ### **NFC West Summary** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # NFC West nfc_west <- nfl_standings %>% filter(division == "NFC West") %>% select(team_name, total_wins, points_for, total_losses, points_against) %>% group_by(team_name) %>% summarise(total_wins = mean(total_wins), total_losses = mean(total_losses), points_for = mean(points_for)/16, points_against = mean(points_against)/16) colnames(nfc_west) <- c("Team Name", "Average Total Wins", "Average Total Losses", "Average Points Per Game", "Average Opponent Points Per Game") kable(nfc_west) ``` NFC Box Plots | Total Wins {data-navmenu="Standings" data-orientation=rows} ============================================================================= Row ----------------------------------------------------------------------------- ### **NFC East Box Plot** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # NFC East box plot nfc_east_box <- nfl_standings %>% filter(division == "NFC East") %>% select(team_name, total_wins, points_for, total_losses, points_against) boxplot_nfceast <- ggplot(nfc_east_box, aes(team_name, total_wins)) + geom_boxplot() + ggtitle("NFC East") + xlab("Team") + ylab("Total Wins") + theme_stata() ggplotly(boxplot_nfceast) ``` ### **NFC North Box Plot** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # NFC North box plot nfc_north_box <- nfl_standings %>% filter(division == "NFC North") %>% select(team_name, total_wins, points_for, total_losses, points_against) boxplot_nfcnorth <- ggplot(nfc_north_box, aes(team_name, total_wins)) + geom_boxplot(col = "brown") + ggtitle("NFC North") + xlab("Team") + ylab("Total Wins") + theme_stata() ggplotly(boxplot_nfcnorth) ``` Row ----------------------------------------------------------------------------- ### **NFC South Box Plot** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # NFC South box plot nfc_south_box <- nfl_standings %>% filter(division == "NFC South") %>% select(team_name, total_wins, points_for, total_losses, points_against) boxplot_nfcsouth <- ggplot(nfc_south_box, aes(team_name, total_wins)) + geom_boxplot(col = "magenta") + ggtitle("NFC South") + xlab("Team") + ylab("Total Wins") + theme_stata() ggplotly(boxplot_nfcsouth) ``` ### **NFC West Box Plot** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # NFC West box plot nfc_west_box <- nfl_standings %>% filter(division == "NFC West") %>% select(team_name, total_wins, points_for, total_losses, points_against) boxplot_nfcwest <- ggplot(nfc_west_box, aes(team_name, total_wins)) + geom_boxplot(col = "goldenrod4") + ggtitle("NFC West") + xlab("Team") + ylab("Total Wins") + theme_stata() ggplotly(boxplot_nfcwest) ``` Division Leaders {data-navmenu="Standings"} ============================================================================= ### **Division Leaders Breakdown** Combining the tables from the previous tabs to form one table with average statistics, the following leaders can be found: * **Average Total Wins**: New England Patriots * **Average Total Losses**: Cleveland Browns * **Average Points Per Game**: New England Patriots * **Average Opponent Points Per Game**: Detroit Lions For a more in-depth look at each team, please refer to the table below. #### NFL Standings of all Teams ```{r, message = FALSE, warning = FALSE, echo = FALSE} # Analysis of all teams in the NFL division_leaders <- nfl_standings %>% select(team_name, total_wins, points_for, total_losses, points_against) %>% group_by(team_name) %>% summarise(total_wins = mean(total_wins), total_losses = mean(total_losses), points_for = mean(points_for)/16, points_against = mean(points_against)/16) colnames(division_leaders) <- c("Team Name", "Average Total Wins", "Average Total Losses", "Average Points Per Game", "Average Opponent Points Per Game") kable(division_leaders) ``` Offense vs. Defense {data-navmenu="Standings" data-orientation=columns} ============================================================================= Column {.sidebar data-width=450} ----------------------------------------------------------------------------- #### **Offense vs. Defense** In the next step of this analysis, I investigated the importance of an offense and a defense. The offense in football is the 11 players who are on the field for a team when they have the ball. Conversely, the defense is the 11 players on the field when the other team has the ball. Sports writers and analysts have argued over the years whether a better offense or defense is more critical to a team's success. Utilizing the `nfl_standings` dataset, I sought to analyze this discussion. First, I created a linear model showcasing a team's wins in a season using `offensive_ranking` and `defensive_ranking` as the predictor variables. ```{r, message = FALSE, warning = FALSE, echo = TRUE} # Attach the dataset attach(nfl_standings) # Create a linear model rankings_model <- lm(total_wins ~ offensive_ranking + defensive_ranking) summary(rankings_model) ``` The model showcases that both `offensive_ranking` and `defensive_ranking` are significant variables in determining a team's total wins at a 99% confidence level. To drill deeper, the correlation coefficients were discovered for each predictor variable to total wins. ```{r, message = FALSE, warning = FALSE, echo = TRUE} # Run correlation tests cor(offensive_ranking, total_wins) cor(defensive_ranking, total_wins) ``` The `offensive_ranking` had a coefficient of **0.7311** and the `defensive_ranking` had a coefficient **0.6379**. As such, it appears that a team's offense has a greater correlation to a team's wins than its defense. To visualize this, I plotted two graphs to further test this hypothesis. These can be seen in the tabs to the right. These graphs confirm the positive correlation between an increasing offensive or defensive ranking and a team's win. Additionally, the confidence band in the defensive ranking is larger than the offensive ranking's band. This agrees with my conclusion that the offensive's ranking correlation is stronger than the defense. Column {.tabset .tabset-fade} ----------------------------------------------------------------------------- ### **Offensive Ranking** ```{r, message = FALSE, warning = FALSE, echo = FALSE} Off_Ranking_Plot <- ggplot(data = nfl_standings, aes(offensive_ranking, total_wins)) + geom_smooth() + ggtitle("Total Wins | Offensive Ranking") + xlab("Offensive Ranking") + ylab("Total Wins") + xlim(-10, 10) + ylim(0, 16) + theme_stata() ggplotly(Off_Ranking_Plot) ``` ### **Defensive Ranking** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # Create rankings graphs with confidence bands Def_Ranking_Plot <- ggplot(data = nfl_standings, aes(defensive_ranking, total_wins)) + geom_smooth(col = "red") + ggtitle("Total Wins | Defensive Ranking") + xlab("Defensive Ranking") + ylab("Total Wins") + xlim(-10, 10) + ylim(0, 16) + theme_stata() ggplotly(Def_Ranking_Plot) ``` Points For vs. Points Against {data-navmenu="Standings" data-orientation=columns} ============================================================================= Column {.sidebar data-width=450} ----------------------------------------------------------------------------- #### **Points For vs. Points Against** Using different statistics now, I changed the predictor variables to be `points_for` and `points_against` as these represent offensive and defensive success, respectively. Then, I used the binary `playoffs` variable to see how scoring or giving up points led to a team's probability of making the playoffs. I took the same approach as the previous variables. This model shows that `points_for` and `points_against` are both significant as well to a team's total wins. Additionally, the correlations to total wins are **0.7276** and **-0.6667**. This indicates a strong, positive relationship for `points_for` and a strong, negative relationship for `points_against`. The offensive side, once again, has a slightly stronger relationship. The graphical representations show these strong linear relationships as well with playoff teams typically having a low Total Points Against and high Total Points For on the season. ```{r, message = FALSE, warning = FALSE, echo = TRUE} # Create linear model points_model <- lm(total_wins ~ points_for + points_against) summary(points_model) # Run correlation tests cor(points_for, total_wins) cor(points_against, total_wins) ``` Column {.tabset .tabset-fade} ----------------------------------------------------------------------------- ### **Points For** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # Create scatter plots Points_For_Plot <- ggplot(data = nfl_standings, aes(points_for, total_wins, col = as.character(playoffs))) + geom_point() + labs(col = "Playoffs") + ggtitle("Total Wins | Total Points Scored") + xlab("Total Points Scored") + ylab("Total Wins") + ylim(0, 16) + theme_stata() ggplotly(Points_For_Plot) ``` ### **Points Against** ```{r, message = FALSE, warning = FALSE, echo = FALSE} Points_Against_Plot <- ggplot(data = nfl_standings, aes(points_against, total_wins, col = as.character(playoffs))) + geom_point() + labs(col = "Playoffs") + ggtitle("Total Wins | Total Points Against") + xlab("Total Points Against") + ylab("Total Wins") + ylim(0, 16) + theme_stata() ggplotly(Points_Against_Plot) ``` Individual Game Observations {data-navmenu="Standings" data-orientation=columns} ============================================================================= Column ----------------------------------------------------------------------------- ### **Individual Game Observations** The last analysis takes a look at data from the individual NFL games. Using the `nfl_games` dataset, I investigated the different variables. Now, to analyze the correlation between different variables, I used the GGally package to produce a detailed scatter plot matrix. The function `ggpairs()` produced histograms along the diagonal of the matrix. Pearson’s rho estimates, or statistics showing correlation, are seen in the upper-right. Scatter plots are seen in the lower-left. I analyzed six variables here - (1) Points Scored by Winning Team (`pts_win`); (2) Yards Gained by Winning Team (`yds_win`); (3) Turnovers Committed by Winning Team (`turnovers_win`); (4) Points Scored by Losing Team (`pts_loss`); (5) Yards Gained by Losing Team (`yds_loss`); and (6) Turnovers Committed by Losing Team (`turnovers_loss`). I then grouped these variables by winning team vs. losing team. This correlation matrix can be seen in the first tab to the right. As evident through both the scatter plots and Pearson’s rho estimates, there is little to no relationship between Points Scored by Winning Team vs. Turnovers Committed by Winning Team as well as Yards Gained by Winning Team vs. Turnovers Committed by Winning Team. All of these correlation coefficients are close to zero. On the other hand, there is a strong, positive relationship between Points Scored by Winning Team vs. Yards Gained by Winning Team, with a Pearson rho estimate of **0.537**. Looking at the variables by losing team in the second tab to the right -- very similar to the winning teams, there is little to no relationship between Points Scored by Losing Team vs. Turnovers Committed by Losing Team as well as Yards Gained by Losing Team vs. Turnovers Committed by Losing Team. All of these correlation coefficients are close to zero. On the other hand, there is a strong, positive relationship between Points Scored by Losing Team vs. Yards Gained by Losing Team, with a Pearson rho estimate of **0.632**. The main takeaway from these correlation matrices are that the more yards gained, the more likely you are to score. To compare a winning team and a losing team, I wanted to see if more turnovers from a losing team caused more points for the winning team. Please reference the third tab to the right to reference the linear model with `pts_win` as the response variable and `turnovers_loss` as the predictor variable. In this graphic, there is a slight, positive relationship between the Turnovers Committed by Losing Team and Points Scored by Winning Team. The correlation coefficient between the two variables is **0.176**. Column {.tabset .tabset-fade} ----------------------------------------------------------------------------- ### **Variables by Winning Team** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # Create correlation graphs for variables in nfl_games cor_graphs1 <- ggpairs(nfl_games %>% select(pts_win, yds_win, turnovers_win)) ggplotly(cor_graphs1) ``` ### **Variables by Losing Team** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # Create correlation graphs for variables in nfl_games cor_graphs2 <- ggpairs(nfl_games %>% select(pts_loss, yds_loss, turnovers_loss)) ggplotly(cor_graphs2) ``` ### **Turnovers Committed by Losing Team vs. Points Scored by Winning Team** ```{r, message = FALSE, warning = FALSE, echo = FALSE, results = 'hide'} # Attach the dataset attach(nfl_games) # Create linear model games_model <- lm(pts_win ~ turnovers_loss) summary(games_model) cor(turnovers_loss, pts_win) ``` ```{r, message = FALSE, warning = FALSE, echo = FALSE} # Plot the data turn_plot <- ggplot(nfl_games) + geom_point(aes(x = turnovers_loss, y = pts_win), color = "coral1") + ggtitle("Turnovers by Losing Team vs. Points Scored by Winning Team") ggplotly(turn_plot) ``` Average Weather Conditions {data-navmenu="Weather" data-orientation=columns} ============================================================================= Column ----------------------------------------------------------------------------- ### **Understanding Weather Conditions** Looking at the `nfl_weather` dataset, I wanted to see which teams performed well under certain weather conditions. To do this, I first wanted to observe the average temperature, humidity, and wind speed at each home location. In R, I utilized the `dplyr` package to tidy my data and create new columns with `mutate`. To visualize the average temperature, humidity, and wind speed at each location, I created bar graphs for each variable per city. ```{r, message = FALSE, warning = FALSE, echo = FALSE, results = 'hide'} head(nfl_weather, 50) # Add a column of weather averages for each team's location nfl_weather <- nfl_weather %>% group_by(home_team_city) %>% mutate(avg_temperature = mean(temperature)) %>% mutate(avg_humidity = mean(humidity)) %>% mutate(avg_wind = mean(wind_mph)) names(nfl_weather) nfl_weather_2 <- nfl_weather[-c(3, 4, 5, 6, 7, 8, 9, 10, 11)] # Remove duplicates based on home_team_city column nfl_weather_2 <- nfl_weather_2[!duplicated(nfl_weather_2$home_team_city), ] # Round values to two decimal places nfl_weather_2$avg_temperature <- round(nfl_weather_2$avg_temperature, digits = 2) nfl_weather_2$avg_humidity <- round(nfl_weather_2$avg_humidity, digits = 2) nfl_weather_2$avg_wind <- round(nfl_weather_2$avg_wind, digits = 2) head(nfl_weather_2, 20) ``` From the visualizations to the right, it appears that the following five cities have the highest average temperatures: 1. **Miami, Florida** -- 76.70°F 2. **Detroit, Michigan** -- 71.64°F 3. **Tampa Bay, Florida** -- 71.51°F 4. **New Orleans, Louisiana** -- 71.03°F 5. **Houston, Texas** -- 71.03°F The following five cities have the highest humidity percentage: 1. **Seattle, Washington** -- 79% 2. **San Francisco, California** -- 71% 3. **Oakland, California** -- 71% 4. **Green Bay, Wisconsin** -- 71% 5. **Miami, Florida** -- 70% Lastly, the following five cities have the highest winds (in mph): 1. **New England, Massachusetts** -- 11.54 mph 2. **New York, New York** --10.57 mph 3. **Dallas, Texas** -- 10.27 mph 4. **Denver, Colorado** -- 9.96 mph 5. **Buffalo, New York** -- 9.95 mph Column {.tabset .tabset-fade} ----------------------------------------------------------------------------- ### **Average Temperature Per City** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # Bar graph of average temperature per city avg_weather <- ggplot(nfl_weather_2, aes(x = reorder(home_team_city, avg_temperature), y = avg_temperature)) + ggtitle( "Average Temperature by City") + xlab("City") + ylab("Temperature (in Fahrenheit)") + geom_col(width = 0.7) + coord_flip() ggplotly(avg_weather) ``` ### **Average Humidity Per City** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # Bar graph of average humidity per city avg_humidity <- ggplot(nfl_weather_2, aes(x = reorder(home_team_city, avg_humidity), y = avg_humidity)) + ggtitle( "Average Humidity by City") + xlab("City") + ylab("Humidity") + geom_col(width = 0.7) + coord_flip() ggplotly(avg_humidity) ``` ### **Average Wind Per City** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # Bar graph of average wind per city avg_wind <- ggplot(nfl_weather_2, aes(x = reorder(home_team_city, avg_wind), y = avg_wind)) + ggtitle( "Average Wind (mph) by City") + xlab("City") + ylab("Wind (mph)") + geom_col(width = 0.7) + coord_flip() ggplotly(avg_wind) ``` Can Weather Predict Game Outcomes? {data-navmenu="Weather" data-orientation=columns} ============================================================================= Column {.sidebar data-width=450} ----------------------------------------------------------------------------- #### **Can Weather Predict Game Outcomes?** Next, I wanted to see in high wind speeds were correlated to low-scoring games. To do this, I first combined the total score of the `home_score` and `away_score` variables. I created a `total_score` column. With this column, I ran correlation coefficients between `total_score` and `temperature`, `total_score` and `humidity`, and `total_score` and `wind_mph`. I also ran two linear models to see if weather conditions could predict whether or not the game would be high-scoring or low-scoring. I was able to train and test my dataset with a 70-30 training-testing split. It appears that the correlation coefficient between `total_score` and `temperature` is **0.0164**. Knowing that a correlation coefficient value of plus or minus one is said to be a perfect correlation, I know that there is little to no correlation between `total_score` and `temperature`. The correlation coefficient between `total_score` and `humidity` is **-0.1207**. Here, I can see that there is a slight, negative correlation between the two variables. Similar to this is the relationship between `total_score` and `wind_mph`. The correlation coefficient is **-0.1328**, indicating a slight, negative correlation. Of the three relationships tested, it appears that the wind speed correlates most to a lower total score. The higher the wind, the lower the combined score of the game. Looking further, I decided to create a linear model to see if temperature, humidity, and wind can predict the total combined score of the game. The first model I created included all three predictor variables; however, it did not perform well. The adjusted R-squared of this model is around a meager 0.02, indicating that the model accounts for only 2% of the variance explained by the model. It did appear that the `humidity` and `wind_mph` variables were statistically significant at the 95% confidence level as they had p-values less than 0.05. The Mean Squared Error (MSE) of this model is higher than 170, which is incredibly high given one would ideally want an MSE of zero. It is important to note that these results will vary given the random training-testing split. Using the above data and knowing that `wind_mph` was the most correlated with `total_score`, I decided to create a second model with just `wind_mph` as the sole predictor variable. This model performed even worse with an adjusted R-squared of around 0.015, indicating that the model accounts for less than 2% of the variance. The `wind_mph` variable is still statistically significant due to its p-value being less than 0.05 at the 95% confidence interval. Additionally, the MSE of this model is still higher than 170. It is important to note that these results will also vary given the random training-testing split. In conclusion, weather does not accurately predict whether or not the game will be high-scoring or low-scoring. I originally thought it would be more difficult for the players to score given higher wind speeds; however, I was proved wrong. I thought about analyzing how teams fared in games located on opposite sides of the country (e.g., New England Patriots at Los Angeles Rams); however, I decided against that analysis. I decided against this because I thought there would be other contributing factors to a loss, such as home-field advantage. Column {.tabset .tabset-fade} ----------------------------------------------------------------------------- ### **Linear Model with all Weather Variables** ```{r, message = FALSE, warning = FALSE, echo = FALSE, results = 'hide'} nfl_weather$total_score = nfl_weather$home_score + nfl_weather$away_score head(nfl_weather) ``` ```{r, message = FALSE, warning = FALSE, echo = TRUE, results = 'hide'} attach(nfl_weather) cor(total_score, temperature); cor(total_score, humidity); cor(total_score, wind_mph) # Split the data into training and testing sample_index <- sample(nrow(nfl_weather), nrow(nfl_weather)*0.70) weather_train <- nfl_weather[sample_index,] weather_test <- nfl_weather[-sample_index,] ``` ```{r, message = FALSE, warning = FALSE, echo = TRUE} # Create the linear model weather_model <- lm(total_score ~ temperature + humidity + wind_mph, data = weather_train) model_summary <- summary(weather_model) model_summary # Out-of-sample performance pi <- predict(object = weather_model, newdata = weather_test) mean((pi - weather_test$total_score)^2) # MSE ``` ### **Linear Model with Wind Variable** ```{r, message = FALSE, warning = FALSE, echo = TRUE} # Drop all variables except wind_mph weather_model_2 <- lm(total_score ~ wind_mph, data = weather_train) model_summary_2 <- summary(weather_model_2) model_summary_2 # Out-of-sample performance pi_2 <- predict(object = weather_model_2, newdata = weather_test) mean((pi_2 - weather_test$total_score)^2) # MSE ``` Teams {data-navmenu="League Leaders" data-orientation=columns} ============================================================================= Column {.sidebar data-width=450} ----------------------------------------------------------------------------- #### **Successful Postseason Teams** For the next part of my analysis, I decided to look at the best head coaches and quarterbacks from the past 20 years. I created this `nfl_playoffs` dataset myself by taking every team from the past 20 years that made the playoffs and then listing their head coach and starting postseason quarterback. The goal of this portion of the analysis is to see if the head coach or quarterback really do make a difference in team success. For example, was it a coincidence that the Tampa Bay Buccaneers won the most recent Super Bowl -- conveniently, the first year with Tom Brady as quarterback? Over the past 20 years, it appears that the top three teams are as follows: 1. **New England Patriots** 2. **Indianapolis Colts** 3. **Green Bay Packers** The New England Patriots have been a dominant force the past few decades. I would argue that everyone that is not a New England Patriots fan strongly roots against them just because they have won so frequently. In the past few years, fans would claim the success comes from head coach Bill Belichick and former players -- Tom Brady and Rob Gronkowski. The Indianapolis Colts' success over the past 20 years can be attributed to their former quarterback Peyton Manning. Manning arrived to the Colts in 1998 and led the team to its first championship in 36 seasons at Super Bowl XLI. The Green Bay Packers have also been quite the team over the past 20 years. Their general manager, Ted Thompson, has been a key figure in their success. The Packers also have had incredible leaders through their coaching staff and players. Notably, the Packers also have a notoriously strong fan base. Column {.tabset .tabset-fade} ----------------------------------------------------------------------------- ### **Successful Postseason Teams** ```{r, message = FALSE, warning = FALSE, echo = FALSE, results = 'hide'} head(nfl_playoffs) nfl_teams_playoffs <- nfl_playoffs %>% count(nfl_playoffs$team_name) nfl_teams_playoffs <- as.data.frame(nfl_teams_playoffs) nfl_teams_playoffs names(nfl_teams_playoffs) # Rename the columns nfl_teams_playoffs <- nfl_teams_playoffs %>% rename( team_name = "nfl_playoffs$team_name", playoffs = "n" ) nfl_teams_playoffs ``` ```{r, message = FALSE, warning = FALSE, echo = TRUE} top_10_teams_playoffs <- nfl_teams_playoffs %>% top_n(10, playoffs) %>% arrange(desc(playoffs)) top_10_teams_playoffs <- top_10_teams_playoffs[1:10,] kable(top_10_teams_playoffs) ``` ### **Top Ten Teams by Playoff Appearances** ```{r, message = FALSE, warning = FALSE, echo = FALSE} playoff_team_plot <- ggplot(data = top_10_teams_playoffs, aes(x = reorder(team_name, -playoffs), y = playoffs)) + geom_bar(stat = "identity", width = 0.5, fill = "black") + scale_y_continuous(name = "Total Playoff Appearances") + scale_x_discrete(name = "Team") + ggtitle("Top Ten Teams in the NFL by Playoff Appearances") + theme(axis.text.x = element_text(angle = 65, vjust = 0.6)) + geom_text(aes(label = playoffs), position = position_dodge(width = 0.5), vjust = 2, color = "white", size = 3.5) playoff_team_plot ``` Head Coaches {data-navmenu="League Leaders" data-orientation=columns} ============================================================================= Column {.sidebar data-width=450} ----------------------------------------------------------------------------- #### **Successful Postseason Head Coaches** Now knowing the top-performing teams in the NFL, I want to see which coaches have led these teams to success. Looking at the visualization **Top Ten Coaches by Playoff Appearances** to the right, it is evident that the top two coaches in the NFL over the past 20 years are: 1. **Bill Belichick** 2. **Andy Reid** Bill Belichick, head coach of the New England Patriots, has been with the team since 2000. As head coach, he has six Super Bowl championships (XXXVI, XXXVIII, XXXIX, XLIX, LI, and LIII). He has won AP NFL Coach of the Year in 2003, 2007, and 2010. He has also won 31 playoff games. I cannot say I was surprised to see him listed as number one in this analysis. His career record as a coach is 311-148 (0.678). Andy Reid is the current head coach for the Kansas City Chiefs. He has been with the team since 2013. Prior to that, Reid was the head coach for the Philadelphia Eagles (1999 - 2012). He has won two Super Bowl championships (XXXI and LIV) - one as an assistant coach and one as a head coach. His career record as a coach is 238-145-1 (0.621). Next, I wanted to look at the head coaches with the most Super Bowl championships won. Is this consistent with the top coaches who go to postseason play? Given Marvin Lewis, former head coach of the Cincinnati Bengals, is number ten in the graphic **Top Ten Coaches by Playoff Appearances**, I cannot be quite sure. In the visualization **Top Ten Coaches by Super Bowl Championships** to the right, it is evident that Bill Belichick is still the dominant head coach from the past 20 years, with *six* Super Bowl championships. The head coach with the next highest number of Super Bowl championships as head coach is Tom Coughlin. Tom Coughlin was the head coach of the Jacksonville Jaguars from 1995 - 2002, and he was the head coach of the New York Giants from 2004 - 2015. His two Super Bowl championships were as the head coach of the New York Giants (XLII and XLVI). His career record as a coach in the NFL was 182-157 (0.537). Column {.tabset .tabset-fade} ----------------------------------------------------------------------------- ### **List of Top Coaches by Playoff Appearances** ```{r, message = FALSE, warning = FALSE, echo = FALSE, results = 'hide'} head(nfl_playoffs) nfl_coaches_playoffs <- nfl_playoffs %>% count(nfl_playoffs$head_coach) nfl_coaches_playoffs <- as.data.frame(nfl_coaches_playoffs) nfl_coaches_playoffs # Rename the columns nfl_coaches_playoffs <- nfl_coaches_playoffs %>% rename( head_coach = "nfl_playoffs$head_coach", playoffs = "n" ) nfl_coaches_playoffs ``` ```{r, message = FALSE, warning = FALSE, echo = TRUE} top_10_coaches_playoffs <- nfl_coaches_playoffs %>% top_n(10, playoffs) %>% arrange(desc(playoffs)) top_10_coaches_playoffs <- top_10_coaches_playoffs[1:10,] kable(top_10_coaches_playoffs) ``` ### **Top Ten Coaches by Playoff Appearances** ```{r, message = FALSE, warning = FALSE, echo = FALSE} playoff_coach_plot <- ggplot(data = top_10_coaches_playoffs, aes(x = reorder(head_coach, -playoffs), y = playoffs)) + geom_bar(stat = "identity", width = 0.5, fill = "red") + scale_y_continuous(name = "Total Playoff Appearances") + scale_x_discrete(name = "Head Coach") + ggtitle("Top Ten Coaches in the NFL by Playoff Appearances") + theme(axis.text.x = element_text(angle = 65, vjust = 0.6)) + geom_text(aes(label = playoffs), position = position_dodge(width = 0.5), vjust = 2, color = "white", size = 3.5) playoff_coach_plot ``` ### **List of Top Coaches by Super Bowl Championships** ```{r, message = FALSE, warning = FALSE, echo = FALSE, results = 'hide'} head(nfl_playoffs) names(nfl_playoffs) # Add a column of count of Super Bowl wins for each head coach nfl_playoffs <- nfl_playoffs %>% group_by(head_coach) %>% mutate(total_sb_coach = sum(sb_winner)) head(nfl_playoffs) # Create new dataset for top coaches nfl_coaches_sb <- nfl_playoffs[-c(1, 2, 3, 4, 5, 6, 8)] head(nfl_coaches_sb) # Remove duplicates based on head_coach column nfl_coaches_sb <- nfl_coaches_sb[!duplicated(nfl_coaches_sb$head_coach), ] ``` ```{r, message = FALSE, warning = FALSE, echo = TRUE} top_10_coaches <- nfl_coaches_sb %>% top_n(10, total_sb_coach) %>% arrange(desc(total_sb_coach)) top_10_coaches <- top_10_coaches[1:10,] kable(top_10_coaches) ``` ### **Top Ten Coaches by Super Bowl Championships** ```{r, message = FALSE, warning = FALSE, echo = FALSE} sb_coach_plot <- ggplot(data = top_10_coaches, aes(x = reorder(head_coach, -total_sb_coach), y = total_sb_coach)) + geom_bar(stat = "identity", width = 0.5, fill = "darkblue") + scale_y_continuous(name = "Total Super Bowls Won") + scale_x_discrete(name = "Head Coach") + ggtitle("Top Ten Coaches in the NFL by Super Bowl Championships") + theme(axis.text.x = element_text(angle = 65, vjust = 0.6)) + geom_text(aes(label = total_sb_coach), position = position_dodge(width = 0.5), vjust = 2, color = "white", size = 3.5) sb_coach_plot ``` Quarterbacks {data-navmenu="League Leaders" data-orientation=columns} ============================================================================= Column {.sidebar data-width=450} ----------------------------------------------------------------------------- #### **Successful Postseason Quarterbacks** An individual can be a great coach; however, they also need a great team. Quarterbacks are often described as the leaders of the NFL. With that said, I wanted to take a look at the best quarterbacks from the past 20 years. From the visualization **Top Ten Quarterbacks by Playoff Appearances** to the right, it is evident that the top three quarterbacks in the NFL over the past 20 years are: 1. **Tom Brady** 2. **Ben Roethlisberger** 3. **Drew Brees** It is no surprise that Tom Brady is number one, as he is often referred to as the Greatest Of All Time (GOAT). Tom Brady was drafted to the New England Patriots in the sixth round of the 2000 NFL Draft. Since then, he is a seven-time Super Bowl Champion (six with the New England Patriots and one with the Tampa Bay Buccaneers). He is still active in the NFL as the quarterback for the Tampa Bay Buccaneers. As of 2020, his completion percentage is 64% and his accolades are many. I am interested to see how much longer he will excel in the league. Ben Roethlisberger, the long-time quarterback for the Pittsburgh Steelers, was drafted in the first round of the 2004 NFL Draft. He has won two Super Bowl championships, and his completion percentage is 64.4%. He just signed with the Steelers for another year, so (as a Steelers fan) I am hoping he will lead the team to a third championship this upcoming season. Drew Brees started his career with the San Diego Chargers (2001 - 2005), but he is most known for his career as the quarterback for the New Orleans Saints (2006 - 2020). Brees has won one Super Bowl championship, and he just announced his retirement this year. His completion percentage was 67.7%. These top three quarterbacks are some of the famous players in the NFL. It is interesting to see how their career statistics speak for themselves. Next, I wanted to see which quarterbacks have won the most Super Bowl Championships over the past 20 years. Once again, looking at the **Top Ten Quarterbacks by Super Bowl Championships** to the right, Tom Brady is the most dominant quarterback in the NFL based on Super Bowl Championships. Ben Roethlisberger is close behind. The other two quarterbacks which I have not discussed are Eli Manning and Peyton Manning. Both of them have won two Super Bowl Championships since 2000. Eli Manning was the quarterback for the New York Giants from 2004 - 2019. He won two Super Bowl Championships (XLII and XLVI) and was the Super Bowl MVP for both games. Peyton Manning was the quarterback for the Indianapolis Colts from 1998 - 2011 and the quarterback for the Denver Broncos from 2012 - 2015. He also won two Super Bowl Championships (XLI, 50) and was the Super Bowl MVP for Super Bowl XLI. He won one Super Bowl with the Colts in 2006 and one Super Bowl withe the Broncos in 2015. Column {.tabset .tabset-fade} ----------------------------------------------------------------------------- ### **List of Top Quarterbacks by Playoff Appearances** ```{r, message = FALSE, warning = FALSE, echo = FALSE, results = 'hide'} head(nfl_playoffs) nfl_qb_playoffs <- nfl_playoffs %>% count(nfl_playoffs$qb) nfl_qb_playoffs <- as.data.frame(nfl_qb_playoffs) nfl_qb_playoffs <- nfl_qb_playoffs[-1] nfl_qb_playoffs # Rename the columns nfl_qb_playoffs <- nfl_qb_playoffs %>% rename( qb = "nfl_playoffs$qb", playoffs = "n" ) nfl_qb_playoffs ``` ```{r, message = FALSE, warning = FALSE, echo = TRUE} top_10_qb_playoffs <- nfl_qb_playoffs %>% top_n(10, playoffs) %>% arrange(desc(playoffs)) top_10_qb_playoffs <- top_10_qb_playoffs[1:10,] kable(top_10_qb_playoffs) ``` ### **Top Ten Quarterbacks by Playoff Appearances** ```{r, message = FALSE, warning = FALSE, echo = FALSE} playoff_qb_plot <- ggplot(data = top_10_qb_playoffs, aes(x = reorder(qb, -playoffs), y = playoffs)) + geom_bar(stat = "identity", width = 0.5, fill = "slategray") + scale_y_continuous(name = "Total Playoff Appearances") + scale_x_discrete(name = "Quarterback") + ggtitle("Top Ten Quarterbacks in the NFL by Playoff Appearances") + theme(axis.text.x = element_text(angle = 65, vjust = 0.6)) + geom_text(aes(label = playoffs), position = position_dodge(width = 0.5), vjust = 2, color = "white", size = 3.5) playoff_qb_plot ``` ### **List of Top Quarterbacks by Super Bowl Championships** ```{r, message = FALSE, warning = FALSE, echo = FALSE, results = 'hide'} # Add a column of count of Super Bowl wins for each quarterback nfl_playoffs <- nfl_playoffs %>% group_by(qb) %>% mutate(total_sb_qb = sum(sb_winner)) head(nfl_playoffs) # Create new dataset for top quarterbacks nfl_qb_sb <- nfl_playoffs[-c(1, 2, 3, 4, 5, 6, 7, 9)] head(nfl_qb_sb) # Remove duplicates based on quarterback column nfl_qb_sb <- nfl_qb_sb[!duplicated(nfl_qb_sb$qb), ] ``` ```{r, message = FALSE, warning = FALSE, echo = TRUE} top_10_qb <- nfl_qb_sb %>% top_n(10, total_sb_qb) %>% arrange(desc(total_sb_qb)) top_10_qb <- top_10_qb[1:10,] kable(top_10_qb) ``` ### **Top Ten Quarterbacks by Super Bowl Championships** ```{r, message = FALSE, warning = FALSE, echo = FALSE} sb_qb_plot <- ggplot(data = top_10_qb, aes(x = reorder(qb, -total_sb_qb), y = total_sb_qb)) + geom_bar(stat = "identity", width = 0.5, fill = "darkgray") + scale_y_continuous(name = "Total Super Bowls Won") + scale_x_discrete(name = "Quarterback") + ggtitle("Top Ten Quarterbacks in the NFL by Super Bowl Championships") + theme(axis.text.x = element_text(angle = 65, vjust = 0.6)) + geom_text(aes(label = total_sb_qb), position = position_dodge(width = 0.5), vjust = 2, color = "white", size = 3.5) sb_qb_plot ``` Passing Yards Leaders {data-navmenu="League Leaders" data-orientation=columns} ============================================================================= Column {.sidebar data-width=450} ----------------------------------------------------------------------------- #### **Passing Yards Leaders** For the next portion of my analysis, I wanted to analyze the top passing yards leaders. Given these are always quarterbacks, I wanted to see if this was consistent with my previous analysis of postseason quarterback success. From the visualizations to the right, it is evident that Drew Brees was the leader for passing yards six out of 20 times the past 20 years. However, he did not have the most playoff appearances. It is interesting that Tom Brady is on the list only three times, yet he has by far been the most successful quarterback. Additionally, looking at the **Top Passing Yards Leaders** tab, the visualization is interactive. It is evident that the top passer over the past 20 years was Peyton Manning in 2013. The data points are colored based on player. The legend can be seen on the right. Column {.tabset .tabset-fade} ----------------------------------------------------------------------------- ### **Passing Yards Summary** ```{r, message = FALSE, warning = FALSE, echo = TRUE} kable(nfl_passing) ``` ### **Top Passing Yards Leaders** ```{r, message = FALSE, warning = FALSE, echo = FALSE} passing_leaders <- nfl_passing %>% ggplot(aes(x = year, y = yds)) + geom_point(alpha = 0.8, aes(color = player)) + ggtitle("Top Passing Yards Per Year") ggplotly(passing_leaders) ``` Rushing Yards Leaders {data-navmenu="League Leaders" data-orientation=columns} ============================================================================= Column {.sidebar data-width=450} ----------------------------------------------------------------------------- #### **Rushing Yards Leaders** Looking at the leaders for rushing yards, I performed the same analysis as above. As seen in the visualizations to the right, the rushing yards leaders have much more variation than the passing yards leaders. Most recently, **Derrick Henry** from the Tennessee Titans has been the dominant running back with 2027 yards in 2020. He has been the top rusher for the past two years in a row. Additionally, looking at the **Top Rushing Yards Leaders** tab, the visualization is also interactive. It is evident that the top rusher over the past 20 years was Adrian Peterson in 2012. The data points are colored based on player. The legend can be seen on the right. Column {.tabset .tabset-fade} ----------------------------------------------------------------------------- ### **Rushing Yards Summary** ```{r, message = FALSE, warning = FALSE, echo = TRUE} kable(nfl_rushing) ``` ### **Top Rushing Yards Leaders** ```{r, message = FALSE, warning = FALSE, echo = FALSE} rushing_leaders <- nfl_rushing %>% ggplot(aes(x = year, y = yds)) + geom_point(alpha = 0.8, aes(color = player)) + ggtitle("Top Rushing Yards Per Year") ggplotly(rushing_leaders) ``` Penalty Yards Per Game {data-navmenu="League Leaders" data-orientation=rows} ============================================================================= ### **NFL Average Penalty Yards Per Game** Looking at the average penalty yards per game, I was able to find a dataset that recorded the average penalty yards against a team from 2003 - 2020. I wanted to figure out -- which team was the most penalized? From the graphic below, it is evident that the **Las Vegas Raiders** have been the most penalized team in the NFL. The top five most penalized teams are: 1. **Las Vegas Raiders** 2. **Baltimore Ravens** 3. **Detroit Lions** 4. **Tampa Bay Buccaneers** 5. **Los Angeles Rams** The least penalized team in the NFL is the **Indianapolis Colts**. ### **Top Penalized Teams** ```{r, message = FALSE, warning = FALSE, echo = FALSE} top_penalized <- ggplot(nfl_penalty, aes(x = reorder(team_name, total), y = total)) + geom_bar(fill = "slategray", stat = "identity") + coord_flip() + ggtitle("Most Penalized Teams in the NFL") + ylab("Average Penalty Yards") + xlab("Team Name") ggplotly(top_penalized) ``` Rival Analysis {data-navmenu="League Leaders" data-orientation=columns} ============================================================================= Column {.sidebar data-width=450} ----------------------------------------------------------------------------- #### **Rival Analysis** The last part of my analysis is to look at the top five rivals in the NFL and the performance of these games over the past 20 years. The term "rivalry" can be a bit subjective; however I chose the following five rivalries to analyze: 1. **Green Bay Packers** vs. **Chicago Bears** -- It is evident that the Green Bay Packers have been dominant in this rivalry, winning 71% of the encounters. 2. **Dallas Cowboys** vs. **Philadelphia Eagles** -- This is a very good rivalry, as both teams have performed. It is evident that the Philadelphia Eagles have been leading this rivalry, winning 54% of the encounters. 3. **Kansas City Chiefs** vs. **Las Vegas Raiders** (formerly Oakland Raiders) -- It is evident that the Kansas City Chiefs have been dominant in this rivalry, winning 63% of the encounters. 4. **Baltimore Ravens** vs. **Pittsburgh Steelers** -- This is a phenomenal rivalry to watch, as both times have won 22 games out of 44 games total. Each team has won 50% of the encounters. 5. **Washington Football Team** (formerly Washington Redskins) vs. **New York Giants** -- It is evident that the New York Giants have been dominant in this rivalry, winning 68% of the encounters. Then, just for fun -- I decided to analyze the **Pittsburgh Steelers** vs. the **Cincinnati Bengals** for fun. It is evident that the Pittsburgh Steelers have been dominant in this rivalry, winning 79% of the encounters. Column {.tabset .tabset-fade} ----------------------------------------------------------------------------- ### **Packers vs. Bears** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # Packers vs. Bears rivalry1 <- nfl_games %>% filter(home_team %in% c("Green Bay Packers", "Chicago Bears")) %>% filter(away_team %in% c("Green Bay Packers", "Chicago Bears")) ``` ```{r, message = FALSE, warning = FALSE, echo = FALSE, results = 'hide'} # Add a column of count of wins for each team rivalry1_freq <- rivalry1 %>% count(rivalry1$winner) rivalry1_freq <- as.data.frame(rivalry1_freq) rivalry1_freq ``` **Which team has won more games in the past 20 years?** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # Rename columns rivalry1_freq <- rivalry1_freq %>% rename( team = "rivalry1$winner", games = "n" ) kable(rivalry1_freq) ``` **Individual Game Statistics** ```{r, message = FALSE, warning = FALSE, echo = FALSE} kable(rivalry1) ``` ### Cowboys vs. Eagles ```{r, message = FALSE, warning = FALSE, echo = FALSE} # Cowboys vs. Eagles rivalry2 <- nfl_games %>% filter(home_team %in% c("Dallas Cowboys", "Philadelphia Eagles")) %>% filter(away_team %in% c("Dallas Cowboys", "Philadelphia Eagles")) ``` ```{r, message = FALSE, warning = FALSE, echo = FALSE, results = 'hide'} # Add a column of count of wins for each team rivalry2_freq <- rivalry2 %>% count(rivalry2$winner) rivalry2_freq <- as.data.frame(rivalry2_freq) rivalry2_freq ``` **Which team has won more games in the past 20 years?** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # Rename columns rivalry2_freq <- rivalry2_freq %>% rename( team = "rivalry2$winner", games = "n" ) kable(rivalry2_freq) ``` **Individual Game Statistics** ```{r, message = FALSE, warning = FALSE, echo = FALSE} kable(rivalry2) ``` ### Chiefs vs. Raiders ```{r, message = FALSE, warning = FALSE, echo = FALSE} # Chiefs vs. Raiders rivalry3 <- nfl_games %>% filter(home_team %in% c("Kansas City Chiefs", "Oakland Raiders")) %>% filter(away_team %in% c("Kansas City Chiefs", "Oakland Raiders")) ``` ```{r, message = FALSE, warning = FALSE, echo = FALSE, results = 'hide'} # Add a column of count of wins for each team rivalry3_freq <- rivalry3 %>% count(rivalry3$winner) rivalry3_freq <- as.data.frame(rivalry3_freq) rivalry3_freq ``` **Which team has won more games in the past 20 years?** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # Rename columns rivalry3_freq <- rivalry3_freq %>% rename( team = "rivalry3$winner", games = "n" ) kable(rivalry3_freq) ``` **Individual Game Statistics** ```{r, message = FALSE, warning = FALSE, echo = FALSE} kable(rivalry3) ``` ### Ravens vs. Steelers ```{r, message = FALSE, warning = FALSE, echo = FALSE} # Ravens vs. Steelers rivalry4 <- nfl_games %>% filter(home_team %in% c("Baltimore Ravens", "Pittsburgh Steelers")) %>% filter(away_team %in% c("Baltimore Ravens", "Pittsburgh Steelers")) ``` ```{r, message = FALSE, warning = FALSE, echo = FALSE, results = 'hide'} # Add a column of count of wins for each team rivalry4_freq <- rivalry4 %>% count(rivalry4$winner) rivalry4_freq <- as.data.frame(rivalry4_freq) rivalry4_freq ``` **Which team has won more games in the past 20 years?** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # Rename columns rivalry4_freq <- rivalry4_freq %>% rename( team = "rivalry4$winner", games = "n" ) kable(rivalry4_freq) ``` **Individual Game Statistics** ```{r, message = FALSE, warning = FALSE, echo = FALSE} kable(rivalry4) ``` ### Packers vs. Bears ```{r, message = FALSE, warning = FALSE, echo = FALSE} # Packers vs. Bears rivalry5 <- nfl_games %>% filter(home_team %in% c("Washington Redskins", "New York Giants")) %>% filter(away_team %in% c("Washington Redskins", "New York Giants")) ``` ```{r, message = FALSE, warning = FALSE, echo = FALSE, results = 'hide'} # Add a column of count of wins for each team rivalry5_freq <- rivalry5 %>% count(rivalry5$winner) rivalry5_freq <- as.data.frame(rivalry5_freq) rivalry5_freq ``` **Which team has won more games in the past 20 years?** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # Rename columns rivalry5_freq <- rivalry5_freq %>% rename( team = "rivalry5$winner", games = "n" ) kable(rivalry5_freq) ``` **Individual Game Statistics** ```{r, message = FALSE, warning = FALSE, echo = FALSE} kable(rivalry5) ``` ### Steelers vs. Bengals ```{r, message = FALSE, warning = FALSE, echo = FALSE} # Steelers vs. Bengals rivalry6 <- nfl_games %>% filter(home_team %in% c("Cincinnati Bengals", "Pittsburgh Steelers")) %>% filter(away_team %in% c("Cincinnati Bengals", "Pittsburgh Steelers")) ``` ```{r, message = FALSE, warning = FALSE, echo = FALSE, results = 'hide'} # Add a column of count of wins for each team rivalry6_freq <- rivalry6 %>% count(rivalry6$winner) rivalry6_freq <- as.data.frame(rivalry6_freq) rivalry6_freq ``` **Which team has won more games in the past 20 years?** ```{r, message = FALSE, warning = FALSE, echo = FALSE} # Rename columns rivalry6_freq <- rivalry6_freq %>% rename( team = "rivalry6$winner", games = "n" ) kable(rivalry6_freq) ``` **Individual Game Statistics** ```{r, message = FALSE, warning = FALSE, echo = FALSE} kable(rivalry6) ``` Summary {data-navmenu="Summary" data-orientation=columns} ============================================================================= #### **Summary** In this analysis, the main goal was to understand what all goes into winning an NFL game and what teams are historically successful in the standings. I was able to successfully break out this analysis into multiple different sections, including, but not limited to: (1) The Importance of Fan Attendance; (2) Standings over the Years; (3) Offense vs. Defense; and (4) Individual Game Observations. Through extensive use of R, I investigated eight datasets with various information regarding the NFL. Linear modeling to discover the correlation between several datasets was frequently used. Additionally, the `ggplot2` package delivered great visualizations to showcase this breakdown of the NFL. New variables and tables were created as well to drill deeper into the data for a better understanding of the raw data. One of my primary focuses was a breakdown of the divisions and their successes over the past 20 years. Box plot visualizations between the two conferences illuminated how teams have fared in the win column from their best season to their worst season. My first analysis looked into NFL Fan Attendance. Graphical representations were created to better understand which teams have a strong fan base and the consistency at which fans show up on a yearly basis. From this analysis, it was evident that the **Dallas Cowboys** have the strongest fan base and the **Los Angeles Chargers** have the weakest. Additionally, the greater attendance to games positively correlated to a team's total wins per season. Secondly, I focused on the divisional standings through the years. As mentioned above, box plot visualizations by division showed the range of success for NFL teams. Per division, these teams have had the most success based on the `nfl_standings` dataset: * **AFC East**: New England Patriots * **AFC North**: Pittsburgh Steelers * **AFC South**: Indianapolis Colts * **AFC West**: Denver Broncos * **NFC East**: Philadelphia Eagles * **NFC North**: Green Bay Packers * **NFC South**: New Orleans Saints * **NFC West**: Seattle Seahawks Using `geom_col()`, I observed that the **AFC East** has won the most Super Bowl Championships. This is due to the phenomenal success of Tom Brady and the New England Patriots during this time period. Next, I researched one of the most common arguments in football - is the offense or defense more important? Linear modeling of the `nfl_standings` data was completed on several variables. High offensive rankings and defensive rankings correlate to more wins for teams. Even though having a great offense and defense are both important, the correlation tests indicated that a better offense is slightly more important to a team's success than a better defense. I created a table of the last 20 Super Bowl Champions and showcased the `offensive_ranking` and `defensive_ranking`. Teams have been trending towards having better offenses in the last few years as evident by this table. I then observed individual game data in the NFL. Through graphs created by `ggpairs()`, I was able to view correlation coefficients for six variables. The main conclusion I deduced from this is that a positive correlation exists between yards gained and points scored. Extending my analysis, I looked into how weather conditions play a role in game outcomes. I was able to find the average temperature, humidity, and wind for each location teams may play. I also was able to train and test a model with a 70-30 split to see if weather conditions predict whether the game will be high-scoring or low-scoring. I deduced there is little predictability of game outcomes from weather conditions, as both the in-sample and out-of-sample performance of my models were underwhelming. I also analyzed league leaders over the past 20 years -- this includes teams, head coaches, quarterbacks, passing yards leaders, rushing yards leaders, and penalty leaders. Understanding who the game-changers are is important when trying to predict which team will win. Lastly, I did an analysis of popular rivalries in the NFL to see which teams have been dominant. My personal favorite is that the Pittsburgh Steelers are 33-9 against the Cincinnati Bengals the past 20 years. As a big NFL fan, it was incredibly interesting to see how the NFL has worked during my entire lifetime. Also, it was intriguing to see my favorite team's success over this time span. The NFL is one of the biggest industries in the world that has large implications on many levels. Sports gambling, the NFL Draft, fantasy football, and the common fan could all have different takeaways from this analysis that would help them better understand the recent history of the NFL. With fans across the globe, a deep dive into the NFL is exciting for many groups. Coaches and players would be able to more effectively prepare for their opponents, gamblers could make more educated bets, general managers could derive their team's needs in the Draft, and the common fan could revel in their team's history. This data tells a phenomenal story of the state of the NFL. However, it is a game for a reason. No one will ever be able to fully predict NFL outcomes, and that is what makes the sport as intriguing as it is!