Introduction

Motivation

The motivation behind choosing this dataset is largely centered around our love for sports, in particular football. With Cincinnati football on the rise, both collegiately and professionally, we thought it would be interesting to dive into the attendance data of the National Football League. As Bengals fans, we’ve experienced several years of below average football, and anticipate that the attendance numbers will be reflective of our performance on the field. We think the results will be beneficial to not only every day fans, but also NFL teams around the league. The Cincinnati Bengals organization is widely known as one of the cheapest in the league, and our hope is through the results of our analysis they will recognize that in order to improve attendance in this city, our team’s performance must improve.

Data and Methodology

For this project, we will be utilizing summary statistics in R to initially explore the various variables within our dataset in order to uncover new information and insights surrounding NFL stadium attendance from the 2000 season through the 2019 season. This analysis will lead us to a greater understanding of the variables affecting attendance at NFL games.

The NFL dataset includes data sourced from Pro Football Reference team standings. In total, the NFL Attendance data is comprised of three unique datasets:

  • Attendance - 8 variables
  • Standings - 15 variables
  • Games - 19 variables

Planned Analysis

With the data we have selected, there are many options for analysis regarding the variables surrounding attendance at NFL games. Through our exploration, we aim to analyze:

  • Past Game Attendance - This analysis will explore to what degree past game attendance for NFL teams in years prior can be an indicator of future game attendance.
  • Team Standing - This analysis will examine how individual NFL team rankings in standing tables prior to said team’s games influence game attendance.
  • Team Game Performance - This analysis will delve into how various metrics recorded over prior games, such as total points, influence the attendance recorded at games.
  • Attendance Factors - This analysis will allow us find correlation between different variables and possibly build a regression to predict fan attendance based on different factors.

Analysis Impact

As addressed earlier, our analysis of this data will prove to be useful not only for every day football fans, but for NFL executives as well. In the most simplistic manner, our research will provide in-depth analysis of the leaders in attendance over the time frame. With that being said, we believe our research can prove to be more beneficial. In professional sports, front offices are constantly using cost-benefit analysis to determine whether it is worth it to spend more money in order to improve their team, or whether they should maintain their current roster regardless of talent. While this data does not dive into the metrics of how much they should spend on players, our analysis of the correlation between performance variables and attendance will allow them to see if it’s worth it to consider investing in better players/coaches to improve the team. Additionally, by diving into trends in points, win/loss percentage, offensive ranking, and defensive ranking, they will be able to determine where their teams have been lacking, and address those issues accordingly.

Packages Used

The project description mentions there are thousands of packages in R that allow for analysis past base R, and this project requires a variety of those packages. The packages below are packages that provide the best use and results while cleaning, understanding, and visualizing the data, but as we continue our analysis we anticipate multiple packages being added.

Many of our packages provide different purposes, as some below are involved in the Data Preparation Section while others are planned to be used in the Exploratory Data Analysis section. Packages such as dplyr and tidyverse were used for data cleaning and manipulation to dive deeper into the data, while packages such as ggplot2 and tibble are planned to be used to make good visualizations to allow us to understand the data and results.

While many are packages are standard because of our growing experience and education in R, we look forward to growing our knowledge on new packages and uses of ones we have already worked with. Packages loaded for this project so far are displayed below.

library(dplyr) #for data cleaning
library(tidyverse) #for data cleaning and data manipulation
library(DT) #for creating and viewing data tables
library(knitr) #for dynamic report generation
library(skimr) # for displaying summary statistics
library(ggplot2) # for data visualization and plotting
library(tibble) # for data visualization and data manipulation

Data Preparation

Data Source

Data selected for this project originally came from our professor, Tianhai Zu, among three other datasets. The data is from GitHub. The GitHub page titled ‘NFL Stadium Attendance’ credits Pro Football Reference for the data, specifically their Team Standings page.

This data is made publicly available by Sports Reference, which makes websites for different sports including both basic and sabermetric statistics and resources for sports fans everywhere. Pro Football Reference includes the full statistical history of the NFL including seasons, teams, players, coaches, drafts, and more since 1960.

Overview

The data includes data for all teams and games played from the beginning of the 2000-2001 NFL season through the end of the 2019-2020 NFL season. The dataset contains three separate tables:

  1. attendance - contains 8 variables and 10,846 observations that list the weekly and seasonal attendance for every week for every team.
  2. standings - contains 15 variables and 638 observations that display the record and basic stats and standings for every team each season.
  3. games - contains 19 variables and 5,324 observations containing the game results and stats for every game from the 2000-2001 season through the 2019-2020 season.

Any of the three tables could be joined relatively nicely by joining on “year”, “team_name”, and “team”.

Missing data is denoted by NA, but only two variables overall contain NAs. Both variables are necessary to the data, as one is weekly_attendance that is only NA when a team is on its by week (when the team does not play) and the other is tie, when is NA unless there is a tie.

Data Importing and Cleaning

Our first step was to set the working directory and import the three data sets.

setwd("/Users/drewgreiner/Library/Mobile Documents/com~apple~CloudDocs/5th Year/Data Wrangling/Project/Midterm") #Sets working directory

# NFL Attendance Data
attendance <- read.csv("/Users/drewgreiner/Documents/nfl/attendance.csv") #imports attendance table
standings <- read.csv("/Users/drewgreiner/Documents/nfl/standings.csv") #imports standings table
games <- read.csv("/Users/drewgreiner/Documents/nfl/games.csv") #imports games table

 

Joining the Datasets

# Join the data relatively nicely with dplyr
nfl_attendance <- left_join(attendance, standings, games, by = c("year", "team_name", "team"))


Table Cleaning and Summarizing

attendance Table

The original attendance dataset contains 10,846 observations and eight variables. In the data, there are two character variables - team and team_name. There also six numeric variables - year, total, home, away, week, weekly_attendance. The data spans from the 2000 season through the 2019 season, and only for weeks 1 through 17 are included, with no playoff attendance data. A view of the original attendance dataset is available below.

# View the attendance data set for the 2000 Arizona Cardinals
datatable(head(attendance, 17))

Below is a data dictionary for the attendance dataset.

Variable Name Variable Data Type Variable Desciption
team character City or state in which the team originates
team_name character Name or mascot of the team
year integer Year
total integer Total attendance per season
home integer Total attendance at home games per season
away integer Total attendance at away games per season
week integer Week in which game was played
weekly_attendance integer Attendance for given week

As mentioned above, there are missing values in the attendance dataset in the weekly_attendance column. Also mentioned above, these NA’s make sense as each team has a bye week in the NFL in which they do not play. In order to be consistent with counts and to not skew the data, we have omitted bye weeks below, resulting in 638 fewer observations.

colSums(is.na(attendance)) # Show the number of NA per column in attendance
##              team         team_name              year             total 
##                 0                 0                 0                 0 
##              home              away              week weekly_attendance 
##                 0                 0                 0               638
attendance <- na.omit(attendance)
colSums(is.na(attendance)) # Show the NA have been omitted
##              team         team_name              year             total 
##                 0                 0                 0                 0 
##              home              away              week weekly_attendance 
##                 0                 0                 0                 0

We also decided to change some of the variable names in the original attendance dataset in order to better define and describe the variables. We made name changes to four variables - team, total, home, away.

attendance <- attendance %>% rename(
  team_city = team,
  total_season_attendance = total,
  season_home_attendance = home,
  season_away_attendance = away
)
names(attendance)
## [1] "team_city"               "team_name"              
## [3] "year"                    "total_season_attendance"
## [5] "season_home_attendance"  "season_away_attendance" 
## [7] "week"                    "weekly_attendance"


The final step in cleaning the attendance dataset is creating two new datasets, one with only weekly attendance information and one with total attendance information (including total_season_attendance, season_home_attendance, season_away_attendance).

weekly_attendance <- attendance[c(1:3, 7:8)] # Removing season totals for attendance
datatable(head(weekly_attendance,16)) # Viewing the new weekly_attendance dataset
total_attendance <- attendance[c(1:6)] # Removing weekly attendance numbers
total_attendance <- total_attendance[!duplicated(attendance), ] # Removing weekly duplicates
datatable(head(total_attendance,31)) # Viewing the new total_attendance dataset


Lastly, you can see a full summary of the cleaned attendance table.

skim(attendance)
Data summary
Name attendance
Number of rows 10208
Number of columns 8
_______________________
Column type frequency:
character 2
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
team_city 0 1 5 13 0 32 0
team_name 0 1 4 10 0 32 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 2009.53 5.75 2000 2005.0 2010 2015.00 2019 ▇▇▇▇▇
total_season_attendance 0 1 1080910.03 72877.18 760644 1040509.0 1081090 1123230.00 1322087 ▁▁▇▆▁
season_home_attendance 0 1 540455.01 66774.84 202687 504360.0 543185 578342.00 741775 ▁▁▅▇▁
season_away_attendance 0 1 540455.01 25509.40 450295 524974.0 541757 557741.00 601655 ▁▂▇▇▂
week 0 1 9.10 4.99 1 5.0 9 13.00 17 ▇▅▅▆▇
weekly_attendance 0 1 67556.88 9022.02 23127 63245.5 68334 72544.75 105121 ▁▁▇▃▁
standings Table

The original standings data set contains 6,638 observations and 15 variables. In the data, there are four character variables - team, team_name, playoffs, sb_winner. There are also 11 numeric variables - year, wins, loss, points_for, points_against, points_differential, margin_of_victory, strength_of_schedule, simple_rating, offensive_ranking, defensive_ranking. The data spans from the 2000 season through the 2019 season. A view of the original standings dataset is available below.

# View the standings data set for the 2000 season
datatable(head(standings, 31))

Below is a data dictionary for the standings dataset.

Variable Name Variable Data Type Variable Desciption
team character City or state in which the team originates
team_name character Name or mascot of the team
year integer Year
wins integer Total wins per season (0 to 16)
loss integer Total losses per season (0 to 16)
points_for integer Total points the team scored per season
points_against integer Total points the opponent scored on the team per season
points_differential integer The difference between the total points for the team and against the team
margin_of_victory numeric Points differential divided by the total number of games per season
strength_of_schedule numeric Difficulty of schedule based on opponent records
simple_rating numeric A rating for the team that takes into account points differential and strength of schedule (measured by Simple Rating System)
offensive_ranking numeric A rating comparing how well the offense performs to opponent teams (measured by Simple Rating System)
defensive_ranking numeric A rating comparing how well the defense performs to opponent teams (measured by Simple Rating System)
playoffs character Stating whether or not the team made it to the playoffs
sb_winner character Stating whether or not the team won the Super Bowl for the season

Next, we wanted to check for any missing values in the columns of the standings dataset. There were no missing value or NA value in the standings dataset, as seen below.

colSums(is.na(standings)) # Show the number of NA per column in standings
##                 team            team_name                 year 
##                    0                    0                    0 
##                 wins                 loss           points_for 
##                    0                    0                    0 
##       points_against  points_differential    margin_of_victory 
##                    0                    0                    0 
## strength_of_schedule        simple_rating    offensive_ranking 
##                    0                    0                    0 
##    defensive_ranking             playoffs            sb_winner 
##                    0                    0                    0


Looking at the last two character variables, we decided to change them to binary variables, and both only have two different values, as shown below.

unique(standings$playoffs) # Show the unique values for playoffs
## [1] "Playoffs"    "No Playoffs"
unique(standings$sb_winner) # Show the unique values for sb_winner
## [1] "No Superbowl"  "Won Superbowl"

For the playoffs variable, we changed the “Playoffs” value to 1, and “No Playoffs” to 0.

standings$playoffs[standings$playoffs == "Playoffs"] <- "1"
standings$playoffs[standings$playoffs == "No Playoffs"] <- "0"

standings$playoffs <- as.numeric(standings$playoffs) # Changing the binary values to numerics

For the sb_winner variable, we changed the “Won Superbowl” value to 1, and “No Superbowl” to 0.

standings$sb_winner[standings$sb_winner == "Won Superbowl"] <- "1"
standings$sb_winner[standings$sb_winner == "No Superbowl"] <- "0"

standings$sb_winner <- as.numeric(standings$sb_winner) # Changing the binary values to numerics

Below you can see the new updated unique values for playoffs and sb_winner.

unique(standings$playoffs) # Show the new unique values for playoffs
## [1] 1 0
unique(standings$sb_winner) # Show the new unique values for sb_winner
## [1] 0 1


With the data now cleaned after checking for missing values and changing two variables to binaries, a view of the cleaned standings dataset is below.

datatable(head(standings,31)) # Viewing the cleaned standings dataset


Lastly, you can see a full summary of the cleaned standings table.

skim(standings)
Data summary
Name standings
Number of rows 638
Number of columns 15
_______________________
Column type frequency:
character 2
numeric 13
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
team 0 1 5 13 0 32 0
team_name 0 1 4 10 0 32 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 2009.53 5.76 2000.0 2005.00 2010.0 2014.75 2019.0 ▇▇▇▇▇
wins 0 1 7.98 3.08 0.0 6.00 8.0 10.00 16.0 ▂▆▇▆▂
loss 0 1 7.98 3.08 0.0 6.00 8.0 10.00 16.0 ▂▆▇▆▂
points_for 0 1 350.28 71.40 161.0 299.00 348.0 396.00 606.0 ▂▇▇▂▁
points_against 0 1 350.28 59.55 165.0 310.00 347.0 391.50 517.0 ▁▃▇▆▁
points_differential 0 1 0.00 101.09 -261.0 -75.00 1.5 72.75 315.0 ▂▆▇▅▁
margin_of_victory 0 1 0.00 6.32 -16.3 -4.70 0.1 4.57 19.7 ▂▆▇▅▁
strength_of_schedule 0 1 0.00 1.63 -4.6 -1.10 0.0 1.20 4.3 ▁▅▇▅▁
simple_rating 0 1 0.00 6.20 -17.4 -4.47 0.0 4.50 20.1 ▁▆▇▅▁
offensive_ranking 0 1 0.00 4.34 -11.7 -3.18 0.0 2.70 15.9 ▁▇▇▂▁
defensive_ranking 0 1 0.00 3.57 -9.8 -2.40 0.1 2.50 9.8 ▁▅▇▅▁
playoffs 0 1 0.38 0.48 0.0 0.00 0.0 1.00 1.0 ▇▁▁▁▅
sb_winner 0 1 0.03 0.17 0.0 0.00 0.0 0.00 1.0 ▇▁▁▁▁
games Table

The original games data set contains 5,324 observations and 19 variables. In the data, there are 12 character variables - week, home_team, away_team, winner, tie, day, date, time, home_team_name, home_team_city, away_team_name, away_team_city. There are also seven numeric variables - year, pts_win, pts_loss, yds_win, turnovers_win, yds_loss, turnovers_loss. The data spans from the 2000 season through the 2019 season. A view of the original standings dataset is available below.

# View the games data set for the Week 1 of the 2000 season
datatable(head(games,16))

Below is a data dictionary for the games dataset.

Variable Name Variable Data Type Variable Desciption
year integer Year
week character Week of the season in which the game was played
home_team character Home team for the game
away_team character Away team for the game
winner character Winner of the game
tie character Was there a tie? (if so, the other team will be listed in this column)
day character Day of the week in which the game was played
date character Date of the game
time character Time of the day in which the game was played
pts_win integer Number of points the winning team scored
pts_loss integer Number of points the losing team scored
yds_win integer Total number of yards the winning team had
turnovers_win integer Total number of turnovers the winning team had
yds_loss integer Total number of yards the losing team had
turnovers_loss integer Total number of turnovers the losing team had
home_team_name character Name or mascot of the winning team
home_team_city character City of the winning team
away_team_name character Name or mascot of the losing team
away_team_city character City of the losing team

Based on looking at the above dataset, we removed the last four columns (home_team_name, home_team_city, away_team_name, away_team_city) as we felt they were repetitive.

names(games) # Viewing the names of the columns in the games dataset
##  [1] "year"           "week"           "home_team"      "away_team"     
##  [5] "winner"         "tie"            "day"            "date"          
##  [9] "time"           "pts_win"        "pts_loss"       "yds_win"       
## [13] "turnovers_win"  "yds_loss"       "turnovers_loss" "home_team_name"
## [17] "home_team_city" "away_team_name" "away_team_city"
games <- games[-c(16:19)] # Removing the repeitive columns from the games dataset
names(games) # Viewing the names of the columns in the updated games dataset
##  [1] "year"           "week"           "home_team"      "away_team"     
##  [5] "winner"         "tie"            "day"            "date"          
##  [9] "time"           "pts_win"        "pts_loss"       "yds_win"       
## [13] "turnovers_win"  "yds_loss"       "turnovers_loss"

As mentioned above, there are missing values in the games dataset in the tie column. Also mentioned above, these NA’s make sense as there are very few ties in the NFL. Knowing that tie contains all of the NA, we also must look at the unique values for tie.

colSums(is.na(games)) # Show the number of NA per column in games
##           year           week      home_team      away_team         winner 
##              0              0              0              0              0 
##            tie            day           date           time        pts_win 
##           5314              0              0              0              0 
##       pts_loss        yds_win  turnovers_win       yds_loss turnovers_loss 
##              0              0              0              0              0
unique(games$tie) # Viewing the unique values in the tie column in games
## [1] NA                   "Atlanta Falcons"    "Cincinnati Bengals"
## [4] "St. Louis Rams"     "Green Bay Packers"  "Carolina Panthers" 
## [7] "Arizona Cardinals"  "Cleveland Browns"
games$winner[games$tie != is.na(games$tie)] <- "Tie" # Entering tie into winner column for ties
games <- games[-c(6)] # Removing the ties column
colSums(is.na(games)) # Checking the number of NA per column in games
##           year           week      home_team      away_team         winner 
##              0              0              0              0              0 
##            day           date           time        pts_win       pts_loss 
##              0              0              0              0              0 
##        yds_win  turnovers_win       yds_loss turnovers_loss 
##              0              0              0              0

The final change needed is changing the week variable from a character variable to a numeric variable. This is done below.

games$week <- as.numeric(games$week) # Changing week to numeric

With the data now cleaned after checking for missing values and changing variables, a view of the cleaned games dataset is below.

datatable(head(games,16)) # Viewing the cleaned games dataset


Lastly, you can see a full summary of the cleaned games table.

skim(games)
Data summary
Name games
Number of rows 5324
Number of columns 14
_______________________
Column type frequency:
character 6
numeric 8
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
home_team 0 1 13 20 0 34 0
away_team 0 1 13 20 0 34 0
winner 0 1 3 20 0 35 0
day 0 1 3 3 0 7 0
date 0 1 9 12 0 154 0
time 0 1 8 8 0 187 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1.00 2009.53 5.75 2000 2005 2010 2015 2019 ▇▇▇▇▇
week 220 0.96 9.10 4.99 1 5 9 13 17 ▇▅▅▆▇
pts_win 0 1.00 27.78 8.83 3 21 27 34 62 ▁▇▇▂▁
pts_loss 0 1.00 16.09 8.14 0 10 16 21 51 ▆▇▅▁▁
yds_win 0 1.00 361.64 78.58 47 308 361 415 653 ▁▂▇▃▁
turnovers_win 0 1.00 1.08 1.04 0 0 1 2 7 ▇▂▁▁▁
yds_loss 0 1.00 309.08 84.50 26 251 306 366 613 ▁▅▇▃▁
turnovers_loss 0 1.00 2.17 1.42 0 1 2 3 8 ▆▇▂▁▁

Exploratory Data Analysis

Discovery

For the discovery of new information in the data, we plan to utilize all three datasets in our analysis, which includes data on NFL attendance, game statistics, and standings. Our examination of attendance data will allow us to understand how standings and team game performance impact NFL attendance. This will allow us to find the following information surrounding attendance and team performance:

  • NFL teams who have had the most attendance over the seasons since 2000, looking at different summary statistics such as mean, median, total, etc.
  • Trends in points (mean, median, total, etc.) over the seasons for each team
  • Trends in win/loss percentage, offensive_ranking, defensive_ranking, and playoff berth by team over the seasons
  • Strength of divisions in relation to other divisions by comparing divisions’ total wins, playoff berths, and super bowl wins

We also plan to analyze how performance variables found in the standings dataset (such as wins, losses, offensive_ranking, defensive_ranking, and playoff berth) correlate to the team’s attendance data over the seasons. We plan to do this by creating a correlation matrix between the variables total (total attendance), wins, loss, offensive_ranking, defensive_ranking, playoffs, and sb_winner. We can also use the correlation functionality in R to find the exact correlation values across the board.

Plots and Tables

For the initial attendance analysis, plots and tables will be very prevalent in our results. For offensive_ranking, defensive_ranking, playoff, sb_winner, wins, points_for, points_against, and other variables, we plan to output tables showing the frequency of each response in relation to weekly_attendance over time. We will also utilize bar plots to visualize the frequency of each over time. A correlation matrix will be essential for our attendance analysis in visualizing the correlation between various variables, including most variables in our standings dataset. We are also looking to visualize the strength of each division in relation to one another. This could be through a plot of the count of different summary statistics of wins or participation in the playoffs by division.

Required Learning

Most of what is needed for our analysis has been covered in this course or in one of the other courses in R that we have already taken. The main learning that we will need to do before completing the project is a deeper dive into joins. We have an interesting challenge to face: there is not a common variable in all the datasets that can be used as a primary/foreign key (issues in the games data set with home vs away and combined team_name and team_city). We are looking for ways to create a common key using the variables that we have. This will be one of our required learnings. We will also dig a little deeper into plotting inside of R to ensure our plots come out as clean and professional.

Machine Learning Techniques

The incorporation of linear regression into our analysis will depend on preliminary results of the correlation values when we examine variables’ correlation with attendance. If we find that there is a strong correlation between any of the variables, that would prompt us to create a regression model to find more information on how attendance is typically affected by other variables present. One such correlation we hope will be worth looking into is the correlation between weekly_attendance and some variables in our standings dataset (i.e. wins).