Synopsis

  1. Problem Statement: The problem statement that we looked at was to find the possible causes in the decline of the regular season attendance and how the attendance can be approved. One of the questions we had was to examine the effect of the team’s advancement into the playoffs and even the championship game from the previous year on the current years regular season game attendance. Another focus was the relationship between the team’s standing or win/loss record will affect the following seasons regular season game attendance

  1. Solution Overview: The data was joined and cleaned into a single exploratory table for descriptive analysis. It was used to look at trends and the distribution of the data provided. Scatter plots and bar charts were used to explore the different factors between teams to look for relationships. The analysis was used to showcase interest insights we found to help identify possible improvement in regular game attendance data. Packages used and analyses performed are shown in the later sections.

  1. Insights: Some very interesting insights were obtained from this analysis:
  • There are two reasons that cause the attendance to skyrocket in the 2002 season. First, The New England Patriots won their 1st Superbowl, and 2002 is the first season for the Houston Texans to join the NFL.
  • The decline in attendance in 2017 was due to a move of Chargers from the Qualcomm Stadium which holds a max capacity of 70,561 to LA Dignity Health Sports Park which holds a max capacity of 27,000 peoplein Los Angeles.
  • The teams that won the Superbowl did not increase their regular season attendance significantly except for the Patriots in 2001.
  • New England Patriots won their 1st Superbowl ever in the 2001 season causing the attendance to skyrocket in the 2002 season. Even though Tom Brady as the star quarterback joined the team in 2001 but the main reason is because of their 1st Superbowl win.
  • In 2001, the Patriots faced the lowest regular season attendance until they won the 1st Superbowl in 2002. However, the attendance fluctuated over years until they won the Superbowl again in 2014. But there is definitely a gradual decline since 2015.
  • The teams that entered the division games did not see a great increase in their total attendance for the following year.

Packages Used

  1. Following packages were used:
    • readr: Used the readr package to read in the csv file we cleaned up last time
    • reshape2: Used the reshape2 package to reshape variables into groups
    • tidyr: For data cleaning
    • tibble: Used to store data, and makes it much easier to handle and manipulate data
    • knitr: For knitting document and include_graphics, kable functions
    • dplyr: Used for data manipulation
    • ggplot2: Used to plot charts
    • plotly: Used to plot interactive charts
    • DT: Used to display the data on the screen in a scrollable format
library(readr)
library(ggplot2)
library(dplyr)
library(tibble)
library(lubridate)
library(reshape2)
library(tidyr)
library(knitr)
library(plotly)
library(DT)

Data Preparation

This sections contains all the procedures followed in getting the data analysis ready. Each step has been explained and the codes have been given

Data Import

Original data (attendance, games and standings) were obtained from Pro Football Reference website and downloaded and cited through Github Rfordatascience

# Read in original data sets from Github
attendance <- as_tibble(read.csv("https://raw.githubusercontent.com/angiezhangty/nfl-analysis/master/attendance.csv", stringsAsFactors = FALSE))
games<- as_tibble(read.csv("https://raw.githubusercontent.com/angiezhangty/nfl-analysis/master/games.csv", stringsAsFactors = FALSE))
standings <- as_tibble(read.csv("https://raw.githubusercontent.com/angiezhangty/nfl-analysis/master/standings.csv", stringsAsFactors = FALSE))
  1. Attendance data
    • Purpose: originally collected to record attendance information for each game, each team throughout the years by week. We believe the data was possibly used to evaluate sales performance and predict future sales.
    • Period: data was collected from 2000 to 2019 for each regular season.
    • Variables: this data source has 8 variables.
    • Peculiarities: all 638 missing data points in the “weekly_attendance” variable were coded as “NA” (numeric) because each team skips a week during each regular season.

  1. Games data
    • Purpose: originally collected to record win/loss data for each game, each team throughout the years and details regarding the win/loss including yards win/loss, points win/loss and turnovers win/loss, etc. We believe the data was possibly used to evaluate team performance and prepare for future games.
    • Period: data was collected from 2000 to 2019 for each season.
    • Variables: this data source has 19 variables.
    • Peculiarities: 5314 values with no tie from the “tie” column were input as “NA” (character).

  1. Standings data
    • Purpose: originally collected to record each team’s standings among each other including rankings, win/loss, ratings, points win/loss and whether won Superbowl, etc.
    • Period: data was collected from 2000 to 2019 for each season.
    • Variables: this data source has 15 variables.
    • Peculiarities: None.

Data Cleaning

  1. Data cleaning steps:

All data cleaning steps were performed in RStudio in a R script. We used a full join first on games and attendance by team_name (attendance) = home_team_name (games), year and week. R automatically merged team_name into home_team_name and we saved the result into a table named “nfl”.

Then we used a left join on nfl and standing by year and home_team_name (nfl) = team_name (standings) and saved the result into a new table named “nfl_cleaned_data”. And we started examining each variable in the new data set. Since we have 2 team columns in the data and they are exactly the same as home_team_city, we deleted them and kept home_team_city instead.

Next we started renaming column names: We changed day into weekday. We separated date into month and day, combined them with year to create game_date, and changed the class of game_date into Date. Then we changed total, home, away to total_attendance, home_attendance, away_attendance. We also replaced some values: We changed playoffs values: Playoffs to 1, No Playoffs to 0, and converted it into a numeric variable.

We renamed sb_winner to SuperBowl_winner and did the same thing to replace values: Won Superbowl to 1, No Superbowl to 0 and converted it into a numeric variable.

Last, we saved the final file into our local folder and named it “nfl_cleaned_data”. However, we did not use the data cleaning steps in this R Markdown report to save time but we have listed the detailed steps below.

#change attendance week column to character 
attendance$week <- as.character(attendance$week)
#full join games and attendance data together
nfl<- full_join(games,attendance,by = c("home_team_name"="team_name", "year", "week"))
#delete duplicate team column
nfl <- select(nfl,-"team")
#join new cleaned data to standings data
nfl_cleaned_data<- left_join(nfl,standings,by=c("home_team_name"="team_name","year","home_team_city"="team"))
#rename column name "day" to "weekday"
names(nfl_cleaned_data)[7] <- "weekday"
#separate data column to month and day columns
nfl_cleaned_data <-separate(nfl_cleaned_data, date, c("month", "day"), sep = " ")
#unite year,month and day columns to game_data column
nfl_cleaned_data <- unite(nfl_cleaned_data, game_date, year, month, day, sep = "-")
#change game_date to standard time format
nfl_cleaned_data$game_date <- as.Date(nfl_cleaned_data$game_date, format = "%Y-%B-%d")
#add game_year columns
nfl_cleaned_data$game_year <- year(nfl_cleaned_data$game_date)
#rename columns name
names(nfl_cleaned_data)[19:21] <- c("total_attendance", "home_attendance","away_attendance")
names(nfl_cleaned_data)[34] <- c("superbowl_winner")

#replace no playoffs to 0, and playoffs to 1 and change to numeric data type
nfl_cleaned_data$playoffs <- str_replace(nfl_cleaned_data$playoffs, "No Playoffs", "0")
nfl_cleaned_data$playoffs <- str_replace(nfl_cleaned_data$playoffs, "Playoffs", "1")
nfl_cleaned_data$playoffs <- as.numeric(nfl_cleaned_data$playoffs)
#replace no Superbowl to 0, and Won Superbowl to 1 and change to numeric data type
nfl_cleaned_data$superbowl_winner <- str_replace(nfl_cleaned_data$superbowl_winner, "No Superbowl", "0")
nfl_cleaned_data$superbowl_winner <- str_replace(nfl_cleaned_data$superbowl_winner, "Won Superbowl", "1")
nfl_cleaned_data$superbowl_winner <- as.numeric(nfl_cleaned_data$superbowl_winner)

After the data cleaning, we once again checked the number of rows and columns, as shown in the code below.

colnames(nfl_cleaned_data)
##  [1] "game_date"            "week"                 "home_team"           
##  [4] "away_team"            "winner"               "tie"                 
##  [7] "weekday"              "time"                 "pts_win"             
## [10] "pts_loss"             "yds_win"              "turnovers_win"       
## [13] "yds_loss"             "turnovers_loss"       "home_team_name"      
## [16] "home_team_city"       "away_team_name"       "away_team_city"      
## [19] "total_attendance"     "home_attendance"      "away_attendance"     
## [22] "weekly_attendance"    "wins"                 "loss"                
## [25] "points_for"           "points_against"       "points_differential" 
## [28] "margin_of_victory"    "strength_of_schedule" "simple_rating"       
## [31] "offensive_ranking"    "defensive_ranking"    "playoffs"            
## [34] "superbowl_winner"     "game_year"
dim(nfl_cleaned_data)
## [1] 11066    35

  1. Cleaned data variables of concern:

-There are a lot of NA’s in the data set after we performed full join and left join with the original data, but we think it should stay NA to be true to our data.
-Week variable: Since there are Wildcard, Division and ConfChamp in the playoffs, we decided to keep the variable as a character.
-Tie variable: only very few games had values in there because ties are only allowed in the regular season and we decided to keep them that way.
-We believe that all the other changes that we performed are for the benefit of the analysis, but we will come back to change it if we believe it is better to do so.


Data Preview

# View data with DT package
datatable(head(nfl_cleaned_data, 50))

Exploratory Data Analysis

In this section, we used a variety of packages, functions and graphical methods to explore the NFL data set.

NFL Analysis

# Here we are looking at the overall attendance trend for all NFL teams from 2000 to 2019.
 annual_total <- nfl_cleaned_data%>%
  drop_na('weekly_attendance', 'game_year')%>%
  group_by(game_year)%>%
  summarize(total = sum(weekly_attendance))

 
 annual_total_plot <- ggplot(annual_total, aes(x=game_year,y= total/1000000))+
  geom_point(size=3)+
  geom_line()+
  labs(title ='Total NFL Regular Games Attendance by Year',x="Year",y="Attendance (Millions)",caption="Source: Pro-football Reference") +
  theme_classic()+ theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) +
   scale_x_continuous(breaks = seq(2000, 2019, 1))+
   annotate(geom = "point", x = 2016, y = 17.788671, colour = "orange", size = 4) + 
   annotate(geom = "point", x = 2016, y = 17.788671) + 
   annotate(geom = "text", size=3,x = 2015, y = 17.788671, label = "New team; Los Angeles Rams added 
 to NFL hence an increase in attendance", hjust = "right")+
   annotate(geom = "point", x = 2017, y = 17.253425, colour = "orange", size = 4) + 
   annotate(geom = "point", x = 2017, y = 17.253425) +
   annotate(geom = "curve", x = 2015, y = 16.9, xend = 2017, yend = 17.23,curvature = .3, arrow = arrow(length = unit(2, "mm"))   ) +
   annotate(geom = "text", size=3,x = 2015, y = 16.8, label = "Drop in attendence is due to Chargers after
 their move from a Qualcomm Stadium that 
 had a maximum capacity of of 70,4561 people
 to LA Dignity Health Sports Park which holds
 a max capacity of 27,000 people" , hjust = "right")+
   annotate("rect", xmin = 1999.5, xmax = 2001.5, ymin = 16.2, ymax = 16.5,alpha = .2)+
   annotate(geom = "curve", x = 2003, y = 16.4, xend = 2001.5, yend = 16.4,curvature = .0, arrow = arrow(length = unit(2, "mm"))   ) +
   annotate(geom = "text", size=3,x = 2003, y = 16.4, label = "New England Patriots 1st superbowl win and the new team
            Houston Texans joining NFL leads to increase in 
            attendence in 2002." , hjust = "left")
 annual_total_plot

Team Analysis

From the graph below, the Dallas Cowboys, New York Giants, New York Jets and Washington Redskins have the highest total attendance over the years whereas the Los Angeles Chargers and Los Angeles Rams have the lowest attendance. It should be noted that the Chargers and the Rams are still new in NFL since they joined in 2016.

# In this section we are looking at the ranking of the teams attendance.
 team_trend <- nfl_cleaned_data%>%
   drop_na('weekly_attendance','home_team')%>%
   group_by(home_team)%>%
   summarize(total = sum(weekly_attendance))
top5team<-team_trend%>%
  arrange(desc(total))%>%
  head(4)
bottom2team<-team_trend%>%
  arrange(total)%>%
  head(2)

 team_trend_graph <- ggplot(team_trend, aes(x=home_team,y= total/1000000))+
   geom_point(size=3)+
   geom_point(data=top5team,color="red",size=3)+
   geom_point(data=bottom2team,color="blue",size=3)+
   theme_classic()+
   theme(axis.text.x = element_text(angle = 90,size = 10))+
   labs(title ='Total Regular Season Attendance for Each NFL Team Since 2000',x="Team",y="Attendance (Millions)")+
   annotate("rect", xmin = 'Kansas City Chiefs', xmax = 'Miami Dolphins', ymin = .0, ymax = 3,alpha = .2)+
   annotate(geom = "curve", x = 'Indianapolis Colts', y = 2.5, xend = 'Kansas City Chiefs', yend = 2,curvature = .0, arrow = arrow(length = unit(2, "mm"))   ) +
   annotate(geom = "text", x = 'Indianapolis Colts', y = 2.5, label = 'New teams in NFL' , hjust = "right")
 
 team_trend_graph 

Some interesting things we observed were that the Dallas Cowboys had continuously high attendance record based on the boxplots below and the Arizona Cardinals had the lowest weekly attendance ever compared to other teams.

# In this section we are looking at each team's attendance
 team_box <- nfl_cleaned_data%>%
   drop_na('weekly_attendance','home_team')
 
 team_trend_box <- ggplot(team_box, aes(x=home_team,y= weekly_attendance/1000))+
   geom_boxplot()+
   labs(title ='NFL Weekly Games Attendance by Team',x="Home team",y="Weekly Attendance (Thousands)") +
   theme_classic()+ theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
 
 team_trend_box

Superbowl Winner

From the graph we can see that for most teams, winning the Superbowl did not increase their regular season attendance significantly except for the Patriots in 2001. The 1st win brought the team fame and popularity making their attendance data go through the roof.

 # We created a sbwinner data frame to showcase the list of Superbowl winners and their attendance data for each year from 2000 to 2019
sbwinner<- nfl_cleaned_data %>%
  filter(superbowl_winner == 1) %>%
  group_by(game_year, home_team) %>%
  summarize(attendance = mean(total_attendance, na.rm = TRUE))
kable(sbwinner, col.names = c("Game Year", "Team Name", "Total Attendance"), align = "ccc")
Game Year Team Name Total Attendance
2000 Baltimore Ravens 1062373
2001 New England Patriots 977717
2002 Tampa Bay Buccaneers 1044920
2003 New England Patriots 1127515
2004 New England Patriots 1108210
2005 Pittsburgh Steelers 1048739
2006 Indianapolis Colts 1028634
2007 New York Giants 1197501
2008 Pittsburgh Steelers 1076998
2009 New Orleans Saints 1116009
2010 Green Bay Packers 1122668
2011 New York Giants 1232157
2012 Baltimore Ravens 1109404
2013 Seattle Seahawks 1094133
2014 New England Patriots 1111983
2015 Denver Broncos 1141878
2016 New England Patriots 1094772
2017 Philadelphia Eagles 1117141
2018 New England Patriots 1068530
2019 Kansas City Chiefs 1115771
# This is a for loop to create a list to get a list of team names and the corresponding attendance data
  team <- list()
  next_year <- list()
  for (x in unique(sbwinner$home_team)) {
    temp <- sbwinner[sbwinner$home_team == x,]
    for (y in unique(temp$game_year)) {
      team <- c(team, x)
      y = y + 1
      next_year <- c(next_year, y)
    }
  }
  
# We put these 2 lists into a tibble df using dplyr and inner joined the new list with the nfl_cleaned_data to eventually find the attendance data
  next_year_list <- tibble(next_year = as.numeric(next_year), team = as.character(team)) %>%
    arrange(next_year)
  next_sbwinner <- right_join(nfl_cleaned_data, next_year_list, by =c("game_year"="next_year", "home_team"="team")) %>%
    group_by(game_year, home_team) %>%
    select(next_game_year = game_year, alt_home_team = home_team, total_attendance) %>%
    summarise(next_attendance = mean(total_attendance, na.rm = TRUE))
  
# Now we can combine the sbwinner and next_sbwinner together to get a consolidated table to show the differences
  sbwinner_combine <- tibble(sbwinner, next_sbwinner) %>%
    select(-next_game_year, -alt_home_team)
  view(sbwinner_combine)
  
# We plotted the sbwinner_combine graph as well
  sbwinner_graph <- ggplot(data = sbwinner_combine) +
    geom_col(aes(x=game_year, y=attendance/1000000, alpha=0.3, fill=home_team), show.legend = FALSE) +
    geom_line(aes(x=game_year, y=next_attendance/1000000)) +
    geom_point(aes(x=game_year, y=next_attendance/1000000), size=4) +
    geom_text(aes(x=game_year, y=attendance/1000000, label=home_team), angle=90, size=3, hjust=2) +
    annotate("text", x=2016, y=1.3, size = 3, label="Note: The dots represent the year subsequent
             to winning the Superbowl") +
    labs(x="Year", y="Total Attendance") +
    scale_x_continuous(breaks = seq(2000, 2019, 1)) +
    scale_y_continuous(breaks = seq(0.0, 1.5, 0.1)) +
    ggtitle("Superbowl Winners Attendance From 2000 to 2020") +
    theme_classic()+ theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
  sbwinner_graph

New England Patriots
As we can see, in 2001, the Patriots had the lowest regular season attendance until they won the Superbowl and attendance suddenly skyrocketed. Then people started expecting the Patriots to win again next year and they did. Back then, they did not think the Patriots could win 2 years in a row so attendance fell a little in 2004, after which attendance fluctuated for a while until they won again in 2014. We can see the rebound in the subsequent years after they won each game. But we can definitely see a gradual decline since 2015.

# So we selected New England Patriots' attendance data for every year because they are a special case
  nep_attendence <- nfl_cleaned_data %>%
    filter(home_team_name == "Patriots") %>%
    group_by(game_year,superbowl_winner) %>%
    summarize(attendance = mean(total_attendance, na.rm = TRUE))

# We filtered out the years where New England Patriots won a Superbowl
  nep_superbowl<- nep_attendence %>%
    filter(superbowl_winner == 1)

# We built a line chart with emphasis on the Patriot's winning years
  nep_trends <- ggplot(nep_attendence, aes(x=game_year, y=attendance/1000000)) +
    geom_line() +
    geom_point(data=nep_superbowl, aes(x=game_year, y=attendance/1000000),color= 'red', size=3) +
    geom_text(data=nep_superbowl, aes(label=attendance), size = 3, vjust = 2) +
    annotate("text", x=2016, y=1.14, label="Note: The dots represent the year the Patriots won the Superbowl") +
    labs(x="Year",y= "Total Attendance (Millions)", title="New England Patriots Attendance From 2000 - 2019") +
    scale_x_continuous(breaks = seq(2000, 2019, 1)) +
    theme_classic()+ theme(axis.text.x = element_text(angle = 90, vjust = 0.45))
  nep_trends

Division Analyisis

We are wondering if advancing to the Division games could increase regular attendance the following year.

# We got the teams attendance data for each year
  teams_attendence <- nfl_cleaned_data %>%
    group_by(game_year, home_team) %>%
    summarize(attendance = mean(total_attendance, na.rm = TRUE))
# We got the team names that entered division games
  division_teams <- nfl_cleaned_data %>%
    filter(week =="Division") %>%
    group_by(game_year)%>%
    select(home_team)
# We got team attendance of every team that entered division games
  division_teams_attendance<- left_join(division_teams,teams_attendence,by=c("game_year"="game_year","home_team"="home_team"))

# We used another for loop for the same purpose as last time 
  teamD <- list()
  next_yearD <- list()
  for (x in unique(division_teams_attendance$home_team)) {
    temp <- division_teams_attendance[division_teams_attendance$home_team == x,]
    for (y in unique(temp$game_year)) {
      teamD <- c(teamD, x)
      y = y + 1
      next_yearD <- c(next_yearD, y)
    }
  }
  
# We put these 2 lists into a tibble df using dplyr and inner joined the new list with the nfl_cleaned_data to eventually find the attendance data
  next_year_listD <- tibble(next_year = as.numeric(next_yearD), team = as.character(teamD)) %>%
    arrange(next_yearD)
  
# We got team next year attendance that entered division games
  division_team_attendance1<-left_join(next_year_listD,nfl_cleaned_data,by= c("team"="home_team","next_year"="game_year"))%>%
    group_by(next_year,team)%>%
    summarise(next_attendance=mean(total_attendance,na.rm = TRUE))

# We got team attendance of this year and next year
  division_combine <- tibble(division_teams_attendance, division_team_attendance1) %>%
    select(-team)

# We melted divison team attendance 
  data.m <- melt(division_combine, id=c("game_year", "home_team"), measure.vars = c("attendance", "next_attendance"))
  
# And got bar chart for division team
  division_combine_trend<-
    ggplot(data.m, aes(x = home_team, y = value/1000000)) +   
    geom_bar(aes(fill = variable), stat="identity", width=0.5, position = position_dodge(width=0.5))+
    theme_classic()+
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5))+
    scale_fill_manual(values=c("#999999", "#E69F00"))+
    labs(title = "Entering the Divison Games on Team Attendance Trends", x ="Team Name",y = "Total Attendance (Millions)")
division_combine_trend  

According to this plot, as we can see, the teams that entered the division games did not see a great increase in their total attendance for the following year. What interested us was that for the Dallas Cowboys, after entering the Division games, their attendance for the following year decreased quite a lot. We do not know the exact reason why but we were wondering if it had anything to do with their long-time drought for a team that only entered the division games twice in the past 20 years but considered themselves America’s team.

     


Limitations

  • In order to understand the impact of NFL game attendance on the revenue generated, we needed to have a revenue data set from the attendances broken down by the team and game played. We could not find that information available to the public.

  • The three data sets that we had, had no unique ids, thus making it hard to joining into one data set without creating many records with NA values.

  • Also, the team had to formulate the problem statement for the project. This meant that we had to limit the questions only to those that can be answered by the available data. Thus, the team had to do an initial data exploration prior to setting up the problem statement and make our own assumptions based on our limited knowledge in football.