Progression of NFL Attendance Across a Season

Introduction

Project Motivation

MetLife Stadium January 2014

Football in the United States has one of the most unique spectating experiences of all professional North American sports. College football flaunts 8 stadiums with attendance capacities greater than 100,000 spectators. All 8 of these are amongst the top 10 largest stadiums in the world,according to data compiled by Wikipedia (not classifying motor sport venues as stadiums).

Looking specifically at the professional sporting industry, the National Football League (NFL) has continued to dominate the world in per game attendances of recent decades. 2007-2016 data from sportingintelligence.com compares annual attendances of various domestic sports leagues across the globe. The NFL has annually posted average per game attendances between ~67,000 - ~69,000, which is consistently 22,000 - 25,000 attendees higher than the next highest league: German Bundesliga (soccer). It is more than double the next highest attendance of US-based leagues (Major League Baseball (MLB) ~30,000 per game).

Further data published on GitHub by Thomas Mock & Chris Stehlik, shows the NFL has continued these trends of high attendance into 2019.

One of the most impressive things about these consistently high numbers, is that they have prevailed in a league that has played a majority of its games outdoors, and many of these games continue in a number of harsh conditions that come with playing during the fall and winter months: rain, snow,extreme cold, etc. It is one of the only major sports in the US that plays outdoors through inclement weather. The MLB plays during warmer summer months and cancels for rain. MLS also plays majority of their season in warmer months. NBA and NHL are entirely indoor.

NFL Total Dome vs. Outdoor Games (Regular Season 2000-2019)
roof	Count	pct_outdoor
dome	1157	0.23
outdoors	3945	0.77

The question is: does weather significantly effect attendance? How does this and success of the team combine to effect attendance?

This type of study has become more relevant in recent years, as high-quality TV viewing options become more available with the advancement of video quality and instant media sharing.These can serve as viable replacement options of attending games. Also, many of the new stadiums are now being built with retractable roofs allowing for indoor play, making it possible that the sport could be moving away from outdoor play in the future.

Analytic Proposal

Problem Statement

Does temperature and weather(rain/snow) significantly effect attendance for all teams, winning teams or losing teams? How does it compare to other predictors involved?

Output to Consumer

Quantify the effect of weather and success on attendance to help teams understand how to maximize return on stadium projects, ticket sales, promotions and overall attendance revenue.

Questions

This analysis will seek to answer the following questions:

What do total attendance trends look like by week? by month?
- particular focus on the back-half of season
How does the trend change when looking at it for contending teams with higher % wins than losses?
How do indoor games compare to outdoor, especially in later weeks? +overall, winning teams, losing teams?
Using weekly attendance as a response variable, how strong a predictor is temperature +add in win % of involved teams?
Other independent variables (ie. Points-For, Offensive-Rank, etc.) How do these work as predictors towards attendance?

Approach

Will utilize game-by-game data for multiple recent seasons of NFL play. This will require joining together weather and attendance-based datasets.

Upon cleaning and merging this data, will begin getting an idea of the time-wise trends across each week of a season without subsetting variables. This will be done through basic visuals and regression analyses, and will give an idea of how attendance trends across a season regardless of attention to other variables.

will begin to expand these analyses to subset the data to look at different factions of the data. For example: teams with higher win % (above .500) vs. lower win % (below .500). Also, teams playing indoors vs. outdoors by week.

Will further mix in regression analyses including specific weather data: game temperature, precipitation/none.

Packages

Packages used in this analysis are read in below.

# Packages to use ----------------------------------------------------
library(tidyverse)
library(readr)
library(kableExtra)
library(nflfastR)
library(lubridate)
library(teamcolors)
library(ggimage)
library(gridExtra)

Package Descriptions

tidyverse- Includes various sub-packages to help with cleaning and manipulating the data. The core sub-packges include:
- dplyr- For data-frame editing and manipulation
- ggplot2- For graphical visualization of data
- tibble- For clean and efficient creation and manipulation of data frames.
readr- To import the .csv data and export sub-data created from nflfastR
kableExtra- For neat outputting of data tables to HTML format
nflfastR- For assisting with importing the play-by-play data that will be condensed and truncated to feed weather data into the final datasets
lubridate- Extenstion of tidyverse for date manipulation
teamcolors- Expansive package of team color and logo data that is an extension of nflfastr
ggimage- For use with ggplot2 to assist in adding graphics to a plot
gridExtra- For further combining plots into single grids

Data Preparation

Data Description

Attendance

Thomas Mock & Chris Stehlik’s GitHub compilation will be used to extract attendance and basic data by game as well as season results. these sets pull from ProFootballReference.com.

The motivation of the data is to provide three different game sets that can be used to bridge together attendance and other factors like offensive/defensive performance, wins, losses, etc. It can be looked at as a whole or at a weekly level.

The sets include:

Attendance- gives total and weekly attendance by team and year from 2000-2019
Standings- gives season results by team and year. This includes wins, losses, points scored(for/against),strength of schedule, etc.
Games- gives basic data for each game from 2000-2019 such as: teams involved (home/away),date, score,winner,loser,etc.

Weather

nflfastr play-by-play data will be manipulated to extract temperature and weather values by unique game for 2000-2019.

Normally this data is used to analyze valuation and expected return from different decisions based on historic information of the play type. This analysis will only use a very small subset of this data to be joined with attendance.

This dataset will include game_id, teams involved, week number, and:

temp- temperature in Farhenheit for the game
roof- Indicates if the game was played outdoors, in dome, in a closed roof, in an open roof +Indoor games (dome and closed roof) will result in NA weather factors
weather_type- A cleaned version of the weather description that can indicate rain,snow,clouds,etc.

Attendance Import

The below code will import the GitHub attendance,standings,games datasets:

setwd('C:/Users/tgobi/OneDrive/MSBANA/Fall2020/Data_Wrangling/Project_Options')
attendance <- read_csv('attendance.csv')
standings <- read_csv('standings.csv')
games <- read_csv('games.csv')

Beginning with attendance the head() function is used to give a quick visual of the data. Upon investigating, some cleaning will be done to merge the team information (team and team_name) with year and week to create a unique ID for team.

This dataset can also be broken into two sets. One that looks solely at weekly attendance, and another that removes the weekly aspect and just selects distinct annual totals.

It is also noticed, that NA values appear during the BYE week in which the team does not play that week.

weekly_attendance as well as home and away will be pulled from this dataset

head(attendance) %>% kbl(caption='View of Attendance Data') %>% kable_styling()

View of Attendance Data
team	team_name	year	total	home	away	week	weekly_attendance
Arizona	Cardinals	2000	893926	387475	506451	1	77434
Arizona	Cardinals	2000	893926	387475	506451	2	66009
Arizona	Cardinals	2000	893926	387475	506451	3	NA
Arizona	Cardinals	2000	893926	387475	506451	4	71801
Arizona	Cardinals	2000	893926	387475	506451	5	66985
Arizona	Cardinals	2000	893926	387475	506451	6	44296

looking at head() for standings below. this analysis will primarily use the ‘points_for’ and ‘points_against’ metrics the most.

playoffs can be converted to a binary character. Same with Super Bowl; however, it is not useful for the analysis being performed.

head(standings) %>% kbl(caption='View of Standings Data') %>% kable_styling()

View of Standings Data
team	team_name	year	wins	loss	points_for	points_against	points_differential	margin_of_victory	strength_of_schedule	simple_rating	offensive_ranking	defensive_ranking	playoffs	sb_winner
Miami	Dolphins	2000	11	5	323	226	97	6.1	1.0	7.1	0.0	7.1	Playoffs	No Superbowl
Indianapolis	Colts	2000	10	6	429	326	103	6.4	1.5	7.9	7.1	0.8	Playoffs	No Superbowl
New York	Jets	2000	9	7	321	321	0	0.0	3.5	3.5	1.4	2.2	No Playoffs	No Superbowl
Buffalo	Bills	2000	8	8	315	350	-35	-2.2	2.2	0.0	0.5	-0.5	No Playoffs	No Superbowl
New England	Patriots	2000	5	11	276	338	-62	-3.9	1.4	-2.5	-2.7	0.2	No Playoffs	No Superbowl
Tennessee	Titans	2000	13	3	346	191	155	9.7	-1.3	8.3	1.5	6.8	Playoffs	No Superbowl

The games dataset has very useful information that will be used to join all other information. it first must be reformatted to give cumulative win information. It should also be sorted into home and away team datasets that will be rejoined later.

head(games) %>% kbl(caption='View of Games Data') %>% kable_styling()

View of Games Data
year	week	home_team	away_team	winner	tie	day	date	time	pts_win	pts_loss	yds_win	turnovers_win	yds_loss	turnovers_loss	home_team_name	home_team_city	away_team_name	away_team_city
2000	1	Minnesota Vikings	Chicago Bears	Minnesota Vikings	NA	Sun	September 3	13:00:00	30	27	374	1	425	1	Vikings	Minnesota	Bears	Chicago
2000	1	Kansas City Chiefs	Indianapolis Colts	Indianapolis Colts	NA	Sun	September 3	13:00:00	27	14	386	2	280	1	Chiefs	Kansas City	Colts	Indianapolis
2000	1	Washington Redskins	Carolina Panthers	Washington Redskins	NA	Sun	September 3	13:01:00	20	17	396	0	236	1	Redskins	Washington	Panthers	Carolina
2000	1	Atlanta Falcons	San Francisco 49ers	Atlanta Falcons	NA	Sun	September 3	13:02:00	36	28	359	1	339	1	Falcons	Atlanta	49ers	San Francisco
2000	1	Pittsburgh Steelers	Baltimore Ravens	Baltimore Ravens	NA	Sun	September 3	13:02:00	16	0	336	0	223	1	Steelers	Pittsburgh	Ravens	Baltimore
2000	1	Cleveland Browns	Jacksonville Jaguars	Jacksonville Jaguars	NA	Sun	September 3	13:02:00	27	7	398	0	249	1	Browns	Cleveland	Jaguars	Jacksonville

Weather Import

As mentioned earlier, the weather data will be a small subset of a raw dataset. The below code imports the data from GitHub. The code is then manipulated to pull game_id,year,home_team,away_team,season_type,week,weather,temp,roof as the important weather variables.

It then takes unique game values of this weather data selected, then extracts the basic weather description, and filters to only look at ‘regular season’ games.

Given the slow loading speed, the truncated version has been exported to a csv file that will be referenced.

# nflfastr weather opt ----------------------------------------------------

seasons <- 2000:2019
pbp <- purrr::map_df(seasons, function(x) {
  readRDS(
    url(
      glue::glue("https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_{x}.rds")
    )
  )
})

pbp_weath <- pbp %>% 
  mutate(year=as.integer(substr(game_id,1,4))) %>% 
  select(game_id,year,home_team,away_team,season_type,week,weather,temp,roof) %>% 
  unique() %>% 
  mutate(weath_type=substr(weather,1,str_locate(weather,'Temp:')-1)) %>% 
  filter(season_type=='REG') %>% 
  select(-weather)
  ###separate(weather,c('status','temp'),sep='temp')
#write_csv(pbp_weath,path='C:/Users/tgobi/OneDrive/MSBANA/Fall2020/Data_Wrangling/Project_Options/nflweather.csv')

pbp_weath <- read_csv('C:/Users/tgobi/OneDrive/MSBANA/Fall2020/Data_Wrangling/Project_Options/nflweather.csv',
                      col_types=list(col_character(),col_double(),col_character(),col_character(),col_character(),col_double(),col_double(),col_character(),col_character()))

Data Cleaning

Annual Attendance Data Separated

First step is to create a separate table that will allow for the separation of full-year attendances and weekly attendances from the attendance dataset. This will become tot_att

The below code will:

Merge the game info into a single game_id
Unite the team and team_name into a single team_id
Remove weekly_attendance and the other weekly identifiers
Change the names to be more descriptive that they are annual totals
group the results by year and team
select only unique values (should only have 1 observation per team per year)

The result is the below table that only displays the total annual attendance. The weekly attendance will be used seperately.

attendance <- 
  attendance %>% 
  unite(c(team,year,week),col=game_id,remove=FALSE) %>% 
  unite(c(team,team_name),col=team_id,remove=F,sep=" ")

tot_att <- attendance %>% 
  select(-weekly_attendance,-week,-game_id) %>% 
  rename(tot_season_att=total,tot_home_att=home,tot_away_att=away) %>% 
  group_by(year,team_id) %>% 
  unique(.)

head(tot_att) %>% kbl() %>% kable_styling(full_width = F)

team_id	team	team_name	year	tot_season_att	tot_home_att	tot_away_att
Arizona Cardinals	Arizona	Cardinals	2000	893926	387475	506451
Atlanta Falcons	Atlanta	Falcons	2000	964579	422814	541765
Baltimore Ravens	Baltimore	Ravens	2000	1062373	551695	510678
Buffalo Bills	Buffalo	Bills	2000	1098587	560695	537892
Carolina Panthers	Carolina	Panthers	2000	1095192	583489	511703
Chicago Bears	Chicago	Bears	2000	1080684	535552	545132

Given that this table now should have a unique count of each team by year, it is a good measure to help point out which years in the dataset may not have read all the teams.

The below code will take the tally grouped by year. Given that there are currently 32 teams in the league, it is expected that all years should have 32; however, the below table shows that 2000 and 2001 had 31 teams. This is due to the expansion of adding the Houston Texans in 2002. It is worth noting this, but not of any immeadiate concern.

  tot_att %>% group_by(year) %>% tally()%>% filter(n!=32) %>% kbl(caption='Unique Teams by Year') %>% kable_styling(full_width = F)

Unique Teams by Year
year	n
2000	31
2001	31

Breaking Game out into Home and Away Sets

The next step is to take the game dataset and break them into each of the home teams and away teams to ensure that all games for each team are captured when joining them into a master dataset.

Before beginning this. It must be ensured that only regular season data is being pulled. the week number variable in game has NA values when it is considered a playoff game. These are filtered and changed to a numeric datatype below:

games <- games %>% mutate(week=as.double(week)) %>% filter(.,!is.na(week))

The following code creates unique team_id’s using the home and away team fields:

It pulls year, week, date, both teams, the winning team and tie
each team_id of focus will be the home_team in home_games and away_team in away_games
The opponent is classified as the opposite (away_team for home_games and home_team for away_games)
if the home_team won in home_games then the binary variable of 1 is assigned to say the home team won (same if tie)
if the away_team won in away_games then the binary variable of 1 is assigned to say the away team won (same if tie)

home_games <- games %>%  select(year,week,date,home_team,away_team,winner,tie) %>%
  rename(team_id=home_team,home_opponent=away_team,home_winner=winner,home_date=date) %>% 
  mutate(home_win=ifelse(team_id==home_winner,1,0),home_tie=ifelse(!is.na(tie),1,0))

away_games <- games %>%  select(year,week,date,away_team,home_team,winner,tie) %>%
  rename(team_id=away_team,away_opponent=home_team,away_winner=winner,away_date=date) %>% 
  mutate(away_win=ifelse(team_id==away_winner,1,0),away_tie=ifelse(!is.na(tie),1,0))

Joining the Broken-Out Game Sets with Attendance Set

Taking the home_games and away_games datasets, the next step is to join their values into the attendance dataset.

The below code creates a join table:

left joins attendance with home_games,away_games,tot_att on the year,week and team identifiers
now that the data is joined and organized by team, the two opponent fields home_opponent and away_opponent can be merged into a single opponent field as long as the team is not on BYE which would be NA
Removes the old opponent fields

join <- attendance %>%  
  left_join(.,home_games,by=c('year','week','team_id')) %>% 
  left_join(.,away_games,by=c('year','week','team_id')) %>% 
  left_join(.,tot_att,by=c('year','team_id')) %>% 
  mutate(opponent=ifelse(is.na(home_opponent),away_opponent,home_opponent)) %>% 
  select(-home_opponent,-away_opponent)

The join table is now used as a index to feed into the main data table built below.

Here are the steps:

mutate the home and away date fields and merge them on whichever is not NA (missing from join)
Create another column to classify month seperately from date for broad classification of time
Add a classification variable to whether the team in team_id is home or away given whichever date is binary win variable is not NA +if both are NA then the team is on BYE and will remain NA
select variables of interest (see next tab of report for details)
aggregate the win and tie variables on whichever is not NA to make a final classification of if game was win or tie
if the win and tie variables are 0 and a weekly_attendance exists (team not on BYE) then a loss=1 binary
grouping by the team_id and year keys, create a dynamic running sum of teams win,losses and ties WTD on season
Create a totwins and totloss variable by year to indicate the ending record for the team that year
Determine a running winning % win_pct of the team by week using the (cumulative wins)/(sum cumulative outcomes)
Create two variables that calculate the percentage of the full season attendance the single game accounts for

by_game <- join %>% 
  mutate(date=ifelse(is.na(home_date),away_date,home_date)) %>% 
  mutate(date=as.Date(date,"%B %d")) %>% mutate(month=format(date,"%m")) %>% #mutate(month=as.integer(month)) %>% 
  mutate(home_away=ifelse(is.na(date),NA,ifelse(is.na(home_win),'away','home'))) %>% 
  
  select(team_id,year,week,date,month,opponent,home_away,weekly_attendance,tot_season_att,tot_home_att,tot_away_att,home_win,home_tie,away_win,away_tie) %>% 
  rowwise() %>% mutate(win=sum(home_win,away_win,na.rm=T),tie=sum(home_tie,away_tie,na.rm=T)) %>% 
  rowwise() %>% mutate(loss=ifelse(win==0 & tie==0 & !is.na(weekly_attendance),1,0)) %>% 
  group_by(team_id,year) %>% mutate(cumwins=cumsum(win),cumloss=cumsum(loss),cumtie=cumsum(tie)) %>% 
  mutate(totwins=sum(win),totloss=sum(loss)) %>% 
  rowwise() %>% mutate(win_pct=cumwins/sum(cumwins,cumloss,cumtie)) %>%
  rowwise() %>% mutate(pct_home=ifelse(home_away=='home',weekly_attendance/tot_home_att,NA)) %>% 
  rowwise() %>% mutate(pct_away=ifelse(home_away=='away',weekly_attendance/tot_away_att,NA)) %>%
  select(-home_win,-home_tie,-away_win,-away_tie)

Final Datasets

Below are the head() views of of the final datasets that will be joined:

by_game:

team_id (chr): Team name and mascot (unique identifier)
year (dbl): Year the game was played (2000-2019)
week (dbl) : Week of the season in which the game occurred (1-17 for the 17 wks. of NFL season)
date (date): Date of the game
month (chr): Month of the game
opponent (chr): Opponent that the team (team_id) played in the given week
home_away (chr): Was the team the home or away team? (if NA then on BYE)
weekly_attendance (dbl): The attendance for the game that week
tot_season_att (dbl): Total attendance the team played in front of across all 16 games
tot_home_att (dbl): Total attendance the team played in front of for all home games
tot_away_att (dbl): Total attendance the team played in front of for all away games
win (dbl): binary, was the game a win (1 or 0)
tie (dbl): binary, was the game a tie (1 or 0)
loss (dbl): binary, was the game a loss (1 or 0)
cumwins (dbl): cumulative wins for the team on the season after the given week has concluded
cumloss (dbl): cumulative losses for the team on the season after the given week has concluded
cumtie (dbl): cumulative ties for the team on the season after the given week has concluded
totwins (dbl): total wins the team had for the season
totloss (dbl): total losses the team had for the season
win_pct (dbl): cumulative % of games won as percent of total games played through the given week
pct_home (dbl): percent of total season-long home attendance accounted for by the specific game
pct_away (dbl): percent of total season-long away attendance accounted for by the specific game

head(by_game) %>% kbl() %>% kable_styling()

team_id	year	week	date	month	opponent	home_away	weekly_attendance	tot_season_att	tot_home_att	tot_away_att	win	loss	cumwins	cumloss	totwins	totloss	win_pct	pct_home	pct_away
Arizona Cardinals	2000	1	2020-09-03	09	New York Giants	away	77434	893926	387475	506451	0	1	0	1	3	13	0.0000000	NA	0.1528953
Arizona Cardinals	2000	2	2020-09-10	09	Dallas Cowboys	home	66009	893926	387475	506451	1	0	1	1	3	13	0.5000000	0.1703568	NA
Arizona Cardinals	2000	3	NA	NA	NA	NA	NA	893926	387475	506451	0	0	1	1	3	13	0.5000000	NA	NA
Arizona Cardinals	2000	4	2020-09-24	09	Green Bay Packers	home	71801	893926	387475	506451	0	1	1	2	3	13	0.3333333	0.1853049	NA
Arizona Cardinals	2000	5	2020-10-01	10	San Francisco 49ers	away	66985	893926	387475	506451	0	1	1	3	3	13	0.2500000	NA	0.1322635
Arizona Cardinals	2000	6	2020-10-08	10	Cleveland Browns	home	44296	893926	387475	506451	1	0	2	3	3	13	0.4000000	0.1143196	NA

pbp_weather:

game_id (chr): the unique identifier for game (year_week_team1_team2)
year (int): year of the game
home_team (chr): abbreviation ID for home team
away_team (chr): abbreviation ID for away team
season_type (chr): ID indicating the game type (all should be ‘REG’ given it’s already filtered for regular season)
week (int): week number for the game in the given year
temp (int): temperature in Fahrenheit for the game
roof (chr): classifier of if the game was played indoors or outdoors +dome=permenant indoor stadium +outdoors=outdoor stadium +open=retractable roof was open to elements (can be simplified to ‘outdoors’ if needed) +closed=retractable roof was closed to the elements (can be simplified to ‘dome’ for indoor classification)
weath_type (chr): Simple description of the weather (participation in particular)

head(pbp_weath) %>% kbl() %>% kable_styling()

game_id	year	home_team	away_team	season_type	week	temp	roof	weath_type
2000_01_ARI_NYG	2000	NYG	ARI	REG	1	80	outdoors	NA
2000_01_BAL_PIT	2000	PIT	BAL	REG	1	74	outdoors	NA
2000_01_CAR_WAS	2000	WAS	CAR	REG	1	80	outdoors	NA
2000_01_CHI_MIN	2000	MIN	CHI	REG	1	NA	dome	NA
2000_01_DEN_STL	2000	LA	DEN	REG	1	NA	dome	NA
2000_01_DET_NO	2000	NO	DET	REG	1	NA	dome	NA

The final cleaning will need to come in the form of matching the team_id variables to do a final join that will help with analysis.

The type_weath from pbp_weath will also need to be string searched for this to be used as a reasonable character variable.

Merging Data

Adding Visual Logo Data

To add clarity to visuals, below code will add team logos and colors:

logo <- select(teams_colors_logos,team_abbr,team_name,team_logo_wikipedia,team_logo_espn)
team_colors <- teamcolors %>% filter(league=='nfl')

Join weather and logos

Further, to make plotting and summarizing easier, the below code will join together the pbp_weath and logos datasets with the by_game to create a final_data set that has each team and their games (home and away) and all possible info for those games.

Warning for this set: it is not unique and will need to likely need distinct() or filter() operations to just look at games uniquely

## add team logo and abbreviation info (abbr is needed to create id to join to weather)
by_game[which(str_detect(by_game$team_id,'Washington Redskins')==TRUE),'team_id'] <- 'Washington Football Team'
by_game[which(str_detect(by_game$opponent,'Washington Redskins')==TRUE),'opponent'] <- 'Washington Football Team'
team_colors[which(str_detect(team_colors$name,'Washington Redskins')==TRUE),'name'] <- 'Washington Football Team'


by_game <- by_game %>% left_join(logo,by=c('team_id'='team_name')) %>% left_join(logo,by=c('opponent'='team_name'))
 names(by_game) <- names(by_game) %>% 
  str_replace_all(string=.,pattern='\\.x$',replacement='') %>% 
  str_replace_all(string=.,pattern='\\.y$',replacement='_opp')

 
 
 ## Creating the game_id key to be used to join accurately with weather data
 by_game <- by_game %>% 
  mutate(game_id=str_c(year,sprintf('%02d',week)
                     ,ifelse(home_away=='away'
                             ,team_abbr,team_abbr_opp)
                     ,ifelse(home_away=='home',team_abbr,team_abbr_opp),sep='_'))

##joining weather and game/attendance data into final dataset
final_data <- by_game %>% 
  left_join(pbp_weath,by=c('year','week','game_id')) %>% 
  select(-season_type)

## adjusting date to scale January to be viewed at end of season
final_data$date[which(final_data$date<='2020-01-31')] <- final_data$date[which(final_data$date<='2020-01-31')] %>% 
  ymd(.) +years(1)

Classify Weather Strings

Visualizing the weather string data, there are large inconsistencies with how a weather event is reported. Stringr extension of tidyverse will be used to detect string values that indicate anything like:

snow
rain
shower

and mutate it to a single 2-factor classification:

Precipitation - If it is detected the string has some form of snow or rain
none- Otherwise is the case

the regex function allows for us to ignore the case of the string. Also, there are strings in which there is Zero Chance or No Chance when referring to rain. Those will be made exceptions to be removed in the case_when

final_data <- final_data %>% mutate(precipitation= case_when(
  str_detect(string=weath_type,pattern=regex('rain|snow|shower',ignore_case = T))==TRUE & 
    str_detect(string=weath_type,pattern=regex('Zero|(?<!s)No',ignore_case=T))!=TRUE ~'Precipitation',
  TRUE ~'none'
)
)

Data Exploration

Outdoor Splits

Amount of outdoor games is given in the table below. since 2000, they have averaged 77%.

The second table visualizes this trend by year to confirm that the annual samples remain relatively consistent in terms of the number of outdoor games being looked at:

ct_table <- pbp_weath %>% 
  mutate(roof=replace(roof,roof=='closed','dome'),roof=replace(roof,roof=='open','outdoors')) %>% 
  group_by(roof) %>%  tally(name='Count') %>% mutate(Count=as.numeric(Count)) %>% 
  mutate(pct_outdoor=round(Count/sum(Count),2))

ct_table_2 <- pbp_weath %>% 
mutate(roof=replace(roof,roof=='closed','dome'),roof=replace(roof,roof=='open','outdoors')) %>% 
group_by(year,roof) %>%  tally(name='Count') %>% mutate(Count=as.numeric(Count)) %>% spread(roof,Count) %>% 
  rowwise() %>% mutate(pct_outdoor=round(outdoors/sum(dome,outdoors),2))
  

#ct_table <- ct_table %>% 
#rbind(c('Total',sum(ct_table$Count),sum(ct_table$pct))) %>% mutate(Count=as.numeric(Count))

ct_table %>% 
  kbl(caption='Dome vs. outdoor games 2000-2019') %>% 
  kable_styling('hover',full_width=F)

Dome vs. outdoor games 2000-2019
roof	Count	pct_outdoor
dome	1157	0.23
outdoors	3945	0.77

ct_table_2 %>% 
  kbl(caption='Dome vs. Outdoor by Year') %>% 
  kable_styling('hover',full_width = F)

Dome vs. Outdoor by Year
year	dome	outdoors	pct_outdoor
2000	48	198	0.80
2001	48	200	0.81
2002	51	205	0.80
2003	53	203	0.79
2004	51	205	0.80
2005	46	210	0.82
2006	58	198	0.77
2007	58	198	0.77
2008	50	206	0.80
2009	62	194	0.76
2010	65	191	0.75
2011	68	188	0.73
2012	68	188	0.73
2013	70	186	0.73
2014	61	195	0.76
2015	60	196	0.77
2016	62	194	0.76
2017	59	197	0.77
2018	60	196	0.77
2019	59	197	0.77

Graphical Summaries

Temperature Trends

To begin analysis of how weather effects attendance, we will first look at the weekly trends for each.

Below gives basic scatter plot and trend of weekly attendance by week:

pbp_weath %>% 
  ggplot(.,aes(x=week,y=temp))+
  geom_point(alpha=.5)+
  geom_smooth()+
  ggtitle(label='Temperature Trends Across a Season',subtitle='All Games 2000-2019')+
  scale_y_continuous(name='Game Temperature')+
  scale_x_continuous(name='NFL Week Number')

It is clear, given the trend line, that there is a substantial downward trend in temperature of outdoor games as the season progresses (as is expected given season in which the sport is played).

Attendance Trends

For comparison across the same time period, two graphs below visualize how attendance trends across a season.

The first is a general scatter for attendances by date from 2000-2019 The second takes the same plot and overlays it by type of stadium determined by roof variable:

final_data %>% distinct(game_id,.keep_all = T) %>% 
  ggplot(.,aes(x=date,y=weekly_attendance))+ 
  geom_point(alpha=.8,color='LightBlue')+
  geom_smooth(color='Blue')+
  scale_y_continuous(name='Weekly Attendance',labels=scales::comma)+
  scale_x_date(name='Date of Game')+
  ggtitle(label='Attendance Trends Across a Season',subtitle='All Games 2000-2019')

final_data %>% distinct(game_id,.keep_all = T) %>% filter(!is.na(roof)) %>% 
  ggplot(.,aes(x=date,y=weekly_attendance,fill=roof))+ 
  geom_point(aes(color=roof),alpha=.2)+
  geom_smooth(color='DarkBlue',method='loess')+
  #stat_summary(fun.y = "mean", geom = "point", size = 3, aes(color = roof), shape = 21, color = "white", stroke = 1,alpha=.6)+
  facet_grid(~roof)+
  scale_y_continuous(name='Weekly Attendance',labels=scales::comma)+
  scale_x_date(name='Date of Game')+
  ggtitle(label='Attendance Trends Across a Season',subtitle='All Games 2000-2019 by Stadium Type')

As from prior analysis on Outdoor Splits, ~80% of all games are currently played outdoors, so clearly from the plots it has the most leverage on any manipulation done in this analysis.

Overall, this visual doesn’t give a clear indication of change in attendance across a season; however, there is some small dips in the smoothed-fit in December.

While this may be due to weather, it could also be a series of things (such as lost interest on losing teams), so further investigation into the types of teams that each of the above stadium groups contain.

Temperature vs. Attendance

Final basic summary will be a basic scatter plotting the two variables against each other.

final_data %>% distinct(game_id,.keep_all = T) %>% 
  ggplot(.,aes(x=temp,y=weekly_attendance))+
  geom_point(alpha=.8,color='lightblue')+
  geom_smooth(color='Blue')+
  scale_y_continuous(name='Weekly Attendance',labels=scales::comma)+
  scale_x_continuous(name='Temperature of Game')+
  ggtitle(label='Temperature of Game by Attendance',subtitle='All Games 2000-2019 by Outdoor Stadiums')

The plot attempting to correlate the Weekly Attendance and Temperature variables gives better context into the data. There is clearly some lower outdoor attendances even at moderate temperatures.

These outliers could be due to other differences such as stadium capacity, which has no clear way of being captured using available data. This will be explored in the next section.

Summary of Stadium Types

Boxplot Summary

To give more visual clarity into the distributions of each stadium type, we can analyze the boxplot and histograms for each group below.

final_data %>% distinct(game_id,.keep_all = T) %>%  ggplot(.,aes(y=weekly_attendance,x=roof))+
  # geom_histogram(aes(y=..density..),color='black',alpha=.5,bins=20,position='identity')+
  geom_boxplot(aes(fill=roof),alpha=.7)+
  scale_y_continuous(name='Weekly Attendance',labels=scales::comma)+
  ggtitle('Boxplot Comparison of Stadium Type Distributions')
final_data %>% distinct(game_id,.keep_all = T) %>% ggplot(.,aes(x=weekly_attendance))+
  facet_wrap(vars(roof),scales='free_y')+
  geom_histogram(color='black',fill='dodgerblue',alpha=.5,bins=20)+
  geom_vline(aes(xintercept=mean(weekly_attendance,na.rm=T)),linetype='dashed',color='red',alpha=.8,lwd=1)+
  ggtitle('Histogram of each Stadium Type')

The mean is overall highest for the closed roof stadiums.the boxplot shows that this is skewed by more attendance values that are higher than the rest of the league. Some of this could be because several stadiums of this type being relatively new and relatively higher in capacity. The overall distribution for these appears bimodal with seemingly a distribution forming around the overall mean attendance of ~67,000 , and another distribution forming about the ~80,000 range.

Comparing to the open roof classification, this distribution appears almost approximately normal about the overall mean of ~67,000 ; however, has a larger left tail in its histogram (making it left-skewed). There is further evidenced in the boxplot, with a large number of attendances below the 1.5IQR range used by the plot.

Using domain knowledge to give a hypothesis for the above observations, the Dallas Cowboys moved into AT&T Stadium with a capacity of about 80,000 in 2009. Given the relative low number of observations in closed classification, these games at 80,000 could be because of a move like this.

Further, to explain the lower open stadiums, there have been several teams playing in stadiums with lower capacities while making a move to a new one. The biggest example of this was the Los Angeles Chargers who played at Dignity Health Sports Park from 2017-2019–who’s smaller capacity of ~30,000 was meant to accommodate soccer–while awaiting the opening of SoFi Stadium this season.

The below graphs will investigate the teams in more detail:

15 seasons with the lowest average attendance by team
15 seasons with the highest average attendance by team

final_data %>% 
    #distinct(game_id,.keep_all = T) %>%  
    left_join(teams_colors_logos[,c('team_abbr','team_color')],by=c('team_abbr')) %>% 
    select(team_abbr,team_logo_wikipedia,home_away,year,team_color,weekly_attendance) %>% 
    group_by(team_abbr,team_logo_wikipedia,team_color,home_away,year) %>%
    filter(home_away=='home') %>% 
    summarise(avg_att=mean(weekly_attendance)) %>%
    arrange(avg_att) %>% ungroup() %>%  top_n(-15) %>% 
    ggplot(.,aes(x=reorder(str_c(team_abbr,year),avg_att),y=avg_att))+
    geom_col(aes(fill=team_color,color=team_color),position='identity',alpha=.75)+
    scale_color_identity(aesthetics = c('color','fill'))+
    geom_image(aes(image=team_logo_wikipedia))+
    geom_label(aes(y=0,label=year))+
    xlab('Team, Year')+
    ylab('Average Attendance Per Game')+
    scale_y_continuous(labels=scales::comma)+
    theme(
      axis.text.x=element_blank(),
      axis.ticks.x=element_blank()
    )+
    ggtitle('Bottom 15 Seasons in Terms of per Game Attendance')

As suspected, the Chargers have the 3 lowest seasons in terms of attendance since 2000. However, They were limited in capacity.

The Cowboys have 10 of the top 15 seasons in terms of attendance and all of them come after the new stadium opening in 2009.

As a result, stadium capacity can be considered the likely reason for many of the attendance outliers in the dataset.

Precipitation Analysis

Rainy/Snowy Attendance Comparison

Given that temperature appears that it may not have a significant impact on attendance, we will also look at the weather descriptions for games in which that info is available.

Weather descriptions have only been recorded in the nflfastR datasets to a limited capacity since 2013, so the data for those games is relatively limited; however, there are some important insights that indicate it having an attendance impact.

Each available string was searched for strings like ‘rain’,‘snow’,‘shower’,etc. , and for those that contained values such as this, are given a ‘Precipitation’ classification–indicating there was precipitation for the game. The classification is ‘none’ if there is no evidence from the weather description.

Below is a high-level summary of the average league attendance for games with precipitation vs. not:

  final_data  %>% filter(!is.na(weath_type) & roof%in%c('outdoors','open') & year>2014 & home_away=='home') %>%
    group_by(year,precipitation) %>% summarise(avg_att=mean(weekly_attendance,na.rm=T)) %>% ungroup() %>% 
    ggplot(aes(x=year,y=avg_att))+
    geom_line(aes(color=precipitation))+
    geom_label(aes(y=avg_att,label=round(avg_att,0)))+
    scale_y_continuous(name='Average Attendance',labels=scales::comma)+
    ggtitle('Average League Attendance for Rainy/Snowy games vs. Fair Weather')

This gives some evidence using 2015-2019 data to indicate there is a difference between rainy/snowy attendance; however, it fluctuates between which average is higher YoY. 2015,2018,2019 all have higher attendances for rainy/snowy games, but the others do not, which indicates potential for other factors skewing the information.

The best solution to see this further, is to subset the analysis by isolating it to each team.

Below graph will compare differences for each team:

  final_data %>% filter(!is.na(weath_type) & home_away=='home' & year>2014) %>%  select(precipitation,team_abbr,weekly_attendance,team_logo_wikipedia) %>% 
    group_by(team_abbr,precipitation,team_logo_wikipedia) %>% 
    summarise(avg_att=mean(weekly_attendance,na.rm=T),count=n()) %>%
    ggplot(aes(x=reorder(team_abbr,avg_att),y=avg_att))+
    geom_col(aes(x=reorder(team_abbr,avg_att),y=avg_att,fill=precipitation,group=precipitation),position='dodge',alpha=.8)+
    coord_flip()+
    geom_image(aes(y=0,image=team_logo_wikipedia,group=team_abbr),position='identity',hjust=-.5)+
    scale_y_continuous(name='Average Attendance',labels=scales::comma)+
    theme(
     # legend.position = c(1,0),
      axis.text.y=element_blank(),
      axis.title.y=element_blank(),
      axis.ticks.y=element_blank()
    )+
    ggtitle('Attendance by Team for Rainy/Snowy Games vs. Normal Games')

There appears to be some small differences in the attendance that support the claim of precipitation negatively effecting attendance. To make these margins clearer, the below barplot is made for comparison of the margins:

 final_data %>% filter(!is.na(weath_type) & home_away=='home' & year>2014) %>%  select(precipitation,team_abbr,weekly_attendance,team_logo_wikipedia) %>% 
    group_by(team_abbr,precipitation,team_logo_wikipedia) %>% 
    summarise(avg_att=mean(weekly_attendance,na.rm=T)) %>% spread(key=precipitation,value=avg_att) %>% mutate(diff=Precipitation-none) %>% 
    gather(key=var,value=val,none:diff) %>% filter(var=='diff') %>% mutate(val=replace_na(val,0)) %>% 
    ggplot(aes(x=reorder(team_abbr,val),y=val,fill=val))+
    geom_bar(stat='identity',alpha=.9,position='identity')+
    scale_color_gradient2(name='Attendance Change',low='red',mid='green',high='red',aesthetics = c('fill'),guide='colorbar')+
    scale_color_gradient2(name='Attendance Change',low='red',mid='green',high='red',aesthetics = c('color'),guide='colorbar')+
    coord_flip()+
    scale_y_continuous(name='Change in Attendance',labels=scales::comma)+
    geom_image(aes(y=0,image=team_logo_wikipedia))+
    theme(axis.title.y=element_blank(),
          axis.text.y=element_blank(),
          axis.ticks.y=element_blank(),
          legend.justification = c(1,0),
          legend.position=c(1,0))+
    ggtitle(label='Attendance Change for games with Precipitation',subtitle='Negative values indicate an average decrease')

We see that teams such as the Jaguars, Steelers and Washington have average decreases of more than 5,000 since 2015.

This is, however, not consistent. Many teams have much smaller decreases, and some exceptions of increases. For teams with little to no change, there is likely very little to no rainy/snowy weather data to compare.

Attendance by Win Pct.

Full NFL Comparison

Given that there is some small evidence of different aspects of weather effecting attendance, we will also investigate other potential causes for attendance changes at any point in a season. One of the biggest could be the interest of fans in attending a game with a winning team vs. a losing team.

Below will summarize live winning percentage for a team plotted against their weekly attendance.

  final_data %>% filter(home_away=='home') %>% select(weekly_attendance,win_pct,roof) %>% filter(!is.na(roof)) %>% 
    ggplot(aes(x=win_pct,y=weekly_attendance))+
    geom_point(alpha=.2,color='Blue',fill='blue')+
    geom_smooth(color = "DarkBlue", alpha = 0.8)+
    #stat_summary(fun.y = "mean", geom = "point", size = 3, fill='Black', shape = 21, color = "white", stroke = 1,alpha=.5) +
    scale_y_continuous(name='Weekly Attendance',labels=scales::comma)+
    scale_x_continuous(name='Winning Percentage',labels=scales::comma)+
    ggtitle('Weekly_Attendance by Winning Percentage','Data 2000-2019')

There is similarly a small trend change in the later months of the season. The higher the win percentage, the higher the up-tick in attendance.

To further visualize at a team level, the below plot shows the same scatter summary; however uses the mean attendance for each team.

  avg_overall_att <- final_data %>% ungroup() %>%  filter(home_away=='home') %>% summarise(mean(weekly_attendance)) %>% as.data.frame()
  avg_overall_win_pct <- final_data %>% ungroup() %>%  filter(home_away=='home') %>% summarise(mean(win_pct)) %>% as.data.frame()
  
  final_data %>% 
    #distinct(game_id,.keep_all = T) %>%  
    left_join(teams_colors_logos[,c('team_abbr','team_color')],by=c('team_abbr')) %>% 
    left_join(team_colors[,c('name','division')],by=c('team_id'='name')) %>% 
    select(team_id,team_abbr,team_logo_wikipedia,home_away,year,team_color,division,weekly_attendance,win_pct) %>% 
    filter(home_away=='home'& team_abbr!='NA') %>% mutate(team_abbr=replace(team_abbr,team_abbr=='LA','LAR')) %>%
    group_by(team_abbr,team_logo_wikipedia,team_color,division,home_away) %>%
    summarise(avg_att=mean(weekly_attendance,na.rm=T),avg_win_pct=mean(win_pct)) %>%
    arrange(avg_att) %>% ungroup() %>% mutate(rank=min_rank(-avg_att)) %>% 
    ggplot(.,aes(x=avg_win_pct,y=avg_att))+
    scale_color_identity(aesthetics = c('color','fill'))+
    geom_image(aes(image=team_logo_wikipedia))+
    xlab('Avg. Winning Percentage')+
    ylab('Average Attendance Per Game')+
    scale_y_continuous(labels=scales::comma)+
    geom_hline(yintercept=avg_overall_att[1,1],linetype='dashed',color='red')+
    geom_vline(xintercept=avg_overall_win_pct[1,1],linetype='dashed',color='red')+
    ggtitle('NFL Teams in Terms of per Game Attendance','Seasons 2000-2019')

Looking at the general trend for each team, it appears to be close to positively linear (with exception of Browns and 2017-2019 Chargers).

Several of these teams with above average winning tendencies are at or near their stadium capacities, so even team such as Seahawks and Steelers in the lower-right quadrant actually have very good attendance utilization as a part of winning.

This graph is helpful in showing that there is a general upward trend in attendance with winning, but there are still several factors such as capacity and regional sporting interest that this analysis does not take into account.

Summary

Problem Statement Review

To review these were two of the primary problems addressed:

Does temperature and weather(rain/snow) significantly effect attendance for all teams, winning teams or losing teams? How does it compare to other predictors involved?

This was addressed through the use of adjoining NFL datasets on 2000-2019 attendance with 2000-2019 game-by-game weather data that included temperature and condition reports.

The data is graphically analyzed to compare outdoor games to indoor games spanned across a season as weather gets harsher. Doing so,we can conclude that there are small downward trends in attendance as the season progresses for outdoor games. Attendance actually increases for indoor games. This could support a conclusion that colder weather effects attendance; however, it there are several other reasons also uncovered in analysis.

Further, temperature and precipitation were mapped directly against attendance to find if there is any direct relation. The plots indicated downward trends at the more extreme values; however, was very small for cold weather while fairly large dips for hot weather. These trends may not be consistent given the limited data it was being based on at the extreme values. Limited data indicates they are high-leverage observations that may not be fully indicative of the population.

To add further context to the analysis, a deeper-dive into the distributions by stadium type and by team is performed. This was done due to the various aspects of team and stadium that could also effect total attendance. It was also found that many variables being analyzed were, in fact, skewed by stadium capacity and teams with newer and larger stadiums with indoor options. There was not sufficient enough capacity data for all stadiums over the past 20 years to perform a more robust analysis using relative attendance (as a % utilization), so the comparisons were broken out at a team level to better visualize the team context.

Using team-level graphs, rainy/snowy games were compared to the normal weather games using 2015-2019 weather description data. Despite limited data, it was found that most teams had very little to no change in average attendance. There were however interesting exceptions of a few teams with a few thousand person difference in attendance.

It was also found that winning percentage has a positive effect on late season attendance, thus leading to some of the season-long trends as well. It is also only a small difference, but is noticable as well.

Implications to Consumer

Quantify the effect of weather and success on attendance to help teams understand how to maximize return on stadium projects, ticket sales, promotions and overall attendance revenue.

This analysis gives some basic visual analysis into the effect of various weather aspects of attendance in relation to other team-oriented aspects. This can be used as a guide towards furthering analysis into more detailed sales analysis using more robust forms of attendance.

Limitations

As already discussed above, the limitations of this model mostly lie in it’s lack of robust transformation to the attendance metrics. There are various team and stadium-related aspects in each season that need to be accounted for in stronger variables.

A next step improvement would be something as simple as scraping for stadium capacity data to begin looking at attendance as a percentage of capacity to better scale its expectations. In addition, further analysis using the more robust metrics could be performed to make statistically significant conclusions. These techniques include regression, ANOVA analysis and significance testing of the temperature and precipitation data to confirm significant insights.

Progression of NFL Attendance Across a Season

Tony G.

11/1/2020

Introduction

Project Motivation

Analytic Proposal

Problem Statement

Output to Consumer

Questions

Approach

Packages

Packages

Package Descriptions

Data Preparation

Data Description

Attendance

Weather

Attendance Import

Weather Import

Data Cleaning

Annual Attendance Data Separated

Breaking Game out into Home and Away Sets

Joining the Broken-Out Game Sets with Attendance Set

Final Datasets

by_game:

pbp_weather:

Merging Data

Adding Visual Logo Data

Join weather and logos

Classify Weather Strings

Data Exploration

Outdoor Splits

Graphical Summaries

Temperature Trends

Attendance Trends

Temperature vs. Attendance

Summary of Stadium Types

Boxplot Summary

Precipitation Analysis

Rainy/Snowy Attendance Comparison

Attendance by Win Pct.

Full NFL Comparison

Summary

Problem Statement Review

Implications to Consumer

Limitations