Introduction

Project Motivation

MetLife Stadium January 2014

Football in the United States has one of the most unique spectating experiences of all professional North American sports. College football flaunts 8 stadiums with attendance capacities greater than 100,000 spectators. All 8 of these are amongst the top 10 largest stadiums in the world,according to data compiled by Wikipedia (not classifying motor sport venues as stadiums).

Looking specifically at the professional sporting industry, the National Football League (NFL) has continued to dominate the world in per game attendances of recent decades. 2007-2016 data from sportingintelligence.com compares annual attendances of various domestic sports leagues across the globe. The NFL has annually posted average per game attendances between ~67,000 - ~69,000, which is consistently 22,000 - 25,000 attendees higher than the next highest league: German Bundesliga (soccer). It is more than double the next highest attendance of US-based leagues (Major League Baseball (MLB) ~30,000 per game).

Further data published on GitHub by Thomas Mock & Chris Stehlik, shows the NFL has continued these trends of high attendance into 2019.

One of the most impressive things about these consistently high numbers, is that they have prevailed in a league that has played a majority of its games outdoors, and many of these games continue in a number of harsh conditions that come with playing during the fall and winter months: rain, snow,extreme cold, etc. It is one of the only major sports in the US that plays outdoors through inclement weather. The MLB plays during warmer summer months and cancels for rain. MLS also plays majority of their season in warmer months. NBA and NHL are entirely indoor.

NFL Total Dome vs. Outdoor Games (Regular Season 2000-2019)
roof Count pct_outdoor
dome 1157 0.23
outdoors 3945 0.77

The question is: does weather significantly effect attendance? How does this and success of the team combine to effect attendance?

This type of study has become more relevant in recent years, as high-quality TV viewing options become more available with the advancement of video quality and instant media sharing.These can serve as viable replacement options of attending games. Also, many of the new stadiums are now being built with retractable roofs allowing for indoor play, making it possible that the sport could be moving away from outdoor play in the future.

Analytic Proposal

Problem Statement

Does temperature and weather(rain/snow) significantly effect attendance for all teams, winning teams or losing teams? How does it compare to other predictors involved?

Output to Consumer

Quantify the effect of weather and success on attendance to help teams understand how to maximize return on stadium projects, ticket sales, promotions and overall attendance revenue.

Questions

This analysis will seek to answer the following questions:

  • What do total attendance trends look like by week? by month?
    • particular focus on the back-half of season
  • How does the trend change when looking at it for contending teams with higher % wins than losses?
  • How do indoor games compare to outdoor, especially in later weeks? +overall, winning teams, losing teams?
  • Using weekly attendance as a response variable, how strong a predictor is temperature +add in win % of involved teams?
  • Other independent variables (ie. Points-For, Offensive-Rank, etc.) How do these work as predictors towards attendance?

Approach

Will utilize game-by-game data for multiple recent seasons of NFL play. This will require joining together weather and attendance-based datasets.

Upon cleaning and merging this data, will begin getting an idea of the time-wise trends across each week of a season without subsetting variables. This will be done through basic visuals and regression analyses, and will give an idea of how attendance trends across a season regardless of attention to other variables.

will begin to expand these analyses to subset the data to look at different factions of the data. For example: teams with higher win % (above .500) vs. lower win % (below .500). Also, teams playing indoors vs. outdoors by week.

Will further mix in regression analyses including specific weather data: game temperature, precipitation/none.

Packages

Packages

Packages used in this analysis are read in below.

# Packages to use ----------------------------------------------------
library(tidyverse)
library(readr)
library(kableExtra)
library(nflfastR)
library(lubridate)
library(teamcolors)
library(ggimage)
library(gridExtra)

Package Descriptions

  • tidyverse- Includes various sub-packages to help with cleaning and manipulating the data. The core sub-packges include:
    • dplyr- For data-frame editing and manipulation
    • ggplot2- For graphical visualization of data
    • tibble- For clean and efficient creation and manipulation of data frames.
  • readr- To import the .csv data and export sub-data created from nflfastR
  • kableExtra- For neat outputting of data tables to HTML format
  • nflfastR- For assisting with importing the play-by-play data that will be condensed and truncated to feed weather data into the final datasets
  • lubridate- Extenstion of tidyverse for date manipulation
  • teamcolors- Expansive package of team color and logo data that is an extension of nflfastr
  • ggimage- For use with ggplot2 to assist in adding graphics to a plot
  • gridExtra- For further combining plots into single grids

Data Preparation

Data Description

Attendance

Thomas Mock & Chris Stehlik’s GitHub compilation will be used to extract attendance and basic data by game as well as season results. these sets pull from ProFootballReference.com.

The motivation of the data is to provide three different game sets that can be used to bridge together attendance and other factors like offensive/defensive performance, wins, losses, etc. It can be looked at as a whole or at a weekly level.

The sets include:

  • Attendance- gives total and weekly attendance by team and year from 2000-2019
  • Standings- gives season results by team and year. This includes wins, losses, points scored(for/against),strength of schedule, etc.
  • Games- gives basic data for each game from 2000-2019 such as: teams involved (home/away),date, score,winner,loser,etc.

Weather

nflfastr play-by-play data will be manipulated to extract temperature and weather values by unique game for 2000-2019.

Normally this data is used to analyze valuation and expected return from different decisions based on historic information of the play type. This analysis will only use a very small subset of this data to be joined with attendance.

This dataset will include game_id, teams involved, week number, and:

  • temp- temperature in Farhenheit for the game
  • roof- Indicates if the game was played outdoors, in dome, in a closed roof, in an open roof +Indoor games (dome and closed roof) will result in NA weather factors
  • weather_type- A cleaned version of the weather description that can indicate rain,snow,clouds,etc.

Attendance Import

The below code will import the GitHub attendance,standings,games datasets:

setwd('C:/Users/tgobi/OneDrive/MSBANA/Fall2020/Data_Wrangling/Project_Options')
attendance <- read_csv('attendance.csv')
standings <- read_csv('standings.csv')
games <- read_csv('games.csv')

Beginning with attendance the head() function is used to give a quick visual of the data. Upon investigating, some cleaning will be done to merge the team information (team and team_name) with year and week to create a unique ID for team.

This dataset can also be broken into two sets. One that looks solely at weekly attendance, and another that removes the weekly aspect and just selects distinct annual totals.

It is also noticed, that NA values appear during the BYE week in which the team does not play that week.

weekly_attendance as well as home and away will be pulled from this dataset

head(attendance) %>% kbl(caption='View of Attendance Data') %>% kable_styling()
View of Attendance Data
team team_name year total home away week weekly_attendance
Arizona Cardinals 2000 893926 387475 506451 1 77434
Arizona Cardinals 2000 893926 387475 506451 2 66009
Arizona Cardinals 2000 893926 387475 506451 3 NA
Arizona Cardinals 2000 893926 387475 506451 4 71801
Arizona Cardinals 2000 893926 387475 506451 5 66985
Arizona Cardinals 2000 893926 387475 506451 6 44296

looking at head() for standings below. this analysis will primarily use the ‘points_for’ and ‘points_against’ metrics the most.

playoffs can be converted to a binary character. Same with Super Bowl; however, it is not useful for the analysis being performed.

head(standings) %>% kbl(caption='View of Standings Data') %>% kable_styling()
View of Standings Data
team team_name year wins loss points_for points_against points_differential margin_of_victory strength_of_schedule simple_rating offensive_ranking defensive_ranking playoffs sb_winner
Miami Dolphins 2000 11 5 323 226 97 6.1 1.0 7.1 0.0 7.1 Playoffs No Superbowl
Indianapolis Colts 2000 10 6 429 326 103 6.4 1.5 7.9 7.1 0.8 Playoffs No Superbowl
New York Jets 2000 9 7 321 321 0 0.0 3.5 3.5 1.4 2.2 No Playoffs No Superbowl
Buffalo Bills 2000 8 8 315 350 -35 -2.2 2.2 0.0 0.5 -0.5 No Playoffs No Superbowl
New England Patriots 2000 5 11 276 338 -62 -3.9 1.4 -2.5 -2.7 0.2 No Playoffs No Superbowl
Tennessee Titans 2000 13 3 346 191 155 9.7 -1.3 8.3 1.5 6.8 Playoffs No Superbowl

The games dataset has very useful information that will be used to join all other information. it first must be reformatted to give cumulative win information. It should also be sorted into home and away team datasets that will be rejoined later.

head(games) %>% kbl(caption='View of Games Data') %>% kable_styling()
View of Games Data
year week home_team away_team winner tie day date time pts_win pts_loss yds_win turnovers_win yds_loss turnovers_loss home_team_name home_team_city away_team_name away_team_city
2000 1 Minnesota Vikings Chicago Bears Minnesota Vikings NA Sun September 3 13:00:00 30 27 374 1 425 1 Vikings Minnesota Bears Chicago
2000 1 Kansas City Chiefs Indianapolis Colts Indianapolis Colts NA Sun September 3 13:00:00 27 14 386 2 280 1 Chiefs Kansas City Colts Indianapolis
2000 1 Washington Redskins Carolina Panthers Washington Redskins NA Sun September 3 13:01:00 20 17 396 0 236 1 Redskins Washington Panthers Carolina
2000 1 Atlanta Falcons San Francisco 49ers Atlanta Falcons NA Sun September 3 13:02:00 36 28 359 1 339 1 Falcons Atlanta 49ers San Francisco
2000 1 Pittsburgh Steelers Baltimore Ravens Baltimore Ravens NA Sun September 3 13:02:00 16 0 336 0 223 1 Steelers Pittsburgh Ravens Baltimore
2000 1 Cleveland Browns Jacksonville Jaguars Jacksonville Jaguars NA Sun September 3 13:02:00 27 7 398 0 249 1 Browns Cleveland Jaguars Jacksonville

Weather Import

As mentioned earlier, the weather data will be a small subset of a raw dataset. The below code imports the data from GitHub. The code is then manipulated to pull game_id,year,home_team,away_team,season_type,week,weather,temp,roof as the important weather variables.

It then takes unique game values of this weather data selected, then extracts the basic weather description, and filters to only look at ‘regular season’ games.

Given the slow loading speed, the truncated version has been exported to a csv file that will be referenced.

# nflfastr weather opt ----------------------------------------------------

seasons <- 2000:2019
pbp <- purrr::map_df(seasons, function(x) {
  readRDS(
    url(
      glue::glue("https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_{x}.rds")
    )
  )
})

pbp_weath <- pbp %>% 
  mutate(year=as.integer(substr(game_id,1,4))) %>% 
  select(game_id,year,home_team,away_team,season_type,week,weather,temp,roof) %>% 
  unique() %>% 
  mutate(weath_type=substr(weather,1,str_locate(weather,'Temp:')-1)) %>% 
  filter(season_type=='REG') %>% 
  select(-weather)
  ###separate(weather,c('status','temp'),sep='temp')
#write_csv(pbp_weath,path='C:/Users/tgobi/OneDrive/MSBANA/Fall2020/Data_Wrangling/Project_Options/nflweather.csv')
pbp_weath <- read_csv('C:/Users/tgobi/OneDrive/MSBANA/Fall2020/Data_Wrangling/Project_Options/nflweather.csv',
                      col_types=list(col_character(),col_double(),col_character(),col_character(),col_character(),col_double(),col_double(),col_character(),col_character()))

Data Cleaning

Annual Attendance Data Separated

First step is to create a separate table that will allow for the separation of full-year attendances and weekly attendances from the attendance dataset. This will become tot_att

The below code will:

  • Merge the game info into a single game_id
  • Unite the team and team_name into a single team_id
  • Remove weekly_attendance and the other weekly identifiers
  • Change the names to be more descriptive that they are annual totals
  • group the results by year and team
  • select only unique values (should only have 1 observation per team per year)

The result is the below table that only displays the total annual attendance. The weekly attendance will be used seperately.

attendance <- 
  attendance %>% 
  unite(c(team,year,week),col=game_id,remove=FALSE) %>% 
  unite(c(team,team_name),col=team_id,remove=F,sep=" ")

tot_att <- attendance %>% 
  select(-weekly_attendance,-week,-game_id) %>% 
  rename(tot_season_att=total,tot_home_att=home,tot_away_att=away) %>% 
  group_by(year,team_id) %>% 
  unique(.)

head(tot_att) %>% kbl() %>% kable_styling(full_width = F)
team_id team team_name year tot_season_att tot_home_att tot_away_att
Arizona Cardinals Arizona Cardinals 2000 893926 387475 506451
Atlanta Falcons Atlanta Falcons 2000 964579 422814 541765
Baltimore Ravens Baltimore Ravens 2000 1062373 551695 510678
Buffalo Bills Buffalo Bills 2000 1098587 560695 537892
Carolina Panthers Carolina Panthers 2000 1095192 583489 511703
Chicago Bears Chicago Bears 2000 1080684 535552 545132

Given that this table now should have a unique count of each team by year, it is a good measure to help point out which years in the dataset may not have read all the teams.

The below code will take the tally grouped by year. Given that there are currently 32 teams in the league, it is expected that all years should have 32; however, the below table shows that 2000 and 2001 had 31 teams. This is due to the expansion of adding the Houston Texans in 2002. It is worth noting this, but not of any immeadiate concern.

  tot_att %>% group_by(year) %>% tally()%>% filter(n!=32) %>% kbl(caption='Unique Teams by Year') %>% kable_styling(full_width = F) 
Unique Teams by Year
year n
2000 31
2001 31

Breaking Game out into Home and Away Sets

The next step is to take the game dataset and break them into each of the home teams and away teams to ensure that all games for each team are captured when joining them into a master dataset.

Before beginning this. It must be ensured that only regular season data is being pulled. the week number variable in game has NA values when it is considered a playoff game. These are filtered and changed to a numeric datatype below:

games <- games %>% mutate(week=as.double(week)) %>% filter(.,!is.na(week))

The following code creates unique team_id’s using the home and away team fields:

  • It pulls year, week, date, both teams, the winning team and tie
  • each team_id of focus will be the home_team in home_games and away_team in away_games
  • The opponent is classified as the opposite (away_team for home_games and home_team for away_games)
  • if the home_team won in home_games then the binary variable of 1 is assigned to say the home team won (same if tie)
  • if the away_team won in away_games then the binary variable of 1 is assigned to say the away team won (same if tie)
home_games <- games %>%  select(year,week,date,home_team,away_team,winner,tie) %>%
  rename(team_id=home_team,home_opponent=away_team,home_winner=winner,home_date=date) %>% 
  mutate(home_win=ifelse(team_id==home_winner,1,0),home_tie=ifelse(!is.na(tie),1,0))

away_games <- games %>%  select(year,week,date,away_team,home_team,winner,tie) %>%
  rename(team_id=away_team,away_opponent=home_team,away_winner=winner,away_date=date) %>% 
  mutate(away_win=ifelse(team_id==away_winner,1,0),away_tie=ifelse(!is.na(tie),1,0))

Joining the Broken-Out Game Sets with Attendance Set

Taking the home_games and away_games datasets, the next step is to join their values into the attendance dataset.

The below code creates a join table:

  • left joins attendance with home_games,away_games,tot_att on the year,week and team identifiers
  • now that the data is joined and organized by team, the two opponent fields home_opponent and away_opponent can be merged into a single opponent field as long as the team is not on BYE which would be NA
  • Removes the old opponent fields
join <- attendance %>%  
  left_join(.,home_games,by=c('year','week','team_id')) %>% 
  left_join(.,away_games,by=c('year','week','team_id')) %>% 
  left_join(.,tot_att,by=c('year','team_id')) %>% 
  mutate(opponent=ifelse(is.na(home_opponent),away_opponent,home_opponent)) %>% 
  select(-home_opponent,-away_opponent)

The join table is now used as a index to feed into the main data table built below.

Here are the steps:

  • mutate the home and away date fields and merge them on whichever is not NA (missing from join)
  • Create another column to classify month seperately from date for broad classification of time
  • Add a classification variable to whether the team in team_id is home or away given whichever date is binary win variable is not NA +if both are NA then the team is on BYE and will remain NA
  • select variables of interest (see next tab of report for details)
  • aggregate the win and tie variables on whichever is not NA to make a final classification of if game was win or tie
  • if the win and tie variables are 0 and a weekly_attendance exists (team not on BYE) then a loss=1 binary
  • grouping by the team_id and year keys, create a dynamic running sum of teams win,losses and ties WTD on season
  • Create a totwins and totloss variable by year to indicate the ending record for the team that year
  • Determine a running winning % win_pct of the team by week using the (cumulative wins)/(sum cumulative outcomes)
  • Create two variables that calculate the percentage of the full season attendance the single game accounts for
by_game <- join %>% 
  mutate(date=ifelse(is.na(home_date),away_date,home_date)) %>% 
  mutate(date=as.Date(date,"%B %d")) %>% mutate(month=format(date,"%m")) %>% #mutate(month=as.integer(month)) %>% 
  mutate(home_away=ifelse(is.na(date),NA,ifelse(is.na(home_win),'away','home'))) %>% 
  
  select(team_id,year,week,date,month,opponent,home_away,weekly_attendance,tot_season_att,tot_home_att,tot_away_att,home_win,home_tie,away_win,away_tie) %>% 
  rowwise() %>% mutate(win=sum(home_win,away_win,na.rm=T),tie=sum(home_tie,away_tie,na.rm=T)) %>% 
  rowwise() %>% mutate(loss=ifelse(win==0 & tie==0 & !is.na(weekly_attendance),1,0)) %>% 
  group_by(team_id,year) %>% mutate(cumwins=cumsum(win),cumloss=cumsum(loss),cumtie=cumsum(tie)) %>% 
  mutate(totwins=sum(win),totloss=sum(loss)) %>% 
  rowwise() %>% mutate(win_pct=cumwins/sum(cumwins,cumloss,cumtie)) %>%
  rowwise() %>% mutate(pct_home=ifelse(home_away=='home',weekly_attendance/tot_home_att,NA)) %>% 
  rowwise() %>% mutate(pct_away=ifelse(home_away=='away',weekly_attendance/tot_away_att,NA)) %>%
  select(-home_win,-home_tie,-away_win,-away_tie)

Final Datasets

Below are the head() views of of the final datasets that will be joined:

by_game:

  • team_id (chr): Team name and mascot (unique identifier)
  • year (dbl): Year the game was played (2000-2019)
  • week (dbl) : Week of the season in which the game occurred (1-17 for the 17 wks. of NFL season)
  • date (date): Date of the game
  • month (chr): Month of the game
  • opponent (chr): Opponent that the team (team_id) played in the given week
  • home_away (chr): Was the team the home or away team? (if NA then on BYE)
  • weekly_attendance (dbl): The attendance for the game that week
  • tot_season_att (dbl): Total attendance the team played in front of across all 16 games
  • tot_home_att (dbl): Total attendance the team played in front of for all home games
  • tot_away_att (dbl): Total attendance the team played in front of for all away games
  • win (dbl): binary, was the game a win (1 or 0)
  • tie (dbl): binary, was the game a tie (1 or 0)
  • loss (dbl): binary, was the game a loss (1 or 0)
  • cumwins (dbl): cumulative wins for the team on the season after the given week has concluded
  • cumloss (dbl): cumulative losses for the team on the season after the given week has concluded
  • cumtie (dbl): cumulative ties for the team on the season after the given week has concluded
  • totwins (dbl): total wins the team had for the season
  • totloss (dbl): total losses the team had for the season
  • win_pct (dbl): cumulative % of games won as percent of total games played through the given week
  • pct_home (dbl): percent of total season-long home attendance accounted for by the specific game
  • pct_away (dbl): percent of total season-long away attendance accounted for by the specific game
head(by_game) %>% kbl() %>% kable_styling()
team_id year week date month opponent home_away weekly_attendance tot_season_att tot_home_att tot_away_att win tie loss cumwins cumloss cumtie totwins totloss win_pct pct_home pct_away
Arizona Cardinals 2000 1 2020-09-03 09 New York Giants away 77434 893926 387475 506451 0 0 1 0 1 0 3 13 0.0000000 NA 0.1528953
Arizona Cardinals 2000 2 2020-09-10 09 Dallas Cowboys home 66009 893926 387475 506451 1 0 0 1 1 0 3 13 0.5000000 0.1703568 NA
Arizona Cardinals 2000 3 NA NA NA NA NA 893926 387475 506451 0 0 0 1 1 0 3 13 0.5000000 NA NA
Arizona Cardinals 2000 4 2020-09-24 09 Green Bay Packers home 71801 893926 387475 506451 0 0 1 1 2 0 3 13 0.3333333 0.1853049 NA
Arizona Cardinals 2000 5 2020-10-01 10 San Francisco 49ers away 66985 893926 387475 506451 0 0 1 1 3 0 3 13 0.2500000 NA 0.1322635
Arizona Cardinals 2000 6 2020-10-08 10 Cleveland Browns home 44296 893926 387475 506451 1 0 0 2 3 0 3 13 0.4000000 0.1143196 NA

pbp_weather:

  • game_id (chr): the unique identifier for game (year_week_team1_team2)
  • year (int): year of the game
  • home_team (chr): abbreviation ID for home team
  • away_team (chr): abbreviation ID for away team
  • season_type (chr): ID indicating the game type (all should be ‘REG’ given it’s already filtered for regular season)
  • week (int): week number for the game in the given year
  • temp (int): temperature in Fahrenheit for the game
  • roof (chr): classifier of if the game was played indoors or outdoors +dome=permenant indoor stadium +outdoors=outdoor stadium +open=retractable roof was open to elements (can be simplified to ‘outdoors’ if needed) +closed=retractable roof was closed to the elements (can be simplified to ‘dome’ for indoor classification)
  • weath_type (chr): Simple description of the weather (participation in particular)
head(pbp_weath) %>% kbl() %>% kable_styling()
game_id year home_team away_team season_type week temp roof weath_type
2000_01_ARI_NYG 2000 NYG ARI REG 1 80 outdoors NA
2000_01_BAL_PIT 2000 PIT BAL REG 1 74 outdoors NA
2000_01_CAR_WAS 2000 WAS CAR REG 1 80 outdoors NA
2000_01_CHI_MIN 2000 MIN CHI REG 1 NA dome NA
2000_01_DEN_STL 2000 LA DEN REG 1 NA dome NA
2000_01_DET_NO 2000 NO DET REG 1 NA dome NA

The final cleaning will need to come in the form of matching the team_id variables to do a final join that will help with analysis.

The type_weath from pbp_weath will also need to be string searched for this to be used as a reasonable character variable.

Merging Data

Adding Visual Logo Data

To add clarity to visuals, below code will add team logos and colors:

logo <- select(teams_colors_logos,team_abbr,team_name,team_logo_wikipedia,team_logo_espn)
team_colors <- teamcolors %>% filter(league=='nfl')

Join weather and logos

Further, to make plotting and summarizing easier, the below code will join together the pbp_weath and logos datasets with the by_game to create a final_data set that has each team and their games (home and away) and all possible info for those games.

Warning for this set: it is not unique and will need to likely need distinct() or filter() operations to just look at games uniquely

## add team logo and abbreviation info (abbr is needed to create id to join to weather)
by_game[which(str_detect(by_game$team_id,'Washington Redskins')==TRUE),'team_id'] <- 'Washington Football Team'
by_game[which(str_detect(by_game$opponent,'Washington Redskins')==TRUE),'opponent'] <- 'Washington Football Team'
team_colors[which(str_detect(team_colors$name,'Washington Redskins')==TRUE),'name'] <- 'Washington Football Team'


by_game <- by_game %>% left_join(logo,by=c('team_id'='team_name')) %>% left_join(logo,by=c('opponent'='team_name'))
 names(by_game) <- names(by_game) %>% 
  str_replace_all(string=.,pattern='\\.x$',replacement='') %>% 
  str_replace_all(string=.,pattern='\\.y$',replacement='_opp')

 
 
 ## Creating the game_id key to be used to join accurately with weather data
 by_game <- by_game %>% 
  mutate(game_id=str_c(year,sprintf('%02d',week)
                     ,ifelse(home_away=='away'
                             ,team_abbr,team_abbr_opp)
                     ,ifelse(home_away=='home',team_abbr,team_abbr_opp),sep='_'))

##joining weather and game/attendance data into final dataset
final_data <- by_game %>% 
  left_join(pbp_weath,by=c('year','week','game_id')) %>% 
  select(-season_type)

## adjusting date to scale January to be viewed at end of season
final_data$date[which(final_data$date<='2020-01-31')] <- final_data$date[which(final_data$date<='2020-01-31')] %>% 
  ymd(.) +years(1)

Classify Weather Strings

Visualizing the weather string data, there are large inconsistencies with how a weather event is reported. Stringr extension of tidyverse will be used to detect string values that indicate anything like:

  • snow
  • rain
  • shower

and mutate it to a single 2-factor classification:

  • Precipitation - If it is detected the string has some form of snow or rain
  • none- Otherwise is the case

the regex function allows for us to ignore the case of the string. Also, there are strings in which there is Zero Chance or No Chance when referring to rain. Those will be made exceptions to be removed in the case_when

final_data <- final_data %>% mutate(precipitation= case_when(
  str_detect(string=weath_type,pattern=regex('rain|snow|shower',ignore_case = T))==TRUE & 
    str_detect(string=weath_type,pattern=regex('Zero|(?<!s)No',ignore_case=T))!=TRUE ~'Precipitation',
  TRUE ~'none'
)
)

Data Exploration

Outdoor Splits

Amount of outdoor games is given in the table below. since 2000, they have averaged 77%.

The second table visualizes this trend by year to confirm that the annual samples remain relatively consistent in terms of the number of outdoor games being looked at:

ct_table <- pbp_weath %>% 
  mutate(roof=replace(roof,roof=='closed','dome'),roof=replace(roof,roof=='open','outdoors')) %>% 
  group_by(roof) %>%  tally(name='Count') %>% mutate(Count=as.numeric(Count)) %>% 
  mutate(pct_outdoor=round(Count/sum(Count),2))

ct_table_2 <- pbp_weath %>% 
mutate(roof=replace(roof,roof=='closed','dome'),roof=replace(roof,roof=='open','outdoors')) %>% 
group_by(year,roof) %>%  tally(name='Count') %>% mutate(Count=as.numeric(Count)) %>% spread(roof,Count) %>% 
  rowwise() %>% mutate(pct_outdoor=round(outdoors/sum(dome,outdoors),2))
  

#ct_table <- ct_table %>% 
#rbind(c('Total',sum(ct_table$Count),sum(ct_table$pct))) %>% mutate(Count=as.numeric(Count))

ct_table %>% 
  kbl(caption='Dome vs. outdoor games 2000-2019') %>% 
  kable_styling('hover',full_width=F)
Dome vs. outdoor games 2000-2019
roof Count pct_outdoor
dome 1157 0.23
outdoors 3945 0.77
ct_table_2 %>% 
  kbl(caption='Dome vs. Outdoor by Year') %>% 
  kable_styling('hover',full_width = F)
Dome vs. Outdoor by Year
year dome outdoors pct_outdoor
2000 48 198 0.80
2001 48 200 0.81
2002 51 205 0.80
2003 53 203 0.79
2004 51 205 0.80
2005 46 210 0.82
2006 58 198 0.77
2007 58 198 0.77
2008 50 206 0.80
2009 62 194 0.76
2010 65 191 0.75
2011 68 188 0.73
2012 68 188 0.73
2013 70 186 0.73
2014 61 195 0.76
2015 60 196 0.77
2016 62 194 0.76
2017 59 197 0.77
2018 60 196 0.77
2019 59 197 0.77

Graphical Summaries

Temperature vs. Attendance

Final basic summary will be a basic scatter plotting the two variables against each other.

final_data %>% distinct(game_id,.keep_all = T) %>% 
  ggplot(.,aes(x=temp,y=weekly_attendance))+
  geom_point(alpha=.8,color='lightblue')+
  geom_smooth(color='Blue')+
  scale_y_continuous(name='Weekly Attendance',labels=scales::comma)+
  scale_x_continuous(name='Temperature of Game')+
  ggtitle(label='Temperature of Game by Attendance',subtitle='All Games 2000-2019 by Outdoor Stadiums')

The plot attempting to correlate the Weekly Attendance and Temperature variables gives better context into the data. There is clearly some lower outdoor attendances even at moderate temperatures.

These outliers could be due to other differences such as stadium capacity, which has no clear way of being captured using available data. This will be explored in the next section.

Summary of Stadium Types

Boxplot Summary

To give more visual clarity into the distributions of each stadium type, we can analyze the boxplot and histograms for each group below.

final_data %>% distinct(game_id,.keep_all = T) %>%  ggplot(.,aes(y=weekly_attendance,x=roof))+
  # geom_histogram(aes(y=..density..),color='black',alpha=.5,bins=20,position='identity')+
  geom_boxplot(aes(fill=roof),alpha=.7)+
  scale_y_continuous(name='Weekly Attendance',labels=scales::comma)+
  ggtitle('Boxplot Comparison of Stadium Type Distributions')
final_data %>% distinct(game_id,.keep_all = T) %>% ggplot(.,aes(x=weekly_attendance))+
  facet_wrap(vars(roof),scales='free_y')+
  geom_histogram(color='black',fill='dodgerblue',alpha=.5,bins=20)+
  geom_vline(aes(xintercept=mean(weekly_attendance,na.rm=T)),linetype='dashed',color='red',alpha=.8,lwd=1)+
  ggtitle('Histogram of each Stadium Type')

The mean is overall highest for the closed roof stadiums.the boxplot shows that this is skewed by more attendance values that are higher than the rest of the league. Some of this could be because several stadiums of this type being relatively new and relatively higher in capacity. The overall distribution for these appears bimodal with seemingly a distribution forming around the overall mean attendance of ~67,000 , and another distribution forming about the ~80,000 range.

Comparing to the open roof classification, this distribution appears almost approximately normal about the overall mean of ~67,000 ; however, has a larger left tail in its histogram (making it left-skewed). There is further evidenced in the boxplot, with a large number of attendances below the 1.5IQR range used by the plot.

Using domain knowledge to give a hypothesis for the above observations, the Dallas Cowboys moved into AT&T Stadium with a capacity of about 80,000 in 2009. Given the relative low number of observations in closed classification, these games at 80,000 could be because of a move like this.

Further, to explain the lower open stadiums, there have been several teams playing in stadiums with lower capacities while making a move to a new one. The biggest example of this was the Los Angeles Chargers who played at Dignity Health Sports Park from 2017-2019–who’s smaller capacity of ~30,000 was meant to accommodate soccer–while awaiting the opening of SoFi Stadium this season.

The below graphs will investigate the teams in more detail:

  • 15 seasons with the lowest average attendance by team
  • 15 seasons with the highest average attendance by team
final_data %>% 
    #distinct(game_id,.keep_all = T) %>%  
    left_join(teams_colors_logos[,c('team_abbr','team_color')],by=c('team_abbr')) %>% 
    select(team_abbr,team_logo_wikipedia,home_away,year,team_color,weekly_attendance) %>% 
    group_by(team_abbr,team_logo_wikipedia,team_color,home_away,year) %>%
    filter(home_away=='home') %>% 
    summarise(avg_att=mean(weekly_attendance)) %>%
    arrange(avg_att) %>% ungroup() %>%  top_n(-15) %>% 
    ggplot(.,aes(x=reorder(str_c(team_abbr,year),avg_att),y=avg_att))+
    geom_col(aes(fill=team_color,color=team_color),position='identity',alpha=.75)+
    scale_color_identity(aesthetics = c('color','fill'))+
    geom_image(aes(image=team_logo_wikipedia))+
    geom_label(aes(y=0,label=year))+
    xlab('Team, Year')+
    ylab('Average Attendance Per Game')+
    scale_y_continuous(labels=scales::comma)+
    theme(
      axis.text.x=element_blank(),
      axis.ticks.x=element_blank()
    )+
    ggtitle('Bottom 15 Seasons in Terms of per Game Attendance')

As suspected, the Chargers have the 3 lowest seasons in terms of attendance since 2000. However, They were limited in capacity.

The Cowboys have 10 of the top 15 seasons in terms of attendance and all of them come after the new stadium opening in 2009.

As a result, stadium capacity can be considered the likely reason for many of the attendance outliers in the dataset.

Precipitation Analysis

Rainy/Snowy Attendance Comparison

Given that temperature appears that it may not have a significant impact on attendance, we will also look at the weather descriptions for games in which that info is available.

Weather descriptions have only been recorded in the nflfastR datasets to a limited capacity since 2013, so the data for those games is relatively limited; however, there are some important insights that indicate it having an attendance impact.

Each available string was searched for strings like ‘rain’,‘snow’,‘shower’,etc. , and for those that contained values such as this, are given a ‘Precipitation’ classification–indicating there was precipitation for the game. The classification is ‘none’ if there is no evidence from the weather description.

Below is a high-level summary of the average league attendance for games with precipitation vs. not:

  final_data  %>% filter(!is.na(weath_type) & roof%in%c('outdoors','open') & year>2014 & home_away=='home') %>%
    group_by(year,precipitation) %>% summarise(avg_att=mean(weekly_attendance,na.rm=T)) %>% ungroup() %>% 
    ggplot(aes(x=year,y=avg_att))+
    geom_line(aes(color=precipitation))+
    geom_label(aes(y=avg_att,label=round(avg_att,0)))+
    scale_y_continuous(name='Average Attendance',labels=scales::comma)+
    ggtitle('Average League Attendance for Rainy/Snowy games vs. Fair Weather')

This gives some evidence using 2015-2019 data to indicate there is a difference between rainy/snowy attendance; however, it fluctuates between which average is higher YoY. 2015,2018,2019 all have higher attendances for rainy/snowy games, but the others do not, which indicates potential for other factors skewing the information.

The best solution to see this further, is to subset the analysis by isolating it to each team.

Below graph will compare differences for each team:

  final_data %>% filter(!is.na(weath_type) & home_away=='home' & year>2014) %>%  select(precipitation,team_abbr,weekly_attendance,team_logo_wikipedia) %>% 
    group_by(team_abbr,precipitation,team_logo_wikipedia) %>% 
    summarise(avg_att=mean(weekly_attendance,na.rm=T),count=n()) %>%
    ggplot(aes(x=reorder(team_abbr,avg_att),y=avg_att))+
    geom_col(aes(x=reorder(team_abbr,avg_att),y=avg_att,fill=precipitation,group=precipitation),position='dodge',alpha=.8)+
    coord_flip()+
    geom_image(aes(y=0,image=team_logo_wikipedia,group=team_abbr),position='identity',hjust=-.5)+
    scale_y_continuous(name='Average Attendance',labels=scales::comma)+
    theme(
     # legend.position = c(1,0),
      axis.text.y=element_blank(),
      axis.title.y=element_blank(),
      axis.ticks.y=element_blank()
    )+
    ggtitle('Attendance by Team for Rainy/Snowy Games vs. Normal Games')

There appears to be some small differences in the attendance that support the claim of precipitation negatively effecting attendance. To make these margins clearer, the below barplot is made for comparison of the margins:

 final_data %>% filter(!is.na(weath_type) & home_away=='home' & year>2014) %>%  select(precipitation,team_abbr,weekly_attendance,team_logo_wikipedia) %>% 
    group_by(team_abbr,precipitation,team_logo_wikipedia) %>% 
    summarise(avg_att=mean(weekly_attendance,na.rm=T)) %>% spread(key=precipitation,value=avg_att) %>% mutate(diff=Precipitation-none) %>% 
    gather(key=var,value=val,none:diff) %>% filter(var=='diff') %>% mutate(val=replace_na(val,0)) %>% 
    ggplot(aes(x=reorder(team_abbr,val),y=val,fill=val))+
    geom_bar(stat='identity',alpha=.9,position='identity')+
    scale_color_gradient2(name='Attendance Change',low='red',mid='green',high='red',aesthetics = c('fill'),guide='colorbar')+
    scale_color_gradient2(name='Attendance Change',low='red',mid='green',high='red',aesthetics = c('color'),guide='colorbar')+
    coord_flip()+
    scale_y_continuous(name='Change in Attendance',labels=scales::comma)+
    geom_image(aes(y=0,image=team_logo_wikipedia))+
    theme(axis.title.y=element_blank(),
          axis.text.y=element_blank(),
          axis.ticks.y=element_blank(),
          legend.justification = c(1,0),
          legend.position=c(1,0))+
    ggtitle(label='Attendance Change for games with Precipitation',subtitle='Negative values indicate an average decrease')

We see that teams such as the Jaguars, Steelers and Washington have average decreases of more than 5,000 since 2015.

This is, however, not consistent. Many teams have much smaller decreases, and some exceptions of increases. For teams with little to no change, there is likely very little to no rainy/snowy weather data to compare.

Attendance by Win Pct.

Full NFL Comparison

Given that there is some small evidence of different aspects of weather effecting attendance, we will also investigate other potential causes for attendance changes at any point in a season. One of the biggest could be the interest of fans in attending a game with a winning team vs. a losing team.

Below will summarize live winning percentage for a team plotted against their weekly attendance.

  final_data %>% filter(home_away=='home') %>% select(weekly_attendance,win_pct,roof) %>% filter(!is.na(roof)) %>% 
    ggplot(aes(x=win_pct,y=weekly_attendance))+
    geom_point(alpha=.2,color='Blue',fill='blue')+
    geom_smooth(color = "DarkBlue", alpha = 0.8)+
    #stat_summary(fun.y = "mean", geom = "point", size = 3, fill='Black', shape = 21, color = "white", stroke = 1,alpha=.5) +
    scale_y_continuous(name='Weekly Attendance',labels=scales::comma)+
    scale_x_continuous(name='Winning Percentage',labels=scales::comma)+
    ggtitle('Weekly_Attendance by Winning Percentage','Data 2000-2019')

There is similarly a small trend change in the later months of the season. The higher the win percentage, the higher the up-tick in attendance.

To further visualize at a team level, the below plot shows the same scatter summary; however uses the mean attendance for each team.

  avg_overall_att <- final_data %>% ungroup() %>%  filter(home_away=='home') %>% summarise(mean(weekly_attendance)) %>% as.data.frame()
  avg_overall_win_pct <- final_data %>% ungroup() %>%  filter(home_away=='home') %>% summarise(mean(win_pct)) %>% as.data.frame()
  
  final_data %>% 
    #distinct(game_id,.keep_all = T) %>%  
    left_join(teams_colors_logos[,c('team_abbr','team_color')],by=c('team_abbr')) %>% 
    left_join(team_colors[,c('name','division')],by=c('team_id'='name')) %>% 
    select(team_id,team_abbr,team_logo_wikipedia,home_away,year,team_color,division,weekly_attendance,win_pct) %>% 
    filter(home_away=='home'& team_abbr!='NA') %>% mutate(team_abbr=replace(team_abbr,team_abbr=='LA','LAR')) %>%
    group_by(team_abbr,team_logo_wikipedia,team_color,division,home_away) %>%
    summarise(avg_att=mean(weekly_attendance,na.rm=T),avg_win_pct=mean(win_pct)) %>%
    arrange(avg_att) %>% ungroup() %>% mutate(rank=min_rank(-avg_att)) %>% 
    ggplot(.,aes(x=avg_win_pct,y=avg_att))+
    scale_color_identity(aesthetics = c('color','fill'))+
    geom_image(aes(image=team_logo_wikipedia))+
    xlab('Avg. Winning Percentage')+
    ylab('Average Attendance Per Game')+
    scale_y_continuous(labels=scales::comma)+
    geom_hline(yintercept=avg_overall_att[1,1],linetype='dashed',color='red')+
    geom_vline(xintercept=avg_overall_win_pct[1,1],linetype='dashed',color='red')+
    ggtitle('NFL Teams in Terms of per Game Attendance','Seasons 2000-2019')

Looking at the general trend for each team, it appears to be close to positively linear (with exception of Browns and 2017-2019 Chargers).

Several of these teams with above average winning tendencies are at or near their stadium capacities, so even team such as Seahawks and Steelers in the lower-right quadrant actually have very good attendance utilization as a part of winning.

This graph is helpful in showing that there is a general upward trend in attendance with winning, but there are still several factors such as capacity and regional sporting interest that this analysis does not take into account.

Summary

Problem Statement Review

To review these were two of the primary problems addressed:

Does temperature and weather(rain/snow) significantly effect attendance for all teams, winning teams or losing teams? How does it compare to other predictors involved?

This was addressed through the use of adjoining NFL datasets on 2000-2019 attendance with 2000-2019 game-by-game weather data that included temperature and condition reports.

The data is graphically analyzed to compare outdoor games to indoor games spanned across a season as weather gets harsher. Doing so,we can conclude that there are small downward trends in attendance as the season progresses for outdoor games. Attendance actually increases for indoor games. This could support a conclusion that colder weather effects attendance; however, it there are several other reasons also uncovered in analysis.

Further, temperature and precipitation were mapped directly against attendance to find if there is any direct relation. The plots indicated downward trends at the more extreme values; however, was very small for cold weather while fairly large dips for hot weather. These trends may not be consistent given the limited data it was being based on at the extreme values. Limited data indicates they are high-leverage observations that may not be fully indicative of the population.

To add further context to the analysis, a deeper-dive into the distributions by stadium type and by team is performed. This was done due to the various aspects of team and stadium that could also effect total attendance. It was also found that many variables being analyzed were, in fact, skewed by stadium capacity and teams with newer and larger stadiums with indoor options. There was not sufficient enough capacity data for all stadiums over the past 20 years to perform a more robust analysis using relative attendance (as a % utilization), so the comparisons were broken out at a team level to better visualize the team context.

Using team-level graphs, rainy/snowy games were compared to the normal weather games using 2015-2019 weather description data. Despite limited data, it was found that most teams had very little to no change in average attendance. There were however interesting exceptions of a few teams with a few thousand person difference in attendance.

It was also found that winning percentage has a positive effect on late season attendance, thus leading to some of the season-long trends as well. It is also only a small difference, but is noticable as well.

Implications to Consumer

Quantify the effect of weather and success on attendance to help teams understand how to maximize return on stadium projects, ticket sales, promotions and overall attendance revenue.

This analysis gives some basic visual analysis into the effect of various weather aspects of attendance in relation to other team-oriented aspects. This can be used as a guide towards furthering analysis into more detailed sales analysis using more robust forms of attendance.

Limitations

As already discussed above, the limitations of this model mostly lie in it’s lack of robust transformation to the attendance metrics. There are various team and stadium-related aspects in each season that need to be accounted for in stronger variables.

A next step improvement would be something as simple as scraping for stadium capacity data to begin looking at attendance as a percentage of capacity to better scale its expectations. In addition, further analysis using the more robust metrics could be performed to make statistically significant conclusions. These techniques include regression, ANOVA analysis and significance testing of the temperature and precipitation data to confirm significant insights.