MetLife Stadium January 2014
Football in the United States has one of the most unique spectating experiences of all professional North American sports. College football flaunts 8 stadiums with attendance capacities greater than 100,000 spectators. All 8 of these are amongst the top 10 largest stadiums in the world,according to data compiled by Wikipedia (not classifying motor sport venues as stadiums).
Looking specifically at the professional sporting industry, the National Football League (NFL) has continued to dominate the world in per game attendances of recent decades. 2007-2016 data from sportingintelligence.com compares annual attendances of various domestic sports leagues across the globe. The NFL has annually posted average per game attendances between ~67,000 - ~69,000, which is consistently 22,000 - 25,000 attendees higher than the next highest league: German Bundesliga (soccer). It is more than double the next highest attendance of US-based leagues (Major League Baseball (MLB) ~30,000 per game).
Further data published on GitHub by Thomas Mock & Chris Stehlik, shows the NFL has continued these trends of high attendance into 2019.
One of the most impressive things about these consistently high numbers, is that they have prevailed in a league that has played a majority of its games outdoors, and many of these games continue in a number of harsh conditions that come with playing during the fall and winter months: rain, snow,extreme cold, etc. It is one of the only major sports in the US that plays outdoors through inclement weather. The MLB plays during warmer summer months and cancels for rain. MLS also plays majority of their season in warmer months. NBA and NHL are entirely indoor.
roof | Count | pct_outdoor |
---|---|---|
dome | 1157 | 0.23 |
outdoors | 3945 | 0.77 |
The question is: does weather significantly effect attendance? How does this and success of the team combine to effect attendance?
This type of study has become more relevant in recent years, as high-quality TV viewing options become more available with the advancement of video quality and instant media sharing.These can serve as viable replacement options of attending games. Also, many of the new stadiums are now being built with retractable roofs allowing for indoor play, making it possible that the sport could be moving away from outdoor play in the future.
Does temperature and weather(rain/snow) significantly effect attendance for all teams, winning teams or losing teams? How does it compare to other predictors involved?
Quantify the effect of weather and success on attendance to help teams understand how to maximize return on stadium projects, ticket sales, promotions and overall attendance revenue.
This analysis will seek to answer the following questions:
Will utilize game-by-game data for multiple recent seasons of NFL play. This will require joining together weather and attendance-based datasets.
Upon cleaning and merging this data, will begin getting an idea of the time-wise trends across each week of a season without subsetting variables. This will be done through basic visuals and regression analyses, and will give an idea of how attendance trends across a season regardless of attention to other variables.
will begin to expand these analyses to subset the data to look at different factions of the data. For example: teams with higher win % (above .500) vs. lower win % (below .500). Also, teams playing indoors vs. outdoors by week.
Will further mix in regression analyses including specific weather data: game temperature, precipitation/none.
Packages used in this analysis are read in below.
# Packages to use ----------------------------------------------------
library(tidyverse)
library(readr)
library(kableExtra)
library(nflfastR)
library(lubridate)
library(teamcolors)
library(ggimage)
library(gridExtra)
nflfastR
nflfastr
ggplot2
to assist in adding graphics to a plotThomas Mock & Chris Stehlik’s GitHub compilation will be used to extract attendance and basic data by game as well as season results. these sets pull from ProFootballReference.com.
The motivation of the data is to provide three different game sets that can be used to bridge together attendance and other factors like offensive/defensive performance, wins, losses, etc. It can be looked at as a whole or at a weekly level.
The sets include:
nflfastr play-by-play data will be manipulated to extract temperature and weather values by unique game for 2000-2019.
Normally this data is used to analyze valuation and expected return from different decisions based on historic information of the play type. This analysis will only use a very small subset of this data to be joined with attendance.
This dataset will include game_id, teams involved, week number, and:
The below code will import the GitHub attendance
,standings
,games
datasets:
setwd('C:/Users/tgobi/OneDrive/MSBANA/Fall2020/Data_Wrangling/Project_Options')
attendance <- read_csv('attendance.csv')
standings <- read_csv('standings.csv')
games <- read_csv('games.csv')
Beginning with attendance
the head()
function is used to give a quick visual of the data. Upon investigating, some cleaning will be done to merge the team information (team and team_name) with year and week to create a unique ID for team.
This dataset can also be broken into two sets. One that looks solely at weekly attendance, and another that removes the weekly aspect and just selects distinct annual totals.
It is also noticed, that NA
values appear during the BYE week in which the team does not play that week.
weekly_attendance
as well as home
and away
will be pulled from this dataset
head(attendance) %>% kbl(caption='View of Attendance Data') %>% kable_styling()
team | team_name | year | total | home | away | week | weekly_attendance |
---|---|---|---|---|---|---|---|
Arizona | Cardinals | 2000 | 893926 | 387475 | 506451 | 1 | 77434 |
Arizona | Cardinals | 2000 | 893926 | 387475 | 506451 | 2 | 66009 |
Arizona | Cardinals | 2000 | 893926 | 387475 | 506451 | 3 | NA |
Arizona | Cardinals | 2000 | 893926 | 387475 | 506451 | 4 | 71801 |
Arizona | Cardinals | 2000 | 893926 | 387475 | 506451 | 5 | 66985 |
Arizona | Cardinals | 2000 | 893926 | 387475 | 506451 | 6 | 44296 |
looking at head()
for standings
below. this analysis will primarily use the ‘points_for’ and ‘points_against’ metrics the most.
playoffs can be converted to a binary character. Same with Super Bowl; however, it is not useful for the analysis being performed.
head(standings) %>% kbl(caption='View of Standings Data') %>% kable_styling()
team | team_name | year | wins | loss | points_for | points_against | points_differential | margin_of_victory | strength_of_schedule | simple_rating | offensive_ranking | defensive_ranking | playoffs | sb_winner |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Miami | Dolphins | 2000 | 11 | 5 | 323 | 226 | 97 | 6.1 | 1.0 | 7.1 | 0.0 | 7.1 | Playoffs | No Superbowl |
Indianapolis | Colts | 2000 | 10 | 6 | 429 | 326 | 103 | 6.4 | 1.5 | 7.9 | 7.1 | 0.8 | Playoffs | No Superbowl |
New York | Jets | 2000 | 9 | 7 | 321 | 321 | 0 | 0.0 | 3.5 | 3.5 | 1.4 | 2.2 | No Playoffs | No Superbowl |
Buffalo | Bills | 2000 | 8 | 8 | 315 | 350 | -35 | -2.2 | 2.2 | 0.0 | 0.5 | -0.5 | No Playoffs | No Superbowl |
New England | Patriots | 2000 | 5 | 11 | 276 | 338 | -62 | -3.9 | 1.4 | -2.5 | -2.7 | 0.2 | No Playoffs | No Superbowl |
Tennessee | Titans | 2000 | 13 | 3 | 346 | 191 | 155 | 9.7 | -1.3 | 8.3 | 1.5 | 6.8 | Playoffs | No Superbowl |
The games
dataset has very useful information that will be used to join all other information. it first must be reformatted to give cumulative win information. It should also be sorted into home and away team datasets that will be rejoined later.
head(games) %>% kbl(caption='View of Games Data') %>% kable_styling()
year | week | home_team | away_team | winner | tie | day | date | time | pts_win | pts_loss | yds_win | turnovers_win | yds_loss | turnovers_loss | home_team_name | home_team_city | away_team_name | away_team_city |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2000 | 1 | Minnesota Vikings | Chicago Bears | Minnesota Vikings | NA | Sun | September 3 | 13:00:00 | 30 | 27 | 374 | 1 | 425 | 1 | Vikings | Minnesota | Bears | Chicago |
2000 | 1 | Kansas City Chiefs | Indianapolis Colts | Indianapolis Colts | NA | Sun | September 3 | 13:00:00 | 27 | 14 | 386 | 2 | 280 | 1 | Chiefs | Kansas City | Colts | Indianapolis |
2000 | 1 | Washington Redskins | Carolina Panthers | Washington Redskins | NA | Sun | September 3 | 13:01:00 | 20 | 17 | 396 | 0 | 236 | 1 | Redskins | Washington | Panthers | Carolina |
2000 | 1 | Atlanta Falcons | San Francisco 49ers | Atlanta Falcons | NA | Sun | September 3 | 13:02:00 | 36 | 28 | 359 | 1 | 339 | 1 | Falcons | Atlanta | 49ers | San Francisco |
2000 | 1 | Pittsburgh Steelers | Baltimore Ravens | Baltimore Ravens | NA | Sun | September 3 | 13:02:00 | 16 | 0 | 336 | 0 | 223 | 1 | Steelers | Pittsburgh | Ravens | Baltimore |
2000 | 1 | Cleveland Browns | Jacksonville Jaguars | Jacksonville Jaguars | NA | Sun | September 3 | 13:02:00 | 27 | 7 | 398 | 0 | 249 | 1 | Browns | Cleveland | Jaguars | Jacksonville |
As mentioned earlier, the weather data will be a small subset of a raw dataset. The below code imports the data from GitHub. The code is then manipulated to pull game_id,year,home_team,away_team,season_type,week,weather,temp,roof
as the important weather variables.
It then takes unique game values of this weather data selected, then extracts the basic weather description, and filters to only look at ‘regular season’ games.
Given the slow loading speed, the truncated version has been exported to a csv file that will be referenced.
# nflfastr weather opt ----------------------------------------------------
seasons <- 2000:2019
pbp <- purrr::map_df(seasons, function(x) {
readRDS(
url(
glue::glue("https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_{x}.rds")
)
)
})
pbp_weath <- pbp %>%
mutate(year=as.integer(substr(game_id,1,4))) %>%
select(game_id,year,home_team,away_team,season_type,week,weather,temp,roof) %>%
unique() %>%
mutate(weath_type=substr(weather,1,str_locate(weather,'Temp:')-1)) %>%
filter(season_type=='REG') %>%
select(-weather)
###separate(weather,c('status','temp'),sep='temp')
#write_csv(pbp_weath,path='C:/Users/tgobi/OneDrive/MSBANA/Fall2020/Data_Wrangling/Project_Options/nflweather.csv')
pbp_weath <- read_csv('C:/Users/tgobi/OneDrive/MSBANA/Fall2020/Data_Wrangling/Project_Options/nflweather.csv',
col_types=list(col_character(),col_double(),col_character(),col_character(),col_character(),col_double(),col_double(),col_character(),col_character()))
First step is to create a separate table that will allow for the separation of full-year attendances and weekly attendances from the attendance
dataset. This will become tot_att
The below code will:
game_id
team_id
weekly_attendance
and the other weekly identifiersThe result is the below table that only displays the total annual attendance. The weekly attendance will be used seperately.
attendance <-
attendance %>%
unite(c(team,year,week),col=game_id,remove=FALSE) %>%
unite(c(team,team_name),col=team_id,remove=F,sep=" ")
tot_att <- attendance %>%
select(-weekly_attendance,-week,-game_id) %>%
rename(tot_season_att=total,tot_home_att=home,tot_away_att=away) %>%
group_by(year,team_id) %>%
unique(.)
head(tot_att) %>% kbl() %>% kable_styling(full_width = F)
team_id | team | team_name | year | tot_season_att | tot_home_att | tot_away_att |
---|---|---|---|---|---|---|
Arizona Cardinals | Arizona | Cardinals | 2000 | 893926 | 387475 | 506451 |
Atlanta Falcons | Atlanta | Falcons | 2000 | 964579 | 422814 | 541765 |
Baltimore Ravens | Baltimore | Ravens | 2000 | 1062373 | 551695 | 510678 |
Buffalo Bills | Buffalo | Bills | 2000 | 1098587 | 560695 | 537892 |
Carolina Panthers | Carolina | Panthers | 2000 | 1095192 | 583489 | 511703 |
Chicago Bears | Chicago | Bears | 2000 | 1080684 | 535552 | 545132 |
Given that this table now should have a unique count of each team by year, it is a good measure to help point out which years in the dataset may not have read all the teams.
The below code will take the tally grouped by year. Given that there are currently 32 teams in the league, it is expected that all years should have 32; however, the below table shows that 2000
and 2001
had 31 teams. This is due to the expansion of adding the Houston Texans in 2002
. It is worth noting this, but not of any immeadiate concern.
tot_att %>% group_by(year) %>% tally()%>% filter(n!=32) %>% kbl(caption='Unique Teams by Year') %>% kable_styling(full_width = F)
year | n |
---|---|
2000 | 31 |
2001 | 31 |
The next step is to take the game
dataset and break them into each of the home teams and away teams to ensure that all games for each team are captured when joining them into a master dataset.
Before beginning this. It must be ensured that only regular season data is being pulled. the week
number variable in game
has NA
values when it is considered a playoff game. These are filtered and changed to a numeric datatype below:
games <- games %>% mutate(week=as.double(week)) %>% filter(.,!is.na(week))
The following code creates unique team_id
’s using the home and away team fields:
team_id
of focus will be the home_team in home_games
and away_team in away_games
home_games
and home_team for away_games
)home_games
then the binary variable of 1 is assigned to say the home team won (same if tie)away_games
then the binary variable of 1 is assigned to say the away team won (same if tie)home_games <- games %>% select(year,week,date,home_team,away_team,winner,tie) %>%
rename(team_id=home_team,home_opponent=away_team,home_winner=winner,home_date=date) %>%
mutate(home_win=ifelse(team_id==home_winner,1,0),home_tie=ifelse(!is.na(tie),1,0))
away_games <- games %>% select(year,week,date,away_team,home_team,winner,tie) %>%
rename(team_id=away_team,away_opponent=home_team,away_winner=winner,away_date=date) %>%
mutate(away_win=ifelse(team_id==away_winner,1,0),away_tie=ifelse(!is.na(tie),1,0))
Taking the home_games
and away_games
datasets, the next step is to join their values into the attendance
dataset.
The below code creates a join table:
attendance
with home_games
,away_games
,tot_att
on the year,week and team identifiershome_opponent
and away_opponent
can be merged into a single opponent
field as long as the team is not on BYE which would be NA
join <- attendance %>%
left_join(.,home_games,by=c('year','week','team_id')) %>%
left_join(.,away_games,by=c('year','week','team_id')) %>%
left_join(.,tot_att,by=c('year','team_id')) %>%
mutate(opponent=ifelse(is.na(home_opponent),away_opponent,home_opponent)) %>%
select(-home_opponent,-away_opponent)
The join
table is now used as a index to feed into the main data table built below.
Here are the steps:
team_id
is home or away given whichever date is binary win
variable is not NA
+if both are NA
then the team is on BYE and will remain NA
weekly_attendance
exists (team not on BYE) then a loss=1 binaryteam_id
and year
keys, create a dynamic running sum of teams win,losses and ties WTD on seasontotwins
and totloss
variable by year to indicate the ending record for the team that yearwin_pct
of the team by week using the (cumulative wins)/(sum cumulative outcomes)by_game <- join %>%
mutate(date=ifelse(is.na(home_date),away_date,home_date)) %>%
mutate(date=as.Date(date,"%B %d")) %>% mutate(month=format(date,"%m")) %>% #mutate(month=as.integer(month)) %>%
mutate(home_away=ifelse(is.na(date),NA,ifelse(is.na(home_win),'away','home'))) %>%
select(team_id,year,week,date,month,opponent,home_away,weekly_attendance,tot_season_att,tot_home_att,tot_away_att,home_win,home_tie,away_win,away_tie) %>%
rowwise() %>% mutate(win=sum(home_win,away_win,na.rm=T),tie=sum(home_tie,away_tie,na.rm=T)) %>%
rowwise() %>% mutate(loss=ifelse(win==0 & tie==0 & !is.na(weekly_attendance),1,0)) %>%
group_by(team_id,year) %>% mutate(cumwins=cumsum(win),cumloss=cumsum(loss),cumtie=cumsum(tie)) %>%
mutate(totwins=sum(win),totloss=sum(loss)) %>%
rowwise() %>% mutate(win_pct=cumwins/sum(cumwins,cumloss,cumtie)) %>%
rowwise() %>% mutate(pct_home=ifelse(home_away=='home',weekly_attendance/tot_home_att,NA)) %>%
rowwise() %>% mutate(pct_away=ifelse(home_away=='away',weekly_attendance/tot_away_att,NA)) %>%
select(-home_win,-home_tie,-away_win,-away_tie)
Below are the head()
views of of the final datasets that will be joined:
team_id
) played in the given weekNA
then on BYE)head(by_game) %>% kbl() %>% kable_styling()
team_id | year | week | date | month | opponent | home_away | weekly_attendance | tot_season_att | tot_home_att | tot_away_att | win | tie | loss | cumwins | cumloss | cumtie | totwins | totloss | win_pct | pct_home | pct_away |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Arizona Cardinals | 2000 | 1 | 2020-09-03 | 09 | New York Giants | away | 77434 | 893926 | 387475 | 506451 | 0 | 0 | 1 | 0 | 1 | 0 | 3 | 13 | 0.0000000 | NA | 0.1528953 |
Arizona Cardinals | 2000 | 2 | 2020-09-10 | 09 | Dallas Cowboys | home | 66009 | 893926 | 387475 | 506451 | 1 | 0 | 0 | 1 | 1 | 0 | 3 | 13 | 0.5000000 | 0.1703568 | NA |
Arizona Cardinals | 2000 | 3 | NA | NA | NA | NA | NA | 893926 | 387475 | 506451 | 0 | 0 | 0 | 1 | 1 | 0 | 3 | 13 | 0.5000000 | NA | NA |
Arizona Cardinals | 2000 | 4 | 2020-09-24 | 09 | Green Bay Packers | home | 71801 | 893926 | 387475 | 506451 | 0 | 0 | 1 | 1 | 2 | 0 | 3 | 13 | 0.3333333 | 0.1853049 | NA |
Arizona Cardinals | 2000 | 5 | 2020-10-01 | 10 | San Francisco 49ers | away | 66985 | 893926 | 387475 | 506451 | 0 | 0 | 1 | 1 | 3 | 0 | 3 | 13 | 0.2500000 | NA | 0.1322635 |
Arizona Cardinals | 2000 | 6 | 2020-10-08 | 10 | Cleveland Browns | home | 44296 | 893926 | 387475 | 506451 | 1 | 0 | 0 | 2 | 3 | 0 | 3 | 13 | 0.4000000 | 0.1143196 | NA |
head(pbp_weath) %>% kbl() %>% kable_styling()
game_id | year | home_team | away_team | season_type | week | temp | roof | weath_type |
---|---|---|---|---|---|---|---|---|
2000_01_ARI_NYG | 2000 | NYG | ARI | REG | 1 | 80 | outdoors | NA |
2000_01_BAL_PIT | 2000 | PIT | BAL | REG | 1 | 74 | outdoors | NA |
2000_01_CAR_WAS | 2000 | WAS | CAR | REG | 1 | 80 | outdoors | NA |
2000_01_CHI_MIN | 2000 | MIN | CHI | REG | 1 | NA | dome | NA |
2000_01_DEN_STL | 2000 | LA | DEN | REG | 1 | NA | dome | NA |
2000_01_DET_NO | 2000 | NO | DET | REG | 1 | NA | dome | NA |
The final cleaning will need to come in the form of matching the team_id
variables to do a final join that will help with analysis.
The type_weath
from pbp_weath
will also need to be string searched for this to be used as a reasonable character variable.
To add clarity to visuals, below code will add team logos and colors:
logo <- select(teams_colors_logos,team_abbr,team_name,team_logo_wikipedia,team_logo_espn)
team_colors <- teamcolors %>% filter(league=='nfl')
Further, to make plotting and summarizing easier, the below code will join together the pbp_weath
and logos
datasets with the by_game
to create a final_data
set that has each team and their games (home and away) and all possible info for those games.
Warning for this set: it is not unique and will need to likely need distinct()
or filter()
operations to just look at games uniquely
## add team logo and abbreviation info (abbr is needed to create id to join to weather)
by_game[which(str_detect(by_game$team_id,'Washington Redskins')==TRUE),'team_id'] <- 'Washington Football Team'
by_game[which(str_detect(by_game$opponent,'Washington Redskins')==TRUE),'opponent'] <- 'Washington Football Team'
team_colors[which(str_detect(team_colors$name,'Washington Redskins')==TRUE),'name'] <- 'Washington Football Team'
by_game <- by_game %>% left_join(logo,by=c('team_id'='team_name')) %>% left_join(logo,by=c('opponent'='team_name'))
names(by_game) <- names(by_game) %>%
str_replace_all(string=.,pattern='\\.x$',replacement='') %>%
str_replace_all(string=.,pattern='\\.y$',replacement='_opp')
## Creating the game_id key to be used to join accurately with weather data
by_game <- by_game %>%
mutate(game_id=str_c(year,sprintf('%02d',week)
,ifelse(home_away=='away'
,team_abbr,team_abbr_opp)
,ifelse(home_away=='home',team_abbr,team_abbr_opp),sep='_'))
##joining weather and game/attendance data into final dataset
final_data <- by_game %>%
left_join(pbp_weath,by=c('year','week','game_id')) %>%
select(-season_type)
## adjusting date to scale January to be viewed at end of season
final_data$date[which(final_data$date<='2020-01-31')] <- final_data$date[which(final_data$date<='2020-01-31')] %>%
ymd(.) +years(1)
Visualizing the weather string data, there are large inconsistencies with how a weather event is reported. Stringr
extension of tidyverse
will be used to detect string values that indicate anything like:
and mutate
it to a single 2-factor classification:
the regex
function allows for us to ignore the case of the string. Also, there are strings in which there is Zero Chance or No Chance when referring to rain. Those will be made exceptions to be removed in the case_when
final_data <- final_data %>% mutate(precipitation= case_when(
str_detect(string=weath_type,pattern=regex('rain|snow|shower',ignore_case = T))==TRUE &
str_detect(string=weath_type,pattern=regex('Zero|(?<!s)No',ignore_case=T))!=TRUE ~'Precipitation',
TRUE ~'none'
)
)
Amount of outdoor games is given in the table below. since 2000, they have averaged 77%.
The second table visualizes this trend by year to confirm that the annual samples remain relatively consistent in terms of the number of outdoor games being looked at:
ct_table <- pbp_weath %>%
mutate(roof=replace(roof,roof=='closed','dome'),roof=replace(roof,roof=='open','outdoors')) %>%
group_by(roof) %>% tally(name='Count') %>% mutate(Count=as.numeric(Count)) %>%
mutate(pct_outdoor=round(Count/sum(Count),2))
ct_table_2 <- pbp_weath %>%
mutate(roof=replace(roof,roof=='closed','dome'),roof=replace(roof,roof=='open','outdoors')) %>%
group_by(year,roof) %>% tally(name='Count') %>% mutate(Count=as.numeric(Count)) %>% spread(roof,Count) %>%
rowwise() %>% mutate(pct_outdoor=round(outdoors/sum(dome,outdoors),2))
#ct_table <- ct_table %>%
#rbind(c('Total',sum(ct_table$Count),sum(ct_table$pct))) %>% mutate(Count=as.numeric(Count))
ct_table %>%
kbl(caption='Dome vs. outdoor games 2000-2019') %>%
kable_styling('hover',full_width=F)
roof | Count | pct_outdoor |
---|---|---|
dome | 1157 | 0.23 |
outdoors | 3945 | 0.77 |
ct_table_2 %>%
kbl(caption='Dome vs. Outdoor by Year') %>%
kable_styling('hover',full_width = F)
year | dome | outdoors | pct_outdoor |
---|---|---|---|
2000 | 48 | 198 | 0.80 |
2001 | 48 | 200 | 0.81 |
2002 | 51 | 205 | 0.80 |
2003 | 53 | 203 | 0.79 |
2004 | 51 | 205 | 0.80 |
2005 | 46 | 210 | 0.82 |
2006 | 58 | 198 | 0.77 |
2007 | 58 | 198 | 0.77 |
2008 | 50 | 206 | 0.80 |
2009 | 62 | 194 | 0.76 |
2010 | 65 | 191 | 0.75 |
2011 | 68 | 188 | 0.73 |
2012 | 68 | 188 | 0.73 |
2013 | 70 | 186 | 0.73 |
2014 | 61 | 195 | 0.76 |
2015 | 60 | 196 | 0.77 |
2016 | 62 | 194 | 0.76 |
2017 | 59 | 197 | 0.77 |
2018 | 60 | 196 | 0.77 |
2019 | 59 | 197 | 0.77 |
To begin analysis of how weather effects attendance, we will first look at the weekly trends for each.
Below gives basic scatter plot and trend of weekly attendance by week:
pbp_weath %>%
ggplot(.,aes(x=week,y=temp))+
geom_point(alpha=.5)+
geom_smooth()+
ggtitle(label='Temperature Trends Across a Season',subtitle='All Games 2000-2019')+
scale_y_continuous(name='Game Temperature')+
scale_x_continuous(name='NFL Week Number')
It is clear, given the trend line, that there is a substantial downward trend in temperature of outdoor games as the season progresses (as is expected given season in which the sport is played).
For comparison across the same time period, two graphs below visualize how attendance trends across a season.
The first is a general scatter for attendances by date from 2000-2019 The second takes the same plot and overlays it by type of stadium determined by roof
variable:
final_data %>% distinct(game_id,.keep_all = T) %>%
ggplot(.,aes(x=date,y=weekly_attendance))+
geom_point(alpha=.8,color='LightBlue')+
geom_smooth(color='Blue')+
scale_y_continuous(name='Weekly Attendance',labels=scales::comma)+
scale_x_date(name='Date of Game')+
ggtitle(label='Attendance Trends Across a Season',subtitle='All Games 2000-2019')
final_data %>% distinct(game_id,.keep_all = T) %>% filter(!is.na(roof)) %>%
ggplot(.,aes(x=date,y=weekly_attendance,fill=roof))+
geom_point(aes(color=roof),alpha=.2)+
geom_smooth(color='DarkBlue',method='loess')+
#stat_summary(fun.y = "mean", geom = "point", size = 3, aes(color = roof), shape = 21, color = "white", stroke = 1,alpha=.6)+
facet_grid(~roof)+
scale_y_continuous(name='Weekly Attendance',labels=scales::comma)+
scale_x_date(name='Date of Game')+
ggtitle(label='Attendance Trends Across a Season',subtitle='All Games 2000-2019 by Stadium Type')
As from prior analysis on Outdoor Splits, ~80% of all games are currently played outdoors, so clearly from the plots it has the most leverage on any manipulation done in this analysis.
Overall, this visual doesn’t give a clear indication of change in attendance across a season; however, there is some small dips in the smoothed-fit in December.
While this may be due to weather, it could also be a series of things (such as lost interest on losing teams), so further investigation into the types of teams that each of the above stadium groups contain.
Final basic summary will be a basic scatter plotting the two variables against each other.
final_data %>% distinct(game_id,.keep_all = T) %>%
ggplot(.,aes(x=temp,y=weekly_attendance))+
geom_point(alpha=.8,color='lightblue')+
geom_smooth(color='Blue')+
scale_y_continuous(name='Weekly Attendance',labels=scales::comma)+
scale_x_continuous(name='Temperature of Game')+
ggtitle(label='Temperature of Game by Attendance',subtitle='All Games 2000-2019 by Outdoor Stadiums')
The plot attempting to correlate the Weekly Attendance and Temperature variables gives better context into the data. There is clearly some lower outdoor attendances even at moderate temperatures.
These outliers could be due to other differences such as stadium capacity, which has no clear way of being captured using available data. This will be explored in the next section.
To give more visual clarity into the distributions of each stadium type, we can analyze the boxplot and histograms for each group below.
final_data %>% distinct(game_id,.keep_all = T) %>% ggplot(.,aes(y=weekly_attendance,x=roof))+
# geom_histogram(aes(y=..density..),color='black',alpha=.5,bins=20,position='identity')+
geom_boxplot(aes(fill=roof),alpha=.7)+
scale_y_continuous(name='Weekly Attendance',labels=scales::comma)+
ggtitle('Boxplot Comparison of Stadium Type Distributions')
final_data %>% distinct(game_id,.keep_all = T) %>% ggplot(.,aes(x=weekly_attendance))+
facet_wrap(vars(roof),scales='free_y')+
geom_histogram(color='black',fill='dodgerblue',alpha=.5,bins=20)+
geom_vline(aes(xintercept=mean(weekly_attendance,na.rm=T)),linetype='dashed',color='red',alpha=.8,lwd=1)+
ggtitle('Histogram of each Stadium Type')
The mean is overall highest for the closed roof stadiums.the boxplot shows that this is skewed by more attendance values that are higher than the rest of the league. Some of this could be because several stadiums of this type being relatively new and relatively higher in capacity. The overall distribution for these appears bimodal with seemingly a distribution forming around the overall mean attendance of ~67,000 , and another distribution forming about the ~80,000 range.
Comparing to the open roof classification, this distribution appears almost approximately normal about the overall mean of ~67,000 ; however, has a larger left tail in its histogram (making it left-skewed). There is further evidenced in the boxplot, with a large number of attendances below the 1.5IQR range used by the plot.
Using domain knowledge to give a hypothesis for the above observations, the Dallas Cowboys moved into AT&T Stadium with a capacity of about 80,000 in 2009. Given the relative low number of observations in closed classification, these games at 80,000 could be because of a move like this.
Further, to explain the lower open stadiums, there have been several teams playing in stadiums with lower capacities while making a move to a new one. The biggest example of this was the Los Angeles Chargers who played at Dignity Health Sports Park from 2017-2019–who’s smaller capacity of ~30,000 was meant to accommodate soccer–while awaiting the opening of SoFi Stadium this season.
The below graphs will investigate the teams in more detail:
final_data %>%
#distinct(game_id,.keep_all = T) %>%
left_join(teams_colors_logos[,c('team_abbr','team_color')],by=c('team_abbr')) %>%
select(team_abbr,team_logo_wikipedia,home_away,year,team_color,weekly_attendance) %>%
group_by(team_abbr,team_logo_wikipedia,team_color,home_away,year) %>%
filter(home_away=='home') %>%
summarise(avg_att=mean(weekly_attendance)) %>%
arrange(avg_att) %>% ungroup() %>% top_n(-15) %>%
ggplot(.,aes(x=reorder(str_c(team_abbr,year),avg_att),y=avg_att))+
geom_col(aes(fill=team_color,color=team_color),position='identity',alpha=.75)+
scale_color_identity(aesthetics = c('color','fill'))+
geom_image(aes(image=team_logo_wikipedia))+
geom_label(aes(y=0,label=year))+
xlab('Team, Year')+
ylab('Average Attendance Per Game')+
scale_y_continuous(labels=scales::comma)+
theme(
axis.text.x=element_blank(),
axis.ticks.x=element_blank()
)+
ggtitle('Bottom 15 Seasons in Terms of per Game Attendance')
As suspected, the Chargers have the 3 lowest seasons in terms of attendance since 2000. However, They were limited in capacity.
The Cowboys have 10 of the top 15 seasons in terms of attendance and all of them come after the new stadium opening in 2009.
As a result, stadium capacity can be considered the likely reason for many of the attendance outliers in the dataset.
Given that temperature appears that it may not have a significant impact on attendance, we will also look at the weather descriptions for games in which that info is available.
Weather descriptions have only been recorded in the nflfastR
datasets to a limited capacity since 2013, so the data for those games is relatively limited; however, there are some important insights that indicate it having an attendance impact.
Each available string was searched for strings like ‘rain’,‘snow’,‘shower’,etc. , and for those that contained values such as this, are given a ‘Precipitation’ classification–indicating there was precipitation for the game. The classification is ‘none’ if there is no evidence from the weather description.
Below is a high-level summary of the average league attendance for games with precipitation vs. not:
final_data %>% filter(!is.na(weath_type) & roof%in%c('outdoors','open') & year>2014 & home_away=='home') %>%
group_by(year,precipitation) %>% summarise(avg_att=mean(weekly_attendance,na.rm=T)) %>% ungroup() %>%
ggplot(aes(x=year,y=avg_att))+
geom_line(aes(color=precipitation))+
geom_label(aes(y=avg_att,label=round(avg_att,0)))+
scale_y_continuous(name='Average Attendance',labels=scales::comma)+
ggtitle('Average League Attendance for Rainy/Snowy games vs. Fair Weather')
This gives some evidence using 2015-2019 data to indicate there is a difference between rainy/snowy attendance; however, it fluctuates between which average is higher YoY. 2015,2018,2019 all have higher attendances for rainy/snowy games, but the others do not, which indicates potential for other factors skewing the information.
The best solution to see this further, is to subset the analysis by isolating it to each team.
Below graph will compare differences for each team:
final_data %>% filter(!is.na(weath_type) & home_away=='home' & year>2014) %>% select(precipitation,team_abbr,weekly_attendance,team_logo_wikipedia) %>%
group_by(team_abbr,precipitation,team_logo_wikipedia) %>%
summarise(avg_att=mean(weekly_attendance,na.rm=T),count=n()) %>%
ggplot(aes(x=reorder(team_abbr,avg_att),y=avg_att))+
geom_col(aes(x=reorder(team_abbr,avg_att),y=avg_att,fill=precipitation,group=precipitation),position='dodge',alpha=.8)+
coord_flip()+
geom_image(aes(y=0,image=team_logo_wikipedia,group=team_abbr),position='identity',hjust=-.5)+
scale_y_continuous(name='Average Attendance',labels=scales::comma)+
theme(
# legend.position = c(1,0),
axis.text.y=element_blank(),
axis.title.y=element_blank(),
axis.ticks.y=element_blank()
)+
ggtitle('Attendance by Team for Rainy/Snowy Games vs. Normal Games')
There appears to be some small differences in the attendance that support the claim of precipitation negatively effecting attendance. To make these margins clearer, the below barplot is made for comparison of the margins:
final_data %>% filter(!is.na(weath_type) & home_away=='home' & year>2014) %>% select(precipitation,team_abbr,weekly_attendance,team_logo_wikipedia) %>%
group_by(team_abbr,precipitation,team_logo_wikipedia) %>%
summarise(avg_att=mean(weekly_attendance,na.rm=T)) %>% spread(key=precipitation,value=avg_att) %>% mutate(diff=Precipitation-none) %>%
gather(key=var,value=val,none:diff) %>% filter(var=='diff') %>% mutate(val=replace_na(val,0)) %>%
ggplot(aes(x=reorder(team_abbr,val),y=val,fill=val))+
geom_bar(stat='identity',alpha=.9,position='identity')+
scale_color_gradient2(name='Attendance Change',low='red',mid='green',high='red',aesthetics = c('fill'),guide='colorbar')+
scale_color_gradient2(name='Attendance Change',low='red',mid='green',high='red',aesthetics = c('color'),guide='colorbar')+
coord_flip()+
scale_y_continuous(name='Change in Attendance',labels=scales::comma)+
geom_image(aes(y=0,image=team_logo_wikipedia))+
theme(axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank(),
legend.justification = c(1,0),
legend.position=c(1,0))+
ggtitle(label='Attendance Change for games with Precipitation',subtitle='Negative values indicate an average decrease')
We see that teams such as the Jaguars, Steelers and Washington have average decreases of more than 5,000 since 2015.
This is, however, not consistent. Many teams have much smaller decreases, and some exceptions of increases. For teams with little to no change, there is likely very little to no rainy/snowy weather data to compare.
Given that there is some small evidence of different aspects of weather effecting attendance, we will also investigate other potential causes for attendance changes at any point in a season. One of the biggest could be the interest of fans in attending a game with a winning team vs. a losing team.
Below will summarize live winning percentage for a team plotted against their weekly attendance.
final_data %>% filter(home_away=='home') %>% select(weekly_attendance,win_pct,roof) %>% filter(!is.na(roof)) %>%
ggplot(aes(x=win_pct,y=weekly_attendance))+
geom_point(alpha=.2,color='Blue',fill='blue')+
geom_smooth(color = "DarkBlue", alpha = 0.8)+
#stat_summary(fun.y = "mean", geom = "point", size = 3, fill='Black', shape = 21, color = "white", stroke = 1,alpha=.5) +
scale_y_continuous(name='Weekly Attendance',labels=scales::comma)+
scale_x_continuous(name='Winning Percentage',labels=scales::comma)+
ggtitle('Weekly_Attendance by Winning Percentage','Data 2000-2019')
There is similarly a small trend change in the later months of the season. The higher the win percentage, the higher the up-tick in attendance.
To further visualize at a team level, the below plot shows the same scatter summary; however uses the mean attendance for each team.
avg_overall_att <- final_data %>% ungroup() %>% filter(home_away=='home') %>% summarise(mean(weekly_attendance)) %>% as.data.frame()
avg_overall_win_pct <- final_data %>% ungroup() %>% filter(home_away=='home') %>% summarise(mean(win_pct)) %>% as.data.frame()
final_data %>%
#distinct(game_id,.keep_all = T) %>%
left_join(teams_colors_logos[,c('team_abbr','team_color')],by=c('team_abbr')) %>%
left_join(team_colors[,c('name','division')],by=c('team_id'='name')) %>%
select(team_id,team_abbr,team_logo_wikipedia,home_away,year,team_color,division,weekly_attendance,win_pct) %>%
filter(home_away=='home'& team_abbr!='NA') %>% mutate(team_abbr=replace(team_abbr,team_abbr=='LA','LAR')) %>%
group_by(team_abbr,team_logo_wikipedia,team_color,division,home_away) %>%
summarise(avg_att=mean(weekly_attendance,na.rm=T),avg_win_pct=mean(win_pct)) %>%
arrange(avg_att) %>% ungroup() %>% mutate(rank=min_rank(-avg_att)) %>%
ggplot(.,aes(x=avg_win_pct,y=avg_att))+
scale_color_identity(aesthetics = c('color','fill'))+
geom_image(aes(image=team_logo_wikipedia))+
xlab('Avg. Winning Percentage')+
ylab('Average Attendance Per Game')+
scale_y_continuous(labels=scales::comma)+
geom_hline(yintercept=avg_overall_att[1,1],linetype='dashed',color='red')+
geom_vline(xintercept=avg_overall_win_pct[1,1],linetype='dashed',color='red')+
ggtitle('NFL Teams in Terms of per Game Attendance','Seasons 2000-2019')
Looking at the general trend for each team, it appears to be close to positively linear (with exception of Browns and 2017-2019 Chargers).
Several of these teams with above average winning tendencies are at or near their stadium capacities, so even team such as Seahawks and Steelers in the lower-right quadrant actually have very good attendance utilization as a part of winning.
This graph is helpful in showing that there is a general upward trend in attendance with winning, but there are still several factors such as capacity and regional sporting interest that this analysis does not take into account.
To review these were two of the primary problems addressed:
Does temperature and weather(rain/snow) significantly effect attendance for all teams, winning teams or losing teams? How does it compare to other predictors involved?
This was addressed through the use of adjoining NFL datasets on 2000-2019 attendance with 2000-2019 game-by-game weather data that included temperature and condition reports.
The data is graphically analyzed to compare outdoor games to indoor games spanned across a season as weather gets harsher. Doing so,we can conclude that there are small downward trends in attendance as the season progresses for outdoor games. Attendance actually increases for indoor games. This could support a conclusion that colder weather effects attendance; however, it there are several other reasons also uncovered in analysis.
Further, temperature and precipitation were mapped directly against attendance to find if there is any direct relation. The plots indicated downward trends at the more extreme values; however, was very small for cold weather while fairly large dips for hot weather. These trends may not be consistent given the limited data it was being based on at the extreme values. Limited data indicates they are high-leverage observations that may not be fully indicative of the population.
To add further context to the analysis, a deeper-dive into the distributions by stadium type and by team is performed. This was done due to the various aspects of team and stadium that could also effect total attendance. It was also found that many variables being analyzed were, in fact, skewed by stadium capacity and teams with newer and larger stadiums with indoor options. There was not sufficient enough capacity data for all stadiums over the past 20 years to perform a more robust analysis using relative attendance (as a % utilization), so the comparisons were broken out at a team level to better visualize the team context.
Using team-level graphs, rainy/snowy games were compared to the normal weather games using 2015-2019 weather description data. Despite limited data, it was found that most teams had very little to no change in average attendance. There were however interesting exceptions of a few teams with a few thousand person difference in attendance.
It was also found that winning percentage has a positive effect on late season attendance, thus leading to some of the season-long trends as well. It is also only a small difference, but is noticable as well.
Quantify the effect of weather and success on attendance to help teams understand how to maximize return on stadium projects, ticket sales, promotions and overall attendance revenue.
This analysis gives some basic visual analysis into the effect of various weather aspects of attendance in relation to other team-oriented aspects. This can be used as a guide towards furthering analysis into more detailed sales analysis using more robust forms of attendance.
As already discussed above, the limitations of this model mostly lie in it’s lack of robust transformation to the attendance metrics. There are various team and stadium-related aspects in each season that need to be accounted for in stronger variables.
A next step improvement would be something as simple as scraping for stadium capacity data to begin looking at attendance as a percentage of capacity to better scale its expectations. In addition, further analysis using the more robust metrics could be performed to make statistically significant conclusions. These techniques include regression, ANOVA analysis and significance testing of the temperature and precipitation data to confirm significant insights.