Abstract:

In this study, attendance and weather data for every NFL game from 2000-2019 is analyzed to determine if weeks with lower game time temperatures have a significant negative effect on attendance. To do this, game information data is subsetted into two primary groups: Indoor/closed for all indoor/closed roof games and Outdoor for all games played outdoors. These sets are attempted to be assigned parametric MLE’s for a normal distribution; however, it is found the skewness of the distributions are better suited toward non-parametric bootstrap analyses when drawing inference. A bootstrap analysis is used to help identify week groupings of the NFL season with significant temperature difference (wk 1-12 vs wk 13-17). Using these buckets to further subset the data, non-parametric bootstrap simulations create Empirical Distributions to test the difference in means between the early and late season buckets for each stadium type. A hypothesis test on these Empirical Distributions concludes with significance that outdoor games will on average yield between 415 and 1,720 fewer attendees after week 12 of the season (95% confidence). Indoor games do not conclude any significant difference using the same bootstrap analysis and week subsets. These results coincides with the drop in temperature to below 50 degrees on average after week 12, leading to the evidence that temperature does effect attendance at least mildly.

Introduction

Project Motivation

Football in the United States has one of the most unique spectating experiences of all professional North American sports. College football flaunts 8 stadiums with attendance capacities greater than 100,000 spectators. All 8 of these are amongst the top 10 largest stadiums in the world,according to data compiled by Wikipedia (not classifying motor sport venues as stadiums).

Looking specifically at the professional sporting industry, the National Football League (NFL) has continued to dominate the world in per game attendances of recent decades. 2007-2016 data from sportingintelligence.com compares annual attendances of various domestic sports leagues across the globe. The NFL has annually posted average per game attendances between ~67,000 - ~69,000, which is consistently >20,000 attendees higher than the next highest league: German Bundesliga (soccer). It is more than double the next highest attendance of US-based leagues (Major League Baseball (MLB) ~30,000 per game).Further data in this analysis will show that these trends have continued into 2019.

One of the most impressive things about these consistently high numbers, is that they have prevailed in a league that has played a majority of its games outdoors, and many of these games continue in a number of harsh conditions that come with playing during the fall and winter months: rain, snow,extreme cold, etc. It is one of the only major sports in the US that plays outdoors through inclement weather. The MLB plays during warmer summer months and cancels for rain. NBA and NHL are entirely indoor.

This type of study has become more relevant in recent years, as high-quality TV viewing options become more available with the advancement of video quality and instant media sharing.These can serve as viable replacement options of attending games. Also, many of the new stadiums are now being built with retractable roofs allowing for indoor play, making it possible that the sport could be moving away from outdoor play in the future.

Analytic Proposal

Problem Statement

Does temperature and weather significantly effect attendance for all teams? How does it compare to other predictors involved?

Output to Consumer

Quantify the effect of weather on attendance revenue to assist teams with decisions that that seek to maximize return on stadium projects, ticket sales, promotions and other factors that effect attendance revenue.

Data Description

Attendance

Thomas Mock & Chris Stehlik’s GitHub compilation will be used to extract attendance and basic data by game as well as season results. these sets pull from ProFootballReference.com.

The motivation of the data is to provide three different game sets that can be used to bridge together attendance and other factors like offensive/defensive performance, wins, losses, etc. It can be looked at as a whole or at a weekly level.

The sets include:

  • Attendance- gives total and weekly attendance by team and year from 2000-2019
  • Standings- gives season results by team and year. This includes wins, losses, points scored(for/against),strength of schedule, etc.
  • Games- gives basic data for each game from 2000-2019 such as: teams involved (home/away),date, score,winner,loser,etc.

Weather

nflfastr play-by-play data will be manipulated to extract temperature and weather values by unique game for 2000-2019.

Normally this data is used to analyze valuation and expected return from different decisions based on historic information of the play type. This analysis will only use a very small subset of this data to be joined with attendance.

This dataset will include game_id, teams involved, week number, and:

  • temp- temperature in Farhenheit for the game
  • roof- Indicates if the game was played outdoors, in dome, in a closed roof, in an open roof +Indoor games (dome and closed roof) will result in NA weather factors
  • weather_type- A cleaned version of the weather description that can indicate rain,snow,clouds,etc.

Data Exploration

Outdoor Splits

Amount of outdoor games is given in the table below. since 2000, they have averaged 77%.

The second table visualizes this trend by year to confirm that the annual samples remain relatively consistent in terms of the number of outdoor games being looked at:

Temperature Trend Bootstrap Analysis

To begin analysis of how weather effects attendance, we will first look at the weekly trends for each.

Below gives basic scatter plot and trend of weekly attendance by week:

It is clear, given the trend line, that there is a substantial downward trend in temperature of outdoor games as the season progresses (as is expected given season in which the sport is played).

To give further analysis into the weekly variance and standard errors, a bootstrap sample distribution is used with 2000 samples to give mean and standard deviation estimates.

It is seen in the results below that the \(\hat{se}\) for the mean temperature by week doubles from 0.55 to 1.02 with the temperatures taking their largest decreases from week 13 on.

Visualizing the ECDF for both the all-in data and the late-season data (week 13 and beyond) and comparing them with their 95% confidence bands there is much higher variance at the extremes partially due to less data; however, is an interesting trend to explore further with attendance.

These buckets of wk 1-12 and wk 13-17 can now be used as two-sample comparisons for later hypothesis testing.

Attendance MLE Analysis by Stadium Type

To give more visual clarity into the distributions of each stadium type, we can analyze the histograms for each group and see that both follow a distribution that is close to normal.

Because of the near-normality, a parametric MLE is given for each parameter: mean(\(\hat{\mu}\)) , Std. Deviation (\(\hat{\sigma}\)) , and the pdf is fitted against the actual histograms below:

MLE estimate results for a normal distribution
Mean Est. SD Est.
Indoor 66,551 8,479
Outdoor 66,903 9,558

In comparing the distributions to the fit, it is approximately close; however, it is likely still estimating poorly at the tails.

outdoor games appear to have left-skewed data from smaller stadiums, including Los Angeles Chargers playing at Dignity Health Sports Park from 2017-2019 with ~20,000 capacity closed roof games appear to have more right skewed data that is likely the result of newer stadiums with roof options, such as AT&T Stadium of the Dallas Cowboys, having higher capacities of about 80,000+

Because of the skewness, an Empirical Bootstrap approach will be used as sampling distribution for further hypothesis testing.

Conclusion

Upon running bootstrap samples to differentiate the drop in NFL game temperatures time of season, further bootstrapping and non-parametric hypothesis testing was used to claim that the weeks of lower temperatures did in fact have statistically significant lower attendances than weeks with higher temperatures. This conclusion alone does not directly indicate a causal relationship between temperature and attendance; however, a conclusion of insignificance using the same tests on indoor games (where weather is not effected) may strengthen a conclusion for a causal relationship.

While the difference for outdoor games was proven with statistical significance due to the consistent low variance of large outdoor samples, the mean difference is likely only between 415 and 1,720 attendees lower during colder/later weeks. Depending on revenue loss due to these drops in attendance, these may not be considered to have domain significance when comparing to 60,000-70,000 person attendances on average.

Other factors could, however, be theorized. One weakness of this model is that it doesn’t take into account any stadium capacity data. Therefore, some of the smaller/more variant samples within the indoor set (only 23% of total games) may be effected with higher leverage to the changes in having newer 80,000+ capacity stadiums mixed with older and lower capacity domes ~40,000 in capacity. Because of this, the sample may not be properly accounting for changes. Isolating indoor and outdoor games for comparison should have helped combat some of this, but the smaller sample of stadiums and the inconsistency of the weeks these stadiums are scheduled to play YoY may not be consistent.

Further, in theory, there are other factors that could also be effecting a lower late-season attendance. For example, the cumulative win % of the home team going into the game could have an effect. Fans of losing/non-contending teams (<.500 win %) may be more likely to lose interest in attending games when their team is not competing and/or vice-versa (Fans of winning/contending teams may have more attendance demand for meaningful late season games).

The second part of this analysis would seek to factor in capacity utilization rather than raw attendance, seek to run some of the same tests for subgroups of winning teams vs. losing teams, use the weather_description fields to pull keyword text for rain/snow or other ‘bad-weather’ indicators and run similar two-sample tests with these subsets.

Appendix

Packages

Packages used in this analysis are read in below.

Package Descriptions

  • tidyverse- Includes various sub-packages to help with cleaning and manipulating the data. The core sub-packges include:
    • dplyr- For data-frame editing and manipulation
    • ggplot2- For graphical visualization of data
    • tibble- For clean and efficient creation and manipulation of data frames.
  • readr- To import the .csv data and export sub-data created from nflfastR
  • kableExtra- For neat outputting of data tables to HTML format
  • nflfastR- For assisting with importing the play-by-play data that will be condensed and truncated to feed weather data into the final datasets
  • lubridate- Extenstion of tidyverse for date manipulation
  • teamcolors- Expansive package of team color and logo data that is an extension of nflfastr
  • ggimage- For use with ggplot2 to assist in adding graphics to a plot
  • gridExtra- For further combining plots into single grids

Rcode:

library(tidyverse)
library(readr)
library(kableExtra)
library(nflfastR)
library(lubridate)
library(teamcolors)
library(ggimage)
library(bootstrap)
library(lattice)
library(latticeExtra) 
library(grid)
library(gridExtra)

rm(attendance)
rm(games)
rm(standings)
rm(home_games)
rm(away_games)
rm(join)
rm(by_game)
rm(final_data)

# nflfastr weather opt ----------------------------------------------------

# seasons <- 2000:2019
# pbp <- purrr::map_df(seasons, function(x) {
#   readRDS(
#     url(
#       glue::glue("https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_{x}.rds")
#     )
#   )
# })
# 
# 
# 
# pbp_weath <- pbp %>% 
#   mutate(year=as.integer(substr(game_id,1,4))) %>% 
#   select(game_id,year,home_team,away_team,season_type,week,weather,temp,roof) %>% 
#   unique() %>% 
#   mutate(weath_type=substr(weather,1,str_locate(weather,'Temp:')-1)) %>% 
#   filter(season_type=='REG') %>% 
#   select(-weather)
#   ###separate(weather,c('status','temp'),sep='temp')



#write_csv(pbp_weath,path='C:/Users/tgobi/OneDrive/MSBANA/Fall2020/Data_Wrangling/Project_Options/nflweather.csv')
pbp_weath <- read_csv('C:/Users/tgobi/OneDrive/MSBANA/Fall2020/Data_Wrangling/Project_Options/nflweather.csv',
                      col_types=list(col_character(),col_double(),col_character(),col_character(),col_character(),col_double(),col_double(),col_character(),col_character()))

# Import Attendance Data --------------------------------------------------


attendance <- read_csv('attendance.csv')
standings <- read_csv('standings.csv')
games <- read_csv('games.csv')

head(attendance) %>% kbl(caption='View of Attendance Data') %>% kable_styling()
head(standings) %>% kbl(caption='View of Standings Data') %>% kable_styling()
head(games) %>% kbl(caption='View of Games Data') %>% kable_styling()


attendance <- 
  attendance %>% 
  unite(c(team,year,week),col=game_id,remove=FALSE) %>% 
  unite(c(team,team_name),col=team_id,remove=F,sep=" ")

tot_att <- attendance %>% 
  select(-weekly_attendance,-week,-game_id) %>% 
  rename(tot_season_att=total,tot_home_att=home,tot_away_att=away) %>% 
  group_by(year,team_id) %>% 
  unique(.)

head(tot_att) %>% kbl() %>% kable_styling()

tot_att %>% group_by(year) %>% tally()%>% filter(n!=32) %>% kbl(caption='Unique Teams by Year') %>% kable_styling(full_width = F) 


# combining weekly data ---------------------------------------------------


games <- games %>% mutate(week=as.double(week)) %>% filter(.,!is.na(week))


home_games <- games %>%  select(year,week,date,home_team,away_team,winner,tie) %>%
  rename(team_id=home_team,home_opponent=away_team,home_winner=winner,home_date=date) %>% 
  mutate(home_win=ifelse(team_id==home_winner,1,0),home_tie=ifelse(!is.na(tie),1,0))
home_games

away_games <- games %>%  select(year,week,date,away_team,home_team,winner,tie) %>%
  rename(team_id=away_team,away_opponent=home_team,away_winner=winner,away_date=date) %>% 
  mutate(away_win=ifelse(team_id==away_winner,1,0),away_tie=ifelse(!is.na(tie),1,0))
away_games






join <- attendance %>%  
  left_join(.,home_games,by=c('year','week','team_id')) %>% 
  left_join(.,away_games,by=c('year','week','team_id')) %>% 
  left_join(.,tot_att,by=c('year','team_id')) %>% 
  mutate(opponent=ifelse(is.na(home_opponent),away_opponent,home_opponent)) %>% 
  select(-home_opponent,-away_opponent)
head(join) %>% kbl() %>% kable_styling()

by_game <- join %>% 
  mutate(date=ifelse(is.na(home_date),away_date,home_date)) %>% 
  mutate(date=as.Date(date,"%B %d")) %>% mutate(month=format(date,"%m")) %>% #mutate(month=as.integer(month)) %>% 
  mutate(home_away=ifelse(is.na(date),NA,ifelse(is.na(home_win),'away','home'))) %>% 
  
  select(team_id,year,week,date,month,opponent,home_away,weekly_attendance,tot_season_att,tot_home_att,tot_away_att,home_win,home_tie,away_win,away_tie) %>% 
  rowwise() %>% mutate(win=sum(home_win,away_win,na.rm=T),tie=sum(home_tie,away_tie,na.rm=T)) %>% 
  rowwise() %>% mutate(loss=ifelse(win==0 & tie==0 & !is.na(weekly_attendance),1,0)) %>% 
  group_by(team_id,year) %>% mutate(cumwins=cumsum(win),cumloss=cumsum(loss),cumtie=cumsum(tie)) %>% 
  mutate(totwins=sum(win),totloss=sum(loss)) %>% 
  rowwise() %>% mutate(win_pct=cumwins/sum(cumwins,cumloss,cumtie)) %>%
  rowwise() %>% mutate(pct_home=ifelse(home_away=='home',weekly_attendance/tot_home_att,NA)) %>% 
  rowwise() %>% mutate(pct_away=ifelse(home_away=='away',weekly_attendance/tot_away_att,NA)) %>% 
  select(-home_win,-home_tie,-away_win,-away_tie)

#%>% 
#select(-weekly_attendance)







# Adding in team logos and indexes ----------------------------------------

logo <- select(teams_colors_logos,team_abbr,team_name,team_logo_wikipedia,team_logo_espn)


# Mass Join Datasets ------------------------------------------------------
logo <- select(teams_colors_logos,team_abbr,team_name,team_logo_wikipedia,team_logo_espn)
team_colors <- teamcolors %>% filter(league=='nfl')
# pbp_weath <- pbp_weath %>% 
#   left_join(logo,by=c('home_team'='team_abbr')) %>% 
#   left_join(logo,by=c('away_team'='team_abbr'))
# names(pbp_weath) <- names(pbp_weath) %>% 
#   str_replace_all(string=.,pattern='\\.x$',replacement='_home') %>% 
#   str_replace_all(string=.,pattern='\\.y$',replacement='_away')


## add team logo and abbreviation info (abbr is needed to create id to join to weather)
by_game[which(str_detect(by_game$team_id,'Washington Redskins')==TRUE),'team_id'] <- 'Washington Football Team'
by_game[which(str_detect(by_game$opponent,'Washington Redskins')==TRUE),'opponent'] <- 'Washington Football Team'


team_colors[which(str_detect(team_colors$name,'Washington Redskins')==TRUE),'name'] <- 'Washington Football Team'


by_game <- by_game %>% left_join(logo,by=c('team_id'='team_name')) %>% left_join(logo,by=c('opponent'='team_name'))
names(by_game) <- names(by_game) %>% 
  str_replace_all(string=.,pattern='\\.x$',replacement='') %>% 
  str_replace_all(string=.,pattern='\\.y$',replacement='_opp')



## Creating the game_id key to be used to join accurately with weather data
by_game <- by_game %>% 
  mutate(game_id=str_c(year,sprintf('%02d',week)
                       ,ifelse(home_away=='away'
                               ,team_abbr,team_abbr_opp)
                       ,ifelse(home_away=='home',team_abbr,team_abbr_opp),sep='_'))

##joining weather and game/attendance data into final dataset
final_data <- by_game %>% 
  left_join(pbp_weath,by=c('year','week','game_id')) %>% 
  select(-season_type)


stadium_cap <- read_delim('stadiums_nfl.csv',';')
str(stadium_cap)
names(stadium_cap) <- str_to_lower(names(stadium_cap))

stadium_cap <- stadium_cap %>% select(name,name2,name3,date_creat,date_modif,comp_affil,conference,division,roof_type,capacity,team)
## adjusting date to scale January to be viewed at end of season

final_data$date[which(final_data$date<='2020-01-31')] <- final_data$date[which(final_data$date<='2020-01-31')] %>% 
  ymd(.) +years(1)

final_data <- final_data %>% mutate(precipitation= case_when(
  str_detect(string=weath_type,pattern=regex('rain|snow|shower',ignore_case = T))==TRUE & 
    str_detect(string=weath_type,pattern=regex('Zero|(?<!s)No',ignore_case=T))!=TRUE ~'Precipitation',
  TRUE ~'none'
)
)


final_data %>% select(precipitation,weath_type) %>% filter(!is.na(weath_type)) %>% filter(precipitation=='Precipitation')





final_data[which(str_detect(final_data$team_id,'Washington Football Team')==TRUE),] %>% select(team_id,team_abbr,team_logo_wikipedia)



# Starting Analysis meh ----------------------------------------------------------------


head(by_game) %>% kbl() %>% kable_styling()
head(pbp_weath) %>% kbl() %>% kable_styling()


by_game %>% select(-team_id,-home_away,-year,-week,-date,-month,-opponent) %>% summary()
pbp_weath %>% group_by(week) %>%  summarise(min=min(temp,na.rm=T),q1=quantile(temp,probs=.25,na.rm=T),mean=mean(temp,na.rm=T),median=median(temp,na.rm=T),q3=quantile(temp,probs=.75,na.rm=T),max=max(temp,na.rm=T),sd=sd(temp,na.rm=T))


winner_by_game <- filter(by_game,totwins>8)
loser_by_game <- filter(by_game,totwins<=8)

by_game %>% select(win:pct_away)
par(mfrow=c(2,2))
plot(by_game$win_pct,by_game$pct_home,ylim=c(.05,.2))
plot(winner_by_game$win_pct,winner_by_game$pct_home,ylim=c(.06,.2))
plot(winner_by_game$date,winner_by_game$pct_home,ylim=c(.06,.2))
plot(loser_by_game$date,loser_by_game$pct_home,ylim=c(.06,.2))




#summary(by_game$weekly_attendance)

#by_game %>% group_by(year) %>% summarise(avg_weekly=mean(weekly_attendance,na.rm=T))


ct_table <- pbp_weath %>% 
  mutate(roof=replace(roof,roof=='closed','dome'),roof=replace(roof,roof=='open','outdoors')) %>% 
  group_by(roof) %>%  tally(name='Count') %>% mutate(Count=as.numeric(Count)) %>% 
  mutate(pct_outdoor=round(Count/sum(Count),2))

ct_table_2 <- pbp_weath %>% 
  mutate(roof=replace(roof,roof=='closed','dome'),roof=replace(roof,roof=='open','outdoors')) %>% 
  group_by(year,roof) %>%  tally(name='Count') %>% mutate(Count=as.numeric(Count)) %>% spread(roof,Count) %>% 
  rowwise() %>% mutate(pct_outdoor=round(outdoors/sum(dome,outdoors),2))


#ct_table <- ct_table %>% 
#rbind(c('Total',sum(ct_table$Count),sum(ct_table$pct))) %>% mutate(Count=as.numeric(Count))



ct_table %>% 
  kbl(caption='Dome vs. outdoor games 2000-2019') %>% 
  kable_styling('hover',full_width=F)
ct_table_2 %>% 
  kbl(caption='Dome vs. Outdoor by Year') %>% 
  kable_styling('hover',full_width = F)















# Weekly bootstraps of temp decrease cross season -------------------------


### Weekly bootstrap errors by week for temperature decrease

pbp_weath %>% 
  ggplot(.,aes(x=week,y=temp))+
  geom_point(alpha=.5)+
  geom_smooth()+
  ggtitle(label='Temperature Trends Across a Season',subtitle='All Games 2000-2019')+
  scale_y_continuous(name='Game Temperature')+
  scale_x_continuous(name='NFL Week Number')+
  stat_summary(fun.y = "mean", geom = "point", size = 3, color='DarkBlue', shape = 21, fill = "white", stroke = 1)

weather_group_boot <- data.frame(week=c(1:17),n=NA,mean=NA,mean_se=NA,sd=NA,sd_se=NA)

for (i in 1:17){
  boot_data <- pbp_weath %>% filter(week==i) %>% select(temp) %>% na.omit() %>% as.data.frame() %>% .$temp
  weather_group_boot[i,'n'] <- length(boot_data)

  boot_results <- bootstrap(boot_data,nboot=2000,mean,na.rm=T)
  weather_group_boot[i,'mean'] <- mean(boot_results$thetastar)
  weather_group_boot[i,'mean_se'] <- sqrt(var(boot_results$thetastar))
  
  rm(boot_results)
  
  boot_results <- bootstrap(boot_data,nboot=2000,sd,na.rm=T)
  weather_group_boot[i,'sd'] <- sd(boot_data)
  weather_group_boot[i,'sd_se'] <- sqrt(var(boot_results$thetastar))
  
  rm(boot_data)
  rm(boot_results)
  
}

weather_group_boot[,-(1:2)] <- weather_group_boot[,-(1:2)] %>% apply(c(1,2),round,digits=2)

weather_group_boot %>% kbl(format='latex',caption='Bootstrap Results for Weekly Temperature') %>% 
  kable_styling(full_width = F,latex_options = 'hold_position')


all_temp_data <- na.omit(pbp_weath$temp)
late_yr_data <- pbp_weath %>% filter(week>=13) %>% select(temp) %>% na.omit() %>% .$temp

temp.ecdf <- ecdf(all_temp_data)
late_yr_ecdf <- ecdf(late_yr_data)

Alpha <- .05

par(mfrow=c(1,2))
plot.ecdf(temp.ecdf,pch=1)
grid1<-seq(0,max(all_temp_data), length.out = 1000)
Eps1 <- sqrt(log(2/Alpha)/(2*length(all_temp_data)))
lines(grid1, pmin(temp.ecdf(grid1)+Eps1,1),lwd=2,col='red')
lines(grid1, pmax(temp.ecdf(grid1)-Eps1,0),lwd=2,col='green')

plot.ecdf(late_yr_ecdf,pch=1)
grid2<-seq(0,max(late_yr_data), length.out = 1000)
Eps2 <- sqrt(log(2/Alpha)/(2*length(late_yr_data)))
lines(grid2, pmin(late_yr_ecdf(grid2)+Eps2,1),lwd=2,col='red')
lines(grid2, pmax(late_yr_ecdf(grid2)-Eps2,0),lwd=2,col='green')





# hypothesis testing attendance trend of each stadium type ---------------------------------

## Overall Summary plot across season

final_data %>% distinct(game_id,.keep_all = T) %>% 
  ggplot(.,aes(x=date,y=weekly_attendance))+ 
  geom_point(alpha=.8,color='lightblue')+
  stat_summary(fun.y = "mean", geom = "point", size = 3, shape = 21, color = "white",fill='darkblue', stroke = 1,alpha=1)+
  geom_smooth()+
  scale_y_continuous(name='Weekly Attendance',labels=scales::comma)+
  scale_x_date(name='Date of Game')+
  ggtitle(label='Attendance Trends Across a Season',subtitle='All Games 2000-2019')


## Attendance visual trend by each Type of stadium
monthly_mean <- final_data %>% distinct(game_id,.keep_all = T) %>% filter(!is.na(roof)) %>% select(date,month,roof,weekly_attendance) %>% 
  mutate(ym=ymd(paste0(year(date),month,'01'))) %>% 
  group_by(ym,roof) %>% 
  summarise(mean=mean(weekly_attendance))


final_data %>% distinct(game_id,.keep_all = T) %>% filter(!is.na(roof)) %>% 
  ggplot(.,aes(x=date,y=weekly_attendance,fill=roof))+ 
  geom_point(aes(color=roof),alpha=.2)+
  geom_smooth(color='DarkBlue',method='loess')+
  #stat_summary(fun.y = "mean", geom = "point", size = 3, aes(color = roof), shape = 21, color = "white", stroke = 1,alpha=.6)+
  geom_point(data=monthly_mean,aes(x=ym,y=mean,fill=roof),size = 3, shape = 21, color = "white", stroke = 1,alpha=.85)+
  facet_grid(~roof)+
  scale_y_continuous(name='Weekly Attendance',labels=scales::comma)+
  scale_x_date(name='Date of Game')+
  ggtitle(label='Attendance Trends Across a Season',subtitle='All Games 2000-2019 by Stadium Type')

## Attendance trend by each re-grouped for dome and open stadium types

roof_data <- final_data %>% 
  mutate(roof=replace(roof,roof=='dome','closed')) %>% 
  mutate(roof=replace(roof,roof=='open','outdoors'))

monthly_mean_roof <- roof_data %>% distinct(game_id,.keep_all = T) %>% filter(!is.na(roof)) %>% select(date,month,roof,weekly_attendance) %>% 
  mutate(ym=ymd(paste0(year(date),month,'01'))) %>% 
  group_by(ym,roof) %>% 
  summarise(mean=mean(weekly_attendance))

  
roof_data %>% distinct(game_id,.keep_all = T) %>% filter(!is.na(roof)) %>% 
  ggplot(.,aes(x=date,y=weekly_attendance,fill=roof))+ 
  geom_point(aes(color=roof),alpha=.2)+
  geom_smooth(color='DarkBlue',method='loess')+
  #stat_summary(fun.y = "mean", geom = "point", size = 3, aes(color = roof), shape = 21, color = "white", stroke = 1,alpha=.6)+
  geom_point(data=monthly_mean_roof,aes(x=ym,y=mean,fill=roof),size = 3, shape = 21, color = "white", stroke = 1,alpha=.85)+
  facet_grid(~roof)+
  scale_y_continuous(name='Weekly Attendance',labels=scales::comma)+
  scale_x_date(name='Date of Game')+
  ggtitle(label='Attendance Trends Across a Season',subtitle='All Games 2000-2019 by Stadium Type')

## Further Distribution Test for types
closed_data <- roof_data %>% distinct(game_id,.keep_all = T) %>% filter(roof=='closed')
outdoor_data <- roof_data %>% distinct(game_id,.keep_all = T) %>% filter(roof=='outdoors')

############### boxplot
roof_data %>% distinct(game_id,.keep_all = T) %>% filter(!is.na(roof)) %>% ggplot(.,aes(y=weekly_attendance,x=roof))+
  # geom_histogram(aes(y=..density..),color='black',alpha=.5,bins=20,position='identity')+
  geom_boxplot(aes(fill=roof),alpha=.7)+
  scale_y_continuous(name='Weekly Attendance',labels=scales::comma)+
  ggtitle('Boxplot Comparison of Stadium Type Distributions')


## Histograms:

roof_data %>% distinct(game_id,.keep_all = T) %>% filter(!is.na(roof)) %>%  ggplot(.,aes(x=weekly_attendance))+
  facet_wrap(vars(roof),scales='free_y')+
  geom_histogram(color='black',fill='dodgerblue',alpha=.5,bins=20)+
  geom_vline(aes(xintercept=mean(weekly_attendance,na.rm=T)),linetype='dashed',color='red',alpha=.8,lwd=1)+
  ggtitle('Histogram of each Stadium Type')

minuslog_lik_c <- function(mu,sig){
  func <- 0
  
  for(i in 1:nrow(closed_data)){
    func <- func+log(dnorm(closed_data$weekly_attendance[i],mean=mu,sd=sig))
  }
  return(-func)
}
minuslog_lik_o <- function(mu,sig){
  func <- 0
  
  for(i in 1:nrow(closed_data)){
    func <- func+log(dnorm(outdoor_data$weekly_attendance[i],mean=mu,sd=sig))
  }
  return(-func)
}

est_lognorm_c <- stats4::mle(minuslog=minuslog_lik_c,start=list(mu=mean(closed_data$weekly_attendance),sig=sd(closed_data$weekly_attendance)))
est_lognorm_o <- stats4::mle(minuslog=minuslog_lik_o,start=list(mu=mean(outdoor_data$weekly_attendance),sig=sd(outdoor_data$weekly_attendance)))

sim_mle_c <- tibble(sim=rnorm(5000,mean=est_lognorm_c@coef[1],sd=est_lognorm_c@coef[2]),roof='closed')
sim_mle_o <- tibble(sim=rnorm(5000,mean=est_lognorm_o@coef[1],sd=est_lognorm_o@coef[2]),roof='outdoors')

sim_mle_total <- rbind(sim_mle_c,sim_mle_o)

ggplot(data=sim_mle_c,aes(x=sim))+
  geom_density()

mle_summary <- rbind(est_lognorm_c@coef,est_lognorm_o@coef) %>% 
  apply(c(1,2),round,digits=1) %>% 
  apply(c(1,2),scales::comma)

rownames(mle_summary) <- c('Indoor','Outdoor')
  mle_summary %>% kbl(caption='MLE estimate results for a normal distribution',row.names=T,col.names=c('Mean Est.','SD Est.')) %>% 
    kable_styling(full_width = F)



roof_data %>% distinct(game_id,.keep_all = T) %>% filter(!is.na(roof)) %>%  
  ggplot(.,aes(x=weekly_attendance))+
  facet_wrap(vars(roof),scales='free_y')+
  geom_histogram(aes(y=..density..),color='black',fill='dodgerblue',alpha=.5,bins=15)+
  geom_density(data=sim_mle_total,aes(x=sim))+
  scale_x_continuous(labels=scales::comma,name='Weekly Attendance')+
  scale_y_continuous(labels=scales::comma,name='Density')
  geom_vline(aes(xintercept=mean(weekly_attendance,na.rm=T)),linetype='dashed',color='red',alpha=.8,lwd=1)+
  ggtitle('Histogram of each Stadium Type',subtitle='With MLE density curve estimated')

  
qqplot(closed_data$weekly_attendance,rnorm(500,mean=est_lognorm_c@coef[1],sd=est_lognorm_c@coef[2]))
qqnorm(closed_data$weekly_attendance)
qqnorm(outdoor_data$weekly_attendance)
# 
# hist(closed_data$weekly_attendance,freq = F)
# lines(density(sim_mle_c$sim))
# 
# hist(outdoor_data$weekly_attendance,freq = F)
# lines(density(sim_mle_o$sim))


## ecdf comparisons of early weeks vs. late weeks for closed and open roofs respectively


closed_data <- roof_data %>% distinct(game_id,.keep_all = T) %>% filter(roof=='closed')
outdoor_data <- roof_data %>% distinct(game_id,.keep_all = T) %>% filter(roof=='outdoors')

early_closed <- closed_data %>% filter(week<13)
late_closed <- closed_data %>% filter(week>=13)
early_outdoors <- outdoor_data %>% filter(week<13)
late_outdoors <- outdoor_data %>% filter(week>=13)


ecdf.earlyc <- ecdf(early_closed$weekly_attendance)
ecdf.latec <- ecdf(late_closed$weekly_attendance)
ecdf.earlyo <- ecdf(early_outdoors$weekly_attendance)
ecdf.lateo <- ecdf(late_outdoors$weekly_attendance)


# ecdfplot(~early_closed$weekly_attendance+late_closed$weekly_attendance)

 # plot(ecdf.earlyc, verticals=TRUE, do.points=FALSE,col='red')
 # plot(ecdf.latec, verticals=TRUE, do.points=FALSE, add=TRUE, col='blue')
 # legend('bottomright',legend=c('wk 1-12','wk 13-17'),lty=c(1,1),col=c('red','blue'))

# ecdfplot(~early_outdoors$weekly_attendance+late_outdoors$weekly_attendance)
 d <- data.frame(x=c(40000,90000))

 cols1 <- c('wk 1-12'='red','wk 13-17'='blue')
 grid.arrange(
   ggplot(d,aes(x=x))+
     stat_function(aes(color='wk 1-12'),fun=ecdf.earlyc,geom='step')+
     stat_function(aes(color='wk 13-17'),fun=ecdf.latec,geom='step')+
     
     # Map(f=stat_function,fun=list(ecdf.earlyc,ecdf.latec),mapping=aes(color=c('red','blue')),geom='smooth')+
     scale_y_continuous(name='F(x)')+
     scale_x_continuous(name='Weekly Attendance',labels=scales::comma)+
     ggtitle('ECDF of Indoor Games')+
     labs(color='legend')+
     scale_color_manual(values=cols1)+
     theme(legend.position=c(1,0),
           legend.justification = c(1,0))
,
ggplot(d,aes(x=x))+
  stat_function(aes(color='wk 1-12'),fun=ecdf.earlyo,geom='step')+
  stat_function(aes(color='wk 13-17'),fun=ecdf.lateo,geom='step')+

  # Map(f=stat_function,fun=list(ecdf.earlyo,ecdf.lateo),color=c('red','blue'),geom='smooth')+
  scale_y_continuous(name='F(x)')+
  scale_x_continuous(name='Weekly Attendance',labels=scales::comma)+
  ggtitle('ECDF of Outdoor Games')+
  labs(color='legend')+
  scale_color_manual(values=cols1)+
  theme(legend.position=c(1,0),
        legend.justification = c(1,0))
, nrow=1,ncol=2
)
 
 
## Hypothesis test on each group
 
 diff_mean <- function(x,n){
   len <- length(x)
   m <- n+1
   mean(x[1:n])-mean(x[m:len])
 }
 
closed_data$weekly_attendance
n_c <- nrow(early_closed)
m_c <- nrow(late_closed)
 
##Actual Test Statistic
obs_diff_c <- mean(early_closed$weekly_attendance)-mean(late_closed$weekly_attendance)





z_boot_c <- bootstrap(closed_data$weekly_attendance,2000,diff_mean,n_c)
mean(z_boot_c$thetastar)
p_boot_c <- sum(abs(z_boot_c$thetastar)>=abs(obs_diff_c))/2000


##split up ci

x_boot_c <- bootstrap(early_closed$weekly_attendance,2000,mean)
y_boot_c <- bootstrap(late_closed$weekly_attendance,2000,mean)


#se
boot_diff_c <- x_boot_c$thetastar-y_boot_c$thetastar
se_boot_c <- sd(boot_diff_c)

#ci
ci_c <- c(2*obs_diff_c-quantile(boot_diff_c,.975),2*obs_diff_c-quantile(boot_diff_c,.025))
ci_c_2 <- c(obs_diff_c-2*se_boot_c,obs_diff_c+2*se_boot_c)




## hypothesis of outdoor

outdoor_data$weekly_attendance
n_o <- nrow(early_outdoors)
m_o <- nrow(late_outdoors)

##Actual Test Statistic
obs_diff_o <- mean(early_outdoors$weekly_attendance)-mean(late_outdoors$weekly_attendance)





z_boot_o <- bootstrap(outdoor_data$weekly_attendance,2000,diff_mean,n_o)
mean(z_boot_o$thetastar)
p_boot_o <- sum(abs(z_boot_o$thetastar)>=abs(obs_diff_o))/2000

x_boot_o <- bootstrap(early_outdoors$weekly_attendance,2000,mean)
y_boot_o <- bootstrap(late_outdoors$weekly_attendance,2000,mean)


#se
boot_diff_o <- x_boot_o$thetastar-y_boot_o$thetastar
se_boot_o <- sd(boot_diff_o)

#ci
ci_o <- c(2*obs_diff_o-quantile(boot_diff_o,.975),2*obs_diff_o-quantile(boot_diff_o,.025))
ci_o_2 <- c(obs_diff_o-2*se_boot_o,obs_diff_o+2*se_boot_o)

test_results <- data.frame(mean_diff=c(obs_diff_c,obs_diff_o)
                           ,boot_se=c(se_boot_c,se_boot_o)
                           ,p_values=c(p_boot_c,p_boot_o)
                           ,LB_ci=c(ci_c[1],ci_o[1])
                           ,UB_ci=c(ci_c[2],ci_o[2]))
rownames(test_results) <- c('Indoor','Outdoor')

test_results[,-3] <- test_results[,-3] %>% apply(c(1,2),round,digits=1) %>% apply(c(1,2),scales::comma)

test_results %>%  kbl(caption='Non-Parametric Bootstrap Results for Comparing Attendance',row.names=T) %>% 
  kable_styling(full_width=F)