Every fall season, the game of football is played in the United States, including little league, high school, college, and in the highest level, the National Football League (NFL). In the NFL, there are 32 teams, each with their own stadium in the United States, and each team plays a total of 17 games, running through September to late December. Each team has 8 home games as well as 8 away games, and then one bye week during the season. At the end of the season, there is playoff tournament with the best teams, and it culminates into one of the biggest games in all of sports: The Super Bowl (what a great name). The NFL has the largest annual revenue of any sports league in the world, totaling over $9.5 Billion dollars(!) in 2019 alone. One of the biggest ways that the NFL attains their revenue is through fans coming to the game. In our project, we plan on looking through attendance data from 2000-2019 to gain insights on what brings fans to the game. Some of these include:
….. and any other insights that we think would be good while glancing through the data.
We will be using data from Pro Football Focus data for our attendance, games, and standings facts. We plan on combining all three, then cleaning and manipulating the data to get the desired final dataset that we can perform different analysis techniques. We hope to look at specific teams too, and see if there is any interesting trends.
Required packages:
Here are a little taste of some of the packages that will be used in order to manipulate and visualize the data:
library(readr)
library(tidyverse)
library(Hmisc)
library(knitr)
library(funModeling)
library(rpart)
library(skimr)
library(scales)
library(DT)
library(zoo)
library(tidymodels)
library(lubridate)
library(grid)
readr: Report generator for datasets
tidyverse: set of packages that include ggplot2 for visualization, dplyr for manipulation, tidyr for data tidying, and much more
knitr:Allows you to creat HTML reports, and other markdown reports
funModeling: EDA chest with multiple uses
skimr: Allows different summaries of data
DT: Can create readable datatables
tidymodels:helps with data pre- processing
Hmisc: Data analysis and manipulation package
rpart: Data analysis package focusing on classification, regression technique such as trees
zoo: Time serires & ordered index observations analysis(regular and irregular) focused package
lubridate belongs to tidyverse, focused on time&date
grid handle static text
When looking at the data, it is split into three datasets:
Attendance: The data includes variables including the week, location, and teams
Games: Has the games, scores, location, turnovers,and other game stats that are strongly correlated with the outcome of the game.
Standings:Has the outcomes of each of the teams at the end of the season, as well as offensive and defensive standings compared to each of the other teams.
Looking at the attendance data, it a has 2 character variables, team and team_name, one for the away and one for the home team, as well as 6 numeric, with year, total, home, away, week and weekly_attendance. It appears that the highest attendance on record is 105,121 fans at a single game, as you can see the percentiles of each of the variables below:
Looking at the attendance data, it a has 2 character variables, team and team_name, one for the away and one for the home team, as well as 6 numeric, with year, total, home, away, week and weekly_attendance. It appears that the highest attendance on record is 105,121 fans at a single game, as you can see the percentiles of each of the variables below:
skim(attendance)
| Name | attendance |
| Number of rows | 10846 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| team | 0 | 1 | 5 | 13 | 0 | 32 | 0 |
| team_name | 0 | 1 | 4 | 10 | 0 | 32 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1.00 | 2009.53 | 5.75 | 2000 | 2005.0 | 2010 | 2015.00 | 2019 | ▇▇▇▇▇ |
| total | 0 | 1.00 | 1080910.03 | 72876.97 | 760644 | 1040509.0 | 1081090 | 1123230.00 | 1322087 | ▁▁▇▆▁ |
| home | 0 | 1.00 | 540455.01 | 66774.65 | 202687 | 504360.0 | 543185 | 578342.00 | 741775 | ▁▁▅▇▁ |
| away | 0 | 1.00 | 540455.01 | 25509.33 | 450295 | 524974.0 | 541757 | 557741.00 | 601655 | ▁▂▇▇▂ |
| week | 0 | 1.00 | 9.00 | 4.90 | 1 | 5.0 | 9 | 13.00 | 17 | ▇▆▆▆▇ |
| weekly_attendance | 638 | 0.94 | 67556.88 | 9022.02 | 23127 | 63245.5 | 68334 | 72544.75 | 105121 | ▁▁▇▃▁ |
The games data table is by far our largest, with over 5324 games being played. There appears to be 12 character and 7 numeric variable types, as well as a lot of tied data sets missing. The dataset also contains the pts scored by each team, as well as the number of yards and turnovers per team.
skim(games)
| Name | games |
| Number of rows | 5324 |
| Number of columns | 19 |
| _______________________ | |
| Column type frequency: | |
| character | 12 |
| numeric | 7 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| week | 0 | 1 | 1 | 9 | 0 | 21 | 0 |
| home_team | 0 | 1 | 13 | 20 | 0 | 34 | 0 |
| away_team | 0 | 1 | 13 | 20 | 0 | 34 | 0 |
| winner | 0 | 1 | 13 | 20 | 0 | 34 | 0 |
| tie | 5314 | 0 | 14 | 18 | 0 | 7 | 0 |
| day | 0 | 1 | 3 | 3 | 0 | 7 | 0 |
| date | 0 | 1 | 9 | 12 | 0 | 154 | 0 |
| time | 0 | 1 | 8 | 8 | 0 | 187 | 0 |
| home_team_name | 0 | 1 | 4 | 10 | 0 | 32 | 0 |
| home_team_city | 0 | 1 | 5 | 13 | 0 | 32 | 0 |
| away_team_name | 0 | 1 | 4 | 10 | 0 | 32 | 0 |
| away_team_city | 0 | 1 | 5 | 13 | 0 | 32 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1 | 2009.53 | 5.75 | 2000 | 2005 | 2010 | 2015 | 2019 | ▇▇▇▇▇ |
| pts_win | 0 | 1 | 27.78 | 8.83 | 3 | 21 | 27 | 34 | 62 | ▁▇▇▂▁ |
| pts_loss | 0 | 1 | 16.09 | 8.14 | 0 | 10 | 16 | 21 | 51 | ▆▇▅▁▁ |
| yds_win | 0 | 1 | 361.64 | 78.58 | 47 | 308 | 361 | 415 | 653 | ▁▂▇▃▁ |
| turnovers_win | 0 | 1 | 1.08 | 1.04 | 0 | 0 | 1 | 2 | 7 | ▇▂▁▁▁ |
| yds_loss | 0 | 1 | 309.08 | 84.50 | 26 | 251 | 306 | 366 | 613 | ▁▅▇▃▁ |
| turnovers_loss | 0 | 1 | 2.17 | 1.42 | 0 | 1 | 2 | 3 | 8 | ▆▇▂▁▁ |
For our standings dataset, it contains 4 character and 11 numeric, which describes each team with how many wins they had, the number of playoffs gone to, the sb_winner that year, as well as some other metrics such as the offensive and defensive rankings at the end of the year for every team.
skim(standings)
| Name | standings |
| Number of rows | 638 |
| Number of columns | 15 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 11 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| team | 0 | 1 | 5 | 13 | 0 | 32 | 0 |
| team_name | 0 | 1 | 4 | 10 | 0 | 32 | 0 |
| playoffs | 0 | 1 | 8 | 11 | 0 | 2 | 0 |
| sb_winner | 0 | 1 | 12 | 13 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1 | 2009.53 | 5.76 | 2000.0 | 2005.00 | 2010.0 | 2014.75 | 2019.0 | ▇▇▇▇▇ |
| wins | 0 | 1 | 7.98 | 3.08 | 0.0 | 6.00 | 8.0 | 10.00 | 16.0 | ▂▆▇▆▂ |
| loss | 0 | 1 | 7.98 | 3.08 | 0.0 | 6.00 | 8.0 | 10.00 | 16.0 | ▂▆▇▆▂ |
| points_for | 0 | 1 | 350.28 | 71.40 | 161.0 | 299.00 | 348.0 | 396.00 | 606.0 | ▂▇▇▂▁ |
| points_against | 0 | 1 | 350.28 | 59.55 | 165.0 | 310.00 | 347.0 | 391.50 | 517.0 | ▁▃▇▆▁ |
| points_differential | 0 | 1 | 0.00 | 101.09 | -261.0 | -75.00 | 1.5 | 72.75 | 315.0 | ▂▆▇▅▁ |
| margin_of_victory | 0 | 1 | 0.00 | 6.32 | -16.3 | -4.70 | 0.1 | 4.57 | 19.7 | ▂▆▇▅▁ |
| strength_of_schedule | 0 | 1 | 0.00 | 1.63 | -4.6 | -1.10 | 0.0 | 1.20 | 4.3 | ▁▅▇▅▁ |
| simple_rating | 0 | 1 | 0.00 | 6.20 | -17.4 | -4.47 | 0.0 | 4.50 | 20.1 | ▁▆▇▅▁ |
| offensive_ranking | 0 | 1 | 0.00 | 4.34 | -11.7 | -3.18 | 0.0 | 2.70 | 15.9 | ▁▇▇▂▁ |
| defensive_ranking | 0 | 1 | 0.00 | 3.57 | -9.8 | -2.40 | 0.1 | 2.50 | 9.8 | ▁▅▇▅▁ |
We wanted to combine all of these data tables into a single, grand table that would encompass all the data, while still adhering to the structure of the data. We did the following:
Combined the games and attendance by doing a full join of games and attendance by using team_name equal to home_team_name, as well by year and week
Combined this brand new table with the standings table, by using home_team_name equal to team_name, as well as year, and home_team_city = team, which are the same
Turned the playoffs and sb_winner into binary variables, with a 1 for yes, and 0 for no.
Dropped unnecessary and repeated values, such as ties, repeated home and away team names that were unnecessary
#change attendance week column to character
attendance$week <- as.character(attendance$week)
nfl.df<- full_join(games,attendance,by = c("home_team_name"="team_name", "year", "week"))
nfl_new_data<- left_join(nfl.df,standings,by=c("home_team_name"="team_name","year","home_team_city"="team"))
nfl_new_data$playoffs <- str_replace(nfl_new_data$playoffs, "No Playoffs", "0")
nfl_new_data$playoffs <- str_replace(nfl_new_data$playoffs, "Playoffs", "1")
nfl_new_data$playoffs <- as.numeric(nfl_new_data$playoffs)
nfl_new_data$sb_winner <- str_replace(nfl_new_data$sb_winner, "No Superbowl", "0")
nfl_new_data$sb_winner <- str_replace(nfl_new_data$sb_winner, "Won Superbowl", "1")
nfl_new_data$sb_winner <- as.numeric(nfl_new_data$sb_winner)
nfl_new_data = subset(nfl_new_data, select = -c(tie,home,away,home_team_name,home_team_city,away_team_name, away_team_city))
nfl_new_data
With the final table, we decided to leave in any NA’s within the table, as being replaced by any values would change the data ( even though there was a lot). The majority of the dataset is a combination of character and integer values, and each value has a quick explanation attached to it:
Here is a quick overview of the data, with the first 50 rows shown.
While looking through the NFL attendance data, I thought it would be a good idea to find a good range of what exactly the average attendance is for every team, and to see what teams tend to have more fans, what outliers there are, etc. In particular, I wanted to see what the general average is of the entire league, and to just visualize the data:
library(tidyverse)
library(ggplot2)
nfl_new_data %>%
mutate(week = factor(home_team)) %>%
drop_na('weekly_attendance','home_team') %>%
ggplot(aes(home_team, weekly_attendance, fill = home_team)) +
geom_boxplot(show.legend = FALSE, outlier.alpha = .5)+
theme(axis.text.x = element_text(angle = 90))+
labs(title = "Box plot of NFL Attendance per team",
x = "Team Name",
y = "Weekly NFL game attendance"
)
While visualizing the data, it appears that the highest mean goes to the New York Giants, followed closely by the Washington Football team(Formally Redskins). There is a range of values that the teams appear to be around, which would be between 60,000 and 80,000 fans. The team that appears to have the largest range of would be between Dallas Cowboys and the Oakland raiders, with the boxplot being larger for the Cowboys and the whiskers are longer for Oakland. The main reason why we created this was just to get a glimpse of the data, in which it shows that all the teams have around the same range and values for attendance, with some outliars that are much lower than the average values.
After viewing the data, the next step taken was going to see if there was any changes year after year with all the teams. We wanted to see if there was any teams (from 2000-2020) that just generally had higher attendance than the rest of the teams, and so we created a line chart for every team to see what the general trend in total attendance per year:
That’s not a good graph! It’s too clustered, and it’s hard to see what teams are separated from the rest. Let’s go ahead and look at a summary of the data to see if we can find highest 8 teams (or 75th percentile) and separate those from the rest:
summary(nfl_attendance_perTandY$sumyearly)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 202687 504405 543185 540455 578339 741775
It appears that the cutoff value would be around 580,000, and so lets go ahead and filter the teams that have an average of 580,000 or more per year:
nfl_attendance_perTandY.copy <- nfl_attendance_perTandY %>% group_by(home_team) %>%
summarise(meanyearly = mean(sumyearly, na.rm = T))
##### only care about the top teams
nfl_arizona <- nfl_attendance_perTandY.copy[which(nfl_attendance_perTandY.copy[,2]>578339),]
nfl_arizona
## # A tibble: 8 x 2
## home_team meanyearly
## <chr> <dbl>
## 1 Carolina Panthers 585030.
## 2 Dallas Cowboys 623466.
## 3 Denver Broncos 607991
## 4 Kansas City Chiefs 598235.
## 5 Los Angeles Rams 583554.
## 6 New York Giants 628300.
## 7 New York Jets 624141.
## 8 Washington Redskins 639074
nfl_graph <- left_join(nfl_arizona, nfl_attendance_perTandY, by = "home_team")
nfl_graph <- nfl_graph[-c(2)]
It appears that only eight teams, Carolina, Dallas, Denver, Kansas City, Los Angeles Rams, New York Giants, New York Jets, and Washington Redskins appear to have the highest annual attendance over the past 20 years. When we visualize the 8 teams, it is clear that we can see any huge jumps or spikes:
ggplot(data = nfl_graph, aes(x=year, y=sumyearly)) + geom_line(aes(colour=home_team))+annotate(geom = "text", size=3,x = 2009, y = 680000, label = "Dallas Cowboys complete their
brand new AT&T stadium, which holds
a max capacity of 80,000 people" , hjust = "left")+annotate(geom = "point", x = 2009, y = 718055, colour = "red", size = 4)+
labs(title = "Attendance per Team by Year",
x = "Year",
y = "Sum of NFL game attendance"
)
While we have these teams, lets view if there are certain predictors we can use with attendance, including wins, whether or not they made the playoffs, and any others.
When looking at these 8 teams, let us create a new variable in our dataset that computes the win-loss percentage of every team, using the number of wins and losses in the data:
nfl_new_data <- nfl_new_data %>% mutate(wpercent = nfl_new_data$wins/(nfl_new_data$loss+nfl_new_data$wins))
Then let us go ahead and combine the two datasets together to give us a nice linear plot that has the win percent as the x variable and sum yearly as the dependent variable.
nfl_winper <- nfl_new_data %>% group_by(home_team,year) %>%
summarise(winpercent = mean(wpercent, na.rm = T))
lm_model_nfl <- full_join(nfl_winper, nfl_graph)
lm_model_nfl <- lm_model_nfl %>%
drop_na('sumyearly')
When graphing these together, it looks like there is no relationship between win percentage, as the line appears to be almost a straight flat line, which no correlation between the two at all:
plot1 <- ggplot(as.data.frame(lm_model_nfl),aes(x=winpercent,y=sumyearly)) +
geom_point() +
geom_smooth(method=lm)+labs(title = "Win Percentage Graph",
x = "Win Percentage",
y = "Sum of NFL game attendance"
)
plot1
Let’s go ahead and create a model to view the data, and include win-loss percentage, wins, playoffs, and the entire dataset to decide what variables are great at determining what predicts attendance.
nfl_lm_model <- nfl_new_data[, !sapply(nfl_new_data, is.character)]
## 75% of the sample size
smp_size <- floor(0.75 * nrow(nfl_lm_model))
## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(nfl_lm_model)), size = smp_size)
nfl.train <- nfl_lm_model[train_ind, ]
nfl.test <- nfl_lm_model[-train_ind, ]
lm_fit <- lm(weekly_attendance ~ playoffs,data = nfl.train)
summary(lm_fit)$r.squared
## [1] 0.01134179
summary(lm_fit)$adj.r.squared
## [1] 0.01108412
Wow those two R-squared values are not high at all! It appears that making the playoffs has not implication on how much attendance the team has during the year. The issue with this is that we don’t know whether or not a team makes the playoffs until the end, while during the year you can easily check to see how well your team is doing by checking on their wins. We are using the year end amount of wins in our model however, but it should work as the end of the year record is a summary of the year. So when we run on a model on both playoffs and wins, hopefully our model predicts the average attendence, with a high R squared.
lm_fit_best <- lm(weekly_attendance ~ playoffs + wins, data = nfl.train)
summary(lm_fit_best)
##
## Call:
## lm(formula = weekly_attendance ~ playoffs + wins, data = nfl.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43918 -4136 213 4806 31837
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63930.06 488.84 130.780 < 2e-16 ***
## playoffs -393.24 474.12 -0.829 0.407
## wins 477.01 74.38 6.413 1.6e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8876 on 3836 degrees of freedom
## (4460 observations deleted due to missingness)
## Multiple R-squared: 0.02183, Adjusted R-squared: 0.02132
## F-statistic: 42.8 on 2 and 3836 DF, p-value: < 2.2e-16
This one isn’t much better! The R squared is still very low. Let’s go ahead and see if we can choose a new model, based on the most important variables.
Finally, when we use those new ones, it appears that the model isn’t that effective:
lm_fit2 <- lm(weekly_attendance ~ year + week + pts_win + pts_loss +
yds_win+turnovers_win+yds_loss+turnovers_loss+ playoffs+ sb_winner, data = nfl.train)
summary(lm_fit2)
##
## Call:
## lm(formula = weekly_attendance ~ year + week + pts_win + pts_loss +
## yds_win + turnovers_win + yds_loss + turnovers_loss + playoffs +
## sb_winner, data = nfl.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -44713 -4188 158 4854 32727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.052e+05 5.175e+04 -2.034 0.042036 *
## year 8.602e+01 2.580e+01 3.334 0.000863 ***
## week -7.189e+01 2.887e+01 -2.490 0.012813 *
## pts_win -3.601e+01 2.288e+01 -1.574 0.115647
## pts_loss 3.379e+01 2.655e+01 1.273 0.203257
## yds_win 1.353e+00 2.337e+00 0.579 0.562790
## turnovers_win -2.202e+02 1.489e+02 -1.479 0.139334
## yds_loss -1.346e+00 2.331e+00 -0.577 0.563855
## turnovers_loss 2.247e+02 1.164e+02 1.931 0.053610 .
## playoffs 2.077e+03 3.081e+02 6.741 1.8e-11 ***
## sb_winner -4.627e+02 8.569e+02 -0.540 0.589288
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8905 on 3828 degrees of freedom
## (4460 observations deleted due to missingness)
## Multiple R-squared: 0.01727, Adjusted R-squared: 0.0147
## F-statistic: 6.726 on 10 and 3828 DF, p-value: 1.879e-10
With such a low R squared, it is not worth our time trying to find if the model is correct or not. Since that is the case, let us now look at the attendance over the week, and see if the data is linear or not and see what the data looks like.
When looking at it week by week, we first would want to see the attendance the entire nfl has per week, to see if it changes (on average) from week to week.
nfl_week <- nfl_new_data %>% group_by(year,week) %>%
summarise(mean_weekly = mean(weekly_attendance, na.rm = TRUE)) %>%
arrange(year,week)
nfl_week$year <- as.character(nfl_week$year)
nfl_week
## # A tibble: 360 x 3
## # Groups: year [20]
## year week mean_weekly
## <chr> <int> <dbl>
## 1 2000 1 66530.
## 2 2000 2 67142.
## 3 2000 3 68332
## 4 2000 4 63451.
## 5 2000 5 70715.
## 6 2000 6 64803.
## 7 2000 7 65463.
## 8 2000 8 65518.
## 9 2000 9 65369.
## 10 2000 10 65361.
## # ... with 350 more rows
We now have every single week for every single year, which we have the corresponding average attendance for all games that week. When we go ahead and plot the data:
ggplot(data = nfl_week, aes(x=week, y=mean_weekly)) + geom_line(aes(colour=year))+ ylim(55000,80000)+
labs(title = " Average Attendance per Team by Year",
x = "Year",
y = "Average of NFL game attendance"
)
It appears that it follows a slight decreasing linear regression, in which means that as the season progresses, the number of fans tends to decrease per game.
When looking at the total fans per week though, it appears that the data does not follow a linear curve:
#attendance per week all years
nfl_week_graph <- nfl_new_data %>% group_by(week) %>%
summarise(sumweekly = sum(weekly_attendance, na.rm = TRUE)/2) %>%
arrange(week)
ggplot(data=nfl_week_graph, aes(x= week, y= sumweekly)) +
geom_point() +
geom_smooth()
This is due to a number of byes given to each of the NFL teams, where the NFL tends to give byes in the middle of the season. And so, when there are less teams playing, there are less games for people to go to, and less people attending, which explains the drop midway through the season.
While it is not feasible to build a good model which predicts attendance based on available data, we can find visualizations which give some insight on attendance in NFL. One of idea comes to mind is to visualize the whole league attendance as a whole. Let start with a 2018-2019 chart
nfl_new_data %>% filter(year %in% (2018:2019)) %>%
ggplot(aes(home_team,away_team, size = weekly_attendance, color = as.factor(playoffs), shape = as.factor(sb_winner)))+geom_point()+
theme(axis.text.x = element_text(angle = 90))+geom_abline(slope =1, color = "white")+
facet_wrap(~year,nrow=1)+
scale_radius(name = "Weekly Attendance", limits = c(0, 106000), range = c(0, 7))+
labs(title = "NFL Attendance Matrix 2018-2019 Regular Season",
x = "Home Team",
y = "Away Team",
shape = "Super Bowl Champion")+
scale_shape_discrete(name = "Super Bowl Champion(Home)", labels = c("No", "Winner"))+
scale_color_manual(name= "Playoff Teams(Home)", labels = c("No", "Yes") ,values = c("gray", "orange"))
One immediate thing we can realize right away is that the attendance of a match is mainly decided by the home team, and so this can lead to more variation horizontally than vertically among weekly attendance matrix. And indeed on a larger scale, the attendance level of one team stays the same across the length of the season with a few exceptions every season. An example of this would be the Los Angeles Chargers in 2019 season when they faced against the Kansas City Chiefs, in which you see an increase in over 20,000 more fans than usual. Checking back on schedule, the match actually wasn’t played in the Charger’s stadium, but instead on a different stadium. Interestingly enough, the Charger’s has another big attendance game against Kansas City Chief, this time was played on their new stadium(Sofi stadium plenty of years later. The other thing we can see is that teams with a big fanbase (like the Dallas Cowboys) usually can pull 20000 attendance more than usual when they are the away team. On the other hand, we can also look at team’s tradition and past performance in the league:
## # A tibble: 33 x 3
## team num_of_SB_won num_of_playoffs
## <chr> <dbl> <dbl>
## 1 New England 6 17
## 2 Baltimore 2 12
## 3 New York 2 14
## 4 Pittsburgh 2 12
## 5 Denver 1 9
## 6 Green Bay 1 14
## 7 Indianapolis 1 14
## 8 Kansas City 1 9
## 9 New Orleans 1 9
## 10 Philadelphia 1 13
## # ... with 23 more rows
## # A tibble: 33 x 3
## team num_of_SB_won num_of_playoffs
## <chr> <dbl> <dbl>
## 1 New England 6 17
## 2 Green Bay 1 14
## 3 Indianapolis 1 14
## 4 New York 2 14
## 5 Philadelphia 1 13
## 6 Seattle 1 13
## 7 Baltimore 2 12
## 8 Pittsburgh 2 12
## 9 Denver 1 9
## 10 Kansas City 1 9
## # ... with 23 more rows
We can see that tradition wise, only 4 team who won the SB more than 2 times . On the other hand, the top 8 team in term of playoff contender also put themselves above the rest while staying neck and neck. In term of attendance, let look the 2019 season:
grob <- grobTree(textGrob("Average attendance 2019 is 66563", x=0.35, y=0.95, hjust=0,
gp=gpar(col="red", fontsize=13, fontface="italic")))
nfl_new_data_t %>% filter(year==2019) %>% filter(sb_winner_count>=2|playoffs_count>=10) %>%
ggplot(aes(home_team,away_team, size = weekly_attendance, color = playoffs_count, shape = as.factor(sb_winner_count) ))+geom_point()+theme(axis.text.x = element_text(angle = 90))+geom_abline(slope =1, color = "white")+
scale_radius(name = "Weekly Attendance")+
facet_wrap(~year,nrow=1)+
scale_colour_gradient( low = "gray", high = "orange")+
annotation_custom(grob)+
labs(title = "Top team attendance in 2019",
color = "Number of times reaching playoff",
shape = "Number of Super Bowl won ")
The New England Patriots are really something else, but they do not have the highest attendance in the league. However, one can notice that most of the team here have more than average the weekly attendance. We can see that there maybe some correlation between past past performance and attendance. However, as we showed before with the Win Percentage model, there is no correlation between those two!
Weather is an interesting aspect It will surely affect the match condition and the result, but does it really affect attendance? I was able to find another data set in which, unfortunately, only has data until 2013. However, 13 years of data may still be able to help us identify weather effect on attendance.
model_weather <-lm(weekly_attendance ~temperature+humidity+wind_mph, data = nfl_new_data_t3)
summary(model_weather)
##
## Call:
## lm(formula = weekly_attendance ~ temperature + humidity + wind_mph,
## data = nfl_new_data_t3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -44579 -4020 589 5299 36065
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 71642.95 1038.18 69.008 < 2e-16 ***
## temperature -53.33 11.23 -4.748 2.17e-06 ***
## humidity -2465.40 1023.39 -2.409 0.01606 *
## wind_mph 101.29 36.44 2.779 0.00549 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8803 on 2570 degrees of freedom
## (714 observations deleted due to missingness)
## Multiple R-squared: 0.01488, Adjusted R-squared: 0.01373
## F-statistic: 12.94 on 3 and 2570 DF, p-value: 2.188e-08
mean(nfl_new_data_t3$temperature)
## [1] 58.03741
mean(nfl_new_data_t3$wind_mph, na.rm = TRUE)
## [1] 8.793609
From the result, it does not seems like temperature plays an important role in attendance count. However, we did find that temperature effect is significant and it is negative. Although, it is hard to reach a conclusion, it is clear that weather itself does not dictate attendance and people are definitely coming to watch the NFL match regardless of the weather. From the visualization below, it seems that the ideal temperature for a NFL day is between 40-50 degrees even though it is not very clear.
nfl_new_data_t3$year <- year(nfl_new_data_t3$new_date)
nfl_new_data_t3 %>% ggplot(aes(temperature,weekly_attendance))+geom_smooth()+
facet_wrap(~year,nrow=5)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 5 rows containing non-finite values (`stat_smooth()`).
There were some conclusions that we found were interesting to find after looking through the data:
There is no correlation between Winning games and having a large attendance!
Weather does not seem to play an important role (on a grand scale) for attendance count
Some games bring in more fans than usual, especially ones that might be rivalry
The Dallas Cowboys bring in the most fans on average per year
There might be some other issues with the data as well, such as some missing data points, other factors that might not be represented in the dataset, some correlation issues between some of the variables when creating the models, and there was no unique ids for each of the datasets, making it difficult to join multiple datasets while still trying to keep the integrity of the data.