Directory

Introduction

Every fall season, the game of football is played in the United States, including little league, high school, college, and in the highest level, the National Football League (NFL). In the NFL, there are 32 teams, each with their own stadium in the United States, and each team plays a total of 17 games, running through September to late December. Each team has 8 home games as well as 8 away games, and then one bye week during the season. At the end of the season, there is playoff tournament with the best teams, and it culminates into one of the biggest games in all of sports: The Super Bowl (what a great name). The NFL has the largest annual revenue of any sports league in the world, totaling over $9.5 Billion dollars(!) in 2019 alone. One of the biggest ways that the NFL attains their revenue is through fans coming to the game. In our project, we plan on looking through attendance data from 2000-2019 to gain insights on what brings fans to the game. Some of these include:

  • What team brings in the largest average attendance per year?
  • What week brings in the most fans?
  • Can we find a correlation between wins and average attendance?
  • What has been the average attendance from 2000-2019 per year?
  • Does making the playoffs or the Super Bowl correspond to higher attendance?
  • Do weather conditions have any impact on the attendance?

….. and any other insights that we think would be good while glancing through the data.

We will be using data from Pro Football Focus data for our attendance, games, and standings facts. We plan on combining all three, then cleaning and manipulating the data to get the desired final dataset that we can perform different analysis techniques. We hope to look at specific teams too, and see if there is any interesting trends.


Required Packages

Required packages:

Here are a little taste of some of the packages that will be used in order to manipulate and visualize the data:

library(readr)
library(tidyverse)
library(Hmisc)
library(knitr)
library(funModeling)
library(rpart)
library(skimr)
library(scales)
library(DT)
library(zoo)
library(tidymodels)
library(lubridate)
library(grid)

readr: Report generator for datasets

tidyverse: set of packages that include ggplot2 for visualization, dplyr for manipulation, tidyr for data tidying, and much more

knitr:Allows you to creat HTML reports, and other markdown reports

funModeling: EDA chest with multiple uses

skimr: Allows different summaries of data

DT: Can create readable datatables

tidymodels:helps with data pre- processing

Hmisc: Data analysis and manipulation package

rpart: Data analysis package focusing on classification, regression technique such as trees

zoo: Time serires & ordered index observations analysis(regular and irregular) focused package

lubridate belongs to tidyverse, focused on time&date

grid handle static text

Quick Summaries of the Data

When looking at the data, it is split into three datasets:

Attendance: The data includes variables including the week, location, and teams

Games: Has the games, scores, location, turnovers,and other game stats that are strongly correlated with the outcome of the game.

Standings:Has the outcomes of each of the teams at the end of the season, as well as offensive and defensive standings compared to each of the other teams.

Looking at the attendance data, it a has 2 character variables, team and team_name, one for the away and one for the home team, as well as 6 numeric, with year, total, home, away, week and weekly_attendance. It appears that the highest attendance on record is 105,121 fans at a single game, as you can see the percentiles of each of the variables below:

Attendance

Looking at the attendance data, it a has 2 character variables, team and team_name, one for the away and one for the home team, as well as 6 numeric, with year, total, home, away, week and weekly_attendance. It appears that the highest attendance on record is 105,121 fans at a single game, as you can see the percentiles of each of the variables below:

skim(attendance)
Data summary
Name attendance
Number of rows 10846
Number of columns 8
_______________________
Column type frequency:
character 2
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
team 0 1 5 13 0 32 0
team_name 0 1 4 10 0 32 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1.00 2009.53 5.75 2000 2005.0 2010 2015.00 2019 ▇▇▇▇▇
total 0 1.00 1080910.03 72876.97 760644 1040509.0 1081090 1123230.00 1322087 ▁▁▇▆▁
home 0 1.00 540455.01 66774.65 202687 504360.0 543185 578342.00 741775 ▁▁▅▇▁
away 0 1.00 540455.01 25509.33 450295 524974.0 541757 557741.00 601655 ▁▂▇▇▂
week 0 1.00 9.00 4.90 1 5.0 9 13.00 17 ▇▆▆▆▇
weekly_attendance 638 0.94 67556.88 9022.02 23127 63245.5 68334 72544.75 105121 ▁▁▇▃▁
Games

The games data table is by far our largest, with over 5324 games being played. There appears to be 12 character and 7 numeric variable types, as well as a lot of tied data sets missing. The dataset also contains the pts scored by each team, as well as the number of yards and turnovers per team.

skim(games)
Data summary
Name games
Number of rows 5324
Number of columns 19
_______________________
Column type frequency:
character 12
numeric 7
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
week 0 1 1 9 0 21 0
home_team 0 1 13 20 0 34 0
away_team 0 1 13 20 0 34 0
winner 0 1 13 20 0 34 0
tie 5314 0 14 18 0 7 0
day 0 1 3 3 0 7 0
date 0 1 9 12 0 154 0
time 0 1 8 8 0 187 0
home_team_name 0 1 4 10 0 32 0
home_team_city 0 1 5 13 0 32 0
away_team_name 0 1 4 10 0 32 0
away_team_city 0 1 5 13 0 32 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 2009.53 5.75 2000 2005 2010 2015 2019 ▇▇▇▇▇
pts_win 0 1 27.78 8.83 3 21 27 34 62 ▁▇▇▂▁
pts_loss 0 1 16.09 8.14 0 10 16 21 51 ▆▇▅▁▁
yds_win 0 1 361.64 78.58 47 308 361 415 653 ▁▂▇▃▁
turnovers_win 0 1 1.08 1.04 0 0 1 2 7 ▇▂▁▁▁
yds_loss 0 1 309.08 84.50 26 251 306 366 613 ▁▅▇▃▁
turnovers_loss 0 1 2.17 1.42 0 1 2 3 8 ▆▇▂▁▁
Standings

For our standings dataset, it contains 4 character and 11 numeric, which describes each team with how many wins they had, the number of playoffs gone to, the sb_winner that year, as well as some other metrics such as the offensive and defensive rankings at the end of the year for every team.

skim(standings)
Data summary
Name standings
Number of rows 638
Number of columns 15
_______________________
Column type frequency:
character 4
numeric 11
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
team 0 1 5 13 0 32 0
team_name 0 1 4 10 0 32 0
playoffs 0 1 8 11 0 2 0
sb_winner 0 1 12 13 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 2009.53 5.76 2000.0 2005.00 2010.0 2014.75 2019.0 ▇▇▇▇▇
wins 0 1 7.98 3.08 0.0 6.00 8.0 10.00 16.0 ▂▆▇▆▂
loss 0 1 7.98 3.08 0.0 6.00 8.0 10.00 16.0 ▂▆▇▆▂
points_for 0 1 350.28 71.40 161.0 299.00 348.0 396.00 606.0 ▂▇▇▂▁
points_against 0 1 350.28 59.55 165.0 310.00 347.0 391.50 517.0 ▁▃▇▆▁
points_differential 0 1 0.00 101.09 -261.0 -75.00 1.5 72.75 315.0 ▂▆▇▅▁
margin_of_victory 0 1 0.00 6.32 -16.3 -4.70 0.1 4.57 19.7 ▂▆▇▅▁
strength_of_schedule 0 1 0.00 1.63 -4.6 -1.10 0.0 1.20 4.3 ▁▅▇▅▁
simple_rating 0 1 0.00 6.20 -17.4 -4.47 0.0 4.50 20.1 ▁▆▇▅▁
offensive_ranking 0 1 0.00 4.34 -11.7 -3.18 0.0 2.70 15.9 ▁▇▇▂▁
defensive_ranking 0 1 0.00 3.57 -9.8 -2.40 0.1 2.50 9.8 ▁▅▇▅▁

Filtering the Data

We wanted to combine all of these data tables into a single, grand table that would encompass all the data, while still adhering to the structure of the data. We did the following:

Combined the games and attendance by doing a full join of games and attendance by using team_name equal to home_team_name, as well by year and week

Combined this brand new table with the standings table, by using home_team_name equal to team_name, as well as year, and home_team_city = team, which are the same

Turned the playoffs and sb_winner into binary variables, with a 1 for yes, and 0 for no.

Dropped unnecessary and repeated values, such as ties, repeated home and away team names that were unnecessary

#change attendance week column to character 
attendance$week <- as.character(attendance$week)
nfl.df<- full_join(games,attendance,by = c("home_team_name"="team_name", "year", "week"))
nfl_new_data<- left_join(nfl.df,standings,by=c("home_team_name"="team_name","year","home_team_city"="team"))
nfl_new_data$playoffs <- str_replace(nfl_new_data$playoffs, "No Playoffs", "0")
nfl_new_data$playoffs <- str_replace(nfl_new_data$playoffs, "Playoffs", "1")
nfl_new_data$playoffs <- as.numeric(nfl_new_data$playoffs)

nfl_new_data$sb_winner <- str_replace(nfl_new_data$sb_winner, "No Superbowl", "0")
nfl_new_data$sb_winner <- str_replace(nfl_new_data$sb_winner, "Won Superbowl", "1")
nfl_new_data$sb_winner <- as.numeric(nfl_new_data$sb_winner)

nfl_new_data = subset(nfl_new_data, select = -c(tie,home,away,home_team_name,home_team_city,away_team_name, away_team_city))
nfl_new_data

With the final table, we decided to leave in any NA’s within the table, as being replaced by any values would change the data ( even though there was a lot). The majority of the dataset is a combination of character and integer values, and each value has a quick explanation attached to it:

Here is a quick overview of the data, with the first 50 rows shown.

Overview

While looking through the NFL attendance data, I thought it would be a good idea to find a good range of what exactly the average attendance is for every team, and to see what teams tend to have more fans, what outliers there are, etc. In particular, I wanted to see what the general average is of the entire league, and to just visualize the data:

library(tidyverse)
library(ggplot2)
nfl_new_data %>%
  mutate(week = factor(home_team)) %>%
  drop_na('weekly_attendance','home_team') %>%
  ggplot(aes(home_team, weekly_attendance, fill = home_team)) +
  geom_boxplot(show.legend = FALSE, outlier.alpha = .5)+
  theme(axis.text.x = element_text(angle = 90))+
  labs(title = "Box plot of NFL Attendance per team",
    x = "Team Name",
    y = "Weekly NFL game attendance"
  )

While visualizing the data, it appears that the highest mean goes to the New York Giants, followed closely by the Washington Football team(Formally Redskins). There is a range of values that the teams appear to be around, which would be between 60,000 and 80,000 fans. The team that appears to have the largest range of would be between Dallas Cowboys and the Oakland raiders, with the boxplot being larger for the Cowboys and the whiskers are longer for Oakland. The main reason why we created this was just to get a glimpse of the data, in which it shows that all the teams have around the same range and values for attendance, with some outliars that are much lower than the average values.

After viewing the data, the next step taken was going to see if there was any changes year after year with all the teams. We wanted to see if there was any teams (from 2000-2020) that just generally had higher attendance than the rest of the teams, and so we created a line chart for every team to see what the general trend in total attendance per year:

That’s not a good graph! It’s too clustered, and it’s hard to see what teams are separated from the rest. Let’s go ahead and look at a summary of the data to see if we can find highest 8 teams (or 75th percentile) and separate those from the rest:

summary(nfl_attendance_perTandY$sumyearly)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  202687  504405  543185  540455  578339  741775

It appears that the cutoff value would be around 580,000, and so lets go ahead and filter the teams that have an average of 580,000 or more per year:

nfl_attendance_perTandY.copy <- nfl_attendance_perTandY %>% group_by(home_team) %>% 
  summarise(meanyearly = mean(sumyearly, na.rm = T))
##### only care about the top teams
nfl_arizona <- nfl_attendance_perTandY.copy[which(nfl_attendance_perTandY.copy[,2]>578339),]
nfl_arizona
## # A tibble: 8 x 2
##   home_team           meanyearly
##   <chr>                    <dbl>
## 1 Carolina Panthers      585030.
## 2 Dallas Cowboys         623466.
## 3 Denver Broncos         607991 
## 4 Kansas City Chiefs     598235.
## 5 Los Angeles Rams       583554.
## 6 New York Giants        628300.
## 7 New York Jets          624141.
## 8 Washington Redskins    639074
nfl_graph <- left_join(nfl_arizona, nfl_attendance_perTandY, by = "home_team")
nfl_graph <- nfl_graph[-c(2)]

It appears that only eight teams, Carolina, Dallas, Denver, Kansas City, Los Angeles Rams, New York Giants, New York Jets, and Washington Redskins appear to have the highest annual attendance over the past 20 years. When we visualize the 8 teams, it is clear that we can see any huge jumps or spikes:

ggplot(data = nfl_graph, aes(x=year, y=sumyearly)) + geom_line(aes(colour=home_team))+annotate(geom = "text", size=3,x = 2009, y = 680000, label = "Dallas Cowboys complete their 
  brand new AT&T stadium,  which holds
 a max capacity of 80,000 people" , hjust = "left")+annotate(geom = "point", x = 2009, y = 718055, colour = "red", size = 4)+
  labs(title = "Attendance per Team by Year",
    x = "Year",
    y = "Sum of NFL game attendance"
  )

While we have these teams, lets view if there are certain predictors we can use with attendance, including wins, whether or not they made the playoffs, and any others.

Win percentage

When looking at these 8 teams, let us create a new variable in our dataset that computes the win-loss percentage of every team, using the number of wins and losses in the data:

nfl_new_data <- nfl_new_data %>% mutate(wpercent = nfl_new_data$wins/(nfl_new_data$loss+nfl_new_data$wins))

Then let us go ahead and combine the two datasets together to give us a nice linear plot that has the win percent as the x variable and sum yearly as the dependent variable.

nfl_winper <- nfl_new_data %>% group_by(home_team,year) %>% 
  summarise(winpercent = mean(wpercent, na.rm = T))
lm_model_nfl <- full_join(nfl_winper, nfl_graph)
lm_model_nfl <- lm_model_nfl %>%
  drop_na('sumyearly')

When graphing these together, it looks like there is no relationship between win percentage, as the line appears to be almost a straight flat line, which no correlation between the two at all:

plot1 <- ggplot(as.data.frame(lm_model_nfl),aes(x=winpercent,y=sumyearly)) + 
  geom_point() +
  geom_smooth(method=lm)+labs(title = "Win Percentage Graph",
    x = "Win Percentage",
    y = "Sum of NFL game attendance"
  )
plot1

Let’s go ahead and create a model to view the data, and include win-loss percentage, wins, playoffs, and the entire dataset to decide what variables are great at determining what predicts attendance.

nfl_lm_model <- nfl_new_data[, !sapply(nfl_new_data, is.character)]
## 75% of the sample size
smp_size <- floor(0.75 * nrow(nfl_lm_model))

## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(nfl_lm_model)), size = smp_size)

nfl.train <- nfl_lm_model[train_ind, ]
nfl.test <- nfl_lm_model[-train_ind, ]
lm_fit <- lm(weekly_attendance ~ playoffs,data = nfl.train)
summary(lm_fit)$r.squared
## [1] 0.01134179
summary(lm_fit)$adj.r.squared
## [1] 0.01108412

Wow those two R-squared values are not high at all! It appears that making the playoffs has not implication on how much attendance the team has during the year. The issue with this is that we don’t know whether or not a team makes the playoffs until the end, while during the year you can easily check to see how well your team is doing by checking on their wins. We are using the year end amount of wins in our model however, but it should work as the end of the year record is a summary of the year. So when we run on a model on both playoffs and wins, hopefully our model predicts the average attendence, with a high R squared.

lm_fit_best <- lm(weekly_attendance ~ playoffs + wins, data = nfl.train)
summary(lm_fit_best)
## 
## Call:
## lm(formula = weekly_attendance ~ playoffs + wins, data = nfl.train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -43918  -4136    213   4806  31837 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 63930.06     488.84 130.780  < 2e-16 ***
## playoffs     -393.24     474.12  -0.829    0.407    
## wins          477.01      74.38   6.413  1.6e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8876 on 3836 degrees of freedom
##   (4460 observations deleted due to missingness)
## Multiple R-squared:  0.02183,    Adjusted R-squared:  0.02132 
## F-statistic:  42.8 on 2 and 3836 DF,  p-value: < 2.2e-16

This one isn’t much better! The R squared is still very low. Let’s go ahead and see if we can choose a new model, based on the most important variables.

Finally, when we use those new ones, it appears that the model isn’t that effective:

lm_fit2 <- lm(weekly_attendance ~ year + week + pts_win + pts_loss + 
            yds_win+turnovers_win+yds_loss+turnovers_loss+ playoffs+ sb_winner, data = nfl.train)
summary(lm_fit2)
## 
## Call:
## lm(formula = weekly_attendance ~ year + week + pts_win + pts_loss + 
##     yds_win + turnovers_win + yds_loss + turnovers_loss + playoffs + 
##     sb_winner, data = nfl.train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -44713  -4188    158   4854  32727 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.052e+05  5.175e+04  -2.034 0.042036 *  
## year            8.602e+01  2.580e+01   3.334 0.000863 ***
## week           -7.189e+01  2.887e+01  -2.490 0.012813 *  
## pts_win        -3.601e+01  2.288e+01  -1.574 0.115647    
## pts_loss        3.379e+01  2.655e+01   1.273 0.203257    
## yds_win         1.353e+00  2.337e+00   0.579 0.562790    
## turnovers_win  -2.202e+02  1.489e+02  -1.479 0.139334    
## yds_loss       -1.346e+00  2.331e+00  -0.577 0.563855    
## turnovers_loss  2.247e+02  1.164e+02   1.931 0.053610 .  
## playoffs        2.077e+03  3.081e+02   6.741  1.8e-11 ***
## sb_winner      -4.627e+02  8.569e+02  -0.540 0.589288    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8905 on 3828 degrees of freedom
##   (4460 observations deleted due to missingness)
## Multiple R-squared:  0.01727,    Adjusted R-squared:  0.0147 
## F-statistic: 6.726 on 10 and 3828 DF,  p-value: 1.879e-10

With such a low R squared, it is not worth our time trying to find if the model is correct or not. Since that is the case, let us now look at the attendance over the week, and see if the data is linear or not and see what the data looks like.

Week by Week

When looking at it week by week, we first would want to see the attendance the entire nfl has per week, to see if it changes (on average) from week to week.

nfl_week <- nfl_new_data %>% group_by(year,week) %>% 
  summarise(mean_weekly = mean(weekly_attendance, na.rm = TRUE)) %>% 
  arrange(year,week)
nfl_week$year <- as.character(nfl_week$year)
nfl_week
## # A tibble: 360 x 3
## # Groups:   year [20]
##    year   week mean_weekly
##    <chr> <int>       <dbl>
##  1 2000      1      66530.
##  2 2000      2      67142.
##  3 2000      3      68332 
##  4 2000      4      63451.
##  5 2000      5      70715.
##  6 2000      6      64803.
##  7 2000      7      65463.
##  8 2000      8      65518.
##  9 2000      9      65369.
## 10 2000     10      65361.
## # ... with 350 more rows

We now have every single week for every single year, which we have the corresponding average attendance for all games that week. When we go ahead and plot the data:

ggplot(data = nfl_week, aes(x=week, y=mean_weekly)) + geom_line(aes(colour=year))+ ylim(55000,80000)+
  labs(title = " Average Attendance per Team by Year",
    x = "Year",
    y = "Average of NFL game attendance"
  )

It appears that it follows a slight decreasing linear regression, in which means that as the season progresses, the number of fans tends to decrease per game.

When looking at the total fans per week though, it appears that the data does not follow a linear curve:

#attendance per week all years
nfl_week_graph <- nfl_new_data %>% group_by(week) %>% 
  summarise(sumweekly = sum(weekly_attendance, na.rm = TRUE)/2) %>% 
  arrange(week)
ggplot(data=nfl_week_graph, aes(x= week, y= sumweekly)) +
  geom_point() +
  geom_smooth()

This is due to a number of byes given to each of the NFL teams, where the NFL tends to give byes in the middle of the season. And so, when there are less teams playing, there are less games for people to go to, and less people attending, which explains the drop midway through the season.

Team by Team basis

While it is not feasible to build a good model which predicts attendance based on available data, we can find visualizations which give some insight on attendance in NFL. One of idea comes to mind is to visualize the whole league attendance as a whole. Let start with a 2018-2019 chart

nfl_new_data %>% filter(year %in% (2018:2019)) %>% 
  ggplot(aes(home_team,away_team, size = weekly_attendance, color = as.factor(playoffs), shape = as.factor(sb_winner)))+geom_point()+
  theme(axis.text.x = element_text(angle = 90))+geom_abline(slope =1, color = "white")+
  facet_wrap(~year,nrow=1)+
  scale_radius(name = "Weekly Attendance", limits = c(0, 106000), range = c(0, 7))+
  labs(title = "NFL Attendance Matrix 2018-2019 Regular Season",
        x = "Home Team",
        y = "Away Team",
        shape = "Super Bowl Champion")+
  scale_shape_discrete(name = "Super Bowl Champion(Home)", labels = c("No", "Winner"))+
  scale_color_manual(name= "Playoff Teams(Home)", labels = c("No", "Yes") ,values = c("gray", "orange"))

One immediate thing we can realize right away is that the attendance of a match is mainly decided by the home team, and so this can lead to more variation horizontally than vertically among weekly attendance matrix. And indeed on a larger scale, the attendance level of one team stays the same across the length of the season with a few exceptions every season. An example of this would be the Los Angeles Chargers in 2019 season when they faced against the Kansas City Chiefs, in which you see an increase in over 20,000 more fans than usual. Checking back on schedule, the match actually wasn’t played in the Charger’s stadium, but instead on a different stadium. Interestingly enough, the Charger’s has another big attendance game against Kansas City Chief, this time was played on their new stadium(Sofi stadium plenty of years later. The other thing we can see is that teams with a big fanbase (like the Dallas Cowboys) usually can pull 20000 attendance more than usual when they are the away team. On the other hand, we can also look at team’s tradition and past performance in the league:

## # A tibble: 33 x 3
##    team         num_of_SB_won num_of_playoffs
##    <chr>                <dbl>           <dbl>
##  1 New England              6              17
##  2 Baltimore                2              12
##  3 New York                 2              14
##  4 Pittsburgh               2              12
##  5 Denver                   1               9
##  6 Green Bay                1              14
##  7 Indianapolis             1              14
##  8 Kansas City              1               9
##  9 New Orleans              1               9
## 10 Philadelphia             1              13
## # ... with 23 more rows
## # A tibble: 33 x 3
##    team         num_of_SB_won num_of_playoffs
##    <chr>                <dbl>           <dbl>
##  1 New England              6              17
##  2 Green Bay                1              14
##  3 Indianapolis             1              14
##  4 New York                 2              14
##  5 Philadelphia             1              13
##  6 Seattle                  1              13
##  7 Baltimore                2              12
##  8 Pittsburgh               2              12
##  9 Denver                   1               9
## 10 Kansas City              1               9
## # ... with 23 more rows

We can see that tradition wise, only 4 team who won the SB more than 2 times . On the other hand, the top 8 team in term of playoff contender also put themselves above the rest while staying neck and neck. In term of attendance, let look the 2019 season:

grob <- grobTree(textGrob("Average attendance 2019 is 66563", x=0.35,  y=0.95, hjust=0,
  gp=gpar(col="red", fontsize=13, fontface="italic")))
nfl_new_data_t %>% filter(year==2019) %>% filter(sb_winner_count>=2|playoffs_count>=10) %>%
  ggplot(aes(home_team,away_team, size = weekly_attendance, color = playoffs_count, shape = as.factor(sb_winner_count) ))+geom_point()+theme(axis.text.x = element_text(angle = 90))+geom_abline(slope =1, color = "white")+
  scale_radius(name = "Weekly Attendance")+
  facet_wrap(~year,nrow=1)+
  scale_colour_gradient( low = "gray", high = "orange")+
  annotation_custom(grob)+
  labs(title = "Top team attendance in 2019",
    color = "Number of times reaching playoff",
       shape = "Number of Super Bowl won ")

The New England Patriots are really something else, but they do not have the highest attendance in the league. However, one can notice that most of the team here have more than average the weekly attendance. We can see that there maybe some correlation between past past performance and attendance. However, as we showed before with the Win Percentage model, there is no correlation between those two!

Weather

Weather is an interesting aspect It will surely affect the match condition and the result, but does it really affect attendance? I was able to find another data set in which, unfortunately, only has data until 2013. However, 13 years of data may still be able to help us identify weather effect on attendance.

model_weather <-lm(weekly_attendance ~temperature+humidity+wind_mph, data = nfl_new_data_t3)
summary(model_weather)
## 
## Call:
## lm(formula = weekly_attendance ~ temperature + humidity + wind_mph, 
##     data = nfl_new_data_t3)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -44579  -4020    589   5299  36065 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 71642.95    1038.18  69.008  < 2e-16 ***
## temperature   -53.33      11.23  -4.748 2.17e-06 ***
## humidity    -2465.40    1023.39  -2.409  0.01606 *  
## wind_mph      101.29      36.44   2.779  0.00549 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8803 on 2570 degrees of freedom
##   (714 observations deleted due to missingness)
## Multiple R-squared:  0.01488,    Adjusted R-squared:  0.01373 
## F-statistic: 12.94 on 3 and 2570 DF,  p-value: 2.188e-08
mean(nfl_new_data_t3$temperature)
## [1] 58.03741
mean(nfl_new_data_t3$wind_mph, na.rm = TRUE)
## [1] 8.793609

From the result, it does not seems like temperature plays an important role in attendance count. However, we did find that temperature effect is significant and it is negative. Although, it is hard to reach a conclusion, it is clear that weather itself does not dictate attendance and people are definitely coming to watch the NFL match regardless of the weather. From the visualization below, it seems that the ideal temperature for a NFL day is between 40-50 degrees even though it is not very clear.

nfl_new_data_t3$year <- year(nfl_new_data_t3$new_date)
nfl_new_data_t3 %>% ggplot(aes(temperature,weekly_attendance))+geom_smooth()+
    facet_wrap(~year,nrow=5)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 5 rows containing non-finite values (`stat_smooth()`).

Conclusions

There were some conclusions that we found were interesting to find after looking through the data:

There is no correlation between Winning games and having a large attendance!

Weather does not seem to play an important role (on a grand scale) for attendance count

Some games bring in more fans than usual, especially ones that might be rivalry

The Dallas Cowboys bring in the most fans on average per year

There might be some other issues with the data as well, such as some missing data points, other factors that might not be represented in the dataset, some correlation issues between some of the variables when creating the models, and there was no unique ids for each of the datasets, making it difficult to join multiple datasets while still trying to keep the integrity of the data.