NFL Stadium Attendance Analysis

Introduction

The National Football League (NFL) is a professional American football league consisting of 32 teams, divided equally between the National Football Conference (NFC) and the American Football Conference (AFC). The NFL’s 17-week regular season runs from early September to late December, with each team playing 16 games and having one bye week. Following the conclusion of the regular season, seven teams from each conference (four division winners and three wild card teams) advance to the playoffs, a single-elimination tournament culminating in the Super Bowl, which is usually held on the first Sunday in February and is played between the champions of the NFC and AFC.

The National Football League is the largest live spectator sporting league in the world in terms of average attendance. The NFL is one of the four major professional sports leagues in North America and the highest professional level of American football in the world. As of 2018, the NFL averaged 67,100 live spectators per game, and 17,177,581 total for the season.

Hard Rock Stadium, Miami Gardens, Florida

The purpose of this project is to analyse the attendance data of the NFL from 2000-2019 and get insights into spectator attendance over the 20 year period. Some of the objectives are to address the below questions.

Does the win percentage have a bearing on the attendance?
Do some teams have better support compared to others?
Does a particular week in the year have better or poor attendance?
Can we build a model to predict the attendance for the 2020?
Can we classify teams into categoriess of highest attendance vs lowest attendance?

For this study we are using the data from Pro Football Reference Website. We will perfom some data cleansing and data manupulation to set up the data for consumption. We will start with exploratory data analysis to understand the data, examine the factors that determine attendance at National League Football games and build a model to identify factors having a bearing on the attendance.

These insights will help us with the pricing of the tickets,proper planning of logistics, promotions and marketing campaigns.

Required Packages

The below packages are required to run the code.

Required Packages

      library(readr)
      library(tidyverse)
      library(Hmisc)
      library(knitr)
      library(funModeling)
      library(rpart)
      library(skimr)
      library(scales)

Library Index

readr - A General-Purpose Package for Dynamic Report Generation in R
tidyverse - will load the below core tidyverse packages
- ggplot2 - for data visualisation.
- dplyr - for data manipulation.
- tidyr - for data tidying.
- readr - for data import.
- purrr - for functional programming.
- tibble - for tibbles, a modern re-imagining of data frames.
- stringr - for strings.
- forcat - for factors.
Hmisc - data analysis, high-level graphics, utility operations
knitr - A General-Purpose Package for Dynamic Report Generation in R
funModeling - Exploratory Data Analysis and Data Preparation Tool-Box
rpart - Recursive partitioning for classification, regression and survival trees.
skimr - Compact and Flexible Summaries of Data
scales - Graphical scales map data to aesthetics, and provide methods for automatically determining breaks and labels foraxes and legends.

Data Preparation

Data Load

We are using the data that was downloaded from Pro Football Reference Website.

attendance <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/attendance.csv')
standings <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/standings.csv')
games <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/games.csv')

library(knitr)
# evaluate fig.cap after a chunk is evaluated
opts_knit$set(eval.after = 'fig.cap')

Data Dictionary

Attendance Data

The Attendance data set contains the weekly attendance information for a team for the years 2000 to 2019.

Data Dictionary - attendance.csv

Variable	Class	Description
team	character	Team City
team_name	character	Team name
year	integer	Season year
total	double	Total attendance across 17 weeks (1 week = no game)
home	double	Home attendance
away	double	Away attendance
week	character	Week number (1-17)
weekly_attendance	double	Weekly attendance number

Standings Data

The Standings data set contains the Win/loss, points scored, rankings for each team for the season year from 2000-2019.

Data Dictionary - standings.csv

Variable	Class	Description
team	character	Team city
team_name	character	Team name
year	integer	season year
wins	double	Wins (0 to 16)
loss	double	Losses (0 to 16)
points_for	double	points for (offensive performance)
points_against	double	points for (defensive performance)
points_differential	double	Point differential (points_for - points_against)
margin_of_victory	double	(Points Scored - Points Allowed)/ Games Played
strength_of_schedule	double	Average quality of opponent as measured by SRS (Simple Rating System)
simple_rating	double	Team quality relative to average (0.0) as measured by SRS (Simple Rating System) SRS = MoV + SoS = OSRS + DSRS
offensive_ranking	double	Team offense quality relative to average (0.0) as measured by SRS (Simple Rating System)
defensive_ranking	double	Team defense quality relative to average (0.0) as measured by SRS (Simple Rating System)
playoffs	character	Made playoffs or not
sb_winner	character	Won superbowl or not

Games Data

The Games data set contains details about each game.

Data Dictionary - games.csv

Variable	Class	Description
year	integer	season year, note that playoff games will still be in the previous season
week	character	week number (1-17, plus playoffs)
home_team	character	Home team
away_team	character	Away team
winner	character	Winning team
tie	character	If a tie, the “losing” team as well
day	character	Day of week
date	character	Date minus year
time	character	Time of game start
pts_win	double	Points by winning team
pts_loss	double	Points by losing team
yds_win	double	Yards by winning team
turnovers_win	double	Turnovers by winning team
yds_loss	double	Yards by losing team
turnovers_loss	double	Turnovers by losing team
home_team_name	character	Home team name
home_team_city	character	Home team city
away_team_name	character	Away team name
away_team_city	character	Away team city

Attendance Data

Summary of the Attendance Data

Looking at the summary statistics of the Attendance Data

     skim(attendance)

Data summary
Name	attendance
Number of rows	10846
Number of columns	8
_______________________
Column type frequency:
character	2
numeric	6
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
team	0	1	5	13	0	32	0
team_name	0	1	4	10	0	32	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
year	0	1.00	2009.53	5.75	2000	2005.0	2010	2015.00	2019	▇▇▇▇▇
total	0	1.00	1080910.03	72876.97	760644	1040509.0	1081090	1123230.00	1322087	▁▁▇▆▁
home	0	1.00	540455.01	66774.65	202687	504360.0	543185	578342.00	741775	▁▁▅▇▁
away	0	1.00	540455.01	25509.33	450295	524974.0	541757	557741.00	601655	▁▂▇▇▂
week	0	1.00	9.00	4.90	1	5.0	9	13.00	17	▇▆▆▆▇
weekly_attendance	638	0.94	67556.88	9022.02	23127	63245.5	68334	72544.75	105121	▁▁▇▃▁

Data Quality/ Missing Data

Creating a single variable for team name by combining team and team_name attributes so that we can join with the rankings and games data sets.

attendance_reshape <- rename(attendance , annual_attendance = total , annual_homegame_attendance = home , 
                             annual_awaygame_attendance = away  )  %>% 
                    mutate(NFL_team_name = str_c(team, team_name, sep = " "))

After verifying the statistics of the each variable,we notice that the values in the weekly_attendance variable are missing for around 638 rows.We will check to see if there is any pattern to the missing values.

     missing_data <- 
     attendance_reshape %>% 
     filter(is.na(weekly_attendance))

All the 32 teams have a bye week for one random week every year.We can ignore this data as there is no game on that day. We also notice that in years 2000 and 2001 there are only 31 teams and starting 2002 we have 32 teams.

We will filter out the data for these missign 638 occurances and use the clean data for further analysis.

 attendance_cleansed <-      attendance_reshape %>%       
 filter(! is.na(weekly_attendance)) %>%
 select(NFL_team_name, year, week , weekly_attendance , annual_attendance)

Sample Data for Attendance after cleansing

Looking at the Sample data after replacing the variable name and fitering the bye week data

    kable(attendance_cleansed[1:5,])

NFL_team_name	year	week	weekly_attendance	annual_attendance
Arizona Cardinals	2000	1	77434	893926
Arizona Cardinals	2000	2	66009	893926
Arizona Cardinals	2000	4	71801	893926
Arizona Cardinals	2000	5	66985	893926
Arizona Cardinals	2000	6	44296	893926

Standings Data

Desciption of the Standings Data

     skim(standings)

Data summary
Name	standings
Number of rows	638
Number of columns	15
_______________________
Column type frequency:
character	4
numeric	11
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
team	1	5	13	32
team_name	1	4	10	32
playoffs	1	8	11	2
sb_winner	1	12	13	2

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
year	1	2009.53	5.76	2000.0	2005.00	2010.0	2014.75	2019.0	▇▇▇▇▇
wins	1	7.98	3.08	0.0	6.00	8.0	10.00	16.0	▂▆▇▆▂
loss	1	7.98	3.08	0.0	6.00	8.0	10.00	16.0	▂▆▇▆▂
points_for	1	350.28	71.40	161.0	299.00	348.0	396.00	606.0	▂▇▇▂▁
points_against	1	350.28	59.55	165.0	310.00	347.0	391.50	517.0	▁▃▇▆▁
points_differential	1	0.00	101.09	-261.0	-75.00	1.5	72.75	315.0	▂▆▇▅▁
margin_of_victory	1	0.00	6.32	-16.3	-4.70	0.1	4.57	19.7	▂▆▇▅▁
strength_of_schedule	1	0.00	1.63	-4.6	-1.10	0.0	1.20	4.3	▁▅▇▅▁
simple_rating	1	0.00	6.20	-17.4	-4.47	0.0	4.50	20.1	▁▆▇▅▁
offensive_ranking	1	0.00	4.34	-11.7	-3.18	0.0	2.70	15.9	▁▇▇▂▁
defensive_ranking	1	0.00	3.57	-9.8	-2.40	0.1	2.50	9.8	▁▅▇▅▁

Data Quality/ Missing Data

After verifying the statistics for each varible, creating a single variable for team full name by combining team and team_name attributes so that we can join with the rankings data set. Rest of the data looks good.

standings_reshape <- standings %>% 
mutate(NFL_team_name = str_c(team, team_name, sep = " ")) %>%
select(NFL_team_name, year, wins, loss, margin_of_victory, simple_rating , offensive_ranking, defensive_ranking, playoffs, sb_winner)

Sample Data for Standings

  kable(standings_reshape[1:5,])

NFL_team_name	year	wins	loss	margin_of_victory	simple_rating	offensive_ranking	defensive_ranking	playoffs	sb_winner
Miami Dolphins	2000	11	5	6.1	7.1	0.0	7.1	Playoffs	No Superbowl
Indianapolis Colts	2000	10	6	6.4	7.9	7.1	0.8	Playoffs	No Superbowl
New York Jets	2000	9	7	0.0	3.5	1.4	2.2	No Playoffs	No Superbowl
Buffalo Bills	2000	8	8	-2.2	0.0	0.5	-0.5	No Playoffs	No Superbowl
New England Patriots	2000	5	11	-3.9	-2.5	-2.7	0.2	No Playoffs	No Superbowl

Games Data

Summary of the Games Data

    skim(games)

Data summary
Name	games
Number of rows	5324
Number of columns	19
_______________________
Column type frequency:
character	11
difftime	1
numeric	7
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
week	0	1	1	9	21
home_team	0	1	13	20	34
away_team	0	1	13	20	34
winner	0	1	13	20	34
tie	5314	0	14	18	7
day	0	1	3	3	7
date	0	1	9	12	154
home_team_name	0	1	4	10	32
home_team_city	0	1	5	13	32
away_team_name	0	1	4	10	32
away_team_city	0	1	5	13	32

Variable type: difftime

skim_variable	n_missing	complete_rate	min	max	median	n_unique
time	0	1	30900 secs	84900 secs	47040 secs	187

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
year	1	2009.53	5.75	2000	2005	2010	2015	2019	▇▇▇▇▇
pts_win	1	27.78	8.83	3	21	27	34	62	▁▇▇▂▁
pts_loss	1	16.09	8.14	0	10	16	21	51	▆▇▅▁▁
yds_win	1	361.64	78.58	47	308	361	415	653	▁▂▇▃▁
turnovers_win	1	1.08	1.04	0	0	1	2	7	▇▂▁▁▁
yds_loss	1	309.08	84.50	26	251	306	366	613	▁▅▇▃▁
turnovers_loss	1	2.17	1.42	0	1	2	3	8	▆▇▂▁▁

Data Quality/ Missing Data

After verifying the statistics for each varible,the data in the variable week looks ambigous. On closer look we notice that it has both numeric and character values. The weeks after the regular season are in character values.This is a valid scenario to have values WildCard ,Division, ConfChamp,SuperBowl respectively after the regular season. As the anlaysis is only on the regular season games and not playoffs, I will ignore the playoff games.

The game dataset contains the game results of each game. A total of 5014 games are played in the regular seasons 2000-2019. We need to change the data type of week from character to numeric so that we can join the games data set with attendance data set.

   games_regular <- games %>%
                   filter(week %in%  c('1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17'))          %>%
                   mutate(week = as.numeric(week)) %>%
                   select(home_team,away_team, year, week , winner)

Reshaping the games data so that we can have both the home and away team rows combined into a column and adding a home game indicator

  games_reshape <- games_regular %>%
                   gather(home_ind,NFL_team_name,home_team:away_team ) %>%
                   select(NFL_team_name, year, week, home_ind, winner)

Sample Data for Games

   kable(games_reshape[1:5,])

NFL_team_name	year	week	home_ind	winner
Minnesota Vikings	2000	1	home_team	Minnesota Vikings
Kansas City Chiefs	2000	1	home_team	Indianapolis Colts
Washington Redskins	2000	1	home_team	Washington Redskins
Atlanta Falcons	2000	1	home_team	Atlanta Falcons
Pittsburgh Steelers	2000	1	home_team	Baltimore Ravens

Combined Dataset

We will create a annual team attendance summary data set, Since the home and away game count is for the year we will need to divide by 8 so that we get the average home game and away game attendance

# Create a team annual summaries data set
  team_summaries <- missing_data %>% 
                    group_by(NFL_team_name,year) %>% 
                    summarise( Average_home = mean(annual_homegame_attendance)/8,
                               Average_away = mean(annual_awaygame_attendance)/8 )
 # Create a annual attendance summary data set
 annual_summaries <- attendance_cleansed %>% 
                     group_by(year) %>% 
                     summarise(Average_nfl = mean( weekly_attendance))
  # Create a ssummary data set with home,away and nfl attendance summaries
   attendance_summaries <- left_join(team_summaries, annual_summaries, 
                                   by = c( "year" ))

Sample Data for Annual Summries

   kable(attendance_summaries[1:5,])

NFL_team_name	year	Average_home	Average_away	Average_nfl
Arizona Cardinals	2000	48434.38	63306.38	65934.44
Arizona Cardinals	2001	38414.38	63009.50	65753.58
Arizona Cardinals	2002	40909.00	71450.62	66325.67
Arizona Cardinals	2003	36062.38	64487.75	66674.36
Arizona Cardinals	2004	37533.38	67286.25	67462.60

We will also create a weekly attendance summary data set

 weekly_summaries <- attendance_cleansed %>% 
   group_by(week) %>% 
   summarise(
     Average_week = mean(weekly_attendance)
   )

Sample Data for Annual Summries

   kable(weekly_summaries[1:5,])

week	Average_week
1	68532.91
2	67430.24
3	67761.31
4	67581.84
5	68629.94

I will be combining the attendance, standings, games datasets to create a combined data set which i will use in my exploratory data analysis.

   combined_data <-   inner_join(attendance_cleansed, standings_reshape, 
                                   by = c( "NFL_team_name", "year"))      %>%
                             inner_join(. , games_reshape,
                            by = c("NFL_team_name" , "year" , "week"))

Sample Data for Combined Dataset

   kable(combined_data[1:5,])

NFL_team_name	year	week	weekly_attendance	annual_attendance	wins	loss	margin_of_victory	simple_rating	offensive_ranking	defensive_ranking	playoffs	sb_winner	home_ind	winner
Arizona Cardinals	2000	1	77434	893926	3	13	-14.6	-15.2	-7.2	-8.1	No Playoffs	No Superbowl	away_team	New York Giants
Arizona Cardinals	2000	2	66009	893926	3	13	-14.6	-15.2	-7.2	-8.1	No Playoffs	No Superbowl	home_team	Arizona Cardinals
Arizona Cardinals	2000	4	71801	893926	3	13	-14.6	-15.2	-7.2	-8.1	No Playoffs	No Superbowl	home_team	Green Bay Packers
Arizona Cardinals	2000	5	66985	893926	3	13	-14.6	-15.2	-7.2	-8.1	No Playoffs	No Superbowl	away_team	San Francisco 49ers
Arizona Cardinals	2000	6	44296	893926	3	13	-14.6	-15.2	-7.2	-8.1	No Playoffs	No Superbowl	home_team	Arizona Cardinals

Exploratory Data Aanalysis

The objective of performing exploratory data analysis is to understand the data better and identify the variables that may have a significance on the attendance numbers.I will look at different variables and will plot some graphs to understand them better.

A look at Teams in the League

From the attendance data we noticed there are 34 distinct teams, but typically NFL has only 32 teams every year.Lets take a look at the team barchart to understand a little more on what really happened.

ggplot(data=combined_data, aes(x= reorder(factor(NFL_team_name), NFL_team_name, function(x) length(x)),
                                            las=10,fill=factor(year),
                                  names.arg=combined_data$NFL_team_name
                                ))+
  geom_bar() +
  theme(axis.text.x = element_text(angle = 0), legend.title = element_blank(),
        legend.position = "bottom", axis.title.x = element_blank(), 
        axis.title.y = element_blank(), plot.title = element_text(hjust = 0.5)) +
        ggtitle("Games played by teams 2000-2019")  +
        coord_flip()

Figure 1 - NFL Teams 2000 -20019.

Looking at the teams that did not play all the 320 games, we noticed that the Rams moved from St.Louis to Los Angeles in 2016. Similarly the Chargers moved from San Diego to Los Angeles in 2017. We also notice that the Houston Texans started playing in the league from 2002.

Attendance based on Week number of the season.

Lets try to find out if there are spikes for attendance on any particular week every year. We will be doing a average attendace of all the weeks vs attendance for given week number on the summarized data to see if any particular weeks see higher attendance.

 ggplot(data=weekly_summaries, aes(x= week, y= Average_week)) +
   geom_point() +
   geom_smooth()+
   theme(axis.text.x = element_text(angle = 0), legend.title = element_blank(),
         legend.position = "bottom", axis.title.x = element_blank(), 
         axis.title.y = element_blank(), plot.title = element_text(hjust = 0.5)) + 
         ggtitle("Average attendance (per week number) 2000-2019")

Figure 2 - Attendance for Week#.

Looking at this plot it seems like there is dip in the attendance of games after week 8. Lets do a box plot to understand if this is significant or not.

combined_data %>%
                    mutate(week = factor(week)) %>%
                    ggplot(aes(week, weekly_attendance, fill = week)) +
                    geom_boxplot(show.legend = FALSE, outlier.alpha = 0.5) +
                    labs( x = "Week of NFL season",
                          y = "Weekly NFL game attendance") +
                  theme(axis.text.x = element_text(angle = 00), legend.title = element_blank(),
                        legend.position = "bottom", axis.title.x = element_blank(), 
                        axis.title.y = element_blank(), plot.title = element_text(hjust = 0.5)) + 
                        scale_y_continuous(label = unit_format(unit = "K", scale = 1/1000, sep = ""))

Figure 3 - Attendance for Week#.

Looking at the box plots there does not seems to be too much variance among different weeks. Now lets look at home and away games impact on attendance.

Home games vs Away Games attendance

We will look at the averages of the game attendance for home and away games alogn with the overall nfl game attendance average.

 ggplot(data = attendance_summaries) +
   geom_line(aes(x = year, y = Average_nfl, col='NFL Game Average')) +
   geom_line(aes(x = year, y = Average_away, col='Away Game Average')) +
   geom_line(aes(x = year, y = Average_home, col='Home Game Average')) +
   facet_wrap(facets = vars(NFL_team_name), shrink = TRUE) +
   theme(axis.text.x = element_text(angle = 00), legend.title = element_blank(),
         legend.position = "bottom", axis.title.x = element_blank(), 
         axis.title.y = element_blank(), plot.title = element_text(hjust = 0.5)) + 
   scale_x_continuous(labels = function(x) substring(x,3,4)) +
   scale_y_continuous(label = unit_format(unit = "K", scale = 1/1000, sep = "")) + 
   scale_color_manual(values = c("red", "royalblue", "black")) + 
   ggtitle("Average NFL attendance (per game) 2000-2019")

Figure 4 - Home,away averages compared with nfl average.

We do not seen any common pattern across all the teams, looks like the trends are specific to each team. We could notice that Dallas Cowboys Home Game attendance is way above rest of the teams. Kansas Cheifs also could be classified as a team with good home support.Also the Washington Reds used to have a very strong home game support until 2017. Oakland Raders seems to have below par home game attendance compared to other teams. Cincinnati Bengals also has lesse home game attendance compared to rest of the teams.

Weekly Attendance vs Playoffs

Let us create a box plot to the weekly attendance for different teams, and for the seasons when they qualified for playoffs

combined_data %>%
  ggplot(aes(fct_reorder(NFL_team_name, weekly_attendance),
             weekly_attendance,
             fill = playoffs
  )) +
  geom_boxplot(outlier.alpha = 0.5) +
  coord_flip() +
  labs(
    fill = NULL, x = NULL,
    y = "Weekly NFL game attendance"
  ) + 
theme(axis.text.x = element_text(angle = 00), legend.title = element_blank(),
         legend.position = "bottom", axis.title.x = element_blank(), 
         axis.title.y = element_blank(), plot.title = element_text(hjust = 0.5))

Figure 5 - weekly nfl game attendance for playoffs and non playoffs.

Again we do not see too many common patterns in the for all the teams combined.We notice Washington Redskins, Dallas Cowboys attendances spike when they are having a playoff season. Sinec Los Angeles Rams data set is very small we will ignore the trend where they seem to have more attendance when they dont make the playoffs.

Machine Learning

Basic Scatterplot Matrix

Creating simple scatterplot Matrix to check for correlation.

   pairs(~annual_attendance+wins+margin_of_victory+simple_rating+simple_rating+offensive_ranking+defensive_ranking,
   data=combined_data,
   main="Simple Scatterplot Matrix")

We do not observe a strong corelation among these.

Creating a linear Regression Model

Let us build a couple of simple linear regression models to identify the factors that may impact the game attendance numbers.

Attendance vs week number

lmann = lm(annual_attendance~week ,data=combined_data)
broom::tidy(lmann)

## # A tibble: 2 x 5
##   term         estimate std.error statistic p.value
##   <chr>           <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept) 1081044.      1501.   720.      0    
## 2 week            -14.7      145.    -0.101   0.919

Looking at the coefficients from the model, the p-value for week is 0.919. There is 92% chance that this predictor is not meaningful for the regression. In other words week number is not a good predictor for predicting the attendance.

Weekly Attendance vs team performance

lmWeek = lm(weekly_attendance~week+wins+loss+margin_of_victory+simple_rating ,data=combined_data)
broom::tidy(lmWeek)

## # A tibble: 6 x 5
##   term              estimate std.error statistic  p.value
##   <chr>                <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)        70764.     8138.      8.69  4.00e-18
## 2 week                 -71.7      17.8    -4.03  5.61e- 5
## 3 wins                -112.      510.     -0.220 8.26e- 1
## 4 loss                -208.      512.     -0.406 6.85e- 1
## 5 margin_of_victory   -142.       64.7    -2.19  2.86e- 2
## 6 simple_rating        250.       55.6     4.49  7.24e- 6

Looking at the coefficients from the model, the p-value for simple_rating is less than 0.05 and could be meaningful for predicting the attendance.Also the coefficient for week seems to have a negative correlation with attendance and is less than 0.05. It could be meaning ful for predicting the attendance.

**Please note that more validations are required to check the model accuracy and also that the sample size may allow for mintue differences to be statistically significant.

Summary

Based on our analysis we notice that they are not many strong factors that impact the attendance but some of the variables like simple rating seems to have a slight impact on the attendance. Also we noticed that week # has a negative correlation which means as we get into week 8 and beyond we notice a dip in the attendance. This could be because of the teams not making the palyoffs are seeing a dip in the attendance.We also noticed that Texas Cowboys and Kansas Cheifs enjoy the most support from their fans.

Additional oppurtunities for more analysis would be do the analysis based on last years performance and see if it has an impact on the attendance.

NFL Stadium Attendance Analysis

Hari Vuppala

Updated: 2020-04-26

Introduction

Required Packages

Required Packages

Library Index

Data Preparation

Data Load

Data Dictionary

Attendance Data

Data Dictionary - attendance.csv

Standings Data

Data Dictionary - standings.csv

Games Data

Data Dictionary - games.csv

Attendance Data

Summary of the Attendance Data

Data Quality/ Missing Data

Sample Data for Attendance after cleansing

Standings Data

Desciption of the Standings Data

Data Quality/ Missing Data

Sample Data for Standings

Games Data

Summary of the Games Data

Data Quality/ Missing Data

Sample Data for Games

Combined Dataset

Sample Data for Annual Summries

Sample Data for Annual Summries

Sample Data for Combined Dataset

Exploratory Data Aanalysis

A look at Teams in the League

Attendance based on Week number of the season.

Home games vs Away Games attendance

Weekly Attendance vs Playoffs

Machine Learning

Basic Scatterplot Matrix

Creating a linear Regression Model

Attendance vs week number

Weekly Attendance vs team performance

Summary