Introduction

The National Football League (NFL) is a professional American football league consisting of 32 teams, divided equally between the National Football Conference (NFC) and the American Football Conference (AFC). The NFL’s 17-week regular season runs from early September to late December, with each team playing 16 games and having one bye week. Following the conclusion of the regular season, seven teams from each conference (four division winners and three wild card teams) advance to the playoffs, a single-elimination tournament culminating in the Super Bowl, which is usually held on the first Sunday in February and is played between the champions of the NFC and AFC.

The National Football League is the largest live spectator sporting league in the world in terms of average attendance. The NFL is one of the four major professional sports leagues in North America and the highest professional level of American football in the world. As of 2018, the NFL averaged 67,100 live spectators per game, and 17,177,581 total for the season.

Hard Rock Stadium, Miami Gardens, Florida

The purpose of this project is to analyse the attendance data of the NFL from 2000-2019 and get insights into spectator attendance over the 20 year period. Some of the objectives are to address the below questions.

  • Does the win percentage have a bearing on the attendance?
  • Do some teams have better support compared to others?
  • Does a particular week in the year have better or poor attendance?
  • Can we build a model to predict the attendance for the 2020?
  • Can we classify teams into categoriess of highest attendance vs lowest attendance?

For this study we are using the data from Pro Football Reference Website. We will perfom some data cleansing and data manupulation to set up the data for consumption. We will start with exploratory data analysis to understand the data, examine the factors that determine attendance at National League Football games and build a model to identify factors having a bearing on the attendance.

These insights will help us with the pricing of the tickets,proper planning of logistics, promotions and marketing campaigns.

Required Packages

The below packages are required to run the code.

Required Packages

      library(readr)
      library(tidyverse)
      library(Hmisc)
      library(knitr)
      library(funModeling)
      library(rpart)
      library(skimr)
      library(scales)

Library Index

  • readr - A General-Purpose Package for Dynamic Report Generation in R
  • tidyverse - will load the below core tidyverse packages
    • ggplot2 - for data visualisation.
    • dplyr - for data manipulation.
    • tidyr - for data tidying.
    • readr - for data import.
    • purrr - for functional programming.
    • tibble - for tibbles, a modern re-imagining of data frames.
    • stringr - for strings.
    • forcat - for factors.
  • Hmisc - data analysis, high-level graphics, utility operations
  • knitr - A General-Purpose Package for Dynamic Report Generation in R
  • funModeling - Exploratory Data Analysis and Data Preparation Tool-Box
  • rpart - Recursive partitioning for classification, regression and survival trees.
  • skimr - Compact and Flexible Summaries of Data
  • scales - Graphical scales map data to aesthetics, and provide methods for automatically determining breaks and labels foraxes and legends.

Data Preparation

Data Load

We are using the data that was downloaded from Pro Football Reference Website.

attendance <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/attendance.csv')
standings <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/standings.csv')
games <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/games.csv')
library(knitr)
# evaluate fig.cap after a chunk is evaluated
opts_knit$set(eval.after = 'fig.cap')

Data Dictionary

Attendance Data

The Attendance data set contains the weekly attendance information for a team for the years 2000 to 2019.

Data Dictionary - attendance.csv

Variable Class Description
team character Team City
team_name character Team name
year integer Season year
total double Total attendance across 17 weeks (1 week = no game)
home double Home attendance
away double Away attendance
week character Week number (1-17)
weekly_attendance double Weekly attendance number

Standings Data

The Standings data set contains the Win/loss, points scored, rankings for each team for the season year from 2000-2019.

Data Dictionary - standings.csv

Variable Class Description
team character Team city
team_name character Team name
year integer season year
wins double Wins (0 to 16)
loss double Losses (0 to 16)
points_for double points for (offensive performance)
points_against double points for (defensive performance)
points_differential double Point differential (points_for - points_against)
margin_of_victory double (Points Scored - Points Allowed)/ Games Played
strength_of_schedule double Average quality of opponent as measured by SRS (Simple Rating System)
simple_rating double Team quality relative to average (0.0) as measured by SRS (Simple Rating System) SRS = MoV + SoS = OSRS + DSRS
offensive_ranking double Team offense quality relative to average (0.0) as measured by SRS (Simple Rating System)
defensive_ranking double Team defense quality relative to average (0.0) as measured by SRS (Simple Rating System)
playoffs character Made playoffs or not
sb_winner character Won superbowl or not

Games Data

The Games data set contains details about each game.

Data Dictionary - games.csv

Variable Class Description
year integer season year, note that playoff games will still be in the previous season
week character week number (1-17, plus playoffs)
home_team character Home team
away_team character Away team
winner character Winning team
tie character If a tie, the “losing” team as well
day character Day of week
date character Date minus year
time character Time of game start
pts_win double Points by winning team
pts_loss double Points by losing team
yds_win double Yards by winning team
turnovers_win double Turnovers by winning team
yds_loss double Yards by losing team
turnovers_loss double Turnovers by losing team
home_team_name character Home team name
home_team_city character Home team city
away_team_name character Away team name
away_team_city character Away team city

Attendance Data

Summary of the Attendance Data

Looking at the summary statistics of the Attendance Data

     skim(attendance)
Data summary
Name attendance
Number of rows 10846
Number of columns 8
_______________________
Column type frequency:
character 2
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
team 0 1 5 13 0 32 0
team_name 0 1 4 10 0 32 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1.00 2009.53 5.75 2000 2005.0 2010 2015.00 2019 ▇▇▇▇▇
total 0 1.00 1080910.03 72876.97 760644 1040509.0 1081090 1123230.00 1322087 ▁▁▇▆▁
home 0 1.00 540455.01 66774.65 202687 504360.0 543185 578342.00 741775 ▁▁▅▇▁
away 0 1.00 540455.01 25509.33 450295 524974.0 541757 557741.00 601655 ▁▂▇▇▂
week 0 1.00 9.00 4.90 1 5.0 9 13.00 17 ▇▆▆▆▇
weekly_attendance 638 0.94 67556.88 9022.02 23127 63245.5 68334 72544.75 105121 ▁▁▇▃▁

Data Quality/ Missing Data

Creating a single variable for team name by combining team and team_name attributes so that we can join with the rankings and games data sets.

attendance_reshape <- rename(attendance , annual_attendance = total , annual_homegame_attendance = home , 
                             annual_awaygame_attendance = away  )  %>% 
                    mutate(NFL_team_name = str_c(team, team_name, sep = " "))

After verifying the statistics of the each variable,we notice that the values in the weekly_attendance variable are missing for around 638 rows.We will check to see if there is any pattern to the missing values.

     missing_data <- 
     attendance_reshape %>% 
     filter(is.na(weekly_attendance)) 

All the 32 teams have a bye week for one random week every year.We can ignore this data as there is no game on that day. We also notice that in years 2000 and 2001 there are only 31 teams and starting 2002 we have 32 teams.

We will filter out the data for these missign 638 occurances and use the clean data for further analysis.

 attendance_cleansed <-      attendance_reshape %>%       
 filter(! is.na(weekly_attendance)) %>%
 select(NFL_team_name, year, week , weekly_attendance , annual_attendance)

Sample Data for Attendance after cleansing

Looking at the Sample data after replacing the variable name and fitering the bye week data

    kable(attendance_cleansed[1:5,])
NFL_team_name year week weekly_attendance annual_attendance
Arizona Cardinals 2000 1 77434 893926
Arizona Cardinals 2000 2 66009 893926
Arizona Cardinals 2000 4 71801 893926
Arizona Cardinals 2000 5 66985 893926
Arizona Cardinals 2000 6 44296 893926

Standings Data

Desciption of the Standings Data

     skim(standings)  
Data summary
Name standings
Number of rows 638
Number of columns 15
_______________________
Column type frequency:
character 4
numeric 11
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
team 0 1 5 13 0 32 0
team_name 0 1 4 10 0 32 0
playoffs 0 1 8 11 0 2 0
sb_winner 0 1 12 13 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 2009.53 5.76 2000.0 2005.00 2010.0 2014.75 2019.0 ▇▇▇▇▇
wins 0 1 7.98 3.08 0.0 6.00 8.0 10.00 16.0 ▂▆▇▆▂
loss 0 1 7.98 3.08 0.0 6.00 8.0 10.00 16.0 ▂▆▇▆▂
points_for 0 1 350.28 71.40 161.0 299.00 348.0 396.00 606.0 ▂▇▇▂▁
points_against 0 1 350.28 59.55 165.0 310.00 347.0 391.50 517.0 ▁▃▇▆▁
points_differential 0 1 0.00 101.09 -261.0 -75.00 1.5 72.75 315.0 ▂▆▇▅▁
margin_of_victory 0 1 0.00 6.32 -16.3 -4.70 0.1 4.57 19.7 ▂▆▇▅▁
strength_of_schedule 0 1 0.00 1.63 -4.6 -1.10 0.0 1.20 4.3 ▁▅▇▅▁
simple_rating 0 1 0.00 6.20 -17.4 -4.47 0.0 4.50 20.1 ▁▆▇▅▁
offensive_ranking 0 1 0.00 4.34 -11.7 -3.18 0.0 2.70 15.9 ▁▇▇▂▁
defensive_ranking 0 1 0.00 3.57 -9.8 -2.40 0.1 2.50 9.8 ▁▅▇▅▁

Data Quality/ Missing Data

After verifying the statistics for each varible, creating a single variable for team full name by combining team and team_name attributes so that we can join with the rankings data set. Rest of the data looks good.

standings_reshape <- standings %>% 
mutate(NFL_team_name = str_c(team, team_name, sep = " ")) %>%
select(NFL_team_name, year, wins, loss, margin_of_victory, simple_rating , offensive_ranking, defensive_ranking, playoffs, sb_winner)

Sample Data for Standings

  kable(standings_reshape[1:5,])
NFL_team_name year wins loss margin_of_victory simple_rating offensive_ranking defensive_ranking playoffs sb_winner
Miami Dolphins 2000 11 5 6.1 7.1 0.0 7.1 Playoffs No Superbowl
Indianapolis Colts 2000 10 6 6.4 7.9 7.1 0.8 Playoffs No Superbowl
New York Jets 2000 9 7 0.0 3.5 1.4 2.2 No Playoffs No Superbowl
Buffalo Bills 2000 8 8 -2.2 0.0 0.5 -0.5 No Playoffs No Superbowl
New England Patriots 2000 5 11 -3.9 -2.5 -2.7 0.2 No Playoffs No Superbowl

Games Data

Summary of the Games Data

    skim(games) 
Data summary
Name games
Number of rows 5324
Number of columns 19
_______________________
Column type frequency:
character 11
difftime 1
numeric 7
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
week 0 1 1 9 0 21 0
home_team 0 1 13 20 0 34 0
away_team 0 1 13 20 0 34 0
winner 0 1 13 20 0 34 0
tie 5314 0 14 18 0 7 0
day 0 1 3 3 0 7 0
date 0 1 9 12 0 154 0
home_team_name 0 1 4 10 0 32 0
home_team_city 0 1 5 13 0 32 0
away_team_name 0 1 4 10 0 32 0
away_team_city 0 1 5 13 0 32 0

Variable type: difftime

skim_variable n_missing complete_rate min max median n_unique
time 0 1 30900 secs 84900 secs 47040 secs 187

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 2009.53 5.75 2000 2005 2010 2015 2019 ▇▇▇▇▇
pts_win 0 1 27.78 8.83 3 21 27 34 62 ▁▇▇▂▁
pts_loss 0 1 16.09 8.14 0 10 16 21 51 ▆▇▅▁▁
yds_win 0 1 361.64 78.58 47 308 361 415 653 ▁▂▇▃▁
turnovers_win 0 1 1.08 1.04 0 0 1 2 7 ▇▂▁▁▁
yds_loss 0 1 309.08 84.50 26 251 306 366 613 ▁▅▇▃▁
turnovers_loss 0 1 2.17 1.42 0 1 2 3 8 ▆▇▂▁▁

Data Quality/ Missing Data

After verifying the statistics for each varible,the data in the variable week looks ambigous. On closer look we notice that it has both numeric and character values. The weeks after the regular season are in character values.This is a valid scenario to have values WildCard ,Division, ConfChamp,SuperBowl respectively after the regular season. As the anlaysis is only on the regular season games and not playoffs, I will ignore the playoff games.

The game dataset contains the game results of each game. A total of 5014 games are played in the regular seasons 2000-2019. We need to change the data type of week from character to numeric so that we can join the games data set with attendance data set.

   games_regular <- games %>%
                   filter(week %in%  c('1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17'))          %>%
                   mutate(week = as.numeric(week)) %>%
                   select(home_team,away_team, year, week , winner)

Reshaping the games data so that we can have both the home and away team rows combined into a column and adding a home game indicator

  games_reshape <- games_regular %>%
                   gather(home_ind,NFL_team_name,home_team:away_team ) %>%
                   select(NFL_team_name, year, week, home_ind, winner)

Sample Data for Games

   kable(games_reshape[1:5,])
NFL_team_name year week home_ind winner
Minnesota Vikings 2000 1 home_team Minnesota Vikings
Kansas City Chiefs 2000 1 home_team Indianapolis Colts
Washington Redskins 2000 1 home_team Washington Redskins
Atlanta Falcons 2000 1 home_team Atlanta Falcons
Pittsburgh Steelers 2000 1 home_team Baltimore Ravens

Combined Dataset

We will create a annual team attendance summary data set, Since the home and away game count is for the year we will need to divide by 8 so that we get the average home game and away game attendance

# Create a team annual summaries data set
  team_summaries <- missing_data %>% 
                    group_by(NFL_team_name,year) %>% 
                    summarise( Average_home = mean(annual_homegame_attendance)/8,
                               Average_away = mean(annual_awaygame_attendance)/8 )
 # Create a annual attendance summary data set
 annual_summaries <- attendance_cleansed %>% 
                     group_by(year) %>% 
                     summarise(Average_nfl = mean( weekly_attendance))
  # Create a ssummary data set with home,away and nfl attendance summaries
   attendance_summaries <- left_join(team_summaries, annual_summaries, 
                                   by = c( "year" ))   

Sample Data for Annual Summries

   kable(attendance_summaries[1:5,])
NFL_team_name year Average_home Average_away Average_nfl
Arizona Cardinals 2000 48434.38 63306.38 65934.44
Arizona Cardinals 2001 38414.38 63009.50 65753.58
Arizona Cardinals 2002 40909.00 71450.62 66325.67
Arizona Cardinals 2003 36062.38 64487.75 66674.36
Arizona Cardinals 2004 37533.38 67286.25 67462.60

We will also create a weekly attendance summary data set

 weekly_summaries <- attendance_cleansed %>% 
   group_by(week) %>% 
   summarise(
     Average_week = mean(weekly_attendance)
   )

Sample Data for Annual Summries

   kable(weekly_summaries[1:5,])
week Average_week
1 68532.91
2 67430.24
3 67761.31
4 67581.84
5 68629.94

I will be combining the attendance, standings, games datasets to create a combined data set which i will use in my exploratory data analysis.

   combined_data <-   inner_join(attendance_cleansed, standings_reshape, 
                                   by = c( "NFL_team_name", "year"))      %>%
                             inner_join(. , games_reshape,
                            by = c("NFL_team_name" , "year" , "week"))

Sample Data for Combined Dataset

   kable(combined_data[1:5,])
NFL_team_name year week weekly_attendance annual_attendance wins loss margin_of_victory simple_rating offensive_ranking defensive_ranking playoffs sb_winner home_ind winner
Arizona Cardinals 2000 1 77434 893926 3 13 -14.6 -15.2 -7.2 -8.1 No Playoffs No Superbowl away_team New York Giants
Arizona Cardinals 2000 2 66009 893926 3 13 -14.6 -15.2 -7.2 -8.1 No Playoffs No Superbowl home_team Arizona Cardinals
Arizona Cardinals 2000 4 71801 893926 3 13 -14.6 -15.2 -7.2 -8.1 No Playoffs No Superbowl home_team Green Bay Packers
Arizona Cardinals 2000 5 66985 893926 3 13 -14.6 -15.2 -7.2 -8.1 No Playoffs No Superbowl away_team San Francisco 49ers
Arizona Cardinals 2000 6 44296 893926 3 13 -14.6 -15.2 -7.2 -8.1 No Playoffs No Superbowl home_team Arizona Cardinals

Exploratory Data Aanalysis

The objective of performing exploratory data analysis is to understand the data better and identify the variables that may have a significance on the attendance numbers.I will look at different variables and will plot some graphs to understand them better.

A look at Teams in the League

From the attendance data we noticed there are 34 distinct teams, but typically NFL has only 32 teams every year.Lets take a look at the team barchart to understand a little more on what really happened.

ggplot(data=combined_data, aes(x= reorder(factor(NFL_team_name), NFL_team_name, function(x) length(x)),
                                            las=10,fill=factor(year),
                                  names.arg=combined_data$NFL_team_name
                                ))+
  geom_bar() +
  theme(axis.text.x = element_text(angle = 0), legend.title = element_blank(),
        legend.position = "bottom", axis.title.x = element_blank(), 
        axis.title.y = element_blank(), plot.title = element_text(hjust = 0.5)) +
        ggtitle("Games played by teams 2000-2019")  +
        coord_flip()
**Figure 1 -  NFL Teams 2000 -20019.**

Figure 1 - NFL Teams 2000 -20019.

Looking at the teams that did not play all the 320 games, we noticed that the Rams moved from St.Louis to Los Angeles in 2016. Similarly the Chargers moved from San Diego to Los Angeles in 2017. We also notice that the Houston Texans started playing in the league from 2002.

Attendance based on Week number of the season.

Lets try to find out if there are spikes for attendance on any particular week every year. We will be doing a average attendace of all the weeks vs attendance for given week number on the summarized data to see if any particular weeks see higher attendance.

 ggplot(data=weekly_summaries, aes(x= week, y= Average_week)) +
   geom_point() +
   geom_smooth()+
   theme(axis.text.x = element_text(angle = 0), legend.title = element_blank(),
         legend.position = "bottom", axis.title.x = element_blank(), 
         axis.title.y = element_blank(), plot.title = element_text(hjust = 0.5)) + 
         ggtitle("Average attendance (per week number) 2000-2019")
**Figure 2 - Attendance for Week#.**

Figure 2 - Attendance for Week#.

Looking at this plot it seems like there is dip in the attendance of games after week 8. Lets do a box plot to understand if this is significant or not.

combined_data %>%
                    mutate(week = factor(week)) %>%
                    ggplot(aes(week, weekly_attendance, fill = week)) +
                    geom_boxplot(show.legend = FALSE, outlier.alpha = 0.5) +
                    labs( x = "Week of NFL season",
                          y = "Weekly NFL game attendance") +
                  theme(axis.text.x = element_text(angle = 00), legend.title = element_blank(),
                        legend.position = "bottom", axis.title.x = element_blank(), 
                        axis.title.y = element_blank(), plot.title = element_text(hjust = 0.5)) + 
                        scale_y_continuous(label = unit_format(unit = "K", scale = 1/1000, sep = "")) 
**Figure 3 - Attendance for Week#.**

Figure 3 - Attendance for Week#.

Looking at the box plots there does not seems to be too much variance among different weeks. Now lets look at home and away games impact on attendance.

Home games vs Away Games attendance

We will look at the averages of the game attendance for home and away games alogn with the overall nfl game attendance average.

 ggplot(data = attendance_summaries) +
   geom_line(aes(x = year, y = Average_nfl, col='NFL Game Average')) +
   geom_line(aes(x = year, y = Average_away, col='Away Game Average')) +
   geom_line(aes(x = year, y = Average_home, col='Home Game Average')) +
   facet_wrap(facets = vars(NFL_team_name), shrink = TRUE) +
   theme(axis.text.x = element_text(angle = 00), legend.title = element_blank(),
         legend.position = "bottom", axis.title.x = element_blank(), 
         axis.title.y = element_blank(), plot.title = element_text(hjust = 0.5)) + 
   scale_x_continuous(labels = function(x) substring(x,3,4)) +
   scale_y_continuous(label = unit_format(unit = "K", scale = 1/1000, sep = "")) + 
   scale_color_manual(values = c("red", "royalblue", "black")) + 
   ggtitle("Average NFL attendance (per game) 2000-2019")
**Figure 4 - Home,away averages compared with nfl average.**

Figure 4 - Home,away averages compared with nfl average.

We do not seen any common pattern across all the teams, looks like the trends are specific to each team. We could notice that Dallas Cowboys Home Game attendance is way above rest of the teams. Kansas Cheifs also could be classified as a team with good home support.Also the Washington Reds used to have a very strong home game support until 2017. Oakland Raders seems to have below par home game attendance compared to other teams. Cincinnati Bengals also has lesse home game attendance compared to rest of the teams.

Weekly Attendance vs Playoffs

Let us create a box plot to the weekly attendance for different teams, and for the seasons when they qualified for playoffs

combined_data %>%
  ggplot(aes(fct_reorder(NFL_team_name, weekly_attendance),
             weekly_attendance,
             fill = playoffs
  )) +
  geom_boxplot(outlier.alpha = 0.5) +
  coord_flip() +
  labs(
    fill = NULL, x = NULL,
    y = "Weekly NFL game attendance"
  ) + 
theme(axis.text.x = element_text(angle = 00), legend.title = element_blank(),
         legend.position = "bottom", axis.title.x = element_blank(), 
         axis.title.y = element_blank(), plot.title = element_text(hjust = 0.5))  
**Figure 5 - weekly nfl game attendance for playoffs and non playoffs.**

Figure 5 - weekly nfl game attendance for playoffs and non playoffs.

Again we do not see too many common patterns in the for all the teams combined.We notice Washington Redskins, Dallas Cowboys attendances spike when they are having a playoff season. Sinec Los Angeles Rams data set is very small we will ignore the trend where they seem to have more attendance when they dont make the playoffs.

Machine Learning

Basic Scatterplot Matrix

Creating simple scatterplot Matrix to check for correlation.

   pairs(~annual_attendance+wins+margin_of_victory+simple_rating+simple_rating+offensive_ranking+defensive_ranking,
   data=combined_data,
   main="Simple Scatterplot Matrix")

We do not observe a strong corelation among these.

Creating a linear Regression Model

Let us build a couple of simple linear regression models to identify the factors that may impact the game attendance numbers.

Attendance vs week number

lmann = lm(annual_attendance~week ,data=combined_data)
broom::tidy(lmann)
## # A tibble: 2 x 5
##   term         estimate std.error statistic p.value
##   <chr>           <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept) 1081044.      1501.   720.      0    
## 2 week            -14.7      145.    -0.101   0.919

Looking at the coefficients from the model, the p-value for week is 0.919. There is 92% chance that this predictor is not meaningful for the regression. In other words week number is not a good predictor for predicting the attendance.

Weekly Attendance vs team performance

lmWeek = lm(weekly_attendance~week+wins+loss+margin_of_victory+simple_rating ,data=combined_data)
broom::tidy(lmWeek)
## # A tibble: 6 x 5
##   term              estimate std.error statistic  p.value
##   <chr>                <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)        70764.     8138.      8.69  4.00e-18
## 2 week                 -71.7      17.8    -4.03  5.61e- 5
## 3 wins                -112.      510.     -0.220 8.26e- 1
## 4 loss                -208.      512.     -0.406 6.85e- 1
## 5 margin_of_victory   -142.       64.7    -2.19  2.86e- 2
## 6 simple_rating        250.       55.6     4.49  7.24e- 6

Looking at the coefficients from the model, the p-value for simple_rating is less than 0.05 and could be meaningful for predicting the attendance.Also the coefficient for week seems to have a negative correlation with attendance and is less than 0.05. It could be meaning ful for predicting the attendance.

**Please note that more validations are required to check the model accuracy and also that the sample size may allow for mintue differences to be statistically significant.

Summary

Based on our analysis we notice that they are not many strong factors that impact the attendance but some of the variables like simple rating seems to have a slight impact on the attendance. Also we noticed that week # has a negative correlation which means as we get into week 8 and beyond we notice a dip in the attendance. This could be because of the teams not making the palyoffs are seeing a dip in the attendance.We also noticed that Texas Cowboys and Kansas Cheifs enjoy the most support from their fans.

Additional oppurtunities for more analysis would be do the analysis based on last years performance and see if it has an impact on the attendance.