Introduction

The National Football League (NFL) is a professional American football league consisting of 32 teams, divided equally between the National Football Conference (NFC) and the American Football Conference (AFC). The NFL’s 17-week regular season runs from early September to late December, with each team playing 16 games and having one bye week. Following the conclusion of the regular season, seven teams from each conference (four division winners and three wild card teams) advance to the playoffs, a single-elimination tournament culminating in the Super Bowl, which is usually held on the first Sunday in February and is played between the champions of the NFC and AFC.

The National Football League is the largest live spectator sporting league in the world in terms of average attendance. The NFL is one of the four major professional sports leagues in North America and the highest professional level of American football in the world. As of 2018, the NFL averaged 67,100 live spectators per game, and 17,177,581 total for the season.

The purpose of this project is to analyse the attendance data of the NFL from 2000-2019 and get insights into spectator attendance over the 20 year period. Some of the objectives are to address the below questions.

  • Which teams have the most loyal fans?
  • Does a playoff team pull more attendance for the games?
  • Does the win percentage have a bearing on the attendance?
  • Historicall does some teams hav better support compared to others?
  • Does crowd support help team progress to the playoffs?
  • Can we build a model to predict the attendance for the 2020?
  • Can we classify teams into categoriess that have have hightest attendnace and the teams that have the lowest attendance?

For this study we are using the data from Pro Football Reference Website. We will perfom some data cleansing and data manupulation to set up the data for consumption. We will start with exploratory data analysis to understand the data, examine the factors that determine attendance at National League Football games and build a classfication model to identify teams with the hightest and the lowest attendance.

These insights will help us with the pricing of the tickets,proper planning of logistics, promotions and marketing campaigns.

Packages Required

The below packages are required

  • readr - A General-Purpose Package for Dynamic Report Generation in R
  • tidyverse - will load the below core tidyverse packages
    • ggplot2 - for data visualisation.
    • dplyr - for data manipulation.
    • tidyr - for data tidying.
    • readr - for data import.
    • purrr - for functional programming.
    • tibble - for tibbles, a modern re-imagining of data frames.
    • stringr - for strings.
    • forcat - for factors.
  • Hmisc - data analysis, high-level graphics, utility operations
  • knitr - A General-Purpose Package for Dynamic Report Generation in R
  • funModeling - Exploratory Data Analysis and Data Preparation Tool-Box
  • rpart - Recursive partitioning for classification, regression and survival trees.

Data Preparation

The source data is from the Pro Football Reference Website. The required data is pulled into 3 csv files - attendance.csv, standings.csv and games.csv and made availabe in the tidytuesday github repository. We are importing the date from the csv files availabe in the github repository. The hyperlinks for the csv files are below -

Attendance Data, Standings Data, Games Data

Attendance Data

The Attendance data set contains the weekly attendance information for for a team in any give year.

Data Dictionary - attendance.csv

Variable Class Description
team character Team City
team_name character Team name
year integer Season year
total double Total attendance across 17 weeks (1 week = no game)
home double Home attendance
away double Away attendance
week character Week number (1-17)
weekly_attendance double Weekly attendance number

Summary of the Attendance Data

Looking at the summary statistics of the Attendance Data

r describe(attendance) %>% html()

attendance

8 Variables   10846 Observations

team
image
nmissingdistinct
10846032
lowest :Arizona Atlanta Baltimore Buffalo Carolina
highest:Seattle St. Louis Tampa Bay Tennessee Washington

team_name
image
nmissingdistinct
10846032
lowest : 49ers Bears Bengals Bills Broncos , highest: Seahawks Steelers Texans Titans Vikings
year
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
108460200.99720106.6352001200220052010201520182019
lowest : 2000 2001 2002 2003 2004 , highest: 2015 2016 2017 2018 2019
  Value       2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011
  Frequency    527   527   544   544   544   544   544   544   544   544   544   544
  Proportion 0.049 0.049 0.050 0.050 0.050 0.050 0.050 0.050 0.050 0.050 0.050 0.050
                                                            
  Value       2012  2013  2014  2015  2016  2017  2018  2019
  Frequency    544   544   544   544   544   544   544   544
  Proportion 0.050 0.050 0.050 0.050 0.050 0.050 0.050 0.050
  

total
image
         n  missing distinct     Info     Mean      Gmd      .05      .10      .25 
     10846        0      637        1  1080910    78566   967434   999843  1040509 
       .50      .75      .90      .95 
   1081090  1123230  1161974  1195369 
  
lowest : 760644 783367 803556 804401 811391 , highest: 1303393 1307231 1309211 1312509 1322087
home
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
108460603154045571678428311463353504360543185578342623325631365
lowest : 202687 254007 262145 288499 300267 , highest: 727432 731672 732958 740318 741775
away
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
108460636154045528682495744505417524974541757557741572774581257
lowest : 450295 456947 466104 471918 473459 , highest: 596357 596935 598761 601080 601655
week
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
108460170.99795.648 1 2 5 9131617
lowest : 1 2 3 4 5 , highest: 13 14 15 16 17
  Value          1     2     3     4     5     6     7     8     9    10    11    12
  Frequency    638   638   638   638   638   638   638   638   638   638   638   638
  Proportion 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.059
                                          
  Value         13    14    15    16    17
  Frequency    638   638   638   638   638
  Proportion 0.059 0.059 0.059 0.059 0.059
  

weekly_attendance
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
102086384073167557949552433572786324668334725457799979114
lowest : 23127 23217 23531 24193 25015 , highest: 95595 95952 100621 103467 105121

Data Quality/ Missing Data

After verifying the statistics of the each variable,we notice that the values in the weekly_attendance variable are missing for around 638 rows.This is the 1 bye week for each team for a given year. All the other variables look good.

     missing_data <- 
     attendance %>% 
     filter(is.na(weekly_attendance)) 

     missing_data
## # A tibble: 638 x 8
##    team       team_name  year   total   home   away  week weekly_attendance
##    <chr>      <chr>     <dbl>   <dbl>  <dbl>  <dbl> <dbl>             <dbl>
##  1 Arizona    Cardinals  2000  893926 387475 506451     3                NA
##  2 Atlanta    Falcons    2000  964579 422814 541765    15                NA
##  3 Baltimore  Ravens     2000 1062373 551695 510678    14                NA
##  4 Buffalo    Bills      2000 1098587 560695 537892     4                NA
##  5 Carolina   Panthers   2000 1095192 583489 511703     4                NA
##  6 Chicago    Bears      2000 1080684 535552 545132     9                NA
##  7 Cincinnati Bengals    2000  967434 469992 497442     1                NA
##  8 Cleveland  Browns     2000 1057139 581544 475595    17                NA
##  9 Dallas     Cowboys    2000 1075470 504360 571110     6                NA
## 10 Denver     Broncos    2000 1140030 604042 535988     9                NA
## # ... with 628 more rows

We will check to see if there is any pattern to the missing values.

Grouping by on team to see if the missing data was specific to a team

r missing_data %>% group_by(team) %>% summarise(count_occurances = n())

## # A tibble: 32 x 2 ## team count_occurances ## <chr> <int> ## 1 Arizona 20 ## 2 Atlanta 20 ## 3 Baltimore 20 ## 4 Buffalo 20 ## 5 Carolina 20 ## 6 Chicago 20 ## 7 Cincinnati 20 ## 8 Cleveland 20 ## 9 Dallas 20 ## 10 Denver 20 ## # ... with 22 more rows Grouping by on year to see if the missing data was specific to a team

r missing_data %>% group_by(year) %>% summarise(count_occurances = n())

## # A tibble: 20 x 2 ## year count_occurances ## <dbl> <int> ## 1 2000 31 ## 2 2001 31 ## 3 2002 32 ## 4 2003 32 ## 5 2004 32 ## 6 2005 32 ## 7 2006 32 ## 8 2007 32 ## 9 2008 32 ## 10 2009 32 ## 11 2010 32 ## 12 2011 32 ## 13 2012 32 ## 14 2013 32 ## 15 2014 32 ## 16 2015 32 ## 17 2016 32 ## 18 2017 32 ## 19 2018 32 ## 20 2019 32

Grouping by on week to see if the missing data was specific to a team

r missing_data %>% group_by(week) %>% summarise(count_occurances = n())

## # A tibble: 17 x 2 ## week count_occurances ## <dbl> <int> ## 1 1 4 ## 2 2 6 ## 3 3 26 ## 4 4 60 ## 5 5 74 ## 6 6 78 ## 7 7 78 ## 8 8 88 ## 9 9 92 ## 10 10 68 ## 11 11 38 ## 12 12 14 ## 13 13 4 ## 14 14 2 ## 15 15 2 ## 16 16 2 ## 17 17 2

By looking at the results from above, looks like all the 32 teams have a bye week for 1 randon week every year. We can ignore this data as there is no game on that day.

We also notice that in years 2000 and 2001 there are only 31 teams and starting 2002 we have 32 teams.

We will filter out the data for these missign 638 occurances and use the clean data for further analysis.

    attendance_cleansed <-      attendance %>% 
    filter(! is.na(weekly_attendance)) 

Sample Data for Attendance after cleansing

Looking at a small sample set of 5 rows from the Attendance data

    kable(attendance_cleansed[1:5,])
team team_name year total home away week weekly_attendance
Arizona Cardinals 2000 893926 387475 506451 1 77434
Arizona Cardinals 2000 893926 387475 506451 2 66009
Arizona Cardinals 2000 893926 387475 506451 4 71801
Arizona Cardinals 2000 893926 387475 506451 5 66985
Arizona Cardinals 2000 893926 387475 506451 6 44296

Standings Data

Data Dictionary - standings.csv

Variable Class Description
team character Team city
team_name character Team name
year integer season year
wins double Wins (0 to 16)
loss double Losses (0 to 16)
points_for double points for (offensive performance)
points_against double points for (defensive performance)
points_differential double Point differential (points_for - points_against)
margin_of_victory double (Points Scored - Points Allowed)/ Games Played
strength_of_schedule double Average quality of opponent as measured by SRS (Simple Rating System)
simple_rating double Team quality relative to average (0.0) as measured by SRS (Simple Rating System) SRS = MoV + SoS = OSRS + DSRS
offensive_ranking double Team offense quality relative to average (0.0) as measured by SRS (Simple Rating System)
defensive_ranking double Team defense quality relative to average (0.0) as measured by SRS (Simple Rating System)
playoffs character Made playoffs or not
sb_winner character Won superbowl or not

Desciption of the Standings Data

r describe(standings) %>% html()

standings

15 Variables   638 Observations

team
image
nmissingdistinct
638032
lowest :Arizona Atlanta Baltimore Buffalo Carolina
highest:Seattle St. Louis Tampa Bay Tennessee Washington

team_name
image
nmissingdistinct
638032
lowest : 49ers Bears Bengals Bills Broncos , highest: Seahawks Steelers Texans Titans Vikings
year
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
6380200.99820106.6452001200220052010201520172018
lowest : 2000 2001 2002 2003 2004 , highest: 2015 2016 2017 2018 2019
  Value       2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011
  Frequency     31    31    32    32    32    32    32    32    32    32    32    32
  Proportion 0.049 0.049 0.050 0.050 0.050 0.050 0.050 0.050 0.050 0.050 0.050 0.050
                                                            
  Value       2012  2013  2014  2015  2016  2017  2018  2019
  Frequency     32    32    32    32    32    32    32    32
  Proportion 0.050 0.050 0.050 0.050 0.050 0.050 0.050 0.050
  

wins
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
6380170.9917.9843.518 3 4 6 8101213
lowest : 0 1 2 3 4 , highest: 12 13 14 15 16
  Value          0     1     2     3     4     5     6     7     8     9    10    11
  Frequency      2     5    18    20    50    54    58    77    69    69    71    52
  Proportion 0.003 0.008 0.028 0.031 0.078 0.085 0.091 0.121 0.108 0.108 0.111 0.082
                                          
  Value         12    13    14    15    16
  Frequency     46    34     9     3     1
  Proportion 0.072 0.053 0.014 0.005 0.002
  

loss
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
6380170.9917.9843.517 3 4 6 8101213
lowest : 0 1 2 3 4 , highest: 12 13 14 15 16
  Value          0     1     2     3     4     5     6     7     8     9    10    11
  Frequency      1     3     9    34    47    54    71    69    70    75    58    53
  Proportion 0.002 0.005 0.014 0.053 0.074 0.085 0.111 0.108 0.110 0.118 0.091 0.083
                                          
  Value         12    13    14    15    16
  Frequency     50    19    18     5     2
  Proportion 0.078 0.030 0.028 0.008 0.003
  

points_for
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
63802571350.380.37240.0262.0299.0348.0396.0437.6468.1
lowest : 161 168 175 185 193 , highest: 557 560 565 589 606
points_against
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
63802231350.367.61254.9273.7310.0347.0391.5433.0448.0
lowest : 165 191 196 201 202 , highest: 478 480 486 494 517
points_differential
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
638031710115.5-169.15-134.90 -75.00 1.50 72.75 134.00 155.45
lowest : -261 -258 -249 -233 -232 , highest: 208 226 230 249 315
margin_of_victory
image
          n   missing  distinct      Info      Mean       Gmd       .05       .10 
        638         0       230         1 -0.001881     7.227   -10.600    -8.460 
        .25       .50       .75       .90       .95 
     -4.700     0.100     4.575     8.400     9.730 
  
lowest : -16.3 -16.1 -15.6 -14.6 -14.5 , highest: 13.0 14.1 14.4 15.6 19.7
strength_of_schedule
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
63807910.0010971.861-2.7-2.2-1.1 0.0 1.2 2.1 2.6
lowest : -4.6 -4.3 -4.2 -3.9 -3.6 , highest: 3.6 3.7 3.8 4.1 4.3
simple_rating
image
          n   missing  distinct      Info      Mean       Gmd       .05       .10 
        638         0       236         1 1.557e-17     7.077   -10.415    -8.130 
        .25       .50       .75       .90       .95 
     -4.475     0.000     4.500     8.330     9.930 
  
lowest : -17.4 -15.2 -15.1 -14.6 -14.4 , highest: 13.0 13.4 15.4 15.6 20.1
offensive_ranking
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
63801771-0.00015674.873-6.500-5.300-3.175 0.000 2.700 5.430 7.000
lowest : -11.7 -10.3 -10.2 -9.9 -9.6 , highest: 11.7 12.2 12.6 14.1 15.9
defensive_ranking
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
63801571-0.0010974.055-5.815-4.700-2.400 0.100 2.500 4.500 5.915
lowest : -9.8 -9.5 -9.2 -9.1 -8.6 , highest: 8.0 8.2 8.6 8.9 9.8
playoffs
nmissingdistinct
63802
  Value      No Playoffs    Playoffs
  Frequency          398         240
  Proportion       0.624       0.376
  

sb_winner
nmissingdistinct
63802
  Value       No Superbowl Won Superbowl
  Frequency            618            20
  Proportion         0.969         0.031
  

Data Quality/ Missing Data

After verifying the statistics for each varible,all the data looks good and there is no need for any data manipulation needed.

Sample Data for Standings

 kable(standings[1:5,])
team team_name year wins loss points_for points_against points_differential margin_of_victory strength_of_schedule simple_rating offensive_ranking defensive_ranking playoffs sb_winner
Miami Dolphins 2000 11 5 323 226 97 6.1 1.0 7.1 0.0 7.1 Playoffs No Superbowl
Indianapolis Colts 2000 10 6 429 326 103 6.4 1.5 7.9 7.1 0.8 Playoffs No Superbowl
New York Jets 2000 9 7 321 321 0 0.0 3.5 3.5 1.4 2.2 No Playoffs No Superbowl
Buffalo Bills 2000 8 8 315 350 -35 -2.2 2.2 0.0 0.5 -0.5 No Playoffs No Superbowl
New England Patriots 2000 5 11 276 338 -62 -3.9 1.4 -2.5 -2.7 0.2 No Playoffs No Superbowl

Games Data

Data Dictionary - games.csv

Variable Class Description
year integer season year, note that playoff games will still be in the previous season
week character week number (1-17, plus playoffs)
home_team character Home team
away_team character Away team
winner character Winning team
tie character If a tie, the “losing” team as well
day character Day of week
date character Date minus year
time character Time of game start
pts_win double Points by winning team
pts_loss double Points by losing team
yds_win double Yards by winning team
turnovers_win double Turnovers by winning team
yds_loss double Yards by losing team
turnovers_loss double Turnovers by losing team
home_team_name character Home team name
home_team_city character Home team city
away_team_name character Away team name
away_team_city character Away team city

Summary of the Games Data

r describe(games) %>% html()

games

19 Variables   5324 Observations

year
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
53240200.99720106.6372001200220052010201520182019
lowest : 2000 2001 2002 2003 2004 , highest: 2015 2016 2017 2018 2019
  Value       2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011
  Frequency    259   259   267   267   267   267   267   267   267   267   267   267
  Proportion 0.049 0.049 0.050 0.050 0.050 0.050 0.050 0.050 0.050 0.050 0.050 0.050
                                                            
  Value       2012  2013  2014  2015  2016  2017  2018  2019
  Frequency    267   267   267   267   267   267   267   267
  Proportion 0.050 0.050 0.050 0.050 0.050 0.050 0.050 0.050
  

week
image
nmissingdistinct
5324021
lowest :1 10 11 12 13
highest:9 ConfChampDivision SuperBowlWildCard

home_team
image
nmissingdistinct
5324034
lowest :Arizona Cardinals Atlanta Falcons Baltimore Ravens Buffalo Bills Carolina Panthers
highest:Seattle Seahawks St. Louis Rams Tampa Bay BuccaneersTennessee Titans Washington Redskins

away_team
image
nmissingdistinct
5324034
lowest :Arizona Cardinals Atlanta Falcons Baltimore Ravens Buffalo Bills Carolina Panthers
highest:Seattle Seahawks St. Louis Rams Tampa Bay BuccaneersTennessee Titans Washington Redskins

winner
image
nmissingdistinct
5324034
lowest :Arizona Cardinals Atlanta Falcons Baltimore Ravens Buffalo Bills Carolina Panthers
highest:Seattle Seahawks St. Louis Rams Tampa Bay BuccaneersTennessee Titans Washington Redskins

tie
image
nmissingdistinct
1053147
lowest :Arizona Cardinals Atlanta Falcons Carolina Panthers Cincinnati BengalsCleveland Browns
highest:Carolina Panthers Cincinnati BengalsCleveland Browns Green Bay Packers St. Louis Rams
  Value       Arizona Cardinals    Atlanta Falcons  Carolina Panthers
  Frequency                   2                  1                  1
  Proportion                0.2                0.1                0.1
                                                                     
  Value      Cincinnati Bengals   Cleveland Browns  Green Bay Packers
  Frequency                   2                  1                  2
  Proportion                0.2                0.1                0.2
                               
  Value          St. Louis Rams
  Frequency                   1
  Proportion                0.1
  

day
image
nmissingdistinct
532407
lowest : Fri Mon Sat Sun Thu , highest: Sat Sun Thu Tue Wed
  Value        Fri   Mon   Sat   Sun   Thu   Tue   Wed
  Frequency      3   339   178  4588   214     1     1
  Proportion 0.001 0.064 0.033 0.862 0.040 0.000 0.000
  

date
nmissingdistinct
53240154
lowest :December 1 December 10December 11December 12December 13
highest:September 5September 6September 7September 8September 9

timesecs
nmissingdistinct
53240187
lowest : 08:35:00 08:36:00 09:30:00 09:31:00 09:35:00 , highest: 22:20:00 22:22:00 22:25:00 22:26:00 23:35:00
pts_win
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
53240560.99727.789.91414172127344044
lowest : 3 6 7 8 9 , highest: 56 57 58 59 62
pts_loss
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
53240470.99616.099.176 3 61016212731
lowest : 0 2 3 5 6 , highest: 45 46 48 49 51
yds_win
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
532404421361.688.43236262308361415460494
lowest : 47 98 104 107 120 , highest: 626 639 643 645 653
turnovers_win
image
nmissingdistinctInfoMeanGmd
5324080.9031.081.094
lowest : 0 1 2 3 4 , highest: 3 4 5 6 7
  Value          0     1     2     3     4     5     6     7
  Frequency   1804  1950  1074   364   106    20     5     1
  Proportion 0.339 0.366 0.202 0.068 0.020 0.004 0.001 0.000
  

yds_loss
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
532404441309.195.69175200251306366420451
lowest : 26 53 67 72 77 , highest: 576 583 589 595 613
turnovers_loss
image
nmissingdistinctInfoMeanGmd
5324090.9542.1681.562
lowest : 0 1 2 3 4 , highest: 4 5 6 7 8
  Value          0     1     2     3     4     5     6     7     8
  Frequency    553  1361  1424  1065   592   237    66    21     5
  Proportion 0.104 0.256 0.267 0.200 0.111 0.045 0.012 0.004 0.001
  

home_team_name
image
nmissingdistinct
5324032
lowest : 49ers Bears Bengals Bills Broncos , highest: Seahawks Steelers Texans Titans Vikings
home_team_city
image
nmissingdistinct
5324032
lowest :Arizona Atlanta Baltimore Buffalo Carolina
highest:Seattle St. Louis Tampa Bay Tennessee Washington

away_team_name
image
nmissingdistinct
5324032
lowest : 49ers Bears Bengals Bills Broncos , highest: Seahawks Steelers Texans Titans Vikings
away_team_city
image
nmissingdistinct
5324032
lowest :Arizona Atlanta Baltimore Buffalo Carolina
highest:Seattle St. Louis Tampa Bay Tennessee Washington

Data Quality/ Missing Data

After verifying the statistics for each varible,the data in the variable week looks ambigous. On closer look we notice that it has both integer values aswell as character values. The weeks after the regular season are in the character values. This is a valid scenario for having values WildCard ,Division, ConfChamp,SuperBowl respectively beyong week 17 of regular season. I dont not see a reason to convert theset as of now, I will assess later if any conversion is required.

Sample Data for Games

 kable(games[1:5,])
year week home_team away_team winner tie day date time pts_win pts_loss yds_win turnovers_win yds_loss turnovers_loss home_team_name home_team_city away_team_name away_team_city
2000 1 Minnesota Vikings Chicago Bears Minnesota Vikings NA Sun September 3 13:00:00 30 27 374 1 425 1 Vikings Minnesota Bears Chicago
2000 1 Kansas City Chiefs Indianapolis Colts Indianapolis Colts NA Sun September 3 13:00:00 27 14 386 2 280 1 Chiefs Kansas City Colts Indianapolis
2000 1 Washington Redskins Carolina Panthers Washington Redskins NA Sun September 3 13:01:00 20 17 396 0 236 1 Redskins Washington Panthers Carolina
2000 1 Atlanta Falcons San Francisco 49ers Atlanta Falcons NA Sun September 3 13:02:00 36 28 359 1 339 1 Falcons Atlanta 49ers San Francisco
2000 1 Pittsburgh Steelers Baltimore Ravens Baltimore Ravens NA Sun September 3 13:02:00 16 0 336 0 223 1 Steelers Pittsburgh Ravens Baltimore

Combined Dataset

I will be combining the attendance and the standings datasets to create a combined data set which i will use in my exploratory data analyis.

r attendance_standings <- inner_join(attendance_cleansed, standings, by = c("team","team_name","year"))

Sample Data for Combined Dataset

 kable(attendance_standings[1:5,])
team team_name year total home away week weekly_attendance wins loss points_for points_against points_differential margin_of_victory strength_of_schedule simple_rating offensive_ranking defensive_ranking playoffs sb_winner
Arizona Cardinals 2000 893926 387475 506451 1 77434 3 13 210 443 -233 -14.6 -0.7 -15.2 -7.2 -8.1 No Playoffs No Superbowl
Arizona Cardinals 2000 893926 387475 506451 2 66009 3 13 210 443 -233 -14.6 -0.7 -15.2 -7.2 -8.1 No Playoffs No Superbowl
Arizona Cardinals 2000 893926 387475 506451 4 71801 3 13 210 443 -233 -14.6 -0.7 -15.2 -7.2 -8.1 No Playoffs No Superbowl
Arizona Cardinals 2000 893926 387475 506451 5 66985 3 13 210 443 -233 -14.6 -0.7 -15.2 -7.2 -8.1 No Playoffs No Superbowl
Arizona Cardinals 2000 893926 387475 506451 6 44296 3 13 210 443 -233 -14.6 -0.7 -15.2 -7.2 -8.1 No Playoffs No Superbowl

Exploratory Data Aanalysis

I will be doing exploratory data analysis to understand variables that have a greater significance is discovering patterns and identify the relationships between variables. I will be generating scatter plots and box plots to understand the relationships.

Machine Learning

I will be performing statistical analysis on the key varaiables and build a linear regression model to identify what are the factors that impact the game attendance numbers.

I will check the statistical significance of the these variables, build a model with training and testing datasets.

Summary

The findings from the Statistical inference and the model will be summarized in this section.