The National Football League (NFL) is a professional American football league consisting of 32 teams, divided equally between the National Football Conference (NFC) and the American Football Conference (AFC). The NFL’s 17-week regular season runs from early September to late December, with each team playing 16 games and having one bye week. Following the conclusion of the regular season, seven teams from each conference (four division winners and three wild card teams) advance to the playoffs, a single-elimination tournament culminating in the Super Bowl, which is usually held on the first Sunday in February and is played between the champions of the NFC and AFC.
The National Football League is the largest live spectator sporting league in the world in terms of average attendance. The NFL is one of the four major professional sports leagues in North America and the highest professional level of American football in the world. As of 2018, the NFL averaged 67,100 live spectators per game, and 17,177,581 total for the season.
The purpose of this project is to analyse the attendance data of the NFL from 2000-2019 and get insights into spectator attendance over the 20 year period. Some of the objectives are to address the below questions.
For this study we are using the data from Pro Football Reference Website. We will perfom some data cleansing and data manupulation to set up the data for consumption. We will start with exploratory data analysis to understand the data, examine the factors that determine attendance at National League Football games and build a classfication model to identify teams with the hightest and the lowest attendance.
These insights will help us with the pricing of the tickets,proper planning of logistics, promotions and marketing campaigns.
The below packages are required
The source data is from the Pro Football Reference Website. The required data is pulled into 3 csv files - attendance.csv, standings.csv and games.csv and made availabe in the tidytuesday github repository. We are importing the date from the csv files availabe in the github repository. The hyperlinks for the csv files are below -
Attendance Data, Standings Data, Games Data
The Attendance data set contains the weekly attendance information for for a team in any give year.
| Variable | Class | Description |
|---|---|---|
| team | character | Team City |
| team_name | character | Team name |
| year | integer | Season year |
| total | double | Total attendance across 17 weeks (1 week = no game) |
| home | double | Home attendance |
| away | double | Away attendance |
| week | character | Week number (1-17) |
| weekly_attendance | double | Weekly attendance number |
Looking at the summary statistics of the Attendance Data
r describe(attendance) %>% html()
| n | missing | distinct |
|---|---|---|
| 10846 | 0 | 32 |
| lowest : | Arizona | Atlanta | Baltimore | Buffalo | Carolina |
| highest: | Seattle | St. Louis | Tampa Bay | Tennessee | Washington |
| n | missing | distinct |
|---|---|---|
| 10846 | 0 | 32 |
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10846 | 0 | 20 | 0.997 | 2010 | 6.635 | 2001 | 2002 | 2005 | 2010 | 2015 | 2018 | 2019 |
Value 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
Frequency 527 527 544 544 544 544 544 544 544 544 544 544
Proportion 0.049 0.049 0.050 0.050 0.050 0.050 0.050 0.050 0.050 0.050 0.050 0.050
Value 2012 2013 2014 2015 2016 2017 2018 2019
Frequency 544 544 544 544 544 544 544 544
Proportion 0.050 0.050 0.050 0.050 0.050 0.050 0.050 0.050
n missing distinct Info Mean Gmd .05 .10 .25
10846 0 637 1 1080910 78566 967434 999843 1040509
.50 .75 .90 .95
1081090 1123230 1161974 1195369
lowest : 760644 783367 803556 804401 811391 , highest: 1303393 1307231 1309211 1312509 1322087 | n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10846 | 0 | 603 | 1 | 540455 | 71678 | 428311 | 463353 | 504360 | 543185 | 578342 | 623325 | 631365 |
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10846 | 0 | 636 | 1 | 540455 | 28682 | 495744 | 505417 | 524974 | 541757 | 557741 | 572774 | 581257 |
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10846 | 0 | 17 | 0.997 | 9 | 5.648 | 1 | 2 | 5 | 9 | 13 | 16 | 17 |
Value 1 2 3 4 5 6 7 8 9 10 11 12
Frequency 638 638 638 638 638 638 638 638 638 638 638 638
Proportion 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.059
Value 13 14 15 16 17
Frequency 638 638 638 638 638
Proportion 0.059 0.059 0.059 0.059 0.059
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10208 | 638 | 4073 | 1 | 67557 | 9495 | 52433 | 57278 | 63246 | 68334 | 72545 | 77999 | 79114 |
After verifying the statistics of the each variable,we notice that the values in the weekly_attendance variable are missing for around 638 rows.This is the 1 bye week for each team for a given year. All the other variables look good.
missing_data <-
attendance %>%
filter(is.na(weekly_attendance))
missing_data
## # A tibble: 638 x 8
## team team_name year total home away week weekly_attendance
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Arizona Cardinals 2000 893926 387475 506451 3 NA
## 2 Atlanta Falcons 2000 964579 422814 541765 15 NA
## 3 Baltimore Ravens 2000 1062373 551695 510678 14 NA
## 4 Buffalo Bills 2000 1098587 560695 537892 4 NA
## 5 Carolina Panthers 2000 1095192 583489 511703 4 NA
## 6 Chicago Bears 2000 1080684 535552 545132 9 NA
## 7 Cincinnati Bengals 2000 967434 469992 497442 1 NA
## 8 Cleveland Browns 2000 1057139 581544 475595 17 NA
## 9 Dallas Cowboys 2000 1075470 504360 571110 6 NA
## 10 Denver Broncos 2000 1140030 604042 535988 9 NA
## # ... with 628 more rows
We will check to see if there is any pattern to the missing values.
Grouping by on team to see if the missing data was specific to a team
r missing_data %>% group_by(team) %>% summarise(count_occurances = n())
## # A tibble: 32 x 2 ## team count_occurances ## <chr> <int> ## 1 Arizona 20 ## 2 Atlanta 20 ## 3 Baltimore 20 ## 4 Buffalo 20 ## 5 Carolina 20 ## 6 Chicago 20 ## 7 Cincinnati 20 ## 8 Cleveland 20 ## 9 Dallas 20 ## 10 Denver 20 ## # ... with 22 more rows Grouping by on year to see if the missing data was specific to a team
r missing_data %>% group_by(year) %>% summarise(count_occurances = n())
## # A tibble: 20 x 2 ## year count_occurances ## <dbl> <int> ## 1 2000 31 ## 2 2001 31 ## 3 2002 32 ## 4 2003 32 ## 5 2004 32 ## 6 2005 32 ## 7 2006 32 ## 8 2007 32 ## 9 2008 32 ## 10 2009 32 ## 11 2010 32 ## 12 2011 32 ## 13 2012 32 ## 14 2013 32 ## 15 2014 32 ## 16 2015 32 ## 17 2016 32 ## 18 2017 32 ## 19 2018 32 ## 20 2019 32
Grouping by on week to see if the missing data was specific to a team
r missing_data %>% group_by(week) %>% summarise(count_occurances = n())
## # A tibble: 17 x 2 ## week count_occurances ## <dbl> <int> ## 1 1 4 ## 2 2 6 ## 3 3 26 ## 4 4 60 ## 5 5 74 ## 6 6 78 ## 7 7 78 ## 8 8 88 ## 9 9 92 ## 10 10 68 ## 11 11 38 ## 12 12 14 ## 13 13 4 ## 14 14 2 ## 15 15 2 ## 16 16 2 ## 17 17 2
By looking at the results from above, looks like all the 32 teams have a bye week for 1 randon week every year. We can ignore this data as there is no game on that day.
We also notice that in years 2000 and 2001 there are only 31 teams and starting 2002 we have 32 teams.
We will filter out the data for these missign 638 occurances and use the clean data for further analysis.
attendance_cleansed <- attendance %>%
filter(! is.na(weekly_attendance))
Looking at a small sample set of 5 rows from the Attendance data
kable(attendance_cleansed[1:5,])
| team | team_name | year | total | home | away | week | weekly_attendance |
|---|---|---|---|---|---|---|---|
| Arizona | Cardinals | 2000 | 893926 | 387475 | 506451 | 1 | 77434 |
| Arizona | Cardinals | 2000 | 893926 | 387475 | 506451 | 2 | 66009 |
| Arizona | Cardinals | 2000 | 893926 | 387475 | 506451 | 4 | 71801 |
| Arizona | Cardinals | 2000 | 893926 | 387475 | 506451 | 5 | 66985 |
| Arizona | Cardinals | 2000 | 893926 | 387475 | 506451 | 6 | 44296 |
| Variable | Class | Description |
|---|---|---|
| team | character | Team city |
| team_name | character | Team name |
| year | integer | season year |
| wins | double | Wins (0 to 16) |
| loss | double | Losses (0 to 16) |
| points_for | double | points for (offensive performance) |
| points_against | double | points for (defensive performance) |
| points_differential | double | Point differential (points_for - points_against) |
| margin_of_victory | double | (Points Scored - Points Allowed)/ Games Played |
| strength_of_schedule | double | Average quality of opponent as measured by SRS (Simple Rating System) |
| simple_rating | double | Team quality relative to average (0.0) as measured by SRS (Simple Rating System) SRS = MoV + SoS = OSRS + DSRS |
| offensive_ranking | double | Team offense quality relative to average (0.0) as measured by SRS (Simple Rating System) |
| defensive_ranking | double | Team defense quality relative to average (0.0) as measured by SRS (Simple Rating System) |
| playoffs | character | Made playoffs or not |
| sb_winner | character | Won superbowl or not |
r describe(standings) %>% html()
| n | missing | distinct |
|---|---|---|
| 638 | 0 | 32 |
| lowest : | Arizona | Atlanta | Baltimore | Buffalo | Carolina |
| highest: | Seattle | St. Louis | Tampa Bay | Tennessee | Washington |
| n | missing | distinct |
|---|---|---|
| 638 | 0 | 32 |
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 638 | 0 | 20 | 0.998 | 2010 | 6.645 | 2001 | 2002 | 2005 | 2010 | 2015 | 2017 | 2018 |
Value 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
Frequency 31 31 32 32 32 32 32 32 32 32 32 32
Proportion 0.049 0.049 0.050 0.050 0.050 0.050 0.050 0.050 0.050 0.050 0.050 0.050
Value 2012 2013 2014 2015 2016 2017 2018 2019
Frequency 32 32 32 32 32 32 32 32
Proportion 0.050 0.050 0.050 0.050 0.050 0.050 0.050 0.050
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 638 | 0 | 17 | 0.991 | 7.984 | 3.518 | 3 | 4 | 6 | 8 | 10 | 12 | 13 |
Value 0 1 2 3 4 5 6 7 8 9 10 11
Frequency 2 5 18 20 50 54 58 77 69 69 71 52
Proportion 0.003 0.008 0.028 0.031 0.078 0.085 0.091 0.121 0.108 0.108 0.111 0.082
Value 12 13 14 15 16
Frequency 46 34 9 3 1
Proportion 0.072 0.053 0.014 0.005 0.002
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 638 | 0 | 17 | 0.991 | 7.984 | 3.517 | 3 | 4 | 6 | 8 | 10 | 12 | 13 |
Value 0 1 2 3 4 5 6 7 8 9 10 11
Frequency 1 3 9 34 47 54 71 69 70 75 58 53
Proportion 0.002 0.005 0.014 0.053 0.074 0.085 0.111 0.108 0.110 0.118 0.091 0.083
Value 12 13 14 15 16
Frequency 50 19 18 5 2
Proportion 0.078 0.030 0.028 0.008 0.003
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 638 | 0 | 257 | 1 | 350.3 | 80.37 | 240.0 | 262.0 | 299.0 | 348.0 | 396.0 | 437.6 | 468.1 |
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 638 | 0 | 223 | 1 | 350.3 | 67.61 | 254.9 | 273.7 | 310.0 | 347.0 | 391.5 | 433.0 | 448.0 |
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 638 | 0 | 317 | 1 | 0 | 115.5 | -169.15 | -134.90 | -75.00 | 1.50 | 72.75 | 134.00 | 155.45 |
n missing distinct Info Mean Gmd .05 .10
638 0 230 1 -0.001881 7.227 -10.600 -8.460
.25 .50 .75 .90 .95
-4.700 0.100 4.575 8.400 9.730
lowest : -16.3 -16.1 -15.6 -14.6 -14.5 , highest: 13.0 14.1 14.4 15.6 19.7 | n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 638 | 0 | 79 | 1 | 0.001097 | 1.861 | -2.7 | -2.2 | -1.1 | 0.0 | 1.2 | 2.1 | 2.6 |
n missing distinct Info Mean Gmd .05 .10
638 0 236 1 1.557e-17 7.077 -10.415 -8.130
.25 .50 .75 .90 .95
-4.475 0.000 4.500 8.330 9.930
lowest : -17.4 -15.2 -15.1 -14.6 -14.4 , highest: 13.0 13.4 15.4 15.6 20.1 | n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 638 | 0 | 177 | 1 | -0.0001567 | 4.873 | -6.500 | -5.300 | -3.175 | 0.000 | 2.700 | 5.430 | 7.000 |
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 638 | 0 | 157 | 1 | -0.001097 | 4.055 | -5.815 | -4.700 | -2.400 | 0.100 | 2.500 | 4.500 | 5.915 |
| n | missing | distinct |
|---|---|---|
| 638 | 0 | 2 |
Value No Playoffs Playoffs Frequency 398 240 Proportion 0.624 0.376
| n | missing | distinct |
|---|---|---|
| 638 | 0 | 2 |
Value No Superbowl Won Superbowl Frequency 618 20 Proportion 0.969 0.031
After verifying the statistics for each varible,all the data looks good and there is no need for any data manipulation needed.
kable(standings[1:5,])
| team | team_name | year | wins | loss | points_for | points_against | points_differential | margin_of_victory | strength_of_schedule | simple_rating | offensive_ranking | defensive_ranking | playoffs | sb_winner |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Miami | Dolphins | 2000 | 11 | 5 | 323 | 226 | 97 | 6.1 | 1.0 | 7.1 | 0.0 | 7.1 | Playoffs | No Superbowl |
| Indianapolis | Colts | 2000 | 10 | 6 | 429 | 326 | 103 | 6.4 | 1.5 | 7.9 | 7.1 | 0.8 | Playoffs | No Superbowl |
| New York | Jets | 2000 | 9 | 7 | 321 | 321 | 0 | 0.0 | 3.5 | 3.5 | 1.4 | 2.2 | No Playoffs | No Superbowl |
| Buffalo | Bills | 2000 | 8 | 8 | 315 | 350 | -35 | -2.2 | 2.2 | 0.0 | 0.5 | -0.5 | No Playoffs | No Superbowl |
| New England | Patriots | 2000 | 5 | 11 | 276 | 338 | -62 | -3.9 | 1.4 | -2.5 | -2.7 | 0.2 | No Playoffs | No Superbowl |
| Variable | Class | Description |
|---|---|---|
| year | integer | season year, note that playoff games will still be in the previous season |
| week | character | week number (1-17, plus playoffs) |
| home_team | character | Home team |
| away_team | character | Away team |
| winner | character | Winning team |
| tie | character | If a tie, the “losing” team as well |
| day | character | Day of week |
| date | character | Date minus year |
| time | character | Time of game start |
| pts_win | double | Points by winning team |
| pts_loss | double | Points by losing team |
| yds_win | double | Yards by winning team |
| turnovers_win | double | Turnovers by winning team |
| yds_loss | double | Yards by losing team |
| turnovers_loss | double | Turnovers by losing team |
| home_team_name | character | Home team name |
| home_team_city | character | Home team city |
| away_team_name | character | Away team name |
| away_team_city | character | Away team city |
r describe(games) %>% html()
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5324 | 0 | 20 | 0.997 | 2010 | 6.637 | 2001 | 2002 | 2005 | 2010 | 2015 | 2018 | 2019 |
Value 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
Frequency 259 259 267 267 267 267 267 267 267 267 267 267
Proportion 0.049 0.049 0.050 0.050 0.050 0.050 0.050 0.050 0.050 0.050 0.050 0.050
Value 2012 2013 2014 2015 2016 2017 2018 2019
Frequency 267 267 267 267 267 267 267 267
Proportion 0.050 0.050 0.050 0.050 0.050 0.050 0.050 0.050
| n | missing | distinct |
|---|---|---|
| 5324 | 0 | 21 |
| lowest : | 1 | 10 | 11 | 12 | 13 |
| highest: | 9 | ConfChamp | Division | SuperBowl | WildCard |
| n | missing | distinct |
|---|---|---|
| 5324 | 0 | 34 |
| lowest : | Arizona Cardinals | Atlanta Falcons | Baltimore Ravens | Buffalo Bills | Carolina Panthers |
| highest: | Seattle Seahawks | St. Louis Rams | Tampa Bay Buccaneers | Tennessee Titans | Washington Redskins |
| n | missing | distinct |
|---|---|---|
| 5324 | 0 | 34 |
| lowest : | Arizona Cardinals | Atlanta Falcons | Baltimore Ravens | Buffalo Bills | Carolina Panthers |
| highest: | Seattle Seahawks | St. Louis Rams | Tampa Bay Buccaneers | Tennessee Titans | Washington Redskins |
| n | missing | distinct |
|---|---|---|
| 5324 | 0 | 34 |
| lowest : | Arizona Cardinals | Atlanta Falcons | Baltimore Ravens | Buffalo Bills | Carolina Panthers |
| highest: | Seattle Seahawks | St. Louis Rams | Tampa Bay Buccaneers | Tennessee Titans | Washington Redskins |
| n | missing | distinct |
|---|---|---|
| 10 | 5314 | 7 |
| lowest : | Arizona Cardinals | Atlanta Falcons | Carolina Panthers | Cincinnati Bengals | Cleveland Browns |
| highest: | Carolina Panthers | Cincinnati Bengals | Cleveland Browns | Green Bay Packers | St. Louis Rams |
Value Arizona Cardinals Atlanta Falcons Carolina Panthers
Frequency 2 1 1
Proportion 0.2 0.1 0.1
Value Cincinnati Bengals Cleveland Browns Green Bay Packers
Frequency 2 1 2
Proportion 0.2 0.1 0.2
Value St. Louis Rams
Frequency 1
Proportion 0.1
| n | missing | distinct |
|---|---|---|
| 5324 | 0 | 7 |
Value Fri Mon Sat Sun Thu Tue Wed Frequency 3 339 178 4588 214 1 1 Proportion 0.001 0.064 0.033 0.862 0.040 0.000 0.000
| n | missing | distinct |
|---|---|---|
| 5324 | 0 | 154 |
| lowest : | December 1 | December 10 | December 11 | December 12 | December 13 |
| highest: | September 5 | September 6 | September 7 | September 8 | September 9 |
| n | missing | distinct |
|---|---|---|
| 5324 | 0 | 187 |
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5324 | 0 | 56 | 0.997 | 27.78 | 9.914 | 14 | 17 | 21 | 27 | 34 | 40 | 44 |
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5324 | 0 | 47 | 0.996 | 16.09 | 9.176 | 3 | 6 | 10 | 16 | 21 | 27 | 31 |
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5324 | 0 | 442 | 1 | 361.6 | 88.43 | 236 | 262 | 308 | 361 | 415 | 460 | 494 |
| n | missing | distinct | Info | Mean | Gmd |
|---|---|---|---|---|---|
| 5324 | 0 | 8 | 0.903 | 1.08 | 1.094 |
Value 0 1 2 3 4 5 6 7 Frequency 1804 1950 1074 364 106 20 5 1 Proportion 0.339 0.366 0.202 0.068 0.020 0.004 0.001 0.000
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5324 | 0 | 444 | 1 | 309.1 | 95.69 | 175 | 200 | 251 | 306 | 366 | 420 | 451 |
| n | missing | distinct | Info | Mean | Gmd |
|---|---|---|---|---|---|
| 5324 | 0 | 9 | 0.954 | 2.168 | 1.562 |
Value 0 1 2 3 4 5 6 7 8 Frequency 553 1361 1424 1065 592 237 66 21 5 Proportion 0.104 0.256 0.267 0.200 0.111 0.045 0.012 0.004 0.001
| n | missing | distinct |
|---|---|---|
| 5324 | 0 | 32 |
| n | missing | distinct |
|---|---|---|
| 5324 | 0 | 32 |
| lowest : | Arizona | Atlanta | Baltimore | Buffalo | Carolina |
| highest: | Seattle | St. Louis | Tampa Bay | Tennessee | Washington |
| n | missing | distinct |
|---|---|---|
| 5324 | 0 | 32 |
| n | missing | distinct |
|---|---|---|
| 5324 | 0 | 32 |
| lowest : | Arizona | Atlanta | Baltimore | Buffalo | Carolina |
| highest: | Seattle | St. Louis | Tampa Bay | Tennessee | Washington |
After verifying the statistics for each varible,the data in the variable week looks ambigous. On closer look we notice that it has both integer values aswell as character values. The weeks after the regular season are in the character values. This is a valid scenario for having values WildCard ,Division, ConfChamp,SuperBowl respectively beyong week 17 of regular season. I dont not see a reason to convert theset as of now, I will assess later if any conversion is required.
kable(games[1:5,])
| year | week | home_team | away_team | winner | tie | day | date | time | pts_win | pts_loss | yds_win | turnovers_win | yds_loss | turnovers_loss | home_team_name | home_team_city | away_team_name | away_team_city |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2000 | 1 | Minnesota Vikings | Chicago Bears | Minnesota Vikings | NA | Sun | September 3 | 13:00:00 | 30 | 27 | 374 | 1 | 425 | 1 | Vikings | Minnesota | Bears | Chicago |
| 2000 | 1 | Kansas City Chiefs | Indianapolis Colts | Indianapolis Colts | NA | Sun | September 3 | 13:00:00 | 27 | 14 | 386 | 2 | 280 | 1 | Chiefs | Kansas City | Colts | Indianapolis |
| 2000 | 1 | Washington Redskins | Carolina Panthers | Washington Redskins | NA | Sun | September 3 | 13:01:00 | 20 | 17 | 396 | 0 | 236 | 1 | Redskins | Washington | Panthers | Carolina |
| 2000 | 1 | Atlanta Falcons | San Francisco 49ers | Atlanta Falcons | NA | Sun | September 3 | 13:02:00 | 36 | 28 | 359 | 1 | 339 | 1 | Falcons | Atlanta | 49ers | San Francisco |
| 2000 | 1 | Pittsburgh Steelers | Baltimore Ravens | Baltimore Ravens | NA | Sun | September 3 | 13:02:00 | 16 | 0 | 336 | 0 | 223 | 1 | Steelers | Pittsburgh | Ravens | Baltimore |
I will be combining the attendance and the standings datasets to create a combined data set which i will use in my exploratory data analyis.
r attendance_standings <- inner_join(attendance_cleansed, standings, by = c("team","team_name","year"))
kable(attendance_standings[1:5,])
| team | team_name | year | total | home | away | week | weekly_attendance | wins | loss | points_for | points_against | points_differential | margin_of_victory | strength_of_schedule | simple_rating | offensive_ranking | defensive_ranking | playoffs | sb_winner |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Arizona | Cardinals | 2000 | 893926 | 387475 | 506451 | 1 | 77434 | 3 | 13 | 210 | 443 | -233 | -14.6 | -0.7 | -15.2 | -7.2 | -8.1 | No Playoffs | No Superbowl |
| Arizona | Cardinals | 2000 | 893926 | 387475 | 506451 | 2 | 66009 | 3 | 13 | 210 | 443 | -233 | -14.6 | -0.7 | -15.2 | -7.2 | -8.1 | No Playoffs | No Superbowl |
| Arizona | Cardinals | 2000 | 893926 | 387475 | 506451 | 4 | 71801 | 3 | 13 | 210 | 443 | -233 | -14.6 | -0.7 | -15.2 | -7.2 | -8.1 | No Playoffs | No Superbowl |
| Arizona | Cardinals | 2000 | 893926 | 387475 | 506451 | 5 | 66985 | 3 | 13 | 210 | 443 | -233 | -14.6 | -0.7 | -15.2 | -7.2 | -8.1 | No Playoffs | No Superbowl |
| Arizona | Cardinals | 2000 | 893926 | 387475 | 506451 | 6 | 44296 | 3 | 13 | 210 | 443 | -233 | -14.6 | -0.7 | -15.2 | -7.2 | -8.1 | No Playoffs | No Superbowl |
I will be doing exploratory data analysis to understand variables that have a greater significance is discovering patterns and identify the relationships between variables. I will be generating scatter plots and box plots to understand the relationships.
I will be performing statistical analysis on the key varaiables and build a linear regression model to identify what are the factors that impact the game attendance numbers.
I will check the statistical significance of the these variables, build a model with training and testing datasets.
The findings from the Statistical inference and the model will be summarized in this section.