The National Football League (NFL) is a professional American football league consisting of 32 teams, divided equally between the National Football Conference (NFC) and the American Football Conference (AFC). The NFL’s 17-week regular season runs from early September to late December, with each team playing 16 games and having one bye week. Following the conclusion of the regular season, seven teams from each conference (four division winners and three wild card teams) advance to the playoffs, a single-elimination tournament culminating in the Super Bowl, which is usually held on the first Sunday in February and is played between the champions of the NFC and AFC.
The National Football League is the largest live spectator sporting league in the world in terms of average attendance. The NFL is one of the four major professional sports leagues in North America and the highest professional level of American football in the world. As of 2018, the NFL averaged 67,100 live spectators per game, and 17,177,581 total for the season.
Hard Rock Stadium, Miami Gardens, Florida
The purpose of this project is to analyse the attendance data of the NFL from 2000-2019 and get insights into spectator attendance over the 20 year period. Some of the objectives are to address the below questions.
For this study we are using the data from Pro Football Reference Website. We will perfom some data cleansing and data manupulation to set up the data for consumption. We will start with exploratory data analysis to understand the data, examine the factors that determine attendance at National League Football games and build a model to identify factors having a bearing on the attendance.
These insights will help us with the pricing of the tickets,proper planning of logistics, promotions and marketing campaigns.
The below packages are required to run the code.
library(readr)
library(tidyverse)
library(Hmisc)
library(knitr)
library(funModeling)
library(rpart)
library(skimr)
library(scales)
We are using the data that was downloaded from Pro Football Reference Website.
attendance <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/attendance.csv')
standings <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/standings.csv')
games <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/games.csv')
library(knitr)
# evaluate fig.cap after a chunk is evaluated
opts_knit$set(eval.after = 'fig.cap')
The Attendance data set contains the weekly attendance information for a team for the years 2000 to 2019.
Variable | Class | Description |
---|---|---|
team | character | Team City |
team_name | character | Team name |
year | integer | Season year |
total | double | Total attendance across 17 weeks (1 week = no game) |
home | double | Home attendance |
away | double | Away attendance |
week | character | Week number (1-17) |
weekly_attendance | double | Weekly attendance number |
The Standings data set contains the Win/loss, points scored, rankings for each team for the season year from 2000-2019.
Variable | Class | Description |
---|---|---|
team | character | Team city |
team_name | character | Team name |
year | integer | season year |
wins | double | Wins (0 to 16) |
loss | double | Losses (0 to 16) |
points_for | double | points for (offensive performance) |
points_against | double | points for (defensive performance) |
points_differential | double | Point differential (points_for - points_against) |
margin_of_victory | double | (Points Scored - Points Allowed)/ Games Played |
strength_of_schedule | double | Average quality of opponent as measured by SRS (Simple Rating System) |
simple_rating | double | Team quality relative to average (0.0) as measured by SRS (Simple Rating System) SRS = MoV + SoS = OSRS + DSRS |
offensive_ranking | double | Team offense quality relative to average (0.0) as measured by SRS (Simple Rating System) |
defensive_ranking | double | Team defense quality relative to average (0.0) as measured by SRS (Simple Rating System) |
playoffs | character | Made playoffs or not |
sb_winner | character | Won superbowl or not |
The Games data set contains details about each game.
Variable | Class | Description |
---|---|---|
year | integer | season year, note that playoff games will still be in the previous season |
week | character | week number (1-17, plus playoffs) |
home_team | character | Home team |
away_team | character | Away team |
winner | character | Winning team |
tie | character | If a tie, the “losing” team as well |
day | character | Day of week |
date | character | Date minus year |
time | character | Time of game start |
pts_win | double | Points by winning team |
pts_loss | double | Points by losing team |
yds_win | double | Yards by winning team |
turnovers_win | double | Turnovers by winning team |
yds_loss | double | Yards by losing team |
turnovers_loss | double | Turnovers by losing team |
home_team_name | character | Home team name |
home_team_city | character | Home team city |
away_team_name | character | Away team name |
away_team_city | character | Away team city |
Looking at the summary statistics of the Attendance Data
skim(attendance)
Name | attendance |
Number of rows | 10846 |
Number of columns | 8 |
_______________________ | |
Column type frequency: | |
character | 2 |
numeric | 6 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
team | 0 | 1 | 5 | 13 | 0 | 32 | 0 |
team_name | 0 | 1 | 4 | 10 | 0 | 32 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
year | 0 | 1.00 | 2009.53 | 5.75 | 2000 | 2005.0 | 2010 | 2015.00 | 2019 | ▇▇▇▇▇ |
total | 0 | 1.00 | 1080910.03 | 72876.97 | 760644 | 1040509.0 | 1081090 | 1123230.00 | 1322087 | ▁▁▇▆▁ |
home | 0 | 1.00 | 540455.01 | 66774.65 | 202687 | 504360.0 | 543185 | 578342.00 | 741775 | ▁▁▅▇▁ |
away | 0 | 1.00 | 540455.01 | 25509.33 | 450295 | 524974.0 | 541757 | 557741.00 | 601655 | ▁▂▇▇▂ |
week | 0 | 1.00 | 9.00 | 4.90 | 1 | 5.0 | 9 | 13.00 | 17 | ▇▆▆▆▇ |
weekly_attendance | 638 | 0.94 | 67556.88 | 9022.02 | 23127 | 63245.5 | 68334 | 72544.75 | 105121 | ▁▁▇▃▁ |
Creating a single variable for team name by combining team and team_name attributes so that we can join with the rankings and games data sets.
attendance_reshape <- rename(attendance , annual_attendance = total , annual_homegame_attendance = home ,
annual_awaygame_attendance = away ) %>%
mutate(NFL_team_name = str_c(team, team_name, sep = " "))
After verifying the statistics of the each variable,we notice that the values in the weekly_attendance variable are missing for around 638 rows.We will check to see if there is any pattern to the missing values.
missing_data <-
attendance_reshape %>%
filter(is.na(weekly_attendance))
All the 32 teams have a bye week for one random week every year.We can ignore this data as there is no game on that day. We also notice that in years 2000 and 2001 there are only 31 teams and starting 2002 we have 32 teams.
We will filter out the data for these missign 638 occurances and use the clean data for further analysis.
attendance_cleansed <- attendance_reshape %>%
filter(! is.na(weekly_attendance)) %>%
select(NFL_team_name, year, week , weekly_attendance , annual_attendance)
Looking at the Sample data after replacing the variable name and fitering the bye week data
kable(attendance_cleansed[1:5,])
NFL_team_name | year | week | weekly_attendance | annual_attendance |
---|---|---|---|---|
Arizona Cardinals | 2000 | 1 | 77434 | 893926 |
Arizona Cardinals | 2000 | 2 | 66009 | 893926 |
Arizona Cardinals | 2000 | 4 | 71801 | 893926 |
Arizona Cardinals | 2000 | 5 | 66985 | 893926 |
Arizona Cardinals | 2000 | 6 | 44296 | 893926 |
skim(standings)
Name | standings |
Number of rows | 638 |
Number of columns | 15 |
_______________________ | |
Column type frequency: | |
character | 4 |
numeric | 11 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
team | 0 | 1 | 5 | 13 | 0 | 32 | 0 |
team_name | 0 | 1 | 4 | 10 | 0 | 32 | 0 |
playoffs | 0 | 1 | 8 | 11 | 0 | 2 | 0 |
sb_winner | 0 | 1 | 12 | 13 | 0 | 2 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
year | 0 | 1 | 2009.53 | 5.76 | 2000.0 | 2005.00 | 2010.0 | 2014.75 | 2019.0 | ▇▇▇▇▇ |
wins | 0 | 1 | 7.98 | 3.08 | 0.0 | 6.00 | 8.0 | 10.00 | 16.0 | ▂▆▇▆▂ |
loss | 0 | 1 | 7.98 | 3.08 | 0.0 | 6.00 | 8.0 | 10.00 | 16.0 | ▂▆▇▆▂ |
points_for | 0 | 1 | 350.28 | 71.40 | 161.0 | 299.00 | 348.0 | 396.00 | 606.0 | ▂▇▇▂▁ |
points_against | 0 | 1 | 350.28 | 59.55 | 165.0 | 310.00 | 347.0 | 391.50 | 517.0 | ▁▃▇▆▁ |
points_differential | 0 | 1 | 0.00 | 101.09 | -261.0 | -75.00 | 1.5 | 72.75 | 315.0 | ▂▆▇▅▁ |
margin_of_victory | 0 | 1 | 0.00 | 6.32 | -16.3 | -4.70 | 0.1 | 4.57 | 19.7 | ▂▆▇▅▁ |
strength_of_schedule | 0 | 1 | 0.00 | 1.63 | -4.6 | -1.10 | 0.0 | 1.20 | 4.3 | ▁▅▇▅▁ |
simple_rating | 0 | 1 | 0.00 | 6.20 | -17.4 | -4.47 | 0.0 | 4.50 | 20.1 | ▁▆▇▅▁ |
offensive_ranking | 0 | 1 | 0.00 | 4.34 | -11.7 | -3.18 | 0.0 | 2.70 | 15.9 | ▁▇▇▂▁ |
defensive_ranking | 0 | 1 | 0.00 | 3.57 | -9.8 | -2.40 | 0.1 | 2.50 | 9.8 | ▁▅▇▅▁ |
After verifying the statistics for each varible, creating a single variable for team full name by combining team and team_name attributes so that we can join with the rankings data set. Rest of the data looks good.
standings_reshape <- standings %>%
mutate(NFL_team_name = str_c(team, team_name, sep = " ")) %>%
select(NFL_team_name, year, wins, loss, margin_of_victory, simple_rating , offensive_ranking, defensive_ranking, playoffs, sb_winner)
kable(standings_reshape[1:5,])
NFL_team_name | year | wins | loss | margin_of_victory | simple_rating | offensive_ranking | defensive_ranking | playoffs | sb_winner |
---|---|---|---|---|---|---|---|---|---|
Miami Dolphins | 2000 | 11 | 5 | 6.1 | 7.1 | 0.0 | 7.1 | Playoffs | No Superbowl |
Indianapolis Colts | 2000 | 10 | 6 | 6.4 | 7.9 | 7.1 | 0.8 | Playoffs | No Superbowl |
New York Jets | 2000 | 9 | 7 | 0.0 | 3.5 | 1.4 | 2.2 | No Playoffs | No Superbowl |
Buffalo Bills | 2000 | 8 | 8 | -2.2 | 0.0 | 0.5 | -0.5 | No Playoffs | No Superbowl |
New England Patriots | 2000 | 5 | 11 | -3.9 | -2.5 | -2.7 | 0.2 | No Playoffs | No Superbowl |
skim(games)
Name | games |
Number of rows | 5324 |
Number of columns | 19 |
_______________________ | |
Column type frequency: | |
character | 11 |
difftime | 1 |
numeric | 7 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
week | 0 | 1 | 1 | 9 | 0 | 21 | 0 |
home_team | 0 | 1 | 13 | 20 | 0 | 34 | 0 |
away_team | 0 | 1 | 13 | 20 | 0 | 34 | 0 |
winner | 0 | 1 | 13 | 20 | 0 | 34 | 0 |
tie | 5314 | 0 | 14 | 18 | 0 | 7 | 0 |
day | 0 | 1 | 3 | 3 | 0 | 7 | 0 |
date | 0 | 1 | 9 | 12 | 0 | 154 | 0 |
home_team_name | 0 | 1 | 4 | 10 | 0 | 32 | 0 |
home_team_city | 0 | 1 | 5 | 13 | 0 | 32 | 0 |
away_team_name | 0 | 1 | 4 | 10 | 0 | 32 | 0 |
away_team_city | 0 | 1 | 5 | 13 | 0 | 32 | 0 |
Variable type: difftime
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
time | 0 | 1 | 30900 secs | 84900 secs | 47040 secs | 187 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
year | 0 | 1 | 2009.53 | 5.75 | 2000 | 2005 | 2010 | 2015 | 2019 | ▇▇▇▇▇ |
pts_win | 0 | 1 | 27.78 | 8.83 | 3 | 21 | 27 | 34 | 62 | ▁▇▇▂▁ |
pts_loss | 0 | 1 | 16.09 | 8.14 | 0 | 10 | 16 | 21 | 51 | ▆▇▅▁▁ |
yds_win | 0 | 1 | 361.64 | 78.58 | 47 | 308 | 361 | 415 | 653 | ▁▂▇▃▁ |
turnovers_win | 0 | 1 | 1.08 | 1.04 | 0 | 0 | 1 | 2 | 7 | ▇▂▁▁▁ |
yds_loss | 0 | 1 | 309.08 | 84.50 | 26 | 251 | 306 | 366 | 613 | ▁▅▇▃▁ |
turnovers_loss | 0 | 1 | 2.17 | 1.42 | 0 | 1 | 2 | 3 | 8 | ▆▇▂▁▁ |
After verifying the statistics for each varible,the data in the variable week looks ambigous. On closer look we notice that it has both numeric and character values. The weeks after the regular season are in character values.This is a valid scenario to have values WildCard ,Division, ConfChamp,SuperBowl respectively after the regular season. As the anlaysis is only on the regular season games and not playoffs, I will ignore the playoff games.
The game dataset contains the game results of each game. A total of 5014 games are played in the regular seasons 2000-2019. We need to change the data type of week from character to numeric so that we can join the games data set with attendance data set.
games_regular <- games %>%
filter(week %in% c('1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17')) %>%
mutate(week = as.numeric(week)) %>%
select(home_team,away_team, year, week , winner)
Reshaping the games data so that we can have both the home and away team rows combined into a column and adding a home game indicator
games_reshape <- games_regular %>%
gather(home_ind,NFL_team_name,home_team:away_team ) %>%
select(NFL_team_name, year, week, home_ind, winner)
kable(games_reshape[1:5,])
NFL_team_name | year | week | home_ind | winner |
---|---|---|---|---|
Minnesota Vikings | 2000 | 1 | home_team | Minnesota Vikings |
Kansas City Chiefs | 2000 | 1 | home_team | Indianapolis Colts |
Washington Redskins | 2000 | 1 | home_team | Washington Redskins |
Atlanta Falcons | 2000 | 1 | home_team | Atlanta Falcons |
Pittsburgh Steelers | 2000 | 1 | home_team | Baltimore Ravens |
We will create a annual team attendance summary data set, Since the home and away game count is for the year we will need to divide by 8 so that we get the average home game and away game attendance
# Create a team annual summaries data set
team_summaries <- missing_data %>%
group_by(NFL_team_name,year) %>%
summarise( Average_home = mean(annual_homegame_attendance)/8,
Average_away = mean(annual_awaygame_attendance)/8 )
# Create a annual attendance summary data set
annual_summaries <- attendance_cleansed %>%
group_by(year) %>%
summarise(Average_nfl = mean( weekly_attendance))
# Create a ssummary data set with home,away and nfl attendance summaries
attendance_summaries <- left_join(team_summaries, annual_summaries,
by = c( "year" ))
kable(attendance_summaries[1:5,])
NFL_team_name | year | Average_home | Average_away | Average_nfl |
---|---|---|---|---|
Arizona Cardinals | 2000 | 48434.38 | 63306.38 | 65934.44 |
Arizona Cardinals | 2001 | 38414.38 | 63009.50 | 65753.58 |
Arizona Cardinals | 2002 | 40909.00 | 71450.62 | 66325.67 |
Arizona Cardinals | 2003 | 36062.38 | 64487.75 | 66674.36 |
Arizona Cardinals | 2004 | 37533.38 | 67286.25 | 67462.60 |
We will also create a weekly attendance summary data set
weekly_summaries <- attendance_cleansed %>%
group_by(week) %>%
summarise(
Average_week = mean(weekly_attendance)
)
kable(weekly_summaries[1:5,])
week | Average_week |
---|---|
1 | 68532.91 |
2 | 67430.24 |
3 | 67761.31 |
4 | 67581.84 |
5 | 68629.94 |
I will be combining the attendance, standings, games datasets to create a combined data set which i will use in my exploratory data analysis.
combined_data <- inner_join(attendance_cleansed, standings_reshape,
by = c( "NFL_team_name", "year")) %>%
inner_join(. , games_reshape,
by = c("NFL_team_name" , "year" , "week"))
kable(combined_data[1:5,])
NFL_team_name | year | week | weekly_attendance | annual_attendance | wins | loss | margin_of_victory | simple_rating | offensive_ranking | defensive_ranking | playoffs | sb_winner | home_ind | winner |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Arizona Cardinals | 2000 | 1 | 77434 | 893926 | 3 | 13 | -14.6 | -15.2 | -7.2 | -8.1 | No Playoffs | No Superbowl | away_team | New York Giants |
Arizona Cardinals | 2000 | 2 | 66009 | 893926 | 3 | 13 | -14.6 | -15.2 | -7.2 | -8.1 | No Playoffs | No Superbowl | home_team | Arizona Cardinals |
Arizona Cardinals | 2000 | 4 | 71801 | 893926 | 3 | 13 | -14.6 | -15.2 | -7.2 | -8.1 | No Playoffs | No Superbowl | home_team | Green Bay Packers |
Arizona Cardinals | 2000 | 5 | 66985 | 893926 | 3 | 13 | -14.6 | -15.2 | -7.2 | -8.1 | No Playoffs | No Superbowl | away_team | San Francisco 49ers |
Arizona Cardinals | 2000 | 6 | 44296 | 893926 | 3 | 13 | -14.6 | -15.2 | -7.2 | -8.1 | No Playoffs | No Superbowl | home_team | Arizona Cardinals |
The objective of performing exploratory data analysis is to understand the data better and identify the variables that may have a significance on the attendance numbers.I will look at different variables and will plot some graphs to understand them better.
From the attendance data we noticed there are 34 distinct teams, but typically NFL has only 32 teams every year.Lets take a look at the team barchart to understand a little more on what really happened.
ggplot(data=combined_data, aes(x= reorder(factor(NFL_team_name), NFL_team_name, function(x) length(x)),
las=10,fill=factor(year),
names.arg=combined_data$NFL_team_name
))+
geom_bar() +
theme(axis.text.x = element_text(angle = 0), legend.title = element_blank(),
legend.position = "bottom", axis.title.x = element_blank(),
axis.title.y = element_blank(), plot.title = element_text(hjust = 0.5)) +
ggtitle("Games played by teams 2000-2019") +
coord_flip()
Figure 1 - NFL Teams 2000 -20019.
Looking at the teams that did not play all the 320 games, we noticed that the Rams moved from St.Louis to Los Angeles in 2016. Similarly the Chargers moved from San Diego to Los Angeles in 2017. We also notice that the Houston Texans started playing in the league from 2002.
Lets try to find out if there are spikes for attendance on any particular week every year. We will be doing a average attendace of all the weeks vs attendance for given week number on the summarized data to see if any particular weeks see higher attendance.
ggplot(data=weekly_summaries, aes(x= week, y= Average_week)) +
geom_point() +
geom_smooth()+
theme(axis.text.x = element_text(angle = 0), legend.title = element_blank(),
legend.position = "bottom", axis.title.x = element_blank(),
axis.title.y = element_blank(), plot.title = element_text(hjust = 0.5)) +
ggtitle("Average attendance (per week number) 2000-2019")
Figure 2 - Attendance for Week#.
Looking at this plot it seems like there is dip in the attendance of games after week 8. Lets do a box plot to understand if this is significant or not.
combined_data %>%
mutate(week = factor(week)) %>%
ggplot(aes(week, weekly_attendance, fill = week)) +
geom_boxplot(show.legend = FALSE, outlier.alpha = 0.5) +
labs( x = "Week of NFL season",
y = "Weekly NFL game attendance") +
theme(axis.text.x = element_text(angle = 00), legend.title = element_blank(),
legend.position = "bottom", axis.title.x = element_blank(),
axis.title.y = element_blank(), plot.title = element_text(hjust = 0.5)) +
scale_y_continuous(label = unit_format(unit = "K", scale = 1/1000, sep = ""))
Figure 3 - Attendance for Week#.
Looking at the box plots there does not seems to be too much variance among different weeks. Now lets look at home and away games impact on attendance.
We will look at the averages of the game attendance for home and away games alogn with the overall nfl game attendance average.
ggplot(data = attendance_summaries) +
geom_line(aes(x = year, y = Average_nfl, col='NFL Game Average')) +
geom_line(aes(x = year, y = Average_away, col='Away Game Average')) +
geom_line(aes(x = year, y = Average_home, col='Home Game Average')) +
facet_wrap(facets = vars(NFL_team_name), shrink = TRUE) +
theme(axis.text.x = element_text(angle = 00), legend.title = element_blank(),
legend.position = "bottom", axis.title.x = element_blank(),
axis.title.y = element_blank(), plot.title = element_text(hjust = 0.5)) +
scale_x_continuous(labels = function(x) substring(x,3,4)) +
scale_y_continuous(label = unit_format(unit = "K", scale = 1/1000, sep = "")) +
scale_color_manual(values = c("red", "royalblue", "black")) +
ggtitle("Average NFL attendance (per game) 2000-2019")
Figure 4 - Home,away averages compared with nfl average.
We do not seen any common pattern across all the teams, looks like the trends are specific to each team. We could notice that Dallas Cowboys Home Game attendance is way above rest of the teams. Kansas Cheifs also could be classified as a team with good home support.Also the Washington Reds used to have a very strong home game support until 2017. Oakland Raders seems to have below par home game attendance compared to other teams. Cincinnati Bengals also has lesse home game attendance compared to rest of the teams.
Let us create a box plot to the weekly attendance for different teams, and for the seasons when they qualified for playoffs
combined_data %>%
ggplot(aes(fct_reorder(NFL_team_name, weekly_attendance),
weekly_attendance,
fill = playoffs
)) +
geom_boxplot(outlier.alpha = 0.5) +
coord_flip() +
labs(
fill = NULL, x = NULL,
y = "Weekly NFL game attendance"
) +
theme(axis.text.x = element_text(angle = 00), legend.title = element_blank(),
legend.position = "bottom", axis.title.x = element_blank(),
axis.title.y = element_blank(), plot.title = element_text(hjust = 0.5))
Figure 5 - weekly nfl game attendance for playoffs and non playoffs.
Again we do not see too many common patterns in the for all the teams combined.We notice Washington Redskins, Dallas Cowboys attendances spike when they are having a playoff season. Sinec Los Angeles Rams data set is very small we will ignore the trend where they seem to have more attendance when they dont make the playoffs.
Creating simple scatterplot Matrix to check for correlation.
pairs(~annual_attendance+wins+margin_of_victory+simple_rating+simple_rating+offensive_ranking+defensive_ranking,
data=combined_data,
main="Simple Scatterplot Matrix")
We do not observe a strong corelation among these.
Let us build a couple of simple linear regression models to identify the factors that may impact the game attendance numbers.
lmann = lm(annual_attendance~week ,data=combined_data)
broom::tidy(lmann)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1081044. 1501. 720. 0
## 2 week -14.7 145. -0.101 0.919
Looking at the coefficients from the model, the p-value for week is 0.919. There is 92% chance that this predictor is not meaningful for the regression. In other words week number is not a good predictor for predicting the attendance.
lmWeek = lm(weekly_attendance~week+wins+loss+margin_of_victory+simple_rating ,data=combined_data)
broom::tidy(lmWeek)
## # A tibble: 6 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 70764. 8138. 8.69 4.00e-18
## 2 week -71.7 17.8 -4.03 5.61e- 5
## 3 wins -112. 510. -0.220 8.26e- 1
## 4 loss -208. 512. -0.406 6.85e- 1
## 5 margin_of_victory -142. 64.7 -2.19 2.86e- 2
## 6 simple_rating 250. 55.6 4.49 7.24e- 6
Looking at the coefficients from the model, the p-value for simple_rating is less than 0.05 and could be meaningful for predicting the attendance.Also the coefficient for week seems to have a negative correlation with attendance and is less than 0.05. It could be meaning ful for predicting the attendance.
**Please note that more validations are required to check the model accuracy and also that the sample size may allow for mintue differences to be statistically significant.
Based on our analysis we notice that they are not many strong factors that impact the attendance but some of the variables like simple rating seems to have a slight impact on the attendance. Also we noticed that week # has a negative correlation which means as we get into week 8 and beyond we notice a dip in the attendance. This could be because of the teams not making the palyoffs are seeing a dip in the attendance.We also noticed that Texas Cowboys and Kansas Cheifs enjoy the most support from their fans.
Additional oppurtunities for more analysis would be do the analysis based on last years performance and see if it has an impact on the attendance.