Paul Brown Stadium
We were provided with 3 separate datasets with various information about NFL games, attendance, and standings across 20 seasons (from 2000-2019). Using this data, we are interested in trying to find a way to effectively predict attendance numbers for teams in the NFL. In order to do this, we will compare variables such as offensive_ranking, wins, and strenght_of_schedule with the overall attendance for teams with the hope that we will be able to shows relationships between the variables. Our analysis will mainly consist of visualizations to illustrate the relationships between our variables. However, we will also be doing some regression and machine learning to further our analysis. Our hope is that we can help organizations accurately predict attendance so that they can properly prepare their stadiums with concessions, employees, and other necessities.
library(tidyverse)
library(stats)
library(rpart)
library(forecast)
library(treemap)
library(rpart.plot)
We loaded tidyverse for its many data transformation, visualization, and processing tools such as dplyr. The stats package was loaded for its many regression and statistical tools. rpart, forecast, treemap, and rpart.plot were all loaded to create and visualize a Regression Tree
Our original data source was provided through our course website. A link to this dataset’s description and download can be found at this link. The data comes from the pro-football-reference website, and it is used to keep track of data from all current and historic NFL games. The data was collected as over the course of the 20 seasons in our dataset. Every season has 17 weeks, with all 32 teams playing 16 games (8 home, 8 away, 1 bye week).
The original data came in three separate csv files with the following information
Missing values were recorded as “NA”, and are cleanly translated in R to the language’s NA value. We found 638 missing values in the attendance dataset, all coming from the weekly_attendance variable. These were missing because the teams were on bye. Additionally, there were 5,314 missing values in the games dataset under the tie variable, which shows that there was no tie in that specific game. The standings dataset contained no missing values, and therefore will require no work for missing data.
Used read_csv function to read in our datasets.
#attendance <- read_csv("attendance.csv")
#games <- read.csv("games.csv")
#standings <- read.csv("standings.csv")
attendance <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/attendance.csv')
standings <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/standings.csv')
games <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/games.csv')
Since most of the data was clean, we did not have much to do. However there were certainly some changes that needed to be made. First, we changed some variable names in the games dataset from home_team_name to team_name and from home_team_city to team in order to match variables in the other datasets for our join.
games_renamed <- rename(games, "team_name" = "home_team_name")
games_renamed <- rename(games_renamed, 'team' = "home_team_city")
Next, we needed to change the data type of the week variables in the games_renamed dataset to numeric. They were originally recorded as strings because they playoff weeks were entered as “WildCard”, “Division”, “ConfChamp”, and “SuperBowl”. In order to deal with this, we removed all of the observations containing these values. We feel that this decision was acceptable because playoff games are not entirely indicative of typical NFL game attendance. Additionally, it does not make complete sense to keep these records since not every team makes the playoffs. We felt that focusing on the regular season would be most important for predicting attendance.
games_renamed <- games_renamed[!games_renamed$week %in% c('WildCard', 'Division', 'ConfChamp', 'SuperBowl'),]
games_renamed$week <- as.numeric(games_renamed$week)
After our data was ready to be merged, we used two left_join lines to combine our datasets.
NFL_join_data <- dplyr::left_join(games_renamed, standings, by = c("year", "team_name", "team"))
NFL_join_data2 <- dplyr::left_join(NFL_join_data, attendance, by = c("year", "team_name", "team", "week"))
After the join, there were still some variable data types that needed to be changed. For example, the time variable, which shows the start of the game was originally a string. We decided to extract the hour out of this variable and convert it to numeric so that we could run future analysis more easily.
NFL_join_data2$time <- format(strptime(NFL_join_data2$time, "%H"),"%H")
NFL_join_data2$time <- as.numeric(NFL_join_data2$time)
Next, We also needed to change the date variable. We decided that the actual date was not as important to know for analysis as the month. Based on this assumption, we decided to drop the day and only record the month that the games took place in.
NFL_join_data2 <- NFL_join_data2 %>%
separate(date, c("month", "dropday"), sep = " ")
NFL_join_data2 <- subset(NFL_join_data2, select = -c(dropday))
After this, we also wanted to change the tie variable so that it would be more useful to us. Originally, the variable was assigned an NA value if the game did not end in a tie. Instead of having this, we decided to change the NA values to 0 and any other values (representing the game ending in a tie) to 1.
NFL_join_data2$tie[is.na(NFL_join_data2$tie)] <- 0
NFL_join_data2$tie[NFL_join_data2$tie != 0] <- 1
# Changing tie variable to numeric
NFL_join_data2$tie <- as.numeric(NFL_join_data2$tie)
We no longer have any missing values in this dataset with this change. There were some outliers in the simple_rating and offensive_ranking variables, but we feel that these should be kept in because they are not erroneous. The teams that had these rankings were truly dominant when they achieved these numbers. We made some box plots to visually examine how extreme the outliers were, and they are shown below.
ggplot(NFL_join_data2) +
aes(simple_rating) +
geom_boxplot() +
ggtitle("Simple Rating Distribution")
ggplot(NFL_join_data2) +
aes(offensive_ranking) +
geom_boxplot() +
ggtitle("Offensive Ranking Distribution")
As we can see from these plots, these outliers do not seem to be too extreme to keep.
Below is a table to show these general structure of our dataset with what we believe will be some key variables.
knitr::kable(head(NFL_join_data2[c("team_name",
"year",
"total",
"wins",
"playoffs",
"sb_winner",
"strength_of_schedule",
"offensive_ranking")]))
| team_name | year | total | wins | playoffs | sb_winner | strength_of_schedule | offensive_ranking |
|---|---|---|---|---|---|---|---|
| Vikings | 2000 | 1029262 | 11 | Playoffs | No Superbowl | 0.3 | 4.3 |
| Chiefs | 2000 | 1115272 | 7 | No Playoffs | No Superbowl | 0.6 | 0.1 |
| Redskins | 2000 | 1174332 | 8 | No Playoffs | No Superbowl | 0.2 | -2.9 |
| Falcons | 2000 | 964579 | 4 | No Playoffs | No Superbowl | 1.5 | -5.7 |
| Steelers | 2000 | 987037 | 9 | No Playoffs | No Superbowl | -0.2 | 0.6 |
| Browns | 2000 | 1057139 | 3 | No Playoffs | No Superbowl | 1.5 | -9.1 |
We also made a table to show summaries of variables that we have deemed important, shown below.
knitr::kable(summary(NFL_join_data2[c("total",
"wins",
"simple_rating",
"defensive_ranking",
"offensive_ranking",
"strength_of_schedule",
"weekly_attendance")]))
| total | wins | simple_rating | defensive_ranking | offensive_ranking | strength_of_schedule | weekly_attendance | |
|---|---|---|---|---|---|---|---|
| Min. : 760644 | Min. : 0.000 | Min. :-17.4 | Min. :-9.800000 | Min. :-1.17e+01 | Min. :-4.600000 | Min. : 23127 | |
| 1st Qu.:1040509 | 1st Qu.: 6.000 | 1st Qu.: -4.5 | 1st Qu.:-2.400000 | 1st Qu.:-3.20e+00 | 1st Qu.:-1.100000 | 1st Qu.: 63246 | |
| Median :1081090 | Median : 8.000 | Median : 0.0 | Median : 0.100000 | Median : 0.00e+00 | Median : 0.000000 | Median : 68334 | |
| Mean :1080910 | Mean : 7.984 | Mean : 0.0 | Mean :-0.001097 | Mean :-1.57e-04 | Mean : 0.001097 | Mean : 67557 | |
| 3rd Qu.:1123230 | 3rd Qu.:10.000 | 3rd Qu.: 4.5 | 3rd Qu.: 2.500000 | 3rd Qu.: 2.70e+00 | 3rd Qu.: 1.200000 | 3rd Qu.: 72545 | |
| Max. :1322087 | Max. :16.000 | Max. : 20.1 | Max. : 9.800000 | Max. : 1.59e+01 | Max. : 4.300000 | Max. :105121 |
We first realized that it would be easier to compare the teams and their attendance numbers if we made them into percentages of stadium capacity. After research about the capacity of each stadium, we found that each team actually exceeded the capacity numbers we found online at some points. This is because many stadiums can squeeze more fans than available seats at events. With this information in mind, we simply divided the attendance for each game by the maximum attendance for that home team to determine percent attendance. We also found the seasonal attendance maximum using a similar technique.
NFL_max_week <-NFL_join_data2 %>%
group_by(home_team) %>%
mutate(weekly_attendance_max=max(weekly_attendance)) %>%
mutate(percentage_attendance = weekly_attendance/weekly_attendance_max*100)
NFL_max_week <- NFL_max_week %>%
group_by(home_team, year) %>%
mutate(weekly_attendance_max_season = max(weekly_attendance)) %>%
mutate(percentage_attendance_season_max = weekly_attendance/weekly_attendance_max_season*100)
We next found the count of home and away games for each team to see what teams had played the most / least. The number of home and away games for each team can be seen below.
ID_NFL_max_week <- NFL_max_week
ID_NFL_max_week$week <- ID_NFL_max_week$week %>% as.character(.)
ID_NFL_max_week$year <- ID_NFL_max_week$year %>% as.character(.)
ID_NFL_max_week <- rowid_to_column(NFL_max_week, "ID")
ID_NFL_max_week %>% mutate(ID = 1:n()) %>% summarise(`Home` = home_team ,`Away` = away_team, ID) %>%
pivot_longer(-c(ID,year),names_to="team",values_to="Team Name") -> all_games
all_games <- all_games[!(all_games$team=="home_team"),]
all_games %>% group_by(`Team Name`) %>% ggplot() +
geom_bar(aes(x=reorder(`Team Name`,`Team Name`,function(x)+length(x)), fill=factor(year), alpha=team), width=.8) +
ggtitle("Games played from 2000-2019") +
xlab("Team") +
ylab("Count") +
labs(fill="Year", alpha="Location") +
scale_alpha_discrete(range = c(0.8, 0.9)) +
coord_flip()
This shows the total amount of home and away games played by each team. Both the Rams and the Chargers moved to Los Angeles during this span, so that is why those teams have fewer games played. In addition, the Houston Texans were not in the league until 2002, giving the a lower game total.
We wanted to see if any weeks had less attendance than others, particularly later weeks when the weather is worse. In order to find this, we plotted the average attendance percentage by week seen below.
NFL_max_week %>%
group_by(week) %>%
summarize(average_p = mean(percentage_attendance)) %>%
ggplot(aes(week, average_p)) +
geom_smooth() +
geom_jitter() +
ggtitle("Average Percentage Attendance by Week") +
labs(x = "Week",
y = "Average Attendance Percentage")
Although we see a slight dip in the later weeks, this is not a significant amount of attendance lost. However, it is interesting to note how the attendance jumps back to being quite high on average for the last game of the year.
We also wanted to see if there was more or less average attendance across the years, so we made a line graph to show how average attendance percentage changed from season to season.
NFL_max_week %>%
group_by(year) %>%
summarise(mean_attendance_percentage = mean(percentage_attendance)) %>%
ggplot() +
geom_line(aes(year, mean_attendance_percentage)) +
ggtitle("Average Attendance Percentage Across Years") +
labs(x = "Year",
y = "Average Attendance Percentage")
As we saw with the week graph, the changes in attendance by year have not been significant. It is interesting to note that mean attendance percentage has been on a decline.
Next, we thought that the amount of wins that a team had in a given season may be a great indicator of how many people attend their games. The graphs for each team’s win percentage and mean attendance percentage by year is shown below.
NFL_max_week %>%
group_by(home_team, year) %>%
summarise(yearly_mean_attendance = mean(percentage_attendance),
win_pct = wins / 16) %>%
ggplot() +
geom_line(aes(year, yearly_mean_attendance / 100, color = "Mean Percentage Attendance")) +
geom_line(aes(year, win_pct, color = "Win Percentage")) +
ggtitle("Average Yearly Attendance Compared to Win Percentage") +
labs(x = "Year",
y = "Percent",
color = "Legend") +
facet_wrap(~ home_team)
This graph does show some clear trends with some teams, but it is not as close as we would have liked. Indeed, fans of teams such as the Bengals, Jaguars, and Bills definitely seem to respond to their team’s performance, but other teams such as the Giants and Jets have consistent attendance not matter the record.
In order to do similar analysis on attendance using other variables, we decided to use standardized scores that would provide a better comparison of values.
standardize <- function(variable) {
(variable - min(variable)) / (max(variable) - min(variable))
}
NFL_max_week <- NFL_max_week %>%
group_by(year) %>%
mutate(standard_points_for = standardize(points_for))
NFL_max_week <- NFL_max_week %>%
group_by(year) %>%
mutate(standard_points_against = standardize(points_against))
We next wanted to determine how these variables relate to attendance, and these relationships can be seen in the graphs below.
NFL_max_week %>%
group_by(home_team, year) %>%
summarise(standard_points_for = standard_points_for,
yearly_mean_attendance = mean(percentage_attendance),
win_pct = wins / 16) %>%
ggplot() +
geom_line(aes(year, yearly_mean_attendance / 100, color = "Mean Percentage Attendance")) +
geom_line(aes(year, standard_points_for, color = "Standardized Points Scored")) +
ggtitle("Average Yearly Attendance Compared to Points Scored") +
labs(x = "Year",
y = "Percent",
color = "Legend") +
facet_wrap(~ home_team)
NFL_max_week %>%
group_by(home_team, year) %>%
summarise(yearly_mean_attendance = mean(percentage_attendance),
standard_points_against = standard_points_against) %>%
ggplot() +
geom_line(aes(year, yearly_mean_attendance / 100, color = "Mean Percentage Attendance")) +
geom_line(aes(year, 1 - standard_points_against, color = "Standardized Points Allowed")) +
ggtitle("Average Yearly Attendance Compared to Points Allowed") +
labs(x = "Year",
y = "Percent",
color = "Legend") +
facet_wrap(~ home_team)
In the points standard_points_allowed graph, we use 1 - standard_points_allowed to give teams who allowed less points a higher score. We feel that this is important to show the relationship more accurately. Both this graph and the previous one show a similar story. Indeed, both the plots show the same relationship that win percentage had with average attendance percentage; some teams seem to be affected by it and other do not.
These next two graphs show the effect of points allowed and points scored on the win percentage. We thought it would be important to see how closely these were related because of how similarly each of these variables related to attendance.
NFL_max_week %>%
group_by(home_team, year) %>%
summarise(win_pct = wins / 16,
standard_points_against = standard_points_against) %>%
ggplot() +
geom_line(aes(year, win_pct, color = "Win Percentage")) +
geom_line(aes(year, 1 - standard_points_against, color = "Standardized Points Allowed")) +
ggtitle("Average Win Percentage Compared to Points Allowed") +
labs(x = "Year",
y = "Percent",
color = "Legend") +
facet_wrap(~ home_team)
NFL_max_week %>%
group_by(home_team, year) %>%
summarise(win_pct = wins / 16,
standard_points_for = standard_points_for) %>%
ggplot() +
geom_line(aes(year, win_pct, color = "Win Percentage")) +
geom_line(aes(year, standard_points_for, color = "Standardized Points Scored")) +
ggtitle("Win Percentage Compared to Points Scored") +
labs(x = "Year",
y = "Percent",
color = "Legend") +
facet_wrap(~ home_team)
These graphs show that both points allowed and points scored and extremely closely related to win percentage. We should, of course, expect this because teams who score more and allow less points are more likely to win. Again, it is important to note that we are using the points allowed on a scale where a team who allowed less points is granted a higher score.
We also wanted to see if games played at a certain time of day had more attendance than other games. Our findings are shown below.
NFL_max_week %>%
group_by(home_team, time) %>%
summarise(mean_attendance = mean(percentage_attendance)) %>%
ggplot() +
geom_col(aes(x = time, y = mean_attendance, fill = time)) +
facet_wrap(~ home_team)
As we can see from this graph, time of day does not appear to have a significant affect on the attendance for a game.
Although we have already seen one graph comparing week to attendance, we wanted to see if this relationship was different for certain teams. Our thinking was that teams in colder parts of the country may see more of a drop in attendance as the season progressed.
NFL_max_week %>%
group_by(home_team, week) %>%
summarise(mean_attendance = mean(percentage_attendance)) %>%
ggplot() +
geom_col(aes(x = week, y = mean_attendance, fill = week)) +
facet_wrap(~ home_team)
As we can see from this graph, the week in which a game is played does not seem to strongly affect the attendance for that game.
We next decided to make a decision tree to see in hopes to find a model that accurately predicts attendance. To perform this analysis, we first got rid of many variables that would not have been useful to us.
NFL_max_week_LM <- NFL_max_week[,-c(5:6,10:25,27,30:34,36:41)]
We then partitioned the data to create our training and test datasets.
# partition data
set.seed(8)
train_lm.index <- sample(rownames(NFL_max_week_LM), nrow(NFL_max_week_LM) * 0.7)
train_lm.df <- NFL_max_week_LM[train_lm.index, ]
valid_lm.index <- setdiff(rownames(NFL_max_week_LM), train_lm.index)
valid_lm.df <- NFL_max_week_LM[valid_lm.index, ]
We loaded in these following packages to perform the decision tree analysis.
#####Regression Tree#####
We then found the best pruned tree.
#best pruned tree
cv.attendance.tree <- rpart(weekly_attendance ~ .,
data = train_lm.df,
cp = 0,
minsplit = 2,
xval = 10)
options(scipen = 999, digits = 8)
minerror <- min(cv.attendance.tree$cptable[ ,4 ]) # find minimum xerror
minerrorstd <- cv.attendance.tree$cptable[cv.attendance.tree$cptable[,4] == minerror, 5] # and its corresponding xstd
#get list of trees where xerror is less than minerror + minerrorstd
simplertrees <- cv.attendance.tree$cptable[cv.attendance.tree$cptable[,4] < minerror + minerrorstd, ]
We then generated the decision tree using the best cp found above.
#use the cp from the simplest of those trees
bestcp <- simplertrees[1, 1]
attendance.pruned <- prune(cv.attendance.tree, cp = bestcp)
prp(attendance.pruned, type = 1, extra = 1, varlen = -10, digits = -3,
box.col = ifelse(attendance.pruned$frame$var == "<leaf>", 'gray', 'white'))
length(attendance.pruned$frame$var[attendance.pruned$frame$var == "<leaf>"])
## [1] 26
pruned.valid.rt.pred <- predict(attendance.pruned, newdata = valid_lm.df)
This tree is not very clear here, but using this link you can download the pdf and zoom in for a clearer picture. Below we can see some the of residuals for a quick glance at how accurate we were.
#look at the first 20 residuals
valid.resid <- valid_lm.df$weekly_attendance - pruned.valid.rt.pred # difference b/w actual and predicted
data.frame("Predicted" = pruned.valid.rt.pred[1:20], "Actual" = valid_lm.df$weekly_attendance[1:20],
"Residual" = valid.resid[1:20])
## Predicted Actual Residual
## 1 62569.467 55049 -7520.46693
## 2 67375.765 72418 5042.23529
## 3 68542.394 77722 9179.60594
## 4 68542.394 65530 -3012.39406
## 5 57275.206 66009 8733.79412
## 6 79615.263 77687 -1928.26316
## 7 60635.647 45653 -14982.64746
## 8 79615.263 77884 -1731.26316
## 9 63508.427 66944 3435.57314
## 10 75480.732 77604 2123.26846
## 11 68542.394 59513 -9029.39406
## 12 62569.467 51769 -10800.46693
## 13 65579.659 68481 2901.34146
## 14 62569.467 65619 3049.53307
## 15 68542.394 72617 4074.60594
## 16 68542.394 68341 -201.39406
## 17 63508.427 64351 842.57314
## 18 56584.041 63406 6821.95918
## 19 60635.647 66944 6308.35254
## 20 75480.732 72192 -3288.73154
We first performed a full linear model that would serve as a baseline for the rest of our regressions.
####### linear model #######
#Full linear Model
attendance.lm <- lm(weekly_attendance ~ ., data = train_lm.df)
options(scipen = 999, digits = 4)
#use predict() to make predictions on a new set
valid.lm.pred <- predict(attendance.lm, # linear model
valid_lm.df) # predict outcomes on validation data
'#calculate the residuals
valid.resid <- valid_lm.df$weekly_attendance - valid.lm.pred # difference b/w actual and predicted
#look at the first 20 residuals
data.frame("Predicted" = valid.lm.pred[1:20], "Actual" = valid_lm.df$weekly_attendance[1:20],
"Residual" = valid.resid[1:20])'
## [1] "#calculate the residuals\nvalid.resid <- valid_lm.df$weekly_attendance - valid.lm.pred # difference b/w actual and predicted\n#look at the first 20 residuals\ndata.frame(\"Predicted\" = valid.lm.pred[1:20], \"Actual\" = valid_lm.df$weekly_attendance[1:20],\n\"Residual\" = valid.resid[1:20])"
Here we can see the accuracy of the full model.
#use accuracy() to compute common accuracy measures
options(digits = 6)
accuracy(valid.lm.pred, # predicted outcomes
valid_lm.df$weekly_attendance)
## ME RMSE MAE MPE MAPE
## Test set 206.618 5745.79 3751.68 -0.69019 6.13995
we next viewed the distribution of the residuals and saw that we had some fairly serious outliers. We feel as though these would not be too detrimental in practice because we would know that the values are off.
#view the distribution of residuals
ggplot(data = as.data.frame(valid.resid), aes(y = valid.resid)) + geom_boxplot()
#full model with all predictors
attendance.lm.null <- lm(weekly_attendance ~ 1, data = train_lm.df) # intercept-only model
Next, we performed foward, backward, and stepwise regressions.
#Forward Selection
attendance.lm.fwd <- step(attendance.lm.null, # initial model
scope = list(attendance.lm.null, upper = attendance.lm), # range of models
direction = "forward") # forward selection
valid.fwd.pred <- predict(attendance.lm.fwd, valid_lm.df)
options(digits = 6)
#Backward Elimination
attendance.lm.back <- step(attendance.lm, # start with full model
direction = "backward") # backward elimination
valid.back.pred <- predict(attendance.lm.back, valid_lm.df)
options(digits = 6)
#Stepwise Regression
attendance.lm.step <- step(attendance.lm.null, # initial model
scope = list(attendance.lm.null, upper = attendance.lm), # range of models
direction = "both") # stepwise regression
valid.step.pred <- predict(attendance.lm.step, valid_lm.df)
options(digits = 6)
We then compared the accuracy of all of these models.
#Prediction Performance
#Regression Tree
accuracy(pruned.valid.rt.pred, valid_lm.df$weekly_attendance)
## ME RMSE MAE MPE MAPE
## Test set -80.928 5313.68 3239.44 -0.896563 5.3456
#Foward
accuracy(valid.fwd.pred, valid_lm.df$weekly_attendance)
## ME RMSE MAE MPE MAPE
## Test set 205.43 5740.86 3750.68 -0.691569 6.13895
#Backward
accuracy(valid.back.pred, valid_lm.df$weekly_attendance)
## ME RMSE MAE MPE MAPE
## Test set 205.43 5740.86 3750.68 -0.691569 6.13895
#Stepwise
accuracy(valid.step.pred, valid_lm.df$weekly_attendance)
## ME RMSE MAE MPE MAPE
## Test set 205.43 5740.86 3750.68 -0.691569 6.13895
# Compare to the original model (Full Model)
accuracy(valid.lm.pred, valid_lm.df$weekly_attendance)
## ME RMSE MAE MPE MAPE
## Test set 206.618 5745.79 3751.68 -0.69019 6.13995
As we can see, the decision tree performed the best out of all of these models. However, even this model did not perform particularly well. Despite the somewhat poor performance, we still feel that this model could help to determine whether attendance will be high or low in general. With the large amount of people at these games, errors even as large as 5,000 may not be extremely detrimental. We believe that this model would most effectively be used to avoid making any extreme mistakes for predicting attendance. It may not be as accurate as we would like, but it can give a broad picture of our expected turnout for each game.
Although our findings were not as conclusive as we would have liked them to be, we do feel that we have made some worthwhile insights on our data. For example, we found that neither offense or defense appears to be more important when determining win percentage. In addition, we identified a few teams whose fans are quite reactive to their performance. It could be quite valuable for those teams to cut down on staffing when expecting to have a poor year. We also found out that teams should not expect to have more or less attendance solely because of the time of day/year.
Finally, we think that our models could be quite effective at simply determining whether a game will be crowded. The model could save teams from taking a big loss by having many staff members or concessions for a game with low attendance. We feel that further analysis with more data would be needed to get more accurate results.