Final Project

NFL Attendance Data

Introduction

Paul Brown Stadium

We were provided with 3 separate datasets with various information about NFL games, attendance, and standings across 20 seasons (from 2000-2019). Using this data, we are interested in trying to find a way to effectively predict attendance numbers for teams in the NFL. In order to do this, we will compare variables such as offensive_ranking, wins, and strenght_of_schedule with the overall attendance for teams with the hope that we will be able to shows relationships between the variables. Our analysis will mainly consist of visualizations to illustrate the relationships between our variables. However, we will also be doing some regression and machine learning to further our analysis. Our hope is that we can help organizations accurately predict attendance so that they can properly prepare their stadiums with concessions, employees, and other necessities.

Packages

library(tidyverse)
library(stats)
library(rpart)
library(forecast)
library(treemap)
library(rpart.plot)

We loaded tidyverse for its many data transformation, visualization, and processing tools such as dplyr. The stats package was loaded for its many regression and statistical tools. rpart, forecast, treemap, and rpart.plot were all loaded to create and visualize a Regression Tree

Our Data

Our original data source was provided through our course website. A link to this dataset’s description and download can be found at this link. The data comes from the pro-football-reference website, and it is used to keep track of data from all current and historic NFL games. The data was collected as over the course of the 20 seasons in our dataset. Every season has 17 weeks, with all 32 teams playing 16 games (8 home, 8 away, 1 bye week).

The original data came in three separate csv files with the following information

The attendance dataset comes with 10,846 observations of 8 variables including information such as weekly attendance and yearly attendance totals (home, away, and total) for each team.
The games dataset has 5,324 observations with 19 variables covering relevant game stats such as turnovers, gained yards, and points scored.
The standings dataset consists of 638 observations of 15 variables including offensive and defensive rankings, strengths of schedule, and each teams simple rating (the teams quality measured by Simple Rating System, which can be calculated using the formula below). \[SRS = MoV + SoS = OSRS + DSRS\]

Missing values were recorded as “NA”, and are cleanly translated in R to the language’s NA value. We found 638 missing values in the attendance dataset, all coming from the weekly_attendance variable. These were missing because the teams were on bye. Additionally, there were 5,314 missing values in the games dataset under the tie variable, which shows that there was no tie in that specific game. The standings dataset contained no missing values, and therefore will require no work for missing data.

Data Prep

Used read_csv function to read in our datasets.

#attendance <- read_csv("attendance.csv")
#games <- read.csv("games.csv")
#standings <- read.csv("standings.csv")

attendance <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/attendance.csv')
standings <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/standings.csv')
games <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/games.csv')

Since most of the data was clean, we did not have much to do. However there were certainly some changes that needed to be made. First, we changed some variable names in the games dataset from home_team_name to team_name and from home_team_city to team in order to match variables in the other datasets for our join.

games_renamed <- rename(games, "team_name" = "home_team_name")

games_renamed <- rename(games_renamed, 'team' = "home_team_city")

Next, we needed to change the data type of the week variables in the games_renamed dataset to numeric. They were originally recorded as strings because they playoff weeks were entered as “WildCard”, “Division”, “ConfChamp”, and “SuperBowl”. In order to deal with this, we removed all of the observations containing these values. We feel that this decision was acceptable because playoff games are not entirely indicative of typical NFL game attendance. Additionally, it does not make complete sense to keep these records since not every team makes the playoffs. We felt that focusing on the regular season would be most important for predicting attendance.

games_renamed <- games_renamed[!games_renamed$week %in% c('WildCard', 'Division', 'ConfChamp', 'SuperBowl'),]

games_renamed$week <- as.numeric(games_renamed$week)

After our data was ready to be merged, we used two left_join lines to combine our datasets.

NFL_join_data <- dplyr::left_join(games_renamed, standings, by = c("year", "team_name", "team"))

NFL_join_data2 <- dplyr::left_join(NFL_join_data, attendance, by = c("year", "team_name", "team", "week"))

After the join, there were still some variable data types that needed to be changed. For example, the time variable, which shows the start of the game was originally a string. We decided to extract the hour out of this variable and convert it to numeric so that we could run future analysis more easily.

NFL_join_data2$time <- format(strptime(NFL_join_data2$time, "%H"),"%H")

NFL_join_data2$time <- as.numeric(NFL_join_data2$time)

Next, We also needed to change the date variable. We decided that the actual date was not as important to know for analysis as the month. Based on this assumption, we decided to drop the day and only record the month that the games took place in.

NFL_join_data2 <- NFL_join_data2 %>%
  separate(date, c("month", "dropday"), sep = " ")

NFL_join_data2 <- subset(NFL_join_data2, select = -c(dropday))

After this, we also wanted to change the tie variable so that it would be more useful to us. Originally, the variable was assigned an NA value if the game did not end in a tie. Instead of having this, we decided to change the NA values to 0 and any other values (representing the game ending in a tie) to 1.

NFL_join_data2$tie[is.na(NFL_join_data2$tie)] <- 0

NFL_join_data2$tie[NFL_join_data2$tie != 0] <- 1

# Changing tie variable to numeric
NFL_join_data2$tie <- as.numeric(NFL_join_data2$tie)

We no longer have any missing values in this dataset with this change. There were some outliers in the simple_rating and offensive_ranking variables, but we feel that these should be kept in because they are not erroneous. The teams that had these rankings were truly dominant when they achieved these numbers. We made some box plots to visually examine how extreme the outliers were, and they are shown below.

ggplot(NFL_join_data2) +
  aes(simple_rating) +
  geom_boxplot() +
  ggtitle("Simple Rating Distribution")

ggplot(NFL_join_data2) +
  aes(offensive_ranking) +
  geom_boxplot() +
  ggtitle("Offensive Ranking Distribution")

As we can see from these plots, these outliers do not seem to be too extreme to keep.

Data Structure

Below is a table to show these general structure of our dataset with what we believe will be some key variables.

knitr::kable(head(NFL_join_data2[c("team_name",
                                   "year",
                                   "total",
                                   "wins",
                                   "playoffs",
                                   "sb_winner",
                                   "strength_of_schedule",
                                   "offensive_ranking")]))

team_name	year	total	wins	playoffs	sb_winner	strength_of_schedule	offensive_ranking
Vikings	2000	1029262	11	Playoffs	No Superbowl	0.3	4.3
Chiefs	2000	1115272	7	No Playoffs	No Superbowl	0.6	0.1
Redskins	2000	1174332	8	No Playoffs	No Superbowl	0.2	-2.9
Falcons	2000	964579	4	No Playoffs	No Superbowl	1.5	-5.7
Steelers	2000	987037	9	No Playoffs	No Superbowl	-0.2	0.6
Browns	2000	1057139	3	No Playoffs	No Superbowl	1.5	-9.1

We also made a table to show summaries of variables that we have deemed important, shown below.

knitr::kable(summary(NFL_join_data2[c("total",
                                      "wins",
                                      "simple_rating",
                                      "defensive_ranking",
                                      "offensive_ranking",
                                      "strength_of_schedule",
                                      "weekly_attendance")]))

total	wins	simple_rating	defensive_ranking	offensive_ranking	strength_of_schedule	weekly_attendance
Min. : 760644	Min. : 0.000	Min. :-17.4	Min. :-9.800000	Min. :-1.17e+01	Min. :-4.600000	Min. : 23127
1st Qu.:1040509	1st Qu.: 6.000	1st Qu.: -4.5	1st Qu.:-2.400000	1st Qu.:-3.20e+00	1st Qu.:-1.100000	1st Qu.: 63246
Median :1081090	Median : 8.000	Median : 0.0	Median : 0.100000	Median : 0.00e+00	Median : 0.000000	Median : 68334
Mean :1080910	Mean : 7.984	Mean : 0.0	Mean :-0.001097	Mean :-1.57e-04	Mean : 0.001097	Mean : 67557
3rd Qu.:1123230	3rd Qu.:10.000	3rd Qu.: 4.5	3rd Qu.: 2.500000	3rd Qu.: 2.70e+00	3rd Qu.: 1.200000	3rd Qu.: 72545
Max. :1322087	Max. :16.000	Max. : 20.1	Max. : 9.800000	Max. : 1.59e+01	Max. : 4.300000	Max. :105121

EDA

Creating New Variables

We first realized that it would be easier to compare the teams and their attendance numbers if we made them into percentages of stadium capacity. After research about the capacity of each stadium, we found that each team actually exceeded the capacity numbers we found online at some points. This is because many stadiums can squeeze more fans than available seats at events. With this information in mind, we simply divided the attendance for each game by the maximum attendance for that home team to determine percent attendance. We also found the seasonal attendance maximum using a similar technique.

NFL_max_week <-NFL_join_data2 %>%
group_by(home_team) %>%
mutate(weekly_attendance_max=max(weekly_attendance)) %>%
mutate(percentage_attendance = weekly_attendance/weekly_attendance_max*100)


NFL_max_week <- NFL_max_week %>% 
  group_by(home_team, year) %>% 
  mutate(weekly_attendance_max_season = max(weekly_attendance)) %>% 
  mutate(percentage_attendance_season_max = weekly_attendance/weekly_attendance_max_season*100)

Analyzing Game Count

We next found the count of home and away games for each team to see what teams had played the most / least. The number of home and away games for each team can be seen below.

ID_NFL_max_week <- NFL_max_week
ID_NFL_max_week$week <- ID_NFL_max_week$week %>% as.character(.)
ID_NFL_max_week$year <- ID_NFL_max_week$year %>% as.character(.)



ID_NFL_max_week <- rowid_to_column(NFL_max_week, "ID")
ID_NFL_max_week %>% mutate(ID = 1:n()) %>% summarise(`Home` = home_team ,`Away` = away_team, ID) %>%
pivot_longer(-c(ID,year),names_to="team",values_to="Team Name") -> all_games

all_games <- all_games[!(all_games$team=="home_team"),]

all_games %>% group_by(`Team Name`) %>% ggplot() +
geom_bar(aes(x=reorder(`Team Name`,`Team Name`,function(x)+length(x)), fill=factor(year), alpha=team), width=.8) +
ggtitle("Games played from 2000-2019") +
xlab("Team") +
ylab("Count") +
labs(fill="Year", alpha="Location") +
scale_alpha_discrete(range = c(0.8, 0.9)) +
coord_flip()

This shows the total amount of home and away games played by each team. Both the Rams and the Chargers moved to Los Angeles during this span, so that is why those teams have fewer games played. In addition, the Houston Texans were not in the league until 2002, giving the a lower game total.

Analyzing Attendance by Week

We wanted to see if any weeks had less attendance than others, particularly later weeks when the weather is worse. In order to find this, we plotted the average attendance percentage by week seen below.

NFL_max_week %>% 
  group_by(week) %>% 
  summarize(average_p = mean(percentage_attendance)) %>%
  
ggplot(aes(week, average_p)) + 
  geom_smooth() + 
  geom_jitter() +
  ggtitle("Average Percentage Attendance by Week") +
  labs(x = "Week",
       y = "Average Attendance Percentage")

Although we see a slight dip in the later weeks, this is not a significant amount of attendance lost. However, it is interesting to note how the attendance jumps back to being quite high on average for the last game of the year.

Analyzing Attendance Percentage Across Years

We also wanted to see if there was more or less average attendance across the years, so we made a line graph to show how average attendance percentage changed from season to season.

NFL_max_week %>%
  group_by(year) %>% 
  summarise(mean_attendance_percentage = mean(percentage_attendance)) %>% 

ggplot() +
  geom_line(aes(year, mean_attendance_percentage)) +
  ggtitle("Average Attendance Percentage Across Years") +
  labs(x = "Year",
       y = "Average Attendance Percentage")

As we saw with the week graph, the changes in attendance by year have not been significant. It is interesting to note that mean attendance percentage has been on a decline.

Analyzing the Effect of Win Percentage on Attendance

Next, we thought that the amount of wins that a team had in a given season may be a great indicator of how many people attend their games. The graphs for each team’s win percentage and mean attendance percentage by year is shown below.

NFL_max_week %>% 
  group_by(home_team, year) %>% 
  summarise(yearly_mean_attendance = mean(percentage_attendance),
            win_pct = wins / 16) %>% 

ggplot() +
  geom_line(aes(year, yearly_mean_attendance / 100, color = "Mean Percentage Attendance")) +
  geom_line(aes(year, win_pct, color = "Win Percentage")) +
  ggtitle("Average Yearly Attendance Compared to Win Percentage") + 
  labs(x = "Year",
     y = "Percent",
     color = "Legend") +
  facet_wrap(~ home_team)

This graph does show some clear trends with some teams, but it is not as close as we would have liked. Indeed, fans of teams such as the Bengals, Jaguars, and Bills definitely seem to respond to their team’s performance, but other teams such as the Giants and Jets have consistent attendance not matter the record.

Standardizing Variables

In order to do similar analysis on attendance using other variables, we decided to use standardized scores that would provide a better comparison of values.

standardize <- function(variable) {
  (variable - min(variable)) / (max(variable) - min(variable))
}

NFL_max_week <- NFL_max_week %>% 
  group_by(year) %>% 
  mutate(standard_points_for = standardize(points_for))

NFL_max_week <- NFL_max_week %>% 
  group_by(year) %>% 
  mutate(standard_points_against = standardize(points_against))

Using These Standardized Variables

We next wanted to determine how these variables relate to attendance, and these relationships can be seen in the graphs below.

NFL_max_week %>% 
  group_by(home_team, year) %>% 
  summarise(standard_points_for = standard_points_for,
            yearly_mean_attendance = mean(percentage_attendance),
            win_pct = wins / 16) %>% 
  
ggplot() +
  geom_line(aes(year, yearly_mean_attendance / 100, color = "Mean Percentage Attendance")) +
  geom_line(aes(year, standard_points_for, color = "Standardized Points Scored")) +
  ggtitle("Average Yearly Attendance Compared to Points Scored") + 
  labs(x = "Year",
     y = "Percent",
     color = "Legend") +
  facet_wrap(~ home_team)

NFL_max_week %>% 
  group_by(home_team, year) %>%  
  summarise(yearly_mean_attendance = mean(percentage_attendance),
            standard_points_against = standard_points_against) %>% 
  
ggplot() +
  geom_line(aes(year, yearly_mean_attendance / 100, color = "Mean Percentage Attendance")) +
  geom_line(aes(year, 1 - standard_points_against, color = "Standardized Points Allowed")) +
  ggtitle("Average Yearly Attendance Compared to Points Allowed") + 
  labs(x = "Year",
     y = "Percent",
     color = "Legend") +
  facet_wrap(~ home_team)

In the points standard_points_allowed graph, we use 1 - standard_points_allowed to give teams who allowed less points a higher score. We feel that this is important to show the relationship more accurately. Both this graph and the previous one show a similar story. Indeed, both the plots show the same relationship that win percentage had with average attendance percentage; some teams seem to be affected by it and other do not.

Comparing Standardized Scores to Win Percentage

These next two graphs show the effect of points allowed and points scored on the win percentage. We thought it would be important to see how closely these were related because of how similarly each of these variables related to attendance.

NFL_max_week %>% 
  group_by(home_team, year) %>%  
  summarise(win_pct = wins / 16,
            standard_points_against = standard_points_against) %>% 
  
ggplot() +
  geom_line(aes(year, win_pct, color = "Win Percentage")) +
  geom_line(aes(year, 1 - standard_points_against, color = "Standardized Points Allowed")) +
  ggtitle("Average Win Percentage Compared to Points Allowed") + 
  labs(x = "Year",
     y = "Percent",
     color = "Legend") +
  facet_wrap(~ home_team)

NFL_max_week %>% 
  group_by(home_team, year) %>%  
  summarise(win_pct = wins / 16,
            standard_points_for = standard_points_for) %>% 
  
ggplot() +
  geom_line(aes(year, win_pct, color = "Win Percentage")) +
  geom_line(aes(year, standard_points_for, color = "Standardized Points Scored")) +
  ggtitle("Win Percentage Compared to Points Scored") + 
  labs(x = "Year",
     y = "Percent",
     color = "Legend") +
  facet_wrap(~ home_team)

These graphs show that both points allowed and points scored and extremely closely related to win percentage. We should, of course, expect this because teams who score more and allow less points are more likely to win. Again, it is important to note that we are using the points allowed on a scale where a team who allowed less points is granted a higher score.

Time of Day Compared to Attendance

We also wanted to see if games played at a certain time of day had more attendance than other games. Our findings are shown below.

NFL_max_week %>% 
  group_by(home_team, time) %>% 
  summarise(mean_attendance = mean(percentage_attendance)) %>% 
  
ggplot() +
  geom_col(aes(x = time, y = mean_attendance, fill = time)) +
  facet_wrap(~ home_team)

As we can see from this graph, time of day does not appear to have a significant affect on the attendance for a game.

Comparing Week to Attendance

Although we have already seen one graph comparing week to attendance, we wanted to see if this relationship was different for certain teams. Our thinking was that teams in colder parts of the country may see more of a drop in attendance as the season progressed.

NFL_max_week %>% 
  group_by(home_team, week) %>% 
  summarise(mean_attendance = mean(percentage_attendance)) %>% 
  
ggplot() +
  geom_col(aes(x = week, y = mean_attendance, fill = week)) +
  facet_wrap(~ home_team)

As we can see from this graph, the week in which a game is played does not seem to strongly affect the attendance for that game.

Machine Learning

Decision Tree

We next decided to make a decision tree to see in hopes to find a model that accurately predicts attendance. To perform this analysis, we first got rid of many variables that would not have been useful to us.

NFL_max_week_LM <- NFL_max_week[,-c(5:6,10:25,27,30:34,36:41)]

We then partitioned the data to create our training and test datasets.

# partition data
set.seed(8)
train_lm.index <- sample(rownames(NFL_max_week_LM), nrow(NFL_max_week_LM) * 0.7)
train_lm.df <- NFL_max_week_LM[train_lm.index, ]
valid_lm.index <- setdiff(rownames(NFL_max_week_LM), train_lm.index)
valid_lm.df <- NFL_max_week_LM[valid_lm.index, ]

We loaded in these following packages to perform the decision tree analysis.

#####Regression Tree#####

We then found the best pruned tree.

#best pruned tree
cv.attendance.tree <- rpart(weekly_attendance ~ .,
data = train_lm.df,
cp = 0,
minsplit = 2,
xval = 10)
options(scipen = 999, digits = 8)


minerror <- min(cv.attendance.tree$cptable[ ,4 ]) # find minimum xerror
minerrorstd <- cv.attendance.tree$cptable[cv.attendance.tree$cptable[,4] == minerror, 5] # and its corresponding xstd

#get list of trees where xerror is less than minerror + minerrorstd
simplertrees <- cv.attendance.tree$cptable[cv.attendance.tree$cptable[,4] < minerror + minerrorstd, ]

We then generated the decision tree using the best cp found above.

#use the cp from the simplest of those trees
bestcp <- simplertrees[1, 1]
attendance.pruned <- prune(cv.attendance.tree, cp = bestcp)
prp(attendance.pruned, type = 1, extra = 1, varlen = -10, digits = -3,
box.col = ifelse(attendance.pruned$frame$var == "<leaf>", 'gray', 'white'))

length(attendance.pruned$frame$var[attendance.pruned$frame$var == "<leaf>"])

## [1] 26

pruned.valid.rt.pred <- predict(attendance.pruned, newdata = valid_lm.df)

This tree is not very clear here, but using this link you can download the pdf and zoom in for a clearer picture. Below we can see some the of residuals for a quick glance at how accurate we were.

#look at the first 20 residuals
valid.resid <- valid_lm.df$weekly_attendance - pruned.valid.rt.pred # difference b/w actual and predicted
data.frame("Predicted" = pruned.valid.rt.pred[1:20], "Actual" = valid_lm.df$weekly_attendance[1:20],
"Residual" = valid.resid[1:20])

##    Predicted Actual     Residual
## 1  62569.467  55049  -7520.46693
## 2  67375.765  72418   5042.23529
## 3  68542.394  77722   9179.60594
## 4  68542.394  65530  -3012.39406
## 5  57275.206  66009   8733.79412
## 6  79615.263  77687  -1928.26316
## 7  60635.647  45653 -14982.64746
## 8  79615.263  77884  -1731.26316
## 9  63508.427  66944   3435.57314
## 10 75480.732  77604   2123.26846
## 11 68542.394  59513  -9029.39406
## 12 62569.467  51769 -10800.46693
## 13 65579.659  68481   2901.34146
## 14 62569.467  65619   3049.53307
## 15 68542.394  72617   4074.60594
## 16 68542.394  68341   -201.39406
## 17 63508.427  64351    842.57314
## 18 56584.041  63406   6821.95918
## 19 60635.647  66944   6308.35254
## 20 75480.732  72192  -3288.73154

Linear Models

We first performed a full linear model that would serve as a baseline for the rest of our regressions.

####### linear model #######

#Full linear Model
attendance.lm <- lm(weekly_attendance ~ ., data = train_lm.df)
options(scipen = 999, digits = 4)

#use predict() to make predictions on a new set
valid.lm.pred <- predict(attendance.lm, # linear model
valid_lm.df) # predict outcomes on validation data
'#calculate the residuals
valid.resid <- valid_lm.df$weekly_attendance - valid.lm.pred # difference b/w actual and predicted
#look at the first 20 residuals
data.frame("Predicted" = valid.lm.pred[1:20], "Actual" = valid_lm.df$weekly_attendance[1:20],
"Residual" = valid.resid[1:20])'

## [1] "#calculate the residuals\nvalid.resid <- valid_lm.df$weekly_attendance - valid.lm.pred # difference b/w actual and predicted\n#look at the first 20 residuals\ndata.frame(\"Predicted\" = valid.lm.pred[1:20], \"Actual\" = valid_lm.df$weekly_attendance[1:20],\n\"Residual\" = valid.resid[1:20])"

Here we can see the accuracy of the full model.

#use accuracy() to compute common accuracy measures
options(digits = 6)
accuracy(valid.lm.pred, # predicted outcomes
valid_lm.df$weekly_attendance)

##               ME    RMSE     MAE      MPE    MAPE
## Test set 206.618 5745.79 3751.68 -0.69019 6.13995

we next viewed the distribution of the residuals and saw that we had some fairly serious outliers. We feel as though these would not be too detrimental in practice because we would know that the values are off.

#view the distribution of residuals
ggplot(data = as.data.frame(valid.resid), aes(y = valid.resid)) + geom_boxplot()

#full model with all predictors
attendance.lm.null <- lm(weekly_attendance ~ 1, data = train_lm.df) # intercept-only model

Next, we performed foward, backward, and stepwise regressions.

#Forward Selection
attendance.lm.fwd <- step(attendance.lm.null, # initial model
scope = list(attendance.lm.null, upper = attendance.lm), # range of models
direction = "forward") # forward selection



valid.fwd.pred <- predict(attendance.lm.fwd, valid_lm.df)
options(digits = 6)

#Backward Elimination
attendance.lm.back <- step(attendance.lm, # start with full model
direction = "backward") # backward elimination



valid.back.pred <- predict(attendance.lm.back, valid_lm.df)
options(digits = 6)

#Stepwise Regression
attendance.lm.step <- step(attendance.lm.null, # initial model
scope = list(attendance.lm.null, upper = attendance.lm), # range of models
direction = "both") # stepwise regression



valid.step.pred <- predict(attendance.lm.step, valid_lm.df)
options(digits = 6)

Accuracy

We then compared the accuracy of all of these models.

#Prediction Performance
#Regression Tree
accuracy(pruned.valid.rt.pred, valid_lm.df$weekly_attendance)

##               ME    RMSE     MAE       MPE   MAPE
## Test set -80.928 5313.68 3239.44 -0.896563 5.3456

#Foward
accuracy(valid.fwd.pred, valid_lm.df$weekly_attendance)

##              ME    RMSE     MAE       MPE    MAPE
## Test set 205.43 5740.86 3750.68 -0.691569 6.13895

#Backward
accuracy(valid.back.pred, valid_lm.df$weekly_attendance)

##              ME    RMSE     MAE       MPE    MAPE
## Test set 205.43 5740.86 3750.68 -0.691569 6.13895

#Stepwise
accuracy(valid.step.pred, valid_lm.df$weekly_attendance)

##              ME    RMSE     MAE       MPE    MAPE
## Test set 205.43 5740.86 3750.68 -0.691569 6.13895

# Compare to the original model (Full Model)
accuracy(valid.lm.pred, valid_lm.df$weekly_attendance)

##               ME    RMSE     MAE      MPE    MAPE
## Test set 206.618 5745.79 3751.68 -0.69019 6.13995

As we can see, the decision tree performed the best out of all of these models. However, even this model did not perform particularly well. Despite the somewhat poor performance, we still feel that this model could help to determine whether attendance will be high or low in general. With the large amount of people at these games, errors even as large as 5,000 may not be extremely detrimental. We believe that this model would most effectively be used to avoid making any extreme mistakes for predicting attendance. It may not be as accurate as we would like, but it can give a broad picture of our expected turnout for each game.

Conclusion

Findings

Although our findings were not as conclusive as we would have liked them to be, we do feel that we have made some worthwhile insights on our data. For example, we found that neither offense or defense appears to be more important when determining win percentage. In addition, we identified a few teams whose fans are quite reactive to their performance. It could be quite valuable for those teams to cut down on staffing when expecting to have a poor year. We also found out that teams should not expect to have more or less attendance solely because of the time of day/year.

Finally, we think that our models could be quite effective at simply determining whether a game will be crowded. The model could save teams from taking a big loss by having many staff members or concessions for a game with low attendance. We feel that further analysis with more data would be needed to get more accurate results.