Problem Statement

Problem Statement Summary

The vendor management company VMC contracts with sports stadiums in order to staff the meal carts at the stadiums of the Washington Redskins, Cincinnati Bengals, and Tampa Bay Buccaneers. VMC is seeking to have a more flexible model for resource planning and would like to understand the variability in attendance based on team results from the prior week. To this end the analytical team reviewed the previous two years of attendance data in order to determine how the actual attendance of home games fluctuate based on a win or loss during the duration of the season.

Assumptions

  • Workers are scheduled on a weekly basis.
  • More staff are necessary when attendance increases, and fewer are required when it decreases.
  • Ticket sales are not a good estimate of attendance potentially due to scalpers, infrequently attending season ticket - holders, and general scheduling conflict
  • People are more likely to attend during winning seasons, and less likely when losing.
  • Fans are more likely to attend when their team wins, and when games are close by.

Proposed Approach

  • Exploratory data analysis with visualizations to compare the results
  • Differing plots to explore the data. Either histogram, scatterplot, or line graph
  • Review weekly attendance as compared to a previous win or loss and observe if trends are present.

Purpose of the Analysis

The purpose is to determine whether or not the outcome of a game (win or a loss) can predict what the attendance may look like any given week based on the previous week’s performance. This will allow the management company to better predict how many people will attend on a weekly basis. It could also be expanded to predict based on the month, or yearly attendance. By determining whether or not a previous game was a win, a loss, an away, or a home game we can better determine what each variable’s impact is on attendance. This will better refine our prediction of the next game’s attendance.

Data Understanding

Data Source

The attendance and standings data was sourced from the Pro Football Reference source at the following link: NFL Attendance

The source data for this project consists of three tables:

  1. attendance.csv
  2. games.csv
  3. standings.csv

The tables provide selected NFL statistics from the 2000 through 2019 football seasons. Each of the important variables are listed below, and the highlighted cells are the same throughout each table.

Data Dictionary

Attendance Data - attendance.csv

The Attendance data set contains the weekly attendance information for a team for the years 2000 to 2019. - The “home” column is the total attendance at home games for that team and will be used to differentiate between the games we care about for this analysis (home) and those which are irrelevant (away). - “Total” is the total overall attendance for that team in that year’s season. - “Weekly_attendance” is the total attendance for that particular week and is the primary variable. This table will be tied to the other tables for analysis. Since the analysis is focusing on Vendor management for home games only, the “home” column will be the variable that will provide the input for attendance.

Variable Class Description
team character Team City
team_name character Team name
year integer Season year
total double Total attendance across 17 week (1 week = no game)
home double Home attendance
away double Away attendance
week character Week number (1-17)
weekly_attendance double Weekly attendance number

Standings Data - standings.csv

The Standings data set contains the Win/loss, points scored, rankings for each team for the season year from 2000-2019. Since the standings table has the win and loss record, that will be used to determine which games were wins, and whether or not that had a consistent and noticeable attendance increase for following weeks.

Variable Class Description
team character Team city
team_name character Team name
year integer Season year
wins double Wins (0 to 16)
loss double Losses (0 to 16)
points_for double Points for (offensive performance)
points_against double Points for (defensive performance)
points_differential double Point differential (points_for - points_against)
margin_of_victory double (Points Scored - Points Allowed)/Games Played
strength_of_schedule double Average quality of opponent as measured by SRS (Simple Rating System)
Simple_rating double Team quality relative to average (0.0) as measured by SRS (Simple Rating System) SRS = MoV + SoS = OSRS + DSRS
offensive_ranking double Team offense quality relative to average (0.0) as measure by SRS (Simple Rating System)
defensive_ranking double Team defense quality relative to average (0.0) as measured by SRS (Simple Rating System)
playoffs character Made playoffs or not
sb_winner character Won superbowl or not

Games Data - games.csv

The Games data set contains details about each game. For this study, the only variables of interest within this set are pts_loss and pts_win. This will be used to establish whether or not the actual “excitement” of the game is a factor, or simply whether a team won or lost is a stronger influence.

Variable Class Description
year integer Season year, note that playoff will still be in the previous season
week character Week number (1-17, plus playoffs)
home_team character Home team
away_team character Away team
winner character Winning team
tie character If a tie, the “losing” team as well
day character Day of week
date character Date minus year
time character Time of game start
pts_win double Points by winning team
pts_loss double Points by losing team
yds_win double Yards by winning team
turnovers_win double Turnovers by winning team
yds_loss double Yards by losing team
turnovers_loss double Turnovers by losing team
home_team_name character Home team name
home_team_city character Home team city
away_team_name character Away team name
away_team_city character Away team city

R Packages and Data Subsets

R Packages

Several packages were used for cleaning, manipulating, and visualization.

Those packages include:

  • Tidyr - Data Cleaning
  • Dplyr - Data Manipulating
  • Readr - Data Manipulation
  • Ggplot2 - Data Graphing and Plotting
library(dplyr)
library(tidyr)
library(readr)
library(ggplot2)

Data Subsetting

Data subsetting and new variable requirements for this analysis included the following:

  1. Subset across all three starting tables (attendance, games, and standings) for:
    1. only Cincinnati, Tampa Bay, & Washington teams,
    2. only 2018 and 2019 seasons,
    3. excluding the “bye” week rows for the subsetted attendance table,
    4. excluding post-season games records from the games table (games$week column contains strings: “ConfChamp”, “Division”, “SuperBowl”, or “WildCard”), and
    5. limiting table column variables to those specified by the project Domain III document
  2. Change the game$week column variable to numeric (allows gamesweek column to be used as a common key during the next step’s join procedure),
  3. Subset and join step 1 attendance, games, and standings data to create separate team tables: Cincinnati_join, Tampa_Bay_join, and Washington_join
  4. Add new calculated column variables, required for the graphic analysis phase, to the team tables from the previous step. The resulting tables are then ready for individual team graphic analysis. New variables created include:
    1. home_away - coded as: H = home, A = away
    2. wl_diff - the points difference for that game, calculated as (pts_win - pts_loss), converted to negative values for losing games
    3. win_lose - coded as: W = win, L = loss, for that record’s game
    4. home_start_diff - the attendance difference between the starting game of the season and the current game attendance
    5. wonprev - indicates whether team won the previous game played; coded as: Y = won previous game, N = lost previous game
    6. gameswon - the total number of games won by that team, prior to that week’s game
    7. gamesplayed - the total number of prior games completed that season
    8. winlossratio - calculated as: gameswon/gamesplayed
  5. Join the team tables from the step above to create the table all_teams_join. This final table allows for “faceted” graphics creation using the ggplot2 package.
# script to import nfl data and create subsets for analysis
# worst attended teams analysis

# load libraries

library(dplyr)
library(readr)

# The packages used for this script are: dplyr, and readr.

# The dplyr package provides a set of tools for efficiently manipulating data sets
# in R. The package uses a consistent set of verbs to help solve such challenges as:
# 1) adding new variables that are functions of existing variables,
# 2) pick variables based on their names
# 3) pick cases based on their values
# 4) reduce multiple values down to a single summary, and
# 5) change the ordering of rows

# The readr package provides a fast and friendly way to read rectangular data 
# (like csv, tsv, and fwf). It is designed to flexibly parse many types of data,
# while still cleanly failing when dat unexpectedly changes.

# load the base project tables; requires that files are located
# in your current working directory of RStudio

attendance <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/attendance.csv')
standings <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/standings.csv')
games <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/games.csv')

# create subsets of each table that: 
# 1) is only Cincinnati, Tampa Bay, & Washington teams
# 2) is only 2018 and 2019 seasons
# 3) excludes bye week for the attendance table
# 4) limits columns to those specified by Domain III document
# 4) is identified by ending "_prj_team"

attend_prj_teams <- attendance %>%
  filter(team_name %in% c("Bengals", "Buccaneers", "Redskins")) %>%
  filter(year %in% c(2018, 2019)) %>%
  filter(!is.na(weekly_attendance)) %>%
  select(team, team_name, year, week, total, weekly_attendance)

games_prj_teams <- games %>%
  group_by(week) %>%
  filter(year %in% c(2018, 2019)) %>%
  filter(home_team %in% c("Cincinnati Bengals", "Tampa Bay Buccaneers", "Washington Redskins") | away_team %in% c("Cincinnati Bengals", "Tampa Bay Buccaneers", "Washington Redskins")) %>%
  select(year, week, home_team, away_team, winner, pts_win, pts_loss)

standings_prj_teams <- standings %>%
  group_by(year) %>%
  filter(team_name %in% c("Bengals", "Buccaneers", "Redskins")) %>%
  filter(year %in% c(2018, 2019)) %>%
  select(team, team_name, year, wins, loss, points_for, points_against, points_differential, playoffs)

# lets add column to attendance table identifying home/away (H/A)

attend_prj_teams <- attend_prj_teams %>%
  mutate(home_away = NA)

# to add data to newly made home_away column:
# 1) create subset of attend_prj_teams specific to each team
# 2) create subset of games_prj_teams specific to each team
# 3) join city attend with city games, by year & week
# 4) use joined table data to write to home_away column for each team
# KEY: joined tables become base tables for graph analyses

Cincinnati_att_prj_teams <- attend_prj_teams %>%
  filter(team == "Cincinnati")

Tampa_Bay_att_prj_teams <- attend_prj_teams %>%
  filter(team == "Tampa Bay")

Washington_att_prj_teams <- attend_prj_teams %>%
  filter(team == "Washington")

Cincinnati_games_prj_teams <- games_prj_teams %>%
  filter(home_team =="Cincinnati Bengals" | away_team == "Cincinnati Bengals")

Tampa_Bay_games_prj_teams <- games_prj_teams %>%
  filter(home_team =="Tampa Bay Buccaneers" | away_team == "Tampa Bay Buccaneers")

Washington_games_prj_teams <- games_prj_teams %>%
  filter(home_team =="Washington Redskins" | away_team == "Washington Redskins")

# can't do join yet...
# attend_prj_teams$week is numeric games_prj_teams$week is character
# games table includes playoff weeks as character strings:
# "ConfChamp", "Division", "SuperBowl", "WildCard"

# build a "$notin%" command to use with the dplyr filter() function

'%notin%' <- Negate('%in%')

# resubset games table to exclude playoff weeks data; use new function

games_prj_teams <- games %>%
  group_by(week) %>%
  filter(year %in% c(2018, 2019)) %>%
  filter(home_team %in% c("Cincinnati Bengals", "Tampa Bay Buccaneers", "Washington Redskins") | away_team %in% c("Cincinnati Bengals", "Tampa Bay Buccaneers", "Washington Redskins")) %>%
  filter(week %notin% c("ConfChamp", "Division", "SuperBowl", "WildCard")) %>%
  select(year, week, home_team, away_team, winner, pts_win, pts_loss)

# change data type for games_prj_teams$week to numeric

games_prj_teams$week <- as.numeric(games_prj_teams$week)

# recreate the teams subsets of this table

Cincinnati_games_prj_teams <- games_prj_teams %>%
  filter(home_team =="Cincinnati Bengals" | away_team == "Cincinnati Bengals")

Tampa_Bay_games_prj_teams <- games_prj_teams %>%
  filter(home_team =="Tampa Bay Buccaneers" | away_team == "Tampa Bay Buccaneers")

Washington_games_prj_teams <- games_prj_teams %>%
  filter(home_team =="Washington Redskins" | away_team == "Washington Redskins")

# join the Cincinnati tables

Cincinnati_join <- left_join(Cincinnati_att_prj_teams, Cincinnati_games_prj_teams, by = c("year", "week"))

# fill home_away column: H = home, A = away

for (i in 1:32) {
  if (Cincinnati_join$home_team[i] == "Cincinnati Bengals") {
    Cincinnati_join$home_away[i] = "H"
  } else {
    Cincinnati_join$home_away[i] = "A"
  }
}

# join Tampa Bay tables

Tampa_Bay_join <- left_join(Tampa_Bay_att_prj_teams, Tampa_Bay_games_prj_teams, by = c("year", "week"))

# fill home_away column: H = home, A = away

for (i in 1:32) {
  if (Tampa_Bay_join$home_team[i] == "Tampa Bay Buccaneers") {
    Tampa_Bay_join$home_away[i] = "H"
  } else {
    Tampa_Bay_join$home_away[i] = "A"
  }
}

# join Washington tables

Washington_join <- left_join(Washington_att_prj_teams, Washington_games_prj_teams, by = c("year", "week"))

# fill home_away column: H = home, A = away

for (i in 1:32) {
  if (Washington_join$home_team[i] == "Washington Redskins") {
    Washington_join$home_away[i] = "H"
  } else {
    Washington_join$home_away[i] = "A"
  }
}

# create a win/loss difference column titled "wl_diff"
# value created as (pts_win - pts_loss)

Cincinnati_join <- Cincinnati_join %>%
  mutate(wl_diff = pts_win - pts_loss)

Tampa_Bay_join <- Tampa_Bay_join %>%
  mutate(wl_diff = pts_win - pts_loss)

Washington_join <- Washington_join %>%
  mutate(wl_diff = pts_win - pts_loss)

# change sign of wl_diff if team lost

for (i in 1:32) {
  if (Cincinnati_join$winner[i] != "Cincinnati Bengals") {
    Cincinnati_join$wl_diff[i] = -1 * Cincinnati_join$wl_diff[i]
  }
  if (Tampa_Bay_join$winner[i] != "Tampa Bay Buccaneers") {
    Tampa_Bay_join$wl_diff[i] = -1 * Tampa_Bay_join$wl_diff[i]
  }
  if (Washington_join$winner[i] != "Washington Redskins") {
    Washington_join$wl_diff[i] = -1 * Washington_join$wl_diff[i]
  }
}

# inspect for whether any of these teams made the playoff in 2018, 2019:

tmp <- standings %>%
  filter(team_name %in% c("Bengals", "Buccaneers", "Redskins")) %>%
  filter(year %in% c(2018, 2019)) %>%
  select(team, team_name, year, playoffs)

# Result: analyzing by playoff is moot: none made it to playoffs

# add a categorical column for win_lose
Cincinnati_join <- Cincinnati_join %>%
  mutate(win_lose = NA)

Tampa_Bay_join <- Tampa_Bay_join %>%
  mutate(win_lose = NA)

Washington_join <- Washington_join %>%
  mutate(win_lose = NA)


# add in the "W" or "L" factor for the week
for (i in 1:32) {
  if (Cincinnati_join$winner[i] == "Cincinnati Bengals") {
    Cincinnati_join$win_lose[i] = "W"
  } else {
    Cincinnati_join$win_lose[i] = "L"
  }
  if (Tampa_Bay_join$winner[i] == "Tampa Bay Buccaneers") {
    Tampa_Bay_join$win_lose[i] = "W"
  } else {
    Tampa_Bay_join$win_lose[i] = "L"
  }
  if (Washington_join$winner[i] == "Washington Redskins") {
    Washington_join$win_lose[i] = "W"
  } else {
    Washington_join$win_lose[i] = "L"
  }
}

# Adding column with each weekly attendance as the difference
# from the opening home game attendance value

# Opening home game attendances:
# 2018: Cincinnati = 50018, Tampa Bay = 56552, Washington = 57013
# 2019: Cincinnati = 50666, Tampa Bay = 55976, Washington = 75128

# Create new column in each of the individual team tables.
# Then post values to tables, followed by join to all_teams_join

Cincinnati_join <- Cincinnati_join %>%
  mutate(home_start_diff = NA)

Tampa_Bay_join <- Tampa_Bay_join %>%
  mutate(home_start_diff = NA)

Washington_join <- Washington_join %>%
  mutate(home_start_diff = NA)

for (i in 1:32) {
  if (Cincinnati_join$year[i] == "2018") {
    Cincinnati_join$home_start_diff[i] = Cincinnati_join$weekly_attendance[i] - 50018
  } else {
    Cincinnati_join$home_start_diff[i] = Cincinnati_join$weekly_attendance[i] - 50666
  }
  if (Tampa_Bay_join$year[i] == "2018") {
    Tampa_Bay_join$home_start_diff[i] = Tampa_Bay_join$weekly_attendance[i] - 56552
  } else {
    Tampa_Bay_join$home_start_diff[i] = Tampa_Bay_join$weekly_attendance[i] - 55976
  }
  if (Washington_join$year[i] == "2018") {
    Washington_join$home_start_diff[i] = Washington_join$weekly_attendance[i] - 57013
  } else {
    Washington_join$home_start_diff[i] = Washington_join$weekly_attendance[i] - 75128
  }
}

#--------------------------------------
# add columns used by Claudia for her graphs; not including
# her "win" column though; adjusting her script to use the 
# existing "win_lose" field.
# Also, adjusted script to update all three team tables
# at the same time

Cincinnati_join <- Cincinnati_join %>% 
  mutate(wonprev = NA, gameswon = 0, gamesplayed = 0, winlossratio = NA)

Tampa_Bay_join <- Tampa_Bay_join %>% 
  mutate(wonprev = NA, gameswon = 0, gamesplayed = 0, winlossratio = NA)

Washington_join <- Washington_join %>% 
  mutate(wonprev = NA, gameswon = 0, gamesplayed = 0, winlossratio = NA)

for (i in 2:16) {
  if (Cincinnati_join$win_lose[i-1] == "W"){
    Cincinnati_join$wonprev[i] <- "Y"
  } else {
    Cincinnati_join$wonprev[i] <- "N"
  }
  if (Tampa_Bay_join$win_lose[i-1] == "W"){
    Tampa_Bay_join$wonprev[i] <- "Y"
  } else {
    Tampa_Bay_join$wonprev[i] <- "N"
  }
  if (Washington_join$win_lose[i-1] == "W"){
    Washington_join$wonprev[i] <- "Y"
  } else {
    Washington_join$wonprev[i] <- "N"
  }
}
for (i in 18:32) {
  if (Cincinnati_join$win_lose[i-1] == "W"){
    Cincinnati_join$wonprev[i] <- "Y"
  } else {
    Cincinnati_join$wonprev[i] <- "N"
  }
  if (Tampa_Bay_join$win_lose[i-1] == "W"){
    Tampa_Bay_join$wonprev[i] <- "Y"
  } else {
    Tampa_Bay_join$wonprev[i] <- "N"
  }
  if (Washington_join$win_lose[i-1] == "W"){
    Washington_join$wonprev[i] <- "Y"
  } else {
    Washington_join$wonprev[i] <- "N"
  }
}

for (i in 1:16) {
  Cincinnati_join$gamesplayed[i] <- i-1
  Tampa_Bay_join$gamesplayed[i] <- i-1
  Washington_join$gamesplayed[i] <- i-1
}
for (i in 17:32) {
  Cincinnati_join$gamesplayed[i] <- i - 17
  Tampa_Bay_join$gamesplayed[i] <- i - 17
  Washington_join$gamesplayed[i] <- i - 17
}

for (i in 2:16) {
  if (Cincinnati_join$win_lose[i-1] == "W"){
    Cincinnati_join$gameswon[i] <- Cincinnati_join$gameswon[i-1] + 1
  } else {
    Cincinnati_join$gameswon[i] <- Cincinnati_join$gameswon[i-1]
  }
  if (Tampa_Bay_join$win_lose[i-1] == "W"){
    Tampa_Bay_join$gameswon[i] <- Tampa_Bay_join$gameswon[i-1] + 1
  } else {
    Tampa_Bay_join$gameswon[i] <- Tampa_Bay_join$gameswon[i-1]
  }
  if (Washington_join$win_lose[i-1] == "W"){
    Washington_join$gameswon[i] <- Washington_join$gameswon[i-1] + 1
  } else {
    Washington_join$gameswon[i] <- Washington_join$gameswon[i-1]
  }
}
for (i in 18:32) {
  if (Cincinnati_join$win_lose[i-1] == "W"){
    Cincinnati_join$gameswon[i] <- Cincinnati_join$gameswon[i-1] + 1
  } else {
    Cincinnati_join$gameswon[i] <- Cincinnati_join$gameswon[i-1]
  }
  if (Tampa_Bay_join$win_lose[i-1] == "W"){
    Tampa_Bay_join$gameswon[i] <- Tampa_Bay_join$gameswon[i-1] + 1
  } else {
    Tampa_Bay_join$gameswon[i] <- Tampa_Bay_join$gameswon[i-1]
  }
  if (Washington_join$win_lose[i-1] == "W"){
    Washington_join$gameswon[i] <- Washington_join$gameswon[i-1] + 1
  } else {
    Washington_join$gameswon[i] <- Washington_join$gameswon[i-1]
  }
}

for (i in 2:16) {
  Cincinnati_join$winlossratio[i] <- Cincinnati_join$gameswon[i] / Cincinnati_join$gamesplayed[i]
  Tampa_Bay_join$winlossratio[i] <- Tampa_Bay_join$gameswon[i] / Tampa_Bay_join$gamesplayed[i]
  Washington_join$winlossratio[i] <- Washington_join$gameswon[i] / Washington_join$gamesplayed[i]
}
for (i in 18:32) {
  Cincinnati_join$winlossratio[i] <- Cincinnati_join$gameswon[i] / Cincinnati_join$gamesplayed[i]
  Tampa_Bay_join$winlossratio[i] <- Tampa_Bay_join$gameswon[i] / Tampa_Bay_join$gamesplayed[i]
  Washington_join$winlossratio[i] <- Washington_join$gameswon[i] / Washington_join$gamesplayed[i]
}

# Lastly, combine the team tables to recreate all_teams_join, now 
# with the new column data values

all_teams_join <- bind_rows(Cincinnati_join, Tampa_Bay_join, Washington_join)

Exploratory Analysis

Stadium Maximum Capacity

Stadium Maximum Capacity

  • Tampa Bay: 74,301
  • Cincinnati: 65,515
  • Washington 82,000

An important variable is to compare each team’s home attendance with their respective stadium’s maximum capacity as there would be no reason to predict attendance at that point. In this case for both years the home attendance never reached the maximum capacity.

# test run to put stadium cap dashed line and annotation into graph
TB_stadium_ann <- data.frame(week = 10, weekly_attendance = 77000, 
                             lab = "Stadium Cap", year = factor(2019, levels = c("2018", "2019")))

Tampa_Bay_join %>%
  ggplot(aes(week, weekly_attendance)) +
  scale_color_manual(values = c("#0072B2", "#000000")) +
  scale_fill_manual(values = c("#F0E442", "#0072B2")) +
  geom_col(aes(fill = home_away, color = win_lose)) +
  xlab("Game Week") +
  ylab("Weekly Game Attendance") +
  labs(color = "Win/Loss", fill = "Home/Away") +
  theme(plot.title = element_text(size = 12)) +
  scale_x_continuous(minor_breaks = 0) +
  ggtitle("Tampa Bay Buccaneers\n2018-2019 Wins Visualization") +
  geom_hline(yintercept = 74301, linetype = "dashed", color = "#6699CC") +
  geom_text(data = TB_stadium_ann, label = "Stadium Cap", color = "#6699CC") +
  facet_wrap(~year)

# create wins visualizations, with stadium cap, for other teams
# Washington
Wash_stadium_ann <- data.frame(week = 9, weekly_attendance = 85000, 
                               lab = "Stadium Cap", year = factor(2019, levels = c("2018", "2019")))

Washington_join %>%
  ggplot(aes(week, weekly_attendance)) +
  scale_color_manual(values = c("#D55E00", "#000000")) +
  scale_fill_manual(values = c("#F0E442", "#D55E00")) +
  geom_col(aes(fill = home_away, color = win_lose)) +
  xlab("Game Week") +
  ylab("Weekly Game Attendance") +
  labs(color = "Win/Loss", fill = "Home/Away") +
  theme(plot.title = element_text(size = 12)) +
  scale_x_continuous(minor_breaks = 0) +
  ggtitle("Washington Redskins\n2018-2019 Wins Visualization") +
  geom_hline(yintercept = 82000, linetype = "dashed", color = "#990000") +
  geom_text(data = Wash_stadium_ann, label = "Stadium Cap", color = "#990000") +
  facet_wrap(~year)

# Cincinnati
Cinc_stadium_ann <- data.frame(week = 10, weekly_attendance = 73000, 
                               lab = "Stadium Cap", year = factor(2019, levels = c("2018", "2019")))

Cincinnati_join %>%
  ggplot(aes(week, weekly_attendance)) +
  scale_color_manual(values = c("#E69F00", "#000000")) +
  scale_fill_manual(values = c("#F0E442", "#E69F00")) +
  geom_col(aes(fill = home_away, color = win_lose)) +
  xlab("Game Week") +
  ylab("Weekly Game Attendance") +
  labs(color = "Win/Loss", fill = "Home/Away") +
  theme(plot.title = element_text(size = 12)) +
  scale_x_continuous(minor_breaks = 0) +
  ggtitle("Cincinnati Bengals\n2018-2019 Wins Visualization") +
  geom_hline(yintercept = 65515, linetype = "dashed", color = "#993300") +
  geom_text(data = Cinc_stadium_ann, label = "Stadium Cap", color = "#993300") +
  facet_wrap(~year)

2018 Overall Attendance

# PUsh All teams 2018 overall - bar chart, color specified for point and line
all_teams_join %>%
  filter(year == 2018) %>%
  ggplot(aes(week, weekly_attendance, fill = team)) +
  xlab("Game Week") +
  ylab("Home Game Attendance") +
  labs(fill = "Team") +
  geom_col(colour = "black") +
  ggtitle("2018 Weekly Attendance by Team") + 
  facet_wrap(~team)

The above graph indicates home and away games weekly attendance data for Cincinnati Bengals, Tampa bay Buccaneers, and Washington Redskins for the 2018 season. The weekly attendance seems to increase towards the end of the season for Tampa bay Buccaneers, and Washington Redskins. Overall, the attendance indicates a mediocre season for 2018. With the exception of a couple outliers such as Tampa Bay’s Game 16, or Washington’s Game 12. Both of these were at the Dallas Cowboys and both had an attendance of approximately 92,000. For the remaining games the attendance appears to be around the 60,000 mark. This is where the home game attendance tends to remain as we will see in the following graph and what VMC would have expected for the purpose of staffing the stadiums.

# PUsh All teams 2018 overall - bar chart, color specified for point and line
all_teams_join %>%
  filter(year == 2018) %>%
  filter(home_away == "H") %>%
  ggplot(aes(week, weekly_attendance, fill = team)) +
  xlab("Game Week") +
  ylab("Home Game Attendance") +
  labs(fill = "Team") +
  geom_col(colour = "black") +
  ggtitle("2018 Season\nHome Attendance by Team") + 
  facet_wrap(~team)

When removing the away games we can clearly see that, while the Redskins did in fact have a median attendance near the 60,000 mark for home games, the remaining teams fell short at home games. This is important as we would expect the attendance to be similar for the following year. Cincinnati had large attendance during the first few games before dropping off considerably, whereas Tampa Bay had a steady number of attendees right around week 6 till the end of the season. The Redskins had an overall increase in attendance as compared to the first game of the season. From this information we can break it down further into the wins and losses of each game to determine whether or not a trend is apparent that leads to these attendance numbers.

2019 Overall Attendance

# Push All teams 2019 overall - bar chart, color specified for point and line
all_teams_join %>%
  filter(year == 2019) %>%
  ggplot(aes(week, weekly_attendance, fill = team)) +
  xlab("Game Week") +
  ylab("Home Game Attendance") +
  labs(fill = "Team") +
  geom_col(colour = "black") +
  ggtitle("2019 Weekly Attendance by Team") + 
  facet_wrap(~team)

The above graph indicates home and away games weekly attendance data for the Cincinnati Bengals, Tampa Bay Buccaneers, and Washington Redskins in 2019. The weekly attendance seems to increase towards the end of the season only for Washington Redskins, unlike in 2018 where the attendance was on rise for both Tampa bay and Washington. The 2019 year overall tends to have more drastic leanings. Cincinnati has a continually declining season until week 15, and then a gradual decrease again. Tampa Bay continually declined in attendance throughout the season with a jump during week 11 or 12 in the season. Washington had a stair-step attendance curve, in that each few weeks the attendance would fall, then rise one week after the other, until falling again. This repeated throughout the entire season until the last week. The specifically-mentioned weeks will be considered to see whether or not that was a winning week and what the potential impact is on the following attendance.

# Push All teams 2019 overall - bar chart, color specified for point and line
all_teams_join %>%
  filter(year == 2019) %>%
  filter(home_away == "H") %>%
  ggplot(aes(week, weekly_attendance, fill = team)) +
  xlab("Game Week") +
  ylab("Home Game Attendance") +
  labs(fill = "Team") +
  geom_col(colour = "black") +
  ggtitle("2019 Season\nHome Attendance by Team") + 
  facet_wrap(~team)

Looking at the home games however, Cincinnati actually tends to increase in attendance slightly as the season progresses for this season whereas Tampa and Washington are more or less consistent with their overall season trajectory. We will look at this in more detail to determine if it is due to a win streak or some other factor.

Attendance Analysis

Tampa Bay Analysis

# USE Tampa Bay - graph as above, bar outline colored for a win
Tampa_Bay_join %>%
  ggplot(aes(week, weekly_attendance, fill = home_away, color = win_lose)) +
  scale_color_manual(values = c("#0072B2", "#000000")) +
  scale_fill_manual(values = c("#F0E442", "#0072B2")) +
  geom_col() +
  xlab("Game Week") +
  ylab("Weekly Game Attendance") +
  labs(color = "Win/Loss", fill = "Home/Away") +
  theme(plot.title = element_text(size = 12)) +
  scale_x_continuous(minor_breaks = 0) +
  ggtitle("Tampa Bay Win Analysis 2018-2019") +
  facet_wrap(~year)

# USE Tampa Bay bar chart, subsetted for home games, faceted by year
Tampa_Bay_join %>%
  filter(home_away == "H") %>%
  ggplot(aes(week, weekly_attendance)) +
  geom_col(fill = "#0072B2", colour = "black") +
  theme(plot.title = element_text(size = 12)) +
  ggtitle("Tampa Bay Home Game Attendance 2018-2019") +
  facet_wrap(~year)

Tampa Bay had a very weak season in 2018 with only 5 wins in the entire season. As we can see the first at-home win was followed by an increase of 6,000 attendees. However after three losses in a row the attendance dropped by 9,000. By the end of the season the final two wins were followed by a small increase of 2,000 and 1,500 attendees each.

In 2019 Tampa Bay had 7 wins, but only 2 were at home as opposed to the 4 of the previous year. The impact of a win does not appear to be strong when Tampa wins an away game. However the largest jump was from week 10 to 11, with an attendance increase of a staggering 14,000 attendees. That game was the first home win of the season and also broke a 5 game loss streak. This is a major point to track as that significant of an increase should trigger an increase of staffing at the stadium. While they continued to win soon after, being late in the season and with the games being in other stadiums, home game attendance dwindled.

ggplot(Tampa_Bay_join, aes(week, weekly_attendance, color = wonprev)) + geom_point(size = 2.5, position = 'jitter') + ggtitle("Tampa Bay Buccaneers\n2018/2019 Weekly Attendance") + labs(x = "Week", y = "Weekly Attendance", color = "Won Previous") + scale_fill_manual(values = c("red", "green", "grey")) + facet_grid(home_away ~ year)

If we were to look at the home games alone, for 2018 we can see a trend occurring. If a team were to win or lose the previous home game, the attendance would rise or fall consistently. This is particularly true of the opening games as they seem to garner more attendance than the later games and therefore have a higher attendance spike. In 2018, the highest attendance for a home game was 62,571, after winning 2 consecutive games at the start of the season. In 2019, the highest attendance was 60,087 after diverse results in the first few games (losing 3 games and winning 2 games). When the team is not a top performing team that year, it will gradually decrease as the weeks go by as we will see in the following graph.

Tampa_Bay_join %>%
      filter(home_away == "H") %>%
      ggplot(aes(winlossratio, weekly_attendance, color = wonprev)) + geom_point(size = 2.5, position = 'jitter') + ggtitle("Tampa Bay Buccaneers\n2018/2019 Weekly Attendance vs. Win Loss Ratio") + labs(x = "Win/Loss Ratio", y = "Weekly Attendance", color = "Won Previous") + scale_fill_manual(values = c("red", "green", "grey")) + facet_grid(home_away ~ year)
## Warning: Removed 1 rows containing missing values (geom_point).

The Win/Loss Ratio is calculated by the number of wins of that particular week divided by the number of weeks that have passed. So we can see a Win/Loss Ratio of 1.0 means that they have won the first game in the first week, and the second game in the second week. This follows the same trend of “The earlier the season, the more impactful a win”.

Washington Analysis

# USE Washington - graph as above, bar outline colored for a win
Washington_join %>%
  ggplot(aes(week, weekly_attendance, fill = home_away, color = win_lose)) +
  scale_color_manual(values = c("#D55E00", "#000000")) +
  scale_fill_manual(values = c("#F0E442", "#D55E00")) +
  geom_col() +
  xlab("Game Week") +
  ylab("Weekly Game Attendance") +
  labs(color = "Win/Loss", fill = "Home/Away") +
  theme(plot.title = element_text(size = 12)) +
  scale_x_continuous(minor_breaks = 0) +
  ggtitle("Washington Win Analysis 2018-2019") +
  facet_wrap(~year)

Washington_join %>%
  ggplot(aes(week, home_start_diff)) +
  geom_point(aes(fill = home_away, color = win_lose), shape = 21, size = 2, stroke = 1) +
  scale_color_manual(values = c("#E69F00", "#000000")) +
  scale_fill_manual(values = c("#F0E442", "#0072B2")) +
  geom_hline(yintercept = 0, color = "#0072B2") +
  geom_line(color = "#999999") +
  xlab("Game Week") +
  ylab("Attendance Relative to First Home Game") +
  labs(color = "Win/Loss", fill = "Home/Away") +
  scale_x_continuous(minor_breaks = 0) +
  ggtitle("Washington Redskins\nHome Attendance Win/Loss Sensitivity") +
  theme(plot.title = element_text(size = 14)) +
  facet_wrap(~year)

The home game attendance for the years 2018 and 2019 is significantly different. 2019 had higher attendance overall, but throughout the season it continually fell off. At the start of the season for 2018 the trend continued, where a win resulted in an increase of attendance for the next home game. However mid-season a win or loss seemed to have no noticeable impact either way on attendance. This season was an unusual season for the Redskins, as their Quarterback was injured in week 11, the backup was injured a couple weeks later, and the team had an all-time-high number of injuries for the season. That is likely the main variable in the attendance numbers for this season.

The 2019 season saw the draft of a new quarterback which also inflated the early attendance. This is another key point to track that, although not related to win or loss, is related to the number of attendees at a given game. A new coach also started in week 6. While the team did win, again the season was lackluster and likely contributed to the lack of attendance regardless of a win or loss.

Washington_join %>%
    filter(home_away == "H") %>%
    ggplot(aes(winlossratio, weekly_attendance, color = wonprev)) + geom_point(size = 2.5, position = 'jitter') + ggtitle("Washington Redskins\n2018/2019 Weekly Attendance vs. Win Loss Ratio") + labs(x = "Win/Loss Ratio", y = "Weekly Attendance", color = "Won Previous") + facet_grid(home_away ~ year)

Washington attendance shows a negative trend for both years, with lower attendance with higher Win/Loss Ratio. These results are significantly different from what were found in Tampa Bay and Cincinnati.

Cincinnati Analysis

Cincinnati_join %>%
  ggplot(aes(week, home_start_diff)) +
  geom_point(aes(fill = home_away, color = win_lose), shape = 21, size = 2, stroke = 1) +
  scale_color_manual(values = c("#E69F00", "#000000")) +
  scale_fill_manual(values = c("#F0E442", "#0072B2")) +
  geom_hline(yintercept = 0, color = "#0072B2") +
  geom_line(color = "#999999") +
  xlab("Game Week") +
  ylab("Attendance Relative to First Home Game") +
  labs(color = "Win/Loss", fill = "Home/Away") +
  scale_x_continuous(minor_breaks = 0) +
  ggtitle("Cincinnati Bengals\nHome Attendance Win/Loss Sensitivity") +
  theme(plot.title = element_text(size = 14)) +
  facet_wrap(~year)

# USE Cincinnati bar chart, subsetted for home games, faceted by year
Cincinnati_join %>%
  filter(home_away == "H") %>%
  ggplot(aes(week, weekly_attendance)) +
  geom_col(fill = "#E69F00", colour = "black") +
  theme(plot.title = element_text(size = 12)) +
  ggtitle("Cincinnati Home Game Attendance 2018-2019") +
  facet_wrap(~year)

The season for Cincinnati home games attendance is quite contrary to what we have been discussing for the Washington Redskins. In 2018 the initial home games of the team saw more attendance than the later games. We can confidently say the game attendance is steady in 2018. It also follows the trend of Tampa Bay, where a win at home would increase the attendance at the next home game. We can see the first three home games had a steady increase in attendance, followed by an equally sharp decrease after the first at-home loss.

2019 was a year nearly devoid of any success. We can see a steady trend of decreasing attendance until nearing week 11, which is when they would become ineligible for the playoffs. An important point in this analysis is the week 12 game that was won and the attendance during week 15. Just like in Tampa Bay’s analysis, we can see a sharp increase in attendance immediately following the breaking of a losing streak. Even though the season is over for Cincinnati the attendance still increased by nearly 18,000.

Cincinnati_join %>%
  filter(home_away == "H") %>%
  ggplot(aes(winlossratio, weekly_attendance, color = wonprev)) + geom_point(size = 2.5, position = 'jitter') + ggtitle("Cincinnati Bengals\n2018/2019 Weekly Attendance vs. Win Loss Ratio") + labs(x = "Win/Loss Ratio", y = "Weekly Attendance", color = "Won Previous") + facet_grid(home_away ~ year)

The home games for the Cincinnati Bengals team in 2018 saw the team start off with successive back to back wins in their first 3 home games, with a loss in week 8 and a reduction in attendance. Attendance then increases over the next 3 games before dropping for the last two home games (against the Denver Broncos and Oakland Raiders respectively) of the 2018 season. These attendance increases and decreases are consistent with the weightings of home games’ Win/Loss Ratio. With the exception of a win to break a losing streak, the later in a season the lower the attendance is impacted by a win or loss.

Insights

Team-Specific Insights

Tampa Bay

The Buccaneers had an increase in attendance of 6,000 after a home win initially, followed by a decrease of 9,000 after a loss. Minor increases and decreases (~1,500) for wins and losses at home towards the end of the season. There was a huge spike in attendance after a home win finally broke a losing streak and subsequently had a staggering increase of 14,000 attendees.

Washington

The Redskins had two very unusual years, with an injured quarterback and overall record number of injuries for a season, as well as a new quarterback and new coach. Unusual attendance pattern with the exception of high attendance as the new quarterback begins a season. While there was an attendance spike after breaking the losing streak, it was a less significant 6,000 increase.

Cincinnati

he Bengals followed many of the before mentioned trends, culminating in an increase of nearly 8,000 attendees following an early 2 home-win lead, and a staggering 18,000 attendee increase after breaking an extended dry streak.

Overall Insights

Attendance at games increases by at least 2,000 attendees per win at the start of the season. Another interesting universal trend was that after breaking a losing streak at home, attendance skyrockets for the following home game. This could be upwards of 15,000 more people over the previous home game, or as low as 8,000 later in the season. However these trends fall apart during unusual seasons such as The Washington Redskins’ second half of 2018 and the entirety of 2019. For these instances ticket sales are most likely a better indicator of attendance as they cause a unique shift in the fan’s willingness to attend for that season.

Conclusions

Conclusions and Impact to VMC

The existing staffing model of VMC is dependent on anticipated attendance at each game, with a minimum number of “x” workers per “y” attendees (for example 10 workers per 1,000 attendees). Therefore VMC should look to higher a proportionate number of staff members based on the expected increase in attendees.

From the available data, the first game of a season serves as a strong benchmark. Attendance levels diminish in the absence of strong opening games, but increase after each subsequent win. Without successive wins at the beginning of the season by a team, the observed trend is a decrease in attendance by a greater and greater margin.

What this means for VMC is if the team were to win their first game, to expect at minimum an additional 2,000 attendees for the next home game. and which correlates to an additional 20 staff members for that particular game. If they were to lose two games in a row, to expect between 4,000 and 9,000 less attendees and the correlated reduction in staff. In addition to this, if a particular team were to lose 4-5 games in a row, and then win a following home game, to expect the next home game to have a massive increase in attendance. Between 8,000 - 15,000 in attendance, depending on how many games were lost.

The impact of these results on the staffing model for VMC suggests that for adequate prediction of attendance levels, the performance of teams at the prior 2 to 3 games at the very least should be considered.

Limitations and Improvements

The data used does not include several factors that could have been used for comparison. The first factor which would have been beneficial as a baseline is the ticket sales. We are assuming that ticket sales do not correlate to a strong attendance (due to the nature of VMC requesting our help), but our data does not include that metric. That should be gathered from the stadium itself as a comparison to the actual attendance.

Another factor is the lack of overarching changes to the team structure throughout the season. This would include roster changes, injuries to key players, or staffing changes across the season. This could have been a factor in the downturn of attendance in the Washington Redskins’ 2018 season. However without data of each team’s roster changes this factor is difficult to quantify. This would increase the complexity immensely however.

A key limitation of this analysis has been the time constraint. This severely impacted on the duration of years selected for the study, and also the number of teams. Future studies may be able to derive more insight by looking at attendance data over more than 2 successive seasons.