MLS Attendance
What factors influence sports attendance? We will look at trends in attendance for Major League Soccer (MLS), the North American professional soccer league which initiated play in 1996. We want to examine attendance in the context of weather, day of the week (weekday vs. weekend or holiday), time of the game (day vs. night), home team record, opponent’s record, novelty of team (# of year in existence), and novelty of stadium. Does using smaller soccer-specific stadiums promote higher attendance versus using a large football stadium, which is often partially closed for MLS games? Do promotions such as free t-shirts incite more fans to show up? Do the demographics of a city influence attendance - especially in the first few years of a team’s existence?
Previous Takes
Academics have tackled this question using relatively advanced statistical methods. Perhaps most relevantly, looking specifically at the MLS, John Charles Bradbury from Kennesaw State University found that good regular season team performance in addition to franchise novelty and the use of soccer-specific stadiums are positively associated with attendance. He did not find market population size, Hispanic population share, the presence of other pro sports teams, or a team having prominent players to be significant determinants of attendance.
The most telling sports economics studies examining ticket sales and attendance have used baseball data. Lim and Pedersen showed that successful predictors of attendance include team playoff appearances, day of game, ticket price, season progress, and anticipated match competitiveness, among other factors. McDonald and Rascher found that promotions such as free hats or t-shirts successfully increase attendance but that each additional promotion causing diminishing returns in terms of an attendance bump.
Sports analytics hobbyists have weighed in as well. Pulling baseball attendance from 1999 to 2016, Hepper showed that team performance (number of runs scored, win-loss record, etc.) with a bias toward offense has the largest effect on attendance. Looking at New York Yankees baseball attendance since 2009, Todisco discovered that weekend games had higher attendance than weekday games and that the 30-40% of games with promotions showed mean higher attendances 2000 higher than games without them.
Our approach will not be an academic one. We want to try to predict attendance in the context of an MLS team trying to gauge ticket sales. Hopefully such data could be used to direct marketing efforts.
Attendance Data
To start our analysis, we’re going to grab MLS attendance data since the league’s beginning in 1996, provided by the excellent soccer analytics site OverLapping Run.
To verify the accuracy of the above data, we compared it to a number of other sources, including the Dallas Morning News’ record of 2018 FC Dallas attendance.
library("rvest")
## Loading required package: xml2
## Warning: package 'xml2' was built under R version 3.3.3
url <- "http://www.overlappingrun.com/DispGames.php"
attendance_raw <- url %>%
read_html() %>%
html_nodes(xpath='//*[@id="post-2881"]/div/div[2]/table') %>%
html_table()
attendance_raw <- attendance_raw[[1]]
How does our scraped data look?
str(attendance_raw)
## 'data.frame': 5492 obs. of 7 variables:
## $ X1: chr "Game #" "1" "2" "3" ...
## $ X2: chr "Date" "1996-04-06" "1996-04-13" "1996-04-13" ...
## $ X3: chr "Home" "HOU/SJ" "Tampa Bay" "Columbus" ...
## $ X4: chr "Home" "1" "3" "4" ...
## $ X5: chr "Visitor" "DC United" "New England" "DC United" ...
## $ X6: chr "Visitor" "0" "2" "0" ...
## $ X7: chr "Att" "31,683" "26,473" "25,266" ...
We need to turn the first row into the headers and update the datatypes.
library(dplyr)
library(knitr)
## Warning: package 'knitr' was built under R version 3.3.3
attendance <- tbl_df(attendance_raw)
#get headers from row 1
colnames(attendance) <- attendance[1,]
#now ditch row 1
attendance = attendance[-1,]
#improve header names
colnames(attendance)[which(names(attendance) == "Game #")] <- "Season_Game_Num"
colnames(attendance)[3] <- "Home_Team"
colnames(attendance)[4] <- "Home_Score"
colnames(attendance)[5] <- "Visiting_Team"
colnames(attendance)[6] <- "Visiting_Score"
#correct datatypes
attendance$Date <- as.Date(attendance$Date)
attendance$Season_Game_Num <- as.numeric(attendance$Season_Game_Num)
attendance$Home_Score <- as.numeric(attendance$Home_Score)
attendance$Visiting_Score <- as.numeric(attendance$Visiting_Score)
#convert to numierc, handling comma in character attendance number
attendance$Att <- as.numeric(gsub(",","",attendance$Att))
#check tibble
kable(attendance[1:15,], caption = "First 15 records from attendance data")
Season_Game_Num | Date | Home_Team | Home_Score | Visiting_Team | Visiting_Score | Att |
---|---|---|---|---|---|---|
1 | 1996-04-06 | HOU/SJ | 1 | DC United | 0 | 31683 |
2 | 1996-04-13 | Tampa Bay | 3 | New England | 2 | 26473 |
3 | 1996-04-13 | Columbus | 4 | DC United | 0 | 25266 |
4 | 1996-04-13 | Kansas City | 3 | Colorado | 0 | 21141 |
5 | 1996-04-13 | Los Angeles | 2 | NJ/NY | 1 | 69255 |
6 | 1996-04-14 | Dallas | 1 | HOU/SJ | 0 | 27779 |
7 | 1996-04-18 | Dallas | 3 | Kansas City | 0 | 9405 |
8 | 1996-04-20 | NJ/NY | 0 | New England | 1 | 46826 |
9 | 1996-04-20 | DC United | 1 | Los Angeles | 2 | 35032 |
10 | 1996-04-20 | Columbus | 1 | Tampa Bay | 2 | 24434 |
11 | 1996-04-21 | Colorado | 3 | Dallas | 1 | 21711 |
12 | 1996-04-21 | HOU/SJ | 2 | Kansas City | 3 | 17580 |
13 | 1996-04-27 | NJ/NY | 0 | Columbus | 2 | 26416 |
14 | 1996-04-27 | New England | 2 | DC United | 1 | 32864 |
15 | 1996-04-28 | Tampa Bay | 1 | Dallas | 2 | 14084 |
Team Data
This looks better than our initial pull. Now let’s add some more data. We’re going to create a table of each MLS team, including its inception year. We will create a table of information regarding each stadium that a team has used so we can convey a percentage of tickets sold in addition to some stadium attributes like soccer-specific vs. football or other type.
To create our table of teams, we’re going to scrape team info from the current and defunct team tables on the Major League Soccer Wikipedia page.
url_2 <- "https://en.wikipedia.org/wiki/Major_League_Soccer"
team_raw_df <- url_2 %>%
read_html() %>%
html_nodes(xpath='//*[@id="mw-content-text"]/div/table[2]') %>%
html_table(fill = TRUE)
team_raw_df <- team_raw_df[[1]]
#And then get the defunct teams as well
url_3 <- "https://en.wikipedia.org/wiki/Major_League_Soccer"
team_defunct_raw_df <- url_3 %>%
read_html() %>%
html_nodes(xpath='//*[@id="mw-content-text"]/div/table[4]') %>%
html_table(fill = TRUE)
team_defunct_raw_df <- team_defunct_raw_df[[1]]
Cleaning of the team table data from Wikipedia.
#clean current team data
team_df <- team_raw_df[,1:5]
team_df <- team_df[-1,]
team_df <- team_df[-13,]
#clean superscript numbers
team_df$Stadium <- gsub('[[:digit:]]+', '', team_df$Stadium)
team_df$Capacity <- substr(gsub("\\[.*","",team_df$Capacity),0,6)
team_df$Capacity <- as.numeric(gsub(",","",team_df$Capacity))
#add defunct year column with NA
team_df$Defunct <- as.numeric(NA)
team_df$Joined <- as.numeric(team_df$Joined)
#reorder columns
team_df <- team_df[,c(1,2,3,5,6,4)]
#head(team_df,20)
#clean defunct teams - blend so can be combined
team_defunct_df <- team_defunct_raw_df[,1:4]
#get headers from row 1
colnames(team_defunct_df) <- team_defunct_df[1,]
colnames(team_defunct_df)[4] <- "Years_Active"
#now ditch row 1
team_defunct_df = team_defunct_df[-1,]
#split years active to Joined and Defunct columns
team_defunct_df$Joined <- as.numeric(substr(team_defunct_df$Years_Active,0,4))
team_defunct_df$Defunct <- as.numeric(substr(team_defunct_df$Years_Active,6,9))
#Remove Years_Active column
team_defunct_df <- team_defunct_df[,-4]
#clean superscript numbers
team_defunct_df$Stadium <- gsub('[[:digit:]]+', '',team_defunct_df$Stadium)
#Add NA stadium column
team_defunct_df$Capacity <- as.numeric(NA)
#head(team_defunct_df)
#combine the tables
teams_df <- arrange(rbind(team_df,team_defunct_df),Team)
kable(teams_df)
Team | City | Stadium | Joined | Defunct | Capacity |
---|---|---|---|---|---|
Atlanta United FC | Atlanta, Georgia | Mercedes-Benz Stadium | 2017 | NA | 42500 |
Chicago Fire | Bridgeview, Illinois | SeatGeek Stadium | 1998 | NA | 20000 |
Chivas USA | Carson, California | StubHub Center | 2005 | 2014 | NA |
Colorado Rapids | Commerce City, Colorado | Dick’s Sporting Goods Park | 1996 | NA | 18061 |
Columbus Crew SC | Columbus, Ohio | Mapfre Stadium | 1996 | NA | 19968 |
D.C. United | Washington, D.C. | Audi Field | 1996 | NA | 20000 |
FC Cincinnati | Cincinnati, Ohio | Nippert Stadium | 2019 | NA | 33250 |
FC Dallas | Frisco, Texas | Toyota Stadium | 1996 | NA | 20500 |
Houston Dynamo | Houston, Texas | BBVA Compass Stadium | 2006 | NA | 22039 |
LA Galaxy | Carson, California | Dignity Health Sports Park | 1996 | NA | 27000 |
Los Angeles FC | Los Angeles, California | Banc of California Stadium | 2018 | NA | 22000 |
Miami Fusion | Fort Lauderdale, Florida | Lockhart Stadium | 1998 | 2001 | NA |
Minnesota United FC | Saint Paul, Minnesota | Allianz Field | 2017 | NA | 19400 |
Montreal Impact | Montreal, Quebec | Saputo Stadium | 2012 | NA | 20801 |
New England Revolution | Foxborough, Massachusetts | Gillette Stadium | 1996 | NA | 20000 |
New York City FC | New York City, New York | Yankee Stadium | 2015 | NA | 30321 |
New York Red Bulls | Harrison, New Jersey | Red Bull Arena | 1996 | NA | 25000 |
Orlando City SC | Orlando, Florida | Orlando City Stadium | 2015 | NA | 25500 |
Philadelphia Union | Chester, Pennsylvania | Talen Energy Stadium | 2010 | NA | 18500 |
Portland Timbers | Portland, Oregon | Providence Park | 2011 | NA | 25000 |
Real Salt Lake | Sandy, Utah | Rio Tinto Stadium | 2005 | NA | 20213 |
San Jose Earthquakes | San Jose, California | Avaya Stadium | 1996 | NA | 18000 |
Seattle Sounders FC | Seattle, Washington | CenturyLink Field | 2009 | NA | 39419 |
Sporting Kansas City | Kansas City, Kansas | Children’s Mercy Park | 1996 | NA | 18467 |
Tampa Bay Mutiny | Tampa, Florida | Raymond James Stadium | 1996 | 2001 | NA |
Toronto FC | Toronto, Ontario | BMO Field | 2007 | NA | 30991 |
Vancouver Whitecaps FC | Vancouver, British Columbia | BC Place | 2011 | NA | 22120 |
Finally, we will add a mapping column that links the Team name from the Teams table to the teams in the attendance data.
Stadium Data
Now let’s create a table of stadium data. We will again rely on Wikipedia, which has thorough historical MLS stadium data. Our attributes of interest include capacity, playing surface, opening year, and whether or not it is a soccer-specific facility.
Weather Data
We will be using daily historical temperature and precipitation records from the National Oceanic and Atmospheric Administration (NOAA). The NOAA generally has data from multiple locations in a major city. We will be using the weather data from the zip code closest to the arena that has daily data since at least 1996, the year of MLS’ birth.
Adding Calendar Depth
We want to be able to separate weekday games from weekend and holiday games. We’d also like to look at attendance by month. To accomplish this, we’re going to use SQL to create a date dimension table to allow us to classify our historical MLS attendance data.
Promotional Nights
Do promotions increase attendance? Previous studies indicate that they do. Our historical data does not include promotional data, so we will limit our examination of promotions to the last two complete seasons (2017-2018) of MLS activity.
Additional Data
We are curious how team performance affects MLS attendance. How does it influence ticket sales in a season, and how does a bad season impact attendance in subsequent seasons? For simplicity’s sake, we will use season win-loss percentages for analysis here. Game-by-game wins and losses might lead to interesting insights as well, but they’re beyond the scope of our current review.
Additionally, how does team novelty affect ticket sales? How has attendance increased or decreased in the first few years of an MLS franchise’s operation?