MLS Attendance

What factors influence sports attendance? We will look at trends in attendance for Major League Soccer (MLS), the North American professional soccer league which initiated play in 1996. We want to examine attendance in the context of weather, day of the week (weekday vs. weekend or holiday), time of the game (day vs. night), home team record, opponent’s record, novelty of team (# of year in existence), and novelty of stadium. Does using smaller soccer-specific stadiums promote higher attendance versus using a large football stadium, which is often partially closed for MLS games? Do promotions such as free t-shirts incite more fans to show up? Do the demographics of a city influence attendance - especially in the first few years of a team’s existence?

Previous Takes

Academics have tackled this question using relatively advanced statistical methods. Perhaps most relevantly, looking specifically at the MLS, John Charles Bradbury from Kennesaw State University found that good regular season team performance in addition to franchise novelty and the use of soccer-specific stadiums are positively associated with attendance. He did not find market population size, Hispanic population share, the presence of other pro sports teams, or a team having prominent players to be significant determinants of attendance.

The most telling sports economics studies examining ticket sales and attendance have used baseball data. Lim and Pedersen showed that successful predictors of attendance include team playoff appearances, day of game, ticket price, season progress, and anticipated match competitiveness, among other factors. McDonald and Rascher found that promotions such as free hats or t-shirts successfully increase attendance but that each additional promotion causing diminishing returns in terms of an attendance bump.

Sports analytics hobbyists have weighed in as well. Pulling baseball attendance from 1999 to 2016, Hepper showed that team performance (number of runs scored, win-loss record, etc.) with a bias toward offense has the largest effect on attendance. Looking at New York Yankees baseball attendance since 2009, Todisco discovered that weekend games had higher attendance than weekday games and that the 30-40% of games with promotions showed mean higher attendances 2000 higher than games without them.

Our approach will not be an academic one. We want to try to predict attendance in the context of an MLS team trying to gauge ticket sales. Hopefully such data could be used to direct marketing efforts.

Attendance Data

To start our analysis, we’re going to grab MLS attendance data since the league’s beginning in 1996, provided by the excellent soccer analytics site OverLapping Run.

To verify the accuracy of the above data, we compared it to a number of other sources, including the Dallas Morning News’ record of 2018 FC Dallas attendance.

library("rvest")
## Loading required package: xml2
## Warning: package 'xml2' was built under R version 3.3.3
url <- "http://www.overlappingrun.com/DispGames.php"
attendance_raw <- url %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="post-2881"]/div/div[2]/table') %>%
  html_table()
attendance_raw <- attendance_raw[[1]]

How does our scraped data look?

str(attendance_raw)
## 'data.frame':    5492 obs. of  7 variables:
##  $ X1: chr  "Game #" "1" "2" "3" ...
##  $ X2: chr  "Date" "1996-04-06" "1996-04-13" "1996-04-13" ...
##  $ X3: chr  "Home" "HOU/SJ" "Tampa Bay" "Columbus" ...
##  $ X4: chr  "Home" "1" "3" "4" ...
##  $ X5: chr  "Visitor" "DC United" "New England" "DC United" ...
##  $ X6: chr  "Visitor" "0" "2" "0" ...
##  $ X7: chr  "Att" "31,683" "26,473" "25,266" ...

We need to turn the first row into the headers and update the datatypes.

library(dplyr)
library(knitr)
## Warning: package 'knitr' was built under R version 3.3.3
attendance <- tbl_df(attendance_raw)
#get headers from row 1
colnames(attendance) <- attendance[1,]
#now ditch row 1
attendance = attendance[-1,]
#improve header names
colnames(attendance)[which(names(attendance) == "Game #")] <- "Season_Game_Num"
colnames(attendance)[3] <- "Home_Team"
colnames(attendance)[4] <- "Home_Score"
colnames(attendance)[5] <- "Visiting_Team"
colnames(attendance)[6] <- "Visiting_Score"
#correct datatypes
attendance$Date <- as.Date(attendance$Date)
attendance$Season_Game_Num <- as.numeric(attendance$Season_Game_Num)
attendance$Home_Score <- as.numeric(attendance$Home_Score)
attendance$Visiting_Score <- as.numeric(attendance$Visiting_Score)
#convert to numierc, handling comma in character attendance number
attendance$Att <- as.numeric(gsub(",","",attendance$Att))
#check tibble
kable(attendance[1:15,], caption = "First 15 records from attendance data")
First 15 records from attendance data
Season_Game_Num Date Home_Team Home_Score Visiting_Team Visiting_Score Att
1 1996-04-06 HOU/SJ 1 DC United 0 31683
2 1996-04-13 Tampa Bay 3 New England 2 26473
3 1996-04-13 Columbus 4 DC United 0 25266
4 1996-04-13 Kansas City 3 Colorado 0 21141
5 1996-04-13 Los Angeles 2 NJ/NY 1 69255
6 1996-04-14 Dallas 1 HOU/SJ 0 27779
7 1996-04-18 Dallas 3 Kansas City 0 9405
8 1996-04-20 NJ/NY 0 New England 1 46826
9 1996-04-20 DC United 1 Los Angeles 2 35032
10 1996-04-20 Columbus 1 Tampa Bay 2 24434
11 1996-04-21 Colorado 3 Dallas 1 21711
12 1996-04-21 HOU/SJ 2 Kansas City 3 17580
13 1996-04-27 NJ/NY 0 Columbus 2 26416
14 1996-04-27 New England 2 DC United 1 32864
15 1996-04-28 Tampa Bay 1 Dallas 2 14084

Team Data

This looks better than our initial pull. Now let’s add some more data. We’re going to create a table of each MLS team, including its inception year. We will create a table of information regarding each stadium that a team has used so we can convey a percentage of tickets sold in addition to some stadium attributes like soccer-specific vs. football or other type.

To create our table of teams, we’re going to scrape team info from the current and defunct team tables on the Major League Soccer Wikipedia page.

url_2 <- "https://en.wikipedia.org/wiki/Major_League_Soccer"
team_raw_df <- url_2 %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="mw-content-text"]/div/table[2]') %>%
  html_table(fill = TRUE)
team_raw_df <- team_raw_df[[1]]

#And then get the defunct teams as well
url_3 <- "https://en.wikipedia.org/wiki/Major_League_Soccer"
team_defunct_raw_df <- url_3 %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="mw-content-text"]/div/table[4]') %>%
  html_table(fill = TRUE)
team_defunct_raw_df <- team_defunct_raw_df[[1]]

Cleaning of the team table data from Wikipedia.

#clean current team data
team_df <- team_raw_df[,1:5]
team_df <- team_df[-1,]
team_df <- team_df[-13,]
#clean superscript numbers
team_df$Stadium <- gsub('[[:digit:]]+', '', team_df$Stadium)
team_df$Capacity <- substr(gsub("\\[.*","",team_df$Capacity),0,6)
team_df$Capacity <- as.numeric(gsub(",","",team_df$Capacity))
#add defunct year column with NA
team_df$Defunct <- as.numeric(NA)
team_df$Joined <- as.numeric(team_df$Joined)
#reorder columns
team_df <- team_df[,c(1,2,3,5,6,4)]
#head(team_df,20)

#clean defunct teams - blend so can be combined
team_defunct_df <- team_defunct_raw_df[,1:4]
  #get headers from row 1
colnames(team_defunct_df) <- team_defunct_df[1,]
colnames(team_defunct_df)[4] <- "Years_Active"
#now ditch row 1
team_defunct_df = team_defunct_df[-1,]
#split years active to Joined and Defunct columns
team_defunct_df$Joined <- as.numeric(substr(team_defunct_df$Years_Active,0,4))
team_defunct_df$Defunct <- as.numeric(substr(team_defunct_df$Years_Active,6,9))
#Remove Years_Active column
team_defunct_df <- team_defunct_df[,-4]
#clean superscript numbers
team_defunct_df$Stadium <- gsub('[[:digit:]]+', '',team_defunct_df$Stadium)
#Add NA stadium column
team_defunct_df$Capacity <- as.numeric(NA)
#head(team_defunct_df)

#combine the tables
teams_df <- arrange(rbind(team_df,team_defunct_df),Team)
kable(teams_df)
Team City Stadium Joined Defunct Capacity
Atlanta United FC Atlanta, Georgia Mercedes-Benz Stadium 2017 NA 42500
Chicago Fire Bridgeview, Illinois SeatGeek Stadium 1998 NA 20000
Chivas USA Carson, California StubHub Center 2005 2014 NA
Colorado Rapids Commerce City, Colorado Dick’s Sporting Goods Park 1996 NA 18061
Columbus Crew SC Columbus, Ohio Mapfre Stadium 1996 NA 19968
D.C. United Washington, D.C. Audi Field 1996 NA 20000
FC Cincinnati Cincinnati, Ohio Nippert Stadium 2019 NA 33250
FC Dallas Frisco, Texas Toyota Stadium 1996 NA 20500
Houston Dynamo Houston, Texas BBVA Compass Stadium 2006 NA 22039
LA Galaxy Carson, California Dignity Health Sports Park 1996 NA 27000
Los Angeles FC Los Angeles, California Banc of California Stadium 2018 NA 22000
Miami Fusion Fort Lauderdale, Florida Lockhart Stadium 1998 2001 NA
Minnesota United FC Saint Paul, Minnesota Allianz Field 2017 NA 19400
Montreal Impact Montreal, Quebec Saputo Stadium 2012 NA 20801
New England Revolution Foxborough, Massachusetts Gillette Stadium 1996 NA 20000
New York City FC New York City, New York Yankee Stadium 2015 NA 30321
New York Red Bulls Harrison, New Jersey Red Bull Arena 1996 NA 25000
Orlando City SC Orlando, Florida Orlando City Stadium 2015 NA 25500
Philadelphia Union Chester, Pennsylvania Talen Energy Stadium 2010 NA 18500
Portland Timbers Portland, Oregon Providence Park 2011 NA 25000
Real Salt Lake Sandy, Utah Rio Tinto Stadium 2005 NA 20213
San Jose Earthquakes San Jose, California Avaya Stadium 1996 NA 18000
Seattle Sounders FC Seattle, Washington CenturyLink Field 2009 NA 39419
Sporting Kansas City Kansas City, Kansas Children’s Mercy Park 1996 NA 18467
Tampa Bay Mutiny Tampa, Florida Raymond James Stadium 1996 2001 NA
Toronto FC Toronto, Ontario BMO Field 2007 NA 30991
Vancouver Whitecaps FC Vancouver, British Columbia BC Place 2011 NA 22120

Finally, we will add a mapping column that links the Team name from the Teams table to the teams in the attendance data.

Stadium Data

Now let’s create a table of stadium data. We will again rely on Wikipedia, which has thorough historical MLS stadium data. Our attributes of interest include capacity, playing surface, opening year, and whether or not it is a soccer-specific facility.

Weather Data

We will be using daily historical temperature and precipitation records from the National Oceanic and Atmospheric Administration (NOAA). The NOAA generally has data from multiple locations in a major city. We will be using the weather data from the zip code closest to the arena that has daily data since at least 1996, the year of MLS’ birth.

Adding Calendar Depth

We want to be able to separate weekday games from weekend and holiday games. We’d also like to look at attendance by month. To accomplish this, we’re going to use SQL to create a date dimension table to allow us to classify our historical MLS attendance data.

Promotional Nights

Do promotions increase attendance? Previous studies indicate that they do. Our historical data does not include promotional data, so we will limit our examination of promotions to the last two complete seasons (2017-2018) of MLS activity.

Additional Data

We are curious how team performance affects MLS attendance. How does it influence ticket sales in a season, and how does a bad season impact attendance in subsequent seasons? For simplicity’s sake, we will use season win-loss percentages for analysis here. Game-by-game wins and losses might lead to interesting insights as well, but they’re beyond the scope of our current review.

Additionally, how does team novelty affect ticket sales? How has attendance increased or decreased in the first few years of an MLS franchise’s operation?

Analysis