The four major North American sports leagues are embedded throughout the cultures of the United States and Canada. Sports venues such as Wrigley and Fenway are often as recognizable of landmarks as any within their respective cities. The purpose of this analysis is to explore the venues associated with the four major North American sports, with the ultimate goal of creating an interactive leaflet map that allows a user to click and explore the venues for themselves.
A rendered copy of this RMarkdown document can be found at my rpubs account.
# loads revelant libraries
library(rvest)
library(tidyverse)
library(knitr)
library(kableExtra)
library(leaflet)
Wikipedia conveniently has pages containing venue information for the four major North American sports.
# We will pull our venue data from four different wikipedia pages
urls = c('https://en.wikipedia.org/wiki/List_of_current_Major_League_Baseball_stadiums',
'https://en.wikipedia.org/wiki/List_of_National_Basketball_Association_arenas',
'https://en.wikipedia.org/wiki/List_of_current_National_Football_League_stadiums',
'https://en.wikipedia.org/wiki/List_of_National_Hockey_League_arenas')
We can use the rvest package to scrape the relevant venue information from each page, manually associating each dataset with the appropriate sports league as we go.
# MLB Data
mlbVenues = read_html(urls[1])
mlbVenues = mlbVenues %>%
html_nodes('table') %>%
.[[2]] %>%
html_table() %>%
as_tibble()
mlbVenues = mlbVenues %>%
select(Name, Capacity, Location, Team, Opened) %>%
mutate(League = 'MLB')
# NBA Data
nbaVenues = read_html(urls[2])
nbaVenues = nbaVenues %>%
html_nodes('table') %>%
.[[1]] %>%
html_table() %>%
as_tibble()
nbaVenues = nbaVenues %>%
select(Arena, Capacity, Location, `Team(s)`, Opened) %>%
mutate(League = 'NBA') %>%
rename(Team = `Team(s)`, Name = Arena)
# NFL Data
nflVenues = read_html(urls[3])
nflVenues = nflVenues %>%
html_nodes('table') %>%
.[[2]] %>%
html_table() %>%
as_tibble()
nflVenues = nflVenues %>%
select(Name, Capacity, Location, `Team(s)`, Opened) %>%
mutate(League = 'NFL') %>%
rename(Team = `Team(s)`)
# NHL Data
nhlVenues = read_html(urls[4])
nhlVenues = nhlVenues %>%
html_nodes('table') %>%
.[[1]] %>%
html_table() %>%
as_tibble()
nhlVenues = nhlVenues %>%
select(Arena, Capacity, Location, `Team(s)`, Opened) %>%
mutate(League = 'NHL') %>%
rename(Team = `Team(s)`, Name = Arena)
Now that we’ve extracted each table into its own separate dataframe, we’ll combine into a common dataset.
# Binds the information into a single tibble
venues = rbind(mlbVenues, nbaVenues, nflVenues, nhlVenues)
venues %>%
head(10) %>%
kable() %>%
kable_styling(bootstrap_options = "striped", full_width = F)
Name | Capacity | Location | Team | Opened | League |
---|---|---|---|---|---|
Angel Stadium | 45,517[1] | Anaheim, California | Los Angeles Angels | 1966 | MLB |
Busch Stadium | 45,494[2] | St. Louis, Missouri | St. Louis Cardinals | 2006 | MLB |
Chase Field | 48,686[3] | Phoenix, Arizona | Arizona Diamondbacks | 1998 | MLB |
Citi Field | 41,922[4] | Queens, New York | New York Mets | 2009 | MLB |
Citizens Bank Park | 42,792[5] | Philadelphia, Pennsylvania | Philadelphia Phillies | 2004 | MLB |
Comerica Park | 41,083[6] | Detroit, Michigan | Detroit Tigers | 2000 | MLB |
Coors Field | 46,897[7] | Denver, Colorado | Colorado Rockies | 1995 | MLB |
Dodger Stadium | 56,000[8] | Los Angeles, California | Los Angeles Dodgers[nb 2] | 1962 | MLB |
Fenway Park | 37,755[9] | Boston, Massachusetts | Boston Red Sox[nb 3] | 1912 | MLB |
Globe Life Park in Arlington | 48,114[10] | Arlington, Texas | Texas Rangers | 1994 | MLB |
There are some formatting issues that we’ll have to address:
# Creates city and state features and coerces data into correct types
venues = venues %>%
separate(Location, c('City', 'State'), ', ') %>%
mutate(Name = str_replace(Name, '\\[.*\\]', ''),
Team = str_replace(Team, '\\[.*\\]', ''),
Capacity = as.integer(str_replace_all(str_replace(Capacity, '\\[.*\\]', ''), ',', '')),
Opened = as.integer(str_sub(Opened, 1, 4)))
Now that we’ve addressed our basic formatting issues, we’ll perform a sanity check to make sure we’re not missing any venues. Each league currently has the following number of teams:
# Summarizes the venue counts by League for sanity checks
venues %>%
group_by(League) %>%
summarise(VenueCount = n()) %>%
kable() %>%
kable_styling(bootstrap_options = "striped", full_width = F)
League | VenueCount |
---|---|
MLB | 30 |
NBA | 30 |
NFL | 31 |
NHL | 32 |
Our team and venue counts for the MLB and the NBA match exactly, but it looks like we have one less venue for the NFL and one more venue for the NHL. Finding the extra NFL record shows us that the two New York teams share a football stadium. We’ll convert these into separate records.
# Two nfl teams share the same stadium
venues %>%
filter(League == 'NFL') %>%
mutate(TeamLength = sapply(Team, str_length)) %>%
arrange(desc(TeamLength)) %>%
top_n(1) %>%
select(-TeamLength) %>%
kable() %>%
kable_styling(bootstrap_options = "striped", full_width = F)
Name | Capacity | City | State | Team | Opened | League |
---|---|---|---|---|---|---|
MetLife Stadium | 82500 | East Rutherford | New Jersey | New York GiantsNew York Jets | 2010 | NFL |
# Creates separate rows for each team
venues = venues %>%
mutate(Team = ifelse(Name == 'MetLife Stadium',
'New York Giants',
Team))
venues = rbind(venues,
venues %>%
filter(Name == 'MetLife Stadium') %>%
mutate(Team = 'New York Jets'))
In the NHL, the New York Islanders have two active venues. We’ll note this fact but will leave our data as is.
# One NHL team has two stadiums
venues %>%
group_by(Team) %>%
summarise(NumberOfVenues = n()) %>%
filter(NumberOfVenues > 1) %>%
inner_join(venues, by = c('Team' = 'Team')) %>%
select(-NumberOfVenues) %>%
kable() %>%
kable_styling(bootstrap_options = "striped", full_width = F)
Team | Name | Capacity | City | State | Opened | League |
---|---|---|---|---|---|---|
New York Islanders | Barclays Center | 15795 | Brooklyn | New York | 2012 | NHL |
New York Islanders | Nassau Coliseum | 13900 | Uniondale | New York | 1972 | NHL |
Next we’ll perform some descriptive analysis on our dataset. We’ll start with some basic statistics about venue capacity by league.
venues %>%
group_by(League) %>%
summarise(MeanCapacity = mean(Capacity),
SdCapacity = sd(Capacity),
MedianCapacity = median(Capacity),
MaxCapacity = max(Capacity),
MinCapacity = min(Capacity)) %>%
kable() %>%
kable_styling(bootstrap_options = "striped", full_width = F)
League | MeanCapacity | SdCapacity | MedianCapacity | MaxCapacity | MinCapacity |
---|---|---|---|---|---|
MLB | 42517.33 | 5598.775 | 41907.5 | 56000 | 25000 |
NBA | 18930.57 | 1049.816 | 18987.5 | 20917 | 16867 |
NFL | 69472.19 | 10238.696 | 69137.5 | 82500 | 27000 |
NHL | 18082.38 | 1411.292 | 18288.5 | 21302 | 13900 |
ggplot(venues, aes(League, Capacity, fill = League)) +
geom_boxplot() +
labs(title = 'Venue Capacity by League',
x = 'League',
y = 'Capacity') +
theme(plot.title = element_text(hjust = .5))
ggplot(venues, aes(x = Capacity, fill = League)) +
geom_histogram() +
labs(title = 'Venue Capacity by League',
x = 'Capacity',
y = 'Count') +
facet_wrap(~League, ncol = 2, scales = 'free_x') +
theme(plot.title = element_text(hjust = .5))
NFL teams clearly tend to have the largest stadium sizes, followed by the MLB teams. NBA teams seem to only narrowly have a greater capacity than NHL teams. This shouldn’t be shocking. About a 3rd of NHL and NBA teams share venues, so aside from some small seating re-arrangements, the two leagues should have similar capacity characteristics.
sharedVenues = venues %>%
group_by(Name) %>%
summarise(TeamCount = n()) %>%
filter(TeamCount > 1) %>%
inner_join(venues, by = c('Name' = 'Name'))
ggplot(sharedVenues, aes(League, Name, col = League)) +
geom_point(pch = 'x', size = 8) +
labs(title = 'Venues Shared by Multiple Teams',
x = 'League',
y = 'Venue') +
theme(plot.title = element_text(hjust = .5))
Next we’ll shift our focus over to the what year the stadiums were opened.
venues %>%
mutate(outlierCaption = ifelse(Opened < 1965, paste0(Name, ', ', Opened), '')) %>%
ggplot(., aes(League, Opened, fill = League)) +
geom_boxplot() +
geom_text(aes(label = outlierCaption), hjust = -.2, size = 2) +
labs(title = 'Year Opened by League',
x = 'League',
y = 'Year Opened') +
scale_y_continuous(breaks = seq(1910,2020,10)) +
theme(plot.title = element_text(hjust = .5))
The MLB has the two oldest stadiums in Fenway Park and Wrigley Field, followed by the Los Angeles Memorial Coliseum and Soldier Field in the NFL.
It would be interesting to see if the capacity of a stadium has anything to do with the year in which it was built. We explore this possibility in the below plots.
ggplot(venues, aes(Opened, Capacity, col = League)) +
geom_point(alpha = .7) +
geom_smooth(method = 'lm', lty = 2, size = .7) +
facet_wrap(~League, ncol = 2, scales = 'free')
There’s not a clear discernable relationship between the year that a venue opened and the seating capacity. Our regression lines for the MLB, NFL, and NHL are near flat. Our NBA regression line trends downward, but the confidence bands include the possibility of a slope of zero.
Next we’d like to visualize the locations of the different sport venues. Our existing dataset contains information about city and state, but we can be more precise by pulling in latitudinal and longitudinal values.
# Reads csv of coordinate data
coordPath = 'C:/Users/Rick/Documents/Projects/Sports_Venues/venueCoordinates.csv'
coordinates = read.csv(coordPath, stringsAsFactors = F)
head(coordinates) %>%
kable() %>%
kable_styling(bootstrap_options = "striped", full_width = F)
sport | lat | lon | team_name | channel |
---|---|---|---|---|
NBA | 25.7814 | -80.1878 | Miami Heat | Fox Sports Sun |
NBA | 28.5392 | -81.3836 | Orlando Magic | Fox Sports Florida |
NBA | 29.4269 | -98.4375 | San Antonio Spurs | Fox Sports Southwest |
NBA | 29.7508 | -95.3622 | Houston Rockets | Root Sports Southwest |
NBA | 29.9489 | -90.0822 | New Orleans Pelicans | Fox Sports New Orleans |
NBA | 32.7904 | -96.8103 | Dallas Mavericks | Fox Sports Southwest |
Joining our coordinate data with our venue data allows us to generate an interactive leaflet map of the stadium locations, color coordinated by league.
venues = venues %>%
left_join(coordinates, by = c('Team' = 'team_name')) %>%
rename(Latitude = lat, Longitude = lon) %>%
select(League, Name, Capacity, City, State, Team, Opened, Latitude, Longitude)
pal <- colorFactor(c("red", "darkgreen", "darkorange", "blue"), levels(venues$League))
popHTML = paste0('<dl>',
'<dd>Venue: ', venues$Name, '</dd>',
'<dd>Location: ', venues$City, ', ', venues$State, '</dd>',
'<dd>Team: ', venues$Team, '</dd>',
'<dd>Capacity: ', venues$Capacity, '</dd>',
'</dl>')
venues %>%
leaflet() %>%
addTiles() %>%
addCircleMarkers(clusterOptions = markerClusterOptions(),
popup = popHTML,
color = ~pal(venues$League),
fillOpacity = 1,
radius = venues$Capacity/6000) %>%
addLegend(
pal = pal,
values = venues$League)