Abstract

The four major North American sports leagues are embedded throughout the cultures of the United States and Canada. Sports venues such as Wrigley and Fenway are often as recognizable of landmarks as any within their respective cities. The purpose of this analysis is to explore the venues associated with the four major North American sports, with the ultimate goal of creating an interactive leaflet map that allows a user to click and explore the venues for themselves.

A rendered copy of this RMarkdown document can be found at my rpubs account.

# loads revelant libraries
library(rvest)
library(tidyverse)
library(knitr)
library(kableExtra)
library(leaflet)

Getting and Cleaning Data

Wikipedia conveniently has pages containing venue information for the four major North American sports.

# We will pull our venue data from four different wikipedia pages
urls = c('https://en.wikipedia.org/wiki/List_of_current_Major_League_Baseball_stadiums',
         'https://en.wikipedia.org/wiki/List_of_National_Basketball_Association_arenas',
         'https://en.wikipedia.org/wiki/List_of_current_National_Football_League_stadiums',
         'https://en.wikipedia.org/wiki/List_of_National_Hockey_League_arenas')

We can use the rvest package to scrape the relevant venue information from each page, manually associating each dataset with the appropriate sports league as we go.

# MLB Data

mlbVenues = read_html(urls[1])

mlbVenues = mlbVenues %>%
    html_nodes('table') %>%
    .[[2]] %>%
    html_table() %>%
    as_tibble()

mlbVenues = mlbVenues %>%
    select(Name, Capacity, Location, Team, Opened) %>%
    mutate(League = 'MLB')

# NBA Data

nbaVenues = read_html(urls[2])

nbaVenues = nbaVenues %>%
    html_nodes('table') %>%
    .[[1]] %>%
    html_table() %>%
    as_tibble()

nbaVenues = nbaVenues %>%
    select(Arena, Capacity, Location, `Team(s)`, Opened) %>%
    mutate(League = 'NBA') %>%
    rename(Team = `Team(s)`, Name = Arena)

# NFL Data

nflVenues = read_html(urls[3])

nflVenues = nflVenues %>%
    html_nodes('table') %>%
    .[[2]] %>%
    html_table() %>%
    as_tibble()

nflVenues = nflVenues %>%
    select(Name, Capacity, Location, `Team(s)`, Opened) %>%
    mutate(League = 'NFL') %>%
    rename(Team = `Team(s)`)

# NHL Data

nhlVenues = read_html(urls[4])

nhlVenues = nhlVenues %>%
    html_nodes('table') %>%
    .[[1]] %>%
    html_table() %>%
    as_tibble()

nhlVenues = nhlVenues %>%
    select(Arena, Capacity, Location, `Team(s)`, Opened) %>%
    mutate(League = 'NHL') %>%
    rename(Team = `Team(s)`, Name = Arena)

Now that we’ve extracted each table into its own separate dataframe, we’ll combine into a common dataset.

# Binds the information into a single tibble
venues = rbind(mlbVenues, nbaVenues, nflVenues, nhlVenues)
venues %>%
    head(10) %>%
    kable() %>%
    kable_styling(bootstrap_options = "striped", full_width = F)
Name Capacity Location Team Opened League
Angel Stadium 45,517[1] Anaheim, California Los Angeles Angels 1966 MLB
Busch Stadium 45,494[2] St. Louis, Missouri St. Louis Cardinals 2006 MLB
Chase Field 48,686[3] Phoenix, Arizona Arizona Diamondbacks 1998 MLB
Citi Field 41,922[4] Queens, New York New York Mets 2009 MLB
Citizens Bank Park 42,792[5] Philadelphia, Pennsylvania Philadelphia Phillies 2004 MLB
Comerica Park 41,083[6] Detroit, Michigan Detroit Tigers 2000 MLB
Coors Field 46,897[7] Denver, Colorado Colorado Rockies 1995 MLB
Dodger Stadium 56,000[8] Los Angeles, California Los Angeles Dodgers[nb 2] 1962 MLB
Fenway Park 37,755[9] Boston, Massachusetts Boston Red Sox[nb 3] 1912 MLB
Globe Life Park in Arlington 48,114[10] Arlington, Texas Texas Rangers 1994 MLB

There are some formatting issues that we’ll have to address:

# Creates city and state features and coerces data into correct types
venues = venues %>%
    separate(Location, c('City', 'State'), ', ') %>%
    mutate(Name = str_replace(Name, '\\[.*\\]', ''),
           Team = str_replace(Team, '\\[.*\\]', ''),
           Capacity = as.integer(str_replace_all(str_replace(Capacity, '\\[.*\\]', ''), ',', '')),
           Opened = as.integer(str_sub(Opened, 1, 4)))

Now that we’ve addressed our basic formatting issues, we’ll perform a sanity check to make sure we’re not missing any venues. Each league currently has the following number of teams:

# Summarizes the venue counts by League for sanity checks
venues %>% 
    group_by(League) %>%
    summarise(VenueCount = n()) %>%
    kable() %>%
    kable_styling(bootstrap_options = "striped", full_width = F)
League VenueCount
MLB 30
NBA 30
NFL 31
NHL 32

Our team and venue counts for the MLB and the NBA match exactly, but it looks like we have one less venue for the NFL and one more venue for the NHL. Finding the extra NFL record shows us that the two New York teams share a football stadium. We’ll convert these into separate records.

# Two nfl teams share the same stadium
venues %>% 
    filter(League == 'NFL') %>%
    mutate(TeamLength = sapply(Team, str_length)) %>%
    arrange(desc(TeamLength)) %>%
    top_n(1) %>%
    select(-TeamLength) %>%
    kable() %>%
    kable_styling(bootstrap_options = "striped", full_width = F)
Name Capacity City State Team Opened League
MetLife Stadium 82500 East Rutherford New Jersey New York GiantsNew York Jets 2010 NFL
# Creates separate rows for each team
venues = venues %>%
    mutate(Team = ifelse(Name == 'MetLife Stadium', 
                         'New York Giants', 
                         Team))
venues = rbind(venues,
               venues %>%
                   filter(Name == 'MetLife Stadium') %>%
                   mutate(Team = 'New York Jets'))

In the NHL, the New York Islanders have two active venues. We’ll note this fact but will leave our data as is.

# One NHL team has two stadiums
venues %>%
    group_by(Team) %>%
    summarise(NumberOfVenues = n()) %>%
    filter(NumberOfVenues > 1) %>%
    inner_join(venues, by = c('Team' = 'Team')) %>%
    select(-NumberOfVenues) %>%
    kable() %>%
    kable_styling(bootstrap_options = "striped", full_width = F)
Team Name Capacity City State Opened League
New York Islanders Barclays Center 15795 Brooklyn New York 2012 NHL
New York Islanders Nassau Coliseum 13900 Uniondale New York 1972 NHL

Exploratory Analysis

Next we’ll perform some descriptive analysis on our dataset. We’ll start with some basic statistics about venue capacity by league.

venues %>%
    group_by(League) %>%
    summarise(MeanCapacity = mean(Capacity),
              SdCapacity = sd(Capacity),
              MedianCapacity = median(Capacity),
              MaxCapacity = max(Capacity),
              MinCapacity = min(Capacity)) %>%
              kable() %>%
              kable_styling(bootstrap_options = "striped", full_width = F)
League MeanCapacity SdCapacity MedianCapacity MaxCapacity MinCapacity
MLB 42517.33 5598.775 41907.5 56000 25000
NBA 18930.57 1049.816 18987.5 20917 16867
NFL 69472.19 10238.696 69137.5 82500 27000
NHL 18082.38 1411.292 18288.5 21302 13900
ggplot(venues, aes(League, Capacity, fill = League)) +
    geom_boxplot() +
    labs(title = 'Venue Capacity by League',
         x = 'League',
         y = 'Capacity') +
    theme(plot.title = element_text(hjust = .5))

ggplot(venues, aes(x = Capacity, fill = League)) +
    geom_histogram() +
    labs(title = 'Venue Capacity by League',
     x = 'Capacity',
     y = 'Count') +
    facet_wrap(~League, ncol = 2, scales = 'free_x') +
    theme(plot.title = element_text(hjust = .5))

NFL teams clearly tend to have the largest stadium sizes, followed by the MLB teams. NBA teams seem to only narrowly have a greater capacity than NHL teams. This shouldn’t be shocking. About a 3rd of NHL and NBA teams share venues, so aside from some small seating re-arrangements, the two leagues should have similar capacity characteristics.

sharedVenues = venues %>%
    group_by(Name) %>%
    summarise(TeamCount = n()) %>%
    filter(TeamCount > 1) %>%
    inner_join(venues, by = c('Name' = 'Name'))

ggplot(sharedVenues, aes(League, Name, col = League)) +
    geom_point(pch = 'x', size = 8) + 
    labs(title = 'Venues Shared by Multiple Teams',
         x = 'League',
         y = 'Venue') +
    theme(plot.title = element_text(hjust = .5))

Next we’ll shift our focus over to the what year the stadiums were opened.

venues %>%
    mutate(outlierCaption = ifelse(Opened < 1965, paste0(Name, ', ', Opened), '')) %>%
    ggplot(., aes(League, Opened, fill = League)) +
    geom_boxplot() + 
    geom_text(aes(label = outlierCaption), hjust = -.2, size = 2) + 
    labs(title = 'Year Opened by League',
         x = 'League',
         y = 'Year Opened') +
    scale_y_continuous(breaks = seq(1910,2020,10)) + 
    theme(plot.title = element_text(hjust = .5))

The MLB has the two oldest stadiums in Fenway Park and Wrigley Field, followed by the Los Angeles Memorial Coliseum and Soldier Field in the NFL.

It would be interesting to see if the capacity of a stadium has anything to do with the year in which it was built. We explore this possibility in the below plots.

ggplot(venues, aes(Opened, Capacity, col = League)) +
    geom_point(alpha = .7) +
    geom_smooth(method = 'lm', lty = 2, size = .7) +
    facet_wrap(~League, ncol = 2, scales = 'free')

There’s not a clear discernable relationship between the year that a venue opened and the seating capacity. Our regression lines for the MLB, NFL, and NHL are near flat. Our NBA regression line trends downward, but the confidence bands include the possibility of a slope of zero.

Stadium Locations of Major American Sports Leagues

Next we’d like to visualize the locations of the different sport venues. Our existing dataset contains information about city and state, but we can be more precise by pulling in latitudinal and longitudinal values.

# Reads csv of coordinate data
coordPath = 'C:/Users/Rick/Documents/Projects/Sports_Venues/venueCoordinates.csv'
coordinates = read.csv(coordPath, stringsAsFactors = F)
head(coordinates) %>% 
    kable() %>%
    kable_styling(bootstrap_options = "striped", full_width = F)
sport lat lon team_name channel
NBA 25.7814 -80.1878 Miami Heat Fox Sports Sun
NBA 28.5392 -81.3836 Orlando Magic Fox Sports Florida
NBA 29.4269 -98.4375 San Antonio Spurs Fox Sports Southwest
NBA 29.7508 -95.3622 Houston Rockets Root Sports Southwest
NBA 29.9489 -90.0822 New Orleans Pelicans Fox Sports New Orleans
NBA 32.7904 -96.8103 Dallas Mavericks Fox Sports Southwest

Joining our coordinate data with our venue data allows us to generate an interactive leaflet map of the stadium locations, color coordinated by league.

venues = venues %>% 
    left_join(coordinates, by = c('Team' = 'team_name')) %>%
    rename(Latitude = lat, Longitude = lon) %>%
    select(League, Name, Capacity, City, State, Team, Opened, Latitude, Longitude)
pal <- colorFactor(c("red", "darkgreen", "darkorange", "blue"), levels(venues$League))
popHTML = paste0('<dl>',
                 '<dd>Venue: ', venues$Name, '</dd>',
                 '<dd>Location: ', venues$City, ', ', venues$State, '</dd>',
                 '<dd>Team: ', venues$Team, '</dd>',
                 '<dd>Capacity: ',  venues$Capacity, '</dd>',
                 '</dl>')
venues %>% 
        leaflet() %>% 
        addTiles() %>% 
        addCircleMarkers(clusterOptions = markerClusterOptions(), 
                popup = popHTML, 
                color = ~pal(venues$League), 
                fillOpacity = 1,
                radius = venues$Capacity/6000) %>% 
        addLegend(
                pal = pal,
                values = venues$League)