Competitive Youth Soccer in the D.C. Metropolitan Area

Overview

In the wake of the U.S. men’s national soccer team failing to qualify for the 2018 FIFA World Cup, an unimaginable series of events left supporters, former players, and members of the media trying to make sense of it all. Some responses were measured—others impassioned, delivered in the heat of the moment. But most shared a common thread: frustration with the country’s youth development system.

More specifically, there emerged a collective assault on “pay-to-play,” a structure which, some argue, limits access for low-income children of color to high-quality, competitive soccer. Building off that argument, I decided to study two lesser-discussed, yet closely related, pieces of the puzzle: location and exposure.

A Washington, D.C.-area resident, I decided to limit my sample to this area. The clubs participating in the National Capital Soccer League’s Fall 2017 season set the geographic limits of my study (D.C. and the surrounding counties in northern Virginia, southern Maryland, and eastern West Virginia). Based on this information, I collected field addresses to calculate the mean location of each club, and I collected race and poverty data for each census tract in the region from the U.S. Census Bureau’s 2015 American Community Survey (ACS).

Most notably, the opportunities to play competitive soccer for children in mostly nonwhite neighborhoods are few and far between. When plotting the points on the map (see the Race section below), a 20-plus-mile-wide hole opened up in the area southeast of D.C., where a majority of the census tracts in which the share of nonwhite residents is 60 percent or greater. And out of the 380 soccer fields in the sample, less than 15 percent are located in majority-minority census tracts. (For reference, out of the 1,600 census tracts in the sample, 32 percent are majority-minority.)

Socioeconomic diversity (by way of poverty rates) also played a role, though less prominent. Out of the 380 soccer fields in the sample, just one is located in a census tract with a poverty rate greater than 20 percent. (For reference, out of the 1,600 census tracts in the sample, 15 percent have poverty rates greater than 20 percent.)

I did attempt a handful of different regression techniques in an attempt to explain causality within the sample. But the shape of the data (only 177 of the 1,600 census tracts contained at least one field) prevented both the nonwhite share and poverty rate of a census tract from being accurate or reliable predictors of the presence of a soccer field on which competitive games are played.

While no conclusions about the state of U.S. youth soccer should be drawn from this study, it does reveal some interesting, albeit alarming, trends within the region of our nation’s capital.

This is not to say that children growing up in these underserved areas are stripped of all opportunities to enjoy the beautiful game. Organizations like DC SCORES and the U.S. Soccer Foundation work to ensure that they are not completely excluded. But if the stated goal of the soccer community is to develop a more racially and socioeconomically diverse pipeline through the ranks of our senior U.S. national teams, these findings may suggest that there’s still work to be done.

Data Collection

Most youth clubs in the D.C. metropolitan area have at least one team in the National Capital Soccer League (NCSL). Beginning here, I collected the names of each club from their listings webpage, as well as the names and street addresses of each field on which each club plays.

Note: Freedom Soccer Club was removed from this list, as it moved to the Central Maryland Soccer Association prior to the Fall 2017 season. I allowed the reach of the remaining National Capital Soccer League clubs to set the geographic boundaries for this project. Then, other clubs from the states of Maryland and Virginia which participate in other leagues (e.g., Virginia Premier League, Club Champions League, Central Maryland Soccer Association), but which still fell within those geographic boundaries, were also added to the data set.

Then, using the {RDSTK} package, I translated each field address to a set of coordinates.

library(RDSTK)

fields.df <- read.csv("fields.csv", colClasses = "character")
lats.longs <- do.call(rbind, lapply(fields.df$field.address, street2coordinates))
fields.df$lat <- lats.longs$latitude ; fields.df$lng <- lats.longs$longitude

Using functions from the {dplyr} package, I calculated the geographic center of each club’s field locations. All field coordinates were weighted equally, including those clustered together in parks or other recreation areas.

Note: I want to acknowledge that weighting each field location by the number of games played there over the course of the season may have resulted in a more accurate depiction of where clubs play their soccer, “on average.” However, given what little difference this weighting was likely to make, I determined it was not worth the hours it would take to also pull game data off NCSL’s antiquated website.

library(dplyr)

clubs <- fields.df %>% 
        group_by(club.name) %>% 
        dplyr::summarize(lat = mean(lat), lng = mean(lng))

And finally, I cleaned up the messy club names.

clubs[1,1] <- "A3 Soccer"
clubs[10,1] <- "Barca Futebol Clube"
clubs[17,1] <- "Capital FC"
clubs[29,1] <- "FSV Ashburn"
clubs[50,1] <- "Olney Boys & Girls Club"

The first few rows of the clubs data frame look like this.

club.name	lat	lng
A3 Soccer	38.99	-76.67
AC Cugini	39.32	-77.28
Alexandria Soccer Association	38.82	-77.06
Alliance Soccer Club	39.16	-77.19
Annandale Boys & Girls Club	38.83	-77.19
Annandale United FC	38.83	-77.2

Clubs and the Communities They Serve

To put the locations of the clubs in context, I used the {acs} package to gather data on race and poverty (at the census tract level) from the U.S. Census Bureau’s 2015 American Community Survey (ACS). I also used the {tigris} package to collect shape files for each census tract.

Using the table of coordinates generated by the street2coordinates function, I identified the codes of each county for which I would need to gather census data.

library(tigris)

## County fips numbers
counties.md <- c(3, 21, 43, 31, 9, 33, 17, 27, 37)
counties.va <- c(510, 13, 59, 600, 47, 107, 630, 187, 683, 685, 153, 610, 179, 61, 840, 69, 43, 177)
counties.dc <- 1
counties.wv <- c(37, 3)

## Pull shapefiles for each state and combine
tracts.md <- tracts("MD", counties.md, cb = TRUE)
tracts.va <- tracts("VA", counties.va, cb = TRUE)
tracts.dc <- tracts("DC", counties.dc, cb = TRUE)
tracts.wv <- tracts("WV", counties.wv, cb = TRUE)
tracts.total <- rbind(tracts.md, tracts.va, tracts.dc, tracts.wv)

Race

After building a function that pulls the necessary data on race from the ACS, I ran it on each state’s counties and combined those results into a single data frame. (Credit to ZevRoss for some guidance here.)

library(acs)
library(stringr)

## Function that will pull race stats for each state's tracts
pullRace <- function(state, counties) {
        geo <- geo.make(state = state, county = counties, tract = "*")
        race <- acs.fetch(2015, 5, geo, table.number = "B02001", col.names = "pretty")
        df <- data.frame(paste0(str_pad(race@geography$state, 2, "left", pad="0"), 
                                   str_pad(race@geography$county, 3, "left", pad="0"),
                                   str_pad(race@geography$tract, 6, "left", pad="0")),
                            race@estimate[,c("Race: Total:", "Race: White alone")],
                            stringsAsFactors = FALSE)
        rownames(df) <- 1:nrow(df)
        names(df) <- c("GEOID", "total", "white")
        df$nonwhite.perc <- (1-(df$white/df$total))
        df
}

## Run race function for each state and combine
race.md <- pullRace("MD", counties.md)
race.va <- pullRace("VA", counties.va)
race.dc <- pullRace("DC", counties.dc)
race.wv <- pullRace("WV", counties.wv)
race.total <- rbind(race.md, race.va, race.dc, race.wv)

To place all of that information on a Leaflet map, I combined the data frame and shape files using the geo_join function.

library(leaflet)

## Join data frames and shapefiles
race.tracts <- geo_join(tracts.total, race.total, "GEOID", "GEOID")
race.tracts$nonwhite.perc <- race.tracts$nonwhite.perc*100

## Set color palette for map
pal1 <- colorBin("Greys", race.tracts$nonwhite.perc, 8, na.color = NA)

## Leaflet map settings
clubs %>% leaflet() %>%
        addProviderTiles("CartoDB.Positron") %>%
        addPolygons(data = race.tracts, 
                    fillColor = ~pal1(nonwhite.perc), 
                    color = "#b2aeae",
                    fillOpacity = 0.7, 
                    weight = 0.5, 
                    smoothFactor = 0.2) %>%
        addCircleMarkers(weight = 0, 
                         radius = 3,
                         fillColor = "#ff7256",
                         fillOpacity = 0.9, 
                         popup = clubs$club.name) %>% 
        addLegend(title = "Share of Nonwhite Residents",
                  pal = pal1, 
                  values = race.tracts$nonwhite.perc, 
                  position = "bottomleft", 
                  labFormat = labelFormat(suffix = "%"))

This map already paints a pretty clear picture, particularly in the areas to the southeast of D.C. proper. But to look at it another way, I chose to plot a histogram of the density of soccer fields by their corresponding census tract data.

To do so, I removed duplicate soccer fields (i.e., instances in which multiple clubs share a single field) using the distinct function. And using the append_geoid function, I translated the coordinates of each field location to an eleven-digit code which corresponds with a census tract (GEOID).

library(censusr)

## Reshape fields table
fields.unique <- fields.df %>% 
        distinct(field.name, .keep_all = TRUE)
names(fields.unique)[5] <- "lon"

## Match GEOIDs
fields.unique$GEOID <- append_geoid(fields.unique[,c(4:5)])[,3]
fields.unique$GEOID <- strtrim(fields.unique$GEOID, 11)

In order to match the soccer fields to the census data in a single table for ggplot, I first created a reference table, and then used the merge function.

library(ggplot2)

## Create lookup table
race.lookup <- race.total[,c(1,4)]

## Fill in race for each field's census tract
fields.unique <- merge(fields.unique, race.lookup, by = "GEOID")

## Build histogram
ggplot(fields.unique, aes(nonwhite.perc)) +
        geom_histogram(binwidth = 0.05, fill = "coral1") +
        theme_TCR() +
        scale_x_continuous(labels = scales::percent) +
        scale_y_continuous(expand = c(0,0)) +
        xlab("Share of Nonwhite Residents") + ylab("Number of Fields")

Again, a clear pattern emerges. Out of the 380 soccer fields in the sample, less than 15 percent are located in majority-minority census tracts. (For reference, out of the 1,600 census tracts in the sample, 32 percent are majority-minority. I’ve also plotted that histogram below.)

sum(fields.unique$nonwhite.perc > 0.5)/nrow(fields.unique)

## [1] 0.1473684

Poverty

To gather information on poverty rates, I similarly built a function that pulls those data from ACS.

## Function that will pull poverty stats for each state's tracts
pullPoverty <- function(state, counties) {
        geo <- geo.make(state = state, county = counties, tract = "*")
        poverty <- acs.fetch(2015, 5, geo, table.number = "C17002", col.names = "pretty")
        df <- data.frame(paste0(str_pad(poverty@geography$state, 2, "left", pad="0"), 
                                str_pad(poverty@geography$county, 3, "left", pad="0"),
                                str_pad(poverty@geography$tract, 6, "left", pad="0")),
                         poverty@estimate[,c("Ratio of Income to Poverty Level in the Past 12 Months: Total:", 
                                             "Ratio of Income to Poverty Level in the Past 12 Months: Under .50",
                                             "Ratio of Income to Poverty Level in the Past 12 Months: .50 to .99")],
                         stringsAsFactors = FALSE)
        rownames(df) <- 1:nrow(df)
        names(df) <- c("GEOID", "total", "poverty1", "poverty2")
        df$poverty.rate <- (df$poverty1 + df$poverty2) / df$total
        df
}

## Run poverty function for each state and combine
poverty.md <- pullPoverty("MD", counties.md)
poverty.va <- pullPoverty("VA", counties.va)
poverty.dc <- pullPoverty("DC", counties.dc)
poverty.wv <- pullPoverty("WV", counties.wv)
poverty.total <- rbind(poverty.md, poverty.va, poverty.dc, poverty.wv)

To place all of that information on a Leaflet map, I combined the data frame and shape files using the geo_join function.

## Join data frames and shapefiles
poverty.tracts <- geo_join(tracts.total, poverty.total, "GEOID", "GEOID")
poverty.tracts$poverty.rate <- poverty.tracts$poverty.rate*100
poverty.tracts$poverty.bins <- cut(poverty.tracts$poverty.rate, 
                                   breaks = c(-1, 10, 20, 30, 40, 100), 
                                   labels = c("Less than 10%", "10 - 20%", "20 - 30%", "30 - 40%", "More than 40%"))

## Set color palette for map
pal2 <- colorFactor("Greys", poverty.tracts$poverty.bins, na.color = NA)

## Leaflet map settings
clubs %>% leaflet() %>%
        addProviderTiles("CartoDB.Positron") %>%
        addPolygons(data = poverty.tracts, 
                    fillColor = ~pal2(poverty.bins), 
                    color = "#b2aeae",
                    fillOpacity = 0.7, 
                    weight = 0.5, 
                    smoothFactor = 0.2) %>%
        addCircleMarkers(weight = 0, 
                         radius = 3,
                         fillColor = "#ff7256",
                         fillOpacity = 0.9, 
                         popup = clubs$club.name) %>% 
       addLegend(title = "Share of Residents<br>Below the Poverty Line",
                 pal = pal2, 
                 values = poverty.tracts$poverty.bins, 
                 position = "bottomleft")

Again, to view the data in another way, I chose to plot a histogram of the density of soccer fields by their corresponding census tract data. In order to match them to the census data in a single table for ggplot, I first created a reference table, and then used the merge function.

## Create lookup table
poverty.lookup <- poverty.total[,c(1,5)]

## Fill in race for each field's census tract
fields.unique <- merge(fields.unique, poverty.lookup, by = "GEOID")

## Build histogram
ggplot(fields.unique, aes(poverty.rate)) +
        geom_histogram(binwidth = 0.05, fill = "coral1") +
        theme_TCR() +
        scale_x_continuous(labels = scales::percent, limits = c(0, 1)) +
        scale_y_continuous(expand = c(0,0)) +
        xlab("Share of Residents Living Below the Poverty Line") + 
        ylab("Number of Fields")

Out of the 380 soccer fields in the sample, just one is located in a census tract with a poverty rate greater than 20 percent. (For reference, out of the 1,600 census tracts in the sample, 15 percent have poverty rates greater than 20 percent. I’ve also plotted that histogram below.)

sum(fields.unique$poverty.rate > 0.2, na.rm = TRUE)

## [1] 1