Coding for Social Science
Workshop: Using GIS to find the Nearest Polling Place in Boston
How close is your nearest polling place? Questions like these are vital to the democratic process, and enabling voter turnout, but are perniciously difficult to answer! This tutorial uses the case of polling places in Boston precincts to demonstrate how to identify your distance from your nearest polling place, focusing on the sf and tidyverse packages in R.
0. Import Data
In this workshop, we’ll use a few files. This includes:
precincts.rds: a dataset of 255 precinct polygons.polling_places.rds: a dataset of 256 polling place locations.train_stops_boston.rds: a dataset of train stops on the Boston T (subway/metro system).train_lines_boston.rds: a dataset of train lines on the Boston T (subway/metro system).
Please click through the buttons below and follow the instructions to make each file!
Load Packages and Projections
First, let’s load our four main packages, and then we’ll load the projections we’ll use for our maps (aea = Albers Equal Area Conic in meters, aed = Lambert Equidistant Conic in meters, and wgs = World Global System in degrees).
# Data wrangling packages
library(tidyverse) # for tidy data wrangling
library(viridis) # for visualization
# Mapping packages
library(sf) # for spatial data.frames
library(rgdal) # for background geospatial operations
# Next, let's also save several mapping projections as objects:
# Equal Area projection
aea <- "+proj=aea +lat_1=20 +lat_2=60 +lat_0=40 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +datum=NAD83 +units=m +no_defs"
# Equal Distance projection
aed <- "+proj=eqdc +lat_0=0 +lon_0=0 +lat_1=33 +lat_2=45 +x_0=0 +y_0=0 +ellps=GRS80 +datum=NAD83 +units=m +no_defs "
# Get EPSG:4326 (WGS 84) projection
wgs <- "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs"
# Look up more projections at spatialreference.org - proj4
#https://spatialreference.org/ref/epsg/wgs-84/Load Data
# Load precinct polygons
precincts <- read_rds("precincts.rds")
# Load polling place points
polls <- read_rds("polling_places.rds")
# Load t-lines
tlines <- read_rds("train_lines_boston.rds")
# Load t-stops
tstops <- read_rds("train_stops_boston.rds")
# All of these already have the Albert Equal Area Conic projectionHow did we get these geospatial data? Click here to find out.
Precincts
Next, let’s gather our precinct polygon data, straight from source on Boston’s open data platform.
We will save most files in this tutorial as .rds files, because it’s a convenient way to compress and work with geospatial data in R, without having to reproject each time we load in the data. However, you could just as easily save them as .kml files, which we will do at the end of this tutorial, using the sf package’s write_sf() function.
# Load in a kml file straight from source.
# You could replace this with your kml file name
read_sf("https://bostonopendata-boston.opendata.arcgis.com/datasets/2bc185ec9b0b478d8c2f7e720d116863_0.kml?outSR=%7B%22latestWkid%22%3A2249%2C%22wkid%22%3A102686%7D") %>%
# grab (and rename) just relevant files
select(ward_precinct = WARD_PRECINCT, geometry) %>%
# transform to an Equal Area projection
st_transform(crs = aea) %>%
# save as a .rds file for easy access
saveRDS("precincts.rds")Polling Places
# Download Polling Places in Boston as Points
read_sf("https://bostonopendata-boston.opendata.arcgis.com/datasets/f7c6dc9eb6b14463a3dd87451beba13f_5.kml?outSR=%7B%22latestWkid%22%3A2249%2C%22wkid%22%3A102686%7D") %>%
# Convert to Equal Area Projection
st_transform(crs = aea) %>%
# Turn our numeric Ward and Precinct codes into 2-digit codes
mutate(Ward = str_pad(Ward, width = 2, side = "left", pad = "0"),
Precinct = str_pad(Precinct, width = 2, side = "left", pad = "0"),
# And then bind them together into a ward-precinct identier
ward_precinct = paste(Ward, Precinct, sep = "")) %>%
# keep (and rename) just columns we need and get rid of extraneous data
select(id = OBJECTID, ward_precinct, geometry) %>%
# Save to file
saveRDS("polling_places.rds")Train Lines
Finally, I downloaded and converted the shape files for the Boston T (metro/subway system here). You can download them yourself from MassGIS, or import my processed version, saved as train_lines_boston.rds and train_stops_boston.rds.
View Data
# View first 3 rows of dataset
precincts %>% head(3)| ward_precinct | geometry |
|---|---|
| 0113 | MULTIPOLYGON (((1911879 535… |
| 0102 | MULTIPOLYGON (((1910977 533… |
| 0105 | MULTIPOLYGON (((1911287 533… |
# View first 3 rows of dataset
polls %>% head(3)| id | ward_precinct | geometry |
|---|---|---|
| 1 | 0101 | POINT (1910854 532402.2) |
| 2 | 0102 | POINT (1910791 532836.1) |
| 3 | 0103 | POINT (1910405 532656.4) |
# View first 3 rows of dataset
tlines %>% head(3)| line | geometry |
|---|---|
| BLUE | LINESTRING (1908975 531357,… |
| BLUE | LINESTRING (1911119 533951…. |
| BLUE | LINESTRING (1910928 533489…. |
# View first 3 rows of dataset
tstops %>% head(3)| station | line | geometry |
|---|---|---|
| Ashmont | RED | POINT (1911230 522507.4) |
| Harvard | RED | POINT (1904312 531677.7) |
| Capen Street | RED | POINT (1909951 520164.8) |
Codebook
In these datasets, our variables mean:
ward_precinct: the 4-digit code of each voting precinct (and the city ward it is located within). Each precinct has one polling place.id: the unique ID number for each polling place. (Appears inpolls)line: the name of the Boston T (metro/subway) line. (Appears intlinesandtstops)stop: the name of the Boston T (metro/subway) stop. (Appears intstops)geometry: the row containing the points, polygons, or lines in our spatial data.frames. Appears in all objects.
Task 1: Visualize
First, we should always visualize our data to understand what’s going on there. We can visualize sf objects using the geom_sf() command in ggplot syntax, from the tidyverse ecosystem of data packages.
# Make and save the visual as 'g1'
g1 <- ggplot() +
# Visualize a nice background outline
geom_sf(data = precincts, color = "darkgrey", size = 5) +
# Visualize tan precinct polygons
geom_sf(data = precincts, color = "white", fill = "tan") +
# Visualize polling places as squares (shape = 22)
geom_sf(data = polls, fill = "black", size = 1, shape = 22) +
# visualize T-lines
geom_sf(data = tlines, mapping = aes(color = line), size = 0.5) +
# Vsiaulize t-stops
geom_sf(data = tstops, mapping = aes(color = line), size = 1.15) +
# Assign a color scheme to the t-lines and stops
scale_color_manual(
# Order the levels as desired
breaks = c("GREEN", "ORANGE", "RED", "BLUE", "SILVER", "MULTIPLE"),
# Now add the labels you want
labels = c("Green", "Orange", "Red", "Blue", "Silver", "Multiple"),
# And give each a color
values = c("seagreen", "orangered", "firebrick",
"dodgerblue", "grey","purple")) +
# Add a clear, simple theme
theme_void(base_size = 30) +
# And a basic legend
labs(color = "Public\nTransit\nLine",
subtitle = "Polling Places vs. Public Transit\nin Boston Precincts")
# Display the visual
g1
Task 2: Measure
Next, let’s figure out how to calculate the distance between each polling place and the nearest t-stop!
polls <- read_rds("polling_places.rds") %>%
# Transform to Equal Distance Projection
st_transform(crs = aed)
tstops <- read_rds("train_stops_boston.rds") %>%
# Transform to Equal Distance Projection
st_transform(crs = aed) %>%
# Save a copy of these train stop's geometry
# in a new column, called geometry_stop
mutate(geometry_stop = geometry)
# Identify which stops are closest to each polling places
nearest <- polls %>%
# Join together the stops nearest
st_join(tstops, join = st_nearest_feature) %>%
# Extract the coordinates of our polls and our new t-stops'
# st_coordinates returns a two-column matrix (x and y),
# so we grab the first column for the x-coordiante,
# and the second column for the y-coordinate
mutate(x_poll = st_coordinates(geometry)[,1],
y_poll = st_coordinates(geometry)[,2],
x_stop = st_coordinates(geometry_stop)[,1],
y_stop = st_coordinates(geometry_stop)[,2]) %>%
# Convert to data.frame
as.data.frame() %>%
# Get rid of our now unnecessary geometry fields
select(-geometry, -geometry_stop)Let’s take a peek at our first few rows.
| id | ward_precinct | station | line | x_poll | y_poll | x_stop | y_stop |
|---|---|---|---|---|---|---|---|
| 1 | 0101 | Maverick | BLUE | -5257178 | 6849656 | -5257184 | 6850197 |
| 2 | 0102 | Maverick | BLUE | -5256851 | 6849908 | -5257184 | 6850197 |
| 3 | 0103 | Maverick | BLUE | -5257215 | 6850171 | -5257184 | 6850197 |
Next, let’s turn these into a line-string object.
mydist <- nearest %>%
# Zoom into just precincts where their coordinates
# and their nearest t-stop coordinates could be obtained
filter(!is.na(x_poll), !is.na(y_poll),
!is.na(x_stop), !is.na(y_stop)) %>%
# For each polling station and its precinct,
group_by(id, ward_precinct) %>%
# Build a new geometry column, with a 4-cell matrix
mutate(geometry = matrix(
# listing our polling place x and our t-stop x coordinate
c(x_poll, x_stop,
# and listing our polling place y and our t-stop y coordinate
y_poll, y_stop),
# Make the matrix have two-rows and two-columns
ncol = 2, nrow = 2) %>%
# convert it to a line-string geometry
st_linestring(dim = "XY") %>%
# And bind it as a spatial data.frame,
# with an Equal Distance Conic projection
st_sfc(crs = aed)) %>%
# Turn the whole data.frame into a spatial data.frame
# with an Equal Distance Conic projection
st_as_sf(crs = aed) %>%
# Finally, calculate the distance of our linestrings!
# It will calculate in the original units of our projections,
# so meters, in this case.
mutate(distance = st_length(geometry) %>% as.numeric())Let’s take a peek at our new distance column, which measures the distance between each polling place and their nearest t-stop in meters!
mydist %>%
head(3) %>%
knitr::kable()| id | ward_precinct | station | line | x_poll | y_poll | x_stop | y_stop | geometry | distance |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 0101 | Maverick | BLUE | -5257178 | 6849656 | -5257184 | 6850197 | LINESTRING (-5257178 684965… | 541.52216 |
| 2 | 0102 | Maverick | BLUE | -5256851 | 6849908 | -5257184 | 6850197 | LINESTRING (-5256851 684990… | 440.60423 |
| 3 | 0103 | Maverick | BLUE | -5257215 | 6850171 | -5257184 | 6850197 | LINESTRING (-5257215 685017… | 40.75076 |
Task 3: Analyze
Great! We can visualize these lines too, like this!
g1 +
# Let's layer on top our new lines!
geom_sf(data = mydist %>%
# Remember to transform it to equal area conic
st_transform(crs = aea), color = "black")Or we could even analyze them, like this!
mydist %>%
as.data.frame() %>%
ggplot(mapping = aes(x = distance)) +
geom_histogram(mapping = aes(y = ..density..),
fill = "dodgerblue", color = "white") +
geom_density(fill = "dodgerblue", alpha = 0.5) +
labs(x = "Distance between Polling Place\nand Nearest T-Stop (meters)",
y = "(%) Frequency\n(n = 256 polling places)") +
theme_bw(base_size = 24)And that’s a wrap!
Task 4: Export
Finally, we need to export our file! We can use write_sf() for this.
We can write it to .kml format, like so:
mydist %>%
write_sf("mydistances.kml", driver = "kml", delete_layer = TRUE)Or we can write it to a shapefile!
mydist %>%
write_sf("mydistances.shp", driver = "ESRI Shapefile", delete_layer = TRUE)Hope this helps! Feel free to reach out if you have questions.