Module 13: Visualizing Results II

Workshop 13: Mapping Boston Social Infrastructure

Click here to run the code for this workshop yourself on RStudio.Cloud!

You’ve learned to visualize many types of data and results, but often, readers of your analyses will want to know a key type of information as well: where? In which communities did you find this trend? How do some communities’ traits differ from others? For this, we need to know how to use geographic data, also known as geospatial data.

This workshop examines which some Boston neighborhoods host more social infrastructure than others. Social infrastructure refers to the physical spaces in our communities that build social ties between residents. These include community spaces, like libraries or community centers, places of worship, like mosques, synagogues, and churches, social businesses, like cafes, barbershops, and nail salons, and parks, like green belts, squares, and fountains.

Together with undergraduates and masters students in the School of Public Policy and Urban Affairs at Northeastern University, our team at the Aldrich Resilience Lab has been mapping social infrastructure throughout core Boston neighborhoods in Fall 2021. Our team has found anecdotal evidence that neighborhoods of color in Boston host much less social infrastructure than others; let’s test that claim! In this workshop, you will learn to visualize social outcomes like social infrastructure on a map, work with geospatial data, and build regression models using spatial data. Let’s get started!

0. Import Data

Load Packages

In this workshop, we’ll use two new packages: (1) sf lets us put spatial data into data.frames and (2) ggspatial helps us visualize spatial data.

Please load these packages, and then load the mypoints, myshapes, and mylines objects described in the next sections by clicking on each tab.

library(tidyverse)
library(viridis)
library(GGally)
# Mapping packages
library(sf)
library(ggspatial)

Points

We’re going to use the read_sf() function from the sf package to load in a spatial data.frame called mypoints, from the .geojson file boston_social_infra.geojson. A .geojson is a special file type for working with geographic data. Just like a .csv, it has rows and columns, but they can contain complex information, like points, polygons, and lines.

mypoints <- read_sf("boston_social_infra.geojson")

For example, let’s look at the first 3 rows with head(). We see below that mypoints is a normal data.frame, but it has an extra box at the top containing several types of information:

Geometry type: POINT: telling us that each row in our data.frame represents a distinct point on a map.
Bounding Box: xmin: -71.08489 ymin: 42.29019 xmax: -71.045 ymax: 42.34415: This means these points all fall within a box between these longitude (x) / latitude (y) coordinates, mapped on a giant map projection called the 1984 World Geodetic System.

mypoints %>% 
  head(3)

## Simple feature collection with 3 features and 3 fields
## Geometry type: POINT
## Dimension:     XYZ
## Bounding box:  xmin: -71.08489 ymin: 42.29019 xmax: -71.045 ymax: 42.34415
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
## # A tibble: 3 × 4
##   Name                                 id group                         geometry
##   <chr>                             <int> <chr>                      <POINT [°]>
## 1 CORCORAN MULLINS JENNISON COMMUN…     1 Community Sp…   Z (-71.045 42.31905 0)
## 2 Christian Science Plaza               2 Parks         Z (-71.08489 42.34415 0)
## 3 Franklin Field Elderly Community…     4 Community Sp… Z (-71.08276 42.29019 0)

Finally, check out the rows in our dataset above. Each point has a Name, id, neighborhood, group, and geometry.

This dataset contains the following variables:

Click here for variable definitions

Name: name of social infrastructure site.
id: unique identifier number for each site.
neighborhood: name of general neighborhood where site falls.
group: type of social infrastructure, classified as "Community Spaces", "Places of Worship", "Social Businesses", or "Parks".
geometry: a column where each cell contains geographic data, in this case, a pair of coordinate points where our site is located! In the future, you could make these yourself, but for now, you’ll just use the ones provided in this workshop.

Polygons

Next, we’ll load in a spatial data.frame full of polygons. These polygons are a fishnet grid of squares, each with an area of approximately 1 square kilometer. They roughly approximate a neighborhood, and there are about 73 in core Boston. We’ll load this in and call it myshapes.

myshapes <- read_sf("boston_grid.geojson")

Let’s look at the first 3 rows in this spatial data.frame.

myshapes %>% head(3)

cell	geometry
Cell 100	POLYGON Z ((-71.04504 42.30…
Cell 106	POLYGON Z ((-71.11646 42.33…
Cell 107	POLYGON Z ((-71.10399 42.32…

Different from mypoints, we see that myshapes has a new Geometry Type: POLYGON. But just like before, it too has a Bounding Box specifying the longitude (x) and latitude (y) range these polygons fall into.

When we examine the rows above, we see that each rows contains the following variables:

cell: a unique identifier for each square.
geometry: a column where each cell contains geographic data; in this case, each row is a polygon representing the boundaries of this square.

Census Data

For every square, we also have a .csv file full of census traits, averaged from the census tracts that square overlaps. Each row has a unique cell identifier, which we will use to match with our myshapes data.frame later.

mydata <- read_csv("boston_census_data.csv")

Let’s look at the first 3 rows.

mydata %>% head(3)

## # A tibble: 3 × 16
##   cell     neighborhood pop_density pop_women pop_white pop_black pop_natam
##   <chr>    <chr>              <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
## 1 Cell 100 Dorchester            0       40.9      54.9      20.7       0.4
## 2 Cell 106 Mission Hill       8368.      57        60.3      16.5       0.2
## 3 Cell 107 Roxbury            9357.      52.4      47.2      25.9       0.3
## # … with 9 more variables: pop_asian <dbl>, pop_pacific <dbl>,
## #   pop_hisplat <dbl>, pop_some_college <dbl>, median_income <dbl>,
## #   income_inequality <dbl>, pop_unemployed <dbl>,
## #   median_monthly_housing_cost <dbl>, pop_age_65_plus <dbl>

This dataset contains the following variables:

Click here for variable definitions

pop: estimated population in 2019, based on American Community Survey’s 5-year average from 2014 to 2019.
pop_density: estimated density of residents per square kilometer, eg. 1052 residents per square kilometer.
pop_women: estimated percentage of residents who are women (written 0-100).
pop_age_65_plus: estimated percentage of residents ages 65 or greater (elder-population) (0-100).
pop_black: estimated percentage of residents who are African American (0-100).
pop_white: estimated percentage of residents who are White (0-100).
pop_asian: estimated percentage of residents who are Asian (0-100).
pop_hisplat: estimated percentage of Hispanic/Latinx residents (0-100).
pop_some_college: estimated percentage of residents with some college education or more. (0-100).
income_inequality: gini coefficient, representing income inequality from 0 (low inequality) to 1 (high inequality).
median_income: estimated median household income, in dollars.
median_monthly_housing_cost: median monthly housing cost, in dollars.
pop_unemployed: estimated percentage of the labor force who are unemployed.

Lines

Finally, for context, we will load in a spatial data.frame of lines. These lines represent Boston’s public transit metro lines (known as the the T).

mylines <- read_sf("boston_train_lines.geojson")

Let’s look at the first 3 rows. We see the Geometry type: LINESTRING, as opposed to POINT or POLYGON before, but with coordinates within the same kind of Bounding Box as before. We also see the following types of variables:

line: describes for each segment of the railway what T-line that track is part of ("BLUE", "RED", "ORANGE", "GREEN", "SILVER", etc.)
geometry: describes the geographic data for that segment, in this case, containing a series of points linked together into a line.

mylines %>% head(3)

## Simple feature collection with 3 features and 1 field
## Geometry type: LINESTRING
## Dimension:     XY
## Bounding box:  xmin: -71.06184 ymin: 42.35871 xmax: -71.02522 ymax: 42.37861
## Geodetic CRS:  WGS 84
## # A tibble: 3 × 2
##   line                                                                  geometry
##   <chr>                                                         <LINESTRING [°]>
## 1 BLUE  (-71.06184 42.36116, -71.06173 42.36112, -71.06124 42.36095, -71.06079 …
## 2 BLUE              (-71.02638 42.37763, -71.02637 42.37764, -71.02522 42.37861)
## 3 BLUE  (-71.03031 42.37428, -71.02988 42.37467, -71.02971 42.37482, -71.02961 …

1. Visualization

Now that we’ve imported each type of spatial data, let’s visualize it! The sf package gives us several special add-on functions to the ggplot2 package that let us work with spatial data.frames. Let’s learn about them below!

geom_sf()

We can use the geom_sf() function to map spatial data.frames. Using geom_sf(), we can add the contents of multiple spatial data.frames to one plot.

geom_sf() looks into our data.frames in the data = argument, identifies the geometry column, and automatically tells ggplot that mapping = aes(geometry = geometry), so we do not have to.

After that, we can tell geom_sf() to do any of the things a geom_ function usually does, like adjust outline color/overall color (color), background fill color (fill), size (size), transparency (alpha), etc.

Here’s our map below!

g1 <- ggplot() +
  geom_sf(data = myshapes, fill = "tan", color = "white", size = 0.5) +
  geom_sf(data = mypoints, color = "dodgerblue", size = 2, alpha = 0.5) +
  geom_sf(data = mylines, color = "black")

# View our map!
g1

coord_sf()

We can already see that our previous map might need some adjustments. For example, it’s clear that our point data only was recorded for neighborhoods in the center of Boston, so let’s crop our map to that range using the coord_sf() function.

We’ll give it a vector of longitudes to zoom into, called xlim = c(-71.13, -71.01). (Negative, because it’s west of the prime meridian). Then, we’ll give it a vector latitudes to zoom into, called, ylim = c(42.28, 42.38).

g1 +
  # Crop the map
  coord_sf(xlim = c(-71.13, -71.01),
           ylim = c(42.28, 42.38)) +
  # Add a nice clear theme
  theme_bw() +
  # Add a scale in the corner to help reader measure distance
  annotation_scale() +
  # Add labels
  labs(subtitle = "Social Infrastructure\nin Boston Neighborhoods")

layering

We can even use a few tricks to make prettier maps, like layering. This refers to when you place the same layer on top of itself, but using different size points, polygons, or lines. This can be a helpful trick for getting clear outlines around your geom_ visuals. For example, below, we give myshapes a "darkgrey" outline of size = 10, and then layer on top our myshapes with a much more ordinary size = 0.5, to reflect our neighborhoods. The result is a nice, clear border around our coastline.

g2 <- ggplot() +
  # Add a background layer with a big outline, to get a nice grey outline
  geom_sf(data = myshapes, color = "darkgrey", size = 10) +
  # Add our polygons on top, shaded tan
  geom_sf(data = myshapes, fill = "steelblue", color = "white", size = 0.15)

g2

aes()

Even better, we can add aesthetics too, using the mapping = aes() functions, as usual. Lines can take color, so we will use the line variable in our mylines data.frame to denote color.

We can use the breaks = argument in the scale_color_manual function to order the names of each of our lines, and the values = argument can provide colors for each of those names.

g3 <- g2 +
    # Add a nice white background for our lines
  geom_sf(data = mylines, color = "white",
          size = 4, alpha = 0.75) + 
  # Layer our actual lines on top of them, to get a little halo effect
  geom_sf(data = mylines, mapping = aes(color = line),
          size = 3, alpha = 0.5) + 
  # Give our T-lines appropriate colors
  scale_color_manual(
    # Sort the legend in this order
    breaks = c("ORANGE", "RED", "GREEN", "BLUE", "SILVER"),
    # Give them the following colors
    values = c("darkorange", "firebrick", "seagreen", "deepskyblue", "grey"))

g3

Similarly, points can take color, but if you change them to shape = 21, then we can adjust the fill as well. We will use the viridis package’s "plasma" palette to get some nice colors, with a "white" outline for each point.

g4 <- g3 +
  # Add our points with outlines, filled by group
  geom_sf(data = mypoints, mapping = aes(fill = group), 
          shape = 21, color = "white", size = 3) +
  # Give our points appropriate colors
  scale_fill_viridis(option = "plasma", discrete = TRUE, begin = 0, end = 0.8,
                     # Adjust order of legend
                     breaks = c("Community Spaces", "Places of Worship",
                                "Social Businesses", "Parks", "Other"))

g4

theme_void()

Finally, we can actually use theme_void() instead of theme_bw() to get rid of the annoying longitude and latitude lines. We lose that vital information, but it gives us back more plot space and makes the plot less messy.

g4 +
  # Add a nice clear theme
  theme_void() +
  # Crop the map
  coord_sf(xlim = c(-71.13, -71.01),
           ylim = c(42.28, 42.38)) +
  # Add a scale in the corner to help reader measure distance
  annotation_scale() +
  # Add labels
  labs(subtitle = "Social Infrastructure in Boston Groups",
       color = "Public Transit",
       fill = "Type")

2. Tabulation

Sometimes, however, it’s not enough to just visualize our data as is. First, that map above is pretty cluttered! Second, our visual tells us that the Downtown area of Boston has a lot of social infrastructure, but this is no surprise - it is the most populous area of Boston. It would be much more helpful to know how much social infrastructure is in each neighborhood per person, per square kilometer. To do, this we need to spatially join our points and polygons and tabulate the number of points in each polygon.

st_join()

Much like its cousin left_join() from the dplyr package, the st_join() function takes a polygon (eg. myshapes) and joins in data from the mypoints data.frame for any points located within that polygon. It uses the geometry column to make matches.

Let’s do an example.

Let’s grab just two squares on the Northeastern University campus.

twoshapes <- myshapes %>%
  filter(cell %in% c("Cell 124", "Cell 123"))

Then, we’ll join in mypoints. We can see that the first square cell = "Cell 123" contained 18 social infrastructure sites. Our other square contained 21 social infrastructure sites, giving us new records, each joined to the same cell ("Cell 124"). If a square contained no points, it would have a NA under Name, id, group, etc.

twoshapes %>%
  st_join(mypoints)

## Simple feature collection with 39 features and 4 fields
## Geometry type: POLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -71.1131 ymin: 42.33433 xmax: -71.08482 ymax: 42.3469
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
## # A tibble: 39 × 5
##    cell                                     geometry Name            id group   
##  * <chr>                               <POLYGON [°]> <chr>        <int> <chr>   
##  1 Cell 123 Z ((-71.11311 42.33873 0, -71.10064 42.… Fenway Comm…    18 Communi…
##  2 Cell 123 Z ((-71.11311 42.33873 0, -71.10064 42.… Justine Mee…    80 Parks   
##  3 Cell 123 Z ((-71.11311 42.33873 0, -71.10064 42.… Monmouth St…   181 Parks   
##  4 Cell 123 Z ((-71.11311 42.33873 0, -71.10064 42.… Ramler Park    212 Parks   
##  5 Cell 123 Z ((-71.11311 42.33873 0, -71.10064 42.… Joslin Park    223 Parks   
##  6 Cell 123 Z ((-71.11311 42.33873 0, -71.10064 42.… Roberto Cle…   254 Parks   
##  7 Cell 123 Z ((-71.11311 42.33873 0, -71.10064 42.… Riverway pa…   457 Parks   
##  8 Cell 123 Z ((-71.11311 42.33873 0, -71.10064 42.… Holy Trinit…   561 Places …
##  9 Cell 123 Z ((-71.11311 42.33873 0, -71.10064 42.… Peterboroug…   562 Parks   
## 10 Cell 123 Z ((-71.11311 42.33873 0, -71.10064 42.… Playground     563 Parks   
## # … with 29 more rows

summarize()

We can take that data.frame and summarize it, so that we get just one row per square (cell)! The sf package is smart, so it carries over the geometry for each cell too.

twotallies <- twoshapes %>%
  st_join(mypoints) %>%
  # For each square,
  group_by(cell) %>%
  # Tally up...
  summarize(
    # the total rows where group does not equal NA 
    total = sum(group != "NA", na.rm = TRUE),
    # the rows where group equals "Parks"
    parks = sum(group == "Parks", na.rm = TRUE))

twotallies

cell	total	parks	geometry
Cell 123	18	8	POLYGON ((-71.11311 42.3387…
Cell 124	21	11	POLYGON ((-71.10064 42.3365…

Above, we exploit a cool trick of R, too. We can count the number of rows that matches a condition by writing it within the sum() function.

To get the total points per polygon, we can say, okay, how many rows had a valid value for their point group? Since only points have a group, we know that if that square has no point in it, group will be NA.
Similarly, we can use conditions like in filter(), asking it to tally up the total parks based on how many cases where group == "Parks". This is called tabulation.
Finally, we can make squares with 0 points report 0, rather than NA, by writing in na.rm = TRUE. That’s an important step too!

left_join()

But last, we still just have tallies of points. We need to join in our polygon data to calculate a rate. For this, we can use left_join from the dplyr package, adding the census traits of our squares from mydata.

We can control for population and area by dividing our tally by population density, to get sites per person, per square kilometer. That gives a tiny number, so it’s best if we also divide this number by 1000, to get more readable numbers. This means, in that square, how many social infrastructure sites would we expect per square kilometer, for every 1000 residents?

twotallies %>%
  # join in census traits, with "cell" as the matching id
  left_join(by = "cell", y = mydata) %>%
  # control for population in 1000s
  mutate(total = total / (pop_density / 1000),
         parks = parks / (pop_density / 1000) )

## Simple feature collection with 2 features and 18 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -71.1131 ymin: 42.33433 xmax: -71.08482 ymax: 42.3469
## Geodetic CRS:  WGS 84
## # A tibble: 2 × 19
##   cell  total parks                  geometry neighborhood pop_density pop_women
## * <chr> <dbl> <dbl>             <POLYGON [°]> <chr>              <dbl>     <dbl>
## 1 Cell…  1.81 0.802 ((-71.11311 42.33873, -7… Fenway/Kenm…       9971.      59.5
## 2 Cell…  1.27 0.664 ((-71.10064 42.33653, -7… Fenway/Kenm…      16561.      57.2
## # … with 12 more variables: pop_white <dbl>, pop_black <dbl>, pop_natam <dbl>,
## #   pop_asian <dbl>, pop_pacific <dbl>, pop_hisplat <dbl>,
## #   pop_some_college <dbl>, median_income <dbl>, income_inequality <dbl>,
## #   pop_unemployed <dbl>, median_monthly_housing_cost <dbl>,
## #   pop_age_65_plus <dbl>

putting it together

Finally, let’s put that all together to get our final dataset for all mydata! Instead of using twoshapes, we’ll use our original spatial data.frame, myshapes.

myrates <- myshapes %>%
  st_join(mypoints) %>%
  # For each squares,
  group_by(cell) %>%
  # Tally up...
  summarize(
    # the total rows where group does not equal NA 
    total = sum(group != "NA", na.rm = TRUE),
    # the rows where group equals "Parks"
    parks = sum(group == "Parks", na.rm = TRUE)) %>%
  # join in census traits, with "cell" as the matching id
  left_join(by = "cell", y = mydata) %>%
  # control for population in 1000s
  mutate(total = total / (pop_density / 1000),
         parks = parks / (pop_density / 1000))

Finally, we can visualize our rate of total social infrastructure.

We can set mapping = aes(fill = total) in geom_sf() using our new data.frame myrates.
We can also import some census block group polygons in as mybg and overlay them to get a better sense of where in the city our grid is.

# Import block-groups
mybg <-  read_sf("boston_block_groups.geojson")

ggplot() +
  # Add the background block groups, for context
  geom_sf(data = mybg, fill = "black", color = "white", size = 0.25) +
  # Add a background layer with a big outline, to get a nice grey outline
  geom_sf(data = myrates, color = "darkgrey", size = 10) +
  # Add our polygons on top, shaded by rate of social infrasrtucture
  geom_sf(data = myrates, mapping = aes(fill = total), color = NA) +
  # Give our points appropriate colors
  scale_fill_viridis(
    option = "plasma", begin = 0, end = 0.8,
    # When our data is strongly skewed, it can be helpful to institute a log-color scale.
    # This just stretches out the colors, while keeping the values the same
    trans = "log", na.value = "black") +
  # Overlay our points with better
  geom_sf(data = mybg, fill = NA, color = "white", size = 0.25) +
  # Add a nice clear theme
  theme_void() +
  # Crop the map
  coord_sf(xlim = c(-71.14, -71.02),
           ylim = c(42.27, 42.38)) +
  # Add a scale in the corner to help reader measure distance
  annotation_scale() +
  # Add labels
  labs(subtitle = "Social Infrastructure\nin Boston Neighborhoods",
       fill = "Rate*",
       caption = "* Calculated per 1000 residents\nper square kilometer.")

3. Analysis

Finally, we can apply our statistical techniques to this data to understand what trends it may uncover.

data.frame

To analyze, we must convert it to a data.frame first, using the as_tibble() from the dplyr package. We will call that data.frame mydf.

mydf <- myrates %>%
  # Convert to data.frame
  as_tibble()  %>% 
  # Get rid of the geometry column
  select(-geometry) %>%
  # Zoom into populated cells
  filter(pop_density > 0)

ggcorr()

We can then examine correlations between each variable very quickly, using the ggcorr() function from our GGally package.

mydf %>%
  ggcorr(label = TRUE, 
         low = "dodgerblue", 
         mid = "white",
         high = "firebrick")

mydf %>%
  ggcorr(label = TRUE, 
         low = "dodgerblue", 
         mid = "white",
         high = "firebrick") +
  theme_bw(base_size = 24)

geom_col()

We can explore our data by neighborhood…, using group_by() and summarize(), and then plot it with geom_col().

mysum <- mydf %>% 
  group_by(neighborhood) %>%
  summarize(total = mean(total),
            income = mean(median_income, na.rm = TRUE))

mysum %>%
  ggplot(mapping = aes(x = reorder(neighborhood, total), 
                       y = total, fill = income)) +
  geom_col(color = "black") +
  scale_fill_viridis(option = "plasma", begin = 0, end = 0.8) +
  coord_flip() +
  theme_bw() +
  labs(y = "Rate of Social Infrastructure*",
       x = "Neighborhood",
       fill = "Estimated\nMedian\nIncome",
       caption = "* per 1000 residents per square kilometer")

geom_jitter()

Or we can compare the association between social infrastructure and several different covariates.

mydf %>%
  pivot_longer(cols = c(pop_white, pop_black, pop_asian, pop_hisplat),
               names_to = "ethnicity", values_to = "percent") %>%
  ggplot(mapping = aes(x = percent, y = log(total), color = ethnicity)) +
  geom_jitter(size = 5, alpha = 0.5) +
  facet_wrap(~ethnicity, scales = "free") +
  guides(color = "none")

lm()

Or, perhaps we could make a simple model, using just terms that were not colinear in our correlation matrix above, to highlight the powerful effect of income on social infrastructure, even after accounting for other major demographic drivers.

mydf %>%
  lm(formula = total ~ pop_density + pop_black + 
       pop_asian + pop_age_65_plus + median_income) %>%
  summary()

## 
## Call:
## lm(formula = total ~ pop_density + pop_black + pop_asian + pop_age_65_plus + 
##     median_income, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3633 -0.8173 -0.1524  0.7376  4.6767 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)  
## (Intercept)     -1.621e+00  1.236e+00  -1.311   0.1953  
## pop_density      4.650e-05  4.903e-05   0.948   0.3472  
## pop_black        3.165e-02  1.397e-02   2.267   0.0274 *
## pop_asian        3.957e-02  2.933e-02   1.349   0.1830  
## pop_age_65_plus  5.142e-02  2.713e-02   1.895   0.0634 .
## median_income    2.032e-05  7.863e-06   2.584   0.0125 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.23 on 54 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.2246, Adjusted R-squared:  0.1528 
## F-statistic: 3.129 on 5 and 54 DF,  p-value: 0.01494

Conclusion

In summary, this workshop highlights several new skills, including the use of the sf package, .geojson files, geom_sf(), and st_join() to visualize and wrangling geographic data. However, we see by the end that no matter the type of data, we can usually apply the same general set of statistical techniques to analyze that data.

By applying our core toolkit of stats to additional areas of data science, like visualization and mapping, you can continue to build up your skills!