Spatial data and thematic maps

This tutorial introduces you to working with spatial data and making thematic maps with the ggplot package.

Spatial data

Spatial data represents the location, size, and shape of an object on the Earth like a building, lake, mountain, or territory. Location information is what separates spatial data from other data types. It’s also known as geospatial data. In R, spatial data is usually handled with packages like sf, sp, terra, and raster.

In the following snippet, we use the sf library, one of the most frequently used library for spatial analysis.

Spatial data are practically like any other datasets you have encountered, with additional information on spatial references (such as longitude and latitude). The following example creates a dataframe, similar to what you have previously done.

Notice the x and y columns. They usually represent coordinates in a two dimensional space. Geospatial coordinates are the same, except the spherical geometry of Earth is projected into a two dimensional plane using various projections.

We use st_as_sf() function to convert that regular dataframe to a spatial dataframe. We specify x and y as spatial coordinates.

The argument coords = c("x", "y") specifies which columns of the data frame contain the coordinates of the points. The crs = 4326 argument sets the Coordinate Reference System (CRS) to EPSG:4326, which is the code for WGS84 latitude-longitude coordinate system.

# install.packages("sf")
library(sf)

# Create a simple feature object
df <- data.frame(
  id = 1:2,
  name = c("Feature 1", "Feature 2"),
  x = c(10, 11),
  y = c(12, 13)
)

sf_object <- st_as_sf(df, coords = c("x", "y"), crs = 4326)
print(sf_object)
Simple feature collection with 2 features and 2 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 10 ymin: 12 xmax: 11 ymax: 13
Geodetic CRS:  WGS 84
  id      name      geometry
1  1 Feature 1 POINT (10 12)
2  2 Feature 2 POINT (11 13)

x and y now become part of the geometry, representing the spatial location of the features.

Obtaining census data

We use tidycensus package developed by Kyle Walker to get census data within R. Being able to analyze and map census data using R can be handy for urban planners. We require a Census API key to download Census data.

Obtaining a Census API key:

  • Go to the U.S. Census Bureau’s API site:

  • Fill out the form with your information.

  • After you’ve submitted the form, you’ll receive an email containing your new API key. This is the key that you can use with the tidycensus package in R.

Remember to keep your API key secure and do not share it with others.

For the following code, remove the overwrite=TRUE, you don’t need it.

The code shows my API code. Use your API key instead.

library(tidycensus)

# Set your Census Bureau API key
census_api_key("95434b8fbba512cd47489e79519845a15524c0c0")

# Download population data for North Carolina
nc_data <- get_acs(
  geography = "tract",
  variables = "B01001_001",
  state = "NC",
  year = 2020,
  geometry = TRUE,
  progress_bar = FALSE
)

We obtain census tract level population for North Carolina. B01001_001 is the specific variable for population.

Exercise

Scroll this list and look at the different variables and estimates that you can access through the American Community Survey (ACS) API.

Making a simple thematic map

We will download decennial census tract level data for Durham county.

library(tidyverse)

durham_race <- get_decennial(
  geography = "tract",
  state = "NC",
  county = "Durham",
  variables = c(
    Hispanic = "P2_002N",
    White = "P2_005N",
    Black = "P2_006N",
    Native = "P2_007N",
    Asian = "P2_008N"
  ),
  summary_var = "P2_001N",
  year = 2020,
  geometry = TRUE,
  progress_bar = FALSE
) %>%
  mutate(percent = 100 * (value / summary_value))

Creating a thematic map using ggplot

We will create a choropleth map using the above data. A choropleth is a type of thematic map in which areas are shaded or patterned in relation to a statistical variable that represents an aggregate summary of a geographic characteristic within each area. In a choropleth map, the intensity of the color or pattern represents the magnitude of the measurement, such as the population density or percentage of an attribute’s occurrence. This provides a way to visualize how a measurement varies across a geographic area.

It’s important to note that choropleth maps can be misleading if not designed carefully. Because they represent data aggregated over areas, their appearance can be influenced by how the areas are defined (a form of the Modifiable Areal Unit Problem, or MAUP) and by the fact that larger areas can visually dominate the map, even if the variable being mapped is not particularly large or significant in those areas. Therefore, care should be taken in interpreting and designing choropleth maps.

We filter data to map distribution of Asian population in Durham County, North Carolina. A minimal amount of code to produce a simple thematic map is presented below. We use geom_sf to map spatial data in ggplot.

durham_asian <- filter(durham_race, 
                         variable == "Asian")

ggplot(data = durham_asian, aes(fill = percent)) + 
  geom_sf()

What could we improve?

  • Coordinate values are usually redundant
  • Lower percentage values are darker. This is unintuitive.
  • It needs a title
  • The gray grid in the background is redundant
  • No information about data source

Let’s modify the plot.

ggplot(data = durham_asian, aes(fill = percent)) + 
  geom_sf()+
  scale_fill_distiller(palette = "RdPu", 
                       direction = 1) + 
  labs(title = "Distribution of Asian population in Durham County census tracts",
       caption = "Data source: 2020 Decennial Census, US Census Bureau",
       fill = "Percentage of population") + 
  theme_void()

This is much neater.

scale_fill_distiller(palette = "RdPu", direction = 1) sets the color scale for the fill color. The palette = "RdPu" means it will use the “Red-Purple” palette from ColorBrewer, and direction = 1 means the palette goes from light to dark.

How else could you improve the plot?

This is a great segue to color palettes and ColorBrewer.

Color palettes using ColorBrewer

ColorBrewer is an online tool designed to help people pick good color schemes for maps and other graphics. It’s especially good for making color palettes that are easily distinguishable (even for people with colorblindness) and that print well in black and white. The RColorBrewer package in R provides access to these color palettes.

In R, you can use the RColorBrewer package to view and use ColorBrewer palettes. Here’s a simple example:

library(RColorBrewer)

# Display all color palettes
display.brewer.all()

In the case of our mapping example, the “RdPu” palette is used. This palette is part of the sequential palettes in ColorBrewer, which are suited to ordered data that progress from low to high.

Color palettes in data visualization can be broadly categorized into three types:

  • Sequential Palettes: These are suitable for ordered data that progresses from low to high. They are often used for representing quantitative data. For example, you might use a sequential palette to represent the population size in different census tracts.

  • Diverging Palettes: These are suitable for data values that deviate around a mean. They are often used when the data has a critical midpoint like an average or zero. For instance, you might use a diverging palette to represent the deviation from the average income level in different census tracts.

  • Qualitative Palettes: These are used for nominal or categorical data. They do not imply magnitude differences between legend classes. They are often used to represent different categories or groups. For instance, you might use a qualitative palette to represent different racial or ethnic groups in the census data.

Exercise

Go to the ColorBrewer website and look at various color palettes suggested for diffferent types of data and numbers of categories/bins.

Remaking the above map with categorical breaks

We convert continuous data to categories using the library classInt. We then use these categories to fill the map.

library(classInt)

# get quantile breaks. Add .00001 offset to catch the lowest value
breaks_qt <- classIntervals(c(min(durham_asian$percent) - .00001, durham_asian$percent), n = 5, style = "quantile")

breaks_qt
style: quantile
  one of 814,385 possible partitions of this variable into 5 classes
  [-1e-05,1.204485)  [1.204485,2.62307)  [2.62307,5.034092) [5.034092,7.288609) 
                 14                  14                  13                  14 
[7.288609,29.01418] 
                 14 
durham_asian <- mutate(durham_asian, percent_cat = cut(percent, breaks_qt$brks)) 

ggplot(durham_asian) + 
    geom_sf(aes(fill=percent_cat)) +
    scale_fill_brewer(palette = "OrRd") +
    theme_void()

Example of diverging color palette

# Obtain ACS data
med_inc_orange <- get_acs(
  geography = "tract",
  variables = "B19013_001", # This is an example variable code for Median Household Income
  state = "NC",
  county = "Orange",
  geometry = TRUE,
  year = 2020,
  progress_bar = FALSE
)
Getting data from the 2016-2020 5-year ACS
Downloading feature geometry from the Census website.  To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.
# estimate divergence from median household income for the county
median_orange <- median(med_inc_orange$estimate, na.rm = TRUE)
med_inc_orange$divergence = (med_inc_orange$estimate - median_orange)/median_orange*100
palette_name <- "RdBu"
palette_colors <- brewer.pal(n = 5, name = palette_name)

# Plot the data
ggplot(med_inc_orange) +
  geom_sf(aes(fill = divergence)) + 
  scale_fill_gradientn(colors = palette_colors) +
  theme_void()

The map shows counties the divergence of household income of census tracts from the median values of the counties. Divergent color schemes are appropriate for this.

Exercise

Make the map more informative.

Faceted maps

Faceted maps (small multiples) can be very useful in comparing distribution of variables across geometries.

ggplot(durham_race, aes(fill = value)) +
  geom_sf() +
  facet_wrap(~ variable)

Exercise
  • Make the map more informative.
  • Additionally, change the color scheme from continuous to categorical. Find an appropriate number of bins by trying out different values. Which number did you ultimately settle on? Why?