Mapping Social Justice Data in R

In this tutorial, we will explore how to create informative and aesthetically pleasing maps in R using the ggplot2 and sf packages. Mapping is a powerful tool for visualizing spatial patterns, especially when dealing with social justice data that often has geographic components. We will delve into the technical aspects of mapping in R and discuss conceptual considerations for producing effective maps.

To follow along, you should have a basic understanding of R and have the necessary packages installed, including tidyverse, tidycensus, sf, tigris, and fuzzyjoin. We will assume you are familiar with obtaining data from sources like the Census and have basic data manipulation skills.

# Load necessary libraries
library(tidyverse)
library(tidycensus)
library(sf)
library(tigris)
library(fuzzyjoin)

Working with Spatial Data

Spatial data in R is often handled using the sf (simple features) package, which provides a standardized way to store and manipulate spatial objects. This integration allows us to use dplyr verbs for data manipulation and ggplot2 for visualization seamlessly.

To begin, we might want to obtain geographic boundaries, such as counties or states. The tigris package offers easy access to these shapefiles.

# Retrieve county boundaries for a specific state
state_counties <- counties(state = "CA", cb = TRUE, class = "sf", progress_bar = FALSE)

The cb = TRUE parameter requests cartographic boundary files, which are simplified for mapping purposes, and class = "sf" ensures the data is returned as an sf object.

Preparing Your Data

Suppose we have a dataset containing metric, like median household income, for each county. Before mapping, we need to ensure our data aligns with the spatial data.

# Sample data frame with social justice data
social_data <- data.frame(
  county = c("Los Angeles", "San Diego", "Orange"),
  median_income = c(65000, 75000, 80000)
)

We need to join this data with our spatial data. The key is to have a common field for the join, such as the county name.

Joining Spatial and Attribute Data

Joining spatial data with attribute data requires careful handling, especially when names might not match perfectly due to differences in formatting or spelling. We can use a standard left_join if we are confident about the matching.

# Join attribute data to spatial data
map_data <- state_counties %>%
  left_join(social_data, by = c("NAME" = "county"))

If discrepancies exist, fuzzy matching techniques can help.

Handling Discrepancies with Fuzzy Matching

When exact matches are not possible, the fuzzyjoin package offers functions to join data based on string similarity. As a minial example, consider:

# Create a sample data frame with slight discrepancies in 'city' names
data1 <- data.frame(
  city = c("New York City", "Los Angles", "Chicago"),
  population = c(8419000, 3980000, 2716000)
)

# Create another data frame with correct 'city' names
data2 <- data.frame(
  city = c("New York", "Los Angeles", "Chicago"),
  median_income = c(63000, 58000, 53000)
)

join1 <- left_join(data1, data2, by = "city")

# Perform a fuzzy join on the 'city' column
join2 <- stringdist_left_join(
  data1, data2,
  by = "city",
  method = "jw",      # Use Jaro-Winkler distance
  max_dist = 0.2,     # Allow for slight discrepancies
  distance_col = "dist" # Include a column showing the string distance
)

Fuzzy matching FTW! Let’s go back and, just to be safe, use fuzzyjoin for our social data.

# Perform a fuzzy join
map_data <- social_data %>%
  stringdist_left_join(
    state_counties,
    by = c("county" = "NAME"),
    method = "jw", # Jaro-Winkler distance
    max_dist = 0.1
  )

After joining, it’s important to verify the matches to ensure data accuracy.

Visualizing the Map

With our data prepared, we can create the map using ggplot2. The geom_sf() function is specifically designed to plot spatial data. However, when we merge spatial data, like state_counties, with regular data frames that contain attributes such as income or population, the spatial information (or geometry) may get altered or lose its structure. This can cause technical issues when trying to plot the map. Therefore, before plotting, we always check and, if necessary, convert the merged data back to an sf object to ensure that the spatial geometry is preserved and ready for visualization.

# Basic map visualization
map_data <- map_data %>%
  st_as_sf()

map_data %>%
  ggplot +
  geom_sf(aes(fill = median_income))

Enhancing the Map Aesthetics

An effective map is not just about displaying data but also about communicating it clearly. We can enhance the map by adjusting color scales, adding titles, and refining the theme.

Choosing the right color scale is crucial. The viridis color scales are perceptually uniform and colorblind-friendly.

# Apply a color scale
map_data %>%
  ggplot +
  geom_sf(aes(fill = median_income)) +
  scale_fill_viridis_c(option = "plasma", name = "Median Income")

Including descriptive titles and labels helps contextualize the map.

# Add titles and labels
map_data %>%
  ggplot +
  geom_sf(aes(fill = median_income)) +
  scale_fill_viridis_c(option = "plasma", name = "Median Income") +
  labs(
    title = "Median Household Income by County",
    subtitle = "An analysis of income distribution",
    caption = "Source: U.S. Census Bureau"
  )

Simplifying the map’s appearance can make it more readable.

# Simplify the map theme
map_data %>%
  ggplot +
  geom_sf(aes(fill = median_income)) +
  scale_fill_viridis_c(option = "plasma", name = "Median Income") +
  labs(
    title = "Median Household Income by County",
    subtitle = "An analysis of income distribution",
    caption = "Source: U.S. Census Bureau"
  ) +
  theme_minimal() +
  theme(
    axis.text = element_blank(),
    axis.title = element_blank(),
    panel.grid = element_blank()
  )

Mapping Point Data

In addition to choropleth maps (area-based), we can map point data, such as pollution monitoring sites or incident locations.

# Sample point data
point_data <- data.frame(
  longitude = c(-118.2437, -117.1611),
  latitude = c(34.0522, 32.7157),
  pm25 = c(12.5, 10.2)
)

# Convert to spatial object
point_data_sf <- st_as_sf(point_data, coords = c("longitude", "latitude"), crs = 4326)

# Plot point data over the map
ggplot() +
  geom_sf(data = state_counties, fill = "white", color = "black") +
  geom_sf(data = point_data_sf, aes(color = pm25), size = 2) +
  scale_color_viridis_c(name = "PM2.5 Levels", option = "inferno")

Conceptual Considerations for Effective Maps

Creating a map involves more than technical execution; it’s also about effective communication.

Ensure that the map is easy to read. This involves choosing appropriate color contrasts, avoiding clutter, and using legible fonts. Colors should accurately represent the data. For sequential data, use gradients; for diverging data, use a color scheme that highlights the midpoint. Maps can be misleading if not properly designed. Pay attention to data classification methods and avoid distorting the data’s message. When dealing with social justice data, be mindful of the potential impact. Present the data respectfully and responsibly, acknowledging any limitations.

Extended examples

Median Household Income by County

Let’s think about income inequality and how it manifests across different regions of the United States. Median household income is a powerful indicator of economic opportunity and well-being, often reflecting disparities in access to education, healthcare, and other essential resources. By mapping this data at the county level, we can visualize economic divides that persist across the nation, highlighting areas that may require targeted social and policy interventions.

# Fetch median household income for all counties in the US
income_data <- get_acs(
  geography = "county",
  variables = "B19013_001",  # Variable for median household income
  year = 2022,               # Specify the year (most recent available year)
  survey = "acs5", # Use the 5-year ACS data for better county-level estimates
  geometry = TRUE 
)

# Plot the data using ggplot2
income_data %>%
  filter(!str_detect(NAME, "\\, Alaska|\\, Hawaii")) %>%
  ggplot +
  geom_sf(aes(fill = estimate), color = "white", size = 0.1) +  # Use 'estimate' for median income
  scale_fill_viridis_c(option = "plasma", name = "Median Income") +  # Use a color scale for income
  theme_minimal() +
  labs(
    title = "Median Household Income by County (2022)",
    subtitle = "Source: ACS 5-Year Estimates",
    caption = "Note: White lines represent county boundaries"
  ) +
  theme(
    axis.text = element_blank(),
    axis.title = element_blank(),
    panel.grid = element_blank()
  )

How to make this map more readable is something that requires a great deal of thought!

Average P2.5 Pollution Levels in California

Air quality is a crucial environmental justice issue, as marginalized communities are disproportionately affected by pollution and its adverse health effects. PM2.5 refers to fine particulate matter that is 2.5 micrometers or smaller in diameter, small enough to be inhaled deeply into the lungs and even enter the bloodstream. These particles come from sources like vehicle emissions, industrial processes, and wildfires, and prolonged exposure to PM2.5 is associated with serious health problems, including respiratory and cardiovascular diseases. Mapping PM2.5 pollution levels across California can reveal spatial patterns of exposure, identifying communities that are most impacted. This type of analysis is vital for advocating policy changes to reduce pollution and to protect vulnerable populations

# Read in pollution data
p25data <- read.csv("https://drive.google.com/uc?export=download&id=1XYIk6_dn2V5GrnPP77x1dNMoB2YRoxqW")

# Step 1: Calculate the average PM2.5 concentration for each site
p25data <- p25data %>%
  group_by(Site.ID, Site.Latitude, Site.Longitude) %>%
  summarize(average_PM25 = mean(Daily.Mean.PM2.5.Concentration, na.rm = TRUE)) %>%
  ungroup()

# We have California counties from earlier

# Plot California counties with pollution data overlay
ggplot(p25data) +
  geom_sf(data = state_counties, fill = "white", color = "black") +  # Map of California counties
  geom_point(
    aes(x = Site.Longitude, y = Site.Latitude, color = average_PM25),
    size = 1.5, alpha = 0.8  # Small, uniform dot size
  ) +
  scale_color_viridis_c(name = "Avg PM2.5", option = "inferno") +  # Broad color range with "inferno" palette
  theme_minimal() +
  theme(
    panel.grid = element_blank()
  )

Presidential Election Activity

To create a comprehensive dataset of election results by county for your assigned state, follow the steps below. This project will involve extracting, organizing, and preparing data for analysis.

Each of you has been assigned a state to focus on. Start with MSNBC’s main presidential election map. Click on your assigned state to view the state-specific election results. Scroll down on your state’s election results page and click the “View All Counties” link to display detailed results for each county. Print the web page to a .pdf file. Ensure that all relevant county data is visible and captured in the document.

Next, go to Google NotebookLM, start a new notebook, and upload the .pdf file as a source document. Use NotebookLM to extract the data in comma-separated (.csv) format. Here are some guidelines for structuring the data:

The first column should be called “state” and should contain the state’s full name repeated for each row.
The second column should be named “county” and should include the name of each county in your state.
The third column should be “harris.votes”, containing the number of votes each county gave to Harris.
The fourth column should be “trump.votes”, containing the number of votes each county gave to Trump.
The fifth column should be “other.votes”, which needs to be calculated as follows:
- Use the total number of votes cast in each county (as provided).
- Subtract the number of Harris votes and Trump votes from the total to get the number of “other” votes.

Either download the .csv file from NotebookLM (if provided) or copy the comma-separated text into a plaintext editor and save it as a .csv file. Name the file after your state, for example: Pennsylvania.csv or New Mexico.csv. Finally, place your .csv file in the shared Google Drive for this project.

Making Maps in R

Chad M. Topaz

2024-11-12

Conceptual Considerations for Effective Maps

Extended examples

Median Household Income by County

Average P2.5 Pollution Levels in California

Presidential Election Activity