This tutorial introduces you to working with spatial data and making thematic maps with the ggplot package.
Spatial data
Spatial data represents the location, size, and shape of an object on the Earth like a building, lake, mountain, or territory. Location information is what separates spatial data from other data types. It’s also known as geospatial data. In R, spatial data is usually handled with packages like sf, sp, terra, and raster.
In the following snippet, we use the sf library, one of the most frequently used library for spatial analysis.
Spatial data are practically like any other datasets you have encountered, with additional information on spatial references (such as longitude and latitude). The following example creates a dataframe, similar to what you have previously done.
Notice the x and y columns. They usually represent coordinates in a two dimensional space. Geospatial coordinates are the same, except the spherical geometry of Earth is projected into a two dimensional plane using various projections.
We use st_as_sf() function to convert that regular dataframe to a spatial dataframe. We specify x and y as spatial coordinates.
The argument coords = c("x", "y") specifies which columns of the data frame contain the coordinates of the points. The crs = 4326 argument sets the Coordinate Reference System (CRS) to EPSG:4326, which is the code for WGS84 latitude-longitude coordinate system.
Simple feature collection with 2 features and 2 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 10 ymin: 12 xmax: 11 ymax: 13
Geodetic CRS: WGS 84
id name geometry
1 1 Feature 1 POINT (10 12)
2 2 Feature 2 POINT (11 13)
x and y now become part of the geometry, representing the spatial location of the features.
Obtaining census data
We use tidycensus package developed by Kyle Walker to get census data within R. Being able to analyze and map census data using R can be handy for urban planners. We require a Census API key to download Census data.
After you’ve submitted the form, you’ll receive an email containing your new API key. This is the key that you can use with the tidycensus package in R.
Remember to keep your API key secure and do not share it with others.
For the following code, remove the overwrite=TRUE, you don’t need it.
The code shows my API code. Use your API key instead.
library(tidycensus)# Set your Census Bureau API keycensus_api_key("95434b8fbba512cd47489e79519845a15524c0c0")# Download population data for North Carolinanc_data <-get_acs(geography ="tract",variables ="B01001_001",state ="NC",year =2020,geometry =TRUE,progress_bar =FALSE)
We obtain census tract level population for North Carolina. B01001_001 is the specific variable for population.
Exercise
Scroll this list and look at the different variables and estimates that you can access through the American Community Survey (ACS) API.
Making a simple thematic map
We will download decennial census tract level data for Durham county.
We will create a choropleth map using the above data. A choropleth is a type of thematic map in which areas are shaded or patterned in relation to a statistical variable that represents an aggregate summary of a geographic characteristic within each area. In a choropleth map, the intensity of the color or pattern represents the magnitude of the measurement, such as the population density or percentage of an attribute’s occurrence. This provides a way to visualize how a measurement varies across a geographic area.
It’s important to note that choropleth maps can be misleading if not designed carefully. Because they represent data aggregated over areas, their appearance can be influenced by how the areas are defined (a form of the Modifiable Areal Unit Problem, or MAUP) and by the fact that larger areas can visually dominate the map, even if the variable being mapped is not particularly large or significant in those areas. Therefore, care should be taken in interpreting and designing choropleth maps.
We filter data to map distribution of Asian population in Durham County, North Carolina. A minimal amount of code to produce a simple thematic map is presented below. We use geom_sf to map spatial data in ggplot.
Lower percentage values are darker. This is unintuitive.
It needs a title
The gray grid in the background is redundant
No information about data source
Let’s modify the plot.
ggplot(data = durham_asian, aes(fill = percent)) +geom_sf()+scale_fill_distiller(palette ="RdPu", direction =1) +labs(title ="Distribution of Asian population in Durham County census tracts",caption ="Data source: 2020 Decennial Census, US Census Bureau",fill ="Percentage of population") +theme_void()
This is much neater.
scale_fill_distiller(palette = "RdPu", direction = 1) sets the color scale for the fill color. The palette = "RdPu" means it will use the “Red-Purple” palette from ColorBrewer, and direction = 1 means the palette goes from light to dark.
How else could you improve the plot?
This is a great segue to color palettes and ColorBrewer.
Color palettes using ColorBrewer
ColorBrewer is an online tool designed to help people pick good color schemes for maps and other graphics. It’s especially good for making color palettes that are easily distinguishable (even for people with colorblindness) and that print well in black and white. The RColorBrewer package in R provides access to these color palettes.
In R, you can use the RColorBrewer package to view and use ColorBrewer palettes. Here’s a simple example:
library(RColorBrewer)# Display all color palettesdisplay.brewer.all()
In the case of our mapping example, the “RdPu” palette is used. This palette is part of the sequential palettes in ColorBrewer, which are suited to ordered data that progress from low to high.
Color palettes in data visualization can be broadly categorized into three types:
Sequential Palettes: These are suitable for ordered data that progresses from low to high. They are often used for representing quantitative data. For example, you might use a sequential palette to represent the population size in different census tracts.
Diverging Palettes: These are suitable for data values that deviate around a mean. They are often used when the data has a critical midpoint like an average or zero. For instance, you might use a diverging palette to represent the deviation from the average income level in different census tracts.
Qualitative Palettes: These are used for nominal or categorical data. They do not imply magnitude differences between legend classes. They are often used to represent different categories or groups. For instance, you might use a qualitative palette to represent different racial or ethnic groups in the census data.
Exercise
Go to the ColorBrewer website and look at various color palettes suggested for diffferent types of data and numbers of categories/bins.
Remaking the above map with categorical breaks
We convert continuous data to categories using the library classInt. We then use these categories to fill the map.
library(classInt)# get quantile breaks. Add .00001 offset to catch the lowest valuebreaks_qt <-classIntervals(c(min(durham_asian$percent) - .00001, durham_asian$percent), n =5, style ="quantile")breaks_qt
style: quantile
one of 814,385 possible partitions of this variable into 5 classes
[-1e-05,1.204485) [1.204485,2.62307) [2.62307,5.034092) [5.034092,7.288609)
14 14 13 14
[7.288609,29.01418]
14
# Obtain ACS datamed_inc_orange <-get_acs(geography ="tract",variables ="B19013_001", # This is an example variable code for Median Household Incomestate ="NC",county ="Orange",geometry =TRUE,year =2020,progress_bar =FALSE)
Getting data from the 2016-2020 5-year ACS
Downloading feature geometry from the Census website. To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.
# estimate divergence from median household income for the countymedian_orange <-median(med_inc_orange$estimate, na.rm =TRUE)
palette_name <-"RdBu"palette_colors <-brewer.pal(n =5, name = palette_name)# Plot the dataggplot(med_inc_orange) +geom_sf(aes(fill = divergence)) +scale_fill_gradientn(colors = palette_colors) +theme_void()
The map shows counties the divergence of household income of census tracts from the median values of the counties. Divergent color schemes are appropriate for this.
Exercise
Make the map more informative.
Faceted maps
Faceted maps (small multiples) can be very useful in comparing distribution of variables across geometries.
Additionally, change the color scheme from continuous to categorical. Find an appropriate number of bins by trying out different values. Which number did you ultimately settle on? Why?