Mapping Raster Data in the Tidyverse

A Brief Tutorial

Open this Project on RStudio.Cloud and follow along!

Introduction

Recently, I saw a great question on ResearchGate of how to find the correlation coefficient for a pair of raster datasets. This tutorial below introduces the functions necessary to perform this test.

First, what is a raster dataset? A raster dataset is a grid of cells, with one value per cell. These cells are measured at specific resolutions and are often used to capture extremely granular geographic data like rainfall, temperature, vegetation, etc.

Second, what is the tidyverse? The tidyverse is a set of packages built over the last several years in R all designed to work together, on tidy functions. They are more or less based around the dplyr package, which introduced the “pipeline” (“%>%”) a key function in R that joins or “pipes” functions together. It has the added effect of making data science easier and more accessible to students and scholars in the social sciences like myself.

A new package that fits into the tidyverse approach is sf, a package which allows for tidy management of geospatial data (Pebesma 2018). We’re going to use this package, and a few others, to convert raster data into rows of data we can analyze and correlate.

1. Data Preparations

1.1 Load Packages

First, let’s load the necessary packages for this exercise. If you don’t have them, you can un-comment (remove the hashtags) the install.packages() function.

#install.packages(c("dplyr", "purrr", "ggplot2", "sf", "rgdal", "tigris", "raster"))

# Load packages
library(dplyr) # for data manipulation and "pipeline" functions (eg. %>%)
library(purrr) # for loops (eg. map_dfr() function)
library(ggplot2) # for visualization
library(viridis) # color palette for easy visualization
library(stringr) # for text editing
library(sf) # for tidy spatial data
library(rgdal) # for spatial data
library(tigris) # for getting US shapefiles
library(raster) # for working with raster files
library(readr) # for reading in files, like .rds or .csv files

1.2 Download raster data

Next, let’s download some raster data! The World Bank runs an amazing project, the Global Solar Atlas, which democratizes access to energy information to help accelerate renewable energy transition. We can download two global rasters from here. These include the photovoltaic output raster, which measures the potential solar power you could produce here given sunlight conditions, in terms of kilowatt-hours per kilowatt-peak per year. These also include a temperature raster, which measures the average temperature in an area at 2 meters above ground level.

We can download the raster files from the following links (or this website) as GEOTIFFs, and import them into R using the raster package’s conveniently named raster function.

Let’s download solar data like this:

# Download photovoltaic output data
download.file(url = "https://api.globalsolaratlas.info/download/World/World_PVOUT_GISdata_LTAy_AvgDailyTotals_GlobalSolarAtlas-v2_GEOTIFF.zip",
              destfile = "pvout.zip") 
# Unzip the folder
unzip("pvout.zip")
# Delete the original zip file, because it's huge.
unlink("pvout.zip")

# Rename the folder to something understandable
file.rename(from = "World_PVOUT_GISdata_LTAy_DailySum_GlobalSolarAtlas_GEOTIFF",
            to = "pvout")

And next, let’s download temperature data like so.

# Download temperature data
download.file(url = "https://api.globalsolaratlas.info/download/World/World_TEMP_GISdata_LTAy_GlobalSolarAtlas-v2_GEOTIFF.zip", 
              destfile = "temp.zip")
# Unzip the file
unzip("temp.zip")
# Delete the original zip file, because it's huge too.
unlink("temp.zip")
# rename the folder to something understandable
file.rename(from = "World_TEMP_GISdata_LTAy_GlobalSolarAtlas_GEOTIFF",
            to = "temp")

1.3 Import Raster Data

Next, using the raster() function from the raster package, we’re going to import our two rasters as objects named pvout and temp.

pvout <- raster("pvout/PVOUT.tif")
temp <- raster("temp/TEMP.tif")

2. Geographic Boundaries

Next, we want to grab a shapefile containing the geographic boundaries we are eventually going to be interested in. For example, I work in Massachusetts in the US, so I want to know about solar and temperature data in cities around Boston. Let’s use the tigris package to obtain county subdivision boundaries (basically cities) as an sf object.

# Import geographic datasets from Massachusetts in the year 2018
shapes <- tigris::county_subdivisions(state = "MA", cb = FALSE, year = 2018) %>%
  # Convert from spatial polygons format to sf (this is key)
  st_as_sf() %>%
  # Let's grab just the unique ID, name, and geometry vector 
  # for each county subdivision
  dplyr::select(geoid = GEOID, name = NAMELSAD, geometry) %>%
  # Let's make sure we use the same projection as our Global Solar Atlas datasets
  st_transform(crs = raster::projection(pvout)) 

# Remove any other rasters to retain memory space
remove(pvout, temp)

2.1 Making a Buffer

Now that we have Massachusetts county subdivision boundaries, we next need to identify just which of those county subdivisions are located within 5 kilometers of Boston. To do so, let’s make a buffer with a radisu 5 kilometers around the center of Boston.

# Let's make a buffer of 0.05 degrees or ~5 kilometers around the Boston city polgyon
buffer <- shapes %>%
  # by zooming into boston
  filter(name == "Boston city") %>%
  # Making the buffer with a distance of 0.05 degrees, 
  # or in whatever units the bounding box is measured
  st_buffer(dist = 0.05) %>%
  # Add a way to identify that this is in the buffer zone
  mutate(zone = "within 5 km of Boston") %>%
  # and keep just those
  dplyr::select(zone, geometry)

2.2 Selecting Locations within the Buffer

Finally, we’re going to spatially join the buffer to our county subdivision dataset, and keep only rows that overlap with the buffer region. Finally, we’ll save it to file so we can use it in the future.

# Here are cities within 5 kilometers of Boston proper.
shapes %>%
  # Spatially join in our buffer region indicator to appropriate cities
  st_join(buffer) %>%
  # Zoom into just the buffer region
  filter(zone == "within 5 km of Boston") %>%
  # And let's kick out any ocean plots
  filter(name != "County subdivisions not defined") %>%
  # and let's just save this just in case
  saveRDS("shapes.rds")

# Get rid of buffer
remove(buffer, shapes)

2.3 Importing our Polygons

Now let’s import our final Boston-area cities from file as shapes.

shapes <- read_rds("shapes.rds")

3. Making a Fishnet Grid

Next, we need to make a grid. Some of this feels redundant (and in this World Solar Atlas example it really is), but if your rasters are of different resolutions or extents, or if there is any reason why they might be different, we need to create a grid in sf where each grid cell gets its own row (what we can tidy format).

3.1 Make the grid

First, let’s decide the size of our grid cells. We’re going to make grid cells based on the county subdivision boundaries. These are measured in degrees, so we need to tell sf and related packages to make a fishnet grid with a cellsize of a certain number of degrees (0.01 degrees = 1.11 km).

# The shapefile is measured in degrees, so when we make the fishnet cellsize 0.1,
# that means 0.1 degrees (10.5 km).
shapes %>%
  # Make a fishnet, where each cell is ~1 km or 0.01 degrees
  st_make_grid(cellsize = 0.01, what = "polygons") %>%
  # Convert to SF
  st_as_sf() %>%
  # rename the geometry field to "geometry"
  rename(geometry = x) %>%
  # add an ID number from 1 to n, the number of rows in the dataet
  mutate(fishnet_id = 1:n()) %>%
  # order the variables
  dplyr::select(fishnet_id, geometry) %>%
  # And finally, let's just save this fishnet, just in case.
  saveRDS("fishnet.rds")

3.2 Import the grid

# Import fishnet grid from file
fishnet <- read_rds("fishnet.rds")
# Let's view the first few rows
fishnet %>% head()

## Simple feature collection with 6 features and 1 field
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: -71.17309 ymin: 42.13551 xmax: -71.11309 ymax: 42.14551
## CRS:            +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
##   fishnet_id                       geometry
## 1          1 POLYGON ((-71.17309 42.1355...
## 2          2 POLYGON ((-71.16309 42.1355...
## 3          3 POLYGON ((-71.15309 42.1355...
## 4          4 POLYGON ((-71.14309 42.1355...
## 5          5 POLYGON ((-71.13309 42.1355...
## 6          6 POLYGON ((-71.12309 42.1355...

3.3 Visualize the grid

Let’s see those boundaries in action, visualizing them using ggplot2!

ggplot() +
  # Visualize fishnet with blue lines with no fill
  geom_sf(data = fishnet, fill = NA, color = "steelblue", size = 0.05) +
  # Visualize county subdivisions with black lines with no fille
  geom_sf(data = shapes, fill = NA, color = "grey") +
  # give it a blank theme, with a font size of 14
  geom_sf_text(data = shapes, 
                mapping = aes(label = name %>% str_remove_all(pattern = " town| city| Town")), check_overlap = TRUE) +
  theme_void(base_size = 14) +
  # Center the title
  theme(plot.title = element_text(hjust = 0.5)) +
  # Give it a title
  labs(title = "Boston Area County Subdivisions (n = 33)\n with ~1 km fishnet grid cells (n = 1520)")

4. Aggregating a Raster to a Fishnet Grid

Now, I’m going to take the raw raster data and average all values located within each fishnet grid cell to get single estimates per square kilometer in Massachusetts of both photovoltaic output and temperature.

4.1 Crop and Transform to sf format

Next, we’re going to convert our rasters from raster format into sf format. However, it’s going to be hard to do that given such big rasters. It would be better if we cropped the rasters first to the extent of our actual fishnet grid of interest. Then, we can join all

# Import pvout tif file
pvout <- raster("pvout/PVOUT.tif") %>%
# Crop it to fit the dimensions of the fishnet
  crop(y = fishnet) %>%
  # Convert to spatial polygons data.frame
  as('SpatialPolygonsDataFrame') %>%
  # Convert to Sf format
  st_as_sf()



# Import pvout tif file
temp <- raster("temp/TEMP.tif") %>%
# Crop it to fit the dimensions of the fishnet
  crop(y = fishnet) %>%
  # Convert to spatial polygons data.frame
  as('SpatialPolygonsDataFrame') %>%
  # Convert to Sf format
  st_as_sf()

# Let's view the output of our pvout sf data.frame.
pvout %>%
  head()

## Simple feature collection with 6 features and 1 field
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: -71.34167 ymin: 42.5 xmax: -71.29167 ymax: 42.50833
## CRS:            +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
##   PVOUT                       geometry
## 1 3.927 POLYGON ((-71.34167 42.5083...
## 2 3.928 POLYGON ((-71.33333 42.5083...
## 3 3.925 POLYGON ((-71.325 42.50833,...
## 4 3.927 POLYGON ((-71.31667 42.5083...
## 5 3.930 POLYGON ((-71.30833 42.5083...
## 6 3.932 POLYGON ((-71.3 42.50833, -...

This produces a data.frame, comprised of as many rows as there were cells in that raster. Note: Since we cropped by a set of irregular polygons (Boston geographies), though there were 3600 cells in the square extent, there were actually fewer final grid cells (~3421).

4.2 Aggregate to Fishnet Grid

pvout <- pvout %>%
  # Join in fishnet cell ids to all points located within them
  st_join(fishnet) %>%
  # convert to data.frame,
  as.data.frame() %>%
  # then for each unique fishnet grid cell
  group_by(fishnet_id) %>%
  # get the mean of the photovoltaic output values
  summarize(pvout = mean(PVOUT, na.rm = TRUE)) %>%
  # and then combine the dataset back together
  ungroup()

# Repeat steps above for temperature
temp <- temp %>%
  st_join(fishnet) %>%
  as.data.frame() %>%
  group_by(fishnet_id) %>%
  summarize(temp = mean(TEMP, na.rm = TRUE)) %>%
  ungroup()

Finally, let’s bind these averages-per-grid-cell into one fishnet grid file and save. Sometimes, we might

fishnet %>%
  # Join in pvout fishnet grid averages
  left_join(by = "fishnet_id", y = pvout) %>%
  # Join in tempreature fishnet grid averages
  left_join(by = "fishnet_id", y = temp) %>%
  # Save to file
  saveRDS("fishnet_values.rds")

remove(pvout, temp)

You could always skip this process if you knew that both your rasters were compatible, cropped to the same extent, of the same size, etc. But if not, then now you’ve made a fishnet grid that proves one row per observations.

5. Analysis

Finally, you can perform any basic operations you want on these grid cells, including mapping or statistics.

# Import fishnet with completed estimates
fishnet <- read_rds("fishnet_values.rds")

5.1 Mapping Temperature

For example, we can map the temperature to each fishnet gridcell, with county subdivisions overlapping to see geographic variation.

ggplot() +
  # Visualize fishnet with blue lines with no fill
  geom_sf(data = fishnet, mapping = aes(fill = temp), color = "steelblue", size = 0.05) +
  scale_fill_viridis(option = "viridis") +
  
  # Visualize county subdivisions with black lines with no fille
  geom_sf(data = shapes, fill = NA, color = "white") +
  # give it a blank theme, with a font size of 14
  geom_sf_text(data = shapes, 
                mapping = aes(label = name %>% str_remove_all(pattern = " town| city| Town")), check_overlap = TRUE) +
  theme_void(base_size = 14) +
  # Center the title
  theme(plot.title = element_text(hjust = 0.5)) +
  # Give it a title
  labs(title = "Boston Area County Subdivisions (n = 33)\n with ~1 km fishnet grid cells (n = 1520)", fill = "Average\nTemperature\n(Celsius)")

5.2 Mapping Photovoltaic Ouput

We can also map the solar photovoltaic ouput to each fishnet gridcell, with county subdivisions overlapping to see geographic variation. This highlights the renewable resources of each city for hosting solar.

ggplot() +
  # Visualize fishnet with blue lines with no fill
  geom_sf(data = fishnet, mapping = aes(fill = pvout), color = "steelblue", size = 0.05) +
  scale_fill_viridis(option = "plasma") +
  
  # Visualize county subdivisions with black lines with no fill
  geom_sf(data = shapes, fill = NA, color = "white") +
  # give it a blank theme, with a font size of 14
  geom_sf_text(data = shapes, 
                mapping = aes(label = name %>% str_remove_all(pattern = " town| city| Town")),
               check_overlap = TRUE) +
  theme_void(base_size = 14) +
  # Center the title
  theme(plot.title = element_text(hjust = 0.5)) +
  # Give it a title
  labs(title = "Boston Area County Subdivisions (n = 33)\n with ~1 km fishnet grid cells (n = 1520)", fill = "Average\nPhotovoltaic\nOuput\n(kWh/kWp/year)")

5.3 Summary Statistics

Further, we can also calculate summary statistics for each community.

# Let's take the original shapes
sumstats <- shapes %>%
  # For each shape, join in the cells within it
  st_join(fishnet) %>%
  group_by(name, geoid) %>%
  summarize(pvout = mean(pvout, na.rm = TRUE),
            temp = mean(temp, na.rm = TRUE)) %>%
  ungroup()

Let’s view the first few rows!

sumstats %>%
  # Convert from sf to data.frame
  as.data.frame() %>%
  # Grab just relevant variables
  dplyr::select(name, pvout, temp) %>%
  # Show first 6 rows
  head()

##                  name    pvout     temp
## 1      Arlington town 3.959927 10.37629
## 2        Belmont town 3.967736 10.39022
## 3         Boston city 3.953918 10.46123
## 4 Braintree Town city 3.956333 10.52872
## 5      Brookline town 3.969602 10.46389
## 6      Cambridge city 3.977275 10.52400

5.4 Visualize Summary Statistics

# And we can visualize that here
ggplot() +
  # Visualize county subdivisions with black lines with pvout fill
  geom_sf(data = sumstats, mapping = aes(fill = pvout), color = "white", alpha = 0.75) +
  scale_fill_viridis(option = "plasma") +
  # give it a blank theme, with a font size of 14
  geom_sf_text(data = shapes, 
                mapping = aes(label = name %>% str_remove_all(pattern = " town| city| Town")),
               check_overlap = TRUE) +
  theme_void(base_size = 14) +
  # Center the title
  theme(plot.title = element_text(hjust = 0.5)) +
  # Give it a title
  labs(title = "Boston Area County Subdivisions (n = 33)", fill = "Average\nPhotovoltaic\nOuput\n(kWh/kWp/year)")

5.5 Correlations

Finally, maybe we want to know to what degree temperature and photovoltaic output are correlated in each city. (It’s a stretch, I know, but in case you’re doing something similar, like the ResearchGate user was, this could help.)

# Let's take the original shapes
mycor <- shapes %>%
  # For each shape, join in the cells within it
  st_join(fishnet) %>%
  group_by(name, geoid) %>%
  # For each city, we can get the correlation between 
  # photovoltaic output and temperature among our grid cells.
  # We use pairwise-complete observations, meaning that we use any rows that are not missing data,
  # which avoids NA correlations
  summarize(cor = cor(pvout, temp, use = "pairwise.complete.obs")) %>%
  ungroup()


# And we can visualize that here
ggplot() +
  # Visualize county subdivisions with black lines with the correlation being the fill
  geom_sf(data = mycor, mapping = aes(fill = cor), color = "darkgrey", alpha = 0.75) +
  # Let's use a better color scale, like this:
  scale_fill_gradient2(low = "firebrick", high = "steelblue", mid = "white", midpoint = 0) +
  # give it a blank theme, with a font size of 14
  geom_sf_text(data = shapes, 
                mapping = aes(label = name %>% str_remove_all(pattern = " town| city| Town")),
               check_overlap = TRUE) +
  theme_void(base_size = 14) +
  # Center the title
  theme(plot.title = element_text(hjust = 0.5)) +
  # Give it a title
  labs(title = "Boston Area County Subdivisions (n = 33)\nCorrelation between\nSolar Photovoltaic Output and Temperature", fill = "Correlation\n(Pearson's r)")

Interesting. So this map reveals that in the city of Boston, these correlate more strongly in some neighborhoods than others, though overall, temperature tends to be negatively correlated with solar potential.

Conclusion

I hope this has provided useful resources for those mapping with raster data in R. Please check out the sf webpage for more information and examples on how to use sf.

For more information on the applications of these datasets, check out my papers below, which compare the effect of solar photovoltaic output with social and political drivers of solar adoption in cities in the US and Japan.

Questions?

Have questions? Want to invite me to give a talk or lead a workshop at your institution? Please email timothy.fraser.1@gmail.com or reach out on Twitter!

References

World Bank. Global Solar Atlas. Accessed January 30, 2021. https://globalsolaratlas.info/map
Pebesma, Edzer. 2018. Simple features for R. Standardized Support for Spatial Vector Data. The R Journal 10(1), 439-446. https://doi.org/10.32614/RJ-2018-009
Fraser Timothy. (2021) Does social capital boost or block renewable energy siting? South African solar politics in comparison. Energy Research & Social Science 71, 1-12.
Fraser Timothy. (2020) Japan’s Resilient, Renewable Cities: How Socioeconomics and Local Policy drive Japan’s Renewable Energy Transition. Environmental Politics 29, 500-523.
Fraser Timothy. (2019) How Governance and Disasters shape Renewable Energy Transitions: The case of Japanese mega-solar. Social Science Quarterly 100, 975-990.