Visualizing and working with spatial data in R is a basic skill needed when working on global change science and communicating your findings to stakeholders. This data tutorial will teach you the basics of spatial data analyses using NEON data.
This data tutorial requires you to have basic understanding of ggplot2 and dplyr from the tidyverse. It is also helpful if you have the basic R data wrangling and plotting knowledge outlined in R for Data Science.
To run the analyses outlined in this tutorial you need the following packages.
install.packages("tidyverse", dependencies = TRUE)
install.packages("sf", dependencies = TRUE)
install.packages("rnaturalearth", dependencies = TRUE)
install.packages("rnaturalearthdata", dependencies = TRUE)
install.packages("tigris", dependencies = TRUE)
install.packages("ggrepel", dependencies = TRUE)
Note that if you do not want to install all of the dependencies, then set this argument to FALSE.
The data used in this module and for the module exercises can be obtained from the installed packages, from the NEON data portal, and from the NEON spatial data website. Specifically you will need the sites and domains shapefiles.
There are two fundamental types of spatial objects: vectors and rasters. Vector data are discrete entities in space and time, such as a lakes, trees, houses, roads, states and countries. Raster data represent continuous data and are divided into pixels at a defined resolution. Rasters are images with associated spatial coordinates and typically are represented in data formats such as a GeoTIFF, but they can also be represented as text files. Raster data can be made from vector data and vice versa depending on the goals of an analysis.
Spatial data may can typically be boiled down to data tables and and often wrangled using methods from tidyverse. In this module we focus on learning how to wrangle and plot vector data.
Figure from: Martin Wegmann
Vector data in R are represented by simple features in the sf package. Features are a standard way to represent discrete spatiotemporal data (i.e., data measured at a particular place at a particular time). Features are simply tidy data tables with observations as rows and variables as columns. Features have both spatial and non-spatial variables.
Feature spatial variables are termed geometries and indicate where on Earth the feature is located. For example, a GPS point would have a geometry made up of a latitude and longitude, and a polygon would have a geometry made up of multiple connected points that define the area encompassed by the polygon. Non-spatial variables are termed metadata, attributes or properties (here we use properties). For example, a single point object can have a name and date associated with it as properties.
Points, lines and polygons are vector data. Points are 0D (zero-dimensional) geometries like a GPS location with a latitude and longitude. Lines are 1D geometries made as a sequence of points connected with lines like a river or a road that do not self intersect to form a closed loop. Polygons are 2D geometries of sequences of connected points (lines) that form closed 2D loops (e.g., squares, donuts).
Vector geometries are all situated in 2D XY space. However, all simple features can also have two more dimensions for a total of 4D (XYZM space). The Z dimension indicates altitude, and the M dimension indicates measure like measurement error or the time the measurement was taken. We do not consider these other dimensions, but see here for more information.
All sf objects are stored as data tables (data.frame or tbl_df) with the number of rows equal to the number of features. For example, an sf object with 10 GPS points would have 10 rows. Geometries are stored as list-columns because all feature geometries require at least X and Y values (e.g., latitude and longitude). Properties can either be stored as standard or list-columns.
sf functions are prefixed by st_ (standing for spatial and temporal).
The sf package can read many file types that contain vector data with the st_read function. Common file types include .csv and .shp files. Comma-delimited files (.csv) are the simplest and a common way to store point features. The shapefile format is a common file type for vector data maintained by the ERSI company.
Here we read in the NEON site locations shapefile with the st_read function.
library(sf)
neon_sites <- st_read("NEONFieldSites-v16/NEON_Field_Sites.shp", as_tibble = TRUE, quiet = TRUE)
neon_sites
Simple feature collection with 81 features and 11 fields
geometry type: POINT
dimension: XY
bbox: xmin: -156.6194 ymin: 17.96955 xmax: -66.79851 ymax: 71.28241
epsg (SRID): 4326
proj4string: +proj=longlat +datum=WGS84 +no_defs
The printed summary indicates that the object neon_sites contains 81 POINT features each with 11 properties. The features are 2D with X and Y dimensions. The area encompassed by all of the points is the bounding box bbox.
The epsg (SRID) value and the proj4string value provide information on the coordinate reference system (CRS) of the object. All spatial data must include a CRS that says how to project the data onto the surface of the earth (e.g., like a sphere or flat like a map). GPS data (like from a phone) treat the earth as a sphere and use the WGS84 CRS. The NEON shapefile is in this projection. Make sure that if you are mapping multiple data sets, all of the data sets are in the same CRS. You can change a CRS with st_transform.
See here for more information on CRS in R. Also see this video that explains what projection are and their historical context. This website is a quick way to see a lot of different projections at once.
Shapefiles .shp are always associated with a set of other supporting files that provide metadata for the shapefile and explain to R how to visualize and map the shapefile. Thus, sf_read not only reads in the .shp file but also the other associated files in the same directory. There are 8 files read by st_read when loading NEON_Field_Sites.shp.
tibble(`Files Read by sf_read` =
str_subset(dir("NEONFieldSites-v16"),
pattern = "NEON_Field_Sites"))
Note that if you want to use st_read for .csv files then you will need to pass along options that say which columns are the x and y geometry variables such as:
st_read("my.csv", options = c("Longitude=x", "Latitude=y")).
Or you could just read in your csv as usual and use st_as_sf() setting the coords arguement (and possibly the crs argument).
mycsv <- read_csv("my.csv")st_as_sf(mycsv, coords = c("Longitude", "Latitude"))
sf objects are data tables and thus many of the functions that are used to wrangle data tables can be used with sf objects. For example, dplyr::select can choose and rearrange variables; dplyr::filter can subset features based on properties; and dplyr::summarize can aggregate and summarize based on properties. First let’s see what variables in the neon_sites object.
colnames(neon_sites)
[1] "DomainNumb" "DomainName" "SiteName" "PMC" "SiteID" "SiteType" "SiteHost"
[8] "State" "Full_State" "Latitude" "Longitude" "geometry"
Select the DomainName property
# selct only the DomainName variable.
neon_sites %>% dplyr::select(DomainName)
Simple feature collection with 81 features and 1 field
geometry type: POINT
dimension: XY
bbox: xmin: -156.6194 ymin: 17.96955 xmax: -66.79851 ymax: 71.28241
epsg (SRID): 4326
proj4string: +proj=longlat +datum=WGS84 +no_defs
Note that because this is an sf object, the geometry list-column is also returned. Subset the sites within the Mid-Atlantic Domain.
# subset sites within the Mid-Atlantic Domain
neon_sites %>%
filter(DomainName == "Mid-Atlantic")
Simple feature collection with 5 features and 11 fields
geometry type: POINT
dimension: XY
bbox: xmin: -78.14678 ymin: 38.89008 xmax: -76.56001 ymax: 39.09564
epsg (SRID): 4326
proj4string: +proj=longlat +datum=WGS84 +no_defs
Summarize the data by counting the number of sites located within each U.S. state.
# count number of sites colocated in each State
neon_sites %>%
group_by(Full_State) %>%
summarize(`# of sites` = n()) %>%
arrange(desc(`# of sites`))
Simple feature collection with 25 features and 2 fields
geometry type: GEOMETRY
dimension: XY
bbox: xmin: -156.6194 ymin: 17.96955 xmax: -66.79851 ymax: 71.28241
epsg (SRID): 4326
proj4string: +proj=longlat +datum=WGS84 +no_defs
Download the site .csv file from the online map of NEON sites. Read in this file. Is the sf object produced by this file identical to the sites shapefile we used above? Use R code to show if and how they are different.
neon_sites that include:Use geom_sf in ggplot2 to plot any sf object.
ggplot() + geom_sf(data = neon_sites, size = 2, shape = 21, fill = "red") + theme_bw()
Like points, polygons can be read in with st_read. Here we read in the NEON domains shapefile that give the outlines of the regions that NEON used to choose their sites. See here for a video that describes the statistical process used.
neon_doms <- st_read("NEONDomains_0/NEON_Domains.shp", as_tibble = TRUE, quiet = TRUE)
Let’s see what is in neon_doms
neon_doms
Simple feature collection with 22 features and 6 fields
geometry type: MULTIPOLYGON
dimension: XY
bbox: xmin: -168.1244 ymin: 17.92621 xmax: -65.59024 ymax: 71.40624
epsg (SRID): 4326
proj4string: +proj=longlat +datum=WGS84 +no_defs
In this object, there are multiple polygons and the DomainName variable provides the name of the domain. The simplest way to make a map of sf objects is to use the geom_sf function in ggplot2. We can map and then label the domains with the following code:
neon_doms %>%
ggplot() +
geom_sf(aes(fill = DomainName)) +
geom_sf(data = neon_sites) +
geom_sf_label(aes(label = DomainName), size = 1.8) +
labs(title = "NEON sites and labeled domains") +
theme(legend.position = "none")
Here we plot both the sites and the polygons and we use geom_sf_labels to label each of the domains with their name. The function geom_sf_text provides similar functionality. There is also a work around to useggrepel for labeling as described here.
library(ggrepel)
neon_doms %>%
ggplot() +
geom_sf(aes(fill = DomainName)) +
geom_label_repel( aes(label = DomainName, geometry = geometry),
stat = "sf_coordinates",
min.segment.length = 0,
size = 1.8) +
labs(title = "Use ggrepel to make popout annotations") +
theme(legend.position = "none")
Compare geom_sf to geom_point using the NEON sites data (i.e., neon_sites). Do the two maps look the same? How can you produce a map with geom_point that looks better? Can you use the same technique with geom_sf? What errors do you get?
Change the projection of one of the above maps into an equal area projection.
Natural Earth is an open access database of global spatial data. It allows you to easily make base maps on which to plot your data. The packages you need are rnaturalearth and rnaturalearthdata
library(rnaturalearth)
library(rnaturalearthdata)
global_basemap <- ne_countries(scale = "small", returnclass = "sf")
ggplot(data = global_basemap) +
geom_sf(fill = "white") +
labs(title = "World map from Natural Earth (low resolution)")
The ne_countries function from rnaturalearth returns country polygons. The returnclass argument just says to return an sf object. The scale argument provides the map resolution, there are small, medium and large resolution base maps. For the large resolution you will need to install rnaturalearthhires and you will be prompted to do so when you set scale = "large" if you have not already installed the package. To obtain even higher resolution base maps, then see the GADM R package GADMTools.
To zoom in on an area, use coord_sf, which is the default coord_ function that geom_sf uses to plot data. See the R4DS chapter on coordinate systems.
ggplot(data = global_basemap) +
geom_sf(fill = "white") +
geom_sf(data = neon_sites) +
coord_sf(xlim = c(-90, -50), ylim = c(10, 50), expand = FALSE) +
theme(panel.background = element_rect(fill = "lightblue")) +
labs(title = "Zoom in with xlim and ylim")
Notice how with this low resolution map, many of the smaller Caribbean Islands are missing.
The globe can be partitioned into hierarchical geopolitical administrative boundaries. rnaturalearth provides data at country and state level boundaries. To illustrate these levels, first let’s get a base map for an individual country.
usa_basemap <- ne_countries(country = "united states of america",
scale = "medium",
returnclass = "sf")
ggplot() +
geom_sf(data = usa_basemap) +
coord_sf(xlim = c(-175, -60)) +
theme_dark() +
labs(title = "U.S.A.")
To get a base-map of individual states use the ne_states function. The first time that you run the code below, you will be asked to install the rnaturalearthhires package.
state_basemap <- ne_states(country = "united states of america",
returnclass = "sf")
ggplot() +
geom_sf(data = state_basemap) +
coord_sf(xlim = c(-175, -60)) +
theme_bw() +
labs(title = "U.S.A. States")
For U.S. county scale data, use the tigris package. This package downloads county (and other) data from the Geography Program of the U.S. Census Bureau. To download county data use the counties function.
library(tigris)
county_basemap <- counties(cb = TRUE) %>% st_as_sf() # cb = TRUE for a smaller downloaded file
county_basemap
Simple feature collection with 3233 features and 9 fields
geometry type: MULTIPOLYGON
dimension: XY
bbox: xmin: -179.1489 ymin: -14.5487 xmax: 179.7785 ymax: 71.36516
epsg (SRID): 4269
proj4string: +proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs
First 10 features:
STATEFP COUNTYFP COUNTYNS AFFGEOID GEOID NAME LSAD ALAND AWATER
0 19 107 00465242 0500000US19107 19107 Keokuk 06 1500067253 1929323
1 19 189 00465283 0500000US19189 19189 Winnebago 06 1037261946 3182052
2 20 093 00485011 0500000US20093 20093 Kearny 06 2254696689 1133601
3 20 123 00485026 0500000US20123 20123 Mitchell 06 1817632928 44979981
4 20 187 00485055 0500000US20187 20187 Stanton 06 1762104518 178555
5 21 005 00516849 0500000US21005 21005 Anderson 06 522745702 6311537
6 21 029 00516861 0500000US21029 21029 Bullitt 06 769289863 8391940
7 21 049 00516871 0500000US21049 21049 Clark 06 653878894 6972358
8 21 059 00516876 0500000US21059 21059 Daviess 06 1187163562 47158144
9 21 063 00516878 0500000US21063 21063 Elliott 06 606876137 2608208
geometry
0 MULTIPOLYGON (((-92.41199 4...
1 MULTIPOLYGON (((-93.97076 4...
2 MULTIPOLYGON (((-101.5419 3...
3 MULTIPOLYGON (((-98.49007 3...
4 MULTIPOLYGON (((-102.0419 3...
5 MULTIPOLYGON (((-85.16919 3...
6 MULTIPOLYGON (((-85.93606 3...
7 MULTIPOLYGON (((-84.34698 3...
8 MULTIPOLYGON (((-87.40529 3...
9 MULTIPOLYGON (((-83.27005 3...
county_plot <- ggplot() +
geom_sf(data = county_basemap) +
labs(title = "
U.S.A. counties")
county_plot + coord_sf(xlim = c(-175, -60), ylim = c(73, 16))
On this map you can plot other layers like the neon_sites.
county_plot +
geom_sf(data = neon_sites, size = 2, shape = 21, fill = "red") +
coord_sf(xlim = c(-175, -60), ylim = c(73, 16))
If you want to just get the counties for one state, then you have to work with the STATEFP property in the county_basemap object. This property provides a unique numeric code called a FIPS code for each state. To find the state you want, use lookup_code. Let’s look at Colorado, which as we saw above, has 7 NEON sites, the second most of any state. It is also where the NEON headquarters is located.
lookup_code("colorado")
[1] "The code for Colorado is '08'."
county_basemap %>%
filter(STATEFP == '08') %>%
ggplot() +
geom_sf() +
geom_sf(data = filter(neon_sites, Full_State == "Colorado"),
aes(fill = SiteType), shape = 21, size = 3, alpha = .7) +
labs(title = "Colorado and its NEON sites") +
theme_bw() +
guides(fill = guide_legend(override.aes = list( size = 1, alpha = 1)))
Once you are able to map polygons and points, you will want to display your data on the map. Just like any ggplot, you can set aesthetics to plot data. Here, we use fill to color polygons countries according to income grouping. The countries data set from rnaturalearth already contains a lot of data at the per country scale, and income group is one included variable.
ggplot(data = global_basemap) +
geom_sf(aes(fill = income_grp)) +
labs(title = "Income groups")
Calculate the number of NEON sites in each state, join the data to state_basemap and use fill to plot those values on a map.
For each team member, choose one of the NEON taxa that we used in our midterm and make a map of species richness across the domains.
To learn more than what is presented here, see:
This teaching model borrowed from these sources. Be sure to cite the sf package in any product you produce using the sf package.