This data tutorial requires you to have basic understanding of ggplot2
and dplyr
from the tidyverse
. It is also helpful if you have the basic R data wrangling and plotting knowledge outlined in R for Data Science.
To run the analyses outlined in this tutorial you need the following packages.
install.packages("tidyverse", dependencies = TRUE)
install.packages("sf", dependencies = TRUE)
install.packages("rnaturalearth", dependencies = TRUE)
install.packages("rnaturalearthdata", dependencies = TRUE)
install.packages("rnaturalearthhires", dependencies = TRUE)
install.packages("tigris", dependencies = TRUE)
install.packages("ggrepel", dependencies = TRUE)
If you do not want to install all of the dependencies, then set this argument to FALSE
. If you are having problems installing the rnaturalearthhires
try the code below to install directly from GitHub (you must have devtools
installed).
library(devtools)
install_github("ropensci/rnaturalearthhires")
Finally, if you are having difficulty running the code, then update all of the required packages and try again.
The data used in this module and for the module exercises can be obtained from the installed packages, from the NEON data portal, and from the NEON spatial data website. Specifically you will need the sites and domains shapefiles, which are located in zip files that you extract to your working directory.
There are two fundamental types of spatial objects: vectors and rasters. Vector data are discrete entities in space and time, such as a lakes, trees, houses, roads, states and countries. Raster data represent continuous data and are divided into pixels at a defined resolution. Rasters are images with associated spatial coordinates and typically are represented in data formats such as a GeoTIFF, but they can also be represented as text files. Raster data can be made from vector data and vice versa depending on the goals of an analysis.
Spatial data may can typically be boiled down to data tables and and often wrangled using methods from tidyverse
. In this module we focus on learning how to wrangle and plot vector data.
Figure from: Martin Wegmann
Vector data in R are represented by simple features in the sf package
. Features are a standard way to represent discrete spatiotemporal data (i.e., data measured at a particular place at a particular time). Features are simply tidy data tables with observations as rows and variables as columns. Features have both spatial and non-spatial variables.
Feature spatial variables are termed geometries and indicate where on Earth the feature is located. For example, a GPS point would have a geometry made up of a latitude and longitude, and a polygon would have a geometry made up of multiple connected points that define the area encompassed by the polygon. Non-spatial variables are termed metadata, attributes or properties. In this module, we use the term properties. For example, a single point object can have a name and date associated with it as properties.
Points, lines and polygons are vector data. Points are 0D (zero-dimensional) geometries like a GPS location with a latitude and longitude. Lines are 1D geometries made as a sequence of points connected with lines like a river or a road that do not self intersect to form a closed loop. Polygons are 2D geometries of sequences of connected points (lines) that form closed 2D loops (e.g., squares, donuts).
Vector geometries are all situated in 2D XY space. However, all simple features can also have two more dimensions for a total of 4D (XYZM space). The Z dimension indicates altitude, and the M dimension indicates measure like measurement error or the time the measurement was taken. We do not consider these other dimensions, but see here for more information.
All sf
objects are stored as data tables (data.frame
or tbl_df
) with the number of rows equal to the number of features. For example, an sf
object with 10 GPS points would have 10 rows. Geometries are stored as list-columns because all feature geometries require at least X and Y values (e.g., latitude and longitude). Properties can either be stored as standard or list-columns.
sf
functions are prefixed by st_
(standing for spatial and temporal).
The sf package
can read many file types that contain vector data with the st_read
function. Common file types include .csv and .shp files. Comma-delimited files (.csv) are the simplest and a common way to store point features. The shapefile format is a common file type for vector data maintained by the ERSI company.
Here we read in the NEON site locations shapefile with the st_read
function. You will first need to download the zip file and extract it to your working directory.
library(sf)
neon_sites <- st_read("NEON_Field_Sites/NEON_Field_Sites_v16_1.shp", as_tibble = TRUE, quiet = TRUE)
# Make sure that the path argument in st_read is where you saved the .shp file
neon_sites
Simple feature collection with 81 features and 11 fields
geometry type: POINT
dimension: XY
bbox: xmin: -156.6194 ymin: 17.96955 xmax: -66.79851 ymax: 71.28241
CRS: 4326
The printed summary indicates that the object neon_sites
contains 81 POINT features each with 11 properties. The features are 2D with X and Y dimensions. The area encompassed by all of the points is the bounding box bbox
.
The epsg (SRID)
value and the proj4string
value provide information on the coordinate reference system (CRS) of the object. All spatial data must include a CRS that says how to project the data onto the surface of the earth (e.g., circular like on a sphere or flat like on a map). GPS data treat the earth as a sphere and use the WGS84 CRS. Your phone’s GPS uses this CRS. The NEON shapefile is in this projection. Make sure that if you are mapping multiple data sets, all of the data sets are in the same CRS. You can change a CRS with st_transform
.
See here for more information on CRS in R. Also see this video that explains what projection are and their historical context. This website is a quick way to see a lot of different projections at once.
Shapefiles .shp
are always associated with a set of other supporting files that provide metadata for the shapefile and explain to R how to visualize and map the shapefile. Thus, sf_read
not only reads in the .shp
file but also the other associated files in the same directory. There are 8 files read by st_read
when loading NEON_Field_Sites.shp
.
tibble(`Files Read by sf_read` =
str_subset(dir("NEON_Field_Sites"),
pattern = "NEON_Field_Sites_v16"))
Note that if you want to use st_read
for .csv files then you will need to pass along options
that say which columns are the x
and y
geometry variables, for example
mysf <- st_read("my.csv", options = c("Longitude=x", "Latitude=y"))
.
Or you could just read in your csv as usual and use st_as_sf()
setting the coords
argument (and possibly the crs
argument), for example
mysf <- read_csv("my.csv") %>% st_as_sf(coords = c("Longitude", "Latitude"))
sf
objects are data tables and thus many of the functions that are used to wrangle data tables can be used with sf
objects. For example, dplyr::select
can choose and rearrange variables; dplyr::filter
can subset features based on properties; and dplyr::summarize
can aggregate and summarize based on properties. First let’s see what variables are in the neon_sites
object.
colnames(neon_sites)
[1] "domainNumb" "domainName" "siteName" "PMC"
[5] "siteID" "siteType" "siteHost" "stateID"
[9] "stateName" "latitude" "longitude" "geometry"
Select the domainName
property
# selct only the DomainName variable.
neon_sites %>% dplyr::select(domainName)
Simple feature collection with 81 features and 1 field
geometry type: POINT
dimension: XY
bbox: xmin: -156.6194 ymin: 17.96955 xmax: -66.79851 ymax: 71.28241
CRS: 4326
Note that because this is an sf
object, the geometry
list-column is also returned. Now let’s subset the sites within the Mid-Atlantic Domain.
# subset sites within the Mid-Atlantic Domain
neon_sites %>%
filter(domainName == "Mid-Atlantic")
Simple feature collection with 5 features and 11 fields
geometry type: POINT
dimension: XY
bbox: xmin: -78.14678 ymin: 38.89008 xmax: -76.56001 ymax: 39.09564
CRS: 4326
Now, we summarize the data by counting the number of sites located within each U.S. state.
# count number of sites colocated in each State
neon_sites %>%
group_by(stateName) %>%
summarize(`# of sites` = n()) %>%
arrange(desc(`# of sites`))
Simple feature collection with 25 features and 2 fields
geometry type: GEOMETRY
dimension: XY
bbox: xmin: -156.6194 ymin: 17.96955 xmax: -66.79851 ymax: 71.28241
CRS: 4326
Download the site .csv file from the online map of NEON sites. Read in this file. Is the sf
object produced by this file identical to the sites shapefile we used above? Use R code to show if and how they are different.
neon_sites
that include:Use geom_sf
in ggplot2
to plot any sf
object.
ggplot() + geom_sf(data = neon_sites, size = 2, shape = 21, fill = "red") + theme_bw()
Like points, polygons can be read in with st_read
. Here we read in the NEON domains shapefile that give the outlines of the regions that NEON used to choose their sites. See here for a video that describes the statistical process used.
Like for the NEON sites shapefile used above, you have to download the zip file and extract the domains shapefile to your working directory.
neon_doms <- st_read("NEONDomains_0/NEON_Domains.shp", as_tibble = TRUE, quiet = TRUE)
Let’s see what is in neon_doms
neon_doms
Simple feature collection with 22 features and 6 fields
geometry type: MULTIPOLYGON
dimension: XY
bbox: xmin: -168.1244 ymin: 17.92621 xmax: -65.59024 ymax: 71.40624
CRS: 4326
In this object, there are multiple polygons and the DomainName
variable provides the name of the domain. The simplest way to make a map of sf
objects is to use the geom_sf
function in ggplot2
. We can map and then label the domains with the following code:
neon_doms %>%
ggplot() +
geom_sf(aes(fill = DomainName)) +
geom_sf(data = neon_sites) +
geom_sf_label(aes(label = DomainName), size = 1.8) +
labs(title = "NEON sites and labeled domains") +
theme(legend.position = "none")
We plotted both the sites and the polygons and we use geom_sf_labels
to label each of the domains with their name. The function geom_sf_text
provides similar functionality. There is also a way to make the labels better spaced usingggrepel
for labeling as described here.
library(ggrepel)
neon_doms %>%
ggplot() +
geom_sf(aes(fill = DomainName)) +
geom_label_repel( aes(label = DomainName, geometry = geometry),
stat = "sf_coordinates",
min.segment.length = 0,
size = 1.8) +
labs(title = "Use ggrepel to make popout annotations") +
theme(legend.position = "none")
Compare the functions geom_sf
and geom_point
by plotting the NEON sites data (i.e., neon_sites
that we made above). Do the two maps look the same? How can you produce a map with geom_point
that looks better? Can you use the same technique with geom_sf
? What errors do you get?
Change the projection of one of the above maps into an equal area projection.
So far we have just been plotting vector data that we supplied to R, but sometimes you will want to plot your vector data onto a base map, like the outline of the USA. Natural Earth is an open access database of global spatial data. It allows you to easily make base maps on which to plot your data. The packages you need are rnaturalearth
and rnaturalearthdata
.
library(rnaturalearth)
library(rnaturalearthdata)
global_basemap <- ne_countries(scale = "small", returnclass = "sf")
ggplot(data = global_basemap) +
geom_sf(fill = "white") +
labs(title = "World map from Natural Earth (low resolution)")
The ne_countries
function from rnaturalearth
returns country polygons. The returnclass
argument just says to return an sf
object. The scale
argument provides the map resolution, there are small
, medium
and large
resolution base maps. For the large resolution you will need to install rnaturalearthhires
and you will be prompted to do so when you set scale = "large"
if you have not already installed the package. To obtain even higher resolution base maps, then see the GADM R package GADMTools
.
To zoom in on an area, use coord_sf
, which is the default coord_
function that geom_sf
uses to plot data. See the R4DS chapter on coordinate systems.
ggplot(data = global_basemap) +
geom_sf(fill = "white") +
geom_sf(data = neon_sites) +
coord_sf(xlim = c(-90, -50), ylim = c(10, 50), expand = FALSE) +
theme(panel.background = element_rect(fill = "lightblue")) +
labs(title = "Zoom in with xlim and ylim")
Notice how with this low resolution map, many of the smaller Caribbean Islands are missing.
The globe can be partitioned into hierarchical geopolitical administrative boundaries. rnaturalearth
provides data at country and state level boundaries. To illustrate these levels, first let’s get a base map for an individual country.
usa_basemap <- ne_countries(country = "united states of america",
scale = "medium",
returnclass = "sf")
ggplot() +
geom_sf(data = usa_basemap) +
coord_sf(xlim = c(-175, -60)) +
theme_dark() +
labs(title = "U.S.A.")
To get a base-map of individual states use the ne_states
function. The first time that you run the code below, you will be asked to install the rnaturalearthhires
package.
state_basemap <- ne_states(country = "united states of america",
returnclass = "sf")
ggplot() +
geom_sf(data = state_basemap) +
coord_sf(xlim = c(-175, -60)) +
theme_bw() +
labs(title = "U.S.A. States")
But Puerto Rico is not included in this base map, and NEON has a site in Puerto Rico.
pr_basemap <- ne_states(country="puerto rico", returnclass = "sf")
# Combine the maps
state_pr_basemap=rbind(state_basemap, pr_basemap)
ggplot() +
geom_sf(data = state_pr_basemap) +
coord_sf(xlim = c(-175, -60)) +
theme_bw() +
labs(title = "U.S.A. States + Puerto Rico")
What about even lower administration base maps? For U.S. county scale data, use the tigris
package. This package downloads county (and other) data from the Geography Program of the U.S. Census Bureau. To download county data use the counties
function.
county_basemap
Simple feature collection with 3233 features and 9 fields
geometry type: MULTIPOLYGON
dimension: XY
bbox: xmin: -179.1489 ymin: -14.5487 xmax: 179.7785 ymax: 71.36516
CRS: +proj=longlat +datum=NAD83 +no_defs +ellps=GRS80 +towgs84=0,0,0
First 10 features:
STATEFP COUNTYFP COUNTYNS AFFGEOID GEOID NAME LSAD
0 21 007 00516850 0500000US21007 21007 Ballard 06
1 21 017 00516855 0500000US21017 21017 Bourbon 06
2 21 031 00516862 0500000US21031 21031 Butler 06
3 21 065 00516879 0500000US21065 21065 Estill 06
4 21 069 00516881 0500000US21069 21069 Fleming 06
5 21 093 00516893 0500000US21093 21093 Hardin 06
6 21 099 00516896 0500000US21099 21099 Hart 06
7 21 131 00516912 0500000US21131 21131 Leslie 06
8 21 151 00516919 0500000US21151 21151 Madison 06
9 21 155 00516921 0500000US21155 21155 Marion 06
ALAND AWATER geometry
0 639387454 69473325 MULTIPOLYGON (((-89.18137 3...
1 750439351 4829777 MULTIPOLYGON (((-84.44266 3...
2 1103571974 13943044 MULTIPOLYGON (((-86.94486 3...
3 655509930 6516335 MULTIPOLYGON (((-84.12662 3...
4 902727151 7182793 MULTIPOLYGON (((-83.98428 3...
5 1614569777 17463238 MULTIPOLYGON (((-86.27756 3...
6 1068530028 13692536 MULTIPOLYGON (((-86.16112 3...
7 1038206077 9189732 MULTIPOLYGON (((-83.5531 37...
8 1132729653 15306635 MULTIPOLYGON (((-84.52564 3...
9 888463701 9891797 MULTIPOLYGON (((-85.52129 3...
county_plot <- ggplot() +
geom_sf(data = county_basemap) +
labs(title = "
U.S.A. counties")
county_plot + coord_sf(xlim = c(-175, -60), ylim = c(16, 73))
On this map you can plot other layers like the neon_sites
.
county_plot +
geom_sf(data = neon_sites, size = 2, shape = 21, fill = "red") +
coord_sf(xlim = c(-175, -60), ylim = c(16, 73))
If you want to just get the counties for one state, then you have to work with the STATEFP
property in the county_basemap
object. This property provides a unique numeric code called a FIPS code for each state. To find the state you want, use lookup_code
. Let’s look at Colorado, which as we saw above, has 7 NEON sites, the second most of any state. It is also where the NEON headquarters is located.
lookup_code("colorado")
[1] "The code for Colorado is '08'."
county_basemap %>%
filter(STATEFP == '08') %>%
ggplot() +
geom_sf() +
geom_sf(data = filter(neon_sites, stateName == "Colorado"),
aes(fill = siteType), shape = 21, size = 3, alpha = .7) +
labs(title = "Colorado and its NEON sites") +
theme_bw() +
guides(fill = guide_legend(override.aes = list( size = 1, alpha = 1)))
Once you are able to map polygons and points, you will want to display your data on the map. Just like any ggplot
, you can set aesthetics to plot data. Here, we use fill
to color countries according to income grouping. The countries data set from rnaturalearth
already contains a lot of data at the per country scale, and income group is one included variable.
ggplot(data = global_basemap) +
geom_sf(aes(fill = income_grp)) +
labs(title = "Income groups")
Calculate the number of NEON sites in each state, join the data to state_basemap
and use fill
to plot those values on a map.
Choose one of the NEON taxa that we used in our midterm and make a map of species richness across the domains.
To learn more than what is presented here, see:
This teaching model borrowed from these sources. Be sure to cite the sf package in any product you produce using the sf package.
Donal O’Leary a science education specialist at NEON made comments and edits to this module. Thank you!