sf* spatial data objects.sp* spatial data objects.sp* and sf* formats.Spatial data is more complex than simple rectangular attribute data (e.g. data tables where a row is an observation and a column is a variable). Spatial data may (or may not) have attributes, but typically include location information stored in either a vector or raster data model, and then held in memory or stored on disc in some format.
Over the past 10+ years, R has increasingly been used to analyze and visualize spatial data. Early on, investigators tackling the complexities of spatial data analysis developed a number of ad hoc one-off approaches to these data. This helped in the short term but created other problems as users needed to chain together steps and had to convert one data format to another. A result of this tumult was a thoughtful and systematic approach to tackling the unique challenges of spatial data in R. Roger Bivand, Edzer Pebesma and others developed the sp package which defined spatial data classes, and provided functional tools to interact with them.
The sp package defined specifica data classes to hold points, lines, and polygons, as well as raster/grid data; each of these data classes can be geometry only (these have names like SpatialPoints or SpatialPolygons) or could contain geometry plus related data attributes (these have names like SPatialPointsDataFrame or SpatialPolygonsDataFrame). Each spatial object can contain all the information spatial data might include: the spatial extent (min/max x, y values), the coordinate system or spatial projectin, the geometry information, the attribute information, etc.
Because of the flexibility and power of the sp* class of objects, they have become a standard up until the last year or so. They continue to be the only format allowed by many of the packages we will use this semester. However analysts sometimes find the complexity of the sp* objects to be a hindrance to efficient processing of geographic data. Specifically the information is stored in numerous ‘slots’ (not a formal list, but conceptually a little like list elements). As the number of ways to visualize data increases, there was desire to make spatial data behave more like tabular data. This led to the same team (e.g. Bivand, Pebesma, others) to develop the Simple Features set of spatial data classes for R. Loaded with the sf package, this data format is almost surely going to become the standard for the coming years. Recognizing that many users and functions prefer the familiar sp* objects, the sf package includes a number of utility functions for easily converting back and forth. In this class we will use sf* objects as the preferred data format, but because some of the tools we’ll learn require sp* we will go back and forth.
sf* data classes are designed to hold all the essential spatial information (projection, extent, geometry), but do so with an easy to evaluate data frame format that integrates the attribute information and the geometry information together. The result is more intuitive sorting, selecting, aggregating, and visualizing.
sf and spsf over sp spatial class definitions:As Robin Lovelace writes in his online eBook, Gecomputation in R, sf offers an approach to spatial data that is compatible with QGIS and PostGIS, important non-ESRI open source GIS platforms, and sf functionality compared to sp provides:
sf objects can be treated as data frames in most operationssf functions can be combined using %>% operator and works well with the tidyverse collection of R packages (we’ll talk about this more later in the semester)sf function names are relatively consistent and intuitive (all begin with st_)sf packageI will refer to the package names as sp and sf. When I use the asterisk (e.g. sp* and sf*) I am referring to the class of objects defined by these packages. For instance in package sf a polygon vector dataset has class sfc_MULTIPOLYGON whereas a point vector dataset has class sfc_MULTIPOINT.
First, we will look at a spatial polygon file that is currently stored in sf* data class (data available for download on Canvas). The name of the target spatial data is ga_mvc.gpkg. There is more about the meaning of this extension .gpkg below. The ga_mvc dataset is a county polygon vector file with information about motor vehicle fatalities by county in Georgia for 2005, 2014, and 2017. We’ll discuss the relevance of these years more in lab.
rr rr
library(sf)
fp <- '../../DATA/GA_MVC/ga_mvc.gpkg'
mvc <- st_read(fp)
Reading layer `ga_mvc' from data source `H:\SpatialEpidemiology\DATA\GA_MVC\ga_mvc.gpkg' using driver `GPKG'
Simple feature collection with 159 features and 17 fields
geometry type: MULTIPOLYGON
dimension: XY
bbox: xmin: -85.60516 ymin: 30.35785 xmax: -80.83973 ymax: 35.00066
epsg (SRID): 4269
proj4string: +proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs
rr rr
#Not run but included for illustration
#mvc <- st_read('../../DATA/GA_MVC/ga_mvc.gpkg')
A few things to note here. First, I needed to load the sf package with the library(sf) call. The second thing I did was define where the file exists on my computer. This will obviously be different for you depending on where you saved the data. The reason I put the location in a separate object called fp (for filepath), was to emphasize that this location is specific to your use. In the final call I use the function st_read(fp) to import the file from the location I just named fp; the imported file is held in my computer memory and referenced by the object name mvc.
Another observation we can make at this point is that the spatially-oriented function call st_read() begins with the st_ prefix. Most of the spatial functions in package sf begin with the prefix, making it easier to search for functions specific to spatial data manipulation.
mvc?To explore what this object contains and how it is structured, begin with a few simple checks.
rr rr
class(mvc)
[1] \sf\ \data.frame\
Why are there so many values returned for the class of this object? Because it conforms to many data classes including the very general class data.frame, as well as the closely related table formats tbl_df and tbl, and finally a class sf. Below I use three different functions (names(mvc), summary(mvc), and head(mvc) ) to examine these data. Look at the results of each and explore what information is returned by each. There is overlap, but also some unique information provided particularly by the latter two functions.
rr rr
names(mvc)
[1] \GEOID\ \NAME\ \variable\
[4] \estimate\ \County\ \MVCDEATHS_05\
[7] \MVCDEATHS_14\ \MVCDEATH_17\ \TPOP_05\
[10] \TPOP_14\ \TPOP_17\ \NCHS_RURAL_CODE_2013\
[13] \nchs_code\ \rural\ \MVCRATE_05\
[16] \MVCRATE_14\ \MVCRATE_17\ \geom\
rr rr
summary(mvc)
GEOID NAME variable estimate
13001 : 1 Appling County, Georgia : 1 B00001_001:159 Min. : 260
13003 : 1 Atkinson County, Georgia: 1 1st Qu.: 1093
13005 : 1 Bacon County, Georgia : 1 Median : 1831
13007 : 1 Baker County, Georgia : 1 Mean : 4219
13009 : 1 Baldwin County, Georgia : 1 3rd Qu.: 3762
13011 : 1 Banks County, Georgia : 1 Max. :58814
(Other):153 (Other) :153
County MVCDEATHS_05 MVCDEATHS_14 MVCDEATH_17
Appling County : 1 Min. : 0.000 Min. : 0.000 Min. : 0.000
Atkinson County: 1 1st Qu.: 3.000 1st Qu.: 2.000 1st Qu.: 3.000
Bacon County : 1 Median : 5.000 Median : 4.000 Median : 6.000
Baker County : 1 Mean : 9.862 Mean : 7.667 Mean : 9.774
Baldwin County : 1 3rd Qu.: 11.000 3rd Qu.: 9.000 3rd Qu.:11.000
Banks County : 1 Max. :105.000 Max. :91.000 Max. :98.000
(Other) :153
TPOP_05 TPOP_14 TPOP_17 NCHS_RURAL_CODE_2013
Min. : 1858 Min. : 1693 Min. : 1628 Min. :1.000
1st Qu.: 11492 1st Qu.: 11543 1st Qu.: 11412 1st Qu.:3.000
Median : 22055 Median : 22755 Median : 22736 Median :5.000
Mean : 56138 Mean : 63505 Mean : 65594 Mean :4.428
3rd Qu.: 46712 3rd Qu.: 53725 3rd Qu.: 55067 3rd Qu.:6.000
Max. :818737 Max. :996319 Max. :1041423 Max. :6.000
nchs_code rural MVCRATE_05 MVCRATE_14
Large central metro: 1 non-Rural:102 Min. : 0.00 Min. : 0.000
Large fringe metro :28 Rural : 57 1st Qu.:13.82 1st Qu.: 9.665
Medium metro :15 Median :22.46 Median :14.096
Micropolitan :28 Mean :24.41 Mean :17.912
Non-core :57 3rd Qu.:33.94 3rd Qu.:23.317
Small metro :30 Max. :66.34 Max. :74.944
MVCRATE_17 geom
Min. : 0.00 MULTIPOLYGON :159
1st Qu.: 12.17 epsg:4269 : 0
Median : 20.43 +proj=long...: 0
Mean : 22.57
3rd Qu.: 29.53
Max. :115.16
rr rr
head(mvc)
Simple feature collection with 6 features and 17 fields
geometry type: MULTIPOLYGON
dimension: XY
bbox: xmin: -84.64195 ymin: 31.0784 xmax: -82.04858 ymax: 34.49172
epsg (SRID): 4269
proj4string: +proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs
GEOID NAME variable estimate County
1 13001 Appling County, Georgia B00001_001 1504 Appling County
2 13003 Atkinson County, Georgia B00001_001 875 Atkinson County
3 13005 Bacon County, Georgia B00001_001 945 Bacon County
4 13007 Baker County, Georgia B00001_001 390 Baker County
5 13009 Baldwin County, Georgia B00001_001 2943 Baldwin County
6 13011 Banks County, Georgia B00001_001 1767 Banks County
MVCDEATHS_05 MVCDEATHS_14 MVCDEATH_17 TPOP_05 TPOP_14 TPOP_17
1 4 4 10 17769 18540 18521
2 5 1 3 8096 8223 8342
3 7 5 0 10552 11281 11319
4 1 1 1 3967 3255 3200
5 6 8 13 46304 45909 44906
6 4 8 6 16683 18295 18634
NCHS_RURAL_CODE_2013 nchs_code rural MVCRATE_05 MVCRATE_14 MVCRATE_17
1 6 Non-core Rural 22.51111 21.57497 53.99276
2 6 Non-core Rural 61.75889 12.16101 35.96260
3 6 Non-core Rural 66.33813 44.32231 0.00000
4 4 Small metro non-Rural 25.20797 30.72197 31.25000
5 5 Micropolitan non-Rural 12.95784 17.42578 28.94936
6 6 Non-core Rural 23.97650 43.72779 32.19921
geom
1 MULTIPOLYGON (((-82.55069 3...
2 MULTIPOLYGON (((-83.141 31....
3 MULTIPOLYGON (((-82.62819 3...
4 MULTIPOLYGON (((-84.64166 3...
5 MULTIPOLYGON (((-83.42674 3...
6 MULTIPOLYGON (((-83.66862 3...
Look through the names of the variables and the summary information. Can you figure out what these variable columns contain/mean? Also notice the general information provided at the top of the output returned from the head() function call. This information is specific to class sf and tells us a great deal about the spatial attributes of data class (MULTIPOLYGON), coordinate system/projection, spatial extent and bounding box. Another way to examine the dataset is to look at the data structure using the str() function.
rr rr
str(mvc)
Classes ‘sf’ and 'data.frame': 159 obs. of 18 variables:
$ GEOID : Factor w/ 159 levels \13001\,\13003\,..: 1 2 3 4 5 6 7 8 9 10 ...
$ NAME : Factor w/ 159 levels \Appling County
One of the advantages of the sf* data classes is that the object is primarily a data table. But then where is the geometry information? Look at the column called ‘geom’. Notice in the str() call how this last field/column is of class sfc_MULTIPOLYGON.
rr rr
class(mvc$geom)
[1] \sfc_MULTIPOLYGON\ \sfc\
rr rr
head(mvc$geom)
Geometry set for 6 features
geometry type: MULTIPOLYGON
dimension: XY
bbox: xmin: -84.64195 ymin: 31.0784 xmax: -82.04858 ymax: 34.49172
epsg (SRID): 4269
proj4string: +proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs
First 5 geometries:
MULTIPOLYGON (((-82.55069 31.74911, -82.54744 3...
MULTIPOLYGON (((-83.141 31.40673, -83.13898 31....
MULTIPOLYGON (((-82.62819 31.56593, -82.62734 3...
MULTIPOLYGON (((-84.64166 31.3125, -84.63994 31...
MULTIPOLYGON (((-83.42674 33.18273, -83.42496 3...
sfc means simple feature column list. This means that in one column of our data.frame(), each value is actually a list. The list contains the x,y locations of the spatial geometry. So if each is a list, what is in the list? Remember to extract an element of a list we use the double bracket, [[]]. Below I ask R to tell me the class of the mvc$geom that is first in the list, meaning the one that goes with Appling County, the top row of data.
rr rr
class(mvc$geom[[1]])
[1] \XY\ \MULTIPOLYGON\ \sfg\
What does this mean? It means that for each row of data (e.g. each county) there is a list of XY points that make up a polygon. The sfg stands for simple feature geometry, which is the information required to draw a single unit of observation (single county).
sf* data objectNow we know three specific classes of data important to the simple feature format:
sf, the table (data.frame) with feature attributes and feature geometries, which contains…sfc, the list-column with the geometries for each feature (record), which is composed of…sfg, the feature geometry of an individual simple feature (e.g. individual county).Is that confusing? Don’t worry too much about it now. The point is mostly to realize that the geometry part of the spatial data is intricately integrated into the table of attributes. This will provide great benefit for some parts of spatial analysis in epidemiology.
How do we really know what kind of spatial information the geom column contains? One quick way to look is to render a simple plot. Note that if you use the base function plot() with an sf* object, it will plot a map for every column or attribute! In other words there are 17 data columns in the object mvc, and thus 17 maps would be rendered by default. But if we only want to take a quick look we could either plot the geometry alone, or we could specify a single variable to plot. To plot the geometry (e.g. shapes) only I use the function st_geometry().
rr rr
plot(st_geometry(mvc))
rr rr
plot(mvc['MVCRATE_14'])
sp packageIt is easy to convert between the current standard format, sp* Spatial classes and the emerging (and for our purposes, primary) format, sf* Simple Feature classes. In general we can coerce objects to sp* classes using the as(x, 'ClassType') format like this:
rr rr
library(sp)
mvc.spatial <- as(mvc, 'Spatial')
class(mvc.spatial)
[1] \SpatialPolygonsDataFrame\
attr(,\package\)
[1] \sp\
rr rr
summary(mvc.spatial)
Object of class SpatialPolygonsDataFrame
Coordinates:
min max
x -85.60516 -80.83973
y 30.35785 35.00066
Is projected: FALSE
proj4string :
[+proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs]
Data attributes:
GEOID NAME variable estimate
13001 : 1 Appling County, Georgia : 1 B00001_001:159 Min. : 260
13003 : 1 Atkinson County, Georgia: 1 1st Qu.: 1093
13005 : 1 Bacon County, Georgia : 1 Median : 1831
13007 : 1 Baker County, Georgia : 1 Mean : 4219
13009 : 1 Baldwin County, Georgia : 1 3rd Qu.: 3762
13011 : 1 Banks County, Georgia : 1 Max. :58814
(Other):153 (Other) :153
County MVCDEATHS_05 MVCDEATHS_14 MVCDEATH_17
Appling County : 1 Min. : 0.000 Min. : 0.000 Min. : 0.000
Atkinson County: 1 1st Qu.: 3.000 1st Qu.: 2.000 1st Qu.: 3.000
Bacon County : 1 Median : 5.000 Median : 4.000 Median : 6.000
Baker County : 1 Mean : 9.862 Mean : 7.667 Mean : 9.774
Baldwin County : 1 3rd Qu.: 11.000 3rd Qu.: 9.000 3rd Qu.:11.000
Banks County : 1 Max. :105.000 Max. :91.000 Max. :98.000
(Other) :153
TPOP_05 TPOP_14 TPOP_17 NCHS_RURAL_CODE_2013
Min. : 1858 Min. : 1693 Min. : 1628 Min. :1.000
1st Qu.: 11492 1st Qu.: 11543 1st Qu.: 11412 1st Qu.:3.000
Median : 22055 Median : 22755 Median : 22736 Median :5.000
Mean : 56138 Mean : 63505 Mean : 65594 Mean :4.428
3rd Qu.: 46712 3rd Qu.: 53725 3rd Qu.: 55067 3rd Qu.:6.000
Max. :818737 Max. :996319 Max. :1041423 Max. :6.000
nchs_code rural MVCRATE_05 MVCRATE_14
Large central metro: 1 non-Rural:102 Min. : 0.00 Min. : 0.000
Large fringe metro :28 Rural : 57 1st Qu.:13.82 1st Qu.: 9.665
Medium metro :15 Median :22.46 Median :14.096
Micropolitan :28 Mean :24.41 Mean :17.912
Non-core :57 3rd Qu.:33.94 3rd Qu.:23.317
Small metro :30 Max. :66.34 Max. :74.944
MVCRATE_17
Min. : 0.00
1st Qu.: 12.17
Median : 20.43
Mean : 22.57
3rd Qu.: 29.53
Max. :115.16
Recall the class-type for the original object, mvc was "sf" (as well as "tbl_df" and "data.frame"). Now the mvc.spatial object has class "SpatialPolygonsDataFrame". Notice the summary() call gives the expected summary of attribute data, but just like the sf objects, includes some information about geometry and projection at the top.
To make a simple plot of the geometry of the SpatialPolygonsDataFrame, you can once again use the base plot() functionality:
rr rr
plot(mvc.spatial)
The sp package actually has a function called spplot() which provides a great deal more control for cartography than base plot(). You can review the documentation about use of spplot() by asking for help: ?spplot. While it is a useful function, we will be using a different approach to map-production for most of our work in this course.
sf* and sp*In the previous section we used the as() syntax to convert from sf to sp class. The syntax for going the other way is a little different.
# Go from sf to sp:
mvc.spatial <- as(mvc, Class = "Spatial")
# Go from sp to sf:
mvc_sf <- st_as_sf(mvc.spatial, "sf")
st_read() for importing dataOne of the supposed benefits of using the sf package over sp was faster and more efficient reading/writing (importing/exporting) of data. So what does that mean exactly? Well for one thing, the st_read() function we used to originally import the mvc dataset is not limited to only Simple Feature data, or to only the .gpkg file format. It can import dozens of different data formats including .shp, .kml, and even from the ESRI spatial database format, the geodatabase (.gdb). You can read more about reading and writing spatial data using package sf in one of the online vignettes (click here).
While the st_read() arguments allow you to specify many specifics about the data you wish to import, often the function ‘guesses’ the right one based on the file extension name. For instance:
rr rr
mvc.shp <- st_read('../../DATA/GA_MVC/ga_mvc.shp')
Reading layer `ga_mvc' from data source `H:\SpatialEpidemiology\DATA\GA_MVC\ga_mvc.shp' using driver `ESRI Shapefile'
Simple feature collection with 159 features and 17 fields
geometry type: MULTIPOLYGON
dimension: XY
bbox: xmin: -85.60516 ymin: 30.35785 xmax: -80.83973 ymax: 35.00066
epsg (SRID): NA
proj4string: +proj=longlat +ellps=GRS80 +no_defs
rr rr
class(mvc.shp)
[1] \sf\ \data.frame\
rr rr
names(mvc.shp)
[1] \GEOID\ \NAME\ \varbl\ \estmt\ \Conty\
[6] \MVCDEATHS_\ \MVCDEATH_1\ \MVCDEATH_\ \TPOP_0\ \TPOP_14\
[11] \TPOP_17\ \NCHS_\ \nchs__1\ \rural\ \MVCRATE_0\
[16] \MVCRATE_14\ \MVCRATE_17\ \geometry\
What was different here from our initial st_read()? In this case we actually read in a totally different dataset, call ga_mvc.shp. It was saved in the ESRI shapefile format, but when read in we see the geometry information (multipolygon), projection infromation, and attribute data have all been imported as an sf object. One note about the field names: ESRI has rules (annoyingly restricting rules in my opinion) about the length of variable names…so notice how our variable names are truncated.
st_write() for exporting dataWhat about saving/writing data? It’s just about as easy. The corresponding command is, st_write() and again the function can ‘guess’ what format based on the file name, although you can also specify the arguments to be certain:
# Save as a geopackage .gpkg (this is efficient storage format)
st_write(mvc, 'ga_mvc2.gpkg')
# Save as an ESRI shapefile
st_write(mvc, 'ga_mvc2.shp')
# Save as shapefile, but specify arguments directly
st_write(mvc, dsn = "ga_mvc2.shp", layer = "ga_mvc.shp", driver = "ESRI Shapefile"))
.gpkg?.gpkg is short for the geopackage format, which is an open-source format for storing vector or raster data. Geopackages are stored as a single file on your computer, and in a format that can be read by many other GIS software packages including ArcGIS! You can try navigating within Arc Catalog to whereever you save your .gpkg data and see that it appears as a spatial database in Arc. You can add the dataset into ArcMap directly from this format.
While I recognize that much of the vector spatial data in the world is in the ESRI .shp format, we will primarily save data in the .gpkg format because it is flexible, efficient, and stored as a single file on disk so avoids the potential confusion that occurs when moving data that are contained in 4-8 separate files on disk.
There are two related concepts important to properly managing, analyzing, and visualizing spatial data: the Coordinate System and the Projection. I assume you have learned about each of these in your intro to GIS course, so I only very briefly review them, with attention to how we manage them in R.
The coordinate system refers to an agreed upon reference system for representing location information across datasets in a way that allows them to be properly aligned with one another and have a known relationship with the world. The coordinate system is defined by its geographical shape (spherical or planar), reference points (e.g. the prime meridian), its units of measurement (e.g. degrees latitude/longitude or meters). A common coordinate system is the GCS84 (World Geodetic System 1984).
Map projections are varying ways to represent the spherical world on a flat piece of paper. Because this is inherently impossible to do with complete accuracy, projections vary with respect to what aspect of representation is most important. For instance some projections maintain map distance to be consistent across the map; other maintain area; others maintain angles, etc. You can see online resources to read more about common projections.
Any spatial data we will see in this course was created using a coordinate system, although not all data are ‘projected’. There is built-in functionality to transform a given dataset into another coordinate system or projection, but before we can do that we must know where we are starting!
rr rr
# to see the current coordinate information use st_crs()
st_crs(mvc)
Coordinate Reference System:
EPSG: 4269
proj4string: \+proj=longlat +ellps=GRS80 +towgs84=0
This little bit of code and result opens up a whole set of words and symbols we will see often. Briefly here they are:
Sometimes you know you want a specific projection but don’t know the number. There are many resources online but one that is useful is http://spatialreference.org/.
So in the case of our current Georgia Counties dataset we can see that it is ‘unprojected’. For public health mapping of areal units (like counties) I prefer equal-area projections because it assures each geographic unit is fairly represented to the viewer. To transform the current data to something new we must first define the ‘new’ projected coordinate system. I used one of the online directories (http://www.epsg-registry.org/) to look up the Albers Equal Area USA projection. It is EPSG::5069. To change from old to new we use st_transform().
rr rr
mvc.albers <- st_transform(mvc, 5069)
st_crs(mvc.albers)
Coordinate Reference System:
EPSG: 5069
proj4string: \+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0 +y_0=0 +datum=NAD27 +units=m +no_defs\
Now we will plot the ‘old’ and the ‘new’ to see if they differ.
rr rr
plot(st_geometry(mvc), main = 'Unprojected lat/lon')
rr rr
plot(st_geometry(mvc.albers), main = 'Albers projection')