Preparing (Vector) Spatial Data for R

Thinking Spatially

Spatial data is awesome. Data with latitude and longitude are rich in relational information between data observations, and beautiful to visualize. Comparing one place on the earth to another often makes intuitive sense (how far away is A from B? Does C touch/overlap with D?), and the flavors of relationships you can create between data with geometries are nearly endless. Vastly different types of data can be combined and leveraged via space and location as a unique identifier rather than some kind of tabular key. However, the process of visualizing and analyzing geometry can have a steep learning curve; software like ArcGIS was created to make spatial visualization and analysis “easy” (although ArcGIS certainly has its own idyosincracies as well).

Broadly, there are two types of spatial data available– vector data, and raster data.

Points, Polygons, and Images, Oh My

Raster data is when a network or mesh of evenly spaced points are used to create a “surface” that describes some attribute of that space. Remotely sensed data using satellites to take primary data like land cover and climate data are often represented this way. They are mapped such that the center of the points form the center of a graph-paper like surface, where the value in the center of the point applies to the entire graph square. This creates the appearance of a surface of squares. The size of these squares (the amount of space each data point is assumed to represent) is refered to as the resolution– if the squares are 5km by 5km, the raster is said to be at a 5km resolution. However, even this is an abstraction–the points are usually sampled such that they are evenly spaced in latitude and longitude, but these “squares” can look more irregular in some areas (such as at the poles) depending on what projection you are using. These spatial data often carry file extensions like .TIFF, .JPEG, although different software packages and governemnt agencies sometimes have their own proprietary formats.

Vector data is when points, lines, and polygons are used to condense the natural and built world into geometric shapes. However, it is important to note that any given real-world object can be represented by a GIS (geographic information system) in multiple ways– the location of a cabin might be represented by a point, showing its location– or by a square showing the footprint of the structure itself. Rivers and lakes can be abstracted to lines and polygons, but given that the width of a river might also be of importance, rivers might also be represented as polygons. The level of detail describing the line or polygon also influences the precision with which you can make inferences about shape, size, and relationships between data (is your lake represented using a clean oval, or exploring every nook and cranny of the wetlands?). Finding areas of similarity, or connecting the dots between raster data objects often is one way of condensing and simplifying raster data into patches or lines from a surface of points.

It’s relatively rare that researchers need to create the shapefiles and geometries used to abstact the real world themselves– as such, you’re probably stuck with what you’ve recieved, which may or may not be the ideal “representation of the real” for the process you plan to undertake. As such, remember not to take these spatial “truths” but so seriously– your spatial data is probably somewhat imprecise, the locations described are most likely not as concrete as we imagine them to be, and human-created boundaries like borders are often arbitrary anyway. That said, using spatial relationships is usually still more interesting and useful than ignoring them.

From a theoretical standpoint, the goal of most spatial data, and resulting analytic processes is to end up with a simplified version of the real relationships as they exist in the real world. Similar to the way that summary statistics are used to get a sense of what’s going on in a non-spatial data set, summaries of spatial processes can be used to simplify what we know about how a variable exists in space. For example, knowing the spatially weighted average of where a species was observed each year (with weights as number of frogs observed per site) might give you a sense of whether the species is moving due to changing climate pressures. Relationships between two spatial variables, like area of intersections, or matrices of distances between points, are ways to condense complicated patterns (that are sometimes easy to recognize by eye) into data you can use in a regression model.

A little bit on how shapefiles work

One of the most pervasive spatial data types for vector data is the Shapefile. Created by ESRI (makers of the ArcGIS software), the shapefile is a format that consists of multiple different parts, creating one data type that is structured as if it has a multi-component soul. Never move these files separately– keep all of the files with the same name (regardless of extension) in the same directory. The (generally) most important parts/file extensions to know are described below:

Shapefile Horcruxes:

.shp - This contains all of the actual lat/lon locations of each point that goes into your vector data set
.dbf - The data attached to each point. If you want to see what’s there you can open this type of file in excel–however, look but do not touch; NEVER change any of the actual data by hand if you do this; despite appearances, this is not a .csv under the hood, and it’s very possible you will be very sad that you have corrupted your file if you do this.
.prj- This simple text file contains a string that describes what spatial projection your data is presented in. *This may or may not exist, but hopefully it does, or you’ll need to figure out via research or guesswork what the projection information was before you analyze it.
.shx- Shapefiles break without this one (.shp, .shx, and .dbf are the only requried extensions), and it deals with spatial indexing, but there’s no reason you need to interface with this in R.

There are plenty of other shapefile extensions (.atx, .ixs, .qix…) that may or may not exist for your file, and that you probably don’t have to worry about.

Loading in the data

Outside of using specially licenced software such as ArcGIS, most of the programatic ways of manipulating spatial data depend on the GDAL (Geospatial Data Analysis Library) and GEOS (Geometry Engine- Open Source) libraries. Many GIS users start off learning the basics of spatial analysis in ArcGIS, then transition to using more open source software like QGIS, (formerly known as Quantum GIS), Python, and R to accomplish the same goals faster, and with more customizability (without paying through the nose for a license). QGIS, Python, and R all ultimately call on the GEOS and GDAL packages under the hood. For basic, one-time-only spatial operations like buffers, intersections, clips, etc, it might be easier to simply do that operation in QGIS (or ArcGIS if you have it) than to mess with this nonsense.

However, ArcGIS and QGIS don’t have a way to loop and iterate through data when creating a map (although both have functionalities to print “atlases”, this is less useful for researchers that want to see different data portrayed within a similar template quickly). So, at the very least, it’s good to know how to load Shapefiles into R for data visualization alone.

This is designed to be an example of how to read in, manipulate, and prepare spatial data for use in R. This example uses a clean (no messed up geometries) shapefile, with valid topology.

Calling upon appropriate libraries, clearing the work space, and discovering where the files are located (the “root” of the file paths).

  library(data.table) #
  library(maptools) # R package with useful map tools
  library(rgeos) # "Geomegry Engine- Open Source (GEOS)"
  library(rgdal) # "Geospatial Data Analysis Library (GDAL)"
  
# Getting rid of anything saved in your workspace 
  rm(list=ls())

Loading in a shapefile

Using the functions in the woodson library

New as of version 0.1.2 of the woodson library, there is now a function that can bring in a shapefile (polygons) into R. You can use this function to bring in your spatial data as a SpatialPolygonsDataFrame, with a data.table as the @data slot. This function does many of the steps detailed below (like checking for geometric validity, transforming the @data object into a data.table). Using the function should be straighforward– simply give the function the directory, the name of the layer, and the string name of the field that serves as the primary ID for the polygons.

If a field that is not a primary key is provided, the polygons will be collapsed such that there will be 1 polygon object (possibly multi-part) for each of the primary keys you have identified, but the rest of the data table will no longer attach to the shapefile.

Note that the shapefile will be returned in the projection that the shapefile is in, and that there is a new field added called polygon_order– this is to protect useres from scrambling the polygons and the data attributes in later merges by always making sure that the @data object is sorted on this field before plotting later. This topic is discussed below, but is gone through even more exhaustively in the tutorial “The Scrambled Polygon Problem” (also on rpubs.com/BeccaStubbs).

library(woodson)

## Loading required package: ggplot2

## Loading required package: ggthemes

## Loading required package: grid

## Loading required package: rje

## 
## Attaching package: 'rje'

## The following object is masked from 'package:data.table':
## 
##     last

## The following object is masked from 'package:base':
## 
##     arrayInd

## Loading required package: extrafont

## Registering fonts with R

## Loading required package: classInt

counties<-shp_to_Rpolygons("J:/DATA/Incoming Data/USA/sae_data/cb_2013_us_county_20m/","cb_2013_us_county_20m","GEOID")

## OGR data source with driver: ESRI Shapefile 
## Source: "J:/DATA/Incoming Data/USA/sae_data/cb_2013_us_county_20m//cb_2013_us_county_20m.shp", layer: "cb_2013_us_county_20m"
## with 3221 features
## It has 9 fields
## Integer64 fields read as strings:  ALAND AWATER 
## [1] "PSA: There is a new column in the @data slot called polygon_order. Remember that the attributes (values and IDs) for each polygon are only linked to the polygons by sorting order-- if you ever merge anything onto your data, make sure that you re-order the data by this polygon_order field afterwards."

plot(counties)

head(counties@data)

##    GEOID polygon_order STATEFP COUNTYFP COUNTYNS       AFFGEOID    NAME
## 1: 01001             1      01      001 00161526 0500000US01001 Autauga
## 2: 01003             2      01      003 00161527 0500000US01003 Baldwin
## 3: 01005             3      01      005 00161528 0500000US01005 Barbour
## 4: 01007             4      01      007 00161529 0500000US01007    Bibb
## 5: 01009             5      01      009 00161530 0500000US01009  Blount
## 6: 01011             6      01      011 00161531 0500000US01011 Bullock
##    LSAD      ALAND     AWATER
## 1:   06 1539584443   25773561
## 2:   06 4117629064 1133084680
## 3:   06 2291820953   50864677
## 4:   06 1612481768    9287964
## 5:   06 1670041814   15077461
## 6:   06 1613058451    6055020

We can also check the shapefile we’ve brought in to make sure that it meets some of the basic requirements of the wmap() function, using another one of the woodson library’s functions– it returns “TRUE” if the shapefile has projection information, valid geometry, and has a data.table in its @data slot:

check_wmapshp(counties)

## [1] TRUE

Now that we’ve learned the easy way, let’s go into depth about some of aspects of bringing in polygons from shapefiles into R– many of the processes (subsetting data, re-projecting it, etc), can be done on objects you’ve already brought into R using the functions above.

Manipulating and bringing in your shapefiles without the function

There are many ways to load spatial data into R– however, one of the most straightforward is to use the readOGR function, since it also reads in the projection information (if it exists).

shapefile_name<-"cb_2013_us_county_20m"
indir <- "J:/DATA/Incoming Data/USA/sae_data/cb_2013_us_county_20m/" # Defining the directory that contains the shapefile you are interested in
map <- readOGR(paste0(indir, shapefile_name,".shp"), layer=shapefile_name)

## OGR data source with driver: ESRI Shapefile 
## Source: "J:/DATA/Incoming Data/USA/sae_data/cb_2013_us_county_20m/cb_2013_us_county_20m.shp", layer: "cb_2013_us_county_20m"
## with 3221 features
## It has 9 fields
## Integer64 fields read as strings:  ALAND AWATER

Our first check should be to see if the spatial geometry is valid– the gIsValid function looks at each polygon, and makes sure it doesn’t break any topological rules. Ideally, all of the geometry is valid, and the total number of polygons that fail the validity tests is 0.

validity<-data.table(valid=gIsValid(map,byid=TRUE))
sum(validity$valid==F)

## [1] 0

Lucky for us, the sum of the geometry problems is 0– If you start getting errors about invalid geometries, my recommendation is to a.) Find a friend that has an ArcGIS license, and use the “repair geometries” feature, and/or b.) talk to someone that’s good at troubleshooting spatial data— fixing invalid geometries is tough to do in R, or c.) prepare for a long road with many struggles. Another good start is do a UnionSpatialPolygons operation on the geographic ID of interest- that sometimes gets rid of topology errors as well (see below).

At this point, the map object is a Large SpatialPolygonsDataFrame. Just as there are multiple parts to a shapefile, there are multiple components of this data type. There is a polygons object, which we can access using map@polygons, and a data object, which we can access using map@data. A quick summary() will tell us what fields are within the data attributes of this file, what the projection information is, and the bounding box (x/y extent) of the geometry itself. We can also use R’s native plot() function to do a preliminary exploration of what this data looks like over space.

summary(map)

## Object of class SpatialPolygonsDataFrame
## Coordinates:
##          min       max
## x -179.14734 179.77847
## y   17.88481  71.35256
## Is projected: FALSE 
## proj4string :
## [+proj=longlat +datum=NAD83 +no_defs +ellps=GRS80 +towgs84=0,0,0]
## Data attributes:
##     STATEFP        COUNTYFP        COUNTYNS              AFFGEOID   
##  48     : 254   001    :  50   00023901:   1   0500000US01001:   1  
##  13     : 159   003    :  50   00025441:   1   0500000US01003:   1  
##  51     : 134   005    :  50   00025442:   1   0500000US01005:   1  
##  21     : 120   009    :  49   00025443:   1   0500000US01007:   1  
##  29     : 115   007    :  48   00025444:   1   0500000US01009:   1  
##  20     : 105   013    :  48   00025445:   1   0500000US01011:   1  
##  (Other):2334   (Other):2926   (Other) :3215   (Other)       :3215  
##      GEOID              NAME           LSAD             ALAND     
##  01001  :   1   Washington:  31   06     :3007   1000508842:   1  
##  01003  :   1   Franklin  :  26   13     :  78   1001059073:   1  
##  01005  :   1   Jefferson :  26   15     :  64   1001787876:   1  
##  01007  :   1   Jackson   :  24   25     :  41   1002099184:   1  
##  01009  :   1   Lincoln   :  24   04     :  12   1002370084:   1  
##  01011  :   1   Madison   :  20   05     :  11   1003451809:   1  
##  (Other):3215   (Other)   :3070   (Other):   8   (Other)   :3215  
##         AWATER    
##  0         :   1  
##  10017640  :   1  
##  100218683 :   1  
##  100389945 :   1  
##  10040121  :   1  
##  1004179934:   1  
##  (Other)   :3215

plot(map, main="Map")

Looks pretty tiny and awkward right now, but that can be fixed later.

Creating and manipulating data fields within the geography’s attributes

This map is a “Large SpatialPolygonsDataFrame”– an R object that contains both geographic information, and any fields/data attached to those geographies (like data, ID fields, etc). You can do operations that use the geographic information (for example, calculate the area of each polygon), or change, alter, or add to the data attached.

Probably the most crucial thing to realize about this data format is that the shapes (polyons, points, etc) are linked to the data through the sort order of the polygon objects and the data objects. I could scramble the "@data" object, and I could still plot a map with colors–however, I could be plotting Nevada’s values in Massachussetts, and if you don’t have the expert knowlege to know that something seems “off”, your entire analysis can be totally bunk due to sort order issues. Many of the ways we manipulate and merge data might seem overly complex– this is to prevent the sort order from becoming lost, and to preserve the geometry-attribute link’s validity.

You can access the data attached to the shapefile by calling the “data” attribute from the Spatial Polygons Data Frame:

head(map@data)

##   STATEFP COUNTYFP COUNTYNS       AFFGEOID GEOID       NAME LSAD
## 0      06      049 00277289 0500000US06049 06049      Modoc   06
## 1      13      257 00350028 0500000US13257 13257   Stephens   06
## 2      18      127 00450382 0500000US18127 18127     Porter   06
## 3      21      077 00516885 0500000US21077 21077   Gallatin   06
## 4      21      053 00516873 0500000US21053 21053    Clinton   06
## 5      26      149 01625034 0500000US26149 26149 St. Joseph   06
##         ALAND    AWATER
## 0 10140841147 745929078
## 1   463948108  13121937
## 2  1083011641 268393387
## 3   254698298  16519758
## 4   510864239  21164150
## 5  1296531678  52934828

Right now, the data object is a data.frame. It might be useful to change the data object to a data.table: this allows us to use data.table and data.frame methods! To do this, we first convert the data.frame into a data.table by calling data.table(map@data), then take a copy of it, and reassign the “data” slot of the map object to be the data.table version.

We take a copy here because of the way that data.table works– it stores an object in memory, which makes it fast to access, but also means that any operation done to a subset of that data.table (that isn’t explicitly copied as a separate object) will also be done to the original object in memory.

map@data<-copy(data.table(map@data))

We can add columns to the data table based on other columns using data.table syntax (which preserves the current order of the rows):

map@data[, name_caps:=toupper(NAME)] # say we wanted to create a new column that had the county names in all caps-- we can add that using data.table syntax

##       STATEFP COUNTYFP COUNTYNS       AFFGEOID GEOID      NAME LSAD
##    1:      06      049 00277289 0500000US06049 06049     Modoc   06
##    2:      13      257 00350028 0500000US13257 13257  Stephens   06
##    3:      18      127 00450382 0500000US18127 18127    Porter   06
##    4:      21      077 00516885 0500000US21077 21077  Gallatin   06
##    5:      21      053 00516873 0500000US21053 21053   Clinton   06
##   ---                                                              
## 3217:      21      081 00516887 0500000US21081 21081     Grant   06
## 3218:      01      081 00161566 0500000US01081 01081       Lee   06
## 3219:      21      111 00516902 0500000US21111 21111 Jefferson   06
## 3220:      48      489 01384030 0500000US48489 48489   Willacy   06
## 3221:      17      173 01785051 0500000US17173 17173    Shelby   06
##             ALAND    AWATER name_caps
##    1: 10140841147 745929078     MODOC
##    2:   463948108  13121937  STEPHENS
##    3:  1083011641 268393387    PORTER
##    4:   254698298  16519758  GALLATIN
##    5:   510864239  21164150   CLINTON
##   ---                                
## 3217:   668126187   7253995     GRANT
## 3218:  1573552207  21488919       LEE
## 3219:   985312624  44680348 JEFFERSON
## 3220:  1529529770 501714610   WILLACY
## 3221:  1964597878  24693146    SHELBY

map@data[, ALAND:=NULL] # Similarly, we can remove columns...

##       STATEFP COUNTYFP COUNTYNS       AFFGEOID GEOID      NAME LSAD
##    1:      06      049 00277289 0500000US06049 06049     Modoc   06
##    2:      13      257 00350028 0500000US13257 13257  Stephens   06
##    3:      18      127 00450382 0500000US18127 18127    Porter   06
##    4:      21      077 00516885 0500000US21077 21077  Gallatin   06
##    5:      21      053 00516873 0500000US21053 21053   Clinton   06
##   ---                                                              
## 3217:      21      081 00516887 0500000US21081 21081     Grant   06
## 3218:      01      081 00161566 0500000US01081 01081       Lee   06
## 3219:      21      111 00516902 0500000US21111 21111 Jefferson   06
## 3220:      48      489 01384030 0500000US48489 48489   Willacy   06
## 3221:      17      173 01785051 0500000US17173 17173    Shelby   06
##          AWATER name_caps
##    1: 745929078     MODOC
##    2:  13121937  STEPHENS
##    3: 268393387    PORTER
##    4:  16519758  GALLATIN
##    5:  21164150   CLINTON
##   ---                    
## 3217:   7253995     GRANT
## 3218:  21488919       LEE
## 3219:  44680348 JEFFERSON
## 3220: 501714610   WILLACY
## 3221:  24693146    SHELBY

head(map@data)

##    STATEFP COUNTYFP COUNTYNS       AFFGEOID GEOID       NAME LSAD
## 1:      06      049 00277289 0500000US06049 06049      Modoc   06
## 2:      13      257 00350028 0500000US13257 13257   Stephens   06
## 3:      18      127 00450382 0500000US18127 18127     Porter   06
## 4:      21      077 00516885 0500000US21077 21077   Gallatin   06
## 5:      21      053 00516873 0500000US21053 21053    Clinton   06
## 6:      26      149 01625034 0500000US26149 26149 St. Joseph   06
##       AWATER  name_caps
## 1: 745929078      MODOC
## 2:  13121937   STEPHENS
## 3: 268393387     PORTER
## 4:  16519758   GALLATIN
## 5:  21164150    CLINTON
## 6:  52934828 ST. JOSEPH

It also might be useful to simplify the information contained by the spatial polygons data frame: we can save the data itself in another object, and restrict the spatial object to only an ID field, such that we could add the other information back in as desired:

map@data[, geo_id:= as.integer(as.character(map@data$GEOID))] #first, let's make an ID that we can use to join data on later

##       STATEFP COUNTYFP COUNTYNS       AFFGEOID GEOID      NAME LSAD
##    1:      06      049 00277289 0500000US06049 06049     Modoc   06
##    2:      13      257 00350028 0500000US13257 13257  Stephens   06
##    3:      18      127 00450382 0500000US18127 18127    Porter   06
##    4:      21      077 00516885 0500000US21077 21077  Gallatin   06
##    5:      21      053 00516873 0500000US21053 21053   Clinton   06
##   ---                                                              
## 3217:      21      081 00516887 0500000US21081 21081     Grant   06
## 3218:      01      081 00161566 0500000US01081 01081       Lee   06
## 3219:      21      111 00516902 0500000US21111 21111 Jefferson   06
## 3220:      48      489 01384030 0500000US48489 48489   Willacy   06
## 3221:      17      173 01785051 0500000US17173 17173    Shelby   06
##          AWATER name_caps geo_id
##    1: 745929078     MODOC   6049
##    2:  13121937  STEPHENS  13257
##    3: 268393387    PORTER  18127
##    4:  16519758  GALLATIN  21077
##    5:  21164150   CLINTON  21053
##   ---                           
## 3217:   7253995     GRANT  21081
## 3218:  21488919       LEE   1081
## 3219:  44680348 JEFFERSON  21111
## 3220: 501714610   WILLACY  48489
## 3221:  24693146    SHELBY  17173

map_data<-copy(map@data) #now we have a data.table of the county information devoid of geographic attributes
map@data<-copy(map@data[,list(geo_id)]) # Resticting the data.table attached to the spatial information to only relevant fields.
    ## We are taking a copy here before reassigning it because of reasons detailed above (a data.table is an object stored in memory, 
    ## without taking a copy here, we are creating a "shallow copy", which R won't like to change later on)

head(map@data) # Should only include the ID

##    geo_id
## 1:   6049
## 2:  13257
## 3:  18127
## 4:  21077
## 5:  21053
## 6:  26149

head(map_data) # Includes the geo_id, and the rest of the data we may or may not want to use later.

##    STATEFP COUNTYFP COUNTYNS       AFFGEOID GEOID       NAME LSAD
## 1:      06      049 00277289 0500000US06049 06049      Modoc   06
## 2:      13      257 00350028 0500000US13257 13257   Stephens   06
## 3:      18      127 00450382 0500000US18127 18127     Porter   06
## 4:      21      077 00516885 0500000US21077 21077   Gallatin   06
## 5:      21      053 00516873 0500000US21053 21053    Clinton   06
## 6:      26      149 01625034 0500000US26149 26149 St. Joseph   06
##       AWATER  name_caps geo_id
## 1: 745929078      MODOC   6049
## 2:  13121937   STEPHENS  13257
## 3: 268393387     PORTER  18127
## 4:  16519758   GALLATIN  21077
## 5:  21164150    CLINTON  21053
## 6:  52934828 ST. JOSEPH  26149

At this point, we can save that map_data as a separate .csv or .rdata, merge it in later, ditch it altogether– in general, it’s good practice to keep your spatial data “light” on attributes, and only keep variables that help you identify the geometries– after all, you can always merge in the other data later.

Merging other data onto a SpatialPolygonsDataTable

Right now, you have rows of data that corresponds to each spatial record. These data are linked to the spatial information by the order they are in the table, since that corresponds to the order of the relevant polygons. This can get tricky, since merges in R will automatically default to sorting the resulting data.table after a merge based on whatever ID field you used to merge on your secondary data set. This is problematic, because when you merge in new information into the data.table linked to the polygons, the polygon order doesn’t update to mesh with the new data.

As discussed above, if you use this out-of-order data set for analysis, you’ll still get values for each polygon–but they won’t actually correspond to the right geography! Luckily, We can take some defensive steps to make sure we don’t get “out of sync” between the geography and the data. If this were a data.frame, we could assign the row.names() to something meaningful– but rownames aren’t supported in data.tables (we can just use a column!), and we can generate a column to revert back to later on.

One way of achieving this security is to sort your entire spatial data object based on the geo_id you plan to use for any merges down the line.

At this point, the entire spatial object (polygons and data), are ordered by the geog_ id we have specified. This means that at any point in the future, we can use this geo_id field to sort our data.table, and re-arrange our data object to make sure it matches up to the geography.

map<-copy(map[order(map@data$geo_id),]) # Assigning the "map" object to be the sorted (on our geog_id) version of the "map" object.
# Right now ,the data.table being used is considered to be a "shallow copy"
# of the data.table from the unsorted object-- we can overwrite this with a
# copy of the map@data object and get rid of this problem.
map@data<-copy(map@data)
head(map@data)

##    geo_id
## 1:   1001
## 2:   1003
## 3:   1005
## 4:   1007
## 5:   1009
## 6:   1011

If you don’t want to use this method, because you aren’t sure what attribute you’ll be merging on in the future, or some other reason, you can also create a new column to use as a home base and re-sort your data.table on. After any manipulations that you do, to make sure that your values correspond correctly with your geography. As of right now, we haven’t done anything to change the order of the data or geography, so we can create an index from 1:(n_rows) that we can use to “revert” back to normal later on (just in case!).

map@data[,sort_order:= 1:nrow(map@data)] # Creating the index that goes from 1-the length of the data.table

##       geo_id sort_order
##    1:   1001          1
##    2:   1003          2
##    3:   1005          3
##    4:   1007          4
##    5:   1009          5
##   ---                  
## 3217:  72145       3217
## 3218:  72147       3218
## 3219:  72149       3219
## 3220:  72151       3220
## 3221:  72153       3221

head(map@data)

##    geo_id sort_order
## 1:   1001          1
## 2:   1003          2
## 3:   1005          3
## 4:   1007          4
## 5:   1009          5
## 6:   1011          6

Now that we are confident we won’t get the geometry mixed up by adding in some data, let’s add back in a few of the variables we stashed away in the separate data.table called map_data:

map@data<-copy(merge(map@data,map_data,by="geo_id"))
head(map@data)

##    geo_id sort_order STATEFP COUNTYFP COUNTYNS       AFFGEOID GEOID
## 1:   1001          1      01      001 00161526 0500000US01001 01001
## 2:   1003          2      01      003 00161527 0500000US01003 01003
## 3:   1005          3      01      005 00161528 0500000US01005 01005
## 4:   1007          4      01      007 00161529 0500000US01007 01007
## 5:   1009          5      01      009 00161530 0500000US01009 01009
## 6:   1011          6      01      011 00161531 0500000US01011 01011
##       NAME LSAD     AWATER name_caps
## 1: Autauga   06   25773561   AUTAUGA
## 2: Baldwin   06 1133084680   BALDWIN
## 3: Barbour   06   50864677   BARBOUR
## 4:    Bibb   06    9287964      BIBB
## 5:  Blount   06   15077461    BLOUNT
## 6: Bullock   06    6055020   BULLOCK

Note that the data you are adding on should never have more than 1 observation per geog_ id! Remember that the polygons are linked to the data by order– we can’t (sensibly) have more data than we have polygons.

Sub-setting the spatial polygons data frame: picking geographic areas that are of interest to you, or excluding them

The relevant thing here is that you can query the spatial polygons data object’s data, and used it to subset the spatialpolygons object– including the geographic boundary information! Let’s get rid of Puerto Rico:

map <- map[map@data$geo_id < 60000,] # Dropping Puerto Rico by only keeping rows that had geo_id values less than 60000

Similarly, we could plot only some section of this data file, or save a subset of the file to another object- let’s use the attributes to find Washington State’s Counties only.

    washington<-copy(map[map@data$STATEFP==53,])
    plot(washington, main="Washington State")

Changing the Map’s Projection

It’s sometimes useful to know or alter the projection used for the map. To discover what the map’s projection is currently, is you use the command proj4string(map).

proj4string(washington)

## [1] "+proj=longlat +datum=NAD83 +no_defs +ellps=GRS80 +towgs84=0,0,0"

In this case, the projection is “longlat”– basically, not a projection; it’s just using the latitude and longitude as a coordinate system. See: http://www.georeference.org/doc/latitude_longitude_projection.htm as a further reference about different projetions. Plenty of resources are available on geographic projections–the important thing to keep in mind is that all projections (putting an “oblate spheroid” like the earth onto a 2d surface) are frought with choices about what to distort and where.

See: https://en.wikipedia.org/wiki/Map_projection#Which_projection_is_best.3F

What does this projection look like?

plot(washington, main="Original Projection")

Luckily,changing this projection is trivial in R.

  # Defining the new projection you want to use:
    new_projection<-"+proj=laea +lat_0=45 +lon_0=-100 +x_0=0 +y_0=0 +a=6370997 +b=6370997 +units=m +no_defs" 
    # This is the Lambert Azimuthal Equal Area projection-- a projection known for keeping the area of the geographic units true to reality.
    # You can probably find whatever projection you want to use's "PROJ.4 STRING" at: http://spatialreference.org/
  
  # Actually transform the object
    washington <- spTransform(washington, CRS(new_projection)) # spTransform: spatial transform of coordinates from one system to antoher 
  
  # What does it look like now?
    plot(washington,main="New Projection")

However, we still need to create a copy() of the data.table and reassign it to the Washington object, or else it won’t allow us to make changes to the data.table later on (it will throw an error about a “shallow copy” of the data table.)

    washington@data<-copy(washington@data)

Calculating fields based on geographic attributes

Let’s say you want to know how much area is in each county– you can create a new data field based on the geographic information. This gArea function calls from the rgeos package: units of this projection is “meters” (see the +units section of the proj4 string), so this should give us Km when we divide by 1000000:

washington@data[,area_sqkm:=gArea(washington,byid=TRUE)/1000000]

##     geo_id sort_order STATEFP COUNTYFP COUNTYNS       AFFGEOID GEOID
##  1:  53001       2955      53      001 01531601 0500000US53001 53001
##  2:  53003       2956      53      003 01533502 0500000US53003 53003
##  3:  53005       2957      53      005 01513302 0500000US53005 53005
##  4:  53007       2958      53      007 01531932 0500000US53007 53007
##  5:  53009       2959      53      009 01531341 0500000US53009 53009
##  6:  53011       2960      53      011 01531820 0500000US53011 53011
##  7:  53013       2961      53      013 01513273 0500000US53013 53013
##  8:  53015       2962      53      015 01529156 0500000US53015 53015
##  9:  53017       2963      53      017 01531818 0500000US53017 53017
## 10:  53019       2964      53      019 01531821 0500000US53019 53019
## 11:  53021       2965      53      021 01531822 0500000US53021 53021
## 12:  53023       2966      53      023 01533500 0500000US53023 53023
## 13:  53025       2967      53      025 01531924 0500000US53025 53025
## 14:  53027       2968      53      027 01531823 0500000US53027 53027
## 15:  53029       2969      53      029 01513272 0500000US53029 53029
## 16:  53031       2970      53      031 01531936 0500000US53031 53031
## 17:  53033       2971      53      033 01531933 0500000US53033 53033
## 18:  53035       2972      53      035 01529223 0500000US53035 53035
## 19:  53037       2973      53      037 01531926 0500000US53037 53037
## 20:  53039       2974      53      039 01529219 0500000US53039 53039
## 21:  53041       2975      53      041 01531927 0500000US53041 53041
## 22:  53043       2976      53      043 01514052 0500000US53043 53043
## 23:  53045       2977      53      045 01529221 0500000US53045 53045
## 24:  53047       2978      53      047 01531928 0500000US53047 53047
## 25:  53049       2979      53      049 01513274 0500000US53049 53049
## 26:  53051       2980      53      051 01529157 0500000US53051 53051
## 27:  53053       2981      53      053 01529159 0500000US53053 53053
## 28:  53055       2982      53      055 01531931 0500000US53055 53055
## 29:  53057       2983      53      057 01531402 0500000US53057 53057
## 30:  53059       2984      53      059 01529220 0500000US53059 53059
## 31:  53061       2985      53      061 01529222 0500000US53061 53061
## 32:  53063       2986      53      063 01529225 0500000US53063 53063
## 33:  53065       2987      53      065 01531930 0500000US53065 53065
## 34:  53067       2988      53      067 01529226 0500000US53067 53067
## 35:  53069       2989      53      069 01513275 0500000US53069 53069
## 36:  53071       2990      53      071 01531819 0500000US53071 53071
## 37:  53073       2991      53      073 01529224 0500000US53073 53073
## 38:  53075       2992      53      075 01533501 0500000US53075 53075
## 39:  53077       2993      53      077 01531929 0500000US53077 53077
##     geo_id sort_order STATEFP COUNTYFP COUNTYNS       AFFGEOID GEOID
##             NAME LSAD     AWATER    name_caps  area_sqkm
##  1:        Adams   06   12622881        ADAMS  4992.1999
##  2:       Asotin   06   11450483       ASOTIN  1663.1743
##  3:       Benton   06  154325810       BENTON  4514.9275
##  4:       Chelan   06  189998659       CHELAN  7679.7601
##  5:      Clallam   06 2413676105      CLALLAM  4657.4722
##  6:        Clark   06   70459265        CLARK  1694.6318
##  7:     Columbia   06   12566852     COLUMBIA  2272.1166
##  8:      Cowlitz   06   67322703      COWLITZ  2987.8458
##  9:      Douglas   06   76102867      DOUGLAS  4747.3631
## 10:        Ferry   06  140280531        FERRY  5781.4918
## 11:     Franklin   06   60193373     FRANKLIN  3252.6191
## 12:     Garfield   06   19490293     GARFIELD  1831.9577
## 13:        Grant   06  289575831        GRANT  7221.0382
## 14: Grays Harbor   06  834956691 GRAYS HARBOR  5006.0129
## 15:       Island   06  800096249       ISLAND   565.5492
## 16:    Jefferson   06  982867070    JEFFERSON  4798.8707
## 17:         King   06  495743196         KING  5756.2371
## 18:       Kitsap   06  442827389       KITSAP  1159.4203
## 19:     Kittitas   06   92366017     KITTITAS  6037.1778
## 20:    Klickitat   06   85063387    KLICKITAT  4941.5249
## 21:        Lewis   06   86636896        LEWIS  6320.1551
## 22:      Lincoln   06   74744200      LINCOLN  6034.4662
## 23:        Mason   06  237129533        MASON  2687.7418
## 24:     Okanogan   06  121223338     OKANOGAN 13789.1367
## 25:      Pacific   06  752975316      PACIFIC  2485.7630
## 26: Pend Oreille   06   65404063 PEND OREILLE  3678.3810
## 27:       Pierce   06  354564434       PIERCE  4580.4432
## 28:     San Juan   06 1158102373     SAN JUAN   656.1864
## 29:       Skagit   06  489272212       SKAGIT  4561.9018
## 30:     Skamania   06   72853035     SKAMANIA  4332.6359
## 31:    Snohomish   06  282423392    SNOHOMISH  5434.8855
## 32:      Spokane   06   43803216      SPOKANE  4605.1543
## 33:      Stevens   06  163107425      STEVENS  6562.9276
## 34:     Thurston   06  133490871     THURSTON  1973.9133
## 35:    Wahkiakum   06   61658249    WAHKIAKUM   719.0812
## 36:  Walla Walla   06   74827791  WALLA WALLA  3404.1141
## 37:      Whatcom   06 1027129386      WHATCOM  5598.2213
## 38:      Whitman   06   47939274      WHITMAN  5643.9042
## 39:       Yakima   06   41107261       YAKIMA 11173.3366
##             NAME LSAD     AWATER    name_caps  area_sqkm

head(washington@data)

##    geo_id sort_order STATEFP COUNTYFP COUNTYNS       AFFGEOID GEOID
## 1:  53001       2955      53      001 01531601 0500000US53001 53001
## 2:  53003       2956      53      003 01533502 0500000US53003 53003
## 3:  53005       2957      53      005 01513302 0500000US53005 53005
## 4:  53007       2958      53      007 01531932 0500000US53007 53007
## 5:  53009       2959      53      009 01531341 0500000US53009 53009
## 6:  53011       2960      53      011 01531820 0500000US53011 53011
##       NAME LSAD     AWATER name_caps area_sqkm
## 1:   Adams   06   12622881     ADAMS  4992.200
## 2:  Asotin   06   11450483    ASOTIN  1663.174
## 3:  Benton   06  154325810    BENTON  4514.928
## 4:  Chelan   06  189998659    CHELAN  7679.760
## 5: Clallam   06 2413676105   CLALLAM  4657.472
## 6:   Clark   06   70459265     CLARK  1694.632

Simplifying the geometry of the geographic information

Often, shapefiles will have more detail than is necessary for cartographic mapping– this can make things slow to plot, and generally cumbersome. To fix this, we can use gSimplify().

To see the impacts, let’s take a look at the impacts of gSimplify up close– on the counties within Washington only.

object.size(washington) #how big in bytes is this object before simplification?

## 1117888 bytes

plot(washington,main="Original Geometry")

washington.simple<-gSimplify(washington,tol=500,topologyPreserve=TRUE)
object.size(washington.simple) #how big in bytes is this object after simplification?

## 174048 bytes

plot(washington.simple,main="Simplified")

    #tol: tolerance (higher=simpler)
    #topologyPreserve: if true, attempts to keep topology (relationships between counties, etc) the same.

Once you’ve simplified a shape, you no longer should use it for analysis– there might be gaps between borders, and the precision for where things are in the real world has been lost. From the point that you have simplified the geometry onward, it should be used for cartography ONLY.

Let’s apply that amount of smoothing to the rest of the the US, and see how much size we “save”:

object.size(map) #how big in bytes is this object before simplification?

## 13448016 bytes

us.simple<-gSimplify(map,tol=500,topologyPreserve=TRUE)
object.size(us.simple) #how big in bytes is this object after simplification?

## 11390976 bytes

plot(us.simple,main="Simplified")

Holy crow, look at Alaska! Obviously, the tolerance that looks good for some areas is not appropriate for others– keep on the lookout for things that look strange, and use as high of tolerance as still looks good to you.

Merging Polygons Together

Shapefiles can have two types of polygons– single part (just one lake), or multipart (a lake and a tiny lake that count as the same object as far as attributes are concerned). Sometimes, when you load in your data, you might notice thousands of polygons, with lots of duplicates. This indicates that your multipart polygons were read in as individual objects. Luckily, we can collapse these back together into a sensible object!

Let’s take a look at a composite shapefile for both national and subnational areas that are contained within the same shapefile.

gbd15 <- readOGR("C:/Users/stubbsrw/Documents/us_counties_stubbs_gitrepo/r_shared/woodson_mapping_suite/global_ihme_shapefiles/gbd15_all_levels_single_polygon_03032016.shp", layer="gbd15_all_levels_single_polygon_03032016")

## OGR data source with driver: ESRI Shapefile 
## Source: "C:/Users/stubbsrw/Documents/us_counties_stubbs_gitrepo/r_shared/woodson_mapping_suite/global_ihme_shapefiles/gbd15_all_levels_single_polygon_03032016.shp", layer: "gbd15_all_levels_single_polygon_03032016"
## with 6959 features
## It has 8 fields
## Integer64 fields read as strings:  parent_id spr_reg_id region_id level

Over 6 thousand features! Let’s collapse those down to something sensible.

First, let’s convert the data object into a data.table, and save it into a separate data.table for use later.

 gbd15<-gbd15[order(gbd15@data$level,gbd15@data$location_i),] # Re-assigning the "map" object to be the sorted (on location_id) version of the "map" object.
  gbd15@data<-data.table(gbd15@data) #Converting the data object to a data.table
  gbd15data<-unique(copy(gbd15@data))#Saving a copy of the data.table (but only the unique observations!)
  gbd15@data[, ihme_lc_id:=as.character(ihme_lc_id)] # Replacing the location ID as a charecter version of location ID

##       parent_id                            loc_name
##    1:         0                       New Caledonia
##    2:         0                              Kosovo
##    3:         0 French Southern and Antarctic Lands
##    4:         0                      Western Sahara
##    5:         0                          Antarctica
##   ---                                              
## 6955:      4749                  South East England
## 6956:      4749                  South East England
## 6957:      4749                  South East England
## 6958:      4749                  South West England
## 6959:      4749                  South West England
##                                 loc_nm_sh spr_reg_id region_id ihme_lc_id
##    1:                       New Caledonia          0         0       NA_4
##    2:                              Kosovo          0         0       NA_5
##    3: French Southern and Antarctic Lands          0         0       NA_2
##    4:                      Western Sahara          0         0       NA_3
##    5:                          Antarctica          0         0       NA_1
##   ---                                                                    
## 6955:                          SE England         64        73   GBR_4625
## 6956:                          SE England         64        73   GBR_4625
## 6957:                          SE England         64        73   GBR_4625
## 6958:                          SW England         64        73   GBR_4626
## 6959:                          SW England         64        73   GBR_4626
##       location_i level
##    1:         -1     0
##    2:         -1     0
##    3:         -1     0
##    4:         -1     0
##    5:         -1     0
##   ---                 
## 6955:       4625     5
## 6956:       4625     5
## 6957:       4625     5
## 6958:       4626     5
## 6959:       4626     5

Now, we will use a particular field of the attributes (in this case, ihme_ lc_ id) aggregate many single-part polygons into multipart polgyons. You also can want to merge together polygons for more normal reasons– say you had a shapefile with counties, and wanted a shapefile of states– since counties nest within states, you could UnionSpatialPolygons on the State ID and you would end up with a State Shapefile. This is a good strategy if you want to highlight borders of a geography of admin levels above your data of interest– that way, you are guaruanteed to have the same border lines as the smaller geography.

map <- unionSpatialPolygons(gbd15, gbd15@data$ihme_lc_id)

At this point, “map” is a SpatialPolygons object– without any data attributes at all. However, the polygon ID of this new SpatialPolygons is the identifier you used to combine geometry before– we can use this to recover and create a data.table, and add the rest of our data back on. Interestingly, to join together data and polygons to create a SpatialPolygonsDataFrame, you need the row names to match up. As such, we create a data.frame with the same row names.

data <- data.frame(ihme_lc_id = (sapply(map@polygons, function(x) x@ID)))
  rownames(data) <- data$ihme_lc_id

Now, we can create a SpatialPolygonsDataFrame using the map and data objects:

  gbd15 <- SpatialPolygonsDataFrame(map, data)

At this point, we follow familiar steps to merge the data back on.

gbd15 <- gbd15[order(gbd15@data$ihme_lc_id),]
gbd15@data$polygon_sort_order<-1:nrow(gbd15@data) # Creating the index that goes from 1-the length of the data.table, such that order can be preserved!

## Merging back on the attributes after having collapsed by location ID
  gbd15@data<-merge(gbd15@data,gbd15data,by="ihme_lc_id",all.x=TRUE, all.y=FALSE)
  gbd15@data<-data.table(gbd15@data)

# Let's also fix the attribute names:
setnames(gbd15@data,"location_i","location_id") 

head(gbd15@data)

##    ihme_lc_id polygon_sort_order parent_id             loc_name
## 1:        AFG                  1       138          Afghanistan
## 2:        AGO                  2       167               Angola
## 3:        ALB                  3        42              Albania
## 4:        AND                  4        73              Andorra
## 5:        ARE                  5       138 United Arab Emirates
## 6:        ARG                  6        96            Argentina
##      loc_nm_sh spr_reg_id region_id location_id level
## 1: Afghanistan        137       138         160     3
## 2:      Angola        166       167         168     3
## 3:     Albania         31        42          43     3
## 4:     Andorra         64        73          74     3
## 5:         UAE        137       138         156     3
## 6:   Argentina         64        96          97     3

plot(gbd15[gbd15@data$level==3,], main="Nations of the World")

Got anything else you want added to this document? Contact Rebecca Stubbs at stubbsrw@gmail.com.

Preparing (Vector) Spatial Data for R– Shapefiles Edition

Rebecca Stubbs