Spatial data is awesome. Data with latitude and longitude are rich in relational information between data observations, and beautiful to visualize. Comparing one place on the earth to another often makes intuitive sense (how far away is A from B? Does C touch/overlap with D?), and the flavors of relationships you can create between data with geometries are nearly endless. Vastly different types of data can be combined and leveraged via space and location as a unique identifier rather than some kind of tabular key. However, the process of visualizing and analyzing geometry can have a steep learning curve; software like ArcGIS was created to make spatial visualization and analysis “easy” (although ArcGIS certainly has its own idyosincracies as well).
Broadly, there are two types of spatial data available– vector data, and raster data.
Raster data is when a network or mesh of evenly spaced points are used to create a “surface” that describes some attribute of that space. Remotely sensed data using satellites to take primary data like land cover and climate data are often represented this way. They are mapped such that the center of the points form the center of a graph-paper like surface, where the value in the center of the point applies to the entire graph square. This creates the appearance of a surface of squares. The size of these squares (the amount of space each data point is assumed to represent) is refered to as the resolution– if the squares are 5km by 5km, the raster is said to be at a 5km resolution. However, even this is an abstraction–the points are usually sampled such that they are evenly spaced in latitude and longitude, but these “squares” can look more irregular in some areas (such as at the poles) depending on what projection you are using. These spatial data often carry file extensions like .TIFF, .JPEG, although different software packages and governemnt agencies sometimes have their own proprietary formats.
Vector data is when points, lines, and polygons are used to condense the natural and built world into geometric shapes. However, it is important to note that any given real-world object can be represented by a GIS (geographic information system) in multiple ways– the location of a cabin might be represented by a point, showing its location– or by a square showing the footprint of the structure itself. Rivers and lakes can be abstracted to lines and polygons, but given that the width of a river might also be of importance, rivers might also be represented as polygons. The level of detail describing the line or polygon also influences the precision with which you can make inferences about shape, size, and relationships between data (is your lake represented using a clean oval, or exploring every nook and cranny of the wetlands?). Finding areas of similarity, or connecting the dots between raster data objects often is one way of condensing and simplifying raster data into patches or lines from a surface of points.
It’s relatively rare that researchers need to create the shapefiles and geometries used to abstact the real world themselves– as such, you’re probably stuck with what you’ve recieved, which may or may not be the ideal “representation of the real” for the process you plan to undertake. As such, remember not to take these spatial “truths” but so seriously– your spatial data is probably somewhat imprecise, the locations described are most likely not as concrete as we imagine them to be, and human-created boundaries like borders are often arbitrary anyway. That said, using spatial relationships is usually still more interesting and useful than ignoring them.
From a theoretical standpoint, the goal of most spatial data, and resulting analytic processes is to end up with a simplified version of the real relationships as they exist in the real world. Similar to the way that summary statistics are used to get a sense of what’s going on in a non-spatial data set, summaries of spatial processes can be used to simplify what we know about how a variable exists in space. For example, knowing the spatially weighted average of where a species was observed each year (with weights as number of frogs observed per site) might give you a sense of whether the species is moving due to changing climate pressures. Relationships between two spatial variables, like area of intersections, or matrices of distances between points, are ways to condense complicated patterns (that are sometimes easy to recognize by eye) into data you can use in a regression model.
One of the most pervasive spatial data types for vector data is the Shapefile. Created by ESRI (makers of the ArcGIS software), the shapefile is a format that consists of multiple different parts, creating one data type that is structured as if it has a multi-component soul. Never move these files separately– keep all of the files with the same name (regardless of extension) in the same directory. The (generally) most important parts/file extensions to know are described below:
Shapefile Horcruxes:
.shp - This contains all of the actual lat/lon locations of each point that goes into your vector data set
.dbf - The data attached to each point. If you want to see what’s there you can open this type of file in excel–however, look but do not touch; NEVER change any of the actual data by hand if you do this; despite appearances, this is not a .csv under the hood, and it’s very possible you will be very sad that you have corrupted your file if you do this.
.prj- This simple text file contains a string that describes what spatial projection your data is presented in. *This may or may not exist, but hopefully it does, or you’ll need to figure out via research or guesswork what the projection information was before you analyze it.
.shx- Shapefiles break without this one (.shp, .shx, and .dbf are the only requried extensions), and it deals with spatial indexing, but there’s no reason you need to interface with this in R.
There are plenty of other shapefile extensions (.atx, .ixs, .qix…) that may or may not exist for your file, and that you probably don’t have to worry about.
Outside of using specially licenced software such as ArcGIS, most of the programatic ways of manipulating spatial data depend on the GDAL (Geospatial Data Analysis Library) and GEOS (Geometry Engine- Open Source) libraries. Many GIS users start off learning the basics of spatial analysis in ArcGIS, then transition to using more open source software like QGIS, (formerly known as Quantum GIS), Python, and R to accomplish the same goals faster, and with more customizability (without paying through the nose for a license). QGIS, Python, and R all ultimately call on the GEOS and GDAL packages under the hood. For basic, one-time-only spatial operations like buffers, intersections, clips, etc, it might be easier to simply do that operation in QGIS (or ArcGIS if you have it) than to mess with this nonsense.
However, ArcGIS and QGIS don’t have a way to loop and iterate through data when creating a map (although both have functionalities to print “atlases”, this is less useful for researchers that want to see different data portrayed within a similar template quickly). So, at the very least, it’s good to know how to load Shapefiles into R for data visualization alone.
This is designed to be an example of how to read in, manipulate, and prepare spatial data for use in R. This example uses a clean (no messed up geometries) shapefile, with valid topology.
library(data.table) #
library(maptools) # R package with useful map tools
library(rgeos) # "Geomegry Engine- Open Source (GEOS)"
library(rgdal) # "Geospatial Data Analysis Library (GDAL)"
# Getting rid of anything saved in your workspace
rm(list=ls())
New as of version 0.1.2 of the woodson library, there is now a function that can bring in a shapefile (polygons) into R. You can use this function to bring in your spatial data as a SpatialPolygonsDataFrame, with a data.table as the @data slot. This function does many of the steps detailed below (like checking for geometric validity, transforming the @data object into a data.table). Using the function should be straighforward– simply give the function the directory, the name of the layer, and the string name of the field that serves as the primary ID for the polygons.
If a field that is not a primary key is provided, the polygons will be collapsed such that there will be 1 polygon object (possibly multi-part) for each of the primary keys you have identified, but the rest of the data table will no longer attach to the shapefile.
Note that the shapefile will be returned in the projection that the shapefile is in, and that there is a new field added called polygon_order– this is to protect useres from scrambling the polygons and the data attributes in later merges by always making sure that the @data object is sorted on this field before plotting later. This topic is discussed below, but is gone through even more exhaustively in the tutorial “The Scrambled Polygon Problem” (also on rpubs.com/BeccaStubbs).
library(woodson)
## Loading required package: ggplot2
## Loading required package: ggthemes
## Loading required package: grid
## Loading required package: rje
##
## Attaching package: 'rje'
## The following object is masked from 'package:data.table':
##
## last
## The following object is masked from 'package:base':
##
## arrayInd
## Loading required package: extrafont
## Registering fonts with R
## Loading required package: classInt
counties<-shp_to_Rpolygons("J:/DATA/Incoming Data/USA/sae_data/cb_2013_us_county_20m/","cb_2013_us_county_20m","GEOID")
## OGR data source with driver: ESRI Shapefile
## Source: "J:/DATA/Incoming Data/USA/sae_data/cb_2013_us_county_20m//cb_2013_us_county_20m.shp", layer: "cb_2013_us_county_20m"
## with 3221 features
## It has 9 fields
## Integer64 fields read as strings: ALAND AWATER
## [1] "PSA: There is a new column in the @data slot called polygon_order. Remember that the attributes (values and IDs) for each polygon are only linked to the polygons by sorting order-- if you ever merge anything onto your data, make sure that you re-order the data by this polygon_order field afterwards."
plot(counties)
head(counties@data)
## GEOID polygon_order STATEFP COUNTYFP COUNTYNS AFFGEOID NAME
## 1: 01001 1 01 001 00161526 0500000US01001 Autauga
## 2: 01003 2 01 003 00161527 0500000US01003 Baldwin
## 3: 01005 3 01 005 00161528 0500000US01005 Barbour
## 4: 01007 4 01 007 00161529 0500000US01007 Bibb
## 5: 01009 5 01 009 00161530 0500000US01009 Blount
## 6: 01011 6 01 011 00161531 0500000US01011 Bullock
## LSAD ALAND AWATER
## 1: 06 1539584443 25773561
## 2: 06 4117629064 1133084680
## 3: 06 2291820953 50864677
## 4: 06 1612481768 9287964
## 5: 06 1670041814 15077461
## 6: 06 1613058451 6055020
We can also check the shapefile we’ve brought in to make sure that it meets some of the basic requirements of the wmap() function, using another one of the woodson library’s functions– it returns “TRUE” if the shapefile has projection information, valid geometry, and has a data.table in its @data slot:
check_wmapshp(counties)
## [1] TRUE
Now that we’ve learned the easy way, let’s go into depth about some of aspects of bringing in polygons from shapefiles into R– many of the processes (subsetting data, re-projecting it, etc), can be done on objects you’ve already brought into R using the functions above.
There are many ways to load spatial data into R– however, one of the most straightforward is to use the readOGR function, since it also reads in the projection information (if it exists).
shapefile_name<-"cb_2013_us_county_20m"
indir <- "J:/DATA/Incoming Data/USA/sae_data/cb_2013_us_county_20m/" # Defining the directory that contains the shapefile you are interested in
map <- readOGR(paste0(indir, shapefile_name,".shp"), layer=shapefile_name)
## OGR data source with driver: ESRI Shapefile
## Source: "J:/DATA/Incoming Data/USA/sae_data/cb_2013_us_county_20m/cb_2013_us_county_20m.shp", layer: "cb_2013_us_county_20m"
## with 3221 features
## It has 9 fields
## Integer64 fields read as strings: ALAND AWATER
Our first check should be to see if the spatial geometry is valid– the gIsValid function looks at each polygon, and makes sure it doesn’t break any topological rules. Ideally, all of the geometry is valid, and the total number of polygons that fail the validity tests is 0.
validity<-data.table(valid=gIsValid(map,byid=TRUE))
sum(validity$valid==F)
## [1] 0
Lucky for us, the sum of the geometry problems is 0– If you start getting errors about invalid geometries, my recommendation is to a.) Find a friend that has an ArcGIS license, and use the “repair geometries” feature, and/or b.) talk to someone that’s good at troubleshooting spatial data— fixing invalid geometries is tough to do in R, or c.) prepare for a long road with many struggles. Another good start is do a UnionSpatialPolygons operation on the geographic ID of interest- that sometimes gets rid of topology errors as well (see below).
At this point, the map object is a Large SpatialPolygonsDataFrame. Just as there are multiple parts to a shapefile, there are multiple components of this data type. There is a polygons object, which we can access using map@polygons, and a data object, which we can access using map@data. A quick summary() will tell us what fields are within the data attributes of this file, what the projection information is, and the bounding box (x/y extent) of the geometry itself. We can also use R’s native plot() function to do a preliminary exploration of what this data looks like over space.
summary(map)
## Object of class SpatialPolygonsDataFrame
## Coordinates:
## min max
## x -179.14734 179.77847
## y 17.88481 71.35256
## Is projected: FALSE
## proj4string :
## [+proj=longlat +datum=NAD83 +no_defs +ellps=GRS80 +towgs84=0,0,0]
## Data attributes:
## STATEFP COUNTYFP COUNTYNS AFFGEOID
## 48 : 254 001 : 50 00023901: 1 0500000US01001: 1
## 13 : 159 003 : 50 00025441: 1 0500000US01003: 1
## 51 : 134 005 : 50 00025442: 1 0500000US01005: 1
## 21 : 120 009 : 49 00025443: 1 0500000US01007: 1
## 29 : 115 007 : 48 00025444: 1 0500000US01009: 1
## 20 : 105 013 : 48 00025445: 1 0500000US01011: 1
## (Other):2334 (Other):2926 (Other) :3215 (Other) :3215
## GEOID NAME LSAD ALAND
## 01001 : 1 Washington: 31 06 :3007 1000508842: 1
## 01003 : 1 Franklin : 26 13 : 78 1001059073: 1
## 01005 : 1 Jefferson : 26 15 : 64 1001787876: 1
## 01007 : 1 Jackson : 24 25 : 41 1002099184: 1
## 01009 : 1 Lincoln : 24 04 : 12 1002370084: 1
## 01011 : 1 Madison : 20 05 : 11 1003451809: 1
## (Other):3215 (Other) :3070 (Other): 8 (Other) :3215
## AWATER
## 0 : 1
## 10017640 : 1
## 100218683 : 1
## 100389945 : 1
## 10040121 : 1
## 1004179934: 1
## (Other) :3215
plot(map, main="Map")
Looks pretty tiny and awkward right now, but that can be fixed later.
This map is a “Large SpatialPolygonsDataFrame”– an R object that contains both geographic information, and any fields/data attached to those geographies (like data, ID fields, etc). You can do operations that use the geographic information (for example, calculate the area of each polygon), or change, alter, or add to the data attached.
Probably the most crucial thing to realize about this data format is that the shapes (polyons, points, etc) are linked to the data through the sort order of the polygon objects and the data objects. I could scramble the "@data" object, and I could still plot a map with colors–however, I could be plotting Nevada’s values in Massachussetts, and if you don’t have the expert knowlege to know that something seems “off”, your entire analysis can be totally bunk due to sort order issues. Many of the ways we manipulate and merge data might seem overly complex– this is to prevent the sort order from becoming lost, and to preserve the geometry-attribute link’s validity.
You can access the data attached to the shapefile by calling the “data” attribute from the Spatial Polygons Data Frame:
head(map@data)
## STATEFP COUNTYFP COUNTYNS AFFGEOID GEOID NAME LSAD
## 0 06 049 00277289 0500000US06049 06049 Modoc 06
## 1 13 257 00350028 0500000US13257 13257 Stephens 06
## 2 18 127 00450382 0500000US18127 18127 Porter 06
## 3 21 077 00516885 0500000US21077 21077 Gallatin 06
## 4 21 053 00516873 0500000US21053 21053 Clinton 06
## 5 26 149 01625034 0500000US26149 26149 St. Joseph 06
## ALAND AWATER
## 0 10140841147 745929078
## 1 463948108 13121937
## 2 1083011641 268393387
## 3 254698298 16519758
## 4 510864239 21164150
## 5 1296531678 52934828
Right now, the data object is a data.frame. It might be useful to change the data object to a data.table: this allows us to use data.table and data.frame methods! To do this, we first convert the data.frame into a data.table by calling data.table(map@data), then take a copy of it, and reassign the “data” slot of the map object to be the data.table version.
We take a copy here because of the way that data.table works– it stores an object in memory, which makes it fast to access, but also means that any operation done to a subset of that data.table (that isn’t explicitly copied as a separate object) will also be done to the original object in memory.
map@data<-copy(data.table(map@data))
We can add columns to the data table based on other columns using data.table syntax (which preserves the current order of the rows):
map@data[, name_caps:=toupper(NAME)] # say we wanted to create a new column that had the county names in all caps-- we can add that using data.table syntax
## STATEFP COUNTYFP COUNTYNS AFFGEOID GEOID NAME LSAD
## 1: 06 049 00277289 0500000US06049 06049 Modoc 06
## 2: 13 257 00350028 0500000US13257 13257 Stephens 06
## 3: 18 127 00450382 0500000US18127 18127 Porter 06
## 4: 21 077 00516885 0500000US21077 21077 Gallatin 06
## 5: 21 053 00516873 0500000US21053 21053 Clinton 06
## ---
## 3217: 21 081 00516887 0500000US21081 21081 Grant 06
## 3218: 01 081 00161566 0500000US01081 01081 Lee 06
## 3219: 21 111 00516902 0500000US21111 21111 Jefferson 06
## 3220: 48 489 01384030 0500000US48489 48489 Willacy 06
## 3221: 17 173 01785051 0500000US17173 17173 Shelby 06
## ALAND AWATER name_caps
## 1: 10140841147 745929078 MODOC
## 2: 463948108 13121937 STEPHENS
## 3: 1083011641 268393387 PORTER
## 4: 254698298 16519758 GALLATIN
## 5: 510864239 21164150 CLINTON
## ---
## 3217: 668126187 7253995 GRANT
## 3218: 1573552207 21488919 LEE
## 3219: 985312624 44680348 JEFFERSON
## 3220: 1529529770 501714610 WILLACY
## 3221: 1964597878 24693146 SHELBY
map@data[, ALAND:=NULL] # Similarly, we can remove columns...
## STATEFP COUNTYFP COUNTYNS AFFGEOID GEOID NAME LSAD
## 1: 06 049 00277289 0500000US06049 06049 Modoc 06
## 2: 13 257 00350028 0500000US13257 13257 Stephens 06
## 3: 18 127 00450382 0500000US18127 18127 Porter 06
## 4: 21 077 00516885 0500000US21077 21077 Gallatin 06
## 5: 21 053 00516873 0500000US21053 21053 Clinton 06
## ---
## 3217: 21 081 00516887 0500000US21081 21081 Grant 06
## 3218: 01 081 00161566 0500000US01081 01081 Lee 06
## 3219: 21 111 00516902 0500000US21111 21111 Jefferson 06
## 3220: 48 489 01384030 0500000US48489 48489 Willacy 06
## 3221: 17 173 01785051 0500000US17173 17173 Shelby 06
## AWATER name_caps
## 1: 745929078 MODOC
## 2: 13121937 STEPHENS
## 3: 268393387 PORTER
## 4: 16519758 GALLATIN
## 5: 21164150 CLINTON
## ---
## 3217: 7253995 GRANT
## 3218: 21488919 LEE
## 3219: 44680348 JEFFERSON
## 3220: 501714610 WILLACY
## 3221: 24693146 SHELBY
head(map@data)
## STATEFP COUNTYFP COUNTYNS AFFGEOID GEOID NAME LSAD
## 1: 06 049 00277289 0500000US06049 06049 Modoc 06
## 2: 13 257 00350028 0500000US13257 13257 Stephens 06
## 3: 18 127 00450382 0500000US18127 18127 Porter 06
## 4: 21 077 00516885 0500000US21077 21077 Gallatin 06
## 5: 21 053 00516873 0500000US21053 21053 Clinton 06
## 6: 26 149 01625034 0500000US26149 26149 St. Joseph 06
## AWATER name_caps
## 1: 745929078 MODOC
## 2: 13121937 STEPHENS
## 3: 268393387 PORTER
## 4: 16519758 GALLATIN
## 5: 21164150 CLINTON
## 6: 52934828 ST. JOSEPH
It also might be useful to simplify the information contained by the spatial polygons data frame: we can save the data itself in another object, and restrict the spatial object to only an ID field, such that we could add the other information back in as desired:
map@data[, geo_id:= as.integer(as.character(map@data$GEOID))] #first, let's make an ID that we can use to join data on later
## STATEFP COUNTYFP COUNTYNS AFFGEOID GEOID NAME LSAD
## 1: 06 049 00277289 0500000US06049 06049 Modoc 06
## 2: 13 257 00350028 0500000US13257 13257 Stephens 06
## 3: 18 127 00450382 0500000US18127 18127 Porter 06
## 4: 21 077 00516885 0500000US21077 21077 Gallatin 06
## 5: 21 053 00516873 0500000US21053 21053 Clinton 06
## ---
## 3217: 21 081 00516887 0500000US21081 21081 Grant 06
## 3218: 01 081 00161566 0500000US01081 01081 Lee 06
## 3219: 21 111 00516902 0500000US21111 21111 Jefferson 06
## 3220: 48 489 01384030 0500000US48489 48489 Willacy 06
## 3221: 17 173 01785051 0500000US17173 17173 Shelby 06
## AWATER name_caps geo_id
## 1: 745929078 MODOC 6049
## 2: 13121937 STEPHENS 13257
## 3: 268393387 PORTER 18127
## 4: 16519758 GALLATIN 21077
## 5: 21164150 CLINTON 21053
## ---
## 3217: 7253995 GRANT 21081
## 3218: 21488919 LEE 1081
## 3219: 44680348 JEFFERSON 21111
## 3220: 501714610 WILLACY 48489
## 3221: 24693146 SHELBY 17173
map_data<-copy(map@data) #now we have a data.table of the county information devoid of geographic attributes
map@data<-copy(map@data[,list(geo_id)]) # Resticting the data.table attached to the spatial information to only relevant fields.
## We are taking a copy here before reassigning it because of reasons detailed above (a data.table is an object stored in memory,
## without taking a copy here, we are creating a "shallow copy", which R won't like to change later on)
head(map@data) # Should only include the ID
## geo_id
## 1: 6049
## 2: 13257
## 3: 18127
## 4: 21077
## 5: 21053
## 6: 26149
head(map_data) # Includes the geo_id, and the rest of the data we may or may not want to use later.
## STATEFP COUNTYFP COUNTYNS AFFGEOID GEOID NAME LSAD
## 1: 06 049 00277289 0500000US06049 06049 Modoc 06
## 2: 13 257 00350028 0500000US13257 13257 Stephens 06
## 3: 18 127 00450382 0500000US18127 18127 Porter 06
## 4: 21 077 00516885 0500000US21077 21077 Gallatin 06
## 5: 21 053 00516873 0500000US21053 21053 Clinton 06
## 6: 26 149 01625034 0500000US26149 26149 St. Joseph 06
## AWATER name_caps geo_id
## 1: 745929078 MODOC 6049
## 2: 13121937 STEPHENS 13257
## 3: 268393387 PORTER 18127
## 4: 16519758 GALLATIN 21077
## 5: 21164150 CLINTON 21053
## 6: 52934828 ST. JOSEPH 26149
At this point, we can save that map_data as a separate .csv or .rdata, merge it in later, ditch it altogether– in general, it’s good practice to keep your spatial data “light” on attributes, and only keep variables that help you identify the geometries– after all, you can always merge in the other data later.
Right now, you have rows of data that corresponds to each spatial record. These data are linked to the spatial information by the order they are in the table, since that corresponds to the order of the relevant polygons. This can get tricky, since merges in R will automatically default to sorting the resulting data.table after a merge based on whatever ID field you used to merge on your secondary data set. This is problematic, because when you merge in new information into the data.table linked to the polygons, the polygon order doesn’t update to mesh with the new data.
As discussed above, if you use this out-of-order data set for analysis, you’ll still get values for each polygon–but they won’t actually correspond to the right geography! Luckily, We can take some defensive steps to make sure we don’t get “out of sync” between the geography and the data. If this were a data.frame, we could assign the row.names() to something meaningful– but rownames aren’t supported in data.tables (we can just use a column!), and we can generate a column to revert back to later on.
One way of achieving this security is to sort your entire spatial data object based on the geo_id you plan to use for any merges down the line.
At this point, the entire spatial object (polygons and data), are ordered by the geog_ id we have specified. This means that at any point in the future, we can use this geo_id field to sort our data.table, and re-arrange our data object to make sure it matches up to the geography.
map<-copy(map[order(map@data$geo_id),]) # Assigning the "map" object to be the sorted (on our geog_id) version of the "map" object.
# Right now ,the data.table being used is considered to be a "shallow copy"
# of the data.table from the unsorted object-- we can overwrite this with a
# copy of the map@data object and get rid of this problem.
map@data<-copy(map@data)
head(map@data)
## geo_id
## 1: 1001
## 2: 1003
## 3: 1005
## 4: 1007
## 5: 1009
## 6: 1011
If you don’t want to use this method, because you aren’t sure what attribute you’ll be merging on in the future, or some other reason, you can also create a new column to use as a home base and re-sort your data.table on. After any manipulations that you do, to make sure that your values correspond correctly with your geography. As of right now, we haven’t done anything to change the order of the data or geography, so we can create an index from 1:(n_rows) that we can use to “revert” back to normal later on (just in case!).
map@data[,sort_order:= 1:nrow(map@data)] # Creating the index that goes from 1-the length of the data.table
## geo_id sort_order
## 1: 1001 1
## 2: 1003 2
## 3: 1005 3
## 4: 1007 4
## 5: 1009 5
## ---
## 3217: 72145 3217
## 3218: 72147 3218
## 3219: 72149 3219
## 3220: 72151 3220
## 3221: 72153 3221
head(map@data)
## geo_id sort_order
## 1: 1001 1
## 2: 1003 2
## 3: 1005 3
## 4: 1007 4
## 5: 1009 5
## 6: 1011 6
Now that we are confident we won’t get the geometry mixed up by adding in some data, let’s add back in a few of the variables we stashed away in the separate data.table called map_data:
map@data<-copy(merge(map@data,map_data,by="geo_id"))
head(map@data)
## geo_id sort_order STATEFP COUNTYFP COUNTYNS AFFGEOID GEOID
## 1: 1001 1 01 001 00161526 0500000US01001 01001
## 2: 1003 2 01 003 00161527 0500000US01003 01003
## 3: 1005 3 01 005 00161528 0500000US01005 01005
## 4: 1007 4 01 007 00161529 0500000US01007 01007
## 5: 1009 5 01 009 00161530 0500000US01009 01009
## 6: 1011 6 01 011 00161531 0500000US01011 01011
## NAME LSAD AWATER name_caps
## 1: Autauga 06 25773561 AUTAUGA
## 2: Baldwin 06 1133084680 BALDWIN
## 3: Barbour 06 50864677 BARBOUR
## 4: Bibb 06 9287964 BIBB
## 5: Blount 06 15077461 BLOUNT
## 6: Bullock 06 6055020 BULLOCK
Note that the data you are adding on should never have more than 1 observation per geog_ id! Remember that the polygons are linked to the data by order– we can’t (sensibly) have more data than we have polygons.
The relevant thing here is that you can query the spatial polygons data object’s data, and used it to subset the spatialpolygons object– including the geographic boundary information! Let’s get rid of Puerto Rico:
map <- map[map@data$geo_id < 60000,] # Dropping Puerto Rico by only keeping rows that had geo_id values less than 60000
Similarly, we could plot only some section of this data file, or save a subset of the file to another object- let’s use the attributes to find Washington State’s Counties only.
washington<-copy(map[map@data$STATEFP==53,])
plot(washington, main="Washington State")
It’s sometimes useful to know or alter the projection used for the map. To discover what the map’s projection is currently, is you use the command proj4string(map).
proj4string(washington)
## [1] "+proj=longlat +datum=NAD83 +no_defs +ellps=GRS80 +towgs84=0,0,0"
In this case, the projection is “longlat”– basically, not a projection; it’s just using the latitude and longitude as a coordinate system. See: http://www.georeference.org/doc/latitude_longitude_projection.htm as a further reference about different projetions. Plenty of resources are available on geographic projections–the important thing to keep in mind is that all projections (putting an “oblate spheroid” like the earth onto a 2d surface) are frought with choices about what to distort and where.
See: https://en.wikipedia.org/wiki/Map_projection#Which_projection_is_best.3F
What does this projection look like?
plot(washington, main="Original Projection")
Luckily,changing this projection is trivial in R.
# Defining the new projection you want to use:
new_projection<-"+proj=laea +lat_0=45 +lon_0=-100 +x_0=0 +y_0=0 +a=6370997 +b=6370997 +units=m +no_defs"
# This is the Lambert Azimuthal Equal Area projection-- a projection known for keeping the area of the geographic units true to reality.
# You can probably find whatever projection you want to use's "PROJ.4 STRING" at: http://spatialreference.org/
# Actually transform the object
washington <- spTransform(washington, CRS(new_projection)) # spTransform: spatial transform of coordinates from one system to antoher
# What does it look like now?
plot(washington,main="New Projection")
However, we still need to create a copy() of the data.table and reassign it to the Washington object, or else it won’t allow us to make changes to the data.table later on (it will throw an error about a “shallow copy” of the data table.)
washington@data<-copy(washington@data)
Let’s say you want to know how much area is in each county– you can create a new data field based on the geographic information. This gArea function calls from the rgeos package: units of this projection is “meters” (see the +units section of the proj4 string), so this should give us Km when we divide by 1000000:
washington@data[,area_sqkm:=gArea(washington,byid=TRUE)/1000000]
## geo_id sort_order STATEFP COUNTYFP COUNTYNS AFFGEOID GEOID
## 1: 53001 2955 53 001 01531601 0500000US53001 53001
## 2: 53003 2956 53 003 01533502 0500000US53003 53003
## 3: 53005 2957 53 005 01513302 0500000US53005 53005
## 4: 53007 2958 53 007 01531932 0500000US53007 53007
## 5: 53009 2959 53 009 01531341 0500000US53009 53009
## 6: 53011 2960 53 011 01531820 0500000US53011 53011
## 7: 53013 2961 53 013 01513273 0500000US53013 53013
## 8: 53015 2962 53 015 01529156 0500000US53015 53015
## 9: 53017 2963 53 017 01531818 0500000US53017 53017
## 10: 53019 2964 53 019 01531821 0500000US53019 53019
## 11: 53021 2965 53 021 01531822 0500000US53021 53021
## 12: 53023 2966 53 023 01533500 0500000US53023 53023
## 13: 53025 2967 53 025 01531924 0500000US53025 53025
## 14: 53027 2968 53 027 01531823 0500000US53027 53027
## 15: 53029 2969 53 029 01513272 0500000US53029 53029
## 16: 53031 2970 53 031 01531936 0500000US53031 53031
## 17: 53033 2971 53 033 01531933 0500000US53033 53033
## 18: 53035 2972 53 035 01529223 0500000US53035 53035
## 19: 53037 2973 53 037 01531926 0500000US53037 53037
## 20: 53039 2974 53 039 01529219 0500000US53039 53039
## 21: 53041 2975 53 041 01531927 0500000US53041 53041
## 22: 53043 2976 53 043 01514052 0500000US53043 53043
## 23: 53045 2977 53 045 01529221 0500000US53045 53045
## 24: 53047 2978 53 047 01531928 0500000US53047 53047
## 25: 53049 2979 53 049 01513274 0500000US53049 53049
## 26: 53051 2980 53 051 01529157 0500000US53051 53051
## 27: 53053 2981 53 053 01529159 0500000US53053 53053
## 28: 53055 2982 53 055 01531931 0500000US53055 53055
## 29: 53057 2983 53 057 01531402 0500000US53057 53057
## 30: 53059 2984 53 059 01529220 0500000US53059 53059
## 31: 53061 2985 53 061 01529222 0500000US53061 53061
## 32: 53063 2986 53 063 01529225 0500000US53063 53063
## 33: 53065 2987 53 065 01531930 0500000US53065 53065
## 34: 53067 2988 53 067 01529226 0500000US53067 53067
## 35: 53069 2989 53 069 01513275 0500000US53069 53069
## 36: 53071 2990 53 071 01531819 0500000US53071 53071
## 37: 53073 2991 53 073 01529224 0500000US53073 53073
## 38: 53075 2992 53 075 01533501 0500000US53075 53075
## 39: 53077 2993 53 077 01531929 0500000US53077 53077
## geo_id sort_order STATEFP COUNTYFP COUNTYNS AFFGEOID GEOID
## NAME LSAD AWATER name_caps area_sqkm
## 1: Adams 06 12622881 ADAMS 4992.1999
## 2: Asotin 06 11450483 ASOTIN 1663.1743
## 3: Benton 06 154325810 BENTON 4514.9275
## 4: Chelan 06 189998659 CHELAN 7679.7601
## 5: Clallam 06 2413676105 CLALLAM 4657.4722
## 6: Clark 06 70459265 CLARK 1694.6318
## 7: Columbia 06 12566852 COLUMBIA 2272.1166
## 8: Cowlitz 06 67322703 COWLITZ 2987.8458
## 9: Douglas 06 76102867 DOUGLAS 4747.3631
## 10: Ferry 06 140280531 FERRY 5781.4918
## 11: Franklin 06 60193373 FRANKLIN 3252.6191
## 12: Garfield 06 19490293 GARFIELD 1831.9577
## 13: Grant 06 289575831 GRANT 7221.0382
## 14: Grays Harbor 06 834956691 GRAYS HARBOR 5006.0129
## 15: Island 06 800096249 ISLAND 565.5492
## 16: Jefferson 06 982867070 JEFFERSON 4798.8707
## 17: King 06 495743196 KING 5756.2371
## 18: Kitsap 06 442827389 KITSAP 1159.4203
## 19: Kittitas 06 92366017 KITTITAS 6037.1778
## 20: Klickitat 06 85063387 KLICKITAT 4941.5249
## 21: Lewis 06 86636896 LEWIS 6320.1551
## 22: Lincoln 06 74744200 LINCOLN 6034.4662
## 23: Mason 06 237129533 MASON 2687.7418
## 24: Okanogan 06 121223338 OKANOGAN 13789.1367
## 25: Pacific 06 752975316 PACIFIC 2485.7630
## 26: Pend Oreille 06 65404063 PEND OREILLE 3678.3810
## 27: Pierce 06 354564434 PIERCE 4580.4432
## 28: San Juan 06 1158102373 SAN JUAN 656.1864
## 29: Skagit 06 489272212 SKAGIT 4561.9018
## 30: Skamania 06 72853035 SKAMANIA 4332.6359
## 31: Snohomish 06 282423392 SNOHOMISH 5434.8855
## 32: Spokane 06 43803216 SPOKANE 4605.1543
## 33: Stevens 06 163107425 STEVENS 6562.9276
## 34: Thurston 06 133490871 THURSTON 1973.9133
## 35: Wahkiakum 06 61658249 WAHKIAKUM 719.0812
## 36: Walla Walla 06 74827791 WALLA WALLA 3404.1141
## 37: Whatcom 06 1027129386 WHATCOM 5598.2213
## 38: Whitman 06 47939274 WHITMAN 5643.9042
## 39: Yakima 06 41107261 YAKIMA 11173.3366
## NAME LSAD AWATER name_caps area_sqkm
head(washington@data)
## geo_id sort_order STATEFP COUNTYFP COUNTYNS AFFGEOID GEOID
## 1: 53001 2955 53 001 01531601 0500000US53001 53001
## 2: 53003 2956 53 003 01533502 0500000US53003 53003
## 3: 53005 2957 53 005 01513302 0500000US53005 53005
## 4: 53007 2958 53 007 01531932 0500000US53007 53007
## 5: 53009 2959 53 009 01531341 0500000US53009 53009
## 6: 53011 2960 53 011 01531820 0500000US53011 53011
## NAME LSAD AWATER name_caps area_sqkm
## 1: Adams 06 12622881 ADAMS 4992.200
## 2: Asotin 06 11450483 ASOTIN 1663.174
## 3: Benton 06 154325810 BENTON 4514.928
## 4: Chelan 06 189998659 CHELAN 7679.760
## 5: Clallam 06 2413676105 CLALLAM 4657.472
## 6: Clark 06 70459265 CLARK 1694.632
Often, shapefiles will have more detail than is necessary for cartographic mapping– this can make things slow to plot, and generally cumbersome. To fix this, we can use gSimplify().
To see the impacts, let’s take a look at the impacts of gSimplify up close– on the counties within Washington only.
object.size(washington) #how big in bytes is this object before simplification?
## 1117888 bytes
plot(washington,main="Original Geometry")
washington.simple<-gSimplify(washington,tol=500,topologyPreserve=TRUE)
object.size(washington.simple) #how big in bytes is this object after simplification?
## 174048 bytes
plot(washington.simple,main="Simplified")
#tol: tolerance (higher=simpler)
#topologyPreserve: if true, attempts to keep topology (relationships between counties, etc) the same.
Once you’ve simplified a shape, you no longer should use it for analysis– there might be gaps between borders, and the precision for where things are in the real world has been lost. From the point that you have simplified the geometry onward, it should be used for cartography ONLY.
Let’s apply that amount of smoothing to the rest of the the US, and see how much size we “save”:
object.size(map) #how big in bytes is this object before simplification?
## 13448016 bytes
us.simple<-gSimplify(map,tol=500,topologyPreserve=TRUE)
object.size(us.simple) #how big in bytes is this object after simplification?
## 11390976 bytes
plot(us.simple,main="Simplified")
Holy crow, look at Alaska! Obviously, the tolerance that looks good for some areas is not appropriate for others– keep on the lookout for things that look strange, and use as high of tolerance as still looks good to you.
Shapefiles can have two types of polygons– single part (just one lake), or multipart (a lake and a tiny lake that count as the same object as far as attributes are concerned). Sometimes, when you load in your data, you might notice thousands of polygons, with lots of duplicates. This indicates that your multipart polygons were read in as individual objects. Luckily, we can collapse these back together into a sensible object!
Let’s take a look at a composite shapefile for both national and subnational areas that are contained within the same shapefile.
gbd15 <- readOGR("C:/Users/stubbsrw/Documents/us_counties_stubbs_gitrepo/r_shared/woodson_mapping_suite/global_ihme_shapefiles/gbd15_all_levels_single_polygon_03032016.shp", layer="gbd15_all_levels_single_polygon_03032016")
## OGR data source with driver: ESRI Shapefile
## Source: "C:/Users/stubbsrw/Documents/us_counties_stubbs_gitrepo/r_shared/woodson_mapping_suite/global_ihme_shapefiles/gbd15_all_levels_single_polygon_03032016.shp", layer: "gbd15_all_levels_single_polygon_03032016"
## with 6959 features
## It has 8 fields
## Integer64 fields read as strings: parent_id spr_reg_id region_id level
Over 6 thousand features! Let’s collapse those down to something sensible.
First, let’s convert the data object into a data.table, and save it into a separate data.table for use later.
gbd15<-gbd15[order(gbd15@data$level,gbd15@data$location_i),] # Re-assigning the "map" object to be the sorted (on location_id) version of the "map" object.
gbd15@data<-data.table(gbd15@data) #Converting the data object to a data.table
gbd15data<-unique(copy(gbd15@data))#Saving a copy of the data.table (but only the unique observations!)
gbd15@data[, ihme_lc_id:=as.character(ihme_lc_id)] # Replacing the location ID as a charecter version of location ID
## parent_id loc_name
## 1: 0 New Caledonia
## 2: 0 Kosovo
## 3: 0 French Southern and Antarctic Lands
## 4: 0 Western Sahara
## 5: 0 Antarctica
## ---
## 6955: 4749 South East England
## 6956: 4749 South East England
## 6957: 4749 South East England
## 6958: 4749 South West England
## 6959: 4749 South West England
## loc_nm_sh spr_reg_id region_id ihme_lc_id
## 1: New Caledonia 0 0 NA_4
## 2: Kosovo 0 0 NA_5
## 3: French Southern and Antarctic Lands 0 0 NA_2
## 4: Western Sahara 0 0 NA_3
## 5: Antarctica 0 0 NA_1
## ---
## 6955: SE England 64 73 GBR_4625
## 6956: SE England 64 73 GBR_4625
## 6957: SE England 64 73 GBR_4625
## 6958: SW England 64 73 GBR_4626
## 6959: SW England 64 73 GBR_4626
## location_i level
## 1: -1 0
## 2: -1 0
## 3: -1 0
## 4: -1 0
## 5: -1 0
## ---
## 6955: 4625 5
## 6956: 4625 5
## 6957: 4625 5
## 6958: 4626 5
## 6959: 4626 5
Now, we will use a particular field of the attributes (in this case, ihme_ lc_ id) aggregate many single-part polygons into multipart polgyons. You also can want to merge together polygons for more normal reasons– say you had a shapefile with counties, and wanted a shapefile of states– since counties nest within states, you could UnionSpatialPolygons on the State ID and you would end up with a State Shapefile. This is a good strategy if you want to highlight borders of a geography of admin levels above your data of interest– that way, you are guaruanteed to have the same border lines as the smaller geography.
map <- unionSpatialPolygons(gbd15, gbd15@data$ihme_lc_id)
At this point, “map” is a SpatialPolygons object– without any data attributes at all. However, the polygon ID of this new SpatialPolygons is the identifier you used to combine geometry before– we can use this to recover and create a data.table, and add the rest of our data back on. Interestingly, to join together data and polygons to create a SpatialPolygonsDataFrame, you need the row names to match up. As such, we create a data.frame with the same row names.
data <- data.frame(ihme_lc_id = (sapply(map@polygons, function(x) x@ID)))
rownames(data) <- data$ihme_lc_id
Now, we can create a SpatialPolygonsDataFrame using the map and data objects:
gbd15 <- SpatialPolygonsDataFrame(map, data)
At this point, we follow familiar steps to merge the data back on.
gbd15 <- gbd15[order(gbd15@data$ihme_lc_id),]
gbd15@data$polygon_sort_order<-1:nrow(gbd15@data) # Creating the index that goes from 1-the length of the data.table, such that order can be preserved!
## Merging back on the attributes after having collapsed by location ID
gbd15@data<-merge(gbd15@data,gbd15data,by="ihme_lc_id",all.x=TRUE, all.y=FALSE)
gbd15@data<-data.table(gbd15@data)
# Let's also fix the attribute names:
setnames(gbd15@data,"location_i","location_id")
head(gbd15@data)
## ihme_lc_id polygon_sort_order parent_id loc_name
## 1: AFG 1 138 Afghanistan
## 2: AGO 2 167 Angola
## 3: ALB 3 42 Albania
## 4: AND 4 73 Andorra
## 5: ARE 5 138 United Arab Emirates
## 6: ARG 6 96 Argentina
## loc_nm_sh spr_reg_id region_id location_id level
## 1: Afghanistan 137 138 160 3
## 2: Angola 166 167 168 3
## 3: Albania 31 42 43 3
## 4: Andorra 64 73 74 3
## 5: UAE 137 138 156 3
## 6: Argentina 64 96 97 3
plot(gbd15[gbd15@data$level==3,], main="Nations of the World")
Got anything else you want added to this document? Contact Rebecca Stubbs at stubbsrw@gmail.com.