I have always been interested in maps and geography, but was confused by shapefiles and always used packages with built in maps. I recently took Ari Lamstein’s course Shapefiles for R Programmers at http://courses2.arilamstein.com/courses/shapefiles-for-r-programmers. After taking the course I can now utilize the many shapefiles that are available on the web and use them in my analysis. To demonstrate the usefulness of shapefiles I would like to share a project I have been working on with my new found knowledge.
This data is from the The Newberry Library’s Atlas of Historical County Boundaries at http://publications.newberry.org/ahcbp/. This atlas contains data and the shapefiles for all U.S. counties from 1629 to 2000, from states and territories. The shapefiles are the data that I wanted to look at in this project. The data comes in various formats, but I am going to work with the GIS file for the entire US at a resolution of 0.001 degree tolerance. with the unzipped files in my working directory, I can load the shapefile into memory using the readOGR function from the rgdal package:
# load the shapefile data into R
library(rgdal)
us = readOGR(dsn=".", layer="US_HistCounties", stringsAsFactors = FALSE)
## OGR data source with driver: ESRI Shapefile
## Source: ".", layer: "US_HistCounties"
## with 17727 features
## It has 17 fields
This loaded file is a Spatial Polygons object with 17 fields. Let’s see what metadata is there:
names(us)
## [1] "ID_NUM" "NAME" "ID" "STATE_TERR" "FIPS"
## [6] "VERSION" "START_DATE" "END_DATE" "CHANGE" "CITATION"
## [11] "START_N" "END_N" "AREA_SQMI" "CNTY_TYPE" "FULL_NAME"
## [16] "CROSS_REF" "NAME_START"
Let’s load some packages to help massage the data frame into a form we can use. The rgeos, maptools, and dplyr packages will help convert the object into a regular data frame, we will use ggmap to plot the shapefiles and download reference maps.
# load other packages
library(rgeos)
library(maptools)
library(ggplot2)
library(dplyr)
library(ggmap)
Since this is a historical collection of shapefiles, plotting them all at the same time would be a confusing mess, so I know we will have to subset on date. To make things easy, we will only work with annual dates, so we will create some new fields of start year and end year. These are the year the county borders started and the year they were discontinued.
# add start and end year for the boundaries
us$start_year <- substr(us$START_DATE, 1, 4)
us$end_year <- substr(us$END_DATE, 1, 4)
Next we use some dplyr commands to convert the shapefile into a more familiar dataframe that we can use the ggmap package on. ggmap is a member of the ggplot2 family, and uses the grammar of graphics in its syntax. I’ll admit I’m still unraveling how this code works, but you can see Hadley Wickham’s explanation at https://github.com/hadley/ggplot2/wiki/plotting-polygon-shapefiles.
# convert to a ggplot2 friendly data.frame
us@data$id = rownames(us@data)
us.points = fortify(us, region="id")
us.df = inner_join(us.points, us@data, by="id")
For ease of use later, I’d like to set up some functions to make subsetting the data more streamlined. For this study, we will create a function for the entire US and a year, and a separate function for a state and/or territory and a year.
# function for pulling out a state/territory and a year
stateYear <- function(df, year, state) {
temp1 <- subset(df, end_year > year & start_year < year+1 & STATE_TERR == state)
}
# function for pulling out the entire 'US' and a year
usYear <- function(df, year) {
temp2 <- subset(df, end_year > year & start_year < year+1)
}
Armed with our data, packages, and functions, we can finally start looking at shapefiles. Let’s take a look at an early period in the US, say 1700. Around this time there were only the original 13 colonies, so that should be the boundaries of our map. I’m going to use the ggmap package, instead of ggplot2, since it gives us an easy function to grab a background map from Google called get_map. The get_map function requires a location, source, type, and zoom, along with some other optional attributes you can define. Since we are looking at the original colonies, the center point should be somewhere near New Jersey. First we will use our function to subset out the year 1700 data:
# let's look at US in 1700
us1700 <- usYear(us.df, 1700)
Now we will grab a background map from Google (warning: there is a limit to the number of inquiries a day, per Google’s Terms and Conditions at https://developers.google.com/maps/terms). With the background map we then will add the shapefile lines to our background map using the geom_path function:
# Need a base map of the US, use get_map from ggmap package
basemap <- get_map(location="New Jersey", source='google',
maptype="terrain", color="bw", zoom = 5)
# Now plot the subset of the shapefile on the base map
us1700map <- ggmap(basemap) +
geom_path(aes(long,lat,group=group),
size = 1,
data = us1700) +
ggtitle("United States circa 1700")
print(us1700map)
Now we can start having fun with the history of state and county boundaries. I did not know that in 1700 Georgia consisted of what looks like only 4 counties, and that Pennsylvania was so large in comparison to the rest of the states at that time.
Let’s take a look how things evolved over the course of the next 100 years:
# let's look at US in 1800
us1800 <- usYear(us.df, 1800)
# Need a base map of the US, use get_map from ggmap package
basemap <- get_map(location="West Virginia", source='google',
maptype="terrain", color="bw", zoom = 5)
# Now plot the subset of the shapefile on the base map
us1800map <- ggmap(basemap) +
geom_path(aes(long,lat,group=group),
size = 1,
data = us1800) +
ggtitle("United States circa 1800")
print(us1800map)
The US has greatly expanded, but the Louisiana Purchase has not happened yet. The Southeast is starting to take shape, but there are large swathes of land that extend from the eastern states all the way to the Mississippi.
Now, for 100 years later in 1900
# let's look at US in 1800
us1900 <- usYear(us.df, 1900)
# Need a base map of the US, use get_map from ggmap package
basemap <- get_map(location="Kansas", source='google',
maptype="terrain", color="bw", zoom = 4)
# Now plot the subset of the shapefile on the base map
us1900map <- ggmap(basemap) +
geom_path(aes(long,lat,group=group),
size = 1,
data = us1900) +
ggtitle("United States circa 1900")
print(us1900map)
Things look pretty much like they do today except for Oklahoma. You can see where are still large patches of land that will eventually get converted into smaller counties when Oklahoma joined the Union in 1907.
One of the drawbacks at looking at the US is you lose some of the detail. Let’s use the state/territory function to look at things on a smaller scale. We will use my home state of Tennessee, and I’ll use some additional parameters to spice up the plots a bit:
##############################################################
# Can do the same for states and add some flair
TN1800 <- stateYear(us.df, 1800, "Tennessee")
# Need a base map of the TN, use get_map from ggmap package
basemap <- get_map(location="Tennessee", source='google',
maptype="terrain", color="bw", zoom = 6)
# Now plot the subset of the shapefile on the base map
TN1800map <- ggmap(basemap) +
scale_x_continuous(limits = c((min(TN1800$long)-0.5), (max(TN1800$long)+0.5)), expand = c(0, 0)) +
scale_y_continuous(limits = c((min(TN1800$lat)-0.5), (max(TN1800$lat)+0.5)), expand = c(0, 0)) +
geom_path(aes(long,lat,group=group),
size = 1,
color = "orange",
data = TN1800) +
ggtitle("Tennessee circa 1800")
print(TN1800map)
Since Tennessee did not join the Union until 1796, we cannot go all the way back to 1700, but we can look at 1800 and 1900. I used the scale_x_continuous and scale_y_continuous functions to zoom the map in based on the minimum and maximum latitude and longitude points in the dataframe. We could do the same with the US map in order to better center the plot, but we would have to zoom out for the entire US, and then the zoomed in portion would lose details.
As the state map shows, the areas of Eastern Tennessee and around Nashville were starting to be defined, but the rest of the state consisted of vast tracks. There are 5 major rivers in the state, the Tennessee, Cumberland, Mississippi, Duck, and Clinch, with a good portion of the county borders being defined by them. Let’s look at 1900:
##############################################################
# Can do the same for states and add some flair
TN1900 <- stateYear(us.df, 1900, "Tennessee")
# Now plot the subset of the shapefile on the base map
TN1900map <- ggmap(basemap) +
scale_x_continuous(limits = c((min(TN1900$long)-0.5), (max(TN1900$long)+0.5)), expand = c(0, 0)) +
scale_y_continuous(limits = c((min(TN1900$lat)-0.5), (max(TN1900$lat)+0.5)), expand = c(0, 0)) +
geom_path(aes(long,lat,group=group),
size = 1,
color = "orange",
data = TN1900) +
ggtitle("Tennessee circa 1900")
print(TN1900map)
Again, the 1900 data looks similar to today’s map, but looking into the data, there are slight changes that occur and several additional counties that come into existence over the next 50 or so years.
Now we can start using the real power of ggmap as we can very easily layer different shapefiles on top of one another. Let’s go back to 1800 and see how it compares to the counties in 1999:
# We can even layer for year to year comparison
TN1999 <- stateYear(us.df, 1999, "Tennessee")
With the Tennessee 1999 shapefiles, we can plot the 1800 on top of it, and reduce the opacity by 50% so we can better see regions that overlap from 200 years ago:
# Now layer, with each subsequent layer on top of the previous
TN1900vs1999map <- ggmap(basemap) +
scale_x_continuous(limits = c((min(TN1999$long)-0.5), (max(TN1999$long)+0.5)), expand = c(0, 0)) +
scale_y_continuous(limits = c((min(TN1999$lat)-0.5), (max(TN1999$lat)+0.5)), expand = c(0, 0)) +
geom_path(aes(long,lat,group=group),
size = 1,
color = "black",
data = TN1999) +
geom_path(aes(long,lat,group=group),
size = 1,
color = "orange",
alpha = 0.5,
data = TN1800) +
ggtitle("Tennessee circa 1800 w/ current counties")
print(TN1900vs1999map)
We can now observe how much things have changed in 200 years, and with the animation package, you can even reconstruct the videos shown on The Newberry Library’s website for the US, or for any state or region you can subset.
As I have demonstrated, ggmap is great tool for mapping and using shapefiles, but you need some trial and error with centering, zooming, and other attributes. In addition, it is a fixed map and that may not be what your audience wants. I’d like to now demonstrate using shapefiles with the leaflet package. Leaflet is java-script library for open source mapping. As a bonus, we can use the original Spatial Polygons object, we just have to subset it for the year and area we want to look at. Let’s recreate the subsets of Tennessee from 1800, 1900, and 1999 with the raw shapefile object:
TN1800 <- stateYear(us, 1800, "Tennessee")
TN1900 <- stateYear(us, 1900, "Tennessee")
TN1999 <- stateYear(us, 1999, "Tennessee")
Now, load the leaflet package, and using the magrittr piping syntax, plot the shapefiles onto the base leaflet map. We won’t fill the polygons and we will keep our orange borders:
#Using magrittr %>% syntex
library(leaflet)
leaflet(data = TN1800) %>% addTiles() %>%
addPolygons(fill = FALSE, color = "orange")
Now we have a map in which we can pan and zoom with. As we zoom in, we get more and more detail, so we can see the rivers and other natural boundaries that defined the early counties.
In addition, leaflet allows us to add multiple popups and markers so we can start enhancing our maps with the metadata that we haven’t used yet. In the below code I create an empty dataframe and then pull the center coordinates out of the Spatial Polygons object. Clicking on the popup brings up the name of the county, and clicking on it again will cause it to disappear. I also layered in the county shapefile from 1999, so we can see both the past and the present. Let’s first pull out the county names:
# And leaflet will allow you to place markers and popups
# The shpefile contains quite a bit of metadata, let's pull
# out the county names in 1800
label_points <- data.frame(long=numeric(),
lat=numeric())
for (i in 1:length(TN1800)) {
label_points[i,] <- TN1800@polygons[[i]]@Polygons[[1]]@labpt
}
I wish I could explain the structure of Spatial Polygons objects, but I am still wrapping my head around how they work. A good explanation can be found at http://eco-data-science.github.io/spatial_analysis2_R/, but it hasn’t quite clicked with me other than it is a list with various objects within.
Let’s continue on with the mapping and then adding popups. I’m going to use the generic popups, but leaflet will allow you to create and add custom popups and markers for your projects. We will use the NAME slot in the Spatial Polygons object to fill the popup.
# Now plot with current counties under the counties from 1800
# but now with popups that when clicked, give you the name
leaflet() %>% addTiles() %>%
addPolygons(fill = FALSE,
color = "black",
data = TN1999) %>%
addPolygons(fill = FALSE,
color = "orange",
opacity = 0.4,
data = TN1800) %>%
addMarkers(label_points$long,
label_points$lat,
popup = TN1800$NAME)
Finally, given that leaflet is open source, there are all kinds of map-tiles out there to play with, and you simply need to use the addProviderTiles function to use them. One place that has several listed is https://leaflet-extras.github.io/leaflet-providers/preview/. Here is one my favorite tiles, with the previous shapefiles and no popups drawn upon it:
# You can even add custom tiles, here's one of my favorites
leaflet() %>% addProviderTiles("NASAGIBS.ViirsEarthAtNight2012") %>%
addPolygons(fill = FALSE,
color = "white",
data = TN1999) %>%
addPolygons(fill = FALSE,
color = "orange",
opacity = 0.4,
data = TN1800)
This is a popular maptile, with a lot of stuff going on with the nighttime lights. With the county borders, now the lights don’t look so random dispersed, and you can see they are mostly segregated and defined by counties. As you can see, there is a lot of analysis you can have with this shapefile data. The Newberry Library does have some census data for various years, but all the early data is scanned in images of written documents, so there would be some work in getting some of the early years in a tidy data form for mapping. One of the project son my to do list is to see if there are historical weather maptiles, and overlay that with natural disaster aftermath data and see if there is a story there.