Want to work with maps in R, but not sure how to get started? Maps are a great way to convey potentially dense information very quickly, and with a handful of tools you can create compelling maps in R to share with your stakeholders. In this post, we’ll cover the basics of where to find map files, how to work with them, and how to use R packages to create a map that shows geographic data.
Maps are an idiom we’re all familiar with, so the consumer of your information can concentrate on what a map shows, rather than spending time figuring out a complex, novel data visualization. Maps are a fantastic way to convey information about things like your customers’ spending trends, the location of your users, or demographic information that might be useful to know before rolling out an outreach program. They’re especially useful for communicating data insights to non-data-science stakeholders who need to consume a lot of information quickly – stakeholders like lawmakers, members of the C-suite, and journalists.
Working with map files can seem daunting, especially for the beginner. Often, you can find map files (shapefiles, say, or geoJSON files) that could be useful, but don’t contain exactly the region you want to map.
Recently, for example, I was asked to help construct a local map for an organization that wanted to map stakeholder addresses to census tracts in the greater Philadelphia area and display that information in a map. It was easy for me to find ready-made map files for the county of Philadelphia, for the state of Pennsylvania, and for the state of New Jersey, but no ready-made maps for the 7-county area that makes up the Philadelphia commuter community. What to do? Combine and filter some map data to get the area I needed!
Here were my steps:
Want to follow along? I’ve included the complete code that will allow you to reproduce what I did for my client.
The Census Bureau has a number of “shapefiles” – a commonly used file type that includes not only geographic shapes (lines, points, and polygons), but supplementary data, such as the id of the state, county, and census tract. It can take a bit of digging to find the right file, but the Census Bureau is dedicated to the dissemination of this kind of data, and has a free API that you might want to check out.
Here, I’m going to download and unzip shapefile data for Pennsylvania:
download.file("https://www2.census.gov/geo/tiger/TIGER2017/TRACT/tl_2017_42_tract.zip", "tl_2017_42_tract.zip")
unzip("tl_2017_42_tract.zip", exdir = "tl_2017_42_tract")
We want to bring in the geographic data using the rgdal library and limit our selection to a few counties. County codes can be found at https://www2.census.gov/geo/docs/reference/codes/files/st42_pa_cou.txt
library(rgdal)
pa <- readOGR(dsn = "tl_2017_42_tract", verbose = FALSE)
pa_counties <- pa[which(pa$COUNTYFP %in% c("017","045","091", "101")),]
Our pa_counties data is complex, but the most important part for us is in the @data portion. Let’s peek!
head(pa_counties@data)
| STATEFP | COUNTYFP | TRACTCE | GEOID | NAME | NAMELSAD | MTFCC | FUNCSTAT | ALAND | AWATER | INTPTLAT | INTPTLON | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 42 | 017 | 100102 | 42017100102 | 1001.02 | Census Tract 1001.02 | G5020 | S | 7608630 | 2509728 | +40.0745595 | -074.9405810 |
| 1 | 42 | 017 | 100103 | 42017100103 | 1001.03 | Census Tract 1001.03 | G5020 | S | 1572548 | 9817 | +40.0687610 | -074.9713318 |
| 2 | 42 | 091 | 201411 | 42091201411 | 2014.11 | Census Tract 2014.11 | G5020 | S | 2409072 | 35484 | +40.1444849 | -075.2140269 |
| 3 | 42 | 017 | 100104 | 42017100104 | 1001.04 | Census Tract 1001.04 | G5020 | S | 1966895 | 12199 | +40.0918428 | -074.9501748 |
| 4 | 42 | 017 | 100105 | 42017100105 | 1001.05 | Census Tract 1001.05 | G5020 | S | 2663937 | 0 | +40.0821350 | -074.9478227 |
| 5 | 42 | 017 | 100206 | 42017100206 | 1002.06 | Census Tract 1002.06 | G5020 | S | 4735309 | 0 | +40.1347795 | -074.9441709 |
You may have already noticed, but the state, county, and tract codes (the first few columns) are combined into the GEOID column, which uniquely identifies a single census tract from among the over 11 million census tracts in the United States. The reason why we want to use census tracts is that the Census Bureau has lots of information at the tract level – mean and median household income, the racial and age makeup of the tract, the level of unemployment, the number of renters versus homeowners, and more! It’s a great way for us to get some level of insight into what our customers, users, patients, or other people we’re mapping are like. Even if we don’t have specific information about them, we know what their neighborhood is like in aggregate.
We’re going to do the same thing for New Jersey – download shapefile data, unzip it, open the shapefile and limit our selection to a few counties.
Codes from https://www2.census.gov/geo/docs/reference/codes/files/st34_nj_cou.txt
download.file("https://www2.census.gov/geo/tiger/TIGER2017/TRACT/tl_2017_34_tract.zip", "tl_2017_34_tract.zip")
unzip("tl_2017_34_tract.zip", exdir = "tl_2017_34_tract")
nj <- readOGR(dsn="tl_2017_34_tract", verbose = FALSE)
nj_counties <- nj[which(nj$COUNTYFP %in% c("005","007","015")),]
The nj_counties geographic data is similar to pa_counties:
head(nj_counties@data)
| STATEFP | COUNTYFP | TRACTCE | GEOID | NAME | NAMELSAD | MTFCC | FUNCSTAT | ALAND | AWATER | INTPTLAT | INTPTLON | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 189 | 34 | 007 | 600400 | 34007600400 | 6004 | Census Tract 6004 | G5020 | S | 862459 | 0 | +39.9329219 | -075.1180881 |
| 191 | 34 | 007 | 600700 | 34007600700 | 6007 | Census Tract 6007 | G5020 | S | 629533 | 537888 | +39.9560193 | -075.1255342 |
| 192 | 34 | 007 | 600800 | 34007600800 | 6008 | Census Tract 6008 | G5020 | S | 1109780 | 260022 | +39.9513992 | -075.1138309 |
| 194 | 34 | 007 | 608303 | 34007608303 | 6083.03 | Census Tract 6083.03 | G5020 | S | 3780973 | 22719 | +39.8120904 | -075.0725353 |
| 252 | 34 | 007 | 608001 | 34007608001 | 6080.01 | Census Tract 6080.01 | G5020 | S | 1799724 | 0 | +39.8319418 | -075.0078629 |
| 253 | 34 | 007 | 609000 | 34007609000 | 6090 | Census Tract 6090 | G5020 | S | 4442823 | 4000 | +39.7312887 | -074.8799265 |
We can simply stack our two sets of geographic data, since they have the same fields, using rbind, to get a combined dataset that covers the greater Philadelphia area, including neighboring counties:
philly_metro <- rbind(nj_counties, pa_counties)
Let’s create some fabricated latitude and longitude data to represent a handful of geographic points. These could be store locations, customer addresses, crime locations, or any other geographic points of interest. You can easily “geocode” (translate to latitude and longitude) any street addresses you have using a Google Maps API (some cost associated once you reach 40,000 geocodes) or a Census API (free). For our purposes, let’s assume we’ve already done that and have our latitude and longitude coordinates.
point_locations <- data.frame(lat = c(39.9931, 39.7103, 39.8592, 39.9526, 40.4108, 39.9078, 40.2290),
lon=c(-74.7879, -75.1078, -75.0144, -75.1652, -75.2479, -75.3879, -75.3879))
* In Philadelphia, ‘jawn’ is a multipurpose term meaning ‘thing’, ‘place’, ‘situation’, etc. Since we’re talking about a map of Philly, it seems fitting to slip in a bit of non-technical jargon!
You can choose to make a static map using ggplot2, if you’re a dedicated user of ggplot, and that’s often what I do. This map is somewhat large, however, and if we were using real data instead of fabricated data, there could be hundreds or even thousands of points that we want to map on top of our census tracts. So instead of a static map, I chose here to use a dynamic map that users can zoom in or out on, hover over to get more details, and so on. leaflet is a very easy to use package for a use case like this.
I’m going to have two layers to my map:
addPolygons), andaddMarkers)Before I map those layers, I set the view of my map to be centered around the middle of my points, and I fiddle with the zoom level to get the right initial zoom into the map (see setView).
In a real project, I would merge in other data into philly_metro@data, such as socioeconomic or demographic data about the tract, and include those data in the labels, and maybe even color-code my census tract fill color to reflect some numeric value in the data.
library(leaflet)
leaflet(philly_metro) %>%
setView(lng = mean(as.numeric(point_locations$lon), na.rm=TRUE),
lat = mean(as.numeric(point_locations$lat), na.rm=TRUE), zoom = 9) %>%
addPolygons(fillColor = "white",
fillOpacity = 1,
weight = 1,
opacity = 0.5,
color = "black",
label = paste(philly_metro$NAMELSAD,
", GEOID ",
philly_metro$GEOID,
sep=""),
labelOptions = labelOptions(
style = list("font-weight" = "normal",
padding = "3px 8px"),
textsize = "13px",
direction = "auto")) %>%
addMarkers(lng = point_locations$lon,
lat=point_locations$lat,
label = paste("Client at",
point_locations$lat,
",",
point_locations$lon))
As you can see, this map would be great to show in a presentation or include on a website, or even take a quick screenshot of if you were in a hurry and didn’t have time to do a static map. With just a pair of tools (rgdal and leaflet), we’re able to create a simple map that can provide mouseover events, lots of customization of look and feel, and a clean, easy to navigate final product.
Want to learn more? I’ll be at ODSC Boston in a few short weeks to lead a workshop in working with geographic data in R. I hope to see you there!