Crafting Just the Map You Need

Want to work with maps in R, but not sure how to get started? Maps are a great way to convey potentially dense information very quickly, and with a handful of tools you can create compelling maps in R to share with your stakeholders. In this post, we’ll cover the basics of where to find map files, how to work with them, and how to use R packages to create a map that shows geographic data.

Why Maps?

Maps are an idiom we’re all familiar with, so the consumer of your information can concentrate on what a map shows, rather than spending time figuring out a complex, novel data visualization. Maps are a fantastic way to convey information about things like your customers’ spending trends, the location of your users, or demographic information that might be useful to know before rolling out an outreach program. They’re especially useful for communicating data insights to non-data-science stakeholders who need to consume a lot of information quickly – stakeholders like lawmakers, members of the C-suite, and journalists.

Getting Started

Working with map files can seem daunting, especially for the beginner. Often, you can find map files (shapefiles, say, or geoJSON files) that could be useful, but don’t contain exactly the region you want to map.

Recently, for example, I was asked to help construct a local map for an organization that wanted to map stakeholder addresses to census tracts in the greater Philadelphia area and display that information in a map. It was easy for me to find ready-made map files for the county of Philadelphia, for the state of Pennsylvania, and for the state of New Jersey, but no ready-made maps for the 7-county area that makes up the Philadelphia commuter community. What to do? Combine and filter some map data to get the area I needed!

Here were my steps:

  • Get a shapefile of census tracts for Pennsylvania, and filter on just the counties I wanted (Philadelphia, Bucks, Delaware, and Montgomery)
  • Get a shapefile of census tracts for New Jersey, and filter on just the counties I wanted (Burlington, Camden, and Gloucester)
  • Combine those two shapefiles
  • Map the polygons of the census tracts
  • Map the points of the addresses.

Want to follow along? I’ve included the complete code that will allow you to reproduce what I did for my client.

Get Census Bureau Shapefiles

The Census Bureau has a number of “shapefiles” – a commonly used file type that includes not only geographic shapes (lines, points, and polygons), but supplementary data, such as the id of the state, county, and census tract. It can take a bit of digging to find the right file, but the Census Bureau is dedicated to the dissemination of this kind of data, and has a free API that you might want to check out.

Here, I’m going to download and unzip shapefile data for Pennsylvania:

download.file("https://www2.census.gov/geo/tiger/TIGER2017/TRACT/tl_2017_42_tract.zip", "tl_2017_42_tract.zip")
unzip("tl_2017_42_tract.zip", exdir = "tl_2017_42_tract")

We want to bring in the geographic data using the rgdal library and limit our selection to a few counties. County codes can be found at https://www2.census.gov/geo/docs/reference/codes/files/st42_pa_cou.txt

  • Philadelphia (101)
  • Bucks (017)
  • Montgomery (091)
  • Delaware (045)
library(rgdal)
pa <- readOGR(dsn = "tl_2017_42_tract", verbose = FALSE)
pa_counties <- pa[which(pa$COUNTYFP %in% c("017","045","091", "101")),]

Our pa_counties data is complex, but the most important part for us is in the @data portion. Let’s peek!

head(pa_counties@data)
STATEFP COUNTYFP TRACTCE GEOID NAME NAMELSAD MTFCC FUNCSTAT ALAND AWATER INTPTLAT INTPTLON
0 42 017 100102 42017100102 1001.02 Census Tract 1001.02 G5020 S 7608630 2509728 +40.0745595 -074.9405810
1 42 017 100103 42017100103 1001.03 Census Tract 1001.03 G5020 S 1572548 9817 +40.0687610 -074.9713318
2 42 091 201411 42091201411 2014.11 Census Tract 2014.11 G5020 S 2409072 35484 +40.1444849 -075.2140269
3 42 017 100104 42017100104 1001.04 Census Tract 1001.04 G5020 S 1966895 12199 +40.0918428 -074.9501748
4 42 017 100105 42017100105 1001.05 Census Tract 1001.05 G5020 S 2663937 0 +40.0821350 -074.9478227
5 42 017 100206 42017100206 1002.06 Census Tract 1002.06 G5020 S 4735309 0 +40.1347795 -074.9441709

You may have already noticed, but the state, county, and tract codes (the first few columns) are combined into the GEOID column, which uniquely identifies a single census tract from among the over 11 million census tracts in the United States. The reason why we want to use census tracts is that the Census Bureau has lots of information at the tract level – mean and median household income, the racial and age makeup of the tract, the level of unemployment, the number of renters versus homeowners, and more! It’s a great way for us to get some level of insight into what our customers, users, patients, or other people we’re mapping are like. Even if we don’t have specific information about them, we know what their neighborhood is like in aggregate.

We’re going to do the same thing for New Jersey – download shapefile data, unzip it, open the shapefile and limit our selection to a few counties.

Codes from https://www2.census.gov/geo/docs/reference/codes/files/st34_nj_cou.txt

  • Gloucester (015)
  • Burlington (005)
  • Camden (007)
download.file("https://www2.census.gov/geo/tiger/TIGER2017/TRACT/tl_2017_34_tract.zip", "tl_2017_34_tract.zip")
unzip("tl_2017_34_tract.zip", exdir = "tl_2017_34_tract")
nj <- readOGR(dsn="tl_2017_34_tract", verbose = FALSE)
nj_counties <- nj[which(nj$COUNTYFP %in% c("005","007","015")),]

The nj_counties geographic data is similar to pa_counties:

head(nj_counties@data)
STATEFP COUNTYFP TRACTCE GEOID NAME NAMELSAD MTFCC FUNCSTAT ALAND AWATER INTPTLAT INTPTLON
189 34 007 600400 34007600400 6004 Census Tract 6004 G5020 S 862459 0 +39.9329219 -075.1180881
191 34 007 600700 34007600700 6007 Census Tract 6007 G5020 S 629533 537888 +39.9560193 -075.1255342
192 34 007 600800 34007600800 6008 Census Tract 6008 G5020 S 1109780 260022 +39.9513992 -075.1138309
194 34 007 608303 34007608303 6083.03 Census Tract 6083.03 G5020 S 3780973 22719 +39.8120904 -075.0725353
252 34 007 608001 34007608001 6080.01 Census Tract 6080.01 G5020 S 1799724 0 +39.8319418 -075.0078629
253 34 007 609000 34007609000 6090 Census Tract 6090 G5020 S 4442823 4000 +39.7312887 -074.8799265

Create the Geographic Area We Care About

We can simply stack our two sets of geographic data, since they have the same fields, using rbind, to get a combined dataset that covers the greater Philadelphia area, including neighboring counties:

philly_metro <- rbind(nj_counties, pa_counties)

Sample Data

Let’s create some fabricated latitude and longitude data to represent a handful of geographic points. These could be store locations, customer addresses, crime locations, or any other geographic points of interest. You can easily “geocode” (translate to latitude and longitude) any street addresses you have using a Google Maps API (some cost associated once you reach 40,000 geocodes) or a Census API (free). For our purposes, let’s assume we’ve already done that and have our latitude and longitude coordinates.

point_locations <- data.frame(lat = c(39.9931, 39.7103, 39.8592, 39.9526, 40.4108, 39.9078, 40.2290),
                              lon=c(-74.7879, -75.1078, -75.0144, -75.1652, -75.2479, -75.3879, -75.3879))

Map that Jawn*

* In Philadelphia, ‘jawn’ is a multipurpose term meaning ‘thing’, ‘place’, ‘situation’, etc. Since we’re talking about a map of Philly, it seems fitting to slip in a bit of non-technical jargon!

You can choose to make a static map using ggplot2, if you’re a dedicated user of ggplot, and that’s often what I do. This map is somewhat large, however, and if we were using real data instead of fabricated data, there could be hundreds or even thousands of points that we want to map on top of our census tracts. So instead of a static map, I chose here to use a dynamic map that users can zoom in or out on, hover over to get more details, and so on. leaflet is a very easy to use package for a use case like this.

I’m going to have two layers to my map:

  • a polygon layer, where I map my census tracts, with helpful labels (see addPolygons), and
  • a point layer, where I put markers for each point I plotted, with helpful labels (see addMarkers)

Before I map those layers, I set the view of my map to be centered around the middle of my points, and I fiddle with the zoom level to get the right initial zoom into the map (see setView).

In a real project, I would merge in other data into philly_metro@data, such as socioeconomic or demographic data about the tract, and include those data in the labels, and maybe even color-code my census tract fill color to reflect some numeric value in the data.

library(leaflet)

leaflet(philly_metro) %>%
  setView(lng = mean(as.numeric(point_locations$lon), na.rm=TRUE), 
          lat = mean(as.numeric(point_locations$lat), na.rm=TRUE), zoom = 9) %>%
  addPolygons(fillColor = "white",
              fillOpacity = 1,
              weight = 1,
              opacity = 0.5,
              color = "black",
              label = paste(philly_metro$NAMELSAD, 
                            ", GEOID ", 
                            philly_metro$GEOID, 
                            sep=""),
              labelOptions = labelOptions(
                style = list("font-weight" = "normal", 
                             padding = "3px 8px"),
                textsize = "13px",
                direction = "auto")) %>%
  addMarkers(lng = point_locations$lon, 
             lat=point_locations$lat, 
             label = paste("Client at",
                           point_locations$lat, 
                           ",",
                           point_locations$lon))

As you can see, this map would be great to show in a presentation or include on a website, or even take a quick screenshot of if you were in a hurry and didn’t have time to do a static map. With just a pair of tools (rgdal and leaflet), we’re able to create a simple map that can provide mouseover events, lots of customization of look and feel, and a clean, easy to navigate final product.

Want to learn more? I’ll be at ODSC Boston in a few short weeks to lead a workshop in working with geographic data in R. I hope to see you there!