Tutorial outputs

In this Tutorial you will create an interactive map centered on Kendall Square, Cambridge using 2011-2015 5-year ACS data.

  1. Create a map showing Median Income.

Map generation overview

This tutorial will go through how to easily download ACS data and map it, all within R Studio. You can of course use ArcMap to map ACS data, but doing it in R can save some tedious and clunky steps and let you publish interactive versions of your maps.

There are three part to creating the webmap:

  1. Fetch the needed Census shapefiles (from Tigris)
  2. Fetch and processing the ACS data, and linking to the shapefiles
  3. Use Leaflet to generate interactive maps.

You should make sure that you have the following packages installed: tigris, acs, leaflet, and mapview. You can check by looking in the lower right pane and clicking on Packages. If they are not there, you can click on Install to do that now (or refer to Tutorial 0). Note that the ACS package was made by our very own Ezra Glenn!

Here are the help files associated with these packages:

Load Packages

First load the packages using the library function. You have to do this every time you start a session in R. You can load them whenever, but it’s best to do this at the beginning of the document.

library(acs)
library(tigris)
library(leaflet)
library(mapview)
library(stringr)

API ‘Key’

You need to use an API ‘key’ in order to have permission to access ACS remotely. Think of it like a door ‘key’. You can use Jeff’s API key which is available in Stellar and on Dropbox. Or you can get your own https://api.census.gov/data/key_signup.htm. If you publish your results on RPub, be sure to set {r, echo=FALSE} so people can’t see your key (like I did below, you can’t see my command installing the key.)

api.key.install(key="INSERT YOUR API KEY HERE")

Fetching a spatial shapefile containing the needed geography

Just like you would download a shapefile for use in ArcGIS, we need to fetch the spatial polygons of the census tracts we want to include in our map. The geographic extent of the shapefile that is downloaded is based on state and county. It’s ok to have more coverage than the area of interest (in this case, Kendall Square).

If we want to fetch a shapefile containing census tracts for Cambridge, we need to know the state and county code. We can use the geo.lookup function (from the ACS package). (Hey, do you even know what county Cambridge is in?)

https://en.wikipedia.org/wiki/List_of_counties_in_Massachusetts

geo.lookup(state="MA", place="Cambridge")
##   state    state.name      county.name place     place.name
## 1    25 Massachusetts             <NA>    NA           <NA>
## 2    25 Massachusetts Middlesex County 11000 Cambridge city
## 3    25 Massachusetts Middlesex County 11000 Cambridge city

Now we know Cambridge is in Middlesex County, but we don’t know the FIPS code. We can use the geo.lookup function again but this time put in a county parameter. Or use the lookup_code function (from the Tigris package).

geo.lookup(state="MA", county="Middlesex")
##   state    state.name county      county.name
## 1    25 Massachusetts     NA             <NA>
## 2    25 Massachusetts     17 Middlesex County
lookup_code(state="MA",county="Middlesex")
## [1] "The code for Massachusetts is '25' and the code for Middlesex County is '017'."

In the console, we see that the state code for Mass is 25 and the county code for Middlesex County is 17. We’ll use these FIPS codes to first fetch the shapefile of the census tracts and later to fetch the ACS data.

Note that the function from the ACS package says the county is ‘17’ and the function from the Tigris package says the county is ‘017’. This will make a difference later when we’re join the ACS data with the Tigris shapefile.

Now let’s fetch our spatial shapefile data using the tracts function (from the Tigris package):

shapefile <- tracts(state='25', county='17')

We can visualize the shapefile we just fetched:

plot(shapefile)

What if you want to include more than one county in the shapefile? Let’s include Boston as well.

geo.lookup(state="MA", place="Boston")
##   state    state.name    county.name place  place.name
## 1    25 Massachusetts           <NA>    NA        <NA>
## 2    25 Massachusetts Suffolk County  7000 Boston city
## 3    25 Massachusetts Suffolk County  7000 Boston city
lookup_code(state="MA",county="Suffolk")
## [1] "The code for Massachusetts is '25' and the code for Suffolk County is '025'."

To tell the tracts function to fetch for more than one county, you need to make a list of the county codes. Lists in R are done using the c function (which stands for contatenate). To make a list, include each item you want in the list inside the c function.

countylist <- c('17','25')
shapefile <- tracts(state='25', county=countylist)

You can also skip the interim step and do the same thing like this: shapefile <- tracts(state='25', county=c('17','25'))

We can visualize the new shapefile we just fetched:

plot(shapefile)

Fetching ACS Data

Now that we have the shapefile for the census tracts, we need some data to associate with these tracts.

The lookup function allows you to see the different attributes and column names of different tables. You can even search keywords, the same way you would on the advanced search of factfinder. Note that the variable lookup function doesn’t work for 2016. So you should use 2015 for looking up variables and can still use 2016 to fetch data (the 2015 variables will be the same as 2016).

Much like the advanced search in Fact Finder (http://factfinder.census.gov), we need to deterine a handful of things before we can fetch the data.

More about the different “spans”: https://www.census.gov/programs-surveys/acs/guidance/estimates.html

More about which years and spans of data are available https://www.census.gov/data/developers/data-sets.html

Note: Do not compare overlapping spans https://www.census.gov/programs-surveys/acs/guidance/comparing-acs-data.html

More about the different geographies: https://www.census.gov/programs-surveys/acs/geography-acs/concepts-definitions.html

Choose the Census table of interest

The best way to determine which table(s) you are interested in is to use censusreporter: https://censusreporter.org.

One confusing thing is that there are different kinds of tables. You can learn more about the different table types here: https://censusreporter.org/topics/table-codes/. The ACS package only fetches B tables (you cannot fetch C tables, which are “collapsed” tables).

To get started with an easy visualization, let’s look at income. Go to https://censusreporter.org, scroll down, and click on income. Choose the first table, B19013 (Median Household Income). There is only one “column” in this data which is the median income for each census tract. Enter cambridge,ma into the place box to show some sample data.

Here we fetch the table of interest and put the results into the variable called mytable.

mytable <- acs.lookup(endyear=2015, table.number="B19013")

What exactly is in mytable? Technically we call it an “object”, and this object contains all the complexities necessary to represent that census table. You can investigate the structure of this object with the str function. But don’t worry if it all doesn’t make sense.

str(mytable)
## Formal class 'acs.lookup' [package "acs"] with 4 slots
##   ..@ endyear: num 2015
##   ..@ span   : num 5
##   ..@ args   :List of 7
##   .. ..$ endyear       : num 2015
##   .. ..$ span          : num 5
##   .. ..$ dataset       : chr "acs"
##   .. ..$ keyword       : symbol 
##   .. ..$ table.name    : symbol 
##   .. ..$ table.number  : chr "B19013"
##   .. ..$ case.sensitive: logi TRUE
##   ..@ results:'data.frame':  1 obs. of  4 variables:
##   .. ..$ variable.code: chr "B19013_001"
##   .. ..$ table.number : chr "B19013."
##   .. ..$ table.name   : chr "B19013. Median Household Income in the Past 12 Months (in 2015 Inflation-Adjusted Dollars)"
##   .. ..$ variable.name: chr "Median household income in the past 12 months (in 2015 Inflation-adjusted dollars)"

Choose the data columns of interest from the table

Census tables often have many “columns” of data. We can make a list of all the columns:

results(mytable)$variable.name
## [1] "Median household income in the past 12 months (in 2015 Inflation-adjusted dollars)"

B19013 only has one column of data. In this case, we are only planning on fetching this one column of data. It is column 1, so we reference that column using brackets. If there were two columns and you wanted them both, you would use myvars <- mytable[1]+mytable[2] to select them both.

myvars <- mytable[1] 

Span

5 year estimates have the lowest margin of error, so we will specify that by putting the number 5 into the variable myspan for use later.

myspan <- 5

End Year

We will use 2015 (the most recent data available from the API at the time of writing is 2016), so we put 2015 into the variable myendyear for use later.

myendyear <- 2015

Defining Geography

Now we set the geography that we want. For this, we will use the geo.make function. We want to pull all the census tracts in the county (unfortunately we can’t pull just for a city but we can filter after the fact).

We’ve already found the state cose for Massachusetts (25). Ane earlier on we definaed a variable countylist containing the two counties of interest (17 and 25). Small glitch: the geo.make function wants that list as numbers and not text versions of the numbers.

We use the “*" for tracts to denote ‘select all’. We will store this in the variable mygeo so we can use it later in our function that will fetch the data.

countylist
## [1] "17" "25"
countylist2 <- as.numeric(countylist)
countylist2
## [1] 17 25
mygeo <- geo.make(state=25, county=countylist2, tract="*")

Putting it all together and fetch the data

Now we have all the parts to put together the fetch function. The line below will always be the same. What changes is the definition of the variables above.

the variable mydata is a complex object containing lots of information about the census table and the data for each census tract.

mydata <- acs.fetch(endyear=myendyear, span=myspan, geography=mygeo, variable=myvars)
## Warning in acs.fetch(endyear = endyear, span = span, geography =
## geography[[1]], : NAs introduced by coercion

## Warning in acs.fetch(endyear = endyear, span = span, geography =
## geography[[1]], : NAs introduced by coercion

Cleaning the Data

As with data when we download it directly from ACS, it comes a bit messy and needs to be cleaned up. It’s a lot easier in R than in excel.

First, we need to generate a standardized GEOID for the ACS data so it can be matched to the shapefile which already has a standardized GEOID. To do that, we need to paste together the state code, county code, and tract code. The paste0 function pastes things together (note: paste puts spaces between the things it pastes, while paste0 does not.) Everything you want pasted together goes inside the paste0 function separated by commas (e.g., paste0("27", "A", "0", "99") will return "27A099".) The str_pad function adds extra zeros before the number to make sure they are the correct length.

The function as.character ensures that the state, county, and tract numbers are first turned into characters before concatenating them together

Remember when we mentioned that ACS does not have an extra zero before the county number? We need to put in that extra zero so that the GEOID code we generate here matches the GEOID codes in the shapefile.

acsgeoid <- paste0(as.character(mydata@geography$state),
                   str_pad(as.character(mydata@geography$county), 3, pad='0'),
                   str_pad(as.character(mydata@geography$tract), 6, pad='0'))

Now you are ready to put the fetched data into a usable format for R. The best format is a dataframe object which is in essence a spreadsheet with columns (variables) and rows (data for each census tract). You only need two columns: the GEOID and the data itself (in this case the median income variable which is found in the estimate component of the fetched data, identified by mydata$estimate). Note if you chose more than one variable in the myvars step above, then each of those variables would be a new column in the dataframe.

Once we put the columns into a dataframe, you can rename the column names to make them more user-friendly. In this case, there is only one data column median income.

You can use the head function to show the first 6 rows in a dataframe.

mydatadf <- data.frame(acsgeoid, mydata@estimate)
colnames(mydatadf)=c("GEOID", "medianincome")
head(mydatadf)
##                                                             GEOID
## Census Tract 3001, Middlesex County, Massachusetts    25017300100
## Census Tract 3011.01, Middlesex County, Massachusetts 25017301101
## Census Tract 3011.02, Middlesex County, Massachusetts 25017301102
## Census Tract 3101, Middlesex County, Massachusetts    25017310100
## Census Tract 3102, Middlesex County, Massachusetts    25017310200
## Census Tract 3103, Middlesex County, Massachusetts    25017310300
##                                                       medianincome
## Census Tract 3001, Middlesex County, Massachusetts           82019
## Census Tract 3011.01, Middlesex County, Massachusetts        73021
## Census Tract 3011.02, Middlesex County, Massachusetts        93935
## Census Tract 3101, Middlesex County, Massachusetts           31739
## Census Tract 3102, Middlesex County, Massachusetts           56653
## Census Tract 3103, Middlesex County, Massachusetts           42172

Map the data using Leaflet

First, you need to join the ACS data with the shapefile. The geo_join function (from the ACS package) takes the shapefile, the fetched data (mydatadf) and the column names of each to match on (in this case we named them the same thing to make it more intuitive).

Join the data

mydatamerged <- geo_join(shapefile, mydatadf, "GEOID", "GEOID")

Prepare for mapping

We need to:

  • Assign our dataframe to a new variable for simplicity
  • Define a popup
  • Set a color pallete
  • Determine the LAT and LNG for the centerpoint for your map

Dataframe shortcut To make a shortcut here, we’re defining a new variable df and setting it to whatever you called your data, so that we don’t have to change every column reference.

df <- mydatamerged

Popup This defines how a popup window will appear when you click on a census tract on the map. Here, we will have the popup display the GEOID and the Median Income. We round the median income to the nearest ones place using the round function. The <br> adds an extra carriage return to put the label on the second line of the popup.

mypopup <- paste0("GEOID: ", df$GEOID, "<br>", "Median Income: $", round(df$medianincome,0))

Pallete This sets the variation/range of colors. This is similar to how in ArcGIS you would symbolize by an attribute. Here we specify the pallete YlGnBu (note that the second character is a lower-case L). To explore different palettes, you can go here: http://colorbrewer2.org. To find the one we are working with, in Multi-hue, bottom row, click on the fourth one from the left).

mypal <- colorNumeric(
  palette = "YlGnBu",
  domain = df$medianincome
)

** LAT/LNG Centerpoint** You can make your map have a center point for the area of interest. For Kendall Square, you can determine the LAT and LNG: go to google maps in a web browser, type “Kendall Square, Cambridge”, right click on the red pin, select “What’s Here?” and the LAT and LNG will display. Then define those points below.

myLAT <- 42.363951
myLNG <- -71.090011
mycentername <- "Kendall Square"

Define the map using leaflet

Now we load and create the map and all its parts. FOr the legend, note that we added a $ prefix (if you wanted something at the end, like a percent sign, you would use suffix = "%"). A zoom of 12 is a good starting point, but you can tweak that value.

mymap<-leaflet() %>%
  addProviderTiles("CartoDB.Positron") %>%
  setView(myLNG, myLAT, zoom = 12) %>%
  addMarkers(lat=myLAT, lng=myLNG, popup=mycentername) %>%
  addPolygons(data = df, 
              fillColor = ~mypal(medianincome), 
              color = "#b2aeae", # you need to use hex colors
              fillOpacity = 0.7, 
              weight = 1, 
              smoothFactor = 0.2,
              popup = mypopup) %>%
  addLegend(pal = mypal, 
            values = df$medianincome, 
            position = "bottomright", 
            title = "Median Income",
            labFormat = labelFormat(prefix = "$"))

Finally, we draw it! (Whew).

mymap