Tutorial outputs

In this Tutorial you will create interactive maps centered on Kendall Square, Cambridge using 2014-2018 5-year ACS data. Unlike the mandated census collection which happens every 10 years, the ACS continuously collects data. The 5-year ACS dataset provides averages over a 5 year period, producing more accurate data.

  1. Create a map showing the percent of workers who commute by transit.
  2. Create a map showing the percent of “service” workers who commute by transit.
  3. Put those maps side by side.

Map generation overview

There are three part to creating the map:

  1. Fetch the needed Census shapefiles (from Tigris)
  2. Fetch and process the ACS data
  3. Link the ACS data to the shapefile
  4. Use Leaflet to generate interactive maps.

Make sure that you have the following packages installed: tigris, acs, leaflet, and mapview.

##Load Packages First load the packages using the library function.

library(tigris)
library(acs)
library(leaflet)
library(mapview)
library(ggplot2)
library(maptools)
library(stringr)

API Key

You need to use an API ‘key’ in order to have permission to access ACS remotely. Think of it like a door ‘key’. You can use Jeff’s API key which is available in Stellar and on Dropbox. Or you can get your own https://api.census.gov/data/key_signup.htm. If you publish your results on RPub, be sure to set {r, echo=FALSE} so people can’t see your key (like I did below, you can’t see my command installing the key.)

api.key.install(key="INSERT YOUR API KEY HERE")

Fetch a spatial shapefile containing the needed geography

Census data is provided at different geographic scales. Here is a good explanation of them. https://guides.lib.umich.edu/c.php?g=283004&p=1885629#:~:text=U.S.%20Census%20and%20Demographic%20Information,-A%20general%20guide&text=The%20Census%20uses%20its%20own,%2C%22%20which%20divides%20up%20counties.&text=For%20a%20full%20explanation%20of,its%20description%20on%20American%20Factfinder.

For our analysis, We want to limit the map to only include the geography needed in order to make it more manageable. We will limit by the relevant counties in the Boston area.

First, let’s download the shapefile for all the counties for Massachusetts. The counties command comes from the tigris package. The map below shows the shapes of all the counties in Massachusetts.

shapefile <- counties(state='MA', class='sp')
## Warning in proj4string(obj): CRS object has comment, which is lost in output
plot(shapefile)

I want to label the counties with the county numbers. If I use the head command to inspect the shapefile variable, I see there are column names, all in caps. I want to the county number, which is COUNTYFP. You could also use the NAME column if you wanted the map to show the names instead of the number (give that a try to see).

Now redo the plot but add labels using the pointLabel command from the maptools package.

head(shapefile)
## class       : SpatialPolygonsDataFrame 
## features    : 6 
## extent      : -73.06851, -69.85886, 41.18705, 42.72161  (xmin, xmax, ymin, ymax)
## Warning in proj4string(x): CRS object has comment, which is lost in output
## crs         : +proj=longlat +datum=NAD83 +no_defs 
## variables   : 17
## names       : STATEFP, COUNTYFP, COUNTYNS, GEOID,       NAME,          NAMELSAD, LSAD, CLASSFP, MTFCC, CSAFP, CBSAFP, METDIVFP, FUNCSTAT,      ALAND,     AWATER, ... 
## min values  :      25,      001, 00606927, 25001, Barnstable, Barnstable County,   06,      H1, G4020,   148,  12700,    14454,        A, 1021048430, 1004283994, ... 
## max values  :      25,      027, 00606940, 25027,  Worcester,  Worcester County,   06,      H4, G4020,   521,  49340,    14454,        N, 3912564078,  666974178, ...
plot(shapefile)
pointLabel(coordinates(shapefile),labels=shapefile$COUNTYFP)

Now we limit to only the counties around Boston. The counties we will include are Suffolk (25), Norfolk (21), and Middlesex (17), and Essex (09). A list of the counties by state can be found here. https://en.wikipedia.org/wiki/List_of_United_States_FIPS_codes_by_county

We want the shapefile at the census tract level, so we use the tigris command tracts.

countylist <- c('17','21','25', '09')
shapefile <- tracts(state='MA', county=countylist, year=2018, class='sp')
## Warning in proj4string(obj): CRS object has comment, which is lost in output
plot(shapefile)

Fetching ACS Data

Now that we have the shapefile for the census tracts, we need some data to associate with these tracts.

Much like the advanced search in Fact Finder (http://factfinder.census.gov), we need to determine a handful of things before we can fetch the data.

Choose the Census table of interest

https://censusreporter.org/

We want to map the percent of workers who commute by public transit. At censusreporter.org, scroll down, and under Topics click on Commute. Scroll down to the Means of Transportation to Work section and then look for B08301: Means of Transportation to Work and click on it. Enter cambridge,ma into the place box to show results for the city of Cambridge as a whole. Here (for 1-year data from 2018), 25.6% of Cambridge residents commute by public transit. We will download more granular data, at the census tract level.

Toward the right, click on “Switch to Totals.” Unfortunately, the imported data only comes with the totals, we will need to calculate the percentages manually.

We fetch the table of interest and put the results into the variable called mytable.

Choose the span

5 year estimates have the lowest margin of error, so we will specify that by putting the number 5 into the variable myspan for use later.

Choose the end year

We will use the most recent data available from the API, which is 2018, so we put 2018 into the variable myendyear for use later.

Limit the geography for the data download

Now we set the geography that we want to minimize the amount of data we want to download. For this, we will use the geo.make function. We want to pull all the census tracts in the counties we are interested in. The state code for Massachusetts is 25. Earlier, we defined a variable countylist containing the three counties of interest. Small glitch: the geo.make function wants that list as numbers and not text versions of the numbers, so we need to convert it. We want data for all the census tracts in these countires, so we use the wildcard "*" to denote ‘select all’.

# Span
myspan <- 5

# End year
myendyear <- 2018

# Which table
mytable <- acs.lookup(endyear=2018, table.number="B08301")
## Warning in acs.lookup(endyear = 2018, table.number = "B08301"): acs.lookup for endyear>2015: using 2015 variable codes to access 2018 data.
##   (See ?acs.lookup for details)
## Warning in acs.lookup(endyear = 2018, table.number = "B08301"): temporarily downloading and using archived XML variable lookup files;
##   since this is *much* slower, recommend running
##   acs.tables.install()
# Limit the geography
# Convert the list of counties that is in text form to numeric form
countylist2 <- as.numeric(countylist)
# Define the geography to limit the data import
mygeo <- geo.make(state=25, county=countylist2, tract="*")

** Choose the data columns of interest from the table **

Census tables usually have many “columns” of data. We can make a list of all the columns.

results(mytable)$variable.name
##  [1] "Total:"                                                                                            
##  [2] "Car, truck, or van:"                                                                               
##  [3] "Car, truck, or van: Drove alone"                                                                   
##  [4] "Car, truck, or van: Carpooled:"                                                                    
##  [5] "Car, truck, or van: Carpooled: In 2-person carpool"                                                
##  [6] "Car, truck, or van: Carpooled: In 3-person carpool"                                                
##  [7] "Car, truck, or van: Carpooled: In 4-person carpool"                                                
##  [8] "Car, truck, or van: Carpooled: In 5- or 6-person carpool"                                          
##  [9] "Car, truck, or van: Carpooled: In 7-or-more-person carpool"                                        
## [10] "Public transportation (excluding taxicab):"                                                        
## [11] "Public transportation (excluding taxicab): Bus or trolley bus"                                     
## [12] "Public transportation (excluding taxicab): Streetcar or trolley car (carro publico in Puerto Rico)"
## [13] "Public transportation (excluding taxicab): Subway or elevated"                                     
## [14] "Public transportation (excluding taxicab): Railroad"                                               
## [15] "Public transportation (excluding taxicab): Ferryboat"                                              
## [16] "Taxicab"                                                                                           
## [17] "Motorcycle"                                                                                        
## [18] "Bicycle"                                                                                           
## [19] "Walked"                                                                                            
## [20] "Other means"                                                                                       
## [21] "Worked at home"

We want column #1 which gives the absolute total and also column #10 which gives the total commute by Public transportation. Later we will calculate the percent. We indicate that we want those two columns by putting them into the variable myvars.

myvars <- mytable[1]+mytable[10] 

Putting it all together and fetch the data

Now we have all the parts to put together the fetch function and put the results into the variable mydata. The line below will always be the same. What changes is the definition of the variables above.

mydata <- acs.fetch(endyear=myendyear, span=myspan, geography=mygeo, variable=myvars)

Cleaning the Data

There are a few cleaning items It’s a lot easier in R than in excel.

First, we need to generate a standardized GEOID for the ACS data so it can be matched to the shapefile which already has a standardized GEOID. To do that, we need to paste together the state code, county code, and tract code. The paste0 function pastes things together (note: paste puts spaces between the things it pastes, while paste0 does not.) Everything you want pasted together goes inside the paste0 function separated by commas (e.g., paste0("27", "A", "0", "99") will return "27A099".)

The function as.character ensures that the state, county, and tract numbers are first turned into characters before concatenating them together

Small problem. ACS does not pad the county numbers with preceeding zeros, so we need to add those extra zero so that the GEOID code we generate here matches the GEOID codes in the shapefile. We use the str_pad function from the stringer package.

mystate <- as.character(mydata@geography$state)

mycounty <- as.character(mydata@geography$county)

# add leading zeros to make the total number of characters 3
mycountypadded <- str_pad(mycounty, 3, pad="0")

mytract <- as.character(mydata@geography$tract)

acsgeoid <- paste0(mystate,mycountypadded,mytract)

Now you are ready to put the fetched data into a usable format for R. The best format is a dataframe object which is in essence a spreadsheet with columns (variables) and rows (data for each census tract). We use the head command to take a look at the first 6 rows of data.

# Put the dataset into a dataframe format
mydatadf <- data.frame(acsgeoid, mydata@estimate)

# Take a look at the first 6 rows of data
head(mydatadf)
##                                                          acsgeoid B08301_001
## Census Tract 3525, Middlesex County, Massachusetts    25017352500       2249
## Census Tract 3527, Middlesex County, Massachusetts    25017352700       1204
## Census Tract 3534, Middlesex County, Massachusetts    25017353400       2000
## Census Tract 3566.01, Middlesex County, Massachusetts 25017356601       2414
## Census Tract 3571, Middlesex County, Massachusetts    25017357100       2057
## Census Tract 3577, Middlesex County, Massachusetts    25017357700       1731
##                                                       B08301_010
## Census Tract 3525, Middlesex County, Massachusetts           835
## Census Tract 3527, Middlesex County, Massachusetts           222
## Census Tract 3534, Middlesex County, Massachusetts           687
## Census Tract 3566.01, Middlesex County, Massachusetts        447
## Census Tract 3571, Middlesex County, Massachusetts           402
## Census Tract 3577, Middlesex County, Massachusetts           378

Note that the columns are ones with bold headers, that first item that has a long text description of the census tract is actually not a column, but the row number– it would be like changing the actual row number in excel. A little confusing I realize.

So the first column, acsgeoid is the the identifier for the census tract, the second column is the total commuters from that census tract, and the third is the number of transit commuters.

Let’s rename the columns to make them more user-friendly. We choose GEOID specifically because we are going to want to use this column to create a link to the shapefile.

# Rename column names
colnames(mydatadf)=c("GEOID", "total", "transit")

# Take a look at the first 6 rows of data
head(mydatadf)
##                                                             GEOID total transit
## Census Tract 3525, Middlesex County, Massachusetts    25017352500  2249     835
## Census Tract 3527, Middlesex County, Massachusetts    25017352700  1204     222
## Census Tract 3534, Middlesex County, Massachusetts    25017353400  2000     687
## Census Tract 3566.01, Middlesex County, Massachusetts 25017356601  2414     447
## Census Tract 3571, Middlesex County, Massachusetts    25017357100  2057     402
## Census Tract 3577, Middlesex County, Massachusetts    25017357700  1731     378

Calculating the percent

The mydatadf dataframe now has a column for the total commuters and a column for the transit commuters. We need to calculate the percent, and make an additional column for that. We will call it “transitpercent”. We multiply the transit percent by 100 to turn it from a fraction into the percentage.

In R, using a $ after a dataframe variable indicates what column you are referring to. In this case, there isn’t a column that already exists named transitpercent, so R knows to create a new one. R knows to do this operation for each row in the dataframe. In Excel, it would be like using a function in a new column and then copying that function for all rows.

NOTE: if you are looking at a Census item like median income, you don’t need to do any percent calculation

# Create new column and calculate the percent commute trips by transit
mydatadf$transitpercent <- mydatadf$transit/mydatadf$total*100

NOTE1: This is only needed for this transit percent problem

NOTE2: For median income, and maybe some other tables, there are some issues with numbers in the negative. To see:

summary(mydatadf)

To fix, subset the data to include only values that are greater than zero, for example:

mydatadf <- subset(mydatadf, mydatadf$medianincome>0)

To confirm that the data looks ok, use the summary command again:

summary(mydatadf)

# There is some weirdness with some tracts showing up as 100%, so eliminate those by subsetting the results to exclude them.
mydatadf <- subset(mydatadf, mydatadf$transitpercent<100)

Join the data

You need to join the ACS data with the shapefile. The geo_join function (from the ACS package) merges the shapefile (variable: shapefile) with the fetched data (variable: mydatadf). In other words, we want the value of percent commute trips by transit for each census tract to be associated with the respective shape. The merged data is put into the variable mydatamerged.

We need to tell the function which columns are to be used to match on. In this case we named them the same thing to make it more intuitive.

# Merge the ACS data with the shapefile
mydatamerged <- geo_join(shapefile, mydatadf, "GEOID", "GEOID")

Creating the map

Popup This defines how a popup window will appear when you click on a census tract on the map. We will have the popup display the GEOID and the percent commute by transit. The <br> adds an extra carriage return to put the label on the second line of the popup. The round function will round the percent to the tenths place.

**NOTE: you don’t need to

mypopup <- paste0("GEOID: ", mydatamerged$GEOID, "<br>", "Commute by Public transit: ", round(mydatamerged$transitpercent,1), "%")

Palette This sets the variation/range of colors. This is similar to how in ArcGIS you would symbolize by an attribute. Here we specify the pallete YlGnBu (note that the second character is a lower-case L). To explore different palettes, you can go here: http://colorbrewer2.org. To find the one we are working with, in Multi-hue, bottom row, click on the fourth one from the left.

mypal <- colorNumeric(
  palette = "YlGnBu",
  domain = NULL
)

LAT/LNG Centerpoint, zoom, and tiles

you need to determine a centerpoint latitude and longitude. To do that, use googlemaps, single click at a specific spot on the map and the lat,long values will appear on the bottom. Cut and paste those values below. I have used a point in Kendall Square. A zoom value of 12 is a good place to start.

You can choose the map background from a variety of sources. You can explore the options here: http://leaflet-extras.github.io/leaflet-providers/preview/. Move the map so you are over Boston. The default one you see is “OpenStreetMap.Mapnik”. Then on the right, you can Scroll to explore options. I chose “CartoDB.Positron” because it does not overly highlight roadways, making it easier to overlay the census data.

# Store values into variables
myLAT <- 42.366740
myLNG <- -71.084760
myZOOM <- 12
myTILES <- "CartoDB.Positron"

Define the map using leaflet

Now we load and create the map and all its parts.

We added a legend, note that we added a % suffix. NOTE: In the legend, if you are using a different census table that isn’t in percent, you don’t need the line that include labFormat

To change the color of the outlines of the shapes, you need to use a hex code for the color. You can choose colors here: https://www.w3schools.com/colors/colors_picker.asp

You can investigate the options for leaflet here: https://cran.r-project.org/web/packages/leaflet/leaflet.pdf. Search for “addPolygons” in that PDF.

mymap <- leaflet() %>%
  addProviderTiles(myTILES) %>%
  setView(myLNG, myLAT, zoom = myZOOM) %>%
  addPolygons(data = mydatamerged, 
              fillColor = ~mypal(transitpercent), # this makes the fill color on the spectrum of the palette
              color = "white", # this is the color of the outline of the shapes
              fillOpacity = 0.7, # how see through the shapes are
              weight = 1, 
              smoothFactor = 0.2,
              popup = mypopup) %>%
  addLegend(pal = mypal, 
            values = mydatamerged$transitpercent, 
            position = "bottomright", 
            title = "Commute by Public Transit",
            labFormat = labelFormat(suffix = "%"),
            bins=5)

# Draw the map
mymap