TUTORIAL comes from:
I made 7 attempts on this tutorial. My experience was fraught with technical difficulties getting various packages to work (e.g. tmaptools, leaflet.extras, rio, shapefile, scales). They would not install completely or did not work because they could not find relevant files. Finally, adding code to change the root directory for knitr (and not using setwd inside a chunk or in the console) seemed to solve the issues. The chunks ran without a hiccup afterward, and I managed to get up to Step 7.
The knitr code came from a Google search: https://philmikejones.me/tutorials/2015-05-20-set-root-directory-knitr/
Knitr, by default, looks in the same directory as your .Rmd file to find any files you need to draw in, like data sets or script files. If you keep your data files and scripts in separate folders (and you should) you need to configure knitr.
You cannot use setwd() with knitr, so the canonical way to do this is to include an initial code chunk:
{r setup, include=FALSE, echo=FALSE} require(“knitr”) opts_knit$set(root.dir = “~/path/to/folder/”)
This creates an R chunk called setup which isn’t included in the knitted file. It loads the knitr package and sets root.dir to your project folder. Knitr will now look for all files from this root folder rather than the folder it is stored in.
These packages were installed earlier: - tmap - tmaptools - gave me trouble - sf - leaflet - leaflet.extras - gave me trouble - rio - gave me trouble - scales - gave me trouble
# Install libraries of packages.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tmap)
library(tmaptools)
library(sf)
## Linking to GEOS 3.7.2, GDAL 2.4.2, PROJ 5.2.0
library(leaflet)
library(leaflet.extras)
library(rio)
nhdatafile <-"NHD2016.xlsx"
There are several packages for importing Excel files into R; but for ease of use, you can’t beat rio. Install it with:
install.packages(“rio”)
and the run this chunk to store data from the election results spreadsheet into a variable called nhdata.
nhdata <- rio::import(nhdatafile)
There were actually 28 candidates in the results; but to focus on mapping instead of data wrangling, let’s not worry about the many minor candidates and pretend there were just two: Hillary Clinton and Bernie Sanders. Select just the County, Clinton and Sanders columns with:
nhdata <- nhdata[,c("County", "Clinton", "Sanders")]
Now we need to think about what exactly we’d like to color-code on the map. We need to pick one column of data for the map’s county colors, but all we have so far is raw vote totals. We probably want to calculate either the winner’s overall percent of the vote, the winner’s percentage-point margin of victory or, less common, the winner’s margin expressed by number of votes (after all, winning by 5 points in a heavily populated county might be more useful than winning by 10 points in a place with way fewer people if the goal is to win the entire state).
It turns out that Sanders won every county; but if he didn’t, we could still map the Sanders “margin of victory” and use negative values for counties he lost.
Let’s add columns for candidates’ margins of victory (or loss) and percent of the vote, again for now pretending there were votes cast only for the two main candidates. (Here and later in the story, click on the gray box and scroll to the right to see all the code.)
# Add columns for percents and margins
nhdata$SandersMarginVotes <- nhdata$Sanders - nhdata$Clinton
nhdata$SandersPct <- (nhdata$Sanders - nhdata$Clinton) / (nhdata$Sanders + nhdata$Clinton) # Will use formatting later to multiply by a hundred
nhdata$ClintonPct <- (nhdata$Clinton - nhdata$Sanders) / (nhdata$Sanders + nhdata$Clinton)
nhdata$SandersMarginPctgPoints <- nhdata$SandersPct - nhdata$ClintonPct
Whether you’re mapping results for your city, your state or the nation, you need geographic data for the area you’ll be mapping in addition to election results. There are several common formats for such geospatial data; but for this tutorial, we’ll focus on just one: shapefiles, a widely used format developed by Esri.
If you want to map results down to your city or town’s precinct level, you’ll probably need to get files from a local or state GIS office. For mapping by larger areas like cities, counties or states, the U.S. Census Bureau is a good place to find shapefiles.
For this New Hampshire mapping project by county, I downloaded files from the Census Bureau’s Cartographic Boundary shapefiles page — these are smaller, simplified files designed for mapping projects where extraordinarily precise boundaries aren’t needed. (Files for engineering projects or redistricting tend to be considerably larger).
From the Cartographic Boundary Shapefiles link, click Counties, then cb_2014_us_county_5m.zip (same as clicking on the link below).
I chose the national county file at http://www2.census.gov/geo/tiger/GENZ2014/shp/cb_2014_us_county_5m.zip and unzipped it within my data subdirectory. With R, it’s easy to create a subset for just one state, or more; and now I’ve got a file I can reuse for other state maps by county as well.
There are a lot of files in that newly unzipped subdirectory; the one you want has the .shp extension. I’ll store the name of this file in a variable called usshapefile:
usshapefile <- "cb_2014_us_county_5m/cb_2014_us_county_5m.shp"
Several R packages have functions for importing shapefiles into R. I’ll use tmaptools’s read_shape(), which I find quite intuitive:
usgeo <- read_shape(file=usshapefile, as.sf = TRUE)
## Warning: This function is deprecated and has been migrated to github.com/
## mtennekes/oldtmaptools
## Warning in readOGR(dir, base, verbose = FALSE, ...): Z-dimension discarded
If read_shape does not work, for some reason, putting library(tmaptools) before that line seems to make it work (I have no idea why)
as.sf = TRUE means I want usgeo to be a simple features object. The simple features standards were recently implemented in R with the sf package, and that’s made GIS work in R a lot, well, simpler. Now, geospatial objects look similar to “regular” R data frames, with a special complex column for geography. If as.sf is set to FALSE, usgeo will have an overall more complicated structure.
QTM function: Quick Thematic Map Plot
If you want to check to see if the usgeo object looks like geography of the U.S., run tmap’s quick thematic map command: qtm(usgeo). This may take a while to load and appear small and rather boring, but if you’ve got a map of the U.S. with divisions, you’re probably on the right track.
If qtm(usgeo)does not work, for some reason, putting library(tmap) before that line seems to make it work (I have no idea why)
qtm(usgeo)
If you run str(usgeo) to see the usgeo data structure. It will look like a conventional data frame except for the final geometry column with sfc_MULTIPOLYGON information.
str(usgeo)
## Classes 'sf' and 'data.frame': 3233 obs. of 10 variables:
## $ STATEFP : Factor w/ 56 levels "01","02","04",..: 1 11 16 37 39 37 28 26 29 10 ...
## $ COUNTYFP: Factor w/ 328 levels "001","003","005",..: 42 76 74 78 78 38 22 137 291 35 ...
## $ COUNTYNS: Factor w/ 3233 levels "00023901","00025441",..: 120 430 738 1911 2024 1880 1399 1373 1490 298 ...
## $ AFFGEOID: Factor w/ 3233 levels "0500000US01001",..: 30 442 844 2189 2302 2158 1669 1589 1764 344 ...
## $ GEOID : Factor w/ 3233 levels "01001","01003",..: 30 442 844 2189 2302 2158 1669 1589 1764 344 ...
## $ NAME : Factor w/ 1921 levels "A\xf1asco","Abbeville",..: 620 592 945 1291 1665 692 320 1683 284 747 ...
## $ LSAD : Factor w/ 11 levels "00","03","04",..: 5 5 5 5 5 5 5 5 1 5 ...
## $ ALAND : Factor w/ 3233 levels "1000508842","1001064387",..: 1199 5 2047 452 1721 2091 1880 1194 2397 1215 ...
## $ AWATER : Factor w/ 3233 levels "0","10017640",..: 1626 414 1940 1718 1118 2724 2916 2228 1613 497 ...
## $ geometry:sfc_MULTIPOLYGON of length 3233; first list element: List of 1
## ..$ :List of 1
## .. ..$ : num [1:9, 1:2] -88.2 -88.2 -88.2 -88.1 -87.5 ...
## ..- attr(*, "class")= chr "XY" "MULTIPOLYGON" "sfg"
## - attr(*, "sf_column")= chr "geometry"
## - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA
## ..- attr(*, "names")= chr "STATEFP" "COUNTYFP" "COUNTYNS" "AFFGEOID" ...
Extracting geodata just for New Hampshire is similar to subsetting any other type of data in R, we just need the state FIPS code for New Hampshire, which turns out to be 33 — or in this case “33,” since the codes are stored as factors, not integers in usgeo.
Use dplyr::filter() on the sf object as you would on a conventional data frame:
nhgeo <- filter(usgeo, STATEFP=="33")
If you want to do a check to see if nhgeo looks correct, run the quick thematic map function again:
qtm(nhgeo)
Still somewhat boring, but it looks like the Granite State with county-sized divisions, so it appears we’ve got the correct file subset.
Like any database join or merge, this has two requirements: 1) a column shared by each data set, and 2) records that refer to the same entity in exactly the same way. (Having a county listed as “Hillsborough” in one file and FIPS code “011” in another wouldn’t give R any idea how to match them up without some sort of translation table.)
Trust me: You will save yourself a lot of time if you run a few R commands to see whether the nhgeo\(NAME vector of county names is the same as the nhdata\)County vector of county names.
Do they have the same structure?
str(nhgeo$NAME)
## Factor w/ 1921 levels "A\xf1asco","Abbeville",..: 684 791 416 138 1470 334 1653 1131 282 1657
str(nhdata$County)
## chr [1:10] "Belknap" "Carroll" "Cheshire" "Coos" "Grafton" ...
Problem number one: The geospatial file lists counties as R factors, while they’re plain character text in the data. Change the factors to character strings with:
nhgeo$NAME <- as.character(nhgeo$NAME)
Next, it can be helpful to sort both data sets by county name and then compare.
nhgeo <- nhgeo[order(nhgeo$NAME),]
nhdata <- nhdata[order(nhdata$County),]
Are the two county columns identical now? They should be; let’s check:
identical(nhgeo$NAME,nhdata$County )
## [1] TRUE
Now we can join the two files. The sp package’s merge function is pretty common for this type of task, but I like tmaptool’s append_data() because of its intuitive syntax and allowing names of the two join columns to be different.
# library(tmaptools)
nhmap <- append_data(nhgeo, nhdata, key.shp = "NAME", key.data="County")
## Warning: This function is deprecated and has been migrated to github.com/
## mtennekes/oldtmaptools
## Keys match perfectly.
You can see the new data structure with:
str(nhmap)
## Classes 'sf' and 'data.frame': 10 obs. of 16 variables:
## $ STATEFP : Factor w/ 56 levels "01","02","04",..: 30 30 30 30 30 30 30 30 30 30
## $ COUNTYFP : Factor w/ 328 levels "001","003","005",..: 1 2 3 5 6 8 10 12 14 15
## $ COUNTYNS : Factor w/ 3233 levels "00023901","00025441",..: 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505
## $ AFFGEOID : Factor w/ 3233 levels "0500000US01001",..: 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774
## $ GEOID : Factor w/ 3233 levels "01001","01003",..: 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774
## $ NAME : chr "Belknap" "Carroll" "Cheshire" "Coos" ...
## $ LSAD : Factor w/ 11 levels "00","03","04",..: 5 5 5 5 5 5 5 5 5 5
## $ ALAND : Factor w/ 3233 levels "1000508842","1001064387",..: 67 1991 1424 2526 2494 1831 1997 1382 3172 730
## $ AWATER : Factor w/ 3233 levels "0","10017640",..: 778 638 2460 3097 55 1942 2454 1327 1852 1817
## $ Clinton : num 3495 3230 5132 2013 6918 ...
## $ Sanders : num 6005 5638 12441 3639 14245 ...
## $ SandersMarginVotes : num 2510 2408 7309 1626 7327 ...
## $ SandersPct : num 0.264 0.272 0.416 0.288 0.346 ...
## $ ClintonPct : num -0.264 -0.272 -0.416 -0.288 -0.346 ...
## $ SandersMarginPctgPoints: num 0.528 0.543 0.832 0.575 0.692 ...
## $ geometry :sfc_POLYGON of length 10; first list element: List of 1
## ..$ : num [1:33, 1:2] -71.7 -71.7 -71.7 -71.7 -71.7 ...
## ..- attr(*, "class")= chr "XY" "POLYGON" "sfg"
## - attr(*, "sf_column")= chr "geometry"
## - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA ...
## ..- attr(*, "names")= chr "STATEFP" "COUNTYFP" "COUNTYNS" "AFFGEOID" ...
The hard part is done: finding data, getting it into the right format and merging it with geospatial data. Now, creating a simple static map of Sanders’ margins by county in number of votes is as easy as:
qtm(nhmap, "SandersMarginVotes")
## Some legend labels were too wide. These labels have been resized to 0.63, 0.63, 0.63, 0.58, 0.54. Increase legend.width (argument of tm_layout) to make the legend wider and therefore the labels larger.
and mapping margins by percentage:
qtm(nhmap, "SandersMarginPctgPoints")
We can see that there’s some difference between which areas gave Sanders the highest percent win versus which ones were most valuable for largest number-of-votes advantage.
For more control over the map’s colors, borders and such, use the tm_shape() function, which uses a ggplot2-like syntax to set fill, border and other attributes:
tm_shape(nhmap) +
tm_fill("SandersMarginVotes", title="Sanders Margin, Total Votes", palette = "PRGn") +
tm_borders(alpha=.5) +
tm_text("NAME", size=0.8)
## Some legend labels were too wide. These labels have been resized to 0.63, 0.63, 0.63, 0.58, 0.54. Increase legend.width (argument of tm_layout) to make the legend wider and therefore the labels larger.
The first line above sets the geodata file to be mapped, while tm_fill() sets the data column to use for mapping color values. The “PRGn” palette argument is a ColorBrewer palette of purples and greens — if you’re not familiar with ColorBrewer, you can see the various palettes available at colorbrewer2.org.
Don’t like the ColorBrewer choices? You can use built-in R palettes or set your own color HEX values manually instead of using a named ColorBrewer option.
There are also a few built-in tmap themes, such as tm_style_classic:
tm_shape(nhmap) +
tm_fill("SandersMarginVotes", title="Sanders Margin, Total Votes", palette = "PRGn") +
tm_borders(alpha=.5) +
tm_text("NAME", size=0.8) +
tm_style_classic()
## Warning in tm_style_classic(): tm_style_classic is deprecated as of tmap
## version 2.0. Please use tm_style("classic", ...) instead
## Some legend labels were too wide. These labels have been resized to 0.63, 0.63, 0.63, 0.58, 0.54. Increase legend.width (argument of tm_layout) to make the legend wider and therefore the labels larger.
You can save static maps created by tmap by using the save_tmap() function:
nhstaticmap <- tm_shape(nhmap) +
tm_fill("SandersMarginVotes", title="Sanders Margin, Total Votes", palette = "PRGn") +
tm_borders(alpha=.5) +
tm_text("NAME", size=0.8)
save_tmap(nhstaticmap, filename="nhdemprimary.jpg")
## Warning in save_tmap(nhstaticmap, filename = "nhdemprimary.jpg"): save_tmap
## is deprecated as of tmap version 2.0. Please use tmap_save instead
## Map saved to /Users/ginger/Documents/0 - Montgomery College/0 - DATA 110/Datasets/GIS/nhdemprimary.jpg
## Resolution: 1501.336 by 2937.385 pixels
## Size: 5.004452 by 9.791282 inches (300 dpi)
The filename extension can be .jpg, .svg, .pdf, .png and several others; tmap will then produce the appropriate file, defaulting to the size of your current plotting window. There are also arguments for width, height, dpi and more; run ?(“save_tmap”) for more info.
If you’d like to learn more about available tmap options, package creator Martijn Tennekes posted a PDF presentation on creating maps with tmap as well as tmap in a nutshell.
The next map we’ll create will let users click to see underlying data as well as switch between maps, thanks to RStudio’s Leaflet package that gives an R front-end to the open-source JavaScript Leaflet mapping library. Leaflet is one of the most popular open-source JavaScript libraries for interactive maps. It’s used by websites ranging from The New York Times and The Washington Post to GitHub and Flickr, as well as GIS specialists like OpenStreetMap, Mapbox, and CartoDB. For a Leaflet map, there are two extra things we’ll want to create in addition to the data we already have: a color palette and pop-up window contents. For palette, we specify the data range we’re mapping and what kind of color palette we want — both the particular colors and the type of color scale. There are four built-in types: • colorNumeric is for a continuous range of colors from low to high, so you might go from a very pale blue all the way to a deep dark blue, with many gradations in between. • colorBin maps a set of numerical data to a set of discrete bins, either defined by exact breaks or specific number of bins — things like “low,” “medium” and “high”. • colorQuantile maps numerical data into groups where each group (quantile) has the same number of records — often used for income levels, such as bottom 20%, next-lowest 20% and so on. • colorFactor is for non-numerical categories where no numerical value makes sense, such as countries in Europe that are part of the Eurozone and those that aren’t. Just to change things up a bit, I’ll map where Hillary Clinton was strongest, the inverse of the Sanders maps. To map Clinton’s vote percentage, we could use this palette:
clintonPalette <- colorNumeric(palette = "Blues", domain=nhmap$ClintonPct)
where “Blues” is a range of blues from ColorBrewer and domain is the data range of the color scale. This can be the same as the data we’re actually plotting but doesn’t have to be. colorNumeric means we want a continuous range of colors using numeric variables, not specific categories like candidate names. We’ll also want to add a pop-up window — what good is an interactive map without being able to click or tap and see underlying data? Aside: For the pop-up window text display, we’ll want to turn the decimal numbers for votes such as 0.7865 into percentages like 78.7%. We could do it by writing a short formula, but the scales package has a percent() function to make this easier. Install (if you need to) and load the scales package:
# install.packages("scales")
library(scales)
Content for a pop-up window is just a simple combination of HTML and R variables, such as:
nhpopup <- paste0("County: ", nhmap$County,
"Sanders ", percent(nhmap$SandersPct), " - Clinton ", percent(nhmap$ClintonPct))
If you’re not familiar with paste0, it’s a concatenate function that can join text and variable values into a single character string. I’d rather the county name column be County instead of NAME, so I’ll take care of that with dplyr’s rename() function: nhmap <- rename(nhmap, County = NAME)
Now, the map code:
leaflet(nhmap) %>%
addProviderTiles("CartoDB.Positron") %>%
addPolygons(stroke=FALSE,
smoothFactor = 0.2,
fillOpacity = .8,
popup=nhpopup,
color= ~clintonPalette(nhmap$ClintonPct))
## Warning: sf layer has inconsistent datum (+proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs).
## Need '+proj=longlat +datum=WGS84'
Let’s go over the code. leaflet(nhmap) creates a leaflet map object and sets nhmap as the data source. addProviderTiles(“CartoDB.Positron” ) sets the background map tiles to CartoDB’s attractive Positron design. There’s a list of free background tiles and what they look like on GitHub if you’d like to choose something else. The addPolygons() function does the rest — putting the county shapes on the map and coloring them accordingly. stroke=FALSE says no border around the counties, fillOpacity sets the opacity of the colors, popupsets the contents of the popup window and color sets the palette — I’m not sure why the tilde is needed before the palette name, but that’s the function format — and what data should be mapped to the color.
Here’s the result: Basic interactive map created in R and RStudio’s Leaflet package. Click on an area to see the underlying data.
You may get an error message about an “inconsistent datum” along with your map. Projections are a complicated issue when mapping, but basically, you want your data projection to match that of your the underlying map tiles. This ensures that everything’s consistent in terms of the scheme used to represent points on a 3D sphere in two dimensions. To fix this, you can add the projection recommended in the error message with
nhmap_projected <- sf::st_transform(nhmap, “+proj=longlat +datum=WGS84”)
and then run the map code above with
leaflet(nhmap_projected) %>%
as the first line.
The Leaflet package has a number of other features we haven’t used yet, including adding legends and the ability to turn layers on and off. Both will be very useful when mapping categories with at least three choices, such as the 2016 Republican primary.
Let’s look at the GOP results in South Carolina among the top three candidates. I won’t go over the data wrangling on this, except to say that I downloaded results from the South Carolina State Election Commission as well as Census Bureau data for education levels by county. If you download the project files, you’ll see the initial data as well as the R code I used to add candidate vote percentages and join all that data to the South Carolina shapefile. That creates a geospatial object scmap to map.
There’s so much data for a multi-candidate race that it’s a little more complicated to choose what to color beyond “who won.” I decided to go with one map layer to show the winner in each county, one layer each for the top three candidates (Trump, Rubio and Cruz) and a final layer showing percent of adult population with at least a bachelor’s degree. (Why education level? Some news reports out of South Carolina said that seemed to correlate with levels of Trump’s support; mapping that could help show if there’s a pattern.)
# South Carolina data
# scdata <- rio::import(scdatafile)
# Cannot find this file, it's not in GIS.zip file. So stopping here.