Intro to Spatial Data in R

All data used in this tutorial can be downloaded here. Before starting this tutorial, ensure you have all the requisite packages installed:

packages<-c("choroplethr", "choroplethrMaps", "RColorBrewer", "acs", "WDI",
            "gstat", "sp", "rgdal", "ggplot2", "dplyr", "maptools", "rworldmap",
            "spdep", "latticeExtra", "tigris")
install.packages(packages)
lapply(packages, require, character.only = TRUE)

Example 1: Repeat QGIS Analysis in R

Recap: In QGIS, we added three layers–tract-level data on median household income, subway lines, and grocery store locations. We classified the income data into equal interval categories, and the subway lines by line color. Our final map looked like this:

tracts<-readOGR("./tracts/tracts.shp", stringsAsFactors = FALSE)
subway<-readOGR("./subway/subwaylines.shp", stringsAsFactors = FALSE)
grocerystores<-readOGR("./grocery stores/grocerystores.shp", stringsAsFactors = FALSE)
neighborhoods<-readOGR("./neighborhoods/neighborhoods.shp", stringsAsFactors = FALSE)
#plot(tracts)
ggplot(tracts, aes(x = long, y = lat, group = group)) + geom_path()+ 
  theme_minimal()+ coord_fixed()

Generate a choropleth map of median household income

We generated a blank map, but we can easily “fill in” the features with the Median Household Income field (MED_INC_HH). This is done easily through the ggplot2 package. To map spatial objects using ggplot, you have to fortify the data first, and then plot the resulting data frame. An alternative to this is the spplot package, however it has less documentation and the layering features can be buggy.

tracts$MED_INC_HH<-as.numeric(tracts$MED_INC_HH)   # convert from string to numeric

tracts@data$id=rownames(tracts@data)
tracts_points<-fortify(tracts, region="id")
tracts_df<-join(tracts_points, tracts@data, by="id")

ggplot(tracts_df, aes(long, lat, group=group, fill=MED_INC_HH))+geom_polygon()+
  scale_fill_continuous(low="white", high="blue")+ coord_fixed() + 
  theme(aspect.ratio=0.6) + theme_minimal()+ coord_fixed()

Does this match up with what we saw in QGIS? What’s different? Does this map gives us the same conclusions as observed in QGIS?

This map has a CONTINUOUS legend (colors are assigned based on where they fall on a color spectrum, not in specific numeric categories). The map we made in QGIS has a DISCRETE legend (colors were assigned based on breaks specified in the Equal Interval classification). We can discretize our continuous Median Household Income values as follows:

tracts@data$INC_BREAKS<-cut_interval(tracts@data$MED_INC_HH, n = 5, dig.lab=10)

tracts@data$id=rownames(tracts@data)
tracts_points<-fortify(tracts, region="id")
tracts_df<-join(tracts_points, tracts@data, by="id")

ggplot(tracts_df, aes(long, lat, group=group, fill=INC_BREAKS))+geom_polygon()+
  scale_fill_brewer(palette = 'Blues') + coord_fixed() + 
  theme_minimal()+ coord_fixed()

Subway lines

Our next step is to plot subway lines, and assign the right color to each line. Here is a handy reference chart for built-in colors in R. Remember that factors for characters are assigned in alphabetical order by default, so our colors will reflect the colors for the Blue, Commuter, Green, Orange, and Red lines respectively.

subway_fortify   <- fortify(subway)
subway_df <- data.frame(id=rownames(subway@data), subway@data)
subway_fortify <- join(subway_fortify, subway_df, by="id")

ggplot(subway_fortify, aes(x=long, y=lat, group=group, color=LINE)) + 
  geom_line(lwd=1.5) + theme_minimal() + 
  scale_color_manual(values=c("darkblue", "darkmagenta", "forestgreen", 
                              "chocolate", "red"))+ coord_fixed()

Grocery stores

Since grocery stores are just points, we will add the tracts layer below to gives us some context.

grocerystore_df<-data.frame(grocerystores)

ggplot()+
  geom_polygon(data=tracts_df, aes(long, lat, group=group), fill=NA, color="grey")+
  geom_point(data=grocerystore_df, 
        aes(x=Longitude, y=Latitude), fill="red", color="red", pch=24)+
       theme_minimal()+ coord_fixed()

Bringing it all together

We can add each of these layers individually to a collective ggplot item, and generate a multilayer map with all three pieces of information.

ggplot()+
  
  # Income data
  geom_polygon(data=tracts_df, aes(long, lat, group=group, fill=INC_BREAKS))+
  scale_fill_brewer(palette = 'Blues', name="Median Household Income")+
  
  # Subway data
  geom_line(data=subway_fortify, aes(x=long, y=lat, group=group, color=LINE), lwd=1.5) + 
  scale_color_manual(values=c("darkblue", "darkmagenta", "forestgreen", 
                              "chocolate", "red"), 
                     name="Subway Line", 
                     labels=c("Blue", "Commuter Rail", "Green", "Orange", "Red")) +
  
  # Grocery store data
  geom_point(data=grocerystore_df, 
             aes(x=Longitude, y=Latitude, size=2),
             color="yellow", fill="yellow") + 
            scale_size("Grocery Stores", range=c(1, 1.5), 
                       labels=c("Grocery Stores"))+

  # General plotting properties
  coord_fixed() + theme_minimal()

Example 2: Mapping World Development Indicators using WDI

Next, we will explore the World Bank World Development Indicators dataset. You can use the website, but the World Bank has developed the WDI package that makeks finding indicators and mapping them all the easier!

We will start by mapping malnutrition prevalence (Indicator name “SH.STA.MALN.ZS”)

datamap = WDI(indicator = "SH.STA.MALN.ZS", start=2010, end=2010);
SPDF <- joinCountryData2Map(datamap, joinCode = "ISO2", nameJoinColumn = "iso2c");
mapParams <- mapCountryData(SPDF, nameColumnToPlot="SH.STA.MALN.ZS", mapTitle="",
                            mapRegion="world", catMethod="quantiles",
                            colourPalette='heat', borderCol = "black")

Good map, but it could look better. Let’s prettify it.

cp <- brewer.pal(10,'BrBG')
datamap = WDI(indicator = "SH.STA.MALN.ZS", start=2014, end=2014);
SPDF <- joinCountryData2Map(datamap, joinCode = "ISO2", nameJoinColumn = "iso2c");
mapParams <- mapCountryData(SPDF, nameColumnToPlot="SH.STA.MALN.ZS",
                            addLegend=TRUE, 
                            mapRegion="world", catMethod="quantiles", 
                            colourPalette=cp, mapTitle = "", borderCol = "black")

Let’s try to zoom into Africa now.

cp <- brewer.pal(5,'RdPu')
datamap = WDI(indicator = "SH.STA.MALN.ZS", start=2014, end=2014)
SPDF <- joinCountryData2Map(datamap, joinCode = "ISO2", nameJoinColumn = "iso2c")

par(mar=c(10, 0, 0, 0))
mapParams <- mapCountryData(SPDF, nameColumnToPlot="SH.STA.MALN.ZS", 
                            addLegend=TRUE, mapTitle="", mapRegion="Africa", 
                            catMethod="quantiles", numCats=10, colourPalette=cp, 
                            borderCol="black")

Looks good!!

What if we wanted to search for an indicator? Let’s try.

WDIsearch('nutrition')

##      indicator          
## [1,] "SH.DTH.COMM.ZS"   
## [2,] "SH.STA.MALN.FE.ZS"
## [3,] "SH.STA.MALN.MA.ZS"
## [4,] "SH.STA.MALN.ZS"   
## [5,] "SH.STA.STNT.FE.ZS"
## [6,] "SH.STA.STNT.MA.ZS"
## [7,] "SH.STA.STNT.ZS"   
##      name                                                                                                   
## [1,] "Cause of death, by communicable diseases and maternal, prenatal and nutrition conditions (% of total)"
## [2,] "Malnutrition prevalence, weight for age, female (% of children under 5)"                              
## [3,] "Malnutrition prevalence, weight for age, male (% of children under 5)"                                
## [4,] "Malnutrition prevalence, weight for age (% of children under 5)"                                      
## [5,] "Malnutrition prevalence, height for age, female (% of children under 5)"                              
## [6,] "Malnutrition prevalence, height for age, male (% of children under 5)"                                
## [7,] "Malnutrition prevalence, height for age (% of children under 5)"

We can manipulate the WDI results thus:

nutrition = WDI(indicator="SH.STA.STNT.ZS", country="all", start=2014, end=2014);
nutrition <- subset(nutrition, !is.na(nutrition$SH.STA.STNT.ZS)); 
nutrition <- group_by(nutrition, country)
nutrition <- filter(nutrition, year == max(year))
nutrition

We can bring this data into ggplot just like we did before, and join the nutrition data from WDI to the countries shapefile using each country’s two-letter ISO2C code. Here is a complete list for UN regions/subregions designations

worldmap=readOGR("./world borders/world_borders.shp") 
ggplot(worldmap, aes(x = long, y = lat, group = group)) + geom_path()+ 
  theme(aspect.ratio=0.6, plot.margin = unit(c(0, 0, 4, 0), "cm"))

worldmap@data$id=rownames(worldmap@data);
worldmap_points<-fortify(worldmap, region="id");
worldmap_df<-join(worldmap_points, worldmap@data, by="id");

names(worldmap_df)[9] <- "iso2c"
worldmap_df$iso2c<-as.character((worldmap_df$iso2c))
final <- left_join(worldmap_df, nutrition, by="iso2c")
final <- subset(final, REGION==142)

ggplot(final) + 
  geom_path(aes(long, lat, group = group), color = "white") + 
  geom_polygon(aes(long, lat, group = group, fill = SH.STA.STNT.ZS)) + 
  scale_fill_gradient("Malnutrition prevalence, \nheight for age \n(% of children under 5)", low = "#132B43", high = "#56B1F7", na.value = "grey50")+theme_minimal()

Example 3: Creating census choropleth maps using choroplethR

Let’s start by creating a simple state map of population

data(state.regions)
head(state.regions)

##       region abb fips.numeric fips.character
## 1     alaska  AK            2             02
## 2    alabama  AL            1             01
## 3   arkansas  AR            5             05
## 4    arizona  AZ            4             04
## 5 california  CA            6             06
## 6   colorado  CO            8             08

data(df_pop_state)
head(df_pop_state)

##       region    value
## 1    alabama  4777326
## 2     alaska   711139
## 3    arizona  6410979
## 4   arkansas  2916372
## 5 california 37325068
## 6   colorado  5042853

state_choropleth(df_pop_state, legend = "Population", num_colors=3)

This data also exists at the county level. Let’s create a county map of Percent Hispanic Population by county.

data("df_county_demographics")
df_county_demographics$value = df_county_demographics$percent_hispanic
county_choropleth(df_county_demographics, num_colors=5)

If you know the table name you are interested in mapping, you can call it directly using the acs package. Suppose we know that Table B19301 represents Per Capita Income in the past 12 months. We can use that information to map:

api.key.install(key="") 
par(mar=c(0,0,0,0), mai=c(0,0,0,0))
county_choropleth_acs("B19301", endyear=2015, span=5, num_colors=5)

## Warning in super$initialize(map.df, user.df): Your data.frame contains the
## following regions which are not mappable: 2158, 46102

## Warning in self$bind(): The following regions were missing and are being
## set to NA: 46113, 51515, 2270

We can also zoom into the counties around Boston:

data(county.regions)
boston_county_names=c("suffolk", "middlesex", "essex", "norfolk", "bristol", "plymouth", "barnstable")
boston_county_fips = county.regions %>%
  filter(state.name=="massachusetts" & county.name %in% boston_county_names) %>%
  select(region)
county_choropleth_acs("B19301", endyear=2015, num_colors=5, span=5, county_zoom=boston_county_fips$region)

## Warning in super$initialize(county.map, user.df): Your data.frame contains
## the following regions which are not mappable: 2158, 46102

You can also create queries to find which tables make most sense for you! Let’s try to search for census data matching the word “food”. These queries give us a lot of results, including table names and variable names, which we can use for mapping further.

query<-acs.lookup(2015, span = 5, dataset = "acs", table.name = "food", case.sensitive=F)
str(query); str(query@results)

## Formal class 'acs.lookup' [package "acs"] with 4 slots
##   ..@ endyear: num 2015
##   ..@ span   : num 5
##   ..@ args   :List of 7
##   .. ..$ endyear       : num 2015
##   .. ..$ span          : num 5
##   .. ..$ dataset       : chr "acs"
##   .. ..$ keyword       : symbol 
##   .. ..$ table.name    : chr "food"
##   .. ..$ table.number  : symbol 
##   .. ..$ case.sensitive: logi FALSE
##   ..@ results:'data.frame':  140 obs. of  4 variables:
##   .. ..$ variable.code: chr [1:140] "B09010_001" "B09010_002" "B09010_003" "B09010_004" ...
##   .. ..$ table.number : chr [1:140] "B09010" "B09010" "B09010" "B09010" ...
##   .. ..$ table.name   : chr [1:140] "Receipt of SSI, Public Assistance Income, or Food Stamps/SNAP in the Past 12 Mnths by Hhld Type for Children < 18 in Hhl" "Receipt of SSI, Public Assistance Income, or Food Stamps/SNAP in the Past 12 Mnths by Hhld Type for Children < 18 in Hhl" "Receipt of SSI, Public Assistance Income, or Food Stamps/SNAP in the Past 12 Mnths by Hhld Type for Children < 18 in Hhl" "Receipt of SSI, Public Assistance Income, or Food Stamps/SNAP in the Past 12 Mnths by Hhld Type for Children < 18 in Hhl" ...
##   .. ..$ variable.name: chr [1:140] "Total:" "Living in household with Supplemental Security Income (SSI), cash public assistance income, or Food Stamps/SNAP in the past 12 "| __truncated__ "Living in household with Supplemental Security Income (SSI), cash public assistance income, or Food Stamps/SNAP in the past 12 "| __truncated__ "Living in household with Supplemental Security Income (SSI), cash public assistance income, or Food Stamps/SNAP in the past 12 "| __truncated__ ...

## 'data.frame':    140 obs. of  4 variables:
##  $ variable.code: chr  "B09010_001" "B09010_002" "B09010_003" "B09010_004" ...
##  $ table.number : chr  "B09010" "B09010" "B09010" "B09010" ...
##  $ table.name   : chr  "Receipt of SSI, Public Assistance Income, or Food Stamps/SNAP in the Past 12 Mnths by Hhld Type for Children < 18 in Hhl" "Receipt of SSI, Public Assistance Income, or Food Stamps/SNAP in the Past 12 Mnths by Hhld Type for Children < 18 in Hhl" "Receipt of SSI, Public Assistance Income, or Food Stamps/SNAP in the Past 12 Mnths by Hhld Type for Children < 18 in Hhl" "Receipt of SSI, Public Assistance Income, or Food Stamps/SNAP in the Past 12 Mnths by Hhld Type for Children < 18 in Hhl" ...
##  $ variable.name: chr  "Total:" "Living in household with Supplemental Security Income (SSI), cash public assistance income, or Food Stamps/SNAP in the past 12 "| __truncated__ "Living in household with Supplemental Security Income (SSI), cash public assistance income, or Food Stamps/SNAP in the past 12 "| __truncated__ "Living in household with Supplemental Security Income (SSI), cash public assistance income, or Food Stamps/SNAP in the past 12 "| __truncated__ ...

head(unique(query@results$table.number))

## [1] "B09010"  "B19058"  "B22001"  "B22002"  "B22003"  "B22005A"

Example 4: The long (but efficient!) way to map census data

The “long” way to do this task, which is STILL simpler than how you would do this in ArcMap or QGIS, is to download the spatial data and tabular data separately. Then, you can merge and map them. Here’s a short example to see how you can apply this method. You need to know: your state and counties of interest, as well as the ACS variable name (I will be using B22002_001, which simplifies to Number of Households which received Food Stamps/SNAP in the Past 12 Months

# Define the counties of interest
gba_ct<-c("Suffolk", "Middlesex")

# Get the spatial data using the tigris package
counties <- tracts('MA', county=gba_ct, cb=TRUE)

# Get the tabular data using the acs package
SNAPdata<-acs.fetch(geo=geo.make(state="MA", county=gba_ct, tract="*"), 
                variable="B22002_001", endyear=2015)

# convert to a data.frame for merging
GEOID<-paste0(str_pad(SNAPdata@geography$state, 2, "left", pad="0"),
              str_pad(SNAPdata@geography$county, 3, "left", pad="0"),
              str_pad(SNAPdata@geography$tract, 6, "left", pad="0"))
SNAP_data<-SNAPdata@estimate[,1]
names(SNAP_data)=NULL

SNAP_df<- data.frame(GEOID, SNAP_data)

# Geo-join the two!
SNAP_merged<- geo_join(counties, SNAP_df, "GEOID", "GEOID")

SNAP_merged@data$id=rownames(SNAP_merged@data)
SNAP_merged_points<-fortify(SNAP_merged, region="id")
SNAP_df<-join(SNAP_merged_points, SNAP_merged@data, by="id")

ggplot(SNAP_df, aes(long, lat, group=group, fill=SNAP_data))+geom_polygon()+
  scale_fill_continuous(low="white", high="blue", name = "Households")+
  coord_fixed() + 
  theme(aspect.ratio=0.6) + theme_minimal() + 
  ggtitle("Households which received Food Stamps/SNAP in the Past 12 Months")