DEM 7093 GIS Summer 2018 - Polygon analysis with R and QGIS

This example will use R to download American Community Survey summary file tables using the tidycensus package. The goal of this example is to illustrate how to use QGIS within R to overlap and intersect different data layers so we can apportion population from one geography to another.

This involves obtaining two layers of geometry, one with a population estimate and the other (target) that does not have an estimate, but for which we desire one.

Simple apportionment relies on dividing the population based on area, and typically works ok, but of course unless population is spread evenly across an area, the estimate will be too low or too high.

The example will use data from San Antonio, Texas from the 2015 American Community Survey summary file.

Get a Census developer API Key

Obtain one at http://api.census.gov/data/key_signup.html

Save your API key to your working directory

use census_api_key(key = "yourkeyhere", install = T)

one time to install your key for use in tidycensus

ACS data extract

For this example, I will use the B03002 table from the 2015 5-year ACS summary file. This table contains counts of population by race and Hispanic #ethnicity. I will extract census tracts within the state of Texas. These will be my smaller units that I will aggregate across, within the larger #geographies to calculate the segregation measures.

library(tigris)
library(RQGIS)
library(tidycensus)
library(sf)
library(dplyr)
myenv<-set_env("C:/Program Files/QGIS 2.18/")

options(tigris_class = "sf")

load the ACS data for age by sex

Here we get our original estimates data, at the census tract level in Bexar county, TX. ACS table B01001 is age by sex.

Our goal is to estimate the total number of children under age 5 in school districts. The population estimates are done separately by sex, so we need two tables.

age.tract<-get_acs(geography = "tract",state="48", county="029", year =2015, variable = c("B01001_003", "B01001_027"), output="wide", geometry = T)

## Getting data from the 2011-2015 5-year ACS

## Downloading feature geometry from the Census website.  To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.

#mutate, calculate area of each tract, select and project
age.tract<-age.tract%>%
  mutate(totalkidsu5=B01001_003E+B01001_027E,
         totalu5err= sqrt(B01001_003M^2+B01001_027M^2), 
         area=st_area(.))%>%
  select(GEOID, totalkidsu5, totalu5err, area)%>%
  st_transform(crs=102340)

plot(age.tract["totalkidsu5"], main="Total children under age 5")

Intersecting geographies

When we want to combine information on two spatial layers, we can do a few different things. The one we’re going to do here is a geometric intersection.

This combines the information from two layers, in the areas in which the two layers overlap. Any areas that don’t overlap are removed.

Since we want to get estimates for school districts, we can download those data:

district<-school_districts(state = "TX", type="unified", refresh=T)
district<-st_transform(district,crs=102340 )
district$area_orig<-st_area(district)
plot(district["NAME"])

since school districts aren’t nested necessarily within counties (why, I have no idea), we have to clip the districts to be within the Bexar county border:

Here we get the county polygon for Bexar county:

bexarco<-counties(state="TX", cb=T, year=2015)
bexarco<-bexarco[bexarco$COUNTYFP=="029",]
bexarco<-st_transform(bexarco, crs=102340)
plot(bexarco["COUNTYFP"])

Here we get the districts in Bexar county using an intersection operation:

params <- get_args_man(alg = "qgis:intersection", qgis_env = myenv)
wd<-"C:/Users/ozd504/Google Drive/classes/dem7093/GIS_class_2018/data/"

params$INPUT <- district
params$INPUT2<-bexarco
params$OUTPUT<-file.path(wd, "bexar_districts.shp")

sadistrict<- run_qgis(alg = "qgis:intersection",
                    params = params,
                    load_output = T,
                    qgis_env = myenv)

## $OUTPUT
## [1] "C:/Users/ozd504/Google Drive/classes/dem7093/GIS_class_2018/data/bexar_districts.shp"

plot(sadistrict["NAME"])

Now we have our school district areas within Bexar county, now we need to intersect tracts and districts. This can be a time consuming process, so we’re actually just going to use the geometric operations within the sf library

#How to use QGIS to do it:
# params <- get_args_man(alg = "qgis:intersection", qgis_env = myenv)
# 
# params$INPUT <- sadistrict
# params$INPUT2<-age.tract
# params$OUTPUT<-file.path(wd, "sa_sd_tract.shp")
# path to the output shapefile
# sdtract <- run_qgis(alg = "qgis:intersection",
#                     params = params,
#                     load_output = T,
#                     qgis_env = myenv)

sdtract<-st_intersection(sadistrict, age.tract )

## Warning: attribute variables are assumed to be spatially constant
## throughout all geometries

plot(sdtract["NAME"])

GREAT! And we see the tracts exist within school districts, which was our goal.

Now, some parts of tracts may exist in multiple districts, this is because the two geographies are not nested. We can see this by mapping both layer together:

library(mapview)
mapview(age.tract)+mapview(sadistrict["NAME"])

Apportionment based on area

So, we can find the ratio of each tract’s area within the school district, and divide it by the original area of the tract. This gives us the proportion of the area of a tract within a given district. Once we have this, we can multiply the population by the proportion, and we will have a proportion of the population of that tract within a district:

sdtract$apparea<-st_area(sdtract) #area of intersected tracts
sdtract$app_kids<-sdtract$totalkidsu5*(sdtract$apparea/sdtract$area) #apportion population
hist(sdtract$apparea/sdtract$area, main="Proportion of tract within a district")

plot(sdtract['app_kids'])

So now we have split our tract populations, we can aggregate the apportioned populations across school districts to get a total population estimate for each district:

sd_ests<-aggregate(app_kids~GEOID, data=sdtract, FUN=sum)

head(sd_ests)

##     GEOID      app_kids
## 1 4807590 1.838950e+03 
## 2 4809360 3.849114e-03 
## 3 4810710 6.345800e+02 
## 4 4814730 1.640453e+03 
## 5 4817850 3.529008e+03 
## 6 4818150 4.788234e+03

So, i’ve let you on by telling you that we didn’t have estimates for districts, but in fact, the Census generates population estimates for all school districts. So in this case, we can actually check how our estimates compare with truth.

#truth
dist_tru<-get_acs(state="TX", geography = "school district (unified)", year=2015, output = "wide",variable = c("B01001_003", "B01001_027"))

## Getting data from the 2011-2015 5-year ACS

distgeo<-school_districts(state="TX", type="unified", year=2015)

dist_tru<-left_join(distgeo, dist_tru, by="GEOID")
dist_tru<-st_transform(dist_tru, crs=102340)
sa_dist_tru<-st_intersection(dist_tru, bexarco)

## Warning: attribute variables are assumed to be spatially constant
## throughout all geometries

sa_dist_tru<-sa_dist_tru%>%
  mutate(totalkidsu5=B01001_003E+B01001_027E,
         totalu5err= sqrt(B01001_003M^2+B01001_027M^2))

mdat<-left_join(sa_dist_tru, sd_ests, by="GEOID")
head(mdat)

plot(totalkidsu5~app_kids, data=mdat, main="Observed vs Estimated Populations")

summary(lm(totalkidsu5~app_kids, data=mdat))

## 
## Call:
## lm(formula = totalkidsu5 ~ app_kids, data = mdat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1716.7  -738.7  -344.5   326.9  4106.1 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 710.41194  299.89461   2.369   0.0262 *  
## app_kids      0.99636    0.02713  36.727   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1355 on 24 degrees of freedom
## Multiple R-squared:  0.9825, Adjusted R-squared:  0.9818 
## F-statistic:  1349 on 1 and 24 DF,  p-value: < 2.2e-16

DEM 7093 GIS Summer 2018 - Polygon analysis with R and QGIS - Apportioning Populations

Corey S. Sparks, PhD

June 14, 2018