Automating data pulls from the US Census API

Introduction to the API

The US census has an API that allows you to download data via a webbrowser. Here is an example that returns all Census 2010 population counts for zip code tabulation areas in Alaska:

http://api.census.gov/data/2010/sf1?key={enter key here}&get=P0010001,P0040003&for=zip+code+tabulation+area:*&in=state:02

###Components of the url

  1. A url root describing the dataset to be queried. In the example, that is http://api.census.gov/data/2010/sf1?' which indicates data from the 2010 Census, Summary File 1.
  2. The user's API key. Every user must have their own key. I didn't enter my key in the example. You'll have to request you're own. You can request a key here.
  3. The variables that will be requested. In this case, that is indicated by get=P0010001,P0040003, indicating variables P0010001 (Total Population) and P004003 (Hispanic or Latino Population). There is a limit of 50 variables that can be requested at one time.
  4. The geographic region to pull data for. Here that is for=zip+code+tabulation+area:*&in=state:02 which indicates pull data for zip code tabulation areas (ZCTA), :* indicates for all, in=state:02 indicates in state 02 (which is Alaska). Another example would be for=block:*&in=state:02+county:290+tract:00100 indicating all blocks in tract 00100 in county 290 in state 02 (Alaska).

    You can find all kinds of examples on the left hand side of this page.

The R code

What can't be simplified

There are a number of obstacles to using data from the API. One obstacle for some people will be that the API requires a solid understanding of census questions and geographies. For example, the user must know the proper codes for the data they want (e.g. knowing that population is P0010001. I don't see a solution to that, other than the user becoming familiar with the data documentation. Another obstacle is that the user must understand basic census geographies and fips codes, such as state-county-tract-block group-block. Again, I don't see an easy solution to this other than reading the documentation.

What can be simplified with R?

There are two things that can be mostly simplified with R, and these revolve around the limitations of the API itself. The first limitation is that you can only extract 50 variables at a time. We can fix that in R. The second limitation is that there are only certain geography combinations you can query at once. For example, you can query every block group in a tract, but you cannot query every block group in a county. If you want to query every block group in a county, then you need to first loop through all the tracts in the county, and construct a block group query for each tract. R can help us to automate this procedure. A third limitation of the url is that it returns an html page, it takes some massaging to get that into a usable dataset.

Description of R functions:

I have included here two R functions:

# getCensusApi
# get Census data via the public API: loop through variables if needed
# Inputs:
#   data_url: the url root of the api, including the '?'
#     example: http://api.census.gov/data/2010/sf1?
#   key: your API key
#   vars: a character vector of variables to get.
#     example c("H0110001","H0110002","H0110003")
#     If there are more than 50, then it will be automatically split into separate queries.
#   region: region to get data for.  contains a for:, and possibly an in:
#     example: for=block:1213&in=state:47+county:015+tract:*
# Output:
#   If successful, a data.frame
#   If unsuccessful, prints the url query that caused the error.
getCensusApi <- function(data_url,key, vars, region, numeric=TRUE){
  if(length(vars)>50){
    vars <- vecToChunk(vars) # Split vars into a list
    get <- lapply(vars, function(x) paste(x, sep='', collapse=","))
    data <- lapply(vars, function(x) getCensusApi2(data_url,key, x, region, numeric=TRUE))
    } else {
      get <- paste(vars, sep='', collapse=',')
      data <- list(getCensusApi2(data_url,key, get, region, numeric=TRUE))
      }
  # Format output.  If there were no errors, than paste the data together
  # If there is an error, just return the unformatted list.
  if(all(sapply(data, is.data.frame))){
    colnames <- unlist(lapply(data, names))
    data <- do.call(cbind,data)
    names(data) <- colnames
    # Prettify the output
    # If there are nonunique colums, remove them
    data <- data[,unique(colnames, fromLast=TRUE)]
    # Reorder columns so that numeric fields follow non-numeric fields
    data <- data[,c(which(sapply(data, class)!='numeric'), which(sapply(data, class)=='numeric'))]
    return(data)
  }else{
    print('unable to create single data.frame in getCensusApi')
    return(data)
    }
}


# getCensusApi2 
# get Census data via the public API using a single query
# Inputs:
#   data_url: the url root of the api, including the '?'
#     example: http://api.census.gov/data/2010/sf1?
#   key: your API key
#   get: The variables to get. Separate multiple variables by commas.
#     example 'H0110001,H0110002,H0110003'
#   region: region to get data for.  contains a for:, and possibly an in:
#     example: for=block:1213&in=state:47+county:015+tract:*
# Output:
#   If successful, a data.frame
#   If unsuccessful, prints the url query that was constructed.
getCensusApi2 <- function(data_url,key, get, region, numeric=TRUE){
  if(length(get)>1) get <- paste(get, collapse=',', sep='')
  api_call <- paste(data_url, 
                    'key=', key, 
                    '&get=', get,
                    '&', region,
                    sep='')

  dat_raw <- try(readLines(api_call, warn="F"))
  if(class(dat_raw)=='try-error') {
    print(api_call)
    return}
  dat_df <- data.frame()

  #split the datastream into a list with each row as an element
  # Thanks to roodmichael on github
  tmp <- strsplit(gsub("[^[:alnum:], _]", '', dat_raw), "\\,")
  #dat_df <- rbind(dat_df, t(sapply(tmp, '[')))
  #names(dat_df) <- sapply(dat_df[1,], as.character)
  #dat_df <- dat_df[-1,]
  dat_df <- as.data.frame(do.call(rbind, tmp[-1]), stringsAsFactors=FALSE)
  names(dat_df) <- tmp[[1]]
  # convert to numeric
  # The fips should stay as character... so how to distinguish fips from data?
  # I think all of the data have numbers in the names, the fips do not
  #  Example: field names of B01001_001E vs state
  if(numeric==TRUE){
    value_cols <- grep("[0-9]", names(dat_df), value=TRUE)
    for(col in value_cols) dat_df[,col] <- as.numeric(as.character(dat_df[,col]))
  }
  return(dat_df)
}

vecToChunk <- function(x, max=50){
  s <- seq_along(x)
  x1 <- split(x, ceiling(s/max))
  return(x1)
}

Usage examples

Enter your api key here

# Uncomment the next line and enter your key

# key = 'your key here'

Get all place populations in the US from the 2010 Census.

sf1_2010_api <- 'http://api.census.gov/data/2010/sf1?'
# Get all place pops for the US
us_places <- getCensusApi(sf1_2010_api, key=key, vars=c("P0010001"), region="for=place:*")
head(us_places)
##   state place P0010001
## 1    01 00100      192
## 2    01 00124     2688
## 3    01 00460     4522
## 4    01 00484      758
## 5    01 00676      356
## 6    01 00820    30352

An example involving more than 50 variables

# Get table B04001 from the 2007-2011 ACS.  There are aver 109 vars here (twice that if you include MOEs!).  Pull the data for all places in California
vars <- paste('B04001_', sprintf('%03i', seq(1, 109)), 'E', sep='')
acs_07_11_url <- 'http://api.census.gov/data/2011/acs5?'
us_places_ancestry <- getCensusApi(acs_07_11_url, key=key, vars=vars, region="for=place:*&in=state:06")
head(us_places_ancestry)
##   state place B04001_001E B04001_002E B04001_003E B04001_004E B04001_005E
## 1    06 00135        1431           0           0           0          31
## 2    06 00156         387           0           0           0           0
## 3    06 00212        7138           0           0           4         316
## 4    06 00296       30670           0           0           0         619
## 5    06 00310         418           0           0           0           0
## 6    06 00394       20353          27           0           0         907
##   B04001_006E B04001_007E B04001_008E B04001_009E B04001_010E B04001_011E
## 1           0           0           0           0           0           0
## 2           0           0           0           0           0           0
## 3           0           0           0           0           0           0
## 4          22           0           0           0          14           0
## 5           0           0           0           0           0           0
## 6         299           0          31         116          51           0
##   B04001_012E B04001_013E B04001_014E B04001_015E B04001_016E B04001_017E
## 1           0           0           0           0           0           0
## 2           0           0           0           0           0           0
## 3           0           0           0           0          44           0
## 4           0           0           0           8         124           0
## 5           0           0           0           0           0           0
## 6          18           0          23          60          87           0
##   B04001_018E B04001_019E B04001_020E B04001_021E B04001_022E B04001_023E
## 1           0           0           0           0           0           0
## 2           0           0           0           0           0           0
## 3           0          27           0           0           0          14
## 4           0           0           0           0           0           8
## 5           0           0           0           0           0           0
## 6           0          40          11          55           0          38
##   B04001_024E B04001_025E B04001_026E B04001_027E B04001_028E B04001_029E
## 1           0           0           0           0           0          16
## 2           0           0           0           0           0           0
## 3           0           0         190           0           0           0
## 4           0           0           7           0           0           0
## 5           0           0           0           0           0           0
## 6          19           0          81           0           0          11
##   B04001_030E B04001_031E B04001_032E B04001_033E B04001_034E B04001_035E
## 1           0           0           0           0           0           0
## 2           0           0           0           0           0           0
## 3           0           0           0          72         118           0
## 4           0           0           0           0           0           0
## 5           0           0           0           0           0           0
## 6           0         100          25         170         149         634
##   B04001_036E B04001_037E B04001_038E B04001_039E B04001_040E B04001_041E
## 1         193           0          14           0          25           0
## 2           0           0           0           0           0           0
## 3         502           0         198          45         176          29
## 4         285           0          88           0          62          96
## 5          77           0           0           0          14           0
## 6         874           0         534          22         207          49
##   B04001_042E B04001_043E B04001_044E B04001_045E B04001_046E B04001_047E
## 1          56           0           0           0           0          32
## 2           0           0           0           0           0           0
## 3        1138           0          64           0          29           0
## 4         657           0           0           0          24           0
## 5         169           0           0           0           0           0
## 6        1963           0          65           0         193           0
##   B04001_048E B04001_049E B04001_050E B04001_051E B04001_052E B04001_053E
## 1          72         250           0          57          41           0
## 2           0           0           0           0           0           0
## 3           0         532           0         587           0           8
## 4           0         812           0         424           0           0
## 5           0          49           0           6           0           0
## 6         211        1722         236        1053           0         112
##   B04001_054E B04001_055E B04001_056E B04001_057E B04001_058E B04001_059E
## 1           0           0           0           0           0           0
## 2           0           0           0           0           0           0
## 3           0           0           0           0         132          80
## 4           0           0           0           0           0         460
## 5           0           0           0           0           0           0
## 6           0           0           0           0         189         136
##   B04001_060E B04001_061E B04001_062E B04001_063E B04001_064E B04001_065E
## 1           0          21          15           0          12          15
## 2           0           0          21           0           0           0
## 3           0         139          21          33          52          36
## 4           0         651           0           0          28         113
## 5           0           0           0           0           0          15
## 6           0         707          29          65        1506          14
##   B04001_066E B04001_067E B04001_068E B04001_069E B04001_070E B04001_071E
## 1           0          62           0           0          14           0
## 2           0           0           0           0           0           0
## 3         120         234           0           0           0          13
## 4         111          59           0           0          18           0
## 5           0          29           0           0           0           0
## 6         251         336           0           0           0           0
##   B04001_072E B04001_073E B04001_074E B04001_075E B04001_076E B04001_077E
## 1           0           0           0           0           0           0
## 2           0           0           0           0           0           0
## 3           0          35           0           0           0           0
## 4           0         279           0           0           0           0
## 5           0           0           0           0           0           0
## 6           0          56           0           0           0           0
##   B04001_078E B04001_079E B04001_080E B04001_081E B04001_082E B04001_083E
## 1           0           0           0           0           0           0
## 2           0           0           0           0           0           0
## 3           0           0           0           0           0           0
## 4           0         177           0           0           0           0
## 5           0           0           0           0           0           0
## 6           0           0           0           0           0          28
##   B04001_084E B04001_085E B04001_086E B04001_087E B04001_088E B04001_089E
## 1           0           0           0           0           0          60
## 2           0           0           0           0           0           9
## 3           0           0           0          35           0         127
## 4           0           0           0         102           0          53
## 5           0           0           0           0           0           0
## 6          28           0           0           0           0         430
##   B04001_090E B04001_091E B04001_092E B04001_093E B04001_094E B04001_095E
## 1           0           0          23           0           0           0
## 2           0           0           0           0           0           0
## 3           0           0           0          20           0           0
## 4           0           0          65           0         140           0
## 5           0           0           0           0           0           0
## 6          76          20         143          11           0           0
##   B04001_096E B04001_097E B04001_098E B04001_099E B04001_100E B04001_101E
## 1           0           0           0           0           0           0
## 2           0           0           0           0           0           0
## 3           0           0           0           0           0           0
## 4           0         102           0           0          31           0
## 5           0           0           0           0           0           0
## 6           0           0           0           0           0           0
##   B04001_102E B04001_103E B04001_104E B04001_105E B04001_106E B04001_107E
## 1           0           0           0           0           0           0
## 2           0           0           0           0           0           0
## 3           0           0           0           0           0           0
## 4           7           0           0           0           0           0
## 5           0           0           0           0           0           0
## 6           0           0           0           0           0           0
##   B04001_108E B04001_109E
## 1         350          72
## 2         357           0
## 3        1535         468
## 4       23147        2318
## 5          23          36
## 6        5081        1409

An example requiring a loop through regions

Here, we will extract tract level data from many counties. You can only request tract data one county at a time, so we'll loop through it.

Create a list of counties to loop through. I have both county names and fips codes here. Only the fips codes are used. The names are just for clarity.

study_area <- data.frame(county = c('Cannon', 'Cheatham', 'Davidson', 'Dickson', 'Hickman', 'Macon', 'Maury', 'Robertson', 'Rutherford', 'Smith', 'Sumner', 'Trousdale', 'Williamson', 'Wilson'),
                         fips = c('015', '021', '037', '043', '081', '111', '119', 
                                  '147', '149', '159', '165', '169', '187', '189'),
                         stringsAsFactors=FALSE)

Create a list of variables to extract.

vars <- c('B00001_001E', 'B00002_001E', 'B01001_001E', 'B01001I_001E')

Now, loop through the counties

# Create an empty data.frame to hold the results in:
df <- NULL
for(cty in study_area$fips){# For each county
  #Construct the regions part of the API Call
  region = paste("for=tract:*&in=state:47+county:", cty, sep='')
  # Pull data
  temp.df <- getCensusApi(acs_07_11_url, key=key, vars=vars, region=region)
  df <- rbind(df, temp.df)
}
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
rm(region,temp.df)
head(df)
##   state county  tract B00001_001E B00002_001E B01001_001E B01001I_001E
## 1    47    015 960100         413         179        3690           11
## 2    47    015 960200         607         259        6623          103
## 3    47    015 960300         269         111        3415          104
## 4    47    021 070102         310         124        4019           74
## 5    47    021 070103         194          80        3632          171
## 6    47    021 070104         325         134        5270           78

That produced some NAs for county 149, tract 014500. I don't know what's up there. Maybe someone else can do a good turn and look into that?

Dynamically create a loop

I often encounter problems where I want block group data for an entire county, or for many counties. The API only lets you pull block groups for one block at a time. The solution to this is to do a simple API pull at the tract level, pull out the tract names, and then loop through each tract, pulling out the needed block group data.

Here's an example to pull block group pops from Davidson County in Tennessee. You'll see the loop is very similar to the last example. The only difference is that we place an intermediate API call in order to generate the list of tracts.

vars="B01001_001E"
# Set up the tract level API call
region_tract <- 'for=tract:*&in=state:47+county:037'
tract_df <- getCensusApi(acs_07_11_url, key=key, vars=vars, region=region_tract)
tract_list <- tract_df$tract
df <- NULL
for(t in tract_list){# For each tract
  #Construct the regions part of the API Call
  region = paste("for=block+group:*&in=state:47+county:037+tract:", t, sep='')
  # Pull data
  temp.df <- getCensusApi(acs_07_11_url, key=key, vars=vars, region=region)
  df <- rbind(df, temp.df)
}
rm(region,temp.df)
head(df)
##   state county  tract block group B01001_001E
## 1    47    037 010103           1        1111
## 2    47    037 010103           2        1146
## 3    47    037 010104           1        1778
## 4    47    037 010104           2        1653
## 5    47    037 010105           1         770
## 6    47    037 010105           2        2435