The US census has an API that allows you to download data via a webbrowser. Here is an example that returns all Census 2010 population counts for zip code tabulation areas in Alaska:
http://api.census.gov/data/2010/sf1?key={enter key here}&get=P0010001,P0040003&for=zip+code+tabulation+area:*&in=state:02
###Components of the url
get=P0010001,P0040003
, indicating variables P0010001
(Total Population) and P004003
(Hispanic or Latino Population). There is a limit of 50 variables that can be requested at one time.The geographic region to pull data for. Here that is for=zip+code+tabulation+area:*&in=state:02
which indicates pull data for zip code tabulation areas (ZCTA), :*
indicates for all, in=state:02
indicates in state 02 (which is Alaska). Another example would be for=block:*&in=state:02+county:290+tract:00100
indicating all blocks in tract 00100 in county 290 in state 02 (Alaska).
You can find all kinds of examples on the left hand side of this page.
There are a number of obstacles to using data from the API. One obstacle for some people will be that the API requires a solid understanding of census questions and geographies. For example, the user must know the proper codes for the data they want (e.g. knowing that population is P0010001
. I don't see a solution to that, other than the user becoming familiar with the data documentation. Another obstacle is that the user must understand basic census geographies and fips codes, such as state-county-tract-block group-block. Again, I don't see an easy solution to this other than reading the documentation.
There are two things that can be mostly simplified with R, and these revolve around the limitations of the API itself. The first limitation is that you can only extract 50 variables at a time. We can fix that in R. The second limitation is that there are only certain geography combinations you can query at once. For example, you can query every block group in a tract, but you cannot query every block group in a county. If you want to query every block group in a county, then you need to first loop through all the tracts in the county, and construct a block group query for each tract. R can help us to automate this procedure. A third limitation of the url is that it returns an html page, it takes some massaging to get that into a usable dataset.
I have included here two R functions:
getCensusApi
assembles the url components into a single url, then sends that request to the Census server, and converts the return into an R data.frame. It automatically deals with the situation when you want more than 50 variables.getCensusApi2
The user should not need this. The user should work with getCensusApi
, which is a wrapper for this. getCensusApi2
assembles the url components and sends the request to the Census server. getCensusApi2
does not deal with more than 50 variables.vecToChunk
takes a list of variable names, and collects them into chunks with no more than 50 variables. Again, the user shouldn't have to worry about this if they use getCensusApi. This helps to get around requests you may have that require more than 50 variables, when the API only allows you to query 50 variables at a time. The user shouldn't need to call this directly.# getCensusApi
# get Census data via the public API: loop through variables if needed
# Inputs:
# data_url: the url root of the api, including the '?'
# example: http://api.census.gov/data/2010/sf1?
# key: your API key
# vars: a character vector of variables to get.
# example c("H0110001","H0110002","H0110003")
# If there are more than 50, then it will be automatically split into separate queries.
# region: region to get data for. contains a for:, and possibly an in:
# example: for=block:1213&in=state:47+county:015+tract:*
# Output:
# If successful, a data.frame
# If unsuccessful, prints the url query that caused the error.
getCensusApi <- function(data_url,key, vars, region, numeric=TRUE){
if(length(vars)>50){
vars <- vecToChunk(vars) # Split vars into a list
get <- lapply(vars, function(x) paste(x, sep='', collapse=","))
data <- lapply(vars, function(x) getCensusApi2(data_url,key, x, region, numeric=TRUE))
} else {
get <- paste(vars, sep='', collapse=',')
data <- list(getCensusApi2(data_url,key, get, region, numeric=TRUE))
}
# Format output. If there were no errors, than paste the data together
# If there is an error, just return the unformatted list.
if(all(sapply(data, is.data.frame))){
colnames <- unlist(lapply(data, names))
data <- do.call(cbind,data)
names(data) <- colnames
# Prettify the output
# If there are nonunique colums, remove them
data <- data[,unique(colnames, fromLast=TRUE)]
# Reorder columns so that numeric fields follow non-numeric fields
data <- data[,c(which(sapply(data, class)!='numeric'), which(sapply(data, class)=='numeric'))]
return(data)
}else{
print('unable to create single data.frame in getCensusApi')
return(data)
}
}
# getCensusApi2
# get Census data via the public API using a single query
# Inputs:
# data_url: the url root of the api, including the '?'
# example: http://api.census.gov/data/2010/sf1?
# key: your API key
# get: The variables to get. Separate multiple variables by commas.
# example 'H0110001,H0110002,H0110003'
# region: region to get data for. contains a for:, and possibly an in:
# example: for=block:1213&in=state:47+county:015+tract:*
# Output:
# If successful, a data.frame
# If unsuccessful, prints the url query that was constructed.
getCensusApi2 <- function(data_url,key, get, region, numeric=TRUE){
if(length(get)>1) get <- paste(get, collapse=',', sep='')
api_call <- paste(data_url,
'key=', key,
'&get=', get,
'&', region,
sep='')
dat_raw <- try(readLines(api_call, warn="F"))
if(class(dat_raw)=='try-error') {
print(api_call)
return}
dat_df <- data.frame()
#split the datastream into a list with each row as an element
# Thanks to roodmichael on github
tmp <- strsplit(gsub("[^[:alnum:], _]", '', dat_raw), "\\,")
#dat_df <- rbind(dat_df, t(sapply(tmp, '[')))
#names(dat_df) <- sapply(dat_df[1,], as.character)
#dat_df <- dat_df[-1,]
dat_df <- as.data.frame(do.call(rbind, tmp[-1]), stringsAsFactors=FALSE)
names(dat_df) <- tmp[[1]]
# convert to numeric
# The fips should stay as character... so how to distinguish fips from data?
# I think all of the data have numbers in the names, the fips do not
# Example: field names of B01001_001E vs state
if(numeric==TRUE){
value_cols <- grep("[0-9]", names(dat_df), value=TRUE)
for(col in value_cols) dat_df[,col] <- as.numeric(as.character(dat_df[,col]))
}
return(dat_df)
}
vecToChunk <- function(x, max=50){
s <- seq_along(x)
x1 <- split(x, ceiling(s/max))
return(x1)
}
# Uncomment the next line and enter your key
# key = 'your key here'
sf1_2010_api <- 'http://api.census.gov/data/2010/sf1?'
# Get all place pops for the US
us_places <- getCensusApi(sf1_2010_api, key=key, vars=c("P0010001"), region="for=place:*")
head(us_places)
## state place P0010001
## 1 01 00100 192
## 2 01 00124 2688
## 3 01 00460 4522
## 4 01 00484 758
## 5 01 00676 356
## 6 01 00820 30352
# Get table B04001 from the 2007-2011 ACS. There are aver 109 vars here (twice that if you include MOEs!). Pull the data for all places in California
vars <- paste('B04001_', sprintf('%03i', seq(1, 109)), 'E', sep='')
acs_07_11_url <- 'http://api.census.gov/data/2011/acs5?'
us_places_ancestry <- getCensusApi(acs_07_11_url, key=key, vars=vars, region="for=place:*&in=state:06")
head(us_places_ancestry)
## state place B04001_001E B04001_002E B04001_003E B04001_004E B04001_005E
## 1 06 00135 1431 0 0 0 31
## 2 06 00156 387 0 0 0 0
## 3 06 00212 7138 0 0 4 316
## 4 06 00296 30670 0 0 0 619
## 5 06 00310 418 0 0 0 0
## 6 06 00394 20353 27 0 0 907
## B04001_006E B04001_007E B04001_008E B04001_009E B04001_010E B04001_011E
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 22 0 0 0 14 0
## 5 0 0 0 0 0 0
## 6 299 0 31 116 51 0
## B04001_012E B04001_013E B04001_014E B04001_015E B04001_016E B04001_017E
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 44 0
## 4 0 0 0 8 124 0
## 5 0 0 0 0 0 0
## 6 18 0 23 60 87 0
## B04001_018E B04001_019E B04001_020E B04001_021E B04001_022E B04001_023E
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 27 0 0 0 14
## 4 0 0 0 0 0 8
## 5 0 0 0 0 0 0
## 6 0 40 11 55 0 38
## B04001_024E B04001_025E B04001_026E B04001_027E B04001_028E B04001_029E
## 1 0 0 0 0 0 16
## 2 0 0 0 0 0 0
## 3 0 0 190 0 0 0
## 4 0 0 7 0 0 0
## 5 0 0 0 0 0 0
## 6 19 0 81 0 0 11
## B04001_030E B04001_031E B04001_032E B04001_033E B04001_034E B04001_035E
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 72 118 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 100 25 170 149 634
## B04001_036E B04001_037E B04001_038E B04001_039E B04001_040E B04001_041E
## 1 193 0 14 0 25 0
## 2 0 0 0 0 0 0
## 3 502 0 198 45 176 29
## 4 285 0 88 0 62 96
## 5 77 0 0 0 14 0
## 6 874 0 534 22 207 49
## B04001_042E B04001_043E B04001_044E B04001_045E B04001_046E B04001_047E
## 1 56 0 0 0 0 32
## 2 0 0 0 0 0 0
## 3 1138 0 64 0 29 0
## 4 657 0 0 0 24 0
## 5 169 0 0 0 0 0
## 6 1963 0 65 0 193 0
## B04001_048E B04001_049E B04001_050E B04001_051E B04001_052E B04001_053E
## 1 72 250 0 57 41 0
## 2 0 0 0 0 0 0
## 3 0 532 0 587 0 8
## 4 0 812 0 424 0 0
## 5 0 49 0 6 0 0
## 6 211 1722 236 1053 0 112
## B04001_054E B04001_055E B04001_056E B04001_057E B04001_058E B04001_059E
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 132 80
## 4 0 0 0 0 0 460
## 5 0 0 0 0 0 0
## 6 0 0 0 0 189 136
## B04001_060E B04001_061E B04001_062E B04001_063E B04001_064E B04001_065E
## 1 0 21 15 0 12 15
## 2 0 0 21 0 0 0
## 3 0 139 21 33 52 36
## 4 0 651 0 0 28 113
## 5 0 0 0 0 0 15
## 6 0 707 29 65 1506 14
## B04001_066E B04001_067E B04001_068E B04001_069E B04001_070E B04001_071E
## 1 0 62 0 0 14 0
## 2 0 0 0 0 0 0
## 3 120 234 0 0 0 13
## 4 111 59 0 0 18 0
## 5 0 29 0 0 0 0
## 6 251 336 0 0 0 0
## B04001_072E B04001_073E B04001_074E B04001_075E B04001_076E B04001_077E
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 35 0 0 0 0
## 4 0 279 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 56 0 0 0 0
## B04001_078E B04001_079E B04001_080E B04001_081E B04001_082E B04001_083E
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 177 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 28
## B04001_084E B04001_085E B04001_086E B04001_087E B04001_088E B04001_089E
## 1 0 0 0 0 0 60
## 2 0 0 0 0 0 9
## 3 0 0 0 35 0 127
## 4 0 0 0 102 0 53
## 5 0 0 0 0 0 0
## 6 28 0 0 0 0 430
## B04001_090E B04001_091E B04001_092E B04001_093E B04001_094E B04001_095E
## 1 0 0 23 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 20 0 0
## 4 0 0 65 0 140 0
## 5 0 0 0 0 0 0
## 6 76 20 143 11 0 0
## B04001_096E B04001_097E B04001_098E B04001_099E B04001_100E B04001_101E
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 102 0 0 31 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## B04001_102E B04001_103E B04001_104E B04001_105E B04001_106E B04001_107E
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 7 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## B04001_108E B04001_109E
## 1 350 72
## 2 357 0
## 3 1535 468
## 4 23147 2318
## 5 23 36
## 6 5081 1409
Here, we will extract tract level data from many counties. You can only request tract data one county at a time, so we'll loop through it.
study_area <- data.frame(county = c('Cannon', 'Cheatham', 'Davidson', 'Dickson', 'Hickman', 'Macon', 'Maury', 'Robertson', 'Rutherford', 'Smith', 'Sumner', 'Trousdale', 'Williamson', 'Wilson'),
fips = c('015', '021', '037', '043', '081', '111', '119',
'147', '149', '159', '165', '169', '187', '189'),
stringsAsFactors=FALSE)
Create a list of variables to extract.
vars <- c('B00001_001E', 'B00002_001E', 'B01001_001E', 'B01001I_001E')
Now, loop through the counties
# Create an empty data.frame to hold the results in:
df <- NULL
for(cty in study_area$fips){# For each county
#Construct the regions part of the API Call
region = paste("for=tract:*&in=state:47+county:", cty, sep='')
# Pull data
temp.df <- getCensusApi(acs_07_11_url, key=key, vars=vars, region=region)
df <- rbind(df, temp.df)
}
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
rm(region,temp.df)
head(df)
## state county tract B00001_001E B00002_001E B01001_001E B01001I_001E
## 1 47 015 960100 413 179 3690 11
## 2 47 015 960200 607 259 6623 103
## 3 47 015 960300 269 111 3415 104
## 4 47 021 070102 310 124 4019 74
## 5 47 021 070103 194 80 3632 171
## 6 47 021 070104 325 134 5270 78
That produced some NAs for county 149, tract 014500. I don't know what's up there. Maybe someone else can do a good turn and look into that?
I often encounter problems where I want block group data for an entire county, or for many counties. The API only lets you pull block groups for one block at a time. The solution to this is to do a simple API pull at the tract level, pull out the tract names, and then loop through each tract, pulling out the needed block group data.
Here's an example to pull block group pops from Davidson County in Tennessee. You'll see the loop is very similar to the last example. The only difference is that we place an intermediate API call in order to generate the list of tracts.
vars="B01001_001E"
# Set up the tract level API call
region_tract <- 'for=tract:*&in=state:47+county:037'
tract_df <- getCensusApi(acs_07_11_url, key=key, vars=vars, region=region_tract)
tract_list <- tract_df$tract
df <- NULL
for(t in tract_list){# For each tract
#Construct the regions part of the API Call
region = paste("for=block+group:*&in=state:47+county:037+tract:", t, sep='')
# Pull data
temp.df <- getCensusApi(acs_07_11_url, key=key, vars=vars, region=region)
df <- rbind(df, temp.df)
}
rm(region,temp.df)
head(df)
## state county tract block group B01001_001E
## 1 47 037 010103 1 1111
## 2 47 037 010103 2 1146
## 3 47 037 010104 1 1778
## 4 47 037 010104 2 1653
## 5 47 037 010105 1 770
## 6 47 037 010105 2 2435