This is in response to a tweet:
Anyone know where to find county-level interracial marriage rates in the US? I'm having a hard time finding it through the census website.
— Matt Blackwell (@matt_blackwell) October 17, 2013
Yes, you can, but you must make an assumption. The ACS counts households and within these there's a special grouping called “family households, married-couple family”. This grouping is broken out by racial group. The assumption you need is that the households described as “Households with a householder who is Two or more races%Family households:%Married-couple family” are all inter-racial couples. Details, and why this assumption may not be too crazy, follow below.
I'm using the ACS 5-year data set to get an answer. I know that there are estimates of total households at the county level. There are also estimates of households by all kinds of racial breakdowns. If there's one that says “interracial household”, I'm good: the ratio of those counts to totals by county should approximate the true prevalence of interracial marriages at the county level.
First, we need to see what's where in the lookup table:
# some housekeeping
rm(list = ls(all = TRUE))
library(xlsx)
## Loading required package: xlsxjars
## Loading required package: rJava
seqct <- 113 # number of ACS sequences (ACS_2007_2011_SF_Tech_Doc.pdf p.11)
# Where I keep my stuff:
root <- "/Volumes/USB_Storage/ACS/"
# What's where in the ACS data: the lookup table
readLookupTable <- function() {
path <- paste(root, "UserTools/", sep = "")
fname <- "Sequence_Number_and_Table_Number_Lookup.txt"
return(read.csv(paste(path, fname, sep = ""), stringsAsFactors = FALSE))
}
lookup <- readLookupTable()
households <- subset(lookup, grepl("household", Long.Table.Title, ignore.case = T))
You download the ACS 5-year data as two massive tarballs; the Census technical name for them is “2 giant files”. They will disgorge 113 * 53 * 2 + 53 * 2 = 12,084 text files.
This is data for 113 sequences (of which we care only about sequence 36) for 53 US entities (50 states, DC, Puerto Rico, and US overall). There are two files for each: one with estimates and the other with margins of errors. The names of these files start with e and m respectively. The final 53*2 set of files holds geographic information – a .txt and a .csv file for each US entity.
So, for estimates of interest in any given state, we know we're looking for the file named e20115xx0036000.txt, where xx is the two-letter state code, in lower case. We might as well write a generic function that collects this file into a data frame. This function, which I called getData, could get by with only one argument – the state code. However, on the chance that we might reuse it for retrieving data from other sequence numbers, it's prudent to give it a second argument, the sequence number.
A quick look at the households data frame suggests that we want to look through table ID's that start with B11001, which are all in sequence 36. They keep count of households. Households are categorized in a few ways, one of which is “family households, married couple family.” This might be what we want. I don't see any match for the word “interracial” though.
For reasons of my own, I only care about ACS data for North Carolina (nc) and the United States (us) overall. This saves a lot of disk space, but it will make the explanations below a bit more tedious than they would be otherwise. The gist though is that I write functions that operate on state-level data files. If they work for nc and us, they'll work the same way for any other state.
Let's take a first sweep through the data:
# get data for a given seq, state id, group
getData <- function(id, seq, grp) {
# set up regex for estimate file name
s <- formatC(seq, width = 4, format = "d", flag = "0")
e <- paste("^e[0-9]+", id, s, "000.txt", sep = "")
# read text file into data frame, no header
path <- paste(root, "group", grp, "_", id, "/", sep = "")
files <- dir(path)
e <- files[grep(e, files)]
efile <- read.csv(paste(path, e, sep = ""), header = FALSE, stringsAsFactors = FALSE)
# read in the variable names and labels
path <- paste(root, "UserTools/2011_Summary_FileTemplates/Seq", sep = "")
seq.template <- paste(path, s, ".xls", sep = "")
labels <- t(read.xlsx(seq.template, sheetIndex = 1))
names(efile) <- row.names(labels)
# pack it up
a <- list(efile, labels)
# return it
return(a)
}
# collect geographic info
getGeo <- function(id) {
path <- paste(root, "Geography/", sep = "")
gfile <- paste("g20115", id, ".csv", sep = "")
geo <- read.csv(paste(path, gfile, sep = ""), header = FALSE, stringsAsFactors = FALSE)
# read in the variable names and labels
path <- paste(root, "UserTools/2011_Summary_FileTemplates/", sep = "")
geo.template <- paste(path, "2011_SFGeoFileTemplate.xls", sep = "")
labels <- t(read.xlsx(geo.template, sheetIndex = 1))
names(geo) <- row.names(labels)
# keep only what we need
geo <- subset(geo, SUMLEVEL == 50)
geo <- geo[, c("LOGRECNO", "GEOID", "NAME")]
labels <- labels[rownames(labels) %in% c("LOGRECNO", "GEOID", "NAME")]
names(labels) <- c("LOGRECNO", "GEOID", "NAME")
# pack it up
a <- list(geo, labels)
# return it
return(a)
}
On my own computer I saved the text files from the two original tarballs into four folders whose names start with group1 and group2 and end with _nc and _us. As it turns out, in both of the _nc folders there's a sequence 36 file for North Carolina. So my getData() function seems to need a third argument, to point it to the right grouping. We'll see later if it's any use to look both places.
ACS flat files are too wide to have descriptive variable names. This makes it pointless to pair the names with the variables the usual way, as table headers. Instead, the variable names are coded and paired with English-language labels separately. The getData() function recovers the names and labels and returns them along with the data as a list whose first element is a data frame and second element is a column vector of value-label pairs for the names in this data frame.
cleanUp <- function(x, tabid) {
data <- x[[1]]
labels <- x[[2]]
keep <- rownames(labels)[c(1:6, grep(tabid, rownames(labels)))]
data <- data[, keep]
labels <- t(t(labels[rownames(labels) %in% keep]))
rownames(labels) <- keep
return(list(data, labels))
}
# we only want married-couple households, and only the totals and the race
# groupings, not Hispanic ethnicity.
cleanUpMore <- function(x) {
data <- x[[1]]
labels <- x[[2]]
keep <- labels[c(1:6, grep("married-couple", labels, ignore.case = T))]
keep <- rownames(labels)[labels %in% keep]
data <- data[, keep]
labels <- t(t(labels[rownames(labels) %in% keep]))
rownames(labels) <- keep
# the last two categories introduce ethnicity. we only care about race, so
# here's one last round of clean-up:
data <- subset(data, select = -c(B11001H_003, B11001I_003))
labels <- t(t(labels[-c(15:16), 1]))
return(list(data, labels))
}
foo1 <- getData("nc", 36, 1)
foo2 <- getData("nc", 36, 2)
bar1 <- cleanUp(foo1, "B11001")
bar2 <- cleanUp(foo2, "B11001")
bat1 <- cleanUpMore(bar1)
bat2 <- cleanUpMore(bar2)
geo <- getGeo("nc")
# Do we need both giant files, or does the first suffice?
dim(merge(bat1[[1]], geo[[1]], by = "LOGRECNO", sort = TRUE))
## [1] 100 16
dim(merge(bat2[[1]], geo[[1]], by = "LOGRECNO", sort = TRUE))
## [1] 0 16
A quick inspection of this column vector shows that the first 6 columns of the data frame are identifiers. Next come variable names whose part before the underscore matches the Table.ID field in the lookup table. So we want only the identifiers and the 90 variables whose names start with “B11001”. The function cleanUp() does this job.
Narrowing the tables from 236 to 96 columns is not bad, but we can do better. We only want the married-couple family households, total and by racial grouping. The function cleanUpMore() finds and keeps only them, and now the tables are slimmed down to exactly what we need – 14 columns.
How about the table length? Between the two estimate files combined, we have 21,781 rows for North Carolina alone, which has 100 counties. How do we find the right 100 rows?
We can use the geography file for that. I collect it using the getGeo() function that works similarly to getData(). Again the data, names and labels are separated, so you need to put them together, so you might as well return the lot as a two-item list, as was the getData() approach.
The county-level aggregation is coded as SUMLEVEL=50. So, for North Carolina we only need the 100 rows of the g20115nc.txt files for which SUMLEVEL=50. The function getGeo() keeps only these in the data frame that is the first element of its output list, and it also keeps only three columns of the original table. The crucial column is LOGRECNO, which will select the correct 100 rows in the estimate file. The columns GEOID and NAME are kept because they might be handy.
As it turns out, we don't need both giant files. The county-level data are all in the text files bundled in the first tarball. It's unlikely that this might not be so in all the states, so let's just count on it.
It's not costly to insure against having to use both files. We just concatenate the two data frames before merging them with the geography file. However, the way getRawData() below is written, this insurance breaks down in the event that the second file is empty. For it to work you would have to use some kind of a check – e.g. if(file.info("my second file")$size>0){...} in any function that goes through both groups of files.
A state-level data collector might look like this:
getRawData <- function(id, seq, tabid) {
geo <- getGeo(id)
foo1 <- cleanUpMore(cleanUp(getData(id, seq, 1), tabid))
# foo2 <- cleanUpMore(cleanUp(getData(id,seq,2),tabid)) # combine data
# across the 2 groups and keep only county-level data: data <-
# merge(rbind(foo1[[1]],foo2[[1]]),geo[[1]],by='LOGRECNO',sort=TRUE) keep
# only county-level data:
data <- merge(foo1[[1]], geo[[1]], by = "LOGRECNO", sort = TRUE)
data.labels <- foo1[[2]]
geo.labels <- geo[[2]]
# pack it and return it
a <- list(data, data.labels, geo.labels)
names(a) <- c("data", "data.labels", "geo.labels")
return(a)
}
counts <- getRawData("nc", 36, "B11001")
It would make sense that if the Census guy stopped by a house where the husband and wife were of different races, say white and black, he would count this household both as B11001A_003 and as B11001B_003, but only count one household for the purposes of the B11001_003 column. If he did that, then a row sum of household totals broken out by race, B11001A_003 through B11001G_003, would add up to more than the corresponding observation in column B11001_003. The difference would be the number of inter-racial households, and its ratio to the total in column B11001_003 would give you the prevalence of inter-racial marriages by county.
Unfortunately, that's not what we see: the counts broken out by race neatly sum to the household total counts, as if there were no interracial marriages whatsoever. This can't be right:
byrace <- counts$data[, 8:14]
totals <- counts$data[["B11001_003"]]
sum(totals - rowSums(byrace))
## [1] 0
So we're left with making the following assumption: our Census guy, faced with such a household, books it as one household of the B11001G_003 kind. This category would then apply both to households where the husband and wife are both multiracial, and to households where one of them is of one race, and the other of another race or multiracial. Under this assumption, the prevalence of inter-racial marriages can be estimated as the ratio of columns B11001G_003 and B11001_003, with the disclaimer that this estimate is overstated by the share of marriages where both halves of the couple are multiracial.
Would this be reasonable? It might be worth checking with the Census help desk. I see no way around the problem if you're going to use the ACS 5-year summary files. So, under this assumption, the final calculation becomes:
getIRM <- function(id) {
counts <- getRawData(id, 36, "B11001")
counts$data[["IRM"]] <- counts$data[["B11001G_003"]]/counts$data[["B11001_003"]]
labels <- t(t(append(counts$data.labels, "Inter-racial Marriage Prevalence")))
rownames(labels) <- append(rownames(counts$data.labels), "IRM")
counts$data.labels <- labels
return(counts)
}
This function can be run in a loop across all state id's, and the output concatenated accordingly. For now, North Carolina will have to do. Let's look at a histogram then:
counts <- getIRM("nc")
hist(counts$data[["IRM"]], main = paste(counts$data.labels["IRM", 1], "in North Carolina by county",
sep = " "))
Might this work for the US, without the need to cycle through states? No. There is no county-level aggregation in the US files. You have to cycle through states:
us.geo <- getGeo("us")
dim(us.geo[[1]])
## [1] 0 3