Social Capital Indices for Census Tracts, Zipcodes, & County Subdivisions
Replication Code for Indices
This RMarkdown document identifies each step necessary for creating a census-tract level social capital index. These indices are the most up-to-date version of the indices presented in our recent article in Nature Scientific Reports (DOI: 10.1038/s41598-022-10275-z), entitled Social capital’s impact on COVID-19 outcomes at local levels.
Want to download the indices? Download them from our repository on the Harvard Dataverse!.
Want to run the replication code below yourself? You can do it using this RStudio.Cloud project available at: https://rstudio.cloud/project/3875479.
Version History
Version 1: 02/18/2021 (https://papers.ssrn.com/abstract=3788540)
Version 2: 01/28/2022 (DOI: 10.1038/s41598-022-10275-z)
log transformed bridging indicators, to ensure normal distributions.
Capped indicators at 2.5th and 97.5th percentiles, to avoid outlier bias.
Version 3: 04/08/2022 (current document)
Added years 2019 and 2020 using updated ACS-5 estimates
Added estimates for Washington, DC.
Improved formulation for Education Equality indicator.
Calculates Voting Age population directly from Census API.
Note: indices are estimated using multiple imputation across the full 2010-2020 dataset, so index scores may change ever so slightly across Versions 2 and Version 3, since Version 3 now can draw on years 2019-2020 to make better imputations over time. I encourage using the most up-to-date version.
Process
This process includes four parts:
- Collect indicators of bonding social capital:
Click to see indicators!
Communication Capacity
Educational Similarity
Race Similarity
Ethnicity Similarity
Employment Equality
Gender Income Similarity
Income Inequality
Language
Non-elder Population
- Collect indicators of bridging social capital:
Click to see indicators!
Civic Orgs per 10,000 persons
Charitable Orgs per 10,000 persons
Religious Orgs per 10,000 persons
Advocacy Orgs per 10,000 persons
Unions per 10,000 persons (County level variable)
Fraternal Orgs per 10,000 persons (County level variable)
Unless otherwise specified, these are collected at the zipcode level over 5 years; in Version 2, 2% of these total data points are missing data; for Charitable, Advocacy, and Civic Orgs, up to 11~20% are missing data.
To deal with this, we replace missing data points with available county level rates. This reduces missing data down to ~1%!
- Collect indicators of linking social capital:
Click to see indicators!
Voting Age Population
Local Government Employees per capita
State government employees per capita
Federal government employees per capita
Rate of Political Action (ESRI county level variable)
These variables are from Kyne & Aldrich’s index, published in 2020. They also all are variables that clearly represent an aspect of bonding, bridging, or linking social capital, all supported by the literature, as discussed in the original article. This dataset simply expands their methodology to a census-tract index, making necessary accomodations for expanding from the county level to the census tract level.
- Aggregation to Zipcode and County Subdivision Levels:
Finally, at the end of this document, we aggregate these census tract level indices to the zipcode and county subdivision level.
0. Setup
0.1 Load Packages
First, please load the following packages:
# If they are not yet installed, uncomment this line and run it.
# install.packages(c("tidyverse", "tidycensus", "censusapi",
# "sf", "tigris", "Amelia", "viridis", "ggpubr", "ggtext"))
# Then load them!
# For data management
library(tidyverse)
# For querying census data
library(tidycensus)
library(censusapi)
# For spatial data management
library(sf)
library(tigris)
# For multiple imputation
library(Amelia)
# For data viz
library(viridis)
library(ggpubr)
library(ggtext)
library(knitr)0.2 Create Folders
Second, let’s create a few folders to hold our various data points.
# Create a folder to host raw index indicators
dir.create("raw_data")0.3. Load API Key
We’re going to compile the various indicators mentioned above,
primarily using Kyle Walker’s tidycensus package, which
allows us to query the US Census Bureau’s American Community Survey
5-year estimates from 2010 to 2020. To do so, you’ll need a Census API
key - no worries, it’s quick to get. Just registering your
email address with the census here and they will email you the API key
immediately.
Once you’ve got an API key, just plug it into the
census_api_key() function from the tidycensus
package.
# Load api key to download data from US Census Bureau
mykey <- "YOUR_API_KEY_HERE"
census_api_key(mykey)1. Collect Data
Let’s collect some data!
1.1 Units
First, let’s make a list of all year-state combinations we want to query.
myunits <- expand_grid(
state = c(state.abb, "DC"),
year = 2011:2020) %>%
mutate(id = 1:n())1.2 Indicators
Second, let’s make a list of all the indicators we want to query. We’re going to try and download them all in one fell swoop to simplify the process and reduce time spent waiting for downloads.
myvars <- read_csv("indicators.csv")We can view the contents of myvars in its entirety using
the following code, but it’s probably easier just to click on
the tabs below to see the variables we’ll be aquiring for each
indicator.
race_similarity
| code | value | survey | geography |
|---|---|---|---|
| B01003_001E | Total Population | acs5 | tract |
| B02001_002E | White alone | acs5 | tract |
| B02001_003E | Black or African American alone | acs5 | tract |
| B02001_004E | American Indian and Alaska Native alone | acs5 | tract |
| B02001_005E | Asian alone | acs5 | tract |
| B02001_006E | Native Hawaiian and Other Pacific Islander alone | acs5 | tract |
| B02001_007E | Some other race alone | acs5 | tract |
| B02001_008E | Two or more races | acs5 | tract |
ethnicity_similarity
| code | value | survey | geography |
|---|---|---|---|
| B01003_001E | Total Population | acs5 | tract |
| B03001_002E | Not Hispanic or Latino | acs5 | tract |
| B03001_003E | Hispanic or Latino | acs5 | tract |
education_equality
| code | value | survey | geography |
|---|---|---|---|
| B15002_001E | Population 25 years and over | acs5 | tract |
| B15002_002E | Population 25 years and over (Male) | acs5 | tract |
| B15002_011E | High School Graduate, GED, or alternative (Male) | acs5 | tract |
| B15002_012E | Some college, less than 1 year (Male) | acs5 | tract |
| B15002_013E | Some college, 1 or more years, no degree (Male) | acs5 | tract |
| B15002_014E | Associate’s degree (Male) | acs5 | tract |
| B15002_015E | Bachelor’s degree (Male) | acs5 | tract |
| B15002_016E | Master’s degree (Male) | acs5 | tract |
| B15002_017E | Professional school Degree (Male) | acs5 | tract |
| B15002_018E | Doctoral Degree (Male) | acs5 | tract |
| B15002_019E | Population 25 years and over (Female) | acs5 | tract |
| B15002_028E | High School Graduate, GED, or alternative (Female) | acs5 | tract |
| B15002_029E | Some college, less than 1 year (Female) | acs5 | tract |
| B15002_030E | Some college, 1 or more years, no degree (Female) | acs5 | tract |
| B15002_031E | Associate’s degree (Female) | acs5 | tract |
| B15002_032E | Bachelor’s degree (Female) | acs5 | tract |
| B15002_033E | Master’s degree (Female) | acs5 | tract |
| B15002_034E | Professional school Degree (Female) | acs5 | tract |
| B15002_035E | Doctoral Degree (Female) | acs5 | tract |
gini
| code | value | survey | geography |
|---|---|---|---|
| B19083_001E | Gini Index | acs5 | tract |
employment_equality
| code | value | survey | geography |
|---|---|---|---|
| B23025_003E | Civilian labor force | acs5 | tract |
| B23025_004E | Employed (in civilian labor force) | acs5 | tract |
| B23025_005E | Unemployed (in civilian labor force) | acs5 | tract |
gender_income_similarity
| code | value | survey | geography |
|---|---|---|---|
| B19326_001E | Median income in the past 12 months (total) | acs5 | tract |
| B19326_002E | Median income (dollars) in the past 12 months (Male) | acs5 | tract |
| B19326_005E | Median income (dollars) in the past 12 months (Female) | acs5 | tract |
language_competency
| code | value | survey | geography |
|---|---|---|---|
| B16004_024E | total 18 to 64 years old | acs5 | tract |
| B16004_025E | Speak only English | acs5 | tract |
| B16004_027E | Speak Spanish & Speak English ‘very well’ | acs5 | tract |
| B16004_032E | Speak other Indo-European languages & Speak English ‘very well’ | acs5 | tract |
| B16004_037E | Speak Asian and Pacific Island languages & Speak English ‘very well’ | acs5 | tract |
| B16004_042E | Speak other languages & Speak English ‘very well’ | acs5 | tract |
communication_capacity
| code | value | survey | geography |
|---|---|---|---|
| B25043_001E | Total Households | acs5 | tract |
| B25043_003E | Owner occupied, with with telephone service available | acs5 | tract |
| B25043_007E | Owner occupied, with no telephone service available | acs5 | tract |
| B25043_012E | Renter occupied, with telephone service available | acs5 | tract |
| B25043_016E | Renter occupied, with no telephone service available | acs5 | tract |
nonelder
| code | value | survey | geography |
|---|---|---|---|
| B01001_020E | Male 65-66 | acs5 | tract |
| B01001_021E | Male 67-69 | acs5 | tract |
| B01001_022E | Male 70-74 | acs5 | tract |
| B01001_023E | Male 75-79 | acs5 | tract |
| B01001_024E | Male 80-84 | acs5 | tract |
| B01001_025E | Male 85+ | acs5 | tract |
| B01001_044E | Female 65-66 | acs5 | tract |
| B01001_045E | Female 67-69 | acs5 | tract |
| B01001_046E | Female 70-74 | acs5 | tract |
| B01001_047E | Female 75-79 | acs5 | tract |
| B01001_048E | Female 85+ | acs5 | tract |
| B01001_002E | Male | acs5 | tract |
| B01001_026E | Female | acs5 | tract |
voting_age
| code | value | survey | geography |
|---|---|---|---|
| B05003_009E | Male, 18 years and over, Native (US Citizen) | acs5 | tract |
| B05003_011E | Male, 18 years and over, Foreign born, Naturalized U.S. citizen | acs5 | tract |
| B05003_020E | Female, 18 years and over, Native (US Citizen) | acs5 | tract |
| B05003_022E | Female, 18 years and over, Foreign born, Naturalized U.S. citizen | acs5 | tract |
| B05001_001E | Total Population | acs5 | tract |
| B05001_006E | Not a US Citizen | acs5 | tract |
gov_workers
| code | value | survey | geography |
|---|---|---|---|
| B08128_006E | Local government workers | acs5 | tract |
| B08128_007E | State government workers | acs5 | tract |
| B08128_008E | Federal government workers | acs5 | tract |
| B08128_001E | Total Federal, State, and Local Government Workers | acs5 | tract |
organizations
| code | value | survey | geography |
|---|---|---|---|
| 8132 | Charitable Organizations | zpb | zipcode |
| 8133 | Social Advocacy Organizations | zpb | zipcode |
| 8134 | Civic and Social Organizations | zpb | zipcode |
| 8139 | Business, Professional, Labor, Political, and Similar Organizations | zpb | zipcode |
| 8131 | Religious Organizations | zpb | zipcode |
zipcode_pop
| code | value | survey | geography |
|---|---|---|---|
| B01003_001E | Total Population | acs5 | zipcode |
fraternal
| code | value | survey | geography |
|---|---|---|---|
| fraternal | Member of fraternal order (% of total) | ESRI | county |
politicalacts
| code | value | survey | geography |
|---|---|---|---|
| politicalacts | % Attended political rally, speech, or organized protest | ESRI | county |
1.3 Download ACS-5 Estimates
So let’s go download those indicators!
Let’s isolate just the ACS-5 census tract variables as
tract_vars, and download them using the
get_acs() function from tidycensus.
# Get the many variables applicable
tract_vars <- myvars %>%
filter(survey == "acs5" & geography == "tract") %>%
mutate(code = code %>% str_extract(".*_[0-9]{3}")) %>%
select(code, value)
# Now cut of the 'E' at the end for 'Estimate';
# tidycensus just wants the codeWe’re going to run a loop iteratively to get the data for each state-year pair, which will be helpful in the likely even that this code crashes every once in a while. Heads up: we’re downloading many megabytes of data; this should take ~20 minutes.
# Create a folder to hold our acs5 estimates
dir.create("raw_data/acs5")
# Then write a function to procure them
get_data = function(data, variables){
result <- get_acs(
# Inserting a distinct year
year = data$year,
# and a distinct state each time
state = data$state,
# but always grabbing the same set of tract variables from the acs5
geography = "tract",
variables = variables,
survey = "acs5")
result %>%
# and each time, write it to file
# in the format of raw_data/acs5/YEAR_STATE.csv
saveRDS(paste("raw_data/acs5/", data$year, "_", data$state, ".rds", sep = ""))
}
# Then run the function
myunits %>%
# For each year-state pair
split(.$id) %>%
# Run the following get_acs() function many times,
map(~get_data(., variables = tract_vars$code), .id = "id")1.4 Download Zipcode Business Patterns Estimates
First, let’s gather the 5 variables applicable, using their NAICS2012 codes.
zip_vars <- myvars %>%
filter(survey == "zpb" & geography == "zipcode") %>%
with(code)Then, let’s build a grid of NAICS code - years, which we will use to
iteratively query the Zipcode Business Patterns API using the
getCensus() function from the censusapi
package. We will save this data as zip_data.rds.
(Notice how we directly insert mykey, our API key,
into the key = argument in this function; it won’t work
without your API key.)
expand_grid(year = as.character(2012:2016),
var = zip_vars) %>%
mutate(id = 1:n()) %>%
split(.$id) %>%
map_dfr(~getCensus(
name = "zbp",
key = mykey,
vintage = .$year,
vars = c("ESTAB", "GEO_ID", "YEAR"),
region = "zipcode:*",
naics2012 = .$var)) %>%
write_rds("raw_data/zip_data.rds")Next, let’s gather the estimated population for every zipcode
tabulation area, from the American Community Survey 5 year estimates. We
can use
tidycensus``'sget_acs()function like before for this. We'll save this aszip_pop.rds```.
# Get the population in 2016 for every zipcode tabulation area
data.frame(year = 2011:2020) %>%
split(.$year) %>%
map_dfr(~get_acs(
year = .$year,
geography = "zcta",
variables = "B01003_001",
survey = "acs5"), .id = "year") %>%
# Get the appropriate zipcode
mutate(geoid = str_sub(GEOID, start = -5, end = -1)) %>%
# and simplify columns
select(year, geoid, pop = estimate) %>%
# and save to file
write_rds("raw_data/zip_pop.rds")Then, let’s combine the two datasets to produce a nice wide matrix of
zipcode-years from 2012 to 2016. Unfortunately, earlier and later
editions are not yet available via the API, so we’re just going to carry
forward the rates from 2016 to subsequent years. We calculate the rate
of establishments per 10,000 residents for religious_orgs,
charitable_orgs, advocacy_orgs,
civic_orgs, and unions, and then save them as
zip_rates.rds.
read_rds("raw_data/zip_pop.rds") %>%
left_join(by = c("year", "geoid"),
y = read_rds("raw_data/zip_data.rds") %>%
select(year = YEAR, geoid = zip_code,
value = ESTAB, variable = NAICS2012) %>%
mutate(variable = variable %>% dplyr::recode(
"8131" = "religious_orgs",
"8132" = "charitable_orgs",
"8133" = "advocacy_orgs",
"8134" = "civic_orgs",
"8139" = "unions")) %>%
pivot_wider(id_cols = c(year, geoid), names_from = variable,
values_from = value)) %>%
# Filter to years of data which are available from Zipcode Business Trends API
filter(year %in% 2012:2016) %>%
mutate_at(vars(charitable_orgs, advocacy_orgs, civic_orgs, unions, religious_orgs),
list(~. / pop * 10000)) %>%
# Now recode the variable numbers into names
# Set any infinite values to NA too
mutate_at(vars(charitable_orgs, advocacy_orgs, civic_orgs, unions, religious_orgs),
list(~if_else(is.nan(.) | is.infinite(.), NA_real_, .))) %>%
write_rds("raw_data/zip_rates.rds")Finally, let’s consolidate these zipcode business trend observations to the census tract level, using the HUD crosswalk files. To do so, let’s first gather a compendium of zipcode-to-tract crosswalk files for each year.
# For Zipcode to Tract conversion
get_crosswalk = function(from, to, year){
# Build URL of HUD crosswalk file
myurl <- paste("https://www.huduser.gov/portal/datasets/usps/",
toupper(from), "_", toupper(to), "_03", year, ".xlsx", sep = "")
# Create a temporary file to receive the download
myfile <- tempfile()
# Download the file from the URL to the temporary site
download.file(url = myurl, destfile = myfile)
# Import the file
readxl::read_excel(myfile) %>%
select(from = 1, to = 2) %>%
mutate(year = year,
type = paste(from, "->", to, sep = "")) %>%
return()
}
data.frame(year = 2011:2020) %>%
split(.$year) %>%
# For each year, run the crosswalk function
map_dfr(~get_crosswalk(from = "zip", to = "tract", year = .$year)) %>%
# Save compressed version of file
saveRDS("raw_data/crosswalk_zip_to_tract.rds")
remove(myfile)And then let’s use these crosswalk files each year to estimate the average rate of organizations per 10,000 residents per tract, for zipcode-census tract pairs that were in existence in that year.
read_rds("raw_data/crosswalk_zip_to_tract.rds") %>%
# Let's join in the zipcode rates to our crosswalk
left_join(by = c("from" = "geoid", "year"),
y = read_rds("raw_data/zip_rates.rds") %>%
mutate(year = as.numeric(year))) %>%
# Then for each year and tract,
group_by(year, geoid = to) %>%
# Calculate the average
summarize_at(
vars(charitable_orgs, advocacy_orgs, civic_orgs, unions, religious_orgs),
list(~mean(., na.rm = TRUE))) %>%
ungroup() %>%
# Fix any NaN or Inf results to NA
mutate_at(
vars(charitable_orgs, advocacy_orgs, civic_orgs, unions, religious_orgs),
list(~if_else(is.nan(.) | is.infinite(.), NA_real_, as.numeric(.)))) %>%
saveRDS("raw_data/tract_rates.rds")1.5 Download County ESRI Estimates
Next, we’re going to borrow two county variables straight from the original county-level indices. These pertain to fraternal orgs and political activities, and we use them because there aren’t many ideal alternatives available at the tract or zipcode level at the moment.
# Finally, to get ESRI's data on fraternal ties and political acts,
# we just borrow it from the original index at the county level.
read_delim("raw_data/Kyne Aldrich Capturing Dataset 2020.tab") %>%
select(geoid, fraternal, politicalacts) %>%
mutate(geoid = geoid %>% str_sub(-5, nchar(geoid))) %>%
write_rds("raw_data/county_esri.rds")1.6 Produce Raw Dataset
Next, we’re process these raw data to produce 1 dataset containing the base data for our indicators. This produces an absolutely enormous dataset, at 343.9 MB when compressed, containing 50+ variables and 700,000+ observations. Yay big data!
get_format = function(data){
print(data$fileid)
read_rds(data$fileid) %>%
mutate(GEOID = as.character(GEOID),
year = str_extract(data$fileid, pattern = "20[0-9]{2}")) %>%
select(year, geoid = GEOID, variable, estimate) %>%
pivot_wider(id_cols = c(geoid, year),
names_from = variable, values_from = estimate) %>%
return()
}
data.frame(fileid = dir("raw_data/acs5", full.names = TRUE)) %>%
split(.$fileid) %>%
map_dfr(~get_format(.)) %>%
# Create a variable for joining in our 2012-2016 variables to the full range
mutate(yearjoin = case_when(
year < 2012 ~ 2012,
year > 2016 ~ 2016,
TRUE ~ as.numeric(year))) %>%
# Join in zipcode-averaged rates
left_join(by = c("geoid", "yearjoin" = "year"),
y = read_rds("raw_data/tract_rates.rds") %>%
filter(year %in% 2012:2016)) %>%
# Get rid of the now-unnecessary joining variable
select(-yearjoin) %>%
# Join in 2 county variables
mutate(county = str_sub(geoid, 1,5)) %>%
left_join(by = c("county" = "geoid"), y = read_rds("raw_data/county_esri.rds")) %>%
write_rds("raw_data/tract_data.rds")
remove(get_format)The resulting data, raw_data/tract_data.rds, is the data
you need to reproduce any of the indicators used in this study.
1.7 Calculate Indicators
Next, let’s use the raw data contained within
raw_data/tract_data.rds to produce our main indicators for
use in this study.
read_rds("raw_data/tract_data.rds") %>%
# Calculate the race fractionalization index.
mutate(race_fractionalization = (1 - (
(B02001_002 / B01003_001) ^ 2 + # square of % White
(B02001_003 / B01003_001) ^ 2 + # square of % Black
(B02001_004 / B01003_001) ^ 2 + # square of % American Indian/Alaskan Native
(B02001_005 / B01003_001) ^ 2 + # square of % Asian
(B02001_006 / B01003_001) ^ 2 + # square of % native Hawaiian/Pacific Islander
(B02001_007 / B01003_001) ^ 2 + # square of % some other race
(B02001_008 / B01003_001) ^ 2) # square of % two or more races
),
# Now calculate race similarity index, so that 1 = homophily and 0 = heterophily
race_similarity = 1 - race_fractionalization) %>%
# Calculate the ethnicity fractionalization index
mutate(ethnicity_fractionalization = (1 - (
(B03001_002 / B01003_001) ^ 2 + # square of % Not Hispanic/Latino
(B03001_003 / B01003_001) ^ 2) # square of % Hispanic/Latino
),
ethnicity_similarity = 1 - ethnicity_fractionalization) %>%
#Calculate educational similarity
# get absolute value of difference between college and high school educated persons
# where a large gap indicates homophily and small gap indicates heterophily
#"EDUCATIONAL ATTAINMENT FOR THE POPULATION 25 YEARS AND OVER"
mutate(
# Calculate % with less than high school education
# = 1 minus % with high school degree or more
less_than_high_school = 1 - (
(B15002_011 + #High School Graduate, GED, or alternative (Male),
B15002_012 + #Some college, less than 1 year (Male),
B15002_013 + #Some college, 1 or more years, no degree (Male),
B15002_014 + #Associate's degree (Male),
B15002_015 + #Bachelor's degree (Male),
B15002_016 + #Master's degree (Male),
B15002_017 + #Professional school Degree (Male),
B15002_018 + #Doctoral Degree (Male),
# For women
B15002_019 + #Population 25 years and over (Female),
B15002_028 + #High School Graduate, GED, or alternative (Female),
B15002_029 + #Some college, less than 1 year (Female),
B15002_030 + #Some college, 1 or more years, no degree (Female),
B15002_031 + #Associate's degree (Female),
B15002_032 + #Bachelor's degree (Female),
B15002_033 + #Master's degree (Female),
B15002_034 + #Professional school Degree (Female),
B15002_035 #Doctoral Degree (Female)
) / B15002_001), # total ages 24 or above
# Calculate % with college education or more
# sum of all types of education at or above college level, divided by total
bachelors_or_more = (
B15002_015 + #Bachelor's degree (Male),
B15002_016 + #Master's degree (Male),
B15002_017 + #Professional school Degree (Male),
B15002_018 + #Doctoral Degree (Male),
B15002_032 + #Bachelor's degree (Female),
B15002_033 + #Master's degree (Female),
B15002_034 + #Professional school Degree (Female),
B15002_035 #Doctoral Degree (Female)
) / # doctoral degree
B15002_001, # total ages 24 or above
EE = abs(less_than_high_school - bachelors_or_more),
# now subtract it from 1, where 1 now = heterophily and 0 = homophily
education_equality = 1-EE) %>%
# Bonding: Race/income equality
# Get gini index
# Gini coefficient (0 = perfect equality to 1 = perfect inequality)
mutate(income_inequality = B19083_001,
income_equality = 1 - income_inequality) %>%
# Bonding: Employment equality
# Absolute difference between % of total employed and % of total unemployed labor force
mutate(
percent_employed = (B23025_004 / B23025_003),
percent_unemployed = (B23025_005 / B23025_003),
employment_inequality = abs(percent_employed - percent_unemployed),
# rescale to range from 0 to 1
employment_inequality = scales::rescale(employment_inequality, to = c(0,1)),
# Here, like with college,
# 1 now = heterophily, while 0 = homophily.
# So, let's rescale it so 1 = homophily
employment_equality = 1 - employment_inequality) %>%
# Get county mean for missing data
group_by(county) %>%
# If values are missing, impute them with the mean value from that variable
mutate_at(vars(B19326_001, B19326_002, B19326_005),
funs(if_else(is.na(.), mean(., na.rm = TRUE), .))) %>%
ungroup() %>%
#Calculate gender income similarity
# Gender Income Similarity
# Median income in the past 12 months (in 20XX inflation-adjusted dollars)
# adjusted by Census Bureau to whatever year that census was (eg. 2010, 2011, 2012, etc.)
# Gender Income Fractionalization (0 = homogeneity, 1 = heterogeneity)
mutate(
# Get percentage of total median income that is for men, squared
MI = (B19326_002 / (B19326_002 + B19326_005)) ^ 2,
# Get percentage of total median income that is for women, squared
FI = (B19326_005 / (B19326_002 + B19326_005)) ^ 2,
# How much space remains? Calculate fractionalization (low = homophily, high = heterophily)
gender_income_fractionalization = (1 - (MI + FI)),
# Now, to get a larger spanning variable, rescale.
# smallest values become 0 (homophily) and
# largest values become 1 (heterophily)
gender_income_fractionalization = (gender_income_fractionalization - min(gender_income_fractionalization, na.rm = TRUE)) /
# Divide by range
(max(gender_income_fractionalization, na.rm = TRUE) - min(gender_income_fractionalization)),
# Calculate gender income similarity, by subtracting from 1 so that
# 1 = homophily and 0 = heterophily
gender_income_similarity = 1 - gender_income_fractionalization
) %>%
#Bonding: Language competency
# % of total population proficient English speakers
mutate(lang = ((
B16004_025 + # "Speak only English",
B16004_027 + # "Speak Spanish & Speak English 'very well'",
B16004_032 + # "Speak other Indo-European languages & Speak English 'very well'",
B16004_037 + # "Speak Asian and Pacific Island languages & Speak English 'very well'",
B16004_042 # "Speak other languages & Speak English 'very well'"
) / B16004_024) # "total 18 to 64 years old",
) %>%
# Bonding: Communication capacity
# % of total households with a telephone
mutate(telephone = ((B25043_003 + # "Owner occupied, with with telephone service available",
B25043_012 # "Renter occupied, with telephone service available",
) /
B25043_001) # total households
) %>%
# Bonding: Non-elder population
# % of total population below 65 years of age
mutate(
# elder population
elder = (
# Sum of elder male residents (65+)
B01001_020 + # Male 65-66
B01001_021 + # Male 67-69
B01001_022 + # Male 70-74
B01001_023 + # Male 75-79
B01001_024 + # Male 80-84
B01001_025 + # Male 85+
# Sum of elder female residents (65+)
B01001_044 + # Female 65-66
B01001_045 + # Female 67-69
B01001_046 + # Female 70-74
B01001_047 + # Female 75-79
B01001_048) # Female 85+
# Divided by sum of male and female residents
/ (B01001_002 + # Male
B01001_026), # Female
# non-elder population
non_elder = 1 - elder) %>%
# % of Citizens who are of Voting Age
mutate(voting_age = (
# = Citizens of Voting Age / (Total Pop - Not Citizen)
B05003_009 + # "Male, 18 years and over, Native (US Citizen)",
B05003_011 + # "Male, 18 years and over, Foreign born, Naturalized U.S. citizen",
B05003_020 + # "Female, 18 years and over, Native (US Citizen)",
B05003_022 # "Female, 18 years and over, Foreign born, Naturalized U.S. citizen",
) / (
B05001_001 - # "Total Population",
B05001_006) # "Not a US Citizen"
) %>%
# % of government works at given level of government
mutate(local_gov = B08128_006 / B08128_001, # Local govt works / total govt workers
state_gov = B08128_007 / B08128_001, # State govt workers / total govt workers
fed_gov = B08128_008 / B08128_001 # federal govt workers/ total govt workers
) %>%
select(geoid, year, county, race_similarity, ethnicity_similarity, education_equality,
income_equality, employment_equality, gender_income_similarity,
lang, telephone, non_elder,
religious_orgs, charitable_orgs, advocacy_orgs, civic_orgs, unions, fraternal,
voting_age, local_gov, state_gov, fed_gov, politicalacts) %>%
write_rds("raw_data/indicators.rds")1.8 Rescale Indicators
# Several variables are actually rates, so to normalize them, we log-transform them.
# First, in the event of zeros, we're going to replace these with a small-but-realistic value,
# since you can't log-transform the value zero.
# We're going to identify two points - the smallest non-zero value and half that value.
# that lower value will be our replacement for zero.
replace_zero = function(x){
xhat <- x[x != 0]
upper <- sort(unique(x))[2]
lower <- upper / 2
return(lower)
}
# Grab a list of all census tracts available
read_rds("raw_data/indicators.rds") %>%
# Let's impute the county-mean for our rates of organizations,
# to try to reduce missing data in a spatially conscious way,
# since these are spatially aggregated variables from the zipcode level anyways.
group_by(year, county) %>%
mutate_at(vars(religious_orgs, charitable_orgs, advocacy_orgs, civic_orgs),
funs(if_else(
condition = is.na(.),
true = mean(., na.rm = TRUE),
false = .))) %>%
ungroup() %>%
# recode any Infinite responses as NA
mutate_at(vars(race_similarity:politicalacts),
list(~if_else(is.infinite(.) | is.nan(.), NA_real_, .))) %>%
# Next, a few of these variables are actually rates, so they are naturally right-skewed.
# In order to be nice and comparable to other indicators,
# we're going to log-transform them to make them normally distributed again.
# To do so, we need to impute 0s with a very small but realistic value, as described above.
mutate_at(vars(advocacy_orgs, civic_orgs,religious_orgs, unions,
charitable_orgs, fraternal, local_gov, state_gov, fed_gov),
list(~case_when(. == 0 ~ replace_zero(.), TRUE ~ .))) %>%
# Now take these variables and log-transform them
mutate_at(vars(charitable_orgs,advocacy_orgs, civic_orgs,religious_orgs, unions,
fraternal, local_gov, state_gov, fed_gov),
list(~log(.))) %>%
# Cap all values above or below the 95% most common values,
# so as not to unduly bias our estimates of the majority of communities
mutate_at(vars(race_similarity:politicalacts),
list(~case_when(
. < quantile(., probs = 0.025, na.rm = TRUE) ~ quantile(., probs = 0.025, na.rm = TRUE),
. > quantile(., probs = 0.975, na.rm = TRUE) ~ quantile(., probs = 0.975, na.rm = TRUE),
TRUE ~ .))) %>%
# Now rescale ALL variables (with their new polarities) from a min of 0 to a max of 1
mutate_at(vars(race_similarity:politicalacts),
list(~scales::rescale(.))) %>%
saveRDS("raw_data/indicators_rescaled.rds")2. Checking for Missing Data
Next, let’s check levels of missing data in our indicators, and save
those rates in v1. We’ll save the missing data percentages
for this imputed data in v2. When we bind them together, we
see that we reduce missing data a fair amount.
v1 <- read_rds("raw_data/indicators.rds") %>%
summarize_at(vars(race_similarity:politicalacts),
funs(round(sum(is.na(.)) / n() * 100, 1)))
v2 <- read_rds("raw_data/indicators_rescaled.rds") %>%
summarize_at(vars(race_similarity:politicalacts),
funs(round(sum(is.na(.)) / n() * 100, 1)))
# Compare missing data before imputing county mean for zipcode measures and after
mymissing <- bind_rows(
v1 %>% mutate(type = "original"),
v2 %>% mutate(type = "imputed")) %>%
pivot_longer(
cols = c(race_similarity:politicalacts),
names_to = "measure",
values_to = "value") %>%
pivot_wider(
id_cols = measure,
names_from = type,
values_from = value)| measure | original | imputed |
|---|---|---|
| race_similarity | 0.9 | 0.9 |
| ethnicity_similarity | 0.9 | 0.9 |
| education_equality | 0.9 | 0.9 |
| income_equality | 1.2 | 1.2 |
| employment_equality | 1.0 | 1.0 |
| gender_income_similarity | 0.0 | 0.0 |
| lang | 0.9 | 0.9 |
| telephone | 1.6 | 1.6 |
| non_elder | 0.9 | 0.9 |
| religious_orgs | 4.2 | 0.1 |
| charitable_orgs | 34.0 | 5.1 |
| advocacy_orgs | 35.6 | 4.8 |
| civic_orgs | 20.7 | 2.0 |
| unions | 10.5 | 10.5 |
| fraternal | 0.0 | 0.0 |
| voting_age | 0.9 | 0.9 |
| local_gov | 1.0 | 1.0 |
| state_gov | 1.0 | 1.0 |
| fed_gov | 1.0 | 1.0 |
| politicalacts | 0.0 | 0.0 |
3. Descriptives
Normality Testing
How normally distributed are our final indicators?
# Get variables
vars_bonding <- c("race_similarity", "ethnicity_similarity", "education_equality",
"income_equality", "employment_equality", "gender_income_similarity",
"lang", "telephone", "non_elder")
# Get variables
vars_bridging <- c("civic_orgs", "advocacy_orgs", "charitable_orgs",
"religious_orgs", "unions", "fraternal")
# Get variables
vars_linking <- c("voting_age", "local_gov", "state_gov", "fed_gov", "politicalacts")
library(tidyverse)
library(shadowtext)
obslong <- read_rds("raw_data/indicators_rescaled.rds") %>%
sample_n(size = 10000, replace = FALSE) %>%
pivot_longer(cols = c(race_similarity:politicalacts),
names_to = "variable", values_to = "value") %>%
mutate(group = case_when(
variable %in% vars_bonding ~ "bonding",
variable %in% vars_bridging ~ "bridging",
variable %in% vars_linking ~ "linking")) %>%
mutate(variable = variable %>% dplyr::recode_factor(
# Bonding
"race_similarity" = "Race\nSimilarity",
"ethnicity_similarity" = "Ethnicity\nSimilarity",
"education_equality" = "Educational\nEquality",
"income_equality" = "Race/Income\nEquality",
"employment_equality" = "Employment\nEquality",
"gender_income_similarity" = "Gender Income\nSimilarity",
"lang" = "Language\nCompetency",
"telephone" = "Communication\nCapacity",
"non_elder" = "Non-Elder\nPopulation",
# Bridging
"religious_orgs" = "Religious\nOrgs",
"civic_orgs" = "Civic\nOrgs",
"advocacy_orgs" = "Social\nAdvocacy Orgs",
"charitable_orgs" = "Charitable\nOrgs",
"fraternal" = "Fraternal\nOrders",
"unions" = "Unions",
# Linking
"voting_age" = "Political\nLinkage",
"local_gov" = "Local Govt.\nEmployees",
"state_gov" = "State Govt.\nEmployees",
"fed_gov" = "Fed. Govt.\nEmployees",
"politicalacts" = "Political\nActivities") %>%
factor(levels = levels(.)))
# Get variables of interest
mylabels <- obslong %>%
group_by(variable, group) %>%
summarize(y = 0.5,
x = min(value, na.rm = TRUE))
g1 <- ggplot(data = obslong) +
geom_density(mapping = aes(x = value, y= ..scaled.., fill = group, group = variable)) +
geom_shadowtext(data = mylabels, mapping = aes(x = x, y = y, sample = NULL,
color = group, label = variable),
bg.color = "white", color = "black", bg.r = 0.2, hjust = 0, size = 3, fontface = "bold") +
facet_wrap(~variable, scales = "free") +
theme_classic(base_size = 14) +
theme(panel.border = element_blank()) +
theme(legend.position = "bottom", strip.text.x = element_blank(),
plot.caption = element_text(hjust= 0.5),
plot.subtitle = element_text(hjust = 0.5)) +
guides(fill = "none", color = "none") +
scale_fill_manual(values = c("#648FFF", "#DC267F", "#FFB000")) +
scale_color_manual(values = c("#648FFF", "#DC267F", "#FFB000")) +
labs(x = "Observed Values",
y = "Frequency (%)",
subtitle = "Distributions of Indicators",
caption = "Based on a random sample of 10,000 census-tract years, due to large sample size.") +
scale_x_continuous(breaks = c(0, 0.25, 0.5, 0.75, 1),
labels = c("0", ".25", ".5", ".75", "1")) +
scale_y_continuous(breaks = c(0, 0.25, 0.5, 0.75, 1),
labels = c("0", ".25", ".5", ".75", "1"))
ggsave(g1, filename = "images/densityplot.png", dpi = 500, width = 8, height = 6)include_graphics("images/densityplot.png")Great! Fairly normal distributions, except in cases which make sense - eg. the distribution of percentages of Americans who speak English is probably pretty left skewed.
remove(g1, obslong, mylabels, mymissing)## Warning in remove(g1, obslong, mylabels, mymissing): object 'g1' not found
## Warning in remove(g1, obslong, mylabels, mymissing): object 'obslong' not found
## Warning in remove(g1, obslong, mylabels, mymissing): object 'mylabels' not found
Mean vs. Median
# Get variables
vars_bonding <- c("race_similarity", "ethnicity_similarity", "education_equality",
"income_equality", "employment_equality", "gender_income_similarity",
"lang", "telephone", "non_elder")
# Get variables
vars_bridging <- c("religious_orgs", "charitable_orgs", "advocacy_orgs",
"civic_orgs", "unions", "fraternal")
# Get variables
vars_linking <- c("voting_age", "local_gov", "state_gov", "fed_gov", "politicalacts")
# Import indicators
total <- read_rds("raw_data/indicators_rescaled.rds")
# Calculate social capital index using mean
data.frame(
year = total$year,
geoid = total$geoid,
bonding = matrixStats::rowMeans2(x = total[, vars_bonding] %>% as.matrix()),
bridging = matrixStats::rowMeans2(x = total[, vars_bridging] %>% as.matrix()),
linking = matrixStats::rowMeans2(x = total[, vars_linking] %>% as.matrix())) %>%
mutate(social_capital = matrixStats::rowMeans2(.[, c("bonding", "bridging", "linking")] %>% as.matrix())) %>%
select(year, geoid, social_capital, bonding, bridging, linking) %>%
saveRDS("raw_data/index_tract_mean.rds")
# Calculate social capital index using medians
data.frame(
year = total$year,
geoid = total$geoid,
bonding = matrixStats::rowMedians(x = total[, vars_bonding] %>% as.matrix()),
bridging = matrixStats::rowMedians(x = total[, vars_bridging] %>% as.matrix()),
linking = matrixStats::rowMedians(x = total[, vars_linking] %>% as.matrix())) %>%
mutate(social_capital = matrixStats::rowMedians(.[, c("bonding", "bridging", "linking")] %>% as.matrix())) %>%
select(year, geoid, social_capital, bonding, bridging, linking) %>%
saveRDS("raw_data/index_tract_median.rds")
remove(total)bind_rows(
read_rds("raw_data/index_tract_mean.rds") %>% mutate(type = "mean"),
read_rds("raw_data/index_tract_median.rds") %>% mutate(type = "median")) %>%
pivot_longer(cols = c(social_capital:linking), names_to = "variable", values_to = "value") %>%
pivot_wider(id_cols = c(geoid, year, variable), names_from = type, values_from = value) %>%
group_by(variable) %>%
summarize(r = cor(mean, median, use = "pairwise.complete.obs")) %>%
write_rds("raw_data/correlation.rds")How closely do they correlate?
read_rds("raw_data/correlation.rds")| variable | r |
|---|---|
| bonding | 0.8270879 |
| bridging | 0.9623829 |
| linking | 0.8342117 |
| social_capital | 0.7531203 |
Looks like when we adjust to take the medians instead, it gets a little murky. It produces extremely similar results for bridging (.93) and bonding (.87), but more disparate results for linking (.73) and social_capital (.64). However, all associations remain extremely strong and positive, which is good. The mean is generally better than the median here, because we want to give each indicator equal weight, since they each represent a different aspect of connectivity. For example, given 5 indicators, if 4 were higher, a much lower score for the 5th indicator would have no impact on the score if we took the median. Using the mean, we ensure that each indicator has at least some impact on the index.
Only a few census tracts are missing data after making the index. 10,000 census tracts-year observations is actually tiny when taken in context of all ~700,000 (~1%).
read_rds("raw_data/index_tract_mean.rds") %>%
filter(year == 2011) %>%
select(social_capital:linking) %>%
summary()## social_capital bonding bridging linking
## Min. :0.207 Min. :0.1788 Min. :0.000 Min. :0.0188
## 1st Qu.:0.494 1st Qu.:0.5110 1st Qu.:0.375 1st Qu.:0.5246
## Median :0.552 Median :0.5837 Median :0.480 Median :0.5988
## Mean :0.544 Mean :0.5766 Mean :0.478 Mean :0.5850
## 3rd Qu.:0.600 3rd Qu.:0.6480 3rd Qu.:0.577 3rd Qu.:0.6617
## Max. :0.838 Max. :0.9371 Max. :0.997 Max. :0.9431
## NA's :10790 NA's :943 NA's :10355 NA's :734
remove(mycor)rm(list = ls())4. Generate Indices
Next, let’s generate our indices using Multiple Imputation in the Amelia package in R. We draw from the full universe of census tracts, using imputation over time to help predict the same census tracts over different years. Imputations are bounded between 0 and 1, to match actual limits of variables.
library(tidyverse)
library(Amelia)
dir.create("mi")
set.seed(98754)
read_rds("raw_data/indicators_rescaled.rds") %>%
select(-county) %>%
mutate(year = as.numeric(year)) %>%
as.data.frame() %>%
Amelia::amelia(m = 5,
ts = "year", cs = "geoid",
max.resample = 1000, parallel = "multicore",
bounds = data.frame(column.number = c(1:20) %>% as.numeric,
lower.bound = 0,
upper.bound = 1) %>%
as.matrix()) %>%
saveRDS("mi/mi.rds")4.1 Building Imputed Indices
Next, let’s combine these imputed indicators into an index.
# We're going to work straight from the imputed data
library(tidyverse)
library(Amelia)
# Import and restructure imputations as tibble
read_rds("mi/mi.rds")$imputations %>%
map_dfr(~as_tibble(.), .id = "imp") %>%
write_rds("mi/mi_imp.rds")Having turned the imputed indicators into a giant
tibble, we are going to now using the
matrixStats package to average these indicators. I
recommend matrixStats in this case because we need to do
these functions on absolutely enormous datasets. (700,000 tract-years x
5 imputations).
# Load imputed cells
mymi <- read_rds("mi/mi_imp.rds")
################
# Bonding
################
# Get variables
vars_bonding <- c("race_similarity", "ethnicity_similarity", "education_equality",
"income_equality", "employment_equality", "gender_income_similarity",
"lang", "telephone", "non_elder")
# Get stats
mybonding <- tibble(
mean = mymi[,vars_bonding] %>% as.matrix() %>% matrixStats::rowMeans2(., na.rm = TRUE),
sd = mymi[,vars_bonding] %>% as.matrix() %>% matrixStats::rowSds(., na.rm = TRUE))
################
# Bridging
################
# Get variables
vars_bridging <- c("religious_orgs", "charitable_orgs", "advocacy_orgs",
"civic_orgs", "unions", "fraternal")
# Get stats
mybridging <- tibble(
mean = mymi[,vars_bridging] %>% as.matrix() %>% matrixStats::rowMeans2(., na.rm = TRUE),
sd = mymi[,vars_bridging] %>% as.matrix() %>% matrixStats::rowSds(., na.rm = TRUE))
################
# Linking
################
# Get variables
vars_linking <- c("voting_age", "local_gov", "state_gov", "fed_gov", "politicalacts")
# Get stats
mylinking <- tibble(
mean = mymi[,vars_linking] %>% as.matrix() %>% matrixStats::rowMeans2(., na.rm = TRUE),
sd = mymi[,vars_linking] %>% as.matrix() %>% matrixStats::rowSds(., na.rm = TRUE))
# Bind together the results
total <- bind_rows(
mybonding %>%
mutate(imp = mymi$imp,
geoid = mymi$geoid,
year = mymi$year,
type = "bonding"),
mybridging %>%
mutate(imp = mymi$imp,
geoid = mymi$geoid,
year = mymi$year,
type = "bridging"),
mylinking %>%
mutate(imp = mymi$imp,
geoid = mymi$geoid,
year = mymi$year,
type = "linking")) %>%
# Pivot so that imputations are in columns, not rows
pivot_wider(id_cols = c(type, geoid, year), names_from = imp, values_from = c(mean, sd))
# Remove the excess data
remove(mymi, mybonding, mybridging, mylinking,
vars_bonding, vars_bridging, vars_linking)4.2 Melding Indices
Next, we’re going to meld our 5 imputed indices into a single index, using Rubin’s Rules, as is considered proper with multiple imputation.
dir.create("index")
total %>%
# Now meld the results for the five imputations
with(
# Now, we want to generate single bonding, bridging, and linking indices,
# combining them and incorporating error within and between imputations
# by merging them with Rubin's Rules
Amelia::mi.meld(
q = select(., mean_imp1:mean_imp5) %>%
as.matrix(),
se = select(., sd_imp1:sd_imp5) %>%
as.matrix(),
byrow = FALSE) ) %>%
# Now report them in a final tibble
with(
tibble(
# Add in unique identifiers
type = total$type,
geoid = total$geoid,
year = total$year,
# And the vectors of melded results
score = .$q.mi %>% as.vector(),
se = .$se.mi %>% as.vector()
)
) %>%
# Now pivot so each index is a column
pivot_wider(
id_cols = c(year, geoid),
names_from = type,
values_from = c(score, se)) %>%
rename(bonding = score_bonding,
bridging = score_bridging,
linking = score_linking,
bonding_se = se_bonding,
bridging_se = se_bridging,
linking_se = se_linking) %>%
# Get social capital index
mutate(social_capital = (bonding + bridging + linking)/3 ) %>%
select(year, geoid, social_capital, bonding, bridging, linking,
bonding_se, bridging_se, linking_se) %>%
# Save to file
saveRDS("index/index_tract_V3_04_10_2022.rds")
remove(total)
# Delete this hulking dataset of imputations, since the original is stored in mi.rds
unlink("mi/mi_imp.rds")
rm(list = ls())5. Compare with Past Versions
Let’s compare our new index with the previous version of the index,
which was put forward into our Nature Scientific Reports paper. We’re
will join them together into the dataframe compare, and
then verify that they remain closely correlated.
compare <- read_rds("index/index_tract_V3_04_10_2022.rds") %>%
filter(year %in% c(2011:2018)) %>%
select(year, geoid,
sc_new = social_capital,
bonding_new = bonding,
bridging_new = bridging,
linking_new = linking) %>%
left_join(by = c("year", "geoid"),
y = read_rds("index/index_tract_V2_01_31_2022.rds") %>%
filter(year %in% c(2011:2018)) %>%
select(year, geoid, social_capital, bonding, bridging, linking))Next, we calculate the correlations.
compare %>%
summarize(sc_new = cor(sc_new, social_capital, use = "pairwise.complete.obs"),
bonding_new = cor(bonding_new, bonding, use = "pairwise.complete.obs"),
bridging_new = cor(bridging_new, bridging, use = "pairwise.complete.obs"),
linking_new = cor(linking_new, linking, use = "pairwise.complete.obs"))## # A tibble: 1 × 4
## sc_new bonding_new bridging_new linking_new
## <dbl> <dbl> <dbl> <dbl>
## 1 0.939 0.820 0.918 0.999
We see that the new edition of the indices are closely correlated with the old edition, with Pearon’s r statistics of +0.82 or greater, which is an extremely strong correlation. The slight variation in correlation for the bonding and bridging indices reflects a couple of minor modifications I made in this round, like:
using year-by-year zipcode-to-tract crosswalks to ensure that we take into account changes over time in zipcode and tract boundaries.
using better census variables for education (Version 2 focused on the educational patterns of a consistent age group, 18 to 24 year olds, in an attempt to adjust for age, but this means we don’t get a good picture of the full educational diversity of the community. Version 3 uses the educational patterns of all residents ages 25 years of above, as in, the adult, mostly-out-of-school population, which allows us to better capture the educational background of the entire community.
calculating indices just for the census tracts that the American Community Survey 5-year estimates reports estimates for. In other words, populated census tracts. In the past, we calculated them for the full set of all census tract geographies in the TIGRIS spatial datasets, but this means some places with a population of 0 received social capital estimates. To avoid this and maximize the clarity of these indices, we focus on just census tracts reported in the the ACS-5. As a result, the annual sample size may be just slightly different from previous versions, but hopefully, more precise!
The 3rd-gen version of these indices reflects our team’s best efforts at producing a fully-reproducible set of indices.
6. Exporting
Finally, let’s convert these to an easier format for use in other programs.
6.1 Making Zipcode Measures
First, let’s calculate our indices for every zipcode, averaging up from the tract level.
We can make a tract to zipcode crosswalk using the HUD crosswalk
files. Let’s bring back our earlier function
get_crosswalk() to do that!
# For Zipcode to Tract conversion
get_crosswalk = function(from, to, year){
# Build URL of HUD crosswalk file
myurl <- paste("https://www.huduser.gov/portal/datasets/usps/",
toupper(from), "_", toupper(to), "_03", year, ".xlsx", sep = "")
# Create a temporary file to receive the download
myfile <- tempfile()
# Download the file from the URL to the temporary site
download.file(url = myurl, destfile = myfile)
# Import the file
readxl::read_excel(myfile) %>%
select(from = 1, to = 2) %>%
mutate(year = year,
type = paste(from, "->", to, sep = "")) %>%
return()
}
data.frame(year = 2011:2020) %>%
split(.$year) %>%
# For each year, run the crosswalk function
map_dfr(~get_crosswalk(from = "tract", to = "zip", year = .$year)) %>%
# Save compressed version of file
saveRDS("raw_data/crosswalk_tract_to_zip.rds")And now, we’ll use that massive 2011-2020 crosswalk to aggregate census tract measures to the appropriate zipcodes that existed in each year.
read_rds("raw_data/crosswalk_tract_to_zip.rds") %>%
select(from, to, year) %>%
# Extract the state from each census tract
mutate(state_code = str_sub(from, 1,2)) %>%
# Join in the state name for each tract
left_join(by = "state_code", y = tigris::fips_codes %>%
select(state, state_code) %>%
distinct()) %>%
# Join in the census tract index scores
left_join(by = c("from" = "geoid", "year"),
y = read_rds("index/index_tract_V3_04_10_2022.rds")) %>%
# Get the average score for each zipcode per state-year
group_by(state, geoid = to, year) %>%
summarize(social_capital = mean(social_capital, na.rm = TRUE),
bonding = mean(bonding, na.rm = TRUE),
bridging = mean(bridging, na.rm = TRUE),
linking = mean(linking, na.rm = TRUE)) %>%
ungroup() %>%
# Some zipcodes could not be estimated, due to their overlapping census tracts
# having no having no census data available for measuring indices. We cut them here.
filter(!is.na(social_capital)) %>%
write_rds("index/index_zipcode_V3_04_10_2022.rds")6.2 Making County Subdivision Measures
Second, let’s calculate these indices at the county subdivision level, averaging up from the tract level.
Sadly, no such crosswalk exists at the moment between tracts and subdivisions for every single year, but the Missouri Census Data Center has created a near equivalent we can use quite easily, where using their Geocorr 2018 program, you can automatically build your own tract-to-county subdivision crosswalk according to 2010 geographies. We use this crosswalk to build the county subdivision indices. Two notes:
Subdivision boundaries generally change less frequently than zipcodes, so we should be fine here.
If this is a question in a users’ case of interest, I recommend you use the census tract indices, the highest level of granularity and precision currently available.
read_rds("raw_data/crosswalk_tract_to_csub.rds") %>%
select(from, to) %>%
# Extract the state from each census tract
mutate(state_code = str_sub(from, 1,2)) %>%
# Join in the state name for each tract
left_join(by = "state_code", y = tigris::fips_codes %>%
select(state, state_code) %>%
distinct()) %>%
# Join in the census tract index scores
left_join(by = c("from" = "geoid"),
y = read_rds("index/index_tract_V3_04_10_2022.rds")) %>%
# Get average score per census county subdivision, per year, per state
group_by(state, geoid = to, year) %>%
summarize(social_capital = mean(social_capital, na.rm = TRUE),
bonding = mean(bonding, na.rm = TRUE),
bridging = mean(bridging, na.rm = TRUE),
linking = mean(linking, na.rm = TRUE)) %>%
ungroup() %>%
# 10 subdivisions in rural NY end up without a corresponding year or values;
# They were not estimatable from current data. we cut these.
filter(!is.na(year)) %>%
write_rds("index/index_csub_V3_04_10_2022.rds")6.3 Exporting to .csv
Finally, let’s reformat and export these .rds files as .csv files,
which can be more easily used in many different platforms. This will
unfortunately cause the file size to balloon, as our .rds
files have been compressed by default, but I guess that’s to be expected
with a 700,000 row dataset.
Finally, let’s convert these indices into a much easier format for outputting.
library(tidyverse)
# Export census tracts
read_rds("index/index_tract_V3_04_10_2022.rds") %>%
mutate(state_code = str_sub(geoid, 1,2)) %>%
left_join(by = "state_code", y = tigris::fips_codes %>%
select(state, state_code) %>%
distinct()) %>%
select(state, year, geoid, social_capital:linking_se) %>%
write_csv("index/index_tracts_V3_04_10_2022.csv")
# Export zipcodes
read_rds("index/index_zipcode_V3_04_10_2022.rds") %>%
write_csv("index/index_zipcode_V3_04_10_2022.csv")
# Export county subdivisions
read_rds("index/index_csub_V3_04_10_2022.rds") %>%
write_csv("index/index_csub_V3_04_10_2022.csv")6.4 Final Notes
And that’s a wrap! If you’ve read this far, thank you for your interest in these indices! We hope you and your team are able to put these indices to good use! For complete replication code
If you have questions regarding the indices, feel free to shoot me an email at timothy.fraser.1@gmail.com, or check out some of the research our team is doing with these indices at www.timothyfraser.com/research.
Best wishes in your research!