Social Capital Indices for Census Tracts, Zipcodes, & County Subdivisions

Replication Code for Indices

This RMarkdown document identifies each step necessary for creating a census-tract level social capital index. These indices are the most up-to-date version of the indices presented in our recent article in Nature Scientific Reports (DOI: 10.1038/s41598-022-10275-z), entitled Social capital’s impact on COVID-19 outcomes at local levels.

Want to download the indices? Download them from our repository on the Harvard Dataverse!.
Want to run the replication code below yourself? You can do it using this RStudio.Cloud project available at: https://rstudio.cloud/project/3875479.

Version History

Version 1: 02/18/2021 (https://papers.ssrn.com/abstract=3788540)

Version 2: 01/28/2022 (DOI: 10.1038/s41598-022-10275-z)

log transformed bridging indicators, to ensure normal distributions.
Capped indicators at 2.5th and 97.5th percentiles, to avoid outlier bias.

Version 3: 04/08/2022 (current document)

Added years 2019 and 2020 using updated ACS-5 estimates
Added estimates for Washington, DC.
Improved formulation for Education Equality indicator.
Calculates Voting Age population directly from Census API.
Note: indices are estimated using multiple imputation across the full 2010-2020 dataset, so index scores may change ever so slightly across Versions 2 and Version 3, since Version 3 now can draw on years 2019-2020 to make better imputations over time. I encourage using the most up-to-date version.

Process

This process includes four parts:

Collect indicators of bonding social capital:

Click to see indicators!

Communication Capacity
Educational Similarity
Race Similarity
Ethnicity Similarity
Employment Equality
Gender Income Similarity
Income Inequality
Language
Non-elder Population

Collect indicators of bridging social capital:

Click to see indicators!

Civic Orgs per 10,000 persons
Charitable Orgs per 10,000 persons
Religious Orgs per 10,000 persons
Advocacy Orgs per 10,000 persons
Unions per 10,000 persons (County level variable)
Fraternal Orgs per 10,000 persons (County level variable)

Unless otherwise specified, these are collected at the zipcode level over 5 years; in Version 2, 2% of these total data points are missing data; for Charitable, Advocacy, and Civic Orgs, up to 11~20% are missing data.

To deal with this, we replace missing data points with available county level rates. This reduces missing data down to ~1%!

Collect indicators of linking social capital:

Click to see indicators!

Voting Age Population
Local Government Employees per capita
State government employees per capita
Federal government employees per capita
Rate of Political Action (ESRI county level variable)

These variables are from Kyne & Aldrich’s index, published in 2020. They also all are variables that clearly represent an aspect of bonding, bridging, or linking social capital, all supported by the literature, as discussed in the original article. This dataset simply expands their methodology to a census-tract index, making necessary accomodations for expanding from the county level to the census tract level.

Aggregation to Zipcode and County Subdivision Levels:

Finally, at the end of this document, we aggregate these census tract level indices to the zipcode and county subdivision level.

0. Setup

0.1 Load Packages

First, please load the following packages:

# If they are not yet installed, uncomment this line and run it.
# install.packages(c("tidyverse", "tidycensus", "censusapi",
#   "sf", "tigris", "Amelia", "viridis", "ggpubr", "ggtext"))

# Then load them!

# For data management
library(tidyverse)

# For querying census data
library(tidycensus)
library(censusapi)

# For spatial data management
library(sf)
library(tigris)

# For multiple imputation
library(Amelia)

# For data viz
library(viridis)
library(ggpubr)
library(ggtext)
library(knitr)

0.2 Create Folders

Second, let’s create a few folders to hold our various data points.

# Create a folder to host raw index indicators
dir.create("raw_data")

0.3. Load API Key

We’re going to compile the various indicators mentioned above, primarily using Kyle Walker’s tidycensus package, which allows us to query the US Census Bureau’s American Community Survey 5-year estimates from 2010 to 2020. To do so, you’ll need a Census API key - no worries, it’s quick to get. Just registering your email address with the census here and they will email you the API key immediately.

Once you’ve got an API key, just plug it into the census_api_key() function from the tidycensus package.

# Load api key to download data from US Census Bureau
mykey <- "YOUR_API_KEY_HERE"
census_api_key(mykey)

1. Collect Data

Let’s collect some data!

1.1 Units

First, let’s make a list of all year-state combinations we want to query.

myunits <- expand_grid(
  state = c(state.abb, "DC"),
  year = 2011:2020) %>%
  mutate(id = 1:n())

1.2 Indicators

Second, let’s make a list of all the indicators we want to query. We’re going to try and download them all in one fell swoop to simplify the process and reduce time spent waiting for downloads.

myvars <- read_csv("indicators.csv")

We can view the contents of myvars in its entirety using the following code, but it’s probably easier just to click on the tabs below to see the variables we’ll be aquiring for each indicator.

race_similarity

code	value	survey	geography
B01003_001E	Total Population	acs5	tract
B02001_002E	White alone	acs5	tract
B02001_003E	Black or African American alone	acs5	tract
B02001_004E	American Indian and Alaska Native alone	acs5	tract
B02001_005E	Asian alone	acs5	tract
B02001_006E	Native Hawaiian and Other Pacific Islander alone	acs5	tract
B02001_007E	Some other race alone	acs5	tract
B02001_008E	Two or more races	acs5	tract

ethnicity_similarity

code	value	survey	geography
B01003_001E	Total Population	acs5	tract
B03001_002E	Not Hispanic or Latino	acs5	tract
B03001_003E	Hispanic or Latino	acs5	tract

education_equality

code	value	survey	geography
B15002_001E	Population 25 years and over	acs5	tract
B15002_002E	Population 25 years and over (Male)	acs5	tract
B15002_011E	High School Graduate, GED, or alternative (Male)	acs5	tract
B15002_012E	Some college, less than 1 year (Male)	acs5	tract
B15002_013E	Some college, 1 or more years, no degree (Male)	acs5	tract
B15002_014E	Associate’s degree (Male)	acs5	tract
B15002_015E	Bachelor’s degree (Male)	acs5	tract
B15002_016E	Master’s degree (Male)	acs5	tract
B15002_017E	Professional school Degree (Male)	acs5	tract
B15002_018E	Doctoral Degree (Male)	acs5	tract
B15002_019E	Population 25 years and over (Female)	acs5	tract
B15002_028E	High School Graduate, GED, or alternative (Female)	acs5	tract
B15002_029E	Some college, less than 1 year (Female)	acs5	tract
B15002_030E	Some college, 1 or more years, no degree (Female)	acs5	tract
B15002_031E	Associate’s degree (Female)	acs5	tract
B15002_032E	Bachelor’s degree (Female)	acs5	tract
B15002_033E	Master’s degree (Female)	acs5	tract
B15002_034E	Professional school Degree (Female)	acs5	tract
B15002_035E	Doctoral Degree (Female)	acs5	tract

gini

code	value	survey	geography
B19083_001E	Gini Index	acs5	tract

employment_equality

code	value	survey	geography
B23025_003E	Civilian labor force	acs5	tract
B23025_004E	Employed (in civilian labor force)	acs5	tract
B23025_005E	Unemployed (in civilian labor force)	acs5	tract

gender_income_similarity

code	value	survey	geography
B19326_001E	Median income in the past 12 months (total)	acs5	tract
B19326_002E	Median income (dollars) in the past 12 months (Male)	acs5	tract
B19326_005E	Median income (dollars) in the past 12 months (Female)	acs5	tract

language_competency

code	value	survey	geography
B16004_024E	total 18 to 64 years old	acs5	tract
B16004_025E	Speak only English	acs5	tract
B16004_027E	Speak Spanish & Speak English ‘very well’	acs5	tract
B16004_032E	Speak other Indo-European languages & Speak English ‘very well’	acs5	tract
B16004_037E	Speak Asian and Pacific Island languages & Speak English ‘very well’	acs5	tract
B16004_042E	Speak other languages & Speak English ‘very well’	acs5	tract

communication_capacity

code	value	survey	geography
B25043_001E	Total Households	acs5	tract
B25043_003E	Owner occupied, with with telephone service available	acs5	tract
B25043_007E	Owner occupied, with no telephone service available	acs5	tract
B25043_012E	Renter occupied, with telephone service available	acs5	tract
B25043_016E	Renter occupied, with no telephone service available	acs5	tract

nonelder

code	value	survey	geography
B01001_020E	Male 65-66	acs5	tract
B01001_021E	Male 67-69	acs5	tract
B01001_022E	Male 70-74	acs5	tract
B01001_023E	Male 75-79	acs5	tract
B01001_024E	Male 80-84	acs5	tract
B01001_025E	Male 85+	acs5	tract
B01001_044E	Female 65-66	acs5	tract
B01001_045E	Female 67-69	acs5	tract
B01001_046E	Female 70-74	acs5	tract
B01001_047E	Female 75-79	acs5	tract
B01001_048E	Female 85+	acs5	tract
B01001_002E	Male	acs5	tract
B01001_026E	Female	acs5	tract

voting_age

code	value	survey	geography
B05003_009E	Male, 18 years and over, Native (US Citizen)	acs5	tract
B05003_011E	Male, 18 years and over, Foreign born, Naturalized U.S. citizen	acs5	tract
B05003_020E	Female, 18 years and over, Native (US Citizen)	acs5	tract
B05003_022E	Female, 18 years and over, Foreign born, Naturalized U.S. citizen	acs5	tract
B05001_001E	Total Population	acs5	tract
B05001_006E	Not a US Citizen	acs5	tract

gov_workers

code	value	survey	geography
B08128_006E	Local government workers	acs5	tract
B08128_007E	State government workers	acs5	tract
B08128_008E	Federal government workers	acs5	tract
B08128_001E	Total Federal, State, and Local Government Workers	acs5	tract

organizations

code	value	survey	geography
8132	Charitable Organizations	zpb	zipcode
8133	Social Advocacy Organizations	zpb	zipcode
8134	Civic and Social Organizations	zpb	zipcode
8139	Business, Professional, Labor, Political, and Similar Organizations	zpb	zipcode
8131	Religious Organizations	zpb	zipcode

zipcode_pop

code	value	survey	geography
B01003_001E	Total Population	acs5	zipcode

fraternal

code	value	survey	geography
fraternal	Member of fraternal order (% of total)	ESRI	county

politicalacts

code	value	survey	geography
politicalacts	% Attended political rally, speech, or organized protest	ESRI	county

1.3 Download ACS-5 Estimates

So let’s go download those indicators!

Let’s isolate just the ACS-5 census tract variables as tract_vars, and download them using the get_acs() function from tidycensus.

# Get the many variables applicable
tract_vars <- myvars %>% 
  filter(survey == "acs5" & geography == "tract") %>%
  mutate(code = code %>%  str_extract(".*_[0-9]{3}")) %>%
  select(code, value)
  # Now cut of the 'E' at the end for 'Estimate';
  # tidycensus just wants the code

We’re going to run a loop iteratively to get the data for each state-year pair, which will be helpful in the likely even that this code crashes every once in a while. Heads up: we’re downloading many megabytes of data; this should take ~20 minutes.

# Create a folder to hold our acs5 estimates
dir.create("raw_data/acs5")

# Then write a function to procure them
get_data = function(data, variables){
  result <- get_acs(
    # Inserting a distinct year
    year = data$year,
    # and a distinct state each time
    state = data$state,
    # but always grabbing the same set of tract variables from the acs5
    geography = "tract",
    variables = variables,
    survey = "acs5")
  
  result %>%
      # and each time, write it to file 
      # in the format of raw_data/acs5/YEAR_STATE.csv
      saveRDS(paste("raw_data/acs5/", data$year, "_", data$state, ".rds", sep = ""))
}

# Then run the function
myunits %>%
  # For each year-state pair
  split(.$id) %>%
  # Run the following get_acs() function many times,
  map(~get_data(., variables = tract_vars$code), .id = "id")

1.4 Download Zipcode Business Patterns Estimates

First, let’s gather the 5 variables applicable, using their NAICS2012 codes.

zip_vars <- myvars %>% 
  filter(survey == "zpb" & geography == "zipcode") %>%
  with(code)

Then, let’s build a grid of NAICS code - years, which we will use to iteratively query the Zipcode Business Patterns API using the getCensus() function from the censusapi package. We will save this data as zip_data.rds.

(Notice how we directly insert mykey, our API key, into the key = argument in this function; it won’t work without your API key.)

expand_grid(year = as.character(2012:2016),
            var = zip_vars) %>% 
  mutate(id = 1:n()) %>%
  split(.$id) %>%
  map_dfr(~getCensus(
        name = "zbp",
        key = mykey,
        vintage = .$year,
        vars = c("ESTAB", "GEO_ID", "YEAR"),
        region = "zipcode:*",
        naics2012 = .$var))  %>%
  write_rds("raw_data/zip_data.rds")

Next, let’s gather the estimated population for every zipcode tabulation area, from the American Community Survey 5 year estimates. We can use tidycensus``'sget_acs()function like before for this. We'll save this aszip_pop.rds```.

# Get the population in 2016 for every zipcode tabulation area
data.frame(year = 2011:2020) %>%
  split(.$year) %>%
  map_dfr(~get_acs(
    year = .$year,
    geography = "zcta",
    variables = "B01003_001",
    survey = "acs5"), .id = "year") %>%
  # Get the appropriate zipcode
  mutate(geoid = str_sub(GEOID, start = -5, end = -1)) %>%
  # and simplify columns
  select(year, geoid, pop = estimate) %>%
  # and save to file
  write_rds("raw_data/zip_pop.rds")

Then, let’s combine the two datasets to produce a nice wide matrix of zipcode-years from 2012 to 2016. Unfortunately, earlier and later editions are not yet available via the API, so we’re just going to carry forward the rates from 2016 to subsequent years. We calculate the rate of establishments per 10,000 residents for religious_orgs, charitable_orgs, advocacy_orgs, civic_orgs, and unions, and then save them as zip_rates.rds.

read_rds("raw_data/zip_pop.rds") %>% 
  left_join(by = c("year", "geoid"),
            y = read_rds("raw_data/zip_data.rds") %>% 
              select(year = YEAR, geoid = zip_code, 
                     value = ESTAB, variable = NAICS2012) %>%
              mutate(variable = variable %>% dplyr::recode(
                "8131" = "religious_orgs",
                "8132" = "charitable_orgs",
                "8133" = "advocacy_orgs",
                "8134" = "civic_orgs",
                "8139" = "unions")) %>%
              pivot_wider(id_cols = c(year, geoid), names_from = variable, 
                          values_from = value)) %>%
  # Filter to years of data which are available from Zipcode Business Trends API
  filter(year %in% 2012:2016) %>%
  mutate_at(vars(charitable_orgs, advocacy_orgs, civic_orgs, unions, religious_orgs),
            list(~. / pop * 10000)) %>%
  # Now recode the variable numbers into names
  # Set any infinite values to NA too
  mutate_at(vars(charitable_orgs, advocacy_orgs, civic_orgs, unions, religious_orgs),
            list(~if_else(is.nan(.) | is.infinite(.), NA_real_, .))) %>%
  write_rds("raw_data/zip_rates.rds")

Finally, let’s consolidate these zipcode business trend observations to the census tract level, using the HUD crosswalk files. To do so, let’s first gather a compendium of zipcode-to-tract crosswalk files for each year.

# For Zipcode to Tract conversion
get_crosswalk = function(from, to, year){
  # Build URL of HUD crosswalk file
  myurl <- paste("https://www.huduser.gov/portal/datasets/usps/", 
                 toupper(from), "_", toupper(to), "_03", year, ".xlsx", sep = "")
  # Create a temporary file to receive the download
  myfile <- tempfile()
  # Download the file from the URL to the temporary site
  download.file(url = myurl, destfile = myfile)
  # Import the file
  readxl::read_excel(myfile) %>% 
    select(from = 1, to = 2) %>% 
    mutate(year = year,
           type = paste(from, "->", to, sep = "")) %>%
    return()
}

data.frame(year = 2011:2020) %>%
  split(.$year) %>%
  # For each year, run the crosswalk function
  map_dfr(~get_crosswalk(from = "zip", to = "tract", year = .$year)) %>%
  # Save compressed version of file
  saveRDS("raw_data/crosswalk_zip_to_tract.rds")

remove(myfile)

And then let’s use these crosswalk files each year to estimate the average rate of organizations per 10,000 residents per tract, for zipcode-census tract pairs that were in existence in that year.

read_rds("raw_data/crosswalk_zip_to_tract.rds") %>%
  # Let's join in the zipcode rates to our crosswalk
  left_join(by = c("from" = "geoid", "year"), 
            y = read_rds("raw_data/zip_rates.rds") %>%
              mutate(year = as.numeric(year))) %>%
  # Then for each year and tract,
  group_by(year, geoid = to) %>%
  # Calculate the average
  summarize_at(
    vars(charitable_orgs, advocacy_orgs, civic_orgs, unions, religious_orgs),
    list(~mean(., na.rm = TRUE))) %>%
  ungroup() %>%
  # Fix any NaN or Inf results to NA
   mutate_at(
    vars(charitable_orgs, advocacy_orgs, civic_orgs, unions, religious_orgs),
    list(~if_else(is.nan(.) | is.infinite(.), NA_real_, as.numeric(.)))) %>%
  saveRDS("raw_data/tract_rates.rds")

1.5 Download County ESRI Estimates

Next, we’re going to borrow two county variables straight from the original county-level indices. These pertain to fraternal orgs and political activities, and we use them because there aren’t many ideal alternatives available at the tract or zipcode level at the moment.

# Finally, to get ESRI's data on fraternal ties and political acts, 
# we just borrow it from the original index at the county level.
read_delim("raw_data/Kyne Aldrich Capturing Dataset 2020.tab") %>%
  select(geoid, fraternal, politicalacts) %>%
  mutate(geoid = geoid %>% str_sub(-5, nchar(geoid))) %>%
  write_rds("raw_data/county_esri.rds")

1.6 Produce Raw Dataset

Next, we’re process these raw data to produce 1 dataset containing the base data for our indicators. This produces an absolutely enormous dataset, at 343.9 MB when compressed, containing 50+ variables and 700,000+ observations. Yay big data!

get_format = function(data){
  
  print(data$fileid)
  
  read_rds(data$fileid) %>%
    mutate(GEOID = as.character(GEOID),
           year = str_extract(data$fileid, pattern = "20[0-9]{2}")) %>%
    select(year, geoid = GEOID, variable, estimate) %>%
    pivot_wider(id_cols = c(geoid, year), 
                names_from = variable, values_from = estimate) %>%
    return()
}

data.frame(fileid = dir("raw_data/acs5", full.names = TRUE)) %>%
  split(.$fileid) %>%
  map_dfr(~get_format(.)) %>% 
  # Create a variable for joining in our 2012-2016 variables to the full range
  mutate(yearjoin = case_when(
    year < 2012 ~ 2012,
    year > 2016 ~ 2016,
    TRUE ~ as.numeric(year))) %>%
  # Join in zipcode-averaged rates
  left_join(by = c("geoid", "yearjoin" = "year"),
            y = read_rds("raw_data/tract_rates.rds") %>% 
              filter(year %in% 2012:2016)) %>%
  # Get rid of the now-unnecessary joining variable
  select(-yearjoin) %>%
  # Join in 2 county variables
  mutate(county = str_sub(geoid, 1,5)) %>%
  left_join(by = c("county" = "geoid"), y = read_rds("raw_data/county_esri.rds")) %>% 
  write_rds("raw_data/tract_data.rds")

remove(get_format)

The resulting data, raw_data/tract_data.rds, is the data you need to reproduce any of the indicators used in this study.

1.7 Calculate Indicators

Next, let’s use the raw data contained within raw_data/tract_data.rds to produce our main indicators for use in this study.

read_rds("raw_data/tract_data.rds") %>% 
  # Calculate the race fractionalization index.
  mutate(race_fractionalization = (1 - (
    (B02001_002 / B01003_001) ^ 2 + # square of % White
      (B02001_003 / B01003_001) ^ 2 + # square of % Black
      (B02001_004 / B01003_001) ^ 2 + # square of % American Indian/Alaskan Native 
      (B02001_005 / B01003_001) ^ 2 + # square of % Asian
      (B02001_006 / B01003_001) ^ 2 + # square of % native Hawaiian/Pacific Islander
      (B02001_007 / B01003_001) ^ 2 + # square of % some other race
      (B02001_008 / B01003_001) ^ 2) # square of % two or more races
  ),
  # Now calculate race similarity index, so that 1 = homophily and 0 = heterophily
  race_similarity = 1 - race_fractionalization) %>%
  # Calculate the ethnicity fractionalization index
  mutate(ethnicity_fractionalization = (1 - (
    (B03001_002 / B01003_001) ^ 2 + # square of % Not Hispanic/Latino
      (B03001_003 / B01003_001) ^ 2)  # square of % Hispanic/Latino
  ),
  ethnicity_similarity = 1 - ethnicity_fractionalization) %>% 
  #Calculate educational similarity
  # get absolute value of difference between college and high school educated persons
  # where a large gap indicates homophily and small gap indicates heterophily
   #"EDUCATIONAL ATTAINMENT FOR THE POPULATION 25 YEARS AND OVER"
  mutate(
    # Calculate % with less than high school education
    #  = 1 minus % with high school degree or more
    less_than_high_school = 1 - (
      (B15002_011 + #High School Graduate, GED, or alternative (Male),
         B15002_012 + #Some college, less than 1 year (Male),
         B15002_013 + #Some college, 1 or more years, no degree (Male),
         B15002_014 + #Associate's degree (Male),
         B15002_015 + #Bachelor's degree (Male),    
         B15002_016 + #Master's degree (Male),
         B15002_017 + #Professional school Degree (Male),
         B15002_018 + #Doctoral Degree (Male),
         # For women   
         B15002_019 + #Population 25 years and over (Female),
         B15002_028 + #High School Graduate, GED, or alternative (Female),
         B15002_029 + #Some college, less than 1 year (Female),
         B15002_030 + #Some college, 1 or more years, no degree (Female),
         B15002_031 + #Associate's degree (Female),
         B15002_032 + #Bachelor's degree (Female),    
         B15002_033 + #Master's degree (Female),
         B15002_034 + #Professional school Degree (Female),
         B15002_035 #Doctoral Degree (Female)
   ) / B15002_001), # total ages 24 or above
    # Calculate % with college education or more
    # sum of all types of education at or above college level, divided by total
    bachelors_or_more = (
      B15002_015 + #Bachelor's degree (Male),    
        B15002_016 + #Master's degree (Male),
        B15002_017 + #Professional school Degree (Male),
        B15002_018 + #Doctoral Degree (Male),
        B15002_032 + #Bachelor's degree (Female),    
        B15002_033 + #Master's degree (Female),
        B15002_034 + #Professional school Degree (Female),
        B15002_035 #Doctoral Degree (Female)
      ) / # doctoral degree
      B15002_001, # total ages 24 or above
    EE = abs(less_than_high_school - bachelors_or_more),
    # now subtract it from 1, where 1 now = heterophily and 0 = homophily
    education_equality = 1-EE) %>%
  # Bonding: Race/income equality
  # Get gini index
  # Gini coefficient (0 = perfect equality to 1 = perfect inequality)
  mutate(income_inequality = B19083_001,
         income_equality = 1 - income_inequality) %>%
  # Bonding: Employment equality 
  # Absolute difference between % of total employed and % of total unemployed labor force 
  mutate(
   percent_employed = (B23025_004 / B23025_003),
   percent_unemployed = (B23025_005 / B23025_003),
   employment_inequality = abs(percent_employed - percent_unemployed),
   # rescale to range from 0 to 1
   employment_inequality = scales::rescale(employment_inequality, to = c(0,1)),
  # Here, like with college,
  # 1 now = heterophily, while 0 = homophily.
  # So, let's rescale it so 1 = homophily
  employment_equality = 1 - employment_inequality) %>%
  # Get county mean for missing data
  group_by(county) %>%
  # If values are missing, impute them with the mean value from that variable
  mutate_at(vars(B19326_001, B19326_002, B19326_005),
            funs(if_else(is.na(.), mean(., na.rm = TRUE), .))) %>%
  ungroup() %>%
  #Calculate gender income similarity
  # Gender Income Similarity
  # Median income in the past 12 months (in 20XX inflation-adjusted dollars)
  # adjusted by Census Bureau to whatever year that census was (eg. 2010, 2011, 2012, etc.)
  # Gender Income Fractionalization (0 = homogeneity, 1 = heterogeneity)
  mutate(
    # Get percentage of total median income that is for men, squared
    MI = (B19326_002 / (B19326_002 + B19326_005)) ^ 2,
    # Get percentage of total median income that is for women, squared
    FI = (B19326_005 / (B19326_002 + B19326_005)) ^ 2,
    # How much space remains? Calculate fractionalization (low = homophily, high = heterophily)
    gender_income_fractionalization = (1 - (MI + FI)),
    # Now, to get a larger spanning variable, rescale.
    # smallest values become 0 (homophily) and 
    # largest values become 1 (heterophily)
    gender_income_fractionalization = (gender_income_fractionalization - min(gender_income_fractionalization, na.rm = TRUE)) / 
      # Divide by range
      (max(gender_income_fractionalization, na.rm = TRUE) - min(gender_income_fractionalization)),
    # Calculate gender income similarity, by subtracting from 1 so that
    # 1 = homophily and 0 = heterophily
    gender_income_similarity = 1 - gender_income_fractionalization
  ) %>%
  
  #Bonding: Language competency 
  # % of total population proficient English speakers
  mutate(lang = ((
      B16004_025 + # "Speak only English",
      B16004_027 + # "Speak Spanish & Speak English 'very well'",
      B16004_032 + # "Speak other Indo-European languages & Speak English 'very well'",
      B16004_037 + # "Speak Asian and Pacific Island languages & Speak English 'very well'",
      B16004_042 # "Speak other languages & Speak English 'very well'"
  ) / B16004_024) # "total 18 to 64 years old",
  ) %>%

  # Bonding: Communication capacity 
  # % of total households with a telephone
  mutate(telephone = ((B25043_003 + # "Owner occupied, with with telephone service available",
                       B25043_012 # "Renter occupied, with telephone service available",
                       ) / 
                        B25043_001) # total households
         ) %>%
  # Bonding: Non-elder population 
  # % of total population below 65 years of age
  mutate(
    # elder population
        elder = (
          # Sum of elder male residents (65+)
          B01001_020 + # Male 65-66
           B01001_021 + # Male 67-69
           B01001_022 + # Male 70-74
           B01001_023 + # Male 75-79
           B01001_024 + # Male 80-84
           B01001_025 + # Male 85+
           # Sum of elder female residents (65+)
           B01001_044 + # Female 65-66
           B01001_045 + # Female 67-69
           B01001_046 + # Female 70-74
           B01001_047 + # Female 75-79
           B01001_048)  # Female 85+
        # Divided by sum of male and female residents
        / (B01001_002 +   # Male
             B01001_026), # Female
        # non-elder population       
        non_elder = 1 - elder) %>%
  # % of Citizens who are of Voting Age
  mutate(voting_age = (
    # = Citizens of Voting Age / (Total Pop - Not Citizen)
      B05003_009 + # "Male, 18 years and over, Native (US Citizen)",
      B05003_011 + # "Male, 18 years and over, Foreign born, Naturalized U.S. citizen",
      B05003_020 + # "Female, 18 years and over, Native (US Citizen)",
      B05003_022   # "Female, 18 years and over, Foreign born, Naturalized U.S. citizen",
    ) / (
      B05001_001 - # "Total Population",
        B05001_006) # "Not a US Citizen"
    ) %>% 
  # % of government works at given level of government
  mutate(local_gov = B08128_006 / B08128_001, # Local govt works / total govt workers
         state_gov = B08128_007 / B08128_001, # State govt workers / total govt workers
         fed_gov = B08128_008 / B08128_001 # federal govt workers/ total govt workers
  ) %>%
  select(geoid, year, county, race_similarity, ethnicity_similarity, education_equality, 
         income_equality, employment_equality, gender_income_similarity, 
         lang, telephone, non_elder, 
         religious_orgs, charitable_orgs, advocacy_orgs, civic_orgs, unions, fraternal, 
         voting_age, local_gov, state_gov, fed_gov, politicalacts) %>%
  write_rds("raw_data/indicators.rds")

1.8 Rescale Indicators

# Several variables are actually rates, so to normalize them, we log-transform them.
# First, in the event of zeros, we're going to replace these with a small-but-realistic value,
# since you can't log-transform the value zero.
# We're going to identify two points - the smallest non-zero value and half that value.
# that lower value will be our replacement for zero.
replace_zero = function(x){
  xhat <- x[x != 0]
  upper <- sort(unique(x))[2]
  lower <- upper / 2
  return(lower)
  
}


# Grab a list of all census tracts available
read_rds("raw_data/indicators.rds") %>%
 # Let's impute the county-mean for our rates of organizations, 
  # to try to reduce missing data in a spatially conscious way, 
  # since these are spatially aggregated variables from the zipcode level anyways. 
  group_by(year, county) %>%
  mutate_at(vars(religious_orgs, charitable_orgs, advocacy_orgs, civic_orgs),
            funs(if_else(
              condition = is.na(.),
              true = mean(., na.rm = TRUE),
              false = .))) %>%
  ungroup() %>%
  # recode any Infinite responses as NA
  mutate_at(vars(race_similarity:politicalacts),
            list(~if_else(is.infinite(.) | is.nan(.), NA_real_, .))) %>%
  # Next, a few of these variables are actually rates, so they are naturally right-skewed.
  # In order to be nice and comparable to other indicators, 
  # we're going to log-transform them to make them normally distributed again.
  # To do so, we need to impute 0s with a very small but realistic value, as described above.
  mutate_at(vars(advocacy_orgs, civic_orgs,religious_orgs, unions, 
                 charitable_orgs, fraternal, local_gov, state_gov, fed_gov),
            list(~case_when(. == 0 ~  replace_zero(.),   TRUE ~ .)))  %>%
  # Now take these variables and log-transform them
  mutate_at(vars(charitable_orgs,advocacy_orgs, civic_orgs,religious_orgs, unions, 
                 fraternal, local_gov, state_gov, fed_gov),
            list(~log(.))) %>% 
  # Cap all values above or below the 95% most common values,
  # so as not to unduly bias our estimates of the majority of communities
  mutate_at(vars(race_similarity:politicalacts),
            list(~case_when(
              . < quantile(., probs = 0.025, na.rm = TRUE) ~ quantile(., probs = 0.025, na.rm = TRUE),
              . > quantile(., probs = 0.975, na.rm = TRUE) ~ quantile(., probs = 0.975, na.rm = TRUE),
              TRUE ~ .))) %>%
  # Now rescale ALL variables (with their new polarities) from a min of 0 to a max of 1
  mutate_at(vars(race_similarity:politicalacts),
            list(~scales::rescale(.))) %>%
  saveRDS("raw_data/indicators_rescaled.rds")

2. Checking for Missing Data

Next, let’s check levels of missing data in our indicators, and save those rates in v1. We’ll save the missing data percentages for this imputed data in v2. When we bind them together, we see that we reduce missing data a fair amount.

v1 <- read_rds("raw_data/indicators.rds") %>%
  summarize_at(vars(race_similarity:politicalacts),
               funs(round(sum(is.na(.)) / n() * 100, 1)))

v2 <- read_rds("raw_data/indicators_rescaled.rds") %>%
  summarize_at(vars(race_similarity:politicalacts),
               funs(round(sum(is.na(.)) / n() * 100, 1)))

# Compare missing data before imputing county mean for zipcode measures and after
mymissing <- bind_rows(
  v1 %>% mutate(type = "original"),
  v2 %>% mutate(type = "imputed")) %>%
  pivot_longer(
    cols = c(race_similarity:politicalacts),
    names_to = "measure",
    values_to = "value") %>%
  pivot_wider(
    id_cols = measure,
    names_from = type,
    values_from = value)

measure	original	imputed
race_similarity	0.9	0.9
ethnicity_similarity	0.9	0.9
education_equality	0.9	0.9
income_equality	1.2	1.2
employment_equality	1.0	1.0
gender_income_similarity	0.0	0.0
lang	0.9	0.9
telephone	1.6	1.6
non_elder	0.9	0.9
religious_orgs	4.2	0.1
charitable_orgs	34.0	5.1
advocacy_orgs	35.6	4.8
civic_orgs	20.7	2.0
unions	10.5	10.5
fraternal	0.0	0.0
voting_age	0.9	0.9
local_gov	1.0	1.0
state_gov	1.0	1.0
fed_gov	1.0	1.0
politicalacts	0.0	0.0

3. Descriptives

Normality Testing

How normally distributed are our final indicators?

# Get variables
vars_bonding <- c("race_similarity", "ethnicity_similarity", "education_equality", 
         "income_equality", "employment_equality", "gender_income_similarity", 
         "lang", "telephone", "non_elder") 
# Get variables
vars_bridging <- c("civic_orgs", "advocacy_orgs",  "charitable_orgs",
                      "religious_orgs", "unions", "fraternal")
# Get variables
vars_linking <- c("voting_age", "local_gov", "state_gov", "fed_gov", "politicalacts")


library(tidyverse)
library(shadowtext)

obslong <- read_rds("raw_data/indicators_rescaled.rds") %>%
  sample_n(size = 10000, replace = FALSE) %>%
  pivot_longer(cols = c(race_similarity:politicalacts), 
               names_to = "variable", values_to = "value") %>%
  mutate(group = case_when(
    variable %in% vars_bonding ~ "bonding",
    variable %in% vars_bridging ~ "bridging",
    variable %in% vars_linking ~ "linking")) %>%
  mutate(variable = variable %>% dplyr::recode_factor(
    # Bonding
    "race_similarity" = "Race\nSimilarity",
    "ethnicity_similarity" = "Ethnicity\nSimilarity",
    "education_equality" = "Educational\nEquality",
    "income_equality" = "Race/Income\nEquality",
    "employment_equality" = "Employment\nEquality",
    "gender_income_similarity" = "Gender Income\nSimilarity",
    "lang" = "Language\nCompetency",
    "telephone" = "Communication\nCapacity",
    "non_elder" = "Non-Elder\nPopulation",
    # Bridging
    "religious_orgs" = "Religious\nOrgs", 
    "civic_orgs" = "Civic\nOrgs", 
    "advocacy_orgs" = "Social\nAdvocacy Orgs",
    "charitable_orgs" = "Charitable\nOrgs", 
    "fraternal" = "Fraternal\nOrders", 
    "unions" = "Unions",
    # Linking
    "voting_age" = "Political\nLinkage", 
    "local_gov" = "Local Govt.\nEmployees",
    "state_gov" = "State Govt.\nEmployees", 
    "fed_gov" = "Fed. Govt.\nEmployees",
    "politicalacts" = "Political\nActivities") %>%
      factor(levels = levels(.)))


# Get variables of interest

mylabels <- obslong %>%
  group_by(variable, group) %>%
  summarize(y = 0.5,
            x = min(value, na.rm = TRUE)) 



g1 <- ggplot(data = obslong) +
  geom_density(mapping = aes(x = value, y= ..scaled.., fill = group, group = variable)) +
  geom_shadowtext(data = mylabels, mapping = aes(x = x, y = y, sample = NULL, 
                                                 color = group, label = variable),
                  bg.color = "white", color = "black", bg.r = 0.2, hjust = 0, size = 3, fontface = "bold") +
  facet_wrap(~variable, scales = "free") + 
  theme_classic(base_size = 14) +
  theme(panel.border = element_blank()) +
  theme(legend.position = "bottom", strip.text.x = element_blank(),
        plot.caption = element_text(hjust=  0.5),
        plot.subtitle = element_text(hjust = 0.5)) +
  guides(fill = "none", color = "none") +
  scale_fill_manual(values = c("#648FFF",  "#DC267F", "#FFB000")) +
  scale_color_manual(values = c("#648FFF",  "#DC267F", "#FFB000")) +
  labs(x = "Observed Values",
       y = "Frequency (%)",
       subtitle = "Distributions of Indicators",
       caption = "Based on a random sample of 10,000 census-tract years, due to large sample size.") +
  scale_x_continuous(breaks = c(0, 0.25, 0.5, 0.75, 1),
                     labels = c("0", ".25", ".5", ".75", "1")) +
  scale_y_continuous(breaks = c(0, 0.25, 0.5, 0.75, 1),
                     labels = c("0", ".25", ".5", ".75", "1"))

ggsave(g1, filename = "images/densityplot.png", dpi = 500, width = 8, height = 6)

include_graphics("images/densityplot.png")

Great! Fairly normal distributions, except in cases which make sense - eg. the distribution of percentages of Americans who speak English is probably pretty left skewed.

remove(g1, obslong, mylabels, mymissing)

## Warning in remove(g1, obslong, mylabels, mymissing): object 'g1' not found

## Warning in remove(g1, obslong, mylabels, mymissing): object 'obslong' not found

## Warning in remove(g1, obslong, mylabels, mymissing): object 'mylabels' not found

Mean vs. Median

# Get variables
vars_bonding <- c("race_similarity", "ethnicity_similarity", "education_equality", 
         "income_equality", "employment_equality", "gender_income_similarity", 
         "lang", "telephone", "non_elder") 
# Get variables
vars_bridging <- c("religious_orgs", "charitable_orgs", "advocacy_orgs", 
                   "civic_orgs", "unions", "fraternal")

# Get variables
vars_linking <- c("voting_age", "local_gov", "state_gov", "fed_gov", "politicalacts")


# Import indicators
total <- read_rds("raw_data/indicators_rescaled.rds")

# Calculate social capital index using mean
data.frame(
  year = total$year,
  geoid = total$geoid,
  bonding = matrixStats::rowMeans2(x = total[, vars_bonding] %>% as.matrix()),
  bridging = matrixStats::rowMeans2(x = total[, vars_bridging] %>% as.matrix()),
  linking = matrixStats::rowMeans2(x = total[, vars_linking] %>% as.matrix())) %>%
  mutate(social_capital = matrixStats::rowMeans2(.[, c("bonding", "bridging", "linking")] %>% as.matrix())) %>%
  select(year, geoid, social_capital, bonding, bridging, linking) %>%
  saveRDS("raw_data/index_tract_mean.rds")

# Calculate social capital index using medians
data.frame(
  year = total$year,
  geoid = total$geoid,
  bonding = matrixStats::rowMedians(x = total[, vars_bonding] %>% as.matrix()),
  bridging = matrixStats::rowMedians(x = total[, vars_bridging] %>% as.matrix()),
  linking = matrixStats::rowMedians(x = total[, vars_linking] %>% as.matrix())) %>%
  mutate(social_capital = matrixStats::rowMedians(.[, c("bonding", "bridging", "linking")] %>% as.matrix())) %>%
  select(year, geoid, social_capital, bonding, bridging, linking) %>%
  saveRDS("raw_data/index_tract_median.rds")

remove(total)

bind_rows(
  read_rds("raw_data/index_tract_mean.rds") %>% mutate(type = "mean"),
  read_rds("raw_data/index_tract_median.rds") %>% mutate(type = "median")) %>%
  pivot_longer(cols = c(social_capital:linking), names_to = "variable", values_to = "value") %>%
  pivot_wider(id_cols = c(geoid, year, variable), names_from = type, values_from = value) %>%
  group_by(variable) %>%
  summarize(r = cor(mean, median, use = "pairwise.complete.obs")) %>%
  write_rds("raw_data/correlation.rds")

How closely do they correlate?

read_rds("raw_data/correlation.rds")

variable	r
bonding	0.8270879
bridging	0.9623829
linking	0.8342117
social_capital	0.7531203

Looks like when we adjust to take the medians instead, it gets a little murky. It produces extremely similar results for bridging (.93) and bonding (.87), but more disparate results for linking (.73) and social_capital (.64). However, all associations remain extremely strong and positive, which is good. The mean is generally better than the median here, because we want to give each indicator equal weight, since they each represent a different aspect of connectivity. For example, given 5 indicators, if 4 were higher, a much lower score for the 5th indicator would have no impact on the score if we took the median. Using the mean, we ensure that each indicator has at least some impact on the index.

Only a few census tracts are missing data after making the index. 10,000 census tracts-year observations is actually tiny when taken in context of all ~700,000 (~1%).

read_rds("raw_data/index_tract_mean.rds") %>% 
  filter(year == 2011) %>%
  select(social_capital:linking) %>%
  summary()

##  social_capital     bonding          bridging        linking      
##  Min.   :0.207   Min.   :0.1788   Min.   :0.000   Min.   :0.0188  
##  1st Qu.:0.494   1st Qu.:0.5110   1st Qu.:0.375   1st Qu.:0.5246  
##  Median :0.552   Median :0.5837   Median :0.480   Median :0.5988  
##  Mean   :0.544   Mean   :0.5766   Mean   :0.478   Mean   :0.5850  
##  3rd Qu.:0.600   3rd Qu.:0.6480   3rd Qu.:0.577   3rd Qu.:0.6617  
##  Max.   :0.838   Max.   :0.9371   Max.   :0.997   Max.   :0.9431  
##  NA's   :10790   NA's   :943      NA's   :10355   NA's   :734

remove(mycor)

rm(list = ls())

4. Generate Indices

Next, let’s generate our indices using Multiple Imputation in the Amelia package in R. We draw from the full universe of census tracts, using imputation over time to help predict the same census tracts over different years. Imputations are bounded between 0 and 1, to match actual limits of variables.

library(tidyverse)
library(Amelia)

dir.create("mi")

set.seed(98754)

read_rds("raw_data/indicators_rescaled.rds") %>%
  select(-county) %>%
  mutate(year = as.numeric(year)) %>%
  as.data.frame() %>%
  Amelia::amelia(m = 5,  
                 ts = "year", cs = "geoid",
                 max.resample = 1000, parallel = "multicore",
                 bounds = data.frame(column.number = c(1:20) %>% as.numeric,
                                     lower.bound = 0,
                                     upper.bound = 1) %>% 
                   as.matrix()) %>%
  saveRDS("mi/mi.rds")

4.1 Building Imputed Indices

Next, let’s combine these imputed indicators into an index.

# We're going to work straight from the imputed data
library(tidyverse)
library(Amelia)

# Import and restructure imputations as tibble
read_rds("mi/mi.rds")$imputations %>% 
  map_dfr(~as_tibble(.), .id = "imp") %>%
  write_rds("mi/mi_imp.rds")

Having turned the imputed indicators into a giant tibble, we are going to now using the matrixStats package to average these indicators. I recommend matrixStats in this case because we need to do these functions on absolutely enormous datasets. (700,000 tract-years x 5 imputations).

# Load imputed cells
mymi <- read_rds("mi/mi_imp.rds")

################
# Bonding
################

# Get variables
vars_bonding <- c("race_similarity", "ethnicity_similarity", "education_equality", 
         "income_equality", "employment_equality", "gender_income_similarity", 
         "lang", "telephone", "non_elder") 
# Get stats
mybonding <- tibble(
  mean = mymi[,vars_bonding] %>% as.matrix() %>% matrixStats::rowMeans2(., na.rm = TRUE),
  sd = mymi[,vars_bonding] %>% as.matrix() %>% matrixStats::rowSds(., na.rm = TRUE))



################
# Bridging
################

# Get variables
vars_bridging <- c("religious_orgs", "charitable_orgs", "advocacy_orgs", 
                   "civic_orgs", "unions", "fraternal")

# Get stats
mybridging <- tibble(
  mean = mymi[,vars_bridging] %>% as.matrix() %>% matrixStats::rowMeans2(., na.rm = TRUE),
  sd = mymi[,vars_bridging] %>% as.matrix() %>% matrixStats::rowSds(., na.rm = TRUE))


################
# Linking
################

# Get variables
vars_linking <- c("voting_age", "local_gov", "state_gov", "fed_gov", "politicalacts")

# Get stats
mylinking <- tibble(
  mean = mymi[,vars_linking] %>% as.matrix() %>% matrixStats::rowMeans2(., na.rm = TRUE),
  sd = mymi[,vars_linking] %>% as.matrix() %>% matrixStats::rowSds(., na.rm = TRUE))


# Bind together the results
total <- bind_rows(
  mybonding %>%
    mutate(imp = mymi$imp,
           geoid = mymi$geoid,
           year = mymi$year,
           type = "bonding"),
  mybridging %>%
    mutate(imp = mymi$imp,
           geoid = mymi$geoid,
           year = mymi$year,
           type = "bridging"),
  mylinking %>%
    mutate(imp = mymi$imp,
           geoid = mymi$geoid,
           year = mymi$year,
           type = "linking")) %>%
  # Pivot so that imputations are in columns, not rows
  pivot_wider(id_cols = c(type, geoid, year), names_from = imp, values_from = c(mean, sd))

# Remove the excess data
remove(mymi, mybonding, mybridging, mylinking,
       vars_bonding, vars_bridging, vars_linking)

4.2 Melding Indices

Next, we’re going to meld our 5 imputed indices into a single index, using Rubin’s Rules, as is considered proper with multiple imputation.

dir.create("index")

total %>%
  # Now meld the results for the five imputations
  with(
    # Now, we want to generate single bonding, bridging, and linking indices,
    # combining them and incorporating error within and between imputations 
    # by merging them with Rubin's Rules
    Amelia::mi.meld(
      q = select(., mean_imp1:mean_imp5) %>%
        as.matrix(),
      se = select(., sd_imp1:sd_imp5) %>%
        as.matrix(),
      byrow = FALSE) ) %>%
 # Now report them in a final tibble
  with(
    tibble(
      # Add in unique identifiers
      type = total$type,
      geoid = total$geoid,
      year = total$year,
      # And the vectors of melded results
      score = .$q.mi %>% as.vector(),
      se = .$se.mi %>% as.vector()
    )
  )  %>%
  # Now pivot so each index is a column
  pivot_wider(
    id_cols = c(year, geoid),
    names_from = type,
    values_from = c(score, se)) %>%
  rename(bonding = score_bonding,
         bridging = score_bridging,
         linking = score_linking,
         bonding_se = se_bonding,
         bridging_se = se_bridging,
         linking_se = se_linking) %>%
  # Get social capital index
  mutate(social_capital = (bonding + bridging + linking)/3 ) %>%
  select(year, geoid, social_capital, bonding, bridging, linking,
         bonding_se, bridging_se, linking_se) %>%
  # Save to file
  saveRDS("index/index_tract_V3_04_10_2022.rds")

remove(total)

# Delete this hulking dataset of imputations, since the original is stored in mi.rds
unlink("mi/mi_imp.rds")

rm(list = ls())

5. Compare with Past Versions

Let’s compare our new index with the previous version of the index, which was put forward into our Nature Scientific Reports paper. We’re will join them together into the dataframe compare, and then verify that they remain closely correlated.

compare <- read_rds("index/index_tract_V3_04_10_2022.rds") %>% 
  filter(year %in% c(2011:2018)) %>%
  select(year, geoid, 
         sc_new = social_capital, 
         bonding_new = bonding, 
         bridging_new = bridging, 
         linking_new = linking) %>%
  left_join(by = c("year", "geoid"),
            y = read_rds("index/index_tract_V2_01_31_2022.rds") %>% 
              filter(year %in% c(2011:2018)) %>%
              select(year, geoid, social_capital, bonding, bridging, linking))

Next, we calculate the correlations.

compare %>% 
  summarize(sc_new = cor(sc_new, social_capital, use = "pairwise.complete.obs"),
            bonding_new = cor(bonding_new, bonding, use = "pairwise.complete.obs"),
            bridging_new = cor(bridging_new, bridging, use = "pairwise.complete.obs"),
            linking_new = cor(linking_new, linking, use = "pairwise.complete.obs"))

## # A tibble: 1 × 4
##   sc_new bonding_new bridging_new linking_new
##    <dbl>       <dbl>        <dbl>       <dbl>
## 1  0.939       0.820        0.918       0.999

We see that the new edition of the indices are closely correlated with the old edition, with Pearon’s r statistics of +0.82 or greater, which is an extremely strong correlation. The slight variation in correlation for the bonding and bridging indices reflects a couple of minor modifications I made in this round, like:

using year-by-year zipcode-to-tract crosswalks to ensure that we take into account changes over time in zipcode and tract boundaries.
using better census variables for education (Version 2 focused on the educational patterns of a consistent age group, 18 to 24 year olds, in an attempt to adjust for age, but this means we don’t get a good picture of the full educational diversity of the community. Version 3 uses the educational patterns of all residents ages 25 years of above, as in, the adult, mostly-out-of-school population, which allows us to better capture the educational background of the entire community.
calculating indices just for the census tracts that the American Community Survey 5-year estimates reports estimates for. In other words, populated census tracts. In the past, we calculated them for the full set of all census tract geographies in the TIGRIS spatial datasets, but this means some places with a population of 0 received social capital estimates. To avoid this and maximize the clarity of these indices, we focus on just census tracts reported in the the ACS-5. As a result, the annual sample size may be just slightly different from previous versions, but hopefully, more precise!

The 3rd-gen version of these indices reflects our team’s best efforts at producing a fully-reproducible set of indices.

6. Exporting

Finally, let’s convert these to an easier format for use in other programs.

6.1 Making Zipcode Measures

First, let’s calculate our indices for every zipcode, averaging up from the tract level.

We can make a tract to zipcode crosswalk using the HUD crosswalk files. Let’s bring back our earlier function get_crosswalk() to do that!

# For Zipcode to Tract conversion
get_crosswalk = function(from, to, year){
  # Build URL of HUD crosswalk file
  myurl <- paste("https://www.huduser.gov/portal/datasets/usps/", 
                 toupper(from), "_", toupper(to), "_03", year, ".xlsx", sep = "")
  # Create a temporary file to receive the download
  myfile <- tempfile()
  # Download the file from the URL to the temporary site
  download.file(url = myurl, destfile = myfile)
  # Import the file
  readxl::read_excel(myfile) %>% 
    select(from = 1, to = 2) %>% 
    mutate(year = year,
           type = paste(from, "->", to, sep = "")) %>%
    return()
}

data.frame(year = 2011:2020) %>%
  split(.$year) %>%
  # For each year, run the crosswalk function
  map_dfr(~get_crosswalk(from = "tract", to = "zip", year = .$year)) %>%
  # Save compressed version of file
  saveRDS("raw_data/crosswalk_tract_to_zip.rds")

And now, we’ll use that massive 2011-2020 crosswalk to aggregate census tract measures to the appropriate zipcodes that existed in each year.

read_rds("raw_data/crosswalk_tract_to_zip.rds") %>% 
  select(from, to, year) %>%
  # Extract the state from each census tract
  mutate(state_code = str_sub(from, 1,2)) %>%
  # Join in the state name for each tract
  left_join(by = "state_code", y = tigris::fips_codes %>%
              select(state, state_code) %>%
              distinct()) %>%
  # Join in the census tract index scores
  left_join(by = c("from" = "geoid", "year"),
            y = read_rds("index/index_tract_V3_04_10_2022.rds")) %>%
  # Get the average score for each zipcode per state-year
  group_by(state, geoid = to, year) %>%
  summarize(social_capital = mean(social_capital, na.rm = TRUE),
            bonding = mean(bonding, na.rm = TRUE),
            bridging = mean(bridging, na.rm = TRUE),
            linking = mean(linking, na.rm = TRUE)) %>%
  ungroup() %>%
  # Some zipcodes could not be estimated, due to their overlapping census tracts 
  # having no having no census data available for measuring indices. We cut them here.
  filter(!is.na(social_capital)) %>%
  write_rds("index/index_zipcode_V3_04_10_2022.rds")

6.2 Making County Subdivision Measures

Second, let’s calculate these indices at the county subdivision level, averaging up from the tract level.

Sadly, no such crosswalk exists at the moment between tracts and subdivisions for every single year, but the Missouri Census Data Center has created a near equivalent we can use quite easily, where using their Geocorr 2018 program, you can automatically build your own tract-to-county subdivision crosswalk according to 2010 geographies. We use this crosswalk to build the county subdivision indices. Two notes:

Subdivision boundaries generally change less frequently than zipcodes, so we should be fine here.
If this is a question in a users’ case of interest, I recommend you use the census tract indices, the highest level of granularity and precision currently available.

read_rds("raw_data/crosswalk_tract_to_csub.rds") %>% 
  select(from, to) %>%
  # Extract the state from each census tract
  mutate(state_code = str_sub(from, 1,2)) %>%
  # Join in the state name for each tract
  left_join(by = "state_code", y = tigris::fips_codes %>%
              select(state, state_code) %>%
              distinct()) %>%
  # Join in the census tract index scores
  left_join(by = c("from" = "geoid"),
            y = read_rds("index/index_tract_V3_04_10_2022.rds")) %>%
  # Get average score per census county subdivision, per year, per state
  group_by(state, geoid = to, year) %>%
  summarize(social_capital = mean(social_capital, na.rm = TRUE),
            bonding = mean(bonding, na.rm = TRUE),
            bridging = mean(bridging, na.rm = TRUE),
            linking = mean(linking, na.rm = TRUE)) %>%
  ungroup() %>%
  # 10 subdivisions in rural NY end up without a corresponding year or values; 
  # They were not estimatable from current data. we cut these.
  filter(!is.na(year)) %>%
  write_rds("index/index_csub_V3_04_10_2022.rds")

6.3 Exporting to .csv

Finally, let’s reformat and export these .rds files as .csv files, which can be more easily used in many different platforms. This will unfortunately cause the file size to balloon, as our .rds files have been compressed by default, but I guess that’s to be expected with a 700,000 row dataset.

Finally, let’s convert these indices into a much easier format for outputting.

library(tidyverse)

# Export census tracts
read_rds("index/index_tract_V3_04_10_2022.rds") %>% 
  mutate(state_code = str_sub(geoid, 1,2)) %>%
  left_join(by = "state_code", y = tigris::fips_codes %>%
              select(state, state_code) %>%
              distinct()) %>%
  select(state, year, geoid, social_capital:linking_se) %>%
  write_csv("index/index_tracts_V3_04_10_2022.csv")

# Export zipcodes
read_rds("index/index_zipcode_V3_04_10_2022.rds") %>% 
  write_csv("index/index_zipcode_V3_04_10_2022.csv")

# Export county subdivisions
read_rds("index/index_csub_V3_04_10_2022.rds") %>% 
  write_csv("index/index_csub_V3_04_10_2022.csv")

6.4 Final Notes

And that’s a wrap! If you’ve read this far, thank you for your interest in these indices! We hope you and your team are able to put these indices to good use! For complete replication code

If you have questions regarding the indices, feel free to shoot me an email at timothy.fraser.1@gmail.com, or check out some of the research our team is doing with these indices at www.timothyfraser.com/research.

Best wishes in your research!