Overview: This analysis rectifies the relationship break from the 2010 census tract to 2020 census tract changes. Census tract relationship files obtained from here . The reason for rectifying is to transmit LE data to 2020 census tracts, which is the unit used for a composite index with 18 indicators, of which LE is one. We used areal-weighted interpolation and computed a weighted mean (intensive interpolation) because Life Expectancy is an intensive measure.
Datasets Import & Merging
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 4.0.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
path <-"C:/Users/rayo-garza/OneDrive - Center for Public Policy Priorities/Documents/Data 2025/"le_data <-read_csv(paste0(path, "TX_A (1).csv"))
Rows: 4709 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): CNTY2KX, TRACT2KX
dbl (5): Tract ID, STATE2KX, e(0), se(e(0)), Abridged life table flag
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
relation_tracts_2 <-read_csv(paste0(path, "relation_tracts_2.csv")) #from US census
Rows: 9135 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): NAMELSAD_TRACT_20, MTFCC_TRACT_20, FUNCSTAT_TRACT_20, NAMELSAD_TRA...
dbl (10): OID_TRACT_20, GEOID_TRACT_20, AREALAND_TRACT_20, AREAWATER_TRACT_2...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#used backticks for Tract ID because of the space in the column name. Lots of nuances easy to forget. merged_data <- relation_tracts_2 %>%left_join(le_data, by =c("GEOID_TRACT_10"="Tract ID"))head(merged_data)
Hereis a good example of the application of this technique.
To reconcile 2010 Life Expectancy (LE) estimates with 2020 Census tract boundaries, we used areal-weighted interpolation approach. Using Census relationship files, we identified how much land area from each 2010 tract overlaps with each 2020 tract. So for example then, if 40% of a 2010 tract falls into a particular 2020 tract, then 40% of that tract’s Life Expectancy value contributes to the new tract.
Then, the 2010 LE value was multiplied by this area proportion to allocate partial contributions to the new tract geography. These weighted fragments were then aggregated to produce a single LE estimate for each 2020 tract. This approach preserves spatial continuity and avoids double-counting while ensuring compatibility with the updated census geography. For exampole, if a 2020 tract contains 40% of one 2010 tract with an LE of 78 years and 60% of another 2010 tract with an LE of 82 years, the new estimate would be calculated as (78 × 0.40) + (82 × 0.60). This ensures that each portion contributes proportionally to the final value.
Then, if all contributing fragments lacked valid LE data, the resulting value was retained as missing (NA) rather than set to zero to preserve true data gaps for later imputation. A coverage diagnostic (total_weight_check) was calculated as the sum of area weights contributing to each tract to assess completeness of source data and inform later normalization and imputation decisions.
options(scipen =999) #added after the fact to remove scientific notationfinal_le_2020_1 <- merged_data %>%#calculate weight and handle LE carefullymutate(area_weight = AREALAND_PART / AREALAND_TRACT_10) %>%#area_wirhgt wiil represent the proportion of a 2010 tract contributing to a 2020 tract#calculate weighted LE, but keep it na if the source was Nnamutate(weighted_le =`e(0)`* area_weight) %>%group_by(GEOID_TRACT_20) %>%summarize(#sum the pieces, but if all pieces are na, the result should be na not 0, and we want na so we can later imputelife_expectancy_2020 =if(all(is.na(weighted_le))) NAelsesum(weighted_le, na.rm =TRUE),total_weight_check =sum(area_weight, na.rm =TRUE) #how much of this 2020 tract is covered by valid 2010 source data )head(final_le_2020_1, 20)
Rows: 250 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): NAMELSAD
dbl (3): GEOID_clean, NAME, pct_in_austin
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 6 × 4
GEOID_clean NAME NAMELSAD pct_in_austin
<dbl> <dbl> <chr> <dbl>
1 48209010912 109. Census Tract 109.12 20.3
2 48453000101 1.01 Census Tract 1.01 100
3 48453000102 1.02 Census Tract 1.02 100
4 48453000203 2.03 Census Tract 2.03 100
5 48453000204 2.04 Census Tract 2.04 100
6 48453000205 2.05 Census Tract 2.05 100
Areal Interpolation, Part 2
To get finalized 2020 LE estimates for the study area, the areal-weighted interpolation procedure was repeated with explicit normalization to account for incomplete source coverage. After recalculating the proportion of each 2010 tract contributing to each 2020 tract, the weighted LE fragments were aggregated at the 2020 tract level.
Unlike the initial aggregation step, this version divides the sum of weighted values by the total share of tract area that had valid 2010 EL data.
Normalization Note: This normalization step “rescales” the estimate to represent 100% of the tract area based only on the portion supported by observed data, preventing partial-coverage tracts from being artificially deflated. In plain words, we estimate what the tract’s value would be if the entire tract behaved like the portion we actually observed. Normalization prevents tracts with partial data from appearing artificially low simply because some source data were missing. So, an example would be if a tract has 60% from a 2010 tract with valid LE (80), but then 40% from a 2010 tract that had missing data. So to get the weighted LE, we would do 80 × 0.60, which = 48. However 48 is not an accurate new LE. Why? It is wrong and small because it only represents a portion of the tract as 40% was missing. That’s why we normalize. So again, we estimate what the value would be if the tract behaved like the rest of the tract for which we actually do have observations.
The resulting dataset was then filtered to include only census tracts within the defined study area, and basic validation checks were performed to confirm structure and to identify any remaining missing values requiring further review.
# Normalize and Filterfinal_study_area_le <- merged_data %>%# calculate the area weight again for safety# area_weight represents the proportion of a 2010 tract contributing to a 2020 tractmutate(area_weight = AREALAND_PART / AREALAND_TRACT_10) %>%mutate(weighted_le =`e(0)`* area_weight) %>%group_by(GEOID_TRACT_20) %>%summarize(# data_coverage = share of the 2020 tract that had valid 2010 LE datadata_coverage =sum(area_weight *!is.na(`e(0)`)),# NORMALIZATION (weighted mean / intensive interpolation):# If coverage is 0, return NA (avoids NaN from dividing by 0).# Otherwise, "stretch" the weighted sum back to represent 100% of the tract# based only on the portion with observed LE data.le_2020_normalized =ifelse( data_coverage ==0,NA_real_,sum(weighted_le, na.rm =TRUE) / data_coverage ),.groups ="drop" ) %>%# keep only the tracts in my specific target list# used 'as.character' to ensure the IDs match perfectly as textfilter(as.character(GEOID_TRACT_20) %in%as.character(target_tracts$GEOID_clean))
Map of NAs
library(sf)
Warning: package 'sf' was built under R version 4.4.3
Linking to GEOS 3.13.0, GDAL 3.10.1, PROJ 9.5.1; sf_use_s2() is TRUE
library(ggplot2)library(tigris)
To enable caching of data, set `options(tigris_use_cache = TRUE)`
in your R script or .Rprofile.
options(tigris_use_cache =TRUE)#downloading shapesstudy_area_shapes <-tracts(state ="48", county =c("021", "209", "453", "491"), year =2020, cb =TRUE)#mutate map_data1 <- study_area_shapes %>%left_join(final_study_area_le %>%mutate(GEOID_TRACT_20 =as.character(GEOID_TRACT_20)), by =c("GEOID"="GEOID_TRACT_20")) %>%#here we filter to get only the 250 tracts in our target spreadsheetfilter(GEOID %in%as.character(target_tracts$GEOID_clean))ggplot(data = map_data1) +geom_sf(aes(fill = le_2020_normalized), color ="white", size =0.1) +scale_fill_viridis_c(na.value ="red", name ="Life Expectancy") +theme_minimal() +labs(title ="Life Expectancy (2020 Tracts)",subtitle ="Red areas indicate missing/NaN data (Coverage = 0)")
just making it interactive
library(mapview)
Warning: package 'mapview' was built under R version 4.4.3
library(leafsync)
Warning: package 'leafsync' was built under R version 4.4.3
#the spatial data is in a standard 'web' projection (WGS84)map_data_web1 <-st_transform(map_data1, crs =4326)# notes from helper 'zcol' defines which column colors the map# notes from helper 'na.color' makes missing data easy to spot interactive_map1 <-mapview(map_data_web1, zcol ="le_2020_normalized", col.regions = viridisLite::viridis(100),na.color ="red",layer.name ="Life Expectancy 2020")
Warning: Found less unique colors (100) than unique zcol values (131)!
Interpolating color vector to match number of zcol values.
interactive_map1
Pop join + missingness review
Next, we pull total population counts from the 2020 Decennial Census and attach them to the tract-level LE results to help diagnose missing values. After downloading tract populations (P1_001N) for the four-county study area, the code joins those population totals to the normalized LE dataset and keeps only a small set of fields needed for review: tract ID, normalized LE, data coverage, and total population. The follow-up step filters the dataset to isolate tracts where LE is still missing.
These tracts show data_coverage = 0, meaning none of the 2020 tract area intersected with a 2010 tract that had valid LE data in the source dataset. Because the normalization step divides by the amount of valid coverage, a coverage of zero results in NaN (i.e., the calculation becomes “something divided by 0”). Many of these tracts have substantial populations, which indicates the issue is not just “non-residential tracts,” but instead a gap in LE source coverage after crosswalking, making them candidates for imputation rather than exclusion.
library(tidycensus)#used decennial for more stable pop estimatespop_data <-get_decennial(geography ="tract",variables ="P1_001N", #total pop for 2020 Decennialstate ="TX",county =c("021", "209", "453", "491"),year =2020)
Getting data from the 2020 decennial Census
Using the PL 94-171 Redistricting Data Summary File
Note: 2020 decennial Census data use differential privacy, a technique that
introduces errors into data to preserve respondent confidentiality.
ℹ Small counts should be interpreted with caution.
ℹ See https://www.census.gov/library/fact-sheets/2021/protecting-the-confidentiality-of-the-2020-census-redistricting-data.html for additional guidance.
This message is displayed once per session.
final_review <- final_study_area_le %>%mutate(GEOID_TRACT_20 =as.character(GEOID_TRACT_20)) %>%left_join(pop_data, by =c("GEOID_TRACT_20"="GEOID")) %>%rename(total_pop_2020 = value) %>%#remember for next time decennial uses value, not estimateselect(GEOID_TRACT_20, le_2020_normalized, data_coverage, total_pop_2020)#the missing analysismissing_analysis <- final_review %>%filter(is.na(le_2020_normalized))missing_analysis
# A tibble: 36 × 4
GEOID_TRACT_20 le_2020_normalized data_coverage total_pop_2020
<chr> <dbl> <dbl> <dbl>
1 48453000203 NA 0 2537
2 48453000308 NA 0 2698
3 48453000309 NA 0 5602
4 48453000500 NA 0 4490
5 48453000601 NA 0 8580
6 48453000605 NA 0 4645
7 48453000606 NA 0 5012
8 48453000607 NA 0 4268
9 48453000608 NA 0 5286
10 48453000700 NA 0 1411
# ℹ 26 more rows
Imputation
Finally, we fill in missing Life Expectancy values using a spatial nearest-neighbor imputation approach so that every tract can be included in the composite index.
First, population totals are joined to the spatial tract map, and tracts with very small populations (≤50) are removed from the imputation process to avoid imputing values into non-residential areas where “nearest neighbor” assumptions may not be meaningful. The airport census tract is an example.
The remaining tracts are split into two groups: those with complete LE values and those still missing LE values. For each missing tract, the code identifies the three closest nearby tracts (based on geographic distance between tract centroids) that do have valid LE estimates. It then imputes the missing LE by taking the average of those five neighbors, assigning each tract a flag to indicate whether the value is original or imputed.
library(sf)library(nngeo)
Warning: package 'nngeo' was built under R version 4.4.3
#joinmap_data2 <- map_data1 %>%left_join(final_review %>%select(GEOID_TRACT_20, total_pop_2020), by =c("GEOID"="GEOID_TRACT_20")) impute_prep <- map_data2 %>%filter(total_pop_2020 >50) #separate into Complete and Missing setscomplete_tracts <- impute_prep %>%filter(!is.na(le_2020_normalized)) # missing_tracts <- impute_prep %>%filter(is.na(le_2020_normalized)) # #find the 3 nearest neighborsneighbor_indices <-st_nn(missing_tracts, complete_tracts, k =5, progress =FALSE)
lines or polygons
#calculate the average LE of those neighborsimputed_values <-sapply(neighbor_indices, function(idx) {mean(complete_tracts$le_2020_normalized[idx], na.rm =TRUE)}) # #plug values back and add status flagsmissing_tracts$le_2020_normalized <- imputed_valuescomplete_tracts$imputed_flag <-"Original"missing_tracts$imputed_flag <-"Imputed"#combine back into one final dataset (Removed the duplicate line)final_le_complete <-rbind(complete_tracts, missing_tracts) output_path <-paste0(path, "Final_Life_Expectancy_Imputed_2020.csv")final_le_complete %>%st_drop_geometry() %>%#just taking out to have csv table for arcmutate(GEOID =as.character(GEOID)) %>%write_csv(output_path)message("File saved to: ", output_path)
File saved to: C:/Users/rayo-garza/OneDrive - Center for Public Policy Priorities/Documents/Data 2025/Final_Life_Expectancy_Imputed_2020.csv
View Imputed Tracts
includes excluded tracts due to low pop
library(dplyr)library(sf)library(ggplot2)#object for tracts excluded from imputation due to low population (≤ 50)excluded_tracts <- map_data2 %>%filter(is.na(total_pop_2020) | total_pop_2020 <=50) %>%mutate(imputed_flag ="Excluded (pop ≤ 50)")excluded_tracts2 <- excluded_tracts %>%mutate(le_2020_normalized =NA_real_,data_coverage =NA_real_ )final_le_for_map <-bind_rows( final_le_complete, excluded_tracts2 %>%select(names(final_le_complete)))ggplot(data = final_le_for_map) +geom_sf(aes(fill = imputed_flag), color ="white", size =0.1) +scale_fill_manual(values =c("Original"="#21908C","Imputed"="#FDE725","Excluded (pop ≤ 50)"="#BDBDBD" ),name ="Data Status" ) +theme_minimal() +labs(title ="Life Expectancy Data Status by Census Tract",subtitle ="Imputed tracts use mean of 5 nearest neighbors; gray tracts were excluded from imputation (pop ≤ 50)." )
Here I was just double checking that the excluded tract was airport
After conducting the leave one out validation check, the decision was made to increase k from 3 to 5
# # Leave-One-Out Validation for KNN Imputation# #This tests whether KNN can accurately recover known LE values.# # set.seed(123)# # # Use only original (non-imputed) tracts for validation# validation_pool <- final_le_complete %>%# filter(imputed_flag == "Original")# # # Randomly select 20 tracts to temporarily treat as "missing"# test_sample <- validation_pool %>%# slice_sample(n = 20)# # # Create temporary complete and missing sets# complete_temp <- validation_pool %>%# filter(!GEOID %in% test_sample$GEOID)# # missing_temp <- test_sample# # # Find 3 nearest neighbors# neighbor_indices <- st_nn(missing_temp, complete_temp, k = 3, progress = FALSE)# # # Predict values# predicted <- sapply(neighbor_indices, function(idx) {# mean(complete_temp$le_2020_normalized[idx], na.rm = TRUE)# })# # actual <- missing_temp$le_2020_normalized# # # Calculate error metrics# rmse <- sqrt(mean((predicted - actual)^2))# mae <- mean(abs(predicted - actual))# # tibble(# RMSE = rmse,# MAE = mae# )# # # sd(final_le_complete$le_2020_normalized, na.rm = TRUE)
# # Leave-One-Out Validation: Compare k = 3 vs k = 5 (same holdout tracts)# # set.seed(123)# # # Use only original (non-imputed) tracts for validation# validation_pool <- final_le_complete %>%# filter(imputed_flag == "Original")# # # Hold out the same 20 tracts# test_sample <- validation_pool %>%# slice_sample(n = 20)# # complete_temp <- validation_pool %>%# filter(!GEOID %in% test_sample$GEOID)# # missing_temp <- test_sample# # # Helper to run KNN prediction for a given k# knn_predict <- function(k) {# neighbor_indices <- st_nn(missing_temp, complete_temp, k = k, progress = FALSE)# sapply(neighbor_indices, function(idx) {# mean(complete_temp$le_2020_normalized[idx], na.rm = TRUE)# })# }# # actual <- missing_temp$le_2020_normalized# # # Predictions# pred_k3 <- knn_predict(3)# pred_k5 <- knn_predict(5)# # # Error metrics# rmse <- function(pred, actual) sqrt(mean((pred - actual)^2))# mae <- function(pred, actual) mean(abs(pred - actual))# # results <- tibble(# k = c(3, 5),# RMSE = c(rmse(pred_k3, actual), rmse(pred_k5, actual)),# MAE = c(mae(pred_k3, actual), mae(pred_k5, actual))# )# # results
Mapping out distribution after switching to k=5
library(ggplot2)ggplot(final_le_complete, aes(x = le_2020_normalized, fill = imputed_flag)) +geom_density(alpha =0.4) +theme_minimal() +labs(title ="Distribution of Life Expectancy: Original vs Imputed",x ="Life Expectancy",y ="Density" )
ggplot(final_le_complete) +geom_sf(aes(fill = imputed_flag), color ="white", size =0.1) +theme_minimal()
Warning: package 'spdep' was built under R version 4.4.3
Loading required package: spData
Warning: package 'spData' was built under R version 4.4.3
To access larger datasets in this package, install the spDataLarge
package with: `install.packages('spDataLarge',
repos='https://nowosad.github.io/drat/', type='source')`
# BEFORE: Only original tractsbefore_all_sf <- map_data2 %>%mutate(GEOID =as.character(GEOID)) %>%left_join( final_review %>%mutate(GEOID_TRACT_20 =as.character(GEOID_TRACT_20)) %>%select(GEOID_TRACT_20, le_2020_normalized) %>%rename(le_before = le_2020_normalized),by =c("GEOID"="GEOID_TRACT_20") ) %>%filter(!is.na(le_before)) %>%arrange(GEOID)# AFTER: All tracts (including imputed)after_all_sf <- final_le_complete %>%mutate(GEOID =as.character(GEOID),le_after =as.numeric(le_2020_normalized)) %>%arrange(GEOID)# Build spatial weights separately# BEFORE weightscoords_before <-st_coordinates(st_centroid(st_geometry(before_all_sf)))nb_before <-knn2nb(knearneigh(coords_before, k =5))lw_before <-nb2listw(nb_before, style ="W")# AFTER weightscoords_after <-st_coordinates(st_centroid(st_geometry(after_all_sf)))nb_after <-knn2nb(knearneigh(coords_after, k =5))lw_after <-nb2listw(nb_after, style ="W")# Moran’s Imoran_before_all <-moran.test(before_all_sf$le_before, lw_before)moran_after_all <-moran.test(after_all_sf$le_after, lw_after)moran_before_all
Moran I test under randomisation
data: before_all_sf$le_before
weights: lw_before
Moran I statistic standard deviate = 9.8168, p-value <
0.00000000000000022
alternative hypothesis: greater
sample estimates:
Moran I statistic Expectation Variance
0.387775200 -0.004694836 0.001598359
moran_after_all
Moran I test under randomisation
data: after_all_sf$le_after
weights: lw_after
Moran I statistic standard deviate = 11.65, p-value <
0.00000000000000022
alternative hypothesis: greater
sample estimates:
Moran I statistic Expectation Variance
0.428849648 -0.004032258 0.001380663
Spatial nearest-neighbor imputation slightly increased measured spatial autocorrelation (Moran’s I from 0.388 to 0.429), reflecting the local smoothing nature of the KNN approach. However, the increase was modest and did not materially alter the overall clustering structure of life expectancy across tracts. Global distributional properties (mean, standard deviation, quantiles) remained stable, indicating that imputation preserved both statistical and spatial integrity of the dataset.
See Esri (2024). How Spatial Autocorrelation (Global Moran’s I) works. ArcGIS Pro Documentation.
Notes
Because tract-level LE is not available at finer population units, we applied land-area weighting as a standard approach; population-weighted interpolation was not feasible with available inputs. One tract (GEOID 48453980000; pop=3) was excluded from imputation due to being non-residential (airport tract).