relationship tracts

Author

crg

Overview: This analysis rectifies the relationship break from the 2010 census tract to 2020 census tract changes. Census tract relationship files obtained from here . The reason for rectifying is to transmit LE data to 2020 census tracts, which is the unit used for a composite index with 18 indicators, of which LE is one. We used areal-weighted interpolation and computed a weighted mean (intensive interpolation) because Life Expectancy is an intensive measure.

Datasets Import & Merging

library(tidyverse) 
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   4.0.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
path <- "C:/Users/rayo-garza/OneDrive - Center for Public Policy Priorities/Documents/Data 2025/"


le_data <- read_csv(paste0(path, "TX_A (1).csv")) 
Rows: 4709 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): CNTY2KX, TRACT2KX
dbl (5): Tract ID, STATE2KX, e(0), se(e(0)), Abridged life table flag

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
relation_tracts_2 <- read_csv(paste0(path, "relation_tracts_2.csv")) #from US census
Rows: 9135 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (6): NAMELSAD_TRACT_20, MTFCC_TRACT_20, FUNCSTAT_TRACT_20, NAMELSAD_TRA...
dbl (10): OID_TRACT_20, GEOID_TRACT_20, AREALAND_TRACT_20, AREAWATER_TRACT_2...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#used backticks for Tract ID because of the space in the column name. Lots of nuances easy to forget. 
merged_data <- relation_tracts_2 %>%
  left_join(le_data, by = c("GEOID_TRACT_10" = "Tract ID"))

head(merged_data)
# A tibble: 6 × 22
  OID_TRACT_20 GEOID_TRACT_20 NAMELSAD_TRACT_20    AREALAND_TRACT_20
         <dbl>          <dbl> <chr>                            <dbl>
1      2.08e13    48001950100 Census Tract 9501            483306613
2      2.08e14    48001950401 Census Tract 9504.01          16509268
3      2.08e14    48001950401 Census Tract 9504.01          16509268
4      2.08e14    48001950402 Census Tract 9504.02          71134275
5      2.08e13    48001950500 Census Tract 9505             23132052
6      2.08e13    48001950600 Census Tract 9506             20653882
# ℹ 18 more variables: AREAWATER_TRACT_20 <dbl>, MTFCC_TRACT_20 <chr>,
#   FUNCSTAT_TRACT_20 <chr>, OID_TRACT_10 <dbl>, GEOID_TRACT_10 <dbl>,
#   NAMELSAD_TRACT_10 <chr>, AREALAND_TRACT_10 <dbl>, AREAWATER_TRACT_10 <dbl>,
#   MTFCC_TRACT_10 <chr>, FUNCSTAT_TRACT_10 <chr>, AREALAND_PART <dbl>,
#   AREAWATER_PART <dbl>, STATE2KX <dbl>, CNTY2KX <chr>, TRACT2KX <chr>,
#   `e(0)` <dbl>, `se(e(0))` <dbl>, `Abridged life table flag` <dbl>

Areal Interpolation, Part 1

Hereis a good example of the application of this technique.

To reconcile 2010 Life Expectancy (LE) estimates with 2020 Census tract boundaries, we used areal-weighted interpolation approach. Using Census relationship files, we identified how much land area from each 2010 tract overlaps with each 2020 tract. So for example then, if 40% of a 2010 tract falls into a particular 2020 tract, then 40% of that tract’s Life Expectancy value contributes to the new tract.

Then, the 2010 LE value was multiplied by this area proportion to allocate partial contributions to the new tract geography. These weighted fragments were then aggregated to produce a single LE estimate for each 2020 tract. This approach preserves spatial continuity and avoids double-counting while ensuring compatibility with the updated census geography. For exampole, if a 2020 tract contains 40% of one 2010 tract with an LE of 78 years and 60% of another 2010 tract with an LE of 82 years, the new estimate would be calculated as (78 × 0.40) + (82 × 0.60). This ensures that each portion contributes proportionally to the final value.

Then, if all contributing fragments lacked valid LE data, the resulting value was retained as missing (NA) rather than set to zero to preserve true data gaps for later imputation. A coverage diagnostic (total_weight_check) was calculated as the sum of area weights contributing to each tract to assess completeness of source data and inform later normalization and imputation decisions.

options(scipen = 999) #added after the fact to remove scientific notation

final_le_2020_1 <- merged_data %>%
  #calculate weight and handle LE carefully
  mutate(area_weight = AREALAND_PART / AREALAND_TRACT_10) %>% #area_wirhgt wiil represent the proportion of a 2010 tract contributing to a 2020 tract
  
  #calculate weighted LE, but keep it na if the source was Nna
  mutate(weighted_le = `e(0)` * area_weight) %>%
  
  group_by(GEOID_TRACT_20) %>%
  summarize(
    #sum the pieces, but if all pieces are na, the result should be na not 0, and we want na so we can later impute
    life_expectancy_2020 = if(all(is.na(weighted_le))) NA else sum(weighted_le, na.rm = TRUE),
    total_weight_check = sum(area_weight, na.rm = TRUE) #how much of this 2020 tract is covered by valid 2010 source data
  )

head(final_le_2020_1, 20)
# A tibble: 20 × 3
   GEOID_TRACT_20 life_expectancy_2020 total_weight_check
            <dbl>                <dbl>              <dbl>
 1    48001950100            82.4                   1    
 2    48001950401             0.000184              1.00 
 3    48001950402            NA                     0.982
 4    48001950500            76.9                   1    
 5    48001950600            71.1                   1    
 6    48001950700            71.3                   1    
 7    48001950800            74.7                   0.981
 8    48001950901            77.1                   1.02 
 9    48001950902            86.0                   1.00 
10    48001951001            47.9                   0.581
11    48001951002            34.6                   0.419
12    48001951100            75.1                   1.02 
13    48003950100            78.9                   1    
14    48003950200            76.6                   1.00 
15    48003950300            74.0                   0.997
16    48003950400            74.4                   1.00 
17    48005000102            78.2                   1    
18    48005000103            36.8                   0.491
19    48005000104            38.2                   0.509
20    48005000201            23.6                   0.327

Bring in our target tracts which include Austin Limited and Full Purpose Boundaries. I made a map of them here.

target_tracts <- read_csv(paste0(path, "Austin_LTD_FULL_Tract_List_10pct(in).csv")) 
Rows: 250 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): NAMELSAD
dbl (3): GEOID_clean, NAME, pct_in_austin

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
names(target_tracts)
[1] "GEOID_clean"   "NAME"          "NAMELSAD"      "pct_in_austin"
head(target_tracts)
# A tibble: 6 × 4
  GEOID_clean   NAME NAMELSAD            pct_in_austin
        <dbl>  <dbl> <chr>                       <dbl>
1 48209010912 109.   Census Tract 109.12          20.3
2 48453000101   1.01 Census Tract 1.01           100  
3 48453000102   1.02 Census Tract 1.02           100  
4 48453000203   2.03 Census Tract 2.03           100  
5 48453000204   2.04 Census Tract 2.04           100  
6 48453000205   2.05 Census Tract 2.05           100  

Areal Interpolation, Part 2

To get finalized 2020 LE estimates for the study area, the areal-weighted interpolation procedure was repeated with explicit normalization to account for incomplete source coverage. After recalculating the proportion of each 2010 tract contributing to each 2020 tract, the weighted LE fragments were aggregated at the 2020 tract level.

Unlike the initial aggregation step, this version divides the sum of weighted values by the total share of tract area that had valid 2010 EL data.

Normalization Note: This normalization step “rescales” the estimate to represent 100% of the tract area based only on the portion supported by observed data, preventing partial-coverage tracts from being artificially deflated. In plain words, we estimate what the tract’s value would be if the entire tract behaved like the portion we actually observed. Normalization prevents tracts with partial data from appearing artificially low simply because some source data were missing. So, an example would be if a tract has 60% from a 2010 tract with valid LE (80), but then 40% from a 2010 tract that had missing data. So to get the weighted LE, we would do 80 × 0.60, which = 48. However 48 is not an accurate new LE. Why? It is wrong and small because it only represents a portion of the tract as 40% was missing. That’s why we normalize. So again, we estimate what the value would be if the tract behaved like the rest of the tract for which we actually do have observations.

The resulting dataset was then filtered to include only census tracts within the defined study area, and basic validation checks were performed to confirm structure and to identify any remaining missing values requiring further review.

# Normalize and Filter
final_study_area_le <- merged_data %>%
  # calculate the area weight again for safety
  # area_weight represents the proportion of a 2010 tract contributing to a 2020 tract
  mutate(area_weight = AREALAND_PART / AREALAND_TRACT_10) %>%
  mutate(weighted_le  = `e(0)` * area_weight) %>%
  
  group_by(GEOID_TRACT_20) %>%
  summarize(
    # data_coverage = share of the 2020 tract that had valid 2010 LE data
    data_coverage = sum(area_weight * !is.na(`e(0)`)),
    
    # NORMALIZATION (weighted mean / intensive interpolation):
    # If coverage is 0, return NA (avoids NaN from dividing by 0).
    # Otherwise, "stretch" the weighted sum back to represent 100% of the tract
    # based only on the portion with observed LE data.
    le_2020_normalized = ifelse(
      data_coverage == 0,
      NA_real_,
      sum(weighted_le, na.rm = TRUE) / data_coverage
    ),
    .groups = "drop"
  ) %>%
  
  # keep only the tracts in my specific target list
  # used 'as.character' to ensure the IDs match perfectly as text
  filter(as.character(GEOID_TRACT_20) %in% as.character(target_tracts$GEOID_clean))

Map of NAs

library(sf)
Warning: package 'sf' was built under R version 4.4.3
Linking to GEOS 3.13.0, GDAL 3.10.1, PROJ 9.5.1; sf_use_s2() is TRUE
library(ggplot2)
library(tigris)
To enable caching of data, set `options(tigris_use_cache = TRUE)`
in your R script or .Rprofile.
options(tigris_use_cache = TRUE)

#downloading shapes
study_area_shapes <- tracts(state = "48", 
                            county = c("021", "209", "453", "491"), 
                            year = 2020, 
                            cb = TRUE)


#mutate 
map_data1 <- study_area_shapes %>%
  left_join(final_study_area_le %>% 
              mutate(GEOID_TRACT_20 = as.character(GEOID_TRACT_20)), 
            by = c("GEOID" = "GEOID_TRACT_20")) %>%
  #here we filter to get only the 250 tracts in our target spreadsheet
  filter(GEOID %in% as.character(target_tracts$GEOID_clean))


ggplot(data = map_data1) +
  geom_sf(aes(fill = le_2020_normalized), color = "white", size = 0.1) +
  scale_fill_viridis_c(na.value = "red", name = "Life Expectancy") + 
  theme_minimal() +
  labs(title = "Life Expectancy (2020 Tracts)",
       subtitle = "Red areas indicate missing/NaN data (Coverage = 0)")

just making it interactive

library(mapview)
Warning: package 'mapview' was built under R version 4.4.3
library(leafsync) 
Warning: package 'leafsync' was built under R version 4.4.3
#the spatial data is in a standard 'web' projection (WGS84)
map_data_web1 <- st_transform(map_data1, crs = 4326)


# notes from  helper 'zcol' defines which column colors the map
# notes from helper 'na.color' makes missing data easy to spot 
interactive_map1 <- mapview(map_data_web1, 
                           zcol = "le_2020_normalized", 
                           col.regions = viridisLite::viridis(100),
                           na.color = "red",
                           layer.name = "Life Expectancy 2020")
Warning: Found less unique colors (100) than unique zcol values (131)! 
Interpolating color vector to match number of zcol values.
interactive_map1

Pop join + missingness review

Next, we pull total population counts from the 2020 Decennial Census and attach them to the tract-level LE results to help diagnose missing values. After downloading tract populations (P1_001N) for the four-county study area, the code joins those population totals to the normalized LE dataset and keeps only a small set of fields needed for review: tract ID, normalized LE, data coverage, and total population. The follow-up step filters the dataset to isolate tracts where LE is still missing.

These tracts show data_coverage = 0, meaning none of the 2020 tract area intersected with a 2010 tract that had valid LE data in the source dataset. Because the normalization step divides by the amount of valid coverage, a coverage of zero results in NaN (i.e., the calculation becomes “something divided by 0”). Many of these tracts have substantial populations, which indicates the issue is not just “non-residential tracts,” but instead a gap in LE source coverage after crosswalking, making them candidates for imputation rather than exclusion.

library(tidycensus)


#used decennial for more stable pop estimates
pop_data <- get_decennial(
  geography = "tract",
  variables = "P1_001N", #total pop for 2020 Decennial
  state = "TX",
  county = c("021", "209", "453", "491"),
  year = 2020
)
Getting data from the 2020 decennial Census
Using the PL 94-171 Redistricting Data Summary File
Note: 2020 decennial Census data use differential privacy, a technique that
introduces errors into data to preserve respondent confidentiality.
ℹ Small counts should be interpreted with caution.
ℹ See https://www.census.gov/library/fact-sheets/2021/protecting-the-confidentiality-of-the-2020-census-redistricting-data.html for additional guidance.
This message is displayed once per session.
final_review <- final_study_area_le %>%
  mutate(GEOID_TRACT_20 = as.character(GEOID_TRACT_20)) %>% 
  left_join(pop_data, by = c("GEOID_TRACT_20" = "GEOID")) %>%
  rename(total_pop_2020 = value) %>% #remember for next time decennial uses value, not estimate
  select(GEOID_TRACT_20, le_2020_normalized, data_coverage, total_pop_2020)



#the missing analysis
missing_analysis <- final_review %>%
  filter(is.na(le_2020_normalized))

missing_analysis
# A tibble: 36 × 4
   GEOID_TRACT_20 le_2020_normalized data_coverage total_pop_2020
   <chr>                       <dbl>         <dbl>          <dbl>
 1 48453000203                    NA             0           2537
 2 48453000308                    NA             0           2698
 3 48453000309                    NA             0           5602
 4 48453000500                    NA             0           4490
 5 48453000601                    NA             0           8580
 6 48453000605                    NA             0           4645
 7 48453000606                    NA             0           5012
 8 48453000607                    NA             0           4268
 9 48453000608                    NA             0           5286
10 48453000700                    NA             0           1411
# ℹ 26 more rows

Imputation

Finally, we fill in missing Life Expectancy values using a spatial nearest-neighbor imputation approach so that every tract can be included in the composite index.

First, population totals are joined to the spatial tract map, and tracts with very small populations (≤50) are removed from the imputation process to avoid imputing values into non-residential areas where “nearest neighbor” assumptions may not be meaningful. The airport census tract is an example.

The remaining tracts are split into two groups: those with complete LE values and those still missing LE values. For each missing tract, the code identifies the three closest nearby tracts (based on geographic distance between tract centroids) that do have valid LE estimates. It then imputes the missing LE by taking the average of those five neighbors, assigning each tract a flag to indicate whether the value is original or imputed.

library(sf)
library(nngeo)
Warning: package 'nngeo' was built under R version 4.4.3
#join
map_data2 <- map_data1 %>%
  left_join(final_review %>% 
              select(GEOID_TRACT_20, total_pop_2020), 
            by = c("GEOID" = "GEOID_TRACT_20")) 


impute_prep <- map_data2 %>%
  filter(total_pop_2020 > 50) 

#separate into Complete and Missing sets
complete_tracts <- impute_prep %>% filter(!is.na(le_2020_normalized)) # 
missing_tracts <- impute_prep %>% filter(is.na(le_2020_normalized)) # 

#find the 3 nearest neighbors
neighbor_indices <- st_nn(missing_tracts, complete_tracts, k = 5, progress = FALSE) 
lines or polygons
#calculate the average LE of those neighbors
imputed_values <- sapply(neighbor_indices, function(idx) {
  mean(complete_tracts$le_2020_normalized[idx], na.rm = TRUE)
}) # 

#plug values back and add status flags
missing_tracts$le_2020_normalized <- imputed_values
complete_tracts$imputed_flag <- "Original"
missing_tracts$imputed_flag <- "Imputed"

#combine back into one final dataset (Removed the duplicate line)
final_le_complete <- rbind(complete_tracts, missing_tracts) 


output_path <- paste0(path, "Final_Life_Expectancy_Imputed_2020.csv")

final_le_complete %>%
  st_drop_geometry() %>% #just taking out to have csv table for arc
  mutate(GEOID = as.character(GEOID)) %>%
  write_csv(output_path)

message("File saved to: ", output_path)
File saved to: C:/Users/rayo-garza/OneDrive - Center for Public Policy Priorities/Documents/Data 2025/Final_Life_Expectancy_Imputed_2020.csv

View Imputed Tracts

includes excluded tracts due to low pop

library(dplyr)
library(sf)
library(ggplot2)

#object for tracts excluded from imputation due to low population (≤ 50)
excluded_tracts <- map_data2 %>%
  filter(is.na(total_pop_2020) | total_pop_2020 <= 50) %>%
  mutate(imputed_flag = "Excluded (pop ≤ 50)")


excluded_tracts2 <- excluded_tracts %>%
  mutate(
    le_2020_normalized = NA_real_,
    data_coverage = NA_real_
  )

final_le_for_map <- bind_rows(
  final_le_complete,
  excluded_tracts2 %>% select(names(final_le_complete))
)


ggplot(data = final_le_for_map) +
  geom_sf(aes(fill = imputed_flag), color = "white", size = 0.1) +
  scale_fill_manual(
    values = c(
      "Original" = "#21908C",
      "Imputed" = "#FDE725",
      "Excluded (pop ≤ 50)" = "#BDBDBD"
    ),
    name = "Data Status"
  ) +
  theme_minimal() +
  labs(
    title = "Life Expectancy Data Status by Census Tract",
    subtitle = "Imputed tracts use mean of 5 nearest neighbors; gray tracts were excluded from imputation (pop ≤ 50)."
  )

Here I was just double checking that the excluded tract was airport

# excluded_count <- sum(map_data2$total_pop_2020 <= 50, na.rm = TRUE)
# 
# message("Number of tracts excluded due to population ≤ 50: ", excluded_count)
# message("Missing LE before imputation: ",
#         sum(is.na(final_review$le_2020_normalized)))
# 
# excluded_count <- sum(map_data2$total_pop_2020 <= 50, na.rm = TRUE)
# table(final_le_complete$imputed_flag)
# 
# 
# summary(map_data2$total_pop_2020)
# sum(map_data2$total_pop_2020 <= 50, na.rm = TRUE)
# 
# map_data2 %>%
#   filter(total_pop_2020 <= 50 | is.na(total_pop_2020)) %>%
#   select(GEOID, total_pop_2020)
# 
# 
# excluded_sf <- map_data2 %>%
#   filter(total_pop_2020 <= 50 | is.na(total_pop_2020))
# 
# excluded_sf %>% 
#   st_drop_geometry() %>% 
#   select(GEOID, NAMELSAD, total_pop_2020)
# 
# #is it airport
# excluded_centroid <- st_centroid(excluded_sf)
# st_coordinates(excluded_centroid)
# 
# 
# ggplot() +
#   geom_sf(data = map_data1, fill = "white", color = "gray80") +
#   geom_sf(data = excluded_sf, fill = "red") +
#   theme_minimal() +
#   labs(title = "Excluded Census Tract (Pop ≤ 50)")

Validation Checks

summary(final_le_complete$le_2020_normalized)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  68.71   77.60   80.12   79.76   81.38   88.90 
final_review %>%
  summarize(
    mean_before = mean(le_2020_normalized, na.rm = TRUE),
    sd_before   = sd(le_2020_normalized, na.rm = TRUE)
  )
# A tibble: 1 × 2
  mean_before sd_before
        <dbl>     <dbl>
1        79.8      3.21
final_le_complete %>%
  summarize(
    mean_after = mean(le_2020_normalized, na.rm = TRUE),
    sd_after   = sd(le_2020_normalized, na.rm = TRUE)
  )
Simple feature collection with 1 feature and 2 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -98.0104 ymin: 30.08853 xmax: -97.45043 ymax: 30.52177
Geodetic CRS:  NAD83
  mean_after sd_after                       geometry
1   79.75549 3.078123 POLYGON ((-97.65604 30.4489...
range(final_le_complete$le_2020_normalized, na.rm = TRUE)
[1] 68.70955 88.90000
final_le_complete %>%
  group_by(imputed_flag) %>%
  summarize(
    min_LE = min(le_2020_normalized, na.rm = TRUE),
    max_LE = max(le_2020_normalized, na.rm = TRUE),
    mean_LE = mean(le_2020_normalized, na.rm = TRUE)
  )
Simple feature collection with 2 features and 4 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -98.0104 ymin: 30.08853 xmax: -97.45043 ymax: 30.52177
Geodetic CRS:  NAD83
# A tibble: 2 × 5
  imputed_flag min_LE max_LE mean_LE                                    geometry
  <chr>         <dbl>  <dbl>   <dbl>                          <MULTIPOLYGON [°]>
1 Imputed        76.2   83.5    79.7 (((-97.63835 30.27447, -97.63357 30.27214,…
2 Original       68.7   88.9    79.8 (((-97.67917 30.45523, -97.67639 30.45397,…

After conducting the leave one out validation check, the decision was made to increase k from 3 to 5

# # Leave-One-Out Validation for KNN Imputation
# #This tests whether KNN can accurately recover known LE values.
# 
# set.seed(123)
# 
# # Use only original (non-imputed) tracts for validation
# validation_pool <- final_le_complete %>%
#   filter(imputed_flag == "Original")
# 
# # Randomly select 20 tracts to temporarily treat as "missing"
# test_sample <- validation_pool %>%
#   slice_sample(n = 20)
# 
# # Create temporary complete and missing sets
# complete_temp <- validation_pool %>%
#   filter(!GEOID %in% test_sample$GEOID)
# 
# missing_temp <- test_sample
# 
# # Find 3 nearest neighbors
# neighbor_indices <- st_nn(missing_temp, complete_temp, k = 3, progress = FALSE)
# 
# # Predict values
# predicted <- sapply(neighbor_indices, function(idx) {
#   mean(complete_temp$le_2020_normalized[idx], na.rm = TRUE)
# })
# 
# actual <- missing_temp$le_2020_normalized
# 
# # Calculate error metrics
# rmse <- sqrt(mean((predicted - actual)^2))
# mae  <- mean(abs(predicted - actual))
# 
# tibble(
#   RMSE = rmse,
#   MAE  = mae
# )
# 
# 
# sd(final_le_complete$le_2020_normalized, na.rm = TRUE)
# # Leave-One-Out Validation: Compare k = 3 vs k = 5 (same holdout tracts)
# 
# set.seed(123)
# 
# # Use only original (non-imputed) tracts for validation
# validation_pool <- final_le_complete %>%
#   filter(imputed_flag == "Original")
# 
# # Hold out the same 20 tracts
# test_sample <- validation_pool %>%
#   slice_sample(n = 20)
# 
# complete_temp <- validation_pool %>%
#   filter(!GEOID %in% test_sample$GEOID)
# 
# missing_temp <- test_sample
# 
# # Helper to run KNN prediction for a given k
# knn_predict <- function(k) {
#   neighbor_indices <- st_nn(missing_temp, complete_temp, k = k, progress = FALSE)
#   sapply(neighbor_indices, function(idx) {
#     mean(complete_temp$le_2020_normalized[idx], na.rm = TRUE)
#   })
# }
# 
# actual <- missing_temp$le_2020_normalized
# 
# # Predictions
# pred_k3 <- knn_predict(3)
# pred_k5 <- knn_predict(5)
# 
# # Error metrics
# rmse <- function(pred, actual) sqrt(mean((pred - actual)^2))
# mae  <- function(pred, actual) mean(abs(pred - actual))
# 
# results <- tibble(
#   k = c(3, 5),
#   RMSE = c(rmse(pred_k3, actual), rmse(pred_k5, actual)),
#   MAE  = c(mae(pred_k3, actual),  mae(pred_k5, actual))
# )
# 
# results

Mapping out distribution after switching to k=5

library(ggplot2)

ggplot(final_le_complete, aes(x = le_2020_normalized, fill = imputed_flag)) +
  geom_density(alpha = 0.4) +
  theme_minimal() +
  labs(
    title = "Distribution of Life Expectancy: Original vs Imputed",
    x = "Life Expectancy",
    y = "Density"
  )

ggplot(final_le_complete) +
  geom_sf(aes(fill = imputed_flag), color = "white", size = 0.1) +
  theme_minimal()

final_review %>%
  summarize(
    q10 = quantile(le_2020_normalized, 0.10, na.rm = TRUE),
    q50 = quantile(le_2020_normalized, 0.50, na.rm = TRUE),
    q90 = quantile(le_2020_normalized, 0.90, na.rm = TRUE)
  )
# A tibble: 1 × 3
    q10   q50   q90
  <dbl> <dbl> <dbl>
1  75.7  80.1  83.5
final_le_complete %>%
  summarize(
    q10 = quantile(le_2020_normalized, 0.10, na.rm = TRUE),
    q50 = quantile(le_2020_normalized, 0.50, na.rm = TRUE),
    q90 = quantile(le_2020_normalized, 0.90, na.rm = TRUE)
  )
Simple feature collection with 1 feature and 3 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -98.0104 ymin: 30.08853 xmax: -97.45043 ymax: 30.52177
Geodetic CRS:  NAD83
       q10      q50   q90                       geometry
1 75.89751 80.11631 83.32 POLYGON ((-97.65604 30.4489...

Morans I Test

library(sf)
library(dplyr)
library(spdep)
Warning: package 'spdep' was built under R version 4.4.3
Loading required package: spData
Warning: package 'spData' was built under R version 4.4.3
To access larger datasets in this package, install the spDataLarge
package with: `install.packages('spDataLarge',
repos='https://nowosad.github.io/drat/', type='source')`
# BEFORE: Only original tracts
before_all_sf <- map_data2 %>%
  mutate(GEOID = as.character(GEOID)) %>%
  left_join(
    final_review %>%
      mutate(GEOID_TRACT_20 = as.character(GEOID_TRACT_20)) %>%
      select(GEOID_TRACT_20, le_2020_normalized) %>%
      rename(le_before = le_2020_normalized),
    by = c("GEOID" = "GEOID_TRACT_20")
  ) %>%
  filter(!is.na(le_before)) %>%
  arrange(GEOID)

# AFTER: All tracts (including imputed)
after_all_sf <- final_le_complete %>%
  mutate(GEOID = as.character(GEOID),
         le_after = as.numeric(le_2020_normalized)) %>%
  arrange(GEOID)

# Build spatial weights separately

# BEFORE weights
coords_before <- st_coordinates(st_centroid(st_geometry(before_all_sf)))
nb_before <- knn2nb(knearneigh(coords_before, k = 5))
lw_before <- nb2listw(nb_before, style = "W")

# AFTER weights
coords_after <- st_coordinates(st_centroid(st_geometry(after_all_sf)))
nb_after <- knn2nb(knearneigh(coords_after, k = 5))
lw_after <- nb2listw(nb_after, style = "W")

# Moran’s I
moran_before_all <- moran.test(before_all_sf$le_before, lw_before)
moran_after_all  <- moran.test(after_all_sf$le_after,  lw_after)

moran_before_all

    Moran I test under randomisation

data:  before_all_sf$le_before  
weights: lw_before    

Moran I statistic standard deviate = 9.8168, p-value <
0.00000000000000022
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
      0.387775200      -0.004694836       0.001598359 
moran_after_all

    Moran I test under randomisation

data:  after_all_sf$le_after  
weights: lw_after    

Moran I statistic standard deviate = 11.65, p-value <
0.00000000000000022
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
      0.428849648      -0.004032258       0.001380663 

Spatial nearest-neighbor imputation slightly increased measured spatial autocorrelation (Moran’s I from 0.388 to 0.429), reflecting the local smoothing nature of the KNN approach. However, the increase was modest and did not materially alter the overall clustering structure of life expectancy across tracts. Global distributional properties (mean, standard deviation, quantiles) remained stable, indicating that imputation preserved both statistical and spatial integrity of the dataset.

weighted.mean(
  final_le_complete$le_2020_normalized,
  final_le_complete$total_pop_2020,
  na.rm = TRUE
)
[1] 79.95513

See Esri (2024). How Spatial Autocorrelation (Global Moran’s I) works. ArcGIS Pro Documentation.

Notes

Because tract-level LE is not available at finer population units, we applied land-area weighting as a standard approach; population-weighted interpolation was not feasible with available inputs. One tract (GEOID 48453980000; pop=3) was excluded from imputation due to being non-residential (airport tract).