Full Dataset Imputation

Author

CRG

Full Dataset Imputation

In this analysis, I solve for missing data at the census tract level to ensure that all tracts can be included in the final composite index.

Public pre-K enrollment initially had suppressed values in several census tracts. Rather than excluding those tracts, I used a multiple imputation approach to estimate plausible values based on relationships with other socioeconomic indicators in the dataset (such as poverty, income, health outcomes, and housing conditions).

Because a small number of other index indicators also contained limited missing values (each affecting fewer than 5% of tracts), I implemented a full joint imputation model. This approach allows the variables to inform one another during estimation, preserving the overall structure of the data while filling in missing entries.

Predictive Mean Matching (PMM) was used for imputation. This method replaces missing values with observed values from similar census tracts, ensuring that imputed estimates remain realistic and within the observed range of the data.

After imputation:

All index indicators were complete across 249 census tracts.
Observed values were unchanged.
The overall distribution and correlation structure of the dataset were preserved, based on validation checks towards end of qmd

This process allows the composite index to be calculated using a complete and internally consistent dataset while minimizing bias from listwise deletion.

Note: In this code, the workflow is a little off. First initial chunks focused only on prek imputation. Refer to table of contents for full dataset imputation. Only keeping irrelevant code for notes and process documentation.

I reviewed KWs code: https://rpubs.com/kaitlan/1396846

I asked her to use CS code as guide: https://rpubs.com/corey_sparks/883799

Other resources: https://stats.oarc.ucla.edu/r/faq/how-do-i-perform-multiple-imputation-using-predictive-mean-matching-in-r/

https://stefvanbuuren.name/fimd/sec-pmm.html

Import and Join

I brought together KW’s master dataframe with my LE imputed dataset

library(readr)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(stringr)


data_dir <- "C:/Users/rayo-garza/OneDrive - Center for Public Policy Priorities/Documents/Data 2025"

master_path  <- file.path(data_dir, "Compiled_Data_Austin_LTD_FULL_02.06 (1).csv")
imputed_path <- file.path(data_dir, "Final_Life_Expectancy_Imputed_2020.csv")

#force IDs to character so leading zeros are safe
master <- read_csv(master_path, col_types = cols(.default = col_guess(),
                                                GEOID_clean = col_character()))

imputed <- read_csv(imputed_path, col_types = cols(.default = col_guess(),
                                                  GEOID = col_character(),
                                                  AFFGEOID = col_character(),
                                                  TRACTCE = col_character(),
                                                  COUNTYFP = col_character(),
                                                  STATEFP = col_character()))

#clean/standardize GEOIDs to 11-digit tract GEOID (SSCCCTTTTTT) ---
clean_geoid11 <- function(x) {
  x %>%
    str_trim() %>%
    str_replace_all("[^0-9]", "") %>%  # removes "1400000US"
    str_sub(-11) %>%                   # keep last 11 digits if longer
    str_pad(11, side = "left", pad = "0")
}

master <- master %>%
  mutate(GEOID_clean = clean_geoid11(GEOID_clean))

imputed <- imputed %>%
  mutate(GEOID11 = clean_geoid11(GEOID))

#checks: duplicates in the imputed file 
dups <- imputed %>%
  count(GEOID11) %>%
  filter(n > 1)

if (nrow(dups) > 0) {
  print(dups)
  stop("Duplicate GEOIDs found in imputed file. Resolve before joining.")
}

#Join
master2 <- master %>%
  left_join(
    imputed %>%
      select(GEOID11, le_2020_normalized, imputed_flag, data_coverage) %>%
      rename(
        LE_imputed_2020 = le_2020_normalized,
        LE_imputed_flag = imputed_flag,
        LE_data_coverage = data_coverage
      ),
    by = c("GEOID_clean" = "GEOID11")
  )

#match rate + how many got LE values 
qa <- master2 %>%
  summarise(
    n_master = n(),
    n_with_imputed_LE = sum(!is.na(LE_imputed_2020)),
    n_missing_imputed_LE = sum(is.na(LE_imputed_2020)),
    match_rate = n_with_imputed_LE / n_master
  )

print(qa)

# A tibble: 1 × 4
  n_master n_with_imputed_LE n_missing_imputed_LE match_rate
     <int>             <int>                <int>      <dbl>
1      250               249                    1      0.996

#unmatched tracts
unmatched <- master2 %>%
  filter(is.na(LE_imputed_2020)) %>%
  select(GEOID_clean, `Tract Full Name`) %>%
  head(20)

print(unmatched)

# A tibble: 1 × 2
  GEOID_clean `Tract Full Name`                      
  <chr>       <chr>                                  
1 48453980000 Census Tract 9800; Travis County; Texas

#remove airport
master2 <- master2 %>%
  filter(GEOID_clean != "48453980000")


out_path <- file.path(data_dir, "master_with_imputed_LE.csv")
write_csv(master2, out_path)


#are there other special use tracts? 

master2 <- master2 %>%
  mutate(
    tract_code = str_sub(GEOID_clean, -6)  # last 6 digits
  )

special_use <- master2 %>%
  filter(str_detect(tract_code, "^(98|99)")) %>%
  select(GEOID_clean, `Tract Full Name`, tract_code)

special_use

# A tibble: 0 × 3
# ℹ 3 variables: GEOID_clean <chr>, Tract Full Name <chr>, tract_code <chr>

(Obselete)

The below code was when I was only imputing for prek

# library(mice)
# library(dplyr)
# 
# 
# #create the MICE dataset with the outcome + predictors (different from KW in that I didn't use the whole indicator list for prediction, just select)
# mice_vars <- c(
#   "public_preK_est",
#   "est_belowpov_pct",
#   "est_med_hh_inc",
#   "pct_female_hh_kids",
#   "pct_hh_with_kids",
#   "Total Population",
#   "est_uninsured_rate",
#   "est_no_internet_perc"
# )
# 
# mice_data <- master2 %>%
#   select(all_of(mice_vars))
# 
# # (Optional) Make sure blanks aren't hiding as empty strings (usually not in CSV, but safe)
# mice_data[mice_data == ""] <- NA
# 
# pred <- make.predictorMatrix(mice_data)
# pred[,] <- 0
# pred["public_preK_est", setdiff(colnames(mice_data), "public_preK_est")] <- 1
# 
# meth <- make.method(mice_data)
# meth[] <- ""
# meth["public_preK_est"] <- "pmm"
# 
# set.seed(123)
# imp <- mice(
#   data = mice_data,
#   m = 5,
#   method = meth,
#   predictorMatrix = pred,
#   maxit = 20,
#   printFlag = FALSE
# )
# 
# completed <- complete(imp, action = 1)
# 
# #add to master
# master2 <- master2 %>%
#   mutate(
#     public_preK_imputed = completed$public_preK_est,
#     preK_was_imputed = is.na(public_preK_est),
#     #finalversion keeps observed values exactly and fills only missings
#     public_preK_final = if_else(is.na(public_preK_est), public_preK_imputed, public_preK_est)
#   )
# 
# #observed values should be unchanged
# cor(master2$public_preK_est, master2$public_preK_final, use = "complete.obs")
# table(is.na(master2$public_preK_final))
# 
# out_path <- file.path(data_dir, "master_with_imputed_LE_and_preK2.csv")
# write_csv(master2, out_path)
# 
# out_path

# library(dplyr)
# 
# master2 <- master2 %>%
#   mutate(
#     preK_z_observed = scale(public_preK_est)[,1],
#     preK_z_imputed  = scale(public_preK_imputed)[,1]
#   )
# 
# cor(master2$preK_z_observed,
#     master2$preK_z_imputed,
#     use = "complete.obs")

>.95 means PMM did not change any observed values. It only filled in missing ones and it did not distort the distribution, which you can see in the histogram (there is polarization, which is reflective of reality).

Notes

Missing values for public Pre-K enrollment (25.7% of tracts) were imputed using Predictive Mean Matching (PMM) with multiple imputation (m=5), incorporating tract-level poverty, median household income, population size, household composition, uninsured rate, and internet access as predictors. PMM preserved the original variance (SD_observed = 40.64; SD_imputed = 41.34) and maintained tract ranking among observed cases (r = 1.00).

Full Dataset Imputation

library(mice)

Warning: package 'mice' was built under R version 4.4.3


Attaching package: 'mice'

The following object is masked from 'package:stats':

    filter

The following objects are masked from 'package:base':

    cbind, rbind

library(dplyr)
library(readr)

#define the index variable list (use the ORIGINAL preK variable)
index_vars <- c(
  "public_preK_est",
  "est_uninsured_rate",
  "LE_imputed_2020",
  "est_has_disability_pct",
  "est_65plus_amb_pct",
  "LPA_CrudePrev",
  "est_med_hh_inc",
  "est_underemp_perc",
  "est_belowpov_pct",
  "Eviction filing rate",
  "hh_support_risk_score",
  "Energy Burden (% income)",
  "est_lesh_perc",
  "est_no_internet_perc",
  "lths_pct",
  "Percent Institutionalized GQ",
  "Persistent Poverty",
  "Temperature_Diff"
)

#build MICE dataset (only index variables)
mice_data <- master2 %>% select(all_of(index_vars))

#ensure numeric (important because spaces/special chars can be tricky)
mice_data <- mice_data %>% mutate(across(everything(), as.numeric))

#set methods: impute only variables that actually have NAs, using PMM
meth <- make.method(mice_data)
meth[] <- ""  # <--default: do not impute--Took from CS and KW notes
vars_with_na <- names(which(colSums(is.na(mice_data)) > 0))
meth[vars_with_na] <- "pmm"

#predictor matrix: by default allow all variables to predict each other (do I need to limit any one specific variable? Need to check)
pred <- make.predictorMatrix(mice_data)
pred[,] <- 1
diag(pred) <- 0


set.seed(123)
imp <- mice(
  data = mice_data,
  m = 10,
  method = meth,
  predictorMatrix = pred,
  maxit = 20,
  printFlag = FALSE
)

#extract one completed dataset. Basically telling R to give me the first of the 10 completed datasets
completed <- complete(imp, action = 1)

#merge back into master2 as *_imp and *_flag columns
#keeping original columns untouched for transparency.
for (v in index_vars) {
  imp_col  <- paste0(v, "_imp")
  flag_col <- paste0(v, "_flag")

  master2[[imp_col]] <- completed[[v]]
  master2[[flag_col]] <- ifelse(is.na(mice_data[[v]]), "Imputed", "Original")
}

#any remaining NA in the imputed index matrix?
imputed_cols <- paste0(index_vars, "_imp")
colSums(is.na(master2[, imputed_cols]))

             public_preK_est_imp           est_uninsured_rate_imp 
                               0                                0 
             LE_imputed_2020_imp       est_has_disability_pct_imp 
                               0                                0 
          est_65plus_amb_pct_imp                LPA_CrudePrev_imp 
                               0                                0 
              est_med_hh_inc_imp            est_underemp_perc_imp 
                               0                                0 
            est_belowpov_pct_imp         Eviction filing rate_imp 
                               0                                0 
       hh_support_risk_score_imp     Energy Burden (% income)_imp 
                               0                                0 
               est_lesh_perc_imp         est_no_internet_perc_imp 
                               0                                0 
                    lths_pct_imp Percent Institutionalized GQ_imp 
                               0                                0 
          Persistent Poverty_imp             Temperature_Diff_imp 
                               0                                0

out_path <- file.path(data_dir, "master_FULL_joint_imputed_for_index.csv")
write_csv(master2, out_path)

out_path

[1] "C:/Users/rayo-garza/OneDrive - Center for Public Policy Priorities/Documents/Data 2025/master_FULL_joint_imputed_for_index.csv"

#do we have any nas after imputation
imputed_cols <- paste0(index_vars, "_imp")
colSums(is.na(master2[, imputed_cols]))

             public_preK_est_imp           est_uninsured_rate_imp 
                               0                                0 
             LE_imputed_2020_imp       est_has_disability_pct_imp 
                               0                                0 
          est_65plus_amb_pct_imp                LPA_CrudePrev_imp 
                               0                                0 
              est_med_hh_inc_imp            est_underemp_perc_imp 
                               0                                0 
            est_belowpov_pct_imp         Eviction filing rate_imp 
                               0                                0 
       hh_support_risk_score_imp     Energy Burden (% income)_imp 
                               0                                0 
               est_lesh_perc_imp         est_no_internet_perc_imp 
                               0                                0 
                    lths_pct_imp Percent Institutionalized GQ_imp 
                               0                                0 
          Persistent Poverty_imp             Temperature_Diff_imp 
                               0                                0

Validation Checks

#make sure observed variables were NOT changed
cor(master2$public_preK_est,
    master2$public_preK_est_imp,
    use="complete.obs")

[1] 1

cor(master2$est_med_hh_inc,
    master2$est_med_hh_inc_imp,
    use="complete.obs")

[1] 1

cor(master2$`Eviction filing rate`,
    master2$`Eviction filing rate_imp`,
    use="complete.obs")

[1] 1

#distribution preservation
summary(master2$public_preK_est)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00    0.00   30.70   42.47   80.50  100.00      64

summary(master2$public_preK_est_imp)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    0.00   46.20   47.95  100.00  100.00

summary(master2$est_med_hh_inc)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  18286   78290   95536  105625  123417  250001       2

summary(master2$est_med_hh_inc_imp)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  18286   78026   95536  105783  123798  250001

summary(master2$`Eviction filing rate`)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.000   1.625   3.175   5.310   6.570  61.720       7

summary(master2$`Eviction filing rate_imp`)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   1.600   3.180   5.344   6.570  61.720

#variance
sd(master2$public_preK_est, na.rm=TRUE)

[1] 40.6392

sd(master2$public_preK_est_imp)

[1] 41.48397

sd(master2$est_med_hh_inc, na.rm=TRUE)

[1] 46215.07

sd(master2$est_med_hh_inc_imp)

[1] 46290.78

library(dplyr)

rank_before <- percent_rank(master2$public_preK_est)
rank_after  <- percent_rank(master2$public_preK_est_imp)

cor(rank_before, rank_after, use="complete.obs")

[1] 0.9996274

master2 %>%
  filter(public_preK_est_flag == "Imputed") %>%
  select(public_preK_est_imp) %>%
  summary()

 public_preK_est_imp
 Min.   :  0.00     
 1st Qu.: 25.90     
 Median : 77.20     
 Mean   : 63.79     
 3rd Qu.:100.00     
 Max.   :100.00

master2 %>%
  filter(est_med_hh_inc_flag == "Imputed") %>%
  select(est_med_hh_inc_imp) %>%
  summary()

 est_med_hh_inc_imp
 Min.   : 73977    
 1st Qu.: 99590    
 Median :125203    
 Mean   :125203    
 3rd Qu.:150816    
 Max.   :176429

cor(master2[, index_vars], use="complete.obs")

                             public_preK_est est_uninsured_rate LE_imputed_2020
public_preK_est                   1.00000000         0.44424579    -0.260946630
est_uninsured_rate                0.44424579         1.00000000    -0.380712468
LE_imputed_2020                  -0.26094663        -0.38071247     1.000000000
est_has_disability_pct            0.25827067         0.40359270    -0.441964645
est_65plus_amb_pct                0.22977167         0.42897600    -0.400097965
LPA_CrudePrev                     0.46421492         0.70654947    -0.405548159
est_med_hh_inc                   -0.34150893        -0.60248355     0.422620905
est_underemp_perc                 0.05879988        -0.02763129     0.173286383
est_belowpov_pct                  0.26831694         0.57825993    -0.340672337
Eviction filing rate              0.31622218         0.25788419    -0.289151917
hh_support_risk_score             0.09606129         0.10606816    -0.004763105
Energy Burden (% income)          0.41223181         0.57010456    -0.294115672
est_lesh_perc                     0.37304380         0.70115373    -0.223539877
est_no_internet_perc              0.24611549         0.37481867    -0.308701232
lths_pct                          0.48628646         0.72304126    -0.416267462
Percent Institutionalized GQ      0.10353749        -0.01687152    -0.064396911
Persistent Poverty                0.08390720         0.15832105    -0.233754838
Temperature_Diff                  0.07050733         0.26165504    -0.213673773
                             est_has_disability_pct est_65plus_amb_pct
public_preK_est                          0.25827067       0.2297716713
est_uninsured_rate                       0.40359270       0.4289759988
LE_imputed_2020                         -0.44196464      -0.4000979648
est_has_disability_pct                   1.00000000       0.5013507524
est_65plus_amb_pct                       0.50135075       1.0000000000
LPA_CrudePrev                            0.53693813       0.5297073068
est_med_hh_inc                          -0.41211095      -0.3002824716
est_underemp_perc                       -0.05289979      -0.1133425311
est_belowpov_pct                         0.38084466       0.4099607762
Eviction filing rate                     0.30685306       0.2101222086
hh_support_risk_score                    0.08896564       0.1155191053
Energy Burden (% income)                 0.46141149       0.3353493163
est_lesh_perc                            0.25000550       0.3207646305
est_no_internet_perc                     0.36533163       0.3756851390
lths_pct                                 0.43692022       0.4386941572
Percent Institutionalized GQ             0.11070460       0.0002500681
Persistent Poverty                       0.11422317       0.1646435633
Temperature_Diff                         0.16839073       0.1948639636
                             LPA_CrudePrev est_med_hh_inc est_underemp_perc
public_preK_est                 0.46421492    -0.34150893       0.058799885
est_uninsured_rate              0.70654947    -0.60248355      -0.027631289
LE_imputed_2020                -0.40554816     0.42262091       0.173286383
est_has_disability_pct          0.53693813    -0.41211095      -0.052899790
est_65plus_amb_pct              0.52970731    -0.30028247      -0.113342531
LPA_CrudePrev                   1.00000000    -0.54502242      -0.120764423
est_med_hh_inc                 -0.54502242     1.00000000       0.163491021
est_underemp_perc              -0.12076442     0.16349102       1.000000000
est_belowpov_pct                0.65173997    -0.59654508      -0.001474493
Eviction filing rate            0.47327100    -0.32040995      -0.034348132
hh_support_risk_score           0.23838664    -0.18033507       0.050315259
Energy Burden (% income)        0.76606051    -0.45356344      -0.079698732
est_lesh_perc                   0.67058794    -0.46457492      -0.001084096
est_no_internet_perc            0.53286680    -0.30195000      -0.195163097
lths_pct                        0.84666442    -0.45393078      -0.061155315
Percent Institutionalized GQ    0.08702208    -0.02056847      -0.009991584
Persistent Poverty              0.11121638    -0.08941666       0.036939876
Temperature_Diff                0.33438410    -0.43579453      -0.274660206
                             est_belowpov_pct Eviction filing rate
public_preK_est                   0.268316944          0.316222184
est_uninsured_rate                0.578259932          0.257884187
LE_imputed_2020                  -0.340672337         -0.289151917
est_has_disability_pct            0.380844664          0.306853064
est_65plus_amb_pct                0.409960776          0.210122209
LPA_CrudePrev                     0.651739973          0.473270998
est_med_hh_inc                   -0.596545075         -0.320409953
est_underemp_perc                -0.001474493         -0.034348132
est_belowpov_pct                  1.000000000          0.217769082
Eviction filing rate              0.217769082          1.000000000
hh_support_risk_score             0.217935999          0.124982356
Energy Burden (% income)          0.508837463          0.423777529
est_lesh_perc                     0.546749770          0.203220450
est_no_internet_perc              0.488904523          0.146081485
lths_pct                          0.536173717          0.436625321
Percent Institutionalized GQ     -0.020405988          0.038193004
Persistent Poverty                0.285568868         -0.003957288
Temperature_Diff                  0.270229141         -0.027410753
                             hh_support_risk_score Energy Burden (% income)
public_preK_est                        0.096061288               0.41223181
est_uninsured_rate                     0.106068157               0.57010456
LE_imputed_2020                       -0.004763105              -0.29411567
est_has_disability_pct                 0.088965643               0.46141149
est_65plus_amb_pct                     0.115519105               0.33534932
LPA_CrudePrev                          0.238386638               0.76606051
est_med_hh_inc                        -0.180335072              -0.45356344
est_underemp_perc                      0.050315259              -0.07969873
est_belowpov_pct                       0.217935999               0.50883746
Eviction filing rate                   0.124982356               0.42377753
hh_support_risk_score                  1.000000000               0.22930946
Energy Burden (% income)               0.229309464               1.00000000
est_lesh_perc                          0.181352974               0.51756307
est_no_internet_perc                   0.162892228               0.34072914
lths_pct                               0.156710568               0.69863313
Percent Institutionalized GQ           0.027206012               0.09087590
Persistent Poverty                    -0.023694422              -0.06923601
Temperature_Diff                       0.100137564               0.21545279
                             est_lesh_perc est_no_internet_perc    lths_pct
public_preK_est                0.373043797            0.2461155  0.48628646
est_uninsured_rate             0.701153728            0.3748187  0.72304126
LE_imputed_2020               -0.223539877           -0.3087012 -0.41626746
est_has_disability_pct         0.250005500            0.3653316  0.43692022
est_65plus_amb_pct             0.320764630            0.3756851  0.43869416
LPA_CrudePrev                  0.670587937            0.5328668  0.84666442
est_med_hh_inc                -0.464574922           -0.3019500 -0.45393078
est_underemp_perc             -0.001084096           -0.1951631 -0.06115531
est_belowpov_pct               0.546749770            0.4889045  0.53617372
Eviction filing rate           0.203220450            0.1460815  0.43662532
hh_support_risk_score          0.181352974            0.1628922  0.15671057
Energy Burden (% income)       0.517563073            0.3407291  0.69863313
est_lesh_perc                  1.000000000            0.3897666  0.62917178
est_no_internet_perc           0.389766637            1.0000000  0.46387996
lths_pct                       0.629171782            0.4638800  1.00000000
Percent Institutionalized GQ  -0.003118287           -0.0224041  0.12032952
Persistent Poverty            -0.029656895            0.2196679  0.12584844
Temperature_Diff               0.336250849            0.1682932  0.15203082
                             Percent Institutionalized GQ Persistent Poverty
public_preK_est                              0.1035374907        0.083907205
est_uninsured_rate                          -0.0168715171        0.158321049
LE_imputed_2020                             -0.0643969112       -0.233754838
est_has_disability_pct                       0.1107046018        0.114223175
est_65plus_amb_pct                           0.0002500681        0.164643563
LPA_CrudePrev                                0.0870220766        0.111216376
est_med_hh_inc                              -0.0205684698       -0.089416659
est_underemp_perc                           -0.0099915841        0.036939876
est_belowpov_pct                            -0.0204059877        0.285568868
Eviction filing rate                         0.0381930040       -0.003957288
hh_support_risk_score                        0.0272060123       -0.023694422
Energy Burden (% income)                     0.0908758966       -0.069236006
est_lesh_perc                               -0.0031182870       -0.029656895
est_no_internet_perc                        -0.0224041033        0.219667880
lths_pct                                     0.1203295228        0.125848440
Percent Institutionalized GQ                 1.0000000000       -0.038881310
Persistent Poverty                          -0.0388813097        1.000000000
Temperature_Diff                            -0.0510776133       -0.013265470
                             Temperature_Diff
public_preK_est                    0.07050733
est_uninsured_rate                 0.26165504
LE_imputed_2020                   -0.21367377
est_has_disability_pct             0.16839073
est_65plus_amb_pct                 0.19486396
LPA_CrudePrev                      0.33438410
est_med_hh_inc                    -0.43579453
est_underemp_perc                 -0.27466021
est_belowpov_pct                   0.27022914
Eviction filing rate              -0.02741075
hh_support_risk_score              0.10013756
Energy Burden (% income)           0.21545279
est_lesh_perc                      0.33625085
est_no_internet_perc               0.16829323
lths_pct                           0.15203082
Percent Institutionalized GQ      -0.05107761
Persistent Poverty                -0.01326547
Temperature_Diff                   1.00000000

cor(master2[, imputed_cols])

                                 public_preK_est_imp est_uninsured_rate_imp
public_preK_est_imp                       1.00000000            0.440237016
est_uninsured_rate_imp                    0.44023702            1.000000000
LE_imputed_2020_imp                      -0.26716877           -0.390014909
est_has_disability_pct_imp                0.23712246            0.163370748
est_65plus_amb_pct_imp                    0.23718262            0.299673214
LPA_CrudePrev_imp                         0.50049324            0.619440177
est_med_hh_inc_imp                       -0.40214148           -0.594884445
est_underemp_perc_imp                     0.18437855            0.002708014
est_belowpov_pct_imp                      0.34755602            0.326362528
Eviction filing rate_imp                  0.29152280            0.192143770
hh_support_risk_score_imp                 0.03197354            0.094709424
Energy Burden (% income)_imp              0.44072292            0.496633106
est_lesh_perc_imp                         0.33084906            0.698543650
est_no_internet_perc_imp                  0.23589393            0.396225567
lths_pct_imp                              0.44469722            0.642829337
Percent Institutionalized GQ_imp          0.05680903           -0.029294798
Persistent Poverty_imp                   -0.02698461            0.098398270
Temperature_Diff_imp                      0.11902275            0.296228446
                                 LE_imputed_2020_imp est_has_disability_pct_imp
public_preK_est_imp                      -0.26716877                  0.2371225
est_uninsured_rate_imp                   -0.39001491                  0.1633707
LE_imputed_2020_imp                       1.00000000                 -0.2157128
est_has_disability_pct_imp               -0.21571275                  1.0000000
est_65plus_amb_pct_imp                   -0.29708726                  0.4052309
LPA_CrudePrev_imp                        -0.39291597                  0.5231132
est_med_hh_inc_imp                        0.37845982                 -0.1224407
est_underemp_perc_imp                     0.09196638                  0.1776450
est_belowpov_pct_imp                     -0.18219021                  0.4965270
Eviction filing rate_imp                 -0.23150542                  0.2540476
hh_support_risk_score_imp                -0.02062798                  0.1022485
Energy Burden (% income)_imp             -0.22860879                  0.4283241
est_lesh_perc_imp                        -0.22457981                  0.0810751
est_no_internet_perc_imp                 -0.25203379                  0.4729545
lths_pct_imp                             -0.38433132                  0.5087759
Percent Institutionalized GQ_imp         -0.06748717                  0.4031216
Persistent Poverty_imp                   -0.18902894                  0.0760528
Temperature_Diff_imp                     -0.19619495                  0.1297674
                                 est_65plus_amb_pct_imp LPA_CrudePrev_imp
public_preK_est_imp                          0.23718262         0.5004932
est_uninsured_rate_imp                       0.29967321         0.6194402
LE_imputed_2020_imp                         -0.29708726        -0.3929160
est_has_disability_pct_imp                   0.40523093         0.5231132
est_65plus_amb_pct_imp                       1.00000000         0.4215015
LPA_CrudePrev_imp                            0.42150150         1.0000000
est_med_hh_inc_imp                          -0.26623460        -0.4852699
est_underemp_perc_imp                        0.07636548         0.2038324
est_belowpov_pct_imp                         0.35499996         0.6536837
Eviction filing rate_imp                     0.13708439         0.4052103
hh_support_risk_score_imp                    0.04340267         0.1812116
Energy Burden (% income)_imp                 0.29547165         0.7473551
est_lesh_perc_imp                            0.15466648         0.5871815
est_no_internet_perc_imp                     0.27529434         0.5478003
lths_pct_imp                                 0.32212030         0.8213187
Percent Institutionalized GQ_imp             0.15616287         0.2952144
Persistent Poverty_imp                       0.14828495         0.1326349
Temperature_Diff_imp                         0.12503563         0.3456429
                                 est_med_hh_inc_imp est_underemp_perc_imp
public_preK_est_imp                     -0.40214148           0.184378545
est_uninsured_rate_imp                  -0.59488445           0.002708014
LE_imputed_2020_imp                      0.37845982           0.091966382
est_has_disability_pct_imp              -0.12244071           0.177645036
est_65plus_amb_pct_imp                  -0.26623460           0.076365480
LPA_CrudePrev_imp                       -0.48526991           0.203832391
est_med_hh_inc_imp                       1.00000000          -0.045800776
est_underemp_perc_imp                   -0.04580078           1.000000000
est_belowpov_pct_imp                    -0.40891524           0.540089583
Eviction filing rate_imp                -0.23808987          -0.041154943
hh_support_risk_score_imp               -0.10239326          -0.123538226
Energy Burden (% income)_imp            -0.45727395           0.304888099
est_lesh_perc_imp                       -0.42905086          -0.015562773
est_no_internet_perc_imp                -0.24261485          -0.001622255
lths_pct_imp                            -0.35690916           0.063840955
Percent Institutionalized GQ_imp         0.08766797           0.131327325
Persistent Poverty_imp                  -0.14621087           0.189027380
Temperature_Diff_imp                    -0.45395696          -0.046003931
                                 est_belowpov_pct_imp Eviction filing rate_imp
public_preK_est_imp                       0.347556023              0.291522796
est_uninsured_rate_imp                    0.326362528              0.192143770
LE_imputed_2020_imp                      -0.182190214             -0.231505419
est_has_disability_pct_imp                0.496527027              0.254047552
est_65plus_amb_pct_imp                    0.354999960              0.137084390
LPA_CrudePrev_imp                         0.653683718              0.405210336
est_med_hh_inc_imp                       -0.408915242             -0.238089869
est_underemp_perc_imp                     0.540089583             -0.041154943
est_belowpov_pct_imp                      1.000000000              0.168851800
Eviction filing rate_imp                  0.168851800              1.000000000
hh_support_risk_score_imp                -0.005163479              0.030961506
Energy Burden (% income)_imp              0.554572265              0.283828439
est_lesh_perc_imp                         0.283949990              0.153948533
est_no_internet_perc_imp                  0.363048861              0.134171163
lths_pct_imp                              0.424880343              0.321402210
Percent Institutionalized GQ_imp          0.342353638              0.210216815
Persistent Poverty_imp                    0.220944898             -0.026910648
Temperature_Diff_imp                      0.283663259             -0.001515009
                                 hh_support_risk_score_imp
public_preK_est_imp                            0.031973538
est_uninsured_rate_imp                         0.094709424
LE_imputed_2020_imp                           -0.020627979
est_has_disability_pct_imp                     0.102248519
est_65plus_amb_pct_imp                         0.043402665
LPA_CrudePrev_imp                              0.181211633
est_med_hh_inc_imp                            -0.102393262
est_underemp_perc_imp                         -0.123538226
est_belowpov_pct_imp                          -0.005163479
Eviction filing rate_imp                       0.030961506
hh_support_risk_score_imp                      1.000000000
Energy Burden (% income)_imp                   0.171380030
est_lesh_perc_imp                              0.112438380
est_no_internet_perc_imp                       0.227663233
lths_pct_imp                                   0.209858283
Percent Institutionalized GQ_imp              -0.150195712
Persistent Poverty_imp                         0.006254729
Temperature_Diff_imp                           0.032006681
                                 Energy Burden (% income)_imp est_lesh_perc_imp
public_preK_est_imp                                0.44072292       0.330849063
est_uninsured_rate_imp                             0.49663311       0.698543650
LE_imputed_2020_imp                               -0.22860879      -0.224579813
est_has_disability_pct_imp                         0.42832411       0.081075097
est_65plus_amb_pct_imp                             0.29547165       0.154666479
LPA_CrudePrev_imp                                  0.74735506       0.587181515
est_med_hh_inc_imp                                -0.45727395      -0.429050863
est_underemp_perc_imp                              0.30488810      -0.015562773
est_belowpov_pct_imp                               0.55457226       0.283949990
Eviction filing rate_imp                           0.28382844       0.153948533
hh_support_risk_score_imp                          0.17138003       0.112438380
Energy Burden (% income)_imp                       1.00000000       0.394451313
est_lesh_perc_imp                                  0.39445131       1.000000000
est_no_internet_perc_imp                           0.39851225       0.418367178
lths_pct_imp                                       0.63909027       0.605963762
Percent Institutionalized GQ_imp                   0.07304903       0.097344488
Persistent Poverty_imp                             0.08885868      -0.007464553
Temperature_Diff_imp                               0.27871585       0.315620470
                                 est_no_internet_perc_imp lths_pct_imp
public_preK_est_imp                           0.235893930   0.44469722
est_uninsured_rate_imp                        0.396225567   0.64282934
LE_imputed_2020_imp                          -0.252033791  -0.38433132
est_has_disability_pct_imp                    0.472954524   0.50877593
est_65plus_amb_pct_imp                        0.275294337   0.32212030
LPA_CrudePrev_imp                             0.547800283   0.82131870
est_med_hh_inc_imp                           -0.242614850  -0.35690916
est_underemp_perc_imp                        -0.001622255   0.06384095
est_belowpov_pct_imp                          0.363048861   0.42488034
Eviction filing rate_imp                      0.134171163   0.32140221
hh_support_risk_score_imp                     0.227663233   0.20985828
Energy Burden (% income)_imp                  0.398512254   0.63909027
est_lesh_perc_imp                             0.418367178   0.60596376
est_no_internet_perc_imp                      1.000000000   0.61927417
lths_pct_imp                                  0.619274171   1.00000000
Percent Institutionalized GQ_imp              0.039786917   0.17400372
Persistent Poverty_imp                        0.176312718   0.12425939
Temperature_Diff_imp                          0.205239882   0.18166003
                                 Percent Institutionalized GQ_imp
public_preK_est_imp                                  5.680903e-02
est_uninsured_rate_imp                              -2.929480e-02
LE_imputed_2020_imp                                 -6.748717e-02
est_has_disability_pct_imp                           4.031216e-01
est_65plus_amb_pct_imp                               1.561629e-01
LPA_CrudePrev_imp                                    2.952144e-01
est_med_hh_inc_imp                                   8.766797e-02
est_underemp_perc_imp                                1.313273e-01
est_belowpov_pct_imp                                 3.423536e-01
Eviction filing rate_imp                             2.102168e-01
hh_support_risk_score_imp                           -1.501957e-01
Energy Burden (% income)_imp                         7.304903e-02
est_lesh_perc_imp                                    9.734449e-02
est_no_internet_perc_imp                             3.978692e-02
lths_pct_imp                                         1.740037e-01
Percent Institutionalized GQ_imp                     1.000000e+00
Persistent Poverty_imp                              -6.482865e-05
Temperature_Diff_imp                                 3.227010e-02
                                 Persistent Poverty_imp Temperature_Diff_imp
public_preK_est_imp                       -2.698461e-02          0.119022750
est_uninsured_rate_imp                     9.839827e-02          0.296228446
LE_imputed_2020_imp                       -1.890289e-01         -0.196194948
est_has_disability_pct_imp                 7.605280e-02          0.129767431
est_65plus_amb_pct_imp                     1.482849e-01          0.125035632
LPA_CrudePrev_imp                          1.326349e-01          0.345642866
est_med_hh_inc_imp                        -1.462109e-01         -0.453956962
est_underemp_perc_imp                      1.890274e-01         -0.046003931
est_belowpov_pct_imp                       2.209449e-01          0.283663259
Eviction filing rate_imp                  -2.691065e-02         -0.001515009
hh_support_risk_score_imp                  6.254729e-03          0.032006681
Energy Burden (% income)_imp               8.885868e-02          0.278715849
est_lesh_perc_imp                         -7.464553e-03          0.315620470
est_no_internet_perc_imp                   1.763127e-01          0.205239882
lths_pct_imp                               1.242594e-01          0.181660032
Percent Institutionalized GQ_imp          -6.482865e-05          0.032270105
Persistent Poverty_imp                     1.000000e+00          0.101897111
Temperature_Diff_imp                       1.018971e-01          1.000000000

sd(master2$est_med_hh_inc, na.rm=TRUE)

[1] 46215.07

sd(master2$est_med_hh_inc_imp)

[1] 46290.78

Full joint imputation using PMM preserved the correlation structure of the dataset and did not materially alter rank ordering or distributional characteristics of key socioeconomic indicators.

Checking Evictions specifically

master2 %>%
  group_by(`Eviction filing rate_flag`) %>%
  summarise(
    n = n(),
    min = min(`Eviction filing rate_imp`, na.rm=TRUE),
    p10 = quantile(`Eviction filing rate_imp`, 0.10, na.rm=TRUE),
    median = median(`Eviction filing rate_imp`, na.rm=TRUE),
    mean = mean(`Eviction filing rate_imp`, na.rm=TRUE),
    p90 = quantile(`Eviction filing rate_imp`, 0.90, na.rm=TRUE),
    max = max(`Eviction filing rate_imp`, na.rm=TRUE)
  )

# A tibble: 2 × 8
  `Eviction filing rate_flag`     n   min   p10 median  mean   p90   max
  <chr>                       <int> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>
1 Imputed                         7  0.66 0.738   3.8   6.53  15.0  31.1
2 Original                      242  0    0.781   3.18  5.31  10.7  61.7

slightly more right tailed

master2 %>%
  filter(`Eviction filing rate_flag` == "Imputed") %>%
  select(GEOID_clean, `Tract Full Name`, `Eviction filing rate`, `Eviction filing rate_imp`)

# A tibble: 7 × 4
  GEOID_clean `Tract Full Name`    `Eviction filing rate` Eviction filing rate…¹
  <chr>       <chr>                                 <dbl>                  <dbl>
1 48453000102 Census Tract 1.02; …                     NA                   1.21
2 48453001606 Census Tract 16.06;…                     NA                   4.33
3 48453002319 Census Tract 23.19;…                     NA                  31.1 
4 48453034300 Census Tract 343; T…                     NA                   3.8 
5 48453034500 Census Tract 345; T…                     NA                   0.79
6 48453036800 Census Tract 368; T…                     NA                   0.66
7 48453044600 Census Tract 446; T…                     NA                   3.83
# ℹ abbreviated name: ¹`Eviction filing rate_imp`

weird_vals <- c(31.08, 3.8, 0.79, 0.66, 4.33, 3.83) 

master2 %>%
  mutate(Observed = `Eviction filing rate_flag` == "Original") %>%
  filter(Observed, `Eviction filing rate_imp` %in% weird_vals) %>%
  select(GEOID_clean, `Tract Full Name`, `Eviction filing rate_imp`)

# A tibble: 7 × 3
  GEOID_clean `Tract Full Name`                        Eviction filing rate_im…¹
  <chr>       <chr>                                                        <dbl>
1 48453000203 Census Tract 2.03; Travis County; Texas                       0.79
2 48453002219 Census Tract 22.19; Travis County; Texas                     31.1 
3 48453002310 Census Tract 23.10; Travis County; Texas                      4.33
4 48453030700 Census Tract 307; Travis County; Texas                        0.79
5 48453034900 Census Tract 349; Travis County; Texas                        3.8 
6 48453036700 Census Tract 367; Travis County; Texas                        0.66
7 48453041200 Census Tract 412; Travis County; Texas                        3.83
# ℹ abbreviated name: ¹`Eviction filing rate_imp`

Create Index

might as well try here

#helper functions
library(dplyr)

#percentile rank like ArcGIS "rank within study area"
#returns 0..1; ties get averaged like dplyr::percent_rank
pct_rank <- function(x) dplyr::percent_rank(x)

#rescale numeric vector to 1..100
rescale_1_100 <- function(x) {
  rng <- range(x, na.rm = TRUE)
  if (diff(rng) == 0) return(rep(50, length(x)))
  1 + (x - rng[1]) * 99 / (rng[2] - rng[1])
}

df_coa <- master2

#define direction
higher_is_better <- c(
  "public_preK_est_imp",
  "LE_imputed_2020_imp",
  "est_med_hh_inc_imp"
)

higher_is_worse <- c(
  "est_uninsured_rate_imp",
  "est_has_disability_pct_imp",
  "est_65plus_amb_pct_imp",
  "LPA_CrudePrev_imp",
  "est_underemp_perc_imp",
  "est_belowpov_pct_imp",
  "Eviction filing rate_imp",
  "Energy Burden (% income)_imp",
  "est_lesh_perc_imp",
  "est_no_internet_perc_imp",
  "lths_pct_imp",
  "Percent Institutionalized GQ_imp",
  "Persistent Poverty_imp",
  "Temperature_Diff_imp",
  "hh_support_risk_score_imp"
)

#create percentile-rank versions (0..1)
df_coa <- df_coa %>%
  mutate(across(all_of(c(higher_is_better, higher_is_worse)),
                pct_rank,
                .names = "{.col}_pct"))

#reverse the "higher is worse" so that higher percentile always means "more favorable"
df_coa <- df_coa %>%
  mutate(across(ends_with("_pct") & all_of(paste0(higher_is_worse, "_pct")),
                ~ 1 - .x))

Create themes and subthemes

# Subtheme → indicator percentile columns
subthemes <- list(
  # 1) Health & Wellbeing
  health_access_resources = c(
    "est_uninsured_rate_imp_pct",
    "public_preK_est_imp_pct",
    "LE_imputed_2020_imp_pct"
  ),
  wellbeing_activity = c(
    "est_has_disability_pct_imp_pct",
    "est_65plus_amb_pct_imp_pct",
    "LPA_CrudePrev_imp_pct"   # low physical activity
  ),

  # 2) Livelihood & Work
  income_employment = c(
    "est_med_hh_inc_imp_pct",
    "est_underemp_perc_imp_pct",
    "est_belowpov_pct_imp_pct"
  ),
  hh_stability_cost_burdens = c(
    "Eviction filing rate_imp_pct",
    "hh_support_risk_score_imp_pct",
    "Energy Burden (% income)_imp_pct"
  ),

  # 3) Access & Belonging
  edu_language_digital = c(
    "est_lesh_perc_imp_pct",          # LEP
    "est_no_internet_perc_imp_pct",
    "lths_pct_imp_pct"
  ),
  environment_neighborhood = c(
    "Percent Institutionalized GQ_imp_pct",  # group quarters
    "Persistent Poverty_imp_pct",            # concentrated poverty tracts 
    "Temperature_Diff_imp_pct"               # heat risk
  )
)


missing_cols <- setdiff(unlist(subthemes), names(df_coa))
if (length(missing_cols) > 0) {
  stop("These columns are missing from df_coa: ", paste(missing_cols, collapse = ", "))
}


#compute subtheme scores (mean of percentiles)
for (nm in names(subthemes)) {
  cols <- subthemes[[nm]]
  df_coa[[paste0("sub_", nm)]] <- rowMeans(df_coa[, cols, drop = FALSE], na.rm = TRUE)
}

Define themes as mean of subthemes (equal weighting)

themes <- list(
  health_wellbeing = c("sub_health_access_resources", "sub_wellbeing_activity"),
  livelihood_work  = c("sub_income_employment", "sub_hh_stability_cost_burdens"),
  access_belonging = c("sub_edu_language_digital", "sub_environment_neighborhood")
)

for (nm in names(themes)) {
  df_coa[[paste0("theme_", nm)]] <- rowMeans(df_coa[, themes[[nm]]], na.rm = TRUE)
}

final composite score

theme_cols <- paste0("theme_", names(themes))

df_coa$composite_pct_mean <- rowMeans(df_coa[, theme_cols], na.rm = TRUE)

# Rescale 0..1 → 1..100 
rescale_1_100 <- function(x) {
  rng <- range(x, na.rm = TRUE)
  if (diff(rng) == 0) return(rep(50, length(x)))
  1 + (x - rng[1]) * 99 / (rng[2] - rng[1])
}

df_coa$composite_1_100 <- rescale_1_100(df_coa$composite_pct_mean)

summary(df_coa$composite_1_100)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00   38.92   54.38   53.84   71.08  100.00

hist(df_coa$composite_1_100, breaks = 20)

df_coa %>%
  select(GEOID_clean,
         composite_1_100,
         est_belowpov_pct_imp,
         est_med_hh_inc_imp) %>%
  arrange(composite_1_100) %>%
  head(10)

# A tibble: 10 × 4
   GEOID_clean composite_1_100 est_belowpov_pct_imp est_med_hh_inc_imp
   <chr>                 <dbl>                <dbl>              <dbl>
 1 48453002105            1                    10.7              68649
 2 48453040700            1.70                 41.7              41526
 3 48453002413            2.57                 42.8              42629
 4 48453002411            7.06                 23.9              67500
 5 48453002112            8.96                 14                83750
 6 48453043200           12.9                  11.6              35357
 7 48453002319           13.8                  91.4             176429
 8 48453040100           14.8                  29.4              61202
 9 48453002310           18.2                  34                36991
10 48453044000           18.4                  39.8              51866

cor(df_coa$composite_1_100, df_coa$est_belowpov_pct_imp)

[1] -0.5292467

cor(df_coa$composite_1_100, df_coa$est_med_hh_inc_imp)

[1] 0.6730984

write.csv(
  df_coa %>%
    select(GEOID_clean, composite_1_100),
  "coa_index_from_R.csv",
  row.names = FALSE
)

mapping for fun

library(tigris)

To enable caching of data, set `options(tigris_use_cache = TRUE)`
in your R script or .Rprofile.

library(sf)

Warning: package 'sf' was built under R version 4.4.3

Linking to GEOS 3.13.0, GDAL 3.10.1, PROJ 9.5.1; sf_use_s2() is TRUE

library(dplyr)

options(tigris_use_cache = TRUE)

travis <- tracts(state="TX", county="Travis", year=2020, class="sf")
hays   <- tracts(state="TX", county="Hays",   year=2020, class="sf")

tracts_sf <- bind_rows(travis, hays) %>%
  mutate(GEOID_clean = as.character(GEOID))

#join and keep only area of interst
map_sf <- tracts_sf %>%
  inner_join(df_coa %>% select(GEOID_clean, composite_1_100),
             by = "GEOID_clean")



mean(!is.na(map_sf$composite_1_100))

[1] 1

#1=100 join

library(tmap)

Warning: package 'tmap' was built under R version 4.4.3

tmap_mode("view")

ℹ tmap modes "plot" - "view"
ℹ toggle with `tmap::ttm()`

tm_shape(map_sf) +
  tm_polygons(
    col = "composite_1_100",
    palette = "YlGnBu",
    id = "GEOID_clean",
    popup.vars = c("Composite Index (1–100)" = "composite_1_100"),
    alpha = 0.9
  )


── tmap v3 code detected ───────────────────────────────────────────────────────
[v3->v4] `tm_tm_polygons()`: migrate the argument(s) related to the scale of
the visual variable `fill` namely 'palette' (rename to 'values') to fill.scale
= tm_scale(<HERE>).[v3->v4] `tm_polygons()`: use 'fill' for the fill color of polygons/symbols
(instead of 'col'), and 'col' for the outlines (instead of 'border.col').[v3->v4] `tm_polygons()`: use `fill_alpha` instead of `alpha`.[cols4all] color palettes: use palettes from the R package cols4all. Run
`cols4all::c4a_gui()` to explore them. The old palette name "YlGnBu" is named
"brewer.yl_gn_bu"Multiple palettes called "yl_gn_bu" found: "brewer.yl_gn_bu", "matplotlib.yl_gn_bu". The first one, "brewer.yl_gn_bu", is returned.

PCA

library(dplyr)

pca_vars <- c(
  "public_preK_est_imp",
  "est_uninsured_rate_imp",
  "LE_imputed_2020_imp",
  "est_has_disability_pct_imp",
  "est_65plus_amb_pct_imp",
  "LPA_CrudePrev_imp",
  "est_med_hh_inc_imp",
  "est_underemp_perc_imp",
  "est_belowpov_pct_imp",
  "Eviction filing rate_imp",
  "hh_support_risk_score_imp",
  "Energy Burden (% income)_imp",
  "est_lesh_perc_imp",
  "est_no_internet_perc_imp",
  "lths_pct_imp",
  "Percent Institutionalized GQ_imp",
  "Persistent Poverty_imp",
  "Temperature_Diff_imp"
)

setdiff(pca_vars, names(df_coa))

character(0)

matrix

X <- df_coa %>% select(all_of(pca_vars))

X <- X %>% mutate(across(everything(), as.numeric))

#flip higher is worse indicators so higher = better
worse_in_pca <- intersect(higher_is_worse, names(X))

X_oriented <- X
X_oriented[worse_in_pca] <- lapply(X_oriented[worse_in_pca], function(v) -v)

#standardize for PCA (mean 0, sd 1)
X_scaled <- scale(X_oriented)

pca <- prcomp(X_scaled, center = FALSE, scale. = FALSE)  # already scaled

summary(pca)

Importance of components:
                          PC1    PC2     PC3     PC4     PC5     PC6     PC7
Standard deviation     2.4589 1.3694 1.17555 1.10538 1.07793 1.00302 0.93878
Proportion of Variance 0.3359 0.1042 0.07677 0.06788 0.06455 0.05589 0.04896
Cumulative Proportion  0.3359 0.4401 0.51686 0.58474 0.64929 0.70518 0.75415
                           PC8     PC9   PC10    PC11    PC12    PC13    PC14
Standard deviation     0.89668 0.80335 0.7719 0.72181 0.66559 0.62184 0.55186
Proportion of Variance 0.04467 0.03585 0.0331 0.02895 0.02461 0.02148 0.01692
Cumulative Proportion  0.79881 0.83467 0.8678 0.89672 0.92133 0.94281 0.95973
                          PC15    PC16    PC17    PC18
Standard deviation     0.48388 0.45990 0.42983 0.30735
Proportion of Variance 0.01301 0.01175 0.01026 0.00525
Cumulative Proportion  0.97274 0.98449 0.99475 1.00000

plot(pca, type = "l")

loadings_pc1 <- sort(pca$rotation[,1], decreasing = TRUE)
loadings_pc1

             public_preK_est_imp        hh_support_risk_score_imp 
                      0.23852852                      -0.07233206 
          Persistent Poverty_imp            est_underemp_perc_imp 
                     -0.07944712                      -0.08738954 
Percent Institutionalized GQ_imp         Eviction filing rate_imp 
                     -0.09641262                      -0.16083796 
            Temperature_Diff_imp              LE_imputed_2020_imp 
                     -0.16496760                      -0.19796270 
          est_65plus_amb_pct_imp       est_has_disability_pct_imp 
                     -0.20217119                      -0.23369610 
              est_med_hh_inc_imp         est_no_internet_perc_imp 
                     -0.25707138                      -0.25719367 
               est_lesh_perc_imp             est_belowpov_pct_imp 
                     -0.26256519                      -0.28030328 
          est_uninsured_rate_imp     Energy Burden (% income)_imp 
                     -0.29955355                      -0.31369596 
                    lths_pct_imp                LPA_CrudePrev_imp 
                     -0.34220278                      -0.37667452

pc1 <- pca$x[,1]

#align direction to match  composite (higher=more favorable)
if (cor(pc1, df_coa$composite_1_100, use = "complete.obs") < 0) {
  pc1 <- -pc1
}

df_coa$pca_pc1 <- pc1

#rescale PC1 to 1-100 for easier comparison
rescale_1_100 <- function(x) {
  rng <- range(x, na.rm = TRUE)
  if (diff(rng) == 0) return(rep(50, length(x)))
  1 + (x - rng[1]) * 99 / (rng[2] - rng[1])
}

df_coa$pca_1_100 <- rescale_1_100(df_coa$pca_pc1)

compare

cor(df_coa$composite_1_100, df_coa$pca_1_100, use = "complete.obs")

[1] 0.8902637

cor(df_coa$composite_1_100, df_coa$pca_1_100, use = "complete.obs", method = "spearman")

[1] 0.9361097

df_coa %>%
  mutate(
    rank_composite = rank(-composite_1_100, ties.method = "average"),
    rank_pca = rank(-pca_1_100, ties.method = "average"),
    rank_diff = rank_composite - rank_pca
  ) %>%
  arrange(desc(abs(rank_diff))) %>%
  select(GEOID_clean, composite_1_100, pca_1_100, rank_composite, rank_pca, rank_diff) %>%
  head(20)

# A tibble: 20 × 6
   GEOID_clean composite_1_100 pca_1_100 rank_composite rank_pca rank_diff
   <chr>                 <dbl>     <dbl>          <dbl>    <dbl>     <dbl>
 1 48453002446            72.8      72.3             56      143       -87
 2 48453000607            52.6      52.2            131      217       -86
 3 48453000606            55.4      56.9            119      203       -84
 4 48453000601            47.8      45.4            153      230       -77
 5 48453002324            41.9      22.3            175      247       -72
 6 48453000605            52.5      59.6            132      194       -62
 7 48453002323            32.2      72.5            202      140        62
 8 48453000203            38.7      73.6            188      130        58
 9 48453000308            47.5      79.2            155      102        53
10 48453000401            51.6      81.2            140       87        53
11 48453031600            54.6      82.7            122       71        51
12 48491020334            45.3      78.2            164      113        51
13 48453000302            46.5      78.3            160      110        50
14 48453045100            73.0      79.1             54      103       -49
15 48491020409            76.6      80.2             44       93       -49
16 48453000608            46.8      55.6            158      206       -48
17 48453002004            36.6      71.3            192      144        48
18 48453000803            54.4      81.8            125       78        47
19 48453000804            26.4      65.9            214      167        47
20 48453041900            59.8      70.5            105      152       -47

A principal component analysis (PCA) was conducted as a robustness check. The first component explained 33.6% of total variance and reflected a broad structural disadvantage dimension, loading heavily on poverty, health risk, energy burden, and digital access indicators. The strong alignment between PCA-derived scores and the equal-weight composite confirms that the index structure is empirically coherent and not driven by arbitrary weighting choices.