Full Dataset Imputation
In this analysis, I solve for missing data at the census tract level to ensure that all tracts can be included in the final composite index.
Public pre-K enrollment initially had suppressed values in several census tracts. Rather than excluding those tracts, I used a multiple imputation approach to estimate plausible values based on relationships with other socioeconomic indicators in the dataset (such as poverty, income, health outcomes, and housing conditions).
Because a small number of other index indicators also contained limited missing values (each affecting fewer than 5% of tracts), I implemented a full joint imputation model. This approach allows the variables to inform one another during estimation, preserving the overall structure of the data while filling in missing entries.
Predictive Mean Matching (PMM) was used for imputation. This method replaces missing values with observed values from similar census tracts, ensuring that imputed estimates remain realistic and within the observed range of the data.
After imputation:
All index indicators were complete across 249 census tracts.
Observed values were unchanged.
The overall distribution and correlation structure of the dataset were preserved, based on validation checks towards end of qmd
This process allows the composite index to be calculated using a complete and internally consistent dataset while minimizing bias from listwise deletion.
Note: In this code, the workflow is a little off. First initial chunks focused only on prek imputation. Refer to table of contents for full dataset imputation. Only keeping irrelevant code for notes and process documentation.
I reviewed KWs code: https://rpubs.com/kaitlan/1396846
I asked her to use CS code as guide: https://rpubs.com/corey_sparks/883799
Other resources: https://stats.oarc.ucla.edu/r/faq/how-do-i-perform-multiple-imputation-using-predictive-mean-matching-in-r/
https://stefvanbuuren.name/fimd/sec-pmm.html
Import and Join
I brought together KW’s master dataframe with my LE imputed dataset
library (readr)
library (dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library (stringr)
data_dir <- "C:/Users/rayo-garza/OneDrive - Center for Public Policy Priorities/Documents/Data 2025"
master_path <- file.path (data_dir, "Compiled_Data_Austin_LTD_FULL_02.06 (1).csv" )
imputed_path <- file.path (data_dir, "Final_Life_Expectancy_Imputed_2020.csv" )
#force IDs to character so leading zeros are safe
master <- read_csv (master_path, col_types = cols (.default = col_guess (),
GEOID_clean = col_character ()))
imputed <- read_csv (imputed_path, col_types = cols (.default = col_guess (),
GEOID = col_character (),
AFFGEOID = col_character (),
TRACTCE = col_character (),
COUNTYFP = col_character (),
STATEFP = col_character ()))
#clean/standardize GEOIDs to 11-digit tract GEOID (SSCCCTTTTTT) ---
clean_geoid11 <- function (x) {
x %>%
str_trim () %>%
str_replace_all ("[^0-9]" , "" ) %>% # removes "1400000US"
str_sub (- 11 ) %>% # keep last 11 digits if longer
str_pad (11 , side = "left" , pad = "0" )
}
master <- master %>%
mutate (GEOID_clean = clean_geoid11 (GEOID_clean))
imputed <- imputed %>%
mutate (GEOID11 = clean_geoid11 (GEOID))
#checks: duplicates in the imputed file
dups <- imputed %>%
count (GEOID11) %>%
filter (n > 1 )
if (nrow (dups) > 0 ) {
print (dups)
stop ("Duplicate GEOIDs found in imputed file. Resolve before joining." )
}
#Join
master2 <- master %>%
left_join (
imputed %>%
select (GEOID11, le_2020_normalized, imputed_flag, data_coverage) %>%
rename (
LE_imputed_2020 = le_2020_normalized,
LE_imputed_flag = imputed_flag,
LE_data_coverage = data_coverage
),
by = c ("GEOID_clean" = "GEOID11" )
)
#match rate + how many got LE values
qa <- master2 %>%
summarise (
n_master = n (),
n_with_imputed_LE = sum (! is.na (LE_imputed_2020)),
n_missing_imputed_LE = sum (is.na (LE_imputed_2020)),
match_rate = n_with_imputed_LE / n_master
)
print (qa)
# A tibble: 1 × 4
n_master n_with_imputed_LE n_missing_imputed_LE match_rate
<int> <int> <int> <dbl>
1 250 249 1 0.996
#unmatched tracts
unmatched <- master2 %>%
filter (is.na (LE_imputed_2020)) %>%
select (GEOID_clean, ` Tract Full Name ` ) %>%
head (20 )
print (unmatched)
# A tibble: 1 × 2
GEOID_clean `Tract Full Name`
<chr> <chr>
1 48453980000 Census Tract 9800; Travis County; Texas
#remove airport
master2 <- master2 %>%
filter (GEOID_clean != "48453980000" )
out_path <- file.path (data_dir, "master_with_imputed_LE.csv" )
write_csv (master2, out_path)
#are there other special use tracts?
master2 <- master2 %>%
mutate (
tract_code = str_sub (GEOID_clean, - 6 ) # last 6 digits
)
special_use <- master2 %>%
filter (str_detect (tract_code, "^(98|99)" )) %>%
select (GEOID_clean, ` Tract Full Name ` , tract_code)
special_use
# A tibble: 0 × 3
# ℹ 3 variables: GEOID_clean <chr>, Tract Full Name <chr>, tract_code <chr>
(Obselete)
The below code was when I was only imputing for prek
# library(mice)
# library(dplyr)
#
#
# #create the MICE dataset with the outcome + predictors (different from KW in that I didn't use the whole indicator list for prediction, just select)
# mice_vars <- c(
# "public_preK_est",
# "est_belowpov_pct",
# "est_med_hh_inc",
# "pct_female_hh_kids",
# "pct_hh_with_kids",
# "Total Population",
# "est_uninsured_rate",
# "est_no_internet_perc"
# )
#
# mice_data <- master2 %>%
# select(all_of(mice_vars))
#
# # (Optional) Make sure blanks aren't hiding as empty strings (usually not in CSV, but safe)
# mice_data[mice_data == ""] <- NA
#
# pred <- make.predictorMatrix(mice_data)
# pred[,] <- 0
# pred["public_preK_est", setdiff(colnames(mice_data), "public_preK_est")] <- 1
#
# meth <- make.method(mice_data)
# meth[] <- ""
# meth["public_preK_est"] <- "pmm"
#
# set.seed(123)
# imp <- mice(
# data = mice_data,
# m = 5,
# method = meth,
# predictorMatrix = pred,
# maxit = 20,
# printFlag = FALSE
# )
#
# completed <- complete(imp, action = 1)
#
# #add to master
# master2 <- master2 %>%
# mutate(
# public_preK_imputed = completed$public_preK_est,
# preK_was_imputed = is.na(public_preK_est),
# #finalversion keeps observed values exactly and fills only missings
# public_preK_final = if_else(is.na(public_preK_est), public_preK_imputed, public_preK_est)
# )
#
# #observed values should be unchanged
# cor(master2$public_preK_est, master2$public_preK_final, use = "complete.obs")
# table(is.na(master2$public_preK_final))
#
# out_path <- file.path(data_dir, "master_with_imputed_LE_and_preK2.csv")
# write_csv(master2, out_path)
#
# out_path
# library(dplyr)
#
# master2 <- master2 %>%
# mutate(
# preK_z_observed = scale(public_preK_est)[,1],
# preK_z_imputed = scale(public_preK_imputed)[,1]
# )
#
# cor(master2$preK_z_observed,
# master2$preK_z_imputed,
# use = "complete.obs")
>.95 means PMM did not change any observed values. It only filled in missing ones and it did not distort the distribution, which you can see in the histogram (there is polarization, which is reflective of reality).
Notes
Missing values for public Pre-K enrollment (25.7% of tracts) were imputed using Predictive Mean Matching (PMM) with multiple imputation (m=5), incorporating tract-level poverty, median household income, population size, household composition, uninsured rate, and internet access as predictors. PMM preserved the original variance (SD_observed = 40.64; SD_imputed = 41.34) and maintained tract ranking among observed cases (r = 1.00).
Full Dataset Imputation
Warning: package 'mice' was built under R version 4.4.3
Attaching package: 'mice'
The following object is masked from 'package:stats':
filter
The following objects are masked from 'package:base':
cbind, rbind
library (dplyr)
library (readr)
#define the index variable list (use the ORIGINAL preK variable)
index_vars <- c (
"public_preK_est" ,
"est_uninsured_rate" ,
"LE_imputed_2020" ,
"est_has_disability_pct" ,
"est_65plus_amb_pct" ,
"LPA_CrudePrev" ,
"est_med_hh_inc" ,
"est_underemp_perc" ,
"est_belowpov_pct" ,
"Eviction filing rate" ,
"hh_support_risk_score" ,
"Energy Burden (% income)" ,
"est_lesh_perc" ,
"est_no_internet_perc" ,
"lths_pct" ,
"Percent Institutionalized GQ" ,
"Persistent Poverty" ,
"Temperature_Diff"
)
#build MICE dataset (only index variables)
mice_data <- master2 %>% select (all_of (index_vars))
#ensure numeric (important because spaces/special chars can be tricky)
mice_data <- mice_data %>% mutate (across (everything (), as.numeric))
#set methods: impute only variables that actually have NAs, using PMM
meth <- make.method (mice_data)
meth[] <- "" # <--default: do not impute--Took from CS and KW notes
vars_with_na <- names (which (colSums (is.na (mice_data)) > 0 ))
meth[vars_with_na] <- "pmm"
#predictor matrix: by default allow all variables to predict each other (do I need to limit any one specific variable? Need to check)
pred <- make.predictorMatrix (mice_data)
pred[,] <- 1
diag (pred) <- 0
set.seed (123 )
imp <- mice (
data = mice_data,
m = 10 ,
method = meth,
predictorMatrix = pred,
maxit = 20 ,
printFlag = FALSE
)
#extract one completed dataset. Basically telling R to give me the first of the 10 completed datasets
completed <- complete (imp, action = 1 )
#merge back into master2 as *_imp and *_flag columns
#keeping original columns untouched for transparency.
for (v in index_vars) {
imp_col <- paste0 (v, "_imp" )
flag_col <- paste0 (v, "_flag" )
master2[[imp_col]] <- completed[[v]]
master2[[flag_col]] <- ifelse (is.na (mice_data[[v]]), "Imputed" , "Original" )
}
#any remaining NA in the imputed index matrix?
imputed_cols <- paste0 (index_vars, "_imp" )
colSums (is.na (master2[, imputed_cols]))
public_preK_est_imp est_uninsured_rate_imp
0 0
LE_imputed_2020_imp est_has_disability_pct_imp
0 0
est_65plus_amb_pct_imp LPA_CrudePrev_imp
0 0
est_med_hh_inc_imp est_underemp_perc_imp
0 0
est_belowpov_pct_imp Eviction filing rate_imp
0 0
hh_support_risk_score_imp Energy Burden (% income)_imp
0 0
est_lesh_perc_imp est_no_internet_perc_imp
0 0
lths_pct_imp Percent Institutionalized GQ_imp
0 0
Persistent Poverty_imp Temperature_Diff_imp
0 0
out_path <- file.path (data_dir, "master_FULL_joint_imputed_for_index.csv" )
write_csv (master2, out_path)
out_path
[1] "C:/Users/rayo-garza/OneDrive - Center for Public Policy Priorities/Documents/Data 2025/master_FULL_joint_imputed_for_index.csv"
#do we have any nas after imputation
imputed_cols <- paste0 (index_vars, "_imp" )
colSums (is.na (master2[, imputed_cols]))
public_preK_est_imp est_uninsured_rate_imp
0 0
LE_imputed_2020_imp est_has_disability_pct_imp
0 0
est_65plus_amb_pct_imp LPA_CrudePrev_imp
0 0
est_med_hh_inc_imp est_underemp_perc_imp
0 0
est_belowpov_pct_imp Eviction filing rate_imp
0 0
hh_support_risk_score_imp Energy Burden (% income)_imp
0 0
est_lesh_perc_imp est_no_internet_perc_imp
0 0
lths_pct_imp Percent Institutionalized GQ_imp
0 0
Persistent Poverty_imp Temperature_Diff_imp
0 0
Validation Checks
#make sure observed variables were NOT changed
cor (master2$ public_preK_est,
master2$ public_preK_est_imp,
use= "complete.obs" )
cor (master2$ est_med_hh_inc,
master2$ est_med_hh_inc_imp,
use= "complete.obs" )
cor (master2$ ` Eviction filing rate ` ,
master2$ ` Eviction filing rate_imp ` ,
use= "complete.obs" )
#distribution preservation
summary (master2$ public_preK_est)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.00 0.00 30.70 42.47 80.50 100.00 64
summary (master2$ public_preK_est_imp)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 0.00 46.20 47.95 100.00 100.00
summary (master2$ est_med_hh_inc)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
18286 78290 95536 105625 123417 250001 2
summary (master2$ est_med_hh_inc_imp)
Min. 1st Qu. Median Mean 3rd Qu. Max.
18286 78026 95536 105783 123798 250001
summary (master2$ ` Eviction filing rate ` )
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 1.625 3.175 5.310 6.570 61.720 7
summary (master2$ ` Eviction filing rate_imp ` )
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 1.600 3.180 5.344 6.570 61.720
#variance
sd (master2$ public_preK_est, na.rm= TRUE )
sd (master2$ public_preK_est_imp)
sd (master2$ est_med_hh_inc, na.rm= TRUE )
sd (master2$ est_med_hh_inc_imp)
library (dplyr)
rank_before <- percent_rank (master2$ public_preK_est)
rank_after <- percent_rank (master2$ public_preK_est_imp)
cor (rank_before, rank_after, use= "complete.obs" )
master2 %>%
filter (public_preK_est_flag == "Imputed" ) %>%
select (public_preK_est_imp) %>%
summary ()
public_preK_est_imp
Min. : 0.00
1st Qu.: 25.90
Median : 77.20
Mean : 63.79
3rd Qu.:100.00
Max. :100.00
master2 %>%
filter (est_med_hh_inc_flag == "Imputed" ) %>%
select (est_med_hh_inc_imp) %>%
summary ()
est_med_hh_inc_imp
Min. : 73977
1st Qu.: 99590
Median :125203
Mean :125203
3rd Qu.:150816
Max. :176429
cor (master2[, index_vars], use= "complete.obs" )
public_preK_est est_uninsured_rate LE_imputed_2020
public_preK_est 1.00000000 0.44424579 -0.260946630
est_uninsured_rate 0.44424579 1.00000000 -0.380712468
LE_imputed_2020 -0.26094663 -0.38071247 1.000000000
est_has_disability_pct 0.25827067 0.40359270 -0.441964645
est_65plus_amb_pct 0.22977167 0.42897600 -0.400097965
LPA_CrudePrev 0.46421492 0.70654947 -0.405548159
est_med_hh_inc -0.34150893 -0.60248355 0.422620905
est_underemp_perc 0.05879988 -0.02763129 0.173286383
est_belowpov_pct 0.26831694 0.57825993 -0.340672337
Eviction filing rate 0.31622218 0.25788419 -0.289151917
hh_support_risk_score 0.09606129 0.10606816 -0.004763105
Energy Burden (% income) 0.41223181 0.57010456 -0.294115672
est_lesh_perc 0.37304380 0.70115373 -0.223539877
est_no_internet_perc 0.24611549 0.37481867 -0.308701232
lths_pct 0.48628646 0.72304126 -0.416267462
Percent Institutionalized GQ 0.10353749 -0.01687152 -0.064396911
Persistent Poverty 0.08390720 0.15832105 -0.233754838
Temperature_Diff 0.07050733 0.26165504 -0.213673773
est_has_disability_pct est_65plus_amb_pct
public_preK_est 0.25827067 0.2297716713
est_uninsured_rate 0.40359270 0.4289759988
LE_imputed_2020 -0.44196464 -0.4000979648
est_has_disability_pct 1.00000000 0.5013507524
est_65plus_amb_pct 0.50135075 1.0000000000
LPA_CrudePrev 0.53693813 0.5297073068
est_med_hh_inc -0.41211095 -0.3002824716
est_underemp_perc -0.05289979 -0.1133425311
est_belowpov_pct 0.38084466 0.4099607762
Eviction filing rate 0.30685306 0.2101222086
hh_support_risk_score 0.08896564 0.1155191053
Energy Burden (% income) 0.46141149 0.3353493163
est_lesh_perc 0.25000550 0.3207646305
est_no_internet_perc 0.36533163 0.3756851390
lths_pct 0.43692022 0.4386941572
Percent Institutionalized GQ 0.11070460 0.0002500681
Persistent Poverty 0.11422317 0.1646435633
Temperature_Diff 0.16839073 0.1948639636
LPA_CrudePrev est_med_hh_inc est_underemp_perc
public_preK_est 0.46421492 -0.34150893 0.058799885
est_uninsured_rate 0.70654947 -0.60248355 -0.027631289
LE_imputed_2020 -0.40554816 0.42262091 0.173286383
est_has_disability_pct 0.53693813 -0.41211095 -0.052899790
est_65plus_amb_pct 0.52970731 -0.30028247 -0.113342531
LPA_CrudePrev 1.00000000 -0.54502242 -0.120764423
est_med_hh_inc -0.54502242 1.00000000 0.163491021
est_underemp_perc -0.12076442 0.16349102 1.000000000
est_belowpov_pct 0.65173997 -0.59654508 -0.001474493
Eviction filing rate 0.47327100 -0.32040995 -0.034348132
hh_support_risk_score 0.23838664 -0.18033507 0.050315259
Energy Burden (% income) 0.76606051 -0.45356344 -0.079698732
est_lesh_perc 0.67058794 -0.46457492 -0.001084096
est_no_internet_perc 0.53286680 -0.30195000 -0.195163097
lths_pct 0.84666442 -0.45393078 -0.061155315
Percent Institutionalized GQ 0.08702208 -0.02056847 -0.009991584
Persistent Poverty 0.11121638 -0.08941666 0.036939876
Temperature_Diff 0.33438410 -0.43579453 -0.274660206
est_belowpov_pct Eviction filing rate
public_preK_est 0.268316944 0.316222184
est_uninsured_rate 0.578259932 0.257884187
LE_imputed_2020 -0.340672337 -0.289151917
est_has_disability_pct 0.380844664 0.306853064
est_65plus_amb_pct 0.409960776 0.210122209
LPA_CrudePrev 0.651739973 0.473270998
est_med_hh_inc -0.596545075 -0.320409953
est_underemp_perc -0.001474493 -0.034348132
est_belowpov_pct 1.000000000 0.217769082
Eviction filing rate 0.217769082 1.000000000
hh_support_risk_score 0.217935999 0.124982356
Energy Burden (% income) 0.508837463 0.423777529
est_lesh_perc 0.546749770 0.203220450
est_no_internet_perc 0.488904523 0.146081485
lths_pct 0.536173717 0.436625321
Percent Institutionalized GQ -0.020405988 0.038193004
Persistent Poverty 0.285568868 -0.003957288
Temperature_Diff 0.270229141 -0.027410753
hh_support_risk_score Energy Burden (% income)
public_preK_est 0.096061288 0.41223181
est_uninsured_rate 0.106068157 0.57010456
LE_imputed_2020 -0.004763105 -0.29411567
est_has_disability_pct 0.088965643 0.46141149
est_65plus_amb_pct 0.115519105 0.33534932
LPA_CrudePrev 0.238386638 0.76606051
est_med_hh_inc -0.180335072 -0.45356344
est_underemp_perc 0.050315259 -0.07969873
est_belowpov_pct 0.217935999 0.50883746
Eviction filing rate 0.124982356 0.42377753
hh_support_risk_score 1.000000000 0.22930946
Energy Burden (% income) 0.229309464 1.00000000
est_lesh_perc 0.181352974 0.51756307
est_no_internet_perc 0.162892228 0.34072914
lths_pct 0.156710568 0.69863313
Percent Institutionalized GQ 0.027206012 0.09087590
Persistent Poverty -0.023694422 -0.06923601
Temperature_Diff 0.100137564 0.21545279
est_lesh_perc est_no_internet_perc lths_pct
public_preK_est 0.373043797 0.2461155 0.48628646
est_uninsured_rate 0.701153728 0.3748187 0.72304126
LE_imputed_2020 -0.223539877 -0.3087012 -0.41626746
est_has_disability_pct 0.250005500 0.3653316 0.43692022
est_65plus_amb_pct 0.320764630 0.3756851 0.43869416
LPA_CrudePrev 0.670587937 0.5328668 0.84666442
est_med_hh_inc -0.464574922 -0.3019500 -0.45393078
est_underemp_perc -0.001084096 -0.1951631 -0.06115531
est_belowpov_pct 0.546749770 0.4889045 0.53617372
Eviction filing rate 0.203220450 0.1460815 0.43662532
hh_support_risk_score 0.181352974 0.1628922 0.15671057
Energy Burden (% income) 0.517563073 0.3407291 0.69863313
est_lesh_perc 1.000000000 0.3897666 0.62917178
est_no_internet_perc 0.389766637 1.0000000 0.46387996
lths_pct 0.629171782 0.4638800 1.00000000
Percent Institutionalized GQ -0.003118287 -0.0224041 0.12032952
Persistent Poverty -0.029656895 0.2196679 0.12584844
Temperature_Diff 0.336250849 0.1682932 0.15203082
Percent Institutionalized GQ Persistent Poverty
public_preK_est 0.1035374907 0.083907205
est_uninsured_rate -0.0168715171 0.158321049
LE_imputed_2020 -0.0643969112 -0.233754838
est_has_disability_pct 0.1107046018 0.114223175
est_65plus_amb_pct 0.0002500681 0.164643563
LPA_CrudePrev 0.0870220766 0.111216376
est_med_hh_inc -0.0205684698 -0.089416659
est_underemp_perc -0.0099915841 0.036939876
est_belowpov_pct -0.0204059877 0.285568868
Eviction filing rate 0.0381930040 -0.003957288
hh_support_risk_score 0.0272060123 -0.023694422
Energy Burden (% income) 0.0908758966 -0.069236006
est_lesh_perc -0.0031182870 -0.029656895
est_no_internet_perc -0.0224041033 0.219667880
lths_pct 0.1203295228 0.125848440
Percent Institutionalized GQ 1.0000000000 -0.038881310
Persistent Poverty -0.0388813097 1.000000000
Temperature_Diff -0.0510776133 -0.013265470
Temperature_Diff
public_preK_est 0.07050733
est_uninsured_rate 0.26165504
LE_imputed_2020 -0.21367377
est_has_disability_pct 0.16839073
est_65plus_amb_pct 0.19486396
LPA_CrudePrev 0.33438410
est_med_hh_inc -0.43579453
est_underemp_perc -0.27466021
est_belowpov_pct 0.27022914
Eviction filing rate -0.02741075
hh_support_risk_score 0.10013756
Energy Burden (% income) 0.21545279
est_lesh_perc 0.33625085
est_no_internet_perc 0.16829323
lths_pct 0.15203082
Percent Institutionalized GQ -0.05107761
Persistent Poverty -0.01326547
Temperature_Diff 1.00000000
cor (master2[, imputed_cols])
public_preK_est_imp est_uninsured_rate_imp
public_preK_est_imp 1.00000000 0.440237016
est_uninsured_rate_imp 0.44023702 1.000000000
LE_imputed_2020_imp -0.26716877 -0.390014909
est_has_disability_pct_imp 0.23712246 0.163370748
est_65plus_amb_pct_imp 0.23718262 0.299673214
LPA_CrudePrev_imp 0.50049324 0.619440177
est_med_hh_inc_imp -0.40214148 -0.594884445
est_underemp_perc_imp 0.18437855 0.002708014
est_belowpov_pct_imp 0.34755602 0.326362528
Eviction filing rate_imp 0.29152280 0.192143770
hh_support_risk_score_imp 0.03197354 0.094709424
Energy Burden (% income)_imp 0.44072292 0.496633106
est_lesh_perc_imp 0.33084906 0.698543650
est_no_internet_perc_imp 0.23589393 0.396225567
lths_pct_imp 0.44469722 0.642829337
Percent Institutionalized GQ_imp 0.05680903 -0.029294798
Persistent Poverty_imp -0.02698461 0.098398270
Temperature_Diff_imp 0.11902275 0.296228446
LE_imputed_2020_imp est_has_disability_pct_imp
public_preK_est_imp -0.26716877 0.2371225
est_uninsured_rate_imp -0.39001491 0.1633707
LE_imputed_2020_imp 1.00000000 -0.2157128
est_has_disability_pct_imp -0.21571275 1.0000000
est_65plus_amb_pct_imp -0.29708726 0.4052309
LPA_CrudePrev_imp -0.39291597 0.5231132
est_med_hh_inc_imp 0.37845982 -0.1224407
est_underemp_perc_imp 0.09196638 0.1776450
est_belowpov_pct_imp -0.18219021 0.4965270
Eviction filing rate_imp -0.23150542 0.2540476
hh_support_risk_score_imp -0.02062798 0.1022485
Energy Burden (% income)_imp -0.22860879 0.4283241
est_lesh_perc_imp -0.22457981 0.0810751
est_no_internet_perc_imp -0.25203379 0.4729545
lths_pct_imp -0.38433132 0.5087759
Percent Institutionalized GQ_imp -0.06748717 0.4031216
Persistent Poverty_imp -0.18902894 0.0760528
Temperature_Diff_imp -0.19619495 0.1297674
est_65plus_amb_pct_imp LPA_CrudePrev_imp
public_preK_est_imp 0.23718262 0.5004932
est_uninsured_rate_imp 0.29967321 0.6194402
LE_imputed_2020_imp -0.29708726 -0.3929160
est_has_disability_pct_imp 0.40523093 0.5231132
est_65plus_amb_pct_imp 1.00000000 0.4215015
LPA_CrudePrev_imp 0.42150150 1.0000000
est_med_hh_inc_imp -0.26623460 -0.4852699
est_underemp_perc_imp 0.07636548 0.2038324
est_belowpov_pct_imp 0.35499996 0.6536837
Eviction filing rate_imp 0.13708439 0.4052103
hh_support_risk_score_imp 0.04340267 0.1812116
Energy Burden (% income)_imp 0.29547165 0.7473551
est_lesh_perc_imp 0.15466648 0.5871815
est_no_internet_perc_imp 0.27529434 0.5478003
lths_pct_imp 0.32212030 0.8213187
Percent Institutionalized GQ_imp 0.15616287 0.2952144
Persistent Poverty_imp 0.14828495 0.1326349
Temperature_Diff_imp 0.12503563 0.3456429
est_med_hh_inc_imp est_underemp_perc_imp
public_preK_est_imp -0.40214148 0.184378545
est_uninsured_rate_imp -0.59488445 0.002708014
LE_imputed_2020_imp 0.37845982 0.091966382
est_has_disability_pct_imp -0.12244071 0.177645036
est_65plus_amb_pct_imp -0.26623460 0.076365480
LPA_CrudePrev_imp -0.48526991 0.203832391
est_med_hh_inc_imp 1.00000000 -0.045800776
est_underemp_perc_imp -0.04580078 1.000000000
est_belowpov_pct_imp -0.40891524 0.540089583
Eviction filing rate_imp -0.23808987 -0.041154943
hh_support_risk_score_imp -0.10239326 -0.123538226
Energy Burden (% income)_imp -0.45727395 0.304888099
est_lesh_perc_imp -0.42905086 -0.015562773
est_no_internet_perc_imp -0.24261485 -0.001622255
lths_pct_imp -0.35690916 0.063840955
Percent Institutionalized GQ_imp 0.08766797 0.131327325
Persistent Poverty_imp -0.14621087 0.189027380
Temperature_Diff_imp -0.45395696 -0.046003931
est_belowpov_pct_imp Eviction filing rate_imp
public_preK_est_imp 0.347556023 0.291522796
est_uninsured_rate_imp 0.326362528 0.192143770
LE_imputed_2020_imp -0.182190214 -0.231505419
est_has_disability_pct_imp 0.496527027 0.254047552
est_65plus_amb_pct_imp 0.354999960 0.137084390
LPA_CrudePrev_imp 0.653683718 0.405210336
est_med_hh_inc_imp -0.408915242 -0.238089869
est_underemp_perc_imp 0.540089583 -0.041154943
est_belowpov_pct_imp 1.000000000 0.168851800
Eviction filing rate_imp 0.168851800 1.000000000
hh_support_risk_score_imp -0.005163479 0.030961506
Energy Burden (% income)_imp 0.554572265 0.283828439
est_lesh_perc_imp 0.283949990 0.153948533
est_no_internet_perc_imp 0.363048861 0.134171163
lths_pct_imp 0.424880343 0.321402210
Percent Institutionalized GQ_imp 0.342353638 0.210216815
Persistent Poverty_imp 0.220944898 -0.026910648
Temperature_Diff_imp 0.283663259 -0.001515009
hh_support_risk_score_imp
public_preK_est_imp 0.031973538
est_uninsured_rate_imp 0.094709424
LE_imputed_2020_imp -0.020627979
est_has_disability_pct_imp 0.102248519
est_65plus_amb_pct_imp 0.043402665
LPA_CrudePrev_imp 0.181211633
est_med_hh_inc_imp -0.102393262
est_underemp_perc_imp -0.123538226
est_belowpov_pct_imp -0.005163479
Eviction filing rate_imp 0.030961506
hh_support_risk_score_imp 1.000000000
Energy Burden (% income)_imp 0.171380030
est_lesh_perc_imp 0.112438380
est_no_internet_perc_imp 0.227663233
lths_pct_imp 0.209858283
Percent Institutionalized GQ_imp -0.150195712
Persistent Poverty_imp 0.006254729
Temperature_Diff_imp 0.032006681
Energy Burden (% income)_imp est_lesh_perc_imp
public_preK_est_imp 0.44072292 0.330849063
est_uninsured_rate_imp 0.49663311 0.698543650
LE_imputed_2020_imp -0.22860879 -0.224579813
est_has_disability_pct_imp 0.42832411 0.081075097
est_65plus_amb_pct_imp 0.29547165 0.154666479
LPA_CrudePrev_imp 0.74735506 0.587181515
est_med_hh_inc_imp -0.45727395 -0.429050863
est_underemp_perc_imp 0.30488810 -0.015562773
est_belowpov_pct_imp 0.55457226 0.283949990
Eviction filing rate_imp 0.28382844 0.153948533
hh_support_risk_score_imp 0.17138003 0.112438380
Energy Burden (% income)_imp 1.00000000 0.394451313
est_lesh_perc_imp 0.39445131 1.000000000
est_no_internet_perc_imp 0.39851225 0.418367178
lths_pct_imp 0.63909027 0.605963762
Percent Institutionalized GQ_imp 0.07304903 0.097344488
Persistent Poverty_imp 0.08885868 -0.007464553
Temperature_Diff_imp 0.27871585 0.315620470
est_no_internet_perc_imp lths_pct_imp
public_preK_est_imp 0.235893930 0.44469722
est_uninsured_rate_imp 0.396225567 0.64282934
LE_imputed_2020_imp -0.252033791 -0.38433132
est_has_disability_pct_imp 0.472954524 0.50877593
est_65plus_amb_pct_imp 0.275294337 0.32212030
LPA_CrudePrev_imp 0.547800283 0.82131870
est_med_hh_inc_imp -0.242614850 -0.35690916
est_underemp_perc_imp -0.001622255 0.06384095
est_belowpov_pct_imp 0.363048861 0.42488034
Eviction filing rate_imp 0.134171163 0.32140221
hh_support_risk_score_imp 0.227663233 0.20985828
Energy Burden (% income)_imp 0.398512254 0.63909027
est_lesh_perc_imp 0.418367178 0.60596376
est_no_internet_perc_imp 1.000000000 0.61927417
lths_pct_imp 0.619274171 1.00000000
Percent Institutionalized GQ_imp 0.039786917 0.17400372
Persistent Poverty_imp 0.176312718 0.12425939
Temperature_Diff_imp 0.205239882 0.18166003
Percent Institutionalized GQ_imp
public_preK_est_imp 5.680903e-02
est_uninsured_rate_imp -2.929480e-02
LE_imputed_2020_imp -6.748717e-02
est_has_disability_pct_imp 4.031216e-01
est_65plus_amb_pct_imp 1.561629e-01
LPA_CrudePrev_imp 2.952144e-01
est_med_hh_inc_imp 8.766797e-02
est_underemp_perc_imp 1.313273e-01
est_belowpov_pct_imp 3.423536e-01
Eviction filing rate_imp 2.102168e-01
hh_support_risk_score_imp -1.501957e-01
Energy Burden (% income)_imp 7.304903e-02
est_lesh_perc_imp 9.734449e-02
est_no_internet_perc_imp 3.978692e-02
lths_pct_imp 1.740037e-01
Percent Institutionalized GQ_imp 1.000000e+00
Persistent Poverty_imp -6.482865e-05
Temperature_Diff_imp 3.227010e-02
Persistent Poverty_imp Temperature_Diff_imp
public_preK_est_imp -2.698461e-02 0.119022750
est_uninsured_rate_imp 9.839827e-02 0.296228446
LE_imputed_2020_imp -1.890289e-01 -0.196194948
est_has_disability_pct_imp 7.605280e-02 0.129767431
est_65plus_amb_pct_imp 1.482849e-01 0.125035632
LPA_CrudePrev_imp 1.326349e-01 0.345642866
est_med_hh_inc_imp -1.462109e-01 -0.453956962
est_underemp_perc_imp 1.890274e-01 -0.046003931
est_belowpov_pct_imp 2.209449e-01 0.283663259
Eviction filing rate_imp -2.691065e-02 -0.001515009
hh_support_risk_score_imp 6.254729e-03 0.032006681
Energy Burden (% income)_imp 8.885868e-02 0.278715849
est_lesh_perc_imp -7.464553e-03 0.315620470
est_no_internet_perc_imp 1.763127e-01 0.205239882
lths_pct_imp 1.242594e-01 0.181660032
Percent Institutionalized GQ_imp -6.482865e-05 0.032270105
Persistent Poverty_imp 1.000000e+00 0.101897111
Temperature_Diff_imp 1.018971e-01 1.000000000
sd (master2$ est_med_hh_inc, na.rm= TRUE )
sd (master2$ est_med_hh_inc_imp)
Full joint imputation using PMM preserved the correlation structure of the dataset and did not materially alter rank ordering or distributional characteristics of key socioeconomic indicators.
Checking Evictions specifically
master2 %>%
group_by (` Eviction filing rate_flag ` ) %>%
summarise (
n = n (),
min = min (` Eviction filing rate_imp ` , na.rm= TRUE ),
p10 = quantile (` Eviction filing rate_imp ` , 0.10 , na.rm= TRUE ),
median = median (` Eviction filing rate_imp ` , na.rm= TRUE ),
mean = mean (` Eviction filing rate_imp ` , na.rm= TRUE ),
p90 = quantile (` Eviction filing rate_imp ` , 0.90 , na.rm= TRUE ),
max = max (` Eviction filing rate_imp ` , na.rm= TRUE )
)
# A tibble: 2 × 8
`Eviction filing rate_flag` n min p10 median mean p90 max
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Imputed 7 0.66 0.738 3.8 6.53 15.0 31.1
2 Original 242 0 0.781 3.18 5.31 10.7 61.7
slightly more right tailed
master2 %>%
filter (` Eviction filing rate_flag ` == "Imputed" ) %>%
select (GEOID_clean, ` Tract Full Name ` , ` Eviction filing rate ` , ` Eviction filing rate_imp ` )
# A tibble: 7 × 4
GEOID_clean `Tract Full Name` `Eviction filing rate` Eviction filing rate…¹
<chr> <chr> <dbl> <dbl>
1 48453000102 Census Tract 1.02; … NA 1.21
2 48453001606 Census Tract 16.06;… NA 4.33
3 48453002319 Census Tract 23.19;… NA 31.1
4 48453034300 Census Tract 343; T… NA 3.8
5 48453034500 Census Tract 345; T… NA 0.79
6 48453036800 Census Tract 368; T… NA 0.66
7 48453044600 Census Tract 446; T… NA 3.83
# ℹ abbreviated name: ¹`Eviction filing rate_imp`
weird_vals <- c (31.08 , 3.8 , 0.79 , 0.66 , 4.33 , 3.83 )
master2 %>%
mutate (Observed = ` Eviction filing rate_flag ` == "Original" ) %>%
filter (Observed, ` Eviction filing rate_imp ` %in% weird_vals) %>%
select (GEOID_clean, ` Tract Full Name ` , ` Eviction filing rate_imp ` )
# A tibble: 7 × 3
GEOID_clean `Tract Full Name` Eviction filing rate_im…¹
<chr> <chr> <dbl>
1 48453000203 Census Tract 2.03; Travis County; Texas 0.79
2 48453002219 Census Tract 22.19; Travis County; Texas 31.1
3 48453002310 Census Tract 23.10; Travis County; Texas 4.33
4 48453030700 Census Tract 307; Travis County; Texas 0.79
5 48453034900 Census Tract 349; Travis County; Texas 3.8
6 48453036700 Census Tract 367; Travis County; Texas 0.66
7 48453041200 Census Tract 412; Travis County; Texas 3.83
# ℹ abbreviated name: ¹`Eviction filing rate_imp`
Create Index
might as well try here
#helper functions
library (dplyr)
#percentile rank like ArcGIS "rank within study area"
#returns 0..1; ties get averaged like dplyr::percent_rank
pct_rank <- function (x) dplyr:: percent_rank (x)
#rescale numeric vector to 1..100
rescale_1_100 <- function (x) {
rng <- range (x, na.rm = TRUE )
if (diff (rng) == 0 ) return (rep (50 , length (x)))
1 + (x - rng[1 ]) * 99 / (rng[2 ] - rng[1 ])
}
df_coa <- master2
#define direction
higher_is_better <- c (
"public_preK_est_imp" ,
"LE_imputed_2020_imp" ,
"est_med_hh_inc_imp"
)
higher_is_worse <- c (
"est_uninsured_rate_imp" ,
"est_has_disability_pct_imp" ,
"est_65plus_amb_pct_imp" ,
"LPA_CrudePrev_imp" ,
"est_underemp_perc_imp" ,
"est_belowpov_pct_imp" ,
"Eviction filing rate_imp" ,
"Energy Burden (% income)_imp" ,
"est_lesh_perc_imp" ,
"est_no_internet_perc_imp" ,
"lths_pct_imp" ,
"Percent Institutionalized GQ_imp" ,
"Persistent Poverty_imp" ,
"Temperature_Diff_imp" ,
"hh_support_risk_score_imp"
)
#create percentile-rank versions (0..1)
df_coa <- df_coa %>%
mutate (across (all_of (c (higher_is_better, higher_is_worse)),
pct_rank,
.names = "{.col}_pct" ))
#reverse the "higher is worse" so that higher percentile always means "more favorable"
df_coa <- df_coa %>%
mutate (across (ends_with ("_pct" ) & all_of (paste0 (higher_is_worse, "_pct" )),
~ 1 - .x))
Create themes and subthemes
# Subtheme → indicator percentile columns
subthemes <- list (
# 1) Health & Wellbeing
health_access_resources = c (
"est_uninsured_rate_imp_pct" ,
"public_preK_est_imp_pct" ,
"LE_imputed_2020_imp_pct"
),
wellbeing_activity = c (
"est_has_disability_pct_imp_pct" ,
"est_65plus_amb_pct_imp_pct" ,
"LPA_CrudePrev_imp_pct" # low physical activity
),
# 2) Livelihood & Work
income_employment = c (
"est_med_hh_inc_imp_pct" ,
"est_underemp_perc_imp_pct" ,
"est_belowpov_pct_imp_pct"
),
hh_stability_cost_burdens = c (
"Eviction filing rate_imp_pct" ,
"hh_support_risk_score_imp_pct" ,
"Energy Burden (% income)_imp_pct"
),
# 3) Access & Belonging
edu_language_digital = c (
"est_lesh_perc_imp_pct" , # LEP
"est_no_internet_perc_imp_pct" ,
"lths_pct_imp_pct"
),
environment_neighborhood = c (
"Percent Institutionalized GQ_imp_pct" , # group quarters
"Persistent Poverty_imp_pct" , # concentrated poverty tracts
"Temperature_Diff_imp_pct" # heat risk
)
)
missing_cols <- setdiff (unlist (subthemes), names (df_coa))
if (length (missing_cols) > 0 ) {
stop ("These columns are missing from df_coa: " , paste (missing_cols, collapse = ", " ))
}
#compute subtheme scores (mean of percentiles)
for (nm in names (subthemes)) {
cols <- subthemes[[nm]]
df_coa[[paste0 ("sub_" , nm)]] <- rowMeans (df_coa[, cols, drop = FALSE ], na.rm = TRUE )
}
Define themes as mean of subthemes (equal weighting)
themes <- list (
health_wellbeing = c ("sub_health_access_resources" , "sub_wellbeing_activity" ),
livelihood_work = c ("sub_income_employment" , "sub_hh_stability_cost_burdens" ),
access_belonging = c ("sub_edu_language_digital" , "sub_environment_neighborhood" )
)
for (nm in names (themes)) {
df_coa[[paste0 ("theme_" , nm)]] <- rowMeans (df_coa[, themes[[nm]]], na.rm = TRUE )
}
final composite score
theme_cols <- paste0 ("theme_" , names (themes))
df_coa$ composite_pct_mean <- rowMeans (df_coa[, theme_cols], na.rm = TRUE )
# Rescale 0..1 → 1..100
rescale_1_100 <- function (x) {
rng <- range (x, na.rm = TRUE )
if (diff (rng) == 0 ) return (rep (50 , length (x)))
1 + (x - rng[1 ]) * 99 / (rng[2 ] - rng[1 ])
}
df_coa$ composite_1_100 <- rescale_1_100 (df_coa$ composite_pct_mean)
summary (df_coa$ composite_1_100)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 38.92 54.38 53.84 71.08 100.00
hist (df_coa$ composite_1_100, breaks = 20 )
df_coa %>%
select (GEOID_clean,
composite_1_100,
est_belowpov_pct_imp,
est_med_hh_inc_imp) %>%
arrange (composite_1_100) %>%
head (10 )
# A tibble: 10 × 4
GEOID_clean composite_1_100 est_belowpov_pct_imp est_med_hh_inc_imp
<chr> <dbl> <dbl> <dbl>
1 48453002105 1 10.7 68649
2 48453040700 1.70 41.7 41526
3 48453002413 2.57 42.8 42629
4 48453002411 7.06 23.9 67500
5 48453002112 8.96 14 83750
6 48453043200 12.9 11.6 35357
7 48453002319 13.8 91.4 176429
8 48453040100 14.8 29.4 61202
9 48453002310 18.2 34 36991
10 48453044000 18.4 39.8 51866
cor (df_coa$ composite_1_100, df_coa$ est_belowpov_pct_imp)
cor (df_coa$ composite_1_100, df_coa$ est_med_hh_inc_imp)
write.csv (
df_coa %>%
select (GEOID_clean, composite_1_100),
"coa_index_from_R.csv" ,
row.names = FALSE
)
mapping for fun
To enable caching of data, set `options(tigris_use_cache = TRUE)`
in your R script or .Rprofile.
Warning: package 'sf' was built under R version 4.4.3
Linking to GEOS 3.13.0, GDAL 3.10.1, PROJ 9.5.1; sf_use_s2() is TRUE
library (dplyr)
options (tigris_use_cache = TRUE )
travis <- tracts (state= "TX" , county= "Travis" , year= 2020 , class= "sf" )
hays <- tracts (state= "TX" , county= "Hays" , year= 2020 , class= "sf" )
tracts_sf <- bind_rows (travis, hays) %>%
mutate (GEOID_clean = as.character (GEOID))
#join and keep only area of interst
map_sf <- tracts_sf %>%
inner_join (df_coa %>% select (GEOID_clean, composite_1_100),
by = "GEOID_clean" )
mean (! is.na (map_sf$ composite_1_100))
Warning: package 'tmap' was built under R version 4.4.3
ℹ tmap modes "plot" - "view"
ℹ toggle with `tmap::ttm()`
tm_shape (map_sf) +
tm_polygons (
col = "composite_1_100" ,
palette = "YlGnBu" ,
id = "GEOID_clean" ,
popup.vars = c ("Composite Index (1–100)" = "composite_1_100" ),
alpha = 0.9
)
── tmap v3 code detected ───────────────────────────────────────────────────────
[v3->v4] `tm_tm_polygons()`: migrate the argument(s) related to the scale of
the visual variable `fill` namely 'palette' (rename to 'values') to fill.scale
= tm_scale(<HERE>).[v3->v4] `tm_polygons()`: use 'fill' for the fill color of polygons/symbols
(instead of 'col'), and 'col' for the outlines (instead of 'border.col').[v3->v4] `tm_polygons()`: use `fill_alpha` instead of `alpha`.[cols4all] color palettes: use palettes from the R package cols4all. Run
`cols4all::c4a_gui()` to explore them. The old palette name "YlGnBu" is named
"brewer.yl_gn_bu"Multiple palettes called "yl_gn_bu" found: "brewer.yl_gn_bu", "matplotlib.yl_gn_bu". The first one, "brewer.yl_gn_bu", is returned.
PCA
library (dplyr)
pca_vars <- c (
"public_preK_est_imp" ,
"est_uninsured_rate_imp" ,
"LE_imputed_2020_imp" ,
"est_has_disability_pct_imp" ,
"est_65plus_amb_pct_imp" ,
"LPA_CrudePrev_imp" ,
"est_med_hh_inc_imp" ,
"est_underemp_perc_imp" ,
"est_belowpov_pct_imp" ,
"Eviction filing rate_imp" ,
"hh_support_risk_score_imp" ,
"Energy Burden (% income)_imp" ,
"est_lesh_perc_imp" ,
"est_no_internet_perc_imp" ,
"lths_pct_imp" ,
"Percent Institutionalized GQ_imp" ,
"Persistent Poverty_imp" ,
"Temperature_Diff_imp"
)
setdiff (pca_vars, names (df_coa))
matrix
X <- df_coa %>% select (all_of (pca_vars))
X <- X %>% mutate (across (everything (), as.numeric))
#flip higher is worse indicators so higher = better
worse_in_pca <- intersect (higher_is_worse, names (X))
X_oriented <- X
X_oriented[worse_in_pca] <- lapply (X_oriented[worse_in_pca], function (v) - v)
#standardize for PCA (mean 0, sd 1)
X_scaled <- scale (X_oriented)
pca <- prcomp (X_scaled, center = FALSE , scale. = FALSE ) # already scaled
summary (pca)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 2.4589 1.3694 1.17555 1.10538 1.07793 1.00302 0.93878
Proportion of Variance 0.3359 0.1042 0.07677 0.06788 0.06455 0.05589 0.04896
Cumulative Proportion 0.3359 0.4401 0.51686 0.58474 0.64929 0.70518 0.75415
PC8 PC9 PC10 PC11 PC12 PC13 PC14
Standard deviation 0.89668 0.80335 0.7719 0.72181 0.66559 0.62184 0.55186
Proportion of Variance 0.04467 0.03585 0.0331 0.02895 0.02461 0.02148 0.01692
Cumulative Proportion 0.79881 0.83467 0.8678 0.89672 0.92133 0.94281 0.95973
PC15 PC16 PC17 PC18
Standard deviation 0.48388 0.45990 0.42983 0.30735
Proportion of Variance 0.01301 0.01175 0.01026 0.00525
Cumulative Proportion 0.97274 0.98449 0.99475 1.00000
loadings_pc1 <- sort (pca$ rotation[,1 ], decreasing = TRUE )
loadings_pc1
public_preK_est_imp hh_support_risk_score_imp
0.23852852 -0.07233206
Persistent Poverty_imp est_underemp_perc_imp
-0.07944712 -0.08738954
Percent Institutionalized GQ_imp Eviction filing rate_imp
-0.09641262 -0.16083796
Temperature_Diff_imp LE_imputed_2020_imp
-0.16496760 -0.19796270
est_65plus_amb_pct_imp est_has_disability_pct_imp
-0.20217119 -0.23369610
est_med_hh_inc_imp est_no_internet_perc_imp
-0.25707138 -0.25719367
est_lesh_perc_imp est_belowpov_pct_imp
-0.26256519 -0.28030328
est_uninsured_rate_imp Energy Burden (% income)_imp
-0.29955355 -0.31369596
lths_pct_imp LPA_CrudePrev_imp
-0.34220278 -0.37667452
pc1 <- pca$ x[,1 ]
#align direction to match composite (higher=more favorable)
if (cor (pc1, df_coa$ composite_1_100, use = "complete.obs" ) < 0 ) {
pc1 <- - pc1
}
df_coa$ pca_pc1 <- pc1
#rescale PC1 to 1-100 for easier comparison
rescale_1_100 <- function (x) {
rng <- range (x, na.rm = TRUE )
if (diff (rng) == 0 ) return (rep (50 , length (x)))
1 + (x - rng[1 ]) * 99 / (rng[2 ] - rng[1 ])
}
df_coa$ pca_1_100 <- rescale_1_100 (df_coa$ pca_pc1)
compare
cor (df_coa$ composite_1_100, df_coa$ pca_1_100, use = "complete.obs" )
cor (df_coa$ composite_1_100, df_coa$ pca_1_100, use = "complete.obs" , method = "spearman" )
df_coa %>%
mutate (
rank_composite = rank (- composite_1_100, ties.method = "average" ),
rank_pca = rank (- pca_1_100, ties.method = "average" ),
rank_diff = rank_composite - rank_pca
) %>%
arrange (desc (abs (rank_diff))) %>%
select (GEOID_clean, composite_1_100, pca_1_100, rank_composite, rank_pca, rank_diff) %>%
head (20 )
# A tibble: 20 × 6
GEOID_clean composite_1_100 pca_1_100 rank_composite rank_pca rank_diff
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 48453002446 72.8 72.3 56 143 -87
2 48453000607 52.6 52.2 131 217 -86
3 48453000606 55.4 56.9 119 203 -84
4 48453000601 47.8 45.4 153 230 -77
5 48453002324 41.9 22.3 175 247 -72
6 48453000605 52.5 59.6 132 194 -62
7 48453002323 32.2 72.5 202 140 62
8 48453000203 38.7 73.6 188 130 58
9 48453000308 47.5 79.2 155 102 53
10 48453000401 51.6 81.2 140 87 53
11 48453031600 54.6 82.7 122 71 51
12 48491020334 45.3 78.2 164 113 51
13 48453000302 46.5 78.3 160 110 50
14 48453045100 73.0 79.1 54 103 -49
15 48491020409 76.6 80.2 44 93 -49
16 48453000608 46.8 55.6 158 206 -48
17 48453002004 36.6 71.3 192 144 48
18 48453000803 54.4 81.8 125 78 47
19 48453000804 26.4 65.9 214 167 47
20 48453041900 59.8 70.5 105 152 -47
A principal component analysis (PCA) was conducted as a robustness check. The first component explained 33.6% of total variance and reflected a broad structural disadvantage dimension, loading heavily on poverty, health risk, energy burden, and digital access indicators. The strong alignment between PCA-derived scores and the equal-weight composite confirms that the index structure is empirically coherent and not driven by arbitrary weighting choices.