Introduction/Motivation
The intricate world of healthcare data has always held a certain fascination for me. Most of my immediate family are nurses (my dad, brother, and sister are nurses and my second sister is currently in schooling to be a nurse). While my professional journey has primarily navigated the data-driven landscapes of political campaigns, the potential of leveraging analytical tools like Ecological Inference (EI) to improve public health outcomes feels like a calling.
When I learned Ecological Inference for my role working on redistricting, I was in awe. This incredible method which could take aggregate level data and estimate individual level results seemed to hold the answers to so many questions the previous teams I had worked on in politics had been unable to answer. This invaluable tool could unlock so much potential for improved campaign performance and wasn’t being used. Indeed, I had volunteered to learn it because it was a service we needed to provide and no one else on the team, despite having previous experience in redistricting, new how to do it. I believe this method is also valuable in the field of healthcare.
As Ruth Salway and Jonathan Wakefield aptly state in their chapter of the book, “Ecological Inference: New Methodological Strategies,”
“Ecological studies are not only useful hypothesis-generating mechanisms, but can also add to the totality of evidence when building a case for a disease risk-exposure relationship (Morgenstern 1998). The appeal of ecological studies is that they can utilize routinely available data (and so are relatively inexpensive to carry out) and can cover a broad geographical area, thus taking advantage of large exposure contrasts and large populations; both of these factors increase power”
This struck a chord with me. So often, the data I worked with was in aggregate; precinct level data, census block level data, county level data, etc. The answers I needed to find, though, required determining individual level responses or behaviors. I expect similarly in healthcare, understanding risk factors at the individual level can from aggregate data could unlock immense understanding and more efficient practices.
Imagine tackling an infectious disease outbreak. Aggregated data might tell us the general location and demographics of affected individuals, but EI could unlock crucial details. Were residents in a specific neighborhood disproportionately exposed? Did certain age groups exhibit higher susceptibility? Such nuanced understanding, gleaned from seemingly opaque group data, could guide targeted interventions and save lives.
This appeal isn’t limited to epidemiology. My experience in political campaigns suggests a parallel: a powerful tool underutilized due to its niche status perceived as only for a need that primarily comes around once a decade, during redistricting. In campaigns, questions about voter preferences within specific demographics often went unanswered, simply because methods already familiar to the team couldn’t handle the aggregated data. At the top level, after the election polls were tossed aside for the actual election results because the polls were known to be inaccurate, sometimes significantly so. At the demographic level, on the other hand, polls were considered the only option. Conversely, EI could have provided actionable insights, optimizing campaign strategies and influencing election outcomes.
Perhaps the same scenario unfolds in healthcare. A wealth of readily available data in hospitals, clinics, and public health records sits waiting to be deciphered. EI could unlock hidden patterns within these aggregates, illuminating individual-level risk factors, treatment efficacy, and even patient behavior.
The following aims to be a bridge between the analytical expertise I gained in my previous career and the vast, data-rich world of healthcare. By demonstrating the transferability of EI and its potential impact on public health outcomes, I hope to not only showcase its immense value but also pave the way for my own journey into this critical field. My unwavering conviction is that my dedication to data and my proficiency in advanced analytical tools like EI could be instrumental in shaping the future of healthcare in my community.
Ecological Inference: A Brief Overview
Ecological Inference (EI) is a powerful statistical technique that addresses a fundamental problem in social science research: inferring individual-level behavior from aggregate data. This method is particularly relevant in fields like political science and healthcare, where individual-level data is often unavailable or impractical to collect.
The challenge of EI lies in the ecological fallacy – the potential error in assuming that relationships observed at the aggregate level also hold true at the individual level. The methodology of EI, as detailed in King’s chapter “A Solution to the Ecological Inference Problem: Reconstructing Individual Behavior from Aggregate Data” (King, 1997), focuses on overcoming this fallacy by developing statistical models that estimate individual behaviors and characteristics from group-level data.
King’s innovation was crucial in the advancement of EI, particularly in political science. His approach, based on a combination of statistical theory and empirical validation, has provided a more robust understanding of voter behavior and political preferences from aggregate election data. This has been instrumental in fields like redistricting and political campaign strategy, where understanding specific demographic voting patterns is crucial.
In healthcare, the application of EI has evolved to address public health challenges. The method allows researchers to derive individual-level risk factors and disease prevalence from aggregated health data, as highlighted in the paper “ecoreg: Conducting ecological regression using aggregate and individual data” (Lumley, 2013). This approach is particularly beneficial in epidemiological studies, where it aids in identifying risk factors and disease spread patterns within populations, using data aggregated by regions or demographics.
The progress in EI methodology has been marked by its increasing accuracy and applicability in these fields, driven by advancements in statistical models and computational power. The method’s ability to unlock individual-level insights from aggregated data continues to be a game-changer, providing valuable information for decision-making in both political science and healthcare.
The Data
Description and Sources
This analysis utilizes a comprehensive dataset on asthma hospitalizations, sourced from the California Department of Public Health. Interested readers can find the dataset here. Initially collected at the zip code level, the data is meticulously converted to block group format to better serve the objectives of this analysis.
Additionally, this dataset is enriched with environmental data, capturing key pollutants like carbon monoxide, nitrogen dioxide, and particulate matter (PM 2.5). This environmental data is obtained through the RAQSAPI R package from the EPA, accessible here. In the methodology, each census block group is assigned environmental measures from the closest EPA measurement site, providing a nuanced understanding of the local environmental conditions.
Further depth is added to the analysis with data from the U.S. Census Bureau, accessed using the tidycensus package. The data, derived from the ACS 5-year survey at the block group level, includes critical demographic variables such as total population, children under 15, individuals commuting by car, those without health insurance, and the total female population. Data is limited to only Riverside and San Bernardino counties.
In essence, the dataset for this analysis is a rich blend of healthcare, environmental, and demographic information. Each data source contributes to a multi-faceted view, enabling a thorough and intricate understanding of the factors impacting asthma hospitalizations in California’s Inland Empire.
Exploring the Data
A preview of the resulting dataset is visible below. Using the
tidycensus package we were also able to pull shapefiles so
that this object contains the geographic information for mapping. We
also have a column indicating the urban area assigned by the Census
Bureau.
## Simple feature collection with 6 features and 12 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: -117.3803 ymin: 33.9865 xmax: -117.3456 ymax: 34.01944
## Geodetic CRS: WGS 84
## county geoid urban_area
## 1 Riverside 060650301011 Riverside--San Bernardino, CA
## 2 Riverside 060650301031 Riverside--San Bernardino, CA
## 3 Riverside 060650301032 Riverside--San Bernardino, CA
## 4 Riverside 060650301041 Riverside--San Bernardino, CA
## 5 Riverside 060650301042 Riverside--San Bernardino, CA
## 6 Riverside 060650301043 Riverside--San Bernardino, CA
## number_of_asthma_ed_visits total_pop children_under_15 drive_alone_commuters
## 1 2 1329 215 494
## 2 3 751 109 289
## 3 8 1477 463 374
## 4 14 2685 525 953
## 5 9 817 187 277
## 6 11 4143 973 1609
## no_health_insurance total_female carbon_monoxide nitrogen_dioxide_no2
## 1 119 654 0.368377 27.04485
## 2 72 302 0.368377 27.04485
## 3 94 697 0.368377 27.04485
## 4 320 1258 0.368377 27.04485
## 5 171 338 0.368377 27.04485
## 6 545 1999 0.368377 27.04485
## pm2_5_local_conditions geometry
## 1 12.18348 POLYGON ((-117.3574 33.9935...
## 2 12.18348 POLYGON ((-117.3654 33.9929...
## 3 12.18348 POLYGON ((-117.3751 34.0008...
## 4 12.18348 POLYGON ((-117.3803 34.0071...
## 5 12.18348 POLYGON ((-117.3761 34.0021...
## 6 12.18348 POLYGON ((-117.3698 34.0058...
A closer look at the urban areas in the San Bernardino and Riverside counties can be viewed below. A more in depth description of what constitutes an urban area can be found here.
## urban_area
## 1 Barstow, CA
## 2 Big Bear, CA
## 3 Blythe, CA--AZ
## 4 Crestline--Lake Arrowhead, CA
## 5 Desert Hot Springs, CA
## 6 Hemet, CA
## 7 Indio--Palm Desert--Palm Springs, CA
## 8 Joshua Tree, CA
## 9 Los Angeles--Long Beach--Anaheim, CA
## 10 Mecca, CA
## 11 Needles, CA--AZ
## 12 Riverside--San Bernardino, CA
## 13 Running Springs, CA
## 14 Silver Lakes, CA
## 15 Temecula--Murrieta--Menifee, CA
## 16 Twentynine Palms North, CA
## 17 Twentynine Palms, CA
## 18 Victorville--Hesperia--Apple Valley, CA
## 19 Yucca Valley, CA
## 20 <NA>
A more complete overview of the data can be viewed using the
skimr::skim() function. Here we see that most of our data
is complete with negligible missing values. With so few missing values
the following analysis will handle them by removing them. We also get a
quick look of the distributions of our numeric variables. For example we
can see that the arithmetic mean for the total_pop column
is 2,143 while the median is 1844 meaning our data for that variable is
left skewed, visible in the histogram included in the functions
output.
| Name | asthma_ed_2018 |
| Number of rows | 2139 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 9 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| county | 0 | 1.00 | 9 | 14 | 0 | 2 | 0 |
| geoid | 0 | 1.00 | 12 | 12 | 0 | 2115 | 0 |
| urban_area | 81 | 0.96 | 9 | 39 | 0 | 19 | 0 |
| geometry | 0 | 1.00 | 6 | 12575 | 0 | 2102 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| number_of_asthma_ed_visits | 0 | 1.00 | 8.85 | 19.54 | 0.00 | 2.00 | 4.00 | 8.00 | 322.00 | ▇▁▁▁▁ |
| total_pop | 14 | 0.99 | 2143.24 | 1425.96 | 0.00 | 1290.00 | 1844.00 | 2566.00 | 19699.00 | ▇▁▁▁▁ |
| children_under_15 | 14 | 0.99 | 464.82 | 392.70 | 0.00 | 218.00 | 378.00 | 595.00 | 3259.00 | ▇▂▁▁▁ |
| drive_alone_commuters | 14 | 0.99 | 695.38 | 490.31 | 0.00 | 387.00 | 584.00 | 863.00 | 4319.00 | ▇▂▁▁▁ |
| no_health_insurance | 14 | 0.99 | 204.47 | 182.80 | 0.00 | 74.00 | 155.00 | 285.00 | 1650.00 | ▇▂▁▁▁ |
| total_female | 14 | 0.99 | 1076.34 | 695.60 | 0.00 | 635.00 | 925.00 | 1291.00 | 6333.00 | ▇▂▁▁▁ |
| carbon_monoxide | 14 | 0.99 | 0.30 | 0.08 | 0.15 | 0.24 | 0.33 | 0.37 | 0.43 | ▃▃▃▇▃ |
| nitrogen_dioxide_no2 | 14 | 0.99 | 25.38 | 7.49 | 8.64 | 19.06 | 27.04 | 28.53 | 43.58 | ▁▂▇▂▁ |
| pm2_5_local_conditions | 14 | 0.99 | 11.38 | 2.98 | 2.56 | 11.19 | 11.33 | 13.90 | 16.02 | ▁▃▂▇▆ |
Looking at correlations, we can see below that the population measures are all pretty closely correlated. This is because they are still in raw numbers and block groups with larger populations are naturally going to have larger counts for the sub-populations. For our analysis, though we will need them to be proportions.
In the plot below we convert these sub-population counts to proportions and check correlations again. This significantly helps to reduce the correlation between our census numbers. The EPA measures also show some correlation. Unfortunately there is no transformation already needed among these variables, but luckily the highest level of correlation we observe among these measures is ~0.66 (0.65797 between carbon monoxide and nitrogen dioxide) which will be tolerable for our analysis.
Mapping the Data
The map below shows the different values of the EPA measures and the locations of the sites where the different measurements were taken. The measurements are on different scales but for all, highest values move closer to #FDE725FF while lower values move to #440154FF on the scale.
The maps below allow us to compare results for asthma ED visits in the map on the left with various variables we can map on the right.
Ecological Inference Analysis
Now we get into the actual results of our Ecological Inference model.
Briefly we can find some interesting results from our models. We have
three different models where we keep our individual level variables
constant. For our modeling (using the ecoreg package) the
individual level variables need to be binary variables that can be
plugged in as proportions. Every model has a different aggregate level
variable. These variables do not need to be binary. In this particular
Ecological Inference model, we are not measuring the number of people as
we would in political campaigns, but rather the effect on the likelihood
of a disease, or in this case, effect on the likelihood of an asthma
emergency department visit.
The most interesting finding is that our individual level odds ratios are fairly consistent across models. We observe approximately ~70% higher risk of asthma ED visit for women, and ~60% higher risk of asthma ED visit for people without health insurance in our individual-level covariates across all models. This is consistent with findings that women (after puberty) are more likely to have asthma then men (after puberty). We can also hypothesize that perhaps people without health insurance are more likely not to get treatment for asthma until it reaches the point of severity of needing an visit to the emergency department.
The odds ratios for all of our aggregate-level covariates are also greater than 1, suggesting that increases in the environmental factors of carbon monoxide, nitrogen dioxide, and PM2.5 by one unit increases the risk of asthma related ED visits by ~0.42%-3%.
For all of our model the -2 x log-likelihood is ~2.6 suggesting consistent performance among all our models.
It’s important to note that this model does not measure the risk of asthma, but rather measures the risk of emergency department visits for asthma.
Conclusion
In conclusion, this journey from the world of political campaigns to healthcare analytics has been enlightening and fruitful. The findings of our study on asthma ED visits in California reinforce the power of Ecological Inference (EI) as a versatile analytical tool. The consistent odds ratios across models reveal crucial insights about asthma risks related to gender and insurance status. Notably, the transferable skills from political data analysis to healthcare have proven valuable, underscoring the potential of interdisciplinary approaches in solving complex public health issues. As we continue to explore and leverage data in healthcare, the principles and methods honed in political campaigns will undoubtedly contribute to deeper understanding and innovative solutions in public health.