NOTE: You can download this
.Rmdfile by clicking on thecodetab on the upper-right.
For this project, synthetic data for two different states is going to be used. This data has been generated using Synthea, currently regarded as the state-of-the-art engine for generating longitudinal electronic health records. Synthea generates realistic-but-not-real longitudinal patient records and is commonly used for workflow development, testing, and training when real EHR data cannot be used.
Our target audience is going to be a clinical operations analytics group within a pediatric hospital network that faces limited access to real-world EHR data. The team’s objective is to develop, validate, and stress-test operational metrics and dashboards using synthetic EHR data as a privacy-safe surrogate while real data access is insufficient.
From the whole Synthea dataset, and keeping our goal in mind, the only dataframes we are going to be working with are:
patients.csv. A dataframe containing one row per
patient with their demographic information.encounters.csv. A dataframe containing one row per
patient encounter.conditions.csv. A dataframe containing one row per
patient condition found in any encounter.Our cohort is made of under-aged patients from different states (taken from the patient’s state field as generated by Synthea, not encounter location) with at least one encounter.
It is important to note that Synthea can contain outpatient/ED class encounters occurring “inside” an inpatient stay. Given that “clinical operations” is our target audience, we will count these encounters as their assigned category even if they lead to an admission, and the admission is also going to be counted. This is intended to reflect real operational load (ED/outpatient throughout + inpatient capacity) even if this implies that we could not be counting unique episodes of care.
It is essential to define what “mortality” and “morbidity” mean in terms of the data we have.
We decided to define “morbidity” as the percentage of pediatric patients with at least one record of a given medical condition (condition prevalence), and as the distribution of the number of distinct condition per patient or chronic-condition count (multi-morbidity).
As for the “mortality”, we decided to look at age-specific mortality to avoid misleading aggregate comparisons.
Because the datasets for California and Mississippi had been created
independently, some patient and/or encounter IDs could appear in both
states for different people and encounter. That is why, all the IDs will
be prefixed with CA: and MS: for California
and Mississippi entries respectively.
From the patients.csv dataframe, we are only interested
in the patient’s ID, date of birth, date of death, race and gender.
In order to compute the patient’s current age, the “end date” used on
the dataset creation (2025-11-13) is going to be used. In
addition, the ages are going to be grouped in the age bands
<1, 1-4, 5-12 and
13-17.
A new flag (is.alive) is going to be created to record
whether the patient is or not alive, as well as a new variable
(state) indicating the patient’s origin.
The two dataframes are going to be merged and written to a new file.
patients.ca.raw <- readr::read_csv( "./data/raw/california/patients.csv",
show_col_types = FALSE )
patients.ca <- patients.ca.raw %>%
dplyr::select( patient.id = Id,
birth.date = BIRTHDATE,
death.date = DEATHDATE,
race = RACE,
gender = GENDER ) %>%
dplyr::mutate(
patient.id = paste0( "CA:", patient.id ),
gender = dplyr::if_else( gender == "F", "Female", "Male"),
is.alive = dplyr::if_else( is.na( death.date ), 1, 0),
age.current = dplyr::if_else(
is.na( death.date ),
as.integer(
as.numeric( difftime( as.Date( "2025-11-13" ),
as.Date( birth.date ),
units = "weeks" )) / 52.25 ),
as.integer(
as.numeric( difftime( as.Date( death.date ),
as.Date( birth.date ),
units = "weeks")) / 52.25 )
),
age.current.range = dplyr::case_when(
age.current < 1 ~ "<1",
age.current < 5 ~ "1-4",
age.current < 13 ~ "5-12",
age.current < 18 ~ "13-17"
),
state = "California"
)patients.ms.raw <- readr::read_csv( "./data/raw/miss/patients.csv",
show_col_types = FALSE )
patients.ms <- patients.ms.raw %>%
dplyr::select( patient.id = Id,
birth.date = BIRTHDATE,
death.date = DEATHDATE,
race = RACE,
gender = GENDER ) %>%
dplyr::mutate(
patient.id = paste0( "MS:", patient.id ),
gender = dplyr::if_else( gender == "F", "Female", "Male"),
is.alive = dplyr::if_else( is.na( death.date ), 1, 0),
age.current = dplyr::if_else(
is.na( death.date ),
as.integer(
as.numeric( difftime( as.Date( "2025-11-13" ),
as.Date( birth.date ),
units = "weeks" )) / 52.25 ),
as.integer(
as.numeric( difftime( as.Date( death.date ),
as.Date( birth.date ),
units = "weeks")) / 52.25 )
),
age.current.range = dplyr::case_when(
age.current < 1 ~ "<1",
age.current < 5 ~ "1-4",
age.current < 13 ~ "5-12",
age.current < 18 ~ "13-17"
),
state = "Mississippi"
)From the encounters.csv dataframe, we are only
interested in the encounter and patient’s ID, the encounter start and
stop date, the encounter class, and the description and reason of the
encounter.
In Synthea, encounters are divided in 9 different categories. For
simplification, those encounters classified as emergency or
urgentcare are going to be grouped into the new
ED visit category, and those classified as
ambulatory or outpatient are going to be
grouped into the new Outpatient (in-person) category. The
rest is going to be renamed but left as they are.
It is important to know the patient’s age and age band at the encounter time. To retrieve this information, the encounter start date is going to be subtracted the patient’s birth date, and the same age bands as before are going to be used. In addition, and in order to be able to stratify the data later on, the patient’s gender is also going to be appended.
As for the encounter’s reason and description strings, these have the
format REASON (REASON TYPE) and
DESCRIPTION (DESCRIPTION TYPE) respectively. We are going
to split the fields so we have the type on a separate column.
encounters.ca.raw <- readr::read_csv( "./data/raw/california/encounters.csv",
show_col_types = FALSE )
encounters.ca <- encounters.ca.raw %>%
dplyr::select(
encounter.id = Id,
patient.id = PATIENT,
encounter.start = START,
encounter.stop = STOP,
encounter.class = ENCOUNTERCLASS,
encounter.description = DESCRIPTION,
encounter.reason = REASONDESCRIPTION
) %>%
dplyr::mutate(
encounter.id = paste0( "CA:", encounter.id ),
patient.id = paste0( "CA:", patient.id ),
encounter.description.type =
stringr::str_extract(encounter.description, "(?<=\\()[^()]*(?=\\)\\s*$)"),
encounter.description =
stringr::str_remove( encounter.description, "\\s*\\([^()]*\\)\\s*$" ),
encounter.reason.type =
stringr::str_extract( encounter.reason, "(?<=\\()[^()]*(?=\\)\\s*$)" ),
encounter.reason =
stringr::str_remove( encounter.reason, "\\s*\\([^()]*\\)\\s*$" ),
encounter.group = dplyr::case_when(
encounter.class %in% c( "emergency", "urgentcare") ~ "ED visit",
encounter.class == "inpatient" ~ "Admissions",
encounter.class %in% c( "ambulatory", "outpatient" ) ~
"Outpatient (in-person)",
encounter.class == "wellness" ~ "Wellness",
encounter.class == "virtual" ~ "Outpatient (virtual)",
encounter.class == "snf" ~ "Post-acute/facility",
encounter.class == "hospice" ~ "End-of-life",
TRUE ~ "Other/Unknown"
),
state = "California"
)encounters.ms.raw <- readr::read_csv( "./data/raw/miss/encounters.csv",
show_col_types = FALSE )
encounters.ms <- encounters.ms.raw %>%
dplyr::select(
encounter.id = Id,
patient.id = PATIENT,
encounter.start = START,
encounter.stop = STOP,
encounter.class = ENCOUNTERCLASS,
encounter.description = DESCRIPTION,
encounter.reason = REASONDESCRIPTION
) %>%
dplyr::mutate(
encounter.id = paste0( "MS:", encounter.id ),
patient.id = paste0( "MS:", patient.id ),
encounter.description.type =
stringr::str_extract(encounter.description, "(?<=\\()[^()]*(?=\\)\\s*$)"),
encounter.description =
stringr::str_remove( encounter.description, "\\s*\\([^()]*\\)\\s*$" ),
encounter.reason.type =
stringr::str_extract( encounter.reason, "(?<=\\()[^()]*(?=\\)\\s*$)" ),
encounter.reason =
stringr::str_remove( encounter.reason, "\\s*\\([^()]*\\)\\s*$" ),
encounter.group = dplyr::case_when(
encounter.class %in% c( "emergency", "urgentcare") ~ "ED visit",
encounter.class == "inpatient" ~ "Admissions",
encounter.class %in% c( "ambulatory", "outpatient" ) ~
"Outpatient (in-person)",
encounter.class == "wellness" ~ "Wellness",
encounter.class == "virtual" ~ "Outpatient (virtual)",
encounter.class == "snf" ~ "Post-acute/facility",
encounter.class == "hospice" ~ "End-of-life",
TRUE ~ "Other/Unknown"
),
state = "Mississippi"
)encounters.inter <- rbind( encounters.ca, encounters.ms )
encounters <- patients %>%
dplyr::select( patient.id, birth.date, gender ) %>%
dplyr::right_join( encounters.inter, by = "patient.id" ) %>%
dplyr::mutate(
encounter.age =
as.integer( as.numeric( difftime( as.Date( encounter.start ),
as.Date( birth.date ),
units = "weeks")) / 52.25),
encounter.age.range = dplyr::case_when(
encounter.age < 1 ~ "<1",
encounter.age < 5 ~ "1-4",
encounter.age < 13 ~ "5-12",
encounter.age < 18 ~ "13-17"
)
)
readr::write_csv( encounters, "./data/clean/encounters.csv")From the conditions.csv dataframe, we are only
interested in the encounter and patient’s ID, the start and end date of
the condition and its description.
The condition description string, in the format
DESCRIPTION (CONDITION TYPE), is going to be split in order
to have the type in a separate field.
Once more, it is important to know the patient’s age and age band at the condition record time. To retrieve this information, the same process as with the age at encounter time is going to be done. In addition, and in order to stratify the data later on, the patient’s gender is also going to be appended.
Given the nature of our data, we have to remove administrative noise. That is, we have to filter out records that aren’t true clinical conditions such as “medication review due”, housing status, or “situation” codes because these reflect administrative workflows rather than patient health.
conditions.ca.raw <- readr::read_csv( "./data/raw/california/conditions.csv",
show_col_types = FALSE )
conditions.ca <- conditions.ca.raw %>%
dplyr::select(
encounter.id = ENCOUNTER,
patient.id = PATIENT,
condition.start = START,
condition.stop = STOP,
condition.description = DESCRIPTION
) %>%
dplyr::mutate(
encounter.id = paste0( "CA:", encounter.id ),
patient.id = paste0( "CA:", patient.id ),
condition.type = stringr::str_extract( condition.description,
"(?<=\\()[^()]*(?=\\)\\s*$)" ),
condition.description = stringr::str_remove( condition.description,
"\\s*\\([^()]*\\)\\s*$" ),
state = "California"
) %>%
dplyr::filter(
!condition.type %in% c( "situation", "person"),
!condition.description
%in% c( "Risk activity involvement", "Homeless", "Died in hospice",
"Received higher education", "Full-time employment",
"Part-time employment", "Housing unsatisfactory",
"Educated to high school level", "Not in labor force",
"Limited social contact", "Reports of violence in the environment",
"Lack of access to transportation",
"Victim of intimate partner abuse", "Unemployed",
"Only received primary school education", "Transport problem",
"Has a criminal record")
)conditions.ms.raw <- readr::read_csv( "./data/raw/miss/conditions.csv",
show_col_types = FALSE )
conditions.ms <- conditions.ms.raw %>%
dplyr::select(
encounter.id = ENCOUNTER,
patient.id = PATIENT,
condition.start = START,
condition.stop = STOP,
condition.description = DESCRIPTION
) %>%
dplyr::mutate(
encounter.id = paste0( "MS:", encounter.id ),
patient.id = paste0( "MS:", patient.id ),
condition.type = stringr::str_extract( condition.description,
"(?<=\\()[^()]*(?=\\)\\s*$)" ),
condition.description = stringr::str_remove( condition.description,
"\\s*\\([^()]*\\)\\s*$" ),
state = "Mississippi"
) %>%
dplyr::filter(
!condition.type %in% c( "situation", "person"),
!condition.description
%in% c( "Risk activity involvement", "Homeless", "Died in hospice",
"Received higher education", "Full-time employment",
"Part-time employment", "Housing unsatisfactory",
"Educated to high school level", "Not in labor force",
"Limited social contact", "Reports of violence in the environment",
"Lack of access to transportation",
"Victim of intimate partner abuse", "Unemployed",
"Only received primary school education", "Transport problem",
"Has a criminal record")
)conditions.inter <- rbind( conditions.ca, conditions.ms )
conditions <- patients %>%
dplyr::select( patient.id, birth.date, gender ) %>%
dplyr::right_join( conditions.inter, by = "patient.id" ) %>%
dplyr::mutate(
condition.age =
as.integer( as.numeric( difftime( as.Date( condition.start ),
as.Date( birth.date ),
units = "weeks")) / 52.25),
condition.age.range = dplyr::case_when(
condition.age < 1 ~ "<1",
condition.age < 5 ~ "1-4",
condition.age < 13 ~ "5-12",
condition.age < 18 ~ "13-17"
)
)
readr::write_csv( conditions, "./data/clean/conditions.csv" )As our aim is to make comparisons between states across the different age bands, the first think to check is if the age bands distribution across states is similar or if, otherwise, it needs some normalization.
patients %>%
dplyr::mutate(
age.current.range = factor( age.current.range, levels = age.levels )
) %>%
dplyr::group_by( state, gender ) %>%
dplyr::count( state, gender, age.current.range, name = "counts" ) %>%
dplyr::mutate(
percentage = counts / sum(counts)
) %>%
dplyr::ungroup() %>%
ggplot( aes( x = age.current.range, y = percentage, fill = state )) +
geom_col( position = "dodge", color = "#FFFFFF", alpha = 0.85 ) +
facet_grid( ~ gender ) +
labs(
title = "Age Band Distribution Across States",
subtitle = "Percentage over 51,439 females and 49,299 males",
x = "",
y = "",
fill = ""
) +
scale_fill_manual( values = state.colors ) +
scale_y_continuous( labels = scales::percent_format( accuracy = 1 )) +
theme_minimal() +
theme(
panel.grid.major.x = element_blank(),
axis.text = element_text( size = 10 ),
strip.text = element_text( size = 11 ),
legend.text = element_text( size = 10 )
)From the plot, we can state that:
To further confirm there is no need to normalize data, we are going to perform a chi-square test of homogeneity within each gender.
patients %>%
dplyr::mutate(
age.current.range = factor( age.current.range, levels = age.levels )
) %>%
dplyr::group_by( gender ) %>%
dplyr::count( state, gender, age.current.range, name = "counts" ) %>%
dplyr::group_modify(~{
tab <- xtabs( counts ~ state + age.current.range, data = .x )
test <- chisq.test( tab )
tibble::tibble(
statistic = unname( test$statistic ),
df = unname( test$parameter ),
p_value = test$p.value
)
}) %>%
dplyr::ungroup()## # A tibble: 2 × 4
## gender statistic df p_value
## <chr> <dbl> <int> <dbl>
## 1 Female 0.145 3 0.986
## 2 Male 0.299 3 0.960
Given that both p-values are high (> 0.05 ) there is no evidence that age distributions differ by state, and therefore, these are comparable.
Another distribution we are going to look at is the patient’s race.
patients %>%
dplyr::mutate( race = factor( race, levels = race.levels )) %>%
dplyr::group_by( state ) %>%
dplyr::count( state, race, name = "counts" ) %>%
dplyr::mutate( percentage = counts / sum( counts )) %>%
dplyr::ungroup() %>%
ggplot( aes( x = race, y = percentage, fill = state )) +
geom_col( position = "dodge", color = "#FFFFFF", alpha = 0.85 ) +
labs(
title = "Race Distribution Across States",
subtitle = "Percentage over 100,738 patients",
x = "",
y = "",
fill = ""
) +
scale_fill_manual( values = state.colors ) +
scale_y_continuous( labels = scales::percent_format( accuracy = 1 )) +
theme_minimal() +
theme(
panel.grid.major.x = element_blank(),
axis.text = element_text( size = 10 ),
legend.text = element_text( size = 10 ),
legend.position = "inside",
legend.position.inside = c( 0.85, 0.9 ),
legend.background = element_rect( color = "#FFFFFF")
)From this plot we can see that California is substantially more
White, whereas Mississippi is substantially more White and Black. The
remaining categories are small in both states with California holding
the highest percentage. We could re-plot it grouping the three smallest
races into a single other category for a less cluttered
visualization.
patients %>%
dplyr::mutate( race =
dplyr::if_else( race %in% c( "hawaiian", "native", "other" ),
"other",
race ),
race = factor( race, levels = race.levels )) %>%
dplyr::group_by( state ) %>%
dplyr::count( state, race, name = "counts" ) %>%
dplyr::mutate( percentage = counts / sum( counts )) %>%
dplyr::ungroup() %>%
ggplot( aes( x = race, y = percentage, fill = state )) +
geom_col( position = "dodge", color = "#FFFFFF", alpha = 0.85, width = 0.7 ) +
labs(
title = "Race Distribution Across States",
subtitle = "Percentage over 100,738 patients",
x = "",
y = "",
fill = ""
) +
scale_fill_manual( values = state.colors ) +
scale_y_continuous( labels = scales::percent_format( accuracy = 1 )) +
theme_minimal() +
theme(
panel.grid.major.x = element_blank(),
axis.text.y = element_text( size = 10 ),
axis.text.x = element_text( size = 11 ),
legend.text = element_text( size = 10 ),
legend.position = "inside",
legend.position.inside = c( 0.85, 0.9 ),
legend.background = element_rect( color = "#FFFFFF")
)The difference is large enough that un-adjusted state comparisons could reflect race-mix differences rather than “state-pattern differences”.
The next step is to look “where” this difference within states sits. That is, between ages and genders.
patients %>%
dplyr::mutate( race =
dplyr::if_else( race %in% c( "hawaiian", "native", "other" ),
"other",
race ),
race = factor( race, levels = race.levels ),
age.current.range = factor( age.current.range, levels = age.levels )) %>%
dplyr::group_by( state, gender, age.current.range ) %>%
dplyr::count( state, race, gender, age.current.range, name = "counts" ) %>%
dplyr::mutate( percentage = counts / sum( counts )) %>%
dplyr::ungroup() %>%
ggplot( aes( x = race, y = percentage, fill = state )) +
geom_col( position = "dodge", color = "#FFFFFF", alpha = 0.85 ) +
facet_grid( gender ~ age.current.range ) +
labs(
title = "Race Distribution Across States, Genders, and Age Bands",
x = "",
y = "",
fill = ""
) +
scale_fill_manual( values = state.colors ) +
scale_y_continuous( labels = scales::percent_format( accuracy = 1 )) +
theme_minimal() +
theme(
legend.position = "bottom",
panel.grid.major.x = element_blank(),
strip.text = element_text( size = 9 ),
axis.text = element_text( size = 9 ),
legend.text = element_text( size = 9 ),
)In this plot we can see that state differences are consistent across age bands and gender, which strengthens the conclusion that this is a systematic compositional difference, not an artifact of one subgroup.
Even if race is not our main lens, it is a potential confounder. If utilization, morbidity or mortality differs by race in the Synthea simulation, then differences between states in race composition can shift the overall rates even when within-race care patterns are identical.
For our purpose, we will treat race as a cohort description and, if necessary, we will later on perform a sensitivity check. Making race a formal stratifier in every primary plot will explode the number of panels, increase small-cell instability, and dilute our core message making the visualization harder to interpret. Using race as a sensitivity check is the most efficient way to show that we are not mistaking compositional differences for operational ones.
As our target audience is a clinical operational team, this section aims to explore which service lines are carrying the highest load, where does the operational load concentrates by age, and whether the patterns (if any) are consistent across states.
encounters %>%
dplyr::group_by( state, encounter.age.range ) %>%
dplyr::count(state, encounter.age.range, encounter.group, name = "counts" ) %>%
dplyr::mutate( percentage = counts / sum( counts )) %>%
dplyr::ungroup() %>%
ggplot( aes( x = encounter.age.range, y = percentage, fill = encounter.group )) +
geom_col( position = "dodge", color = "#FFFFFF", alpha = 0.85 ) +
facet_grid( ~ state ) +
labs(
title = "Service Line Load Distribution Across States and Age Bands",
x = "",
y = "",
fill = ""
) +
scale_y_continuous( labels = scales::percent_format( accuracy = 1 )) +
theme_minimal() +
theme(
legend.position = "bottom",
panel.grid.major.x = element_blank(),
strip.text = element_text( size = 9 ),
axis.text = element_text( size = 9 ),
legend.text = element_text( size = 9 ),
)From this plot, we can see that there are three main encounter types: wellness, in-person outpatients and ED visits.
encounters %>%
dplyr::filter(
encounter.group %in% c( "Wellness", "Outpatient (in-person)", "ED visit")
) %>%
dplyr::group_by( state, encounter.age.range ) %>%
dplyr::count(state, encounter.age.range, encounter.group, name = "counts" ) %>%
dplyr::mutate(
percentage = counts / sum( counts ),
encounter.age.range = factor( encounter.age.range, levels = age.levels )
) %>%
dplyr::ungroup() %>%
ggplot( aes( x = encounter.age.range, y = percentage, fill = state )) +
geom_col( position = "dodge", color = "#FFFFFF", alpha = 0.85 ) +
facet_grid( ~ encounter.group ) +
labs(
title = "Major Service Line Distribution Across States and Age Bands",
subtitle = "Within-age composition of service lines",
caption = "For each state x age band, ED + Outpatient + Wellness = 100%",
x = "",
y = "",
fill = ""
) +
scale_fill_manual( values = state.colors ) +
scale_y_continuous( labels = scales::percent_format( accuracy = 1 )) +
theme_minimal() +
theme(
panel.grid.major.x = element_blank(),
strip.text = element_text( size = 10 ),
axis.text = element_text( size = 10 ),
legend.text = element_text( size = 10 ),
)In this simplified plot, the first thing to notice is that state differences are minimal; California and Mississippi bars are nearly overlapping in each age band and service line, implying similar service-line mix. Regarding the difference between service line, we can see that the ED share is very small in all pediatric age bands. Wellness dominates in infants (<1), then declines with age, and outpatient increases with age and becomes the dominant share in adolescents (13-17).
Overall, this plot pictures the pediatric landscape you would likely find if you were to use real data.
The next thing to explore is the encounter rate. We will focus on the three main service lines (Wellness, Outpatient (in-person), and ED visits).
For this, we decided to compute the encounter rate per 1,000 patient-years, being the patient-years the sum of overlap time across all contributing patients in that band:
\[ \displaystyle PY(state,gender,age\_band) = \sum_{i\in patients} \frac{overlap\_days_{i,band}}{365.25} \]
Where \(overlap\_days_{i,band}\) is the overlap between the patient’s observation window and the age band interval.
Then:
\[ rate_{1,000\_PY} = 1000 \cdot \frac{n_{encounters}}{PY} \]
encounters.end <- max( as.Date( encounters$encounter.stop ),
as.Date( encounters$encounter.start ),
na.rm = TRUE )
age.bands <- tibble::tibble(
encounter.age.range = c( "<1", "1-4", "5-12", "13-17" ),
start.yr = c( 0, 1, 5, 13 ),
end.yr = c( 1, 5, 13, 18 )
)
denoms <- encounters %>%
dplyr::group_by( patient.id ) %>%
dplyr::summarise(
state = first( state ),
gender = first( gender ),
birth.date = first( birth.date ),
.groups = "drop"
) %>%
dplyr::mutate(
obs.start = birth.date,
obs.end = encounters.end
) %>%
tidyr::crossing( age.bands ) %>%
dplyr::mutate(
band.start = birth.date %m+% lubridate::years( start.yr ),
band.end = birth.date %m+% lubridate::years( end.yr ),
overlap.start = pmax( obs.start, band.start ),
overlap.end = pmin( obs.end, band.end ),
days = as.numeric( overlap.end - overlap.start ),
days = dplyr::if_else( is.na( days ) | days < 0, 0, days ),
patient.years = days / 365.25
) %>%
dplyr::group_by( state, gender, encounter.age.range ) %>%
dplyr::summarise(
patient.years = sum( patient.years ),
.groups = "drop"
)
nums <- encounters %>%
dplyr::count(
state, gender, encounter.age.range, encounter.group, name = "n.encounters"
)
rates.1000.py <- nums %>%
dplyr::left_join(
denoms, by = c( "state", "gender", "encounter.age.range" )
) %>%
dplyr::filter( patient.years > 0 ) %>%
dplyr::mutate(
rate.1000.py = 1000 * n.encounters / patient.years,
)rates.1000.py %>%
dplyr::mutate(
encounter.age.range = factor( encounter.age.range, levels = age.levels ),
) %>%
dplyr::filter(
encounter.group %in% c( "Wellness", "Outpatient (in-person)", "ED visit" )
) %>%
ggplot( aes(x = encounter.age.range,
y = rate.1000.py,
color = state,
group = state )
) +
geom_line( linewidth = 0.8, alpha = 0.7 ) +
geom_point( size = 2.5, alpha = 0.7 ) +
facet_grid( gender ~ encounter.group ) +
labs(
title = "Encounter Rates per 1,000 patient-years (log scale) by Age at Encounter",
x = "",
y = "",
color = ""
) +
scale_color_manual( values = state.colors ) +
scale_y_log10() +
theme_minimal() +
theme(
legend.position = "bottom",
legend.text = element_text( size = 10 ),
panel.border = element_rect(color = "#A9A9A9", fill = NA),
axis.text = element_text( size = 10 ),
strip.text = element_text( size = 11 )
)Across all three service lines, age band is the dominant driver of rate differences, while between-state differences appear small in absolute terms. However, because California and Mississippi curves are so close, this figure is not ideal for answering the “difference” question directly. In addition, this plot does not explicitly show statistical uncertainty from finite event counts.
Because the state-specific rate curves largely overlap, we are going to summarize between-state differences directly using “Mississippi-to-California” rate ratios with 95% confidence intervals under a poisson count model.
rates.1000.py %>%
dplyr::select(
state, gender, encounter.age.range,
encounter.group, n.encounters, patient.years
) %>%
tidyr::pivot_wider(
names_from = state,
values_from = c(n.encounters, patient.years),
) %>%
dplyr::rename(
n.CA = n.encounters_California,
T.CA = patient.years_California,
n.MS = n.encounters_Mississippi,
T.MS = patient.years_Mississippi
) %>%
dplyr::mutate(
rr = ( n.MS / T.MS ) / ( n.CA / T.CA ),
se.log.rr = sqrt( 1 / n.MS + 1 / n.CA ),
rr.ci.low = exp( log( rr ) - 1.96 * se.log.rr ),
rr.ci.high = exp( log( rr ) + 1.96 * se.log.rr ),
encounter.age.range = factor( encounter.age.range, levels = age.levels )
) %>%
dplyr::filter(
encounter.group %in% c( "Wellness", "Outpatient (in-person)", "ED visit" )
) %>%
ggplot( aes( x = encounter.age.range, y = rr, color = gender )) +
annotate( "rect",
xmin = -Inf, xmax = Inf,
ymin = 0.95, ymax = 1.05,
alpha = 0.03 ) +
geom_hline( yintercept = 1,
linewidth = 0.5,
linetype = "dashed",
color = "#696969" ) +
geom_linerange( aes( ymin = rr.ci.low, ymax = rr.ci.high ),
linewidth = 0.6, position = position_dodge( width = 0.5 )) +
geom_point( size = 2, position = position_dodge( width = 0.5 )) +
facet_grid(~ encounter.group ) +
scale_y_log10() +
labs(
title = "Rate ratio (Mississippi / California); log scale (95% CI)",
x = "",
y = "",
color = ""
) +
scale_color_manual( values = gender.colors ) +
theme_minimal() +
theme(
legend.position = "bottom",
legend.text = element_text( size = 10 ),
axis.text = element_text( size = 10 ),
strip.text = element_text( size = 11 )
)Across all three service lines, the point estimates cluster very tightly around a rate ratio = 1. This indicate that, once standardized by patient-years and stratified by age band and gender, utilization intensity is broadly similar in Mississippi and California.
In ED visits, we see no consistent state difference and have the highest uncertainty, specially for infants (<1) and teenagers (13-17).
Wellness rate ratios are extremely close to 1 across all age bands and both genders, with relatively tight CIs. Therefore, we can state that the intensity is essentially the same in both states.
On the contrary, the outpatient panel is the only one where a patterned deviation appears; female’s rate ratio is slightly above 1 in younger bands, then very close to 1 in the older ones. Males, on the other hand, behaves in the opposite way. However, these departures are still modest and there is not a large between-state divergence.
In this section we seek to analyze pediatric conditions across California and Mississippi. Our goal is straightforward: we want to understand the baseline burden ( i.e. conditions that are common everywhere) and identify material differences ( i.e. conditions that might drive different utilization patterns).
We will compute the prevalence per 1,000 patients as the ratio between the number of cases for a given condition stratified by state, age band and gender, and the total number of patients in that stratum multiplied by 1,000.
To ensure statistical robustness, we are going to apply two key rules to our prevalence calculation:
NOTE: A minor limitation exists due to right-censoring. A child currently aged 2 contributes to the 1-4 denominator but has not yet lived through ages 3 and 4. If they were to develop a condition at age 3, our current snapshot would miss it. This may slightly underestimate prevalence in younger cohorts but affects both states equally.
condition.denom <- patients %>%
dplyr::select( patient.id, state, gender, age.current.range ) %>%
dplyr::mutate(
is.denom.less_1 = TRUE,
is.denom.1_4 = age.current.range %in% c("1-4", "5-12", "13-17"),
is.denom.5_12 = age.current.range %in% c("5-12", "13-17"),
is.denom.13_17 = age.current.range == "13-17"
) %>%
tidyr::pivot_longer(
cols = starts_with( "is.denom." ),
names_to = "condition.age.range",
values_to = "is.eligible"
) %>%
dplyr::filter( is.eligible ) %>%
dplyr::mutate(
condition.age.range = dplyr::case_when(
condition.age.range == "is.denom.less_1" ~ "<1",
condition.age.range == "is.denom.1_4" ~ "1-4",
condition.age.range == "is.denom.5_12" ~ "5-12",
condition.age.range == "is.denom.13_17" ~ "13-17"
)
) %>%
dplyr::group_by( state, gender, condition.age.range ) %>%
dplyr::summarise(
total.patients = n(),
.groups = "drop"
)
condition.nums <- conditions %>%
dplyr::group_by(
state, gender, condition.age.range, condition.description
) %>%
dplyr::summarise(
n.cases = dplyr::n_distinct( patient.id ),
.groups = "drop"
)
condition.prevalence <- condition.nums %>%
dplyr::left_join(
condition.denom,
by = c( "state", "gender", "condition.age.range")
) %>%
dplyr::mutate(
n.cases = tidyr::replace_na( n.cases, 0 ),
prevalence = n.cases / total.patients,
prevalence.1000 = prevalence * 1000
)First, we are going to visualize the highest prevalence pr 1,000 patients on a table in order to understand the magnitudes we are working on.
A difference of 5 cases/1,000 is negligible if the base rate is 800/1,000 (common cold), but critical if the base rate is 10/1,000 (leukemia). This table anchors our analysis in absolute volume.
In addition, the table reveals a clear insight: in terms of sheer volume, dental health is the primary driver of recorded conditions for children over 5 years old. Gingivitis appears as the top condition for age groups 5-12 and 13-17, with prevalence rates exceeding 500 per 1,000 patients (over 50% of the population).
Crucially, this high burden is symmetrical across states ( e.g., Gingivitis in teens is ~549/1000 in California and ~546/1000 in Mississippi). This confirms that “maintenance” diagnoses are universal and not driven by geography.
Now that the magnitude is understood, a diverging dot plot is going to be used to isolate the gap between states, allowing us to see directional patterns without the visual clutter of absolute bars.
top.5.condition.prevalence <- condition.prevalence %>%
dplyr::group_by( condition.age.range, condition.description ) %>%
dplyr::summarise(
total.burden = sum( prevalence.1000 ),
.groups = "drop"
) %>%
dplyr::group_by( condition.age.range ) %>%
dplyr::slice_max( order_by = total.burden, n = 5 ) %>%
dplyr::select( condition.age.range, condition.description )
condition.prevalence %>%
dplyr::inner_join( top.5.condition.prevalence,
by = c( "condition.age.range", "condition.description" )
) %>%
dplyr::select(
state, gender, condition.age.range, condition.description, prevalence.1000
) %>%
tidyr::pivot_wider(
names_from = state,
values_from = prevalence.1000
) %>%
dplyr::mutate(
prevalence.diff = California - Mississippi,
condition.age.range = factor( condition.age.range, levels = age.levels )
) %>%
ggplot( aes( x = prevalence.diff,
y = reorder( condition.description, prevalence.diff) )) +
geom_vline( xintercept = 0, linetype = "dashed", color = "#696969" ) +
geom_point( aes( color = prevalence.diff > 0 ), size = 3 ) +
facet_grid( condition.age.range ~ gender, scales = "free_y", space = "free_y" ) +
scale_color_manual(values = state.diff.colors,
labels = c( "TRUE" = "Higher in CA",
"FALSE" = "Higher in MS" )) +
labs(
title = "Condition Prevalence Difference (California - Mississippi)",
subtitle = "Difference in cases per 1,000 population (Top 5 conditions)",
x = "",
y = "",
color = "Direction"
) +
theme_minimal() +
theme(
legend.position = "bottom",
panel.border = element_rect(color = "#A9A9A9", fill = NA),
axis.text = element_text( size = 10 ),
strip.text = element_text( size = 11 )
)While most age-gender panels show high similarity (points clustering around the zero line), the stratified view reveals that the largest and most consistent directional shifts are concentrated at the extremes of the pediatric age range, rather than being uniform across childhood.
In the infant band (<1), the plotted conditions are consistently shifted toward the Mississippi side (negative California - Mississippi differences). This indicates a directional tendency for common early-childhood diagnoses to be more prevalent among Mississippi infants, even if the absolute differences are small.
Moving into ages 1-4 and 5-12, the dots generally compress toward zero and the direction becomes mixed, suggesting that the broad burden of common pediatric conditions is largely comparable across states once children age out of infancy. Where differences do appear, these are condition-specific rather than systematic.
In the teenage band, the plot shifts from “near-identical” to “selectively divergent”. Here differences are larger in magnitude in a few strata, particularly among males, indicating that the state contrast can be pronounced for specific condition categories in older adolescents.
Overall, we can conclude that the general pediatric morbidity profile is broadly similar between California and Mississippi and the broader pediatric burden appears comparable.
Multi-morbidity is a practical proxy for care complexity: higher condition burden is typically associates with more frequent ED utilization, higher admission probability, longer inpatient length of stay, and higher coordination needs. In this section we seek to quantify how pediatric patients distribute across “condition-burden” strata by state, age band, and gender.
We will define “patient-level multi-morbidity” as the number of distinct conditions ever recorded for each patient, treating multiple occurrences of the same condition as one to avoid artifacts. This “ever-recorded distinct condition” measure mixes acute and chronic diagnoses. For the first-pass exploratory data analysis that is not an issue, but we will need to compute chronic-only multi-morbididty, declaring as “chronic” those conditions that do not have a “stop” date and/or has a chronic condition.
Even if we are going to use the age at diagnosis time, for this analysis we are going to only use those patients who are not deceased. We decided to proceed as such
To obtain core patient-level multi-morbidity measures, for each patient we will compute:
NOTE: The following vector of chronic conditions ONLY apply to out particular dataset. If you generate your own data using Synthea, you will need to ensure that all your chronic diseases are listed here, or otherwise add them to the vector.
chronic.conditions <- c(
"Child attention deficit disorder",
"Chiari malformation type II",
"Meningomyelocele",
"Congenital deformity of foot",
"Scoliosis deformity of spine",
"Kyphosis deformity of spine",
"Epilepsy",
"Cerebral palsy",
"Acute myeloid leukemia",
"Mitral valve stenosis",
"Intellectual disability",
"Dystonia",
"Chronic kidney disease stage 1",
"Chronic kidney disease stage 2",
"Osteoporosis",
"Disorder of kidney due to diabetes mellitus",
"Microalbuminuria due to type 2 diabetes mellitus",
"Malignant neoplasm of breast",
"Meningocele",
"Human immunodeficiency virus infection",
"Male infertility due to cystic fibrosis",
"Cystic fibrosis",
"Diabetes mellitus due to cystic fibrosis",
"Diabetes mellitus type 2",
"Chronic hepatitis C",
"Chronic paralysis due to lesion of spinal cord",
"Female infertility due to cystic fibrosis",
"Chronic type B viral hepatitis",
"Spina bifida occulta",
"Chronic kidney disease stage 3",
"Chronic kidney disease stage 4",
"Tricuspid valve stenosis",
"Heart failure",
"Congenital uterine anomaly"
)burden.ever <- conditions %>%
dplyr::filter(
patient.id %in% ( patients %>% dplyr::filter( is.alive == 1 ))$patient.id
) %>%
dplyr::group_by( patient.id ) %>%
dplyr::summarise(
n.conditions.ever = dplyr::n_distinct( condition.description ),
.groups = "drop"
)
burden.active <- conditions %>%
dplyr::filter(
patient.id %in% ( patients %>% dplyr::filter( is.alive == 1 ))$patient.id,
is.na( condition.stop ) | condition.description %in% chronic.conditions
) %>%
dplyr::group_by( patient.id ) %>%
dplyr::summarise(
n.conditions.active = dplyr::n_distinct( condition.description ),
.groups = "drop"
)
multimorbidity <- patients %>%
dplyr::select( patient.id, state, gender, age.current.range ) %>%
dplyr::left_join( burden.ever, by = "patient.id" ) %>%
dplyr::left_join( burden.active, by = "patient.id" ) %>%
dplyr::mutate(
n.conditions.active = tidyr::replace_na( n.conditions.active, 0 ),
n.conditions.ever = tidyr::replace_na( n.conditions.ever, 0 )
)## n.conditions.ever n.conditions.active
## Min. : 0.000 Min. : 0.0000
## 1st Qu.: 2.000 1st Qu.: 0.0000
## Median : 5.000 Median : 0.0000
## Mean : 5.323 Mean : 0.4667
## 3rd Qu.: 8.000 3rd Qu.: 1.0000
## Max. :26.000 Max. :13.0000
Looking at the summary, we can state that
n.conditions.active shows a plausible shape for an
“active-at-snapshot” definition in pediatrics: most children have 0
currently active or chronic conditions but a meaningful subset has at
least 1 active or chronic condition, with a small tail up to 13. When
looking at the conditions for those patients with at least 5 different
active conditions, we observed that 289 out of the 577 patients had
cerebral palsy; this chronic neurological disorder has several symptoms
such as poor muscle tone, intellectual disability or spasticity among
others. All these “findings” are counted towards the total amount of
active conditions. Acknowledging this as an issue to properly assess
multi-morbidity, any condition of the type finding is going
to be removed. We are well aware that this could lead to miss some
conditions, but we consider that having inflated values is a worst
approach.
burden.ever <- conditions %>%
dplyr::filter(
patient.id %in% ( patients %>% dplyr::filter( is.alive == 1 ))$patient.id,
condition.type != "finding"
) %>%
dplyr::group_by( patient.id ) %>%
dplyr::summarise(
n.conditions.ever = dplyr::n_distinct( condition.description ),
.groups = "drop"
)
burden.active <- conditions %>%
dplyr::filter(
patient.id %in% ( patients %>% dplyr::filter( is.alive == 1 ))$patient.id,
is.na( condition.stop ) | condition.description %in% chronic.conditions,
condition.type != "finding"
) %>%
dplyr::group_by( patient.id ) %>%
dplyr::summarise(
n.conditions.active = dplyr::n_distinct( condition.description ),
.groups = "drop"
)
multimorbidity <- patients %>%
dplyr::select( patient.id, state, gender, age.current.range ) %>%
dplyr::left_join( burden.ever, by = "patient.id" ) %>%
dplyr::left_join( burden.active, by = "patient.id" ) %>%
dplyr::mutate(
n.conditions.active = tidyr::replace_na( n.conditions.active, 0 ),
n.conditions.ever = tidyr::replace_na( n.conditions.ever, 0 )
)## n.conditions.ever n.conditions.active
## Min. : 0.000 Min. : 0.0000
## 1st Qu.: 2.000 1st Qu.: 0.0000
## Median : 5.000 Median : 0.0000
## Mean : 4.867 Mean : 0.4381
## 3rd Qu.: 7.000 3rd Qu.: 1.0000
## Max. :22.000 Max. :11.0000
The summary statistics did not change much, but we are more confident about its interpretation.
As before, at least 50% of the pediatric patients have no active conditions and at least 25% have at least 1 with a meaningful but not extreme tail up to 11 suggesting a small subset with substantial current complexity.
Regarding the number of distinct ever-recorded conditions
(n.conditions.ever), at least 25% of pediatric patients
have had between 0 and 2 conditions (under our previous definition of
“clinical condition”). The right tail extends to 22, implying a smaller
high-burden segment.
In order to better visualize and understand the data, the number of conditions are going to be binned into clinically interpretable cohorts:
n.conditions.ever), the bins
are going to be 0, 1, 2-3, 4-5, 6-9, 10+n.conditions.active), the
bins are going to be 0, 1, 2, 3-4, 5+multimorbidity.bins <- multimorbidity %>%
dplyr::mutate(
bin.conditions.ever = dplyr::case_when(
n.conditions.ever == 0 ~ "0",
n.conditions.ever == 1 ~ "1",
n.conditions.ever < 4 ~ "2-3",
n.conditions.ever < 6 ~ "4-5",
n.conditions.ever < 10 ~ "6-9",
n.conditions.ever >= 10 ~ "10+"
),
bin.conditions.active = dplyr::case_when(
n.conditions.active == 0 ~ "0",
n.conditions.active == 1 ~ "1",
n.conditions.active == 2 ~ "2",
n.conditions.active < 5 ~ "3-4",
n.conditions.active >= 5 ~ "5+"
)
) multimorbidity.bins %>%
dplyr::count(
state, age.current.range, bin.conditions.active,
name = "counts" ) %>%
dplyr::group_by( state, age.current.range ) %>%
dplyr::mutate(
percentage = counts / sum( counts ),
age.current.range = factor( age.current.range, levels = age.levels ),
bin.conditions.active =
factor( bin.conditions.active, levels = bin.active.levels )
) %>%
ungroup() %>%
ggplot(
aes(
x = bin.conditions.active,
y = percentage,
color = state,
group = state
)
) +
geom_line(linewidth = 0.9) +
geom_point(size = 2.2) +
facet_grid(~age.current.range) +
labs(
title = "Active Condition Distribution Across Age Bands",
subtitle = "Percent distribution of active-condition bins within each state × age band",
x = "",
y = "",
color = ""
) +
scale_color_manual(values = state.colors) +
scale_y_continuous(labels = percent_format(accuracy = 1)) +
theme_minimal() +
theme(
panel.grid.major.x = element_blank(),
legend.position = "bottom",
axis.text = element_text(size = 10),
strip.text = element_text(size = 11),
legend.text = element_text(size = 10)
)The first thing to notice is that in the age band 5-12, the distribution is exactly the same. This odd result has to be further inspected before taking any conclusion.
In general, in every age band, we can see that the distribution is front-loaded; the largest share of patients sits in 0 or 1 active conditions. Percentages are really similar except for infants and teenagers, where we can see the highest between-state differences.
multimorbidity.bins %>%
dplyr::count(
state, age.current.range, bin.conditions.ever,
name = "counts" ) %>%
dplyr::group_by( state, age.current.range ) %>%
dplyr::mutate(
percentage = counts / sum( counts ),
age.current.range = factor( age.current.range, levels = age.levels ),
bin.conditions.ever =
factor( bin.conditions.ever, levels = bin.ever.levels )
) %>%
ungroup() %>%
ggplot(
aes(
x = bin.conditions.ever,
y = percentage,
color = state,
group = state
)
) +
geom_line(linewidth = 0.9) +
geom_point(size = 2.2) +
facet_grid(~age.current.range) +
labs(
title = "Condition Distribution Across Age Bands",
subtitle =
"Percent distribution of ever-condition bins within each state × age band",
x = "",
y = "",
color = ""
) +
scale_color_manual(values = state.colors) +
scale_y_continuous(labels = percent_format(accuracy = 1)) +
theme_minimal() +
theme(
panel.grid.major.x = element_blank(),
legend.position = "bottom",
axis.text = element_text(size = 10),
strip.text = element_text(size = 11),
legend.text = element_text(size = 10)
)When looking at the amount of conditions ever diagnosed, lines overlap even more than with the conditions currently active with the younger age bands showing the only apparent differences.
Given the huge similarities, we decided to move on and examine the tail concentration which measures inequality in disease burden in order to answer which percentage of the total diagnosis volume is accounted for by the sickest 5% of the pediatrics patients.
This table reveals that care complexity is significantly more concentrated in the active phase, with the top 5% of patients accounting for between ~28% - 89% of the active burden compared to only between ~9% - 38% of the historical burden. Moreover, highest share burden is seen in the younger age bands decreasing as age increases. The difference between states appear not to be relevant except, maybe, for the on-going conditions in infants (<1).
As a final metric prior to perform statistical comparisons, we want to explore the shape comparisons across state using the empirical cumulative distribution function (ECDF). The ECDF can be described as the “fingerprint” of the population. If the lines for California and Mississippi perfectly overlap, the populations are identical in complexity. If one line is “lower” than the other, that state has a heavier burden (more patients with higher counts).
multimorbidity %>%
ggplot( aes( x = n.conditions.ever, color = state, group = state )) +
stat_ecdf( geom = "step", linewidth = 0.5, alpha = 0.6 ) +
facet_grid( gender ~ age.current.range ) +
labs(
title = "Cumulative Distribution of Condition Burden (Ever)",
y = "Cumulative Probability (Pct of patients < X)",
x = "Number of Distinct Conditions",
color = ""
) +
scale_color_manual( values = state.colors ) +
theme_minimal() +
theme(
panel.grid.major.x = element_blank(),
legend.position = "bottom",
axis.text = element_text(size = 10),
strip.text = element_text(size = 9),
legend.text = element_text(size = 10)
)The nearly perfect overlap of the California and Mississippi curves across every facet visually confirms that state of residence has no impact on the distribution of condition burden. The primary signal here is demographic rather than geographic: as we look at older age groups (particularly moving to the 13-17 band), the curves shift significantly to the right and flatten out, illustrating how the “ever-recorded” count naturally accumulates over a patient’s lifetime regardless of their location.
To conclude with this section, we are going to perform a negative binomial regression. We are going to use the negative binomial instead of Poisson because condition counts usually have high variance (“overdispersion”).
nb.model <- MASS::glm.nb(
n.conditions.ever ~ state * age.current.range + state * gender,
data = multimorbidity )
summary(nb.model)##
## Call:
## MASS::glm.nb(formula = n.conditions.ever ~ state * age.current.range +
## state * gender, data = multimorbidity, init.theta = 33.74609672,
## link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.291739 0.036321 -35.565 <2e-16
## stateMississippi 0.077363 0.050418 1.534 0.125
## age.current.range1-4 1.991019 0.036893 53.967 <2e-16
## age.current.range13-17 3.224605 0.036426 88.525 <2e-16
## age.current.range5-12 3.016732 0.036383 82.916 <2e-16
## genderMale -0.003425 0.004370 -0.784 0.433
## stateMississippi:age.current.range1-4 -0.071741 0.051245 -1.400 0.162
## stateMississippi:age.current.range13-17 -0.072632 0.050573 -1.436 0.151
## stateMississippi:age.current.range5-12 -0.081860 0.050512 -1.621 0.105
## stateMississippi:genderMale 0.002584 0.006180 0.418 0.676
##
## (Intercept) ***
## stateMississippi
## age.current.range1-4 ***
## age.current.range13-17 ***
## age.current.range5-12 ***
## genderMale
## stateMississippi:age.current.range1-4
## stateMississippi:age.current.range13-17
## stateMississippi:age.current.range5-12
## stateMississippi:genderMale
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(33.7461) family taken to be 1)
##
## Null deviance: 218019 on 100737 degrees of freedom
## Residual deviance: 113065 on 100728 degrees of freedom
## AIC: 435427
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 33.75
## Std. Err.: 1.19
##
## 2 x log-likelihood: -435405.15
This regression backs up our earlier findings. Even when we count the exact number of conditions for every patient, there is no difference between Mississippi and California. The only significant driver of higher condition counts is age given that older children naturally accumulate more diagnoses over time, regardless of where they live.
In this final section, we are going to focus on all-cause pediatric mortality across states. Unlike other sections where we looked at the patient’s cumulative history, this analysis is a snapshot of the current status.
patients %>%
dplyr::filter( is.alive == 0 ) %>%
dplyr::mutate(
age.current.range = factor( age.current.range, levels = age.levels )
) %>%
ggplot( aes( x = age.current.range, fill = state )) +
geom_bar( position = "dodge", color = "#FFFFFF", alpha = 0.85 ) +
labs(
title = "Mortality Distribution (Total Counts) Across States and Genders",
x = "",
y = "",
color = ""
) +
facet_grid( ~ gender ) +
ylim( 0, 150 ) +
scale_fill_manual(values = state.colors ) +
theme_minimal() +
theme(
panel.grid.major.x = element_blank(),
axis.text = element_text(size = 10),
strip.text = element_text(size = 11),
legend.text = element_text(size = 10)
)## Ignoring unknown labels:
## • colour : ""
At a first glance, the total counts appear pretty similar across states, genders and age bands except for males in California between 1 and 4 years old. In that age band, we find a difference of ~25 deceased males that could lead to distorted results given the small stratum sizes.
Pediatric mortality is a rare event. This poses a statistical challenge, as standard methods can be misleading when death counts are very low. To address this, we will use the Exact Binomial Test (Clopper-Pearson interval) to calculate confidence intervals. This method is conservative and ensures we do not overestimate the certainty of our rates in strata with small sample sizes.
mortality <- patients %>%
dplyr::group_by( state, gender, age.current.range ) %>%
dplyr::summarise(
n.total = n(),
n.deaths = sum( is.alive == 0 ),
.groups = "drop"
) %>%
dplyr::rowwise() %>%
dplyr::mutate(
binom = list( binom.test( n.deaths, n.total )),
rate.percent = binom$estimate * 100,
ci.low.percent = binom$conf.int[1] * 100,
ci.high.percent = binom$conf.int[2] * 100
) %>%
dplyr::ungroup() %>%
dplyr::select( -binom )mortality %>%
dplyr::mutate(
age.current.range = factor( age.current.range, levels = age.levels )
) %>%
ggplot( aes( x = rate.percent,
y = age.current.range,
color = state,
group = state )) +
geom_pointrange(
aes( xmin = ci.low.percent, xmax = ci.high.percent ),
position = position_dodge( width = 0.6 ),
size = 0.8
) +
geom_text(
aes( label = sprintf("%.2f%%", rate.percent ),
vjust = ifelse(state == "California", 2, -1.5)),
position = position_dodge( width = 0.6 ),
size = 3.5,
fontface = "bold",
show.legend = FALSE
) +
facet_grid( ~ gender ) +
labs(
title = "All-Cause Pediatric Mortality Rates (%)",
subtitle = "Percentage of deceased patients; 95% Exact Binomial CI.",
x = "",
y = "Age Group",
color = ""
) +
scale_colour_manual( values = state.colors ) +
theme_minimal() +
theme(
panel.grid.minor = element_blank(),
legend.position = "bottom",
axis.text = element_text(size = 10),
strip.text = element_text(size = 11),
panel.border = element_rect(color = "#A9A9A9", fill = NA),
legend.text = element_text(size = 10)
)Strictly speaking, the 95% exact binomial confidence intervals for California and Mississippi overlap across all age–gendere strata, so the plot alone does not provide clear visual separation between states. Nevertheless, the point estimates reveal meaningful variation by age and, in one case, a notable state contrast.
In males aged 1–4, California’s mortality estimate (2.30%; 130/5644) exceeds Mississippi’s (1.75%; 96/5474), an absolute difference of 0.55 percentage points ( \(\approx\) 31% relative to Mississippi). Because this comparison is supported by substantial event counts and similar denominators in both states, the high rate in California is unlikely to be driven by sparse-data instability, although.
More broadly, the dominant signal is the age pattern: contrary to typical infant mortality expectations, the highest mortality burden in this synthetic dataset is concentrated in the 1–4 age band ( \(\approx\) 1.65%–2.30%), while the lowest burden is consistently observed in adolescents aged 13–17, where rates remain below 0.20% in both states and genders.
## R version 4.5.2 (2025-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Arch Linux
##
## Matrix products: default
## BLAS: /usr/lib/libblas.so.3.12.0
## LAPACK: /usr/lib/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=es_ES.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=es_ES.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Europe/Madrid
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] lubridate_1.9.4 forcats_1.0.1 stringr_1.5.2 dplyr_1.1.4
## [5] purrr_1.2.0 readr_2.1.5 tidyr_1.3.1 tibble_3.3.0
## [9] ggplot2_4.0.0 tidyverse_2.0.0 scales_1.4.0 MASS_7.3-65
## [13] DT_0.34.0
##
## loaded via a namespace (and not attached):
## [1] utf8_1.2.6 sass_0.4.10 generics_0.1.4 stringi_1.8.7
## [5] hms_1.1.4 digest_0.6.38 magrittr_2.0.4 evaluate_1.0.5
## [9] grid_4.5.2 timechange_0.3.0 RColorBrewer_1.1-3 fastmap_1.2.0
## [13] jsonlite_2.0.0 crosstalk_1.2.2 jquerylib_0.1.4 cli_3.6.5
## [17] rlang_1.1.6 crayon_1.5.3 bit64_4.6.0-1 withr_3.0.2
## [21] cachem_1.1.0 yaml_2.3.10 tools_4.5.2 parallel_4.5.2
## [25] tzdb_0.5.0 vctrs_0.6.5 R6_2.6.1 lifecycle_1.0.4
## [29] htmlwidgets_1.6.4 bit_4.6.0 vroom_1.6.6 pkgconfig_2.0.3
## [33] pillar_1.11.1 bslib_0.9.0 gtable_0.3.6 glue_1.8.0
## [37] xfun_0.54 tidyselect_1.2.1 rstudioapi_0.17.1 knitr_1.50
## [41] dichromat_2.0-0.1 farver_2.1.2 htmltools_0.5.8.1 labeling_0.4.3
## [45] rmarkdown_2.30 compiler_4.5.2 S7_0.2.0