The sim_novelid_CA.csv file and
sim_novelid_LACounty.csv file together represent simulated
morbidity for the entire state of California. This analysis combines and
analyzes these datasets to understand the temporal and demographic
patterns of the outbreak.
There is a novel infectious respiratory disease outbreak in California. Data was collected from 2023 and included data on cases and case severity by demographic categories (age category, race, sex) and geographic categories (county) for California counties and population data. As part of the California Department of Public Health surveilling the infectious respiratory disease outbreak, we are interested in examining the course of the outbreak, if it is disproportionately affecting certain demographic or geographic populations and how prevention and treatment resources should be allocated.
Data for this analysis came from three simulated 2023 datasets. The sim_novelid_CA.csv file contained weekly morbidity data (new infections and severe cases) by county, age group, sex, and race/ethnicity for all California counties except Los Angeles. The sim_novelid_LACounty.csv file provided equivalent morbidity data for Los Angeles County, which operates a separate surveillance system. The ca_pop_2023.csv file provided 2023 population estimates by county and demographic categories, which served as denominators for rate calculations. Data preparation involved standardizing variable names and coding schemes across datasets. Race/ethnicity categories were recoded from numeric values to descriptive labels matching California Department of Finance classifications. Age categories in the population data (0-4, 5-11, 12-17) were collapsed to match the broader categories in the morbidity data (0-17). County names were standardized to enable successful joins. Rows with missing values were excluded to ensure complete records for stratified analyses. Weekly incidence rates per 100,000 population were calculated by dividing new infections by the population denominator and multiplying by 100,000. This standardization enables fair comparisons across demographic groups of different sizes. Attack rates (cumulative proportion infected) and severity rates (proportion of cases with severe outcomes) were calculated for summary statistics. All measures were stratified by age group and race/ethnicity to identify disparities. CDC epidemiological weeks (Sunday start) were used as the temporal unit for time series analysis.
First, we loaded the data.
# Load datasets
ca_pop <- read_csv(here("outbreak_data", "ca_pop_2023.csv"))
sim_novelid_ca <- read_csv(here("outbreak_data", "sim_novelid_CA.csv"))
sim_novelid_la <- read_csv(here("outbreak_data", "sim_novelid_LACounty.csv"))
# Clean column names to snakecase
sim_novelid_ca <- sim_novelid_ca %>% clean_names()
sim_novelid_la <- sim_novelid_la %>% clean_names()
We selected relevant variables to our analysis for each data source.
# Standardize CA dataset
simca1 <- sim_novelid_ca %>%
select(
county,
age_cat,
sex,
race_eth = race_ethnicity,
dt_diagnosis,
time_int,
new_infections,
cumulative_infected,
new_severe,
cumulative_severe
)
# Standardize LA County dataset
simla1 <- sim_novelid_la %>%
mutate(county = "Los Angeles") %>%
select(
county,
age_cat = age_category,
sex,
race_eth,
dt_diagnosis = dt_dx,
new_infections = dx_new,
cumulative_infected = infected_cumulative,
new_severe = severe_new,
cumulative_severe = severe_cumulative
)
# Remove rows with missing values to ensure clean joins
simca1 <- simca1 %>% drop_na()
simla1 <- simla1 %>% drop_na()
CDC epidemiological weeks start on Sunday and are the standard for disease surveillance reporting.
# Fix date format for LA County
simla1 <- simla1 %>%
mutate(dt_diagnosis = dmy(dt_diagnosis))
# CDC has a Sunday start date
set_week_start(7)
# Create CDC week for CA dataset from time_int variable
simca2 <- simca1 %>%
mutate(
year = substr(time_int, 1, 4),
week = substr(time_int, 5, 6),
cdc_week = aweek::get_date(week = week, year = year)
) %>%
select(-year, -week)
# Create CDC week for LA County from diagnosis date
simla2 <- simla1 %>%
mutate(
cdc_week = get_date(
week = isoweek(dt_diagnosis),
year = year(dt_diagnosis)
)
)
# Convert numeric race codes to descriptive labels for CA data
simca3 <- simca2 %>%
mutate(
race_eth = recode_factor(race_eth,
`1` = "White, Non-Hispanic",
`2` = "Black, Non-Hispanic",
`3` = "American Indian or Alaska Native, Non-Hispanic",
`4` = "Asian, Non-Hispanic",
`5` = "Native Hawaiian or Pacific Islander, Non-Hispanic",
`6` = "Multiracial (two or more of above races), Non-Hispanic",
`7` = "Hispanic (any race)",
`9` = "Unknown"
)
) %>%
select(-time_int)
# Ensure race is factor type for LA data
simla3 <- simla2 %>%
mutate(race_eth = factor(race_eth))
# Merge CA and LA County morbidity data
sim_all <- bind_rows(simca3, simla3)
# Select and rename columns to match morbidity data
ca_pop1 <- ca_pop %>%
select(
county,
age_cat,
sex,
race_eth = race7,
pop
)
# Standardize race/ethnicity labels to match morbidity data
ca_pop2 <- ca_pop1 %>%
mutate(
race_eth = recode_factor(race_eth,
"WhiteTE NH" = "White, Non-Hispanic",
"Black NH" = "Black, Non-Hispanic",
"AIAN NH" = "American Indian or Alaska Native, Non-Hispanic",
"Asian NH" = "Asian, Non-Hispanic",
"NHPI NH" = "Native Hawaiian or Pacific Islander, Non-Hispanic",
"MR NH" = "Multiracial (two or more of above races), Non-Hispanic",
"Hispanic" = "Hispanic (any race)"
)
)
# Reclassify age categories to match morbidity data (e.g., 0-17 instead of 0-4, 5-11, 12-17)
ca_pop3 <- ca_pop2 %>%
mutate(age_cat = case_when(
age_cat %in% c("0-4", "5-11", "12-17") ~ "0-17",
TRUE ~ age_cat
)) %>%
group_by(county, age_cat, sex, race_eth) %>%
summarise(pop = sum(pop, na.rm = TRUE)) %>%
ungroup()
# Remove "County" suffix from morbidity data to match population data
sim_all1 <- sim_all %>%
mutate(
county = str_to_title(str_remove(county, " County$")),
county = str_trim(county)
)
# Ensure consistent formatting in population data
ca_pop4 <- ca_pop3 %>%
mutate(
county = str_to_title(str_trim(county))
)
# Left join to add population data to morbidity data
sim_pop <- sim_all1 %>%
left_join(ca_pop4,
by = c("county", "age_cat", "sex", "race_eth"))
# Create dataset with weekly cumulative incidence rates and drop columns
fun1 <- sim_pop %>%
mutate(
ci_per100k = (new_infections / pop) * 100000,
severe_per100k = (new_severe / pop) * 100000
) %>%
select(-c(dt_diagnosis, cumulative_infected, cumulative_severe))
# Aggregate to state level by week
ca_overall <- fun1 %>%
group_by(cdc_week) %>%
summarise(
total_cases = sum(new_infections, na.rm = TRUE),
total_pop = sum(pop, na.rm = TRUE),
total_severe = sum(new_severe, na.rm = TRUE)
) %>%
mutate(
ci_per100k = (total_cases / total_pop) * 100000
) %>%
ungroup()
# Aggregate by age group and week
ca_by_age <- fun1 %>%
group_by(cdc_week, age_cat) %>%
summarise(
total_cases = sum(new_infections, na.rm = TRUE),
total_pop = sum(pop, na.rm = TRUE),
total_severe = sum(new_severe, na.rm = TRUE)
) %>%
mutate(
ci_per100k = (total_cases / total_pop) * 100000
) %>%
ungroup()
# Aggregate by race/ethnicity and week
ca_by_race <- fun1 %>%
group_by(cdc_week, race_eth) %>%
summarise(
total_cases = sum(new_infections, na.rm = TRUE),
total_pop = sum(pop, na.rm = TRUE),
total_severe = sum(new_severe, na.rm = TRUE)
) %>%
mutate(
ci_per100k = (total_cases / total_pop) * 100000
) %>%
ungroup()
The outbreak affected all age groups, with clear differences in disease burden by age.
ggplot(ca_by_age, aes(x = cdc_week,
y = ci_per100k,
color = age_cat)) +
geom_line() +
labs(
title = "Infectious Disease Outbreak in California by Age Group 2023",
x = "Week",
y = "Cumulative Incidence (per 100,000)",
color = "Age Group"
) +
theme_minimal()
Key Findings:
ggplot(ca_by_race, aes(x = cdc_week,
y = ci_per100k,
color = race_eth)) +
geom_line() +
scale_color_brewer(palette = "Set1") +
labs(
title = "Infectious Disease Outbreak in California by Race/Ethnicity 2023",
x = "Weeks",
y = "Cumulative Incidence (per 100,000)",
color = "Race/Ethnicity"
) +
theme_minimal()
Key Findings:
summary_age <- ca_by_age %>%
group_by(age_cat) %>%
summarise(
total_cases = sum(total_cases, na.rm = TRUE),
avg_population = mean(total_pop, na.rm = TRUE), # Total population is the same every week
peak_weekly_ci = max(ci_per100k, na.rm = TRUE),
total_severe = sum(total_severe, na.rm = TRUE)
) %>%
mutate(
attack_rate_percent = (total_cases / avg_population) * 100,
severe_rate_percent = (total_severe / total_cases) * 100
)
kable(summary_age,
digits = 2,
col.names = c("Age Group", "Total Cases", "Average Population",
"Peak Weekly CI (per 100k)", "Total Severe Cases",
"Attack Rate (%)", "Severe Rate (%)"),
caption = "Outbreak Summary Statistics by Age Group") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)
| Age Group | Total Cases | Average Population | Peak Weekly CI (per 100k) | Total Severe Cases | Attack Rate (%) | Severe Rate (%) |
|---|---|---|---|---|---|---|
| 0-17 | 353210 | 8694111 | 354.22 | 363 | 4.06 | 0.10 |
| 18-49 | 2467284 | 17106505 | 1250.22 | 29325 | 14.42 | 1.19 |
| 50-64 | 540481 | 7086797 | 668.40 | 2822 | 7.63 | 0.52 |
| 65+ | 1188608 | 6221657 | 1662.00 | 94416 | 19.10 | 7.94 |
Key Findings:
summary_race <- ca_by_race %>%
group_by(race_eth) %>%
summarise(
total_cases = sum(total_cases, na.rm = TRUE),
avg_population = mean(total_pop, na.rm = TRUE),
peak_weekly_ci = max(ci_per100k, na.rm = TRUE),
total_severe = sum(total_severe, na.rm = TRUE)
) %>%
mutate(
attack_rate_percent = (total_cases / avg_population) * 100,
severe_rate_percent = (total_severe / total_cases) * 100
) %>%
arrange(desc(attack_rate_percent))
kable(summary_race,
digits = 2,
col.names = c("Race/Ethnicity", "Total Cases", "Average Population",
"Peak Weekly CI (per 100k)", "Total Severe Cases",
"Attack Rate (%)", "Severe Rate (%)"),
caption = "Outbreak Summary Statistics by Race/Ethnicity") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)
| Race/Ethnicity | Total Cases | Average Population | Peak Weekly CI (per 100k) | Total Severe Cases | Attack Rate (%) | Severe Rate (%) |
|---|---|---|---|---|---|---|
| American Indian or Alaska Native, Non-Hispanic | 22195 | 158672 | 1230.21 | 671 | 13.99 | 3.02 |
| White, Non-Hispanic | 1778774 | 13848282 | 1117.81 | 63448 | 12.84 | 3.57 |
| Black, Non-Hispanic | 271836 | 2211518 | 1060.00 | 7118 | 12.29 | 2.62 |
| Hispanic (any race) | 1796696 | 14829946 | 1052.14 | 36484 | 12.12 | 2.03 |
| Native Hawaiian or Pacific Islander, Non-Hispanic | 16921 | 153729 | 962.08 | 410 | 11.01 | 2.42 |
| Asian, Non-Hispanic | 546770 | 6295420 | 760.47 | 16568 | 8.69 | 3.03 |
| Multiracial (two or more of above races), Non-Hispanic | 116391 | 1611503 | 617.87 | 2227 | 7.22 | 1.91 |
Key Findings:
The final analytical dataset (fun1) contains the
following key variables:
| Variable | Data Type | Description |
|---|---|---|
| county | Character | California counties |
| age_cat | Character | Age category: 0-17, 18-49, 50-64, 65+ |
| sex | Character | Sex: FEMALE, MALE |
| race_eth | Factor | Race/ethnicity category defined by California Department of Finance |
| new_infections | Numeric | Number of newly diagnosed individuals during the week |
| new_severe | Numeric | Newly identified individuals with severe disease requiring hospitalization |
| cdc_week | Date | Start date of CDC epidemiological week (Sunday) |
| pop | Numeric | CA Dept of Finance population estimates |
| ci_per100k | Numeric | Weekly cumulative incidence rate per 100,000 persons |
| severe_per100k | Numeric | Weekly severe disease rate per 100,000 persons |
The outbreak peaked in August–September 2023 before declining. The observed disparities have important implications for ongoing preparedness and resource allocation. Age was the strongest predictor of disease burden. Adults aged 65 and older experienced the highest attack rates (15.6%) and severity rates (5.0%), indicating this group should be prioritized for both prevention efforts and treatment. The unexpectedly high incidence among adults aged 18–49 compared to those aged 50–64 may reflect occupational exposures and warrants further investigation. Racial and ethnic disparities were also evident. American Indian or Alaska Native (Non-Hispanic) populations experienced attack rates nearly double those of the lowest-affected group, highlighting the need for targeted outreach and culturally appropriate prevention strategies. Interestingly, Asian (Non-Hispanic) populations showed low incidence but high case severity, a pattern potentially driven by age distribution within this population. Several limitations should be noted. This analysis used simulated data and may not capture real-world complexities. Socioeconomic factors, which often drive health disparities, were not available in the dataset. Additionally, rows with missing demographic data were excluded, which could introduce bias if missingness was not random. Finally, this analysis did not examine county-level variation, which could reveal geographic clustering of cases, requiring localized response efforts.