Milestone 6
Problem statement
For this project, we worked within the scenario of a statewide public health response to a simulated outbreak of a novel respiratory pathogen in California in 2023. The available data included weekly morbidity counts for all counties (with Los Angeles County provided separately) and 2023 population estimates for the same demographic strata. These datasets allowed us to examine disease occurrence by age group, sex, race/ethnicity, and county.
The primary objective of our analysis was to describe how infections were distributed across demographic and geographic groups. By calculating case counts and incidence rates for key populations, we aimed to identify disparities in disease burden and provide evidence to guide resource allocation during the outbreak. This descriptive epidemiologic assessment supports understanding which communities and demographic groups may be most affected and where public health interventions may need to be concentrated.
Methods
We used three datasets provided for Scenario 1: weekly morbidity data for all California counties except Los Angeles, a parallel morbidity file for Los Angeles County, and 2023 population estimates for all demographic strata. Together, these sources enabled statewide analysis of new infections and incidence rates by age, sex, race/ethnicity, and county.
All analyses were conducted in R using a workflow centered around the tidyverse. We used readr for data import and dplyr, tidyr, stringr, and forcats for cleaning, recoding, grouping, and reshaping the data. The janitor package standardized column names, and scales and formattable supported numeric formatting. Final tables were produced using knitr and kableExtra, and all figures were created with ggplot2; plotly was used only during development to inspect patterns interactively. Tibbles were used throughout for consistent data handling.
Data cleaning focused on harmonizing demographic variables across files so the datasets could be merged reliably. We aligned variable names, recoded race/ethnicity and age categories to resolve inconsistencies, and ensured county names matched across morbidity and population sources. Records with missing or unusable demographic information were removed. The California and Los Angeles morbidity files were combined into a single statewide dataset and merged with population estimates using shared demographic keys.
To support analysis, we created new variables including total case counts per demographic stratum and incidence rates per 100,000 population. We reordered factor levels and refined category labels to ensure that visualizations were interpretable and aligned with assignment requirements. As a team, we chose to focus on one primary visualization (race/ethnicity by sex) and one primary table (age by sex) to meet the scenario while maintaining clarity. Additional wrangling, such as numeric formatting, ordering categories by rate, and applying consistent labels, was performed to produce publication-ready outputs.
Results
Our analysis revealed clear patterns in the distribution of new infections across demographic groups in California. After merging morbidity and population data, we calculated incidence rates per 100,000 population, which were summarized by race/ethnicity and sex in the primary visualization. This horizontal bar chart showed notable disparities: American Indian or Alaska Native and White, Non-Hispanic populations had the highest infection rates, followed by Black and Hispanic populations. Asian, Native Hawaiian or Pacific Islander, and Multiracial groups showed substantially lower rates. Differences between males and females within each racial/ethnic group were relatively small compared with the wider disparities across groups.
The table below summarizes infection rates by age group and sex. Adults aged 65 and older experienced the highest infection rates overall, followed by adults aged 18–49. Rates were lower among those aged 50–64 and lowest among children ages 0–17. The table used a color gradient to highlight relative magnitudes of infection burden across age groups. Together, the table and the visualization provide a cohesive summary of how racial/ethnic identity and age shape the distribution of infections in this simulated outbreak.
| Age Group | Female | Male | Overall |
|---|---|---|---|
| 0-17 | 4,008.20 | 4,116.00 | 4,062.10 |
| 18-49 | 13,984.40 | 14,859.90 | 14,422.20 |
| 50-64 | 7,555.60 | 7,701.20 | 7,628.40 |
| 65+ | 18,892.20 | 19,370.20 | 19,131.20 |
| Source: SIM CA and LA Morbidity Files + CA DOF Population Estimates |
Discussion
The findings from this project indicate that the burden of the simulated respiratory outbreak was not evenly distributed across California’s population. The race/ethnicity–sex visualization highlights pronounced disparities, with certain racial and ethnic groups experiencing significantly higher infection rates than others. These patterns likely reflect a combination of structural, occupational, and environmental factors rather than biological differences alone. The minimal differences by sex support the interpretation that social and structural factors, access to preventative resources, crowded living or working conditions, or local transmission dynamics, play a larger role than sex in shaping infection risk.
The age-specific results further illuminate these differences. Older adults had markedly higher infection rates, consistent with both greater biological susceptibility, underlying health conditions, and potential differences in exposure or health-seeking behavior. The substantial burden among adults aged 18–49 suggests that working-age individuals may represent an important driver of community transmission, potentially due to workplace exposures or care-giving responsibilities. Another potential factor that could impact the rates for all age groups, is testing behavior. If certain age groups are being tested more frequently (i.e. through a work-based program) the incidence numbers for that group would increase even if the true incidence rates were more homogeneous across age groups. However, additional data on rates of testing by age-group would be necessary to assess the impact of this factor. Age-based differences in infection rates do not appear to vary significantly by sex.
Taken together, these results indicate that an effective outbreak response would require targeted outreach to the most affected racial/ethnic groups and increased support for older adults. Enhancing access to testing, treatment, and prevention resources, along with tailored communication strategies, would likely help reduce both overall transmission and the disparities observed in this simulated scenario. These findings illustrate the importance of demographic-specific analyses in guiding equitable public health action.