Project Milestone 5
Group 1- Project Milestone #5
Scenario 1: Infectious disease outbreak (simulated) in California
Problem Statement
The California Department of Public Health (CDPH) works to protect public health in California and fosters positive health outcomes for individuals, families, and communities. The Department’s programs and services, implemented in collaboration with local health departments and state, federal, and private partners, impact the lives of every Californian. The CDPH is responsible for monitoring and responding to a simulated outbreak of a novel infectious respiratory disease within the state. Outbreak data shows variations in case counts and severity by demographic categories (age, race, sex) and geographic regions (counties). However, the data is fragmented across three separate datasets, creating challenges in understanding the disease’s progression, its disproportionate impact on specific populations or regions, and the allocation of limited prevention and treatment resources.
To address this issue, integrating data is necessary from counties across California (excluding Los Angeles County), population and morbidity data specific to Los Angeles County, and 2023 population estimates for all counties. Without harmonizing and analyzing these datasets, the CDPH cannot effectively identify high-risk populations, geographic hotspots, or establish equitable strategies for resource allocation. Closing these gaps is critical to ensuring timely and targeted public health interventions during this simulated outbreak.
Methods
Source & Dates
Dataset one (sim_novelid_CA.csv) contains weekly data about cases and case severity by demographic categories to include age category, race, sex and geographic categories (county) for California counties, except for Los Angeles. The data also includes simulated novel infectious respiratory disease case reporting for California (excluding LA County). It contains case severity by demographic categories and county for the time period from May 28, 2023 to December 30, 2023.
Dataset two (sim_novelid_LACounty.csv) contains population data by similar demographic and geographic categories for Los Angeles county. This data set contains morbidity and population data by similar demographic and geographic categories including sex, race/ethnicity, and age category. Morbidity data includes newly diagnosed cases of the respiratory infection, infected cumulative cases, new unrecovered cases, cumulative unrecovered cases, new severe cases, and cumulative severe cases for the time period of May 29, 2023 to December 25, 2023.
Dataset three (ca_pop_2023.csv) contains population estimate data by demographic category and county for 2023. County-level demographic data includes health officer region, age category, sex, and race.
Description of Cleaning Activities
For dataset one, the race_ethnicity variable was converted from numeric codes to descriptive text, and the date was reformatted from epidemiological weeks to a standard date format. The term “county” was removed from the county column. Additionally, column names in both dataset one and dataset two were standardized for consistency. For dataset two, a new column named county was added and populated with “Los Angeles” as the value. The date was reformatted to align with the standard format used in other datasets. Dataset one and dataset two were then merged using a full join. Data was then aggregated by race/ethnicity and county. In dataset three, race/ethnicity group names were adjusted to match those in datasets one and two, and age categories were recorded to align with the four age groups used in the other datasets. Finally, the merged output of datasets one and two was combined with dataset three with rate columns added. We also added a data dictionary and data element statistics summary for easy interpretation of the variables within the merged data set.
Analytic Methods
For the final dataset, we added rate columns for each county and race/ethnicity, showing the percentage of total infections, unrecovered cases, and severe cases. We visualized the data with bar graphs, plotting infection percentages by race/ethnicity, including total infections, severe cases, and unrecovered cases. Similar visualizations were created for the most affected counties.
Results
Regarding race/ethnicity the results showed that American Indian/Alaska Native had the highest infection rate at 14.0% followed by White with an infection rate of about 12.8%. The Asian and Multiracial groups had the lowest infection rates at about 8.7% and 7.2%, respectively. Severe and unaffected cases follow the same pattern on the basis of race and ethnicity. Amongst infected individuals, White people had the highest rate of severe infection at 3.6% followed by American Indian/Alaska Natives and Asians, both at 3.0%. Multiracial individuals and Hispanic individuals of any race had the lowest rate of severe infection at 1.9% and 2.0%, respectively. Similarly, amongst infected individuals, the White and Asian groups had the highest rate of being unrecovered with 15.6% for White and 13.4% for Asian. In the Multiracial and Hispanic groups, there was the lowest rate of unrecovered individuals at 8.6% and 9.2%, respectively.
When comparing infection rates across counties, Imperial county located in Southern California had the highest infection rate at nearly 45%. Counties in the Central Valley had some of the highest infection rates including Kern (30%), Tulare (24%), Kings (23%), Merced (23%), Colusa (21%), Tehama (19%), and Stanislaus (19%).
| Race/Ethnicity | Percent Infected | Percentage of Severe Cases among Infected | Percentage of Unrecovered Cases among Infected |
|---|---|---|---|
| American Indian Or Alaska Native, Non-Hispanic | 14.0 | 3.0 | 13.3 |
| Asian, Non-Hispanic | 8.7 | 3.0 | 13.4 |
| Black, Non-Hispanic | 12.3 | 2.6 | 11.8 |
| Hispanic (Any Race) | 12.1 | 2.0 | 9.2 |
| Multiracial (Two Or More Of Above Races), Non-Hispanic | 7.2 | 1.9 | 8.6 |
| Native Hawaiian Or Pacific Islander, Non-Hispanic | 11.0 | 2.4 | 10.5 |
| White, Non-Hispanic | 12.8 | 3.6 | 15.6 |
| Interpretation: | |||
| The group with the highest percent of infection is American Indian or Alaska Native. However, the group with the highest percentage of severe cases and unrecovered cases is White. The Multiracial group has the lowest percentage of infection, severe infections, and unrecovered infections. |
Discussion
In analyzing the 2023 novel respiratory disease outbreak across California counties, several notable patterns emerged. Infection rates were disproportionately high among specific racial and ethnic groups, particularly American Indian/Alaska Native and White populations. These disparities may stem from socioeconomic factors, education level, and subsequent access to healthcare. Socioeconomic status may impact the type of work individuals perform, impacting their proximity to others, ability to stay home when sick, and physical exertion required. These factors collectively may impact the risk of contracting the disease, the severity of symptoms, and recovery outcomes.
Geographically, significant differences in infection rates were observed, with Imperial County and areas within the Central Valley showing notable high prevalence. Agricultural practices, including pesticide use and exposure to environmental toxins, may exacerbate respiratory symptoms and contribute to increased disease transmission and severity in the Central Valley. Given these findings, we recommend prevention and treatment resources be focused on these top 10 counties, with particular focus on addressing the needs of vulnerable populations to reduce disparities and improve health outcomes.