Project Milestone 6

Author

Group 1 - Elizabeth Hawkes, Innocent Menyo, Sarah Chin

Group 1- Project Milestone #6

Scenario 1: Infectious disease outbreak (simulated) in California

Problem Statement

The California Department of Public Health (CDPH) works to protect public health in California and fosters positive health outcomes for individuals, families, and communities. The Department’s programs and services, implemented in collaboration with local health departments and state, federal, and private partners, impact the lives of every Californian. The CDPH is responsible for monitoring and responding to a simulated outbreak of a novel infectious respiratory disease within the state. Outbreak data shows variations in case counts and severity by demographic categories (age, race, sex) and geographic regions (counties). However, the data is fragmented across three separate datasets, creating challenges in understanding the disease’s progression, its disproportionate impact on specific populations or regions, and the allocation of limited prevention and treatment resources.

To address this issue, integrating data is necessary from counties across California (excluding Los Angeles County), population and morbidity data specific to Los Angeles County, and 2023 population estimates for all counties. Without harmonizing and analyzing these datasets, the CDPH cannot effectively identify high-risk populations, geographic hotspots, or establish equitable strategies for resource allocation. Closing these gaps is critical to ensuring timely and targeted public health interventions during this simulated outbreak.

Methods

Source & Dates

Dataset one (sim_novelid_CA.csv) contains weekly data about cases and case severity by demographic categories to include age category, race, sex and geographic categories (county) for California counties, except for Los Angeles. The data also includes simulated novel infectious respiratory disease case reporting for California (excluding LA County). It contains case severity by demographic categories and county for the time period from May 28, 2023 to December 30, 2023.

Dataset two (sim_novelid_LACounty.csv) contains population data by similar demographic and geographic categories for Los Angeles county. This data set contains morbidity and population data by similar demographic and geographic categories including sex, race/ethnicity, and age category. Morbidity data includes newly diagnosed cases of the respiratory infection, infected cumulative cases, new unrecovered cases, cumulative unrecovered cases, new severe cases, and cumulative severe cases for the time period of May 29, 2023 to December 25, 2023.

Dataset three (ca_pop_2023.csv) contains population estimate data by demographic category and county for 2023. County-level demographic data includes health officer region, age category, sex, and race.

Description of Cleaning Activities/Analytic Methods

For all datasets, the race_ethnicity variable was converted from numeric codes to descriptive text, age groups were standardized, and the date was reformatted from epidemiological weeks to a standard date format. Additionally, column names were standardized (added or removed) for consistency. Datasets were then merged and aggregated by race/ethnicity and counts. For the final dataset, we added rate columns for each county and race/ethnicity, showing the percentage of total infections, unrecovered cases, and severe cases.

Results

Figure 1 data represents percentages of infection categories calculated based on total population per race/ethnicity. The graph shows the race categories most affected are American Indian/Alaska Native, White, Black and Hispanic with Asian and Multiracial least affected. The percentage of severe and unaffected cases follow the same pattern. Table 1 race/ethnicity categories show that American Indian/Alaska Native had the highest incidence rate of infection at 14.0 per 100 people followed by White with an infection rate of approximately 12.8 per 100 people. The Asian and Multiracial groups had the lowest infection rates at about 8.7 per 100 people and 7.2 per 100 people, respectively. Severe and unaffected cases follow the same pattern on the basis of race and ethnicity. Amongst infected individuals, White people had the highest rate of severe infection at 3.6 per 100 people followed by American Indian/Alaska Natives and Asians, both at 3.0 per 100 people. Multiracial individuals and Hispanic individuals of any race had the lowest rate of severe infection at 1.9 per 100 people and 2.0 per 100 people, respectively. Similarly, amongst infected individuals, the White and Asian groups had the highest rate of being unrecovered with 15.6 per 100 people for White and 13.4 per 100 people for Asian. In the Multiracial and Hispanic groups, there was the lowest rate of unrecovered individuals at 8.6 per 100 people and 9.2 per 100 people, respectively.

Figure 2 shows the top ten most affected counties with Imperial, located in Southern California, having the highest incidence rate of nearly 45 cases per 100 people. Imperial county also had the highest rate of severe and unrecovered cases. Counties in the Central Valley had some of the highest incidence rates of infection including Kern (30 per 100 people), Tulare (24 per 100 people), Kings (23 per 100 people), Merced (23 per 100 people), Colusa (21 per 100 people), Tehama (19 per 100 people), and Stanislaus (19 per 100 people).

Table 1. California Novel Respiratory Disease Incidence Rates (per 100 people) by Race/Ethnicity, 2023
Race/Ethnicity IR of Disease IR of Severe Disease among Infected IR of Unrecovered Disease among Infected
American Indian Or Alaska Native, Non-Hispanic 14.0 3.0 13.3
Asian, Non-Hispanic 8.7 3.0 13.4
Black, Non-Hispanic 12.3 2.6 11.8
Hispanic (Any Race) 12.1 2.0 9.2
Multiracial (Two Or More Of Above Races), Non-Hispanic 7.2 1.9 8.6
Native Hawaiian Or Pacific Islander, Non-Hispanic 11.0 2.4 10.5
White, Non-Hispanic 12.8 3.6 15.6

Discussion

In analyzing the 2023 novel respiratory disease outbreak across California counties, several notable patterns emerged. Infection rates were disproportionately high among specific racial and ethnic groups, particularly American Indian/Alaska Native and White populations. These disparities may stem from socioeconomic factors, education level, and subsequent access to healthcare. Socioeconomic status may impact the type of work individuals perform, impacting their proximity to others, ability to stay home when sick, and physical exertion required. These factors collectively may impact the risk of contracting the disease, the severity of symptoms, and recovery outcomes.

Geographically, significant differences in infection rates were observed, with Imperial County and areas within the Central Valley showing notable high prevalence. Agricultural practices, including pesticide use and exposure to environmental toxins, may exacerbate respiratory symptoms and contribute to increased disease transmission and severity in the Central Valley. Given these findings, we recommend prevention and treatment resources be focused on these top 10 counties, with particular focus on addressing the needs of vulnerable populations to reduce disparities and improve health outcomes.