Milestone #6 Details

Scenario 1: Infectious disease outbreak (simulated) in California

Objective: Visualizations complete.

Link to RPubs: https://rpubs.com/vincentgdoanberkeley/1257292

Problem Statement

In 2023, a novel infectious disease was discovered in the United States. The disease rapidly spread to all 50 states, through modes such as direct contact, indirect contact and respiratory droplets. As members of California’s Infectious Disease Surveillance Team we play a vital task in tracking the course of the infectious respiratory disease outbreak. Three datasets inform us about the outbreak: cases and case severity by demographic categories for CA counties and LA County, and CA population estimates for 2023. We can determine the optimal allocation of prevention and treatment resources and which will have a larger impact on reducing morbidity in California. We additionally focus on LA County to study demographic disparities in order to make demographic-specific generalizations for the state.

Our research questions are: “which demographic populations have the highest novel infectious respiratory disease burden?” and “does race and ethnicity affect infection rate in LA County?”

Methods

Three datasets were utilized: a simulated Novel Infectious Disease case reporting for California, a simulated Novel Infectious Disease case reporting for Los Angeles County, and an estimated California state population dataset. All datasets covered the year 2023 with the simulations covering May to December.

Many of the cleaning functions were performed using various tidyverse libraries, including readxl (reading raw data), dplyr (renaming, modifying or merging columns), or with standalone packages (date columns using lubridate package).

Standard good practices were used throughout the datasets, including following naming conventions and capitalization, consistent column types (dbl, chr, or Date). Similar steps were done consecutively, namely column refactoring, reformatting, or renaming, while other steps were done independently to ensure proper function (e.g. creation of visualizations).

A complete case analysis was performed on all datasets within this project – as such any data row with NA entries was removed. Afterwards, a left_join() was used on the two simulated datasets (LA County and California) after both datasets were ensured to be similarly formatted. Following this, the CA population data was incorporated into the final dataset with the creation of new columns in the newly joined dataset as appending was unfeasible and the data types did not overlap. New variables were created when a dataset was heavily transformed or when a new visual was created.

Additional wrangling was done for visualization which included removing unnecessary columns to focus specific subsets of data relevant to the visualization.

Visualizations

Figure 1: Infection Rates by Race & Ethnicity in California

Comparison of Infection Rates by Race & Ethnicity
Race & Ethnicity Category Total Infections Total Population Infection Rate per 100,000 persons
American Indian or Alaska Native, Non-Hispanic 20450 155922 13115.53
Asian, Non-Hispanic 424800 3896784 10901.30
Black, Non-Hispanic 199553 1502119 13284.77
Hispanic (any race) 1367531 10924674 12517.82
Multiracial (two or more of above races), Non-Hispanic 99210 929795 10670.09
Native Hawaiian or Pacific Islander, Non-Hispanic 14831 120583 12299.41
White, Non-Hispanic 1537052 12601592 12197.28

Figure 2: Racial & Ethnic Distribution of California’s Cumulative Infections

Figure 3. Infections Rates in LA County by Age

Results

Figure 1 shows the total infections, total population, and infection rate per 100,000 persons, by race & ethnicity from May 29, 2023, to December 25, 2023. American Indian, or Alaska Native (Non-Hispanic): 20450 infections, 155922 individuals, infection rate of 13115.53 per 100,000 persons. Asian (Non-Hispanic): 424800 infections, 3896784 individuals, infection rate of 10901.30 per 100,000 persons. Black (Non-Hispanic), 199553 infections, 1502119 individuals, infection rate of 13284.77 per 100,000 persons. Hispanic (Any Race): 1367531 infections, 10924674 individuals, infection rate of 12517.82 per 100,000 persons. Multiracial (Non-Hispanic): 99210 infections, 929795 individuals, infection rate of 10670.09 per 100,000 persons. Native Hawaiian or Pacific Islander (Non-Hispanic): 14831 infections, 120583 individuals, infection rate of 12299.41 per 100,000 persons. White (Non-Hispanic) category: 1537052 infections, 12601592 individuals, infection rate of 12197.28 per 100,000 persons.

Figure 2 shows the distribution of cumulative infections by race/ethnicity. White, Non-Hispanic individuals have the highest infections (33.8%), followed by Hispanic (any race) with 30.2%. Asian, Non-Hispanic represents 9.3%, and Black, Non-Hispanic accounts for 4.4%. Smaller groups include Multiracial, Non-Hispanic (2.2%), American Indian/Alaska Native, Non-Hispanic (0.5%), and Native Hawaiian/Pacific Islander, Non-Hispanic (0.3%). This distribution highlights the racial/ethnic disparities in infection rates, with White and Hispanic groups experiencing the highest cumulative infections. Figure 3 depicts the infection rate per 100,000 by age accounting for race/ethnicity and gender. All races/ethnicities follow very similar trends in infection rates and have similar values per 100,000 (ages 0-17: ~0.025 infections per 100,000, 50-64: ~0.055, 18-49: ~0.120, 65+: ~0.130). The infection rate is not linearly correlated with age as the 18-49 age group has a higher infection rate than the 50-64 age group, while other age groups follow the trend of a lower age group correlating with lower infection rate.

Discussion

Race, ethnicity, and age are among key social determinants of health that play pivotal roles in infection rate in California’s outbreak. Race & ethnicity can significantly influence disease distribution due to intersections with systemic inequalities, cultural factors, and access to resources, while age can directly impact individual resilience and exposure to disease.

The distribution of cumulative infections by race/ethnicity in this dataset highlights health disparities across groups (Fig 2). White, Non-Hispanic and Hispanic populations, accounting for 33.8% and 30.2% of infections respectively, correspond to their larger representation in many U.S. regions . Hispanic populations often experience higher infection rates due to barriers such as limited access to healthcare, socioeconomic challenges, and systemic inequities. Black, Non-Hispanic and Asian, Non-Hispanic groups show moderate infection rates, which may vary depending on regional differences and healthcare access. The relatively low percentages among American Indian/Alaska Native and Native Hawaiian/Pacific Islander populations may reflect their smaller population sizes but also emphasize the need for targeted public health interventions to address disparities in these communities.

Age category appears to play a role in infection rates in LA County (Fig 3), with all demographics showing extremely similar age-related infection trends. This observation is likely to be generalizable to the entire California population, as LA county has representative populations (diverse subpopulation and large sample size) for the state.

Instead of attributing resources equally proportional to subpopulation size, an equity-based approach with over-allocation to over-affected subpopulations should be implemented. While White and Hispanic subpopulations represent the largest amount of disease, Black and American Indian, Alaskan Native subpopulations, especially elderly (ages 65+) and working age (18-49) subpopulations are disproportionately affected by California’s disease outbreak. As such, we suggest those subpopulations have a higher relative amount of resources dedicated to them in order to equitably address California’s outbreak situation.