Project Milestone #6

Authors

Angela Bartolo

Lucy Lu

Tyana Michelle Perera

Problem statement

Working under the California Department of Public Health, our team identified health disparities caused by a novel infectious respiratory disease outbreak in California. We were provided three datasets: two datasets contained weekly data about cases and case severity by demographic information in non-LA counties (sim_novelid_CA.csv) and LA County (sim_novelid_LACounty.csv), respectively; a third dataset contained population estimate data and demographic information across all CA counties (ca_pop_2023.csv). Demographic information across all datasets included age, sex, and race/ethnicity.

The project leveraged these datasets to understand whether the disease outbreak was disproportionately affecting certain demographic or geographic populations. This was achieved by processing and joining the datasets, creating new variables, and then visualizing the data in multiple modalities. The findings provided evidence of health disparities, which will inform equitable resource allocation.

Methods

Creation of the joined dataset

To analyze the data, a single dataset was created by combining all three datasets provided. The following steps were taken to create the joined dataset for visualization:

  1. Read in the two datasets containing simulated novel infectious disease case data: sim_novelid_CA.csv and sim_novelid_LACounty.csv. They contain weekly data about a new infectious respiratory disease in 2023 for all non-LA CA counties and LA County, respectively.

  2. Data was cleaned to ensure consistency before joining the two datasets. This included: making ordinal values for race/ethnicity consistent, renaming columns, and removing columns not required for further analysis.

  3. Row bind the two datasets such that one dataset contains infectious disease data from all CA counties.

  4. Group by county and race_ethnicity to compute the sum of cases and severe cases in each racial/ethnic group per county to create the case dataset df_morbidity_agg.

  5. Read the dataset ca_pop_2023.csv that contains the population estimate.

  6. Perform data cleaning with the population dataset, such as recoding values to be consistent with the case dataset.

  7. Create the population dataset df_population_agg by grouping health_officer_region, county, race_ethnicity to compute the total population in each racial/ethnic group per county. This will be used as the denominator in calculating rate metrics later.

  8. The final dataset df_joined is created by joining the case dataset df_morbidity_agg and the population dataset df_population_agg.

Create New Variables

Two new rate variables were constructed within the joined dataset df_joined: infection_rate and severe_infection_rate are computed by total_infections and total_severe over the total_population of each stratum defined by county and race_ethnicity.

A second dataset for time-chart visualization was also created (the last visualization in this report):

  • df_morbidity_time_race was created by grouping by the all CA counties morbidity data by dt_diagnosis and race_ethnicity. df_population_race is created by grouping granular population data by race_ethnicity alone.

  • These two datasets are joined into df_joined_time_race , where we also created severe_infection_rate as the ratio of new severe cases and total population. df_joined_time_race then contains information about race/ethnicity-specific severe infection rate, over time.

Results

The following tables and bar charts are reflective of the infectious disease data for California in 2023 from the joined dataset, df_joined, which includes the data shown in the Data Dictionary below.

Data Dictionary

Data Dictionary
Variable Type Description
health_officer_region character California Health Officer Region: Central California, Greater Sierra Sacramento, Sounthern California, Rural North, Bay Area, Los Angeles
county character County of residence of novel ID cases
race_ethnicity character Race-ethnicity categorization as defined by CA Department of Finance
total_population numeric Total poulation estimates from the CA Department of Finance for 2023
total_infections numeric Number of newly diagnosed individuals
total_severe numeric Number of newly identified individuals having severe disease requiring hospitalization
infection_rate numeric Rate of newly diagnosed individuals per 100 people
severe_infection_rate numeric Rate of newly diagnosed individuals having severe disease requiring hospitalization per 1000 people
Columns for the joined dataset df_joined

Descriptive Statistics

Table 1. Descriptive Statistics for Infection Rate (per 100 people)
n population mean sd median IQR min max
406 39109070 12.55 8.43 9.89 7.24 0.00 66.64
Rates calculated per 100 people. N = 406 strata.

The infection rate for California in 2023, with a total population of 39,109,070 people, was 12.55 cases per 100 persons with a standard deviation of 8.43 cases per 100 persons. The median was 9.89 cases per 100 persons, with an interquartile range of 7.24 cases per 100 people, a minimum of 0 cases per 100 persons, and a maximum of 66.64 cases per 100 people in some strata.

Table 2. Descriptive Statistics for Severe Infection Rate (per 1000 people)
n population mean sd median IQR min max
406 39109070 3.35 3.17 2.60 2.94 0.00 25.98
Rates calculated per 100 people. N = 406 strata.

The severe infection rate for California in 2023, with a total population of 39,109,070, was 3.35 cases per 1,000 persons with a standard deviation of 3.27 cases per 1,000 persons. The median was 2.60 cases per 1,000 persons, with an interquartile range of 2.94 cases per 1,000 people, a minimum of 0 cases per 1,000 persons, and a maximum of 25.98 cases per 1,000 people in some strata.

Regional Comparison

Table 3. California Infection Data Grouped by Region
Region Total Population Total Infections Infection Rate Total Severe Infections Severe Infection Rate
Central California 4432134 804517 18.15 19897 4.49
Greater Sierra Sacramento 2973210 460390 15.48 13134 4.42
Southern California 12802429 1503964 11.75 41650 3.25
Rural North 683715 72896 10.66 2559 3.74
Bay Area 8391874 821660 9.79 24577 2.93
Los Angeles 9825708 886156 9.02 25109 2.56
Infection Rate is per 100 people. Severe Infection Rate is per 1000 people.

California infection rates and severe infection rates in 2023 varied by region. Central California had the highest infection rates and severe infection rates, followed by the Greater Sierra Sacramento region.

Racial & Ethnic Comparison

Table 4. California Infection Data Grouped by Race/Ethnicity
Race/Ethnicity Total Population Total Infections Infection Rate Total Severe Infections Severe Infection Rate
American Indian or Alaska Native, Non-Hispanic 158672 22195 13.99 671 4.23
White, Non-Hispanic 13848282 1778774 12.84 63448 4.58
Black, Non-Hispanic 2211518 271836 12.29 7118 3.22
Hispanic (any race) 14829946 1796696 12.12 36484 2.46
Native Hawaiian or Pacific Islander, Non-Hispanic 153729 16921 11.01 410 2.67
Asian, Non-Hispanic 6295420 546770 8.69 16568 2.63
Multiracial (two or more of above races), Non-Hispanic 1611503 116391 7.22 2227 1.38
Infection Rate is per 100 people. Severe Infection Rate is per 1000 people.

The bar chart and Table 4 show 2023 infection rates per 100 people across racial & ethnic groups in California. Rates vary by group, with American Indian or Alaska Native (Non-Hispanics) showing the highest infection rate in this dataset, followed by White (Non-Hispanic). Hispanic (any race) and Black (Non-Hispanic) groups also have relatively high infection rates, while Asian (Non-Hispanic) and Multiracial (Non-Hispanic) groups show lower infection rates. While the figure is descriptive and highlights differences across groups, it does not explain these differences.

Course of Pandemic

This figure shows weekly trends in severe infection rates by race & ethnicity in California during 2023. Across all race & ethnicity groups, the rate of severe infection increased from May to October, and the rate decreased from October to December. Consistent with the previous visualizations, there are significant variations across race & ethnicity groups, with White (Non-Hispanic) and American Indian or Alaska Native (Non-Hispanic) having the highest severe infection rate over the study duration.

Discussion

By region, Central California and Greater Sierra Sacramento would require greater resource allocation. These regions have the two highest infection rates and severe infection rates, with the rates being similar to each other. Both regions’ infection rates and severe infection rates are greater than the mean and median infection rates and severe infection rates for the state. This suggests clear regional disparities that require attention. 

By race & ethnicity, American Indian or Alaska Native (Non-Hispanics) and White (Non-Hispanics) groups would require greater resource allocation. These racial groups have the two highest infection rates and severe infection rates that are greater than the mean and median infection rates and severe infection rates for the state. The time trend also suggests that these two racial & ethnic groups have consistently higher severe infection rates over time, which shows clear racial & ethnic disparities and a need for greater resource allocation.