| Data Dictionary | ||
|---|---|---|
| Variable | Type | Description |
| health_officer_region | character | California Health Officer Region: Central California, Greater Sierra Sacramento, Sounthern California, Rural North, Bay Area, Los Angeles |
| county | character | County of residence of novel ID cases |
| race_ethnicity | character | Race-ethnicity categorization as defined by CA Department of Finance |
| total_population | numeric | Total poulation estimates from the CA Department of Finance for 2023 |
| total_infections | numeric | Number of newly diagnosed individuals |
| total_severe | numeric | Number of newly identified individuals having severe disease requiring hospitalization |
| infection_rate | numeric | Rate of newly diagnosed individuals per 100 people |
| severe_infection_rate | numeric | Rate of newly diagnosed individuals having severe disease requiring hospitalization per 1000 people |
| Columns for the joined dataset df_joined | ||
Project Milestone #6
Problem statement
Working under the California Department of Public Health, our team identified health disparities caused by a novel infectious respiratory disease outbreak in California. We were provided three datasets: two datasets contained weekly data about cases and case severity by demographic information in non-LA counties (sim_novelid_CA.csv) and LA County (sim_novelid_LACounty.csv), respectively; a third dataset contained population estimate data and demographic information across all CA counties (ca_pop_2023.csv). Demographic information across all datasets included age, sex, and race/ethnicity.
The project leveraged these datasets to understand whether the disease outbreak was disproportionately affecting certain demographic or geographic populations. This was achieved by processing and joining the datasets, creating new variables, and then visualizing the data in multiple modalities. The findings provided evidence of health disparities, which will inform equitable resource allocation.
Methods
Creation of the joined dataset
To analyze the data, a single dataset was created by combining all three datasets provided. The following steps were taken to create the joined dataset for visualization:
Read in the two datasets containing simulated novel infectious disease case data:
sim_novelid_CA.csvandsim_novelid_LACounty.csv. They contain weekly data about a new infectious respiratory disease in 2023 for all non-LA CA counties and LA County, respectively.Data was cleaned to ensure consistency before joining the two datasets. This included: making ordinal values for race/ethnicity consistent, renaming columns, and removing columns not required for further analysis.
Row bind the two datasets such that one dataset contains infectious disease data from all CA counties.
Group by
countyandrace_ethnicityto compute the sum of cases and severe cases in each racial/ethnic group per county to create the case datasetdf_morbidity_agg.Read the dataset
ca_pop_2023.csvthat contains the population estimate.Perform data cleaning with the population dataset, such as recoding values to be consistent with the case dataset.
Create the population dataset
df_population_aggby groupinghealth_officer_region, county, race_ethnicityto compute the total population in each racial/ethnic group per county. This will be used as the denominator in calculating rate metrics later.The final dataset
df_joinedis created by joining the case datasetdf_morbidity_aggand the population datasetdf_population_agg.
Create New Variables
Two new rate variables were constructed within the joined dataset df_joined: infection_rate and severe_infection_rate are computed by total_infections and total_severe over the total_population of each stratum defined by county and race_ethnicity.
A second dataset for time-chart visualization was also created (the last visualization in this report):
df_morbidity_time_racewas created by grouping by the all CA counties morbidity data bydt_diagnosisandrace_ethnicity.df_population_raceis created by grouping granular population data byrace_ethnicityalone.These two datasets are joined into
df_joined_time_race, where we also createdsevere_infection_rateas the ratio of new severe cases and total population.df_joined_time_racethen contains information about race/ethnicity-specific severe infection rate, over time.
Results
The following tables and bar charts are reflective of the infectious disease data for California in 2023 from the joined dataset, df_joined, which includes the data shown in the Data Dictionary below.
Data Dictionary
Descriptive Statistics
| Table 1. Descriptive Statistics for Infection Rate (per 100 people) | |||||||
|---|---|---|---|---|---|---|---|
| n | population | mean | sd | median | IQR | min | max |
| 406 | 39109070 | 12.55 | 8.43 | 9.89 | 7.24 | 0.00 | 66.64 |
| Rates calculated per 100 people. N = 406 strata. | |||||||
The infection rate for California in 2023, with a total population of 39,109,070 people, was 12.55 cases per 100 persons with a standard deviation of 8.43 cases per 100 persons. The median was 9.89 cases per 100 persons, with an interquartile range of 7.24 cases per 100 people, a minimum of 0 cases per 100 persons, and a maximum of 66.64 cases per 100 people in some strata.
| Table 2. Descriptive Statistics for Severe Infection Rate (per 1000 people) | |||||||
|---|---|---|---|---|---|---|---|
| n | population | mean | sd | median | IQR | min | max |
| 406 | 39109070 | 3.35 | 3.17 | 2.60 | 2.94 | 0.00 | 25.98 |
| Rates calculated per 100 people. N = 406 strata. | |||||||
The severe infection rate for California in 2023, with a total population of 39,109,070, was 3.35 cases per 1,000 persons with a standard deviation of 3.27 cases per 1,000 persons. The median was 2.60 cases per 1,000 persons, with an interquartile range of 2.94 cases per 1,000 people, a minimum of 0 cases per 1,000 persons, and a maximum of 25.98 cases per 1,000 people in some strata.
Regional Comparison
| Table 3. California Infection Data Grouped by Region | |||||
|---|---|---|---|---|---|
| Region | Total Population | Total Infections | Infection Rate | Total Severe Infections | Severe Infection Rate |
| Central California | 4432134 | 804517 | 18.15 | 19897 | 4.49 |
| Greater Sierra Sacramento | 2973210 | 460390 | 15.48 | 13134 | 4.42 |
| Southern California | 12802429 | 1503964 | 11.75 | 41650 | 3.25 |
| Rural North | 683715 | 72896 | 10.66 | 2559 | 3.74 |
| Bay Area | 8391874 | 821660 | 9.79 | 24577 | 2.93 |
| Los Angeles | 9825708 | 886156 | 9.02 | 25109 | 2.56 |
| Infection Rate is per 100 people. Severe Infection Rate is per 1000 people. | |||||
California infection rates and severe infection rates in 2023 varied by region. Central California had the highest infection rates and severe infection rates, followed by the Greater Sierra Sacramento region.
Racial & Ethnic Comparison
| Table 4. California Infection Data Grouped by Race/Ethnicity | |||||
|---|---|---|---|---|---|
| Race/Ethnicity | Total Population | Total Infections | Infection Rate | Total Severe Infections | Severe Infection Rate |
| American Indian or Alaska Native, Non-Hispanic | 158672 | 22195 | 13.99 | 671 | 4.23 |
| White, Non-Hispanic | 13848282 | 1778774 | 12.84 | 63448 | 4.58 |
| Black, Non-Hispanic | 2211518 | 271836 | 12.29 | 7118 | 3.22 |
| Hispanic (any race) | 14829946 | 1796696 | 12.12 | 36484 | 2.46 |
| Native Hawaiian or Pacific Islander, Non-Hispanic | 153729 | 16921 | 11.01 | 410 | 2.67 |
| Asian, Non-Hispanic | 6295420 | 546770 | 8.69 | 16568 | 2.63 |
| Multiracial (two or more of above races), Non-Hispanic | 1611503 | 116391 | 7.22 | 2227 | 1.38 |
| Infection Rate is per 100 people. Severe Infection Rate is per 1000 people. | |||||
The bar chart and Table 4 show 2023 infection rates per 100 people across racial & ethnic groups in California. Rates vary by group, with American Indian or Alaska Native (Non-Hispanics) showing the highest infection rate in this dataset, followed by White (Non-Hispanic). Hispanic (any race) and Black (Non-Hispanic) groups also have relatively high infection rates, while Asian (Non-Hispanic) and Multiracial (Non-Hispanic) groups show lower infection rates. While the figure is descriptive and highlights differences across groups, it does not explain these differences.
Course of Pandemic
This figure shows weekly trends in severe infection rates by race & ethnicity in California during 2023. Across all race & ethnicity groups, the rate of severe infection increased from May to October, and the rate decreased from October to December. Consistent with the previous visualizations, there are significant variations across race & ethnicity groups, with White (Non-Hispanic) and American Indian or Alaska Native (Non-Hispanic) having the highest severe infection rate over the study duration.
Discussion
By region, Central California and Greater Sierra Sacramento would require greater resource allocation. These regions have the two highest infection rates and severe infection rates, with the rates being similar to each other. Both regions’ infection rates and severe infection rates are greater than the mean and median infection rates and severe infection rates for the state. This suggests clear regional disparities that require attention.
By race & ethnicity, American Indian or Alaska Native (Non-Hispanics) and White (Non-Hispanics) groups would require greater resource allocation. These racial groups have the two highest infection rates and severe infection rates that are greater than the mean and median infection rates and severe infection rates for the state. The time trend also suggests that these two racial & ethnic groups have consistently higher severe infection rates over time, which shows clear racial & ethnic disparities and a need for greater resource allocation.