Project Milestone 6
2023 Bay Region Respiratory Disease Outbreak Analysis
RPubs link saved/submitted: https://rpubs.com/Juliana-Verruck/Project-Milestone-4-2025
VISUALIZATION #1:
This figure displays epidemic curves using histograms summarizing the incidence rate of infection in the Bay Area among adults 65+ compared to those under 65. Across the entire observation period, adults 65+ consistently experienced higher incidence rates, indicating a disproportionate burden of disease in older populations, regardless of sex or race.
VISUALIZATION #2:
This table summarizes the total number of new infections, population size, and incidence rates across Bay Area counties. By comparing incidence per 100,000 residents, it shows which counties are experiencing higher relative infection burdens within the region.
| County | Total Cases | Total Population | Incidence per 100,000 |
|---|---|---|---|
| Alameda County | 149,041 | 51,337,147 | 290.3 |
| Contra Costa County | 103,042 | 35,542,616 | 289.9 |
| Marin County | 24,318 | 7,897,064 | 307.9 |
| Monterey County | 44,136 | 13,530,756 | 326.2 |
| Napa County | 12,673 | 4,169,004 | 304.0 |
| San Benito County | 5,291 | 2,017,542 | 262.2 |
| San Francisco County | 85,116 | 26,288,589 | 323.8 |
| San Mateo County | 67,986 | 23,221,325 | 292.8 |
| Santa Clara County | 173,074 | 59,264,219 | 292.0 |
| Santa Cruz County | 26,333 | 8,156,131 | 322.9 |
| Solano County | 84,328 | 13,849,095 | 608.9 |
| Sonoma County | 46,322 | 14,874,606 | 311.4 |
VISUALIZATION #3:
In the previous two plots, we explored health officer region wide data analyzing health disparities among age groups and counties. In the third and final plot, we built on those existing visualizations to explore the relationships that race/ethnicity may have on the health outcomes of people in the Bay Area. For purposes of visualization, the data are aggregated by month rather than by week.
Problem Statement
In order to have effective infectious disease surveillance depends on the ability to integrate data from different sources, correcting inconsistencies, and providing quality insights from the collected data. In this project, we worked with three simulated datasets that represent morbidity in California, morbidity specifically to Los Angeles counties, and California population estimates. These datasets all provide components of the same infectious outbreak, but they were all formatted differently. For example, different variable names, inconsistent formatting, and mismatched demographic information. This typically happens as a result of data being collected by different agencies, systems, or having different documentation standards. Due to this, to do meaningful analysis, all of the datasets require extensive cleaning and coding. Extensive cleaning of the dataset includes: having unified column names, recoding demographic categories, aligning temporal variables, etc, to allow the datasets to combine into a single statewide dataset for the team to do meaningful analysis.
While the initial goal was to conduct a statewide assessment, our team ultimately selected the Bay Area region as the analytical focus. Our team decided on this due to wanting to simulate a realistic epidemiological scenario, where during a widespread outbreak, oftentimes agencies prioritize regions experiencing early transmission patterns or high population density. The Bay Area tends to be a region where infectious diseases can spread quickly due to the close connections between counties, the diverse population, and the amount of cross traffic that occurs. Narrowing our scope allowed us to dive deeper into the patterns within these specific counties. After cleaning and standardizing the morbidity and population datasets, we were able to calculate incidence rates per 100,000 residents for each county and organize all the data into one clean data sheet. This allowed us to see how consistent data formatting and cleaning aid public health analysis to have much more accurate data and easier to interpret.
Methods
Data Sources
We analyzed three datasets provided as part of a simulated 2023 infectious respiratory disease outbreak in California. Weekly morbidity data for all counties except Los Angeles County were obtained from sim_novelid_CA.csv, with matching Los Angeles County data from sim_novelid_LACounty.csv. Population denominators were drawn from ca_pop_2023.csv, which contains 2023 population estimates by county, health officer region, age group, sex, and race/ethnicity. All datasets covered the outbreak period beginning in late May 2023.
Data Cleaning and Harmonization
Population age categories were first standardized to correspond to the age structure used in the morbidity datasets. The categories 0–4, 5–11, and 12–17 were collapsed into a single 0–17 group, while the remaining categories (18–49, 50–64, and 65+) were retained as reported. Population counts were then aggregated within each county, health officer region, age category, sex, and race/ethnicity to create mutually exclusive strata suitable for merging. The two morbidity datasets were combined into a single statewide file. Non-informative variables such as time_int were removed. The unified morbidity dataset was then merged with the collapsed population dataset using county, age category, sex, and race/ethnicity as linking fields. Only exact matches were retained, producing a final analytic dataset containing morbidity indicators alongside the associated population denominators for each demographic and geographic stratum.
Variable Construction
For each observation, we calculated incidence and severity rates per 100,000 population using standard epidemiologic formulas. Specifically, we generated weekly rates for: new infections; new unrecovered cases; new severe cases; cumulative infections; cumulative unrecovered cases; cumulative severe cases. All rates were calculated as the count divided by the corresponding population denominator multiplied by 100,000 and rounded to two decimal places. Additional analytic variables included a binary age grouping (65+ vs. <65), the health officer region, and a monthly date variable created using floor_date() to facilitate temporal aggregation of race/ethnicity data. Analytic Methods The cleaned and merged dataset was used to characterize the outbreak trajectory and assess disparities across demographic and geographic subgroups. Analyses were descriptive and focused on (1) temporal trends, (2) age-specific disease burden, (3) county-level variation within the Bay Area, and (4) race/ethnicity-specific incidence patterns.
Results and Interpretation
After cleaning, we created 3 visualizations exploring incidence rate across age groups, counties, and racial groups in the Bay Area.
Our first visualization revealed that adults aged 65 and older consistently experienced much higher incidence rates than individuals under 65. Both the 65+ population and the under 65 population experienced a similar trend in overall incidence pattern with a peak around September. However, a key difference in magnitude was observed when stratifying by age. The 65+ population had an infection rate almost twice as much as the under 65 population during peak incidence. This indicates disparities in age, as older adults were more affected, potentially due to biological differences between the groups, such as increased susceptibility in the older population. This trend emphasizes the need for intervention and prevention focused on older populations living in highly dense metropolitan areas such as the Bay Area. (See Visualization 1)
When stratifying by geography, the incidence rate showed a generally uniform distribution across different counties in the Bay Area, with the exception of one outlier. Solano County had an incidence rate of 608.9 per 100,000, which is much higher than the other counties in the Bay Area which had incidence rates between 250 and 350 per 100,000. This may suggest a potential localized effect or variations in practices. (See Visualization 2)
We also looked at the distribution in regard to race and ethnicity. All groups mirrored the curve mentioned in the first visualization, but the incidence rates varied significantly when stratified by race. The groups that consistently had the highest incidence rates across the surveillance period were the American Indian or Alaska Native, black, and white populations, while Native Hawaiian or Pacific Islander and Hispanic populations had slightly lower rates. Additionally, Asian and Multiracial populations experienced the lowest rates among all racial groups. These differences may reflect different structural or social factors associated with the Bay Area, such as localized racial disparities in health access, etc. (See Visualization 3)
Overall, the results show demographic and geographical factors and how it affects disease distribution. This also highlights the importance of data in identifying populations at increased risk and subsequently informing appropriate interventions to address them.