Group 6 Milestone #6

R Markdown

Problem Statement

The California Department of Public Health is monitoring a simulated outbreak of a novel infectious respiratory disease and must quickly assess its impact across the state. Multiple data sources have been provided including weekly morbidity and severity counts for all California counties except Los Angeles, detailed demographic and population data for Los Angeles County, and 2023 county-level population estimates (to support statewide surveillance activities). To effectively guide the public health response, these datasets must be merged and analyzed to produce a unified view of the outbreak.

Public health leadership seeks to understand the trajectory of the outbreak, identify whether certain demographic or geographic groups are disproportionately affected, and determine how prevention and treatment resources should be allocated. The task requires integrating the three datasets, generating summary measures such as morbidity rates and severity distributions, and producing clear, print-ready tables and graphical visualizations that communicate key trends across counties and demographic categories. This analysis will inform data-driven decision-making for targeted interventions and resource deployment.

Methods for Each Data Source

The sim_novelid_CA.csv dataset contained weekly morbidity records for all counties in California except Los Angeles County. It included cases, severity indicators, and demographic variables, and was stratified by age category, county, race, sex, and week start date. We cleaned and standardized the dataset, converting label cases, renaming the race variable, and ensuring that groups matched the format of the population data. We removed duplicate columns and looked for missing values before joining it.

The sim_novelid_LACounty.csv dataset was formatted in the same manner as the California state dataset, but it contained separate data for Los Angeles County. We utilized the same cleaning steps to ensure that both of the morbidity datasets were consistent and uniform. After we standardized column names and categories, we combined this dataset with the California dataset to create a morbidity dataset that included all counties.

The ca_pop_2023.csv dataset contained population estimates by age group, county, race, and sex. The labels in this dataset did not match those in the morbidity data, so they required more cleaning. We added County and re-coded population age categories to have them match age groups, sex labels, and race labels used in the morbidity datasets.

After all of the datasets had uniform variable names and categories, we grouped and summed the population, then joined it to the morbidity data using county, age group, sex, and race. Then we created a new variable for incidence per 100,000. Other minor cleaning steps were completed to prepare for the development of tables and plot visualizations.

Summary Measures and Visualizations

Table 1.

Sex	Total cases	Total population	Incidence rate per 100,000
Incidence Rate per 100,000 by Sex
Simulated Infectious Disease Outbreak, California & LA County
Female	2,294,166	540,075,676	424.8
Male	2,255,417	525,013,644	429.6

Total cases, population counts, and incidence rates for males and females during the simulated infectious disease outbreak across California and Los Angeles County. Incidence rates were nearly identical for both groups, suggesting that sex did not influence infection risk.

Figure 1.

Incidence rates across age groups during the simulated outbreak in California and Los Angeles County. Rates increased steadily with age, with the highest burden seen among adults 65 years and older.

Figure 2.

Total number of reported cases across racial and ethnic groups during the simulated outbreak. The distribution shows clear differences in case burden across groups, which may reflect variations in population size, exposure patterns, or access to care.

Figure 3

Incidence rates across racial and ethnic groups during the simulated outbreak. The figure highlights meaningful differences in infection rates, suggesting that certain groups may face higher exposure or barriers to prevention and care.

Results

The incidence rates for males and females were nearly identical, with females at 424.8 per 100,000 and males at 429.6 per 100,000 (Table 1). This shows that sex did not appear to influence infection risk in this simulated outbreak.Incidence rates increased steadily across age groups, with the highest burden observed among adults 65 years and older (Figure 1). Children and younger adults had much lower rates, which highlights the greater vulnerability of older adults in this setting. There were also clear differences across racial and ethnic groups. Total case counts varied widely across groups (Figure 2), and incidence rates showed meaningful variation as well (Figure 3). Some racial and ethnic groups experienced much higher incidence rates than others. These differences may reflect variations in exposure, underlying health conditions, work environments, or access to care. Overall, age and race or ethnicity were important demographic factors associated with differences in incidence, while sex showed no meaningful effect.

Discussion

The results indicate that sex did not have a major influence on infection risk, since males and females had almost identical incidence rates (Table 1). Much stronger patterns were observed across age and racial or ethnic groups.

Older adults showed the highest incidence rates by a considerable margin (Figure 1). This is consistent with many respiratory illnesses, where adults 65 years and older often experience more severe disease or greater susceptibility due to underlying medical conditions or changes in immune function.The variation across racial and ethnic groups in both total cases (Figure 2) and incidence rates (Figure 3) highlights important differences that may be shaped by social, environmental, and healthcare related factors. These patterns suggest areas where additional attention may be needed to understand community level risks and opportunities for targeted prevention.

Together, these findings provide a clear early picture of the groups most affected within this simulated outbreak and support future work focused on prevention strategies, resource allocation, and further analyses to understand the reasons behind these demographic patterns.