Introduction

Scenario 1: Infectious Disease Outbreak (Simulated) in California

The sim_novelid_CA.csv file and sim_novelid_LACounty.csv file together represent simulated morbidity for the entire state of California. This analysis combines and analyzes these datasets to understand the temporal and demographic patterns of the outbreak.

Problem Statement

There is a novel infectious respiratory disease outbreak in California. Data was collected from 2023 and included data on cases and case severity by demographic categories (age category, race, sex) and geographic categories (county) for California counties and population data. As part of the California Department of Public Health surveilling the infectious respiratory disease outbreak, we are interested in examining the course of the outbreak, if it is disproportionately affecting certain demographic or geographic populations and how prevention and treatment resources should be allocated.

Methods

Data for this analysis came from three simulated 2023 datasets. The sim_novelid_CA.csv file contained weekly morbidity data (new infections and severe cases) by county, age group, sex, and race/ethnicity for all California counties except Los Angeles. The sim_novelid_LACounty.csv file provided equivalent morbidity data for Los Angeles County, which operates a separate surveillance system. The ca_pop_2023.csv file provided 2023 population estimates by county and demographic categories, which served as denominators for rate calculations. Data preparation involved standardizing variable names and coding schemes across datasets. Race/ethnicity categories were recoded from numeric values to descriptive labels matching California Department of Finance classifications. Age categories in the population data (0-4, 5-11, 12-17) were collapsed to match the broader categories in the morbidity data (0-17). County names were standardized to enable successful joins. Rows with missing values were excluded to ensure complete records for stratified analyses. Weekly incidence rates per 100,000 population were calculated by dividing new infections by the population denominator and multiplying by 100,000. This standardization enables fair comparisons across demographic groups of different sizes. Attack rates (cumulative proportion infected) and severity rates (proportion of cases with severe outcomes) were calculated for summary statistics. All measures were stratified by age group and race/ethnicity to identify disparities. CDC epidemiological weeks (Sunday start) were used as the temporal unit for time series analysis.

Data Preparation

Load and Clean Morbidity Data

First, we loaded the data.

Standardize Column Names and Select Variables

We selected relevant variables to our analysis for each data source.

Handle Missing Data

We removed rows with missing values to ensure clean joins.

Dates and CDC Week

We fixed date format for LA County. CDC epidemiological weeks start on Sunday and are the standard for disease surveillance reporting. So we set the week to start on Sunday. We then created the CDC week for CA dataset from time_int variable and for LA County from diagnosis date.

Recode Race/Ethnicity Categories

We converted the numeric race codes to descriptive labels for CA data, and ensured race is factor type for LA data.

Combine Datasets

We merged the CA and LA County morbidity data.

Population Data Preparation

Clean and Standardize Population Data

We selected and renamed relevant columns to match morbidity data. We then standardized the race/ethnicity labels to match morbidity data. We reclassified the age categories to match morbidity data (e.g., 0-17 instead of 0-4, 5-11, 12-17).

Standardize County Names

To standardize the county names, we removed the “County” suffix from morbidity data to match population data. We also made sure there was consistent formatting in population data.

Join Morbidity and Population Data

We used a left join to add population data to morbidity data.

Time Series Analysis

Calculate Weekly Rates

We created a dataset with weekly cumulative incidence rates and dropped columns not relevant to our time series analysis.

Overall California Outbreak Trajectory

We aggregated to state level by week to determine the overall California outbreak trajectory.

Stratify by Age Group

We aggregated by age group and week to stratify by age group.

Stratify by Race/Ethnicity

We aggregated by race/ethnicity and week to stratify by race/ethnicity.

Results

Epidemic Curve by Age Group

The outbreak affected all age groups, with clear differences in disease burden by age.

Key Findings:

All age groups peaked simultaneously around August-September 2023.
Incidence increases with age, however, cumulative incidence for 18-49 > 50-64.
Persons aged 65+ experienced the highest burden (~1,600 per 100,000 at peak).
Persons aged 0-17 had the lowest incidence (~350 per 100,000 at peak).

Epidemic Curve by Race/Ethnicity

Key Findings:

American Indian or Alaska Native (Non-Hispanic) populations experienced the highest peak incidence (~1,250 per 100,000).
All racial/ethnic groups followed similar temporal patterns, peaking around the same time.

Summary Statistics

Summary by Age Group

Outbreak Summary Statistics by Age Group
Age Group	Total Cases	Average Population	Peak Weekly CI (per 100k)	Total Severe Cases	Attack Rate (%)	Severe Rate (%)
0-17	353210	8694111	354.22	363	4.06	0.10
18-49	2467284	17106505	1250.22	29325	14.42	1.19
50-64	540481	7086797	668.40	2822	7.63	0.52
65+	1188608	6221657	1662.00	94416	19.10	7.94

Key Findings:

Overall, Cumulative Incidence (per 100k), Attack Rate, and Severe Rate generally increased with age, with the exception that the 18–49 age group had higher rates than the 50–64 group.
Persons aged 65+ experienced the highest burden across all 3 measures: Cumulative Incidence (per 100k), Attack Rate, and Severe Rate .
Persons aged 0-17 had the lowest burden across all 3 measures: Cumulative Incidence (per 100k), Attack Rate, and Severe Rate.

Summary by Race/Ethnicity

Outbreak Summary Statistics by Race/Ethnicity
Race/Ethnicity	Total Cases	Average Population	Peak Weekly CI (per 100k)	Total Severe Cases	Attack Rate (%)	Severe Rate (%)
American Indian or Alaska Native, Non-Hispanic	22195	158672	1230.21	671	13.99	3.02
White, Non-Hispanic	1778774	13848282	1117.81	63448	12.84	3.57
Black, Non-Hispanic	271836	2211518	1060.00	7118	12.29	2.62
Hispanic (any race)	1796696	14829946	1052.14	36484	12.12	2.03
Native Hawaiian or Pacific Islander, Non-Hispanic	16921	153729	962.08	410	11.01	2.42
Asian, Non-Hispanic	546770	6295420	760.47	16568	8.69	3.03
Multiracial (two or more of above races), Non-Hispanic	116391	1611503	617.87	2227	7.22	1.91

Key Findings:

American Indian or Alaska Native (Non-Hispanic) populations experienced the highest attack rate (14.0%), nearly double that of the lowest group (Multiracial, 7.2%).
Asian (Non-Hispanic) populations showed low incidence (8.7% attack rate) but high case severity (3.03%), a pattern potentially explained by an older age distribution in this population.

Data Dictionary

Clean Dataset Variables

The final analytical dataset (fun1) contains the following key variables:

Variable	Data Type	Description
county	Character	California counties
age_cat	Character	Age category: 0-17, 18-49, 50-64, 65+
sex	Character	Sex: FEMALE, MALE
race_eth	Factor	Race/ethnicity category defined by California Department of Finance
new_infections	Numeric	Number of newly diagnosed individuals during the week
new_severe	Numeric	Newly identified individuals with severe disease requiring hospitalization
cdc_week	Date	Start date of CDC epidemiological week (Sunday)
pop	Numeric	CA Dept of Finance population estimates
ci_per100k	Numeric	Weekly cumulative incidence rate per 100,000 persons
severe_per100k	Numeric	Weekly severe disease rate per 100,000 persons

Discussion

The outbreak peaked in August–September 2023 before declining. The observed disparities have important implications for ongoing preparedness and resource allocation. Age was the strongest predictor of disease burden. Adults aged 65 and older experienced the highest attack rates (15.6%) and severity rates (5.0%), indicating this group should be prioritized for both prevention efforts and treatment. The unexpectedly high incidence among adults aged 18–49 compared to those aged 50–64 may reflect occupational exposures and warrants further investigation. Racial and ethnic disparities were also evident. American Indian or Alaska Native (Non-Hispanic) populations experienced attack rates nearly double those of the lowest-affected group, highlighting the need for targeted outreach and culturally appropriate prevention strategies. Interestingly, Asian (Non-Hispanic) populations showed low incidence but high case severity, a pattern potentially driven by age distribution within this population. Several limitations should be noted. This analysis used simulated data and may not capture real-world complexities. Socioeconomic factors, which often drive health disparities, were not available in the dataset. Additionally, rows with missing demographic data were excluded, which could introduce bias if missingness was not random. Finally, this analysis did not examine county-level variation, which could reveal geographic clustering of cases, requiring localized response efforts.

Milestone 6: Infectious Disease Outbreak in California

Nilson Palma, Michelle Zhen Huang

2025-12-14

Introduction

Scenario 1: Infectious Disease Outbreak (Simulated) in California

Problem Statement

Methods

Data Preparation

Load and Clean Morbidity Data

Standardize Column Names and Select Variables

Handle Missing Data

Dates and CDC Week

Recode Race/Ethnicity Categories

Combine Datasets

Population Data Preparation

Clean and Standardize Population Data

Standardize County Names

Join Morbidity and Population Data

Time Series Analysis

Calculate Weekly Rates

Overall California Outbreak Trajectory

Stratify by Age Group

Stratify by Race/Ethnicity

Results

Epidemic Curve by Age Group

Epidemic Curve by Race/Ethnicity

Summary Statistics

Summary by Age Group

Summary by Race/Ethnicity

Data Dictionary

Clean Dataset Variables

Discussion