Introduction
Data
- Data Sources
Data Cleaning
- Cleaning morbidity data
- Cleaning population data
Analysis Element 1
Analysis Element 2
Analysis Element 3
Descriptive Statistics
Additional tables
Data Dictionary

Introduction

This report presents an analysis of an infectious disease outbreak in the state of California. The analysis is based on three distinct datasets sourced from reputable public health databases. Each dataset contains valuable information relevant to understanding the dynamics of the outbreak, including case counts, demographic information, and geographic distributions.

To ensure consistency across our analyses, we will be merging these datasets into a single comprehensive data frame. This process involves recoding values to standardize variable representations, enabling us to perform more accurate comparisons and analyses. By employing RStudio, we will facilitate this data integration and prepare the dataset for further exploratory and statistical analyses.

Our ultimate goal is to identify patterns and insights that can inform public health responses and policy decisions. This analysis aims to provide a clearer understanding of the outbreak’s impact across different communities in California and to support efforts in mitigating its effects.

Data

sim_novelid_CA.csv
sim_novelid_LACounty.csv
ca_pop_2023.csv

Data Sources

Dataset novel_id_ca contains weekly data about an infectious disease outbreak containing the number of cases and case severity by demographic categories and geographic categories for California counties.
Dataset novel_id_la contains morbidity and population data by demographic and geographic categories for Los Angeles county.
Dataset ca_pop contains population estimates for 2023 by California county and demographic categories.

Data Cleaning

Cleaning morbidity data

Added “county” column to LA dataset.

Renamed columns in LA dataset to match CA dataset.
Dropped “DT_REPORT” variable.

Reclassified CA dt_diagnosis column as date.

Reclassified LA dt_diagnosis column as date.

Reclassified CA race_ethnicity column from numeric to character.

Combined CA and LA datasets and reclassified time_int as date (epi week).

Cleaning population data

Reclassified CA_pop race_ethnicity column from numeric to character.

Grouped by county and race_ethnicity

These are the strata of interest

Created population columns - population by race and county, total county population.

Joined morbidity dataset with population dataset grouped by race_ethnicity and county

full_dataset_race_county <- all_ca_counties %>%
  left_join(ca_pop3, by = c("county", "race_ethnicity")) 

str(full_dataset_race_county)

## 'data.frame':    100688 obs. of  13 variables:
##  $ county             : chr  "Alameda County" "Alameda County" "Alameda County" "Alameda County" ...
##  $ age_cat            : chr  "0-17" "0-17" "0-17" "0-17" ...
##  $ sex                : chr  "FEMALE" "FEMALE" "FEMALE" "FEMALE" ...
##  $ race_ethnicity     : chr  "White, Non-Hispanic" "White, Non-Hispanic" "White, Non-Hispanic" "White, Non-Hispanic" ...
##  $ dt_diagnosis       : Date, format: "2023-05-29" "2023-06-05" ...
##  $ time_int           : num  22 23 24 25 26 27 28 29 30 31 ...
##  $ new_infections     : int  6 1 2 10 19 25 23 18 22 35 ...
##  $ cumulative_infected: int  6 7 9 19 38 63 86 104 126 161 ...
##  $ new_severe         : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ cumulative_severe  : int  0 0 1 1 1 1 1 1 1 1 ...
##  $ month              : num  5 6 6 6 6 7 7 7 7 7 ...
##  $ pop_race_county    : int  567943 567943 567943 567943 567943 567943 567943 567943 567943 567943 ...
##  $ total_county_pop   : int  1701203 1701203 1701203 1701203 1701203 1701203 1701203 1701203 1701203 1701203 ...

#save cleaned dataset as csv
write_csv(full_dataset_race_county,
          "../full_dataset_race_county.csv")

Analysis Element 1

County and racial/ethnic groups most affected by the outbreak

Top 10 Populations Infected with Disease by Race/Ethnicity and County

# Total infections table setup

inc_new_table <- full_dataset_race_county %>% 
  group_by(county, race_ethnicity) %>% 
  mutate(total_infections_race_county = sum(new_infections)) %>% 
  ungroup() %>%
  mutate(percent_race_county =
           round((total_infections_race_county/pop_race_county)*100,2)) %>%
  ungroup() %>% 
  distinct(county,race_ethnicity, percent_race_county) %>%
  arrange(desc(percent_race_county)) %>% 
  filter(row_number() %in% seq(1,10,1))

# Total infections Kable table

inc_new_table %>%
  drop_na() %>%
  kable(
      booktabs = TRUE,
      caption = "<center><strong>Top 10 Populations Infected with Disease by Race/Ethnicity and County</strong></center>",
      col.names = c("County", "Race/Ethnicity", "Population Infected (%)"),
      align = "c") %>%
      kable_styling(full_width = FALSE, position = "center") %>%
  footnote(
    general = "*Source: CDPH",
    general_title = "",
    footnote_as_chunk = TRUE)

**Top 10 Populations Infected with Disease by Race/Ethnicity and County**
County	Race/Ethnicity	Population Infected (%)
Imperial County	Black, Non-Hispanic	46.40
Imperial County	Asian, Non-Hispanic	43.19
Imperial County	American Indian or Alaska Native, Non-Hispanic	43.17
Imperial County	Native Hawaiian or Pacific Islander, Non-Hispanic	43.00
Imperial County	White, Non-Hispanic	42.27
Imperial County	Multiracial (two or more of above races), Non-Hispanic	40.40
Inyo County	Native Hawaiian or Pacific Islander, Non-Hispanic	40.00
Imperial County	Hispanic (any race)	39.89
Inyo County	Black, Non-Hispanic	33.00
Inyo County	Asian, Non-Hispanic	32.07
*Source: CDPH

Interpretation: This table displays the 10 populations most affected by the outbreak, as calculated by the percentage of total infections per racial/ethnic group and California county. Imperial and Inyo counties appear to be disproportionately affected from a geographic standpoint, with the Black, Asian, and Native Hawaiian/Pacific Islander populations most affected within those counties.

Interactive table: California Population Infected with Disease by Race/Ethnicity and County

#Total infections datatable setup

inc_new_table2 <- full_dataset_race_county %>%
   group_by(county, race_ethnicity) %>%
   mutate(total_infections_race_county = sum(new_infections)) %>%
   ungroup() %>%
   mutate(percent_race_county =
            round((total_infections_race_county/pop_race_county)*100,2)) %>%
   ungroup() %>%
   distinct(county,race_ethnicity, percent_race_county) %>%
   arrange(desc(percent_race_county))

#Total infections datatable

 inc_new_table2 %>%
   na.omit() %>%
   datatable(
            caption = tags$caption(
             style = 'caption-side: top; text-align: center; font-weight: bold; font-size: 17px;',
             'CA Population Infected with Disease by Race/Ethnicity and County'),
           rownames = FALSE,
           colnames = c("County", "Race/Ethnicity", "Population Infected (%)"),
           options = list(
             pageLength = 10,
             lengthMenu = c(10, 25, 50, 100),
             dom = 'Blfrtip',
             buttons = c('copy', 'csv', 'excel', 'pdf', 'print'),
             columnDefs = list(
               list(className = 'dt-left', targets = 0),
               list(className = 'dt-right', targets = 1:2),
               list(className = 'dt-center', targets = '_all', headerOnly = TRUE)))) %>%
   formatRound('percent_race_county', digits = 2)

Analysis Element 2

County and racial/ethnic groups most affected by severe infections

Top 10 Populations Infected with Severe Disease by Race/Ethnicity and County

# Severe infections table setup
  
inc_new_sev_table <- full_dataset_race_county %>% 
  group_by(county, race_ethnicity) %>% 
  mutate(sev_infections_race_county = sum(new_severe)) %>% 
  ungroup() %>%
  mutate(percent_race_county =
           round((sev_infections_race_county/pop_race_county)*100,2)) %>% 
  ungroup() %>% 
  distinct(county,race_ethnicity, percent_race_county) %>% 
  arrange(desc(percent_race_county)) %>% 
  filter(row_number() %in% seq(1,10,1))


# Severe infection Kable table

inc_new_sev_table %>% 
  drop_na() %>%
  kable(
      booktabs = TRUE,
      caption = "<center><strong>Top 10 Populations Infected with Severe Disease by Race/Ethnicity and County</strong></center>",
      col.names = c("County", "Race/Ethnicity", "Population with Severe Infection (%)"),
      align = "c") %>%
      kable_styling(full_width = FALSE, position = "center") %>%
  footnote(
    general = "*Source: CDPH",
    general_title = "",
    footnote_as_chunk = TRUE)

**Top 10 Populations Infected with Severe Disease by Race/Ethnicity and County**
County	Race/Ethnicity	Population with Severe Infection (%)
Plumas County	Asian, Non-Hispanic	1.89
Madera County	Native Hawaiian or Pacific Islander, Non-Hispanic	1.74
Imperial County	White, Non-Hispanic	1.67
Imperial County	Multiracial (two or more of above races), Non-Hispanic	1.61
Tehama County	Native Hawaiian or Pacific Islander, Non-Hispanic	1.49
Colusa County	Asian, Non-Hispanic	1.39
Inyo County	White, Non-Hispanic	1.38
Sutter County	Native Hawaiian or Pacific Islander, Non-Hispanic	1.31
Inyo County	Asian, Non-Hispanic	1.27
Colusa County	American Indian or Alaska Native, Non-Hispanic	1.24
*Source: CDPH

Interpretation: This table displays the 10 populations most affected by severe cases within the disease outbreak, as calculated by the percentage of total severe infections per racial/ethnic group and California county. The Asian population in Plumas county has been most affected, though only 1.89% of this population contracted severe disease. Therefore, these data suggest that the rate of severe infection remains low and has not significantly impacted a particular geographic and/or demographic population.

Interactive table: California Population Infected with Severe Disease by Race/Ethnicity and County

#Severe infections datatable setup

 inc_new_sev_table2 <- full_dataset_race_county %>%
   group_by(county, race_ethnicity) %>%
   mutate(sev_infections_race_county = sum(new_severe)) %>%
   ungroup() %>%
   mutate(percent_race_county =
            round((sev_infections_race_county/pop_race_county)*100,2)) %>%
   ungroup() %>%
   distinct(county,race_ethnicity, percent_race_county) %>%
   arrange(desc(percent_race_county))

#Severe infections datatable

 inc_new_sev_table2 %>%
   na.omit() %>%
   datatable(
           caption = tags$caption(
             style = 'caption-side: top; text-align: center; font-weight: bold; font-size: 17px;',
             'CA Population Infected with Severe Disease by Race/Ethnicity and County'),
           rownames = FALSE,
           colnames = c("County", "Race/Ethnicity", "Population Infected (%)"),
           options = list(
             pageLength = 10,
             lengthMenu = c(10, 25, 50, 100),
             dom = 'Blfrtip',
             buttons = c('copy', 'csv', 'excel', 'pdf', 'print'),
             columnDefs = list(
               list(className = 'dt-left', targets = 0),
               list(className = 'dt-right', targets = 1:2),
               list(className = 'dt-center', targets = '_all', headerOnly = TRUE)))) %>%
   formatRound('percent_race_county', digits = 2)

Plots of Analysis Elements 1 & 2: California Population Infected with Disease and Severe Disease by Race/Ethnicity and County

#Code for boxplots, using inc_new_table2 dataframes.
inf_box <- plot_ly(inc_new_table2,
        x= ~race_ethnicity,
        y= ~percent_race_county,
        color=~race_ethnicity, type = "box") %>%
  layout(
    title = "Plot 1: CA Populations Impacted by Disease, by Race/Ethnicity",
    yaxis = list(title = "% of County Population Infected", range = c(0, 50)),
    xaxis = list(title = "Race/Ethnicity group", showticklabels = FALSE),
    legend = list(orientation = 'h')
  )
#using inc_new_sev_table2 dataframe
sev_box <- plot_ly(inc_new_sev_table2,
        x= ~race_ethnicity,
        y= ~percent_race_county,
        color=~race_ethnicity, type = "box") %>%
  layout(
    title = "Plot 2: CA Populations Infected with Severe Disease, by Race/Ethnicity",
    yaxis = list(title = "% of County Population Infected", range = c(0, 2)),
    xaxis = list(title = "Race/Ethnicity group", showticklabels = FALSE),
    legend = list(x=200, y=0.5),
    margin = list(r=160)
  )

inf_box

sev_box

Interpretation: These plots display the distribution of cases calculated as a percentage of the total population in each county, per racial/ethnic group. Distributions for normal cases and severe cases (defined as cases requiring hospitilization) are shown. The IQR of percent infection for all racial groups lies between 8-20% with a right-skewed median of ~10%, suggesting a fairly similar distribution of general infection apart from the counties and demographics of concern identified in Table 1. In contrast, the distribution of severe infection varies slightly: while Asian and Pacific Islander populations are most impacted by severe disease in certain counties, the IQR and median of severe infection is highest among White populations. As noted above, the rate of severe infection is low overall and does not suggest that further action needs to be taken beyond standard surveillance.

Analysis Element 3

Incidence Rate by Epi Week

epi_week <- full_dataset_race_county%>%
  group_by(time_int, race_ethnicity) %>%
  summarize(new_infections = sum(new_infections, na.rm = TRUE), .groups = "drop",
            pop_race = max(pop_race_county, na.rm = TRUE))

epi_week <- epi_week%>%
  mutate(incidence_week = round((new_infections / pop_race) * 1000, 2))

bar_epi <-ggplot(epi_week, aes(x= time_int, y = incidence_week, fill = race_ethnicity,
                               text = paste("Ethnicity", race_ethnicity,
                                            "<br>Incidence: ", incidence_week)))+
  geom_bar(stat = "identity",position = "stack", na.rm = TRUE)+
  scale_x_continuous(breaks = seq(22, 52, 1))+
  theme_classic()+
  theme(legend.text = element_text(size = 5),
        legend.title = element_text(size = 8),
        legend.key.size = unit(0.3, "cm"),
        axis.text.x = element_text(size=8))+
  labs(title = "Incidence per 1000 distributed by Epi Week",
       x = "Epi Week", y = "Incidence per 1000")
  
#ggplotly(bar_epi, tooltip = "text") %>%
  #layout(showlegend = FALSE)

bar_epi

Interpretation: This plot depicts the total number of new cases (incidence) per 1,000 people by epi week of the outbreak. Data are additionally stratified by race, depicting the total number of new cases by epi week for each racial/ethnic group.

Descriptive Statistics

Counties with the highest percent of each racial group

The results are presented in table 1

Describe age and sex variables

The results are presented in table 2 and table 3

Table of summary statistics for new infections

Table of summary statistics for severe infections

Additional tables

Data Dictionary

**Data Dictionary**
Variable	Type	Description
age_cat	character	Age in this dataset is categorized for simplicity and represents the age groups
time_int	double	This variable contains the epi week and allows for standardization of data
new_infections	integer	It contains the new reported cases of the infectious disease per epi week
pop_race_county	integer	This variable contains the racial breakdown of a county popuation
total_county_pop	integer	This variable contains the total population of a county in the state of CA

Milestone 4

Ariana Waters, Brooke Stimmel, Imana Waseem

November 24, 2024