Introduction

This report presents an analysis of an infectious disease outbreak in the state of California. The analysis is based on three distinct datasets sourced from reputable public health databases. Each dataset contains valuable information relevant to understanding the dynamics of the outbreak, including case counts, demographic information, and geographic distributions.

To ensure consistency across our analyses, we will be merging these datasets into a single comprehensive data frame. This process involves recoding values to standardize variable representations, enabling us to perform more accurate comparisons and analyses. By employing RStudio, we will facilitate this data integration and prepare the dataset for further exploratory and statistical analyses.

Our ultimate goal is to identify patterns and insights that can inform public health responses and policy decisions. This analysis aims to provide a clearer understanding of the outbreak’s impact across different communities in California and to support efforts in mitigating its effects.

Data

  1. sim_novelid_CA.csv
  2. sim_novelid_LACounty.csv
  3. ca_pop_2023.csv

Data Sources

  1. Dataset novel_id_ca contains weekly data about an infectious disease outbreak containing the number of cases and case severity by demographic categories and geographic categories for California counties.
  2. Dataset novel_id_la contains morbidity and population data by demographic and geographic categories for Los Angeles county.
  3. Dataset ca_pop contains population estimates for 2023 by California county and demographic categories.

Data Cleaning

Cleaning morbidity data

Added “county” column to LA dataset.

  1. Renamed columns in LA dataset to match CA dataset.
  2. Dropped “DT_REPORT” variable.

Reclassified CA dt_diagnosis column as date.

Reclassified LA dt_diagnosis column as date.

Reclassified CA race_ethnicity column from numeric to character.

Combined CA and LA datasets and reclassified time_int as date (epi week).

Cleaning population data

Reclassified CA_pop race_ethnicity column from numeric to character.

Grouped by county and race_ethnicity

These are the strata of interest

Created population columns - population by race and county, total county population.

Joined morbidity dataset with population dataset grouped by race_ethnicity and county

full_dataset_race_county <- all_ca_counties %>%
  left_join(ca_pop3, by = c("county", "race_ethnicity")) 

str(full_dataset_race_county)
## 'data.frame':    100688 obs. of  13 variables:
##  $ county             : chr  "Alameda County" "Alameda County" "Alameda County" "Alameda County" ...
##  $ age_cat            : chr  "0-17" "0-17" "0-17" "0-17" ...
##  $ sex                : chr  "FEMALE" "FEMALE" "FEMALE" "FEMALE" ...
##  $ race_ethnicity     : chr  "White, Non-Hispanic" "White, Non-Hispanic" "White, Non-Hispanic" "White, Non-Hispanic" ...
##  $ dt_diagnosis       : Date, format: "2023-05-29" "2023-06-05" ...
##  $ time_int           : num  22 23 24 25 26 27 28 29 30 31 ...
##  $ new_infections     : int  6 1 2 10 19 25 23 18 22 35 ...
##  $ cumulative_infected: int  6 7 9 19 38 63 86 104 126 161 ...
##  $ new_severe         : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ cumulative_severe  : int  0 0 1 1 1 1 1 1 1 1 ...
##  $ month              : num  5 6 6 6 6 7 7 7 7 7 ...
##  $ pop_race_county    : int  567943 567943 567943 567943 567943 567943 567943 567943 567943 567943 ...
##  $ total_county_pop   : int  1701203 1701203 1701203 1701203 1701203 1701203 1701203 1701203 1701203 1701203 ...
#save cleaned dataset as csv
write_csv(full_dataset_race_county,
          "../full_dataset_race_county.csv")

Analysis Element 1

County and racial/ethnic groups most affected by the outbreak

Top 10 Populations Infected with Disease by Race/Ethnicity and County

# Total infections table setup

inc_new_table <- full_dataset_race_county %>% 
  group_by(county, race_ethnicity) %>% 
  mutate(total_infections_race_county = sum(new_infections)) %>% 
  ungroup() %>%
  mutate(percent_race_county =
           round((total_infections_race_county/pop_race_county)*100,2)) %>%
  ungroup() %>% 
  distinct(county,race_ethnicity, percent_race_county) %>%
  arrange(desc(percent_race_county)) %>% 
  filter(row_number() %in% seq(1,10,1))

# Total infections Kable table

inc_new_table %>%
  drop_na() %>%
  kable(
      booktabs = TRUE,
      caption = "<center><strong>Top 10 Populations Infected with Disease by Race/Ethnicity and County</strong></center>",
      col.names = c("County", "Race/Ethnicity", "Population Infected (%)"),
      align = "c") %>%
      kable_styling(full_width = FALSE, position = "center") %>%
  footnote(
    general = "*Source: CDPH",
    general_title = "",
    footnote_as_chunk = TRUE)
Top 10 Populations Infected with Disease by Race/Ethnicity and County
County Race/Ethnicity Population Infected (%)
Imperial County Black, Non-Hispanic 46.40
Imperial County Asian, Non-Hispanic 43.19
Imperial County American Indian or Alaska Native, Non-Hispanic 43.17
Imperial County Native Hawaiian or Pacific Islander, Non-Hispanic 43.00
Imperial County White, Non-Hispanic 42.27
Imperial County Multiracial (two or more of above races), Non-Hispanic 40.40
Inyo County Native Hawaiian or Pacific Islander, Non-Hispanic 40.00
Imperial County Hispanic (any race) 39.89
Inyo County Black, Non-Hispanic 33.00
Inyo County Asian, Non-Hispanic 32.07
*Source: CDPH
Interpretation: This table displays the 10 populations most affected by the outbreak, as calculated by the percentage of total infections per racial/ethnic group and California county. Imperial and Inyo counties appear to be disproportionately affected from a geographic standpoint, with the Black, Asian, and Native Hawaiian/Pacific Islander populations most affected within those counties.

Interactive table: California Population Infected with Disease by Race/Ethnicity and County

#Total infections datatable setup

inc_new_table2 <- full_dataset_race_county %>%
   group_by(county, race_ethnicity) %>%
   mutate(total_infections_race_county = sum(new_infections)) %>%
   ungroup() %>%
   mutate(percent_race_county =
            round((total_infections_race_county/pop_race_county)*100,2)) %>%
   ungroup() %>%
   distinct(county,race_ethnicity, percent_race_county) %>%
   arrange(desc(percent_race_county))

#Total infections datatable

 inc_new_table2 %>%
   na.omit() %>%
   datatable(
            caption = tags$caption(
             style = 'caption-side: top; text-align: center; font-weight: bold; font-size: 17px;',
             'CA Population Infected with Disease by Race/Ethnicity and County'),
           rownames = FALSE,
           colnames = c("County", "Race/Ethnicity", "Population Infected (%)"),
           options = list(
             pageLength = 10,
             lengthMenu = c(10, 25, 50, 100),
             dom = 'Blfrtip',
             buttons = c('copy', 'csv', 'excel', 'pdf', 'print'),
             columnDefs = list(
               list(className = 'dt-left', targets = 0),
               list(className = 'dt-right', targets = 1:2),
               list(className = 'dt-center', targets = '_all', headerOnly = TRUE)))) %>%
   formatRound('percent_race_county', digits = 2)

Analysis Element 2

County and racial/ethnic groups most affected by severe infections

Top 10 Populations Infected with Severe Disease by Race/Ethnicity and County

# Severe infections table setup
  
inc_new_sev_table <- full_dataset_race_county %>% 
  group_by(county, race_ethnicity) %>% 
  mutate(sev_infections_race_county = sum(new_severe)) %>% 
  ungroup() %>%
  mutate(percent_race_county =
           round((sev_infections_race_county/pop_race_county)*100,2)) %>% 
  ungroup() %>% 
  distinct(county,race_ethnicity, percent_race_county) %>% 
  arrange(desc(percent_race_county)) %>% 
  filter(row_number() %in% seq(1,10,1))


# Severe infection Kable table

inc_new_sev_table %>% 
  drop_na() %>%
  kable(
      booktabs = TRUE,
      caption = "<center><strong>Top 10 Populations Infected with Severe Disease by Race/Ethnicity and County</strong></center>",
      col.names = c("County", "Race/Ethnicity", "Population with Severe Infection (%)"),
      align = "c") %>%
      kable_styling(full_width = FALSE, position = "center") %>%
  footnote(
    general = "*Source: CDPH",
    general_title = "",
    footnote_as_chunk = TRUE)
Top 10 Populations Infected with Severe Disease by Race/Ethnicity and County
County Race/Ethnicity Population with Severe Infection (%)
Plumas County Asian, Non-Hispanic 1.89
Madera County Native Hawaiian or Pacific Islander, Non-Hispanic 1.74
Imperial County White, Non-Hispanic 1.67
Imperial County Multiracial (two or more of above races), Non-Hispanic 1.61
Tehama County Native Hawaiian or Pacific Islander, Non-Hispanic 1.49
Colusa County Asian, Non-Hispanic 1.39
Inyo County White, Non-Hispanic 1.38
Sutter County Native Hawaiian or Pacific Islander, Non-Hispanic 1.31
Inyo County Asian, Non-Hispanic 1.27
Colusa County American Indian or Alaska Native, Non-Hispanic 1.24
*Source: CDPH
Interpretation: This table displays the 10 populations most affected by severe cases within the disease outbreak, as calculated by the percentage of total severe infections per racial/ethnic group and California county. The Asian population in Plumas county has been most affected, though only 1.89% of this population contracted severe disease. Therefore, these data suggest that the rate of severe infection remains low and has not significantly impacted a particular geographic and/or demographic population.

Interactive table: California Population Infected with Severe Disease by Race/Ethnicity and County

#Severe infections datatable setup

 inc_new_sev_table2 <- full_dataset_race_county %>%
   group_by(county, race_ethnicity) %>%
   mutate(sev_infections_race_county = sum(new_severe)) %>%
   ungroup() %>%
   mutate(percent_race_county =
            round((sev_infections_race_county/pop_race_county)*100,2)) %>%
   ungroup() %>%
   distinct(county,race_ethnicity, percent_race_county) %>%
   arrange(desc(percent_race_county))

#Severe infections datatable

 inc_new_sev_table2 %>%
   na.omit() %>%
   datatable(
           caption = tags$caption(
             style = 'caption-side: top; text-align: center; font-weight: bold; font-size: 17px;',
             'CA Population Infected with Severe Disease by Race/Ethnicity and County'),
           rownames = FALSE,
           colnames = c("County", "Race/Ethnicity", "Population Infected (%)"),
           options = list(
             pageLength = 10,
             lengthMenu = c(10, 25, 50, 100),
             dom = 'Blfrtip',
             buttons = c('copy', 'csv', 'excel', 'pdf', 'print'),
             columnDefs = list(
               list(className = 'dt-left', targets = 0),
               list(className = 'dt-right', targets = 1:2),
               list(className = 'dt-center', targets = '_all', headerOnly = TRUE)))) %>%
   formatRound('percent_race_county', digits = 2)

Plots of Analysis Elements 1 & 2: California Population Infected with Disease and Severe Disease by Race/Ethnicity and County

#Code for boxplots, using inc_new_table2 dataframes.
inf_box <- plot_ly(inc_new_table2,
        x= ~race_ethnicity,
        y= ~percent_race_county,
        color=~race_ethnicity, type = "box") %>%
  layout(
    title = "Plot 1: CA Populations Impacted by Disease, by Race/Ethnicity",
    yaxis = list(title = "% of County Population Infected", range = c(0, 50)),
    xaxis = list(title = "Race/Ethnicity group", showticklabels = FALSE),
    legend = list(orientation = 'h')
  )
#using inc_new_sev_table2 dataframe
sev_box <- plot_ly(inc_new_sev_table2,
        x= ~race_ethnicity,
        y= ~percent_race_county,
        color=~race_ethnicity, type = "box") %>%
  layout(
    title = "Plot 2: CA Populations Infected with Severe Disease, by Race/Ethnicity",
    yaxis = list(title = "% of County Population Infected", range = c(0, 2)),
    xaxis = list(title = "Race/Ethnicity group", showticklabels = FALSE),
    legend = list(x=200, y=0.5),
    margin = list(r=160)
  )

inf_box
sev_box
Interpretation: These plots display the distribution of cases calculated as a percentage of the total population in each county, per racial/ethnic group. Distributions for normal cases and severe cases (defined as cases requiring hospitilization) are shown. The IQR of percent infection for all racial groups lies between 8-20% with a right-skewed median of ~10%, suggesting a fairly similar distribution of general infection apart from the counties and demographics of concern identified in Table 1. In contrast, the distribution of severe infection varies slightly: while Asian and Pacific Islander populations are most impacted by severe disease in certain counties, the IQR and median of severe infection is highest among White populations. As noted above, the rate of severe infection is low overall and does not suggest that further action needs to be taken beyond standard surveillance.

Analysis Element 3

Incidence Rate by Epi Week

epi_week <- full_dataset_race_county%>%
  group_by(time_int, race_ethnicity) %>%
  summarize(new_infections = sum(new_infections, na.rm = TRUE), .groups = "drop",
            pop_race = max(pop_race_county, na.rm = TRUE))

epi_week <- epi_week%>%
  mutate(incidence_week = round((new_infections / pop_race) * 1000, 2))
bar_epi <-ggplot(epi_week, aes(x= time_int, y = incidence_week, fill = race_ethnicity,
                               text = paste("Ethnicity", race_ethnicity,
                                            "<br>Incidence: ", incidence_week)))+
  geom_bar(stat = "identity",position = "stack", na.rm = TRUE)+
  scale_x_continuous(breaks = seq(22, 52, 1))+
  theme_classic()+
  theme(legend.text = element_text(size = 5),
        legend.title = element_text(size = 8),
        legend.key.size = unit(0.3, "cm"),
        axis.text.x = element_text(size=8))+
  labs(title = "Incidence per 1000 distributed by Epi Week",
       x = "Epi Week", y = "Incidence per 1000")
  
#ggplotly(bar_epi, tooltip = "text") %>%
  #layout(showlegend = FALSE)

bar_epi

Interpretation: This plot depicts the total number of new cases (incidence) per 1,000 people by epi week of the outbreak. Data are additionally stratified by race, depicting the total number of new cases by epi week for each racial/ethnic group.

Descriptive Statistics

Counties with the highest percent of each racial group

The results are presented in table 1

Describe age and sex variables

The results are presented in table 2 and table 3

Table of summary statistics for new infections

Table of summary statistics for severe infections

Additional tables

Data Dictionary

Data Dictionary
Variable Type Description
age_cat character Age in this dataset is categorized for simplicity and represents the age groups
time_int double This variable contains the epi week and allows for standardization of data
new_infections integer It contains the new reported cases of the infectious disease per epi week
pop_race_county integer This variable contains the racial breakdown of a county popuation
total_county_pop integer This variable contains the total population of a county in the state of CA