1 Introduction

Investigating the Early Life Airway Fungal Microbiome is an exploratory study examining the potential role of the fungal microbiome (mycobiome) in the development of recurrent wheezing (≥3 episodes/year) in children. While most microbiome research has focused on bacterial communities, this study specifically investigates the understudied airway mycobiome and its association with respiratory outcomes. We are asking a simple but important question: Are certain fungi more common in kids who develop recurrent wheezing? Our study includes children between the ages 2 and 5.5 years old. During early childhood, they had nasal swabs collected, which was later used to identify fungi through DNA sequencing. Logistic Regression models were used to predict the presence of the fungi with recurrent wheezing as the main predictor. In addition we adjusted for race/ethnicity, season, maternal atopy.

Subject IDs and other identifying information are fully anonymized in this study to preserve privacy

library(ggplot2)
library(vcd)
library(Hmisc)
library(gtsummary)
library(dplyr)
library(tidyr)
library(RColorBrewer)
library(viridis)
library(plotly)
library(logistf)
library(ggrepel)
library(gt)
library(gtExtras)
library(forcats)
library(knitr)
library(patchwork)
# Load data
metadata <- read.csv("metadata.csv")
fungaldata <- read.csv("genus_table.csv")

# Convert recurrent_wheezing variable from numeric (1/0) to character ("yes"/"no")
metadata[["recurrent_wheezing"]] <- ifelse(metadata[["recurrent_wheezing"]] == 1, "yes", "no")
metadata <- metadata[!is.na(metadata[,"recurrent_wheezing"]),]


2 Data Description

We used two datasets for this analysis. The first includes participant characteristics such as age, race/ethnicity, season, and whether the child has recurrent wheezing. Below is the first few rows of the dataset.

# Presentable metadata table

metadata_table <- head(metadata) %>%
  gt() %>%
  tab_header(title = "Preview of Metadata") %>%
  fmt_number(columns = where(is.numeric), decimals = 2)

metadata_table
Preview of Metadata
subject_id recurrent_wheezing sex race_ethnicity age warm_season maternal_atopy maternal_asthma
1.00 no female white_nh 3.01 warm (apr-oct) no no
2.00 no female black_nh 2.20 warm (apr-oct) yes no
3.00 no male white_nh 2.14 warm (apr-oct) yes yes
5.00 no female white_nh 2.10 warm (apr-oct) no no
6.00 yes male multiple 2.25 warm (apr-oct) yes yes
7.00 no male white_nh 2.09 warm (apr-oct) no no

The second dataset we are working with contains detailed fungal abundance data for 73 different fungi, measured from each child’s nasal swab. We can notice something important already only looking at a few rows and a few fungi, this is that most of the fungi abundance is 0. Another important detail to note is that for each subject the abundance of all the fungi combined should add up to 1.

# Presentable fungal abundance table

presentable_fungi <- c("Didymellaceae_unclassified", "Talaromyces", "Ascomycota_unclassified", "Aspergillus", "subject_id")
present_subset <- fungaldata[,presentable_fungi]

fungi_table <- present_subset %>% 
  head(6) %>% 
  gt() %>% 
  tab_header(title = "Preview of Fungaldata (4 Selected Fungi)") %>%
  fmt_number(columns = where(is.numeric), decimals = 3)

fungi_table
Preview of Fungaldata (4 Selected Fungi)
Didymellaceae_unclassified Talaromyces Ascomycota_unclassified Aspergillus subject_id
0.000 0.000 0.000 0.000 1.000
0.000 0.000 0.000 0.986 2.000
0.000 0.000 0.000 0.000 3.000
0.000 0.000 0.000 0.000 4.000
0.000 0.000 0.143 0.020 5.000
0.126 0.000 0.058 0.000 6.000

Since our main predictor for the presence of a fungi is recurrent wheezing we exclude the children who are missing this data. Therefore from our 147 children we are left with 140 children to analyze. We merge the two datasets by subject id.

fungalmetadata <- merge(metadata, fungaldata, by = "subject_id", all = TRUE)
cleandata <- fungalmetadata[!is.na(fungalmetadata[,"recurrent_wheezing"]),]

3 Characteristics of Children in Study Stratified by Recurrent Wheezing Status

# Recode categorical variables for clarity and consistency in analysis:

cleandata <- cleandata %>%
  mutate(race_ethnicity = fct_recode(race_ethnicity,
                                     "Asian" = "asian",
                                     "Black (Non-Hispanic)" = "black_nh",
                                     "Hispanic" = "hispanic",
                                     "Multiple Races" = "multiple",
                                     "White (Non-Hispanic)" = "white_nh"))

cleandata <- cleandata %>% 
  mutate(sex = fct_recode(sex,
                          "Female" = "female",
                          "Male" = "male"))

cleandata <- cleandata %>% 
  mutate(warm_season = fct_recode(warm_season,
                          "Cold (Nov-Mar)" = "cold (nov-mar)",
                          "Warm (Apr - Oct)" = "warm (apr-oct)"))
# Create TableOne
table_1 <- tbl_summary(
  data = cleandata,
  by = recurrent_wheezing,
  include = c(sex, race_ethnicity, age, warm_season, maternal_atopy, maternal_asthma, Ascomycota_unclassified, Talaromyces),
  statistic = list(all_categorical() ~ "{n} ({p}%)"),
  label = list(
    sex ~ "Sex",
    race_ethnicity ~ "Race/Ethnicity",
    age ~ "Age",
    warm_season ~ "Month microbiome was sampled",
    maternal_atopy ~ "Maternal Atopy",
    maternal_asthma ~ "Maternal Asthma",
    Ascomycota_unclassified ~ "Ascomycota (Unclassified)",
    Talaromyces ~ "Talaromyces"
  )
) %>%
  add_overall(last = TRUE) %>%
  add_p() %>%
  bold_labels() %>%
  modify_header(label ~ "**Variable**") %>%
  modify_caption("**Table 1. Demographic Characteristics of Children With and Without Recurrent Wheezing**") %>%
  modify_spanning_header(all_stat_cols() ~ "**Recurrent Wheezing Status**") %>%
  modify_header(stat_1 ~ "**No Recurrent Wheezing (N = {n})**", stat_2 ~ "**Recurrent Wheezing (N = {n})**")

table_1
Table 1. Demographic Characteristics of Children With and Without Recurrent Wheezing
Variable
Recurrent Wheezing Status
p-value2
No Recurrent Wheezing (N = 120)1 Recurrent Wheezing (N = 20)1 Overall
N = 140
1
Sex


0.2
    Female 54 (45%) 6 (30%) 60 (43%)
    Male 66 (55%) 14 (70%) 80 (57%)
Race/Ethnicity


0.033
    Asian 1 (0.8%) 2 (10%) 3 (2.1%)
    Black (Non-Hispanic) 16 (13%) 5 (25%) 21 (15%)
    Hispanic 10 (8.3%) 2 (10%) 12 (8.6%)
    Multiple Races 8 (6.7%) 2 (10%) 10 (7.1%)
    White (Non-Hispanic) 85 (71%) 9 (45%) 94 (67%)
Age 3.09 (3.02, 3.20) 3.08 (3.03, 3.31) 3.09 (3.02, 3.21) 0.8
Month microbiome was sampled


0.7
    Cold (Nov-Mar) 48 (40%) 9 (45%) 57 (41%)
    Warm (Apr - Oct) 72 (60%) 11 (55%) 83 (59%)
Maternal Atopy 44 (37%) 14 (70%) 58 (41%) 0.005
Maternal Asthma 20 (17%) 11 (55%) 31 (22%) <0.001
Ascomycota (Unclassified) 0.00 (0.00, 0.00) 0.00 (0.00, 0.47) 0.00 (0.00, 0.01) 0.061
Talaromyces 0.0000 (0.0000, 0.0000) 0.0000 (0.0000, 0.0000) 0.0000 (0.0000, 0.0000) 0.6
1 n (%); Median (Q1, Q3)
2 Pearson’s Chi-squared test; Fisher’s exact test; Wilcoxon rank sum test

Important observations from Table 1: Males had higher wheezing prevalence (70% vs. 55% in non-wheezers), though not statistically significant (p=0.2). Race/Ethnicity have significant differences (p=0.033). Age distribution is relatively the same for children with recurrent wheezing and children without. No significant difference for month microbiome data was collected, but slight warm-season predominance (55-60%). For maternal atopy 70% of wheezers had atopic mothers vs. 37% (p=0.005). Similar for maternal asthma wqheezers had 3× higher prevalence (55% vs. 17%, p<0.001). Two fungal taxa were included at the bottom of Table 1 to illustrate the typical data structure: one showed a statistically significant difference between groups, while the other did not. It is important to note that the wheezing group is relatively small (N = 20 out of 140 subjects), which limits statistical power.

In terms of actionable takeaways, maternal asthma was prioritized in this study given its strong association with recurrent wheezing. Additionally, the observed differences across race/ethnicity highlight the need to explore this varaible.

4 Inspection of Potential Covariates

soft_palette <- c("#A6CEE3", "#B2DF8A", "#FDBF6F", "#CAB2D6", "#FFB3B3")

p1 <- ggplot(cleandata, aes(x = recurrent_wheezing, y = age)) +
  geom_boxplot(fill = "#A6CEE3", color = "#1F78B4") +
  labs(title = "Age by Recurrent Wheezing", x = "Recurrent Wheezing", y = "Age") +
  theme_classic() +
  theme(panel.border = element_rect(color = "black", fill = NA))

p2 <- ggplot(cleandata, aes(x = recurrent_wheezing, fill = sex)) +
  geom_bar(position = "dodge", color = "black") +
  scale_fill_manual(values = soft_palette[1:2]) +
  labs(title = "Recurrent Wheezing by Sex", x = "Recurrent Wheezing", y = "Count", fill = "Sex") +
  theme_classic() +
  theme(panel.border = element_rect(color = "black", fill = NA))

p3 <- ggplot(cleandata, aes(x = recurrent_wheezing, fill = race_ethnicity)) +
  geom_bar(position = "dodge", color = "black") +
  scale_fill_manual(values = soft_palette) +
  labs(title = "Recurrent Wheezing by Race/Ethnicity", x = "Recurrent Wheezing", y = "Count", fill = "Race/Ethnicity") +
  theme_classic() +
  theme(panel.border = element_rect(color = "black", fill = NA))

p4 <- ggplot(cleandata, aes(x = recurrent_wheezing, fill = warm_season)) +
  geom_bar(position = "dodge", color = "black") +
  scale_fill_manual(values = soft_palette[1:2]) +
  labs(title = "Recurrent Wheezing by Season", x = "Recurrent Wheezing", y = "Count", fill = "Season") +
  theme_classic() +
  theme(panel.border = element_rect(color = "black", fill = NA))

p5 <- ggplot(cleandata, aes(x = recurrent_wheezing, fill = maternal_atopy)) +
  geom_bar(position = "dodge", color = "black") +
  scale_fill_manual(values = soft_palette[1:2]) +
  labs(title = "Recurrent Wheezing by Maternal Atopy", x = "Recurrent Wheezing", y = "Count", fill = "Maternal Atopy") +
  theme_classic() +
  theme(panel.border = element_rect(color = "black", fill = NA))

p6 <- ggplot(cleandata, aes(x = recurrent_wheezing, fill = maternal_asthma)) +
  geom_bar(position = "dodge", color = "black") +
  scale_fill_manual(values = soft_palette[1:2]) +
  labs(title = "Recurrent Wheezing by Maternal Asthma", x = "Recurrent Wheezing", y = "Count", fill = "Maternal Asthma") +
  theme_classic() +
  theme(panel.border = element_rect(color = "black", fill = NA))

(p1 | p2)

(p3 | p4)

(p5 | p6)

metadata_no_id <- metadata[, !names(metadata) %in% "subject_id"]
kable(summary(metadata_no_id), caption = "Summary of Metadata (excluding subject_id)")
Summary of Metadata (excluding subject_id)
recurrent_wheezing sex race_ethnicity age warm_season maternal_atopy maternal_asthma
Length:140 Length:140 Length:140 Min. :2.086 Length:140 Length:140 Length:140
Class :character Class :character Class :character 1st Qu.:3.022 Class :character Class :character Class :character
Mode :character Mode :character Mode :character Median :3.087 Mode :character Mode :character Mode :character
NA NA NA Mean :3.155 NA NA NA
NA NA NA 3rd Qu.:3.205 NA NA NA
NA NA NA Max. :5.018 NA NA NA

5 Inspection of Fungi Variables

plot_histogram <- function(var, label){
  ggplot(cleandata, aes(x = .data[[var]])) +
    geom_histogram(fill = "skyblue", color = "black", bins = 10) +
    labs(title = paste(label, "Distribution"), x = label, y = "Count") +
    theme_classic() +
    theme(panel.border = element_rect(color = "black", fill = NA))
}

p7 <- plot_histogram("Ascomycota_unclassified", "Ascomycota_unclassified")
p8 <- plot_histogram("Curvularia", "Curvularia")
p9 <- plot_histogram("Talaromyces", "Talaromyces")
(p7 | p8 | p9)


Something obvious that these distribution plots tell us is that fungi abundance for most fungi is very often 0. Only three plots were shown, because they all look very similar.

6 Exploring Average Abundance and Prevalence among all Fungi

funginames <- colnames(cleandata)[9:81]
fungiinfo <- data.frame(
  "Fungi" = funginames
)

fungicolumns <- cleandata[,9:81]
fungiprevalence <- sapply(fungicolumns, function(col){
  (sum(col != 0) / nrow(cleandata)) * 100
})

fungiinfo[,"Fungi Prevalence"] <- fungiprevalence[fungiinfo[,"Fungi"]]

fungiabundance <- sapply(fungicolumns, function(col){
  sum(col) / nrow(cleandata)
})

fungiinfo[, "Average.abundance"] <- fungiabundance[fungiinfo[,"Fungi"]]

sort_fungi <- fungiinfo[order(-fungiinfo[, "Average.abundance"]), ]

top_15 <- sort_fungi[1:15, c("Fungi", "Average.abundance")]

palette <- c(
  "#264653","#2a9d8f","#e76f51","#e9c46a","#8d99ae","#d90429","#6a994e",
  "#9a031e","#007f5f","#5f0f40","#f08a5d","#588157","#ffba08","#3d405b","#bc6c25"
)

fungi_colors <- setNames(palette, top_15$Fungi)

full_palette <- c(fungi_colors, "All Other" = "#999999")

filtered_fungi <- fungiinfo[fungiinfo$Fungi %in% top_15$Fungi, ]

top_fungi <- filtered_fungi[, c("Fungi", "Average.abundance")]

other_abundance <- sum(fungiinfo$Average.abundance[!fungiinfo$Fungi %in% top_15$Fungi])

other_row <- data.frame(
  Fungi = "All Other",
  Average.abundance = other_abundance
)

plot_data <- rbind(top_fungi, other_row)

plot_data$Fungi <- factor(
  plot_data$Fungi,
  levels = c(top_15$Fungi, "All Other")
)
fungaldatalong <- cleandata[, c("subject_id", "recurrent_wheezing", funginames)] %>% 
  pivot_longer(cols = all_of(funginames),
               names_to = "Fungi",
               values_to = "Abundance")

fungaldatalong <- fungaldatalong %>% 
  mutate(Fungus2 = ifelse(Fungi %in% top_15$Fungi, Fungi, "All Other"))

fungus2_summary <- fungaldatalong %>%
  group_by(subject_id, recurrent_wheezing, Fungus2) %>%
  summarise(Total_abundance = sum(Abundance), .groups = "drop")

avg_data <- fungus2_summary %>%
  group_by(recurrent_wheezing, Fungus2) %>%
  summarise(AvgAbundance = mean(Total_abundance), .groups = "drop")

avg_data <- avg_data %>%
  group_by(recurrent_wheezing) %>%
  mutate(Rank = ifelse(Fungus2 != "All Other", rank(-AvgAbundance, ties.method = "first"), NA)) %>%
  ungroup()

avg_data <- avg_data %>%
  arrange(recurrent_wheezing, Rank) %>%
  mutate(Fungus2 = factor(Fungus2, levels = unique(Fungus2)))

fungaldatalong_filtered <- fungaldatalong %>%
  filter(Fungi %in% top_15$Fungi) %>%
  mutate(Fungi = factor(Fungi, levels = top_15$Fungi))
fungi_subject_summary <- fungaldatalong %>%
  group_by(subject_id, recurrent_wheezing, Fungus2) %>%
  summarise(TotalAbundance = sum(Abundance), .groups = "drop") %>%
  mutate(Fungus2 = factor(Fungus2, levels = c(top_15$Fungi, "All Other")))

cleandata[["race_ethnicity"]] <- ifelse(cleandata[["race_ethnicity"]] == "White (Non-Hispanic)","White (Non-Hispanic)", "Other")
ggplot(plot_data, aes(x = "", y = Average.abundance, fill = Fungi)) +
  geom_bar(stat = "identity", width = 0.6) +  
  labs(title = "Top 15 fungi by Average Abundance Across All Samples",
       x = "Fungi", y = "Average Abundance") +
  theme_classic() +
  theme(panel.border = element_rect(color = "black", fill = NA)) +
  scale_fill_manual(values = full_palette)


We want to explore the top 15 fungi by average abundance across all samples. We can see that the top two bars are much larger than the rest of the other bars. Also something else to note is that the bottom half of the fungi abundances are very small. Therefore there is no reason to plot the other fungi, as it will not be informative. Therefore for the 58 other fungi we combine them into an “All Other” bar.

ggplot(avg_data, aes(x = recurrent_wheezing, y = AvgAbundance, fill = Fungus2)) +
  geom_bar(stat = "identity", position = "stack", width = .8) +
  scale_fill_manual(values = c(fungi_colors, "All Other" = "#999999")) +
  labs(title = "Average Fungal Abundance by Wheezing Status", x = "Recurrent Wheezing", y = "Average Abundance", fill = "Fungi") +
  theme_classic() +
  theme(panel.border = element_rect(color = "black", fill = NA))


Using this plot we want to look if there are any potential fungi abundance differences for recurrent wheezing status. Something we can notice immediately is that Ascomycota unclassified seems to be more abundant in children with recurrent wheezing. We can also see that there is less diversity of fungi in children with recurrent wheezing. This is potentially because we only have 20 subjects that have recurrent wheezing.

ggplot(fungi_subject_summary, aes(x = as.character(subject_id), y = TotalAbundance, fill = Fungus2)) +
  geom_bar(stat = "identity") +
  facet_grid(. ~ recurrent_wheezing, space = "free_x", scales = "free_x") +
  scale_fill_manual(values = c(fungi_colors, "All Other" = "#999999")) +
  labs(title = "Fungal Abundance by Subject and Wheezing Satus",x = "Subject", y = "Abundance", fill = "Fungi") +
  theme_classic() +
  theme(panel.border = element_rect(color = "black", fill = NA), axis.text.x = element_blank())


Each bar represents a subject. Each different color in the bar are fungi abundances, and each subject stacks up to 1. The plot is divided into two groups, children who do not have recurrent wheezing and children that do have it. We plotted the top 15 most abundant fungi and for the other fungi we put them all in the bar “All Other” since their abundances are very small. From this plot it is not obvious there is one (or few) dominant fungi that are “always there”. In addition it is not super obvious how the fungal mycobiome is different between wheezing vs. Non-Wheezing.

fungiinfo[, "Average.abundance"] <- fungiabundance[fungiinfo[,"Fungi"]]

hist(fungiinfo[,"Average.abundance"],
     col = "skyblue",
     border = "black",
     main = "Average abundance Distribution",
     xlab = "Average abundance Prevalence",
     ylab = "Count",
)
box()


Average abundance is the mean amount or level of that fungus measured among all subjects. We can see that most average abundance is between 0.00 to 0.02.

hist(fungiinfo[,"Fungi Prevalence"],
     col = "skyblue",
     border = "black",
     main = "Fungi Prevalence Distribution",
     xlab = "Fungi Prevalence",
     ylab = "Count",
)
box()


Prevalence is calculated by how often a fungus is detected across all subjects (i.e., the proportion of subjects where the fungus is present). Most fungi have very small prevalence. There are only two fungi that have a prevalence higher than 25%.

7 Logistic Regression Model

We will examine the presence/absence of fungi vs recurrent wheezing, along with some covariates (logistic regression). For that we changed the fungi abundance to presence if it is greater than 0 it is present otherwise not present. The covariates we explored are race/ethnicity, season and maternal atopy. We ignore the warning of separation for now, we will address it later.
Let \(p\) represent the probability that fungi are present. The logistic regression model is then: \[ \log\left(\frac{p}{1 - p}\right) = \beta_0 + \beta_1(\text{Recurrent Wheezing}) + \beta_2(\text{Race/Ethnicity}) + \beta_3(\text{Warm Season}) + \beta_4(\text{Maternal Atopy}) \]

logisticregressiondata1 <- cleandata
for(fungus in funginames){
  logisticregressiondata1[[fungus]] <- ifelse(cleandata[[fungus]] > 0, 1, 0)
}

logisticregressionresults <- data.frame()

for(fungus in funginames){
  formula <- as.formula(paste(fungus, "~ recurrent_wheezing + race_ethnicity + warm_season + maternal_atopy"))
  model <- glm(formula, data = logisticregressiondata1, family = binomial)
  model_info <- coef(summary(model))
  df <- data.frame(
    Term = rownames(model_info),
    Fungi = fungus,
    Coefficient = model_info[,"Estimate"],
    SE = model_info[,"Std. Error"],
    P_Value = model_info[,"Pr(>|z|)"]
  )
  logisticregressionresults <- rbind(logisticregressionresults, df)
}
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

We put the results of the 73 tests conducted into a table removing the intercept term and keeping only the p-values that are below 0.05.

logisticregressionresults <- logisticregressionresults %>%
  filter(!grepl("Intercept", Term)) %>%
  mutate(Odds_Ratio = exp(Coefficient))

# Filter p < 0.05
significant_logistic <- logisticregressionresults %>% filter(P_Value <= 0.05)

table_sig <- significant_logistic %>%
  gt() %>%
  fmt_number(columns = where(is.numeric), decimals = 3) %>%
  tab_header(title = "Significant Logistic Regression Results (p<0.05)")

table_sig
Significant Logistic Regression Results (p<0.05)
Term Fungi Coefficient SE P_Value Odds_Ratio
race_ethnicityWhite (Non-Hispanic) Alternaria 1.755 0.794 0.027 5.785
warm_seasonWarm (Apr - Oct) Phlebia 2.470 1.060 0.020 11.823
maternal_atopyyes Phlebia 1.372 0.610 0.024 3.945
warm_seasonWarm (Apr - Oct) Athelia −1.903 0.816 0.020 0.149
recurrent_wheezingyes Wallemia 2.765 1.153 0.017 15.877
warm_seasonWarm (Apr - Oct) Pleosporales_unclassified 1.928 0.780 0.013 6.876
recurrent_wheezingyes Ganoderma 4.493 1.463 0.002 89.400
recurrent_wheezingyes Vishniacozyma 2.741 1.349 0.042 15.496
recurrent_wheezingyes Curvularia 1.614 0.771 0.036 5.020
warm_seasonWarm (Apr - Oct) Curvularia 2.411 1.075 0.025 11.143
recurrent_wheezingyes Peniophora 2.135 1.038 0.040 8.454

From this table we can see that recurrent wheezing status and the season the microbiome was collected have the most “significant” p-values (p<0.05). For the other covariates we only have two terms that are “significant”.

8 Issues in The Logistic Regression Model

We have two main issues when we ran all of our statistical tests. The first issue is of Quasi-complete separation. This occurs when a predictor perfectly predicts the outcome. Therefore when you run the model, the model tries to assign an infinite weight to that predictor, leading to unstable estimates and model convergence issues.

ggplot(logisticregressionresults, aes(x = Coefficient)) +
  geom_histogram(fill = "skyblue", color = "black", bins = 7) +
  labs(title = "Log Odds Ratio Distribution", x = "Log Odds Ratio", y = "Count") +
  theme_classic() +
  theme(panel.border = element_rect(color = "black", fill = NA))

From this plot we can see that there are log odds ratio (coefficients) that have extreme values.
We can visualize this separation in the following plot.

ggplot(logisticregressiondata1, aes(x = Ganoderma, y = warm_season)) +
  geom_jitter(width = 0.1, height = 0.1, color = "steelblue") +
  scale_x_continuous(breaks = c(0, 1)) +
  labs(x = "Ganoderma Present (0/1)", y = "Season",
       title = "Visualizing Separation in Logistic Regression") +
  theme_classic()+
  theme(panel.border = element_rect(color = "black", fill = NA))

We can see an example of quasi-complete separation. Specifically, during the Cold season (Nov–Mar), Ganoderma is never observed, resulting in no observed presence during the cold season. In contrast, during the Warm season (Apr–Oct), Ganoderma presence varies, with both presences and absences recorded. This imbalance can lead to instability in logistic regression estimates, particularly inflating the effect size of the Cold season predictor. To fix this issue we can use Firth’s Penalized Logistic Regression Model.
The second issue is that with multiple hypothesis tests (73 tests), the likelihood of false positives (Type 1 errors) increases substantially. A Type 1 error occurs when a test incorrectly rejects a true null hypothesis, suggesting a significant effect where none exists. For instance, with 73 tests, the chance of getting at least one false positive rises to over 97%. To mitigate this risk we will use Benjamini-Hochberg procedure, which adjust the threshold for statistical significance to account for the number of tests performed.

9 Firth’s Penalized Logistic Regression Model

Firth’s Penalized Logistic Regression model applies a bias-reducing penalty to the likelihood. It produces finite, stable coefficient estimates, avoiding inflation seen in standard logistic regression models that have separation.
Let \(p\) represent the probability that fungi are present. The model is:

\[ \log\left(\frac{p}{1 - p}\right) = \beta_0 + \beta_1(\text{Recurrent Wheezing}) + \beta_2(\text{Race/Ethnicity}) + \beta_3(\text{Warm Season}) + \beta_4(\text{Maternal Atopy}) \]

This model is estimated using Firth’s penalized likelihood approach, which reduces small-sample bias and handles separation by modifying the likelihood function with a Jeffreys prior.

logistf_data <- cleandata
for(f in funginames){
  logistf_data[[f]] <- ifelse(cleandata[[f]] > 0, 1, 0)
}

penalized_results <- data.frame()

for(f in funginames){
  fmodel <- logistf(as.formula(paste(f, "~ recurrent_wheezing + race_ethnicity + warm_season + maternal_atopy")), data = logistf_data)
  temp <- data.frame(
    Term = names(fmodel$coefficients),
    Fungi = f,
    Coefficient = fmodel$coefficients,
    SE = sqrt(diag(fmodel$var)),
    P_Value = fmodel$prob
  )
  penalized_results <- rbind(penalized_results, temp)
}

penalized_results <- penalized_results %>%
  filter(!grepl("Intercept", Term)) %>%
  mutate(Odds_Ratio = exp(Coefficient))

df <- logisticregressionresults %>%
  mutate(ID = paste(Term, Fungi, sep = "_")) %>%
  select(ID, Coefficient) %>%
  rename(Coefficient_Regular = Coefficient)

df_df <- penalized_results %>%
  mutate(ID = paste(Term, Fungi, sep = "_")) %>%
  select(ID, Coefficient) %>%
  rename(Coefficient_Penalized = Coefficient)

combined_df <- merge(df, df_df, by = "ID", all = TRUE)

combined_df <- combined_df %>%
  mutate(highlight = ifelse(Coefficient_Regular > 5 | Coefficient_Regular < -5, "Extreme", "Normal"))

ggplot(combined_df, aes(x = Coefficient_Regular, y = Coefficient_Penalized)) +
  geom_point(aes(color = highlight), size = 3) +
  scale_color_manual(values = c("Extreme" = "red", "Normal" = "skyblue")) +
  labs(
    title = "Scatterplot of Coefficients",
    x = "Coefficient for Logistic Regression",
    y = "Coefficient for Firth's Penalized Logistic Regression",
    color = "Coefficient Type"
  ) +
  theme_classic() +
  theme(panel.border = element_rect(color = "black", fill = NA))


Using this scatterplot we compare the standard logistic regression to Firth’s penalized logistic regression. We can see the key effect of penalization; it stabilizes the coefficient estimates (log odds ratios) by shrinking the values and preventing extreme of inflated estimates that arise with separation.
Doing Firth’s penalized logistic regression p-values are affected therefore we can look at the new “significant” p-values.

# Filter p < 0.05
significant_logistic_pen <- penalized_results %>% filter(P_Value <= 0.05)

table_sig_pen <- significant_logistic_pen %>%
  gt() %>%
  fmt_number(columns = where(is.numeric), decimals = 3) %>%
  tab_header(title = "Significant Results from Firth Logistic Regression (p<0.05)")

table_sig_pen
Significant Results from Firth Logistic Regression (p<0.05)
Term Fungi Coefficient SE P_Value Odds_Ratio
maternal_atopyyes Trichosporon 2.560 1.345 0.036 12.935
race_ethnicityWhite (Non-Hispanic) Alternaria 1.522 0.693 0.014 4.582
warm_seasonWarm (Apr - Oct) Phlebia 2.058 0.841 0.003 7.833
maternal_atopyyes Phlebia 1.284 0.565 0.023 3.610
warm_seasonWarm (Apr - Oct) Athelia −1.694 0.700 0.011 0.184
recurrent_wheezingyes Wallemia 2.436 0.910 0.020 11.428
warm_seasonWarm (Apr - Oct) Pleosporales_unclassified 1.695 0.680 0.005 5.449
warm_seasonWarm (Apr - Oct) Ceramothyrium 2.692 1.343 0.007 14.760
warm_seasonWarm (Apr - Oct) Candida 2.428 1.330 0.020 11.335
warm_seasonWarm (Apr - Oct) Cladosporium 1.704 0.840 0.023 5.498
maternal_atopyyes Teratosphaeriaceae_unclassified 2.641 1.328 0.019 14.022
recurrent_wheezingyes Ganoderma 3.691 1.091 0.001 40.069
race_ethnicityWhite (Non-Hispanic) Ganoderma 3.783 1.766 0.014 43.965
warm_seasonWarm (Apr - Oct) Ganoderma 3.254 1.579 0.014 25.904
recurrent_wheezingyes Vishniacozyma 2.320 0.973 0.042 10.175
recurrent_wheezingyes Curvularia 1.532 0.696 0.039 4.628
warm_seasonWarm (Apr - Oct) Curvularia 1.973 0.846 0.006 7.194
recurrent_wheezingyes Peniophora 1.965 0.853 0.046 7.134
race_ethnicityWhite (Non-Hispanic) Peniophora 2.714 1.469 0.027 15.082

10 P-value Correction

To deal with our second issue of multiple hypothesis testing we implemented the Benjamini-Hochberg procedure to control the fase discovery rate (FDR). Given the limited sample size of our study (only 20 subjects in the wheezing group), we adopt a less stringent FDR threshold of < 0.2. This corresponds, roughly, to that, out of every five discoveries, we expect (at most) one to be false positive (the rest are true discoveries).
The Benjamini-Hochberg procedure controls the false discovery rate (FDR) by comparing each ordered p-value \(p_{(i)}\) to a threshold:

\[ p_{(i)} \leq \frac{i}{m} \cdot \alpha \] where \(i\) is the rank, \(m\) is the total number of tests, and \(\alpha\) is the desired FDR level.

penalized_results_adj <- penalized_results %>%
  group_by(Term) %>%
  mutate(FDR_P_Value = p.adjust(P_Value, method = "fdr"))

penalized_significant_.2 <- penalized_results_adj %>% filter(FDR_P_Value <= 0.2)

penalized_significant_.2 %>%
  gt() %>%
  fmt_number(columns = where(is.numeric), decimals = 3) %>%
  tab_header(title = "Significant Results from FDR adjustment (p<0.2)")
Significant Results from FDR adjustment (p<0.2)
Fungi Coefficient SE P_Value Odds_Ratio FDR_P_Value
warm_seasonWarm (Apr - Oct)
Phlebia 2.058 0.841 0.003 7.833 0.122
Athelia −1.694 0.700 0.011 0.184 0.168
Pleosporales_unclassified 1.695 0.680 0.005 5.449 0.122
Ceramothyrium 2.692 1.343 0.007 14.760 0.122
Ganoderma 3.254 1.579 0.014 25.904 0.171
Curvularia 1.973 0.846 0.006 7.194 0.122
recurrent_wheezingyes
Ganoderma 3.691 1.091 0.001 40.069 0.052

11 Conclusion

This study investigated the relationship between the airway fungal microbiome and recurrent wheezing while adjusting for various covariates. The fungal microbiome data were highly sparse, and the limited sample size brought up issues for conventional logistic regression, including issues of perfect separation that led to instability in the estimation of log odds ratios. To deal with this issue we employed Firth’s penalized logistic regression, which provided more reliable coefficient estimates under these conditions. Additionally, we applied the Benjamini-Hochberg procedure to control the false discovery rate across multiple hypothesis tests. One main finding is that the genus Ganoderma was found to be associated with recurrent wheezing (FDR-adjusted p < 0.052), suggesting a potential worth of further investigation. Several fungal taxa also showed significant associations with seasonal variation, aligning with our expectation. These finding highlight that there may be potential of further possible research with more data.