Investigating the Early Life Airway Fungal Microbiome

1 Introduction
2 Data Description
3 Characteristics of Children in Study Stratified by Recurrent Wheezing Status
4 Inspection of Potential Covariates
5 Inspection of Fungi variables
6 Exploring Average Abundance and Prevalence among all Fungi
7 Logistic Regression Model
8 Issues in The Logistic Regression Model
9 Firth’s Penalized Logistic Regression Model
10 P-value Correction
11 Conclusion

1 Introduction

Investigating the Early Life Airway Fungal Microbiome is an exploratory study examining the potential role of the fungal microbiome (mycobiome) in the development of recurrent wheezing (≥3 episodes/year) in children. While most microbiome research has focused on bacterial communities, this study specifically investigates the understudied airway mycobiome and its association with respiratory outcomes. We are asking a simple but important question: Are certain fungi more common in kids who develop recurrent wheezing? Our study includes children between the ages 2 and 5.5 years old. During early childhood, they had nasal swabs collected, which was later used to identify fungi through DNA sequencing. Logistic Regression models were used to predict the presence of the fungi with recurrent wheezing as the main predictor. In addition we adjusted for race/ethnicity, season, maternal atopy.

Subject IDs and other identifying information are fully anonymized in this study to preserve privacy

library(ggplot2)
library(vcd)
library(Hmisc)
library(gtsummary)
library(dplyr)
library(tidyr)
library(RColorBrewer)
library(viridis)
library(plotly)
library(logistf)
library(ggrepel)
library(gt)
library(gtExtras)
library(forcats)
library(knitr)
library(patchwork)

# Load data
metadata <- read.csv("metadata.csv")
fungaldata <- read.csv("genus_table.csv")

# Convert recurrent_wheezing variable from numeric (1/0) to character ("yes"/"no")
metadata[["recurrent_wheezing"]] <- ifelse(metadata[["recurrent_wheezing"]] == 1, "yes", "no")
metadata <- metadata[!is.na(metadata[,"recurrent_wheezing"]),]

2 Data Description

We used two datasets for this analysis. The first includes participant characteristics such as age, race/ethnicity, season, and whether the child has recurrent wheezing. Below is the first few rows of the dataset.

# Presentable metadata table

metadata_table <- head(metadata) %>%
  gt() %>%
  tab_header(title = "Preview of Metadata") %>%
  fmt_number(columns = where(is.numeric), decimals = 2)

metadata_table

subject_id	recurrent_wheezing	sex	race_ethnicity	age	warm_season	maternal_atopy	maternal_asthma
Preview of Metadata
1.00	no	female	white_nh	3.01	warm (apr-oct)	no	no
2.00	no	female	black_nh	2.20	warm (apr-oct)	yes	no
3.00	no	male	white_nh	2.14	warm (apr-oct)	yes	yes
5.00	no	female	white_nh	2.10	warm (apr-oct)	no	no
6.00	yes	male	multiple	2.25	warm (apr-oct)	yes	yes
7.00	no	male	white_nh	2.09	warm (apr-oct)	no	no

The second dataset we are working with contains detailed fungal abundance data for 73 different fungi, measured from each child’s nasal swab. We can notice something important already only looking at a few rows and a few fungi, this is that most of the fungi abundance is 0. Another important detail to note is that for each subject the abundance of all the fungi combined should add up to 1.

# Presentable fungal abundance table

presentable_fungi <- c("Didymellaceae_unclassified", "Talaromyces", "Ascomycota_unclassified", "Aspergillus", "subject_id")
present_subset <- fungaldata[,presentable_fungi]

fungi_table <- present_subset %>% 
  head(6) %>% 
  gt() %>% 
  tab_header(title = "Preview of Fungaldata (4 Selected Fungi)") %>%
  fmt_number(columns = where(is.numeric), decimals = 3)

fungi_table

Didymellaceae_unclassified	Talaromyces	Ascomycota_unclassified	Aspergillus	subject_id
Preview of Fungaldata (4 Selected Fungi)
0.000	0.000	0.000	0.000	1.000
0.000	0.000	0.000	0.986	2.000
0.000	0.000	0.000	0.000	3.000
0.000	0.000	0.000	0.000	4.000
0.000	0.000	0.143	0.020	5.000
0.126	0.000	0.058	0.000	6.000

Since our main predictor for the presence of a fungi is recurrent wheezing we exclude the children who are missing this data. Therefore from our 147 children we are left with 140 children to analyze. We merge the two datasets by subject id.

fungalmetadata <- merge(metadata, fungaldata, by = "subject_id", all = TRUE)
cleandata <- fungalmetadata[!is.na(fungalmetadata[,"recurrent_wheezing"]),]

3 Characteristics of Children in Study Stratified by Recurrent Wheezing Status

# Recode categorical variables for clarity and consistency in analysis:

cleandata <- cleandata %>%
  mutate(race_ethnicity = fct_recode(race_ethnicity,
                                     "Asian" = "asian",
                                     "Black (Non-Hispanic)" = "black_nh",
                                     "Hispanic" = "hispanic",
                                     "Multiple Races" = "multiple",
                                     "White (Non-Hispanic)" = "white_nh"))

cleandata <- cleandata %>% 
  mutate(sex = fct_recode(sex,
                          "Female" = "female",
                          "Male" = "male"))

cleandata <- cleandata %>% 
  mutate(warm_season = fct_recode(warm_season,
                          "Cold (Nov-Mar)" = "cold (nov-mar)",
                          "Warm (Apr - Oct)" = "warm (apr-oct)"))

# Create TableOne
table_1 <- tbl_summary(
  data = cleandata,
  by = recurrent_wheezing,
  include = c(sex, race_ethnicity, age, warm_season, maternal_atopy, maternal_asthma, Ascomycota_unclassified, Talaromyces),
  statistic = list(all_categorical() ~ "{n} ({p}%)"),
  label = list(
    sex ~ "Sex",
    race_ethnicity ~ "Race/Ethnicity",
    age ~ "Age",
    warm_season ~ "Month microbiome was sampled",
    maternal_atopy ~ "Maternal Atopy",
    maternal_asthma ~ "Maternal Asthma",
    Ascomycota_unclassified ~ "Ascomycota (Unclassified)",
    Talaromyces ~ "Talaromyces"
  )
) %>%
  add_overall(last = TRUE) %>%
  add_p() %>%
  bold_labels() %>%
  modify_header(label ~ "**Variable**") %>%
  modify_caption("**Table 1. Demographic Characteristics of Children With and Without Recurrent Wheezing**") %>%
  modify_spanning_header(all_stat_cols() ~ "**Recurrent Wheezing Status**") %>%
  modify_header(stat_1 ~ "**No Recurrent Wheezing (N = {n})**", stat_2 ~ "**Recurrent Wheezing (N = {n})**")

table_1

**Table 1. Demographic Characteristics of Children With and Without Recurrent Wheezing**
Variable	Recurrent Wheezing Status			p-value²
Variable	No Recurrent Wheezing (N = 120)¹	Recurrent Wheezing (N = 20)¹	Overall N = 140¹	p-value²
Sex				0.2
Female	54 (45%)	6 (30%)	60 (43%)
Male	66 (55%)	14 (70%)	80 (57%)
Race/Ethnicity				0.033
Asian	1 (0.8%)	2 (10%)	3 (2.1%)
Black (Non-Hispanic)	16 (13%)	5 (25%)	21 (15%)
Hispanic	10 (8.3%)	2 (10%)	12 (8.6%)
Multiple Races	8 (6.7%)	2 (10%)	10 (7.1%)
White (Non-Hispanic)	85 (71%)	9 (45%)	94 (67%)
Age	3.09 (3.02, 3.20)	3.08 (3.03, 3.31)	3.09 (3.02, 3.21)	0.8
Month microbiome was sampled				0.7
Cold (Nov-Mar)	48 (40%)	9 (45%)	57 (41%)
Warm (Apr - Oct)	72 (60%)	11 (55%)	83 (59%)
Maternal Atopy	44 (37%)	14 (70%)	58 (41%)	0.005
Maternal Asthma	20 (17%)	11 (55%)	31 (22%)	<0.001
Ascomycota (Unclassified)	0.00 (0.00, 0.00)	0.00 (0.00, 0.47)	0.00 (0.00, 0.01)	0.061
Talaromyces	0.0000 (0.0000, 0.0000)	0.0000 (0.0000, 0.0000)	0.0000 (0.0000, 0.0000)	0.6
¹ n (%); Median (Q1, Q3)
² Pearson’s Chi-squared test; Fisher’s exact test; Wilcoxon rank sum test

Important observations from Table 1: Males had higher wheezing prevalence (70% vs. 55% in non-wheezers), though not statistically significant (p=0.2). Race/Ethnicity have significant differences (p=0.033). Age distribution is relatively the same for children with recurrent wheezing and children without. No significant difference for month microbiome data was collected, but slight warm-season predominance (55-60%). For maternal atopy 70% of wheezers had atopic mothers vs. 37% (p=0.005). Similar for maternal asthma wqheezers had 3× higher prevalence (55% vs. 17%, p<0.001). Two fungal taxa were included at the bottom of Table 1 to illustrate the typical data structure: one showed a statistically significant difference between groups, while the other did not. It is important to note that the wheezing group is relatively small (N = 20 out of 140 subjects), which limits statistical power.

In terms of actionable takeaways, maternal asthma was prioritized in this study given its strong association with recurrent wheezing. Additionally, the observed differences across race/ethnicity highlight the need to explore this varaible.

4 Inspection of Potential Covariates

soft_palette <- c("#A6CEE3", "#B2DF8A", "#FDBF6F", "#CAB2D6", "#FFB3B3")

p1 <- ggplot(cleandata, aes(x = recurrent_wheezing, y = age)) +
  geom_boxplot(fill = "#A6CEE3", color = "#1F78B4") +
  labs(title = "Age by Recurrent Wheezing", x = "Recurrent Wheezing", y = "Age") +
  theme_classic() +
  theme(panel.border = element_rect(color = "black", fill = NA))

p2 <- ggplot(cleandata, aes(x = recurrent_wheezing, fill = sex)) +
  geom_bar(position = "dodge", color = "black") +
  scale_fill_manual(values = soft_palette[1:2]) +
  labs(title = "Recurrent Wheezing by Sex", x = "Recurrent Wheezing", y = "Count", fill = "Sex") +
  theme_classic() +
  theme(panel.border = element_rect(color = "black", fill = NA))

p3 <- ggplot(cleandata, aes(x = recurrent_wheezing, fill = race_ethnicity)) +
  geom_bar(position = "dodge", color = "black") +
  scale_fill_manual(values = soft_palette) +
  labs(title = "Recurrent Wheezing by Race/Ethnicity", x = "Recurrent Wheezing", y = "Count", fill = "Race/Ethnicity") +
  theme_classic() +
  theme(panel.border = element_rect(color = "black", fill = NA))

p4 <- ggplot(cleandata, aes(x = recurrent_wheezing, fill = warm_season)) +
  geom_bar(position = "dodge", color = "black") +
  scale_fill_manual(values = soft_palette[1:2]) +
  labs(title = "Recurrent Wheezing by Season", x = "Recurrent Wheezing", y = "Count", fill = "Season") +
  theme_classic() +
  theme(panel.border = element_rect(color = "black", fill = NA))

p5 <- ggplot(cleandata, aes(x = recurrent_wheezing, fill = maternal_atopy)) +
  geom_bar(position = "dodge", color = "black") +
  scale_fill_manual(values = soft_palette[1:2]) +
  labs(title = "Recurrent Wheezing by Maternal Atopy", x = "Recurrent Wheezing", y = "Count", fill = "Maternal Atopy") +
  theme_classic() +
  theme(panel.border = element_rect(color = "black", fill = NA))

p6 <- ggplot(cleandata, aes(x = recurrent_wheezing, fill = maternal_asthma)) +
  geom_bar(position = "dodge", color = "black") +
  scale_fill_manual(values = soft_palette[1:2]) +
  labs(title = "Recurrent Wheezing by Maternal Asthma", x = "Recurrent Wheezing", y = "Count", fill = "Maternal Asthma") +
  theme_classic() +
  theme(panel.border = element_rect(color = "black", fill = NA))

(p1 | p2)

(p3 | p4)

(p5 | p6)

metadata_no_id <- metadata[, !names(metadata) %in% "subject_id"]
kable(summary(metadata_no_id), caption = "Summary of Metadata (excluding subject_id)")

Summary of Metadata (excluding subject_id)
recurrent_wheezing	sex	race_ethnicity	age	warm_season	maternal_atopy	maternal_asthma
Length:140	Length:140	Length:140	Min. :2.086	Length:140	Length:140	Length:140
Class :character	Class :character	Class :character	1st Qu.:3.022	Class :character	Class :character	Class :character
Mode :character	Mode :character	Mode :character	Median :3.087	Mode :character	Mode :character	Mode :character
NA	NA	NA	Mean :3.155	NA	NA	NA
NA	NA	NA	3rd Qu.:3.205	NA	NA	NA
NA	NA	NA	Max. :5.018	NA	NA	NA

5 Inspection of Fungi Variables

plot_histogram <- function(var, label){
  ggplot(cleandata, aes(x = .data[[var]])) +
    geom_histogram(fill = "skyblue", color = "black", bins = 10) +
    labs(title = paste(label, "Distribution"), x = label, y = "Count") +
    theme_classic() +
    theme(panel.border = element_rect(color = "black", fill = NA))
}

p7 <- plot_histogram("Ascomycota_unclassified", "Ascomycota_unclassified")
p8 <- plot_histogram("Curvularia", "Curvularia")
p9 <- plot_histogram("Talaromyces", "Talaromyces")
(p7 | p8 | p9)

Something obvious that these distribution plots tell us is that fungi abundance for most fungi is very often 0. Only three plots were shown, because they all look very similar.

6 Exploring Average Abundance and Prevalence among all Fungi

funginames <- colnames(cleandata)[9:81]
fungiinfo <- data.frame(
  "Fungi" = funginames
)

fungicolumns <- cleandata[,9:81]
fungiprevalence <- sapply(fungicolumns, function(col){
  (sum(col != 0) / nrow(cleandata)) * 100
})

fungiinfo[,"Fungi Prevalence"] <- fungiprevalence[fungiinfo[,"Fungi"]]

fungiabundance <- sapply(fungicolumns, function(col){
  sum(col) / nrow(cleandata)
})

fungiinfo[, "Average.abundance"] <- fungiabundance[fungiinfo[,"Fungi"]]

sort_fungi <- fungiinfo[order(-fungiinfo[, "Average.abundance"]), ]

top_15 <- sort_fungi[1:15, c("Fungi", "Average.abundance")]

palette <- c(
  "#264653","#2a9d8f","#e76f51","#e9c46a","#8d99ae","#d90429","#6a994e",
  "#9a031e","#007f5f","#5f0f40","#f08a5d","#588157","#ffba08","#3d405b","#bc6c25"
)

fungi_colors <- setNames(palette, top_15$Fungi)

full_palette <- c(fungi_colors, "All Other" = "#999999")

filtered_fungi <- fungiinfo[fungiinfo$Fungi %in% top_15$Fungi, ]

top_fungi <- filtered_fungi[, c("Fungi", "Average.abundance")]

other_abundance <- sum(fungiinfo$Average.abundance[!fungiinfo$Fungi %in% top_15$Fungi])

other_row <- data.frame(
  Fungi = "All Other",
  Average.abundance = other_abundance
)

plot_data <- rbind(top_fungi, other_row)

plot_data$Fungi <- factor(
  plot_data$Fungi,
  levels = c(top_15$Fungi, "All Other")
)

fungaldatalong <- cleandata[, c("subject_id", "recurrent_wheezing", funginames)] %>% 
  pivot_longer(cols = all_of(funginames),
               names_to = "Fungi",
               values_to = "Abundance")

fungaldatalong <- fungaldatalong %>% 
  mutate(Fungus2 = ifelse(Fungi %in% top_15$Fungi, Fungi, "All Other"))

fungus2_summary <- fungaldatalong %>%
  group_by(subject_id, recurrent_wheezing, Fungus2) %>%
  summarise(Total_abundance = sum(Abundance), .groups = "drop")

avg_data <- fungus2_summary %>%
  group_by(recurrent_wheezing, Fungus2) %>%
  summarise(AvgAbundance = mean(Total_abundance), .groups = "drop")

avg_data <- avg_data %>%
  group_by(recurrent_wheezing) %>%
  mutate(Rank = ifelse(Fungus2 != "All Other", rank(-AvgAbundance, ties.method = "first"), NA)) %>%
  ungroup()

avg_data <- avg_data %>%
  arrange(recurrent_wheezing, Rank) %>%
  mutate(Fungus2 = factor(Fungus2, levels = unique(Fungus2)))

fungaldatalong_filtered <- fungaldatalong %>%
  filter(Fungi %in% top_15$Fungi) %>%
  mutate(Fungi = factor(Fungi, levels = top_15$Fungi))

fungi_subject_summary <- fungaldatalong %>%
  group_by(subject_id, recurrent_wheezing, Fungus2) %>%
  summarise(TotalAbundance = sum(Abundance), .groups = "drop") %>%
  mutate(Fungus2 = factor(Fungus2, levels = c(top_15$Fungi, "All Other")))

cleandata[["race_ethnicity"]] <- ifelse(cleandata[["race_ethnicity"]] == "White (Non-Hispanic)","White (Non-Hispanic)", "Other")

ggplot(plot_data, aes(x = "", y = Average.abundance, fill = Fungi)) +
  geom_bar(stat = "identity", width = 0.6) +  
  labs(title = "Top 15 fungi by Average Abundance Across All Samples",
       x = "Fungi", y = "Average Abundance") +
  theme_classic() +
  theme(panel.border = element_rect(color = "black", fill = NA)) +
  scale_fill_manual(values = full_palette)

We want to explore the top 15 fungi by average abundance across all samples. We can see that the top two bars are much larger than the rest of the other bars. Also something else to note is that the bottom half of the fungi abundances are very small. Therefore there is no reason to plot the other fungi, as it will not be informative. Therefore for the 58 other fungi we combine them into an “All Other” bar.

ggplot(avg_data, aes(x = recurrent_wheezing, y = AvgAbundance, fill = Fungus2)) +
  geom_bar(stat = "identity", position = "stack", width = .8) +
  scale_fill_manual(values = c(fungi_colors, "All Other" = "#999999")) +
  labs(title = "Average Fungal Abundance by Wheezing Status", x = "Recurrent Wheezing", y = "Average Abundance", fill = "Fungi") +
  theme_classic() +
  theme(panel.border = element_rect(color = "black", fill = NA))

Using this plot we want to look if there are any potential fungi abundance differences for recurrent wheezing status. Something we can notice immediately is that Ascomycota unclassified seems to be more abundant in children with recurrent wheezing. We can also see that there is less diversity of fungi in children with recurrent wheezing. This is potentially because we only have 20 subjects that have recurrent wheezing.

ggplot(fungi_subject_summary, aes(x = as.character(subject_id), y = TotalAbundance, fill = Fungus2)) +
  geom_bar(stat = "identity") +
  facet_grid(. ~ recurrent_wheezing, space = "free_x", scales = "free_x") +
  scale_fill_manual(values = c(fungi_colors, "All Other" = "#999999")) +
  labs(title = "Fungal Abundance by Subject and Wheezing Satus",x = "Subject", y = "Abundance", fill = "Fungi") +
  theme_classic() +
  theme(panel.border = element_rect(color = "black", fill = NA), axis.text.x = element_blank())

Each bar represents a subject. Each different color in the bar are fungi abundances, and each subject stacks up to 1. The plot is divided into two groups, children who do not have recurrent wheezing and children that do have it. We plotted the top 15 most abundant fungi and for the other fungi we put them all in the bar “All Other” since their abundances are very small. From this plot it is not obvious there is one (or few) dominant fungi that are “always there”. In addition it is not super obvious how the fungal mycobiome is different between wheezing vs. Non-Wheezing.

fungiinfo[, "Average.abundance"] <- fungiabundance[fungiinfo[,"Fungi"]]

hist(fungiinfo[,"Average.abundance"],
     col = "skyblue",
     border = "black",
     main = "Average abundance Distribution",
     xlab = "Average abundance Prevalence",
     ylab = "Count",
)
box()

Average abundance is the mean amount or level of that fungus measured among all subjects. We can see that most average abundance is between 0.00 to 0.02.

hist(fungiinfo[,"Fungi Prevalence"],
     col = "skyblue",
     border = "black",
     main = "Fungi Prevalence Distribution",
     xlab = "Fungi Prevalence",
     ylab = "Count",
)
box()

Prevalence is calculated by how often a fungus is detected across all subjects (i.e., the proportion of subjects where the fungus is present). Most fungi have very small prevalence. There are only two fungi that have a prevalence higher than 25%.

7 Logistic Regression Model

We will examine the presence/absence of fungi vs recurrent wheezing, along with some covariates (logistic regression). For that we changed the fungi abundance to presence if it is greater than 0 it is present otherwise not present. The covariates we explored are race/ethnicity, season and maternal atopy. We ignore the warning of separation for now, we will address it later.
Let \(p\) represent the probability that fungi are present. The logistic regression model is then: \[ \log\left(\frac{p}{1 - p}\right) = \beta_0 + \beta_1(\text{Recurrent Wheezing}) + \beta_2(\text{Race/Ethnicity}) + \beta_3(\text{Warm Season}) + \beta_4(\text{Maternal Atopy}) \]

logisticregressiondata1 <- cleandata
for(fungus in funginames){
  logisticregressiondata1[[fungus]] <- ifelse(cleandata[[fungus]] > 0, 1, 0)
}

logisticregressionresults <- data.frame()

for(fungus in funginames){
  formula <- as.formula(paste(fungus, "~ recurrent_wheezing + race_ethnicity + warm_season + maternal_atopy"))
  model <- glm(formula, data = logisticregressiondata1, family = binomial)
  model_info <- coef(summary(model))
  df <- data.frame(
    Term = rownames(model_info),
    Fungi = fungus,
    Coefficient = model_info[,"Estimate"],
    SE = model_info[,"Std. Error"],
    P_Value = model_info[,"Pr(>|z|)"]
  )
  logisticregressionresults <- rbind(logisticregressionresults, df)
}

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

We put the results of the 73 tests conducted into a table removing the intercept term and keeping only the p-values that are below 0.05.

logisticregressionresults <- logisticregressionresults %>%
  filter(!grepl("Intercept", Term)) %>%
  mutate(Odds_Ratio = exp(Coefficient))

# Filter p < 0.05
significant_logistic <- logisticregressionresults %>% filter(P_Value <= 0.05)

table_sig <- significant_logistic %>%
  gt() %>%
  fmt_number(columns = where(is.numeric), decimals = 3) %>%
  tab_header(title = "Significant Logistic Regression Results (p<0.05)")

table_sig

Term	Fungi	Coefficient	SE	P_Value	Odds_Ratio
Significant Logistic Regression Results (p<0.05)
race_ethnicityWhite (Non-Hispanic)	Alternaria	1.755	0.794	0.027	5.785
warm_seasonWarm (Apr - Oct)	Phlebia	2.470	1.060	0.020	11.823
maternal_atopyyes	Phlebia	1.372	0.610	0.024	3.945
warm_seasonWarm (Apr - Oct)	Athelia	−1.903	0.816	0.020	0.149
recurrent_wheezingyes	Wallemia	2.765	1.153	0.017	15.877
warm_seasonWarm (Apr - Oct)	Pleosporales_unclassified	1.928	0.780	0.013	6.876
recurrent_wheezingyes	Ganoderma	4.493	1.463	0.002	89.400
recurrent_wheezingyes	Vishniacozyma	2.741	1.349	0.042	15.496
recurrent_wheezingyes	Curvularia	1.614	0.771	0.036	5.020
warm_seasonWarm (Apr - Oct)	Curvularia	2.411	1.075	0.025	11.143
recurrent_wheezingyes	Peniophora	2.135	1.038	0.040	8.454

From this table we can see that recurrent wheezing status and the season the microbiome was collected have the most “significant” p-values (p<0.05). For the other covariates we only have two terms that are “significant”.

8 Issues in The Logistic Regression Model

We have two main issues when we ran all of our statistical tests. The first issue is of Quasi-complete separation. This occurs when a predictor perfectly predicts the outcome. Therefore when you run the model, the model tries to assign an infinite weight to that predictor, leading to unstable estimates and model convergence issues.

ggplot(logisticregressionresults, aes(x = Coefficient)) +
  geom_histogram(fill = "skyblue", color = "black", bins = 7) +
  labs(title = "Log Odds Ratio Distribution", x = "Log Odds Ratio", y = "Count") +
  theme_classic() +
  theme(panel.border = element_rect(color = "black", fill = NA))

From this plot we can see that there are log odds ratio (coefficients) that have extreme values.
We can visualize this separation in the following plot.

ggplot(logisticregressiondata1, aes(x = Ganoderma, y = warm_season)) +
  geom_jitter(width = 0.1, height = 0.1, color = "steelblue") +
  scale_x_continuous(breaks = c(0, 1)) +
  labs(x = "Ganoderma Present (0/1)", y = "Season",
       title = "Visualizing Separation in Logistic Regression") +
  theme_classic()+
  theme(panel.border = element_rect(color = "black", fill = NA))

We can see an example of quasi-complete separation. Specifically, during the Cold season (Nov–Mar), Ganoderma is never observed, resulting in no observed presence during the cold season. In contrast, during the Warm season (Apr–Oct), Ganoderma presence varies, with both presences and absences recorded. This imbalance can lead to instability in logistic regression estimates, particularly inflating the effect size of the Cold season predictor. To fix this issue we can use Firth’s Penalized Logistic Regression Model.
The second issue is that with multiple hypothesis tests (73 tests), the likelihood of false positives (Type 1 errors) increases substantially. A Type 1 error occurs when a test incorrectly rejects a true null hypothesis, suggesting a significant effect where none exists. For instance, with 73 tests, the chance of getting at least one false positive rises to over 97%. To mitigate this risk we will use Benjamini-Hochberg procedure, which adjust the threshold for statistical significance to account for the number of tests performed.

9 Firth’s Penalized Logistic Regression Model

Firth’s Penalized Logistic Regression model applies a bias-reducing penalty to the likelihood. It produces finite, stable coefficient estimates, avoiding inflation seen in standard logistic regression models that have separation.
Let \(p\) represent the probability that fungi are present. The model is:

\[ \log\left(\frac{p}{1 - p}\right) = \beta_0 + \beta_1(\text{Recurrent Wheezing}) + \beta_2(\text{Race/Ethnicity}) + \beta_3(\text{Warm Season}) + \beta_4(\text{Maternal Atopy}) \]

This model is estimated using Firth’s penalized likelihood approach, which reduces small-sample bias and handles separation by modifying the likelihood function with a Jeffreys prior.

logistf_data <- cleandata
for(f in funginames){
  logistf_data[[f]] <- ifelse(cleandata[[f]] > 0, 1, 0)
}

penalized_results <- data.frame()

for(f in funginames){
  fmodel <- logistf(as.formula(paste(f, "~ recurrent_wheezing + race_ethnicity + warm_season + maternal_atopy")), data = logistf_data)
  temp <- data.frame(
    Term = names(fmodel$coefficients),
    Fungi = f,
    Coefficient = fmodel$coefficients,
    SE = sqrt(diag(fmodel$var)),
    P_Value = fmodel$prob
  )
  penalized_results <- rbind(penalized_results, temp)
}

penalized_results <- penalized_results %>%
  filter(!grepl("Intercept", Term)) %>%
  mutate(Odds_Ratio = exp(Coefficient))

df <- logisticregressionresults %>%
  mutate(ID = paste(Term, Fungi, sep = "_")) %>%
  select(ID, Coefficient) %>%
  rename(Coefficient_Regular = Coefficient)

df_df <- penalized_results %>%
  mutate(ID = paste(Term, Fungi, sep = "_")) %>%
  select(ID, Coefficient) %>%
  rename(Coefficient_Penalized = Coefficient)

combined_df <- merge(df, df_df, by = "ID", all = TRUE)

combined_df <- combined_df %>%
  mutate(highlight = ifelse(Coefficient_Regular > 5 | Coefficient_Regular < -5, "Extreme", "Normal"))

ggplot(combined_df, aes(x = Coefficient_Regular, y = Coefficient_Penalized)) +
  geom_point(aes(color = highlight), size = 3) +
  scale_color_manual(values = c("Extreme" = "red", "Normal" = "skyblue")) +
  labs(
    title = "Scatterplot of Coefficients",
    x = "Coefficient for Logistic Regression",
    y = "Coefficient for Firth's Penalized Logistic Regression",
    color = "Coefficient Type"
  ) +
  theme_classic() +
  theme(panel.border = element_rect(color = "black", fill = NA))

Using this scatterplot we compare the standard logistic regression to Firth’s penalized logistic regression. We can see the key effect of penalization; it stabilizes the coefficient estimates (log odds ratios) by shrinking the values and preventing extreme of inflated estimates that arise with separation.
Doing Firth’s penalized logistic regression p-values are affected therefore we can look at the new “significant” p-values.

# Filter p < 0.05
significant_logistic_pen <- penalized_results %>% filter(P_Value <= 0.05)

table_sig_pen <- significant_logistic_pen %>%
  gt() %>%
  fmt_number(columns = where(is.numeric), decimals = 3) %>%
  tab_header(title = "Significant Results from Firth Logistic Regression (p<0.05)")

table_sig_pen

Term	Fungi	Coefficient	SE	P_Value	Odds_Ratio
Significant Results from Firth Logistic Regression (p<0.05)
maternal_atopyyes	Trichosporon	2.560	1.345	0.036	12.935
race_ethnicityWhite (Non-Hispanic)	Alternaria	1.522	0.693	0.014	4.582
warm_seasonWarm (Apr - Oct)	Phlebia	2.058	0.841	0.003	7.833
maternal_atopyyes	Phlebia	1.284	0.565	0.023	3.610
warm_seasonWarm (Apr - Oct)	Athelia	−1.694	0.700	0.011	0.184
recurrent_wheezingyes	Wallemia	2.436	0.910	0.020	11.428
warm_seasonWarm (Apr - Oct)	Pleosporales_unclassified	1.695	0.680	0.005	5.449
warm_seasonWarm (Apr - Oct)	Ceramothyrium	2.692	1.343	0.007	14.760
warm_seasonWarm (Apr - Oct)	Candida	2.428	1.330	0.020	11.335
warm_seasonWarm (Apr - Oct)	Cladosporium	1.704	0.840	0.023	5.498
maternal_atopyyes	Teratosphaeriaceae_unclassified	2.641	1.328	0.019	14.022
recurrent_wheezingyes	Ganoderma	3.691	1.091	0.001	40.069
race_ethnicityWhite (Non-Hispanic)	Ganoderma	3.783	1.766	0.014	43.965
warm_seasonWarm (Apr - Oct)	Ganoderma	3.254	1.579	0.014	25.904
recurrent_wheezingyes	Vishniacozyma	2.320	0.973	0.042	10.175
recurrent_wheezingyes	Curvularia	1.532	0.696	0.039	4.628
warm_seasonWarm (Apr - Oct)	Curvularia	1.973	0.846	0.006	7.194
recurrent_wheezingyes	Peniophora	1.965	0.853	0.046	7.134
race_ethnicityWhite (Non-Hispanic)	Peniophora	2.714	1.469	0.027	15.082

10 P-value Correction

To deal with our second issue of multiple hypothesis testing we implemented the Benjamini-Hochberg procedure to control the fase discovery rate (FDR). Given the limited sample size of our study (only 20 subjects in the wheezing group), we adopt a less stringent FDR threshold of < 0.2. This corresponds, roughly, to that, out of every five discoveries, we expect (at most) one to be false positive (the rest are true discoveries).
The Benjamini-Hochberg procedure controls the false discovery rate (FDR) by comparing each ordered p-value \(p_{(i)}\) to a threshold:

\[ p_{(i)} \leq \frac{i}{m} \cdot \alpha \] where \(i\) is the rank, \(m\) is the total number of tests, and \(\alpha\) is the desired FDR level.

penalized_results_adj <- penalized_results %>%
  group_by(Term) %>%
  mutate(FDR_P_Value = p.adjust(P_Value, method = "fdr"))

penalized_significant_.2 <- penalized_results_adj %>% filter(FDR_P_Value <= 0.2)

penalized_significant_.2 %>%
  gt() %>%
  fmt_number(columns = where(is.numeric), decimals = 3) %>%
  tab_header(title = "Significant Results from FDR adjustment (p<0.2)")

Fungi	Coefficient	SE	P_Value	Odds_Ratio	FDR_P_Value
Significant Results from FDR adjustment (p<0.2)
warm_seasonWarm (Apr - Oct)
Phlebia	2.058	0.841	0.003	7.833	0.122
Athelia	−1.694	0.700	0.011	0.184	0.168
Pleosporales_unclassified	1.695	0.680	0.005	5.449	0.122
Ceramothyrium	2.692	1.343	0.007	14.760	0.122
Ganoderma	3.254	1.579	0.014	25.904	0.171
Curvularia	1.973	0.846	0.006	7.194	0.122
recurrent_wheezingyes
Ganoderma	3.691	1.091	0.001	40.069	0.052

11 Conclusion

This study investigated the relationship between the airway fungal microbiome and recurrent wheezing while adjusting for various covariates. The fungal microbiome data were highly sparse, and the limited sample size brought up issues for conventional logistic regression, including issues of perfect separation that led to instability in the estimation of log odds ratios. To deal with this issue we employed Firth’s penalized logistic regression, which provided more reliable coefficient estimates under these conditions. Additionally, we applied the Benjamini-Hochberg procedure to control the false discovery rate across multiple hypothesis tests. One main finding is that the genus Ganoderma was found to be associated with recurrent wheezing (FDR-adjusted p < 0.052), suggesting a potential worth of further investigation. Several fungal taxa also showed significant associations with seasonal variation, aligning with our expectation. These finding highlight that there may be potential of further possible research with more data.

Investigating the Early Life Airway Fungal Microbiome

Juan Carlos Guevara
Siyuan Ma, Ph.D.
Department of Biostatistics, Vanderbilt University Medical Center
Christian Rosas-Salazar, MD, MPH
Department of Pediatrics, Vanderbilt University Medical Center

June 17, 2025

1 Introduction

2 Data Description

3 Characteristics of Children in Study Stratified by Recurrent Wheezing Status

4 Inspection of Potential Covariates

5 Inspection of Fungi Variables

6 Exploring Average Abundance and Prevalence among all Fungi

7 Logistic Regression Model

8 Issues in The Logistic Regression Model

9 Firth’s Penalized Logistic Regression Model

10 P-value Correction

11 Conclusion

Investigating the Early Life Airway Fungal Microbiome

Juan Carlos GuevaraSiyuan Ma, Ph.D.Department of Biostatistics, Vanderbilt University Medical CenterChristian Rosas-Salazar, MD, MPHDepartment of Pediatrics, Vanderbilt University Medical Center

June 17, 2025

1 Introduction

2 Data Description

3 Characteristics of Children in Study Stratified by Recurrent Wheezing Status

4 Inspection of Potential Covariates

5 Inspection of Fungi Variables

6 Exploring Average Abundance and Prevalence among all Fungi

7 Logistic Regression Model

8 Issues in The Logistic Regression Model

9 Firth’s Penalized Logistic Regression Model

10 P-value Correction

11 Conclusion

Juan Carlos Guevara
Siyuan Ma, Ph.D.
Department of Biostatistics, Vanderbilt University Medical Center
Christian Rosas-Salazar, MD, MPH
Department of Pediatrics, Vanderbilt University Medical Center