This instruction requires a data frame, a trait of interest, and a specific SNP with its corresponding minor allele as inputs. It generates a table displaying mean, variance, and bimodality test results, while also producing a violin-histogram plot for visual analysis.
suppressPackageStartupMessages({
library(tidyverse)
library(dplyr)
library(stringr)
library(mixR)
library(car)
library(patchwork)
library(ggplot2)
library(tidyr)
library(gridExtra)
library(grid)
library(rlang)
})
Before proceeding, ensure the data frame is accessible. Our original dataset contains each row as the genotype for a single SNP for one individual, along with their phenotype, resulting 1,752,394 observations from 4,403 individuals and 398 SNPs. We convert this data from a long to a wide format for better demonstration. The columns in the wide table include: Indiv_ID, SNP_1_Minor_Allele, SNP_2_Minor_Allele, SNP_3_Minor_Allele, …, SEX, AGE, DIABETES, and other traits of interest.
SEX, AGE, and DIABETES are covariates used in the mean effect analysis. We’ll demonstrate the trait of interest using Log_BMI, but feel free to add more traits as needed.
The name of each SNP_Minor_Allele column identifies a specific SNP using its rs ID followed by the minor allele. You might need to refer to cohort genotype data to determine these information. The value in these columns indicates the inheritance origin of the minor allele, which follows four possible categories: maternal, paternal, both, or none.
For example, in the Framingham Heart Study cohort, individual ID 9995 has the minor allele “C” at SNP rs8058618 inherited from the mother, while the minor allele “G” at SNP rs4389136 is inherited from the father (The first row of the data overview below). This inheritance pattern can be determined by comparing the individual’s genotype with the parental genotypes.
The following function performs mean, variance, and bimodality tests for a specified SNP and trait of interest. It requires three parameters: df, which is your dataset; SNP_minor_allele, representing the variant with its rs ID and the corresponding minor allele; and trait, which is the trait of interest.
It consists of three sections:
Mean Effect: To investigate how the maternally or paternally inherited minor allele affects trait means, we perform an mQTL POE contrast between two heterozygous genotype (Maternal vs Paternal) groups using a linear regression model, adjusting for age, sex, and diabetes. This analysis will return the mean effect p-value and the effect size, which is mean difference of the trait of interest from maternal minus paternal inheritance, adjusting for the covariates. If the difference is greater than 0, it indicates a stronger maternal effect, while a negative value indicates a stronger paternal effect.
Variance Effect: To evaluate how these inherited minor alleles influence trait dispersion, we apply Levene’s test to assess variance equality across the two heterozygous genotypes (Maternal vs. Paternal). The analysis returns the variance effect p-value and the effect size, which is the standard deviation ratio of the trait of interest from maternal divided by paternal inheritance. If the ratio is greater than 1, it indicates greater variability in the maternal inheritance group, while a ratio less than 1 indicates greater variability in the paternal inheritance group.
Bimodality Test: We use the mixR package to evaluate bimodality in the trait of interest across genotype groups from a specific SNP. This analysis generates the mean and standard deviation of both components, the proportion of the first component, and the p-value of the test for all four genotype groups respectively. Note that a minimum sample size of 30 is required for effective mixR fitting; results will return NA if this requirement is not met. We employ the default settings of the mixR methods (mixR::mixfit, mixR::bs.test). If these do not work for specific data sets, please refer to the mixR documentation for parameter tuning: https://cran.r-project.org/web/packages/mixR/mixR.pdf .
##################################################################
### Create a function to perform mQTL vQTL and Bimodality Test
##################################################################
mQTL_vQTL_Bimodality <- function(df, SNP_minor_allele, trait) {
# Convert string column names to symbols
SNP_minor_allele_sym <- rlang::sym(SNP_minor_allele)
trait_sym <- rlang::sym(trait)
# Filter the dataframe
df_filtered <- df %>%
dplyr::select(!!SNP_minor_allele_sym, SEX, AGE, !!trait_sym, DIABETES) ## !!SNP_minor_allele_sym evaluate the symbols stored in SNP_minor_allele_sym and injects it as a column name
###################################
## Mean Effect ##
###################################
# Set the reference level as "Paternal" to ensure "Maternal - Paternal" in the mean effect analysis
df_filtered[[SNP_minor_allele]] <- relevel(factor(df_filtered[[SNP_minor_allele]]), ref = "Paternal")
# Create formula dynamically
formula_str <- paste0(trait, " ~ AGE + SEX + DIABETES + ", SNP_minor_allele)
lm_model <- lm(as.formula(formula_str), data = df_filtered)
# Extract p-value and effect
mQTL_p <- round(summary(lm_model)$coefficients[paste0(SNP_minor_allele, "Maternal"), "Pr(>|t|)"], 3)
mQTL_Eff <- round(lm_model$coefficients[paste0(SNP_minor_allele, "Maternal")], 3)
#################################
## Variance Effect ##
#################################
# Filter for heterozygous cases
df_filtered_heter <- df_filtered %>%
dplyr::filter(!!rlang::sym(SNP_minor_allele) %in% c("Maternal", "Paternal"))
# Extract data for different genotypes
data_ma <- df_filtered %>%
dplyr::filter(!!rlang::sym(SNP_minor_allele) == "Maternal") %>%
dplyr::select(!!trait_sym) %>%
dplyr::pull() %>%
na.omit()
data_pa <- df_filtered %>%
dplyr::filter(!!rlang::sym(SNP_minor_allele) == "Paternal") %>%
dplyr::select(!!trait_sym) %>%
dplyr::pull() %>%
na.omit()
data_both <- df_filtered %>%
dplyr::filter(!!rlang::sym(SNP_minor_allele) == "Both") %>%
dplyr::select(!!trait_sym) %>%
dplyr::pull() %>%
na.omit()
data_none <- df_filtered %>%
dplyr::filter(!!rlang::sym(SNP_minor_allele) == "None") %>%
dplyr::select(!!trait_sym) %>%
dplyr::pull() %>%
na.omit()
# Calculate variance QTL statistics
vQTL_formula <- as.formula(paste0(trait, " ~ ", SNP_minor_allele))
vQTL_p <- round(car::leveneTest(vQTL_formula, data = df_filtered_heter)$`Pr(>F)`[1], 3)
vQTL_Eff <- round(sd(data_ma)/sd(data_pa), 3)
####################################################
## Bimodality Test ##
###################################################
# Define mixR test function
mixR_test <- function(x) {
# Initialize results list with NAs
results <- list(
fit_mu1 = NA,
fit_mu2 = NA,
fit_sd1 = NA,
fit_sd2 = NA,
fit_prop1 = NA,
fit_p_value = NA
)
# Check if length of x is greater than 30
if (length(x) <= 30) {
warning("mixR warning: the sample size should be greater than 30.")
return(results)
}
# Try to fit the model and perform the test
tryCatch({
bimo_fit <- mixR::mixfit(x, ncomp = 2)
bimo_test <- mixR::bs.test(x, ncomp = c(1, 2))
# Update results with values from the fit
results$fit_mu1 <- round(bimo_fit$mu[1], 2)
results$fit_mu2 <- round(bimo_fit$mu[2], 2)
results$fit_sd1 <- round(bimo_fit$sd[1], 2)
results$fit_sd2 <- round(bimo_fit$sd[2], 2)
results$fit_prop1 <- round(bimo_fit$pi[1], 2)
results$fit_p_value <- bimo_test$pvalue
return(results)
}, error = function(e) {
# Return the list with NAs if there's an error
message("Error in mixR test: ", e$message)
return(results)
})
}
# Apply mixR test to each genotype
ma <- mixR_test(data_ma)
pa <- mixR_test(data_pa)
both <- mixR_test(data_both)
none <- mixR_test(data_none)
# Collect results
results <- data.frame(
SNP_Minor_Allele = SNP_minor_allele,
ma_pa_mQTL_p = mQTL_p,
ma_pa_mQTL_Eff = mQTL_Eff,
ma_pa_vQTL_p = vQTL_p,
ma_pa_vQTL_Eff = vQTL_Eff,
mixR_ma_mu1_mu2 = paste0(ma$fit_mu1, "_", ma$fit_mu2),
mixR_ma_sd1_sd2 = paste0(ma$fit_sd1, "_", ma$fit_sd2),
mixR_ma_prop1 = ma$fit_prop1,
mixR_ma_p = ma$fit_p_value,
mixR_pa_mu1_mu2 = paste0(pa$fit_mu1, "_", pa$fit_mu2),
mixR_pa_sd1_sd2 = paste0(pa$fit_sd1, "_", pa$fit_sd2),
mixR_pa_prop1 = pa$fit_prop1,
mixR_pa_p = pa$fit_p_value,
mixR_both_mu1_mu2 = paste0(both$fit_mu1, "_", both$fit_mu2),
mixR_both_sd1_sd2 = paste0(both$fit_sd1, "_", both$fit_sd2),
mixR_both_prop1 = both$fit_prop1,
mixR_both_p = both$fit_p_value,
mixR_none_mu1_mu2 = paste0(none$fit_mu1, "_", none$fit_mu2),
mixR_none_sd1_sd2 = paste0(none$fit_sd1, "_", none$fit_sd2),
mixR_none_prop1 = none$fit_prop1,
mixR_none_p = none$fit_p_value
)
# Transpose results and set column names
results <- as.data.frame(t(results))
colnames(results)[1] <- results[1,1]
rownames(results)[1] <- "SNP_Minor_Allele"
return(results)
}
The following function displays the violin plot and histogram of the SNP and trait of interest across 6 genotype groups(“Maternal”, “Paternal”, “None_Maternal_Heter”, “None_Paternal_Heter”, “None_Both_Parents_Homo”, “None_Both_Parents_Heter”). Note if the count of any group is less than 5, it won’t appear in the plot. It requires 4 parameters: df, which is your dataset; SNP_minor_allele, representing the variant with its rs ID and the minor allele; trait, which is the trait of interest; trait_display_name is optional, when it’s not specified, the function will display the trait instead.
violin_histogram_combined <- function(df, snp_name, trait, trait_display_name = NULL) {
# If trait_display_name is not provided, use trait
if(is.null(trait_display_name)) {
trait_display_name <- trait
}
df_filtered = df %>%
dplyr::filter(SNP_Minor_Allele == snp_name, Minor_Allele_Inheritance_Interested_Subgroups != "Both") %>%
dplyr::select(!!sym(trait), Minor_Allele_Inheritance_Interested_Subgroups)
df_filtered$Minor_Allele_Inheritance_Interested_Subgroups = factor(df_filtered$Minor_Allele_Inheritance_Interested_Subgroups,
levels = c("Maternal", "Paternal", "None_Maternal_Heter", "None_Paternal_Heter", "None_Both_Parents_Homo", "None_Both_Parents_Heter"))
# Store the original levels and their corresponding colors for consistent mapping
# Assuming the order in genotype_number matches the expected groups
all_inheritance_types <- c("Maternal", "Paternal", "None_Maternal_Heter", "None_Paternal_Heter", "None_Both_Parents_Homo", "None_Both_Parents_Heter")
genotype_number = c(2,3,4,5,6,7)
color_mapping <- setNames(genotype_number, all_inheritance_types)
# Count observations in each Minor_Allele_Inheritance_Interested_Subgroups group
group_counts <- table(df_filtered$Minor_Allele_Inheritance_Interested_Subgroups)
# Filter out groups with fewer than 5 observations
groups_to_keep <- names(group_counts[group_counts >= 5])
#groups_to_keep = names(group_counts)
# Filter the dataframe to only include groups with sufficient observations
df_filtered <- df_filtered[df_filtered$Minor_Allele_Inheritance_Interested_Subgroups %in% groups_to_keep, ]
# If no groups remain after filtering, return a blank plot with a message
if(nrow(df_filtered) == 0) {
empty_plot <- ggplot() +
annotate("text", x = 0.5, y = 0.5, label = "No groups with 5 or more observations") +
theme_void()
return(arrangeGrob(empty_plot))
}
# Ensure Minor_Allele_Inheritance_Interested_Subgroups is a factor with only the levels that have 5+ observations
df_filtered$Minor_Allele_Inheritance_Interested_Subgroups <- factor(df_filtered$Minor_Allele_Inheritance_Interested_Subgroups)
# Get the subset of color mapping for the groups that remain after filtering
filtered_color_mapping <- color_mapping[groups_to_keep]
# Create a named vector for secondary labels
secondary_labels <- c(
"Maternal" = "Offspring Ref|Alt",
"Paternal" = "Offspring Alt|Ref",
"Both" = "Offspring Alt|Alt",
"None_Maternal_Heter" = "Offspring Ref|Ref Mother is Heter",
"None_Paternal_Heter" = "Offspring Ref|Ref Father is Heter",
"None_Both_Parents_Homo" = "Offspring Ref|Ref Parents Ref|Ref ",
"None_Both_Parents_Heter" = "Offspring Ref|Ref Parents are Heter"
)
# Get secondary labels for groups that remain after filtering
filtered_secondary_labels <- secondary_labels[groups_to_keep]
# Calculate the maximum value for positioning the top labels
y_max <- max(df_filtered[[trait]], na.rm = TRUE)
y_min <- min(df_filtered[[trait]], na.rm = TRUE)
y_range <- y_max - y_min
# Violin Plot
violin_plot <- ggplot(df_filtered, aes(x = Minor_Allele_Inheritance_Interested_Subgroups, y = .data[[trait]],
fill = Minor_Allele_Inheritance_Interested_Subgroups)) +
geom_violin(trim = FALSE, alpha = 0.3) +
geom_point(color = "darkgrey", position = position_jitter(width = 0.1, height = 0),
alpha = 0.5) +
stat_summary(fun.data = "mean_sdl", geom = "pointrange", color = "black", na.rm = TRUE) +
labs(title = paste(trait_display_name, " - ", snp_name), # Combined title with trait and SNP
x = "", # Remove x-axis label as we're using custom annotations
y = paste(trait_display_name)) +
scale_fill_manual(values = filtered_color_mapping) +
scale_color_manual(values = filtered_color_mapping) +
theme_minimal(base_size = 30) +
theme(
plot.title = element_text(size = 30),
axis.title.x = element_text(size = 30),
axis.title.y = element_text(size = 30),
axis.text.x = element_text(size = 30),
axis.text.y = element_text(size = 30),
legend.title = element_text(size = 30),
legend.text = element_text(size = 30),
# Add margin at the top of the plot for secondary labels
plot.margin = margin(t = 40, r = 5, b = 5, l = 5, unit = "pt")
)
# Add secondary labels at the top of each violin
x_positions <- 1:length(groups_to_keep)
# Add text at the top for offspring labels
violin_plot <- violin_plot +
annotate("text",
x = x_positions,
y = rep(y_max + y_range * 0.05, length(x_positions)),
label = filtered_secondary_labels,
size = 8)
# Get the range of the trait column for density plot
trait_range <- range(df_filtered[[trait]], na.rm = TRUE)
trait_min <- floor(trait_range[1] * 100) / 100
trait_max <- ceiling(trait_range[2] * 100) / 100
step_size <- (trait_max - trait_min) / 4
breaks <- seq(trait_min, trait_max, by = step_size)
# Density Plot with histogram - explicitly move facet labels to the bottom
# Create a modified theme that ensures strip position at bottom
bottom_strip_theme <- theme_minimal(base_size = 30) +
theme(
plot.title = element_text(size = 30),
axis.title.x = element_text(size = 30),
axis.title.y = element_text(size = 30),
axis.text.x = element_text(size = 30),
axis.text.y = element_text(size = 30),
legend.title = element_text(size = 30),
legend.text = element_text(size = 30),
# Critical settings for bottom strips
strip.placement = "outside",
strip.text = element_text(size = 30),
strip.background = element_rect(fill = "white", color = NA),
panel.spacing = unit(1, "lines"),
# Explicitly set strip position to bottom
strip.position = "bottom"
)
density_plot <- ggplot(df_filtered, aes(x = .data[[trait]], fill = Minor_Allele_Inheritance_Interested_Subgroups,
color = Minor_Allele_Inheritance_Interested_Subgroups)) +
geom_histogram(aes(y = after_stat(density)), position = "identity",
alpha = 0.2, bins = 30) +
geom_density(alpha = 0.3) +
scale_fill_manual(values = filtered_color_mapping) +
scale_color_manual(values = filtered_color_mapping) +
scale_x_continuous(limits = c(trait_min, trait_max),
breaks = breaks,
labels = round(breaks, 1)) +
# Use facet_grid instead of facet_wrap to have more control over strip placement
facet_grid(cols = vars(Minor_Allele_Inheritance_Interested_Subgroups),
switch = "x") + # This is key - switches the strip to the bottom
ggtitle(paste(trait_display_name, " - ", snp_name)) + # Combined title
ylab("Density") +
xlab("") + # Remove x-axis label
bottom_strip_theme
# Remove legends from both plots
violin_plot <- violin_plot + theme(legend.position = "none")
density_plot <- density_plot + theme(legend.position = "none")
# Use grid.arrange to display the plot immediately
grid.arrange(violin_plot, density_plot, nrow = 2)
}
The code below demonstrates how to call the mQTL_vQTL_Bimodality function using a SNP and Log_BMI as examples. A comprehensive result will be generated.
mQTL_vQTL_mixR_result = mQTL_vQTL_Bimodality(geno_pheno_snp_df_wide, "rs115140649_G", "Log_BMI")
## rs115140649_G
## SNP_Minor_Allele rs115140649_G
## ma_pa_mQTL_p 0.026
## ma_pa_mQTL_Eff 0.066
## ma_pa_vQTL_p 0.038
## ma_pa_vQTL_Eff 1.347
## mixR_ma_mu1_mu2 3.22_3.7
## mixR_ma_sd1_sd2 0.15_0.04
## mixR_ma_prop1 0.94
## mixR_ma_p 0.02
## mixR_pa_mu1_mu2 3.13_3.24
## mixR_pa_sd1_sd2 0.14_0.12
## mixR_pa_prop1 0.44
## mixR_pa_p 0.94
## mixR_both_mu1_mu2 NA_NA
## mixR_both_sd1_sd2 NA_NA
## mixR_both_prop1 <NA>
## mixR_both_p <NA>
## mixR_none_mu1_mu2 3.19_3.39
## mixR_none_sd1_sd2 0.14_0.21
## mixR_none_prop1 0.69
## mixR_none_p 0
The code below generates a violin plot and histogram of Log_BMI for a SNP of interest across different genotype groups. We’ll use the original long format data here; alternatively, you can use the code below to convert the wide data back to the long format. Note that we use “Minor_Allele_Inheritance_Interested_Subgroups” instead of “Minor_Allele_Inheritance”, because in addition to the four original groups (“Paternal”, “Maternal”, “None”, “Both”) from “Minor_Allele_Inheritance”, “Minor_Allele_Inheritance_Interested_Subgroups” includes 4 more subgroups within the offspring “None” category that are associated with parental genotypes (“None_Maternal_Heter”, “None_Paternal_Heter”, “None_Both_Parents_Homo”, “None_Both_Parents_Heter”).
geno_pheno_snp_df_long <- geno_pheno_snp_df_wide %>%
pivot_longer(
cols = -c(Indiv_ID, SEX, AGE, Log_BMI, DIABETES),
names_to = "SNP_Minor_Allele",
values_to = "Minor_Allele_Inheritance_Interested_Subgroups"
)
violin_histogram_combined(df = geno_pheno_snp_df_long,
snp_name = "rs115140649_G",
trait = "Log_BMI",
trait_display_name = "log(BMI)")
## rs1421085_C
## SNP_Minor_Allele rs1421085_C
## ma_pa_mQTL_p 0.045
## ma_pa_mQTL_Eff 0.015
## ma_pa_vQTL_p 0.374
## ma_pa_vQTL_Eff 0.997
## mixR_ma_mu1_mu2 3.2_3.45
## mixR_ma_sd1_sd2 0.15_0.21
## mixR_ma_prop1 0.79
## mixR_ma_p 0
## mixR_pa_mu1_mu2 3.12_3.32
## mixR_pa_sd1_sd2 0.12_0.19
## mixR_pa_prop1 0.39
## mixR_pa_p 0
## mixR_both_mu1_mu2 3.23_3.46
## mixR_both_sd1_sd2 0.15_0.22
## mixR_both_prop1 0.79
## mixR_both_p 0
## mixR_none_mu1_mu2 NA_NA
## mixR_none_sd1_sd2 NA_NA
## mixR_none_prop1 <NA>
## mixR_none_p <NA>
violin_histogram_combined(df = geno_pheno_snp_df_long,
snp_name = "rs1421085_C",
trait = "Log_BMI",
trait_display_name = "log(BMI)")