Introduction

MicroRNAs (miRNAs) are small, non-coding RNA molecules that regulate gene expression and play an essential role in various biological processes. They are increasingly recognised as biomarkers for numerous diseases, including colorectal cancer, due to their involvement in tumour progression, metastasis, and immune response.

In cancer epidemiology, miRNAs hold promise as diagnostic tools, prognostic markers, and therapeutic targets. This analysis re-make the result from the reports in Clinical Cancer Research (2014), British Journal of Cancer (2014) and Oncotarget (2016) which highlighted miRNAs’ significance in understanding cancer pathways and outcomes. Identifying miRNAs with notable expression changes could advance their use as biomarkers for early disease detection and treatment.

(1) qPCR Array Analysis

This analyses qPCR array data to identify the most highly expressed miRNAs in a given pre-processed dataset deposited in Clinical Cancer Research (2014). Here, I include statistical analysis of fold changes and calculate medians and visualising the key miRNAs that warrant further investigation for public health and clinical applications.

Loading Required Libraries

The libraries necessary for this analysis are loaded below. Missing libraries will be installed automatically.

# Install packages if not already installed
if (!requireNamespace("pROC", quietly = TRUE)) install.packages("pROC")
if (!requireNamespace("ggplot2", quietly = TRUE)) install.packages("ggplot2")
if (!requireNamespace("dplyr", quietly = TRUE)) install.packages("dplyr")
if (!requireNamespace("openxlsx", quietly = TRUE)) install.packages("openxlsx")
if (!requireNamespace("rstatix", quietly = TRUE)) install.packages("rstatix")
if (!requireNamespace("ggpubr", quietly = TRUE)) install.packages("ggpubr")
if (!requireNamespace("RColorBrewer", quietly = TRUE)) install.packages("RColorBrewer")

# Load Libraries
library(pROC)
library(ggplot2)
library(dplyr)
library(openxlsx)
library(rstatix)
library(ggpubr)
library(RColorBrewer)

Data Preprocessing

The pre-processed data is cleaned and formatted to ensure accurate and meaningful analysis. Key steps include:

  • Removing extraneous data, such as metadata and null values.

  • Standardising the dataset to facilitate calculations and visualisations.

The dataset is structured for statistical computations, avoiding potential biases caused by unprocessed or incomplete data.

# URL of the array data file published in CCR (Supplementary Table 3)
url <- "https://ndownloader.figstatic.com/files/39900485"

# Load data from Excel file without column names
qPCR.array <- read.xlsx(url, colNames = FALSE)

# Clean the first column by removing the "-#number" suffix
qPCR.array[[1]] <- gsub("-\\d+$", "", qPCR.array[[1]])

# Set the value "miRNA" to the second row of the first column
qPCR.array[2, 1] <- "miRNA"

# Use the second row as column names
colnames(qPCR.array) <- qPCR.array[2, ]

# Remove the first three rows and last three columns
qPCR.array <- qPCR.array[-c(1:3), ]
qPCR.array <- qPCR.array[, -c(7:9)]

# Remove rows containing "Null"
qPCR.array <- qPCR.array[apply(qPCR.array, 1, function(x) !any(x == "Null")), ]

# Convert columns 2 to 6 to numeric
qPCR.array[, 2:6] <- lapply(qPCR.array[, 2:6], as.numeric)

Statistical Analysis

To identify candidate miRNAs, we calculate the median fold change (a measure of central tendency) from 5 pairs of cancer-adjacent normal colonic tissue samples for each miRNA. These metrics are essential for evaluating the fold change and variability across samples.

# Extract only columns 2 to 6 for calculations
data_numeric <- qPCR.array[, 2:6]

# Calculate median for each miRNA
Medians <- apply(data_numeric, 1, median, na.rm = TRUE)

# Create a data frame with miRNA names, medians, and SEM
miRNA_stats <- data.frame(
  miRNA = qPCR.array[[1]], 
  Medians = Medians
)

# Get the top 12 miRNAs by median
top_miRNA_stats <- miRNA_stats[order(-miRNA_stats$Medians), ][1:12, ]

Data Visualisation - Bar Plot

A visual representation of the top 12 miRNAs is provided through bar plots using the median.

Note that including error bars could be misleading due to the small sample size (5 paired samples), and omitting them helps avoid presenting potentially misleading information.

# Plot the top 12 miRNAs
ggplot(top_miRNA_stats, aes(x = reorder(miRNA, Medians), y = Medians, fill = miRNA)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(x = "microRNA", y = "Fold Change") +
  theme_classic() +
  theme(
    text = element_text(size = 15, colour = "black"), 
    axis.ticks = element_line(colour = "black", size = 1.25),
    axis.line = element_line(colour = 'black', size = 1.25),
    axis.text.x = element_text(angle = 0, hjust = 0.5, colour = "black", size = 12),
    axis.text.y = element_text(angle = 30, hjust = 1, colour = "black", size = 12),
    axis.title.y = element_text(color = "black", size = 14, face = "bold"), 
    axis.title.x = element_text(color = "black", size = 14, face = "bold"), 
    legend.position = "none"
  )

Remove unwanted objects

Clean up the unwanted objects to keep the R environment tidy.

rm(data_numeric, miRNA_stats, qPCR.array, top_miRNA_stats, Medians, url)

(2) ROC analysis

The predictive performance of specific miRNAs for differentiating between cancer and normal samples is assessed using Receiver Operating Characteristic (ROC) curves. This analysis examines:

The study identifies miRNAs with potential utility as cancer detection biomarkers.

Data Loading

# Load Data
data <- read.xlsx("20241201 List of stool samples.xlsx")

Data Filtering

We load the dataset and filter it to include only the “Cancer” and “Normal” groups. The true labels (1 for cancer, 0 for normal) are extracted for use in ROC curve analysis.

# Filter data for "Cancer" and "Normal" groups
filtered_data <- data %>%
  filter(Group %in% c("Cancer", "Normal"))

# Extract true labels
true_labels <- ifelse(filtered_data$Group == "Cancer", 1, 0)

# Define predictive parameters
predictive_params <- c("miR-18a.copies/ng", "20a.copies/ng", 
                       "135b.copies/ng", "221.copies/ng")

ROC Curve Analysis

For each predictive parameter, we calculate ROC curves and extract sensitivity, specificity, and AUC values. Annotations and arrow positions for visualisation are also prepared.

# Initialize data containers
roc_data_list <- data.frame()
annotation_points <- data.frame()
arrow_positions <- data.frame()

# Calculate ROC Curves and Annotations
for (param in predictive_params) {
  # Predicted scores
  predicted_scores <- filtered_data[[param]]
  
  # Generate ROC curve
  roc_curve <- roc(true_labels, predicted_scores)
  
  # Extract ROC curve data for ggplot
  roc_data <- data.frame(
    Specificity = rev(roc_curve$specificities),
    Sensitivity = rev(roc_curve$sensitivities),
    Parameter = param
  )
  
  # Append to the overall data frame
  roc_data_list <- rbind(roc_data_list, roc_data)
  
  # Sensitivity and specificity at 80% specificity
  coords_at_80_spec <- coords(roc_curve, x = 0.8, input = "specificity", ret = c("sensitivity", "specificity"))
  sen <- coords_at_80_spec["sensitivity"]
  spe <- coords_at_80_spec["specificity"]
  
  # Annotation points
  annotation_points <- rbind(annotation_points, data.frame(
    x = c(0.25, 0.5, 0.125, 0.8)[which(predictive_params == param)],
    y = c(0.15, 0.4, 0.85, 0.4)[which(predictive_params == param)],
    label = sprintf("%s\nSen=%.2f\nSpe=%.2f", param, sen, spe),
    Parameter = param
  ))
  
  # Arrow positions
  annotation_x <- 1 - roc_data$Specificity
  annotation_y <- roc_data$Sensitivity
  target_x <- annotation_points$x[nrow(annotation_points)]
  target_y <- annotation_points$y[nrow(annotation_points)]
  distances <- (annotation_x - target_x)^2 + (annotation_y - target_y)^2
  closest_index <- which.min(distances)
  
  # Store arrow starting positions  
  arrow_positions <- rbind(arrow_positions, data.frame(
    x = annotation_x[closest_index],
    y = annotation_y[closest_index],
    Parameter = param
  ))
}

# Adjust arrow end positions
adjusted_arrow_positions <- annotation_points %>%
  mutate(
    xend = ifelse(x < 0.5, x - 0.05, x + 0.05),
    yend = ifelse(y < 0.5, y - 0.05, y + 0.05)
  )

# Combine arrow start and end positions
arrows <- merge(adjusted_arrow_positions, arrow_positions, by = "Parameter")

Data Visualisation

We plot the ROC curves with sensitivity, specificity, and AUC annotations using ggplot2, this includes, renaming, calculating sensitivity and specificity at 80% specificity for a better explaination.

# Combine arrow start and end positions correctly
arrows <- merge(
  adjusted_arrow_positions, 
  arrow_positions, 
  by = "Parameter",
  suffixes = c(".label", ".curve")  # Clear distinction between label and curve columns
)

# Check the content of `arrows` to ensure alignment
print(arrows)
##           Parameter x.label y.label                                 label  xend
## 1    135b.copies/ng   0.125    0.85    135b.copies/ng\nSen=0.58\nSpe=0.80 0.075
## 2     20a.copies/ng   0.500    0.40     20a.copies/ng\nSen=0.55\nSpe=0.80 0.550
## 3     221.copies/ng   0.800    0.40     221.copies/ng\nSen=0.51\nSpe=0.80 0.850
## 4 miR-18a.copies/ng   0.250    0.15 miR-18a.copies/ng\nSen=0.46\nSpe=0.80 0.200
##   yend    x.curve   y.curve
## 1 0.90 0.29797980 0.6969697
## 2 0.35 0.33333333 0.6313131
## 3 0.35 0.40909091 0.6868687
## 4 0.10 0.03030303 0.2070707
# Modify the labels in the annotation_points data frame and retain the ROC-related values
annotation_points$label <- dplyr::recode(
  annotation_points$Parameter,
  "miR-18a.copies/ng" = "miR-18a",
  "20a.copies/ng" = "miR-20a",
  "135b.copies/ng" = "miR-135b",
  "221.copies/ng" = "miR-221"
)

# Recalculate sensitivity, specificity, and AUC for each parameter and update the labels
for (param in predictive_params) {
  # Extract the ROC curve for each parameter
  roc_curve <- roc(true_labels, filtered_data[[param]])
  
  # Calculate sensitivity and specificity at 80% specificity
  coords_at_80_spec <- coords(roc_curve, x = 0.8, input = "specificity", ret = c("sensitivity", "specificity"))
  sen <- coords_at_80_spec["sensitivity"]
  spe <- coords_at_80_spec["specificity"]
  
  # Calculate AUC for the current ROC curve
  auc_value <- auc(roc_curve)
  
  # Update the annotation_points data frame with the calculated sen, spe, and AUC for each parameter
  annotation_points[annotation_points$Parameter == param, "sen"] <- sen
  annotation_points[annotation_points$Parameter == param, "spe"] <- spe
  annotation_points[annotation_points$Parameter == param, "AUC"] <- auc_value
  
  # Update the label with sen, spe, and AUC values
  annotation_points[annotation_points$Parameter == param, "label"] <- 
    sprintf("%s\nSen: %.2f\nSpe: %.2f\nAUC: %.2f", annotation_points[annotation_points$Parameter == param, "label"], sen, spe, auc_value)
}

# Reformat labels for clarity
ggplot(data = roc_data_list, aes(x = 1 - Specificity, y = Sensitivity, color = Parameter)) +
  geom_line(size = 1) +                            # Plot ROC curves
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "gray") +  # Diagonal line
  labs(
    # title = "ROC Curves with Sensitivity, Specificity, and AUC",
    x = "1 - Specificity (False Positive Rate)",
    y = "Sensitivity (True Positive Rate)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    legend.position = "none",
    legend.title = element_text(face = "bold"
                                )
  ) +
  scale_color_discrete(name = "Predictive Parameters") +  # Legend title
  # Add arrows above the labels
  geom_segment(
    data = arrows,
    aes(
      x = x.label,    # Start x-coordinate (from the label)
      y = y.label,    # Start y-coordinate (from the label)
      xend = x.curve,    # End x-coordinate (closest point on the curve)
      yend = y.curve,    # End y-coordinate (closest point on the curve)
      color = Parameter  # Match arrow color to the curve
    ),
    inherit.aes = FALSE,
    arrow = arrow(length = unit(0.02, "npc")),
    size = 1.5
  ) +
  # Use geom_label to add background color to labels
  geom_label(
    data = annotation_points, 
    aes(x = x, y = y, label = label, fill = Parameter),  # Use 'fill' for background color
    inherit.aes = FALSE, 
    size = 4,
    label.size = 0.5,    # Border size around the label
    label.padding = unit(0.3, "lines"),  # Padding inside the label
    color = "white",  # Text color
    alpha = 1       # Transparency of the label background
  )

(3) Combined ROC analysis

A logistic regression model was used for the combined ROC curve analysis, which predicts the probability of a binary outcome (Cancer vs. Normal) based on multiple predictors, including the expression levels of four microRNAs (miR-18a, miR-221, miR-20a, miR-135b).

The logistic regression model estimates the likelihood of the outcome occurring (Cancer) for each observation. Predicted probabilities from this model are used to plot a ROC curve, which evaluates the model’s performance by showing the trade-off between sensitivity (true positive rate) and specificity (1 - false positive rate) at different decision thresholds.

The model formula:

logit(P)=β0​+β1​(miR-18a)+β2​(miR-221)+β3​(miR-20a)+β4​(miR-135b)

# Create the binary outcome variable (1 for Cancer, 0 for Normal)
true_labels <- ifelse(filtered_data$Group == "Cancer", 1, 0)

# Fit a logistic regression model with multiple predictors
model_multi <- glm(true_labels ~ `miR-18a.copies/ng` + 
                     `221.copies/ng` + 
                     `20a.copies/ng` + 
                     `135b.copies/ng`,
                   data = filtered_data,
                   family = binomial)

# Get predicted probabilities for the new model
predicted_probs_multi <- predict(model_multi, type = "response")

# Compute ROC curve using pROC
roc_curve_multi <- roc(true_labels, predicted_probs_multi)

# Extract data for ggplot
roc_data <- data.frame(
  Specificity = rev(roc_curve_multi$specificities),
  Sensitivity = rev(roc_curve_multi$sensitivities)
)

# Find the sensitivity at 80% specificity
target_specificity <- 0.8

# Find the closest specificity value to 0.8
closest_index <- which.min(abs(roc_data$Specificity - target_specificity))

# Extract the corresponding sensitivity and specificity
sensitivity_at_80_spec <- roc_data$Sensitivity[closest_index]
specificity_at_80_spec <- roc_data$Specificity[closest_index]

# Display the result
print(paste("Sensitivity at 80% specificity: ", round(sensitivity_at_80_spec, 3)))
## [1] "Sensitivity at 80% specificity:  0.621"
print(paste("Specificity at 80% specificity: ", round(specificity_at_80_spec, 3)))
## [1] "Specificity at 80% specificity:  0.798"
# Calculate AUC
auc_multi <- auc(roc_curve_multi)
auc_label <- paste0("AUC: ", round(auc_multi, 3))

# Create the annotation points for labeling on the ROC plot at (0.5, 0.5)
annotation_points <- data.frame(
  x = 0.6,  # Label at (0.5, 0.5)
  y = 0.4,  # Label at (0.5, 0.5)
  label = sprintf("Multi-Predictor\nSen: %.2f\nSpe: %.2f\nAUC: %.2f", sensitivity_at_80_spec, specificity_at_80_spec, auc_multi),
  Parameter = "Multi-Predictor"
)

# Define custom arrow positions, making the arrow point to the closest point on the curve
arrow_positions <- data.frame(
  x = 0.375,  # Start of the arrow at (0.5, 0.5)
  y = 0.75,  # Start of the arrow at (0.5, 0.5)
  Parameter = "Multi-Predictor"
)

# Combine arrow start and end positions
arrows <- merge(
  annotation_points, 
  arrow_positions, 
  by = "Parameter",
  suffixes = c(".label", ".curve")
)

# Plot the ROC curve using ggplot with AUC and sensitivity annotations
ggplot(roc_data, aes(x = 1 - Specificity, y = Sensitivity)) +
  geom_line(color = "saddlebrown", size = 1.2) +  # ROC curve
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "gray") +  # Diagonal line
  labs(
    x = "1 - Specificity (False Positive Rate)",
    y = "Sensitivity (True Positive Rate)"
  ) +
  # Add arrows pointing to the labels
  geom_segment(
    data = arrows,
    aes(
      x = x.label,    # Start x-coordinate (from the label)
      y = y.label,    # Start y-coordinate (from the label)
      xend = x.curve,    # End x-coordinate (closest point on the curve)
      yend = y.curve,    # End y-coordinate (closest point on the curve)
      color = Parameter  # Match arrow color to the curve
    ),
    inherit.aes = FALSE,
    arrow = arrow(length = unit(0.02, "npc")),
    size = 1.5
  ) +
  # Use geom_label to add background color to labels
  geom_label(
    data = annotation_points, 
    aes(x = x, y = y, label = label, fill = Parameter),  # Use 'fill' for background color
    inherit.aes = FALSE, 
    size = 4,
    label.size = 0.5,    # Border size around the label
    label.padding = unit(0.3, "lines"),  # Padding inside the label
    color = "white",  # Text color
    alpha = 1       # Transparency of the label background
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    legend.position = "none",
    legend.title = element_text(face = "bold")
  ) +
  xlim(0, 1) +  # Adjust axis limits to make sure the arrows and labels are within view
  ylim(0, 1)

Remove unwanted objects

Remove all the unwanted objects except the data object for the next analysis.

# Remove all objects except "data"
rm(list = setdiff(ls(), "data"))

Group Comparisons

Log Transformation of microRNA Levels

To address the non-normal distribution of miR-135b levels, a log10 transformation is applied. A small constant (+1) is added to avoid taking the logarithm of zero.

# Apply log10 transformation to miR-135b data
data$log_miR_135b <- log10(data$`135b.copies/ng` + 1)  # Add 1 to avoid log(0)

# Set the order of the groups
data$Group <- factor(data$Group, levels = c("Normal", "Adenoma", "AN", "Cancer"))

Statistical Testing - Wilcoxon Rank-Sum Test

The Wilcoxon rank-sum test is used for the group comparisons of the transformed miR-135b levels. P-values are adjusted using the Bonferroni correction method.

# Perform pairwise Wilcoxon test for log-transformed miR-135b values
stat.test <- data %>%
  wilcox_test(log_miR_135b ~ Group, p.adjust.method = "bonferroni") %>%
  add_significance("p") %>%
  filter(p.adj.signif != "ns")  # Keep only significant comparisons

# Adjust the positions for the p-value labels
stat.test <- stat.test %>%
  mutate(y.position = seq(max(data$log_miR_135b) + 0.1, by = 0.3, length.out = nrow(stat.test)))

Visualization - Create Scatter and Box Plot

Scatter plots, overlaid with box plots, illustrate the differences in miR-135b levels across groups. Annotated p-values highlight significant findings, supporting transparent interpretation.

# Plot the scatter plot with log-transformed miR-135b values and box plots
ggplot(data, aes(x = Group, y = log_miR_135b, color = Group)) +  
  geom_point(position = "jitter", size = 3, alpha = 0.7) +  
  geom_boxplot(
    aes(x = Group, y = log_miR_135b, fill = Group),
    alpha = 0.3, 
    color = "black", 
    outlier.shape = NA,  
    width = 0.3  
  ) +
  stat_pvalue_manual(stat.test, label = "p.signif", size = 4, face = "bold") +
  labs(
    x = "Group",
    y = "Stool miR-135b level, log10"
  ) +
  theme_classic() + 
  theme(
    text = element_text(size = 15, colour = "black"), 
    axis.ticks = element_line(colour = "black", size = 1),
    axis.line = element_line(colour = 'black', size = 1),
    axis.text.x = element_text(angle = 0, hjust = 0.5, colour = "black", size = 13, face = "bold"),
    axis.text.y = element_text(angle = 0, hjust = 0.5, colour = "black", size = 13, face = "bold"),
    axis.title.y = element_text(color = "black", size = 15, face = "bold"),
    axis.title.x = element_text(color = "black", size = 15, face = "bold"),
    legend.position = "none"
  ) +
  scale_y_continuous(limits = c(-0.5, 5.25), breaks = c(0, 1, 2, 3, 4, 5))

(4) Analysis by Tumour Location

We also applied a similar approach to analyse miR-135b levels in cancer patients stratified by tumour location (proximal vs. distal), considering that Fecal Immunochemical Test also shows varying sensitivity and specificity depending on the context.

Data Filtering and Statistical Testing

The analysis focuses on cancer group data, ensuring only relevant comparisons are performed. Statistical significance is tested.

# Filter the data to include only the Cancer group and remove rows with NA values
cancer_data <- data %>%
  filter(Group == "Cancer") %>%       # Retain rows where Group is "Cancer"
  drop_na(Location)                   # Remove rows where 'Location' is NA

# Rename the Location groups
cancer_data$Location <- recode(cancer_data$Location, "1" = "PROXIMAL", "2" = "DISTAL")

# Set the order of the Location groups
cancer_data$Location <- factor(cancer_data$Location, levels = c("PROXIMAL", "DISTAL"))
# Perform Wilcoxon rank-sum test for log-transformed miR-135b values between two independent groups
stat.test <- cancer_data %>%
  wilcox_test(log_miR_135b ~ Location) %>%
  add_significance("p")


# Adjust the positions for the p-value labels
stat.test <- stat.test %>%
  mutate(y.position = seq(max(cancer_data$log_miR_135b) + 0.1, by = 0.3, length.out = nrow(stat.test)))

Visualissation - Scatter Plot with Box Plots

Scatter plots with overlaid box plots are used to visualise miR-135b levels by tumour location. Significant differences are annotated to inform potential clinical implications.

# Plot the scatter plot with log-transformed miR-135b values and box plots for Location

ggplot(cancer_data, 
       aes(x = Location, y = log_miR_135b, fill = Location)) +  
  geom_point(aes(color = Location),  
             position = position_jitter(width = 0.2), size = 2.5, alpha = 1) +  
  geom_violin(trim = TRUE, color = NA, alpha = 0.5) +  
  geom_boxplot(width = 0.12, fill = "white", , alpha = 0.7) +  
  stat_pvalue_manual(stat.test, label = "p.signif", 
                     tip.length = 0.03,   
                     inherit.aes = FALSE, 
                     y.position = 4.2,    
                     size = 6,
                     face = "bold",
                     vjust = 0,           
                     hjust = 0) +  
  labs(
    x = "Location",
    y = "Stool miR-135b level, log10"
  ) +
  scale_fill_brewer(palette = "Set1") +  
  scale_color_brewer(palette = "Set1") +  
  scale_y_continuous(limits = c(-0.05, 4.3), 
                     breaks = c(0, 1, 2, 3, 4)) +  
  theme_classic() +   
  theme(
    text = element_text(size = 15, colour = "black"), 
    axis.ticks = element_line(colour = "black", size = 1.25),
    axis.line = element_line(colour = 'black', size = 1.25),
    axis.text.x = element_text(angle = 45, hjust = 1, colour = "black", size = 13, face = "bold"),
    axis.text.y = element_text(angle = 0, hjust = 0.5, colour = "black", size = 13, face = "bold"),
    axis.title.y = element_text(color = "black", size = 15, face = "bold"),
    axis.title.x = element_text(color = "black", size = 15, face = "bold"),
    legend.position = "none"
  )

(5) The Use of Antibiotics

Antibiotics can alter the composition of the intestinal microbiota, which may, in turn, affect miRNA expression in feces. To investigate this, we examined the impact of antibiotics on fecal miR-20a levels by comparing patients who had taken antibiotics within 30 days prior to faecal sample collection with those who had not.

Filter and Categorise Data

We filter the dataset to retain only cancer patients and categorise the Antibiotics variable into “YES” and “NO.”

# Filter the data to include only the Cancer group
cancer_data <- data %>% filter(Group == "Cancer")

# Rename the Antibiotics groups
cancer_data$Antibiotics <- recode(cancer_data$Antibiotics, "N" = "NO", "Y" = "YES")

# Set the order of the Antibiotics groups
cancer_data$Antibiotics <- factor(cancer_data$Antibiotics, levels = c("NO", "YES"))

Statistical Analysis - Wilcoxon Test

The Wilcoxon rank-sum test compares log-transformed miR-135b levels between antibiotic groups. Bonferroni correction adjusts for multiple comparisons, and annotation positions are customised.

# Perform pairwise Wilcoxon test for log10-transformed miR-135b values
stat.test <- cancer_data %>%
  wilcox_test(log_miR_135b ~ Antibiotics) %>%
  add_significance("p")

# Adjust the positions for the p-value labels
stat.test <- stat.test %>%
  mutate(y.position = seq(max(cancer_data$log_miR_135b) + 0.1, by = 0.3, length.out = nrow(stat.test)))

Visualisation - Scatter and Box Plot

We use a scatter plot with overlaid boxplots to visualise log-transformed miR-135b levels for antibiotic usage groups. Significant differences are annotated.

# Plot the scatter plot with log-transformed miR-135b values and box plots for Antibiotics
ggplot(cancer_data, 
       aes(x = Antibiotics, y = log_miR_135b, fill = Antibiotics)) + 
  geom_point(aes(color = Antibiotics),  # Points mapped to Antibiotics
             position = position_jitter(width = 0.2), size = 2.5, alpha = 1) +  # Jitter for spreading
  geom_violin(trim = TRUE, color = NA, alpha = 0.5) +  # Violin plot
  geom_boxplot(width = 0.12, fill = "white", alpha = 0.7) +  # Boxplot inside the violin
  stat_pvalue_manual(stat.test, label = "p.signif", 
                     tip.length = 0.03,   
                     inherit.aes = FALSE, 
                     y.position = 4.2,    
                     size = 6,
                     face = "bold",
                     vjust = 0,           
                     hjust = 0) +  
  labs(
    x = "Antibiotics", 
    y = "Stool miR-135b level, log10"
  ) +
  scale_fill_brewer(palette = "Set1") +  # Violin fill colors
  scale_color_brewer(palette = "Set1") +  # Point colors
  scale_y_continuous(limits = c(-0.05, 4.3), 
                     breaks = c(0, 1, 2, 3, 4)) +  
  theme_classic() +   
  theme(
    text = element_text(size = 15, colour = "black"), 
    axis.ticks = element_line(colour = "black", size = 1.25),
    axis.line = element_line(colour = 'black', size = 1.25),
    axis.text.x = element_text(angle = 45, hjust = 1, colour = "black", size = 13, face = "bold"),
    axis.text.y = element_text(angle = 0, hjust = 0.5, colour = "black", size = 13, face = "bold"),
    axis.title.y = element_text(color = "black", size = 15, face = "bold"),
    axis.title.x = element_text(color = "black", size = 15, face = "bold"),
    legend.position = "none"
  )

sessionInfo()
## R version 4.4.2 (2024-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.utf8 
## [2] LC_CTYPE=English_United Kingdom.utf8   
## [3] LC_MONETARY=English_United Kingdom.utf8
## [4] LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.utf8    
## 
## time zone: Europe/London
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] RColorBrewer_1.1-3 ggpubr_0.6.0       rstatix_0.7.2      openxlsx_4.2.7.1  
## [5] dplyr_1.1.4        ggplot2_3.5.1      pROC_1.18.5       
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.9        utf8_1.2.4        generics_0.1.3    tidyr_1.3.1      
##  [5] stringi_1.8.4     digest_0.6.36     magrittr_2.0.3    evaluate_0.24.0  
##  [9] grid_4.4.2        fastmap_1.2.0     plyr_1.8.9        jsonlite_1.8.8   
## [13] zip_2.3.1         backports_1.5.0   purrr_1.0.2       fansi_1.0.6      
## [17] scales_1.3.0      jquerylib_0.1.4   abind_1.4-5       cli_3.6.3        
## [21] rlang_1.1.4       munsell_0.5.1     withr_3.0.0       cachem_1.1.0     
## [25] yaml_2.3.9        tools_4.4.2       ggsignif_0.6.4    colorspace_2.1-0 
## [29] broom_1.0.6       vctrs_0.6.5       R6_2.5.1          lifecycle_1.0.4  
## [33] car_3.1-2         pkgconfig_2.0.3   pillar_1.9.0      bslib_0.7.0      
## [37] gtable_0.3.5      glue_1.7.0        Rcpp_1.0.12       highr_0.11       
## [41] xfun_0.45         tibble_3.2.1      tidyselect_1.2.1  rstudioapi_0.16.0
## [45] knitr_1.48        farver_2.1.2      htmltools_0.5.8.1 rmarkdown_2.27   
## [49] carData_3.0-5     labeling_0.4.3    compiler_4.4.2