1 Introduction

Metabolomics is an emerging field of biomedical sciences, allowing the definition of biomolecular patterns associated with relevant biochemical processes or pathophysiological mechanisms.

However, the large amounts of data often generated by some analytical methods can be daunting to those who are not data analysts or have no experience in multivariate analysis.

In this article, I will guide you in conducting simple multivariate analysis for metabolomics, including dimension reduction methods and cluster analysis. For this purpose, I will use a dummy breathomics dataset that can be downloaded as a .csv file here.

2 Multivariate analysis of the metabolome

2.1 Loading required packages and preparing data

In this example, we will mostly use functions included in base R, as well as some tools contained in the factoextra package. The latter, created by Alboukadel Kassambara and Fabian Mundt, will be useful for plotting and visualizing results for our multivariate analysis. You can learn more about the factoextra package here. We will also use the dplyr package to handle data more fluently.

library(dplyr)
library(factoextra)

Next, we load the example csv file as df. Be sure to replace the path in the file argument to your own path.

# latin-1 encoding will be used due to compound names
df <- read.csv(file = "path_to_your_file/breathomics_example.csv", encoding = "latin-1", check.names = FALSE)

We can check the data structure using the glimpse() function from the dplyr package.

glimpse(df)
## Rows: 200
## Columns: 31
## $ Patient_ID                  <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,…
## $ Asthma                      <int> 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, …
## $ `1,2-propanediol`           <dbl> 65.69225, 95.88098, 61.27993, 73.83496, 53…
## $ `2-butylcyclohexanol`       <dbl> 115.5899, 145.7228, 111.8286, 123.6831, 10…
## $ `2-propen-1-ol`             <dbl> 77.56422, 108.15147, 74.33091, 86.25945, 6…
## $ `Hexadecan-2-ol`            <dbl> 125.1097, 155.7785, 121.0401, 133.3455, 11…
## $ Isopropanol                 <dbl> 131.5955, 161.6312, 126.6553, 138.9938, 11…
## $ `2-octenal`                 <dbl> 41.85882, 71.40023, 37.48268, 49.52501, 28…
## $ `2-undecenal`               <dbl> 90.38280, 120.22582, 86.29556, 98.00580, 7…
## $ `4-methyl-2-pentenal`       <dbl> 126.7383, 156.3094, 122.3860, 134.5346, 11…
## $ Decanal                     <dbl> 92.26829, 122.75126, 88.76091, 100.15101, …
## $ Dodecanal                   <dbl> 82.58492, 112.93067, 78.09373, 90.14973, 7…
## $ Nonanal                     <dbl> 133.0892, 162.7918, 128.8052, 140.2590, 12…
## $ Pentadecanal                <dbl> 82.81600, 113.15896, 79.07945, 90.31172, 7…
## $ `1-(methylsulfanyl)propane` <dbl> 104.8449, 134.7973, 100.7607, 112.4918, 92…
## $ `2,2,4-trimethylheptane`    <dbl> 94.97019, 124.29846, 90.34674, 102.42008, …
## $ `2,2-dimethylhexane`        <dbl> 47.47850, 77.71796, 44.02195, 55.07499, 35…
## $ `2,3,6-trimethyldecane`     <dbl> 147.3337, 156.1348, 137.3461, 145.3459, 13…
## $ `2,3,6-trimethyloctane`     <dbl> 126.4192, 135.1605, 115.5992, 124.3296, 11…
## $ `2,3-dimethylheptane`       <dbl> 87.08962, 94.97783, 76.61197, 84.74034, 74…
## $ `2,4,6-trimethyldecane`     <dbl> 101.72661, 109.09558, 91.36103, 100.38671,…
## $ `2,4-dimethylheptane`       <dbl> 149.6408, 159.5379, 141.3893, 148.2968, 13…
## $ `2,5-dimethylhexane`        <dbl> 143.8372, 154.4330, 135.8000, 143.6583, 13…
## $ `2,4-dimethyloctane`        <dbl> 77.63026, 83.74699, 66.46336, 73.81485, 62…
## $ `2,4-dimethylpentane`       <dbl> 112.20408, 81.20440, 130.76727, 109.36703,…
## $ `2,6,10-trimethyldodecane`  <dbl> 74.25147, 76.64534, 111.48966, 90.74491, 1…
## $ `3-methylpentane`           <dbl> 130.98608, 102.13766, 105.19133, 95.68023,…
## $ `2,6,11-trimethyldodecane`  <dbl> 124.77345, 111.98984, 97.82532, 77.22286, …
## $ `3-ethylhexane`             <dbl> 103.33876, 90.12203, 100.36632, 96.45270, …
## $ `2-methyl-1-butene`         <dbl> 91.17490, 112.00639, 57.12484, 125.90886, …
## $ `4-ethyl-2-hexanol`         <dbl> 96.10588, 129.84190, 82.70370, 69.06879, 8…

As we can see, the data frame consists of information from 200 different patients. A variable with the patient ID number has been included, as well as a variable representing the asthma diagnosis status of each patient. Most importantly, there are 29 columns representing the concentrations of exhaled volatile organic compounds, in \(mg/m^3\).

To prepare the diagnosis variable for analysis and improve clarity, we’ll convert it to a factor and rename its levels:

df$Asthma <- factor(df$Asthma, levels = c(0, 1), labels = c("No asthma", "Asthma"))

Important! This dataset contains no missing values. Most of the multivariate analysis methods used in this article assume that there will be no missing values in the provided data. If your dataset contains missing values, make sure to remove or handle them appropriately:

df <- na.omit(df)

2.2 Dimensionality reduction using Principal Component Analysis

In metabolomics, we mostly deal with continuous data, such as the concentrations of multiple metabolites. Depending on our research question, the relevance of the measured compounds will vary and some might be more determinant in identifying a given condition, either individually or in combination with other metabolites. Principal component analysis (PCA) is a good starting point to condense our variables in groups (components) that help distinguish between different patient groups in our data..

In our example, we are trying to identify a profile of compounds associated with asthma. We will first use PCA to linearly transform breathomics data onto a 2D coordinate system that will capture the largest variation in the data, with the goal of separating individuals with and without asthma.

# Extracting breathomics data only
omics <- df[,3:length(df)]

# Performing PCA
pca_results <- prcomp(omics, retx = TRUE, center = TRUE, scale. = TRUE)

Next, we will use the factoextra package to plot our results.

We start by observing the contribution of our principal components for data variance.

fviz_eig(pca_results, addlabels = TRUE, ylim=c(0,100), ncp = 6)

By observing the plot, we can see that the two first components, PC1 and PC2, explain more than 75% of the total variance.

Therefore, it is a good idea to observe the distribution of patients in a two-dimensional space defined by these components.

fviz_pca_biplot(pca_results, 
                axes = c(1,2), 
                geom.ind = "point",
                geom.var = "arrow",
                habillage = df$Asthma)

Patients appear to be clustered according to the asthma diagnosis. We can even create ellipses to better illustrate these results:

fviz_pca_biplot(pca_results, 
                axes = c(1,2), 
                geom.ind = "point",
                geom.var = "",
                addEllipses = TRUE,
                habillage = df$Asthma)

There is a strong separation of individuals with asthma from healthy subjects, mostly defined by metabolites represented in PC1.

To check the contribution of each metabolite to PC1, we can use the get_pca_var() function from factoextra.

# First, we need to get the variables from PCA
pca_vars <- get_pca_var(pca_results)

print(sort(pca_vars$contrib[,1], decreasing = TRUE))
##                 Dodecanal       4-methyl-2-pentenal        2,2-dimethylhexane 
##               6.451068828               6.448859323               6.447945886 
##               Isopropanol              Pentadecanal                   Nonanal 
##               6.446625696               6.445957724               6.445549010 
##                 2-octenal           1,2-propanediol       2-butylcyclohexanol 
##               6.444860224               6.444608139               6.444027357 
##    2,2,4-trimethylheptane                   Decanal               2-undecenal 
##               6.443972721               6.442862156               6.441715016 
## 1-(methylsulfanyl)propane             2-propen-1-ol            Hexadecan-2-ol 
##               6.441298629               6.439472540               6.438893725 
##     2,3,6-trimethyloctane     2,4,6-trimethyldecane     2,3,6-trimethyldecane 
##               0.472201053               0.450461845               0.425591114 
##        2,5-dimethylhexane        2,4-dimethyloctane       2,3-dimethylheptane 
##               0.420761427               0.420636173               0.413713667 
##       2,4-dimethylheptane           3-methylpentane         2-methyl-1-butene 
##               0.404926489               0.098437371               0.088603043 
##         4-ethyl-2-hexanol  2,6,11-trimethyldodecane             3-ethylhexane 
##               0.063966529               0.050765584               0.012989225 
##  2,6,10-trimethyldodecane       2,4-dimethylpentane 
##               0.005878496               0.003351011

In our example, dodecanal, 4-methyl-2-pentenal and 2,2-dimethylhexane are the three compounds that mostly define PC1.

Next, we can check the direction of the correlation:

print(sort(pca_vars$cor[,1], decreasing = TRUE))
##                 Dodecanal       4-methyl-2-pentenal        2,2-dimethylhexane 
##                0.99320976                0.99303966                0.99296933 
##               Isopropanol              Pentadecanal                   Nonanal 
##                0.99286767                0.99281623                0.99278475 
##                 2-octenal           1,2-propanediol       2-butylcyclohexanol 
##                0.99273171                0.99271229                0.99266756 
##    2,2,4-trimethylheptane                   Decanal               2-undecenal 
##                0.99266335                0.99257781                0.99248944 
## 1-(methylsulfanyl)propane             2-propen-1-ol            Hexadecan-2-ol 
##                0.99245736                0.99231667                0.99227208 
##     2,3,6-trimethyloctane     2,4,6-trimethyldecane     2,3,6-trimethyldecane 
##                0.26871296                0.26245457                0.25510644 
##        2,5-dimethylhexane        2,4-dimethyloctane       2,3-dimethylheptane 
##                0.25365481                0.25361705                0.25152148 
##       2,4-dimethylheptane         2-methyl-1-butene         4-ethyl-2-hexanol 
##                0.24883601                0.11639906                0.09890119 
##  2,6,11-trimethyldodecane       2,4-dimethylpentane  2,6,10-trimethyldodecane 
##                0.08810688                0.02263670               -0.02998184 
##             3-ethylhexane           3-methylpentane 
##               -0.04456735               -0.12268886

These results show that dodecanal, 4-methyl-2-pentenal and 2,2-dimethylhexane have the highest positive correlation with PC1, while 2,6,10-trimethyldodecane, 3-ethylhexane and 3-methylpentane had the highest negative correlation. Variables with correlation coefficients close to 1 or -1 contribute more significantly to a principal component, while those near zero have a small impact.

If you want to graphically represent these results, you may use a correlation plot and set the is.corr argument to FALSE:

library(corrplot)

corr_matrix <- pca_vars$cor[,c(1,2)]

colnames(corr_matrix) <- c("PC1", "PC2")

corrplot(t(corr_matrix), is.corr = FALSE, win.asp = 1.8, cl.cex = 0.4)

Next, we can fit a logistic regression model based on the two main components:

PC1 <- pca_results$x[,1]
PC2 <- pca_results$x[,2]

mod <- glm(data = df, Asthma ~ PC1 + PC2, family = "binomial")


summary(mod)
## 
## Call:
## glm(formula = Asthma ~ PC1 + PC2, family = "binomial", data = df)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.11257    0.16960   0.664 0.506846    
## PC1          0.36486    0.06045   6.036 1.58e-09 ***
## PC2         -0.28111    0.07266  -3.869 0.000109 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 276.94  on 199  degrees of freedom
## Residual deviance: 208.12  on 197  degrees of freedom
## AIC: 214.12
## 
## Number of Fisher Scoring iterations: 5
exp(mod$coefficients)
## (Intercept)         PC1         PC2 
##   1.1191513   1.4403141   0.7549433
exp(confint(mod))
##                 2.5 %    97.5 %
## (Intercept) 0.8030269 1.5649014
## PC1         1.2889511 1.6349354
## PC2         0.6499659 0.8654576

We can see that the two main components are significantly associated with the asthma classification.

Constructing analytical models and plotting the spatial distribution of our data is an interesting approach to metabolomics. But imagine that we now wish to check the efficacy of our model in differentiating individuals with asthma. The most correct process would be to collect a secondary sample of data to be classified using the developed model, checking the results with a gold standard method (for instance, specialist diagnosis). In our case, since we do not have an external population to validate our model, we will simulate results using the predict() function in R. Next, we will draw the ROC curve based on the fitted model:

library(pROC)

predicted_probs <- predict(mod, type = "response")

roc_curve <- roc(df$Asthma, predicted_probs)

plot(roc_curve)

To evaluate the effectiveness of our model in distinguishing individuals with asthma, we can extract the information directly from the roc object. Relevant parameters include sensitivity, specificity and area under the curve (AUC).

print(roc_curve$auc)
## Area under the curve: 0.8203
best_threshold <- coords(roc_curve, "best", ret = c("threshold", "sensitivity", "specificity", "ppv", "npv"))

print(best_threshold)
##           threshold sensitivity specificity       ppv      npv
## threshold 0.5533116   0.7596154   0.8020833 0.8061224 0.754902

In this particular example, our model showed a sensitivity of 76%, a specificity of 80% and an AUC of 82%. Additional parameters, such as Negative Predictive Value and Positive Predictive Value may also be extracted from the roc object. Note that this ROC curve is based on predicted probabilities from our fitted model using PC1 and PC2. True validation would require testing on an independent dataset.

Important! These results are just an example. If you are interested in knowing which compounds are actually associated with asthma, please consider reading this article.

2.3 Using volcano plots to illustrate the importance of variables

Although PCA allows us to reduce the dimension of variables, the convergence into principal components often shifts the focus away from individual metabolites that characterize the profile. In this scenario, volcano plots emerge as an appealing solution to illustrate the relevance of each metabolite for differentiating a given condition (asthma, in this case). A volcano plot is a type of scatter plot that allows us to visualize changes across many variables, especially in datasets with replicates.

To start, we must first create a function to calculate the log odds ratio for our data.

get_log_odds <- function(x, y, data) {
  model <- glm(data = data, y ~ x, family = "binomial")
  summary_model <- summary(model)
  log_odds <- exp(summary_model$coefficients)[2] # Extracting the coefficients from the summary...
  p_value <- summary_model$coefficients[8] # ...and respective 95% confidence intervals
  return(c(log_odds, p_value))
}

Next, we would apply the get_log_odds() function to each variable and calculate the log odds ratio. However, we just have 29 variables in our data frame, which is a relatively small number for generating an informative volcano plot. Therefore, to enhance our example, we will first simulate 60 additional variables based on the existing ones.

# Setting seed for reproducibility
set.seed(123)

n_variables <- 60

new_vars <- paste0("sim_var_", seq_len(n_variables))

new_omics <- omics

for(var in new_vars) {
  
  base_col <- sample(names(omics), 1)
  
  # Here, we simulate a 20% variation between original and synthetic variables
  random_gen <- runif(nrow(omics),0.8,1.2)
  
  new_omics[[var]] <- new_omics[[base_col]] * random_gen + rnorm(nrow(new_omics), mean = 0, sd = sd(new_omics[[base_col]]) * 0.2)
}

With our new set of 89 variables, we can now apply the get_log_odds() function to our data.

volcan_variables <- c()
pvalue_vector <- c()
for (i in seq_along(new_omics)) {
  temp_logOR <- get_log_odds(new_omics[,i], df$Asthma, new_omics)[1]
  temp_pval <- get_log_odds(new_omics[,i], df$Asthma, new_omics)[2]
  volcan_variables <- c(volcan_variables, temp_logOR)
  pvalue_vector <- c(pvalue_vector, temp_pval)
}

volcano_results <- data.frame(
  variables = colnames(new_omics),
  log_odds = volcan_variables,
  p_value = pvalue_vector
)

volcano_results$logP <- -log10(volcano_results$p_value)

Finally, we create a plot using the obtained results.

volcano_results$diffexpressed <- "NO ASSOCIATION"

volcano_results$diffexpressed[volcano_results$p_value < 0.05] <- "SIGNIFICANT ASSOCIATION"
  
ggplot(volcano_results, aes(x = log_odds, y = logP, color = diffexpressed, label = volcano_results$varlabels))+ 
  geom_point()+
  theme_classic()+
  scale_color_manual(values=c("red", "darkgreen", "black", "navy"))+
  labs(
    title = "Volcano plot",
    x = "Log odds ratio for asthma",
    y = "-log10 pval",
    color = "Legend"
  )+
  geom_vline(xintercept = 1, col = "red", linetype = "dashed") +
  geom_hline(yintercept = -log10(0.05), col = "blue", linetype = "dashed")

To allow a better interpretation of the determinant metabolites, we can add their names to the plot using the ggrepel package, to avoid overlapping. We will also remove labels from non-significant metabolites.

volcano_results$varlabels <- NA
volcano_results$varlabels[volcano_results$diffexpressed != "NO ASSOCIATION"] <- volcano_results$variables[volcano_results$diffexpressed !="NO ASSOCIATION"]
  
ggplot(volcano_results, aes(x = log_odds, y = logP, color = diffexpressed, label = volcano_results$varlabels))+ 
  geom_point()+
  ggrepel::geom_text_repel(max.overlaps = 50, size = 2)+
  theme_classic()+
  scale_color_manual(values=c("red", "darkgreen", "black", "navy"))+
  labs(
    title = "Volcano plot",
    x = "Log odds ratio for asthma",
    y = "-log10 pval",
    color = "Legend"
  )+
  geom_vline(xintercept = 1, col = "red", linetype = "dashed") +
  geom_hline(yintercept = -log10(0.05), col = "blue", linetype = "dashed")

2.4 Cluster analysis

Cluster analysis aims to group similar data points based on shared characteristics, organizing them into clusters that reflect natural patterns in the data. The main goal of cluster analysis is to identify patterns or structures in a dataset by organizing data points into clusters, where members of the same cluster are more similar to each other than to those in other clusters. For this purpose, we can use several clustering methods, including partitioning methods (e.g., K-Means, PAM), hierarchical clustering or density-based clustering (e.g., DBSCAN).

Cluster analysis is widely used in metabolomics to uncover natural groupings in metabolite data. Based on our example, it is a good idea to check data distribution for clustering tendency. For this, we can use Hopkins statistic, which compares the distribution of actual data points to randomly generated points within the same feature space. With larger datasets, we typically assess clustering tendency using a random sample to reduce computational load. In this case, since we are dealing with just 200 cases, we can check the entire data.

library(hopkins)

sample_size <- nrow(omics)-1

hop_result <- hopkins(omics, m = sample_size)

print(hop_result)
## [1] 0.9999976

As expected, the Hopkins statistic (H) is very close to 1, suggesting the data has a strong clustering tendency. Next, we estimate the optimal number of natural clusters in our data using multiple methods. In this example, we will check hierarchical clustering, K-Means and PAM algorithms. A maximum of 6 clusters will be used to save resources.

library(clValid)


inter_val <- clValid(
  t(omics),
  nClust = c(2:6), # Setting this to a larger vector may significantly increase the processing time!
  clMethods = c("hierarchical", "kmeans", "pam"),
  validation = "internal")

summary(inter_val)
## 
## Clustering Methods:
##  hierarchical kmeans pam 
## 
## Cluster sizes:
##  2 3 4 5 6 
## 
## Validation Measures:
##                                  2       3       4       5       6
##                                                                   
## hierarchical Connectivity   4.0734 12.0968 15.0008 17.8298 20.7337
##              Dunn           0.1983  0.3247  0.3247  0.3247  0.3247
##              Silhouette     0.5171  0.4585  0.3783  0.3744  0.3728
## kmeans       Connectivity   4.0734 12.1218 30.0651 31.4012 33.9623
##              Dunn           0.1983  0.3440  0.3701  0.3841  0.4005
##              Silhouette     0.5171  0.4639  0.3395  0.3457  0.3437
## pam          Connectivity   4.0734 12.4476 15.0996 21.8663 36.9429
##              Dunn           0.1983  0.3168  0.3704  0.3704  0.2468
##              Silhouette     0.5171  0.4782  0.3998  0.3097  0.2484
## 
## Optimal Scores:
## 
##              Score  Method       Clusters
## Connectivity 4.0734 hierarchical 2       
## Dunn         0.4005 kmeans       6       
## Silhouette   0.5171 hierarchical 2

As we can see, all three methods perform equally well for k = 2, but hierarchical clustering has slightly better stability across more values of k.

To confirm this observation, let’s visualize the optimal number of clusters using the Silhouette method, which measures how well each data point fits within its assigned cluster compared to other clusters.

fviz_nbclust(omics, FUN = hcut, method="silhouette")

The plot clearly shows an optimal number of clusters at k = 2, where the silhouette score peaks.

In this example, we will select Euclidean distances as the metric for cluster analysis. The Euclidean distance is a common measure of dissimilarity between two data points, and it is often used in clustering algorithms (such as hierarchical clustering) to determine how “close” or “far” apart two points are in a multi-dimensional space. Other possible distance methods include Manhattan or Minkowski distances.

hclust <- eclust(omics, FUNcluster = "hclust", k= 2, hc_metric = "euclidean", 
                 hc_method = "ward.D2", graph = FALSE)

hclust$labels <- df$Asthma

fviz_dend(hclust, label_cols = as.numeric(df$Asthma)[hclust$order], show_labels = TRUE, palette= "lancet", 
          as.ggplot=TRUE, rect = TRUE, type = "rectangle", cex = 0.4)

The plot illustrates the distribution of our asthma cases by each cluster. It appears there is a higher density of asthma cases in the right branch (marked as red). We should therefore test this hypothesis using inferential statistics. In this example, we will use the chi-square test to check if the proportion of asthma cases is indeed different between clusters.

# We first extract the cluster results
cluster_res <- hclust$cluster %>% as.factor()

chisq.test(cluster_res, df$Asthma, correct = TRUE)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  cluster_res and df$Asthma
## X-squared = 40.352, df = 1, p-value = 2.12e-10

As expected, the proportion of asthma cases is significantly different between clusters.

An appealing method of showing hierarchical clustering results is by plotting a heatmap, displaying the concentration of metabolites in each of the cluster leaves. For this, we will need the package pheatmap. We will also use some color ramps from RColorBrewer.

library(RColorBrewer)
library(pheatmap)



pheatmap(omics,
         cluster_rows = TRUE,
         cutree_rows = 2,
         cluster_cols = FALSE,  
         display_numbers = FALSE, 
         color = colorRampPalette(brewer.pal(9, "RdYlBu"))(100), 
         fontsize = 6, 
         fontsize_number = 8,
         labels_row = df$Asthma,
         border_color = "gray", 
         main = "Heatmap", 
         angle_col = 45) 

It is important to have in mind that hierarchical clustering may be a less optimal method when dealing with larger data sets. As an alternative, let’s try cluster analysis using k-means.

kmean_clust <- eclust(omics, FUNcluster = "kmeans", k= 6, hc_metric = "euclidean", 
                 hc_method = "ward.D2", graph = FALSE)

fviz_silhouette(kmean_clust)
##   cluster size ave.sil.width
## 1       1   20          0.11
## 2       2   44          0.08
## 3       3   34          0.10
## 4       4   41          0.09
## 5       5   28          0.09
## 6       6   33          0.11

fviz_cluster(kmean_clust, geom = "point")

We can then label each point according to the asthma classification:

fviz_cluster(kmean_clust, geom = "point", aes(color = df$Asthma)) +
  ggrepel::geom_text_repel(
    aes(label = df$Asthma), 
    color = as.numeric(df$Asthma), 
    size = 3,
    max.overlaps = 40
    ) +
  theme_classic()

The plot becomes visually cluttered due to overlapping text labels. To provide a more clean visualization of data, and considering we already have ellipses representing the cluster areas, we can replace the label with a more simple legend for our data. For a cleaner visualization, we will use emojis as symbolic markers: 🫁 for asthma and 💚 for healthy individuals.

fviz_cluster(kmean_clust, geom = "point", aes(color = df$Asthma)) +
  ggrepel::geom_text_repel(
    aes(label = ifelse(df$Asthma == "Asthma", "🫁", "💚")), 
    color = as.numeric(df$Asthma), 
    size = 3,
    max.overlaps = 40
    ) +
  theme_classic()

2.5 Recursive partitioning

Another interesting approach to metabolomics is performing recursive partitioning regression. Partitioning regression is a type of regression analysis that involves splitting the data into distinct regions or subsets (called partitions) based on predictor variables and then fitting a separate regression model to each partition. The goal is to capture different relationships between the predictor variables and the outcome variable in different regions of the predictor space.

We can also use this opportunity to introduce the concept of supervised machine learning. This type of model is trained on a labeled dataset, with the algorithm identifying the most predictive variables (and their respective thresholds or levels) for the given outcomes. After training the algorithm, models are tested using an additional data sample and effective measures are calculated.

In this example, we will use 80% of data points to train our algorithm, with the remaining 20% being used to test its accuracy:

set.seed(123) # For reproducibility

df_tree <- cbind(df$Asthma, omics) 
df_tree <- df_tree[sample(1:nrow(df_tree)),]

train_sample <- 1:(0.8 * nrow(df_tree)) # To use 80% of sample for training

data_train <- df_tree[train_sample,]
data_test <- df_tree[-train_sample,]

prop.table(table(data_train[,1]))
## 
## No asthma    Asthma 
##    0.4875    0.5125
prop.table(table(data_test[,1]))
## 
## No asthma    Asthma 
##      0.45      0.55

The class proportions (e.g., asthma vs. no asthma) are relatively similar between the training and test sets.

To create the recursive partitioning model, we will need the rpart package. The formula structure to use in the model is very similar to a regular logistic regression expression:

library(rpart)

# We use asthma classification as our dependent variable.
# Using "." incorporates all variables in the model. To select specific compounds, just replace "." with the variable name.
mod_rpart <- rpart(data = data_train[,-1], data_train[,1] ~., method = "class")

With the fitted model, let’s plot a regression tree:

rpart.plot::rpart.plot(mod_rpart, type = 4, extra = 2)

The figure illustrates the regression tree. The discriminant variables and respective cutoffs are displayed in each node. To use different display options for the tree, you can change the extra and type arguments in the rpart.plot function. You can check the function’s documentation for additional information on plot formatting.

Now that we have our recursive partitioning model, let’s test it against our test set. There are several ways to show the effectiveness of a model in predicting an outcome. In this example, given that the outcome is a clinical condition, sensitivity and specificity are particularly relevant performance metrics. We might as well calculate overall accuracy, and positive and negative likelihood ratios.

accuracy_measurements <- function(TP, FP, FN, TN) {
  
  
  sensitivity <- TP/(TP+FN)
  specificity <- TN/(TN+FP)
  accuracy <- (TP+TN)/(TP+TN+FP+FN)
  
  
  PPV <- TP/(TP+FP)
  NPV <- TN/(TN+FN)
  
  z <- list(sensitivity, specificity, accuracy, PPV, NPV)
  names(z) <- c("Sensitivity", "Specificity", "Accuracy", "PPV", "NPV")
  
  return(z)
}


predict_tree <- predict(mod_rpart, data_test, type = "class")
cases_table <- table(data_test[,1], predict_tree)

accuracy_measurements(cases_table[2,2], cases_table[2,1], cases_table[1,2], cases_table[1,1])
## $Sensitivity
## [1] 0.8823529
## 
## $Specificity
## [1] 0.6956522
## 
## $Accuracy
## [1] 0.775
## 
## $PPV
## [1] 0.6818182
## 
## $NPV
## [1] 0.8888889

The overall accuracy of our recursive partitioning model was 77.5%, which is not great. It shows high sensitivity (88.2%) but it is not very specific (68.1%). Of course, we could also plot a ROC curve, as we have done previously with the PCA results.

The training and testing approach that was used in this example can also be used for the other mentioned algorithms, such as hierarchical clustering.

2.6 Neural networks

Using neural networks to identify metabolic profiles is less common than the aforementioned methods. However, there are many cases where this approach would be useful, particularly when dealing with large data sets with a lengthy number of variables. The downside of using neural networks, besides the need for significant computational power, is that the model is generally presented as a black box concept where, unlike with recursive partitioning, we will not be able to perceive the contribution of each variable to the decision process.

The following example will help you get started with neural networks, but it should serve as a basic introduction rather than a comprehensive guide. For a more detailed reading, I would recommend the dedicated guide by Ashutosh Nanda, that can be accessed here.

However, we can create a simple example to illustrate how neural networks can be applied in metabolomics. For this, we will use the neuralnet package. In the neuralnet() function, we need to set the structure of our input layers, output layers and hidden layers. We will also need to set linear.output to FALSE, since we are aiming for a binary classification output (Asthma: “yes” or “no”).

library(neuralnet)

# We can convert the variables' name to something more appropriate for analysis

set.seed(123)
df2 <- data_train

names(df2) <- make.names(names(df2)) 

# Create a dynamic formula to handle all metabolites
vars <- names(df2)[-1]
formula <- as.formula(paste("df.Asthma ~", paste(vars, collapse = " + ")))

# Define the neural network structure based on input variables and hidden layers
nn_model <- neuralnet(formula, data = df2, hidden = c(6,3,1,3), linear.output = FALSE)

To plot the neural network, one could simply use plot(nn_model). However, to improve control over the visualization, we will use the plotnet() function, from the NeuralNetTools package.

NeuralNetTools::plotnet(nn_model, cex_val = 0.6, circle_cex = 3, pos_col = "darkgreen", neg_col = "firebrick")

Next, we need to check the efficacy of the neural network. As an example, We dichotomize the neural network’s output by classifying values below the median as “asthma” and those above as “no asthma”.

df_test2 <- data_test
names(df_test2) <- make.names(names(df_test2))

predicted_prob <- predict(nn_model, newdata = df_test2)

predicted_class <- ifelse(predicted_prob[,1] <= median(predicted_prob[,1]), "Asthma", "No asthma") %>%
  as.factor()

nn.table <- table(df_test2$df.Asthma, predicted_class)

accuracy_measurements(nn.table[2,2], nn.table[2,1], nn.table[1,2], nn.table[1,1])
## $Sensitivity
## [1] 0.5
## 
## $Specificity
## [1] 0.4285714
## 
## $Accuracy
## [1] 0.45
## 
## $PPV
## [1] 0.2727273
## 
## $NPV
## [1] 0.6666667

As we can see, our neural net was not able to accurately differentiate individuals with asthma based on the measured metabolites. Neural nets tend to perform better with larger training sets. Nevertheless, we can iterate through several neural net structures to find the most robust model.

Warning! The following snippet might require significant computational resources to complete. It should not take longer than 1 minute in modern hardware.

set.seed(123)

best_accuracy <- 0
best_model <- NULL

for(j in c(1:3)) { # number of nodes in layer 1 (if it exists)
  for(k in c(0:3)) { # number of nodes in layer 2 (if it exists)
    for(l in c(0:3)) { # number of nodes in layer 3 (if it exists)

      nn.structure <- c(j,k,l)
      nn.structure <- nn.structure[nn.structure != 0]
      
      nn_model <- neuralnet(formula, data = df2, hidden = nn.structure, stepmax = 1e6,
                            linear.output = FALSE)
      
      predicted_prob <- predict(nn_model, newdata = df_test2)
      
      predicted_class <- ifelse(
        predicted_prob[,1] <= median(predicted_prob[,1]), "Asthma", "No asthma"
        ) %>% as.factor()
      
      nn.table <- table(df_test2$df.Asthma, predicted_class)
      
      if(ncol(nn.table) > 1) { # ignore if there is no differentiation
        acm <- accuracy_measurements(nn.table[2,2], nn.table[2,1], nn.table[1,2], nn.table[1,1])
        
        if(acm$Accuracy > best_accuracy) {
          best_accuracy <- acm$Accuracy
          best_model <- nn.structure
        }
      }
    }
  } 
}


print(paste0("Accuracy: ", best_accuracy, ", NN structure: ", best_model))
## [1] "Accuracy: 0.55, NN structure: 1"

In this example, using a single hidden layer with one node provided the “best” accuracy value (55%). This would not be adequate for a correct classification and, therefore, this model would have to be retrained using a larger data set or different input variables.

3 Final thoughts from the author

I hope this short tutorial helps you get started with metabolomics. While we used simulated data in this demonstration, the same approaches can be applied to real-world datasets with clinical impact. If you’re working in metabolomics research, I encourage you to adapt these techniques to your context and validate your findings through robust experimental design.


  1. Institute of Public Health of the University of Porto↩︎

  2. School of Health, Polytechnic of Porto↩︎