The notebook is also available on github. There you can retrieve the interactive plots for the dimension reduction since the herein used workspace doesn’t support them yet.

The goal of the project is to present an analysis of clinical data by leveraging a multitude of computational approaches. The objectives of the analysis are the sensitive prediction of sick patients using machinge learning (ML) and the possible identification of new biomarkers for inflammatory bowel diseases (IBD) by using the feature importance of the ML models and/or differential expression analysis.

This project is based on high-dimensional data obtained by mass spectrometry (MS). Here, different dimensionality reduction and machine learning (ML) methods will be used and discussed. The script is divided in several sections, starting with data wrangling, followed by exploratory data analysis. Different methods are used for dimension reduction with subsequent classification from a variety of ML models. Finally, the results will be discussed. Moreover, a differential expression analysis is planned.

Imports

A variety of packages are required for the diverse ML methods. The main framework is tidymodels and the many tidyverse principles are used.

suppressPackageStartupMessages({library(mixOmics, quietly = TRUE)
library(plsmod)
library(readxl, quietly = TRUE)
library(vip, quietly = TRUE)
library(tictoc, quietly = TRUE)
library(stacks, quietly = TRUE)
library(MASS,exclude = "select")
library(Rtsne, quietly = TRUE)
library(tidyverse, quietly = TRUE)
library(tidymodels, quietly = TRUE)
library(plotly, quietly = TRUE)
library(embed, quietly = TRUE)
library(gtools, quietly = TRUE)
library(cowplot, quietly = TRUE)
library(visdat, quietly = TRUE)
})

Initial Data Analysis

The data

The data set can be retrieved from the supplementary material from the published study. Untargeted metabolomic profiling of stool samples from patients with Crohn’s disease (CD), ulcerative colitis (UC), and non-inflammatory bowel disease control (IBD) patients was performed in the study of E. A. Franzosa et al. Previous studies have found differences in faecal metabolite composition in IBD patients.
In their study metabolomic and shotgun metagenomic profiling was conducted, and in this analysis only the metabolomic features derived by liquid chromatography-mass spectrometry (LC-MS) will be considered. The data set includes untargeted metabolomic profiles of stool samples from a 155-member discovery cohort and a 65-member independent validation cohort. Four different LC-MS methods were used to cover different metabolites classes like polar metabolites, lipids, free fatty acids, and bile acids. 3829 uncharacterized metabolite features were assigned to putative molecular classed based on comparison to the Human Metabolome Database (HMDB) and 566 features were annotated by reference from an in-house compound library. Also, metabolites were clustered based on their covariation across samples. The discovery cohort includes 68 CD, 53 UC, and 34 non-IBD patients. Whereas the validation cohort is more balanced with 20 CD, 23 UC, and 22 control patients.

download.file("https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6342642/bin/NIHMS1510763-supplement-Dataset_2.xlsx", destfile = "/tmp/file.xlsx")
df_orig <- read_excel("/tmp/file.xlsx", skip=1)
download.file("https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6342642/bin/NIHMS1510763-supplement-Dataset_1.xlsx", destfile = "/tmp/file2.xlsx")
df_met_feature <- read_excel("/tmp/file2.xlsx", skip=1)
head(df_orig)
## # A tibble: 6 × 221
##   `# Feature / Sample` `PRISM|7122` `PRISM|7147` `PRISM|7150`       `PRISM|7153`
##   <chr>                <chr>        <chr>        <chr>              <chr>       
## 1 Age                  38           50           41                 51          
## 2 Diagnosis            CD           CD           CD                 CD          
## 3 Fecal.Calprotectin   207.4844286  <NA>         218.33451690000001 <NA>        
## 4 antibiotic           No           No           No                 No          
## 5 immunosuppressant    Yes          No           Yes                No          
## 6 mesalamine           No           Yes          No                 Yes         
## # … with 216 more variables: `PRISM|7184` <chr>, `PRISM|7238` <chr>,
## #   `PRISM|7406` <chr>, `PRISM|7408` <chr>, `PRISM|7421` <chr>,
## #   `PRISM|7445` <chr>, `PRISM|7486` <chr>, `PRISM|7496` <chr>,
## #   `PRISM|7506` <chr>, `PRISM|7547` <chr>, `PRISM|7658` <chr>,
## #   `PRISM|7662` <chr>, `PRISM|7744` <chr>, `PRISM|7759` <chr>,
## #   `PRISM|7762` <chr>, `PRISM|7776` <chr>, `PRISM|7791` <chr>,
## #   `PRISM|7843` <chr>, `PRISM|7847` <chr>, `PRISM|7852` <chr>, …

For ML algorithms, the data must be in the form that the rows corresponded to the sample individuals and the columns to the variables/features. Therefore, the data set is pivoted.

df_long <- df_orig %>% pivot_longer(-`# Feature / Sample`, names_to="PatientID") %>%
  pivot_wider(names_from=`# Feature / Sample`, values_from="value") 
head(df_long)
## # A tibble: 6 × 8,856
##   PatientID  Age   Diagnosis Fecal.Calprotectin antibiotic immunosuppressant
##   <chr>      <chr> <chr>     <chr>              <chr>      <chr>            
## 1 PRISM|7122 38    CD        207.4844286        No         Yes              
## 2 PRISM|7147 50    CD        <NA>               No         No               
## 3 PRISM|7150 41    CD        218.33451690000001 No         Yes              
## 4 PRISM|7153 51    CD        <NA>               No         No               
## 5 PRISM|7184 68    CD        20.16795055        No         No               
## 6 PRISM|7238 67    CD        2.5862468359999999 Yes        Yes              
## # … with 8,850 more variables: mesalamine <chr>, steroids <chr>,
## #   `C18-neg_Cluster_0001` <chr>, `C18-neg_Cluster_0002` <chr>,
## #   `C18-neg_Cluster_0003` <chr>, `C18-neg_Cluster_0004` <chr>,
## #   `C18-neg_Cluster_0005` <chr>, `C18-neg_Cluster_0006` <chr>,
## #   `C18-neg_Cluster_0007` <chr>, `C18-neg_Cluster_0008` <chr>,
## #   `C18-neg_Cluster_0009` <chr>, `C18-neg_Cluster_0010` <chr>,
## #   `C18-neg_Cluster_0011` <chr>, `C18-neg_Cluster_0012` <chr>, …

Handling Missing Data

We can directly see NA values in the Fecal.Calprotectin column. Most ML methods require the absence of NA values or remove them on their own. There are different ways to handle missing values. The two primary solutions are deletion or imputation. Deletion should be avoided or minimized as much as possible because it comes at the cost of losing data/information. This removal can be done by either removing the row or the entire column. The latter might be necessary if many values are missing for one variable. In contrast, there are many options to impute missing values. This ranges from using simple statistics like mean or median to using methods like regression or ML methods like the k-nearest neighbors algorithm. First, the columns with NA values will be visualized using the visdat package to find some possible relationships or causes of the missing data. The package allows for a heatmap-like visualization of missing data.

na_cols <- unlist(lapply(df_long, function(x) any(is.na(x))))
df_long[na_cols] %>% vis_dat()
## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## Please use `gather()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

There are multiple NA values apparent. The Fecal.Calprotectin column has to many missing values, and therefore the column will be removed. The immunosuppressant, mesalamine, and steroids variables display a similar distribution of NA values. The majority of the missing values cluster together and will have probably something in common. For further investigation, the nominal variable Diagnosis might give further details of the possible relationship.

p1 <- df_long %>% 
  ggplot(aes(Diagnosis, fill=is.na(steroids))) +
  geom_bar() +
  theme(legend.position="none")+
  labs(title="Missing values in steroids variable")

p2 <- df_long %>% 
  ggplot(aes(Diagnosis, fill=is.na(immunosuppressant))) +
  geom_bar() +
  theme(legend.position="none")+
  labs(title="immunosuppressant variable")

p3 <- df_long %>% 
  ggplot(aes(Diagnosis, fill=is.na(mesalamine))) +
  geom_bar() +
  labs(title="mesalamine variable",fill="Missing")

plot_grid(p1,p2, p3)

The majority of missing values for the three variables are in the control group. Here, we can assume that the healthy control group don’t take steroids, immunosupressants, and mesalamine. Therefore, the missing values in the control group are replaced with “No”.

df_long <- df_long  %>% select(-Fecal.Calprotectin)
df_long[df_long$Diagnosis == "Control",] <- df_long[df_long$Diagnosis == "Control",] %>% replace_na(list(immunosuppressant="No", mesalamine="No", steroids="No"))
df_long[!complete.cases(df_long), ]
## # A tibble: 5 × 8,855
##   PatientID  Age   Diagnosis antibiotic immunosuppressant mesalamine steroids
##   <chr>      <chr> <chr>     <chr>      <chr>             <chr>      <chr>   
## 1 PRISM|7762 40    UC        <NA>       No                Yes        <NA>    
## 2 PRISM|7948 61    CD        <NA>       No                Yes        No      
## 3 PRISM|8573 31    CD        <NA>       No                Yes        Yes     
## 4 PRISM|8794 24    CD        No         No                <NA>       <NA>    
## 5 PRISM|8802 73    CD        No         No                <NA>       No      
## # … with 8,848 more variables: `C18-neg_Cluster_0001` <chr>,
## #   `C18-neg_Cluster_0002` <chr>, `C18-neg_Cluster_0003` <chr>,
## #   `C18-neg_Cluster_0004` <chr>, `C18-neg_Cluster_0005` <chr>,
## #   `C18-neg_Cluster_0006` <chr>, `C18-neg_Cluster_0007` <chr>,
## #   `C18-neg_Cluster_0008` <chr>, `C18-neg_Cluster_0009` <chr>,
## #   `C18-neg_Cluster_0010` <chr>, `C18-neg_Cluster_0011` <chr>,
## #   `C18-neg_Cluster_0012` <chr>, `C18-neg_Cluster_0013` <chr>, …

There are five rows remaining with missing values. For sake of simplicity, the rows will be removed in this project. However, clinical studies have limited subjects and the used methods can be expensive and strenuous. Therefore, within the framework of the study evaluation, one should ask for further information to avoid losing precious information.

df_long <- df_long %>% na.omit()
dim(df_long)
## [1]  215 8855

Data Splitting & Data Preparation for ML

Now, the data will be divided into the discovery group and the validation group. Fortunately, the study provides three independent cohorts. The PRISM cohort will be used to train and optimize the models and two groups are available for the validation set, Deep and UMCG. This procedure ensures that the final models will never have seen the validation data, and we can gauge the generalisability of the models.

PRISM <- df_orig %>% select(starts_with('PRISM')) %>% names()
data_prism <- df_long %>% filter(PatientID %in% PRISM)
data_validate <- df_long %>% filter(!PatientID %in% PRISM)

Since all variables are encoded as characters, we will format the data that the ML algorithms can use it. The nominal variables, e.g. Diagnosis or antibiotic, are transformed into factors and the age and mass spectrometry features into numerical variables. Moreover, the PatientID is excluded from the training set.

tidy_data <- function(untidy) {
  untidy %>% 
  mutate(PatientID = as.factor(PatientID),
        Age = as.numeric(Age),
        Diagnosis = as.factor(Diagnosis),
        antibiotic = as.factor(antibiotic),
        mesalamine = as.factor(mesalamine),
        steroids = as.factor(steroids),
        immunosuppressant = as.factor(immunosuppressant)) -> formatted

w <- which(sapply(formatted, class ) == 'character' )
formatted[w] <- lapply(formatted[w], function(x) as.numeric(x))
return(formatted)
}

data_prism_formatted <- tidy_data(data_prism) 
data_validate_formatted <- tidy_data(data_validate)

colnames(data_prism_formatted) <- make.names(colnames(data_prism_formatted)) # compatible feature names for ML
colnames(data_validate_formatted) <- make.names(colnames(data_validate_formatted))
train_df <- data_prism_formatted %>% select(-PatientID) %>%
  mutate(Diagnosis = factor(Diagnosis, levels =c("Control","CD", "UC"), labels =c(1,2,3)))
test_df <- data_validate_formatted %>% select(-PatientID,-Diagnosis) 
test_df_truth <- data_validate_formatted %>% select(PatientID, Diagnosis) %>% # for performance metrics
  mutate(Diagnosis = factor(Diagnosis, levels =c("Control","CD", "UC"), labels =c(1,2,3))) %>% 
  mutate(Cohort = case_when(str_detect(PatientID,"LLDeep") == TRUE ~ "Deep",
                                                             TRUE ~ "UMCG")) # for error analysis based on cohort affiliation
folds <- train_df %>% vfold_cv(5, strata = Diagnosis) # for model training
                       
data_combined <- bind_rows(data_prism_formatted, data_validate_formatted) %>% # for EDA
  mutate(Cohort = case_when(
      str_detect(PatientID, "LLDeep") == TRUE ~ "Deep",
      str_detect(PatientID, "PRISM") == TRUE ~ "PRISM", 
      TRUE ~ "UMCG"))
data_combined_train <- data_combined %>% select(-c(PatientID, Diagnosis, Cohort)) # for t-SNE 
head(data_combined %>% select(Cohort, everything()))
## # A tibble: 6 × 8,856
##   Cohort PatientID    Age Diagnosis antibiotic immunosuppressant mesalamine
##   <chr>  <fct>      <dbl> <fct>     <fct>      <fct>             <fct>     
## 1 PRISM  PRISM|7122    38 CD        No         Yes               No        
## 2 PRISM  PRISM|7147    50 CD        No         No                Yes       
## 3 PRISM  PRISM|7150    41 CD        No         Yes               No        
## 4 PRISM  PRISM|7153    51 CD        No         No                Yes       
## 5 PRISM  PRISM|7184    68 CD        No         No                No        
## 6 PRISM  PRISM|7238    67 CD        Yes        Yes               No        
## # … with 8,849 more variables: steroids <fct>, C18.neg_Cluster_0001 <dbl>,
## #   C18.neg_Cluster_0002 <dbl>, C18.neg_Cluster_0003 <dbl>,
## #   C18.neg_Cluster_0004 <dbl>, C18.neg_Cluster_0005 <dbl>,
## #   C18.neg_Cluster_0006 <dbl>, C18.neg_Cluster_0007 <dbl>,
## #   C18.neg_Cluster_0008 <dbl>, C18.neg_Cluster_0009 <dbl>,
## #   C18.neg_Cluster_0010 <dbl>, C18.neg_Cluster_0011 <dbl>,
## #   C18.neg_Cluster_0012 <dbl>, C18.neg_Cluster_0013 <dbl>, …

Exploratory Data Analyis

Exploratory data analysis (EDA) is not a formal process. Instead, it is used to gain an overview of your data. At this stage, you can feel free to explore every idea that comes to your mind. Those insights are typically expressed as meaningful visualizations. During this stage data transformations might be necessary and not all ideas will deliver the expected results or progress to different questions.

First, the Group variable is created to distinguish between the discovery and validation data.

data_combined_exp <- data_combined %>% mutate(Group = case_when(Cohort == "Deep" ~ "Validation",
                                           Cohort == "PRISM" ~ "Discovery",
                                           TRUE ~ "Validation")) 
p4 <- ggplot(data_combined_exp, aes(x = Diagnosis, fill=Diagnosis)) + 
  geom_bar()+
  facet_wrap(~Group) +
  theme(legend.position = "none")

Next, the naming pattern of the four different LC coloumns are required to find the count of them.

names(data_combined) %>% sample(10) # four different column prefixes should be found
##  [1] "C8.pos_Cluster_1373"    "HILIC.pos_Cluster_1847" "C8.pos_Cluster_2426"   
##  [4] "HILIC.neg_Cluster_0827" "C18.neg_Cluster_0337"   "HILIC.neg_Cluster_1130"
##  [7] "HILIC.neg_Cluster_0705" "HILIC.pos_Cluster_1446" "HILIC.pos_Cluster_1618"
## [10] "C18.neg_Cluster_0086"
column_names <- c("C18.neg", "C8.pos", "HILIC.neg", "HILIC.pos")
column_count <- tibble(Column = column_names, count = c(NA))
for (i in 1:4) {
  x <- column_names[[i]]
column_count$count[[i]] <- data_combined %>% select(starts_with(x)) %>% ncol()
}
p5 <- column_count %>% unnest(count) %>%
  ggplot(aes(Column, count, fill=Column)) +
  geom_col() +
  theme(legend.position = "none")

plot_grid(p4,p5, labels=c("A", "B"))

The discovery group displays an unevenly distribution of the patient’s diagnoses in plot A. However, the validation group displays an almost even distribution.
Fortunately, the earlier removed five rows were CD and UC patients in the discovery group and led to an more balanced distribution. Class imbalances can lead to a poor performance of the ML models by exhibiting bias toward the majority class. But, the minority class was usually the reason to build the model in the first place.
Herein, however the majority classes CD and UC are more interesting, as the ML model could identify patients with IBD. There are a variety of techniques to deal with class imbalances, e.g. up- or downsampling, or Synthetic Minority Oversampling Technique (SMOTE). Here, no correction of the imbalances will be done, even though SMOTE could easily integrated in the tidymodels workflow.

In plot B, the number of metabolic features per column type are displayed. LC in positive mode tended to generate more features than in negative mode. The different column types will probably also identified identical compounds. 466 metabolites were more precisly identified in the study, and 3829 features were linked to putative identifiers by searching against the Human Metabolome Database. The name of the identified metabolites or substance classes are not included in this dataset, but they can later combined and used for further analyis. For example, the metabolites with the highest variable importance of some models can be retrieved.

data_combined_exp <- data_combined %>% mutate(Group = case_when(Cohort == "Deep" ~ "Validation",
                                           Cohort == "PRISM" ~ "Discovery",
                                           TRUE ~ "Validation")) 
p_age <-ggplot(data_combined_exp,aes(x = Age,fill = Diagnosis)) + 
  geom_density(alpha=.5)+
  labs(x="Age",
       y="Density",
       fill="Diagnosis",
       title="Age distribution of subjects")+
  facet_grid(~ Group, labeller = label_both, scales = "free_y")+
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        panel.background = element_rect(fill='transparent'),
        plot.background = element_rect(fill='transparent', color=NA),
        panel.grid.major = element_blank(),
         panel.grid.minor = element_blank(),
        legend.background = element_rect(fill='transparent'))
vars = c("antibiotic","immunosuppressant","mesalamine","steroids")
for (i in vars) {
  assign(
    paste0("p_", i), 
    ggplot(data_combined_exp, aes_string(i, fill = "Diagnosis")) + 
      geom_bar(position="fill")+
      facet_wrap(~Group)+
      theme(legend.position="none")+
      labs(y="")
  )
}
plot_grid(p_age,plot_grid(p_antibiotic, p_immunosuppressant, p_mesalamine, p_steroids, labels = c('A', 'B', "C", "D"), label_size = 12))

The age distribution of the discovery and validation cohorts have similar characteristics. The CD and UC patient’s age ranges from 20 to 80 years. Only, the control group’s age in the validation cohort reached up to merely 60 years. Moreover, the control group patients are mostly younger than 40 for the discovery cohort. The UC patient’s age displays a bimodal distribution with a higher density of patients younger than and older than 60.

The barplots display a similar percentage of the antibiotic, immunosuppressant, mesalamine, and steroids intake for the discovery and validations cohorts across the different patient groups. The biggest difference is that there was no antibiotic intake in the validation cohort. Moreover, there a moderate differences of the percentage of immunosuppressant and mesalamine intake for the validation groups.

Feature Correlation

Data from biological samples are often highly correlated. Explanatory features will not only correlate with the target variable, Diagnosis, but, also the feature will corelate with themselves, so called multicollinarity in case of a linear relationship. Multicollinarity may not influence the accuracy of the prediction of the models, but impair their interpretability. In case of highly correlated features, the coefficients, p-values, or SHAP values of the model may not represent the actual importance of the biomarkers because they are redundant and the model might have used only one feature. Therefore, the simpler the model (less features) the higher the interpretability.

Since the herein used data contains about 9000 features, the correlation matrix for all of them is too big to fit in the memory (RAM) of most computers. Therefore, a subset of 2500 MS features with equal amounts from every column type will be used for the correlation analyis.

sample_func <- function(sample_size, data_combined = data_combined) { # sampling N features from every column type
set.seed(42)
for (i in 1:nrow(column_count)) {    
    x = column_count$Column[[i]]
    assign(paste0("sample_", x),
    data_combined %>% select(starts_with(x)) %>% select(sample(1:column_count$count[[i]], sample_size, replace=FALSE)) 
           )
    }
    sample_C18.neg %>% add_column(sample_C8.pos,
                                             sample_HILIC.neg,
                                             sample_HILIC.pos) -> sample
    return(sample)
    }
sample <- sample_func(625, data_combined)
corr_func <- function(threshold, sample = sample) { # correlating and cleaning matrix with threshold for correlation coefficient
    set.seed(42)
    cor_all <- sample %>% cor()
    cor_all[!lower.tri(cor_all)] <- NA # remove diagonal and redundant values
    data.frame(cor_all) %>%
              rownames_to_column() %>%
              gather(key="variable", value="correlation", -rowname) %>%
              filter(!is.na(correlation)) %>%
              filter(abs(correlation) >= threshold) -> df_corr_sample
    return(df_corr_sample)
    }
thresholds <- c(.1,.4,.6,.9,.95,.99) # correlation cofficient 
df_count <- tibble(Threshold=thresholds, count=c(NA))
for (i in seq_along(thresholds)) {
    x <- thresholds[[i]]
df_count$count[[i]] <- corr_func(.0, sample) %>%
    filter(abs(correlation) > x) %>%
    nrow()
    }
n <- ncol(sample) # sample size
df_count %>% mutate(Percentage = count/((n**2 - n)/2)) %>% # divided by all possible combinations to get percentages
ggplot(aes(factor(Threshold), Percentage)) +
geom_point() + 
scale_y_continuous(trans='log10',labels = label_percent(), breaks = c(5 %o% 10^(-5:1))) +
coord_flip() +
labs(x = "Threshold of Pearson's correlation coefficient", y = "Proportion", title="Correlation of features")

About 50% of the feature subset show a neglible linear correlation (0.1). 2.5% percent of features display a weak correlation (0.4). A very strong feature correlation (.9) and an almost perfect correlation (.99) is existing between only about 0.05% and 0.0025% of features, respectively.

Next, I would be interesting to know which column types are contributing the most correlated features.

comb_grid <- combinations(n=4, r=2, v = column_count$Column, repeats.allowed=T)
df_comb <- data.frame(comb_grid,count=c(NA)) %>% select(Column_1 = X1, Column_2 = X2, count) 
threshold <- 0.4
df_corr_sample <- corr_func(threshold, sample)

for (i in 1:nrow(df_comb)) {
    x <- df_comb$Column_1[[i]]
    y <- df_comb$Column_2[[i]]
    df_corr_sample %>%
    filter((str_detect(rowname, x) & str_detect(variable ,y)) | (str_detect(rowname, y) & str_detect(variable,x))) %>%
    nrow() -> df_comb$count[[i]]
    }
sum(df_comb$count) == nrow(df_corr_sample) # check if all combinations have been counted 
## [1] TRUE
n_column <- n/4 # samples per col
n_equal <- (n_column**2 - n_column)/2 # all possible combinations
n_unequal <- n_column**2 
df_comb_prop <- df_comb %>% mutate(Combination = paste0(Column_1, " x ", Column_2),
                                   Proportion = case_when(Column_1 != Column_2 ~ count/n_unequal, 
                                                          TRUE ~ count/n_equal
                                                          )
                   )
df_comb_prop %>% ggplot(aes(fct_reorder(Combination, Proportion), Proportion)) +
                       geom_point() +
                       scale_y_continuous(labels = scales::percent) +
                       coord_flip() +
                       labs(x = "Combination of column types",
                           y = "Proportion of correlated features with r >= 0.4")

Not surprisingly, the highest percentage of features with a moderate correlation (> 0.4) is contributed by identical column types. This means that features originating from the similar column type are more likely to correlate. This is possibly caused by adjacent features that are caused by adjacent elution peaks during LC which could represent identical or nearly identical compounds.
The lowest correlation of features was found between features of the C8 column and HILIC column in positive mode. The C8 column setup was used to capture polar metabolites (e.g. amino acids) and the HILIC.pos mode for polar and non-polar lipids.

Variability

Some features may have zero or near-zero variance across the samples and hence they are likely non-informative. Herein, tidymodels preprocessing functions are used with default values for near-zero variance. By default, a feature is classified as near-zero variance if the percentage of unique values in the samples is less than 10% and when the frequency ratio, the ratio of the frequency of the most common value to the frequency of the second most common value, is greater than 19 (95/5).

nzv <- recipe(Diagnosis ~.,data_combined) %>%
step_nzv(all_predictors()) %>%
prep()

nzv_obj <- bake(nzv, new_data = NULL)
nzv_cols <- names(data_combined)[!(names(data_combined) %in% names(nzv_obj))]
nzv_cols
## [1] "HILIC.pos_Cluster_1433"
data_combined %>% select(HILIC.pos_Cluster_1433) %>% unique()
## # A tibble: 14 × 1
##    HILIC.pos_Cluster_1433
##                     <dbl>
##  1                   0   
##  2                 228.  
##  3                   4.58
##  4                 455.  
##  5                 673.  
##  6                 774.  
##  7                1505.  
##  8                   6.51
##  9                  10.5 
## 10                   6.69
## 11                 716.  
## 12                 511.  
## 13                1817.  
## 14                 266.

The only feature that showed near-zero variance was the HILIC.pos_Cluster_1433 column. There are still some unique values in the column, which might provide information for the predictive models. Therefore, the column will be retained.

Dimensionality Reduction

Dimensionality reduction (DR) is frequently applied during the analysis of high-dimensional data. Because of “the curse of dimensionality,” the volume of the data increases so fast with additional dimensions (variables) that the data becomes sparse. This phenomenon can result in a loss of accuracy of ML models. DR can be beneficial for the majority of modern biological datasets, in which the data often contain measurements on uninformative or redundant variables. For example, redundant variables from overlapping peaks or compounds found in more than one LC-MS setup.

DR can be viewed as a method for latent variable extraction and is also frequently used for data compression, exploration, visualization, and data pre-processing for ML. There is no best technique for dimensionality reduction. Instead, the choice depends on the nature of the data and the intended purpose. In the case of ML, the best DR method could be chosen on the basis of the performance of the resulting model.

There a multitude of dimensional reduction algorithms which can reduce the dimensions and allow for visualization in the two-dimensional (2D) or three-dimensional (3D) space. Those techniques can also be divided into linear and non-linear transformations. The most common linear transformation is the principal component analysis (PCA). PCA can also be extended with the kernel trick to kernel PCA (kPCA), which allows for non-linear separable data to be separated in a higher-dimensional space. Both methods allow for the application of the learnt mapping function to unseen test data. More examples of non-linear transformations are t-distributed stochastic neighbour embedding (t-SNE) and uniform manifold approximation and projection (UMAP).

t-Distributed Stochastic neighbor Embedding

t-SNE is used for visualization of high-dimensional data which can be projected into the low-dimensional space. Thereby, similar objects are modelled by nearby points and dissimilar objects by distant points with high probability. The visual appearance is influenced by hyperparameters, where perplexity is the most important. Perplexity is comparable to the nearest neighbours hyperparameter. Lower values tend to spread out the projected data and higher values tend to preserve the global structure.

Larger and denser data sets require a higher perplexity which requires longer computational time. Typical values range from 5 to 50. t-SNE learns a non-parametric mapping and not an explicit function. Therefore, it is not possible to directly embed test data into an existing map, which is crucial for supervised machine learning. However, it would be possible to fit a multivariate regression model between the original data and the embedded dimensions. Thus, t-SNE is rather used for visualization purposes than for pre-processing for ML.

Herein, different parameters of perplexity are used and visualized in the two-dimensional space. The model was trained with the whole data set, including the discovery and validation group.

data_tsne <- tibble(perplexity = c(5, 10, 15, 30, 45))
data_tsne <- data_tsne %>% mutate(TSNE = map(perplexity, ~Rtsne(data_combined_train, dims=2, PCA=F, perplexity=.x, max_iter=2000)))

for (i in 1:nrow(data_tsne)){
  x = data_tsne$perplexity[[i]]
assign(paste0("tsne_df_tune_", x),
       tibble(
              X = data_tsne$TSNE[[i]]$Y[,1],
              Y = data_tsne$TSNE[[i]]$Y[,2],
              #Z = data_tsne$TSNE[[i]]$Y[,3],
              Perplexity = data_tsne$perplexity[[i]],
              Diagnosis = data_combined$Diagnosis,
              Cohort = data_combined$Cohort,
              ID = data_combined$PatientID
             )
      )
}
list_tsne_tune <- ls(pattern = "tsne_df_tune_")
tsne_combined <- bind_rows(mget(ls(pattern = "tsne_df_tune_")))

tsne_combined %>% 
plot_ly(x = ~X, y = ~Y, frame = ~Perplexity, color = ~Diagnosis, symbol = ~Cohort,
        colors = "Set1", type="scatter", mode="markers", ids=~ID,  text = ~paste0("Sample: ", ID, "<br>",
                                                                                  "Diagnosis: ", Diagnosis),
        hoverinfo = "text",
        marker = list(sizemode = "diameter")) %>%

layout(xaxis = list(showgrid = F,
                    zeroline = F
                   ),
       yaxis = list(showgrid = F,
                    zeroline = F
                   )
       ) %>%
animation_opts(
    redraw=FALSE,
    frame = 2000,
    transition = 500) %>%

animation_slider(
     currentvalue = list(
       prefix = "Perplexity: ", 
       font = list(color = "red")))

Note: The interactive visualization is not supported in this workspace yet. The interactive version can be seen here.

The t-SNE dimension reduction has been performed with different values of the perplexity parameter. The coordinate values are in different ranges for the different perplexity settings and can be autoscales in the top-right corner. The individuals (samples) are divided into the discovery and validation groups and the three different cohorts. The objective of the approach is to segment the patients with different diagnoses (colors).

Herein, all methods with different perplexity parameters segmented the patients into three cluster to some extent. However, in all setups patients with different diagnoses are found in one cluster. This intertwining exposes some possible effects in the subsequent analysis steps. First, some patients with dissimilar diagnoses could have very similar metabolic characteristics. Moreover, the exact differentiation of patients could be impaired by ML methods and differential expression analysis. Based on visual assessment the perplexity value of 30 yielded the best separation of patients. Patients with CD and UC seem to more intertwined and hence could possess similar disease characteristics. The cluster of the control group is penetrated by patients with CD and UC, but no control patients are found in the clusters of the CD and UC patients, exept one control patient in the CD cluster.

The largest detriement of t-SNE is the inabilty as preprocessing method for supervised learning since no mapping function is calculated, which could be applied to the test or validation set. Therefore, other dimension reduction methods are used in the subsequent steps.

Principial component analyis and Uniform Manifold Approximation and Projection

First, PCA is linear transformation that transforms a large set of variables into a smaller set (principal components) and retains as much information (variability) as possible. After standardization , the eigenvalues and eigenvectors are determined from the covariance matrix in order to derive the principal components (PCs). These are new variables constructed as linear combinations so that as much as possible variability is squeezed into the first PC (PC1). The second PC (PC2) holds the second largest amount of variability orthogonal (uncorrelated) to PC1.

PCA outputs the similar amount of PCs as input variables. For dimension reduction, a cutoff value, e.g. 0.9, can be chosen at which as much as PCs are chosen to retain the desired amount of information. This also applies to kPCA, which is more suited for non-linear DR. In contrast to PCA, the data is transformed into a higher-dimensional space before covariance matrix calculation. Using the kernel-trick, much computation is avoided since the coordinates of the data don’t need to be calculated.

Similar to t-SNE UMAP employs graph layout algorithms. A high-dimensional graph layout is constructed of the data and then a low-dimensional graph is fitted to be as structually similar as possible. Two common parameters are n-neighbors and min_dist. They control the balance between local and global structure. Low values of n-neighbors leads to a more local structure, whereas a higher value represents a more big-picture structure while losing more fine details. min_dist controls how tight the points are packed together. Lower values brings points more together and allow for more clustering. In contrast, larger values will preserve more of the broader topological structure. Moreover, the algorithms allows for both unsupervised and supervised DR. Herein, the latter will be used.

In contrast to t-SNE, the hyperparameters were selected based on the best area under the curve of the receiver opererater characteristic (AUROC) score of a cross-validated ML classification model. Moreover, the optimal number of dimensions for classification were also be determined by cross-validation in a separate script using tidymodels. Three different kinds of ML models were used for each DR algorithm such as KNN, SVM, and a single-layer neural network. UMAP and kPCA hyperparamters were tuned with a single-layer neural network with AUROC scores of 0.80 and 0.87, respectively. The number of principal components in PCA were selected based on the Radial basis function SVM (RBF-SVM) classifier with a AUROC score of 0.88.

set.seed(42)
recipe_umap <- recipe(Diagnosis~., data = train_df) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_umap(all_predictors(), outcome = vars(Diagnosis), num_comp = 6, min_dist = 0.94396076, learn_rate = 6.193164e-03)

set.seed(42)
recipe_kpca <- recipe(Diagnosis~., data = train_df) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_kpca_rbf(all_predictors(), num_comp = 61, sigma = 1.782724e-06)

set.seed(42)
recipe_pca <- recipe(Diagnosis~., data = train_df) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_pca(all_predictors(), num_comp = 60)

train_umap <- bake(recipe_umap %>% prep(), train_df)
## as(<dgCMatrix>, "dgTMatrix") is deprecated since Matrix 1.5-0; do as(., "TsparseMatrix") instead
train_kpca <- bake(recipe_kpca %>% prep(), train_df)
train_pca <- bake(recipe_pca %>% prep(), train_df)

test_umap <- bake(recipe_umap %>% prep(), test_df)
test_kpca <- bake(recipe_kpca %>% prep(), test_df)
test_pca <- bake(recipe_pca %>% prep(), test_df)
train_1 <- train_umap %>% select(UMAP = UMAP1, Diagnosis) %>%
add_column(PC = train_pca$PC01) %>%
add_column(kPC = train_kpca$kPC01)

train_2 <- train_umap %>% select(UMAP2) %>%
add_column(PatientID = data_prism_formatted$PatientID) %>%
add_column(PC = train_pca$PC02) %>%
add_column(kPC = train_kpca$kPC02)

test_1 <- test_umap %>% select(UMAP = UMAP1) %>%
add_column(Diagnosis = test_df_truth$Diagnosis) %>%
add_column(kPC = test_kpca$kPC01) %>%
add_column(PC = test_pca$PC01)

test_2 <- test_umap %>% select(UMAP = UMAP2) %>%
add_column(PatientID = data_validate_formatted$PatientID) %>%
add_column(kPC = test_kpca$kPC02) %>%
add_column(PC = test_pca$PC02)
train_combined <- train_1 %>% pivot_longer(!Diagnosis, names_to = "Preprocessing", values_to = "Dim1") %>%
add_column(train_2 %>% pivot_longer(!PatientID,names_to = "Preprocessing", values_to = "Dim2") %>%
           select(Dim2, PatientID)) %>%
mutate(Diagnosis = factor(Diagnosis, levels = c(1,2,3), labels = c("Control","CD", "UC")))

test_combined <- test_1 %>% pivot_longer(!Diagnosis, names_to = "Preprocessing", values_to = "Dim1") %>%
add_column(test_2 %>% pivot_longer(!PatientID,names_to = "Preprocessing", values_to = "Dim2") %>%
           select(Dim2, PatientID)) %>%
mutate(Diagnosis = factor(Diagnosis, levels = c(1,2,3), labels = c("Control","CD", "UC")))

combined <- train_combined %>% bind_rows(test_combined, .id = c("Group")) %>%
mutate(Group = case_when(Group==1 ~ "Discovery",
                         TRUE ~ "Validate"))
head(combined)
## # A tibble: 6 × 6
##   Group     Diagnosis Preprocessing    Dim1    Dim2 PatientID 
##   <chr>     <fct>     <chr>           <dbl>   <dbl> <fct>     
## 1 Discovery CD        UMAP           -1.40   -1.39  PRISM|7122
## 2 Discovery CD        PC            -18.2   -16.7   PRISM|7122
## 3 Discovery CD        kPC             0.414   0.386 PRISM|7122
## 4 Discovery CD        UMAP           -0.528  -0.130 PRISM|7147
## 5 Discovery CD        PC            -15.0    26.0   PRISM|7147
## 6 Discovery CD        kPC             0.344  -0.583 PRISM|7147
combined %>% 
plot_ly(x = ~Dim1, y = ~Dim2, frame = ~Preprocessing, color = ~Diagnosis, symbol = ~Group,
        colors = "Set1", type="scatter", mode="markers") %>%

layout(xaxis = list(showgrid = F,
                    zeroline = F,
                    autorange = T
                   ),
       yaxis = list(showgrid = F,
                    zeroline = F,
                    autorange = T
                   )
       ) %>%
animation_opts(
    redraw=FALSE,
    frame = 2000,
    transition = 500) %>%

animation_slider(
     currentvalue = list(
       prefix = "Preprocessing: ", 
       font = list(color = "red")))

In contrast to t-SNE, the data was split into the discovery group for training the mapping functions and the validation group to which the mapping function was applied. Both groups are visible in the 2D scatterplot with different symbols. Since more than 2 dimensions were calculated for the herein used methods, the assessment based on visuzilization will be less useful. Instead, the meaningfulness of the DR methods will be discussed later in combination with diagnosis classification. The final conclusion is made when DR method is applied to the validation group based on metrics like the AUROC score.

Herein, PCA and kPCA show only a moderate amount of separation in the first two dimension. Note, that there are 58 or 59 additional dimensions, respectively. So that, there is no need to display a good separation in the first two dimensions. By contrast, a sharper separation of patient groups is visible using the first two dimensions of UMAP. Similar to t-SNE the UC and CD patients are more intertwined with themselves than with the control group. However, the CD patients tend to mix with the healthy controls more than the UC patients with the control group.

In the following chapter, it will be assessed whether the DR methods can increase the performance of ML models.

ML Disease Classification

Herein, several ML algorithms are used for the prediction of the patient’s diagnosis. First, all three potential diagnoses are considered (multiclass classification). Then, the UC and CD patients are grouped together as IBD patients for the binary classification of sick patients (IBD) and healthy controls.

Biological data are often not only high-dimensional, but also involve non-linear relationships. For example, the non-linear Michaelis-Menten model for enzyme kinectics. Therefore, it can assumed that metabolomic data also incorporates non-linear relationships. This would impair the performance of strictly linear classifiers like partial least-squares discriminant analysis (PLS-DA), which is often used by the metabolomics community. Therefore, non-linear classifiers like RF, support vector machines (SVM) with kernel trick, k-nearest neighbors algorithm (k-NN), and gradient boosted trees (XGBoost) will be used as comparison to the linear algorithms PLS-DA and the multinominal logistic regregression. The latter is a subtype of generalized linear models (GLM) for multiclass prediction with a logit link function.

Herein, the basis for the machine learning is the tidymodels package. Particular, the list-column workflow (LCW) will be used, in which the map function of the purrr package and the nest/unnest function of the tidyr package are integral components. Here all objects, including dataframes, parameters, and different stages of the workflow are stored in one nested dataframe and operations like training and testing can easily be done by using the map function. This workflow has many advantages as keeping all data in one place, reducing code (DRY principle), and a tidy structure for easier subsequent manipulations. An interactive course for LCW can be found on DataCamp and an example with further literature recommendations here.

Multiclass prediction

The training of the classifiers was performed with five-fold cross-validation using the PRISM discovery group. Hyperparameter tuning was done for all models in a separate notebook and the best hyperparameters based on the AUROC score were chosen. Using the recipe workflow, pre-processing is easily integrated such as dummy coding or DR. DR increased the performance of only one model for the validation set, the k-NN classifier. Hyperparameters were also tuned for a multilayer perceptron (MLP) with one hidden layer with prior dimension reduction. The cross-validated AUROC score was 0.87, but the AUROC score for the validation set was 0.5. This means that the model poorly generalizes to unseen data. Moreover, training an MLP with high-dimensional data is not a feasible process at scale. The large input layer severely increases the weights of the network. Herein, DR was necessary for model tuning of the MLP, but training is possible on a standard notebook or cloud workspace.

For future model development, workflowsetsof the tidymodels universe could be used for screening of different combinations of pre-processing methods (e.g. DR), model specifictions (e.g. random forest or knn and their specific hyperparameters).

First, the tibble with the names of the models and the train and test data is produced.

data_ml <- tibble(classifier = c("RF", "PLSDA", "KNN", "SVM", "XGB", "GLM"),
                  train = list(train_df,train_df,train_df,train_df,train_df,train_df),
                  test = list(test_df,test_df,test_df,test_df,test_df,test_df))

Next, the models are specified with the recipefunctions with the found hyperparameters by doing grid search using 5-fold cross-validation. The parameters of grid has been found with the grid_max_entropy function of the dials package. This function harness the parameter space so that every combination has about the same distance from each other.

recipe_rf <- recipe(Diagnosis ~ ., data=train_df)

recipe_pls_svm <- recipe(Diagnosis ~ ., data=train_df) %>% 
  step_normalize(all_numeric(),-all_outcomes()) %>%
  step_dummy(all_nominal(), -all_outcomes())

recipe_pca_knn <- recipe(Diagnosis ~ ., data=train_df) %>% 
  step_normalize(all_numeric(),-all_outcomes()) %>%
  step_dummy(all_nominal(), -all_outcomes()) %>%
  step_pca(all_predictors(), num_comp = 19) 

recipe_xgb_glm <- recipe(Diagnosis ~ ., data=train_df) %>% 
  step_dummy(all_nominal(), -all_outcomes()) %>%
  step_mutate_at(all_predictors(), fn = ~ as.numeric(.)) # required by XGBoost 

model_PLSDA <- pls(num_comp=3) %>%
  set_engine("mixOmics") %>%
  set_mode("classification")

model_RF <- rand_forest(trees = 154,
                        min_n = 8) %>%
  set_engine("ranger", importance = "impurity") %>%
  set_mode("classification")

model_KNN <- nearest_neighbor(neighbors = 88) %>%
  set_engine("kknn") %>%
  set_mode("classification")

model_SVM <- svm_rbf(cost=27.695185,
                     rbf_sigma = 3.779591e-06,
                     margin = 0.01473722) %>%
  set_engine("kernlab") %>%
  set_mode("classification")

model_XGB <- boost_tree(trees = 426,
                       min_n = 5,
                       tree_depth = 12,
                       learn_rate = 0.090994590,
                       loss_reduction = 1.094324e-10) %>%
  set_engine("xgboost") %>%
  set_mode("classification")

model_GLM <- multinom_reg(penalty = 0.67959909,
                          mixture = 0.0592261) %>%
  set_engine("glmnet") %>%
  set_mode("classification")

Now, the workflows are construced in a separate tibble.

df_wflow <- tibble(classifier=data_ml$classifier, wflow=c(),
                   recipe=list(recipe_rf, recipe_pls_svm, recipe_pca_knn, recipe_pls_svm, recipe_xgb_glm, recipe_xgb_glm))
for (i in df_wflow$classifier) {
  x <- paste("model_",i, sep="")
  j <- which(df_wflow$classifier == i)
  df_wflow$wflow[[j]] <- workflow() %>%
    add_model(eval(parse(text = x))) %>%
    add_recipe(df_wflow$recipe[[j]])
}
## Warning: Unknown or uninitialised column: `wflow`.

Next, the LCW is used for the first time to train the models and subsequently predict the diagnosis for the validation group.

set.seed(42)
tic()
data_ml_fit <- data_ml %>% add_column(workflow = df_wflow$wflow) %>%
  mutate(fitted = map2(workflow, train, ~fit(.x,.y))) %>%
  mutate(pred_prob = map2(fitted, test, ~predict(.x,.y, type="prob"))) %>%
  mutate(pred_class = map2(fitted, test, ~predict(.x,.y, type="class")))
toc()
## 6264.639 sec elapsed

The models are compared by several metrics. The AUROC score (here ROCAUC), the precision recall score (PRAUC), the balanced accuracy, sensitivity, and specificity will be used. The ROCAUC and the PRAUC score are calculated from the class probabilities at different decision thresholds. The other metrics are derived from the confusion metrics. In case of class imbalance, the common metric accuracy should be avoided, since it could be very misleading in terms of model performance. The sensitivity (or recall, true positive rate) and specificity (true negative rate) are not affected by data distribution and therefore robust to class imbalances. The balanced accuracy is the average of the sensitivity and specificity and a better choice for imbalanced data sets. Also, the AUROC score is sensitive to class imbalances. In contrast, the precision recall AUC (PRAUC) score is robust to class imbalances and the better choice when the positive class is of interest.

For the multi-class case, the metrics are calculated by averaging multiple binary cases. Therefore, each class will be compared as postive (one) against all other cases as negatives (rest). This method is also known as one vs rest classification.

calc_metrics <- function(data_ml_fit) {
df <- tibble(Classifier=data_ml_fit$classifier, ROCAUC=c(NA),PRAUC=c(NA),`Balanced Accuracy`=c(NA), Sensitivity=c(NA), Specificity=c(NA))
for (i in 1:nrow(data_ml_fit)) { 
df$ROCAUC[[i]] <- data_ml_fit$pred_prob[[i]] %>% roc_auc(estimate=c(names(data_ml_fit$pred_prob[[i]])), truth= test_df_truth$Diagnosis,estimator="hand_till") %>% pull(.estimate) %>% round(2)

df$PRAUC[[i]] <- data_ml_fit$pred_prob[[i]] %>% pr_auc(estimate=c(names(data_ml_fit$pred_prob[[i]])), truth= test_df_truth$Diagnosis,estimator="macro") %>% pull(.estimate) %>% round(2)
    
df$`Balanced Accuracy`[[i]] <- data_ml_fit$pred_class[[i]] %>% bal_accuracy(estimate=data_ml_fit$pred_class[[i]] %>% pull(),
                                                                            truth=test_df_truth$Diagnosis,
                                                                            estimator="macro") %>% pull(.estimate) %>% round(2) # average of sensitivity and specificity

df$Sensitivity[[i]] <- data_ml_fit$pred_class[[i]] %>% sens(estimate=data_ml_fit$pred_class[[i]] %>% pull(), truth=test_df_truth$Diagnosis) %>% pull(.estimate) %>% round(2)

df$Specificity[[i]] <- data_ml_fit$pred_class[[i]] %>% spec(estimate=data_ml_fit$pred_class[[i]] %>% pull(), truth=test_df_truth$Diagnosis) %>% pull(.estimate) %>% round(2)
}
return(df %>% unnest(cols = c(ROCAUC, PRAUC, `Balanced Accuracy`, Sensitivity, Specificity)))
}
calc_metrics(data_ml_fit)
## # A tibble: 6 × 6
##   Classifier ROCAUC PRAUC `Balanced Accuracy` Sensitivity Specificity
##   <chr>       <dbl> <dbl>               <dbl>       <dbl>       <dbl>
## 1 RF           0.9   0.84                0.83        0.77        0.89
## 2 PLSDA        0.89  0.81                0.79        0.72        0.86
## 3 KNN          0.87  0.81                0.61        0.48        0.74
## 4 SVM          0.84  0.76                0.81        0.75        0.88
## 5 XGB          0.9   0.85                0.82        0.76        0.88
## 6 GLM          0.85  0.73                0.79        0.72        0.86

In contrast to the binary case, in multi-class prediction the class of interest doesn’t need to be specified, because the average will be used for the metrics.

The random forest and the gradient boosted tree are the best models based on all metrics. Based on the ROCAUC and PRAUC score the PLSDA and KNN classifier are following close behind. But, in terms of balanced accuracy, sensitivity, and specificity both are performing worse, particular the KNN. By contrast, the GLM and the SVM model performed better than the KNN and similar to the PLSDA model based on the balanced accuracy, sensitivity, and specificity.

Next, the output of the classifiers will be used to build an ensemble model with the stacks package. Therefore, the top three models are selected to fit the stacked model. This usually incrementally increases the model performance a bit.

set.seed(42)
resamples <- vfold_cv(train_df, v=5)
ctrl_res <- control_stack_resamples()
tic()
data_ml_fit_stack <- data_ml_fit %>% filter(classifier %in% c("RF", "PLSDA", "XGB")) %>%
  mutate(resamples = list(resamples, resamples, resamples)) %>%
  mutate(stacks = map2(workflow, resamples,  ~fit_resamples(.x,.y, control=ctrl_res)))
## ℹ The workflow being saved contains a recipe, which is 13.25 Mb in
## ℹ memory. If this was not intentional, please set the control setting
## ℹ `save_workflow = FALSE`.
## ℹ The workflow being saved contains a recipe, which is 13.28 Mb in
## ℹ memory. If this was not intentional, please set the control setting
## ℹ `save_workflow = FALSE`.
## ℹ The workflow being saved contains a recipe, which is 13.28 Mb in
## ℹ memory. If this was not intentional, please set the control setting
## ℹ `save_workflow = FALSE`.
toc()
## 326.517 sec elapsed
mdl_stack_fit <- stacks() %>%
  add_candidates(data_ml_fit_stack$stacks[[1]]) %>%
  add_candidates(data_ml_fit_stack$stacks[[2]]) %>%
  add_candidates(data_ml_fit_stack$stacks[[3]]) %>%
  blend_predictions(non_negative = TRUE) %>%
  fit_members() 
## The inputted `name` argument cannot prefix a valid column name. The
## • data stack will use "data_ml_fit_stack.stacks..1.." rather than
## • "data_ml_fit_stack$stacks[[1]]" in constructing candidate names.
## The inputted `name` argument cannot prefix a valid column name. The
## • data stack will use "data_ml_fit_stack.stacks..2.." rather than
## • "data_ml_fit_stack$stacks[[2]]" in constructing candidate names.
## The inputted `name` argument cannot prefix a valid column name. The
## • data stack will use "data_ml_fit_stack.stacks..3.." rather than
## • "data_ml_fit_stack$stacks[[3]]" in constructing candidate names.
data_stack <- tibble(classifier = "Stack")
data_stack$pred_class[[1]] <- predict(mdl_stack_fit, new_data=test_df, type="class")
## Warning: Unknown or uninitialised column: `pred_class`.
data_stack_prob <- predict(mdl_stack_fit,test_df, type="prob")
data_stack$pred_prob[[1]] <- tibble(pred_prob=list(data_stack_prob)) %>% unnest(pred_prob)
## Warning: Unknown or uninitialised column: `pred_prob`.
data_ml_fit <- bind_rows(data_ml_fit, data_stack)
calc_metrics(data_ml_fit)
## # A tibble: 7 × 6
##   Classifier ROCAUC PRAUC `Balanced Accuracy` Sensitivity Specificity
##   <chr>       <dbl> <dbl>               <dbl>       <dbl>       <dbl>
## 1 RF           0.9   0.84                0.83        0.77        0.89
## 2 PLSDA        0.89  0.81                0.79        0.72        0.86
## 3 KNN          0.87  0.81                0.61        0.48        0.74
## 4 SVM          0.84  0.76                0.81        0.75        0.88
## 5 XGB          0.9   0.85                0.82        0.76        0.88
## 6 GLM          0.85  0.73                0.79        0.72        0.86
## 7 Stack        0.88  0.81                0.85        0.8         0.9

The stacked model performed best in terms of balanced accuracy, sensitivity, and specificity. However, it is slightly worse than the RF and XGBoost model based on the ROCAUC score. An even higher decrease in performance is observed in the PRAUC score of 0.81.

Next, the ROC curve will be plotted for every class against all other classes.

df <- tibble(Classifier=data_ml_fit$classifier)
for (i in 1:nrow(data_ml_fit)) { 
df$AUC[[i]] <- data_ml_fit$pred_prob[[i]] %>% 
  roc_curve(estimate=c(names(data_ml_fit$pred_prob[[i]])), truth=test_df_truth$Diagnosis)
}
## Warning: Unknown or uninitialised column: `AUC`.
bind_rows_func <- function(df){
df_auc <- bind_rows(RF = df$AUC[[1]],
                    PLSDA = df$AUC[[2]],
                    KNN = df$AUC[[3]],
                    SVM = df$AUC[[4]],
                    XGB = df$AUC[[5]],
                    GLM = df$AUC[[6]],
                    Stack = df$AUC[[7]],
                    .id= "Model")
return(df_auc)
}
bind_rows_func(df) %>% group_by(Model) %>% mutate(.level = case_when(.level == 1 ~ "Control",
                                                         .level == 2 ~ "CD",
                                                         TRUE ~ "UC")) %>%
  mutate(.level = fct_relevel(as.factor(.level), c("Control", "CD", "UC")),
         Model = as.factor(Model),
         Model = fct_relevel(Model, data_ml_fit$classifier)) %>%
  
  ggplot(aes(x = 1 - specificity, y = sensitivity, color=Model,group=Model)) +
  geom_path() +
  geom_abline(lty = 3) +
  coord_equal() +
  facet_wrap(~.level) +
  theme_bw() 

The best model would have a sensitivity of 1 and specificity (true negative rate) of 1. The false-positive rate is 1 - specificity and is zero for a perfect classifier. Thus, the ROC curve of a perfect classifier would start at (0,0) and go straight to (0,1) and then straight to (1,1). This results in an area under the curve of 1 (AUROC score).

Herein, the ROC curve of the control class is the best with a low false-positive rate a high sensitivity is achieved for all models. The ROC curve for the UC class is slightly worse. In contrast, the CD curve is clearly worse. This means that the diagnosis for control and UC patients is more often found for the respective condition out of all patients with the specific condition than for the CD patients at a low false-positive rate.

However, the sick patients are of interest. Therefore, we would like to increase the sensitivity of the CD and UC group. Also, the specificity is less important since false-positive are likely to be detected in follow-up checks. Perhaps the sensitive detection of CD and UC patients could be improved when the UC and CD diagnoses are combined. Thus, further models are created for the binary case of sick against healthy controls.

Next, the confusion matrix of the best performing model, the RF, is plotted. This plot will visualize the correct and incorrect classifications.

df_conf <- tibble(pred = data_ml_fit$pred_class[[1]] %>% pull(), truth = test_df_truth$Diagnosis) %>%
   mutate(pred = case_when(pred == 1 ~ "Control",
                                                         pred == 2 ~ "CD",
                                                         TRUE ~ "UC"),
          truth = case_when(truth == 1 ~ "Control",
                                                         truth == 2 ~ "CD",
                                                         TRUE ~ "UC"),
          truth = fct_relevel(as.factor(truth), c("Control", "CD", "UC")),
          pred = fct_relevel(as.factor(pred), c("Control", "CD", "UC")))
conf_mat(df_conf, truth=pred, estimate = truth, dnn = c("Truth","Prediction")) %>% autoplot(type="heatmap") # Predicted and test labels were switched to transpose the matrix so that the true labels are the rows

accuracy(df_conf, truth=truth, estimate=pred)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.769

The highest sensitivity was achieved for the control diagnosis (0.86). The CD and UC patients were detected with a sensitivity of 0.75 and 0.70, respectively. All misclassified patients of the UC group were predicted as CD patients. Whereas, the misclassified patients of the CD and control group are more evenly distributed across the classes. The difficult separation of the UC and CD patients were to some extend anticipated from the dimension reduction plots.

The overall accuracy of the RF is 0.77, which is identical to the performance of the RF mode published by Franzosa et al. In contrast to this model, their model misclassified UC patients more evenly across all groups and all misclassified CD patients were predicted as control patients.

In a comparative study of Mendez et. al the data set was also used to build different classifiers. However, they used only the data from the negative C18 column for the binary case of CD vs UC patients. Their model achieved only a AUROC score of 0.73 on the validation set. Even though they used a different data split and overall data, but it also highlights the difficulties of discriminating UC and CD patients from each other.

Discrimination between healthy and IBD patients

For better prediction of sick patients, the binary case of sick and healthy controls will be evaluated. The probabilities or the predicted class properties of the UC and CD class are combined into the IBD class for the sake of simplicity, instead of training and tuning the models again.

data_ml_fit_ibd <- data_ml_fit
for (i in 1:nrow(data_ml_fit_ibd)) {
data_ml_fit_ibd$pred_prob[[i]] <- data_ml_fit$pred_prob[[i]] %>% mutate(.pred_IBD = .pred_2 + .pred_3) %>%
  select(.pred_Control = .pred_1, -c(.pred_2, .pred_3)) %>%
  add_column(test_df_truth %>% mutate(Diagnosis = case_when(Diagnosis == "2" ~ "IBD",
                                          Diagnosis == "3" ~ "IBD",
                                          TRUE ~ "Control"))) %>% 
  mutate(Diagnosis = as.factor(Diagnosis)) %>%
  select(truth = Diagnosis, everything()) %>% droplevels()
}
for (i in 1:nrow(data_ml_fit_ibd)) {
data_ml_fit_ibd$pred_class[[i]] <- data_ml_fit$pred_class[[i]] %>% mutate(.pred_class = case_when(
  .pred_class == "1" ~ "Control",
  .pred_class == "3" ~ "IBD",
  TRUE ~ "IBD")) %>%
  mutate(.pred_class = as.factor(.pred_class)) %>% 
  add_column(test_df_truth %>% mutate(Diagnosis = case_when(Diagnosis == "2" ~ "IBD",
                                          Diagnosis == "3" ~ "IBD",
                                          TRUE ~ "Control"))) %>% 
  mutate(Diagnosis = as.factor(Diagnosis)) %>%
  select(truth = Diagnosis, everything())
}

Typically in a binary classification problem, the class of interest is called positive class. Herein, the class of interest is IBD, the sick patients. Therefore, IBD is the positive class.
Since sensitivity, specificity, and balanced accuracy are calculated from the confusion matrix, it is neccessary to specify the order (level) of the classes. This can be done by manipulating the data or by choosing the right event_level of the truth class in the yardstick functions, e.g. sens().

cat("Levels of predicted column:",levels(data_ml_fit_ibd$pred_class[[1]]$.pred_class),
    "\nLevels of truth column:",levels(data_ml_fit_ibd$pred_class[[1]]$truth))
## Levels of predicted column: Control IBD 
## Levels of truth column: Control IBD

Since the class of interest, IBD, has the second level, the event_level should be specified as “second”.

calc_metrics_ibd <- function(data_ml_fit_ibd) {
df <- tibble(Classifier = data_ml_fit_ibd$classifier)
for (i in 1:nrow(data_ml_fit_ibd)) { 
df$AUC[[i]] <- data_ml_fit_ibd$pred_prob[[i]] %>% roc_auc(estimate=.pred_Control, truth=truth) %>% pull(.estimate) %>% round(2)
df$PRAUC[[i]] <- data_ml_fit$pred_prob[[i]] %>% pr_auc(estimate=c(names(data_ml_fit$pred_prob[[i]])), truth= test_df_truth$Diagnosis,estimator="macro_weighted") %>% pull(.estimate) %>% round(2)
df$`Balanced Accuracy`[[i]] <- data_ml_fit_ibd$pred_class[[i]] %>% bal_accuracy(estimate=.pred_class, truth=truth, event_level = "second") %>% pull(.estimate) %>% round(2) # average of sensitivity and specificity

df$Sensitivity[[i]] <- data_ml_fit_ibd$pred_class[[i]] %>% sens(estimate = .pred_class, truth=truth, event_level = "second") %>% pull(.estimate) %>% round(2) # IBD is the event

df$Specificity[[i]] <- data_ml_fit_ibd$pred_class[[i]] %>% spec(estimate = .pred_class, truth=truth, event_level = "second") %>% pull(.estimate) %>% round(2)
}
return(df %>% unnest())
}
df_ibd_curve <- tibble(Classifier=data_ml_fit$classifier)
for (i in 1:nrow(data_ml_fit_ibd)) {
df_ibd_curve$AUC[[i]] <- data_ml_fit_ibd$pred_prob[[i]] %>% roc_curve(estimator=.pred_Control, truth=truth)
}
## Warning: Unknown or uninitialised column: `AUC`.
df_roc_curve_ibd <- bind_rows_func(df_ibd_curve) %>%
  mutate(Model = as.factor(Model)) %>%
  mutate(Model = fct_relevel(Model, data_ml_fit_ibd$classifier))

df_ibd_score <- tibble(Classifier=data_ml_fit_ibd$classifier)
for (i in 1:nrow(data_ml_fit_ibd)) { 
df_ibd_score$AUC[[i]] <- data_ml_fit_ibd$pred_prob[[i]] %>% roc_auc(estimate=.pred_Control, truth=truth)
}
## Warning: Unknown or uninitialised column: `AUC`.
df_roc_score <- bind_rows_func(df_ibd_score) %>%
  select(Model, AUC = .estimate) %>% mutate(AUC = round(AUC, 2))
calc_metrics_ibd(data_ml_fit_ibd)
## Warning: Unknown or uninitialised column: `AUC`.
## Warning: Unknown or uninitialised column: `PRAUC`.
## Warning: Unknown or uninitialised column: `Balanced Accuracy`.
## Warning: Unknown or uninitialised column: `Sensitivity`.
## Warning: Unknown or uninitialised column: `Specificity`.
## Warning: `cols` is now required when using unnest().
## Please use `cols = c(AUC, PRAUC, `Balanced Accuracy`, Sensitivity, Specificity)`
## # A tibble: 7 × 6
##   Classifier   AUC PRAUC `Balanced Accuracy` Sensitivity Specificity
##   <chr>      <dbl> <dbl>               <dbl>       <dbl>       <dbl>
## 1 RF          0.93  0.85                0.91        0.95        0.86
## 2 PLSDA       0.89  0.81                0.86        0.77        0.95
## 3 KNN         0.91  0.81                0.72        0.93        0.5 
## 4 SVM         0.88  0.76                0.86        0.77        0.95
## 5 XGB         0.91  0.85                0.84        0.91        0.77
## 6 GLM         0.88  0.74                0.84        0.91        0.77
## 7 Stack       0.93  0.82                0.9         0.88        0.91

The performance of all models increased in almost every metric for the binary case. The random forest is the best model based on the most important metrics PRAUC and sensitivity, 0.85 and 0.95 respectively. Also the KNN, XGB, and GLM displayed a high sensitivity for the sick patients. But, only the XGB scored high on the PRAUC score (0.85). Based on the AUC score the RF, KNN, XGB and Stack performed best. The stacked model exceeded the specificity of the best model, the RF but the sensitivity was lower than for the RF.

truth <- data_ml_fit_ibd$pred_class[[1]]$truth
pred <- data_ml_fit_ibd$pred_class[[1]]$.pred_class
df_conf_ibd <- tibble(truth=truth, pred=pred)
conf_mat(df_conf_ibd, truth=pred, estimate=truth,dnn = c("Truth","Prediction")) %>% autoplot(type="heatmap") # Predicted and test labels were switched to transpose the matrix so that the true labels are the rows

The confusion matrix of the RF model clearly display the low rate of misclassification for both classes. Herein, the almost all IBD patients were detected (sensitivity of 0.95), which is the most important purpose of the model.

df_geom <- data.frame(Model = data_ml_fit_ibd$classifier, Score = (paste0(" (AUC = ", df_roc_score$AUC, ")"))) %>%
  mutate(Model = as.factor(Model),
         Model = fct_relevel(Model, data_ml_fit_ibd$classifier))

df_roc_curve_ibd %>% 
  ggplot(aes(x = 1 - specificity, y = sensitivity, color=Model, group=Model)) +
  geom_path() +
  geom_abline(lty = 3) +
  coord_equal() +
  theme_bw() +
  theme(legend.position="none")+
  geom_text(data=df_geom, aes(x = .63, y = .075, label = Model), size= 3.5, position="stack",hjust=0, show.legend=FALSE)+
  geom_text(data=df_geom, aes(x = .76, y = .075, label = Score, group=Model, color=NULL),size= 3.5,position="stack", hjust=0, show.legend=FALSE)

Overall, all models can be classified as A models (AUC >= 0.9) or as very good B models (AUC >= 0.8). There are differences across the metrics for the different models. Depending on the application of the model some models are superior. For example, the RF is the best choice for a high sensitivity of IBD patients.

For comparison, Franzosa et. al utilized also a RF model for the discrimination of IBD and healthy controls. Their model achieved a ROCAUC score of 0.89 for the validation group, which was outperformed by the herein used RF (0.93).

Variable importance

Variable importance allows us to infer which features influence the predicted outcome and to what extent. Some algorithms like random forest or gradient boosted trees have an inherent way (model-based) to quantify the relative influence of each feature. In generalized linear models and PLS-DA, the variable importance is represented by the magnitude of the standardized coefficients. Unfortunately, not all model algorithms provide model-based feature importance, e.g. KNN or SVM algorithms. However, model-agnostic interpretation methods work for every model, e.g. based on permutation. A single feature is shuffled, and the resulting change in model performance is measured, indicating how much the model relies on that feature. However, permutation-based variable importance does not reflect the intrinsic significance of the feature but how important the feature is for that particular model. Moreover, in multicollinearity cases, this method will display none of the features as necessary because other features can represent the shuffled feature. This issue could be mitigated by employing Spearman rank-order correlations with clustering and selecting only one feature from each cluster.

Herein, the variable importance of only the RF, XGB and the PLSDA are considered for the sake of simplicity. The ten most important features are selected and grouped by the putative chemical class.

vi_rf <- data_ml_fit$fitted[[1]] %>% extract_fit_parsnip() %>% vi() %>% rename(Feature=Variable)
vi_xgb <- data_ml_fit$fitted[[5]] %>% extract_fit_parsnip() %>% vi() %>% rename(Feature=Variable)
vi_plsda <-data_ml_fit$fitted[[2]] %>% extract_fit_parsnip() %>% tidy() %>% filter(type=="predictors")

df_fet <- df_met_feature %>% select(Feature = `Metabolomic Feature`, everything()) %>% as.data.frame()
df_fet$Feature <- str_replace(df_fet$Feature, "-", ".")
top_10_rf <- left_join(df_fet, vi_rf, by=c("Feature")) %>% drop_na(c(`Putative Chemical Class`)) %>% top_n(wt=Importance, 10)
top_10_xgb <- left_join(df_fet, vi_xgb, by=c("Feature")) %>% drop_na(c(`Putative Chemical Class`)) %>% top_n(wt=Importance, 10)
top_10_plsda <- left_join(df_fet, vi_plsda, by=c("Feature" = "term")) %>% drop_na(c(`Putative Chemical Class`)) %>% top_n(wt=value, 10) %>% select(Importance = value, everything(), -type, -component)

top_30 <- rbind(top_10_rf, top_10_xgb, top_10_plsda)
top_30 %>% count(`Putative Chemical Class`) %>% filter(n > 1) %>% ggplot(aes(x=fct_reorder(`Putative Chemical Class`, n), y=n)) +
  geom_col() +
  coord_flip() +
  labs(x="Putative chemical class",
       y="Count")

Tetrapyrrole derivatives are the most important chemical class for the prediction of the diagnosis for the three selected models. They appeared in total eight times in the ten most important features. The other features appeared at least two times in sum. Tetrapyrrole derivatives could origin from vitamin B12 or heme. Vitamin B12 could be produced by bacteria or ingested. Moreover, heme, which could originate from food or bleeding in the gastrointestinal tract could lead to gut dysbiosis. Also, the alpha-branched alpha, beta-unsaturated ketone class is interesting and could potentially hint microbial (xenobiotic) metabolism. In contrast to host enzymes, which usually use oxidative and conjugative reactions, gut microbes use primarily hydrolytic and reductive methods. https://www.science.org/doi/10.1126/science.aag2770 Due to the inherent electrophilc character of this chemical class they have the capacity to adverse and toxic effects. https://pubmed.ncbi.nlm.nih.gov/15199949/

Those results could be more refined with a more accurate identification of the chemical compounds. Moreover, differential abundance analysis would be an complementary method to elucidate the role of the compounds in IBD.

Conclusion

After data wrangling and exploratory data analysis, first impressions of the difficulty of the discrimination of patients were gathered during the dimension reduction visualization. Both linear and non-linear classifiers achieved satisfactory results. This could indicate that non-linear relationships are not predominant in this case. The UC and CD patients can be grouped as IBD patients to detect sick patients with a high sensitivity. More studies are required to discriminate between CD and UC patients appropriately. This could either be done by tuning classification models or finding a better LC-MS analysis method. For example, different dimension reductions are combined with a variety of classifiers and scanned using workflowsets. Moreover, variable importance analysis and/or differential expression analysis is performed for the binary case of CD vs UC. Then, the LC-MS method could be optimized for the predominating discriminating substance class.