The notebook is also available on github. There you can retrieve the interactive plots for the dimension reduction since the herein used workspace doesn’t support them yet.
The goal of the project is to present an analysis of clinical data by leveraging a multitude of computational approaches. The objectives of the analysis are the sensitive prediction of sick patients using machinge learning (ML) and the possible identification of new biomarkers for inflammatory bowel diseases (IBD) by using the feature importance of the ML models and/or differential expression analysis.
This project is based on high-dimensional data obtained by mass spectrometry (MS). Here, different dimensionality reduction and machine learning (ML) methods will be used and discussed. The script is divided in several sections, starting with data wrangling, followed by exploratory data analysis. Different methods are used for dimension reduction with subsequent classification from a variety of ML models. Finally, the results will be discussed. Moreover, a differential expression analysis is planned.
A variety of packages are required for the diverse ML methods. The main framework is tidymodels and the many tidyverse principles are used.
suppressPackageStartupMessages({library(mixOmics, quietly = TRUE)
library(plsmod)
library(readxl, quietly = TRUE)
library(vip, quietly = TRUE)
library(tictoc, quietly = TRUE)
library(stacks, quietly = TRUE)
library(MASS,exclude = "select")
library(Rtsne, quietly = TRUE)
library(tidyverse, quietly = TRUE)
library(tidymodels, quietly = TRUE)
library(plotly, quietly = TRUE)
library(embed, quietly = TRUE)
library(gtools, quietly = TRUE)
library(cowplot, quietly = TRUE)
library(visdat, quietly = TRUE)
})
The data set can be retrieved from the supplementary material from
the published
study. Untargeted metabolomic profiling of stool samples from
patients with Crohn’s disease (CD), ulcerative colitis (UC), and
non-inflammatory bowel disease control (IBD) patients was performed in
the study of E. A.
Franzosa et al. Previous studies have found differences in faecal
metabolite composition in IBD patients.
In their study metabolomic and shotgun metagenomic profiling was
conducted, and in this analysis only the metabolomic features derived by
liquid chromatography-mass spectrometry (LC-MS) will be considered. The
data set includes untargeted metabolomic profiles of stool samples from
a 155-member discovery cohort and a 65-member independent validation
cohort. Four different LC-MS methods were used to cover different
metabolites classes like polar metabolites, lipids, free fatty acids,
and bile acids. 3829 uncharacterized metabolite features were assigned
to putative molecular classed based on comparison to the Human
Metabolome Database (HMDB) and 566 features were annotated by reference
from an in-house compound library. Also, metabolites were clustered
based on their covariation across samples. The discovery cohort includes
68 CD, 53 UC, and 34 non-IBD patients. Whereas the validation cohort is
more balanced with 20 CD, 23 UC, and 22 control patients.
download.file("https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6342642/bin/NIHMS1510763-supplement-Dataset_2.xlsx", destfile = "/tmp/file.xlsx")
df_orig <- read_excel("/tmp/file.xlsx", skip=1)
download.file("https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6342642/bin/NIHMS1510763-supplement-Dataset_1.xlsx", destfile = "/tmp/file2.xlsx")
df_met_feature <- read_excel("/tmp/file2.xlsx", skip=1)
head(df_orig)
## # A tibble: 6 × 221
## `# Feature / Sample` `PRISM|7122` `PRISM|7147` `PRISM|7150` `PRISM|7153`
## <chr> <chr> <chr> <chr> <chr>
## 1 Age 38 50 41 51
## 2 Diagnosis CD CD CD CD
## 3 Fecal.Calprotectin 207.4844286 <NA> 218.33451690000001 <NA>
## 4 antibiotic No No No No
## 5 immunosuppressant Yes No Yes No
## 6 mesalamine No Yes No Yes
## # … with 216 more variables: `PRISM|7184` <chr>, `PRISM|7238` <chr>,
## # `PRISM|7406` <chr>, `PRISM|7408` <chr>, `PRISM|7421` <chr>,
## # `PRISM|7445` <chr>, `PRISM|7486` <chr>, `PRISM|7496` <chr>,
## # `PRISM|7506` <chr>, `PRISM|7547` <chr>, `PRISM|7658` <chr>,
## # `PRISM|7662` <chr>, `PRISM|7744` <chr>, `PRISM|7759` <chr>,
## # `PRISM|7762` <chr>, `PRISM|7776` <chr>, `PRISM|7791` <chr>,
## # `PRISM|7843` <chr>, `PRISM|7847` <chr>, `PRISM|7852` <chr>, …
For ML algorithms, the data must be in the form that the rows corresponded to the sample individuals and the columns to the variables/features. Therefore, the data set is pivoted.
df_long <- df_orig %>% pivot_longer(-`# Feature / Sample`, names_to="PatientID") %>%
pivot_wider(names_from=`# Feature / Sample`, values_from="value")
head(df_long)
## # A tibble: 6 × 8,856
## PatientID Age Diagnosis Fecal.Calprotectin antibiotic immunosuppressant
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 PRISM|7122 38 CD 207.4844286 No Yes
## 2 PRISM|7147 50 CD <NA> No No
## 3 PRISM|7150 41 CD 218.33451690000001 No Yes
## 4 PRISM|7153 51 CD <NA> No No
## 5 PRISM|7184 68 CD 20.16795055 No No
## 6 PRISM|7238 67 CD 2.5862468359999999 Yes Yes
## # … with 8,850 more variables: mesalamine <chr>, steroids <chr>,
## # `C18-neg_Cluster_0001` <chr>, `C18-neg_Cluster_0002` <chr>,
## # `C18-neg_Cluster_0003` <chr>, `C18-neg_Cluster_0004` <chr>,
## # `C18-neg_Cluster_0005` <chr>, `C18-neg_Cluster_0006` <chr>,
## # `C18-neg_Cluster_0007` <chr>, `C18-neg_Cluster_0008` <chr>,
## # `C18-neg_Cluster_0009` <chr>, `C18-neg_Cluster_0010` <chr>,
## # `C18-neg_Cluster_0011` <chr>, `C18-neg_Cluster_0012` <chr>, …
We can directly see NA values in the Fecal.Calprotectin
column. Most ML methods require the absence of NA values or remove them
on their own. There are different ways to handle missing values. The two
primary solutions are deletion or imputation. Deletion should be avoided
or minimized as much as possible because it comes at the cost of losing
data/information. This removal can be done by either removing the row or
the entire column. The latter might be necessary if many values are
missing for one variable. In contrast, there are many options to impute
missing values. This ranges from using simple statistics like mean or
median to using methods like regression or ML methods like the
k-nearest neighbors algorithm. First, the columns with NA
values will be visualized using the visdat package to find
some possible relationships or causes of the missing data. The package
allows for a heatmap-like visualization of missing data.
na_cols <- unlist(lapply(df_long, function(x) any(is.na(x))))
df_long[na_cols] %>% vis_dat()
## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## Please use `gather()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
There are multiple NA values apparent. The
Fecal.Calprotectin column has to many missing values, and
therefore the column will be removed. The
immunosuppressant, mesalamine, and
steroids variables display a similar distribution of NA
values. The majority of the missing values cluster together and will
have probably something in common. For further investigation, the
nominal variable Diagnosis might give further details of
the possible relationship.
p1 <- df_long %>%
ggplot(aes(Diagnosis, fill=is.na(steroids))) +
geom_bar() +
theme(legend.position="none")+
labs(title="Missing values in steroids variable")
p2 <- df_long %>%
ggplot(aes(Diagnosis, fill=is.na(immunosuppressant))) +
geom_bar() +
theme(legend.position="none")+
labs(title="immunosuppressant variable")
p3 <- df_long %>%
ggplot(aes(Diagnosis, fill=is.na(mesalamine))) +
geom_bar() +
labs(title="mesalamine variable",fill="Missing")
plot_grid(p1,p2, p3)
The majority of missing values for the three variables are in the control group. Here, we can assume that the healthy control group don’t take steroids, immunosupressants, and mesalamine. Therefore, the missing values in the control group are replaced with “No”.
df_long <- df_long %>% select(-Fecal.Calprotectin)
df_long[df_long$Diagnosis == "Control",] <- df_long[df_long$Diagnosis == "Control",] %>% replace_na(list(immunosuppressant="No", mesalamine="No", steroids="No"))
df_long[!complete.cases(df_long), ]
## # A tibble: 5 × 8,855
## PatientID Age Diagnosis antibiotic immunosuppressant mesalamine steroids
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 PRISM|7762 40 UC <NA> No Yes <NA>
## 2 PRISM|7948 61 CD <NA> No Yes No
## 3 PRISM|8573 31 CD <NA> No Yes Yes
## 4 PRISM|8794 24 CD No No <NA> <NA>
## 5 PRISM|8802 73 CD No No <NA> No
## # … with 8,848 more variables: `C18-neg_Cluster_0001` <chr>,
## # `C18-neg_Cluster_0002` <chr>, `C18-neg_Cluster_0003` <chr>,
## # `C18-neg_Cluster_0004` <chr>, `C18-neg_Cluster_0005` <chr>,
## # `C18-neg_Cluster_0006` <chr>, `C18-neg_Cluster_0007` <chr>,
## # `C18-neg_Cluster_0008` <chr>, `C18-neg_Cluster_0009` <chr>,
## # `C18-neg_Cluster_0010` <chr>, `C18-neg_Cluster_0011` <chr>,
## # `C18-neg_Cluster_0012` <chr>, `C18-neg_Cluster_0013` <chr>, …
There are five rows remaining with missing values. For sake of simplicity, the rows will be removed in this project. However, clinical studies have limited subjects and the used methods can be expensive and strenuous. Therefore, within the framework of the study evaluation, one should ask for further information to avoid losing precious information.
df_long <- df_long %>% na.omit()
dim(df_long)
## [1] 215 8855
Now, the data will be divided into the discovery group and the validation group. Fortunately, the study provides three independent cohorts. The PRISM cohort will be used to train and optimize the models and two groups are available for the validation set, Deep and UMCG. This procedure ensures that the final models will never have seen the validation data, and we can gauge the generalisability of the models.
PRISM <- df_orig %>% select(starts_with('PRISM')) %>% names()
data_prism <- df_long %>% filter(PatientID %in% PRISM)
data_validate <- df_long %>% filter(!PatientID %in% PRISM)
Since all variables are encoded as characters, we will format the
data that the ML algorithms can use it. The nominal variables,
e.g. Diagnosis or antibiotic, are transformed
into factors and the age and mass spectrometry features
into numerical variables. Moreover, the PatientID is
excluded from the training set.
tidy_data <- function(untidy) {
untidy %>%
mutate(PatientID = as.factor(PatientID),
Age = as.numeric(Age),
Diagnosis = as.factor(Diagnosis),
antibiotic = as.factor(antibiotic),
mesalamine = as.factor(mesalamine),
steroids = as.factor(steroids),
immunosuppressant = as.factor(immunosuppressant)) -> formatted
w <- which(sapply(formatted, class ) == 'character' )
formatted[w] <- lapply(formatted[w], function(x) as.numeric(x))
return(formatted)
}
data_prism_formatted <- tidy_data(data_prism)
data_validate_formatted <- tidy_data(data_validate)
colnames(data_prism_formatted) <- make.names(colnames(data_prism_formatted)) # compatible feature names for ML
colnames(data_validate_formatted) <- make.names(colnames(data_validate_formatted))
train_df <- data_prism_formatted %>% select(-PatientID) %>%
mutate(Diagnosis = factor(Diagnosis, levels =c("Control","CD", "UC"), labels =c(1,2,3)))
test_df <- data_validate_formatted %>% select(-PatientID,-Diagnosis)
test_df_truth <- data_validate_formatted %>% select(PatientID, Diagnosis) %>% # for performance metrics
mutate(Diagnosis = factor(Diagnosis, levels =c("Control","CD", "UC"), labels =c(1,2,3))) %>%
mutate(Cohort = case_when(str_detect(PatientID,"LLDeep") == TRUE ~ "Deep",
TRUE ~ "UMCG")) # for error analysis based on cohort affiliation
folds <- train_df %>% vfold_cv(5, strata = Diagnosis) # for model training
data_combined <- bind_rows(data_prism_formatted, data_validate_formatted) %>% # for EDA
mutate(Cohort = case_when(
str_detect(PatientID, "LLDeep") == TRUE ~ "Deep",
str_detect(PatientID, "PRISM") == TRUE ~ "PRISM",
TRUE ~ "UMCG"))
data_combined_train <- data_combined %>% select(-c(PatientID, Diagnosis, Cohort)) # for t-SNE
head(data_combined %>% select(Cohort, everything()))
## # A tibble: 6 × 8,856
## Cohort PatientID Age Diagnosis antibiotic immunosuppressant mesalamine
## <chr> <fct> <dbl> <fct> <fct> <fct> <fct>
## 1 PRISM PRISM|7122 38 CD No Yes No
## 2 PRISM PRISM|7147 50 CD No No Yes
## 3 PRISM PRISM|7150 41 CD No Yes No
## 4 PRISM PRISM|7153 51 CD No No Yes
## 5 PRISM PRISM|7184 68 CD No No No
## 6 PRISM PRISM|7238 67 CD Yes Yes No
## # … with 8,849 more variables: steroids <fct>, C18.neg_Cluster_0001 <dbl>,
## # C18.neg_Cluster_0002 <dbl>, C18.neg_Cluster_0003 <dbl>,
## # C18.neg_Cluster_0004 <dbl>, C18.neg_Cluster_0005 <dbl>,
## # C18.neg_Cluster_0006 <dbl>, C18.neg_Cluster_0007 <dbl>,
## # C18.neg_Cluster_0008 <dbl>, C18.neg_Cluster_0009 <dbl>,
## # C18.neg_Cluster_0010 <dbl>, C18.neg_Cluster_0011 <dbl>,
## # C18.neg_Cluster_0012 <dbl>, C18.neg_Cluster_0013 <dbl>, …
Exploratory data analysis (EDA) is not a formal process. Instead, it is used to gain an overview of your data. At this stage, you can feel free to explore every idea that comes to your mind. Those insights are typically expressed as meaningful visualizations. During this stage data transformations might be necessary and not all ideas will deliver the expected results or progress to different questions.
First, the Group variable is created to distinguish
between the discovery and validation data.
data_combined_exp <- data_combined %>% mutate(Group = case_when(Cohort == "Deep" ~ "Validation",
Cohort == "PRISM" ~ "Discovery",
TRUE ~ "Validation"))
p4 <- ggplot(data_combined_exp, aes(x = Diagnosis, fill=Diagnosis)) +
geom_bar()+
facet_wrap(~Group) +
theme(legend.position = "none")
Next, the naming pattern of the four different LC coloumns are required to find the count of them.
names(data_combined) %>% sample(10) # four different column prefixes should be found
## [1] "C8.pos_Cluster_1373" "HILIC.pos_Cluster_1847" "C8.pos_Cluster_2426"
## [4] "HILIC.neg_Cluster_0827" "C18.neg_Cluster_0337" "HILIC.neg_Cluster_1130"
## [7] "HILIC.neg_Cluster_0705" "HILIC.pos_Cluster_1446" "HILIC.pos_Cluster_1618"
## [10] "C18.neg_Cluster_0086"
column_names <- c("C18.neg", "C8.pos", "HILIC.neg", "HILIC.pos")
column_count <- tibble(Column = column_names, count = c(NA))
for (i in 1:4) {
x <- column_names[[i]]
column_count$count[[i]] <- data_combined %>% select(starts_with(x)) %>% ncol()
}
p5 <- column_count %>% unnest(count) %>%
ggplot(aes(Column, count, fill=Column)) +
geom_col() +
theme(legend.position = "none")
plot_grid(p4,p5, labels=c("A", "B"))
The discovery group displays an unevenly distribution of the
patient’s diagnoses in plot A. However, the validation group displays an
almost even distribution.
Fortunately, the earlier removed five rows were CD and UC patients in
the discovery group and led to an more balanced distribution. Class
imbalances can lead to a poor performance of the ML models by exhibiting
bias toward the majority class. But, the minority class was usually the
reason to build the model in the first place.
Herein, however the majority classes CD and UC are more interesting, as
the ML model could identify patients with IBD. There are a variety of
techniques to deal with class imbalances, e.g. up- or downsampling, or
Synthetic Minority Oversampling Technique (SMOTE). Here, no correction
of the imbalances will be done, even though SMOTE could easily
integrated in the tidymodels workflow.
In plot B, the number of metabolic features per column type are
displayed. LC in positive mode tended to generate more features than in
negative mode. The different column types will probably also identified
identical compounds. 466 metabolites were more precisly identified in
the study, and 3829 features were linked to putative identifiers by
searching against the Human Metabolome Database. The name of the
identified metabolites or substance classes are not included in this
dataset, but they can later combined and used for further analyis. For
example, the metabolites with the highest variable importance of some
models can be retrieved.
data_combined_exp <- data_combined %>% mutate(Group = case_when(Cohort == "Deep" ~ "Validation",
Cohort == "PRISM" ~ "Discovery",
TRUE ~ "Validation"))
p_age <-ggplot(data_combined_exp,aes(x = Age,fill = Diagnosis)) +
geom_density(alpha=.5)+
labs(x="Age",
y="Density",
fill="Diagnosis",
title="Age distribution of subjects")+
facet_grid(~ Group, labeller = label_both, scales = "free_y")+
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
panel.background = element_rect(fill='transparent'),
plot.background = element_rect(fill='transparent', color=NA),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.background = element_rect(fill='transparent'))
vars = c("antibiotic","immunosuppressant","mesalamine","steroids")
for (i in vars) {
assign(
paste0("p_", i),
ggplot(data_combined_exp, aes_string(i, fill = "Diagnosis")) +
geom_bar(position="fill")+
facet_wrap(~Group)+
theme(legend.position="none")+
labs(y="")
)
}
plot_grid(p_age,plot_grid(p_antibiotic, p_immunosuppressant, p_mesalamine, p_steroids, labels = c('A', 'B', "C", "D"), label_size = 12))
The age distribution of the discovery and validation cohorts have
similar characteristics. The CD and UC patient’s age ranges from 20 to
80 years. Only, the control group’s age in the validation cohort reached
up to merely 60 years. Moreover, the control group patients are mostly
younger than 40 for the discovery cohort. The UC patient’s age displays
a bimodal distribution with a higher density of patients younger than
and older than 60.
The barplots display a similar percentage of the antibiotic,
immunosuppressant, mesalamine, and steroids intake for the discovery and
validations cohorts across the different patient groups. The biggest
difference is that there was no antibiotic intake in the validation
cohort. Moreover, there a moderate differences of the percentage of
immunosuppressant and mesalamine intake for the validation groups.
Data from biological samples are often highly correlated. Explanatory
features will not only correlate with the target variable, Diagnosis,
but, also the feature will corelate with themselves, so called
multicollinarity in case of a linear relationship. Multicollinarity may
not influence the accuracy of the prediction of the models, but impair
their interpretability. In case of highly correlated features, the
coefficients, p-values, or SHAP values of the model may not represent
the actual importance of the biomarkers because they are redundant and
the model might have used only one feature. Therefore, the simpler the
model (less features) the higher the interpretability.
Since the herein used data contains about 9000 features, the correlation
matrix for all of them is too big to fit in the memory (RAM) of most
computers. Therefore, a subset of 2500 MS features with equal amounts
from every column type will be used for the correlation analyis.
sample_func <- function(sample_size, data_combined = data_combined) { # sampling N features from every column type
set.seed(42)
for (i in 1:nrow(column_count)) {
x = column_count$Column[[i]]
assign(paste0("sample_", x),
data_combined %>% select(starts_with(x)) %>% select(sample(1:column_count$count[[i]], sample_size, replace=FALSE))
)
}
sample_C18.neg %>% add_column(sample_C8.pos,
sample_HILIC.neg,
sample_HILIC.pos) -> sample
return(sample)
}
sample <- sample_func(625, data_combined)
corr_func <- function(threshold, sample = sample) { # correlating and cleaning matrix with threshold for correlation coefficient
set.seed(42)
cor_all <- sample %>% cor()
cor_all[!lower.tri(cor_all)] <- NA # remove diagonal and redundant values
data.frame(cor_all) %>%
rownames_to_column() %>%
gather(key="variable", value="correlation", -rowname) %>%
filter(!is.na(correlation)) %>%
filter(abs(correlation) >= threshold) -> df_corr_sample
return(df_corr_sample)
}
thresholds <- c(.1,.4,.6,.9,.95,.99) # correlation cofficient
df_count <- tibble(Threshold=thresholds, count=c(NA))
for (i in seq_along(thresholds)) {
x <- thresholds[[i]]
df_count$count[[i]] <- corr_func(.0, sample) %>%
filter(abs(correlation) > x) %>%
nrow()
}
n <- ncol(sample) # sample size
df_count %>% mutate(Percentage = count/((n**2 - n)/2)) %>% # divided by all possible combinations to get percentages
ggplot(aes(factor(Threshold), Percentage)) +
geom_point() +
scale_y_continuous(trans='log10',labels = label_percent(), breaks = c(5 %o% 10^(-5:1))) +
coord_flip() +
labs(x = "Threshold of Pearson's correlation coefficient", y = "Proportion", title="Correlation of features")
About 50% of the feature subset show a neglible linear correlation
(0.1). 2.5% percent of features display a weak correlation (0.4). A very
strong feature correlation (.9) and an almost perfect correlation (.99)
is existing between only about 0.05% and 0.0025% of features,
respectively.
Next, I would be interesting to know which column types are contributing
the most correlated features.
comb_grid <- combinations(n=4, r=2, v = column_count$Column, repeats.allowed=T)
df_comb <- data.frame(comb_grid,count=c(NA)) %>% select(Column_1 = X1, Column_2 = X2, count)
threshold <- 0.4
df_corr_sample <- corr_func(threshold, sample)
for (i in 1:nrow(df_comb)) {
x <- df_comb$Column_1[[i]]
y <- df_comb$Column_2[[i]]
df_corr_sample %>%
filter((str_detect(rowname, x) & str_detect(variable ,y)) | (str_detect(rowname, y) & str_detect(variable,x))) %>%
nrow() -> df_comb$count[[i]]
}
sum(df_comb$count) == nrow(df_corr_sample) # check if all combinations have been counted
## [1] TRUE
n_column <- n/4 # samples per col
n_equal <- (n_column**2 - n_column)/2 # all possible combinations
n_unequal <- n_column**2
df_comb_prop <- df_comb %>% mutate(Combination = paste0(Column_1, " x ", Column_2),
Proportion = case_when(Column_1 != Column_2 ~ count/n_unequal,
TRUE ~ count/n_equal
)
)
df_comb_prop %>% ggplot(aes(fct_reorder(Combination, Proportion), Proportion)) +
geom_point() +
scale_y_continuous(labels = scales::percent) +
coord_flip() +
labs(x = "Combination of column types",
y = "Proportion of correlated features with r >= 0.4")
Not surprisingly, the highest percentage of features with a moderate
correlation (> 0.4) is contributed by identical column types. This
means that features originating from the similar column type are more
likely to correlate. This is possibly caused by adjacent features that
are caused by adjacent elution peaks during LC which could represent
identical or nearly identical compounds.
The lowest correlation of features was found between features of the C8
column and HILIC column in positive mode. The C8 column setup was used
to capture polar metabolites (e.g. amino acids) and the HILIC.pos mode
for polar and non-polar lipids.
Some features may have zero or near-zero variance across the samples and hence they are likely non-informative. Herein, tidymodels preprocessing functions are used with default values for near-zero variance. By default, a feature is classified as near-zero variance if the percentage of unique values in the samples is less than 10% and when the frequency ratio, the ratio of the frequency of the most common value to the frequency of the second most common value, is greater than 19 (95/5).
nzv <- recipe(Diagnosis ~.,data_combined) %>%
step_nzv(all_predictors()) %>%
prep()
nzv_obj <- bake(nzv, new_data = NULL)
nzv_cols <- names(data_combined)[!(names(data_combined) %in% names(nzv_obj))]
nzv_cols
## [1] "HILIC.pos_Cluster_1433"
data_combined %>% select(HILIC.pos_Cluster_1433) %>% unique()
## # A tibble: 14 × 1
## HILIC.pos_Cluster_1433
## <dbl>
## 1 0
## 2 228.
## 3 4.58
## 4 455.
## 5 673.
## 6 774.
## 7 1505.
## 8 6.51
## 9 10.5
## 10 6.69
## 11 716.
## 12 511.
## 13 1817.
## 14 266.
The only feature that showed near-zero variance was the
HILIC.pos_Cluster_1433 column. There are still some unique
values in the column, which might provide information for the predictive
models. Therefore, the column will be retained.
Dimensionality reduction (DR) is frequently applied during the
analysis of high-dimensional data. Because of “the curse of
dimensionality,” the volume of the data increases so fast with
additional dimensions (variables) that the data becomes sparse. This
phenomenon can result in a loss of accuracy of ML models. DR can be
beneficial for the majority of modern biological datasets, in which the
data often contain measurements on uninformative or redundant variables.
For example, redundant variables from overlapping peaks or compounds
found in more than one LC-MS setup.
DR can be viewed as a method for latent variable extraction and is also
frequently used for data compression, exploration, visualization, and
data pre-processing for ML. There is no best technique for
dimensionality reduction. Instead, the choice depends on the nature of
the data and the intended purpose. In the case of ML, the best DR method
could be chosen on the basis of the performance of the resulting
model.
There a multitude of dimensional reduction algorithms which can reduce
the dimensions and allow for visualization in the two-dimensional (2D)
or three-dimensional (3D) space. Those techniques can also be divided
into linear and non-linear transformations. The most common linear
transformation is the principal component analysis (PCA). PCA can also
be extended with the kernel trick to kernel PCA (kPCA), which allows for
non-linear separable data to be separated in a higher-dimensional space.
Both methods allow for the application of the learnt mapping function to
unseen test data. More examples of non-linear transformations are
t-distributed stochastic neighbour embedding (t-SNE) and uniform
manifold approximation and projection (UMAP).
t-SNE is used for visualization of high-dimensional data
which can be projected into the low-dimensional space. Thereby, similar
objects are modelled by nearby points and dissimilar objects by distant
points with high probability. The visual appearance is influenced by
hyperparameters, where perplexity is the most important. Perplexity is
comparable to the nearest neighbours hyperparameter. Lower values tend
to spread out the projected data and higher values tend to preserve the
global structure.
Larger and denser data sets require a higher perplexity which requires
longer computational time. Typical values range from 5 to 50.
t-SNE learns a non-parametric mapping and not an explicit
function. Therefore, it is not possible to directly embed test data into
an existing map, which is crucial for supervised machine learning.
However, it would be possible to fit a multivariate regression model
between the original data and the embedded dimensions. Thus,
t-SNE is rather used for visualization purposes than for
pre-processing for ML.
Herein, different parameters of perplexity are used and visualized in
the two-dimensional space. The model was trained with the whole data
set, including the discovery and validation group.
data_tsne <- tibble(perplexity = c(5, 10, 15, 30, 45))
data_tsne <- data_tsne %>% mutate(TSNE = map(perplexity, ~Rtsne(data_combined_train, dims=2, PCA=F, perplexity=.x, max_iter=2000)))
for (i in 1:nrow(data_tsne)){
x = data_tsne$perplexity[[i]]
assign(paste0("tsne_df_tune_", x),
tibble(
X = data_tsne$TSNE[[i]]$Y[,1],
Y = data_tsne$TSNE[[i]]$Y[,2],
#Z = data_tsne$TSNE[[i]]$Y[,3],
Perplexity = data_tsne$perplexity[[i]],
Diagnosis = data_combined$Diagnosis,
Cohort = data_combined$Cohort,
ID = data_combined$PatientID
)
)
}
list_tsne_tune <- ls(pattern = "tsne_df_tune_")
tsne_combined <- bind_rows(mget(ls(pattern = "tsne_df_tune_")))
tsne_combined %>%
plot_ly(x = ~X, y = ~Y, frame = ~Perplexity, color = ~Diagnosis, symbol = ~Cohort,
colors = "Set1", type="scatter", mode="markers", ids=~ID, text = ~paste0("Sample: ", ID, "<br>",
"Diagnosis: ", Diagnosis),
hoverinfo = "text",
marker = list(sizemode = "diameter")) %>%
layout(xaxis = list(showgrid = F,
zeroline = F
),
yaxis = list(showgrid = F,
zeroline = F
)
) %>%
animation_opts(
redraw=FALSE,
frame = 2000,
transition = 500) %>%
animation_slider(
currentvalue = list(
prefix = "Perplexity: ",
font = list(color = "red")))
Note: The interactive visualization is not supported in this
workspace yet. The interactive version can be seen here.
The t-SNE dimension reduction has been performed with different
values of the perplexity parameter. The coordinate values are in
different ranges for the different perplexity settings and can be
autoscales in the top-right corner. The individuals (samples) are
divided into the discovery and validation groups and the three different
cohorts. The objective of the approach is to segment the patients with
different diagnoses (colors).
Herein, all methods with different perplexity parameters segmented the
patients into three cluster to some extent. However, in all setups
patients with different diagnoses are found in one cluster. This
intertwining exposes some possible effects in the subsequent analysis
steps. First, some patients with dissimilar diagnoses could have very
similar metabolic characteristics. Moreover, the exact differentiation
of patients could be impaired by ML methods and differential expression
analysis. Based on visual assessment the perplexity value of 30 yielded
the best separation of patients. Patients with CD and UC seem to more
intertwined and hence could possess similar disease characteristics. The
cluster of the control group is penetrated by patients with CD and UC,
but no control patients are found in the clusters of the CD and UC
patients, exept one control patient in the CD cluster.
The largest detriement of t-SNE is the inabilty as
preprocessing method for supervised learning since no mapping function
is calculated, which could be applied to the test or validation set.
Therefore, other dimension reduction methods are used in the subsequent
steps.
First, PCA is linear transformation that transforms a large set of
variables into a smaller set (principal components) and retains as much
information (variability) as possible. After standardization , the
eigenvalues and eigenvectors are determined from the covariance matrix
in order to derive the principal components (PCs). These are new
variables constructed as linear combinations so that as much as possible
variability is squeezed into the first PC (PC1). The second PC (PC2)
holds the second largest amount of variability orthogonal (uncorrelated)
to PC1.
PCA outputs the similar amount of PCs as input variables. For dimension
reduction, a cutoff value, e.g. 0.9, can be chosen at which as much as
PCs are chosen to retain the desired amount of information. This also
applies to kPCA, which is more suited for non-linear DR. In contrast to
PCA, the data is transformed into a higher-dimensional space before
covariance matrix calculation. Using the kernel-trick, much computation
is avoided since the coordinates of the data don’t need to be
calculated.
Similar to t-SNE UMAP employs graph layout algorithms. A
high-dimensional graph layout is constructed of the data and then a
low-dimensional graph is fitted to be as structually similar as
possible. Two common parameters are n-neighbors and
min_dist. They control the balance between local and global
structure. Low values of n-neighbors leads to a more local
structure, whereas a higher value represents a more big-picture
structure while losing more fine details. min_dist controls
how tight the points are packed together. Lower values brings points
more together and allow for more clustering. In contrast, larger values
will preserve more of the broader topological structure. Moreover, the
algorithms allows for both unsupervised and supervised DR. Herein, the
latter will be used.
In contrast to t-SNE, the hyperparameters were selected based on the best area under the curve of the receiver opererater characteristic (AUROC) score of a cross-validated ML classification model. Moreover, the optimal number of dimensions for classification were also be determined by cross-validation in a separate script using tidymodels. Three different kinds of ML models were used for each DR algorithm such as KNN, SVM, and a single-layer neural network. UMAP and kPCA hyperparamters were tuned with a single-layer neural network with AUROC scores of 0.80 and 0.87, respectively. The number of principal components in PCA were selected based on the Radial basis function SVM (RBF-SVM) classifier with a AUROC score of 0.88.
set.seed(42)
recipe_umap <- recipe(Diagnosis~., data = train_df) %>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_umap(all_predictors(), outcome = vars(Diagnosis), num_comp = 6, min_dist = 0.94396076, learn_rate = 6.193164e-03)
set.seed(42)
recipe_kpca <- recipe(Diagnosis~., data = train_df) %>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_kpca_rbf(all_predictors(), num_comp = 61, sigma = 1.782724e-06)
set.seed(42)
recipe_pca <- recipe(Diagnosis~., data = train_df) %>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_pca(all_predictors(), num_comp = 60)
train_umap <- bake(recipe_umap %>% prep(), train_df)
## as(<dgCMatrix>, "dgTMatrix") is deprecated since Matrix 1.5-0; do as(., "TsparseMatrix") instead
train_kpca <- bake(recipe_kpca %>% prep(), train_df)
train_pca <- bake(recipe_pca %>% prep(), train_df)
test_umap <- bake(recipe_umap %>% prep(), test_df)
test_kpca <- bake(recipe_kpca %>% prep(), test_df)
test_pca <- bake(recipe_pca %>% prep(), test_df)
train_1 <- train_umap %>% select(UMAP = UMAP1, Diagnosis) %>%
add_column(PC = train_pca$PC01) %>%
add_column(kPC = train_kpca$kPC01)
train_2 <- train_umap %>% select(UMAP2) %>%
add_column(PatientID = data_prism_formatted$PatientID) %>%
add_column(PC = train_pca$PC02) %>%
add_column(kPC = train_kpca$kPC02)
test_1 <- test_umap %>% select(UMAP = UMAP1) %>%
add_column(Diagnosis = test_df_truth$Diagnosis) %>%
add_column(kPC = test_kpca$kPC01) %>%
add_column(PC = test_pca$PC01)
test_2 <- test_umap %>% select(UMAP = UMAP2) %>%
add_column(PatientID = data_validate_formatted$PatientID) %>%
add_column(kPC = test_kpca$kPC02) %>%
add_column(PC = test_pca$PC02)
train_combined <- train_1 %>% pivot_longer(!Diagnosis, names_to = "Preprocessing", values_to = "Dim1") %>%
add_column(train_2 %>% pivot_longer(!PatientID,names_to = "Preprocessing", values_to = "Dim2") %>%
select(Dim2, PatientID)) %>%
mutate(Diagnosis = factor(Diagnosis, levels = c(1,2,3), labels = c("Control","CD", "UC")))
test_combined <- test_1 %>% pivot_longer(!Diagnosis, names_to = "Preprocessing", values_to = "Dim1") %>%
add_column(test_2 %>% pivot_longer(!PatientID,names_to = "Preprocessing", values_to = "Dim2") %>%
select(Dim2, PatientID)) %>%
mutate(Diagnosis = factor(Diagnosis, levels = c(1,2,3), labels = c("Control","CD", "UC")))
combined <- train_combined %>% bind_rows(test_combined, .id = c("Group")) %>%
mutate(Group = case_when(Group==1 ~ "Discovery",
TRUE ~ "Validate"))
head(combined)
## # A tibble: 6 × 6
## Group Diagnosis Preprocessing Dim1 Dim2 PatientID
## <chr> <fct> <chr> <dbl> <dbl> <fct>
## 1 Discovery CD UMAP -1.40 -1.39 PRISM|7122
## 2 Discovery CD PC -18.2 -16.7 PRISM|7122
## 3 Discovery CD kPC 0.414 0.386 PRISM|7122
## 4 Discovery CD UMAP -0.528 -0.130 PRISM|7147
## 5 Discovery CD PC -15.0 26.0 PRISM|7147
## 6 Discovery CD kPC 0.344 -0.583 PRISM|7147
combined %>%
plot_ly(x = ~Dim1, y = ~Dim2, frame = ~Preprocessing, color = ~Diagnosis, symbol = ~Group,
colors = "Set1", type="scatter", mode="markers") %>%
layout(xaxis = list(showgrid = F,
zeroline = F,
autorange = T
),
yaxis = list(showgrid = F,
zeroline = F,
autorange = T
)
) %>%
animation_opts(
redraw=FALSE,
frame = 2000,
transition = 500) %>%
animation_slider(
currentvalue = list(
prefix = "Preprocessing: ",
font = list(color = "red")))
In contrast to t-SNE, the data was split into the discovery group for training the mapping functions and the validation group to which the mapping function was applied. Both groups are visible in the 2D scatterplot with different symbols. Since more than 2 dimensions were calculated for the herein used methods, the assessment based on visuzilization will be less useful. Instead, the meaningfulness of the DR methods will be discussed later in combination with diagnosis classification. The final conclusion is made when DR method is applied to the validation group based on metrics like the AUROC score.
Herein, PCA and kPCA show only a moderate amount of separation in the first two dimension. Note, that there are 58 or 59 additional dimensions, respectively. So that, there is no need to display a good separation in the first two dimensions. By contrast, a sharper separation of patient groups is visible using the first two dimensions of UMAP. Similar to t-SNE the UC and CD patients are more intertwined with themselves than with the control group. However, the CD patients tend to mix with the healthy controls more than the UC patients with the control group.
In the following chapter, it will be assessed whether the DR methods can increase the performance of ML models.
Herein, several ML algorithms are used for the prediction of the
patient’s diagnosis. First, all three potential diagnoses are considered
(multiclass classification). Then, the UC and CD patients are grouped
together as IBD patients for the binary classification of sick patients
(IBD) and healthy controls.
Biological data are often not only high-dimensional, but also involve
non-linear relationships. For example, the non-linear Michaelis-Menten
model for enzyme kinectics. Therefore, it can assumed that metabolomic
data also incorporates non-linear relationships. This would impair the
performance of strictly linear classifiers like partial least-squares
discriminant analysis (PLS-DA), which is often used by the metabolomics
community. Therefore, non-linear classifiers like RF, support vector
machines (SVM) with kernel trick, k-nearest neighbors algorithm
(k-NN), and gradient boosted trees (XGBoost) will be used as
comparison to the linear algorithms PLS-DA and the multinominal logistic
regregression. The latter is a subtype of generalized linear models
(GLM) for multiclass prediction with a logit link function.
Herein, the basis for the machine learning is the
tidymodels package. Particular, the list-column workflow
(LCW) will be used, in which the map function of the
purrr package and the nest/unnest function of
the tidyr package are integral components. Here all
objects, including dataframes, parameters, and different stages of the
workflow are stored in one nested dataframe and operations like training
and testing can easily be done by using the map function.
This workflow has many advantages as keeping all data in one place,
reducing code (DRY principle), and a tidy structure for easier
subsequent manipulations. An interactive course for LCW can be found on
DataCamp
and an example with further literature recommendations here.
The training of the classifiers was performed with five-fold
cross-validation using the PRISM discovery group. Hyperparameter tuning
was done for all models in a separate notebook and the best
hyperparameters based on the AUROC score were chosen. Using the
recipe workflow, pre-processing is easily integrated such
as dummy coding or DR. DR increased the performance of only one model
for the validation set, the k-NN classifier. Hyperparameters
were also tuned for a multilayer perceptron (MLP) with one hidden layer
with prior dimension reduction. The cross-validated AUROC score was
0.87, but the AUROC score for the validation set was 0.5. This means
that the model poorly generalizes to unseen data. Moreover, training an
MLP with high-dimensional data is not a feasible process at scale. The
large input layer severely increases the weights of the network. Herein,
DR was necessary for model tuning of the MLP, but training is possible
on a standard notebook or cloud workspace.
For future model development, workflowsetsof the
tidymodels universe could be used for screening of
different combinations of pre-processing methods (e.g. DR), model
specifictions (e.g. random forest or knn and their specific
hyperparameters).
First, the tibble with the names of the models and the train and test
data is produced.
data_ml <- tibble(classifier = c("RF", "PLSDA", "KNN", "SVM", "XGB", "GLM"),
train = list(train_df,train_df,train_df,train_df,train_df,train_df),
test = list(test_df,test_df,test_df,test_df,test_df,test_df))
Next, the models are specified with the recipefunctions
with the found hyperparameters by doing grid search using 5-fold
cross-validation. The parameters of grid has been found with the
grid_max_entropy function of the dials
package. This function harness the parameter space so that every
combination has about the same distance from each other.
recipe_rf <- recipe(Diagnosis ~ ., data=train_df)
recipe_pls_svm <- recipe(Diagnosis ~ ., data=train_df) %>%
step_normalize(all_numeric(),-all_outcomes()) %>%
step_dummy(all_nominal(), -all_outcomes())
recipe_pca_knn <- recipe(Diagnosis ~ ., data=train_df) %>%
step_normalize(all_numeric(),-all_outcomes()) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_pca(all_predictors(), num_comp = 19)
recipe_xgb_glm <- recipe(Diagnosis ~ ., data=train_df) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_mutate_at(all_predictors(), fn = ~ as.numeric(.)) # required by XGBoost
model_PLSDA <- pls(num_comp=3) %>%
set_engine("mixOmics") %>%
set_mode("classification")
model_RF <- rand_forest(trees = 154,
min_n = 8) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("classification")
model_KNN <- nearest_neighbor(neighbors = 88) %>%
set_engine("kknn") %>%
set_mode("classification")
model_SVM <- svm_rbf(cost=27.695185,
rbf_sigma = 3.779591e-06,
margin = 0.01473722) %>%
set_engine("kernlab") %>%
set_mode("classification")
model_XGB <- boost_tree(trees = 426,
min_n = 5,
tree_depth = 12,
learn_rate = 0.090994590,
loss_reduction = 1.094324e-10) %>%
set_engine("xgboost") %>%
set_mode("classification")
model_GLM <- multinom_reg(penalty = 0.67959909,
mixture = 0.0592261) %>%
set_engine("glmnet") %>%
set_mode("classification")
Now, the workflows are construced in a separate tibble.
df_wflow <- tibble(classifier=data_ml$classifier, wflow=c(),
recipe=list(recipe_rf, recipe_pls_svm, recipe_pca_knn, recipe_pls_svm, recipe_xgb_glm, recipe_xgb_glm))
for (i in df_wflow$classifier) {
x <- paste("model_",i, sep="")
j <- which(df_wflow$classifier == i)
df_wflow$wflow[[j]] <- workflow() %>%
add_model(eval(parse(text = x))) %>%
add_recipe(df_wflow$recipe[[j]])
}
## Warning: Unknown or uninitialised column: `wflow`.
Next, the LCW is used for the first time to train the models and subsequently predict the diagnosis for the validation group.
set.seed(42)
tic()
data_ml_fit <- data_ml %>% add_column(workflow = df_wflow$wflow) %>%
mutate(fitted = map2(workflow, train, ~fit(.x,.y))) %>%
mutate(pred_prob = map2(fitted, test, ~predict(.x,.y, type="prob"))) %>%
mutate(pred_class = map2(fitted, test, ~predict(.x,.y, type="class")))
toc()
## 6264.639 sec elapsed
The models are compared by several metrics. The AUROC score (here ROCAUC), the precision recall score (PRAUC), the balanced accuracy, sensitivity, and specificity will be used. The ROCAUC and the PRAUC score are calculated from the class probabilities at different decision thresholds. The other metrics are derived from the confusion metrics. In case of class imbalance, the common metric accuracy should be avoided, since it could be very misleading in terms of model performance. The sensitivity (or recall, true positive rate) and specificity (true negative rate) are not affected by data distribution and therefore robust to class imbalances. The balanced accuracy is the average of the sensitivity and specificity and a better choice for imbalanced data sets. Also, the AUROC score is sensitive to class imbalances. In contrast, the precision recall AUC (PRAUC) score is robust to class imbalances and the better choice when the positive class is of interest.
For the multi-class case, the metrics are calculated by averaging multiple binary cases. Therefore, each class will be compared as postive (one) against all other cases as negatives (rest). This method is also known as one vs rest classification.
calc_metrics <- function(data_ml_fit) {
df <- tibble(Classifier=data_ml_fit$classifier, ROCAUC=c(NA),PRAUC=c(NA),`Balanced Accuracy`=c(NA), Sensitivity=c(NA), Specificity=c(NA))
for (i in 1:nrow(data_ml_fit)) {
df$ROCAUC[[i]] <- data_ml_fit$pred_prob[[i]] %>% roc_auc(estimate=c(names(data_ml_fit$pred_prob[[i]])), truth= test_df_truth$Diagnosis,estimator="hand_till") %>% pull(.estimate) %>% round(2)
df$PRAUC[[i]] <- data_ml_fit$pred_prob[[i]] %>% pr_auc(estimate=c(names(data_ml_fit$pred_prob[[i]])), truth= test_df_truth$Diagnosis,estimator="macro") %>% pull(.estimate) %>% round(2)
df$`Balanced Accuracy`[[i]] <- data_ml_fit$pred_class[[i]] %>% bal_accuracy(estimate=data_ml_fit$pred_class[[i]] %>% pull(),
truth=test_df_truth$Diagnosis,
estimator="macro") %>% pull(.estimate) %>% round(2) # average of sensitivity and specificity
df$Sensitivity[[i]] <- data_ml_fit$pred_class[[i]] %>% sens(estimate=data_ml_fit$pred_class[[i]] %>% pull(), truth=test_df_truth$Diagnosis) %>% pull(.estimate) %>% round(2)
df$Specificity[[i]] <- data_ml_fit$pred_class[[i]] %>% spec(estimate=data_ml_fit$pred_class[[i]] %>% pull(), truth=test_df_truth$Diagnosis) %>% pull(.estimate) %>% round(2)
}
return(df %>% unnest(cols = c(ROCAUC, PRAUC, `Balanced Accuracy`, Sensitivity, Specificity)))
}
calc_metrics(data_ml_fit)
## # A tibble: 6 × 6
## Classifier ROCAUC PRAUC `Balanced Accuracy` Sensitivity Specificity
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 RF 0.9 0.84 0.83 0.77 0.89
## 2 PLSDA 0.89 0.81 0.79 0.72 0.86
## 3 KNN 0.87 0.81 0.61 0.48 0.74
## 4 SVM 0.84 0.76 0.81 0.75 0.88
## 5 XGB 0.9 0.85 0.82 0.76 0.88
## 6 GLM 0.85 0.73 0.79 0.72 0.86
In contrast to the binary case, in multi-class prediction the class of interest doesn’t need to be specified, because the average will be used for the metrics.
The random forest and the gradient boosted tree are the best models based on all metrics. Based on the ROCAUC and PRAUC score the PLSDA and KNN classifier are following close behind. But, in terms of balanced accuracy, sensitivity, and specificity both are performing worse, particular the KNN. By contrast, the GLM and the SVM model performed better than the KNN and similar to the PLSDA model based on the balanced accuracy, sensitivity, and specificity.
Next, the output of the classifiers will be used to build an ensemble
model with the stacks package. Therefore, the top three
models are selected to fit the stacked model. This usually incrementally
increases the model performance a bit.
set.seed(42)
resamples <- vfold_cv(train_df, v=5)
ctrl_res <- control_stack_resamples()
tic()
data_ml_fit_stack <- data_ml_fit %>% filter(classifier %in% c("RF", "PLSDA", "XGB")) %>%
mutate(resamples = list(resamples, resamples, resamples)) %>%
mutate(stacks = map2(workflow, resamples, ~fit_resamples(.x,.y, control=ctrl_res)))
## ℹ The workflow being saved contains a recipe, which is 13.25 Mb in
## ℹ memory. If this was not intentional, please set the control setting
## ℹ `save_workflow = FALSE`.
## ℹ The workflow being saved contains a recipe, which is 13.28 Mb in
## ℹ memory. If this was not intentional, please set the control setting
## ℹ `save_workflow = FALSE`.
## ℹ The workflow being saved contains a recipe, which is 13.28 Mb in
## ℹ memory. If this was not intentional, please set the control setting
## ℹ `save_workflow = FALSE`.
toc()
## 326.517 sec elapsed
mdl_stack_fit <- stacks() %>%
add_candidates(data_ml_fit_stack$stacks[[1]]) %>%
add_candidates(data_ml_fit_stack$stacks[[2]]) %>%
add_candidates(data_ml_fit_stack$stacks[[3]]) %>%
blend_predictions(non_negative = TRUE) %>%
fit_members()
## The inputted `name` argument cannot prefix a valid column name. The
## • data stack will use "data_ml_fit_stack.stacks..1.." rather than
## • "data_ml_fit_stack$stacks[[1]]" in constructing candidate names.
## The inputted `name` argument cannot prefix a valid column name. The
## • data stack will use "data_ml_fit_stack.stacks..2.." rather than
## • "data_ml_fit_stack$stacks[[2]]" in constructing candidate names.
## The inputted `name` argument cannot prefix a valid column name. The
## • data stack will use "data_ml_fit_stack.stacks..3.." rather than
## • "data_ml_fit_stack$stacks[[3]]" in constructing candidate names.
data_stack <- tibble(classifier = "Stack")
data_stack$pred_class[[1]] <- predict(mdl_stack_fit, new_data=test_df, type="class")
## Warning: Unknown or uninitialised column: `pred_class`.
data_stack_prob <- predict(mdl_stack_fit,test_df, type="prob")
data_stack$pred_prob[[1]] <- tibble(pred_prob=list(data_stack_prob)) %>% unnest(pred_prob)
## Warning: Unknown or uninitialised column: `pred_prob`.
data_ml_fit <- bind_rows(data_ml_fit, data_stack)
calc_metrics(data_ml_fit)
## # A tibble: 7 × 6
## Classifier ROCAUC PRAUC `Balanced Accuracy` Sensitivity Specificity
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 RF 0.9 0.84 0.83 0.77 0.89
## 2 PLSDA 0.89 0.81 0.79 0.72 0.86
## 3 KNN 0.87 0.81 0.61 0.48 0.74
## 4 SVM 0.84 0.76 0.81 0.75 0.88
## 5 XGB 0.9 0.85 0.82 0.76 0.88
## 6 GLM 0.85 0.73 0.79 0.72 0.86
## 7 Stack 0.88 0.81 0.85 0.8 0.9
The stacked model performed best in terms of balanced accuracy, sensitivity, and specificity. However, it is slightly worse than the RF and XGBoost model based on the ROCAUC score. An even higher decrease in performance is observed in the PRAUC score of 0.81.
Next, the ROC curve will be plotted for every class against all other classes.
df <- tibble(Classifier=data_ml_fit$classifier)
for (i in 1:nrow(data_ml_fit)) {
df$AUC[[i]] <- data_ml_fit$pred_prob[[i]] %>%
roc_curve(estimate=c(names(data_ml_fit$pred_prob[[i]])), truth=test_df_truth$Diagnosis)
}
## Warning: Unknown or uninitialised column: `AUC`.
bind_rows_func <- function(df){
df_auc <- bind_rows(RF = df$AUC[[1]],
PLSDA = df$AUC[[2]],
KNN = df$AUC[[3]],
SVM = df$AUC[[4]],
XGB = df$AUC[[5]],
GLM = df$AUC[[6]],
Stack = df$AUC[[7]],
.id= "Model")
return(df_auc)
}
bind_rows_func(df) %>% group_by(Model) %>% mutate(.level = case_when(.level == 1 ~ "Control",
.level == 2 ~ "CD",
TRUE ~ "UC")) %>%
mutate(.level = fct_relevel(as.factor(.level), c("Control", "CD", "UC")),
Model = as.factor(Model),
Model = fct_relevel(Model, data_ml_fit$classifier)) %>%
ggplot(aes(x = 1 - specificity, y = sensitivity, color=Model,group=Model)) +
geom_path() +
geom_abline(lty = 3) +
coord_equal() +
facet_wrap(~.level) +
theme_bw()
The best model would have a sensitivity of 1 and specificity (true negative rate) of 1. The false-positive rate is 1 - specificity and is zero for a perfect classifier. Thus, the ROC curve of a perfect classifier would start at (0,0) and go straight to (0,1) and then straight to (1,1). This results in an area under the curve of 1 (AUROC score).
Herein, the ROC curve of the control class is the best with a low false-positive rate a high sensitivity is achieved for all models. The ROC curve for the UC class is slightly worse. In contrast, the CD curve is clearly worse. This means that the diagnosis for control and UC patients is more often found for the respective condition out of all patients with the specific condition than for the CD patients at a low false-positive rate.
However, the sick patients are of interest. Therefore, we would like to increase the sensitivity of the CD and UC group. Also, the specificity is less important since false-positive are likely to be detected in follow-up checks. Perhaps the sensitive detection of CD and UC patients could be improved when the UC and CD diagnoses are combined. Thus, further models are created for the binary case of sick against healthy controls.
Next, the confusion matrix of the best performing model, the RF, is plotted. This plot will visualize the correct and incorrect classifications.
df_conf <- tibble(pred = data_ml_fit$pred_class[[1]] %>% pull(), truth = test_df_truth$Diagnosis) %>%
mutate(pred = case_when(pred == 1 ~ "Control",
pred == 2 ~ "CD",
TRUE ~ "UC"),
truth = case_when(truth == 1 ~ "Control",
truth == 2 ~ "CD",
TRUE ~ "UC"),
truth = fct_relevel(as.factor(truth), c("Control", "CD", "UC")),
pred = fct_relevel(as.factor(pred), c("Control", "CD", "UC")))
conf_mat(df_conf, truth=pred, estimate = truth, dnn = c("Truth","Prediction")) %>% autoplot(type="heatmap") # Predicted and test labels were switched to transpose the matrix so that the true labels are the rows
accuracy(df_conf, truth=truth, estimate=pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy multiclass 0.769
The highest sensitivity was achieved for the control diagnosis (0.86). The CD and UC patients were detected with a sensitivity of 0.75 and 0.70, respectively. All misclassified patients of the UC group were predicted as CD patients. Whereas, the misclassified patients of the CD and control group are more evenly distributed across the classes. The difficult separation of the UC and CD patients were to some extend anticipated from the dimension reduction plots.
The overall accuracy of the RF is 0.77, which is identical to the performance of the RF mode published by Franzosa et al. In contrast to this model, their model misclassified UC patients more evenly across all groups and all misclassified CD patients were predicted as control patients.
In a comparative study of Mendez et. al the data set was also used to build different classifiers. However, they used only the data from the negative C18 column for the binary case of CD vs UC patients. Their model achieved only a AUROC score of 0.73 on the validation set. Even though they used a different data split and overall data, but it also highlights the difficulties of discriminating UC and CD patients from each other.
For better prediction of sick patients, the binary case of sick and healthy controls will be evaluated. The probabilities or the predicted class properties of the UC and CD class are combined into the IBD class for the sake of simplicity, instead of training and tuning the models again.
data_ml_fit_ibd <- data_ml_fit
for (i in 1:nrow(data_ml_fit_ibd)) {
data_ml_fit_ibd$pred_prob[[i]] <- data_ml_fit$pred_prob[[i]] %>% mutate(.pred_IBD = .pred_2 + .pred_3) %>%
select(.pred_Control = .pred_1, -c(.pred_2, .pred_3)) %>%
add_column(test_df_truth %>% mutate(Diagnosis = case_when(Diagnosis == "2" ~ "IBD",
Diagnosis == "3" ~ "IBD",
TRUE ~ "Control"))) %>%
mutate(Diagnosis = as.factor(Diagnosis)) %>%
select(truth = Diagnosis, everything()) %>% droplevels()
}
for (i in 1:nrow(data_ml_fit_ibd)) {
data_ml_fit_ibd$pred_class[[i]] <- data_ml_fit$pred_class[[i]] %>% mutate(.pred_class = case_when(
.pred_class == "1" ~ "Control",
.pred_class == "3" ~ "IBD",
TRUE ~ "IBD")) %>%
mutate(.pred_class = as.factor(.pred_class)) %>%
add_column(test_df_truth %>% mutate(Diagnosis = case_when(Diagnosis == "2" ~ "IBD",
Diagnosis == "3" ~ "IBD",
TRUE ~ "Control"))) %>%
mutate(Diagnosis = as.factor(Diagnosis)) %>%
select(truth = Diagnosis, everything())
}
Typically in a binary classification problem, the class of interest
is called positive class. Herein, the class of interest is IBD, the sick
patients. Therefore, IBD is the positive class.
Since sensitivity, specificity, and balanced accuracy are calculated
from the confusion matrix, it is neccessary to specify the order (level)
of the classes. This can be done by manipulating the data or by choosing
the right event_level of the truth class in the
yardstick functions, e.g. sens().
cat("Levels of predicted column:",levels(data_ml_fit_ibd$pred_class[[1]]$.pred_class),
"\nLevels of truth column:",levels(data_ml_fit_ibd$pred_class[[1]]$truth))
## Levels of predicted column: Control IBD
## Levels of truth column: Control IBD
Since the class of interest, IBD, has the second level, the
event_level should be specified as “second”.
calc_metrics_ibd <- function(data_ml_fit_ibd) {
df <- tibble(Classifier = data_ml_fit_ibd$classifier)
for (i in 1:nrow(data_ml_fit_ibd)) {
df$AUC[[i]] <- data_ml_fit_ibd$pred_prob[[i]] %>% roc_auc(estimate=.pred_Control, truth=truth) %>% pull(.estimate) %>% round(2)
df$PRAUC[[i]] <- data_ml_fit$pred_prob[[i]] %>% pr_auc(estimate=c(names(data_ml_fit$pred_prob[[i]])), truth= test_df_truth$Diagnosis,estimator="macro_weighted") %>% pull(.estimate) %>% round(2)
df$`Balanced Accuracy`[[i]] <- data_ml_fit_ibd$pred_class[[i]] %>% bal_accuracy(estimate=.pred_class, truth=truth, event_level = "second") %>% pull(.estimate) %>% round(2) # average of sensitivity and specificity
df$Sensitivity[[i]] <- data_ml_fit_ibd$pred_class[[i]] %>% sens(estimate = .pred_class, truth=truth, event_level = "second") %>% pull(.estimate) %>% round(2) # IBD is the event
df$Specificity[[i]] <- data_ml_fit_ibd$pred_class[[i]] %>% spec(estimate = .pred_class, truth=truth, event_level = "second") %>% pull(.estimate) %>% round(2)
}
return(df %>% unnest())
}
df_ibd_curve <- tibble(Classifier=data_ml_fit$classifier)
for (i in 1:nrow(data_ml_fit_ibd)) {
df_ibd_curve$AUC[[i]] <- data_ml_fit_ibd$pred_prob[[i]] %>% roc_curve(estimator=.pred_Control, truth=truth)
}
## Warning: Unknown or uninitialised column: `AUC`.
df_roc_curve_ibd <- bind_rows_func(df_ibd_curve) %>%
mutate(Model = as.factor(Model)) %>%
mutate(Model = fct_relevel(Model, data_ml_fit_ibd$classifier))
df_ibd_score <- tibble(Classifier=data_ml_fit_ibd$classifier)
for (i in 1:nrow(data_ml_fit_ibd)) {
df_ibd_score$AUC[[i]] <- data_ml_fit_ibd$pred_prob[[i]] %>% roc_auc(estimate=.pred_Control, truth=truth)
}
## Warning: Unknown or uninitialised column: `AUC`.
df_roc_score <- bind_rows_func(df_ibd_score) %>%
select(Model, AUC = .estimate) %>% mutate(AUC = round(AUC, 2))
calc_metrics_ibd(data_ml_fit_ibd)
## Warning: Unknown or uninitialised column: `AUC`.
## Warning: Unknown or uninitialised column: `PRAUC`.
## Warning: Unknown or uninitialised column: `Balanced Accuracy`.
## Warning: Unknown or uninitialised column: `Sensitivity`.
## Warning: Unknown or uninitialised column: `Specificity`.
## Warning: `cols` is now required when using unnest().
## Please use `cols = c(AUC, PRAUC, `Balanced Accuracy`, Sensitivity, Specificity)`
## # A tibble: 7 × 6
## Classifier AUC PRAUC `Balanced Accuracy` Sensitivity Specificity
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 RF 0.93 0.85 0.91 0.95 0.86
## 2 PLSDA 0.89 0.81 0.86 0.77 0.95
## 3 KNN 0.91 0.81 0.72 0.93 0.5
## 4 SVM 0.88 0.76 0.86 0.77 0.95
## 5 XGB 0.91 0.85 0.84 0.91 0.77
## 6 GLM 0.88 0.74 0.84 0.91 0.77
## 7 Stack 0.93 0.82 0.9 0.88 0.91
The performance of all models increased in almost every metric for the binary case. The random forest is the best model based on the most important metrics PRAUC and sensitivity, 0.85 and 0.95 respectively. Also the KNN, XGB, and GLM displayed a high sensitivity for the sick patients. But, only the XGB scored high on the PRAUC score (0.85). Based on the AUC score the RF, KNN, XGB and Stack performed best. The stacked model exceeded the specificity of the best model, the RF but the sensitivity was lower than for the RF.
truth <- data_ml_fit_ibd$pred_class[[1]]$truth
pred <- data_ml_fit_ibd$pred_class[[1]]$.pred_class
df_conf_ibd <- tibble(truth=truth, pred=pred)
conf_mat(df_conf_ibd, truth=pred, estimate=truth,dnn = c("Truth","Prediction")) %>% autoplot(type="heatmap") # Predicted and test labels were switched to transpose the matrix so that the true labels are the rows
The confusion matrix of the RF model clearly display the low rate of misclassification for both classes. Herein, the almost all IBD patients were detected (sensitivity of 0.95), which is the most important purpose of the model.
df_geom <- data.frame(Model = data_ml_fit_ibd$classifier, Score = (paste0(" (AUC = ", df_roc_score$AUC, ")"))) %>%
mutate(Model = as.factor(Model),
Model = fct_relevel(Model, data_ml_fit_ibd$classifier))
df_roc_curve_ibd %>%
ggplot(aes(x = 1 - specificity, y = sensitivity, color=Model, group=Model)) +
geom_path() +
geom_abline(lty = 3) +
coord_equal() +
theme_bw() +
theme(legend.position="none")+
geom_text(data=df_geom, aes(x = .63, y = .075, label = Model), size= 3.5, position="stack",hjust=0, show.legend=FALSE)+
geom_text(data=df_geom, aes(x = .76, y = .075, label = Score, group=Model, color=NULL),size= 3.5,position="stack", hjust=0, show.legend=FALSE)
Overall, all models can be classified as A models (AUC >= 0.9) or as very good B models (AUC >= 0.8). There are differences across the metrics for the different models. Depending on the application of the model some models are superior. For example, the RF is the best choice for a high sensitivity of IBD patients.
For comparison, Franzosa et. al utilized also a RF model for the discrimination of IBD and healthy controls. Their model achieved a ROCAUC score of 0.89 for the validation group, which was outperformed by the herein used RF (0.93).
Variable importance allows us to infer which features influence the predicted outcome and to what extent. Some algorithms like random forest or gradient boosted trees have an inherent way (model-based) to quantify the relative influence of each feature. In generalized linear models and PLS-DA, the variable importance is represented by the magnitude of the standardized coefficients. Unfortunately, not all model algorithms provide model-based feature importance, e.g. KNN or SVM algorithms. However, model-agnostic interpretation methods work for every model, e.g. based on permutation. A single feature is shuffled, and the resulting change in model performance is measured, indicating how much the model relies on that feature. However, permutation-based variable importance does not reflect the intrinsic significance of the feature but how important the feature is for that particular model. Moreover, in multicollinearity cases, this method will display none of the features as necessary because other features can represent the shuffled feature. This issue could be mitigated by employing Spearman rank-order correlations with clustering and selecting only one feature from each cluster.
Herein, the variable importance of only the RF, XGB and the PLSDA are considered for the sake of simplicity. The ten most important features are selected and grouped by the putative chemical class.
vi_rf <- data_ml_fit$fitted[[1]] %>% extract_fit_parsnip() %>% vi() %>% rename(Feature=Variable)
vi_xgb <- data_ml_fit$fitted[[5]] %>% extract_fit_parsnip() %>% vi() %>% rename(Feature=Variable)
vi_plsda <-data_ml_fit$fitted[[2]] %>% extract_fit_parsnip() %>% tidy() %>% filter(type=="predictors")
df_fet <- df_met_feature %>% select(Feature = `Metabolomic Feature`, everything()) %>% as.data.frame()
df_fet$Feature <- str_replace(df_fet$Feature, "-", ".")
top_10_rf <- left_join(df_fet, vi_rf, by=c("Feature")) %>% drop_na(c(`Putative Chemical Class`)) %>% top_n(wt=Importance, 10)
top_10_xgb <- left_join(df_fet, vi_xgb, by=c("Feature")) %>% drop_na(c(`Putative Chemical Class`)) %>% top_n(wt=Importance, 10)
top_10_plsda <- left_join(df_fet, vi_plsda, by=c("Feature" = "term")) %>% drop_na(c(`Putative Chemical Class`)) %>% top_n(wt=value, 10) %>% select(Importance = value, everything(), -type, -component)
top_30 <- rbind(top_10_rf, top_10_xgb, top_10_plsda)
top_30 %>% count(`Putative Chemical Class`) %>% filter(n > 1) %>% ggplot(aes(x=fct_reorder(`Putative Chemical Class`, n), y=n)) +
geom_col() +
coord_flip() +
labs(x="Putative chemical class",
y="Count")
Tetrapyrrole derivatives are the most important chemical class for the prediction of the diagnosis for the three selected models. They appeared in total eight times in the ten most important features. The other features appeared at least two times in sum. Tetrapyrrole derivatives could origin from vitamin B12 or heme. Vitamin B12 could be produced by bacteria or ingested. Moreover, heme, which could originate from food or bleeding in the gastrointestinal tract could lead to gut dysbiosis. Also, the alpha-branched alpha, beta-unsaturated ketone class is interesting and could potentially hint microbial (xenobiotic) metabolism. In contrast to host enzymes, which usually use oxidative and conjugative reactions, gut microbes use primarily hydrolytic and reductive methods. https://www.science.org/doi/10.1126/science.aag2770 Due to the inherent electrophilc character of this chemical class they have the capacity to adverse and toxic effects. https://pubmed.ncbi.nlm.nih.gov/15199949/
Those results could be more refined with a more accurate identification of the chemical compounds. Moreover, differential abundance analysis would be an complementary method to elucidate the role of the compounds in IBD.
After data wrangling and exploratory data analysis, first impressions
of the difficulty of the discrimination of patients were gathered during
the dimension reduction visualization. Both linear and non-linear
classifiers achieved satisfactory results. This could indicate that
non-linear relationships are not predominant in this case. The UC and CD
patients can be grouped as IBD patients to detect sick patients with a
high sensitivity. More studies are required to discriminate between CD
and UC patients appropriately. This could either be done by tuning
classification models or finding a better LC-MS analysis method. For
example, different dimension reductions are combined with a variety of
classifiers and scanned using workflowsets. Moreover,
variable importance analysis and/or differential expression analysis is
performed for the binary case of CD vs UC. Then, the LC-MS method could
be optimized for the predominating discriminating substance class.