Final Project

Introduction

The incessant contraction of cardiomyocytes demands tremendous amount of energy supply, thus, maintaining a robust metabolic homeostasis is essential for the heart. For decades, accumulating evidence suggests that perturbation of cardiac metabolism plays an important role in the pathological progression of HF. Accordingly, comprehensive quantitation of metabolite abundances in blood plasma, termed metabolite or metabolomics profiling, can provide mechanistic insights into the molecular alterations underlying maladaptive remodeling and HF. The large-scale quantification of circulating metabolites across multiple pathways may also identify metabolic changes that reflect clinical classification and outcomes of HF patients before and after optimal medical therapy and surgical interventions, thereby advancing important venues of biomarker and drug-target development together with validation of treatments.

Our workflow involves taking whole blood samples from 26 patients receiving MCSD implantation. The whole blood was extracted at various timepoints, and the plasma was isolated in order to determine the metabolic profile. Temporal profiling of plasma metabolites enabled the identification of significantly altered metabolites as well as metabolites that maintain a relative stable concentration in plasma regardless of pathophysiological changes (housekeeping metabolites). Two methods were used to identify significantly altered metabolites: differential expression analysis and a linear mixed model. In total, the metabolites selected through these processes were subjected to the following statistical tests.

Setting up the Data

In order to work with my data, I had to pull it from a data object created by another person working on my project. The manipulations I performed are below, but essentially, my process was as follows:

Create a matrix of the quantification of every metabolite(610) at each time point (11)
Select the metabolites that have been identified through differential expression (as shown in Table 1), linear mixed model (Table 2), or both (Table 3)
Attach the patient number, timepoint, and survival status to each row of data
Reorganize so the data was more manageable
Write each as its own .csv

#put all the metabolite quantifications into a workable file
metabolite_new <- Data_Obj_RMRedundancy$quantification

#1.create overlapping metabolites matrix
metabolite_select <- metabolite_new[,c("C12_p180","C14_p180","C14:2_p180", "C16:2_p180", "PC aa C36:1_p180","PE aa C26:4_Lipid", "PE aa C36:1_Lipid", "PE aa C36:2_Lipid", "PE aa C36:3_Lipid", "SM C19:0_Lipid", "Testosterone_Steroids")]

#2. transform the data and add the necessary identifiers
metabolite_select = data.frame(metabolite_select)
metabolite_select$Patient = Data_Obj_RMRedundancy$subjectNumber
metabolite_select$Time = Data_Obj_RMRedundancy$timepoint
metabolite_select$Surv = Data_Obj_RMRedundancy$deathStatus

#3. reshape the data so it looks good
select <- metabolite_select[c(12:14,1:11)]
row.names(select) <- paste0(select$Patient, select$Time)

select$Type <- substr(select$Patient, 1, 1)
select <- select[c(15,1:14)]

#4. write the csv
write.csv(select, file = "Selected Metabolites")




#set up other DF - differential expression
#1. select out the relevant metabolites
metabolite_select_diff <- metabolite_new[,c("PC aa C32:2_Lipid", "PC aa C36:1_Lipid", "PG aa C36:3_Lipid", "PE aa C36:1_Lipid", "PE aa C36:2_Lipid", "PE aa C36:3_Lipid", "PE ae C34:1_Lipid", "PE ae C36:3_Lipid", "LPC a C18:2_Lipid", "PC aa C36:3_Lipid", "LPE a C16:0_Lipid", "LPE e C18:0_Lipid", "PE aa C26:4_Lipid", "PC ae C36:3_Lipid", "PE ae C34:3_Lipid", "Glu_p180", "Dopamine_p180",
"PC aa C42:0_p180", "N-C20:0-Cer(2H)_Lipid", "SM C19:0_Lipid", "SM C23:3_Lipid", "Testosterone_Steroids", "C12_p180", "C14_p180", "C16:1_p180", "C16:2_p180", "alpha-KGA_Energy Metabolism", "Fum_Energy Metabolism", "DHA_PUFAs", "AA_PUFAs", "N-C9:0-Cer_Lipid",
"N-C15:1-Cer_Lipid", "N-C20:0(OH)-Cer_Lipid", "N-C20:0(OH)-Cer(2H)_Lipid", "C14:0_FFA", "C14:2_p180")]

#2. apply the correct identifiers
metabolite_select_diff = data.frame(metabolite_select_diff)
metabolite_select_diff$Patient = Data_Obj_RMRedundancy$subjectNumber
metabolite_select_diff$Time = Data_Obj_RMRedundancy$timepoint
metabolite_select_diff$Surv = Data_Obj_RMRedundancy$deathStatus

#3. reorganize
select_diff <- metabolite_select_diff[c(37:39,1:36)]
row.names(select_diff) <- paste0(select_diff$Patient, select_diff$Time)

select_diff$Type <- substr(select_diff$Patient, 1, 1)
select_diff <- select_diff[c(40,1:39)]

#4. write the csv
write.csv(select_diff, file = "Selected Metabolites Diff")



#set up other DF - LMM
#1. select out the relevant metabolites
metabolite_select_LMM <- metabolite_new[,c("Kyn_p180", "C0_p180", "C2_p180", "C4_p180", "C4:1_p180", "C4-OH (C3-DC)_p180", "C5_p180", "C5:1_p180", "C5:1-DC_p180","C5-DC (C6-OH)_p180", "C5-OH (C3-DC-M)_p180", "C6 (C4:1-DC)_p180", "C7-DC_p180", "C8_p180", "C9_p180", "C10_p180", "C10:1_p180", "C10:2_p180", "C12_p180", "C12:1_p180", "C14_p180", "C14:1-OH_p180", "C14:2_p180", "C14:2-OH_p180", "C16:2_p180", "C16-OH_p180", "LPC a C16:0_p180", "LPC a C17:0_p180", "LPC a C18:0_p180", "PC aa C32:0_p180", "PC aa C32:1_p180", "PC aa C34:1_p180", "SM (OH) C16:1_p180", "SM C18:0_p180", "SM C18:1_p180", "LPC e C18:0_Lipid", "PC aa C36:1_Lipid", "PG aa C32:1_Lipid", "PG ae C32:0_Lipid", "PE aa C26:4_Lipid", "PE aa C38:3_Lipid", "PE aa C34:0_Lipid", "PE aa C34:1_Lipid", "PE aa C34:2_Lipid", "PE aa C34:3_Lipid", "PE aa C36:1_Lipid", "PE aa C40:4_Lipid", "PE aa C36:2_Lipid", "PE aa C36:3_Lipid", "PC ae C36:5_Lipid", "PE ae C36:1_Lipid", "PE ae C36:2_Lipid", "PE ae C40:4_Lipid", "PE ae C36:4_Lipid", "PE ae C36:5_Lipid", "PE ae C38:1_Lipid", "PE ae C38:2_Lipid", "PE ae C38:3_Lipid", "N-C21:0(OH)-Cer_Lipid", "N-C22:0(OH)-Cer_Lipid", "N-C23:0(OH)-Cer_Lipid", "N-C24:0(OH)-Cer_Lipid", "N-C7:0-Cer(2H)_Lipid", "SM C17:0_Lipid",  "SM C19:0_Lipid", "SM C19:1_Lipid", "SM C19:2_Lipid", "SM C20:1_Lipid", "11-Deoxycorticosterone_Steroids", "11-Deoxycortisol_Steroids", "Testosterone_Steroids", "DOPA_p180")]

#2. apply the correct identifiers
metabolite_select_LMM = data.frame(metabolite_select_LMM)
metabolite_select_LMM$Patient = Data_Obj_RMRedundancy$subjectNumber
metabolite_select_LMM$Time = Data_Obj_RMRedundancy$timepoint
metabolite_select_LMM$Surv = Data_Obj_RMRedundancy$deathStatus

#3. reorganize
select_diff <- metabolite_select_LMM[c(73:75,1:72)]
row.names(select_diff) <- paste0(select_diff$Patient, select_diff$Time)

select_diff$Type <- substr(select_diff$Patient, 1, 1)
select_diff <- select_diff[c(76,1:75)]

#4. write the csv
write.csv(select_diff, file = "Selected Metabolites LMM")

Differentially Expressed Metabolites

The metabolites identified through this statistical method are summarized below.

Table 1: Metabolites Identified to be Upregulated and Downregulated

Upregulated	Downregulated
PC aa C32:2	Glu
PC aa C36:1	Dopamine
PG aa C36:3	PC aa C42:0
PE aa C36:1	N-C20:0-Cer(2H)
PE aa C36:2	SM C19:0
PE aa C36:3	SM C23:3
PE ae C34:1	Testosterone
PE ae C36:3	C12
LPC a C18:2	C14
PC aa C36:3	C16:1
LPE a C16:0	C16:2
LPE e C18:0	alpha-KGA
PE aa C26:4	Fum
PC ae C36:3	DHA
PE ae C34:3	AA
-	N-C9:0-Cer
-	N-C15:1-Cer
-	N-C20:0(OH)-Cer
-	N-C20:0(OH)-Cer(2H)
-	C14:0
-	C14:2

setwd("/Users/jenniferpolson/Desktop/Metabolomics")
DF <- read.csv("Selected Metabolites Diff")
row.names(DF) <- DF$X
DF <- DF[,2:41]

#set patients and control
DF_patient <- subset(DF, Type == "P")
DF_control <- subset(DF, Type == "C")

#set survival stats
DF_surv <- subset(DF, Surv == "Live")
DF_dead <- subset(DF, Surv == "Died")

nmetab = length(metabolites_diff)

PCA Visualization using Differentially Expressed Metabolites

I first wanted to see how well these metabolites would cluster the patients vs. controls, as well as the patients who survived vs. those who did not. I normalized the metabolite abundances, and performed a Principal Component Analysis (PCA). Using the dudi.pca function, I gathered the necessary analysis to visualize how these groups would cluster. The results are summarized in Figure 1

NA2Mean <- function(x){
  aColMean = colMeans(x, na.rm = TRUE);
  for (i in c(1:length(colnames(x)))){
    x[is.na(x[,i]),i] = aColMean[i];
  }
  y = x;
  return(y)
}

NormByCol <- function(x){
  aColMean = apply(x,2,mean, na.rm = TRUE);
  aColStd = apply(x,2,sd, na.rm = TRUE);
  y = sweep(x,2,aColMean, "-");
  y = sweep(y, 2, aColStd, "/");
  
  return(y);
}

par(mfrow = c(1,2), oma = c(0,0,2,0))
DF$Type <- as.character(DF$Type)
DF$Type[DF$Type == "P"] <- "Patient"
DF$Type[DF$Type == "C"] <- "Control"

diff0 <- DF[,6:ncol(DF)]
diff1 <- NA2Mean(diff0);
norm_diff <- NormByCol(diff1);

pcaData = dudi.pca(norm_diff, center = TRUE, scale = TRUE, scan = FALSE)

s.class(pcaData$li, factor(DF$Type), 
        clabel = 1.5,  col = c("cornflowerblue","darkorchid"), cpoint = 0.2,
        sub = "1A: Clustering Based on Type of Sample", csub = 0.8,
        xlim = c(-10,10), ylim = c(-10,10));



diff0 <- DF_patient[,6:ncol(DF_patient)]
diff1 <- NA2Mean(diff0);
norm_diff <- NormByCol(diff1);

pcaData = dudi.pca(norm_diff, center = TRUE, scale = TRUE, scan = FALSE)

s.class(pcaData$li, factor(DF_patient$Surv), 
        clabel = 1.5,  col = c("forestgreen","gold"), cpoint = 0.2,
        sub = "1B: Clustering Based on Survival Status", csub = 0.8,
        xlim = c(-10,10), ylim = c(-10,10));

mtext("Figure 1: PCA and Clustering of Patients vs. Controls - Differential Metabolites", outer = TRUE, cex = 2, font = 2)

Figure 1 shows the Principal Component Analyses clustering the patients and controls (Figure 1A) as well as the patients based on survival status (Figure 1B) from differentially expressed metabolites.

The results of this show that the patients are well clustered when compared to the controls, but the patients do not cluster well based on survival status. As such, I chose to move forward with performing recursive t-tests for each metabolite at every time point, for both two-group comparisons.

Heatmaps Comparing Patients and Controls

The overall goal of this analysis is to see the statistical differences between groups for the aforementioned selected metabolites at each timepoint for which there is a measurement. The reason for this comparison is that, if one can find a statistical difference at every time point, then that would greatly buttress the metabolites ability to predict status. Understandably, this is much more possible in theory than in practice. This requires subsetting the previously created file into the groups that will be tested, according to type (patient vs. control) and survival (alive vs. dead). Due to the fact that there is unequal variance between the population sizes, the t-test that will be performed is Welch’s t-test.

To perform this, a function was created to return the numeric p-value of the test results; this was applied to every subset for every metabolite. The results are visualized in a heatmap below. Figure 1 depicts the p-values for each test, while Figure 2 shows the adjusted p-values (q-values), using the false discovery rate method of adjustment.

#t-test function that will return Welch's statistic as numeric value
t_test <- function(x, y){
  result = t.test(x,y);
  return(as.numeric(result$p.value));
}

#create control vector
vector <- unique(DF_control$Time)

#creating the p-values for every timepoint and metabolite
result = c();
for (i in vector){
  #print(i);
  
  DF_S <- subset(DF_control, Time == i)[,-(1:4)];
  DF_D <- subset(DF_patient, Time == i)[,-(1:4)];
  
  tmp = lapply(DF_S, DF_D, FUN = t_test)
  result = rbind(result, tmp)
  
}


#set the timepoints
rownames(result) = vector

#change to numeric
result <- round(as.numeric(result),5)
p.result <- matrix(result, ncol = nmetab, byrow=T)
rownames(p.result) <- vector
colnames(p.result) <- metabolites_diff

#set the q-values
qvalues <- round(p.adjust(p.result, method = "fdr"), 5)
q.result <- matrix(qvalues, ncol = nmetab, byrow = T)
rownames(q.result) <- vector
colnames(q.result) <- metabolites_diff

#setup the color palette
my_palette <- colorRampPalette(c("tomato4", "sienna1","white"))(n = 10)

p.result <- t(p.result)
q.result <- t(q.result)

#create the graphs
par(mfrow = c(1,1), oma = c(2,2,2,8))
heatmap.2(p.result, Colv = FALSE,
          trace = "none", cellnote=ifelse(p.result>0.01, NA, p.result),
          notecol = "white", tracecol = "black", col = my_palette, 
          main = "Fig. 2: p-values of Differential Metabolites at Each Timepoint",
          dendrogram = "none")

Figure 2 illustrates the results of Welch’s t-test when comparing the patients and controls for each metabolite, at every time point. The values below 0.01 are listed in the corresponding box.

Figure 3 illustrates the results of the adjusted p-values (q-values) from Figure 2 using the false discovery rate method. The values below 0.01 are listed in the corresponding box.

Heatmaps Comparing Survival Status

When comparing patients and controls, the differentially expressed metabolites were all statistically different for each group. Moving forward, I chose to perform the same statistical analyses when comparing patients who survived, and patients who did not. The results are summarized in Figures 4 & 5.

#create timepoint vector to assign
vector <- unique(DF$Time)
vector = vector[-12]


#creating the p-values for every timepoint and metabolite
result = c();
for (i in vector){
  #print(i);
  
  DF_S <- subset(DF_surv, Time == i)[,-(1:4)];
  DF_D <- subset(DF_dead, Time == i)[,-(1:4)];
  
  tmp = lapply(DF_S, DF_D, FUN = t_test)
  
  result = rbind(result, tmp)
}


#set the timepoints
rownames(result) = vector

#reorder to get proper order
result <- result[c(1,6,2,7,8,3,4,5,9,10,11), ]

vector <- vector[c(1,6,2,7,8,3,4,5,9,10,11) ]
#change to numeric
result <- round(as.numeric(result),5)
p.result <- matrix(result, ncol = nmetab, byrow=T)
rownames(p.result) <- vector
colnames(p.result) <- metabolites_diff


qvalues <- round(p.adjust(p.result, method = "fdr"), 5)
q.result <- matrix(qvalues, ncol = nmetab, byrow = T)
rownames(q.result) <- vector
colnames(q.result) <- metabolites_diff

#setup the color palette
my_palette <- colorRampPalette(c("goldenrod3","lightgoldenrod1", "white"))(n = 10)

p.result <- t(p.result)
q.result <- t(q.result)

#create the graphs4
par(mfrow = c(1,1), oma = c(2,2,2,8))
heatmap.2(p.result, Colv = FALSE,
          trace = "none", cellnote=ifelse(p.result>0.01, NA, p.result),
          notecol = "white", tracecol = "black", col = my_palette, 
          main = "Fig. 4: p-values of Differential Metabolites at Each Timepoint",
          dendrogram = "none")

Figure 4 illustrates the results of Welch’s t-test when comparing the patients who survived and patients who did not for each metabolite, at every time point. The values below 0.01 are listed in the corresponding box.

Figure 5 illustrates the results of the adjusted p-values (q-values) from Figure 4 using the false discovery rate method. The values below 0.01 are listed in the corresponding box.

Metabolites Identified through Linear Mixed Model

Similar to the q-values generated for the comparison of patients vs. controls (Figure 3), the comparison of survived vs. died (Figure 3) had q-value results proved to be significant at many timepoints; indeed, there were differential q-values. That being said, it was not possible to glean conclusive evidence as to the functional pathways that are perturbed at each time point (functional analysis, using Reactome, not pictured) due to the fact that there was not one metabolite that was significant across all time points. Therefore, I chose to look at the metabolites that were identified for the linear mixed model, which totaled 72 (Table 2). I performed the same analyses as above for this list of metabolites; the results are summarized in Figures 6-10.

Table 2: Metabolites Identified through the Linear Mixed Model (LMM)

Metabolite	Metabolite	Metabolite
Kyn	C16:2	PE aa C36:3
C0	C16-OH	PC ae C36:5
C2	LPC a C16:0	PE ae C36:1
C4	LPC a C17:0	PE ae C36:2
C4:1	LPC a C18:0	PE ae C40:4
C4-OH (C3-DC)	PC aa C32:0	PE ae C36:4
C5	PC aa C32:1	PE ae C36:5
C5:1	PC aa C34:1	PE ae C38:1
C5:1-DC SM (OH)	C16:1	PE ae C38:2
C5-DC (C6-OH)	SM C18:0	PE ae C38:3
C5-OH (C3-DC-M)	SM C18:1	N-C21:0(OH)-Cer
C6 (C4:1-DC)	LPC e C18:0	N-C22:0(OH)-Cer
C7-DC	PC aa C36:1	N-C23:0(OH)-Cer
C8	PG aa C32:1	N-C24:0(OH)-Cer
C9	PG ae C32:0	N-C7:0-Cer(2H)
C10	PE aa C26:4	SM C17:0
C10:1	PE aa C38:3	SM C19:0
C10:2	PE aa C34:0	SM C19:1
C12	PE aa C34:1	SM C19:2
C12:1	PE aa C34:2	SM C20:1
C14	PE aa C34:3	11-Deoxycorticosterone
C14:1-OH	PE aa C36:1	11-Deoxycortisol
C14:2	PE aa C40:4	Testosterone
C14:2-OH	PE aa C36:2	DOPA

Figure 6 shows the Principal Component Analyses clustering the patients and controls (Figure 6A) as well as the patients based on survival status (Figure 6B) from differentially expressed metabolites.

The PCA in Figure 5 showed better clustering among the patients based on the survival status factor. This could potentially mean that there might be metabolites that are significantly different across all time points.

LMM Metabolites Comparing Patients vs. Controls

Figure 7 illustrates the results of Welch’s t-test when comparing the patients and controls for each metabolite, at every time point. The values below 0.01 are listed in the corresponding box.

Figure 8 illustrates the results of the adjusted p-values (q-values) from Figure 7 using the false discovery rate method. The values below 0.01 are listed in the corresponding box.

LMM Metabolites Comparing Survival Status

Figure 9 illustrates the results of Welch’s t-test when comparing the patients who survived and patients who did not for each metabolite, at every time point. The values below 0.01 are listed in the corresponding box.

Figure 10 illustrates the results of the adjusted p-values (q-values) from Figure 9 using the false discovery rate method. The values below 0.01 are listed in the corresponding box.

Data Setup - Overlapping Metabolites

In total, there were 8 metabolites that were identified to have a statistically significant level of abundance between the two groups. This will inform pathway analysis and annotation, in order to determine the changes that are occurring. As a final comparison, I also chose to perform the same series of analysis, but with the metabolites that were identified through both differential expression analysis and the linear mixed model. The overlapping metabolites are listed in Table 3.

Metabolite	Metabolite
Dodecanoylcarnitine (C12)	Phosphatidylethanolamine aa C36:1
Tetradecanoylcarnitine (C14)	Phosphatidylethanolamine aa C36:2
Tetradecadienylcarnitine (C14:2)	Phosphatidylethanolamine aa C36:3
Hexadecadienylcarnitine (C16:2)	Sphingomyelin C19:0
Phosphatidylcholine aa C36:1	Testosterone
Phosphatidylethanolamine aa C26:4	–

Figure 11 shows the Principal Component Analyses clustering the patients and controls (Figure 11A) as well as the patients based on survival status (Figure 11B) from differentially expressed metabolites.

The separation of clusters here is poor; this is likely due to the small number of metabolites used for this comparison (only 11).

Comparing Patients and Controls

Figure 12 illustrates the results of Welch’s t-test when comparing the patients and controls for each metabolite, at every time point. The values below 0.01 are listed in the corresponding box.

Figure 13 illustrates the results of the adjusted p-values (q-values) from Figure 12 using the false discovery rate method. The values below 0.01 are listed in the corresponding box.

Comparing Survival Status

Figure 14 illustrates the results of Welch’s t-test when comparing the patients who survived and patients who did not for each metabolite, at every time point. The values below 0.01 are listed in the corresponding box.

Figure 15 illustrates the results of the adjusted p-values (q-values) from Figure 14 using the false discovery rate method. The values below 0.01 are listed in the corresponding box.

By examining the adjusted p-values of this comparison, one interesting result came about: there was not a single metabolite that was statistically different between the two survival groups for every single time point. However, 8 metabolites were identified through the linear mixed model. Due to the very small sample sizes, it is difficult to conclude if there are metabolites that stratify patients. This does not mean that it does not exist; in fact, there are many statistical limitations to this study. Individual variability in genetic assembly and metabolic homeostasis has the propensity to preclude deduction of generalized clinical insights, especially when an intended study contains human subjects with other disease comorbidities. For plasma metabolomics investigations, long-term, large-scale clinical studies take years to construct and data acquisition and analysis require significant economical and scientific commitments. Studies using small cohorts provide broader possibility for “data-driven” biomarker candidate discovery. However, a small subject number limits the statistical power for in-depth characterization by using standard computational approaches. Generating temporal metabolomics profiling datasets offers unique sensitivity in response to pathological stress or therapeutic regiments, which enable microscopic identification of biomarker candidates. This is why the statistical threshold was established as p < 0.01.