1 Introduction

This R Markdown document describes the analysis of the Quartile data from the TOF-MS analysis by running sPLSDA, PCA, Cluster Analysis, Calinski and Harabasz index, and Univariate Non-Parametric analyses based on the demo script provided by Paul Brinkman.

2 Packages and data

First, we load the required R packages and BiocManager packages. Next, the data is read from a csv file and imported into R. We save the number of columns in the data as n.

if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

library(mixOmics)
library(pROC)
library(epiR)
library(stringr)
library(readr)
library(ConsensusClusterPlus) 
library(fpc)
library(tableone)
library(splitstackshape)

rm(list = ls()) #clears workspace / global enviroment

setwd("~/Google Drive/My Drive/5 TOF-MS/Paul Demo") 
GCMS <- read_delim("half_area.csv", delim = ";", escape_double = FALSE, trim_ws = TRUE, locale=locale(encoding="latin1"))[1:52,]
GCMS_ori=GCMS
n <- ncol(GCMS)

We change negative features to 0 (artifact of retention time correction), and check for contamination. This is uninimportant for now, but will become relevant when containants have been identified from the results, and we rerun the analysis without the contaminants.

for(i in 1:ncol(GCMS)){
  negative_peaks=which(GCMS[,i]<0)
  #print(negative_peaks)
  if(length(negative_peaks)>0){
    GCMS[negative_peaks,i]=0
  }
}

# Code to check for contamination

3 sPLSDA

‘sPLSDA’ stands for sparse partial least squares discriminant analysis. In sPLS-DA, the objective is to find linear combinations of predictors (independent variables) that best discriminate between different classes or groups (dependent variables).

We create a vector with the maximum number of variables to be used in the sPLSDA and store it in list.keepX. The tune.splsda' function determines what the optimal number of components is to describe the variability in the data.ncompwithin this function determines the number of components we want to reduce the model into, in this case 2. In this type of application, it is reasonable to expect to be able to summarize the model into two components.validation = ‘loo’stands for the method of testing internal validation ("loo" is short for leave-one-out, which is the prefered method for a small dataset).distselects the distance metric to use for the function to estimate the classification error rate and is set tomax.dist, meaning maximum distance between the centers of the two groups in the figure.measure = “BER”` selects a method of measuring the misclassification error (“BER” stands for Balanced Error Rate).

From the analysis, two figures emerge. The first figure shows the two groups we want to distinguish from each other with a circle representing the 95% confidence interval. If you include the second and third quartiles, you would expect them to lie between these two circles. Figure 2 illustrates which two variables from the dataset explain the most variation in component 1 and component 2, respectively; in this case, cyclopentane and 1-Phenoxypropan-2-ol. The interpretation is that these two substances have the greatest predictive value for distinguishing the groups. Because most of the variability is in the y-axis (by inspection of figure 1), component 2 is the most important for distinguishing between the two groups (see also the box plots under the PCA paragraph). Because component 2 is best represented by cyclopentane, cyclopentane is the substance of interest based on the sPLSDA and the next step would be to see if a biological mechanism can explain this.

#select groups
groups=GCMS$Quart
table(groups)

## groups
##  1  4 
## 26 26

#select data
sPLSDA_data=GCMS[,c(3:n)]

list.keepX = c(1:10)
tune.splsda = tune.splsda(X = sPLSDA_data, Y=groups, ncomp = 2, validation = 'loo',
                          progressBar = TRUE, dist = 'max.dist', measure = "BER",
                          test.keepX = list.keepX, cpus = 4)

## 
## comp 1 
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
## comp 2 
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

select.keepX = tune.splsda[["choice.keepX"]]  # optimal number of variables to select
select.keepX

## comp1 comp2 
##     1     1

splsda.training=splsda(X = sPLSDA_data, Y=groups, ncomp = 2, keepX = select.keepX) 
plotIndiv(splsda.training, ellipse = TRUE, star = TRUE, legend=TRUE)

plotVar(splsda.training, cex=3.0)

#selectVar(splsda.training, comp=1)
#selectVar(splsda.training, comp=2)

4 PCA

PCA stands for Principal Component Analysis. It’s a statistical technique used to reduce the dimensionality of data while preserving most of the variability present in the data set. PCA transforms the original variables into a new set of uncorrelated variables called principal components.

omics.PCA fits a PCA model on the data, where the parameter scale=TRUE normalizes the data (to prevent unwanted effects of scale between variables). PCA analysis shows which substances have the greatest variance without considering patient labels (blind analysis). If a substance emerges from PCA analysis, it is a stronger result because it comes solely from the data (data-driven), i.e., the risk of overfitting is much smaller.

The dataframe generated by the code below shows which 5 compounds have the highest contribution to the principal components. 3 out of 5 results are siloxanes, which are compounds directly attributable to the plastics used for the samples. We can leave these out and rerun the analysis.

PCA_data=GCMS[,c(3:99)]

groups=GCMS$Quart
omics.PCA=pca(PCA_data, scale=TRUE)

#plot PCA results
plotIndiv(omics.PCA , comp = c(1,2), scale=TRUE,
          group = groups, ind.names = FALSE, star = TRUE, size.title = rel(0.8),
          ellipse = TRUE, centroid = FALSE, legend = TRUE, title = "PCA - Half Area Quartile")

plotVar(omics.PCA, cex=3.0, cutoff = 0.8)

boxplot(PCA_data$Cyclopentane ~ GCMS$Quart)

boxplot(PCA_data$`1-Phenoxypropan-2-ol` ~ GCMS$Quart)

head(selectVar(omics.PCA, comp=1)$value, 5)

##                                                                              value.var
## 5-Amino-1-methyl-1H-pyrazole-4-carboxamide, 3TMS                             0.2146700
## 3-Isopropoxy-1,1,1,7,7,7-hexamethyl-3,5,5-tris(trimethylsiloxy)tetrasiloxane 0.2093067
## [(4-Hexylbenzene-1,3-diyl)bis(oxy)]bis(trimethylsilane)                      0.1961191
## Cyclotetrasiloxane, octamethyl-                                              0.1840281
## Acetophenone                                                                 0.1778584

PCAresult=data.frame(omics.PCA[["variates"]][["X"]])

wilcox.test(PCAresult$PC1[1:52], as.numeric(as.factor(GCMS$Quart[1:52]))) #2 groups

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  PCAresult$PC1[1:52] and as.numeric(as.factor(GCMS$Quart[1:52]))
## W = 962, p-value = 0.01009
## alternative hypothesis: true location shift is not equal to 0

#Note1: [1:10] is to make sure that in this demo only 2 groups are selected (A & B).
#Note2: as.numeric(as.factor(GCMS$PID[1:10])) converts A,B into 1,2 ,since this Wilcox-function works with numeric y-vectors only.  
#Note3: as.numeric(as.factor(GCMS$PID[1:10])) should be replaced with a (clinical) characteristic of interest.

#kruskal.test(PCAresult$PC1, as.factor(GCMS$PID)) #>2 groups
#Note1: as.factor(GCMS$PID) should be replaced with a (clinical) characteristic of interest.

5 Univariate Non-Parametric

Working on it.

TOFMS-Quartile

Mahunda Sinyangwe

2024-04-08

1 Introduction

2 Packages and data

3 sPLSDA

4 PCA

5 Univariate Non-Parametric