1 Purpose

The purpose of the dsOmicsClient package is to provide a set of functions to perform omic association analyses when data are stored on federated databases or, more generally, in different repositories. In particular the package utilizes DataSHIELD infrastructure which is a software solution that allows simultaneous co-analysis of data from multiple studies stored on different servers without the need to physically pool data or disclose sensitive information (Wilson et al. 2017). DataSHIELD uses Opal servers to properly perform such analyses. Our bookdown introduces Opal, DaaSHIELD and other related features. Here, we describe the most relevant ones to be able to reproduce this document.

At a high level DataSHIELD is set up as a client-server model which houses the data for a particular study. A request is made from the client to run specific functions on the remote servers where the analysis is performed. Non-sensitive and pre-approved summary statistics are returned from each study to the client where they can be combined for an overall analysis. An overview of what a single-site DataSHIELD architecture would look like is illustrated in Figure 1.

Figure 1: Single Server DataSHIELD Architecture (Wilson et al 2017)

One of the main limitations of DataSHIELD is how to deal with large data given the restrictions of Opal with databases. Nonetheless, the recent development of the resourcer R package allows DataSHIELD developers to overcome this drawback by granting the Opal servers to deal with any type of data (e.g. resources). So far, Opal can register access to different types of data resources in different formats (csv, tsv, R data, SQL, tiddy, ..) that can also be located in different places (local, http, ssh, AWS S3 or Mongodb file stores, …). This is another important advancement since the resourcer addresses another important issue that is having duplicated data in different research centers or hospitals.

The resourcer package permits to work with specific R data classes. This is highly important in our setting since it will allow to use Bioconductor classes to properly manage omic data using efficient infrastructures such as ExpressionSet or RangedSummarizedExperiment among others. Another important asset of the resourcer package is that it can be extended to new data types by writting specific functions (see how to extending resources. We have used this feature and created some functions for the analysis of Variant Calling Format (VCF files) that are loaded into R as Genomic Data Storage objects. These functions along with others that allow the managment of Bioconductor classes in DataSHIELD have been included in a new DataSHIELD package, the dsOmics, which is able to manage different Bioconductor data infrastructures that are required to perform omic association analyses. These including ExpressionSet, RangedSummarizedExperiment or GDS among others. Generaly speaking, any data format and storage that can be read by R can be expressed as a resource.

In the next sections we first describe how to deal with Opal servers and resources. We illustre how we prepared a test environment to describe how Opal must be setup as well as how to provide the appropiate R/DataSHIELD configuration in both the Opal server and the client side to perform omic association analyses. Then, the different types of omic data analyses that can be performed with the dsOmicsClient functionality are described and further illustrated using real data examples including epigenome, transcriptome and genomic data analyses.

2 Setup

In this section we describe how to configure the Opal server and the needed packages to carry out omic association analyses from the client side. Basically, the resources must be defined in the Opal along with the required information that includes the url where data is located, the format (e.g., SPSS, R class, GDS …) and the credentials which are not visible to the DataSHIELD users (Figure 2). The permission to use a resource for DataSHIELD operations is granted (to a user or a group of users) in Opal.

Figure 2: Information required for Resources

A description of the pre-requisites can be found here. At the time of the writing of this vignette, the resource capabilities of Opal, DataSHIELD and related R packages have not been released yet. Basically, what is needed is:

DSI
DSOpal
opalr
OPTIONAL (used for development): DSLite
dsBaseClient in the v6.0-dev branch
dsOmicsClient

and in the server side: - resourcer - dsOmics - dsBase in the v6.0-dev branch

Notice that the dsOmics package includes new extensions of the resourcer package to deal with new types of resources such as file in VCF format to converted to a file in GDS format (VCF2GDS). Next subsections further describe what is required along with some examples.

2.1 Required Opal server with resources

Resources are datasets or computation units which are located under a URL and their access is protected by some credentials. When resources are assigned to a R/DataSHIELD server session, remote big/complex datasets or high performance computers are being accessible to data analysts.

Instead of storing the data in Opal databases, only the way to access them needs to be defined: the datasets are kept in their original format and location (e.g., an R object, a SQL database, a SPSS file, etc.) and are read directly from the R/DataSHIELD server-side session. Then as soon as there is a R reader for the dataset or a connector for the analysis services, a resource can be defined. Opal takes care of the DataSHIELD permissions (a DataSHIELD user cannot see the resource’s credentials) and of the resources assignment to a R/DataSHIELD session (see Figure 3)

Figure 3: Resources: a new DataSHIELD infrastructure

As previously mentioned, the resourcer R package allows to deal with the main data sources (using tidyverse, DBI, dplyr, sparklyr, MongoDB, AWS S3, SSH etc.) and is easily extensible to new ones including specific data infrastructure in R or Bioconductor. So far ExpressionSet and RangedSummarizedExperiment objects saved in .rdata files are accesible through the resourcer package. The dsOmics package contains a new extension that deals with VCF (Variant Calling Format) files which are coerced to a GDS (Genomic Data Storage) format (VCF2GDS).

In order to achive this resourcer extension, two R6 classes have been implemented:

GDSFileResourceResolver class which handles file-base resources with data in GDS or VCF formats. This class is responsible for creating a GDSFileResourceClient object instance from an assigned resource.
GDSFileResourceClient class which is responsible for getting the referenced file and making a connection (created by GWASTools) to the GDS file (will also convert the VCF file to a GDS file on the fly, using SNPRelate). For the subsequent analysis, it’s this connection handle to the GDS file that will be used.

We have prepared a test environment, with the Opal implementation of Resources and an appropriate R/DataSHIELD configuration that is available at: opal-demo.obiba.org. This figure illustrate the resources which are available for the RSRC project:

Figure 4: Resources from a test enviroment available at https://opal-demo.obiba.org

It is possible to declare a resource that is to be resolved by an R package that uses the resourcer API

Figure 5: Declaration of a resource corresponding to a VCF2GDS format

2.2 Required DataSHIELD packages in the opal server

Required DataSHIELD packages must be uploaded in the opal server through the Administration site by accessing to DataSHIELD tab. In our case, both dsBase and dsOmics and resourcer packages must be installed as is illustrated in the figure (NOTE: dsGeo is uploaded for other type of analyses and it is not necesary for omics).

Figure 6: Installed packages in the test opal server

The tab +Add package can be used to install a new package. The figure depicts how dsOmics was intalled into the opal server

Figure 7: Description how dsOmics package was intalled into the test opal server

2.3 Required R Packages in the client site (e.g. local machine)

In order to use the functions contained within this package the following R packages must be installed in the client side:

install.packages("DSIOpal")
install.packages('dsBaseClient', repos=c(getOption('repos'), 
                                         'http://cran.obiba.org'), dependencies=TRUE)
devtools::install_github("isglobal-brge/dsOmicsClient", dependencies = TRUE)

The package dependencies are then loaded as follows:

library(DSOpal)
library(dsBaseClient)
library(dsOmicsClient)

Notes:

For advanced users willing to use DSLite, the server side packages needs to be installed as well:

install.packages(c("resourcer", "DSLite"), dependencies = TRUE)
install.packages("dsBase", repos = c("https://cloud.r-project.org", 
                                     "https://cran.obiba.org"), dependencies = TRUE)

We refer to this chapter of our bookdown to a more detail description about how to work with DataSHIELD in a serverless environment.

3 Transcriptomic and Epigenomic data analysis

3.1 Single study analysis

Let us start by illustrating a simple example where a researcher may be interested in performing differential gene expression analysis (DGE) having data in a single repository (e.g. one study).

3.1.1 Illustrative example: differential gene expression (DGE) analysis

Let us illustrate how to perform transcriptomic data analysis using data from TCGA project. We have uploaded to the opal server a resource called tcga_liver whose URL is http://duffel.rail.bio/recount/TCGA/rse_gene_liver.Rdata which is available through the recount project. This resource contains the RangeSummarizedExperiment with the RNAseq profiling of liver cancer data from TCGA. Next, we illustrate how a differential expression analysis to compare RNAseq profiling of women vs men (variable gdc_cases.demographic.gender). The DGE analysis is normally performed using limma package. In that case, as we are analyzing RNA-seq data, limma + voom method will be required.

Let us start by creating the connection to the opal server:

builder <- newDSLoginBuilder()
builder$append(server = "study1", url = "https://opal-demo.obiba.org", 
               user = "dsuser", password = "password", 
               resource = "RSRC.tcga_liver")

logindata <- builder$build()

conns <- datashield.login(logins = logindata, assign = TRUE, 
                          symbol = "res")

Then, let us coerce the resource to a RangedSummarizedExperiment which is the type of object that is available in the recount project.

datashield.assign.expr(conns, symbol = "rse", 
                       expr = quote(as.resource.object(res)))
ds.class("rse")

$study1
[1] "RangedSummarizedExperiment"
attr(,"package")
[1] "SummarizedExperiment"

The number of features and samples can be inspected by

ds.dim("rse")

$`dimensions of rse in study1`
[1] 58037   424

$`dimensions of rse in combined studies`
[1] 58037   424

And the names of the features using the same function used in the case of analyzing an ExpressionSet

name.features <- ds.featureNames("rse")
lapply(name.features, head)

$study1
[1] "ENSG00000000003.14" "ENSG00000000005.5" 
[3] "ENSG00000000419.12" "ENSG00000000457.13"
[5] "ENSG00000000460.16" "ENSG00000000938.12"

Also the covariate names can be inspected by

name.vars <- ds.featureData("rse")
lapply(name.vars, head, n=15)

$study1
 [1] "project"                                       
 [2] "sample"                                        
 [3] "experiment"                                    
 [4] "run"                                           
 [5] "read_count_as_reported_by_sra"                 
 [6] "reads_downloaded"                              
 [7] "proportion_of_reads_reported_by_sra_downloaded"
 [8] "paired_end"                                    
 [9] "sra_misreported_paired_end"                    
[10] "mapped_read_count"                             
[11] "auc"                                           
[12] "sharq_beta_tissue"                             
[13] "sharq_beta_cell_type"                          
[14] "biosample_submission_date"                     
[15] "biosample_publication_date"

We can visualize the levels of the variable having gender information that will be our condition (i.e., we are interested in obtaining genes that are differentially expressed between males and females)

ds.table("rse$gdc_cases.demographic.gender")


 Data in all studies were valid 

Study 1 :  No errors reported from this study

$output.list
$output.list$TABLE_rvar.by.study_row.props
                                study
rse$gdc_cases.demographic.gender 1
                          female 1
                          male   1

$output.list$TABLE_rvar.by.study_col.props
                                study
rse$gdc_cases.demographic.gender         1
                          female 0.3372642
                          male   0.6627358

$output.list$TABLE_rvar.by.study_counts
                                study
rse$gdc_cases.demographic.gender   1
                          female 143
                          male   281

$output.list$TABLES.COMBINED_all.sources_proportions
rse$gdc_cases.demographic.gender
female   male 
 0.337  0.663 

$output.list$TABLES.COMBINED_all.sources_counts
rse$gdc_cases.demographic.gender
female   male 
   143    281 


$validity.message
[1] "Data in all studies were valid"

We have implemented a function called ds.RNAseqPreproc() to perform RNAseq data pre-processing that includes:

transforming data into log2 CPM units
filtering lowly-expressed genes
data normalization

ds.RNAseqPreproc('rse', group= 'gdc_cases.demographic.gender', 
                 newobj.name = 'rse.pre')

Note that it is recommended to indicate the grouping variable (i.e., condition). Once data have been pre-processed, we can perform differential expression analysis. Notice how dimensions have changed given the fact that we have removed genes with low expression which are expected to do not be differentially expressed.

ds.dim('rse')

$`dimensions of rse in study1`
[1] 58037   424

$`dimensions of rse in combined studies`
[1] 58037   424

ds.dim('rse.pre')

$`dimensions of rse.pre in study1`
[1] 40363   424

$`dimensions of rse.pre in combined studies`
[1] 40363   424

The differential expression analysis is ´dsOmicsClient/dsOmics´ is implemented in the funcion ds.limma(). This functions runs a limma-pipeline for microarray data and for RNAseq data allows:

voom + limma
DESeq2
edgeR

We recommend to use the voom + limma pipeline proposed here given its versatility and that limma is much faster than DESeq2 and edgeR. By default, the function consider that data are obtained from a microarray experiment (type.data = "microarray"). Therefore, as we are analyzing RNAseq data, we much indicate that type.data = "RNAse"

ans.gender <- ds.limma(model =  ~ gdc_cases.demographic.gender, 
                       Set = "rse.pre", type.data = "RNAseq")

The top differentially expressed genes can be visualized by:

ans.gender

$study1
# A tibble: 40,363 x 7
   id             n   beta     SE     t   P.Value adj.P.Val
   <chr>      <int>  <dbl>  <dbl> <dbl>     <dbl>     <dbl>
 1 ENSG00000~   424 -12.4  0.0761 -52.1 2.74e-187 1.11e-182
 2 ENSG00000~   424 -10.2  0.461  -43.8 5.21e-160 1.05e-155
 3 ENSG00000~   424 -11.0  0.0603 -39.5 1.08e-144 1.45e-140
 4 ENSG00000~   424 -11.3  0.0651 -36.0 2.27e-131 2.29e-127
 5 ENSG00000~   424  10.9  0.0885  35.5 1.85e-129 1.49e-125
 6 ENSG00000~   424  10.2  0.118   32.9 3.72e-119 2.50e-115
 7 ENSG00000~   424  11.4  0.128   31.9 5.57e-115 3.21e-111
 8 ENSG00000~   424  -7.78 0.0812 -28.8 3.85e-102 1.94e- 98
 9 ENSG00000~   424   9.62 0.0894  27.4 4.72e- 96 2.12e- 92
10 ENSG00000~   424  11.4  0.0924  27.3 9.63e- 96 3.89e- 92
# ... with 40,353 more rows

attr(,"class")
[1] "dsLimma" "list"

We can verify whether the distribution of the observed p-values are the ones we expect in this type of analyses

hist(ans.gender$study1$P.Value, xlab="Raw p-value gender effect",
     main="", las=1, cex.lab=1.5, cex.axis=1.2, col="gray")

We can also check whether there is inflation just executing

qqplot(ans.gender$study1$P.Value)

So, in that case, the model needs to remove unwanted variability (\(\lambda>2\)). If so, we can use surrogate variable analysis just changing the argument sva=TRUE

ans.gender.sva <- ds.limma(model =  ~ gdc_cases.demographic.gender, 
                       Set = "rse.pre", type.data = "RNAseq",
                       sva = TRUE)

Now the inflation has dramatically been reduced (\(\lambda>1.12\))

qqplot(ans.gender.sva$study1$P.Value)

We can add annotation to the output that is available in our RSE object. We can have access to this information by

ds.fvarLabels('rse.pre')

$study1
[1] "chromosome" "start"      "end"        "width"     
[5] "strand"     "gene_id"    "bp_length"  "symbol"    

attr(,"class")
[1] "dsfvarLabels" "list"

So, we can run

ans.gender.sva <- ds.limma(model =  ~ gdc_cases.demographic.gender, 
                       Set = "rse.pre", type.data = "RNAseq",
                       sva = TRUE, annotCols = c("chromosome"))

The results are:

ans.gender.sva

$study1
# A tibble: 40,363 x 8
   id             n   beta     SE     t   P.Value adj.P.Val
   <chr>      <int>  <dbl>  <dbl> <dbl>     <dbl>     <dbl>
 1 ENSG00000~   424 -12.1  0.0713 -51.3 1.82e-178 7.33e-174
 2 ENSG00000~   424  -9.95 0.467  -44.9 1.52e-158 3.07e-154
 3 ENSG00000~   424 -10.7  0.0513 -40.3 2.19e-143 2.95e-139
 4 ENSG00000~   424 -10.7  0.0461 -37.4 6.78e-133 6.84e-129
 5 ENSG00000~   424  10.8  0.0571  36.4 2.43e-129 1.96e-125
 6 ENSG00000~   424  10.3  0.0774  34.3 1.98e-121 1.33e-117
 7 ENSG00000~   424   9.64 0.101   31.6 5.70e-111 2.87e-107
 8 ENSG00000~   424  11.2  0.0691  31.7 2.34e-111 1.35e-107
 9 ENSG00000~   424   4.48 0.0678  30.2 2.13e-105 8.61e-102
10 ENSG00000~   424  11.2  0.0547  30.9 2.17e-108 9.72e-105
# ... with 40,353 more rows, and 1 more variable:
#   chromosome <chr>

attr(,"class")
[1] "dsLimma" "list"

The function has another arguments that can be used to fit other type of models:

sva: estimate surrogate variables
annotCols: to add annotation available in the
method: Linear regression (“ls”) or robust regression (“robust”) used in limma (lmFit)
robust: robust method used for outlier sample variances used in limma (eBayes)
normalization: normalization method used in the voom transformation (default “none”)
voomQualityWeights: should voomQualityWeights function be used instead of voom? (default FALSE)
big: should SmartSVA be used instead of SVA (useful for big sample size or when analyzing epigenome data. Default FALSE)

We have also implemented two other functions ds.DESeq2 and ds.edgeR that perform DGE analysis using DESeq2 and edgeR methods. This is the R code used to that purpose:

To be supplied

We close the DataSHIELD session by:

datashield.logout(conns)

3.2 Analysis from multiple studies

The Figure 8 describes the different types of omic association analyses that can be performed using DataSHIELD client functions implemented in the dsOmicsClient package. Basically, data (omic and phenotypes/covariates) can be stored in different sites (http, ssh, AWS S3, local, …) and are managed with Opal through the resourcer package and their extensions implemented in dsOmics.

Non-disclosive omic data analysis with DataSHIELD and Bioconductor. The figure illustrates how the `resourcer` package is used to get access to omic data through the Opal servers. Then DataSHIELD is used in the client side to perform non-disclosive data analyses.

Figure 8: Non-disclosive omic data analysis with DataSHIELD and Bioconductor
The figure illustrates how the resourcer package is used to get access to omic data through the Opal servers. Then DataSHIELD is used in the client side to perform non-disclosive data analyses.

Then, dsOmicsClient package allows different types of analyses: “virtually” pooled and federated meta-analysis. Both methods are based on fitting different generalized linear models (GLMs) for each feature when assesing association between omic data and the phenotype/trait/condition of interest. Of course non-disclosive omic data analysis from a single study can also be performed.

The “virtually” pooled approach (Figure 9) is recommended when the user wants to analyze omic data from different sources and obtain results as if the data were located in a single computer. It should be noticed that this can be very time consuming when analyzing multiple features since it calls repeatedly to a base function in DataSHIELD (ds.glm) and that it cannot be recommended when data are not properly harmonized (e.g. gene expression normalized using different methods, GWAS data having different platforms, …). Also when it is necesary to remove unwanted variability (for transcriptomic and epigenomica analysis) or control for population stratification (for GWAS analysis), this approach cannot be used since we need to develop methods to compute surrogate variables (to remove unwanted variability) or PCAs (to to address population stratification) in a non-disclosive way.

The federated meta-analysis approach Figure 10 overcomes the limitations raised when performing pooled analyses. First, the computation issue is addressed by using scalable and fast methods to perform data analysis at whole-genome level at each server. The transcriptomic and epigenomic data analyses make use of the widely used limma package that uses ExpressionSet or RangedSummarizedExperiment Bioc infrastructures to deal with omic and phenotypic (e.g covariates). The genomic data are analyzed using GWASTools and GENESIS that are designed to perform quality control (QC) and GWAS using GDS infrastructure.

Next, we describe how both approaches are implemented:

Virtually Pooled approach: Figure 9 illustrate how this analysis is performed. This corresponds to generalized linear models (glm) on data from single or multiple sources. It makes use of ds.glm() function which is a DataSHIELD function that uses an approach that is mathematically equivalent to placing all individual-level data froma all sources in one central warehouse and analysing those data using the conventional glm() function in R. The user can select one (or multiple) features (i.e., genes, transcripts, CpGs, SNPs, …)

Non-disclosive omic data analysis with DataSHIELD and Bioconductor. The figure illustrates how to perform single pooled omic data analysis. The analyses are performed by using a generalized linear model (glm) on data from one or multiple sources. It makes use of `ds.glm()`, a DataSHIELD function, that uses an approach that is mathematically equivalent to placing all individual-level data from all sources in one central warehouse and analysing those data using the conventional `glm()` function in R.

Figure 9: Non-disclosive omic data analysis with DataSHIELD and Bioconductor
The figure illustrates how to perform single pooled omic data analysis. The analyses are performed by using a generalized linear model (glm) on data from one or multiple sources. It makes use of ds.glm(), a DataSHIELD function, that uses an approach that is mathematically equivalent to placing all individual-level data from all sources in one central warehouse and analysing those data using the conventional glm() function in R.

Meta-analysis: Figure 10 illustrate how this analysis is performed. This corresponds to perform a genome-wide analysis at each server using functions that are specifically design to that purpose and that are scalable. Then the results of each server can be meta-analyzed using method that meta-analyze either effects or p-values.

Non-disclosive omic data analysis with DataSHIELD and Bioconductor. The figure illustrates how to perform anlyses at genome-wide level from one or multiple sources. It runs standard Bioconductor functions at each server independently to speed up the analyses and in the case of having multiple sources, results can be meta-analyzed uning standar R functions.

Figure 10: Non-disclosive omic data analysis with DataSHIELD and Bioconductor
The figure illustrates how to perform anlyses at genome-wide level from one or multiple sources. It runs standard Bioconductor functions at each server independently to speed up the analyses and in the case of having multiple sources, results can be meta-analyzed uning standar R functions.

3.2.1 Illustrative example: Epigenome-wide association analysis (EWAS)

EWAS requires basically the same statistical methods as those used in DGE. It should be notice that the pooled analysis we are going to illustrate here can also be performed with transcriptomic data since each study must have different range values. If so, gene expression harmonization should be performed, for instance, by standardizing the data at each study. For EWAS where methylation is measured using beta values (e.g CpG data are in the range 0-1) this is not a problem. In any case, adopting the meta-analysis approach could be a safe option.

We have downloaded data from GEO corresponding to the accesion number GSE66351 which includes DNA methylation profiling (Illumina 450K array) of 190 individuals. Data corresponds to CpGs beta values measured in the superior temporal gyrus and prefrontal cortex brain regions of patients with Alzheimer’s. Data have been downloaded using GEOquery package that gets GEO data as ExpressionSet objects. Researchers who are not familiar with ExpressionSets can read this Section. Notice that data are encoded as beta-values that ensure data harmonization across studies.

In order to illustrate how to perform data analyses using federated data, we have split the data into two ExpressionSets having 100 and 90 samples as if they were two different studies. Figure 4 shows the two resources defined for both studies (GSE66351_1 and GSE66351_2)

In order to perform omic data analyses, we need first to login and assign resources to DataSHIELD. This can be performed using the as.resource.object() function

builder <- DSI::newDSLoginBuilder()
builder$append(server = "study1", url = "https://opal-demo.obiba.org", 
               user = "dsuser", password = "password", 
               resource = "RSRC.GSE66351_1")
builder$append(server = "study2", url = "https://opal-demo.obiba.org", 
               user = "dsuser", password = "password", 
               resource = "RSRC.GSE66351_2")

logindata <- builder$build()

conns <- DSI::datashield.login(logins = logindata, assign = TRUE, 
                               symbol = "res")


# Assign to the original R class (e.g ExpressionSet)
datashield.assign.expr(conns, symbol = "methy", 
                       expr = quote(as.resource.object(res)))

Now, we can see that the resources are actually loaded into the R servers as their original class

ds.class("methy")

$study1
[1] "ExpressionSet"
attr(,"package")
[1] "Biobase"

$study2
[1] "ExpressionSet"
attr(,"package")
[1] "Biobase"

Then, some Bioconductor-type functions can be use to return non-disclosive information of ExpressionSets from each server to the client, using similar functions as those defined in the dsBaseClient package. For example, feature names can be returned by

fn <- ds.featureNames("methy")
lapply(fn, head)

$study1
[1] "cg00000029" "cg00000108" "cg00000109" "cg00000165"
[5] "cg00000236" "cg00000289"

$study2
[1] "cg00000029" "cg00000108" "cg00000109" "cg00000165"
[5] "cg00000236" "cg00000289"

Experimental phenotypes variables can be obtained by

ds.varLabels("methy")

$study1
 [1] "title"                   "geo_accession"          
 [3] "status"                  "submission_date"        
 [5] "last_update_date"        "type"                   
 [7] "channel_count"           "source_name_ch1"        
 [9] "organism_ch1"            "characteristics_ch1"    
[11] "characteristics_ch1.1"   "characteristics_ch1.2"  
[13] "characteristics_ch1.3"   "characteristics_ch1.4"  
[15] "characteristics_ch1.5"   "characteristics_ch1.6"  
[17] "characteristics_ch1.7"   "characteristics_ch1.8"  
[19] "molecule_ch1"            "extract_protocol_ch1"   
[21] "label_ch1"               "label_protocol_ch1"     
[23] "taxid_ch1"               "hyb_protocol"           
[25] "scan_protocol"           "description"            
[27] "data_processing"         "platform_id"            
[29] "contact_name"            "contact_email"          
[31] "contact_phone"           "contact_laboratory"     
[33] "contact_institute"       "contact_address"        
[35] "contact_city"            "contact_zip/postal_code"
[37] "contact_country"         "supplementary_file"     
[39] "supplementary_file.1"    "data_row_count"         
[41] "age"                     "braak_stage"            
[43] "brain_region"            "cell type"              
[45] "diagnosis"               "donor_id"               
[47] "sentrix_id"              "sentrix_position"       
[49] "Sex"                    

$study2
 [1] "title"                   "geo_accession"          
 [3] "status"                  "submission_date"        
 [5] "last_update_date"        "type"                   
 [7] "channel_count"           "source_name_ch1"        
 [9] "organism_ch1"            "characteristics_ch1"    
[11] "characteristics_ch1.1"   "characteristics_ch1.2"  
[13] "characteristics_ch1.3"   "characteristics_ch1.4"  
[15] "characteristics_ch1.5"   "characteristics_ch1.6"  
[17] "characteristics_ch1.7"   "characteristics_ch1.8"  
[19] "molecule_ch1"            "extract_protocol_ch1"   
[21] "label_ch1"               "label_protocol_ch1"     
[23] "taxid_ch1"               "hyb_protocol"           
[25] "scan_protocol"           "description"            
[27] "data_processing"         "platform_id"            
[29] "contact_name"            "contact_email"          
[31] "contact_phone"           "contact_laboratory"     
[33] "contact_institute"       "contact_address"        
[35] "contact_city"            "contact_zip/postal_code"
[37] "contact_country"         "supplementary_file"     
[39] "supplementary_file.1"    "data_row_count"         
[41] "age"                     "braak_stage"            
[43] "brain_region"            "cell type"              
[45] "diagnosis"               "donor_id"               
[47] "sentrix_id"              "sentrix_position"       
[49] "Sex"                    

attr(,"class")
[1] "dsvarLabels" "list"

3.3 Single CpG analysis

Once the methylation data have been loaded into the opal server, we can perform different type of analyses using functions from the dsOmicsClient package. Let us start by illustrating how to analyze a single CpG from two studies by using an approach that is mathematically equivalent to placing all individual-level.

ans <- ds.lmFeature(feature = "cg07363416", 
                    model = ~ diagnosis + Sex, 
                    Set = "methy",
                    datasources = conns)
ans

             Estimate Std. Error   p-value
cg07363416 0.03459886 0.02504291 0.1670998
attr(,"class")
[1] "dsLmFeature" "matrix"      "array"

3.4 Multiple CpG analysis

The same analysis can be performed for all features (e.g. CpGs) just avoiding the feature argument. This process can be parallelized using mclapply function from the multicore package.

ans <- ds.lmFeature(model = ~ diagnosis + Sex, 
                    Set = "methy",
                    datasources = conns,
                    mc.cores = 20)

This method corresponds to the pooled analysis approach and can be very time consiming since the function repeatedly calls the DataSHIELD function ds.glm(). We can adopt another strategy that is to run a glm of each feature independently at each study using limma package (which is really fast) and then combine the results (i.e. meta-analysis approach).

ans.limma <- ds.limma(model = ~ diagnosis + Sex,
                      Set = "methy", 
                      datasources = conns)

Then, we can visualize the top genes at each study (i.e server) by

lapply(ans.limma, head)

$study1
# A tibble: 6 x 7
  id           n    beta      SE     t    P.Value adj.P.Val
  <chr>    <int>   <dbl>   <dbl> <dbl>      <dbl>     <dbl>
1 cg13138~   100 -0.147  0.0122  -6.62    1.90e-9  0.000466
2 cg23859~   100 -0.0569 0.00520 -6.58    2.32e-9  0.000466
3 cg13772~   100 -0.0820 0.0135  -6.50    3.27e-9  0.000466
4 cg12706~   100 -0.0519 0.00872 -6.45    4.25e-9  0.000466
5 cg24724~   100 -0.0452 0.00775 -6.39    5.47e-9  0.000466
6 cg02812~   100 -0.125  0.0163  -6.33    7.31e-9  0.000466

$study2
# A tibble: 6 x 7
  id           n    beta      SE     t    P.Value adj.P.Val
  <chr>    <int>   <dbl>   <dbl> <dbl>      <dbl>     <dbl>
1 cg04046~    90 -0.101  0.0128  -5.91    6.21e-8    0.0172
2 cg07664~    90 -0.0431 0.00390 -5.85    8.22e-8    0.0172
3 cg27098~    90 -0.0688 0.0147  -5.79    1.07e-7    0.0172
4 cg08933~    90 -0.0461 0.00791 -5.55    2.98e-7    0.0360
5 cg18349~    90 -0.0491 0.00848 -5.42    5.07e-7    0.0489
6 cg02182~    90 -0.0199 0.0155  -5.36    6.70e-7    0.0538

The annotation can be added by using the argument annotCols. It should be a vector with the columns of the annotation available in the ExpressionSet or RangedSummarizedExperiment that want to be showed. The columns of the annotation can be obtained by

ds.fvarLabels("methy")

$study1
 [1] "ID"                         
 [2] "Name"                       
 [3] "AddressA_ID"                
 [4] "AlleleA_ProbeSeq"           
 [5] "AddressB_ID"                
 [6] "AlleleB_ProbeSeq"           
 [7] "Infinium_Design_Type"       
 [8] "Next_Base"                  
 [9] "Color_Channel"              
[10] "Forward_Sequence"           
[11] "Genome_Build"               
[12] "CHR"                        
[13] "MAPINFO"                    
[14] "SourceSeq"                  
[15] "Chromosome_36"              
[16] "Coordinate_36"              
[17] "Strand"                     
[18] "Probe_SNPs"                 
[19] "Probe_SNPs_10"              
[20] "Random_Loci"                
[21] "Methyl27_Loci"              
[22] "UCSC_RefGene_Name"          
[23] "UCSC_RefGene_Accession"     
[24] "UCSC_RefGene_Group"         
[25] "UCSC_CpG_Islands_Name"      
[26] "Relation_to_UCSC_CpG_Island"
[27] "Phantom"                    
[28] "DMR"                        
[29] "Enhancer"                   
[30] "HMM_Island"                 
[31] "Regulatory_Feature_Name"    
[32] "Regulatory_Feature_Group"   
[33] "DHS"                        
[34] "RANGE_START"                
[35] "RANGE_END"                  
[36] "RANGE_GB"                   
[37] "SPOT_ID"                    

$study2
 [1] "ID"                         
 [2] "Name"                       
 [3] "AddressA_ID"                
 [4] "AlleleA_ProbeSeq"           
 [5] "AddressB_ID"                
 [6] "AlleleB_ProbeSeq"           
 [7] "Infinium_Design_Type"       
 [8] "Next_Base"                  
 [9] "Color_Channel"              
[10] "Forward_Sequence"           
[11] "Genome_Build"               
[12] "CHR"                        
[13] "MAPINFO"                    
[14] "SourceSeq"                  
[15] "Chromosome_36"              
[16] "Coordinate_36"              
[17] "Strand"                     
[18] "Probe_SNPs"                 
[19] "Probe_SNPs_10"              
[20] "Random_Loci"                
[21] "Methyl27_Loci"              
[22] "UCSC_RefGene_Name"          
[23] "UCSC_RefGene_Accession"     
[24] "UCSC_RefGene_Group"         
[25] "UCSC_CpG_Islands_Name"      
[26] "Relation_to_UCSC_CpG_Island"
[27] "Phantom"                    
[28] "DMR"                        
[29] "Enhancer"                   
[30] "HMM_Island"                 
[31] "Regulatory_Feature_Name"    
[32] "Regulatory_Feature_Group"   
[33] "DHS"                        
[34] "RANGE_START"                
[35] "RANGE_END"                  
[36] "RANGE_GB"                   
[37] "SPOT_ID"                    

attr(,"class")
[1] "dsfvarLabels" "list"

Then we can run the analysis and obtain the output with the chromosome and gene symbol by:

ans.limma.annot <- ds.limma(model = ~ diagnosis + Sex,
                            Set = "methy", 
                            annotCols = c("CHR", "UCSC_RefGene_Name"),
                            datasources = conns)

lapply(ans.limma.annot, head)

$study1
# A tibble: 6 x 9
  id        n    beta      SE     t P.Value adj.P.Val CHR  
  <chr> <int>   <dbl>   <dbl> <dbl>   <dbl>     <dbl> <chr>
1 cg13~   100 -0.147  0.0122  -6.62 1.90e-9  0.000466 2    
2 cg23~   100 -0.0569 0.00520 -6.58 2.32e-9  0.000466 2    
3 cg13~   100 -0.0820 0.0135  -6.50 3.27e-9  0.000466 17   
4 cg12~   100 -0.0519 0.00872 -6.45 4.25e-9  0.000466 19   
5 cg24~   100 -0.0452 0.00775 -6.39 5.47e-9  0.000466 19   
6 cg02~   100 -0.125  0.0163  -6.33 7.31e-9  0.000466 2    
# ... with 1 more variable: UCSC_RefGene_Name <chr>

$study2
# A tibble: 6 x 9
  id        n    beta      SE     t P.Value adj.P.Val CHR  
  <chr> <int>   <dbl>   <dbl> <dbl>   <dbl>     <dbl> <chr>
1 cg04~    90 -0.101  0.0128  -5.91 6.21e-8    0.0172 11   
2 cg07~    90 -0.0431 0.00390 -5.85 8.22e-8    0.0172 6    
3 cg27~    90 -0.0688 0.0147  -5.79 1.07e-7    0.0172 11   
4 cg08~    90 -0.0461 0.00791 -5.55 2.98e-7    0.0360 1    
5 cg18~    90 -0.0491 0.00848 -5.42 5.07e-7    0.0489 3    
6 cg02~    90 -0.0199 0.0155  -5.36 6.70e-7    0.0538 8    
# ... with 1 more variable: UCSC_RefGene_Name <chr>

Then, the last step is to meta-analyze the results. Different methods can be used to this end. We have implemented a method that meta-analyze the p-pvalues of each study as follows:

ans.meta <- metaPvalues(ans.limma)
ans.meta

# A tibble: 481,868 x 4
   id                study1      study2   p.meta
   <chr>              <dbl>       <dbl>    <dbl>
 1 cg13138089 0.00000000190 0.00000763  4.78e-13
 2 cg25317941 0.0000000179  0.00000196  1.12e-12
 3 cg02812891 0.00000000731 0.00000707  1.63e-12
 4 cg12706938 0.00000000425 0.0000161   2.14e-12
 5 cg16026647 0.000000101   0.000000797 2.51e-12
 6 cg12695465 0.00000000985 0.0000144   4.33e-12
 7 cg21171625 0.000000146   0.00000225  9.78e-12
 8 cg13772815 0.00000000327 0.000122    1.18e-11
 9 cg00228891 0.000000166   0.00000283  1.38e-11
10 cg21488617 0.0000000186  0.0000299   1.62e-11
# ... with 481,858 more rows

We can verify that the results are pretty similar to those obtained using pooled analyses. Here we compute the association for two of the top-CpGs:

res1 <- ds.lmFeature(feature = "cg13138089", 
                     model = ~ diagnosis + Sex, 
                     Set = "methy",
                     datasources = conns)
res1

             Estimate Std. Error      p-value
cg13138089 -0.1373348 0.01712405 1.057482e-15
attr(,"class")
[1] "dsLmFeature" "matrix"      "array"

res2 <- ds.lmFeature(feature = "cg13772815", 
                     model = ~ diagnosis + Sex, 
                     Set = "methy",
                     datasources = conns)
res2

              Estimate  Std. Error      p-value
cg13772815 -0.06786137 0.009128915 1.056225e-13
attr(,"class")
[1] "dsLmFeature" "matrix"      "array"

We can create a QQ-plot by using the function qqplot available in our package.

qqplot(ans.meta$p.meta)

Here In some cases inflation can be observed, so that, correction for cell-type or surrogate variables must be performed. We describe how we can do that in the next two sections.

3.5 Adjusting for Surrogate Variables

The vast majority of omic studies require to control for unwanted variability. The surrogate variable analysis (SVA) can address this issue by estimating some hidden covariates that capture differences across individuals due to some artifacts such as batch effects or sample quality sam among others. The method is implemented in SVA package.

Performing this type of analysis using the ds.lmFeature function is not allowed since estimating SVA would require to implement a non-disclosive method that computes SVA from the different servers. This will be a future topic of the dsOmicsClient. NOTE that, estimating SVA separately at each server would not be a good idea since the aim of SVA is to capture differences mainly due to experimental issues among ALL individuals. What we can do instead is to use the ds.limma function to perform the analyses adjusted for SVA at each study.

ans.sva <- ds.limma(model = ~ diagnosis + Sex, 
                    Set = "methy",
                    sva = TRUE, annotCols = c("CHR", "UCSC_RefGene_Name"))
ans.sva

$study1
# A tibble: 481,868 x 9
   id            n    beta      SE     t  P.Value adj.P.Val
   <chr>     <int>   <dbl>   <dbl> <dbl>    <dbl>     <dbl>
 1 cg035381~   100 -0.0300 0.0106  -6.99 3.64e-10  0.000102
 2 cg247245~   100 -0.0397 0.00452 -6.88 6.12e-10  0.000102
 3 cg021567~   100 -0.0497 0.00947 -6.87 6.35e-10  0.000102
 4 cg238596~   100 -0.0498 0.00715 -6.79 9.36e-10  0.000113
 5 cg137728~   100 -0.0767 0.00697 -6.62 2.01e- 9  0.000168
 6 cg052138~   100 -0.0803 0.0160  -6.62 2.10e- 9  0.000168
 7 cg194060~   100 -0.0409 0.00779 -6.52 3.29e- 9  0.000201
 8 cg261524~   100 -0.0840 0.00947 -6.52 3.33e- 9  0.000201
 9 cg009799~   100 -0.0723 0.00539 -6.44 4.81e- 9  0.000242
10 cg276606~   100 -0.0514 0.00128 -6.43 5.03e- 9  0.000242
# ... with 481,858 more rows, and 2 more variables:
#   CHR <chr>, UCSC_RefGene_Name <chr>

$study2
# A tibble: 481,868 x 9
   id           n    beta      SE     t   P.Value adj.P.Val
   <chr>    <int>   <dbl>   <dbl> <dbl>     <dbl>     <dbl>
 1 cg03470~    90 -0.0502 0.0104  -6.41   7.30e-9   0.00352
 2 cg02293~    90 -0.0394 0.00372 -6.02   4.04e-8   0.00663
 3 cg24302~    90 -0.0370 0.0111  -5.87   7.90e-8   0.00663
 4 cg07664~    90 -0.0416 0.00719 -5.85   8.43e-8   0.00663
 5 cg16766~    90 -0.0436 0.00671 -5.83   9.38e-8   0.00663
 6 cg04046~    90 -0.100  0.0142  -5.81   9.90e-8   0.00663
 7 cg00848~    90 -0.0414 0.00836 -5.79   1.07e-7   0.00663
 8 cg08862~    90 -0.0289 0.00900 -5.79   1.10e-7   0.00663
 9 cg10850~    90 -0.0700 0.00445 -5.74   1.35e-7   0.00672
10 cg27098~    90 -0.0664 0.00156 -5.72   1.49e-7   0.00672
# ... with 481,858 more rows, and 2 more variables:
#   CHR <chr>, UCSC_RefGene_Name <chr>

attr(,"class")
[1] "dsLimma" "list"

Then, data can be combined meta-anlyzed as follows:

ans.meta.sv <- metaPvalues(ans.sva)
ans.meta.sv

# A tibble: 481,868 x 4
   id           study1       study2   p.meta
   <chr>         <dbl>        <dbl>    <dbl>
 1 cg24302412 1.94e- 8 0.0000000790 5.37e-14
 2 cg01301319 8.29e- 9 0.000000769  2.14e-13
 3 cg05213896 2.10e- 9 0.00000763   5.25e-13
 4 cg02812891 1.07e- 8 0.00000153   5.37e-13
 5 cg00228891 5.61e- 8 0.000000301  5.53e-13
 6 cg12695465 5.90e- 9 0.00000360   6.90e-13
 7 cg13138089 6.16e- 9 0.00000571   1.12e-12
 8 cg24724506 6.12e-10 0.0000829    1.60e-12
 9 cg19406036 3.29e- 9 0.0000331    3.37e-12
10 cg16026647 2.82e- 7 0.000000449  3.89e-12
# ... with 481,858 more rows

The DataSHIELD session must by closed by:

datashield.logout(conns)

4 Genomic data analysis: GWAS

Genomic data can be stored in different formats. PLINK and VCF files are commonly used in genetic epidemiology studies. We have a GWAS example available at BRGE data repository that aims to find SNPs associated with asthma. We have data stored in VCF (brge.vcf) with several covariates and phenotypes available in the file brge.txt (gender, age, obesity, smoking, country and asthma status). The same data is also available in PLINK format (brge.bed, brge.bim, brge.fam) with covariates in the file brge.phe.

4.1 Analysis with Bioconductor packages: example how to extend the resources

In order to deal with this type of data, we have extended the resources available at the resourcer package to VCF files. NOTE: PLINK files can be translated into VCF files using different pipelines. In R you can use SeqArray to get VCF files.

We use the Genomic Data Storage (GDS) format which efficiently manage VCF files into the R environment. This extension requires to create a Client and a Resolver function that are located into the dsOmics package. The client function uses snpgdsVCF2GDS function implemented in SNPrelate to coerce the VCF file to a GDS object. Then the GDS object is loaded into R as an object of class GdsGenotypeReader from GWASTools package that facilitates downstream analyses.

The opal API server allows to incorporate this new type of resource as illustrated in Figure 11.

Figure 11: Description of how a VCF file can be added to the opal resources

It is important to notice that the URL should contain the tag method=biallelic.only&snpfirstdim=TRUE since these are required parameters of snpgdsVCF2GDS function. This is an example:

https://raw.githubusercontent.com/isglobal-brge/scoreInvHap/master/inst/extdata/example.vcf?method=biallelic.only&snpfirstdim=TRUE

In that case we indicate that only biallelic SNPs are considered (‘method=biallelic.only’) and that genotypes are stored in the individual-major mode, (i.e., list all SNPs for the first individual, and then list all SNPs for the second individual, etc) (‘snpfirstdim=TRUE’).

We have created a resource having the VCF file of our study on asthma as previously described. The name of the resource is brge_vcf the phenotypes are available in another resource called brge that is a .txt file. The GWAS analysis is then perform as follows

We first start by preparing login data

builder <- newDSLoginBuilder()
builder$append(server = "study1", url = "https://opal-demo.obiba.org",
               user = "dsuser", password = "password",
               resource = "RSRC.brge_vcf")
logindata <- builder$build()

conns <- datashield.login(logins = logindata, assign = TRUE,
                          symbol = "res")

In this case we have to assign to different resources. One for the VCF (obesity_vcf) and another one for the phenotypic data (obesity). To this end, the datashield.assign.resource function is required before assigning any object to the specific resource. Notice that the VCF resource can be load into R as a GDS thanks to our extension of existing resources in the reourcer

datashield.assign.resource(conns, symbol = "vcf.res", 
                           resource = list(study1 = "RSRC.brge_vcf"))
datashield.assign.expr(conns, symbol = "gds", 
                       expr = quote(as.resource.object(vcf.res)))


datashield.assign.resource(conns, symbol = "covars.res", 
                           resource = list(study1 = "RSRC.brge"))
datashield.assign.expr(conns, symbol = "covars", 
                       expr = quote(as.resource.data.frame(covars.res)))

These are the objects available in the Opal server

ds.ls()

$study1
$study1$environment.searched
[1] "R_GlobalEnv"

$study1$objects.found
[1] "covars"     "covars.res" "gds"        "res"       
[5] "vcf.res"

We can use dsBaseClient functions to inspect the variables that are in the covars data.frame. The variables are

ds.colnames("covars")

$study1
[1] "scanID"  "gender"  "obese"   "age"     "smoke"  
[6] "country" "asthma"

The asthma variable has this number of individuals at each level (0: controls, 1: cases)

ds.table("covars$asthma")


 Data in all studies were valid 

Study 1 :  No errors reported from this study

$output.list
$output.list$TABLE_rvar.by.study_row.props
             study
covars$asthma 1
            0 1
            1 1

$output.list$TABLE_rvar.by.study_col.props
             study
covars$asthma         1
            0 0.6864187
            1 0.3135813

$output.list$TABLE_rvar.by.study_counts
             study
covars$asthma    1
            0 1587
            1  725

$output.list$TABLES.COMBINED_all.sources_proportions
covars$asthma
    0     1 
0.686 0.314 

$output.list$TABLES.COMBINED_all.sources_counts
covars$asthma
   0    1 
1587  725 


$validity.message
[1] "Data in all studies were valid"

There may be interest in only studying certain genes, for that matter, the loaded VCF resource can be subsetting as follows

genes <- c("A1BG","A2MP1")
ds.getSNPSbyGen("gds", genes)

The previous code will over-write the VCF with the SNPs corresponding to the selected genes, if the intention is to perform studies with both the complete VCF and a subsetted one, the argument name can be used to create a new object on the server with the subsetted VCF, preserving the complete one.

genes <- c("A1BG","A2MP1")
ds.getSNPSbyGen("gds", genes = genes, name = "subset.vcf")

Then, an object of class GenotypeData must be created at the server side to perform genetic data analyses. This is a container defined in the GWASTools package for storing genotype and phenotypic data from genetic association studies. By doing that we will also verify whether individuals in the GDS (e.g VCF) and covariates files have the same individuals and are in the same order. This can be performed by

ds.GenotypeData(x='gds', covars = 'covars', columnId = 1, sexId = 2, male_encoding = "2", female_encoding = "1", newobj.name = 'gds.Data')

Before performing the association analyses, quality control (QC) can be performed to the loaded data. Three methodologies are available; 1) Principal Component Analysis (PCA) of the genomic data, 2) Hardy-Weinberg Equilibrium (HWE) testing and 3) Allelic frequency estimates. The QC methods 2 and 3 have as inputs a GenotypeData object, created with a covariates file that has a gender column; while method 1 has as input a VCF.

To perform the PCA, a pruning functionality is built inside so that redundant SNPs are discarted (there is an extra argument ld.threshold which controls the pruning, more information about it at the SNPRelate documentation), speeding up the execution time

pca <- ds.PCASNPS("gds", prune = TRUE)
plotPCASNPS(pca)

To perform QC methodologies 2 and 3, the name of the gender column as well as the keys to describe male or female have to be provided. Remember that we can visualize the names of the variables from our data by executing ds.colnames("covars"). In our case, this variable is called “gender,” and the levels of this variable are 1 for male and 2 for female as we can see here (NOTE: we cannot use ds.levels since gender variable is not a factor):

ds.table1D("covars$gender")$counts

      covars$gender
1              1215
2              1097
Total          2312

The HWE test can be performed to selected chromosomes using the argument chromosomes, only the autosomes can be selected when performing a HWE test, the encoding of the autosomes present on the object can be fetched with

ds.genoDimensions('gds.Data')$chromosomes

NULL

Therefore, HWE can be performed by:

ds.exactHWE("gds.Data", chromosome = "20")

$study1
# A tibble: 2,384 x 8
   rs     chr     nAA   nAB   nBB   MAF minor.allele   pval
   <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <chr>         <dbl>
 1 rs613~ 20      123   815  1351 0.232 A            1     
 2 rs603~ 20      114   782  1410 0.219 A            0.670 
 3 rs603~ 20      286  1013  1009 0.343 A            0.197 
 4 rs608~ 20      384  1085   821 0.405 A            0.435 
 5 rs607~ 20      496  1131   679 0.460 A            0.557 
 6 rs169~ 20      358  1100   841 0.395 A            1.00  
 7 rs448~ 20      330  1087   888 0.379 A            0.965 
 8 rs608~ 20      309  1070   927 0.366 A            1     
 9 rs209~ 20       20   446  1840 0.105 A            0.268 
10 rs605~ 20      546  1088   656 0.476 A            0.0237
# ... with 2,374 more rows

attr(,"class")
[1] "dsexactHWE" "list"

The same test can be conducted only for the control individuals (on this example taking the asthma covariates columnas as a cas/control) by:

ds.exactHWE("gds.Data", chromosome = "20", controls = TRUE, controls_column = "asthma")

$study1
# A tibble: 2,384 x 8
   rs      chr     nAA   nAB   nBB   MAF minor.allele  pval
   <chr>   <chr> <dbl> <dbl> <dbl> <dbl> <chr>        <dbl>
 1 rs6139~ 20       90   559   927 0.234 A            0.624
 2 rs6038~ 20       76   532   975 0.216 A            0.767
 3 rs6039~ 20      200   708   677 0.350 A            0.473
 4 rs6086~ 20      266   761   548 0.410 A            0.958
 5 rs6074~ 20      358   770   457 0.469 A            0.338
 6 rs1699~ 20      245   767   567 0.398 A            0.600
 7 rs4482~ 20      224   760   599 0.382 A            0.523
 8 rs6082~ 20      215   736   633 0.368 A            0.957
 9 rs2092~ 20       13   308  1260 0.106 A            0.285
10 rs6051~ 20      382   758   431 0.484 A            0.189
# ... with 2,374 more rows

attr(,"class")
[1] "dsexactHWE" "list"

Similarly, allele frequencies estimates can be estimated by:

ds.alleleFrequency("gds.Data")

$study1
# A tibble: 99,289 x 8
   rs           M       F     all   n.M   n.F     n     MAF
   <chr>    <dbl>   <dbl>   <dbl> <dbl> <dbl> <dbl>   <dbl>
 1 MitoC~ 0.0147  1.5 e-2 1.49e-2  1086  1200  2286 1.49e-2
 2 MitoG~ 0.00276 8.37e-4 1.75e-3  1087  1195  2282 1.75e-3
 3 MitoG~ 0.00548 3.30e-3 4.33e-3  1095  1212  2307 4.33e-3
 4 MitoT~ 0.00868 1.24e-2 1.06e-2  1095  1213  2308 1.06e-2
 5 MitoC~ 0       1.65e-3 8.66e-4  1095  1214  2309 8.66e-4
 6 MitoT~ 0.0754  7.12e-2 7.32e-2  1094  1208  2302 7.32e-2
 7 MitoA~ 0.00913 6.60e-3 7.80e-3  1095  1213  2308 7.80e-3
 8 MitoG~ 0.00274 2.48e-3 2.61e-3  1093  1209  2302 2.61e-3
 9 MitoA~ 0.181   1.82e-1 1.81e-1  1071  1188  2259 1.81e-1
10 MitoT~ 0       1.65e-3 8.68e-4  1094  1210  2304 8.68e-4
# ... with 99,279 more rows

attr(,"class")
[1] "dsalleleFrequency" "list"

In the future, more functions will be created to perform quality control (QC) for both, SNPs and inviduals.

Association analysis for a given SNP is performed by simply

ds.glmSNP(snps.fit = "rs11247693", model = asthma ~ gender + age, genoData='gds.Data')

             Estimate Std. Error   p-value    n     p.adj
rs11247693 -0.1543215  0.2309585 0.5040196 2311 0.5040196
attr(,"class")
[1] "dsGlmSNP" "matrix"   "array"

The analysis of all available SNPs is performed when the argument snps.fit is missing. The function performs the analysis of the selected SNPs in a single repository or in multiple repositories as performing pooled analyses (it uses ds.glm DataSHIELD function). As in the case of transcriptomic data, analyzing all the SNPs in the genome (e.g GWAS) will be high time-consuming. We can adopt a similar approach as the one adopted using the limma at each server. That is, we run GWAS at each repository using specific and scalable packages available in R/Bioc. In that case we use the GWASTools and GENESIS packages. The complete pipeline is implemented in this function

ans.bioC <- ds.GWAS('gds.Data', model=asthma~age+country)

This close the DataSHIELD session

datashield.logout(conns)

4.2 GWAS with PLINK

Here we illustrate how to perform the same GWAS analyses on the asthma using PLINK secure shell commands. This can be performed thanks to the posibility of having ssh resources as described here.

It is worth to notice that this workflow and the new R functions implemented in dsOmicsClient could be used as a guideline to carry out similar analyses using existing analysis tools in genomics such as IMPUTE, SAMtools or BEDtools among many others.

We start by assigning login resources

  library(DSOpal)
  library(dsBaseClient)
  library(dsOmicsClient)
  builder <- newDSLoginBuilder()
  builder$append(server = "study1", url = "https://opal-demo.obiba.org",
                 user = "dsuser", password = "password",
                 resource = "RSRC.brge_plink")
  logindata <- builder$build()

Then we assign the resource to a symbol (i.e. R object) called client which is a ssh resource

  conns <- datashield.login(logins = logindata, assign = TRUE,
                            symbol = "client")
  ds.class("client")

$study1
[1] "SshResourceClient"     "CommandResourceClient"
[3] "ResourceClient"        "R6"

Now, we are ready to run any PLINK command from the client site. Notice that in this case we want to assess association between the genotype data in bed format and use as phenotype the variable ‘asthma’ that is in the file ‘brge.phe’ in the 6th column. The sentence in a PLINK command would be (NOTE: we avoid –out to indicate the output file since the file will be available in R as a tibble).

  plink --bfile brge --logistic --pheno brge.phe --mpheno 6 --covar brge.phe --covar-name gender,age

The arguments must be encapsulated in a single character without the command ‘plink’

  plink.arguments <- "--bfile brge --logistic --pheno brge.phe --mpheno 6 --covar brge.phe --covar-name gender,age"

the analyses are then performed by

  ans.plink <- ds.PLINK("client", plink.arguments)

The object ans.plink contains the PLINK results at each server as well as the outuput provided by PLINK

  lapply(ans.plink, names)

$study1
[1] "results"   "plink.out"

  head(ans.plink$study1$results)

# A tibble: 6 x 9
    CHR SNP       BP A1    TEST  NMISS    OR   STAT       P
  <dbl> <chr>  <dbl> <chr> <chr> <dbl> <dbl>  <dbl>   <dbl>
1     0 MitoC~  3993 T     ADD    2286 0.752 -1.33  1.82e-1
2     0 MitoC~  3993 T     gend~  2286 0.742 -3.27  1.07e-3
3     0 MitoC~  3993 T     age    2286 1.00   0.565 5.72e-1
4     0 MitoG~  4821 A     ADD    2282 2.68   1.71  8.79e-2
5     0 MitoG~  4821 A     gend~  2282 0.740 -3.31  9.40e-4
6     0 MitoG~  4821 A     age    2282 1.00   0.465 6.42e-1

  ans.plink$study$plink.out

$status
[1] 0

$output
 [1] ""                                                                                   
 [2] "@----------------------------------------------------------@"                       
 [3] "|        PLINK!       |     v1.07      |   10/Aug/2009     |"                       
 [4] "|----------------------------------------------------------|"                       
 [5] "|  (C) 2009 Shaun Purcell, GNU General Public License, v2  |"                       
 [6] "|----------------------------------------------------------|"                       
 [7] "|  For documentation, citation & bug-report instructions:  |"                       
 [8] "|        http://pngu.mgh.harvard.edu/purcell/plink/        |"                       
 [9] "@----------------------------------------------------------@"                       
[10] ""                                                                                   
[11] "Skipping web check... [ --noweb ] "                                                 
[12] "Writing this text to log file [ /tmp/ssh-3254/out.log ]"                            
[13] "Analysis started: Mon May 10 12:23:29 2021"                                         
[14] ""                                                                                   
[15] "Options in effect:"                                                                 
[16] "\t--bfile brge"                                                                      
[17] "\t--logistic"                                                                        
[18] "\t--pheno brge.phe"                                                                  
[19] "\t--mpheno 6"                                                                        
[20] "\t--covar brge.phe"                                                                  
[21] "\t--covar-name gender,age"                                                           
[22] "\t--noweb"                                                                           
[23] "\t--out /tmp/ssh-3254/out"                                                           
[24] ""                                                                                   
[25] "Reading map (extended format) from [ brge.bim ] "                                   
[26] "100000 markers to be included from [ brge.bim ]"                                    
[27] "Reading pedigree information from [ brge.fam ] "                                    
[28] "2312 individuals read from [ brge.fam ] "                                           
[29] "2312 individuals with nonmissing phenotypes"                                        
[30] "Assuming a disease phenotype (1=unaff, 2=aff, 0=miss)"                              
[31] "Missing phenotype value is also -9"                                                 
[32] "725 cases, 1587 controls and 0 missing"                                             
[33] "1097 males, 1215 females, and 0 of unspecified sex"                                 
[34] "Reading genotype bitfile from [ brge.bed ] "                                        
[35] "Detected that binary PED file is v1.00 SNP-major mode"                              
[36] "Reading alternate phenotype from [ brge.phe ] "                                     
[37] "2312 individuals with non-missing alternate phenotype"                              
[38] "Assuming a disease phenotype (1=unaff, 2=aff, 0=miss)"                              
[39] "Missing phenotype value is also -9"                                                 
[40] "725 cases, 1587 controls and 0 missing"                                             
[41] "Reading 6 covariates from [ brge.phe ] with nonmissing values for 2199 individuals" 
[42] "Selected subset of 2 from 6 covariates"                                             
[43] "For these, nonmissing covariate values for 2312 individuals"                        
[44] "Before frequency and genotyping pruning, there are 100000 SNPs"                     
[45] "2312 founders and 0 non-founders found"                                             
[46] "6009 heterozygous haploid genotypes; set to missing"                                
[47] "Writing list of heterozygous haploid genotypes to [ /tmp/ssh-3254/out.hh ]"         
[48] "7 SNPs with no founder genotypes observed"                                          
[49] "Warning, MAF set to 0 for these SNPs (see --nonfounders)"                           
[50] "Writing list of these SNPs to [ /tmp/ssh-3254/out.nof ]"                            
[51] "Total genotyping rate in remaining individuals is 0.994408"                         
[52] "0 SNPs failed missingness test ( GENO > 1 )"                                        
[53] "0 SNPs failed frequency test ( MAF < 0 )"                                           
[54] "After frequency and genotyping pruning, there are 100000 SNPs"                      
[55] "After filtering, 725 cases, 1587 controls and 0 missing"                            
[56] "After filtering, 1097 males, 1215 females, and 0 of unspecified sex"                
[57] "Converting data to Individual-major format"                                         
[58] "Writing logistic model association results to [ /tmp/ssh-3254/out.assoc.logistic ] "
[59] ""                                                                                   
[60] "Analysis finished: Mon May 10 12:25:20 2021"                                        
[61] ""                                                                                   

$error
character(0)

$command
[1] "cd /home/master/brge && plink1 --bfile brge --logistic --pheno brge.phe --mpheno 6 --covar brge.phe --covar-name gender,age --noweb --out /tmp/ssh-3254/out"

attr(,"class")
[1] "resource.exec"

We can compare the p-values obtained using PLINK with Bioconductor-based packages for the top-10 SNPs as follows:

  library(tidyverse)
  # get SNP p.values (additive model - ADD)
  res.plink <- ans.plink$study1$results %>% filter(TEST=="ADD") %>%
    arrange(P)
  # compare top-10 with Biocoductor's results
  snps <- res.plink$SNP[1:10]
  plink <- res.plink %>% filter(SNP%in%snps) %>% dplyr::select(SNP, P)
  bioC <- ans.bioC$study1 %>% filter(rs%in%snps) %>% dplyr::select(rs, p.value)
  left_join(plink, bioC, by=c("SNP" = "rs"))

# A tibble: 10 x 3
   SNP                 P p.value
   <chr>           <dbl>   <dbl>
 1 rs2267914  0.00000151 0.0105 
 2 rs6097326  0.00000424 0.00881
 3 rs7153     0.00000440 0.00900
 4 rs3732410  0.00000817 0.00849
 5 rs7995146  0.0000170  0.00789
 6 rs6495788  0.0000213  0.00760
 7 rs1602679  0.0000268  0.00764
 8 rs11055608 0.0000270  0.00805
 9 rs7098143  0.0000313  0.00781
10 rs7676164  0.0000543  0.00706

As expected, the p-values are in the same order of magnitud having little variations due to the implemented methods of each software.

We can do the same comparisons of minor allele frequency (MAF) estimation performed with Bioconductor and PLINK. To this end, we need first to estimate MAF using PLINK

  plink.arguments <- "--bfile brge --freq"
  ans.plink2 <- ds.PLINK("client", plink.arguments)
  maf.plink <- ans.plink2$study1$results
  
  plink <- maf.plink %>% filter(SNP%in%snps) %>% dplyr::select(SNP, MAF)
  bioC <- ans.bioC$study1 %>% filter(rs%in%snps) %>% dplyr::select(rs, freq)
  left_join(plink, bioC, by=c("SNP" = "rs"))

# A tibble: 10 x 3
   SNP           MAF   freq
   <chr>       <dbl>  <dbl>
 1 rs7153     0.256  0.256 
 2 rs3732410  0.254  0.254 
 3 rs7676164  0.304  0.304 
 4 rs1602679  0.101  0.101 
 5 rs7098143  0.210  0.210 
 6 rs11055608 0.446  0.446 
 7 rs7995146  0.0527 0.0527
 8 rs6495788  0.267  0.267 
 9 rs2267914  0.104  0.104 
10 rs6097326  0.125  0.125

This close the DataSHIELD session

  datashield.logout(conns)

5 Acknowledgments

JRG want to thank Deroshan Padotan for having worked on a preliminary version of dsOmicsClient package developed before the resourcer package was created. We also thank Demetris Avraam for his feedback while programming DataSHIELD functions and writing the vignette. Leire Abarrategui is acknowledged for implementing RNA-seq data analysis using DESeq2 and edgeR.

References

Wilson, Rebecca, Oliver Butters, Demetris Avraam, James Baker, Jonathan Tedds, Andrew Turner, Madeleine Murtagh, and Paul Burton. 2017. “DataSHIELD – New Directions and Dimensions.” Data Science Journal 16 (0): 21. https://doi.org/10.5334/dsj-2017-021.

Non-disclosive federated omic data analysis with DataSHIELD and Bioconductor

2021-05-14

Package

1 Purpose

2 Setup

2.1 Required Opal server with resources

2.2 Required DataSHIELD packages in the opal server

2.3 Required R Packages in the client site (e.g. local machine)

3 Transcriptomic and Epigenomic data analysis

3.1 Single study analysis

3.1.1 Illustrative example: differential gene expression (DGE) analysis

3.2 Analysis from multiple studies

3.2.1 Illustrative example: Epigenome-wide association analysis (EWAS)

3.3 Single CpG analysis

3.4 Multiple CpG analysis

3.5 Adjusting for Surrogate Variables

4 Genomic data analysis: GWAS

4.1 Analysis with Bioconductor packages: example how to extend the resources

4.2 GWAS with PLINK

5 Acknowledgments

References