dsOmics 1.0.4
The purpose of the dsOmicsClient package is to provide a set of functions to perform omic association analyses when data are stored on federated databases or, more generally, in different repositories. In particular the package utilizes DataSHIELD infrastructure which is a software solution that allows simultaneous co-analysis of data from multiple studies stored on different servers without the need to physically pool data or disclose sensitive information (Wilson et al. 2017). DataSHIELD uses Opal servers to properly perform such analyses. Our bookdown introduces Opal, DaaSHIELD and other related features. Here, we describe the most relevant ones to be able to reproduce this document.
At a high level DataSHIELD is set up as a client-server model which houses the data for a particular study. A request is made from the client to run specific functions on the remote servers where the analysis is performed. Non-sensitive and pre-approved summary statistics are returned from each study to the client where they can be combined for an overall analysis. An overview of what a single-site DataSHIELD architecture would look like is illustrated in Figure 1.
Figure 1: Single Server DataSHIELD Architecture (Wilson et al 2017)
One of the main limitations of DataSHIELD is how to deal with large data given the restrictions of Opal with databases. Nonetheless, the recent development of the resourcer R package allows DataSHIELD developers to overcome this drawback by granting the Opal servers to deal with any type of data (e.g. resources). So far, Opal can register access to different types of data resources in different formats (csv, tsv, R data, SQL, tiddy, ..) that can also be located in different places (local, http, ssh, AWS S3 or Mongodb file stores, …). This is another important advancement since the resourcer addresses another important issue that is having duplicated data in different research centers or hospitals.
The resourcer package permits to work with specific R data classes. This is highly important in our setting since it will allow to use Bioconductor classes to properly manage omic data using efficient infrastructures such as ExpressionSet or RangedSummarizedExperiment among others. Another important asset of the resourcer package is that it can be extended to new data types by writting specific functions (see how to extending resources. We have used this feature and created some functions for the analysis of Variant Calling Format (VCF files) that are loaded into R as Genomic Data Storage objects. These functions along with others that allow the managment of Bioconductor classes in DataSHIELD have been included in a new DataSHIELD package, the dsOmics, which is able to manage different Bioconductor data infrastructures that are required to perform omic association analyses. These including ExpressionSet, RangedSummarizedExperiment or GDS among others. Generaly speaking, any data format and storage that can be read by R can be expressed as a resource.
In the next sections we first describe how to deal with Opal servers and resources. We illustre how we prepared a test environment to describe how Opal must be setup as well as how to provide the appropiate R/DataSHIELD configuration in both the Opal server and the client side to perform omic association analyses. Then, the different types of omic data analyses that can be performed with the dsOmicsClient functionality are described and further illustrated using real data examples including epigenome, transcriptome and genomic data analyses.
In this section we describe how to configure the Opal server and the needed packages to carry out omic association analyses from the client side. Basically, the resources must be defined in the Opal along with the required information that includes the url where data is located, the format (e.g., SPSS, R class, GDS …) and the credentials which are not visible to the DataSHIELD users (Figure 2). The permission to use a resource for DataSHIELD operations is granted (to a user or a group of users) in Opal.
Figure 2: Information required for Resources
A description of the pre-requisites can be found here. At the time of the writing of this vignette, the resource capabilities of Opal, DataSHIELD and related R packages have not been released yet. Basically, what is needed is:
OPTIONAL (used for development): DSLite
dsBaseClient in the v6.0-dev branch
and in the server side: - resourcer - dsOmics - dsBase in the v6.0-dev branch
Notice that the dsOmics package includes new extensions of the resourcer package to deal with new types of resources such as file in VCF format to converted to a file in GDS format (VCF2GDS). Next subsections further describe what is required along with some examples.
Resources are datasets or computation units which are located under a URL and their access is protected by some credentials. When resources are assigned to a R/DataSHIELD server session, remote big/complex datasets or high performance computers are being accessible to data analysts.
Instead of storing the data in Opal databases, only the way to access them needs to be defined: the datasets are kept in their original format and location (e.g., an R object, a SQL database, a SPSS file, etc.) and are read directly from the R/DataSHIELD server-side session. Then as soon as there is a R reader for the dataset or a connector for the analysis services, a resource can be defined. Opal takes care of the DataSHIELD permissions (a DataSHIELD user cannot see the resource’s credentials) and of the resources assignment to a R/DataSHIELD session (see Figure 3)
Figure 3: Resources: a new DataSHIELD infrastructure
As previously mentioned, the resourcer R package allows to deal with the main data sources (using tidyverse, DBI, dplyr, sparklyr, MongoDB, AWS S3, SSH etc.) and is easily extensible to new ones including specific data infrastructure in R or Bioconductor. So far ExpressionSet and RangedSummarizedExperiment objects saved in .rdata files are accesible through the resourcer package. The dsOmics package contains a new extension that deals with VCF (Variant Calling Format) files which are coerced to a GDS (Genomic Data Storage) format (VCF2GDS).
In order to achive this resourcer extension, two R6 classes have been implemented:
GDSFileResourceResolver class which handles file-base resources with data in GDS or VCF formats. This class is responsible for creating a GDSFileResourceClient object instance from an assigned resource.GDSFileResourceClient class which is responsible for getting the referenced file and making a connection (created by GWASTools) to the GDS file (will also convert the VCF file to a GDS file on the fly, using SNPRelate). For the subsequent analysis, it’s this connection handle to the GDS file that will be used.We have prepared a test environment, with the Opal implementation of Resources and an appropriate R/DataSHIELD configuration that is available at: opal-demo.obiba.org. This figure illustrate the resources which are available for the RSRC project:
Figure 4: Resources from a test enviroment available at https://opal-demo.obiba.org
It is possible to declare a resource that is to be resolved by an R package that uses the resourcer API
Figure 5: Declaration of a resource corresponding to a VCF2GDS format
Required DataSHIELD packages must be uploaded in the opal server through the Administration site by accessing to DataSHIELD tab. In our case, both dsBase and dsOmics and resourcer packages must be installed as is illustrated in the figure (NOTE: dsGeo is uploaded for other type of analyses and it is not necesary for omics).
Figure 6: Installed packages in the test opal server
The tab +Add package can be used to install a new package. The figure depicts how dsOmics was intalled into the opal server
Figure 7: Description how dsOmics package was intalled into the test opal server
In order to use the functions contained within this package the following R packages must be installed in the client side:
install.packages("DSIOpal")
install.packages('dsBaseClient', repos=c(getOption('repos'),
'http://cran.obiba.org'), dependencies=TRUE)
devtools::install_github("isglobal-brge/dsOmicsClient", dependencies = TRUE)
The package dependencies are then loaded as follows:
library(DSOpal)
library(dsBaseClient)
library(dsOmicsClient)
Notes:
For advanced users willing to use DSLite, the server side packages needs to be installed as well:
install.packages(c("resourcer", "DSLite"), dependencies = TRUE)
install.packages("dsBase", repos = c("https://cloud.r-project.org",
"https://cran.obiba.org"), dependencies = TRUE)
We refer to this chapter of our bookdown to a more detail description about how to work with DataSHIELD in a serverless environment.
Let us start by illustrating a simple example where a researcher may be interested in performing differential gene expression analysis (DGE) having data in a single repository (e.g. one study).
Let us illustrate how to perform transcriptomic data analysis using data from TCGA project. We have uploaded to the opal server a resource called tcga_liver whose URL is http://duffel.rail.bio/recount/TCGA/rse_gene_liver.Rdata which is available through the recount project. This resource contains the RangeSummarizedExperiment with the RNAseq profiling of liver cancer data from TCGA. Next, we illustrate how a differential expression analysis to compare RNAseq profiling of women vs men (variable gdc_cases.demographic.gender). The DGE analysis is normally performed using limma package. In that case, as we are analyzing RNA-seq data, limma + voom method will be required.
Let us start by creating the connection to the opal server:
builder <- newDSLoginBuilder()
builder$append(server = "study1", url = "https://opal-demo.obiba.org",
user = "dsuser", password = "password",
resource = "RSRC.tcga_liver")
logindata <- builder$build()
conns <- datashield.login(logins = logindata, assign = TRUE,
symbol = "res")
Then, let us coerce the resource to a RangedSummarizedExperiment which is the type of object that is available in the recount project.
datashield.assign.expr(conns, symbol = "rse",
expr = quote(as.resource.object(res)))
ds.class("rse")
$study1
[1] "RangedSummarizedExperiment"
attr(,"package")
[1] "SummarizedExperiment"
The number of features and samples can be inspected by
ds.dim("rse")
$`dimensions of rse in study1`
[1] 58037 424
$`dimensions of rse in combined studies`
[1] 58037 424
And the names of the features using the same function used in the case of analyzing an ExpressionSet
name.features <- ds.featureNames("rse")
lapply(name.features, head)
$study1
[1] "ENSG00000000003.14" "ENSG00000000005.5"
[3] "ENSG00000000419.12" "ENSG00000000457.13"
[5] "ENSG00000000460.16" "ENSG00000000938.12"
Also the covariate names can be inspected by
name.vars <- ds.featureData("rse")
lapply(name.vars, head, n=15)
$study1
[1] "project"
[2] "sample"
[3] "experiment"
[4] "run"
[5] "read_count_as_reported_by_sra"
[6] "reads_downloaded"
[7] "proportion_of_reads_reported_by_sra_downloaded"
[8] "paired_end"
[9] "sra_misreported_paired_end"
[10] "mapped_read_count"
[11] "auc"
[12] "sharq_beta_tissue"
[13] "sharq_beta_cell_type"
[14] "biosample_submission_date"
[15] "biosample_publication_date"
We can visualize the levels of the variable having gender information that will be our condition (i.e., we are interested in obtaining genes that are differentially expressed between males and females)
ds.table("rse$gdc_cases.demographic.gender")
Data in all studies were valid
Study 1 : No errors reported from this study
$output.list
$output.list$TABLE_rvar.by.study_row.props
study
rse$gdc_cases.demographic.gender 1
female 1
male 1
$output.list$TABLE_rvar.by.study_col.props
study
rse$gdc_cases.demographic.gender 1
female 0.3372642
male 0.6627358
$output.list$TABLE_rvar.by.study_counts
study
rse$gdc_cases.demographic.gender 1
female 143
male 281
$output.list$TABLES.COMBINED_all.sources_proportions
rse$gdc_cases.demographic.gender
female male
0.337 0.663
$output.list$TABLES.COMBINED_all.sources_counts
rse$gdc_cases.demographic.gender
female male
143 281
$validity.message
[1] "Data in all studies were valid"
We have implemented a function called ds.RNAseqPreproc() to perform RNAseq data pre-processing that includes:
ds.RNAseqPreproc('rse', group= 'gdc_cases.demographic.gender',
newobj.name = 'rse.pre')
Note that it is recommended to indicate the grouping variable (i.e., condition). Once data have been pre-processed, we can perform differential expression analysis. Notice how dimensions have changed given the fact that we have removed genes with low expression which are expected to do not be differentially expressed.
ds.dim('rse')
$`dimensions of rse in study1`
[1] 58037 424
$`dimensions of rse in combined studies`
[1] 58037 424
ds.dim('rse.pre')
$`dimensions of rse.pre in study1`
[1] 40363 424
$`dimensions of rse.pre in combined studies`
[1] 40363 424
The differential expression analysis is ´dsOmicsClient/dsOmics´ is implemented in the funcion ds.limma(). This functions runs a limma-pipeline for microarray data and for RNAseq data allows:
We recommend to use the voom + limma pipeline proposed here given its versatility and that limma is much faster than DESeq2 and edgeR. By default, the function consider that data are obtained from a microarray experiment (type.data = "microarray"). Therefore, as we are analyzing RNAseq data, we much indicate that type.data = "RNAse"
ans.gender <- ds.limma(model = ~ gdc_cases.demographic.gender,
Set = "rse.pre", type.data = "RNAseq")
The top differentially expressed genes can be visualized by:
ans.gender
$study1
# A tibble: 40,363 x 7
id n beta SE t P.Value adj.P.Val
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ENSG00000~ 424 -12.4 0.0761 -52.1 2.74e-187 1.11e-182
2 ENSG00000~ 424 -10.2 0.461 -43.8 5.21e-160 1.05e-155
3 ENSG00000~ 424 -11.0 0.0603 -39.5 1.08e-144 1.45e-140
4 ENSG00000~ 424 -11.3 0.0651 -36.0 2.27e-131 2.29e-127
5 ENSG00000~ 424 10.9 0.0885 35.5 1.85e-129 1.49e-125
6 ENSG00000~ 424 10.2 0.118 32.9 3.72e-119 2.50e-115
7 ENSG00000~ 424 11.4 0.128 31.9 5.57e-115 3.21e-111
8 ENSG00000~ 424 -7.78 0.0812 -28.8 3.85e-102 1.94e- 98
9 ENSG00000~ 424 9.62 0.0894 27.4 4.72e- 96 2.12e- 92
10 ENSG00000~ 424 11.4 0.0924 27.3 9.63e- 96 3.89e- 92
# ... with 40,353 more rows
attr(,"class")
[1] "dsLimma" "list"
We can verify whether the distribution of the observed p-values are the ones we expect in this type of analyses
hist(ans.gender$study1$P.Value, xlab="Raw p-value gender effect",
main="", las=1, cex.lab=1.5, cex.axis=1.2, col="gray")
We can also check whether there is inflation just executing
qqplot(ans.gender$study1$P.Value)
So, in that case, the model needs to remove unwanted variability (\(\lambda>2\)). If so, we can use surrogate variable analysis just changing the argument sva=TRUE
ans.gender.sva <- ds.limma(model = ~ gdc_cases.demographic.gender,
Set = "rse.pre", type.data = "RNAseq",
sva = TRUE)
Now the inflation has dramatically been reduced (\(\lambda>1.12\))
qqplot(ans.gender.sva$study1$P.Value)
We can add annotation to the output that is available in our RSE object. We can have access to this information by
ds.fvarLabels('rse.pre')
$study1
[1] "chromosome" "start" "end" "width"
[5] "strand" "gene_id" "bp_length" "symbol"
attr(,"class")
[1] "dsfvarLabels" "list"
So, we can run
ans.gender.sva <- ds.limma(model = ~ gdc_cases.demographic.gender,
Set = "rse.pre", type.data = "RNAseq",
sva = TRUE, annotCols = c("chromosome"))
The results are:
ans.gender.sva
$study1
# A tibble: 40,363 x 8
id n beta SE t P.Value adj.P.Val
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ENSG00000~ 424 -12.1 0.0713 -51.3 1.82e-178 7.33e-174
2 ENSG00000~ 424 -9.95 0.467 -44.9 1.52e-158 3.07e-154
3 ENSG00000~ 424 -10.7 0.0513 -40.3 2.19e-143 2.95e-139
4 ENSG00000~ 424 -10.7 0.0461 -37.4 6.78e-133 6.84e-129
5 ENSG00000~ 424 10.8 0.0571 36.4 2.43e-129 1.96e-125
6 ENSG00000~ 424 10.3 0.0774 34.3 1.98e-121 1.33e-117
7 ENSG00000~ 424 9.64 0.101 31.6 5.70e-111 2.87e-107
8 ENSG00000~ 424 11.2 0.0691 31.7 2.34e-111 1.35e-107
9 ENSG00000~ 424 4.48 0.0678 30.2 2.13e-105 8.61e-102
10 ENSG00000~ 424 11.2 0.0547 30.9 2.17e-108 9.72e-105
# ... with 40,353 more rows, and 1 more variable:
# chromosome <chr>
attr(,"class")
[1] "dsLimma" "list"
The function has another arguments that can be used to fit other type of models:
lmFit)eBayes)voom transformation (default “none”)voomQualityWeights function be used instead of voom? (default FALSE)We have also implemented two other functions ds.DESeq2 and ds.edgeR that perform DGE analysis using DESeq2 and edgeR methods. This is the R code used to that purpose:
To be supplied
We close the DataSHIELD session by:
datashield.logout(conns)
The Figure 8 describes the different types of omic association analyses that can be performed using DataSHIELD client functions implemented in the dsOmicsClient package. Basically, data (omic and phenotypes/covariates) can be stored in different sites (http, ssh, AWS S3, local, …) and are managed with Opal through the resourcer package and their extensions implemented in dsOmics.
Figure 8: Non-disclosive omic data analysis with DataSHIELD and Bioconductor
The figure illustrates how the resourcer package is used to get access to omic data through the Opal servers. Then DataSHIELD is used in the client side to perform non-disclosive data analyses.
Then, dsOmicsClient package allows different types of analyses: “virtually” pooled and federated meta-analysis. Both methods are based on fitting different generalized linear models (GLMs) for each feature when assesing association between omic data and the phenotype/trait/condition of interest. Of course non-disclosive omic data analysis from a single study can also be performed.
The “virtually” pooled approach (Figure 9) is recommended when the user wants to analyze omic data from different sources and obtain results as if the data were located in a single computer. It should be noticed that this can be very time consuming when analyzing multiple features since it calls repeatedly to a base function in DataSHIELD (ds.glm) and that it cannot be recommended when data are not properly harmonized (e.g. gene expression normalized using different methods, GWAS data having different platforms, …). Also when it is necesary to remove unwanted variability (for transcriptomic and epigenomica analysis) or control for population stratification (for GWAS analysis), this approach cannot be used since we need to develop methods to compute surrogate variables (to remove unwanted variability) or PCAs (to to address population stratification) in a non-disclosive way.
The federated meta-analysis approach Figure 10 overcomes the limitations raised when performing pooled analyses. First, the computation issue is addressed by using scalable and fast methods to perform data analysis at whole-genome level at each server. The transcriptomic and epigenomic data analyses make use of the widely used limma package that uses ExpressionSet or RangedSummarizedExperiment Bioc infrastructures to deal with omic and phenotypic (e.g covariates). The genomic data are analyzed using GWASTools and GENESIS that are designed to perform quality control (QC) and GWAS using GDS infrastructure.
Next, we describe how both approaches are implemented:
ds.glm() function which is a DataSHIELD function that uses an approach that is mathematically equivalent to placing all individual-level data froma all sources in one central warehouse and analysing those data using the conventional glm() function in R. The user can select one (or multiple) features (i.e., genes, transcripts, CpGs, SNPs, …)
Figure 9: Non-disclosive omic data analysis with DataSHIELD and Bioconductor
The figure illustrates how to perform single pooled omic data analysis. The analyses are performed by using a generalized linear model (glm) on data from one or multiple sources. It makes use of ds.glm(), a DataSHIELD function, that uses an approach that is mathematically equivalent to placing all individual-level data from all sources in one central warehouse and analysing those data using the conventional glm() function in R.
Figure 10: Non-disclosive omic data analysis with DataSHIELD and Bioconductor
The figure illustrates how to perform anlyses at genome-wide level from one or multiple sources. It runs standard Bioconductor functions at each server independently to speed up the analyses and in the case of having multiple sources, results can be meta-analyzed uning standar R functions.
EWAS requires basically the same statistical methods as those used in DGE. It should be notice that the pooled analysis we are going to illustrate here can also be performed with transcriptomic data since each study must have different range values. If so, gene expression harmonization should be performed, for instance, by standardizing the data at each study. For EWAS where methylation is measured using beta values (e.g CpG data are in the range 0-1) this is not a problem. In any case, adopting the meta-analysis approach could be a safe option.
We have downloaded data from GEO corresponding to the accesion number GSE66351 which includes DNA methylation profiling (Illumina 450K array) of 190 individuals. Data corresponds to CpGs beta values measured in the superior temporal gyrus and prefrontal cortex brain regions of patients with Alzheimer’s. Data have been downloaded using GEOquery package that gets GEO data as ExpressionSet objects. Researchers who are not familiar with ExpressionSets can read this Section. Notice that data are encoded as beta-values that ensure data harmonization across studies.
In order to illustrate how to perform data analyses using federated data, we have split the data into two ExpressionSets having 100 and 90 samples as if they were two different studies. Figure 4 shows the two resources defined for both studies (GSE66351_1 and GSE66351_2)
In order to perform omic data analyses, we need first to login and assign resources to DataSHIELD. This can be performed using the as.resource.object() function
builder <- DSI::newDSLoginBuilder()
builder$append(server = "study1", url = "https://opal-demo.obiba.org",
user = "dsuser", password = "password",
resource = "RSRC.GSE66351_1")
builder$append(server = "study2", url = "https://opal-demo.obiba.org",
user = "dsuser", password = "password",
resource = "RSRC.GSE66351_2")
logindata <- builder$build()
conns <- DSI::datashield.login(logins = logindata, assign = TRUE,
symbol = "res")
# Assign to the original R class (e.g ExpressionSet)
datashield.assign.expr(conns, symbol = "methy",
expr = quote(as.resource.object(res)))
Now, we can see that the resources are actually loaded into the R servers as their original class
ds.class("methy")
$study1
[1] "ExpressionSet"
attr(,"package")
[1] "Biobase"
$study2
[1] "ExpressionSet"
attr(,"package")
[1] "Biobase"
Then, some Bioconductor-type functions can be use to return non-disclosive information of ExpressionSets from each server to the client, using similar functions as those defined in the dsBaseClient package. For example, feature names can be returned by
fn <- ds.featureNames("methy")
lapply(fn, head)
$study1
[1] "cg00000029" "cg00000108" "cg00000109" "cg00000165"
[5] "cg00000236" "cg00000289"
$study2
[1] "cg00000029" "cg00000108" "cg00000109" "cg00000165"
[5] "cg00000236" "cg00000289"
Experimental phenotypes variables can be obtained by
ds.varLabels("methy")
$study1
[1] "title" "geo_accession"
[3] "status" "submission_date"
[5] "last_update_date" "type"
[7] "channel_count" "source_name_ch1"
[9] "organism_ch1" "characteristics_ch1"
[11] "characteristics_ch1.1" "characteristics_ch1.2"
[13] "characteristics_ch1.3" "characteristics_ch1.4"
[15] "characteristics_ch1.5" "characteristics_ch1.6"
[17] "characteristics_ch1.7" "characteristics_ch1.8"
[19] "molecule_ch1" "extract_protocol_ch1"
[21] "label_ch1" "label_protocol_ch1"
[23] "taxid_ch1" "hyb_protocol"
[25] "scan_protocol" "description"
[27] "data_processing" "platform_id"
[29] "contact_name" "contact_email"
[31] "contact_phone" "contact_laboratory"
[33] "contact_institute" "contact_address"
[35] "contact_city" "contact_zip/postal_code"
[37] "contact_country" "supplementary_file"
[39] "supplementary_file.1" "data_row_count"
[41] "age" "braak_stage"
[43] "brain_region" "cell type"
[45] "diagnosis" "donor_id"
[47] "sentrix_id" "sentrix_position"
[49] "Sex"
$study2
[1] "title" "geo_accession"
[3] "status" "submission_date"
[5] "last_update_date" "type"
[7] "channel_count" "source_name_ch1"
[9] "organism_ch1" "characteristics_ch1"
[11] "characteristics_ch1.1" "characteristics_ch1.2"
[13] "characteristics_ch1.3" "characteristics_ch1.4"
[15] "characteristics_ch1.5" "characteristics_ch1.6"
[17] "characteristics_ch1.7" "characteristics_ch1.8"
[19] "molecule_ch1" "extract_protocol_ch1"
[21] "label_ch1" "label_protocol_ch1"
[23] "taxid_ch1" "hyb_protocol"
[25] "scan_protocol" "description"
[27] "data_processing" "platform_id"
[29] "contact_name" "contact_email"
[31] "contact_phone" "contact_laboratory"
[33] "contact_institute" "contact_address"
[35] "contact_city" "contact_zip/postal_code"
[37] "contact_country" "supplementary_file"
[39] "supplementary_file.1" "data_row_count"
[41] "age" "braak_stage"
[43] "brain_region" "cell type"
[45] "diagnosis" "donor_id"
[47] "sentrix_id" "sentrix_position"
[49] "Sex"
attr(,"class")
[1] "dsvarLabels" "list"
Once the methylation data have been loaded into the opal server, we can perform different type of analyses using functions from the dsOmicsClient package. Let us start by illustrating how to analyze a single CpG from two studies by using an approach that is mathematically equivalent to placing all individual-level.
ans <- ds.lmFeature(feature = "cg07363416",
model = ~ diagnosis + Sex,
Set = "methy",
datasources = conns)
ans
Estimate Std. Error p-value
cg07363416 0.03459886 0.02504291 0.1670998
attr(,"class")
[1] "dsLmFeature" "matrix" "array"
The same analysis can be performed for all features (e.g. CpGs) just avoiding the feature argument. This process can be parallelized using mclapply function from the multicore package.
ans <- ds.lmFeature(model = ~ diagnosis + Sex,
Set = "methy",
datasources = conns,
mc.cores = 20)
This method corresponds to the pooled analysis approach and can be very time consiming since the function repeatedly calls the DataSHIELD function ds.glm(). We can adopt another strategy that is to run a glm of each feature independently at each study using limma package (which is really fast) and then combine the results (i.e. meta-analysis approach).
ans.limma <- ds.limma(model = ~ diagnosis + Sex,
Set = "methy",
datasources = conns)
Then, we can visualize the top genes at each study (i.e server) by
lapply(ans.limma, head)
$study1
# A tibble: 6 x 7
id n beta SE t P.Value adj.P.Val
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 cg13138~ 100 -0.147 0.0122 -6.62 1.90e-9 0.000466
2 cg23859~ 100 -0.0569 0.00520 -6.58 2.32e-9 0.000466
3 cg13772~ 100 -0.0820 0.0135 -6.50 3.27e-9 0.000466
4 cg12706~ 100 -0.0519 0.00872 -6.45 4.25e-9 0.000466
5 cg24724~ 100 -0.0452 0.00775 -6.39 5.47e-9 0.000466
6 cg02812~ 100 -0.125 0.0163 -6.33 7.31e-9 0.000466
$study2
# A tibble: 6 x 7
id n beta SE t P.Value adj.P.Val
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 cg04046~ 90 -0.101 0.0128 -5.91 6.21e-8 0.0172
2 cg07664~ 90 -0.0431 0.00390 -5.85 8.22e-8 0.0172
3 cg27098~ 90 -0.0688 0.0147 -5.79 1.07e-7 0.0172
4 cg08933~ 90 -0.0461 0.00791 -5.55 2.98e-7 0.0360
5 cg18349~ 90 -0.0491 0.00848 -5.42 5.07e-7 0.0489
6 cg02182~ 90 -0.0199 0.0155 -5.36 6.70e-7 0.0538
The annotation can be added by using the argument annotCols. It should be a vector with the columns of the annotation available in the ExpressionSet or RangedSummarizedExperiment that want to be showed. The columns of the annotation can be obtained by
ds.fvarLabels("methy")
$study1
[1] "ID"
[2] "Name"
[3] "AddressA_ID"
[4] "AlleleA_ProbeSeq"
[5] "AddressB_ID"
[6] "AlleleB_ProbeSeq"
[7] "Infinium_Design_Type"
[8] "Next_Base"
[9] "Color_Channel"
[10] "Forward_Sequence"
[11] "Genome_Build"
[12] "CHR"
[13] "MAPINFO"
[14] "SourceSeq"
[15] "Chromosome_36"
[16] "Coordinate_36"
[17] "Strand"
[18] "Probe_SNPs"
[19] "Probe_SNPs_10"
[20] "Random_Loci"
[21] "Methyl27_Loci"
[22] "UCSC_RefGene_Name"
[23] "UCSC_RefGene_Accession"
[24] "UCSC_RefGene_Group"
[25] "UCSC_CpG_Islands_Name"
[26] "Relation_to_UCSC_CpG_Island"
[27] "Phantom"
[28] "DMR"
[29] "Enhancer"
[30] "HMM_Island"
[31] "Regulatory_Feature_Name"
[32] "Regulatory_Feature_Group"
[33] "DHS"
[34] "RANGE_START"
[35] "RANGE_END"
[36] "RANGE_GB"
[37] "SPOT_ID"
$study2
[1] "ID"
[2] "Name"
[3] "AddressA_ID"
[4] "AlleleA_ProbeSeq"
[5] "AddressB_ID"
[6] "AlleleB_ProbeSeq"
[7] "Infinium_Design_Type"
[8] "Next_Base"
[9] "Color_Channel"
[10] "Forward_Sequence"
[11] "Genome_Build"
[12] "CHR"
[13] "MAPINFO"
[14] "SourceSeq"
[15] "Chromosome_36"
[16] "Coordinate_36"
[17] "Strand"
[18] "Probe_SNPs"
[19] "Probe_SNPs_10"
[20] "Random_Loci"
[21] "Methyl27_Loci"
[22] "UCSC_RefGene_Name"
[23] "UCSC_RefGene_Accession"
[24] "UCSC_RefGene_Group"
[25] "UCSC_CpG_Islands_Name"
[26] "Relation_to_UCSC_CpG_Island"
[27] "Phantom"
[28] "DMR"
[29] "Enhancer"
[30] "HMM_Island"
[31] "Regulatory_Feature_Name"
[32] "Regulatory_Feature_Group"
[33] "DHS"
[34] "RANGE_START"
[35] "RANGE_END"
[36] "RANGE_GB"
[37] "SPOT_ID"
attr(,"class")
[1] "dsfvarLabels" "list"
Then we can run the analysis and obtain the output with the chromosome and gene symbol by:
ans.limma.annot <- ds.limma(model = ~ diagnosis + Sex,
Set = "methy",
annotCols = c("CHR", "UCSC_RefGene_Name"),
datasources = conns)
lapply(ans.limma.annot, head)
$study1
# A tibble: 6 x 9
id n beta SE t P.Value adj.P.Val CHR
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 cg13~ 100 -0.147 0.0122 -6.62 1.90e-9 0.000466 2
2 cg23~ 100 -0.0569 0.00520 -6.58 2.32e-9 0.000466 2
3 cg13~ 100 -0.0820 0.0135 -6.50 3.27e-9 0.000466 17
4 cg12~ 100 -0.0519 0.00872 -6.45 4.25e-9 0.000466 19
5 cg24~ 100 -0.0452 0.00775 -6.39 5.47e-9 0.000466 19
6 cg02~ 100 -0.125 0.0163 -6.33 7.31e-9 0.000466 2
# ... with 1 more variable: UCSC_RefGene_Name <chr>
$study2
# A tibble: 6 x 9
id n beta SE t P.Value adj.P.Val CHR
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 cg04~ 90 -0.101 0.0128 -5.91 6.21e-8 0.0172 11
2 cg07~ 90 -0.0431 0.00390 -5.85 8.22e-8 0.0172 6
3 cg27~ 90 -0.0688 0.0147 -5.79 1.07e-7 0.0172 11
4 cg08~ 90 -0.0461 0.00791 -5.55 2.98e-7 0.0360 1
5 cg18~ 90 -0.0491 0.00848 -5.42 5.07e-7 0.0489 3
6 cg02~ 90 -0.0199 0.0155 -5.36 6.70e-7 0.0538 8
# ... with 1 more variable: UCSC_RefGene_Name <chr>
Then, the last step is to meta-analyze the results. Different methods can be used to this end. We have implemented a method that meta-analyze the p-pvalues of each study as follows:
ans.meta <- metaPvalues(ans.limma)
ans.meta
# A tibble: 481,868 x 4
id study1 study2 p.meta
<chr> <dbl> <dbl> <dbl>
1 cg13138089 0.00000000190 0.00000763 4.78e-13
2 cg25317941 0.0000000179 0.00000196 1.12e-12
3 cg02812891 0.00000000731 0.00000707 1.63e-12
4 cg12706938 0.00000000425 0.0000161 2.14e-12
5 cg16026647 0.000000101 0.000000797 2.51e-12
6 cg12695465 0.00000000985 0.0000144 4.33e-12
7 cg21171625 0.000000146 0.00000225 9.78e-12
8 cg13772815 0.00000000327 0.000122 1.18e-11
9 cg00228891 0.000000166 0.00000283 1.38e-11
10 cg21488617 0.0000000186 0.0000299 1.62e-11
# ... with 481,858 more rows
We can verify that the results are pretty similar to those obtained using pooled analyses. Here we compute the association for two of the top-CpGs:
res1 <- ds.lmFeature(feature = "cg13138089",
model = ~ diagnosis + Sex,
Set = "methy",
datasources = conns)
res1
Estimate Std. Error p-value
cg13138089 -0.1373348 0.01712405 1.057482e-15
attr(,"class")
[1] "dsLmFeature" "matrix" "array"
res2 <- ds.lmFeature(feature = "cg13772815",
model = ~ diagnosis + Sex,
Set = "methy",
datasources = conns)
res2
Estimate Std. Error p-value
cg13772815 -0.06786137 0.009128915 1.056225e-13
attr(,"class")
[1] "dsLmFeature" "matrix" "array"
We can create a QQ-plot by using the function qqplot available in our package.
qqplot(ans.meta$p.meta)
Here In some cases inflation can be observed, so that, correction for cell-type or surrogate variables must be performed. We describe how we can do that in the next two sections.
The vast majority of omic studies require to control for unwanted variability. The surrogate variable analysis (SVA) can address this issue by estimating some hidden covariates that capture differences across individuals due to some artifacts such as batch effects or sample quality sam among others. The method is implemented in SVA package.
Performing this type of analysis using the ds.lmFeature function is not allowed since estimating SVA would require to implement a non-disclosive method that computes SVA from the different servers. This will be a future topic of the dsOmicsClient. NOTE that, estimating SVA separately at each server would not be a good idea since the aim of SVA is to capture differences mainly due to experimental issues among ALL individuals. What we can do instead is to use the ds.limma function to perform the analyses adjusted for SVA at each study.
ans.sva <- ds.limma(model = ~ diagnosis + Sex,
Set = "methy",
sva = TRUE, annotCols = c("CHR", "UCSC_RefGene_Name"))
ans.sva
$study1
# A tibble: 481,868 x 9
id n beta SE t P.Value adj.P.Val
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 cg035381~ 100 -0.0300 0.0106 -6.99 3.64e-10 0.000102
2 cg247245~ 100 -0.0397 0.00452 -6.88 6.12e-10 0.000102
3 cg021567~ 100 -0.0497 0.00947 -6.87 6.35e-10 0.000102
4 cg238596~ 100 -0.0498 0.00715 -6.79 9.36e-10 0.000113
5 cg137728~ 100 -0.0767 0.00697 -6.62 2.01e- 9 0.000168
6 cg052138~ 100 -0.0803 0.0160 -6.62 2.10e- 9 0.000168
7 cg194060~ 100 -0.0409 0.00779 -6.52 3.29e- 9 0.000201
8 cg261524~ 100 -0.0840 0.00947 -6.52 3.33e- 9 0.000201
9 cg009799~ 100 -0.0723 0.00539 -6.44 4.81e- 9 0.000242
10 cg276606~ 100 -0.0514 0.00128 -6.43 5.03e- 9 0.000242
# ... with 481,858 more rows, and 2 more variables:
# CHR <chr>, UCSC_RefGene_Name <chr>
$study2
# A tibble: 481,868 x 9
id n beta SE t P.Value adj.P.Val
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 cg03470~ 90 -0.0502 0.0104 -6.41 7.30e-9 0.00352
2 cg02293~ 90 -0.0394 0.00372 -6.02 4.04e-8 0.00663
3 cg24302~ 90 -0.0370 0.0111 -5.87 7.90e-8 0.00663
4 cg07664~ 90 -0.0416 0.00719 -5.85 8.43e-8 0.00663
5 cg16766~ 90 -0.0436 0.00671 -5.83 9.38e-8 0.00663
6 cg04046~ 90 -0.100 0.0142 -5.81 9.90e-8 0.00663
7 cg00848~ 90 -0.0414 0.00836 -5.79 1.07e-7 0.00663
8 cg08862~ 90 -0.0289 0.00900 -5.79 1.10e-7 0.00663
9 cg10850~ 90 -0.0700 0.00445 -5.74 1.35e-7 0.00672
10 cg27098~ 90 -0.0664 0.00156 -5.72 1.49e-7 0.00672
# ... with 481,858 more rows, and 2 more variables:
# CHR <chr>, UCSC_RefGene_Name <chr>
attr(,"class")
[1] "dsLimma" "list"
Then, data can be combined meta-anlyzed as follows:
ans.meta.sv <- metaPvalues(ans.sva)
ans.meta.sv
# A tibble: 481,868 x 4
id study1 study2 p.meta
<chr> <dbl> <dbl> <dbl>
1 cg24302412 1.94e- 8 0.0000000790 5.37e-14
2 cg01301319 8.29e- 9 0.000000769 2.14e-13
3 cg05213896 2.10e- 9 0.00000763 5.25e-13
4 cg02812891 1.07e- 8 0.00000153 5.37e-13
5 cg00228891 5.61e- 8 0.000000301 5.53e-13
6 cg12695465 5.90e- 9 0.00000360 6.90e-13
7 cg13138089 6.16e- 9 0.00000571 1.12e-12
8 cg24724506 6.12e-10 0.0000829 1.60e-12
9 cg19406036 3.29e- 9 0.0000331 3.37e-12
10 cg16026647 2.82e- 7 0.000000449 3.89e-12
# ... with 481,858 more rows
The DataSHIELD session must by closed by:
datashield.logout(conns)
Genomic data can be stored in different formats. PLINK and VCF files are commonly used in genetic epidemiology studies. We have a GWAS example available at BRGE data repository that aims to find SNPs associated with asthma. We have data stored in VCF (brge.vcf) with several covariates and phenotypes available in the file brge.txt (gender, age, obesity, smoking, country and asthma status). The same data is also available in PLINK format (brge.bed, brge.bim, brge.fam) with covariates in the file brge.phe.
In order to deal with this type of data, we have extended the resources available at the resourcer package to VCF files. NOTE: PLINK files can be translated into VCF files using different pipelines. In R you can use SeqArray to get VCF files.
We use the Genomic Data Storage (GDS) format which efficiently manage VCF files into the R environment. This extension requires to create a Client and a Resolver function that are located into the dsOmics package. The client function uses snpgdsVCF2GDS function implemented in SNPrelate to coerce the VCF file to a GDS object. Then the GDS object is loaded into R as an object of class GdsGenotypeReader from GWASTools package that facilitates downstream analyses.
The opal API server allows to incorporate this new type of resource as illustrated in Figure 11.
Figure 11: Description of how a VCF file can be added to the opal resources
It is important to notice that the URL should contain the tag method=biallelic.only&snpfirstdim=TRUE since these are required parameters of snpgdsVCF2GDS function. This is an example:
https://raw.githubusercontent.com/isglobal-brge/scoreInvHap/master/inst/extdata/example.vcf?method=biallelic.only&snpfirstdim=TRUE
In that case we indicate that only biallelic SNPs are considered (‘method=biallelic.only’) and that genotypes are stored in the individual-major mode, (i.e., list all SNPs for the first individual, and then list all SNPs for the second individual, etc) (‘snpfirstdim=TRUE’).
We have created a resource having the VCF file of our study on asthma as previously described. The name of the resource is brge_vcf the phenotypes are available in another resource called brge that is a .txt file. The GWAS analysis is then perform as follows
We first start by preparing login data
builder <- newDSLoginBuilder()
builder$append(server = "study1", url = "https://opal-demo.obiba.org",
user = "dsuser", password = "password",
resource = "RSRC.brge_vcf")
logindata <- builder$build()
conns <- datashield.login(logins = logindata, assign = TRUE,
symbol = "res")
In this case we have to assign to different resources. One for the VCF (obesity_vcf) and another one for the phenotypic data (obesity). To this end, the datashield.assign.resource function is required before assigning any object to the specific resource. Notice that the VCF resource can be load into R as a GDS thanks to our extension of existing resources in the reourcer
datashield.assign.resource(conns, symbol = "vcf.res",
resource = list(study1 = "RSRC.brge_vcf"))
datashield.assign.expr(conns, symbol = "gds",
expr = quote(as.resource.object(vcf.res)))
datashield.assign.resource(conns, symbol = "covars.res",
resource = list(study1 = "RSRC.brge"))
datashield.assign.expr(conns, symbol = "covars",
expr = quote(as.resource.data.frame(covars.res)))
These are the objects available in the Opal server
ds.ls()
$study1
$study1$environment.searched
[1] "R_GlobalEnv"
$study1$objects.found
[1] "covars" "covars.res" "gds" "res"
[5] "vcf.res"
We can use dsBaseClient functions to inspect the variables that are in the covars data.frame. The variables are
ds.colnames("covars")
$study1
[1] "scanID" "gender" "obese" "age" "smoke"
[6] "country" "asthma"
The asthma variable has this number of individuals at each level (0: controls, 1: cases)
ds.table("covars$asthma")
Data in all studies were valid
Study 1 : No errors reported from this study
$output.list
$output.list$TABLE_rvar.by.study_row.props
study
covars$asthma 1
0 1
1 1
$output.list$TABLE_rvar.by.study_col.props
study
covars$asthma 1
0 0.6864187
1 0.3135813
$output.list$TABLE_rvar.by.study_counts
study
covars$asthma 1
0 1587
1 725
$output.list$TABLES.COMBINED_all.sources_proportions
covars$asthma
0 1
0.686 0.314
$output.list$TABLES.COMBINED_all.sources_counts
covars$asthma
0 1
1587 725
$validity.message
[1] "Data in all studies were valid"
There may be interest in only studying certain genes, for that matter, the loaded VCF resource can be subsetting as follows
genes <- c("A1BG","A2MP1")
ds.getSNPSbyGen("gds", genes)
The previous code will over-write the VCF with the SNPs corresponding to the selected genes, if the intention is to perform studies with both the complete VCF and a subsetted one, the argument name can be used to create a new object on the server with the subsetted VCF, preserving the complete one.
genes <- c("A1BG","A2MP1")
ds.getSNPSbyGen("gds", genes = genes, name = "subset.vcf")
Then, an object of class GenotypeData must be created at the server side to perform genetic data analyses. This is a container defined in the GWASTools package for storing genotype and phenotypic data from genetic association studies. By doing that we will also verify whether individuals in the GDS (e.g VCF) and covariates files have the same individuals and are in the same order. This can be performed by
ds.GenotypeData(x='gds', covars = 'covars', columnId = 1, sexId = 2, male_encoding = "2", female_encoding = "1", newobj.name = 'gds.Data')
Before performing the association analyses, quality control (QC) can be performed to the loaded data. Three methodologies are available; 1) Principal Component Analysis (PCA) of the genomic data, 2) Hardy-Weinberg Equilibrium (HWE) testing and 3) Allelic frequency estimates. The QC methods 2 and 3 have as inputs a GenotypeData object, created with a covariates file that has a gender column; while method 1 has as input a VCF.
To perform the PCA, a pruning functionality is built inside so that redundant SNPs are discarted (there is an extra argument ld.threshold which controls the pruning, more information about it at the SNPRelate documentation), speeding up the execution time
pca <- ds.PCASNPS("gds", prune = TRUE)
plotPCASNPS(pca)
To perform QC methodologies 2 and 3, the name of the gender column as well as the keys to describe male or female have to be provided. Remember that we can visualize the names of the variables from our data by executing ds.colnames("covars"). In our case, this variable is called “gender,” and the levels of this variable are 1 for male and 2 for female as we can see here (NOTE: we cannot use ds.levels since gender variable is not a factor):
ds.table1D("covars$gender")$counts
covars$gender
1 1215
2 1097
Total 2312
The HWE test can be performed to selected chromosomes using the argument chromosomes, only the autosomes can be selected when performing a HWE test, the encoding of the autosomes present on the object can be fetched with
ds.genoDimensions('gds.Data')$chromosomes
NULL
Therefore, HWE can be performed by:
ds.exactHWE("gds.Data", chromosome = "20")
$study1
# A tibble: 2,384 x 8
rs chr nAA nAB nBB MAF minor.allele pval
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 rs613~ 20 123 815 1351 0.232 A 1
2 rs603~ 20 114 782 1410 0.219 A 0.670
3 rs603~ 20 286 1013 1009 0.343 A 0.197
4 rs608~ 20 384 1085 821 0.405 A 0.435
5 rs607~ 20 496 1131 679 0.460 A 0.557
6 rs169~ 20 358 1100 841 0.395 A 1.00
7 rs448~ 20 330 1087 888 0.379 A 0.965
8 rs608~ 20 309 1070 927 0.366 A 1
9 rs209~ 20 20 446 1840 0.105 A 0.268
10 rs605~ 20 546 1088 656 0.476 A 0.0237
# ... with 2,374 more rows
attr(,"class")
[1] "dsexactHWE" "list"
The same test can be conducted only for the control individuals (on this example taking the asthma covariates columnas as a cas/control) by:
ds.exactHWE("gds.Data", chromosome = "20", controls = TRUE, controls_column = "asthma")
$study1
# A tibble: 2,384 x 8
rs chr nAA nAB nBB MAF minor.allele pval
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 rs6139~ 20 90 559 927 0.234 A 0.624
2 rs6038~ 20 76 532 975 0.216 A 0.767
3 rs6039~ 20 200 708 677 0.350 A 0.473
4 rs6086~ 20 266 761 548 0.410 A 0.958
5 rs6074~ 20 358 770 457 0.469 A 0.338
6 rs1699~ 20 245 767 567 0.398 A 0.600
7 rs4482~ 20 224 760 599 0.382 A 0.523
8 rs6082~ 20 215 736 633 0.368 A 0.957
9 rs2092~ 20 13 308 1260 0.106 A 0.285
10 rs6051~ 20 382 758 431 0.484 A 0.189
# ... with 2,374 more rows
attr(,"class")
[1] "dsexactHWE" "list"
Similarly, allele frequencies estimates can be estimated by:
ds.alleleFrequency("gds.Data")
$study1
# A tibble: 99,289 x 8
rs M F all n.M n.F n MAF
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 MitoC~ 0.0147 1.5 e-2 1.49e-2 1086 1200 2286 1.49e-2
2 MitoG~ 0.00276 8.37e-4 1.75e-3 1087 1195 2282 1.75e-3
3 MitoG~ 0.00548 3.30e-3 4.33e-3 1095 1212 2307 4.33e-3
4 MitoT~ 0.00868 1.24e-2 1.06e-2 1095 1213 2308 1.06e-2
5 MitoC~ 0 1.65e-3 8.66e-4 1095 1214 2309 8.66e-4
6 MitoT~ 0.0754 7.12e-2 7.32e-2 1094 1208 2302 7.32e-2
7 MitoA~ 0.00913 6.60e-3 7.80e-3 1095 1213 2308 7.80e-3
8 MitoG~ 0.00274 2.48e-3 2.61e-3 1093 1209 2302 2.61e-3
9 MitoA~ 0.181 1.82e-1 1.81e-1 1071 1188 2259 1.81e-1
10 MitoT~ 0 1.65e-3 8.68e-4 1094 1210 2304 8.68e-4
# ... with 99,279 more rows
attr(,"class")
[1] "dsalleleFrequency" "list"
In the future, more functions will be created to perform quality control (QC) for both, SNPs and inviduals.
Association analysis for a given SNP is performed by simply
ds.glmSNP(snps.fit = "rs11247693", model = asthma ~ gender + age, genoData='gds.Data')
Estimate Std. Error p-value n p.adj
rs11247693 -0.1543215 0.2309585 0.5040196 2311 0.5040196
attr(,"class")
[1] "dsGlmSNP" "matrix" "array"
The analysis of all available SNPs is performed when the argument snps.fit is missing. The function performs the analysis of the selected SNPs in a single repository or in multiple repositories as performing pooled analyses (it uses ds.glm DataSHIELD function). As in the case of transcriptomic data, analyzing all the SNPs in the genome (e.g GWAS) will be high time-consuming. We can adopt a similar approach as the one adopted using the limma at each server. That is, we run GWAS at each repository using specific and scalable packages available in R/Bioc. In that case we use the GWASTools and GENESIS packages. The complete pipeline is implemented in this function
ans.bioC <- ds.GWAS('gds.Data', model=asthma~age+country)
This close the DataSHIELD session
datashield.logout(conns)
Here we illustrate how to perform the same GWAS analyses on the asthma using PLINK secure shell commands. This can be performed thanks to the posibility of having ssh resources as described here.
It is worth to notice that this workflow and the new R functions implemented in dsOmicsClient could be used as a guideline to carry out similar analyses using existing analysis tools in genomics such as IMPUTE, SAMtools or BEDtools among many others.
We start by assigning login resources
library(DSOpal)
library(dsBaseClient)
library(dsOmicsClient)
builder <- newDSLoginBuilder()
builder$append(server = "study1", url = "https://opal-demo.obiba.org",
user = "dsuser", password = "password",
resource = "RSRC.brge_plink")
logindata <- builder$build()
Then we assign the resource to a symbol (i.e. R object) called client which is a ssh resource
conns <- datashield.login(logins = logindata, assign = TRUE,
symbol = "client")
ds.class("client")
$study1
[1] "SshResourceClient" "CommandResourceClient"
[3] "ResourceClient" "R6"
Now, we are ready to run any PLINK command from the client site. Notice that in this case we want to assess association between the genotype data in bed format and use as phenotype the variable ‘asthma’ that is in the file ‘brge.phe’ in the 6th column. The sentence in a PLINK command would be (NOTE: we avoid –out to indicate the output file since the file will be available in R as a tibble).
plink --bfile brge --logistic --pheno brge.phe --mpheno 6 --covar brge.phe --covar-name gender,age
The arguments must be encapsulated in a single character without the command ‘plink’
plink.arguments <- "--bfile brge --logistic --pheno brge.phe --mpheno 6 --covar brge.phe --covar-name gender,age"
the analyses are then performed by
ans.plink <- ds.PLINK("client", plink.arguments)
The object ans.plink contains the PLINK results at each server as well as the outuput provided by PLINK
lapply(ans.plink, names)
$study1
[1] "results" "plink.out"
head(ans.plink$study1$results)
# A tibble: 6 x 9
CHR SNP BP A1 TEST NMISS OR STAT P
<dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 0 MitoC~ 3993 T ADD 2286 0.752 -1.33 1.82e-1
2 0 MitoC~ 3993 T gend~ 2286 0.742 -3.27 1.07e-3
3 0 MitoC~ 3993 T age 2286 1.00 0.565 5.72e-1
4 0 MitoG~ 4821 A ADD 2282 2.68 1.71 8.79e-2
5 0 MitoG~ 4821 A gend~ 2282 0.740 -3.31 9.40e-4
6 0 MitoG~ 4821 A age 2282 1.00 0.465 6.42e-1
ans.plink$study$plink.out
$status
[1] 0
$output
[1] ""
[2] "@----------------------------------------------------------@"
[3] "| PLINK! | v1.07 | 10/Aug/2009 |"
[4] "|----------------------------------------------------------|"
[5] "| (C) 2009 Shaun Purcell, GNU General Public License, v2 |"
[6] "|----------------------------------------------------------|"
[7] "| For documentation, citation & bug-report instructions: |"
[8] "| http://pngu.mgh.harvard.edu/purcell/plink/ |"
[9] "@----------------------------------------------------------@"
[10] ""
[11] "Skipping web check... [ --noweb ] "
[12] "Writing this text to log file [ /tmp/ssh-3254/out.log ]"
[13] "Analysis started: Mon May 10 12:23:29 2021"
[14] ""
[15] "Options in effect:"
[16] "\t--bfile brge"
[17] "\t--logistic"
[18] "\t--pheno brge.phe"
[19] "\t--mpheno 6"
[20] "\t--covar brge.phe"
[21] "\t--covar-name gender,age"
[22] "\t--noweb"
[23] "\t--out /tmp/ssh-3254/out"
[24] ""
[25] "Reading map (extended format) from [ brge.bim ] "
[26] "100000 markers to be included from [ brge.bim ]"
[27] "Reading pedigree information from [ brge.fam ] "
[28] "2312 individuals read from [ brge.fam ] "
[29] "2312 individuals with nonmissing phenotypes"
[30] "Assuming a disease phenotype (1=unaff, 2=aff, 0=miss)"
[31] "Missing phenotype value is also -9"
[32] "725 cases, 1587 controls and 0 missing"
[33] "1097 males, 1215 females, and 0 of unspecified sex"
[34] "Reading genotype bitfile from [ brge.bed ] "
[35] "Detected that binary PED file is v1.00 SNP-major mode"
[36] "Reading alternate phenotype from [ brge.phe ] "
[37] "2312 individuals with non-missing alternate phenotype"
[38] "Assuming a disease phenotype (1=unaff, 2=aff, 0=miss)"
[39] "Missing phenotype value is also -9"
[40] "725 cases, 1587 controls and 0 missing"
[41] "Reading 6 covariates from [ brge.phe ] with nonmissing values for 2199 individuals"
[42] "Selected subset of 2 from 6 covariates"
[43] "For these, nonmissing covariate values for 2312 individuals"
[44] "Before frequency and genotyping pruning, there are 100000 SNPs"
[45] "2312 founders and 0 non-founders found"
[46] "6009 heterozygous haploid genotypes; set to missing"
[47] "Writing list of heterozygous haploid genotypes to [ /tmp/ssh-3254/out.hh ]"
[48] "7 SNPs with no founder genotypes observed"
[49] "Warning, MAF set to 0 for these SNPs (see --nonfounders)"
[50] "Writing list of these SNPs to [ /tmp/ssh-3254/out.nof ]"
[51] "Total genotyping rate in remaining individuals is 0.994408"
[52] "0 SNPs failed missingness test ( GENO > 1 )"
[53] "0 SNPs failed frequency test ( MAF < 0 )"
[54] "After frequency and genotyping pruning, there are 100000 SNPs"
[55] "After filtering, 725 cases, 1587 controls and 0 missing"
[56] "After filtering, 1097 males, 1215 females, and 0 of unspecified sex"
[57] "Converting data to Individual-major format"
[58] "Writing logistic model association results to [ /tmp/ssh-3254/out.assoc.logistic ] "
[59] ""
[60] "Analysis finished: Mon May 10 12:25:20 2021"
[61] ""
$error
character(0)
$command
[1] "cd /home/master/brge && plink1 --bfile brge --logistic --pheno brge.phe --mpheno 6 --covar brge.phe --covar-name gender,age --noweb --out /tmp/ssh-3254/out"
attr(,"class")
[1] "resource.exec"
We can compare the p-values obtained using PLINK with Bioconductor-based packages for the top-10 SNPs as follows:
library(tidyverse)
# get SNP p.values (additive model - ADD)
res.plink <- ans.plink$study1$results %>% filter(TEST=="ADD") %>%
arrange(P)
# compare top-10 with Biocoductor's results
snps <- res.plink$SNP[1:10]
plink <- res.plink %>% filter(SNP%in%snps) %>% dplyr::select(SNP, P)
bioC <- ans.bioC$study1 %>% filter(rs%in%snps) %>% dplyr::select(rs, p.value)
left_join(plink, bioC, by=c("SNP" = "rs"))
# A tibble: 10 x 3
SNP P p.value
<chr> <dbl> <dbl>
1 rs2267914 0.00000151 0.0105
2 rs6097326 0.00000424 0.00881
3 rs7153 0.00000440 0.00900
4 rs3732410 0.00000817 0.00849
5 rs7995146 0.0000170 0.00789
6 rs6495788 0.0000213 0.00760
7 rs1602679 0.0000268 0.00764
8 rs11055608 0.0000270 0.00805
9 rs7098143 0.0000313 0.00781
10 rs7676164 0.0000543 0.00706
As expected, the p-values are in the same order of magnitud having little variations due to the implemented methods of each software.
We can do the same comparisons of minor allele frequency (MAF) estimation performed with Bioconductor and PLINK. To this end, we need first to estimate MAF using PLINK
plink.arguments <- "--bfile brge --freq"
ans.plink2 <- ds.PLINK("client", plink.arguments)
maf.plink <- ans.plink2$study1$results
plink <- maf.plink %>% filter(SNP%in%snps) %>% dplyr::select(SNP, MAF)
bioC <- ans.bioC$study1 %>% filter(rs%in%snps) %>% dplyr::select(rs, freq)
left_join(plink, bioC, by=c("SNP" = "rs"))
# A tibble: 10 x 3
SNP MAF freq
<chr> <dbl> <dbl>
1 rs7153 0.256 0.256
2 rs3732410 0.254 0.254
3 rs7676164 0.304 0.304
4 rs1602679 0.101 0.101
5 rs7098143 0.210 0.210
6 rs11055608 0.446 0.446
7 rs7995146 0.0527 0.0527
8 rs6495788 0.267 0.267
9 rs2267914 0.104 0.104
10 rs6097326 0.125 0.125
This close the DataSHIELD session
datashield.logout(conns)
JRG want to thank Deroshan Padotan for having worked on a preliminary version of dsOmicsClient package developed before the resourcer package was created. We also thank Demetris Avraam for his feedback while programming DataSHIELD functions and writing the vignette. Leire Abarrategui is acknowledged for implementing RNA-seq data analysis using DESeq2 and edgeR.