Contents

1 Introduction

This tutorial aims to provide examples about how to analyze omics data in DataSHIELD. More detailed information can be found in our bookdown Orchestrating privacy-protected non-disclosive big data analyses of data from different resources with R and DataSHIELD.

2 Getting started

This document can be reproduced by installing the following packages

install.packages("DSOpal")
devtools::install_github("datashield/dsBaseClient")
devtools::install_github("isglobal-brge/dsOmicsClient") 

Then the packages are loaded as usual

library(DSOpal) 
library(dsBaseClient)
library(dsOmicsClient)

We have set up an Opal demo site to illustrate how to perform different omic data analyses using DataSHIELD. The DataSHIELD user credentials are

We also illustrate how to deal with different resources for ’omic data. For this, the Opal server can be accessed with the credentials:

This tutorial will mainly make use of the resources available at RSRC project

Resources available at Opal demo site of RSRC project

Figure 1: Resources available at Opal demo site of RSRC project

and we will cover how to perform:

  1. Transcriptomic data analysis
  2. Epigenomic data analysis
  3. GWAS

For those who are not familiar with omic data analysis or do not know the state-of-the-art methods, you can read this bookdown on Omic Data Analysis with R/Bioconductor.

3 Types of omic data analyses implemented

The Figure 2 describes the different types of ’omic association analyses that can be performed using DataSHIELD client functions implemented in the dsOmicsClient package. Basically, data (’omic and phenotypes/covariates) can be stored in different sites (http, ssh, AWS S3, local, …) and are managed with Opal through the resourcer package and their extensions implemented in dsOmics.

Non-disclosive omic data analysis with DataSHIELD and Bioconductor. The figure illustrates how the `resourcer` package is used to get access to omic data through the Opal servers. Then DataSHIELD is used in the client side to perform non-disclosive data analyses.

Figure 2: Non-disclosive omic data analysis with DataSHIELD and Bioconductor
The figure illustrates how the resourcer package is used to get access to omic data through the Opal servers. Then DataSHIELD is used in the client side to perform non-disclosive data analyses.

The dsOmicsClient package allows different types of analyses: pooled and meta-analysis. Both methods are based on fitting different Generalized Linear Models (GLMs) for each feature when assessing association between ’omic data and the phenotype/trait/condition of interest. Of course, non-disclosive ’omic data analysis from a single study can also be performed.

The pooled approach (Figure 3) is recommended when the user wants to analyze ’omic data from different sources and obtain results as if the data were located in a single computer. It should be noted that this can be very time consuming when analyzing multiple features since it calls a base function in DataSHIELD (ds.glm) repeatedly. It also cannot be recommended when data are not properly harmonized (e.g. gene expression normalized using different methods, GWAS data having different platforms, …). Furthermore when it is necesary to remove unwanted variability (for transcriptomic and epigenomica analysis) or control for population stratification (for GWAS analysis), this approach cannot be used since we need to develop methods to compute surrogate variables (to remove unwanted variability) or PCAs (to to address population stratification) in a non-disclosive way.

The meta-analysis approach Figure 4 overcomes the limitations raised when performing pooled analyses. First, the computation issue is addressed by using scalable and fast methods to perform data analysis at whole-genome level at each location The transcriptomic and epigenomic data analyses make use of the widely used limma package that uses ExpressionSet or RangedSummarizedExperiment Bioc infrastructures to deal with ’omic and phenotypic (e.g covariates). The genomic data are analyzed using GWASTools and GENESIS that are designed to perform quality control (QC) and GWAS using GDS infrastructure.

Next, we describe how both approaches are implemented:

Non-disclosive omic data analysis with DataSHIELD and Bioconductor. The figure illustrates how to perform single pooled omic data analysis. The analyses are performed by using a generalized linear model (glm) on data from one or multiple sources. It makes use of `ds.glm()`, a DataSHIELD function, that uses an approach that is mathematically equivalent to placing all individual-level data from all sources in one central warehouse and analysing those data using the conventional `glm()` function in R.

Figure 3: Non-disclosive omic data analysis with DataSHIELD and Bioconductor
The figure illustrates how to perform single pooled omic data analysis. The analyses are performed by using a generalized linear model (glm) on data from one or multiple sources. It makes use of ds.glm(), a DataSHIELD function, that uses an approach that is mathematically equivalent to placing all individual-level data from all sources in one central warehouse and analysing those data using the conventional glm() function in R.

Non-disclosive omic data analysis with DataSHIELD and Bioconductor. The figure illustrates how to perform anlyses at genome-wide level from one or multiple sources. It runs standard Bioconductor functions at each server independently to speed up the analyses and in the case of having multiple sources, results can be meta-analyzed uning standar R functions.

Figure 4: Non-disclosive omic data analysis with DataSHIELD and Bioconductor
The figure illustrates how to perform anlyses at genome-wide level from one or multiple sources. It runs standard Bioconductor functions at each server independently to speed up the analyses and in the case of having multiple sources, results can be meta-analyzed uning standar R functions.

3.1 Differential gene expression (DGE) analysis

Let us illustrate how to perform transcriptomic data analysis using data from TCGA project. We have uploaded to the opal server a resource called tcga_liver whose URL is http://duffel.rail.bio/recount/TCGA/rse_gene_liver.Rdata which is available through the recount project. This resource contains the RangeSummarizedExperiment with the RNAseq profiling of liver cancer data from TCGA. Next, we illustrate how a differential expression analysis to compare RNAseq profiling of women vs men (variable gdc_cases.demographic.gender). The DGE analysis is normally performed using limma package. In that case, as we are analyzing RNA-seq data, limma + voom method will be required.

Let us start by creating the connection to the opal server:

builder <- newDSLoginBuilder()
builder$append(server = "study1", url = "https://opal-demo.obiba.org", 
               user = "dsuser", password = "password", 
               resource = "RSRC.tcga_liver", driver = "OpalDriver")

logindata <- builder$build()

conns <- datashield.login(logins = logindata, assign = TRUE, 
                          symbol = "res")

Then, let us coerce the resource to a RangedSummarizedExperiment which is the type of object that is available in the recount project.

datashield.assign.expr(conns, symbol = "rse", 
                       expr = quote(as.resource.object(res)))
ds.class("rse")
$study1
[1] "RangedSummarizedExperiment"
attr(,"package")
[1] "SummarizedExperiment"

The number of features and samples can be inspected by

ds.dim("rse")
$`dimensions of rse in study1`
[1] 58037   424

$`dimensions of rse in combined studies`
[1] 58037   424

And the names of the features using the same function used in the case of analyzing an ExpressionSet

name.features <- ds.featureNames("rse")
lapply(name.features, head)
$study1
[1] "ENSG00000000003.14" "ENSG00000000005.5"  "ENSG00000000419.12" "ENSG00000000457.13"
[5] "ENSG00000000460.16" "ENSG00000000938.12"

Also the covariate names can be inspected by

name.vars <- ds.featureData("rse")
lapply(name.vars, head, n=15)
$study1
 [1] "project"                                       
 [2] "sample"                                        
 [3] "experiment"                                    
 [4] "run"                                           
 [5] "read_count_as_reported_by_sra"                 
 [6] "reads_downloaded"                              
 [7] "proportion_of_reads_reported_by_sra_downloaded"
 [8] "paired_end"                                    
 [9] "sra_misreported_paired_end"                    
[10] "mapped_read_count"                             
[11] "auc"                                           
[12] "sharq_beta_tissue"                             
[13] "sharq_beta_cell_type"                          
[14] "biosample_submission_date"                     
[15] "biosample_publication_date"                    

We can visualize the levels of the variable having gender information

ds.table("rse$gdc_cases.demographic.gender")

 Data in all studies were valid 

Study 1 :  No errors reported from this study
$output.list
$output.list$TABLE_rvar.by.study_row.props
                                study
rse$gdc_cases.demographic.gender 1
                          female 1
                          male   1

$output.list$TABLE_rvar.by.study_col.props
                                study
rse$gdc_cases.demographic.gender         1
                          female 0.3372642
                          male   0.6627358

$output.list$TABLE_rvar.by.study_counts
                                study
rse$gdc_cases.demographic.gender   1
                          female 143
                          male   281

$output.list$TABLES.COMBINED_all.sources_proportions
rse$gdc_cases.demographic.gender
female   male 
 0.337  0.663 

$output.list$TABLES.COMBINED_all.sources_counts
rse$gdc_cases.demographic.gender
female   male 
   143    281 


$validity.message
[1] "Data in all studies were valid"

The differential expression analysis is then performed by:

ans.gender <- ds.limma(model =  ~ gdc_cases.demographic.gender, 
                       Set = "rse", type.data = "RNAseq", 
                       sva = FALSE)

Notice that we have set type.data='RNAseq' to consider that our data are counts obtained from a RNA-seq experiment. By indicating so, the differential analysis is performed by using voom + limma as previously mention.

The top differentially expressed genes can be visualized by:

ans.gender
$study1
# A tibble: 58,037 x 10
   id                  logFC   CI.L   CI.R AveExpr     t   P.Value adj.P.Val     B     SE
   <chr>               <dbl>  <dbl>  <dbl>   <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl>
 1 ENSG00000233070.1   10.8   10.4   11.2  -5.59    55.2 2.89e-196 1.68e-191  418. 0.0763
 2 ENSG00000274655.1  -12.4  -12.8  -11.9  -7.71   -54.8 5.47e-195 1.59e-190  416. 0.460 
 3 ENSG00000260197.1   10.1    9.74  10.6  -5.88    48.4 2.45e-175 4.73e-171  373. 0.0601
 4 ENSG00000213318.4   11.4   10.9   11.9  -4.35    45.0 5.62e-164 8.16e-160  345. 0.0649
 5 ENSG00000231535.5    8.98   8.57   9.39 -7.38    42.9 9.49e-157 1.10e-152  334. 0.0886
 6 ENSG00000270641.1  -10.2  -10.7   -9.73  0.0618 -41.7 2.19e-152 2.12e-148  331. 0.118 
 7 ENSG00000277577.1  -11.3  -11.9  -10.8  -8.08   -41.1 3.13e-150 2.59e-146  321. 0.127 
 8 ENSG00000229807.10 -11.0  -11.6  -10.5  -0.211  -37.6 1.69e-137 1.09e-133  297. 0.0814
 9 ENSG00000278847.1    7.91   7.50   8.32 -7.94    37.9 1.12e-138 8.14e-135  294. 0.0895
10 ENSG00000176728.7   11.1   10.5   11.8  -4.40    36.1 1.46e-131 8.48e-128  277. 0.0923
# ... with 58,027 more rows

attr(,"class")
[1] "dsLimma" "list"   

We have also implemented two other functions ds.DESeq2 and ds.edgeR that perform DGE analysis using DESeq2 and edgeR methods. This is the R code used to that purpose:

To be supplied

We close the DataSHIELD session by:

datashield.logout(conns)

3.2 Epigenome-wide association analysis (EWAS)

EWAS requires basically the same statistical methods as those used in DGE. It should be notice that the pooled analysis we are going to illustrate here can also be performed with transcriptomic data since each study must have different range values. If so, gene expression harmonization should be performed, for instance, by standardizing the data at each study. For EWAS where methylation is measured using beta values (e.g CpG data are in the range 0-1) this is not a problem. In any case, adopting the meta-analysis approach could be a safe option.

We have downloaded data from GEO corresponding to the accesion number GSE66351 which includes DNA methylation profiling (Illumina 450K array) of 190 individuals. Data corresponds to CpGs beta values measured in the superior temporal gyrus and prefrontal cortex brain regions of patients with Alzheimer’s. Data have been downloaded using GEOquery package that gets GEO data as ExpressionSet objects. Researchers who are not familiar with ExpressionSets can read this Section. Notice that data are encoded as beta-values that ensure data harmonization across studies.

In order to illustrate how to perform data analyses using federated data, we have split the data into two ExpressionSets having 100 and 90 samples as if they were two different studies. Figure ?? shows the two resources defined for both studies (GSE66351_1 and GSE66351_2)

In order to perform omic data analyses, we need first to login and assign resources to DataSHIELD. This can be performed using the as.resource.object() function

builder <- DSI::newDSLoginBuilder()
builder$append(server = "study1", url = "https://opal-demo.obiba.org", 
               user = "dsuser", password = "password", 
               resource = "RSRC.GSE66351_1", driver = "OpalDriver")
builder$append(server = "study2", url = "https://opal-demo.obiba.org", 
               user = "dsuser", password = "password", 
               resource = "RSRC.GSE66351_2", driver = "OpalDriver")

logindata <- builder$build()

conns <- DSI::datashield.login(logins = logindata, assign = TRUE, 
                               symbol = "res")


# Assign to the original R class (e.g ExpressionSet)
datashield.assign.expr(conns, symbol = "methy", 
                       expr = quote(as.resource.object(res)))

Now, we can see that the resources are actually loaded into the R servers as their original class

ds.class("methy")
$study1
[1] "ExpressionSet"
attr(,"package")
[1] "Biobase"

$study2
[1] "ExpressionSet"
attr(,"package")
[1] "Biobase"

Then, some Bioconductor-type functions can be use to return non-disclosive information of ExpressionSets from each server to the client, using similar functions as those defined in the dsBaseClient package. For example, feature names can be returned by

fn <- ds.featureNames("methy")
lapply(fn, head)
$study1
[1] "cg00000029" "cg00000108" "cg00000109" "cg00000165" "cg00000236" "cg00000289"

$study2
[1] "cg00000029" "cg00000108" "cg00000109" "cg00000165" "cg00000236" "cg00000289"

Experimental phenotypes variables can be obtained by

ds.varLabels("methy")
$study1
 [1] "title"                   "geo_accession"           "status"                 
 [4] "submission_date"         "last_update_date"        "type"                   
 [7] "channel_count"           "source_name_ch1"         "organism_ch1"           
[10] "characteristics_ch1"     "characteristics_ch1.1"   "characteristics_ch1.2"  
[13] "characteristics_ch1.3"   "characteristics_ch1.4"   "characteristics_ch1.5"  
[16] "characteristics_ch1.6"   "characteristics_ch1.7"   "characteristics_ch1.8"  
[19] "molecule_ch1"            "extract_protocol_ch1"    "label_ch1"              
[22] "label_protocol_ch1"      "taxid_ch1"               "hyb_protocol"           
[25] "scan_protocol"           "description"             "data_processing"        
[28] "platform_id"             "contact_name"            "contact_email"          
[31] "contact_phone"           "contact_laboratory"      "contact_institute"      
[34] "contact_address"         "contact_city"            "contact_zip/postal_code"
[37] "contact_country"         "supplementary_file"      "supplementary_file.1"   
[40] "data_row_count"          "age"                     "braak_stage"            
[43] "brain_region"            "cell type"               "diagnosis"              
[46] "donor_id"                "sentrix_id"              "sentrix_position"       
[49] "Sex"                    

$study2
 [1] "title"                   "geo_accession"           "status"                 
 [4] "submission_date"         "last_update_date"        "type"                   
 [7] "channel_count"           "source_name_ch1"         "organism_ch1"           
[10] "characteristics_ch1"     "characteristics_ch1.1"   "characteristics_ch1.2"  
[13] "characteristics_ch1.3"   "characteristics_ch1.4"   "characteristics_ch1.5"  
[16] "characteristics_ch1.6"   "characteristics_ch1.7"   "characteristics_ch1.8"  
[19] "molecule_ch1"            "extract_protocol_ch1"    "label_ch1"              
[22] "label_protocol_ch1"      "taxid_ch1"               "hyb_protocol"           
[25] "scan_protocol"           "description"             "data_processing"        
[28] "platform_id"             "contact_name"            "contact_email"          
[31] "contact_phone"           "contact_laboratory"      "contact_institute"      
[34] "contact_address"         "contact_city"            "contact_zip/postal_code"
[37] "contact_country"         "supplementary_file"      "supplementary_file.1"   
[40] "data_row_count"          "age"                     "braak_stage"            
[43] "brain_region"            "cell type"               "diagnosis"              
[46] "donor_id"                "sentrix_id"              "sentrix_position"       
[49] "Sex"                    

attr(,"class")
[1] "dsvarLabels" "list"       

3.2.1 Single CpG analysis

Once the methylation data have been loaded into the opal server, we can perform different type of analyses using functions from the dsOmicsClient package. Let us start by illustrating how to analyze a single CpG from two studies by using an approach that is mathematically equivalent to placing all individual-level.

ans <- ds.lmFeature(feature = "cg07363416", 
                    model = ~ diagnosis + Sex, 
                    Set = "methy",
                    datasources = conns)
ans
             Estimate Std. Error   p-value
cg07363416 0.03459886 0.02504291 0.1670998
attr(,"class")
[1] "dsLmFeature" "matrix"      "array"      

3.2.2 Multiple CpG analysis

The same analysis can be performed for all features (e.g. CpGs) just avoiding the feature argument. This process can be parallelized using mclapply function from the multicore package.

ans <- ds.lmFeature(model = ~ diagnosis + Sex, 
                    Set = "methy",
                    datasources = conns,
                    mc.cores = 20)

This method corresponds to the pooled analysis approach and can be very time consiming since the function repeatedly calls the DataSHIELD function ds.glm(). We can adopt another strategy that is to run a glm of each feature independently at each study using limma package (which is really fast) and then combine the results (i.e. meta-analysis approach).

ans.limma <- ds.limma(model = ~ diagnosis + Sex,
                      Set = "methy", 
                      datasources = conns)

Then, we can visualize the top genes at each study (i.e server) by

lapply(ans.limma, head)
$study1
# A tibble: 6 x 10
  id           logFC    CI.L    CI.R AveExpr     t       P.Value adj.P.Val     B      SE
  <chr>        <dbl>   <dbl>   <dbl>   <dbl> <dbl>         <dbl>     <dbl> <dbl>   <dbl>
1 cg13138089 -0.147  -0.191  -0.103    0.380 -6.62 0.00000000190  0.000466 10.6  0.0122 
2 cg23859635 -0.0569 -0.0741 -0.0397   0.200 -6.58 0.00000000232  0.000466 10.4  0.00520
3 cg13772815 -0.0820 -0.107  -0.0570   0.437 -6.50 0.00000000327  0.000466 10.0  0.0135 
4 cg12706938 -0.0519 -0.0678 -0.0359   0.145 -6.45 0.00000000425  0.000466  9.76 0.00872
5 cg24724506 -0.0452 -0.0593 -0.0312   0.139 -6.39 0.00000000547  0.000466  9.51 0.00775
6 cg02812891 -0.125  -0.165  -0.0860   0.247 -6.33 0.00000000731  0.000466  9.23 0.0163 

$study2
# A tibble: 6 x 10
  id           logFC    CI.L    CI.R AveExpr     t      P.Value adj.P.Val     B      SE
  <chr>        <dbl>   <dbl>   <dbl>   <dbl> <dbl>        <dbl>     <dbl> <dbl>   <dbl>
1 cg04046629 -0.101  -0.135  -0.0669  0.345  -5.91 0.0000000621    0.0172  7.18 0.0128 
2 cg07664323 -0.0431 -0.0577 -0.0284  0.776  -5.85 0.0000000822    0.0172  6.90 0.00390
3 cg27098804 -0.0688 -0.0924 -0.0452  0.277  -5.79 0.000000107     0.0172  6.64 0.0147 
4 cg08933615 -0.0461 -0.0627 -0.0296  0.166  -5.55 0.000000298     0.0360  5.64 0.00791
5 cg18349298 -0.0491 -0.0671 -0.0311  0.157  -5.42 0.000000507     0.0489  5.12 0.00848
6 cg02182795 -0.0199 -0.0272 -0.0125  0.0947 -5.36 0.000000670     0.0538  4.84 0.0155 

The annotation can be added by using the argument annotCols. It should be a vector with the columns of the annotation available in the ExpressionSet or RangedSummarizedExperiment that want to be showed. The columns of the annotation can be obtained by

ds.fvarLabels("methy")
$study1
 [1] "title"                   "geo_accession"           "status"                 
 [4] "submission_date"         "last_update_date"        "type"                   
 [7] "channel_count"           "source_name_ch1"         "organism_ch1"           
[10] "characteristics_ch1"     "characteristics_ch1.1"   "characteristics_ch1.2"  
[13] "characteristics_ch1.3"   "characteristics_ch1.4"   "characteristics_ch1.5"  
[16] "characteristics_ch1.6"   "characteristics_ch1.7"   "characteristics_ch1.8"  
[19] "molecule_ch1"            "extract_protocol_ch1"    "label_ch1"              
[22] "label_protocol_ch1"      "taxid_ch1"               "hyb_protocol"           
[25] "scan_protocol"           "description"             "data_processing"        
[28] "platform_id"             "contact_name"            "contact_email"          
[31] "contact_phone"           "contact_laboratory"      "contact_institute"      
[34] "contact_address"         "contact_city"            "contact_zip/postal_code"
[37] "contact_country"         "supplementary_file"      "supplementary_file.1"   
[40] "data_row_count"          "age"                     "braak_stage"            
[43] "brain_region"            "cell type"               "diagnosis"              
[46] "donor_id"                "sentrix_id"              "sentrix_position"       
[49] "Sex"                    

$study2
 [1] "title"                   "geo_accession"           "status"                 
 [4] "submission_date"         "last_update_date"        "type"                   
 [7] "channel_count"           "source_name_ch1"         "organism_ch1"           
[10] "characteristics_ch1"     "characteristics_ch1.1"   "characteristics_ch1.2"  
[13] "characteristics_ch1.3"   "characteristics_ch1.4"   "characteristics_ch1.5"  
[16] "characteristics_ch1.6"   "characteristics_ch1.7"   "characteristics_ch1.8"  
[19] "molecule_ch1"            "extract_protocol_ch1"    "label_ch1"              
[22] "label_protocol_ch1"      "taxid_ch1"               "hyb_protocol"           
[25] "scan_protocol"           "description"             "data_processing"        
[28] "platform_id"             "contact_name"            "contact_email"          
[31] "contact_phone"           "contact_laboratory"      "contact_institute"      
[34] "contact_address"         "contact_city"            "contact_zip/postal_code"
[37] "contact_country"         "supplementary_file"      "supplementary_file.1"   
[40] "data_row_count"          "age"                     "braak_stage"            
[43] "brain_region"            "cell type"               "diagnosis"              
[46] "donor_id"                "sentrix_id"              "sentrix_position"       
[49] "Sex"                    

attr(,"class")
[1] "dsfvarLabels" "list"        

Then we can run the analysis and obtain the output with the chromosome and gene symbol by:

ans.limma.annot <- ds.limma(model = ~ diagnosis + Sex,
                            Set = "methy", 
                            annotCols = c("CHR", "UCSC_RefGene_Name"),
                            datasources = conns)
lapply(ans.limma.annot, head)
$study1
# A tibble: 6 x 12
  id    CHR   UCSC_RefGene_Na~   logFC    CI.L    CI.R AveExpr     t P.Value adj.P.Val     B      SE
  <chr> <chr> <chr>              <dbl>   <dbl>   <dbl>   <dbl> <dbl>   <dbl>     <dbl> <dbl>   <dbl>
1 cg13~ 2     "ECEL1P2"        -0.147  -0.191  -0.103    0.380 -6.62 1.90e-9  0.000466 10.6  0.0122 
2 cg23~ 2     "MTA3"           -0.0569 -0.0741 -0.0397   0.200 -6.58 2.32e-9  0.000466 10.4  0.00520
3 cg13~ 17    ""               -0.0820 -0.107  -0.0570   0.437 -6.50 3.27e-9  0.000466 10.0  0.0135 
4 cg12~ 19    "MEX3D"          -0.0519 -0.0678 -0.0359   0.145 -6.45 4.25e-9  0.000466  9.76 0.00872
5 cg24~ 19    "ISOC2;ISOC2;IS~ -0.0452 -0.0593 -0.0312   0.139 -6.39 5.47e-9  0.000466  9.51 0.00775
6 cg02~ 2     "ECEL1P2"        -0.125  -0.165  -0.0860   0.247 -6.33 7.31e-9  0.000466  9.23 0.0163 

$study2
# A tibble: 6 x 12
  id    CHR   UCSC_RefGene_Na~   logFC    CI.L    CI.R AveExpr     t P.Value adj.P.Val     B      SE
  <chr> <chr> <chr>              <dbl>   <dbl>   <dbl>   <dbl> <dbl>   <dbl>     <dbl> <dbl>   <dbl>
1 cg04~ 11    "CD6"            -0.101  -0.135  -0.0669  0.345  -5.91 6.21e-8    0.0172  7.18 0.0128 
2 cg07~ 6     "MUC21"          -0.0431 -0.0577 -0.0284  0.776  -5.85 8.22e-8    0.0172  6.90 0.00390
3 cg27~ 11    "CD6"            -0.0688 -0.0924 -0.0452  0.277  -5.79 1.07e-7    0.0172  6.64 0.0147 
4 cg08~ 1     ""               -0.0461 -0.0627 -0.0296  0.166  -5.55 2.98e-7    0.0360  5.64 0.00791
5 cg18~ 3     "RARRES1;RARRES~ -0.0491 -0.0671 -0.0311  0.157  -5.42 5.07e-7    0.0489  5.12 0.00848
6 cg02~ 8     ""               -0.0199 -0.0272 -0.0125  0.0947 -5.36 6.70e-7    0.0538  4.84 0.0155 

Then, the last step is to meta-analyze the results. Different methods can be used to this end. We have implemented a method that meta-analyze the p-pvalues of each study as follows:

ans.meta <- metaPvalues(ans.limma)
ans.meta
# A tibble: 481,868 x 4
   id                study1      study2   p.meta
   <chr>              <dbl>       <dbl>    <dbl>
 1 cg13138089 0.00000000190 0.00000763  4.78e-13
 2 cg25317941 0.0000000179  0.00000196  1.12e-12
 3 cg02812891 0.00000000731 0.00000707  1.63e-12
 4 cg12706938 0.00000000425 0.0000161   2.14e-12
 5 cg16026647 0.000000101   0.000000797 2.51e-12
 6 cg12695465 0.00000000985 0.0000144   4.33e-12
 7 cg21171625 0.000000146   0.00000225  9.78e-12
 8 cg13772815 0.00000000327 0.000122    1.18e-11
 9 cg00228891 0.000000166   0.00000283  1.38e-11
10 cg21488617 0.0000000186  0.0000299   1.62e-11
# ... with 481,858 more rows

We can verify that the results are pretty similar to those obtained using pooled analyses. Here we compute the association for two of the top-CpGs:

res1 <- ds.lmFeature(feature = "cg13138089", 
                     model = ~ diagnosis + Sex, 
                     Set = "methy",
                     datasources = conns)
res1
             Estimate Std. Error      p-value
cg13138089 -0.1373348 0.01712405 1.057482e-15
attr(,"class")
[1] "dsLmFeature" "matrix"      "array"      
res2 <- ds.lmFeature(feature = "cg13772815", 
                     model = ~ diagnosis + Sex, 
                     Set = "methy",
                     datasources = conns)
res2
              Estimate  Std. Error      p-value
cg13772815 -0.06786137 0.009128915 1.056225e-13
attr(,"class")
[1] "dsLmFeature" "matrix"      "array"      

We can create a QQ-plot by using the function qqplot available in our package.

qqplot(ans.meta$p.meta)

Here In some cases inflation can be observed, so that, correction for cell-type or surrogate variables must be performed. We describe how we can do that in the next two sections.

3.2.3 Adjusting for Surrogate Variables

The vast majority of omic studies require to control for unwanted variability. The surrogate variable analysis (SVA) can address this issue by estimating some hidden covariates that capture differences across individuals due to some artifacts such as batch effects or sample quality sam among others. The method is implemented in SVA package.

Performing this type of analysis using the ds.lmFeature function is not allowed since estimating SVA would require to implement a non-disclosive method that computes SVA from the different servers. This will be a future topic of the dsOmicsClient. NOTE that, estimating SVA separately at each server would not be a good idea since the aim of SVA is to capture differences mainly due to experimental issues among ALL individuals. What we can do instead is to use the ds.limma function to perform the analyses adjusted for SVA at each study.

ans.sva <- ds.limma(model = ~ diagnosis + Sex, 
                    Set = "methy",
                    sva = TRUE, annotCols = c("CHR", "UCSC_RefGene_Name"))
ans.sva
$study1
# A tibble: 481,868 x 12
   id    CHR   UCSC_RefGene_Na~   logFC    CI.L    CI.R AveExpr     t P.Value adj.P.Val     B
   <chr> <chr> <chr>              <dbl>   <dbl>   <dbl>   <dbl> <dbl>   <dbl>     <dbl> <dbl>
 1 cg10~ 19    "GNG7"           -0.0547 -0.0721 -0.0373   0.338 -6.25 1.31e-8   0.00591  8.51
 2 cg13~ 17    ""               -0.0569 -0.0757 -0.0381   0.437 -6.00 3.91e-8   0.00591  7.43
 3 cg11~ 19    "PODNL1;PODNL1;~  0.0334  0.0223  0.0445   0.568  5.97 4.53e-8   0.00591  7.29
 4 cg27~ 17    "SLC47A2;SLC47A~  0.0274  0.0182  0.0365   0.548  5.95 4.91e-8   0.00591  7.21
 5 cg21~ 6     ""               -0.0453 -0.0609 -0.0297   0.799 -5.78 1.05e-7   0.00666  6.46
 6 cg23~ 2     "MTA3"           -0.0327 -0.0440 -0.0215   0.200 -5.77 1.09e-7   0.00666  6.42
 7 cg10~ 16    "SALL1;SALL1"     0.0366  0.0240  0.0492   0.137  5.77 1.10e-7   0.00666  6.42
 8 cg24~ 1     ""                0.0317  0.0208  0.0427   0.821  5.76 1.11e-7   0.00666  6.41
 9 cg13~ 9     ""               -0.0326 -0.0440 -0.0213   0.367 -5.72 1.32e-7   0.00705  6.24
10 cg13~ 2     "ECEL1P2"        -0.106  -0.144  -0.0686   0.380 -5.61 2.15e-7   0.00835  5.76
# ... with 481,858 more rows, and 1 more variable: SE <dbl>

$study2
# A tibble: 481,868 x 12
   id    CHR   UCSC_RefGene_Na~   logFC    CI.L    CI.R AveExpr     t P.Value adj.P.Val     B
   <chr> <chr> <chr>              <dbl>   <dbl>   <dbl>   <dbl> <dbl>   <dbl>     <dbl> <dbl>
 1 cg16~ 12    "LRP1"           -0.0418 -0.0548 -0.0289   0.388 -6.42 7.82e-9   0.00374  9.05
 2 cg25~ 1     ""               -0.0664 -0.0874 -0.0453   0.569 -6.27 1.55e-8   0.00374  8.37
 3 cg12~ 11    "NRXN2;NRXN2"    -0.0273 -0.0367 -0.0179   0.728 -5.81 1.13e-7   0.0149   6.42
 4 cg07~ 3     ""               -0.0393 -0.0529 -0.0258   0.160 -5.78 1.24e-7   0.0149   6.32
 5 cg07~ 6     "MUC21"          -0.0427 -0.0579 -0.0276   0.776 -5.61 2.63e-7   0.0163   5.59
 6 cg00~ 1     "CR1L"           -0.0568 -0.0771 -0.0365   0.350 -5.56 3.18e-7   0.0163   5.40
 7 cg11~ 12    "CNTN1;CNTN1"    -0.0443 -0.0601 -0.0284   0.152 -5.56 3.18e-7   0.0163   5.40
 8 cg03~ 7     "PGAM2;PGAM2"    -0.0442 -0.0600 -0.0283   0.211 -5.56 3.26e-7   0.0163   5.38
 9 cg07~ 1     "KCNAB2;KCNAB2"   0.0611  0.0392  0.0831   0.659  5.54 3.47e-7   0.0163   5.31
10 cg25~ 17    "WNK4"           -0.0392 -0.0533 -0.0251   0.512 -5.52 3.70e-7   0.0163   5.25
# ... with 481,858 more rows, and 1 more variable: SE <dbl>

attr(,"class")
[1] "dsLimma" "list"   

Then, data can be combined meta-anlyzed as follows:

ans.meta.sv <- metaPvalues(ans.sva)
ans.meta.sv
# A tibble: 481,868 x 4
   id               study1      study2   p.meta
   <chr>             <dbl>       <dbl>    <dbl>
 1 cg00228891 0.00000397   0.000000318 3.58e-11
 2 cg01301319 0.00000609   0.00000123  2.00e-10
 3 cg22962123 0.00000106   0.00000767  2.17e-10
 4 cg24302412 0.0000139    0.000000762 2.77e-10
 5 cg02812891 0.000000408  0.0000327   3.47e-10
 6 cg23859635 0.000000109  0.000190    5.29e-10
 7 cg13138089 0.000000215  0.000105    5.74e-10
 8 cg24938077 0.0000125    0.00000254  8.02e-10
 9 cg13772815 0.0000000391 0.00132     1.27e- 9
10 cg21212881 0.000000340  0.000248    2.04e- 9
# ... with 481,858 more rows

The DataSHIELD session must by closed by:

datashield.logout(conns)

3.3 Exercise 1

He have collected information on methylation data (200 more variable CpGs) in order to compare methylation patterns in inner-city children with persistent atopic asthma (n=97) versus healthy controls (n=97). The authors used DNA from peripheral blood mononuclear cells (PBMCs) from inner city children aged 6-12 years with persistent atopic asthma children and healthy controls. Comparing asthmatics (N=97) to controls (N=97). Original data are available at GEO under the accession number GSE40732. The ExpressionSet with 194 samples is available at:

This data can be use later to check that DataSHIELD (using dsOmics) is providing the same results as if we had the whole data in a single computer

We have split this data set in two and have created two resources in the Opal demo site (https://opal-demo.obiba.org/) that can be access through DataSHIELD using the credentials:

- user: dsuser
- password: password

The two resources are in the project called workshop and their names are GSE40732_1 and GSE40732_2

  1. Login the connection to the two resources to DataSHIELD.
  2. Assign the resources to an R object called ‘methy’.
  3. Use ds.nFeatures () and ds.nSamples () functions to verify that the loaded objects are having the proper dimensions.
  4. Get the names of the variables available in the sample metadata (use ds.varLabels ()).
  5. Get the distribution of samples in the variable asthma (cases/controls).
  6. Perform an EWAS considering asthma as the condition (e.g. grouping variable).
  7. Do a simple meta-analysis using metaPvalues () function.

3.4 Genome-wide association analyses (GWAS)

Genomic data can be stored in different formats. VCF and PLINK files are commonly used in genetic epidemiology studies. We have a GWAS example available at BRGE data repository that aims to find SNPs associated with asthma. We have data stored in VCF (brge.vcf) with several covariates and phenotypes available in the file brge.txt (gender, age, obesity, smoking, country and asthma status). The same data is also available in PLINK format (brge.bed, brge.bim, brge.fam) with covariates in the file brge.phe.

Here we illustrate how to perform GWAS using R and Bioconductor packages or PLINK shell command line.

3.4.1 GWAS with Bioconductor

We have created a resource having the VCF file of our study on asthma as previously described. The name of the resource is brge_vcf the phenotypes are available in another resource called brge that is a .txt file. The GWAS analysis is then perform as follows

We first start by preparing login data

builder <- newDSLoginBuilder()
builder$append(server = "study1", url = "https://opal-demo.obiba.org",
               user = "dsuser", password = "password",
               resource = "RSRC.brge_vcf", driver = "OpalDriver")
logindata <- builder$build()

conns <- datashield.login(logins = logindata, assign = TRUE,
                          symbol = "res")

In this case we have to assign to different resources. One for the VCF (obesity_vcf) and another one for the phenotypic data (obesity). To this end, the datashield.assign.resource function is required before assigning any object to the specific resource. Notice that the VCF resource can be load into R as a GDS thanks to our extension of existing resources in the reourcer

datashield.assign.resource(conns, symbol = "vcf.res", 
                           resource = list(study1 = "RSRC.brge_vcf"))
datashield.assign.expr(conns, symbol = "gds", 
                       expr = quote(as.resource.object(vcf.res)))


datashield.assign.resource(conns, symbol = "covars.res", 
                           resource = list(study1 = "RSRC.brge"))
datashield.assign.expr(conns, symbol = "covars", 
                       expr = quote(as.resource.data.frame(covars.res)))

These are the objects available in the Opal server

ds.ls()
$study1
$study1$environment.searched
[1] "R_GlobalEnv"

$study1$objects.found
[1] "covars"     "covars.res" "gds"        "res"        "vcf.res"   

We can use dsBaseClient functions to inspect the variables that are in the covars data.frame. The variables are

ds.colnames("covars")
$study1
[1] "scanID"  "gender"  "obese"   "age"     "smoke"   "country" "asthma" 

The asthma variable has this number of individuals at each level (0: controls, 1: cases)

ds.table("covars$asthma")

 Data in all studies were valid 

Study 1 :  No errors reported from this study
$output.list
$output.list$TABLE_rvar.by.study_row.props
             study
covars$asthma 1
            0 1
            1 1

$output.list$TABLE_rvar.by.study_col.props
             study
covars$asthma         1
            0 0.6864187
            1 0.3135813

$output.list$TABLE_rvar.by.study_counts
             study
covars$asthma    1
            0 1587
            1  725

$output.list$TABLES.COMBINED_all.sources_proportions
covars$asthma
    0     1 
0.686 0.314 

$output.list$TABLES.COMBINED_all.sources_counts
covars$asthma
   0    1 
1587  725 


$validity.message
[1] "Data in all studies were valid"

There may be interest in only studying certain genes, for that matter, the loaded VCF resource can be subsetting as follows

genes <- c("A1BG","A2MP1")
ds.getSNPSbyGen("gds", genes)

The previous code will over-write the VCF with the SNPs corresponding to the selected genes, if the intention is to perform studies with both the complete VCF and a subsetted one, the argument name can be used to create a new object on the server with the subsetted VCF, preserving the complete one.

genes <- c("A1BG","A2MP1")
ds.getSNPSbyGen("gds", genes = genes, name = "subset.vcf")

Then, an object of class GenotypeData must be created at the server side to perform genetic data analyses. This is a container defined in the GWASTools package for storing genotype and phenotypic data from genetic association studies. By doing that we will also verify whether individuals in the GDS (e.g VCF) and covariates files have the same individuals and are in the same order. This can be performed by

ds.GenotypeData(x='gds', covars = 'covars', columnId = 1, newobj.name = 'gds.Data')

Before performing the association analyses, quality control (QC) can be performed to the loaded data. Three methodologies are available; 1) Principal Component Analysis (PCA) of the genomic data, 2) Hardy-Weinberg Equilibrium (HWE) testing and 3) Allelic frequency estimates. The QC methods 2 and 3 have as inputs a GenotypeData object, created with a covariates file that has a gender column; while method 1 has as input a VCF.

To perform the PCA, a pruning functionality is built inside so that redundant SNPs are discarted (there is an extra argument ld.threshold which controls the pruning, more information about it at the SNPRelate documentation), speeding up the execution time

ds.PCASNPS("gds", prune = TRUE)

To perform QC methodologies 2 and 3, the name of the gender column as well as the keys to describe male or female have to be provided. Remember that we can visualize the names of the variables from our data by executing ds.colnames("covars"). In our case, this variable is called “gender”, and the levels of this variable are 1 for male and 2 for female as we can see here (NOTE: we cannot use ds.levels since gender variable is not a factor):

ds.table1D("covars$gender")$counts
      covars$gender
1              1215
2              1097
Total          2312

The HWE test can be performed to selected chromosomes using the argument chromosomes, only the autosomes can be selected when performing a HWE test, the encoding of the autosomes can be fetched with

ds.getChromosomeNames("gds.Data")$autosomes
NULL

Therefore, HWE can be performed by:

ds.exactHWE("gds.Data", sexcol = "gender", male = "1", female = "2", chromosome = "22")
$study1
# A tibble: 1,581 x 9
   snpID chr   nAA   nAB   nBB   MAF               minor.allele f                 pval              
   <int> <chr> <chr> <chr> <chr> <chr>             <chr>        <chr>             <chr>             
 1 95140 22    531   1082  640   0.475810031069685 A            0.03724945341034~ 0.0762378420092437
 2 95141 22    580   1092  590   0.497789566755084 A            0.03446388812448~ 0.101115839682148 
 3 95142 22    0     333   1951  0.07289842381786~ A            -0.0786304604486~ 3.29321452000707e~
 4 95143 22    0     225   2080  0.04880694143167~ A            -0.0513112884834~ 0.005713534221738~
 5 95144 22    176   831   1249  0.262189716312057 A            0.04792409337548~ 0.025554465064758 
 6 95145 22    249   1003  1055  0.32531426094495  A            0.00958157673233~ 0.635607524466151 
 7 95146 22    55    516   1738  0.135556517973149 A            0.04646033280618~ 0.0325126363699498
 8 95147 22    203   929   1164  0.290722996515679 A            0.01888804177461~ 0.362631409282596 
 9 95148 22    291   1035  985   0.349848550411077 A            0.01549983175844~ 0.464409674690009 
10 95149 22    11    276   2013  0.06478260869565~ A            0.00966929694008~ 0.604847402951204 
# ... with 1,571 more rows

attr(,"class")
[1] "dsexactHWE" "list"      

Similarly, allele frequencies estimates can be estimated by:

ds.alleleFrequency("gds.Data", sexcol = "gender", male = "1", female = "2")
$study1
# A tibble: 99,289 x 7
          M       F      all   n.M   n.F     n      MAF
      <dbl>   <dbl>    <dbl> <dbl> <dbl> <dbl>    <dbl>
 1 0.015    0.0147  0.0149    1200  1086  2286 0.0149  
 2 0.000837 0.00276 0.00175   1195  1087  2282 0.00175 
 3 0.00330  0.00548 0.00433   1212  1095  2307 0.00433 
 4 0.0124   0.00868 0.0106    1213  1095  2308 0.0106  
 5 0.00165  0       0.000866  1214  1095  2309 0.000866
 6 0.0712   0.0754  0.0732    1208  1094  2302 0.0732  
 7 0.00660  0.00913 0.00780   1213  1095  2308 0.00780 
 8 0.00248  0.00274 0.00261   1209  1093  2302 0.00261 
 9 0.182    0.181   0.181     1188  1071  2259 0.181   
10 0.00165  0       0.000868  1210  1094  2304 0.000868
# ... with 99,279 more rows

attr(,"class")
[1] "dsalleleFrequency" "list"             

In the future, more functions will be created to perform quality control (QC) for both, SNPs and inviduals.

Association analysis for a given SNP is performed by simply

ds.glmSNP(snps.fit = "rs11247693", model = asthma ~ gender + age, genoData='gds.Data')
             Estimate Std. Error   p-value
rs11247693 -0.1543215  0.2309585 0.5040196
attr(,"class")
[1] "dsGlmSNP" "matrix"   "array"   

The analysis of all available SNPs is performed when the argument snps.fit is missing. The function performs the analysis of the selected SNPs in a single repository or in multiple repositories as performing pooled analyses (it uses ds.glm DataSHIELD function). As in the case of transcriptomic data, analyzing all the SNPs in the genome (e.g GWAS) will be high time-consuming. We can adopt a similar approach as the one adopted using the limma at each server. That is, we run GWAS at each repository using specific and scalable packages available in R/Bioc. In that case we use the GWASTools and GENESIS packages. The complete pipeline is implemented in this function

ans.bioC <- ds.GWAS('gds.Data', model=asthma~age+country)

This close the DataSHIELD session

datashield.logout(conns)

4 Real application: Epigenomic changes and green spaces

Our main aim is to train ATHLETE researchers to perform omic data analysis linking the epigenome with the exposome. WP3 and WP4 have setup a project (GREENSPACE) that will be used as proof-of-concept of our developed infrastructure. This figure summarized that project

GREENSPACE project overview. Linking the exposome (green space) with the epigenome

Figure 5: GREENSPACE project overview
Linking the exposome (green space) with the epigenome

We prepared some datasets with this information:

GREENSPACE project available data

Figure 6: GREENSPACE project available data

The variables we have are:

So, the requirements to perform such analyses are:

  1. To install dsBaseClient and dsOmicsClient R packages into analyst computer.
  2. To have harmonized exposome data in a opal server for each cohort (already done in LifeCycle) (Tables).
  3. To have a resource with an ExpressioSet encapsulating methylation data for each cohort (Resources) [see here for a description of an ExpressionSet and how to deal with it as a resource]
    • WP4 is providing an R code to create such ExpressionSet
    • Each cohort will provide the URL having an .Rdata file with the required object.
  4. To install required DataSHIELD packages into the Opal server: dsOmics, dsBase and resourcer

For this tutorial, we have tried to mimic real situations we have a project called GREENSPACE in our Opal demo server having exposome data for 3 cohorts (simulated data).

Tables available in the project GREENSPACE of Opal demo site

Figure 7: Tables available in the project GREENSPACE of Opal demo site

In the next figure we can see that there are 7 variables at each table and different number of samples (117, 178, 160). Using what you have learn in the DataSHIELD beginners course, you can obtain this information from the client site.

We start by creating the connection to our server and assigning the tables to an R object called exposome:

library(DSOpal)
library(dsBaseClient)


builder <- newDSLoginBuilder()
builder$append(server = 'study1', url = 'https://opal-demo.obiba.org', 
               user = 'dsuser', password = 'password')
builder$append(server = 'study2', url = 'https://opal-demo.obiba.org', 
               user = 'dsuser', password = 'password')
builder$append(server = 'study3', url = 'https://opal-demo.obiba.org', 
               user = 'dsuser', password = 'password')



logindata <- builder$build()
conns <- datashield.login(logins = logindata)

datashield.assign.table(conns, symbol = "exposome",
                        table = list(study1 = "GREENSPACE.Cohort1_exposome",
                                     study2 = "GREENSPACE.Cohort2_exposome",
                                     study3 = "GREENSPACE.Cohort3_exposome"))

The we can run basic DataSHIELD commands

ds.class('exposome')
$study1
[1] "data.frame"

$study2
[1] "data.frame"

$study3
[1] "data.frame"
ds.dim('exposome')
$`dimensions of exposome in study1`
[1] 117   7

$`dimensions of exposome in study2`
[1] 160   7

$`dimensions of exposome in study3`
[1] 178   7

$`dimensions of exposome in combined studies`
[1] 455   7
ds.colnames('exposome')
$study1
[1] "CODE"         "greenyn300_0" "greenyn300_s" "greenyn300_4" "ndvi100_0"    "ndvi100_s"   
[7] "ndvi100_4"   

$study2
[1] "CODE"         "greenyn300_0" "greenyn300_s" "greenyn300_4" "ndvi100_0"    "ndvi100_s"   
[7] "ndvi100_4"   

$study3
[1] "CODE"         "greenyn300_0" "greenyn300_s" "greenyn300_4" "ndvi100_0"    "ndvi100_s"   
[7] "ndvi100_4"   

The three .Rdata file containing the different ExpressionSets are available at:

So, we need to create the resources into the Opal server as we leant in the section 3.3 of the resources workshop. Our GREENSPACE project already have those resources created:

Resources available in the project GREENSPACE of Opal demo site

Figure 8: Resources available in the project GREENSPACE of Opal demo site

Then, we can load those resources into our DataSHIELD session as an R object called methy which is an object of class ExpressionSet by:

datashield.assign.resource(conns, symbol = "eSet",
                           resource = list(study1 = "GREENSPACE.Cohort1_Methyl_C_blood_0y_450K",
                                           study2 = "GREENSPACE.Cohort2_Methyl_C_blood_0y_450K",
                                           study3 = "GREENSPACE.Cohort3_Methyl_C_blood_0y_450K"))

datashield.assign.expr(conns, symbol = "methy", expr = quote(as.resource.object(eSet)))

Then, using dsOmicsClient library we can first check that our methylation data have properly been integrated into the DataSHIELD R server

library(dsOmicsClient)
ds.class('methy')
$study1
[1] "ExpressionSet"
attr(,"package")
[1] "Biobase"

$study2
[1] "ExpressionSet"
attr(,"package")
[1] "Biobase"

$study3
[1] "ExpressionSet"
attr(,"package")
[1] "Biobase"
ds.nFeatures('methy')
$study1
Features 
     100 

$study2
Features 
     100 

$study3
Features 
     100 

attr(,"class")
[1] "dsnFeatures" "list"       
ds.nSamples('methy')
$study1
Samples 
    100 

$study2
Samples 
    150 

$study3
Samples 
    135 

attr(,"class")
[1] "dsnSamples" "list"      

Then, in order to be able to use ds.limma () we need first to add the exposome data to the ExpressionSet object. To this end, the function ds.addPhenoData () has been created

ds.addPhenoData('methy', 'exposome', identifier = "CODE", name = 'methy2')

This function creates a new object in the R server side that is called methy2

ds.ls()
$study1
$study1$environment.searched
[1] "R_GlobalEnv"

$study1$objects.found
[1] "eSet"     "exposome" "methy"    "methy2"  


$study2
$study2$environment.searched
[1] "R_GlobalEnv"

$study2$objects.found
[1] "eSet"     "exposome" "methy"    "methy2"  


$study3
$study3$environment.searched
[1] "R_GlobalEnv"

$study3$objects.found
[1] "eSet"     "exposome" "methy"    "methy2"  

This is an object of class ExpressionSet

ds.class('methy2')
$study1
[1] "ExpressionSet"
attr(,"package")
[1] "Biobase"

$study2
[1] "ExpressionSet"
attr(,"package")
[1] "Biobase"

$study3
[1] "ExpressionSet"
attr(,"package")
[1] "Biobase"

Having the exposures in the phenotypic data

ds.fvarLabels('methy')
$study1
 [1] "X"          "id_methyl"  "id"         "age_methyl" "sex_methyl" "anc_methyl" "cohort"    
 [8] "Bcell"      "CD4T"       "CD8T"       "Gran"       "Mono"       "NK"         "nRBC"      

$study2
 [1] "X"          "id_methyl"  "id"         "age_methyl" "sex_methyl" "anc_methyl" "cohort"    
 [8] "Bcell"      "CD4T"       "CD8T"       "Gran"       "Mono"       "NK"         "nRBC"      

$study3
 [1] "X"          "id_methyl"  "id"         "age_methyl" "sex_methyl" "anc_methyl" "cohort"    
 [8] "Bcell"      "CD4T"       "CD8T"       "Gran"       "Mono"       "NK"         "nRBC"      

attr(,"class")
[1] "dsfvarLabels" "list"        
ds.fvarLabels('methy2')
$study1
 [1] "X"            "id_methyl"    "id"           "age_methyl"   "sex_methyl"   "anc_methyl"  
 [7] "cohort"       "Bcell"        "CD4T"         "CD8T"         "Gran"         "Mono"        
[13] "NK"           "nRBC"         "greenyn300_0" "greenyn300_s" "greenyn300_4" "ndvi100_0"   
[19] "ndvi100_s"    "ndvi100_4"   

$study2
 [1] "X"            "id_methyl"    "id"           "age_methyl"   "sex_methyl"   "anc_methyl"  
 [7] "cohort"       "Bcell"        "CD4T"         "CD8T"         "Gran"         "Mono"        
[13] "NK"           "nRBC"         "greenyn300_0" "greenyn300_s" "greenyn300_4" "ndvi100_0"   
[19] "ndvi100_s"    "ndvi100_4"   

$study3
 [1] "X"            "id_methyl"    "id"           "age_methyl"   "sex_methyl"   "anc_methyl"  
 [7] "cohort"       "Bcell"        "CD4T"         "CD8T"         "Gran"         "Mono"        
[13] "NK"           "nRBC"         "greenyn300_0" "greenyn300_s" "greenyn300_4" "ndvi100_0"   
[19] "ndvi100_s"    "ndvi100_4"   

attr(,"class")
[1] "dsfvarLabels" "list"        

we can verify that matching has properly been performed

ds.nSamples('methy2')
$study1
Samples 
     97 

$study2
Samples 
    145 

$study3
Samples 
    133 

attr(,"class")
[1] "dsnSamples" "list"      

Now, we are ready to perform the required EXWAS by

ewas <- ds.limma(model = ~ greenyn300_0,
                      Set = "methy2")

We can meta-analyze our results as we want, since we have the required information in our computer

ewas
$study1
# A tibble: 100 x 10
   id            logFC      CI.L      CI.R AveExpr     t P.Value adj.P.Val     B      SE
   <chr>         <dbl>     <dbl>     <dbl>   <dbl> <dbl>   <dbl>     <dbl> <dbl>   <dbl>
 1 cg14008030  0.0288   0.00537   0.0523    0.526   2.44  0.0165     0.613 -4.66 0.0100 
 2 cg23999112  0.0258   0.00428   0.0473    0.752   2.38  0.0193     0.613 -4.80 0.0118 
 3 cg18762746  0.00589  0.000781  0.0110    0.967   2.29  0.0243     0.613 -5.00 0.00594
 4 cg05597748 -0.0195  -0.0365   -0.00257   0.813  -2.29  0.0245     0.613 -5.01 0.0170 
 5 cg04496485 -0.00869 -0.0175    0.000171  0.0525 -1.95  0.0545     0.989 -5.70 0.00431
 6 cg12285627  0.0128  -0.00161   0.0272    0.615   1.76  0.0812     0.989 -6.03 0.0231 
 7 cg01097950  0.0160  -0.00253   0.0345    0.814   1.71  0.0899     0.989 -6.11 0.0119 
 8 cg08900511  0.0149  -0.00325   0.0330    0.845   1.63  0.107      0.989 -6.25 0.0142 
 9 cg05475702 -0.0212  -0.0480    0.00557   0.703  -1.57  0.119      0.989 -6.34 0.0133 
10 cg16736630 -0.0335  -0.0781    0.0112    0.501  -1.49  0.140      0.989 -6.47 0.0281 
# ... with 90 more rows

$study2
# A tibble: 100 x 10
   id            logFC      CI.L    CI.R AveExpr     t P.Value adj.P.Val     B      SE
   <chr>         <dbl>     <dbl>   <dbl>   <dbl> <dbl>   <dbl>     <dbl> <dbl>   <dbl>
 1 cg14008030 0.0163    0.000457 0.0322   0.531   2.03  0.0438     0.989 -6.06 0.00660
 2 cg11422233 0.00539  -0.000365 0.0111   0.0830  1.85  0.0662     0.989 -6.41 0.00802
 3 cg05662829 0.0165   -0.00126  0.0342   0.815   1.84  0.0684     0.989 -6.44 0.00322
 4 cg06402284 0.00306  -0.000481 0.00660  0.0408  1.71  0.0897     0.989 -6.66 0.00993
 5 cg00381604 0.00282  -0.000583 0.00621  0.0306  1.64  0.104      0.989 -6.78 0.00172
 6 cg01551879 0.00129  -0.000291 0.00287  0.0181  1.61  0.109      0.989 -6.82 0.0147 
 7 cg12045430 0.00492  -0.00144  0.0113   0.0361  1.53  0.129      0.989 -6.95 0.00723
 8 cg24699430 0.000768 -0.000275 0.00181  0.0145  1.46  0.148      0.989 -7.05 0.00878
 9 cg04407431 0.000908 -0.000327 0.00214  0.0173  1.45  0.148      0.989 -7.06 0.00859
10 cg08167039 0.00203  -0.000735 0.00479  0.0352  1.45  0.149      0.989 -7.06 0.0150 
# ... with 90 more rows

$study3
# A tibble: 100 x 10
   id            logFC      CI.L      CI.R AveExpr     t P.Value adj.P.Val     B      SE
   <chr>         <dbl>     <dbl>     <dbl>   <dbl> <dbl>   <dbl>     <dbl> <dbl>   <dbl>
 1 cg09989996  0.00735  0.00288   0.0118    0.950   3.25 0.00146     0.146 -2.75 0.00815
 2 cg13938959 -0.0254  -0.0423   -0.00853   0.731  -2.98 0.00347     0.173 -3.54 0.00907
 3 cg10988531  0.0193   0.00539   0.0331    0.834   2.75 0.00686     0.229 -4.17 0.00818
 4 cg14057946  0.0114   0.00225   0.0206    0.0945  2.46 0.0151      0.377 -4.87 0.0121 
 5 cg03213877 -0.00174 -0.00333  -0.000160  0.0174 -2.18 0.0312      0.624 -5.51 0.00536
 6 cg01068023 -0.00801 -0.0156   -0.000416  0.888  -2.09 0.0388      0.647 -5.70 0.0186 
 7 cg21996134  0.00995 -0.000355  0.0203    0.954   1.91 0.0583      0.790 -6.05 0.00899
 8 cg21870274 -0.0165  -0.0343    0.00127   0.814  -1.84 0.0685      0.790 -6.18 0.0105 
 9 cg11954957  0.0272  -0.00282   0.0573    0.746   1.79 0.0753      0.790 -6.26 0.0130 
10 cg18263059  0.0107  -0.00125   0.0226    0.0381  1.77 0.0790      0.790 -6.30 0.0169 
# ... with 90 more rows

attr(,"class")
[1] "dsLimma" "list"   

We have implemented a very simple function that makes use of metap package to combine p-values from microarray experiments. For both ATHELETE and LifeCycle researchers we highly recommend to use our EASIER package

metaPvalues(ewas)
# A tibble: 100 x 5
   id         study1 study2  study3 p.meta
   <chr>       <dbl>  <dbl>   <dbl>  <dbl>
 1 cg14008030 0.0165 0.0438 0.795   0.0209
 2 cg14057946 0.263  0.213  0.0151  0.0280
 3 cg09989996 0.976  0.955  0.00146 0.0401
 4 cg23999112 0.0193 0.771  0.102   0.0433
 5 cg13938959 0.967  0.829  0.00347 0.0672
 6 cg05597748 0.0245 0.353  0.334   0.0691
 7 cg16047670 0.158  0.182  0.105   0.0713
 8 cg10988531 0.935  0.544  0.00686 0.0791
 9 cg18762746 0.0243 0.402  0.463   0.0949
10 cg01068023 0.693  0.255  0.0388  0.126 
# ... with 90 more rows

We can also run analyses adjusted for other covariates or even for cell types if metadata is available in the ExpressionSet

ds.fvarLabels('methy2')
$study1
 [1] "X"            "id_methyl"    "id"           "age_methyl"   "sex_methyl"   "anc_methyl"  
 [7] "cohort"       "Bcell"        "CD4T"         "CD8T"         "Gran"         "Mono"        
[13] "NK"           "nRBC"         "greenyn300_0" "greenyn300_s" "greenyn300_4" "ndvi100_0"   
[19] "ndvi100_s"    "ndvi100_4"   

$study2
 [1] "X"            "id_methyl"    "id"           "age_methyl"   "sex_methyl"   "anc_methyl"  
 [7] "cohort"       "Bcell"        "CD4T"         "CD8T"         "Gran"         "Mono"        
[13] "NK"           "nRBC"         "greenyn300_0" "greenyn300_s" "greenyn300_4" "ndvi100_0"   
[19] "ndvi100_s"    "ndvi100_4"   

$study3
 [1] "X"            "id_methyl"    "id"           "age_methyl"   "sex_methyl"   "anc_methyl"  
 [7] "cohort"       "Bcell"        "CD4T"         "CD8T"         "Gran"         "Mono"        
[13] "NK"           "nRBC"         "greenyn300_0" "greenyn300_s" "greenyn300_4" "ndvi100_0"   
[19] "ndvi100_s"    "ndvi100_4"   

attr(,"class")
[1] "dsfvarLabels" "list"        

For example, let us run the analysis adjusted for gender

ewas.adj <- ds.limma(model = ~ greenyn300_0 + sex_methyl, Set = "methy2")
metaPvalues(ewas.adj)
# A tibble: 100 x 5
   id         study1 study2  study3 p.meta
   <chr>       <dbl>  <dbl>   <dbl>  <dbl>
 1 cg14008030 0.0230 0.0463 0.824   0.0288
 2 cg14057946 0.261  0.242  0.0145  0.0297
 3 cg09989996 0.990  0.945  0.00153 0.0415
 4 cg23999112 0.0214 0.785  0.111   0.0503
 5 cg05597748 0.0270 0.344  0.290   0.0658
 6 cg13938959 0.942  0.817  0.00373 0.0689
 7 cg10988531 0.959  0.550  0.00558 0.0701
 8 cg16047670 0.156  0.184  0.104   0.0707
 9 cg18762746 0.0293 0.395  0.476   0.109 
10 cg01068023 0.665  0.275  0.0371  0.125 
# ... with 90 more rows

Do not forget to logout the connection

datashield.logout(conns)    

4.1 Exercise 2

Peform an EWAS using the same data that is available in the GREENSPACE project to determine which CpGs are differentially methylated for the exposures: greenyn300_4, ndvi100_0 and ndvi100_4.

EWAS requires to adjust the analyses for cell type composition. The ExpressionSet contains the variables Bcell, CD4T, CD8T, Gran, Mono and NK in the sample metadata. Re-run the models adjustin for those variables

5 Session Info

sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=Spanish_Spain.1252  LC_CTYPE=Spanish_Spain.1252    LC_MONETARY=Spanish_Spain.1252
[4] LC_NUMERIC=C                   LC_TIME=Spanish_Spain.1252    

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] forcats_0.5.0       stringr_1.4.0       dplyr_1.0.2         purrr_0.3.4        
 [5] readr_1.3.1         tidyr_1.1.2         tibble_3.0.3        ggplot2_3.3.2      
 [9] tidyverse_1.3.0     dsOmicsClient_1.1.0 dsBaseClient_6.1.0  Biobase_2.48.0     
[13] BiocGenerics_0.34.0 DSOpal_1.1.0        DSI_1.1.0           R6_2.4.1           
[17] progress_1.2.2      opalr_1.5.1         httr_1.4.2          knitr_1.29         
[21] BiocStyle_2.16.0   

loaded via a namespace (and not attached):
 [1] fs_1.5.0            lubridate_1.7.9     RColorBrewer_1.1-2  numDeriv_2016.8-1.1
 [5] tools_4.0.2         backports_1.1.9     utf8_1.1.4          DBI_1.1.0          
 [9] colorspace_1.4-1    sn_1.6-2            withr_2.2.0         tidyselect_1.1.0   
[13] prettyunits_1.1.1   mnormt_2.0.2        curl_4.3            compiler_4.0.2     
[17] cli_2.0.2           rvest_0.3.6         xml2_1.3.2          TFisher_0.2.0      
[21] sandwich_2.5-1      bookdown_0.20       scales_1.1.1        mvtnorm_1.1-1      
[25] digest_0.6.25       rmarkdown_2.3       pkgconfig_2.0.3     htmltools_0.5.0    
[29] bibtex_0.4.2.2      plotrix_3.7-8       highr_0.8           dbplyr_1.4.4       
[33] rlang_0.4.7         readxl_1.3.1        rstudioapi_0.11     generics_0.0.2     
[37] zoo_1.8-8           jsonlite_1.7.0      magrittr_1.5        Matrix_1.2-18      
[41] fansi_0.4.1         Rcpp_1.0.5          munsell_0.5.0       lifecycle_0.2.0    
[45] stringi_1.4.6       multcomp_1.4-13     yaml_2.2.1          mathjaxr_1.0-1     
[49] gbRd_0.4-11         MASS_7.3-52         grid_4.0.2          blob_1.2.1         
[53] ggrepel_0.8.2       crayon_1.3.4        lattice_0.20-41     haven_2.3.1        
[57] splines_4.0.2       multtest_2.44.0     hms_0.5.3           tmvnsim_1.0-2      
[61] pillar_1.4.6        codetools_0.2-16    stats4_4.0.2        reprex_0.3.0       
[65] mutoss_0.1-12       glue_1.4.2          evaluate_0.14       metap_1.4          
[69] BiocManager_1.30.10 modelr_0.1.8        vctrs_0.3.3         Rdpack_1.0.0       
[73] cellranger_1.1.0    gtable_0.3.0        assertthat_0.2.1    xfun_0.16          
[77] mime_0.9            broom_0.7.0         survival_3.2-3      TH.data_1.0-10     
[81] ellipsis_0.3.1