This tutorial aims to provide examples about how to analyze omics data in DataSHIELD. More detailed information can be found in our bookdown Orchestrating privacy-protected non-disclosive big data analyses of data from different resources with R and DataSHIELD.
This document can be reproduced by installing the following packages
install.packages("DSOpal")
devtools::install_github("datashield/dsBaseClient")
devtools::install_github("isglobal-brge/dsOmicsClient")
Then the packages are loaded as usual
library(DSOpal)
library(dsBaseClient)
library(dsOmicsClient)
We have set up an Opal demo site to illustrate how to perform different omic data analyses using DataSHIELD. The DataSHIELD user credentials are
dsuserpasswordWe also illustrate how to deal with different resources for ’omic data. For this, the Opal server can be accessed with the credentials:
administratorpasswordThis tutorial will mainly make use of the resources available at RSRC project
Figure 1: Resources available at Opal demo site of RSRC project
and we will cover how to perform:
- Transcriptomic data analysis
- Epigenomic data analysis
- GWAS
For those who are not familiar with omic data analysis or do not know the state-of-the-art methods, you can read this bookdown on Omic Data Analysis with R/Bioconductor.
The Figure 2 describes the different types of ’omic association analyses that can be performed using DataSHIELD client functions implemented in the dsOmicsClient package. Basically, data (’omic and phenotypes/covariates) can be stored in different sites (http, ssh, AWS S3, local, …) and are managed with Opal through the resourcer package and their extensions implemented in dsOmics.
Figure 2: Non-disclosive omic data analysis with DataSHIELD and Bioconductor
The figure illustrates how the resourcer package is used to get access to omic data through the Opal servers. Then DataSHIELD is used in the client side to perform non-disclosive data analyses.
The dsOmicsClient package allows different types of analyses: pooled and meta-analysis. Both methods are based on fitting different Generalized Linear Models (GLMs) for each feature when assessing association between ’omic data and the phenotype/trait/condition of interest. Of course, non-disclosive ’omic data analysis from a single study can also be performed.
The pooled approach (Figure 3) is recommended when the user wants to analyze ’omic data from different sources and obtain results as if the data were located in a single computer. It should be noted that this can be very time consuming when analyzing multiple features since it calls a base function in DataSHIELD (ds.glm) repeatedly. It also cannot be recommended when data are not properly harmonized (e.g. gene expression normalized using different methods, GWAS data having different platforms, …). Furthermore when it is necesary to remove unwanted variability (for transcriptomic and epigenomica analysis) or control for population stratification (for GWAS analysis), this approach cannot be used since we need to develop methods to compute surrogate variables (to remove unwanted variability) or PCAs (to to address population stratification) in a non-disclosive way.
The meta-analysis approach Figure 4 overcomes the limitations raised when performing pooled analyses. First, the computation issue is addressed by using scalable and fast methods to perform data analysis at whole-genome level at each location The transcriptomic and epigenomic data analyses make use of the widely used limma package that uses ExpressionSet or RangedSummarizedExperiment Bioc infrastructures to deal with ’omic and phenotypic (e.g covariates). The genomic data are analyzed using GWASTools and GENESIS that are designed to perform quality control (QC) and GWAS using GDS infrastructure.
Next, we describe how both approaches are implemented:
ds.glm() function which is a DataSHIELD function that uses an approach that is mathematically equivalent to placing all individual-level data froma all sources in one central warehouse and analysing those data using the conventional glm() function in R. The user can select one (or multiple) features (i.e., genes, transcripts, CpGs, SNPs, …)
Figure 3: Non-disclosive omic data analysis with DataSHIELD and Bioconductor
The figure illustrates how to perform single pooled omic data analysis. The analyses are performed by using a generalized linear model (glm) on data from one or multiple sources. It makes use of ds.glm(), a DataSHIELD function, that uses an approach that is mathematically equivalent to placing all individual-level data from all sources in one central warehouse and analysing those data using the conventional glm() function in R.
Figure 4: Non-disclosive omic data analysis with DataSHIELD and Bioconductor
The figure illustrates how to perform anlyses at genome-wide level from one or multiple sources. It runs standard Bioconductor functions at each server independently to speed up the analyses and in the case of having multiple sources, results can be meta-analyzed uning standar R functions.
Let us illustrate how to perform transcriptomic data analysis using data from TCGA project. We have uploaded to the opal server a resource called tcga_liver whose URL is http://duffel.rail.bio/recount/TCGA/rse_gene_liver.Rdata which is available through the recount project. This resource contains the RangeSummarizedExperiment with the RNAseq profiling of liver cancer data from TCGA. Next, we illustrate how a differential expression analysis to compare RNAseq profiling of women vs men (variable gdc_cases.demographic.gender). The DGE analysis is normally performed using limma package. In that case, as we are analyzing RNA-seq data, limma + voom method will be required.
Let us start by creating the connection to the opal server:
builder <- newDSLoginBuilder()
builder$append(server = "study1", url = "https://opal-demo.obiba.org",
user = "dsuser", password = "password",
resource = "RSRC.tcga_liver", driver = "OpalDriver")
logindata <- builder$build()
conns <- datashield.login(logins = logindata, assign = TRUE,
symbol = "res")
Then, let us coerce the resource to a RangedSummarizedExperiment which is the type of object that is available in the recount project.
datashield.assign.expr(conns, symbol = "rse",
expr = quote(as.resource.object(res)))
ds.class("rse")
$study1
[1] "RangedSummarizedExperiment"
attr(,"package")
[1] "SummarizedExperiment"
The number of features and samples can be inspected by
ds.dim("rse")
$`dimensions of rse in study1`
[1] 58037 424
$`dimensions of rse in combined studies`
[1] 58037 424
And the names of the features using the same function used in the case of analyzing an ExpressionSet
name.features <- ds.featureNames("rse")
lapply(name.features, head)
$study1
[1] "ENSG00000000003.14" "ENSG00000000005.5" "ENSG00000000419.12" "ENSG00000000457.13"
[5] "ENSG00000000460.16" "ENSG00000000938.12"
Also the covariate names can be inspected by
name.vars <- ds.featureData("rse")
lapply(name.vars, head, n=15)
$study1
[1] "project"
[2] "sample"
[3] "experiment"
[4] "run"
[5] "read_count_as_reported_by_sra"
[6] "reads_downloaded"
[7] "proportion_of_reads_reported_by_sra_downloaded"
[8] "paired_end"
[9] "sra_misreported_paired_end"
[10] "mapped_read_count"
[11] "auc"
[12] "sharq_beta_tissue"
[13] "sharq_beta_cell_type"
[14] "biosample_submission_date"
[15] "biosample_publication_date"
We can visualize the levels of the variable having gender information
ds.table("rse$gdc_cases.demographic.gender")
Data in all studies were valid
Study 1 : No errors reported from this study
$output.list
$output.list$TABLE_rvar.by.study_row.props
study
rse$gdc_cases.demographic.gender 1
female 1
male 1
$output.list$TABLE_rvar.by.study_col.props
study
rse$gdc_cases.demographic.gender 1
female 0.3372642
male 0.6627358
$output.list$TABLE_rvar.by.study_counts
study
rse$gdc_cases.demographic.gender 1
female 143
male 281
$output.list$TABLES.COMBINED_all.sources_proportions
rse$gdc_cases.demographic.gender
female male
0.337 0.663
$output.list$TABLES.COMBINED_all.sources_counts
rse$gdc_cases.demographic.gender
female male
143 281
$validity.message
[1] "Data in all studies were valid"
The differential expression analysis is then performed by:
ans.gender <- ds.limma(model = ~ gdc_cases.demographic.gender,
Set = "rse", type.data = "RNAseq",
sva = FALSE)
Notice that we have set type.data='RNAseq' to consider that our data are counts obtained from a RNA-seq experiment. By indicating so, the differential analysis is performed by using voom + limma as previously mention.
The top differentially expressed genes can be visualized by:
ans.gender
$study1
# A tibble: 58,037 x 10
id logFC CI.L CI.R AveExpr t P.Value adj.P.Val B SE
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ENSG00000233070.1 10.8 10.4 11.2 -5.59 55.2 2.89e-196 1.68e-191 418. 0.0763
2 ENSG00000274655.1 -12.4 -12.8 -11.9 -7.71 -54.8 5.47e-195 1.59e-190 416. 0.460
3 ENSG00000260197.1 10.1 9.74 10.6 -5.88 48.4 2.45e-175 4.73e-171 373. 0.0601
4 ENSG00000213318.4 11.4 10.9 11.9 -4.35 45.0 5.62e-164 8.16e-160 345. 0.0649
5 ENSG00000231535.5 8.98 8.57 9.39 -7.38 42.9 9.49e-157 1.10e-152 334. 0.0886
6 ENSG00000270641.1 -10.2 -10.7 -9.73 0.0618 -41.7 2.19e-152 2.12e-148 331. 0.118
7 ENSG00000277577.1 -11.3 -11.9 -10.8 -8.08 -41.1 3.13e-150 2.59e-146 321. 0.127
8 ENSG00000229807.10 -11.0 -11.6 -10.5 -0.211 -37.6 1.69e-137 1.09e-133 297. 0.0814
9 ENSG00000278847.1 7.91 7.50 8.32 -7.94 37.9 1.12e-138 8.14e-135 294. 0.0895
10 ENSG00000176728.7 11.1 10.5 11.8 -4.40 36.1 1.46e-131 8.48e-128 277. 0.0923
# ... with 58,027 more rows
attr(,"class")
[1] "dsLimma" "list"
We have also implemented two other functions ds.DESeq2 and ds.edgeR that perform DGE analysis using DESeq2 and edgeR methods. This is the R code used to that purpose:
To be supplied
We close the DataSHIELD session by:
datashield.logout(conns)
EWAS requires basically the same statistical methods as those used in DGE. It should be notice that the pooled analysis we are going to illustrate here can also be performed with transcriptomic data since each study must have different range values. If so, gene expression harmonization should be performed, for instance, by standardizing the data at each study. For EWAS where methylation is measured using beta values (e.g CpG data are in the range 0-1) this is not a problem. In any case, adopting the meta-analysis approach could be a safe option.
We have downloaded data from GEO corresponding to the accesion number GSE66351 which includes DNA methylation profiling (Illumina 450K array) of 190 individuals. Data corresponds to CpGs beta values measured in the superior temporal gyrus and prefrontal cortex brain regions of patients with Alzheimer’s. Data have been downloaded using GEOquery package that gets GEO data as ExpressionSet objects. Researchers who are not familiar with ExpressionSets can read this Section. Notice that data are encoded as beta-values that ensure data harmonization across studies.
In order to illustrate how to perform data analyses using federated data, we have split the data into two ExpressionSets having 100 and 90 samples as if they were two different studies. Figure ?? shows the two resources defined for both studies (GSE66351_1 and GSE66351_2)
In order to perform omic data analyses, we need first to login and assign resources to DataSHIELD. This can be performed using the as.resource.object() function
builder <- DSI::newDSLoginBuilder()
builder$append(server = "study1", url = "https://opal-demo.obiba.org",
user = "dsuser", password = "password",
resource = "RSRC.GSE66351_1", driver = "OpalDriver")
builder$append(server = "study2", url = "https://opal-demo.obiba.org",
user = "dsuser", password = "password",
resource = "RSRC.GSE66351_2", driver = "OpalDriver")
logindata <- builder$build()
conns <- DSI::datashield.login(logins = logindata, assign = TRUE,
symbol = "res")
# Assign to the original R class (e.g ExpressionSet)
datashield.assign.expr(conns, symbol = "methy",
expr = quote(as.resource.object(res)))
Now, we can see that the resources are actually loaded into the R servers as their original class
ds.class("methy")
$study1
[1] "ExpressionSet"
attr(,"package")
[1] "Biobase"
$study2
[1] "ExpressionSet"
attr(,"package")
[1] "Biobase"
Then, some Bioconductor-type functions can be use to return non-disclosive information of ExpressionSets from each server to the client, using similar functions as those defined in the dsBaseClient package. For example, feature names can be returned by
fn <- ds.featureNames("methy")
lapply(fn, head)
$study1
[1] "cg00000029" "cg00000108" "cg00000109" "cg00000165" "cg00000236" "cg00000289"
$study2
[1] "cg00000029" "cg00000108" "cg00000109" "cg00000165" "cg00000236" "cg00000289"
Experimental phenotypes variables can be obtained by
ds.varLabels("methy")
$study1
[1] "title" "geo_accession" "status"
[4] "submission_date" "last_update_date" "type"
[7] "channel_count" "source_name_ch1" "organism_ch1"
[10] "characteristics_ch1" "characteristics_ch1.1" "characteristics_ch1.2"
[13] "characteristics_ch1.3" "characteristics_ch1.4" "characteristics_ch1.5"
[16] "characteristics_ch1.6" "characteristics_ch1.7" "characteristics_ch1.8"
[19] "molecule_ch1" "extract_protocol_ch1" "label_ch1"
[22] "label_protocol_ch1" "taxid_ch1" "hyb_protocol"
[25] "scan_protocol" "description" "data_processing"
[28] "platform_id" "contact_name" "contact_email"
[31] "contact_phone" "contact_laboratory" "contact_institute"
[34] "contact_address" "contact_city" "contact_zip/postal_code"
[37] "contact_country" "supplementary_file" "supplementary_file.1"
[40] "data_row_count" "age" "braak_stage"
[43] "brain_region" "cell type" "diagnosis"
[46] "donor_id" "sentrix_id" "sentrix_position"
[49] "Sex"
$study2
[1] "title" "geo_accession" "status"
[4] "submission_date" "last_update_date" "type"
[7] "channel_count" "source_name_ch1" "organism_ch1"
[10] "characteristics_ch1" "characteristics_ch1.1" "characteristics_ch1.2"
[13] "characteristics_ch1.3" "characteristics_ch1.4" "characteristics_ch1.5"
[16] "characteristics_ch1.6" "characteristics_ch1.7" "characteristics_ch1.8"
[19] "molecule_ch1" "extract_protocol_ch1" "label_ch1"
[22] "label_protocol_ch1" "taxid_ch1" "hyb_protocol"
[25] "scan_protocol" "description" "data_processing"
[28] "platform_id" "contact_name" "contact_email"
[31] "contact_phone" "contact_laboratory" "contact_institute"
[34] "contact_address" "contact_city" "contact_zip/postal_code"
[37] "contact_country" "supplementary_file" "supplementary_file.1"
[40] "data_row_count" "age" "braak_stage"
[43] "brain_region" "cell type" "diagnosis"
[46] "donor_id" "sentrix_id" "sentrix_position"
[49] "Sex"
attr(,"class")
[1] "dsvarLabels" "list"
Once the methylation data have been loaded into the opal server, we can perform different type of analyses using functions from the dsOmicsClient package. Let us start by illustrating how to analyze a single CpG from two studies by using an approach that is mathematically equivalent to placing all individual-level.
ans <- ds.lmFeature(feature = "cg07363416",
model = ~ diagnosis + Sex,
Set = "methy",
datasources = conns)
ans
Estimate Std. Error p-value
cg07363416 0.03459886 0.02504291 0.1670998
attr(,"class")
[1] "dsLmFeature" "matrix" "array"
The same analysis can be performed for all features (e.g. CpGs) just avoiding the feature argument. This process can be parallelized using mclapply function from the multicore package.
ans <- ds.lmFeature(model = ~ diagnosis + Sex,
Set = "methy",
datasources = conns,
mc.cores = 20)
This method corresponds to the pooled analysis approach and can be very time consiming since the function repeatedly calls the DataSHIELD function ds.glm(). We can adopt another strategy that is to run a glm of each feature independently at each study using limma package (which is really fast) and then combine the results (i.e. meta-analysis approach).
ans.limma <- ds.limma(model = ~ diagnosis + Sex,
Set = "methy",
datasources = conns)
Then, we can visualize the top genes at each study (i.e server) by
lapply(ans.limma, head)
$study1
# A tibble: 6 x 10
id logFC CI.L CI.R AveExpr t P.Value adj.P.Val B SE
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 cg13138089 -0.147 -0.191 -0.103 0.380 -6.62 0.00000000190 0.000466 10.6 0.0122
2 cg23859635 -0.0569 -0.0741 -0.0397 0.200 -6.58 0.00000000232 0.000466 10.4 0.00520
3 cg13772815 -0.0820 -0.107 -0.0570 0.437 -6.50 0.00000000327 0.000466 10.0 0.0135
4 cg12706938 -0.0519 -0.0678 -0.0359 0.145 -6.45 0.00000000425 0.000466 9.76 0.00872
5 cg24724506 -0.0452 -0.0593 -0.0312 0.139 -6.39 0.00000000547 0.000466 9.51 0.00775
6 cg02812891 -0.125 -0.165 -0.0860 0.247 -6.33 0.00000000731 0.000466 9.23 0.0163
$study2
# A tibble: 6 x 10
id logFC CI.L CI.R AveExpr t P.Value adj.P.Val B SE
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 cg04046629 -0.101 -0.135 -0.0669 0.345 -5.91 0.0000000621 0.0172 7.18 0.0128
2 cg07664323 -0.0431 -0.0577 -0.0284 0.776 -5.85 0.0000000822 0.0172 6.90 0.00390
3 cg27098804 -0.0688 -0.0924 -0.0452 0.277 -5.79 0.000000107 0.0172 6.64 0.0147
4 cg08933615 -0.0461 -0.0627 -0.0296 0.166 -5.55 0.000000298 0.0360 5.64 0.00791
5 cg18349298 -0.0491 -0.0671 -0.0311 0.157 -5.42 0.000000507 0.0489 5.12 0.00848
6 cg02182795 -0.0199 -0.0272 -0.0125 0.0947 -5.36 0.000000670 0.0538 4.84 0.0155
The annotation can be added by using the argument annotCols. It should be a vector with the columns of the annotation available in the ExpressionSet or RangedSummarizedExperiment that want to be showed. The columns of the annotation can be obtained by
ds.fvarLabels("methy")
$study1
[1] "title" "geo_accession" "status"
[4] "submission_date" "last_update_date" "type"
[7] "channel_count" "source_name_ch1" "organism_ch1"
[10] "characteristics_ch1" "characteristics_ch1.1" "characteristics_ch1.2"
[13] "characteristics_ch1.3" "characteristics_ch1.4" "characteristics_ch1.5"
[16] "characteristics_ch1.6" "characteristics_ch1.7" "characteristics_ch1.8"
[19] "molecule_ch1" "extract_protocol_ch1" "label_ch1"
[22] "label_protocol_ch1" "taxid_ch1" "hyb_protocol"
[25] "scan_protocol" "description" "data_processing"
[28] "platform_id" "contact_name" "contact_email"
[31] "contact_phone" "contact_laboratory" "contact_institute"
[34] "contact_address" "contact_city" "contact_zip/postal_code"
[37] "contact_country" "supplementary_file" "supplementary_file.1"
[40] "data_row_count" "age" "braak_stage"
[43] "brain_region" "cell type" "diagnosis"
[46] "donor_id" "sentrix_id" "sentrix_position"
[49] "Sex"
$study2
[1] "title" "geo_accession" "status"
[4] "submission_date" "last_update_date" "type"
[7] "channel_count" "source_name_ch1" "organism_ch1"
[10] "characteristics_ch1" "characteristics_ch1.1" "characteristics_ch1.2"
[13] "characteristics_ch1.3" "characteristics_ch1.4" "characteristics_ch1.5"
[16] "characteristics_ch1.6" "characteristics_ch1.7" "characteristics_ch1.8"
[19] "molecule_ch1" "extract_protocol_ch1" "label_ch1"
[22] "label_protocol_ch1" "taxid_ch1" "hyb_protocol"
[25] "scan_protocol" "description" "data_processing"
[28] "platform_id" "contact_name" "contact_email"
[31] "contact_phone" "contact_laboratory" "contact_institute"
[34] "contact_address" "contact_city" "contact_zip/postal_code"
[37] "contact_country" "supplementary_file" "supplementary_file.1"
[40] "data_row_count" "age" "braak_stage"
[43] "brain_region" "cell type" "diagnosis"
[46] "donor_id" "sentrix_id" "sentrix_position"
[49] "Sex"
attr(,"class")
[1] "dsfvarLabels" "list"
Then we can run the analysis and obtain the output with the chromosome and gene symbol by:
ans.limma.annot <- ds.limma(model = ~ diagnosis + Sex,
Set = "methy",
annotCols = c("CHR", "UCSC_RefGene_Name"),
datasources = conns)
lapply(ans.limma.annot, head)
$study1
# A tibble: 6 x 12
id CHR UCSC_RefGene_Na~ logFC CI.L CI.R AveExpr t P.Value adj.P.Val B SE
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 cg13~ 2 "ECEL1P2" -0.147 -0.191 -0.103 0.380 -6.62 1.90e-9 0.000466 10.6 0.0122
2 cg23~ 2 "MTA3" -0.0569 -0.0741 -0.0397 0.200 -6.58 2.32e-9 0.000466 10.4 0.00520
3 cg13~ 17 "" -0.0820 -0.107 -0.0570 0.437 -6.50 3.27e-9 0.000466 10.0 0.0135
4 cg12~ 19 "MEX3D" -0.0519 -0.0678 -0.0359 0.145 -6.45 4.25e-9 0.000466 9.76 0.00872
5 cg24~ 19 "ISOC2;ISOC2;IS~ -0.0452 -0.0593 -0.0312 0.139 -6.39 5.47e-9 0.000466 9.51 0.00775
6 cg02~ 2 "ECEL1P2" -0.125 -0.165 -0.0860 0.247 -6.33 7.31e-9 0.000466 9.23 0.0163
$study2
# A tibble: 6 x 12
id CHR UCSC_RefGene_Na~ logFC CI.L CI.R AveExpr t P.Value adj.P.Val B SE
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 cg04~ 11 "CD6" -0.101 -0.135 -0.0669 0.345 -5.91 6.21e-8 0.0172 7.18 0.0128
2 cg07~ 6 "MUC21" -0.0431 -0.0577 -0.0284 0.776 -5.85 8.22e-8 0.0172 6.90 0.00390
3 cg27~ 11 "CD6" -0.0688 -0.0924 -0.0452 0.277 -5.79 1.07e-7 0.0172 6.64 0.0147
4 cg08~ 1 "" -0.0461 -0.0627 -0.0296 0.166 -5.55 2.98e-7 0.0360 5.64 0.00791
5 cg18~ 3 "RARRES1;RARRES~ -0.0491 -0.0671 -0.0311 0.157 -5.42 5.07e-7 0.0489 5.12 0.00848
6 cg02~ 8 "" -0.0199 -0.0272 -0.0125 0.0947 -5.36 6.70e-7 0.0538 4.84 0.0155
Then, the last step is to meta-analyze the results. Different methods can be used to this end. We have implemented a method that meta-analyze the p-pvalues of each study as follows:
ans.meta <- metaPvalues(ans.limma)
ans.meta
# A tibble: 481,868 x 4
id study1 study2 p.meta
<chr> <dbl> <dbl> <dbl>
1 cg13138089 0.00000000190 0.00000763 4.78e-13
2 cg25317941 0.0000000179 0.00000196 1.12e-12
3 cg02812891 0.00000000731 0.00000707 1.63e-12
4 cg12706938 0.00000000425 0.0000161 2.14e-12
5 cg16026647 0.000000101 0.000000797 2.51e-12
6 cg12695465 0.00000000985 0.0000144 4.33e-12
7 cg21171625 0.000000146 0.00000225 9.78e-12
8 cg13772815 0.00000000327 0.000122 1.18e-11
9 cg00228891 0.000000166 0.00000283 1.38e-11
10 cg21488617 0.0000000186 0.0000299 1.62e-11
# ... with 481,858 more rows
We can verify that the results are pretty similar to those obtained using pooled analyses. Here we compute the association for two of the top-CpGs:
res1 <- ds.lmFeature(feature = "cg13138089",
model = ~ diagnosis + Sex,
Set = "methy",
datasources = conns)
res1
Estimate Std. Error p-value
cg13138089 -0.1373348 0.01712405 1.057482e-15
attr(,"class")
[1] "dsLmFeature" "matrix" "array"
res2 <- ds.lmFeature(feature = "cg13772815",
model = ~ diagnosis + Sex,
Set = "methy",
datasources = conns)
res2
Estimate Std. Error p-value
cg13772815 -0.06786137 0.009128915 1.056225e-13
attr(,"class")
[1] "dsLmFeature" "matrix" "array"
We can create a QQ-plot by using the function qqplot available in our package.
qqplot(ans.meta$p.meta)
Here In some cases inflation can be observed, so that, correction for cell-type or surrogate variables must be performed. We describe how we can do that in the next two sections.
The vast majority of omic studies require to control for unwanted variability. The surrogate variable analysis (SVA) can address this issue by estimating some hidden covariates that capture differences across individuals due to some artifacts such as batch effects or sample quality sam among others. The method is implemented in SVA package.
Performing this type of analysis using the ds.lmFeature function is not allowed since estimating SVA would require to implement a non-disclosive method that computes SVA from the different servers. This will be a future topic of the dsOmicsClient. NOTE that, estimating SVA separately at each server would not be a good idea since the aim of SVA is to capture differences mainly due to experimental issues among ALL individuals. What we can do instead is to use the ds.limma function to perform the analyses adjusted for SVA at each study.
ans.sva <- ds.limma(model = ~ diagnosis + Sex,
Set = "methy",
sva = TRUE, annotCols = c("CHR", "UCSC_RefGene_Name"))
ans.sva
$study1
# A tibble: 481,868 x 12
id CHR UCSC_RefGene_Na~ logFC CI.L CI.R AveExpr t P.Value adj.P.Val B
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 cg10~ 19 "GNG7" -0.0547 -0.0721 -0.0373 0.338 -6.25 1.31e-8 0.00591 8.51
2 cg13~ 17 "" -0.0569 -0.0757 -0.0381 0.437 -6.00 3.91e-8 0.00591 7.43
3 cg11~ 19 "PODNL1;PODNL1;~ 0.0334 0.0223 0.0445 0.568 5.97 4.53e-8 0.00591 7.29
4 cg27~ 17 "SLC47A2;SLC47A~ 0.0274 0.0182 0.0365 0.548 5.95 4.91e-8 0.00591 7.21
5 cg21~ 6 "" -0.0453 -0.0609 -0.0297 0.799 -5.78 1.05e-7 0.00666 6.46
6 cg23~ 2 "MTA3" -0.0327 -0.0440 -0.0215 0.200 -5.77 1.09e-7 0.00666 6.42
7 cg10~ 16 "SALL1;SALL1" 0.0366 0.0240 0.0492 0.137 5.77 1.10e-7 0.00666 6.42
8 cg24~ 1 "" 0.0317 0.0208 0.0427 0.821 5.76 1.11e-7 0.00666 6.41
9 cg13~ 9 "" -0.0326 -0.0440 -0.0213 0.367 -5.72 1.32e-7 0.00705 6.24
10 cg13~ 2 "ECEL1P2" -0.106 -0.144 -0.0686 0.380 -5.61 2.15e-7 0.00835 5.76
# ... with 481,858 more rows, and 1 more variable: SE <dbl>
$study2
# A tibble: 481,868 x 12
id CHR UCSC_RefGene_Na~ logFC CI.L CI.R AveExpr t P.Value adj.P.Val B
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 cg16~ 12 "LRP1" -0.0418 -0.0548 -0.0289 0.388 -6.42 7.82e-9 0.00374 9.05
2 cg25~ 1 "" -0.0664 -0.0874 -0.0453 0.569 -6.27 1.55e-8 0.00374 8.37
3 cg12~ 11 "NRXN2;NRXN2" -0.0273 -0.0367 -0.0179 0.728 -5.81 1.13e-7 0.0149 6.42
4 cg07~ 3 "" -0.0393 -0.0529 -0.0258 0.160 -5.78 1.24e-7 0.0149 6.32
5 cg07~ 6 "MUC21" -0.0427 -0.0579 -0.0276 0.776 -5.61 2.63e-7 0.0163 5.59
6 cg00~ 1 "CR1L" -0.0568 -0.0771 -0.0365 0.350 -5.56 3.18e-7 0.0163 5.40
7 cg11~ 12 "CNTN1;CNTN1" -0.0443 -0.0601 -0.0284 0.152 -5.56 3.18e-7 0.0163 5.40
8 cg03~ 7 "PGAM2;PGAM2" -0.0442 -0.0600 -0.0283 0.211 -5.56 3.26e-7 0.0163 5.38
9 cg07~ 1 "KCNAB2;KCNAB2" 0.0611 0.0392 0.0831 0.659 5.54 3.47e-7 0.0163 5.31
10 cg25~ 17 "WNK4" -0.0392 -0.0533 -0.0251 0.512 -5.52 3.70e-7 0.0163 5.25
# ... with 481,858 more rows, and 1 more variable: SE <dbl>
attr(,"class")
[1] "dsLimma" "list"
Then, data can be combined meta-anlyzed as follows:
ans.meta.sv <- metaPvalues(ans.sva)
ans.meta.sv
# A tibble: 481,868 x 4
id study1 study2 p.meta
<chr> <dbl> <dbl> <dbl>
1 cg00228891 0.00000397 0.000000318 3.58e-11
2 cg01301319 0.00000609 0.00000123 2.00e-10
3 cg22962123 0.00000106 0.00000767 2.17e-10
4 cg24302412 0.0000139 0.000000762 2.77e-10
5 cg02812891 0.000000408 0.0000327 3.47e-10
6 cg23859635 0.000000109 0.000190 5.29e-10
7 cg13138089 0.000000215 0.000105 5.74e-10
8 cg24938077 0.0000125 0.00000254 8.02e-10
9 cg13772815 0.0000000391 0.00132 1.27e- 9
10 cg21212881 0.000000340 0.000248 2.04e- 9
# ... with 481,858 more rows
The DataSHIELD session must by closed by:
datashield.logout(conns)
He have collected information on methylation data (200 more variable CpGs) in order to compare methylation patterns in inner-city children with persistent atopic asthma (n=97) versus healthy controls (n=97). The authors used DNA from peripheral blood mononuclear cells (PBMCs) from inner city children aged 6-12 years with persistent atopic asthma children and healthy controls. Comparing asthmatics (N=97) to controls (N=97). Original data are available at GEO under the accession number GSE40732. The
ExpressionSetwith 194 samples is available at:This data can be use later to check that DataSHIELD (using
dsOmics) is providing the same results as if we had the whole data in a single computerWe have split this data set in two and have created two resources in the Opal demo site (https://opal-demo.obiba.org/) that can be access through DataSHIELD using the credentials:
- user: dsuser - password: passwordThe two resources are in the project called
workshopand their names areGSE40732_1andGSE40732_2
- Login the connection to the two resources to DataSHIELD.
- Assign the resources to an R object called ‘methy’.
- Use
ds.nFeatures ()andds.nSamples ()functions to verify that the loaded objects are having the proper dimensions.- Get the names of the variables available in the sample metadata (use
ds.varLabels ()).- Get the distribution of samples in the variable
asthma(cases/controls).- Perform an EWAS considering
asthmaas the condition (e.g. grouping variable).- Do a simple meta-analysis using
metaPvalues ()function.
Genomic data can be stored in different formats. VCF and PLINK files are commonly used in genetic epidemiology studies. We have a GWAS example available at BRGE data repository that aims to find SNPs associated with asthma. We have data stored in VCF (brge.vcf) with several covariates and phenotypes available in the file brge.txt (gender, age, obesity, smoking, country and asthma status). The same data is also available in PLINK format (brge.bed, brge.bim, brge.fam) with covariates in the file brge.phe.
Here we illustrate how to perform GWAS using R and Bioconductor packages or PLINK shell command line.
We have created a resource having the VCF file of our study on asthma as previously described. The name of the resource is brge_vcf the phenotypes are available in another resource called brge that is a .txt file. The GWAS analysis is then perform as follows
We first start by preparing login data
builder <- newDSLoginBuilder()
builder$append(server = "study1", url = "https://opal-demo.obiba.org",
user = "dsuser", password = "password",
resource = "RSRC.brge_vcf", driver = "OpalDriver")
logindata <- builder$build()
conns <- datashield.login(logins = logindata, assign = TRUE,
symbol = "res")
In this case we have to assign to different resources. One for the VCF (obesity_vcf) and another one for the phenotypic data (obesity). To this end, the datashield.assign.resource function is required before assigning any object to the specific resource. Notice that the VCF resource can be load into R as a GDS thanks to our extension of existing resources in the reourcer
datashield.assign.resource(conns, symbol = "vcf.res",
resource = list(study1 = "RSRC.brge_vcf"))
datashield.assign.expr(conns, symbol = "gds",
expr = quote(as.resource.object(vcf.res)))
datashield.assign.resource(conns, symbol = "covars.res",
resource = list(study1 = "RSRC.brge"))
datashield.assign.expr(conns, symbol = "covars",
expr = quote(as.resource.data.frame(covars.res)))
These are the objects available in the Opal server
ds.ls()
$study1
$study1$environment.searched
[1] "R_GlobalEnv"
$study1$objects.found
[1] "covars" "covars.res" "gds" "res" "vcf.res"
We can use dsBaseClient functions to inspect the variables that are in the covars data.frame. The variables are
ds.colnames("covars")
$study1
[1] "scanID" "gender" "obese" "age" "smoke" "country" "asthma"
The asthma variable has this number of individuals at each level (0: controls, 1: cases)
ds.table("covars$asthma")
Data in all studies were valid
Study 1 : No errors reported from this study
$output.list
$output.list$TABLE_rvar.by.study_row.props
study
covars$asthma 1
0 1
1 1
$output.list$TABLE_rvar.by.study_col.props
study
covars$asthma 1
0 0.6864187
1 0.3135813
$output.list$TABLE_rvar.by.study_counts
study
covars$asthma 1
0 1587
1 725
$output.list$TABLES.COMBINED_all.sources_proportions
covars$asthma
0 1
0.686 0.314
$output.list$TABLES.COMBINED_all.sources_counts
covars$asthma
0 1
1587 725
$validity.message
[1] "Data in all studies were valid"
There may be interest in only studying certain genes, for that matter, the loaded VCF resource can be subsetting as follows
genes <- c("A1BG","A2MP1")
ds.getSNPSbyGen("gds", genes)
The previous code will over-write the VCF with the SNPs corresponding to the selected genes, if the intention is to perform studies with both the complete VCF and a subsetted one, the argument name can be used to create a new object on the server with the subsetted VCF, preserving the complete one.
genes <- c("A1BG","A2MP1")
ds.getSNPSbyGen("gds", genes = genes, name = "subset.vcf")
Then, an object of class GenotypeData must be created at the server side to perform genetic data analyses. This is a container defined in the GWASTools package for storing genotype and phenotypic data from genetic association studies. By doing that we will also verify whether individuals in the GDS (e.g VCF) and covariates files have the same individuals and are in the same order. This can be performed by
ds.GenotypeData(x='gds', covars = 'covars', columnId = 1, newobj.name = 'gds.Data')
Before performing the association analyses, quality control (QC) can be performed to the loaded data. Three methodologies are available; 1) Principal Component Analysis (PCA) of the genomic data, 2) Hardy-Weinberg Equilibrium (HWE) testing and 3) Allelic frequency estimates. The QC methods 2 and 3 have as inputs a GenotypeData object, created with a covariates file that has a gender column; while method 1 has as input a VCF.
To perform the PCA, a pruning functionality is built inside so that redundant SNPs are discarted (there is an extra argument ld.threshold which controls the pruning, more information about it at the SNPRelate documentation), speeding up the execution time
ds.PCASNPS("gds", prune = TRUE)
To perform QC methodologies 2 and 3, the name of the gender column as well as the keys to describe male or female have to be provided. Remember that we can visualize the names of the variables from our data by executing ds.colnames("covars"). In our case, this variable is called “gender”, and the levels of this variable are 1 for male and 2 for female as we can see here (NOTE: we cannot use ds.levels since gender variable is not a factor):
ds.table1D("covars$gender")$counts
covars$gender
1 1215
2 1097
Total 2312
The HWE test can be performed to selected chromosomes using the argument chromosomes, only the autosomes can be selected when performing a HWE test, the encoding of the autosomes can be fetched with
ds.getChromosomeNames("gds.Data")$autosomes
NULL
Therefore, HWE can be performed by:
ds.exactHWE("gds.Data", sexcol = "gender", male = "1", female = "2", chromosome = "22")
$study1
# A tibble: 1,581 x 9
snpID chr nAA nAB nBB MAF minor.allele f pval
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 95140 22 531 1082 640 0.475810031069685 A 0.03724945341034~ 0.0762378420092437
2 95141 22 580 1092 590 0.497789566755084 A 0.03446388812448~ 0.101115839682148
3 95142 22 0 333 1951 0.07289842381786~ A -0.0786304604486~ 3.29321452000707e~
4 95143 22 0 225 2080 0.04880694143167~ A -0.0513112884834~ 0.005713534221738~
5 95144 22 176 831 1249 0.262189716312057 A 0.04792409337548~ 0.025554465064758
6 95145 22 249 1003 1055 0.32531426094495 A 0.00958157673233~ 0.635607524466151
7 95146 22 55 516 1738 0.135556517973149 A 0.04646033280618~ 0.0325126363699498
8 95147 22 203 929 1164 0.290722996515679 A 0.01888804177461~ 0.362631409282596
9 95148 22 291 1035 985 0.349848550411077 A 0.01549983175844~ 0.464409674690009
10 95149 22 11 276 2013 0.06478260869565~ A 0.00966929694008~ 0.604847402951204
# ... with 1,571 more rows
attr(,"class")
[1] "dsexactHWE" "list"
Similarly, allele frequencies estimates can be estimated by:
ds.alleleFrequency("gds.Data", sexcol = "gender", male = "1", female = "2")
$study1
# A tibble: 99,289 x 7
M F all n.M n.F n MAF
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.015 0.0147 0.0149 1200 1086 2286 0.0149
2 0.000837 0.00276 0.00175 1195 1087 2282 0.00175
3 0.00330 0.00548 0.00433 1212 1095 2307 0.00433
4 0.0124 0.00868 0.0106 1213 1095 2308 0.0106
5 0.00165 0 0.000866 1214 1095 2309 0.000866
6 0.0712 0.0754 0.0732 1208 1094 2302 0.0732
7 0.00660 0.00913 0.00780 1213 1095 2308 0.00780
8 0.00248 0.00274 0.00261 1209 1093 2302 0.00261
9 0.182 0.181 0.181 1188 1071 2259 0.181
10 0.00165 0 0.000868 1210 1094 2304 0.000868
# ... with 99,279 more rows
attr(,"class")
[1] "dsalleleFrequency" "list"
In the future, more functions will be created to perform quality control (QC) for both, SNPs and inviduals.
Association analysis for a given SNP is performed by simply
ds.glmSNP(snps.fit = "rs11247693", model = asthma ~ gender + age, genoData='gds.Data')
Estimate Std. Error p-value
rs11247693 -0.1543215 0.2309585 0.5040196
attr(,"class")
[1] "dsGlmSNP" "matrix" "array"
The analysis of all available SNPs is performed when the argument snps.fit is missing. The function performs the analysis of the selected SNPs in a single repository or in multiple repositories as performing pooled analyses (it uses ds.glm DataSHIELD function). As in the case of transcriptomic data, analyzing all the SNPs in the genome (e.g GWAS) will be high time-consuming. We can adopt a similar approach as the one adopted using the limma at each server. That is, we run GWAS at each repository using specific and scalable packages available in R/Bioc. In that case we use the GWASTools and GENESIS packages. The complete pipeline is implemented in this function
ans.bioC <- ds.GWAS('gds.Data', model=asthma~age+country)
This close the DataSHIELD session
datashield.logout(conns)
Here we illustrate how to perform the same GWAS analyses on the asthma using PLINK secure shell commands. This can be performed thanks to the posibility of having ssh resources as described here.
It is worth to notice that this workflow and the new R functions implemented in dsOmicsClient could be used as a guideline to carry out similar analyses using existing analysis tools in genomics such as IMPUTE, SAMtools or BEDtools among many others.
We start by assigning login resources
library(DSOpal)
library(dsBaseClient)
library(dsOmicsClient)
builder <- newDSLoginBuilder()
builder$append(server = "study1", url = "https://opal-demo.obiba.org",
user = "dsuser", password = "password",
resource = "RSRC.brge_plink", driver = "OpalDriver")
logindata <- builder$build()
Then we assign the resource to a symbol (i.e. R object) called client which is a ssh resource
conns <- datashield.login(logins = logindata, assign = TRUE,
symbol = "client")
ds.class("client")
$study1
[1] "SshResourceClient" "CommandResourceClient" "ResourceClient" "R6"
Now, we are ready to run any PLINK command from the client site. Notice that in this case we want to assess association between the genotype data in bed format and use as phenotype the variable ‘asthma’ that is in the file ‘brge.phe’ in the 6th column. The sentence in a PLINK command would be (NOTE: we avoid –out to indicate the output file since the file will be available in R as a tibble).
plink --bfile brge --logistic --pheno brge.phe --mpheno 6 --covar brge.phe --covar-name gender,age
The arguments must be encapsulated in a single character without the command ‘plink’
plink.arguments <- "--bfile brge --logistic --pheno brge.phe --mpheno 6 --covar brge.phe --covar-name gender,age"
the analyses are then performed by
ans.plink <- ds.PLINK("client", plink.arguments)
The object ans.plink contains the PLINK results at each server as well as the outuput provided by PLINK
lapply(ans.plink, names)
$study1
[1] "results" "plink.out"
head(ans.plink$study1$results)
# A tibble: 6 x 9
CHR SNP BP A1 TEST NMISS OR STAT P
<dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 0 MitoC3993T 3993 T ADD 2286 0.752 -1.33 0.182
2 0 MitoC3993T 3993 T gender 2286 0.742 -3.27 0.00107
3 0 MitoC3993T 3993 T age 2286 1.00 0.565 0.572
4 0 MitoG4821A 4821 A ADD 2282 2.68 1.71 0.0879
5 0 MitoG4821A 4821 A gender 2282 0.740 -3.31 0.000940
6 0 MitoG4821A 4821 A age 2282 1.00 0.465 0.642
ans.plink$study$plink.out
$status
[1] 0
$output
[1] ""
[2] "@----------------------------------------------------------@"
[3] "| PLINK! | v1.07 | 10/Aug/2009 |"
[4] "|----------------------------------------------------------|"
[5] "| (C) 2009 Shaun Purcell, GNU General Public License, v2 |"
[6] "|----------------------------------------------------------|"
[7] "| For documentation, citation & bug-report instructions: |"
[8] "| http://pngu.mgh.harvard.edu/purcell/plink/ |"
[9] "@----------------------------------------------------------@"
[10] ""
[11] "Skipping web check... [ --noweb ] "
[12] "Writing this text to log file [ /tmp/ssh-8558/out.log ]"
[13] "Analysis started: Tue Jan 26 17:12:57 2021"
[14] ""
[15] "Options in effect:"
[16] "\t--bfile brge"
[17] "\t--logistic"
[18] "\t--pheno brge.phe"
[19] "\t--mpheno 6"
[20] "\t--covar brge.phe"
[21] "\t--covar-name gender,age"
[22] "\t--noweb"
[23] "\t--out /tmp/ssh-8558/out"
[24] ""
[25] "Reading map (extended format) from [ brge.bim ] "
[26] "100000 markers to be included from [ brge.bim ]"
[27] "Reading pedigree information from [ brge.fam ] "
[28] "2312 individuals read from [ brge.fam ] "
[29] "2312 individuals with nonmissing phenotypes"
[30] "Assuming a disease phenotype (1=unaff, 2=aff, 0=miss)"
[31] "Missing phenotype value is also -9"
[32] "725 cases, 1587 controls and 0 missing"
[33] "1097 males, 1215 females, and 0 of unspecified sex"
[34] "Reading genotype bitfile from [ brge.bed ] "
[35] "Detected that binary PED file is v1.00 SNP-major mode"
[36] "Reading alternate phenotype from [ brge.phe ] "
[37] "2312 individuals with non-missing alternate phenotype"
[38] "Assuming a disease phenotype (1=unaff, 2=aff, 0=miss)"
[39] "Missing phenotype value is also -9"
[40] "725 cases, 1587 controls and 0 missing"
[41] "Reading 6 covariates from [ brge.phe ] with nonmissing values for 2199 individuals"
[42] "Selected subset of 2 from 6 covariates"
[43] "For these, nonmissing covariate values for 2312 individuals"
[44] "Before frequency and genotyping pruning, there are 100000 SNPs"
[45] "2312 founders and 0 non-founders found"
[46] "6009 heterozygous haploid genotypes; set to missing"
[47] "Writing list of heterozygous haploid genotypes to [ /tmp/ssh-8558/out.hh ]"
[48] "7 SNPs with no founder genotypes observed"
[49] "Warning, MAF set to 0 for these SNPs (see --nonfounders)"
[50] "Writing list of these SNPs to [ /tmp/ssh-8558/out.nof ]"
[51] "Total genotyping rate in remaining individuals is 0.994408"
[52] "0 SNPs failed missingness test ( GENO > 1 )"
[53] "0 SNPs failed frequency test ( MAF < 0 )"
[54] "After frequency and genotyping pruning, there are 100000 SNPs"
[55] "After filtering, 725 cases, 1587 controls and 0 missing"
[56] "After filtering, 1097 males, 1215 females, and 0 of unspecified sex"
[57] "Converting data to Individual-major format"
[58] "Writing logistic model association results to [ /tmp/ssh-8558/out.assoc.logistic ] "
[59] ""
[60] "Analysis finished: Tue Jan 26 17:14:56 2021"
[61] ""
$error
character(0)
$command
[1] "cd /home/master/brge && plink1 --bfile brge --logistic --pheno brge.phe --mpheno 6 --covar brge.phe --covar-name gender,age --noweb --out /tmp/ssh-8558/out"
attr(,"class")
[1] "resource.exec"
We can compare the p-values obtained using PLINK with Bioconductor-based packages for the top-10 SNPs as follows:
library(tidyverse)
# get SNP p.values (additive model - ADD)
res.plink <- ans.plink$study1$results %>% filter(TEST=="ADD") %>%
arrange(P)
# compare top-10 with Biocoductor's results
snps <- res.plink$SNP[1:10]
plink <- res.plink %>% filter(SNP%in%snps) %>% dplyr::select(SNP, P)
bioC <- ans.bioC$study1 %>% filter(rs%in%snps) %>% dplyr::select(rs, Score.pval)
left_join(plink, bioC, by=c("SNP" = "rs"))
# A tibble: 10 x 3
SNP P Score.pval
<chr> <dbl> <dbl>
1 rs2267914 0.00000151 0.000000809
2 rs6097326 0.00000424 0.00000642
3 rs7153 0.00000440 0.00000508
4 rs3732410 0.00000817 0.00000940
5 rs7995146 0.0000170 0.0000195
6 rs6495788 0.0000213 0.0000278
7 rs1602679 0.0000268 0.0000264
8 rs11055608 0.0000270 0.0000161
9 rs7098143 0.0000313 0.0000214
10 rs7676164 0.0000543 0.0000537
As expected, the p-values are in the same order of magnitud having little variations due to the implemented methods of each software.
We can do the same comparions of minor allele frequency (MAF) estimation performed with Bioconductor and PLINK. To this end, we need first to estimate MAF using PLINK
plink.arguments <- "--bfile brge --freq"
ans.plink2 <- ds.PLINK("client", plink.arguments)
maf.plink <- ans.plink2$study1$results
plink <- maf.plink %>% filter(SNP%in%snps) %>% dplyr::select(SNP, MAF)
bioC <- ans.bioC$study1 %>% filter(rs%in%snps) %>% dplyr::select(rs, freq)
left_join(plink, bioC, by=c("SNP" = "rs"))
# A tibble: 10 x 3
SNP MAF freq
<chr> <dbl> <dbl>
1 rs7153 0.256 0.256
2 rs3732410 0.254 0.254
3 rs7676164 0.304 0.304
4 rs1602679 0.101 0.101
5 rs7098143 0.210 0.210
6 rs11055608 0.446 0.446
7 rs7995146 0.0527 0.0527
8 rs6495788 0.267 0.267
9 rs2267914 0.104 0.104
10 rs6097326 0.125 0.125
This close the DataSHIELD session
datashield.logout(conns)
Our main aim is to train ATHLETE researchers to perform omic data analysis linking the epigenome with the exposome. WP3 and WP4 have setup a project (GREENSPACE) that will be used as proof-of-concept of our developed infrastructure. This figure summarized that project
Figure 5: GREENSPACE project overview
Linking the exposome (green space) with the epigenome
We prepared some datasets with this information:
Figure 6: GREENSPACE project available data
The variables we have are:
ExposomeSetSo, the requirements to perform such analyses are:
dsBaseClient and dsOmicsClient R packages into analyst computer.ExpressioSet encapsulating methylation data for each cohort (Resources) [see here for a description of an ExpressionSet and how to deal with it as a resource]
ExpressionSet.Rdata file with the required object.dsOmics, dsBase and resourcerFor this tutorial, we have tried to mimic real situations we have a project called GREENSPACE in our Opal demo server having exposome data for 3 cohorts (simulated data).
Figure 7: Tables available in the project GREENSPACE of Opal demo site
In the next figure we can see that there are 7 variables at each table and different number of samples (117, 178, 160). Using what you have learn in the DataSHIELD beginners course, you can obtain this information from the client site.
We start by creating the connection to our server and assigning the tables to an R object called exposome:
library(DSOpal)
library(dsBaseClient)
builder <- newDSLoginBuilder()
builder$append(server = 'study1', url = 'https://opal-demo.obiba.org',
user = 'dsuser', password = 'password')
builder$append(server = 'study2', url = 'https://opal-demo.obiba.org',
user = 'dsuser', password = 'password')
builder$append(server = 'study3', url = 'https://opal-demo.obiba.org',
user = 'dsuser', password = 'password')
logindata <- builder$build()
conns <- datashield.login(logins = logindata)
datashield.assign.table(conns, symbol = "exposome",
table = list(study1 = "GREENSPACE.Cohort1_exposome",
study2 = "GREENSPACE.Cohort2_exposome",
study3 = "GREENSPACE.Cohort3_exposome"))
The we can run basic DataSHIELD commands
ds.class('exposome')
$study1
[1] "data.frame"
$study2
[1] "data.frame"
$study3
[1] "data.frame"
ds.dim('exposome')
$`dimensions of exposome in study1`
[1] 117 7
$`dimensions of exposome in study2`
[1] 160 7
$`dimensions of exposome in study3`
[1] 178 7
$`dimensions of exposome in combined studies`
[1] 455 7
ds.colnames('exposome')
$study1
[1] "CODE" "greenyn300_0" "greenyn300_s" "greenyn300_4" "ndvi100_0" "ndvi100_s"
[7] "ndvi100_4"
$study2
[1] "CODE" "greenyn300_0" "greenyn300_s" "greenyn300_4" "ndvi100_0" "ndvi100_s"
[7] "ndvi100_4"
$study3
[1] "CODE" "greenyn300_0" "greenyn300_s" "greenyn300_4" "ndvi100_0" "ndvi100_s"
[7] "ndvi100_4"
The three .Rdata file containing the different ExpressionSets are available at:
So, we need to create the resources into the Opal server as we leant in the section 3.3 of the resources workshop. Our GREENSPACE project already have those resources created:
Figure 8: Resources available in the project GREENSPACE of Opal demo site
Then, we can load those resources into our DataSHIELD session as an R object called methy which is an object of class ExpressionSet by:
datashield.assign.resource(conns, symbol = "eSet",
resource = list(study1 = "GREENSPACE.Cohort1_Methyl_C_blood_0y_450K",
study2 = "GREENSPACE.Cohort2_Methyl_C_blood_0y_450K",
study3 = "GREENSPACE.Cohort3_Methyl_C_blood_0y_450K"))
datashield.assign.expr(conns, symbol = "methy", expr = quote(as.resource.object(eSet)))
Then, using dsOmicsClient library we can first check that our methylation data have properly been integrated into the DataSHIELD R server
library(dsOmicsClient)
ds.class('methy')
$study1
[1] "ExpressionSet"
attr(,"package")
[1] "Biobase"
$study2
[1] "ExpressionSet"
attr(,"package")
[1] "Biobase"
$study3
[1] "ExpressionSet"
attr(,"package")
[1] "Biobase"
ds.nFeatures('methy')
$study1
Features
100
$study2
Features
100
$study3
Features
100
attr(,"class")
[1] "dsnFeatures" "list"
ds.nSamples('methy')
$study1
Samples
100
$study2
Samples
150
$study3
Samples
135
attr(,"class")
[1] "dsnSamples" "list"
Then, in order to be able to use ds.limma () we need first to add the exposome data to the ExpressionSet object. To this end, the function ds.addPhenoData () has been created
ds.addPhenoData('methy', 'exposome', identifier = "CODE", name = 'methy2')
This function creates a new object in the R server side that is called methy2
ds.ls()
$study1
$study1$environment.searched
[1] "R_GlobalEnv"
$study1$objects.found
[1] "eSet" "exposome" "methy" "methy2"
$study2
$study2$environment.searched
[1] "R_GlobalEnv"
$study2$objects.found
[1] "eSet" "exposome" "methy" "methy2"
$study3
$study3$environment.searched
[1] "R_GlobalEnv"
$study3$objects.found
[1] "eSet" "exposome" "methy" "methy2"
This is an object of class ExpressionSet
ds.class('methy2')
$study1
[1] "ExpressionSet"
attr(,"package")
[1] "Biobase"
$study2
[1] "ExpressionSet"
attr(,"package")
[1] "Biobase"
$study3
[1] "ExpressionSet"
attr(,"package")
[1] "Biobase"
Having the exposures in the phenotypic data
ds.fvarLabels('methy')
$study1
[1] "X" "id_methyl" "id" "age_methyl" "sex_methyl" "anc_methyl" "cohort"
[8] "Bcell" "CD4T" "CD8T" "Gran" "Mono" "NK" "nRBC"
$study2
[1] "X" "id_methyl" "id" "age_methyl" "sex_methyl" "anc_methyl" "cohort"
[8] "Bcell" "CD4T" "CD8T" "Gran" "Mono" "NK" "nRBC"
$study3
[1] "X" "id_methyl" "id" "age_methyl" "sex_methyl" "anc_methyl" "cohort"
[8] "Bcell" "CD4T" "CD8T" "Gran" "Mono" "NK" "nRBC"
attr(,"class")
[1] "dsfvarLabels" "list"
ds.fvarLabels('methy2')
$study1
[1] "X" "id_methyl" "id" "age_methyl" "sex_methyl" "anc_methyl"
[7] "cohort" "Bcell" "CD4T" "CD8T" "Gran" "Mono"
[13] "NK" "nRBC" "greenyn300_0" "greenyn300_s" "greenyn300_4" "ndvi100_0"
[19] "ndvi100_s" "ndvi100_4"
$study2
[1] "X" "id_methyl" "id" "age_methyl" "sex_methyl" "anc_methyl"
[7] "cohort" "Bcell" "CD4T" "CD8T" "Gran" "Mono"
[13] "NK" "nRBC" "greenyn300_0" "greenyn300_s" "greenyn300_4" "ndvi100_0"
[19] "ndvi100_s" "ndvi100_4"
$study3
[1] "X" "id_methyl" "id" "age_methyl" "sex_methyl" "anc_methyl"
[7] "cohort" "Bcell" "CD4T" "CD8T" "Gran" "Mono"
[13] "NK" "nRBC" "greenyn300_0" "greenyn300_s" "greenyn300_4" "ndvi100_0"
[19] "ndvi100_s" "ndvi100_4"
attr(,"class")
[1] "dsfvarLabels" "list"
we can verify that matching has properly been performed
ds.nSamples('methy2')
$study1
Samples
97
$study2
Samples
145
$study3
Samples
133
attr(,"class")
[1] "dsnSamples" "list"
Now, we are ready to perform the required EXWAS by
ewas <- ds.limma(model = ~ greenyn300_0,
Set = "methy2")
We can meta-analyze our results as we want, since we have the required information in our computer
ewas
$study1
# A tibble: 100 x 10
id logFC CI.L CI.R AveExpr t P.Value adj.P.Val B SE
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 cg14008030 0.0288 0.00537 0.0523 0.526 2.44 0.0165 0.613 -4.66 0.0100
2 cg23999112 0.0258 0.00428 0.0473 0.752 2.38 0.0193 0.613 -4.80 0.0118
3 cg18762746 0.00589 0.000781 0.0110 0.967 2.29 0.0243 0.613 -5.00 0.00594
4 cg05597748 -0.0195 -0.0365 -0.00257 0.813 -2.29 0.0245 0.613 -5.01 0.0170
5 cg04496485 -0.00869 -0.0175 0.000171 0.0525 -1.95 0.0545 0.989 -5.70 0.00431
6 cg12285627 0.0128 -0.00161 0.0272 0.615 1.76 0.0812 0.989 -6.03 0.0231
7 cg01097950 0.0160 -0.00253 0.0345 0.814 1.71 0.0899 0.989 -6.11 0.0119
8 cg08900511 0.0149 -0.00325 0.0330 0.845 1.63 0.107 0.989 -6.25 0.0142
9 cg05475702 -0.0212 -0.0480 0.00557 0.703 -1.57 0.119 0.989 -6.34 0.0133
10 cg16736630 -0.0335 -0.0781 0.0112 0.501 -1.49 0.140 0.989 -6.47 0.0281
# ... with 90 more rows
$study2
# A tibble: 100 x 10
id logFC CI.L CI.R AveExpr t P.Value adj.P.Val B SE
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 cg14008030 0.0163 0.000457 0.0322 0.531 2.03 0.0438 0.989 -6.06 0.00660
2 cg11422233 0.00539 -0.000365 0.0111 0.0830 1.85 0.0662 0.989 -6.41 0.00802
3 cg05662829 0.0165 -0.00126 0.0342 0.815 1.84 0.0684 0.989 -6.44 0.00322
4 cg06402284 0.00306 -0.000481 0.00660 0.0408 1.71 0.0897 0.989 -6.66 0.00993
5 cg00381604 0.00282 -0.000583 0.00621 0.0306 1.64 0.104 0.989 -6.78 0.00172
6 cg01551879 0.00129 -0.000291 0.00287 0.0181 1.61 0.109 0.989 -6.82 0.0147
7 cg12045430 0.00492 -0.00144 0.0113 0.0361 1.53 0.129 0.989 -6.95 0.00723
8 cg24699430 0.000768 -0.000275 0.00181 0.0145 1.46 0.148 0.989 -7.05 0.00878
9 cg04407431 0.000908 -0.000327 0.00214 0.0173 1.45 0.148 0.989 -7.06 0.00859
10 cg08167039 0.00203 -0.000735 0.00479 0.0352 1.45 0.149 0.989 -7.06 0.0150
# ... with 90 more rows
$study3
# A tibble: 100 x 10
id logFC CI.L CI.R AveExpr t P.Value adj.P.Val B SE
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 cg09989996 0.00735 0.00288 0.0118 0.950 3.25 0.00146 0.146 -2.75 0.00815
2 cg13938959 -0.0254 -0.0423 -0.00853 0.731 -2.98 0.00347 0.173 -3.54 0.00907
3 cg10988531 0.0193 0.00539 0.0331 0.834 2.75 0.00686 0.229 -4.17 0.00818
4 cg14057946 0.0114 0.00225 0.0206 0.0945 2.46 0.0151 0.377 -4.87 0.0121
5 cg03213877 -0.00174 -0.00333 -0.000160 0.0174 -2.18 0.0312 0.624 -5.51 0.00536
6 cg01068023 -0.00801 -0.0156 -0.000416 0.888 -2.09 0.0388 0.647 -5.70 0.0186
7 cg21996134 0.00995 -0.000355 0.0203 0.954 1.91 0.0583 0.790 -6.05 0.00899
8 cg21870274 -0.0165 -0.0343 0.00127 0.814 -1.84 0.0685 0.790 -6.18 0.0105
9 cg11954957 0.0272 -0.00282 0.0573 0.746 1.79 0.0753 0.790 -6.26 0.0130
10 cg18263059 0.0107 -0.00125 0.0226 0.0381 1.77 0.0790 0.790 -6.30 0.0169
# ... with 90 more rows
attr(,"class")
[1] "dsLimma" "list"
We have implemented a very simple function that makes use of metap package to combine p-values from microarray experiments. For both ATHELETE and LifeCycle researchers we highly recommend to use our EASIER package
metaPvalues(ewas)
# A tibble: 100 x 5
id study1 study2 study3 p.meta
<chr> <dbl> <dbl> <dbl> <dbl>
1 cg14008030 0.0165 0.0438 0.795 0.0209
2 cg14057946 0.263 0.213 0.0151 0.0280
3 cg09989996 0.976 0.955 0.00146 0.0401
4 cg23999112 0.0193 0.771 0.102 0.0433
5 cg13938959 0.967 0.829 0.00347 0.0672
6 cg05597748 0.0245 0.353 0.334 0.0691
7 cg16047670 0.158 0.182 0.105 0.0713
8 cg10988531 0.935 0.544 0.00686 0.0791
9 cg18762746 0.0243 0.402 0.463 0.0949
10 cg01068023 0.693 0.255 0.0388 0.126
# ... with 90 more rows
We can also run analyses adjusted for other covariates or even for cell types if metadata is available in the ExpressionSet
ds.fvarLabels('methy2')
$study1
[1] "X" "id_methyl" "id" "age_methyl" "sex_methyl" "anc_methyl"
[7] "cohort" "Bcell" "CD4T" "CD8T" "Gran" "Mono"
[13] "NK" "nRBC" "greenyn300_0" "greenyn300_s" "greenyn300_4" "ndvi100_0"
[19] "ndvi100_s" "ndvi100_4"
$study2
[1] "X" "id_methyl" "id" "age_methyl" "sex_methyl" "anc_methyl"
[7] "cohort" "Bcell" "CD4T" "CD8T" "Gran" "Mono"
[13] "NK" "nRBC" "greenyn300_0" "greenyn300_s" "greenyn300_4" "ndvi100_0"
[19] "ndvi100_s" "ndvi100_4"
$study3
[1] "X" "id_methyl" "id" "age_methyl" "sex_methyl" "anc_methyl"
[7] "cohort" "Bcell" "CD4T" "CD8T" "Gran" "Mono"
[13] "NK" "nRBC" "greenyn300_0" "greenyn300_s" "greenyn300_4" "ndvi100_0"
[19] "ndvi100_s" "ndvi100_4"
attr(,"class")
[1] "dsfvarLabels" "list"
For example, let us run the analysis adjusted for gender
ewas.adj <- ds.limma(model = ~ greenyn300_0 + sex_methyl, Set = "methy2")
metaPvalues(ewas.adj)
# A tibble: 100 x 5
id study1 study2 study3 p.meta
<chr> <dbl> <dbl> <dbl> <dbl>
1 cg14008030 0.0230 0.0463 0.824 0.0288
2 cg14057946 0.261 0.242 0.0145 0.0297
3 cg09989996 0.990 0.945 0.00153 0.0415
4 cg23999112 0.0214 0.785 0.111 0.0503
5 cg05597748 0.0270 0.344 0.290 0.0658
6 cg13938959 0.942 0.817 0.00373 0.0689
7 cg10988531 0.959 0.550 0.00558 0.0701
8 cg16047670 0.156 0.184 0.104 0.0707
9 cg18762746 0.0293 0.395 0.476 0.109
10 cg01068023 0.665 0.275 0.0371 0.125
# ... with 90 more rows
Do not forget to logout the connection
datashield.logout(conns)
Peform an EWAS using the same data that is available in the GREENSPACE project to determine which CpGs are differentially methylated for the exposures:
greenyn300_4,ndvi100_0andndvi100_4.
EWAS requires to adjust the analyses for cell type composition. The
ExpressionSetcontains the variables Bcell, CD4T, CD8T, Gran, Mono and NK in the sample metadata. Re-run the models adjustin for those variables
sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)
Matrix products: default
locale:
[1] LC_COLLATE=Spanish_Spain.1252 LC_CTYPE=Spanish_Spain.1252 LC_MONETARY=Spanish_Spain.1252
[4] LC_NUMERIC=C LC_TIME=Spanish_Spain.1252
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] forcats_0.5.0 stringr_1.4.0 dplyr_1.0.2 purrr_0.3.4
[5] readr_1.3.1 tidyr_1.1.2 tibble_3.0.3 ggplot2_3.3.2
[9] tidyverse_1.3.0 dsOmicsClient_1.1.0 dsBaseClient_6.1.0 Biobase_2.48.0
[13] BiocGenerics_0.34.0 DSOpal_1.1.0 DSI_1.1.0 R6_2.4.1
[17] progress_1.2.2 opalr_1.5.1 httr_1.4.2 knitr_1.29
[21] BiocStyle_2.16.0
loaded via a namespace (and not attached):
[1] fs_1.5.0 lubridate_1.7.9 RColorBrewer_1.1-2 numDeriv_2016.8-1.1
[5] tools_4.0.2 backports_1.1.9 utf8_1.1.4 DBI_1.1.0
[9] colorspace_1.4-1 sn_1.6-2 withr_2.2.0 tidyselect_1.1.0
[13] prettyunits_1.1.1 mnormt_2.0.2 curl_4.3 compiler_4.0.2
[17] cli_2.0.2 rvest_0.3.6 xml2_1.3.2 TFisher_0.2.0
[21] sandwich_2.5-1 bookdown_0.20 scales_1.1.1 mvtnorm_1.1-1
[25] digest_0.6.25 rmarkdown_2.3 pkgconfig_2.0.3 htmltools_0.5.0
[29] bibtex_0.4.2.2 plotrix_3.7-8 highr_0.8 dbplyr_1.4.4
[33] rlang_0.4.7 readxl_1.3.1 rstudioapi_0.11 generics_0.0.2
[37] zoo_1.8-8 jsonlite_1.7.0 magrittr_1.5 Matrix_1.2-18
[41] fansi_0.4.1 Rcpp_1.0.5 munsell_0.5.0 lifecycle_0.2.0
[45] stringi_1.4.6 multcomp_1.4-13 yaml_2.2.1 mathjaxr_1.0-1
[49] gbRd_0.4-11 MASS_7.3-52 grid_4.0.2 blob_1.2.1
[53] ggrepel_0.8.2 crayon_1.3.4 lattice_0.20-41 haven_2.3.1
[57] splines_4.0.2 multtest_2.44.0 hms_0.5.3 tmvnsim_1.0-2
[61] pillar_1.4.6 codetools_0.2-16 stats4_4.0.2 reprex_0.3.0
[65] mutoss_0.1-12 glue_1.4.2 evaluate_0.14 metap_1.4
[69] BiocManager_1.30.10 modelr_0.1.8 vctrs_0.3.3 Rdpack_1.0.0
[73] cellranger_1.1.0 gtable_0.3.0 assertthat_0.2.1 xfun_0.16
[77] mime_0.9 broom_0.7.0 survival_3.2-3 TH.data_1.0-10
[81] ellipsis_0.3.1