Contents

1 Introduction

This tutorial aims to provide examples about how to deal with resources in Opal and DataSHIELD. More detailed information can be found in our bookdown Orchestrating privacy-protected non-disclosive big data analyses of data from different resources with R and DataSHIELD.

A quick demo of the resources is also available in Tutorial: Resources in R.

2 Getting started

This document can be reproduced by installing the following packages

install.packages("DSOpal")
devtools::install_github("datashield/dsBaseClient")
devtools::install_github("isglobal-brge/dsOmicsClient") 

Then the packages are loaded as usual

library(DSOpal) 
library(dsBaseClient)
library(dsOmicsClient)

We have set up an Opal demo site to illustrate how to perform some basic analyses using DataSHIELD as well as how to deal with different resources for ’omic data. The Opal server can be accessed with the credentials:

In this figure we can see all the projects available.

Opal demo site available projects

Figure 1: Opal demo site available projects

This tutorial will mainly make use of the resources available at RSRC project

Resources available at Opal demo site of RSRC project

Figure 2: Resources available at Opal demo site of RSRC project

In order to make the reader familiar with Opal we recommend visiting the Opal online documentation.

3 Adding a new resource to the Opal server

The resources can be uploaded in the Opal server manually as described here. However, it can also be done using R code as it is described in the following subsections.

3.1 Resource as a text file

Let us imagine that we have a tsv file that is stored in our hospital, server, cloud, GitHub repository or any other site. This file is containing information on several variables we want to analyze using DataSHIELD. Let us also imagine that this data is available at this URL: http://duffel.rail.bio/recount/TCGA/TCGA.tsv. This file encodes the phenotypes corresponding to 11,287 samples from TCGA that are available in the Recount project.

Thanks to the resources this dataset is not necessary to be uploaded into the Opal server as a table anymore. We can analyze this data with DataSHIELD packages by creating a new resource as following.

Let us start by login the Opal server. NOTE that this requires full permissions and, hence, we access with administrator rights

o <- opal.login(username = 'administrator',
                password = 'password', 
                url = 'https://opal-demo.obiba.org')

TCGA dataset can be added as a new resource simply by typing:

opal.resource_create(opal = o, 
                     project = 'RSRC', 
                     name = 'pheno_TCGA', 
                     url = 'http://duffel.rail.bio/recount/TCGA/TCGA.tsv', 
                     format = 'tsv')

We can see that this resource have been added to our project by

opal.resources(o, project='RSRC')
              name project
1     1000G_covars    RSRC
2        1000G_vcf    RSRC
3           CNSIM1    RSRC
4           CNSIM2    RSRC
5           CNSIM3    RSRC
6              EGA    RSRC
7       GSE66351_1    RSRC
8       GSE66351_2    RSRC
9         GSE80970    RSRC
10            brge    RSRC
11      brge_plink    RSRC
12        brge_vcf    RSRC
13     example.vcf    RSRC
14     ga4gh_1000g    RSRC
15        gps_data    RSRC
16 gps_participant    RSRC
17      pheno_TCGA    RSRC
18    takeaway_gps    RSRC
19      tcga_liver    RSRC
                                                                                                                                           url
1                                                       https://raw.githubusercontent.com/isglobal-brge/brgeUtils/master/data/covars_1000G.txt
2  https://raw.githubusercontent.com/isglobal-brge/brgeUtils/master/data/1KG_phase3_subset_chr22.vcf.gz?method=biallelic.only&snpfirstdim=TRUE
3                                                                                                           mysql://mysqldata:3306/opal/CNSIM1
4                                                                                                                 file:///srv/data/CNSIM2.zsav
5                                                                           opal+https://opal-demo.obiba.org/ws/files/projects/RSRC/CNSIM3.zip
6                                       https://ega.ebi.ac.uk:8052/elixir/tickets/tickets/EGAF00001753756?&referenceName=chr21&start=1&end=100
7                                                                     https://github.com/isglobal-brge/brgedata/raw/master/data/gse66351_1.rda
8                                                                     https://github.com/isglobal-brge/brgedata/raw/master/data/gse66351_2.rda
9                                                                                                              file:///srv/data/GSE80970.Rdata
10                                                       https://raw.githubusercontent.com/isglobal-brge/brgedata/master/inst/extdata/brge.txt
11                                                                       ssh://plink-demo.obiba.org:2222/home/master/brge?exec=ls,plink1,plink
12             https://raw.githubusercontent.com/isglobal-brge/brgedata/master/inst/extdata/brge.vcf.gz?method=biallelic.only&snpfirstdim=TRUE
13          https://raw.githubusercontent.com/isglobal-brge/scoreInvHap/master/inst/extdata/example.vcf?method=biallelic.only&snpfirstdim=TRUE
14                                     https://htsget.ga4gh.org/variants/1000genomes.phase1.chr1?format=VCF&referenceName=1&start=1&end=100000
15                                                                opal+https://opal-demo.obiba.org/ws/files/projects/RSRC/gps_data_final.RData
16                                                               opal+https://opal-demo.obiba.org/ws/files/projects/RSRC/gps_participant.RData
17                                                                                                http://duffel.rail.bio/recount/TCGA/TCGA.tsv
18                                                                 opal+https://opal-demo.obiba.org/ws/files/projects/RSRC/takeaway_data.RData
19                                                              https://github.com/isglobal-brge/brgedata/raw/master/data/rse_gene_liver.Rdata
                       format              created              updated
1                         csv 2021-01-23T05:47:34Z 2021-01-23T05:47:34Z
2                     VCF2GDS 2021-01-23T05:47:34Z 2021-01-23T05:47:34Z
3                        <NA> 2021-01-23T05:47:33Z 2021-01-23T05:47:33Z
4                         sav 2021-01-23T05:47:33Z 2021-01-23T05:47:33Z
5                         csv 2021-01-23T05:47:34Z 2021-01-23T05:47:34Z
6                EGAhtsgetBAM 2021-01-23T05:47:37Z 2021-01-23T05:47:37Z
7               ExpressionSet 2021-01-23T05:47:32Z 2021-01-23T05:47:32Z
8               ExpressionSet 2021-01-23T05:47:33Z 2021-01-23T05:47:33Z
9               ExpressionSet 2021-01-23T05:47:33Z 2021-01-23T05:47:33Z
10                        tsv 2021-01-23T05:47:35Z 2021-01-23T05:47:35Z
11                       <NA> 2021-01-23T05:47:36Z 2021-01-23T05:47:36Z
12                    VCF2GDS 2021-01-23T05:47:35Z 2021-01-23T05:47:35Z
13                    VCF2GDS 2021-01-23T05:47:34Z 2021-01-23T05:47:34Z
14                   GA4GHVCF 2021-01-23T05:47:37Z 2021-01-23T05:47:37Z
15      SpatialLinesDataFrame 2021-01-23T05:47:36Z 2021-01-23T05:47:36Z
16                 data.frame 2021-01-23T05:47:36Z 2021-01-23T05:47:36Z
17                        tsv 2021-01-23T19:50:04Z 2021-01-23T19:50:04Z
18     SpatialPointsDataFrame 2021-01-23T05:47:36Z 2021-01-23T05:47:36Z
19 RangedSummarizedExperiment 2021-01-23T05:47:35Z 2021-01-23T05:47:35Z

We can test the resource assignment. First we assign the resource to an object called client

opal.assign.resource(o, 'client', 'RSRC.pheno_TCGA')
opal.execute(o, 'class(client)')
[1] "TidyFileResourceClient" "FileResourceClient"     "ResourceClient"        
[4] "R6"                    

We see that this object is of class TidyFileResourceClient. The resourcer package will be use then to “resolve” this resource and to load it into the R server as we will see later.

We logout the connection

opal.logout(o)

Then, we can analyze the data using DataSHIELD by making use of the created resource. We start by login the resource using an user who have DataSHIELD permissions to our Opal server (dsuser).

builder <- newDSLoginBuilder()
builder$append(server = 'study1', url = 'https://opal-demo.obiba.org', 
               user = 'dsuser', password = 'password', 
               resource = 'RSRC.pheno_TCGA')
logindata <- builder$build()

Then, we login the resource to asssign (R object called res)

conns <- datashield.login(logins = logindata, 
                          assign = TRUE, 
                          symbol = 'res')

The resourcer package which is installed in the Opal server contains functions that facilitates the access of this data in the R server. In particular, we will have access to the resource as a data.frame called pheno. To this end, as.resource.data.frame () function is used to coerce the resource (e.g. ResourceClient object) to a data frame.

datashield.assign.expr(conns, symbol = 'pheno', 
                       expr = quote(as.resource.data.frame(res)))
ds.class('pheno')
$study1
[1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame" 
ds.dim('pheno')
$`dimensions of pheno in study1`
[1] 11287   859

$`dimensions of pheno in combined studies`
[1] 11287   859
datashield.logout(conns)

3.2 Resource as a Rdata file (data frame)

Now, let us assume that our resource is a data frame saved in a .Rdata file available at: https://github.com/isglobal-brge/brgedata/raw/master/data/asthma.rda. We can do similar steps to load the resource into the Opal server:

o <- opal.login('administrator','password', 
                url='https://opal-demo.obiba.org')

opal.resource_create(opal = o, 
                     project = 'RSRC', 
                     name = 'asthma', 
                     url = 'https://github.com/isglobal-brge/brgedata/raw/master/data/asthma.rda', 
                     format = 'data.frame')
     
opal.assign.resource(o, 'client', 'RSRC.asthma')
opal.execute(o, 'class(client)')
[1] "RDataFileResourceClient" "FileResourceClient"      "ResourceClient"         
[4] "R6"                     
opal.logout(o)

In that case the class of the client object is RdataFileResourceClient whose resolver is also implemented in the resourcer package.

Then, we can analyze this data using DataSHIELD functions as usual. In that case, instead of using as.resource.data.frame () (it is also possible to use it) we make use of the function as.resource.object () which coerce the resource to an internal data object that depends on the implementation of this object.

builder <- newDSLoginBuilder()
builder$append(server = 'study1', url = 'https://opal-demo.obiba.org', 
               user = 'dsuser', password = 'password', 
               resource = 'RSRC.asthma')
logindata <- builder$build()

conns <- datashield.login(logins = logindata, 
                          assign = TRUE, 
                          symbol = 'res')

datashield.assign.expr(conns, symbol = 'asthma', 
                       expr = quote(as.resource.object(res)))

Now, we can perform some standard statistical analysis in DataSHIELD as usual:

ds.class('asthma')
$study1
[1] "data.frame"
ds.colnames('asthma')
$study1
 [1] "country"     "gender"      "age"         "bmi"         "smoke"       "casecontrol"
 [7] "rs4490198"   "rs4849332"   "rs1367179"   "rs11123242"  "rs13014858"  "rs1430094"  
[13] "rs1430093"   "rs746710"    "rs1430090"   "rs6737251"   "rs11685217"  "rs1430097"  
[19] "rs10496465"  "rs3756688"   "rs2303063"   "rs1422993"   "rs2400478"   "rs714588"   
[25] "rs1023555"   "rs898070"    "rs963218"    "rs1419835"   "rs765023"    "rs1345267"  
[31] "rs324381"    "hopo546333"  "rs184448"    "rs324396"    "rs324957"    "rs324960"   
[37] "rs10486657"  "rs324981"    "rs1419780"   "rs325462"    "rs727162"    "rs10250709" 
[43] "rs6958905"   "rs10238983"  "rs4941643"   "rs3794381"   "rs2031532"   "rs2247119"  
[49] "rs8000149"   "rs2274276"   "rs7332573"   "rs3829366"   "rs6084432"   "rs512625"   
[55] "rs3918395"   "rs2787095"   "rs2853215"  
ds.glm(casecontrol ~ rs1422993 + smoke + bmi, data='asthma', family='binomial')
$Nvalid
[1] 1559

$Nmissing
[1] 19

$Ntotal
[1] 1578

$disclosure.risk
       RISK OF DISCLOSURE
study1                  0

$errorMessage
       ERROR MESSAGES
study1 "No errors"   

$nsubs
[1] 1559

$iter
[1] 5

$family

Family: binomial 
Link function: logit 


$formula
[1] "casecontrol ~ rs1422993 + smoke + bmi"

$coefficients
               Estimate Std. Error    z-value      p-value low0.95CI.LP high0.95CI.LP       P_OR
(Intercept) -2.45698463 0.36442291 -6.7421245 1.560873e-11  -3.17124042   -1.74272885 0.07892927
rs1422993GT  0.35228783 0.13087079  2.6918752 7.105152e-03   0.09578580    0.60878987 1.42231786
rs1422993TT  0.14131809 0.25709215  0.5496787 5.825397e-01  -0.36257326    0.64520944 1.15179096
smoke       -0.36267146 0.14370585 -2.5237069 1.161247e-02  -0.64432976   -0.08101316 0.69581500
bmi          0.04234646 0.01349121  3.1388188 1.696303e-03   0.01590418    0.06878875 1.04325586
            low0.95CI.P_OR high0.95CI.P_OR
(Intercept)     0.04026246       0.1489667
rs1422993GT     1.10052331       1.8382056
rs1422993TT     0.69588333       1.9063863
smoke           0.52501431       0.9221816
bmi             1.01603132       1.0712099

$dev
[1] 1579.802

$df
[1] 1554

$output.information
[1] "SEE TOP OF OUTPUT FOR INFORMATION ON MISSING DATA AND ERROR MESSAGES"

Do not forget to close the connection after finishing the analyses

datashield.logout(conns)

3.3 Resource as an Rdata file (ExpressionSet)

One of the main advantages of using the resources as an R file is that it may contains any type of R object. Let us illustrate this by having an R file with an ExpressionSet object which is a [Bioconductor] class to encapsulate omic data along with phenotypic information and annotation. See here for a description of this type of objects or to this video.

This figure provides a visual idea about how these objects are organized (for epidemiologists)

Data frame vs ExpressionSet organization

Figure 3: Data frame vs ExpressionSet organization

This is how an ExpressionSet can be summarized

ExpressionSet infrastructure

Figure 4: ExpressionSet infrastructure

He have collected information on gene expression of the 200 more variable genes measured in 194 samples corresponding to the Asthma inner city study available in GEO with the accesion numberGSE40732. This is an ExpressionSet that is available at:

Let us illustrate how to create the resources in the Opal server having access to these two datasets

o <- opal.login('administrator','password', 
                url='https://opal-demo.obiba.org')

opal.resource_create(opal = o, 
                     project = 'RSRC', 
                     name = 'genexpr', 
                     url = 'https://github.com/isglobal-brge/brgedata/raw/master/data/GSE40732.Rdata', 
                     format = 'ExpressionSet')
opal.logout(o)

Now, we are ready to perform any data analysis using DataSHIELD as following. NOTE that in this case the R server must have installed those specific packages to deal with our specific class of objects. In that case, ExpressionSets are managed with Biobase Bioconductor package. If so, the function as.resource.object () which coerce the resource to an ExpressionSet that, in this case, will be available in the R server as an object called eSet.

builder <- newDSLoginBuilder()
builder$append(server = 'study1', url = 'https://opal-demo.obiba.org', 
               user = 'dsuser', password = 'password', 
               resource = 'RSRC.genexpr')
logindata <- builder$build()

conns <- datashield.login(logins = logindata, 
                          assign = TRUE, 
                          symbol = 'res')

datashield.assign.expr(conns, symbol = 'eSet', 
                       expr = quote(as.resource.object(res)))

Then, we can use DataSHIELD fuctions as usual

ds.class('eSet')
$study1
[1] "ExpressionSet"
attr(,"package")
[1] "Biobase"

In that case, if we execute

ds.dim('eSet')

we get an error, but … we have developed a pair of DataSHIELD packages, called dsOmics and dsOmicsClient that allow us to deal with this type of objects in DataSHIELD and that are able to get the desired information (i.e. the dimension of the ExpressionSet).

dsOmicsClient::ds.nFeatures('eSet')
$study1
Features 
     200 

attr(,"class")
[1] "dsnFeatures" "list"       
dsOmicsClient::ds.nSamples('eSet')
$study1
Samples 
    194 

attr(,"class")
[1] "dsnSamples" "list"      

We finish the example by login out

datashield.logout(conns)

4 Exercise 1

He have collected information on gene expression of the 1000 more variable genes measured in skin biopsies from 10 patients with active atopic eczema and 10 healthy controls (GEO accesion number GSE6012. The ExpressionSet is available at:

  1. Create a resource into the project workshop available at the demo Opal site (https://opal-demo.obiba.org/)
    • user: administrator
    • password: password

IMPORTANT NOTE: In order to avoid problems with duplicated names (all of you are using the same server) give a name to the resources like eSet_XXXXXX and replace XXXXXX by random lower and upper letters (for example: eSet_DrMMuZ).

  1. Login the connection to the resource to DataSHIELD using these credentials:

    • user: dsuser
    • password: password
  2. Assign the resources to an R object having the same name as the resource name (in my case eSet_DrMMuZ).

  3. Load dsOmicsClient package and use ds.nFeatures () and ds.nSamples () functions to verify that the loaded object is having the proper dimensions.

  4. Investigate the dsOmicsClient package (type ls("package:dsOmicsClient") in the R console) to find a function for knowing the names of the covariates in the metadata on samples

5 Using the resources in DataSHIELD

5.1 Data analysis combining different types of resources

Here, we will use data from three studies that are available in our Opal demo repository. The three databases are called CNSIM1, CNSIM2, CNSIM3 and are available as three different resources: mySQL database, SPSS file and CSV file (see Figure 2). This example mimics real situations where different hospitals or research centers manage their own databases containing harmonized data. Data correspond to three simulated datasets with different numbers of observations of 11 harmonized variables. They contain synthetic data based on a model derived from the participants of the 1958 Birth Cohort, as part of an obesity methodological development project. The available variables are:

Variable Description Type Note
LAB_TSC Total Serum Cholesterol numeric mmol/L
LAB_TRIG Triglycerides numeric mmol/L
LAB_HDL HDL Cholesterol numeric mmol/L
LAB_GLUC_ADJUSTED Non-Fasting Glucose numeric mmol/L
PM_BMI_CONTINUOUS Body Mass Index (continuous) numeric kg/m2
DIS_CVA History of Stroke factor 0 = Never had stroke; 1 = Has had stroke
MEDI_LPD Current Use of Lipid Lowering Medication (from categorical assessment item) factor 0 = Not currently using lipid lowering medication; 1 = Currently using lipid lowering medication
DIS_DIAB History of Diabetes factor 0 = Never had diabetes; 1 = Has had diabetes
DIS_AMI History of Myocardial Infarction factor 0 = Never had myocardial infarction; 1 = Has had myocardial infarction
GENDER Gender factor 0 = Female; 1 = Male
PM_BMI_CATEGORICAL Body Mass Index (categorical) factor 1 = Less than 25 kg/m2; 2 = 25 to 30 kg/m2; 3 = Over 30 kg/m2

The aggregated analysis can be performed as follows. We first start by preparing the login data and the resources to assign

builder <- DSI::newDSLoginBuilder()
builder$append(server = "study1", url = "https://opal-demo.obiba.org", 
               user = "dsuser", password = "password", 
               resource = "RSRC.CNSIM1")
builder$append(server = "study2", url = "https://opal-demo.obiba.org", 
               user = "dsuser", password = "password", 
               resource = "RSRC.CNSIM2")
builder$append(server = "study3", url = "https://opal-demo.obiba.org", 
               user = "dsuser", password = "password", 
               resource = "RSRC.CNSIM3")
logindata <- builder$build()

Then, we login and assign resources

conns <- datashield.login(logins = logindata, assign = TRUE, symbol = "res")

The assigned objects are of class ResourceClient (and others)

ds.class("res")
$study1
[1] "SQLResourceClient" "ResourceClient"    "R6"               

$study2
[1] "TidyFileResourceClient" "FileResourceClient"     "ResourceClient"        
[4] "R6"                    

$study3
[1] "TidyFileResourceClient" "FileResourceClient"     "ResourceClient"        
[4] "R6"                    

We then coerce the ResourceClient objects to data frames

datashield.assign.expr(conns, symbol = "D", 
                       expr = quote(as.resource.data.frame(res, strict = TRUE)))
ds.class("D")
$study1
[1] "data.frame"

$study2
[1] "data.frame"

$study3
[1] "data.frame"

Now, we are ready to do usual DataSHIELD analyses

ds.summary('D$LAB_HDL')
$study1
$study1$class
[1] "numeric"

$study1$length
[1] 2163

$study1$`quantiles & mean`
      5%      10%      25%      50%      75%      90%      95%     Mean 
0.875240 1.047400 1.300000 1.581000 1.844500 2.090000 2.210900 1.569416 


$study2
$study2$class
[1] "numeric"

$study2$length
[1] 3088

$study2$`quantiles & mean`
      5%      10%      25%      50%      75%      90%      95%     Mean 
0.850280 1.032200 1.294000 1.563000 1.840000 2.077000 2.225000 1.556648 


$study3
$study3$class
[1] "numeric"

$study3$length
[1] 4128

$study3$`quantiles & mean`
      5%      10%      25%      50%      75%      90%      95%     Mean 
0.876760 1.039200 1.304000 1.589000 1.856000 2.098800 2.244200 1.574687 

NOTE: vector types are not necessarily the same depending on the data reader that was used

ds.class('D$GENDER')
$study1
[1] "integer"

$study2
[1] "haven_labelled" "vctrs_vctr"     "double"        

$study3
[1] "numeric"
ds.asFactor('D$GENDER', 'GENDER')
$all.unique.levels
[1] "0" "1"

$return.message
[1] "Data object <GENDER> correctly created in all specified data sources"
ds.summary('GENDER')
$study1
$study1$class
[1] "factor"

$study1$length
[1] 2163

$study1$categories
[1] "0" "1"

$study1$`count of '0'`
[1] 1092

$study1$`count of '1'`
[1] 1071


$study2
$study2$class
[1] "factor"

$study2$length
[1] 3088

$study2$categories
[1] "0" "1"

$study2$`count of '0'`
[1] 1585

$study2$`count of '1'`
[1] 1503


$study3
$study3$class
[1] "factor"

$study3$length
[1] 4128

$study3$categories
[1] "0" "1"

$study3$`count of '0'`
[1] 2091

$study3$`count of '1'`
[1] 2037

A logistic regression model can be fitted using privacy-protected analyses as if we would have the three datasets located in a single computer

mod <- ds.glm("DIS_DIAB ~ LAB_TRIG + GENDER", data = "D" , family="binomial")
mod$coeff
              Estimate Std. Error    z-value       p-value low0.95CI.LP high0.95CI.LP       P_OR
(Intercept) -4.7792110 0.21081170 -22.670521 8.755236e-114   -5.1923944   -4.36602770 0.00833261
LAB_TRIG     0.3035931 0.05487436   5.532514  3.156737e-08    0.1960414    0.41114488 1.35471774
GENDER      -0.4455989 0.20797931  -2.142516  3.215202e-02   -0.8532309   -0.03796695 0.64044060
            low0.95CI.P_OR high0.95CI.P_OR
(Intercept)    0.005527953      0.01254229
LAB_TRIG       1.216577226      1.50854390
GENDER         0.426036242      0.96274475

Logout the connection

datashield.logout(conns)

5.2 Data analysis using a remote computation server

Computation resources are resources on which tasks/commands can be triggered and from which resulting data can be retrieved. We can define a SSH Resource which is accessible through a secure shell. The path part of the URL is the remote working directory. The available commands are defined by the exec query parameter as described in the resources demo tutorial.

PLINK is a program to perform whole genome association analysis (GWAS) in a computationally efficient manner. So, it may have sense to use this program to run the analyses instead of using R.

PLINK data analysis requires to have genotype data in three files (.bed, .bim, .fam) and phenotype data in a different file. Therefore, the computation resource must contain the data and the commands that are allowed to be run in the server. This can be set up manually in the Opal server as indicated in the next figure

Computation resource corresponding to PLINK example

Figure 5: Computation resource corresponding to PLINK example

Notice that this resource are hosted at plink-demo.obiba.org in the folder /home/master/brge. This folder contains brge.bed, brge.bim, brge.fam, which are the three PLINK files and brge.phe file that encodes the phenotypic variables. The server also have installed PLINK.

This computation resource is already available at our demo Opal server (brge_plink) as you can see in figure 2

The computation resource is a server side feature. So, we cannot see its functionallity using DataSHIELD. However, in order to easily illustrate how it works, let us create the resource in plain R

library(resourcer)
ssh.res <- newResource(
  url = "ssh://plink-demo.obiba.org:2222/home/master/brge?exec=ls,plink1,plink",
  identity = "master",
  secret = "master"
)

The resource connection client is resolved as follow:

ssh.client <- newResourceClient(ssh.res)
class(ssh.client)
[1] "SshResourceClient"     "CommandResourceClient" "ResourceClient"        "R6"                   

This type of client allows to issue shell commands through a SSH connection. In our case we can use:

ssh.client$getAllowedCommands()
[1] "ls"     "plink1" "plink" 

We can also see that the computation resource not only have PLINK program installed, but also our genomic and phenotypic data

ssh.client$exec("ls")
$status
[1] 0

$output
[1] "brge.bed" "brge.bim" "brge.fam" "brge.gds" "brge.phe" "brge.txt"

$error
character(0)

$command
[1] "cd /home/master/brge && ls"

attr(,"class")
[1] "resource.exec"

Now, we are ready to run any PLINK command from R. Notice that in this case we want to assess association between the genotype data in bed format and use as phenotype the variable ‘obese’ that is in the file ‘obesity.phe’. The sentence in a PLINK command would be (NOTE: we avoid –out to indicate the output file since the file will be available in R as a tibble).

  plink1 --bfile brge --assoc --pheno brge.phe --pheno-name obese --noweb

This could be done simply by executing

ans <- ssh.client$exec('plink1', c('--bfile', 'brge', '--assoc', '--pheno', 'brge.phe', '--pheno-name', 'obese', '--noweb'))
ans
$status
[1] 0

$output
 [1] ""                                                                   
 [2] "@----------------------------------------------------------@"       
 [3] "|        PLINK!       |     v1.07      |   10/Aug/2009     |"       
 [4] "|----------------------------------------------------------|"       
 [5] "|  (C) 2009 Shaun Purcell, GNU General Public License, v2  |"       
 [6] "|----------------------------------------------------------|"       
 [7] "|  For documentation, citation & bug-report instructions:  |"       
 [8] "|        http://pngu.mgh.harvard.edu/purcell/plink/        |"       
 [9] "@----------------------------------------------------------@"       
[10] ""                                                                   
[11] "Skipping web check... [ --noweb ] "                                 
[12] "Writing this text to log file [ plink.log ]"                        
[13] "Analysis started: Sat Jan 23 19:50:52 2021"                         
[14] ""                                                                   
[15] "Options in effect:"                                                 
[16] "\t--bfile brge"                                                      
[17] "\t--assoc"                                                           
[18] "\t--pheno brge.phe"                                                  
[19] "\t--pheno-name obese"                                                
[20] "\t--noweb"                                                           
[21] ""                                                                   
[22] "Reading map (extended format) from [ brge.bim ] "                   
[23] "100000 markers to be included from [ brge.bim ]"                    
[24] "Reading pedigree information from [ brge.fam ] "                    
[25] "2312 individuals read from [ brge.fam ] "                           
[26] "2312 individuals with nonmissing phenotypes"                        
[27] "Assuming a disease phenotype (1=unaff, 2=aff, 0=miss)"              
[28] "Missing phenotype value is also -9"                                 
[29] "725 cases, 1587 controls and 0 missing"                             
[30] "1097 males, 1215 females, and 0 of unspecified sex"                 
[31] "Reading genotype bitfile from [ brge.bed ] "                        
[32] "Detected that binary PED file is v1.00 SNP-major mode"              
[33] "Reading alternate phenotype from [ brge.phe ] "                     
[34] "2204 individuals with non-missing alternate phenotype"              
[35] "Assuming a disease phenotype (1=unaff, 2=aff, 0=miss)"              
[36] "Missing phenotype value is also -9"                                 
[37] "402 cases, 1802 controls and 108 missing"                           
[38] "Before frequency and genotyping pruning, there are 100000 SNPs"     
[39] "2312 founders and 0 non-founders found"                             
[40] "6009 heterozygous haploid genotypes; set to missing"                
[41] "Writing list of heterozygous haploid genotypes to [ plink.hh ]"     
[42] "7 SNPs with no founder genotypes observed"                          
[43] "Warning, MAF set to 0 for these SNPs (see --nonfounders)"           
[44] "Writing list of these SNPs to [ plink.nof ]"                        
[45] "Total genotyping rate in remaining individuals is 0.994408"         
[46] "0 SNPs failed missingness test ( GENO > 1 )"                        
[47] "0 SNPs failed frequency test ( MAF < 0 )"                           
[48] "After frequency and genotyping pruning, there are 100000 SNPs"      
[49] "After filtering, 402 cases, 1802 controls and 108 missing"          
[50] "After filtering, 1097 males, 1215 females, and 0 of unspecified sex"
[51] "Writing main association results to [ plink.assoc ] "               
[52] ""                                                                   
[53] "Analysis finished: Sat Jan 23 19:50:58 2021"                        
[54] ""                                                                   

$error
character(0)

$command
[1] "cd /home/master/brge && plink1 --bfile brge --assoc --pheno brge.phe --pheno-name obese --noweb"

attr(,"class")
[1] "resource.exec"

This is the output that is provided by PLINK and new output files have been created in the computation resource. These files can be downloaded in the R server by using the function

ssh.client$downloadFile()

We have implemented all these procedures in dsOmics package whose DataSHIELD client package is dsOmicsClient. Therefore, once the compuation resource is created in the Opal server we can perform such analyses using DataSHIELD by simply

  builder <- newDSLoginBuilder()
  builder$append(server = 'study1', url = 'https://opal-demo.obiba.org',
                 user = 'dsuser', password = 'password',
                 resource = 'RSRC.brge_plink')
  logindata <- builder$build()

Then we assign the resource to a symbol (i.e. R object) called client which is a ssh resource

  conns <- datashield.login(logins = logindata, assign = TRUE,
                            symbol = "client")
  ds.class("client")
$study1
[1] "SshResourceClient"     "CommandResourceClient" "ResourceClient"        "R6"                   

and then call to ds.PLINK () function. The arguments must be encapsulated in a single character without the command ‘plink1’. The ‘–noweb’ option is not necessary either.

plink.arguments <- "--bfile brge --assoc --pheno brge.phe --pheno-name obese"

and then, the analyses are performed by

library(dsOmicsClient)
ans.plink <- ds.PLINK("client", plink.arguments)

The object ans.plink contains the PLINK results at each server as well as the outuput provided by PLINK

lapply(ans.plink, names)
$study1
[1] "results"   "plink.out"
head(ans.plink$study1$results)
# A tibble: 6 x 10
    CHR SNP           BP A1        F_A     F_U A2      CHISQ      P    OR
  <dbl> <chr>      <dbl> <chr>   <dbl>   <dbl> <chr>   <dbl>  <dbl> <dbl>
1     0 MitoC3993T  3993 T     0.0150  0.0146  C     0.00836 0.927  1.03 
2     0 MitoG4821A  4821 A     0       0.00169 G     1.34    0.248  0    
3     0 MitoG6027A  6027 A     0.00249 0.00501 G     0.922   0.337  0.495
4     0 MitoT6153C  6153 C     0.00873 0.0117  T     0.516   0.472  0.746
5     0 MitoC7275T  7275 T     0       0.00111 C     0.894   0.344  0    
6     0 MitoT9699C  9699 C     0.0908  0.0669  T     5.67    0.0172 1.39 
ans.plink$study$plink.out
$status
[1] 0

$output
 [1] ""                                                                          
 [2] "@----------------------------------------------------------@"              
 [3] "|        PLINK!       |     v1.07      |   10/Aug/2009     |"              
 [4] "|----------------------------------------------------------|"              
 [5] "|  (C) 2009 Shaun Purcell, GNU General Public License, v2  |"              
 [6] "|----------------------------------------------------------|"              
 [7] "|  For documentation, citation & bug-report instructions:  |"              
 [8] "|        http://pngu.mgh.harvard.edu/purcell/plink/        |"              
 [9] "@----------------------------------------------------------@"              
[10] ""                                                                          
[11] "Skipping web check... [ --noweb ] "                                        
[12] "Writing this text to log file [ /tmp/ssh-1148/out.log ]"                   
[13] "Analysis started: Sat Jan 23 19:51:01 2021"                                
[14] ""                                                                          
[15] "Options in effect:"                                                        
[16] "\t--bfile brge"                                                             
[17] "\t--assoc"                                                                  
[18] "\t--pheno brge.phe"                                                         
[19] "\t--pheno-name obese"                                                       
[20] "\t--noweb"                                                                  
[21] "\t--out /tmp/ssh-1148/out"                                                  
[22] ""                                                                          
[23] "Reading map (extended format) from [ brge.bim ] "                          
[24] "100000 markers to be included from [ brge.bim ]"                           
[25] "Reading pedigree information from [ brge.fam ] "                           
[26] "2312 individuals read from [ brge.fam ] "                                  
[27] "2312 individuals with nonmissing phenotypes"                               
[28] "Assuming a disease phenotype (1=unaff, 2=aff, 0=miss)"                     
[29] "Missing phenotype value is also -9"                                        
[30] "725 cases, 1587 controls and 0 missing"                                    
[31] "1097 males, 1215 females, and 0 of unspecified sex"                        
[32] "Reading genotype bitfile from [ brge.bed ] "                               
[33] "Detected that binary PED file is v1.00 SNP-major mode"                     
[34] "Reading alternate phenotype from [ brge.phe ] "                            
[35] "2204 individuals with non-missing alternate phenotype"                     
[36] "Assuming a disease phenotype (1=unaff, 2=aff, 0=miss)"                     
[37] "Missing phenotype value is also -9"                                        
[38] "402 cases, 1802 controls and 108 missing"                                  
[39] "Before frequency and genotyping pruning, there are 100000 SNPs"            
[40] "2312 founders and 0 non-founders found"                                    
[41] "6009 heterozygous haploid genotypes; set to missing"                       
[42] "Writing list of heterozygous haploid genotypes to [ /tmp/ssh-1148/out.hh ]"
[43] "7 SNPs with no founder genotypes observed"                                 
[44] "Warning, MAF set to 0 for these SNPs (see --nonfounders)"                  
[45] "Writing list of these SNPs to [ /tmp/ssh-1148/out.nof ]"                   
[46] "Total genotyping rate in remaining individuals is 0.994408"                
[47] "0 SNPs failed missingness test ( GENO > 1 )"                               
[48] "0 SNPs failed frequency test ( MAF < 0 )"                                  
[49] "After frequency and genotyping pruning, there are 100000 SNPs"             
[50] "After filtering, 402 cases, 1802 controls and 108 missing"                 
[51] "After filtering, 1097 males, 1215 females, and 0 of unspecified sex"       
[52] "Writing main association results to [ /tmp/ssh-1148/out.assoc ] "          
[53] ""                                                                          
[54] "Analysis finished: Sat Jan 23 19:51:07 2021"                               
[55] ""                                                                          

$error
character(0)

$command
[1] "cd /home/master/brge && plink1 --bfile brge --assoc --pheno brge.phe --pheno-name obese --noweb --out /tmp/ssh-1148/out"

attr(,"class")
[1] "resource.exec"

Let us finish our tutorial by removing the created resources

opal.resource_delete(opal=o, project='RSRC', resource='pheno_TCGA')
opal.resource_delete(opal=o, project='RSRC', resource='asthma')
opal.resource_delete(opal=o, project='RSRC', resource='genexpr')

6 Exercise 2

Physiological ecologists often analyze the responses of physiological or biochemical traits to environmental factors such as temperature, irradiance, water potential, or the concentrations of CO2, O2, and inorganic nutrients. Some researchers were interested in knowing the relationship between CO2 uptake (variable uptake) for six Echinochloa crus-galli plants from Quebec and six plants from Mississippi as a function of ambient CO2 concentration (variable conc).

The CO2 uptake of six plants from Quebec and six plants from Mississippi was measured at several levels of ambient CO2 concentration. Half the plants of each type were chilled overnight before the experiment was conducted.

Data cannot be shared between plants given confidentiality restrictions and researchers are interested in pooling both datasets via a virtually pooled-analysis. The scheme of the project can be seen in the next figure

Carbon Dioxide Uptake in Quebec and Mississippi Grass Plants. Scheme of the proposed data analysis using DataSHIELD. Data of two locations are stored in different repositories and are available as .tsv files.

Figure 6: Carbon Dioxide Uptake in Quebec and Mississippi Grass Plants
Scheme of the proposed data analysis using DataSHIELD. Data of two locations are stored in different repositories and are available as .tsv files.

Both datasets are .tsv files that are available in two different locations:

Do the following steps to perform the required analysis:

  1. Upload the two resources to the project workshop available at the demo Opal site (https://opal-demo.obiba.org/, u:administrator, p:password). IMPORTANT NOTE: In order to avoid problems with duplicated names (all you are using the same server) give a name to the two resources like boston_XXXXXX and mississippi_XXXXXX and replace XXXXXX by random lower and upper letters (for example: boston_QrTyUZ and mississippi_AgbnYJ).
  2. Login the connection to the resource to DataSHIELD using these credentials:
    • user: dsuser
    • password: password
  3. Assign the resources to objects having the same name as the resource name (in our case boston_QrTyUZ and mississippi_AgbnYJ).
  4. Using dsBaseClient functions (for those new in DataSHIELD see ls("package:dsBaseClient")):
  • Check the names of the variables of each data frame.
  • Check that the number of observations at each study (variable Type) is 42.
  • Create a “combined” scatterplot of the CO2 uptake (Y-axis) vs CO2 concentration (variable conc) (X-axix).
  • Run a linear model to investigate whether CO2 concentration can predicts CO2 uptake adjusting for chilled overnight factor (Treatment).

7 Session Info

sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=Spanish_Spain.1252  LC_CTYPE=Spanish_Spain.1252    LC_MONETARY=Spanish_Spain.1252
[4] LC_NUMERIC=C                   LC_TIME=Spanish_Spain.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ssh_0.7.0           resourcer_1.1.0     dsOmicsClient_1.1.0 dsBaseClient_6.1.0 
 [5] DSOpal_1.1.0        DSI_1.1.0           R6_2.4.1            progress_1.2.2     
 [9] opalr_1.5.1         httr_1.4.2          knitr_1.29          BiocStyle_2.16.0   

loaded via a namespace (and not attached):
 [1] Biobase_2.48.0      jsonlite_1.7.0      splines_4.0.2       tmvnsim_1.0-2      
 [5] assertthat_0.2.1    sn_1.6-2            Rdpack_1.0.0        askpass_1.1        
 [9] BiocManager_1.30.10 highr_0.8           stats4_4.0.2        TFisher_0.2.0      
[13] yaml_2.2.1          ggrepel_0.8.2       numDeriv_2016.8-1.1 pillar_1.4.6       
[17] lattice_0.20-41     glue_1.4.2          digest_0.6.25       RColorBrewer_1.1-2 
[21] colorspace_1.4-1    sandwich_2.5-1      htmltools_0.5.0     Matrix_1.2-18      
[25] pkgconfig_2.0.3     bibtex_0.4.2.2      bookdown_0.20       purrr_0.3.4        
[29] mvtnorm_1.1-1       scales_1.1.1        openssl_1.4.2       tibble_3.0.3       
[33] generics_0.0.2      ggplot2_3.3.2       ellipsis_0.3.1      TH.data_1.0-10     
[37] credentials_1.3.0   BiocGenerics_0.34.0 cli_2.0.2           mnormt_2.0.2       
[41] survival_3.2-3      magrittr_1.5        crayon_1.3.4        mime_0.9           
[45] evaluate_0.14       fansi_0.4.1         MASS_7.3-52         tools_4.0.2        
[49] prettyunits_1.1.1   hms_0.5.3           gbRd_0.4-11         lifecycle_0.2.0    
[53] multcomp_1.4-13     mutoss_0.1-12       stringr_1.4.0       munsell_0.5.0      
[57] plotrix_3.7-8       compiler_4.0.2      rlang_0.4.7         grid_4.0.2         
[61] rmarkdown_2.3       gtable_0.3.0        codetools_0.2-16    multtest_2.44.0    
[65] curl_4.3            zoo_1.8-8           dplyr_1.0.2         utf8_1.1.4         
[69] mathjaxr_1.0-1      readr_1.3.1         metap_1.4           stringi_1.4.6      
[73] parallel_4.0.2      Rcpp_1.0.5          vctrs_0.3.3         tidyselect_1.1.0   
[77] xfun_0.16