This tutorial aims to provide examples about how to deal with resources in Opal and DataSHIELD. More detailed information can be found in our bookdown Orchestrating privacy-protected non-disclosive big data analyses of data from different resources with R and DataSHIELD.
A quick demo of the resources is also available in Tutorial: Resources in R.
This document can be reproduced by installing the following packages
install.packages("DSOpal")
devtools::install_github("datashield/dsBaseClient")
devtools::install_github("isglobal-brge/dsOmicsClient")
Then the packages are loaded as usual
library(DSOpal)
library(dsBaseClient)
library(dsOmicsClient)
We have set up an Opal demo site to illustrate how to perform some basic analyses using DataSHIELD as well as how to deal with different resources for ’omic data. The Opal server can be accessed with the credentials:
administratorpasswordIn this figure we can see all the projects available.
Figure 1: Opal demo site available projects
This tutorial will mainly make use of the resources available at RSRC project
Figure 2: Resources available at Opal demo site of RSRC project
In order to make the reader familiar with Opal we recommend visiting the Opal online documentation.
The resources can be uploaded in the Opal server manually as described here. However, it can also be done using R code as it is described in the following subsections.
Let us imagine that we have a tsv file that is stored in our hospital, server, cloud, GitHub repository or any other site. This file is containing information on several variables we want to analyze using DataSHIELD. Let us also imagine that this data is available at this URL: http://duffel.rail.bio/recount/TCGA/TCGA.tsv. This file encodes the phenotypes corresponding to 11,287 samples from TCGA that are available in the Recount project.
Thanks to the resources this dataset is not necessary to be uploaded into the Opal server as a table anymore. We can analyze this data with DataSHIELD packages by creating a new resource as following.
Let us start by login the Opal server. NOTE that this requires full permissions and, hence, we access with administrator rights
o <- opal.login(username = 'administrator',
password = 'password',
url = 'https://opal-demo.obiba.org')
TCGA dataset can be added as a new resource simply by typing:
opal.resource_create(opal = o,
project = 'RSRC',
name = 'pheno_TCGA',
url = 'http://duffel.rail.bio/recount/TCGA/TCGA.tsv',
format = 'tsv')
We can see that this resource have been added to our project by
opal.resources(o, project='RSRC')
name project
1 1000G_covars RSRC
2 1000G_vcf RSRC
3 CNSIM1 RSRC
4 CNSIM2 RSRC
5 CNSIM3 RSRC
6 EGA RSRC
7 GSE66351_1 RSRC
8 GSE66351_2 RSRC
9 GSE80970 RSRC
10 brge RSRC
11 brge_plink RSRC
12 brge_vcf RSRC
13 example.vcf RSRC
14 ga4gh_1000g RSRC
15 gps_data RSRC
16 gps_participant RSRC
17 pheno_TCGA RSRC
18 takeaway_gps RSRC
19 tcga_liver RSRC
url
1 https://raw.githubusercontent.com/isglobal-brge/brgeUtils/master/data/covars_1000G.txt
2 https://raw.githubusercontent.com/isglobal-brge/brgeUtils/master/data/1KG_phase3_subset_chr22.vcf.gz?method=biallelic.only&snpfirstdim=TRUE
3 mysql://mysqldata:3306/opal/CNSIM1
4 file:///srv/data/CNSIM2.zsav
5 opal+https://opal-demo.obiba.org/ws/files/projects/RSRC/CNSIM3.zip
6 https://ega.ebi.ac.uk:8052/elixir/tickets/tickets/EGAF00001753756?&referenceName=chr21&start=1&end=100
7 https://github.com/isglobal-brge/brgedata/raw/master/data/gse66351_1.rda
8 https://github.com/isglobal-brge/brgedata/raw/master/data/gse66351_2.rda
9 file:///srv/data/GSE80970.Rdata
10 https://raw.githubusercontent.com/isglobal-brge/brgedata/master/inst/extdata/brge.txt
11 ssh://plink-demo.obiba.org:2222/home/master/brge?exec=ls,plink1,plink
12 https://raw.githubusercontent.com/isglobal-brge/brgedata/master/inst/extdata/brge.vcf.gz?method=biallelic.only&snpfirstdim=TRUE
13 https://raw.githubusercontent.com/isglobal-brge/scoreInvHap/master/inst/extdata/example.vcf?method=biallelic.only&snpfirstdim=TRUE
14 https://htsget.ga4gh.org/variants/1000genomes.phase1.chr1?format=VCF&referenceName=1&start=1&end=100000
15 opal+https://opal-demo.obiba.org/ws/files/projects/RSRC/gps_data_final.RData
16 opal+https://opal-demo.obiba.org/ws/files/projects/RSRC/gps_participant.RData
17 http://duffel.rail.bio/recount/TCGA/TCGA.tsv
18 opal+https://opal-demo.obiba.org/ws/files/projects/RSRC/takeaway_data.RData
19 https://github.com/isglobal-brge/brgedata/raw/master/data/rse_gene_liver.Rdata
format created updated
1 csv 2021-01-23T05:47:34Z 2021-01-23T05:47:34Z
2 VCF2GDS 2021-01-23T05:47:34Z 2021-01-23T05:47:34Z
3 <NA> 2021-01-23T05:47:33Z 2021-01-23T05:47:33Z
4 sav 2021-01-23T05:47:33Z 2021-01-23T05:47:33Z
5 csv 2021-01-23T05:47:34Z 2021-01-23T05:47:34Z
6 EGAhtsgetBAM 2021-01-23T05:47:37Z 2021-01-23T05:47:37Z
7 ExpressionSet 2021-01-23T05:47:32Z 2021-01-23T05:47:32Z
8 ExpressionSet 2021-01-23T05:47:33Z 2021-01-23T05:47:33Z
9 ExpressionSet 2021-01-23T05:47:33Z 2021-01-23T05:47:33Z
10 tsv 2021-01-23T05:47:35Z 2021-01-23T05:47:35Z
11 <NA> 2021-01-23T05:47:36Z 2021-01-23T05:47:36Z
12 VCF2GDS 2021-01-23T05:47:35Z 2021-01-23T05:47:35Z
13 VCF2GDS 2021-01-23T05:47:34Z 2021-01-23T05:47:34Z
14 GA4GHVCF 2021-01-23T05:47:37Z 2021-01-23T05:47:37Z
15 SpatialLinesDataFrame 2021-01-23T05:47:36Z 2021-01-23T05:47:36Z
16 data.frame 2021-01-23T05:47:36Z 2021-01-23T05:47:36Z
17 tsv 2021-01-23T19:50:04Z 2021-01-23T19:50:04Z
18 SpatialPointsDataFrame 2021-01-23T05:47:36Z 2021-01-23T05:47:36Z
19 RangedSummarizedExperiment 2021-01-23T05:47:35Z 2021-01-23T05:47:35Z
We can test the resource assignment. First we assign the resource to an object called client
opal.assign.resource(o, 'client', 'RSRC.pheno_TCGA')
opal.execute(o, 'class(client)')
[1] "TidyFileResourceClient" "FileResourceClient" "ResourceClient"
[4] "R6"
We see that this object is of class TidyFileResourceClient. The resourcer package will be use then to “resolve” this resource and to load it into the R server as we will see later.
We logout the connection
opal.logout(o)
Then, we can analyze the data using DataSHIELD by making use of the created resource. We start by login the resource using an user who have DataSHIELD permissions to our Opal server (dsuser).
builder <- newDSLoginBuilder()
builder$append(server = 'study1', url = 'https://opal-demo.obiba.org',
user = 'dsuser', password = 'password',
resource = 'RSRC.pheno_TCGA')
logindata <- builder$build()
Then, we login the resource to asssign (R object called res)
conns <- datashield.login(logins = logindata,
assign = TRUE,
symbol = 'res')
The resourcer package which is installed in the Opal server contains functions that facilitates the access of this data in the R server. In particular, we will have access to the resource as a data.frame called pheno. To this end, as.resource.data.frame () function is used to coerce the resource (e.g. ResourceClient object) to a data frame.
datashield.assign.expr(conns, symbol = 'pheno',
expr = quote(as.resource.data.frame(res)))
ds.class('pheno')
$study1
[1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
ds.dim('pheno')
$`dimensions of pheno in study1`
[1] 11287 859
$`dimensions of pheno in combined studies`
[1] 11287 859
datashield.logout(conns)
Now, let us assume that our resource is a data frame saved in a .Rdata file available at: https://github.com/isglobal-brge/brgedata/raw/master/data/asthma.rda. We can do similar steps to load the resource into the Opal server:
o <- opal.login('administrator','password',
url='https://opal-demo.obiba.org')
opal.resource_create(opal = o,
project = 'RSRC',
name = 'asthma',
url = 'https://github.com/isglobal-brge/brgedata/raw/master/data/asthma.rda',
format = 'data.frame')
opal.assign.resource(o, 'client', 'RSRC.asthma')
opal.execute(o, 'class(client)')
[1] "RDataFileResourceClient" "FileResourceClient" "ResourceClient"
[4] "R6"
opal.logout(o)
In that case the class of the client object is RdataFileResourceClient whose resolver is also implemented in the resourcer package.
Then, we can analyze this data using DataSHIELD functions as usual. In that case, instead of using as.resource.data.frame () (it is also possible to use it) we make use of the function as.resource.object () which coerce the resource to an internal data object that depends on the implementation of this object.
builder <- newDSLoginBuilder()
builder$append(server = 'study1', url = 'https://opal-demo.obiba.org',
user = 'dsuser', password = 'password',
resource = 'RSRC.asthma')
logindata <- builder$build()
conns <- datashield.login(logins = logindata,
assign = TRUE,
symbol = 'res')
datashield.assign.expr(conns, symbol = 'asthma',
expr = quote(as.resource.object(res)))
Now, we can perform some standard statistical analysis in DataSHIELD as usual:
ds.class('asthma')
$study1
[1] "data.frame"
ds.colnames('asthma')
$study1
[1] "country" "gender" "age" "bmi" "smoke" "casecontrol"
[7] "rs4490198" "rs4849332" "rs1367179" "rs11123242" "rs13014858" "rs1430094"
[13] "rs1430093" "rs746710" "rs1430090" "rs6737251" "rs11685217" "rs1430097"
[19] "rs10496465" "rs3756688" "rs2303063" "rs1422993" "rs2400478" "rs714588"
[25] "rs1023555" "rs898070" "rs963218" "rs1419835" "rs765023" "rs1345267"
[31] "rs324381" "hopo546333" "rs184448" "rs324396" "rs324957" "rs324960"
[37] "rs10486657" "rs324981" "rs1419780" "rs325462" "rs727162" "rs10250709"
[43] "rs6958905" "rs10238983" "rs4941643" "rs3794381" "rs2031532" "rs2247119"
[49] "rs8000149" "rs2274276" "rs7332573" "rs3829366" "rs6084432" "rs512625"
[55] "rs3918395" "rs2787095" "rs2853215"
ds.glm(casecontrol ~ rs1422993 + smoke + bmi, data='asthma', family='binomial')
$Nvalid
[1] 1559
$Nmissing
[1] 19
$Ntotal
[1] 1578
$disclosure.risk
RISK OF DISCLOSURE
study1 0
$errorMessage
ERROR MESSAGES
study1 "No errors"
$nsubs
[1] 1559
$iter
[1] 5
$family
Family: binomial
Link function: logit
$formula
[1] "casecontrol ~ rs1422993 + smoke + bmi"
$coefficients
Estimate Std. Error z-value p-value low0.95CI.LP high0.95CI.LP P_OR
(Intercept) -2.45698463 0.36442291 -6.7421245 1.560873e-11 -3.17124042 -1.74272885 0.07892927
rs1422993GT 0.35228783 0.13087079 2.6918752 7.105152e-03 0.09578580 0.60878987 1.42231786
rs1422993TT 0.14131809 0.25709215 0.5496787 5.825397e-01 -0.36257326 0.64520944 1.15179096
smoke -0.36267146 0.14370585 -2.5237069 1.161247e-02 -0.64432976 -0.08101316 0.69581500
bmi 0.04234646 0.01349121 3.1388188 1.696303e-03 0.01590418 0.06878875 1.04325586
low0.95CI.P_OR high0.95CI.P_OR
(Intercept) 0.04026246 0.1489667
rs1422993GT 1.10052331 1.8382056
rs1422993TT 0.69588333 1.9063863
smoke 0.52501431 0.9221816
bmi 1.01603132 1.0712099
$dev
[1] 1579.802
$df
[1] 1554
$output.information
[1] "SEE TOP OF OUTPUT FOR INFORMATION ON MISSING DATA AND ERROR MESSAGES"
Do not forget to close the connection after finishing the analyses
datashield.logout(conns)
One of the main advantages of using the resources as an R file is that it may contains any type of R object. Let us illustrate this by having an R file with an ExpressionSet object which is a [Bioconductor] class to encapsulate omic data along with phenotypic information and annotation. See here for a description of this type of objects or to this video.
This figure provides a visual idea about how these objects are organized (for epidemiologists)
Figure 3: Data frame vs ExpressionSet organization
This is how an ExpressionSet can be summarized
Figure 4: ExpressionSet infrastructure
He have collected information on gene expression of the 200 more variable genes measured in 194 samples corresponding to the Asthma inner city study available in GEO with the accesion numberGSE40732. This is an ExpressionSet that is available at:
Let us illustrate how to create the resources in the Opal server having access to these two datasets
o <- opal.login('administrator','password',
url='https://opal-demo.obiba.org')
opal.resource_create(opal = o,
project = 'RSRC',
name = 'genexpr',
url = 'https://github.com/isglobal-brge/brgedata/raw/master/data/GSE40732.Rdata',
format = 'ExpressionSet')
opal.logout(o)
Now, we are ready to perform any data analysis using DataSHIELD as following. NOTE that in this case the R server must have installed those specific packages to deal with our specific class of objects. In that case, ExpressionSets are managed with Biobase Bioconductor package. If so, the function as.resource.object () which coerce the resource to an ExpressionSet that, in this case, will be available in the R server as an object called eSet.
builder <- newDSLoginBuilder()
builder$append(server = 'study1', url = 'https://opal-demo.obiba.org',
user = 'dsuser', password = 'password',
resource = 'RSRC.genexpr')
logindata <- builder$build()
conns <- datashield.login(logins = logindata,
assign = TRUE,
symbol = 'res')
datashield.assign.expr(conns, symbol = 'eSet',
expr = quote(as.resource.object(res)))
Then, we can use DataSHIELD fuctions as usual
ds.class('eSet')
$study1
[1] "ExpressionSet"
attr(,"package")
[1] "Biobase"
In that case, if we execute
ds.dim('eSet')
we get an error, but … we have developed a pair of DataSHIELD packages, called dsOmics and dsOmicsClient that allow us to deal with this type of objects in DataSHIELD and that are able to get the desired information (i.e. the dimension of the ExpressionSet).
dsOmicsClient::ds.nFeatures('eSet')
$study1
Features
200
attr(,"class")
[1] "dsnFeatures" "list"
dsOmicsClient::ds.nSamples('eSet')
$study1
Samples
194
attr(,"class")
[1] "dsnSamples" "list"
We finish the example by login out
datashield.logout(conns)
He have collected information on gene expression of the 1000 more variable genes measured in skin biopsies from 10 patients with active atopic eczema and 10 healthy controls (GEO accesion number GSE6012. The
ExpressionSetis available at:
- Create a resource into the project
workshopavailable at the demo Opal site (https://opal-demo.obiba.org/)
- user: administrator
- password: password
IMPORTANT NOTE: In order to avoid problems with duplicated names (all of you are using the same server) give a name to the resources like
eSet_XXXXXXand replaceXXXXXXby random lower and upper letters (for example: eSet_DrMMuZ).
Login the connection to the resource to DataSHIELD using these credentials:
- user: dsuser
- password: password
Assign the resources to an R object having the same name as the resource name (in my case eSet_DrMMuZ).
Load
dsOmicsClientpackage and useds.nFeatures ()andds.nSamples ()functions to verify that the loaded object is having the proper dimensions.Investigate the
dsOmicsClientpackage (typels("package:dsOmicsClient")in the R console) to find a function for knowing the names of the covariates in the metadata on samples
Here, we will use data from three studies that are available in our Opal demo repository. The three databases are called CNSIM1, CNSIM2, CNSIM3 and are available as three different resources: mySQL database, SPSS file and CSV file (see Figure 2). This example mimics real situations where different hospitals or research centers manage their own databases containing harmonized data. Data correspond to three simulated datasets with different numbers of observations of 11 harmonized variables. They contain synthetic data based on a model derived from the participants of the 1958 Birth Cohort, as part of an obesity methodological development project. The available variables are:
| Variable | Description | Type | Note |
|---|---|---|---|
| LAB_TSC | Total Serum Cholesterol | numeric | mmol/L |
| LAB_TRIG | Triglycerides | numeric | mmol/L |
| LAB_HDL | HDL Cholesterol | numeric | mmol/L |
| LAB_GLUC_ADJUSTED | Non-Fasting Glucose | numeric | mmol/L |
| PM_BMI_CONTINUOUS | Body Mass Index (continuous) | numeric | kg/m2 |
| DIS_CVA | History of Stroke | factor | 0 = Never had stroke; 1 = Has had stroke |
| MEDI_LPD | Current Use of Lipid Lowering Medication (from categorical assessment item) | factor | 0 = Not currently using lipid lowering medication; 1 = Currently using lipid lowering medication |
| DIS_DIAB | History of Diabetes | factor | 0 = Never had diabetes; 1 = Has had diabetes |
| DIS_AMI | History of Myocardial Infarction | factor | 0 = Never had myocardial infarction; 1 = Has had myocardial infarction |
| GENDER | Gender | factor | 0 = Female; 1 = Male |
| PM_BMI_CATEGORICAL | Body Mass Index (categorical) | factor | 1 = Less than 25 kg/m2; 2 = 25 to 30 kg/m2; 3 = Over 30 kg/m2 |
The aggregated analysis can be performed as follows. We first start by preparing the login data and the resources to assign
builder <- DSI::newDSLoginBuilder()
builder$append(server = "study1", url = "https://opal-demo.obiba.org",
user = "dsuser", password = "password",
resource = "RSRC.CNSIM1")
builder$append(server = "study2", url = "https://opal-demo.obiba.org",
user = "dsuser", password = "password",
resource = "RSRC.CNSIM2")
builder$append(server = "study3", url = "https://opal-demo.obiba.org",
user = "dsuser", password = "password",
resource = "RSRC.CNSIM3")
logindata <- builder$build()
Then, we login and assign resources
conns <- datashield.login(logins = logindata, assign = TRUE, symbol = "res")
The assigned objects are of class ResourceClient (and others)
ds.class("res")
$study1
[1] "SQLResourceClient" "ResourceClient" "R6"
$study2
[1] "TidyFileResourceClient" "FileResourceClient" "ResourceClient"
[4] "R6"
$study3
[1] "TidyFileResourceClient" "FileResourceClient" "ResourceClient"
[4] "R6"
We then coerce the ResourceClient objects to data frames
datashield.assign.expr(conns, symbol = "D",
expr = quote(as.resource.data.frame(res, strict = TRUE)))
ds.class("D")
$study1
[1] "data.frame"
$study2
[1] "data.frame"
$study3
[1] "data.frame"
Now, we are ready to do usual DataSHIELD analyses
ds.summary('D$LAB_HDL')
$study1
$study1$class
[1] "numeric"
$study1$length
[1] 2163
$study1$`quantiles & mean`
5% 10% 25% 50% 75% 90% 95% Mean
0.875240 1.047400 1.300000 1.581000 1.844500 2.090000 2.210900 1.569416
$study2
$study2$class
[1] "numeric"
$study2$length
[1] 3088
$study2$`quantiles & mean`
5% 10% 25% 50% 75% 90% 95% Mean
0.850280 1.032200 1.294000 1.563000 1.840000 2.077000 2.225000 1.556648
$study3
$study3$class
[1] "numeric"
$study3$length
[1] 4128
$study3$`quantiles & mean`
5% 10% 25% 50% 75% 90% 95% Mean
0.876760 1.039200 1.304000 1.589000 1.856000 2.098800 2.244200 1.574687
NOTE: vector types are not necessarily the same depending on the data reader that was used
ds.class('D$GENDER')
$study1
[1] "integer"
$study2
[1] "haven_labelled" "vctrs_vctr" "double"
$study3
[1] "numeric"
ds.asFactor('D$GENDER', 'GENDER')
$all.unique.levels
[1] "0" "1"
$return.message
[1] "Data object <GENDER> correctly created in all specified data sources"
ds.summary('GENDER')
$study1
$study1$class
[1] "factor"
$study1$length
[1] 2163
$study1$categories
[1] "0" "1"
$study1$`count of '0'`
[1] 1092
$study1$`count of '1'`
[1] 1071
$study2
$study2$class
[1] "factor"
$study2$length
[1] 3088
$study2$categories
[1] "0" "1"
$study2$`count of '0'`
[1] 1585
$study2$`count of '1'`
[1] 1503
$study3
$study3$class
[1] "factor"
$study3$length
[1] 4128
$study3$categories
[1] "0" "1"
$study3$`count of '0'`
[1] 2091
$study3$`count of '1'`
[1] 2037
A logistic regression model can be fitted using privacy-protected analyses as if we would have the three datasets located in a single computer
mod <- ds.glm("DIS_DIAB ~ LAB_TRIG + GENDER", data = "D" , family="binomial")
mod$coeff
Estimate Std. Error z-value p-value low0.95CI.LP high0.95CI.LP P_OR
(Intercept) -4.7792110 0.21081170 -22.670521 8.755236e-114 -5.1923944 -4.36602770 0.00833261
LAB_TRIG 0.3035931 0.05487436 5.532514 3.156737e-08 0.1960414 0.41114488 1.35471774
GENDER -0.4455989 0.20797931 -2.142516 3.215202e-02 -0.8532309 -0.03796695 0.64044060
low0.95CI.P_OR high0.95CI.P_OR
(Intercept) 0.005527953 0.01254229
LAB_TRIG 1.216577226 1.50854390
GENDER 0.426036242 0.96274475
Logout the connection
datashield.logout(conns)
Computation resources are resources on which tasks/commands can be triggered and from which resulting data can be retrieved. We can define a SSH Resource which is accessible through a secure shell. The path part of the URL is the remote working directory. The available commands are defined by the exec query parameter as described in the resources demo tutorial.
PLINK is a program to perform whole genome association analysis (GWAS) in a computationally efficient manner. So, it may have sense to use this program to run the analyses instead of using R.
PLINK data analysis requires to have genotype data in three files (.bed, .bim, .fam) and phenotype data in a different file. Therefore, the computation resource must contain the data and the commands that are allowed to be run in the server. This can be set up manually in the Opal server as indicated in the next figure
Figure 5: Computation resource corresponding to PLINK example
Notice that this resource are hosted at plink-demo.obiba.org in the folder /home/master/brge. This folder contains brge.bed, brge.bim, brge.fam, which are the three PLINK files and brge.phe file that encodes the phenotypic variables. The server also have installed PLINK.
This computation resource is already available at our demo Opal server (brge_plink) as you can see in figure 2
The computation resource is a server side feature. So, we cannot see its functionallity using DataSHIELD. However, in order to easily illustrate how it works, let us create the resource in plain R
library(resourcer)
ssh.res <- newResource(
url = "ssh://plink-demo.obiba.org:2222/home/master/brge?exec=ls,plink1,plink",
identity = "master",
secret = "master"
)
The resource connection client is resolved as follow:
ssh.client <- newResourceClient(ssh.res)
class(ssh.client)
[1] "SshResourceClient" "CommandResourceClient" "ResourceClient" "R6"
This type of client allows to issue shell commands through a SSH connection. In our case we can use:
ssh.client$getAllowedCommands()
[1] "ls" "plink1" "plink"
We can also see that the computation resource not only have PLINK program installed, but also our genomic and phenotypic data
ssh.client$exec("ls")
$status
[1] 0
$output
[1] "brge.bed" "brge.bim" "brge.fam" "brge.gds" "brge.phe" "brge.txt"
$error
character(0)
$command
[1] "cd /home/master/brge && ls"
attr(,"class")
[1] "resource.exec"
Now, we are ready to run any PLINK command from R. Notice that in this case we want to assess association between the genotype data in bed format and use as phenotype the variable ‘obese’ that is in the file ‘obesity.phe’. The sentence in a PLINK command would be (NOTE: we avoid –out to indicate the output file since the file will be available in R as a tibble).
plink1 --bfile brge --assoc --pheno brge.phe --pheno-name obese --noweb
This could be done simply by executing
ans <- ssh.client$exec('plink1', c('--bfile', 'brge', '--assoc', '--pheno', 'brge.phe', '--pheno-name', 'obese', '--noweb'))
ans
$status
[1] 0
$output
[1] ""
[2] "@----------------------------------------------------------@"
[3] "| PLINK! | v1.07 | 10/Aug/2009 |"
[4] "|----------------------------------------------------------|"
[5] "| (C) 2009 Shaun Purcell, GNU General Public License, v2 |"
[6] "|----------------------------------------------------------|"
[7] "| For documentation, citation & bug-report instructions: |"
[8] "| http://pngu.mgh.harvard.edu/purcell/plink/ |"
[9] "@----------------------------------------------------------@"
[10] ""
[11] "Skipping web check... [ --noweb ] "
[12] "Writing this text to log file [ plink.log ]"
[13] "Analysis started: Sat Jan 23 19:50:52 2021"
[14] ""
[15] "Options in effect:"
[16] "\t--bfile brge"
[17] "\t--assoc"
[18] "\t--pheno brge.phe"
[19] "\t--pheno-name obese"
[20] "\t--noweb"
[21] ""
[22] "Reading map (extended format) from [ brge.bim ] "
[23] "100000 markers to be included from [ brge.bim ]"
[24] "Reading pedigree information from [ brge.fam ] "
[25] "2312 individuals read from [ brge.fam ] "
[26] "2312 individuals with nonmissing phenotypes"
[27] "Assuming a disease phenotype (1=unaff, 2=aff, 0=miss)"
[28] "Missing phenotype value is also -9"
[29] "725 cases, 1587 controls and 0 missing"
[30] "1097 males, 1215 females, and 0 of unspecified sex"
[31] "Reading genotype bitfile from [ brge.bed ] "
[32] "Detected that binary PED file is v1.00 SNP-major mode"
[33] "Reading alternate phenotype from [ brge.phe ] "
[34] "2204 individuals with non-missing alternate phenotype"
[35] "Assuming a disease phenotype (1=unaff, 2=aff, 0=miss)"
[36] "Missing phenotype value is also -9"
[37] "402 cases, 1802 controls and 108 missing"
[38] "Before frequency and genotyping pruning, there are 100000 SNPs"
[39] "2312 founders and 0 non-founders found"
[40] "6009 heterozygous haploid genotypes; set to missing"
[41] "Writing list of heterozygous haploid genotypes to [ plink.hh ]"
[42] "7 SNPs with no founder genotypes observed"
[43] "Warning, MAF set to 0 for these SNPs (see --nonfounders)"
[44] "Writing list of these SNPs to [ plink.nof ]"
[45] "Total genotyping rate in remaining individuals is 0.994408"
[46] "0 SNPs failed missingness test ( GENO > 1 )"
[47] "0 SNPs failed frequency test ( MAF < 0 )"
[48] "After frequency and genotyping pruning, there are 100000 SNPs"
[49] "After filtering, 402 cases, 1802 controls and 108 missing"
[50] "After filtering, 1097 males, 1215 females, and 0 of unspecified sex"
[51] "Writing main association results to [ plink.assoc ] "
[52] ""
[53] "Analysis finished: Sat Jan 23 19:50:58 2021"
[54] ""
$error
character(0)
$command
[1] "cd /home/master/brge && plink1 --bfile brge --assoc --pheno brge.phe --pheno-name obese --noweb"
attr(,"class")
[1] "resource.exec"
This is the output that is provided by PLINK and new output files have been created in the computation resource. These files can be downloaded in the R server by using the function
ssh.client$downloadFile()
We have implemented all these procedures in dsOmics package whose DataSHIELD client package is dsOmicsClient. Therefore, once the compuation resource is created in the Opal server we can perform such analyses using DataSHIELD by simply
builder <- newDSLoginBuilder()
builder$append(server = 'study1', url = 'https://opal-demo.obiba.org',
user = 'dsuser', password = 'password',
resource = 'RSRC.brge_plink')
logindata <- builder$build()
Then we assign the resource to a symbol (i.e. R object) called client which is a ssh resource
conns <- datashield.login(logins = logindata, assign = TRUE,
symbol = "client")
ds.class("client")
$study1
[1] "SshResourceClient" "CommandResourceClient" "ResourceClient" "R6"
and then call to ds.PLINK () function. The arguments must be encapsulated in a single character without the command ‘plink1’. The ‘–noweb’ option is not necessary either.
plink.arguments <- "--bfile brge --assoc --pheno brge.phe --pheno-name obese"
and then, the analyses are performed by
library(dsOmicsClient)
ans.plink <- ds.PLINK("client", plink.arguments)
The object ans.plink contains the PLINK results at each server as well as the outuput provided by PLINK
lapply(ans.plink, names)
$study1
[1] "results" "plink.out"
head(ans.plink$study1$results)
# A tibble: 6 x 10
CHR SNP BP A1 F_A F_U A2 CHISQ P OR
<dbl> <chr> <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 0 MitoC3993T 3993 T 0.0150 0.0146 C 0.00836 0.927 1.03
2 0 MitoG4821A 4821 A 0 0.00169 G 1.34 0.248 0
3 0 MitoG6027A 6027 A 0.00249 0.00501 G 0.922 0.337 0.495
4 0 MitoT6153C 6153 C 0.00873 0.0117 T 0.516 0.472 0.746
5 0 MitoC7275T 7275 T 0 0.00111 C 0.894 0.344 0
6 0 MitoT9699C 9699 C 0.0908 0.0669 T 5.67 0.0172 1.39
ans.plink$study$plink.out
$status
[1] 0
$output
[1] ""
[2] "@----------------------------------------------------------@"
[3] "| PLINK! | v1.07 | 10/Aug/2009 |"
[4] "|----------------------------------------------------------|"
[5] "| (C) 2009 Shaun Purcell, GNU General Public License, v2 |"
[6] "|----------------------------------------------------------|"
[7] "| For documentation, citation & bug-report instructions: |"
[8] "| http://pngu.mgh.harvard.edu/purcell/plink/ |"
[9] "@----------------------------------------------------------@"
[10] ""
[11] "Skipping web check... [ --noweb ] "
[12] "Writing this text to log file [ /tmp/ssh-1148/out.log ]"
[13] "Analysis started: Sat Jan 23 19:51:01 2021"
[14] ""
[15] "Options in effect:"
[16] "\t--bfile brge"
[17] "\t--assoc"
[18] "\t--pheno brge.phe"
[19] "\t--pheno-name obese"
[20] "\t--noweb"
[21] "\t--out /tmp/ssh-1148/out"
[22] ""
[23] "Reading map (extended format) from [ brge.bim ] "
[24] "100000 markers to be included from [ brge.bim ]"
[25] "Reading pedigree information from [ brge.fam ] "
[26] "2312 individuals read from [ brge.fam ] "
[27] "2312 individuals with nonmissing phenotypes"
[28] "Assuming a disease phenotype (1=unaff, 2=aff, 0=miss)"
[29] "Missing phenotype value is also -9"
[30] "725 cases, 1587 controls and 0 missing"
[31] "1097 males, 1215 females, and 0 of unspecified sex"
[32] "Reading genotype bitfile from [ brge.bed ] "
[33] "Detected that binary PED file is v1.00 SNP-major mode"
[34] "Reading alternate phenotype from [ brge.phe ] "
[35] "2204 individuals with non-missing alternate phenotype"
[36] "Assuming a disease phenotype (1=unaff, 2=aff, 0=miss)"
[37] "Missing phenotype value is also -9"
[38] "402 cases, 1802 controls and 108 missing"
[39] "Before frequency and genotyping pruning, there are 100000 SNPs"
[40] "2312 founders and 0 non-founders found"
[41] "6009 heterozygous haploid genotypes; set to missing"
[42] "Writing list of heterozygous haploid genotypes to [ /tmp/ssh-1148/out.hh ]"
[43] "7 SNPs with no founder genotypes observed"
[44] "Warning, MAF set to 0 for these SNPs (see --nonfounders)"
[45] "Writing list of these SNPs to [ /tmp/ssh-1148/out.nof ]"
[46] "Total genotyping rate in remaining individuals is 0.994408"
[47] "0 SNPs failed missingness test ( GENO > 1 )"
[48] "0 SNPs failed frequency test ( MAF < 0 )"
[49] "After frequency and genotyping pruning, there are 100000 SNPs"
[50] "After filtering, 402 cases, 1802 controls and 108 missing"
[51] "After filtering, 1097 males, 1215 females, and 0 of unspecified sex"
[52] "Writing main association results to [ /tmp/ssh-1148/out.assoc ] "
[53] ""
[54] "Analysis finished: Sat Jan 23 19:51:07 2021"
[55] ""
$error
character(0)
$command
[1] "cd /home/master/brge && plink1 --bfile brge --assoc --pheno brge.phe --pheno-name obese --noweb --out /tmp/ssh-1148/out"
attr(,"class")
[1] "resource.exec"
Let us finish our tutorial by removing the created resources
opal.resource_delete(opal=o, project='RSRC', resource='pheno_TCGA')
opal.resource_delete(opal=o, project='RSRC', resource='asthma')
opal.resource_delete(opal=o, project='RSRC', resource='genexpr')
Physiological ecologists often analyze the responses of physiological or biochemical traits to environmental factors such as temperature, irradiance, water potential, or the concentrations of CO2, O2, and inorganic nutrients. Some researchers were interested in knowing the relationship between CO2 uptake (variable
uptake) for six Echinochloa crus-galli plants from Quebec and six plants from Mississippi as a function of ambient CO2 concentration (variableconc).The CO2 uptake of six plants from Quebec and six plants from Mississippi was measured at several levels of ambient CO2 concentration. Half the plants of each type were chilled overnight before the experiment was conducted.
Data cannot be shared between plants given confidentiality restrictions and researchers are interested in pooling both datasets via a virtually pooled-analysis. The scheme of the project can be seen in the next figure
Figure 6: Carbon Dioxide Uptake in Quebec and Mississippi Grass Plants
Scheme of the proposed data analysis using DataSHIELD. Data of two locations are stored in different repositories and are available as .tsv files.
Both datasets are .tsv files that are available in two different locations:
Quebec (IGSlobal repository): https://raw.githubusercontent.com/isglobal-brge/brgedata/master/inst/extdata/co2_quebec.tsv
Mississippi (OBiBa repository): https://raw.githubusercontent.com/obiba/resources-workshop/main/data/co2_mississippi.tsv
Do the following steps to perform the required analysis:
- Upload the two resources to the project
workshopavailable at the demo Opal site (https://opal-demo.obiba.org/, u:administrator, p:password). IMPORTANT NOTE: In order to avoid problems with duplicated names (all you are using the same server) give a name to the two resources likeboston_XXXXXXandmississippi_XXXXXXand replaceXXXXXXby random lower and upper letters (for example: boston_QrTyUZ and mississippi_AgbnYJ).- Login the connection to the resource to DataSHIELD using these credentials:
- user: dsuser
- password: password
- Assign the resources to objects having the same name as the resource name (in our case boston_QrTyUZ and mississippi_AgbnYJ).
- Using
dsBaseClientfunctions (for those new in DataSHIELD seels("package:dsBaseClient")):
- Check the names of the variables of each data frame.
- Check that the number of observations at each study (variable
Type) is 42.- Create a “combined” scatterplot of the CO2 uptake (Y-axis) vs CO2 concentration (variable
conc) (X-axix).- Run a linear model to investigate whether CO2 concentration can predicts CO2 uptake adjusting for chilled overnight factor (
Treatment).
sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)
Matrix products: default
locale:
[1] LC_COLLATE=Spanish_Spain.1252 LC_CTYPE=Spanish_Spain.1252 LC_MONETARY=Spanish_Spain.1252
[4] LC_NUMERIC=C LC_TIME=Spanish_Spain.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ssh_0.7.0 resourcer_1.1.0 dsOmicsClient_1.1.0 dsBaseClient_6.1.0
[5] DSOpal_1.1.0 DSI_1.1.0 R6_2.4.1 progress_1.2.2
[9] opalr_1.5.1 httr_1.4.2 knitr_1.29 BiocStyle_2.16.0
loaded via a namespace (and not attached):
[1] Biobase_2.48.0 jsonlite_1.7.0 splines_4.0.2 tmvnsim_1.0-2
[5] assertthat_0.2.1 sn_1.6-2 Rdpack_1.0.0 askpass_1.1
[9] BiocManager_1.30.10 highr_0.8 stats4_4.0.2 TFisher_0.2.0
[13] yaml_2.2.1 ggrepel_0.8.2 numDeriv_2016.8-1.1 pillar_1.4.6
[17] lattice_0.20-41 glue_1.4.2 digest_0.6.25 RColorBrewer_1.1-2
[21] colorspace_1.4-1 sandwich_2.5-1 htmltools_0.5.0 Matrix_1.2-18
[25] pkgconfig_2.0.3 bibtex_0.4.2.2 bookdown_0.20 purrr_0.3.4
[29] mvtnorm_1.1-1 scales_1.1.1 openssl_1.4.2 tibble_3.0.3
[33] generics_0.0.2 ggplot2_3.3.2 ellipsis_0.3.1 TH.data_1.0-10
[37] credentials_1.3.0 BiocGenerics_0.34.0 cli_2.0.2 mnormt_2.0.2
[41] survival_3.2-3 magrittr_1.5 crayon_1.3.4 mime_0.9
[45] evaluate_0.14 fansi_0.4.1 MASS_7.3-52 tools_4.0.2
[49] prettyunits_1.1.1 hms_0.5.3 gbRd_0.4-11 lifecycle_0.2.0
[53] multcomp_1.4-13 mutoss_0.1-12 stringr_1.4.0 munsell_0.5.0
[57] plotrix_3.7-8 compiler_4.0.2 rlang_0.4.7 grid_4.0.2
[61] rmarkdown_2.3 gtable_0.3.0 codetools_0.2-16 multtest_2.44.0
[65] curl_4.3 zoo_1.8-8 dplyr_1.0.2 utf8_1.1.4
[69] mathjaxr_1.0-1 readr_1.3.1 metap_1.4 stringi_1.4.6
[73] parallel_4.0.2 Rcpp_1.0.5 vctrs_0.3.3 tidyselect_1.1.0
[77] xfun_0.16