Contents

1 Introduction

The DataBiosphere project includes a vision

schema

of which AnVIL/Terra forms a part.

The terra-notebook-utils python modules is described as a “Python API and CLI providing utilities for working with DRS objects, VCF files, and the Terra notebook environment.”

This R package aims to provide a regulated interface between R and terra-notebook-utils for use in AnVIL.

By “regulated” we mean that the entire python ecosystem used to work with terra-notebook-utils is defined in a virtual environment. We make some exceptions for the sake of demonstration, but, for example, the drs_access command uses a very particular interface between R and python, using the Bioconductor basilisk package.

2 Basic concepts

2.1 Installing in an AnVIL workspace, Oct 2022

As of 10/2022, BiocTNU exists in a github repository. To install and use properly with R in AnVIL

  • ensure there is a .Renviron file in the home folder with content PIP_USER=false
  • use BiocManager::install("vjcitn/BiocTNU")
  • verify that library(BiocTNU); example(drs_access) produces a signed URL

2.2 Probing available features

Once installation has succeeded, we use basilisk-mediated commands defined in the BiocTNU package to probe or use terra-notebook-utils. We can get the names of all modules available after importing terra-notebook-utils.

library(BiocTNU)
## Loading required package: tools
## Loading required package: reticulate
## Loading required package: basilisk
tnu_top()
## + '/home/rstudio/.cache/R/basilisk/1.8.1/0/bin/conda' 'create' '--yes' '--prefix' '/home/rstudio/.cache/R/basilisk/1.8.1/BiocTNU/0.0.7/bsklenv' 'python=3.7.7' '--quiet' '-c' 'conda-forge'
## + '/home/rstudio/.cache/R/basilisk/1.8.1/0/bin/conda' 'install' '--yes' '--prefix' '/home/rstudio/.cache/R/basilisk/1.8.1/BiocTNU/0.0.7/bsklenv' 'python=3.7.7'
## + '/home/rstudio/.cache/R/basilisk/1.8.1/0/bin/conda' 'install' '--yes' '--prefix' '/home/rstudio/.cache/R/basilisk/1.8.1/BiocTNU/0.0.7/bsklenv' '-c' 'conda-forge' 'python=3.7.7' 'pandas=1.3.5'
## [1] "IO_CONCURRENCY"           "MARTHA_URL"              
## [3] "MARTHA_URL_VERSION"       "os"                      
## [5] "TERRA_DEPLOYMENT_ENV"     "WORKSPACE_BUCKET"        
## [7] "WORKSPACE_GOOGLE_PROJECT" "WORKSPACE_NAME"          
## [9] "WORKSPACE_NAMESPACE"

We can also retrieve the help content for the python modules subordinate to terra-notebook-utils.

cat(tnu_help())
## Help on package terra_notebook_utils:
## 
## NAME
##     terra_notebook_utils
## 
## PACKAGE CONTENTS
##     blobstore (package)
##     cli (package)
##     costs
##     drs
##     gs
##     http
##     logger
##     profile
##     table
##     tar_gz
##     utils
##     vcf
##     version
##     workflows
##     workspace
##     xprofile
## 
## DATA
##     IO_CONCURRENCY = 3
##     MARTHA_URL = 'https://us-central1-broad-dsde-prod.cloudfunctions.net/m...
##     MARTHA_URL_VERSION = 'martha_v3'
##     TERRA_DEPLOYMENT_ENV = 'prod'
##     WORKSPACE_BUCKET = 'fc-48f42333-d659-4762-845e-5dbe7e00ef1b'
##     WORKSPACE_GOOGLE_PROJECT = 'terra-91e8a8e4'
##     WORKSPACE_NAME = 'Bioconductor-Package-BiocTNU'
##     WORKSPACE_NAMESPACE = 'landmarkanvil2'
## 
## FILE
##     /home/rstudio/.cache/R/basilisk/1.8.1/BiocTNU/0.0.7/bsklenv/lib/python3.7/site-packages/terra_notebook_utils/__init__.py

2.3 Generating signed URLS

The default argument to drs_access is the google storage location of a CRAI file.

substr(drs_access(), 1, 80)
## [1] "https://nih-nhlbi-biodata-catalyst-1000-genomes.storage.googleapis.com/CCDG_1360"

More features related to DRS become available when basilisk interfaces are added to this package.

cat(tnu_drs_help())
## Help on module terra_notebook_utils.drs in terra_notebook_utils:
## 
## NAME
##     terra_notebook_utils.drs - Utilities for working with DRS objects.
## 
## CLASSES
##     builtins.Exception(builtins.BaseException)
##         DRSResolutionError
##     builtins.tuple(builtins.object)
##         DRSInfo
##     terra_notebook_utils.blobstore.copy_client.CopyClient(builtins.object)
##         DRSCopyClient
##     
##     class DRSCopyClient(terra_notebook_utils.blobstore.copy_client.CopyClient)
##      |  DRSCopyClient(concurrency: int = 4, raise_on_error: bool = False, indicator_type: Union[terra_notebook_utils.blobstore.progress.Indicator, NoneType] = None)
##      |  
##      |  Method resolution order:
##      |      DRSCopyClient
##      |      terra_notebook_utils.blobstore.copy_client.CopyClient
##      |      builtins.object
##      |  
##      |  Methods defined here:
##      |  
##      |  copy(self, drs_uri: str, dst: str)
##      |  
##      |  ----------------------------------------------------------------------
##      |  Data and other attributes defined here:
##      |  
##      |  __annotations__ = {'workspace': typing.Union[str, NoneType], 'workspac...
##      |  
##      |  workspace = None
##      |  
##      |  workspace_namespace = None
##      |  
##      |  ----------------------------------------------------------------------
##      |  Methods inherited from terra_notebook_utils.blobstore.copy_client.CopyClient:
##      |  
##      |  __enter__(self)
##      |  
##      |  __exit__(self, *args, **kwargs)
##      |  
##      |  __init__(self, concurrency: int = 4, raise_on_error: bool = False, indicator_type: Union[terra_notebook_utils.blobstore.progress.Indicator, NoneType] = None)
##      |      If 'raise_on_error' is False, all copy operations will be attempted even if one or more operations error. If
##      |      'raise_on_error' is True, the first error encountered will be raise and all scheduled operations will be
##      |      canceled.
##      |  
##      |  ----------------------------------------------------------------------
##      |  Data descriptors inherited from terra_notebook_utils.blobstore.copy_client.CopyClient:
##      |  
##      |  __dict__
##      |      dictionary for instance variables (if defined)
##      |  
##      |  __weakref__
##      |      list of weak references to the object (if defined)
##      |  
##      |  ----------------------------------------------------------------------
##      |  Data and other attributes inherited from terra_notebook_utils.blobstore.copy_client.CopyClient:
##      |  
##      |  multipart_threshold = 134217728
##     
##     class DRSInfo(builtins.tuple)
##      |  DRSInfo(credentials, access_url, bucket_name, key, name, size, updated, checksums)
##      |  
##      |  DRSInfo(credentials, access_url, bucket_name, key, name, size, updated, checksums)
##      |  
##      |  Method resolution order:
##      |      DRSInfo
##      |      builtins.tuple
##      |      builtins.object
##      |  
##      |  Methods defined here:
##      |  
##      |  __getnewargs__(self)
##      |      Return self as a plain tuple.  Used by copy and pickle.
##      |  
##      |  __repr__(self)
##      |      Return a nicely formatted representation string
##      |  
##      |  _asdict(self)
##      |      Return a new OrderedDict which maps field names to their values.
##      |  
##      |  _replace(_self, **kwds)
##      |      Return a new DRSInfo object replacing specified fields with new values
##      |  
##      |  ----------------------------------------------------------------------
##      |  Class methods defined here:
##      |  
##      |  _make(iterable) from builtins.type
##      |      Make a new DRSInfo object from a sequence or iterable
##      |  
##      |  ----------------------------------------------------------------------
##      |  Static methods defined here:
##      |  
##      |  __new__(_cls, credentials, access_url, bucket_name, key, name, size, updated, checksums)
##      |      Create new instance of DRSInfo(credentials, access_url, bucket_name, key, name, size, updated, checksums)
##      |  
##      |  ----------------------------------------------------------------------
##      |  Data descriptors defined here:
##      |  
##      |  credentials
##      |      Alias for field number 0
##      |  
##      |  access_url
##      |      Alias for field number 1
##      |  
##      |  bucket_name
##      |      Alias for field number 2
##      |  
##      |  key
##      |      Alias for field number 3
##      |  
##      |  name
##      |      Alias for field number 4
##      |  
##      |  size
##      |      Alias for field number 5
##      |  
##      |  updated
##      |      Alias for field number 6
##      |  
##      |  checksums
##      |      Alias for field number 7
##      |  
##      |  ----------------------------------------------------------------------
##      |  Data and other attributes defined here:
##      |  
##      |  _field_defaults = {}
##      |  
##      |  _fields = ('credentials', 'access_url', 'bucket_name', 'key', 'name', ...
##      |  
##      |  _fields_defaults = {}
##      |  
##      |  ----------------------------------------------------------------------
##      |  Methods inherited from builtins.tuple:
##      |  
##      |  __add__(self, value, /)
##      |      Return self+value.
##      |  
##      |  __contains__(self, key, /)
##      |      Return key in self.
##      |  
##      |  __eq__(self, value, /)
##      |      Return self==value.
##      |  
##      |  __ge__(self, value, /)
##      |      Return self>=value.
##      |  
##      |  __getattribute__(self, name, /)
##      |      Return getattr(self, name).
##      |  
##      |  __getitem__(self, key, /)
##      |      Return self[key].
##      |  
##      |  __gt__(self, value, /)
##      |      Return self>value.
##      |  
##      |  __hash__(self, /)
##      |      Return hash(self).
##      |  
##      |  __iter__(self, /)
##      |      Implement iter(self).
##      |  
##      |  __le__(self, value, /)
##      |      Return self<=value.
##      |  
##      |  __len__(self, /)
##      |      Return len(self).
##      |  
##      |  __lt__(self, value, /)
##      |      Return self<value.
##      |  
##      |  __mul__(self, value, /)
##      |      Return self*value.
##      |  
##      |  __ne__(self, value, /)
##      |      Return self!=value.
##      |  
##      |  __rmul__(self, value, /)
##      |      Return value*self.
##      |  
##      |  count(self, value, /)
##      |      Return number of occurrences of value.
##      |  
##      |  index(self, value, start=0, stop=9223372036854775807, /)
##      |      Return first index of value.
##      |      
##      |      Raises ValueError if the value is not present.
##     
##     class DRSResolutionError(builtins.Exception)
##      |  Common base class for all non-exit exceptions.
##      |  
##      |  Method resolution order:
##      |      DRSResolutionError
##      |      builtins.Exception
##      |      builtins.BaseException
##      |      builtins.object
##      |  
##      |  Data descriptors defined here:
##      |  
##      |  __weakref__
##      |      list of weak references to the object (if defined)
##      |  
##      |  ----------------------------------------------------------------------
##      |  Methods inherited from builtins.Exception:
##      |  
##      |  __init__(self, /, *args, **kwargs)
##      |      Initialize self.  See help(type(self)) for accurate signature.
##      |  
##      |  ----------------------------------------------------------------------
##      |  Static methods inherited from builtins.Exception:
##      |  
##      |  __new__(*args, **kwargs) from builtins.type
##      |      Create and return a new object.  See help(type) for accurate signature.
##      |  
##      |  ----------------------------------------------------------------------
##      |  Methods inherited from builtins.BaseException:
##      |  
##      |  __delattr__(self, name, /)
##      |      Implement delattr(self, name).
##      |  
##      |  __getattribute__(self, name, /)
##      |      Return getattr(self, name).
##      |  
##      |  __reduce__(...)
##      |      Helper for pickle.
##      |  
##      |  __repr__(self, /)
##      |      Return repr(self).
##      |  
##      |  __setattr__(self, name, value, /)
##      |      Implement setattr(self, name, value).
##      |  
##      |  __setstate__(...)
##      |  
##      |  __str__(self, /)
##      |      Return str(self).
##      |  
##      |  with_traceback(...)
##      |      Exception.with_traceback(tb) --
##      |      set self.__traceback__ to tb and return self.
##      |  
##      |  ----------------------------------------------------------------------
##      |  Data descriptors inherited from builtins.BaseException:
##      |  
##      |  __cause__
##      |      exception cause
##      |  
##      |  __context__
##      |      exception context
##      |  
##      |  __dict__
##      |  
##      |  __suppress_context__
##      |  
##      |  __traceback__
##      |  
##      |  args
## 
## FUNCTIONS
##     access(drs_url: str, workspace_name: Union[str, NoneType] = 'Bioconductor-Package-BiocTNU', workspace_namespace: Union[str, NoneType] = 'landmarkanvil2', billing_project: Union[str, NoneType] = 'terra-91e8a8e4') -> str
##         Return a signed url for a drs:// URI, if available.
##     
##     blob_for_url(url: str, billing_project: Union[str, NoneType] = 'terra-91e8a8e4') -> terra_notebook_utils.blobstore.Blob
##     
##     copy(drs_uri: str, dst: str, indicator_type: terra_notebook_utils.blobstore.progress.Indicator = <Indicator.bar: <class 'getm.progress.ProgressBar'>>, workspace_name: Union[str, NoneType] = 'Bioconductor-Package-BiocTNU', workspace_namespace: Union[str, NoneType] = 'landmarkanvil2')
##         Copy a DRS object to either the local filesystem, or to a Google Storage location if `dst` starts with
##         "gs://".
##     
##     copy_batch(drs_urls: Union[Iterable[str], NoneType] = None, dst_pfx: Union[str, NoneType] = None, workspace_name: Union[str, NoneType] = 'Bioconductor-Package-BiocTNU', workspace_namespace: Union[str, NoneType] = 'landmarkanvil2', indicator_type: terra_notebook_utils.blobstore.progress.Indicator = <Indicator.log: <class 'getm.progress.ProgressLogger'>>, manifest: Union[List[Dict[str, str]], NoneType] = None)
##     
##     copy_batch_manifest(manifest: List[Dict[str, str]], indicator_type: terra_notebook_utils.blobstore.progress.Indicator = <Indicator.log: <class 'getm.progress.ProgressLogger'>>, workspace_name: Union[str, NoneType] = 'Bioconductor-Package-BiocTNU', workspace_namespace: Union[str, NoneType] = 'landmarkanvil2')
##     
##     copy_batch_urls(drs_urls: Iterable[str], dst_pfx: str, indicator_type: terra_notebook_utils.blobstore.progress.Indicator = <Indicator.log: <class 'getm.progress.ProgressLogger'>>, workspace_name: Union[str, NoneType] = 'Bioconductor-Package-BiocTNU', workspace_namespace: Union[str, NoneType] = 'landmarkanvil2')
##     
##     copy_to_bucket(drs_uri: str, dst_key: str = '', dst_bucket_name: Union[str, NoneType] = None, indicator_type: terra_notebook_utils.blobstore.progress.Indicator = <Indicator.bar: <class 'getm.progress.ProgressBar'>>, workspace_name: Union[str, NoneType] = 'Bioconductor-Package-BiocTNU', workspace_namespace: Union[str, NoneType] = 'landmarkanvil2')
##         Resolve `drs_url` and copy into user-specified bucket `dst_bucket`.  If `dst_bucket` is None, copy into
##         workspace bucket.
##     
##     enable_requester_pays(workspace_name: Union[str, NoneType] = 'Bioconductor-Package-BiocTNU', workspace_namespace: Union[str, NoneType] = 'landmarkanvil2')
##     
##     extract_tar_gz(drs_url: str, dst: Union[str, NoneType] = None, workspace_name: Union[str, NoneType] = 'Bioconductor-Package-BiocTNU', workspace_namespace: Union[str, NoneType] = 'landmarkanvil2', billing_project: Union[str, NoneType] = 'terra-91e8a8e4')
##         Extract a `.tar.gz` archive resolved by a DRS url. 'dst' may be either a local filepath or a 'gs://' url.
##         Default extraction is to the bucket for 'workspace'.
##     
##     get_drs(drs_url: str, fields: List[str]) -> requests.models.Response
##         Request DRS information from martha.
##     
##     get_drs_blob(drs_url_or_info: Union[str, terra_notebook_utils.drs.DRSInfo], billing_project: Union[str, NoneType] = 'terra-91e8a8e4') -> Union[terra_notebook_utils.blobstore.gs.GSBlob, terra_notebook_utils.blobstore.url.URLBlob]
##     
##     get_drs_info(drs_url: str, access_url: bool = False) -> terra_notebook_utils.drs.DRSInfo
##         Attempt to resolve gs:// url and credentials for a DRS object.
##     
##     head(drs_url: str, num_bytes: int = 1, workspace_name: Union[str, NoneType] = 'Bioconductor-Package-BiocTNU', workspace_namespace: Union[str, NoneType] = 'landmarkanvil2', billing_project: Union[str, NoneType] = 'terra-91e8a8e4')
##         Head a DRS object by byte.
##     
##     info(drs_url: str) -> dict
##         Return a curated subset of data from `get_drs`.
## 
## DATA
##     Dict = typing.Dict
##     Iterable = typing.Iterable
##     List = typing.List
##     MARTHA_URL = 'https://us-central1-broad-dsde-prod.cloudfunctions.net/m...
##     Optional = typing.Optional
##     TERRA_DEPLOYMENT_ENV = 'prod'
##     Tuple = typing.Tuple
##     Union = typing.Union
##     WORKSPACE_BUCKET = 'fc-48f42333-d659-4762-845e-5dbe7e00ef1b'
##     WORKSPACE_GOOGLE_PROJECT = 'terra-91e8a8e4'
##     WORKSPACE_NAME = 'Bioconductor-Package-BiocTNU'
##     WORKSPACE_NAMESPACE = 'landmarkanvil2'
##     http = <requests.sessions.Session object>
##     logger = <Logger terra_notebook_utils.logger (INFO)>
##     manifest_schema = {'items': {'properties': {'drs_uri': {'type': 'strin...
## 
## FILE
##     /home/rstudio/.cache/R/basilisk/1.8.1/BiocTNU/0.0.7/bsklenv/lib/python3.7/site-packages/terra_notebook_utils/drs.py