enrichR is a simple R package designed to interface with the Local Contexts Hub API.

It allows you to fetch metadata from the TK Hub and append it to DSI (in the form of a FASTA file)

enrichR will allow you to:

Getting started

Install and load devtools and enrichR

# install.packages("devtools")
library(devtools)
#devtools::install_github("jacobgolan/enrichR")
library(enrichR)
library(dplyr)

enrichR is a simple and easy to use package. Here is an index of its key functions:

?index.all.projects() # create a table of all projects in the LC Hub

?find.project.id() #search for a project's unique ID

?find.projects() #find and retrieve projects' metadata, notices and TK/BC labels

?readFASTA() #import a FASTA file into R

?outputFASTA() #export a FASTA file from R

?testFASTA() #create a dummy FASTA for learning

Step 1: Importing Sequences

The function testFASTA() allows us to create a dummy FASTA for testing. You must specify the following arguments:

testFASTA(
  seqlength = c(75,100), # minimum and maximum sequence length range
  DNA_prob = rep(0.25,4), # relative freq of A, T, G, C
  no_seqs = 10, # how many sequences to simulate
  out.fasta = TRUE, #TRUE if you want to save a .fasta file to your computer
  file_name = "dsi", #only specify if out.fasta=TRUE
  loc = "LocusX" # locus name
)

When working with real data you will likely need to import a file from your computer into R. This can be done easily using readFASTA()

For simplicity, we will just read in the dummy FASTA we just created with testFASTA(). But in practice you can read in an FASTA file downloaded from NCBI, custom created, etc.

dsi<-readFASTA("dsi.fasta")

Running either testFASTA() or readFASTA() results in a dataframe with two columns. The first column is the sequence name (header) and the second column is the sequence itself.

For example:
name sequence
Sample_1_LocusX GCCGCCAAGTCGCCAATGTGGCCCTCAGGTAACCATGGGGTCTGTTGCCGCATCCGAGTTGGCGGCATCGCCATGCGCTTCACACT
Sample_2_LocusX CAACCAGAGGGAGTTATCGATCTGATCATACGGTATTATGGCCATTACAGCCAGTAGCTCGGGACAAGAGCCGGGAAGCCTTTAGA
Sample_3_LocusX GCTGAGGACGTCGCTGTAGGCATGGGCACTCTATGACCATACTAGGTTGGCCTAATATCCCCGGATTAGCGGAACCCCGGAGAG
Sample_4_LocusX CTAGGAGACCGCCAGCTGGTCTGACTGCCTGTTACCATGGTGCCCGTTAAGTAACACAATAATGGACATCATATAACGTGGACGAGTGGGC
Sample_5_LocusX TTCTGTATGCGCACCGTCCACTGTTGGCCCTATCATACATGATCGGGGTTAGAGTTGGCTAAGCGATGCCTACAAACCCCGGGTCCAATGCTG
Sample_6_LocusX CTAAACGGACGTTTGATTAACTTTAATTAACGAGCACGTGACGAACAGTCGCGAGCGTTGATAGAGCGAAAGTGTGGGAGGCAAGCTCATGATTGA
Sample_7_LocusX GTTATCTTACATCGCACGGTCGGTGAAAGCATCAGCAGTTATTTTTCGGTACCGATGTGTTCAAAAAAAAATCCATCTGGCATCCATTT
Sample_8_LocusX TGGGTTTCGTAGCCAGCGGTAGAAACGCCCATCTAAGGTTGCCTGGCCGCACGTTCCCAAGGCAGCGTGTTCAGACGGCGGTG
Sample_9_LocusX TCTCTGAGGCCTTGTTCTATATTCACCCAATTAAGATTGTATCCATCGTCATGCTGTGCGGGGCGAGGACCGGAACGAAACACAGACGGAAGATGAT
Sample_10_LocusX AATAGCAGTTGATTGGACGACATAGGGGCATCGATATTGAAAAATTATACTCCCGAGCTGCCGATTGTGGAGGACAAACAGCCCTACT

Step 2: Find Project Data

enrichR allows you to index all of the projects in the LC Hub. You can also search by unique ID, or if you are unsure of the uniqe ID, you can find it by searching for matching titles, etc.

To start, let’s say we already know the unique ID of the project we are interested in:

sample.proj<-find.projects("259854f7-b261-4c8c-8556-4b153deebc18")
lapply(sample.proj, names)[[1]]
#> [1] "unique_id"                "providers_id"            
#> [3] "title"                    "project_privacy"         
#> [5] "date_added"               "date_modified"           
#> [7] "bc_labels"                "tk_labels"               
#> [9] "project_boundary_geojson"

By searching by a project’s unique ID we get to all the project metadata. Above displays all of the metadata fields.

sample.proj[[1]]$unique_id #displays the unique ID we searned for
#> [1] "259854f7-b261-4c8c-8556-4b153deebc18"
sample.proj[[1]]$providers_id
#> NULL
sample.proj[[1]]$title # project title
#> [1] "Sample Project"
sample.proj[[1]]$project_privacy # project privacy
#> [1] "Public"
sample.proj[[1]]$date_added
#> [1] "2021-10-22T18:15:41.507481Z"
sample.proj[[1]]$date_modified
#> [1] "2021-10-22T18:15:41.507513Z"

We can also see that there are fields in sample.proj for BC and TK labels

sample.proj[[1]]$bc_labels %>% names()
#>  [1] "name"         "label_type"   "language_tag" "language"     "default_text"
#>  [6] "img_url"      "svg_url"      "community"    "translations" "created"     
#> [11] "updated"
sample.proj[[1]]$tk_labels %>% names()
#>  [1] "name"         "label_type"   "language_tag" "language"     "default_text"
#>  [6] "img_url"      "svg_url"      "community"    "translations" "created"     
#> [11] "updated"

There are many ways to parse the image files associated with TK and BC labels. Below is just one example of how to do so. You can see a BC Provenance label is displayed. All we had to do was specify the uniquqe ID and the API will take care of the rest!

imager::load.image(sample.proj[[1]]$bc_labels$img_url) %>% plot(axes=FALSE)

But what happens if you don’t know the unique ID for a project? We can still find it!

tmp<-index.all.projects() # Import a table of all Hub projects
id.search.out<-find.project.id(tmp, title="Two") #Search for projects with the word 'Two' in their title

Looks like there is one project that matches our criterion.

UQID<-as.character(id.search.out$unique_id)

Gets us the unique ID of the project we were looking for!

Step 3: Appending Unique IDs to our Sequences

Now that we know the relevant unique ID, we can append it to our FASTA headers. We will use the FASTA we read in before in Step 1.

outputFASTA(
  seqs = dsi$seqstr,
  seqid = dsi$seqid,
  uqID = UQID, # the uniqe ID for the project we found above in Step 3
  filename = "dsi.with.hub.metadata" # .fasta is automatically added
)