Disclaimer: Most of this material is shamelessly copied or adapted from the Bioconductor How-to-Guide and Hadley Wickham’s book on R packages. Other sources are mentioned in the text.

What, why, who, when and how

What is a Bioconductor package 📦?

A Bioconductor package is an R package 📦 that provides tools 🔨 for the analysis and comprehension of high-throughput genomic data and is available on the Bioconductor repository. Like any R package 📦 , Bioconductor package bundle together code (in functions), data, documentation and tests in order to share these with others. Broadly speaking in Bioconductor, there are packages 📦 are of three main types:

  • Annotation: Data-base like packages 📦 that provide information linking identifiers to other information
  • Experiment data: Provide data sets that are used to illustrate particular analyses.
  • Software: Provide implementation of algorithms, access to resources or visualizations.
  • Workflow: Long-form vignettes that illustrate how to analyse a particular type of data, such as RNA-seq.

For the purposes of this workshop, we will only consider software packages.

Why make a Bioconductor package 📦?

Bioconductor packages 📦 provide a simple way to distribute R code and documentation related to analysis and comprehension of high throughput genomic data. Packages on Bioconductor 📦 are basically guaranteed to be installable, as they are regularly built, installed, and tested on multiple systems. They are also required to be high-quality, well maintained and thoroughly documented. By creating such a package and making it available via Bioconductor, you are contributing to open science. Open science is a movement that tries to ensure that all aspects of the scientific process, which includes software, are accessible. This ensures reproducible research and increases efficiency by reducing replication of work.

Besides these lofty reasons for making a Bioconductor package 📦, being the creator and maintainer of a Bioconductor package 📦 is good for your career. It increases the reach and significance of your work, as it allows other scientists 👩‍🔬 to make direct use of your research.

Who should make Bioconductor packages 📦?

Absolutely anyone with some R programming experience can make a Bioconductor package 📦.

When should I make a Bioconductor packages 📦?

You are probably ready to make a Bioconductor package 📦 when you have a set of cohesive functions that address one or multiple problems in the analyses or comprehension of high-throughput genomic data. It is important that your package 📦 does not merely present an alternative to existing solutions, but constitutes an advance. However, do not be discouraged if your idea is already implemented. In such cases consider approaching the author of the package that has implemented your work and offer to collaborate 👨‍💻 and help maintain their package.

How do I make a Bioconductor package 📦?

Well it all starts with an R package 📦. This will be the main focus of the rest of the workshop.

Design principles for Bioconductor packages 📦

Reuse

There are 1,823 software packages on Bioconductor packages 📦 currently available. Many of these packages 📦 have implemented thoughtful data structures and built infrastructure around these. In particular, the Bioconductor Core Team have spent considerable resources designing and developing well-tested packages 📦 that are central to the Bioconductor project. It is vital that your package makes use of these data structures and infrastructures whenever possible. For example, high-throughput genomic data is commonly stored in the SummarizedExperiment object class. If your package makes use of such data, you should consider interoperating with the SummarizedExperiment package. Here is a list of core packages that you should try to incorporate if appropriate:

Modularity

Packages 📦 in Bioconductor are meant to be modular. That means that you should try to break down your functions into smaller parts. This has multiple advantages:

  • Shorter functions are easier to comprehend
  • Functions can be used across multiple problems
  • Users can check intermediate results and adjust analysis

In particular avoid copy pasting code. Instead just write a function and apply this function.

Note that the concepts of modularity and reuse are sometimes referred to as interoperability.

Reproducible research

Documentation of your package 📦, in particular example use cases provided through vignettes, are the corner stone of Bioconductor. This ensures that users know how to correctly apply your package 📦. This is part of enabling reproducible research and use.

Making a Bioconductor package 📦

We will now get to the hands-on part of the workshop. For this you require RStudio, as it is a great place to get started because RStudio has already added tools 🔨 to make package 📦 creation and dissemination easier for end users.

For this workshop the following packages are required:

library(usethis)
library(roxygen2)
library(devtools)
library(goodpractice)
library(BiocCheck)

Always start with version control 🐱

Version control is particularly important for software development. This is because you will want to keep track of every change, so in case you accidentally break something you can go back in time and fix your errors. Essentially just think of version control as the “Track Changes” feature in Microsoft Word on steroids.

How to version control 🐱?

There are multiple ways to handle version control, we will default to Git. Git is a software that facilitates version control. It was designed particularly for coordinating work among software developers. Git-based projects are hosted on cloud-based services, such as GitHub, Bitbucket and GitLab. You can think of these as Google Drive, but much more organized. These allow you to store your projects, share your work with other people and even allow others to make changes.

Version control 🐱 with GitHub and RStudio is easy

Here we will work with Git Hub for the sake of specificity. In order for you to be able to work with Git Hub you will need to get the following:

  • Get a GitHub account
  • Install git on your local machine
  • Connect your git to R Studio

Lucky for this tutorial you will only need to get a GitHub account and connect it to RStudio, because we will be working on AWS where git is already installed. If you were wondering how to install git, just follow this link.

To connect RStudio to your git, we will be using the usethis package:

use_git_config(user.name = "Jane Doe", user.email = "jane@example.org")

Simply replace “Jane Doe” with your GitHub username and enter your email instead of “jane@example.org”. Make sure to use the one that was used when you signed up to GitHub.

If you are feeling a bit overwhelmed with the whole version control 🐱 concept, don’t worry. There is Jenny Bryan and Jim Hester’s excellent book “Happy Git with R” which is available for free online.

Let’s get started

Now we are almost ready to start. Note that in this workshop we will each write a little package 📦 that gives praise to the user. This does obviously not constitute a Bioconductor tool, however we want you to focus on the package developing part instead of thinking of high-throughput genomic applications.

Set some layout parameters

However before we initialize our package 📦 we want to make some layout configurations to our RStudio session, so your code will be formatted the way that Bioconductor prefers it. Just think of this step as setting the layout parameters on a Word document.

  • Set up the tab to be 4 spaces. You can find this in the ‘Tools’ menu if you select the ‘Global Options…’ and then look at the ‘Code’ panel under the ‘Editing’ tab.