Functions and data structures in R…

…and how to use them!

Author

Dr Hannah Jackson

Learning Objectives

  • To understand the use of functions in R
  • To describe the 4 main data structures in R
  • To understand how functions can be used with various data structures in bioinformatics

R in bioinformatics

R is an incredibly useful tool in bioinformatics:

  • A wide range of specialised packages (so you don’t have to reinvent the wheel)
  • Powerful datavis packages
  • Integration with other tools (i.e. Python, packages like Rsamtools)
  • Reproducibility
  • Community support

Functions 🔧

  • Functions are reusable blocks of code which simplify repetitive tasks

  • Allow automation of complex workflows

  • Either pre-defined in packages, or user defined

  • From very simple ➡ complex multi-step tools

The anatomy of a function 🧠

function name(what to perform function on)

mean(1:5)
[1] 3
print("Hello world!")
[1] "Hello world!"

User defined functions are created with the function() function:

add_ten <- function(x) # This assigns our new function the name "add_ten", 
                       # and tells R to expect a single input, x.
  {
  y <- x + 10          # give y the value of x + 10
  return(y)            # and then return the result
}                      # curly brackets define where the function starts and ends


add_ten(1)
[1] 11

Calculating GC content in sequences 🧬

This is a custom function created to calculate the GC content of some sequence data.


gc_content <- function(dna_sequence) {
  # Remove non-ACGT bases
  valid_bases <- dna_sequence[dna_sequence %in% c("A", "T", "G", "C")]
  
  # Count Gs and Cs
  g_count <- sum(valid_bases == "G")
  c_count <- sum(valid_bases == "C")
  
  # Calculate GC content as a percentage
  gc_total <- g_count + c_count
  sequence_length <- length(valid_bases)
  
  # Calculate GC percentage
  gc_percentage <- (gc_total / sequence_length) * 100
  
  # Return the GC content as a percentage
  return(gc_percentage)
}

We’ll come back to this in a minute 🕑

Data structures 🧱

4 main ways that R stores data, these are:

  1. Vectors

  2. Matrices

  3. Data frames

  4. Lists

Important

Lists in R are different to lists in Python!

But first…

Remember 💡

R can handle three main classes of data;

  • numeric,
  • character,
  • logical (TRUE or FALSE).

This is really important when we’re thinking about data structures!

Vectors

  • 1-dimensional series of values.
  • This is the most basic data structure you’ll find in R.
  • All elements are of a single class (i.e. all numeric, all character)
  • Accessing the contents is different to other languages.
Numeric vectors
x <- c(1, 2, 3, 4, 5)

note that c() here is the concatenate function; it’s small but mighty, and combines all your inputs into a single vector

Important

Vectors of different dimensions iterate when combined for operations.

a short vector will “loop” over itself if combined with a longer vector

For e.g.:

x <- c(1, 2, 3, 4, 5)
y <- c(1, 10)

x + y
Warning in x + y: longer object length is not a multiple of shorter object
length
[1]  2 12  4 14  6
Character vectors

We can also create vectors with character values, for e.g. a string of bases

v1 <- c("C", "A", "T", "G", "T", "C")

We need the c() function here, without it, R will throw us an error.

Logical vectors

Or we can create logical/boolean vectors, useful for filtering data

v2 <- c(TRUE, FALSE, TRUE, TRUE)

Accesing vector elements

Important

R is 1 indexed, not 0 indexed.

To access the elements in a vector, we use [ ].

To access the fourth element of x:

x <- c(1, 2, 3, 4, 5)
x[4]
[1] 4

The first and third elements:

x <- c(1, 2, 3, 4, 5)
x[c(1, 3)]
[1] 1 3

From the first to the fourth:

x <- c(1, 2, 3, 4, 5)
x[1:4]
[1] 1 2 3 4

To remove the third we use negative indexing:

x <- c(1, 2, 3, 4, 5)
x[-3]
[1] 1 2 4 5

Why vectors? ❓

  • Computationally efficient
  • Element-wise operations (i.e. perform an operation on everything all at once)
  • Building blocks of more complicated data structures

Let’s try our GC content function out…

Try it here: https://tinyurl.com/r-input

Code to paste into above:


gc_content <- function(dna_sequence) {
  # Remove non-ACGT bases
  valid_bases <- dna_sequence[dna_sequence %in% c("A", "T", "G", "C")]
  
  # Count Gs and Cs
  g_count <- sum(valid_bases == "G")
  c_count <- sum(valid_bases == "C")
  
  # Calculate GC content as a percentage
  gc_total <- g_count + c_count
  sequence_length <- length(valid_bases)
  
  # Calculate GC percentage
  gc_percentage <- (gc_total / sequence_length) * 100
  
  # Return the GC content as a percentage
  return(gc_percentage)
}


v1 <- c("C", "A", "T", "G", "T", "C")

gc_content(v1)

Matricies

  • 2-dimensional
  • Data stored in rows and columns
  • Can be built from vectors
  • Must contain a single data type
  • Dimensions can be found with dim(), or nrow() & ncol()

Back to our example…

What if we’ve got multiple sequences?

We could create a matrix by sticking vectors together using cbind() or rbind(), or by using the matrix() function

seq_matrix <- matrix(c("A", "T", "G", "C", "G", "A", 
                       "A", "C", "G", "T", "T", "T", 
                       "G", "C", "C", "C", "G", "T"), 
                           nrow = 3, byrow = TRUE) #byrow = TRUE gives the fill direction

# Assign row names (sequence IDs)
rownames(seq_matrix) <- c("seq1", "seq2", "seq3")
print(seq_matrix)
     [,1] [,2] [,3] [,4] [,5] [,6]
seq1 "A"  "T"  "G"  "C"  "G"  "A" 
seq2 "A"  "C"  "G"  "T"  "T"  "T" 
seq3 "G"  "C"  "C"  "C"  "G"  "T" 

Performing the GC content function on a matrix

Code to test above


gc_content <- function(dna_sequence) {
  # Remove non-ACGT bases
  valid_bases <- dna_sequence[dna_sequence %in% c("A", "T", "G", "C")]
  
  # Count Gs and Cs
  g_count <- sum(valid_bases == "G")
  c_count <- sum(valid_bases == "C")
  
  # Calculate GC content as a percentage
  gc_total <- g_count + c_count
  sequence_length <- length(valid_bases)
  
  # Calculate GC percentage
  gc_percentage <- (gc_total / sequence_length) * 100
  
  # Return the GC content as a percentage
  return(gc_percentage)
}

seq_matrix <- matrix(c("A", "T", "G", "C", "G", "A", 
                       "A", "C", "G", "T", "T", "T", 
                       "G", "C", "C", "C", "G", "T"), 
                           nrow = 3, byrow = TRUE) 
          #byrow = TRUE specifies the fill direction

# Assign row names (sequence IDs)
rownames(seq_matrix) <- c("seq1", "seq2", "seq3")

gc_content(seq_matrix)

But that doesn’t work?!

The function gave us the GC content of all of the bases in the matrix, what if we want to find them all individually?

We index our matrix using [row, column]

gc_content(seq_matrix[1,])
gc_content(seq_matrix[2,])
gc_content(seq_matrix[3,])

But what if we’ve got loads of sequences?

We can use an apply() function.

apply() functions

Note

apply() are a set of functions that allow you to loop over columns/rows in multidimensional data structures. We can use apply() to do all of this at once (but don’t worry about the details of this).

apply(seq_matrix, 1, gc_content)

# note that the 2nd part of the function is '1' because we are working by row.

Why matricies? ❓

  • Lots of packages and functions expect matrix inputs

  • Efficient row-wise or column-wise operations

  • Faster than lots of vectors (and allow row-wise operations)

  • Assignment of meaningful row and column names

  • Can be stacked to n-dimensions and stored as an array (for multiple “slices”)

Dataframes

  • Like a matrix, but capable of storing different data classes
  • Most “real” data will be a data.frame object
  • Must have named columns
  • Accessed the same way as matricies (plus with column names)

Let’s make one

We can combine our matrix from earlier with some metadata about our samples by using as.data.frame().

seq_matrix <- matrix(c("A", "T", "G", "C", "G", "A","A", "C", "G", "T", "T", "T","G", "C", "C", "C", "G", "T"), nrow = 3, byrow = TRUE) 
rownames(seq_matrix) <- c("seq1", "seq2", "seq3") # Assign row names (sequence IDs)
colnames(seq_matrix) <-c("pos1", "pos2", "pos3", "pos4", "pos5", "pos6")

presence <- c(TRUE, FALSE, TRUE)
expression <- c(3.0, 0, 1.3)

df <- as.data.frame(cbind(seq_matrix, presence, expression))
class(df)
[1] "data.frame"
df
     pos1 pos2 pos3 pos4 pos5 pos6 presence expression
seq1    A    T    G    C    G    A     TRUE          3
seq2    A    C    G    T    T    T    FALSE          0
seq3    G    C    C    C    G    T     TRUE        1.3

Accessing information

  • This works the same way as a matrix
  • You can also use the column names to acces, with the $ operator

$ means “look inside”, and extracts a column, which is subsequently treated as a vector

df$presence

Back to the example…

  • What if we want to use our metadata to subset our sequences to only contain those with values of TRUE for presence?
  • We can subset based on column values using the subset() function.
df_subset <- subset(df, presence == TRUE)
     pos1 pos2 pos3 pos4 pos5 pos6 presence expression
seq1    A    T    G    C    G    A     TRUE          3
seq3    G    C    C    C    G    T     TRUE        1.3
apply(df_subset, 1, gc_content)
    seq1     seq3 
50.00000 83.33333 

Why data.frames?❓

  • Allows storage of metadata alongside other data due to ability to handle mixed datatypes
  • Column naming can force organisation
  • Easy to manipulate or subset
  • Easy to import/export with read. and write. functions

Lists

  • More complex 🤯
  • Allow storage of objects of different lengths in a self-contained object
    • in a data.frame, each component part (column/row) must be the same length - this is not true of lists
  • Can contain vectors, matricies, data.frames, and lists
  • Incredibly flexible, but less structured

Lists

Let’s create a list. First, we create vectors, and then use the list() function to stick them together.

species <- c("polymita_picta", "cepaea_nemoralis", "auriculella_pulchra")
numbers <- 1:50
logic <- c(TRUE, FALSE, FALSE, TRUE)

my_list <- list(species, numbers, logic)

my_list
[[1]]
[1] "polymita_picta"      "cepaea_nemoralis"    "auriculella_pulchra"

[[2]]
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

[[3]]
[1]  TRUE FALSE FALSE  TRUE

Accessing lists

  • Note the double square brackets [[ ]]
  • If we select items with my_list[1] for example, R will return us a list (just containing the first element, the species)
  • If we select the same with double brackets my_list[[1]], R will return it in it’s original vector form.
  • The syntax is a bit confusing - my_list[1:2] versus my_list[[1:2]]

Accessing lists

my_list[1:2] returns the 1st and 2nd vectors (as lists)

[[1]]
[1] "polymita_picta"      "cepaea_nemoralis"    "auriculella_pulchra"

[[2]]
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

my_list[[1:2]] returns the 2nd item of the 1st element

[1] "cepaea_nemoralis"

Perhaps more intuitively, we can combine [ ] and [[ ]]. To access the 2nd item of the 1st element, use my_list[[1]][2]:

[1] "cepaea_nemoralis"

Why lists?❓

  • Useful for managaging complex datasets where elements vary in structure and type
  • Lots of functions will output as a list
  • Hierarchical or nested data
  • Storing results from multiple outputs (bootstrapping, iterative procedures)

R is not all powerful

There are some things that R is not the best tool for due to efficiency and scalability:

  • Sequence alignment
  • de novo assembly
  • Variant calling (although R can process VCF files and perform downstream analysis)

It is however very good at downstream analysis, datavis, and statistics.

Things to read 📚

There are loads of brilliant R resources online, including:

Advanced R - Hadley Wickham - available online, here: http://adv-r.had.co.nz/ - lots more detail about functions and data structures

A little book of R for Bioinformatics - Avril Coghlan & Nathan Brouwer - available online, here: https://brouwern.github.io/lbrb/ - incredibly useful for things like Bioconductor package installs

R for Biologists - - available online, here: https://omicstutorials.com/r-for-biologists-an-introductory-guide-to-bioinformatics-analysis/ - loads of worked examples of using R for bioinformatics, perhaps requires more basic R knowledge than the others

More help please!!! 🆘

  • R for dummies
  • StackOverflow
  • Online forums (but use these with caution…)
  • The “Awesome R learning” github repository https://github.com/iamericfletcher/awesome-r-learning-resources

Useful packages 📦

  1. Sequence analysis:
  • Biostrings; msa
  1. Genomic analysis:
  • rtracklayer; Genomicranges
  1. Transcriptomics
  • DEseq2; limma; edgeR
  1. Phylogenetics
  • ape; ggtree
  1. DataViz
  • ggplot2; circlize; plotly
  1. Machine Learning
  • randomForest; caret