Functions and data structures in R…

…and how to use them!

Author

Dr Hannah Jackson

Learning Objectives

To understand the use of functions in R
To describe the 4 main data structures in R
To understand how functions can be used with various data structures in bioinformatics

R in bioinformatics

R is an incredibly useful tool in bioinformatics:

A wide range of specialised packages (so you don’t have to reinvent the wheel)
Powerful datavis packages
Integration with other tools (i.e. Python, packages like Rsamtools)
Reproducibility
Community support

Functions 🔧

Functions are reusable blocks of code which simplify repetitive tasks
Allow automation of complex workflows
Either pre-defined in packages, or user defined
From very simple ➡ complex multi-step tools

The anatomy of a function 🧠

function name(what to perform function on)

mean(1:5)

[1] 3

print("Hello world!")

[1] "Hello world!"

User defined functions are created with the function() function:

add_ten <- function(x) # This assigns our new function the name "add_ten", 
                       # and tells R to expect a single input, x.
  {
  y <- x + 10          # give y the value of x + 10
  return(y)            # and then return the result
}                      # curly brackets define where the function starts and ends


add_ten(1)

[1] 11

Calculating GC content in sequences 🧬

This is a custom function created to calculate the GC content of some sequence data.


gc_content <- function(dna_sequence) {
  # Remove non-ACGT bases
  valid_bases <- dna_sequence[dna_sequence %in% c("A", "T", "G", "C")]
  
  # Count Gs and Cs
  g_count <- sum(valid_bases == "G")
  c_count <- sum(valid_bases == "C")
  
  # Calculate GC content as a percentage
  gc_total <- g_count + c_count
  sequence_length <- length(valid_bases)
  
  # Calculate GC percentage
  gc_percentage <- (gc_total / sequence_length) * 100
  
  # Return the GC content as a percentage
  return(gc_percentage)
}

We’ll come back to this in a minute 🕑

Data structures 🧱

4 main ways that R stores data, these are:

Vectors
Matrices
Data frames
Lists

Important

Lists in R are different to lists in Python!

But first…

Remember 💡

R can handle three main classes of data;

numeric,
character,
logical (TRUE or FALSE).

This is really important when we’re thinking about data structures!

Vectors

1-dimensional series of values.
This is the most basic data structure you’ll find in R.
All elements are of a single class (i.e. all numeric, all character)
Accessing the contents is different to other languages.

Numeric vectors

x <- c(1, 2, 3, 4, 5)

note that c() here is the concatenate function; it’s small but mighty, and combines all your inputs into a single vector

Important

Vectors of different dimensions iterate when combined for operations.

a short vector will “loop” over itself if combined with a longer vector

For e.g.:

x <- c(1, 2, 3, 4, 5)
y <- c(1, 10)

x + y

Warning in x + y: longer object length is not a multiple of shorter object
length

[1]  2 12  4 14  6

Character vectors

We can also create vectors with character values, for e.g. a string of bases

v1 <- c("C", "A", "T", "G", "T", "C")

We need the c() function here, without it, R will throw us an error.

Logical vectors

Or we can create logical/boolean vectors, useful for filtering data

v2 <- c(TRUE, FALSE, TRUE, TRUE)

Accesing vector elements

Important

R is 1 indexed, not 0 indexed.

To access the elements in a vector, we use [ ].

To access the fourth element of x:

x <- c(1, 2, 3, 4, 5)
x[4]

[1] 4

The first and third elements:

x <- c(1, 2, 3, 4, 5)
x[c(1, 3)]

[1] 1 3

From the first to the fourth:

x <- c(1, 2, 3, 4, 5)
x[1:4]

[1] 1 2 3 4

To remove the third we use negative indexing:

x <- c(1, 2, 3, 4, 5)
x[-3]

[1] 1 2 4 5

Why vectors? ❓

Computationally efficient
Element-wise operations (i.e. perform an operation on everything all at once)
Building blocks of more complicated data structures

Let’s try our GC content function out…

Try it here: https://tinyurl.com/r-input

Code to paste into above:


gc_content <- function(dna_sequence) {
  # Remove non-ACGT bases
  valid_bases <- dna_sequence[dna_sequence %in% c("A", "T", "G", "C")]
  
  # Count Gs and Cs
  g_count <- sum(valid_bases == "G")
  c_count <- sum(valid_bases == "C")
  
  # Calculate GC content as a percentage
  gc_total <- g_count + c_count
  sequence_length <- length(valid_bases)
  
  # Calculate GC percentage
  gc_percentage <- (gc_total / sequence_length) * 100
  
  # Return the GC content as a percentage
  return(gc_percentage)
}


v1 <- c("C", "A", "T", "G", "T", "C")

gc_content(v1)

Matricies

2-dimensional
Data stored in rows and columns
Can be built from vectors
Must contain a single data type
Dimensions can be found with dim(), or nrow() & ncol()

Back to our example…

What if we’ve got multiple sequences?

We could create a matrix by sticking vectors together using cbind() or rbind(), or by using the matrix() function

seq_matrix <- matrix(c("A", "T", "G", "C", "G", "A", 
                       "A", "C", "G", "T", "T", "T", 
                       "G", "C", "C", "C", "G", "T"), 
                           nrow = 3, byrow = TRUE) #byrow = TRUE gives the fill direction

# Assign row names (sequence IDs)
rownames(seq_matrix) <- c("seq1", "seq2", "seq3")
print(seq_matrix)

     [,1] [,2] [,3] [,4] [,5] [,6]
seq1 "A"  "T"  "G"  "C"  "G"  "A" 
seq2 "A"  "C"  "G"  "T"  "T"  "T" 
seq3 "G"  "C"  "C"  "C"  "G"  "T"

Performing the GC content function on a matrix

Code to test above


gc_content <- function(dna_sequence) {
  # Remove non-ACGT bases
  valid_bases <- dna_sequence[dna_sequence %in% c("A", "T", "G", "C")]
  
  # Count Gs and Cs
  g_count <- sum(valid_bases == "G")
  c_count <- sum(valid_bases == "C")
  
  # Calculate GC content as a percentage
  gc_total <- g_count + c_count
  sequence_length <- length(valid_bases)
  
  # Calculate GC percentage
  gc_percentage <- (gc_total / sequence_length) * 100
  
  # Return the GC content as a percentage
  return(gc_percentage)
}

seq_matrix <- matrix(c("A", "T", "G", "C", "G", "A", 
                       "A", "C", "G", "T", "T", "T", 
                       "G", "C", "C", "C", "G", "T"), 
                           nrow = 3, byrow = TRUE) 
          #byrow = TRUE specifies the fill direction

# Assign row names (sequence IDs)
rownames(seq_matrix) <- c("seq1", "seq2", "seq3")

gc_content(seq_matrix)

But that doesn’t work?!

The function gave us the GC content of all of the bases in the matrix, what if we want to find them all individually?

We index our matrix using [row, column]

gc_content(seq_matrix[1,])
gc_content(seq_matrix[2,])
gc_content(seq_matrix[3,])

But what if we’ve got loads of sequences?

We can use an apply() function.

`apply()` functions

Note

apply() are a set of functions that allow you to loop over columns/rows in multidimensional data structures. We can use apply() to do all of this at once (but don’t worry about the details of this).

apply(seq_matrix, 1, gc_content)

# note that the 2nd part of the function is '1' because we are working by row.

Why matricies? ❓

Lots of packages and functions expect matrix inputs
Efficient row-wise or column-wise operations
Faster than lots of vectors (and allow row-wise operations)
Assignment of meaningful row and column names
Can be stacked to n-dimensions and stored as an array (for multiple “slices”)

Dataframes

Like a matrix, but capable of storing different data classes
Most “real” data will be a data.frame object
Must have named columns
Accessed the same way as matricies (plus with column names)

Let’s make one

We can combine our matrix from earlier with some metadata about our samples by using as.data.frame().

seq_matrix <- matrix(c("A", "T", "G", "C", "G", "A","A", "C", "G", "T", "T", "T","G", "C", "C", "C", "G", "T"), nrow = 3, byrow = TRUE) 
rownames(seq_matrix) <- c("seq1", "seq2", "seq3") # Assign row names (sequence IDs)
colnames(seq_matrix) <-c("pos1", "pos2", "pos3", "pos4", "pos5", "pos6")

presence <- c(TRUE, FALSE, TRUE)
expression <- c(3.0, 0, 1.3)

df <- as.data.frame(cbind(seq_matrix, presence, expression))
class(df)

[1] "data.frame"

df

     pos1 pos2 pos3 pos4 pos5 pos6 presence expression
seq1    A    T    G    C    G    A     TRUE          3
seq2    A    C    G    T    T    T    FALSE          0
seq3    G    C    C    C    G    T     TRUE        1.3

Accessing information

This works the same way as a matrix
You can also use the column names to acces, with the $ operator

$ means “look inside”, and extracts a column, which is subsequently treated as a vector

df$presence

Back to the example…

What if we want to use our metadata to subset our sequences to only contain those with values of TRUE for presence?
We can subset based on column values using the subset() function.

df_subset <- subset(df, presence == TRUE)

     pos1 pos2 pos3 pos4 pos5 pos6 presence expression
seq1    A    T    G    C    G    A     TRUE          3
seq3    G    C    C    C    G    T     TRUE        1.3

apply(df_subset, 1, gc_content)

    seq1     seq3 
50.00000 83.33333

Why data.frames?❓

Allows storage of metadata alongside other data due to ability to handle mixed datatypes
Column naming can force organisation
Easy to manipulate or subset
Easy to import/export with read. and write. functions

Lists

More complex 🤯
Allow storage of objects of different lengths in a self-contained object
- in a data.frame, each component part (column/row) must be the same length - this is not true of lists
Can contain vectors, matricies, data.frames, and lists
Incredibly flexible, but less structured

Lists

Let’s create a list. First, we create vectors, and then use the list() function to stick them together.

species <- c("polymita_picta", "cepaea_nemoralis", "auriculella_pulchra")
numbers <- 1:50
logic <- c(TRUE, FALSE, FALSE, TRUE)

my_list <- list(species, numbers, logic)

my_list

[[1]]
[1] "polymita_picta"      "cepaea_nemoralis"    "auriculella_pulchra"

[[2]]
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

[[3]]
[1]  TRUE FALSE FALSE  TRUE

Accessing lists

Note the double square brackets [[ ]]
If we select items with my_list[1] for example, R will return us a list (just containing the first element, the species)
If we select the same with double brackets my_list[[1]], R will return it in it’s original vector form.
The syntax is a bit confusing - my_list[1:2] versus my_list[[1:2]]

Accessing lists

my_list[1:2] returns the 1st and 2nd vectors (as lists)

[[1]]
[1] "polymita_picta"      "cepaea_nemoralis"    "auriculella_pulchra"

[[2]]
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

my_list[[1:2]] returns the 2nd item of the 1st element

[1] "cepaea_nemoralis"

Perhaps more intuitively, we can combine [ ] and [[ ]]. To access the 2nd item of the 1st element, use my_list[[1]][2]:

[1] "cepaea_nemoralis"

Why lists?❓

Useful for managaging complex datasets where elements vary in structure and type
Lots of functions will output as a list
Hierarchical or nested data
Storing results from multiple outputs (bootstrapping, iterative procedures)

R is not all powerful

There are some things that R is not the best tool for due to efficiency and scalability:

Sequence alignment
de novo assembly
Variant calling (although R can process VCF files and perform downstream analysis)

It is however very good at downstream analysis, datavis, and statistics.

Things to read 📚

There are loads of brilliant R resources online, including:

Advanced R - Hadley Wickham - available online, here: http://adv-r.had.co.nz/ - lots more detail about functions and data structures

A little book of R for Bioinformatics - Avril Coghlan & Nathan Brouwer - available online, here: https://brouwern.github.io/lbrb/ - incredibly useful for things like Bioconductor package installs

R for Biologists - - available online, here: https://omicstutorials.com/r-for-biologists-an-introductory-guide-to-bioinformatics-analysis/ - loads of worked examples of using R for bioinformatics, perhaps requires more basic R knowledge than the others

More help please!!! 🆘

R for dummies
StackOverflow
Online forums (but use these with caution…)
The “Awesome R learning” github repository https://github.com/iamericfletcher/awesome-r-learning-resources

Useful packages 📦

Sequence analysis:

Biostrings; msa

Genomic analysis:

rtracklayer; Genomicranges

Transcriptomics

DEseq2; limma; edgeR

Phylogenetics

ape; ggtree

DataViz

ggplot2; circlize; plotly

Machine Learning

randomForest; caret

Learning Objectives

R in bioinformatics

Functions 🔧

The anatomy of a function 🧠

Calculating GC content in sequences 🧬

Data structures 🧱

But first…

Vectors

Numeric vectors

Character vectors

Logical vectors

Accesing vector elements

Why vectors? ❓

Let’s try our GC content function out…

Code to paste into above:

Matricies

Back to our example…

Performing the GC content function on a matrix

But that doesn’t work?!

apply() functions

Why matricies? ❓

Dataframes

Let’s make one

Accessing information

Back to the example…

Why data.frames?❓

Lists

Lists

Accessing lists

Accessing lists

Why lists?❓

R is not all powerful

Things to read 📚

More help please!!! 🆘

Useful packages 📦

`apply()` functions