Functions and data structures in R…

…and how to use them!

Dr Hannah Jackson

hannah.j.jackson@kcl.ac.uk

Learning Objectives

To understand the use of functions in R
To describe the 4 main data structures in R
To understand how functions can be used with various data structures in bioinformatics

To follow along:

https://tinyurl.com/functions-and-structures

R in bioinformatics

R is an incredibly useful tool in bioinformatics:

A wide range of specialised packages (so you don’t have to reinvent the wheel)
Powerful datavis packages
Integration with other tools (i.e. Python, packages like Rsamtools)
Reproducibility
Community support

Functions 🔧

Functions are reusable blocks of code which simplify repetitive tasks
Allow automation of complex workflows
Either pre-defined in packages, or user defined
From very simple ➡ complex multi-step tools

The anatomy of a function 🧠

function name(what to perform function on)

mean(1:5)

[1] 3

print("Hello world!")

[1] "Hello world!"

User defined functions are created with the function() function:

add_ten <- function(x) # This assigns our new function the name "add_ten", 
                       # and tells R to expect a single input, x.
  {
  y <- x + 10          # give y the value of x + 10
  return(y)            # and then return the result
}                      # curly brackets define where the function starts and ends


add_ten(1)

[1] 11

Calculating GC content in sequences 🧬


gc_content <- function(dna_sequence) {
  # Remove non-ACGT bases
  valid_bases <- dna_sequence[dna_sequence %in% c("A", "T", "G", "C")]
  
  # Count Gs and Cs
  g_count <- sum(valid_bases == "G")
  c_count <- sum(valid_bases == "C")
  
  # Calculate GC content as a percentage
  gc_total <- g_count + c_count
  sequence_length <- length(valid_bases)
  
  # Calculate GC percentage
  gc_percentage <- (gc_total / sequence_length) * 100
  
  # Return the GC content as a percentage
  return(gc_percentage)
}

We’ll come back to this in a minute 🕑

Data structures 🧱

4 main ways that R stores data, these are:

Vectors
Matrices
Data frames
Lists

Important

Lists in R are different to lists in Python!

But first…

Remember 💡

R can handle three main classes of data;

numeric,
character,
logical (TRUE or FALSE).

This is really important when we’re thinking about data structures!

Vectors

1-dimensional series of values.
This is the most basic data structure you’ll find in R.
All elements are of a single class (i.e. all numeric, all character)
Accessing the contents is different to other languages.

A numeric vector:

x <- c(1, 2, 3, 4, 5)

note that c() here is the concatenate function; it’s small but mighty, and combines all your inputs into a single vector

Vectors

Important

Vectors of different dimensions iterate when combined for operations.

a short vector will “loop” over itself if combined with a longer vector

For e.g.:

x <- c(1, 2, 3, 4, 5)
y <- c(1, 10)

x + y

[1]  2 12  4 14  6

Vectors

Character vectors

We can also create vectors with character values, for e.g. a string of bases

v1 <- c("C", "A", "T", "G", "T", "C")

We need the c() function here, without it, R will throw us an error.

Logical vectors

Or we can create logical/boolean vectors, useful for filtering data

v2 <- c(TRUE, FALSE, TRUE, TRUE)

Accessing elements

Important

R is 1 indexed, not 0 indexed.

To access the elements in a vector, we use [ ].

To access the fourth element of x:

x <- c(1, 2, 3, 4, 5)
x[4]

[1] 4

Accessing elements

Important

R is 1 indexed, not 0 indexed.

To access the elements in a vector, we use [ ].

The first and third elements:

x <- c(1, 2, 3, 4, 5)
x[c(1, 3)]

[1] 1 3

Accessing elements

Important

R is 1 indexed, not 0 indexed.

To access the elements in a vector, we use [ ].

From the first to the fourth:

x <- c(1, 2, 3, 4, 5)
x[1:4]

[1] 1 2 3 4

Accessing elements

Important

R is 1 indexed, not 0 indexed.

To access the elements in a vector, we use [ ].

To remove the third:

x <- c(1, 2, 3, 4, 5)
x[-3]

[1] 1 2 4 5

Why vectors? ❓

Computationally efficient
Element-wise operations (i.e. perform an operation on everything all at once)
Building blocks of more complicated data structures

Let’s try our GC content function out…

Try it here: https://tinyurl.com/r-input

Code for previous slide:


gc_content <- function(dna_sequence) {
  # Remove non-ACGT bases
  valid_bases <- dna_sequence[dna_sequence %in% c("A", "T", "G", "C")]
  
  # Count Gs and Cs
  g_count <- sum(valid_bases == "G")
  c_count <- sum(valid_bases == "C")
  
  # Calculate GC content as a percentage
  gc_total <- g_count + c_count
  sequence_length <- length(valid_bases)
  
  # Calculate GC percentage
  gc_percentage <- (gc_total / sequence_length) * 100
  
  # Return the GC content as a percentage
  return(gc_percentage)
}


v1 <- c("C", "A", "T", "G", "T", "C")

gc_content(v1)

Matrices

2-dimensional
Data stored in rows and columns
Can be built from vectors
Must contain a single data type
Dimensions can be found with dim(), or nrow() & ncol()

Back to our example…

What if we’ve got multiple sequences?

We could create a matrix by sticking vectors together using cbind() or rbind(), or by using the matrix() function

seq_matrix <- matrix(c("A", "T", "G", "C", "G", "A", 
                       "A", "C", "G", "T", "T", "T", 
                       "G", "C", "C", "C", "G", "T"), 
                           nrow = 3, byrow = TRUE) #byrow = TRUE gives the fill direction

# Assign row names (sequence IDs)
rownames(seq_matrix) <- c("seq1", "seq2", "seq3")
print(seq_matrix)

     [,1] [,2] [,3] [,4] [,5] [,6]
seq1 "A"  "T"  "G"  "C"  "G"  "A" 
seq2 "A"  "C"  "G"  "T"  "T"  "T" 
seq3 "G"  "C"  "C"  "C"  "G"  "T"

Performing the GC content function on a matrix

Code for previous slide


gc_content <- function(dna_sequence) {
  # Remove non-ACGT bases
  valid_bases <- dna_sequence[dna_sequence %in% c("A", "T", "G", "C")]
  
  # Count Gs and Cs
  g_count <- sum(valid_bases == "G")
  c_count <- sum(valid_bases == "C")
  
  # Calculate GC content as a percentage
  gc_total <- g_count + c_count
  sequence_length <- length(valid_bases)
  
  # Calculate GC percentage
  gc_percentage <- (gc_total / sequence_length) * 100
  
  # Return the GC content as a percentage
  return(gc_percentage)
}

seq_matrix <- matrix(c("A", "T", "G", "C", "G", "A", 
                       "A", "C", "G", "T", "T", "T", 
                       "G", "C", "C", "C", "G", "T"), 
                           nrow = 3, byrow = TRUE) 
          #byrow = TRUE specifies the fill direction

# Assign row names (sequence IDs)
rownames(seq_matrix) <- c("seq1", "seq2", "seq3")

gc_content(seq_matrix)

But that doesn’t work?!

The function gave us the GC content of all of the bases in the matrix, what if we want to find them all individually?

We index our matrix using [row, column]

gc_content(seq_matrix[1,])
gc_content(seq_matrix[2,])
gc_content(seq_matrix[3,])

But what if we’ve got loads of sequences?

We can use an apply() function.

`apply()` functions

Note

apply() are a set of functions that allow you to loop over columns/rows in multidimensional data structures. We can use apply() to do all of this at once (but don’t worry about the details of this).

apply(seq_matrix, 1, gc_content)

# note that the 2nd part of the function is '1' because we are working by row.

Why matricies? ❓

Lots of packages and functions expect matrix inputs
Efficient row-wise or column-wise operations
Faster than lots of vectors (and allow row-wise operations)
Assignment of meaningful row and column names
Can be stacked to n-dimensions and stored as an array (for multiple “slices”)

Dataframes

Like a matrix, but capable of storing different data classes
Most “real” data will be a data.frame object
Must have named columns
Accessed the same way as matrices (plus with column names)

Let’s make one

We can combine our matrix from earlier with some metadata about our samples by using as.data.frame().

seq_matrix <- matrix(c("A", "T", "G", "C", "G", "A","A", "C", "G", "T", "T", "T","G", "C", "C", "C", "G", "T"), nrow = 3, byrow = TRUE) 
rownames(seq_matrix) <- c("seq1", "seq2", "seq3") # Assign row names (sequence IDs)
colnames(seq_matrix) <-c("pos1", "pos2", "pos3", "pos4", "pos5", "pos6")

presence <- c(TRUE, FALSE, TRUE)
expression <- c(3.0, 0, 1.3)

df <- as.data.frame(cbind(seq_matrix, presence, expression))
class(df)

[1] "data.frame"

df

     pos1 pos2 pos3 pos4 pos5 pos6 presence expression
seq1    A    T    G    C    G    A     TRUE          3
seq2    A    C    G    T    T    T    FALSE          0
seq3    G    C    C    C    G    T     TRUE        1.3

Accessing information

This works the same way as a matrix
You can also use the column names to acces, with the $ operator

$ means “look inside”, and extracts a column, which is subsequently treated as a vector

df$presence

Back to the example…

What if we want to use our metadata to subset our sequences to only contain those with values of TRUE for presence?
We can subset based on column values using the subset() function.

df_subset <- subset(df, presence == TRUE)

     pos1 pos2 pos3 pos4 pos5 pos6 presence expression
seq1    A    T    G    C    G    A     TRUE          3
seq3    G    C    C    C    G    T     TRUE        1.3

apply(df_subset, 1, gc_content)

    seq1     seq3 
50.00000 83.33333

Why data.frames?❓

Allows storage of metadata alongside other data due to ability to handle mixed datatypes
Column naming can force organisation
Easy to manipulate or subset
Easy to import/export with read. and write. functions

Lists

More complex 🤯
Allow storage of objects of different lengths in a self-contained object
- in a data.frame, each component part (column/row) must be the same length - this is not true of lists
Can contain vectors, matricies, data.frames, and lists
Incredibly flexible, but less structured

Lists

Let’s create a list. First, we create vectors, and then use the list() function to stick them together.

species <- c("polymita_picta", "cepaea_nemoralis", "auriculella_pulchra")
numbers <- 1:50
logic <- c(TRUE, FALSE, FALSE, TRUE)

my_list <- list(species, numbers, logic)

my_list

[[1]]
[1] "polymita_picta"      "cepaea_nemoralis"    "auriculella_pulchra"

[[2]]
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

[[3]]
[1]  TRUE FALSE FALSE  TRUE

Accessing lists

Note the double square brackets [[ ]]
If we select items with my_list[1] for example, R will return us a list (just containing the first element, the species)
If we select the same with double brackets my_list[[1]], R will return it in it’s original vector form.
The syntax is a bit confusing - my_list[1:2] versus my_list[[1:2]]

Accessing lists

my_list[1:2] returns the 1st and 2nd vectors (as lists)

[[1]]
[1] "polymita_picta"      "cepaea_nemoralis"    "auriculella_pulchra"

[[2]]
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

my_list[[1:2]] returns the 2nd item of the 1st element

[1] "cepaea_nemoralis"

Perhaps more intuitively, we can combine [ ] and [[ ]]. To access the 2nd item of the 1st element, use my_list[[1]][2]:

[1] "cepaea_nemoralis"

Why lists?❓

Useful for managaging complex datasets where elements vary in structure and type
Lots of functions will output as a list
Hierarchical or nested data
Storing results from multiple outputs (bootstrapping, iterative procedures)

R is not all powerful

There are some things that R is not the best tool for due to efficiency and scalability:

Sequence alignment
de novo assembly
Variant calling (although R can process VCF files and perform downstream analysis)

It is however very good at downstream analysis, datavis, and statistics.

Things to read 📚

There are loads of brilliant R resources online, including:

Advanced R - Hadley Wickham - available online, here: http://adv-r.had.co.nz/ - lots more detail about functions and data structures

A little book of R for Bioinformatics - Avril Coghlan & Nathan Brouwer - available online, here: https://brouwern.github.io/lbrb/ - incredibly useful for things like Bioconductor package installs

R for Biologists - - available online, here: https://omicstutorials.com/r-for-biologists-an-introductory-guide-to-bioinformatics-analysis/ - loads of worked examples of using R for bioinformatics, perhaps requires more basic R knowledge than the others

Functions and data structures in R…

Learning Objectives

To follow along:

R in bioinformatics

Functions 🔧

The anatomy of a function 🧠

Calculating GC content in sequences 🧬

Data structures 🧱

But first…

Vectors

Vectors

Vectors

Character vectors

Logical vectors

Accessing elements

Accessing elements

Accessing elements

Accessing elements

Why vectors? ❓

Let’s try our GC content function out…

Code for previous slide:

Matrices

Back to our example…

Performing the GC content function on a matrix

Code for previous slide

But that doesn’t work?!

apply() functions

Why matricies? ❓

Dataframes

Let’s make one

Accessing information

Back to the example…

Why data.frames?❓

Lists

Lists

Accessing lists

Accessing lists

Why lists?❓

R is not all powerful

Things to read 📚

`apply()` functions