mean(1:5)[1] 3
…and how to use them!
R is an incredibly useful tool in bioinformatics:
Functions are reusable blocks of code which simplify repetitive tasks
Allow automation of complex workflows
Either pre-defined in packages, or user defined
From very simple ➡ complex multi-step tools
function name(what to perform function on)
mean(1:5)[1] 3
print("Hello world!")[1] "Hello world!"
User defined functions are created with the function() function:
add_ten <- function(x) # This assigns our new function the name "add_ten",
# and tells R to expect a single input, x.
{
y <- x + 10 # give y the value of x + 10
return(y) # and then return the result
} # curly brackets define where the function starts and ends
add_ten(1)[1] 11
This is a custom function created to calculate the GC content of some sequence data.
gc_content <- function(dna_sequence) {
# Remove non-ACGT bases
valid_bases <- dna_sequence[dna_sequence %in% c("A", "T", "G", "C")]
# Count Gs and Cs
g_count <- sum(valid_bases == "G")
c_count <- sum(valid_bases == "C")
# Calculate GC content as a percentage
gc_total <- g_count + c_count
sequence_length <- length(valid_bases)
# Calculate GC percentage
gc_percentage <- (gc_total / sequence_length) * 100
# Return the GC content as a percentage
return(gc_percentage)
}We’ll come back to this in a minute 🕑
4 main ways that R stores data, these are:
Vectors
Matrices
Data frames
Lists
Lists in R are different to lists in Python!
R can handle three main classes of data;
TRUE or FALSE).This is really important when we’re thinking about data structures!
x <- c(1, 2, 3, 4, 5)note that c() here is the concatenate function; it’s small but mighty, and combines all your inputs into a single vector
Vectors of different dimensions iterate when combined for operations.
a short vector will “loop” over itself if combined with a longer vector
For e.g.:
x <- c(1, 2, 3, 4, 5)
y <- c(1, 10)
x + yWarning in x + y: longer object length is not a multiple of shorter object
length
[1] 2 12 4 14 6
We can also create vectors with character values, for e.g. a string of bases
v1 <- c("C", "A", "T", "G", "T", "C")We need the c() function here, without it, R will throw us an error.
Or we can create logical/boolean vectors, useful for filtering data
v2 <- c(TRUE, FALSE, TRUE, TRUE)R is 1 indexed, not 0 indexed.
To access the elements in a vector, we use [ ].
To access the fourth element of x:
x <- c(1, 2, 3, 4, 5)
x[4][1] 4
The first and third elements:
x <- c(1, 2, 3, 4, 5)
x[c(1, 3)][1] 1 3
From the first to the fourth:
x <- c(1, 2, 3, 4, 5)
x[1:4][1] 1 2 3 4
To remove the third we use negative indexing:
x <- c(1, 2, 3, 4, 5)
x[-3][1] 1 2 4 5
Try it here: https://tinyurl.com/r-input
gc_content <- function(dna_sequence) {
# Remove non-ACGT bases
valid_bases <- dna_sequence[dna_sequence %in% c("A", "T", "G", "C")]
# Count Gs and Cs
g_count <- sum(valid_bases == "G")
c_count <- sum(valid_bases == "C")
# Calculate GC content as a percentage
gc_total <- g_count + c_count
sequence_length <- length(valid_bases)
# Calculate GC percentage
gc_percentage <- (gc_total / sequence_length) * 100
# Return the GC content as a percentage
return(gc_percentage)
}
v1 <- c("C", "A", "T", "G", "T", "C")
gc_content(v1)dim(), or nrow() & ncol()What if we’ve got multiple sequences?
We could create a matrix by sticking vectors together using cbind() or rbind(), or by using the matrix() function
seq_matrix <- matrix(c("A", "T", "G", "C", "G", "A",
"A", "C", "G", "T", "T", "T",
"G", "C", "C", "C", "G", "T"),
nrow = 3, byrow = TRUE) #byrow = TRUE gives the fill direction
# Assign row names (sequence IDs)
rownames(seq_matrix) <- c("seq1", "seq2", "seq3")
print(seq_matrix) [,1] [,2] [,3] [,4] [,5] [,6]
seq1 "A" "T" "G" "C" "G" "A"
seq2 "A" "C" "G" "T" "T" "T"
seq3 "G" "C" "C" "C" "G" "T"
Code to test above
gc_content <- function(dna_sequence) {
# Remove non-ACGT bases
valid_bases <- dna_sequence[dna_sequence %in% c("A", "T", "G", "C")]
# Count Gs and Cs
g_count <- sum(valid_bases == "G")
c_count <- sum(valid_bases == "C")
# Calculate GC content as a percentage
gc_total <- g_count + c_count
sequence_length <- length(valid_bases)
# Calculate GC percentage
gc_percentage <- (gc_total / sequence_length) * 100
# Return the GC content as a percentage
return(gc_percentage)
}
seq_matrix <- matrix(c("A", "T", "G", "C", "G", "A",
"A", "C", "G", "T", "T", "T",
"G", "C", "C", "C", "G", "T"),
nrow = 3, byrow = TRUE)
#byrow = TRUE specifies the fill direction
# Assign row names (sequence IDs)
rownames(seq_matrix) <- c("seq1", "seq2", "seq3")
gc_content(seq_matrix)The function gave us the GC content of all of the bases in the matrix, what if we want to find them all individually?
We index our matrix using [row, column]
gc_content(seq_matrix[1,])
gc_content(seq_matrix[2,])
gc_content(seq_matrix[3,])
But what if we’ve got loads of sequences?
We can use an apply() function.
apply() functionsapply() are a set of functions that allow you to loop over columns/rows in multidimensional data structures. We can use apply() to do all of this at once (but don’t worry about the details of this).
apply(seq_matrix, 1, gc_content)
# note that the 2nd part of the function is '1' because we are working by row.Lots of packages and functions expect matrix inputs
Efficient row-wise or column-wise operations
Faster than lots of vectors (and allow row-wise operations)
Assignment of meaningful row and column names
Can be stacked to n-dimensions and stored as an array (for multiple “slices”)
data.frame objectWe can combine our matrix from earlier with some metadata about our samples by using as.data.frame().
seq_matrix <- matrix(c("A", "T", "G", "C", "G", "A","A", "C", "G", "T", "T", "T","G", "C", "C", "C", "G", "T"), nrow = 3, byrow = TRUE)
rownames(seq_matrix) <- c("seq1", "seq2", "seq3") # Assign row names (sequence IDs)
colnames(seq_matrix) <-c("pos1", "pos2", "pos3", "pos4", "pos5", "pos6")
presence <- c(TRUE, FALSE, TRUE)
expression <- c(3.0, 0, 1.3)
df <- as.data.frame(cbind(seq_matrix, presence, expression))
class(df)[1] "data.frame"
df pos1 pos2 pos3 pos4 pos5 pos6 presence expression
seq1 A T G C G A TRUE 3
seq2 A C G T T T FALSE 0
seq3 G C C C G T TRUE 1.3
$ operator$ means “look inside”, and extracts a column, which is subsequently treated as a vector
df$presenceTRUE for presence?subset() function.df_subset <- subset(df, presence == TRUE) pos1 pos2 pos3 pos4 pos5 pos6 presence expression
seq1 A T G C G A TRUE 3
seq3 G C C C G T TRUE 1.3
apply(df_subset, 1, gc_content) seq1 seq3
50.00000 83.33333
read. and write. functionsLet’s create a list. First, we create vectors, and then use the list() function to stick them together.
species <- c("polymita_picta", "cepaea_nemoralis", "auriculella_pulchra")
numbers <- 1:50
logic <- c(TRUE, FALSE, FALSE, TRUE)
my_list <- list(species, numbers, logic)
my_list[[1]]
[1] "polymita_picta" "cepaea_nemoralis" "auriculella_pulchra"
[[2]]
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
[[3]]
[1] TRUE FALSE FALSE TRUE
[[ ]]my_list[1] for example, R will return us a list (just containing the first element, the species)my_list[[1]], R will return it in it’s original vector form.my_list[1:2] versus my_list[[1:2]]my_list[1:2] returns the 1st and 2nd vectors (as lists)
[[1]]
[1] "polymita_picta" "cepaea_nemoralis" "auriculella_pulchra"
[[2]]
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
my_list[[1:2]] returns the 2nd item of the 1st element
[1] "cepaea_nemoralis"
Perhaps more intuitively, we can combine [ ] and [[ ]]. To access the 2nd item of the 1st element, use my_list[[1]][2]:
[1] "cepaea_nemoralis"
There are some things that R is not the best tool for due to efficiency and scalability:
It is however very good at downstream analysis, datavis, and statistics.
There are loads of brilliant R resources online, including:
Advanced R - Hadley Wickham - available online, here: http://adv-r.had.co.nz/ - lots more detail about functions and data structures
A little book of R for Bioinformatics - Avril Coghlan & Nathan Brouwer - available online, here: https://brouwern.github.io/lbrb/ - incredibly useful for things like Bioconductor package installs
R for Biologists - - available online, here: https://omicstutorials.com/r-for-biologists-an-introductory-guide-to-bioinformatics-analysis/ - loads of worked examples of using R for bioinformatics, perhaps requires more basic R knowledge than the others