mean(1:5)[1] 3
…and how to use them!
https://tinyurl.com/functions-and-structures
R is an incredibly useful tool in bioinformatics:
Functions are reusable blocks of code which simplify repetitive tasks
Allow automation of complex workflows
Either pre-defined in packages, or user defined
From very simple ➡ complex multi-step tools
function name(what to perform function on)
User defined functions are created with the function() function:
gc_content <- function(dna_sequence) {
# Remove non-ACGT bases
valid_bases <- dna_sequence[dna_sequence %in% c("A", "T", "G", "C")]
# Count Gs and Cs
g_count <- sum(valid_bases == "G")
c_count <- sum(valid_bases == "C")
# Calculate GC content as a percentage
gc_total <- g_count + c_count
sequence_length <- length(valid_bases)
# Calculate GC percentage
gc_percentage <- (gc_total / sequence_length) * 100
# Return the GC content as a percentage
return(gc_percentage)
}We’ll come back to this in a minute 🕑
4 main ways that R stores data, these are:
Vectors
Matrices
Data frames
Lists
Important
Lists in R are different to lists in Python!
Remember 💡
R can handle three main classes of data;
TRUE or FALSE).This is really important when we’re thinking about data structures!
Important
Vectors of different dimensions iterate when combined for operations.
a short vector will “loop” over itself if combined with a longer vector
For e.g.:
We can also create vectors with character values, for e.g. a string of bases
We need the c() function here, without it, R will throw us an error.
Important
R is 1 indexed, not 0 indexed.
To access the elements in a vector, we use [ ].
To access the fourth element of x:
Important
R is 1 indexed, not 0 indexed.
To access the elements in a vector, we use [ ].
The first and third elements:
Important
R is 1 indexed, not 0 indexed.
To access the elements in a vector, we use [ ].
From the first to the fourth:
Important
R is 1 indexed, not 0 indexed.
To access the elements in a vector, we use [ ].
To remove the third:
Try it here: https://tinyurl.com/r-input
gc_content <- function(dna_sequence) {
# Remove non-ACGT bases
valid_bases <- dna_sequence[dna_sequence %in% c("A", "T", "G", "C")]
# Count Gs and Cs
g_count <- sum(valid_bases == "G")
c_count <- sum(valid_bases == "C")
# Calculate GC content as a percentage
gc_total <- g_count + c_count
sequence_length <- length(valid_bases)
# Calculate GC percentage
gc_percentage <- (gc_total / sequence_length) * 100
# Return the GC content as a percentage
return(gc_percentage)
}
v1 <- c("C", "A", "T", "G", "T", "C")
gc_content(v1)dim(), or nrow() & ncol()What if we’ve got multiple sequences?
We could create a matrix by sticking vectors together using cbind() or rbind(), or by using the matrix() function
seq_matrix <- matrix(c("A", "T", "G", "C", "G", "A",
"A", "C", "G", "T", "T", "T",
"G", "C", "C", "C", "G", "T"),
nrow = 3, byrow = TRUE) #byrow = TRUE gives the fill direction
# Assign row names (sequence IDs)
rownames(seq_matrix) <- c("seq1", "seq2", "seq3")
print(seq_matrix) [,1] [,2] [,3] [,4] [,5] [,6]
seq1 "A" "T" "G" "C" "G" "A"
seq2 "A" "C" "G" "T" "T" "T"
seq3 "G" "C" "C" "C" "G" "T"
gc_content <- function(dna_sequence) {
# Remove non-ACGT bases
valid_bases <- dna_sequence[dna_sequence %in% c("A", "T", "G", "C")]
# Count Gs and Cs
g_count <- sum(valid_bases == "G")
c_count <- sum(valid_bases == "C")
# Calculate GC content as a percentage
gc_total <- g_count + c_count
sequence_length <- length(valid_bases)
# Calculate GC percentage
gc_percentage <- (gc_total / sequence_length) * 100
# Return the GC content as a percentage
return(gc_percentage)
}
seq_matrix <- matrix(c("A", "T", "G", "C", "G", "A",
"A", "C", "G", "T", "T", "T",
"G", "C", "C", "C", "G", "T"),
nrow = 3, byrow = TRUE)
#byrow = TRUE specifies the fill direction
# Assign row names (sequence IDs)
rownames(seq_matrix) <- c("seq1", "seq2", "seq3")
gc_content(seq_matrix)The function gave us the GC content of all of the bases in the matrix, what if we want to find them all individually?
We index our matrix using [row, column]
gc_content(seq_matrix[1,])
gc_content(seq_matrix[2,])
gc_content(seq_matrix[3,])
But what if we’ve got loads of sequences?
We can use an apply() function.
apply() functionsNote
apply() are a set of functions that allow you to loop over columns/rows in multidimensional data structures. We can use apply() to do all of this at once (but don’t worry about the details of this).
Lots of packages and functions expect matrix inputs
Efficient row-wise or column-wise operations
Faster than lots of vectors (and allow row-wise operations)
Assignment of meaningful row and column names
Can be stacked to n-dimensions and stored as an array (for multiple “slices”)
data.frame objectWe can combine our matrix from earlier with some metadata about our samples by using as.data.frame().
seq_matrix <- matrix(c("A", "T", "G", "C", "G", "A","A", "C", "G", "T", "T", "T","G", "C", "C", "C", "G", "T"), nrow = 3, byrow = TRUE)
rownames(seq_matrix) <- c("seq1", "seq2", "seq3") # Assign row names (sequence IDs)
colnames(seq_matrix) <-c("pos1", "pos2", "pos3", "pos4", "pos5", "pos6")
presence <- c(TRUE, FALSE, TRUE)
expression <- c(3.0, 0, 1.3)
df <- as.data.frame(cbind(seq_matrix, presence, expression))
class(df)[1] "data.frame"
pos1 pos2 pos3 pos4 pos5 pos6 presence expression
seq1 A T G C G A TRUE 3
seq2 A C G T T T FALSE 0
seq3 G C C C G T TRUE 1.3
$ operator$ means “look inside”, and extracts a column, which is subsequently treated as a vector
TRUE for presence?subset() function. pos1 pos2 pos3 pos4 pos5 pos6 presence expression
seq1 A T G C G A TRUE 3
seq3 G C C C G T TRUE 1.3
seq1 seq3
50.00000 83.33333
read. and write. functionsLet’s create a list. First, we create vectors, and then use the list() function to stick them together.
species <- c("polymita_picta", "cepaea_nemoralis", "auriculella_pulchra")
numbers <- 1:50
logic <- c(TRUE, FALSE, FALSE, TRUE)
my_list <- list(species, numbers, logic)
my_list[[1]]
[1] "polymita_picta" "cepaea_nemoralis" "auriculella_pulchra"
[[2]]
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
[[3]]
[1] TRUE FALSE FALSE TRUE
[[ ]]my_list[1] for example, R will return us a list (just containing the first element, the species)my_list[[1]], R will return it in it’s original vector form.my_list[1:2] versus my_list[[1:2]]my_list[1:2] returns the 1st and 2nd vectors (as lists)
[[1]]
[1] "polymita_picta" "cepaea_nemoralis" "auriculella_pulchra"
[[2]]
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
my_list[[1:2]] returns the 2nd item of the 1st element
[1] "cepaea_nemoralis"
Perhaps more intuitively, we can combine [ ] and [[ ]]. To access the 2nd item of the 1st element, use my_list[[1]][2]:
[1] "cepaea_nemoralis"
There are some things that R is not the best tool for due to efficiency and scalability:
It is however very good at downstream analysis, datavis, and statistics.
There are loads of brilliant R resources online, including:
Advanced R - Hadley Wickham - available online, here: http://adv-r.had.co.nz/ - lots more detail about functions and data structures
A little book of R for Bioinformatics - Avril Coghlan & Nathan Brouwer - available online, here: https://brouwern.github.io/lbrb/ - incredibly useful for things like Bioconductor package installs
R for Biologists - - available online, here: https://omicstutorials.com/r-for-biologists-an-introductory-guide-to-bioinformatics-analysis/ - loads of worked examples of using R for bioinformatics, perhaps requires more basic R knowledge than the others