This is a guide on using R for computational genomics. This guideline aims at pinpointing the P/R locus in the basidiomycete Coprinopsis cinerea as well as calculating the GC content. However, this pipeline could be adapted for other loci in other organisms.
Recombination is one of the major evolutionary processes and therefore plays a crucial role in the evolution of genomes (1). Several correlations between recombination and a large number of functional and structural properties, such as the length of intros (2) and the length of genes (3). However, the strongest found correlation is between recombination and the GC content in many organisms including. Examples of these organisms included: Drosophila melanogaster (4), Saccharomyces cerevisiae(5), Caenorhabditis elegans (4) and mammals (6). Despite the fact that it is not yet clear whether GC content drives recombination or the converse, it is clear that both are correlated and therefore, by assessing the GC content across the genome of any species, we may infer the recombination rate of specific regions in an indirect fashion (7).
Supergenes are genetic structures made up of a group of neighboring genes which are inherited together (8) and for this reason we would expect to see a different GC content when comparing them to other regions in the genome (9). The PR locus meets the definition of supergene and plays a very important role in the mating type identity of tetrapolar basidiomicetes (10). This locus consists of several pheromones and pheromone receptors that are used to recognize other individuals of the same species (11). For that reason, we will use as an example one of the pheromones within the P/R locus in order to identify this locus in the genome of Coprinopsis cinerea and to calculate the GC content in a sliding window manner by using R (12).
Firstly, we will install and load the packages that we will need during this guideline in order to run the different analyses. In this case, since the packages are already installed after having used the function install.packages(‘name of the package’), we will simply load the packages by using the function library(package).
library(ape)
library(Biostrings)
library(rBLAST)
library(magrittr)
library(tidyr)
library(readr)
library(seqinr)
library(ggplot2)
library(gggenes)
library(dplyr)
library(R.utils)
We need a Whole Genome Sequence (WGS) (13) and one known gene (14) within the locus of interest of Coprinopsis cinerea in fasta format.
In our case we will use exclusively the EnsemblFungi databank (https://fungi.ensembl.org/index.html).
This is the link (ftp://ftp.ensemblgenomes.org/pub/fungi/release-48/fasta/fungi_basidiomycota1_collection/coprinopsis_cinerea_okayama7_130_gca_000182895/dna/) which contains all chromosomes in fasta format. Since we already know that the P/R locus in Coprinopsis cinerea is located within chromosome 10 (1), we will simply download the following file:
Coprinopsis_cinerea_okayama7_130_gca_000182895.CC3.dna.chromosome.10.fa.gz.
After downloading this file we have to unzip it and keep it in an R object (WGS_database).
unzip_file = "gunzip ~/PROJECT/Coprinopsis_cinerea/Data/Coprinopsis_cinerea_okayama7_130_gca_000182895.CC3.dna.chromosome.10.fa.gz"
system(unzip_file)
WGS_database = "~/PROJECT/Coprinopsis_cinerea/Data/Coprinopsis_cinerea_okayama7_130_gca_000182895.CC3.dna.chromosome.10.fa"
Once we have unzipped the WGS, it is time to download a known gene that is found within the P/R locus.
Since we know that within this locus there are pheromone and pheromone receptor genes, we search in the EnsemblFungi databank for one pheromone or pheromone receptor. In this case we will use the pheromone Phb2.1 that can be found in: https://fungi.ensembl.org/Coprinopsis_cinerea_okayama7_130_gca_000182895/Gene/Sequence?db=core;g=CC1G_02135;r=10:1728857-1729042;t=EAU87376
After downloading the gene we will keep it in an R object (pheromone_phb2.1):
pheromone_phb2.1 = "~/PROJECT/Coprinopsis_cinerea/Data/C_cinerea_phb2.1.fasta"
BLAST is a bioinformatic software, maintained by the National Center for Biotechnology Information (NCBI), that allows you to find regions of similarity between biological sequences (15). This program compares protein or nucleotide sequences to sequence databases and calculates the statistical significance. BLAST is not a single tool, but rather a suite of tools, so, in this guideline we will make use of the blastn tool as we are using nucleotide data in fasta format.
We will take advantage of this software by finding the sequence of the downloaded gene (pheromone Phb2.1) in the WGS of Coprinopsis cinerea. Thus, the query sequence will be the gene and the subject sequence the WGS. The blastn tool uses subject sequences which are by default the online available databases so, since we want to use the downloaded WGS as the subject sequence, we will use a tool called makeblastdb provided by BLAST+. This tool converts a subject fasta file into an indexed version that will be used as a database.
makeblastdb(file = WGS_database, dbtype = "nucl")
We will use the function system() to run OS commands in R. Here below is the complete script that will allow us to run the tool blastn in R.
# Setting variables for blastn function
blastn = "/usr/bin/blastn"
input = "input_file"
evalue = 10
format = 6
# Running blastn function
blastn_function <- system2(command = blastn, args = c("-db",
WGS_database, "-query", pheromone_phb2.1, "-outfmt", format,
"-evalue", evalue, "-ungapped"), wait = TRUE, stdout = TRUE)
# Setting column names
colnames <- c("qseqid", "sseqid", "pident", "length", "mismatch",
"gapopen", "qstart", "qend", "sstart", "send", "evalue",
"bitscore")
# Wrangling the data obtained after using the blastn function
# in a human readable format
blast_output <- blastn_function %>% as_tibble() %>% separate(col = value,
into = colnames, sep = "\t", convert = TRUE)
blast_output
As we can see, the gene that codes for the pheromone phb2.1 starts at the position 1729042 and ends at the position 1728857 of chromosome 10 in Coprinopsis cinerea. This info will be useful to determine the position of the P/R locus by using an annotated genome of Coprinopsis cinerea, the Integrative Genomics Viewer (IGV) and previous reported data (1,2).
The code below illustrates a visual figure of the P/R locus in Coprinopsis cinerea:
genes_arrows = read.delim(file = "~/PROJECT/Coprinopsis_cinerea/Data/filtered_4.gff3",
header = TRUE, sep = "\t")
genes_arrows <- genes_arrows %>% mutate(newV1 = ifelse(strand ==
"reverse", start, end), newV3 = ifelse(strand == "reverse",
end, start), end = newV1, start = newV3) %>% select(-newV1,
-newV3)
dummies <- make_alignment_dummies(genes_arrows, ggplot2::aes(xmin = start,
xmax = end, y = molecule, id = gene), on = "genE")
PR_locus_plotting = ggplot2::ggplot(genes_arrows, ggplot2::aes(xmin = start,
xmax = end, y = molecule, fill = gene)) + geom_gene_arrow() +
ggplot2::geom_blank(data = dummies) + theme_genes() + xlab("Genome Location in Chr.10") +
theme(axis.title.x = element_text(margin = margin(t = 25,
r = 0, b = 20, l = 0))) + ylab("") + theme(axis.title = element_text(size = 16))
PR_locus_plotting
Fig. 1 Schematic showing the genomic structure of P/R locus in Coprinopsis cinerea. P/R genes (represented by arrows) are colored as indicated in the legend with different color grades. Long arrows represent pheromone receptors while short arrows represent pheromones.
Now we will extract the position of the gene within the WGS from the output obtained after running blast in R
position_pheromone = blast_output %>%
select(c("sstart", "send"))
position_pheromone
In order to asses the GC content in a sliding window manner of a zoomed-in region of chromosome 10, we will extract a genomic region of 200kb: 100kb upstream and downstream from the pheromone phb2.1. By doing so we can easily compare the GC content of the P/R locus with the two proximal regions.
# adding +/- 100k to get the 200kb genomic sequence
starting = position_pheromone$sstart - 1e+05
ending = position_pheromone$send + 1e+05
chromosome = 10
position_pheromone_200kb <- data.frame(chromosome, starting,
ending)
# transforming object position_pheromone_200kb to a txt file
write_tsv(position_pheromone_200kb, path = "~/PROJECT/Coprinopsis_cinerea/Data/location_phb2.1.bed")
Now we will extract this genomic sequence of 200kb that contains the P/R locus from the WGS and we will keep it in a fasta file (location_phb2.1.fa):
y = "sed '1d' ~/PROJECT/Coprinopsis_cinerea/Data/location_phb2.1.bed > ~/PROJECT/Coprinopsis_cinerea/Data/location_phb2.1_modified.bed"
z = "bedtools getfasta -fi ~/PROJECT/Coprinopsis_cinerea/Data/Coprinopsis_cinerea_okayama7_130_gca_000182895.CC3.dna.chromosome.10.fa -bed ~/PROJECT/Coprinopsis_cinerea/Data/location_phb2.1_modified.bed -fo ~/PROJECT/Coprinopsis_cinerea/Data/location_phb2.1.fa"
system(y)
system(z)
The code chunk found below will calculate the GC content in a sliding window manner with a window size of 2000 bp.
refseq = read.fasta("~/PROJECT/Coprinopsis_cinerea/Data/location_phb1.1.fa")
wiggle_output <- file("~/PROJECT/Coprinopsis_cinerea/Data/GC_content/output5.txt",
"a")
printf("track type=print wiggle_0 name=fileName description=fileName\n",
file = wiggle_output)
window_size = 2000
refseq_names = names(refseq)
for (i in 1:length(refseq)) {
fasta = refseq[[i]]
fasta_name = refseq_names[[i]]
fasta_length = length(fasta) - 1
mod_window_size = ifelse(fasta_length < window_size, fasta_length,
window_size)
print(mod_window_size)
starts <- seq(1, length(fasta) - mod_window_size, by = window_size)
n <- length(starts) # Find the length of the vector 'starts'
chunkGCs <- numeric(n) # Make a vector of the same length
# as vector 'starts', but just containing zeroes
printf("variableStep chrom=%s span=%d\n", fasta_name, mod_window_size,
file = wiggle_output)
for (i in 1:n) {
chunk <- fasta[starts[i]:(starts[i] + mod_window_size -
1)]
chunkGC <- GC(chunk)
chunkGCs[i] <- chunkGC
printf("%s %s\n", starts[i], ifelse(is.na(chunkGC), 0,
chunkGC), file = wiggle_output)
}
}
count_table <- read.csv("~/PROJECT/Coprinopsis_cinerea/Data/GC_content/output5.txt",
header = TRUE, sep = "")
c_table <- count_table[-1, ]
c_table_updated = c_table %>% select(c("track", "type.print"))
c_table_updated_2 = c_table_updated %>% rename(Position = track,
GC_content = type.print)
rownames(c_table_updated_2) <- NULL
write.table(c_table_updated_2, "~/PROJECT/Coprinopsis_cinerea/Data/GC_content/output5_new.txt",
append = FALSE, sep = " ", dec = ".", row.names = TRUE, col.names = TRUE)
my_data_200kb <- read.delim(file = "~/PROJECT/Coprinopsis_cinerea/Data/GC_content/output5_new.txt",
header = TRUE, sep = " ")
# Now we have to change the position values for the real
# position in chromosome 10.
x = "cut -f2 ~/PROJECT/Coprinopsis_cinerea/Data/location_phb1.1_modified.bed"
beginning_1 <- system(x, intern = TRUE)
beginning_2 = as.numeric(beginning_1)
Location = mapply(function(x, y) x + y, my_data_200kb$Position,
beginning_2)
my_data_200kb_new = cbind(my_data_200kb, Location)
number_rows = nrow(my_data_200kb_new)
PR_locus = rep(0, number_rows)
my_data_200kb_new_1 <- data.frame(my_data_200kb_new, PR_locus = PR_locus)
write.table(my_data_200kb_new_1, file = "~/PROJECT/my_data_200kb_new.txt",
sep = "\t", row.names = FALSE)
GC_content_200kb = read.delim("~/PROJECT/my_data_200kb_new.txt",
header = TRUE, sep = "\t")
q = c(46:57)
GC_content_200kb[q, "PR_locus"] = 1
GC_content_200kb$GC_content <- GC_content_200kb$GC_content *
100
Now that we have calculated the GC content across a zoomed-in region of chromosome 10, it is time to plot it.
Below we will plot the GC content in two different ways. The first plot will show a zoomed-in Y-axis (from a GC content of 47.5% to 0.60%). The second plot will show a zoomed-out Y-axis (from a GC content of 0% to 100%)
For zoomed-in Y-axis:
a1 = ggplot(GC_content_200kb, aes(Location, GC_content, color = PR_locus)) +
geom_line(size = 1) + stat_smooth(col = "red") + theme(legend.position = "none") +
xlab("Genome Location in Chr.10 (bp)") + ylab("GC Content (%)") +
theme(axis.title.x = element_text(margin = margin(t = 20,
r = 20, b = 20, l = 0))) + theme(axis.title.y = element_text(margin = margin(t = 20,
r = 20, b = 20, l = 20))) + theme(axis.title = element_text(size = 16))
a1
Fig. 2 GC content (%) across a genomic region located in chromosome 10 in Coprinopsis cinerea. The region in which the P/R locus is located is shown in blue. The Y-axis has been zoomed-in (it goes from 47.5% to 60% )
For zoomed-out Y-axis:
a2 = ggplot(GC_content_200kb, aes(x = Location, y = GC_content,
color = PR_locus)) + geom_line(size = 1) + ylim(0, 100) +
theme(legend.position = "none") + xlab("Genome Location in Chr.10 (bp)") +
ylab("GC Content (%)") + theme(axis.title.x = element_text(margin = margin(t = 20,
r = 20, b = 20, l = 0))) + theme(axis.title.y = element_text(margin = margin(t = 20,
r = 20, b = 20, l = 20))) + theme(axis.title = element_text(size = 16))
a2
Fig. 3 GC content (%) across a genomic region located in chromosome 10 in Coprinopsis cinerea. The region in which the P/R locus is located is shown in blue. The Y-axis has not been zoomed-in (it goes from 0% to 100% )
Now we will calculate the GC content in a sliding window manner across the entire chromosome 10 and we will highlight the region in which the PR locus is located.
library("seqinr")
refseq = read.fasta("~/PROJECT/Coprinopsis_cinerea/Data/GC_content/WGS_GC_content/Coprinopsis_cinerea_okayama7_130_gca_000182895.CC3.dna.chromosome.10.fa")
wiggle_output <- file("~/PROJECT/Coprinopsis_cinerea/Data/GC_content/WGS_GC_content/output5.txt",
"a")
printf("track type=print wiggle_0 name=fileName description=fileName\n",
file = wiggle_output)
window_size = 2000
refseq_names = names(refseq)
for (i in 1:length(refseq)) {
fasta = refseq[[i]]
fasta_name = refseq_names[[i]]
fasta_length = length(fasta) - 1
mod_window_size = ifelse(fasta_length < window_size, fasta_length,
window_size)
print(mod_window_size)
starts <- seq(1, length(fasta) - mod_window_size, by = window_size)
n <- length(starts)
chunkGCs <- numeric(n)
printf("variableStep chrom=%s span=%d\n", fasta_name, mod_window_size,
file = wiggle_output)
for (i in 1:n) {
chunk <- fasta[starts[i]:(starts[i] + mod_window_size -
1)]
chunkGC <- GC(chunk)
chunkGCs[i] <- chunkGC
printf("%s %s\n", starts[i], ifelse(is.na(chunkGC), 0,
chunkGC), file = wiggle_output)
}
}
count_table <- read.csv("~/PROJECT/Coprinopsis_cinerea/Data/GC_content/WGS_GC_content/output5.txt",
header = TRUE, sep = "")
c_table <- count_table[-1, ]
c_table_updated = c_table %>% select(c("track", "type.print"))
c_table_updated_2 = c_table_updated %>% rename(Position = track,
GC_content = type.print)
rownames(c_table_updated_2) <- NULL
write.table(c_table_updated_2, "~/PROJECT/Coprinopsis_cinerea/Data/GC_content/WGS_GC_content/output5_new.txt",
append = FALSE, sep = " ", dec = ".", row.names = TRUE, col.names = TRUE)
my_data <- read.delim(file = "~/PROJECT/Coprinopsis_cinerea/Data/GC_content/WGS_GC_content/output5_new.txt",
header = TRUE, sep = " ")
number_rows = nrow(my_data)
PR_locus = rep(0, number_rows)
my_da <- data.frame(my_data, PR_locus = PR_locus)
write.table(my_da, file = "~/PROJECT/my_data_200kb_new_2.txt",
sep = "\t", row.names = FALSE)
v = read.delim("~/PROJECT/my_data_200kb_new_2.txt", header = TRUE,
sep = "\t")
b = "egrep -n '1720001|1740001' ~/PROJECT/my_data_200kb_new_2.txt"
system(b)
q = c(862:872)
v[q, "PR_locus"] = 1
v$GC_content <- v$GC_content * 100
For zoomed-in Y-axis:
b1 = ggplot(v, aes(Position, GC_content, col = PR_locus)) + geom_line() +
stat_smooth(col = "red") + theme(legend.position = "none") +
xlab("Genome Location in Chr.10 (bp)") + ylab("GC Content (%)") +
theme(axis.title.x = element_text(margin = margin(t = 20,
r = 20, b = 20, l = 0))) + theme(axis.title.y = element_text(margin = margin(t = 20,
r = 20, b = 20, l = 20))) + theme(axis.title = element_text(size = 16))
b1
Fig. 4 GC content (%) across the entire chromosome 10 in Coprinopsis cinerea. The region in which the P/R locus is located is shown in blue. The Y-axis has been zoomed-in (it goes from 45% to 60% )
For zoomed-out Y-axis:
b2 = ggplot(v, aes(Position, GC_content, col = PR_locus)) + geom_line() +
ylim(0, 100) + theme(legend.position = "none") + xlab("Genome Location in Chr.10 (bp)") +
ylab("GC Content (%)") + theme(axis.title.x = element_text(margin = margin(t = 20,
r = 20, b = 20, l = 0))) + theme(axis.title.y = element_text(margin = margin(t = 20,
r = 20, b = 20, l = 20))) + theme(axis.title = element_text(size = 16))
b2
Fig. 5 GC content (%) across the entire chromosome 10 in Coprinopsis cinerea. The region in which the P/R locus is located is shown in blue. The Y-axis has been zoomed-in (it goes from 0% to 100%)
1.- Stapley, J., Feulner, P., Johnston, S. E., Santure, A. W., & Smadja, C. M. (2017). Recombination: the good, the bad and the variable. Philosophical transactions of the Royal Society of London. Series B, Biological sciences, 372(1736), 20170279.
2.- Comeron, J. M., & Kreitman, M. (2000). The correlation between intron length and recombination in drosophila. Dynamic equilibrium between mutational and selective forces. Genetics, 156(3), 1175–1190.
3.- Prachumwat, A., DeVincentis, L., & Palopoli, M. F. (2004). Intron size correlates positively with recombination rate in Caenorhabditis elegans. Genetics, 166(3), 1585–1590.
4.- Marais, G., Mouchiroud, D., & Duret, L. (2001). Does recombination improve selection on codon usage? Lessons from nematode and fly complete genomes. Proceedings of the National Academy of Sciences of the United States of America, 98(10), 5688–5692.
5.- Gerton, J. L., DeRisi, J., Shroff, R., Lichten, M., Brown, P. O., & Petes, T. D. (2000). Global mapping of meiotic recombination hotspots and coldspots in the yeast Saccharomyces cerevisiae. Proceedings of the National Academy of Sciences of the United States of America, 97(21), 11383–11390.
6.- Duret, L., & Arndt, P. F. (2008). The impact of recombination on nucleotide substitutions in the human genome. PLoS genetics, 4(5), e1000071
7.- Marsolier-Kergoat, M. C., & Yeramian, E. (2009). GC content and recombination: reassessing the causal effects for the Saccharomyces cerevisiae genome. Genetics, 183(1), 31–38
8.- Thompson, M. J., & Jiggins, C. D. (2014). Supergenes and their role in evolution. Heredity, 113(1), 1–8.
9.- Campos, J. L., Charlesworth, B., & Haddrill, P. R. (2012). Molecular evolution in nonrecombining regions of the Drosophila melanogaster genome. Genome biology and evolution, 4(3), 278–288.
10.- Jones, S. K., Jr, & Bennett, R. J. (2011). Fungal mating pheromones: choreographing the dating game. Fungal genetics and biology : FG & B, 48(7), 668–676.
11.- Sun, S., Coelho, M. A., Heitman, J., & Nowrousian, M. (2019). Convergent evolution of linked mating-type loci in basidiomycete fungi. PLoS genetics, 15(9), e1008365.
12.- R Core Team (2018) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.
13.- Stajich, J. E., Wilke, S. K., Ahrén, D., Au, C. H., Birren, B. W., Borodovsky, M., Burns, C., Canbäck, B., Casselton, L. A., Cheng, C. K., Deng, J., Dietrich, F. S., Fargo, D. C., Farman, M. L., Gathman, A. C., Goldberg, J., Guigó, R., Hoegger, P. J., Hooker, J. B., Huggins, A., … Pukkila, P. J. (2010). Insights into evolution of multicellular fungi from the assembled chromosomes of the mushroom Coprinopsis cinerea (Coprinus cinereus). Proceedings of the National Academy of Sciences of the United States of America, 107(26), 11889–11894.
14.- Coelho, M. A., Bakkeren, G., Sun, S., Hood, M. E., & Giraud, T. (2017). Fungal Sex: The Basidiomycota. Microbiology spectrum, 5(3), 10.1128/microbiolspec.FUNK-0046-2016.
15.- Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local alignment search tool. Journal of molecular biology, 215(3), 403–410.