We have 20 different mice. Some of them have a genetic mutation that causes a rare disease called wobbly whisker syndrome (wws)! It’s our job to figure out which mutation is the culprit. We’ve been given some genotype and RNA-seq data for all of the mice. Lets dig into it to see if we can figure out where the mutation is, and how it affects the their transcriptional profiles.
Before we can get started with the analysis, we have to go over some basic R commands. We are going to be using two different data structures in this tutorial. The first is a vector:
#make the vectors
x <- c(1, 2, 3)
y <- c(4, 5, 6)
print(x)
#> [1] 1 2 3
print(y)
#> [1] 4 5 6
Vectors are indexed with square brackets []. Unlike most other languages, indexing starts at 1, not 0.
print(x[1])
#> [1] 1
We will also use dataframes. Think of dataframes as a 2-dimensional table, similar to an excel spreadsheet. Here is how you define one:
df = data.frame(
column1 = c(1, 2, 3),
column2 = c(4, 5, 6),
column3 = c("x", "y", "z")
)
print(df)
#> column1 column2 column3
#> 1 1 4 x
#> 2 2 5 y
#> 3 3 6 z
All rows and columns of a dataframe can be numerically indexed similarly to vectors with the format df[rows, columns]. Specific columns can also be accessed with the $ operator.
df[1,] #get the first row
#> column1 column2 column3
#> 1 1 4 x
df[,2:3] #get second and third column
#> column2 column3
#> 1 4 x
#> 2 5 y
#> 3 6 z
df$column1 #return values in first column as a vector
#> [1] 1 2 3
Finally, we’re quickly going to go over control structures and loops. Sometimes, we’ll only want a code block to execute if another condition is true. To do this, we’ll use an if statement:
x = 3
if(x == 3){
print("x is 3")
}else{
print("x is not 3")
}
#> [1] "x is 3"
Sometimes, we’ll want to repeat the same task a set number of times. A common example of this is running the same operation on every row of a data frame. To do this, we will use a “for” loop:
#let's add one to every value in x
x = c(1, 2, 3)
for(i in 1:length(x)){
x[i] <- x[i] + 1
}
print(x)
#> [1] 2 3 4
That’s it for the basics! We’ll go more in depth about them as we get further into this tutorial.
Typically, at the start of every R script, we load in any libraries we will need for the analysis. Libraries are chunks of code that other people have already written for us, so we can save time! We will be using two in this tutorial:
library(ggplot2)
library(ggpubr)
The wet lab researchers for this project have sent us two csv files. One for the genotype data, and one for the RNA-seq data. Lets load them in and take a look at them!
gdf <- read.csv("genotype_data.csv")
rna.counts <- read.csv("rnaseq_counts.csv")
rownames(rna.counts) <- rna.counts$X #set the rownames to the first column! read.csv won't do it for you
rna.counts <- rna.counts[,-1] #get rid of the extra gene name column once row names are set
de.analysis <- read.csv("rnaseq_DE_results.csv")
First, lets check out the genotype data. Each row represents a site where mice commonly have mutations. The first two columns are the chromosome and coordinate of each site. The third column is the name of the gene that the mutation affects, and the fourth column is a fishers exact pvalue representing the probability that the mutation is associated with the wws mice. The remaining columns are the nucleotides that each mouse has at each site.
head(gdf, 2)
#> chrom pos gene p.val wws.mouse.1 wws.mouse.2 wws.mouse.3 wws.mouse.4
#> 1 1 30201 nc3 0.08668731 T N T N
#> 2 1 30231 w92l 0.58204334 T T T T
#> wws.mouse.5 wws.mouse.6 wws.mouse.7 wws.mouse.8 wws.mouse.9 wws.mouse.10
#> 1 T C T T T N
#> 2 T T T T T A
#> wt.mouse.1 wt.mouse.2 wt.mouse.3 wt.mouse.4 wt.mouse.5 wt.mouse.6 wt.mouse.7
#> 1 T T T T T T T
#> 2 T T A T T T A
#> wt.mouse.8 wt.mouse.9 wt.mouse.10
#> 1 T T T
#> 2 T A T
The RNA-sequencing data shows the number of transcripts sequenced for each gene within each mouse sample:
head(rna.counts, 2)
#> wws.mouse.1.sample.1 wws.mouse.1.sample.2 wws.mouse.1.sample.3
#> nc3 9 4 12
#> w92l 18 32 17
#> wws.mouse.2.sample.1 wws.mouse.2.sample.2 wws.mouse.2.sample.3
#> nc3 8 2 11
#> w92l 28 20 42
#> wws.mouse.3.sample.1 wws.mouse.3.sample.2 wws.mouse.3.sample.3
#> nc3 14 11 7
#> w92l 28 18 25
#> wws.mouse.4.sample.1 wws.mouse.4.sample.2 wws.mouse.4.sample.3
#> nc3 9 10 4
#> w92l 55 40 5
#> wws.mouse.5.sample.1 wws.mouse.5.sample.2 wws.mouse.5.sample.3
#> nc3 8 9 18
#> w92l 39 5 30
#> wws.mouse.6.sample.1 wws.mouse.6.sample.2 wws.mouse.6.sample.3
#> nc3 9 6 17
#> w92l 49 22 14
#> wws.mouse.7.sample.1 wws.mouse.7.sample.2 wws.mouse.7.sample.3
#> nc3 10 7 3
#> w92l 26 26 31
#> wws.mouse.8.sample.1 wws.mouse.8.sample.2 wws.mouse.8.sample.3
#> nc3 11 10 6
#> w92l 32 38 9
#> wws.mouse.9.sample.1 wws.mouse.9.sample.2 wws.mouse.9.sample.3
#> nc3 6 10 2
#> w92l 21 21 36
#> wws.mouse.10.sample.1 wws.mouse.10.sample.2 wws.mouse.10.sample.3
#> nc3 9 7 17
#> w92l 16 14 21
#> wt.mouse.1.sample.1 wt.mouse.1.sample.2 wt.mouse.1.sample.3
#> nc3 8 7 5
#> w92l 29 14 21
#> wt.mouse.2.sample.1 wt.mouse.2.sample.2 wt.mouse.2.sample.3
#> nc3 20 9 10
#> w92l 25 20 9
#> wt.mouse.3.sample.1 wt.mouse.3.sample.2 wt.mouse.3.sample.3
#> nc3 9 14 13
#> w92l 36 62 26
#> wt.mouse.4.sample.1 wt.mouse.4.sample.2 wt.mouse.4.sample.3
#> nc3 14 1 13
#> w92l 36 7 24
#> wt.mouse.5.sample.1 wt.mouse.5.sample.2 wt.mouse.5.sample.3
#> nc3 11 8 17
#> w92l 22 70 38
#> wt.mouse.6.sample.1 wt.mouse.6.sample.2 wt.mouse.6.sample.3
#> nc3 11 10 13
#> w92l 12 45 38
#> wt.mouse.7.sample.1 wt.mouse.7.sample.2 wt.mouse.7.sample.3
#> nc3 3 5 19
#> w92l 27 46 21
#> wt.mouse.8.sample.1 wt.mouse.8.sample.2 wt.mouse.8.sample.3
#> nc3 12 5 9
#> w92l 5 2 39
#> wt.mouse.9.sample.1 wt.mouse.9.sample.2 wt.mouse.9.sample.3
#> nc3 9 5 20
#> w92l 44 46 55
#> wt.mouse.10.sample.1 wt.mouse.10.sample.2 wt.mouse.10.sample.3
#> nc3 5 17 8
#> w92l 23 16 15
We also have some differential expression analysis results for the RNA-seq data. This dataframe has the log2 fold change for each gene, as well as a p-value generated by DESeq2. We can use this to find genes that are significantly up or down regulated between the wws and the wt mice.
head(de.analysis, 2)
#> X log2FoldChange padj
#> 1 nc3 -0.2840112 0.5699358
#> 2 w92l -0.2304174 0.7566018
Usually, bioinformaticians will check the frequencies of each nucelotide for QC purposes. We can do this with a simple bar graph. First, copy and paste this block of code into your R session to get a dataframe that has the frequecy of each nucelotide. If you want to see how each line works, you can highlight certain parts of each command and press ctrl+enter to see their output.
nuc.freqs <- table(unlist(gdf[5:ncol(gdf)]))
nuc.freqs <- as.data.frame(nuc.freqs)
nuc.freqs$Freq <- nuc.freqs$Freq/sum(nuc.freqs$Freq)
nuc.freqs$Var1 <- factor(nuc.freqs$Var1,
levels = c("A", "T", "C", "G", "H", "N"))
print(nuc.freqs)
#> Var1 Freq
#> 1 A 0.286170
#> 2 C 0.153455
#> 3 G 0.155425
#> 4 H 0.077155
#> 5 N 0.040230
#> 6 T 0.287565
Now, let’s make a bar graph that shows us the relative proportions of each nucleotide. You should notice that the A/T and G/C have similar frequencies, and Heterozygous/Deletion calls are much less frequent.
ggplot(nuc.freqs, aes(x = Var1, y = Freq))+
geom_bar(stat = "identity", fill = "grey20") +
theme_classic() +
scale_y_continuous(expand = c(0,0)) +
xlab("Nucleotide") +
ylab("Proportion")
## Looking at significant genotype hits Lets filter our genotype sites
for hits that could be associated with wws.
targets <- gdf[gdf$p.val == min(gdf$p.val),]
print(targets)
#> chrom pos gene p.val wws.mouse.1 wws.mouse.2 wws.mouse.3
#> 855 2 69773 vhf0y3 1.08e-05 A A A
#> 2586 5 89529 79vqc 1.08e-05 G G G
#> 5923 12 45940 ndagq 1.08e-05 C C C
#> 9285 18 75156 e7x 1.08e-05 T T T
#> 9462 18 97072 w6dfxj 1.08e-05 T T C
#> 9741 19 64259 Wbbl1 1.08e-05 N N N
#> wws.mouse.4 wws.mouse.5 wws.mouse.6 wws.mouse.7 wws.mouse.8 wws.mouse.9
#> 855 A A A A A A
#> 2586 G G G G G G
#> 5923 C C C C C C
#> 9285 T T T T T T
#> 9462 T C T C T T
#> 9741 N N N N N N
#> wws.mouse.10 wt.mouse.1 wt.mouse.2 wt.mouse.3 wt.mouse.4 wt.mouse.5
#> 855 A C C C C C
#> 2586 G C C C C C
#> 5923 C A G G G A
#> 9285 T N N N N N
#> 9462 T H H H H H
#> 9741 N A A A A A
#> wt.mouse.6 wt.mouse.7 wt.mouse.8 wt.mouse.9 wt.mouse.10
#> 855 C C C C C
#> 2586 C C C C C
#> 5923 G G G A G
#> 9285 N N N N N
#> 9462 H H H H H
#> 9741 A A A A A
We have some interesting hits! Lets look at the histograms for their corresponding RNA-seq data. First, we have to retrieve the counts for these genes:
target.counts <- rna.counts[rownames(rna.counts) %in% targets$gene, ]
Now, we have to “melt” the counts down so that they’re compatible with ggplot2:
histdf <- data.frame()
for(i in 1:nrow(target.counts)){
counts <- as.numeric(target.counts[i,])
gname <- rownames(target.counts)[i]
conds <- c(rep("wws", 30), rep("wt", 30)) #30 because there are 3 samples for each mouse
temp <- data.frame(
gene = gname,
condition = conds,
counts = counts
)
histdf <- rbind(histdf, temp)
}
head(histdf)
#> gene condition counts
#> 1 vhf0y3 wws 26
#> 2 vhf0y3 wws 50
#> 3 vhf0y3 wws 36
#> 4 vhf0y3 wws 36
#> 5 vhf0y3 wws 35
#> 6 vhf0y3 wws 47
Let’s make a histogram for the first gene in the list, vhf0y3. Following our last example, this is is how we would make the histogram:
ggplot(histdf[histdf$gene == "vhf0y3",],
aes(x = counts, fill = condition)) +
geom_histogram(alpha = 0.5)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
However, this doesn’t do what we want! This stacks the histogram bars on
top of eachother. We want them to be overlapping, not stacked, so we
have to separate them out into two separate layers.
subhistdf <- histdf[histdf$gene == "vhf0y3",]
ggplot() +
geom_histogram(data = subhistdf[subhistdf$condition == "wws",],
aes(x = counts),
alpha = 0.5,
fill = "darkblue",
inherit.aes = F) +
geom_histogram(data = subhistdf[subhistdf$condition == "wt",],
aes(x = counts),
alpha = 0.5,
fill = "darkred",
inherit.aes = F) +
theme_classic() +
scale_y_continuous(expand = c(0,0)) +
xlab("Expression Level") +
ggtitle("Wbbl1") +
theme(plot.title = element_text(hjust = 0.5))
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This makes the plot we are looking for, but only for a single gene! We want to look at all six candidates, so we are going to write a function that will replicate this plot for any gene we want to look at. The name of this function will be “plot_expression_hist”. We are going to give it our data and the gene we want to look at, and it will return the plot.
plot_expression_hist <- function(histdf, gene){
subhistdf <- histdf[histdf$gene == gene,] #get the gene we want
p <- ggplot() +
geom_histogram(data = subhistdf[subhistdf$condition == "wws",],
aes(x = counts),
alpha = 0.5,
fill = "darkblue") +
geom_histogram(data = subhistdf[subhistdf$condition == "wt",],
aes(x = counts),
alpha = 0.5,
fill = "darkred") +
theme_classic() +
scale_y_continuous(expand = c(0,0)) +
xlab("Expression Level") +
ggtitle(gene) +
theme(plot.title = element_text(hjust = 0.5))
return(p) #return the plot
}
Now that our function is written, we can make a plot for each of our six genes:
hist1 <- plot_expression_hist(histdf, "vhf0y3")
hist2 <- plot_expression_hist(histdf, "79vqc")
hist3 <- plot_expression_hist(histdf, "ndagq")
hist4 <- plot_expression_hist(histdf, "e7x")
hist5 <- plot_expression_hist(histdf, "w6dfxj")
hist6 <- plot_expression_hist(histdf, "Wbbl1")
Finally, we can use ggarrange from the ggpubr package to merge all of these plots together:
expression_hists <- ggarrange(hist1, hist2, hist3, hist4, hist5, hist6, nrow = 2, ncol = 3)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
expression_hists
By looking at the histograms, we can see that Wbbl1 has a huge
expression level difference between the wws and the wt mice. The single
blue bar indicates that wws mice have little to no expression of Wbbl1.
Because we are good biologists, we know that Wbbl1 is short for whisker
balance breakdown locus 1, and that this gene plays a major role in
whisker stabilization!
Now that we know that Wbbl1 might be our culprit for wws in these mice, lets look into the formal differential expression analysis data. We are going to make a volcano plot, which is what bioinformatians use to compare the magnitude of a genes up/down regulation (log2 foldchange, or lfc for short) against the probability that any up/down regulation is statistically significant.
de.analysis$color <- "up"
de.analysis$color[de.analysis$log2FoldChange < 0] <- "down"
de.analysis$padj[de.analysis$padj == 0] <- 10^-300
volcano <- ggplot(de.analysis, aes(x = log2FoldChange, y = -log10(padj), color = color)) +
geom_point() +
scale_color_manual(values = c("darkblue", "darkred")) +
xlim(-10, 10) +
ylim(0, 325)+
geom_hline(yintercept = -log(0.05), linetype = "dashed") +
geom_text(data = de.analysis[de.analysis$log2FoldChange < -5, ],
aes(label = X ),
color = "black",
check_overlap = T,
nudge_y = 10) +
theme_classic() +
theme(legend.position = "none")
volcano
Bioinforamticians love to use heatmaps as an effective visualization, but making them isn’t always the most straightforward. To demonstrate how to make a heatmap, we will make an expression heatmap from the RNA-seq data. First, we have to melt our data again. The code to do this is shown below, but don’t run it! It will take a few minutes, so load in the pre-made csv file containing the output instead.
heatmapdf <- data.frame()
for(i in 1:nrow(rna.counts)){
counts <- as.numeric(rna.counts[i,])
gname <- rownames(rna.counts)[i]
conds <- c(rep("wws", 30), rep("wt", 30))
if(de.analysis$log2FoldChange[i] < -5){
cluster <- "Downregulated"
}else{
cluster <- "Other"
}
temp <- data.frame(
gene = gname,
condition = conds,
counts = counts,
mouse = colnames(rna.counts),
cluster = cluster
)
heatmapdf <- rbind(heatmapdf, temp)
}
heatmapdf$counts <- log(heatmapdf$counts +1)
Here is the code to make the heatmap:
heatmapdf <- read.csv("heatmapdf.csv")
geneorder <- de.analysis$X[order(de.analysis$log2FoldChange)]
heatmapdf$gene <- factor(heatmapdf$gene, levels = geneorder)
ggplot(heatmapdf, aes(x = mouse, y = gene, fill = counts)) +
geom_tile()+
scale_fill_viridis_b(option = "A") +
facet_grid(cluster~condition, scales = "free")+
theme(panel.grid = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank())