1 Scenario

We have 20 different mice. Some of them have a genetic mutation that causes a rare disease called wobbly whisker syndrome (wws)! It’s our job to figure out which mutation is the culprit. We’ve been given some genotype and RNA-seq data for all of the mice. Lets dig into it to see if we can figure out where the mutation is, and how it affects the their transcriptional profiles.

2 R Basics

Before we can get started with the analysis, we have to go over some basic R commands. We are going to be using two different data structures in this tutorial. The first is a vector:

#make the vectors
x <- c(1, 2, 3) 
y <- c(4, 5, 6)

print(x)

#> [1] 1 2 3

print(y)

#> [1] 4 5 6

Vectors are indexed with square brackets []. Unlike most other languages, indexing starts at 1, not 0.

print(x[1])

#> [1] 1

We will also use dataframes. Think of dataframes as a 2-dimensional table, similar to an excel spreadsheet. Here is how you define one:

df = data.frame(
  column1 = c(1, 2, 3),
  column2 = c(4, 5, 6), 
  column3 = c("x", "y", "z")
)
print(df)

#>   column1 column2 column3
#> 1       1       4       x
#> 2       2       5       y
#> 3       3       6       z

All rows and columns of a dataframe can be numerically indexed similarly to vectors with the format df[rows, columns]. Specific columns can also be accessed with the $ operator.

df[1,] #get the first row

#>   column1 column2 column3
#> 1       1       4       x

df[,2:3] #get second and third column

#>   column2 column3
#> 1       4       x
#> 2       5       y
#> 3       6       z

df$column1 #return values in first column as a vector

#> [1] 1 2 3

Finally, we’re quickly going to go over control structures and loops. Sometimes, we’ll only want a code block to execute if another condition is true. To do this, we’ll use an if statement:

x = 3

if(x == 3){
  print("x is 3")
}else{
  print("x is not 3")
}

#> [1] "x is 3"

Sometimes, we’ll want to repeat the same task a set number of times. A common example of this is running the same operation on every row of a data frame. To do this, we will use a “for” loop:

#let's add one to every value in x
x = c(1, 2, 3)

for(i in 1:length(x)){
  x[i] <- x[i] + 1
}
print(x)

#> [1] 2 3 4

That’s it for the basics! We’ll go more in depth about them as we get further into this tutorial.

3 Analysis

3.1 Loading in our data

Typically, at the start of every R script, we load in any libraries we will need for the analysis. Libraries are chunks of code that other people have already written for us, so we can save time! We will be using two in this tutorial:

library(ggplot2)
library(ggpubr)

The wet lab researchers for this project have sent us two csv files. One for the genotype data, and one for the RNA-seq data. Lets load them in and take a look at them!

gdf <- read.csv("genotype_data.csv")
rna.counts <- read.csv("rnaseq_counts.csv")
rownames(rna.counts) <- rna.counts$X #set the rownames to the first column! read.csv won't do it for you
rna.counts <- rna.counts[,-1] #get rid of the extra gene name column once row names are set
de.analysis <- read.csv("rnaseq_DE_results.csv")

First, lets check out the genotype data. Each row represents a site where mice commonly have mutations. The first two columns are the chromosome and coordinate of each site. The third column is the name of the gene that the mutation affects, and the fourth column is a fishers exact pvalue representing the probability that the mutation is associated with the wws mice. The remaining columns are the nucleotides that each mouse has at each site.

head(gdf, 2)

#>   chrom   pos gene      p.val wws.mouse.1 wws.mouse.2 wws.mouse.3 wws.mouse.4
#> 1     1 30201  nc3 0.08668731           T           N           T           N
#> 2     1 30231 w92l 0.58204334           T           T           T           T
#>   wws.mouse.5 wws.mouse.6 wws.mouse.7 wws.mouse.8 wws.mouse.9 wws.mouse.10
#> 1           T           C           T           T           T            N
#> 2           T           T           T           T           T            A
#>   wt.mouse.1 wt.mouse.2 wt.mouse.3 wt.mouse.4 wt.mouse.5 wt.mouse.6 wt.mouse.7
#> 1          T          T          T          T          T          T          T
#> 2          T          T          A          T          T          T          A
#>   wt.mouse.8 wt.mouse.9 wt.mouse.10
#> 1          T          T           T
#> 2          T          A           T

The RNA-sequencing data shows the number of transcripts sequenced for each gene within each mouse sample:

head(rna.counts, 2)

#>      wws.mouse.1.sample.1 wws.mouse.1.sample.2 wws.mouse.1.sample.3
#> nc3                     9                    4                   12
#> w92l                   18                   32                   17
#>      wws.mouse.2.sample.1 wws.mouse.2.sample.2 wws.mouse.2.sample.3
#> nc3                     8                    2                   11
#> w92l                   28                   20                   42
#>      wws.mouse.3.sample.1 wws.mouse.3.sample.2 wws.mouse.3.sample.3
#> nc3                    14                   11                    7
#> w92l                   28                   18                   25
#>      wws.mouse.4.sample.1 wws.mouse.4.sample.2 wws.mouse.4.sample.3
#> nc3                     9                   10                    4
#> w92l                   55                   40                    5
#>      wws.mouse.5.sample.1 wws.mouse.5.sample.2 wws.mouse.5.sample.3
#> nc3                     8                    9                   18
#> w92l                   39                    5                   30
#>      wws.mouse.6.sample.1 wws.mouse.6.sample.2 wws.mouse.6.sample.3
#> nc3                     9                    6                   17
#> w92l                   49                   22                   14
#>      wws.mouse.7.sample.1 wws.mouse.7.sample.2 wws.mouse.7.sample.3
#> nc3                    10                    7                    3
#> w92l                   26                   26                   31
#>      wws.mouse.8.sample.1 wws.mouse.8.sample.2 wws.mouse.8.sample.3
#> nc3                    11                   10                    6
#> w92l                   32                   38                    9
#>      wws.mouse.9.sample.1 wws.mouse.9.sample.2 wws.mouse.9.sample.3
#> nc3                     6                   10                    2
#> w92l                   21                   21                   36
#>      wws.mouse.10.sample.1 wws.mouse.10.sample.2 wws.mouse.10.sample.3
#> nc3                      9                     7                    17
#> w92l                    16                    14                    21
#>      wt.mouse.1.sample.1 wt.mouse.1.sample.2 wt.mouse.1.sample.3
#> nc3                    8                   7                   5
#> w92l                  29                  14                  21
#>      wt.mouse.2.sample.1 wt.mouse.2.sample.2 wt.mouse.2.sample.3
#> nc3                   20                   9                  10
#> w92l                  25                  20                   9
#>      wt.mouse.3.sample.1 wt.mouse.3.sample.2 wt.mouse.3.sample.3
#> nc3                    9                  14                  13
#> w92l                  36                  62                  26
#>      wt.mouse.4.sample.1 wt.mouse.4.sample.2 wt.mouse.4.sample.3
#> nc3                   14                   1                  13
#> w92l                  36                   7                  24
#>      wt.mouse.5.sample.1 wt.mouse.5.sample.2 wt.mouse.5.sample.3
#> nc3                   11                   8                  17
#> w92l                  22                  70                  38
#>      wt.mouse.6.sample.1 wt.mouse.6.sample.2 wt.mouse.6.sample.3
#> nc3                   11                  10                  13
#> w92l                  12                  45                  38
#>      wt.mouse.7.sample.1 wt.mouse.7.sample.2 wt.mouse.7.sample.3
#> nc3                    3                   5                  19
#> w92l                  27                  46                  21
#>      wt.mouse.8.sample.1 wt.mouse.8.sample.2 wt.mouse.8.sample.3
#> nc3                   12                   5                   9
#> w92l                   5                   2                  39
#>      wt.mouse.9.sample.1 wt.mouse.9.sample.2 wt.mouse.9.sample.3
#> nc3                    9                   5                  20
#> w92l                  44                  46                  55
#>      wt.mouse.10.sample.1 wt.mouse.10.sample.2 wt.mouse.10.sample.3
#> nc3                     5                   17                    8
#> w92l                   23                   16                   15

We also have some differential expression analysis results for the RNA-seq data. This dataframe has the log2 fold change for each gene, as well as a p-value generated by DESeq2. We can use this to find genes that are significantly up or down regulated between the wws and the wt mice.

head(de.analysis, 2)

#>      X log2FoldChange      padj
#> 1  nc3     -0.2840112 0.5699358
#> 2 w92l     -0.2304174 0.7566018

3.2 Basic Genotyping QC

Usually, bioinformaticians will check the frequencies of each nucelotide for QC purposes. We can do this with a simple bar graph. First, copy and paste this block of code into your R session to get a dataframe that has the frequecy of each nucelotide. If you want to see how each line works, you can highlight certain parts of each command and press ctrl+enter to see their output.

nuc.freqs <- table(unlist(gdf[5:ncol(gdf)])) 
nuc.freqs <- as.data.frame(nuc.freqs)
nuc.freqs$Freq <- nuc.freqs$Freq/sum(nuc.freqs$Freq)
nuc.freqs$Var1 <- factor(nuc.freqs$Var1,
                         levels = c("A", "T", "C", "G", "H", "N"))
print(nuc.freqs)

#>   Var1     Freq
#> 1    A 0.286170
#> 2    C 0.153455
#> 3    G 0.155425
#> 4    H 0.077155
#> 5    N 0.040230
#> 6    T 0.287565

Now, let’s make a bar graph that shows us the relative proportions of each nucleotide. You should notice that the A/T and G/C have similar frequencies, and Heterozygous/Deletion calls are much less frequent.

ggplot(nuc.freqs, aes(x = Var1, y = Freq))+
  geom_bar(stat = "identity", fill = "grey20") +
  theme_classic() +
  scale_y_continuous(expand = c(0,0)) +
  xlab("Nucleotide")  +
  ylab("Proportion")

## Looking at significant genotype hits Lets filter our genotype sites for hits that could be associated with wws.

targets <- gdf[gdf$p.val == min(gdf$p.val),]
print(targets)

#>      chrom   pos   gene    p.val wws.mouse.1 wws.mouse.2 wws.mouse.3
#> 855      2 69773 vhf0y3 1.08e-05           A           A           A
#> 2586     5 89529  79vqc 1.08e-05           G           G           G
#> 5923    12 45940  ndagq 1.08e-05           C           C           C
#> 9285    18 75156    e7x 1.08e-05           T           T           T
#> 9462    18 97072 w6dfxj 1.08e-05           T           T           C
#> 9741    19 64259  Wbbl1 1.08e-05           N           N           N
#>      wws.mouse.4 wws.mouse.5 wws.mouse.6 wws.mouse.7 wws.mouse.8 wws.mouse.9
#> 855            A           A           A           A           A           A
#> 2586           G           G           G           G           G           G
#> 5923           C           C           C           C           C           C
#> 9285           T           T           T           T           T           T
#> 9462           T           C           T           C           T           T
#> 9741           N           N           N           N           N           N
#>      wws.mouse.10 wt.mouse.1 wt.mouse.2 wt.mouse.3 wt.mouse.4 wt.mouse.5
#> 855             A          C          C          C          C          C
#> 2586            G          C          C          C          C          C
#> 5923            C          A          G          G          G          A
#> 9285            T          N          N          N          N          N
#> 9462            T          H          H          H          H          H
#> 9741            N          A          A          A          A          A
#>      wt.mouse.6 wt.mouse.7 wt.mouse.8 wt.mouse.9 wt.mouse.10
#> 855           C          C          C          C           C
#> 2586          C          C          C          C           C
#> 5923          G          G          G          A           G
#> 9285          N          N          N          N           N
#> 9462          H          H          H          H           H
#> 9741          A          A          A          A           A

We have some interesting hits! Lets look at the histograms for their corresponding RNA-seq data. First, we have to retrieve the counts for these genes:

target.counts <- rna.counts[rownames(rna.counts) %in% targets$gene, ]

Now, we have to “melt” the counts down so that they’re compatible with ggplot2:

histdf <- data.frame()
for(i in 1:nrow(target.counts)){
  counts <- as.numeric(target.counts[i,])
  gname <- rownames(target.counts)[i]
  conds <- c(rep("wws", 30), rep("wt", 30))  #30 because there are 3 samples for each mouse
  temp <- data.frame(
    gene = gname,
    condition = conds,
    counts = counts
  )
  histdf <- rbind(histdf, temp)
}
head(histdf)

#>     gene condition counts
#> 1 vhf0y3       wws     26
#> 2 vhf0y3       wws     50
#> 3 vhf0y3       wws     36
#> 4 vhf0y3       wws     36
#> 5 vhf0y3       wws     35
#> 6 vhf0y3       wws     47

Let’s make a histogram for the first gene in the list, vhf0y3. Following our last example, this is is how we would make the histogram:

ggplot(histdf[histdf$gene == "vhf0y3",],
       aes(x = counts, fill = condition)) +
  geom_histogram(alpha = 0.5)

#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

However, this doesn’t do what we want! This stacks the histogram bars on top of eachother. We want them to be overlapping, not stacked, so we have to separate them out into two separate layers.

subhistdf <- histdf[histdf$gene == "vhf0y3",]

ggplot() +
  geom_histogram(data = subhistdf[subhistdf$condition == "wws",],
                 aes(x = counts),
                 alpha = 0.5,
                 fill = "darkblue",
                 inherit.aes = F) +
  geom_histogram(data = subhistdf[subhistdf$condition == "wt",],
                 aes(x = counts),
                 alpha = 0.5,
                 fill = "darkred",
                 inherit.aes = F) +
  theme_classic() +
  scale_y_continuous(expand = c(0,0))  +
  xlab("Expression Level") +
  ggtitle("Wbbl1") +
  theme(plot.title = element_text(hjust = 0.5))

#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This makes the plot we are looking for, but only for a single gene! We want to look at all six candidates, so we are going to write a function that will replicate this plot for any gene we want to look at. The name of this function will be “plot_expression_hist”. We are going to give it our data and the gene we want to look at, and it will return the plot.

plot_expression_hist <- function(histdf, gene){
  subhistdf <- histdf[histdf$gene == gene,] #get the gene we want
  p <- ggplot() +
    geom_histogram(data = subhistdf[subhistdf$condition == "wws",],
                   aes(x = counts),
                   alpha = 0.5,
                   fill = "darkblue") +
    geom_histogram(data = subhistdf[subhistdf$condition == "wt",],
                   aes(x = counts),
                   alpha = 0.5,
                   fill = "darkred") +
    theme_classic() +
    scale_y_continuous(expand = c(0,0))  +
    xlab("Expression Level") +
    ggtitle(gene) +
    theme(plot.title = element_text(hjust = 0.5))
  return(p) #return the plot
}

Now that our function is written, we can make a plot for each of our six genes:

hist1 <- plot_expression_hist(histdf, "vhf0y3")
hist2 <- plot_expression_hist(histdf, "79vqc")
hist3 <- plot_expression_hist(histdf, "ndagq")
hist4 <- plot_expression_hist(histdf, "e7x")
hist5 <- plot_expression_hist(histdf, "w6dfxj")
hist6 <- plot_expression_hist(histdf, "Wbbl1")

Finally, we can use ggarrange from the ggpubr package to merge all of these plots together:

expression_hists <- ggarrange(hist1, hist2, hist3, hist4, hist5, hist6, nrow = 2, ncol = 3)

#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

expression_hists

By looking at the histograms, we can see that Wbbl1 has a huge expression level difference between the wws and the wt mice. The single blue bar indicates that wws mice have little to no expression of Wbbl1. Because we are good biologists, we know that Wbbl1 is short for whisker balance breakdown locus 1, and that this gene plays a major role in whisker stabilization!

3.3 Confirming Up And Downregulated Genes

Now that we know that Wbbl1 might be our culprit for wws in these mice, lets look into the formal differential expression analysis data. We are going to make a volcano plot, which is what bioinformatians use to compare the magnitude of a genes up/down regulation (log2 foldchange, or lfc for short) against the probability that any up/down regulation is statistically significant.

de.analysis$color <- "up"
de.analysis$color[de.analysis$log2FoldChange < 0] <- "down"
de.analysis$padj[de.analysis$padj == 0] <- 10^-300

volcano <- ggplot(de.analysis, aes(x = log2FoldChange, y = -log10(padj), color = color)) +
  geom_point() +
  scale_color_manual(values = c("darkblue", "darkred")) +
  xlim(-10, 10) +
  ylim(0, 325)+
  geom_hline(yintercept = -log(0.05), linetype = "dashed") +
  geom_text(data = de.analysis[de.analysis$log2FoldChange < -5, ],
            aes(label = X ),
            color = "black",
            check_overlap = T,
            nudge_y = 10) +
  theme_classic() +
  theme(legend.position = "none")
volcano

3.4 Tackling the dreaded heatmap

Bioinforamticians love to use heatmaps as an effective visualization, but making them isn’t always the most straightforward. To demonstrate how to make a heatmap, we will make an expression heatmap from the RNA-seq data. First, we have to melt our data again. The code to do this is shown below, but don’t run it! It will take a few minutes, so load in the pre-made csv file containing the output instead.

heatmapdf <- data.frame()
for(i in 1:nrow(rna.counts)){
  counts <- as.numeric(rna.counts[i,])
  gname <- rownames(rna.counts)[i]
  conds <- c(rep("wws", 30), rep("wt", 30))

  if(de.analysis$log2FoldChange[i] < -5){
    cluster <- "Downregulated"
  }else{
    cluster <- "Other"
  }
  temp <- data.frame(
    gene = gname,
    condition = conds,
    counts = counts,
    mouse = colnames(rna.counts),
    cluster = cluster
  )
  heatmapdf <- rbind(heatmapdf, temp)
}
heatmapdf$counts <- log(heatmapdf$counts +1)

Here is the code to make the heatmap:

heatmapdf <- read.csv("heatmapdf.csv")
geneorder <- de.analysis$X[order(de.analysis$log2FoldChange)]


heatmapdf$gene <- factor(heatmapdf$gene, levels = geneorder)

ggplot(heatmapdf, aes(x = mouse, y = gene, fill = counts)) +
  geom_tile()+
  scale_fill_viridis_b(option = "A") +
  facet_grid(cluster~condition, scales = "free")+
  theme(panel.grid = element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank())