Step 5: Merge the Annotated Transcriptome with the List of Differentially Expressed Genes

Merge the table of differentially expressed genes (generated with DESeq2 in step 3) to the table of annotated genes (generated using Blastin step 4) to explore the functions of the differentially expressed genes

Author

Sarah Tanja

Published

June 1, 2023

Install Packages

r
if ("tidyverse" %in% rownames(installed.packages()) == 'FALSE') install.packages('tidyverse')
if ("ggplot2" %in% rownames(installed.packages()) == 'FALSE') install.packages('ggplot2')
if ("RColorBrewer" %in% rownames(installed.packages()) == 'FALSE') install.packages('RColorBrewer')

Load Libraries

r
library(tidyverse)
library(ggplot2)
library(RColorBrewer)

Merge Annotated Table with DEGlist

If you take a look at the DEGlist.tab file below, you will see that the row names themselves are the gene ID’s.

r
diff_genes <- read.delim(file = "../output/deseq/DEGlist.tab", header = TRUE)
head (diff_genes)

This contrasts with the blast GO annotated table blast_annot_go.tab that has a whole column (V1) dedicated to the gene ID in the same format. Additionally, notice that the column V3contains the UniProt accession number, which is a unique identifier for each protein entry.

r
blast_go <- read.delim(file = "../output/blast/blast_annot_go.tab", header = TRUE)
head(blast_go)

As such, we first need to take the row names of the DEGlist.tab and make them into a new column.

diff_genes$gene_ID <- row.names(diff_genes) # adds the gene name into data frame as new column called gene_ID
rownames(diff_genes) <- diff_genes$X # renames all the rows to 1-n instead of the gene name

Relocate the gene_ID column back to the first column

diff_genes %>% 
  relocate(gene_ID, .before= baseMean) # relocate gene_ID to the first column

Rename V1in the blast_go dataframe to gene_ID

r
blast_go <- blast_go %>%
  rename('V1' = 'gene_ID')

Rename V3in the blast_go dataframe to something more descriptive of the column, like UniProt_acess_num

r
blast_go <- blast_go %>%
  rename('V3' = 'UniProt_access_num')

Checkout new descriptive column names in blast_go dataframe

r
head(blast_go)

Now that we have both the blast go and differential gene list in the right format, we can merge the two so that we have information regarding gene ID and function paired with differentially expressed genes.

r
# Perform inner join between diff_genes and blast_go data frames based on shared "gene_ID" column
annote_DEG <- merge(diff_genes, blast_go, by = "gene_ID", all = FALSE)

# Print the merged table
kable(annote_DEG)

Finally, let’s write out the annotated DEG list to save it to our output directory.

write.table(annote_DEG, "../output/annote_DEG.tab", row.names = T)

Merge annotated DEG dataframe with count matrix

Read in count matrix

r
countmatrix <- read.delim("../output/trinity-matrix/mcap.isoform.counts.matrix", header = TRUE, sep = '\t')
rownames(countmatrix) <- countmatrix$X
countmatrix <- countmatrix[,-1]
head(countmatrix)

This count matrix has 54,384 rows (genes) and 8 columns (samples) Each column represents a sample, and each row represents a gene. The ‘count’ is the number of times that gene was found in the RNA sequence of the sample.

Make countmatrix a dataframe we can merge

Round integers up to whole numbers (for analysis)

r
countmatrix <- round(countmatrix, 0)
head(countmatrix)

Row names are gene identifiers, let’s make them their own column that we can use to match countmatrix rows with annote_DEG rows. Take the row names of the countmatrix and make them into a new column.

countmatrix$gene_ID <- row.names(countmatrix) # adds the gene name into data frame as new column called gene_ID
rownames(countmatrix) <- countmatrix$X # renames all the rows to 1-n instead of the gene name

Relocate the gene_ID column back to the first column

countmatrix %>% 
  relocate(gene_ID, .before= 1) # relocate gene_ID to the first column

Merge the DEG with the countmatrix using an inner join

This new dataframe annote_counts has both gene annotations and count data for each biological sample!

annote_counts <- merge(annote_DEG, countmatrix, by="gene_ID", all=FALSE)
dim(annote_counts)
Tip

Notice that the annote_counts has 20427 observations, the same as annote_DEG… indicating that every gene that has an annotation from the reference genome (saved in annote_DEG) was kept and is part of the new dataframe, annote_counts

Warning

Watchout! Some of these may be duplicates…

Check for duplicates

head(which(duplicated(annote_counts)))

Eliminate duplicates

annote_counts <- distinct(annote_counts)
head(which(duplicated(annote_counts)))

Finally, let’s write out the annotated DEG list to save it to our output directory.

write.table(annote_counts, "../output/annote_counts.tab", row.names = T)