Step 5: Merge the Annotated Transcriptome with the List of Differentially Expressed Genes

Merge the table of differentially expressed genes (generated with DESeq2 in step 3) to the table of annotated genes (generated using Blastin step 4) to explore the functions of the differentially expressed genes

Author

Sarah Tanja

Published

June 1, 2023

Install Packages

if ("tidyverse" %in% rownames(installed.packages()) == 'FALSE') install.packages('tidyverse')
if ("ggplot2" %in% rownames(installed.packages()) == 'FALSE') install.packages('ggplot2')
if ("RColorBrewer" %in% rownames(installed.packages()) == 'FALSE') install.packages('RColorBrewer')

Load Libraries

library(tidyverse)
library(ggplot2)
library(RColorBrewer)

Merge Annotated Table with DEGlist

If you take a look at the DEGlist.tab file below, you will see that the row names themselves are the gene ID’s.

diff_genes <- read.delim(file = "../output/deseq/DEGlist.tab", header = TRUE)
head (diff_genes)

This contrasts with the blast GO annotated table blast_annot_go.tab that has a whole column (V1) dedicated to the gene ID in the same format. Additionally, notice that the column V3contains the UniProt accession number, which is a unique identifier for each protein entry.

blast_go <- read.delim(file = "../output/blast/blast_annot_go.tab", header = TRUE)
head(blast_go)

As such, we first need to take the row names of the DEGlist.tab and make them into a new column.

diff_genes$gene_ID <- row.names(diff_genes) # adds the gene name into data frame as new column called gene_ID
rownames(diff_genes) <- diff_genes$X # renames all the rows to 1-n instead of the gene name

Relocate the gene_ID column back to the first column

diff_genes %>% 
  relocate(gene_ID, .before= baseMean) # relocate gene_ID to the first column

Rename V1in the blast_go dataframe to gene_ID

blast_go <- blast_go %>%
  rename('V1' = 'gene_ID')

Rename V3in the blast_go dataframe to something more descriptive of the column, like UniProt_acess_num

blast_go <- blast_go %>%
  rename('V3' = 'UniProt_access_num')

Checkout new descriptive column names in blast_go dataframe

head(blast_go)

Now that we have both the blast go and differential gene list in the right format, we can merge the two so that we have information regarding gene ID and function paired with differentially expressed genes.

# Perform inner join between diff_genes and blast_go data frames based on shared "gene_ID" column
annote_DEG <- merge(diff_genes, blast_go, by = "gene_ID", all = FALSE)

# Print the merged table
kable(annote_DEG)

Finally, let’s write out the annotated DEG list to save it to our output directory.

write.table(annote_DEG, "../output/annote_DEG.tab", row.names = T)

Merge annotated DEG dataframe with count matrix

Read in count matrix

countmatrix <- read.delim("../output/trinity-matrix/mcap.isoform.counts.matrix", header = TRUE, sep = '\t')
rownames(countmatrix) <- countmatrix$X
countmatrix <- countmatrix[,-1]
head(countmatrix)

This count matrix has 54,384 rows (genes) and 8 columns (samples) Each column represents a sample, and each row represents a gene. The ‘count’ is the number of times that gene was found in the RNA sequence of the sample.

Make `countmatrix` a dataframe we can merge

Round integers up to whole numbers (for analysis)

countmatrix <- round(countmatrix, 0)
head(countmatrix)

Row names are gene identifiers, let’s make them their own column that we can use to match countmatrix rows with annote_DEG rows. Take the row names of the countmatrix and make them into a new column.

countmatrix$gene_ID <- row.names(countmatrix) # adds the gene name into data frame as new column called gene_ID
rownames(countmatrix) <- countmatrix$X # renames all the rows to 1-n instead of the gene name

Relocate the gene_ID column back to the first column

countmatrix %>% 
  relocate(gene_ID, .before= 1) # relocate gene_ID to the first column

Merge the DEG with the countmatrix using an inner join

This new dataframe annote_counts has both gene annotations and count data for each biological sample!

annote_counts <- merge(annote_DEG, countmatrix, by="gene_ID", all=FALSE)
dim(annote_counts)

Tip

Notice that the annote_counts has 20427 observations, the same as annote_DEG… indicating that every gene that has an annotation from the reference genome (saved in annote_DEG) was kept and is part of the new dataframe, annote_counts

Warning

Watchout! Some of these may be duplicates…

Check for duplicates

head(which(duplicated(annote_counts)))

Eliminate duplicates

annote_counts <- distinct(annote_counts)
head(which(duplicated(annote_counts)))

Finally, let’s write out the annotated DEG list to save it to our output directory.

write.table(annote_counts, "../output/annote_counts.tab", row.names = T)

--- title: "Step 5: Merge the Annotated Transcriptome with the List of Differentially Expressed Genes" subtitle: "Merge the table of differentially expressed genes **(generated with `DESeq2` in step 3)** to the table of annotated genes **(generated using `Blast`in step 4)** to explore the functions of the differentially expressed genes" author: "Sarah Tanja" date: "`r format(Sys.time(), '%d %B, %Y')`" format: html: df-print: paged toc: true toc-location: left smooth-scroll: true link-external-icon: true link-external-newwindow: true code-fold: false code-tools: true code-copy: true highlight-style: breeze code-overflow: wrap theme: minty --- ```{r setup, include=FALSE} knitr::opts_chunk$set( echo = TRUE, # Display code chunks eval = FALSE, # Evaluate code chunks warning = FALSE, # Hide warnings message = FALSE, # Hide messages fig.width = 6, # Set plot width in inches fig.height = 4, # Set plot height in inches fig.align = "center" # Align plots to the center ) ``` ### Install Packages ```{r install packages, filename='r'} if ("tidyverse" %in% rownames(installed.packages()) == 'FALSE') install.packages('tidyverse') if ("ggplot2" %in% rownames(installed.packages()) == 'FALSE') install.packages('ggplot2') if ("RColorBrewer" %in% rownames(installed.packages()) == 'FALSE') install.packages('RColorBrewer') ``` ### Load Libraries ```{r load libraries, filename='r'} library(tidyverse) library(ggplot2) library(RColorBrewer) ``` ## Merge Annotated Table with DEGlist If you take a look at the DEGlist.tab file below, you will see that the row names themselves are the gene ID's. ```{r, filename='r', eval=TRUE} diff_genes <- read.delim(file = "../output/deseq/DEGlist.tab", header = TRUE) head (diff_genes) ``` This contrasts with the blast GO annotated table `blast_annot_go.tab` that has a whole column (V1) dedicated to the gene ID in the same format. Additionally, notice that the column `V3`contains the UniProt accession number, which is a unique identifier for each protein entry. ```{r, filename='r', eval=TRUE} blast_go <- read.delim(file = "../output/blast/blast_annot_go.tab", header = TRUE) head(blast_go) ``` As such, we first need to take the row names of the DEGlist.tab and make them into a new column. ```{r} diff_genes$gene_ID <- row.names(diff_genes) # adds the gene name into data frame as new column called gene_ID rownames(diff_genes) <- diff_genes$X # renames all the rows to 1-n instead of the gene name ``` Relocate the `gene_ID` column back to the first column ```{r} diff_genes %>% relocate(gene_ID, .before= baseMean) # relocate gene_ID to the first column ``` Rename `V1`in the `blast_go` dataframe to `gene_ID` ```{r, filename='r', eval=FALSE} blast_go <- blast_go %>% rename('V1' = 'gene_ID') ``` Rename `V3`in the `blast_go` dataframe to something more descriptive of the column, like `UniProt_acess_num` ```{r, filename='r', eval=FALSE} blast_go <- blast_go %>% rename('V3' = 'UniProt_access_num') ``` Checkout new descriptive column names in blast_go dataframe ```{r, filename='r'} head(blast_go) ``` Now that we have both the blast go and differential gene list in the right format, we can merge the two so that we have information regarding gene ID and function paired with differentially expressed genes. ```{r, filename='r', eval=FALSE} # Perform inner join between diff_genes and blast_go data frames based on shared "gene_ID" column annote_DEG <- merge(diff_genes, blast_go, by = "gene_ID", all = FALSE) # Print the merged table kable(annote_DEG) ``` Finally, let's write out the annotated DEG list to save it to our output directory. ```{r} write.table(annote_DEG, "../output/annote_DEG.tab", row.names = T) ``` # Merge annotated DEG dataframe with count matrix ## Read in count matrix ```{r, filename = 'r'} countmatrix <- read.delim("../output/trinity-matrix/mcap.isoform.counts.matrix", header = TRUE, sep = '\t') rownames(countmatrix) <- countmatrix$X countmatrix <- countmatrix[,-1] head(countmatrix) ``` This count matrix has 54,384 rows (genes) and 8 columns (samples) Each column represents a sample, and each row represents a gene. The 'count' is the number of times that gene was found in the RNA sequence of the sample. ### Make `countmatrix` a dataframe we can merge Round integers up to whole numbers (for analysis) ```{r, filename = 'r'} countmatrix <- round(countmatrix, 0) head(countmatrix) ``` Row names are gene identifiers, let's make them their own column that we can use to match `countmatrix` rows with `annote_DEG` rows. Take the row names of the `countmatrix` and make them into a new column. ```{r} countmatrix$gene_ID <- row.names(countmatrix) # adds the gene name into data frame as new column called gene_ID rownames(countmatrix) <- countmatrix$X # renames all the rows to 1-n instead of the gene name ``` Relocate the `gene_ID` column back to the first column ```{r} countmatrix %>% relocate(gene_ID, .before= 1) # relocate gene_ID to the first column ``` #### Merge the DEG with the countmatrix using an inner join This new dataframe `annote_counts` has both gene annotations and count data for each biological sample! ```{r} annote_counts <- merge(annote_DEG, countmatrix, by="gene_ID", all=FALSE) dim(annote_counts) ``` :::callout-tip Notice that the annote_counts has 20427 observations, the same as `annote_DEG`... indicating that every gene that has an annotation from the reference genome (saved in `annote_DEG`) was kept and is part of the new dataframe, `annote_counts` ::: ::: callout-warning Watchout! Some of these may be duplicates... ::: Check for duplicates ```{r} head(which(duplicated(annote_counts))) ``` Eliminate duplicates ```{r} annote_counts <- distinct(annote_counts) head(which(duplicated(annote_counts))) ``` Finally, let's write out the annotated DEG list to save it to our output directory. ```{r} write.table(annote_counts, "../output/annote_counts.tab", row.names = T) ```

Install Packages

Load Libraries

Merge Annotated Table with DEGlist

Merge annotated DEG dataframe with count matrix

Read in count matrix

Make countmatrix a dataframe we can merge

Merge the DEG with the countmatrix using an inner join

Make `countmatrix` a dataframe we can merge