Step 5: Merge the Annotated Transcriptome with the List of Differentially Expressed Genes
Merge the table of differentially expressed genes (generated with DESeq2 in step 3) to the table of annotated genes (generated using Blastin step 4) to explore the functions of the differentially expressed genes
Author
Sarah Tanja
Published
June 1, 2023
Install Packages
r
if ("tidyverse"%in%rownames(installed.packages()) =='FALSE') install.packages('tidyverse')if ("ggplot2"%in%rownames(installed.packages()) =='FALSE') install.packages('ggplot2')if ("RColorBrewer"%in%rownames(installed.packages()) =='FALSE') install.packages('RColorBrewer')
This contrasts with the blast GO annotated table blast_annot_go.tab that has a whole column (V1) dedicated to the gene ID in the same format. Additionally, notice that the column V3contains the UniProt accession number, which is a unique identifier for each protein entry.
As such, we first need to take the row names of the DEGlist.tab and make them into a new column.
diff_genes$gene_ID <-row.names(diff_genes) # adds the gene name into data frame as new column called gene_IDrownames(diff_genes) <- diff_genes$X # renames all the rows to 1-n instead of the gene name
Relocate the gene_ID column back to the first column
diff_genes %>%relocate(gene_ID, .before= baseMean) # relocate gene_ID to the first column
Rename V1in the blast_go dataframe to gene_ID
r
blast_go <- blast_go %>%rename('V1'='gene_ID')
Rename V3in the blast_go dataframe to something more descriptive of the column, like UniProt_acess_num
Checkout new descriptive column names in blast_go dataframe
r
head(blast_go)
Now that we have both the blast go and differential gene list in the right format, we can merge the two so that we have information regarding gene ID and function paired with differentially expressed genes.
r
# Perform inner join between diff_genes and blast_go data frames based on shared "gene_ID" columnannote_DEG <-merge(diff_genes, blast_go, by ="gene_ID", all =FALSE)# Print the merged tablekable(annote_DEG)
Finally, let’s write out the annotated DEG list to save it to our output directory.
This count matrix has 54,384 rows (genes) and 8 columns (samples) Each column represents a sample, and each row represents a gene. The ‘count’ is the number of times that gene was found in the RNA sequence of the sample.
Row names are gene identifiers, let’s make them their own column that we can use to match countmatrix rows with annote_DEG rows. Take the row names of the countmatrix and make them into a new column.
countmatrix$gene_ID <-row.names(countmatrix) # adds the gene name into data frame as new column called gene_IDrownames(countmatrix) <- countmatrix$X # renames all the rows to 1-n instead of the gene name
Relocate the gene_ID column back to the first column
countmatrix %>%relocate(gene_ID, .before=1) # relocate gene_ID to the first column
Merge the DEG with the countmatrix using an inner join
This new dataframe annote_counts has both gene annotations and count data for each biological sample!
Notice that the annote_counts has 20427 observations, the same as annote_DEG… indicating that every gene that has an annotation from the reference genome (saved in annote_DEG) was kept and is part of the new dataframe, annote_counts
---title: "Step 5: Merge the Annotated Transcriptome with the List of Differentially Expressed Genes"subtitle: "Merge the table of differentially expressed genes **(generated with `DESeq2` in step 3)** to the table of annotated genes **(generated using `Blast`in step 4)** to explore the functions of the differentially expressed genes"author: "Sarah Tanja"date: "`r format(Sys.time(), '%d %B, %Y')`"format: html: df-print: paged toc: true toc-location: left smooth-scroll: true link-external-icon: true link-external-newwindow: true code-fold: false code-tools: true code-copy: true highlight-style: breeze code-overflow: wrap theme: minty---```{r setup, include=FALSE}knitr::opts_chunk$set(echo =TRUE, # Display code chunkseval =FALSE, # Evaluate code chunkswarning =FALSE, # Hide warningsmessage =FALSE, # Hide messagesfig.width =6, # Set plot width in inchesfig.height =4, # Set plot height in inchesfig.align ="center"# Align plots to the center)```### Install Packages```{r install packages, filename='r'}if ("tidyverse"%in%rownames(installed.packages()) =='FALSE') install.packages('tidyverse')if ("ggplot2"%in%rownames(installed.packages()) =='FALSE') install.packages('ggplot2')if ("RColorBrewer"%in%rownames(installed.packages()) =='FALSE') install.packages('RColorBrewer')```### Load Libraries```{r load libraries, filename='r'}library(tidyverse)library(ggplot2)library(RColorBrewer)```## Merge Annotated Table with DEGlistIf you take a look at the DEGlist.tab file below, you will see that the row names themselves are the gene ID's.```{r, filename='r', eval=TRUE}diff_genes <-read.delim(file ="../output/deseq/DEGlist.tab", header =TRUE)head (diff_genes)```This contrasts with the blast GO annotated table `blast_annot_go.tab` that has a whole column (V1) dedicated to the gene ID in the same format. Additionally, notice that the column `V3`contains the UniProt accession number, which is a unique identifier for each protein entry.```{r, filename='r', eval=TRUE}blast_go <-read.delim(file ="../output/blast/blast_annot_go.tab", header =TRUE)head(blast_go)```As such, we first need to take the row names of the DEGlist.tab and make them into a new column.```{r}diff_genes$gene_ID <-row.names(diff_genes) # adds the gene name into data frame as new column called gene_IDrownames(diff_genes) <- diff_genes$X # renames all the rows to 1-n instead of the gene name```Relocate the `gene_ID` column back to the first column```{r}diff_genes %>%relocate(gene_ID, .before= baseMean) # relocate gene_ID to the first column```Rename `V1`in the `blast_go` dataframe to `gene_ID````{r, filename='r', eval=FALSE}blast_go <- blast_go %>%rename('V1'='gene_ID')```Rename `V3`in the `blast_go` dataframe to something more descriptive of the column, like `UniProt_acess_num````{r, filename='r', eval=FALSE}blast_go <- blast_go %>%rename('V3'='UniProt_access_num')```Checkout new descriptive column names in blast_go dataframe```{r, filename='r'}head(blast_go)```Now that we have both the blast go and differential gene list in the right format, we can merge the two so that we have information regarding gene ID and function paired with differentially expressed genes.```{r, filename='r', eval=FALSE}# Perform inner join between diff_genes and blast_go data frames based on shared "gene_ID" columnannote_DEG <-merge(diff_genes, blast_go, by ="gene_ID", all =FALSE)# Print the merged tablekable(annote_DEG)```Finally, let's write out the annotated DEG list to save it to our output directory.```{r}write.table(annote_DEG, "../output/annote_DEG.tab", row.names = T)```# Merge annotated DEG dataframe with count matrix## Read in count matrix```{r, filename = 'r'}countmatrix <-read.delim("../output/trinity-matrix/mcap.isoform.counts.matrix", header =TRUE, sep ='\t')rownames(countmatrix) <- countmatrix$Xcountmatrix <- countmatrix[,-1]head(countmatrix)```This count matrix has 54,384 rows (genes) and 8 columns (samples) Each column represents a sample, and each row represents a gene. The 'count' is the number of times that gene was found in the RNA sequence of the sample.### Make `countmatrix` a dataframe we can mergeRound integers up to whole numbers (for analysis)```{r, filename = 'r'}countmatrix <-round(countmatrix, 0)head(countmatrix)```Row names are gene identifiers, let's make them their own column that we can use to match `countmatrix` rows with `annote_DEG` rows. Take the row names of the `countmatrix` and make them into a new column.```{r}countmatrix$gene_ID <-row.names(countmatrix) # adds the gene name into data frame as new column called gene_IDrownames(countmatrix) <- countmatrix$X # renames all the rows to 1-n instead of the gene name```Relocate the `gene_ID` column back to the first column```{r}countmatrix %>%relocate(gene_ID, .before=1) # relocate gene_ID to the first column```#### Merge the DEG with the countmatrix using an inner joinThis new dataframe `annote_counts` has both gene annotations and count data for each biological sample!```{r}annote_counts <-merge(annote_DEG, countmatrix, by="gene_ID", all=FALSE)dim(annote_counts)```:::callout-tipNotice that the annote_counts has 20427 observations, the same as `annote_DEG`... indicating that every gene that has an annotation from the reference genome (saved in `annote_DEG`) was kept and is part of the new dataframe, `annote_counts`:::::: callout-warningWatchout! Some of these may be duplicates...:::Check for duplicates```{r}head(which(duplicated(annote_counts)))```Eliminate duplicates```{r}annote_counts <-distinct(annote_counts)head(which(duplicated(annote_counts)))```Finally, let's write out the annotated DEG list to save it to our output directory.```{r}write.table(annote_counts, "../output/annote_counts.tab", row.names = T)```