Mytilus galloprovincialis differential gene expression in response to contaminant exposure

Author

Chris

Published

June 2, 2023

Project Description

Climate change is an unavoidable suite of stressors for aquatic organisms and its effects are exacerbated by the personal care products and pharmaceuticals we use regularly. Thinking about what that multiple-stressor environment looks like and understanding its impact on aquatic organisms, we will turn our attention to marine mussels, Mytilus spp. They are an indicator organism for environmental health and inferences to those impacts to human health. I am exploring just how Mytilus spp. respond to man-made contaminants and to do that I’m using differential gene expression to help me draw conclusions about what common or unique physiological responses are displayed by M. galloprovincials when exposed to various anthropogenic stressors. This molecular-based technique ill help me get ot the heart of what physiological functions could be impaired.

Data

Since I want to explore a ‘hunch’ before making larger experimental choices, it makes sense to look at what’s out there. I chose four different types of stressors and a solid reference genome from existing sequenced data in the NCBI & Uniprot databases. Listed below are the data I chose:

NCBI & Uniprot Databases

Existing RNA-Seq data was retrieved from complete RNASeq data available in the NCBI database along wiht the reference genome. These very large, ‘crowd sourced’, and free sites provide the foundation for many types of molecular level work - give it a browse!

From this database I was able to select a reference, and examples of broad categories of stressors; organic pharmaceutical contaminants like Valsartan (blood pressure medication) and Carbemazepine (seizure), endocrine disruptoring compounds like synthetic Estrogen, biotoxins like those created by harmful algal blooms, and finally abiotic stressors like hypoxia.

Workflow

flowchart LR
  A[Study Organism] --> B[Data Search] --> C[Reproducible Bioinformatics w/ GitHub]
  C --> D{Tools for Analysis}
  D --> E[RNASeq -> Data type]
  D --> F[Kallisto & DESeq -> Quantifyng DEGs]
  D --> G[RStudio -> Visualizing Results]
  D --> H[Uniprot & Blast -> What *are* the DEGs?]
  D --> I[Annotation & Functional Gene Enrichment -> Next Steps]

We’ve already covered the study organism and the data search - lets give a shoutout to the workhorse platforms of the show: GitHub and RStudio/ R. Both polatforms are free, can be connected, and keep you from accidentally ‘losing’ your work when you aren’t using the best file structure practices. Check out Happy Git for more information on how this symbiotic relationship can be attained.

RNASeq

This is the type of sequence data I have available to work with from NCBI so I have to retrieve it. ### Pull out all of the sequences from the exposures.

Code


/home/shared/sratoolkit.3.0.2-ubuntu64/bin/./fasterq-dump \
--outdir /home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/ncbi \
--progress \
SRR19782039 \
SRR13013756 \
SRR7725722 \
SRR16771870

Pulling my reference genome from prefab code provided by NCBI & verify where I am working

Code


cd ../data
ls

Code


# grab the reference
cd ../data

curl -OJX GET "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCA_900618805.1/download?include_annotation_type=GENOME_FASTA,GENOME_GFF,RNA_FASTA,CDS_FASTA,PROT_FASTA,SEQUENCE_REPORT&filename=GCA_900618805.1.zip" -H "Accept: application/zip"

unzip GCA_900618805.1.zip /chris-musselcon/data

QC Note

Even though I didn’t do this step, it is important and how it could be done with MultiQC.

Kallisto

This process aligns my sequences to the reference to support DEG quantification.

Code


 cd /home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/data/ncbi_dataset/data/GCA_900618805.1
 head cds_from_genomic.fna

>lcl|UYJE01000001.1_cds_VDH88688.1_1 [locus_tag=MGAL_10B017214] [protein=Hypothetical predicted protein] [protein_id=VDH88688.1] [location=complement(join(55975..56224,59152..59365,60239..60337,61332..61522,64535..64608))] [gbkey=CDS]
ATGAATAGAATTACTGATAGGGACTACGACTACTATGACTTTGAAGATGACAGTGACCACGAGCCTTGCGATAGTTCTGA
TGATGATATCGAGGTTATTTTACATGGAACACCTGAACAGAAGCGTAAATTACAGACCAAAGTCCAACAAAGACATGATT
CTTCAAGTGAAGATGACTTTGAAAAGGAAATGAATAATGAACTTAACAAACATATTAAAGGACTGGTAAATGAAAGATCA
AGTAATGTTGCAGAAACTGTTCAAGGTAGTAGCAAAGCTCAAGACCAAGAGAAACCAACAGAACAACAACAATTTTATGA
TGATATTTATTACGATTCAGAAGAAGAGGAAATGGTTTTACAAGGTGATGAACGTGTCAAAAGAAGACAACCTGTTCAAA
GCAATGATGACTTATTGTACGATCCTGACCTAGACGAAGAAGACCAGCGATGGGTTGATGCTGAACGACAAGCTTATCAG
CTGCCTGTACCCTCAGGATCCAAATCAAAACGTCAAAACAGTGATGCAGTTTTAAACTGTCCCGCTTGTATGACATTACT
GTGTCTTGATTGTCAGGGGCATGATGTTTATGAAAACCAGTACAGAGCTATGTTTGTTAAGAACTGTCGTGTCGATACAT
CAGAATTATTAAAACAGCCGTTACAGAAGAAAAAACGTAAAAAAAAACAGAAGACATTGGACACTACAAATAATGAAACA

Create the index file to align my short read files to the genes from the MGAL_10 (reference genome)

Code

/home/shared/kallisto/kallisto \
index \
-i ../data/MGAL_cds.index \
../data/ncbi_dataset/data/GCA_900618805.1/cds_from_genomic.fna

Let’s take a look

Code

head ../data/ncbi_dataset/data/GCA_900618805.1/cds_from_genomic.fna

Code

/home/shared/kallisto/kallisto quant

kallisto 0.46.1
Computes equivalence classes for reads and quantifies abundances

Usage: kallisto quant [arguments] FASTQ-files

Required arguments:
-i, --index=STRING            Filename for the kallisto index to be used for
                              quantification
-o, --output-dir=STRING       Directory to write output to

Optional arguments:
    --bias                    Perform sequence based bias correction
-b, --bootstrap-samples=INT   Number of bootstrap samples (default: 0)
    --seed=INT                Seed for the bootstrap sampling (default: 42)
    --plaintext               Output plaintext instead of HDF5
    --fusion                  Search for fusions for Pizzly
    --single                  Quantify single-end reads
    --single-overhang         Include reads where unobserved rest of fragment is
                              predicted to lie outside a transcript
    --fr-stranded             Strand specific reads, first read forward
    --rf-stranded             Strand specific reads, first read reverse
-l, --fragment-length=DOUBLE  Estimated average fragment length
-s, --sd=DOUBLE               Estimated standard deviation of fragment length
                              (default: -l, -s values are estimated from paired
                               end data, but are required when using --single)
-t, --threads=INT             Number of threads to use (default: 1)
    --pseudobam               Save pseudoalignments to transcriptome to BAM file
    --genomebam               Project pseudoalignments to genome sorted BAM file
-g, --gtf                     GTF file for transcriptome information
                              (required for --genomebam)
-c, --chromosomes             Tab separated file with chromosome names and lengths
                              (optional for --genomebam, but recommended)

Code

ls /home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/ncbi/

SRR13013756.fastq
SRR13013756_fastqc.html
SRR13013756_fastqc.zip
SRR16771870_1.fastq
SRR16771870_1_fastqc.html
SRR16771870_1_fastqc.zip
SRR16771870_2.fastq
SRR16771870_2_fastqc.html
SRR16771870_2_fastqc.zip
SRR19782039.fastq
SRR19782039_fastqc.html
SRR19782039_fastqc.zip
SRR7725722_1.fastq
SRR7725722_1_fastqc.html
SRR7725722_1_fastqc.zip
SRR7725722_2.fastq
SRR7725722_2_fastqc.html
SRR7725722_2_fastqc.zip

Code


#mkdir ../output
mkdir ../output/kallisto_01

#pulling all files for comparison to my index

find /home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/ncbi/*_1.fastq \
| xargs basename -s _1.fastq  | xargs -I{} /home/shared/kallisto/kallisto \
quant -i ../data/MGAL_cds.index \
-o ../output/kallisto_01/{} \
-t 4 \
/home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/ncbi/{}_1.fastq \
/home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/ncbi/{}_2.fastq \

Getting length mean & SD for the single read sequences in the .fastq files. Code taken from: https://www.biostars.org/p/243552/

Code


awk 'BEGIN { t=0.0;sq=0.0; n=0;} ;NR%4==2 {n++;L=length($0);t+=L;sq+=L*L;}END{m=t/n;printf("total %d avg=%f stddev=%f\n",n,m,sq/n-m*m);}' /home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/ncbi/*.fastq

Grabbing only the fastq files

Code


find /home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/ncbi/*.fastq \
| xargs basename -s .fastq  | xargs -I{} /home/shared/kallisto/kallisto \
quant -i ../data/MGAL_cds.index \
--single -l 92 -s 405 \
-o ../output/kallisto_01/{} \
-t 4 \
/home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/ncbi/{}.fastq

Code


perl /home/shared/trinityrnaseq-v2.12.0/util/abundance_estimates_to_matrix.pl \
--est_method kallisto \
    --gene_trans_map none \
    --out_prefix /home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/kallisto_01 \
    --name_sample_by_basedir \
    /home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/kallisto_01/SRR7725722/abundance.tsv \
    /home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/kallisto_01/SRR7725722_1/abundance.tsv \
    /home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/kallisto_01/SRR7725722_2/abundance.tsv \
    /home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/kallisto_01/SRR13013756/abundance.tsv \
    /home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/kallisto_01/SRR16771870/abundance.tsv \
    /home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/kallisto_01/SRR16771870_1/abundance.tsv \
    /home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/kallisto_01/SRR16771870_2/abundance.tsv \
    /home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/kallisto_01/SRR19782039/abundance.tsv

Let’s get this into a manageable format we can feed to DESeq and followup with visualization.

Code

countmatrix <- read.delim("../output/kallisto_01.isoform.counts.matrix", header = TRUE, sep = '\t')
rownames(countmatrix) <- countmatrix$X
countmatrix <- countmatrix[,-1]
head(countmatrix)
#tail(countmatrix)
#dim(countmatrix)

Code

countmatrix <- round(countmatrix, 0)
str(countmatrix)

Code

dim(countmatrix)
#dim(deseq2.colData)

Code

length(colnames(data))

DESeq2

This quantifies how many differentially expressed genes exist based on the parameters we set. We will get the data into a dataframe for review.

Code

deseq2.colData <- data.frame(condition=factor(c("DSP", "DSP", "Hypoxia", "EE2", "EE2", "EE2", "Valsartan")), 
                             type=factor(rep("single-read", 7)))
rownames(deseq2.colData) <- colnames(data)
dim(deseq2.colData)
deseq2.dds <- DESeqDataSetFromMatrix(countData = countmatrix,
                                     colData = deseq2.colData, 
                                     design = ~ condition)

Code

deseq2.dds <- DESeq(deseq2.dds)
deseq2.res <- results(deseq2.dds)
deseq2.res <- deseq2.res[order(rownames(deseq2.res)), ]

head(deseq2.dds)

Visualization

This is an interim step to look at your DESeq data and see what is interesting/ make any general inferences. ### PCA Plot

Code

vsd <- vst(deseq2.dds, blind = FALSE)
plotPCA(vsd, intgroup = "condition")

Code

# Select top 50 differentially expressed genes - is it valuable to select more than 50?
res <- results(deseq2.dds)
res_ordered <- res[order(res$padj), ]
top_genes <- row.names(res_ordered)[1:50]

# Extract counts and normalize
counts <- counts(deseq2.dds, normalized = TRUE)
counts_top <- counts[top_genes, ]

# Log-transform counts
log_counts_top <- log2(counts_top + 1)

# Generate heatmap
pheatmap(log_counts_top, scale = "row")

Our PCA shows strong clustering and warrants further review. ### Heatmap

Code

#what are we doing here and how do I get it to compare the others against each other?
head(deseq2.res)

Code

# Count number of hits with adjusted p-value less then 0.05
dim(deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ])

Our heatmap shows us the contaminant sequences labeled below the color blocks on the X-axis, and up-regulated (red) or down-regulated (blue) genes and the y-axis. When we go back and match the sequence numbers we see confirmation of the result of the PCA. Now we want to look at how many DEGs are statistically significant. ### Volcano Plots

Code

tmp <- deseq2.res
# The main plot
plot(tmp$baseMean, tmp$log2FoldChange, pch=20, cex=0.45, ylim=c(-3, 3), log="x", col="darkgray",
     main="DEG Valsartan vs DSP (pval <= 0.05)",
     xlab="mean of normalized counts",
     ylab="Log2 Fold Change")
# Getting the significant points and plotting them again so they're a different color
tmp.sig <- deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ]
points(tmp.sig$baseMean, tmp.sig$log2FoldChange, pch=20, cex=0.45, col="red")
# 2 FC lines
abline(h=c(-1,1), col="blue")

Code

# Prepare the data for plotting
res_df <- as.data.frame(deseq2.res)
res_df$gene <- row.names(res_df)

# Create volcano plot
volcano_plot <- ggplot(res_df, aes(x = log2FoldChange, y = -log10(padj), color = padj < 0.05)) +
  geom_point(alpha = 0.6, size = 1.5) +
  scale_color_manual(values = c("grey", "red")) +
  labs(title = "Volcano Plot",
       x = "Log2 Fold Change",
       y = "-Log10 Adjusted P-value",
       color = "Significantly\nDifferentially Expressed") +
  theme_minimal() +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        legend.position = "top")

print(volcano_plot)

Now we have something to work with, so we create a result table for further work.

Code

write.table(tmp.sig, "../output/DEGlist.tab", sep = '\t', row.names = T)

Annotation

This will take my annotated reference gene sequence (cds file) and compare it to the result table of differentially expressed genes we created above and a reference of all the known proteins in the Uniprot protein database so we can blast them against each other and pick up what the functions of the DEGs may be.

Using BLAST

This is the tool to compare my cds file and the uniprot database. Going to call my existing uniprot database made in week 2 to blast my cds file

Code



/home/shared/8TB_HDD_02/cnmntgna/Applications/bioinfo/ncbi-blast-2.13.0+/bin/blastx \
-query /home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/data/ncbi_dataset/data/GCA_900618805.1/cds_from_genomic.fna \
-db /home/shared/8TB_HDD_02/cnmntgna/output/blastdb/uniprot_sprot_r2023_01 \
-out /home/shared/8TB_HDD_02/cnmntgna/data/cds_blastx.tab \
-evalue 1E-20 \
-num_threads 20 \
-max_target_seqs 1 \
-outfmt 6

Code


head /home/shared/8TB_HDD_02/cnmntgna/data/cds_blastx.tab

Tables on tables on tables!

This is my annotated reference genome.

Code


head -2 ../code/cds.csv
wc -l ../code/cds.csv

Fixing the format for a clean merge/ join.

Code

tr '|' '\t' < ../code/cds.csv | head -2

Code

tr '|' '\t' < ../code/cds.csv \
> ../data/cds.csv

Code


head ../data/cds.csv

This is the list of DEGs generated from DESeq and will also be prepared for merging/ joining

Code

head -2 ../output/DEGlist.tab
wc -l ../output/DEGlist.tab

Code

tr '|' '\t' < ../output/DEGlist.tab | head -2

Code

tr '|' '\t' < ../output/DEGlist.tab \
> ../data/DEGlist.tab

Code


head ../data/DEGlist.tab

Code


head ../data/cds.tab

Verify everything is formatted correctly

Code

getwd()

[1] "/home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/code"

Code

cds <- read.table("/home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/data/cds.tab", sep = "\t", header = FALSE)
deg <- read.table("/home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/data/DEGlist.tab", sep = "\t", header = TRUE)

Extract the actual term I need from cds before I merge with deg.

Code

library(tidyverse)

# Read the TSV file into a data frame
cds1 <- read.table("/home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/data/cds.tab", sep = "\t", header = FALSE)

# Separate the column into multiple columns based on a delimiter
cds2 <- separate(cds1, V2, into = c("Column 1", "Column 2", "Column 3"), sep = "\t")

# Write the updated data frame back to a TSV file
write.table(cds2, file = "/home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/data/newcds.tab", sep = "\t", row.names = FALSE)

Fixing deg column data being used as row names

Code

# Create a new column with the values from row names
deg$column_name <- rownames(deg)

# Reset the row names to default numeric indices
rownames(deg) <- NULL

Merge cds and DEGlist before merging with Uniprot protein table

Code

cdsDEG <- merge(cds2, deg, by.x = 1, by.y = 7 )
head(cdsDEG)

Prep Uniprot table for joining

Code

head -2 ../data/uniprot_table_r2023_01.tab
wc -l ../data/uniprot_table_r2023_01.tab

Entry   Reviewed    Entry Name  Protein names   Gene Names  Organism    Length  Gene Ontology (molecular function)  Gene Ontology (GO)  Gene Ontology (biological process)  Gene Ontology (cellular component)  Gene Ontology IDs   Interacts with  EC number   Reactome    UniPathway  InterPro
A0A023I7E1  reviewed    ENG1_RHIMI  Glucan endo-1,3-beta-D-glucosidase 1 (Endo-1,3-beta-glucanase 1) (EC 3.2.1.39) (Laminarinase) (RmLam81A)    ENG1 LAM81A Rhizomucor miehei   796 glucan endo-1,3-beta-D-glucosidase activity [GO:0042973]; glucan endo-1,3-beta-glucanase activity, C-3 substituted reducing group [GO:0052861]; glucan endo-1,4-beta-glucanase activity, C-3 substituted reducing group [GO:0052862]    extracellular region [GO:0005576]; glucan endo-1,3-beta-D-glucosidase activity [GO:0042973]; glucan endo-1,3-beta-glucanase activity, C-3 substituted reducing group [GO:0052861]; glucan endo-1,4-beta-glucanase activity, C-3 substituted reducing group [GO:0052862]; cell wall organization [GO:0071555]; polysaccharide catabolic process [GO:0000272] cell wall organization [GO:0071555]; polysaccharide catabolic process [GO:0000272]  extracellular region [GO:0005576]   GO:0000272; GO:0005576; GO:0042973; GO:0052861; GO:0052862; GO:0071555      3.2.1.39            IPR005200;IPR040720;IPR040451;
569214 ../data/uniprot_table_r2023_01.tab

Code

uni<- read.csv("../data/uniprot_table_r2023_01.tab", sep = '\t', header = TRUE)

Join all three tables and take a look at the list.

Code

library(kableExtra)
library(knitr)
library(dplyr)
library(readr)
#new table process
full <-left_join(cdsDEG, uni, by = c("Column 2" = "Entry"))
write.table(full, file = "/home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/data/annoblast.tab", sep = "\t", row.names = FALSE)

My table is acting up, so it will be a mystery until resolved. mystery table

Next Steps

Functional Gene Enrichment
Physiological inference
Preliminary conclusions & any further evaluation of targeted processes
If you want to follow along as I continue, please check out the Rpubs link or the Zenodo data repository linked below
Rpubs link