DOI Githup release June 6, 2023
This project contains a workflow for examining the effect of ocean acidification on gene expression of snow crabs. It runs using paired end RNA-Seq data prepared and provided by Laura Spencer. 63 juvenile crabs were exposed to the following pH treatments:
On 7/20/2021, after exposure, all crabs were sacrificed by puncturing the carapace through the cardiac region. Crabs were preserved overnight at 4C prior to being transferred to -80C. mRNA samples were then extracted. Each of the 63 mRNA samples were run in both lanes of NovaSeq 6000, for a grand total of 126 paired-end RNA-Seq data sets. Sequence data received January 3, 2022.
Volcano plot of snow crab gene expression when exposed to pH 7.5 for 12 weeks hours
Download these packages for analysis and figure generation.
library(DESeq2)
library(tidyverse)
library(pheatmap)
library(RColorBrewer)
library(data.table)
library(gplots)
library(knitr)
Check that blast software has been downloaded
ls /home/shared/
There should be 63 samples, each with a forward and reverse sequence (designated by R1 and R2) for a total of 126 files.
ls /home/shared/8TB_HDD_01/snow_crab/5010
There are three available transcriptomes for opilio crabs: molting gland, eyestalk ganglia, and hepatopancreas. Here, we will be using the molting gland transcriptome.
cd ../data
curl -o "c_opilio_mg_transcriptome.fasta" -H "Accept: text/plain; format=tsv" "https://gannet.fish.washington.edu/seashell/snaps/HBXI01.fasta"
Examine the molting gland (mg) transcriptome:
head ../data/c_opilio_mg_transcriptome.fasta
echo "How many sequences are there?"
grep -c ">" ../data/c_opilio_mg_transcriptome.fasta
## >ENA|HBXI01000001|HBXI01000001.1 TSA: Chionoecetes opilio, transcript: TRINITY_DN22351_c0_g1_i1
## GGAGGCGGTAGCAGAGTGGTTAGCGTCACTGAGTGGCATTCTCGAAGACCCGAGTTCGAT
## CCCGAGTGAGACCCATAAATCCACTTAAAATATTTCTGCACCGTCGAGTGTCCCAAGAAC
## ACCCACATGCCCATTAACTCTACCTTCGACTGCAAGTCGAGTGCATCGGAGGACACAGGG
## TGGCGGCATGAGCCTCTAATCACAGGCGTTACTATAGAAATCCTGCCTGCGCTACCACGG
## GCAAGGCAGGATTGATGGCCAACCATCATAGCCCTCAAGTGACTTAATTAACGTGCCATC
## CCACCCCTGAAGGTTACTGCCTACACGCACTATAGGCCAAGATGTGATAAAAAAAAGCTT
## GCAAATGAAGCCTAAGCTGAGTTAGTAAGCTTGTCTCTGATTGGCCAGGGGCTCAGGCAG
## GCTGTCGCCCCGCTCATCAGATTGCTCTTTCT
## >ENA|HBXI01000002|HBXI01000002.1 TSA: Chionoecetes opilio, transcript: TRINITY_DN22351_c5_g1_i1
## How many sequences are there?
## 136190
Download UniProt/Swiss-prot fasta, rename with year and version, then unzip the file.
cd ../data
curl -O https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
mv uniprot_sprot.fasta.gz uniprot_sprot_r2023_01.fasta.gz
gunzip -k uniprot_sprot_r2023_01.fasta.gz
ls ../data
Make blast database, save the Uniprot/Swiss-Prot data there.
/home/shared/ncbi-blast-2.11.0+/bin/makeblastdb \
-in ../data/uniprot_sprot_r2023_01.fasta \
-dbtype prot \
-out ../blastdb/uniprot_sprot_r2023_01
Blast the opilio molting gland transcriptome
/home/shared/ncbi-blast-2.11.0+/bin/blastx \
-query ../data/c_opilio_mg_transcriptome.fasta \
-db ../blastdb/uniprot_sprot_r2023_01 \
-out ../output/c_opilio_mg_uniprot_blastx.tab \
-evalue 1E-20 \
-num_threads 20 \
-max_target_seqs 1 \
-outfmt 6
Examine blastx table
head -2 ../output/c_opilio_mg_uniprot_blastx.tab
wc -l ../output/c_opilio_mg_uniprot_blastx.tab
## ENA|HBXI01000004|HBXI01000004.1 sp|Q86VS8|HOOK3_HUMAN 44.444 351 190 2 1044 1 178 526 2.00e-73 244
## ENA|HBXI01000009|HBXI01000009.1 sp|Q16UN6|NCBP1_AEDAE 55.016 309 115 3 1289 432 506 813 3.19e-97 311
## 27723 ../output/c_opilio_mg_uniprot_blastx.tab
Download UniProt table into data directory, rename as “uniprot_table_r2023_01.tab”
cd ../data
curl -o "uniprot_table_r2023_01.tab" -H "Accept: text/plain; format=tsv" "https://gannet.fish.washington.edu/seashell/snaps/uniprot_table_r2023_01.tab"
Examine UniProt table
head -2 ../data/uniprot_table_r2023_01.tab
## Entry Reviewed Entry Name Protein names Gene Names Organism Length Gene Ontology (molecular function) Gene Ontology (GO) Gene Ontology (biological process) Gene Ontology (cellular component) Gene Ontology IDs Interacts with EC number Reactome UniPathway InterPro
## A0A023I7E1 reviewed ENG1_RHIMI Glucan endo-1,3-beta-D-glucosidase 1 (Endo-1,3-beta-glucanase 1) (EC 3.2.1.39) (Laminarinase) (RmLam81A) ENG1 LAM81A Rhizomucor miehei 796 glucan endo-1,3-beta-D-glucosidase activity [GO:0042973]; glucan endo-1,3-beta-glucanase activity, C-3 substituted reducing group [GO:0052861]; glucan endo-1,4-beta-glucanase activity, C-3 substituted reducing group [GO:0052862] extracellular region [GO:0005576]; glucan endo-1,3-beta-D-glucosidase activity [GO:0042973]; glucan endo-1,3-beta-glucanase activity, C-3 substituted reducing group [GO:0052861]; glucan endo-1,4-beta-glucanase activity, C-3 substituted reducing group [GO:0052862]; cell wall organization [GO:0071555]; polysaccharide catabolic process [GO:0000272] cell wall organization [GO:0071555]; polysaccharide catabolic process [GO:0000272] extracellular region [GO:0005576] GO:0000272; GO:0005576; GO:0042973; GO:0052861; GO:0052862; GO:0071555 3.2.1.39 IPR005200;IPR040720;IPR040451;
Save annotated opilio transcriptome as local objects for later use.
annot_tab <- read.csv("../output/blast_annot_go_mg.tab", sep = '\t', header = TRUE)
We want to index “c_opilio_mg_transcriptome.fasta” and rename it as “c_opilio_mg_transcriptome.index” using kallisto.
/home/shared/kallisto/kallisto \
index -i \
../data/c_opilio_mg_transcriptome.index \
../data/c_opilio_mg_transcriptome.fasta
First, we want to make a new directory called “kallisto_mg” in our output folder. We then create a folder within “kallisto_mg” for each crab sample that will each hold the abundance estimates generated by kallisto. The name of each folder will match the basename of the original RNAseq file. This will generate 126 folders (two for each sample–both forward and reverse) so we will later have to remove one folder (will be empty) from each pair.
Then, we tell Kallisto which files to quantify. Kallisto does a psuedo-alignment of each file using an index (“c_opilio_mg_transcriptome.index”), then quantifies the abundance of each contig. We will look for all fastq.gz files in the directory “/5010” (our 126 RNA seq files), using wildcard (*) to find files with any base name.
We ask kallisto to generate abundance estimates using both forward and reverse reads, bundle them into a file, and save them into a directory that we already created and named accordingly.
Now, we can remove all folders in our “kallisto_mg” folder that are empty, which will be those ending in “R2_001.fastq.gz”.
#make a new directory in output
mkdir ../output/kallisto_mg
#find specific files and create abundance estimates
find /home/shared/8TB_HDD_01/snow_crab/5010/*fastq.gz \
| xargs basename -s _L003_R1_001.fastq.gz | xargs -I{} \
/home/shared/kallisto/kallisto \
quant -i ../data/c_opilio_mg_transcriptome.index \
-o ../output/kallisto_mg/{} \
-t 40 \
/home/shared/8TB_HDD_01/snow_crab/5010/{}_L003_R1_001.fastq.gz \
/home/shared/8TB_HDD_01/snow_crab/5010/{}_L003_R2_001.fastq.gz
# delete empty files
rm -r ../output/kallisto_mg/*R2_001.fastq.gz/
First, we can list all folder names in our kallisto_mg file, then use this list in our next step.
ls ~/courtney_RNAseq_crabs/output/kallisto_mg
#use these folder names for perl input
Now, we use abundance_estimates_to_matrix.pl script from the Trinity RNA-seq assembly software package to create a matrix with all of the abundance.tsv files. This will generate a matrix called “kallisto_paired.isoform.counts.matrix” in our output folder.
#Sample name input for perl
perl /home/shared/trinityrnaseq-v2.12.0/util/abundance_estimates_to_matrix.pl \
--est_method kallisto \
--gene_trans_map none \
--out_prefix ../output/kallisto_mg \
--name_sample_by_basedir \
../output/kallisto_mg/5010_1_S1/abundance.tsv \
../output/kallisto_mg/5010_2_S2/abundance.tsv \
../output/kallisto_mg/5010_3_S3/abundance.tsv \
../output/kallisto_mg/5010_4_S4/abundance.tsv \
../output/kallisto_mg/5010_5_S5/abundance.tsv \
../output/kallisto_mg/5010_6_S6/abundance.tsv \
../output/kallisto_mg/5010_7_S7/abundance.tsv \
../output/kallisto_mg/5010_8_S8/abundance.tsv \
../output/kallisto_mg/5010_9_S9/abundance.tsv \
../output/kallisto_mg/5010_10_S10/abundance.tsv \
../output/kallisto_mg/5010_11_S11/abundance.tsv \
../output/kallisto_mg/5010_12_S12/abundance.tsv \
../output/kallisto_mg/5010_13_S13/abundance.tsv \
../output/kallisto_mg/5010_14_S14/abundance.tsv \
../output/kallisto_mg/5010_15_S15/abundance.tsv \
../output/kallisto_mg/5010_16_S16/abundance.tsv \
../output/kallisto_mg/5010_17_S17/abundance.tsv \
../output/kallisto_mg/5010_18_S18/abundance.tsv \
../output/kallisto_mg/5010_19_S19/abundance.tsv \
../output/kallisto_mg/5010_20_S20/abundance.tsv \
../output/kallisto_mg/5010_21_S21/abundance.tsv \
../output/kallisto_mg/5010_22_S22/abundance.tsv \
../output/kallisto_mg/5010_23_S23/abundance.tsv \
../output/kallisto_mg/5010_24_S24/abundance.tsv \
../output/kallisto_mg/5010_25_S25/abundance.tsv \
../output/kallisto_mg/5010_26_S26/abundance.tsv \
../output/kallisto_mg/5010_27_S27/abundance.tsv \
../output/kallisto_mg/5010_28_S28/abundance.tsv \
../output/kallisto_mg/5010_29_S29/abundance.tsv \
../output/kallisto_mg/5010_30_S30/abundance.tsv \
../output/kallisto_mg/5010_31_S31/abundance.tsv \
../output/kallisto_mg/5010_32_S32/abundance.tsv \
../output/kallisto_mg/5010_33_S33/abundance.tsv \
../output/kallisto_mg/5010_34_S34/abundance.tsv \
../output/kallisto_mg/5010_35_S35/abundance.tsv \
../output/kallisto_mg/5010_36_S36/abundance.tsv \
../output/kallisto_mg/5010_37_S37/abundance.tsv \
../output/kallisto_mg/5010_38_S38/abundance.tsv \
../output/kallisto_mg/5010_39_S39/abundance.tsv \
../output/kallisto_mg/5010_40_S40/abundance.tsv \
../output/kallisto_mg/5010_41_S41/abundance.tsv \
../output/kallisto_mg/5010_42_S42/abundance.tsv \
../output/kallisto_mg/5010_43_S43/abundance.tsv \
../output/kallisto_mg/5010_44_S44/abundance.tsv \
../output/kallisto_mg/5010_45_S45/abundance.tsv \
../output/kallisto_mg/5010_46_S46/abundance.tsv \
../output/kallisto_mg/5010_47_S47/abundance.tsv \
../output/kallisto_mg/5010_48_S48/abundance.tsv \
../output/kallisto_mg/5010_49_S49/abundance.tsv \
../output/kallisto_mg/5010_50_S50/abundance.tsv \
../output/kallisto_mg/5010_51_S51/abundance.tsv \
../output/kallisto_mg/5010_52_S52/abundance.tsv \
../output/kallisto_mg/5010_53_S53/abundance.tsv \
../output/kallisto_mg/5010_54_S54/abundance.tsv \
../output/kallisto_mg/5010_55_S55/abundance.tsv \
../output/kallisto_mg/5010_56_S56/abundance.tsv \
../output/kallisto_mg/5010_57_S57/abundance.tsv \
../output/kallisto_mg/5010_58_S58/abundance.tsv \
../output/kallisto_mg/5010_59_S59/abundance.tsv \
../output/kallisto_mg/5010_60_S60/abundance.tsv \
../output/kallisto_mg/5010_61_S61/abundance.tsv \
../output/kallisto_mg/5010_62_S62/abundance.tsv \
../output/kallisto_mg/5010_63_S63/abundance.tsv
Save our count matrix from its location into a local object in R. Label the rows and columns of the matrix. Looking at metadata, we can see that the first 12 samples were the “control”, followed by 14 “pH 7.5-long exposure”, 12 “pH 7.5-short exposure”, 14 “pH 7.8-long exposure”, then 11 “pH 7.8-short exposure”. This gives us a total of 63 paired-end reads. This information can be used to label the matrix accordingly.
Here, we select columns 2:27 which contain the control and pH 7.5 long exposure sequences. Locations of other pH treatments:
countmatrix <- read.delim("../output/kallisto_mg.isoform.counts.matrix", header = TRUE, sep = '\t')
rownames(countmatrix) <- countmatrix$X
countmatrix_pH7.5_long <- countmatrix[,2:27] #select only the control and pH 7.5 (long exposure)
head(countmatrix_pH7.5_long, 2)
## X5010_1_S1 X5010_2_S2 X5010_3_S3 X5010_4_S4
## ENA|HBXI01058009|HBXI01058009.1 2.02523 5.13229 0 0
## ENA|HBXI01123645|HBXI01123645.1 3.00000 12.00000 5 5
## X5010_5_S5 X5010_6_S6 X5010_7_S7 X5010_8_S8
## ENA|HBXI01058009|HBXI01058009.1 1.02115 1.01759 4.0102 3.25025
## ENA|HBXI01123645|HBXI01123645.1 7.00000 2.00000 2.0000 3.00000
## X5010_9_S9 X5010_10_S10 X5010_11_S11
## ENA|HBXI01058009|HBXI01058009.1 0 5.17748 1.04771
## ENA|HBXI01123645|HBXI01123645.1 8 15.00000 4.00000
## X5010_12_S12 X5010_13_S13 X5010_14_S14
## ENA|HBXI01058009|HBXI01058009.1 1.08902 1.01784 1.01729
## ENA|HBXI01123645|HBXI01123645.1 5.00000 9.00000 6.00000
## X5010_15_S15 X5010_16_S16 X5010_17_S17
## ENA|HBXI01058009|HBXI01058009.1 1.04906 0.00000 3.06902
## ENA|HBXI01123645|HBXI01123645.1 3.00000 7.63786 6.00000
## X5010_18_S18 X5010_19_S19 X5010_20_S20
## ENA|HBXI01058009|HBXI01058009.1 0 0 1.01147
## ENA|HBXI01123645|HBXI01123645.1 4 13 2.00000
## X5010_21_S21 X5010_22_S22 X5010_23_S23
## ENA|HBXI01058009|HBXI01058009.1 1.01424 1.02296 0
## ENA|HBXI01123645|HBXI01123645.1 1.00000 7.00000 7
## X5010_24_S24 X5010_25_S25 X5010_26_S26
## ENA|HBXI01058009|HBXI01058009.1 1.01176 4.07845 6.12139
## ENA|HBXI01123645|HBXI01123645.1 3.00000 5.00000 3.00000
Round up to whole numbers.
Use DESeq2 to see differential gene expression based on pH exposure.
deseq2.colData <- data.frame(condition=factor(c(rep("control", 12), rep("pH 7.5-long exposure", 14))))
rownames(deseq2.colData) <- colnames(data)
deseq2.dds <- DESeqDataSetFromMatrix(countData = countmatrix_pH7.5_long,
colData = deseq2.colData,
design = ~ condition)
dim(countmatrix_pH7.5_long)
## [1] 136190 26
dim(deseq2.colData)
## [1] 26 1
Here, we save the results to local object “deseq2.dds” so that we can create plots in R.
deseq2.dds <- DESeq(deseq2.dds)
deseq2.res <- results(deseq2.dds) #save results of DESeq as a new variable
deseq2.res <- deseq2.res[order(rownames(deseq2.res)), ]
Look at deseq2.res to make sure that everything looks good.
head(deseq2.res)
## log2 fold change (MLE): condition pH.7.5.long.exposure vs control
## Wald test p-value: condition pH.7.5.long.exposure vs control
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat
## <numeric> <numeric> <numeric> <numeric>
## ENA|HBXI01000001|HBXI01000001.1 1.726586 -2.407688 1.080991 -2.227297
## ENA|HBXI01000002|HBXI01000002.1 92.584796 -0.265826 0.245711 -1.081865
## ENA|HBXI01000003|HBXI01000003.1 32.559274 0.153707 0.247062 0.622138
## ENA|HBXI01000004|HBXI01000004.1 138.252859 -0.439818 0.219847 -2.000559
## ENA|HBXI01000005|HBXI01000005.1 10.396522 0.725797 0.313181 2.317499
## ENA|HBXI01000006|HBXI01000006.1 0.784975 0.169092 1.052060 0.160724
## pvalue padj
## <numeric> <numeric>
## ENA|HBXI01000001|HBXI01000001.1 0.0259275 NA
## ENA|HBXI01000002|HBXI01000002.1 0.2793124 0.674916
## ENA|HBXI01000003|HBXI01000003.1 0.5338510 0.848168
## ENA|HBXI01000004|HBXI01000004.1 0.0454399 0.323964
## ENA|HBXI01000005|HBXI01000005.1 0.0204765 0.226191
## ENA|HBXI01000006|HBXI01000006.1 0.8723104 NA
Count the number of hits with adjusted p-value less then 0.05
dim(deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ])
## [1] 662 6
Save significant DESeq analysis results as local object for later use in figures and create a table in “output”.
deseq2.res.sig <- deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ]
write.table(deseq2.res.sig, "../output/DEGlist.tab", row.names = T)
Here, we make sure that the columns of DEGlist.tab are formatted in a way that is compatible with the columns of annot_tab. We need to separate “ENA|HBXI01000282|HBXI01000282.1” into individual elements “ENA,” “HBXI01000282,” and “HBXI01000282.1” by replacing pipes with tabs.
cat ../output/DEGlist.tab | tr '|' '\t' | tr -d \" | sed '1d' \
> ../output/DEGlist_sep.tab #rename with "_sep"
Compare the original table to separated table to make sure that columns were correctly separated.
head -2 ../output/DEGlist.tab
## "baseMean" "log2FoldChange" "lfcSE" "stat" "pvalue" "padj"
## "ENA|HBXI01000282|HBXI01000282.1" 10785.1007252411 0.390401357378876 0.110184336461187 3.54316566144951 0.000395354264658291 0.0375869027719842
head -2 ../output/DEGlist_sep.tab
## ENA HBXI01000282 HBXI01000282.1 10785.1007252411 0.390401357378876 0.110184336461187 3.54316566144951 0.000395354264658291 0.0375869027719842
## ENA HBXI01001137 HBXI01001137.1 265.786801907878 -0.441150615577295 0.107666506562387 -4.0973802314434 4.17852238082971e-05 0.0114411698762291
Read the separated DEG table as a local variable “DEGtab” so that we can use it in “left_join” in the next step.
DEGtab <- read.csv("../output/DEGlist_sep.tab", sep = "", header = FALSE)
# head(DEGtab)
Join the DEG table and the annotated table where the columns match. Here, V3 in DEGtab matches V3 in annot_tab. We write the output into a file in the “output” directory.
DEG_annot_tab <- left_join(DEGtab, annot_tab, by = c("V3" = "V3")) %>%
write_delim("../output/DEG_annot.tab", delim = '\t')
# look at annotated table of differentially expressed genes
head(DEG_annot_tab, 2)
## V1.x V2.x V3 V4.x V5.x V6.x V7.x
## 1 ENA HBXI01000282 HBXI01000282.1 10785.1007 0.3904014 0.1101843 3.543166
## 2 ENA HBXI01001137 HBXI01001137.1 265.7868 -0.4411506 0.1076665 -4.097380
## V8.x V9.x V1.y V2.y V4.y V5.y V6.y V7.y V8.y
## 1 3.953543e-04 0.03758690 <NA> <NA> <NA> <NA> <NA> NA NA
## 2 4.178522e-05 0.01144117 ENA HBXI01001137 sp P48444 COPD_HUMAN 47.222 216
## V9.y V10 V11 V12 V13 V14 V15 V16 Reviewed Entry.Name
## 1 NA NA NA NA NA NA NA NA <NA> <NA>
## 2 113 1 1 648 296 510 7.34e-67 219 reviewed COPD_HUMAN
## Protein.names Gene.Names
## 1 <NA> <NA>
## 2 Coatomer subunit delta (Archain) (Delta-coat protein) (Delta-COP) ARCN1 COPD
## Organism Length Gene.Ontology..molecular.function.
## 1 <NA> NA <NA>
## 2 Homo sapiens (Human) 511 RNA binding [GO:0003723]
## Gene.Ontology..GO.
## 1 <NA>
## 2 COPI vesicle coat [GO:0030126]; cytosol [GO:0005829]; endoplasmic reticulum membrane [GO:0005789]; Golgi membrane [GO:0000139]; membrane [GO:0016020]; transport vesicle [GO:0030133]; RNA binding [GO:0003723]; adult locomotory behavior [GO:0008344]; cerebellar Purkinje cell layer maturation [GO:0021691]; endoplasmic reticulum to Golgi vesicle-mediated transport [GO:0006888]; Golgi localization [GO:0051645]; intracellular protein transport [GO:0006886]; pigmentation [GO:0043473]; retrograde vesicle-mediated transport, Golgi to endoplasmic reticulum [GO:0006890]
## Gene.Ontology..biological.process.
## 1 <NA>
## 2 adult locomotory behavior [GO:0008344]; cerebellar Purkinje cell layer maturation [GO:0021691]; endoplasmic reticulum to Golgi vesicle-mediated transport [GO:0006888]; Golgi localization [GO:0051645]; intracellular protein transport [GO:0006886]; pigmentation [GO:0043473]; retrograde vesicle-mediated transport, Golgi to endoplasmic reticulum [GO:0006890]
## Gene.Ontology..cellular.component.
## 1 <NA>
## 2 COPI vesicle coat [GO:0030126]; cytosol [GO:0005829]; endoplasmic reticulum membrane [GO:0005789]; Golgi membrane [GO:0000139]; membrane [GO:0016020]; transport vesicle [GO:0030133]
## Gene.Ontology.IDs
## 1 <NA>
## 2 GO:0000139; GO:0003723; GO:0005789; GO:0005829; GO:0006886; GO:0006888; GO:0006890; GO:0008344; GO:0016020; GO:0021691; GO:0030126; GO:0030133; GO:0043473; GO:0051645
## Interacts.with EC.number Reactome UniPathway
## 1 <NA> <NA> <NA> <NA>
## 2 P53618; P42858; Q9BQE6-2 R-HSA-6807878;R-HSA-6811434;
## InterPro
## 1 <NA>
## 2 IPR036168;IPR022775;IPR027059;IPR011012;IPR028565;
Take counts from deseq2.dds, save as local object and as file.
#extract counts and normalizes them
counts <- counts(deseq2.dds, normalized = TRUE)
#save counts as file for use in bash manipulation
write.table(counts, file = "../output/DESeq_counts.tab", sep = "\t", quote = TRUE, row.names = TRUE)
head(counts)
#we can see that the rows are combined "ENA|HBXI01058009|HBXI01058009.1"
# they need to be separated for future steps
Here, we format DESeq_counts.tab are in a way that can be used in heatmaps. We need to separate “ENA|HBXI01000282|HBXI01000282.1” into individual elements “ENA,” “HBXI01000282,” and “HBXI01000282.1” by replacing pipes with tabs.
cat ../output/DESeq_counts.tab | tr '|' '\t' | tr -d \" | sed '1d' \
> ../output/DESeq_counts_sep.tab # rename with "_sep"
# examine new table
head -1 ../output/DESeq_counts_sep.tab
# compare to old table
head -2 ../output/DESeq_counts.tab
Clean up counts_sep and annot_tab with appropriate names and column organization.
#read in separeted DESeq_counts_sep.tab as local variable
counts_sep <- read.csv("../output/DESeq_counts_sep.tab", sep = '\t', header = FALSE)
#rename columns
colnames(counts_sep) <- c(1,"access_num",
"access_num_ver",
paste0("S", 1:12, " (control)"),
paste0("S", 13:26, " (pH 7.5-long)"))
#reorder columns so that accession numbers are first
counts_sep <- counts_sep[ , c(3,2,1,4:29)]
#read in annot_tab (created earlier for blast)
annot_tab <- read.csv("../output/blast_annot_go_mg.tab", sep = '\t', header = TRUE)
#rename columns to match counts_sep
colnames(annot_tab)[2] <- "access_num"
colnames(annot_tab)[3] <- "access_num_ver"
#reorder columns so that accession numbers are first
annot_tab <- annot_tab[,c(3,2,1,4:32)]
head(annot_tab,1)
## access_num_ver access_num V1 V4 V5 V6 V7 V8 V9 V10 V11
## 1 HBXI01000004.1 HBXI01000004 ENA sp Q86VS8 HOOK3_HUMAN 44.444 351 190 2 1044
## V12 V13 V14 V15 V16 Reviewed Entry.Name
## 1 1 178 526 2e-73 244 reviewed HOOK3_HUMAN
## Protein.names Gene.Names Organism
## 1 Protein Hook homolog 3 (h-hook3) (hHK3) HOOK3 Homo sapiens (Human)
## Length
## 1 718
## Gene.Ontology..molecular.function.
## 1 dynactin binding [GO:0034452]; dynein intermediate chain binding [GO:0045505]; dynein light chain binding [GO:0045503]; dynein light intermediate chain binding [GO:0051959]; identical protein binding [GO:0042802]; microtubule binding [GO:0008017]
## Gene.Ontology..GO.
## 1 centriolar satellite [GO:0034451]; centrosome [GO:0005813]; cis-Golgi network [GO:0005801]; cytoplasm [GO:0005737]; cytosol [GO:0005829]; FHF complex [GO:0070695]; microtubule [GO:0005874]; pericentriolar material [GO:0000242]; dynactin binding [GO:0034452]; dynein intermediate chain binding [GO:0045505]; dynein light chain binding [GO:0045503]; dynein light intermediate chain binding [GO:0051959]; identical protein binding [GO:0042802]; microtubule binding [GO:0008017]; cytoplasmic microtubule organization [GO:0031122]; cytoskeleton-dependent intracellular transport [GO:0030705]; early endosome to late endosome transport [GO:0045022]; endosome organization [GO:0007032]; endosome to lysosome transport [GO:0008333]; Golgi localization [GO:0051645]; interkinetic nuclear migration [GO:0022027]; lysosome organization [GO:0007040]; microtubule anchoring at centrosome [GO:0034454]; negative regulation of neurogenesis [GO:0050768]; neuronal stem cell population maintenance [GO:0097150]; protein localization to centrosome [GO:0071539]; protein localization to perinuclear region of cytoplasm [GO:1905719]; protein transport [GO:0015031]
## Gene.Ontology..biological.process.
## 1 cytoplasmic microtubule organization [GO:0031122]; cytoskeleton-dependent intracellular transport [GO:0030705]; early endosome to late endosome transport [GO:0045022]; endosome organization [GO:0007032]; endosome to lysosome transport [GO:0008333]; Golgi localization [GO:0051645]; interkinetic nuclear migration [GO:0022027]; lysosome organization [GO:0007040]; microtubule anchoring at centrosome [GO:0034454]; negative regulation of neurogenesis [GO:0050768]; neuronal stem cell population maintenance [GO:0097150]; protein localization to centrosome [GO:0071539]; protein localization to perinuclear region of cytoplasm [GO:1905719]; protein transport [GO:0015031]
## Gene.Ontology..cellular.component.
## 1 centriolar satellite [GO:0034451]; centrosome [GO:0005813]; cis-Golgi network [GO:0005801]; cytoplasm [GO:0005737]; cytosol [GO:0005829]; FHF complex [GO:0070695]; microtubule [GO:0005874]; pericentriolar material [GO:0000242]
## Gene.Ontology.IDs
## 1 GO:0000242; GO:0005737; GO:0005801; GO:0005813; GO:0005829; GO:0005874; GO:0007032; GO:0007040; GO:0008017; GO:0008333; GO:0015031; GO:0022027; GO:0030705; GO:0031122; GO:0034451; GO:0034452; GO:0034454; GO:0042802; GO:0045022; GO:0045503; GO:0045505; GO:0050768; GO:0051645; GO:0051959; GO:0070695; GO:0071539; GO:0097150; GO:1905719
## Interacts.with EC.number Reactome
## 1 O43310-2; Q9UJC3; P13646; Q7Z3Y8; Q6FHY5; P21757; Q96PV4
## UniPathway InterPro
## 1 IPR001715;IPR036872;IPR008636;IPR043936;
Merge the blast table and counts table together.
#both must be data frames
counts_merged <- merge(counts_sep, annot_tab, by = "access_num_ver", all=FALSE)
# head(counts_merged)
#check to see if there any duplicates
head(which(duplicated(counts_merged)))
## integer(0)
Convert counts_merged into a matrix and rename rows.
#make this a matrix so it can take the rownames()
counts_merged.mat <- as.matrix(counts_merged)
#make a vector of names from the first column (access_num_ver)
access_num_ver <- counts_merged.mat[,1]
#rename all rows with access_num_ver
rownames(counts_merged.mat) <- access_num_ver
Pick the top 1000 differentially expressed genes and find where those intersect with the abundance counts matrix (counts_merged.mat). Because each element in top_genes is also connected by pipes, we need to separate them into columns then select which columns contain the correct data.
#designate top 1000 genes
res_ordered <- deseq2.res[order(deseq2.res$padj), ]
top_genes <- row.names(res_ordered)[1:1000]
# Fixing top_genes to be in the correct format for interesect()
# Input string
input_vector <- top_genes
# Separate the vector into separate columns using the pipe symbol as the delimiter
split_columns <- do.call(rbind, strsplit(input_vector, "\\|"))
# Separate by column 3, which matches "access_num_ver" in annot_tab and counts_sep
top_genes.sep <- split_columns[1:length(top_genes),3]
# Identify top genes that are present in the counts_merged.mat row names
present_genes <- intersect(top_genes.sep, row.names(counts_merged.mat))
# Extract the rows corresponding to the present genes
counts_top <- counts_merged.mat[present_genes, ]
head(counts_top, 1)
## access_num_ver access_num.x 1 S1 (control)
## HBXI01001950.1 "HBXI01001950.1" "HBXI01001950" "ENA" "5.656291e+02"
## S2 (control) S3 (control) S4 (control) S5 (control)
## HBXI01001950.1 "4.802489e+02" "7.096624e+02" "4.199234e+02" "4.626127e+02"
## S6 (control) S7 (control) S8 (control) S9 (control)
## HBXI01001950.1 "5.978378e+02" "6.734937e+02" "5.945974e+02" "4.067192e+02"
## S10 (control) S11 (control) S12 (control) S13 (pH 7.5-long)
## HBXI01001950.1 "5.496532e+02" "4.374703e+02" "4.683099e+02" "2.296729e+02"
## S14 (pH 7.5-long) S15 (pH 7.5-long) S16 (pH 7.5-long)
## HBXI01001950.1 "3.210739e+02" "3.497218e+02" "3.066595e+02"
## S17 (pH 7.5-long) S18 (pH 7.5-long) S19 (pH 7.5-long)
## HBXI01001950.1 "4.078034e+02" "3.385568e+02" "3.367583e+02"
## S20 (pH 7.5-long) S21 (pH 7.5-long) S22 (pH 7.5-long)
## HBXI01001950.1 "4.312495e+02" " 304.882320" "3.735225e+02"
## S23 (pH 7.5-long) S24 (pH 7.5-long) S25 (pH 7.5-long)
## HBXI01001950.1 "4.897509e+02" "2.303789e+02" "3.310385e+02"
## S26 (pH 7.5-long) access_num.y V1 V4 V5
## HBXI01001950.1 "2.448149e+02" "HBXI01001950" "ENA" "sp" "Q5ZMJ9"
## V6 V7 V8 V9 V10 V11 V12
## HBXI01001950.1 "SRRM1_CHICK" " 74.400" " 125" " 31" " 1" " 813" " 439"
## V13 V14 V15 V16 Reviewed Entry.Name
## HBXI01001950.1 " 2" " 125" " 5.42e-52" " 186.0" "reviewed" "SRRM1_CHICK"
## Protein.names
## HBXI01001950.1 "Serine/arginine repetitive matrix protein 1"
## Gene.Names Organism Length
## HBXI01001950.1 "SRRM1 RCJMB04_1m21" "Gallus gallus (Chicken)" " 888"
## Gene.Ontology..molecular.function.
## HBXI01001950.1 ""
## Gene.Ontology..GO.
## HBXI01001950.1 "spliceosomal complex [GO:0005681]; mRNA processing [GO:0006397]; regulation of mRNA splicing, via spliceosome [GO:0048024]; RNA splicing [GO:0008380]"
## Gene.Ontology..biological.process.
## HBXI01001950.1 "mRNA processing [GO:0006397]; regulation of mRNA splicing, via spliceosome [GO:0048024]; RNA splicing [GO:0008380]"
## Gene.Ontology..cellular.component.
## HBXI01001950.1 "spliceosomal complex [GO:0005681]"
## Gene.Ontology.IDs Interacts.with
## HBXI01001950.1 "GO:0005681; GO:0006397; GO:0008380; GO:0048024" ""
## EC.number Reactome UniPathway InterPro
## HBXI01001950.1 "" "" "" "IPR002483;IPR036483;"
#check how many DEGs will be plotted
nrow(counts_top) #43
## [1] 43
Organizing by protein name
#trim down data frame
counts_top_pro <- counts_top[, c(1,4:29,47)] #protein names are in column 47
#rename rows as protein names
rownames(counts_top_pro) <- counts_top_pro[,"Protein.names"]
#remove protein column
counts_top_pro <- counts_top_pro[ ,-47]
counts_top_pro <- as.matrix(counts_top_pro)
# head(counts_top_pro)
#convert the counts into numbers from character into new matrix and renaming the col and row names of that new matrix
counts_top_pro2 <- matrix(as.numeric(counts_top_pro),
ncol = ncol(counts_top_pro))
colnames(counts_top_pro2) <- colnames(counts_top_pro)
protein_names <- rownames(counts_top_pro)
#clean up protein names list so that each entry is not as long
protein_names2 <- protein_names %>%
sub(' \\([^)]+\\).*$', '', .) %>%
sub('\\[.*?\\]', ' ', .)
rownames(counts_top_pro2) <- protein_names2
# Perform log transformation
log_counts_top2 <- log2(counts_top_pro2 + 1)
Organizing by “Gene.Ontology..biological.process.”
#trim down data frame
counts_top_bio_ont <- counts_top[, c(1,4:29,53)]
#rename rows as biological process
rownames(counts_top_bio_ont) <- counts_top_bio_ont[,"Gene.Ontology..biological.process."]
#delete Gene.Ontology..biological.process. column
counts_top_bio_ont <- counts_top_bio_ont[ ,-53]
counts_top_bio_ont <- as.matrix(counts_top_bio_ont)
head(counts_top_bio_ont)
#convert the counts into numbers from characters
#into new matrix and renaming the col and row names of that new matrix
counts_top_bio_ont2 <- matrix(as.numeric(counts_top_bio_ont),
ncol = ncol(counts_top_bio_ont))
colnames(counts_top_bio_ont2) <- colnames(counts_top_bio_ont)
bio_ont <- rownames(counts_top_bio_ont)
head(bio_ont)
#need to find a way to clean up biological process list so that each entry is not as long
rownames(counts_top_bio_ont2) <- bio_ont2
# Perform log transformation
log_counts_top_bio2 <- log2(counts_top_bio_ont2 + 1)
# Define custom palette of shades of light orange to dark orange
custom_palette <- c('#fff7fb','#ece2f0','#d0d1e6',
'#a6bddb','#67a9cf','#3690c0',
'#02818a','#016c59','#014636')
par(mar=c(7,4,4,2)+0.1)
png(filename='../output/03-heatmap-01.png', width=800, height=750)
# Customize the heatmap using heatmap.2
heatmap.2(log_counts_top2,
scale="column",
cexRow=0.9,
margins =c(10,20),
col=custom_palette,
Colv = FALSE,
ylab = "Protein Names",
xlab = "Sample Run ID",
key=TRUE,
keysize = 0.7,
key.title = "Log Counts",
key.xlab = "Expression",
key.ylab = "Frequency",
trace="none",
cexCol = 1)
graphics.off()

This generates a volcano plot that shows the log change of gene expression between the control and a treatment of pH 7.5 for 12 weeks.
par(mar=c(7,4,4,2)+0.1)
png(filename='../output/03-volcano-plot.png', width=800, height=750)
# The main plot
plot(deseq2.res$baseMean, deseq2.res$log2FoldChange, pch=20, cex=0.45, ylim=c(-3, 3), log="x", col="darkgray",
main="DEG Snow Crab Treatments (pval <= 0.05)",
xlab="mean of normalized counts",
ylab="Log2 Fold Change")
# Getting the significant points and plotting them again so they're a different color
points(deseq2.res.sig$baseMean, deseq2.res.sig$log2FoldChange, pch=20, cex=0.7, col="#3690c0")
# 2 FC lines
abline(h=c(-1,1), col="#ec7014")
graphics.off()

par(mar=c(7,4,4,2)+0.1)
png(filename='../output/03-PCA.png', width=800, height=750)
vst.data <- vst(deseq2.dds, blind = FALSE)
plotPCA(vst.data, intgroup = "condition")
graphics.off()

Separate out pipes into tabs.
cat ../output/log_counts_top.tab | tr '|' '\t' | tr -d \" | sed '1d' \
> ../output/log_counts_top_sep.tab # rename with "_sep"
# examine new table
head -1 ../output/log_counts_top_sep.tab
Clean up counts_sep and annot_tab with appropriate names and column organization.
# read in separeted log_counts_top_sep.tab as local variable
log_counts_top_sep <- read.csv("../output/log_counts_top_sep.tab", sep = '\t', header = FALSE)
head(log_counts_top_sep)
## V1 V2 V3 V4 V5 V6 V7 V8 V9
## 1 ENA HBXI01089505 HBXI01089505.1 0 0.00000 0 0.000000 0.000000 0.000000
## 2 ENA HBXI01097869 HBXI01097869.1 0 0.00000 0 0.000000 0.000000 0.000000
## 3 ENA HBXI01061771 HBXI01061771.1 0 0.00000 0 0.000000 0.000000 0.000000
## 4 ENA HBXI01135226 HBXI01135226.1 0 10.01203 0 0.000000 0.000000 4.023605
## 5 ENA HBXI01004523 HBXI01004523.1 0 0.00000 0 6.211255 7.552041 0.000000
## 6 ENA HBXI01011121 HBXI01011121.1 0 0.00000 0 0.000000 0.000000 0.000000
## V10 V11 V12 V13 V14 V15 V16 V17 V18 V19
## 1 0.000000 0.000000 0 0.000000 0 0 0.000000 0 6.366676 9.663929
## 2 0.000000 0.000000 0 0.000000 0 0 0.000000 0 0.000000 0.000000
## 3 0.000000 0.000000 0 7.400073 0 0 7.432794 0 0.000000 0.000000
## 4 0.000000 9.494103 0 9.280617 0 0 0.000000 0 0.000000 0.000000
## 5 5.695707 0.000000 0 6.710848 0 0 0.000000 0 0.000000 8.305476
## 6 7.995194 0.000000 0 0.000000 0 0 0.000000 0 0.000000 8.618519
## V20 V21 V22 V23 V24 V25 V26 V27
## 1 7.344208 5.345521 0.000000 0.000000 8.713539 0.000000 10.061665 0.000000
## 2 0.000000 7.797745 0.000000 8.885313 7.576322 8.356288 5.479004 5.998076
## 3 6.248607 0.000000 0.000000 0.000000 0.000000 5.293224 7.267822 0.000000
## 4 0.000000 0.000000 0.000000 0.000000 8.791967 0.000000 0.000000 0.000000
## 5 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## 6 4.304435 0.000000 4.709676 0.000000 0.000000 0.000000 0.000000 9.121676
## V28 V29
## 1 8.859392 9.347096
## 2 0.000000 0.000000
## 3 2.811241 0.000000
## 4 0.000000 0.000000
## 5 0.000000 0.000000
## 6 10.981913 0.000000
# rename columns
colnames(log_counts_top_sep) <- c(1,"access_num",
"access_num_ver",
paste0("S", 1:12, " (control)"),
paste0("S", 13:26, " (pH 7.5-long)"))
# make a vector of names from the first column (access_num_ver)
access_num_ver <- log_counts_top_sep[,3]
#rename all rows with access_num_ver
rownames(log_counts_top_sep) <- access_num_ver
# removes ENA and accession numbers
log_counts_top_sep <- log_counts_top_sep[,c(4:29)]
#converts to matrix for use in heatmap()
log_counts_top_sep <- as.matrix(log_counts_top_sep)
png(file="../output/03-heatmap-02.png",
width=900, height=650)
pheatmap(log_counts_top_sep,
scale = "row",
color = custom_palette,
fontsize_row = 8,
cellheight = 8)
dev.off()
## png
## 3
