Compendium of FISH546 project
2025-06-05
Differential Gene Expression Analysis Project
I am using RNA-seq data taken from Sea Cucumbers (Apostichopus japonicus ) that were treated under 2 different temperatures (26°C & 30°C). The purpose being to conduct DGE analysis to determine the biological responses that heat stress induces on this organism. Data was obtained from NCBI, done by researchers Xu et al. at Qingdao Agricultural University (2023).
Methods
Variables: Time and Temperature.
3 groups: control (18°C), 26°C (sub-lethal), and 30°C (lethal).
Time: 6 hrs and 48 hrs.
26°C kept at 6 hrs and 48 hrs, 3 replicates each.
30°C kept at 6 hrs, 3 replicates.
Researchers note 30°C at 48 hrs killed their sea cucumbers.
Code
3 parts, 3 scripts.
1.1: Obtain data & conduct quality check (QC)
FastQ files were obtained from NCBI. Accession code being: PRJNA848687. More information about the data can be found here . Go here if you want to access information about the individual files (12 total).
The quality check was done using FastQC:
/home/shared/8TB_HDD_02/hannia/SeaCucumber/FastQC/fastqc \
/home/shared/8TB_HDD_02/hannia/SeaCucumber/PRJNA848687_fastq/* .fastq \
-o /home/shared/8TB_HDD_02/hannia/SeaCucumber/output
Results of FastQC
Results looked normal for paired-end reads. Generally, all files looked like this.
Refer to ~Hannia-jelly/Project/output/output_fastqc for the html files.
1.2: Pseudo-alignment using Kallisto
The reference genome of Apostichopus japonicus for the pseudo-alignment was obtained from NCBI and can be accessed here . The NCBI RefSeq assembly ID is: GCF_037975245.1.
Creating the index using Kallisto from the rna.fna of the reference genome
/home/shared/kallisto/kallisto index \
-i /home/shared/8TB_HDD_02/hannia/SeaCucumber/index.idx \
/home/shared/8TB_HDD_02/hannia/SeaCucumber/GCF_037975245.1_ref/ncbi_dataset/data/GCF_037975245.1/rna.fna
1.3 . Using Kallisto quant to complete the pseudo-alignment.
find /home/shared/8TB_HDD_02/hannia/SeaCucumber/PRJNA848687_fastq/* _1.fastq \
| xargs -n1 basename \
| sed 's/_1\.fastq$//' \
| xargs -I{} /home/shared/kallisto/kallisto quant \
-i /home/shared/8TB_HDD_02/hannia/SeaCucumber/index.idx \
-o /home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01/{} \
-t 40 \
--paired \
/home/shared/8TB_HDD_02/hannia/SeaCucumber/PRJNA848687_fastq/{}_1.fastq \
/home/shared/8TB_HDD_02/hannia/SeaCucumber/PRJNA848687_fastq/{}_2.fastq
1.4: Creating a gene expresison matrix
perl /home/shared/trinityrnaseq-v2.12.0/util/abundance_estimates_to_matrix.pl \
--est_method kallisto \
--gene_trans_map none \
--out_prefix /home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01 \
--name_sample_by_basedir \
/home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01/SRR19635628/abundance.tsv \
/home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01/SRR19635629/abundance.tsv \
/home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01/SRR19635630/abundance.tsv \
/home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01/SRR19635631/abundance.tsv \
/home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01/SRR19635632/abundance.tsv \
/home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01/SRR19635633/abundance.tsv \
/home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01/SRR19635634/abundance.tsv \
/home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01/SRR19635635/abundance.tsv \
/home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01/SRR19635636/abundance.tsv \
/home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01/SRR19635637/abundance.tsv \
/home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01/SRR19635638/abundance.tsv \
/home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01/SRR19635639/abundance.tsv
1.4: head of .matrix file
head ~/SeaCucumber/output/kallisto_01.isoform.counts.matrix
SRR19635628 SRR19635629 SRR19635630 SRR19635631 SRR19635632 SRR19635633 SRR19635634 SRR19635635 SRR19635636 SRR19635637 SRR19635638 SRR19635639
XM_071976267.1 0 9.00484 9 79.0623 56 4 46.0767 10.0067 47 43 27 8.01381
XR_011789836.1 0 0 0 0 0 0 0 0 0 0 0 0.52118
XM_071997363.1 4.14253 16.8493 281.989 0 0 6.8272 0 6.66699 51.2241 6.68807 0 0
XM_071955467.1 93.3973 0 53.8949 0 0 0 0 92.5794 92.807 46.5277 0 0
XR_011796061.1 0 0 0 0 0 0 0 0 0 0 0 0
XM_071967761.1 142 56 209 88 71 7 67 80 87 3 8 28
XM_071984305.1 0 0 73.3343 0 0 0 0 0 0 0 0 0
XM_071987644.1 36.1874 309.39 129.686 700.958 207.801 374.889 117.255 104.78 155.507 0.144135 0.0026576 1.25093e-06
XM_071964952.1 0 0 0 0 0 5.9977 0 0 0 0 0 0
Part 2.0: BLAST rna.fna to Swiss-prot for functional analysis
/home/shared/8TB_HDD_02/hannia/applications/ncbi-blast-2.16.0+/bin/blastx \
> -query /home/shared/8TB_HDD_02/hannia/SeaCucumber/GCF_037975245.1_ref/ncbi_dataset/data/GCF_037975245.1/rna.fna \
> -db /home/shared/8TB_HDD_02/hannia/blast/uniprot_sprot_r2025_01 \
> -out /home/shared/8TB_HDD_02/hannia/SeaCucumber/output/Seacuc_blastx_sprotresults.tab \
> -evalue 1E-20 \
> -num_threads 20 \
> -max_target_seqs 5 \
> -outfmt 6
2.0 head of blastx results
head /home/shared/8TB_HDD_02/hannia/SeaCucumber/output/Seacuc_blastx_sprotresults.tab
XM_071954493.1 sp|Q9UGM1|ACHA9_HUMAN 28.107 338 218 7 134 1120 10 331 1.13e-42 162
XM_071954493.1 sp|P43144|ACHA9_RAT 28.339 307 198 6 227 1120 38 331 3.99e-41 157
XM_071954493.1 sp|Q9PTS8|ACHA9_CHICK 28.155 309 200 6 221 1120 38 333 8.28e-39 151
XM_071954493.1 sp|Q9I8C7|ACH10_CHICK 29.316 307 195 7 227 1120 39 332 2.94e-37 146
XM_071954493.1 sp|Q8JFN7|ACH91_ONCMY 28.013 307 199 6 227 1120 32 325 4.36e-36 145
XM_071954496.1 sp|Q2TBP5|TMM53_BOVIN 37.184 277 155 6 67 858 5 275 3.57e-34 136
XM_071954496.1 sp|Q5PPS7|TM53A_XENLA 37.091 275 155 7 61 858 3 268 9.21e-34 135
XM_071954496.1 sp|Q6DJC8|TM53B_XENLA 35.531 273 158 6 67 858 5 268 2.31e-33 134
XM_071954496.1 sp|Q6DHN0|TMM53_DANRE 34.066 273 168 5 61 861 3 269 3.15e-33 133
XM_071954496.1 sp|Q9D0Z3|TMM53_MOUSE 37.956 274 155 6 67 858 5 273 8.07e-33 132
Part 3: Data Analysis
3.1: DESeq2 for DGE analysis
Reading in counts and filtering to keep counts that have more than 5 counts in at least 3 samples.
# Load counts
counts <- read.table ("~/SeaCucumber/output/kallisto_01.isoform.counts.matrix" , header= TRUE , row.names= 1 )
# Filter: keep rows with counts >5 in at least 3 samples
keep <- rowSums (counts > 5 ) >= 3
filtered.counts <- counts[keep, ]
# Summary
cat ("Before filtering:" , nrow (counts), "isoforms \n " )
Before filtering: 56281 isoforms
cat ("After filtering:" , nrow (filtered.counts), "isoforms \n " )
After filtering: 29310 isoforms
3.1 Final filtered.counts data frame
This is rounded to nearest integer & and has proper meaningful column names.
#rounding to nearest integer
filtered.counts <- round (filtered.counts, 0 )
str (filtered.counts)
'data.frame': 29310 obs. of 12 variables:
$ SRR19635628: num 0 4 93 142 36 0 130 16 0 2 ...
$ SRR19635629: num 9 17 0 56 309 0 18 27 0 63 ...
$ SRR19635630: num 9 282 54 209 130 0 36 26 11 19 ...
$ SRR19635631: num 79 0 0 88 701 61 482 153 7 90 ...
$ SRR19635632: num 56 0 0 71 208 0 144 42 5 139 ...
$ SRR19635633: num 4 7 0 7 375 0 41 16 3 2 ...
$ SRR19635634: num 46 0 0 67 117 0 19 35 0 10 ...
$ SRR19635635: num 10 7 93 80 105 0 47 115 1 26 ...
$ SRR19635636: num 47 51 93 87 156 0 17 112 13 1 ...
$ SRR19635637: num 43 7 47 3 0 31 253 17 21 31 ...
$ SRR19635638: num 27 0 0 8 0 0 30 34 4 27 ...
$ SRR19635639: num 8 0 0 28 0 123 11 0 1 60 ...
# Vector of new sample names in the correct order
new.names <- c ("Control 1" , "Control 2" , "Control 3" , "26.a1" , "26.a2" , "26.a3" , "26.b1" , "26.b2" , "26.b3" , "30.a1" ,"30.a2" ,"30.a3" )
# Apply new names to the columns
colnames (filtered.counts) <- new.names
head (filtered.counts)
Control 1 Control 2 Control 3 26.a1 26.a2 26.a3 26.b1 26.b2
XM_071976267.1 0 9 9 79 56 4 46 10
XM_071997363.1 4 17 282 0 0 7 0 7
XM_071955467.1 93 0 54 0 0 0 0 93
XM_071967761.1 142 56 209 88 71 7 67 80
XM_071987644.1 36 309 130 701 208 375 117 105
XM_071957402.1 0 0 0 61 0 0 0 0
26.b3 30.a1 30.a2 30.a3
XM_071976267.1 47 43 27 8
XM_071997363.1 51 7 0 0
XM_071955467.1 93 47 0 0
XM_071967761.1 87 3 8 28
XM_071987644.1 156 0 0 0
XM_071957402.1 0 31 0 123
Libraries used for downstream analysis
library (DESeq2)
library (tidyverse)
library (pheatmap)
library (RColorBrewer)
library (data.table)
library (gprofiler2)
3.2 Setting up data frames for DESeq2( )
condition <- factor (rep (c ("Control" , "26.a" , "26.b" , "30.a" ), each = 3 ),
levels = c ("Control" , "26.a" , "26.b" , "30.a" ))
condition
[1] Control Control Control 26.a 26.a 26.a 26.b 26.b 26.b
[10] 30.a 30.a 30.a
Levels: Control 26.a 26.b 30.a
deseq2.colData <- data.frame (condition = condition,
type= factor (rep ("paired" , 12 )))
rownames (deseq2.colData) <- colnames (filtered.counts)
deseq2.data <- DESeqDataSetFromMatrix (countData = filtered.counts,
colData = deseq2.colData,
design = ~ condition)
3.3 Running DESeq( ) command
deseq2.data <- DESeq (deseq2.data)
deseq2.res <- results (deseq2.data)
deseq2.res <- deseq2.res[order (rownames (deseq2.data)), ]
3.4 DESeq results
# Count number of hits with adjusted p-value less then 0.05
dim (deseq2.res[! is.na (deseq2.res$ padj) & deseq2.res$ padj <= 0.05 , ])
head (deseq2.res[order (deseq2.res$ padj), ]) # sorted by adjusted p-value
log2 fold change (MLE): condition 30.a vs Control
Wald test p-value: condition 30.a vs Control
DataFrame with 6 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue
<numeric> <numeric> <numeric> <numeric> <numeric>
XM_071987485.1 52098.76 7.91277 0.624320 12.67422 8.21714e-37
XM_071976661.1 160742.38 10.91870 0.953902 11.44635 2.45258e-30
XM_071971899.1 177059.26 11.97290 1.073711 11.15096 7.08404e-29
XM_071978956.1 94012.69 13.16145 1.227628 10.72104 8.10871e-27
XM_071997912.1 71359.22 9.46740 0.908432 10.42169 1.97408e-25
XM_071980612.1 2758.95 10.22529 1.033384 9.89495 4.37816e-23
padj
<numeric>
XM_071987485.1 2.04714e-32
XM_071976661.1 3.05505e-26
XM_071971899.1 5.88283e-25
XM_071978956.1 5.05031e-23
XM_071997912.1 9.83603e-22
XM_071980612.1 1.81789e-19
3.4 Visualizing DESeq( ) results
tmp <- deseq2.res
# The main plot
plot (tmp$ baseMean, tmp$ log2FoldChange, pch= 20 , cex= 0.45 , ylim= c (- 3 , 3 ), log= "x" , col= "darkgray" ,
main= "DEG of control and 30degC data (pval <= 0.05)" ,
xlab= "mean of normalized counts" ,
ylab= "Log2 Fold Change" )
# Getting the significant points and plotting them again so they're a different color
tmp.sig <- deseq2.res[! is.na (deseq2.res$ padj) & deseq2.res$ padj <= 0.05 , ]
points (tmp.sig$ baseMean, tmp.sig$ log2FoldChange, pch= 20 , cex= 0.45 , col= "red" )
# 2 FC lines
abline (h= c (- 1 ,1 ), col= "blue" )
3.4 The Main Plot
3.4 More visualization of DESeq( ) results
Figure of Top 50 Deferentially Expressed Genes:
3.5 Merging BLAST & DESeq2 Results for functional analysis
Go to ~/Hannia-jelly/Project/scripts/03-SeaCucDGEAnalysis.Rmd for full code of this step.
# Merge the data
deg.annotated <- merge (DEGtable, blastresults, by = "transcriptID" , all.x = TRUE )
# Display only the first 5 rows
head (read.csv ("~/SeaCucumber/output/deg_annotated.csv" ), 5 )
transcriptID baseMean log2FoldChange lfcSE stat pvalue
1 XM_071954564.1 34.38615 21.40793 3.703689 5.780163 7.462832e-09
2 XM_071954564.1 34.38615 21.40793 3.703689 5.780163 7.462832e-09
3 XM_071954564.1 34.38615 21.40793 3.703689 5.780163 7.462832e-09
4 XM_071954564.1 34.38615 21.40793 3.703689 5.780163 7.462832e-09
5 XM_071954564.1 34.38615 21.40793 3.703689 5.780163 7.462832e-09
padj subject_id percent_identity alignment_length
1 1.309307e-06 sp|Q5BL81|SC5A8_XENTR 40.381 473
2 1.309307e-06 sp|Q8BYF6|SC5A8_MOUSE 34.633 641
3 1.309307e-06 sp|Q49B93|SC5AC_MOUSE 37.705 488
4 1.309307e-06 sp|Q8N695|SC5A8_HUMAN 37.056 591
5 1.309307e-06 sp|A7MBD8|SC5AC_BOVIN 38.285 478
mismatches gap_opens q_start q_end s_start s_end evalue bit_score
1 278 2 405 1811 9 481 2.84e-94 313
2 366 7 387 2297 3 594 3.59e-97 320
3 300 2 405 1856 5 492 1.21e-92 308
4 331 6 405 2171 9 560 5.78e-100 328
5 291 2 405 1826 5 482 1.31e-93 311
3.6 Uniprot ID mapping
First, created a text file of the subject_id column to paste into the Uniprot ID mapping website to conduct functional analysis.
deg.annotated$ accession <- sub ("sp \\ |(.*?) \\ |.*" , " \\ 1" , deg.annotated$ subject_id)
write.table (deg.annotated$ accession, file = "~/SeaCucumber/output/uniprot_ids.txt" , quote = FALSE , row.names = FALSE , col.names = FALSE )
Downloaded .tsv to directory and merged with deg.annotated table
functional.merge <- merge (deg.annotated, uniprot.annot, by.x = "accession" , by.y = "From" , all.x = TRUE )
Head of functional.merge data frame
head (read.csv ("~/SeaCucumber/output/functional_merge.csv" ), 5 )
accession transcriptID baseMean log2FoldChange lfcSE stat
1 A0A061I403 XM_071996487.1 413.60029 1.045867 0.3225155 3.242843
2 A0A088MLT8 XM_071983001.1 49.74936 -21.294231 3.8514087 -5.528946
3 A0A096X8J7 XM_071966355.1 17.24461 20.221448 4.2518977 4.755864
4 A0A0B4KEE4 XM_071958907.1 23.44183 8.840447 2.8686350 3.081761
5 A0A0G2JZ79 XM_071972825.1 202.67727 -1.559612 0.4417978 -3.530148
pvalue padj subject_id percent_identity
1 1.183433e-03 2.656115e-02 sp|A0A061I403|FICD_CRIGR 55.660
2 3.221603e-08 4.321509e-06 sp|A0A088MLT8|IQIP1_MOUSE 42.205
3 1.975993e-06 1.103765e-04 sp|A0A096X8J7|B3A3_DANRE 38.755
4 2.057802e-03 4.014566e-02 sp|A0A0B4KEE4|KOI_DROME 57.672
5 4.153272e-04 1.145852e-02 sp|A0A0G2JZ79|SIR1_RAT 65.046
alignment_length mismatches gap_opens q_start q_end s_start s_end evalue
1 318 140 1 432 1382 100 417 2.01e-115
2 263 136 7 1748 2521 292 543 9.09e-36
3 916 497 9 1034 3736 273 1139 0.00e+00
4 189 79 1 3731 4294 772 960 2.05e-67
5 329 108 4 600 1586 1 322 6.40e-136
bit_score Entry Reviewed Entry.Name
1 369 A0A061I403 reviewed FICD_CRIGR
2 150 A0A088MLT8 reviewed IQIP1_MOUSE
3 595 A0A096X8J7 reviewed B3A3_DANRE
4 252 A0A0B4KEE4 reviewed KOI_DROME
5 441 A0A0G2JZ79 reviewed SIR1_RAT
Protein.names
1 Protein adenylyltransferase FICD (EC 2.7.7.108) (AMPylator FICD) (De-AMPylase FICD) (EC 3.1.4.-) (FIC domain-containing protein) (Huntingtin-interacting protein E)
2 IQCJ-SCHIP1 readthrough transcript protein
3 Anion exchange protein 3 (AE 3) (Solute carrier family 4 member 3)
4 Klaroid protein
5 NAD-dependent protein deacetylase sirtuin-1 (EC 2.3.1.286) (NAD-dependent protein deacylase sirtuin-1) (EC 2.3.1.-)
Gene.Names
1 FICD HYPE H671_4g11989 I79_014982
2 Iqcj-Schip1 Iqschfp Schip1
3 Slc4a3 Ae3
4 koi CG44154
5 Sirt1
Organism Length
1 Cricetulus griseus (Chinese hamster) (Cricetulus barabensis griseus) 455
2 Mus musculus (Mouse) 559
3 Danio rerio (Zebrafish) (Brachydanio rerio) 1170
4 Drosophila melanogaster (Fruit fly) 965
5 Rattus norvegicus (Rat) 555
3.7 gprofiler2 package to use gost( ) command
Performing a Gene Ontology (GO)/ Pathway enrichment analysis using UniProt accession IDs.
deg_ids <- unique (deg.annotated$ accession) #getting accession ID (used for uniprot id mapping)
library (gprofiler2)
# Use cleaned UniProt accessions of DEGs
gost.res <- gost (query = deg_ids, organism = "hsapiens" )
3.7 Visualizing Gene Ontology (GO)/ Pathway enrichment analysis
# Select top 10 significant terms
top_terms <- head (gost.res$ result[order (gost.res$ result$ p_value), ], 10 )
ggplot (top_terms, aes (x = reorder (term_name, - log10 (p_value)),
y = - log10 (p_value),
fill = source)) +
geom_col () +
coord_flip () +
labs (title = "Top Enriched Terms" , x = "GO/Pathway Term" , y = "-log10(p-value)" ) +
theme_minimal ()
Interpretation
2 groups : Gene ontology & Biologcial Pathways
CC = Cellular components, MF = Molecular function, TF = Transcription factors.
Most DEG genes in the 30degC group correspond to genes relating to: CF, MF, and TFs involved biological pathways.
Main finding
The lethal temperature of 30°C was found to cause the most stat. significant biological changes in gene expression of Japanese Sea Cucumber. These genes correspond to CF, MF, and TF. Additionally, the time variable within the 26 °C group (6hr vs 48) was not found to have a significant impact on gene expression. The differences of DEG between the 26 and 30 °C groups were also not significantly different.
How to access my scripts and output folder
My scripts and small-size ouput files were uploaded to Github Hannia-jelly repository.
Zenodo DOI: https://doi.org/10.5281/zenodo.15599622
References
Xu, D., Zhang, J., Song, W., Sun, L., Liu, J., Gu, Y., Chen, Y., & Xia, B. (2023). Analysis of differentially expressed genes in the sea cucumber Apostichopus japonicus under heat stress. Acta Oceanologica Sinica, 42(11), 117–126. https://doi.org/10.1007/s13131-023-2196-4