Compendium of FISH546 project

Hannia Larino

2025-06-05

Differential Gene Expression Analysis Project

I am using RNA-seq data taken from Sea Cucumbers (Apostichopus japonicus) that were treated under 2 different temperatures (26°C & 30°C). The purpose being to conduct DGE analysis to determine the biological responses that heat stress induces on this organism. Data was obtained from NCBI, done by researchers Xu et al. at Qingdao Agricultural University (2023).

Methods

  • Variables: Time and Temperature.

  • 3 groups: control (18°C), 26°C (sub-lethal), and 30°C (lethal).

  • Time: 6 hrs and 48 hrs.

  • 26°C kept at 6 hrs and 48 hrs, 3 replicates each.

  • 30°C kept at 6 hrs, 3 replicates.

  • Researchers note 30°C at 48 hrs killed their sea cucumbers.

Code

3 parts, 3 scripts.

Part 1: Pseudo-alignment

1.1: Obtain data & conduct quality check (QC)

FastQ files were obtained from NCBI. Accession code being: PRJNA848687. More information about the data can be found here. Go here if you want to access information about the individual files (12 total).

The quality check was done using FastQC:

/home/shared/8TB_HDD_02/hannia/SeaCucumber/FastQC/fastqc \
/home/shared/8TB_HDD_02/hannia/SeaCucumber/PRJNA848687_fastq/*.fastq \
-o /home/shared/8TB_HDD_02/hannia/SeaCucumber/output

Results of FastQC

Results looked normal for paired-end reads. Generally, all files looked like this.

Refer to ~Hannia-jelly/Project/output/output_fastqc for the html files.

1.2: Pseudo-alignment using Kallisto

The reference genome of Apostichopus japonicus for the pseudo-alignment was obtained from NCBI and can be accessed here. The NCBI RefSeq assembly ID is: GCF_037975245.1.

Creating the index using Kallisto from the rna.fna of the reference genome

/home/shared/kallisto/kallisto index \
-i /home/shared/8TB_HDD_02/hannia/SeaCucumber/index.idx \
/home/shared/8TB_HDD_02/hannia/SeaCucumber/GCF_037975245.1_ref/ncbi_dataset/data/GCF_037975245.1/rna.fna

1.3 . Using Kallisto quant to complete the pseudo-alignment.

find /home/shared/8TB_HDD_02/hannia/SeaCucumber/PRJNA848687_fastq/*_1.fastq \
| xargs -n1 basename \
| sed 's/_1\.fastq$//' \
| xargs -I{} /home/shared/kallisto/kallisto quant \
-i /home/shared/8TB_HDD_02/hannia/SeaCucumber/index.idx \
-o /home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01/{} \
-t 40 \
--paired \
/home/shared/8TB_HDD_02/hannia/SeaCucumber/PRJNA848687_fastq/{}_1.fastq \
/home/shared/8TB_HDD_02/hannia/SeaCucumber/PRJNA848687_fastq/{}_2.fastq

1.4: Creating a gene expresison matrix

perl /home/shared/trinityrnaseq-v2.12.0/util/abundance_estimates_to_matrix.pl \
  --est_method kallisto \
  --gene_trans_map none \
  --out_prefix /home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01 \
  --name_sample_by_basedir \
  /home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01/SRR19635628/abundance.tsv \
  /home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01/SRR19635629/abundance.tsv \
  /home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01/SRR19635630/abundance.tsv \
  /home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01/SRR19635631/abundance.tsv \
  /home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01/SRR19635632/abundance.tsv \
  /home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01/SRR19635633/abundance.tsv \
  /home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01/SRR19635634/abundance.tsv \
  /home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01/SRR19635635/abundance.tsv \
  /home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01/SRR19635636/abundance.tsv \
  /home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01/SRR19635637/abundance.tsv \
  /home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01/SRR19635638/abundance.tsv \
 /home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01/SRR19635639/abundance.tsv

1.4: head of .matrix file

head ~/SeaCucumber/output/kallisto_01.isoform.counts.matrix
    SRR19635628 SRR19635629 SRR19635630 SRR19635631 SRR19635632 SRR19635633 SRR19635634 SRR19635635 SRR19635636 SRR19635637 SRR19635638 SRR19635639
XM_071976267.1  0   9.00484 9   79.0623 56  4   46.0767 10.0067 47  43  27  8.01381
XR_011789836.1  0   0   0   0   0   0   0   0   0   0   0   0.52118
XM_071997363.1  4.14253 16.8493 281.989 0   0   6.8272  0   6.66699 51.2241 6.68807 0   0
XM_071955467.1  93.3973 0   53.8949 0   0   0   0   92.5794 92.807  46.5277 0   0
XR_011796061.1  0   0   0   0   0   0   0   0   0   0   0   0
XM_071967761.1  142 56  209 88  71  7   67  80  87  3   8   28
XM_071984305.1  0   0   73.3343 0   0   0   0   0   0   0   0   0
XM_071987644.1  36.1874 309.39  129.686 700.958 207.801 374.889 117.255 104.78  155.507 0.144135    0.0026576   1.25093e-06
XM_071964952.1  0   0   0   0   0   5.9977  0   0   0   0   0   0

Part 2.0: BLAST rna.fna to Swiss-prot for functional analysis

/home/shared/8TB_HDD_02/hannia/applications/ncbi-blast-2.16.0+/bin/blastx \
> -query /home/shared/8TB_HDD_02/hannia/SeaCucumber/GCF_037975245.1_ref/ncbi_dataset/data/GCF_037975245.1/rna.fna \
> -db /home/shared/8TB_HDD_02/hannia/blast/uniprot_sprot_r2025_01 \
> -out /home/shared/8TB_HDD_02/hannia/SeaCucumber/output/Seacuc_blastx_sprotresults.tab \
> -evalue 1E-20 \
> -num_threads 20 \
> -max_target_seqs 5 \
> -outfmt 6

2.0 head of blastx results

head /home/shared/8TB_HDD_02/hannia/SeaCucumber/output/Seacuc_blastx_sprotresults.tab
XM_071954493.1  sp|Q9UGM1|ACHA9_HUMAN   28.107  338 218 7   134 1120    10  331 1.13e-42    162
XM_071954493.1  sp|P43144|ACHA9_RAT 28.339  307 198 6   227 1120    38  331 3.99e-41    157
XM_071954493.1  sp|Q9PTS8|ACHA9_CHICK   28.155  309 200 6   221 1120    38  333 8.28e-39    151
XM_071954493.1  sp|Q9I8C7|ACH10_CHICK   29.316  307 195 7   227 1120    39  332 2.94e-37    146
XM_071954493.1  sp|Q8JFN7|ACH91_ONCMY   28.013  307 199 6   227 1120    32  325 4.36e-36    145
XM_071954496.1  sp|Q2TBP5|TMM53_BOVIN   37.184  277 155 6   67  858 5   275 3.57e-34    136
XM_071954496.1  sp|Q5PPS7|TM53A_XENLA   37.091  275 155 7   61  858 3   268 9.21e-34    135
XM_071954496.1  sp|Q6DJC8|TM53B_XENLA   35.531  273 158 6   67  858 5   268 2.31e-33    134
XM_071954496.1  sp|Q6DHN0|TMM53_DANRE   34.066  273 168 5   61  861 3   269 3.15e-33    133
XM_071954496.1  sp|Q9D0Z3|TMM53_MOUSE   37.956  274 155 6   67  858 5   273 8.07e-33    132

Part 3: Data Analysis

3.1: DESeq2 for DGE analysis

Reading in counts and filtering to keep counts that have more than 5 counts in at least 3 samples.

# Load counts
counts <- read.table("~/SeaCucumber/output/kallisto_01.isoform.counts.matrix", header=TRUE, row.names=1)

# Filter: keep rows with counts >5 in at least 3 samples
keep <- rowSums(counts > 5) >= 3
filtered.counts <- counts[keep, ]

# Summary
cat("Before filtering:", nrow(counts), "isoforms\n")
Before filtering: 56281 isoforms
cat("After filtering:", nrow(filtered.counts), "isoforms\n")
After filtering: 29310 isoforms

3.1 Final filtered.counts data frame

This is rounded to nearest integer & and has proper meaningful column names.

#rounding to nearest integer
filtered.counts <- round(filtered.counts, 0)
str(filtered.counts)
'data.frame':   29310 obs. of  12 variables:
 $ SRR19635628: num  0 4 93 142 36 0 130 16 0 2 ...
 $ SRR19635629: num  9 17 0 56 309 0 18 27 0 63 ...
 $ SRR19635630: num  9 282 54 209 130 0 36 26 11 19 ...
 $ SRR19635631: num  79 0 0 88 701 61 482 153 7 90 ...
 $ SRR19635632: num  56 0 0 71 208 0 144 42 5 139 ...
 $ SRR19635633: num  4 7 0 7 375 0 41 16 3 2 ...
 $ SRR19635634: num  46 0 0 67 117 0 19 35 0 10 ...
 $ SRR19635635: num  10 7 93 80 105 0 47 115 1 26 ...
 $ SRR19635636: num  47 51 93 87 156 0 17 112 13 1 ...
 $ SRR19635637: num  43 7 47 3 0 31 253 17 21 31 ...
 $ SRR19635638: num  27 0 0 8 0 0 30 34 4 27 ...
 $ SRR19635639: num  8 0 0 28 0 123 11 0 1 60 ...
# Vector of new sample names in the correct order
new.names <- c("Control 1", "Control 2", "Control 3", "26.a1", "26.a2", "26.a3", "26.b1", "26.b2", "26.b3", "30.a1","30.a2","30.a3")
# Apply new names to the columns
colnames(filtered.counts) <- new.names
head(filtered.counts)
               Control 1 Control 2 Control 3 26.a1 26.a2 26.a3 26.b1 26.b2
XM_071976267.1         0         9         9    79    56     4    46    10
XM_071997363.1         4        17       282     0     0     7     0     7
XM_071955467.1        93         0        54     0     0     0     0    93
XM_071967761.1       142        56       209    88    71     7    67    80
XM_071987644.1        36       309       130   701   208   375   117   105
XM_071957402.1         0         0         0    61     0     0     0     0
               26.b3 30.a1 30.a2 30.a3
XM_071976267.1    47    43    27     8
XM_071997363.1    51     7     0     0
XM_071955467.1    93    47     0     0
XM_071967761.1    87     3     8    28
XM_071987644.1   156     0     0     0
XM_071957402.1     0    31     0   123

Libraries used for downstream analysis

library(DESeq2)
library(tidyverse)
library(pheatmap)
library(RColorBrewer)
library(data.table)
library(gprofiler2)

3.2 Setting up data frames for DESeq2( )

condition <- factor(rep(c("Control", "26.a", "26.b", "30.a"), each = 3),
                    levels = c("Control", "26.a", "26.b", "30.a"))
condition
 [1] Control Control Control 26.a    26.a    26.a    26.b    26.b    26.b   
[10] 30.a    30.a    30.a   
Levels: Control 26.a 26.b 30.a
deseq2.colData <- data.frame(condition = condition, 
                             type=factor(rep("paired", 12)))

rownames(deseq2.colData) <- colnames(filtered.counts)

deseq2.data <- DESeqDataSetFromMatrix(countData = filtered.counts,
                                     colData = deseq2.colData, 
                                     design = ~ condition)

3.3 Running DESeq( ) command

deseq2.data <- DESeq(deseq2.data)
deseq2.res <- results(deseq2.data)
deseq2.res <- deseq2.res[order(rownames(deseq2.data)), ]

3.4 DESeq results

# Count number of hits with adjusted p-value less then 0.05
dim(deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ])
[1] 1388    6
head(deseq2.res[order(deseq2.res$padj), ])  # sorted by adjusted p-value
log2 fold change (MLE): condition 30.a vs Control 
Wald test p-value: condition 30.a vs Control 
DataFrame with 6 rows and 6 columns
                baseMean log2FoldChange     lfcSE      stat      pvalue
               <numeric>      <numeric> <numeric> <numeric>   <numeric>
XM_071987485.1  52098.76        7.91277  0.624320  12.67422 8.21714e-37
XM_071976661.1 160742.38       10.91870  0.953902  11.44635 2.45258e-30
XM_071971899.1 177059.26       11.97290  1.073711  11.15096 7.08404e-29
XM_071978956.1  94012.69       13.16145  1.227628  10.72104 8.10871e-27
XM_071997912.1  71359.22        9.46740  0.908432  10.42169 1.97408e-25
XM_071980612.1   2758.95       10.22529  1.033384   9.89495 4.37816e-23
                      padj
                 <numeric>
XM_071987485.1 2.04714e-32
XM_071976661.1 3.05505e-26
XM_071971899.1 5.88283e-25
XM_071978956.1 5.05031e-23
XM_071997912.1 9.83603e-22
XM_071980612.1 1.81789e-19

3.4 Visualizing DESeq( ) results

tmp <- deseq2.res
# The main plot
plot(tmp$baseMean, tmp$log2FoldChange, pch=20, cex=0.45, ylim=c(-3, 3), log="x", col="darkgray",
     main="DEG of control and 30degC data (pval <= 0.05)",
     xlab="mean of normalized counts",
     ylab="Log2 Fold Change")

# Getting the significant points and plotting them again so they're a different color
tmp.sig <- deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ]
points(tmp.sig$baseMean, tmp.sig$log2FoldChange, pch=20, cex=0.45, col="red")
# 2 FC lines
abline(h=c(-1,1), col="blue")

3.4 The Main Plot

3.4 More visualization of DESeq( ) results

Figure of Top 50 Deferentially Expressed Genes:

3.5 Merging BLAST & DESeq2 Results for functional analysis

Go to ~/Hannia-jelly/Project/scripts/03-SeaCucDGEAnalysis.Rmd for full code of this step.

# Merge the data
deg.annotated <- merge(DEGtable, blastresults, by = "transcriptID", all.x = TRUE)
# Display only the first 5 rows
head(read.csv("~/SeaCucumber/output/deg_annotated.csv"), 5)
    transcriptID baseMean log2FoldChange    lfcSE     stat       pvalue
1 XM_071954564.1 34.38615       21.40793 3.703689 5.780163 7.462832e-09
2 XM_071954564.1 34.38615       21.40793 3.703689 5.780163 7.462832e-09
3 XM_071954564.1 34.38615       21.40793 3.703689 5.780163 7.462832e-09
4 XM_071954564.1 34.38615       21.40793 3.703689 5.780163 7.462832e-09
5 XM_071954564.1 34.38615       21.40793 3.703689 5.780163 7.462832e-09
          padj            subject_id percent_identity alignment_length
1 1.309307e-06 sp|Q5BL81|SC5A8_XENTR           40.381              473
2 1.309307e-06 sp|Q8BYF6|SC5A8_MOUSE           34.633              641
3 1.309307e-06 sp|Q49B93|SC5AC_MOUSE           37.705              488
4 1.309307e-06 sp|Q8N695|SC5A8_HUMAN           37.056              591
5 1.309307e-06 sp|A7MBD8|SC5AC_BOVIN           38.285              478
  mismatches gap_opens q_start q_end s_start s_end    evalue bit_score
1        278         2     405  1811       9   481  2.84e-94       313
2        366         7     387  2297       3   594  3.59e-97       320
3        300         2     405  1856       5   492  1.21e-92       308
4        331         6     405  2171       9   560 5.78e-100       328
5        291         2     405  1826       5   482  1.31e-93       311

3.6 Uniprot ID mapping

First, created a text file of the subject_id column to paste into the Uniprot ID mapping websiteto conduct functional analysis.

deg.annotated$accession <- sub("sp\\|(.*?)\\|.*", "\\1", deg.annotated$subject_id)
write.table(deg.annotated$accession, file = "~/SeaCucumber/output/uniprot_ids.txt", quote = FALSE, row.names = FALSE, col.names = FALSE)

Downloaded .tsv to directory and merged with deg.annotated table

functional.merge <- merge(deg.annotated, uniprot.annot, by.x = "accession", by.y = "From", all.x = TRUE)

Head of functional.merge data frame

head(read.csv("~/SeaCucumber/output/functional_merge.csv"), 5)
   accession   transcriptID  baseMean log2FoldChange     lfcSE      stat
1 A0A061I403 XM_071996487.1 413.60029       1.045867 0.3225155  3.242843
2 A0A088MLT8 XM_071983001.1  49.74936     -21.294231 3.8514087 -5.528946
3 A0A096X8J7 XM_071966355.1  17.24461      20.221448 4.2518977  4.755864
4 A0A0B4KEE4 XM_071958907.1  23.44183       8.840447 2.8686350  3.081761
5 A0A0G2JZ79 XM_071972825.1 202.67727      -1.559612 0.4417978 -3.530148
        pvalue         padj                subject_id percent_identity
1 1.183433e-03 2.656115e-02  sp|A0A061I403|FICD_CRIGR           55.660
2 3.221603e-08 4.321509e-06 sp|A0A088MLT8|IQIP1_MOUSE           42.205
3 1.975993e-06 1.103765e-04  sp|A0A096X8J7|B3A3_DANRE           38.755
4 2.057802e-03 4.014566e-02   sp|A0A0B4KEE4|KOI_DROME           57.672
5 4.153272e-04 1.145852e-02    sp|A0A0G2JZ79|SIR1_RAT           65.046
  alignment_length mismatches gap_opens q_start q_end s_start s_end    evalue
1              318        140         1     432  1382     100   417 2.01e-115
2              263        136         7    1748  2521     292   543  9.09e-36
3              916        497         9    1034  3736     273  1139  0.00e+00
4              189         79         1    3731  4294     772   960  2.05e-67
5              329        108         4     600  1586       1   322 6.40e-136
  bit_score      Entry Reviewed  Entry.Name
1       369 A0A061I403 reviewed  FICD_CRIGR
2       150 A0A088MLT8 reviewed IQIP1_MOUSE
3       595 A0A096X8J7 reviewed  B3A3_DANRE
4       252 A0A0B4KEE4 reviewed   KOI_DROME
5       441 A0A0G2JZ79 reviewed    SIR1_RAT
                                                                                                                                                        Protein.names
1 Protein adenylyltransferase FICD (EC 2.7.7.108) (AMPylator FICD) (De-AMPylase FICD) (EC 3.1.4.-) (FIC domain-containing protein) (Huntingtin-interacting protein E)
2                                                                                                                          IQCJ-SCHIP1 readthrough transcript protein
3                                                                                                  Anion exchange protein 3 (AE 3) (Solute carrier family 4 member 3)
4                                                                                                                                                     Klaroid protein
5                                                 NAD-dependent protein deacetylase sirtuin-1 (EC 2.3.1.286) (NAD-dependent protein deacylase sirtuin-1) (EC 2.3.1.-)
                         Gene.Names
1 FICD HYPE H671_4g11989 I79_014982
2        Iqcj-Schip1 Iqschfp Schip1
3                        Slc4a3 Ae3
4                       koi CG44154
5                             Sirt1
                                                              Organism Length
1 Cricetulus griseus (Chinese hamster) (Cricetulus barabensis griseus)    455
2                                                 Mus musculus (Mouse)    559
3                          Danio rerio (Zebrafish) (Brachydanio rerio)   1170
4                                  Drosophila melanogaster (Fruit fly)    965
5                                              Rattus norvegicus (Rat)    555

3.7 gprofiler2 package to use gost( ) command

Performing a Gene Ontology (GO)/ Pathway enrichment analysis using UniProt accession IDs.

deg_ids <- unique(deg.annotated$accession) #getting accession ID (used for uniprot id mapping)

library(gprofiler2)

# Use cleaned UniProt accessions of DEGs
gost.res <- gost(query = deg_ids, organism = "hsapiens") 

3.7 Visualizing Gene Ontology (GO)/ Pathway enrichment analysis

# Select top 10 significant terms
top_terms <- head(gost.res$result[order(gost.res$result$p_value), ], 10)

ggplot(top_terms, aes(x = reorder(term_name, -log10(p_value)), 
                      y = -log10(p_value), 
                      fill = source)) +
  geom_col() +
  coord_flip() +
  labs(title = "Top Enriched Terms", x = "GO/Pathway Term", y = "-log10(p-value)") +
  theme_minimal()

Interpretation

  • 2 groups: Gene ontology & Biologcial Pathways

  • CC = Cellular components, MF = Molecular function, TF = Transcription factors.

  • Most DEG genes in the 30degC group correspond to genes relating to: CF, MF, and TFs involved biological pathways.

Main finding

The lethal temperature of 30°C was found to cause the most stat. significant biological changes in gene expression of Japanese Sea Cucumber. These genes correspond to CF, MF, and TF. Additionally, the time variable within the 26 °C group (6hr vs 48) was not found to have a significant impact on gene expression. The differences of DEG between the 26 and 30 °C groups were also not significantly different.

How to access my scripts and output folder

My scripts and small-size ouput files were uploaded to Github Hannia-jelly repository.

Zenodo DOI: https://doi.org/10.5281/zenodo.15599622

References

Xu, D., Zhang, J., Song, W., Sun, L., Liu, J., Gu, Y., Chen, Y., & Xia, B. (2023). Analysis of differentially expressed genes in the sea cucumber Apostichopus japonicus under heat stress. Acta Oceanologica Sinica, 42(11), 117–126. https://doi.org/10.1007/s13131-023-2196-4