Welcome to My Project

This project presents the full analysis for the ISG 5311 final project, including RNA-seq differential expression analysis and interpretation.

1 Introduction

Pyrethroid pesticides, such as deltamethrin (DM), are widely used in agriculture and aquaculture due to their high insecticidal activity and relatively low toxicity compared to older pesticides. DM is effective against a broad spectrum of pests, making it a popular choice for pest and disease control in aquatic environments (Guardiola et al. (2014)). However, DM frequently enters rivers and lakes through surface runoff and wastewater discharge, posing risks to aquatic organisms. Fish are particularly vulnerable because they lack efficient mechanisms to metabolize and degrade DM, leading to toxic effects even at environmentally relevant concentrations (Wu et al. (2020)).

Goldfish (Carassius auratus) are a common freshwater species widely distributed in many water bodies and are often used as a model organism in toxicology research (Blanco et al. (2018)). Their ecological relevance, manageable size, and ease of maintenance make them suitable for laboratory experiments, while their physiology provides insight into potential impacts of environmental contaminants on aquatic species. Prior studies have shown that DM exposure can induce oxidative stress, apoptosis, and immune dysfunction in various fish species, but the molecular mechanisms underlying these effects are not fully understood (Zhou et al. (2021)).

The aim of this study was to investigate the effects of acute, environmentally relevant concentrations of DM on goldfish kidney transcriptomes and intestinal microbiota. Nine healthy goldfish were randomly assigned to three treatment groups: 0 µg/L control, 0.2 µg/L DM, and 2 µg/L DM, with three fish per group. Fish were exposed for 96 hours under controlled laboratory conditions. Kidney tissues were collected and RNA extracted for high-throughput RNA-sequencing to identify differentially expressed genes and characterize affected biological pathways, including apoptosis, immune response, and drug metabolism. Additionally, intestinal contents were collected to assess changes in microbial composition using 16S rRNA sequencing.

By combining transcriptomic and microbiota analyses, this study provides an integrative understanding of how DM affects fish at both the molecular and microbial levels. The results are expected to reveal key pathways disrupted by DM exposure and highlight the broader impacts of environmental contaminants on fish physiology and gut microbial communities (Zhou et al. (2021)). This work contributes to the growing field of aquatic toxicology by providing detailed molecular insights into DM-induced toxicity in a widely studied model species.

2 Methods

Methods were adapted from Zhou et al. (2021) unless otherwise noted.

2.1 Experimental Design and Samples

Eighty-one healthy goldfish (Carassius auratus) were purchased from a commercial supplier (Hanjin Ornamental Fish Farm, Wuhan, China) and acclimated for two weeks in aerated, dechlorinated water. During acclimation, fish were held in large tanks under controlled temperature (19–21 °C), pH (6.9–7.3), and dissolved oxygen (~6.2 mg/L), with a consistent light:dark cycle of 10 hours of light to 14 hours of darkness. Fish were fed once daily with a commercial pellet diet, and one-third of the water was exchanged every three days to maintain water quality.

Deltamethrin (>98% purity) was dissolved in dimethyl sulfoxide (DMSO) to generate stock solutions. Fish were randomly assigned to one of three treatment groups: 0 µg/L DM (control), 0.2 µg/L DM, and 2 µg/L DM (n = 9 fish per group). These concentrations reflect environmentally relevant DM levels reported in aquatic systems. All groups, including controls, contained trace amounts of DMSO (<0.001% v/v). Fish were exposed to their assigned treatments for 96 hours. Solutions were renewed daily to maintain stable DM levels, and fish were not fed during the exposure period.

2.2 Sample Collection

After 96 hours exposure, nine specimens were selected from each group to collect intestinal contents and pooled for transctiprome anslysis. Simultaneously, three specimens were randomly selected from each group and the kidney was removed. The kidney specimens were used for RNA-seq analysis.

2.3 Transcriptome Processing and Analysis

2.3.1 RNA Extraction and Sequencing

Total RNA was extracted from kidney tissue using TRIzol Reagent (Life Technologies) in accordance with the manufacturer’s protocol. A NanoDrop 2000 was used to assess RNA quantity and purity, and an Agilent 2100 Bioanalyzer was used to confirm RNA integrity. Only high-quality RNA samples were processed for sequencing. Libraries were prepared and sequenced on an Illumina NovaSeq 6000 to produce paired-end reads. Raw RNA-seq data have been submitted to the Sequence Read Archive (SRA) database in NCBI (accession number: SRP323663).

2.3.2 Bioinformatic Workflow

Raw reads were filtered to remove low-quality sequences, adapter contamination, and poly-N content. Clean reads were aligned to the goldfish reference genome (GCF_003368295.1) using HISAT2 v2.2.1. Gene-level counts were generated and normalized using DESeq2 v1.38.0 within R. Differentially expressed genes (DEGs) were identified based on false discovery rate (FDR) < 0.05 and |log2 fold change| > 1. Functional enrichment of DEGs was performed using GOseq for Gene Ontology (GO) enrichment and KOBAS for Encyclopedia of Genes and Genomes (KEGG) pathway enrichment. These analyses were used to identify biological pathways associated with apoptosis, immune function, and metabolism.

2.3.3 qPCR Validation

To confirm the reliability of RNA-seq results, a subset of ten DEGs were selected for quantitative real-time PCR (qPCR). Primers were designed using Primer3, and qPCR was conducted using a QuantStudio 3 Real-Time PCR System with ChamQ Universal SYBR qPCR Master Mix. Each reaction contained SYBR qPCR Master Mix, forward and reverse primers, cDNA template, and nuclease-free water. Thermal cycling conditions were: 95 °C for 30 seconds followed by 40 cycles of denaturing at 95 °C for 10 seconds and annealing at 60 °C for 30 seconds. The β-actin gene was used as a reference (housekeeping) gene, its expression level was assumed to remain stable across all samples. This allows for comparison of the expression of the target DEGs relative to a consistent baseline. To quantify gene expression differences, the 2⁻ΔΔCt method was applied. This method uses the cycle threshold (Ct) values from qPCR to calculate relative fold changes in gene expression between experimental groups.

2.4 Intestinal Microbiota Analysis

2.4.1 DNA Extraction and Sequencing

DNA was extracted from intestinal contents using the QIAamp DNA Stool Mini Kit (Qiagen). DNA concentration and purity were measured using a NanoDrop spectrophotometer. Full-length 16S rRNA genes were amplified using primers 27F and 1492R. PCR amplification was performed using Premix Taq (Takara) with the following thermal cycling program: an initial denaturation at 95 °C for 5 minutes; 25 cycles of denaturation at 95 °C for 30 seconds, annealing at 50 °C for 30 seconds, and extension at 72 °C for 1 minute; followed by a final extension at 72 °C for 7 minutes. Libraries were prepped from purified DNA using the SMRTbell Template Prep Kit 1.0, and sequenced on a PacBio SMRT RS II platform.

2.4.2 Sequence Processing and Statistical Analysis

Raw PacBio reads were processed with the SMRT Analysis Suite to generate high-quality circular consensus sequences. Sequences were clustered into operational taxonomic units at 97% similarity using USEARCH v10.0. Taxonomic classification was performed using the RDP Classifier with an 80% confidence threshold.

Microbial diversity analyses were completed in QIIME2, including: alpha diversity (Chao1, ACE, Shannon, and Simpson indices), Good’s coverage to estimate sequencing completeness, and beta diversity (PCoA and UPGMA clustering). All statistical analyses were performed in R. Data were checked for normality and homogeneity of variance. Differences among treatment groups were evaluated using one-way ANOVA followed by Duncan’s multiple range test, with significance set at p < 0.05.

3 Results

3.1 Raw sequencing and preprocessing

Nine kidney RNA-seq libraries, three per treatment group, were sequenced. After quality control, the number of clean reads per library was ~41.36 million and per-library Q30 exceeded 95.09% (see Table 1). Clean reads mapped to the goldfish reference genome (GCF_003368295.1) with mapping rates between 88.37% and 90.71%, indicating high-quality sequencing and good genome coverage.

I performed quality control on the trimmed FastQC data using MultiQC. Overall, the sequencing reads appear to be high quality and suitable for downstream analysis. Per-base quality scores were consistently high (mean Phred >30), per-base N content was effectively zero, and GC content across all samples was uniform at 45–47%, comparable to the 46–47% reported in Table S1. Adapter contamination was negligible, and overrepresented sequences accounted for less than 1% of reads, indicating successful trimming.

Table 1: Summary of RNA-seq library quality and mapping statistics
# A tibble: 9 × 8
  Sample `Clean reads` `Clean bases` `GC content (%)` `Q30 (%)`
  <chr>          <dbl>         <dbl>            <dbl>     <dbl>
1 C-1         54858738    8189592344             46.7      96.0
2 C-2         51384524    7673744890             46.7      95.9
3 C-3         55474754    8260159928             47.0      95.6
4 L-1         48550446    7251033918             46.6      95.1
5 L-2         41360330    6184981010             45.9      95.7
6 L-3         54107400    8091356676             46.4      95.6
7 H-1         56013224    8350534022             46.7      95.7
8 H-2         46984324    7034325400             46.6      95.2
9 H-3         48665048    7255381648             46.6      95.1
# ℹ 3 more variables: `Total mapped reads (ratio)` <chr>,
#   `Multiple mapped reads (ratio)` <chr>,
#   `Uniquely mapped reads (ratio)` <chr>
Table 1

3.2 Differential expression analysis

After alignment and normalization, DESeq2 identified 270 DEGs, 160 upregulated and 110 downregulated in the 0.2 µg/L versus control comparison, and 711 DEGs, 309 upregulated and 402 downregulated in the 2 µg/L versus control comparison (Table 2). A set of DEGs overlapped between treatments; the intersection is shown in a Venn diagram (Figure 2D).

I analyzed RNA-seq data from goldfish kidneys exposed to deltamethrin (DM) and observed clear dose-dependent transcriptional changes. In the low-dose treatment (0.2 µg/L) compared to control, 173 genes were upregulated and 123 genes were downregulated, for a total of 296 differentially expressed genes (DEGs), with approximately 25% of genes filtered out due to low counts. In the high-dose treatment (2 µg/L) versus control, 531 genes were upregulated and 622 were downregulated, totaling 1,153 DEGs, with ~31% of genes filtered. The comparison between high and low doses (2 µg/L vs 0.2 µg/L) revealed 639 upregulated and 626 downregulated genes, totaling 1,265 DEGs, with ~33% of genes filtered.

Table 2: Summary of differentially expressed genes (DEGs) from Original Paper
Warning: package 'knitr' was built under R version 4.5.2
Warning: package 'ggplot2' was built under R version 4.5.2
Summary of differentially expressed genes (DEGs) from Original Paper
Comparison Total DEGs Upregulated Downregulated
0.2 µg/L vs Control 270 160 110
2 µg/L vs Control 711 309 402
Table 3: Summary of differentially expressed genes (DEGs) in Re-Analysis
Summary of differentially expressed genes (DEGs) in Re-Analysis
Comparison Total DEGs Upregulated Downregulated
0.2 µg/L vs Control 296 173 123
2 µg/L vs Control 1153 531 622
Figure 1: Differentially Expressed Genes by Treatment

Figure 2: Summary of differentially expressed genes in goldfish (Carassius auratus) after exposure to 0.2 and 2 µg/L deltamethrin Zhou et al. (2021).

3.3 Functional enrichment (GO & KEGG)

GO enrichment for the 0.2 µg/L group assigned 228 DEGs to GO categories: biological process (BP), cellular component (CC) and molecular function (MF) (Figure 3). The most-enriched BP terms included regulation of programmed cell death and regulation of humoral immune response. MF enrichment highlighted sequence-specific DNA-binding transcription factor activity. For the 2 µg/L group, enriched terms included myeloid leukocyte activation and several cellular component terms related to mitochondria and granules.

KEGG enrichment identified 92 pathways significantly affected at 0.2 µg/L and 151 pathways at 2 µg/L. The top enriched pathway at 0.2 µg/L was apoptosis (ko04210) (14 DEGs), and at 2 µg/L phagosome (ko04145) had the highest enrichment (27 DEGs) (Figure 4).

The functional interpretation of the DEGs in my analysis of the goldfish kidney dataset aligns well with the findings reported in the original publication, showing clear dose-dependent transcriptional responses to deltamethrin (DM) exposure. In our analysis, low-dose (0.2 μg/L) exposure primarily affected genes involved in transcription regulation, DNA replication, and protein dimerization, with enriched GO terms such as transcription regulator complex, DNA duplex unwinding, and negative regulation of transcription by RNA polymerase II. These results correspond closely to the paper, where the 0.2 μg/L treatment showed enrichment in transcription factor complexes (GO:0005667, GO:0090575) and molecular functions such as transcription factor activity and sequence-specific DNA binding (GO:0003700), as well as biological processes related to regulation of programmed cell death and immune responses. For high-dose (2 μg/L) exposure, our data identified enriched terms related to mitochondrial function (protein import into the mitochondrial matrix), protein folding, rRNA processing, and nucleolus activity, which are consistent with the paper’s observation of enrichment in the mitochondrial outer membrane translocase complex, nucleolus, and other cellular components involved in stress and protein homeostasis. While KEGG pathway enrichment was not explicitly performed in our current analysis, the paper reported apoptosis and phagosome pathways as significantly affected, supporting the idea that DM exposure perturbs both cellular stress responses and immune-related pathways. Overall, our results recapitulate the main trends observed in the publication, with low-dose exposure impacting transcriptional regulation and high-dose exposure affecting protein processing and mitochondrial pathways, confirming dose-dependent effects of DM on goldfish kidney gene expression.

Figure 3: Gene ontology annotation of the differentially expressed genes in the three main GO categories: biological process, cellular component and molecular function. A: 0 μg/L vs 0.2 μg/L deltamethrin; B: 0 μg/L vs 2 μg/L deltamethrin Zhou et al. (2021).

Figure 4: KEGG enrichment based on the DEGs after exposure to (A) 0.2 and (B) 2 µg/L DM compared to the control group (0 µg/L DM). The vertical and horizontal axes represent different pathways and rich factor, respectively Zhou et al. (2021).

3.4 qPCR validation

Ten DEGs were selected for qPCR validation, five in the 0.2 µg/L group and five in the 2 µg/L group. IFI44L, CYP27C1, CTSL, IFIG1 and GIMAP7 are evaluated in the 0.2 µg/L DM-treated group and are compared with control samples. LYS, ALP, C3, GST and CYP1A1 genes are evaluated in the 2 µg/L DM-treated group and are compared with control samples. qPCR fold-change results were consistent with RNA-seq fold changes for all ten genes, supporting the reliability of the transcriptomic data (Figure 5). Error bars indicate mean ± SD (n = 3 biological replicates).

Figure 5: Differentially expressed genes in goldfish kidneys confirmed by qPCR. The X-axis displays 10 DEGs and the Y-axis represents relative fold change. The data are expressed as the means ± SD (n = 3) Zhou et al. (2021).

3.5 Microbiome sequencing and alpha/beta diversity

A total of 88,633 effective circular consensus reads (CCS) were obtained from intestinal samples, with average read lengths between 1,418 and 1,456 bp. Rarefaction curves indicated that the sequencing depth was sufficient to capture the majority of microbial diversity, allowing for reliable downstream analysis. Analysis of alpha-diversity indices showed a slight decrease in microbial diversity in the 0.2 µg/L deltamethrin (DM) group and an increase in the 2 µg/L DM group; however, these changes were not statistically significant (P > 0.05).

Proteobacteria dominated at the phylum level, followed by Firmicutes, Bacteroidetes, Fusobacteria, and Verrucomicrobia. Exposure to 2 µg/L DM increased Proteobacteria, Firmicutes, and Fusobacteria, while Bacteroidetes decreased in both DM-treated groups; Verrucomicrobia increased in the 0.2 µg/L group. At the genus level, Aeromonas, Cetobacterium, Dielma, Pseudorhodobacter, and Akkermansia were most abundant. DM exposure decreased Aeromonas, Cetobacterium, Dielma, and Pseudorhodobacter, but increased Akkermansia. These results indicate that deltamethrin exposure alters intestinal microbial composition in a dose-dependent manner.

Figure 6: Hierarchical cluster analysis based on the unweighted pair group method with arithmetic mean; Relative abundances of dominant microbial genera. (CM: 0 μg/L deltamethrin; LM: 0.2 μg/L deltamethrin; HM: 2 μg/L deltamethrin) Zhou et al. (2021).

All results for R scripts are embedded in the scripts portion of the document

4 Discussion

This project applied a comprehensive RNA-seq bioinformatics pipeline to evaluate the transcriptional effects of DM exposure in goldfish kidney tissue. Beginning with raw sequencing data, each analytical step—from quality control to gene ontology enrichment—contributed to understanding how DM alters gene expression in a dose-dependent manner. Overall, the results strongly align with the original publication while providing additional insight.

The raw RNA-seq dataset and the corresponding goldfish reference genome and annotation (GCF_003368295.1) were downloaded to ensure accurate mapping and quantification. The genome assembly matches that used in the published study, ensuring compatibility between analyses.

Quality control (QC) of the FASTQ files indicated high-quality sequencing suitable for downstream analysis. Per-base Phred quality scores were consistently above 30, N content was negligible, and GC content ranged from 45–47%, closely matching the 46–47% reported in the original study. Adapter contamination was minimal, and overrepresented sequences accounted for less than 1% of reads. Duplication levels were relatively high (68–72%), likely reflecting biological duplication or highly expressed transcripts, as the pattern was consistent across all replicates. No sample displayed abnormal metrics, suggesting the dataset was robust and no libraries required removal.

Reads were aligned to the goldfish genome using a splice-aware aligner (HISAT2). Mapping rates ranged from approximately 88–91%, consistent with the original study, confirming the reference genome’s suitability for kidney transcriptomics. Uniform mapping rates across treatment groups suggest consistent library preparation.

Gene expression quantification was performed using featureCounts to assign reads to annotated genes. Each sample yielded ~41 million clean reads, aligning with the sequencing depth in the original article. Approximately 25–33% of genes were filtered out due to low counts, reflecting typical tissue- and condition-specific expression patterns in fish transcriptomes.

Differential expression analysis using DESeq2 revealed clear, dose-dependent transcriptional responses to DM. In the 0.2 μg/L treatment versus control, 173 genes were upregulated and 123 downregulated (296 total DEGs). In the 2 μg/L group versus control, 531 genes were upregulated and 622 downregulated (1,153 DEGs). The high-versus-low dose comparison identified 1,265 DEGs, demonstrating that increasing DM concentration substantially alters transcriptional profiles. Principal component analysis (PCA) showed clear clustering by treatment with no outliers, confirming data reliability. Heatmap visualization displayed less distinct grouping, likely due to the large number of DEGs and gene-specific expression variability.

These results parallel the original publication, which reported 270 DEGs for the low dose and 711 DEGs for the high dose. Minor differences in DEG counts likely reflect variations in filtering thresholds, DESeq2 versions, or QC cutoffs, but the biological trend is consistent: DM induces a dose-dependent transcriptional response in goldfish kidneys.

Principal component analysis (PCA) demonstrated clear separation of samples by dose, indicating that deltamethrin exposure was the primary driver of transcriptional variation. Heatmap visualization of the top 500 most variable genes did not have distinct clustering of high-exposure, low-exposure, and/or control samples. These results indicate that acute exposure to deltamethrin induces measurable changes in the goldfish kidney transcriptome, with stronger effects observed at higher exposure levels.

Gene Ontology (GO) enrichment analysis contextualized the DEGs within biological processes. Enriched terms included immune activation, apoptosis, complement signaling, programmed cell death regulation, and membrane trafficking, consistent with the original study. These findings reinforce that DM disrupts immune function and cellular homeostasis, even at low exposure levels. Expansion of apoptotic and immune-related GO categories at higher doses suggests a model in which DM induces progressive cellular stress that intensifies with concentration.

Overall, the bioinformatics pipeline—from data download and QC through alignment, quantification, differential expression analysis, and GO enrichment—produced results that are technically sound and biologically meaningful. The findings align with the central conclusions of the original study.

5 Scripts

Download SRA

This script automates the download of raw RNA-seq data from the NCBI Sequence Read Archive (SRA) using fasterq-dump. It loops through a list of SRR accession numbers, downloads the corresponding FASTQ files, and compresses them with gzip.

#!/bin/bash
#SBATCH --job-name=fasterq_dump_goldfish
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 1
#SBATCH --mem=15G
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mail-type=ALL
#SBATCH --mail-user=stephanie.ballas@uconn.edu
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err

# Load software
module load sratoolkit/3.0.1

# Data is from:
#  https://www.ncbi.nlm.nih.gov/sra/?term=SRP323663
#  https://doi.org/10.1016/j.ecoenv.2021.112716

# Output folder
OUTDIR=~/Goldfish_Deltamethrin_Project/data/fastq
mkdir -p ${OUTDIR}

# Accession list
ACCLIST=~/Goldfish_Deltamethrin_Project/metadata/accessionlist.txt

# Loop through each SRR in the accession list and download
while read SRR; do
    echo "Downloading ${SRR}..."
    fasterq-dump ${SRR} -O ${OUTDIR}
done < ${ACCLIST}

# Compress all FASTQ files
echo "Compressing FASTQ files..."
gzip ${OUTDIR}/*.fastq

date
echo "Download and compression complete."

Download Genome

This script retrieves the goldfish reference genome, GTF annotation, and cDNA sequences from Ensembl using wget. Files are stored in a designated genome directory, decompressed, and indexed using samtools faidx to facilitate downstream alignment and analysis.

#!/bin/bash
#SBATCH --job-name=get_genome
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 4
#SBATCH --mem=4G
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mail-type=ALL
#SBATCH --mail-user=stephanie.ballas@uconn.edu
#SBATCH -o /home/FCAM/sballas/Goldfish_Deltamethrin_Project/logs/%x_%j.out
#SBATCH -e /home/FCAM/sballas/Goldfish_Deltamethrin_Project/logs/%x_%j.err

# Load samtools
module load samtools/1.16.1

# Make genome directory
mkdir -p /home/FCAM/sballas/Goldfish_Deltamethrin_Project/data/genome

# Download genome, GTF, and cDNA from Ensembl
wget -c -P /home/FCAM/sballas/Goldfish_Deltamethrin_Project/data/genome \
http://ftp.ensembl.org/pub/release-115/fasta/carassius_auratus/dna/Carassius_auratus.ASM336829v1.dna_sm.toplevel.fa.gz

wget -c -P /home/FCAM/sballas/Goldfish_Deltamethrin_Project/data/genome \
http://ftp.ensembl.org/pub/release-115/gtf/carassius_auratus/Carassius_auratus.ASM336829v1.115.gtf.gz

wget -c -P /home/FCAM/sballas/Goldfish_Deltamethrin_Project/data/genome \
http://ftp.ensembl.org/pub/release-115/fasta/carassius_auratus/cdna/Carassius_auratus.ASM336829v1.cdna.all.fa.gz

# Decompress
gunzip -f /home/FCAM/sballas/Goldfish_Deltamethrin_Project/data/genome/*.gz

# Create samtools index
samtools faidx /home/FCAM/sballas/Goldfish_Deltamethrin_Project/data/genome/Carassius_auratus.ASM336829v1.dna_sm.toplevel.fa
samtools faidx /home/FCAM/sballas/Goldfish_Deltamethrin_Project/data/genome/Carassius_auratus.ASM336829v1.cdna.all.fa

FastQC

The FastQC script performs quality control on raw FASTQ files. It reads a list of sample accessions, runs FastQC on paired-end files, and outputs detailed per-sample quality reports, including read quality, adapter content, and sequence duplication levels.

#!/bin/bash
#SBATCH --job-name=fastqc
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 10
#SBATCH --mem=20G
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mail-type=ALL
#SBATCH --mail-user=stephanie.ballas@uconn.edu
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err

echo `hostname`

#################################################################
# FastQC
#################################################################
module load fastqc/0.12.1

# set input/output directory variables
INDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/data/fastq
REPORTDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/fastqc
mkdir -p $REPORTDIR


# accession list
ACCLIST=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/metadata/accessionlist.txt

# Run FastQC on all FASTQ pairs (no parallel needed)
for ACC in $(cat $ACCLIST); do
    echo "Running FastQC on $ACC..."
    fastqc --outdir $REPORTDIR ${INDIR}/${ACC}_1.fastq.gz ${INDIR}/${ACC}_2.fastq.gz
done

MultiQC

This script aggregates individual FastQC reports into a single summary using MultiQC. It collects output from all samples and generates an integrated HTML report, simplifying the evaluation of overall sequencing quality.

#!/bin/bash
#SBATCH --job-name=multiqc
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 4
#SBATCH --mem=10G
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mail-type=ALL
#SBATCH --mail-user=stephanie.ballas@uconn.edu
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err

# Load module
module load MultiQC/1.15

# Set directories
INDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/fastqc
OUTDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/multiqc

mkdir -p $OUTDIR

# run on fastqc output
multiqc -f -o ${OUTDIR} ${INDIR}

Trimmomatic

Trimmomatic is used here to remove adapter sequences and low-quality bases from raw FASTQ files. The script loops over all samples, performing paired-end trimming with specified parameters, generating both trimmed reads and orphaned reads for downstream analysis.


#!/bin/bash
#SBATCH --job-name=trimmomatic
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 4
#SBATCH --mem=20G
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mail-type=ALL
#SBATCH --mail-user=stephanie.ballas@uconn.edu
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err

# Record where and when this ran
hostname
date

#################################################################
# Trimmomatic
#################################################################

# Load Trimmomatic module
module load Trimmomatic/0.39
module list

# Set directories
INDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/data/fastq
TRIMDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/trimmed_fastq
mkdir -p $TRIMDIR

# Adapter file
ADAPTERS=/isg/shared/apps/Trimmomatic/0.39/adapters/TruSeq3-PE.fa

# Sample list
SAMPLES=(SRR14783835 SRR14783836 SRR14783837 SRR14783838 SRR14783839 SRR14783840 SRR14783841 SRR14783$

# Loop through samples
for SAMPLE in "${SAMPLES[@]}"
do
    echo "Processing ${SAMPLE}..."
    java -jar /isg/shared/apps/Trimmomatic/0.39/trimmomatic-0.39.jar PE -phred33 \
        -threads 4 \
        ${INDIR}/${SAMPLE}_1.fastq.gz ${INDIR}/${SAMPLE}_2.fastq.gz \
        ${TRIMDIR}/${SAMPLE}_trim_1.fastq.gz ${TRIMDIR}/${SAMPLE}_trim_orphans_1.fastq.gz \
        ${TRIMDIR}/${SAMPLE}_trim_2.fastq.gz ${TRIMDIR}/${SAMPLE}_trim_orphans_2.fastq.gz \
        ILLUMINACLIP:${ADAPTERS}:2:30:10 \
        SLIDINGWINDOW:4:25 MINLEN:45
done

FastQC Trimmed

After trimming, this script re-runs FastQC on the trimmed FASTQ files. It ensures that adapter removal and quality trimming were successful and provides updated quality metrics for the processed reads.

#!/bin/bash
#SBATCH --job-name=fastqc_trimmed
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 4
#SBATCH --mem=10G
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mail-type=ALL
#SBATCH --mail-user=stephanie.ballas@uconn.edu
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err

# Load FastQC
module load fastqc/0.12.1

# Directories
TRIMDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/trimmed_fastq
OUTDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/fastqc_trimmed
mkdir -p ${OUTDIR}

# Run FastQC on all trimmed paired-end files
fastqc -t 4 -o ${OUTDIR} ${TRIMDIR}/*_trim_[12].fastq.gz

MultiQC Trimmed

This script aggregates the post-trimming FastQC reports into a unified MultiQC summary. It provides a comprehensive view of quality improvements and any remaining issues after read trimming.

#!/bin/bash
#SBATCH --job-name=multiqc_trimmed
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 4
#SBATCH --mem=10G
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mail-type=ALL
#SBATCH --mail-user=stephanie.ballas@uconn.edu
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err

# Load MultiQC
module load MultiQC/1.15

# Directories
FASTQCDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/fastqc_trimmed
OUTDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/multiqc_trimmed

# Run MultiQC to summarize all FastQC reports
multiqc ${FASTQCDIR} -o ${OUTDIR}
MultiQC Results

Index

The indexing script builds a HISAT2 genome index from the goldfish reference FASTA. The indexed genome enables efficient alignment of RNA-seq reads in subsequent steps.

#!/bin/bash
#SBATCH --job-name=hisat2_index
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 16
#SBATCH --mem=10G
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mail-type=ALL
#SBATCH --mail-user=stephanie.ballas@uconn.edu
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err

hostname
date

#################################################################
# Index the Genome
#################################################################

# Load HISAT2
module load hisat2/2.2.1

# Define absolute input/output directories
OUTDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/genome/hisat2_index
mkdir -p $OUTDIR

GENOME=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/data/genome/Carassius_auratus.ASM336829v1.dna_sm.toplevel.fa

# Build the HISAT2 index
hisat2-build -p 16 $GENOME $OUTDIR/Fhet

Alignment

This script aligns trimmed RNA-seq reads to the indexed goldfish genome using HISAT2. Each sample is processed as a SLURM array job, producing sorted and indexed BAM files suitable for downstream analysis.

#!/bin/bash
#SBATCH --job-name=align
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 3
#SBATCH --mem=15G
#SBATCH --partition=xeon
#SBATCH --qos=general
#SBATCH --mail-type=ALL
#SBATCH --mail-user=stephanie.ballas@uconn.edu
#SBATCH -o /home/FCAM/sballas/Goldfish_Deltamethrin_Project/logs/%x_%A_%a.out
#SBATCH -e /home/FCAM/sballas/Goldfish_Deltamethrin_Project/logs/%x_%A_%a.err
#SBATCH --array=1-9%3   # Adjust to the number of samples, max concurrent jobs 3

# Print hostname and date
hostname
date

#################################################################
# Align reads to genome (HISAT2)
#################################################################

module load hisat2/2.2.1
module load samtools/1.16.1

# Absolute paths
FASTQ_DIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/trimmed_fastq
OUTDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/align
mkdir -p ${OUTDIR}

GENOME_DIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/data/genome
HISAT2_INDEX=${GENOME_DIR}/Carassius_auratus_index  # We'll build index below if needed

ACCLIST=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/metadata/accessionlist.txt

# Build HISAT2 index if it doesn't exist
if [ ! -f "${HISAT2_INDEX}.1.ht2" ]; then
    hisat2-build ${GENOME_DIR}/Carassius_auratus.ASM336829v1.dna_sm.toplevel.fa ${HISAT2_INDEX}
fi

# Get sample for this array task
SAMPLE=$(sed -n ${SLURM_ARRAY_TASK_ID}p ${ACCLIST})

# Run HISAT2 alignment and pipe to sorted BAM
hisat2 \
    -p 3 \
    -x ${HISAT2_INDEX} \
    -1 ${FASTQ_DIR}/${SAMPLE}_trim_1.fastq.gz \
    -2 ${FASTQ_DIR}/${SAMPLE}_trim_2.fastq.gz | \
samtools sort -@ 3 -T ${OUTDIR}/${SAMPLE} -o ${OUTDIR}/${SAMPLE}.bam

# Index BAM file
samtools index ${OUTDIR}/${SAMPLE}.bam

Samtools Stats

Samtools is used to calculate alignment statistics for each BAM file. The script runs in parallel across samples, generates per-sample metrics, and compiles a summary table of key alignment statistics, providing an overview of mapping quality and coverage.


#!/bin/bash
#SBATCH --job-name=samstats
#SBATCH --mail-user=stephanie.ballas@uconn.edu
#SBATCH --mail-type=ALL
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=10
#SBATCH --mem=15G
#SBATCH --qos=general
#SBATCH --partition=general

hostname
date

##################################
# Calculate alignment statistics using Samtools
##################################

# Load software
module load samtools/1.16.1
module load parallel/20180122

# Define directories
INDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/align
OUTDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/samstats
mkdir -p $OUTDIR

# Accession list
ACCLIST=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/metadata/accessionlist.txt

# Run samtools stats for each BAM file
cat $ACCLIST | parallel -j 10 \
    "samtools stats $INDIR/{}.bam > $OUTDIR/{}.stats"

# Combine key stats into one table
FILES=($(find $OUTDIR -name "*.stats" | sort))

# Initialize summary table with metric names (SN lines)
grep "^SN" ${FILES[0]} | cut -f 2 > $OUTDIR/SN.txt

# Loop over each stats file and append columns
for file in ${FILES[@]}; do
    paste $OUTDIR/SN.txt <(grep ^SN $file | cut -f 3) > $OUTDIR/SN2.txt && \
    mv $OUTDIR/SN2.txt $OUTDIR/SN.txt
    echo "Processed $file"
done

# Add header with sample names
cat \
<(echo ${FILES[@]} | sed 's,.*/,,g' | sed 's/.stats//g' | tr ' ' '\t') \
$OUTDIR/SN.txt \
> $OUTDIR/Samtools_Summary.txt

Qualimap

Qualimap is employed to evaluate RNA-seq alignment quality against the reference genome and GTF annotation. Running in parallel across samples, it generates detailed reports on coverage, gene body distribution, and mapping performance.

#!/bin/bash
#SBATCH --job-name=qualimap
#SBATCH --mail-user=stephanie.ballas@uconn.edu
#SBATCH --mail-type=ALL
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=5
#SBATCH --mem=10G
#SBATCH --qos=general
#SBATCH --partition=general

hostname
date

##################################
# Calculate stats on alignments using Qualimap
##################################

# Load required software
module load qualimap/2.2.1
module load parallel/20180122

# Define directories
INDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/align
OUTDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/qualimap
mkdir -p $OUTDIR

# GTF annotation file
GTF=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/data/genome/Carassius_auratus.ASM336829v1.115.gtf

# Accession list
ACCLIST=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/metadata/accessionlist.txt

# Run Qualimap in parallel
cat $ACCLIST | \
parallel -j 5 \
    qualimap rnaseq \
        -bam $INDIR/{}.bam \
        -gtf $GTF \
        -outdir $OUTDIR/{} \
        --java-mem-size=2G

MultiQC

This MultiQC script consolidates both Samtools and Qualimap outputs into integrated reports. It allows for a high-level assessment of read alignment quality across all samples.

#!/bin/bash
#SBATCH --job-name=multiqc
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 4
#SBATCH --mem=5G
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mail-type=ALL
#SBATCH --mail-user=stephanie.ballas@uconn.edu
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err

hostname
date

#################################################################
# Aggregate reports using MultiQC
#################################################################

module load MultiQC/1.9

# Define input and output directories
STATS=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/samtools_stats
QUALIMAP=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/qualimap
OUTDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/multiqc

# Create output directories if needed
mkdir -p ${OUTDIR}/samtools_stats
mkdir -p ${OUTDIR}/qualimap

# Run MultiQC on Samtools stats output
multiqc -f -o ${OUTDIR}/samtools_stats ${STATS}

# Run MultiQC on Qualimap output
multiqc -f -o ${OUTDIR}/qualimap ${QUALIMAP}

Counts

HTSeq-count is used to quantify gene-level expression from aligned BAM files. The script processes each sample using the GTF annotation to generate raw read counts per gene, producing files suitable for downstream differential expression analysis in R.

#SBATCH --job-name=htseq_count
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 4
#SBATCH --mem=20G
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mail-type=ALL
#SBATCH --mail-user=stephanie.ballas@uconn.edu
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err

echo `hostname`
date

#################################################################
# Generate Counts
#################################################################
module load htseq/0.13.5

# Directories
INDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/align
OUTDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/counts
mkdir -p $OUTDIR

# Accession list
ACCLIST=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/metadata/accessionlist.txt

# GTF annotation file
GTF=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/data/genome/Carassius_auratus.ASM336829v1.115.gtf

echo "Starting HTSeq-count..."

while read SAMPLE; do
    echo "Processing $SAMPLE..."
    htseq-count \
        -s no \
        -r pos \
        -f bam \
        $INDIR/${SAMPLE}.bam \
        $GTF \
        > $OUTDIR/${SAMPLE}.counts
done < $ACCLIST

echo "Done!"
date

Read in the Data to R

This section performs a basic differential expression analysis in R using the DESeq2 package. The scripe loads data into R and prepares RNA-seq count data in order to run the first statistical analysis.

library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.5.2
Warning: package 'tidyr' was built under R version 4.5.2
Warning: package 'readr' was built under R version 4.5.2
Warning: package 'purrr' was built under R version 4.5.2
Warning: package 'stringr' was built under R version 4.5.2
Warning: package 'forcats' was built under R version 4.5.2
Warning: package 'lubridate' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.2.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(DESeq2)
Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: generics
Warning: package 'generics' was built under R version 4.5.2

Attaching package: 'generics'

The following object is masked from 'package:lubridate':

    as.difftime

The following object is masked from 'package:dplyr':

    explain

The following objects are masked from 'package:base':

    as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
    setequal, union


Attaching package: 'BiocGenerics'

The following object is masked from 'package:dplyr':

    combine

The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs

The following objects are masked from 'package:base':

    anyDuplicated, aperm, append, as.data.frame, basename, cbind,
    colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
    get, grep, grepl, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, saveRDS, table, tapply, unique,
    unsplit, which.max, which.min


Attaching package: 'S4Vectors'

The following objects are masked from 'package:lubridate':

    second, second<-

The following objects are masked from 'package:dplyr':

    first, rename

The following object is masked from 'package:tidyr':

    expand

The following object is masked from 'package:utils':

    findMatches

The following objects are masked from 'package:base':

    expand.grid, I, unname

Loading required package: IRanges

Attaching package: 'IRanges'

The following object is masked from 'package:lubridate':

    %within%

The following objects are masked from 'package:dplyr':

    collapse, desc, slice

The following object is masked from 'package:purrr':

    reduce

The following object is masked from 'package:grDevices':

    windows

Loading required package: GenomicRanges
Loading required package: Seqinfo
Loading required package: SummarizedExperiment
Loading required package: MatrixGenerics
Loading required package: matrixStats
Warning: package 'matrixStats' was built under R version 4.5.2

Attaching package: 'matrixStats'

The following object is masked from 'package:dplyr':

    count


Attaching package: 'MatrixGenerics'

The following objects are masked from 'package:matrixStats':

    colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
    colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
    colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
    colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
    colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
    colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
    colWeightedMeans, colWeightedMedians, colWeightedSds,
    colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
    rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
    rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
    rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
    rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
    rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
    rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
    rowWeightedSds, rowWeightedVars

Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.


Attaching package: 'Biobase'

The following object is masked from 'package:MatrixGenerics':

    rowMedians

The following objects are masked from 'package:matrixStats':

    anyMissing, rowMedians
# Read in data
meta <- read.csv("C:/Users/steph/OneDrive/Documents/Goldfish_Deltamethrin_Project/metadata/SraRunTable.csv") %>%
  select(Run, dose, fileName)

# Directory with counts
directory <- "C:/Users/steph/OneDrive/Documents/Goldfish_Deltamethrin_Project/results/04_counts"

# Build sample table directly from metadata
sampleTable <- data.frame(
  sampleName = meta$Run,
  fileName = meta$fileName,
  dose = meta$dose
)
sampleTable
   sampleName           fileName          dose
1 SRR14783835 SRR14783835.counts high exposure
2 SRR14783836 SRR14783836.counts high exposure
3 SRR14783837 SRR14783837.counts high exposure
4 SRR14783838 SRR14783838.counts  low exposure
5 SRR14783839 SRR14783839.counts  low exposure
6 SRR14783840 SRR14783840.counts  low exposure
7 SRR14783841 SRR14783841.counts       control
8 SRR14783842 SRR14783842.counts       control
9 SRR14783843 SRR14783843.counts       control
# Build DESeq dataset
ddsHTSeq <- DESeqDataSetFromHTSeqCount(
  sampleTable = sampleTable,
  directory = directory,
  design = ~ dose
)
Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
design formula are characters, converting to factors
  Note: levels of factors in the design contain characters other than
  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters
  for column names in R. [This is a message, not a warning or an error]
head(counts(ddsHTSeq))
                   SRR14783835 SRR14783836 SRR14783837 SRR14783838 SRR14783839
ENSCARG00000000002           0           0           0           0           0
ENSCARG00000000003         969         443         517         339         265
ENSCARG00000000004           0           0           0           0           0
ENSCARG00000000005        3721        3492        3036        2892        2695
ENSCARG00000000006           0           0           0           0           0
ENSCARG00000000007           9           6           4           2           2
                   SRR14783840 SRR14783841 SRR14783842 SRR14783843
ENSCARG00000000002           0           0           0           0
ENSCARG00000000003         312         372         744        1003
ENSCARG00000000004           0           0           0           0
ENSCARG00000000005        1744        2763        3111        4713
ENSCARG00000000006           0           0           0           0
ENSCARG00000000007           1           9           5       12481
dds <- DESeq(ddsHTSeq)
estimating size factors
  Note: levels of factors in the design contain characters other than
  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters
  for column names in R. [This is a message, not a warning or an error]
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
  Note: levels of factors in the design contain characters other than
  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters
  for column names in R. [This is a message, not a warning or an error]
final dispersion estimates
fitting model and testing
res <- results(dds, contrast = c("dose", "high exposure", "control"))

summary(res)

out of 51556 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)       : 531, 1%
LFC < 0 (down)     : 622, 1.2%
outliers [1]       : 419, 0.81%
low counts [2]     : 15814, 31%
(mean count < 6)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

Differential Expression Analysis

Differential expression analysis was performed using DESeq2 to evaluate the transcriptional impact of deltamethrin (DM) exposure in goldfish kidney tissue. First, sample metadata was linked to HTSeq count files, and a DESeqDataSet was created with treatment dose as the design factor. DESeq2 normalized the counts, estimated dispersions, and performed statistical testing for differential expression. Comparisons included high-dose (2 μg/L) versus control, low-dose (0.2 μg/L) versus control, and high versus low doses. Variance-stabilizing transformation (VST) was applied to the count data, and the top 500 most variable genes were selected for heatmap visualization. PCA analysis demonstrated clear clustering by treatment without outliers, while heatmap clustering was less distinct due to gene-specific variability. Overall, the analysis identified genes and pathways that responded in a dose-dependent manner to DM exposure, providing a foundation for functional interpretation and comparison to the original publication.

# Load required packages
library(tidyverse)
library(DESeq2)
library(ggrepel)
Warning: package 'ggrepel' was built under R version 4.5.2
library(pheatmap)
Warning: package 'pheatmap' was built under R version 4.5.2
# read in metadata
meta <- read.csv("C:/Users/steph/OneDrive/Documents/Goldfish_Deltamethrin_Project/metadata/SraRunTable.csv") %>%
  select(Run, dose, fileName)

# sample table creation
directory <- "C:/Users/steph/OneDrive/Documents/Goldfish_Deltamethrin_Project/results/04_counts"

sampleTable <- data.frame(
  sampleName = meta$Run,
  fileName   = meta$fileName,
  dose       = meta$dose
)

sampleTable
   sampleName           fileName          dose
1 SRR14783835 SRR14783835.counts high exposure
2 SRR14783836 SRR14783836.counts high exposure
3 SRR14783837 SRR14783837.counts high exposure
4 SRR14783838 SRR14783838.counts  low exposure
5 SRR14783839 SRR14783839.counts  low exposure
6 SRR14783840 SRR14783840.counts  low exposure
7 SRR14783841 SRR14783841.counts       control
8 SRR14783842 SRR14783842.counts       control
9 SRR14783843 SRR14783843.counts       control
# DESeq2 Data Set
ddsHTSeq <- DESeqDataSetFromHTSeqCount(
  sampleTable = sampleTable,
  directory   = directory,
  design      = ~ dose
)
Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
design formula are characters, converting to factors
  Note: levels of factors in the design contain characters other than
  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters
  for column names in R. [This is a message, not a warning or an error]
head(counts(ddsHTSeq))
                   SRR14783835 SRR14783836 SRR14783837 SRR14783838 SRR14783839
ENSCARG00000000002           0           0           0           0           0
ENSCARG00000000003         969         443         517         339         265
ENSCARG00000000004           0           0           0           0           0
ENSCARG00000000005        3721        3492        3036        2892        2695
ENSCARG00000000006           0           0           0           0           0
ENSCARG00000000007           9           6           4           2           2
                   SRR14783840 SRR14783841 SRR14783842 SRR14783843
ENSCARG00000000002           0           0           0           0
ENSCARG00000000003         312         372         744        1003
ENSCARG00000000004           0           0           0           0
ENSCARG00000000005        1744        2763        3111        4713
ENSCARG00000000006           0           0           0           0
ENSCARG00000000007           1           9           5       12481
dds <- DESeq(ddsHTSeq)
estimating size factors
  Note: levels of factors in the design contain characters other than
  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters
  for column names in R. [This is a message, not a warning or an error]
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
  Note: levels of factors in the design contain characters other than
  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters
  for column names in R. [This is a message, not a warning or an error]
final dispersion estimates
fitting model and testing
# DIFFERENTIAL EXPRESSION: High Exposure vs Control

res_high <- results(dds, contrast = c("dose", "high exposure", "control"))

summary(res_high)

out of 51556 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)       : 531, 1%
LFC < 0 (down)     : 622, 1.2%
outliers [1]       : 419, 0.81%
low counts [2]     : 15814, 31%
(mean count < 6)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
# DIFFERENTIAL EXPRESSION: Low Exposure vs Control

res_low <- results(dds, contrast = c("dose", "low exposure", "control"))

summary (res_low)

out of 51556 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)       : 173, 0.34%
LFC < 0 (down)     : 123, 0.24%
outliers [1]       : 419, 0.81%
low counts [2]     : 12854, 25%
(mean count < 3)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
plotDispEsts(dds)

res_high <- results(dds, name = "dose_high.exposure_vs_control")
head(res_high)
log2 fold change (MLE): dose high.exposure vs control 
Wald test p-value: dose high.exposure vs control 
DataFrame with 6 rows and 6 columns
                    baseMean log2FoldChange     lfcSE       stat    pvalue
                   <numeric>      <numeric> <numeric>  <numeric> <numeric>
ENSCARG00000000002     0.000             NA        NA         NA        NA
ENSCARG00000000003   534.941     -0.0797024  0.450349 -0.1769790  0.859525
ENSCARG00000000004     0.000             NA        NA         NA        NA
ENSCARG00000000005  3071.952      0.0233631  0.336682  0.0693923  0.944677
ENSCARG00000000006     0.000             NA        NA         NA        NA
ENSCARG00000000007  1289.695     -9.2725840  2.010418 -4.6122670        NA
                        padj
                   <numeric>
ENSCARG00000000002        NA
ENSCARG00000000003  0.977899
ENSCARG00000000004        NA
ENSCARG00000000005  0.991337
ENSCARG00000000006        NA
ENSCARG00000000007        NA
res_low <- results(dds, name = "dose_low.exposure_vs_control")
head(res_low)
log2 fold change (MLE): dose low.exposure vs control 
Wald test p-value: dose low.exposure vs control 
DataFrame with 6 rows and 6 columns
                    baseMean log2FoldChange     lfcSE      stat    pvalue
                   <numeric>      <numeric> <numeric> <numeric> <numeric>
ENSCARG00000000002     0.000             NA        NA        NA        NA
ENSCARG00000000003   534.941      -1.052080  0.451695  -2.32918 0.0198494
ENSCARG00000000004     0.000             NA        NA        NA        NA
ENSCARG00000000005  3071.952      -0.357187  0.336804  -1.06052 0.2889093
ENSCARG00000000006     0.000             NA        NA        NA        NA
ENSCARG00000000007  1289.695     -11.101779  2.085484  -5.32336        NA
                        padj
                   <numeric>
ENSCARG00000000002        NA
ENSCARG00000000003  0.509289
ENSCARG00000000004        NA
ENSCARG00000000005  0.999619
ENSCARG00000000006        NA
ENSCARG00000000007        NA
gene_ids_high <- rownames(res_high)
gene_ids_low  <- rownames(res_low)
library(pheatmap)
library(DESeq2)
library(matrixStats)

# Variance-stabilizing transform (if not already done)
vsd <- vst(dds)

# Select top 500 most variable genes
topVarGenes <- head(order(rowVars(assay(vsd)), decreasing = TRUE), 500)
vsd_top <- assay(vsd)[topVarGenes, ]

# Scale rows (genes) to mean=0, sd=1 for better visualization
vsd_scaled <- t(scale(t(vsd_top)))

# Optional: create annotation for columns (samples)
annotation_col <- data.frame(
  Dose = colData(vsd)$dose
)
rownames(annotation_col) <- colnames(vsd)

# Plot heatmap
pheatmap(
  vsd_scaled,
  annotation_col = annotation_col,
  show_rownames = FALSE,
  cluster_rows = TRUE,
  cluster_cols = TRUE,
  color = colorRampPalette(c("navy", "white", "firebrick3"))(50),
  fontsize_col = 10,
  main = "Top 500 Most Variable Genes"
)

library(DESeq2)
library(ggplot2)
library(ggrepel)
library(matrixStats)

# Variance-stabilizing transform
vsd <- vst(dds)

# Use top 500 most variable genes 
topVarGenes <- head(order(rowVars(assay(vsd)), decreasing = TRUE), 500)
vsd_top <- vsd[topVarGenes, ]

# Extract PCA data
pca_dat <- plotPCA(vsd_top, returnData = TRUE, intgroup = c("dose"))
using ntop=500 top features by variance
# Plot
p <- ggplot(pca_dat, aes(x = PC1, y = PC2, color = dose)) +
  geom_point(size = 4) +
  geom_label_repel(aes(label = name), size = 3) +
  xlab(paste0("PC1: ", round(attr(pca_dat, "percentVar")[1] * 100, 1), "% variance")) +
  ylab(paste0("PC2: ", round(attr(pca_dat, "percentVar")[2] * 100, 1), "% variance")) +
  theme_minimal()

p

Functional Interpretation

library(biomaRt)
Warning: package 'biomaRt' was built under R version 4.5.2
# Use the Ensembl archive host
ensemblhost <- "https://may2024.archive.ensembl.org"

# Dataset name for goldfish
goldfish_dataset <- "cauratus_gene_ensembl"

# Create mart object
goldfish_mart <- useMart(
  biomart = "ENSEMBL_MART_ENSEMBL",
  dataset = goldfish_dataset,
  host = ensemblhost
)

# Load libraries
library(clusterProfiler)
clusterProfiler v4.18.2 Learn more at https://yulab-smu.top/contribution-knowledge-mining/

Please cite:

Guangchuang Yu, Li-Gen Wang, Yanyan Han and Qing-Yu He.
clusterProfiler: an R package for comparing biological themes among
gene clusters. OMICS: A Journal of Integrative Biology. 2012,
16(5):284-287

Attaching package: 'clusterProfiler'
The following object is masked from 'package:biomaRt':

    select
The following object is masked from 'package:IRanges':

    slice
The following object is masked from 'package:S4Vectors':

    rename
The following object is masked from 'package:purrr':

    simplify
The following object is masked from 'package:stats':

    filter
library(enrichplot)
enrichplot v1.30.4 Learn more at https://yulab-smu.top/contribution-knowledge-mining/

Please cite:

Guangchuang Yu, Li-Gen Wang, Guang-Rong Yan, Qing-Yu He. DOSE: an
R/Bioconductor package for Disease Ontology Semantic and Enrichment
analysis. Bioinformatics. 2015, 31(4):608-609
library(dplyr)
library(biomaRt)

######## Pull GO annotations from BioMart ########

# High exposure GO annotations
go_ann_high <- getBM(
  filter = "ensembl_gene_id",
  value = gene_ids_high,
  attributes = c("ensembl_gene_id", "go_id", "name_1006", "namespace_1003"),
  mart = goldfish_mart
) %>% 
  filter(go_id != "")   # remove rows without GO terms

# Low exposure GO annotations
go_ann_low <- getBM(
  filter = "ensembl_gene_id",
  value = gene_ids_low,
  attributes = c("ensembl_gene_id", "go_id", "name_1006", "namespace_1003"),
  mart = goldfish_mart
) %>% 
  filter(go_id != "")

######### Prepare unique GO term table ########

gonames_high <- unique(go_ann_high[, c("go_id", "name_1006")])
gonames_low  <- unique(go_ann_low[, c("go_id", "name_1006")])

######## Define DE genes and universe ########

# High exposure
genes_high <- rownames(res_high[which(res_high$padj < 0.2), ])
univ_high  <- rownames(res_high[!is.na(res_high$padj), ])

# Low exposure
genes_low <- rownames(res_low[which(res_low$padj < 0.2), ])
univ_low  <- rownames(res_low[!is.na(res_low$padj), ])

######## Run over-representation analysis ########

# High exposure ORA
kcdose_enrich_high <- enricher(
  gene = genes_high,
  universe = univ_high,
  TERM2GENE = go_ann_high[, c("go_id", "ensembl_gene_id")],
  TERM2NAME = gonames_high
)

# Low exposure ORA
kcdose_enrich_low <- enricher(
  gene = genes_low,
  universe = univ_low,
  TERM2GENE = go_ann_low[, c("go_id", "ensembl_gene_id")],
  TERM2NAME = gonames_low
)

######## Convert to data frames for inspection ########
enrich_df_high <- as.data.frame(kcdose_enrich_high)
enrich_df_low  <- as.data.frame(kcdose_enrich_low)

# Inspect top 6 enriched GO terms
head(enrich_df_high)
                   ID                              Description GeneRatio
GO:0030150 GO:0030150 protein import into mitochondrial matrix    9/1697
GO:0006457 GO:0006457                          protein folding   26/1697
GO:0006364 GO:0006364                          rRNA processing   15/1697
GO:0005730 GO:0005730                                nucleolus    9/1697
GO:0140662 GO:0140662  ATP-dependent protein folding chaperone   15/1697
GO:0019843 GO:0019843                             rRNA binding    8/1697
             BgRatio RichFactor FoldEnrichment   zScore       pvalue
GO:0030150  18/25478  0.5000000       7.506777 7.376884 7.080653e-07
GO:0006457 139/25478  0.1870504       2.808291 5.710589 1.482336e-06
GO:0006364  61/25478  0.2459016       3.691857 5.622828 8.060261e-06
GO:0005730  23/25478  0.3913043       5.874869 6.247990 8.769946e-06
GO:0140662  65/25478  0.2307692       3.464666 5.314799 1.841148e-05
GO:0019843  21/25478  0.3809524       5.719449 5.779594 3.536876e-05
               p.adjust       qvalue
GO:0030150 0.0003979327 0.0003667033
GO:0006457 0.0004165365 0.0003838471
GO:0006364 0.0012321774 0.0011354773
GO:0005730 0.0012321774 0.0011354773
GO:0140662 0.0020694501 0.0019070415
GO:0019843 0.0033128742 0.0030528828
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  geneID
GO:0030150                                                                                                                                                                                                                                                                                                                                    ENSCARG00000005809/ENSCARG00000007422/ENSCARG00000008998/ENSCARG00000017108/ENSCARG00000017811/ENSCARG00000030512/ENSCARG00000053129/ENSCARG00000053910/ENSCARG00000061945
GO:0006457 ENSCARG00000001396/ENSCARG00000002333/ENSCARG00000006436/ENSCARG00000007967/ENSCARG00000009008/ENSCARG00000010109/ENSCARG00000010236/ENSCARG00000012578/ENSCARG00000013517/ENSCARG00000015467/ENSCARG00000016248/ENSCARG00000019003/ENSCARG00000020750/ENSCARG00000022668/ENSCARG00000023453/ENSCARG00000023609/ENSCARG00000023635/ENSCARG00000026999/ENSCARG00000039930/ENSCARG00000049729/ENSCARG00000052496/ENSCARG00000052635/ENSCARG00000060825/ENSCARG00000065920/ENSCARG00000068371/ENSCARG00000069141
GO:0006364                                                                                                                                                                                                                  ENSCARG00000002386/ENSCARG00000003325/ENSCARG00000006035/ENSCARG00000006922/ENSCARG00000010084/ENSCARG00000010917/ENSCARG00000011697/ENSCARG00000011903/ENSCARG00000014109/ENSCARG00000018073/ENSCARG00000018154/ENSCARG00000024712/ENSCARG00000030494/ENSCARG00000034601/ENSCARG00000045672
GO:0005730                                                                                                                                                                                                                                                                                                                                    ENSCARG00000007643/ENSCARG00000014249/ENSCARG00000017763/ENSCARG00000018707/ENSCARG00000019018/ENSCARG00000024950/ENSCARG00000052698/ENSCARG00000063661/ENSCARG00000065781
GO:0140662                                                                                                                                                                                                                  ENSCARG00000001396/ENSCARG00000007967/ENSCARG00000009283/ENSCARG00000012578/ENSCARG00000014253/ENSCARG00000018490/ENSCARG00000019003/ENSCARG00000022508/ENSCARG00000023453/ENSCARG00000023635/ENSCARG00000026999/ENSCARG00000036190/ENSCARG00000039930/ENSCARG00000052496/ENSCARG00000068371
GO:0019843                                                                                                                                                                                                                                                                                                                                                       ENSCARG00000002761/ENSCARG00000010917/ENSCARG00000011903/ENSCARG00000027860/ENSCARG00000034601/ENSCARG00000039744/ENSCARG00000045672/ENSCARG00000047140
           Count
GO:0030150     9
GO:0006457    26
GO:0006364    15
GO:0005730     9
GO:0140662    15
GO:0019843     8
head(enrich_df_low)
                   ID                                               Description
GO:0005667 GO:0005667                           transcription regulator complex
GO:0042555 GO:0042555                                               MCM complex
GO:0032508 GO:0032508                                      DNA duplex unwinding
GO:0006270 GO:0006270                                DNA replication initiation
GO:0046983 GO:0046983                             protein dimerization activity
GO:0000122 GO:0000122 negative regulation of transcription by RNA polymerase II
           GeneRatio   BgRatio RichFactor FoldEnrichment   zScore       pvalue
GO:0005667    10/373  91/27245 0.10989011       8.026692 7.910329 4.926321e-07
GO:0042555     4/373  17/27245 0.23529412      17.186564 7.865224 7.144912e-05
GO:0032508     4/373  22/27245 0.18181818      13.280526 6.788912 2.080316e-04
GO:0006270     4/373  23/27245 0.17391304      12.703112 6.615233 2.491202e-04
GO:0046983    12/373 281/27245 0.04270463       3.119270 4.207141 5.407630e-04
GO:0000122     4/373  31/27245 0.12903226       9.424890 5.529537 8.120220e-04
               p.adjust       qvalue
GO:0005667 0.0001261138 0.0001151203
GO:0042555 0.0091454868 0.0083482650
GO:0032508 0.0159436902 0.0145538620
GO:0006270 0.0159436902 0.0145538620
GO:0046983 0.0276870639 0.0252735534
GO:0000122 0.0346462732 0.0316261211
                                                                                                                                                                                                                                        geneID
GO:0005667                                       ENSCARG00000003598/ENSCARG00000019497/ENSCARG00000020564/ENSCARG00000021883/ENSCARG00000031187/ENSCARG00000046635/ENSCARG00000055717/ENSCARG00000058081/ENSCARG00000064397/ENSCARG00000070014
GO:0042555                                                                                                                                                         ENSCARG00000007468/ENSCARG00000011956/ENSCARG00000034639/ENSCARG00000049591
GO:0032508                                                                                                                                                         ENSCARG00000007468/ENSCARG00000011956/ENSCARG00000034639/ENSCARG00000049591
GO:0006270                                                                                                                                                         ENSCARG00000007468/ENSCARG00000011956/ENSCARG00000034639/ENSCARG00000049591
GO:0046983 ENSCARG00000019325/ENSCARG00000019497/ENSCARG00000020564/ENSCARG00000031187/ENSCARG00000031303/ENSCARG00000038860/ENSCARG00000044627/ENSCARG00000046635/ENSCARG00000055717/ENSCARG00000058081/ENSCARG00000063633/ENSCARG00000070014
GO:0000122                                                                                                                                                         ENSCARG00000001711/ENSCARG00000038860/ENSCARG00000044627/ENSCARG00000063633
           Count
GO:0005667    10
GO:0042555     4
GO:0032508     4
GO:0006270     4
GO:0046983    12
GO:0000122     4
library(enrichplot)
library(ggplot2)

# Dotplots

dotplot(kcdose_enrich_high, showCategory = 10, title = "Top Enriched GO Terms - High Exposure ORA")

dotplot(kcdose_enrich_low,  showCategory = 10, title = "Top Enriched GO Terms - Low Exposure ORA")

# Gene Concept Network
# Named vector of log2 fold changes for foldChange
l2fcvec_high <- res_high$log2FoldChange
names(l2fcvec_high) <- rownames(res_high)

l2fcvec_low <- res_low$log2FoldChange
names(l2fcvec_low) <- rownames(res_low)

# ORA gene-concept network
cnetplot(kcdose_enrich_high, showCategory = 10, foldChange = l2fcvec_high, cex_label_gene = 0.5)
Warning: ggrepel: 15 unlabeled data points (too many overlaps). Consider
increasing max.overlaps

cnetplot(kcdose_enrich_low,  showCategory = 10, foldChange = l2fcvec_low,  cex_label_gene = 0.5)

References

Blanco, Ayelén Melisa, Lakshminarasimhan Sundarrajan, Juan Ignacio Bertucci, and Suraj Unniappan. 2018. “Why Goldfish? Merits and Challenges in Employing Goldfish as a Model Organism in Comparative Endocrinology Research.” General and Comparative Endocrinology 257: 13–28. https://doi.org/https://doi.org/10.1016/j.ygcen.2017.02.001.
Guardiola, F. A., P. Gónzalez-Párraga, J. Meseguer, A. Cuesta, and M. A. Esteban. 2014. “Modulatory Effects of Deltamethrin-Exposure on the Immune Status, Metabolism and Oxidative Stress in Gilthead Seabream (Sparus Aurata l.).” Fish & Shellfish Immunology 36 (1): 120–29. https://doi.org/https://doi.org/10.1016/j.fsi.2013.10.020.
Wu, Yaqin, Wenhua Li, Mingrui Yuan, and Xuan Liu. 2020. “The Synthetic Pyrethroid Deltamethrin Impairs Zebrafish (Danio Rerio) Swim Bladder Development.” Science of The Total Environment 701: 134870. https://doi.org/https://doi.org/10.1016/j.scitotenv.2019.134870.
Zhou, Shun, Jing Dong, Yongtao Liu, Qiuhong Yang, Ning Xu, Yibin Yang, and Xiaohui Ai. 2021. “Effects of Acute Deltamethrin Exposure on Kidney Transcriptome and Intestinal Microbiota in Goldfish (Carassius Auratus).” Ecotoxicology and Environmental Safety 225: 112716. https://doi.org/https://doi.org/10.1016/j.ecoenv.2021.112716.