Welcome to My Project

This project presents the full analysis for the ISG 5311 final project, including RNA-seq differential expression analysis and interpretation.

1 Introduction

Pyrethroid pesticides, such as deltamethrin (DM), are widely used in agriculture and aquaculture due to their high insecticidal activity and relatively low toxicity compared to older pesticides. DM is effective against a broad spectrum of pests, making it a popular choice for pest and disease control in aquatic environments (Guardiola et al. (2014)). However, DM frequently enters rivers and lakes through surface runoff and wastewater discharge, posing risks to aquatic organisms. Fish are particularly vulnerable because they lack efficient mechanisms to metabolize and degrade DM, leading to toxic effects even at environmentally relevant concentrations (Wu et al. (2020)).

Goldfish (Carassius auratus) are a common freshwater species widely distributed in many water bodies and are often used as a model organism in toxicology research (Blanco et al. (2018)). Their ecological relevance, manageable size, and ease of maintenance make them suitable for laboratory experiments, while their physiology provides insight into potential impacts of environmental contaminants on aquatic species. Prior studies have shown that DM exposure can induce oxidative stress, apoptosis, and immune dysfunction in various fish species, but the molecular mechanisms underlying these effects are not fully understood (Zhou et al. (2021)).

The aim of this study was to investigate the effects of acute, environmentally relevant concentrations of DM on goldfish kidney transcriptomes and intestinal microbiota. Nine healthy goldfish were randomly assigned to three treatment groups: 0 µg/L control, 0.2 µg/L DM, and 2 µg/L DM, with three fish per group. Fish were exposed for 96 hours under controlled laboratory conditions. Kidney tissues were collected and RNA extracted for high-throughput RNA-sequencing to identify differentially expressed genes and characterize affected biological pathways, including apoptosis, immune response, and drug metabolism. Additionally, intestinal contents were collected to assess changes in microbial composition using 16S rRNA sequencing.

By combining transcriptomic and microbiota analyses, this study provides an integrative understanding of how DM affects fish at both the molecular and microbial levels. The results are expected to reveal key pathways disrupted by DM exposure and highlight the broader impacts of environmental contaminants on fish physiology and gut microbial communities (Zhou et al. (2021)). This work contributes to the growing field of aquatic toxicology by providing detailed molecular insights into DM-induced toxicity in a widely studied model species.

2 Methods

Methods were adapted from Zhou et al. (2021) unless otherwise noted.

2.1 Experimental Design and Samples

Eighty-one healthy goldfish (Carassius auratus) were purchased from a commercial supplier (Hanjin Ornamental Fish Farm, Wuhan, China) and acclimated for two weeks in aerated, dechlorinated water. During acclimation, fish were held in large tanks under controlled temperature (19–21 °C), pH (6.9–7.3), and dissolved oxygen (~6.2 mg/L), with a consistent light:dark cycle of 10 hours of light to 14 hours of darkness. Fish were fed once daily with a commercial pellet diet, and one-third of the water was exchanged every three days to maintain water quality.

Deltamethrin (>98% purity) was dissolved in dimethyl sulfoxide (DMSO) to generate stock solutions. Fish were randomly assigned to one of three treatment groups: 0 µg/L DM (control), 0.2 µg/L DM, and 2 µg/L DM (n = 9 fish per group). These concentrations reflect environmentally relevant DM levels reported in aquatic systems. All groups, including controls, contained trace amounts of DMSO (<0.001% v/v). Fish were exposed to their assigned treatments for 96 hours. Solutions were renewed daily to maintain stable DM levels, and fish were not fed during the exposure period.

2.2 Sample Collection

After 96 hours exposure, nine specimens were selected from each group to collect intestinal contents and pooled for transctiprome anslysis. Simultaneously, three specimens were randomly selected from each group and the kidney was removed. The kidney specimens were used for RNA-seq analysis.

2.3 Transcriptome Processing and Analysis

2.3.1 RNA Extraction and Sequencing

Total RNA was extracted from kidney tissue using TRIzol Reagent (Life Technologies) in accordance with the manufacturer’s protocol. A NanoDrop 2000 was used to assess RNA quantity and purity, and an Agilent 2100 Bioanalyzer was used to confirm RNA integrity. Only high-quality RNA samples were processed for sequencing. Libraries were prepared and sequenced on an Illumina NovaSeq 6000 to produce paired-end reads. Raw RNA-seq data have been submitted to the Sequence Read Archive (SRA) database in NCBI (accession number: SRP323663).

2.3.2 Bioinformatic Workflow

Raw reads were filtered to remove low-quality sequences, adapter contamination, and poly-N content. Clean reads were aligned to the goldfish reference genome (GCF_003368295.1) using HISAT2 v2.2.1. Gene-level counts were generated and normalized using DESeq2 v1.38.0 within R.

Differentially expressed genes (DEGs) were identified based on false discovery rate (FDR) < 0.05 and |log2 fold change| > 1. Functional enrichment of DEGs was performed using GOseq for Gene Ontology (GO) enrichment and KOBAS for Encyclopedia of Genes and Genomes (KEGG) pathway enrichment. These analyses were used to identify biological pathways associated with apoptosis, immune function, and metabolism.

2.3.3 qPCR Validation

To confirm the reliability of RNA-seq results, a subset of ten DEGs were selected for quantitative real-time PCR (qPCR). Primers were designed using Primer3, and qPCR was conducted using a QuantStudio 3 Real-Time PCR System with ChamQ Universal SYBR qPCR Master Mix. Each reaction contained SYBR qPCR Master Mix, forward and reverse primers, cDNA template, and nuclease-free water. Thermal cycling conditions were: 95 °C for 30 seconds followed by 40 cycles of denaturing at 95 °C for 10 seconds and annealing at 60 °C for 30 seconds. The β-actin gene was used as a reference (housekeeping) gene, its expression level was assumed to remain stable across all samples. This allows for comparison of the expression of the target DEGs relative to a consistent baseline. To quantify gene expression differences, the 2⁻ΔΔCt method was applied. This method uses the cycle threshold (Ct) values from qPCR to calculate relative fold changes in gene expression between experimental groups.

2.4 Intestinal Microbiota Analysis

2.4.1 DNA Extraction and Sequencing

DNA was extracted from intestinal contents using the QIAamp DNA Stool Mini Kit (Qiagen). DNA concentration and purity were measured using a NanoDrop spectrophotometer. Full-length 16S rRNA genes were amplified using primers 27F and 1492R. PCR amplification was performed using Premix Taq (Takara) with the following thermal cycling program: an initial denaturation at 95 °C for 5 minutes; 25 cycles of denaturation at 95 °C for 30 seconds, annealing at 50 °C for 30 seconds, and extension at 72 °C for 1 minute; followed by a final extension at 72 °C for 7 minutes. Libraries were prepped from purified DNA using the SMRTbell Template Prep Kit 1.0, and sequenced on a PacBio SMRT RS II platform.

2.4.2 Sequence Processing and Statistical Analysis

Raw PacBio reads were processed with the SMRT Analysis Suite to generate high-quality circular consensus sequences. Sequences were clustered into operational taxonomic units at 97% similarity using USEARCH v10.0. Taxonomic classification was performed using the RDP Classifier with an 80% confidence threshold.

Microbial diversity analyses were completed in QIIME2, including: alpha diversity (Chao1, ACE, Shannon, and Simpson indices), Good’s coverage to estimate sequencing completeness, and beta diversity (PCoA and UPGMA clustering). All statistical analyses were performed in R. Data were checked for normality and homogeneity of variance. Differences among treatment groups were evaluated using one-way ANOVA followed by Duncan’s multiple range test, with significance set at p < 0.05.

3 Result

3.1 Raw sequencing and preprocessing

Nine kidney RNA-seq libraries, three per treatment group, were sequenced. After quality control, the number of clean reads per library was ~41.36 million and per-library Q30 exceeded 95.09% (see Table 1). Clean reads mapped to the goldfish reference genome (GCF_003368295.1) with mapping rates between 88.37% and 90.71%, indicating high-quality sequencing and good genome coverage.

Table 1: Summary of RNA-seq library quality and mapping statistics

# A tibble: 9 × 8
  Sample `Clean reads` `Clean bases` `GC content (%)` `Q30 (%)`
  <chr>          <dbl>         <dbl>            <dbl>     <dbl>
1 C-1         54858738    8189592344             46.7      96.0
2 C-2         51384524    7673744890             46.7      95.9
3 C-3         55474754    8260159928             47.0      95.6
4 L-1         48550446    7251033918             46.6      95.1
5 L-2         41360330    6184981010             45.9      95.7
6 L-3         54107400    8091356676             46.4      95.6
7 H-1         56013224    8350534022             46.7      95.7
8 H-2         46984324    7034325400             46.6      95.2
9 H-3         48665048    7255381648             46.6      95.1
# ℹ 3 more variables: `Total mapped reads (ratio)` <chr>,
#   `Multiple mapped reads (ratio)` <chr>,
#   `Uniquely mapped reads (ratio)` <chr>

3.2 Differential expression analysis

After alignment and normalization, DESeq2 identified 270 DEGs, 160 upregulated and 110 downregulated in the 0.2 µg/L versus control comparison, and 711 DEGs, 309 upregulated and 402 downregulated in the 2 µg/L versus control comparison (Table 2). A set of DEGs overlapped between treatments; the intersection is shown in a Venn diagram (Figure 2D).

Table 2: Summary of differentially expressed genes (DEGs)

Warning: package 'knitr' was built under R version 4.5.2

Warning: package 'ggplot2' was built under R version 4.5.2

Summary of differentially expressed genes (DEGs) {#tab:DEGs}
Comparison	Total DEGs	Upregulated	Downregulated
0.2 µg/L vs Control	270	160	110
2 µg/L vs Control	711	309	402

Figure 1: Differentially Expressed Genes by Treatment

Figure 2: Summary of differentially expressed genes in goldfish (Carassius auratus) after exposure to 0.2 and 2 µg/L deltamethrin Zhou et al. (2021).

4 Discussion

This project applied a comprehensive RNA-seq bioinformatics pipeline to evaluate the transcriptional effects of DM exposure in goldfish kidney tissue. Beginning with raw sequencing data, each analytical step—from quality control to gene ontology enrichment—contributed to understanding how DM alters gene expression in a dose-dependent manner. Overall, the results strongly align with the original publication while providing additional insight.

The raw RNA-seq dataset and the corresponding goldfish reference genome and annotation (GCF_003368295.1) were downloaded to ensure accurate mapping and quantification. The genome assembly matches that used in the published study, ensuring compatibility between analyses.

Quality control (QC) of the FASTQ files indicated high-quality sequencing suitable for downstream analysis. Per-base Phred quality scores were consistently above 30, N content was negligible, and GC content ranged from 45–47%, closely matching the 46–47% reported in the original study. Adapter contamination was minimal, and overrepresented sequences accounted for less than 1% of reads. Duplication levels were relatively high (68–72%), likely reflecting biological duplication or highly expressed transcripts, as the pattern was consistent across all replicates. No sample displayed abnormal metrics, suggesting the dataset was robust and no libraries required removal.

Reads were aligned to the goldfish genome using a splice-aware aligner (HISAT2). Mapping rates ranged from approximately 88–91%, consistent with the original study, confirming the reference genome’s suitability for kidney transcriptomics. Uniform mapping rates across treatment groups suggest consistent library preparation.

Gene expression quantification was performed using featureCounts to assign reads to annotated genes. Each sample yielded ~41 million clean reads, aligning with the sequencing depth in the original article. Approximately 25–33% of genes were filtered out due to low counts, reflecting typical tissue- and condition-specific expression patterns in fish transcriptomes.

Differential expression analysis using DESeq2 revealed clear, dose-dependent transcriptional responses to DM. In the 0.2 μg/L treatment versus control, 173 genes were upregulated and 123 downregulated (296 total DEGs). In the 2 μg/L group versus control, 531 genes were upregulated and 622 downregulated (1,153 DEGs). The high-versus-low dose comparison identified 1,265 DEGs, demonstrating that increasing DM concentration substantially alters transcriptional profiles. Principal component analysis (PCA) showed clear clustering by treatment with no outliers, confirming data reliability. Heatmap visualization displayed less distinct grouping, likely due to the large number of DEGs and gene-specific expression variability.

These results parallel the original publication, which reported 270 DEGs for the low dose and 711 DEGs for the high dose. Minor differences in DEG counts likely reflect variations in filtering thresholds, DESeq2 versions, or QC cutoffs, but the biological trend is consistent: DM induces a dose-dependent transcriptional response in goldfish kidneys.

Gene Ontology (GO) enrichment analysis contextualized the DEGs within biological processes. Enriched terms included immune activation, apoptosis, complement signaling, programmed cell death regulation, and membrane trafficking, consistent with the original study. These findings reinforce that DM disrupts immune function and cellular homeostasis, even at low exposure levels. Expansion of apoptotic and immune-related GO categories at higher doses suggests a model in which DM induces progressive cellular stress that intensifies with concentration.

Overall, the bioinformatics pipeline—from data download and QC through alignment, quantification, differential expression analysis, and GO enrichment—produced results that are technically sound and biologically meaningful. The findings align with the central conclusions of the original study.

5 Scripts

Download SRA

This script automates the download of raw RNA-seq data from the NCBI Sequence Read Archive (SRA) using fasterq-dump. It loops through a list of SRR accession numbers, downloads the corresponding FASTQ files, and compresses them with gzip.

#!/bin/bash
#SBATCH --job-name=fasterq_dump_goldfish
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 1
#SBATCH --mem=15G
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mail-type=ALL
#SBATCH --mail-user=stephanie.ballas@uconn.edu
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err

# Load software
module load sratoolkit/3.0.1

# Data is from:
#  https://www.ncbi.nlm.nih.gov/sra/?term=SRP323663
#  https://doi.org/10.1016/j.ecoenv.2021.112716

# Output folder
OUTDIR=~/Goldfish_Deltamethrin_Project/data/fastq
mkdir -p ${OUTDIR}

# Accession list
ACCLIST=~/Goldfish_Deltamethrin_Project/metadata/accessionlist.txt

# Loop through each SRR in the accession list and download
while read SRR; do
    echo "Downloading ${SRR}..."
    fasterq-dump ${SRR} -O ${OUTDIR}
done < ${ACCLIST}

# Compress all FASTQ files
echo "Compressing FASTQ files..."
gzip ${OUTDIR}/*.fastq

date
echo "Download and compression complete."

Download Genome

This script retrieves the goldfish reference genome, GTF annotation, and cDNA sequences from Ensembl using wget. Files are stored in a designated genome directory, decompressed, and indexed using samtools faidx to facilitate downstream alignment and analysis.

#!/bin/bash
#SBATCH --job-name=get_genome
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 4
#SBATCH --mem=4G
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mail-type=ALL
#SBATCH --mail-user=stephanie.ballas@uconn.edu
#SBATCH -o /home/FCAM/sballas/Goldfish_Deltamethrin_Project/logs/%x_%j.out
#SBATCH -e /home/FCAM/sballas/Goldfish_Deltamethrin_Project/logs/%x_%j.err

# Load samtools
module load samtools/1.16.1

# Make genome directory
mkdir -p /home/FCAM/sballas/Goldfish_Deltamethrin_Project/data/genome

# Download genome, GTF, and cDNA from Ensembl
wget -c -P /home/FCAM/sballas/Goldfish_Deltamethrin_Project/data/genome \
http://ftp.ensembl.org/pub/release-115/fasta/carassius_auratus/dna/Carassius_auratus.ASM336829v1.dna_sm.toplevel.fa.gz

wget -c -P /home/FCAM/sballas/Goldfish_Deltamethrin_Project/data/genome \
http://ftp.ensembl.org/pub/release-115/gtf/carassius_auratus/Carassius_auratus.ASM336829v1.115.gtf.gz

wget -c -P /home/FCAM/sballas/Goldfish_Deltamethrin_Project/data/genome \
http://ftp.ensembl.org/pub/release-115/fasta/carassius_auratus/cdna/Carassius_auratus.ASM336829v1.cdna.all.fa.gz

# Decompress
gunzip -f /home/FCAM/sballas/Goldfish_Deltamethrin_Project/data/genome/*.gz

# Create samtools index
samtools faidx /home/FCAM/sballas/Goldfish_Deltamethrin_Project/data/genome/Carassius_auratus.ASM336829v1.dna_sm.toplevel.fa
samtools faidx /home/FCAM/sballas/Goldfish_Deltamethrin_Project/data/genome/Carassius_auratus.ASM336829v1.cdna.all.fa

FastQC

The FastQC script performs quality control on raw FASTQ files. It reads a list of sample accessions, runs FastQC on paired-end files, and outputs detailed per-sample quality reports, including read quality, adapter content, and sequence duplication levels.

#!/bin/bash
#SBATCH --job-name=fastqc
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 10
#SBATCH --mem=20G
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mail-type=ALL
#SBATCH --mail-user=stephanie.ballas@uconn.edu
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err

echo `hostname`

#################################################################
# FastQC
#################################################################
module load fastqc/0.12.1

# set input/output directory variables
INDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/data/fastq
REPORTDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/fastqc
mkdir -p $REPORTDIR


# accession list
ACCLIST=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/metadata/accessionlist.txt

# Run FastQC on all FASTQ pairs (no parallel needed)
for ACC in $(cat $ACCLIST); do
    echo "Running FastQC on $ACC..."
    fastqc --outdir $REPORTDIR ${INDIR}/${ACC}_1.fastq.gz ${INDIR}/${ACC}_2.fastq.gz
done

MultiQC

This script aggregates individual FastQC reports into a single summary using MultiQC. It collects output from all samples and generates an integrated HTML report, simplifying the evaluation of overall sequencing quality.

#!/bin/bash
#SBATCH --job-name=multiqc
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 4
#SBATCH --mem=10G
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mail-type=ALL
#SBATCH --mail-user=stephanie.ballas@uconn.edu
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err

# Load module
module load MultiQC/1.15

# Set directories
INDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/fastqc
OUTDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/multiqc

mkdir -p $OUTDIR

# run on fastqc output
multiqc -f -o ${OUTDIR} ${INDIR}

Trimmomatic

Trimmomatic is used here to remove adapter sequences and low-quality bases from raw FASTQ files. The script loops over all samples, performing paired-end trimming with specified parameters, generating both trimmed reads and orphaned reads for downstream analysis.


#!/bin/bash
#SBATCH --job-name=trimmomatic
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 4
#SBATCH --mem=20G
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mail-type=ALL
#SBATCH --mail-user=stephanie.ballas@uconn.edu
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err

# Record where and when this ran
hostname
date

#################################################################
# Trimmomatic
#################################################################

# Load Trimmomatic module
module load Trimmomatic/0.39
module list

# Set directories
INDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/data/fastq
TRIMDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/trimmed_fastq
mkdir -p $TRIMDIR

# Adapter file
ADAPTERS=/isg/shared/apps/Trimmomatic/0.39/adapters/TruSeq3-PE.fa

# Sample list
SAMPLES=(SRR14783835 SRR14783836 SRR14783837 SRR14783838 SRR14783839 SRR14783840 SRR14783841 SRR14783$

# Loop through samples
for SAMPLE in "${SAMPLES[@]}"
do
    echo "Processing ${SAMPLE}..."
    java -jar /isg/shared/apps/Trimmomatic/0.39/trimmomatic-0.39.jar PE -phred33 \
        -threads 4 \
        ${INDIR}/${SAMPLE}_1.fastq.gz ${INDIR}/${SAMPLE}_2.fastq.gz \
        ${TRIMDIR}/${SAMPLE}_trim_1.fastq.gz ${TRIMDIR}/${SAMPLE}_trim_orphans_1.fastq.gz \
        ${TRIMDIR}/${SAMPLE}_trim_2.fastq.gz ${TRIMDIR}/${SAMPLE}_trim_orphans_2.fastq.gz \
        ILLUMINACLIP:${ADAPTERS}:2:30:10 \
        SLIDINGWINDOW:4:25 MINLEN:45
done

FastQC Trimmed

After trimming, this script re-runs FastQC on the trimmed FASTQ files. It ensures that adapter removal and quality trimming were successful and provides updated quality metrics for the processed reads.

#!/bin/bash
#SBATCH --job-name=fastqc_trimmed
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 4
#SBATCH --mem=10G
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mail-type=ALL
#SBATCH --mail-user=stephanie.ballas@uconn.edu
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err

# Load FastQC
module load fastqc/0.12.1

# Directories
TRIMDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/trimmed_fastq
OUTDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/fastqc_trimmed
mkdir -p ${OUTDIR}

# Run FastQC on all trimmed paired-end files
fastqc -t 4 -o ${OUTDIR} ${TRIMDIR}/*_trim_[12].fastq.gz

MultiQC Trimmed

This script aggregates the post-trimming FastQC reports into a unified MultiQC summary. It provides a comprehensive view of quality improvements and any remaining issues after read trimming.

#!/bin/bash
#SBATCH --job-name=multiqc_trimmed
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 4
#SBATCH --mem=10G
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mail-type=ALL
#SBATCH --mail-user=stephanie.ballas@uconn.edu
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err

# Load MultiQC
module load MultiQC/1.15

# Directories
FASTQCDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/fastqc_trimmed
OUTDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/multiqc_trimmed

# Run MultiQC to summarize all FastQC reports
multiqc ${FASTQCDIR} -o ${OUTDIR}

MultiQC Results

Index

The indexing script builds a HISAT2 genome index from the goldfish reference FASTA. The indexed genome enables efficient alignment of RNA-seq reads in subsequent steps.

#!/bin/bash
#SBATCH --job-name=hisat2_index
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 16
#SBATCH --mem=10G
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mail-type=ALL
#SBATCH --mail-user=stephanie.ballas@uconn.edu
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err

hostname
date

#################################################################
# Index the Genome
#################################################################

# Load HISAT2
module load hisat2/2.2.1

# Define absolute input/output directories
OUTDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/genome/hisat2_index
mkdir -p $OUTDIR

GENOME=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/data/genome/Carassius_auratus.ASM336829v1.dna_sm.toplevel.fa

# Build the HISAT2 index
hisat2-build -p 16 $GENOME $OUTDIR/Fhet

Alignment

This script aligns trimmed RNA-seq reads to the indexed goldfish genome using HISAT2. Each sample is processed as a SLURM array job, producing sorted and indexed BAM files suitable for downstream analysis.

#!/bin/bash
#SBATCH --job-name=align
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 3
#SBATCH --mem=15G
#SBATCH --partition=xeon
#SBATCH --qos=general
#SBATCH --mail-type=ALL
#SBATCH --mail-user=stephanie.ballas@uconn.edu
#SBATCH -o /home/FCAM/sballas/Goldfish_Deltamethrin_Project/logs/%x_%A_%a.out
#SBATCH -e /home/FCAM/sballas/Goldfish_Deltamethrin_Project/logs/%x_%A_%a.err
#SBATCH --array=1-9%3   # Adjust to the number of samples, max concurrent jobs 3

# Print hostname and date
hostname
date

#################################################################
# Align reads to genome (HISAT2)
#################################################################

module load hisat2/2.2.1
module load samtools/1.16.1

# Absolute paths
FASTQ_DIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/trimmed_fastq
OUTDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/align
mkdir -p ${OUTDIR}

GENOME_DIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/data/genome
HISAT2_INDEX=${GENOME_DIR}/Carassius_auratus_index  # We'll build index below if needed

ACCLIST=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/metadata/accessionlist.txt

# Build HISAT2 index if it doesn't exist
if [ ! -f "${HISAT2_INDEX}.1.ht2" ]; then
    hisat2-build ${GENOME_DIR}/Carassius_auratus.ASM336829v1.dna_sm.toplevel.fa ${HISAT2_INDEX}
fi

# Get sample for this array task
SAMPLE=$(sed -n ${SLURM_ARRAY_TASK_ID}p ${ACCLIST})

# Run HISAT2 alignment and pipe to sorted BAM
hisat2 \
    -p 3 \
    -x ${HISAT2_INDEX} \
    -1 ${FASTQ_DIR}/${SAMPLE}_trim_1.fastq.gz \
    -2 ${FASTQ_DIR}/${SAMPLE}_trim_2.fastq.gz | \
samtools sort -@ 3 -T ${OUTDIR}/${SAMPLE} -o ${OUTDIR}/${SAMPLE}.bam

# Index BAM file
samtools index ${OUTDIR}/${SAMPLE}.bam

Samtools Stats

Samtools is used to calculate alignment statistics for each BAM file. The script runs in parallel across samples, generates per-sample metrics, and compiles a summary table of key alignment statistics, providing an overview of mapping quality and coverage.


#!/bin/bash
#SBATCH --job-name=samstats
#SBATCH --mail-user=stephanie.ballas@uconn.edu
#SBATCH --mail-type=ALL
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=10
#SBATCH --mem=15G
#SBATCH --qos=general
#SBATCH --partition=general

hostname
date

##################################
# Calculate alignment statistics using Samtools
##################################

# Load software
module load samtools/1.16.1
module load parallel/20180122

# Define directories
INDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/align
OUTDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/samstats
mkdir -p $OUTDIR

# Accession list
ACCLIST=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/metadata/accessionlist.txt

# Run samtools stats for each BAM file
cat $ACCLIST | parallel -j 10 \
    "samtools stats $INDIR/{}.bam > $OUTDIR/{}.stats"

# Combine key stats into one table
FILES=($(find $OUTDIR -name "*.stats" | sort))

# Initialize summary table with metric names (SN lines)
grep "^SN" ${FILES[0]} | cut -f 2 > $OUTDIR/SN.txt

# Loop over each stats file and append columns
for file in ${FILES[@]}; do
    paste $OUTDIR/SN.txt <(grep ^SN $file | cut -f 3) > $OUTDIR/SN2.txt && \
    mv $OUTDIR/SN2.txt $OUTDIR/SN.txt
    echo "Processed $file"
done

# Add header with sample names
cat \
<(echo ${FILES[@]} | sed 's,.*/,,g' | sed 's/.stats//g' | tr ' ' '\t') \
$OUTDIR/SN.txt \
> $OUTDIR/Samtools_Summary.txt

Qualimap

Qualimap is employed to evaluate RNA-seq alignment quality against the reference genome and GTF annotation. Running in parallel across samples, it generates detailed reports on coverage, gene body distribution, and mapping performance.

#!/bin/bash
#SBATCH --job-name=qualimap
#SBATCH --mail-user=stephanie.ballas@uconn.edu
#SBATCH --mail-type=ALL
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=5
#SBATCH --mem=10G
#SBATCH --qos=general
#SBATCH --partition=general

hostname
date

##################################
# Calculate stats on alignments using Qualimap
##################################

# Load required software
module load qualimap/2.2.1
module load parallel/20180122

# Define directories
INDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/align
OUTDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/qualimap
mkdir -p $OUTDIR

# GTF annotation file
GTF=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/data/genome/Carassius_auratus.ASM336829v1.115.gtf

# Accession list
ACCLIST=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/metadata/accessionlist.txt

# Run Qualimap in parallel
cat $ACCLIST | \
parallel -j 5 \
    qualimap rnaseq \
        -bam $INDIR/{}.bam \
        -gtf $GTF \
        -outdir $OUTDIR/{} \
        --java-mem-size=2G

MultiQC

This MultiQC script consolidates both Samtools and Qualimap outputs into integrated reports. It allows for a high-level assessment of read alignment quality across all samples.

#!/bin/bash
#SBATCH --job-name=multiqc
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 4
#SBATCH --mem=5G
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mail-type=ALL
#SBATCH --mail-user=stephanie.ballas@uconn.edu
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err

hostname
date

#################################################################
# Aggregate reports using MultiQC
#################################################################

module load MultiQC/1.9

# Define input and output directories
STATS=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/samtools_stats
QUALIMAP=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/qualimap
OUTDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/multiqc

# Create output directories if needed
mkdir -p ${OUTDIR}/samtools_stats
mkdir -p ${OUTDIR}/qualimap

# Run MultiQC on Samtools stats output
multiqc -f -o ${OUTDIR}/samtools_stats ${STATS}

# Run MultiQC on Qualimap output
multiqc -f -o ${OUTDIR}/qualimap ${QUALIMAP}

Counts

HTSeq-count is used to quantify gene-level expression from aligned BAM files. The script processes each sample using the GTF annotation to generate raw read counts per gene, producing files suitable for downstream differential expression analysis in R.

#SBATCH --job-name=htseq_count
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 4
#SBATCH --mem=20G
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mail-type=ALL
#SBATCH --mail-user=stephanie.ballas@uconn.edu
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err

echo `hostname`
date

#################################################################
# Generate Counts
#################################################################
module load htseq/0.13.5

# Directories
INDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/align
OUTDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/counts
mkdir -p $OUTDIR

# Accession list
ACCLIST=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/metadata/accessionlist.txt

# GTF annotation file
GTF=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/data/genome/Carassius_auratus.ASM336829v1.115.gtf

echo "Starting HTSeq-count..."

while read SAMPLE; do
    echo "Processing $SAMPLE..."
    htseq-count \
        -s no \
        -r pos \
        -f bam \
        $INDIR/${SAMPLE}.bam \
        $GTF \
        > $OUTDIR/${SAMPLE}.counts
done < $ACCLIST

echo "Done!"
date

Read in the Data to R

This section performs a basic differential expression analysis in R using the DESeq2 package. The scripe loads data into R and prepares RNA-seq count data in order to run the first statistical analysis.

library(tidyverse)
library(DESeq2)

# Read in data
meta <- read.csv("C:/Users/steph/OneDrive/Documents/Goldfish_Deltamethrin_Project/metadata/SraRunTable.csv") %>%
  select(Run, dose, fileName)

# Directory with counts
directory <- "C:/Users/steph/OneDrive/Documents/Goldfish_Deltamethrin_Project/results/04_counts"

# Build sample table directly from metadata
sampleTable <- data.frame(
  sampleName = meta$Run,
  fileName = meta$fileName,
  dose = meta$dose
)

sampleTable

   sampleName           fileName          dose
1 SRR14783835 SRR14783835.counts high exposure
2 SRR14783836 SRR14783836.counts high exposure
3 SRR14783837 SRR14783837.counts high exposure
4 SRR14783838 SRR14783838.counts  low exposure
5 SRR14783839 SRR14783839.counts  low exposure
6 SRR14783840 SRR14783840.counts  low exposure
7 SRR14783841 SRR14783841.counts       control
8 SRR14783842 SRR14783842.counts       control
9 SRR14783843 SRR14783843.counts       control

# Build DESeq dataset
ddsHTSeq <- DESeqDataSetFromHTSeqCount(
  sampleTable = sampleTable,
  directory = directory,
  design = ~ dose
)

Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
design formula are characters, converting to factors

  Note: levels of factors in the design contain characters other than
  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters
  for column names in R. [This is a message, not a warning or an error]

head(counts(ddsHTSeq))

                   SRR14783835 SRR14783836 SRR14783837 SRR14783838 SRR14783839
ENSCARG00000000002           0           0           0           0           0
ENSCARG00000000003         969         443         517         339         265
ENSCARG00000000004           0           0           0           0           0
ENSCARG00000000005        3721        3492        3036        2892        2695
ENSCARG00000000006           0           0           0           0           0
ENSCARG00000000007           9           6           4           2           2
                   SRR14783840 SRR14783841 SRR14783842 SRR14783843
ENSCARG00000000002           0           0           0           0
ENSCARG00000000003         312         372         744        1003
ENSCARG00000000004           0           0           0           0
ENSCARG00000000005        1744        2763        3111        4713
ENSCARG00000000006           0           0           0           0
ENSCARG00000000007           1           9           5       12481

dds <- DESeq(ddsHTSeq)

estimating size factors

  Note: levels of factors in the design contain characters other than
  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters
  for column names in R. [This is a message, not a warning or an error]

estimating dispersions

gene-wise dispersion estimates

mean-dispersion relationship

  Note: levels of factors in the design contain characters other than
  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters
  for column names in R. [This is a message, not a warning or an error]

final dispersion estimates

fitting model and testing

res <- results(dds, contrast = c("dose", "high exposure", "control"))

summary(res)


out of 51556 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)       : 531, 1%
LFC < 0 (down)     : 622, 1.2%
outliers [1]       : 419, 0.81%
low counts [2]     : 15814, 31%
(mean count < 6)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

Filtering

Heatmap

PCA Plot

Module 22

Module 23

References

Blanco, Ayelén Melisa, Lakshminarasimhan Sundarrajan, Juan Ignacio Bertucci, and Suraj Unniappan. 2018. “Why Goldfish? Merits and Challenges in Employing Goldfish as a Model Organism in Comparative Endocrinology Research.” General and Comparative Endocrinology 257: 13–28. https://doi.org/https://doi.org/10.1016/j.ygcen.2017.02.001.

Guardiola, F. A., P. Gónzalez-Párraga, J. Meseguer, A. Cuesta, and M. A. Esteban. 2014. “Modulatory Effects of Deltamethrin-Exposure on the Immune Status, Metabolism and Oxidative Stress in Gilthead Seabream (Sparus Aurata l.).” Fish & Shellfish Immunology 36 (1): 120–29. https://doi.org/https://doi.org/10.1016/j.fsi.2013.10.020.

Wu, Yaqin, Wenhua Li, Mingrui Yuan, and Xuan Liu. 2020. “The Synthetic Pyrethroid Deltamethrin Impairs Zebrafish (Danio Rerio) Swim Bladder Development.” Science of The Total Environment 701: 134870. https://doi.org/https://doi.org/10.1016/j.scitotenv.2019.134870.

Zhou, Shun, Jing Dong, Yongtao Liu, Qiuhong Yang, Ning Xu, Yibin Yang, and Xiaohui Ai. 2021. “Effects of Acute Deltamethrin Exposure on Kidney Transcriptome and Intestinal Microbiota in Goldfish (Carassius Auratus).” Ecotoxicology and Environmental Safety 225: 112716. https://doi.org/https://doi.org/10.1016/j.ecoenv.2021.112716.

Other Formats

Welcome to My Project

1 Introduction

2 Methods

2.1 Experimental Design and Samples

2.2 Sample Collection

2.3 Transcriptome Processing and Analysis

2.3.1 RNA Extraction and Sequencing

2.3.2 Bioinformatic Workflow

2.3.3 qPCR Validation

2.4 Intestinal Microbiota Analysis

2.4.1 DNA Extraction and Sequencing

2.4.2 Sequence Processing and Statistical Analysis

3 Result

3.1 Raw sequencing and preprocessing

Table 1: Summary of RNA-seq library quality and mapping statistics

3.2 Differential expression analysis

Table 2: Summary of differentially expressed genes (DEGs)

Figure 1: Differentially Expressed Genes by Treatment

Figure 2: Summary of differentially expressed genes in goldfish (Carassius auratus) after exposure to 0.2 and 2 µg/L deltamethrin Zhou et al. (2021).

4 Discussion

5 Scripts

Download SRA

Download Genome

FastQC

MultiQC

Trimmomatic

FastQC Trimmed

MultiQC Trimmed

MultiQC Results

Index

Alignment

Samtools Stats

Qualimap

MultiQC

Counts

Read in the Data to R

Filtering

Heatmap

PCA Plot

Module 22

Module 23

References