# A tibble: 9 × 8
Sample `Clean reads` `Clean bases` `GC content (%)` `Q30 (%)`
<chr> <dbl> <dbl> <dbl> <dbl>
1 C-1 54858738 8189592344 46.7 96.0
2 C-2 51384524 7673744890 46.7 95.9
3 C-3 55474754 8260159928 47.0 95.6
4 L-1 48550446 7251033918 46.6 95.1
5 L-2 41360330 6184981010 45.9 95.7
6 L-3 54107400 8091356676 46.4 95.6
7 H-1 56013224 8350534022 46.7 95.7
8 H-2 46984324 7034325400 46.6 95.2
9 H-3 48665048 7255381648 46.6 95.1
# ℹ 3 more variables: `Total mapped reads (ratio)` <chr>,
# `Multiple mapped reads (ratio)` <chr>,
# `Uniquely mapped reads (ratio)` <chr>
Welcome to My Project
This project presents the full analysis for the ISG 5311 final project, including RNA-seq differential expression analysis and interpretation.
1 Introduction
Pyrethroid pesticides, such as deltamethrin (DM), are widely used in agriculture and aquaculture due to their high insecticidal activity and relatively low toxicity compared to older pesticides. DM is effective against a broad spectrum of pests, making it a popular choice for pest and disease control in aquatic environments (Guardiola et al. (2014)). However, DM frequently enters rivers and lakes through surface runoff and wastewater discharge, posing risks to aquatic organisms. Fish are particularly vulnerable because they lack efficient mechanisms to metabolize and degrade DM, leading to toxic effects even at environmentally relevant concentrations (Wu et al. (2020)).
Goldfish (Carassius auratus) are a common freshwater species widely distributed in many water bodies and are often used as a model organism in toxicology research (Blanco et al. (2018)). Their ecological relevance, manageable size, and ease of maintenance make them suitable for laboratory experiments, while their physiology provides insight into potential impacts of environmental contaminants on aquatic species. Prior studies have shown that DM exposure can induce oxidative stress, apoptosis, and immune dysfunction in various fish species, but the molecular mechanisms underlying these effects are not fully understood (Zhou et al. (2021)).
The aim of this study was to investigate the effects of acute, environmentally relevant concentrations of DM on goldfish kidney transcriptomes and intestinal microbiota. Nine healthy goldfish were randomly assigned to three treatment groups: 0 µg/L control, 0.2 µg/L DM, and 2 µg/L DM, with three fish per group. Fish were exposed for 96 hours under controlled laboratory conditions. Kidney tissues were collected and RNA extracted for high-throughput RNA-sequencing to identify differentially expressed genes and characterize affected biological pathways, including apoptosis, immune response, and drug metabolism. Additionally, intestinal contents were collected to assess changes in microbial composition using 16S rRNA sequencing.
By combining transcriptomic and microbiota analyses, this study provides an integrative understanding of how DM affects fish at both the molecular and microbial levels. The results are expected to reveal key pathways disrupted by DM exposure and highlight the broader impacts of environmental contaminants on fish physiology and gut microbial communities (Zhou et al. (2021)). This work contributes to the growing field of aquatic toxicology by providing detailed molecular insights into DM-induced toxicity in a widely studied model species.
2 Methods
Methods were adapted from Zhou et al. (2021) unless otherwise noted.
2.1 Experimental Design and Samples
Eighty-one healthy goldfish (Carassius auratus) were purchased from a commercial supplier (Hanjin Ornamental Fish Farm, Wuhan, China) and acclimated for two weeks in aerated, dechlorinated water. During acclimation, fish were held in large tanks under controlled temperature (19–21 °C), pH (6.9–7.3), and dissolved oxygen (~6.2 mg/L), with a consistent light:dark cycle of 10 hours of light to 14 hours of darkness. Fish were fed once daily with a commercial pellet diet, and one-third of the water was exchanged every three days to maintain water quality.
Deltamethrin (>98% purity) was dissolved in dimethyl sulfoxide (DMSO) to generate stock solutions. Fish were randomly assigned to one of three treatment groups: 0 µg/L DM (control), 0.2 µg/L DM, and 2 µg/L DM (n = 9 fish per group). These concentrations reflect environmentally relevant DM levels reported in aquatic systems. All groups, including controls, contained trace amounts of DMSO (<0.001% v/v). Fish were exposed to their assigned treatments for 96 hours. Solutions were renewed daily to maintain stable DM levels, and fish were not fed during the exposure period.
2.2 Sample Collection
After 96 hours exposure, nine specimens were selected from each group to collect intestinal contents and pooled for transctiprome anslysis. Simultaneously, three specimens were randomly selected from each group and the kidney was removed. The kidney specimens were used for RNA-seq analysis.
2.3 Transcriptome Processing and Analysis
2.3.1 RNA Extraction and Sequencing
Total RNA was extracted from kidney tissue using TRIzol Reagent (Life Technologies) in accordance with the manufacturer’s protocol. A NanoDrop 2000 was used to assess RNA quantity and purity, and an Agilent 2100 Bioanalyzer was used to confirm RNA integrity. Only high-quality RNA samples were processed for sequencing. Libraries were prepared and sequenced on an Illumina NovaSeq 6000 to produce paired-end reads. Raw RNA-seq data have been submitted to the Sequence Read Archive (SRA) database in NCBI (accession number: SRP323663).
2.3.2 Bioinformatic Workflow
Raw reads were filtered to remove low-quality sequences, adapter contamination, and poly-N content. Clean reads were aligned to the goldfish reference genome (GCF_003368295.1) using HISAT2 v2.2.1. Gene-level counts were generated and normalized using DESeq2 v1.38.0 within R.
Differentially expressed genes (DEGs) were identified based on false discovery rate (FDR) < 0.05 and |log2 fold change| > 1. Functional enrichment of DEGs was performed using GOseq for Gene Ontology (GO) enrichment and KOBAS for Encyclopedia of Genes and Genomes (KEGG) pathway enrichment. These analyses were used to identify biological pathways associated with apoptosis, immune function, and metabolism.
2.3.3 qPCR Validation
To confirm the reliability of RNA-seq results, a subset of ten DEGs were selected for quantitative real-time PCR (qPCR). Primers were designed using Primer3, and qPCR was conducted using a QuantStudio 3 Real-Time PCR System with ChamQ Universal SYBR qPCR Master Mix. Each reaction contained SYBR qPCR Master Mix, forward and reverse primers, cDNA template, and nuclease-free water. Thermal cycling conditions were: 95 °C for 30 seconds followed by 40 cycles of denaturing at 95 °C for 10 seconds and annealing at 60 °C for 30 seconds. The β-actin gene was used as a reference (housekeeping) gene, its expression level was assumed to remain stable across all samples. This allows for comparison of the expression of the target DEGs relative to a consistent baseline. To quantify gene expression differences, the 2⁻ΔΔCt method was applied. This method uses the cycle threshold (Ct) values from qPCR to calculate relative fold changes in gene expression between experimental groups.
2.4 Intestinal Microbiota Analysis
2.4.1 DNA Extraction and Sequencing
DNA was extracted from intestinal contents using the QIAamp DNA Stool Mini Kit (Qiagen). DNA concentration and purity were measured using a NanoDrop spectrophotometer. Full-length 16S rRNA genes were amplified using primers 27F and 1492R. PCR amplification was performed using Premix Taq (Takara) with the following thermal cycling program: an initial denaturation at 95 °C for 5 minutes; 25 cycles of denaturation at 95 °C for 30 seconds, annealing at 50 °C for 30 seconds, and extension at 72 °C for 1 minute; followed by a final extension at 72 °C for 7 minutes. Libraries were prepped from purified DNA using the SMRTbell Template Prep Kit 1.0, and sequenced on a PacBio SMRT RS II platform.
2.4.2 Sequence Processing and Statistical Analysis
Raw PacBio reads were processed with the SMRT Analysis Suite to generate high-quality circular consensus sequences. Sequences were clustered into operational taxonomic units at 97% similarity using USEARCH v10.0. Taxonomic classification was performed using the RDP Classifier with an 80% confidence threshold.
Microbial diversity analyses were completed in QIIME2, including: alpha diversity (Chao1, ACE, Shannon, and Simpson indices), Good’s coverage to estimate sequencing completeness, and beta diversity (PCoA and UPGMA clustering). All statistical analyses were performed in R. Data were checked for normality and homogeneity of variance. Differences among treatment groups were evaluated using one-way ANOVA followed by Duncan’s multiple range test, with significance set at p < 0.05.
3 Result
3.1 Raw sequencing and preprocessing
Nine kidney RNA-seq libraries, three per treatment group, were sequenced. After quality control, the number of clean reads per library was ~41.36 million and per-library Q30 exceeded 95.09% (see Table 1). Clean reads mapped to the goldfish reference genome (GCF_003368295.1) with mapping rates between 88.37% and 90.71%, indicating high-quality sequencing and good genome coverage.
Table 1: Summary of RNA-seq library quality and mapping statistics
3.2 Differential expression analysis
After alignment and normalization, DESeq2 identified 270 DEGs, 160 upregulated and 110 downregulated in the 0.2 µg/L versus control comparison, and 711 DEGs, 309 upregulated and 402 downregulated in the 2 µg/L versus control comparison (Table 2). A set of DEGs overlapped between treatments; the intersection is shown in a Venn diagram (Figure 2D).
Table 2: Summary of differentially expressed genes (DEGs)
Warning: package 'knitr' was built under R version 4.5.2
Warning: package 'ggplot2' was built under R version 4.5.2
| Comparison | Total DEGs | Upregulated | Downregulated |
|---|---|---|---|
| 0.2 µg/L vs Control | 270 | 160 | 110 |
| 2 µg/L vs Control | 711 | 309 | 402 |
Figure 1: Differentially Expressed Genes by Treatment
Figure 2: Summary of differentially expressed genes in goldfish (Carassius auratus) after exposure to 0.2 and 2 µg/L deltamethrin Zhou et al. (2021).
4 Discussion
This project applied a comprehensive RNA-seq bioinformatics pipeline to evaluate the transcriptional effects of DM exposure in goldfish kidney tissue. Beginning with raw sequencing data, each analytical step—from quality control to gene ontology enrichment—contributed to understanding how DM alters gene expression in a dose-dependent manner. Overall, the results strongly align with the original publication while providing additional insight.
The raw RNA-seq dataset and the corresponding goldfish reference genome and annotation (GCF_003368295.1) were downloaded to ensure accurate mapping and quantification. The genome assembly matches that used in the published study, ensuring compatibility between analyses.
Quality control (QC) of the FASTQ files indicated high-quality sequencing suitable for downstream analysis. Per-base Phred quality scores were consistently above 30, N content was negligible, and GC content ranged from 45–47%, closely matching the 46–47% reported in the original study. Adapter contamination was minimal, and overrepresented sequences accounted for less than 1% of reads. Duplication levels were relatively high (68–72%), likely reflecting biological duplication or highly expressed transcripts, as the pattern was consistent across all replicates. No sample displayed abnormal metrics, suggesting the dataset was robust and no libraries required removal.
Reads were aligned to the goldfish genome using a splice-aware aligner (HISAT2). Mapping rates ranged from approximately 88–91%, consistent with the original study, confirming the reference genome’s suitability for kidney transcriptomics. Uniform mapping rates across treatment groups suggest consistent library preparation.
Gene expression quantification was performed using featureCounts to assign reads to annotated genes. Each sample yielded ~41 million clean reads, aligning with the sequencing depth in the original article. Approximately 25–33% of genes were filtered out due to low counts, reflecting typical tissue- and condition-specific expression patterns in fish transcriptomes.
Differential expression analysis using DESeq2 revealed clear, dose-dependent transcriptional responses to DM. In the 0.2 μg/L treatment versus control, 173 genes were upregulated and 123 downregulated (296 total DEGs). In the 2 μg/L group versus control, 531 genes were upregulated and 622 downregulated (1,153 DEGs). The high-versus-low dose comparison identified 1,265 DEGs, demonstrating that increasing DM concentration substantially alters transcriptional profiles. Principal component analysis (PCA) showed clear clustering by treatment with no outliers, confirming data reliability. Heatmap visualization displayed less distinct grouping, likely due to the large number of DEGs and gene-specific expression variability.
These results parallel the original publication, which reported 270 DEGs for the low dose and 711 DEGs for the high dose. Minor differences in DEG counts likely reflect variations in filtering thresholds, DESeq2 versions, or QC cutoffs, but the biological trend is consistent: DM induces a dose-dependent transcriptional response in goldfish kidneys.
Gene Ontology (GO) enrichment analysis contextualized the DEGs within biological processes. Enriched terms included immune activation, apoptosis, complement signaling, programmed cell death regulation, and membrane trafficking, consistent with the original study. These findings reinforce that DM disrupts immune function and cellular homeostasis, even at low exposure levels. Expansion of apoptotic and immune-related GO categories at higher doses suggests a model in which DM induces progressive cellular stress that intensifies with concentration.
Overall, the bioinformatics pipeline—from data download and QC through alignment, quantification, differential expression analysis, and GO enrichment—produced results that are technically sound and biologically meaningful. The findings align with the central conclusions of the original study.
5 Scripts
Download SRA
This script automates the download of raw RNA-seq data from the NCBI Sequence Read Archive (SRA) using fasterq-dump. It loops through a list of SRR accession numbers, downloads the corresponding FASTQ files, and compresses them with gzip.
#!/bin/bash
#SBATCH --job-name=fasterq_dump_goldfish
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 1
#SBATCH --mem=15G
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mail-type=ALL
#SBATCH --mail-user=stephanie.ballas@uconn.edu
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err
# Load software
module load sratoolkit/3.0.1
# Data is from:
# https://www.ncbi.nlm.nih.gov/sra/?term=SRP323663
# https://doi.org/10.1016/j.ecoenv.2021.112716
# Output folder
OUTDIR=~/Goldfish_Deltamethrin_Project/data/fastq
mkdir -p ${OUTDIR}
# Accession list
ACCLIST=~/Goldfish_Deltamethrin_Project/metadata/accessionlist.txt
# Loop through each SRR in the accession list and download
while read SRR; do
echo "Downloading ${SRR}..."
fasterq-dump ${SRR} -O ${OUTDIR}
done < ${ACCLIST}
# Compress all FASTQ files
echo "Compressing FASTQ files..."
gzip ${OUTDIR}/*.fastq
date
echo "Download and compression complete."Download Genome
This script retrieves the goldfish reference genome, GTF annotation, and cDNA sequences from Ensembl using wget. Files are stored in a designated genome directory, decompressed, and indexed using samtools faidx to facilitate downstream alignment and analysis.
#!/bin/bash
#SBATCH --job-name=get_genome
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 4
#SBATCH --mem=4G
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mail-type=ALL
#SBATCH --mail-user=stephanie.ballas@uconn.edu
#SBATCH -o /home/FCAM/sballas/Goldfish_Deltamethrin_Project/logs/%x_%j.out
#SBATCH -e /home/FCAM/sballas/Goldfish_Deltamethrin_Project/logs/%x_%j.err
# Load samtools
module load samtools/1.16.1
# Make genome directory
mkdir -p /home/FCAM/sballas/Goldfish_Deltamethrin_Project/data/genome
# Download genome, GTF, and cDNA from Ensembl
wget -c -P /home/FCAM/sballas/Goldfish_Deltamethrin_Project/data/genome \
http://ftp.ensembl.org/pub/release-115/fasta/carassius_auratus/dna/Carassius_auratus.ASM336829v1.dna_sm.toplevel.fa.gz
wget -c -P /home/FCAM/sballas/Goldfish_Deltamethrin_Project/data/genome \
http://ftp.ensembl.org/pub/release-115/gtf/carassius_auratus/Carassius_auratus.ASM336829v1.115.gtf.gz
wget -c -P /home/FCAM/sballas/Goldfish_Deltamethrin_Project/data/genome \
http://ftp.ensembl.org/pub/release-115/fasta/carassius_auratus/cdna/Carassius_auratus.ASM336829v1.cdna.all.fa.gz
# Decompress
gunzip -f /home/FCAM/sballas/Goldfish_Deltamethrin_Project/data/genome/*.gz
# Create samtools index
samtools faidx /home/FCAM/sballas/Goldfish_Deltamethrin_Project/data/genome/Carassius_auratus.ASM336829v1.dna_sm.toplevel.fa
samtools faidx /home/FCAM/sballas/Goldfish_Deltamethrin_Project/data/genome/Carassius_auratus.ASM336829v1.cdna.all.faFastQC
The FastQC script performs quality control on raw FASTQ files. It reads a list of sample accessions, runs FastQC on paired-end files, and outputs detailed per-sample quality reports, including read quality, adapter content, and sequence duplication levels.
#!/bin/bash
#SBATCH --job-name=fastqc
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 10
#SBATCH --mem=20G
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mail-type=ALL
#SBATCH --mail-user=stephanie.ballas@uconn.edu
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err
echo `hostname`
#################################################################
# FastQC
#################################################################
module load fastqc/0.12.1
# set input/output directory variables
INDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/data/fastq
REPORTDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/fastqc
mkdir -p $REPORTDIR
# accession list
ACCLIST=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/metadata/accessionlist.txt
# Run FastQC on all FASTQ pairs (no parallel needed)
for ACC in $(cat $ACCLIST); do
echo "Running FastQC on $ACC..."
fastqc --outdir $REPORTDIR ${INDIR}/${ACC}_1.fastq.gz ${INDIR}/${ACC}_2.fastq.gz
doneMultiQC
This script aggregates individual FastQC reports into a single summary using MultiQC. It collects output from all samples and generates an integrated HTML report, simplifying the evaluation of overall sequencing quality.
#!/bin/bash
#SBATCH --job-name=multiqc
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 4
#SBATCH --mem=10G
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mail-type=ALL
#SBATCH --mail-user=stephanie.ballas@uconn.edu
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err
# Load module
module load MultiQC/1.15
# Set directories
INDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/fastqc
OUTDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/multiqc
mkdir -p $OUTDIR
# run on fastqc output
multiqc -f -o ${OUTDIR} ${INDIR}Trimmomatic
Trimmomatic is used here to remove adapter sequences and low-quality bases from raw FASTQ files. The script loops over all samples, performing paired-end trimming with specified parameters, generating both trimmed reads and orphaned reads for downstream analysis.
#!/bin/bash
#SBATCH --job-name=trimmomatic
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 4
#SBATCH --mem=20G
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mail-type=ALL
#SBATCH --mail-user=stephanie.ballas@uconn.edu
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err
# Record where and when this ran
hostname
date
#################################################################
# Trimmomatic
#################################################################
# Load Trimmomatic module
module load Trimmomatic/0.39
module list
# Set directories
INDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/data/fastq
TRIMDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/trimmed_fastq
mkdir -p $TRIMDIR
# Adapter file
ADAPTERS=/isg/shared/apps/Trimmomatic/0.39/adapters/TruSeq3-PE.fa
# Sample list
SAMPLES=(SRR14783835 SRR14783836 SRR14783837 SRR14783838 SRR14783839 SRR14783840 SRR14783841 SRR14783$
# Loop through samples
for SAMPLE in "${SAMPLES[@]}"
do
echo "Processing ${SAMPLE}..."
java -jar /isg/shared/apps/Trimmomatic/0.39/trimmomatic-0.39.jar PE -phred33 \
-threads 4 \
${INDIR}/${SAMPLE}_1.fastq.gz ${INDIR}/${SAMPLE}_2.fastq.gz \
${TRIMDIR}/${SAMPLE}_trim_1.fastq.gz ${TRIMDIR}/${SAMPLE}_trim_orphans_1.fastq.gz \
${TRIMDIR}/${SAMPLE}_trim_2.fastq.gz ${TRIMDIR}/${SAMPLE}_trim_orphans_2.fastq.gz \
ILLUMINACLIP:${ADAPTERS}:2:30:10 \
SLIDINGWINDOW:4:25 MINLEN:45
doneFastQC Trimmed
After trimming, this script re-runs FastQC on the trimmed FASTQ files. It ensures that adapter removal and quality trimming were successful and provides updated quality metrics for the processed reads.
#!/bin/bash
#SBATCH --job-name=fastqc_trimmed
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 4
#SBATCH --mem=10G
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mail-type=ALL
#SBATCH --mail-user=stephanie.ballas@uconn.edu
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err
# Load FastQC
module load fastqc/0.12.1
# Directories
TRIMDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/trimmed_fastq
OUTDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/fastqc_trimmed
mkdir -p ${OUTDIR}
# Run FastQC on all trimmed paired-end files
fastqc -t 4 -o ${OUTDIR} ${TRIMDIR}/*_trim_[12].fastq.gzMultiQC Trimmed
This script aggregates the post-trimming FastQC reports into a unified MultiQC summary. It provides a comprehensive view of quality improvements and any remaining issues after read trimming.
#!/bin/bash
#SBATCH --job-name=multiqc_trimmed
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 4
#SBATCH --mem=10G
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mail-type=ALL
#SBATCH --mail-user=stephanie.ballas@uconn.edu
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err
# Load MultiQC
module load MultiQC/1.15
# Directories
FASTQCDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/fastqc_trimmed
OUTDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/multiqc_trimmed
# Run MultiQC to summarize all FastQC reports
multiqc ${FASTQCDIR} -o ${OUTDIR}MultiQC Results
Index
The indexing script builds a HISAT2 genome index from the goldfish reference FASTA. The indexed genome enables efficient alignment of RNA-seq reads in subsequent steps.
#!/bin/bash
#SBATCH --job-name=hisat2_index
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 16
#SBATCH --mem=10G
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mail-type=ALL
#SBATCH --mail-user=stephanie.ballas@uconn.edu
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err
hostname
date
#################################################################
# Index the Genome
#################################################################
# Load HISAT2
module load hisat2/2.2.1
# Define absolute input/output directories
OUTDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/genome/hisat2_index
mkdir -p $OUTDIR
GENOME=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/data/genome/Carassius_auratus.ASM336829v1.dna_sm.toplevel.fa
# Build the HISAT2 index
hisat2-build -p 16 $GENOME $OUTDIR/FhetAlignment
This script aligns trimmed RNA-seq reads to the indexed goldfish genome using HISAT2. Each sample is processed as a SLURM array job, producing sorted and indexed BAM files suitable for downstream analysis.
#!/bin/bash
#SBATCH --job-name=align
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 3
#SBATCH --mem=15G
#SBATCH --partition=xeon
#SBATCH --qos=general
#SBATCH --mail-type=ALL
#SBATCH --mail-user=stephanie.ballas@uconn.edu
#SBATCH -o /home/FCAM/sballas/Goldfish_Deltamethrin_Project/logs/%x_%A_%a.out
#SBATCH -e /home/FCAM/sballas/Goldfish_Deltamethrin_Project/logs/%x_%A_%a.err
#SBATCH --array=1-9%3 # Adjust to the number of samples, max concurrent jobs 3
# Print hostname and date
hostname
date
#################################################################
# Align reads to genome (HISAT2)
#################################################################
module load hisat2/2.2.1
module load samtools/1.16.1
# Absolute paths
FASTQ_DIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/trimmed_fastq
OUTDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/align
mkdir -p ${OUTDIR}
GENOME_DIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/data/genome
HISAT2_INDEX=${GENOME_DIR}/Carassius_auratus_index # We'll build index below if needed
ACCLIST=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/metadata/accessionlist.txt
# Build HISAT2 index if it doesn't exist
if [ ! -f "${HISAT2_INDEX}.1.ht2" ]; then
hisat2-build ${GENOME_DIR}/Carassius_auratus.ASM336829v1.dna_sm.toplevel.fa ${HISAT2_INDEX}
fi
# Get sample for this array task
SAMPLE=$(sed -n ${SLURM_ARRAY_TASK_ID}p ${ACCLIST})
# Run HISAT2 alignment and pipe to sorted BAM
hisat2 \
-p 3 \
-x ${HISAT2_INDEX} \
-1 ${FASTQ_DIR}/${SAMPLE}_trim_1.fastq.gz \
-2 ${FASTQ_DIR}/${SAMPLE}_trim_2.fastq.gz | \
samtools sort -@ 3 -T ${OUTDIR}/${SAMPLE} -o ${OUTDIR}/${SAMPLE}.bam
# Index BAM file
samtools index ${OUTDIR}/${SAMPLE}.bamSamtools Stats
Samtools is used to calculate alignment statistics for each BAM file. The script runs in parallel across samples, generates per-sample metrics, and compiles a summary table of key alignment statistics, providing an overview of mapping quality and coverage.
#!/bin/bash
#SBATCH --job-name=samstats
#SBATCH --mail-user=stephanie.ballas@uconn.edu
#SBATCH --mail-type=ALL
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=10
#SBATCH --mem=15G
#SBATCH --qos=general
#SBATCH --partition=general
hostname
date
##################################
# Calculate alignment statistics using Samtools
##################################
# Load software
module load samtools/1.16.1
module load parallel/20180122
# Define directories
INDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/align
OUTDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/samstats
mkdir -p $OUTDIR
# Accession list
ACCLIST=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/metadata/accessionlist.txt
# Run samtools stats for each BAM file
cat $ACCLIST | parallel -j 10 \
"samtools stats $INDIR/{}.bam > $OUTDIR/{}.stats"
# Combine key stats into one table
FILES=($(find $OUTDIR -name "*.stats" | sort))
# Initialize summary table with metric names (SN lines)
grep "^SN" ${FILES[0]} | cut -f 2 > $OUTDIR/SN.txt
# Loop over each stats file and append columns
for file in ${FILES[@]}; do
paste $OUTDIR/SN.txt <(grep ^SN $file | cut -f 3) > $OUTDIR/SN2.txt && \
mv $OUTDIR/SN2.txt $OUTDIR/SN.txt
echo "Processed $file"
done
# Add header with sample names
cat \
<(echo ${FILES[@]} | sed 's,.*/,,g' | sed 's/.stats//g' | tr ' ' '\t') \
$OUTDIR/SN.txt \
> $OUTDIR/Samtools_Summary.txtQualimap
Qualimap is employed to evaluate RNA-seq alignment quality against the reference genome and GTF annotation. Running in parallel across samples, it generates detailed reports on coverage, gene body distribution, and mapping performance.
#!/bin/bash
#SBATCH --job-name=qualimap
#SBATCH --mail-user=stephanie.ballas@uconn.edu
#SBATCH --mail-type=ALL
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=5
#SBATCH --mem=10G
#SBATCH --qos=general
#SBATCH --partition=general
hostname
date
##################################
# Calculate stats on alignments using Qualimap
##################################
# Load required software
module load qualimap/2.2.1
module load parallel/20180122
# Define directories
INDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/align
OUTDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/qualimap
mkdir -p $OUTDIR
# GTF annotation file
GTF=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/data/genome/Carassius_auratus.ASM336829v1.115.gtf
# Accession list
ACCLIST=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/metadata/accessionlist.txt
# Run Qualimap in parallel
cat $ACCLIST | \
parallel -j 5 \
qualimap rnaseq \
-bam $INDIR/{}.bam \
-gtf $GTF \
-outdir $OUTDIR/{} \
--java-mem-size=2GMultiQC
This MultiQC script consolidates both Samtools and Qualimap outputs into integrated reports. It allows for a high-level assessment of read alignment quality across all samples.
#!/bin/bash
#SBATCH --job-name=multiqc
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 4
#SBATCH --mem=5G
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mail-type=ALL
#SBATCH --mail-user=stephanie.ballas@uconn.edu
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err
hostname
date
#################################################################
# Aggregate reports using MultiQC
#################################################################
module load MultiQC/1.9
# Define input and output directories
STATS=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/samtools_stats
QUALIMAP=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/qualimap
OUTDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/multiqc
# Create output directories if needed
mkdir -p ${OUTDIR}/samtools_stats
mkdir -p ${OUTDIR}/qualimap
# Run MultiQC on Samtools stats output
multiqc -f -o ${OUTDIR}/samtools_stats ${STATS}
# Run MultiQC on Qualimap output
multiqc -f -o ${OUTDIR}/qualimap ${QUALIMAP}Counts
HTSeq-count is used to quantify gene-level expression from aligned BAM files. The script processes each sample using the GTF annotation to generate raw read counts per gene, producing files suitable for downstream differential expression analysis in R.
#SBATCH --job-name=htseq_count
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 4
#SBATCH --mem=20G
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mail-type=ALL
#SBATCH --mail-user=stephanie.ballas@uconn.edu
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err
echo `hostname`
date
#################################################################
# Generate Counts
#################################################################
module load htseq/0.13.5
# Directories
INDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/align
OUTDIR=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/results/counts
mkdir -p $OUTDIR
# Accession list
ACCLIST=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/metadata/accessionlist.txt
# GTF annotation file
GTF=/home/FCAM/sballas/Goldfish_Deltamethrin_Project/data/genome/Carassius_auratus.ASM336829v1.115.gtf
echo "Starting HTSeq-count..."
while read SAMPLE; do
echo "Processing $SAMPLE..."
htseq-count \
-s no \
-r pos \
-f bam \
$INDIR/${SAMPLE}.bam \
$GTF \
> $OUTDIR/${SAMPLE}.counts
done < $ACCLIST
echo "Done!"
dateRead in the Data to R
This section performs a basic differential expression analysis in R using the DESeq2 package. The scripe loads data into R and prepares RNA-seq count data in order to run the first statistical analysis.
library(tidyverse)
library(DESeq2)# Read in data
meta <- read.csv("C:/Users/steph/OneDrive/Documents/Goldfish_Deltamethrin_Project/metadata/SraRunTable.csv") %>%
select(Run, dose, fileName)
# Directory with counts
directory <- "C:/Users/steph/OneDrive/Documents/Goldfish_Deltamethrin_Project/results/04_counts"
# Build sample table directly from metadata
sampleTable <- data.frame(
sampleName = meta$Run,
fileName = meta$fileName,
dose = meta$dose
)sampleTable sampleName fileName dose
1 SRR14783835 SRR14783835.counts high exposure
2 SRR14783836 SRR14783836.counts high exposure
3 SRR14783837 SRR14783837.counts high exposure
4 SRR14783838 SRR14783838.counts low exposure
5 SRR14783839 SRR14783839.counts low exposure
6 SRR14783840 SRR14783840.counts low exposure
7 SRR14783841 SRR14783841.counts control
8 SRR14783842 SRR14783842.counts control
9 SRR14783843 SRR14783843.counts control
# Build DESeq dataset
ddsHTSeq <- DESeqDataSetFromHTSeqCount(
sampleTable = sampleTable,
directory = directory,
design = ~ dose
)Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
design formula are characters, converting to factors
Note: levels of factors in the design contain characters other than
letters, numbers, '_' and '.'. It is recommended (but not required) to use
only letters, numbers, and delimiters '_' or '.', as these are safe characters
for column names in R. [This is a message, not a warning or an error]
head(counts(ddsHTSeq)) SRR14783835 SRR14783836 SRR14783837 SRR14783838 SRR14783839
ENSCARG00000000002 0 0 0 0 0
ENSCARG00000000003 969 443 517 339 265
ENSCARG00000000004 0 0 0 0 0
ENSCARG00000000005 3721 3492 3036 2892 2695
ENSCARG00000000006 0 0 0 0 0
ENSCARG00000000007 9 6 4 2 2
SRR14783840 SRR14783841 SRR14783842 SRR14783843
ENSCARG00000000002 0 0 0 0
ENSCARG00000000003 312 372 744 1003
ENSCARG00000000004 0 0 0 0
ENSCARG00000000005 1744 2763 3111 4713
ENSCARG00000000006 0 0 0 0
ENSCARG00000000007 1 9 5 12481
dds <- DESeq(ddsHTSeq)estimating size factors
Note: levels of factors in the design contain characters other than
letters, numbers, '_' and '.'. It is recommended (but not required) to use
only letters, numbers, and delimiters '_' or '.', as these are safe characters
for column names in R. [This is a message, not a warning or an error]
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
Note: levels of factors in the design contain characters other than
letters, numbers, '_' and '.'. It is recommended (but not required) to use
only letters, numbers, and delimiters '_' or '.', as these are safe characters
for column names in R. [This is a message, not a warning or an error]
final dispersion estimates
fitting model and testing
res <- results(dds, contrast = c("dose", "high exposure", "control"))
summary(res)
out of 51556 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 531, 1%
LFC < 0 (down) : 622, 1.2%
outliers [1] : 419, 0.81%
low counts [2] : 15814, 31%
(mean count < 6)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results