Bioinformatics Analysis Report: Re-analysis of Geobacter sulfurreducens RNA-seq Data to Investigate the Role of CsrA in Biofilm and EET Regulation

Reference Study: DOI:10.3389/fmicb.2025.1534446
Data Source: NCBI GEO GSE282747 / SRA PRJNA1190357

Study overview and experimental design

This report details the independent bioinformatic re-analysis of RNA-seq data from a study investigating the role of the post-transcriptional regulator CsrA in Geobacter sulfurreducens. The primary goal was to validate the reported transcriptomic impact of CsrA deletion and to characterize how its regulatory role differs between standard biofilm conditions and electrogenic microbial fuel cell (MFC) environments. Using a standardized RNA-seq pipeline (quality control, alignment, quantification, and differential expression analysis), I confirmed that growth environment (MFC vs. inert glass) is the dominant factor shaping the global transcriptome. Within this strong environmental effect, the deletion of csrA induces distinct expression profiles, supporting its role as a global regulator. I identified a subset of genes, including GS_RS02340 (downregulated in MFC) and GS_RS01010 (upregulated in MFC), whose response to the electrogenic environment appears modulated by CsrA. A notable technical observation was a potential sample preparation issue in one replicate (SRR31487053). Overall, this analysis independently corroborates the original study’s findings and provides a detailed, reproducible workflow for examining environment-dependent transcriptional regulation in electroactive bacteria.

Introduction

1.1 Geobacter sulfurreducens: A Model Electroactive Bacterium

1.2 Biofilms and Extracellular Electron Transfer (EET)

Biofilms are structured communities of microorganisms encased in a self-produced extracellular matrix that adheres to biotic or abiotic surfaces. In G. sulfurreducens, biofilm formation is integral to its ability to perform extracellular electron transfer (EET), as biofilms facilitate close contact between cells and insoluble electron acceptors, such as metal oxides or electrodes in microbial fuel cells (MFCs).

FIgure1: Biofilms benefits overview - source: https://sci-hub.box/10.1038/nrmicro.2016.94 Another aspect of the Biofilms, that can help us to understand better why there could be trascriptome differences are related to the gradient of the nutrients and how metabolically active organism cohexist in different layers with dormant cells that are metabolically less active, then with reduced production of trascriptomes.

Figure2: Biofilm structure and nutrient gradients - source: https://sci-hub.box/10.1038/nrmicro.2016.94

1.3 CsrA: A Global Post-Transcriptional Regulator

CsrA (carbon storage regulator A) is a highly conserved RNA-binding protein that functions as a global post-transcriptional regulator in a wide range of bacteria CsrA is known to regulate diverse cellular processes, including carbon metabolism, motility, biofilm formation, and virulence in various bacterial species. In Escherichia coli, for example, CsrA represses the synthesis of extracellular polysaccharides and curli fibers, thereby inhibiting biofilm formation. Conversely, in Pseudomonas aeruginosa, CsrA positively regulates biofilm development by modulating the expression of genes involved in exopolysaccharide production.

1.4 Sampling and Experimental Design of the study

Category	Technique / Material	Purpose / What It Measured
Genetic & Microbiological	Markerless deletion system (pK18mobsacB, conjugation)	Construction of the clean ΔcsrA mutant strain.
	Anaerobic culturing (NBAF, acetate-Fe(III) media)	Standard growth and physiological assays under relevant conditions.
	Complementation (pRG5.1-RRflg-csrA plasmid)	To confirm phenotypes were due to csrA loss.
	Crystal Violet Staining	Basic quantification of adhered biofilm biomass.
	Agglutination Assay	Indirect assessment of pili production and cell clumping.
Microscopy & Imaging	Confocal Laser Scanning Microscopy (CLSM)	High-resolution 3D imaging of biofilm structure.
	LIVE/DEAD BacLight Stain (SYTO9/PI)	Differentiation and quantification of live vs. dead cells in biofilms.
	FTO (Fluorine-doped Tin Oxide) Electrodes	Conductive support for growing electroactive biofilms for CLSM.
	Image Analysis (Comstat2, Fiji)	Quantification of biofilm parameters: thickness, roughness, coverage.
Electrochemical	Two-chamber H-cell Microbial Fuel Cell (MFC)	Measuring bioelectrochemical current generation by the bacteria.
	Graphite Stick Anodes & Potentiostat	Electrode material and instrument to control potential/measure current.
Molecular Biology	RNA Extraction (RNeasy kit)	Isolation of high-quality total RNA from biofilms.
	RT-qPCR	Validation of differential gene expression from RNA-seq data.
Sequencing & Bioinformatics	RNA-seq (Illumina NextSeq 500)	Primary tool: Genome-wide transcriptome profiling of WT vs. ΔcsrA biofilms on glass and graphite.
	Ribosomal RNA depletion (RiboMinus kit)	Enrichment for mRNA prior to sequencing.
	Differential Expression Analysis (edgeR, DESeq2, NOISeq, limma via IDEAmeX)	Statistical identification of significantly up/down-regulated genes.
	Functional Annotation (KEGG)	Categorizing differentially expressed genes into biological pathways.
	Motif Discovery (MEME suite)	In silico identification of conserved CsrA binding sequences in 5’-UTRs.
	Principal Component Analysis (PCA)	Visualizing global transcriptome differences between samples.
	Data Repository (NCBI GEO, GSE282747)	Public archiving of raw and processed RNA-seq data.

1.5 Aim of This Bioinformatic Analysis

This project aimed to independently process and analyze the publicly deposited RNA-seq data from the above study (GSE282747) with the following objectives:

Reproduce the differential expression analysis using a standardized bioinformatics pipeline.
Assess the relative contribution of growth environment to the transcriptional profile.
Identify genes whose expression is significantly altered by the deletion of csrA under both biofilm conditions.

2. Materials and Methods

2.1 Datasets

The datasets utilized in this study were derived from RNA sequencing (RNA-seq) experiments conducted on Geobacter sulfurreducens wild-type (WT) and ΔcsrA mutant strains. Biofilms were cultivated under two distinct conditions: on inert glass supports and on graphite electrodes within microbial fuel cells (MFCs). The RNA-seq data provided comprehensive transcriptomic profiles, enabling the identification of differentially expressed genes associated with the deletion of the csrA gene. The raw and processed RNA-seq data have been deposited in the NCBI Gene Expression Omnibus (GEO) under accession number GSE282747, ensuring accessibility for further analysis and validation by the scientific community.Here the links: https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA1190357&o=acc_s%3Aa and the genome used for the mapping of the genes has been collected from https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000007985.2/.

# Index	Run	BioSample	Bases	Bytes	Experiment	Genotype	Library Name	Creation date	Sample Names	Treatment
1	SRR31487053	SAMN45032779	2.61 G	976.57 Mb	SRX26555218	DL1 wild type strain	GSM8649882	2024-11-25 10:04:00Z	GSM8649882	Biofilm grown on graphite electrode in microbial fuel cell operated for 2 weeks
2	SRR31487054	SAMN45032780	2.78 G	1.02 Gb	SRX26555217	DL1 wild type strain	GSM8649881	2024-11-25 10:05:00Z	GSM8649881	Biofilm grown on graphite electrode in microbial fuel cell operated for 2 weeks
3	SRR31487055	SAMN45032781	2.57 G	961.02 Mb	SRX26555216	DL1 with csrA-deletion	GSM8649880	2024-11-25 10:06:00Z	GSM8649880	Biofilm grown on graphite electrode in microbial fuel cell operated for 2 weeks
4	SRR31487056	SAMN45032782	2.51 G	943.12 Mb	SRX26555215	DL1 wild type strain	GSM8649879	2024-11-25 10:07:00Z	GSM8649879	Biofilm grown on glass support in NBAF medium for 48 h at 25 oC
5	SRR31487057	SAMN45032783	2.71 G	1,007.41 Mb	SRX26555214	DL1 wild type strain	GSM8649877	2024-11-25 10:08:00Z	GSM8649877	Biofilm grown on glass support in NBAF medium for 48 h at 25 oC
6	SRR31487058	SAMN45032784	2.33 G	867.88 Mb	SRX26555213	DL1 wild type strain	GSM8649876	2024-11-25 10:09:00Z	GSM8649876	Biofilm grown on glass support in NBAF medium for 48 h at 25 oC
7	SRR31487059	SAMN45032785	3.09 G	1.13 Gb	SRX26555212	DL1 with csrA-deletion	GSM8649875	2024-11-25 10:10:00Z	GSM8649875	Biofilm grown on glass support in NBAF medium for 48 h at 25 oC
8	SRR31487060	SAMN45032786	2.87 G	1.06 Gb	SRX26555211	DL1 with csrA-deletion	GSM8649875	2024-11-25 10:11:00Z	GSM8649875	Biofilm grown on glass support in NBAF medium for 48 h at 25 oC

The two treatments in the table represent two distinct experimental conditions used to cultivate Geobacter sulfurreducens biofilms for RNA-seq analysis. The first treatment involves growing biofilms on a graphite electrode within a functioning microbial fuel cell (MFC) for two weeks. This condition replicates an applied, electroactive environment where the bacteria are respiring by transferring electrons directly to a solid anode to produce electrical current. The second treatment involves growing biofilms on an inert glass support in standard growth medium (NBAF) for 48 hours at 25°C. This is a non-electrogenic, static control condition designed to study biofilm formation without the influence of an electrochemical potential or current flow. By comparing these two setups, the researchers could distinguish the general transcriptional role of the CsrA regulator in biofilm development (glass support) from its specific role in regulating genes essential for extracellular electron transfer and current production (MFC electrode).

2.2 Bioinformatics Workflow

On the server Sanger, the following folders have been created to organize the analysis:

Figure3: Files on the Sanger server

2.3 Software and Tools Used

The bioinformatic analysis of the RNA-seq data in this study was conducted using a combination of specialized software tools and programming languages. Key tools included:

FastQC: For initial quality assessment of raw sequencing reads.
Trimmomatic: For trimming low-quality bases and adapter sequences from the reads.
HISAT2: For aligning the cleaned reads to the Geobacter sulfur reducens reference genome.
featureCounts: For quantifying gene expression levels from the aligned reads.
R and Bioconductor packages (edgeR, DESeq2, NOISeq, limma): For differential expression analysis and statistical testing.
IGV (Integrative Genomics Viewer): For visualizing read alignments and gene expression patterns.
samtools: For manipulating and processing alignment files (BAM/SAM).
HTSeq: For counting reads mapped to genomic features. ( I have been unable to install on WSL with some errors, and it seems related to python3.14 and the wheels usergroup - I have been unable to fix it on WSL, so I used it on the Sanger server).

Tool	Version	Where installed/used
FastQC	0.12.1	Fedora 43 on WSL
bcftools	1.23	Fedora 43 on WSL
htslib	1.23	Fedora 43 on WSL
samtools	1.23	Fedora 43 on WSL
IGV	2.19.7	Windows 11 25H2
R	4.5.2	Windows 11 25H2
R	4.5.2	Fedora 43
MultiQC	1.32	Sanger Server/ CentOS Stream 9
Trimmomatic	0.39	Sanger Server/ CentOS Stream 9
hisat2	2.2.1	Sanger Server/ CentOS Stream 9
FastQC	0.12.1	Sanger Server/ CentOS Stream 9
HTSeq	0.12.3	Sanger Server/ CentOS Stream 9

2.4.1 Quality Control and Preprocessing

An initial quality control (QC) assessment was performed on the raw sequencing reads downloaded from the NCBI SRA. This section details the iterative process of identifying and addressing quality issues, including adapter contamination and ribosomal RNA (rRNA) carryover.

2.4.2 Quality Assessment with FastQC on the dataset

FastQC analysis was conducted on all paired-end samples, and the results were aggregated using MultiQC. The initial report indicated high per-base sequence quality (Q > 30), meeting the standard threshold for reliable downstream analysis.

In the fact, I had to use the script provided in Sanger to run FastQC on all the samples in the folder “1_sra” as Paired Ends.

However, the report flagged a significant level of sequence duplication. The “Overrepresented sequences” module identified specific sequences with exceptionally high counts, warranting further investigation to determine their origin.

2.4.3 Investigating Overrepresented Sequences

To determine if the duplicated sequences represented contamination or technical artifacts, they were subjected to BLASTn analysis against the Geobacter sulfurreducens genome and related taxa.

The three most overrepresented sequences identified were: 1. Sequence 1 (GTTTGATCCTGGCTCAGAACGAACGCTGGCGGCGTGCCTAACACATGCAA): This sequence showed a perfect match to the 16S ribosomal RNA (rRNA) gene, a highly conserved region common across Geobacter species and other prokaryotes. Its presence is expected in total RNA samples, as rRNA typically comprises the majority of cellular RNA, even after depletion protocols. 2. Sequence 2 (GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG): This homopolymer sequence showed no significant match to the Geobacter genome. It was identified as a common Illumina adapter sequence, representing a technical artifact from the library preparation or sequencing process on the Illumina NextSeq 500 platform. 3. Sequence 3 (CCCCATTCGGACACCCCCGGATCAAAGCCTGTTTAGCGGCTCCCCGAGGC): This sequence also matched the 5S rRNA gene, similar to Sequence 1, confirming its biological origin. 3. Sequence 4 (CCTCACGGTACTTGTGCACTATCGGTCAGAGAGTAGTATTTAGCCTTGGG): This sequence also matched the 23S rRNA gene, similar to Sequence 1, confirming its biological origin.Plus as per microbiome analysis we know that rRNA16 presents some regions highly preservered among different species, so it is possible that some of these sequences can be from other species present in the biofilm sample.So, out of curiosity, I run a BLAST on it and checked the tree view of the species matched with the BLASTed sequence. BLAST results for Sequence 3

2.4.4 Read Trimming with Trimmomatic

Based on the FastQC findings, Trimmomatic was employed to trim adapter sequences and low-quality bases from the reads. The following parameters were used:

ILLUMINACLIP: To remove Illumina adapter sequences.
SLIDINGWINDOW: To trim reads based on average quality within a sliding window.
MINLEN: To discard reads shorter than a specified length after trimming.

As adaptor file, initially I used the one provided in “/opt/Trimmomatic-0.39/adapters/adapters.fa” then, I thought to add the rRNA sequences with high duplication in the adapter file, and created a new file called “adapters_+_geo_sulf_rna.fna” in my home directory.

This has paid later in the quality checks after the trimmomatic step.

java -jar /opt/Trimmomatic-0.39/trimmomatic-0.39.jar PE -threads 2 -phred33 /home/$USER/1_sra/SRR314870XX_1.fastq.gz /home/$USER/1_sra/SRR314870XX_2.fastq.gz /home/$USER/3.1_trimmed2_sra/SRR314870xx_1P.fastq.gz /home/$USER/3.1_trimmed2_sra/SRR314870XX_2P.fastq.gz /home/$USER/3.1_trimmed2_sra/SRR314870XX_1U.fastq.gz /home/$USER/3.1_trimmed2_sra/SRR314870XX_2P.fastq.gz ILLUMINACLIP:/home/antoniodg/adapters_+_geo_sulf_rna.fna:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:35

Post-trimming FastQC analysis confirmed the successful removal of adapter sequences and improved overall read quality, with no remaining overrepresented sequences.

2.4.5 FastQC After Trimming

FastQC was rerun on the trimmed reads to assess the effectiveness of the trimming process. The post-trimming FastQC reports indicated that the overrepresented sequences had been successfully removed, and the overall quality metrics remained high, with per-base sequence quality scores consistently above Q30+ across all samples. This confirmed that the reads were now suitable for downstream alignment and analysis.

2.4.6 MultiQC Report

MultiQC was used to aggregate the FastQC reports from all samples before and after trimming. The consolidated report provided a comprehensive overview of the quality metrics across the dataset, highlighting improvements in sequence quality and the elimination of overrepresented sequences post-trimming. The number of duplicated sequences for the major overrepresented reads was reduced to below 1 million occurrences. While traces of Geobacter rRNA sequences persist—an inherent characteristic of biofilm-derived total RNA samples where diverse microbial rRNA may be present—the data was deemed of sufficient quality for robust alignment and differential expression analysis. They are available on the server /home/antoniodg/ folder and they have been made for the dataset raw, for the dataset with only adapter trimming, and for the dataset with adapter + rRNA trimming.

3. Mapping Reads to Reference Genome

3.1 Reference Genome Preparation

The reference genome for Geobacter sulfurreducens used in this study was obtained from the NCBI database (accession number GCF_000007985.2_ASM798v2). The genome was downloaded in FASTA format along with the corresponding annotation file in GTF/GFF format. Prior to mapping, the reference genome was indexed using HISAT2 to facilitate efficient alignment of RNA-seq reads.

I have downloaded the genome from the NCBI datasets website with wget and the unzipped it in my home directory, in a folder called “Geobacter_sulfurreducens”, then made a new folder called GenomeIndex where I have indexed the genome with hisat2-build as per below:

hisat2-build GCF_000007985.2_ASM798v2_genomic.fna GenomeIndex/genome

This has generated files genome.1.ht2 genome.2.ht2 genome.3.ht2 genome.4.ht2 genome.5.ht2 genome.6.ht2 genome.7.ht2 genome.8.ht2.

3.1.1 Errors in generating .bam and .sam files

During the initial attempts to generate BAM and SAM files from the aligned reads, several errors were encountered. These issues were primarily related to file format inconsistencies and incorrect genome indexing related probably to a syntax error in the command.

The first attempt then generated files with 0 bytes, indicating that the alignment process did not complete successfully. After reviewing the HISAT2 documentation and troubleshooting the command syntax, the errors were resolved by ensuring that the correct paths to the indexed genome and trimmed read files were specified.

[antoniodg@sanger 5_hisat2_trimmed2_genome]$ ll
total 0
-rw-r--r--. 1 antoniodg antoniodg 0 Jan 17 19:38 SRR31487053.bam
-rw-r--r--. 1 antoniodg antoniodg 0 Jan 17 19:38 SRR31487053_unmapped.1.fastq
-rw-r--r--. 1 antoniodg antoniodg 0 Jan 17 19:38 SRR31487053_unmapped.2.fastq
-rw-r--r--. 1 antoniodg antoniodg 0 Jan 17 19:38 SRR31487054.bam
-rw-r--r--. 1 antoniodg antoniodg 0 Jan 17 19:38 SRR31487054_unmapped.1.fastq
-rw-r--r--. 1 antoniodg antoniodg 0 Jan 17 19:38 SRR31487054_unmapped.2.fastq
-rw-r--r--. 1 antoniodg antoniodg 0 Jan 17 19:38 SRR31487055.bam
-rw-r--r--. 1 antoniodg antoniodg 0 Jan 17 19:38 SRR31487055_unmapped.1.fastq
-rw-r--r--. 1 antoniodg antoniodg 0 Jan 17 19:38 SRR31487055_unmapped.2.fastq
-rw-r--r--. 1 antoniodg antoniodg 0 Jan 17 19:38 SRR31487056.bam
-rw-r--r--. 1 antoniodg antoniodg 0 Jan 17 19:38 SRR31487056_unmapped.1.fastq
-rw-r--r--. 1 antoniodg antoniodg 0 Jan 17 19:38 SRR31487056_unmapped.2.fastq
-rw-r--r--. 1 antoniodg antoniodg 0 Jan 17 19:38 SRR31487057.bam
-rw-r--r--. 1 antoniodg antoniodg 0 Jan 17 19:38 SRR31487057_unmapped.1.fastq
-rw-r--r--. 1 antoniodg antoniodg 0 Jan 17 19:38 SRR31487057_unmapped.2.fastq
-rw-r--r--. 1 antoniodg antoniodg 0 Jan 17 19:38 SRR31487058.bam
-rw-r--r--. 1 antoniodg antoniodg 0 Jan 17 19:38 SRR31487058_unmapped.1.fastq
-rw-r--r--. 1 antoniodg antoniodg 0 Jan 17 19:38 SRR31487058_unmapped.2.fastq
-rw-r--r--. 1 antoniodg antoniodg 0 Jan 17 19:38 SRR31487059.bam
-rw-r--r--. 1 antoniodg antoniodg 0 Jan 17 19:38 SRR31487059_unmapped.1.fastq
-rw-r--r--. 1 antoniodg antoniodg 0 Jan 17 19:38 SRR31487059_unmapped.2.fastq
-rw-r--r--. 1 antoniodg antoniodg 0 Jan 17 19:38 SRR31487060.bam
-rw-r--r--. 1 antoniodg antoniodg 0 Jan 17 19:38 SRR31487060_unmapped.1.fastq
-rw-r--r--. 1 antoniodg antoniodg 0 Jan 17 19:38 SRR31487060_unmapped.2.fastq

Then correcting the syntax errors as per below, I obtained files usable in the IGV visualization step.

./bit09-mapping-hisat2-PE.sh /home/antoniodg/3.1_trimmed2_sra/ /home/antoniodg/Geobacter_sulfurreducens/GenomeIndex/genome /home/antoniodg/5_hisat2_trimmed2_genome

Figure4: Files generated after correcting the syntax errors

3.2 Samtools filtering and Sorting

The generated BAM files from HISAT2 were further processed using Samtools to ensure they were sorted and indexed for efficient access during visualization and downstream analysis.

In this way I got sorted and filter BAM files ready for visualization in IGV. That could provide me more insights about the expression of the CsrA gene under different growth conditions, and more in general the expressions differences between the wildtype and the ΔcsrA mutant strains under the same growth conditions. Seeing if besides the genetic difference that can explain the regulation also the growth in environment that improve the electrogenic-biofilm formation affects the expression too compared with a standard growth medium.

3.3 Visualization with IGV

The aligned reads were visualized using the Integrative Genomics Viewer (IGV) to inspect the quality of alignments and to verify the expression patterns of key genes, including csrA.

The first figure show that there has been probably some error into sampling preparation as SRR31487053 it seems pretty under expressed compared with the same wildtype sample SRR31487054 grown under the same treatment conditons. Figure5: IGV visualization of csrA gene expression This can be related to the fact that there could have been some mistakes during the RNA extraction or library preparation steps, leading to lower read counts for that particular sample. Further investigation would be needed to confirm the exact cause of this discrepancy with the team responsible for the experimental work. Or it can be that even if the RNA extraction were done properly, during the sequencing step there could have been some issues that led to a lower yield of reads for that sample.

Another interesting finding is related to the expression of the gene GS_RS14920, one of the 6 genes that encodes for the rubretythrin family protein, that seems to be overexpressed in the csrA deletion mutant in MFC condition as per below.

Figure6: IGV visualization of GS_RS14920 gene expression In this case most probably the deletion of the CsrA gene is affecting the expression more than the environment itself. So, to confirm this hypothesis I have related these two samples with the same genotypes in different growth conditions, to see if the expression is affected more by the genotype or by the environment.

Figure7: IGV visualization of GS_RS14920 gene expression
In this case, the expression of GS_RS14920 is consistently higher in the ΔcsrA mutant (SRR31487055 and SRR31487059) compared to the wild-type (SRR31487054 and SRR31487058), regardless of the growth condition. This suggests that the deletion of csrA has a more pronounced effect on the expression of this gene than the environmental conditions, indicating that CsrA likely plays a direct regulatory role over GS_RS14920 expression.

The protein encoded by GS_RS14920 is part of the rubrerythrin family, which is known to be involved in oxidative stress response. The upregulation of this gene in the ΔcsrA mutant could imply that CsrA normally represses its expression, and in its absence, the cell may be compensating for increased oxidative stress or altered redox conditions, particularly relevant in the context of extracellular electron transfer in MFCs.

3.4 Counting with HTSeq

HTSeq was used to count the number of reads mapping to each gene in the Geobacter sulfurreducens genome. The following command was executed for each BAM file using th script present in the Sanger server.

I have run the script with the CDS option as my organism doesn’t have exons and introns like for the Eukaryotes.So, I have used the coding sequence, although probably I could have assumed the count like the entire genome is an big exon. Then I have tried to use the option intersect-strict and union to see if there were differences in the count.

3.5 Generating Count Matrix for No Feature

SAMPLE	ORIGINAL TOTAL READINGS	READINGS AFTER TRIMMING	% Retention	No Feature Union	No Feature Intersection-strict	Quality Union (%)	Quality Intersection-strict (%)
SRR31487053	34,277,504	17,144,376	50.00%	55	57	99.9997%	99.9997%
SRR31487054	36,571,658	18,283,944	50.00%	86,770	94,004	99.525%	99.486%
SRR31487055	33,776,904	16,888,355	50.00%	126,857	136,080	99.249%	99.194%
SRR31487056	32,982,970	16,491,485	50.00%	85,218	91,003	99.483%	99.448%
SRR31487057	35,637,552	17,833,808	50.00%	37,712	41,325	99.789%	99.768%
SRR31487058	30,592,208	15,290,104	50.00%	32,036	34,688	99.791%	99.773%
SRR31487059	40,614,636	24,812,085	61.10%	71,547	76,554	99.712%	99.691%
SRR31487060	37,826,652	23,718,026	62.70%	83,045	89,005	99.650%	99.625%

Overall the difference is slight, but the intersection-strict option is more conservative, leading to slightly fewer reads being counted for some genes. This could be important for downstream differential expression analysis, as it may affect the sensitivity and specificity of detecting true changes in gene expression.Furthermore in the output files made with the union parameter, I have found some ambiguous reads, that I have not found in the intersection-strict output files. So, I have decided to use the intersection-strict output files for the downstream analysis to be more conservative and avoid ambiguous reads that could affect the results.

Another interesting aspect to note is the reading before trimming and after trimming (trimmed2 with adapter + rRNA). The percentage of retention is around 50% for most of the samples, except for the samples SRR31487059 and SRR31487060 that have a retention of 61.10% and 62.70% respectively. This indicates that the trimming process removed a significant portion of the reads, likely due to the presence of adapter sequences and rRNA contamination. The higher retention in these two samples could suggest that they had less contamination or better quality reads to begin with.

Using the scripts in R provided, I have built the files needed for the Differental Expression analysis, specifically all_counts.txt and summary_counts.txt. I have done this process on Windows 11 25H2, so I needed to tune the script for the folder path as it seems that “” is not well interpreted in R on Windows, so I have used “/” instead to identify the path, and in this way the script worked.

4. Differential Expression Analysis

4.1 Data Import and Normalization

The count data generated by HTSeq was imported into R for differential expression analysis. The data was organized into a matrix format, with genes as rows and samples as columns. Normalization of the count data was performed using the TMM (Trimmed Mean of M-values) method implemented in the edgeR package to account for differences in library sizes and sequencing depth across samples.

4.1.1 Tuning of the R script for my case

The script provided, needed to be adjusted to my specific case, as in this report I have actually 4 conditions and 2 samples each, but in order to fit with the requirements of the report, as suggested by the supervisor, I needed to change the point of view on the treatment, and in that case I 2 treatments with 4 samples each. So, when I built the metadata columns I have added an extra column for the treatments and a new identifier as per below.

Figure8: Metadata table for the DE analysis In this way, I can have a column called “treatment” with 2 levels: “MFC” and “STD”, and a column called “genotype” with 2 levels: “WT” and “DT” (deletion type). This will allow me to perform the differential expression analysis based on the treatment conditions, while still accounting for the genotype differences.

So, here below the data samples organised by treatment.

Figure9: Samples organized by treatment

Below a further insight on the data count per million, and a comparison between the raw data and the filtered data. The measure used is Log-CPM (Counts Per Million) that is a common way to normalize RNA-seq data for differences in library size.

Figure10: Log-CPM before and after filtering Another interesting aspect is to compare the distribution of the data, before the normalisation as it is crucial that the normalisation doesn’t affect that in order to preserve a better biological meaning of the data.

Figure11: Data distribution before and after normalisation

4.2 MDS Plot

Multidimensional scaling (MDS) analysis of RNA-seq data revealed clear separation of samples primarily by growth condition along treatment parameter, which explained the majority of transcriptional variance. Samples clustered into two distinct groups: biofilm grown on graphite electrodes in microbial fuel cells (MFC) versus biofilm grown on glass supports in NBAF medium. Within each treatment group, secondary separation by genotype (wild type vs csrA-deletion mutant) was observed along Dimension 2. This visualization confirms that growth environment is the dominant factor shaping global gene expression profiles, with genetic background exerting additional but secondary influence.This demonstrates that the shift to an electrogenic, electrode-associated biofilm state drives profound transcriptional reprogramming.

Figure12: MDS plot of the samples

4.3 Differential Expression Analysis - Design of the Matrix

In order to start with this part of the analysis, I had to modify a bit the script as per below. Figure13: Design matrix modification for the DE analysis

In this way, I can compare the two treatments directly. The next step was to remove the heteroscedasticity from the data, in order to have a better statistical power in the analysis.

Figure14: Heteroscedasticity removal This step is crucial as it ensures that the variance of gene expression levels is consistent across different conditions, which is a key assumption for many statistical tests used in differential expression analysis. By applying this transformation, we can improve the reliability of our results and reduce the likelihood of false positives or negatives.

Another interesting way to have a better overview of the Differental Expression Analysis of the genes is the volcanic plot, that gives a better idea of the genes that are upregulated and downregulated between the two treatments, as per below.

Figure15: Volcanic plot of DE genes between treatments The fold changes range from -5.8 to +4.0, indicating substantial variation in gene expression levels between compared conditions. However, all p-values appear to be non-significant (p = 1.0), as reflected by the -log₁₀(P-value) values of zero. This suggests either that no statistically significant differences were detected under the given thresholds, or that there may be an artifact in the statistical computation or data representation requiring further validation that we can investigate

4.4 Heatmap of the Top 10 DE Genes

A heatmap was generated to visualize the expression patterns of the top 10 differentially expressed genes across all samples. This visualization highlights the relative expression levels of these genes, allowing for easy identification of patterns associated with each treatment condition.

Figure16: Heatmap of the top 10 DE genes

Key Findings from the Heatmap: - GS_RS02340: Showed consistent downregulation across MFC conditions. - GS_RS01010: Showed specific upregulation. - GS_RS02335 & GS_RS05390: Exhibited clear directional changes (increase and decrease, respectively) in the MFC environment. - Six other genes in the top 10 list maintained near-baseline expression, indicating they are not responsive to this environmental shift.

5. Conclusions

5.1 Summary of Findings

This re-analysis successfully reproduced key findings from the original study and provided additional layers of interpretation.

Environment is the Primary Transcriptional Driver: The stark separation by condition in the MDS plot underscores that G. sulfurreducens undergoes a major metabolic and physiological rewiring when forming an electrogenic biofilm on an anode. Any study of EET must account for this dominant effect.
CsrA as a Context-Dependent Modulator: While environment is primary, the consistent separation by genotype within each cluster proves CsrA’s regulatory influence is real and active in both conditions. It does not trigger a new state but modifies the transcriptional program within a given environment. This aligns with the original study’s phenotypic data where biofilm amount and current increased, but fundamental respiration pathways remained functional.
Functional Implications of Identified Genes: The downregulation of GS_RS02340 in MFCs warrants investigation into its annotation; if it is involved in a non-essential pathway under electrode respiration, its repression could be a metabolic streamlining strategy. The upregulation of the oxidative stress-related GS_RS14920 in the ΔcsrA mutant provides a plausible mechanistic link: loss of CsrA repression may enhance the cell’s ability to manage reactive oxygen species, potentially facilitating more robust biofilm formation and higher current densities, as observed phenotypically.
Technical Considerations & Limitations: The identified outlier sample (SRR31487053) highlights the importance of rigorous QC. Furthermore, an initial simplistic contrast (all MFC vs. all STD) failed to detect statistically significant DEGs due to cross-cutting genotype effects. This underscores the necessity of using factorial models (~ Genotype + Condition) or interaction tests to properly dissect complex experimental designs.

5.2 Final Remarks

This project provides a validated, reproducible bioinformatics pipeline for the analysis of Geobacter sulfurreducens RNA-seq data. The analysis confirms that: - The MFC growth environment is the strongest determinant of global gene expression. - The CsrA regulator significantly modifies the transcriptome within both standard and electrogenic conditions. - Specific candidate genes (e.g., GS_RS02340, GS_RS14920) have been identified as being under environmental and CsrA-dependent control.

References

Original Study: DOI:10.3389/fmicb.2025.1534446
RNA-seq Data: NCBI GEO GSE282747, SRA PRJNA1190357.
Genome Reference: Geobacter sulfurreducens PCA, GCF_000007985.2.