1_Filtering

This code will use the “Rfastp” package from the Bioconductor software to filter and trim raw sequencing files.

First, you will need to set your working directory. This should be the folder where your raw sequencing files are stored.

setwd("E:/RNA-seq analysis/Raw files")

If this is your first time running this code, you will need to install the Bioconductor software as well as the Rfastp package.

if (!require("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

## Bioconductor version '3.14' is out-of-date; the current release version '3.15'
##   is available with R version '4.2'; see https://bioconductor.org/install

BiocManager::install()

## Bioconductor version 3.14 (BiocManager 1.30.17), R 4.1.3 (2022-03-10)

## Installation paths not writeable, unable to update packages
##   path: C:/Program Files/R/R-4.1.3/library
##   packages:
##     cluster, MASS, Matrix, mgcv, nlme, survival

## Old packages: 'cli', 'matrixStats'

BiocManager::install("Rfastp")

## Bioconductor version 3.14 (BiocManager 1.30.17), R 4.1.3 (2022-03-10)

## Warning: package(s) not installed when version(s) same as current; use `force = TRUE` to
##   re-install: 'Rfastp'

## Installation paths not writeable, unable to update packages
##   path: C:/Program Files/R/R-4.1.3/library
##   packages:
##     cluster, MASS, Matrix, mgcv, nlme, survival
## Old packages: 'cli', 'matrixStats'

After the package has been installed, load it into R with the following command.

library("Rfastp")

## Rfastp is a wrapper of fastp project: https://github.com/OpenGene/fastp
## 
## Please cite fastp in your publication:
## Shifu Chen, Yanqing Zhou, Yaru Chen, Jia Gu; fastp: an ultra-fast all-in-one 
##     FASTQ preprocessor, Bioinformatics, Volume 34, Issue 17, 1 September 2018,
##     Pages i884-i890, https://doi.org/10.1093/bioinformatics/bty560

Next, we want to read our files into R. All the sequencing files should be compressed FASTQ files, so they should end in “.fastq.gz”. The name of the FASTQ files doesn’t matter, however if these are paired reads, each pair should have the same name with each pair marked R1 and R2 i.e. “Sample_1_R1.fastq.gz” and “Sample_1_R2.fastq.gz”. The R program will only recognize the pairs if their name is the same.

fastq.files <- list.files(pattern = ".fastq.gz$", full.names = FALSE)

Then, the program will go through all the FASTQ files and perform rfastp() function. This function will filter and trim all the sequencing files and perform quality checks on each file. The output of rfastp() is: 1. HTML file containing the QC report 2. JSON file, which is an R object that contains QC summary statistics 3. FASTQ files - these are the filtered files and will have the prefix “filtered_” to differentiate them from the original FASTQ files

for (i in seq(1, length(fastq.files), 2)){
  pair1 <- fastq.files[i]
  pair2 <- fastq.files[i+1]
  output_name <- paste0("filtered_", fastq.files[i])
  json_report <- rfastp(read1= pair1 , read2 = pair2, outputFastq = output_name)
}

1_Filtering

2022-04-27