Accessing fastq files from Sequence Read Archive (SRA)

0.1 Run selector
0.2 Downloading a single SRR
0.3 Parallelizing the SRR download of multiple FASTQ files
0.4 GitHub
0.5 wget
0.6 tar
0.7 Rpubs: https://rpubs.com/Charleen_D_Adams/761244

0.1 Run selector

On GEO, having searched for an accession number (e.g., “GSE146620” for the chicken methylation clock):

Download the Accession list to desktop. Copy the contents of the downloaded file to a new file on the cluster:

$ mkdir -p /n/holyscratch01/lemos_lab/Lab/chicken    # make a new directory

$ cd /n/holyscratch01/lemos_lab/Lab/chicken  # change to that directory

$ vim SRR_Acc_List.txt   # paste into this new file and save

The SRA-toolkit writes additional cache files, which are automatically directed to your home directory by default, even if we are in our lab’s scratch folders. Because of this, we need to write a short configuration file to tell SRA-toolkit to write its cache files to the scratch space, instead of our home, to avoid running out of storage.

## navigate to your scratch space (replace 'username' with your username)
cd /n/holyscratch01/lemos_lab/Lab/chicken 

## make a directory for ncbi configuration settings
mkdir -p /n/holyscratch01/lemos_lab/Lab/chicken/.ncbi

## write configuration file with a line that redirects the cache
echo '/repository/user/main/public/root = "/n/holyscratch01/lemos_lab/Lab/chicken/sra-cache"' > /n/holyscratch01/lemos_lab/Lab/chicken/.ncbi/user-settings.mkfg

0.2 Downloading a single SRR

For a single SRR, use NCBI’s SRA toolkit.

Use the fastq-dump command.

module load sratoolkit/2.8.0-fasrc01 

## The fastq-dump command will only download the fastq version of the SRR, given the SRR number and an internet connection
fastq-dump SRR11278846

0.3 Parallelizing the SRR download of multiple FASTQ files

The first script contains the command to do a fastq dump on a given SRR number.

$ vim inner_script.slurm

#!/bin/bash
#SBATCH -t 6:00:00           # Runtime - asking for 6 hours
#SBATCH -p shared            # Partition (queue) - asking for shared queue
#SBATCH -J sra_download      # Job name
#SBATCH -o run.o             # Standard out
#SBATCH -e run.e             # Standard error
#SBATCH --cpus-per-task=1    # CPUs per task
#SBATCH --mem-per-cpu=8G     # Memory needed per core
#SBATCH --mail-type=NONE     # Mail when the job ends

module load sratoolkit/2.8.0-fasrc01 

# for single end reads only
fastq-dump $1

# for paired end reads only
# fastq-dump --split-3  $1

NOTE: Downloading Paired End Data: SRR files can have paired-end reads in a single file. Because of this format, paired files need to be split down the middle at download.

SRA toolkit has an option for this called --split-files. By using this, one single SRR file will download as SRRxxx_1.fastq and SRRxxx_2.fastq.

The option called --split-3 splits an SRR into 3 files: one for read 1, one for read 2, and one for any orphan reads (ie: reads that aren’t present in both files).

The second script loops through the list of SRRs by calling the first script from within the loop.

$ vim sra_fqdump.slurm

#!/bin/bash
#SBATCH -t 6:00:00           # Runtime
#SBATCH -p test              # Partition (queue)
#SBATCH -J your_job_name     # Job name
#SBATCH -o run.o             # Standard out
#SBATCH -e run.e             # Standard error
#SBATCH --cpus-per-task=1    # CPUs per task
#SBATCH --mem-per-cpu=8G     # Memory needed per core
#SBATCH --mail-type=NONE     # Mail when the job ends

# for every SRR in the list of SRRs file
for srr in $(cat SRR_Acc_List.txt)
do

# call the bash script that does the fastq dump, passing it the SRR number next in file
sbatch inner_script.slurm $srr
done

By calling a script within a script, we start a new job for each SRR download and download all the files at once in parallel.

The following will run the main (second) script on Odyssey (now called “Cannon”):

$ sbatch sra_fqdump.slurm

0.4 GitHub

git init

echo "## Getting SRA files into the cluster

This repo contains the scripts for obtaining 'GSE146620' fastq files from the chicken methylation clock paper. 51 files (for all tissues and timepoints) are now available at: 
/n/holyscratch01/lemos_lab/Lab/chicken 

<hr style=border:2px solid green> </hr>

![chicken paper](https://https://github.com/LabLemos/SRA_fastq/blob/main/chicken.png)

<hr style=border:2px solid green> </hr>">> README.md

git add README.md
git commit -m "fastqs from SRA"

git branch -M main
git remote add origin https://github.com/LabLemos/SRA_fastq.git
git push -u origin main

git add chicken.png
git commit -m "chicken.png"
git push --all

git add SRA_fastq_lablemos.Rmd
git commit -m "SRA_fastq_lablemos.Rmd"
git push --all

0.5 `wget`

For files not deposited into SRA that have fastqs with an ftp

wget --recursive --no-parent -nd https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM393nnn/GSM393182/suppl/GSM393182_input_ArrayG_080522.CEL.gz

0.6 `tar`


tar -xvf GSE111889_RAW.tar 

for all in *.gz; do gunzip $all; done

Accessing fastq files from Sequence Read Archive (SRA)

Script adapted from a public file by Kayleigh Rutherford & Mary Piper

2021-04-27

Contents

0.1 Run selector

0.2 Downloading a single SRR

0.3 Parallelizing the SRR download of multiple FASTQ files

0.4 GitHub

0.5 `wget`

0.6 `tar`

0.7 Rpubs: https://rpubs.com/Charleen_D_Adams/761244

Accessing fastq files from Sequence Read Archive (SRA)

Script adapted from a public file by Kayleigh Rutherford & Mary Piper

2021-04-27

Contents

0.1 Run selector

0.2 Downloading a single SRR

0.3 Parallelizing the SRR download of multiple FASTQ files

0.4 GitHub

0.5 wget

0.6 tar

0.7 Rpubs: https://rpubs.com/Charleen_D_Adams/761244

0.5 `wget`

0.6 `tar`