The SRA pipeline

The SRA_pipeline.py script is a custom python script written by Alex Chitsazan that can be used on the main reh lab computer. This program will downlaod the fastq’s of interest, place them in a project directory and align them using our standard ATAC-seq pipeline. Some starter information about the script can be acessed by typing SRA_pipeline.py -h in the terminal.

Step 1 - Downloading your SRR list for the GEO repository of interest

The first thing that needs to be done is you to download a list of Accession numbers corresponding to the project of interest.

A. Go to the SRA Run Selector website

Go to https://trace.ncbi.nlm.nih.gov/Traces/study/?go=home: SRA_RunSelectorOverview

B. Find your Project ID

Go to the GEO search engine and find the project you are working and locate the project ID. Example shown below (either ID’s labeled work) GEORun

C. Type in your Project ID to Run Selector

Type in the project ID found in step B into the run selector page. There you can download the SRR_Acc_List.txt file by clicking the red labeled button under download. This file should cantain a new-line seperated list of SRR accession numbers corresponding to the samples of interest. Save this in a Project directory of your choosing (I have created an SRA folder in the Genomics_Master_Drive hardrive called SRA if you want to put your project directory into there). RunSelector

D. Select specific files you care about (OPTIONAL)

Note that the download labeled in red in C will contain ALL samples from the GEO project. If for example you only care about a certain number of files (in the example case, we only care the rod ATAC-seq files). If we only care about a couple of the files you can select the files of interest and download only that specific subset (download in red). RunSelector

Step 2 - Run the pipeline!

A. Run the Terminal SRA_pipeline.py command

Now you should have everything to complete the SRA_pipeline command! An example and the most basic form of the command is: SRA_pipeline.py -o /Volumes/Genomics_Master_Drive/Genomics/SRA/RodTest -s SRR_Acc_List.txt

Here is the output of SRA_pipeline.py -h to give show all the parameters that are possible. Help

Details

Here are the actual programs that SRA_pipeline.py is running under the hood if you want to do it manually or care:

 ## Fetching SRAs using prefetch
prefetch --option-file SRR_Acc_List.txt


 ## Converting to fastq
mkdir /Volumes/Genomics_Master_Drive/Genomics/SRA/RodTest/rawFastq
fastq-dump /Volumes/Genomics_Master_Drive/Genomics/SRA/raw/sra/SRR3662499.sra --outdir /Volumes/Genomics_Master_Drive/Genomics/SRA/RodTest/rawFastq/ --split-files --gzip -v
fastq-dump /Volumes/Genomics_Master_Drive/Genomics/SRA/raw/sra/SRR3662500.sra --outdir /Volumes/Genomics_Master_Drive/Genomics/SRA/RodTest/rawFastq/ --split-files --gzip -v
fastq-dump /Volumes/Genomics_Master_Drive/Genomics/SRA/raw/sra/SRR3662501.sra --outdir /Volumes/Genomics_Master_Drive/Genomics/SRA/RodTest/rawFastq/ --split-files --gzip -v
fastq-dump /Volumes/Genomics_Master_Drive/Genomics/SRA/raw/sra/SRR3662502.sra --outdir /Volumes/Genomics_Master_Drive/Genomics/SRA/RodTest/rawFastq/ --split-files --gzip -v

 ## Fixing Naming Convention
fixSRAPairedNames.sh

 ## Generating Alignment Scripts
align_paired_ATAC.py -a nextera -f /Volumes/Genomics_Master_Drive/Genomics/SRA/RodTest/rawFastq/ -p 2 -b /Volumes/Genomics_Master_Drive/Genomics/SRA/RodTest/rawFastq/../BAM/ -i /Applications/bowtie2/mm9bt2/mm9

 ## Runing Alignment
/Volumes/Genomics_Master_Drive/Genomics/SRA/RodTest/rawFastq/runAll.sh

SRA_pipeline for paired end ATAC-seq Files

Alex Chitsazan