01-get-data

Author

Sarah Tanja

Published

May 16, 2023

Downloading RNA sequence files

Goals: - Identify samples (accession numbers) I want to analyze in SRA Run Selector - curl sample RNA sequences data files (.fastq) from NCBI BioProject into a large storage drive - Check the files for integrity using md5sum

Environment:

RStudio Server hosted on a Unix OS (Roberts Lab raven server)

Install Packages

Code

if ("tidyverse" %in% rownames(installed.packages()) == 'FALSE') install.packages('tidyverse')
if ("SRAdb" %in% rownames(installed.packages()) == 'FALSE') BiocManager::install("SRAdb")

Load packages

Code

library(SRAdb)
library(tidyverse)

Get sample metadata

First we want to get an idea of the files we are downloading and the samples that generated the data. We will start by looking at the metadata for the samples.

Code

# navigate to data directory
cd ../data

# download metadata from the git repo into the data directory
curl -O https://raw.githubusercontent.com/AHuffmyer/EarlyLifeHistory_Energetics/master/Mcap2020/Data/TagSeq/Sample_Info.csv

Code

#pull data into R and rename it metadata 
metadata <- read_csv("../data/Sample_Info.csv")

Check file integrity with `md5sum`

Info

Learn How to Generate and Verify Files with MD5 Checksum in Linux

Code

cd ../data
md5sum Sample_Info.csv > md5.transferred

Code

cd ../data
cmp Sample_Info.csv md5.transferred

These files differ by 1 byte, and I haven’t yet figured out why… possibly a windows vs unix thing

Code

#look at the metadata
head(metadata)

There are 39 samples (rows) with 8 metadata columns in this tibble dataset (AH1 - AH39). These samples are Montipora capitata coral taken at different life-stages (denoted by column names time-stage and code ), and RNA extracted and sequenced using Tag-Seq.

M. capitata development diagram by A. Huffmyer

Get TagSeq FASTQ files from National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA)

Info

Checkout some background info on The Sequence Read Archive (SRA) and the SRA factsheet

NCBI sequences have been trimmed and cleaned and are ready for alignment and analysis.

Use SRA Run Selector to make a .txt file list of Run (SRR) numbers

“Results are called runs (SRR). Runs comprise the data gathered for a sample or sample bundle and refer to a defining experiment.”

First, we need to obtain RNAseq files from NCBI from the project PRJNA900235. In order to do this we need a list of SRR numbers that identify the specific sequence files for RNA. Go to the SRA Run Selector for BioProject 900235 and select the files of interest.

Info

These sequence files are single-stranded RNA

To make things easier and faster, we’re going to compare 2 life-history groups, with an n=4 for each group. In this case, let’s compare 4 Embryos (an early stage) to 4 Attached Recruits (a later stage).

SRA run selector, downloading 8 fastq files (4 Embryo, 4 Attached Recruits)

Select the files of interest, click ‘Accession List’ and the list of file run ID’s will be downloaded in a text file named SRR_Acc_list.txt , upload this list into the data folder.

Navigate to where you will download the fastq files

Make sure this space has lots of storage

Code

# move to large data hard-drive 
cd /home/shared/8TB_HDD_01
# make a new directory named 'mcap', short for Montipora capitata
mkdir -p mcap

Download the fastq files

Roberts Lab Resources Github issue#1569 thread

Using sratoolkit.3.0.2-ubuntu64 which is already downloaded in /home/shared folder

Danger

The following code will take some time, run it and go take a wee break

Code

/home/shared/sratoolkit.3.0.2-ubuntu64/bin/./fasterq-dump \
--outdir /home/shared/8TB_HDD_01/mcap \
--progress \
SRR22293447 \
SRR22293448 \
SRR22293449 \
SRR22293450 \
SRR22293451 \
SRR22293452 \
SRR22293453 \
SRR22293454

Absolute path to fastq files in raven:

/home/shared/8TB_HDD_01/mcap/

Relative path to fastq files in raven:

cd ../../../../../8TB_HDD_01/mcap/

Check that the fastq files are downloaded:

Code

cd /home/shared/8TB_HDD_01/mcap/
ls

Tip

The fastq files have been downloaded from NCBI!

Download Annotated Reference Genome for Montipora capitata

Deep Dive project with genomes of interest: https://github.com/urol-e5/deep-dive

Montipora capitata Genome version V3, Rutgers University: http://cyanophora.rutgers.edu/montipora/

Genome publication: https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giac098/6815755

Nucleotide Coding Sequence (CDS): http://cyanophora.rutgers.edu/montipora/Montipora_capitata_HIv3.genes.cds.fna.gz

This code grabs the Montipora capitata fasta file (rna.fna) of genes.

Code

# change to work in data directory
cd ../data
# download the rna.fna file to data directory from the gannet server
curl -O http://cyanophora.rutgers.edu/montipora/Montipora_capitata_HIv3.genes.cds.fna.gz

wget http://cyanophora.rutgers.edu/montipora/Montipora_capitata_HIv3.assembly.fasta.gz

wget http://cyanophora.rutgers.edu/montipora/Montipora_capitata_HIv3.genes.gff3.gz

Obtain Montipora_capitata_HIv3.genes_fixed.gff3 file by downloading from GitHub.

Code

cd ../data
wget https://github.com/AHuffmyer/EarlyLifeHistory_Energetics/raw/master/Mcap2020/Data/TagSeq/Montipora_capitata_HIv3.genes_fixed.gff3.gz

This was generated by running the original gff file Montipora_capitata_HIv3.genes.gff3.gz through this script in R: https://github.com/AHuffmyer/EarlyLifeHistory_Energetics/blob/master/Mcap2020/Scripts/TagSeq/Genome_V3/fix_gff_format.Rmd

Unzip gff and genome file

Code

cd ../data
gunzip Montipora_capitata_HIv3.genes.gff3.gz
gunzip Montipora_capitata_HIv3.genes_fixed.gff3.gz
gunzip Montipora_capitata_HIv3.assembly.fasta.gz

So we now have the sample sequence files located at absolute path /home/shared/8TB_HDD_01/mcap/ And we have the reference genome V3 files located at ../data/

Summary & Next Steps

Info

Next Step: Go to 02s-pseudo-align-kallisto to align reads to the reference genome

--- title: "01-get-data" author: "Sarah Tanja" date: "`r format(Sys.time(), '%d %B, %Y')`" format: html: df-print: paged toc: true smooth-scroll: true link-external-icon: true link-external-newwindow: true code-fold: show code-tools: true code-copy: true highlight-style: arrow code-overflow: wrap theme: light: sandstone dark: darkly --- ```{r setup, include=FALSE} knitr::opts_chunk$set( echo = TRUE, # Display code chunks eval = FALSE, # Evaluate code chunks warning = FALSE, # Hide warnings message = FALSE, # Hide messages fig.width = 6, # Set plot width in inches fig.height = 4, # Set plot height in inches fig.align = "center" # Align plots to the center ) ``` # Downloading RNA sequence files Goals: - Identify samples (accession numbers) I want to analyze in SRA Run Selector - `curl` sample RNA sequences data files (.fastq) from NCBI BioProject into a large storage drive - Check the files for integrity using `md5sum` Environment: - RStudio Server hosted on a Unix OS (Roberts Lab raven server) ## Install Packages ```{r install-packages, eval=TRUE, cache=TRUE} if ("tidyverse" %in% rownames(installed.packages()) == 'FALSE') install.packages('tidyverse') if ("SRAdb" %in% rownames(installed.packages()) == 'FALSE') BiocManager::install("SRAdb") ``` ## Load packages ```{r load-packages, cache=TRUE} library(SRAdb) library(tidyverse) ``` ## Get sample metadata First we want to get an idea of the files we are downloading and the samples that generated the data. We will start by looking at the metadata for the samples. ```{r, engine='bash'} # navigate to data directory cd ../data # download metadata from the git repo into the data directory curl -O https://raw.githubusercontent.com/AHuffmyer/EarlyLifeHistory_Energetics/master/Mcap2020/Data/TagSeq/Sample_Info.csv ``` ```{r} #pull data into R and rename it metadata metadata <- read_csv("../data/Sample_Info.csv") ``` ### Check file integrity with `md5sum` ::: callout-info [Learn How to Generate and Verify Files with MD5 Checksum in Linux](https://www.tecmint.com/generate-verify-check-files-md5-checksum-linux/) ::: ```{r, engine='bash'} cd ../data md5sum Sample_Info.csv > md5.transferred ``` ```{r, engine='bash'} cd ../data cmp Sample_Info.csv md5.transferred ``` > These files differ by 1 byte, and I haven't yet figured out why... possibly a windows vs unix thing ```{r} #look at the metadata head(metadata) ``` There are 39 samples (rows) with 8 metadata columns in this tibble dataset (AH1 - AH39). These samples are *Montipora capitata* coral taken at different life-stages (denoted by column names `time-stage` and `code` ), and RNA extracted and sequenced using Tag-Seq. [![M. capitata development diagram by A. Huffmyer](https://user-images.githubusercontent.com/32178010/211181816-cf21abb7-7038-4f86-9aca-3ca326a958ce.png)](https://github.com/AHuffmyer/EarlyLifeHistory_Energetics) # Get TagSeq FASTQ files from National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) ::: callout-info Checkout some background info on [The Sequence Read Archive (SRA)](https://linsalrob.github.io/ComputationalGenomicsManual/Databases/SRA.html#:~:text=A%20Study%20(SRP)%20has%20one,want%20to%20download%20from%20NCBI.) and the [SRA factsheet](https://www.ncbi.nlm.nih.gov/core/assets/sra/files/Factsheet_SRA.pdf) ::: NCBI sequences have been trimmed and cleaned and are ready for alignment and analysis. ### Use SRA Run Selector to make a .txt file list of Run (SRR) numbers > *"Results are called runs (SRR). Runs comprise the data gathered for a sample or sample bundle and refer to a defining experiment."* First, we need to obtain RNAseq files from NCBI from the project [PRJNA900235](https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA900235). In order to do this we need a list of SRR numbers that identify the specific sequence files for RNA. Go to the [SRA Run Selector](https://www.ncbi.nlm.nih.gov/Traces/study/?query_key=9&WebEnv=MCID_64358c3e49f6486f57b185c1&o=acc_s%3Aa) for BioProject 900235 and select the files of interest. ::: callout-info These sequence files are single-stranded RNA ::: To make things easier and faster, we're going to compare 2 life-history groups, with an n=4 for each group. In this case, let's compare 4 Embryos (an early stage) to 4 Attached Recruits (a later stage). ![SRA run selector, downloading 8 fastq files (4 Embryo, 4 Attached Recruits)](images/sra-run-selector.png){fig-align="center"} Select the files of interest, click 'Accession List' and the list of file run ID's will be downloaded in a text file named `SRR_Acc_list.txt` , upload this list into the `data` folder. ## Navigate to where you will download the fastq files Make sure this space has lots of storage ```{r, engine='bash'} # move to large data hard-drive cd /home/shared/8TB_HDD_01 # make a new directory named 'mcap', short for Montipora capitata mkdir -p mcap ``` # Download the fastq files Roberts Lab Resources Github [issue#1569](https://github.com/RobertsLab/resources/issues/1569) thread Using `sratoolkit.3.0.2-ubuntu64` which is already downloaded in `/home/shared` folder ::: callout-caution The following code will take some time, run it and go take a wee break ::: ```{r, engine='bash', cache=TRUE} /home/shared/sratoolkit.3.0.2-ubuntu64/bin/./fasterq-dump \ --outdir /home/shared/8TB_HDD_01/mcap \ --progress \ SRR22293447 \ SRR22293448 \ SRR22293449 \ SRR22293450 \ SRR22293451 \ SRR22293452 \ SRR22293453 \ SRR22293454 ``` Absolute path to fastq files in raven: `/home/shared/8TB_HDD_01/mcap/` Relative path to fastq files in raven: `cd ../../../../../8TB_HDD_01/mcap/` Check that the fastq files are downloaded: ```{r, engine='bash'} cd /home/shared/8TB_HDD_01/mcap/ ls ``` ::: callout-tip The fastq files have been downloaded from NCBI! ::: # Download Annotated Reference Genome for *Montipora capitata* Deep Dive project with genomes of interest: <https://github.com/urol-e5/deep-dive> *Montipora capitata* Genome version V3, Rutgers University: <http://cyanophora.rutgers.edu/montipora/> Genome publication: <https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giac098/6815755> Nucleotide Coding Sequence (CDS): <http://cyanophora.rutgers.edu/montipora/Montipora_capitata_HIv3.genes.cds.fna.gz> This code grabs the *Montipora capitata* fasta file (rna.fna) of genes. ```{r, engine='bash'} # change to work in data directory cd ../data # download the rna.fna file to data directory from the gannet server curl -O http://cyanophora.rutgers.edu/montipora/Montipora_capitata_HIv3.genes.cds.fna.gz wget http://cyanophora.rutgers.edu/montipora/Montipora_capitata_HIv3.assembly.fasta.gz wget http://cyanophora.rutgers.edu/montipora/Montipora_capitata_HIv3.genes.gff3.gz ``` Obtain `Montipora_capitata_HIv3.genes_fixed.gff3` file by downloading from GitHub. ```{r, engine='bash'} cd ../data wget https://github.com/AHuffmyer/EarlyLifeHistory_Energetics/raw/master/Mcap2020/Data/TagSeq/Montipora_capitata_HIv3.genes_fixed.gff3.gz ``` This was generated by running the original gff file `Montipora_capitata_HIv3.genes.gff3.gz` through this script in R: <https://github.com/AHuffmyer/EarlyLifeHistory_Energetics/blob/master/Mcap2020/Scripts/TagSeq/Genome_V3/fix_gff_format.Rmd> Unzip gff and genome file ```{r, engine='bash'} cd ../data gunzip Montipora_capitata_HIv3.genes.gff3.gz gunzip Montipora_capitata_HIv3.genes_fixed.gff3.gz gunzip Montipora_capitata_HIv3.assembly.fasta.gz ``` So we now have the sample sequence files located at absolute path `/home/shared/8TB_HDD_01/mcap/` And we have the reference genome V3 files located at `../data/` # Summary & Next Steps ::: callout-info **Next Step:** Go to `02s-pseudo-align-kallisto` to align reads to the reference genome :::

Downloading RNA sequence files

Install Packages

Load packages

Get sample metadata

Check file integrity with md5sum

Get TagSeq FASTQ files from National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA)

Use SRA Run Selector to make a .txt file list of Run (SRR) numbers

Navigate to where you will download the fastq files

Download the fastq files

Download Annotated Reference Genome for Montipora capitata

Summary & Next Steps

Check file integrity with `md5sum`