Code
if ("tidyverse" %in% rownames(installed.packages()) == 'FALSE') install.packages('tidyverse')
if ("SRAdb" %in% rownames(installed.packages()) == 'FALSE') BiocManager::install("SRAdb")Goals: - Identify samples (accession numbers) I want to analyze in SRA Run Selector - curl sample RNA sequences data files (.fastq) from NCBI BioProject into a large storage drive - Check the files for integrity using md5sum
Environment:
if ("tidyverse" %in% rownames(installed.packages()) == 'FALSE') install.packages('tidyverse')
if ("SRAdb" %in% rownames(installed.packages()) == 'FALSE') BiocManager::install("SRAdb")library(SRAdb)
library(tidyverse)First we want to get an idea of the files we are downloading and the samples that generated the data. We will start by looking at the metadata for the samples.
# navigate to data directory
cd ../data
# download metadata from the git repo into the data directory
curl -O https://raw.githubusercontent.com/AHuffmyer/EarlyLifeHistory_Energetics/master/Mcap2020/Data/TagSeq/Sample_Info.csv #pull data into R and rename it metadata
metadata <- read_csv("../data/Sample_Info.csv")md5sumcd ../data
md5sum Sample_Info.csv > md5.transferredcd ../data
cmp Sample_Info.csv md5.transferredThese files differ by 1 byte, and I haven’t yet figured out why… possibly a windows vs unix thing
#look at the metadata
head(metadata)There are 39 samples (rows) with 8 metadata columns in this tibble dataset (AH1 - AH39). These samples are Montipora capitata coral taken at different life-stages (denoted by column names time-stage and code ), and RNA extracted and sequenced using Tag-Seq.
Roberts Lab Resources Github issue#1569 thread
Using sratoolkit.3.0.2-ubuntu64 which is already downloaded in /home/shared folder
The following code will take some time, run it and go take a wee break
/home/shared/sratoolkit.3.0.2-ubuntu64/bin/./fasterq-dump \
--outdir /home/shared/8TB_HDD_01/mcap \
--progress \
SRR22293447 \
SRR22293448 \
SRR22293449 \
SRR22293450 \
SRR22293451 \
SRR22293452 \
SRR22293453 \
SRR22293454Absolute path to fastq files in raven:
/home/shared/8TB_HDD_01/mcap/
Relative path to fastq files in raven:
cd ../../../../../8TB_HDD_01/mcap/
Check that the fastq files are downloaded:
cd /home/shared/8TB_HDD_01/mcap/
lsThe fastq files have been downloaded from NCBI!
Deep Dive project with genomes of interest: https://github.com/urol-e5/deep-dive
Montipora capitata Genome version V3, Rutgers University: http://cyanophora.rutgers.edu/montipora/
Genome publication: https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giac098/6815755
Nucleotide Coding Sequence (CDS): http://cyanophora.rutgers.edu/montipora/Montipora_capitata_HIv3.genes.cds.fna.gz
This code grabs the Montipora capitata fasta file (rna.fna) of genes.
# change to work in data directory
cd ../data
# download the rna.fna file to data directory from the gannet server
curl -O http://cyanophora.rutgers.edu/montipora/Montipora_capitata_HIv3.genes.cds.fna.gz
wget http://cyanophora.rutgers.edu/montipora/Montipora_capitata_HIv3.assembly.fasta.gz
wget http://cyanophora.rutgers.edu/montipora/Montipora_capitata_HIv3.genes.gff3.gzObtain Montipora_capitata_HIv3.genes_fixed.gff3 file by downloading from GitHub.
cd ../data
wget https://github.com/AHuffmyer/EarlyLifeHistory_Energetics/raw/master/Mcap2020/Data/TagSeq/Montipora_capitata_HIv3.genes_fixed.gff3.gzThis was generated by running the original gff file Montipora_capitata_HIv3.genes.gff3.gz through this script in R: https://github.com/AHuffmyer/EarlyLifeHistory_Energetics/blob/master/Mcap2020/Scripts/TagSeq/Genome_V3/fix_gff_format.Rmd
Unzip gff and genome file
cd ../data
gunzip Montipora_capitata_HIv3.genes.gff3.gz
gunzip Montipora_capitata_HIv3.genes_fixed.gff3.gz
gunzip Montipora_capitata_HIv3.assembly.fasta.gzSo we now have the sample sequence files located at absolute path /home/shared/8TB_HDD_01/mcap/ And we have the reference genome V3 files located at ../data/