I get .fastq data from NBCI using coding as follow:
Set working directory:
setwd("/Users/nnthieu/Downloads/GenomeProject1000/Blast/")
Firstly, create a “accession.txt” file containing fastq file on NBCI I need to download: ‘nano accession.txt’
SRR30352484
SRR30298594
SRR30298562
SRR722403
SRR30352543
SRR30352544
SRR30352545
Then create a “download_fastq.sh” file containing code as below:
‘nano download_fastq.sh’
while IFS= read -r accession; do fastq-dump –gzip “$accession” done < accessions.txt
# system("bash download_fastq.sh")
Finally, I have files as
SRR30352484.fastq
SRR30298594.fast
SRR30298562.fastq
SRR722403.fastq
SRR30352544.fastq
SRR30352543.fastq
SRR30352545.fastq
Another way I can manually download .fastq data files from NBCI as:
# system("fastq-dump --gzip SRR30352484")
Now I have a file as SRR772403.fastq
See the sequences in the file using ‘cat and head’
system("cat SRR772403.fastq | awk '(NR%4==2) {print $0}' | head -n 10")
For showing the results in the html Knite outputs, I need to add some more codes as follow:
# Capture the output of the command using system()
output <- system("cat SRR772403.fastq | awk '(NR%4==2) {print $0}' | head -n 10", intern = TRUE)
# Print the captured output
cat(output, sep = "\n")
## TCCCTGATACCCTAACTTGTGATGGAATTCTCGGGTGCCAAGGAACTCCA
## TGAGGTAGTAGGTTGTATAGTTTGGAATTCTCGGGTGCCAAGGAACTCCA
## TACCCTGTAGATCCGAATTTGTTGGAATTCTCGGGTGCCAAGGAACTCCA
## TCCTGTACTGAGCTGCCCCGAGATGGAATTCTCGGGTGCCAAGGAACTCC
## TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACATGTCAATCTCGTATGC
## TGAGATGAAGCACTGTAGCTTGGAATTCTCGGGTGCCAAGGAACTCCAGT
## TACCCTGTAGAACCGAATTTGTTGGAATTCTCGGGTGCCAAGGAACTCCA
## TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACATGTCAATCTCGTATGC
## AAGCTGCCAGTTGAAGAACTGTTGGAATTCTCGGGTGCCAAGGAACTCCA
## TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACATGTCAATCTCGTATGC
Now I count the length of reads
output <- system("cat SRR772403.fastq | awk '(NR%4==2) {print length($0)}' | head -n 10", intern = TRUE)
cat(output, sep = "\n")
## 50
## 50
## 50
## 50
## 50
## 50
## 50
## 50
## 50
## 50
sort the number of reads.
output <- system("cat SRR772403.fastq | awk '(NR%4==2) {print length($0)}' | sort | uniq" , intern = TRUE)
cat(output, sep = "\n")
## 50
All reads have their length of 50
Now I count the total read number or sequencing depth
output <- system("cat SRR772403.fastq|grep '@S' | wc -l", intern = TRUE)
cat(output, sep = "\n")
## 581450
I have 581450 reads.
I list the files of .fastq in the folder that I want to analyze and count the numbers of their reads.
These below code lines do not work in R because of the syntax but they work in Linux:
# system("for sample in $(ls *.fastq | sort | uniq); do
# n_read=$(cat "${sample}" | grep "^@S" | wc -l)
# echo -e ${sample}: "\t" ${n_read}
# done > n_reads.tsv" )
output <- system("cat n_reads.tsv | sort -nk 2" , intern = TRUE) # sort asc; -nk 1: desc
cat(output, sep = "\n")
## SRR30352543.fastq: 159310
## SRR30352545.fastq: 160470
## SRR30352544.fastq: 163192
## SRR772403.fastq: 581450
## SRR30298594.fastq: 1469094
## SRR30298562.fastq: 1500670
## SRR30352484.fastq: 3770332
Ploting the number of reads:
library(ggplot2)
# Read the data from the TSV file without headers
file_path <- "/Users/nnthieu/Downloads/GenomeProject1000/Blast/n_reads.tsv"
n_reads <- read.table(file_path, header = FALSE, sep = "\t", col.names = c("Sample", "Reads"))
# Create a bar plot
ggplot(n_reads, aes(x = reorder(Sample, -Reads), y = Reads)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(title = "Number of Reads per Sample",
x = "Samples",
y = "Number of Reads") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotates x-axis labels for better readability