Analyzing .fastq Genome Data Using Linux Combined with R

Download .fastq data from NBCI

I get .fastq data from NBCI using coding as follow:

Set working directory:

setwd("/Users/nnthieu/Downloads/GenomeProject1000/Blast/")

Firstly, create a “accession.txt” file containing fastq file on NBCI I need to download: ‘nano accession.txt’

SRR30352484

SRR30298594

SRR30298562

SRR722403

SRR30352543

SRR30352544

SRR30352545

Then create a “download_fastq.sh” file containing code as below:

‘nano download_fastq.sh’

while IFS= read -r accession; do fastq-dump –gzip “$accession” done < accessions.txt

# system("bash download_fastq.sh")

Finally, I have files as

SRR30352484.fastq

SRR30298594.fast

SRR30298562.fastq

SRR722403.fastq

SRR30352544.fastq

SRR30352543.fastq

SRR30352545.fastq

Another way I can manually download .fastq data files from NBCI as:

# system("fastq-dump --gzip SRR30352484")

Now I have a file as SRR772403.fastq

Analyzing data

See the sequences in the file using ‘cat and head’

system("cat SRR772403.fastq | awk '(NR%4==2) {print $0}' | head -n 10")

For showing the results in the html Knite outputs, I need to add some more codes as follow:

# Capture the output of the command using system()
output <- system("cat SRR772403.fastq | awk '(NR%4==2) {print $0}' | head -n 10", intern = TRUE)
# Print the captured output
cat(output, sep = "\n")

## TCCCTGATACCCTAACTTGTGATGGAATTCTCGGGTGCCAAGGAACTCCA
## TGAGGTAGTAGGTTGTATAGTTTGGAATTCTCGGGTGCCAAGGAACTCCA
## TACCCTGTAGATCCGAATTTGTTGGAATTCTCGGGTGCCAAGGAACTCCA
## TCCTGTACTGAGCTGCCCCGAGATGGAATTCTCGGGTGCCAAGGAACTCC
## TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACATGTCAATCTCGTATGC
## TGAGATGAAGCACTGTAGCTTGGAATTCTCGGGTGCCAAGGAACTCCAGT
## TACCCTGTAGAACCGAATTTGTTGGAATTCTCGGGTGCCAAGGAACTCCA
## TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACATGTCAATCTCGTATGC
## AAGCTGCCAGTTGAAGAACTGTTGGAATTCTCGGGTGCCAAGGAACTCCA
## TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACATGTCAATCTCGTATGC

Now I count the length of reads

output <- system("cat SRR772403.fastq | awk '(NR%4==2) {print length($0)}' | head -n 10", intern = TRUE)
cat(output, sep = "\n")

## 50
## 50
## 50
## 50
## 50
## 50
## 50
## 50
## 50
## 50

sort the number of reads.

output <- system("cat SRR772403.fastq | awk '(NR%4==2) {print length($0)}' | sort | uniq" , intern = TRUE)
cat(output, sep = "\n")

## 50

All reads have their length of 50

Now I count the total read number or sequencing depth

output <- system("cat SRR772403.fastq|grep '@S' | wc -l", intern = TRUE)
cat(output, sep = "\n")

##   581450

I have 581450 reads.

I list the files of .fastq in the folder that I want to analyze and count the numbers of their reads.

These below code lines do not work in R because of the syntax but they work in Linux:

# system("for sample in $(ls *.fastq | sort | uniq); do 
#    n_read=$(cat "${sample}" | grep "^@S" | wc -l)
#    echo -e ${sample}: "\t"  ${n_read}
# done > n_reads.tsv" ) 

output <- system("cat n_reads.tsv | sort -nk 2" , intern = TRUE) # sort asc; -nk 1: desc
cat(output, sep = "\n")

## SRR30352543.fastq:    159310
## SRR30352545.fastq:    160470
## SRR30352544.fastq:    163192
## SRR772403.fastq:      581450
## SRR30298594.fastq:    1469094
## SRR30298562.fastq:    1500670
## SRR30352484.fastq:    3770332

Ploting the number of reads:

library(ggplot2)
# Read the data from the TSV file without headers
file_path <- "/Users/nnthieu/Downloads/GenomeProject1000/Blast/n_reads.tsv"
n_reads <- read.table(file_path, header = FALSE, sep = "\t", col.names = c("Sample", "Reads"))

# Create a bar plot
ggplot(n_reads, aes(x = reorder(Sample, -Reads), y = Reads)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(title = "Number of Reads per Sample", 
       x = "Samples", 
       y = "Number of Reads") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotates x-axis labels for better readability

Analyzing .fastq Genome Data Using Linux Combined with R

Thieu Nguyen

8/27/2024

Download .fastq data from NBCI

Analyzing data