SEQUENCING DATA FASTQ QUALITY CHECK WITH FASTQC
- Checking the sequence data quality
Download data from NBCI using this command:
prefetch SRR576933
It will download the file: SRR576933.sra to local directory SRR576933. This is a single-end sequencing file.
Then I convert this .sra file into SRR576933.fastq files as following
!fasterq-dump --outdir ./SRR576933 SRR576933.sra
Ii returns the SRR576933.fastq file. I check number of lines in the file: 14,414,176
!cat ./SRR576933/SRR576933.fastq | wc -l
14414176
I perform a fastqc check for quality of sequencing data in this fastq file using 'fastqc'
!mkdir qc
!fastqc ./SRR576933/SRR576933.fastq --outdir ./qc --threads 3
null Started analysis of SRR576933.fastq Approx 5% complete for SRR576933.fastq Approx 10% complete for SRR576933.fastq Approx 15% complete for SRR576933.fastq Approx 20% complete for SRR576933.fastq Approx 25% complete for SRR576933.fastq Approx 30% complete for SRR576933.fastq Approx 35% complete for SRR576933.fastq Approx 40% complete for SRR576933.fastq Approx 45% complete for SRR576933.fastq Approx 50% complete for SRR576933.fastq Approx 55% complete for SRR576933.fastq Approx 60% complete for SRR576933.fastq Approx 65% complete for SRR576933.fastq Approx 70% complete for SRR576933.fastq Approx 75% complete for SRR576933.fastq Approx 80% complete for SRR576933.fastq Approx 85% complete for SRR576933.fastq Approx 90% complete for SRR576933.fastq Approx 95% complete for SRR576933.fastq Analysis complete for SRR576933.fastq
Outputs in the directory qc are the files:
SRR576933_fastqc.html
SRR576933_fastqc.zip
The SRR576933_fastqc.html shows the summary as below, see the basic statistics first.
The sequencing data file have some problems warned with red and pink-color sticks on the left panel that needed to fix before going to analyze the data.
Firstly, the Per base sequence quality has end reads with low Phred scores < Q20.
Per tile sequence quality also have problems with such random scattered hot areas (errors). Such quality problem may cause false insertion to the reads and it is hard to be fixed with sequence trimming since the low-quality bases are not at the end of the reads.
Per sequence quality scores are reasonable because almost bases have high scores.
Per base sequence content has problems.
The curve of the per sequence GC content is not fit the theoretical distribution.
Per base N content
Sequence length distribution is OK
Sequence duplication level with some problems
Overrepresented sequences
Adapter content is OK
Processing of fastq reads
Before moving on to the next step for data analysis, errors and biases should be adjusted to avoid incorrect results and misleading interpretation.
I use fastx-toolkit program after installing the program:
'conda install -c bioconda fastx_toolkit'
#!fastx_quality_stats -i ./SRR576933.fastq -o ./SRR576933_quality_stats.txt
!cat /Users/nnthieu/fastqs/SRR576933/SRR576933_quality_stats.txt | head -n 10
column count min max sum mean Q1 med Q3 IQR lW rW A_Count C_Count G_Count T_Count N_Count Max_count 1 3603544 4 38 128770036 35.73 36 38 38 2 33 38 652900 516510 1611485 770608 52041 3603544 2 3603544 4 38 128233592 35.59 35 37 38 3 31 38 1760276 572752 707395 563121 0 3603544 3 3603544 4 38 128988631 35.79 36 38 38 2 33 38 680597 530116 669666 1723165 0 3603544 4 3603544 4 38 129084672 35.82 36 38 38 2 33 38 680520 1670693 667003 585328 0 3603544 5 3603544 4 38 129057062 35.81 36 38 38 2 33 38 664586 570355 1742363 626240 0 3603544 6 3603544 4 38 129270122 35.87 36 38 38 2 33 38 659409 604600 1724323 615211 1 3603544 7 3603544 4 38 128530519 35.67 35 38 38 3 31 38 1808967 564091 582675 647811 0 3603544 8 3603544 4 38 128900977 35.77 35 38 38 3 31 38 1810094 585542 578694 629205 9 3603544 9 3603544 4 38 129071292 35.82 36 38 38 2 33 38 647732 586600 1722981 646231 0 3603544
Filtering for reads with Phred score >= 28 then save into new file SRR576933_filtered.fastq
!fastq_quality_filter -i /Users/nnthieu/fastqs/SRR576933/SRR576933.fastq -q 28 -p 80 -o SRR576933_filtered.fastq -Q33
Trimming the end bases.
!fastq_quality_trimmer -i /Users/nnthieu/fastqs/SRR576933/SRR576933_filtered.fastq -t 28 -o SRR576933_filt_trim.fastq -Q33
Now the Per base sequence quality is improved in which the base 36 has been removed
Clipping adapters
!fastx_clipper -a GATCGGAAGAGCACACGTCTGAACTCCAGTCACACA -i SRR576933_filt_trim.fastq -o SRR576933_filt_trim_clip.fastq -v -Q33
Clipping Adapter: GATCGGAAGAGCACACGTCTGAACTCCAGTCACACA Min. Length: 5 Input: 3148034 reads. Output: 2054907 reads. discarded 37338 too-short reads. discarded 1054048 adapter-only reads. discarded 1741 N reads.
Now several problems have been fixed, some red or pink sticks turning to green.
Sequence length distribution shows the status that some unequal sequence lengths in the file but not much. It is reasonable. Some aligners not accept unequal sequence lengths for analysis. In this case I have to cut sequences with their length < 33.
!paste <(cat SRR576933_filt_trim_clip.fastq | paste - - - -)
| awk -v FS='\t' 'length($2) >= 34 && length($4) >= 34'
| tee >(cut -f 1-4 | tr '\t' '\n' > SRR576933_filt_trim_clip_eq.fastq)
The data is now quite good enough to be analyzed.