1 Introduction

FastQC output包含11個分析結果:
1. Basic Statistics
2. Per base sequence quality
3. Per tile sequence quality
4. Per sequence quality scores
5. Per base sequence content
6. Per sequence GC content
7. Per base N content
8. Sequence Length Distribution
9. Sequence Duplication Levels
10. Overrepresented sequences
11. Adapter Content

分別針對各分析結果做介紹:

1.1 Basic Statistics

基本統計量，包含:

Filename: 檔案名稱
File type: 檔案類型
Encoding: 測序平台的版本跟相應的編碼版本號
Total Sequences: 處理的序列總數
Sequences flagged as poor quality: 質量差的序列
Sequence length: 測序的長度
%GC: 整體序列中的GC含量

註. 基本統計量永遠不會顯示成警告或不合格。

1.2 Per base sequence quality

各個位置上所有鹼基的質量值分布情況，將所有reads數據綜合一起分析。

橫軸為測序序列reads上鹼基的位置(第一個鹼基到第n個鹼基，n=Sequence length)。縱軸為鹼基的質量分數，Q=-10*log10(error P)，分數越高越好。
背景色分為三種顏色：綠色(very good quality)、橘色(reasonable quality)、紅色(poor quality)。
當任何鹼基質量分數Q低於10、或是任何箱型圖中位數低於25，則會出現警告圖標。而當任何鹼基質量分數Q低於5、或是任何箱型圖中位數低於20，則會出現不合格圖標。

1.3 Per tile sequence quality

檢查reads中每一個鹼基位置在不同測序小孔之間的偏離度。

藍色代表偏離度小，質量好；紅色代表偏離度越大，質量越差。

1.4 Per sequence quality scores

每條序列的測序質量統計，查看序列的子集是否具有普遍較低的質量值。每條reads的平均質量分布。

橫軸代表品質質量Q，縱軸為reads數目。
質量普遍較差的序列子集僅佔總序列的一小部分，通常是因為它們的成像效果很差(在視場邊緣等)所導致的。如果有大部分reads具有較低的質量，則可能代表存在某種系統性問題。
若測序結果(平均質量)低於27，則會出現警告標示(0.2%錯誤率)；若平均質量低於20，則會引發錯誤標示(1%錯誤率)。

1.5 Per base sequence content

對所有reads的每一個位置，繪製ACTG四種鹼基佔的比例。

橫軸代表鹼基的每個不同位置，縱軸為百分比。
四條線代表A、C、T、G在每個位置上的平均含量。
理論上A、T應該相等，C、G應該相等，但可能會由於測序的狀態不穩定(通常在開頭的地方)導致可能出現不平均的ATCG比例。希望序列不同鹼基之間幾乎沒有差異，因此圖中的線應彼此平行。
若在任何一處A/T或G/C的比例差大於10%，則會出現警告圖標；而若A/T或G/C的比例差在任一處大於20%，則會出現不合格圖標。

1.6 Per sequence GC content

GC含量在序列中的比例分布，透過GC含量判斷測序過程是否足夠隨機。

橫軸代表GC含量，即為G和C兩鹼基佔總鹼基的比例；縱軸代表reads數量。
藍線為根據經驗分布給出的理論值(期望GC含量大致呈常態分佈，均值由平均GC含量推得的)，紅色為根據真實數據計算出來的數值。
希望不同序列的GC分布圖可跟理論分布差不多。當紅線形狀接近常態分布但偏離理論時，表示可能有系統偏差。
當真實值偏離理論值15%會出現警告圖示；偏離理論值30%則會出現不合格圖示。

1.7 Per base N content

當出現測序儀不能分辨某條reads的某個位置為何種鹼基時會產生N，代表序列的位點定序品質很差。

橫軸為read位置；縱軸為N的比例。
正常情況下N的比例很小。當任一個位置其N所佔比例大於5%，則出現警告圖示；大於20%則出現不合格圖示。

1.8 Sequence Length Distribution

序列測序的長度分布圖。

每次測序儀測出的reads長度在理論上應為一致，但是總會有一些偏差。
當測序的reads長度不一致時則出現警告圖示，代表測序儀在此次測序過程產生的數據不可信；當有出現長度為0的reads時則會顯示不合格圖示。

1.9 Sequence Duplication Levels

計算每個序列的重複程度。

橫軸代表duplication的次數，重複的程度。縱軸為duplicated reads的數目。
當測序深度越高越容易產生一定程度的duplication，但若duplication程度很高，則代表可能存在偏差。
藍線代表完整序列其重複率佔總reads個數的百分比, 紅色線:代表去掉重複後不同序列的比例。
為了使分析的計算效率提高,這個分析模組只會分析在前100000的序列。
在一個均勻的library中序列應該只會出現一次，重複次數為一次的比例越高越好。當有某個序列大量出現，非唯一序列佔總reads個數的20%以上會產生警告圖示，超過50%則產生不合格圖示。

1.10 Overrepresented sequences

序列的重複數。

當某個序列大量出現(over-represented)時，超過總reads的0.1%時則出現警告圖示，超過1%則為不合格。
為了節省內存與計算方便，僅取前200,000條reads進行統計，所以有可能會漏掉over-represented reads。

1.11 Adapter Content

每一位置上常用接頭序列的比例。

橫軸代表鹼基位置，縱軸代表百分比。
主要想衡量序列中兩端adapter的情況，adapter是用來連接序列未知的目標測序片段。

2 Analysis

2.1 各檔案結果總和

下表為12個檔案的fasq結果，綠燈代表合格、橘燈代表警告、紅燈則表示不合格。

2.2 結論

結合上表結果，針對有橘燈與紅燈的結果依序介紹。

2.2.1 Per base sequence content

所有檔案的Per base sequence content皆不合格，代表A/T或G/C的平均含量差至少在某一處大於20%。
特別在位置1~9變動幅度最明顯。

下方圖形由左至右，上至下分別為 `SRR1552444、SRR1552445、SRR1552446、SRR1552447、SRR1552448、SRR1552449檔案的Per base sequence分析結果。

knitr::include_graphics("E:\\graduate institute\\108_1\\Bioinformatics\\Homework7 fastq\\file plot\\44\\Per base sequence content.png")
knitr::include_graphics("E:\\graduate institute\\108_1\\Bioinformatics\\Homework7 fastq\\file plot\\45\\Per base sequence content.png")
knitr::include_graphics("E:\\graduate institute\\108_1\\Bioinformatics\\Homework7 fastq\\file plot\\46\\Per base sequence content.png")
knitr::include_graphics("E:\\graduate institute\\108_1\\Bioinformatics\\Homework7 fastq\\file plot\\47\\Per base sequence content.png")
knitr::include_graphics("E:\\graduate institute\\108_1\\Bioinformatics\\Homework7 fastq\\file plot\\48\\Per base sequence content.png")
knitr::include_graphics("E:\\graduate institute\\108_1\\Bioinformatics\\Homework7 fastq\\file plot\\49\\Per base sequence content.png")

下方圖形由左至右，上至下分別代表 SRR1552450、SRR1552451、SRR1552452、SRR1552453、SRR1552454、SRR1552455檔案的Per base sequence分析結果。

knitr::include_graphics("E:\\graduate institute\\108_1\\Bioinformatics\\Homework7 fastq\\file plot\\50\\Per base sequence content.png")
knitr::include_graphics("E:\\graduate institute\\108_1\\Bioinformatics\\Homework7 fastq\\file plot\\51\\Per base sequence content.png")
knitr::include_graphics("E:\\graduate institute\\108_1\\Bioinformatics\\Homework7 fastq\\file plot\\52\\Per base sequence content.png")
knitr::include_graphics("E:\\graduate institute\\108_1\\Bioinformatics\\Homework7 fastq\\file plot\\53\\Per base sequence content.png")
knitr::include_graphics("E:\\graduate institute\\108_1\\Bioinformatics\\Homework7 fastq\\file plot\\54\\Per base sequence content.png")
knitr::include_graphics("E:\\graduate institute\\108_1\\Bioinformatics\\Homework7 fastq\\file plot\\55\\Per base sequence content.png")

2.2.2 Per sequence GC content

大部分檔案的Per sequence GC content皆出現警告標示，代表GC含量真實值偏離理論值15%以上。特別是SRR1552448、SRR1552450與SRR1552454檔案的GC含量真實值偏離理論值30%以上，出現不合格標示。

分別將出現警告標示的檔案列於下:

圖形由左至右，上至下分別代表 SRR1552444、SRR1552445、SRR1552446、SRR1552447 檔案的Per sequence GC content分析結果。

knitr::include_graphics("E:\\graduate institute\\108_1\\Bioinformatics\\Homework7 fastq\\file plot\\44\\Per sequence GC content.png")
knitr::include_graphics("E:\\graduate institute\\108_1\\Bioinformatics\\Homework7 fastq\\file plot\\45\\Per sequence GC content.png")
knitr::include_graphics("E:\\graduate institute\\108_1\\Bioinformatics\\Homework7 fastq\\file plot\\46\\Per sequence GC content.png")
knitr::include_graphics("E:\\graduate institute\\108_1\\Bioinformatics\\Homework7 fastq\\file plot\\47\\Per sequence GC content.png")

圖形由左至右，上至下分別代表 SRR1552449、SRR1552451、SRR1552452、SRR1552453、SRR1552455檔案的``Per sequence GC content分析結果。

knitr::include_graphics("E:\\graduate institute\\108_1\\Bioinformatics\\Homework7 fastq\\file plot\\49\\Per sequence GC content.png")
knitr::include_graphics("E:\\graduate institute\\108_1\\Bioinformatics\\Homework7 fastq\\file plot\\51\\Per sequence GC content.png")
knitr::include_graphics("E:\\graduate institute\\108_1\\Bioinformatics\\Homework7 fastq\\file plot\\52\\Per sequence GC content.png")
knitr::include_graphics("E:\\graduate institute\\108_1\\Bioinformatics\\Homework7 fastq\\file plot\\53\\Per sequence GC content.png")
knitr::include_graphics("E:\\graduate institute\\108_1\\Bioinformatics\\Homework7 fastq\\file plot\\55\\Per sequence GC content.png")

分別將出現不合格標示的檔案列於下:

圖形由左至右，上至下分別代表 SRR1552448、SRR1552450、SRR1552454檔案的Per sequence GC content分析結果。。

knitr::include_graphics("E:\\graduate institute\\108_1\\Bioinformatics\\Homework7 fastq\\file plot\\48\\Per sequence GC content.png")
knitr::include_graphics("E:\\graduate institute\\108_1\\Bioinformatics\\Homework7 fastq\\file plot\\50\\Per sequence GC content.png")
knitr::include_graphics("E:\\graduate institute\\108_1\\Bioinformatics\\Homework7 fastq\\file plot\\54\\Per sequence GC content.png")

2.2.3 Per base N content

有5個檔案之 Per base N content分析結果出現警告圖示，分別為SRR1552444至SRR1552448。
代表各檔案下，測序儀不能分辨某條reads的某個位置所佔的比例大於5%。
大約都是在76~84的位點定序品質不好。
將5個檔案的圖形由左至右、上至下依序展示如下(最左上為SRR1552444資料檔，最左下為SRR1552448資料檔):

knitr::include_graphics("E:\\graduate institute\\108_1\\Bioinformatics\\Homework7 fastq\\file plot\\44\\Per base N content.png")
knitr::include_graphics("E:\\graduate institute\\108_1\\Bioinformatics\\Homework7 fastq\\file plot\\45\\Per base N content.png")
knitr::include_graphics("E:\\graduate institute\\108_1\\Bioinformatics\\Homework7 fastq\\file plot\\46\\Per base N content.png")
knitr::include_graphics("E:\\graduate institute\\108_1\\Bioinformatics\\Homework7 fastq\\file plot\\47\\Per base N content.png")
knitr::include_graphics("E:\\graduate institute\\108_1\\Bioinformatics\\Homework7 fastq\\file plot\\48\\Per base N content.png")

2.2.4 Overrepresented sequences

於Overrepresented sequences結果中，僅有SRR1552454資料檔的分析結果合格，其餘資料檔的序列的重複數比例皆超過總reads的0.1%。
皆只看前200,000條reads的分析結果，因此也有可能有存在其他reads重複數比例過高的情形。

下表為SRR1552444資料檔分析結果，有2條序列的重複數超過總reads的0.1%。

knitr::include_graphics("E:\\graduate institute\\108_1\\Bioinformatics\\Homework7 fastq\\file plot\\44\\Overrepresented sequences.png")

下表為SRR1552445資料檔分析結果，有2條序列的重複數超過總reads的0.1%。

knitr::include_graphics("E:\\graduate institute\\108_1\\Bioinformatics\\Homework7 fastq\\file plot\\45\\Overrepresented sequences.png")

下表為SRR1552446資料檔分析結果，有13條序列的重複數超過總reads的0.1%。

knitr::include_graphics("E:\\graduate institute\\108_1\\Bioinformatics\\Homework7 fastq\\file plot\\46\\Overrepresented sequences.png")

下表為SRR1552447資料檔分析結果，有條序列的重複數超過總reads的0.1%。

knitr::include_graphics("E:\\graduate institute\\108_1\\Bioinformatics\\Homework7 fastq\\file plot\\47\\Overrepresented sequences.png")

下表為SRR1552448資料檔分析結果，有31條序列的重複數超過總reads的0.1%。

knitr::include_graphics("E:\\graduate institute\\108_1\\Bioinformatics\\Homework7 fastq\\file plot\\48\\Overrepresented sequences.png")

下表為SRR1552449資料檔分析結果，有16條序列的重複數超過總reads的0.1%。

knitr::include_graphics("E:\\graduate institute\\108_1\\Bioinformatics\\Homework7 fastq\\file plot\\49\\Overrepresented sequences.png")

下表為SRR1552450資料檔分析結果，有1條序列的重複數超過總reads的0.1%。

knitr::include_graphics("E:\\graduate institute\\108_1\\Bioinformatics\\Homework7 fastq\\file plot\\50\\Overrepresented sequences.png")

下表為SRR1552451資料檔分析結果，有1條序列的重複數超過總reads的0.1%。

knitr::include_graphics("E:\\graduate institute\\108_1\\Bioinformatics\\Homework7 fastq\\file plot\\51\\Overrepresented sequences.png")

下表為SRR1552452資料檔分析結果，有1條序列的重複數超過總reads的0.1%。

knitr::include_graphics("E:\\graduate institute\\108_1\\Bioinformatics\\Homework7 fastq\\file plot\\52\\Overrepresented sequences.png")

下表為SRR1552453資料檔分析結果，有2條序列的重複數超過總reads的0.1%。

knitr::include_graphics("E:\\graduate institute\\108_1\\Bioinformatics\\Homework7 fastq\\file plot\\53\\Overrepresented sequences.png")

下表為SRR1552455資料檔分析結果，有2條序列的重複數超過總reads的0.1%。

knitr::include_graphics("E:\\graduate institute\\108_1\\Bioinformatics\\Homework7 fastq\\file plot\\55\\Overrepresented sequences.png")

Bioinformatics-FastQC Report

Pin-Hsuan Chiu

2019/11/23