Alignment QC, and Variant Calling

Author

Charlie Clarke

Alignment QC and Variant Calling

The alignment of the the 5 samples to the CanFam genome experienced a handful of hiccups, but ultimately was successful. The variant calling however was not able to be completed due I think in part to how the genome was formatted.

Alignment QC

Using SRR24300331.bam as an example we can see that the samstats and multiQC files show an average quality of 38.9 and 24800473590 total bases mapped.

Sample Name	Error rate	M Reads Mapped	% Mapped	% Proper Pairs	M Total seqs
SRR24300331	0.50%	171.1	97.6%	95.4%	175.3
SRR24300332	0.51%	145.4	97.6%	95.3%	149.0
SRR24300333	0.59%	153.6	96.4%	90.5%	159.3
SRR24300335	0.50%	171.7	97.6%	95.4%	176.0
SRR24300349	5.10%	416.2	99.4%	71.1%	418.9
SRR24300350	3.96%	434.5	99.5%	77.1%	436.9
SRR24300357	0.51%	244.5	96.2%	93.4%	254.2

MultiQC of allignments

Variant Calling

After creating the freebayes.query.txt file with the genome parameters listed in the paper chr37:18000042-20145745, it became clear to me that something went wrong with the way that the genome was labeled.

This generated an empty data frame with 17 headers and no outputs. After playing around with the parameters I was able to figure out that it has something to do with the nomenclature of the genome and potentially the way that it is indexed.

This is how each chromosome is labeled :

NC_006583.4 Canis lupus familiaris isolate Tasha breed boxer chromosome 1, alternate assembly Dog10K_Boxer_Tasha, whole genome shotgun sequence.

For chromosome 37 the unique identifier was NC_006589.4

library (tidyverse)

Warning: package 'tidyverse' was built under R version 4.4.2

Warning: package 'ggplot2' was built under R version 4.4.2

Warning: package 'tibble' was built under R version 4.4.2

Warning: package 'tidyr' was built under R version 4.4.2

Warning: package 'readr' was built under R version 4.4.2

Warning: package 'purrr' was built under R version 4.4.2

Warning: package 'dplyr' was built under R version 4.4.2

Warning: package 'stringr' was built under R version 4.4.2

Warning: package 'forcats' was built under R version 4.4.2

Warning: package 'lubridate' was built under R version 4.4.2

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library (knitr)

Warning: package 'knitr' was built under R version 4.4.3

df <- read.table("freebayes.query_final.txt.gz", header=TRUE, comment.char="")
colnames(df) <- colnames(df) %>% str_remove("^X[.0-9]+")
df <- filter(df, !str_detect(ALT, ",")) %>%
  mutate(AB=as.numeric(AB))

kable(head(df))

CHROM	POS	REF	ALT	QUAL	DP	AB	SRR24300357.GT	SRR24300335.GT	SRR24300333.GT	SRR24300331.GT	SRR24300332.GT	SRR24300357.DP	SRR24300335.DP	SRR24300333.DP	SRR24300331.DP	SRR24300332.DP
NC_006589.4	18000187	G	A	382.3830000	55	0.535714	0/0	0/0	0/0	0/1	0/1	11	10	6	17	11
NC_006589.4	18000408	AT	CC	0.0000009	53	0.000000	0/0	0/0	0/0	0/0	0/0	9	16	5	14	9
NC_006589.4	18000422	TAAAC	AAAAT	0.0000014	48	0.000000	0/0	0/0	0/0	0/0	0/0	6	14	6	14	8
NC_006589.4	18000434	T	C	0.0000009	52	0.000000	0/0	0/0	0/0	0/0	0/0	6	18	6	13	9
NC_006589.4	18000730	A	G	427.2360000	71	0.485714	0/0	0/0	0/0	0/1	0/1	10	22	4	21	14
NC_006589.4	18001008	GACTTTA	GACTTTACTTTA	94.6521000	45	0.500000	0/0	0/0	0/0	0/1	0/1	19	12	2	8	4

First I measured the allele balance:

#qual
ggplot(df, aes(x=POS, y=AB, color=QUAL > 30)) +
  geom_point(size=0.2)

Then depth colored by quality:

#Depth
ggplot(df, aes(x=POS, y=DP, color=QUAL > 30)) +
  geom_point(size=0.2) +
  ylim(0,400)

Finally after annotating all of the outputs I still am unclear exactly which samples are long lived and which samples are short lived.