Alignment QC, and Variant Calling

Author

Charlie Clarke

Alignment QC and Variant Calling

The alignment of the the 5 samples to the CanFam genome experienced a handful of hiccups, but ultimately was successful. The variant calling however was not able to be completed due I think in part to how the genome was formatted.

Alignment QC

Using SRR24300331.bam as an example we can see that the samstats and multiQC files show an average quality of 38.9 and 24800473590 total bases mapped.

Sample Name Error rate M Non-Primary M Reads Mapped % Mapped % Proper Pairs M Total seqs
SRR24300331 0.50% 0.0 171.1 97.6% 95.4% 175.3
SRR24300332 0.51% 0.0 145.4 97.6% 95.3% 149.0
SRR24300333 0.59% 0.0 153.6 96.4% 90.5% 159.3
SRR24300335 0.50% 0.0 171.7 97.6% 95.4% 176.0
SRR24300349 5.10% 0.0 416.2 99.4% 71.1% 418.9
SRR24300350 3.96% 0.0 434.5 99.5% 77.1% 436.9
SRR24300357 0.51% 0.0 244.5 96.2% 93.4% 254.2

MultiQC of allignments

Alignment Plot

Variant Calling

After creating the freebayes.query.txt file with the genome parameters listed in the paper chr37:18000042-20145745, it became clear to me that something went wrong with the way that the genome was labeled.

This generated an empty data frame with 17 headers and no outputs. After playing around with the parameters I was able to figure out that it has something to do with the nomenclature of the genome and potentially the way that it is indexed.

This is how each chromosome is labeled :

NC_006583.4 Canis lupus familiaris isolate Tasha breed boxer chromosome 1, alternate assembly Dog10K_Boxer_Tasha, whole genome shotgun sequence.

For chromosome 37 the unique identifier was NC_006589.4

library (tidyverse)
Warning: package 'tidyverse' was built under R version 4.4.2
Warning: package 'ggplot2' was built under R version 4.4.2
Warning: package 'tibble' was built under R version 4.4.2
Warning: package 'tidyr' was built under R version 4.4.2
Warning: package 'readr' was built under R version 4.4.2
Warning: package 'purrr' was built under R version 4.4.2
Warning: package 'dplyr' was built under R version 4.4.2
Warning: package 'stringr' was built under R version 4.4.2
Warning: package 'forcats' was built under R version 4.4.2
Warning: package 'lubridate' was built under R version 4.4.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library (knitr)
Warning: package 'knitr' was built under R version 4.4.3
df <- read.table("freebayes.query_final.txt.gz", header=TRUE, comment.char="")
colnames(df) <- colnames(df) %>% str_remove("^X[.0-9]+")
df <- filter(df, !str_detect(ALT, ",")) %>%
  mutate(AB=as.numeric(AB))

kable(head(df))
CHROM POS REF ALT QUAL DP AB SRR24300357.GT SRR24300335.GT SRR24300333.GT SRR24300331.GT SRR24300332.GT SRR24300357.DP SRR24300335.DP SRR24300333.DP SRR24300331.DP SRR24300332.DP
NC_006589.4 18000187 G A 382.3830000 55 0.535714 0/0 0/0 0/0 0/1 0/1 11 10 6 17 11
NC_006589.4 18000408 AT CC 0.0000009 53 0.000000 0/0 0/0 0/0 0/0 0/0 9 16 5 14 9
NC_006589.4 18000422 TAAAC AAAAT 0.0000014 48 0.000000 0/0 0/0 0/0 0/0 0/0 6 14 6 14 8
NC_006589.4 18000434 T C 0.0000009 52 0.000000 0/0 0/0 0/0 0/0 0/0 6 18 6 13 9
NC_006589.4 18000730 A G 427.2360000 71 0.485714 0/0 0/0 0/0 0/1 0/1 10 22 4 21 14
NC_006589.4 18001008 GACTTTA GACTTTACTTTA 94.6521000 45 0.500000 0/0 0/0 0/0 0/1 0/1 19 12 2 8 4

First I measured the allele balance:

#qual
ggplot(df, aes(x=POS, y=AB, color=QUAL > 30)) +
  geom_point(size=0.2)

Then depth colored by quality:

#Depth
ggplot(df, aes(x=POS, y=DP, color=QUAL > 30)) +
  geom_point(size=0.2) +
  ylim(0,400)

Finally after annotating all of the outputs I still am unclear exactly which samples are long lived and which samples are short lived.