The coverage depth for sample CR2342 is calculated to be ~82.691x coverage. I arrived at the answer using the equation from the lecture/video and the thought process was shown below:
# Coverage = LN/G where C=coverage, L= read length, N = number of reads and G= haploid size of genome
# L=51
# N=194567996
# G=120,000,000 (or 120 Mb)
# C = (51*194567996)/120,000,000 which is equal to ~82.691x coverage
It might not be accurate because (1) reads varied in length based on soft- or hard-clipping and trimming due to adapters resulting in reads with different lengths. Also (2) coverage depth is reported based on read length and number of reads without actually mapping the reads to the genome which is not accurate as unmapped reads are also accounted for in this calculation. Therefore, this formula is only used for making theoretical calculations.
MQ0 for reads mapped with BWA means that the read mapped to more than one location in the reference genome. I would expect (c) situation to find MQO reads mapped to gene A pr A’ in the reference.
From the samtools stat output file, insert size and the corresponding number of pairs falling into each insert category (number of reads) are extracted. After extraction of these values, the number of reads (counts) are plotted against the insert size to devise a visual histogram representation of insert size distribution. This graph is a visual representation of the distribution of the counts of the number of reads that have a certain insert size. A copy of the graph is attached as a pdf file to this submission. We would be looking at the tail of the distribution which helps infer a certain event including deletions/insertions depending on which end of the tail we will be looking at. The right end of the tail (with increasing size of insert sizes) would be enriched for deletions. Deletions happen in which genomic regions that are present in the reference are absent from the sample. When this happens, when we align the sample reads to the reference genome, we would expect a larger insert size as it spans a deletion. Based on the distribution, it seems that a majority of insert sizes are around 300-400. After insert size of 466, the number of reads that have insert size of greater than 466 are significantly less (huge jump from 11942 reads with insert size of 466 to only 4268 reads with insert size of 467). If i were to determine a threshod insert size above which are likely to be enriched for deletions, I would choose insert size of 466 as my threshold, where anything above this insert size may possibly be a deletion event.
The read depth at position 10,001 on chromosome_1 for sample CR2342 is 241:
samtools depth -r chromosome_1:10001-10001 /scratch/work/courses/BI7653/hw6.2022/CR2342.bam
# This identifies read depth for only position 10,001 on chromosome_1 for sample CR2342.
The coverage in the interval chromosome_1:10001-10020 for CR2342 is 282
samtools depth -r chromosome_1:10001-10020 /scratch/work/courses/BI7653/hw6.2022/CR2342.bam
chromosome_1 10001 241
chromosome_1 10002 251
chromosome_1 10003 260
chromosome_1 10004 266
chromosome_1 10005 268
chromosome_1 10006 268
chromosome_1 10007 274
chromosome_1 10008 275
chromosome_1 10009 276
chromosome_1 10010 283
chromosome_1 10011 289
chromosome_1 10012 294
chromosome_1 10013 298
chromosome_1 10014 302
chromosome_1 10015 301
chromosome_1 10016 298
chromosome_1 10017 299
chromosome_1 10018 297
chromosome_1 10019 299
chromosome_1 10020 302
# Coverage is calculated by taking the average of depth of positions 10001-10020 from the sample alignment for Chlamydomonas strain CR2342.
#!/bin/bash
#
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=8:00:00
#SBATCH --mem=8GB
#SBATCH --job-name=bedcov_coverage
#SBATCH --mail-type=FAIL,END
#SBATCH --mail-user=bl2477@nyu.edu
module purge
module load samtools/intel/1.14
samtools bedcov /scratch/work/courses/BI7653/hw6.2022/chromosome_1.500bp_intervals.bed /scratch/work/courses/BI7653/hw6.2022/CR2342.bam /scratch/work/courses/BI7653/hw6.2022/CR407.bam
chromosome_1 8029000 8029500 13544 21495
chromosome_1 8029500 8030000 3065 17839
chromosome_1 8030000 8030500 2383 19434
chromosome_1 8030500 8031000 7392 28084
chromosome_1 8031000 8031500 11040 38217
chromosome_1 8031500 8032000 9891 26758
chromosome_1 8032000 8032500 10937 26281
chromosome_1 8032500 8033000 19142 31353
chromosome_1 8033000 8033500 17844 28520
chromosome_1 8033500 8033585 1323 2457
library(magrittr)
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
library(dplyr)
library(tidyverse)
bedcov.tbl_df <- read_tsv("slurm-15920505.out",col_names=F)
## Rows: 16068 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (1): X1
## dbl (4): X2, X3, X4, X5
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
names(bedcov.tbl_df) <- c('chr','start','end','CR2342_dp','CR407_dp')
bedcov.tbl_df <- bedcov.tbl_df %>%
mutate(CR2342_normdp = log2( CR2342_dp / median(CR2342_dp,na.rm=T)))
bedcov.tbl_df <- bedcov.tbl_df %>%
mutate(CR407_normdp = log2( CR407_dp / median(CR407_dp,na.rm=T)))
bedcov_pivoted.tbl_df <- bedcov.tbl_df %>%
select(-CR2342_dp,-CR407_dp) %>%
pivot_longer(cols = c(-chr,-start,-end),
names_to = 'sample',
values_to = 'normalized_depth')
bedcov_pivoted.tbl_df %>%
ggplot(aes(x = start,y = normalized_depth)) + geom_point(color="#0072B2") +
facet_wrap(~ sample,nrow=2) + theme_bw()
Sample CR407 has a large duplication on chromosome 1 approximately at starting position 4,250,000 (4.25e+06), ~400 kb duplication.
The approxiate log2 value in this duplicated region is ~2.4 (little less than 2.5). Based on this log2 value, a value of 2.4 on the log2 scale is close to 5 (5.278) which I believe is the copy number of this duplication given that Chlamydomonas are haploid.