Task 1: Summarize coverage and insert size distribution from BAM alignments

Q1.1 Report the coverage depth for sample CR2342. The genome size for Chlamydomonas is 120 Mb. Explain how you arrived at this answer [ 1 points ]:

The coverage depth for sample CR2342 is calculated to be ~82.691x coverage. I arrived at the answer using the equation from the lecture/video and the thought process was shown below:

# Coverage = LN/G where C=coverage, L= read length, N = number of reads and G= haploid size of genome
# L=51
# N=194567996
# G=120,000,000 (or 120 Mb)
# C = (51*194567996)/120,000,000 which is equal to ~82.691x coverage

Q1.1b. When reporting coverage depth genomewide for a BAM, what are two reasons why it might it not be accurate to simply count the number of reads in the alignment (e.g., samtools view -c ), multiply by the read length from the sequencing (e.g., 100 in a 2 X 100 PE run), and divide by the genome size? [ 1 point ]

It might not be accurate because (1) reads varied in length based on soft- or hard-clipping and trimming due to adapters resulting in reads with different lengths. Also (2) coverage depth is reported based on read length and number of reads without actually mapping the reads to the genome which is not accurate as unmapped reads are also accounted for in this calculation. Therefore, this formula is only used for making theoretical calculations.

Q1.1c Explain what MQ0 (=mapping quality of zero) represents in the stats output for reads mapped with BWA. Then answer the following multiple-choice question. Which of the following situations would you expect to find MQ0 reads mapped to gene A (or A’) in the reference? Choose the single best answer. (a) gene A is duplicated in the reference to form identical copies gene A and A’, but is single copy gene A in the sequenced sample (b) gene A is single copy in the reference but duplicated to form identifical copies gene A and gene A’ in the sequenced sample (c) both situations should produce MQ0 reads mapped to the reference [ 1 point ].

MQ0 for reads mapped with BWA means that the read mapped to more than one location in the reference genome. I would expect (c) situation to find MQO reads mapped to gene A pr A’ in the reference.

Q1.1d The lines in the samtools stat output beginning with “IS” contain the insert size and the corresponding number of pairs falling into each insert size category. Use these data to devise a crude method to predict deletions using this empirical insert size distribution [ 2 points ].

From the samtools stat output file, insert size and the corresponding number of pairs falling into each insert category (number of reads) are extracted. After extraction of these values, the number of reads (counts) are plotted against the insert size to devise a visual histogram representation of insert size distribution. This graph is a visual representation of the distribution of the counts of the number of reads that have a certain insert size. A copy of the graph is attached as a pdf file to this submission. We would be looking at the tail of the distribution which helps infer a certain event including deletions/insertions depending on which end of the tail we will be looking at. The right end of the tail (with increasing size of insert sizes) would be enriched for deletions. Deletions happen in which genomic regions that are present in the reference are absent from the sample. When this happens, when we align the sample reads to the reference genome, we would expect a larger insert size as it spans a deletion. Based on the distribution, it seems that a majority of insert sizes are around 300-400. After insert size of 466, the number of reads that have insert size of greater than 466 are significantly less (huge jump from 11942 reads with insert size of 466 to only 4268 reads with insert size of 467). If i were to determine a threshod insert size above which are likely to be enriched for deletions, I would choose insert size of 466 as my threshold, where anything above this insert size may possibly be a deletion event.

Task 2: Coverage depth in genomic regions and copy number variant discovery

Q2.1a: What is the read depth at position 10,001 on chromosome_1 for sample CR2342? Please show your command with your answer.

The read depth at position 10,001 on chromosome_1 for sample CR2342 is 241:

samtools depth -r chromosome_1:10001-10001 /scratch/work/courses/BI7653/hw6.2022/CR2342.bam
# This identifies read depth for only position 10,001 on chromosome_1 for sample CR2342.

Q2.1b: What is the coverage in the interval chromosome_1:10001-10020 for CR2342?

The coverage in the interval chromosome_1:10001-10020 for CR2342 is 282

samtools depth -r chromosome_1:10001-10020 /scratch/work/courses/BI7653/hw6.2022/CR2342.bam
chromosome_1    10001   241
chromosome_1    10002   251
chromosome_1    10003   260
chromosome_1    10004   266
chromosome_1    10005   268
chromosome_1    10006   268
chromosome_1    10007   274
chromosome_1    10008   275
chromosome_1    10009   276
chromosome_1    10010   283
chromosome_1    10011   289
chromosome_1    10012   294
chromosome_1    10013   298
chromosome_1    10014   302
chromosome_1    10015   301
chromosome_1    10016   298
chromosome_1    10017   299
chromosome_1    10018   297
chromosome_1    10019   299
chromosome_1    10020   302
# Coverage is calculated by taking the average of depth of positions 10001-10020 from the sample alignment for Chlamydomonas strain CR2342.

Q2.2a Paste the contents of your job submission script into your assignment document [ 1 point ].

#!/bin/bash
# 
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=8:00:00
#SBATCH --mem=8GB
#SBATCH --job-name=bedcov_coverage
#SBATCH --mail-type=FAIL,END
#SBATCH --mail-user=bl2477@nyu.edu

module purge
module load samtools/intel/1.14

samtools bedcov /scratch/work/courses/BI7653/hw6.2022/chromosome_1.500bp_intervals.bed /scratch/work/courses/BI7653/hw6.2022/CR2342.bam /scratch/work/courses/BI7653/hw6.2022/CR407.bam

Q2.2b What is the coverage of the last 10 intervals CR407 and CR2342 in the output file: (tail -n 10 ) in your answers file [ 1 point ].

chromosome_1    8029000 8029500 13544   21495
chromosome_1    8029500 8030000 3065    17839
chromosome_1    8030000 8030500 2383    19434
chromosome_1    8030500 8031000 7392    28084
chromosome_1    8031000 8031500 11040   38217
chromosome_1    8031500 8032000 9891    26758
chromosome_1    8032000 8032500 10937   26281
chromosome_1    8032500 8033000 19142   31353
chromosome_1    8033000 8033500 17844   28520
chromosome_1    8033500 8033585 1323    2457

Q2.3 Include your plot in your MarkDown report or use the example code to create a pdf (which you must submit with your answer) [ 1 point ]:

library(magrittr)
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract
library(dplyr)
library(tidyverse) 
bedcov.tbl_df <- read_tsv("slurm-15920505.out",col_names=F)
## Rows: 16068 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (1): X1
## dbl (4): X2, X3, X4, X5
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
names(bedcov.tbl_df) <- c('chr','start','end','CR2342_dp','CR407_dp')

bedcov.tbl_df <- bedcov.tbl_df %>% 
  mutate(CR2342_normdp = log2( CR2342_dp / median(CR2342_dp,na.rm=T)))
bedcov.tbl_df <- bedcov.tbl_df %>% 
  mutate(CR407_normdp = log2( CR407_dp / median(CR407_dp,na.rm=T)))

bedcov_pivoted.tbl_df <- bedcov.tbl_df %>%
  select(-CR2342_dp,-CR407_dp) %>%
  pivot_longer(cols = c(-chr,-start,-end), 
               names_to = 'sample', 
               values_to = 'normalized_depth')

bedcov_pivoted.tbl_df %>%
  ggplot(aes(x = start,y = normalized_depth)) + geom_point(color="#0072B2") + 
  facet_wrap(~ sample,nrow=2) + theme_bw()

Q2.4a Which sample has a large (~ 400 kb) duplication on chromosome 1? Approximately what position on the chromosome is the duplication?

Sample CR407 has a large duplication on chromosome 1 approximately at starting position 4,250,000 (4.25e+06), ~400 kb duplication.

Q2.4b What is the approximate log2 value in this duplicated region? Based on this log2 value, what do you think the copy number of this duplication might be given that Chlamydomonas are haploid?

The approxiate log2 value in this duplicated region is ~2.4 (little less than 2.5). Based on this log2 value, a value of 2.4 on the log2 scale is close to 5 (5.278) which I believe is the copy number of this duplication given that Chlamydomonas are haploid.