Midterm project Data Science

# programs
library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.2.1 --

## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.4.2     v dplyr   0.7.4
## v tidyr   0.8.0     v stringr 1.3.0
## v readr   1.1.1     v forcats 0.3.0

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Q0

Import the data (warning the file is big, consider using skip and nmax when reading in data) Describe briefly the structure of the data and what the different variables mean - how many different chromosomes do you have?

# importing the data with read, setting nmax

d = read_delim(file = "dgrp2.tgeno", delim = c(" "),
               col_type = cols_only(chr='c', pos='i', id='c', ref='c', alt='c', refc='i', altc='i', cov='i')
)

Answering the Q0 questions

names(d)

## [1] "chr"  "pos"  "id"   "ref"  "alt"  "refc" "altc" "cov"

chr: describes which chromosome pos: describes the genomic position id: The ID of the snp(chromosome, position and type of change/SNP) ref: Reference allel (Which nucleotide) alt: Alternative allel (Which nucleotide) refc: Frequency of the reference allel altc: Frequency of the alternative allel. qual: Quality of the read cov: Coverage. Number of reads that include the SNP.

# For at finde de forskellige kromosomer, som er i datasettet bruger jeg unique funktionen.
unique(select(d, chr), incomparables= FALSE)

## # A tibble: 6 x 1
##   chr  
##   <chr>
## 1 2L   
## 2 2R   
## 3 3L   
## 4 3R   
## 5 4    
## 6 X

Q1

Extract the first 100,000 SNPs located on chromosome 3L, Locate the genomic position of the first and last SNP filter out all SNPs for which less than 75%, 80%, 90%, 95% of the individuals could be genotyped, report how many SNPs are left based on the filter you apply In all subsequent questions, restrict your attention to SNP genotyped in at least 95% of individuals

# To ensure that I only use 3L i filter these, and ensuring there's difference between reference og alternative i dictate that they should be higher than 0.

d_100k = d %>%
  filter(chr=="3L", nchar(ref)==1, nchar(alt)==1, refc > 0, altc > 0)

# To work with the first 10^5 SNPs i use the slice function.
d_100k <-slice(d_100k, 1:100000)

#To show the first and last SNP i use the head and tail function
head(d_100k)

## # A tibble: 6 x 8
##   chr     pos id           ref   alt    refc  altc   cov
##   <chr> <int> <chr>        <chr> <chr> <int> <int> <int>
## 1 3L    19766 3L_19766_SNP G     T        78     2     2
## 2 3L    19809 3L_19809_SNP C     G        99     1     6
## 3 3L    19834 3L_19834_SNP C     A        82     6     9
## 4 3L    19855 3L_19855_SNP C     T        79    13     9
## 5 3L    19862 3L_19862_SNP T     A        75    12     9
## 6 3L    19917 3L_19917_SNP A     G        94     2    10

tail(d_100k)

## # A tibble: 6 x 8
##   chr       pos id             ref   alt    refc  altc   cov
##   <chr>   <int> <chr>          <chr> <chr> <int> <int> <int>
## 1 3L    2484075 3L_2484075_SNP T     C       161    35    25
## 2 3L    2484078 3L_2484078_SNP G     A       195     8    26
## 3 3L    2484093 3L_2484093_SNP G     T       189    11    31
## 4 3L    2484113 3L_2484113_SNP C     A       175    25    34
## 5 3L    2484121 3L_2484121_SNP A     T       197     5    35
## 6 3L    2484135 3L_2484135_SNP T     C       203     2    34

q1: The first genomic position is 19766 and the last is 2484135 for the SNPs.

# Filtering out the individiuals which less than 75%, 80%, 85%, 90% and 95% 
# So if (refc + altc)/205 is below 75%, 80%, 85%, 90% or 95% not enough individuals could be sequenced.

nrow(filter(d_100k, (refc + altc)/205 >= 0.75))

## [1] 98195

nrow(filter(d_100k, (refc + altc)/205 >= 0.80))

## [1] 97673

nrow(filter(d_100k, (refc + altc)/205 >= 0.85))

## [1] 96942

nrow(filter(d_100k, (refc + altc)/205 >= 0.90))

## [1] 95625

nrow(filter(d_100k, (refc + altc)/205 >= 0.95))

## [1] 90725

# Restricting my attention to SNP genotyped at least 95% of the individuals.
d_100k <- filter(d_100k, (refc + altc)/205 >= 0.95)

Q2.

Graph the distribution of SNP allele frequencies, graph the distribution of coverage of SNPs (“cov”) and the distribution of number of lines genotyped for each snp Is there a statistical association the different types of SNPs and the allele frequency of SNPs? The type of snp is defined in the last part of the “id” column.

We define a SNP as either rare (maf < 0.05) or common (maf >= 0.05).

We define coverage as either “low coverage” (coverage in the lowest 5% quantile) or “high coverage”.

# Programs
library(ggplot2)

Graphin the distribution of SNP allel frequencies

# Distribution of SNP allele frequency
ggplot(d_100k) +
  geom_histogram(aes(altc, fill = "alternative"), bins=30, alpha = 0.6) +
  geom_histogram(aes(refc, fill = "reference"), bins=30, alpha = 0.7) +
  xlim(c(0,100)) +
  ggtitle ("Distribution of SNP allele frequencies") +
  theme(plot.title = element_text(hjust = 0.5)) +
  xlab("Allele Frequency") +
  ylab("Number of SNPs")

## Warning: Removed 9526 rows containing non-finite values (stat_bin).

## Warning: Removed 80891 rows containing non-finite values (stat_bin).

# Distribution of coverage of SNPs.
ggplot(d_100k, aes(x=cov)) +
  geom_histogram(binwidth = 0.5, fill = "chartreuse3") +
  xlim(c(10,55)) +
  ggtitle ("Distribution of coverage of SNPs") +
  theme(plot.title = element_text(hjust = 0.5)) +
  xlab("Coverage") +
  ylab("Number of SNPs")

## Warning: Removed 34 rows containing non-finite values (stat_bin).

The coverage is normal distributed(bell-shaped)

# Distribution of number of lines genotyped for each SNP. Using geom_bar because we work with discrete variables.
ggplot(d_100k, aes(x= refc + altc),) +
  geom_bar(color = "brown", fill = "brown") +
  ggtitle ("Distribution of lines genotyped for each SNP") +
  theme(plot.title = element_text(hjust = 0.5)) +
  xlab("Number of lines genotyped") +
  ylab("Number of SNPs")

# To calculate and insert MAF(Minor Allele Frequency as a column I have to mute the dataset with a pipefunction)
d_100k <- d_100k %>%
  mutate(maf=ifelse(altc < refc, altc/(altc+refc), refc/(altc+refc)))

head(d_100k)

## # A tibble: 6 x 9
##   chr     pos id           ref   alt    refc  altc   cov     maf
##   <chr> <int> <chr>        <chr> <chr> <int> <int> <int>   <dbl>
## 1 3L    51289 3L_51289_SNP T     A       194     2    33 0.0102 
## 2 3L    51953 3L_51953_SNP G     C       195     1    37 0.00510
## 3 3L    51970 3L_51970_SNP T     A       195     1    37 0.00510
## 4 3L    52060 3L_52060_SNP A     G       193     2    35 0.0103 
## 5 3L    52438 3L_52438_SNP A     T       188     8    24 0.0408 
## 6 3L    52703 3L_52703_SNP T     A       190     5    24 0.0256

Succesfully added the maf column

nrow(filter(d_100k, maf >= 0.05))

## [1] 39940

There is 39940 commen mafs

nrow(filter(d_100k, maf < 0.05))

## [1] 50785

There is 50785 rare mafs.

Q3. If coverage was homogenous throughout the genome (by that we mean that on average the coverage is the same for any given position), what probability distribution is expected to capture well the coverage ?

A poisson distribution is expected to capture well since the coverage is the same for any given position.

Q4.

Make goodness of fit test of the coverage data: is the theoretical distribution proposed in Q3 a good fit for the data?

# To make a good of fit test of the coverage data I have to mutate my d_100k dataframe by adding two new columns.
# Using the ppois log distribution and theoretical/expected coverage i can afterwards use in a chisquared test.
d_100k <- d_100k %>%
  mutate(exp_pr=ppois(q=cov, lambda=mean(cov))) %>% 
  mutate(n=n(), exp_cov=exp_pr*n) 

obs <- d_100k$cov
exp <- d_100k$exp_cov 
chisq.test(obs, exp)

## Warning in chisq.test(obs, exp): Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  obs and exp
## X-squared = 5262000, df = 3364, p-value < 2.2e-16

I can reject H0 since p_value is very low, and rejecting the H0 means that the data does not fit a poisson distribution.

Q5.

Pick yourself a small (100 kb) region on chromosome 3L Graph the frequency of “0” alleles along that region Calculate the statistical association between allelic states of each pair of neighboring SNPS (this association is also called LD) Explore visually how LD varies with the physical distance beween SNPs (measured in bp) Compare and discuss with the Fig 1 of the paper.

# Selecting a 100kb region on 3L
f <- d_100k %>%
  mutate(bin = (pos %/% 100) * 100, ld=0, bp=0) %>%
  select(pos,altc,refc,bin,ld, bp) %>%
  group_by(bin)

# plot of frequency 0 alleles in the region. 
ggplot(f) +
  geom_histogram(aes(altc), bins=30, fill = "chocolate1") +
  ggtitle ("Allele frequencies on 3L") +
  theme(plot.title = element_text(hjust = 0.5)) +
  xlab("\"0\" Allele Frequency") +
  ylab("Number of SNPs")

Investigating for LD refc is the frequency for 0s altc is the frequency for 2s LD = f(22) - f_1(2) * f_2(2) LD = (n22/n) - (n22 + n20/n) * (n22 + n02/n)

n = sum(f$altc) + sum(f$refc)

for(i in 1:nrow(f)-1){
  n22 = (f$altc[i]*f$altc[i+1])
  n20 = (f$altc[i]*f$refc[i+1])
  n02 = (f$refc[i]*f$altc[i+1])
  f22 = n22/n
  f1 = (n22+n20)/n
  f2 = (n22+n02)/n
  ld = f22-(f1*f2)
  f$ld[i] = ld
}

# Calculating distance (bp) between SNPs, take 10 seconds
for(i in 2:nrow(f)){
  dist = (f$pos[i]-f$pos[i-1])
  if(dist <= 1000){
    f$bp[i] = dist
  } else {
    f$bp[i] = 0
  }
}

ggplot(f, aes(x=bp, y=ld)) +
  geom_point() +
  ggtitle ("Decay of LD") +
  theme(plot.title = element_text(hjust = 0.5)) +
  xlab("Distance between SNPs (bp)") +
  ylab("LD")

As can be seen the LD is much higher when the distance between the SNPs are shorter. This agrees with the figure1 in the paper.

Midterm project Data Science

Mikkel Tschenka Skyum Eriksen

9 apr 2018

Q0

Q1

Q2.

Q4.

Q5.