Project

Topic 1

Download all sequences from the link: Get Sequences

The data are in fasta format. This means that lines starting with an ‘>’ contain the names of the sequences and you are not interested in them.

Assume the motif “CGTCAC” (6 positions).

Find all instances of this motif in the sequences, for which the Hamming distance between the substring and the motif is less than 2, i.e., 0 or 1. Please, consider only the sequences from the file which are longer than or equal to the motif and reject sequences that are shorter than the motif itself.
Construct the PWM using all substring (see above, question 1) with distance less or equal to 1.
Construct the logo for the PWM. Useful R packages are the following. ggseqlogo https://omarwagih.github.io/ggseqlogo/ or seqLogo https://bioconductor.org/packages/release/bioc/vignettes/seqLogo/inst/doc/seqLogo.html
Get the starting position on each sequence of the substring that differs from the motif by maximum 1 (see question 1). How far is this position from the end of the sequence?. For example, if at position 993 of sequence1 (which has length 1000), you find a substring that differs from the motif by 1 position (ie. Hamming distance = 1), then is position is 7 bp (1000-993) far away from the end of the sequence. Get all such distances from the end of the sequence (if any). Construct a density plot using all these distances from each end and the function density in R. What do you observe? For example, if in the first sequence you get the number 7, from sequence3, the numbers 8 and 10, from sequence 4 nothing, from sequence 5 the numbers 12, then use the vector c(7, 8, 10, 12) to construct the density plot.

Topic 2

Download the expression dataset from here: https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS3713. Remember you need the soft file NOT THE FULL SOFT.

Clean the data (remove all lines that start with ^, #, !)
Log the file (use log10) in R.
See the design of the experiment https://www.ncbi.nlm.nih.gov/geo/tools/profileGraph.cgi?ID=GDS3713. This will tell you which samples are smokers and which not.
Find all genes with expression value statistically higher in smokers than non-smokers. Use t.test for this and the appropriate alternative hypothesis. The significance threshold should be 0.05 after FDR correction for multiple testing.

Topic 3

Assume the datasets in the folder http://139.91.162.101//teaching/project2022/R/ms/. The file ms.obs.out should be treated as an “observation dataset” (i.e., as a real dataset).

The files ms.X.out (where X:1..10000) should be treated as simulated datasets. You can also use the file ms.sim.out which contain ALL ms.X.out one under the other. Probably, it will be easier and faster to use this total file instead of each file separately.

Our goal is to infer the population growth rate parameter g, that describes the growth rate of a population. Each simulated dataset has been generated with a different value of growth rate. These values are given in the file params.txt, in the same folder and they correspond 1-1 to the datasets (i.e., the first value corresponds to the dataset ms.1.out).

Follow the ABC methodology with the Euclidean Distance to get parameter estimates. Use the best 500 simulated datasets (i.e. the 500 datasets which are closer to the observation) to infer the parameter value for the growth rate.

Use the observation and the simulated datasets to infer the growth rate for the ms.obs.out. Use as summary statistics the i) average number of pairwise differences between sequences, ii) number of SNPs.
Use the observation and the simulated datasets to infer the growth rate for the ms.obs.out. Use as summary statistics the i) average number of pairwise differences between sequences, ii) number of SNPs and iii) Tajima’s D as it is described in https://en.wikipedia.org/wiki/Tajima%27s_D.
Which of the inferences (1 or 2) is more accurate if the true value for g is 100? (i.e. which inferred value is closer to 100).

IMPORTANT

Prepare an R code for your solutions. The code should have comments and explain each step of your algorithm.
Embed the code either in a google doc (save it in PDF) or in an RMarkDown. Google-doc has an extension for code blocks, use it to provide a visually better document.
Prepare the code AND the Report in 2 separate files. However, zip them before submission in order to submit eventually 1 zipped file.
Your grade will depend on i) correctness of the results, ii) how R-like the code is (for example, don’t use for loops to obtain the maximum of a vector), iii) how nice and detailed the report is, iv) how fast the code runs.

IMPORTANT II

if you read a file, this should be IN THE SAME FOLDER as your R script (i.e., don’t use paths in your computer, it will not be possible to assess whether your code is correct)

Project

Pavlos Pavlidis

5/23/2022

Topic 1

Topic 2

Topic 3