Download all sequences from the link: Get Sequences
The data are in fasta format. This means that lines starting with an ‘>’ contain the names of the sequences and you are not interested in them.
Assume the motif “CGTCAC” (6 positions).
Find all instances of this motif in the sequences, for which the Hamming distance between the substring and the motif is less than 2, i.e., 0 or 1. Please, consider only the sequences from the file which are longer than or equal to the motif and reject sequences that are shorter than the motif itself.
Construct the PWM using all substring (see above, question 1) with distance less or equal to 1.
Construct the logo for the PWM. Useful R packages are the following. ggseqlogo https://omarwagih.github.io/ggseqlogo/ or seqLogo https://bioconductor.org/packages/release/bioc/vignettes/seqLogo/inst/doc/seqLogo.html
Get the starting position on each sequence of the substring that differs from the motif by maximum 1 (see question 1). How far is this position from the end of the sequence?. For example, if at position 993 of sequence1 (which has length 1000), you find a substring that differs from the motif by 1 position (ie. Hamming distance = 1), then is position is 7 bp (1000-993) far away from the end of the sequence. Get all such distances from the end of the sequence (if any). Construct a density plot using all these distances from each end and the function density in R. What do you observe? For example, if in the first sequence you get the number 7, from sequence3, the numbers 8 and 10, from sequence 4 nothing, from sequence 5 the numbers 12, then use the vector c(7, 8, 10, 12) to construct the density plot.
Download the expression dataset from here: https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS3713. Remember you need the soft file NOT THE FULL SOFT.
Assume the datasets in the folder http://139.91.162.101//teaching/project2022/R/ms/. The file ms.obs.out should be treated as an “observation dataset” (i.e., as a real dataset).
The files ms.X.out (where X:1..10000) should be treated as simulated datasets. You can also use the file ms.sim.out which contain ALL ms.X.out one under the other. Probably, it will be easier and faster to use this total file instead of each file separately.
Our goal is to infer the population growth rate parameter g, that describes the growth rate of a population. Each simulated dataset has been generated with a different value of growth rate. These values are given in the file params.txt, in the same folder and they correspond 1-1 to the datasets (i.e., the first value corresponds to the dataset ms.1.out).
Follow the ABC methodology with the Euclidean Distance to get parameter estimates. Use the best 500 simulated datasets (i.e. the 500 datasets which are closer to the observation) to infer the parameter value for the growth rate.
g is 100? (i.e. which inferred value is closer to 100).IMPORTANT
IMPORTANT II
if you read a file, this should be IN THE SAME FOLDER as your R script (i.e., don’t use paths in your computer, it will not be possible to assess whether your code is correct)