Data Analysis and Bioinformatics (9308055232020.1)

Finals

Total points = 100

Time: 90min

Note:

• Please, do not share these questions with anyone!   • Please, do not share your answers with anyone!   • Using internet and any online resources is encouraged!   • Please, pay particular attention to plagiarism! Provide citation when necessary. Do not copy and paste text from sources. Paraphrase!   • Asking questions about the problems unless for clarification purpose is discouraged!

Question 1:

In your experiment, you identified a new protein, and you are curious about which protein it is. You somehow sequenced its entire amino acid sequence and here is what you got:

>Unknown putative dab-interacting protein 1
MPKKSIEEWEEDAIESVPYLASDEKGSNYKEATQIPLNLKQSEIENHPTVKPWVHFVAGGIGGMAGAVVTCPFDLVKTRLQSDIFLKAYKSQAVNISKGSTRPKSINYVIQAGTHFKETLGIIGNVYKQEGFRSLFKGLGPNLVGVIPARSINFFTYGTTKDMYAKAFNNGQETPMIHLMAAATAGWATATATNPIWLIKTRVQLDKAGKTSVRQYKNSWDCLKSVIRNEGFTGLYKGLSASYLGSVEGILQWLLYEQMKRLIKERSIEKFGYQAEGTKSTSEKVKEWCQRSGSAGLAKFVASIATYPHEVVRTRLRQTPKENGKRKYTGLVQSFKVIIKEEGLFSMYSGLTPHLMRTVPNSIIMFGTWEIVIRLLS* 

Determine the identity of this protein using the BLAST website (http://blast.ncbi.nlm.nih.gov/). From which genome is it most likely to have originated? Why? (20 points)

Question 2:

You are designing a whole genome sequencing experiment by sequencing with Illumina instrument. Your goal is to achieve at least 100X uniform coverage throughout the genome and your organism of interest has a genomic length of 173Kb.

To obtain such coverage, how many “paired-end” reads do you need to generate in total for your sequencing library? (Note: read length is 150bp x 2)(20 points)

Question 3:

For RNA-seq read mapping to a reference genome, why should we use a splice-aware aligner algorithm? What would happen if we do not? (20 Points)

Question 4:

Please, explain the critical points to consider about an experimental design in which you are planning to conduct a differential expression analysis between two conditions? What are the factors that would severely affect your analysis if not accounted for? Please, discuss why. (20 Points)

Question 5:

In your RNA sequencing experiment, how would you make sure whether your sequencing depth is good enough to perform a fair differential gene expression analysis even for low expressed genes? What kind of analysis would you do to assess this? (20 Points)