• Please, do not share these questions with anyone! • Please, do not share your answers with anyone! • Using internet and any online resources is encouraged! • Please, pay particular attention to plagiarism! Provide citation when necessary. Do not copy and paste text from sources. Paraphrase! • Asking questions about the problems unless for clarification purpose is discouraged!
In your experiment, you identified a new protein, and you are curious about which protein it is. You somehow sequenced its entire amino acid sequence and here is what you got:
>Unknown putative dab-interacting protein 1
MPKKSIEEWEEDAIESVPYLASDEKGSNYKEATQIPLNLKQSEIENHPTVKPWVHFVAGGIGGMAGAVVTCPFDLVKTRLQSDIFLKAYKSQAVNISKGSTRPKSINYVIQAGTHFKETLGIIGNVYKQEGFRSLFKGLGPNLVGVIPARSINFFTYGTTKDMYAKAFNNGQETPMIHLMAAATAGWATATATNPIWLIKTRVQLDKAGKTSVRQYKNSWDCLKSVIRNEGFTGLYKGLSASYLGSVEGILQWLLYEQMKRLIKERSIEKFGYQAEGTKSTSEKVKEWCQRSGSAGLAKFVASIATYPHEVVRTRLRQTPKENGKRKYTGLVQSFKVIIKEEGLFSMYSGLTPHLMRTVPNSIIMFGTWEIVIRLLS*
Determine the identity of this protein using the BLAST website (http://blast.ncbi.nlm.nih.gov/). From which genome is it most likely to have originated? Why? (20 points)
You are designing a whole genome sequencing experiment by sequencing with Illumina instrument. Your goal is to achieve at least 100X uniform coverage throughout the genome and your organism of interest has a genomic length of 173Kb.
To obtain such coverage, how many “paired-end” reads do you need to generate in total for your sequencing library? (Note: read length is 150bp x 2)(20 points)
For RNA-seq read mapping to a reference genome, why should we use a splice-aware aligner algorithm? What would happen if we do not? (20 Points)
Please, explain the critical points to consider about an experimental design in which you are planning to conduct a differential expression analysis between two conditions? What are the factors that would severely affect your analysis if not accounted for? Please, discuss why. (20 Points)
In your RNA sequencing experiment, how would you make sure whether your sequencing depth is good enough to perform a fair differential gene expression analysis even for low expressed genes? What kind of analysis would you do to assess this? (20 Points)