Introduction

Genes are considered the ‘blueprint for life’ because they code for proteins which perform vast array of functions within living organisms. In this portion of the BIO231-CS125 collaborative team project, we will be using skills we have just learned about Python Strings and Data Reformatting to parse and reformat a file containing a list of all the proteins coded for by your organism’s genome.

We will then query these data to test whether your organism’s genome codes for specific enzymes of interest. Enzymes are biological catalysts responsible for performing thousands of metabolic processes that sustain life. For example, urease is an enzyme that enables an organism to break down urea into carbon dioxide and ammonia. Your microbiology team members have conducted a series of biochemical experiments in the laboratory to test for presence or absence of certain enzymes in your organism. In this team project, you will make computational predictions and compare them to the experimental results determined by your BIO231 team members. The overall goal is to characterize the metabolism of your organism and validate your scientific predictions.

Deliverables

Your team will turn in your Python code (with a .py extension). Attach this file in an email to your TA’s with the subject ‘Your name, Team Project 4: Python part.’

To get the .py file: in your iPython Notebook, go to File –> Download as –> Python (.py) . If there are two CS125 students on your team, then make sure to complete the R portion of the team project as well! If you are the only CS student on your team, then the R portion is optional.

In addition, your team will synthesize both lab and computational results to create a final poster presentation on your organism. More details for this can be found on the BIO231 Collaboration Home Page.

Data

Your data will come from a .ptt file from the National Center of Biotechnology Information. A .ptt file is an NCBI Protein Table file, which is a tab-delimited file containing the following columns:

Location Strand Length PID Gene Synonym Code COG Product
190..255 + 21 16127995 thrL b0001 - - thr operon leader peptide
These columns correspond to the following:
  • Location is the start and end coordinates for the gene
  • Strand is (+) if on the template DNA strand, (-) if the complementary DNA strand
  • Length is the length of the amino acid chain (minus the stop codon)
  • PID, Gene, Synonym, Code, and COG are all identification numbers used by NCBI to keep track of genes and proteins
  • Product is the full name of the protein
  • Ask your BIO231 Team members to make sure you understand these and make sure you go through the process of acquiring the data together using the following steps.

    1. Go to this document and follow the link to your organism’s NCBI Protein table file.

    Requirements

    1. Now that you have obtained your .ptt file, you will need to import it into Python using the following commands:

    import urllib

    protTable = urllib.urlopen(‘ftp://ftp.ncbi.nih.gov/genomes/ASSEMBLY_BACTERIA/your_organism_here.ptt’).read()

    After you have imported the file, use the skills you learned in your Reformatting homework to split the contents of the table by line, and then by entry. Remember this is a tab-delimited file. One line of your result should look something like this:
      ['26479..27345', '+', '288', '12044873', 'fba', 'MG_023', '-', 'COG0191G', 'fructose-1,6-bisphosphate aldolase, class II']  

    2. Once you have obtained a list of all entries, you will need to split the lines into their corresponding columns. Some of the columns, like Strand , Synonym , Code , and COG should be omitted because they aren’t important for the purpose of this investigation. You also will want to split the Location entry by ‘..’ into two columns called Start and Stop . Print out the entries neatly so they look like the output below. Make sure to account for the two lines in the header. Your output should look something like this:

    Escherichia coli str. K-12 substr. MG1655, complete genome. - 1..4641652
    4140 proteins 
    Start   Stop    Length  Gene    PID     Product 
    190 255 21  thrL    16127995        thr operon leader peptide

    If you are having trouble with this, just consider that you will be printing one line at a time. For example, to print the header, you could use a simple line of code like:

    print “Start”,“Stop”,“Length”,“Gene”,“PID”,“Product”

    Then use a loop to filter and print each row one at a time.

    To make things clearer for you, this is an example table (created using a Markdown table so it looks nice) showing which element goes in which column:

    Start Stop Length Gene PID Product
    190 255 21 thrL 16127995 thr operon leader peptide
    337 2799 820 thrA 16127996 Bifunctional aspartokinase/homoserine dehydrogenase 1

    3. This is the list of enzymes you will be searching for. You can copy and paste the following line into your code.
    enzList=[‘citrate lyase’,‘cytochrome C’,‘catalase’,‘nitrate reductase’,‘gelatinase’,‘urease’,‘phenylalanine deaminase’,‘ornithine decarboxylase’,‘cysteine desulfhydrase’,‘tryptophanase’]

    If you and your BIO231 team members decide to test for more enzymes, feel free to add them to your list.

    #Consult your BIO231 team members to determine the function of 3 of these enzymes. Include their functions in your final poster presentation. Why are they important?
        

    4. Now modify your code from question 2 to make a new table. This one is going to search through the Product column and identify enzymes of interest that perform a specific metabolic function. Your microbiology team members experimentally tested for the presence of these enzymes, so now it is your job to see if experimental results are verified by the genetic code of your organism. Your final result should be a table with three columns:

    Enzyme type Product Protein ID
    citrate lyase citrate lyase, citrate-ACP transferase (alpha) subunit 16128598
    catalase catalase HPII, heme d-containing 49176140

    Note that each enzyme you search for most likely will have multiple matches. It does not need to be an exact match. You will notice that in some cases, the match is actually the opposite of what we’d expect e.g. catalase inhibitor protein instead of catalase . Work with your microbiology team members to determine whether each enzyme is present or absent.

    Escherichia coli str. K-12 substr. MG1655, complete genome. - 1..4641652
    Enzyme type Product Protein ID
    citrate lyase   citrate lyase, citrate-ACP transferase (alpha) subunit  16128598
    citrate lyase   citrate lyase, citryl-ACP lyase (beta) subunit  90111153
    citrate lyase   citrate lyase, acyl carrier (gamma) subunit 16128600
    catalase    catalase HPII, heme d-containing    49176140
    catalase    catalase-peroxidase HPI, heme b-containing  16131780 

    5. Work with your team to fill out the table in the Results section of your poster. Compare your computational predictions from step 4 with laboratory results and work together to discuss any discrepancies.