Protein Table Reformatting using Python: Team Project

Introduction

Genes are considered the ‘blueprint for life’ because they code for proteins which perform vast array of functions within living organisms. In this portion of the BIO231-CS125 collaborative team project, we will be using skills we have just learned about Python Strings and Data Reformatting to parse and reformat a file containing a list of all the proteins coded for by your organism’s genome.

We will then query these data to test whether your organism’s genome codes for specific enzymes of interest. Enzymes are biological catalysts responsible for performing thousands of metabolic processes that sustain life. For example, urease is an enzyme that enables an organism to break down urea into carbon dioxide and ammonia. Your microbiology team members have conducted a series of biochemical experiments in the laboratory to test for presence or absence of certain enzymes in your organism. In this team project, you will make computational predictions and compare them to the experimental results determined by your BIO231 team members. The overall goal is to characterize the metabolism of your organism and validate your scientific predictions.

Deliverables

Your team will turn in your Python code (with a .py extension). Attach this file in an email to your TA’s with the subject ‘Your name, Team Project 4: Python part.’

To get the .py file: in your iPython Notebook, go to File –> Download as –> Python (.py) . If there are two CS125 students on your team, then make sure to complete the R portion of the team project as well! If you are the only CS student on your team, then the R portion is optional.

In addition, your team will synthesize both lab and computational results to create a final poster presentation on your organism. More details for this can be found on the BIO231 Collaboration Home Page.

Data

Your data will come from a .ptt file from the National Center of Biotechnology Information. A .ptt file is an NCBI Protein Table file, which is a tab-delimited file containing the following columns:

Location	Strand	Length	PID	Gene	Synonym	Code	COG	Product
190..255	+	21	16127995	thrL	b0001	-	-	thr operon leader peptide

These columns correspond to the following:

Location is the start and end coordinates for the gene

Strand is (+) if on the template DNA strand, (-) if the complementary DNA strand

Length is the length of the amino acid chain (minus the stop codon)

PID, Gene, Synonym, Code, and COG are all identification numbers used by NCBI to keep track of genes and proteins

Product is the full name of the protein

Ask your BIO231 Team members to make sure you understand these and make sure you go through the process of acquiring the data together using the following steps.

Go to this document and follow the link to your organism’s NCBI Protein table file.

Requirements

1. Now that you have obtained your .ptt file, you will need to import it into Python using the following commands:

import urllib

protTable = urllib.urlopen(‘ftp://ftp.ncbi.nih.gov/genomes/ASSEMBLY_BACTERIA/your_organism_here.ptt’).read()

After you have imported the file, use the skills you learned in your Reformatting homework to split the contents of the table by line, and then by entry. Remember this is a tab-delimited file. One line of your result should look something like this:

  ['26479..27345', '+', '288', '12044873', 'fba', 'MG_023', '-', 'COG0191G', 'fructose-1,6-bisphosphate aldolase, class II']

2. Once you have obtained a list of all entries, you will need to split the lines into their corresponding columns. Some of the columns, like Strand , Synonym , Code , and COG should be omitted because they aren’t important for the purpose of this investigation. You also will want to split the Location entry by ‘..’ into two columns called Start and Stop . Print out the entries neatly so they look like the output below. Make sure to account for the two lines in the header. Your output should look something like this:

Escherichia coli str. K-12 substr. MG1655, complete genome. - 1..4641652
4140 proteins 
Start   Stop    Length  Gene    PID     Product 
190 255 21  thrL    16127995        thr operon leader peptide

If you are having trouble with this, just consider that you will be printing one line at a time. For example, to print the header, you could use a simple line of code like:

print “Start”,“Stop”,“Length”,“Gene”,“PID”,“Product”

Then use a loop to filter and print each row one at a time.

To make things clearer for you, this is an example table (created using a Markdown table so it looks nice) showing which element goes in which column:

Start	Stop	Length	Gene	PID	Product
190	255	21	thrL	16127995	thr operon leader peptide
337	2799	820	thrA	16127996	Bifunctional aspartokinase/homoserine dehydrogenase 1

3. This is the list of enzymes you will be searching for. You can copy and paste the following line into your code.
enzList=[‘citrate lyase’,‘cytochrome C’,‘catalase’,‘nitrate reductase’,‘gelatinase’,‘urease’,‘phenylalanine deaminase’,‘ornithine decarboxylase’,‘cysteine desulfhydrase’,‘tryptophanase’]

If you and your BIO231 team members decide to test for more enzymes, feel free to add them to your list.

#Consult your BIO231 team members to determine the function of 3 of these enzymes. Include their functions in your final poster presentation. Why are they important?

Note that cytochrome C is not an enzyme, but instead is a protein that is part of the electron transport chain. Consult your microbiology team partners to understand what this is if you don’t already.

4. Now modify your code from question 2 to make a new table. This one is going to search through the Product column and identify enzymes of interest that perform a specific metabolic function. Your microbiology team members experimentally tested for the presence of these enzymes, so now it is your job to see if experimental results are verified by the genetic code of your organism. Your final result should be a table with three columns:

Enzyme type	Product	Protein ID
citrate lyase	citrate lyase, citrate-ACP transferase (alpha) subunit	16128598
catalase	catalase HPII, heme d-containing	49176140

Note that each enzyme you search for most likely will have multiple matches. It does not need to be an exact match. You will notice that in some cases, the match is actually the opposite of what we’d expect e.g. catalase inhibitor protein instead of catalase . Work with your microbiology team members to determine whether each enzyme is present or absent.

Escherichia coli str. K-12 substr. MG1655, complete genome. - 1..4641652
Enzyme type Product Protein ID
citrate lyase   citrate lyase, citrate-ACP transferase (alpha) subunit  16128598
citrate lyase   citrate lyase, citryl-ACP lyase (beta) subunit  90111153
citrate lyase   citrate lyase, acyl carrier (gamma) subunit 16128600
catalase    catalase HPII, heme d-containing    49176140
catalase    catalase-peroxidase HPI, heme b-containing  16131780

5. Work with your team to fill out the table in the Results section of your poster. Compare your computational predictions from step 4 with laboratory results and work together to discuss any discrepancies.

Protein Table Reformatting using Python: Team Project

Your Team Members

Today’s date

Introduction

Deliverables

Data

Requirements