Overview

In today’s workshop we will be looking to:

Explore the Protein Data Bank (PDB) and NCBI Structure databases.
Use a protein structure prediction system to determine structures of some unknowns.

Protein structural databases

As we’ve discussed, there are three primary ways we can determine the 3D (tertiary/quaternary) structure of biological molcules - typically proteins.

X-ray crystallography
Nuclear magnetic resonance (NMR)
Cryogenic electron microscopy (Cryo-EM)

Each technique has its pros and cons such as how well each is able to capture the dynamic states of molecules. NMR and Cryo-EM are suitable techniques for understanding how dynamic structures may be whereas X-ray crystallography cannot really be used for these purposes. The trade-off is often resolution. Most crytal structures have considerably higher resoltion, although with advances in NMR and Cryo-EM methods these differences are diminishing.

Protein data bank (PDB)

Navigate to: http://www.rcsb.org

Let’s go ahead and check out the “November Molecule of the Month”! Feel free to read through a bit about this enzyme. Let’s navigat to the entry 6VZ8 (the plant acetohydroxyacid synthase complex).

Let’s look at the entry’s information. A few key variables should be assessed for any structure prior to use.

Method - This is how the data was collected. X-ray crystallography is common. More recently, various electron microscopy and NMR methods have been successful in resolving protein structures.
Resolution - The minimum distance you can reliably resolve two atoms in the structure, on average. Usually provided in Angstroms (or $10^{-10} meters).
wwPDB validation - How do these scores rank relative to other structures?

There are many more parameters in the full report, but these are a good starting point for any general inquiry. If scores are bad or the resolution is quite large (Angstroms), proceed with caution.

Question 1: What method was used to generate 6VZ8 and what is the resolution?

We can also visualize the data in a web-based structure viewer. Alternatively, you can download the PDB file and open it with any number of locally installed viewer options. A traditional viewer is RasMol.

Use your mouse/trackpad to move this structure around. Quite beautiful, no? The default setting in the viewer is to a cartoon depiction. This should seem somewhat intuitive since it is a common depiction in a lot of literature. We can modify this in the settings window, under Components. Here, we can add, remove, or change visualized components. You will need to play with these settings to develop some intuition for these settings and the associated menu.

Now, let’s hover our pointer over some of the subunits. You’ll note you can highlight specific amino acids on each subunit. As you hover over these subunits, you can see in the bottom right what peptide chain these amino acids are associated with. This can be very important in working out how genes/proteins correspond to the teriary/quaternary structure of a multi-subunit protein.

Finally, we can use our viewer to produce nice animations of the structure based on the settings we have made in the Components menu. These are often nice for presentations or your own viewing.

NCBI Structure database

Navigate to: https://www.ncbi.nlm.nih.gov/Structure/index.shtml

NCBI is also a great resource for protein structures, thought most everything is redundant with PDB. However, there is strong integration of the NCBI Structure database to other NCBI tools like BLAST as well as all of the search integration we have seen in environments like R, Python, etc.

Go ahead and search for: 6VZ8

Like PDB, we can see there is a 3D viewer. However, NCBI will also provide some additional search information such as domains and an interaction network. These can help you better understand the function and relationship to structure each subunit of a protein.

Structure prediction

Navigate to: https://robetta.bakerlab.org/

Use your username+pwd to log in. Alternatively, register yourself with Robetta so you can proceed.

Robetta is an interface to a number of protein structure tools developed by David Baker’s lab at the University of Washington in Seattle, WA.

Once you are logged in, there are two menus at the top that are relevant for our work: Project and Structure Prediction.

Let’s first look at the Example Results in the Structure Prediction dropdown menu. You can see, there is a 3D rendering of the structure and typically 4 or 5 additional model outputs for your protein. There are also confidence scores with the models (1=awesome, 0=terrible). The error estimates over the length of the predicted peptide are also provided. Low numbers mean good predictions, high numbers not so good predictions. This interpretation is subjective based on the problem being addressed. But, a lot of bad scores generally means a poorly predicted structure.

Note, bad scores may actually have biological meaning! There are various proteins that have intrinsically disordered regions (or IDRs). These cannot been resolved very well since they do not form ordered crystals or adopt any stable, long-term configuration. However, IDRs are very import in self-assembly of various biological macromolecules. Learn More!

Let’s go ahead and submit a job.

Go to Submit in the Structure Prediction menu. Submit the following protein sequence for prediction. Only paste in the amino acid sequence, not the header.

>tr|B5XXV1|B5XXV1_KLEP3 Regulatory protein SoxS OS=Klebsiella pneumoniae (strain 342) OX=507522 GN=soxS PE=4 SV=1
MSHQDIIQTLIEWIDEHIDQPLNIDIVARKSGYSKWYLQRMFRTVMHQTLGDYIRQRRLL
LAAEALRTTQRPIFDIAMDLGYVSQQTFSRVFRREFDRTPSDYRHQISA

You will see a number of options to select, below. For now, we’ll stick to RoseTTAFold. This is an integrated homology based predictive algorithm. Once you have submitted you can find your job in the Queue. Navigate to the Queue in the Structure Prediction menu.

While we wait, what RoseTTAFold do?

Searches for known homology in databases like NCBI and correlated structural information in databases like PDB. This is a HMM-based approach.
Reconstucts structure based on “known” pieces from databases and performs ab initio folding to fill in gaps.
These different modes can be altered based on input settings (CM vs AB settings).

While your prediction is calculated, let’s take a look at some of my outputs. Search for “lchubiz” in the queue. You can search by user name at the top of the queue. This can help you find your jobs.

Take a look at the MarA, MarA-Rob, and MarA-Rob modified Ps structures. The MarA-Rob structure is a fusion between the N-terminal and C-terminal regions of two regulatory proteins in E. coli (both have known structures). This fusion is reasonably accurate. To modify this structure, I altered a number of prolines in the primary AA sequence to glycines. This is a traditional method to disrupt structures, experimentally. Let’s look closely at the predicted effects.

Question2: Did changing P to G make any difference in the MarA-Rob structure prediction? Why do you think this is the case? (Hint: Think about how RoseTTAFold is determining the structure.)

Here is the CLUSTALO alignment of these proteins so you can see where Ps were modified.

CLUSTAL O(1.2.4) multiple sequence alignment


MarA                      MSRRNTDAITIHSILDWIEDNLESPLSLEKVSERSGYSKWHLQRMFKKETGHSLGQYIRS  60
MarA-Rob_Fusion           MSRRNTDAITIHSILDWIEDNLESPLSLEKVSERSGYSKWHLQRMFKKETGHSLGQYIRS  60
MarA-Rob_Fusion_ModP      MSRRNTDAITIHSILDWIEDNLESGLSLEKVSERSGYSKWHLQRMFKKETGHSLGQYIRS  60
                          ************************ ***********************************

MarA                      RKMTEIAQKLKESNEPILYLAERYGFESQQTLTRTFKNYFDVPPHKYRMTNMQGESRFLH  120
MarA-Rob_Fusion           RKMTEIAQKLKESNEPILYLAERYGFESQQTLTRTFKNYFDVPPHKYRRSPEWSAFGIRP  120
MarA-Rob_Fusion_ModP      RKMTEIAQKLKESNEGILYLAERYGFESQQTLTRTFKNYFDVGGHKYRRSPEWSAFGIRG  120
                          *************** **************************  **** :   .   :  

MarA                      PLNH---------YNS--------------------------------------------  127
MarA-Rob_Fusion           PLRLGEFTMPEHKFVTLEDTPLIGVTQSYSCSLEQISDFRHEMRYQFWHDFLGNAPTIPP  180
MarA-Rob_Fusion_ModP      GLRLGEFTMPEHKFVTLEDTPLIGVTQSYSCSLEQISDFRHEMRYQFWHDFLGNAPTIPP  180
                           *.          : :                                            

MarA                      ------------------------------------------------------------  127
MarA-Rob_Fusion           VLYGLNETRPSQDKDDEQEVFYTTALAQDQADGYVLTGHPVMLQGGEYVMFTYEGLGTGV  240
MarA-Rob_Fusion_ModP      VLYGLNETRPSQDKDDEQEVFYTTALAQDQADGYVLTGHPVMLQGGEYVMFTYEGLGTGV  240
                                                                                      

MarA                      ----------------------------------------------------- 127
MarA-Rob_Fusion           QEFILTVYGTCMPMLNLTRRKGQDIERYYPAEDAKAGDRPINLRCELLIPIRR 293
MarA-Rob_Fusion_ModP      QEFILTVYGTCMPMLNLTRRKGQDIERYYPAEDAKAGDRPINLRCELLIPIRR 293

Module 10 Lab Exercises

Overview

Protein structural databases

Protein data bank (PDB)

NCBI Structure database

Structure prediction