Team:TU Darmstadt/Results/Modeling/Theory

From 2014.igem.org

Home


Theory

Homology Modeling

All the proteins used by us have already been described before. Therefore their function and origin are already known. But if these enzymes are integrated into another organism, they will show changes in behavior based on different cytosol composition. Simulation of the enzyme activity helps to estimate their behavior. Therefore you need a 3D structure. These structures can be observed by e.g. X-Ray Crystallography or Nuclear Magnetic Resonance Spectroscopy (NMR). But all this methods are expensive both at time and cost, especially for an iGEM team. The computer prediction of a 3D structure is a low cost alternative. With the 3D structure you can run simulations in order to predict parameters like e.g. stability and docking.

Here we use Homology modeling for the 3D structure prediction of our proteins. At first a BLAST search is made to find all related proteins. After that the PDB (Protein Database) is searched after existing 3D-Structures from the related proteins. The best structure is used as a template and the primary structure of the target protein is lined along the template. This step is followed by algorithms to adapt the structure with respect to the differences between the primary structure of the target and the template. After that you got a predicted 3D structure of your protein.

General Workflow

We used Yasara Structure [1] to calculate our 3D Models. We used a script which does the following steps [7]:

  1. Sequence is PSI-BLASTed against Uniprot [2]
  2. Calculation of a position-specific scoring matrix (PSSM) from related sequences
  3. Using the PSSM to search the PDB for potential modeling templates
  4. The Templates are ranked based on the alignment score and the structural quality [3]
  5. Deriving additional informations for template and target (prediction of secondary structure, structure-based alignment correction by using SSALN scoring matrices [4])
  6. A graph of the side-chain rotamer network is built, dead-end elimination is used to find an initial rotamer solution in the context of a simple repulsive energy function [5]
  7. The loop-network is optimized using a high amount of different orientations
  8. Side-chain rotamers are fine-tuned considering electrostatic and knowledge-based packing interactions as well as solvation effects
  9. An unrestrained high-resolution refinement with explicit solvent molecules is run, using the latest knowledge-based force fields [6]

Information Theoretical Analysis

Entropy

In Information Theory the entropy (H) is referred as the average information content. It is computed with the following formula developed by Claude E. Shannon in 1948: 

\[ H(x)= \sum p(x) log_2( p(x) ) \]

Where x is a discrete, random variable and p(x) is the probability of x. It shows the number of states of a variable with respect to their possibility. This is similar to the physical entropy which shows the number of energy states and their possibility. In simple words, the entropy indicates how much information you can store in a message, e.g. a word. Intuitive the information content in this example is based on two parameters: The length of the word and the number of different letters which can be used. So if you want to create a word of three letters (e.g. a codon) and have four letters to choose (e.g. the amino acids) you got 43 (64) options (or states). As prior mentioned the entropy  depends on the possibility of the states. If the possibilities of all states are equally distributed, the entropy gets maximal. For intuitive Understanding, let?s assume someone else has a word (the information) in his mind. The word can be formed out of the letters A and B. If the chance for A is 100%, the word is AA. You don?t have to guess and therefore the information stored in this word is zero. If the chance for A is 90%, you have to guess but the chance to hit with an A is high. If a word contains 6 letters, guessing 'AAAAAA' would have still a chance of 53% (0,9 6). But if the chance for A is 50% guessing becomes more difficult. By a word containing six letters, your chance of guessing right at the first time is only 1,5% (0,56). The entropy of the information gets maximal since the possibilities of all states are equally distributed.

Mutual Information

The mutual information (MI) of two variables shows their correlation. Therefore you add the entropies of both variables minus the symmetric difference of both entropies:

\[ MI(X,Y)=H(X)+H(Y)-H(X,Y) \]



If the symmetric difference [H(x,y)] is as big as the sum of both singular entropies, then the MI is zero thus there is no correlation between them. If the symmetric difference is zero, then the MI gets maximal (not 1!), meaning that both variables are the same.

Application of MI to sequence alignments

The MI can be used to measure co-evolution signals in protein sequences [2]. The MI can be computed with the following equation in form of a Kullback-Leibler-Divergence (DKL):

\[ MI(X,Y)=\sum_{i=1}^Q \sum_{j=1}^Q p(x_i) p(y_j) log_2(\frac{p(x_i,y_j)}{p(x_i) p(y_j)})   \]

With this formula it is possible to compare two different rows (X and Y) of a multiple sequence alignment (MSA). Here p(x) and p(y) are the occurrence probabilities of symbols in column X and Y of the MSA. The joint probability p(xi, yj) describe the occurrence of one specific amino acid pair (e.g. x= Alanin and y= Glutamin) . Q is the set of symbols (AA or Bases), which depends on whether you analyze proteins or genes in your MSA. The result of these calculations is asymmetric matrix M which includes all MI values for any two columns in an MSA. A dependency of two columns shows high MI values and indicates that the evolution of those two amino acids is connected.

Normalization

Normalization is made with a standard score (Z-score). This score measures how many standard deviations a particular value differs from the mean value. This is important because every two columns have an MI value and normalization is needed to see whether this value is high or low. MI dependent Z-scores were calculated with a null model [3]:

\[ Z_{i,j}=\frac{M_{i,j}-E(M_{i,j})}{\sqrt{Var(M_{i,j})}} \]

Molecular dynamics simulation

Molecular Dynamics (MD) is one of the most common tools in computational biology. They are used to predict the mechanical behavior of a protein. At first a 3D Model of the protein is needed. This model gets transferred into a 3D enviroment normaly consisting of water molecules. A force field is chosen which simulates all occuring forces during the simulation.

Here, a force field is a collection of mathematical functions and parameters which describe the potential energy of particles (atoms and molecules) in a defined system. For illustration the potential derived from the Assisted Model Building with Energy Refinement (AMBER) can be divided in four terms:

\[ V(r_1...r_n)=\sum_{ij} \epsilon [(\frac{\sigma_{ij}}{r_{ij}})^{12}-(\frac{\eta_{ij}}{r_{ij}})^6] + \sum_{ij} \frac{1}{4\pi \epsilon_0} \frac{q_i q_j}{r_{ij}^2}  \]

\[ + \sum_{dihedral} K_{\Phi} [1+\cos(n \Phi - \delta)] + \sum_{bonds} \frac{1}{2} K_d(d-d_0) +\sum_{angles} \frac{1}{2} K_{\theta}(\theta-\theta_0) \]

These terms are the core components of a force field. Thus the sum of all these terms approximates the total of a given system. Since we can approximate the potential energy of a given system, force can be computed as the derivative of this potential with respect to position. The first equation represents the non bonded interactions within a system. It can be divided into the Lennard-Jones potential, which approximates the interaction between a  pair of neutral atoms or molecules. 

The second part is a Coulomb potential, approximate the electrostatics between a pair of atoms i and j with the partial charges. The next term describes the stretching of a chemical bonds as an harmonic spring. The last term describes the binding angels ? as well as the dihedral angels are approximated with harmonic springs.

Due to experimental data and quantum chemical calculations the force field parameters can be derived. The quality of MD simulations is largely depending on the applied force field.

Gaussian network model

With Molecular Dynamics (MD) simulations the mechanical behavior of a protein can be predicted. However, this approach needs a high computation capacity. A computational affordable alternative is the usage of the Gaussian network model (GNM).

A GNM is a simplified representation of a protein in which amino acids (alpha Carbon) are pictured as nodes and the bonds and interactions between them are treated like springs. This approach isn't as sophisticated as MD simulations but in exchange it is calculated in short time.

Computation

The structure dynamics in the GNM is described by the topology of contacts within the Kirchhoff matrix G. The elements of the matrix G are computed with the following formular:

\[ \Gamma_{ij}= \begin{cases}{\quad \ -1,\qquad \quad if \ \ i \neq j \ and \ R_{ij} \leq r_c }\\{\qquad 0,\qquad \quad \ \ if \ \ i \neq j \ and \ R_{ij} \geq r_c}\\{- \sum_{j,j \neq i}^N \Gamma_{ij}, \quad if \ \ i = j }\end{cases}     \]

Where N is the number of nodes and Rij is the distance between the point i and the point j. Gamma is the Ca-based contact matrix. The choice of rc defines the degree of cross-linking within the network.

In a simulation you can see which parts of the molecule are flexible. At those sites interactions and conformational changes will occur. Thus reaction sites can be discovered.  We computed the GNM with the Programm R [2] using the BioPhysConnector [3] library.

Anisotropic Network Model

An ANM is an improvement of the GNM [Atilgan et al]. In a GNM only the amplitude of movements can be observed but not the directions, because in a Kirchhoff matrix no coordinates are stored. In an ANM the Kirchhoff matrix is extended to the Hesse matrix:  

\[ H_{ij}= \begin{bmatrix}{\frac{\partial^2 V_{ij}}{\partial x_i \partial x_j}} & {\frac{\partial^2 V_{ij}}{\partial x_i \partial y_j}} & {\frac{\partial^2 V_{ij}}{\partial x_i \partial z_j}} \\ {\frac{\partial^2 V_{ij}}{\partial y_i \partial x_j}} & {\frac{\partial^2 V_{ij}}{\partial y_i \partial y_j}} & {\frac{\partial^2 V_{ij}}{\partial y_i \partial z_j}} \\ {\frac{\partial^2 V_{ij}}{\partial z_i \partial x_j}} & {\frac{\partial^2 V_{ij}}{\partial z_i \partial y_j}} & {\frac{\partial^2 V_{ij}}{\partial z_i \partial z_j}} \end{bmatrix} \]


Here, H is a matrix with the dimensions 3N × 3N . Where N denotes for the sum of residues within a protein. Each element in H is a 3 X 3 matrix which holds the anisotropic information regarding the orientation of nodes i,j. Each super element of the Hessian is defined as:

\[ V_{ij}=\frac{\gamma}{2}(s_{ij}-s{ij}^0)^2 \]

References

Homology Modeling

  1. E. Krieger, G. Koraimann, and G. Vriend, ?Increasing the precision of comparative models with YASARA NOVA--a self-parameterizing force field.,? Proteins, vol. 47, no. 3, pp. 393?402, 2002.
  2. S. F. Altschul, T. L. Madden, A. A. Schäffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman, ?Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.,? Nucleic Acids Res, vol. 25, no. 17, pp. 3389?3402, Sep. 1997.
  3. R. W. Hooft, G. Vriend, C. Sander, and E. E. Abola, ?Errors in protein structures.,? Nature, vol. 381, no. 6580. Nature Publishing Group, p. 272, 1996.
  4. D. T. Jones, ?Protein secondary structure prediction based on position-specific scoring matrices,? Journal of Molecular Biology, vol. 292, no. 2, pp. 195?202, 1999.
  5. A. A. Canutescu, A. A. Shelenkov, and R. L. Dunbrack, ?A graph-theory algorithm for rapid protein side-chain prediction.,? Protein Science, vol. 12, no. 9, pp. 2001?2014, 2003.
  6. E. Krieger, K. Joo, J. Lee, J. Lee, S. Raman, J. Thompson, M. Tyka, D. Baker, and K. Karplus, ?Improving physical realism, stereochemistry, and side-chain accuracy in homology modeling: Four approaches that performed well in  CASP8.,? Proteins, vol. 77 Suppl 9, no. June, pp. 114?122, 2009.
  7. www.yasara.org/homologymodeling.htm

Information Theoretical Analysis

  1. Shanon, C. E. (1948). A Mathematical Theory of Communication. The Bell System Technical Journal, 27, 379-423.
  2. Hamacher, K. (2008). Relating sequence evolution of HIV1-protease to its underlying molecular mechanics. Gene, 422 (1-2), 30-36. doi:10.1016/j.gene.2008.06.007
  3. Weil, P., Hoffgaard, F., & Hamacher, K. (2009). Estimating sufficient statistics in co-evolutionary analysis by mutual information. Comput Biol Chem, 33(6), 440-444. doi:10.1016/j.compbiolchem.2009.10.003
  4. Altschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J., Zhang, Z., Miller, W., & Lipman, D. J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, 25(17), 3389-3402.
  5. Sievers, F., Wilm, A., Dineen, D., Gibson, T. J., Karplus, K., Li, W., Lopez, R., et al. (2011). Fast, scalable 
  6. generation of high-quality protein multiple sequence alignments using Clustal Omega. Molecular Systems Biology, 7 (539), 539. Nature Publishing Group. doi:10.1038/msb.2011.75
  7. Hoffgaard, F., Weil, P., & Hamacher, K. (2010). BioPhysConnectoR: Connecting sequence information and biophysical models. BMC Bioinformatics, 11, 199. doi:10.1186/1471-2105-11-199.