Team:Heidelberg/pages/Linker Modeling

From 2014.igem.org

(Difference between revisions)
(In silico refinement)
(In silico refinement)
 
(83 intermediate revisions not shown)
Line 1: Line 1:
-
=Abstract=
 
-
Peptidic linkers are widely used tools in protein modification, not only for connecting protein subdomains but also for connecting their extremities to circularize them. Traditionally,  flexible linkers like Glycin-Serine peptides are used for this purpose. [[#References | [1]]]. However, to keep domains of chimeric proteins in a certain distance, rigid peptides built of helical patterns are also often applied. [[#References | [4]]] [[#References | [0]]]
 
-
In the following we show a novel approach to build customized rigid linkers that follow a desired shape. This is achieved by connecting peptides forming rigid helices with amino acids that produce a certain angle between those. We herein describe building blocks, with which one can build customized linkers. The angle patterns were obtained from statistical analysis of structural databases, normally used for protein structure prediction.  The potential linkers were tested in a large screening in silico, refining the initial the properties of the patterns. Additionally they were tested in vitro for circularizing lysozyme from bacteriophage lambda as a model enzyme. Not only the modularity but also the reliability of our linkers are huge advantages compared to normally used poly Alanine or Glycin-Serine linkers.
 
-
The aim of this part was to identify building block like patterns, that can be used to computationally build linkers following a certain path.
 
-
=Introduction=
 
-
Artificially circularized proteins can gain heat stability due to the constrain of the relative positioning of the C- and the N-termini.[[#References | [ 10] ]] If the ends are too far from each other, circularization requires a linker that does not change the natural conformation of the protein but restrains the relative position of the ends and thus restricts the degrees of freedom.
 
-
Reference to software page
 
-
By combining the advantages of these different approaches for linker design, we have set up a model to build rigid linkers with alpha helices following a certain path. The main achievement is the modularity of our system for building.
 
-
§§§don't know yet where to put this information§§§Also for artificial protein engineering it is most important being able to define the conformation of the single helical building blocks by defining a supersecondary structure.
 
=Background=
=Background=
-
Primary, secondary, tertiary and quaternary structure are the main levels of protein structure characterization. Primary structure designates the amino acid sequence, while the secondary structure describes the arrangement of consecutive amino acids  through their two dihedral angles $\phi$ and $\psi$. The Ramachandran plot, which represents the amino acid position in the space of those two angles, shows two particular arrangement commonly found in proteins: alpha helices and beta sheets. The next level of protein organization is the tertiary structure, which describes how the protein is organized in the three spatial dimensions, whereas the quaternary structure describes how different subunits of proteins cluster.
+
Primary, secondary, tertiary and quaternary structures are the main levels of protein structure characterization. Primary structure designates the amino acid sequence, while the secondary structure describes the arrangement of consecutive amino acids  through their two dihedral angles $\phi$ and $\psi$. The Ramachandran plot, which represents the amino acid position in the space of those two angles, shows two particular arrangement commonly found in proteins: alpha helices and beta sheets. The next level of protein organization is the tertiary structure, which describes how the protein is organized in the three spatial dimensions, whereas the quaternary structure describes how different subunits of proteins cluster.
-
Finally, closely related to these standard structures is the supersecondary structure, that describes how secondary structure elements are connected to each other by on first sight undefined conformations. Further analysis revealed that this wide variety of supersecondary structure motifs can be clustered to certain patterns. [[#References|[5]]]
+
Finally, closely related to these standard structures, the supersecondary structure describes how secondary structure elements are connected to each other. While these connections look undefined at first sight, further analysis revealed that this wide variety of supersecondary structure motifs can be clustered to certain patterns [[#References|[5]]].
 +
 
==Supersecondary structure==
==Supersecondary structure==
-
When the properties of supersecondary structures were first described, only very few patterns were identified, mainly due to the lack of highly resolved protein structures. At that time the structures were mainly classified by the Ramachandran plot regions ($\alpha, \beta, \gamma$ etc. ) where the amino acids could be found. [[#References|[6]]] With growing amount of known crystal structures, the analysis of supersecondary structure became better and better leading to databases with about 150 000 classified loop structures and elaborate clustering. [[#References|[7]]] Nowadays supersecondary structures are just the structures built when two secondary structure elements are combined by a small peptide that is not clustered into one of the secondary structures. These loop peptides range from 1 to 9 amino acids.
+
 
-
=Defining the structure=
+
When the properties of supersecondary structures were first described, only very few patterns were identified, mainly due to the lack of highly resolved protein structures. At that time, the structures were mainly classified by the Ramachandran plot regions where the amino acids could be found [[#References|[6]]]. With growing amount of known crystal structures, the analysis of supersecondary structure improved and lead to databases with about 150 000 classified loop structures and elaborate clustering [[#References|[7]]]. Nowadays supersecondary structures are defined as the structures built when two secondary structure elements are combined by a small peptide that is not clustered into one of the secondary structures. These loop peptides range from 1 to 9 amino acids.
-
The aim was to build reliable stable linkers out of alpha helices connected by supersecondary structure motifs that produce certain angles. To achieve that, we searched for the most reliable alpha helix patterns and angle patterns covering the whole range of angles from 0 to 180 degrees.  
+
Our aim was to build reliable stable linkers out of alpha helices connected by supersecondary structure motifs that produce certain angles. To achieve that, we searched for the most reliable alpha helix patterns that would form rigid rods and angle patterns covering the whole range of angles from 0 to 180 degrees.
-
==Helix patterns==
+
 
-
Various different patterns have been used to build helical linkers between for connecting ends of proteins. [[References | [3]]] Also in known protein structures linkers between subparts can be identified and their properties are well analyzed. [[#References | [2]]] We have chosen that our helix patterns should be as safe as they can be, so we decided to go for aminoacids, that the most likely form helices. On the other hand, as our linker always needed to be soluble in aqueous medium. Therefore we could not just use linkers built of Alanine. So we decided to add some charged aminoacids to the pattern, also stabilizing themselves by coulomb interaction. These aminoacids needed to be 4 aminoacids apart from each other, as a helical turn takes about 3.6 aminoacids. The pattern we chose as most suitable for our purpose was also described to be one of the most stable[[#References | [4]]]
+
=Linker building block design=
-
==Angle patterns==
+
 
-
The angle patterns for our model were obtained from the ArchDB database [[#References | [5]]], which classifies loops from known proteins structures. About 17 000 non-homologous proteins from PDB database were analyzed and thus over 300 000 loop structures, i.e. regions connecting two secondary structure elements, were identified. ###Numbers to be checked###. The classification took into account not only the length of the loop, its conformation, meaning φ and ψ backbone dihedral angles of the residues in the loop, but also the distance between the attachments of the loop to the surrounding secondary structures. Furthermore the surrounding secondary structures of the loop and the geometry defined by the super-secondary structure motifs can be found in the database.
+
===Helix patterns===
-
To extract from ArchDB the relevant supersecondary structure motifs for our linker design, the complete database was downloaded and helix-loop-helix motifs were extracted using a self-written script in python programming language. Furthermore we only took into account loops composed of 1 and 2 amino acids, because the longer the loops, the less frequent and therefore the less reliable a single loop is, and the further the ends are from each other.
+
Various different patterns have been used to build helical linkers to connect protein ends [[References | [8]]]. Moreover, in known protein structures, linkers between subdomains can be identified and their properties have already been analyzed [[#References | [9]]]. Two main criteria were used to build the alpha helix patterns: they should robustly for alpha helices, and they should be soluble in aqueous solution. Therefore, we could not just use linkers built of Alanine. So we decided to add some charged aminoacids to the pattern, and to position them physically close to each other so that they could stabilize themselves by Coulomb interaction. These amino acids needed to be separated by 3 amino acids as a helical turn takes about 3.6 aminoacids. The pattern we chose as most suitable for our purpose was also described to be one of the most stable [[#References | [3]]].
-
The interesting information for us was the angle produced between the vectors defining the bracing alpha helices, the distance between the ends of the loop, and the type of amino acids surrounding the loop. Furthermore we analyzed the statistical significance of the conformation. ###still to be done###
+
8 alpha helix building blocks were eventually chosen: AEAAAK, AEAAAKA, AEAAAKAA, AEAAAKEAAAK, AEAAAKEAAAKA, AEAAAKEAAAKEAAAKA, AEAAAKEAAAKEAAAKEAAAKA, AEAAAKEAAAKEAAAKEAAAKEAAAKA, with a respective estimated length of 9, 10.5, 12, 16.5, 18, 25.5 and 33Å.
-
For each amino acid combination in the loop region, the angle distribution between the embracing alpha helices, the loop length distribution and a 2d heatmap of the embracing amino acids were automatically plotted.
+
 
-
These distributions were then visually analyzed to identify loops of interest for linker design. We focused on loops that show a narrow angle distribution and that appear frequently in the database. The corresponding  amino acids were further analyzed, by enlarging the amino acid pattern with the amino acids occurring the most next to them. [[See fig ###]] In practical terms this means when we have identified, that K in a turn produces an interesting distribution, we would elongate it by K in front and elongate it by A in the end and see how the distribution behaves. In the end we would look at the final pattern. Of course by restraining the possibilites the occurrences go down tremendously, but still the behavior is interesting. [[###plot_of__T_.png### ]] [[plot_of_K_T_.png]] [[plot_of__T_A.png]] [[plot_of_K_T_A.png]]. This allowed us to narrow down the angle distribution, and also to select loops where no preference for the surrounding amino acids could be seen anymore. Thus we can claim, that the angle distribution is not due to the surrounding structure, but because of the identified pattern itself. On this way 10 different angle motifs could be identified producing different angles.  
+
===Angle patterns===
 +
 
 +
The angle patterns for our model were obtained from the ArchDB database [[#References | [5]]], which classifies loops from known proteins structures. About 17 000 non-homologous proteins from PDB database were analyzed and from them, 150 000 loop structures, i.e. regions connecting two secondary structure elements, were identified. The classification took into account not only the length of the loop, its conformation, meaning φ and ψ backbone dihedral angles of the residues in the loop, but also the distance between the attachments of the loop to the surrounding secondary structures. Furthermore the secondary structures surrounding the loop and the geometry defined by the super-secondary structure motifs can be found in the database.
 +
 
 +
To extract from ArchDB the relevant supersecondary structure motifs for our linker design, the complete database was downloaded and helix-loop-helix motifs were extracted using a [https://github.com/igemsoftware/Heidelberg_2014 self-written script] in Python programming language. From them we only took into account loops composed of 1 and 2 amino acids, because the longer the loops, the less frequent and therefore the less reliable they are, and the further the ends are from each other.
 +
The interesting information for us was (1) the angle produced between the vectors defining the surrounding alpha helices, (2) the distance between the ends of the loop, and (3) the type of amino acids surrounding the loop. Furthermore we analyzed the statistical significance of the conformation. For each amino acid combination in the loop region, the angle distribution between the embracing alpha helices, the loop length distribution and a 2D heatmap of the surrounding amino acids were plotted automatically (Figures 1-4).
 +
 
 +
These distributions were then visually analyzed to identify loops of interest for linker design. We focused on loops that showed a narrow angle distribution and that appeared frequently in the database and we extended the distribution analysis to the surrounding amino acids. For example, when we have identified that the amino acid T in a turn produces an interesting distribution (Figure 1), we would elongate it with the amino acid K in front (Figure 2), and with the amino acid A at the end (Figure 3), and with both (Figure 4) and investigate how the distributions behave. By restraining the possibilites, the occurrences go down tremendously, but the properties become more interesting.
 +
 
 +
{{:Team:Heidelberg/templates/image-full|
 +
caption = |
 +
file = plot_of__T_.png|
 +
descr= Figure 1) A loop composed of the amino acid T produces an interesting well-defined angle distribution (left panel). The middle panel represents the distribution of lengths found with this loop. The right panel represent the distribution of amino acids found before and after the amino acid T in all the corresponding loops. One can see that the amino acid K is the most frequent before and A the most frequent after.}}
 +
 
 +
{{:Team:Heidelberg/templates/image-full|
 +
caption = |
 +
file = plot_of_K_T_.png|
 +
descr= Figure 2) These distributions are subsets of the ones of Figure 1, with only the loops composed of the amino acid T preceded by the amino acid K. This constrain considerably narrows down the angle and length distributions.}}
 +
 
 +
{{:Team:Heidelberg/templates/image-full|
 +
caption = |
 +
file = plot_of__T_A.png|
 +
descr= Figure 3) Same procedure as in Figure 2. with the loops composed of the amino acid T followed by the amino acid A. This also narrows down the angle and length distributions compared to Figure 1., although not as much as in Figure 2.}}
 +
 
 +
{{:Team:Heidelberg/templates/image-full|
 +
caption = |
 +
file = plot_of_K_T_A.png|
 +
descr= Figure 4) The distributions correspond to loops composed of the amino acids KTA and therefore to subsets of all the preceeding ones. One can notice that the angle distribution becomes very narrow, but also that the frequency is reduced compared to the preceeding ones.}}
 +
 
 +
This step had two main goals: narrowing down the angle distribution and finding loops with no preferences for the amino acid surrounding them. This last point was important for the modularity of our approach: the angle blocks should not be affected by the surrounding alpha helices. Using this approach, 10 different angle motifs (table 1) could be identified producing different angles. Importantly, all these motifs were chosen so that the length of the two segments starting from the turning point was the same for all of them (put figure turning point). It should also be noticed that 3 of the patterns, KTA, LVA and AAIAP, produce the same angles. There were nevertheless all kept for the ''in silico'' refinement described below. But only one was used for the [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software].
 +
 +
 
{| class="table table-hover" style="text-align: center;"
{| class="table table-hover" style="text-align: center;"
-
|+'''table 1''': The span of parameters.
+
|+'''Table 1''': The span of parameters.
!colspan="10"|Angle Patterns
!colspan="10"|Angle Patterns
|-
|-
Line 54: Line 77:
|    29     
|    29     
|      27             
|      27             
-
|  12.        
+
|  12       
-
|  27.        
+
|  27       
-
|  12.      
+
|  12     
-
|  15 .     
+
|  15      
-
|    5.      
+
|    5       
|}
|}
-
==Attached sequences==
 
-
Helical patterns often distract the folding of the attached sequences. But our linkers should fit at other sequences, without perturbating them. Therefore we thought how to prevent the helix from continuing into the next part. Therefore we analyzed the effects of various aminoacids in silico using online tools like [[http://bioserv.rpbs.univ-paris-diderot.fr/services/PEP-FOLD/ |pep-fold]] on helix formation. We identified glycines and prolines to interrupt helix formation reliably. We then decided to use GG pairs, because they give more flexibility to the initial orientation of the initial helix. This point is somehow also important for linkers.
 
-
==In silico refinement==
 
-
As some of the interesting patterns could not be found often enough to be statistically significant, we decided to make a further refinement in silico by modeling the structure of proteins with circularizing linkers. To perform this for realistic situations, we selected from the RCSB database structures of non-homologous target proteins with extremities that are separated enough to require a linker for circularization. For setting up an environment as close as possible to the application of the patterns were we designed the following workflow.
 
-
First, the ###linker software### generates possible fitting linkers for various proteins. From these possible linkers, the 100 shortest were taken and the angle patterns that should produce the same angles. Like you can easily see from [[table 1 ###]] KTA, LVA and AAIAP nearly produce the same angles. These then should be tested in comparison. On the other hand the software could not handle different patterns for the same angle, so afterwards we exchanged the sequences in the predicted linkers. The linkers connect the ends of the protein without setting tension on the protein, so that the protein can fold in its natural way. For further information on the generation of the linker sequences please follow to the [[Linkergenerator|###link###]]
 
-
After this the circularized proteins with the specific linkers are modelled using a software called Modeller.[[#References|[8]]] This software is widely used for comparative structure prediction. It is well established in the scientific community and should be most suitable for prediction of loop regions attached to existing structures. [[#References|[9]]] Modeller is a program that is able to predict the 3d structure of a given sequence based on an alignment with a given structure. In our script modeller needed to be provided with to things, a sequence with the linker attached and the PDB file of the protein of interest. The result of a prediction for lambda-lysozyme can be seen in [[circ_lam_lys_nils.png]] It is freely available for academical usage from the  [http://salilab.org/modeller/ salilab] webpages. As we just want to determine the properties of our linker patterns attached to proteins, it perfectly fitted our purpose. Most important for our purpose is that Modeller does not rely on structural databases like ArchDB database, but does an ab initio modelling of our linkers by minimizing energy functions with different methods like conjugate gradients and molecular dynamics. Even though it recommends only to simulate loops of up to 8 aminoacids, we chose to use it, because the similarity of the sequence with the provided PDB is as a matter of fact at about 90%. Modeller is recommended for usage from about 30% sequence similarity.
 
-
Each modeled structure is provided with energy values, thanks to which different models of the same structure can be compared. From Modeller we received about 8 different models and choose the one with the best energy scores to further proceed. At first modeller makes an alignment between the provided structure and the sequence identifying the regions, that can not be found in the structure . Based on that modeller generates 4 initial models. One of the strengths of modeller is it's capability to further refine only certain parts of the protein. Thus we let modeller refine the loops. A loop for modeller is defined as any part of the protein, that could not be found in the structure file. For these refinement steps, one can choose different levels of optimization. We always decided for accuracy instead of velocity of the program.
 
-
Modeller was run  via the [[iGEM@home|###link to i@h###]] system, calculating distributedly the structures of various proteins at the same time. The modelling of one linker took about 10 hours of calculation time on average via the iGEM@home system. Actually this value is highly depending on the size of the protein. Then the best model is evaluated  by another self-written program to analyze the behaviour of the linker patterns in their natural surroundings.
 
-
 
-
In the third step all the models for the different structures and the different linkers are analyzed for their properties like the length of the helical patterns, the shape of the attachment structures of the linker and the angles produced by the angle patterns. [[ Figure helix_winkel_messung.png]] First the modeled structure and the natural structure are fitted, to see how big the differences between those are. If the protein has been disturbed too much, the model is discarded.
 
-
For the analysis of of the different patterns, the connection between the $C_{\alpha}$ of the first amino acid and the $C_{\alpha}$ of the last aminoacid is defining a vector. For the attachment sequences and the helices the length of this vector is calculated, for the  Otherwise length of attachment sequences are calculated just by calculating the distance of the atoms. For the helical patterns a vector is fitted to the C$\alpha$s. For these vectors always distance between the ends, the length and the angles are calculated. Furthermore a possible crossing point is estimated. Afterwards for each helical pattern and for each angle-pattern we obtain a distribution for the different properties, so that we can refine our assumptions on the behaviour of the patterns. With the coordinates of the estimation of crossing points, on can furthermore see, whether the linker really follows a software predicted path and thus verifying the results of the linker-software.
 
-
=Results=
+
===Sequences to connect the alpha helix to the protein extremity===
-
Out of this, we decided to set up a modular system for our linkers. All linkers start with two amino acids, that guarantee some flexibility to the ends of the protein and that prohibit the attached helix to continue into the protein and thus making non-helical regions helical. The next building block is one of the alpha helix forming patterns AEAAAK, AEAAAKA, AEAAAKAA, AEAAAKEAAAK, AEAAAKEAAAKA, AEAAAKEAAAKEAAAKA, AEAAAKEAAAKEAAAKEAAAKA, AEAAAKEAAAKEAAAKEAAAKEAAAKA with a well-defined length and shape. Then an angle pattern is attached. All the angle patterns chosen by us, have the same distance from the actual turning point.[[figure ###turning point###]]. Thus one can easily exchange different angle patterns and easily calculate the distances between the following turning points, like it is used in our [[###software###]]. To this angle pattern, another helix pattern can easily be attached again. ###figure needed###. All our linkers end with the two exteins because of circularization or the sortase scar, treated both as rather unstructured flexible regions. On the other side we have introduced two amino acids, that prevent the helix from disturbing the protein ends by helix formation. Therefore we identified GG as a suitable pattern.
+
 
-
==Application==
+
Helical patterns often affect the folding of the attached sequences. To prevent them from affected the structure of the protein of interest, we analyzed the effects of various aminoacids in silico using online tools like [http://bioserv.rpbs.univ-paris-diderot.fr/services/PEP-FOLD/ pep-fold] on helix formation. We identified glycines and prolines as reliable amino acids to interrupt helix formation. We then decided to use glycine pairs to connect the protein of interest to the linker, because they give more flexibility to the initial orientation of the initial helix.
-
-DNMT1
+
This only concerns the protein end that was connected to the linker through the coding sequence. The other end is ligated through exteins or sortase scar, both treated as unstructured flexible regions.
-
-Lysozyme
+
 
-
==Verification of patterns==
+
===''Conclusion''===
-
The whole process for the verification of the different linker patterns was set up on the distributed computing system. [[  #### iGEM@home]] But unfortunately due to lack of time only few results could be analyzed, resulting in distributions for the different helices, see for example ### fig 5. and 6.###
+
 
-
[[plot_of_AEAAAKA.png]] [[plot of AEAAAKEAAAKA]]
+
The last three parts show how we could design alpha helix and angle pattern blocks and connect them to each other and to the protein of interest. They provide the material that allowed us to transform a linker defined as a geometrical path into a real amino acid sequence in our [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software software].
 +
 
 +
Thanks to this we could design linkers to circularize [https://2014.igem.org/Team:Heidelberg/Project/PCR_2.0 DNMT1] and [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening lysozyme]. Additionally it was shown for lysozyme, that a customly tailored linker enhances heat-stability compared to a badly designed linker and even to the flexible linker.
 +
 
 +
=''In silico'' refinement=
 +
 
 +
As some of the interesting patterns could not be found often enough to be statistically significant, we decided to make a further refinement ''in silico'' by modeling the structure of proteins with circularizing linkers. To perform this for realistic situations, we selected, from the RCSB database, structures of non-homologous target proteins with extremities that are separated enough to require a linker for circularization. First, the [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software] generates possible fitting linkers for various proteins. From these possible linkers, the 100 shortest were taken. Among the 3 possible angle patterns that generate the same angle, the software provides only one. But the linker refinement developed here used the three of them for comparison. We assume that the linkers connect the ends of the protein without setting tension on the protein, so that the protein can fold in its natural way.
 +
 
 +
After this the circularized proteins with the specific linkers are modelled using a software called Modeller [[#References|[10]]]. This software is widely used for comparative structure prediction. It is well established in the scientific community and should be most suitable for prediction of loop regions attached to existing structures [[#References|[11]]]. It is freely available for academical usage from the  [http://salilab.org/modeller/ salilab] webpages. The program is able to predict the 3d structure of a given sequence based on an alignment with a given structure. In our script Modeller needed to be provided with a sequence with the linker attached and with the PDB file of the protein of interest. This latter file is the only structural information used here. Modeller is recommended when at least 30% sequence similarity exists between the provided structure and the one that the user wish to model. Here, only the linker is different, so this similarity if around 90% in our case. At first Modeller makes an alignment between the provided structure and the sequence of our linked protein, identifying the regions that cannot be found in the structure. Based on that, Modeller generates 4 initial models. One of the strengths of the software is its capacity to further refine only certain parts of the protein. Thus we let Modeller refine the loops, which are defined as any part of the protein that could not be found in the structure file. The ''ab initio'' modelling of our linkers is made by minimizing energy functions with different methods like conjugate gradients and molecular dynamics. Eventually, each modeled structure is provided with energy values, thanks to which different models of the same structure can be compared. From Modeller we received about 8 different models and choose the one with the best energy scores to further proceed. For these refinement steps, one can choose different levels of optimization. We always decided for accuracy instead of velocity of the program. The result of a prediction for lysozyme of bacteriophage lambda can be seen in figure 5.
 +
 
 +
{{:Team:Heidelberg/templates/image-quarter|
 +
align=right|
 +
caption=Figure 5) Circular lysozyme|
 +
descr=The structure of circular lysozyme (yellow) was predicted and aligned to the linear structure (purple). The linker calculated by the software was marked in green. |
 +
file=circ_lam_lys_nils.png}}
 +
 
 +
Modeller was run by distributing calculation via the [https://2014.igem.org/Team:Heidelberg/Software/igemathome iGEM@home system] system. The modelling of one linker took about 10 hours of calculation time on average, a value that is highly depending on the size of the protein. Then the best model is evaluated  by another self-written program to analyze the behaviour of the linker patterns in their natural surroundings.
 +
Finally, all the models for the different structures and the different linkers are analyzed for their properties like the length of the helical patterns, the shape of the attachment structures of the linker and the angles produced by the angle patterns [[ Figure helix_winkel_messung.png]]. First the modeled structure and the natural structure are fitted, to see how big the differences between those are. If the protein has been disturbed too much, the model is discarded. We control that the alpha helix patterns really generate the expected pattern by measuring the distance between the first and the last amino acids of each of them and checking if it is compatible with an alpha helix. These first and last amino acids of each rods define a vector that is then used to calculate the angles between consecutive helical patterns. The frequency of all the determined lengths and angles is then further analyzed, using a similar strategy to the ArchDB analysis presented above.
 +
 
 +
The whole process for the verification of the different linker patterns was set up on the [https://2014.igem.org/Team:Heidelberg/Software/igemathome distributed computing system]. Due to lack of time, only few results could be analyzed, resulting in distributions for the different helices, see for example figures 6 and 7.
 +
This lead to a refinement of the length of the 8 different alpha helix blocks presented above to 8.7, 10, 10.8, 15.6, 16.8, 24.8, 32.3 Å.
 +
 
 +
{{:Team:Heidelberg/templates/image-quarter|
 +
align=right|
 +
descr=|
 +
caption=Fig. 7. Length distribution of AEAAAKEAAAK|
 +
file=plot_of_AEAAAKEAAAK.png}}
 +
{{:Team:Heidelberg/templates/image-quarter|
 +
align=right|
 +
descr=|
 +
caption=Fig. 6. Length distribution of AEAAAKA    |
 +
file=plot_of_AEAAAKA.png}}
 +
 
=Conclusion=
=Conclusion=
-
The patterns introduced provide an easy and fast tool to build customly shaped peptides. The main achievement is the identification of the angle patterns. These are created in a building block like manner for enhanced applicability. The shapes were identified from a database of non-homologous proteins. The patterns were refined until the distribution of attached amino acids looked randomly distributed. Thus we can exclude to a certain amount, that the angle distributions we have observed is not due to the attached sequences, but due to the identified patterns. We have not observed any evidence for certain helix patterns being preferred in the database.
+
The patterns that we identified by analyzing structure databases provide an easy and fast tool to build customly shaped peptides. The main achievement is the identification of the angle patterns. These are designed as building blocks for enhanced applicability. The shapes were identified from a database of non-homologous proteins and the patterns were refined until the distribution of surrounding amino acids looked randomly distributed. Thus we can exclude that the angle distributions we observed is not due to the surrounding sequences, but to the identified patterns. We have not observed any evidence for certain helix patterns being preferred in the database.
-
From figures ###5. and 6### we have learned, that the lengths we have assumed for the helices needed to be adjusted to the new values. For example we had assumed the AEAAAKA motif to span a distance of 10.5 ###Å ### but have observed it to be only 10 Å long. This has found direct influence to the [[Linker_Software ###CRAUT]], the software for generating linkers, we have introduced.  
+
From figures 6 and 7 we have learned that the lengths we have assumed for the helices needed to be refined. For example we had assumed the AEAAAKA motif to span a distance of 10.5 Å but have observed it to be only 10 Å long. Our [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software] was accordingly corrected.
 +
 
=References=
=References=
-
[0] Wang,  C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database  of  cyclic protein sequences and structures, with applications in protein  discovery and engineering. Nucleic Acids Research 36, (2008).
+
 
-
[10]  Vieille, C. & Zeikus, G.J. Hyperthermophilic enzymes: sources, uses, and molecular mechanisms for thermostability. Microbiology and molecular biology reviews : MMBR 65, 1-43 (2001).
+
[1]  Vieille, C. & Zeikus, G.J. Hyperthermophilic enzymes: sources, uses, and molecular mechanisms for thermostability. Microbiology and molecular biology reviews : MMBR 65, 1-43 (2001).
-
[1] Yu, Y. & Lutz, S. Circular permutation: a different way to engineer enzyme structure and function. Trends Biotechnol. 29, 18-–25 (2011).
+
 
-
[2] Chen, X., Zaro, J. L. & Shen, W.-C. Fusion protein linkers: property, design and functionality. Adv. Drug Deliv. Rev. 65, 1357-13–69 (2013).
+
[2] Yu, Y. & Lutz, S. Circular permutation: a different way to engineer enzyme structure and function. Trends Biotechnol. 29, 18-–25 (2011).
-
[3] George, R. a & Heringa, J. An analysis of protein domain linkers: their classification and role in protein folding. Protein Eng. 15, 871-87–9 (2002).
+
 
-
[4] Arai, R., Ueda, H., Kitayama, a, Kamiya, N. & Nagamune, T. Design of the linkers which effectively separate domains of a bifunctional fusion protein. Protein Eng. 14, 529–-532 (2001).
+
[3] Arai, R., Ueda, H., Kitayama, a, Kamiya, N. & Nagamune, T. Design of the linkers which effectively separate domains of a bifunctional fusion protein. Protein Eng. 14, 529–-532 (2001).
 +
 
 +
[4] Wang,  C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database  of  cyclic protein sequences and structures, with applications in protein  discovery and engineering. Nucleic Acids Research 36, (2008).
 +
 
[5] Efimov, a V. Standard structures in proteins. Prog. Biophys. Mol. Biol. 60, 201-2–39 (1993).
[5] Efimov, a V. Standard structures in proteins. Prog. Biophys. Mol. Biol. 60, 201-2–39 (1993).
 +
[6] Donate, L. E., Rufino, S. D., Canard, L.  H. & Blundell, T. L. Conformational analysis and clustering of  short and medium size loops connecting regular secondary structures: a  database for modeling and prediction. Protein Sci. 5, 2600-26–16  (1996).
[6] Donate, L. E., Rufino, S. D., Canard, L.  H. & Blundell, T. L. Conformational analysis and clustering of  short and medium size loops connecting regular secondary structures: a  database for modeling and prediction. Protein Sci. 5, 2600-26–16  (1996).
 +
[7] Bonet, J. et al. ArchDB 2014: structural classification of loops in proteins. Nucleic Acids Res. 42, D315-D31–9 (2014).
[7] Bonet, J. et al. ArchDB 2014: structural classification of loops in proteins. Nucleic Acids Res. 42, D315-D31–9 (2014).
-
[8] Fiser,  a et al. Modeling of loops in protein structures. Protein science : a publication of the Protein Society 9, 1753-73 (2000).
+
 
-
[9] Fiser, A. & Sali, A. ModLoop: Automated modeling of loops in protein structures. Bioinformatics 19, 2500-2501 (2003).
+
[8] George, R. a & Heringa, J. An analysis of protein domain linkers: their classification and role in protein folding. Protein Eng. 15, 871-87–9 (2002).
 +
 
 +
[9] Chen, X., Zaro, J. L. & Shen, W.-C. Fusion protein linkers: property, design and functionality. Adv. Drug Deliv. Rev. 65, 1357-13–69 (2013).
 +
 
 +
[10] Fiser,  a et al. Modeling of loops in protein structures. Protein science : a publication of the Protein Society 9, 1753-73 (2000).
 +
 
 +
[11] Fiser, A. & Sali, A. ModLoop: Automated modeling of loops in protein structures. Bioinformatics 19, 2500-2501 (2003).

Latest revision as of 21:15, 17 October 2014

Contents

Background

Primary, secondary, tertiary and quaternary structures are the main levels of protein structure characterization. Primary structure designates the amino acid sequence, while the secondary structure describes the arrangement of consecutive amino acids through their two dihedral angles $\phi$ and $\psi$. The Ramachandran plot, which represents the amino acid position in the space of those two angles, shows two particular arrangement commonly found in proteins: alpha helices and beta sheets. The next level of protein organization is the tertiary structure, which describes how the protein is organized in the three spatial dimensions, whereas the quaternary structure describes how different subunits of proteins cluster. Finally, closely related to these standard structures, the supersecondary structure describes how secondary structure elements are connected to each other. While these connections look undefined at first sight, further analysis revealed that this wide variety of supersecondary structure motifs can be clustered to certain patterns [5].

Supersecondary structure

When the properties of supersecondary structures were first described, only very few patterns were identified, mainly due to the lack of highly resolved protein structures. At that time, the structures were mainly classified by the Ramachandran plot regions where the amino acids could be found [6]. With growing amount of known crystal structures, the analysis of supersecondary structure improved and lead to databases with about 150 000 classified loop structures and elaborate clustering [7]. Nowadays supersecondary structures are defined as the structures built when two secondary structure elements are combined by a small peptide that is not clustered into one of the secondary structures. These loop peptides range from 1 to 9 amino acids. Our aim was to build reliable stable linkers out of alpha helices connected by supersecondary structure motifs that produce certain angles. To achieve that, we searched for the most reliable alpha helix patterns that would form rigid rods and angle patterns covering the whole range of angles from 0 to 180 degrees.

Linker building block design

Helix patterns

Various different patterns have been used to build helical linkers to connect protein ends [8]. Moreover, in known protein structures, linkers between subdomains can be identified and their properties have already been analyzed [9]. Two main criteria were used to build the alpha helix patterns: they should robustly for alpha helices, and they should be soluble in aqueous solution. Therefore, we could not just use linkers built of Alanine. So we decided to add some charged aminoacids to the pattern, and to position them physically close to each other so that they could stabilize themselves by Coulomb interaction. These amino acids needed to be separated by 3 amino acids as a helical turn takes about 3.6 aminoacids. The pattern we chose as most suitable for our purpose was also described to be one of the most stable [3]. 8 alpha helix building blocks were eventually chosen: AEAAAK, AEAAAKA, AEAAAKAA, AEAAAKEAAAK, AEAAAKEAAAKA, AEAAAKEAAAKEAAAKA, AEAAAKEAAAKEAAAKEAAAKA, AEAAAKEAAAKEAAAKEAAAKEAAAKA, with a respective estimated length of 9, 10.5, 12, 16.5, 18, 25.5 and 33Å.

Angle patterns

The angle patterns for our model were obtained from the ArchDB database [5], which classifies loops from known proteins structures. About 17 000 non-homologous proteins from PDB database were analyzed and from them, 150 000 loop structures, i.e. regions connecting two secondary structure elements, were identified. The classification took into account not only the length of the loop, its conformation, meaning φ and ψ backbone dihedral angles of the residues in the loop, but also the distance between the attachments of the loop to the surrounding secondary structures. Furthermore the secondary structures surrounding the loop and the geometry defined by the super-secondary structure motifs can be found in the database.

To extract from ArchDB the relevant supersecondary structure motifs for our linker design, the complete database was downloaded and helix-loop-helix motifs were extracted using a self-written script in Python programming language. From them we only took into account loops composed of 1 and 2 amino acids, because the longer the loops, the less frequent and therefore the less reliable they are, and the further the ends are from each other. The interesting information for us was (1) the angle produced between the vectors defining the surrounding alpha helices, (2) the distance between the ends of the loop, and (3) the type of amino acids surrounding the loop. Furthermore we analyzed the statistical significance of the conformation. For each amino acid combination in the loop region, the angle distribution between the embracing alpha helices, the loop length distribution and a 2D heatmap of the surrounding amino acids were plotted automatically (Figures 1-4).

These distributions were then visually analyzed to identify loops of interest for linker design. We focused on loops that showed a narrow angle distribution and that appeared frequently in the database and we extended the distribution analysis to the surrounding amino acids. For example, when we have identified that the amino acid T in a turn produces an interesting distribution (Figure 1), we would elongate it with the amino acid K in front (Figure 2), and with the amino acid A at the end (Figure 3), and with both (Figure 4) and investigate how the distributions behave. By restraining the possibilites, the occurrences go down tremendously, but the properties become more interesting.


Figure 1) A loop composed of the amino acid T produces an interesting well-defined angle distribution (left panel). The middle panel represents the distribution of lengths found with this loop. The right panel represent the distribution of amino acids found before and after the amino acid T in all the corresponding loops. One can see that the amino acid K is the most frequent before and A the most frequent after.


Figure 2) These distributions are subsets of the ones of Figure 1, with only the loops composed of the amino acid T preceded by the amino acid K. This constrain considerably narrows down the angle and length distributions.


Figure 3) Same procedure as in Figure 2. with the loops composed of the amino acid T followed by the amino acid A. This also narrows down the angle and length distributions compared to Figure 1., although not as much as in Figure 2.


Figure 4) The distributions correspond to loops composed of the amino acids KTA and therefore to subsets of all the preceeding ones. One can notice that the angle distribution becomes very narrow, but also that the frequency is reduced compared to the preceeding ones.

This step had two main goals: narrowing down the angle distribution and finding loops with no preferences for the amino acid surrounding them. This last point was important for the modularity of our approach: the angle blocks should not be affected by the surrounding alpha helices. Using this approach, 10 different angle motifs (table 1) could be identified producing different angles. Importantly, all these motifs were chosen so that the length of the two segments starting from the turning point was the same for all of them (put figure turning point). It should also be noticed that 3 of the patterns, KTA, LVA and AAIAP, produce the same angles. There were nevertheless all kept for the in silico refinement described below. But only one was used for the CRAUT software.


Table 1: The span of parameters.
Angle Patterns
Pattern NVL KTA LVA AAIAP AADGTL VNLTA AAAHPEA ASLPAA ATGDLA
Mean 29.7 38.7 35 36.5 60 74.5 117 140 160
Variation 8.5 30 29 27 12 27 12 15 5

Sequences to connect the alpha helix to the protein extremity

Helical patterns often affect the folding of the attached sequences. To prevent them from affected the structure of the protein of interest, we analyzed the effects of various aminoacids in silico using online tools like [http://bioserv.rpbs.univ-paris-diderot.fr/services/PEP-FOLD/ pep-fold] on helix formation. We identified glycines and prolines as reliable amino acids to interrupt helix formation. We then decided to use glycine pairs to connect the protein of interest to the linker, because they give more flexibility to the initial orientation of the initial helix. This only concerns the protein end that was connected to the linker through the coding sequence. The other end is ligated through exteins or sortase scar, both treated as unstructured flexible regions.

Conclusion

The last three parts show how we could design alpha helix and angle pattern blocks and connect them to each other and to the protein of interest. They provide the material that allowed us to transform a linker defined as a geometrical path into a real amino acid sequence in our software.

Thanks to this we could design linkers to circularize DNMT1 and lysozyme. Additionally it was shown for lysozyme, that a customly tailored linker enhances heat-stability compared to a badly designed linker and even to the flexible linker.

In silico refinement

As some of the interesting patterns could not be found often enough to be statistically significant, we decided to make a further refinement in silico by modeling the structure of proteins with circularizing linkers. To perform this for realistic situations, we selected, from the RCSB database, structures of non-homologous target proteins with extremities that are separated enough to require a linker for circularization. First, the CRAUT software generates possible fitting linkers for various proteins. From these possible linkers, the 100 shortest were taken. Among the 3 possible angle patterns that generate the same angle, the software provides only one. But the linker refinement developed here used the three of them for comparison. We assume that the linkers connect the ends of the protein without setting tension on the protein, so that the protein can fold in its natural way.

After this the circularized proteins with the specific linkers are modelled using a software called Modeller [10]. This software is widely used for comparative structure prediction. It is well established in the scientific community and should be most suitable for prediction of loop regions attached to existing structures [11]. It is freely available for academical usage from the [http://salilab.org/modeller/ salilab] webpages. The program is able to predict the 3d structure of a given sequence based on an alignment with a given structure. In our script Modeller needed to be provided with a sequence with the linker attached and with the PDB file of the protein of interest. This latter file is the only structural information used here. Modeller is recommended when at least 30% sequence similarity exists between the provided structure and the one that the user wish to model. Here, only the linker is different, so this similarity if around 90% in our case. At first Modeller makes an alignment between the provided structure and the sequence of our linked protein, identifying the regions that cannot be found in the structure. Based on that, Modeller generates 4 initial models. One of the strengths of the software is its capacity to further refine only certain parts of the protein. Thus we let Modeller refine the loops, which are defined as any part of the protein that could not be found in the structure file. The ab initio modelling of our linkers is made by minimizing energy functions with different methods like conjugate gradients and molecular dynamics. Eventually, each modeled structure is provided with energy values, thanks to which different models of the same structure can be compared. From Modeller we received about 8 different models and choose the one with the best energy scores to further proceed. For these refinement steps, one can choose different levels of optimization. We always decided for accuracy instead of velocity of the program. The result of a prediction for lysozyme of bacteriophage lambda can be seen in figure 5.


Figure 5) Circular lysozyme
Figure 5) Circular lysozyme

The structure of circular lysozyme (yellow) was predicted and aligned to the linear structure (purple). The linker calculated by the software was marked in green.

Modeller was run by distributing calculation via the iGEM@home system system. The modelling of one linker took about 10 hours of calculation time on average, a value that is highly depending on the size of the protein. Then the best model is evaluated by another self-written program to analyze the behaviour of the linker patterns in their natural surroundings. Finally, all the models for the different structures and the different linkers are analyzed for their properties like the length of the helical patterns, the shape of the attachment structures of the linker and the angles produced by the angle patterns Figure helix_winkel_messung.png. First the modeled structure and the natural structure are fitted, to see how big the differences between those are. If the protein has been disturbed too much, the model is discarded. We control that the alpha helix patterns really generate the expected pattern by measuring the distance between the first and the last amino acids of each of them and checking if it is compatible with an alpha helix. These first and last amino acids of each rods define a vector that is then used to calculate the angles between consecutive helical patterns. The frequency of all the determined lengths and angles is then further analyzed, using a similar strategy to the ArchDB analysis presented above.

The whole process for the verification of the different linker patterns was set up on the distributed computing system. Due to lack of time, only few results could be analyzed, resulting in distributions for the different helices, see for example figures 6 and 7. This lead to a refinement of the length of the 8 different alpha helix blocks presented above to 8.7, 10, 10.8, 15.6, 16.8, 24.8, 32.3 Å.


Fig. 7. Length distribution of AEAAAKEAAAK
Fig. 7. Length distribution of AEAAAKEAAAK

Fig. 6. Length distribution of AEAAAKA
Fig. 6. Length distribution of AEAAAKA

Conclusion

The patterns that we identified by analyzing structure databases provide an easy and fast tool to build customly shaped peptides. The main achievement is the identification of the angle patterns. These are designed as building blocks for enhanced applicability. The shapes were identified from a database of non-homologous proteins and the patterns were refined until the distribution of surrounding amino acids looked randomly distributed. Thus we can exclude that the angle distributions we observed is not due to the surrounding sequences, but to the identified patterns. We have not observed any evidence for certain helix patterns being preferred in the database. From figures 6 and 7 we have learned that the lengths we have assumed for the helices needed to be refined. For example we had assumed the AEAAAKA motif to span a distance of 10.5 Å but have observed it to be only 10 Å long. Our CRAUT software was accordingly corrected.

References

[1] Vieille, C. & Zeikus, G.J. Hyperthermophilic enzymes: sources, uses, and molecular mechanisms for thermostability. Microbiology and molecular biology reviews : MMBR 65, 1-43 (2001).

[2] Yu, Y. & Lutz, S. Circular permutation: a different way to engineer enzyme structure and function. Trends Biotechnol. 29, 18-–25 (2011).

[3] Arai, R., Ueda, H., Kitayama, a, Kamiya, N. & Nagamune, T. Design of the linkers which effectively separate domains of a bifunctional fusion protein. Protein Eng. 14, 529–-532 (2001).

[4] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).

[5] Efimov, a V. Standard structures in proteins. Prog. Biophys. Mol. Biol. 60, 201-2–39 (1993).

[6] Donate, L. E., Rufino, S. D., Canard, L. H. & Blundell, T. L. Conformational analysis and clustering of short and medium size loops connecting regular secondary structures: a database for modeling and prediction. Protein Sci. 5, 2600-26–16 (1996).

[7] Bonet, J. et al. ArchDB 2014: structural classification of loops in proteins. Nucleic Acids Res. 42, D315-D31–9 (2014).

[8] George, R. a & Heringa, J. An analysis of protein domain linkers: their classification and role in protein folding. Protein Eng. 15, 871-87–9 (2002).

[9] Chen, X., Zaro, J. L. & Shen, W.-C. Fusion protein linkers: property, design and functionality. Adv. Drug Deliv. Rev. 65, 1357-13–69 (2013).

[10] Fiser, a et al. Modeling of loops in protein structures. Protein science : a publication of the Protein Society 9, 1753-73 (2000).

[11] Fiser, A. & Sali, A. ModLoop: Automated modeling of loops in protein structures. Bioinformatics 19, 2500-2501 (2003).