Team:Heidelberg/pages/Linker Software


Revision as of 21:21, 17 October 2014 by Igemnils (Talk | contribs)



Artificially circularized proteins gain the effect on heatstability by restraining the C- and N-terminus from moving around freely. If the ends are too far from each other, a linker is needed to connect them, for not changing the natural conformation of the protein too much and restraining the relative position of the ends and thus restricting the degrees of freedom. These linkers should omit hindrance of the protein's function by any mean. Consequently it is import to avoid linkers from passing through the active site or from covering a cavity of a protein for example. In the modelling part we have showed, that it is possible to define the shape of our linkers, by applying our model of rigid helical rods connected by well-defined angle regions. But having the possibility to define the path the linker should take, one still needs to know. Especially for larger proteins with complex shapes this can be very difficult. Furthermore one would like to take into account that the active sites are omitted. But even taking all this into account, one could never also take the paths into account, that the same sort of linker is also able to take, because of some sort of rotational degree of freedom inherent in the linkermodel ###figure needed### Thus we decided to provide the science community with a powerfull open-source software, that for every protein with a given structure can calculate the sequence of the linker needed to circularize the protein with a rigid linker and with minimal inhibition of the protein's function. Until now each scientist had to estimate the length of the linker himself,###check for flexlinker### so our software is a completely novel approach to circularization.

General procedure

At first the proteinstructure is analysed and then all possible paths that connect the two ends are found, that have less than three additional edges. In the end all these paths should be sorted by well it would be, if we circularize the protein using this path. But as the final weighing is quite computation intensive, at first the paths need to be sorted out. A path is only sorted out, if it is breaching any rule for linkers. For example paths should never pass through the protein. After all the paths have been generated, the paths are improved, by shifting the points according to the underlying linkermodel, so that no paths are taken into account, that could not be built with our building blocks. After all paths being correctly generated, the paths are weighted by several factors. Afterwards one weighting for a path is calculated, that corresponds to the goodness of this certain path. Then the path is retranslated by usage of our amino acid patterns to produce an amino acid sequence. But as one sequence can follow more than one path, all the paths built up by this sequence are clustered. In the end the average of the weighting of the pathclusters is calculated and thus it for every linker only one weightingvalue is produced with the contributions of all paths represented by this linker.

PDB parsing

At first the PDB file containing the structure of the target protein is parsed and thus the information about the coordinates of the atoms are stored. From this data a calibration of coordinate system of the PDB to metric units is made. For this purpose the distance between certain atoms in all the glycines is measured. This distance is a well known distance and thus a calibration can be made, leading normally to 100 pm per unit in the PDB file. After this some first tests with the protein structure are made. At first it is tested, whether the C- and N-terminus lie on the solvent accessible surface of the protein, which is crucial for circularisation. The angles from which the ends are accessible are stored for the linkergeneration afterwards. An end is accessible in a certain angle, if the axis that is rotatet from z-achsis there with the angle, is not too near to any of the atompoints. The minimal distance a connection must have from the protein is set to the radius of an alpha helix 5 Å . ###fig needed###

Generation of paths

As our model is to build linkers with helical rods and connecting angles, a path is completely defined by the coordinates of the angle points. Thus always just the angle points are generated and then the good ones are sorted out. As making shifts to existing points in our programmingstyle is very efficient, this was easier than only to generate the points, that are representing good connections. In the modelling part we have already described the patterns we are using for building up our linkers. The modularity was crucial for the success of the software. Therefore the angle patterns were always chosen, so that the end of the anglepattern was ???8??? Å away from the turning point ###figure needed###. Thus only displacements had to be made with a certain length, not depending on the direction in which it is going and not depending on the direction in that the linker will continue. This modularity makes the calculations more efficient than it would be, with just generating points randomly.

Flexible ends

Most proteins have flexible regions at the ends, that are not pointing in a certain direction. Often these flexible ends are even missing in the structure files but still our software estimates how they could behave. Furtheron due to circularization non-helical sequences remain at the ends of the protein. This gives a huge possibility to insert fitting linkers. But this is also a big problem as the estimation of flexible parts is not easy with our brute-force ansatz. For generation of linkers taking into account the flexibility of the ends until now have been included two functions.

One helix at flexible ends

The only variability we have in the customly tailored linker there is the length of the helix. Most likely we no suitable linker can be found, because if there is some obstacle between the ends, the linker can't bend around it. But even though if there is the possibility of such a linker, it should be found, because this will be one of the best linkers predicted. ###figure needed### So for all accessible angles that were calculated before, we calculate all points from the N-terminus and from the C-terminus that lie in certain distances from the terminus. The points are spread over varying distances from 4.5 Å to the maximum length of the flexible part. ###figure### Then all possible connections between the firstpoints and the lastpoints are tested It is tested whether they are too close at the protein or even pass through the protein. This is done with a higher accuracy, because none of these linkers should be lost by error. Therefore here another function for calculating the distance from the protein to the connection is used, that is more time-consuming, but also more accurate, than the ones used normally.

Linker with one angle at flexible ends.

The other possibilty for taking flexible ends into account without blowing up the calculation time was to reduce the possibilities in the amount of angles in the helical pattern to one specified. For the linkers with flexible ends and one angle are built as rectangular triangles. ###figure### The rectangular triangles are chosen, because a rectangle can be built well with the angles we have and the rectangle helps a lot in further calculation. At first out of all possible linkerparts it is analyzed which triangles are possible to be built, by use of Pythagoras' Theorem. It is important that both legs can be built with our linker patterns and that the hypotenuse has the correct length to fit between the ends. Afterwards all possible rectangular triangles constructed, that have the two edges of the hypothenuse on N- and C-terminus. For this purpose Thales' Theorem is used, as all the possible rectangles lie on a sphere of radius hypothenuses half. Then these rectangular triangles are analyzed, whether they can be built with our linkerparts, by comparing the angles with the amount of possible triangles from Pythagoras' Theorem. These triangles are now all shiftet by each displacement of possible angles at the C-terminus, resulting in the rectanglepoints ###figure needed###. Now the connections to the possible points from the N-terminus are generated. At this step the triangles don't need to be rectangular anymore, but can have slightly different angles, but in the next steps the paths are analyzed, whether they still fit. At first they lengthes of the legs of the triangles are checked and then it is checked, that the paths don't disturb the protein. If they would disturbe the protein anyhow, they are just deleted.

Rigid paths

If the fist two possibilities for finding suitable paths didn't work the software provides also the possibility to find paths with up to three edges. Thus even one of the worst shapes for circularization, a torus with the two termini in the pits, could be circularized with our linker system, without needing infinitely long straights. On the other hand this were the maximum of possibilities which was still feasible to calculate. Here now the flexible parts of the protein are estimated to point into the same direction as the following helix. By this mean the amount of possibilities is kept fixed but of course this is quite some rough estimation. Let's check the next paragraph together in detail At first points from each end in the right distances are generated. ###figure needed, explains point names### Now for each point of the first points to all directions in 5 degrees angledifference next points are generated. Of these all points that don't fit are immediately sorted out. Now all connections from second points to the last points are checked whether they lie in the right distance. After this the normal sorting steps are made for these new connections.

Sorting out of paths

The most calcualtion time is consumed while sorting out misleading paths. This is due to the fact, that every path needs to be checked. Making the brute-force ansatz the amount of possible paths is about 10^9, so this step consumes most time. There are three main functions, that sort out paths that don't fit. The easiest just sorts paths out, that would require an angle, we can't produce with our angle patterns. But as we nearly can produce angles from ???20 -170??? degrees only very few paths are sorted out by this function. The next function just checks whether the endpoint is lying in the protein. If yes, the path is deleted. Otherwise the connection between the point coming from to this point is checked, whether it is passing too near at the protein.

Shifting paths to the patterns

Because of rounding errors and other inaccuracies, it can always happen that the generated paths don't fit perfectly to the helical- and to the angle patterns. Therefore before translating them to the aminoacid sequence, the paths need to be refined. This is also done before the weighting, so that the paths don't change after weighting anymore. Therefore the distances between the different points are calculated and then the points are shifted so far, that they fit into the patterns. The shifts never exceed a certain length so that no path then would pass through the protein after refinement, even though it didn't before.

Weighting of paths

Before translating the paths to sequences and thus to expressible linkers, at first the value of each path needs to be checked. Therefore we have identified four contributions, that determine how good a path will work to enhance heatstability of a target protein by circularization. In the end all these contributions are summarized to get one final value for the goodness of the connection. For each of the single contributions, always lower weighting values refer to a better path. Therefore all contributions somehow need to be normed by some property of the protein, so that the contributions of each function on it's own are independent of the target protein and generality is achieved. At first it is tested how long the linker is. We assumed that a shorter linker normally is better for restraining the ends, as a longer helix might give more flexibility to the ends. Because we want this value to be independent of the size of the protein, the length of the linker is normed by the distance of the two termini. It is clear that not only because the two termini are far from each other and clearly a longer linker is needed, this contribution exceeds all other contributions to the weighting function. The next value is a measure for the goodness of the angles used in the linker. Each of our potential angle patterns is somehow distributed with a standard deviation from the mean. It is clear, that the narrower this distribution is, the likelier it is, that the pattern will in the end produce the assumed angle between the embracing helices. Therefore path-angles that fit perfectly with the angle patterns provided get a lower weighting value. Then the distance from the protein is taken into account. Because the linkers should not disturb the protein in it's normal environment, linkers that pass near to the protein's surface are considered better linkers. Of course no linker is too near on the surface, so that it would interact with the protein too much. This value is normalized with the minimal distance a linker should have from the surface. After this the places a linker should ommit are calculated. Each protein binds some ligands or substrates. The user can specify, how big they are and where they attach. If a linker passes through a potential ligand, this value goes to infinity, so that a linker is discarded. Otherwise the farther a linker is from a potential ligand, the better this value gets. The user can also specify the importance of certain regions. In the end the total weighting is independent of the amount of binding sites.

Calibrating the weighting function

???keep?### Every contribution has it's own distribution. You can see an example in figure ### figure histogram_length_lys.png, but all of them have different shapes. The aim is to find the paths that minimize all of these distributions. Therefore in the weighting function the four mentioned contributions are combined in a linear manner: \[ W(p) = \alpha L(p) + \beta A(p) + \gamma D(p) + \delta u(p) \] where W is the final weighting, p the path, L the length contribution, A the angle contribution, D the distance contribution and u the contribution from the forbidden regions. $\alpha, \beta, \gamma, \delta$ are the weighting constants that needed to be found. The normalization performed for each of the contribution were made so that each of them is dimensionless and that all have reasonably similar values. The weighting constants were obtained from the linker-screening ###lin performed with lysozyme and the enzyme-modeling ###link Please see below ###link for the detailed explanation, how the values were obtained.

Translating paths to sequence

As already mentioned before the software is provided with two databases, one for the possible angle patterns and one for the helix patterns. The choice of the patterns was inspired by known crystal structures extracted from databases and described in different papers. A huge in silico screening for refining the preferences of the patterns was then set up using the ###Link system. For the complete description of search for suitable patterns, one can read the ###Link to patternspart page. All the possible paths are now split up at the angles and compared with the possible patterns in the databases. ###Figure needed to explain, how the path is translated### The most suitable patterns are identified and added together to build the paths sequence. It is important to notice that this is only possible because of the modularity of our linker patterns used as building blocks: each block, being an alpha helix or an angle pattern, is not affected by the other. Thus for each possible path, one sequence is produced.

Clustering of paths

Many different paths are represented by the same sequence ###Figure that shows, different paths have same properties, already before in the text ### and we therefore clustered such paths. The weigths for those clustered paths were then calculated by averaging the weights of the different paths that compose a cluster.



A major motivation of our effort to design rigid linkers with angles was the circularization of the DNA methyltranferase Dnmt1 (link). The truncation form used in our project is composed of 900 amino acids and the N- and C-terminal extremities are well separated. To circularize it, two linkers were designed: a flexible one made of glycine and serine, and a rigid one designed by the software. The rigif linkers for DNMT1 were obtained from an early state of the software. At that time the calculation took 11 days on a laptop computer with intel i5 processor and 8GB of RAM, which shows the importance of a distributed computing system for large proteins. But from that state on, the software has still imporved a lot, resulting in reduced calculation time to about 1 day for DNMT1.

      1. DNMT1 could be made heatstable, still missing###

From the linker-screening and the enzyme_modeling we have obtained the activities after heat-shock of the different linkers. With these a calibration of the weighting function has been made, please see table 1. for the results.

table 1: Linker and their amino acid sequence. Green: attachment sequences to prevent the flexible regions from being perturbed; Blue: angle; Purple: extein.
Linker Amino acid sequence activity length- contribution angle- contribution binding site contribution distance from surface weightingvalue after calibration
Very good linkers
sgt2 GGAEAAAKAAAHPEAAEAAAKRGTCWE 0.7477 1.9205 6.7789 0.002259 10.525 114912
Average linkers
ord1 GGAEAAAKEAAAKATGDLAAEAAAKAARGTCWE 0.956 4.936 4.639 0.00055708 220.8 27985
ord3 GGAEAAAKEAAAKASLPAAAEAAAKEAAAKRGTCWE 1.390 4.949 7.116 0.000545 261.2 28557
Short linkers
sho1 GGRGTCWE 0.7087
flexible linker GGSGGGSGRGKCWE 0.6851
linear lysozyme no linker 0.7039

circ_lam_lys_nils.png comparison with nice_linker_lysozyme_flexible_ends.png

      1. About predictions of software###

In the end we obtained a ranking of the in vitro tested linkers from the linker-screening and chose the parameters $\alpha, \beta, \gamma, \delta$ of the weighting function in the way, that the ranking from the software represented the ranking from the assays. The final values were: ...


Will always be refined with more data from i@h...


[0] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).

[1] Thornton, J.M. & Sibanda, B.L. Amino and carboxy-terminal regions in globular proteins. Journal of molecular biology 167, 443-460 (1983).