(Difference between revisions)
 Revision as of 22:47, 17 October 2014 (view source)Igemnils (Talk | contribs)← Older edit Revision as of 22:49, 17 October 2014 (view source)Igemnils (Talk | contribs) Newer edit → Line 66: Line 66: - ===Calibrating the weighting function=== - Every contribution has it's own distribution. You can see an example in figure ###  [[figure histogram_length_lys.png]], but all of them have different shapes. The aim is to find the paths that minimize all of these distributions. Therefore in the weighting function the four mentioned contributions are combined in a linear manner: - $W(p) = \alpha L(p) + \beta A(p) + \gamma D(p) + \delta u(p)$ - where W is the final weighting, p the path, L the length contribution, A the angle contribution, D the distance contribution and u the contribution from the forbidden regions. $\alpha, \beta, \gamma, \delta$ are the weighting constants that needed to be found. The normalization performed for each of the contribution were made so that each of them is dimensionless and that all have reasonably similar values. - The weighting constants were obtained from the [[linker-screening ###lin]] performed with lysozyme and the [[enzyme-modeling ###link]] - Please see [[below  ###link]] for the detailed explanation, how the values were obtained. - ==Translating paths to sequence== - As already mentioned before the software is provided with two databases, one for the possible angle patterns and one for the helix patterns. The choice of the patterns was inspired by known crystal structures extracted from databases and described in different papers. - A huge in silico screening for refining the preferences of the patterns was then set up using the [[iGEM@home|###Link]] system. For the complete description of search for suitable patterns, one can read the [[modeling|###Link to patternspart]] page. - All the possible paths are now split up at the angles and compared with the possible patterns in the databases. ###Figure needed to explain, how the path is translated### The most suitable patterns are identified and added together to build the paths sequence. It is important to notice that this is only possible because of the modularity of our linker patterns used as building blocks: each block, being an alpha helix or an angle pattern, is not affected by the other. Thus for each possible path, one sequence is produced. - ==Clustering of paths== - Many different paths are represented by the same sequence [[###Figure that shows, different paths have same properties, already before in the text ###]] and we therefore clustered such paths. The weigths for those clustered paths were then calculated by averaging the weights of the different paths that compose a cluster. - =Results= - ==DNMT1== - A major motivation of our effort to design rigid linkers with angles was the circularization of the DNA methyltranferase Dnmt1 (link). The truncation form used in our project is composed of 900 amino acids and the N- and C-terminal extremities are well separated. To circularize it, two linkers were designed: a flexible one made of glycine and serine, and a rigid one designed by the software. The rigif linkers for DNMT1 were obtained from an early state of the software.  At that time the calculation took 11 days on a laptop computer with  intel i5 processor and 8GB of RAM, which shows the importance of a  distributed computing system for large proteins. But from that state on, the software has still imporved a lot, resulting in reduced calculation  time to about 1 day for DNMT1. - IMPORTANT! - paths-> geometrical - linkers-> helices... - sequences-> seq of linkers - insist on huge number of data produced - We definitely need to cite [0]!!! - =Abstract= - =Overview= - As already introduced [###link to circularization idea in the toolbox], artificially circularized proteins may gain some heat stability by restraining the C- and N-terminus from moving around  freely. This  circularization may be trivial when the protein termini are very close to each other, which seems to be reasonably common [[#References|[1]]]. However, if the ends are too far from each other, a long linker is needed to  connect them. This linker should not change  the natural conformation of the protein and should constrain the relative position of the ends to restrict the degrees of freedom and thus to stabilize the structure even when heated up. On  top, these linkers should not affect any of the protein functions.  Consequently it is important  to prevent linkers from passing through  the active site or from covering binding domains  to other molecules for example. Therefore one  needs to be able to  define the shape of possible linkers. This section describes the  software we developed to design such linkers. We would like to stress  that this work has been made possible thanks to the feedback between  computer modeling and experimental work: We  could first design linkers in silico, test them experimentally and use  the results to further calibrate the software. To our knowledge, this is  the first time that such an approach is used to customly design linkers to connect protein extremities. - In short, the software can provide a weighted list of linkers to circularize  any protein of interest with a known structure. Those linkers are made  of rigid alpha helices segments connected with defined angles. Contrary  to flexible linkers, those rigid linkers were expected to constrain the  protein extremities and to confer better heat stability. To generate those linkers, we first defined the geometrical paths, with segments and angles, that they should follow. The geometrical paths that are biologically feasible are afterwards translated into amino acid sequences. Both the compatibility of paths with possible structures and the translation  were made possible thanks to our modeling approaches. The first approach consisted in performing a statistical analysis of more than 17000 known non-homologous structures containing alpha helices connected with angles. For the second approach, we modeled the conformation of linkers circularizing proteins of known structure and analyzed them for certain properties. This second approach was run for a large number of proteins thanks to our distributing computing system igemathome. The software provides different possible linkers with weights that provide the ranking of the linkers depending on their capacity to maintain protein activity at higher temperatures. They were generated in an extensive [[linker screening| ### ]] on the target protein lambda-lysozyme, using the first modeling approach. - The software was checked for running stability in a huge test over the [[iGEM@home]]. - Furthermore  our software provided linkers for circularizing [[DNMT1| ###]], that  could be made more heatstable due to circularization???. - For  detailed information on the implementation and the practical use of the  software, please see the [[documentation software-docu]] page. - ==What does the software do== - still missing [[figure 0, graph abstract]] - =Background= - What are linkers, why should ours work better? - Linkers, References to the modeling, - Classically,  protein  linkers were designed in three different manners.  ###REFERENCE### The  easiest way is to define the length that a linker should  cover and  then simply  use a flexible glycine-serine peptide with the  right  amount of amino acids  to match this length. Glycine is used for  flexibility, as it has no  sidechain and does not produce any steric  hindrance, while serine is  used for solubility, as it has a small  polar  side chain. This solubility  is important, as the linkers  should  not pass through the hydrophobic core of the protein, but should  be  dissolved  in the surrounding medium. These flexible linkers were  normally used for circularization but also for connecting different  proteins, when the  main important aspect is that the different parts  are connected, but not how they are connected, or when the flexiblity of the linker was required for specific applications. - A  second strategy consists in using rigid helical linkers to keep  proteins or protein domains at a certain distance from each other. This  is especially important for signalling proteins and fluorescent    proteins . ###TODO: Reference### One major property of alpha helices  is  that they always fold in a defined way with well defined angles  and lengths. There are also many different helical patterns that differ  in stability and solubility. One big disadvantage of this strategy is    that one can only build straight linkers with helices. So in the  context  of circularization, if an artificial line that would connect  protein  extremities is crossing the protein, this strategy is not an  option. - The  third option, which served as a base to develop our approach,  consists  in designing customly tailored linkers for each specific  application. These linkers can be obtained from protein structure  prediction. At  first one needs to define the path that  the linker should take to connect  two  amino acids. Afterwards one  designs a possible linker sequence that  might fit well. Next  one makes  a structure prediction of the linker  attached to the proteins to  validate the prediction. Several different  linkers, with slight  changes, can be compared. This is repeated several  times until the  linker effectively follows the expected  path. ###TODO:  Reference, WADE paper###  This method is time consuming  as it is not  only computation intensive, but also requires a strong  knowledge on  protein folding and protein structure prediction. On the  other hand,  the  benefit  can be important as the interaction of the linker with  the  proteins  surface can be taken into account and as one can accurately  define the path taken by the linker to the resolution of  protein structure. - We  have set up a completely new strategy to design rigid linkers. As further detailed in the  [[###link###|modelling]] part, it is possible to define the shape of a linker, by combining rigid  alpha helical rods with well-defined angle patterns. Therefore,  by defining, in a geometrical way, the possible paths of the  circularizing linkers for a given protein, we can then propose potential  linkers.This definition of the geometrical path can be very difficult, especially  for large proteins with complex shapes. Moreover,  this definition is further constained by the fact that linkers must  avoid hiding active sites of the protein of interest. Finally the paths  have rotational degrees of freedom at the extremities of the protein,  and depending on their orientation, they may or may not match the  geometry of the protein.  The tool we present here covers the two steps: defining geometrical  paths with some weights and translate them into feasible linkers, also  with weights. This tool is universal as it has the capacity to design  circularizing linkers for any protein with a known structure. Moreover  it is modular as, thanks to our modeling approach [link] we have design  linkers as exchangeable blocks of rods of different lengths and of angle  patterns. The following sections detail the different steps followed by our software to design proper linkers. - But  even taking all this into account, one could never also  take the paths  into account, that the same sort of linker is also able  to take,  because of rotational degrees of freedom inherent in the  linkermodel  ###figure needed### - Thus  we decided to provide the science community with a powerfull  open-source software, that for every protein with a given structure can  calculate the sequence of the linker needed to circularize the protein  with a rigid linker and with minimal inhibition of the protein's  function. - Until  now each scientist had to estimate the length of the linker  himself,###check for flexlinker### so our software is a completely  novel  approach to circularization. - =General procedure= - Our  general approach to findin,rods and angles in discrete manner, only  finite possibilities to connect the ends, these are cheked all. - I think this paragraph is just too much, as anyway there was already an overview before. - At first the protein structure is analysed in a geometrical way:  paths only composed of no more than four straight segments connected by angles are computanionally generated.  A path is always represented by straight lines and connecting angles  between them. ###In the end all these paths should be sorted by how well  it would be, if we circularize the protein using this path.### But as  the final weighing is quite computation intensive, at first the paths  need to be sorted out. A path is only sorted out, if it is breaching any  rule for linkers. For example paths should never pass through the  protein. After all the paths have been generated, the paths are  improved, by shifting the angularpoints according to the underlying  linkermodel, so that no paths are taken into account, that could not be  built with our building blocks. - After  all paths being correctly generated, the paths are weighted by several  factors. Afterwards one weighting for a path is calculated, that  corresponds to the goodness of this certain path. Then the path is  retranslated by usage of our amino acid patterns to produce an amino  acid sequence. But as one sequence can follow more than one path,  all  the paths built up by this sequence are clustered. In the end the  average of the weighting of the pathclusters is calculated and thus for  every linker only one weightingvalue is produced with the contributions  of all paths possibly taken by this linker. - ==PDB analysis== - At first, the PDB file containing the structure of the target protein is parsed and the coordinates of the atoms are stored, in the metric unit.After this, some initial tests are made with the protein structure. First, we checked whether the C- and N-termini lie on the surface of the protein and are accessible to the solvent, which is crucial for circularization. We  defined a line originating from an extremity of the protein with the  two angles of the spherical coordinates around the z-axis. From that, we  could determine the accessible angles by rejecting all the lines that  are too close to the protein. As the future linker will be made of alpha  helices and will therefore have a radius of 5 Å, we used this length as the minimal allowed distance. - Those allowed angles are stored for the coming linker generation. . ###fig needed### - ==Generation of geometric paths== - As our strategy consists in building  linkers with helical rods and connecting angles, a path is completely  defined by the coordinates of the angle points. Advancing one step from  an existing point is always done by adding a displacement vector on this point. This  vector is defined by the two spherical angles, chosen here in a  discrete manner with an increment of 5 degrees, and by a length, also  chosen in a discrete manner. This discrete length was used in two  different contexts: it may correspond to the length of an alpha helix or  to the length of the flexible part that appears at the extremity of the protein.  The coordinates that are reached thanks to this vector defines the new coordinates of an angle point, no matter if the vector corresponds to an alpha helix or to a flexible part. ###figure, points + vectors###. As we screen for all possible angles in a discrete manner, those angle points coordinates are regularly distributed on a sphere.  As further detailed in the next sections, those spheres are defined  from both ends, either once or in several steps. Then the software checks for possible straight connections of given lengths for each pair of angle points originated from both extremities. ###figure  needed, with all these possible steps, one sphere around each point and  checking for the connections, 2D, different lengths in two different  figures, ### - The  linkers are built in a modular way, with blocks of well-defined size.  From the modeling of potential linkers [link], we could derive 8  different alpha helical rods, all with different lengths. On top, the  length of the two segments inside an angle block was always 8  Å, so exchanging angle blocks do not affect the length of the  linker. This means that the distance between the angle points is well  defined, an essential aspect of our strategy of linker design. - The  software proceeds in three steps. First, it checks for the possibility  of direct single alpha helix linker. for this, it applies the procedure  just mentioned with spheres of radius that reasonably corresponds to the  length of the short parts at the extremity of the protein. Second, it  tests if a linker containing two alpha helices connected with a right  angle allows the circularization. Finally it searches the possible  linkers with three angle points. The next parts will explain those three  steps in detail. - This method has been chosen, because it could be implemented easily and efficiently in our program. However this strategy generated paths that crossed the protein. Therefore we put big efforts in the sorting out of the paths. - In  the [[###link###|modelling]] part we have already described the  patterns we are using for building up our linkers. The modularity was  crucial for the success of the software. Therefore the angle patterns  were always chosen, so that the end of the anglepattern was 8  Å away from the turning point ###figure needed###. Thus only  displacements had to be made with a certain length, not depending on the  direction in which it is going and not depending on the direction in  that the linker will continue. This modularity makes the calculations  more efficient than it would be, with just generating points randomly. - We still have to rephrase that. - ===Step 1=== - As  a simple rigid linker with no angle would be easier to design and  likely more thermostable than the ones containing angles, the software  first checks if this simple solution is possible ###figure needed###.  For this, we took into account the fact that proteins have some flexible  amino acids at their extremities. This  flexible part may come  from  the protein itself, but also from the 2  glycines that are included  at  the N-terminal part and from the extein at the C-terminal part. Those  two latter parts comes from our linkers. Those parts have no  preferential angles and offers a large amount of possibilities to insert fitting linkers. But this flexibility is also a drawback as we have to include this large amount of possible angles and length to our path search. - In  this first step, the software explicitely takes these flexible parts  into account to check for the possibility of straight linkers. As the  angles and the length of the flexible parts are variable,  the software position their extremity on a sphere centered on the last  fixed position of the structure as explained above. The radius of this  sphere is incremented in a discrete manner, in 4 steps, from 5.25 Å to the maximum length of the flexible part.  ###figure### - Then all possible straight segments between  the points and the lastpoints are tested. If they are closer than 5  Å to the protein, of if they cross it, then they are rejected.  If they are kept, then the software checks whether the length of the  segments is compatible with the feasible alpha helices in terms of  length: if the length of a given segment equal one of the 8 alpha helix  lengths plus or minus 0.75 Å, then the path is eventually saved. - "  Therefore here another  function for calculating the distance from the  protein to the connection  is used, that is more time-consuming, but  also more accurate, than the  ones used normally. " ?? - "This is done with a higher accuracy, because  none of these linkers should be lost by error". ?? - For generation of linkers taking into account the flexibility of the ends until now have been included two functions. - The  only variability we have in the customly tailored linker there is the  length of the helix. Most likely here no suitable linker can be found,  because if there is some obstacle between the ends, the linker can't  bend around it. But even though if there is the possibility of such a  linker, it should be found, because this will be one of the best linkers  predicted. - So  for all accessible angles that were calculated before, we calculate all  points from the N-terminus and from the C-terminus that lie in certain  distances from the terminus. The points are spread over varying  distances  from ; to the . - ===Step 2=== - The next possibilty to design more complex rigid linkers while still taking flexible ends into account with a reasonable calculation time was to reduce amount of possible angles. As we originally thought that 90° angle would be practically feasible, the software was designed to generate  linkers with flexible ends and one 90° angle. ###figure### This choice was notably made because of the simplicity to calculate lengths of right triangle edges.  We already saw in Step 1 that the length of an edge can only take 8  different values. As the linkers have to start from the extremities of  the protein, and as we impose a right angle, the number of possible  paths is therefore low, making them easy to compute. Practically, the  extremities of the proteins are positioned in a flexible way as in Step  1. From each of the positions allowed by this flexibility, the software  searches for all the allowed right triangles. - some words on degree of freedom... - ###newly written### - Again  here we only have discrete possibilities to build right triangles by  use of our helical patterns. Therefore at first all combinations of two  helical patterns are searched, that could build a right triangle, that's  hypotenuse has the length of the distance between the termini. - Now  we shift back to 3D and apply Thales's theorem, that says, when A, B  and C lie on a circle, the line between A and C is a diameter of the  circle, then the angle at B is a right angle. ### fig, thales### Thus in  3d we can discretisize the possibilities, where the angle point (B) can  lie in reference to the starting point (A). This amount is  counterchecked with the amount of possible right triangles from before,  so that we only keep paths, that can be built with our patterns. - Like  in Step 1 the software creates spheres of points around the start  (first points) and around the end (last points). Then for each point  from the first points - ===Step 3=== - Finally the software also provides the possibility to find paths with up to 4 edges, meaning 4 alpha helices and 3 angles. Thanks  to the modularity of the possible linkers, such paths can offer the  possibility to circularize theoretically any kind of protein. ###figure of torus### - To  keep the calculation feasible in a reasonable time, we design the  searching strategy so that the flexible part at the extremity are  oriented in the same direction as the consecutive alpha helix. This is  obviously restricting the search but as these orientations are allowed for the flexible part, this approach remains fully correct. - First,  potential ending points of the first alpha helical rod are calculated  from the N-terminal point of the protein. The orientation is chosen in a  discrete manner, with an incrementation of 5 degrees for the two angles  of the spherical coordinates. The distance from the origin corresponds  to the 8 possible lengths allowed by the alpha helices, as already seen  in Step 2, plus a length of 0, which mimics a linker with 3 instead of 4  edges. The exact same procedure is repeated to define all the potential  ending points of the second alpha helical rod starting from all the  possible ending points of the first alpha helical rod. Thanks to the  possibility of a length of 0 for the first and the second rods, the  software also calculate paths with 2 edges. Then, the same is done only  once from the C-terminal point of the protein, defining 1 edge. The  final step consists in checking if the points originating from the N-  and C-terminal points can be linked by an potential alpha helix, i.e. if  they are separated by the appropriate distance. If any of the potential  alpha helix length lies within the distance between two points plus or  minus 0.75 Å, then the path is eventually saved. In the same way, if two points are directly closer than 0.75 Å, then the path is also saved. - one starts from omega, 2 from alpha - Thus  even one of the worst shapes for circularization, a torus with the two  termini in the pits, could be circularized with our linker system,  without needing infinitely long straights. On the other hand this were  the maximum of possibilities which was still feasible to calculate. - Here  now the flexible parts of the protein are estimated to point into the  same direction  as the following helix. By this mean the amount of  possibilities is kept fixed but of course this is quite some rough  estimation. - Let's check the next paragraph together in detail - At  first points from each end in the right distances are generated.  ###figure needed, explains point names### Now for each point of the  first points to all directions in 5 degrees angledifference next points  are generated. Of these all points that don't fit are immediately sorted  out. Now all connections from second points to the last points are  checked whether they lie in the right distance. After this the normal  sorting steps are made for these new connections. - ==Sorting out of paths== - The  previous part described the generation of paths that can connect the  two extremeties of the protein irrespective of the position of these  paths relative to the protein. While this allows a fast computing of the  geometrical paths, this also implies that the paths that are not  practically feasible need to be sorted out. This is the most time  consuming part of the computing as about 1 billion paths are generated.  Three criteria are considered for the sorting.  The first one is the feasibility of the linker: can the software find  angle patterns that correspond to the one defined by the geometrical  path? This question was part of the motivation for a large modeling  effort (link) to determine the possible angles between consecutive  angles. This was achieved by analyzing the distribution of angles  between alpha helices found in the ArchDB database (link). As nearly any  angle could be found between 20 and 170 degrees, only few paths were  actually rejected at that step. The next criteria was the position of  the angle point: if they appear inside the protein, then the path is  rejected. Finally, the software checks if any of the atoms of the  protein is less than 5 Å away from any of the alpha helices,  then the path is also rejected. - ==Shifting paths to the patterns== - The  strategy described in step 3 gives a certain freedom for the rod that  connect the last two angle points that were generated from the N- and  C-terminal points. As this freedom is actually not permitted by the  alpha helix and the angle pattern, but is permitted by the flexible part  for example at the C-terminal end, the software slighty refine the path  by rotating the segment that originates from the C-terminal point. - In particular because of the discritization,  it can always happen that the generated paths don't fit perfectly to  the helical and to the angle patterns. Therefore before translating them  to the aminoacid sequence, the paths need to be refined. - Every  path, that has survived is analyzed step by step, starting from the  starting point and always advancing one angle point. For each step from  point to point, the length of the step is calculated and compared to the  lengthes we can build with the helix patterns. If the length is too  long, the next point is shifted in direction of the previous one, until  it fits. If it is too short, it is shifted away. These shifts do not  exceed a certain length, so that the paths don't shift too much and  suddenly would pass throgh restricted areas. - This is also done before the weighting, so that the paths don't change after weighting anymore. - ==Weighting of paths== - Before translating the paths into sequences and thus into linkers that can be expressed, each path needs to be evaluated for its potential capacity to enhance heat stability. For this we have identified different contributions that should be combined into one value that defines how good the linker may be and that consequently defines its ranking among all possible linkers. The smaller this value, the more we expect that the linker will enhance thermostability. An important step for all these contributions is the normalization, as  explained in the next paragraphs. - The first contribution we considered was the linker length. We assumed that a short linker is better to constrain the protein extremities, and that a long helix might give more flexibility. Because we wanted this value to be independent of the size of the protein, the length of the linker is normalized to the distance between the two termini. - The second contribution relates to the angles used in the linker. We learned from the [[modeling ###link###]]  that angles formed by a certain angle pattern follow a certain  distribution. First, we assumed that the narrower the distribution, the more likely the alpha helices would actually produce this angle. Second, the angles found by the software should be as close as possible to those well-defined angles. In this case, the weight value from this contribution should be low. - Then the distance of the linker to  the protein is taken into account. Because the linkers should not  disturb the protein in its normal environment, linkers that pass close to the protein surface are considered better linkers.  The distance was defined as the minimal distance between the linker and  all the atoms of the protein. As already mentioned for the sorting of  the paths, a linker cannot come closer than 5 Å and this distance was used for normalization of calculated distances. - As linkers should also not be too close to the protein surface, this value is normalized with the minimal distance a linker should have from the surface. The distance is calculated as the minimal distance an the atoms of the protein have from the connection. - After this, the places a linker should avoid are calculated. Each protein can interact with other molecules on some oarts of its surface. The user can specify where and how big those parts are. If a linker passes in front a potential molecule bindind domain, the value of the corresponding path goes to infinity, so that the linker is discarded. Conversely the farther a linker is from a potential ligand binding domain, the smaller its weighting value. The user can also specify the importance of certain regions. In the end the total weighting is normalized to the amount of binding domains. - ===Calibrating the weighting function=== - ???keep?### - Every  contribution has it's own distribution. You can see an example in  figure ###  [[figure histogram_length_lys.png]], but all of them have  different shapes. The aim is to find the paths that minimize all of  these distributions. Therefore in the weighting function the four  mentioned contributions are combined in a linear manner: - $W(p) = \alpha L(p) + \beta A(p) + \gamma D(p) + u(p)$ - where  W is the final weighting, p the path, L the length contribution, A the  angle contribution, D the distance contribution and u the contribution  from the forbidden regions. $\alpha, \beta, \gamma, \delta$ are the weighting constants that needed to be found.  The normalization performed for each of the contribution were made so  that each of them is dimensionless and that all have reasonably similar  values. - The weighting constants were obtained from the [[linker-screening ###lin]] performed with lysozyme and the [[enzyme-modeling ###link]] - Please see [[below  ###link]] for the detailed explanation, how the values were obtained. - ==Translating paths to sequence== - As  already mentioned before the software is provided with two databases,  one for the possible angle patterns and one for the helix patterns. The  choice of the patterns was inspired by known crystal structures  extracted from databases and described in different papers. - A  huge in silico screening for refining the preferences of the patterns  was then set up using the [[iGEM@home|###Link]] system. For the complete  description of search for suitable patterns, one can read the [[modeling|###Link to patternspart]] page. - All  the possible paths are now split up at the angles and compared with the  possible patterns in the databases. ###Figure needed to explain, how  the path is translated### The most suitable patterns are identified and  added together to build the paths sequence. It is important to notice that this is only possible because of the modularity of our linker patterns used as building blocks: each block, being an alpha helix or an angle pattern, is not affected by the other. Thus for each possible path, one sequence is produced. - ==Clustering of paths== - Many  different paths are represented by the same sequence [[###Figure that  shows, different paths have same properties, already before in the text  ###]] and we therefore clustered such paths. The weigths for those clustered paths were then calculated by averaging the weights of the different paths that compose a cluster. =Results= =Results= ==DNMT1== ==DNMT1==

# Abstract

As already introduced, artificially circularized proteins may gain some heat stability by restraining the C- and N-terminus from moving around freely. This circularization may be trivial when the protein termini are very close to each other, which seems to be reasonably common [1]. However, if the ends are too far from each other, a long linker is needed to connect them. This linker should not change the natural conformation of the protein and should constrain the relative position of the ends to restrict the degrees of freedom and thus to stabilize the structure even when heated up. On top, these linkers should not affect any of the protein functions. Consequently it is important to prevent linkers from passing through the active site or from covering binding domains to other molecules for example. Therefore one needs to be able to define the shape of possible linkers. This section describes the software we developed to design such linkers. We would like to stress that this work has been made possible thanks to the feedback between computer modeling and experimental work: We could first design linkers in silico, test them experimentally and use the results to further calibrate the software. To our knowledge, this is the first time that such an approach is used to customly design rigid linkers with angles to connect protein extremities.

# Background

Classically, protein linkers were designed in three different manners. The easiest way is to define the length that a linker should cover and then simply use a flexible glycine-serine peptide with the right amount of amino acids to match this length. Glycine is used for flexibility, as it has no sidechain and does not produce any steric hindrance, while serine is used for solubility, as it has a small polar side chain. This solubility is important, as the linkers should not pass through the hydrophobic core of the protein, but should be dissolved in the surrounding medium. These flexible linkers were normally used for circularization but also for connecting different proteins, when the main goal is that the different parts are connected, but not how they are connected, or when the flexiblity of the linker was required for specific applications.

A second strategy consists in using rigid helical linkers to keep proteins or protein domains at a certain distance from each other. This is especially important for signalling proteins and fluorescent proteins. One major property of alpha helices is that they always fold in a defined way with well defined angles and lengths. There are also many different helical patterns that differ in stability and solubility. Although they have been used to design cirularizing linkers [2]. One big disadvantage of this strategy is that one can only build straight linkers with helices. So in the context of circularization, if an artificial line that would connect protein extremities is crossing the protein, this strategy is not an option.

The third option, which served as a base to develop our approach and which came from discussions with the group of Rebecca Wade in Heidelberg, Germany, consists in designing customly tailored linkers for each specific application. These linkers can be obtained from protein structure prediction. At first one needs to define the path that the linker should take to connect the protein ends. Afterwards, one designs a possible linker sequence that might fit well. Next one makes a structure prediction of the linker attached to the proteins to validate the prediction. Several different linkers, with slight changes, can be compared. This is repeated several times until the linker effectively follows the expected path. This method requires a strong knowledge on protein folding and protein structure prediction and is computationaly intensive. On the other hand, the benefit can be important as the interaction of the linker with the protein surface can be taken into account and as one can accurately define the path taken by the linker to the resolution of protein structure.

We have set up a completely new strategy to design rigid linkers. As further detailed in the modeling part, it is possible to define the shape of a linker, by combining rigid alpha helical rods with well-defined angle patterns. Therefore, by defining, in a geometrical way, the possible paths of the circularizing linkers for a given protein, we can then propose potential linkers. This definition of the geometrical path can be very difficult, especially for large proteins with complex shapes. Moreover, this definition is further constained by the fact that linkers must avoid hiding active sites of the protein of interest. Finally the paths have rotational degrees of freedom at the extremities of the protein, and depending on their orientation, they may or may not match the geometry of the protein. The tool we present here covers the two steps: defining geometrical paths with some weights and translate them into feasible linkers, also with weights. This tool is universal as it has the capacity to design circularizing linkers for any protein with a known structure. Moreover it is modular as, thanks to our modeling approach, we have design linkers as exchangeable blocks of rods of different lengths and of angle patterns. The following sections detail the different steps followed by our software to design proper linkers.

## PDB parsing

At first, the PDB file containing the structure of the target protein is parsed and the coordinates of the atoms are stored, in the metric unit.After this, some initial tests are made with the protein structure. First, we checked whether the C- and N-termini lie on the surface of the protein and are accessible to the solvent, which is crucial for circularization. We defined a line originating from an extremity of the protein with the two angles of the spherical coordinates around the z-axis. From that, we could determine the accessible angles by rejecting all the lines that are too close to the protein. As the future linker will be made of alpha helices and will therefore have a radius of 5 Å, we used this length as the minimal allowed distance. Those allowed angles are stored for the coming linker generation. ###fig needed###

## Generation of geometric paths

### Step 1

As a simple rigid linker with no angle would be easier to design and likely more thermostable than the ones containing angles, the software first checks if this simple solution is possible ###figure needed###. For this, we took into account the fact that proteins have some flexible amino acids at their extremities. This flexible part may come from the protein itself, but also from the 2 glycines that are included at the N-terminal part and from the extein at the C-terminal part. Those two latter parts comes from our linkers. Those parts have no preferential angles and offers a large amount of possibilities to insert fitting linkers. But this flexibility is also a drawback as we have to include this large amount of possible angles and length to our path search. In this first step, the software explicitely takes these flexible parts into account to check for the possibility of straight linkers. As the angles and the length of the flexible parts are variable, the software position their extremity on a sphere centered on the last fixed position of the structure as explained above. The radius of this sphere is incremented in a discrete manner, in 4 steps, from 5.25 Å to the maximum length of the flexible part. ###figure### Then all possible straight segments between the points and the lastpoints are tested. If they are closer than 5 &Aring to the protein, of if they cross it, then they are rejected. If they are kept, then the software checks whether the length of the segments is compatible with the feasible alpha helices in terms of length: if the length of a given segment equal one of the 8 alpha helix lengths plus or minus 0.75 Å, then the path is eventually saved.

### Step 2

The next possibilty to design more complex rigid linkers while still taking flexible ends into account with a reasonable calculation time was to reduce amount of possible angles. As we originally thought that 90° angle would be practically feasible, the software was designed to generate linkers with flexible ends and one 90° angle. ###figure, protein_one_90° angle### This choice was notably made because of the simplicity to calculate lengths of right triangle edges. We already saw in Step 1 that the length of an edge can only take 8 different values. As the linkers have to start from the extremities of the protein, and as we impose a right angle, the number of possible paths is therefore low, making them easy to compute. Practically, the extremities of the proteins are positioned in a flexible way as in Step 1. From each of the positions allowed by this flexibility, the software searches for all the allowed right triangles. This was mainly done, as the degrees of freedom needed to be restricted to keep calculations feasible.

### Step 3

Finally the software also provides the possibility to find paths with up to 4 edges, meaning 4 alpha helices and 3 angles. Thanks to the modularity of the possible linkers, such paths can offer the possibility to circularize theoretically any kind of protein. ###figure of torus### To keep the calculation feasible in a reasonable time, we design the searching strategy so that the flexible part at the extremity are oriented in the same direction as the consecutive alpha helix. This is obviously restricting the search but as these orientations are allowed for the flexible part, this approach remains fully correct. First, potential ending points of the first alpha helical rod are calculated from the N-terminal point of the protein. The orientation is chosen in a discrete manner, with an incrementation of 5 degrees for the two angles of the spherical coordinates. The distance from the origin corresponds to the 8 possible lengths allowed by the alpha helices, as already seen in Step 2, plus a length of 0, which mimics a linker with 3 instead of 4 edges. The exact same procedure is repeated to define all the potential ending points of the second alpha helical rod starting from all the possible ending points of the first alpha helical rod. Thanks to the possibility of a length of 0 for the first and the second rods, the software also calculate paths with 2 edges. Then, the same is done only once from the C-terminal point of the protein, defining 1 edge. The final step consists in checking if the points originating from the N- and C-terminal points can be linked by an potential alpha helix, i.e. if they are separated by the appropriate distance. If any of the potential alpha helix length lies within the distance between two points plus or minus 0.75 Å, then the path is eventually saved. In the same way, if two points are directly closer than 0.75 Å, then the path is also saved.

## Sorting out of paths

The previous part described the generation of paths that can connect the two extremeties of the protein irrespective of the position of these paths relative to the protein. While this allows a fast computing of the geometrical paths, this also implies that the paths that are not practically feasible need to be sorted out. This is the most time consuming part of the computing as about 1 billion paths are generated. Three criteria are considered for the sorting. The first one is the feasibility of the linker: can the software find angle patterns that correspond to the one defined by the geometrical path? This question was part of the motivation for a large modeling effort (link) to determine the possible angles between consecutive angles. This was achieved by analyzing the distribution of angles between alpha helices found in the ArchDB database. As nearly any angle could be found between 20 and 170 degrees, only few paths were actually rejected at that step. The next criteria was the position of the angle point: if they appear inside the protein, then the path is rejected. Finally, the software checks if any of the atoms of the protein is less than 5&Aring away from any of the alpha helices, then the path is also rejected.

## Shifting paths to the patterns

The strategy described in step 3 gives a certain freedom for the rod that connect the last two angle points that were generated from the N- and C-terminal points. As this freedom is actually not permitted by the alpha helix and the angle pattern, but is permitted by the flexible part for example at the C-terminal end, the software slighty refine the path by rotating the segment that originates from the C-terminal point.

## Weighting of paths

### Calibrating the weighting function

???keep?### Every contribution has it's own distribution. You can see an example in figure ### figure histogram_length_lys.png, but all of them have different shapes. The aim is to find the paths that minimize all of these distributions. Therefore in the weighting function the four mentioned contributions are combined in a linear manner: $W(p) = \alpha L(p) + \beta A(p) + \gamma D(p) + \delta u(p)$ where W is the final weighting, p the path, L the length contribution, A the angle contribution, D the distance contribution and u the contribution from the forbidden regions. $\alpha, \beta, \gamma, \delta$ are the weighting constants that needed to be found. The normalization performed for each of the contribution were made so that each of them is dimensionless and that all have reasonably similar values. The weighting constants were obtained from the linker-screening ###lin performed with lysozyme and the enzyme-modeling ###link Please see below ###link for the detailed explanation, how the values were obtained.

## Translating paths to sequence

As already mentioned before the software is provided with two databases, one for the possible angle patterns and one for the helix patterns. The choice of the patterns was inspired by known crystal structures extracted from databases and described in different papers. A huge in silico screening for refining the preferences of the patterns was then set up using the ###Link system. For the complete description of search for suitable patterns, one can read the ###Link to patternspart page. All the possible paths are now split up at the angles and compared with the possible patterns in the databases. ###Figure needed to explain, how the path is translated### The most suitable patterns are identified and added together to build the paths sequence. It is important to notice that this is only possible because of the modularity of our linker patterns used as building blocks: each block, being an alpha helix or an angle pattern, is not affected by the other. Thus for each possible path, one sequence is produced.

## Clustering of paths

Many different paths are represented by the same sequence ###Figure that shows, different paths have same properties, already before in the text ### and we therefore clustered such paths. The weigths for those clustered paths were then calculated by averaging the weights of the different paths that compose a cluster.

# Results

## DNMT1

A major motivation of our effort to design rigid linkers with angles was the circularization of the DNA methyltranferase Dnmt1 (link). The truncation form used in our project is composed of 900 amino acids and the N- and C-terminal extremities are well separated. To circularize it, two linkers were designed: a flexible one made of glycine and serine, and a rigid one designed by the software. The rigif linkers for DNMT1 were obtained from an early state of the software. At that time the calculation took 11 days on a laptop computer with intel i5 processor and 8GB of RAM, which shows the importance of a distributed computing system for large proteins. But from that state on, the software has still imporved a lot, resulting in reduced calculation time to about 1 day for DNMT1.

 ###DNMT1 could be made heatstable, still missing###


## Feedback from wet lab

The results from the software for lysozyme were tested as described in the linker-screening part and evaluated as described in the enzyme modeling part. We have performed a large linker screening on 10 different lysozymes with different linkers. As the purpose of the lysozyme screen was the calibration of the software,tThose linkers were designed according to the four contributions previously mentioned. One of them was the shortest possible, one had the best possible angle, and so on.

table 1: Linker and their amino acid sequence. Green: attachment sequences to prevent the flexible regions from being perturbed; Blue: angle; Purple: extein.
sgt2 GGAEAAAKAAAHPEAAEAAAKRGTCWE
rigid GGAEAAAKEAAAKAAPRGKCWE
may1 GGAEAAAKEAAAKAAAAHPEAAEAAAKEAAAKAKTAAEAAAKEAAAKARGTCWE
ord1 GGAEAAAKEAAAKATGDLAAEAAAKAARGTCWE
ord3 GGAEAAAKEAAAKASLPAAAEAAAKEAAAKRGTCWE
sho1 GGRGTCWE
sho2 GGAEAAAKRGTCWE

In the end we obtained a ranking of the in vitro tested linkers from the linker-screening and chose the parameters $\alpha, \beta, \gamma, \delta$ of the weighting function in the way, that the ranking from the software represented the ranking from the assays. The final values were: $\alpha = 1.85 * 10 ^{-6}, \beta = 0.57 , \gamma = 50.8 * 10^6$. Therefore we set up a function, that was minimized, when the order of the weighting output was the same as the order from the wetlab.