http://2014.igem.org/wiki/index.php?title=Special:Contributions/Igemnils&feed=atom&limit=50&target=Igemnils&year=&month=2014.igem.org - User contributions [en]2024-03-29T07:15:10ZFrom 2014.igem.orgMediaWiki 1.16.5http://2014.igem.org/Team:Heidelberg/Team/MembersTeam:Heidelberg/Team/Members2014-10-18T00:25:54Z<p>Igemnils: </p>
<hr />
<div>{{:Team:Heidelberg/templates/wikipage_new|<br />
title=TEAM<br />
|<br />
red=<br />
|<br />
subtitle=<br />
|<br />
red-logo=<br />
|<br />
white-logo=true<br />
|<br />
container-style=background-color:white;<br />
|<br />
header-img=<br />
|<br />
header-bg=black<br />
|<br />
body-style=background-color:black;<br />
|<br />
content=<br />
<html><br />
<div class="row"><br />
<div class="col-lg-offset-9 col-lg-3 col-md-offset-6 col-md-6 col-sm-offset-6 col-sm-6" style="margin-bottom: 15px; text-align:right;"><button id="Teammembers-btn" class="btn btn-lg btn-hd active" style="margin-right:15px;">Students</button><button id="Supervisors-btn" class="btn btn-lg btn-hd">Supervisors</button></div><br />
</div><br />
<div class="row" style="position:relative"><br />
<div id="memberSelector" class="col-lg-3 col-md-6 col-sm-4 col-xs-3"><br />
<div class="row Team"><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Anna" href="#Anna" class="thumbnail"><br />
<img src="/wiki/images/3/36/Heidelberg_Anna_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Nils" href="#Nils" class="thumbnail"><br />
<img src="/wiki/images/e/ed/Heidelberg_Nils_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
</div><br />
<div class="row Team"><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Elisabeth" href="#Elisabeth" class="thumbnail"><br />
<img src="/wiki/images/1/15/Heidelberg_Elisabeth_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Max_H" href="#MaxH" class="thumbnail"><br />
<img src="/wiki/images/9/9b/Heidelberg_MaxH_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
</div><br />
<div class="row Team"><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Charlotte" href="#Charlotte" class="thumbnail"><br />
<img src="/wiki/images/1/18/Heidelberg_Charlotte_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Jakob" href="#Jakob" class="thumbnail"><br />
<img src="/wiki/images/3/3e/Heidelberg_Jakob_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
</div><br />
<div class="row Team"><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Magdalena" href="#Magdalena" class="thumbnail"><br />
<img src="/wiki/images/1/1b/Heidelberg_Magdalena_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Silvan" href="#Silvan" class="thumbnail"><br />
<img src="/wiki/images/4/4f/Heidelberg_Silvan_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
</div><br />
<div class="row Team"><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Max_W" href="#MaxW" class="thumbnail"><br />
<img src="/wiki/images/4/4a/Heidelberg_MaxW_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Carolin" href="#Carolin" class="thumbnail"><br />
<img src="/wiki/images/5/51/Heidelberg_Carolin_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
</div><br />
<div class="row Team"><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Constantin" href="#Constantin" class="thumbnail"><br />
<img src="/wiki/images/d/d6/Heidelberg_Constantin_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Jan" href="#Jan" class="thumbnail"><br />
<img src="/wiki/images/2/23/Heidelberg_Jan_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
</div><br />
<div class="row Supervisor"><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Eils" href="#Roland_Eils" class="thumbnail"><br />
<img src="/wiki/images/b/be/Heidelberg_Roland_Eils_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Barbara" href="#Barbara_DiVentura" class="thumbnail"><br />
<img src="/wiki/images/3/3a/Heidelberg_Barbara_DiVentura_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
</div><br />
<div class="row Supervisor"><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Julia" href="#Julia" class="thumbnail"><br />
<img src="/wiki/images/1/1b/Heidelberg_Julia_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Philipp" href="#Philipp" class="thumbnail"><br />
<img src="/wiki/images/9/95/Heidelberg_Philipp_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
</div><br />
<div class="row Supervisor"><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Joel" href="#Joel" class="thumbnail"><br />
<img src="/wiki/images/1/12/Heidelberg_Joel_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Stefen" href="#Stefen" class="thumbnail"><br />
<img src="/wiki/images/a/a2/Heidelberg_Stefen_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
</div><br />
<div class="row Supervisor"><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Pierre" href="#Pierre" class="thumbnail"><br />
<img src="/wiki/images/1/1d/Heidelberg_Piere_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
</div><br />
</div><br />
<div class="col-lg-4 col-md-6 col-sm-8 col-xs-9 col-lg-push-5 memberview"><br />
<div class="imageBorder"><br />
<img id="memberImageOverlay" src="/wiki/images/7/71/Heidelberg_Placeholder.jpg" alt="Image Overlay" class="img-responsive"/><br />
<img id="memberImage" src="/wiki/images/7/71/Heidelberg_Placeholder.jpg" alt="placeholder" class="img-responsive"/><br />
</div><br />
</div><br />
<div class="col-lg-5 col-md-12 col-sm-12 col-xs-12 col-lg-pull-4 memberview well"><br />
<h2 id="Name"></h2><br />
<p id="Description">Placeholder<br />
</p><br />
</div><br />
<div class="col-lg-9 col-md-6 col-sm-8 col-xs-9 col-md-offset-6 col-sm-offset-4 col-xs-offset-3 col-lg-offset-3 team-overlay"><br />
<div class="row" ><br />
<div class="col-lg-12"><br />
<img class="img-responsive border" src="/wiki/images/b/be/Heidelberg_Team.jpg" /><br />
</div><br />
<div class="col-lg-12"><br />
<h2>Our Team</h2><br />
<p><br />
We are a group of young and motivated students, and patient and <br />
motivated supervisors, who love synthetic biology and wish to make an <br />
important contribution to its advancement and acceptance by society. <br />
This year we have a nice mix of wet-lab experts (who made tons and <br />
tons of assays) and computer freaks (who wrote millions of code <br />
lines). Plus a crazy physicist (who did many, many things – but not <br />
involving a pipette!).<br />
We would be a perfect german team, if it weren´t for that one <br />
Italian and that one French supervisors.<br />
Like good split inteins, we like to work only when assembled in a team!<br />
</p><br />
</div><br />
</div><br />
</div><br />
<div class="clearfix"></div><br />
</div><br />
</html><br />
|<br />
titles=<br />
|<br />
white=true<br />
|<br />
abstract=<br />
}}<br />
{{:Team:Heidelberg/Templates/IncludeCSS|:Team:Heidelberg/css/slick}}<br />
{{:Team:Heidelberg/Templates/IncludeCSS|:Team:Heidelberg/css/teampage}}<br />
{{:Team:Heidelberg/Templates/IncludeJS|:Team:Heidelberg/js/teampage}}<br />
{{:Team:Heidelberg/Templates/IncludeJS|:Team:Heidelberg/js/slick}}</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/Project/Linker_ScreeningTeam:Heidelberg/Project/Linker Screening2014-10-18T00:06:08Z<p>Igemnils: </p>
<hr />
<div>{{:Team:Heidelberg/templates/wikipage_new|<br />
title=LINKER SCREENING<br />
|<br />
white=true<br />
|<br />
red-logo=true<br />
|<br />
header-img=<br />
|<br />
header=background-color:#DE4230<br />
|<br />
header-bg=black<br />
|<br />
subtitle= Proof and calibration of the CRAUT linker software in the wet-lab<br />
|<br />
container-style=background-color:white;<br />
|<br />
titles={{:Team:Heidelberg/templates/title|Introduction}}{{:Team:Heidelberg/templates/title|Materials and Methods|Materials_and_Methods}}{{:Team:Heidelberg/templates/title|Results}}{{:Team:Heidelberg/templates/title|Discussion}}{{:Team:Heidelberg/templates/title|References}}<br />
|<br />
abstract=<br />
|<br />
content=<br />
<div class="col-lg-12"><br />
{{:Team:Heidelberg/pages/Linker_Screening}}<br />
</div><br />
|<br />
white-logo=<br />
|<br />
red=<br />
}}</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/Toolbox/CircularizationTeam:Heidelberg/Toolbox/Circularization2014-10-18T00:04:57Z<p>Igemnils: </p>
<hr />
<div>{{:Team:Heidelberg/templates/wikipage_new|<br />
title=CIRCULARIZATION<br />
|<br />
white=true<br />
|<br />
red-logo=true<br />
|<br />
header-img=/wiki/images/a/aa/Header_circularization.jpg<br />
|<br />
header-bg=black<br />
|<br />
subtitle= Tansforming an enzyme into a ring of fire<br />
|<br />
container-style=background-color:white;<br />
|<br />
titles=<br />
|<br />
abstract=<br />
|<br />
content=<br />
<div class="col-lg-12"><br />
{{:Team:Heidelberg/pages/Circularization}}<br />
</div><br />
|<br />
white-logo=<br />
|<br />
red=<br />
}}</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/pages/PartsTeam:Heidelberg/pages/Parts2014-10-18T00:01:34Z<p>Igemnils: </p>
<hr />
<div><html><br />
<style type="text/css"><br />
div.margin-top { margin-top: 100px; }<br />
</style><br />
<br />
<br />
<h1 id="Favorite Parts">Favorite Parts.</h1><br />
<p>The iGEM Team Heidelberg 2014 had built a new biological system for the iGEM community integrating split-inteins. <br />
Intein splicing is a natural process that excises one part of a protein and leaves the remaining parts irreversibly attached. This great function allows you to modify your protein in numerous ways.</p><br />
<p>Creating a toolbox including all great functions and possibilities of inteins, we need a new standard for the scientific world of iGEM. This standard, the RFC of the iGEM Team Heidelberg 2014, allows us to easily and modulary work with split inteins.</p><br />
<br />
<p>Our favorite Parts represent the basic constructs of our toolbox – the Assembly and the Circularization construct, which are both tested in many methods and applications. </p><br />
<p>In the following we present you <br />
<a href="http://parts.igem.org/Part:BBa_K1362000">BBa_K1362000</a>, the construct for circularization, <br />
<a href="http://parts.igem.org/Part:BBa_K1362100">BBa_K1362100</a> and <br />
<a href="http://parts.igem.org/BBa_K1362101">BBa_K1362101</a>, the N- and the C-construct for assembly. Take a look and visit the Partsregistry to read the associated documentation.</p><br />
<br/><br />
<br/><br />
<h3> Circularization Construct. BBa_K1362000 </h3> <br />
<br />
<div class="row"><br />
<div class="col-md-4 col-sm-12 col-xs-12"> <br />
<h4>BBa_K1362000</h4><br />
Placeholder<br />
<br />
</div><br />
<div class="col-md-8 col-sm-12 col-xs-12"><br />
<img src="/wiki/images/7/7c/BBa_K1362000.png" class="img-responsive" alt="Circularization Construct"><br />
<br />
</div><br />
</div><br />
<br />
<br/><br />
<br />
<h3> Assembly Constructs. BBa_K1362100 and BBa_K1362101 </h3> <br />
<br />
<div class="row"><br />
<div class="col-md-4 col-sm-12 col-xs-12"> <br />
<br />
<h4>BBa_K1362100</h4><br />
<p>This intein assembly construct is part of our strategy for cloning with split inteins. Inteins are naturally occuring peptide sequences that splice out of a precursor protein and attach the remaining ends together to form a new protein. When splitting those intein sequence into an N-terminal and a C-terminal split intein one is left with a powerful tool to post-translationally modify whole proteins on the amino-acid sequence level. This construct was designed to express any protein of interest fused to the Nostoc punctiforme DnaE N-terminal split intein. </p><br />
</div><br />
<div class="col-md-8 col-sm-12 col-xs-12"><br />
<img src="/wiki/images/8/81/BBa_K1362100.png" class="img-responsive" alt="Assembly Constructs"><br />
</div><br />
<br />
</div><br />
<br />
<br/><br />
<div class="row"><br />
<div class="col-md-4 col-sm-12 col-xs-12"> <br />
<br />
<h4>BBa_K1362101</h4><br />
BBa_K1362101 is the corresponding C-terminal construct to BBa_K1362100. Upon coexpression or mixture of the N- and C-constructs protein splicing takes place and the N- and C-terminal proteins of interest are irreversibly assembled via a newly formed peptide bond.</p><p><br />
This mechanism can be applied for a variety of different uses such as the activation of a protein through reconstitution of individually expressed split halves. See our split sfGFP experiment and the respective parts in the registry for more information. Protein splicing offers many new possibilities and we hope to have set a foundation that you guys can build on!</p><br />
</div><br />
<div class="col-md-8 col-sm-12 col-xs-12"><br />
<img src="/wiki/images/3/3f/BBa_K1362101.png" class="img-responsive" alt="Assembly Constructs"><br />
</div><br />
</div><br />
<br />
<br />
<div class="col-md-12 col-sm-12 col-xs-12" style="margin: 100px 0;"><br />
<img src="/wiki/images/9/9a/Heidelberg_dna.png" class="img-responsive" alt="Circularization Construct"><br />
</div><br />
<br />
<br />
<br />
<h1 id="Sample Data Page">Sample Data Page for our favorite Parts.</h1><br />
<br />
<h3> Circularization Construct. BBa_K1362000 </h3> <br />
<br />
<br />
<br/><br />
<br />
<div class="row"><br />
<div class="col-md-7 col-sm-12 col-xs-12"> <br />
<img src="/wiki/images/9/9b/SampleData_Circularization.png" class="img-responsive" alt="Circularization Construct"><br />
</div><br />
<div class="col-md-5 col-sm-12 col-xs-12"><br />
<!-- <div class="margin-top"> --><br />
<div class="well well-sm"><br />
This part represents an easy way to circularize any protein. In a single step you can clone your protein in the split intein circularization construct. Exteins, RFC [i] standard overhangs and BsaI sites have to be added to the coding sequence of the protein to be circularized without start- and stop codons by PCR. By Golden Gate assembly, the mRFP selection marker has to be replaced with the protein insert.<br />
If the distance of the ends of your protein of interest aren't close enough to connect them you will need a linker. <a href="http://parts.igem.org/Part:BBa_K1362000">BBa_K1362000</a>, the split intein circularization construct, includes a strong T7 RBS (<a href="http://parts.igem.org/wiki/index.php?title=Part:BBa_K1362090">BBa_K1362090</a>), we sent to the parts registry as well, and the split intein Npu DnaE. The T7 RBS derived from the T7 phage gene 10a (major capsid protein). </div> <br />
<!-- </div> --><br />
<div class="well well-sm"><br />
The resulting plasmid can be used to express the protein of interest with the obligatory linker and the N- and C-intein.<br />
</div><br />
<div class="well well-sm"><br />
In an autocatalytic in vivo reaction, the circular protein is formed. To read more about the trans-splicing reaction visit our <a href="https://2014.igem.org/Team:Heidelberg/Project/Background">Intein Background</a> page. If corresponding split inteins are added to both termini of a protein, the trans-splicing reaction results in a circular backbone. <br />
</div><br />
<div class="well well-sm"><br />
Circular proteins offers many advantages. While conserving the functionality of their linear counterpart, circular proteins can be superior in terms of thermostability, resistance against chemical denaturation and protection from exopeptidases. Moreover, a circular backbone can improve in vivo stability of therapeutical proteins and peptides.<br />
</div><br />
<br />
</div><br />
</div><br />
<br />
<br />
<h3> Assembly Construct. BBa_K1362100 and BBa_K1362101 </h3> <br />
<br />
<br />
<br />
<br/><br />
<br/><br />
<br />
<div class="row"><br />
<div class="col-md-7 col-sm-12 col-xs-12"> <br />
<img src="/wiki/images/5/5c/SampleData_Assembly.png" class="img-responsive" alt="Circularization Construct"><br />
</div><br />
<div class="col-md-5 col-sm-12 col-xs-12"><br />
<br />
<div class="well well-sm"><br />
This part represents an easy way to circularize any protein. In a single step you can clone your protein in the split intein circularization construct. Exteins, RFC [i] standard overhangs and BsaI sites have to be added to the coding sequence of the protein to be circularized without start- and stop codons by PCR. By Golden Gate assembly, the mRFP selection marker has to be replaced with the protein insert.<br />
If the distance of the ends of your protein of interest aren't close enough to connect them you will need a linker. <a href="http://parts.igem.org/Part:BBa_K1362000">BBa_K1362000</a>, the split intein circularization construct, includes a strong T7 RBS (<a href="http://parts.igem.org/wiki/index.php?title=Part:BBa_K1362090">BBa_K1362090</a>), we sent to the parts registry as well, and the split intein Npu DnaE. The T7 RBS derived from the T7 phage gene 10a (major capsid protein). </div> <br />
<!-- </div> --><br />
<div class="well well-sm"><br />
The resulting plasmid can be used to express the protein of interest with the obligatory linker and the N- and C-intein.<br />
</div><br />
<div class="well well-sm"><br />
In an autocatalytic in vivo reaction, the circular protein is formed. To read more about the trans-splicing reaction visit our <a href="https://2014.igem.org/Team:Heidelberg/Project/Background">Intein Background</a> page. If corresponding split inteins are added to both termini of a protein, the trans-splicing reaction results in a circular backbone. <br />
</div><br />
<div class="well well-sm"><br />
Circular proteins offers many advantages. While conserving the functionality of their linear counterpart, circular proteins can be superior in terms of thermostability, resistance against chemical denaturation and protection from exopeptidases. Moreover, a circular backbone can improve in vivo stability of therapeutical proteins and peptides.<br />
</div><br />
</div><br />
</div><br />
<br />
<div class="col-md-12 col-sm-12 col-xs-12" style="margin: 100px 0;"><br />
<img src="/wiki/images/9/9a/Heidelberg_dna.png" class="img-responsive" alt="Circularization Construct"><br />
</div><br />
<br />
<br/><br />
<br />
<h1 id="Intein Library">Intein Library.</h1><br />
<br/><br />
</html><br />
Inteins are the basic unity of our toolbox. They are integrated as extraneous polypeptide sequences into habitual proteins and do not follow the original protein function. Inteins perform an autocatalytic splicing reaction, where they excite themselves out of the host protein while reconnecting the remaining chains on both end, so called N and C exteins, via a new peptide bond. Read more about it in our [https://2014.igem.org/Team:Heidelberg/Project/Background| project background]!<br />
<br />
To characterize the different types and groups of split-inteins and inteins we collect many details about them to develop a intein library. It gives you a great and clear overview about the most important facts.<br />
<br />
{| class="table table-hover"<br />
|-<br />
!Split intein<br />
!Special features<br />
!Nint<br />
!Cint<br />
!Reaction properties<br />
!Origin<br />
!References<br />
|-<br />
| Npu DnaE||fast; robust at high temperature range and high-yielding trans-splicing activity, well characterised requirements||102||36||t1/2 = 63s , 37°C , k=~1x10^-2 (s^-1); activity range 6 to 37°C||S1 natural split intein, Nostoc punctiforme||[[#References|[1]]] [[#References|[2]]] <br />
|-<br />
| Ssp DnaX||cross-reactivity with other N-inteins, transsplicing in vivo and in vitro, high yields||||||k=~1.7x10^-4(s^-1); efficiency 96%||engineered from Synechocystis species||[[#References|[3]]] [[#References|[4]]] <br />
|-<br />
| Ssp GyrB|| very short Nint facilitates trans-splicing of synthetic peptides||6||150||k=~1x10^-4(s^-1), efficiency 40-80%||S11 split intein enginered from Synechocystis species, strain PCC6803||[[#References|[4]]] [[#References|[5]]] <br />
|-<br />
| Ter DnaE3||trans-splicing activity with high yields||102||36||k=~2x10^-4(s^-1), efficiency 87%||natural split intein, Trichodesmium erythraeum||[[#References|[4]]] [[#References|[6]]] <br />
|-<br />
| Ssp DnaB||relatively fast||||||t1/2=12min, 25°C, k=~1x10^-3(s^-1)||engineered from Synechocystis species, strain PCC6803||[[#References|[2]]] <br />
|-<br />
| Gp41-1||fastes known reaction ||88||38||t1/2=20-30s, 37°C, k=~1.8x10^-1 (s^-1); activity range 0 to 60°C||natural split intein, Cyanophage||[[#References|[7]]] [[#References|[8]]] <br />
|-<br />
|}<br />
<br />
<html><br />
<h3>References</h3><br />
<p>[1] Iwai, H., Züger, S., Jin, J. & Tam, P.-H. Highly efficient protein trans-splicing by a naturally split DnaE intein from Nostoc punctiforme. FEBS Lett. 580, 1853–8 (2006).</p><br />
<br />
<p>[2] Zettler, J., Schütz, V. & Mootz, H. D. The naturally split Npu DnaE intein exhibits an extraordinarily high rate in the protein trans-splicing reaction. FEBS Lett. 583, 909–14 (2009).</p><br />
<br />
<p>[3] Song, H., Meng, Q. & Liu, X.-Q. Protein trans-splicing of an atypical split intein showing structural flexibility and cross-reactivity. PLoS One 7, e45355 (2012).</p><br />
<br />
<p>[4] Lin, Y. et al. Protein trans-splicing of multiple atypical split inteins engineered from natural inteins. PLoS One 8, e59516 (2013).</p><br />
<br />
<p>[5] Appleby, J. H., Zhou, K., Volkmann, G. & Liu, X.-Q. Novel Split Intein for trans-Splicing Synthetic Peptide onto C Terminus of Protein. J. Biol. Chem. 284, 6194–6199 (2009).</p><br />
<br />
<p>[6] Liu, X.-Q. & Yang, J. Split dnaE genes encoding multiple novel inteins in Trichodesmium erythraeum. J. Biol. Chem. 278, 26315–8 (2003).</p><br />
<br />
<p>[7] Carvajal-Vallejos, P., Pallissé, R., Mootz, H. D. & Schmidt, S. R. Unprecedented rates and efficiencies revealed for new natural split inteins from metagenomic sources. J. Biol. Chem. 287, 28686–96 (2012).</p><br />
<br />
<p>[8] Dassa, B., London, N., Stoddard, B. L., Schueler-Furman, O. & Pietrokovski, S. Fractured genes: a novel genomic arrangement involving new split inteins and a new homing endonuclease family. Nucleic Acids Res. 37, 2560–73 (2009).</p><br />
<br />
<div class="col-md-12 col-sm-12 col-xs-12" style="margin: 100px 0;"><br />
<img src="/wiki/images/9/9a/Heidelberg_dna.png" class="img-responsive"><br />
</div><br />
<br />
<br/><br />
<br />
<h1 id="allParts"><span style="font-size:170%;">List of Parts</span style="font-size:170%;"> <!-- – <span style="font-size":50%">Placeholder --></h1><br />
<div class="col-xs-12"><br />
<div id="partsTable">Loading...</div><br />
</div><br />
<br />
<br />
<script type="text/javascript"><br />
var keepParts = ['BBa_K1362000', 'BBa_K1362001', 'BBa_K1362003', 'BBa_K1362004', 'BBa_K1362005', 'BBa_K1362011', 'BBa_K1362012', 'BBa_K1362013', 'BBa_K1362020', 'BBa_K1362021', 'BBa_K1362022', 'BBa_K1362023', 'BBa_K1362050', 'BBa_K1362051', 'BBa_K1362052', 'BBa_K1362053', 'BBa_K1362054', 'BBa_K1362055', 'BBa_K1362056', 'BBa_K1362057', 'BBa_K1362058', 'BBa_K1362059', 'BBa_K1362060', 'BBa_K1362090', 'BBa_K1362091', 'BBa_K1362092', 'BBa_K1362093', 'BBa_K1362094', 'BBa_K1362095', 'BBa_K1362096', 'BBa_K1362097', 'BBa_K1362100', 'BBa_K1362101', 'BBa_K1362102', 'BBa_K1362103', 'BBa_K1362104', 'BBa_K1362105', 'BBa_K1362106', 'BBa_K1362107', 'BBa_K1362108', 'BBa_K1362109', 'BBa_K1362110', 'BBa_K1362111', 'BBa_K1362120', 'BBa_K1362121', 'BBa_K1362130', 'BBa_K1362131', 'BBa_K1362140', 'BBa_K1362141', 'BBa_K1362142', 'BBa_K1362143', 'BBa_K1362150', 'BBa_K1362151', 'BBa_K1362160', 'BBa_K1362161', 'BBa_K1362166', 'BBa_K1362167', 'BBa_K1362170', 'BBa_K1362171', 'BBa_K1362172', 'BBa_K1362173', 'BBa_K1362174', 'BBa_K1362202', 'BBa_K1362203', 'BBa_K1362204', 'BBa_K1362205', 'BBa_K1362500'];<br />
<br />
$( document ).ready(function() {<br />
jQuery("#partsTable").load("https://2014.igem.org/cgi/api/groupparts.cgi?t=iGEM014&amp;g=Heidelberg", function(){<br />
$('.pgrouptable tr td:nth-child(4) a').each(function(){<br />
var text = $(this).text();<br />
if($.inArray(text, keepParts) == -1){<br />
$(this).parent().parent().remove();<br />
}<br />
});<br />
$('.pgrouptable tr td:nth-child(7)').remove();<br />
$('.pgrouptable tr th:nth-child(7)').remove();<br />
$('.pgrouptable').removeClass('pgrouptable tablesorter').addClass('table table-hover');<br />
$('.heart13').removeClass('heart13').addClass('glyphicon glyphicon-heart');<br />
<br />
});<br />
<br />
});<br />
<br />
</script><br />
<div class="col-md-12 col-sm-12 col-xs-12" style="margin: 100px 0;"><br />
<img src="/wiki/images/9/9a/Heidelberg_dna.png" class="img-responsive"><br />
</div><br />
</html></div>Igemnilshttp://2014.igem.org/Team:Heidelberg/Team/MembersTeam:Heidelberg/Team/Members2014-10-17T23:49:59Z<p>Igemnils: </p>
<hr />
<div>{{:Team:Heidelberg/templates/wikipage_new|<br />
title=TEAM<br />
|<br />
red=<br />
|<br />
subtitle=<br />
|<br />
red-logo=<br />
|<br />
white-logo=true<br />
|<br />
container-style=background-color:white;<br />
|<br />
header-img=<br />
|<br />
header-bg=black<br />
|<br />
body-style=background-color:black;<br />
|<br />
content=<br />
<html><br />
<div class="row"><br />
<div class="col-lg-offset-9 col-lg-3 col-md-offset-6 col-md-6 col-sm-offset-6 col-sm-6" style="margin-bottom: 15px; text-align:right;"><button id="Teammembers-btn" class="btn btn-lg btn-hd active" style="margin-right:15px;">Students</button><button id="Supervisors-btn" class="btn btn-lg btn-hd">Supervisors</button></div><br />
</div><br />
<div class="row" style="position:relative"><br />
<div id="memberSelector" class="col-lg-3 col-md-6 col-sm-4 col-xs-3"><br />
<div class="row Team"><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Anna" href="#Anna" class="thumbnail"><br />
<img src="/wiki/images/3/36/Heidelberg_Anna_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Nils" href="#Nils" class="thumbnail"><br />
<img src="/wiki/images/e/ed/Heidelberg_Nils_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
</div><br />
<div class="row Team"><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Elisabeth" href="#Elisabeth" class="thumbnail"><br />
<img src="/wiki/images/1/15/Heidelberg_Elisabeth_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Max_H" href="#MaxH" class="thumbnail"><br />
<img src="/wiki/images/9/9b/Heidelberg_MaxH_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
</div><br />
<div class="row Team"><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Charlotte" href="#Charlotte" class="thumbnail"><br />
<img src="/wiki/images/1/18/Heidelberg_Charlotte_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Jakob" href="#Jakob" class="thumbnail"><br />
<img src="/wiki/images/3/3e/Heidelberg_Jakob_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
</div><br />
<div class="row Team"><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Magdalena" href="#Magdalena" class="thumbnail"><br />
<img src="/wiki/images/1/1b/Heidelberg_Magdalena_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Silvan" href="#Silvan" class="thumbnail"><br />
<img src="/wiki/images/4/4f/Heidelberg_Silvan_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
</div><br />
<div class="row Team"><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Max_W" href="#MaxW" class="thumbnail"><br />
<img src="/wiki/images/4/4a/Heidelberg_MaxW_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Carolin" href="#Carolin" class="thumbnail"><br />
<img src="/wiki/images/5/51/Heidelberg_Carolin_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
</div><br />
<div class="row Team"><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Constantin" href="#Constantin" class="thumbnail"><br />
<img src="/wiki/images/d/d6/Heidelberg_Constantin_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Jan" href="#Jan" class="thumbnail"><br />
<img src="/wiki/images/2/23/Heidelberg_Jan_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
</div><br />
<div class="row Supervisor"><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Eils" href="#Roland_Eils" class="thumbnail"><br />
<img src="/wiki/images/b/be/Heidelberg_Roland_Eils_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Barbara" href="#Barbara_DiVentura" class="thumbnail"><br />
<img src="/wiki/images/3/3a/Heidelberg_Barbara_DiVentura_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
</div><br />
<div class="row Supervisor"><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Julia" href="#Julia" class="thumbnail"><br />
<img src="/wiki/images/1/1b/Heidelberg_Julia_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Philipp" href="#Philipp" class="thumbnail"><br />
<img src="/wiki/images/9/95/Heidelberg_Philipp_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
</div><br />
<div class="row Supervisor"><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Joel" href="#Joel" class="thumbnail"><br />
<img src="/wiki/images/1/12/Heidelberg_Joel_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Stefen" href="#Stefen" class="thumbnail"><br />
<img src="/wiki/images/a/a2/Heidelberg_Stefen_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
</div><br />
<div class="row Supervisor"><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Pierre" href="#Pierre" class="thumbnail"><br />
<img src="/wiki/images/1/1d/Heidelberg_Piere_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
</div><br />
</div><br />
<div class="col-lg-4 col-md-6 col-sm-8 col-xs-9 col-lg-push-5 memberview"><br />
<div class="imageBorder"><br />
<img id="memberImageOverlay" src="/wiki/images/7/71/Heidelberg_Placeholder.jpg" alt="Image Overlay" class="img-responsive"/><br />
<img id="memberImage" src="/wiki/images/7/71/Heidelberg_Placeholder.jpg" alt="placeholder" class="img-responsive"/><br />
</div><br />
</div><br />
<div class="col-lg-5 col-md-12 col-sm-12 col-xs-12 col-lg-pull-4 memberview well"><br />
<h2 id="Name"></h2><br />
<p id="Description">Placeholder<br />
</p><br />
</div><br />
<div class="col-lg-9 col-md-6 col-sm-8 col-xs-9 col-md-offset-6 col-sm-offset-4 col-xs-offset-3 col-lg-offset-3 team-overlay"><br />
<div class="row" ><br />
<div class="col-lg-12"><br />
<img class="img-responsive border" src="/wiki/images/b/be/Heidelberg_Team.jpg" /><br />
</div><br />
<div class="col-lg-12"><br />
<h2>Our Team</h2><br />
<p><br />
test<br />
</p><br />
</div><br />
</div><br />
</div><br />
<div class="clearfix"></div><br />
</div><br />
</html><br />
|<br />
titles=<br />
|<br />
white=true<br />
|<br />
abstract=<br />
}}<br />
{{:Team:Heidelberg/Templates/IncludeCSS|:Team:Heidelberg/css/slick}}<br />
{{:Team:Heidelberg/Templates/IncludeCSS|:Team:Heidelberg/css/teampage}}<br />
{{:Team:Heidelberg/Templates/IncludeJS|:Team:Heidelberg/js/teampage}}<br />
{{:Team:Heidelberg/Templates/IncludeJS|:Team:Heidelberg/js/slick}}</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/Team/MembersTeam:Heidelberg/Team/Members2014-10-17T23:45:45Z<p>Igemnils: </p>
<hr />
<div>{{:Team:Heidelberg/templates/wikipage_new|<br />
title=TEAM<br />
|<br />
red=<br />
|<br />
subtitle=<br />
|<br />
red-logo=<br />
|<br />
white-logo=true<br />
|<br />
container-style=background-color:white;<br />
|<br />
header-img=<br />
|<br />
header-bg=black<br />
|<br />
body-style=background-color:black;<br />
|<br />
content=<br />
<html><br />
<div class="row"><br />
<div class="col-lg-offset-9 col-lg-3 col-md-offset-6 col-md-6 col-sm-offset-6 col-sm-6" style="margin-bottom: 15px; text-align:right;"><button id="Teammembers-btn" class="btn btn-lg btn-hd active" style="margin-right:15px;">Students</button><button id="Supervisors-btn" class="btn btn-lg btn-hd">Supervisors</button></div><br />
</div><br />
<div class="row" style="position:relative"><br />
<div id="memberSelector" class="col-lg-3 col-md-6 col-sm-4 col-xs-3"><br />
<div class="row Team"><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Anna" href="#Anna" class="thumbnail"><br />
<img src="/wiki/images/3/36/Heidelberg_Anna_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Nils" href="#Nils" class="thumbnail"><br />
<img src="/wiki/images/e/ed/Heidelberg_Nils_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
</div><br />
<div class="row Team"><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Elisabeth" href="#Elisabeth" class="thumbnail"><br />
<img src="/wiki/images/1/15/Heidelberg_Elisabeth_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Max_H" href="#MaxH" class="thumbnail"><br />
<img src="/wiki/images/9/9b/Heidelberg_MaxH_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
</div><br />
<div class="row Team"><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Charlotte" href="#Charlotte" class="thumbnail"><br />
<img src="/wiki/images/1/18/Heidelberg_Charlotte_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Jakob" href="#Jakob" class="thumbnail"><br />
<img src="/wiki/images/3/3e/Heidelberg_Jakob_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
</div><br />
<div class="row Team"><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Magdalena" href="#Magdalena" class="thumbnail"><br />
<img src="/wiki/images/1/1b/Heidelberg_Magdalena_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Silvan" href="#Silvan" class="thumbnail"><br />
<img src="/wiki/images/4/4f/Heidelberg_Silvan_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
</div><br />
<div class="row Team"><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Max_W" href="#MaxW" class="thumbnail"><br />
<img src="/wiki/images/4/4a/Heidelberg_MaxW_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Carolin" href="#Carolin" class="thumbnail"><br />
<img src="/wiki/images/5/51/Heidelberg_Carolin_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
</div><br />
<div class="row Team"><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Constantin" href="#Constantin" class="thumbnail"><br />
<img src="/wiki/images/d/d6/Heidelberg_Constantin_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Jan" href="#Jan" class="thumbnail"><br />
<img src="/wiki/images/2/23/Heidelberg_Jan_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
</div><br />
<div class="row Supervisor"><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Eils" href="#Roland_Eils" class="thumbnail"><br />
<img src="/wiki/images/b/be/Heidelberg_Roland_Eils_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Barbara" href="#Barbara_DiVentura" class="thumbnail"><br />
<img src="/wiki/images/3/3a/Heidelberg_Barbara_DiVentura_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
</div><br />
<div class="row Supervisor"><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Julia" href="#Julia" class="thumbnail"><br />
<img src="/wiki/images/1/1b/Heidelberg_Julia_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Philipp" href="#Philipp" class="thumbnail"><br />
<img src="/wiki/images/9/95/Heidelberg_Philipp_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
</div><br />
<div class="row Supervisor"><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Joel" href="#Joel" class="thumbnail"><br />
<img src="/wiki/images/1/12/Heidelberg_Joel_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Stefen" href="#Stefen" class="thumbnail"><br />
<img src="/wiki/images/a/a2/Heidelberg_Stefen_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
</div><br />
<div class="row Supervisor"><br />
<div class="col-lg-6 col-md-6 col-sm-6"><br />
<a id="Member-Pierre" href="#Pierre" class="thumbnail"><br />
<img src="/wiki/images/1/1d/Heidelberg_Piere_Thumbnail.jpg" alt="..."><br />
</a><br />
</div><br />
</div><br />
</div><br />
<div class="col-lg-4 col-md-6 col-sm-8 col-xs-9 col-lg-push-5 memberview"><br />
<div class="imageBorder"><br />
<img id="memberImageOverlay" src="/wiki/images/7/71/Heidelberg_Placeholder.jpg" alt="Image Overlay" class="img-responsive"/><br />
<img id="memberImage" src="/wiki/images/7/71/Heidelberg_Placeholder.jpg" alt="placeholder" class="img-responsive"/><br />
</div><br />
</div><br />
<div class="col-lg-5 col-md-12 col-sm-12 col-xs-12 col-lg-pull-4 memberview well"><br />
<h2 id="Name"></h2><br />
<p id="Description">Placeholder<br />
</p><br />
</div><br />
<div class="col-lg-9 col-md-6 col-sm-8 col-xs-9 col-md-offset-6 col-sm-offset-4 col-xs-offset-3 col-lg-offset-3 team-overlay"><br />
<div class="row" ><br />
<div class="col-lg-12"><br />
<img class="img-responsive border" src="/wiki/images/b/be/Heidelberg_Team.jpg" /><br />
</div><br />
<div class="col-lg-12"><br />
<h2>Our Team</h2><br />
<p><br />
Group leader<br />
<br />
I'm an experimentalist, originally physicist and now specialized in quantitative cell biology using imaging techniques. I've advised the students on the design of their experiments and on the techniques that have allowed them to quantitatively characterize their new proteins. I was also involved in the modeling parts of the project and their feedback to experimental data. <br />
</p><br />
</div><br />
</div><br />
</div><br />
<div class="clearfix"></div><br />
</div><br />
</html><br />
|<br />
titles=<br />
|<br />
white=true<br />
|<br />
abstract=<br />
}}<br />
{{:Team:Heidelberg/Templates/IncludeCSS|:Team:Heidelberg/css/slick}}<br />
{{:Team:Heidelberg/Templates/IncludeCSS|:Team:Heidelberg/css/teampage}}<br />
{{:Team:Heidelberg/Templates/IncludeJS|:Team:Heidelberg/js/teampage}}<br />
{{:Team:Heidelberg/Templates/IncludeJS|:Team:Heidelberg/js/slick}}</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/SoftwareTeam:Heidelberg/Software2014-10-17T23:40:43Z<p>Igemnils: </p>
<hr />
<div>{{:Team:Heidelberg/Templates/IncludeCSS|:Team:Heidelberg/css/boxes}}<br />
<html><br />
<style><br />
.box {<br />
background-color: rgba(81,81,81,0.7);<br />
color: white;<br />
}<br />
.box:hover {<br />
text-decoration: none;<br />
color: white;<br />
}<br />
<br />
.box.nohover h2 {<br />
position: relative;<br />
top:10px;<br />
background-color: transparent;<br />
}<br />
<br />
.box.nohover {<br />
background-color: transparent;<br />
}<br />
.box.nohover:hover {<br />
color: white;<br />
background-color: transparent;<br />
}<br />
.box2.nohover:hover {<br />
color: white;<br />
}<br />
<br />
</style><br />
</html><br />
{{:Team:Heidelberg/templates/wikipage_new|<br />
|<br />
container-style=background-color: black; background-image: url(/wiki/images/6/6a/Heidelberg_epic_background.jpg); background-repeat: no-repeat; background-size: 100% auto; color: white;<br />
|<br />
title=SOFTWARE<br />
|<br />
white=true<br />
|<br />
red-logo=true<br />
|<br />
subtitle=Placeholder<br />
|<br />
abstract=<br />
|<br />
content=<br />
<html><br />
<div class="col-xs-12"><br />
<div class="boxes-table"><br />
<div class="boxes-row"><br />
<div class="cell box nohover" style="width:33.333333%; text-align:center;"><br />
<a href="https://2014.igem.org/Team:Heidelberg/Software/igemathome"><img src="/wiki/images/9/9f/Heidelberg_Igemathome_bg.png" class="img-responsive"/></a><br />
<h2>iGEM@home</h2><br />
</div><br />
<div class="cell box nohover" style="width:33.333333%; text-align:center;"><br />
<a href="https://2014.igem.org/Team:Heidelberg/Software/Linker_Software"><img src="/wiki/images/a/ad/Logo_fuer_Nilsi_weiß.png" style="height:200px;" class="img-responsive"/></a><br />
<h2>Linker Software</h2> <br />
</div><br />
<div class="cell box nohover" style="width:33.333333%; text-align:center;"><br />
<div style="width: 50%;left: 0;right: 0;position: relative;margin: auto;"><br />
<a href="https://2014.igem.org/Team:Heidelberg/Software/MidnightDoc"><img src="/wiki/images/d/d2/Software_MD_yellow.png" class="img-responsive"/></a><br />
</div><br />
<h2>MidnightDoc</h2> <br />
</div><br />
</div><br />
<div class="boxes-row"><br />
<a href="https://2014.igem.org/Team:Heidelberg/Software/igemathome" class="cell box"><br />
<h3>The Software</h3><br />
iGEM@home is a software that divides extensive computing task into many packages allows everybod to get involved with our science. Read more about it!<br />
</a><br />
<a href="https://2014.igem.org/Team:Heidelberg/Software/Linker_Software" class="cell box"><br />
<h3>The Software</h3><br />
Circularization is a narrow path between gaining heat-stability and loosing function due to deformation. <br />
We developed a linker software, which predict the perfect linker depending on the folding structure of every protein.<br />
</a><br />
<a href="https://2014.igem.org/Team:Heidelberg/Software/MidnightDoc" class="cell box"><br />
<h3>The Software</h3><br />
MidnightDoc is the new way of lab documentation – enabling backtraces of experiments and provides an easy to use platform for protocol management and result logging! </a><br />
</div><br />
<div class="boxes-row"><br />
<a href="https://2014.igem.org/Team:Heidelberg/Software/igemathome/implementation" class="cell box"><br />
<h4>The Implementation</h4><br />
Here you can find a detailed description about the implementation of iGEM@home. Click here to read more about Java- and Python-embedding for distribution via the BOINC platform. <br />
</a><br />
<a href="https://2014.igem.org/Team:Heidelberg/Software/Linker_Software/Documentation" class="cell box"><br />
<h4>The Documentation</h4><br />
Here is the documentation of our CRAUT (Circularization with Rods and Angles of Unlinked Termini) software that predicts and ranks linkers built of rigid helical patterns and angles.<br />
</a><br />
<a href="https://2014.igem.org/Team:Heidelberg/Software/MidnightDoc/Documentation" class="cell box"><br />
<h4>The Documentation</h4><br />
Placeholder<br />
</a><br />
</div><br />
</div><br />
<br />
</div><br />
<br />
</html><br />
|<br />
header-img=<br />
|<br />
header-bg=<br />
|<br />
red=<br />
|<br />
titles=<br />
|<br />
white-logo=<br />
}}</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/pages/Linker_SoftwareTeam:Heidelberg/pages/Linker Software2014-10-17T23:37:40Z<p>Igemnils: /* General procedure */</p>
<hr />
<div>=Abstract=<br />
As already [https://2014.igem.org/Team:Heidelberg/Toolbox/Circularization introduced], artificially circularized proteins may gain some heat stability by restraining the C- and N-terminus from moving around freely. This circularization may be trivial when the protein termini are very close to each other, which seems to be reasonably common [[#References|[1]]]. However, if the ends are too far from each other, a long linker is needed to connect them. This linker should not change the natural conformation of the protein and should constrain the relative position of the ends to restrict the degrees of freedom and thus to stabilize the structure even when heated up. On top, these linkers should not affect any of the protein functions. Consequently it is important to prevent linkers from passing through the active site or from covering binding domains to other molecules for example. Therefore one needs to be able to define the shape of possible linkers. This section describes the software we developed to design such linkers. We would like to stress that this work has been made possible thanks to the feedback between computer modeling and experimental work: We could first design linkers in silico, test them experimentally and use the results to further calibrate the software. To our knowledge, this is the first time that such an approach is used to customly design rigid linkers with angles to connect protein extremities.<br />
<br />
=General procedure=<br />
In short, the software can provide a weighted list of linkers to circularize any protein of interest with a known structure. Those linkers are made of rigid alpha helices segments connected with defined angles. Contrary to flexible linkers, those rigid linkers were expected to constrain the protein extremities and to confer better heat stability. Such an idea was already developed [[#References|[2]]] but only with alpha helices defining simple rods, and without any possibility to introduce angles. To generate those linkers, we first defined the geometrical paths, with segments and angles, that they should follow. The geometrical paths that are biologically feasible are afterwards translated into amino acid sequences. Both the compatibility of paths with possible structures and the translation were made possible thanks to our [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling approaches]. The first approach consisted in performing a statistical analysis of more than 17000 known non-homologous structures containing alpha helices connected with angles. For the second approach, we modeled the conformation of linkers circularizing proteins of known structure and analyzed them for certain properties. This second approach was run for a large number of proteins thanks to our distributing computing system [https://2014.igem.org/Team:Heidelberg/Software/igemathome igemathome]. The software provides different possible linkers with weights that provide the ranking of the linkers depending on their capacity to maintain protein activity at higher temperatures. They were generated thanks to an extensive [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening linker screening] on the target protein lambda-lysozyme, using the first modeling approach.<br />
The documentation of our CRAUT software can be found [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software/Documentation here].<br />
<br />
=Background=<br />
Classically, protein linkers were designed in three different manners. The easiest way is to define the length that a linker should cover and then simply use a flexible glycine-serine peptide with the right amount of amino acids to match this length. Glycine is used for flexibility, as it has no sidechain and does not produce any steric hindrance, while serine is used for solubility, as it has a small polar side chain. This solubility is important, as the linkers should not pass through the hydrophobic core of the protein, but should be dissolved in the surrounding medium. These flexible linkers were normally used for circularization but also for connecting different proteins, when the main goal is that the different parts are connected, but not how they are connected, or when the flexiblity of the linker was required for specific applications. <br />
<br />
A second strategy consists in using rigid helical linkers to keep proteins or protein domains at a certain distance from each other. This is especially important for signalling proteins and fluorescent proteins. One major property of alpha helices is that they always fold in a defined way with well defined angles and lengths. There are also many different helical patterns that differ in stability and solubility. Although they have been used to design cirularizing linkers [[#References|[2]]]. One big disadvantage of this strategy is that one can only build straight linkers with helices. So in the context of circularization, if an artificial line that would connect protein extremities is crossing the protein, this strategy is not an option.<br />
<br />
The third option, which served as a base to develop our approach and which came from discussions with the group of Rebecca Wade in Heidelberg, Germany, consists in designing customly tailored linkers for each specific application. These linkers can be obtained from protein structure prediction. At first one needs to define the path that the linker should take to connect the protein ends. Afterwards, one designs a possible linker sequence that might fit well. Next one makes a structure prediction of the linker attached to the proteins to validate the prediction. Several different linkers, with slight changes, can be compared. This is repeated several times until the linker effectively follows the expected path. This method requires a strong knowledge on protein folding and protein structure prediction and is computationaly intensive. On the other hand, the benefit can be important as the interaction of the linker with the protein surface can be taken into account and as one can accurately define the path taken by the linker to the resolution of protein structure.<br />
<br />
We have set up a completely new strategy to design rigid linkers. As further detailed in the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] part, it is possible to define the shape of a linker, by combining rigid alpha helical rods with well-defined angle patterns. Therefore, by defining, in a geometrical way, the possible paths of the circularizing linkers for a given protein, we can then propose potential linkers. This definition of the geometrical path can be very difficult, especially for large proteins with complex shapes. Moreover, this definition is further constained by the fact that linkers must avoid hiding active sites of the protein of interest. Finally the paths have rotational degrees of freedom at the extremities of the protein, and depending on their orientation, they may or may not match the geometry of the protein. The tool we present here covers the two steps: defining geometrical paths with some weights and translate them into feasible linkers, also with weights. This tool is universal as it has the capacity to design circularizing linkers for any protein with a known structure. Moreover it is modular as, thanks to our [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling approach], we have design linkers as exchangeable blocks of rods of different lengths and of angle patterns. The following sections detail the different steps followed by our software to design proper linkers.<br />
<br />
<br />
==PDB parsing==<br />
At first, the PDB file containing the structure of the target protein is parsed and the coordinates of the atoms are stored, in the metric unit.After this, some initial tests are made with the protein structure. First, we checked whether the C- and N-termini lie on the surface of the protein and are accessible to the solvent, which is crucial for circularization. We defined a line originating from an extremity of the protein with the two angles of the spherical coordinates around the z-axis. From that, we could determine the accessible angles by rejecting all the lines that are too close to the protein. As the future linker will be made of alpha helices and will therefore have a radius of 5 &Aring;, we used this length as the minimal allowed distance.<br />
Those allowed angles are stored for the coming linker generation. ###fig needed### <br />
<br />
==Generation of geometric paths==<br />
As our strategy consists in building linkers with helical rods and connecting angles, a path is completely defined by the coordinates of the angle points. Advancing one step from an existing point is always done by adding a displacement vector on this point. This vector is defined by the two spherical angles, chosen here in a discrete manner with an increment of 5 degrees, and by a length, also chosen in a discrete manner. This discrete length was used in two different contexts: it may correspond to the length of an alpha helix or to the length of the flexible part that appears at the extremity of the protein. The coordinates that are reached thanks to this vector defines the new coordinates of an angle point, no matter if the vector corresponds to an alpha helix or to a flexible part. ###figure, points + vectors###. As we screen for all possible angles in a discrete manner, those angle points coordinates are regularly distributed on a sphere. As further detailed in the next sections, those spheres are defined from both ends, either once or in several steps. Then the software checks for possible straight connections of given lengths for each pair of angle points originated from both extremities. ###figure needed, with all these possible steps, one sphere around each point and checking for the connections, 2D, different lengths in two different figures, ###<br />
The linkers are built in a modular way, with blocks of well-defined size. From the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] of potential linkers, we could derive 8 different alpha helical rods, all with different lengths. On top, the length of the two segments inside an angle block was always 8&Aring;, so exchanging angle blocks do not affect the length of the linker. This means that the distance between the angle points is well defined, an essential aspect of our strategy of linker design.<br />
The software proceeds in three steps. First, it checks for the possibility of direct single alpha helix linker. for this, it applies the procedure just mentioned with spheres of radius that reasonably corresponds to the length of the short parts at the extremity of the protein. Second, it tests if a linker containing two alpha helices connected with a right angle allows the circularization. Finally it searches the possible linkers with three angle points. The next parts will explain those three steps in detail.<br />
This method has been chosen, because it could be implemented easily and efficiently in our program. However, this strategy generated paths that crossed the protein. Therefore we put big efforts in the sorting out of the paths.<br />
<br />
===Step 1===<br />
As a simple rigid linker with no angle would be easier to design and likely more thermostable than the ones containing angles, the software first checks if this simple solution is possible ###figure needed###. For this, we took into account the fact that proteins have some flexible amino acids at their extremities. This flexible part may come from the protein itself, but also from the 2 glycines that are included at the N-terminal part and from the extein at the C-terminal part. Those two latter parts comes from our linkers. Those parts have no preferential angles and offers a large amount of possibilities to insert fitting linkers. But this flexibility is also a drawback as we have to include this large amount of possible angles and length to our path search.<br />
In this first step, the software explicitely takes these flexible parts into account to check for the possibility of straight linkers. As the angles and the length of the flexible parts are variable, the software position their extremity on a sphere centered on the last fixed position of the structure as explained above. The radius of this sphere is incremented in a discrete manner, in 4 steps, from 5.25 &Aring; to the maximum length of the flexible part. ###figure###<br />
Then all possible straight segments between the points and the lastpoints are tested. If they are closer than 5 &Aring to the protein, of if they cross it, then they are rejected. If they are kept, then the software checks whether the length of the segments is compatible with the feasible alpha helices in terms of length: if the length of a given segment equal one of the 8 alpha helix lengths plus or minus 0.75 &Aring;, then the path is eventually saved.<br />
===Step 2===<br />
The next possibilty to design more complex rigid linkers while still taking flexible ends into account with a reasonable calculation time was to reduce amount of possible angles. As we originally thought that 90° angle would be practically feasible, the software was designed to generate linkers with flexible ends and one 90° angle. ###figure, protein_one_90° angle### This choice was notably made because of the simplicity to calculate lengths of right triangle edges. We already saw in Step 1 that the length of an edge can only take 8 different values. As the linkers have to start from the extremities of the protein, and as we impose a right angle, the number of possible paths is therefore low, making them easy to compute. Practically, the extremities of the proteins are positioned in a flexible way as in Step 1. From each of the positions allowed by this flexibility, the software searches for all the allowed right triangles. This was mainly done, as the degrees of freedom needed to be restricted to keep calculations feasible. <br />
===Step 3===<br />
Finally the software also provides the possibility to find paths with up to 4 edges, meaning 4 alpha helices and 3 angles. Thanks to the modularity of the possible linkers, such paths can offer the possibility to circularize theoretically any kind of protein. ###figure of torus###<br />
To keep the calculation feasible in a reasonable time, we design the searching strategy so that the flexible part at the extremity are oriented in the same direction as the consecutive alpha helix. This is obviously restricting the search but as these orientations are allowed for the flexible part, this approach remains fully correct. <br />
First, potential ending points of the first alpha helical rod are calculated from the N-terminal point of the protein. The orientation is chosen in a discrete manner, with an incrementation of 5 degrees for the two angles of the spherical coordinates. The distance from the origin corresponds to the 8 possible lengths allowed by the alpha helices, as already seen in Step 2, plus a length of 0, which mimics a linker with 3 instead of 4 edges. The exact same procedure is repeated to define all the potential ending points of the second alpha helical rod starting from all the possible ending points of the first alpha helical rod. Thanks to the possibility of a length of 0 for the first and the second rods, the software also calculate paths with 2 edges. Then, the same is done only once from the C-terminal point of the protein, defining 1 edge. The final step consists in checking if the points originating from the N- and C-terminal points can be linked by an potential alpha helix, i.e. if they are separated by the appropriate distance. If any of the potential alpha helix length lies within the distance between two points plus or minus 0.75 &Aring;, then the path is eventually saved. In the same way, if two points are directly closer than 0.75 &Aring;, then the path is also saved.<br />
<br />
==Sorting out of paths==<br />
The previous part described the generation of paths that can connect the two extremeties of the protein irrespective of the position of these paths relative to the protein. While this allows a fast computing of the geometrical paths, this also implies that the paths that are not practically feasible need to be sorted out. This is the most time consuming part of the computing as about 1 billion paths are generated. Three criteria are considered for the sorting. The first one is the feasibility of the linker: can the software find angle patterns that correspond to the one defined by the geometrical path? This question was part of the motivation for a large modeling effort (link) to determine the possible angles between consecutive angles. This was achieved by analyzing the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling distribution of angles] between alpha helices found in the ArchDB database. As nearly any angle could be found between 20 and 170 degrees, only few paths were actually rejected at that step. The next criteria was the position of the angle point: if they appear inside the protein, then the path is rejected. Finally, the software checks if any of the atoms of the protein is less than 5&Aring away from any of the alpha helices, then the path is also rejected.<br />
<br />
==Shifting paths to the patterns==<br />
<br />
The strategy described in step 3 gives a certain freedom for the rod that connect the last two angle points that were generated from the N- and C-terminal points. As this freedom is actually not permitted by the alpha helix and the angle pattern, but is permitted by the flexible part for example at the C-terminal end, the software slighty refine the path by rotating the segment that originates from the C-terminal point.<br />
<br />
==Weighting of paths==<br />
Before translating the paths into sequences and thus into linkers that can be expressed, each path needs to be evaluated for its potential capacity to enhance heat stability. For this we have identified different contributions that should be combined into one value that defines how good the linker may be and that consequently defines its ranking among all possible linkers. The smaller this value, the more we expect that the linker will enhance thermostability. An important step for all these contributions is the normalization, as explained in the next paragraphs.<br />
The first contribution we considered was the linker length. We assumed that a short linker is better to constrain the protein extremities, and that a long helix might give more flexibility. Because we wanted this value to be independent of the size of the protein, the length of the linker is normalized to the distance between the two termini.<br />
The second contribution relates to the angles used in the linker. We learned from the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] that angles formed by a certain angle pattern follow a certain distribution. First, we assumed that the narrower the distribution, the more likely the alpha helices would actually produce this angle. Second, the angles found by the software should be as close as possible to those well-defined angles. In this case, the weight value from this contribution should be low.<br />
Then the distance of the linker to the protein is taken into account. Because the linkers should not disturb the protein in its normal environment, linkers that pass close to the protein surface are considered better linkers. The distance was defined as the minimal distance between the linker and all the atoms of the protein. As already mentioned for the sorting of the paths, a linker cannot come closer than 5 &Aring; and this distance was used for normalization of calculated distances.<br />
After this, the places a linker should avoid are calculated. Each protein can interact with other molecules on some oarts of its surface. The user can specify where and how big those parts are. If a linker passes in front a potential molecule bindind domain, the value of the corresponding path goes to infinity, so that the linker is discarded. Conversely the farther a linker is from a potential ligand binding domain, the smaller its weighting value. The user can also specify the importance of certain regions. In the end the total weighting is normalized to the amount of binding domains.<br />
<br />
===Calibrating the weighting function===<br />
Every contribution has its own distribution. You can see an example in figure ### [[figure histogram_length_lys.png]], but all of them have different shapes. The aim is to find the paths that globally minimize all of these distributions. Therefore, for simplicity,in the weighting function the four mentioned contributions were combined in a linear manner:<br />
\[ W(p) = \alpha L(p) + \beta A(p) + \gamma D(p) + \delta u(p) \]<br />
where W is the final weighting, p the path, L the length contribution, A the angle contribution, D the distance contribution and u the contribution from the forbidden regions. $\alpha, \beta, \gamma, \delta$ are the weighting constants that needed to be found. The normalization performed for each of the contribution were made so that each of them is dimensionless and that all have reasonably similar values.<br />
The weighting constants were obtained from the [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening linker screening] performed with lysozyme and the [https://2014.igem.org/Team:Heidelberg/Modeling/Enzyme_Modeling modeling of the enzyme activity]. Their calculation is presented in the results below.<br />
<br />
==Translating paths to sequence==<br />
As already mentioned before the software is provided with two databases, one for the possible angle patterns and one for the helix patterns. The choice of the patterns was inspired by known crystal structures extracted from databases and described in different papers.<br />
A huge in silico screening for refining the preferences of the patterns was then set up using the [https://2014.igem.org/Team:Heidelberg/Software/igemathome distribution calculation] system. For the complete description of search for suitable patterns, one can read the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] page. <br />
All the possible paths are now split up at the angles and compared with the possible patterns in the databases. ###Figure needed to explain, how the path is translated### The most suitable patterns are identified and added together to build the paths sequence. It is important to notice that this is only possible because of the modularity of our linker patterns used as building blocks: each block, being an alpha helix or an angle pattern, is not affected by the other. Thus for each possible path, one sequence is produced.<br />
<br />
==Clustering of paths==<br />
Many different paths are represented by the same sequence [[###Figure that shows, different paths have same properties, already before in the text ###]] and we therefore clustered such paths. The weigths for those clustered paths were then calculated by averaging the weights of the different paths that compose a cluster.<br />
<br />
<br />
=Results=<br />
==DNMT1==<br />
A major motivation of our effort to design rigid linkers with angles was the [https://2014.igem.org/Team:Heidelberg/Project/PCR_2.0 circularization of the DNA methyltranferase Dnmt1]. The truncation form used in our project is composed of 900 amino acids and the N- and C-terminal extremities are well separated. To circularize it, two linkers were designed: a flexible one made of glycine and serine, and a rigid one designed by the software. The rigif linkers for DNMT1 were obtained from an early state of the software. At that time the calculation took 11 days on a laptop computer with intel i5 processor and 8GB of RAM, which shows the importance of a distributed computing system for large proteins. But from that state on, the software has still improved a lot, resulting in reduced calculation time to about 1 day for DNMT1.<br />
<br />
==Feedback from wet lab==<br />
The results from the software for lysozyme were tested as described in the [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening linker-screening] part and evaluated as described in the [https://2014.igem.org/Team:Heidelberg/Modeling/Enzyme_Modeling enzyme modeling] part. We have performed a large linker screening on 10 different lysozymes with different linkers. As the purpose of the lysozyme screen was the calibration of the software,those linkers (Table 1) were designed according to the four contributions previously mentioned. One of them was the shortest possible, one had the best possible angle, and so on.<br />
{| class="table table-hover"<br />
|+ '''Table 1''': Linkers and their amino acid sequence. Green: attachment sequences to prevent the flexible regions from being perturbed; Blue: angle; Purple: extein.<br />
! Linker<br />
! Amino acid sequence<br />
|- <br />
|<br />
|<br />
|-<br />
|'''Very good linkers'''<br />
|<br />
|-<br />
| sgt2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#00BFFF;">AAAHPEA</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| rigid<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKAA<span style="color:#3ADF00;">P</span><span style="color:#A901DB;">RGKCWE</span><br />
|-<br />
| <br />
| <br />
|-<br />
| '''Average linkers'''<br />
|<br />
|-<br />
| may1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKA<span style="color:#00BFFF;">AAAHPEA</span>AEAAAKEAAAKA<span style="color:#00BFFF;">KTA</span>AEAAAKEAAAKA<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| ord1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ATGDLA</span>AEAAAKAA<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| ord3<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ASLPAA</span>AEAAAKEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
|<br />
| <br />
|-<br />
| '''Short linkers'''<br />
| <br />
|-<br />
| sho1<br />
| <span style="color:#3ADF00;">GG</span><span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| sho2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
|}<br />
[[circ_lam_lys_nils.png]] comparison with [[nice_linker_lysozyme_flexible_ends.png]]<br />
###About predictions of software###<br />
In the end we obtained a ranking of the in vitro tested linkers from the [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening linker-screening] and chose the parameters $\alpha, \beta, \gamma, \delta$ of the weighting function so that the ranking from the software represented the ranking from the assays. The final values, $\alpha = 1.85 * 10 ^{-6}, \beta = 0.57 , \gamma = 50.8 * 10^6$, set up a function that could reproduce the ranking oberved in the wetlab experiments.<br />
<br />
=Discussion=<br />
The software described here allowed us to design rigid linkers with well-defined angles. This represents a major advance compared to previous approaches like [[#References|[2]]] as these linkers can circularize any protein of known structure with any complex geometry.<br />
The feedback between the modeling and the experiment work on lysozyme activity was a crutial step in the development of the software. It allowed the testing of our approach and the calibration of the contribution of different features of the linkers to heat stability. This calibration was performed on one enzyme, and can improve in the future with the testing of more enzymes. This will also be refined thanks to a complete modeling and analysis of protein structures with linkers.<br />
<br />
=References=<br />
<br />
[1] Thornton, J.M. & Sibanda, B.L. Amino and carboxy-terminal regions in globular proteins. Journal of molecular biology 167, 443-460 (1983).<br />
<br />
[2] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/Modeling/Enzyme_ModelingTeam:Heidelberg/Modeling/Enzyme Modeling2014-10-17T23:34:49Z<p>Igemnils: </p>
<hr />
<div>{{:Team:Heidelberg/templates/wikipage_new|<br />
title=ENZYME MODELING<br />
|<br />
white=true<br />
|<br />
red-logo=true<br />
|<br />
header-img=<br />
|<br />
header=background-color:#DE4230<br />
|<br />
header-bg=black<br />
|<br />
subtitle= Modeling of lysozyme activity with product inhibition<br />
|<br />
container-style=background-color:white;<br />
|<br />
titles={{:Team:Heidelberg/templates/title|Introduction}}<br />
|<br />
abstract=<br />
....<br />
|<br />
content=<br />
<div class="col-lg-12"><br />
{{:Team:Heidelberg/pages/Enzyme_Modeling}}<br />
</div><br />
|<br />
white-logo=<br />
|<br />
red=<br />
}}<br />
<br />
{{:Team:Heidelberg/templates/mathjax}}</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/Modeling/Enzyme_ModelingTeam:Heidelberg/Modeling/Enzyme Modeling2014-10-17T23:34:02Z<p>Igemnils: </p>
<hr />
<div>{{:Team:Heidelberg/templates/wikipage_new|<br />
title=ENZYME MODELING<br />
|<br />
white=true<br />
|<br />
red-logo=true<br />
|<br />
header-img=<br />
|<br />
header=background-color:#DE4230<br />
|<br />
header-bg=black<br />
|<br />
subtitle= modeling of lysozyme activity using enzymatic activity modeling with product inhibition<br />
|<br />
container-style=background-color:white;<br />
|<br />
titles={{:Team:Heidelberg/templates/title|Introduction}}<br />
|<br />
abstract=<br />
....<br />
|<br />
content=<br />
<div class="col-lg-12"><br />
{{:Team:Heidelberg/pages/Enzyme_Modeling}}<br />
</div><br />
|<br />
white-logo=<br />
|<br />
red=<br />
}}<br />
<br />
{{:Team:Heidelberg/templates/mathjax}}</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/pages/Linker_SoftwareTeam:Heidelberg/pages/Linker Software2014-10-17T23:31:15Z<p>Igemnils: /* Discussion */</p>
<hr />
<div>=Abstract=<br />
As already [https://2014.igem.org/Team:Heidelberg/Toolbox/Circularization introduced], artificially circularized proteins may gain some heat stability by restraining the C- and N-terminus from moving around freely. This circularization may be trivial when the protein termini are very close to each other, which seems to be reasonably common [[#References|[1]]]. However, if the ends are too far from each other, a long linker is needed to connect them. This linker should not change the natural conformation of the protein and should constrain the relative position of the ends to restrict the degrees of freedom and thus to stabilize the structure even when heated up. On top, these linkers should not affect any of the protein functions. Consequently it is important to prevent linkers from passing through the active site or from covering binding domains to other molecules for example. Therefore one needs to be able to define the shape of possible linkers. This section describes the software we developed to design such linkers. We would like to stress that this work has been made possible thanks to the feedback between computer modeling and experimental work: We could first design linkers in silico, test them experimentally and use the results to further calibrate the software. To our knowledge, this is the first time that such an approach is used to customly design rigid linkers with angles to connect protein extremities.<br />
<br />
=General procedure=<br />
In short, the software can provide a weighted list of linkers to circularize any protein of interest with a known structure. Those linkers are made of rigid alpha helices segments connected with defined angles. Contrary to flexible linkers, those rigid linkers were expected to constrain the protein extremities and to confer better heat stability. Such an idea was already developed [[#References|[2]]] but only with alpha helices defining simple rods, and without any possibility to introduce angles. To generate those linkers, we first defined the geometrical paths, with segments and angles, that they should follow. The geometrical paths that are biologically feasible are afterwards translated into amino acid sequences. Both the compatibility of paths with possible structures and the translation were made possible thanks to our [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling approaches]. The first approach consisted in performing a statistical analysis of more than 17000 known non-homologous structures containing alpha helices connected with angles. For the second approach, we modeled the conformation of linkers circularizing proteins of known structure and analyzed them for certain properties. This second approach was run for a large number of proteins thanks to our distributing computing system [https://2014.igem.org/Team:Heidelberg/Software/igemathome igemathome]. The software provides different possible linkers with weights that provide the ranking of the linkers depending on their capacity to maintain protein activity at higher temperatures. They were generated thanks to an extensive [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening linker screening] on the target protein lambda-lysozyme, using the first modeling approach.<br />
<br />
=Background=<br />
Classically, protein linkers were designed in three different manners. The easiest way is to define the length that a linker should cover and then simply use a flexible glycine-serine peptide with the right amount of amino acids to match this length. Glycine is used for flexibility, as it has no sidechain and does not produce any steric hindrance, while serine is used for solubility, as it has a small polar side chain. This solubility is important, as the linkers should not pass through the hydrophobic core of the protein, but should be dissolved in the surrounding medium. These flexible linkers were normally used for circularization but also for connecting different proteins, when the main goal is that the different parts are connected, but not how they are connected, or when the flexiblity of the linker was required for specific applications. <br />
<br />
A second strategy consists in using rigid helical linkers to keep proteins or protein domains at a certain distance from each other. This is especially important for signalling proteins and fluorescent proteins. One major property of alpha helices is that they always fold in a defined way with well defined angles and lengths. There are also many different helical patterns that differ in stability and solubility. Although they have been used to design cirularizing linkers [[#References|[2]]]. One big disadvantage of this strategy is that one can only build straight linkers with helices. So in the context of circularization, if an artificial line that would connect protein extremities is crossing the protein, this strategy is not an option.<br />
<br />
The third option, which served as a base to develop our approach and which came from discussions with the group of Rebecca Wade in Heidelberg, Germany, consists in designing customly tailored linkers for each specific application. These linkers can be obtained from protein structure prediction. At first one needs to define the path that the linker should take to connect the protein ends. Afterwards, one designs a possible linker sequence that might fit well. Next one makes a structure prediction of the linker attached to the proteins to validate the prediction. Several different linkers, with slight changes, can be compared. This is repeated several times until the linker effectively follows the expected path. This method requires a strong knowledge on protein folding and protein structure prediction and is computationaly intensive. On the other hand, the benefit can be important as the interaction of the linker with the protein surface can be taken into account and as one can accurately define the path taken by the linker to the resolution of protein structure.<br />
<br />
We have set up a completely new strategy to design rigid linkers. As further detailed in the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] part, it is possible to define the shape of a linker, by combining rigid alpha helical rods with well-defined angle patterns. Therefore, by defining, in a geometrical way, the possible paths of the circularizing linkers for a given protein, we can then propose potential linkers. This definition of the geometrical path can be very difficult, especially for large proteins with complex shapes. Moreover, this definition is further constained by the fact that linkers must avoid hiding active sites of the protein of interest. Finally the paths have rotational degrees of freedom at the extremities of the protein, and depending on their orientation, they may or may not match the geometry of the protein. The tool we present here covers the two steps: defining geometrical paths with some weights and translate them into feasible linkers, also with weights. This tool is universal as it has the capacity to design circularizing linkers for any protein with a known structure. Moreover it is modular as, thanks to our [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling approach], we have design linkers as exchangeable blocks of rods of different lengths and of angle patterns. The following sections detail the different steps followed by our software to design proper linkers.<br />
<br />
<br />
==PDB parsing==<br />
At first, the PDB file containing the structure of the target protein is parsed and the coordinates of the atoms are stored, in the metric unit.After this, some initial tests are made with the protein structure. First, we checked whether the C- and N-termini lie on the surface of the protein and are accessible to the solvent, which is crucial for circularization. We defined a line originating from an extremity of the protein with the two angles of the spherical coordinates around the z-axis. From that, we could determine the accessible angles by rejecting all the lines that are too close to the protein. As the future linker will be made of alpha helices and will therefore have a radius of 5 &Aring;, we used this length as the minimal allowed distance.<br />
Those allowed angles are stored for the coming linker generation. ###fig needed### <br />
<br />
==Generation of geometric paths==<br />
As our strategy consists in building linkers with helical rods and connecting angles, a path is completely defined by the coordinates of the angle points. Advancing one step from an existing point is always done by adding a displacement vector on this point. This vector is defined by the two spherical angles, chosen here in a discrete manner with an increment of 5 degrees, and by a length, also chosen in a discrete manner. This discrete length was used in two different contexts: it may correspond to the length of an alpha helix or to the length of the flexible part that appears at the extremity of the protein. The coordinates that are reached thanks to this vector defines the new coordinates of an angle point, no matter if the vector corresponds to an alpha helix or to a flexible part. ###figure, points + vectors###. As we screen for all possible angles in a discrete manner, those angle points coordinates are regularly distributed on a sphere. As further detailed in the next sections, those spheres are defined from both ends, either once or in several steps. Then the software checks for possible straight connections of given lengths for each pair of angle points originated from both extremities. ###figure needed, with all these possible steps, one sphere around each point and checking for the connections, 2D, different lengths in two different figures, ###<br />
The linkers are built in a modular way, with blocks of well-defined size. From the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] of potential linkers, we could derive 8 different alpha helical rods, all with different lengths. On top, the length of the two segments inside an angle block was always 8&Aring;, so exchanging angle blocks do not affect the length of the linker. This means that the distance between the angle points is well defined, an essential aspect of our strategy of linker design.<br />
The software proceeds in three steps. First, it checks for the possibility of direct single alpha helix linker. for this, it applies the procedure just mentioned with spheres of radius that reasonably corresponds to the length of the short parts at the extremity of the protein. Second, it tests if a linker containing two alpha helices connected with a right angle allows the circularization. Finally it searches the possible linkers with three angle points. The next parts will explain those three steps in detail.<br />
This method has been chosen, because it could be implemented easily and efficiently in our program. However, this strategy generated paths that crossed the protein. Therefore we put big efforts in the sorting out of the paths.<br />
<br />
===Step 1===<br />
As a simple rigid linker with no angle would be easier to design and likely more thermostable than the ones containing angles, the software first checks if this simple solution is possible ###figure needed###. For this, we took into account the fact that proteins have some flexible amino acids at their extremities. This flexible part may come from the protein itself, but also from the 2 glycines that are included at the N-terminal part and from the extein at the C-terminal part. Those two latter parts comes from our linkers. Those parts have no preferential angles and offers a large amount of possibilities to insert fitting linkers. But this flexibility is also a drawback as we have to include this large amount of possible angles and length to our path search.<br />
In this first step, the software explicitely takes these flexible parts into account to check for the possibility of straight linkers. As the angles and the length of the flexible parts are variable, the software position their extremity on a sphere centered on the last fixed position of the structure as explained above. The radius of this sphere is incremented in a discrete manner, in 4 steps, from 5.25 &Aring; to the maximum length of the flexible part. ###figure###<br />
Then all possible straight segments between the points and the lastpoints are tested. If they are closer than 5 &Aring to the protein, of if they cross it, then they are rejected. If they are kept, then the software checks whether the length of the segments is compatible with the feasible alpha helices in terms of length: if the length of a given segment equal one of the 8 alpha helix lengths plus or minus 0.75 &Aring;, then the path is eventually saved.<br />
===Step 2===<br />
The next possibilty to design more complex rigid linkers while still taking flexible ends into account with a reasonable calculation time was to reduce amount of possible angles. As we originally thought that 90° angle would be practically feasible, the software was designed to generate linkers with flexible ends and one 90° angle. ###figure, protein_one_90° angle### This choice was notably made because of the simplicity to calculate lengths of right triangle edges. We already saw in Step 1 that the length of an edge can only take 8 different values. As the linkers have to start from the extremities of the protein, and as we impose a right angle, the number of possible paths is therefore low, making them easy to compute. Practically, the extremities of the proteins are positioned in a flexible way as in Step 1. From each of the positions allowed by this flexibility, the software searches for all the allowed right triangles. This was mainly done, as the degrees of freedom needed to be restricted to keep calculations feasible. <br />
===Step 3===<br />
Finally the software also provides the possibility to find paths with up to 4 edges, meaning 4 alpha helices and 3 angles. Thanks to the modularity of the possible linkers, such paths can offer the possibility to circularize theoretically any kind of protein. ###figure of torus###<br />
To keep the calculation feasible in a reasonable time, we design the searching strategy so that the flexible part at the extremity are oriented in the same direction as the consecutive alpha helix. This is obviously restricting the search but as these orientations are allowed for the flexible part, this approach remains fully correct. <br />
First, potential ending points of the first alpha helical rod are calculated from the N-terminal point of the protein. The orientation is chosen in a discrete manner, with an incrementation of 5 degrees for the two angles of the spherical coordinates. The distance from the origin corresponds to the 8 possible lengths allowed by the alpha helices, as already seen in Step 2, plus a length of 0, which mimics a linker with 3 instead of 4 edges. The exact same procedure is repeated to define all the potential ending points of the second alpha helical rod starting from all the possible ending points of the first alpha helical rod. Thanks to the possibility of a length of 0 for the first and the second rods, the software also calculate paths with 2 edges. Then, the same is done only once from the C-terminal point of the protein, defining 1 edge. The final step consists in checking if the points originating from the N- and C-terminal points can be linked by an potential alpha helix, i.e. if they are separated by the appropriate distance. If any of the potential alpha helix length lies within the distance between two points plus or minus 0.75 &Aring;, then the path is eventually saved. In the same way, if two points are directly closer than 0.75 &Aring;, then the path is also saved.<br />
<br />
==Sorting out of paths==<br />
The previous part described the generation of paths that can connect the two extremeties of the protein irrespective of the position of these paths relative to the protein. While this allows a fast computing of the geometrical paths, this also implies that the paths that are not practically feasible need to be sorted out. This is the most time consuming part of the computing as about 1 billion paths are generated. Three criteria are considered for the sorting. The first one is the feasibility of the linker: can the software find angle patterns that correspond to the one defined by the geometrical path? This question was part of the motivation for a large modeling effort (link) to determine the possible angles between consecutive angles. This was achieved by analyzing the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling distribution of angles] between alpha helices found in the ArchDB database. As nearly any angle could be found between 20 and 170 degrees, only few paths were actually rejected at that step. The next criteria was the position of the angle point: if they appear inside the protein, then the path is rejected. Finally, the software checks if any of the atoms of the protein is less than 5&Aring away from any of the alpha helices, then the path is also rejected.<br />
<br />
==Shifting paths to the patterns==<br />
<br />
The strategy described in step 3 gives a certain freedom for the rod that connect the last two angle points that were generated from the N- and C-terminal points. As this freedom is actually not permitted by the alpha helix and the angle pattern, but is permitted by the flexible part for example at the C-terminal end, the software slighty refine the path by rotating the segment that originates from the C-terminal point.<br />
<br />
==Weighting of paths==<br />
Before translating the paths into sequences and thus into linkers that can be expressed, each path needs to be evaluated for its potential capacity to enhance heat stability. For this we have identified different contributions that should be combined into one value that defines how good the linker may be and that consequently defines its ranking among all possible linkers. The smaller this value, the more we expect that the linker will enhance thermostability. An important step for all these contributions is the normalization, as explained in the next paragraphs.<br />
The first contribution we considered was the linker length. We assumed that a short linker is better to constrain the protein extremities, and that a long helix might give more flexibility. Because we wanted this value to be independent of the size of the protein, the length of the linker is normalized to the distance between the two termini.<br />
The second contribution relates to the angles used in the linker. We learned from the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] that angles formed by a certain angle pattern follow a certain distribution. First, we assumed that the narrower the distribution, the more likely the alpha helices would actually produce this angle. Second, the angles found by the software should be as close as possible to those well-defined angles. In this case, the weight value from this contribution should be low.<br />
Then the distance of the linker to the protein is taken into account. Because the linkers should not disturb the protein in its normal environment, linkers that pass close to the protein surface are considered better linkers. The distance was defined as the minimal distance between the linker and all the atoms of the protein. As already mentioned for the sorting of the paths, a linker cannot come closer than 5 &Aring; and this distance was used for normalization of calculated distances.<br />
After this, the places a linker should avoid are calculated. Each protein can interact with other molecules on some oarts of its surface. The user can specify where and how big those parts are. If a linker passes in front a potential molecule bindind domain, the value of the corresponding path goes to infinity, so that the linker is discarded. Conversely the farther a linker is from a potential ligand binding domain, the smaller its weighting value. The user can also specify the importance of certain regions. In the end the total weighting is normalized to the amount of binding domains.<br />
<br />
===Calibrating the weighting function===<br />
Every contribution has its own distribution. You can see an example in figure ### [[figure histogram_length_lys.png]], but all of them have different shapes. The aim is to find the paths that globally minimize all of these distributions. Therefore, for simplicity,in the weighting function the four mentioned contributions were combined in a linear manner:<br />
\[ W(p) = \alpha L(p) + \beta A(p) + \gamma D(p) + \delta u(p) \]<br />
where W is the final weighting, p the path, L the length contribution, A the angle contribution, D the distance contribution and u the contribution from the forbidden regions. $\alpha, \beta, \gamma, \delta$ are the weighting constants that needed to be found. The normalization performed for each of the contribution were made so that each of them is dimensionless and that all have reasonably similar values.<br />
The weighting constants were obtained from the [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening linker screening] performed with lysozyme and the [https://2014.igem.org/Team:Heidelberg/Modeling/Enzyme_Modeling modeling of the enzyme activity]. Their calculation is presented in the results below.<br />
<br />
==Translating paths to sequence==<br />
As already mentioned before the software is provided with two databases, one for the possible angle patterns and one for the helix patterns. The choice of the patterns was inspired by known crystal structures extracted from databases and described in different papers.<br />
A huge in silico screening for refining the preferences of the patterns was then set up using the [https://2014.igem.org/Team:Heidelberg/Software/igemathome distribution calculation] system. For the complete description of search for suitable patterns, one can read the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] page. <br />
All the possible paths are now split up at the angles and compared with the possible patterns in the databases. ###Figure needed to explain, how the path is translated### The most suitable patterns are identified and added together to build the paths sequence. It is important to notice that this is only possible because of the modularity of our linker patterns used as building blocks: each block, being an alpha helix or an angle pattern, is not affected by the other. Thus for each possible path, one sequence is produced.<br />
<br />
==Clustering of paths==<br />
Many different paths are represented by the same sequence [[###Figure that shows, different paths have same properties, already before in the text ###]] and we therefore clustered such paths. The weigths for those clustered paths were then calculated by averaging the weights of the different paths that compose a cluster.<br />
<br />
<br />
=Results=<br />
==DNMT1==<br />
A major motivation of our effort to design rigid linkers with angles was the [https://2014.igem.org/Team:Heidelberg/Project/PCR_2.0 circularization of the DNA methyltranferase Dnmt1]. The truncation form used in our project is composed of 900 amino acids and the N- and C-terminal extremities are well separated. To circularize it, two linkers were designed: a flexible one made of glycine and serine, and a rigid one designed by the software. The rigif linkers for DNMT1 were obtained from an early state of the software. At that time the calculation took 11 days on a laptop computer with intel i5 processor and 8GB of RAM, which shows the importance of a distributed computing system for large proteins. But from that state on, the software has still improved a lot, resulting in reduced calculation time to about 1 day for DNMT1.<br />
<br />
==Feedback from wet lab==<br />
The results from the software for lysozyme were tested as described in the [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening linker-screening] part and evaluated as described in the [https://2014.igem.org/Team:Heidelberg/Modeling/Enzyme_Modeling enzyme modeling] part. We have performed a large linker screening on 10 different lysozymes with different linkers. As the purpose of the lysozyme screen was the calibration of the software,those linkers (Table 1) were designed according to the four contributions previously mentioned. One of them was the shortest possible, one had the best possible angle, and so on.<br />
{| class="table table-hover"<br />
|+ '''Table 1''': Linkers and their amino acid sequence. Green: attachment sequences to prevent the flexible regions from being perturbed; Blue: angle; Purple: extein.<br />
! Linker<br />
! Amino acid sequence<br />
|- <br />
|<br />
|<br />
|-<br />
|'''Very good linkers'''<br />
|<br />
|-<br />
| sgt2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#00BFFF;">AAAHPEA</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| rigid<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKAA<span style="color:#3ADF00;">P</span><span style="color:#A901DB;">RGKCWE</span><br />
|-<br />
| <br />
| <br />
|-<br />
| '''Average linkers'''<br />
|<br />
|-<br />
| may1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKA<span style="color:#00BFFF;">AAAHPEA</span>AEAAAKEAAAKA<span style="color:#00BFFF;">KTA</span>AEAAAKEAAAKA<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| ord1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ATGDLA</span>AEAAAKAA<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| ord3<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ASLPAA</span>AEAAAKEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
|<br />
| <br />
|-<br />
| '''Short linkers'''<br />
| <br />
|-<br />
| sho1<br />
| <span style="color:#3ADF00;">GG</span><span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| sho2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
|}<br />
[[circ_lam_lys_nils.png]] comparison with [[nice_linker_lysozyme_flexible_ends.png]]<br />
###About predictions of software###<br />
In the end we obtained a ranking of the in vitro tested linkers from the [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening linker-screening] and chose the parameters $\alpha, \beta, \gamma, \delta$ of the weighting function so that the ranking from the software represented the ranking from the assays. The final values, $\alpha = 1.85 * 10 ^{-6}, \beta = 0.57 , \gamma = 50.8 * 10^6$, set up a function that could reproduce the ranking oberved in the wetlab experiments.<br />
<br />
=Discussion=<br />
The software described here allowed us to design rigid linkers with well-defined angles. This represents a major advance compared to previous approaches like [[#References|[2]]] as these linkers can circularize any protein of known structure with any complex geometry.<br />
The feedback between the modeling and the experiment work on lysozyme activity was a crutial step in the development of the software. It allowed the testing of our approach and the calibration of the contribution of different features of the linkers to heat stability. This calibration was performed on one enzyme, and can improve in the future with the testing of more enzymes. This will also be refined thanks to a complete modeling and analysis of protein structures with linkers.<br />
<br />
=References=<br />
<br />
[1] Thornton, J.M. & Sibanda, B.L. Amino and carboxy-terminal regions in globular proteins. Journal of molecular biology 167, 443-460 (1983).<br />
<br />
[2] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/pages/Linker_SoftwareTeam:Heidelberg/pages/Linker Software2014-10-17T23:16:08Z<p>Igemnils: /* Feedback from wet lab */</p>
<hr />
<div>=Abstract=<br />
As already [https://2014.igem.org/Team:Heidelberg/Toolbox/Circularization introduced], artificially circularized proteins may gain some heat stability by restraining the C- and N-terminus from moving around freely. This circularization may be trivial when the protein termini are very close to each other, which seems to be reasonably common [[#References|[1]]]. However, if the ends are too far from each other, a long linker is needed to connect them. This linker should not change the natural conformation of the protein and should constrain the relative position of the ends to restrict the degrees of freedom and thus to stabilize the structure even when heated up. On top, these linkers should not affect any of the protein functions. Consequently it is important to prevent linkers from passing through the active site or from covering binding domains to other molecules for example. Therefore one needs to be able to define the shape of possible linkers. This section describes the software we developed to design such linkers. We would like to stress that this work has been made possible thanks to the feedback between computer modeling and experimental work: We could first design linkers in silico, test them experimentally and use the results to further calibrate the software. To our knowledge, this is the first time that such an approach is used to customly design rigid linkers with angles to connect protein extremities.<br />
<br />
=General procedure=<br />
In short, the software can provide a weighted list of linkers to circularize any protein of interest with a known structure. Those linkers are made of rigid alpha helices segments connected with defined angles. Contrary to flexible linkers, those rigid linkers were expected to constrain the protein extremities and to confer better heat stability. Such an idea was already developed [[#References|[2]]] but only with alpha helices defining simple rods, and without any possibility to introduce angles. To generate those linkers, we first defined the geometrical paths, with segments and angles, that they should follow. The geometrical paths that are biologically feasible are afterwards translated into amino acid sequences. Both the compatibility of paths with possible structures and the translation were made possible thanks to our [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling approaches]. The first approach consisted in performing a statistical analysis of more than 17000 known non-homologous structures containing alpha helices connected with angles. For the second approach, we modeled the conformation of linkers circularizing proteins of known structure and analyzed them for certain properties. This second approach was run for a large number of proteins thanks to our distributing computing system [https://2014.igem.org/Team:Heidelberg/Software/igemathome igemathome]. The software provides different possible linkers with weights that provide the ranking of the linkers depending on their capacity to maintain protein activity at higher temperatures. They were generated thanks to an extensive [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening linker screening] on the target protein lambda-lysozyme, using the first modeling approach.<br />
<br />
=Background=<br />
Classically, protein linkers were designed in three different manners. The easiest way is to define the length that a linker should cover and then simply use a flexible glycine-serine peptide with the right amount of amino acids to match this length. Glycine is used for flexibility, as it has no sidechain and does not produce any steric hindrance, while serine is used for solubility, as it has a small polar side chain. This solubility is important, as the linkers should not pass through the hydrophobic core of the protein, but should be dissolved in the surrounding medium. These flexible linkers were normally used for circularization but also for connecting different proteins, when the main goal is that the different parts are connected, but not how they are connected, or when the flexiblity of the linker was required for specific applications. <br />
<br />
A second strategy consists in using rigid helical linkers to keep proteins or protein domains at a certain distance from each other. This is especially important for signalling proteins and fluorescent proteins. One major property of alpha helices is that they always fold in a defined way with well defined angles and lengths. There are also many different helical patterns that differ in stability and solubility. Although they have been used to design cirularizing linkers [[#References|[2]]]. One big disadvantage of this strategy is that one can only build straight linkers with helices. So in the context of circularization, if an artificial line that would connect protein extremities is crossing the protein, this strategy is not an option.<br />
<br />
The third option, which served as a base to develop our approach and which came from discussions with the group of Rebecca Wade in Heidelberg, Germany, consists in designing customly tailored linkers for each specific application. These linkers can be obtained from protein structure prediction. At first one needs to define the path that the linker should take to connect the protein ends. Afterwards, one designs a possible linker sequence that might fit well. Next one makes a structure prediction of the linker attached to the proteins to validate the prediction. Several different linkers, with slight changes, can be compared. This is repeated several times until the linker effectively follows the expected path. This method requires a strong knowledge on protein folding and protein structure prediction and is computationaly intensive. On the other hand, the benefit can be important as the interaction of the linker with the protein surface can be taken into account and as one can accurately define the path taken by the linker to the resolution of protein structure.<br />
<br />
We have set up a completely new strategy to design rigid linkers. As further detailed in the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] part, it is possible to define the shape of a linker, by combining rigid alpha helical rods with well-defined angle patterns. Therefore, by defining, in a geometrical way, the possible paths of the circularizing linkers for a given protein, we can then propose potential linkers. This definition of the geometrical path can be very difficult, especially for large proteins with complex shapes. Moreover, this definition is further constained by the fact that linkers must avoid hiding active sites of the protein of interest. Finally the paths have rotational degrees of freedom at the extremities of the protein, and depending on their orientation, they may or may not match the geometry of the protein. The tool we present here covers the two steps: defining geometrical paths with some weights and translate them into feasible linkers, also with weights. This tool is universal as it has the capacity to design circularizing linkers for any protein with a known structure. Moreover it is modular as, thanks to our [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling approach], we have design linkers as exchangeable blocks of rods of different lengths and of angle patterns. The following sections detail the different steps followed by our software to design proper linkers.<br />
<br />
<br />
==PDB parsing==<br />
At first, the PDB file containing the structure of the target protein is parsed and the coordinates of the atoms are stored, in the metric unit.After this, some initial tests are made with the protein structure. First, we checked whether the C- and N-termini lie on the surface of the protein and are accessible to the solvent, which is crucial for circularization. We defined a line originating from an extremity of the protein with the two angles of the spherical coordinates around the z-axis. From that, we could determine the accessible angles by rejecting all the lines that are too close to the protein. As the future linker will be made of alpha helices and will therefore have a radius of 5 &Aring;, we used this length as the minimal allowed distance.<br />
Those allowed angles are stored for the coming linker generation. ###fig needed### <br />
<br />
==Generation of geometric paths==<br />
As our strategy consists in building linkers with helical rods and connecting angles, a path is completely defined by the coordinates of the angle points. Advancing one step from an existing point is always done by adding a displacement vector on this point. This vector is defined by the two spherical angles, chosen here in a discrete manner with an increment of 5 degrees, and by a length, also chosen in a discrete manner. This discrete length was used in two different contexts: it may correspond to the length of an alpha helix or to the length of the flexible part that appears at the extremity of the protein. The coordinates that are reached thanks to this vector defines the new coordinates of an angle point, no matter if the vector corresponds to an alpha helix or to a flexible part. ###figure, points + vectors###. As we screen for all possible angles in a discrete manner, those angle points coordinates are regularly distributed on a sphere. As further detailed in the next sections, those spheres are defined from both ends, either once or in several steps. Then the software checks for possible straight connections of given lengths for each pair of angle points originated from both extremities. ###figure needed, with all these possible steps, one sphere around each point and checking for the connections, 2D, different lengths in two different figures, ###<br />
The linkers are built in a modular way, with blocks of well-defined size. From the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] of potential linkers, we could derive 8 different alpha helical rods, all with different lengths. On top, the length of the two segments inside an angle block was always 8&Aring;, so exchanging angle blocks do not affect the length of the linker. This means that the distance between the angle points is well defined, an essential aspect of our strategy of linker design.<br />
The software proceeds in three steps. First, it checks for the possibility of direct single alpha helix linker. for this, it applies the procedure just mentioned with spheres of radius that reasonably corresponds to the length of the short parts at the extremity of the protein. Second, it tests if a linker containing two alpha helices connected with a right angle allows the circularization. Finally it searches the possible linkers with three angle points. The next parts will explain those three steps in detail.<br />
This method has been chosen, because it could be implemented easily and efficiently in our program. However, this strategy generated paths that crossed the protein. Therefore we put big efforts in the sorting out of the paths.<br />
<br />
===Step 1===<br />
As a simple rigid linker with no angle would be easier to design and likely more thermostable than the ones containing angles, the software first checks if this simple solution is possible ###figure needed###. For this, we took into account the fact that proteins have some flexible amino acids at their extremities. This flexible part may come from the protein itself, but also from the 2 glycines that are included at the N-terminal part and from the extein at the C-terminal part. Those two latter parts comes from our linkers. Those parts have no preferential angles and offers a large amount of possibilities to insert fitting linkers. But this flexibility is also a drawback as we have to include this large amount of possible angles and length to our path search.<br />
In this first step, the software explicitely takes these flexible parts into account to check for the possibility of straight linkers. As the angles and the length of the flexible parts are variable, the software position their extremity on a sphere centered on the last fixed position of the structure as explained above. The radius of this sphere is incremented in a discrete manner, in 4 steps, from 5.25 &Aring; to the maximum length of the flexible part. ###figure###<br />
Then all possible straight segments between the points and the lastpoints are tested. If they are closer than 5 &Aring to the protein, of if they cross it, then they are rejected. If they are kept, then the software checks whether the length of the segments is compatible with the feasible alpha helices in terms of length: if the length of a given segment equal one of the 8 alpha helix lengths plus or minus 0.75 &Aring;, then the path is eventually saved.<br />
===Step 2===<br />
The next possibilty to design more complex rigid linkers while still taking flexible ends into account with a reasonable calculation time was to reduce amount of possible angles. As we originally thought that 90° angle would be practically feasible, the software was designed to generate linkers with flexible ends and one 90° angle. ###figure, protein_one_90° angle### This choice was notably made because of the simplicity to calculate lengths of right triangle edges. We already saw in Step 1 that the length of an edge can only take 8 different values. As the linkers have to start from the extremities of the protein, and as we impose a right angle, the number of possible paths is therefore low, making them easy to compute. Practically, the extremities of the proteins are positioned in a flexible way as in Step 1. From each of the positions allowed by this flexibility, the software searches for all the allowed right triangles. This was mainly done, as the degrees of freedom needed to be restricted to keep calculations feasible. <br />
===Step 3===<br />
Finally the software also provides the possibility to find paths with up to 4 edges, meaning 4 alpha helices and 3 angles. Thanks to the modularity of the possible linkers, such paths can offer the possibility to circularize theoretically any kind of protein. ###figure of torus###<br />
To keep the calculation feasible in a reasonable time, we design the searching strategy so that the flexible part at the extremity are oriented in the same direction as the consecutive alpha helix. This is obviously restricting the search but as these orientations are allowed for the flexible part, this approach remains fully correct. <br />
First, potential ending points of the first alpha helical rod are calculated from the N-terminal point of the protein. The orientation is chosen in a discrete manner, with an incrementation of 5 degrees for the two angles of the spherical coordinates. The distance from the origin corresponds to the 8 possible lengths allowed by the alpha helices, as already seen in Step 2, plus a length of 0, which mimics a linker with 3 instead of 4 edges. The exact same procedure is repeated to define all the potential ending points of the second alpha helical rod starting from all the possible ending points of the first alpha helical rod. Thanks to the possibility of a length of 0 for the first and the second rods, the software also calculate paths with 2 edges. Then, the same is done only once from the C-terminal point of the protein, defining 1 edge. The final step consists in checking if the points originating from the N- and C-terminal points can be linked by an potential alpha helix, i.e. if they are separated by the appropriate distance. If any of the potential alpha helix length lies within the distance between two points plus or minus 0.75 &Aring;, then the path is eventually saved. In the same way, if two points are directly closer than 0.75 &Aring;, then the path is also saved.<br />
<br />
==Sorting out of paths==<br />
The previous part described the generation of paths that can connect the two extremeties of the protein irrespective of the position of these paths relative to the protein. While this allows a fast computing of the geometrical paths, this also implies that the paths that are not practically feasible need to be sorted out. This is the most time consuming part of the computing as about 1 billion paths are generated. Three criteria are considered for the sorting. The first one is the feasibility of the linker: can the software find angle patterns that correspond to the one defined by the geometrical path? This question was part of the motivation for a large modeling effort (link) to determine the possible angles between consecutive angles. This was achieved by analyzing the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling distribution of angles] between alpha helices found in the ArchDB database. As nearly any angle could be found between 20 and 170 degrees, only few paths were actually rejected at that step. The next criteria was the position of the angle point: if they appear inside the protein, then the path is rejected. Finally, the software checks if any of the atoms of the protein is less than 5&Aring away from any of the alpha helices, then the path is also rejected.<br />
<br />
==Shifting paths to the patterns==<br />
<br />
The strategy described in step 3 gives a certain freedom for the rod that connect the last two angle points that were generated from the N- and C-terminal points. As this freedom is actually not permitted by the alpha helix and the angle pattern, but is permitted by the flexible part for example at the C-terminal end, the software slighty refine the path by rotating the segment that originates from the C-terminal point.<br />
<br />
==Weighting of paths==<br />
Before translating the paths into sequences and thus into linkers that can be expressed, each path needs to be evaluated for its potential capacity to enhance heat stability. For this we have identified different contributions that should be combined into one value that defines how good the linker may be and that consequently defines its ranking among all possible linkers. The smaller this value, the more we expect that the linker will enhance thermostability. An important step for all these contributions is the normalization, as explained in the next paragraphs.<br />
The first contribution we considered was the linker length. We assumed that a short linker is better to constrain the protein extremities, and that a long helix might give more flexibility. Because we wanted this value to be independent of the size of the protein, the length of the linker is normalized to the distance between the two termini.<br />
The second contribution relates to the angles used in the linker. We learned from the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] that angles formed by a certain angle pattern follow a certain distribution. First, we assumed that the narrower the distribution, the more likely the alpha helices would actually produce this angle. Second, the angles found by the software should be as close as possible to those well-defined angles. In this case, the weight value from this contribution should be low.<br />
Then the distance of the linker to the protein is taken into account. Because the linkers should not disturb the protein in its normal environment, linkers that pass close to the protein surface are considered better linkers. The distance was defined as the minimal distance between the linker and all the atoms of the protein. As already mentioned for the sorting of the paths, a linker cannot come closer than 5 &Aring; and this distance was used for normalization of calculated distances.<br />
After this, the places a linker should avoid are calculated. Each protein can interact with other molecules on some oarts of its surface. The user can specify where and how big those parts are. If a linker passes in front a potential molecule bindind domain, the value of the corresponding path goes to infinity, so that the linker is discarded. Conversely the farther a linker is from a potential ligand binding domain, the smaller its weighting value. The user can also specify the importance of certain regions. In the end the total weighting is normalized to the amount of binding domains.<br />
<br />
===Calibrating the weighting function===<br />
Every contribution has its own distribution. You can see an example in figure ### [[figure histogram_length_lys.png]], but all of them have different shapes. The aim is to find the paths that globally minimize all of these distributions. Therefore, for simplicity,in the weighting function the four mentioned contributions were combined in a linear manner:<br />
\[ W(p) = \alpha L(p) + \beta A(p) + \gamma D(p) + \delta u(p) \]<br />
where W is the final weighting, p the path, L the length contribution, A the angle contribution, D the distance contribution and u the contribution from the forbidden regions. $\alpha, \beta, \gamma, \delta$ are the weighting constants that needed to be found. The normalization performed for each of the contribution were made so that each of them is dimensionless and that all have reasonably similar values.<br />
The weighting constants were obtained from the [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening linker screening] performed with lysozyme and the [https://2014.igem.org/Team:Heidelberg/Modeling/Enzyme_Modeling modeling of the enzyme activity]. Their calculation is presented in the results below.<br />
<br />
==Translating paths to sequence==<br />
As already mentioned before the software is provided with two databases, one for the possible angle patterns and one for the helix patterns. The choice of the patterns was inspired by known crystal structures extracted from databases and described in different papers.<br />
A huge in silico screening for refining the preferences of the patterns was then set up using the [https://2014.igem.org/Team:Heidelberg/Software/igemathome distribution calculation] system. For the complete description of search for suitable patterns, one can read the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] page. <br />
All the possible paths are now split up at the angles and compared with the possible patterns in the databases. ###Figure needed to explain, how the path is translated### The most suitable patterns are identified and added together to build the paths sequence. It is important to notice that this is only possible because of the modularity of our linker patterns used as building blocks: each block, being an alpha helix or an angle pattern, is not affected by the other. Thus for each possible path, one sequence is produced.<br />
<br />
==Clustering of paths==<br />
Many different paths are represented by the same sequence [[###Figure that shows, different paths have same properties, already before in the text ###]] and we therefore clustered such paths. The weigths for those clustered paths were then calculated by averaging the weights of the different paths that compose a cluster.<br />
<br />
<br />
=Results=<br />
==DNMT1==<br />
A major motivation of our effort to design rigid linkers with angles was the [https://2014.igem.org/Team:Heidelberg/Project/PCR_2.0 circularization of the DNA methyltranferase Dnmt1]. The truncation form used in our project is composed of 900 amino acids and the N- and C-terminal extremities are well separated. To circularize it, two linkers were designed: a flexible one made of glycine and serine, and a rigid one designed by the software. The rigif linkers for DNMT1 were obtained from an early state of the software. At that time the calculation took 11 days on a laptop computer with intel i5 processor and 8GB of RAM, which shows the importance of a distributed computing system for large proteins. But from that state on, the software has still improved a lot, resulting in reduced calculation time to about 1 day for DNMT1.<br />
<br />
==Feedback from wet lab==<br />
The results from the software for lysozyme were tested as described in the [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening linker-screening] part and evaluated as described in the [https://2014.igem.org/Team:Heidelberg/Modeling/Enzyme_Modeling enzyme modeling] part. We have performed a large linker screening on 10 different lysozymes with different linkers. As the purpose of the lysozyme screen was the calibration of the software,those linkers (Table 1) were designed according to the four contributions previously mentioned. One of them was the shortest possible, one had the best possible angle, and so on.<br />
{| class="table table-hover"<br />
|+ '''Table 1''': Linkers and their amino acid sequence. Green: attachment sequences to prevent the flexible regions from being perturbed; Blue: angle; Purple: extein.<br />
! Linker<br />
! Amino acid sequence<br />
|- <br />
|<br />
|<br />
|-<br />
|'''Very good linkers'''<br />
|<br />
|-<br />
| sgt2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#00BFFF;">AAAHPEA</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| rigid<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKAA<span style="color:#3ADF00;">P</span><span style="color:#A901DB;">RGKCWE</span><br />
|-<br />
| <br />
| <br />
|-<br />
| '''Average linkers'''<br />
|<br />
|-<br />
| may1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKA<span style="color:#00BFFF;">AAAHPEA</span>AEAAAKEAAAKA<span style="color:#00BFFF;">KTA</span>AEAAAKEAAAKA<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| ord1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ATGDLA</span>AEAAAKAA<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| ord3<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ASLPAA</span>AEAAAKEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
|<br />
| <br />
|-<br />
| '''Short linkers'''<br />
| <br />
|-<br />
| sho1<br />
| <span style="color:#3ADF00;">GG</span><span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| sho2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
|}<br />
[[circ_lam_lys_nils.png]] comparison with [[nice_linker_lysozyme_flexible_ends.png]]<br />
###About predictions of software###<br />
In the end we obtained a ranking of the in vitro tested linkers from the [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening linker-screening] and chose the parameters $\alpha, \beta, \gamma, \delta$ of the weighting function so that the ranking from the software represented the ranking from the assays. The final values, $\alpha = 1.85 * 10 ^{-6}, \beta = 0.57 , \gamma = 50.8 * 10^6$, set up a function that could reproduce the ranking oberved in the wetlab experiments.<br />
<br />
=Discussion=<br />
The software described here allowed us to design<br />
<br />
=References=<br />
<br />
[1] Thornton, J.M. & Sibanda, B.L. Amino and carboxy-terminal regions in globular proteins. Journal of molecular biology 167, 443-460 (1983).<br />
<br />
[2] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/pages/Linker_SoftwareTeam:Heidelberg/pages/Linker Software2014-10-17T23:15:35Z<p>Igemnils: /* DNMT1 */</p>
<hr />
<div>=Abstract=<br />
As already [https://2014.igem.org/Team:Heidelberg/Toolbox/Circularization introduced], artificially circularized proteins may gain some heat stability by restraining the C- and N-terminus from moving around freely. This circularization may be trivial when the protein termini are very close to each other, which seems to be reasonably common [[#References|[1]]]. However, if the ends are too far from each other, a long linker is needed to connect them. This linker should not change the natural conformation of the protein and should constrain the relative position of the ends to restrict the degrees of freedom and thus to stabilize the structure even when heated up. On top, these linkers should not affect any of the protein functions. Consequently it is important to prevent linkers from passing through the active site or from covering binding domains to other molecules for example. Therefore one needs to be able to define the shape of possible linkers. This section describes the software we developed to design such linkers. We would like to stress that this work has been made possible thanks to the feedback between computer modeling and experimental work: We could first design linkers in silico, test them experimentally and use the results to further calibrate the software. To our knowledge, this is the first time that such an approach is used to customly design rigid linkers with angles to connect protein extremities.<br />
<br />
=General procedure=<br />
In short, the software can provide a weighted list of linkers to circularize any protein of interest with a known structure. Those linkers are made of rigid alpha helices segments connected with defined angles. Contrary to flexible linkers, those rigid linkers were expected to constrain the protein extremities and to confer better heat stability. Such an idea was already developed [[#References|[2]]] but only with alpha helices defining simple rods, and without any possibility to introduce angles. To generate those linkers, we first defined the geometrical paths, with segments and angles, that they should follow. The geometrical paths that are biologically feasible are afterwards translated into amino acid sequences. Both the compatibility of paths with possible structures and the translation were made possible thanks to our [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling approaches]. The first approach consisted in performing a statistical analysis of more than 17000 known non-homologous structures containing alpha helices connected with angles. For the second approach, we modeled the conformation of linkers circularizing proteins of known structure and analyzed them for certain properties. This second approach was run for a large number of proteins thanks to our distributing computing system [https://2014.igem.org/Team:Heidelberg/Software/igemathome igemathome]. The software provides different possible linkers with weights that provide the ranking of the linkers depending on their capacity to maintain protein activity at higher temperatures. They were generated thanks to an extensive [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening linker screening] on the target protein lambda-lysozyme, using the first modeling approach.<br />
<br />
=Background=<br />
Classically, protein linkers were designed in three different manners. The easiest way is to define the length that a linker should cover and then simply use a flexible glycine-serine peptide with the right amount of amino acids to match this length. Glycine is used for flexibility, as it has no sidechain and does not produce any steric hindrance, while serine is used for solubility, as it has a small polar side chain. This solubility is important, as the linkers should not pass through the hydrophobic core of the protein, but should be dissolved in the surrounding medium. These flexible linkers were normally used for circularization but also for connecting different proteins, when the main goal is that the different parts are connected, but not how they are connected, or when the flexiblity of the linker was required for specific applications. <br />
<br />
A second strategy consists in using rigid helical linkers to keep proteins or protein domains at a certain distance from each other. This is especially important for signalling proteins and fluorescent proteins. One major property of alpha helices is that they always fold in a defined way with well defined angles and lengths. There are also many different helical patterns that differ in stability and solubility. Although they have been used to design cirularizing linkers [[#References|[2]]]. One big disadvantage of this strategy is that one can only build straight linkers with helices. So in the context of circularization, if an artificial line that would connect protein extremities is crossing the protein, this strategy is not an option.<br />
<br />
The third option, which served as a base to develop our approach and which came from discussions with the group of Rebecca Wade in Heidelberg, Germany, consists in designing customly tailored linkers for each specific application. These linkers can be obtained from protein structure prediction. At first one needs to define the path that the linker should take to connect the protein ends. Afterwards, one designs a possible linker sequence that might fit well. Next one makes a structure prediction of the linker attached to the proteins to validate the prediction. Several different linkers, with slight changes, can be compared. This is repeated several times until the linker effectively follows the expected path. This method requires a strong knowledge on protein folding and protein structure prediction and is computationaly intensive. On the other hand, the benefit can be important as the interaction of the linker with the protein surface can be taken into account and as one can accurately define the path taken by the linker to the resolution of protein structure.<br />
<br />
We have set up a completely new strategy to design rigid linkers. As further detailed in the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] part, it is possible to define the shape of a linker, by combining rigid alpha helical rods with well-defined angle patterns. Therefore, by defining, in a geometrical way, the possible paths of the circularizing linkers for a given protein, we can then propose potential linkers. This definition of the geometrical path can be very difficult, especially for large proteins with complex shapes. Moreover, this definition is further constained by the fact that linkers must avoid hiding active sites of the protein of interest. Finally the paths have rotational degrees of freedom at the extremities of the protein, and depending on their orientation, they may or may not match the geometry of the protein. The tool we present here covers the two steps: defining geometrical paths with some weights and translate them into feasible linkers, also with weights. This tool is universal as it has the capacity to design circularizing linkers for any protein with a known structure. Moreover it is modular as, thanks to our [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling approach], we have design linkers as exchangeable blocks of rods of different lengths and of angle patterns. The following sections detail the different steps followed by our software to design proper linkers.<br />
<br />
<br />
==PDB parsing==<br />
At first, the PDB file containing the structure of the target protein is parsed and the coordinates of the atoms are stored, in the metric unit.After this, some initial tests are made with the protein structure. First, we checked whether the C- and N-termini lie on the surface of the protein and are accessible to the solvent, which is crucial for circularization. We defined a line originating from an extremity of the protein with the two angles of the spherical coordinates around the z-axis. From that, we could determine the accessible angles by rejecting all the lines that are too close to the protein. As the future linker will be made of alpha helices and will therefore have a radius of 5 &Aring;, we used this length as the minimal allowed distance.<br />
Those allowed angles are stored for the coming linker generation. ###fig needed### <br />
<br />
==Generation of geometric paths==<br />
As our strategy consists in building linkers with helical rods and connecting angles, a path is completely defined by the coordinates of the angle points. Advancing one step from an existing point is always done by adding a displacement vector on this point. This vector is defined by the two spherical angles, chosen here in a discrete manner with an increment of 5 degrees, and by a length, also chosen in a discrete manner. This discrete length was used in two different contexts: it may correspond to the length of an alpha helix or to the length of the flexible part that appears at the extremity of the protein. The coordinates that are reached thanks to this vector defines the new coordinates of an angle point, no matter if the vector corresponds to an alpha helix or to a flexible part. ###figure, points + vectors###. As we screen for all possible angles in a discrete manner, those angle points coordinates are regularly distributed on a sphere. As further detailed in the next sections, those spheres are defined from both ends, either once or in several steps. Then the software checks for possible straight connections of given lengths for each pair of angle points originated from both extremities. ###figure needed, with all these possible steps, one sphere around each point and checking for the connections, 2D, different lengths in two different figures, ###<br />
The linkers are built in a modular way, with blocks of well-defined size. From the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] of potential linkers, we could derive 8 different alpha helical rods, all with different lengths. On top, the length of the two segments inside an angle block was always 8&Aring;, so exchanging angle blocks do not affect the length of the linker. This means that the distance between the angle points is well defined, an essential aspect of our strategy of linker design.<br />
The software proceeds in three steps. First, it checks for the possibility of direct single alpha helix linker. for this, it applies the procedure just mentioned with spheres of radius that reasonably corresponds to the length of the short parts at the extremity of the protein. Second, it tests if a linker containing two alpha helices connected with a right angle allows the circularization. Finally it searches the possible linkers with three angle points. The next parts will explain those three steps in detail.<br />
This method has been chosen, because it could be implemented easily and efficiently in our program. However, this strategy generated paths that crossed the protein. Therefore we put big efforts in the sorting out of the paths.<br />
<br />
===Step 1===<br />
As a simple rigid linker with no angle would be easier to design and likely more thermostable than the ones containing angles, the software first checks if this simple solution is possible ###figure needed###. For this, we took into account the fact that proteins have some flexible amino acids at their extremities. This flexible part may come from the protein itself, but also from the 2 glycines that are included at the N-terminal part and from the extein at the C-terminal part. Those two latter parts comes from our linkers. Those parts have no preferential angles and offers a large amount of possibilities to insert fitting linkers. But this flexibility is also a drawback as we have to include this large amount of possible angles and length to our path search.<br />
In this first step, the software explicitely takes these flexible parts into account to check for the possibility of straight linkers. As the angles and the length of the flexible parts are variable, the software position their extremity on a sphere centered on the last fixed position of the structure as explained above. The radius of this sphere is incremented in a discrete manner, in 4 steps, from 5.25 &Aring; to the maximum length of the flexible part. ###figure###<br />
Then all possible straight segments between the points and the lastpoints are tested. If they are closer than 5 &Aring to the protein, of if they cross it, then they are rejected. If they are kept, then the software checks whether the length of the segments is compatible with the feasible alpha helices in terms of length: if the length of a given segment equal one of the 8 alpha helix lengths plus or minus 0.75 &Aring;, then the path is eventually saved.<br />
===Step 2===<br />
The next possibilty to design more complex rigid linkers while still taking flexible ends into account with a reasonable calculation time was to reduce amount of possible angles. As we originally thought that 90° angle would be practically feasible, the software was designed to generate linkers with flexible ends and one 90° angle. ###figure, protein_one_90° angle### This choice was notably made because of the simplicity to calculate lengths of right triangle edges. We already saw in Step 1 that the length of an edge can only take 8 different values. As the linkers have to start from the extremities of the protein, and as we impose a right angle, the number of possible paths is therefore low, making them easy to compute. Practically, the extremities of the proteins are positioned in a flexible way as in Step 1. From each of the positions allowed by this flexibility, the software searches for all the allowed right triangles. This was mainly done, as the degrees of freedom needed to be restricted to keep calculations feasible. <br />
===Step 3===<br />
Finally the software also provides the possibility to find paths with up to 4 edges, meaning 4 alpha helices and 3 angles. Thanks to the modularity of the possible linkers, such paths can offer the possibility to circularize theoretically any kind of protein. ###figure of torus###<br />
To keep the calculation feasible in a reasonable time, we design the searching strategy so that the flexible part at the extremity are oriented in the same direction as the consecutive alpha helix. This is obviously restricting the search but as these orientations are allowed for the flexible part, this approach remains fully correct. <br />
First, potential ending points of the first alpha helical rod are calculated from the N-terminal point of the protein. The orientation is chosen in a discrete manner, with an incrementation of 5 degrees for the two angles of the spherical coordinates. The distance from the origin corresponds to the 8 possible lengths allowed by the alpha helices, as already seen in Step 2, plus a length of 0, which mimics a linker with 3 instead of 4 edges. The exact same procedure is repeated to define all the potential ending points of the second alpha helical rod starting from all the possible ending points of the first alpha helical rod. Thanks to the possibility of a length of 0 for the first and the second rods, the software also calculate paths with 2 edges. Then, the same is done only once from the C-terminal point of the protein, defining 1 edge. The final step consists in checking if the points originating from the N- and C-terminal points can be linked by an potential alpha helix, i.e. if they are separated by the appropriate distance. If any of the potential alpha helix length lies within the distance between two points plus or minus 0.75 &Aring;, then the path is eventually saved. In the same way, if two points are directly closer than 0.75 &Aring;, then the path is also saved.<br />
<br />
==Sorting out of paths==<br />
The previous part described the generation of paths that can connect the two extremeties of the protein irrespective of the position of these paths relative to the protein. While this allows a fast computing of the geometrical paths, this also implies that the paths that are not practically feasible need to be sorted out. This is the most time consuming part of the computing as about 1 billion paths are generated. Three criteria are considered for the sorting. The first one is the feasibility of the linker: can the software find angle patterns that correspond to the one defined by the geometrical path? This question was part of the motivation for a large modeling effort (link) to determine the possible angles between consecutive angles. This was achieved by analyzing the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling distribution of angles] between alpha helices found in the ArchDB database. As nearly any angle could be found between 20 and 170 degrees, only few paths were actually rejected at that step. The next criteria was the position of the angle point: if they appear inside the protein, then the path is rejected. Finally, the software checks if any of the atoms of the protein is less than 5&Aring away from any of the alpha helices, then the path is also rejected.<br />
<br />
==Shifting paths to the patterns==<br />
<br />
The strategy described in step 3 gives a certain freedom for the rod that connect the last two angle points that were generated from the N- and C-terminal points. As this freedom is actually not permitted by the alpha helix and the angle pattern, but is permitted by the flexible part for example at the C-terminal end, the software slighty refine the path by rotating the segment that originates from the C-terminal point.<br />
<br />
==Weighting of paths==<br />
Before translating the paths into sequences and thus into linkers that can be expressed, each path needs to be evaluated for its potential capacity to enhance heat stability. For this we have identified different contributions that should be combined into one value that defines how good the linker may be and that consequently defines its ranking among all possible linkers. The smaller this value, the more we expect that the linker will enhance thermostability. An important step for all these contributions is the normalization, as explained in the next paragraphs.<br />
The first contribution we considered was the linker length. We assumed that a short linker is better to constrain the protein extremities, and that a long helix might give more flexibility. Because we wanted this value to be independent of the size of the protein, the length of the linker is normalized to the distance between the two termini.<br />
The second contribution relates to the angles used in the linker. We learned from the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] that angles formed by a certain angle pattern follow a certain distribution. First, we assumed that the narrower the distribution, the more likely the alpha helices would actually produce this angle. Second, the angles found by the software should be as close as possible to those well-defined angles. In this case, the weight value from this contribution should be low.<br />
Then the distance of the linker to the protein is taken into account. Because the linkers should not disturb the protein in its normal environment, linkers that pass close to the protein surface are considered better linkers. The distance was defined as the minimal distance between the linker and all the atoms of the protein. As already mentioned for the sorting of the paths, a linker cannot come closer than 5 &Aring; and this distance was used for normalization of calculated distances.<br />
After this, the places a linker should avoid are calculated. Each protein can interact with other molecules on some oarts of its surface. The user can specify where and how big those parts are. If a linker passes in front a potential molecule bindind domain, the value of the corresponding path goes to infinity, so that the linker is discarded. Conversely the farther a linker is from a potential ligand binding domain, the smaller its weighting value. The user can also specify the importance of certain regions. In the end the total weighting is normalized to the amount of binding domains.<br />
<br />
===Calibrating the weighting function===<br />
Every contribution has its own distribution. You can see an example in figure ### [[figure histogram_length_lys.png]], but all of them have different shapes. The aim is to find the paths that globally minimize all of these distributions. Therefore, for simplicity,in the weighting function the four mentioned contributions were combined in a linear manner:<br />
\[ W(p) = \alpha L(p) + \beta A(p) + \gamma D(p) + \delta u(p) \]<br />
where W is the final weighting, p the path, L the length contribution, A the angle contribution, D the distance contribution and u the contribution from the forbidden regions. $\alpha, \beta, \gamma, \delta$ are the weighting constants that needed to be found. The normalization performed for each of the contribution were made so that each of them is dimensionless and that all have reasonably similar values.<br />
The weighting constants were obtained from the [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening linker screening] performed with lysozyme and the [https://2014.igem.org/Team:Heidelberg/Modeling/Enzyme_Modeling modeling of the enzyme activity]. Their calculation is presented in the results below.<br />
<br />
==Translating paths to sequence==<br />
As already mentioned before the software is provided with two databases, one for the possible angle patterns and one for the helix patterns. The choice of the patterns was inspired by known crystal structures extracted from databases and described in different papers.<br />
A huge in silico screening for refining the preferences of the patterns was then set up using the [https://2014.igem.org/Team:Heidelberg/Software/igemathome distribution calculation] system. For the complete description of search for suitable patterns, one can read the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] page. <br />
All the possible paths are now split up at the angles and compared with the possible patterns in the databases. ###Figure needed to explain, how the path is translated### The most suitable patterns are identified and added together to build the paths sequence. It is important to notice that this is only possible because of the modularity of our linker patterns used as building blocks: each block, being an alpha helix or an angle pattern, is not affected by the other. Thus for each possible path, one sequence is produced.<br />
<br />
==Clustering of paths==<br />
Many different paths are represented by the same sequence [[###Figure that shows, different paths have same properties, already before in the text ###]] and we therefore clustered such paths. The weigths for those clustered paths were then calculated by averaging the weights of the different paths that compose a cluster.<br />
<br />
<br />
=Results=<br />
==DNMT1==<br />
A major motivation of our effort to design rigid linkers with angles was the [https://2014.igem.org/Team:Heidelberg/Project/PCR_2.0 circularization of the DNA methyltranferase Dnmt1]. The truncation form used in our project is composed of 900 amino acids and the N- and C-terminal extremities are well separated. To circularize it, two linkers were designed: a flexible one made of glycine and serine, and a rigid one designed by the software. The rigif linkers for DNMT1 were obtained from an early state of the software. At that time the calculation took 11 days on a laptop computer with intel i5 processor and 8GB of RAM, which shows the importance of a distributed computing system for large proteins. But from that state on, the software has still improved a lot, resulting in reduced calculation time to about 1 day for DNMT1.<br />
<br />
==Feedback from wet lab==<br />
The results from the software for lysozyme were tested as described in the [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening linker-screening] part and evaluated as described in the [https://2014.igem.org/Team:Heidelberg/Modeling/Enzyme_Modeling enzyme modeling] part. We have performed a large linker screening on 10 different lysozymes with different linkers. As the purpose of the lysozyme screen was the calibration of the software,those linkers were designed according to the four contributions previously mentioned. One of them was the shortest possible, one had the best possible angle, and so on.<br />
{| class="table table-hover"<br />
|+ '''table 1''': Linkers and their amino acid sequence. Green: attachment sequences to prevent the flexible regions from being perturbed; Blue: angle; Purple: extein.<br />
! Linker<br />
! Amino acid sequence<br />
|- <br />
|<br />
|<br />
|-<br />
|'''Very good linkers'''<br />
|<br />
|-<br />
| sgt2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#00BFFF;">AAAHPEA</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| rigid<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKAA<span style="color:#3ADF00;">P</span><span style="color:#A901DB;">RGKCWE</span><br />
|-<br />
| <br />
| <br />
|-<br />
| '''Average linkers'''<br />
|<br />
|-<br />
| may1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKA<span style="color:#00BFFF;">AAAHPEA</span>AEAAAKEAAAKA<span style="color:#00BFFF;">KTA</span>AEAAAKEAAAKA<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| ord1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ATGDLA</span>AEAAAKAA<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| ord3<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ASLPAA</span>AEAAAKEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
|<br />
| <br />
|-<br />
| '''Short linkers'''<br />
| <br />
|-<br />
| sho1<br />
| <span style="color:#3ADF00;">GG</span><span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| sho2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
|}<br />
[[circ_lam_lys_nils.png]] comparison with [[nice_linker_lysozyme_flexible_ends.png]]<br />
###About predictions of software###<br />
In the end we obtained a ranking of the in vitro tested linkers from the [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening linker-screening] and chose the parameters $\alpha, \beta, \gamma, \delta$ of the weighting function so that the ranking from the software represented the ranking from the assays. The final values, $\alpha = 1.85 * 10 ^{-6}, \beta = 0.57 , \gamma = 50.8 * 10^6$, set up a function that could reproduce the ranking oberved in the wetlab experiments.<br />
<br />
=Discussion=<br />
The software described here allowed us to design<br />
<br />
=References=<br />
<br />
[1] Thornton, J.M. & Sibanda, B.L. Amino and carboxy-terminal regions in globular proteins. Journal of molecular biology 167, 443-460 (1983).<br />
<br />
[2] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/pages/Linker_SoftwareTeam:Heidelberg/pages/Linker Software2014-10-17T23:15:12Z<p>Igemnils: /* Discussion */</p>
<hr />
<div>=Abstract=<br />
As already [https://2014.igem.org/Team:Heidelberg/Toolbox/Circularization introduced], artificially circularized proteins may gain some heat stability by restraining the C- and N-terminus from moving around freely. This circularization may be trivial when the protein termini are very close to each other, which seems to be reasonably common [[#References|[1]]]. However, if the ends are too far from each other, a long linker is needed to connect them. This linker should not change the natural conformation of the protein and should constrain the relative position of the ends to restrict the degrees of freedom and thus to stabilize the structure even when heated up. On top, these linkers should not affect any of the protein functions. Consequently it is important to prevent linkers from passing through the active site or from covering binding domains to other molecules for example. Therefore one needs to be able to define the shape of possible linkers. This section describes the software we developed to design such linkers. We would like to stress that this work has been made possible thanks to the feedback between computer modeling and experimental work: We could first design linkers in silico, test them experimentally and use the results to further calibrate the software. To our knowledge, this is the first time that such an approach is used to customly design rigid linkers with angles to connect protein extremities.<br />
<br />
=General procedure=<br />
In short, the software can provide a weighted list of linkers to circularize any protein of interest with a known structure. Those linkers are made of rigid alpha helices segments connected with defined angles. Contrary to flexible linkers, those rigid linkers were expected to constrain the protein extremities and to confer better heat stability. Such an idea was already developed [[#References|[2]]] but only with alpha helices defining simple rods, and without any possibility to introduce angles. To generate those linkers, we first defined the geometrical paths, with segments and angles, that they should follow. The geometrical paths that are biologically feasible are afterwards translated into amino acid sequences. Both the compatibility of paths with possible structures and the translation were made possible thanks to our [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling approaches]. The first approach consisted in performing a statistical analysis of more than 17000 known non-homologous structures containing alpha helices connected with angles. For the second approach, we modeled the conformation of linkers circularizing proteins of known structure and analyzed them for certain properties. This second approach was run for a large number of proteins thanks to our distributing computing system [https://2014.igem.org/Team:Heidelberg/Software/igemathome igemathome]. The software provides different possible linkers with weights that provide the ranking of the linkers depending on their capacity to maintain protein activity at higher temperatures. They were generated thanks to an extensive [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening linker screening] on the target protein lambda-lysozyme, using the first modeling approach.<br />
<br />
=Background=<br />
Classically, protein linkers were designed in three different manners. The easiest way is to define the length that a linker should cover and then simply use a flexible glycine-serine peptide with the right amount of amino acids to match this length. Glycine is used for flexibility, as it has no sidechain and does not produce any steric hindrance, while serine is used for solubility, as it has a small polar side chain. This solubility is important, as the linkers should not pass through the hydrophobic core of the protein, but should be dissolved in the surrounding medium. These flexible linkers were normally used for circularization but also for connecting different proteins, when the main goal is that the different parts are connected, but not how they are connected, or when the flexiblity of the linker was required for specific applications. <br />
<br />
A second strategy consists in using rigid helical linkers to keep proteins or protein domains at a certain distance from each other. This is especially important for signalling proteins and fluorescent proteins. One major property of alpha helices is that they always fold in a defined way with well defined angles and lengths. There are also many different helical patterns that differ in stability and solubility. Although they have been used to design cirularizing linkers [[#References|[2]]]. One big disadvantage of this strategy is that one can only build straight linkers with helices. So in the context of circularization, if an artificial line that would connect protein extremities is crossing the protein, this strategy is not an option.<br />
<br />
The third option, which served as a base to develop our approach and which came from discussions with the group of Rebecca Wade in Heidelberg, Germany, consists in designing customly tailored linkers for each specific application. These linkers can be obtained from protein structure prediction. At first one needs to define the path that the linker should take to connect the protein ends. Afterwards, one designs a possible linker sequence that might fit well. Next one makes a structure prediction of the linker attached to the proteins to validate the prediction. Several different linkers, with slight changes, can be compared. This is repeated several times until the linker effectively follows the expected path. This method requires a strong knowledge on protein folding and protein structure prediction and is computationaly intensive. On the other hand, the benefit can be important as the interaction of the linker with the protein surface can be taken into account and as one can accurately define the path taken by the linker to the resolution of protein structure.<br />
<br />
We have set up a completely new strategy to design rigid linkers. As further detailed in the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] part, it is possible to define the shape of a linker, by combining rigid alpha helical rods with well-defined angle patterns. Therefore, by defining, in a geometrical way, the possible paths of the circularizing linkers for a given protein, we can then propose potential linkers. This definition of the geometrical path can be very difficult, especially for large proteins with complex shapes. Moreover, this definition is further constained by the fact that linkers must avoid hiding active sites of the protein of interest. Finally the paths have rotational degrees of freedom at the extremities of the protein, and depending on their orientation, they may or may not match the geometry of the protein. The tool we present here covers the two steps: defining geometrical paths with some weights and translate them into feasible linkers, also with weights. This tool is universal as it has the capacity to design circularizing linkers for any protein with a known structure. Moreover it is modular as, thanks to our [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling approach], we have design linkers as exchangeable blocks of rods of different lengths and of angle patterns. The following sections detail the different steps followed by our software to design proper linkers.<br />
<br />
<br />
==PDB parsing==<br />
At first, the PDB file containing the structure of the target protein is parsed and the coordinates of the atoms are stored, in the metric unit.After this, some initial tests are made with the protein structure. First, we checked whether the C- and N-termini lie on the surface of the protein and are accessible to the solvent, which is crucial for circularization. We defined a line originating from an extremity of the protein with the two angles of the spherical coordinates around the z-axis. From that, we could determine the accessible angles by rejecting all the lines that are too close to the protein. As the future linker will be made of alpha helices and will therefore have a radius of 5 &Aring;, we used this length as the minimal allowed distance.<br />
Those allowed angles are stored for the coming linker generation. ###fig needed### <br />
<br />
==Generation of geometric paths==<br />
As our strategy consists in building linkers with helical rods and connecting angles, a path is completely defined by the coordinates of the angle points. Advancing one step from an existing point is always done by adding a displacement vector on this point. This vector is defined by the two spherical angles, chosen here in a discrete manner with an increment of 5 degrees, and by a length, also chosen in a discrete manner. This discrete length was used in two different contexts: it may correspond to the length of an alpha helix or to the length of the flexible part that appears at the extremity of the protein. The coordinates that are reached thanks to this vector defines the new coordinates of an angle point, no matter if the vector corresponds to an alpha helix or to a flexible part. ###figure, points + vectors###. As we screen for all possible angles in a discrete manner, those angle points coordinates are regularly distributed on a sphere. As further detailed in the next sections, those spheres are defined from both ends, either once or in several steps. Then the software checks for possible straight connections of given lengths for each pair of angle points originated from both extremities. ###figure needed, with all these possible steps, one sphere around each point and checking for the connections, 2D, different lengths in two different figures, ###<br />
The linkers are built in a modular way, with blocks of well-defined size. From the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] of potential linkers, we could derive 8 different alpha helical rods, all with different lengths. On top, the length of the two segments inside an angle block was always 8&Aring;, so exchanging angle blocks do not affect the length of the linker. This means that the distance between the angle points is well defined, an essential aspect of our strategy of linker design.<br />
The software proceeds in three steps. First, it checks for the possibility of direct single alpha helix linker. for this, it applies the procedure just mentioned with spheres of radius that reasonably corresponds to the length of the short parts at the extremity of the protein. Second, it tests if a linker containing two alpha helices connected with a right angle allows the circularization. Finally it searches the possible linkers with three angle points. The next parts will explain those three steps in detail.<br />
This method has been chosen, because it could be implemented easily and efficiently in our program. However, this strategy generated paths that crossed the protein. Therefore we put big efforts in the sorting out of the paths.<br />
<br />
===Step 1===<br />
As a simple rigid linker with no angle would be easier to design and likely more thermostable than the ones containing angles, the software first checks if this simple solution is possible ###figure needed###. For this, we took into account the fact that proteins have some flexible amino acids at their extremities. This flexible part may come from the protein itself, but also from the 2 glycines that are included at the N-terminal part and from the extein at the C-terminal part. Those two latter parts comes from our linkers. Those parts have no preferential angles and offers a large amount of possibilities to insert fitting linkers. But this flexibility is also a drawback as we have to include this large amount of possible angles and length to our path search.<br />
In this first step, the software explicitely takes these flexible parts into account to check for the possibility of straight linkers. As the angles and the length of the flexible parts are variable, the software position their extremity on a sphere centered on the last fixed position of the structure as explained above. The radius of this sphere is incremented in a discrete manner, in 4 steps, from 5.25 &Aring; to the maximum length of the flexible part. ###figure###<br />
Then all possible straight segments between the points and the lastpoints are tested. If they are closer than 5 &Aring to the protein, of if they cross it, then they are rejected. If they are kept, then the software checks whether the length of the segments is compatible with the feasible alpha helices in terms of length: if the length of a given segment equal one of the 8 alpha helix lengths plus or minus 0.75 &Aring;, then the path is eventually saved.<br />
===Step 2===<br />
The next possibilty to design more complex rigid linkers while still taking flexible ends into account with a reasonable calculation time was to reduce amount of possible angles. As we originally thought that 90° angle would be practically feasible, the software was designed to generate linkers with flexible ends and one 90° angle. ###figure, protein_one_90° angle### This choice was notably made because of the simplicity to calculate lengths of right triangle edges. We already saw in Step 1 that the length of an edge can only take 8 different values. As the linkers have to start from the extremities of the protein, and as we impose a right angle, the number of possible paths is therefore low, making them easy to compute. Practically, the extremities of the proteins are positioned in a flexible way as in Step 1. From each of the positions allowed by this flexibility, the software searches for all the allowed right triangles. This was mainly done, as the degrees of freedom needed to be restricted to keep calculations feasible. <br />
===Step 3===<br />
Finally the software also provides the possibility to find paths with up to 4 edges, meaning 4 alpha helices and 3 angles. Thanks to the modularity of the possible linkers, such paths can offer the possibility to circularize theoretically any kind of protein. ###figure of torus###<br />
To keep the calculation feasible in a reasonable time, we design the searching strategy so that the flexible part at the extremity are oriented in the same direction as the consecutive alpha helix. This is obviously restricting the search but as these orientations are allowed for the flexible part, this approach remains fully correct. <br />
First, potential ending points of the first alpha helical rod are calculated from the N-terminal point of the protein. The orientation is chosen in a discrete manner, with an incrementation of 5 degrees for the two angles of the spherical coordinates. The distance from the origin corresponds to the 8 possible lengths allowed by the alpha helices, as already seen in Step 2, plus a length of 0, which mimics a linker with 3 instead of 4 edges. The exact same procedure is repeated to define all the potential ending points of the second alpha helical rod starting from all the possible ending points of the first alpha helical rod. Thanks to the possibility of a length of 0 for the first and the second rods, the software also calculate paths with 2 edges. Then, the same is done only once from the C-terminal point of the protein, defining 1 edge. The final step consists in checking if the points originating from the N- and C-terminal points can be linked by an potential alpha helix, i.e. if they are separated by the appropriate distance. If any of the potential alpha helix length lies within the distance between two points plus or minus 0.75 &Aring;, then the path is eventually saved. In the same way, if two points are directly closer than 0.75 &Aring;, then the path is also saved.<br />
<br />
==Sorting out of paths==<br />
The previous part described the generation of paths that can connect the two extremeties of the protein irrespective of the position of these paths relative to the protein. While this allows a fast computing of the geometrical paths, this also implies that the paths that are not practically feasible need to be sorted out. This is the most time consuming part of the computing as about 1 billion paths are generated. Three criteria are considered for the sorting. The first one is the feasibility of the linker: can the software find angle patterns that correspond to the one defined by the geometrical path? This question was part of the motivation for a large modeling effort (link) to determine the possible angles between consecutive angles. This was achieved by analyzing the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling distribution of angles] between alpha helices found in the ArchDB database. As nearly any angle could be found between 20 and 170 degrees, only few paths were actually rejected at that step. The next criteria was the position of the angle point: if they appear inside the protein, then the path is rejected. Finally, the software checks if any of the atoms of the protein is less than 5&Aring away from any of the alpha helices, then the path is also rejected.<br />
<br />
==Shifting paths to the patterns==<br />
<br />
The strategy described in step 3 gives a certain freedom for the rod that connect the last two angle points that were generated from the N- and C-terminal points. As this freedom is actually not permitted by the alpha helix and the angle pattern, but is permitted by the flexible part for example at the C-terminal end, the software slighty refine the path by rotating the segment that originates from the C-terminal point.<br />
<br />
==Weighting of paths==<br />
Before translating the paths into sequences and thus into linkers that can be expressed, each path needs to be evaluated for its potential capacity to enhance heat stability. For this we have identified different contributions that should be combined into one value that defines how good the linker may be and that consequently defines its ranking among all possible linkers. The smaller this value, the more we expect that the linker will enhance thermostability. An important step for all these contributions is the normalization, as explained in the next paragraphs.<br />
The first contribution we considered was the linker length. We assumed that a short linker is better to constrain the protein extremities, and that a long helix might give more flexibility. Because we wanted this value to be independent of the size of the protein, the length of the linker is normalized to the distance between the two termini.<br />
The second contribution relates to the angles used in the linker. We learned from the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] that angles formed by a certain angle pattern follow a certain distribution. First, we assumed that the narrower the distribution, the more likely the alpha helices would actually produce this angle. Second, the angles found by the software should be as close as possible to those well-defined angles. In this case, the weight value from this contribution should be low.<br />
Then the distance of the linker to the protein is taken into account. Because the linkers should not disturb the protein in its normal environment, linkers that pass close to the protein surface are considered better linkers. The distance was defined as the minimal distance between the linker and all the atoms of the protein. As already mentioned for the sorting of the paths, a linker cannot come closer than 5 &Aring; and this distance was used for normalization of calculated distances.<br />
After this, the places a linker should avoid are calculated. Each protein can interact with other molecules on some oarts of its surface. The user can specify where and how big those parts are. If a linker passes in front a potential molecule bindind domain, the value of the corresponding path goes to infinity, so that the linker is discarded. Conversely the farther a linker is from a potential ligand binding domain, the smaller its weighting value. The user can also specify the importance of certain regions. In the end the total weighting is normalized to the amount of binding domains.<br />
<br />
===Calibrating the weighting function===<br />
Every contribution has its own distribution. You can see an example in figure ### [[figure histogram_length_lys.png]], but all of them have different shapes. The aim is to find the paths that globally minimize all of these distributions. Therefore, for simplicity,in the weighting function the four mentioned contributions were combined in a linear manner:<br />
\[ W(p) = \alpha L(p) + \beta A(p) + \gamma D(p) + \delta u(p) \]<br />
where W is the final weighting, p the path, L the length contribution, A the angle contribution, D the distance contribution and u the contribution from the forbidden regions. $\alpha, \beta, \gamma, \delta$ are the weighting constants that needed to be found. The normalization performed for each of the contribution were made so that each of them is dimensionless and that all have reasonably similar values.<br />
The weighting constants were obtained from the [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening linker screening] performed with lysozyme and the [https://2014.igem.org/Team:Heidelberg/Modeling/Enzyme_Modeling modeling of the enzyme activity]. Their calculation is presented in the results below.<br />
<br />
==Translating paths to sequence==<br />
As already mentioned before the software is provided with two databases, one for the possible angle patterns and one for the helix patterns. The choice of the patterns was inspired by known crystal structures extracted from databases and described in different papers.<br />
A huge in silico screening for refining the preferences of the patterns was then set up using the [https://2014.igem.org/Team:Heidelberg/Software/igemathome distribution calculation] system. For the complete description of search for suitable patterns, one can read the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] page. <br />
All the possible paths are now split up at the angles and compared with the possible patterns in the databases. ###Figure needed to explain, how the path is translated### The most suitable patterns are identified and added together to build the paths sequence. It is important to notice that this is only possible because of the modularity of our linker patterns used as building blocks: each block, being an alpha helix or an angle pattern, is not affected by the other. Thus for each possible path, one sequence is produced.<br />
<br />
==Clustering of paths==<br />
Many different paths are represented by the same sequence [[###Figure that shows, different paths have same properties, already before in the text ###]] and we therefore clustered such paths. The weigths for those clustered paths were then calculated by averaging the weights of the different paths that compose a cluster.<br />
<br />
<br />
=Results=<br />
==DNMT1==<br />
A major motivation of our effort to design rigid linkers with angles was the [https://2014.igem.org/Team:Heidelberg/Project/PCR_2.0 circularization of the DNA methyltranferase Dnmt1]. The truncation form used in our project is composed of 900 amino acids and the N- and C-terminal extremities are well separated. To circularize it, two linkers were designed: a flexible one made of glycine and serine, and a rigid one designed by the software. The rigif linkers for DNMT1 were obtained from an early state of the software. At that time the calculation took 11 days on a laptop computer with intel i5 processor and 8GB of RAM, which shows the importance of a distributed computing system for large proteins. But from that state on, the software has still imporved a lot, resulting in reduced calculation time to about 1 day for DNMT1.<br />
<br />
==Feedback from wet lab==<br />
The results from the software for lysozyme were tested as described in the [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening linker-screening] part and evaluated as described in the [https://2014.igem.org/Team:Heidelberg/Modeling/Enzyme_Modeling enzyme modeling] part. We have performed a large linker screening on 10 different lysozymes with different linkers. As the purpose of the lysozyme screen was the calibration of the software,those linkers were designed according to the four contributions previously mentioned. One of them was the shortest possible, one had the best possible angle, and so on.<br />
{| class="table table-hover"<br />
|+ '''table 1''': Linkers and their amino acid sequence. Green: attachment sequences to prevent the flexible regions from being perturbed; Blue: angle; Purple: extein.<br />
! Linker<br />
! Amino acid sequence<br />
|- <br />
|<br />
|<br />
|-<br />
|'''Very good linkers'''<br />
|<br />
|-<br />
| sgt2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#00BFFF;">AAAHPEA</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| rigid<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKAA<span style="color:#3ADF00;">P</span><span style="color:#A901DB;">RGKCWE</span><br />
|-<br />
| <br />
| <br />
|-<br />
| '''Average linkers'''<br />
|<br />
|-<br />
| may1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKA<span style="color:#00BFFF;">AAAHPEA</span>AEAAAKEAAAKA<span style="color:#00BFFF;">KTA</span>AEAAAKEAAAKA<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| ord1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ATGDLA</span>AEAAAKAA<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| ord3<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ASLPAA</span>AEAAAKEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
|<br />
| <br />
|-<br />
| '''Short linkers'''<br />
| <br />
|-<br />
| sho1<br />
| <span style="color:#3ADF00;">GG</span><span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| sho2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
|}<br />
[[circ_lam_lys_nils.png]] comparison with [[nice_linker_lysozyme_flexible_ends.png]]<br />
###About predictions of software###<br />
In the end we obtained a ranking of the in vitro tested linkers from the [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening linker-screening] and chose the parameters $\alpha, \beta, \gamma, \delta$ of the weighting function so that the ranking from the software represented the ranking from the assays. The final values, $\alpha = 1.85 * 10 ^{-6}, \beta = 0.57 , \gamma = 50.8 * 10^6$, set up a function that could reproduce the ranking oberved in the wetlab experiments.<br />
<br />
=Discussion=<br />
The software described here allowed us to design<br />
<br />
=References=<br />
<br />
[1] Thornton, J.M. & Sibanda, B.L. Amino and carboxy-terminal regions in globular proteins. Journal of molecular biology 167, 443-460 (1983).<br />
<br />
[2] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/pages/Linker_SoftwareTeam:Heidelberg/pages/Linker Software2014-10-17T23:13:33Z<p>Igemnils: /* Feedback from wet lab */</p>
<hr />
<div>=Abstract=<br />
As already [https://2014.igem.org/Team:Heidelberg/Toolbox/Circularization introduced], artificially circularized proteins may gain some heat stability by restraining the C- and N-terminus from moving around freely. This circularization may be trivial when the protein termini are very close to each other, which seems to be reasonably common [[#References|[1]]]. However, if the ends are too far from each other, a long linker is needed to connect them. This linker should not change the natural conformation of the protein and should constrain the relative position of the ends to restrict the degrees of freedom and thus to stabilize the structure even when heated up. On top, these linkers should not affect any of the protein functions. Consequently it is important to prevent linkers from passing through the active site or from covering binding domains to other molecules for example. Therefore one needs to be able to define the shape of possible linkers. This section describes the software we developed to design such linkers. We would like to stress that this work has been made possible thanks to the feedback between computer modeling and experimental work: We could first design linkers in silico, test them experimentally and use the results to further calibrate the software. To our knowledge, this is the first time that such an approach is used to customly design rigid linkers with angles to connect protein extremities.<br />
<br />
=General procedure=<br />
In short, the software can provide a weighted list of linkers to circularize any protein of interest with a known structure. Those linkers are made of rigid alpha helices segments connected with defined angles. Contrary to flexible linkers, those rigid linkers were expected to constrain the protein extremities and to confer better heat stability. Such an idea was already developed [[#References|[2]]] but only with alpha helices defining simple rods, and without any possibility to introduce angles. To generate those linkers, we first defined the geometrical paths, with segments and angles, that they should follow. The geometrical paths that are biologically feasible are afterwards translated into amino acid sequences. Both the compatibility of paths with possible structures and the translation were made possible thanks to our [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling approaches]. The first approach consisted in performing a statistical analysis of more than 17000 known non-homologous structures containing alpha helices connected with angles. For the second approach, we modeled the conformation of linkers circularizing proteins of known structure and analyzed them for certain properties. This second approach was run for a large number of proteins thanks to our distributing computing system [https://2014.igem.org/Team:Heidelberg/Software/igemathome igemathome]. The software provides different possible linkers with weights that provide the ranking of the linkers depending on their capacity to maintain protein activity at higher temperatures. They were generated thanks to an extensive [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening linker screening] on the target protein lambda-lysozyme, using the first modeling approach.<br />
<br />
=Background=<br />
Classically, protein linkers were designed in three different manners. The easiest way is to define the length that a linker should cover and then simply use a flexible glycine-serine peptide with the right amount of amino acids to match this length. Glycine is used for flexibility, as it has no sidechain and does not produce any steric hindrance, while serine is used for solubility, as it has a small polar side chain. This solubility is important, as the linkers should not pass through the hydrophobic core of the protein, but should be dissolved in the surrounding medium. These flexible linkers were normally used for circularization but also for connecting different proteins, when the main goal is that the different parts are connected, but not how they are connected, or when the flexiblity of the linker was required for specific applications. <br />
<br />
A second strategy consists in using rigid helical linkers to keep proteins or protein domains at a certain distance from each other. This is especially important for signalling proteins and fluorescent proteins. One major property of alpha helices is that they always fold in a defined way with well defined angles and lengths. There are also many different helical patterns that differ in stability and solubility. Although they have been used to design cirularizing linkers [[#References|[2]]]. One big disadvantage of this strategy is that one can only build straight linkers with helices. So in the context of circularization, if an artificial line that would connect protein extremities is crossing the protein, this strategy is not an option.<br />
<br />
The third option, which served as a base to develop our approach and which came from discussions with the group of Rebecca Wade in Heidelberg, Germany, consists in designing customly tailored linkers for each specific application. These linkers can be obtained from protein structure prediction. At first one needs to define the path that the linker should take to connect the protein ends. Afterwards, one designs a possible linker sequence that might fit well. Next one makes a structure prediction of the linker attached to the proteins to validate the prediction. Several different linkers, with slight changes, can be compared. This is repeated several times until the linker effectively follows the expected path. This method requires a strong knowledge on protein folding and protein structure prediction and is computationaly intensive. On the other hand, the benefit can be important as the interaction of the linker with the protein surface can be taken into account and as one can accurately define the path taken by the linker to the resolution of protein structure.<br />
<br />
We have set up a completely new strategy to design rigid linkers. As further detailed in the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] part, it is possible to define the shape of a linker, by combining rigid alpha helical rods with well-defined angle patterns. Therefore, by defining, in a geometrical way, the possible paths of the circularizing linkers for a given protein, we can then propose potential linkers. This definition of the geometrical path can be very difficult, especially for large proteins with complex shapes. Moreover, this definition is further constained by the fact that linkers must avoid hiding active sites of the protein of interest. Finally the paths have rotational degrees of freedom at the extremities of the protein, and depending on their orientation, they may or may not match the geometry of the protein. The tool we present here covers the two steps: defining geometrical paths with some weights and translate them into feasible linkers, also with weights. This tool is universal as it has the capacity to design circularizing linkers for any protein with a known structure. Moreover it is modular as, thanks to our [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling approach], we have design linkers as exchangeable blocks of rods of different lengths and of angle patterns. The following sections detail the different steps followed by our software to design proper linkers.<br />
<br />
<br />
==PDB parsing==<br />
At first, the PDB file containing the structure of the target protein is parsed and the coordinates of the atoms are stored, in the metric unit.After this, some initial tests are made with the protein structure. First, we checked whether the C- and N-termini lie on the surface of the protein and are accessible to the solvent, which is crucial for circularization. We defined a line originating from an extremity of the protein with the two angles of the spherical coordinates around the z-axis. From that, we could determine the accessible angles by rejecting all the lines that are too close to the protein. As the future linker will be made of alpha helices and will therefore have a radius of 5 &Aring;, we used this length as the minimal allowed distance.<br />
Those allowed angles are stored for the coming linker generation. ###fig needed### <br />
<br />
==Generation of geometric paths==<br />
As our strategy consists in building linkers with helical rods and connecting angles, a path is completely defined by the coordinates of the angle points. Advancing one step from an existing point is always done by adding a displacement vector on this point. This vector is defined by the two spherical angles, chosen here in a discrete manner with an increment of 5 degrees, and by a length, also chosen in a discrete manner. This discrete length was used in two different contexts: it may correspond to the length of an alpha helix or to the length of the flexible part that appears at the extremity of the protein. The coordinates that are reached thanks to this vector defines the new coordinates of an angle point, no matter if the vector corresponds to an alpha helix or to a flexible part. ###figure, points + vectors###. As we screen for all possible angles in a discrete manner, those angle points coordinates are regularly distributed on a sphere. As further detailed in the next sections, those spheres are defined from both ends, either once or in several steps. Then the software checks for possible straight connections of given lengths for each pair of angle points originated from both extremities. ###figure needed, with all these possible steps, one sphere around each point and checking for the connections, 2D, different lengths in two different figures, ###<br />
The linkers are built in a modular way, with blocks of well-defined size. From the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] of potential linkers, we could derive 8 different alpha helical rods, all with different lengths. On top, the length of the two segments inside an angle block was always 8&Aring;, so exchanging angle blocks do not affect the length of the linker. This means that the distance between the angle points is well defined, an essential aspect of our strategy of linker design.<br />
The software proceeds in three steps. First, it checks for the possibility of direct single alpha helix linker. for this, it applies the procedure just mentioned with spheres of radius that reasonably corresponds to the length of the short parts at the extremity of the protein. Second, it tests if a linker containing two alpha helices connected with a right angle allows the circularization. Finally it searches the possible linkers with three angle points. The next parts will explain those three steps in detail.<br />
This method has been chosen, because it could be implemented easily and efficiently in our program. However, this strategy generated paths that crossed the protein. Therefore we put big efforts in the sorting out of the paths.<br />
<br />
===Step 1===<br />
As a simple rigid linker with no angle would be easier to design and likely more thermostable than the ones containing angles, the software first checks if this simple solution is possible ###figure needed###. For this, we took into account the fact that proteins have some flexible amino acids at their extremities. This flexible part may come from the protein itself, but also from the 2 glycines that are included at the N-terminal part and from the extein at the C-terminal part. Those two latter parts comes from our linkers. Those parts have no preferential angles and offers a large amount of possibilities to insert fitting linkers. But this flexibility is also a drawback as we have to include this large amount of possible angles and length to our path search.<br />
In this first step, the software explicitely takes these flexible parts into account to check for the possibility of straight linkers. As the angles and the length of the flexible parts are variable, the software position their extremity on a sphere centered on the last fixed position of the structure as explained above. The radius of this sphere is incremented in a discrete manner, in 4 steps, from 5.25 &Aring; to the maximum length of the flexible part. ###figure###<br />
Then all possible straight segments between the points and the lastpoints are tested. If they are closer than 5 &Aring to the protein, of if they cross it, then they are rejected. If they are kept, then the software checks whether the length of the segments is compatible with the feasible alpha helices in terms of length: if the length of a given segment equal one of the 8 alpha helix lengths plus or minus 0.75 &Aring;, then the path is eventually saved.<br />
===Step 2===<br />
The next possibilty to design more complex rigid linkers while still taking flexible ends into account with a reasonable calculation time was to reduce amount of possible angles. As we originally thought that 90° angle would be practically feasible, the software was designed to generate linkers with flexible ends and one 90° angle. ###figure, protein_one_90° angle### This choice was notably made because of the simplicity to calculate lengths of right triangle edges. We already saw in Step 1 that the length of an edge can only take 8 different values. As the linkers have to start from the extremities of the protein, and as we impose a right angle, the number of possible paths is therefore low, making them easy to compute. Practically, the extremities of the proteins are positioned in a flexible way as in Step 1. From each of the positions allowed by this flexibility, the software searches for all the allowed right triangles. This was mainly done, as the degrees of freedom needed to be restricted to keep calculations feasible. <br />
===Step 3===<br />
Finally the software also provides the possibility to find paths with up to 4 edges, meaning 4 alpha helices and 3 angles. Thanks to the modularity of the possible linkers, such paths can offer the possibility to circularize theoretically any kind of protein. ###figure of torus###<br />
To keep the calculation feasible in a reasonable time, we design the searching strategy so that the flexible part at the extremity are oriented in the same direction as the consecutive alpha helix. This is obviously restricting the search but as these orientations are allowed for the flexible part, this approach remains fully correct. <br />
First, potential ending points of the first alpha helical rod are calculated from the N-terminal point of the protein. The orientation is chosen in a discrete manner, with an incrementation of 5 degrees for the two angles of the spherical coordinates. The distance from the origin corresponds to the 8 possible lengths allowed by the alpha helices, as already seen in Step 2, plus a length of 0, which mimics a linker with 3 instead of 4 edges. The exact same procedure is repeated to define all the potential ending points of the second alpha helical rod starting from all the possible ending points of the first alpha helical rod. Thanks to the possibility of a length of 0 for the first and the second rods, the software also calculate paths with 2 edges. Then, the same is done only once from the C-terminal point of the protein, defining 1 edge. The final step consists in checking if the points originating from the N- and C-terminal points can be linked by an potential alpha helix, i.e. if they are separated by the appropriate distance. If any of the potential alpha helix length lies within the distance between two points plus or minus 0.75 &Aring;, then the path is eventually saved. In the same way, if two points are directly closer than 0.75 &Aring;, then the path is also saved.<br />
<br />
==Sorting out of paths==<br />
The previous part described the generation of paths that can connect the two extremeties of the protein irrespective of the position of these paths relative to the protein. While this allows a fast computing of the geometrical paths, this also implies that the paths that are not practically feasible need to be sorted out. This is the most time consuming part of the computing as about 1 billion paths are generated. Three criteria are considered for the sorting. The first one is the feasibility of the linker: can the software find angle patterns that correspond to the one defined by the geometrical path? This question was part of the motivation for a large modeling effort (link) to determine the possible angles between consecutive angles. This was achieved by analyzing the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling distribution of angles] between alpha helices found in the ArchDB database. As nearly any angle could be found between 20 and 170 degrees, only few paths were actually rejected at that step. The next criteria was the position of the angle point: if they appear inside the protein, then the path is rejected. Finally, the software checks if any of the atoms of the protein is less than 5&Aring away from any of the alpha helices, then the path is also rejected.<br />
<br />
==Shifting paths to the patterns==<br />
<br />
The strategy described in step 3 gives a certain freedom for the rod that connect the last two angle points that were generated from the N- and C-terminal points. As this freedom is actually not permitted by the alpha helix and the angle pattern, but is permitted by the flexible part for example at the C-terminal end, the software slighty refine the path by rotating the segment that originates from the C-terminal point.<br />
<br />
==Weighting of paths==<br />
Before translating the paths into sequences and thus into linkers that can be expressed, each path needs to be evaluated for its potential capacity to enhance heat stability. For this we have identified different contributions that should be combined into one value that defines how good the linker may be and that consequently defines its ranking among all possible linkers. The smaller this value, the more we expect that the linker will enhance thermostability. An important step for all these contributions is the normalization, as explained in the next paragraphs.<br />
The first contribution we considered was the linker length. We assumed that a short linker is better to constrain the protein extremities, and that a long helix might give more flexibility. Because we wanted this value to be independent of the size of the protein, the length of the linker is normalized to the distance between the two termini.<br />
The second contribution relates to the angles used in the linker. We learned from the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] that angles formed by a certain angle pattern follow a certain distribution. First, we assumed that the narrower the distribution, the more likely the alpha helices would actually produce this angle. Second, the angles found by the software should be as close as possible to those well-defined angles. In this case, the weight value from this contribution should be low.<br />
Then the distance of the linker to the protein is taken into account. Because the linkers should not disturb the protein in its normal environment, linkers that pass close to the protein surface are considered better linkers. The distance was defined as the minimal distance between the linker and all the atoms of the protein. As already mentioned for the sorting of the paths, a linker cannot come closer than 5 &Aring; and this distance was used for normalization of calculated distances.<br />
After this, the places a linker should avoid are calculated. Each protein can interact with other molecules on some oarts of its surface. The user can specify where and how big those parts are. If a linker passes in front a potential molecule bindind domain, the value of the corresponding path goes to infinity, so that the linker is discarded. Conversely the farther a linker is from a potential ligand binding domain, the smaller its weighting value. The user can also specify the importance of certain regions. In the end the total weighting is normalized to the amount of binding domains.<br />
<br />
===Calibrating the weighting function===<br />
Every contribution has its own distribution. You can see an example in figure ### [[figure histogram_length_lys.png]], but all of them have different shapes. The aim is to find the paths that globally minimize all of these distributions. Therefore, for simplicity,in the weighting function the four mentioned contributions were combined in a linear manner:<br />
\[ W(p) = \alpha L(p) + \beta A(p) + \gamma D(p) + \delta u(p) \]<br />
where W is the final weighting, p the path, L the length contribution, A the angle contribution, D the distance contribution and u the contribution from the forbidden regions. $\alpha, \beta, \gamma, \delta$ are the weighting constants that needed to be found. The normalization performed for each of the contribution were made so that each of them is dimensionless and that all have reasonably similar values.<br />
The weighting constants were obtained from the [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening linker screening] performed with lysozyme and the [https://2014.igem.org/Team:Heidelberg/Modeling/Enzyme_Modeling modeling of the enzyme activity]. Their calculation is presented in the results below.<br />
<br />
==Translating paths to sequence==<br />
As already mentioned before the software is provided with two databases, one for the possible angle patterns and one for the helix patterns. The choice of the patterns was inspired by known crystal structures extracted from databases and described in different papers.<br />
A huge in silico screening for refining the preferences of the patterns was then set up using the [https://2014.igem.org/Team:Heidelberg/Software/igemathome distribution calculation] system. For the complete description of search for suitable patterns, one can read the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] page. <br />
All the possible paths are now split up at the angles and compared with the possible patterns in the databases. ###Figure needed to explain, how the path is translated### The most suitable patterns are identified and added together to build the paths sequence. It is important to notice that this is only possible because of the modularity of our linker patterns used as building blocks: each block, being an alpha helix or an angle pattern, is not affected by the other. Thus for each possible path, one sequence is produced.<br />
<br />
==Clustering of paths==<br />
Many different paths are represented by the same sequence [[###Figure that shows, different paths have same properties, already before in the text ###]] and we therefore clustered such paths. The weigths for those clustered paths were then calculated by averaging the weights of the different paths that compose a cluster.<br />
<br />
<br />
=Results=<br />
==DNMT1==<br />
A major motivation of our effort to design rigid linkers with angles was the [https://2014.igem.org/Team:Heidelberg/Project/PCR_2.0 circularization of the DNA methyltranferase Dnmt1]. The truncation form used in our project is composed of 900 amino acids and the N- and C-terminal extremities are well separated. To circularize it, two linkers were designed: a flexible one made of glycine and serine, and a rigid one designed by the software. The rigif linkers for DNMT1 were obtained from an early state of the software. At that time the calculation took 11 days on a laptop computer with intel i5 processor and 8GB of RAM, which shows the importance of a distributed computing system for large proteins. But from that state on, the software has still imporved a lot, resulting in reduced calculation time to about 1 day for DNMT1.<br />
<br />
==Feedback from wet lab==<br />
The results from the software for lysozyme were tested as described in the [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening linker-screening] part and evaluated as described in the [https://2014.igem.org/Team:Heidelberg/Modeling/Enzyme_Modeling enzyme modeling] part. We have performed a large linker screening on 10 different lysozymes with different linkers. As the purpose of the lysozyme screen was the calibration of the software,those linkers were designed according to the four contributions previously mentioned. One of them was the shortest possible, one had the best possible angle, and so on.<br />
{| class="table table-hover"<br />
|+ '''table 1''': Linkers and their amino acid sequence. Green: attachment sequences to prevent the flexible regions from being perturbed; Blue: angle; Purple: extein.<br />
! Linker<br />
! Amino acid sequence<br />
|- <br />
|<br />
|<br />
|-<br />
|'''Very good linkers'''<br />
|<br />
|-<br />
| sgt2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#00BFFF;">AAAHPEA</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| rigid<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKAA<span style="color:#3ADF00;">P</span><span style="color:#A901DB;">RGKCWE</span><br />
|-<br />
| <br />
| <br />
|-<br />
| '''Average linkers'''<br />
|<br />
|-<br />
| may1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKA<span style="color:#00BFFF;">AAAHPEA</span>AEAAAKEAAAKA<span style="color:#00BFFF;">KTA</span>AEAAAKEAAAKA<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| ord1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ATGDLA</span>AEAAAKAA<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| ord3<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ASLPAA</span>AEAAAKEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
|<br />
| <br />
|-<br />
| '''Short linkers'''<br />
| <br />
|-<br />
| sho1<br />
| <span style="color:#3ADF00;">GG</span><span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| sho2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
|}<br />
[[circ_lam_lys_nils.png]] comparison with [[nice_linker_lysozyme_flexible_ends.png]]<br />
###About predictions of software###<br />
In the end we obtained a ranking of the in vitro tested linkers from the [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening linker-screening] and chose the parameters $\alpha, \beta, \gamma, \delta$ of the weighting function so that the ranking from the software represented the ranking from the assays. The final values, $\alpha = 1.85 * 10 ^{-6}, \beta = 0.57 , \gamma = 50.8 * 10^6$, set up a function that could reproduce the ranking oberved in the wetlab experiments.<br />
<br />
=Discussion=<br />
Will always be refined with more data from i@h...<br />
=References=<br />
<br />
[1] Thornton, J.M. & Sibanda, B.L. Amino and carboxy-terminal regions in globular proteins. Journal of molecular biology 167, 443-460 (1983).<br />
<br />
[2] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/pages/Linker_SoftwareTeam:Heidelberg/pages/Linker Software2014-10-17T23:06:23Z<p>Igemnils: /* DNMT1 */</p>
<hr />
<div>=Abstract=<br />
As already [https://2014.igem.org/Team:Heidelberg/Toolbox/Circularization introduced], artificially circularized proteins may gain some heat stability by restraining the C- and N-terminus from moving around freely. This circularization may be trivial when the protein termini are very close to each other, which seems to be reasonably common [[#References|[1]]]. However, if the ends are too far from each other, a long linker is needed to connect them. This linker should not change the natural conformation of the protein and should constrain the relative position of the ends to restrict the degrees of freedom and thus to stabilize the structure even when heated up. On top, these linkers should not affect any of the protein functions. Consequently it is important to prevent linkers from passing through the active site or from covering binding domains to other molecules for example. Therefore one needs to be able to define the shape of possible linkers. This section describes the software we developed to design such linkers. We would like to stress that this work has been made possible thanks to the feedback between computer modeling and experimental work: We could first design linkers in silico, test them experimentally and use the results to further calibrate the software. To our knowledge, this is the first time that such an approach is used to customly design rigid linkers with angles to connect protein extremities.<br />
<br />
=General procedure=<br />
In short, the software can provide a weighted list of linkers to circularize any protein of interest with a known structure. Those linkers are made of rigid alpha helices segments connected with defined angles. Contrary to flexible linkers, those rigid linkers were expected to constrain the protein extremities and to confer better heat stability. Such an idea was already developed [[#References|[2]]] but only with alpha helices defining simple rods, and without any possibility to introduce angles. To generate those linkers, we first defined the geometrical paths, with segments and angles, that they should follow. The geometrical paths that are biologically feasible are afterwards translated into amino acid sequences. Both the compatibility of paths with possible structures and the translation were made possible thanks to our [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling approaches]. The first approach consisted in performing a statistical analysis of more than 17000 known non-homologous structures containing alpha helices connected with angles. For the second approach, we modeled the conformation of linkers circularizing proteins of known structure and analyzed them for certain properties. This second approach was run for a large number of proteins thanks to our distributing computing system [https://2014.igem.org/Team:Heidelberg/Software/igemathome igemathome]. The software provides different possible linkers with weights that provide the ranking of the linkers depending on their capacity to maintain protein activity at higher temperatures. They were generated thanks to an extensive [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening linker screening] on the target protein lambda-lysozyme, using the first modeling approach.<br />
<br />
=Background=<br />
Classically, protein linkers were designed in three different manners. The easiest way is to define the length that a linker should cover and then simply use a flexible glycine-serine peptide with the right amount of amino acids to match this length. Glycine is used for flexibility, as it has no sidechain and does not produce any steric hindrance, while serine is used for solubility, as it has a small polar side chain. This solubility is important, as the linkers should not pass through the hydrophobic core of the protein, but should be dissolved in the surrounding medium. These flexible linkers were normally used for circularization but also for connecting different proteins, when the main goal is that the different parts are connected, but not how they are connected, or when the flexiblity of the linker was required for specific applications. <br />
<br />
A second strategy consists in using rigid helical linkers to keep proteins or protein domains at a certain distance from each other. This is especially important for signalling proteins and fluorescent proteins. One major property of alpha helices is that they always fold in a defined way with well defined angles and lengths. There are also many different helical patterns that differ in stability and solubility. Although they have been used to design cirularizing linkers [[#References|[2]]]. One big disadvantage of this strategy is that one can only build straight linkers with helices. So in the context of circularization, if an artificial line that would connect protein extremities is crossing the protein, this strategy is not an option.<br />
<br />
The third option, which served as a base to develop our approach and which came from discussions with the group of Rebecca Wade in Heidelberg, Germany, consists in designing customly tailored linkers for each specific application. These linkers can be obtained from protein structure prediction. At first one needs to define the path that the linker should take to connect the protein ends. Afterwards, one designs a possible linker sequence that might fit well. Next one makes a structure prediction of the linker attached to the proteins to validate the prediction. Several different linkers, with slight changes, can be compared. This is repeated several times until the linker effectively follows the expected path. This method requires a strong knowledge on protein folding and protein structure prediction and is computationaly intensive. On the other hand, the benefit can be important as the interaction of the linker with the protein surface can be taken into account and as one can accurately define the path taken by the linker to the resolution of protein structure.<br />
<br />
We have set up a completely new strategy to design rigid linkers. As further detailed in the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] part, it is possible to define the shape of a linker, by combining rigid alpha helical rods with well-defined angle patterns. Therefore, by defining, in a geometrical way, the possible paths of the circularizing linkers for a given protein, we can then propose potential linkers. This definition of the geometrical path can be very difficult, especially for large proteins with complex shapes. Moreover, this definition is further constained by the fact that linkers must avoid hiding active sites of the protein of interest. Finally the paths have rotational degrees of freedom at the extremities of the protein, and depending on their orientation, they may or may not match the geometry of the protein. The tool we present here covers the two steps: defining geometrical paths with some weights and translate them into feasible linkers, also with weights. This tool is universal as it has the capacity to design circularizing linkers for any protein with a known structure. Moreover it is modular as, thanks to our [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling approach], we have design linkers as exchangeable blocks of rods of different lengths and of angle patterns. The following sections detail the different steps followed by our software to design proper linkers.<br />
<br />
<br />
==PDB parsing==<br />
At first, the PDB file containing the structure of the target protein is parsed and the coordinates of the atoms are stored, in the metric unit.After this, some initial tests are made with the protein structure. First, we checked whether the C- and N-termini lie on the surface of the protein and are accessible to the solvent, which is crucial for circularization. We defined a line originating from an extremity of the protein with the two angles of the spherical coordinates around the z-axis. From that, we could determine the accessible angles by rejecting all the lines that are too close to the protein. As the future linker will be made of alpha helices and will therefore have a radius of 5 &Aring;, we used this length as the minimal allowed distance.<br />
Those allowed angles are stored for the coming linker generation. ###fig needed### <br />
<br />
==Generation of geometric paths==<br />
As our strategy consists in building linkers with helical rods and connecting angles, a path is completely defined by the coordinates of the angle points. Advancing one step from an existing point is always done by adding a displacement vector on this point. This vector is defined by the two spherical angles, chosen here in a discrete manner with an increment of 5 degrees, and by a length, also chosen in a discrete manner. This discrete length was used in two different contexts: it may correspond to the length of an alpha helix or to the length of the flexible part that appears at the extremity of the protein. The coordinates that are reached thanks to this vector defines the new coordinates of an angle point, no matter if the vector corresponds to an alpha helix or to a flexible part. ###figure, points + vectors###. As we screen for all possible angles in a discrete manner, those angle points coordinates are regularly distributed on a sphere. As further detailed in the next sections, those spheres are defined from both ends, either once or in several steps. Then the software checks for possible straight connections of given lengths for each pair of angle points originated from both extremities. ###figure needed, with all these possible steps, one sphere around each point and checking for the connections, 2D, different lengths in two different figures, ###<br />
The linkers are built in a modular way, with blocks of well-defined size. From the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] of potential linkers, we could derive 8 different alpha helical rods, all with different lengths. On top, the length of the two segments inside an angle block was always 8&Aring;, so exchanging angle blocks do not affect the length of the linker. This means that the distance between the angle points is well defined, an essential aspect of our strategy of linker design.<br />
The software proceeds in three steps. First, it checks for the possibility of direct single alpha helix linker. for this, it applies the procedure just mentioned with spheres of radius that reasonably corresponds to the length of the short parts at the extremity of the protein. Second, it tests if a linker containing two alpha helices connected with a right angle allows the circularization. Finally it searches the possible linkers with three angle points. The next parts will explain those three steps in detail.<br />
This method has been chosen, because it could be implemented easily and efficiently in our program. However, this strategy generated paths that crossed the protein. Therefore we put big efforts in the sorting out of the paths.<br />
<br />
===Step 1===<br />
As a simple rigid linker with no angle would be easier to design and likely more thermostable than the ones containing angles, the software first checks if this simple solution is possible ###figure needed###. For this, we took into account the fact that proteins have some flexible amino acids at their extremities. This flexible part may come from the protein itself, but also from the 2 glycines that are included at the N-terminal part and from the extein at the C-terminal part. Those two latter parts comes from our linkers. Those parts have no preferential angles and offers a large amount of possibilities to insert fitting linkers. But this flexibility is also a drawback as we have to include this large amount of possible angles and length to our path search.<br />
In this first step, the software explicitely takes these flexible parts into account to check for the possibility of straight linkers. As the angles and the length of the flexible parts are variable, the software position their extremity on a sphere centered on the last fixed position of the structure as explained above. The radius of this sphere is incremented in a discrete manner, in 4 steps, from 5.25 &Aring; to the maximum length of the flexible part. ###figure###<br />
Then all possible straight segments between the points and the lastpoints are tested. If they are closer than 5 &Aring to the protein, of if they cross it, then they are rejected. If they are kept, then the software checks whether the length of the segments is compatible with the feasible alpha helices in terms of length: if the length of a given segment equal one of the 8 alpha helix lengths plus or minus 0.75 &Aring;, then the path is eventually saved.<br />
===Step 2===<br />
The next possibilty to design more complex rigid linkers while still taking flexible ends into account with a reasonable calculation time was to reduce amount of possible angles. As we originally thought that 90° angle would be practically feasible, the software was designed to generate linkers with flexible ends and one 90° angle. ###figure, protein_one_90° angle### This choice was notably made because of the simplicity to calculate lengths of right triangle edges. We already saw in Step 1 that the length of an edge can only take 8 different values. As the linkers have to start from the extremities of the protein, and as we impose a right angle, the number of possible paths is therefore low, making them easy to compute. Practically, the extremities of the proteins are positioned in a flexible way as in Step 1. From each of the positions allowed by this flexibility, the software searches for all the allowed right triangles. This was mainly done, as the degrees of freedom needed to be restricted to keep calculations feasible. <br />
===Step 3===<br />
Finally the software also provides the possibility to find paths with up to 4 edges, meaning 4 alpha helices and 3 angles. Thanks to the modularity of the possible linkers, such paths can offer the possibility to circularize theoretically any kind of protein. ###figure of torus###<br />
To keep the calculation feasible in a reasonable time, we design the searching strategy so that the flexible part at the extremity are oriented in the same direction as the consecutive alpha helix. This is obviously restricting the search but as these orientations are allowed for the flexible part, this approach remains fully correct. <br />
First, potential ending points of the first alpha helical rod are calculated from the N-terminal point of the protein. The orientation is chosen in a discrete manner, with an incrementation of 5 degrees for the two angles of the spherical coordinates. The distance from the origin corresponds to the 8 possible lengths allowed by the alpha helices, as already seen in Step 2, plus a length of 0, which mimics a linker with 3 instead of 4 edges. The exact same procedure is repeated to define all the potential ending points of the second alpha helical rod starting from all the possible ending points of the first alpha helical rod. Thanks to the possibility of a length of 0 for the first and the second rods, the software also calculate paths with 2 edges. Then, the same is done only once from the C-terminal point of the protein, defining 1 edge. The final step consists in checking if the points originating from the N- and C-terminal points can be linked by an potential alpha helix, i.e. if they are separated by the appropriate distance. If any of the potential alpha helix length lies within the distance between two points plus or minus 0.75 &Aring;, then the path is eventually saved. In the same way, if two points are directly closer than 0.75 &Aring;, then the path is also saved.<br />
<br />
==Sorting out of paths==<br />
The previous part described the generation of paths that can connect the two extremeties of the protein irrespective of the position of these paths relative to the protein. While this allows a fast computing of the geometrical paths, this also implies that the paths that are not practically feasible need to be sorted out. This is the most time consuming part of the computing as about 1 billion paths are generated. Three criteria are considered for the sorting. The first one is the feasibility of the linker: can the software find angle patterns that correspond to the one defined by the geometrical path? This question was part of the motivation for a large modeling effort (link) to determine the possible angles between consecutive angles. This was achieved by analyzing the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling distribution of angles] between alpha helices found in the ArchDB database. As nearly any angle could be found between 20 and 170 degrees, only few paths were actually rejected at that step. The next criteria was the position of the angle point: if they appear inside the protein, then the path is rejected. Finally, the software checks if any of the atoms of the protein is less than 5&Aring away from any of the alpha helices, then the path is also rejected.<br />
<br />
==Shifting paths to the patterns==<br />
<br />
The strategy described in step 3 gives a certain freedom for the rod that connect the last two angle points that were generated from the N- and C-terminal points. As this freedom is actually not permitted by the alpha helix and the angle pattern, but is permitted by the flexible part for example at the C-terminal end, the software slighty refine the path by rotating the segment that originates from the C-terminal point.<br />
<br />
==Weighting of paths==<br />
Before translating the paths into sequences and thus into linkers that can be expressed, each path needs to be evaluated for its potential capacity to enhance heat stability. For this we have identified different contributions that should be combined into one value that defines how good the linker may be and that consequently defines its ranking among all possible linkers. The smaller this value, the more we expect that the linker will enhance thermostability. An important step for all these contributions is the normalization, as explained in the next paragraphs.<br />
The first contribution we considered was the linker length. We assumed that a short linker is better to constrain the protein extremities, and that a long helix might give more flexibility. Because we wanted this value to be independent of the size of the protein, the length of the linker is normalized to the distance between the two termini.<br />
The second contribution relates to the angles used in the linker. We learned from the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] that angles formed by a certain angle pattern follow a certain distribution. First, we assumed that the narrower the distribution, the more likely the alpha helices would actually produce this angle. Second, the angles found by the software should be as close as possible to those well-defined angles. In this case, the weight value from this contribution should be low.<br />
Then the distance of the linker to the protein is taken into account. Because the linkers should not disturb the protein in its normal environment, linkers that pass close to the protein surface are considered better linkers. The distance was defined as the minimal distance between the linker and all the atoms of the protein. As already mentioned for the sorting of the paths, a linker cannot come closer than 5 &Aring; and this distance was used for normalization of calculated distances.<br />
After this, the places a linker should avoid are calculated. Each protein can interact with other molecules on some oarts of its surface. The user can specify where and how big those parts are. If a linker passes in front a potential molecule bindind domain, the value of the corresponding path goes to infinity, so that the linker is discarded. Conversely the farther a linker is from a potential ligand binding domain, the smaller its weighting value. The user can also specify the importance of certain regions. In the end the total weighting is normalized to the amount of binding domains.<br />
<br />
===Calibrating the weighting function===<br />
Every contribution has its own distribution. You can see an example in figure ### [[figure histogram_length_lys.png]], but all of them have different shapes. The aim is to find the paths that globally minimize all of these distributions. Therefore, for simplicity,in the weighting function the four mentioned contributions were combined in a linear manner:<br />
\[ W(p) = \alpha L(p) + \beta A(p) + \gamma D(p) + \delta u(p) \]<br />
where W is the final weighting, p the path, L the length contribution, A the angle contribution, D the distance contribution and u the contribution from the forbidden regions. $\alpha, \beta, \gamma, \delta$ are the weighting constants that needed to be found. The normalization performed for each of the contribution were made so that each of them is dimensionless and that all have reasonably similar values.<br />
The weighting constants were obtained from the [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening linker screening] performed with lysozyme and the [https://2014.igem.org/Team:Heidelberg/Modeling/Enzyme_Modeling modeling of the enzyme activity]. Their calculation is presented in the results below.<br />
<br />
==Translating paths to sequence==<br />
As already mentioned before the software is provided with two databases, one for the possible angle patterns and one for the helix patterns. The choice of the patterns was inspired by known crystal structures extracted from databases and described in different papers.<br />
A huge in silico screening for refining the preferences of the patterns was then set up using the [https://2014.igem.org/Team:Heidelberg/Software/igemathome distribution calculation] system. For the complete description of search for suitable patterns, one can read the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] page. <br />
All the possible paths are now split up at the angles and compared with the possible patterns in the databases. ###Figure needed to explain, how the path is translated### The most suitable patterns are identified and added together to build the paths sequence. It is important to notice that this is only possible because of the modularity of our linker patterns used as building blocks: each block, being an alpha helix or an angle pattern, is not affected by the other. Thus for each possible path, one sequence is produced.<br />
<br />
==Clustering of paths==<br />
Many different paths are represented by the same sequence [[###Figure that shows, different paths have same properties, already before in the text ###]] and we therefore clustered such paths. The weigths for those clustered paths were then calculated by averaging the weights of the different paths that compose a cluster.<br />
<br />
<br />
=Results=<br />
==DNMT1==<br />
A major motivation of our effort to design rigid linkers with angles was the [https://2014.igem.org/Team:Heidelberg/Project/PCR_2.0 circularization of the DNA methyltranferase Dnmt1]. The truncation form used in our project is composed of 900 amino acids and the N- and C-terminal extremities are well separated. To circularize it, two linkers were designed: a flexible one made of glycine and serine, and a rigid one designed by the software. The rigif linkers for DNMT1 were obtained from an early state of the software. At that time the calculation took 11 days on a laptop computer with intel i5 processor and 8GB of RAM, which shows the importance of a distributed computing system for large proteins. But from that state on, the software has still imporved a lot, resulting in reduced calculation time to about 1 day for DNMT1.<br />
<br />
==Feedback from wet lab==<br />
The results from the software for lysozyme were tested as described in the [[linker-screening]] part and evaluated as described in the [[enzyme modeling]] part. We have performed a large linker screening on 10 different lysozymes with different linkers. As the purpose of the lysozyme screen was the calibration of the software,tThose linkers were designed according to the four contributions previously mentioned. One of them was the shortest possible, one had the best possible angle, and so on.<br />
{| class="table table-hover"<br />
|+ '''table 1''': Linker and their amino acid sequence. Green: attachment sequences to prevent the flexible regions from being perturbed; Blue: angle; Purple: extein.<br />
! Linker<br />
! Amino acid sequence<br />
|- <br />
|<br />
|<br />
|-<br />
|'''Very good linkers'''<br />
|<br />
|-<br />
| sgt2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#00BFFF;">AAAHPEA</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| rigid<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKAA<span style="color:#3ADF00;">P</span><span style="color:#A901DB;">RGKCWE</span><br />
|-<br />
| <br />
| <br />
|-<br />
| '''Average linkers'''<br />
|<br />
|-<br />
| may1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKA<span style="color:#00BFFF;">AAAHPEA</span>AEAAAKEAAAKA<span style="color:#00BFFF;">KTA</span>AEAAAKEAAAKA<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| ord1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ATGDLA</span>AEAAAKAA<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| ord3<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ASLPAA</span>AEAAAKEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
|<br />
| <br />
|-<br />
| '''Short linkers'''<br />
| <br />
|-<br />
| sho1<br />
| <span style="color:#3ADF00;">GG</span><span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| sho2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
|}<br />
[[circ_lam_lys_nils.png]] comparison with [[nice_linker_lysozyme_flexible_ends.png]]<br />
###About predictions of software###<br />
In the end we obtained a ranking of the in vitro tested linkers from the [[linker-screening]] and chose the parameters $\alpha, \beta, \gamma, \delta$ of the weighting function in the way, that the ranking from the software represented the ranking from the assays. The final values were: $\alpha = 1.85 * 10 ^{-6}, \beta = 0.57 , \gamma = 50.8 * 10^6$.<br />
Therefore we set up a function, that was minimized, when the order of the weighting output was the same as the order from the wetlab.<br />
<br />
=Discussion=<br />
Will always be refined with more data from i@h...<br />
=References=<br />
<br />
[1] Thornton, J.M. & Sibanda, B.L. Amino and carboxy-terminal regions in globular proteins. Journal of molecular biology 167, 443-460 (1983).<br />
<br />
[2] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/pages/Linker_SoftwareTeam:Heidelberg/pages/Linker Software2014-10-17T23:03:27Z<p>Igemnils: /* Translating paths to sequence */</p>
<hr />
<div>=Abstract=<br />
As already [https://2014.igem.org/Team:Heidelberg/Toolbox/Circularization introduced], artificially circularized proteins may gain some heat stability by restraining the C- and N-terminus from moving around freely. This circularization may be trivial when the protein termini are very close to each other, which seems to be reasonably common [[#References|[1]]]. However, if the ends are too far from each other, a long linker is needed to connect them. This linker should not change the natural conformation of the protein and should constrain the relative position of the ends to restrict the degrees of freedom and thus to stabilize the structure even when heated up. On top, these linkers should not affect any of the protein functions. Consequently it is important to prevent linkers from passing through the active site or from covering binding domains to other molecules for example. Therefore one needs to be able to define the shape of possible linkers. This section describes the software we developed to design such linkers. We would like to stress that this work has been made possible thanks to the feedback between computer modeling and experimental work: We could first design linkers in silico, test them experimentally and use the results to further calibrate the software. To our knowledge, this is the first time that such an approach is used to customly design rigid linkers with angles to connect protein extremities.<br />
<br />
=General procedure=<br />
In short, the software can provide a weighted list of linkers to circularize any protein of interest with a known structure. Those linkers are made of rigid alpha helices segments connected with defined angles. Contrary to flexible linkers, those rigid linkers were expected to constrain the protein extremities and to confer better heat stability. Such an idea was already developed [[#References|[2]]] but only with alpha helices defining simple rods, and without any possibility to introduce angles. To generate those linkers, we first defined the geometrical paths, with segments and angles, that they should follow. The geometrical paths that are biologically feasible are afterwards translated into amino acid sequences. Both the compatibility of paths with possible structures and the translation were made possible thanks to our [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling approaches]. The first approach consisted in performing a statistical analysis of more than 17000 known non-homologous structures containing alpha helices connected with angles. For the second approach, we modeled the conformation of linkers circularizing proteins of known structure and analyzed them for certain properties. This second approach was run for a large number of proteins thanks to our distributing computing system [https://2014.igem.org/Team:Heidelberg/Software/igemathome igemathome]. The software provides different possible linkers with weights that provide the ranking of the linkers depending on their capacity to maintain protein activity at higher temperatures. They were generated thanks to an extensive [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening linker screening] on the target protein lambda-lysozyme, using the first modeling approach.<br />
<br />
=Background=<br />
Classically, protein linkers were designed in three different manners. The easiest way is to define the length that a linker should cover and then simply use a flexible glycine-serine peptide with the right amount of amino acids to match this length. Glycine is used for flexibility, as it has no sidechain and does not produce any steric hindrance, while serine is used for solubility, as it has a small polar side chain. This solubility is important, as the linkers should not pass through the hydrophobic core of the protein, but should be dissolved in the surrounding medium. These flexible linkers were normally used for circularization but also for connecting different proteins, when the main goal is that the different parts are connected, but not how they are connected, or when the flexiblity of the linker was required for specific applications. <br />
<br />
A second strategy consists in using rigid helical linkers to keep proteins or protein domains at a certain distance from each other. This is especially important for signalling proteins and fluorescent proteins. One major property of alpha helices is that they always fold in a defined way with well defined angles and lengths. There are also many different helical patterns that differ in stability and solubility. Although they have been used to design cirularizing linkers [[#References|[2]]]. One big disadvantage of this strategy is that one can only build straight linkers with helices. So in the context of circularization, if an artificial line that would connect protein extremities is crossing the protein, this strategy is not an option.<br />
<br />
The third option, which served as a base to develop our approach and which came from discussions with the group of Rebecca Wade in Heidelberg, Germany, consists in designing customly tailored linkers for each specific application. These linkers can be obtained from protein structure prediction. At first one needs to define the path that the linker should take to connect the protein ends. Afterwards, one designs a possible linker sequence that might fit well. Next one makes a structure prediction of the linker attached to the proteins to validate the prediction. Several different linkers, with slight changes, can be compared. This is repeated several times until the linker effectively follows the expected path. This method requires a strong knowledge on protein folding and protein structure prediction and is computationaly intensive. On the other hand, the benefit can be important as the interaction of the linker with the protein surface can be taken into account and as one can accurately define the path taken by the linker to the resolution of protein structure.<br />
<br />
We have set up a completely new strategy to design rigid linkers. As further detailed in the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] part, it is possible to define the shape of a linker, by combining rigid alpha helical rods with well-defined angle patterns. Therefore, by defining, in a geometrical way, the possible paths of the circularizing linkers for a given protein, we can then propose potential linkers. This definition of the geometrical path can be very difficult, especially for large proteins with complex shapes. Moreover, this definition is further constained by the fact that linkers must avoid hiding active sites of the protein of interest. Finally the paths have rotational degrees of freedom at the extremities of the protein, and depending on their orientation, they may or may not match the geometry of the protein. The tool we present here covers the two steps: defining geometrical paths with some weights and translate them into feasible linkers, also with weights. This tool is universal as it has the capacity to design circularizing linkers for any protein with a known structure. Moreover it is modular as, thanks to our [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling approach], we have design linkers as exchangeable blocks of rods of different lengths and of angle patterns. The following sections detail the different steps followed by our software to design proper linkers.<br />
<br />
<br />
==PDB parsing==<br />
At first, the PDB file containing the structure of the target protein is parsed and the coordinates of the atoms are stored, in the metric unit.After this, some initial tests are made with the protein structure. First, we checked whether the C- and N-termini lie on the surface of the protein and are accessible to the solvent, which is crucial for circularization. We defined a line originating from an extremity of the protein with the two angles of the spherical coordinates around the z-axis. From that, we could determine the accessible angles by rejecting all the lines that are too close to the protein. As the future linker will be made of alpha helices and will therefore have a radius of 5 &Aring;, we used this length as the minimal allowed distance.<br />
Those allowed angles are stored for the coming linker generation. ###fig needed### <br />
<br />
==Generation of geometric paths==<br />
As our strategy consists in building linkers with helical rods and connecting angles, a path is completely defined by the coordinates of the angle points. Advancing one step from an existing point is always done by adding a displacement vector on this point. This vector is defined by the two spherical angles, chosen here in a discrete manner with an increment of 5 degrees, and by a length, also chosen in a discrete manner. This discrete length was used in two different contexts: it may correspond to the length of an alpha helix or to the length of the flexible part that appears at the extremity of the protein. The coordinates that are reached thanks to this vector defines the new coordinates of an angle point, no matter if the vector corresponds to an alpha helix or to a flexible part. ###figure, points + vectors###. As we screen for all possible angles in a discrete manner, those angle points coordinates are regularly distributed on a sphere. As further detailed in the next sections, those spheres are defined from both ends, either once or in several steps. Then the software checks for possible straight connections of given lengths for each pair of angle points originated from both extremities. ###figure needed, with all these possible steps, one sphere around each point and checking for the connections, 2D, different lengths in two different figures, ###<br />
The linkers are built in a modular way, with blocks of well-defined size. From the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] of potential linkers, we could derive 8 different alpha helical rods, all with different lengths. On top, the length of the two segments inside an angle block was always 8&Aring;, so exchanging angle blocks do not affect the length of the linker. This means that the distance between the angle points is well defined, an essential aspect of our strategy of linker design.<br />
The software proceeds in three steps. First, it checks for the possibility of direct single alpha helix linker. for this, it applies the procedure just mentioned with spheres of radius that reasonably corresponds to the length of the short parts at the extremity of the protein. Second, it tests if a linker containing two alpha helices connected with a right angle allows the circularization. Finally it searches the possible linkers with three angle points. The next parts will explain those three steps in detail.<br />
This method has been chosen, because it could be implemented easily and efficiently in our program. However, this strategy generated paths that crossed the protein. Therefore we put big efforts in the sorting out of the paths.<br />
<br />
===Step 1===<br />
As a simple rigid linker with no angle would be easier to design and likely more thermostable than the ones containing angles, the software first checks if this simple solution is possible ###figure needed###. For this, we took into account the fact that proteins have some flexible amino acids at their extremities. This flexible part may come from the protein itself, but also from the 2 glycines that are included at the N-terminal part and from the extein at the C-terminal part. Those two latter parts comes from our linkers. Those parts have no preferential angles and offers a large amount of possibilities to insert fitting linkers. But this flexibility is also a drawback as we have to include this large amount of possible angles and length to our path search.<br />
In this first step, the software explicitely takes these flexible parts into account to check for the possibility of straight linkers. As the angles and the length of the flexible parts are variable, the software position their extremity on a sphere centered on the last fixed position of the structure as explained above. The radius of this sphere is incremented in a discrete manner, in 4 steps, from 5.25 &Aring; to the maximum length of the flexible part. ###figure###<br />
Then all possible straight segments between the points and the lastpoints are tested. If they are closer than 5 &Aring to the protein, of if they cross it, then they are rejected. If they are kept, then the software checks whether the length of the segments is compatible with the feasible alpha helices in terms of length: if the length of a given segment equal one of the 8 alpha helix lengths plus or minus 0.75 &Aring;, then the path is eventually saved.<br />
===Step 2===<br />
The next possibilty to design more complex rigid linkers while still taking flexible ends into account with a reasonable calculation time was to reduce amount of possible angles. As we originally thought that 90° angle would be practically feasible, the software was designed to generate linkers with flexible ends and one 90° angle. ###figure, protein_one_90° angle### This choice was notably made because of the simplicity to calculate lengths of right triangle edges. We already saw in Step 1 that the length of an edge can only take 8 different values. As the linkers have to start from the extremities of the protein, and as we impose a right angle, the number of possible paths is therefore low, making them easy to compute. Practically, the extremities of the proteins are positioned in a flexible way as in Step 1. From each of the positions allowed by this flexibility, the software searches for all the allowed right triangles. This was mainly done, as the degrees of freedom needed to be restricted to keep calculations feasible. <br />
===Step 3===<br />
Finally the software also provides the possibility to find paths with up to 4 edges, meaning 4 alpha helices and 3 angles. Thanks to the modularity of the possible linkers, such paths can offer the possibility to circularize theoretically any kind of protein. ###figure of torus###<br />
To keep the calculation feasible in a reasonable time, we design the searching strategy so that the flexible part at the extremity are oriented in the same direction as the consecutive alpha helix. This is obviously restricting the search but as these orientations are allowed for the flexible part, this approach remains fully correct. <br />
First, potential ending points of the first alpha helical rod are calculated from the N-terminal point of the protein. The orientation is chosen in a discrete manner, with an incrementation of 5 degrees for the two angles of the spherical coordinates. The distance from the origin corresponds to the 8 possible lengths allowed by the alpha helices, as already seen in Step 2, plus a length of 0, which mimics a linker with 3 instead of 4 edges. The exact same procedure is repeated to define all the potential ending points of the second alpha helical rod starting from all the possible ending points of the first alpha helical rod. Thanks to the possibility of a length of 0 for the first and the second rods, the software also calculate paths with 2 edges. Then, the same is done only once from the C-terminal point of the protein, defining 1 edge. The final step consists in checking if the points originating from the N- and C-terminal points can be linked by an potential alpha helix, i.e. if they are separated by the appropriate distance. If any of the potential alpha helix length lies within the distance between two points plus or minus 0.75 &Aring;, then the path is eventually saved. In the same way, if two points are directly closer than 0.75 &Aring;, then the path is also saved.<br />
<br />
==Sorting out of paths==<br />
The previous part described the generation of paths that can connect the two extremeties of the protein irrespective of the position of these paths relative to the protein. While this allows a fast computing of the geometrical paths, this also implies that the paths that are not practically feasible need to be sorted out. This is the most time consuming part of the computing as about 1 billion paths are generated. Three criteria are considered for the sorting. The first one is the feasibility of the linker: can the software find angle patterns that correspond to the one defined by the geometrical path? This question was part of the motivation for a large modeling effort (link) to determine the possible angles between consecutive angles. This was achieved by analyzing the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling distribution of angles] between alpha helices found in the ArchDB database. As nearly any angle could be found between 20 and 170 degrees, only few paths were actually rejected at that step. The next criteria was the position of the angle point: if they appear inside the protein, then the path is rejected. Finally, the software checks if any of the atoms of the protein is less than 5&Aring away from any of the alpha helices, then the path is also rejected.<br />
<br />
==Shifting paths to the patterns==<br />
<br />
The strategy described in step 3 gives a certain freedom for the rod that connect the last two angle points that were generated from the N- and C-terminal points. As this freedom is actually not permitted by the alpha helix and the angle pattern, but is permitted by the flexible part for example at the C-terminal end, the software slighty refine the path by rotating the segment that originates from the C-terminal point.<br />
<br />
==Weighting of paths==<br />
Before translating the paths into sequences and thus into linkers that can be expressed, each path needs to be evaluated for its potential capacity to enhance heat stability. For this we have identified different contributions that should be combined into one value that defines how good the linker may be and that consequently defines its ranking among all possible linkers. The smaller this value, the more we expect that the linker will enhance thermostability. An important step for all these contributions is the normalization, as explained in the next paragraphs.<br />
The first contribution we considered was the linker length. We assumed that a short linker is better to constrain the protein extremities, and that a long helix might give more flexibility. Because we wanted this value to be independent of the size of the protein, the length of the linker is normalized to the distance between the two termini.<br />
The second contribution relates to the angles used in the linker. We learned from the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] that angles formed by a certain angle pattern follow a certain distribution. First, we assumed that the narrower the distribution, the more likely the alpha helices would actually produce this angle. Second, the angles found by the software should be as close as possible to those well-defined angles. In this case, the weight value from this contribution should be low.<br />
Then the distance of the linker to the protein is taken into account. Because the linkers should not disturb the protein in its normal environment, linkers that pass close to the protein surface are considered better linkers. The distance was defined as the minimal distance between the linker and all the atoms of the protein. As already mentioned for the sorting of the paths, a linker cannot come closer than 5 &Aring; and this distance was used for normalization of calculated distances.<br />
After this, the places a linker should avoid are calculated. Each protein can interact with other molecules on some oarts of its surface. The user can specify where and how big those parts are. If a linker passes in front a potential molecule bindind domain, the value of the corresponding path goes to infinity, so that the linker is discarded. Conversely the farther a linker is from a potential ligand binding domain, the smaller its weighting value. The user can also specify the importance of certain regions. In the end the total weighting is normalized to the amount of binding domains.<br />
<br />
===Calibrating the weighting function===<br />
Every contribution has its own distribution. You can see an example in figure ### [[figure histogram_length_lys.png]], but all of them have different shapes. The aim is to find the paths that globally minimize all of these distributions. Therefore, for simplicity,in the weighting function the four mentioned contributions were combined in a linear manner:<br />
\[ W(p) = \alpha L(p) + \beta A(p) + \gamma D(p) + \delta u(p) \]<br />
where W is the final weighting, p the path, L the length contribution, A the angle contribution, D the distance contribution and u the contribution from the forbidden regions. $\alpha, \beta, \gamma, \delta$ are the weighting constants that needed to be found. The normalization performed for each of the contribution were made so that each of them is dimensionless and that all have reasonably similar values.<br />
The weighting constants were obtained from the [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening linker screening] performed with lysozyme and the [https://2014.igem.org/Team:Heidelberg/Modeling/Enzyme_Modeling modeling of the enzyme activity]. Their calculation is presented in the results below.<br />
<br />
==Translating paths to sequence==<br />
As already mentioned before the software is provided with two databases, one for the possible angle patterns and one for the helix patterns. The choice of the patterns was inspired by known crystal structures extracted from databases and described in different papers.<br />
A huge in silico screening for refining the preferences of the patterns was then set up using the [https://2014.igem.org/Team:Heidelberg/Software/igemathome distribution calculation] system. For the complete description of search for suitable patterns, one can read the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] page. <br />
All the possible paths are now split up at the angles and compared with the possible patterns in the databases. ###Figure needed to explain, how the path is translated### The most suitable patterns are identified and added together to build the paths sequence. It is important to notice that this is only possible because of the modularity of our linker patterns used as building blocks: each block, being an alpha helix or an angle pattern, is not affected by the other. Thus for each possible path, one sequence is produced.<br />
<br />
==Clustering of paths==<br />
Many different paths are represented by the same sequence [[###Figure that shows, different paths have same properties, already before in the text ###]] and we therefore clustered such paths. The weigths for those clustered paths were then calculated by averaging the weights of the different paths that compose a cluster.<br />
<br />
<br />
=Results=<br />
==DNMT1==<br />
A major motivation of our effort to design rigid linkers with angles was the circularization of the DNA methyltranferase Dnmt1 (link). The truncation form used in our project is composed of 900 amino acids and the N- and C-terminal extremities are well separated. To circularize it, two linkers were designed: a flexible one made of glycine and serine, and a rigid one designed by the software. The rigif linkers for DNMT1 were obtained from an early state of the software. At that time the calculation took 11 days on a laptop computer with intel i5 processor and 8GB of RAM, which shows the importance of a distributed computing system for large proteins. But from that state on, the software has still imporved a lot, resulting in reduced calculation time to about 1 day for DNMT1.<br />
###DNMT1 could be made heatstable, still missing###<br />
==Feedback from wet lab==<br />
The results from the software for lysozyme were tested as described in the [[linker-screening]] part and evaluated as described in the [[enzyme modeling]] part. We have performed a large linker screening on 10 different lysozymes with different linkers. As the purpose of the lysozyme screen was the calibration of the software,tThose linkers were designed according to the four contributions previously mentioned. One of them was the shortest possible, one had the best possible angle, and so on.<br />
{| class="table table-hover"<br />
|+ '''table 1''': Linker and their amino acid sequence. Green: attachment sequences to prevent the flexible regions from being perturbed; Blue: angle; Purple: extein.<br />
! Linker<br />
! Amino acid sequence<br />
|- <br />
|<br />
|<br />
|-<br />
|'''Very good linkers'''<br />
|<br />
|-<br />
| sgt2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#00BFFF;">AAAHPEA</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| rigid<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKAA<span style="color:#3ADF00;">P</span><span style="color:#A901DB;">RGKCWE</span><br />
|-<br />
| <br />
| <br />
|-<br />
| '''Average linkers'''<br />
|<br />
|-<br />
| may1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKA<span style="color:#00BFFF;">AAAHPEA</span>AEAAAKEAAAKA<span style="color:#00BFFF;">KTA</span>AEAAAKEAAAKA<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| ord1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ATGDLA</span>AEAAAKAA<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| ord3<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ASLPAA</span>AEAAAKEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
|<br />
| <br />
|-<br />
| '''Short linkers'''<br />
| <br />
|-<br />
| sho1<br />
| <span style="color:#3ADF00;">GG</span><span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| sho2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
|}<br />
[[circ_lam_lys_nils.png]] comparison with [[nice_linker_lysozyme_flexible_ends.png]]<br />
###About predictions of software###<br />
In the end we obtained a ranking of the in vitro tested linkers from the [[linker-screening]] and chose the parameters $\alpha, \beta, \gamma, \delta$ of the weighting function in the way, that the ranking from the software represented the ranking from the assays. The final values were: $\alpha = 1.85 * 10 ^{-6}, \beta = 0.57 , \gamma = 50.8 * 10^6$.<br />
Therefore we set up a function, that was minimized, when the order of the weighting output was the same as the order from the wetlab.<br />
<br />
=Discussion=<br />
Will always be refined with more data from i@h...<br />
=References=<br />
<br />
[1] Thornton, J.M. & Sibanda, B.L. Amino and carboxy-terminal regions in globular proteins. Journal of molecular biology 167, 443-460 (1983).<br />
<br />
[2] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/pages/Linker_SoftwareTeam:Heidelberg/pages/Linker Software2014-10-17T23:00:20Z<p>Igemnils: /* Calibrating the weighting function */</p>
<hr />
<div>=Abstract=<br />
As already [https://2014.igem.org/Team:Heidelberg/Toolbox/Circularization introduced], artificially circularized proteins may gain some heat stability by restraining the C- and N-terminus from moving around freely. This circularization may be trivial when the protein termini are very close to each other, which seems to be reasonably common [[#References|[1]]]. However, if the ends are too far from each other, a long linker is needed to connect them. This linker should not change the natural conformation of the protein and should constrain the relative position of the ends to restrict the degrees of freedom and thus to stabilize the structure even when heated up. On top, these linkers should not affect any of the protein functions. Consequently it is important to prevent linkers from passing through the active site or from covering binding domains to other molecules for example. Therefore one needs to be able to define the shape of possible linkers. This section describes the software we developed to design such linkers. We would like to stress that this work has been made possible thanks to the feedback between computer modeling and experimental work: We could first design linkers in silico, test them experimentally and use the results to further calibrate the software. To our knowledge, this is the first time that such an approach is used to customly design rigid linkers with angles to connect protein extremities.<br />
<br />
=General procedure=<br />
In short, the software can provide a weighted list of linkers to circularize any protein of interest with a known structure. Those linkers are made of rigid alpha helices segments connected with defined angles. Contrary to flexible linkers, those rigid linkers were expected to constrain the protein extremities and to confer better heat stability. Such an idea was already developed [[#References|[2]]] but only with alpha helices defining simple rods, and without any possibility to introduce angles. To generate those linkers, we first defined the geometrical paths, with segments and angles, that they should follow. The geometrical paths that are biologically feasible are afterwards translated into amino acid sequences. Both the compatibility of paths with possible structures and the translation were made possible thanks to our [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling approaches]. The first approach consisted in performing a statistical analysis of more than 17000 known non-homologous structures containing alpha helices connected with angles. For the second approach, we modeled the conformation of linkers circularizing proteins of known structure and analyzed them for certain properties. This second approach was run for a large number of proteins thanks to our distributing computing system [https://2014.igem.org/Team:Heidelberg/Software/igemathome igemathome]. The software provides different possible linkers with weights that provide the ranking of the linkers depending on their capacity to maintain protein activity at higher temperatures. They were generated thanks to an extensive [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening linker screening] on the target protein lambda-lysozyme, using the first modeling approach.<br />
<br />
=Background=<br />
Classically, protein linkers were designed in three different manners. The easiest way is to define the length that a linker should cover and then simply use a flexible glycine-serine peptide with the right amount of amino acids to match this length. Glycine is used for flexibility, as it has no sidechain and does not produce any steric hindrance, while serine is used for solubility, as it has a small polar side chain. This solubility is important, as the linkers should not pass through the hydrophobic core of the protein, but should be dissolved in the surrounding medium. These flexible linkers were normally used for circularization but also for connecting different proteins, when the main goal is that the different parts are connected, but not how they are connected, or when the flexiblity of the linker was required for specific applications. <br />
<br />
A second strategy consists in using rigid helical linkers to keep proteins or protein domains at a certain distance from each other. This is especially important for signalling proteins and fluorescent proteins. One major property of alpha helices is that they always fold in a defined way with well defined angles and lengths. There are also many different helical patterns that differ in stability and solubility. Although they have been used to design cirularizing linkers [[#References|[2]]]. One big disadvantage of this strategy is that one can only build straight linkers with helices. So in the context of circularization, if an artificial line that would connect protein extremities is crossing the protein, this strategy is not an option.<br />
<br />
The third option, which served as a base to develop our approach and which came from discussions with the group of Rebecca Wade in Heidelberg, Germany, consists in designing customly tailored linkers for each specific application. These linkers can be obtained from protein structure prediction. At first one needs to define the path that the linker should take to connect the protein ends. Afterwards, one designs a possible linker sequence that might fit well. Next one makes a structure prediction of the linker attached to the proteins to validate the prediction. Several different linkers, with slight changes, can be compared. This is repeated several times until the linker effectively follows the expected path. This method requires a strong knowledge on protein folding and protein structure prediction and is computationaly intensive. On the other hand, the benefit can be important as the interaction of the linker with the protein surface can be taken into account and as one can accurately define the path taken by the linker to the resolution of protein structure.<br />
<br />
We have set up a completely new strategy to design rigid linkers. As further detailed in the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] part, it is possible to define the shape of a linker, by combining rigid alpha helical rods with well-defined angle patterns. Therefore, by defining, in a geometrical way, the possible paths of the circularizing linkers for a given protein, we can then propose potential linkers. This definition of the geometrical path can be very difficult, especially for large proteins with complex shapes. Moreover, this definition is further constained by the fact that linkers must avoid hiding active sites of the protein of interest. Finally the paths have rotational degrees of freedom at the extremities of the protein, and depending on their orientation, they may or may not match the geometry of the protein. The tool we present here covers the two steps: defining geometrical paths with some weights and translate them into feasible linkers, also with weights. This tool is universal as it has the capacity to design circularizing linkers for any protein with a known structure. Moreover it is modular as, thanks to our [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling approach], we have design linkers as exchangeable blocks of rods of different lengths and of angle patterns. The following sections detail the different steps followed by our software to design proper linkers.<br />
<br />
<br />
==PDB parsing==<br />
At first, the PDB file containing the structure of the target protein is parsed and the coordinates of the atoms are stored, in the metric unit.After this, some initial tests are made with the protein structure. First, we checked whether the C- and N-termini lie on the surface of the protein and are accessible to the solvent, which is crucial for circularization. We defined a line originating from an extremity of the protein with the two angles of the spherical coordinates around the z-axis. From that, we could determine the accessible angles by rejecting all the lines that are too close to the protein. As the future linker will be made of alpha helices and will therefore have a radius of 5 &Aring;, we used this length as the minimal allowed distance.<br />
Those allowed angles are stored for the coming linker generation. ###fig needed### <br />
<br />
==Generation of geometric paths==<br />
As our strategy consists in building linkers with helical rods and connecting angles, a path is completely defined by the coordinates of the angle points. Advancing one step from an existing point is always done by adding a displacement vector on this point. This vector is defined by the two spherical angles, chosen here in a discrete manner with an increment of 5 degrees, and by a length, also chosen in a discrete manner. This discrete length was used in two different contexts: it may correspond to the length of an alpha helix or to the length of the flexible part that appears at the extremity of the protein. The coordinates that are reached thanks to this vector defines the new coordinates of an angle point, no matter if the vector corresponds to an alpha helix or to a flexible part. ###figure, points + vectors###. As we screen for all possible angles in a discrete manner, those angle points coordinates are regularly distributed on a sphere. As further detailed in the next sections, those spheres are defined from both ends, either once or in several steps. Then the software checks for possible straight connections of given lengths for each pair of angle points originated from both extremities. ###figure needed, with all these possible steps, one sphere around each point and checking for the connections, 2D, different lengths in two different figures, ###<br />
The linkers are built in a modular way, with blocks of well-defined size. From the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] of potential linkers, we could derive 8 different alpha helical rods, all with different lengths. On top, the length of the two segments inside an angle block was always 8&Aring;, so exchanging angle blocks do not affect the length of the linker. This means that the distance between the angle points is well defined, an essential aspect of our strategy of linker design.<br />
The software proceeds in three steps. First, it checks for the possibility of direct single alpha helix linker. for this, it applies the procedure just mentioned with spheres of radius that reasonably corresponds to the length of the short parts at the extremity of the protein. Second, it tests if a linker containing two alpha helices connected with a right angle allows the circularization. Finally it searches the possible linkers with three angle points. The next parts will explain those three steps in detail.<br />
This method has been chosen, because it could be implemented easily and efficiently in our program. However, this strategy generated paths that crossed the protein. Therefore we put big efforts in the sorting out of the paths.<br />
<br />
===Step 1===<br />
As a simple rigid linker with no angle would be easier to design and likely more thermostable than the ones containing angles, the software first checks if this simple solution is possible ###figure needed###. For this, we took into account the fact that proteins have some flexible amino acids at their extremities. This flexible part may come from the protein itself, but also from the 2 glycines that are included at the N-terminal part and from the extein at the C-terminal part. Those two latter parts comes from our linkers. Those parts have no preferential angles and offers a large amount of possibilities to insert fitting linkers. But this flexibility is also a drawback as we have to include this large amount of possible angles and length to our path search.<br />
In this first step, the software explicitely takes these flexible parts into account to check for the possibility of straight linkers. As the angles and the length of the flexible parts are variable, the software position their extremity on a sphere centered on the last fixed position of the structure as explained above. The radius of this sphere is incremented in a discrete manner, in 4 steps, from 5.25 &Aring; to the maximum length of the flexible part. ###figure###<br />
Then all possible straight segments between the points and the lastpoints are tested. If they are closer than 5 &Aring to the protein, of if they cross it, then they are rejected. If they are kept, then the software checks whether the length of the segments is compatible with the feasible alpha helices in terms of length: if the length of a given segment equal one of the 8 alpha helix lengths plus or minus 0.75 &Aring;, then the path is eventually saved.<br />
===Step 2===<br />
The next possibilty to design more complex rigid linkers while still taking flexible ends into account with a reasonable calculation time was to reduce amount of possible angles. As we originally thought that 90° angle would be practically feasible, the software was designed to generate linkers with flexible ends and one 90° angle. ###figure, protein_one_90° angle### This choice was notably made because of the simplicity to calculate lengths of right triangle edges. We already saw in Step 1 that the length of an edge can only take 8 different values. As the linkers have to start from the extremities of the protein, and as we impose a right angle, the number of possible paths is therefore low, making them easy to compute. Practically, the extremities of the proteins are positioned in a flexible way as in Step 1. From each of the positions allowed by this flexibility, the software searches for all the allowed right triangles. This was mainly done, as the degrees of freedom needed to be restricted to keep calculations feasible. <br />
===Step 3===<br />
Finally the software also provides the possibility to find paths with up to 4 edges, meaning 4 alpha helices and 3 angles. Thanks to the modularity of the possible linkers, such paths can offer the possibility to circularize theoretically any kind of protein. ###figure of torus###<br />
To keep the calculation feasible in a reasonable time, we design the searching strategy so that the flexible part at the extremity are oriented in the same direction as the consecutive alpha helix. This is obviously restricting the search but as these orientations are allowed for the flexible part, this approach remains fully correct. <br />
First, potential ending points of the first alpha helical rod are calculated from the N-terminal point of the protein. The orientation is chosen in a discrete manner, with an incrementation of 5 degrees for the two angles of the spherical coordinates. The distance from the origin corresponds to the 8 possible lengths allowed by the alpha helices, as already seen in Step 2, plus a length of 0, which mimics a linker with 3 instead of 4 edges. The exact same procedure is repeated to define all the potential ending points of the second alpha helical rod starting from all the possible ending points of the first alpha helical rod. Thanks to the possibility of a length of 0 for the first and the second rods, the software also calculate paths with 2 edges. Then, the same is done only once from the C-terminal point of the protein, defining 1 edge. The final step consists in checking if the points originating from the N- and C-terminal points can be linked by an potential alpha helix, i.e. if they are separated by the appropriate distance. If any of the potential alpha helix length lies within the distance between two points plus or minus 0.75 &Aring;, then the path is eventually saved. In the same way, if two points are directly closer than 0.75 &Aring;, then the path is also saved.<br />
<br />
==Sorting out of paths==<br />
The previous part described the generation of paths that can connect the two extremeties of the protein irrespective of the position of these paths relative to the protein. While this allows a fast computing of the geometrical paths, this also implies that the paths that are not practically feasible need to be sorted out. This is the most time consuming part of the computing as about 1 billion paths are generated. Three criteria are considered for the sorting. The first one is the feasibility of the linker: can the software find angle patterns that correspond to the one defined by the geometrical path? This question was part of the motivation for a large modeling effort (link) to determine the possible angles between consecutive angles. This was achieved by analyzing the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling distribution of angles] between alpha helices found in the ArchDB database. As nearly any angle could be found between 20 and 170 degrees, only few paths were actually rejected at that step. The next criteria was the position of the angle point: if they appear inside the protein, then the path is rejected. Finally, the software checks if any of the atoms of the protein is less than 5&Aring away from any of the alpha helices, then the path is also rejected.<br />
<br />
==Shifting paths to the patterns==<br />
<br />
The strategy described in step 3 gives a certain freedom for the rod that connect the last two angle points that were generated from the N- and C-terminal points. As this freedom is actually not permitted by the alpha helix and the angle pattern, but is permitted by the flexible part for example at the C-terminal end, the software slighty refine the path by rotating the segment that originates from the C-terminal point.<br />
<br />
==Weighting of paths==<br />
Before translating the paths into sequences and thus into linkers that can be expressed, each path needs to be evaluated for its potential capacity to enhance heat stability. For this we have identified different contributions that should be combined into one value that defines how good the linker may be and that consequently defines its ranking among all possible linkers. The smaller this value, the more we expect that the linker will enhance thermostability. An important step for all these contributions is the normalization, as explained in the next paragraphs.<br />
The first contribution we considered was the linker length. We assumed that a short linker is better to constrain the protein extremities, and that a long helix might give more flexibility. Because we wanted this value to be independent of the size of the protein, the length of the linker is normalized to the distance between the two termini.<br />
The second contribution relates to the angles used in the linker. We learned from the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] that angles formed by a certain angle pattern follow a certain distribution. First, we assumed that the narrower the distribution, the more likely the alpha helices would actually produce this angle. Second, the angles found by the software should be as close as possible to those well-defined angles. In this case, the weight value from this contribution should be low.<br />
Then the distance of the linker to the protein is taken into account. Because the linkers should not disturb the protein in its normal environment, linkers that pass close to the protein surface are considered better linkers. The distance was defined as the minimal distance between the linker and all the atoms of the protein. As already mentioned for the sorting of the paths, a linker cannot come closer than 5 &Aring; and this distance was used for normalization of calculated distances.<br />
After this, the places a linker should avoid are calculated. Each protein can interact with other molecules on some oarts of its surface. The user can specify where and how big those parts are. If a linker passes in front a potential molecule bindind domain, the value of the corresponding path goes to infinity, so that the linker is discarded. Conversely the farther a linker is from a potential ligand binding domain, the smaller its weighting value. The user can also specify the importance of certain regions. In the end the total weighting is normalized to the amount of binding domains.<br />
<br />
===Calibrating the weighting function===<br />
Every contribution has its own distribution. You can see an example in figure ### [[figure histogram_length_lys.png]], but all of them have different shapes. The aim is to find the paths that globally minimize all of these distributions. Therefore, for simplicity,in the weighting function the four mentioned contributions were combined in a linear manner:<br />
\[ W(p) = \alpha L(p) + \beta A(p) + \gamma D(p) + \delta u(p) \]<br />
where W is the final weighting, p the path, L the length contribution, A the angle contribution, D the distance contribution and u the contribution from the forbidden regions. $\alpha, \beta, \gamma, \delta$ are the weighting constants that needed to be found. The normalization performed for each of the contribution were made so that each of them is dimensionless and that all have reasonably similar values.<br />
The weighting constants were obtained from the [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening linker screening] performed with lysozyme and the [https://2014.igem.org/Team:Heidelberg/Modeling/Enzyme_Modeling modeling of the enzyme activity]. Their calculation is presented in the results below.<br />
<br />
==Translating paths to sequence==<br />
As already mentioned before the software is provided with two databases, one for the possible angle patterns and one for the helix patterns. The choice of the patterns was inspired by known crystal structures extracted from databases and described in different papers.<br />
A huge in silico screening for refining the preferences of the patterns was then set up using the [[iGEM@home|###Link]] system. For the complete description of search for suitable patterns, one can read the [[modeling|###Link to patternspart]] page. <br />
All the possible paths are now split up at the angles and compared with the possible patterns in the databases. ###Figure needed to explain, how the path is translated### The most suitable patterns are identified and added together to build the paths sequence. It is important to notice that this is only possible because of the modularity of our linker patterns used as building blocks: each block, being an alpha helix or an angle pattern, is not affected by the other. Thus for each possible path, one sequence is produced.<br />
==Clustering of paths==<br />
Many different paths are represented by the same sequence [[###Figure that shows, different paths have same properties, already before in the text ###]] and we therefore clustered such paths. The weigths for those clustered paths were then calculated by averaging the weights of the different paths that compose a cluster.<br />
<br />
<br />
=Results=<br />
==DNMT1==<br />
A major motivation of our effort to design rigid linkers with angles was the circularization of the DNA methyltranferase Dnmt1 (link). The truncation form used in our project is composed of 900 amino acids and the N- and C-terminal extremities are well separated. To circularize it, two linkers were designed: a flexible one made of glycine and serine, and a rigid one designed by the software. The rigif linkers for DNMT1 were obtained from an early state of the software. At that time the calculation took 11 days on a laptop computer with intel i5 processor and 8GB of RAM, which shows the importance of a distributed computing system for large proteins. But from that state on, the software has still imporved a lot, resulting in reduced calculation time to about 1 day for DNMT1.<br />
###DNMT1 could be made heatstable, still missing###<br />
==Feedback from wet lab==<br />
The results from the software for lysozyme were tested as described in the [[linker-screening]] part and evaluated as described in the [[enzyme modeling]] part. We have performed a large linker screening on 10 different lysozymes with different linkers. As the purpose of the lysozyme screen was the calibration of the software,tThose linkers were designed according to the four contributions previously mentioned. One of them was the shortest possible, one had the best possible angle, and so on.<br />
{| class="table table-hover"<br />
|+ '''table 1''': Linker and their amino acid sequence. Green: attachment sequences to prevent the flexible regions from being perturbed; Blue: angle; Purple: extein.<br />
! Linker<br />
! Amino acid sequence<br />
|- <br />
|<br />
|<br />
|-<br />
|'''Very good linkers'''<br />
|<br />
|-<br />
| sgt2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#00BFFF;">AAAHPEA</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| rigid<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKAA<span style="color:#3ADF00;">P</span><span style="color:#A901DB;">RGKCWE</span><br />
|-<br />
| <br />
| <br />
|-<br />
| '''Average linkers'''<br />
|<br />
|-<br />
| may1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKA<span style="color:#00BFFF;">AAAHPEA</span>AEAAAKEAAAKA<span style="color:#00BFFF;">KTA</span>AEAAAKEAAAKA<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| ord1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ATGDLA</span>AEAAAKAA<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| ord3<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ASLPAA</span>AEAAAKEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
|<br />
| <br />
|-<br />
| '''Short linkers'''<br />
| <br />
|-<br />
| sho1<br />
| <span style="color:#3ADF00;">GG</span><span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| sho2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
|}<br />
[[circ_lam_lys_nils.png]] comparison with [[nice_linker_lysozyme_flexible_ends.png]]<br />
###About predictions of software###<br />
In the end we obtained a ranking of the in vitro tested linkers from the [[linker-screening]] and chose the parameters $\alpha, \beta, \gamma, \delta$ of the weighting function in the way, that the ranking from the software represented the ranking from the assays. The final values were: $\alpha = 1.85 * 10 ^{-6}, \beta = 0.57 , \gamma = 50.8 * 10^6$.<br />
Therefore we set up a function, that was minimized, when the order of the weighting output was the same as the order from the wetlab.<br />
<br />
=Discussion=<br />
Will always be refined with more data from i@h...<br />
=References=<br />
<br />
[1] Thornton, J.M. & Sibanda, B.L. Amino and carboxy-terminal regions in globular proteins. Journal of molecular biology 167, 443-460 (1983).<br />
<br />
[2] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/pages/Linker_SoftwareTeam:Heidelberg/pages/Linker Software2014-10-17T22:58:55Z<p>Igemnils: /* Calibrating the weighting function */</p>
<hr />
<div>=Abstract=<br />
As already [https://2014.igem.org/Team:Heidelberg/Toolbox/Circularization introduced], artificially circularized proteins may gain some heat stability by restraining the C- and N-terminus from moving around freely. This circularization may be trivial when the protein termini are very close to each other, which seems to be reasonably common [[#References|[1]]]. However, if the ends are too far from each other, a long linker is needed to connect them. This linker should not change the natural conformation of the protein and should constrain the relative position of the ends to restrict the degrees of freedom and thus to stabilize the structure even when heated up. On top, these linkers should not affect any of the protein functions. Consequently it is important to prevent linkers from passing through the active site or from covering binding domains to other molecules for example. Therefore one needs to be able to define the shape of possible linkers. This section describes the software we developed to design such linkers. We would like to stress that this work has been made possible thanks to the feedback between computer modeling and experimental work: We could first design linkers in silico, test them experimentally and use the results to further calibrate the software. To our knowledge, this is the first time that such an approach is used to customly design rigid linkers with angles to connect protein extremities.<br />
<br />
=General procedure=<br />
In short, the software can provide a weighted list of linkers to circularize any protein of interest with a known structure. Those linkers are made of rigid alpha helices segments connected with defined angles. Contrary to flexible linkers, those rigid linkers were expected to constrain the protein extremities and to confer better heat stability. Such an idea was already developed [[#References|[2]]] but only with alpha helices defining simple rods, and without any possibility to introduce angles. To generate those linkers, we first defined the geometrical paths, with segments and angles, that they should follow. The geometrical paths that are biologically feasible are afterwards translated into amino acid sequences. Both the compatibility of paths with possible structures and the translation were made possible thanks to our [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling approaches]. The first approach consisted in performing a statistical analysis of more than 17000 known non-homologous structures containing alpha helices connected with angles. For the second approach, we modeled the conformation of linkers circularizing proteins of known structure and analyzed them for certain properties. This second approach was run for a large number of proteins thanks to our distributing computing system [https://2014.igem.org/Team:Heidelberg/Software/igemathome igemathome]. The software provides different possible linkers with weights that provide the ranking of the linkers depending on their capacity to maintain protein activity at higher temperatures. They were generated thanks to an extensive [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening linker screening] on the target protein lambda-lysozyme, using the first modeling approach.<br />
<br />
=Background=<br />
Classically, protein linkers were designed in three different manners. The easiest way is to define the length that a linker should cover and then simply use a flexible glycine-serine peptide with the right amount of amino acids to match this length. Glycine is used for flexibility, as it has no sidechain and does not produce any steric hindrance, while serine is used for solubility, as it has a small polar side chain. This solubility is important, as the linkers should not pass through the hydrophobic core of the protein, but should be dissolved in the surrounding medium. These flexible linkers were normally used for circularization but also for connecting different proteins, when the main goal is that the different parts are connected, but not how they are connected, or when the flexiblity of the linker was required for specific applications. <br />
<br />
A second strategy consists in using rigid helical linkers to keep proteins or protein domains at a certain distance from each other. This is especially important for signalling proteins and fluorescent proteins. One major property of alpha helices is that they always fold in a defined way with well defined angles and lengths. There are also many different helical patterns that differ in stability and solubility. Although they have been used to design cirularizing linkers [[#References|[2]]]. One big disadvantage of this strategy is that one can only build straight linkers with helices. So in the context of circularization, if an artificial line that would connect protein extremities is crossing the protein, this strategy is not an option.<br />
<br />
The third option, which served as a base to develop our approach and which came from discussions with the group of Rebecca Wade in Heidelberg, Germany, consists in designing customly tailored linkers for each specific application. These linkers can be obtained from protein structure prediction. At first one needs to define the path that the linker should take to connect the protein ends. Afterwards, one designs a possible linker sequence that might fit well. Next one makes a structure prediction of the linker attached to the proteins to validate the prediction. Several different linkers, with slight changes, can be compared. This is repeated several times until the linker effectively follows the expected path. This method requires a strong knowledge on protein folding and protein structure prediction and is computationaly intensive. On the other hand, the benefit can be important as the interaction of the linker with the protein surface can be taken into account and as one can accurately define the path taken by the linker to the resolution of protein structure.<br />
<br />
We have set up a completely new strategy to design rigid linkers. As further detailed in the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] part, it is possible to define the shape of a linker, by combining rigid alpha helical rods with well-defined angle patterns. Therefore, by defining, in a geometrical way, the possible paths of the circularizing linkers for a given protein, we can then propose potential linkers. This definition of the geometrical path can be very difficult, especially for large proteins with complex shapes. Moreover, this definition is further constained by the fact that linkers must avoid hiding active sites of the protein of interest. Finally the paths have rotational degrees of freedom at the extremities of the protein, and depending on their orientation, they may or may not match the geometry of the protein. The tool we present here covers the two steps: defining geometrical paths with some weights and translate them into feasible linkers, also with weights. This tool is universal as it has the capacity to design circularizing linkers for any protein with a known structure. Moreover it is modular as, thanks to our [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling approach], we have design linkers as exchangeable blocks of rods of different lengths and of angle patterns. The following sections detail the different steps followed by our software to design proper linkers.<br />
<br />
<br />
==PDB parsing==<br />
At first, the PDB file containing the structure of the target protein is parsed and the coordinates of the atoms are stored, in the metric unit.After this, some initial tests are made with the protein structure. First, we checked whether the C- and N-termini lie on the surface of the protein and are accessible to the solvent, which is crucial for circularization. We defined a line originating from an extremity of the protein with the two angles of the spherical coordinates around the z-axis. From that, we could determine the accessible angles by rejecting all the lines that are too close to the protein. As the future linker will be made of alpha helices and will therefore have a radius of 5 &Aring;, we used this length as the minimal allowed distance.<br />
Those allowed angles are stored for the coming linker generation. ###fig needed### <br />
<br />
==Generation of geometric paths==<br />
As our strategy consists in building linkers with helical rods and connecting angles, a path is completely defined by the coordinates of the angle points. Advancing one step from an existing point is always done by adding a displacement vector on this point. This vector is defined by the two spherical angles, chosen here in a discrete manner with an increment of 5 degrees, and by a length, also chosen in a discrete manner. This discrete length was used in two different contexts: it may correspond to the length of an alpha helix or to the length of the flexible part that appears at the extremity of the protein. The coordinates that are reached thanks to this vector defines the new coordinates of an angle point, no matter if the vector corresponds to an alpha helix or to a flexible part. ###figure, points + vectors###. As we screen for all possible angles in a discrete manner, those angle points coordinates are regularly distributed on a sphere. As further detailed in the next sections, those spheres are defined from both ends, either once or in several steps. Then the software checks for possible straight connections of given lengths for each pair of angle points originated from both extremities. ###figure needed, with all these possible steps, one sphere around each point and checking for the connections, 2D, different lengths in two different figures, ###<br />
The linkers are built in a modular way, with blocks of well-defined size. From the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] of potential linkers, we could derive 8 different alpha helical rods, all with different lengths. On top, the length of the two segments inside an angle block was always 8&Aring;, so exchanging angle blocks do not affect the length of the linker. This means that the distance between the angle points is well defined, an essential aspect of our strategy of linker design.<br />
The software proceeds in three steps. First, it checks for the possibility of direct single alpha helix linker. for this, it applies the procedure just mentioned with spheres of radius that reasonably corresponds to the length of the short parts at the extremity of the protein. Second, it tests if a linker containing two alpha helices connected with a right angle allows the circularization. Finally it searches the possible linkers with three angle points. The next parts will explain those three steps in detail.<br />
This method has been chosen, because it could be implemented easily and efficiently in our program. However, this strategy generated paths that crossed the protein. Therefore we put big efforts in the sorting out of the paths.<br />
<br />
===Step 1===<br />
As a simple rigid linker with no angle would be easier to design and likely more thermostable than the ones containing angles, the software first checks if this simple solution is possible ###figure needed###. For this, we took into account the fact that proteins have some flexible amino acids at their extremities. This flexible part may come from the protein itself, but also from the 2 glycines that are included at the N-terminal part and from the extein at the C-terminal part. Those two latter parts comes from our linkers. Those parts have no preferential angles and offers a large amount of possibilities to insert fitting linkers. But this flexibility is also a drawback as we have to include this large amount of possible angles and length to our path search.<br />
In this first step, the software explicitely takes these flexible parts into account to check for the possibility of straight linkers. As the angles and the length of the flexible parts are variable, the software position their extremity on a sphere centered on the last fixed position of the structure as explained above. The radius of this sphere is incremented in a discrete manner, in 4 steps, from 5.25 &Aring; to the maximum length of the flexible part. ###figure###<br />
Then all possible straight segments between the points and the lastpoints are tested. If they are closer than 5 &Aring to the protein, of if they cross it, then they are rejected. If they are kept, then the software checks whether the length of the segments is compatible with the feasible alpha helices in terms of length: if the length of a given segment equal one of the 8 alpha helix lengths plus or minus 0.75 &Aring;, then the path is eventually saved.<br />
===Step 2===<br />
The next possibilty to design more complex rigid linkers while still taking flexible ends into account with a reasonable calculation time was to reduce amount of possible angles. As we originally thought that 90° angle would be practically feasible, the software was designed to generate linkers with flexible ends and one 90° angle. ###figure, protein_one_90° angle### This choice was notably made because of the simplicity to calculate lengths of right triangle edges. We already saw in Step 1 that the length of an edge can only take 8 different values. As the linkers have to start from the extremities of the protein, and as we impose a right angle, the number of possible paths is therefore low, making them easy to compute. Practically, the extremities of the proteins are positioned in a flexible way as in Step 1. From each of the positions allowed by this flexibility, the software searches for all the allowed right triangles. This was mainly done, as the degrees of freedom needed to be restricted to keep calculations feasible. <br />
===Step 3===<br />
Finally the software also provides the possibility to find paths with up to 4 edges, meaning 4 alpha helices and 3 angles. Thanks to the modularity of the possible linkers, such paths can offer the possibility to circularize theoretically any kind of protein. ###figure of torus###<br />
To keep the calculation feasible in a reasonable time, we design the searching strategy so that the flexible part at the extremity are oriented in the same direction as the consecutive alpha helix. This is obviously restricting the search but as these orientations are allowed for the flexible part, this approach remains fully correct. <br />
First, potential ending points of the first alpha helical rod are calculated from the N-terminal point of the protein. The orientation is chosen in a discrete manner, with an incrementation of 5 degrees for the two angles of the spherical coordinates. The distance from the origin corresponds to the 8 possible lengths allowed by the alpha helices, as already seen in Step 2, plus a length of 0, which mimics a linker with 3 instead of 4 edges. The exact same procedure is repeated to define all the potential ending points of the second alpha helical rod starting from all the possible ending points of the first alpha helical rod. Thanks to the possibility of a length of 0 for the first and the second rods, the software also calculate paths with 2 edges. Then, the same is done only once from the C-terminal point of the protein, defining 1 edge. The final step consists in checking if the points originating from the N- and C-terminal points can be linked by an potential alpha helix, i.e. if they are separated by the appropriate distance. If any of the potential alpha helix length lies within the distance between two points plus or minus 0.75 &Aring;, then the path is eventually saved. In the same way, if two points are directly closer than 0.75 &Aring;, then the path is also saved.<br />
<br />
==Sorting out of paths==<br />
The previous part described the generation of paths that can connect the two extremeties of the protein irrespective of the position of these paths relative to the protein. While this allows a fast computing of the geometrical paths, this also implies that the paths that are not practically feasible need to be sorted out. This is the most time consuming part of the computing as about 1 billion paths are generated. Three criteria are considered for the sorting. The first one is the feasibility of the linker: can the software find angle patterns that correspond to the one defined by the geometrical path? This question was part of the motivation for a large modeling effort (link) to determine the possible angles between consecutive angles. This was achieved by analyzing the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling distribution of angles] between alpha helices found in the ArchDB database. As nearly any angle could be found between 20 and 170 degrees, only few paths were actually rejected at that step. The next criteria was the position of the angle point: if they appear inside the protein, then the path is rejected. Finally, the software checks if any of the atoms of the protein is less than 5&Aring away from any of the alpha helices, then the path is also rejected.<br />
<br />
==Shifting paths to the patterns==<br />
<br />
The strategy described in step 3 gives a certain freedom for the rod that connect the last two angle points that were generated from the N- and C-terminal points. As this freedom is actually not permitted by the alpha helix and the angle pattern, but is permitted by the flexible part for example at the C-terminal end, the software slighty refine the path by rotating the segment that originates from the C-terminal point.<br />
<br />
==Weighting of paths==<br />
Before translating the paths into sequences and thus into linkers that can be expressed, each path needs to be evaluated for its potential capacity to enhance heat stability. For this we have identified different contributions that should be combined into one value that defines how good the linker may be and that consequently defines its ranking among all possible linkers. The smaller this value, the more we expect that the linker will enhance thermostability. An important step for all these contributions is the normalization, as explained in the next paragraphs.<br />
The first contribution we considered was the linker length. We assumed that a short linker is better to constrain the protein extremities, and that a long helix might give more flexibility. Because we wanted this value to be independent of the size of the protein, the length of the linker is normalized to the distance between the two termini.<br />
The second contribution relates to the angles used in the linker. We learned from the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] that angles formed by a certain angle pattern follow a certain distribution. First, we assumed that the narrower the distribution, the more likely the alpha helices would actually produce this angle. Second, the angles found by the software should be as close as possible to those well-defined angles. In this case, the weight value from this contribution should be low.<br />
Then the distance of the linker to the protein is taken into account. Because the linkers should not disturb the protein in its normal environment, linkers that pass close to the protein surface are considered better linkers. The distance was defined as the minimal distance between the linker and all the atoms of the protein. As already mentioned for the sorting of the paths, a linker cannot come closer than 5 &Aring; and this distance was used for normalization of calculated distances.<br />
After this, the places a linker should avoid are calculated. Each protein can interact with other molecules on some oarts of its surface. The user can specify where and how big those parts are. If a linker passes in front a potential molecule bindind domain, the value of the corresponding path goes to infinity, so that the linker is discarded. Conversely the farther a linker is from a potential ligand binding domain, the smaller its weighting value. The user can also specify the importance of certain regions. In the end the total weighting is normalized to the amount of binding domains.<br />
<br />
===Calibrating the weighting function===<br />
Every contribution has its own distribution. You can see an example in figure ### [[figure histogram_length_lys.png]], but all of them have different shapes. The aim is to find the paths that globally minimize all of these distributions. Therefore, for simplicity,in the weighting function the four mentioned contributions were combined in a linear manner:<br />
\[ W(p) = \alpha L(p) + \beta A(p) + \gamma D(p) + \delta u(p) \]<br />
where W is the final weighting, p the path, L the length contribution, A the angle contribution, D the distance contribution and u the contribution from the forbidden regions. $\alpha, \beta, \gamma, \delta$ are the weighting constants that needed to be found. The normalization performed for each of the contribution were made so that each of them is dimensionless and that all have reasonably similar values.<br />
The weighting constants were obtained from the [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening linker screening] performed with lysozyme and the [https://2014.igem.org/Team:Heidelberg/Modeling/Enzyme_Modeling modeling of the enzyme activity]<br />
Please see [[below ###link]] for the detailed explanation, how the values were obtained.<br />
<br />
==Translating paths to sequence==<br />
As already mentioned before the software is provided with two databases, one for the possible angle patterns and one for the helix patterns. The choice of the patterns was inspired by known crystal structures extracted from databases and described in different papers.<br />
A huge in silico screening for refining the preferences of the patterns was then set up using the [[iGEM@home|###Link]] system. For the complete description of search for suitable patterns, one can read the [[modeling|###Link to patternspart]] page. <br />
All the possible paths are now split up at the angles and compared with the possible patterns in the databases. ###Figure needed to explain, how the path is translated### The most suitable patterns are identified and added together to build the paths sequence. It is important to notice that this is only possible because of the modularity of our linker patterns used as building blocks: each block, being an alpha helix or an angle pattern, is not affected by the other. Thus for each possible path, one sequence is produced.<br />
==Clustering of paths==<br />
Many different paths are represented by the same sequence [[###Figure that shows, different paths have same properties, already before in the text ###]] and we therefore clustered such paths. The weigths for those clustered paths were then calculated by averaging the weights of the different paths that compose a cluster.<br />
<br />
<br />
=Results=<br />
==DNMT1==<br />
A major motivation of our effort to design rigid linkers with angles was the circularization of the DNA methyltranferase Dnmt1 (link). The truncation form used in our project is composed of 900 amino acids and the N- and C-terminal extremities are well separated. To circularize it, two linkers were designed: a flexible one made of glycine and serine, and a rigid one designed by the software. The rigif linkers for DNMT1 were obtained from an early state of the software. At that time the calculation took 11 days on a laptop computer with intel i5 processor and 8GB of RAM, which shows the importance of a distributed computing system for large proteins. But from that state on, the software has still imporved a lot, resulting in reduced calculation time to about 1 day for DNMT1.<br />
###DNMT1 could be made heatstable, still missing###<br />
==Feedback from wet lab==<br />
The results from the software for lysozyme were tested as described in the [[linker-screening]] part and evaluated as described in the [[enzyme modeling]] part. We have performed a large linker screening on 10 different lysozymes with different linkers. As the purpose of the lysozyme screen was the calibration of the software,tThose linkers were designed according to the four contributions previously mentioned. One of them was the shortest possible, one had the best possible angle, and so on.<br />
{| class="table table-hover"<br />
|+ '''table 1''': Linker and their amino acid sequence. Green: attachment sequences to prevent the flexible regions from being perturbed; Blue: angle; Purple: extein.<br />
! Linker<br />
! Amino acid sequence<br />
|- <br />
|<br />
|<br />
|-<br />
|'''Very good linkers'''<br />
|<br />
|-<br />
| sgt2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#00BFFF;">AAAHPEA</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| rigid<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKAA<span style="color:#3ADF00;">P</span><span style="color:#A901DB;">RGKCWE</span><br />
|-<br />
| <br />
| <br />
|-<br />
| '''Average linkers'''<br />
|<br />
|-<br />
| may1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKA<span style="color:#00BFFF;">AAAHPEA</span>AEAAAKEAAAKA<span style="color:#00BFFF;">KTA</span>AEAAAKEAAAKA<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| ord1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ATGDLA</span>AEAAAKAA<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| ord3<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ASLPAA</span>AEAAAKEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
|<br />
| <br />
|-<br />
| '''Short linkers'''<br />
| <br />
|-<br />
| sho1<br />
| <span style="color:#3ADF00;">GG</span><span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| sho2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
|}<br />
[[circ_lam_lys_nils.png]] comparison with [[nice_linker_lysozyme_flexible_ends.png]]<br />
###About predictions of software###<br />
In the end we obtained a ranking of the in vitro tested linkers from the [[linker-screening]] and chose the parameters $\alpha, \beta, \gamma, \delta$ of the weighting function in the way, that the ranking from the software represented the ranking from the assays. The final values were: $\alpha = 1.85 * 10 ^{-6}, \beta = 0.57 , \gamma = 50.8 * 10^6$.<br />
Therefore we set up a function, that was minimized, when the order of the weighting output was the same as the order from the wetlab.<br />
<br />
=Discussion=<br />
Will always be refined with more data from i@h...<br />
=References=<br />
<br />
[1] Thornton, J.M. & Sibanda, B.L. Amino and carboxy-terminal regions in globular proteins. Journal of molecular biology 167, 443-460 (1983).<br />
<br />
[2] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/pages/Linker_SoftwareTeam:Heidelberg/pages/Linker Software2014-10-17T22:54:21Z<p>Igemnils: /* Weighting of paths */</p>
<hr />
<div>=Abstract=<br />
As already [https://2014.igem.org/Team:Heidelberg/Toolbox/Circularization introduced], artificially circularized proteins may gain some heat stability by restraining the C- and N-terminus from moving around freely. This circularization may be trivial when the protein termini are very close to each other, which seems to be reasonably common [[#References|[1]]]. However, if the ends are too far from each other, a long linker is needed to connect them. This linker should not change the natural conformation of the protein and should constrain the relative position of the ends to restrict the degrees of freedom and thus to stabilize the structure even when heated up. On top, these linkers should not affect any of the protein functions. Consequently it is important to prevent linkers from passing through the active site or from covering binding domains to other molecules for example. Therefore one needs to be able to define the shape of possible linkers. This section describes the software we developed to design such linkers. We would like to stress that this work has been made possible thanks to the feedback between computer modeling and experimental work: We could first design linkers in silico, test them experimentally and use the results to further calibrate the software. To our knowledge, this is the first time that such an approach is used to customly design rigid linkers with angles to connect protein extremities.<br />
<br />
=General procedure=<br />
In short, the software can provide a weighted list of linkers to circularize any protein of interest with a known structure. Those linkers are made of rigid alpha helices segments connected with defined angles. Contrary to flexible linkers, those rigid linkers were expected to constrain the protein extremities and to confer better heat stability. Such an idea was already developed [[#References|[2]]] but only with alpha helices defining simple rods, and without any possibility to introduce angles. To generate those linkers, we first defined the geometrical paths, with segments and angles, that they should follow. The geometrical paths that are biologically feasible are afterwards translated into amino acid sequences. Both the compatibility of paths with possible structures and the translation were made possible thanks to our [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling approaches]. The first approach consisted in performing a statistical analysis of more than 17000 known non-homologous structures containing alpha helices connected with angles. For the second approach, we modeled the conformation of linkers circularizing proteins of known structure and analyzed them for certain properties. This second approach was run for a large number of proteins thanks to our distributing computing system [https://2014.igem.org/Team:Heidelberg/Software/igemathome igemathome]. The software provides different possible linkers with weights that provide the ranking of the linkers depending on their capacity to maintain protein activity at higher temperatures. They were generated thanks to an extensive [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening linker screening] on the target protein lambda-lysozyme, using the first modeling approach.<br />
<br />
=Background=<br />
Classically, protein linkers were designed in three different manners. The easiest way is to define the length that a linker should cover and then simply use a flexible glycine-serine peptide with the right amount of amino acids to match this length. Glycine is used for flexibility, as it has no sidechain and does not produce any steric hindrance, while serine is used for solubility, as it has a small polar side chain. This solubility is important, as the linkers should not pass through the hydrophobic core of the protein, but should be dissolved in the surrounding medium. These flexible linkers were normally used for circularization but also for connecting different proteins, when the main goal is that the different parts are connected, but not how they are connected, or when the flexiblity of the linker was required for specific applications. <br />
<br />
A second strategy consists in using rigid helical linkers to keep proteins or protein domains at a certain distance from each other. This is especially important for signalling proteins and fluorescent proteins. One major property of alpha helices is that they always fold in a defined way with well defined angles and lengths. There are also many different helical patterns that differ in stability and solubility. Although they have been used to design cirularizing linkers [[#References|[2]]]. One big disadvantage of this strategy is that one can only build straight linkers with helices. So in the context of circularization, if an artificial line that would connect protein extremities is crossing the protein, this strategy is not an option.<br />
<br />
The third option, which served as a base to develop our approach and which came from discussions with the group of Rebecca Wade in Heidelberg, Germany, consists in designing customly tailored linkers for each specific application. These linkers can be obtained from protein structure prediction. At first one needs to define the path that the linker should take to connect the protein ends. Afterwards, one designs a possible linker sequence that might fit well. Next one makes a structure prediction of the linker attached to the proteins to validate the prediction. Several different linkers, with slight changes, can be compared. This is repeated several times until the linker effectively follows the expected path. This method requires a strong knowledge on protein folding and protein structure prediction and is computationaly intensive. On the other hand, the benefit can be important as the interaction of the linker with the protein surface can be taken into account and as one can accurately define the path taken by the linker to the resolution of protein structure.<br />
<br />
We have set up a completely new strategy to design rigid linkers. As further detailed in the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] part, it is possible to define the shape of a linker, by combining rigid alpha helical rods with well-defined angle patterns. Therefore, by defining, in a geometrical way, the possible paths of the circularizing linkers for a given protein, we can then propose potential linkers. This definition of the geometrical path can be very difficult, especially for large proteins with complex shapes. Moreover, this definition is further constained by the fact that linkers must avoid hiding active sites of the protein of interest. Finally the paths have rotational degrees of freedom at the extremities of the protein, and depending on their orientation, they may or may not match the geometry of the protein. The tool we present here covers the two steps: defining geometrical paths with some weights and translate them into feasible linkers, also with weights. This tool is universal as it has the capacity to design circularizing linkers for any protein with a known structure. Moreover it is modular as, thanks to our [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling approach], we have design linkers as exchangeable blocks of rods of different lengths and of angle patterns. The following sections detail the different steps followed by our software to design proper linkers.<br />
<br />
<br />
==PDB parsing==<br />
At first, the PDB file containing the structure of the target protein is parsed and the coordinates of the atoms are stored, in the metric unit.After this, some initial tests are made with the protein structure. First, we checked whether the C- and N-termini lie on the surface of the protein and are accessible to the solvent, which is crucial for circularization. We defined a line originating from an extremity of the protein with the two angles of the spherical coordinates around the z-axis. From that, we could determine the accessible angles by rejecting all the lines that are too close to the protein. As the future linker will be made of alpha helices and will therefore have a radius of 5 &Aring;, we used this length as the minimal allowed distance.<br />
Those allowed angles are stored for the coming linker generation. ###fig needed### <br />
<br />
==Generation of geometric paths==<br />
As our strategy consists in building linkers with helical rods and connecting angles, a path is completely defined by the coordinates of the angle points. Advancing one step from an existing point is always done by adding a displacement vector on this point. This vector is defined by the two spherical angles, chosen here in a discrete manner with an increment of 5 degrees, and by a length, also chosen in a discrete manner. This discrete length was used in two different contexts: it may correspond to the length of an alpha helix or to the length of the flexible part that appears at the extremity of the protein. The coordinates that are reached thanks to this vector defines the new coordinates of an angle point, no matter if the vector corresponds to an alpha helix or to a flexible part. ###figure, points + vectors###. As we screen for all possible angles in a discrete manner, those angle points coordinates are regularly distributed on a sphere. As further detailed in the next sections, those spheres are defined from both ends, either once or in several steps. Then the software checks for possible straight connections of given lengths for each pair of angle points originated from both extremities. ###figure needed, with all these possible steps, one sphere around each point and checking for the connections, 2D, different lengths in two different figures, ###<br />
The linkers are built in a modular way, with blocks of well-defined size. From the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] of potential linkers, we could derive 8 different alpha helical rods, all with different lengths. On top, the length of the two segments inside an angle block was always 8&Aring;, so exchanging angle blocks do not affect the length of the linker. This means that the distance between the angle points is well defined, an essential aspect of our strategy of linker design.<br />
The software proceeds in three steps. First, it checks for the possibility of direct single alpha helix linker. for this, it applies the procedure just mentioned with spheres of radius that reasonably corresponds to the length of the short parts at the extremity of the protein. Second, it tests if a linker containing two alpha helices connected with a right angle allows the circularization. Finally it searches the possible linkers with three angle points. The next parts will explain those three steps in detail.<br />
This method has been chosen, because it could be implemented easily and efficiently in our program. However, this strategy generated paths that crossed the protein. Therefore we put big efforts in the sorting out of the paths.<br />
<br />
===Step 1===<br />
As a simple rigid linker with no angle would be easier to design and likely more thermostable than the ones containing angles, the software first checks if this simple solution is possible ###figure needed###. For this, we took into account the fact that proteins have some flexible amino acids at their extremities. This flexible part may come from the protein itself, but also from the 2 glycines that are included at the N-terminal part and from the extein at the C-terminal part. Those two latter parts comes from our linkers. Those parts have no preferential angles and offers a large amount of possibilities to insert fitting linkers. But this flexibility is also a drawback as we have to include this large amount of possible angles and length to our path search.<br />
In this first step, the software explicitely takes these flexible parts into account to check for the possibility of straight linkers. As the angles and the length of the flexible parts are variable, the software position their extremity on a sphere centered on the last fixed position of the structure as explained above. The radius of this sphere is incremented in a discrete manner, in 4 steps, from 5.25 &Aring; to the maximum length of the flexible part. ###figure###<br />
Then all possible straight segments between the points and the lastpoints are tested. If they are closer than 5 &Aring to the protein, of if they cross it, then they are rejected. If they are kept, then the software checks whether the length of the segments is compatible with the feasible alpha helices in terms of length: if the length of a given segment equal one of the 8 alpha helix lengths plus or minus 0.75 &Aring;, then the path is eventually saved.<br />
===Step 2===<br />
The next possibilty to design more complex rigid linkers while still taking flexible ends into account with a reasonable calculation time was to reduce amount of possible angles. As we originally thought that 90° angle would be practically feasible, the software was designed to generate linkers with flexible ends and one 90° angle. ###figure, protein_one_90° angle### This choice was notably made because of the simplicity to calculate lengths of right triangle edges. We already saw in Step 1 that the length of an edge can only take 8 different values. As the linkers have to start from the extremities of the protein, and as we impose a right angle, the number of possible paths is therefore low, making them easy to compute. Practically, the extremities of the proteins are positioned in a flexible way as in Step 1. From each of the positions allowed by this flexibility, the software searches for all the allowed right triangles. This was mainly done, as the degrees of freedom needed to be restricted to keep calculations feasible. <br />
===Step 3===<br />
Finally the software also provides the possibility to find paths with up to 4 edges, meaning 4 alpha helices and 3 angles. Thanks to the modularity of the possible linkers, such paths can offer the possibility to circularize theoretically any kind of protein. ###figure of torus###<br />
To keep the calculation feasible in a reasonable time, we design the searching strategy so that the flexible part at the extremity are oriented in the same direction as the consecutive alpha helix. This is obviously restricting the search but as these orientations are allowed for the flexible part, this approach remains fully correct. <br />
First, potential ending points of the first alpha helical rod are calculated from the N-terminal point of the protein. The orientation is chosen in a discrete manner, with an incrementation of 5 degrees for the two angles of the spherical coordinates. The distance from the origin corresponds to the 8 possible lengths allowed by the alpha helices, as already seen in Step 2, plus a length of 0, which mimics a linker with 3 instead of 4 edges. The exact same procedure is repeated to define all the potential ending points of the second alpha helical rod starting from all the possible ending points of the first alpha helical rod. Thanks to the possibility of a length of 0 for the first and the second rods, the software also calculate paths with 2 edges. Then, the same is done only once from the C-terminal point of the protein, defining 1 edge. The final step consists in checking if the points originating from the N- and C-terminal points can be linked by an potential alpha helix, i.e. if they are separated by the appropriate distance. If any of the potential alpha helix length lies within the distance between two points plus or minus 0.75 &Aring;, then the path is eventually saved. In the same way, if two points are directly closer than 0.75 &Aring;, then the path is also saved.<br />
<br />
==Sorting out of paths==<br />
The previous part described the generation of paths that can connect the two extremeties of the protein irrespective of the position of these paths relative to the protein. While this allows a fast computing of the geometrical paths, this also implies that the paths that are not practically feasible need to be sorted out. This is the most time consuming part of the computing as about 1 billion paths are generated. Three criteria are considered for the sorting. The first one is the feasibility of the linker: can the software find angle patterns that correspond to the one defined by the geometrical path? This question was part of the motivation for a large modeling effort (link) to determine the possible angles between consecutive angles. This was achieved by analyzing the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling distribution of angles] between alpha helices found in the ArchDB database. As nearly any angle could be found between 20 and 170 degrees, only few paths were actually rejected at that step. The next criteria was the position of the angle point: if they appear inside the protein, then the path is rejected. Finally, the software checks if any of the atoms of the protein is less than 5&Aring away from any of the alpha helices, then the path is also rejected.<br />
<br />
==Shifting paths to the patterns==<br />
<br />
The strategy described in step 3 gives a certain freedom for the rod that connect the last two angle points that were generated from the N- and C-terminal points. As this freedom is actually not permitted by the alpha helix and the angle pattern, but is permitted by the flexible part for example at the C-terminal end, the software slighty refine the path by rotating the segment that originates from the C-terminal point.<br />
<br />
==Weighting of paths==<br />
Before translating the paths into sequences and thus into linkers that can be expressed, each path needs to be evaluated for its potential capacity to enhance heat stability. For this we have identified different contributions that should be combined into one value that defines how good the linker may be and that consequently defines its ranking among all possible linkers. The smaller this value, the more we expect that the linker will enhance thermostability. An important step for all these contributions is the normalization, as explained in the next paragraphs.<br />
The first contribution we considered was the linker length. We assumed that a short linker is better to constrain the protein extremities, and that a long helix might give more flexibility. Because we wanted this value to be independent of the size of the protein, the length of the linker is normalized to the distance between the two termini.<br />
The second contribution relates to the angles used in the linker. We learned from the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] that angles formed by a certain angle pattern follow a certain distribution. First, we assumed that the narrower the distribution, the more likely the alpha helices would actually produce this angle. Second, the angles found by the software should be as close as possible to those well-defined angles. In this case, the weight value from this contribution should be low.<br />
Then the distance of the linker to the protein is taken into account. Because the linkers should not disturb the protein in its normal environment, linkers that pass close to the protein surface are considered better linkers. The distance was defined as the minimal distance between the linker and all the atoms of the protein. As already mentioned for the sorting of the paths, a linker cannot come closer than 5 &Aring; and this distance was used for normalization of calculated distances.<br />
After this, the places a linker should avoid are calculated. Each protein can interact with other molecules on some oarts of its surface. The user can specify where and how big those parts are. If a linker passes in front a potential molecule bindind domain, the value of the corresponding path goes to infinity, so that the linker is discarded. Conversely the farther a linker is from a potential ligand binding domain, the smaller its weighting value. The user can also specify the importance of certain regions. In the end the total weighting is normalized to the amount of binding domains.<br />
<br />
===Calibrating the weighting function===<br />
???keep?###<br />
Every contribution has it's own distribution. You can see an example in figure ### [[figure histogram_length_lys.png]], but all of them have different shapes. The aim is to find the paths that minimize all of these distributions. Therefore in the weighting function the four mentioned contributions are combined in a linear manner:<br />
\[ W(p) = \alpha L(p) + \beta A(p) + \gamma D(p) + \delta u(p) \]<br />
where W is the final weighting, p the path, L the length contribution, A the angle contribution, D the distance contribution and u the contribution from the forbidden regions. $\alpha, \beta, \gamma, \delta$ are the weighting constants that needed to be found. The normalization performed for each of the contribution were made so that each of them is dimensionless and that all have reasonably similar values.<br />
The weighting constants were obtained from the [[linker-screening ###lin]] performed with lysozyme and the [[enzyme-modeling ###link]]<br />
Please see [[below ###link]] for the detailed explanation, how the values were obtained.<br />
<br />
==Translating paths to sequence==<br />
As already mentioned before the software is provided with two databases, one for the possible angle patterns and one for the helix patterns. The choice of the patterns was inspired by known crystal structures extracted from databases and described in different papers.<br />
A huge in silico screening for refining the preferences of the patterns was then set up using the [[iGEM@home|###Link]] system. For the complete description of search for suitable patterns, one can read the [[modeling|###Link to patternspart]] page. <br />
All the possible paths are now split up at the angles and compared with the possible patterns in the databases. ###Figure needed to explain, how the path is translated### The most suitable patterns are identified and added together to build the paths sequence. It is important to notice that this is only possible because of the modularity of our linker patterns used as building blocks: each block, being an alpha helix or an angle pattern, is not affected by the other. Thus for each possible path, one sequence is produced.<br />
==Clustering of paths==<br />
Many different paths are represented by the same sequence [[###Figure that shows, different paths have same properties, already before in the text ###]] and we therefore clustered such paths. The weigths for those clustered paths were then calculated by averaging the weights of the different paths that compose a cluster.<br />
<br />
<br />
=Results=<br />
==DNMT1==<br />
A major motivation of our effort to design rigid linkers with angles was the circularization of the DNA methyltranferase Dnmt1 (link). The truncation form used in our project is composed of 900 amino acids and the N- and C-terminal extremities are well separated. To circularize it, two linkers were designed: a flexible one made of glycine and serine, and a rigid one designed by the software. The rigif linkers for DNMT1 were obtained from an early state of the software. At that time the calculation took 11 days on a laptop computer with intel i5 processor and 8GB of RAM, which shows the importance of a distributed computing system for large proteins. But from that state on, the software has still imporved a lot, resulting in reduced calculation time to about 1 day for DNMT1.<br />
###DNMT1 could be made heatstable, still missing###<br />
==Feedback from wet lab==<br />
The results from the software for lysozyme were tested as described in the [[linker-screening]] part and evaluated as described in the [[enzyme modeling]] part. We have performed a large linker screening on 10 different lysozymes with different linkers. As the purpose of the lysozyme screen was the calibration of the software,tThose linkers were designed according to the four contributions previously mentioned. One of them was the shortest possible, one had the best possible angle, and so on.<br />
{| class="table table-hover"<br />
|+ '''table 1''': Linker and their amino acid sequence. Green: attachment sequences to prevent the flexible regions from being perturbed; Blue: angle; Purple: extein.<br />
! Linker<br />
! Amino acid sequence<br />
|- <br />
|<br />
|<br />
|-<br />
|'''Very good linkers'''<br />
|<br />
|-<br />
| sgt2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#00BFFF;">AAAHPEA</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| rigid<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKAA<span style="color:#3ADF00;">P</span><span style="color:#A901DB;">RGKCWE</span><br />
|-<br />
| <br />
| <br />
|-<br />
| '''Average linkers'''<br />
|<br />
|-<br />
| may1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKA<span style="color:#00BFFF;">AAAHPEA</span>AEAAAKEAAAKA<span style="color:#00BFFF;">KTA</span>AEAAAKEAAAKA<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| ord1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ATGDLA</span>AEAAAKAA<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| ord3<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ASLPAA</span>AEAAAKEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
|<br />
| <br />
|-<br />
| '''Short linkers'''<br />
| <br />
|-<br />
| sho1<br />
| <span style="color:#3ADF00;">GG</span><span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| sho2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
|}<br />
[[circ_lam_lys_nils.png]] comparison with [[nice_linker_lysozyme_flexible_ends.png]]<br />
###About predictions of software###<br />
In the end we obtained a ranking of the in vitro tested linkers from the [[linker-screening]] and chose the parameters $\alpha, \beta, \gamma, \delta$ of the weighting function in the way, that the ranking from the software represented the ranking from the assays. The final values were: $\alpha = 1.85 * 10 ^{-6}, \beta = 0.57 , \gamma = 50.8 * 10^6$.<br />
Therefore we set up a function, that was minimized, when the order of the weighting output was the same as the order from the wetlab.<br />
<br />
=Discussion=<br />
Will always be refined with more data from i@h...<br />
=References=<br />
<br />
[1] Thornton, J.M. & Sibanda, B.L. Amino and carboxy-terminal regions in globular proteins. Journal of molecular biology 167, 443-460 (1983).<br />
<br />
[2] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/pages/Linker_SoftwareTeam:Heidelberg/pages/Linker Software2014-10-17T22:49:46Z<p>Igemnils: </p>
<hr />
<div>=Abstract=<br />
As already [https://2014.igem.org/Team:Heidelberg/Toolbox/Circularization introduced], artificially circularized proteins may gain some heat stability by restraining the C- and N-terminus from moving around freely. This circularization may be trivial when the protein termini are very close to each other, which seems to be reasonably common [[#References|[1]]]. However, if the ends are too far from each other, a long linker is needed to connect them. This linker should not change the natural conformation of the protein and should constrain the relative position of the ends to restrict the degrees of freedom and thus to stabilize the structure even when heated up. On top, these linkers should not affect any of the protein functions. Consequently it is important to prevent linkers from passing through the active site or from covering binding domains to other molecules for example. Therefore one needs to be able to define the shape of possible linkers. This section describes the software we developed to design such linkers. We would like to stress that this work has been made possible thanks to the feedback between computer modeling and experimental work: We could first design linkers in silico, test them experimentally and use the results to further calibrate the software. To our knowledge, this is the first time that such an approach is used to customly design rigid linkers with angles to connect protein extremities.<br />
<br />
=General procedure=<br />
In short, the software can provide a weighted list of linkers to circularize any protein of interest with a known structure. Those linkers are made of rigid alpha helices segments connected with defined angles. Contrary to flexible linkers, those rigid linkers were expected to constrain the protein extremities and to confer better heat stability. Such an idea was already developed [[#References|[2]]] but only with alpha helices defining simple rods, and without any possibility to introduce angles. To generate those linkers, we first defined the geometrical paths, with segments and angles, that they should follow. The geometrical paths that are biologically feasible are afterwards translated into amino acid sequences. Both the compatibility of paths with possible structures and the translation were made possible thanks to our [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling approaches]. The first approach consisted in performing a statistical analysis of more than 17000 known non-homologous structures containing alpha helices connected with angles. For the second approach, we modeled the conformation of linkers circularizing proteins of known structure and analyzed them for certain properties. This second approach was run for a large number of proteins thanks to our distributing computing system [https://2014.igem.org/Team:Heidelberg/Software/igemathome igemathome]. The software provides different possible linkers with weights that provide the ranking of the linkers depending on their capacity to maintain protein activity at higher temperatures. They were generated thanks to an extensive [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening linker screening] on the target protein lambda-lysozyme, using the first modeling approach.<br />
<br />
=Background=<br />
Classically, protein linkers were designed in three different manners. The easiest way is to define the length that a linker should cover and then simply use a flexible glycine-serine peptide with the right amount of amino acids to match this length. Glycine is used for flexibility, as it has no sidechain and does not produce any steric hindrance, while serine is used for solubility, as it has a small polar side chain. This solubility is important, as the linkers should not pass through the hydrophobic core of the protein, but should be dissolved in the surrounding medium. These flexible linkers were normally used for circularization but also for connecting different proteins, when the main goal is that the different parts are connected, but not how they are connected, or when the flexiblity of the linker was required for specific applications. <br />
<br />
A second strategy consists in using rigid helical linkers to keep proteins or protein domains at a certain distance from each other. This is especially important for signalling proteins and fluorescent proteins. One major property of alpha helices is that they always fold in a defined way with well defined angles and lengths. There are also many different helical patterns that differ in stability and solubility. Although they have been used to design cirularizing linkers [[#References|[2]]]. One big disadvantage of this strategy is that one can only build straight linkers with helices. So in the context of circularization, if an artificial line that would connect protein extremities is crossing the protein, this strategy is not an option.<br />
<br />
The third option, which served as a base to develop our approach and which came from discussions with the group of Rebecca Wade in Heidelberg, Germany, consists in designing customly tailored linkers for each specific application. These linkers can be obtained from protein structure prediction. At first one needs to define the path that the linker should take to connect the protein ends. Afterwards, one designs a possible linker sequence that might fit well. Next one makes a structure prediction of the linker attached to the proteins to validate the prediction. Several different linkers, with slight changes, can be compared. This is repeated several times until the linker effectively follows the expected path. This method requires a strong knowledge on protein folding and protein structure prediction and is computationaly intensive. On the other hand, the benefit can be important as the interaction of the linker with the protein surface can be taken into account and as one can accurately define the path taken by the linker to the resolution of protein structure.<br />
<br />
We have set up a completely new strategy to design rigid linkers. As further detailed in the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] part, it is possible to define the shape of a linker, by combining rigid alpha helical rods with well-defined angle patterns. Therefore, by defining, in a geometrical way, the possible paths of the circularizing linkers for a given protein, we can then propose potential linkers. This definition of the geometrical path can be very difficult, especially for large proteins with complex shapes. Moreover, this definition is further constained by the fact that linkers must avoid hiding active sites of the protein of interest. Finally the paths have rotational degrees of freedom at the extremities of the protein, and depending on their orientation, they may or may not match the geometry of the protein. The tool we present here covers the two steps: defining geometrical paths with some weights and translate them into feasible linkers, also with weights. This tool is universal as it has the capacity to design circularizing linkers for any protein with a known structure. Moreover it is modular as, thanks to our [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling approach], we have design linkers as exchangeable blocks of rods of different lengths and of angle patterns. The following sections detail the different steps followed by our software to design proper linkers.<br />
<br />
<br />
==PDB parsing==<br />
At first, the PDB file containing the structure of the target protein is parsed and the coordinates of the atoms are stored, in the metric unit.After this, some initial tests are made with the protein structure. First, we checked whether the C- and N-termini lie on the surface of the protein and are accessible to the solvent, which is crucial for circularization. We defined a line originating from an extremity of the protein with the two angles of the spherical coordinates around the z-axis. From that, we could determine the accessible angles by rejecting all the lines that are too close to the protein. As the future linker will be made of alpha helices and will therefore have a radius of 5 &Aring;, we used this length as the minimal allowed distance.<br />
Those allowed angles are stored for the coming linker generation. ###fig needed### <br />
<br />
==Generation of geometric paths==<br />
As our strategy consists in building linkers with helical rods and connecting angles, a path is completely defined by the coordinates of the angle points. Advancing one step from an existing point is always done by adding a displacement vector on this point. This vector is defined by the two spherical angles, chosen here in a discrete manner with an increment of 5 degrees, and by a length, also chosen in a discrete manner. This discrete length was used in two different contexts: it may correspond to the length of an alpha helix or to the length of the flexible part that appears at the extremity of the protein. The coordinates that are reached thanks to this vector defines the new coordinates of an angle point, no matter if the vector corresponds to an alpha helix or to a flexible part. ###figure, points + vectors###. As we screen for all possible angles in a discrete manner, those angle points coordinates are regularly distributed on a sphere. As further detailed in the next sections, those spheres are defined from both ends, either once or in several steps. Then the software checks for possible straight connections of given lengths for each pair of angle points originated from both extremities. ###figure needed, with all these possible steps, one sphere around each point and checking for the connections, 2D, different lengths in two different figures, ###<br />
The linkers are built in a modular way, with blocks of well-defined size. From the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] of potential linkers, we could derive 8 different alpha helical rods, all with different lengths. On top, the length of the two segments inside an angle block was always 8&Aring;, so exchanging angle blocks do not affect the length of the linker. This means that the distance between the angle points is well defined, an essential aspect of our strategy of linker design.<br />
The software proceeds in three steps. First, it checks for the possibility of direct single alpha helix linker. for this, it applies the procedure just mentioned with spheres of radius that reasonably corresponds to the length of the short parts at the extremity of the protein. Second, it tests if a linker containing two alpha helices connected with a right angle allows the circularization. Finally it searches the possible linkers with three angle points. The next parts will explain those three steps in detail.<br />
This method has been chosen, because it could be implemented easily and efficiently in our program. However, this strategy generated paths that crossed the protein. Therefore we put big efforts in the sorting out of the paths.<br />
<br />
===Step 1===<br />
As a simple rigid linker with no angle would be easier to design and likely more thermostable than the ones containing angles, the software first checks if this simple solution is possible ###figure needed###. For this, we took into account the fact that proteins have some flexible amino acids at their extremities. This flexible part may come from the protein itself, but also from the 2 glycines that are included at the N-terminal part and from the extein at the C-terminal part. Those two latter parts comes from our linkers. Those parts have no preferential angles and offers a large amount of possibilities to insert fitting linkers. But this flexibility is also a drawback as we have to include this large amount of possible angles and length to our path search.<br />
In this first step, the software explicitely takes these flexible parts into account to check for the possibility of straight linkers. As the angles and the length of the flexible parts are variable, the software position their extremity on a sphere centered on the last fixed position of the structure as explained above. The radius of this sphere is incremented in a discrete manner, in 4 steps, from 5.25 &Aring; to the maximum length of the flexible part. ###figure###<br />
Then all possible straight segments between the points and the lastpoints are tested. If they are closer than 5 &Aring to the protein, of if they cross it, then they are rejected. If they are kept, then the software checks whether the length of the segments is compatible with the feasible alpha helices in terms of length: if the length of a given segment equal one of the 8 alpha helix lengths plus or minus 0.75 &Aring;, then the path is eventually saved.<br />
===Step 2===<br />
The next possibilty to design more complex rigid linkers while still taking flexible ends into account with a reasonable calculation time was to reduce amount of possible angles. As we originally thought that 90° angle would be practically feasible, the software was designed to generate linkers with flexible ends and one 90° angle. ###figure, protein_one_90° angle### This choice was notably made because of the simplicity to calculate lengths of right triangle edges. We already saw in Step 1 that the length of an edge can only take 8 different values. As the linkers have to start from the extremities of the protein, and as we impose a right angle, the number of possible paths is therefore low, making them easy to compute. Practically, the extremities of the proteins are positioned in a flexible way as in Step 1. From each of the positions allowed by this flexibility, the software searches for all the allowed right triangles. This was mainly done, as the degrees of freedom needed to be restricted to keep calculations feasible. <br />
===Step 3===<br />
Finally the software also provides the possibility to find paths with up to 4 edges, meaning 4 alpha helices and 3 angles. Thanks to the modularity of the possible linkers, such paths can offer the possibility to circularize theoretically any kind of protein. ###figure of torus###<br />
To keep the calculation feasible in a reasonable time, we design the searching strategy so that the flexible part at the extremity are oriented in the same direction as the consecutive alpha helix. This is obviously restricting the search but as these orientations are allowed for the flexible part, this approach remains fully correct. <br />
First, potential ending points of the first alpha helical rod are calculated from the N-terminal point of the protein. The orientation is chosen in a discrete manner, with an incrementation of 5 degrees for the two angles of the spherical coordinates. The distance from the origin corresponds to the 8 possible lengths allowed by the alpha helices, as already seen in Step 2, plus a length of 0, which mimics a linker with 3 instead of 4 edges. The exact same procedure is repeated to define all the potential ending points of the second alpha helical rod starting from all the possible ending points of the first alpha helical rod. Thanks to the possibility of a length of 0 for the first and the second rods, the software also calculate paths with 2 edges. Then, the same is done only once from the C-terminal point of the protein, defining 1 edge. The final step consists in checking if the points originating from the N- and C-terminal points can be linked by an potential alpha helix, i.e. if they are separated by the appropriate distance. If any of the potential alpha helix length lies within the distance between two points plus or minus 0.75 &Aring;, then the path is eventually saved. In the same way, if two points are directly closer than 0.75 &Aring;, then the path is also saved.<br />
<br />
==Sorting out of paths==<br />
The previous part described the generation of paths that can connect the two extremeties of the protein irrespective of the position of these paths relative to the protein. While this allows a fast computing of the geometrical paths, this also implies that the paths that are not practically feasible need to be sorted out. This is the most time consuming part of the computing as about 1 billion paths are generated. Three criteria are considered for the sorting. The first one is the feasibility of the linker: can the software find angle patterns that correspond to the one defined by the geometrical path? This question was part of the motivation for a large modeling effort (link) to determine the possible angles between consecutive angles. This was achieved by analyzing the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling distribution of angles] between alpha helices found in the ArchDB database. As nearly any angle could be found between 20 and 170 degrees, only few paths were actually rejected at that step. The next criteria was the position of the angle point: if they appear inside the protein, then the path is rejected. Finally, the software checks if any of the atoms of the protein is less than 5&Aring away from any of the alpha helices, then the path is also rejected.<br />
<br />
==Shifting paths to the patterns==<br />
<br />
The strategy described in step 3 gives a certain freedom for the rod that connect the last two angle points that were generated from the N- and C-terminal points. As this freedom is actually not permitted by the alpha helix and the angle pattern, but is permitted by the flexible part for example at the C-terminal end, the software slighty refine the path by rotating the segment that originates from the C-terminal point.<br />
<br />
==Weighting of paths==<br />
Before translating the paths into sequences and thus into linkers that can be expressed, each path needs to be evaluated for its potential capacity to enhance heat stability. For this we have identified different contributions that should be combined into one value that defines how good the linker may be and that consequently defines its ranking among all possible linkers. The smaller this value, the more we expect that the linker will enhance thermostability. An important step for all these contributions is the normalization, as explained in the next paragraphs.<br />
The first contribution we considered was the linker length. We assumed that a short linker is better to constrain the protein extremities, and that a long helix might give more flexibility. Because we wanted this value to be independent of the size of the protein, the length of the linker is normalized to the distance between the two termini.<br />
The second contribution relates to the angles used in the linker. We learned from the [[modeling ###link###]] that angles formed by a certain angle pattern follow a certain distribution. First, we assumed that the narrower the distribution, the more likely the alpha helices would actually produce this angle. Second, the angles found by the software should be as close as possible to those well-defined angles. In this case, the weight value from this contribution should be low.<br />
Then the distance of the linker to the protein is taken into account. Because the linkers should not disturb the protein in its normal environment, linkers that pass close to the protein surface are considered better linkers. The distance was defined as the minimal distance between the linker and all the atoms of the protein. As already mentioned for the sorting of the paths, a linker cannot come closer than 5 &Aring; and this distance was used for normalization of calculated distances.<br />
After this, the places a linker should avoid are calculated. Each protein can interact with other molecules on some oarts of its surface. The user can specify where and how big those parts are. If a linker passes in front a potential molecule bindind domain, the value of the corresponding path goes to infinity, so that the linker is discarded. Conversely the farther a linker is from a potential ligand binding domain, the smaller its weighting value. The user can also specify the importance of certain regions. In the end the total weighting is normalized to the amount of binding domains.<br />
<br />
===Calibrating the weighting function===<br />
???keep?###<br />
Every contribution has it's own distribution. You can see an example in figure ### [[figure histogram_length_lys.png]], but all of them have different shapes. The aim is to find the paths that minimize all of these distributions. Therefore in the weighting function the four mentioned contributions are combined in a linear manner:<br />
\[ W(p) = \alpha L(p) + \beta A(p) + \gamma D(p) + \delta u(p) \]<br />
where W is the final weighting, p the path, L the length contribution, A the angle contribution, D the distance contribution and u the contribution from the forbidden regions. $\alpha, \beta, \gamma, \delta$ are the weighting constants that needed to be found. The normalization performed for each of the contribution were made so that each of them is dimensionless and that all have reasonably similar values.<br />
The weighting constants were obtained from the [[linker-screening ###lin]] performed with lysozyme and the [[enzyme-modeling ###link]]<br />
Please see [[below ###link]] for the detailed explanation, how the values were obtained.<br />
<br />
==Translating paths to sequence==<br />
As already mentioned before the software is provided with two databases, one for the possible angle patterns and one for the helix patterns. The choice of the patterns was inspired by known crystal structures extracted from databases and described in different papers.<br />
A huge in silico screening for refining the preferences of the patterns was then set up using the [[iGEM@home|###Link]] system. For the complete description of search for suitable patterns, one can read the [[modeling|###Link to patternspart]] page. <br />
All the possible paths are now split up at the angles and compared with the possible patterns in the databases. ###Figure needed to explain, how the path is translated### The most suitable patterns are identified and added together to build the paths sequence. It is important to notice that this is only possible because of the modularity of our linker patterns used as building blocks: each block, being an alpha helix or an angle pattern, is not affected by the other. Thus for each possible path, one sequence is produced.<br />
==Clustering of paths==<br />
Many different paths are represented by the same sequence [[###Figure that shows, different paths have same properties, already before in the text ###]] and we therefore clustered such paths. The weigths for those clustered paths were then calculated by averaging the weights of the different paths that compose a cluster.<br />
<br />
<br />
=Results=<br />
==DNMT1==<br />
A major motivation of our effort to design rigid linkers with angles was the circularization of the DNA methyltranferase Dnmt1 (link). The truncation form used in our project is composed of 900 amino acids and the N- and C-terminal extremities are well separated. To circularize it, two linkers were designed: a flexible one made of glycine and serine, and a rigid one designed by the software. The rigif linkers for DNMT1 were obtained from an early state of the software. At that time the calculation took 11 days on a laptop computer with intel i5 processor and 8GB of RAM, which shows the importance of a distributed computing system for large proteins. But from that state on, the software has still imporved a lot, resulting in reduced calculation time to about 1 day for DNMT1.<br />
###DNMT1 could be made heatstable, still missing###<br />
==Feedback from wet lab==<br />
The results from the software for lysozyme were tested as described in the [[linker-screening]] part and evaluated as described in the [[enzyme modeling]] part. We have performed a large linker screening on 10 different lysozymes with different linkers. As the purpose of the lysozyme screen was the calibration of the software,tThose linkers were designed according to the four contributions previously mentioned. One of them was the shortest possible, one had the best possible angle, and so on.<br />
{| class="table table-hover"<br />
|+ '''table 1''': Linker and their amino acid sequence. Green: attachment sequences to prevent the flexible regions from being perturbed; Blue: angle; Purple: extein.<br />
! Linker<br />
! Amino acid sequence<br />
|- <br />
|<br />
|<br />
|-<br />
|'''Very good linkers'''<br />
|<br />
|-<br />
| sgt2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#00BFFF;">AAAHPEA</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| rigid<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKAA<span style="color:#3ADF00;">P</span><span style="color:#A901DB;">RGKCWE</span><br />
|-<br />
| <br />
| <br />
|-<br />
| '''Average linkers'''<br />
|<br />
|-<br />
| may1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKA<span style="color:#00BFFF;">AAAHPEA</span>AEAAAKEAAAKA<span style="color:#00BFFF;">KTA</span>AEAAAKEAAAKA<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| ord1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ATGDLA</span>AEAAAKAA<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| ord3<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ASLPAA</span>AEAAAKEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
|<br />
| <br />
|-<br />
| '''Short linkers'''<br />
| <br />
|-<br />
| sho1<br />
| <span style="color:#3ADF00;">GG</span><span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| sho2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
|}<br />
[[circ_lam_lys_nils.png]] comparison with [[nice_linker_lysozyme_flexible_ends.png]]<br />
###About predictions of software###<br />
In the end we obtained a ranking of the in vitro tested linkers from the [[linker-screening]] and chose the parameters $\alpha, \beta, \gamma, \delta$ of the weighting function in the way, that the ranking from the software represented the ranking from the assays. The final values were: $\alpha = 1.85 * 10 ^{-6}, \beta = 0.57 , \gamma = 50.8 * 10^6$.<br />
Therefore we set up a function, that was minimized, when the order of the weighting output was the same as the order from the wetlab.<br />
<br />
=Discussion=<br />
Will always be refined with more data from i@h...<br />
=References=<br />
<br />
[1] Thornton, J.M. & Sibanda, B.L. Amino and carboxy-terminal regions in globular proteins. Journal of molecular biology 167, 443-460 (1983).<br />
<br />
[2] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/pages/Linker_SoftwareTeam:Heidelberg/pages/Linker Software2014-10-17T22:47:09Z<p>Igemnils: </p>
<hr />
<div>=Abstract=<br />
As already [https://2014.igem.org/Team:Heidelberg/Toolbox/Circularization introduced], artificially circularized proteins may gain some heat stability by restraining the C- and N-terminus from moving around freely. This circularization may be trivial when the protein termini are very close to each other, which seems to be reasonably common [[#References|[1]]]. However, if the ends are too far from each other, a long linker is needed to connect them. This linker should not change the natural conformation of the protein and should constrain the relative position of the ends to restrict the degrees of freedom and thus to stabilize the structure even when heated up. On top, these linkers should not affect any of the protein functions. Consequently it is important to prevent linkers from passing through the active site or from covering binding domains to other molecules for example. Therefore one needs to be able to define the shape of possible linkers. This section describes the software we developed to design such linkers. We would like to stress that this work has been made possible thanks to the feedback between computer modeling and experimental work: We could first design linkers in silico, test them experimentally and use the results to further calibrate the software. To our knowledge, this is the first time that such an approach is used to customly design rigid linkers with angles to connect protein extremities.<br />
<br />
=General procedure=<br />
In short, the software can provide a weighted list of linkers to circularize any protein of interest with a known structure. Those linkers are made of rigid alpha helices segments connected with defined angles. Contrary to flexible linkers, those rigid linkers were expected to constrain the protein extremities and to confer better heat stability. Such an idea was already developed [[#References|[2]]] but only with alpha helices defining simple rods, and without any possibility to introduce angles. To generate those linkers, we first defined the geometrical paths, with segments and angles, that they should follow. The geometrical paths that are biologically feasible are afterwards translated into amino acid sequences. Both the compatibility of paths with possible structures and the translation were made possible thanks to our [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling approaches]. The first approach consisted in performing a statistical analysis of more than 17000 known non-homologous structures containing alpha helices connected with angles. For the second approach, we modeled the conformation of linkers circularizing proteins of known structure and analyzed them for certain properties. This second approach was run for a large number of proteins thanks to our distributing computing system [https://2014.igem.org/Team:Heidelberg/Software/igemathome igemathome]. The software provides different possible linkers with weights that provide the ranking of the linkers depending on their capacity to maintain protein activity at higher temperatures. They were generated thanks to an extensive [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening linker screening] on the target protein lambda-lysozyme, using the first modeling approach.<br />
<br />
=Background=<br />
Classically, protein linkers were designed in three different manners. The easiest way is to define the length that a linker should cover and then simply use a flexible glycine-serine peptide with the right amount of amino acids to match this length. Glycine is used for flexibility, as it has no sidechain and does not produce any steric hindrance, while serine is used for solubility, as it has a small polar side chain. This solubility is important, as the linkers should not pass through the hydrophobic core of the protein, but should be dissolved in the surrounding medium. These flexible linkers were normally used for circularization but also for connecting different proteins, when the main goal is that the different parts are connected, but not how they are connected, or when the flexiblity of the linker was required for specific applications. <br />
<br />
A second strategy consists in using rigid helical linkers to keep proteins or protein domains at a certain distance from each other. This is especially important for signalling proteins and fluorescent proteins. One major property of alpha helices is that they always fold in a defined way with well defined angles and lengths. There are also many different helical patterns that differ in stability and solubility. Although they have been used to design cirularizing linkers [[#References|[2]]]. One big disadvantage of this strategy is that one can only build straight linkers with helices. So in the context of circularization, if an artificial line that would connect protein extremities is crossing the protein, this strategy is not an option.<br />
<br />
The third option, which served as a base to develop our approach and which came from discussions with the group of Rebecca Wade in Heidelberg, Germany, consists in designing customly tailored linkers for each specific application. These linkers can be obtained from protein structure prediction. At first one needs to define the path that the linker should take to connect the protein ends. Afterwards, one designs a possible linker sequence that might fit well. Next one makes a structure prediction of the linker attached to the proteins to validate the prediction. Several different linkers, with slight changes, can be compared. This is repeated several times until the linker effectively follows the expected path. This method requires a strong knowledge on protein folding and protein structure prediction and is computationaly intensive. On the other hand, the benefit can be important as the interaction of the linker with the protein surface can be taken into account and as one can accurately define the path taken by the linker to the resolution of protein structure.<br />
<br />
We have set up a completely new strategy to design rigid linkers. As further detailed in the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] part, it is possible to define the shape of a linker, by combining rigid alpha helical rods with well-defined angle patterns. Therefore, by defining, in a geometrical way, the possible paths of the circularizing linkers for a given protein, we can then propose potential linkers. This definition of the geometrical path can be very difficult, especially for large proteins with complex shapes. Moreover, this definition is further constained by the fact that linkers must avoid hiding active sites of the protein of interest. Finally the paths have rotational degrees of freedom at the extremities of the protein, and depending on their orientation, they may or may not match the geometry of the protein. The tool we present here covers the two steps: defining geometrical paths with some weights and translate them into feasible linkers, also with weights. This tool is universal as it has the capacity to design circularizing linkers for any protein with a known structure. Moreover it is modular as, thanks to our [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling approach], we have design linkers as exchangeable blocks of rods of different lengths and of angle patterns. The following sections detail the different steps followed by our software to design proper linkers.<br />
<br />
<br />
==PDB parsing==<br />
At first, the PDB file containing the structure of the target protein is parsed and the coordinates of the atoms are stored, in the metric unit.After this, some initial tests are made with the protein structure. First, we checked whether the C- and N-termini lie on the surface of the protein and are accessible to the solvent, which is crucial for circularization. We defined a line originating from an extremity of the protein with the two angles of the spherical coordinates around the z-axis. From that, we could determine the accessible angles by rejecting all the lines that are too close to the protein. As the future linker will be made of alpha helices and will therefore have a radius of 5 &Aring;, we used this length as the minimal allowed distance.<br />
Those allowed angles are stored for the coming linker generation. ###fig needed### <br />
<br />
==Generation of geometric paths==<br />
As our strategy consists in building linkers with helical rods and connecting angles, a path is completely defined by the coordinates of the angle points. Advancing one step from an existing point is always done by adding a displacement vector on this point. This vector is defined by the two spherical angles, chosen here in a discrete manner with an increment of 5 degrees, and by a length, also chosen in a discrete manner. This discrete length was used in two different contexts: it may correspond to the length of an alpha helix or to the length of the flexible part that appears at the extremity of the protein. The coordinates that are reached thanks to this vector defines the new coordinates of an angle point, no matter if the vector corresponds to an alpha helix or to a flexible part. ###figure, points + vectors###. As we screen for all possible angles in a discrete manner, those angle points coordinates are regularly distributed on a sphere. As further detailed in the next sections, those spheres are defined from both ends, either once or in several steps. Then the software checks for possible straight connections of given lengths for each pair of angle points originated from both extremities. ###figure needed, with all these possible steps, one sphere around each point and checking for the connections, 2D, different lengths in two different figures, ###<br />
The linkers are built in a modular way, with blocks of well-defined size. From the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] of potential linkers, we could derive 8 different alpha helical rods, all with different lengths. On top, the length of the two segments inside an angle block was always 8&Aring;, so exchanging angle blocks do not affect the length of the linker. This means that the distance between the angle points is well defined, an essential aspect of our strategy of linker design.<br />
The software proceeds in three steps. First, it checks for the possibility of direct single alpha helix linker. for this, it applies the procedure just mentioned with spheres of radius that reasonably corresponds to the length of the short parts at the extremity of the protein. Second, it tests if a linker containing two alpha helices connected with a right angle allows the circularization. Finally it searches the possible linkers with three angle points. The next parts will explain those three steps in detail.<br />
This method has been chosen, because it could be implemented easily and efficiently in our program. However, this strategy generated paths that crossed the protein. Therefore we put big efforts in the sorting out of the paths.<br />
<br />
===Step 1===<br />
As a simple rigid linker with no angle would be easier to design and likely more thermostable than the ones containing angles, the software first checks if this simple solution is possible ###figure needed###. For this, we took into account the fact that proteins have some flexible amino acids at their extremities. This flexible part may come from the protein itself, but also from the 2 glycines that are included at the N-terminal part and from the extein at the C-terminal part. Those two latter parts comes from our linkers. Those parts have no preferential angles and offers a large amount of possibilities to insert fitting linkers. But this flexibility is also a drawback as we have to include this large amount of possible angles and length to our path search.<br />
In this first step, the software explicitely takes these flexible parts into account to check for the possibility of straight linkers. As the angles and the length of the flexible parts are variable, the software position their extremity on a sphere centered on the last fixed position of the structure as explained above. The radius of this sphere is incremented in a discrete manner, in 4 steps, from 5.25 &Aring; to the maximum length of the flexible part. ###figure###<br />
Then all possible straight segments between the points and the lastpoints are tested. If they are closer than 5 &Aring to the protein, of if they cross it, then they are rejected. If they are kept, then the software checks whether the length of the segments is compatible with the feasible alpha helices in terms of length: if the length of a given segment equal one of the 8 alpha helix lengths plus or minus 0.75 &Aring;, then the path is eventually saved.<br />
===Step 2===<br />
The next possibilty to design more complex rigid linkers while still taking flexible ends into account with a reasonable calculation time was to reduce amount of possible angles. As we originally thought that 90° angle would be practically feasible, the software was designed to generate linkers with flexible ends and one 90° angle. ###figure, protein_one_90° angle### This choice was notably made because of the simplicity to calculate lengths of right triangle edges. We already saw in Step 1 that the length of an edge can only take 8 different values. As the linkers have to start from the extremities of the protein, and as we impose a right angle, the number of possible paths is therefore low, making them easy to compute. Practically, the extremities of the proteins are positioned in a flexible way as in Step 1. From each of the positions allowed by this flexibility, the software searches for all the allowed right triangles. This was mainly done, as the degrees of freedom needed to be restricted to keep calculations feasible. <br />
===Step 3===<br />
Finally the software also provides the possibility to find paths with up to 4 edges, meaning 4 alpha helices and 3 angles. Thanks to the modularity of the possible linkers, such paths can offer the possibility to circularize theoretically any kind of protein. ###figure of torus###<br />
To keep the calculation feasible in a reasonable time, we design the searching strategy so that the flexible part at the extremity are oriented in the same direction as the consecutive alpha helix. This is obviously restricting the search but as these orientations are allowed for the flexible part, this approach remains fully correct. <br />
First, potential ending points of the first alpha helical rod are calculated from the N-terminal point of the protein. The orientation is chosen in a discrete manner, with an incrementation of 5 degrees for the two angles of the spherical coordinates. The distance from the origin corresponds to the 8 possible lengths allowed by the alpha helices, as already seen in Step 2, plus a length of 0, which mimics a linker with 3 instead of 4 edges. The exact same procedure is repeated to define all the potential ending points of the second alpha helical rod starting from all the possible ending points of the first alpha helical rod. Thanks to the possibility of a length of 0 for the first and the second rods, the software also calculate paths with 2 edges. Then, the same is done only once from the C-terminal point of the protein, defining 1 edge. The final step consists in checking if the points originating from the N- and C-terminal points can be linked by an potential alpha helix, i.e. if they are separated by the appropriate distance. If any of the potential alpha helix length lies within the distance between two points plus or minus 0.75 &Aring;, then the path is eventually saved. In the same way, if two points are directly closer than 0.75 &Aring;, then the path is also saved.<br />
<br />
==Sorting out of paths==<br />
The previous part described the generation of paths that can connect the two extremeties of the protein irrespective of the position of these paths relative to the protein. While this allows a fast computing of the geometrical paths, this also implies that the paths that are not practically feasible need to be sorted out. This is the most time consuming part of the computing as about 1 billion paths are generated. Three criteria are considered for the sorting. The first one is the feasibility of the linker: can the software find angle patterns that correspond to the one defined by the geometrical path? This question was part of the motivation for a large modeling effort (link) to determine the possible angles between consecutive angles. This was achieved by analyzing the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling distribution of angles] between alpha helices found in the ArchDB database. As nearly any angle could be found between 20 and 170 degrees, only few paths were actually rejected at that step. The next criteria was the position of the angle point: if they appear inside the protein, then the path is rejected. Finally, the software checks if any of the atoms of the protein is less than 5&Aring away from any of the alpha helices, then the path is also rejected.<br />
<br />
==Shifting paths to the patterns==<br />
<br />
The strategy described in step 3 gives a certain freedom for the rod that connect the last two angle points that were generated from the N- and C-terminal points. As this freedom is actually not permitted by the alpha helix and the angle pattern, but is permitted by the flexible part for example at the C-terminal end, the software slighty refine the path by rotating the segment that originates from the C-terminal point.<br />
<br />
==Weighting of paths==<br />
Before translating the paths into sequences and thus into linkers that can be expressed, each path needs to be evaluated for its potential capacity to enhance heat stability. For this we have identified different contributions that should be combined into one value that defines how good the linker may be and that consequently defines its ranking among all possible linkers. The smaller this value, the more we expect that the linker will enhance thermostability. An important step for all these contributions is the normalization, as explained in the next paragraphs.<br />
The first contribution we considered was the linker length. We assumed that a short linker is better to constrain the protein extremities, and that a long helix might give more flexibility. Because we wanted this value to be independent of the size of the protein, the length of the linker is normalized to the distance between the two termini.<br />
The second contribution relates to the angles used in the linker. We learned from the [[modeling ###link###]] that angles formed by a certain angle pattern follow a certain distribution. First, we assumed that the narrower the distribution, the more likely the alpha helices would actually produce this angle. Second, the angles found by the software should be as close as possible to those well-defined angles. In this case, the weight value from this contribution should be low.<br />
Then the distance of the linker to the protein is taken into account. Because the linkers should not disturb the protein in its normal environment, linkers that pass close to the protein surface are considered better linkers. The distance was defined as the minimal distance between the linker and all the atoms of the protein. As already mentioned for the sorting of the paths, a linker cannot come closer than 5 &Aring; and this distance was used for normalization of calculated distances.<br />
After this, the places a linker should avoid are calculated. Each protein can interact with other molecules on some oarts of its surface. The user can specify where and how big those parts are. If a linker passes in front a potential molecule bindind domain, the value of the corresponding path goes to infinity, so that the linker is discarded. Conversely the farther a linker is from a potential ligand binding domain, the smaller its weighting value. The user can also specify the importance of certain regions. In the end the total weighting is normalized to the amount of binding domains.<br />
<br />
===Calibrating the weighting function===<br />
???keep?###<br />
Every contribution has it's own distribution. You can see an example in figure ### [[figure histogram_length_lys.png]], but all of them have different shapes. The aim is to find the paths that minimize all of these distributions. Therefore in the weighting function the four mentioned contributions are combined in a linear manner:<br />
\[ W(p) = \alpha L(p) + \beta A(p) + \gamma D(p) + \delta u(p) \]<br />
where W is the final weighting, p the path, L the length contribution, A the angle contribution, D the distance contribution and u the contribution from the forbidden regions. $\alpha, \beta, \gamma, \delta$ are the weighting constants that needed to be found. The normalization performed for each of the contribution were made so that each of them is dimensionless and that all have reasonably similar values.<br />
The weighting constants were obtained from the [[linker-screening ###lin]] performed with lysozyme and the [[enzyme-modeling ###link]]<br />
Please see [[below ###link]] for the detailed explanation, how the values were obtained.<br />
<br />
==Translating paths to sequence==<br />
As already mentioned before the software is provided with two databases, one for the possible angle patterns and one for the helix patterns. The choice of the patterns was inspired by known crystal structures extracted from databases and described in different papers.<br />
A huge in silico screening for refining the preferences of the patterns was then set up using the [[iGEM@home|###Link]] system. For the complete description of search for suitable patterns, one can read the [[modeling|###Link to patternspart]] page. <br />
All the possible paths are now split up at the angles and compared with the possible patterns in the databases. ###Figure needed to explain, how the path is translated### The most suitable patterns are identified and added together to build the paths sequence. It is important to notice that this is only possible because of the modularity of our linker patterns used as building blocks: each block, being an alpha helix or an angle pattern, is not affected by the other. Thus for each possible path, one sequence is produced.<br />
==Clustering of paths==<br />
Many different paths are represented by the same sequence [[###Figure that shows, different paths have same properties, already before in the text ###]] and we therefore clustered such paths. The weigths for those clustered paths were then calculated by averaging the weights of the different paths that compose a cluster.<br />
<br />
<br />
===Calibrating the weighting function===<br />
Every contribution has it's own distribution. You can see an example in figure ### [[figure histogram_length_lys.png]], but all of them have different shapes. The aim is to find the paths that minimize all of these distributions. Therefore in the weighting function the four mentioned contributions are combined in a linear manner:<br />
\[ W(p) = \alpha L(p) + \beta A(p) + \gamma D(p) + \delta u(p) \]<br />
where W is the final weighting, p the path, L the length contribution, A the angle contribution, D the distance contribution and u the contribution from the forbidden regions. $\alpha, \beta, \gamma, \delta$ are the weighting constants that needed to be found. The normalization performed for each of the contribution were made so that each of them is dimensionless and that all have reasonably similar values.<br />
The weighting constants were obtained from the [[linker-screening ###lin]] performed with lysozyme and the [[enzyme-modeling ###link]]<br />
Please see [[below ###link]] for the detailed explanation, how the values were obtained.<br />
==Translating paths to sequence==<br />
As already mentioned before the software is provided with two databases, one for the possible angle patterns and one for the helix patterns. The choice of the patterns was inspired by known crystal structures extracted from databases and described in different papers.<br />
A huge in silico screening for refining the preferences of the patterns was then set up using the [[iGEM@home|###Link]] system. For the complete description of search for suitable patterns, one can read the [[modeling|###Link to patternspart]] page. <br />
All the possible paths are now split up at the angles and compared with the possible patterns in the databases. ###Figure needed to explain, how the path is translated### The most suitable patterns are identified and added together to build the paths sequence. It is important to notice that this is only possible because of the modularity of our linker patterns used as building blocks: each block, being an alpha helix or an angle pattern, is not affected by the other. Thus for each possible path, one sequence is produced.<br />
==Clustering of paths==<br />
Many different paths are represented by the same sequence [[###Figure that shows, different paths have same properties, already before in the text ###]] and we therefore clustered such paths. The weigths for those clustered paths were then calculated by averaging the weights of the different paths that compose a cluster.<br />
=Results=<br />
==DNMT1==<br />
A major motivation of our effort to design rigid linkers with angles was the circularization of the DNA methyltranferase Dnmt1 (link). The truncation form used in our project is composed of 900 amino acids and the N- and C-terminal extremities are well separated. To circularize it, two linkers were designed: a flexible one made of glycine and serine, and a rigid one designed by the software. The rigif linkers for DNMT1 were obtained from an early state of the software. At that time the calculation took 11 days on a laptop computer with intel i5 processor and 8GB of RAM, which shows the importance of a distributed computing system for large proteins. But from that state on, the software has still imporved a lot, resulting in reduced calculation time to about 1 day for DNMT1.<br />
IMPORTANT!<br />
paths-> geometrical<br />
linkers-> helices...<br />
sequences-> seq of linkers<br />
insist on huge number of data produced<br />
We definitely need to cite [0]!!!<br />
=Abstract=<br />
=Overview=<br />
As already introduced [###link to circularization idea in the toolbox], artificially circularized proteins may gain some heat stability by restraining the C- and N-terminus from moving around freely. This circularization may be trivial when the protein termini are very close to each other, which seems to be reasonably common [[#References|[1]]]. However, if the ends are too far from each other, a long linker is needed to connect them. This linker should not change the natural conformation of the protein and should constrain the relative position of the ends to restrict the degrees of freedom and thus to stabilize the structure even when heated up. On top, these linkers should not affect any of the protein functions. Consequently it is important to prevent linkers from passing through the active site or from covering binding domains to other molecules for example. Therefore one needs to be able to define the shape of possible linkers. This section describes the software we developed to design such linkers. We would like to stress that this work has been made possible thanks to the feedback between computer modeling and experimental work: We could first design linkers in silico, test them experimentally and use the results to further calibrate the software. To our knowledge, this is the first time that such an approach is used to customly design linkers to connect protein extremities.<br />
In short, the software can provide a weighted list of linkers to circularize any protein of interest with a known structure. Those linkers are made of rigid alpha helices segments connected with defined angles. Contrary to flexible linkers, those rigid linkers were expected to constrain the protein extremities and to confer better heat stability. To generate those linkers, we first defined the geometrical paths, with segments and angles, that they should follow. The geometrical paths that are biologically feasible are afterwards translated into amino acid sequences. Both the compatibility of paths with possible structures and the translation were made possible thanks to our modeling approaches. The first approach consisted in performing a statistical analysis of more than 17000 known non-homologous structures containing alpha helices connected with angles. For the second approach, we modeled the conformation of linkers circularizing proteins of known structure and analyzed them for certain properties. This second approach was run for a large number of proteins thanks to our distributing computing system igemathome. The software provides different possible linkers with weights that provide the ranking of the linkers depending on their capacity to maintain protein activity at higher temperatures. They were generated in an extensive [[linker screening| ### ]] on the target protein lambda-lysozyme, using the first modeling approach.<br />
The software was checked for running stability in a huge test over the [[iGEM@home]]. <br />
Furthermore our software provided linkers for circularizing [[DNMT1| ###]], that could be made more heatstable due to circularization???.<br />
For detailed information on the implementation and the practical use of the software, please see the [[documentation software-docu]] page.<br />
==What does the software do==<br />
still missing [[figure 0, graph abstract]]<br />
=Background=<br />
What are linkers, why should ours work better?<br />
Linkers, References to the modeling, <br />
Classically, protein linkers were designed in three different manners. ###REFERENCE### The easiest way is to define the length that a linker should cover and then simply use a flexible glycine-serine peptide with the right amount of amino acids to match this length. Glycine is used for flexibility, as it has no sidechain and does not produce any steric hindrance, while serine is used for solubility, as it has a small polar side chain. This solubility is important, as the linkers should not pass through the hydrophobic core of the protein, but should be dissolved in the surrounding medium. These flexible linkers were normally used for circularization but also for connecting different proteins, when the main important aspect is that the different parts are connected, but not how they are connected, or when the flexiblity of the linker was required for specific applications.<br />
A second strategy consists in using rigid helical linkers to keep proteins or protein domains at a certain distance from each other. This is especially important for signalling proteins and fluorescent proteins . ###TODO: Reference### One major property of alpha helices is that they always fold in a defined way with well defined angles and lengths. There are also many different helical patterns that differ in stability and solubility. One big disadvantage of this strategy is that one can only build straight linkers with helices. So in the context of circularization, if an artificial line that would connect protein extremities is crossing the protein, this strategy is not an option.<br />
The third option, which served as a base to develop our approach, consists in designing customly tailored linkers for each specific application. These linkers can be obtained from protein structure prediction. At first one needs to define the path that the linker should take to connect two amino acids. Afterwards one designs a possible linker sequence that might fit well. Next one makes a structure prediction of the linker attached to the proteins to validate the prediction. Several different linkers, with slight changes, can be compared. This is repeated several times until the linker effectively follows the expected path. ###TODO: Reference, WADE paper### This method is time consuming as it is not only computation intensive, but also requires a strong knowledge on protein folding and protein structure prediction. On the other hand, the benefit can be important as the interaction of the linker with the proteins surface can be taken into account and as one can accurately define the path taken by the linker to the resolution of protein structure.<br />
We have set up a completely new strategy to design rigid linkers. As further detailed in the [[###link###|modelling]] part, it is possible to define the shape of a linker, by combining rigid alpha helical rods with well-defined angle patterns. Therefore, by defining, in a geometrical way, the possible paths of the circularizing linkers for a given protein, we can then propose potential linkers.This definition of the geometrical path can be very difficult, especially for large proteins with complex shapes. Moreover, this definition is further constained by the fact that linkers must avoid hiding active sites of the protein of interest. Finally the paths have rotational degrees of freedom at the extremities of the protein, and depending on their orientation, they may or may not match the geometry of the protein. The tool we present here covers the two steps: defining geometrical paths with some weights and translate them into feasible linkers, also with weights. This tool is universal as it has the capacity to design circularizing linkers for any protein with a known structure. Moreover it is modular as, thanks to our modeling approach [link] we have design linkers as exchangeable blocks of rods of different lengths and of angle patterns. The following sections detail the different steps followed by our software to design proper linkers.<br />
But even taking all this into account, one could never also take the paths into account, that the same sort of linker is also able to take, because of rotational degrees of freedom inherent in the linkermodel ###figure needed###<br />
Thus we decided to provide the science community with a powerfull open-source software, that for every protein with a given structure can calculate the sequence of the linker needed to circularize the protein with a rigid linker and with minimal inhibition of the protein's function. <br />
Until now each scientist had to estimate the length of the linker himself,###check for flexlinker### so our software is a completely novel approach to circularization.<br />
=General procedure=<br />
Our general approach to findin,rods and angles in discrete manner, only finite possibilities to connect the ends, these are cheked all. <br />
I think this paragraph is just too much, as anyway there was already an overview before.<br />
At first the protein structure is analysed in a geometrical way: paths only composed of no more than four straight segments connected by angles are computanionally generated. A path is always represented by straight lines and connecting angles between them. ###In the end all these paths should be sorted by how well it would be, if we circularize the protein using this path.### But as the final weighing is quite computation intensive, at first the paths need to be sorted out. A path is only sorted out, if it is breaching any rule for linkers. For example paths should never pass through the protein. After all the paths have been generated, the paths are improved, by shifting the angularpoints according to the underlying linkermodel, so that no paths are taken into account, that could not be built with our building blocks.<br />
After all paths being correctly generated, the paths are weighted by several factors. Afterwards one weighting for a path is calculated, that corresponds to the goodness of this certain path. Then the path is retranslated by usage of our amino acid patterns to produce an amino acid sequence. But as one sequence can follow more than one path, all the paths built up by this sequence are clustered. In the end the average of the weighting of the pathclusters is calculated and thus for every linker only one weightingvalue is produced with the contributions of all paths possibly taken by this linker.<br />
==PDB analysis==<br />
At first, the PDB file containing the structure of the target protein is parsed and the coordinates of the atoms are stored, in the metric unit.After this, some initial tests are made with the protein structure. First, we checked whether the C- and N-termini lie on the surface of the protein and are accessible to the solvent, which is crucial for circularization. We defined a line originating from an extremity of the protein with the two angles of the spherical coordinates around the z-axis. From that, we could determine the accessible angles by rejecting all the lines that are too close to the protein. As the future linker will be made of alpha helices and will therefore have a radius of 5 &Aring;, we used this length as the minimal allowed distance.<br />
Those allowed angles are stored for the coming linker generation. . ###fig needed### <br />
==Generation of geometric paths==<br />
As our strategy consists in building linkers with helical rods and connecting angles, a path is completely defined by the coordinates of the angle points. Advancing one step from an existing point is always done by adding a displacement vector on this point. This vector is defined by the two spherical angles, chosen here in a discrete manner with an increment of 5 degrees, and by a length, also chosen in a discrete manner. This discrete length was used in two different contexts: it may correspond to the length of an alpha helix or to the length of the flexible part that appears at the extremity of the protein. The coordinates that are reached thanks to this vector defines the new coordinates of an angle point, no matter if the vector corresponds to an alpha helix or to a flexible part. ###figure, points + vectors###. As we screen for all possible angles in a discrete manner, those angle points coordinates are regularly distributed on a sphere. As further detailed in the next sections, those spheres are defined from both ends, either once or in several steps. Then the software checks for possible straight connections of given lengths for each pair of angle points originated from both extremities. ###figure needed, with all these possible steps, one sphere around each point and checking for the connections, 2D, different lengths in two different figures, ###<br />
The linkers are built in a modular way, with blocks of well-defined size. From the modeling of potential linkers [link], we could derive 8 different alpha helical rods, all with different lengths. On top, the length of the two segments inside an angle block was always 8 &Aring;, so exchanging angle blocks do not affect the length of the linker. This means that the distance between the angle points is well defined, an essential aspect of our strategy of linker design.<br />
The software proceeds in three steps. First, it checks for the possibility of direct single alpha helix linker. for this, it applies the procedure just mentioned with spheres of radius that reasonably corresponds to the length of the short parts at the extremity of the protein. Second, it tests if a linker containing two alpha helices connected with a right angle allows the circularization. Finally it searches the possible linkers with three angle points. The next parts will explain those three steps in detail.<br />
This method has been chosen, because it could be implemented easily and efficiently in our program. However this strategy generated paths that crossed the protein. Therefore we put big efforts in the sorting out of the paths.<br />
In the [[###link###|modelling]] part we have already described the patterns we are using for building up our linkers. The modularity was crucial for the success of the software. Therefore the angle patterns were always chosen, so that the end of the anglepattern was 8 &Aring; away from the turning point ###figure needed###. Thus only displacements had to be made with a certain length, not depending on the direction in which it is going and not depending on the direction in that the linker will continue. This modularity makes the calculations more efficient than it would be, with just generating points randomly.<br />
We still have to rephrase that.<br />
===Step 1===<br />
As a simple rigid linker with no angle would be easier to design and likely more thermostable than the ones containing angles, the software first checks if this simple solution is possible ###figure needed###. For this, we took into account the fact that proteins have some flexible amino acids at their extremities. This flexible part may come from the protein itself, but also from the 2 glycines that are included at the N-terminal part and from the extein at the C-terminal part. Those two latter parts comes from our linkers. Those parts have no preferential angles and offers a large amount of possibilities to insert fitting linkers. But this flexibility is also a drawback as we have to include this large amount of possible angles and length to our path search.<br />
In this first step, the software explicitely takes these flexible parts into account to check for the possibility of straight linkers. As the angles and the length of the flexible parts are variable, the software position their extremity on a sphere centered on the last fixed position of the structure as explained above. The radius of this sphere is incremented in a discrete manner, in 4 steps, from 5.25 &Aring; to the maximum length of the flexible part. ###figure###<br />
Then all possible straight segments between the points and the lastpoints are tested. If they are closer than 5 &Aring to the protein, of if they cross it, then they are rejected. If they are kept, then the software checks whether the length of the segments is compatible with the feasible alpha helices in terms of length: if the length of a given segment equal one of the 8 alpha helix lengths plus or minus 0.75 &Aring;, then the path is eventually saved.<br />
" Therefore here another function for calculating the distance from the protein to the connection is used, that is more time-consuming, but also more accurate, than the ones used normally. " ??<br />
"This is done with a higher accuracy, because none of these linkers should be lost by error". ??<br />
For generation of linkers taking into account the flexibility of the ends until now have been included two functions.<br />
The only variability we have in the customly tailored linker there is the length of the helix. Most likely here no suitable linker can be found, because if there is some obstacle between the ends, the linker can't bend around it. But even though if there is the possibility of such a linker, it should be found, because this will be one of the best linkers predicted. <br />
So for all accessible angles that were calculated before, we calculate all points from the N-terminus and from the C-terminus that lie in certain distances from the terminus. The points are spread over varying distances from ; to the .<br />
===Step 2===<br />
The next possibilty to design more complex rigid linkers while still taking flexible ends into account with a reasonable calculation time was to reduce amount of possible angles. As we originally thought that 90° angle would be practically feasible, the software was designed to generate linkers with flexible ends and one 90° angle. ###figure### This choice was notably made because of the simplicity to calculate lengths of right triangle edges. We already saw in Step 1 that the length of an edge can only take 8 different values. As the linkers have to start from the extremities of the protein, and as we impose a right angle, the number of possible paths is therefore low, making them easy to compute. Practically, the extremities of the proteins are positioned in a flexible way as in Step 1. From each of the positions allowed by this flexibility, the software searches for all the allowed right triangles. <br />
some words on degree of freedom...<br />
###newly written###<br />
Again here we only have discrete possibilities to build right triangles by use of our helical patterns. Therefore at first all combinations of two helical patterns are searched, that could build a right triangle, that's hypotenuse has the length of the distance between the termini.<br />
Now we shift back to 3D and apply Thales's theorem, that says, when A, B and C lie on a circle, the line between A and C is a diameter of the circle, then the angle at B is a right angle. ### fig, thales### Thus in 3d we can discretisize the possibilities, where the angle point (B) can lie in reference to the starting point (A). This amount is counterchecked with the amount of possible right triangles from before, so that we only keep paths, that can be built with our patterns.<br />
Like in Step 1 the software creates spheres of points around the start (first points) and around the end (last points). Then for each point from the first points<br />
===Step 3===<br />
Finally the software also provides the possibility to find paths with up to 4 edges, meaning 4 alpha helices and 3 angles. Thanks to the modularity of the possible linkers, such paths can offer the possibility to circularize theoretically any kind of protein. ###figure of torus###<br />
To keep the calculation feasible in a reasonable time, we design the searching strategy so that the flexible part at the extremity are oriented in the same direction as the consecutive alpha helix. This is obviously restricting the search but as these orientations are allowed for the flexible part, this approach remains fully correct. <br />
First, potential ending points of the first alpha helical rod are calculated from the N-terminal point of the protein. The orientation is chosen in a discrete manner, with an incrementation of 5 degrees for the two angles of the spherical coordinates. The distance from the origin corresponds to the 8 possible lengths allowed by the alpha helices, as already seen in Step 2, plus a length of 0, which mimics a linker with 3 instead of 4 edges. The exact same procedure is repeated to define all the potential ending points of the second alpha helical rod starting from all the possible ending points of the first alpha helical rod. Thanks to the possibility of a length of 0 for the first and the second rods, the software also calculate paths with 2 edges. Then, the same is done only once from the C-terminal point of the protein, defining 1 edge. The final step consists in checking if the points originating from the N- and C-terminal points can be linked by an potential alpha helix, i.e. if they are separated by the appropriate distance. If any of the potential alpha helix length lies within the distance between two points plus or minus 0.75 &Aring;, then the path is eventually saved. In the same way, if two points are directly closer than 0.75 &Aring;, then the path is also saved.<br />
one starts from omega, 2 from alpha<br />
Thus even one of the worst shapes for circularization, a torus with the two termini in the pits, could be circularized with our linker system, without needing infinitely long straights. On the other hand this were the maximum of possibilities which was still feasible to calculate.<br />
Here now the flexible parts of the protein are estimated to point into the same direction as the following helix. By this mean the amount of possibilities is kept fixed but of course this is quite some rough estimation.<br />
Let's check the next paragraph together in detail<br />
At first points from each end in the right distances are generated. ###figure needed, explains point names### Now for each point of the first points to all directions in 5 degrees angledifference next points are generated. Of these all points that don't fit are immediately sorted out. Now all connections from second points to the last points are checked whether they lie in the right distance. After this the normal sorting steps are made for these new connections.<br />
==Sorting out of paths==<br />
The previous part described the generation of paths that can connect the two extremeties of the protein irrespective of the position of these paths relative to the protein. While this allows a fast computing of the geometrical paths, this also implies that the paths that are not practically feasible need to be sorted out. This is the most time consuming part of the computing as about 1 billion paths are generated. Three criteria are considered for the sorting. The first one is the feasibility of the linker: can the software find angle patterns that correspond to the one defined by the geometrical path? This question was part of the motivation for a large modeling effort (link) to determine the possible angles between consecutive angles. This was achieved by analyzing the distribution of angles between alpha helices found in the ArchDB database (link). As nearly any angle could be found between 20 and 170 degrees, only few paths were actually rejected at that step. The next criteria was the position of the angle point: if they appear inside the protein, then the path is rejected. Finally, the software checks if any of the atoms of the protein is less than 5 &Aring away from any of the alpha helices, then the path is also rejected.<br />
==Shifting paths to the patterns==<br />
The strategy described in step 3 gives a certain freedom for the rod that connect the last two angle points that were generated from the N- and C-terminal points. As this freedom is actually not permitted by the alpha helix and the angle pattern, but is permitted by the flexible part for example at the C-terminal end, the software slighty refine the path by rotating the segment that originates from the C-terminal point.<br />
In particular because of the discritization, it can always happen that the generated paths don't fit perfectly to the helical and to the angle patterns. Therefore before translating them to the aminoacid sequence, the paths need to be refined.<br />
Every path, that has survived is analyzed step by step, starting from the starting point and always advancing one angle point. For each step from point to point, the length of the step is calculated and compared to the lengthes we can build with the helix patterns. If the length is too long, the next point is shifted in direction of the previous one, until it fits. If it is too short, it is shifted away. These shifts do not exceed a certain length, so that the paths don't shift too much and suddenly would pass throgh restricted areas.<br />
This is also done before the weighting, so that the paths don't change after weighting anymore.<br />
==Weighting of paths==<br />
Before translating the paths into sequences and thus into linkers that can be expressed, each path needs to be evaluated for its potential capacity to enhance heat stability. For this we have identified different contributions that should be combined into one value that defines how good the linker may be and that consequently defines its ranking among all possible linkers. The smaller this value, the more we expect that the linker will enhance thermostability. An important step for all these contributions is the normalization, as explained in the next paragraphs.<br />
The first contribution we considered was the linker length. We assumed that a short linker is better to constrain the protein extremities, and that a long helix might give more flexibility. Because we wanted this value to be independent of the size of the protein, the length of the linker is normalized to the distance between the two termini.<br />
The second contribution relates to the angles used in the linker. We learned from the [[modeling ###link###]] that angles formed by a certain angle pattern follow a certain distribution. First, we assumed that the narrower the distribution, the more likely the alpha helices would actually produce this angle. Second, the angles found by the software should be as close as possible to those well-defined angles. In this case, the weight value from this contribution should be low.<br />
Then the distance of the linker to the protein is taken into account. Because the linkers should not disturb the protein in its normal environment, linkers that pass close to the protein surface are considered better linkers. The distance was defined as the minimal distance between the linker and all the atoms of the protein. As already mentioned for the sorting of the paths, a linker cannot come closer than 5 &Aring; and this distance was used for normalization of calculated distances.<br />
As linkers should also not be too close to the protein surface, this value is normalized with the minimal distance a linker should have from the surface. The distance is calculated as the minimal distance an the atoms of the protein have from the connection.<br />
After this, the places a linker should avoid are calculated. Each protein can interact with other molecules on some oarts of its surface. The user can specify where and how big those parts are. If a linker passes in front a potential molecule bindind domain, the value of the corresponding path goes to infinity, so that the linker is discarded. Conversely the farther a linker is from a potential ligand binding domain, the smaller its weighting value. The user can also specify the importance of certain regions. In the end the total weighting is normalized to the amount of binding domains.<br />
===Calibrating the weighting function===<br />
???keep?###<br />
Every contribution has it's own distribution. You can see an example in figure ### [[figure histogram_length_lys.png]], but all of them have different shapes. The aim is to find the paths that minimize all of these distributions. Therefore in the weighting function the four mentioned contributions are combined in a linear manner:<br />
\[ W(p) = \alpha L(p) + \beta A(p) + \gamma D(p) + u(p) \]<br />
where W is the final weighting, p the path, L the length contribution, A the angle contribution, D the distance contribution and u the contribution from the forbidden regions. $\alpha, \beta, \gamma, \delta$ are the weighting constants that needed to be found. The normalization performed for each of the contribution were made so that each of them is dimensionless and that all have reasonably similar values.<br />
The weighting constants were obtained from the [[linker-screening ###lin]] performed with lysozyme and the [[enzyme-modeling ###link]]<br />
Please see [[below ###link]] for the detailed explanation, how the values were obtained.<br />
==Translating paths to sequence==<br />
As already mentioned before the software is provided with two databases, one for the possible angle patterns and one for the helix patterns. The choice of the patterns was inspired by known crystal structures extracted from databases and described in different papers.<br />
A huge in silico screening for refining the preferences of the patterns was then set up using the [[iGEM@home|###Link]] system. For the complete description of search for suitable patterns, one can read the [[modeling|###Link to patternspart]] page. <br />
All the possible paths are now split up at the angles and compared with the possible patterns in the databases. ###Figure needed to explain, how the path is translated### The most suitable patterns are identified and added together to build the paths sequence. It is important to notice that this is only possible because of the modularity of our linker patterns used as building blocks: each block, being an alpha helix or an angle pattern, is not affected by the other. Thus for each possible path, one sequence is produced.<br />
==Clustering of paths==<br />
Many different paths are represented by the same sequence [[###Figure that shows, different paths have same properties, already before in the text ###]] and we therefore clustered such paths. The weigths for those clustered paths were then calculated by averaging the weights of the different paths that compose a cluster.<br />
=Results=<br />
==DNMT1==<br />
A major motivation of our effort to design rigid linkers with angles was the circularization of the DNA methyltranferase Dnmt1 (link). The truncation form used in our project is composed of 900 amino acids and the N- and C-terminal extremities are well separated. To circularize it, two linkers were designed: a flexible one made of glycine and serine, and a rigid one designed by the software. The rigif linkers for DNMT1 were obtained from an early state of the software. At that time the calculation took 11 days on a laptop computer with intel i5 processor and 8GB of RAM, which shows the importance of a distributed computing system for large proteins. But from that state on, the software has still imporved a lot, resulting in reduced calculation time to about 1 day for DNMT1.<br />
###DNMT1 could be made heatstable, still missing###<br />
==Feedback from wet lab==<br />
The results from the software for lysozyme were tested as described in the [[linker-screening]] part and evaluated as described in the [[enzyme modeling]] part. We have performed a large linker screening on 10 different lysozymes with different linkers. As the purpose of the lysozyme screen was the calibration of the software,tThose linkers were designed according to the four contributions previously mentioned. One of them was the shortest possible, one had the best possible angle, and so on.<br />
{| class="table table-hover"<br />
|+ '''table 1''': Linker and their amino acid sequence. Green: attachment sequences to prevent the flexible regions from being perturbed; Blue: angle; Purple: extein.<br />
! Linker<br />
! Amino acid sequence<br />
|- <br />
|<br />
|<br />
|-<br />
|'''Very good linkers'''<br />
|<br />
|-<br />
| sgt2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#00BFFF;">AAAHPEA</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| rigid<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKAA<span style="color:#3ADF00;">P</span><span style="color:#A901DB;">RGKCWE</span><br />
|-<br />
| <br />
| <br />
|-<br />
| '''Average linkers'''<br />
|<br />
|-<br />
| may1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKA<span style="color:#00BFFF;">AAAHPEA</span>AEAAAKEAAAKA<span style="color:#00BFFF;">KTA</span>AEAAAKEAAAKA<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| ord1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ATGDLA</span>AEAAAKAA<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| ord3<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ASLPAA</span>AEAAAKEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
|<br />
| <br />
|-<br />
| '''Short linkers'''<br />
| <br />
|-<br />
| sho1<br />
| <span style="color:#3ADF00;">GG</span><span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| sho2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
|}<br />
[[circ_lam_lys_nils.png]] comparison with [[nice_linker_lysozyme_flexible_ends.png]]<br />
###About predictions of software###<br />
In the end we obtained a ranking of the in vitro tested linkers from the [[linker-screening]] and chose the parameters $\alpha, \beta, \gamma, \delta$ of the weighting function in the way, that the ranking from the software represented the ranking from the assays. The final values were: $\alpha = 1.85 * 10 ^{-6}, \beta = 0.57 , \gamma = 50.8 * 10^6$.<br />
Therefore we set up a function, that was minimized, when the order of the weighting output was the same as the order from the wetlab.<br />
<br />
=Discussion=<br />
Will always be refined with more data from i@h...<br />
=References=<br />
<br />
[1] Thornton, J.M. & Sibanda, B.L. Amino and carboxy-terminal regions in globular proteins. Journal of molecular biology 167, 443-460 (1983).<br />
<br />
[2] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/pages/Linker_SoftwareTeam:Heidelberg/pages/Linker Software2014-10-17T22:42:16Z<p>Igemnils: /* References */</p>
<hr />
<div>=Abstract=<br />
As already [https://2014.igem.org/Team:Heidelberg/Toolbox/Circularization introduced], artificially circularized proteins may gain some heat stability by restraining the C- and N-terminus from moving around freely. This circularization may be trivial when the protein termini are very close to each other, which seems to be reasonably common [[#References|[1]]]. However, if the ends are too far from each other, a long linker is needed to connect them. This linker should not change the natural conformation of the protein and should constrain the relative position of the ends to restrict the degrees of freedom and thus to stabilize the structure even when heated up. On top, these linkers should not affect any of the protein functions. Consequently it is important to prevent linkers from passing through the active site or from covering binding domains to other molecules for example. Therefore one needs to be able to define the shape of possible linkers. This section describes the software we developed to design such linkers. We would like to stress that this work has been made possible thanks to the feedback between computer modeling and experimental work: We could first design linkers in silico, test them experimentally and use the results to further calibrate the software. To our knowledge, this is the first time that such an approach is used to customly design rigid linkers with angles to connect protein extremities.<br />
<br />
=General procedure=<br />
In short, the software can provide a weighted list of linkers to circularize any protein of interest with a known structure. Those linkers are made of rigid alpha helices segments connected with defined angles. Contrary to flexible linkers, those rigid linkers were expected to constrain the protein extremities and to confer better heat stability. Such an idea was already developed [[#References|[2]]] but only with alpha helices defining simple rods, and without any possibility to introduce angles. To generate those linkers, we first defined the geometrical paths, with segments and angles, that they should follow. The geometrical paths that are biologically feasible are afterwards translated into amino acid sequences. Both the compatibility of paths with possible structures and the translation were made possible thanks to our [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling approaches]. The first approach consisted in performing a statistical analysis of more than 17000 known non-homologous structures containing alpha helices connected with angles. For the second approach, we modeled the conformation of linkers circularizing proteins of known structure and analyzed them for certain properties. This second approach was run for a large number of proteins thanks to our distributing computing system [https://2014.igem.org/Team:Heidelberg/Software/igemathome igemathome]. The software provides different possible linkers with weights that provide the ranking of the linkers depending on their capacity to maintain protein activity at higher temperatures. They were generated thanks to an extensive [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening linker screening] on the target protein lambda-lysozyme, using the first modeling approach.<br />
<br />
=Background=<br />
Classically, protein linkers were designed in three different manners. The easiest way is to define the length that a linker should cover and then simply use a flexible glycine-serine peptide with the right amount of amino acids to match this length. Glycine is used for flexibility, as it has no sidechain and does not produce any steric hindrance, while serine is used for solubility, as it has a small polar side chain. This solubility is important, as the linkers should not pass through the hydrophobic core of the protein, but should be dissolved in the surrounding medium. These flexible linkers were normally used for circularization but also for connecting different proteins, when the main goal is that the different parts are connected, but not how they are connected, or when the flexiblity of the linker was required for specific applications. <br />
<br />
A second strategy consists in using rigid helical linkers to keep proteins or protein domains at a certain distance from each other. This is especially important for signalling proteins and fluorescent proteins. One major property of alpha helices is that they always fold in a defined way with well defined angles and lengths. There are also many different helical patterns that differ in stability and solubility. Although they have been used to design cirularizing linkers [[#References|[2]]]. One big disadvantage of this strategy is that one can only build straight linkers with helices. So in the context of circularization, if an artificial line that would connect protein extremities is crossing the protein, this strategy is not an option.<br />
<br />
The third option, which served as a base to develop our approach and which came from discussions with the group of Rebecca Wade in Heidelberg, Germany, consists in designing customly tailored linkers for each specific application. These linkers can be obtained from protein structure prediction. At first one needs to define the path that the linker should take to connect the protein ends. Afterwards, one designs a possible linker sequence that might fit well. Next one makes a structure prediction of the linker attached to the proteins to validate the prediction. Several different linkers, with slight changes, can be compared. This is repeated several times until the linker effectively follows the expected path. This method requires a strong knowledge on protein folding and protein structure prediction and is computationaly intensive. On the other hand, the benefit can be important as the interaction of the linker with the protein surface can be taken into account and as one can accurately define the path taken by the linker to the resolution of protein structure.<br />
<br />
We have set up a completely new strategy to design rigid linkers. As further detailed in the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] part, it is possible to define the shape of a linker, by combining rigid alpha helical rods with well-defined angle patterns. Therefore, by defining, in a geometrical way, the possible paths of the circularizing linkers for a given protein, we can then propose potential linkers. This definition of the geometrical path can be very difficult, especially for large proteins with complex shapes. Moreover, this definition is further constained by the fact that linkers must avoid hiding active sites of the protein of interest. Finally the paths have rotational degrees of freedom at the extremities of the protein, and depending on their orientation, they may or may not match the geometry of the protein. The tool we present here covers the two steps: defining geometrical paths with some weights and translate them into feasible linkers, also with weights. This tool is universal as it has the capacity to design circularizing linkers for any protein with a known structure. Moreover it is modular as, thanks to our [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling approach], we have design linkers as exchangeable blocks of rods of different lengths and of angle patterns. The following sections detail the different steps followed by our software to design proper linkers.<br />
<br />
<br />
==PDB parsing==<br />
At first, the PDB file containing the structure of the target protein is parsed and the coordinates of the atoms are stored, in the metric unit.After this, some initial tests are made with the protein structure. First, we checked whether the C- and N-termini lie on the surface of the protein and are accessible to the solvent, which is crucial for circularization. We defined a line originating from an extremity of the protein with the two angles of the spherical coordinates around the z-axis. From that, we could determine the accessible angles by rejecting all the lines that are too close to the protein. As the future linker will be made of alpha helices and will therefore have a radius of 5 &Aring;, we used this length as the minimal allowed distance.<br />
Those allowed angles are stored for the coming linker generation. ###fig needed### <br />
<br />
==Generation of geometric paths==<br />
As our strategy consists in building linkers with helical rods and connecting angles, a path is completely defined by the coordinates of the angle points. Advancing one step from an existing point is always done by adding a displacement vector on this point. This vector is defined by the two spherical angles, chosen here in a discrete manner with an increment of 5 degrees, and by a length, also chosen in a discrete manner. This discrete length was used in two different contexts: it may correspond to the length of an alpha helix or to the length of the flexible part that appears at the extremity of the protein. The coordinates that are reached thanks to this vector defines the new coordinates of an angle point, no matter if the vector corresponds to an alpha helix or to a flexible part. ###figure, points + vectors###. As we screen for all possible angles in a discrete manner, those angle points coordinates are regularly distributed on a sphere. As further detailed in the next sections, those spheres are defined from both ends, either once or in several steps. Then the software checks for possible straight connections of given lengths for each pair of angle points originated from both extremities. ###figure needed, with all these possible steps, one sphere around each point and checking for the connections, 2D, different lengths in two different figures, ###<br />
The linkers are built in a modular way, with blocks of well-defined size. From the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] of potential linkers, we could derive 8 different alpha helical rods, all with different lengths. On top, the length of the two segments inside an angle block was always 8&Aring;, so exchanging angle blocks do not affect the length of the linker. This means that the distance between the angle points is well defined, an essential aspect of our strategy of linker design.<br />
The software proceeds in three steps. First, it checks for the possibility of direct single alpha helix linker. for this, it applies the procedure just mentioned with spheres of radius that reasonably corresponds to the length of the short parts at the extremity of the protein. Second, it tests if a linker containing two alpha helices connected with a right angle allows the circularization. Finally it searches the possible linkers with three angle points. The next parts will explain those three steps in detail.<br />
This method has been chosen, because it could be implemented easily and efficiently in our program. However, this strategy generated paths that crossed the protein. Therefore we put big efforts in the sorting out of the paths.<br />
<br />
===Step 1===<br />
As a simple rigid linker with no angle would be easier to design and likely more thermostable than the ones containing angles, the software first checks if this simple solution is possible ###figure needed###. For this, we took into account the fact that proteins have some flexible amino acids at their extremities. This flexible part may come from the protein itself, but also from the 2 glycines that are included at the N-terminal part and from the extein at the C-terminal part. Those two latter parts comes from our linkers. Those parts have no preferential angles and offers a large amount of possibilities to insert fitting linkers. But this flexibility is also a drawback as we have to include this large amount of possible angles and length to our path search.<br />
In this first step, the software explicitely takes these flexible parts into account to check for the possibility of straight linkers. As the angles and the length of the flexible parts are variable, the software position their extremity on a sphere centered on the last fixed position of the structure as explained above. The radius of this sphere is incremented in a discrete manner, in 4 steps, from 5.25 &Aring; to the maximum length of the flexible part. ###figure###<br />
Then all possible straight segments between the points and the lastpoints are tested. If they are closer than 5 &Aring to the protein, of if they cross it, then they are rejected. If they are kept, then the software checks whether the length of the segments is compatible with the feasible alpha helices in terms of length: if the length of a given segment equal one of the 8 alpha helix lengths plus or minus 0.75 &Aring;, then the path is eventually saved.<br />
===Step 2===<br />
The next possibilty to design more complex rigid linkers while still taking flexible ends into account with a reasonable calculation time was to reduce amount of possible angles. As we originally thought that 90° angle would be practically feasible, the software was designed to generate linkers with flexible ends and one 90° angle. ###figure, protein_one_90° angle### This choice was notably made because of the simplicity to calculate lengths of right triangle edges. We already saw in Step 1 that the length of an edge can only take 8 different values. As the linkers have to start from the extremities of the protein, and as we impose a right angle, the number of possible paths is therefore low, making them easy to compute. Practically, the extremities of the proteins are positioned in a flexible way as in Step 1. From each of the positions allowed by this flexibility, the software searches for all the allowed right triangles. This was mainly done, as the degrees of freedom needed to be restricted to keep calculations feasible. <br />
===Step 3===<br />
Finally the software also provides the possibility to find paths with up to 4 edges, meaning 4 alpha helices and 3 angles. Thanks to the modularity of the possible linkers, such paths can offer the possibility to circularize theoretically any kind of protein. ###figure of torus###<br />
To keep the calculation feasible in a reasonable time, we design the searching strategy so that the flexible part at the extremity are oriented in the same direction as the consecutive alpha helix. This is obviously restricting the search but as these orientations are allowed for the flexible part, this approach remains fully correct. <br />
First, potential ending points of the first alpha helical rod are calculated from the N-terminal point of the protein. The orientation is chosen in a discrete manner, with an incrementation of 5 degrees for the two angles of the spherical coordinates. The distance from the origin corresponds to the 8 possible lengths allowed by the alpha helices, as already seen in Step 2, plus a length of 0, which mimics a linker with 3 instead of 4 edges. The exact same procedure is repeated to define all the potential ending points of the second alpha helical rod starting from all the possible ending points of the first alpha helical rod. Thanks to the possibility of a length of 0 for the first and the second rods, the software also calculate paths with 2 edges. Then, the same is done only once from the C-terminal point of the protein, defining 1 edge. The final step consists in checking if the points originating from the N- and C-terminal points can be linked by an potential alpha helix, i.e. if they are separated by the appropriate distance. If any of the potential alpha helix length lies within the distance between two points plus or minus 0.75 &Aring;, then the path is eventually saved. In the same way, if two points are directly closer than 0.75 &Aring;, then the path is also saved.<br />
<br />
==Sorting out of paths==<br />
The previous part described the generation of paths that can connect the two extremeties of the protein irrespective of the position of these paths relative to the protein. While this allows a fast computing of the geometrical paths, this also implies that the paths that are not practically feasible need to be sorted out. This is the most time consuming part of the computing as about 1 billion paths are generated. Three criteria are considered for the sorting. The first one is the feasibility of the linker: can the software find angle patterns that correspond to the one defined by the geometrical path? This question was part of the motivation for a large modeling effort (link) to determine the possible angles between consecutive angles. This was achieved by analyzing the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling distribution of angles] between alpha helices found in the ArchDB database. As nearly any angle could be found between 20 and 170 degrees, only few paths were actually rejected at that step. The next criteria was the position of the angle point: if they appear inside the protein, then the path is rejected. Finally, the software checks if any of the atoms of the protein is less than 5&Aring away from any of the alpha helices, then the path is also rejected.<br />
<br />
==Shifting paths to the patterns==<br />
<br />
The strategy described in step 3 gives a certain freedom for the rod that connect the last two angle points that were generated from the N- and C-terminal points. As this freedom is actually not permitted by the alpha helix and the angle pattern, but is permitted by the flexible part for example at the C-terminal end, the software slighty refine the path by rotating the segment that originates from the C-terminal point.<br />
<br />
==Weighting of paths==<br />
Before translating the paths into sequences and thus into linkers that can be expressed, each path needs to be evaluated for its potential capacity to enhance heat stability. For this we have identified different contributions that should be combined into one value that defines how good the linker may be and that consequently defines its ranking among all possible linkers. The smaller this value, the more we expect that the linker will enhance thermostability. An important step for all these contributions is the normalization, as explained in the next paragraphs.<br />
The first contribution we considered was the linker length. We assumed that a short linker is better to constrain the protein extremities, and that a long helix might give more flexibility. Because we wanted this value to be independent of the size of the protein, the length of the linker is normalized to the distance between the two termini.<br />
The second contribution relates to the angles used in the linker. We learned from the [[modeling ###link###]] that angles formed by a certain angle pattern follow a certain distribution. First, we assumed that the narrower the distribution, the more likely the alpha helices would actually produce this angle. Second, the angles found by the software should be as close as possible to those well-defined angles. In this case, the weight value from this contribution should be low.<br />
Then the distance of the linker to the protein is taken into account. Because the linkers should not disturb the protein in its normal environment, linkers that pass close to the protein surface are considered better linkers. The distance was defined as the minimal distance between the linker and all the atoms of the protein. As already mentioned for the sorting of the paths, a linker cannot come closer than 5 &Aring; and this distance was used for normalization of calculated distances.<br />
After this, the places a linker should avoid are calculated. Each protein can interact with other molecules on some oarts of its surface. The user can specify where and how big those parts are. If a linker passes in front a potential molecule bindind domain, the value of the corresponding path goes to infinity, so that the linker is discarded. Conversely the farther a linker is from a potential ligand binding domain, the smaller its weighting value. The user can also specify the importance of certain regions. In the end the total weighting is normalized to the amount of binding domains.<br />
<br />
===Calibrating the weighting function===<br />
???keep?###<br />
Every contribution has it's own distribution. You can see an example in figure ### [[figure histogram_length_lys.png]], but all of them have different shapes. The aim is to find the paths that minimize all of these distributions. Therefore in the weighting function the four mentioned contributions are combined in a linear manner:<br />
\[ W(p) = \alpha L(p) + \beta A(p) + \gamma D(p) + \delta u(p) \]<br />
where W is the final weighting, p the path, L the length contribution, A the angle contribution, D the distance contribution and u the contribution from the forbidden regions. $\alpha, \beta, \gamma, \delta$ are the weighting constants that needed to be found. The normalization performed for each of the contribution were made so that each of them is dimensionless and that all have reasonably similar values.<br />
The weighting constants were obtained from the [[linker-screening ###lin]] performed with lysozyme and the [[enzyme-modeling ###link]]<br />
Please see [[below ###link]] for the detailed explanation, how the values were obtained.<br />
<br />
==Translating paths to sequence==<br />
As already mentioned before the software is provided with two databases, one for the possible angle patterns and one for the helix patterns. The choice of the patterns was inspired by known crystal structures extracted from databases and described in different papers.<br />
A huge in silico screening for refining the preferences of the patterns was then set up using the [[iGEM@home|###Link]] system. For the complete description of search for suitable patterns, one can read the [[modeling|###Link to patternspart]] page. <br />
All the possible paths are now split up at the angles and compared with the possible patterns in the databases. ###Figure needed to explain, how the path is translated### The most suitable patterns are identified and added together to build the paths sequence. It is important to notice that this is only possible because of the modularity of our linker patterns used as building blocks: each block, being an alpha helix or an angle pattern, is not affected by the other. Thus for each possible path, one sequence is produced.<br />
==Clustering of paths==<br />
Many different paths are represented by the same sequence [[###Figure that shows, different paths have same properties, already before in the text ###]] and we therefore clustered such paths. The weigths for those clustered paths were then calculated by averaging the weights of the different paths that compose a cluster.<br />
<br />
=Results=<br />
IMPORTANT!<br />
paths-> geometrical<br />
linkers-> helices...<br />
sequences-> seq of linkers<br />
insist on huge number of data produced<br />
We definitely need to cite [0]!!!<br />
=Abstract=<br />
=Overview=<br />
As already introduced [###link to circularization idea in the toolbox], artificially circularized proteins may gain some heat stability by restraining the C- and N-terminus from moving around freely. This circularization may be trivial when the protein termini are very close to each other, which seems to be reasonably common [[#References|[1]]]. However, if the ends are too far from each other, a long linker is needed to connect them. This linker should not change the natural conformation of the protein and should constrain the relative position of the ends to restrict the degrees of freedom and thus to stabilize the structure even when heated up. On top, these linkers should not affect any of the protein functions. Consequently it is important to prevent linkers from passing through the active site or from covering binding domains to other molecules for example. Therefore one needs to be able to define the shape of possible linkers. This section describes the software we developed to design such linkers. We would like to stress that this work has been made possible thanks to the feedback between computer modeling and experimental work: We could first design linkers in silico, test them experimentally and use the results to further calibrate the software. To our knowledge, this is the first time that such an approach is used to customly design linkers to connect protein extremities.<br />
In short, the software can provide a weighted list of linkers to circularize any protein of interest with a known structure. Those linkers are made of rigid alpha helices segments connected with defined angles. Contrary to flexible linkers, those rigid linkers were expected to constrain the protein extremities and to confer better heat stability. To generate those linkers, we first defined the geometrical paths, with segments and angles, that they should follow. The geometrical paths that are biologically feasible are afterwards translated into amino acid sequences. Both the compatibility of paths with possible structures and the translation were made possible thanks to our modeling approaches. The first approach consisted in performing a statistical analysis of more than 17000 known non-homologous structures containing alpha helices connected with angles. For the second approach, we modeled the conformation of linkers circularizing proteins of known structure and analyzed them for certain properties. This second approach was run for a large number of proteins thanks to our distributing computing system igemathome. The software provides different possible linkers with weights that provide the ranking of the linkers depending on their capacity to maintain protein activity at higher temperatures. They were generated in an extensive [[linker screening| ### ]] on the target protein lambda-lysozyme, using the first modeling approach.<br />
The software was checked for running stability in a huge test over the [[iGEM@home]]. <br />
Furthermore our software provided linkers for circularizing [[DNMT1| ###]], that could be made more heatstable due to circularization???.<br />
For detailed information on the implementation and the practical use of the software, please see the [[documentation software-docu]] page.<br />
==What does the software do==<br />
still missing [[figure 0, graph abstract]]<br />
=Background=<br />
What are linkers, why should ours work better?<br />
Linkers, References to the modeling, <br />
Classically, protein linkers were designed in three different manners. ###REFERENCE### The easiest way is to define the length that a linker should cover and then simply use a flexible glycine-serine peptide with the right amount of amino acids to match this length. Glycine is used for flexibility, as it has no sidechain and does not produce any steric hindrance, while serine is used for solubility, as it has a small polar side chain. This solubility is important, as the linkers should not pass through the hydrophobic core of the protein, but should be dissolved in the surrounding medium. These flexible linkers were normally used for circularization but also for connecting different proteins, when the main important aspect is that the different parts are connected, but not how they are connected, or when the flexiblity of the linker was required for specific applications.<br />
A second strategy consists in using rigid helical linkers to keep proteins or protein domains at a certain distance from each other. This is especially important for signalling proteins and fluorescent proteins . ###TODO: Reference### One major property of alpha helices is that they always fold in a defined way with well defined angles and lengths. There are also many different helical patterns that differ in stability and solubility. One big disadvantage of this strategy is that one can only build straight linkers with helices. So in the context of circularization, if an artificial line that would connect protein extremities is crossing the protein, this strategy is not an option.<br />
The third option, which served as a base to develop our approach, consists in designing customly tailored linkers for each specific application. These linkers can be obtained from protein structure prediction. At first one needs to define the path that the linker should take to connect two amino acids. Afterwards one designs a possible linker sequence that might fit well. Next one makes a structure prediction of the linker attached to the proteins to validate the prediction. Several different linkers, with slight changes, can be compared. This is repeated several times until the linker effectively follows the expected path. ###TODO: Reference, WADE paper### This method is time consuming as it is not only computation intensive, but also requires a strong knowledge on protein folding and protein structure prediction. On the other hand, the benefit can be important as the interaction of the linker with the proteins surface can be taken into account and as one can accurately define the path taken by the linker to the resolution of protein structure.<br />
We have set up a completely new strategy to design rigid linkers. As further detailed in the [[###link###|modelling]] part, it is possible to define the shape of a linker, by combining rigid alpha helical rods with well-defined angle patterns. Therefore, by defining, in a geometrical way, the possible paths of the circularizing linkers for a given protein, we can then propose potential linkers.This definition of the geometrical path can be very difficult, especially for large proteins with complex shapes. Moreover, this definition is further constained by the fact that linkers must avoid hiding active sites of the protein of interest. Finally the paths have rotational degrees of freedom at the extremities of the protein, and depending on their orientation, they may or may not match the geometry of the protein. The tool we present here covers the two steps: defining geometrical paths with some weights and translate them into feasible linkers, also with weights. This tool is universal as it has the capacity to design circularizing linkers for any protein with a known structure. Moreover it is modular as, thanks to our modeling approach [link] we have design linkers as exchangeable blocks of rods of different lengths and of angle patterns. The following sections detail the different steps followed by our software to design proper linkers.<br />
==PDB analysis==<br />
At first, the PDB file containing the structure of the target protein is parsed and the coordinates of the atoms are stored, in the metric unit.After this, some initial tests are made with the protein structure. First, we checked whether the C- and N-termini lie on the surface of the protein and are accessible to the solvent, which is crucial for circularization. We defined a line originating from an extremity of the protein with the two angles of the spherical coordinates around the z-axis. From that, we could determine the accessible angles by rejecting all the lines that are too close to the protein. As the future linker will be made of alpha helices and will therefore have a radius of 5 &Aring;, we used this length as the minimal allowed distance.<br />
Those allowed angles are stored for the coming linker generation. . ###fig needed### <br />
==Generation of geometric paths==<br />
As our strategy consists in building linkers with helical rods and connecting angles, a path is completely defined by the coordinates of the angle points. Advancing one step from an existing point is always done by adding a displacement vector on this point. This vector is defined by the two spherical angles, chosen here in a discrete manner with an increment of 5 degrees, and by a length, also chosen in a discrete manner. This discrete length was used in two different contexts: it may correspond to the length of an alpha helix or to the length of the flexible part that appears at the extremity of the protein. The coordinates that are reached thanks to this vector defines the new coordinates of an angle point, no matter if the vector corresponds to an alpha helix or to a flexible part. ###figure, points + vectors###. As we screen for all possible angles in a discrete manner, those angle points coordinates are regularly distributed on a sphere. As further detailed in the next sections, those spheres are defined from both ends, either once or in several steps. Then the software checks for possible straight connections of given lengths for each pair of angle points originated from both extremities. ###figure needed, with all these possible steps, one sphere around each point and checking for the connections, 2D, different lengths in two different figures, ###<br />
The linkers are built in a modular way, with blocks of well-defined size. From the modeling of potential linkers [link], we could derive 8 different alpha helical rods, all with different lengths. On top, the length of the two segments inside an angle block was always 8 &Aring;, so exchanging angle blocks do not affect the length of the linker. This means that the distance between the angle points is well defined, an essential aspect of our strategy of linker design.<br />
The software proceeds in three steps. First, it checks for the possibility of direct single alpha helix linker. for this, it applies the procedure just mentioned with spheres of radius that reasonably corresponds to the length of the short parts at the extremity of the protein. Second, it tests if a linker containing two alpha helices connected with a right angle allows the circularization. Finally it searches the possible linkers with three angle points. The next parts will explain those three steps in detail.<br />
This method has been chosen, because it could be implemented easily and efficiently in our program. However this strategy generated paths that crossed the protein. Therefore we put big efforts in the sorting out of the paths.<br />
===Step 1===<br />
As a simple rigid linker with no angle would be easier to design and likely more thermostable than the ones containing angles, the software first checks if this simple solution is possible ###figure needed###. For this, we took into account the fact that proteins have some flexible amino acids at their extremities. This flexible part may come from the protein itself, but also from the 2 glycines that are included at the N-terminal part and from the extein at the C-terminal part. Those two latter parts comes from our linkers. Those parts have no preferential angles and offers a large amount of possibilities to insert fitting linkers. But this flexibility is also a drawback as we have to include this large amount of possible angles and length to our path search.<br />
In this first step, the software explicitely takes these flexible parts into account to check for the possibility of straight linkers. As the angles and the length of the flexible parts are variable, the software position their extremity on a sphere centered on the last fixed position of the structure as explained above. The radius of this sphere is incremented in a discrete manner, in 4 steps, from 5.25 &Aring; to the maximum length of the flexible part. ###figure###<br />
Then all possible straight segments between the points and the lastpoints are tested. If they are closer than 5 &Aring to the protein, of if they cross it, then they are rejected. If they are kept, then the software checks whether the length of the segments is compatible with the feasible alpha helices in terms of length: if the length of a given segment equal one of the 8 alpha helix lengths plus or minus 0.75 &Aring;, then the path is eventually saved.<br />
===Step 2===<br />
The next possibilty to design more complex rigid linkers while still taking flexible ends into account with a reasonable calculation time was to reduce amount of possible angles. As we originally thought that 90° angle would be practically feasible, the software was designed to generate linkers with flexible ends and one 90° angle. ###figure, protein_one_90° angle### This choice was notably made because of the simplicity to calculate lengths of right triangle edges. We already saw in Step 1 that the length of an edge can only take 8 different values. As the linkers have to start from the extremities of the protein, and as we impose a right angle, the number of possible paths is therefore low, making them easy to compute. Practically, the extremities of the proteins are positioned in a flexible way as in Step 1. From each of the positions allowed by this flexibility, the software searches for all the allowed right triangles. This was mainly done, as the degrees of freedom needed to be restricted to keep calculations feasible. <br />
===Step 3===<br />
Finally the software also provides the possibility to find paths with up to 4 edges, meaning 4 alpha helices and 3 angles. Thanks to the modularity of the possible linkers, such paths can offer the possibility to circularize theoretically any kind of protein. ###figure of torus###<br />
To keep the calculation feasible in a reasonable time, we design the searching strategy so that the flexible part at the extremity are oriented in the same direction as the consecutive alpha helix. This is obviously restricting the search but as these orientations are allowed for the flexible part, this approach remains fully correct. <br />
First, potential ending points of the first alpha helical rod are calculated from the N-terminal point of the protein. The orientation is chosen in a discrete manner, with an incrementation of 5 degrees for the two angles of the spherical coordinates. The distance from the origin corresponds to the 8 possible lengths allowed by the alpha helices, as already seen in Step 2, plus a length of 0, which mimics a linker with 3 instead of 4 edges. The exact same procedure is repeated to define all the potential ending points of the second alpha helical rod starting from all the possible ending points of the first alpha helical rod. Thanks to the possibility of a length of 0 for the first and the second rods, the software also calculate paths with 2 edges. Then, the same is done only once from the C-terminal point of the protein, defining 1 edge. The final step consists in checking if the points originating from the N- and C-terminal points can be linked by an potential alpha helix, i.e. if they are separated by the appropriate distance. If any of the potential alpha helix length lies within the distance between two points plus or minus 0.75 &Aring;, then the path is eventually saved. In the same way, if two points are directly closer than 0.75 &Aring;, then the path is also saved.<br />
==Sorting out of paths==<br />
The previous part described the generation of paths that can connect the two extremeties of the protein irrespective of the position of these paths relative to the protein. While this allows a fast computing of the geometrical paths, this also implies that the paths that are not practically feasible need to be sorted out. This is the most time consuming part of the computing as about 1 billion paths are generated. Three criteria are considered for the sorting. The first one is the feasibility of the linker: can the software find angle patterns that correspond to the one defined by the geometrical path? This question was part of the motivation for a large modeling effort (link) to determine the possible angles between consecutive angles. This was achieved by analyzing the distribution of angles between alpha helices found in the ArchDB database (link). As nearly any angle could be found between 20 and 170 degrees, only few paths were actually rejected at that step. The next criteria was the position of the angle point: if they appear inside the protein, then the path is rejected. Finally, the software checks if any of the atoms of the protein is less than 5 &Aring away from any of the alpha helices, then the path is also rejected.<br />
==Shifting paths to the patterns==<br />
The strategy described in step 3 gives a certain freedom for the rod that connect the last two angle points that were generated from the N- and C-terminal points. As this freedom is actually not permitted by the alpha helix and the angle pattern, but is permitted by the flexible part for example at the C-terminal end, the software slighty refine the path by rotating the segment that originates from the C-terminal point.<br />
==Weighting of paths==<br />
Before translating the paths into sequences and thus into linkers that can be expressed, each path needs to be evaluated for its potential capacity to enhance heat stability. For this we have identified different contributions that should be combined into one value that defines how good the linker may be and that consequently defines its ranking among all possible linkers. The smaller this value, the more we expect that the linker will enhance thermostability. An important step for all these contributions is the normalization, as explained in the next paragraphs.<br />
The first contribution we considered was the linker length. We assumed that a short linker is better to constrain the protein extremities, and that a long helix might give more flexibility. Because we wanted this value to be independent of the size of the protein, the length of the linker is normalized to the distance between the two termini.<br />
The second contribution relates to the angles used in the linker. We learned from the [[modeling ###link###]] that angles formed by a certain angle pattern follow a certain distribution. First, we assumed that the narrower the distribution, the more likely the alpha helices would actually produce this angle. Second, the angles found by the software should be as close as possible to those well-defined angles. In this case, the weight value from this contribution should be low.<br />
Then the distance of the linker to the protein is taken into account. Because the linkers should not disturb the protein in its normal environment, linkers that pass close to the protein surface are considered better linkers. The distance was defined as the minimal distance between the linker and all the atoms of the protein. As already mentioned for the sorting of the paths, a linker cannot come closer than 5 &Aring; and this distance was used for normalization of calculated distances.<br />
After this, the places a linker should avoid are calculated. Each protein can interact with other molecules on some oarts of its surface. The user can specify where and how big those parts are. If a linker passes in front a potential molecule bindind domain, the value of the corresponding path goes to infinity, so that the linker is discarded. Conversely the farther a linker is from a potential ligand binding domain, the smaller its weighting value. The user can also specify the importance of certain regions. In the end the total weighting is normalized to the amount of binding domains.<br />
===Calibrating the weighting function===<br />
Every contribution has it's own distribution. You can see an example in figure ### [[figure histogram_length_lys.png]], but all of them have different shapes. The aim is to find the paths that minimize all of these distributions. Therefore in the weighting function the four mentioned contributions are combined in a linear manner:<br />
\[ W(p) = \alpha L(p) + \beta A(p) + \gamma D(p) + \delta u(p) \]<br />
where W is the final weighting, p the path, L the length contribution, A the angle contribution, D the distance contribution and u the contribution from the forbidden regions. $\alpha, \beta, \gamma, \delta$ are the weighting constants that needed to be found. The normalization performed for each of the contribution were made so that each of them is dimensionless and that all have reasonably similar values.<br />
The weighting constants were obtained from the [[linker-screening ###lin]] performed with lysozyme and the [[enzyme-modeling ###link]]<br />
Please see [[below ###link]] for the detailed explanation, how the values were obtained.<br />
==Translating paths to sequence==<br />
As already mentioned before the software is provided with two databases, one for the possible angle patterns and one for the helix patterns. The choice of the patterns was inspired by known crystal structures extracted from databases and described in different papers.<br />
A huge in silico screening for refining the preferences of the patterns was then set up using the [[iGEM@home|###Link]] system. For the complete description of search for suitable patterns, one can read the [[modeling|###Link to patternspart]] page. <br />
All the possible paths are now split up at the angles and compared with the possible patterns in the databases. ###Figure needed to explain, how the path is translated### The most suitable patterns are identified and added together to build the paths sequence. It is important to notice that this is only possible because of the modularity of our linker patterns used as building blocks: each block, being an alpha helix or an angle pattern, is not affected by the other. Thus for each possible path, one sequence is produced.<br />
==Clustering of paths==<br />
Many different paths are represented by the same sequence [[###Figure that shows, different paths have same properties, already before in the text ###]] and we therefore clustered such paths. The weigths for those clustered paths were then calculated by averaging the weights of the different paths that compose a cluster.<br />
=Results=<br />
==DNMT1==<br />
A major motivation of our effort to design rigid linkers with angles was the circularization of the DNA methyltranferase Dnmt1 (link). The truncation form used in our project is composed of 900 amino acids and the N- and C-terminal extremities are well separated. To circularize it, two linkers were designed: a flexible one made of glycine and serine, and a rigid one designed by the software. The rigif linkers for DNMT1 were obtained from an early state of the software. At that time the calculation took 11 days on a laptop computer with intel i5 processor and 8GB of RAM, which shows the importance of a distributed computing system for large proteins. But from that state on, the software has still imporved a lot, resulting in reduced calculation time to about 1 day for DNMT1.<br />
IMPORTANT!<br />
paths-> geometrical<br />
linkers-> helices...<br />
sequences-> seq of linkers<br />
insist on huge number of data produced<br />
We definitely need to cite [0]!!!<br />
=Abstract=<br />
=Overview=<br />
As already introduced [###link to circularization idea in the toolbox], artificially circularized proteins may gain some heat stability by restraining the C- and N-terminus from moving around freely. This circularization may be trivial when the protein termini are very close to each other, which seems to be reasonably common [[#References|[1]]]. However, if the ends are too far from each other, a long linker is needed to connect them. This linker should not change the natural conformation of the protein and should constrain the relative position of the ends to restrict the degrees of freedom and thus to stabilize the structure even when heated up. On top, these linkers should not affect any of the protein functions. Consequently it is important to prevent linkers from passing through the active site or from covering binding domains to other molecules for example. Therefore one needs to be able to define the shape of possible linkers. This section describes the software we developed to design such linkers. We would like to stress that this work has been made possible thanks to the feedback between computer modeling and experimental work: We could first design linkers in silico, test them experimentally and use the results to further calibrate the software. To our knowledge, this is the first time that such an approach is used to customly design linkers to connect protein extremities.<br />
In short, the software can provide a weighted list of linkers to circularize any protein of interest with a known structure. Those linkers are made of rigid alpha helices segments connected with defined angles. Contrary to flexible linkers, those rigid linkers were expected to constrain the protein extremities and to confer better heat stability. To generate those linkers, we first defined the geometrical paths, with segments and angles, that they should follow. The geometrical paths that are biologically feasible are afterwards translated into amino acid sequences. Both the compatibility of paths with possible structures and the translation were made possible thanks to our modeling approaches. The first approach consisted in performing a statistical analysis of more than 17000 known non-homologous structures containing alpha helices connected with angles. For the second approach, we modeled the conformation of linkers circularizing proteins of known structure and analyzed them for certain properties. This second approach was run for a large number of proteins thanks to our distributing computing system igemathome. The software provides different possible linkers with weights that provide the ranking of the linkers depending on their capacity to maintain protein activity at higher temperatures. They were generated in an extensive [[linker screening| ### ]] on the target protein lambda-lysozyme, using the first modeling approach.<br />
The software was checked for running stability in a huge test over the [[iGEM@home]]. <br />
Furthermore our software provided linkers for circularizing [[DNMT1| ###]], that could be made more heatstable due to circularization???.<br />
For detailed information on the implementation and the practical use of the software, please see the [[documentation software-docu]] page.<br />
==What does the software do==<br />
still missing [[figure 0, graph abstract]]<br />
=Background=<br />
What are linkers, why should ours work better?<br />
Linkers, References to the modeling, <br />
Classically, protein linkers were designed in three different manners. ###REFERENCE### The easiest way is to define the length that a linker should cover and then simply use a flexible glycine-serine peptide with the right amount of amino acids to match this length. Glycine is used for flexibility, as it has no sidechain and does not produce any steric hindrance, while serine is used for solubility, as it has a small polar side chain. This solubility is important, as the linkers should not pass through the hydrophobic core of the protein, but should be dissolved in the surrounding medium. These flexible linkers were normally used for circularization but also for connecting different proteins, when the main important aspect is that the different parts are connected, but not how they are connected, or when the flexiblity of the linker was required for specific applications.<br />
A second strategy consists in using rigid helical linkers to keep proteins or protein domains at a certain distance from each other. This is especially important for signalling proteins and fluorescent proteins . ###TODO: Reference### One major property of alpha helices is that they always fold in a defined way with well defined angles and lengths. There are also many different helical patterns that differ in stability and solubility. One big disadvantage of this strategy is that one can only build straight linkers with helices. So in the context of circularization, if an artificial line that would connect protein extremities is crossing the protein, this strategy is not an option.<br />
The third option, which served as a base to develop our approach, consists in designing customly tailored linkers for each specific application. These linkers can be obtained from protein structure prediction. At first one needs to define the path that the linker should take to connect two amino acids. Afterwards one designs a possible linker sequence that might fit well. Next one makes a structure prediction of the linker attached to the proteins to validate the prediction. Several different linkers, with slight changes, can be compared. This is repeated several times until the linker effectively follows the expected path. ###TODO: Reference, WADE paper### This method is time consuming as it is not only computation intensive, but also requires a strong knowledge on protein folding and protein structure prediction. On the other hand, the benefit can be important as the interaction of the linker with the proteins surface can be taken into account and as one can accurately define the path taken by the linker to the resolution of protein structure.<br />
We have set up a completely new strategy to design rigid linkers. As further detailed in the [[###link###|modelling]] part, it is possible to define the shape of a linker, by combining rigid alpha helical rods with well-defined angle patterns. Therefore, by defining, in a geometrical way, the possible paths of the circularizing linkers for a given protein, we can then propose potential linkers.This definition of the geometrical path can be very difficult, especially for large proteins with complex shapes. Moreover, this definition is further constained by the fact that linkers must avoid hiding active sites of the protein of interest. Finally the paths have rotational degrees of freedom at the extremities of the protein, and depending on their orientation, they may or may not match the geometry of the protein. The tool we present here covers the two steps: defining geometrical paths with some weights and translate them into feasible linkers, also with weights. This tool is universal as it has the capacity to design circularizing linkers for any protein with a known structure. Moreover it is modular as, thanks to our modeling approach [link] we have design linkers as exchangeable blocks of rods of different lengths and of angle patterns. The following sections detail the different steps followed by our software to design proper linkers.<br />
But even taking all this into account, one could never also take the paths into account, that the same sort of linker is also able to take, because of rotational degrees of freedom inherent in the linkermodel ###figure needed###<br />
Thus we decided to provide the science community with a powerfull open-source software, that for every protein with a given structure can calculate the sequence of the linker needed to circularize the protein with a rigid linker and with minimal inhibition of the protein's function. <br />
Until now each scientist had to estimate the length of the linker himself,###check for flexlinker### so our software is a completely novel approach to circularization.<br />
=General procedure=<br />
Our general approach to findin,rods and angles in discrete manner, only finite possibilities to connect the ends, these are cheked all. <br />
I think this paragraph is just too much, as anyway there was already an overview before.<br />
At first the protein structure is analysed in a geometrical way: paths only composed of no more than four straight segments connected by angles are computanionally generated. A path is always represented by straight lines and connecting angles between them. ###In the end all these paths should be sorted by how well it would be, if we circularize the protein using this path.### But as the final weighing is quite computation intensive, at first the paths need to be sorted out. A path is only sorted out, if it is breaching any rule for linkers. For example paths should never pass through the protein. After all the paths have been generated, the paths are improved, by shifting the angularpoints according to the underlying linkermodel, so that no paths are taken into account, that could not be built with our building blocks.<br />
After all paths being correctly generated, the paths are weighted by several factors. Afterwards one weighting for a path is calculated, that corresponds to the goodness of this certain path. Then the path is retranslated by usage of our amino acid patterns to produce an amino acid sequence. But as one sequence can follow more than one path, all the paths built up by this sequence are clustered. In the end the average of the weighting of the pathclusters is calculated and thus for every linker only one weightingvalue is produced with the contributions of all paths possibly taken by this linker.<br />
==PDB analysis==<br />
At first, the PDB file containing the structure of the target protein is parsed and the coordinates of the atoms are stored, in the metric unit.After this, some initial tests are made with the protein structure. First, we checked whether the C- and N-termini lie on the surface of the protein and are accessible to the solvent, which is crucial for circularization. We defined a line originating from an extremity of the protein with the two angles of the spherical coordinates around the z-axis. From that, we could determine the accessible angles by rejecting all the lines that are too close to the protein. As the future linker will be made of alpha helices and will therefore have a radius of 5 &Aring;, we used this length as the minimal allowed distance.<br />
Those allowed angles are stored for the coming linker generation. . ###fig needed### <br />
==Generation of geometric paths==<br />
As our strategy consists in building linkers with helical rods and connecting angles, a path is completely defined by the coordinates of the angle points. Advancing one step from an existing point is always done by adding a displacement vector on this point. This vector is defined by the two spherical angles, chosen here in a discrete manner with an increment of 5 degrees, and by a length, also chosen in a discrete manner. This discrete length was used in two different contexts: it may correspond to the length of an alpha helix or to the length of the flexible part that appears at the extremity of the protein. The coordinates that are reached thanks to this vector defines the new coordinates of an angle point, no matter if the vector corresponds to an alpha helix or to a flexible part. ###figure, points + vectors###. As we screen for all possible angles in a discrete manner, those angle points coordinates are regularly distributed on a sphere. As further detailed in the next sections, those spheres are defined from both ends, either once or in several steps. Then the software checks for possible straight connections of given lengths for each pair of angle points originated from both extremities. ###figure needed, with all these possible steps, one sphere around each point and checking for the connections, 2D, different lengths in two different figures, ###<br />
The linkers are built in a modular way, with blocks of well-defined size. From the modeling of potential linkers [link], we could derive 8 different alpha helical rods, all with different lengths. On top, the length of the two segments inside an angle block was always 8 &Aring;, so exchanging angle blocks do not affect the length of the linker. This means that the distance between the angle points is well defined, an essential aspect of our strategy of linker design.<br />
The software proceeds in three steps. First, it checks for the possibility of direct single alpha helix linker. for this, it applies the procedure just mentioned with spheres of radius that reasonably corresponds to the length of the short parts at the extremity of the protein. Second, it tests if a linker containing two alpha helices connected with a right angle allows the circularization. Finally it searches the possible linkers with three angle points. The next parts will explain those three steps in detail.<br />
This method has been chosen, because it could be implemented easily and efficiently in our program. However this strategy generated paths that crossed the protein. Therefore we put big efforts in the sorting out of the paths.<br />
In the [[###link###|modelling]] part we have already described the patterns we are using for building up our linkers. The modularity was crucial for the success of the software. Therefore the angle patterns were always chosen, so that the end of the anglepattern was 8 &Aring; away from the turning point ###figure needed###. Thus only displacements had to be made with a certain length, not depending on the direction in which it is going and not depending on the direction in that the linker will continue. This modularity makes the calculations more efficient than it would be, with just generating points randomly.<br />
We still have to rephrase that.<br />
===Step 1===<br />
As a simple rigid linker with no angle would be easier to design and likely more thermostable than the ones containing angles, the software first checks if this simple solution is possible ###figure needed###. For this, we took into account the fact that proteins have some flexible amino acids at their extremities. This flexible part may come from the protein itself, but also from the 2 glycines that are included at the N-terminal part and from the extein at the C-terminal part. Those two latter parts comes from our linkers. Those parts have no preferential angles and offers a large amount of possibilities to insert fitting linkers. But this flexibility is also a drawback as we have to include this large amount of possible angles and length to our path search.<br />
In this first step, the software explicitely takes these flexible parts into account to check for the possibility of straight linkers. As the angles and the length of the flexible parts are variable, the software position their extremity on a sphere centered on the last fixed position of the structure as explained above. The radius of this sphere is incremented in a discrete manner, in 4 steps, from 5.25 &Aring; to the maximum length of the flexible part. ###figure###<br />
Then all possible straight segments between the points and the lastpoints are tested. If they are closer than 5 &Aring to the protein, of if they cross it, then they are rejected. If they are kept, then the software checks whether the length of the segments is compatible with the feasible alpha helices in terms of length: if the length of a given segment equal one of the 8 alpha helix lengths plus or minus 0.75 &Aring;, then the path is eventually saved.<br />
" Therefore here another function for calculating the distance from the protein to the connection is used, that is more time-consuming, but also more accurate, than the ones used normally. " ??<br />
"This is done with a higher accuracy, because none of these linkers should be lost by error". ??<br />
For generation of linkers taking into account the flexibility of the ends until now have been included two functions.<br />
The only variability we have in the customly tailored linker there is the length of the helix. Most likely here no suitable linker can be found, because if there is some obstacle between the ends, the linker can't bend around it. But even though if there is the possibility of such a linker, it should be found, because this will be one of the best linkers predicted. <br />
So for all accessible angles that were calculated before, we calculate all points from the N-terminus and from the C-terminus that lie in certain distances from the terminus. The points are spread over varying distances from ; to the .<br />
===Step 2===<br />
The next possibilty to design more complex rigid linkers while still taking flexible ends into account with a reasonable calculation time was to reduce amount of possible angles. As we originally thought that 90° angle would be practically feasible, the software was designed to generate linkers with flexible ends and one 90° angle. ###figure### This choice was notably made because of the simplicity to calculate lengths of right triangle edges. We already saw in Step 1 that the length of an edge can only take 8 different values. As the linkers have to start from the extremities of the protein, and as we impose a right angle, the number of possible paths is therefore low, making them easy to compute. Practically, the extremities of the proteins are positioned in a flexible way as in Step 1. From each of the positions allowed by this flexibility, the software searches for all the allowed right triangles. <br />
some words on degree of freedom...<br />
###newly written###<br />
Again here we only have discrete possibilities to build right triangles by use of our helical patterns. Therefore at first all combinations of two helical patterns are searched, that could build a right triangle, that's hypotenuse has the length of the distance between the termini.<br />
Now we shift back to 3D and apply Thales's theorem, that says, when A, B and C lie on a circle, the line between A and C is a diameter of the circle, then the angle at B is a right angle. ### fig, thales### Thus in 3d we can discretisize the possibilities, where the angle point (B) can lie in reference to the starting point (A). This amount is counterchecked with the amount of possible right triangles from before, so that we only keep paths, that can be built with our patterns.<br />
Like in Step 1 the software creates spheres of points around the start (first points) and around the end (last points). Then for each point from the first points<br />
===Step 3===<br />
Finally the software also provides the possibility to find paths with up to 4 edges, meaning 4 alpha helices and 3 angles. Thanks to the modularity of the possible linkers, such paths can offer the possibility to circularize theoretically any kind of protein. ###figure of torus###<br />
To keep the calculation feasible in a reasonable time, we design the searching strategy so that the flexible part at the extremity are oriented in the same direction as the consecutive alpha helix. This is obviously restricting the search but as these orientations are allowed for the flexible part, this approach remains fully correct. <br />
First, potential ending points of the first alpha helical rod are calculated from the N-terminal point of the protein. The orientation is chosen in a discrete manner, with an incrementation of 5 degrees for the two angles of the spherical coordinates. The distance from the origin corresponds to the 8 possible lengths allowed by the alpha helices, as already seen in Step 2, plus a length of 0, which mimics a linker with 3 instead of 4 edges. The exact same procedure is repeated to define all the potential ending points of the second alpha helical rod starting from all the possible ending points of the first alpha helical rod. Thanks to the possibility of a length of 0 for the first and the second rods, the software also calculate paths with 2 edges. Then, the same is done only once from the C-terminal point of the protein, defining 1 edge. The final step consists in checking if the points originating from the N- and C-terminal points can be linked by an potential alpha helix, i.e. if they are separated by the appropriate distance. If any of the potential alpha helix length lies within the distance between two points plus or minus 0.75 &Aring;, then the path is eventually saved. In the same way, if two points are directly closer than 0.75 &Aring;, then the path is also saved.<br />
one starts from omega, 2 from alpha<br />
Thus even one of the worst shapes for circularization, a torus with the two termini in the pits, could be circularized with our linker system, without needing infinitely long straights. On the other hand this were the maximum of possibilities which was still feasible to calculate.<br />
Here now the flexible parts of the protein are estimated to point into the same direction as the following helix. By this mean the amount of possibilities is kept fixed but of course this is quite some rough estimation.<br />
Let's check the next paragraph together in detail<br />
At first points from each end in the right distances are generated. ###figure needed, explains point names### Now for each point of the first points to all directions in 5 degrees angledifference next points are generated. Of these all points that don't fit are immediately sorted out. Now all connections from second points to the last points are checked whether they lie in the right distance. After this the normal sorting steps are made for these new connections.<br />
==Sorting out of paths==<br />
The previous part described the generation of paths that can connect the two extremeties of the protein irrespective of the position of these paths relative to the protein. While this allows a fast computing of the geometrical paths, this also implies that the paths that are not practically feasible need to be sorted out. This is the most time consuming part of the computing as about 1 billion paths are generated. Three criteria are considered for the sorting. The first one is the feasibility of the linker: can the software find angle patterns that correspond to the one defined by the geometrical path? This question was part of the motivation for a large modeling effort (link) to determine the possible angles between consecutive angles. This was achieved by analyzing the distribution of angles between alpha helices found in the ArchDB database (link). As nearly any angle could be found between 20 and 170 degrees, only few paths were actually rejected at that step. The next criteria was the position of the angle point: if they appear inside the protein, then the path is rejected. Finally, the software checks if any of the atoms of the protein is less than 5 &Aring away from any of the alpha helices, then the path is also rejected.<br />
==Shifting paths to the patterns==<br />
The strategy described in step 3 gives a certain freedom for the rod that connect the last two angle points that were generated from the N- and C-terminal points. As this freedom is actually not permitted by the alpha helix and the angle pattern, but is permitted by the flexible part for example at the C-terminal end, the software slighty refine the path by rotating the segment that originates from the C-terminal point.<br />
In particular because of the discritization, it can always happen that the generated paths don't fit perfectly to the helical and to the angle patterns. Therefore before translating them to the aminoacid sequence, the paths need to be refined.<br />
Every path, that has survived is analyzed step by step, starting from the starting point and always advancing one angle point. For each step from point to point, the length of the step is calculated and compared to the lengthes we can build with the helix patterns. If the length is too long, the next point is shifted in direction of the previous one, until it fits. If it is too short, it is shifted away. These shifts do not exceed a certain length, so that the paths don't shift too much and suddenly would pass throgh restricted areas.<br />
This is also done before the weighting, so that the paths don't change after weighting anymore.<br />
==Weighting of paths==<br />
Before translating the paths into sequences and thus into linkers that can be expressed, each path needs to be evaluated for its potential capacity to enhance heat stability. For this we have identified different contributions that should be combined into one value that defines how good the linker may be and that consequently defines its ranking among all possible linkers. The smaller this value, the more we expect that the linker will enhance thermostability. An important step for all these contributions is the normalization, as explained in the next paragraphs.<br />
The first contribution we considered was the linker length. We assumed that a short linker is better to constrain the protein extremities, and that a long helix might give more flexibility. Because we wanted this value to be independent of the size of the protein, the length of the linker is normalized to the distance between the two termini.<br />
The second contribution relates to the angles used in the linker. We learned from the [[modeling ###link###]] that angles formed by a certain angle pattern follow a certain distribution. First, we assumed that the narrower the distribution, the more likely the alpha helices would actually produce this angle. Second, the angles found by the software should be as close as possible to those well-defined angles. In this case, the weight value from this contribution should be low.<br />
Then the distance of the linker to the protein is taken into account. Because the linkers should not disturb the protein in its normal environment, linkers that pass close to the protein surface are considered better linkers. The distance was defined as the minimal distance between the linker and all the atoms of the protein. As already mentioned for the sorting of the paths, a linker cannot come closer than 5 &Aring; and this distance was used for normalization of calculated distances.<br />
As linkers should also not be too close to the protein surface, this value is normalized with the minimal distance a linker should have from the surface. The distance is calculated as the minimal distance an the atoms of the protein have from the connection.<br />
After this, the places a linker should avoid are calculated. Each protein can interact with other molecules on some oarts of its surface. The user can specify where and how big those parts are. If a linker passes in front a potential molecule bindind domain, the value of the corresponding path goes to infinity, so that the linker is discarded. Conversely the farther a linker is from a potential ligand binding domain, the smaller its weighting value. The user can also specify the importance of certain regions. In the end the total weighting is normalized to the amount of binding domains.<br />
===Calibrating the weighting function===<br />
???keep?###<br />
Every contribution has it's own distribution. You can see an example in figure ### [[figure histogram_length_lys.png]], but all of them have different shapes. The aim is to find the paths that minimize all of these distributions. Therefore in the weighting function the four mentioned contributions are combined in a linear manner:<br />
\[ W(p) = \alpha L(p) + \beta A(p) + \gamma D(p) + u(p) \]<br />
where W is the final weighting, p the path, L the length contribution, A the angle contribution, D the distance contribution and u the contribution from the forbidden regions. $\alpha, \beta, \gamma, \delta$ are the weighting constants that needed to be found. The normalization performed for each of the contribution were made so that each of them is dimensionless and that all have reasonably similar values.<br />
The weighting constants were obtained from the [[linker-screening ###lin]] performed with lysozyme and the [[enzyme-modeling ###link]]<br />
Please see [[below ###link]] for the detailed explanation, how the values were obtained.<br />
==Translating paths to sequence==<br />
As already mentioned before the software is provided with two databases, one for the possible angle patterns and one for the helix patterns. The choice of the patterns was inspired by known crystal structures extracted from databases and described in different papers.<br />
A huge in silico screening for refining the preferences of the patterns was then set up using the [[iGEM@home|###Link]] system. For the complete description of search for suitable patterns, one can read the [[modeling|###Link to patternspart]] page. <br />
All the possible paths are now split up at the angles and compared with the possible patterns in the databases. ###Figure needed to explain, how the path is translated### The most suitable patterns are identified and added together to build the paths sequence. It is important to notice that this is only possible because of the modularity of our linker patterns used as building blocks: each block, being an alpha helix or an angle pattern, is not affected by the other. Thus for each possible path, one sequence is produced.<br />
==Clustering of paths==<br />
Many different paths are represented by the same sequence [[###Figure that shows, different paths have same properties, already before in the text ###]] and we therefore clustered such paths. The weigths for those clustered paths were then calculated by averaging the weights of the different paths that compose a cluster.<br />
=Results=<br />
==DNMT1==<br />
A major motivation of our effort to design rigid linkers with angles was the circularization of the DNA methyltranferase Dnmt1 (link). The truncation form used in our project is composed of 900 amino acids and the N- and C-terminal extremities are well separated. To circularize it, two linkers were designed: a flexible one made of glycine and serine, and a rigid one designed by the software. The rigif linkers for DNMT1 were obtained from an early state of the software. At that time the calculation took 11 days on a laptop computer with intel i5 processor and 8GB of RAM, which shows the importance of a distributed computing system for large proteins. But from that state on, the software has still imporved a lot, resulting in reduced calculation time to about 1 day for DNMT1.<br />
###DNMT1 could be made heatstable, still missing###<br />
==Feedback from wet lab==<br />
The results from the software for lysozyme were tested as described in the [[linker-screening]] part and evaluated as described in the [[enzyme modeling]] part. We have performed a large linker screening on 10 different lysozymes with different linkers. As the purpose of the lysozyme screen was the calibration of the software,tThose linkers were designed according to the four contributions previously mentioned. One of them was the shortest possible, one had the best possible angle, and so on.<br />
{| class="table table-hover"<br />
|+ '''table 1''': Linker and their amino acid sequence. Green: attachment sequences to prevent the flexible regions from being perturbed; Blue: angle; Purple: extein.<br />
! Linker<br />
! Amino acid sequence<br />
|- <br />
|<br />
|<br />
|-<br />
|'''Very good linkers'''<br />
|<br />
|-<br />
| sgt2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#00BFFF;">AAAHPEA</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| rigid<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKAA<span style="color:#3ADF00;">P</span><span style="color:#A901DB;">RGKCWE</span><br />
|-<br />
| <br />
| <br />
|-<br />
| '''Average linkers'''<br />
|<br />
|-<br />
| may1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKA<span style="color:#00BFFF;">AAAHPEA</span>AEAAAKEAAAKA<span style="color:#00BFFF;">KTA</span>AEAAAKEAAAKA<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| ord1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ATGDLA</span>AEAAAKAA<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| ord3<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ASLPAA</span>AEAAAKEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
|<br />
| <br />
|-<br />
| '''Short linkers'''<br />
| <br />
|-<br />
| sho1<br />
| <span style="color:#3ADF00;">GG</span><span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| sho2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
|}<br />
[[circ_lam_lys_nils.png]] comparison with [[nice_linker_lysozyme_flexible_ends.png]]<br />
###About predictions of software###<br />
In the end we obtained a ranking of the in vitro tested linkers from the [[linker-screening]] and chose the parameters $\alpha, \beta, \gamma, \delta$ of the weighting function in the way, that the ranking from the software represented the ranking from the assays. The final values were: $\alpha = 1.85 * 10 ^{-6}, \beta = 0.57 , \gamma = 50.8 * 10^6$.<br />
Therefore we set up a function, that was minimized, when the order of the weighting output was the same as the order from the wetlab.<br />
<br />
=Discussion=<br />
Will always be refined with more data from i@h...<br />
=References=<br />
[0] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).<br />
[1] Thornton, J.M. & Sibanda, B.L. Amino and carboxy-terminal regions in globular proteins. Journal of molecular biology 167, 443-460 (1983).<br />
<br />
=Discussion=<br />
Will always be refined with more data from i@h...<br />
=References=<br />
<br />
[1] Thornton, J.M. & Sibanda, B.L. Amino and carboxy-terminal regions in globular proteins. Journal of molecular biology 167, 443-460 (1983).<br />
<br />
[2] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/pages/Linker_SoftwareTeam:Heidelberg/pages/Linker Software2014-10-17T22:41:31Z<p>Igemnils: /* Weighting of paths */</p>
<hr />
<div>=Abstract=<br />
As already [https://2014.igem.org/Team:Heidelberg/Toolbox/Circularization introduced], artificially circularized proteins may gain some heat stability by restraining the C- and N-terminus from moving around freely. This circularization may be trivial when the protein termini are very close to each other, which seems to be reasonably common [[#References|[1]]]. However, if the ends are too far from each other, a long linker is needed to connect them. This linker should not change the natural conformation of the protein and should constrain the relative position of the ends to restrict the degrees of freedom and thus to stabilize the structure even when heated up. On top, these linkers should not affect any of the protein functions. Consequently it is important to prevent linkers from passing through the active site or from covering binding domains to other molecules for example. Therefore one needs to be able to define the shape of possible linkers. This section describes the software we developed to design such linkers. We would like to stress that this work has been made possible thanks to the feedback between computer modeling and experimental work: We could first design linkers in silico, test them experimentally and use the results to further calibrate the software. To our knowledge, this is the first time that such an approach is used to customly design rigid linkers with angles to connect protein extremities.<br />
<br />
=General procedure=<br />
In short, the software can provide a weighted list of linkers to circularize any protein of interest with a known structure. Those linkers are made of rigid alpha helices segments connected with defined angles. Contrary to flexible linkers, those rigid linkers were expected to constrain the protein extremities and to confer better heat stability. Such an idea was already developed [[#References|[2]]] but only with alpha helices defining simple rods, and without any possibility to introduce angles. To generate those linkers, we first defined the geometrical paths, with segments and angles, that they should follow. The geometrical paths that are biologically feasible are afterwards translated into amino acid sequences. Both the compatibility of paths with possible structures and the translation were made possible thanks to our [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling approaches]. The first approach consisted in performing a statistical analysis of more than 17000 known non-homologous structures containing alpha helices connected with angles. For the second approach, we modeled the conformation of linkers circularizing proteins of known structure and analyzed them for certain properties. This second approach was run for a large number of proteins thanks to our distributing computing system [https://2014.igem.org/Team:Heidelberg/Software/igemathome igemathome]. The software provides different possible linkers with weights that provide the ranking of the linkers depending on their capacity to maintain protein activity at higher temperatures. They were generated thanks to an extensive [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening linker screening] on the target protein lambda-lysozyme, using the first modeling approach.<br />
<br />
=Background=<br />
Classically, protein linkers were designed in three different manners. The easiest way is to define the length that a linker should cover and then simply use a flexible glycine-serine peptide with the right amount of amino acids to match this length. Glycine is used for flexibility, as it has no sidechain and does not produce any steric hindrance, while serine is used for solubility, as it has a small polar side chain. This solubility is important, as the linkers should not pass through the hydrophobic core of the protein, but should be dissolved in the surrounding medium. These flexible linkers were normally used for circularization but also for connecting different proteins, when the main goal is that the different parts are connected, but not how they are connected, or when the flexiblity of the linker was required for specific applications. <br />
<br />
A second strategy consists in using rigid helical linkers to keep proteins or protein domains at a certain distance from each other. This is especially important for signalling proteins and fluorescent proteins. One major property of alpha helices is that they always fold in a defined way with well defined angles and lengths. There are also many different helical patterns that differ in stability and solubility. Although they have been used to design cirularizing linkers [[#References|[2]]]. One big disadvantage of this strategy is that one can only build straight linkers with helices. So in the context of circularization, if an artificial line that would connect protein extremities is crossing the protein, this strategy is not an option.<br />
<br />
The third option, which served as a base to develop our approach and which came from discussions with the group of Rebecca Wade in Heidelberg, Germany, consists in designing customly tailored linkers for each specific application. These linkers can be obtained from protein structure prediction. At first one needs to define the path that the linker should take to connect the protein ends. Afterwards, one designs a possible linker sequence that might fit well. Next one makes a structure prediction of the linker attached to the proteins to validate the prediction. Several different linkers, with slight changes, can be compared. This is repeated several times until the linker effectively follows the expected path. This method requires a strong knowledge on protein folding and protein structure prediction and is computationaly intensive. On the other hand, the benefit can be important as the interaction of the linker with the protein surface can be taken into account and as one can accurately define the path taken by the linker to the resolution of protein structure.<br />
<br />
We have set up a completely new strategy to design rigid linkers. As further detailed in the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] part, it is possible to define the shape of a linker, by combining rigid alpha helical rods with well-defined angle patterns. Therefore, by defining, in a geometrical way, the possible paths of the circularizing linkers for a given protein, we can then propose potential linkers. This definition of the geometrical path can be very difficult, especially for large proteins with complex shapes. Moreover, this definition is further constained by the fact that linkers must avoid hiding active sites of the protein of interest. Finally the paths have rotational degrees of freedom at the extremities of the protein, and depending on their orientation, they may or may not match the geometry of the protein. The tool we present here covers the two steps: defining geometrical paths with some weights and translate them into feasible linkers, also with weights. This tool is universal as it has the capacity to design circularizing linkers for any protein with a known structure. Moreover it is modular as, thanks to our [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling approach], we have design linkers as exchangeable blocks of rods of different lengths and of angle patterns. The following sections detail the different steps followed by our software to design proper linkers.<br />
<br />
<br />
==PDB parsing==<br />
At first, the PDB file containing the structure of the target protein is parsed and the coordinates of the atoms are stored, in the metric unit.After this, some initial tests are made with the protein structure. First, we checked whether the C- and N-termini lie on the surface of the protein and are accessible to the solvent, which is crucial for circularization. We defined a line originating from an extremity of the protein with the two angles of the spherical coordinates around the z-axis. From that, we could determine the accessible angles by rejecting all the lines that are too close to the protein. As the future linker will be made of alpha helices and will therefore have a radius of 5 &Aring;, we used this length as the minimal allowed distance.<br />
Those allowed angles are stored for the coming linker generation. ###fig needed### <br />
<br />
==Generation of geometric paths==<br />
As our strategy consists in building linkers with helical rods and connecting angles, a path is completely defined by the coordinates of the angle points. Advancing one step from an existing point is always done by adding a displacement vector on this point. This vector is defined by the two spherical angles, chosen here in a discrete manner with an increment of 5 degrees, and by a length, also chosen in a discrete manner. This discrete length was used in two different contexts: it may correspond to the length of an alpha helix or to the length of the flexible part that appears at the extremity of the protein. The coordinates that are reached thanks to this vector defines the new coordinates of an angle point, no matter if the vector corresponds to an alpha helix or to a flexible part. ###figure, points + vectors###. As we screen for all possible angles in a discrete manner, those angle points coordinates are regularly distributed on a sphere. As further detailed in the next sections, those spheres are defined from both ends, either once or in several steps. Then the software checks for possible straight connections of given lengths for each pair of angle points originated from both extremities. ###figure needed, with all these possible steps, one sphere around each point and checking for the connections, 2D, different lengths in two different figures, ###<br />
The linkers are built in a modular way, with blocks of well-defined size. From the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] of potential linkers, we could derive 8 different alpha helical rods, all with different lengths. On top, the length of the two segments inside an angle block was always 8&Aring;, so exchanging angle blocks do not affect the length of the linker. This means that the distance between the angle points is well defined, an essential aspect of our strategy of linker design.<br />
The software proceeds in three steps. First, it checks for the possibility of direct single alpha helix linker. for this, it applies the procedure just mentioned with spheres of radius that reasonably corresponds to the length of the short parts at the extremity of the protein. Second, it tests if a linker containing two alpha helices connected with a right angle allows the circularization. Finally it searches the possible linkers with three angle points. The next parts will explain those three steps in detail.<br />
This method has been chosen, because it could be implemented easily and efficiently in our program. However, this strategy generated paths that crossed the protein. Therefore we put big efforts in the sorting out of the paths.<br />
<br />
===Step 1===<br />
As a simple rigid linker with no angle would be easier to design and likely more thermostable than the ones containing angles, the software first checks if this simple solution is possible ###figure needed###. For this, we took into account the fact that proteins have some flexible amino acids at their extremities. This flexible part may come from the protein itself, but also from the 2 glycines that are included at the N-terminal part and from the extein at the C-terminal part. Those two latter parts comes from our linkers. Those parts have no preferential angles and offers a large amount of possibilities to insert fitting linkers. But this flexibility is also a drawback as we have to include this large amount of possible angles and length to our path search.<br />
In this first step, the software explicitely takes these flexible parts into account to check for the possibility of straight linkers. As the angles and the length of the flexible parts are variable, the software position their extremity on a sphere centered on the last fixed position of the structure as explained above. The radius of this sphere is incremented in a discrete manner, in 4 steps, from 5.25 &Aring; to the maximum length of the flexible part. ###figure###<br />
Then all possible straight segments between the points and the lastpoints are tested. If they are closer than 5 &Aring to the protein, of if they cross it, then they are rejected. If they are kept, then the software checks whether the length of the segments is compatible with the feasible alpha helices in terms of length: if the length of a given segment equal one of the 8 alpha helix lengths plus or minus 0.75 &Aring;, then the path is eventually saved.<br />
===Step 2===<br />
The next possibilty to design more complex rigid linkers while still taking flexible ends into account with a reasonable calculation time was to reduce amount of possible angles. As we originally thought that 90° angle would be practically feasible, the software was designed to generate linkers with flexible ends and one 90° angle. ###figure, protein_one_90° angle### This choice was notably made because of the simplicity to calculate lengths of right triangle edges. We already saw in Step 1 that the length of an edge can only take 8 different values. As the linkers have to start from the extremities of the protein, and as we impose a right angle, the number of possible paths is therefore low, making them easy to compute. Practically, the extremities of the proteins are positioned in a flexible way as in Step 1. From each of the positions allowed by this flexibility, the software searches for all the allowed right triangles. This was mainly done, as the degrees of freedom needed to be restricted to keep calculations feasible. <br />
===Step 3===<br />
Finally the software also provides the possibility to find paths with up to 4 edges, meaning 4 alpha helices and 3 angles. Thanks to the modularity of the possible linkers, such paths can offer the possibility to circularize theoretically any kind of protein. ###figure of torus###<br />
To keep the calculation feasible in a reasonable time, we design the searching strategy so that the flexible part at the extremity are oriented in the same direction as the consecutive alpha helix. This is obviously restricting the search but as these orientations are allowed for the flexible part, this approach remains fully correct. <br />
First, potential ending points of the first alpha helical rod are calculated from the N-terminal point of the protein. The orientation is chosen in a discrete manner, with an incrementation of 5 degrees for the two angles of the spherical coordinates. The distance from the origin corresponds to the 8 possible lengths allowed by the alpha helices, as already seen in Step 2, plus a length of 0, which mimics a linker with 3 instead of 4 edges. The exact same procedure is repeated to define all the potential ending points of the second alpha helical rod starting from all the possible ending points of the first alpha helical rod. Thanks to the possibility of a length of 0 for the first and the second rods, the software also calculate paths with 2 edges. Then, the same is done only once from the C-terminal point of the protein, defining 1 edge. The final step consists in checking if the points originating from the N- and C-terminal points can be linked by an potential alpha helix, i.e. if they are separated by the appropriate distance. If any of the potential alpha helix length lies within the distance between two points plus or minus 0.75 &Aring;, then the path is eventually saved. In the same way, if two points are directly closer than 0.75 &Aring;, then the path is also saved.<br />
<br />
==Sorting out of paths==<br />
The previous part described the generation of paths that can connect the two extremeties of the protein irrespective of the position of these paths relative to the protein. While this allows a fast computing of the geometrical paths, this also implies that the paths that are not practically feasible need to be sorted out. This is the most time consuming part of the computing as about 1 billion paths are generated. Three criteria are considered for the sorting. The first one is the feasibility of the linker: can the software find angle patterns that correspond to the one defined by the geometrical path? This question was part of the motivation for a large modeling effort (link) to determine the possible angles between consecutive angles. This was achieved by analyzing the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling distribution of angles] between alpha helices found in the ArchDB database. As nearly any angle could be found between 20 and 170 degrees, only few paths were actually rejected at that step. The next criteria was the position of the angle point: if they appear inside the protein, then the path is rejected. Finally, the software checks if any of the atoms of the protein is less than 5&Aring away from any of the alpha helices, then the path is also rejected.<br />
<br />
==Shifting paths to the patterns==<br />
<br />
The strategy described in step 3 gives a certain freedom for the rod that connect the last two angle points that were generated from the N- and C-terminal points. As this freedom is actually not permitted by the alpha helix and the angle pattern, but is permitted by the flexible part for example at the C-terminal end, the software slighty refine the path by rotating the segment that originates from the C-terminal point.<br />
<br />
==Weighting of paths==<br />
Before translating the paths into sequences and thus into linkers that can be expressed, each path needs to be evaluated for its potential capacity to enhance heat stability. For this we have identified different contributions that should be combined into one value that defines how good the linker may be and that consequently defines its ranking among all possible linkers. The smaller this value, the more we expect that the linker will enhance thermostability. An important step for all these contributions is the normalization, as explained in the next paragraphs.<br />
The first contribution we considered was the linker length. We assumed that a short linker is better to constrain the protein extremities, and that a long helix might give more flexibility. Because we wanted this value to be independent of the size of the protein, the length of the linker is normalized to the distance between the two termini.<br />
The second contribution relates to the angles used in the linker. We learned from the [[modeling ###link###]] that angles formed by a certain angle pattern follow a certain distribution. First, we assumed that the narrower the distribution, the more likely the alpha helices would actually produce this angle. Second, the angles found by the software should be as close as possible to those well-defined angles. In this case, the weight value from this contribution should be low.<br />
Then the distance of the linker to the protein is taken into account. Because the linkers should not disturb the protein in its normal environment, linkers that pass close to the protein surface are considered better linkers. The distance was defined as the minimal distance between the linker and all the atoms of the protein. As already mentioned for the sorting of the paths, a linker cannot come closer than 5 &Aring; and this distance was used for normalization of calculated distances.<br />
After this, the places a linker should avoid are calculated. Each protein can interact with other molecules on some oarts of its surface. The user can specify where and how big those parts are. If a linker passes in front a potential molecule bindind domain, the value of the corresponding path goes to infinity, so that the linker is discarded. Conversely the farther a linker is from a potential ligand binding domain, the smaller its weighting value. The user can also specify the importance of certain regions. In the end the total weighting is normalized to the amount of binding domains.<br />
<br />
===Calibrating the weighting function===<br />
???keep?###<br />
Every contribution has it's own distribution. You can see an example in figure ### [[figure histogram_length_lys.png]], but all of them have different shapes. The aim is to find the paths that minimize all of these distributions. Therefore in the weighting function the four mentioned contributions are combined in a linear manner:<br />
\[ W(p) = \alpha L(p) + \beta A(p) + \gamma D(p) + \delta u(p) \]<br />
where W is the final weighting, p the path, L the length contribution, A the angle contribution, D the distance contribution and u the contribution from the forbidden regions. $\alpha, \beta, \gamma, \delta$ are the weighting constants that needed to be found. The normalization performed for each of the contribution were made so that each of them is dimensionless and that all have reasonably similar values.<br />
The weighting constants were obtained from the [[linker-screening ###lin]] performed with lysozyme and the [[enzyme-modeling ###link]]<br />
Please see [[below ###link]] for the detailed explanation, how the values were obtained.<br />
<br />
==Translating paths to sequence==<br />
As already mentioned before the software is provided with two databases, one for the possible angle patterns and one for the helix patterns. The choice of the patterns was inspired by known crystal structures extracted from databases and described in different papers.<br />
A huge in silico screening for refining the preferences of the patterns was then set up using the [[iGEM@home|###Link]] system. For the complete description of search for suitable patterns, one can read the [[modeling|###Link to patternspart]] page. <br />
All the possible paths are now split up at the angles and compared with the possible patterns in the databases. ###Figure needed to explain, how the path is translated### The most suitable patterns are identified and added together to build the paths sequence. It is important to notice that this is only possible because of the modularity of our linker patterns used as building blocks: each block, being an alpha helix or an angle pattern, is not affected by the other. Thus for each possible path, one sequence is produced.<br />
==Clustering of paths==<br />
Many different paths are represented by the same sequence [[###Figure that shows, different paths have same properties, already before in the text ###]] and we therefore clustered such paths. The weigths for those clustered paths were then calculated by averaging the weights of the different paths that compose a cluster.<br />
<br />
=Results=<br />
IMPORTANT!<br />
paths-> geometrical<br />
linkers-> helices...<br />
sequences-> seq of linkers<br />
insist on huge number of data produced<br />
We definitely need to cite [0]!!!<br />
=Abstract=<br />
=Overview=<br />
As already introduced [###link to circularization idea in the toolbox], artificially circularized proteins may gain some heat stability by restraining the C- and N-terminus from moving around freely. This circularization may be trivial when the protein termini are very close to each other, which seems to be reasonably common [[#References|[1]]]. However, if the ends are too far from each other, a long linker is needed to connect them. This linker should not change the natural conformation of the protein and should constrain the relative position of the ends to restrict the degrees of freedom and thus to stabilize the structure even when heated up. On top, these linkers should not affect any of the protein functions. Consequently it is important to prevent linkers from passing through the active site or from covering binding domains to other molecules for example. Therefore one needs to be able to define the shape of possible linkers. This section describes the software we developed to design such linkers. We would like to stress that this work has been made possible thanks to the feedback between computer modeling and experimental work: We could first design linkers in silico, test them experimentally and use the results to further calibrate the software. To our knowledge, this is the first time that such an approach is used to customly design linkers to connect protein extremities.<br />
In short, the software can provide a weighted list of linkers to circularize any protein of interest with a known structure. Those linkers are made of rigid alpha helices segments connected with defined angles. Contrary to flexible linkers, those rigid linkers were expected to constrain the protein extremities and to confer better heat stability. To generate those linkers, we first defined the geometrical paths, with segments and angles, that they should follow. The geometrical paths that are biologically feasible are afterwards translated into amino acid sequences. Both the compatibility of paths with possible structures and the translation were made possible thanks to our modeling approaches. The first approach consisted in performing a statistical analysis of more than 17000 known non-homologous structures containing alpha helices connected with angles. For the second approach, we modeled the conformation of linkers circularizing proteins of known structure and analyzed them for certain properties. This second approach was run for a large number of proteins thanks to our distributing computing system igemathome. The software provides different possible linkers with weights that provide the ranking of the linkers depending on their capacity to maintain protein activity at higher temperatures. They were generated in an extensive [[linker screening| ### ]] on the target protein lambda-lysozyme, using the first modeling approach.<br />
The software was checked for running stability in a huge test over the [[iGEM@home]]. <br />
Furthermore our software provided linkers for circularizing [[DNMT1| ###]], that could be made more heatstable due to circularization???.<br />
For detailed information on the implementation and the practical use of the software, please see the [[documentation software-docu]] page.<br />
==What does the software do==<br />
still missing [[figure 0, graph abstract]]<br />
=Background=<br />
What are linkers, why should ours work better?<br />
Linkers, References to the modeling, <br />
Classically, protein linkers were designed in three different manners. ###REFERENCE### The easiest way is to define the length that a linker should cover and then simply use a flexible glycine-serine peptide with the right amount of amino acids to match this length. Glycine is used for flexibility, as it has no sidechain and does not produce any steric hindrance, while serine is used for solubility, as it has a small polar side chain. This solubility is important, as the linkers should not pass through the hydrophobic core of the protein, but should be dissolved in the surrounding medium. These flexible linkers were normally used for circularization but also for connecting different proteins, when the main important aspect is that the different parts are connected, but not how they are connected, or when the flexiblity of the linker was required for specific applications.<br />
A second strategy consists in using rigid helical linkers to keep proteins or protein domains at a certain distance from each other. This is especially important for signalling proteins and fluorescent proteins . ###TODO: Reference### One major property of alpha helices is that they always fold in a defined way with well defined angles and lengths. There are also many different helical patterns that differ in stability and solubility. One big disadvantage of this strategy is that one can only build straight linkers with helices. So in the context of circularization, if an artificial line that would connect protein extremities is crossing the protein, this strategy is not an option.<br />
The third option, which served as a base to develop our approach, consists in designing customly tailored linkers for each specific application. These linkers can be obtained from protein structure prediction. At first one needs to define the path that the linker should take to connect two amino acids. Afterwards one designs a possible linker sequence that might fit well. Next one makes a structure prediction of the linker attached to the proteins to validate the prediction. Several different linkers, with slight changes, can be compared. This is repeated several times until the linker effectively follows the expected path. ###TODO: Reference, WADE paper### This method is time consuming as it is not only computation intensive, but also requires a strong knowledge on protein folding and protein structure prediction. On the other hand, the benefit can be important as the interaction of the linker with the proteins surface can be taken into account and as one can accurately define the path taken by the linker to the resolution of protein structure.<br />
We have set up a completely new strategy to design rigid linkers. As further detailed in the [[###link###|modelling]] part, it is possible to define the shape of a linker, by combining rigid alpha helical rods with well-defined angle patterns. Therefore, by defining, in a geometrical way, the possible paths of the circularizing linkers for a given protein, we can then propose potential linkers.This definition of the geometrical path can be very difficult, especially for large proteins with complex shapes. Moreover, this definition is further constained by the fact that linkers must avoid hiding active sites of the protein of interest. Finally the paths have rotational degrees of freedom at the extremities of the protein, and depending on their orientation, they may or may not match the geometry of the protein. The tool we present here covers the two steps: defining geometrical paths with some weights and translate them into feasible linkers, also with weights. This tool is universal as it has the capacity to design circularizing linkers for any protein with a known structure. Moreover it is modular as, thanks to our modeling approach [link] we have design linkers as exchangeable blocks of rods of different lengths and of angle patterns. The following sections detail the different steps followed by our software to design proper linkers.<br />
==PDB analysis==<br />
At first, the PDB file containing the structure of the target protein is parsed and the coordinates of the atoms are stored, in the metric unit.After this, some initial tests are made with the protein structure. First, we checked whether the C- and N-termini lie on the surface of the protein and are accessible to the solvent, which is crucial for circularization. We defined a line originating from an extremity of the protein with the two angles of the spherical coordinates around the z-axis. From that, we could determine the accessible angles by rejecting all the lines that are too close to the protein. As the future linker will be made of alpha helices and will therefore have a radius of 5 &Aring;, we used this length as the minimal allowed distance.<br />
Those allowed angles are stored for the coming linker generation. . ###fig needed### <br />
==Generation of geometric paths==<br />
As our strategy consists in building linkers with helical rods and connecting angles, a path is completely defined by the coordinates of the angle points. Advancing one step from an existing point is always done by adding a displacement vector on this point. This vector is defined by the two spherical angles, chosen here in a discrete manner with an increment of 5 degrees, and by a length, also chosen in a discrete manner. This discrete length was used in two different contexts: it may correspond to the length of an alpha helix or to the length of the flexible part that appears at the extremity of the protein. The coordinates that are reached thanks to this vector defines the new coordinates of an angle point, no matter if the vector corresponds to an alpha helix or to a flexible part. ###figure, points + vectors###. As we screen for all possible angles in a discrete manner, those angle points coordinates are regularly distributed on a sphere. As further detailed in the next sections, those spheres are defined from both ends, either once or in several steps. Then the software checks for possible straight connections of given lengths for each pair of angle points originated from both extremities. ###figure needed, with all these possible steps, one sphere around each point and checking for the connections, 2D, different lengths in two different figures, ###<br />
The linkers are built in a modular way, with blocks of well-defined size. From the modeling of potential linkers [link], we could derive 8 different alpha helical rods, all with different lengths. On top, the length of the two segments inside an angle block was always 8 &Aring;, so exchanging angle blocks do not affect the length of the linker. This means that the distance between the angle points is well defined, an essential aspect of our strategy of linker design.<br />
The software proceeds in three steps. First, it checks for the possibility of direct single alpha helix linker. for this, it applies the procedure just mentioned with spheres of radius that reasonably corresponds to the length of the short parts at the extremity of the protein. Second, it tests if a linker containing two alpha helices connected with a right angle allows the circularization. Finally it searches the possible linkers with three angle points. The next parts will explain those three steps in detail.<br />
This method has been chosen, because it could be implemented easily and efficiently in our program. However this strategy generated paths that crossed the protein. Therefore we put big efforts in the sorting out of the paths.<br />
===Step 1===<br />
As a simple rigid linker with no angle would be easier to design and likely more thermostable than the ones containing angles, the software first checks if this simple solution is possible ###figure needed###. For this, we took into account the fact that proteins have some flexible amino acids at their extremities. This flexible part may come from the protein itself, but also from the 2 glycines that are included at the N-terminal part and from the extein at the C-terminal part. Those two latter parts comes from our linkers. Those parts have no preferential angles and offers a large amount of possibilities to insert fitting linkers. But this flexibility is also a drawback as we have to include this large amount of possible angles and length to our path search.<br />
In this first step, the software explicitely takes these flexible parts into account to check for the possibility of straight linkers. As the angles and the length of the flexible parts are variable, the software position their extremity on a sphere centered on the last fixed position of the structure as explained above. The radius of this sphere is incremented in a discrete manner, in 4 steps, from 5.25 &Aring; to the maximum length of the flexible part. ###figure###<br />
Then all possible straight segments between the points and the lastpoints are tested. If they are closer than 5 &Aring to the protein, of if they cross it, then they are rejected. If they are kept, then the software checks whether the length of the segments is compatible with the feasible alpha helices in terms of length: if the length of a given segment equal one of the 8 alpha helix lengths plus or minus 0.75 &Aring;, then the path is eventually saved.<br />
===Step 2===<br />
The next possibilty to design more complex rigid linkers while still taking flexible ends into account with a reasonable calculation time was to reduce amount of possible angles. As we originally thought that 90° angle would be practically feasible, the software was designed to generate linkers with flexible ends and one 90° angle. ###figure, protein_one_90° angle### This choice was notably made because of the simplicity to calculate lengths of right triangle edges. We already saw in Step 1 that the length of an edge can only take 8 different values. As the linkers have to start from the extremities of the protein, and as we impose a right angle, the number of possible paths is therefore low, making them easy to compute. Practically, the extremities of the proteins are positioned in a flexible way as in Step 1. From each of the positions allowed by this flexibility, the software searches for all the allowed right triangles. This was mainly done, as the degrees of freedom needed to be restricted to keep calculations feasible. <br />
===Step 3===<br />
Finally the software also provides the possibility to find paths with up to 4 edges, meaning 4 alpha helices and 3 angles. Thanks to the modularity of the possible linkers, such paths can offer the possibility to circularize theoretically any kind of protein. ###figure of torus###<br />
To keep the calculation feasible in a reasonable time, we design the searching strategy so that the flexible part at the extremity are oriented in the same direction as the consecutive alpha helix. This is obviously restricting the search but as these orientations are allowed for the flexible part, this approach remains fully correct. <br />
First, potential ending points of the first alpha helical rod are calculated from the N-terminal point of the protein. The orientation is chosen in a discrete manner, with an incrementation of 5 degrees for the two angles of the spherical coordinates. The distance from the origin corresponds to the 8 possible lengths allowed by the alpha helices, as already seen in Step 2, plus a length of 0, which mimics a linker with 3 instead of 4 edges. The exact same procedure is repeated to define all the potential ending points of the second alpha helical rod starting from all the possible ending points of the first alpha helical rod. Thanks to the possibility of a length of 0 for the first and the second rods, the software also calculate paths with 2 edges. Then, the same is done only once from the C-terminal point of the protein, defining 1 edge. The final step consists in checking if the points originating from the N- and C-terminal points can be linked by an potential alpha helix, i.e. if they are separated by the appropriate distance. If any of the potential alpha helix length lies within the distance between two points plus or minus 0.75 &Aring;, then the path is eventually saved. In the same way, if two points are directly closer than 0.75 &Aring;, then the path is also saved.<br />
==Sorting out of paths==<br />
The previous part described the generation of paths that can connect the two extremeties of the protein irrespective of the position of these paths relative to the protein. While this allows a fast computing of the geometrical paths, this also implies that the paths that are not practically feasible need to be sorted out. This is the most time consuming part of the computing as about 1 billion paths are generated. Three criteria are considered for the sorting. The first one is the feasibility of the linker: can the software find angle patterns that correspond to the one defined by the geometrical path? This question was part of the motivation for a large modeling effort (link) to determine the possible angles between consecutive angles. This was achieved by analyzing the distribution of angles between alpha helices found in the ArchDB database (link). As nearly any angle could be found between 20 and 170 degrees, only few paths were actually rejected at that step. The next criteria was the position of the angle point: if they appear inside the protein, then the path is rejected. Finally, the software checks if any of the atoms of the protein is less than 5 &Aring away from any of the alpha helices, then the path is also rejected.<br />
==Shifting paths to the patterns==<br />
The strategy described in step 3 gives a certain freedom for the rod that connect the last two angle points that were generated from the N- and C-terminal points. As this freedom is actually not permitted by the alpha helix and the angle pattern, but is permitted by the flexible part for example at the C-terminal end, the software slighty refine the path by rotating the segment that originates from the C-terminal point.<br />
==Weighting of paths==<br />
Before translating the paths into sequences and thus into linkers that can be expressed, each path needs to be evaluated for its potential capacity to enhance heat stability. For this we have identified different contributions that should be combined into one value that defines how good the linker may be and that consequently defines its ranking among all possible linkers. The smaller this value, the more we expect that the linker will enhance thermostability. An important step for all these contributions is the normalization, as explained in the next paragraphs.<br />
The first contribution we considered was the linker length. We assumed that a short linker is better to constrain the protein extremities, and that a long helix might give more flexibility. Because we wanted this value to be independent of the size of the protein, the length of the linker is normalized to the distance between the two termini.<br />
The second contribution relates to the angles used in the linker. We learned from the [[modeling ###link###]] that angles formed by a certain angle pattern follow a certain distribution. First, we assumed that the narrower the distribution, the more likely the alpha helices would actually produce this angle. Second, the angles found by the software should be as close as possible to those well-defined angles. In this case, the weight value from this contribution should be low.<br />
Then the distance of the linker to the protein is taken into account. Because the linkers should not disturb the protein in its normal environment, linkers that pass close to the protein surface are considered better linkers. The distance was defined as the minimal distance between the linker and all the atoms of the protein. As already mentioned for the sorting of the paths, a linker cannot come closer than 5 &Aring; and this distance was used for normalization of calculated distances.<br />
After this, the places a linker should avoid are calculated. Each protein can interact with other molecules on some oarts of its surface. The user can specify where and how big those parts are. If a linker passes in front a potential molecule bindind domain, the value of the corresponding path goes to infinity, so that the linker is discarded. Conversely the farther a linker is from a potential ligand binding domain, the smaller its weighting value. The user can also specify the importance of certain regions. In the end the total weighting is normalized to the amount of binding domains.<br />
===Calibrating the weighting function===<br />
Every contribution has it's own distribution. You can see an example in figure ### [[figure histogram_length_lys.png]], but all of them have different shapes. The aim is to find the paths that minimize all of these distributions. Therefore in the weighting function the four mentioned contributions are combined in a linear manner:<br />
\[ W(p) = \alpha L(p) + \beta A(p) + \gamma D(p) + \delta u(p) \]<br />
where W is the final weighting, p the path, L the length contribution, A the angle contribution, D the distance contribution and u the contribution from the forbidden regions. $\alpha, \beta, \gamma, \delta$ are the weighting constants that needed to be found. The normalization performed for each of the contribution were made so that each of them is dimensionless and that all have reasonably similar values.<br />
The weighting constants were obtained from the [[linker-screening ###lin]] performed with lysozyme and the [[enzyme-modeling ###link]]<br />
Please see [[below ###link]] for the detailed explanation, how the values were obtained.<br />
==Translating paths to sequence==<br />
As already mentioned before the software is provided with two databases, one for the possible angle patterns and one for the helix patterns. The choice of the patterns was inspired by known crystal structures extracted from databases and described in different papers.<br />
A huge in silico screening for refining the preferences of the patterns was then set up using the [[iGEM@home|###Link]] system. For the complete description of search for suitable patterns, one can read the [[modeling|###Link to patternspart]] page. <br />
All the possible paths are now split up at the angles and compared with the possible patterns in the databases. ###Figure needed to explain, how the path is translated### The most suitable patterns are identified and added together to build the paths sequence. It is important to notice that this is only possible because of the modularity of our linker patterns used as building blocks: each block, being an alpha helix or an angle pattern, is not affected by the other. Thus for each possible path, one sequence is produced.<br />
==Clustering of paths==<br />
Many different paths are represented by the same sequence [[###Figure that shows, different paths have same properties, already before in the text ###]] and we therefore clustered such paths. The weigths for those clustered paths were then calculated by averaging the weights of the different paths that compose a cluster.<br />
=Results=<br />
==DNMT1==<br />
A major motivation of our effort to design rigid linkers with angles was the circularization of the DNA methyltranferase Dnmt1 (link). The truncation form used in our project is composed of 900 amino acids and the N- and C-terminal extremities are well separated. To circularize it, two linkers were designed: a flexible one made of glycine and serine, and a rigid one designed by the software. The rigif linkers for DNMT1 were obtained from an early state of the software. At that time the calculation took 11 days on a laptop computer with intel i5 processor and 8GB of RAM, which shows the importance of a distributed computing system for large proteins. But from that state on, the software has still imporved a lot, resulting in reduced calculation time to about 1 day for DNMT1.<br />
IMPORTANT!<br />
paths-> geometrical<br />
linkers-> helices...<br />
sequences-> seq of linkers<br />
insist on huge number of data produced<br />
We definitely need to cite [0]!!!<br />
=Abstract=<br />
=Overview=<br />
As already introduced [###link to circularization idea in the toolbox], artificially circularized proteins may gain some heat stability by restraining the C- and N-terminus from moving around freely. This circularization may be trivial when the protein termini are very close to each other, which seems to be reasonably common [[#References|[1]]]. However, if the ends are too far from each other, a long linker is needed to connect them. This linker should not change the natural conformation of the protein and should constrain the relative position of the ends to restrict the degrees of freedom and thus to stabilize the structure even when heated up. On top, these linkers should not affect any of the protein functions. Consequently it is important to prevent linkers from passing through the active site or from covering binding domains to other molecules for example. Therefore one needs to be able to define the shape of possible linkers. This section describes the software we developed to design such linkers. We would like to stress that this work has been made possible thanks to the feedback between computer modeling and experimental work: We could first design linkers in silico, test them experimentally and use the results to further calibrate the software. To our knowledge, this is the first time that such an approach is used to customly design linkers to connect protein extremities.<br />
In short, the software can provide a weighted list of linkers to circularize any protein of interest with a known structure. Those linkers are made of rigid alpha helices segments connected with defined angles. Contrary to flexible linkers, those rigid linkers were expected to constrain the protein extremities and to confer better heat stability. To generate those linkers, we first defined the geometrical paths, with segments and angles, that they should follow. The geometrical paths that are biologically feasible are afterwards translated into amino acid sequences. Both the compatibility of paths with possible structures and the translation were made possible thanks to our modeling approaches. The first approach consisted in performing a statistical analysis of more than 17000 known non-homologous structures containing alpha helices connected with angles. For the second approach, we modeled the conformation of linkers circularizing proteins of known structure and analyzed them for certain properties. This second approach was run for a large number of proteins thanks to our distributing computing system igemathome. The software provides different possible linkers with weights that provide the ranking of the linkers depending on their capacity to maintain protein activity at higher temperatures. They were generated in an extensive [[linker screening| ### ]] on the target protein lambda-lysozyme, using the first modeling approach.<br />
The software was checked for running stability in a huge test over the [[iGEM@home]]. <br />
Furthermore our software provided linkers for circularizing [[DNMT1| ###]], that could be made more heatstable due to circularization???.<br />
For detailed information on the implementation and the practical use of the software, please see the [[documentation software-docu]] page.<br />
==What does the software do==<br />
still missing [[figure 0, graph abstract]]<br />
=Background=<br />
What are linkers, why should ours work better?<br />
Linkers, References to the modeling, <br />
Classically, protein linkers were designed in three different manners. ###REFERENCE### The easiest way is to define the length that a linker should cover and then simply use a flexible glycine-serine peptide with the right amount of amino acids to match this length. Glycine is used for flexibility, as it has no sidechain and does not produce any steric hindrance, while serine is used for solubility, as it has a small polar side chain. This solubility is important, as the linkers should not pass through the hydrophobic core of the protein, but should be dissolved in the surrounding medium. These flexible linkers were normally used for circularization but also for connecting different proteins, when the main important aspect is that the different parts are connected, but not how they are connected, or when the flexiblity of the linker was required for specific applications.<br />
A second strategy consists in using rigid helical linkers to keep proteins or protein domains at a certain distance from each other. This is especially important for signalling proteins and fluorescent proteins . ###TODO: Reference### One major property of alpha helices is that they always fold in a defined way with well defined angles and lengths. There are also many different helical patterns that differ in stability and solubility. One big disadvantage of this strategy is that one can only build straight linkers with helices. So in the context of circularization, if an artificial line that would connect protein extremities is crossing the protein, this strategy is not an option.<br />
The third option, which served as a base to develop our approach, consists in designing customly tailored linkers for each specific application. These linkers can be obtained from protein structure prediction. At first one needs to define the path that the linker should take to connect two amino acids. Afterwards one designs a possible linker sequence that might fit well. Next one makes a structure prediction of the linker attached to the proteins to validate the prediction. Several different linkers, with slight changes, can be compared. This is repeated several times until the linker effectively follows the expected path. ###TODO: Reference, WADE paper### This method is time consuming as it is not only computation intensive, but also requires a strong knowledge on protein folding and protein structure prediction. On the other hand, the benefit can be important as the interaction of the linker with the proteins surface can be taken into account and as one can accurately define the path taken by the linker to the resolution of protein structure.<br />
We have set up a completely new strategy to design rigid linkers. As further detailed in the [[###link###|modelling]] part, it is possible to define the shape of a linker, by combining rigid alpha helical rods with well-defined angle patterns. Therefore, by defining, in a geometrical way, the possible paths of the circularizing linkers for a given protein, we can then propose potential linkers.This definition of the geometrical path can be very difficult, especially for large proteins with complex shapes. Moreover, this definition is further constained by the fact that linkers must avoid hiding active sites of the protein of interest. Finally the paths have rotational degrees of freedom at the extremities of the protein, and depending on their orientation, they may or may not match the geometry of the protein. The tool we present here covers the two steps: defining geometrical paths with some weights and translate them into feasible linkers, also with weights. This tool is universal as it has the capacity to design circularizing linkers for any protein with a known structure. Moreover it is modular as, thanks to our modeling approach [link] we have design linkers as exchangeable blocks of rods of different lengths and of angle patterns. The following sections detail the different steps followed by our software to design proper linkers.<br />
But even taking all this into account, one could never also take the paths into account, that the same sort of linker is also able to take, because of rotational degrees of freedom inherent in the linkermodel ###figure needed###<br />
Thus we decided to provide the science community with a powerfull open-source software, that for every protein with a given structure can calculate the sequence of the linker needed to circularize the protein with a rigid linker and with minimal inhibition of the protein's function. <br />
Until now each scientist had to estimate the length of the linker himself,###check for flexlinker### so our software is a completely novel approach to circularization.<br />
=General procedure=<br />
Our general approach to findin,rods and angles in discrete manner, only finite possibilities to connect the ends, these are cheked all. <br />
I think this paragraph is just too much, as anyway there was already an overview before.<br />
At first the protein structure is analysed in a geometrical way: paths only composed of no more than four straight segments connected by angles are computanionally generated. A path is always represented by straight lines and connecting angles between them. ###In the end all these paths should be sorted by how well it would be, if we circularize the protein using this path.### But as the final weighing is quite computation intensive, at first the paths need to be sorted out. A path is only sorted out, if it is breaching any rule for linkers. For example paths should never pass through the protein. After all the paths have been generated, the paths are improved, by shifting the angularpoints according to the underlying linkermodel, so that no paths are taken into account, that could not be built with our building blocks.<br />
After all paths being correctly generated, the paths are weighted by several factors. Afterwards one weighting for a path is calculated, that corresponds to the goodness of this certain path. Then the path is retranslated by usage of our amino acid patterns to produce an amino acid sequence. But as one sequence can follow more than one path, all the paths built up by this sequence are clustered. In the end the average of the weighting of the pathclusters is calculated and thus for every linker only one weightingvalue is produced with the contributions of all paths possibly taken by this linker.<br />
==PDB analysis==<br />
At first, the PDB file containing the structure of the target protein is parsed and the coordinates of the atoms are stored, in the metric unit.After this, some initial tests are made with the protein structure. First, we checked whether the C- and N-termini lie on the surface of the protein and are accessible to the solvent, which is crucial for circularization. We defined a line originating from an extremity of the protein with the two angles of the spherical coordinates around the z-axis. From that, we could determine the accessible angles by rejecting all the lines that are too close to the protein. As the future linker will be made of alpha helices and will therefore have a radius of 5 &Aring;, we used this length as the minimal allowed distance.<br />
Those allowed angles are stored for the coming linker generation. . ###fig needed### <br />
==Generation of geometric paths==<br />
As our strategy consists in building linkers with helical rods and connecting angles, a path is completely defined by the coordinates of the angle points. Advancing one step from an existing point is always done by adding a displacement vector on this point. This vector is defined by the two spherical angles, chosen here in a discrete manner with an increment of 5 degrees, and by a length, also chosen in a discrete manner. This discrete length was used in two different contexts: it may correspond to the length of an alpha helix or to the length of the flexible part that appears at the extremity of the protein. The coordinates that are reached thanks to this vector defines the new coordinates of an angle point, no matter if the vector corresponds to an alpha helix or to a flexible part. ###figure, points + vectors###. As we screen for all possible angles in a discrete manner, those angle points coordinates are regularly distributed on a sphere. As further detailed in the next sections, those spheres are defined from both ends, either once or in several steps. Then the software checks for possible straight connections of given lengths for each pair of angle points originated from both extremities. ###figure needed, with all these possible steps, one sphere around each point and checking for the connections, 2D, different lengths in two different figures, ###<br />
The linkers are built in a modular way, with blocks of well-defined size. From the modeling of potential linkers [link], we could derive 8 different alpha helical rods, all with different lengths. On top, the length of the two segments inside an angle block was always 8 &Aring;, so exchanging angle blocks do not affect the length of the linker. This means that the distance between the angle points is well defined, an essential aspect of our strategy of linker design.<br />
The software proceeds in three steps. First, it checks for the possibility of direct single alpha helix linker. for this, it applies the procedure just mentioned with spheres of radius that reasonably corresponds to the length of the short parts at the extremity of the protein. Second, it tests if a linker containing two alpha helices connected with a right angle allows the circularization. Finally it searches the possible linkers with three angle points. The next parts will explain those three steps in detail.<br />
This method has been chosen, because it could be implemented easily and efficiently in our program. However this strategy generated paths that crossed the protein. Therefore we put big efforts in the sorting out of the paths.<br />
In the [[###link###|modelling]] part we have already described the patterns we are using for building up our linkers. The modularity was crucial for the success of the software. Therefore the angle patterns were always chosen, so that the end of the anglepattern was 8 &Aring; away from the turning point ###figure needed###. Thus only displacements had to be made with a certain length, not depending on the direction in which it is going and not depending on the direction in that the linker will continue. This modularity makes the calculations more efficient than it would be, with just generating points randomly.<br />
We still have to rephrase that.<br />
===Step 1===<br />
As a simple rigid linker with no angle would be easier to design and likely more thermostable than the ones containing angles, the software first checks if this simple solution is possible ###figure needed###. For this, we took into account the fact that proteins have some flexible amino acids at their extremities. This flexible part may come from the protein itself, but also from the 2 glycines that are included at the N-terminal part and from the extein at the C-terminal part. Those two latter parts comes from our linkers. Those parts have no preferential angles and offers a large amount of possibilities to insert fitting linkers. But this flexibility is also a drawback as we have to include this large amount of possible angles and length to our path search.<br />
In this first step, the software explicitely takes these flexible parts into account to check for the possibility of straight linkers. As the angles and the length of the flexible parts are variable, the software position their extremity on a sphere centered on the last fixed position of the structure as explained above. The radius of this sphere is incremented in a discrete manner, in 4 steps, from 5.25 &Aring; to the maximum length of the flexible part. ###figure###<br />
Then all possible straight segments between the points and the lastpoints are tested. If they are closer than 5 &Aring to the protein, of if they cross it, then they are rejected. If they are kept, then the software checks whether the length of the segments is compatible with the feasible alpha helices in terms of length: if the length of a given segment equal one of the 8 alpha helix lengths plus or minus 0.75 &Aring;, then the path is eventually saved.<br />
" Therefore here another function for calculating the distance from the protein to the connection is used, that is more time-consuming, but also more accurate, than the ones used normally. " ??<br />
"This is done with a higher accuracy, because none of these linkers should be lost by error". ??<br />
For generation of linkers taking into account the flexibility of the ends until now have been included two functions.<br />
The only variability we have in the customly tailored linker there is the length of the helix. Most likely here no suitable linker can be found, because if there is some obstacle between the ends, the linker can't bend around it. But even though if there is the possibility of such a linker, it should be found, because this will be one of the best linkers predicted. <br />
So for all accessible angles that were calculated before, we calculate all points from the N-terminus and from the C-terminus that lie in certain distances from the terminus. The points are spread over varying distances from ; to the .<br />
===Step 2===<br />
The next possibilty to design more complex rigid linkers while still taking flexible ends into account with a reasonable calculation time was to reduce amount of possible angles. As we originally thought that 90° angle would be practically feasible, the software was designed to generate linkers with flexible ends and one 90° angle. ###figure### This choice was notably made because of the simplicity to calculate lengths of right triangle edges. We already saw in Step 1 that the length of an edge can only take 8 different values. As the linkers have to start from the extremities of the protein, and as we impose a right angle, the number of possible paths is therefore low, making them easy to compute. Practically, the extremities of the proteins are positioned in a flexible way as in Step 1. From each of the positions allowed by this flexibility, the software searches for all the allowed right triangles. <br />
some words on degree of freedom...<br />
###newly written###<br />
Again here we only have discrete possibilities to build right triangles by use of our helical patterns. Therefore at first all combinations of two helical patterns are searched, that could build a right triangle, that's hypotenuse has the length of the distance between the termini.<br />
Now we shift back to 3D and apply Thales's theorem, that says, when A, B and C lie on a circle, the line between A and C is a diameter of the circle, then the angle at B is a right angle. ### fig, thales### Thus in 3d we can discretisize the possibilities, where the angle point (B) can lie in reference to the starting point (A). This amount is counterchecked with the amount of possible right triangles from before, so that we only keep paths, that can be built with our patterns.<br />
Like in Step 1 the software creates spheres of points around the start (first points) and around the end (last points). Then for each point from the first points<br />
===Step 3===<br />
Finally the software also provides the possibility to find paths with up to 4 edges, meaning 4 alpha helices and 3 angles. Thanks to the modularity of the possible linkers, such paths can offer the possibility to circularize theoretically any kind of protein. ###figure of torus###<br />
To keep the calculation feasible in a reasonable time, we design the searching strategy so that the flexible part at the extremity are oriented in the same direction as the consecutive alpha helix. This is obviously restricting the search but as these orientations are allowed for the flexible part, this approach remains fully correct. <br />
First, potential ending points of the first alpha helical rod are calculated from the N-terminal point of the protein. The orientation is chosen in a discrete manner, with an incrementation of 5 degrees for the two angles of the spherical coordinates. The distance from the origin corresponds to the 8 possible lengths allowed by the alpha helices, as already seen in Step 2, plus a length of 0, which mimics a linker with 3 instead of 4 edges. The exact same procedure is repeated to define all the potential ending points of the second alpha helical rod starting from all the possible ending points of the first alpha helical rod. Thanks to the possibility of a length of 0 for the first and the second rods, the software also calculate paths with 2 edges. Then, the same is done only once from the C-terminal point of the protein, defining 1 edge. The final step consists in checking if the points originating from the N- and C-terminal points can be linked by an potential alpha helix, i.e. if they are separated by the appropriate distance. If any of the potential alpha helix length lies within the distance between two points plus or minus 0.75 &Aring;, then the path is eventually saved. In the same way, if two points are directly closer than 0.75 &Aring;, then the path is also saved.<br />
one starts from omega, 2 from alpha<br />
Thus even one of the worst shapes for circularization, a torus with the two termini in the pits, could be circularized with our linker system, without needing infinitely long straights. On the other hand this were the maximum of possibilities which was still feasible to calculate.<br />
Here now the flexible parts of the protein are estimated to point into the same direction as the following helix. By this mean the amount of possibilities is kept fixed but of course this is quite some rough estimation.<br />
Let's check the next paragraph together in detail<br />
At first points from each end in the right distances are generated. ###figure needed, explains point names### Now for each point of the first points to all directions in 5 degrees angledifference next points are generated. Of these all points that don't fit are immediately sorted out. Now all connections from second points to the last points are checked whether they lie in the right distance. After this the normal sorting steps are made for these new connections.<br />
==Sorting out of paths==<br />
The previous part described the generation of paths that can connect the two extremeties of the protein irrespective of the position of these paths relative to the protein. While this allows a fast computing of the geometrical paths, this also implies that the paths that are not practically feasible need to be sorted out. This is the most time consuming part of the computing as about 1 billion paths are generated. Three criteria are considered for the sorting. The first one is the feasibility of the linker: can the software find angle patterns that correspond to the one defined by the geometrical path? This question was part of the motivation for a large modeling effort (link) to determine the possible angles between consecutive angles. This was achieved by analyzing the distribution of angles between alpha helices found in the ArchDB database (link). As nearly any angle could be found between 20 and 170 degrees, only few paths were actually rejected at that step. The next criteria was the position of the angle point: if they appear inside the protein, then the path is rejected. Finally, the software checks if any of the atoms of the protein is less than 5 &Aring away from any of the alpha helices, then the path is also rejected.<br />
==Shifting paths to the patterns==<br />
The strategy described in step 3 gives a certain freedom for the rod that connect the last two angle points that were generated from the N- and C-terminal points. As this freedom is actually not permitted by the alpha helix and the angle pattern, but is permitted by the flexible part for example at the C-terminal end, the software slighty refine the path by rotating the segment that originates from the C-terminal point.<br />
In particular because of the discritization, it can always happen that the generated paths don't fit perfectly to the helical and to the angle patterns. Therefore before translating them to the aminoacid sequence, the paths need to be refined.<br />
Every path, that has survived is analyzed step by step, starting from the starting point and always advancing one angle point. For each step from point to point, the length of the step is calculated and compared to the lengthes we can build with the helix patterns. If the length is too long, the next point is shifted in direction of the previous one, until it fits. If it is too short, it is shifted away. These shifts do not exceed a certain length, so that the paths don't shift too much and suddenly would pass throgh restricted areas.<br />
This is also done before the weighting, so that the paths don't change after weighting anymore.<br />
==Weighting of paths==<br />
Before translating the paths into sequences and thus into linkers that can be expressed, each path needs to be evaluated for its potential capacity to enhance heat stability. For this we have identified different contributions that should be combined into one value that defines how good the linker may be and that consequently defines its ranking among all possible linkers. The smaller this value, the more we expect that the linker will enhance thermostability. An important step for all these contributions is the normalization, as explained in the next paragraphs.<br />
The first contribution we considered was the linker length. We assumed that a short linker is better to constrain the protein extremities, and that a long helix might give more flexibility. Because we wanted this value to be independent of the size of the protein, the length of the linker is normalized to the distance between the two termini.<br />
The second contribution relates to the angles used in the linker. We learned from the [[modeling ###link###]] that angles formed by a certain angle pattern follow a certain distribution. First, we assumed that the narrower the distribution, the more likely the alpha helices would actually produce this angle. Second, the angles found by the software should be as close as possible to those well-defined angles. In this case, the weight value from this contribution should be low.<br />
Then the distance of the linker to the protein is taken into account. Because the linkers should not disturb the protein in its normal environment, linkers that pass close to the protein surface are considered better linkers. The distance was defined as the minimal distance between the linker and all the atoms of the protein. As already mentioned for the sorting of the paths, a linker cannot come closer than 5 &Aring; and this distance was used for normalization of calculated distances.<br />
As linkers should also not be too close to the protein surface, this value is normalized with the minimal distance a linker should have from the surface. The distance is calculated as the minimal distance an the atoms of the protein have from the connection.<br />
After this, the places a linker should avoid are calculated. Each protein can interact with other molecules on some oarts of its surface. The user can specify where and how big those parts are. If a linker passes in front a potential molecule bindind domain, the value of the corresponding path goes to infinity, so that the linker is discarded. Conversely the farther a linker is from a potential ligand binding domain, the smaller its weighting value. The user can also specify the importance of certain regions. In the end the total weighting is normalized to the amount of binding domains.<br />
===Calibrating the weighting function===<br />
???keep?###<br />
Every contribution has it's own distribution. You can see an example in figure ### [[figure histogram_length_lys.png]], but all of them have different shapes. The aim is to find the paths that minimize all of these distributions. Therefore in the weighting function the four mentioned contributions are combined in a linear manner:<br />
\[ W(p) = \alpha L(p) + \beta A(p) + \gamma D(p) + u(p) \]<br />
where W is the final weighting, p the path, L the length contribution, A the angle contribution, D the distance contribution and u the contribution from the forbidden regions. $\alpha, \beta, \gamma, \delta$ are the weighting constants that needed to be found. The normalization performed for each of the contribution were made so that each of them is dimensionless and that all have reasonably similar values.<br />
The weighting constants were obtained from the [[linker-screening ###lin]] performed with lysozyme and the [[enzyme-modeling ###link]]<br />
Please see [[below ###link]] for the detailed explanation, how the values were obtained.<br />
==Translating paths to sequence==<br />
As already mentioned before the software is provided with two databases, one for the possible angle patterns and one for the helix patterns. The choice of the patterns was inspired by known crystal structures extracted from databases and described in different papers.<br />
A huge in silico screening for refining the preferences of the patterns was then set up using the [[iGEM@home|###Link]] system. For the complete description of search for suitable patterns, one can read the [[modeling|###Link to patternspart]] page. <br />
All the possible paths are now split up at the angles and compared with the possible patterns in the databases. ###Figure needed to explain, how the path is translated### The most suitable patterns are identified and added together to build the paths sequence. It is important to notice that this is only possible because of the modularity of our linker patterns used as building blocks: each block, being an alpha helix or an angle pattern, is not affected by the other. Thus for each possible path, one sequence is produced.<br />
==Clustering of paths==<br />
Many different paths are represented by the same sequence [[###Figure that shows, different paths have same properties, already before in the text ###]] and we therefore clustered such paths. The weigths for those clustered paths were then calculated by averaging the weights of the different paths that compose a cluster.<br />
=Results=<br />
==DNMT1==<br />
A major motivation of our effort to design rigid linkers with angles was the circularization of the DNA methyltranferase Dnmt1 (link). The truncation form used in our project is composed of 900 amino acids and the N- and C-terminal extremities are well separated. To circularize it, two linkers were designed: a flexible one made of glycine and serine, and a rigid one designed by the software. The rigif linkers for DNMT1 were obtained from an early state of the software. At that time the calculation took 11 days on a laptop computer with intel i5 processor and 8GB of RAM, which shows the importance of a distributed computing system for large proteins. But from that state on, the software has still imporved a lot, resulting in reduced calculation time to about 1 day for DNMT1.<br />
###DNMT1 could be made heatstable, still missing###<br />
==Feedback from wet lab==<br />
The results from the software for lysozyme were tested as described in the [[linker-screening]] part and evaluated as described in the [[enzyme modeling]] part. We have performed a large linker screening on 10 different lysozymes with different linkers. As the purpose of the lysozyme screen was the calibration of the software,tThose linkers were designed according to the four contributions previously mentioned. One of them was the shortest possible, one had the best possible angle, and so on.<br />
{| class="table table-hover"<br />
|+ '''table 1''': Linker and their amino acid sequence. Green: attachment sequences to prevent the flexible regions from being perturbed; Blue: angle; Purple: extein.<br />
! Linker<br />
! Amino acid sequence<br />
|- <br />
|<br />
|<br />
|-<br />
|'''Very good linkers'''<br />
|<br />
|-<br />
| sgt2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#00BFFF;">AAAHPEA</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| rigid<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKAA<span style="color:#3ADF00;">P</span><span style="color:#A901DB;">RGKCWE</span><br />
|-<br />
| <br />
| <br />
|-<br />
| '''Average linkers'''<br />
|<br />
|-<br />
| may1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKA<span style="color:#00BFFF;">AAAHPEA</span>AEAAAKEAAAKA<span style="color:#00BFFF;">KTA</span>AEAAAKEAAAKA<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| ord1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ATGDLA</span>AEAAAKAA<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| ord3<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ASLPAA</span>AEAAAKEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
|<br />
| <br />
|-<br />
| '''Short linkers'''<br />
| <br />
|-<br />
| sho1<br />
| <span style="color:#3ADF00;">GG</span><span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| sho2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
|}<br />
[[circ_lam_lys_nils.png]] comparison with [[nice_linker_lysozyme_flexible_ends.png]]<br />
###About predictions of software###<br />
In the end we obtained a ranking of the in vitro tested linkers from the [[linker-screening]] and chose the parameters $\alpha, \beta, \gamma, \delta$ of the weighting function in the way, that the ranking from the software represented the ranking from the assays. The final values were: $\alpha = 1.85 * 10 ^{-6}, \beta = 0.57 , \gamma = 50.8 * 10^6$.<br />
Therefore we set up a function, that was minimized, when the order of the weighting output was the same as the order from the wetlab.<br />
<br />
=Discussion=<br />
Will always be refined with more data from i@h...<br />
=References=<br />
[0] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).<br />
[1] Thornton, J.M. & Sibanda, B.L. Amino and carboxy-terminal regions in globular proteins. Journal of molecular biology 167, 443-460 (1983).<br />
<br />
=Discussion=<br />
Will always be refined with more data from i@h...<br />
=References=<br />
<br />
[0] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).<br />
<br />
[1] Thornton, J.M. & Sibanda, B.L. Amino and carboxy-terminal regions in globular proteins. Journal of molecular biology 167, 443-460 (1983).</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/pages/Linker_SoftwareTeam:Heidelberg/pages/Linker Software2014-10-17T22:40:40Z<p>Igemnils: /* Shifting paths to the patterns */</p>
<hr />
<div>=Abstract=<br />
As already [https://2014.igem.org/Team:Heidelberg/Toolbox/Circularization introduced], artificially circularized proteins may gain some heat stability by restraining the C- and N-terminus from moving around freely. This circularization may be trivial when the protein termini are very close to each other, which seems to be reasonably common [[#References|[1]]]. However, if the ends are too far from each other, a long linker is needed to connect them. This linker should not change the natural conformation of the protein and should constrain the relative position of the ends to restrict the degrees of freedom and thus to stabilize the structure even when heated up. On top, these linkers should not affect any of the protein functions. Consequently it is important to prevent linkers from passing through the active site or from covering binding domains to other molecules for example. Therefore one needs to be able to define the shape of possible linkers. This section describes the software we developed to design such linkers. We would like to stress that this work has been made possible thanks to the feedback between computer modeling and experimental work: We could first design linkers in silico, test them experimentally and use the results to further calibrate the software. To our knowledge, this is the first time that such an approach is used to customly design rigid linkers with angles to connect protein extremities.<br />
<br />
=General procedure=<br />
In short, the software can provide a weighted list of linkers to circularize any protein of interest with a known structure. Those linkers are made of rigid alpha helices segments connected with defined angles. Contrary to flexible linkers, those rigid linkers were expected to constrain the protein extremities and to confer better heat stability. Such an idea was already developed [[#References|[2]]] but only with alpha helices defining simple rods, and without any possibility to introduce angles. To generate those linkers, we first defined the geometrical paths, with segments and angles, that they should follow. The geometrical paths that are biologically feasible are afterwards translated into amino acid sequences. Both the compatibility of paths with possible structures and the translation were made possible thanks to our [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling approaches]. The first approach consisted in performing a statistical analysis of more than 17000 known non-homologous structures containing alpha helices connected with angles. For the second approach, we modeled the conformation of linkers circularizing proteins of known structure and analyzed them for certain properties. This second approach was run for a large number of proteins thanks to our distributing computing system [https://2014.igem.org/Team:Heidelberg/Software/igemathome igemathome]. The software provides different possible linkers with weights that provide the ranking of the linkers depending on their capacity to maintain protein activity at higher temperatures. They were generated thanks to an extensive [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening linker screening] on the target protein lambda-lysozyme, using the first modeling approach.<br />
<br />
=Background=<br />
Classically, protein linkers were designed in three different manners. The easiest way is to define the length that a linker should cover and then simply use a flexible glycine-serine peptide with the right amount of amino acids to match this length. Glycine is used for flexibility, as it has no sidechain and does not produce any steric hindrance, while serine is used for solubility, as it has a small polar side chain. This solubility is important, as the linkers should not pass through the hydrophobic core of the protein, but should be dissolved in the surrounding medium. These flexible linkers were normally used for circularization but also for connecting different proteins, when the main goal is that the different parts are connected, but not how they are connected, or when the flexiblity of the linker was required for specific applications. <br />
<br />
A second strategy consists in using rigid helical linkers to keep proteins or protein domains at a certain distance from each other. This is especially important for signalling proteins and fluorescent proteins. One major property of alpha helices is that they always fold in a defined way with well defined angles and lengths. There are also many different helical patterns that differ in stability and solubility. Although they have been used to design cirularizing linkers [[#References|[2]]]. One big disadvantage of this strategy is that one can only build straight linkers with helices. So in the context of circularization, if an artificial line that would connect protein extremities is crossing the protein, this strategy is not an option.<br />
<br />
The third option, which served as a base to develop our approach and which came from discussions with the group of Rebecca Wade in Heidelberg, Germany, consists in designing customly tailored linkers for each specific application. These linkers can be obtained from protein structure prediction. At first one needs to define the path that the linker should take to connect the protein ends. Afterwards, one designs a possible linker sequence that might fit well. Next one makes a structure prediction of the linker attached to the proteins to validate the prediction. Several different linkers, with slight changes, can be compared. This is repeated several times until the linker effectively follows the expected path. This method requires a strong knowledge on protein folding and protein structure prediction and is computationaly intensive. On the other hand, the benefit can be important as the interaction of the linker with the protein surface can be taken into account and as one can accurately define the path taken by the linker to the resolution of protein structure.<br />
<br />
We have set up a completely new strategy to design rigid linkers. As further detailed in the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] part, it is possible to define the shape of a linker, by combining rigid alpha helical rods with well-defined angle patterns. Therefore, by defining, in a geometrical way, the possible paths of the circularizing linkers for a given protein, we can then propose potential linkers. This definition of the geometrical path can be very difficult, especially for large proteins with complex shapes. Moreover, this definition is further constained by the fact that linkers must avoid hiding active sites of the protein of interest. Finally the paths have rotational degrees of freedom at the extremities of the protein, and depending on their orientation, they may or may not match the geometry of the protein. The tool we present here covers the two steps: defining geometrical paths with some weights and translate them into feasible linkers, also with weights. This tool is universal as it has the capacity to design circularizing linkers for any protein with a known structure. Moreover it is modular as, thanks to our [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling approach], we have design linkers as exchangeable blocks of rods of different lengths and of angle patterns. The following sections detail the different steps followed by our software to design proper linkers.<br />
<br />
<br />
==PDB parsing==<br />
At first, the PDB file containing the structure of the target protein is parsed and the coordinates of the atoms are stored, in the metric unit.After this, some initial tests are made with the protein structure. First, we checked whether the C- and N-termini lie on the surface of the protein and are accessible to the solvent, which is crucial for circularization. We defined a line originating from an extremity of the protein with the two angles of the spherical coordinates around the z-axis. From that, we could determine the accessible angles by rejecting all the lines that are too close to the protein. As the future linker will be made of alpha helices and will therefore have a radius of 5 &Aring;, we used this length as the minimal allowed distance.<br />
Those allowed angles are stored for the coming linker generation. ###fig needed### <br />
<br />
==Generation of geometric paths==<br />
As our strategy consists in building linkers with helical rods and connecting angles, a path is completely defined by the coordinates of the angle points. Advancing one step from an existing point is always done by adding a displacement vector on this point. This vector is defined by the two spherical angles, chosen here in a discrete manner with an increment of 5 degrees, and by a length, also chosen in a discrete manner. This discrete length was used in two different contexts: it may correspond to the length of an alpha helix or to the length of the flexible part that appears at the extremity of the protein. The coordinates that are reached thanks to this vector defines the new coordinates of an angle point, no matter if the vector corresponds to an alpha helix or to a flexible part. ###figure, points + vectors###. As we screen for all possible angles in a discrete manner, those angle points coordinates are regularly distributed on a sphere. As further detailed in the next sections, those spheres are defined from both ends, either once or in several steps. Then the software checks for possible straight connections of given lengths for each pair of angle points originated from both extremities. ###figure needed, with all these possible steps, one sphere around each point and checking for the connections, 2D, different lengths in two different figures, ###<br />
The linkers are built in a modular way, with blocks of well-defined size. From the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] of potential linkers, we could derive 8 different alpha helical rods, all with different lengths. On top, the length of the two segments inside an angle block was always 8&Aring;, so exchanging angle blocks do not affect the length of the linker. This means that the distance between the angle points is well defined, an essential aspect of our strategy of linker design.<br />
The software proceeds in three steps. First, it checks for the possibility of direct single alpha helix linker. for this, it applies the procedure just mentioned with spheres of radius that reasonably corresponds to the length of the short parts at the extremity of the protein. Second, it tests if a linker containing two alpha helices connected with a right angle allows the circularization. Finally it searches the possible linkers with three angle points. The next parts will explain those three steps in detail.<br />
This method has been chosen, because it could be implemented easily and efficiently in our program. However, this strategy generated paths that crossed the protein. Therefore we put big efforts in the sorting out of the paths.<br />
<br />
===Step 1===<br />
As a simple rigid linker with no angle would be easier to design and likely more thermostable than the ones containing angles, the software first checks if this simple solution is possible ###figure needed###. For this, we took into account the fact that proteins have some flexible amino acids at their extremities. This flexible part may come from the protein itself, but also from the 2 glycines that are included at the N-terminal part and from the extein at the C-terminal part. Those two latter parts comes from our linkers. Those parts have no preferential angles and offers a large amount of possibilities to insert fitting linkers. But this flexibility is also a drawback as we have to include this large amount of possible angles and length to our path search.<br />
In this first step, the software explicitely takes these flexible parts into account to check for the possibility of straight linkers. As the angles and the length of the flexible parts are variable, the software position their extremity on a sphere centered on the last fixed position of the structure as explained above. The radius of this sphere is incremented in a discrete manner, in 4 steps, from 5.25 &Aring; to the maximum length of the flexible part. ###figure###<br />
Then all possible straight segments between the points and the lastpoints are tested. If they are closer than 5 &Aring to the protein, of if they cross it, then they are rejected. If they are kept, then the software checks whether the length of the segments is compatible with the feasible alpha helices in terms of length: if the length of a given segment equal one of the 8 alpha helix lengths plus or minus 0.75 &Aring;, then the path is eventually saved.<br />
===Step 2===<br />
The next possibilty to design more complex rigid linkers while still taking flexible ends into account with a reasonable calculation time was to reduce amount of possible angles. As we originally thought that 90° angle would be practically feasible, the software was designed to generate linkers with flexible ends and one 90° angle. ###figure, protein_one_90° angle### This choice was notably made because of the simplicity to calculate lengths of right triangle edges. We already saw in Step 1 that the length of an edge can only take 8 different values. As the linkers have to start from the extremities of the protein, and as we impose a right angle, the number of possible paths is therefore low, making them easy to compute. Practically, the extremities of the proteins are positioned in a flexible way as in Step 1. From each of the positions allowed by this flexibility, the software searches for all the allowed right triangles. This was mainly done, as the degrees of freedom needed to be restricted to keep calculations feasible. <br />
===Step 3===<br />
Finally the software also provides the possibility to find paths with up to 4 edges, meaning 4 alpha helices and 3 angles. Thanks to the modularity of the possible linkers, such paths can offer the possibility to circularize theoretically any kind of protein. ###figure of torus###<br />
To keep the calculation feasible in a reasonable time, we design the searching strategy so that the flexible part at the extremity are oriented in the same direction as the consecutive alpha helix. This is obviously restricting the search but as these orientations are allowed for the flexible part, this approach remains fully correct. <br />
First, potential ending points of the first alpha helical rod are calculated from the N-terminal point of the protein. The orientation is chosen in a discrete manner, with an incrementation of 5 degrees for the two angles of the spherical coordinates. The distance from the origin corresponds to the 8 possible lengths allowed by the alpha helices, as already seen in Step 2, plus a length of 0, which mimics a linker with 3 instead of 4 edges. The exact same procedure is repeated to define all the potential ending points of the second alpha helical rod starting from all the possible ending points of the first alpha helical rod. Thanks to the possibility of a length of 0 for the first and the second rods, the software also calculate paths with 2 edges. Then, the same is done only once from the C-terminal point of the protein, defining 1 edge. The final step consists in checking if the points originating from the N- and C-terminal points can be linked by an potential alpha helix, i.e. if they are separated by the appropriate distance. If any of the potential alpha helix length lies within the distance between two points plus or minus 0.75 &Aring;, then the path is eventually saved. In the same way, if two points are directly closer than 0.75 &Aring;, then the path is also saved.<br />
<br />
==Sorting out of paths==<br />
The previous part described the generation of paths that can connect the two extremeties of the protein irrespective of the position of these paths relative to the protein. While this allows a fast computing of the geometrical paths, this also implies that the paths that are not practically feasible need to be sorted out. This is the most time consuming part of the computing as about 1 billion paths are generated. Three criteria are considered for the sorting. The first one is the feasibility of the linker: can the software find angle patterns that correspond to the one defined by the geometrical path? This question was part of the motivation for a large modeling effort (link) to determine the possible angles between consecutive angles. This was achieved by analyzing the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling distribution of angles] between alpha helices found in the ArchDB database. As nearly any angle could be found between 20 and 170 degrees, only few paths were actually rejected at that step. The next criteria was the position of the angle point: if they appear inside the protein, then the path is rejected. Finally, the software checks if any of the atoms of the protein is less than 5&Aring away from any of the alpha helices, then the path is also rejected.<br />
<br />
==Shifting paths to the patterns==<br />
<br />
The strategy described in step 3 gives a certain freedom for the rod that connect the last two angle points that were generated from the N- and C-terminal points. As this freedom is actually not permitted by the alpha helix and the angle pattern, but is permitted by the flexible part for example at the C-terminal end, the software slighty refine the path by rotating the segment that originates from the C-terminal point.<br />
<br />
==Weighting of paths==<br />
Before translating the paths to sequences and thus to expressible linkers, at first the value of each path needs to be checked. Therefore we have identified four contributions, that determine how good a path will work to enhance heatstability of a target protein by circularization. In the end all these contributions are summarized to get one final value for the goodness of the connection. For each of the single contributions, always lower weighting values refer to a better path. Therefore all contributions somehow need to be normed by some property of the protein, so that the contributions of each function on it's own are independent of the target protein and generality is achieved.<br />
At first it is tested how long the linker is. We assumed that a shorter linker normally is better for restraining the ends, as a longer helix might give more flexibility to the ends. Because we want this value to be independent of the size of the protein, the length of the linker is normed by the distance of the two termini. It is clear that not only because the two termini are far from each other and clearly a longer linker is needed, this contribution exceeds all other contributions to the weighting function.<br />
The next value is a measure for the goodness of the angles used in the linker. Each of our potential angle patterns is somehow distributed with a standard deviation from the mean. It is clear, that the narrower this distribution is, the likelier it is, that the pattern will in the end produce the assumed angle between the embracing helices. Therefore path-angles that fit perfectly with the angle patterns provided get a lower weighting value.<br />
Then the distance from the protein is taken into account. Because the linkers should not disturb the protein in it's normal environment, linkers that pass near to the protein's surface are considered better linkers. Of course no linker is too near on the surface, so that it would interact with the protein too much. This value is normalized with the minimal distance a linker should have from the surface.<br />
After this the places a linker should ommit are calculated. Each protein binds some ligands or substrates. The user can specify, how big they are and where they attach. If a linker passes through a potential ligand, this value goes to infinity, so that a linker is discarded. Otherwise the farther a linker is from a potential ligand, the better this value gets. The user can also specify the importance of certain regions. In the end the total weighting is independent of the amount of binding sites.<br />
<br />
===Calibrating the weighting function===<br />
???keep?###<br />
Every contribution has it's own distribution. You can see an example in figure ### [[figure histogram_length_lys.png]], but all of them have different shapes. The aim is to find the paths that minimize all of these distributions. Therefore in the weighting function the four mentioned contributions are combined in a linear manner:<br />
\[ W(p) = \alpha L(p) + \beta A(p) + \gamma D(p) + \delta u(p) \]<br />
where W is the final weighting, p the path, L the length contribution, A the angle contribution, D the distance contribution and u the contribution from the forbidden regions. $\alpha, \beta, \gamma, \delta$ are the weighting constants that needed to be found. The normalization performed for each of the contribution were made so that each of them is dimensionless and that all have reasonably similar values.<br />
The weighting constants were obtained from the [[linker-screening ###lin]] performed with lysozyme and the [[enzyme-modeling ###link]]<br />
Please see [[below ###link]] for the detailed explanation, how the values were obtained.<br />
==Translating paths to sequence==<br />
As already mentioned before the software is provided with two databases, one for the possible angle patterns and one for the helix patterns. The choice of the patterns was inspired by known crystal structures extracted from databases and described in different papers.<br />
A huge in silico screening for refining the preferences of the patterns was then set up using the [[iGEM@home|###Link]] system. For the complete description of search for suitable patterns, one can read the [[modeling|###Link to patternspart]] page. <br />
All the possible paths are now split up at the angles and compared with the possible patterns in the databases. ###Figure needed to explain, how the path is translated### The most suitable patterns are identified and added together to build the paths sequence. It is important to notice that this is only possible because of the modularity of our linker patterns used as building blocks: each block, being an alpha helix or an angle pattern, is not affected by the other. Thus for each possible path, one sequence is produced.<br />
==Clustering of paths==<br />
Many different paths are represented by the same sequence [[###Figure that shows, different paths have same properties, already before in the text ###]] and we therefore clustered such paths. The weigths for those clustered paths were then calculated by averaging the weights of the different paths that compose a cluster.<br />
<br />
=Results=<br />
IMPORTANT!<br />
paths-> geometrical<br />
linkers-> helices...<br />
sequences-> seq of linkers<br />
insist on huge number of data produced<br />
We definitely need to cite [0]!!!<br />
=Abstract=<br />
=Overview=<br />
As already introduced [###link to circularization idea in the toolbox], artificially circularized proteins may gain some heat stability by restraining the C- and N-terminus from moving around freely. This circularization may be trivial when the protein termini are very close to each other, which seems to be reasonably common [[#References|[1]]]. However, if the ends are too far from each other, a long linker is needed to connect them. This linker should not change the natural conformation of the protein and should constrain the relative position of the ends to restrict the degrees of freedom and thus to stabilize the structure even when heated up. On top, these linkers should not affect any of the protein functions. Consequently it is important to prevent linkers from passing through the active site or from covering binding domains to other molecules for example. Therefore one needs to be able to define the shape of possible linkers. This section describes the software we developed to design such linkers. We would like to stress that this work has been made possible thanks to the feedback between computer modeling and experimental work: We could first design linkers in silico, test them experimentally and use the results to further calibrate the software. To our knowledge, this is the first time that such an approach is used to customly design linkers to connect protein extremities.<br />
In short, the software can provide a weighted list of linkers to circularize any protein of interest with a known structure. Those linkers are made of rigid alpha helices segments connected with defined angles. Contrary to flexible linkers, those rigid linkers were expected to constrain the protein extremities and to confer better heat stability. To generate those linkers, we first defined the geometrical paths, with segments and angles, that they should follow. The geometrical paths that are biologically feasible are afterwards translated into amino acid sequences. Both the compatibility of paths with possible structures and the translation were made possible thanks to our modeling approaches. The first approach consisted in performing a statistical analysis of more than 17000 known non-homologous structures containing alpha helices connected with angles. For the second approach, we modeled the conformation of linkers circularizing proteins of known structure and analyzed them for certain properties. This second approach was run for a large number of proteins thanks to our distributing computing system igemathome. The software provides different possible linkers with weights that provide the ranking of the linkers depending on their capacity to maintain protein activity at higher temperatures. They were generated in an extensive [[linker screening| ### ]] on the target protein lambda-lysozyme, using the first modeling approach.<br />
The software was checked for running stability in a huge test over the [[iGEM@home]]. <br />
Furthermore our software provided linkers for circularizing [[DNMT1| ###]], that could be made more heatstable due to circularization???.<br />
For detailed information on the implementation and the practical use of the software, please see the [[documentation software-docu]] page.<br />
==What does the software do==<br />
still missing [[figure 0, graph abstract]]<br />
=Background=<br />
What are linkers, why should ours work better?<br />
Linkers, References to the modeling, <br />
Classically, protein linkers were designed in three different manners. ###REFERENCE### The easiest way is to define the length that a linker should cover and then simply use a flexible glycine-serine peptide with the right amount of amino acids to match this length. Glycine is used for flexibility, as it has no sidechain and does not produce any steric hindrance, while serine is used for solubility, as it has a small polar side chain. This solubility is important, as the linkers should not pass through the hydrophobic core of the protein, but should be dissolved in the surrounding medium. These flexible linkers were normally used for circularization but also for connecting different proteins, when the main important aspect is that the different parts are connected, but not how they are connected, or when the flexiblity of the linker was required for specific applications.<br />
A second strategy consists in using rigid helical linkers to keep proteins or protein domains at a certain distance from each other. This is especially important for signalling proteins and fluorescent proteins . ###TODO: Reference### One major property of alpha helices is that they always fold in a defined way with well defined angles and lengths. There are also many different helical patterns that differ in stability and solubility. One big disadvantage of this strategy is that one can only build straight linkers with helices. So in the context of circularization, if an artificial line that would connect protein extremities is crossing the protein, this strategy is not an option.<br />
The third option, which served as a base to develop our approach, consists in designing customly tailored linkers for each specific application. These linkers can be obtained from protein structure prediction. At first one needs to define the path that the linker should take to connect two amino acids. Afterwards one designs a possible linker sequence that might fit well. Next one makes a structure prediction of the linker attached to the proteins to validate the prediction. Several different linkers, with slight changes, can be compared. This is repeated several times until the linker effectively follows the expected path. ###TODO: Reference, WADE paper### This method is time consuming as it is not only computation intensive, but also requires a strong knowledge on protein folding and protein structure prediction. On the other hand, the benefit can be important as the interaction of the linker with the proteins surface can be taken into account and as one can accurately define the path taken by the linker to the resolution of protein structure.<br />
We have set up a completely new strategy to design rigid linkers. As further detailed in the [[###link###|modelling]] part, it is possible to define the shape of a linker, by combining rigid alpha helical rods with well-defined angle patterns. Therefore, by defining, in a geometrical way, the possible paths of the circularizing linkers for a given protein, we can then propose potential linkers.This definition of the geometrical path can be very difficult, especially for large proteins with complex shapes. Moreover, this definition is further constained by the fact that linkers must avoid hiding active sites of the protein of interest. Finally the paths have rotational degrees of freedom at the extremities of the protein, and depending on their orientation, they may or may not match the geometry of the protein. The tool we present here covers the two steps: defining geometrical paths with some weights and translate them into feasible linkers, also with weights. This tool is universal as it has the capacity to design circularizing linkers for any protein with a known structure. Moreover it is modular as, thanks to our modeling approach [link] we have design linkers as exchangeable blocks of rods of different lengths and of angle patterns. The following sections detail the different steps followed by our software to design proper linkers.<br />
==PDB analysis==<br />
At first, the PDB file containing the structure of the target protein is parsed and the coordinates of the atoms are stored, in the metric unit.After this, some initial tests are made with the protein structure. First, we checked whether the C- and N-termini lie on the surface of the protein and are accessible to the solvent, which is crucial for circularization. We defined a line originating from an extremity of the protein with the two angles of the spherical coordinates around the z-axis. From that, we could determine the accessible angles by rejecting all the lines that are too close to the protein. As the future linker will be made of alpha helices and will therefore have a radius of 5 &Aring;, we used this length as the minimal allowed distance.<br />
Those allowed angles are stored for the coming linker generation. . ###fig needed### <br />
==Generation of geometric paths==<br />
As our strategy consists in building linkers with helical rods and connecting angles, a path is completely defined by the coordinates of the angle points. Advancing one step from an existing point is always done by adding a displacement vector on this point. This vector is defined by the two spherical angles, chosen here in a discrete manner with an increment of 5 degrees, and by a length, also chosen in a discrete manner. This discrete length was used in two different contexts: it may correspond to the length of an alpha helix or to the length of the flexible part that appears at the extremity of the protein. The coordinates that are reached thanks to this vector defines the new coordinates of an angle point, no matter if the vector corresponds to an alpha helix or to a flexible part. ###figure, points + vectors###. As we screen for all possible angles in a discrete manner, those angle points coordinates are regularly distributed on a sphere. As further detailed in the next sections, those spheres are defined from both ends, either once or in several steps. Then the software checks for possible straight connections of given lengths for each pair of angle points originated from both extremities. ###figure needed, with all these possible steps, one sphere around each point and checking for the connections, 2D, different lengths in two different figures, ###<br />
The linkers are built in a modular way, with blocks of well-defined size. From the modeling of potential linkers [link], we could derive 8 different alpha helical rods, all with different lengths. On top, the length of the two segments inside an angle block was always 8 &Aring;, so exchanging angle blocks do not affect the length of the linker. This means that the distance between the angle points is well defined, an essential aspect of our strategy of linker design.<br />
The software proceeds in three steps. First, it checks for the possibility of direct single alpha helix linker. for this, it applies the procedure just mentioned with spheres of radius that reasonably corresponds to the length of the short parts at the extremity of the protein. Second, it tests if a linker containing two alpha helices connected with a right angle allows the circularization. Finally it searches the possible linkers with three angle points. The next parts will explain those three steps in detail.<br />
This method has been chosen, because it could be implemented easily and efficiently in our program. However this strategy generated paths that crossed the protein. Therefore we put big efforts in the sorting out of the paths.<br />
===Step 1===<br />
As a simple rigid linker with no angle would be easier to design and likely more thermostable than the ones containing angles, the software first checks if this simple solution is possible ###figure needed###. For this, we took into account the fact that proteins have some flexible amino acids at their extremities. This flexible part may come from the protein itself, but also from the 2 glycines that are included at the N-terminal part and from the extein at the C-terminal part. Those two latter parts comes from our linkers. Those parts have no preferential angles and offers a large amount of possibilities to insert fitting linkers. But this flexibility is also a drawback as we have to include this large amount of possible angles and length to our path search.<br />
In this first step, the software explicitely takes these flexible parts into account to check for the possibility of straight linkers. As the angles and the length of the flexible parts are variable, the software position their extremity on a sphere centered on the last fixed position of the structure as explained above. The radius of this sphere is incremented in a discrete manner, in 4 steps, from 5.25 &Aring; to the maximum length of the flexible part. ###figure###<br />
Then all possible straight segments between the points and the lastpoints are tested. If they are closer than 5 &Aring to the protein, of if they cross it, then they are rejected. If they are kept, then the software checks whether the length of the segments is compatible with the feasible alpha helices in terms of length: if the length of a given segment equal one of the 8 alpha helix lengths plus or minus 0.75 &Aring;, then the path is eventually saved.<br />
===Step 2===<br />
The next possibilty to design more complex rigid linkers while still taking flexible ends into account with a reasonable calculation time was to reduce amount of possible angles. As we originally thought that 90° angle would be practically feasible, the software was designed to generate linkers with flexible ends and one 90° angle. ###figure, protein_one_90° angle### This choice was notably made because of the simplicity to calculate lengths of right triangle edges. We already saw in Step 1 that the length of an edge can only take 8 different values. As the linkers have to start from the extremities of the protein, and as we impose a right angle, the number of possible paths is therefore low, making them easy to compute. Practically, the extremities of the proteins are positioned in a flexible way as in Step 1. From each of the positions allowed by this flexibility, the software searches for all the allowed right triangles. This was mainly done, as the degrees of freedom needed to be restricted to keep calculations feasible. <br />
===Step 3===<br />
Finally the software also provides the possibility to find paths with up to 4 edges, meaning 4 alpha helices and 3 angles. Thanks to the modularity of the possible linkers, such paths can offer the possibility to circularize theoretically any kind of protein. ###figure of torus###<br />
To keep the calculation feasible in a reasonable time, we design the searching strategy so that the flexible part at the extremity are oriented in the same direction as the consecutive alpha helix. This is obviously restricting the search but as these orientations are allowed for the flexible part, this approach remains fully correct. <br />
First, potential ending points of the first alpha helical rod are calculated from the N-terminal point of the protein. The orientation is chosen in a discrete manner, with an incrementation of 5 degrees for the two angles of the spherical coordinates. The distance from the origin corresponds to the 8 possible lengths allowed by the alpha helices, as already seen in Step 2, plus a length of 0, which mimics a linker with 3 instead of 4 edges. The exact same procedure is repeated to define all the potential ending points of the second alpha helical rod starting from all the possible ending points of the first alpha helical rod. Thanks to the possibility of a length of 0 for the first and the second rods, the software also calculate paths with 2 edges. Then, the same is done only once from the C-terminal point of the protein, defining 1 edge. The final step consists in checking if the points originating from the N- and C-terminal points can be linked by an potential alpha helix, i.e. if they are separated by the appropriate distance. If any of the potential alpha helix length lies within the distance between two points plus or minus 0.75 &Aring;, then the path is eventually saved. In the same way, if two points are directly closer than 0.75 &Aring;, then the path is also saved.<br />
==Sorting out of paths==<br />
The previous part described the generation of paths that can connect the two extremeties of the protein irrespective of the position of these paths relative to the protein. While this allows a fast computing of the geometrical paths, this also implies that the paths that are not practically feasible need to be sorted out. This is the most time consuming part of the computing as about 1 billion paths are generated. Three criteria are considered for the sorting. The first one is the feasibility of the linker: can the software find angle patterns that correspond to the one defined by the geometrical path? This question was part of the motivation for a large modeling effort (link) to determine the possible angles between consecutive angles. This was achieved by analyzing the distribution of angles between alpha helices found in the ArchDB database (link). As nearly any angle could be found between 20 and 170 degrees, only few paths were actually rejected at that step. The next criteria was the position of the angle point: if they appear inside the protein, then the path is rejected. Finally, the software checks if any of the atoms of the protein is less than 5 &Aring away from any of the alpha helices, then the path is also rejected.<br />
==Shifting paths to the patterns==<br />
The strategy described in step 3 gives a certain freedom for the rod that connect the last two angle points that were generated from the N- and C-terminal points. As this freedom is actually not permitted by the alpha helix and the angle pattern, but is permitted by the flexible part for example at the C-terminal end, the software slighty refine the path by rotating the segment that originates from the C-terminal point.<br />
==Weighting of paths==<br />
Before translating the paths into sequences and thus into linkers that can be expressed, each path needs to be evaluated for its potential capacity to enhance heat stability. For this we have identified different contributions that should be combined into one value that defines how good the linker may be and that consequently defines its ranking among all possible linkers. The smaller this value, the more we expect that the linker will enhance thermostability. An important step for all these contributions is the normalization, as explained in the next paragraphs.<br />
The first contribution we considered was the linker length. We assumed that a short linker is better to constrain the protein extremities, and that a long helix might give more flexibility. Because we wanted this value to be independent of the size of the protein, the length of the linker is normalized to the distance between the two termini.<br />
The second contribution relates to the angles used in the linker. We learned from the [[modeling ###link###]] that angles formed by a certain angle pattern follow a certain distribution. First, we assumed that the narrower the distribution, the more likely the alpha helices would actually produce this angle. Second, the angles found by the software should be as close as possible to those well-defined angles. In this case, the weight value from this contribution should be low.<br />
Then the distance of the linker to the protein is taken into account. Because the linkers should not disturb the protein in its normal environment, linkers that pass close to the protein surface are considered better linkers. The distance was defined as the minimal distance between the linker and all the atoms of the protein. As already mentioned for the sorting of the paths, a linker cannot come closer than 5 &Aring; and this distance was used for normalization of calculated distances.<br />
After this, the places a linker should avoid are calculated. Each protein can interact with other molecules on some oarts of its surface. The user can specify where and how big those parts are. If a linker passes in front a potential molecule bindind domain, the value of the corresponding path goes to infinity, so that the linker is discarded. Conversely the farther a linker is from a potential ligand binding domain, the smaller its weighting value. The user can also specify the importance of certain regions. In the end the total weighting is normalized to the amount of binding domains.<br />
===Calibrating the weighting function===<br />
Every contribution has it's own distribution. You can see an example in figure ### [[figure histogram_length_lys.png]], but all of them have different shapes. The aim is to find the paths that minimize all of these distributions. Therefore in the weighting function the four mentioned contributions are combined in a linear manner:<br />
\[ W(p) = \alpha L(p) + \beta A(p) + \gamma D(p) + \delta u(p) \]<br />
where W is the final weighting, p the path, L the length contribution, A the angle contribution, D the distance contribution and u the contribution from the forbidden regions. $\alpha, \beta, \gamma, \delta$ are the weighting constants that needed to be found. The normalization performed for each of the contribution were made so that each of them is dimensionless and that all have reasonably similar values.<br />
The weighting constants were obtained from the [[linker-screening ###lin]] performed with lysozyme and the [[enzyme-modeling ###link]]<br />
Please see [[below ###link]] for the detailed explanation, how the values were obtained.<br />
==Translating paths to sequence==<br />
As already mentioned before the software is provided with two databases, one for the possible angle patterns and one for the helix patterns. The choice of the patterns was inspired by known crystal structures extracted from databases and described in different papers.<br />
A huge in silico screening for refining the preferences of the patterns was then set up using the [[iGEM@home|###Link]] system. For the complete description of search for suitable patterns, one can read the [[modeling|###Link to patternspart]] page. <br />
All the possible paths are now split up at the angles and compared with the possible patterns in the databases. ###Figure needed to explain, how the path is translated### The most suitable patterns are identified and added together to build the paths sequence. It is important to notice that this is only possible because of the modularity of our linker patterns used as building blocks: each block, being an alpha helix or an angle pattern, is not affected by the other. Thus for each possible path, one sequence is produced.<br />
==Clustering of paths==<br />
Many different paths are represented by the same sequence [[###Figure that shows, different paths have same properties, already before in the text ###]] and we therefore clustered such paths. The weigths for those clustered paths were then calculated by averaging the weights of the different paths that compose a cluster.<br />
=Results=<br />
==DNMT1==<br />
A major motivation of our effort to design rigid linkers with angles was the circularization of the DNA methyltranferase Dnmt1 (link). The truncation form used in our project is composed of 900 amino acids and the N- and C-terminal extremities are well separated. To circularize it, two linkers were designed: a flexible one made of glycine and serine, and a rigid one designed by the software. The rigif linkers for DNMT1 were obtained from an early state of the software. At that time the calculation took 11 days on a laptop computer with intel i5 processor and 8GB of RAM, which shows the importance of a distributed computing system for large proteins. But from that state on, the software has still imporved a lot, resulting in reduced calculation time to about 1 day for DNMT1.<br />
IMPORTANT!<br />
paths-> geometrical<br />
linkers-> helices...<br />
sequences-> seq of linkers<br />
insist on huge number of data produced<br />
We definitely need to cite [0]!!!<br />
=Abstract=<br />
=Overview=<br />
As already introduced [###link to circularization idea in the toolbox], artificially circularized proteins may gain some heat stability by restraining the C- and N-terminus from moving around freely. This circularization may be trivial when the protein termini are very close to each other, which seems to be reasonably common [[#References|[1]]]. However, if the ends are too far from each other, a long linker is needed to connect them. This linker should not change the natural conformation of the protein and should constrain the relative position of the ends to restrict the degrees of freedom and thus to stabilize the structure even when heated up. On top, these linkers should not affect any of the protein functions. Consequently it is important to prevent linkers from passing through the active site or from covering binding domains to other molecules for example. Therefore one needs to be able to define the shape of possible linkers. This section describes the software we developed to design such linkers. We would like to stress that this work has been made possible thanks to the feedback between computer modeling and experimental work: We could first design linkers in silico, test them experimentally and use the results to further calibrate the software. To our knowledge, this is the first time that such an approach is used to customly design linkers to connect protein extremities.<br />
In short, the software can provide a weighted list of linkers to circularize any protein of interest with a known structure. Those linkers are made of rigid alpha helices segments connected with defined angles. Contrary to flexible linkers, those rigid linkers were expected to constrain the protein extremities and to confer better heat stability. To generate those linkers, we first defined the geometrical paths, with segments and angles, that they should follow. The geometrical paths that are biologically feasible are afterwards translated into amino acid sequences. Both the compatibility of paths with possible structures and the translation were made possible thanks to our modeling approaches. The first approach consisted in performing a statistical analysis of more than 17000 known non-homologous structures containing alpha helices connected with angles. For the second approach, we modeled the conformation of linkers circularizing proteins of known structure and analyzed them for certain properties. This second approach was run for a large number of proteins thanks to our distributing computing system igemathome. The software provides different possible linkers with weights that provide the ranking of the linkers depending on their capacity to maintain protein activity at higher temperatures. They were generated in an extensive [[linker screening| ### ]] on the target protein lambda-lysozyme, using the first modeling approach.<br />
The software was checked for running stability in a huge test over the [[iGEM@home]]. <br />
Furthermore our software provided linkers for circularizing [[DNMT1| ###]], that could be made more heatstable due to circularization???.<br />
For detailed information on the implementation and the practical use of the software, please see the [[documentation software-docu]] page.<br />
==What does the software do==<br />
still missing [[figure 0, graph abstract]]<br />
=Background=<br />
What are linkers, why should ours work better?<br />
Linkers, References to the modeling, <br />
Classically, protein linkers were designed in three different manners. ###REFERENCE### The easiest way is to define the length that a linker should cover and then simply use a flexible glycine-serine peptide with the right amount of amino acids to match this length. Glycine is used for flexibility, as it has no sidechain and does not produce any steric hindrance, while serine is used for solubility, as it has a small polar side chain. This solubility is important, as the linkers should not pass through the hydrophobic core of the protein, but should be dissolved in the surrounding medium. These flexible linkers were normally used for circularization but also for connecting different proteins, when the main important aspect is that the different parts are connected, but not how they are connected, or when the flexiblity of the linker was required for specific applications.<br />
A second strategy consists in using rigid helical linkers to keep proteins or protein domains at a certain distance from each other. This is especially important for signalling proteins and fluorescent proteins . ###TODO: Reference### One major property of alpha helices is that they always fold in a defined way with well defined angles and lengths. There are also many different helical patterns that differ in stability and solubility. One big disadvantage of this strategy is that one can only build straight linkers with helices. So in the context of circularization, if an artificial line that would connect protein extremities is crossing the protein, this strategy is not an option.<br />
The third option, which served as a base to develop our approach, consists in designing customly tailored linkers for each specific application. These linkers can be obtained from protein structure prediction. At first one needs to define the path that the linker should take to connect two amino acids. Afterwards one designs a possible linker sequence that might fit well. Next one makes a structure prediction of the linker attached to the proteins to validate the prediction. Several different linkers, with slight changes, can be compared. This is repeated several times until the linker effectively follows the expected path. ###TODO: Reference, WADE paper### This method is time consuming as it is not only computation intensive, but also requires a strong knowledge on protein folding and protein structure prediction. On the other hand, the benefit can be important as the interaction of the linker with the proteins surface can be taken into account and as one can accurately define the path taken by the linker to the resolution of protein structure.<br />
We have set up a completely new strategy to design rigid linkers. As further detailed in the [[###link###|modelling]] part, it is possible to define the shape of a linker, by combining rigid alpha helical rods with well-defined angle patterns. Therefore, by defining, in a geometrical way, the possible paths of the circularizing linkers for a given protein, we can then propose potential linkers.This definition of the geometrical path can be very difficult, especially for large proteins with complex shapes. Moreover, this definition is further constained by the fact that linkers must avoid hiding active sites of the protein of interest. Finally the paths have rotational degrees of freedom at the extremities of the protein, and depending on their orientation, they may or may not match the geometry of the protein. The tool we present here covers the two steps: defining geometrical paths with some weights and translate them into feasible linkers, also with weights. This tool is universal as it has the capacity to design circularizing linkers for any protein with a known structure. Moreover it is modular as, thanks to our modeling approach [link] we have design linkers as exchangeable blocks of rods of different lengths and of angle patterns. The following sections detail the different steps followed by our software to design proper linkers.<br />
But even taking all this into account, one could never also take the paths into account, that the same sort of linker is also able to take, because of rotational degrees of freedom inherent in the linkermodel ###figure needed###<br />
Thus we decided to provide the science community with a powerfull open-source software, that for every protein with a given structure can calculate the sequence of the linker needed to circularize the protein with a rigid linker and with minimal inhibition of the protein's function. <br />
Until now each scientist had to estimate the length of the linker himself,###check for flexlinker### so our software is a completely novel approach to circularization.<br />
=General procedure=<br />
Our general approach to findin,rods and angles in discrete manner, only finite possibilities to connect the ends, these are cheked all. <br />
I think this paragraph is just too much, as anyway there was already an overview before.<br />
At first the protein structure is analysed in a geometrical way: paths only composed of no more than four straight segments connected by angles are computanionally generated. A path is always represented by straight lines and connecting angles between them. ###In the end all these paths should be sorted by how well it would be, if we circularize the protein using this path.### But as the final weighing is quite computation intensive, at first the paths need to be sorted out. A path is only sorted out, if it is breaching any rule for linkers. For example paths should never pass through the protein. After all the paths have been generated, the paths are improved, by shifting the angularpoints according to the underlying linkermodel, so that no paths are taken into account, that could not be built with our building blocks.<br />
After all paths being correctly generated, the paths are weighted by several factors. Afterwards one weighting for a path is calculated, that corresponds to the goodness of this certain path. Then the path is retranslated by usage of our amino acid patterns to produce an amino acid sequence. But as one sequence can follow more than one path, all the paths built up by this sequence are clustered. In the end the average of the weighting of the pathclusters is calculated and thus for every linker only one weightingvalue is produced with the contributions of all paths possibly taken by this linker.<br />
==PDB analysis==<br />
At first, the PDB file containing the structure of the target protein is parsed and the coordinates of the atoms are stored, in the metric unit.After this, some initial tests are made with the protein structure. First, we checked whether the C- and N-termini lie on the surface of the protein and are accessible to the solvent, which is crucial for circularization. We defined a line originating from an extremity of the protein with the two angles of the spherical coordinates around the z-axis. From that, we could determine the accessible angles by rejecting all the lines that are too close to the protein. As the future linker will be made of alpha helices and will therefore have a radius of 5 &Aring;, we used this length as the minimal allowed distance.<br />
Those allowed angles are stored for the coming linker generation. . ###fig needed### <br />
==Generation of geometric paths==<br />
As our strategy consists in building linkers with helical rods and connecting angles, a path is completely defined by the coordinates of the angle points. Advancing one step from an existing point is always done by adding a displacement vector on this point. This vector is defined by the two spherical angles, chosen here in a discrete manner with an increment of 5 degrees, and by a length, also chosen in a discrete manner. This discrete length was used in two different contexts: it may correspond to the length of an alpha helix or to the length of the flexible part that appears at the extremity of the protein. The coordinates that are reached thanks to this vector defines the new coordinates of an angle point, no matter if the vector corresponds to an alpha helix or to a flexible part. ###figure, points + vectors###. As we screen for all possible angles in a discrete manner, those angle points coordinates are regularly distributed on a sphere. As further detailed in the next sections, those spheres are defined from both ends, either once or in several steps. Then the software checks for possible straight connections of given lengths for each pair of angle points originated from both extremities. ###figure needed, with all these possible steps, one sphere around each point and checking for the connections, 2D, different lengths in two different figures, ###<br />
The linkers are built in a modular way, with blocks of well-defined size. From the modeling of potential linkers [link], we could derive 8 different alpha helical rods, all with different lengths. On top, the length of the two segments inside an angle block was always 8 &Aring;, so exchanging angle blocks do not affect the length of the linker. This means that the distance between the angle points is well defined, an essential aspect of our strategy of linker design.<br />
The software proceeds in three steps. First, it checks for the possibility of direct single alpha helix linker. for this, it applies the procedure just mentioned with spheres of radius that reasonably corresponds to the length of the short parts at the extremity of the protein. Second, it tests if a linker containing two alpha helices connected with a right angle allows the circularization. Finally it searches the possible linkers with three angle points. The next parts will explain those three steps in detail.<br />
This method has been chosen, because it could be implemented easily and efficiently in our program. However this strategy generated paths that crossed the protein. Therefore we put big efforts in the sorting out of the paths.<br />
In the [[###link###|modelling]] part we have already described the patterns we are using for building up our linkers. The modularity was crucial for the success of the software. Therefore the angle patterns were always chosen, so that the end of the anglepattern was 8 &Aring; away from the turning point ###figure needed###. Thus only displacements had to be made with a certain length, not depending on the direction in which it is going and not depending on the direction in that the linker will continue. This modularity makes the calculations more efficient than it would be, with just generating points randomly.<br />
We still have to rephrase that.<br />
===Step 1===<br />
As a simple rigid linker with no angle would be easier to design and likely more thermostable than the ones containing angles, the software first checks if this simple solution is possible ###figure needed###. For this, we took into account the fact that proteins have some flexible amino acids at their extremities. This flexible part may come from the protein itself, but also from the 2 glycines that are included at the N-terminal part and from the extein at the C-terminal part. Those two latter parts comes from our linkers. Those parts have no preferential angles and offers a large amount of possibilities to insert fitting linkers. But this flexibility is also a drawback as we have to include this large amount of possible angles and length to our path search.<br />
In this first step, the software explicitely takes these flexible parts into account to check for the possibility of straight linkers. As the angles and the length of the flexible parts are variable, the software position their extremity on a sphere centered on the last fixed position of the structure as explained above. The radius of this sphere is incremented in a discrete manner, in 4 steps, from 5.25 &Aring; to the maximum length of the flexible part. ###figure###<br />
Then all possible straight segments between the points and the lastpoints are tested. If they are closer than 5 &Aring to the protein, of if they cross it, then they are rejected. If they are kept, then the software checks whether the length of the segments is compatible with the feasible alpha helices in terms of length: if the length of a given segment equal one of the 8 alpha helix lengths plus or minus 0.75 &Aring;, then the path is eventually saved.<br />
" Therefore here another function for calculating the distance from the protein to the connection is used, that is more time-consuming, but also more accurate, than the ones used normally. " ??<br />
"This is done with a higher accuracy, because none of these linkers should be lost by error". ??<br />
For generation of linkers taking into account the flexibility of the ends until now have been included two functions.<br />
The only variability we have in the customly tailored linker there is the length of the helix. Most likely here no suitable linker can be found, because if there is some obstacle between the ends, the linker can't bend around it. But even though if there is the possibility of such a linker, it should be found, because this will be one of the best linkers predicted. <br />
So for all accessible angles that were calculated before, we calculate all points from the N-terminus and from the C-terminus that lie in certain distances from the terminus. The points are spread over varying distances from ; to the .<br />
===Step 2===<br />
The next possibilty to design more complex rigid linkers while still taking flexible ends into account with a reasonable calculation time was to reduce amount of possible angles. As we originally thought that 90° angle would be practically feasible, the software was designed to generate linkers with flexible ends and one 90° angle. ###figure### This choice was notably made because of the simplicity to calculate lengths of right triangle edges. We already saw in Step 1 that the length of an edge can only take 8 different values. As the linkers have to start from the extremities of the protein, and as we impose a right angle, the number of possible paths is therefore low, making them easy to compute. Practically, the extremities of the proteins are positioned in a flexible way as in Step 1. From each of the positions allowed by this flexibility, the software searches for all the allowed right triangles. <br />
some words on degree of freedom...<br />
###newly written###<br />
Again here we only have discrete possibilities to build right triangles by use of our helical patterns. Therefore at first all combinations of two helical patterns are searched, that could build a right triangle, that's hypotenuse has the length of the distance between the termini.<br />
Now we shift back to 3D and apply Thales's theorem, that says, when A, B and C lie on a circle, the line between A and C is a diameter of the circle, then the angle at B is a right angle. ### fig, thales### Thus in 3d we can discretisize the possibilities, where the angle point (B) can lie in reference to the starting point (A). This amount is counterchecked with the amount of possible right triangles from before, so that we only keep paths, that can be built with our patterns.<br />
Like in Step 1 the software creates spheres of points around the start (first points) and around the end (last points). Then for each point from the first points<br />
===Step 3===<br />
Finally the software also provides the possibility to find paths with up to 4 edges, meaning 4 alpha helices and 3 angles. Thanks to the modularity of the possible linkers, such paths can offer the possibility to circularize theoretically any kind of protein. ###figure of torus###<br />
To keep the calculation feasible in a reasonable time, we design the searching strategy so that the flexible part at the extremity are oriented in the same direction as the consecutive alpha helix. This is obviously restricting the search but as these orientations are allowed for the flexible part, this approach remains fully correct. <br />
First, potential ending points of the first alpha helical rod are calculated from the N-terminal point of the protein. The orientation is chosen in a discrete manner, with an incrementation of 5 degrees for the two angles of the spherical coordinates. The distance from the origin corresponds to the 8 possible lengths allowed by the alpha helices, as already seen in Step 2, plus a length of 0, which mimics a linker with 3 instead of 4 edges. The exact same procedure is repeated to define all the potential ending points of the second alpha helical rod starting from all the possible ending points of the first alpha helical rod. Thanks to the possibility of a length of 0 for the first and the second rods, the software also calculate paths with 2 edges. Then, the same is done only once from the C-terminal point of the protein, defining 1 edge. The final step consists in checking if the points originating from the N- and C-terminal points can be linked by an potential alpha helix, i.e. if they are separated by the appropriate distance. If any of the potential alpha helix length lies within the distance between two points plus or minus 0.75 &Aring;, then the path is eventually saved. In the same way, if two points are directly closer than 0.75 &Aring;, then the path is also saved.<br />
one starts from omega, 2 from alpha<br />
Thus even one of the worst shapes for circularization, a torus with the two termini in the pits, could be circularized with our linker system, without needing infinitely long straights. On the other hand this were the maximum of possibilities which was still feasible to calculate.<br />
Here now the flexible parts of the protein are estimated to point into the same direction as the following helix. By this mean the amount of possibilities is kept fixed but of course this is quite some rough estimation.<br />
Let's check the next paragraph together in detail<br />
At first points from each end in the right distances are generated. ###figure needed, explains point names### Now for each point of the first points to all directions in 5 degrees angledifference next points are generated. Of these all points that don't fit are immediately sorted out. Now all connections from second points to the last points are checked whether they lie in the right distance. After this the normal sorting steps are made for these new connections.<br />
==Sorting out of paths==<br />
The previous part described the generation of paths that can connect the two extremeties of the protein irrespective of the position of these paths relative to the protein. While this allows a fast computing of the geometrical paths, this also implies that the paths that are not practically feasible need to be sorted out. This is the most time consuming part of the computing as about 1 billion paths are generated. Three criteria are considered for the sorting. The first one is the feasibility of the linker: can the software find angle patterns that correspond to the one defined by the geometrical path? This question was part of the motivation for a large modeling effort (link) to determine the possible angles between consecutive angles. This was achieved by analyzing the distribution of angles between alpha helices found in the ArchDB database (link). As nearly any angle could be found between 20 and 170 degrees, only few paths were actually rejected at that step. The next criteria was the position of the angle point: if they appear inside the protein, then the path is rejected. Finally, the software checks if any of the atoms of the protein is less than 5 &Aring away from any of the alpha helices, then the path is also rejected.<br />
==Shifting paths to the patterns==<br />
The strategy described in step 3 gives a certain freedom for the rod that connect the last two angle points that were generated from the N- and C-terminal points. As this freedom is actually not permitted by the alpha helix and the angle pattern, but is permitted by the flexible part for example at the C-terminal end, the software slighty refine the path by rotating the segment that originates from the C-terminal point.<br />
In particular because of the discritization, it can always happen that the generated paths don't fit perfectly to the helical and to the angle patterns. Therefore before translating them to the aminoacid sequence, the paths need to be refined.<br />
Every path, that has survived is analyzed step by step, starting from the starting point and always advancing one angle point. For each step from point to point, the length of the step is calculated and compared to the lengthes we can build with the helix patterns. If the length is too long, the next point is shifted in direction of the previous one, until it fits. If it is too short, it is shifted away. These shifts do not exceed a certain length, so that the paths don't shift too much and suddenly would pass throgh restricted areas.<br />
This is also done before the weighting, so that the paths don't change after weighting anymore.<br />
==Weighting of paths==<br />
Before translating the paths into sequences and thus into linkers that can be expressed, each path needs to be evaluated for its potential capacity to enhance heat stability. For this we have identified different contributions that should be combined into one value that defines how good the linker may be and that consequently defines its ranking among all possible linkers. The smaller this value, the more we expect that the linker will enhance thermostability. An important step for all these contributions is the normalization, as explained in the next paragraphs.<br />
The first contribution we considered was the linker length. We assumed that a short linker is better to constrain the protein extremities, and that a long helix might give more flexibility. Because we wanted this value to be independent of the size of the protein, the length of the linker is normalized to the distance between the two termini.<br />
The second contribution relates to the angles used in the linker. We learned from the [[modeling ###link###]] that angles formed by a certain angle pattern follow a certain distribution. First, we assumed that the narrower the distribution, the more likely the alpha helices would actually produce this angle. Second, the angles found by the software should be as close as possible to those well-defined angles. In this case, the weight value from this contribution should be low.<br />
Then the distance of the linker to the protein is taken into account. Because the linkers should not disturb the protein in its normal environment, linkers that pass close to the protein surface are considered better linkers. The distance was defined as the minimal distance between the linker and all the atoms of the protein. As already mentioned for the sorting of the paths, a linker cannot come closer than 5 &Aring; and this distance was used for normalization of calculated distances.<br />
As linkers should also not be too close to the protein surface, this value is normalized with the minimal distance a linker should have from the surface. The distance is calculated as the minimal distance an the atoms of the protein have from the connection.<br />
After this, the places a linker should avoid are calculated. Each protein can interact with other molecules on some oarts of its surface. The user can specify where and how big those parts are. If a linker passes in front a potential molecule bindind domain, the value of the corresponding path goes to infinity, so that the linker is discarded. Conversely the farther a linker is from a potential ligand binding domain, the smaller its weighting value. The user can also specify the importance of certain regions. In the end the total weighting is normalized to the amount of binding domains.<br />
===Calibrating the weighting function===<br />
???keep?###<br />
Every contribution has it's own distribution. You can see an example in figure ### [[figure histogram_length_lys.png]], but all of them have different shapes. The aim is to find the paths that minimize all of these distributions. Therefore in the weighting function the four mentioned contributions are combined in a linear manner:<br />
\[ W(p) = \alpha L(p) + \beta A(p) + \gamma D(p) + u(p) \]<br />
where W is the final weighting, p the path, L the length contribution, A the angle contribution, D the distance contribution and u the contribution from the forbidden regions. $\alpha, \beta, \gamma, \delta$ are the weighting constants that needed to be found. The normalization performed for each of the contribution were made so that each of them is dimensionless and that all have reasonably similar values.<br />
The weighting constants were obtained from the [[linker-screening ###lin]] performed with lysozyme and the [[enzyme-modeling ###link]]<br />
Please see [[below ###link]] for the detailed explanation, how the values were obtained.<br />
==Translating paths to sequence==<br />
As already mentioned before the software is provided with two databases, one for the possible angle patterns and one for the helix patterns. The choice of the patterns was inspired by known crystal structures extracted from databases and described in different papers.<br />
A huge in silico screening for refining the preferences of the patterns was then set up using the [[iGEM@home|###Link]] system. For the complete description of search for suitable patterns, one can read the [[modeling|###Link to patternspart]] page. <br />
All the possible paths are now split up at the angles and compared with the possible patterns in the databases. ###Figure needed to explain, how the path is translated### The most suitable patterns are identified and added together to build the paths sequence. It is important to notice that this is only possible because of the modularity of our linker patterns used as building blocks: each block, being an alpha helix or an angle pattern, is not affected by the other. Thus for each possible path, one sequence is produced.<br />
==Clustering of paths==<br />
Many different paths are represented by the same sequence [[###Figure that shows, different paths have same properties, already before in the text ###]] and we therefore clustered such paths. The weigths for those clustered paths were then calculated by averaging the weights of the different paths that compose a cluster.<br />
=Results=<br />
==DNMT1==<br />
A major motivation of our effort to design rigid linkers with angles was the circularization of the DNA methyltranferase Dnmt1 (link). The truncation form used in our project is composed of 900 amino acids and the N- and C-terminal extremities are well separated. To circularize it, two linkers were designed: a flexible one made of glycine and serine, and a rigid one designed by the software. The rigif linkers for DNMT1 were obtained from an early state of the software. At that time the calculation took 11 days on a laptop computer with intel i5 processor and 8GB of RAM, which shows the importance of a distributed computing system for large proteins. But from that state on, the software has still imporved a lot, resulting in reduced calculation time to about 1 day for DNMT1.<br />
###DNMT1 could be made heatstable, still missing###<br />
==Feedback from wet lab==<br />
The results from the software for lysozyme were tested as described in the [[linker-screening]] part and evaluated as described in the [[enzyme modeling]] part. We have performed a large linker screening on 10 different lysozymes with different linkers. As the purpose of the lysozyme screen was the calibration of the software,tThose linkers were designed according to the four contributions previously mentioned. One of them was the shortest possible, one had the best possible angle, and so on.<br />
{| class="table table-hover"<br />
|+ '''table 1''': Linker and their amino acid sequence. Green: attachment sequences to prevent the flexible regions from being perturbed; Blue: angle; Purple: extein.<br />
! Linker<br />
! Amino acid sequence<br />
|- <br />
|<br />
|<br />
|-<br />
|'''Very good linkers'''<br />
|<br />
|-<br />
| sgt2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#00BFFF;">AAAHPEA</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| rigid<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKAA<span style="color:#3ADF00;">P</span><span style="color:#A901DB;">RGKCWE</span><br />
|-<br />
| <br />
| <br />
|-<br />
| '''Average linkers'''<br />
|<br />
|-<br />
| may1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKA<span style="color:#00BFFF;">AAAHPEA</span>AEAAAKEAAAKA<span style="color:#00BFFF;">KTA</span>AEAAAKEAAAKA<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| ord1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ATGDLA</span>AEAAAKAA<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| ord3<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ASLPAA</span>AEAAAKEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
|<br />
| <br />
|-<br />
| '''Short linkers'''<br />
| <br />
|-<br />
| sho1<br />
| <span style="color:#3ADF00;">GG</span><span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| sho2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
|}<br />
[[circ_lam_lys_nils.png]] comparison with [[nice_linker_lysozyme_flexible_ends.png]]<br />
###About predictions of software###<br />
In the end we obtained a ranking of the in vitro tested linkers from the [[linker-screening]] and chose the parameters $\alpha, \beta, \gamma, \delta$ of the weighting function in the way, that the ranking from the software represented the ranking from the assays. The final values were: $\alpha = 1.85 * 10 ^{-6}, \beta = 0.57 , \gamma = 50.8 * 10^6$.<br />
Therefore we set up a function, that was minimized, when the order of the weighting output was the same as the order from the wetlab.<br />
<br />
=Discussion=<br />
Will always be refined with more data from i@h...<br />
=References=<br />
[0] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).<br />
[1] Thornton, J.M. & Sibanda, B.L. Amino and carboxy-terminal regions in globular proteins. Journal of molecular biology 167, 443-460 (1983).<br />
<br />
=Discussion=<br />
Will always be refined with more data from i@h...<br />
=References=<br />
<br />
[0] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).<br />
<br />
[1] Thornton, J.M. & Sibanda, B.L. Amino and carboxy-terminal regions in globular proteins. Journal of molecular biology 167, 443-460 (1983).</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/pages/Linker_SoftwareTeam:Heidelberg/pages/Linker Software2014-10-17T22:39:57Z<p>Igemnils: /* Sorting out of paths */</p>
<hr />
<div>=Abstract=<br />
As already [https://2014.igem.org/Team:Heidelberg/Toolbox/Circularization introduced], artificially circularized proteins may gain some heat stability by restraining the C- and N-terminus from moving around freely. This circularization may be trivial when the protein termini are very close to each other, which seems to be reasonably common [[#References|[1]]]. However, if the ends are too far from each other, a long linker is needed to connect them. This linker should not change the natural conformation of the protein and should constrain the relative position of the ends to restrict the degrees of freedom and thus to stabilize the structure even when heated up. On top, these linkers should not affect any of the protein functions. Consequently it is important to prevent linkers from passing through the active site or from covering binding domains to other molecules for example. Therefore one needs to be able to define the shape of possible linkers. This section describes the software we developed to design such linkers. We would like to stress that this work has been made possible thanks to the feedback between computer modeling and experimental work: We could first design linkers in silico, test them experimentally and use the results to further calibrate the software. To our knowledge, this is the first time that such an approach is used to customly design rigid linkers with angles to connect protein extremities.<br />
<br />
=General procedure=<br />
In short, the software can provide a weighted list of linkers to circularize any protein of interest with a known structure. Those linkers are made of rigid alpha helices segments connected with defined angles. Contrary to flexible linkers, those rigid linkers were expected to constrain the protein extremities and to confer better heat stability. Such an idea was already developed [[#References|[2]]] but only with alpha helices defining simple rods, and without any possibility to introduce angles. To generate those linkers, we first defined the geometrical paths, with segments and angles, that they should follow. The geometrical paths that are biologically feasible are afterwards translated into amino acid sequences. Both the compatibility of paths with possible structures and the translation were made possible thanks to our [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling approaches]. The first approach consisted in performing a statistical analysis of more than 17000 known non-homologous structures containing alpha helices connected with angles. For the second approach, we modeled the conformation of linkers circularizing proteins of known structure and analyzed them for certain properties. This second approach was run for a large number of proteins thanks to our distributing computing system [https://2014.igem.org/Team:Heidelberg/Software/igemathome igemathome]. The software provides different possible linkers with weights that provide the ranking of the linkers depending on their capacity to maintain protein activity at higher temperatures. They were generated thanks to an extensive [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening linker screening] on the target protein lambda-lysozyme, using the first modeling approach.<br />
<br />
=Background=<br />
Classically, protein linkers were designed in three different manners. The easiest way is to define the length that a linker should cover and then simply use a flexible glycine-serine peptide with the right amount of amino acids to match this length. Glycine is used for flexibility, as it has no sidechain and does not produce any steric hindrance, while serine is used for solubility, as it has a small polar side chain. This solubility is important, as the linkers should not pass through the hydrophobic core of the protein, but should be dissolved in the surrounding medium. These flexible linkers were normally used for circularization but also for connecting different proteins, when the main goal is that the different parts are connected, but not how they are connected, or when the flexiblity of the linker was required for specific applications. <br />
<br />
A second strategy consists in using rigid helical linkers to keep proteins or protein domains at a certain distance from each other. This is especially important for signalling proteins and fluorescent proteins. One major property of alpha helices is that they always fold in a defined way with well defined angles and lengths. There are also many different helical patterns that differ in stability and solubility. Although they have been used to design cirularizing linkers [[#References|[2]]]. One big disadvantage of this strategy is that one can only build straight linkers with helices. So in the context of circularization, if an artificial line that would connect protein extremities is crossing the protein, this strategy is not an option.<br />
<br />
The third option, which served as a base to develop our approach and which came from discussions with the group of Rebecca Wade in Heidelberg, Germany, consists in designing customly tailored linkers for each specific application. These linkers can be obtained from protein structure prediction. At first one needs to define the path that the linker should take to connect the protein ends. Afterwards, one designs a possible linker sequence that might fit well. Next one makes a structure prediction of the linker attached to the proteins to validate the prediction. Several different linkers, with slight changes, can be compared. This is repeated several times until the linker effectively follows the expected path. This method requires a strong knowledge on protein folding and protein structure prediction and is computationaly intensive. On the other hand, the benefit can be important as the interaction of the linker with the protein surface can be taken into account and as one can accurately define the path taken by the linker to the resolution of protein structure.<br />
<br />
We have set up a completely new strategy to design rigid linkers. As further detailed in the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] part, it is possible to define the shape of a linker, by combining rigid alpha helical rods with well-defined angle patterns. Therefore, by defining, in a geometrical way, the possible paths of the circularizing linkers for a given protein, we can then propose potential linkers. This definition of the geometrical path can be very difficult, especially for large proteins with complex shapes. Moreover, this definition is further constained by the fact that linkers must avoid hiding active sites of the protein of interest. Finally the paths have rotational degrees of freedom at the extremities of the protein, and depending on their orientation, they may or may not match the geometry of the protein. The tool we present here covers the two steps: defining geometrical paths with some weights and translate them into feasible linkers, also with weights. This tool is universal as it has the capacity to design circularizing linkers for any protein with a known structure. Moreover it is modular as, thanks to our [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling approach], we have design linkers as exchangeable blocks of rods of different lengths and of angle patterns. The following sections detail the different steps followed by our software to design proper linkers.<br />
<br />
<br />
==PDB parsing==<br />
At first, the PDB file containing the structure of the target protein is parsed and the coordinates of the atoms are stored, in the metric unit.After this, some initial tests are made with the protein structure. First, we checked whether the C- and N-termini lie on the surface of the protein and are accessible to the solvent, which is crucial for circularization. We defined a line originating from an extremity of the protein with the two angles of the spherical coordinates around the z-axis. From that, we could determine the accessible angles by rejecting all the lines that are too close to the protein. As the future linker will be made of alpha helices and will therefore have a radius of 5 &Aring;, we used this length as the minimal allowed distance.<br />
Those allowed angles are stored for the coming linker generation. ###fig needed### <br />
<br />
==Generation of geometric paths==<br />
As our strategy consists in building linkers with helical rods and connecting angles, a path is completely defined by the coordinates of the angle points. Advancing one step from an existing point is always done by adding a displacement vector on this point. This vector is defined by the two spherical angles, chosen here in a discrete manner with an increment of 5 degrees, and by a length, also chosen in a discrete manner. This discrete length was used in two different contexts: it may correspond to the length of an alpha helix or to the length of the flexible part that appears at the extremity of the protein. The coordinates that are reached thanks to this vector defines the new coordinates of an angle point, no matter if the vector corresponds to an alpha helix or to a flexible part. ###figure, points + vectors###. As we screen for all possible angles in a discrete manner, those angle points coordinates are regularly distributed on a sphere. As further detailed in the next sections, those spheres are defined from both ends, either once or in several steps. Then the software checks for possible straight connections of given lengths for each pair of angle points originated from both extremities. ###figure needed, with all these possible steps, one sphere around each point and checking for the connections, 2D, different lengths in two different figures, ###<br />
The linkers are built in a modular way, with blocks of well-defined size. From the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] of potential linkers, we could derive 8 different alpha helical rods, all with different lengths. On top, the length of the two segments inside an angle block was always 8&Aring;, so exchanging angle blocks do not affect the length of the linker. This means that the distance between the angle points is well defined, an essential aspect of our strategy of linker design.<br />
The software proceeds in three steps. First, it checks for the possibility of direct single alpha helix linker. for this, it applies the procedure just mentioned with spheres of radius that reasonably corresponds to the length of the short parts at the extremity of the protein. Second, it tests if a linker containing two alpha helices connected with a right angle allows the circularization. Finally it searches the possible linkers with three angle points. The next parts will explain those three steps in detail.<br />
This method has been chosen, because it could be implemented easily and efficiently in our program. However, this strategy generated paths that crossed the protein. Therefore we put big efforts in the sorting out of the paths.<br />
<br />
===Step 1===<br />
As a simple rigid linker with no angle would be easier to design and likely more thermostable than the ones containing angles, the software first checks if this simple solution is possible ###figure needed###. For this, we took into account the fact that proteins have some flexible amino acids at their extremities. This flexible part may come from the protein itself, but also from the 2 glycines that are included at the N-terminal part and from the extein at the C-terminal part. Those two latter parts comes from our linkers. Those parts have no preferential angles and offers a large amount of possibilities to insert fitting linkers. But this flexibility is also a drawback as we have to include this large amount of possible angles and length to our path search.<br />
In this first step, the software explicitely takes these flexible parts into account to check for the possibility of straight linkers. As the angles and the length of the flexible parts are variable, the software position their extremity on a sphere centered on the last fixed position of the structure as explained above. The radius of this sphere is incremented in a discrete manner, in 4 steps, from 5.25 &Aring; to the maximum length of the flexible part. ###figure###<br />
Then all possible straight segments between the points and the lastpoints are tested. If they are closer than 5 &Aring to the protein, of if they cross it, then they are rejected. If they are kept, then the software checks whether the length of the segments is compatible with the feasible alpha helices in terms of length: if the length of a given segment equal one of the 8 alpha helix lengths plus or minus 0.75 &Aring;, then the path is eventually saved.<br />
===Step 2===<br />
The next possibilty to design more complex rigid linkers while still taking flexible ends into account with a reasonable calculation time was to reduce amount of possible angles. As we originally thought that 90° angle would be practically feasible, the software was designed to generate linkers with flexible ends and one 90° angle. ###figure, protein_one_90° angle### This choice was notably made because of the simplicity to calculate lengths of right triangle edges. We already saw in Step 1 that the length of an edge can only take 8 different values. As the linkers have to start from the extremities of the protein, and as we impose a right angle, the number of possible paths is therefore low, making them easy to compute. Practically, the extremities of the proteins are positioned in a flexible way as in Step 1. From each of the positions allowed by this flexibility, the software searches for all the allowed right triangles. This was mainly done, as the degrees of freedom needed to be restricted to keep calculations feasible. <br />
===Step 3===<br />
Finally the software also provides the possibility to find paths with up to 4 edges, meaning 4 alpha helices and 3 angles. Thanks to the modularity of the possible linkers, such paths can offer the possibility to circularize theoretically any kind of protein. ###figure of torus###<br />
To keep the calculation feasible in a reasonable time, we design the searching strategy so that the flexible part at the extremity are oriented in the same direction as the consecutive alpha helix. This is obviously restricting the search but as these orientations are allowed for the flexible part, this approach remains fully correct. <br />
First, potential ending points of the first alpha helical rod are calculated from the N-terminal point of the protein. The orientation is chosen in a discrete manner, with an incrementation of 5 degrees for the two angles of the spherical coordinates. The distance from the origin corresponds to the 8 possible lengths allowed by the alpha helices, as already seen in Step 2, plus a length of 0, which mimics a linker with 3 instead of 4 edges. The exact same procedure is repeated to define all the potential ending points of the second alpha helical rod starting from all the possible ending points of the first alpha helical rod. Thanks to the possibility of a length of 0 for the first and the second rods, the software also calculate paths with 2 edges. Then, the same is done only once from the C-terminal point of the protein, defining 1 edge. The final step consists in checking if the points originating from the N- and C-terminal points can be linked by an potential alpha helix, i.e. if they are separated by the appropriate distance. If any of the potential alpha helix length lies within the distance between two points plus or minus 0.75 &Aring;, then the path is eventually saved. In the same way, if two points are directly closer than 0.75 &Aring;, then the path is also saved.<br />
<br />
==Sorting out of paths==<br />
The previous part described the generation of paths that can connect the two extremeties of the protein irrespective of the position of these paths relative to the protein. While this allows a fast computing of the geometrical paths, this also implies that the paths that are not practically feasible need to be sorted out. This is the most time consuming part of the computing as about 1 billion paths are generated. Three criteria are considered for the sorting. The first one is the feasibility of the linker: can the software find angle patterns that correspond to the one defined by the geometrical path? This question was part of the motivation for a large modeling effort (link) to determine the possible angles between consecutive angles. This was achieved by analyzing the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling distribution of angles] between alpha helices found in the ArchDB database. As nearly any angle could be found between 20 and 170 degrees, only few paths were actually rejected at that step. The next criteria was the position of the angle point: if they appear inside the protein, then the path is rejected. Finally, the software checks if any of the atoms of the protein is less than 5&Aring away from any of the alpha helices, then the path is also rejected.<br />
<br />
==Shifting paths to the patterns==<br />
<br />
Because of rounding errors and other inaccuracies, it can always happen that the generated paths don't fit perfectly to the helical- and to the angle patterns. Therefore before translating them to the aminoacid sequence, the paths need to be refined. This is also done before the weighting, so that the paths don't change after weighting anymore.<br />
Therefore the distances between the different points are calculated and then the points are shifted so far, that they fit into the patterns. The shifts never exceed a certain length so that no path then would pass through the protein after refinement, even though it didn't before.<br />
==Weighting of paths==<br />
Before translating the paths to sequences and thus to expressible linkers, at first the value of each path needs to be checked. Therefore we have identified four contributions, that determine how good a path will work to enhance heatstability of a target protein by circularization. In the end all these contributions are summarized to get one final value for the goodness of the connection. For each of the single contributions, always lower weighting values refer to a better path. Therefore all contributions somehow need to be normed by some property of the protein, so that the contributions of each function on it's own are independent of the target protein and generality is achieved.<br />
At first it is tested how long the linker is. We assumed that a shorter linker normally is better for restraining the ends, as a longer helix might give more flexibility to the ends. Because we want this value to be independent of the size of the protein, the length of the linker is normed by the distance of the two termini. It is clear that not only because the two termini are far from each other and clearly a longer linker is needed, this contribution exceeds all other contributions to the weighting function.<br />
The next value is a measure for the goodness of the angles used in the linker. Each of our potential angle patterns is somehow distributed with a standard deviation from the mean. It is clear, that the narrower this distribution is, the likelier it is, that the pattern will in the end produce the assumed angle between the embracing helices. Therefore path-angles that fit perfectly with the angle patterns provided get a lower weighting value.<br />
Then the distance from the protein is taken into account. Because the linkers should not disturb the protein in it's normal environment, linkers that pass near to the protein's surface are considered better linkers. Of course no linker is too near on the surface, so that it would interact with the protein too much. This value is normalized with the minimal distance a linker should have from the surface.<br />
After this the places a linker should ommit are calculated. Each protein binds some ligands or substrates. The user can specify, how big they are and where they attach. If a linker passes through a potential ligand, this value goes to infinity, so that a linker is discarded. Otherwise the farther a linker is from a potential ligand, the better this value gets. The user can also specify the importance of certain regions. In the end the total weighting is independent of the amount of binding sites.<br />
<br />
===Calibrating the weighting function===<br />
???keep?###<br />
Every contribution has it's own distribution. You can see an example in figure ### [[figure histogram_length_lys.png]], but all of them have different shapes. The aim is to find the paths that minimize all of these distributions. Therefore in the weighting function the four mentioned contributions are combined in a linear manner:<br />
\[ W(p) = \alpha L(p) + \beta A(p) + \gamma D(p) + \delta u(p) \]<br />
where W is the final weighting, p the path, L the length contribution, A the angle contribution, D the distance contribution and u the contribution from the forbidden regions. $\alpha, \beta, \gamma, \delta$ are the weighting constants that needed to be found. The normalization performed for each of the contribution were made so that each of them is dimensionless and that all have reasonably similar values.<br />
The weighting constants were obtained from the [[linker-screening ###lin]] performed with lysozyme and the [[enzyme-modeling ###link]]<br />
Please see [[below ###link]] for the detailed explanation, how the values were obtained.<br />
==Translating paths to sequence==<br />
As already mentioned before the software is provided with two databases, one for the possible angle patterns and one for the helix patterns. The choice of the patterns was inspired by known crystal structures extracted from databases and described in different papers.<br />
A huge in silico screening for refining the preferences of the patterns was then set up using the [[iGEM@home|###Link]] system. For the complete description of search for suitable patterns, one can read the [[modeling|###Link to patternspart]] page. <br />
All the possible paths are now split up at the angles and compared with the possible patterns in the databases. ###Figure needed to explain, how the path is translated### The most suitable patterns are identified and added together to build the paths sequence. It is important to notice that this is only possible because of the modularity of our linker patterns used as building blocks: each block, being an alpha helix or an angle pattern, is not affected by the other. Thus for each possible path, one sequence is produced.<br />
==Clustering of paths==<br />
Many different paths are represented by the same sequence [[###Figure that shows, different paths have same properties, already before in the text ###]] and we therefore clustered such paths. The weigths for those clustered paths were then calculated by averaging the weights of the different paths that compose a cluster.<br />
<br />
=Results=<br />
IMPORTANT!<br />
paths-> geometrical<br />
linkers-> helices...<br />
sequences-> seq of linkers<br />
insist on huge number of data produced<br />
We definitely need to cite [0]!!!<br />
=Abstract=<br />
=Overview=<br />
As already introduced [###link to circularization idea in the toolbox], artificially circularized proteins may gain some heat stability by restraining the C- and N-terminus from moving around freely. This circularization may be trivial when the protein termini are very close to each other, which seems to be reasonably common [[#References|[1]]]. However, if the ends are too far from each other, a long linker is needed to connect them. This linker should not change the natural conformation of the protein and should constrain the relative position of the ends to restrict the degrees of freedom and thus to stabilize the structure even when heated up. On top, these linkers should not affect any of the protein functions. Consequently it is important to prevent linkers from passing through the active site or from covering binding domains to other molecules for example. Therefore one needs to be able to define the shape of possible linkers. This section describes the software we developed to design such linkers. We would like to stress that this work has been made possible thanks to the feedback between computer modeling and experimental work: We could first design linkers in silico, test them experimentally and use the results to further calibrate the software. To our knowledge, this is the first time that such an approach is used to customly design linkers to connect protein extremities.<br />
In short, the software can provide a weighted list of linkers to circularize any protein of interest with a known structure. Those linkers are made of rigid alpha helices segments connected with defined angles. Contrary to flexible linkers, those rigid linkers were expected to constrain the protein extremities and to confer better heat stability. To generate those linkers, we first defined the geometrical paths, with segments and angles, that they should follow. The geometrical paths that are biologically feasible are afterwards translated into amino acid sequences. Both the compatibility of paths with possible structures and the translation were made possible thanks to our modeling approaches. The first approach consisted in performing a statistical analysis of more than 17000 known non-homologous structures containing alpha helices connected with angles. For the second approach, we modeled the conformation of linkers circularizing proteins of known structure and analyzed them for certain properties. This second approach was run for a large number of proteins thanks to our distributing computing system igemathome. The software provides different possible linkers with weights that provide the ranking of the linkers depending on their capacity to maintain protein activity at higher temperatures. They were generated in an extensive [[linker screening| ### ]] on the target protein lambda-lysozyme, using the first modeling approach.<br />
The software was checked for running stability in a huge test over the [[iGEM@home]]. <br />
Furthermore our software provided linkers for circularizing [[DNMT1| ###]], that could be made more heatstable due to circularization???.<br />
For detailed information on the implementation and the practical use of the software, please see the [[documentation software-docu]] page.<br />
==What does the software do==<br />
still missing [[figure 0, graph abstract]]<br />
=Background=<br />
What are linkers, why should ours work better?<br />
Linkers, References to the modeling, <br />
Classically, protein linkers were designed in three different manners. ###REFERENCE### The easiest way is to define the length that a linker should cover and then simply use a flexible glycine-serine peptide with the right amount of amino acids to match this length. Glycine is used for flexibility, as it has no sidechain and does not produce any steric hindrance, while serine is used for solubility, as it has a small polar side chain. This solubility is important, as the linkers should not pass through the hydrophobic core of the protein, but should be dissolved in the surrounding medium. These flexible linkers were normally used for circularization but also for connecting different proteins, when the main important aspect is that the different parts are connected, but not how they are connected, or when the flexiblity of the linker was required for specific applications.<br />
A second strategy consists in using rigid helical linkers to keep proteins or protein domains at a certain distance from each other. This is especially important for signalling proteins and fluorescent proteins . ###TODO: Reference### One major property of alpha helices is that they always fold in a defined way with well defined angles and lengths. There are also many different helical patterns that differ in stability and solubility. One big disadvantage of this strategy is that one can only build straight linkers with helices. So in the context of circularization, if an artificial line that would connect protein extremities is crossing the protein, this strategy is not an option.<br />
The third option, which served as a base to develop our approach, consists in designing customly tailored linkers for each specific application. These linkers can be obtained from protein structure prediction. At first one needs to define the path that the linker should take to connect two amino acids. Afterwards one designs a possible linker sequence that might fit well. Next one makes a structure prediction of the linker attached to the proteins to validate the prediction. Several different linkers, with slight changes, can be compared. This is repeated several times until the linker effectively follows the expected path. ###TODO: Reference, WADE paper### This method is time consuming as it is not only computation intensive, but also requires a strong knowledge on protein folding and protein structure prediction. On the other hand, the benefit can be important as the interaction of the linker with the proteins surface can be taken into account and as one can accurately define the path taken by the linker to the resolution of protein structure.<br />
We have set up a completely new strategy to design rigid linkers. As further detailed in the [[###link###|modelling]] part, it is possible to define the shape of a linker, by combining rigid alpha helical rods with well-defined angle patterns. Therefore, by defining, in a geometrical way, the possible paths of the circularizing linkers for a given protein, we can then propose potential linkers.This definition of the geometrical path can be very difficult, especially for large proteins with complex shapes. Moreover, this definition is further constained by the fact that linkers must avoid hiding active sites of the protein of interest. Finally the paths have rotational degrees of freedom at the extremities of the protein, and depending on their orientation, they may or may not match the geometry of the protein. The tool we present here covers the two steps: defining geometrical paths with some weights and translate them into feasible linkers, also with weights. This tool is universal as it has the capacity to design circularizing linkers for any protein with a known structure. Moreover it is modular as, thanks to our modeling approach [link] we have design linkers as exchangeable blocks of rods of different lengths and of angle patterns. The following sections detail the different steps followed by our software to design proper linkers.<br />
==PDB analysis==<br />
At first, the PDB file containing the structure of the target protein is parsed and the coordinates of the atoms are stored, in the metric unit.After this, some initial tests are made with the protein structure. First, we checked whether the C- and N-termini lie on the surface of the protein and are accessible to the solvent, which is crucial for circularization. We defined a line originating from an extremity of the protein with the two angles of the spherical coordinates around the z-axis. From that, we could determine the accessible angles by rejecting all the lines that are too close to the protein. As the future linker will be made of alpha helices and will therefore have a radius of 5 &Aring;, we used this length as the minimal allowed distance.<br />
Those allowed angles are stored for the coming linker generation. . ###fig needed### <br />
==Generation of geometric paths==<br />
As our strategy consists in building linkers with helical rods and connecting angles, a path is completely defined by the coordinates of the angle points. Advancing one step from an existing point is always done by adding a displacement vector on this point. This vector is defined by the two spherical angles, chosen here in a discrete manner with an increment of 5 degrees, and by a length, also chosen in a discrete manner. This discrete length was used in two different contexts: it may correspond to the length of an alpha helix or to the length of the flexible part that appears at the extremity of the protein. The coordinates that are reached thanks to this vector defines the new coordinates of an angle point, no matter if the vector corresponds to an alpha helix or to a flexible part. ###figure, points + vectors###. As we screen for all possible angles in a discrete manner, those angle points coordinates are regularly distributed on a sphere. As further detailed in the next sections, those spheres are defined from both ends, either once or in several steps. Then the software checks for possible straight connections of given lengths for each pair of angle points originated from both extremities. ###figure needed, with all these possible steps, one sphere around each point and checking for the connections, 2D, different lengths in two different figures, ###<br />
The linkers are built in a modular way, with blocks of well-defined size. From the modeling of potential linkers [link], we could derive 8 different alpha helical rods, all with different lengths. On top, the length of the two segments inside an angle block was always 8 &Aring;, so exchanging angle blocks do not affect the length of the linker. This means that the distance between the angle points is well defined, an essential aspect of our strategy of linker design.<br />
The software proceeds in three steps. First, it checks for the possibility of direct single alpha helix linker. for this, it applies the procedure just mentioned with spheres of radius that reasonably corresponds to the length of the short parts at the extremity of the protein. Second, it tests if a linker containing two alpha helices connected with a right angle allows the circularization. Finally it searches the possible linkers with three angle points. The next parts will explain those three steps in detail.<br />
This method has been chosen, because it could be implemented easily and efficiently in our program. However this strategy generated paths that crossed the protein. Therefore we put big efforts in the sorting out of the paths.<br />
===Step 1===<br />
As a simple rigid linker with no angle would be easier to design and likely more thermostable than the ones containing angles, the software first checks if this simple solution is possible ###figure needed###. For this, we took into account the fact that proteins have some flexible amino acids at their extremities. This flexible part may come from the protein itself, but also from the 2 glycines that are included at the N-terminal part and from the extein at the C-terminal part. Those two latter parts comes from our linkers. Those parts have no preferential angles and offers a large amount of possibilities to insert fitting linkers. But this flexibility is also a drawback as we have to include this large amount of possible angles and length to our path search.<br />
In this first step, the software explicitely takes these flexible parts into account to check for the possibility of straight linkers. As the angles and the length of the flexible parts are variable, the software position their extremity on a sphere centered on the last fixed position of the structure as explained above. The radius of this sphere is incremented in a discrete manner, in 4 steps, from 5.25 &Aring; to the maximum length of the flexible part. ###figure###<br />
Then all possible straight segments between the points and the lastpoints are tested. If they are closer than 5 &Aring to the protein, of if they cross it, then they are rejected. If they are kept, then the software checks whether the length of the segments is compatible with the feasible alpha helices in terms of length: if the length of a given segment equal one of the 8 alpha helix lengths plus or minus 0.75 &Aring;, then the path is eventually saved.<br />
===Step 2===<br />
The next possibilty to design more complex rigid linkers while still taking flexible ends into account with a reasonable calculation time was to reduce amount of possible angles. As we originally thought that 90° angle would be practically feasible, the software was designed to generate linkers with flexible ends and one 90° angle. ###figure, protein_one_90° angle### This choice was notably made because of the simplicity to calculate lengths of right triangle edges. We already saw in Step 1 that the length of an edge can only take 8 different values. As the linkers have to start from the extremities of the protein, and as we impose a right angle, the number of possible paths is therefore low, making them easy to compute. Practically, the extremities of the proteins are positioned in a flexible way as in Step 1. From each of the positions allowed by this flexibility, the software searches for all the allowed right triangles. This was mainly done, as the degrees of freedom needed to be restricted to keep calculations feasible. <br />
===Step 3===<br />
Finally the software also provides the possibility to find paths with up to 4 edges, meaning 4 alpha helices and 3 angles. Thanks to the modularity of the possible linkers, such paths can offer the possibility to circularize theoretically any kind of protein. ###figure of torus###<br />
To keep the calculation feasible in a reasonable time, we design the searching strategy so that the flexible part at the extremity are oriented in the same direction as the consecutive alpha helix. This is obviously restricting the search but as these orientations are allowed for the flexible part, this approach remains fully correct. <br />
First, potential ending points of the first alpha helical rod are calculated from the N-terminal point of the protein. The orientation is chosen in a discrete manner, with an incrementation of 5 degrees for the two angles of the spherical coordinates. The distance from the origin corresponds to the 8 possible lengths allowed by the alpha helices, as already seen in Step 2, plus a length of 0, which mimics a linker with 3 instead of 4 edges. The exact same procedure is repeated to define all the potential ending points of the second alpha helical rod starting from all the possible ending points of the first alpha helical rod. Thanks to the possibility of a length of 0 for the first and the second rods, the software also calculate paths with 2 edges. Then, the same is done only once from the C-terminal point of the protein, defining 1 edge. The final step consists in checking if the points originating from the N- and C-terminal points can be linked by an potential alpha helix, i.e. if they are separated by the appropriate distance. If any of the potential alpha helix length lies within the distance between two points plus or minus 0.75 &Aring;, then the path is eventually saved. In the same way, if two points are directly closer than 0.75 &Aring;, then the path is also saved.<br />
==Sorting out of paths==<br />
The previous part described the generation of paths that can connect the two extremeties of the protein irrespective of the position of these paths relative to the protein. While this allows a fast computing of the geometrical paths, this also implies that the paths that are not practically feasible need to be sorted out. This is the most time consuming part of the computing as about 1 billion paths are generated. Three criteria are considered for the sorting. The first one is the feasibility of the linker: can the software find angle patterns that correspond to the one defined by the geometrical path? This question was part of the motivation for a large modeling effort (link) to determine the possible angles between consecutive angles. This was achieved by analyzing the distribution of angles between alpha helices found in the ArchDB database (link). As nearly any angle could be found between 20 and 170 degrees, only few paths were actually rejected at that step. The next criteria was the position of the angle point: if they appear inside the protein, then the path is rejected. Finally, the software checks if any of the atoms of the protein is less than 5 &Aring away from any of the alpha helices, then the path is also rejected.<br />
==Shifting paths to the patterns==<br />
The strategy described in step 3 gives a certain freedom for the rod that connect the last two angle points that were generated from the N- and C-terminal points. As this freedom is actually not permitted by the alpha helix and the angle pattern, but is permitted by the flexible part for example at the C-terminal end, the software slighty refine the path by rotating the segment that originates from the C-terminal point.<br />
==Weighting of paths==<br />
Before translating the paths into sequences and thus into linkers that can be expressed, each path needs to be evaluated for its potential capacity to enhance heat stability. For this we have identified different contributions that should be combined into one value that defines how good the linker may be and that consequently defines its ranking among all possible linkers. The smaller this value, the more we expect that the linker will enhance thermostability. An important step for all these contributions is the normalization, as explained in the next paragraphs.<br />
The first contribution we considered was the linker length. We assumed that a short linker is better to constrain the protein extremities, and that a long helix might give more flexibility. Because we wanted this value to be independent of the size of the protein, the length of the linker is normalized to the distance between the two termini.<br />
The second contribution relates to the angles used in the linker. We learned from the [[modeling ###link###]] that angles formed by a certain angle pattern follow a certain distribution. First, we assumed that the narrower the distribution, the more likely the alpha helices would actually produce this angle. Second, the angles found by the software should be as close as possible to those well-defined angles. In this case, the weight value from this contribution should be low.<br />
Then the distance of the linker to the protein is taken into account. Because the linkers should not disturb the protein in its normal environment, linkers that pass close to the protein surface are considered better linkers. The distance was defined as the minimal distance between the linker and all the atoms of the protein. As already mentioned for the sorting of the paths, a linker cannot come closer than 5 &Aring; and this distance was used for normalization of calculated distances.<br />
After this, the places a linker should avoid are calculated. Each protein can interact with other molecules on some oarts of its surface. The user can specify where and how big those parts are. If a linker passes in front a potential molecule bindind domain, the value of the corresponding path goes to infinity, so that the linker is discarded. Conversely the farther a linker is from a potential ligand binding domain, the smaller its weighting value. The user can also specify the importance of certain regions. In the end the total weighting is normalized to the amount of binding domains.<br />
===Calibrating the weighting function===<br />
Every contribution has it's own distribution. You can see an example in figure ### [[figure histogram_length_lys.png]], but all of them have different shapes. The aim is to find the paths that minimize all of these distributions. Therefore in the weighting function the four mentioned contributions are combined in a linear manner:<br />
\[ W(p) = \alpha L(p) + \beta A(p) + \gamma D(p) + \delta u(p) \]<br />
where W is the final weighting, p the path, L the length contribution, A the angle contribution, D the distance contribution and u the contribution from the forbidden regions. $\alpha, \beta, \gamma, \delta$ are the weighting constants that needed to be found. The normalization performed for each of the contribution were made so that each of them is dimensionless and that all have reasonably similar values.<br />
The weighting constants were obtained from the [[linker-screening ###lin]] performed with lysozyme and the [[enzyme-modeling ###link]]<br />
Please see [[below ###link]] for the detailed explanation, how the values were obtained.<br />
==Translating paths to sequence==<br />
As already mentioned before the software is provided with two databases, one for the possible angle patterns and one for the helix patterns. The choice of the patterns was inspired by known crystal structures extracted from databases and described in different papers.<br />
A huge in silico screening for refining the preferences of the patterns was then set up using the [[iGEM@home|###Link]] system. For the complete description of search for suitable patterns, one can read the [[modeling|###Link to patternspart]] page. <br />
All the possible paths are now split up at the angles and compared with the possible patterns in the databases. ###Figure needed to explain, how the path is translated### The most suitable patterns are identified and added together to build the paths sequence. It is important to notice that this is only possible because of the modularity of our linker patterns used as building blocks: each block, being an alpha helix or an angle pattern, is not affected by the other. Thus for each possible path, one sequence is produced.<br />
==Clustering of paths==<br />
Many different paths are represented by the same sequence [[###Figure that shows, different paths have same properties, already before in the text ###]] and we therefore clustered such paths. The weigths for those clustered paths were then calculated by averaging the weights of the different paths that compose a cluster.<br />
=Results=<br />
==DNMT1==<br />
A major motivation of our effort to design rigid linkers with angles was the circularization of the DNA methyltranferase Dnmt1 (link). The truncation form used in our project is composed of 900 amino acids and the N- and C-terminal extremities are well separated. To circularize it, two linkers were designed: a flexible one made of glycine and serine, and a rigid one designed by the software. The rigif linkers for DNMT1 were obtained from an early state of the software. At that time the calculation took 11 days on a laptop computer with intel i5 processor and 8GB of RAM, which shows the importance of a distributed computing system for large proteins. But from that state on, the software has still imporved a lot, resulting in reduced calculation time to about 1 day for DNMT1.<br />
IMPORTANT!<br />
paths-> geometrical<br />
linkers-> helices...<br />
sequences-> seq of linkers<br />
insist on huge number of data produced<br />
We definitely need to cite [0]!!!<br />
=Abstract=<br />
=Overview=<br />
As already introduced [###link to circularization idea in the toolbox], artificially circularized proteins may gain some heat stability by restraining the C- and N-terminus from moving around freely. This circularization may be trivial when the protein termini are very close to each other, which seems to be reasonably common [[#References|[1]]]. However, if the ends are too far from each other, a long linker is needed to connect them. This linker should not change the natural conformation of the protein and should constrain the relative position of the ends to restrict the degrees of freedom and thus to stabilize the structure even when heated up. On top, these linkers should not affect any of the protein functions. Consequently it is important to prevent linkers from passing through the active site or from covering binding domains to other molecules for example. Therefore one needs to be able to define the shape of possible linkers. This section describes the software we developed to design such linkers. We would like to stress that this work has been made possible thanks to the feedback between computer modeling and experimental work: We could first design linkers in silico, test them experimentally and use the results to further calibrate the software. To our knowledge, this is the first time that such an approach is used to customly design linkers to connect protein extremities.<br />
In short, the software can provide a weighted list of linkers to circularize any protein of interest with a known structure. Those linkers are made of rigid alpha helices segments connected with defined angles. Contrary to flexible linkers, those rigid linkers were expected to constrain the protein extremities and to confer better heat stability. To generate those linkers, we first defined the geometrical paths, with segments and angles, that they should follow. The geometrical paths that are biologically feasible are afterwards translated into amino acid sequences. Both the compatibility of paths with possible structures and the translation were made possible thanks to our modeling approaches. The first approach consisted in performing a statistical analysis of more than 17000 known non-homologous structures containing alpha helices connected with angles. For the second approach, we modeled the conformation of linkers circularizing proteins of known structure and analyzed them for certain properties. This second approach was run for a large number of proteins thanks to our distributing computing system igemathome. The software provides different possible linkers with weights that provide the ranking of the linkers depending on their capacity to maintain protein activity at higher temperatures. They were generated in an extensive [[linker screening| ### ]] on the target protein lambda-lysozyme, using the first modeling approach.<br />
The software was checked for running stability in a huge test over the [[iGEM@home]]. <br />
Furthermore our software provided linkers for circularizing [[DNMT1| ###]], that could be made more heatstable due to circularization???.<br />
For detailed information on the implementation and the practical use of the software, please see the [[documentation software-docu]] page.<br />
==What does the software do==<br />
still missing [[figure 0, graph abstract]]<br />
=Background=<br />
What are linkers, why should ours work better?<br />
Linkers, References to the modeling, <br />
Classically, protein linkers were designed in three different manners. ###REFERENCE### The easiest way is to define the length that a linker should cover and then simply use a flexible glycine-serine peptide with the right amount of amino acids to match this length. Glycine is used for flexibility, as it has no sidechain and does not produce any steric hindrance, while serine is used for solubility, as it has a small polar side chain. This solubility is important, as the linkers should not pass through the hydrophobic core of the protein, but should be dissolved in the surrounding medium. These flexible linkers were normally used for circularization but also for connecting different proteins, when the main important aspect is that the different parts are connected, but not how they are connected, or when the flexiblity of the linker was required for specific applications.<br />
A second strategy consists in using rigid helical linkers to keep proteins or protein domains at a certain distance from each other. This is especially important for signalling proteins and fluorescent proteins . ###TODO: Reference### One major property of alpha helices is that they always fold in a defined way with well defined angles and lengths. There are also many different helical patterns that differ in stability and solubility. One big disadvantage of this strategy is that one can only build straight linkers with helices. So in the context of circularization, if an artificial line that would connect protein extremities is crossing the protein, this strategy is not an option.<br />
The third option, which served as a base to develop our approach, consists in designing customly tailored linkers for each specific application. These linkers can be obtained from protein structure prediction. At first one needs to define the path that the linker should take to connect two amino acids. Afterwards one designs a possible linker sequence that might fit well. Next one makes a structure prediction of the linker attached to the proteins to validate the prediction. Several different linkers, with slight changes, can be compared. This is repeated several times until the linker effectively follows the expected path. ###TODO: Reference, WADE paper### This method is time consuming as it is not only computation intensive, but also requires a strong knowledge on protein folding and protein structure prediction. On the other hand, the benefit can be important as the interaction of the linker with the proteins surface can be taken into account and as one can accurately define the path taken by the linker to the resolution of protein structure.<br />
We have set up a completely new strategy to design rigid linkers. As further detailed in the [[###link###|modelling]] part, it is possible to define the shape of a linker, by combining rigid alpha helical rods with well-defined angle patterns. Therefore, by defining, in a geometrical way, the possible paths of the circularizing linkers for a given protein, we can then propose potential linkers.This definition of the geometrical path can be very difficult, especially for large proteins with complex shapes. Moreover, this definition is further constained by the fact that linkers must avoid hiding active sites of the protein of interest. Finally the paths have rotational degrees of freedom at the extremities of the protein, and depending on their orientation, they may or may not match the geometry of the protein. The tool we present here covers the two steps: defining geometrical paths with some weights and translate them into feasible linkers, also with weights. This tool is universal as it has the capacity to design circularizing linkers for any protein with a known structure. Moreover it is modular as, thanks to our modeling approach [link] we have design linkers as exchangeable blocks of rods of different lengths and of angle patterns. The following sections detail the different steps followed by our software to design proper linkers.<br />
But even taking all this into account, one could never also take the paths into account, that the same sort of linker is also able to take, because of rotational degrees of freedom inherent in the linkermodel ###figure needed###<br />
Thus we decided to provide the science community with a powerfull open-source software, that for every protein with a given structure can calculate the sequence of the linker needed to circularize the protein with a rigid linker and with minimal inhibition of the protein's function. <br />
Until now each scientist had to estimate the length of the linker himself,###check for flexlinker### so our software is a completely novel approach to circularization.<br />
=General procedure=<br />
Our general approach to findin,rods and angles in discrete manner, only finite possibilities to connect the ends, these are cheked all. <br />
I think this paragraph is just too much, as anyway there was already an overview before.<br />
At first the protein structure is analysed in a geometrical way: paths only composed of no more than four straight segments connected by angles are computanionally generated. A path is always represented by straight lines and connecting angles between them. ###In the end all these paths should be sorted by how well it would be, if we circularize the protein using this path.### But as the final weighing is quite computation intensive, at first the paths need to be sorted out. A path is only sorted out, if it is breaching any rule for linkers. For example paths should never pass through the protein. After all the paths have been generated, the paths are improved, by shifting the angularpoints according to the underlying linkermodel, so that no paths are taken into account, that could not be built with our building blocks.<br />
After all paths being correctly generated, the paths are weighted by several factors. Afterwards one weighting for a path is calculated, that corresponds to the goodness of this certain path. Then the path is retranslated by usage of our amino acid patterns to produce an amino acid sequence. But as one sequence can follow more than one path, all the paths built up by this sequence are clustered. In the end the average of the weighting of the pathclusters is calculated and thus for every linker only one weightingvalue is produced with the contributions of all paths possibly taken by this linker.<br />
==PDB analysis==<br />
At first, the PDB file containing the structure of the target protein is parsed and the coordinates of the atoms are stored, in the metric unit.After this, some initial tests are made with the protein structure. First, we checked whether the C- and N-termini lie on the surface of the protein and are accessible to the solvent, which is crucial for circularization. We defined a line originating from an extremity of the protein with the two angles of the spherical coordinates around the z-axis. From that, we could determine the accessible angles by rejecting all the lines that are too close to the protein. As the future linker will be made of alpha helices and will therefore have a radius of 5 &Aring;, we used this length as the minimal allowed distance.<br />
Those allowed angles are stored for the coming linker generation. . ###fig needed### <br />
==Generation of geometric paths==<br />
As our strategy consists in building linkers with helical rods and connecting angles, a path is completely defined by the coordinates of the angle points. Advancing one step from an existing point is always done by adding a displacement vector on this point. This vector is defined by the two spherical angles, chosen here in a discrete manner with an increment of 5 degrees, and by a length, also chosen in a discrete manner. This discrete length was used in two different contexts: it may correspond to the length of an alpha helix or to the length of the flexible part that appears at the extremity of the protein. The coordinates that are reached thanks to this vector defines the new coordinates of an angle point, no matter if the vector corresponds to an alpha helix or to a flexible part. ###figure, points + vectors###. As we screen for all possible angles in a discrete manner, those angle points coordinates are regularly distributed on a sphere. As further detailed in the next sections, those spheres are defined from both ends, either once or in several steps. Then the software checks for possible straight connections of given lengths for each pair of angle points originated from both extremities. ###figure needed, with all these possible steps, one sphere around each point and checking for the connections, 2D, different lengths in two different figures, ###<br />
The linkers are built in a modular way, with blocks of well-defined size. From the modeling of potential linkers [link], we could derive 8 different alpha helical rods, all with different lengths. On top, the length of the two segments inside an angle block was always 8 &Aring;, so exchanging angle blocks do not affect the length of the linker. This means that the distance between the angle points is well defined, an essential aspect of our strategy of linker design.<br />
The software proceeds in three steps. First, it checks for the possibility of direct single alpha helix linker. for this, it applies the procedure just mentioned with spheres of radius that reasonably corresponds to the length of the short parts at the extremity of the protein. Second, it tests if a linker containing two alpha helices connected with a right angle allows the circularization. Finally it searches the possible linkers with three angle points. The next parts will explain those three steps in detail.<br />
This method has been chosen, because it could be implemented easily and efficiently in our program. However this strategy generated paths that crossed the protein. Therefore we put big efforts in the sorting out of the paths.<br />
In the [[###link###|modelling]] part we have already described the patterns we are using for building up our linkers. The modularity was crucial for the success of the software. Therefore the angle patterns were always chosen, so that the end of the anglepattern was 8 &Aring; away from the turning point ###figure needed###. Thus only displacements had to be made with a certain length, not depending on the direction in which it is going and not depending on the direction in that the linker will continue. This modularity makes the calculations more efficient than it would be, with just generating points randomly.<br />
We still have to rephrase that.<br />
===Step 1===<br />
As a simple rigid linker with no angle would be easier to design and likely more thermostable than the ones containing angles, the software first checks if this simple solution is possible ###figure needed###. For this, we took into account the fact that proteins have some flexible amino acids at their extremities. This flexible part may come from the protein itself, but also from the 2 glycines that are included at the N-terminal part and from the extein at the C-terminal part. Those two latter parts comes from our linkers. Those parts have no preferential angles and offers a large amount of possibilities to insert fitting linkers. But this flexibility is also a drawback as we have to include this large amount of possible angles and length to our path search.<br />
In this first step, the software explicitely takes these flexible parts into account to check for the possibility of straight linkers. As the angles and the length of the flexible parts are variable, the software position their extremity on a sphere centered on the last fixed position of the structure as explained above. The radius of this sphere is incremented in a discrete manner, in 4 steps, from 5.25 &Aring; to the maximum length of the flexible part. ###figure###<br />
Then all possible straight segments between the points and the lastpoints are tested. If they are closer than 5 &Aring to the protein, of if they cross it, then they are rejected. If they are kept, then the software checks whether the length of the segments is compatible with the feasible alpha helices in terms of length: if the length of a given segment equal one of the 8 alpha helix lengths plus or minus 0.75 &Aring;, then the path is eventually saved.<br />
" Therefore here another function for calculating the distance from the protein to the connection is used, that is more time-consuming, but also more accurate, than the ones used normally. " ??<br />
"This is done with a higher accuracy, because none of these linkers should be lost by error". ??<br />
For generation of linkers taking into account the flexibility of the ends until now have been included two functions.<br />
The only variability we have in the customly tailored linker there is the length of the helix. Most likely here no suitable linker can be found, because if there is some obstacle between the ends, the linker can't bend around it. But even though if there is the possibility of such a linker, it should be found, because this will be one of the best linkers predicted. <br />
So for all accessible angles that were calculated before, we calculate all points from the N-terminus and from the C-terminus that lie in certain distances from the terminus. The points are spread over varying distances from ; to the .<br />
===Step 2===<br />
The next possibilty to design more complex rigid linkers while still taking flexible ends into account with a reasonable calculation time was to reduce amount of possible angles. As we originally thought that 90° angle would be practically feasible, the software was designed to generate linkers with flexible ends and one 90° angle. ###figure### This choice was notably made because of the simplicity to calculate lengths of right triangle edges. We already saw in Step 1 that the length of an edge can only take 8 different values. As the linkers have to start from the extremities of the protein, and as we impose a right angle, the number of possible paths is therefore low, making them easy to compute. Practically, the extremities of the proteins are positioned in a flexible way as in Step 1. From each of the positions allowed by this flexibility, the software searches for all the allowed right triangles. <br />
some words on degree of freedom...<br />
###newly written###<br />
Again here we only have discrete possibilities to build right triangles by use of our helical patterns. Therefore at first all combinations of two helical patterns are searched, that could build a right triangle, that's hypotenuse has the length of the distance between the termini.<br />
Now we shift back to 3D and apply Thales's theorem, that says, when A, B and C lie on a circle, the line between A and C is a diameter of the circle, then the angle at B is a right angle. ### fig, thales### Thus in 3d we can discretisize the possibilities, where the angle point (B) can lie in reference to the starting point (A). This amount is counterchecked with the amount of possible right triangles from before, so that we only keep paths, that can be built with our patterns.<br />
Like in Step 1 the software creates spheres of points around the start (first points) and around the end (last points). Then for each point from the first points<br />
===Step 3===<br />
Finally the software also provides the possibility to find paths with up to 4 edges, meaning 4 alpha helices and 3 angles. Thanks to the modularity of the possible linkers, such paths can offer the possibility to circularize theoretically any kind of protein. ###figure of torus###<br />
To keep the calculation feasible in a reasonable time, we design the searching strategy so that the flexible part at the extremity are oriented in the same direction as the consecutive alpha helix. This is obviously restricting the search but as these orientations are allowed for the flexible part, this approach remains fully correct. <br />
First, potential ending points of the first alpha helical rod are calculated from the N-terminal point of the protein. The orientation is chosen in a discrete manner, with an incrementation of 5 degrees for the two angles of the spherical coordinates. The distance from the origin corresponds to the 8 possible lengths allowed by the alpha helices, as already seen in Step 2, plus a length of 0, which mimics a linker with 3 instead of 4 edges. The exact same procedure is repeated to define all the potential ending points of the second alpha helical rod starting from all the possible ending points of the first alpha helical rod. Thanks to the possibility of a length of 0 for the first and the second rods, the software also calculate paths with 2 edges. Then, the same is done only once from the C-terminal point of the protein, defining 1 edge. The final step consists in checking if the points originating from the N- and C-terminal points can be linked by an potential alpha helix, i.e. if they are separated by the appropriate distance. If any of the potential alpha helix length lies within the distance between two points plus or minus 0.75 &Aring;, then the path is eventually saved. In the same way, if two points are directly closer than 0.75 &Aring;, then the path is also saved.<br />
one starts from omega, 2 from alpha<br />
Thus even one of the worst shapes for circularization, a torus with the two termini in the pits, could be circularized with our linker system, without needing infinitely long straights. On the other hand this were the maximum of possibilities which was still feasible to calculate.<br />
Here now the flexible parts of the protein are estimated to point into the same direction as the following helix. By this mean the amount of possibilities is kept fixed but of course this is quite some rough estimation.<br />
Let's check the next paragraph together in detail<br />
At first points from each end in the right distances are generated. ###figure needed, explains point names### Now for each point of the first points to all directions in 5 degrees angledifference next points are generated. Of these all points that don't fit are immediately sorted out. Now all connections from second points to the last points are checked whether they lie in the right distance. After this the normal sorting steps are made for these new connections.<br />
==Sorting out of paths==<br />
The previous part described the generation of paths that can connect the two extremeties of the protein irrespective of the position of these paths relative to the protein. While this allows a fast computing of the geometrical paths, this also implies that the paths that are not practically feasible need to be sorted out. This is the most time consuming part of the computing as about 1 billion paths are generated. Three criteria are considered for the sorting. The first one is the feasibility of the linker: can the software find angle patterns that correspond to the one defined by the geometrical path? This question was part of the motivation for a large modeling effort (link) to determine the possible angles between consecutive angles. This was achieved by analyzing the distribution of angles between alpha helices found in the ArchDB database (link). As nearly any angle could be found between 20 and 170 degrees, only few paths were actually rejected at that step. The next criteria was the position of the angle point: if they appear inside the protein, then the path is rejected. Finally, the software checks if any of the atoms of the protein is less than 5 &Aring away from any of the alpha helices, then the path is also rejected.<br />
==Shifting paths to the patterns==<br />
The strategy described in step 3 gives a certain freedom for the rod that connect the last two angle points that were generated from the N- and C-terminal points. As this freedom is actually not permitted by the alpha helix and the angle pattern, but is permitted by the flexible part for example at the C-terminal end, the software slighty refine the path by rotating the segment that originates from the C-terminal point.<br />
In particular because of the discritization, it can always happen that the generated paths don't fit perfectly to the helical and to the angle patterns. Therefore before translating them to the aminoacid sequence, the paths need to be refined.<br />
Every path, that has survived is analyzed step by step, starting from the starting point and always advancing one angle point. For each step from point to point, the length of the step is calculated and compared to the lengthes we can build with the helix patterns. If the length is too long, the next point is shifted in direction of the previous one, until it fits. If it is too short, it is shifted away. These shifts do not exceed a certain length, so that the paths don't shift too much and suddenly would pass throgh restricted areas.<br />
This is also done before the weighting, so that the paths don't change after weighting anymore.<br />
==Weighting of paths==<br />
Before translating the paths into sequences and thus into linkers that can be expressed, each path needs to be evaluated for its potential capacity to enhance heat stability. For this we have identified different contributions that should be combined into one value that defines how good the linker may be and that consequently defines its ranking among all possible linkers. The smaller this value, the more we expect that the linker will enhance thermostability. An important step for all these contributions is the normalization, as explained in the next paragraphs.<br />
The first contribution we considered was the linker length. We assumed that a short linker is better to constrain the protein extremities, and that a long helix might give more flexibility. Because we wanted this value to be independent of the size of the protein, the length of the linker is normalized to the distance between the two termini.<br />
The second contribution relates to the angles used in the linker. We learned from the [[modeling ###link###]] that angles formed by a certain angle pattern follow a certain distribution. First, we assumed that the narrower the distribution, the more likely the alpha helices would actually produce this angle. Second, the angles found by the software should be as close as possible to those well-defined angles. In this case, the weight value from this contribution should be low.<br />
Then the distance of the linker to the protein is taken into account. Because the linkers should not disturb the protein in its normal environment, linkers that pass close to the protein surface are considered better linkers. The distance was defined as the minimal distance between the linker and all the atoms of the protein. As already mentioned for the sorting of the paths, a linker cannot come closer than 5 &Aring; and this distance was used for normalization of calculated distances.<br />
As linkers should also not be too close to the protein surface, this value is normalized with the minimal distance a linker should have from the surface. The distance is calculated as the minimal distance an the atoms of the protein have from the connection.<br />
After this, the places a linker should avoid are calculated. Each protein can interact with other molecules on some oarts of its surface. The user can specify where and how big those parts are. If a linker passes in front a potential molecule bindind domain, the value of the corresponding path goes to infinity, so that the linker is discarded. Conversely the farther a linker is from a potential ligand binding domain, the smaller its weighting value. The user can also specify the importance of certain regions. In the end the total weighting is normalized to the amount of binding domains.<br />
===Calibrating the weighting function===<br />
???keep?###<br />
Every contribution has it's own distribution. You can see an example in figure ### [[figure histogram_length_lys.png]], but all of them have different shapes. The aim is to find the paths that minimize all of these distributions. Therefore in the weighting function the four mentioned contributions are combined in a linear manner:<br />
\[ W(p) = \alpha L(p) + \beta A(p) + \gamma D(p) + u(p) \]<br />
where W is the final weighting, p the path, L the length contribution, A the angle contribution, D the distance contribution and u the contribution from the forbidden regions. $\alpha, \beta, \gamma, \delta$ are the weighting constants that needed to be found. The normalization performed for each of the contribution were made so that each of them is dimensionless and that all have reasonably similar values.<br />
The weighting constants were obtained from the [[linker-screening ###lin]] performed with lysozyme and the [[enzyme-modeling ###link]]<br />
Please see [[below ###link]] for the detailed explanation, how the values were obtained.<br />
==Translating paths to sequence==<br />
As already mentioned before the software is provided with two databases, one for the possible angle patterns and one for the helix patterns. The choice of the patterns was inspired by known crystal structures extracted from databases and described in different papers.<br />
A huge in silico screening for refining the preferences of the patterns was then set up using the [[iGEM@home|###Link]] system. For the complete description of search for suitable patterns, one can read the [[modeling|###Link to patternspart]] page. <br />
All the possible paths are now split up at the angles and compared with the possible patterns in the databases. ###Figure needed to explain, how the path is translated### The most suitable patterns are identified and added together to build the paths sequence. It is important to notice that this is only possible because of the modularity of our linker patterns used as building blocks: each block, being an alpha helix or an angle pattern, is not affected by the other. Thus for each possible path, one sequence is produced.<br />
==Clustering of paths==<br />
Many different paths are represented by the same sequence [[###Figure that shows, different paths have same properties, already before in the text ###]] and we therefore clustered such paths. The weigths for those clustered paths were then calculated by averaging the weights of the different paths that compose a cluster.<br />
=Results=<br />
==DNMT1==<br />
A major motivation of our effort to design rigid linkers with angles was the circularization of the DNA methyltranferase Dnmt1 (link). The truncation form used in our project is composed of 900 amino acids and the N- and C-terminal extremities are well separated. To circularize it, two linkers were designed: a flexible one made of glycine and serine, and a rigid one designed by the software. The rigif linkers for DNMT1 were obtained from an early state of the software. At that time the calculation took 11 days on a laptop computer with intel i5 processor and 8GB of RAM, which shows the importance of a distributed computing system for large proteins. But from that state on, the software has still imporved a lot, resulting in reduced calculation time to about 1 day for DNMT1.<br />
###DNMT1 could be made heatstable, still missing###<br />
==Feedback from wet lab==<br />
The results from the software for lysozyme were tested as described in the [[linker-screening]] part and evaluated as described in the [[enzyme modeling]] part. We have performed a large linker screening on 10 different lysozymes with different linkers. As the purpose of the lysozyme screen was the calibration of the software,tThose linkers were designed according to the four contributions previously mentioned. One of them was the shortest possible, one had the best possible angle, and so on.<br />
{| class="table table-hover"<br />
|+ '''table 1''': Linker and their amino acid sequence. Green: attachment sequences to prevent the flexible regions from being perturbed; Blue: angle; Purple: extein.<br />
! Linker<br />
! Amino acid sequence<br />
|- <br />
|<br />
|<br />
|-<br />
|'''Very good linkers'''<br />
|<br />
|-<br />
| sgt2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#00BFFF;">AAAHPEA</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| rigid<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKAA<span style="color:#3ADF00;">P</span><span style="color:#A901DB;">RGKCWE</span><br />
|-<br />
| <br />
| <br />
|-<br />
| '''Average linkers'''<br />
|<br />
|-<br />
| may1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKA<span style="color:#00BFFF;">AAAHPEA</span>AEAAAKEAAAKA<span style="color:#00BFFF;">KTA</span>AEAAAKEAAAKA<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| ord1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ATGDLA</span>AEAAAKAA<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| ord3<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ASLPAA</span>AEAAAKEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
|<br />
| <br />
|-<br />
| '''Short linkers'''<br />
| <br />
|-<br />
| sho1<br />
| <span style="color:#3ADF00;">GG</span><span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| sho2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
|}<br />
[[circ_lam_lys_nils.png]] comparison with [[nice_linker_lysozyme_flexible_ends.png]]<br />
###About predictions of software###<br />
In the end we obtained a ranking of the in vitro tested linkers from the [[linker-screening]] and chose the parameters $\alpha, \beta, \gamma, \delta$ of the weighting function in the way, that the ranking from the software represented the ranking from the assays. The final values were: $\alpha = 1.85 * 10 ^{-6}, \beta = 0.57 , \gamma = 50.8 * 10^6$.<br />
Therefore we set up a function, that was minimized, when the order of the weighting output was the same as the order from the wetlab.<br />
<br />
=Discussion=<br />
Will always be refined with more data from i@h...<br />
=References=<br />
[0] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).<br />
[1] Thornton, J.M. & Sibanda, B.L. Amino and carboxy-terminal regions in globular proteins. Journal of molecular biology 167, 443-460 (1983).<br />
<br />
=Discussion=<br />
Will always be refined with more data from i@h...<br />
=References=<br />
<br />
[0] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).<br />
<br />
[1] Thornton, J.M. & Sibanda, B.L. Amino and carboxy-terminal regions in globular proteins. Journal of molecular biology 167, 443-460 (1983).</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/pages/Linker_SoftwareTeam:Heidelberg/pages/Linker Software2014-10-17T22:37:58Z<p>Igemnils: /* Background */</p>
<hr />
<div>=Abstract=<br />
As already [https://2014.igem.org/Team:Heidelberg/Toolbox/Circularization introduced], artificially circularized proteins may gain some heat stability by restraining the C- and N-terminus from moving around freely. This circularization may be trivial when the protein termini are very close to each other, which seems to be reasonably common [[#References|[1]]]. However, if the ends are too far from each other, a long linker is needed to connect them. This linker should not change the natural conformation of the protein and should constrain the relative position of the ends to restrict the degrees of freedom and thus to stabilize the structure even when heated up. On top, these linkers should not affect any of the protein functions. Consequently it is important to prevent linkers from passing through the active site or from covering binding domains to other molecules for example. Therefore one needs to be able to define the shape of possible linkers. This section describes the software we developed to design such linkers. We would like to stress that this work has been made possible thanks to the feedback between computer modeling and experimental work: We could first design linkers in silico, test them experimentally and use the results to further calibrate the software. To our knowledge, this is the first time that such an approach is used to customly design rigid linkers with angles to connect protein extremities.<br />
<br />
=General procedure=<br />
In short, the software can provide a weighted list of linkers to circularize any protein of interest with a known structure. Those linkers are made of rigid alpha helices segments connected with defined angles. Contrary to flexible linkers, those rigid linkers were expected to constrain the protein extremities and to confer better heat stability. Such an idea was already developed [[#References|[2]]] but only with alpha helices defining simple rods, and without any possibility to introduce angles. To generate those linkers, we first defined the geometrical paths, with segments and angles, that they should follow. The geometrical paths that are biologically feasible are afterwards translated into amino acid sequences. Both the compatibility of paths with possible structures and the translation were made possible thanks to our [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling approaches]. The first approach consisted in performing a statistical analysis of more than 17000 known non-homologous structures containing alpha helices connected with angles. For the second approach, we modeled the conformation of linkers circularizing proteins of known structure and analyzed them for certain properties. This second approach was run for a large number of proteins thanks to our distributing computing system [https://2014.igem.org/Team:Heidelberg/Software/igemathome igemathome]. The software provides different possible linkers with weights that provide the ranking of the linkers depending on their capacity to maintain protein activity at higher temperatures. They were generated thanks to an extensive [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening linker screening] on the target protein lambda-lysozyme, using the first modeling approach.<br />
<br />
=Background=<br />
Classically, protein linkers were designed in three different manners. The easiest way is to define the length that a linker should cover and then simply use a flexible glycine-serine peptide with the right amount of amino acids to match this length. Glycine is used for flexibility, as it has no sidechain and does not produce any steric hindrance, while serine is used for solubility, as it has a small polar side chain. This solubility is important, as the linkers should not pass through the hydrophobic core of the protein, but should be dissolved in the surrounding medium. These flexible linkers were normally used for circularization but also for connecting different proteins, when the main goal is that the different parts are connected, but not how they are connected, or when the flexiblity of the linker was required for specific applications. <br />
<br />
A second strategy consists in using rigid helical linkers to keep proteins or protein domains at a certain distance from each other. This is especially important for signalling proteins and fluorescent proteins. One major property of alpha helices is that they always fold in a defined way with well defined angles and lengths. There are also many different helical patterns that differ in stability and solubility. Although they have been used to design cirularizing linkers [[#References|[2]]]. One big disadvantage of this strategy is that one can only build straight linkers with helices. So in the context of circularization, if an artificial line that would connect protein extremities is crossing the protein, this strategy is not an option.<br />
<br />
The third option, which served as a base to develop our approach and which came from discussions with the group of Rebecca Wade in Heidelberg, Germany, consists in designing customly tailored linkers for each specific application. These linkers can be obtained from protein structure prediction. At first one needs to define the path that the linker should take to connect the protein ends. Afterwards, one designs a possible linker sequence that might fit well. Next one makes a structure prediction of the linker attached to the proteins to validate the prediction. Several different linkers, with slight changes, can be compared. This is repeated several times until the linker effectively follows the expected path. This method requires a strong knowledge on protein folding and protein structure prediction and is computationaly intensive. On the other hand, the benefit can be important as the interaction of the linker with the protein surface can be taken into account and as one can accurately define the path taken by the linker to the resolution of protein structure.<br />
<br />
We have set up a completely new strategy to design rigid linkers. As further detailed in the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] part, it is possible to define the shape of a linker, by combining rigid alpha helical rods with well-defined angle patterns. Therefore, by defining, in a geometrical way, the possible paths of the circularizing linkers for a given protein, we can then propose potential linkers. This definition of the geometrical path can be very difficult, especially for large proteins with complex shapes. Moreover, this definition is further constained by the fact that linkers must avoid hiding active sites of the protein of interest. Finally the paths have rotational degrees of freedom at the extremities of the protein, and depending on their orientation, they may or may not match the geometry of the protein. The tool we present here covers the two steps: defining geometrical paths with some weights and translate them into feasible linkers, also with weights. This tool is universal as it has the capacity to design circularizing linkers for any protein with a known structure. Moreover it is modular as, thanks to our [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling approach], we have design linkers as exchangeable blocks of rods of different lengths and of angle patterns. The following sections detail the different steps followed by our software to design proper linkers.<br />
<br />
<br />
==PDB parsing==<br />
At first, the PDB file containing the structure of the target protein is parsed and the coordinates of the atoms are stored, in the metric unit.After this, some initial tests are made with the protein structure. First, we checked whether the C- and N-termini lie on the surface of the protein and are accessible to the solvent, which is crucial for circularization. We defined a line originating from an extremity of the protein with the two angles of the spherical coordinates around the z-axis. From that, we could determine the accessible angles by rejecting all the lines that are too close to the protein. As the future linker will be made of alpha helices and will therefore have a radius of 5 &Aring;, we used this length as the minimal allowed distance.<br />
Those allowed angles are stored for the coming linker generation. ###fig needed### <br />
<br />
==Generation of geometric paths==<br />
As our strategy consists in building linkers with helical rods and connecting angles, a path is completely defined by the coordinates of the angle points. Advancing one step from an existing point is always done by adding a displacement vector on this point. This vector is defined by the two spherical angles, chosen here in a discrete manner with an increment of 5 degrees, and by a length, also chosen in a discrete manner. This discrete length was used in two different contexts: it may correspond to the length of an alpha helix or to the length of the flexible part that appears at the extremity of the protein. The coordinates that are reached thanks to this vector defines the new coordinates of an angle point, no matter if the vector corresponds to an alpha helix or to a flexible part. ###figure, points + vectors###. As we screen for all possible angles in a discrete manner, those angle points coordinates are regularly distributed on a sphere. As further detailed in the next sections, those spheres are defined from both ends, either once or in several steps. Then the software checks for possible straight connections of given lengths for each pair of angle points originated from both extremities. ###figure needed, with all these possible steps, one sphere around each point and checking for the connections, 2D, different lengths in two different figures, ###<br />
The linkers are built in a modular way, with blocks of well-defined size. From the [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling] of potential linkers, we could derive 8 different alpha helical rods, all with different lengths. On top, the length of the two segments inside an angle block was always 8&Aring;, so exchanging angle blocks do not affect the length of the linker. This means that the distance between the angle points is well defined, an essential aspect of our strategy of linker design.<br />
The software proceeds in three steps. First, it checks for the possibility of direct single alpha helix linker. for this, it applies the procedure just mentioned with spheres of radius that reasonably corresponds to the length of the short parts at the extremity of the protein. Second, it tests if a linker containing two alpha helices connected with a right angle allows the circularization. Finally it searches the possible linkers with three angle points. The next parts will explain those three steps in detail.<br />
This method has been chosen, because it could be implemented easily and efficiently in our program. However, this strategy generated paths that crossed the protein. Therefore we put big efforts in the sorting out of the paths.<br />
<br />
===Step 1===<br />
As a simple rigid linker with no angle would be easier to design and likely more thermostable than the ones containing angles, the software first checks if this simple solution is possible ###figure needed###. For this, we took into account the fact that proteins have some flexible amino acids at their extremities. This flexible part may come from the protein itself, but also from the 2 glycines that are included at the N-terminal part and from the extein at the C-terminal part. Those two latter parts comes from our linkers. Those parts have no preferential angles and offers a large amount of possibilities to insert fitting linkers. But this flexibility is also a drawback as we have to include this large amount of possible angles and length to our path search.<br />
In this first step, the software explicitely takes these flexible parts into account to check for the possibility of straight linkers. As the angles and the length of the flexible parts are variable, the software position their extremity on a sphere centered on the last fixed position of the structure as explained above. The radius of this sphere is incremented in a discrete manner, in 4 steps, from 5.25 &Aring; to the maximum length of the flexible part. ###figure###<br />
Then all possible straight segments between the points and the lastpoints are tested. If they are closer than 5 &Aring to the protein, of if they cross it, then they are rejected. If they are kept, then the software checks whether the length of the segments is compatible with the feasible alpha helices in terms of length: if the length of a given segment equal one of the 8 alpha helix lengths plus or minus 0.75 &Aring;, then the path is eventually saved.<br />
===Step 2===<br />
The next possibilty to design more complex rigid linkers while still taking flexible ends into account with a reasonable calculation time was to reduce amount of possible angles. As we originally thought that 90° angle would be practically feasible, the software was designed to generate linkers with flexible ends and one 90° angle. ###figure, protein_one_90° angle### This choice was notably made because of the simplicity to calculate lengths of right triangle edges. We already saw in Step 1 that the length of an edge can only take 8 different values. As the linkers have to start from the extremities of the protein, and as we impose a right angle, the number of possible paths is therefore low, making them easy to compute. Practically, the extremities of the proteins are positioned in a flexible way as in Step 1. From each of the positions allowed by this flexibility, the software searches for all the allowed right triangles. This was mainly done, as the degrees of freedom needed to be restricted to keep calculations feasible. <br />
===Step 3===<br />
Finally the software also provides the possibility to find paths with up to 4 edges, meaning 4 alpha helices and 3 angles. Thanks to the modularity of the possible linkers, such paths can offer the possibility to circularize theoretically any kind of protein. ###figure of torus###<br />
To keep the calculation feasible in a reasonable time, we design the searching strategy so that the flexible part at the extremity are oriented in the same direction as the consecutive alpha helix. This is obviously restricting the search but as these orientations are allowed for the flexible part, this approach remains fully correct. <br />
First, potential ending points of the first alpha helical rod are calculated from the N-terminal point of the protein. The orientation is chosen in a discrete manner, with an incrementation of 5 degrees for the two angles of the spherical coordinates. The distance from the origin corresponds to the 8 possible lengths allowed by the alpha helices, as already seen in Step 2, plus a length of 0, which mimics a linker with 3 instead of 4 edges. The exact same procedure is repeated to define all the potential ending points of the second alpha helical rod starting from all the possible ending points of the first alpha helical rod. Thanks to the possibility of a length of 0 for the first and the second rods, the software also calculate paths with 2 edges. Then, the same is done only once from the C-terminal point of the protein, defining 1 edge. The final step consists in checking if the points originating from the N- and C-terminal points can be linked by an potential alpha helix, i.e. if they are separated by the appropriate distance. If any of the potential alpha helix length lies within the distance between two points plus or minus 0.75 &Aring;, then the path is eventually saved. In the same way, if two points are directly closer than 0.75 &Aring;, then the path is also saved.<br />
<br />
==Sorting out of paths==<br />
The most calcualtion time is consumed while sorting out misleading paths. This is due to the fact, that every path needs to be checked. Making the brute-force ansatz the amount of possible paths is about 10^9, so this step consumes most time.<br />
There are three main functions, that sort out paths that don't fit. The easiest just sorts paths out, that would require an angle, we can't produce with our angle patterns. But as we nearly can produce angles from ???20 -170??? degrees only very few paths are sorted out by this function.<br />
The next function just checks whether the endpoint is lying in the protein. If yes, the path is deleted. Otherwise the connection between the point coming from to this point is checked, whether it is passing too near at the protein.<br />
<br />
==Shifting paths to the patterns==<br />
<br />
Because of rounding errors and other inaccuracies, it can always happen that the generated paths don't fit perfectly to the helical- and to the angle patterns. Therefore before translating them to the aminoacid sequence, the paths need to be refined. This is also done before the weighting, so that the paths don't change after weighting anymore.<br />
Therefore the distances between the different points are calculated and then the points are shifted so far, that they fit into the patterns. The shifts never exceed a certain length so that no path then would pass through the protein after refinement, even though it didn't before.<br />
==Weighting of paths==<br />
Before translating the paths to sequences and thus to expressible linkers, at first the value of each path needs to be checked. Therefore we have identified four contributions, that determine how good a path will work to enhance heatstability of a target protein by circularization. In the end all these contributions are summarized to get one final value for the goodness of the connection. For each of the single contributions, always lower weighting values refer to a better path. Therefore all contributions somehow need to be normed by some property of the protein, so that the contributions of each function on it's own are independent of the target protein and generality is achieved.<br />
At first it is tested how long the linker is. We assumed that a shorter linker normally is better for restraining the ends, as a longer helix might give more flexibility to the ends. Because we want this value to be independent of the size of the protein, the length of the linker is normed by the distance of the two termini. It is clear that not only because the two termini are far from each other and clearly a longer linker is needed, this contribution exceeds all other contributions to the weighting function.<br />
The next value is a measure for the goodness of the angles used in the linker. Each of our potential angle patterns is somehow distributed with a standard deviation from the mean. It is clear, that the narrower this distribution is, the likelier it is, that the pattern will in the end produce the assumed angle between the embracing helices. Therefore path-angles that fit perfectly with the angle patterns provided get a lower weighting value.<br />
Then the distance from the protein is taken into account. Because the linkers should not disturb the protein in it's normal environment, linkers that pass near to the protein's surface are considered better linkers. Of course no linker is too near on the surface, so that it would interact with the protein too much. This value is normalized with the minimal distance a linker should have from the surface.<br />
After this the places a linker should ommit are calculated. Each protein binds some ligands or substrates. The user can specify, how big they are and where they attach. If a linker passes through a potential ligand, this value goes to infinity, so that a linker is discarded. Otherwise the farther a linker is from a potential ligand, the better this value gets. The user can also specify the importance of certain regions. In the end the total weighting is independent of the amount of binding sites.<br />
<br />
===Calibrating the weighting function===<br />
???keep?###<br />
Every contribution has it's own distribution. You can see an example in figure ### [[figure histogram_length_lys.png]], but all of them have different shapes. The aim is to find the paths that minimize all of these distributions. Therefore in the weighting function the four mentioned contributions are combined in a linear manner:<br />
\[ W(p) = \alpha L(p) + \beta A(p) + \gamma D(p) + \delta u(p) \]<br />
where W is the final weighting, p the path, L the length contribution, A the angle contribution, D the distance contribution and u the contribution from the forbidden regions. $\alpha, \beta, \gamma, \delta$ are the weighting constants that needed to be found. The normalization performed for each of the contribution were made so that each of them is dimensionless and that all have reasonably similar values.<br />
The weighting constants were obtained from the [[linker-screening ###lin]] performed with lysozyme and the [[enzyme-modeling ###link]]<br />
Please see [[below ###link]] for the detailed explanation, how the values were obtained.<br />
==Translating paths to sequence==<br />
As already mentioned before the software is provided with two databases, one for the possible angle patterns and one for the helix patterns. The choice of the patterns was inspired by known crystal structures extracted from databases and described in different papers.<br />
A huge in silico screening for refining the preferences of the patterns was then set up using the [[iGEM@home|###Link]] system. For the complete description of search for suitable patterns, one can read the [[modeling|###Link to patternspart]] page. <br />
All the possible paths are now split up at the angles and compared with the possible patterns in the databases. ###Figure needed to explain, how the path is translated### The most suitable patterns are identified and added together to build the paths sequence. It is important to notice that this is only possible because of the modularity of our linker patterns used as building blocks: each block, being an alpha helix or an angle pattern, is not affected by the other. Thus for each possible path, one sequence is produced.<br />
==Clustering of paths==<br />
Many different paths are represented by the same sequence [[###Figure that shows, different paths have same properties, already before in the text ###]] and we therefore clustered such paths. The weigths for those clustered paths were then calculated by averaging the weights of the different paths that compose a cluster.<br />
<br />
=Results=<br />
IMPORTANT!<br />
paths-> geometrical<br />
linkers-> helices...<br />
sequences-> seq of linkers<br />
insist on huge number of data produced<br />
We definitely need to cite [0]!!!<br />
=Abstract=<br />
=Overview=<br />
As already introduced [###link to circularization idea in the toolbox], artificially circularized proteins may gain some heat stability by restraining the C- and N-terminus from moving around freely. This circularization may be trivial when the protein termini are very close to each other, which seems to be reasonably common [[#References|[1]]]. However, if the ends are too far from each other, a long linker is needed to connect them. This linker should not change the natural conformation of the protein and should constrain the relative position of the ends to restrict the degrees of freedom and thus to stabilize the structure even when heated up. On top, these linkers should not affect any of the protein functions. Consequently it is important to prevent linkers from passing through the active site or from covering binding domains to other molecules for example. Therefore one needs to be able to define the shape of possible linkers. This section describes the software we developed to design such linkers. We would like to stress that this work has been made possible thanks to the feedback between computer modeling and experimental work: We could first design linkers in silico, test them experimentally and use the results to further calibrate the software. To our knowledge, this is the first time that such an approach is used to customly design linkers to connect protein extremities.<br />
In short, the software can provide a weighted list of linkers to circularize any protein of interest with a known structure. Those linkers are made of rigid alpha helices segments connected with defined angles. Contrary to flexible linkers, those rigid linkers were expected to constrain the protein extremities and to confer better heat stability. To generate those linkers, we first defined the geometrical paths, with segments and angles, that they should follow. The geometrical paths that are biologically feasible are afterwards translated into amino acid sequences. Both the compatibility of paths with possible structures and the translation were made possible thanks to our modeling approaches. The first approach consisted in performing a statistical analysis of more than 17000 known non-homologous structures containing alpha helices connected with angles. For the second approach, we modeled the conformation of linkers circularizing proteins of known structure and analyzed them for certain properties. This second approach was run for a large number of proteins thanks to our distributing computing system igemathome. The software provides different possible linkers with weights that provide the ranking of the linkers depending on their capacity to maintain protein activity at higher temperatures. They were generated in an extensive [[linker screening| ### ]] on the target protein lambda-lysozyme, using the first modeling approach.<br />
The software was checked for running stability in a huge test over the [[iGEM@home]]. <br />
Furthermore our software provided linkers for circularizing [[DNMT1| ###]], that could be made more heatstable due to circularization???.<br />
For detailed information on the implementation and the practical use of the software, please see the [[documentation software-docu]] page.<br />
==What does the software do==<br />
still missing [[figure 0, graph abstract]]<br />
=Background=<br />
What are linkers, why should ours work better?<br />
Linkers, References to the modeling, <br />
Classically, protein linkers were designed in three different manners. ###REFERENCE### The easiest way is to define the length that a linker should cover and then simply use a flexible glycine-serine peptide with the right amount of amino acids to match this length. Glycine is used for flexibility, as it has no sidechain and does not produce any steric hindrance, while serine is used for solubility, as it has a small polar side chain. This solubility is important, as the linkers should not pass through the hydrophobic core of the protein, but should be dissolved in the surrounding medium. These flexible linkers were normally used for circularization but also for connecting different proteins, when the main important aspect is that the different parts are connected, but not how they are connected, or when the flexiblity of the linker was required for specific applications.<br />
A second strategy consists in using rigid helical linkers to keep proteins or protein domains at a certain distance from each other. This is especially important for signalling proteins and fluorescent proteins . ###TODO: Reference### One major property of alpha helices is that they always fold in a defined way with well defined angles and lengths. There are also many different helical patterns that differ in stability and solubility. One big disadvantage of this strategy is that one can only build straight linkers with helices. So in the context of circularization, if an artificial line that would connect protein extremities is crossing the protein, this strategy is not an option.<br />
The third option, which served as a base to develop our approach, consists in designing customly tailored linkers for each specific application. These linkers can be obtained from protein structure prediction. At first one needs to define the path that the linker should take to connect two amino acids. Afterwards one designs a possible linker sequence that might fit well. Next one makes a structure prediction of the linker attached to the proteins to validate the prediction. Several different linkers, with slight changes, can be compared. This is repeated several times until the linker effectively follows the expected path. ###TODO: Reference, WADE paper### This method is time consuming as it is not only computation intensive, but also requires a strong knowledge on protein folding and protein structure prediction. On the other hand, the benefit can be important as the interaction of the linker with the proteins surface can be taken into account and as one can accurately define the path taken by the linker to the resolution of protein structure.<br />
We have set up a completely new strategy to design rigid linkers. As further detailed in the [[###link###|modelling]] part, it is possible to define the shape of a linker, by combining rigid alpha helical rods with well-defined angle patterns. Therefore, by defining, in a geometrical way, the possible paths of the circularizing linkers for a given protein, we can then propose potential linkers.This definition of the geometrical path can be very difficult, especially for large proteins with complex shapes. Moreover, this definition is further constained by the fact that linkers must avoid hiding active sites of the protein of interest. Finally the paths have rotational degrees of freedom at the extremities of the protein, and depending on their orientation, they may or may not match the geometry of the protein. The tool we present here covers the two steps: defining geometrical paths with some weights and translate them into feasible linkers, also with weights. This tool is universal as it has the capacity to design circularizing linkers for any protein with a known structure. Moreover it is modular as, thanks to our modeling approach [link] we have design linkers as exchangeable blocks of rods of different lengths and of angle patterns. The following sections detail the different steps followed by our software to design proper linkers.<br />
==PDB analysis==<br />
At first, the PDB file containing the structure of the target protein is parsed and the coordinates of the atoms are stored, in the metric unit.After this, some initial tests are made with the protein structure. First, we checked whether the C- and N-termini lie on the surface of the protein and are accessible to the solvent, which is crucial for circularization. We defined a line originating from an extremity of the protein with the two angles of the spherical coordinates around the z-axis. From that, we could determine the accessible angles by rejecting all the lines that are too close to the protein. As the future linker will be made of alpha helices and will therefore have a radius of 5 &Aring;, we used this length as the minimal allowed distance.<br />
Those allowed angles are stored for the coming linker generation. . ###fig needed### <br />
==Generation of geometric paths==<br />
As our strategy consists in building linkers with helical rods and connecting angles, a path is completely defined by the coordinates of the angle points. Advancing one step from an existing point is always done by adding a displacement vector on this point. This vector is defined by the two spherical angles, chosen here in a discrete manner with an increment of 5 degrees, and by a length, also chosen in a discrete manner. This discrete length was used in two different contexts: it may correspond to the length of an alpha helix or to the length of the flexible part that appears at the extremity of the protein. The coordinates that are reached thanks to this vector defines the new coordinates of an angle point, no matter if the vector corresponds to an alpha helix or to a flexible part. ###figure, points + vectors###. As we screen for all possible angles in a discrete manner, those angle points coordinates are regularly distributed on a sphere. As further detailed in the next sections, those spheres are defined from both ends, either once or in several steps. Then the software checks for possible straight connections of given lengths for each pair of angle points originated from both extremities. ###figure needed, with all these possible steps, one sphere around each point and checking for the connections, 2D, different lengths in two different figures, ###<br />
The linkers are built in a modular way, with blocks of well-defined size. From the modeling of potential linkers [link], we could derive 8 different alpha helical rods, all with different lengths. On top, the length of the two segments inside an angle block was always 8 &Aring;, so exchanging angle blocks do not affect the length of the linker. This means that the distance between the angle points is well defined, an essential aspect of our strategy of linker design.<br />
The software proceeds in three steps. First, it checks for the possibility of direct single alpha helix linker. for this, it applies the procedure just mentioned with spheres of radius that reasonably corresponds to the length of the short parts at the extremity of the protein. Second, it tests if a linker containing two alpha helices connected with a right angle allows the circularization. Finally it searches the possible linkers with three angle points. The next parts will explain those three steps in detail.<br />
This method has been chosen, because it could be implemented easily and efficiently in our program. However this strategy generated paths that crossed the protein. Therefore we put big efforts in the sorting out of the paths.<br />
===Step 1===<br />
As a simple rigid linker with no angle would be easier to design and likely more thermostable than the ones containing angles, the software first checks if this simple solution is possible ###figure needed###. For this, we took into account the fact that proteins have some flexible amino acids at their extremities. This flexible part may come from the protein itself, but also from the 2 glycines that are included at the N-terminal part and from the extein at the C-terminal part. Those two latter parts comes from our linkers. Those parts have no preferential angles and offers a large amount of possibilities to insert fitting linkers. But this flexibility is also a drawback as we have to include this large amount of possible angles and length to our path search.<br />
In this first step, the software explicitely takes these flexible parts into account to check for the possibility of straight linkers. As the angles and the length of the flexible parts are variable, the software position their extremity on a sphere centered on the last fixed position of the structure as explained above. The radius of this sphere is incremented in a discrete manner, in 4 steps, from 5.25 &Aring; to the maximum length of the flexible part. ###figure###<br />
Then all possible straight segments between the points and the lastpoints are tested. If they are closer than 5 &Aring to the protein, of if they cross it, then they are rejected. If they are kept, then the software checks whether the length of the segments is compatible with the feasible alpha helices in terms of length: if the length of a given segment equal one of the 8 alpha helix lengths plus or minus 0.75 &Aring;, then the path is eventually saved.<br />
===Step 2===<br />
The next possibilty to design more complex rigid linkers while still taking flexible ends into account with a reasonable calculation time was to reduce amount of possible angles. As we originally thought that 90° angle would be practically feasible, the software was designed to generate linkers with flexible ends and one 90° angle. ###figure, protein_one_90° angle### This choice was notably made because of the simplicity to calculate lengths of right triangle edges. We already saw in Step 1 that the length of an edge can only take 8 different values. As the linkers have to start from the extremities of the protein, and as we impose a right angle, the number of possible paths is therefore low, making them easy to compute. Practically, the extremities of the proteins are positioned in a flexible way as in Step 1. From each of the positions allowed by this flexibility, the software searches for all the allowed right triangles. This was mainly done, as the degrees of freedom needed to be restricted to keep calculations feasible. <br />
===Step 3===<br />
Finally the software also provides the possibility to find paths with up to 4 edges, meaning 4 alpha helices and 3 angles. Thanks to the modularity of the possible linkers, such paths can offer the possibility to circularize theoretically any kind of protein. ###figure of torus###<br />
To keep the calculation feasible in a reasonable time, we design the searching strategy so that the flexible part at the extremity are oriented in the same direction as the consecutive alpha helix. This is obviously restricting the search but as these orientations are allowed for the flexible part, this approach remains fully correct. <br />
First, potential ending points of the first alpha helical rod are calculated from the N-terminal point of the protein. The orientation is chosen in a discrete manner, with an incrementation of 5 degrees for the two angles of the spherical coordinates. The distance from the origin corresponds to the 8 possible lengths allowed by the alpha helices, as already seen in Step 2, plus a length of 0, which mimics a linker with 3 instead of 4 edges. The exact same procedure is repeated to define all the potential ending points of the second alpha helical rod starting from all the possible ending points of the first alpha helical rod. Thanks to the possibility of a length of 0 for the first and the second rods, the software also calculate paths with 2 edges. Then, the same is done only once from the C-terminal point of the protein, defining 1 edge. The final step consists in checking if the points originating from the N- and C-terminal points can be linked by an potential alpha helix, i.e. if they are separated by the appropriate distance. If any of the potential alpha helix length lies within the distance between two points plus or minus 0.75 &Aring;, then the path is eventually saved. In the same way, if two points are directly closer than 0.75 &Aring;, then the path is also saved.<br />
==Sorting out of paths==<br />
The previous part described the generation of paths that can connect the two extremeties of the protein irrespective of the position of these paths relative to the protein. While this allows a fast computing of the geometrical paths, this also implies that the paths that are not practically feasible need to be sorted out. This is the most time consuming part of the computing as about 1 billion paths are generated. Three criteria are considered for the sorting. The first one is the feasibility of the linker: can the software find angle patterns that correspond to the one defined by the geometrical path? This question was part of the motivation for a large modeling effort (link) to determine the possible angles between consecutive angles. This was achieved by analyzing the distribution of angles between alpha helices found in the ArchDB database (link). As nearly any angle could be found between 20 and 170 degrees, only few paths were actually rejected at that step. The next criteria was the position of the angle point: if they appear inside the protein, then the path is rejected. Finally, the software checks if any of the atoms of the protein is less than 5 &Aring away from any of the alpha helices, then the path is also rejected.<br />
==Shifting paths to the patterns==<br />
The strategy described in step 3 gives a certain freedom for the rod that connect the last two angle points that were generated from the N- and C-terminal points. As this freedom is actually not permitted by the alpha helix and the angle pattern, but is permitted by the flexible part for example at the C-terminal end, the software slighty refine the path by rotating the segment that originates from the C-terminal point.<br />
==Weighting of paths==<br />
Before translating the paths into sequences and thus into linkers that can be expressed, each path needs to be evaluated for its potential capacity to enhance heat stability. For this we have identified different contributions that should be combined into one value that defines how good the linker may be and that consequently defines its ranking among all possible linkers. The smaller this value, the more we expect that the linker will enhance thermostability. An important step for all these contributions is the normalization, as explained in the next paragraphs.<br />
The first contribution we considered was the linker length. We assumed that a short linker is better to constrain the protein extremities, and that a long helix might give more flexibility. Because we wanted this value to be independent of the size of the protein, the length of the linker is normalized to the distance between the two termini.<br />
The second contribution relates to the angles used in the linker. We learned from the [[modeling ###link###]] that angles formed by a certain angle pattern follow a certain distribution. First, we assumed that the narrower the distribution, the more likely the alpha helices would actually produce this angle. Second, the angles found by the software should be as close as possible to those well-defined angles. In this case, the weight value from this contribution should be low.<br />
Then the distance of the linker to the protein is taken into account. Because the linkers should not disturb the protein in its normal environment, linkers that pass close to the protein surface are considered better linkers. The distance was defined as the minimal distance between the linker and all the atoms of the protein. As already mentioned for the sorting of the paths, a linker cannot come closer than 5 &Aring; and this distance was used for normalization of calculated distances.<br />
After this, the places a linker should avoid are calculated. Each protein can interact with other molecules on some oarts of its surface. The user can specify where and how big those parts are. If a linker passes in front a potential molecule bindind domain, the value of the corresponding path goes to infinity, so that the linker is discarded. Conversely the farther a linker is from a potential ligand binding domain, the smaller its weighting value. The user can also specify the importance of certain regions. In the end the total weighting is normalized to the amount of binding domains.<br />
===Calibrating the weighting function===<br />
Every contribution has it's own distribution. You can see an example in figure ### [[figure histogram_length_lys.png]], but all of them have different shapes. The aim is to find the paths that minimize all of these distributions. Therefore in the weighting function the four mentioned contributions are combined in a linear manner:<br />
\[ W(p) = \alpha L(p) + \beta A(p) + \gamma D(p) + \delta u(p) \]<br />
where W is the final weighting, p the path, L the length contribution, A the angle contribution, D the distance contribution and u the contribution from the forbidden regions. $\alpha, \beta, \gamma, \delta$ are the weighting constants that needed to be found. The normalization performed for each of the contribution were made so that each of them is dimensionless and that all have reasonably similar values.<br />
The weighting constants were obtained from the [[linker-screening ###lin]] performed with lysozyme and the [[enzyme-modeling ###link]]<br />
Please see [[below ###link]] for the detailed explanation, how the values were obtained.<br />
==Translating paths to sequence==<br />
As already mentioned before the software is provided with two databases, one for the possible angle patterns and one for the helix patterns. The choice of the patterns was inspired by known crystal structures extracted from databases and described in different papers.<br />
A huge in silico screening for refining the preferences of the patterns was then set up using the [[iGEM@home|###Link]] system. For the complete description of search for suitable patterns, one can read the [[modeling|###Link to patternspart]] page. <br />
All the possible paths are now split up at the angles and compared with the possible patterns in the databases. ###Figure needed to explain, how the path is translated### The most suitable patterns are identified and added together to build the paths sequence. It is important to notice that this is only possible because of the modularity of our linker patterns used as building blocks: each block, being an alpha helix or an angle pattern, is not affected by the other. Thus for each possible path, one sequence is produced.<br />
==Clustering of paths==<br />
Many different paths are represented by the same sequence [[###Figure that shows, different paths have same properties, already before in the text ###]] and we therefore clustered such paths. The weigths for those clustered paths were then calculated by averaging the weights of the different paths that compose a cluster.<br />
=Results=<br />
==DNMT1==<br />
A major motivation of our effort to design rigid linkers with angles was the circularization of the DNA methyltranferase Dnmt1 (link). The truncation form used in our project is composed of 900 amino acids and the N- and C-terminal extremities are well separated. To circularize it, two linkers were designed: a flexible one made of glycine and serine, and a rigid one designed by the software. The rigif linkers for DNMT1 were obtained from an early state of the software. At that time the calculation took 11 days on a laptop computer with intel i5 processor and 8GB of RAM, which shows the importance of a distributed computing system for large proteins. But from that state on, the software has still imporved a lot, resulting in reduced calculation time to about 1 day for DNMT1.<br />
IMPORTANT!<br />
paths-> geometrical<br />
linkers-> helices...<br />
sequences-> seq of linkers<br />
insist on huge number of data produced<br />
We definitely need to cite [0]!!!<br />
=Abstract=<br />
=Overview=<br />
As already introduced [###link to circularization idea in the toolbox], artificially circularized proteins may gain some heat stability by restraining the C- and N-terminus from moving around freely. This circularization may be trivial when the protein termini are very close to each other, which seems to be reasonably common [[#References|[1]]]. However, if the ends are too far from each other, a long linker is needed to connect them. This linker should not change the natural conformation of the protein and should constrain the relative position of the ends to restrict the degrees of freedom and thus to stabilize the structure even when heated up. On top, these linkers should not affect any of the protein functions. Consequently it is important to prevent linkers from passing through the active site or from covering binding domains to other molecules for example. Therefore one needs to be able to define the shape of possible linkers. This section describes the software we developed to design such linkers. We would like to stress that this work has been made possible thanks to the feedback between computer modeling and experimental work: We could first design linkers in silico, test them experimentally and use the results to further calibrate the software. To our knowledge, this is the first time that such an approach is used to customly design linkers to connect protein extremities.<br />
In short, the software can provide a weighted list of linkers to circularize any protein of interest with a known structure. Those linkers are made of rigid alpha helices segments connected with defined angles. Contrary to flexible linkers, those rigid linkers were expected to constrain the protein extremities and to confer better heat stability. To generate those linkers, we first defined the geometrical paths, with segments and angles, that they should follow. The geometrical paths that are biologically feasible are afterwards translated into amino acid sequences. Both the compatibility of paths with possible structures and the translation were made possible thanks to our modeling approaches. The first approach consisted in performing a statistical analysis of more than 17000 known non-homologous structures containing alpha helices connected with angles. For the second approach, we modeled the conformation of linkers circularizing proteins of known structure and analyzed them for certain properties. This second approach was run for a large number of proteins thanks to our distributing computing system igemathome. The software provides different possible linkers with weights that provide the ranking of the linkers depending on their capacity to maintain protein activity at higher temperatures. They were generated in an extensive [[linker screening| ### ]] on the target protein lambda-lysozyme, using the first modeling approach.<br />
The software was checked for running stability in a huge test over the [[iGEM@home]]. <br />
Furthermore our software provided linkers for circularizing [[DNMT1| ###]], that could be made more heatstable due to circularization???.<br />
For detailed information on the implementation and the practical use of the software, please see the [[documentation software-docu]] page.<br />
==What does the software do==<br />
still missing [[figure 0, graph abstract]]<br />
=Background=<br />
What are linkers, why should ours work better?<br />
Linkers, References to the modeling, <br />
Classically, protein linkers were designed in three different manners. ###REFERENCE### The easiest way is to define the length that a linker should cover and then simply use a flexible glycine-serine peptide with the right amount of amino acids to match this length. Glycine is used for flexibility, as it has no sidechain and does not produce any steric hindrance, while serine is used for solubility, as it has a small polar side chain. This solubility is important, as the linkers should not pass through the hydrophobic core of the protein, but should be dissolved in the surrounding medium. These flexible linkers were normally used for circularization but also for connecting different proteins, when the main important aspect is that the different parts are connected, but not how they are connected, or when the flexiblity of the linker was required for specific applications.<br />
A second strategy consists in using rigid helical linkers to keep proteins or protein domains at a certain distance from each other. This is especially important for signalling proteins and fluorescent proteins . ###TODO: Reference### One major property of alpha helices is that they always fold in a defined way with well defined angles and lengths. There are also many different helical patterns that differ in stability and solubility. One big disadvantage of this strategy is that one can only build straight linkers with helices. So in the context of circularization, if an artificial line that would connect protein extremities is crossing the protein, this strategy is not an option.<br />
The third option, which served as a base to develop our approach, consists in designing customly tailored linkers for each specific application. These linkers can be obtained from protein structure prediction. At first one needs to define the path that the linker should take to connect two amino acids. Afterwards one designs a possible linker sequence that might fit well. Next one makes a structure prediction of the linker attached to the proteins to validate the prediction. Several different linkers, with slight changes, can be compared. This is repeated several times until the linker effectively follows the expected path. ###TODO: Reference, WADE paper### This method is time consuming as it is not only computation intensive, but also requires a strong knowledge on protein folding and protein structure prediction. On the other hand, the benefit can be important as the interaction of the linker with the proteins surface can be taken into account and as one can accurately define the path taken by the linker to the resolution of protein structure.<br />
We have set up a completely new strategy to design rigid linkers. As further detailed in the [[###link###|modelling]] part, it is possible to define the shape of a linker, by combining rigid alpha helical rods with well-defined angle patterns. Therefore, by defining, in a geometrical way, the possible paths of the circularizing linkers for a given protein, we can then propose potential linkers.This definition of the geometrical path can be very difficult, especially for large proteins with complex shapes. Moreover, this definition is further constained by the fact that linkers must avoid hiding active sites of the protein of interest. Finally the paths have rotational degrees of freedom at the extremities of the protein, and depending on their orientation, they may or may not match the geometry of the protein. The tool we present here covers the two steps: defining geometrical paths with some weights and translate them into feasible linkers, also with weights. This tool is universal as it has the capacity to design circularizing linkers for any protein with a known structure. Moreover it is modular as, thanks to our modeling approach [link] we have design linkers as exchangeable blocks of rods of different lengths and of angle patterns. The following sections detail the different steps followed by our software to design proper linkers.<br />
But even taking all this into account, one could never also take the paths into account, that the same sort of linker is also able to take, because of rotational degrees of freedom inherent in the linkermodel ###figure needed###<br />
Thus we decided to provide the science community with a powerfull open-source software, that for every protein with a given structure can calculate the sequence of the linker needed to circularize the protein with a rigid linker and with minimal inhibition of the protein's function. <br />
Until now each scientist had to estimate the length of the linker himself,###check for flexlinker### so our software is a completely novel approach to circularization.<br />
=General procedure=<br />
Our general approach to findin,rods and angles in discrete manner, only finite possibilities to connect the ends, these are cheked all. <br />
I think this paragraph is just too much, as anyway there was already an overview before.<br />
At first the protein structure is analysed in a geometrical way: paths only composed of no more than four straight segments connected by angles are computanionally generated. A path is always represented by straight lines and connecting angles between them. ###In the end all these paths should be sorted by how well it would be, if we circularize the protein using this path.### But as the final weighing is quite computation intensive, at first the paths need to be sorted out. A path is only sorted out, if it is breaching any rule for linkers. For example paths should never pass through the protein. After all the paths have been generated, the paths are improved, by shifting the angularpoints according to the underlying linkermodel, so that no paths are taken into account, that could not be built with our building blocks.<br />
After all paths being correctly generated, the paths are weighted by several factors. Afterwards one weighting for a path is calculated, that corresponds to the goodness of this certain path. Then the path is retranslated by usage of our amino acid patterns to produce an amino acid sequence. But as one sequence can follow more than one path, all the paths built up by this sequence are clustered. In the end the average of the weighting of the pathclusters is calculated and thus for every linker only one weightingvalue is produced with the contributions of all paths possibly taken by this linker.<br />
==PDB analysis==<br />
At first, the PDB file containing the structure of the target protein is parsed and the coordinates of the atoms are stored, in the metric unit.After this, some initial tests are made with the protein structure. First, we checked whether the C- and N-termini lie on the surface of the protein and are accessible to the solvent, which is crucial for circularization. We defined a line originating from an extremity of the protein with the two angles of the spherical coordinates around the z-axis. From that, we could determine the accessible angles by rejecting all the lines that are too close to the protein. As the future linker will be made of alpha helices and will therefore have a radius of 5 &Aring;, we used this length as the minimal allowed distance.<br />
Those allowed angles are stored for the coming linker generation. . ###fig needed### <br />
==Generation of geometric paths==<br />
As our strategy consists in building linkers with helical rods and connecting angles, a path is completely defined by the coordinates of the angle points. Advancing one step from an existing point is always done by adding a displacement vector on this point. This vector is defined by the two spherical angles, chosen here in a discrete manner with an increment of 5 degrees, and by a length, also chosen in a discrete manner. This discrete length was used in two different contexts: it may correspond to the length of an alpha helix or to the length of the flexible part that appears at the extremity of the protein. The coordinates that are reached thanks to this vector defines the new coordinates of an angle point, no matter if the vector corresponds to an alpha helix or to a flexible part. ###figure, points + vectors###. As we screen for all possible angles in a discrete manner, those angle points coordinates are regularly distributed on a sphere. As further detailed in the next sections, those spheres are defined from both ends, either once or in several steps. Then the software checks for possible straight connections of given lengths for each pair of angle points originated from both extremities. ###figure needed, with all these possible steps, one sphere around each point and checking for the connections, 2D, different lengths in two different figures, ###<br />
The linkers are built in a modular way, with blocks of well-defined size. From the modeling of potential linkers [link], we could derive 8 different alpha helical rods, all with different lengths. On top, the length of the two segments inside an angle block was always 8 &Aring;, so exchanging angle blocks do not affect the length of the linker. This means that the distance between the angle points is well defined, an essential aspect of our strategy of linker design.<br />
The software proceeds in three steps. First, it checks for the possibility of direct single alpha helix linker. for this, it applies the procedure just mentioned with spheres of radius that reasonably corresponds to the length of the short parts at the extremity of the protein. Second, it tests if a linker containing two alpha helices connected with a right angle allows the circularization. Finally it searches the possible linkers with three angle points. The next parts will explain those three steps in detail.<br />
This method has been chosen, because it could be implemented easily and efficiently in our program. However this strategy generated paths that crossed the protein. Therefore we put big efforts in the sorting out of the paths.<br />
In the [[###link###|modelling]] part we have already described the patterns we are using for building up our linkers. The modularity was crucial for the success of the software. Therefore the angle patterns were always chosen, so that the end of the anglepattern was 8 &Aring; away from the turning point ###figure needed###. Thus only displacements had to be made with a certain length, not depending on the direction in which it is going and not depending on the direction in that the linker will continue. This modularity makes the calculations more efficient than it would be, with just generating points randomly.<br />
We still have to rephrase that.<br />
===Step 1===<br />
As a simple rigid linker with no angle would be easier to design and likely more thermostable than the ones containing angles, the software first checks if this simple solution is possible ###figure needed###. For this, we took into account the fact that proteins have some flexible amino acids at their extremities. This flexible part may come from the protein itself, but also from the 2 glycines that are included at the N-terminal part and from the extein at the C-terminal part. Those two latter parts comes from our linkers. Those parts have no preferential angles and offers a large amount of possibilities to insert fitting linkers. But this flexibility is also a drawback as we have to include this large amount of possible angles and length to our path search.<br />
In this first step, the software explicitely takes these flexible parts into account to check for the possibility of straight linkers. As the angles and the length of the flexible parts are variable, the software position their extremity on a sphere centered on the last fixed position of the structure as explained above. The radius of this sphere is incremented in a discrete manner, in 4 steps, from 5.25 &Aring; to the maximum length of the flexible part. ###figure###<br />
Then all possible straight segments between the points and the lastpoints are tested. If they are closer than 5 &Aring to the protein, of if they cross it, then they are rejected. If they are kept, then the software checks whether the length of the segments is compatible with the feasible alpha helices in terms of length: if the length of a given segment equal one of the 8 alpha helix lengths plus or minus 0.75 &Aring;, then the path is eventually saved.<br />
" Therefore here another function for calculating the distance from the protein to the connection is used, that is more time-consuming, but also more accurate, than the ones used normally. " ??<br />
"This is done with a higher accuracy, because none of these linkers should be lost by error". ??<br />
For generation of linkers taking into account the flexibility of the ends until now have been included two functions.<br />
The only variability we have in the customly tailored linker there is the length of the helix. Most likely here no suitable linker can be found, because if there is some obstacle between the ends, the linker can't bend around it. But even though if there is the possibility of such a linker, it should be found, because this will be one of the best linkers predicted. <br />
So for all accessible angles that were calculated before, we calculate all points from the N-terminus and from the C-terminus that lie in certain distances from the terminus. The points are spread over varying distances from ; to the .<br />
===Step 2===<br />
The next possibilty to design more complex rigid linkers while still taking flexible ends into account with a reasonable calculation time was to reduce amount of possible angles. As we originally thought that 90° angle would be practically feasible, the software was designed to generate linkers with flexible ends and one 90° angle. ###figure### This choice was notably made because of the simplicity to calculate lengths of right triangle edges. We already saw in Step 1 that the length of an edge can only take 8 different values. As the linkers have to start from the extremities of the protein, and as we impose a right angle, the number of possible paths is therefore low, making them easy to compute. Practically, the extremities of the proteins are positioned in a flexible way as in Step 1. From each of the positions allowed by this flexibility, the software searches for all the allowed right triangles. <br />
some words on degree of freedom...<br />
###newly written###<br />
Again here we only have discrete possibilities to build right triangles by use of our helical patterns. Therefore at first all combinations of two helical patterns are searched, that could build a right triangle, that's hypotenuse has the length of the distance between the termini.<br />
Now we shift back to 3D and apply Thales's theorem, that says, when A, B and C lie on a circle, the line between A and C is a diameter of the circle, then the angle at B is a right angle. ### fig, thales### Thus in 3d we can discretisize the possibilities, where the angle point (B) can lie in reference to the starting point (A). This amount is counterchecked with the amount of possible right triangles from before, so that we only keep paths, that can be built with our patterns.<br />
Like in Step 1 the software creates spheres of points around the start (first points) and around the end (last points). Then for each point from the first points<br />
===Step 3===<br />
Finally the software also provides the possibility to find paths with up to 4 edges, meaning 4 alpha helices and 3 angles. Thanks to the modularity of the possible linkers, such paths can offer the possibility to circularize theoretically any kind of protein. ###figure of torus###<br />
To keep the calculation feasible in a reasonable time, we design the searching strategy so that the flexible part at the extremity are oriented in the same direction as the consecutive alpha helix. This is obviously restricting the search but as these orientations are allowed for the flexible part, this approach remains fully correct. <br />
First, potential ending points of the first alpha helical rod are calculated from the N-terminal point of the protein. The orientation is chosen in a discrete manner, with an incrementation of 5 degrees for the two angles of the spherical coordinates. The distance from the origin corresponds to the 8 possible lengths allowed by the alpha helices, as already seen in Step 2, plus a length of 0, which mimics a linker with 3 instead of 4 edges. The exact same procedure is repeated to define all the potential ending points of the second alpha helical rod starting from all the possible ending points of the first alpha helical rod. Thanks to the possibility of a length of 0 for the first and the second rods, the software also calculate paths with 2 edges. Then, the same is done only once from the C-terminal point of the protein, defining 1 edge. The final step consists in checking if the points originating from the N- and C-terminal points can be linked by an potential alpha helix, i.e. if they are separated by the appropriate distance. If any of the potential alpha helix length lies within the distance between two points plus or minus 0.75 &Aring;, then the path is eventually saved. In the same way, if two points are directly closer than 0.75 &Aring;, then the path is also saved.<br />
one starts from omega, 2 from alpha<br />
Thus even one of the worst shapes for circularization, a torus with the two termini in the pits, could be circularized with our linker system, without needing infinitely long straights. On the other hand this were the maximum of possibilities which was still feasible to calculate.<br />
Here now the flexible parts of the protein are estimated to point into the same direction as the following helix. By this mean the amount of possibilities is kept fixed but of course this is quite some rough estimation.<br />
Let's check the next paragraph together in detail<br />
At first points from each end in the right distances are generated. ###figure needed, explains point names### Now for each point of the first points to all directions in 5 degrees angledifference next points are generated. Of these all points that don't fit are immediately sorted out. Now all connections from second points to the last points are checked whether they lie in the right distance. After this the normal sorting steps are made for these new connections.<br />
==Sorting out of paths==<br />
The previous part described the generation of paths that can connect the two extremeties of the protein irrespective of the position of these paths relative to the protein. While this allows a fast computing of the geometrical paths, this also implies that the paths that are not practically feasible need to be sorted out. This is the most time consuming part of the computing as about 1 billion paths are generated. Three criteria are considered for the sorting. The first one is the feasibility of the linker: can the software find angle patterns that correspond to the one defined by the geometrical path? This question was part of the motivation for a large modeling effort (link) to determine the possible angles between consecutive angles. This was achieved by analyzing the distribution of angles between alpha helices found in the ArchDB database (link). As nearly any angle could be found between 20 and 170 degrees, only few paths were actually rejected at that step. The next criteria was the position of the angle point: if they appear inside the protein, then the path is rejected. Finally, the software checks if any of the atoms of the protein is less than 5 &Aring away from any of the alpha helices, then the path is also rejected.<br />
==Shifting paths to the patterns==<br />
The strategy described in step 3 gives a certain freedom for the rod that connect the last two angle points that were generated from the N- and C-terminal points. As this freedom is actually not permitted by the alpha helix and the angle pattern, but is permitted by the flexible part for example at the C-terminal end, the software slighty refine the path by rotating the segment that originates from the C-terminal point.<br />
In particular because of the discritization, it can always happen that the generated paths don't fit perfectly to the helical and to the angle patterns. Therefore before translating them to the aminoacid sequence, the paths need to be refined.<br />
Every path, that has survived is analyzed step by step, starting from the starting point and always advancing one angle point. For each step from point to point, the length of the step is calculated and compared to the lengthes we can build with the helix patterns. If the length is too long, the next point is shifted in direction of the previous one, until it fits. If it is too short, it is shifted away. These shifts do not exceed a certain length, so that the paths don't shift too much and suddenly would pass throgh restricted areas.<br />
This is also done before the weighting, so that the paths don't change after weighting anymore.<br />
==Weighting of paths==<br />
Before translating the paths into sequences and thus into linkers that can be expressed, each path needs to be evaluated for its potential capacity to enhance heat stability. For this we have identified different contributions that should be combined into one value that defines how good the linker may be and that consequently defines its ranking among all possible linkers. The smaller this value, the more we expect that the linker will enhance thermostability. An important step for all these contributions is the normalization, as explained in the next paragraphs.<br />
The first contribution we considered was the linker length. We assumed that a short linker is better to constrain the protein extremities, and that a long helix might give more flexibility. Because we wanted this value to be independent of the size of the protein, the length of the linker is normalized to the distance between the two termini.<br />
The second contribution relates to the angles used in the linker. We learned from the [[modeling ###link###]] that angles formed by a certain angle pattern follow a certain distribution. First, we assumed that the narrower the distribution, the more likely the alpha helices would actually produce this angle. Second, the angles found by the software should be as close as possible to those well-defined angles. In this case, the weight value from this contribution should be low.<br />
Then the distance of the linker to the protein is taken into account. Because the linkers should not disturb the protein in its normal environment, linkers that pass close to the protein surface are considered better linkers. The distance was defined as the minimal distance between the linker and all the atoms of the protein. As already mentioned for the sorting of the paths, a linker cannot come closer than 5 &Aring; and this distance was used for normalization of calculated distances.<br />
As linkers should also not be too close to the protein surface, this value is normalized with the minimal distance a linker should have from the surface. The distance is calculated as the minimal distance an the atoms of the protein have from the connection.<br />
After this, the places a linker should avoid are calculated. Each protein can interact with other molecules on some oarts of its surface. The user can specify where and how big those parts are. If a linker passes in front a potential molecule bindind domain, the value of the corresponding path goes to infinity, so that the linker is discarded. Conversely the farther a linker is from a potential ligand binding domain, the smaller its weighting value. The user can also specify the importance of certain regions. In the end the total weighting is normalized to the amount of binding domains.<br />
===Calibrating the weighting function===<br />
???keep?###<br />
Every contribution has it's own distribution. You can see an example in figure ### [[figure histogram_length_lys.png]], but all of them have different shapes. The aim is to find the paths that minimize all of these distributions. Therefore in the weighting function the four mentioned contributions are combined in a linear manner:<br />
\[ W(p) = \alpha L(p) + \beta A(p) + \gamma D(p) + u(p) \]<br />
where W is the final weighting, p the path, L the length contribution, A the angle contribution, D the distance contribution and u the contribution from the forbidden regions. $\alpha, \beta, \gamma, \delta$ are the weighting constants that needed to be found. The normalization performed for each of the contribution were made so that each of them is dimensionless and that all have reasonably similar values.<br />
The weighting constants were obtained from the [[linker-screening ###lin]] performed with lysozyme and the [[enzyme-modeling ###link]]<br />
Please see [[below ###link]] for the detailed explanation, how the values were obtained.<br />
==Translating paths to sequence==<br />
As already mentioned before the software is provided with two databases, one for the possible angle patterns and one for the helix patterns. The choice of the patterns was inspired by known crystal structures extracted from databases and described in different papers.<br />
A huge in silico screening for refining the preferences of the patterns was then set up using the [[iGEM@home|###Link]] system. For the complete description of search for suitable patterns, one can read the [[modeling|###Link to patternspart]] page. <br />
All the possible paths are now split up at the angles and compared with the possible patterns in the databases. ###Figure needed to explain, how the path is translated### The most suitable patterns are identified and added together to build the paths sequence. It is important to notice that this is only possible because of the modularity of our linker patterns used as building blocks: each block, being an alpha helix or an angle pattern, is not affected by the other. Thus for each possible path, one sequence is produced.<br />
==Clustering of paths==<br />
Many different paths are represented by the same sequence [[###Figure that shows, different paths have same properties, already before in the text ###]] and we therefore clustered such paths. The weigths for those clustered paths were then calculated by averaging the weights of the different paths that compose a cluster.<br />
=Results=<br />
==DNMT1==<br />
A major motivation of our effort to design rigid linkers with angles was the circularization of the DNA methyltranferase Dnmt1 (link). The truncation form used in our project is composed of 900 amino acids and the N- and C-terminal extremities are well separated. To circularize it, two linkers were designed: a flexible one made of glycine and serine, and a rigid one designed by the software. The rigif linkers for DNMT1 were obtained from an early state of the software. At that time the calculation took 11 days on a laptop computer with intel i5 processor and 8GB of RAM, which shows the importance of a distributed computing system for large proteins. But from that state on, the software has still imporved a lot, resulting in reduced calculation time to about 1 day for DNMT1.<br />
###DNMT1 could be made heatstable, still missing###<br />
==Feedback from wet lab==<br />
The results from the software for lysozyme were tested as described in the [[linker-screening]] part and evaluated as described in the [[enzyme modeling]] part. We have performed a large linker screening on 10 different lysozymes with different linkers. As the purpose of the lysozyme screen was the calibration of the software,tThose linkers were designed according to the four contributions previously mentioned. One of them was the shortest possible, one had the best possible angle, and so on.<br />
{| class="table table-hover"<br />
|+ '''table 1''': Linker and their amino acid sequence. Green: attachment sequences to prevent the flexible regions from being perturbed; Blue: angle; Purple: extein.<br />
! Linker<br />
! Amino acid sequence<br />
|- <br />
|<br />
|<br />
|-<br />
|'''Very good linkers'''<br />
|<br />
|-<br />
| sgt2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#00BFFF;">AAAHPEA</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| rigid<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKAA<span style="color:#3ADF00;">P</span><span style="color:#A901DB;">RGKCWE</span><br />
|-<br />
| <br />
| <br />
|-<br />
| '''Average linkers'''<br />
|<br />
|-<br />
| may1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKA<span style="color:#00BFFF;">AAAHPEA</span>AEAAAKEAAAKA<span style="color:#00BFFF;">KTA</span>AEAAAKEAAAKA<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| ord1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ATGDLA</span>AEAAAKAA<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| ord3<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ASLPAA</span>AEAAAKEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
|<br />
| <br />
|-<br />
| '''Short linkers'''<br />
| <br />
|-<br />
| sho1<br />
| <span style="color:#3ADF00;">GG</span><span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
| sho2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
|-<br />
|}<br />
[[circ_lam_lys_nils.png]] comparison with [[nice_linker_lysozyme_flexible_ends.png]]<br />
###About predictions of software###<br />
In the end we obtained a ranking of the in vitro tested linkers from the [[linker-screening]] and chose the parameters $\alpha, \beta, \gamma, \delta$ of the weighting function in the way, that the ranking from the software represented the ranking from the assays. The final values were: $\alpha = 1.85 * 10 ^{-6}, \beta = 0.57 , \gamma = 50.8 * 10^6$.<br />
Therefore we set up a function, that was minimized, when the order of the weighting output was the same as the order from the wetlab.<br />
<br />
=Discussion=<br />
Will always be refined with more data from i@h...<br />
=References=<br />
[0] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).<br />
[1] Thornton, J.M. & Sibanda, B.L. Amino and carboxy-terminal regions in globular proteins. Journal of molecular biology 167, 443-460 (1983).<br />
<br />
=Discussion=<br />
Will always be refined with more data from i@h...<br />
=References=<br />
<br />
[0] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).<br />
<br />
[1] Thornton, J.M. & Sibanda, B.L. Amino and carboxy-terminal regions in globular proteins. Journal of molecular biology 167, 443-460 (1983).</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/pages/Linker_SoftwareTeam:Heidelberg/pages/Linker Software2014-10-17T22:16:51Z<p>Igemnils: /* Background */</p>
<hr />
<div>=Abstract=<br />
As already [https://2014.igem.org/Team:Heidelberg/Toolbox/Circularization introduced], artificially circularized proteins may gain some heat stability by restraining the C- and N-terminus from moving around freely. This circularization may be trivial when the protein termini are very close to each other, which seems to be reasonably common [[#References|[1]]]. However, if the ends are too far from each other, a long linker is needed to connect them. This linker should not change the natural conformation of the protein and should constrain the relative position of the ends to restrict the degrees of freedom and thus to stabilize the structure even when heated up. On top, these linkers should not affect any of the protein functions. Consequently it is important to prevent linkers from passing through the active site or from covering binding domains to other molecules for example. Therefore one needs to be able to define the shape of possible linkers. This section describes the software we developed to design such linkers. We would like to stress that this work has been made possible thanks to the feedback between computer modeling and experimental work: We could first design linkers in silico, test them experimentally and use the results to further calibrate the software. To our knowledge, this is the first time that such an approach is used to customly design rigid linkers with angles to connect protein extremities.<br />
<br />
=General procedure=<br />
In short, the software can provide a weighted list of linkers to circularize any protein of interest with a known structure. Those linkers are made of rigid alpha helices segments connected with defined angles. Contrary to flexible linkers, those rigid linkers were expected to constrain the protein extremities and to confer better heat stability. Such an idea was already developed [[#References|[2]]] but only with alpha helices defining simple rods, and without any possibility to introduce angles. To generate those linkers, we first defined the geometrical paths, with segments and angles, that they should follow. The geometrical paths that are biologically feasible are afterwards translated into amino acid sequences. Both the compatibility of paths with possible structures and the translation were made possible thanks to our [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling approaches]. The first approach consisted in performing a statistical analysis of more than 17000 known non-homologous structures containing alpha helices connected with angles. For the second approach, we modeled the conformation of linkers circularizing proteins of known structure and analyzed them for certain properties. This second approach was run for a large number of proteins thanks to our distributing computing system [https://2014.igem.org/Team:Heidelberg/Software/igemathome igemathome]. The software provides different possible linkers with weights that provide the ranking of the linkers depending on their capacity to maintain protein activity at higher temperatures. They were generated thanks to an extensive [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening linker screening] on the target protein lambda-lysozyme, using the first modeling approach.<br />
<br />
=Background=<br />
Classically, protein linkers were designed in three different manners. The easiest way is to define the length that a linker should cover and then simply use a flexible glycine-serine peptide with the right amount of amino acids to match this length. Glycine is used for flexibility, as it has no sidechain and does not produce any steric hindrance, while serine is used for solubility, as it has a small polar side chain. This solubility is important, as the linkers should not pass through the hydrophobic core of the protein, but should be dissolved in the surrounding medium. These flexible linkers were normally used for circularization but also for connecting different proteins, when the main goal is that the different parts are connected, but not how they are connected, or when the flexiblity of the linker was required for specific applications.<br />
A second strategy consists in using rigid helical linkers to keep proteins or protein domains at a certain distance from each other. This is especially important for signalling proteins and fluorescent proteins. One major property of alpha helices is that they always fold in a defined way with well defined angles and lengths. There are also many different helical patterns that differ in stability and solubility. Although they have been used to design cirularizing linkers [[#References|[2]]]. One big disadvantage of this strategy is that one can only build straight linkers with helices. So in the context of circularization, if an artificial line that would connect protein extremities is crossing the protein, this strategy is not an option.<br />
The third option, which served as a base to develop our approach, consists in designing customly tailored linkers for each specific application. These linkers can be obtained from protein structure prediction. At first one needs to define the path that the linker should take to connect the protein ends. Afterwards, one designs a possible linker sequence that might fit well. Next one makes a structure prediction of the linker attached to the proteins to validate the prediction. Several different linkers, with slight changes, can be compared. This is repeated several times until the linker effectively follows the expected path ###TODO: Reference, WADE paper###. This method requires a strong knowledge on protein folding and protein structure prediction and is computationaly intensive. On the other hand, the benefit can be important as the interaction of the linker with the proteins surface can be taken into account and as one can accurately define the path taken by the linker to the resolution of protein structure.<br />
We have set up a completely new strategy to design rigid linkers. As further detailed in the [[###link###|modelling]] part, it is possible to define the shape of a linker, by combining rigid alpha helical rods with well-defined angle patterns. Therefore, by defining, in a geometrical way, the possible paths of the circularizing linkers for a given protein, we can then propose potential linkers.This definition of the geometrical path can be very difficult, especially for large proteins with complex shapes. Moreover, this definition is further constained by the fact that linkers must avoid hiding active sites of the protein of interest. Finally the paths have rotational degrees of freedom at the extremities of the protein, and depending on their orientation, they may or may not match the geometry of the protein. The tool we present here covers the two steps: defining geometrical paths with some weights and translate them into feasible linkers, also with weights. This tool is universal as it has the capacity to design circularizing linkers for any protein with a known structure. Moreover it is modular as, thanks to our modeling approach [link] we have design linkers as exchangeable blocks of rods of different lengths and of angle patterns. The following sections detail the different steps followed by our software to design proper linkers.<br />
<br />
At first the proteinstructure is analysed and then all possible paths that connect the two ends are found, that have less than three additional edges. In the end all these paths should be sorted by well it would be, if we circularize the protein using this path. But as the final weighing is quite computation intensive, at first the paths need to be sorted out. A path is only sorted out, if it is breaching any rule for linkers. For example paths should never pass through the protein. After all the paths have been generated, the paths are improved, by shifting the points according to the underlying linkermodel, so that no paths are taken into account, that could not be built with our building blocks.<br />
After all paths being correctly generated, the paths are weighted by several factors. Afterwards one weighting for a path is calculated, that corresponds to the goodness of this certain path. Then the path is retranslated by usage of our amino acid patterns to produce an amino acid sequence. But as one sequence can follow more than one path, all the paths built up by this sequence are clustered. In the end the average of the weighting of the pathclusters is calculated and thus it for every linker only one weightingvalue is produced with the contributions of all paths represented by this linker.<br />
==PDB parsing==<br />
At first the PDB file containing the structure of the target protein is parsed and thus the information about the coordinates of the atoms are stored. From this data a calibration of coordinate system of the PDB to metric units is made. For this purpose the distance between certain atoms in all the glycines is measured. This distance is a well known distance and thus a calibration can be made, leading normally to 100 pm per unit in the PDB file.<br />
After this some first tests with the protein structure are made. At first it is tested, whether the C- and N-terminus lie on the solvent accessible surface of the protein, which is crucial for circularisation. The angles from which the ends are accessible are stored for the linkergeneration afterwards. An end is accessible in a certain angle, if the axis that is rotatet from z-achsis there with the angle, is not too near to any of the atompoints. The minimal distance a connection must have from the protein is set to the radius of an alpha helix 5 &Aring; . ###fig needed### <br />
==Generation of paths==<br />
As our model is to build linkers with helical rods and connecting angles, a path is completely defined by the coordinates of the angle points. Thus always just the angle points are generated and then the good ones are sorted out. As making shifts to existing points in our programmingstyle is very efficient, this was easier than only to generate the points, that are representing good connections.<br />
In the [[###link###|modelling]] part we have already described the patterns we are using for building up our linkers. The modularity was crucial for the success of the software. Therefore the angle patterns were always chosen, so that the end of the anglepattern was ???8??? &Aring; away from the turning point ###figure needed###. Thus only displacements had to be made with a certain length, not depending on the direction in which it is going and not depending on the direction in that the linker will continue. This modularity makes the calculations more efficient than it would be, with just generating points randomly.<br />
===Flexible ends===<br />
Most proteins have flexible regions at the ends, that are not pointing in a certain direction. Often these flexible ends are even missing in the structure files but still our software estimates how they could behave. Furtheron due to circularization non-helical sequences remain at the ends of the protein. This gives a huge possibility to insert fitting linkers. But this is also a big problem as the estimation of flexible parts is not easy with our brute-force ansatz.<br />
For generation of linkers taking into account the flexibility of the ends until now have been included two functions.<br />
====One helix at flexible ends====<br />
The only variability we have in the customly tailored linker there is the length of the helix. Most likely we no suitable linker can be found, because if there is some obstacle between the ends, the linker can't bend around it. But even though if there is the possibility of such a linker, it should be found, because this will be one of the best linkers predicted. ###figure needed###<br />
So for all accessible angles that were calculated before, we calculate all points from the N-terminus and from the C-terminus that lie in certain distances from the terminus. The points are spread over varying distances from 4.5 &Aring; to the maximum length of the flexible part. ###figure###<br />
Then all possible connections between the firstpoints and the lastpoints are tested It is tested whether they are too close at the protein or even pass through the protein. This is done with a higher accuracy, because none of these linkers should be lost by error. Therefore here another function for calculating the distance from the protein to the connection is used, that is more time-consuming, but also more accurate, than the ones used normally.<br />
<br />
====Linker with one angle at flexible ends.====<br />
The other possibilty for taking flexible ends into account without blowing up the calculation time was to reduce the possibilities in the amount of angles in the helical pattern to one specified. <br />
For the linkers with flexible ends and one angle are built as rectangular triangles. ###figure### The rectangular triangles are chosen, because a rectangle can be built well with the angles we have and the rectangle helps a lot in further calculation. <br />
At first out of all possible linkerparts it is analyzed which triangles are possible to be built, by use of Pythagoras' Theorem. It is important that both legs can be built with our linker patterns and that the hypotenuse has the correct length to fit between the ends.<br />
Afterwards all possible rectangular triangles constructed, that have the two edges of the hypothenuse on N- and C-terminus. For this purpose Thales' Theorem is used, as all the possible rectangles lie on a sphere of radius hypothenuses half. Then these rectangular triangles are analyzed, whether they can be built with our linkerparts, by comparing the angles with the amount of possible triangles from Pythagoras' Theorem.<br />
These triangles are now all shiftet by each displacement of possible angles at the C-terminus, resulting in the rectanglepoints ###figure needed###. Now the connections to the possible points from the N-terminus are generated. At this step the triangles don't need to be rectangular anymore, but can have slightly different angles, but in the next steps the paths are analyzed, whether they still fit. At first they lengthes of the legs of the triangles are checked and then it is checked, that the paths don't disturb the protein. If they would disturbe the protein anyhow, they are just deleted.<br />
===Rigid paths===<br />
If the fist two possibilities for finding suitable paths didn't work the software provides also the possibility to find paths with up to three edges. Thus even one of the worst shapes for circularization, a torus with the two termini in the pits, could be circularized with our linker system, without needing infinitely long straights. On the other hand this were the maximum of possibilities which was still feasible to calculate.<br />
Here now the flexible parts of the protein are estimated to point into the same direction as the following helix. By this mean the amount of possibilities is kept fixed but of course this is quite some rough estimation.<br />
Let's check the next paragraph together in detail<br />
At first points from each end in the right distances are generated. ###figure needed, explains point names### Now for each point of the first points to all directions in 5 degrees angledifference next points are generated. Of these all points that don't fit are immediately sorted out. Now all connections from second points to the last points are checked whether they lie in the right distance. After this the normal sorting steps are made for these new connections.<br />
==Sorting out of paths==<br />
The most calcualtion time is consumed while sorting out misleading paths. This is due to the fact, that every path needs to be checked. Making the brute-force ansatz the amount of possible paths is about 10^9, so this step consumes most time.<br />
There are three main functions, that sort out paths that don't fit. The easiest just sorts paths out, that would require an angle, we can't produce with our angle patterns. But as we nearly can produce angles from ???20 -170??? degrees only very few paths are sorted out by this function.<br />
The next function just checks whether the endpoint is lying in the protein. If yes, the path is deleted. Otherwise the connection between the point coming from to this point is checked, whether it is passing too near at the protein.<br />
<br />
==Shifting paths to the patterns==<br />
<br />
Because of rounding errors and other inaccuracies, it can always happen that the generated paths don't fit perfectly to the helical- and to the angle patterns. Therefore before translating them to the aminoacid sequence, the paths need to be refined. This is also done before the weighting, so that the paths don't change after weighting anymore.<br />
Therefore the distances between the different points are calculated and then the points are shifted so far, that they fit into the patterns. The shifts never exceed a certain length so that no path then would pass through the protein after refinement, even though it didn't before.<br />
==Weighting of paths==<br />
Before translating the paths to sequences and thus to expressible linkers, at first the value of each path needs to be checked. Therefore we have identified four contributions, that determine how good a path will work to enhance heatstability of a target protein by circularization. In the end all these contributions are summarized to get one final value for the goodness of the connection. For each of the single contributions, always lower weighting values refer to a better path. Therefore all contributions somehow need to be normed by some property of the protein, so that the contributions of each function on it's own are independent of the target protein and generality is achieved.<br />
At first it is tested how long the linker is. We assumed that a shorter linker normally is better for restraining the ends, as a longer helix might give more flexibility to the ends. Because we want this value to be independent of the size of the protein, the length of the linker is normed by the distance of the two termini. It is clear that not only because the two termini are far from each other and clearly a longer linker is needed, this contribution exceeds all other contributions to the weighting function.<br />
The next value is a measure for the goodness of the angles used in the linker. Each of our potential angle patterns is somehow distributed with a standard deviation from the mean. It is clear, that the narrower this distribution is, the likelier it is, that the pattern will in the end produce the assumed angle between the embracing helices. Therefore path-angles that fit perfectly with the angle patterns provided get a lower weighting value.<br />
Then the distance from the protein is taken into account. Because the linkers should not disturb the protein in it's normal environment, linkers that pass near to the protein's surface are considered better linkers. Of course no linker is too near on the surface, so that it would interact with the protein too much. This value is normalized with the minimal distance a linker should have from the surface.<br />
After this the places a linker should ommit are calculated. Each protein binds some ligands or substrates. The user can specify, how big they are and where they attach. If a linker passes through a potential ligand, this value goes to infinity, so that a linker is discarded. Otherwise the farther a linker is from a potential ligand, the better this value gets. The user can also specify the importance of certain regions. In the end the total weighting is independent of the amount of binding sites.<br />
<br />
===Calibrating the weighting function===<br />
???keep?###<br />
Every contribution has it's own distribution. You can see an example in figure ### [[figure histogram_length_lys.png]], but all of them have different shapes. The aim is to find the paths that minimize all of these distributions. Therefore in the weighting function the four mentioned contributions are combined in a linear manner:<br />
\[ W(p) = \alpha L(p) + \beta A(p) + \gamma D(p) + \delta u(p) \]<br />
where W is the final weighting, p the path, L the length contribution, A the angle contribution, D the distance contribution and u the contribution from the forbidden regions. $\alpha, \beta, \gamma, \delta$ are the weighting constants that needed to be found. The normalization performed for each of the contribution were made so that each of them is dimensionless and that all have reasonably similar values.<br />
The weighting constants were obtained from the [[linker-screening ###lin]] performed with lysozyme and the [[enzyme-modeling ###link]]<br />
Please see [[below ###link]] for the detailed explanation, how the values were obtained.<br />
==Translating paths to sequence==<br />
As already mentioned before the software is provided with two databases, one for the possible angle patterns and one for the helix patterns. The choice of the patterns was inspired by known crystal structures extracted from databases and described in different papers.<br />
A huge in silico screening for refining the preferences of the patterns was then set up using the [[iGEM@home|###Link]] system. For the complete description of search for suitable patterns, one can read the [[modeling|###Link to patternspart]] page. <br />
All the possible paths are now split up at the angles and compared with the possible patterns in the databases. ###Figure needed to explain, how the path is translated### The most suitable patterns are identified and added together to build the paths sequence. It is important to notice that this is only possible because of the modularity of our linker patterns used as building blocks: each block, being an alpha helix or an angle pattern, is not affected by the other. Thus for each possible path, one sequence is produced.<br />
==Clustering of paths==<br />
Many different paths are represented by the same sequence [[###Figure that shows, different paths have same properties, already before in the text ###]] and we therefore clustered such paths. The weigths for those clustered paths were then calculated by averaging the weights of the different paths that compose a cluster.<br />
<br />
=Results=<br />
==DNMT1==<br />
A major motivation of our effort to design rigid linkers with angles was the circularization of the DNA methyltranferase Dnmt1 (link). The truncation form used in our project is composed of 900 amino acids and the N- and C-terminal extremities are well separated. To circularize it, two linkers were designed: a flexible one made of glycine and serine, and a rigid one designed by the software. The rigif linkers for DNMT1 were obtained from an early state of the software. At that time the calculation took 11 days on a laptop computer with intel i5 processor and 8GB of RAM, which shows the importance of a distributed computing system for large proteins. But from that state on, the software has still imporved a lot, resulting in reduced calculation time to about 1 day for DNMT1.<br />
<br />
###DNMT1 could be made heatstable, still missing###<br />
<br />
<br />
From the [[Project/Linker_Screening | linker-screening]] and the [[Modeling/Enzyme_Modeling | enzyme_modeling]] we have obtained the activities after heat-shock of the different linkers. With these a calibration of the weighting function has been made, please see table 1. for the results.<br />
<br />
{| class="table table-hover" style="text-align: center;"<br />
|+ '''table 1''': Linker and their amino acid sequence. Green: attachment sequences to prevent the flexible regions from being perturbed; Blue: angle; Purple: extein.<br />
! Linker<br />
! Amino acid sequence<br />
! activity<br />
! length- contribution<br />
! angle- contribution<br />
! binding site contribution<br />
! distance from surface<br />
! weightingvalue after calibration<br />
|- <br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|'''Very good linkers'''<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| sgt2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#00BFFF;">AAAHPEA</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
| 0.7477<br />
| 1.9205<br />
| 6.7789<br />
| 0.002259<br />
| 10.525<br />
| 114912<br />
|-<br />
| rigid<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKAA<span style="color:#3ADF00;">P</span><span style="color:#A901DB;">RGKCWE</span><br />
| 0.9447<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| '''Average linkers'''<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| may1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKA<span style="color:#00BFFF;">AAAHPEA</span>AEAAAK EAAAKA<span style="color:#00BFFF;">KTA</span>AEAAAKEAAAKA<span style="color:#A901DB;">RGTCWE</span><br />
| 0.7489<br />
| 6.2225<br />
| 13.19<br />
| 0.00384<br />
| 1095.2<br />
| 196414<br />
|-<br />
| ord1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ATGDLA</span>AEAAAKAA<span style="color:#A901DB;">RGTCWE</span><br />
| 0.956<br />
| 4.936<br />
| 4.639<br />
| 0.00055708<br />
| 220.8<br />
| 27985<br />
|-<br />
| ord3<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ASLPAA</span>AEAAAKEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
| 1.390<br />
| 4.949<br />
| 7.116<br />
| 0.000545<br />
| 261.2<br />
| 28557<br />
|-<br />
|<br />
| <br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| '''Short linkers'''<br />
| <br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| sho1<br />
| <span style="color:#3ADF00;">GG</span><span style="color:#A901DB;">RGTCWE</span><br />
| 0.7087<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| sho2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
| 0.5743<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| flexible linker<br />
| GGSGGGSGRGKCWE<br />
| 0.6851<br />
| <br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| linear lysozyme<br />
| no linker<br />
| 0.7039<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|}<br />
[[circ_lam_lys_nils.png]] comparison with [[nice_linker_lysozyme_flexible_ends.png]]<br />
###About predictions of software###<br />
In the end we obtained a ranking of the in vitro tested linkers from the [[linker-screening]] and chose the parameters $\alpha, \beta, \gamma, \delta$ of the weighting function in the way, that the ranking from the software represented the ranking from the assays. The final values were: ...<br />
<br />
=Discussion=<br />
Will always be refined with more data from i@h...<br />
=References=<br />
<br />
[0] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).<br />
<br />
[1] Thornton, J.M. & Sibanda, B.L. Amino and carboxy-terminal regions in globular proteins. Journal of molecular biology 167, 443-460 (1983).</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/pages/Linker_SoftwareTeam:Heidelberg/pages/Linker Software2014-10-17T22:09:52Z<p>Igemnils: /* Background */</p>
<hr />
<div><br />
=Abstract=<br />
As already [https://2014.igem.org/Team:Heidelberg/Toolbox/Circularization introduced], artificially circularized proteins may gain some heat stability by restraining the C- and N-terminus from moving around freely. This circularization may be trivial when the protein termini are very close to each other, which seems to be reasonably common [[#References|[1]]]. However, if the ends are too far from each other, a long linker is needed to connect them. This linker should not change the natural conformation of the protein and should constrain the relative position of the ends to restrict the degrees of freedom and thus to stabilize the structure even when heated up. On top, these linkers should not affect any of the protein functions. Consequently it is important to prevent linkers from passing through the active site or from covering binding domains to other molecules for example. Therefore one needs to be able to define the shape of possible linkers. This section describes the software we developed to design such linkers. We would like to stress that this work has been made possible thanks to the feedback between computer modeling and experimental work: We could first design linkers in silico, test them experimentally and use the results to further calibrate the software. To our knowledge, this is the first time that such an approach is used to customly design rigid linkers with angles to connect protein extremities.<br />
<br />
=General procedure=<br />
In short, the software can provide a weighted list of linkers to circularize any protein of interest with a known structure. Those linkers are made of rigid alpha helices segments connected with defined angles. Contrary to flexible linkers, those rigid linkers were expected to constrain the protein extremities and to confer better heat stability. Such an idea was already developed [[#References|[2]]] but only with alpha helices defining simple rods, and without any possibility to introduce angles. To generate those linkers, we first defined the geometrical paths, with segments and angles, that they should follow. The geometrical paths that are biologically feasible are afterwards translated into amino acid sequences. Both the compatibility of paths with possible structures and the translation were made possible thanks to our [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling approaches]. The first approach consisted in performing a statistical analysis of more than 17000 known non-homologous structures containing alpha helices connected with angles. For the second approach, we modeled the conformation of linkers circularizing proteins of known structure and analyzed them for certain properties. This second approach was run for a large number of proteins thanks to our distributing computing system [https://2014.igem.org/Team:Heidelberg/Software/igemathome igemathome]. The software provides different possible linkers with weights that provide the ranking of the linkers depending on their capacity to maintain protein activity at higher temperatures. They were generated thanks to an extensive [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening linker screening] on the target protein lambda-lysozyme, using the first modeling approach.<br />
<br />
=Background=<br />
Classically, protein linkers were designed in three different manners. The easiest way is to define the length that a linker should cover and then simply use a flexible glycine-serine peptide with the right amount of amino acids to match this length. Glycine is used for flexibility, as it has no sidechain and does not produce any steric hindrance, while serine is used for solubility, as it has a small polar side chain. This solubility is important, as the linkers should not pass through the hydrophobic core of the protein, but should be dissolved in the surrounding medium. These flexible linkers were normally used for circularization but also for connecting different proteins, when the main goal is that the different parts are connected, but not how they are connected, or when the flexiblity of the linker was required for specific applications.<br />
A second strategy consists in using rigid helical linkers to keep proteins or protein domains at a certain distance from each other. This is especially important for signalling proteins and fluorescent proteins. One major property of alpha helices is that they always fold in a defined way with well defined angles and lengths. There are also many different helical patterns that differ in stability and solubility. Although they have been used to design cirularizing linkers [[#References|[2]]]. One big disadvantage of this strategy is that one can only build straight linkers with helices. So in the context of circularization, if an artificial line that would connect protein extremities is crossing the protein, this strategy is not an option.<br />
The third option, which served as a base to develop our approach, consists in designing customly tailored linkers for each specific application. These linkers can be obtained from protein structure prediction. At first one needs to define the path that the linker should take to connect two amino acids. Afterwards one designs a possible linker sequence that might fit well. Next one makes a structure prediction of the linker attached to the proteins to validate the prediction. Several different linkers, with slight changes, can be compared. This is repeated several times until the linker effectively follows the expected path. ###TODO: Reference, WADE paper### This method is time consuming as it is not only computation intensive, but also requires a strong knowledge on protein folding and protein structure prediction. On the other hand, the benefit can be important as the interaction of the linker with the proteins surface can be taken into account and as one can accurately define the path taken by the linker to the resolution of protein structure.<br />
We have set up a completely new strategy to design rigid linkers. As further detailed in the [[###link###|modelling]] part, it is possible to define the shape of a linker, by combining rigid alpha helical rods with well-defined angle patterns. Therefore, by defining, in a geometrical way, the possible paths of the circularizing linkers for a given protein, we can then propose potential linkers.This definition of the geometrical path can be very difficult, especially for large proteins with complex shapes. Moreover, this definition is further constained by the fact that linkers must avoid hiding active sites of the protein of interest. Finally the paths have rotational degrees of freedom at the extremities of the protein, and depending on their orientation, they may or may not match the geometry of the protein. The tool we present here covers the two steps: defining geometrical paths with some weights and translate them into feasible linkers, also with weights. This tool is universal as it has the capacity to design circularizing linkers for any protein with a known structure. Moreover it is modular as, thanks to our modeling approach [link] we have design linkers as exchangeable blocks of rods of different lengths and of angle patterns. The following sections detail the different steps followed by our software to design proper linkers.<br />
<br />
At first the proteinstructure is analysed and then all possible paths that connect the two ends are found, that have less than three additional edges. In the end all these paths should be sorted by well it would be, if we circularize the protein using this path. But as the final weighing is quite computation intensive, at first the paths need to be sorted out. A path is only sorted out, if it is breaching any rule for linkers. For example paths should never pass through the protein. After all the paths have been generated, the paths are improved, by shifting the points according to the underlying linkermodel, so that no paths are taken into account, that could not be built with our building blocks.<br />
After all paths being correctly generated, the paths are weighted by several factors. Afterwards one weighting for a path is calculated, that corresponds to the goodness of this certain path. Then the path is retranslated by usage of our amino acid patterns to produce an amino acid sequence. But as one sequence can follow more than one path, all the paths built up by this sequence are clustered. In the end the average of the weighting of the pathclusters is calculated and thus it for every linker only one weightingvalue is produced with the contributions of all paths represented by this linker.<br />
==PDB parsing==<br />
At first the PDB file containing the structure of the target protein is parsed and thus the information about the coordinates of the atoms are stored. From this data a calibration of coordinate system of the PDB to metric units is made. For this purpose the distance between certain atoms in all the glycines is measured. This distance is a well known distance and thus a calibration can be made, leading normally to 100 pm per unit in the PDB file.<br />
After this some first tests with the protein structure are made. At first it is tested, whether the C- and N-terminus lie on the solvent accessible surface of the protein, which is crucial for circularisation. The angles from which the ends are accessible are stored for the linkergeneration afterwards. An end is accessible in a certain angle, if the axis that is rotatet from z-achsis there with the angle, is not too near to any of the atompoints. The minimal distance a connection must have from the protein is set to the radius of an alpha helix 5 &Aring; . ###fig needed### <br />
==Generation of paths==<br />
As our model is to build linkers with helical rods and connecting angles, a path is completely defined by the coordinates of the angle points. Thus always just the angle points are generated and then the good ones are sorted out. As making shifts to existing points in our programmingstyle is very efficient, this was easier than only to generate the points, that are representing good connections.<br />
In the [[###link###|modelling]] part we have already described the patterns we are using for building up our linkers. The modularity was crucial for the success of the software. Therefore the angle patterns were always chosen, so that the end of the anglepattern was ???8??? &Aring; away from the turning point ###figure needed###. Thus only displacements had to be made with a certain length, not depending on the direction in which it is going and not depending on the direction in that the linker will continue. This modularity makes the calculations more efficient than it would be, with just generating points randomly.<br />
===Flexible ends===<br />
Most proteins have flexible regions at the ends, that are not pointing in a certain direction. Often these flexible ends are even missing in the structure files but still our software estimates how they could behave. Furtheron due to circularization non-helical sequences remain at the ends of the protein. This gives a huge possibility to insert fitting linkers. But this is also a big problem as the estimation of flexible parts is not easy with our brute-force ansatz.<br />
For generation of linkers taking into account the flexibility of the ends until now have been included two functions.<br />
====One helix at flexible ends====<br />
The only variability we have in the customly tailored linker there is the length of the helix. Most likely we no suitable linker can be found, because if there is some obstacle between the ends, the linker can't bend around it. But even though if there is the possibility of such a linker, it should be found, because this will be one of the best linkers predicted. ###figure needed###<br />
So for all accessible angles that were calculated before, we calculate all points from the N-terminus and from the C-terminus that lie in certain distances from the terminus. The points are spread over varying distances from 4.5 &Aring; to the maximum length of the flexible part. ###figure###<br />
Then all possible connections between the firstpoints and the lastpoints are tested It is tested whether they are too close at the protein or even pass through the protein. This is done with a higher accuracy, because none of these linkers should be lost by error. Therefore here another function for calculating the distance from the protein to the connection is used, that is more time-consuming, but also more accurate, than the ones used normally.<br />
<br />
====Linker with one angle at flexible ends.====<br />
The other possibilty for taking flexible ends into account without blowing up the calculation time was to reduce the possibilities in the amount of angles in the helical pattern to one specified. <br />
For the linkers with flexible ends and one angle are built as rectangular triangles. ###figure### The rectangular triangles are chosen, because a rectangle can be built well with the angles we have and the rectangle helps a lot in further calculation. <br />
At first out of all possible linkerparts it is analyzed which triangles are possible to be built, by use of Pythagoras' Theorem. It is important that both legs can be built with our linker patterns and that the hypotenuse has the correct length to fit between the ends.<br />
Afterwards all possible rectangular triangles constructed, that have the two edges of the hypothenuse on N- and C-terminus. For this purpose Thales' Theorem is used, as all the possible rectangles lie on a sphere of radius hypothenuses half. Then these rectangular triangles are analyzed, whether they can be built with our linkerparts, by comparing the angles with the amount of possible triangles from Pythagoras' Theorem.<br />
These triangles are now all shiftet by each displacement of possible angles at the C-terminus, resulting in the rectanglepoints ###figure needed###. Now the connections to the possible points from the N-terminus are generated. At this step the triangles don't need to be rectangular anymore, but can have slightly different angles, but in the next steps the paths are analyzed, whether they still fit. At first they lengthes of the legs of the triangles are checked and then it is checked, that the paths don't disturb the protein. If they would disturbe the protein anyhow, they are just deleted.<br />
===Rigid paths===<br />
If the fist two possibilities for finding suitable paths didn't work the software provides also the possibility to find paths with up to three edges. Thus even one of the worst shapes for circularization, a torus with the two termini in the pits, could be circularized with our linker system, without needing infinitely long straights. On the other hand this were the maximum of possibilities which was still feasible to calculate.<br />
Here now the flexible parts of the protein are estimated to point into the same direction as the following helix. By this mean the amount of possibilities is kept fixed but of course this is quite some rough estimation.<br />
Let's check the next paragraph together in detail<br />
At first points from each end in the right distances are generated. ###figure needed, explains point names### Now for each point of the first points to all directions in 5 degrees angledifference next points are generated. Of these all points that don't fit are immediately sorted out. Now all connections from second points to the last points are checked whether they lie in the right distance. After this the normal sorting steps are made for these new connections.<br />
==Sorting out of paths==<br />
The most calcualtion time is consumed while sorting out misleading paths. This is due to the fact, that every path needs to be checked. Making the brute-force ansatz the amount of possible paths is about 10^9, so this step consumes most time.<br />
There are three main functions, that sort out paths that don't fit. The easiest just sorts paths out, that would require an angle, we can't produce with our angle patterns. But as we nearly can produce angles from ???20 -170??? degrees only very few paths are sorted out by this function.<br />
The next function just checks whether the endpoint is lying in the protein. If yes, the path is deleted. Otherwise the connection between the point coming from to this point is checked, whether it is passing too near at the protein.<br />
<br />
==Shifting paths to the patterns==<br />
<br />
Because of rounding errors and other inaccuracies, it can always happen that the generated paths don't fit perfectly to the helical- and to the angle patterns. Therefore before translating them to the aminoacid sequence, the paths need to be refined. This is also done before the weighting, so that the paths don't change after weighting anymore.<br />
Therefore the distances between the different points are calculated and then the points are shifted so far, that they fit into the patterns. The shifts never exceed a certain length so that no path then would pass through the protein after refinement, even though it didn't before.<br />
==Weighting of paths==<br />
Before translating the paths to sequences and thus to expressible linkers, at first the value of each path needs to be checked. Therefore we have identified four contributions, that determine how good a path will work to enhance heatstability of a target protein by circularization. In the end all these contributions are summarized to get one final value for the goodness of the connection. For each of the single contributions, always lower weighting values refer to a better path. Therefore all contributions somehow need to be normed by some property of the protein, so that the contributions of each function on it's own are independent of the target protein and generality is achieved.<br />
At first it is tested how long the linker is. We assumed that a shorter linker normally is better for restraining the ends, as a longer helix might give more flexibility to the ends. Because we want this value to be independent of the size of the protein, the length of the linker is normed by the distance of the two termini. It is clear that not only because the two termini are far from each other and clearly a longer linker is needed, this contribution exceeds all other contributions to the weighting function.<br />
The next value is a measure for the goodness of the angles used in the linker. Each of our potential angle patterns is somehow distributed with a standard deviation from the mean. It is clear, that the narrower this distribution is, the likelier it is, that the pattern will in the end produce the assumed angle between the embracing helices. Therefore path-angles that fit perfectly with the angle patterns provided get a lower weighting value.<br />
Then the distance from the protein is taken into account. Because the linkers should not disturb the protein in it's normal environment, linkers that pass near to the protein's surface are considered better linkers. Of course no linker is too near on the surface, so that it would interact with the protein too much. This value is normalized with the minimal distance a linker should have from the surface.<br />
After this the places a linker should ommit are calculated. Each protein binds some ligands or substrates. The user can specify, how big they are and where they attach. If a linker passes through a potential ligand, this value goes to infinity, so that a linker is discarded. Otherwise the farther a linker is from a potential ligand, the better this value gets. The user can also specify the importance of certain regions. In the end the total weighting is independent of the amount of binding sites.<br />
<br />
===Calibrating the weighting function===<br />
???keep?###<br />
Every contribution has it's own distribution. You can see an example in figure ### [[figure histogram_length_lys.png]], but all of them have different shapes. The aim is to find the paths that minimize all of these distributions. Therefore in the weighting function the four mentioned contributions are combined in a linear manner:<br />
\[ W(p) = \alpha L(p) + \beta A(p) + \gamma D(p) + \delta u(p) \]<br />
where W is the final weighting, p the path, L the length contribution, A the angle contribution, D the distance contribution and u the contribution from the forbidden regions. $\alpha, \beta, \gamma, \delta$ are the weighting constants that needed to be found. The normalization performed for each of the contribution were made so that each of them is dimensionless and that all have reasonably similar values.<br />
The weighting constants were obtained from the [[linker-screening ###lin]] performed with lysozyme and the [[enzyme-modeling ###link]]<br />
Please see [[below ###link]] for the detailed explanation, how the values were obtained.<br />
==Translating paths to sequence==<br />
As already mentioned before the software is provided with two databases, one for the possible angle patterns and one for the helix patterns. The choice of the patterns was inspired by known crystal structures extracted from databases and described in different papers.<br />
A huge in silico screening for refining the preferences of the patterns was then set up using the [[iGEM@home|###Link]] system. For the complete description of search for suitable patterns, one can read the [[modeling|###Link to patternspart]] page. <br />
All the possible paths are now split up at the angles and compared with the possible patterns in the databases. ###Figure needed to explain, how the path is translated### The most suitable patterns are identified and added together to build the paths sequence. It is important to notice that this is only possible because of the modularity of our linker patterns used as building blocks: each block, being an alpha helix or an angle pattern, is not affected by the other. Thus for each possible path, one sequence is produced.<br />
==Clustering of paths==<br />
Many different paths are represented by the same sequence [[###Figure that shows, different paths have same properties, already before in the text ###]] and we therefore clustered such paths. The weigths for those clustered paths were then calculated by averaging the weights of the different paths that compose a cluster.<br />
<br />
=Results=<br />
==DNMT1==<br />
A major motivation of our effort to design rigid linkers with angles was the circularization of the DNA methyltranferase Dnmt1 (link). The truncation form used in our project is composed of 900 amino acids and the N- and C-terminal extremities are well separated. To circularize it, two linkers were designed: a flexible one made of glycine and serine, and a rigid one designed by the software. The rigif linkers for DNMT1 were obtained from an early state of the software. At that time the calculation took 11 days on a laptop computer with intel i5 processor and 8GB of RAM, which shows the importance of a distributed computing system for large proteins. But from that state on, the software has still imporved a lot, resulting in reduced calculation time to about 1 day for DNMT1.<br />
<br />
###DNMT1 could be made heatstable, still missing###<br />
<br />
<br />
From the [[Project/Linker_Screening | linker-screening]] and the [[Modeling/Enzyme_Modeling | enzyme_modeling]] we have obtained the activities after heat-shock of the different linkers. With these a calibration of the weighting function has been made, please see table 1. for the results.<br />
<br />
{| class="table table-hover" style="text-align: center;"<br />
|+ '''table 1''': Linker and their amino acid sequence. Green: attachment sequences to prevent the flexible regions from being perturbed; Blue: angle; Purple: extein.<br />
! Linker<br />
! Amino acid sequence<br />
! activity<br />
! length- contribution<br />
! angle- contribution<br />
! binding site contribution<br />
! distance from surface<br />
! weightingvalue after calibration<br />
|- <br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|'''Very good linkers'''<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| sgt2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#00BFFF;">AAAHPEA</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
| 0.7477<br />
| 1.9205<br />
| 6.7789<br />
| 0.002259<br />
| 10.525<br />
| 114912<br />
|-<br />
| rigid<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKAA<span style="color:#3ADF00;">P</span><span style="color:#A901DB;">RGKCWE</span><br />
| 0.9447<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| '''Average linkers'''<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| may1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKA<span style="color:#00BFFF;">AAAHPEA</span>AEAAAK EAAAKA<span style="color:#00BFFF;">KTA</span>AEAAAKEAAAKA<span style="color:#A901DB;">RGTCWE</span><br />
| 0.7489<br />
| 6.2225<br />
| 13.19<br />
| 0.00384<br />
| 1095.2<br />
| 196414<br />
|-<br />
| ord1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ATGDLA</span>AEAAAKAA<span style="color:#A901DB;">RGTCWE</span><br />
| 0.956<br />
| 4.936<br />
| 4.639<br />
| 0.00055708<br />
| 220.8<br />
| 27985<br />
|-<br />
| ord3<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ASLPAA</span>AEAAAKEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
| 1.390<br />
| 4.949<br />
| 7.116<br />
| 0.000545<br />
| 261.2<br />
| 28557<br />
|-<br />
|<br />
| <br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| '''Short linkers'''<br />
| <br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| sho1<br />
| <span style="color:#3ADF00;">GG</span><span style="color:#A901DB;">RGTCWE</span><br />
| 0.7087<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| sho2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
| 0.5743<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| flexible linker<br />
| GGSGGGSGRGKCWE<br />
| 0.6851<br />
| <br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| linear lysozyme<br />
| no linker<br />
| 0.7039<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|}<br />
[[circ_lam_lys_nils.png]] comparison with [[nice_linker_lysozyme_flexible_ends.png]]<br />
###About predictions of software###<br />
In the end we obtained a ranking of the in vitro tested linkers from the [[linker-screening]] and chose the parameters $\alpha, \beta, \gamma, \delta$ of the weighting function in the way, that the ranking from the software represented the ranking from the assays. The final values were: ...<br />
<br />
=Discussion=<br />
Will always be refined with more data from i@h...<br />
=References=<br />
<br />
[0] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).<br />
<br />
[1] Thornton, J.M. & Sibanda, B.L. Amino and carboxy-terminal regions in globular proteins. Journal of molecular biology 167, 443-460 (1983).</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/pages/Linker_SoftwareTeam:Heidelberg/pages/Linker Software2014-10-17T22:04:26Z<p>Igemnils: /* General procedure */</p>
<hr />
<div><br />
=Abstract=<br />
As already [https://2014.igem.org/Team:Heidelberg/Toolbox/Circularization introduced], artificially circularized proteins may gain some heat stability by restraining the C- and N-terminus from moving around freely. This circularization may be trivial when the protein termini are very close to each other, which seems to be reasonably common [[#References|[1]]]. However, if the ends are too far from each other, a long linker is needed to connect them. This linker should not change the natural conformation of the protein and should constrain the relative position of the ends to restrict the degrees of freedom and thus to stabilize the structure even when heated up. On top, these linkers should not affect any of the protein functions. Consequently it is important to prevent linkers from passing through the active site or from covering binding domains to other molecules for example. Therefore one needs to be able to define the shape of possible linkers. This section describes the software we developed to design such linkers. We would like to stress that this work has been made possible thanks to the feedback between computer modeling and experimental work: We could first design linkers in silico, test them experimentally and use the results to further calibrate the software. To our knowledge, this is the first time that such an approach is used to customly design rigid linkers with angles to connect protein extremities.<br />
<br />
=General procedure=<br />
In short, the software can provide a weighted list of linkers to circularize any protein of interest with a known structure. Those linkers are made of rigid alpha helices segments connected with defined angles. Contrary to flexible linkers, those rigid linkers were expected to constrain the protein extremities and to confer better heat stability. Such an idea was already developed [[#References|[2]]] but only with alpha helices defining simple rods, and without any possibility to introduce angles. To generate those linkers, we first defined the geometrical paths, with segments and angles, that they should follow. The geometrical paths that are biologically feasible are afterwards translated into amino acid sequences. Both the compatibility of paths with possible structures and the translation were made possible thanks to our [https://2014.igem.org/Team:Heidelberg/Modeling/Linker_Modeling modeling approaches]. The first approach consisted in performing a statistical analysis of more than 17000 known non-homologous structures containing alpha helices connected with angles. For the second approach, we modeled the conformation of linkers circularizing proteins of known structure and analyzed them for certain properties. This second approach was run for a large number of proteins thanks to our distributing computing system [https://2014.igem.org/Team:Heidelberg/Software/igemathome igemathome]. The software provides different possible linkers with weights that provide the ranking of the linkers depending on their capacity to maintain protein activity at higher temperatures. They were generated thanks to an extensive [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening linker screening] on the target protein lambda-lysozyme, using the first modeling approach.<br />
<br />
=Background=<br />
Classically, protein linkers were designed in three different manners. The easiest way is to define the length that a linker should cover and then simply use a flexible glycine-serine peptide with the right amount of amino acids to match this length. Glycine is used for flexibility, as it has no sidechain and does not produce any steric hindrance, while serine is used for solubility, as it has a small polar side chain. This solubility is important, as the linkers should not pass through the hydrophobic core of the protein, but should be dissolved in the surrounding medium. These flexible linkers were normally used for circularization but also for connecting different proteins, when the main important aspect is that the different parts are connected, but not how they are connected, or when the flexiblity of the linker was required for specific applications.<br />
A second strategy consists in using rigid helical linkers to keep proteins or protein domains at a certain distance from each other. This is especially important for signalling proteins and fluorescent proteins . ###TODO: Reference### One major property of alpha helices is that they always fold in a defined way with well defined angles and lengths. There are also many different helical patterns that differ in stability and solubility. One big disadvantage of this strategy is that one can only build straight linkers with helices. So in the context of circularization, if an artificial line that would connect protein extremities is crossing the protein, this strategy is not an option.<br />
The third option, which served as a base to develop our approach, consists in designing customly tailored linkers for each specific application. These linkers can be obtained from protein structure prediction. At first one needs to define the path that the linker should take to connect two amino acids. Afterwards one designs a possible linker sequence that might fit well. Next one makes a structure prediction of the linker attached to the proteins to validate the prediction. Several different linkers, with slight changes, can be compared. This is repeated several times until the linker effectively follows the expected path. ###TODO: Reference, WADE paper### This method is time consuming as it is not only computation intensive, but also requires a strong knowledge on protein folding and protein structure prediction. On the other hand, the benefit can be important as the interaction of the linker with the proteins surface can be taken into account and as one can accurately define the path taken by the linker to the resolution of protein structure.<br />
We have set up a completely new strategy to design rigid linkers. As further detailed in the [[###link###|modelling]] part, it is possible to define the shape of a linker, by combining rigid alpha helical rods with well-defined angle patterns. Therefore, by defining, in a geometrical way, the possible paths of the circularizing linkers for a given protein, we can then propose potential linkers.This definition of the geometrical path can be very difficult, especially for large proteins with complex shapes. Moreover, this definition is further constained by the fact that linkers must avoid hiding active sites of the protein of interest. Finally the paths have rotational degrees of freedom at the extremities of the protein, and depending on their orientation, they may or may not match the geometry of the protein. The tool we present here covers the two steps: defining geometrical paths with some weights and translate them into feasible linkers, also with weights. This tool is universal as it has the capacity to design circularizing linkers for any protein with a known structure. Moreover it is modular as, thanks to our modeling approach [link] we have design linkers as exchangeable blocks of rods of different lengths and of angle patterns. The following sections detail the different steps followed by our software to design proper linkers.<br />
<br />
At first the proteinstructure is analysed and then all possible paths that connect the two ends are found, that have less than three additional edges. In the end all these paths should be sorted by well it would be, if we circularize the protein using this path. But as the final weighing is quite computation intensive, at first the paths need to be sorted out. A path is only sorted out, if it is breaching any rule for linkers. For example paths should never pass through the protein. After all the paths have been generated, the paths are improved, by shifting the points according to the underlying linkermodel, so that no paths are taken into account, that could not be built with our building blocks.<br />
After all paths being correctly generated, the paths are weighted by several factors. Afterwards one weighting for a path is calculated, that corresponds to the goodness of this certain path. Then the path is retranslated by usage of our amino acid patterns to produce an amino acid sequence. But as one sequence can follow more than one path, all the paths built up by this sequence are clustered. In the end the average of the weighting of the pathclusters is calculated and thus it for every linker only one weightingvalue is produced with the contributions of all paths represented by this linker.<br />
==PDB parsing==<br />
At first the PDB file containing the structure of the target protein is parsed and thus the information about the coordinates of the atoms are stored. From this data a calibration of coordinate system of the PDB to metric units is made. For this purpose the distance between certain atoms in all the glycines is measured. This distance is a well known distance and thus a calibration can be made, leading normally to 100 pm per unit in the PDB file.<br />
After this some first tests with the protein structure are made. At first it is tested, whether the C- and N-terminus lie on the solvent accessible surface of the protein, which is crucial for circularisation. The angles from which the ends are accessible are stored for the linkergeneration afterwards. An end is accessible in a certain angle, if the axis that is rotatet from z-achsis there with the angle, is not too near to any of the atompoints. The minimal distance a connection must have from the protein is set to the radius of an alpha helix 5 &Aring; . ###fig needed### <br />
==Generation of paths==<br />
As our model is to build linkers with helical rods and connecting angles, a path is completely defined by the coordinates of the angle points. Thus always just the angle points are generated and then the good ones are sorted out. As making shifts to existing points in our programmingstyle is very efficient, this was easier than only to generate the points, that are representing good connections.<br />
In the [[###link###|modelling]] part we have already described the patterns we are using for building up our linkers. The modularity was crucial for the success of the software. Therefore the angle patterns were always chosen, so that the end of the anglepattern was ???8??? &Aring; away from the turning point ###figure needed###. Thus only displacements had to be made with a certain length, not depending on the direction in which it is going and not depending on the direction in that the linker will continue. This modularity makes the calculations more efficient than it would be, with just generating points randomly.<br />
===Flexible ends===<br />
Most proteins have flexible regions at the ends, that are not pointing in a certain direction. Often these flexible ends are even missing in the structure files but still our software estimates how they could behave. Furtheron due to circularization non-helical sequences remain at the ends of the protein. This gives a huge possibility to insert fitting linkers. But this is also a big problem as the estimation of flexible parts is not easy with our brute-force ansatz.<br />
For generation of linkers taking into account the flexibility of the ends until now have been included two functions.<br />
====One helix at flexible ends====<br />
The only variability we have in the customly tailored linker there is the length of the helix. Most likely we no suitable linker can be found, because if there is some obstacle between the ends, the linker can't bend around it. But even though if there is the possibility of such a linker, it should be found, because this will be one of the best linkers predicted. ###figure needed###<br />
So for all accessible angles that were calculated before, we calculate all points from the N-terminus and from the C-terminus that lie in certain distances from the terminus. The points are spread over varying distances from 4.5 &Aring; to the maximum length of the flexible part. ###figure###<br />
Then all possible connections between the firstpoints and the lastpoints are tested It is tested whether they are too close at the protein or even pass through the protein. This is done with a higher accuracy, because none of these linkers should be lost by error. Therefore here another function for calculating the distance from the protein to the connection is used, that is more time-consuming, but also more accurate, than the ones used normally.<br />
<br />
====Linker with one angle at flexible ends.====<br />
The other possibilty for taking flexible ends into account without blowing up the calculation time was to reduce the possibilities in the amount of angles in the helical pattern to one specified. <br />
For the linkers with flexible ends and one angle are built as rectangular triangles. ###figure### The rectangular triangles are chosen, because a rectangle can be built well with the angles we have and the rectangle helps a lot in further calculation. <br />
At first out of all possible linkerparts it is analyzed which triangles are possible to be built, by use of Pythagoras' Theorem. It is important that both legs can be built with our linker patterns and that the hypotenuse has the correct length to fit between the ends.<br />
Afterwards all possible rectangular triangles constructed, that have the two edges of the hypothenuse on N- and C-terminus. For this purpose Thales' Theorem is used, as all the possible rectangles lie on a sphere of radius hypothenuses half. Then these rectangular triangles are analyzed, whether they can be built with our linkerparts, by comparing the angles with the amount of possible triangles from Pythagoras' Theorem.<br />
These triangles are now all shiftet by each displacement of possible angles at the C-terminus, resulting in the rectanglepoints ###figure needed###. Now the connections to the possible points from the N-terminus are generated. At this step the triangles don't need to be rectangular anymore, but can have slightly different angles, but in the next steps the paths are analyzed, whether they still fit. At first they lengthes of the legs of the triangles are checked and then it is checked, that the paths don't disturb the protein. If they would disturbe the protein anyhow, they are just deleted.<br />
===Rigid paths===<br />
If the fist two possibilities for finding suitable paths didn't work the software provides also the possibility to find paths with up to three edges. Thus even one of the worst shapes for circularization, a torus with the two termini in the pits, could be circularized with our linker system, without needing infinitely long straights. On the other hand this were the maximum of possibilities which was still feasible to calculate.<br />
Here now the flexible parts of the protein are estimated to point into the same direction as the following helix. By this mean the amount of possibilities is kept fixed but of course this is quite some rough estimation.<br />
Let's check the next paragraph together in detail<br />
At first points from each end in the right distances are generated. ###figure needed, explains point names### Now for each point of the first points to all directions in 5 degrees angledifference next points are generated. Of these all points that don't fit are immediately sorted out. Now all connections from second points to the last points are checked whether they lie in the right distance. After this the normal sorting steps are made for these new connections.<br />
==Sorting out of paths==<br />
The most calcualtion time is consumed while sorting out misleading paths. This is due to the fact, that every path needs to be checked. Making the brute-force ansatz the amount of possible paths is about 10^9, so this step consumes most time.<br />
There are three main functions, that sort out paths that don't fit. The easiest just sorts paths out, that would require an angle, we can't produce with our angle patterns. But as we nearly can produce angles from ???20 -170??? degrees only very few paths are sorted out by this function.<br />
The next function just checks whether the endpoint is lying in the protein. If yes, the path is deleted. Otherwise the connection between the point coming from to this point is checked, whether it is passing too near at the protein.<br />
<br />
==Shifting paths to the patterns==<br />
<br />
Because of rounding errors and other inaccuracies, it can always happen that the generated paths don't fit perfectly to the helical- and to the angle patterns. Therefore before translating them to the aminoacid sequence, the paths need to be refined. This is also done before the weighting, so that the paths don't change after weighting anymore.<br />
Therefore the distances between the different points are calculated and then the points are shifted so far, that they fit into the patterns. The shifts never exceed a certain length so that no path then would pass through the protein after refinement, even though it didn't before.<br />
==Weighting of paths==<br />
Before translating the paths to sequences and thus to expressible linkers, at first the value of each path needs to be checked. Therefore we have identified four contributions, that determine how good a path will work to enhance heatstability of a target protein by circularization. In the end all these contributions are summarized to get one final value for the goodness of the connection. For each of the single contributions, always lower weighting values refer to a better path. Therefore all contributions somehow need to be normed by some property of the protein, so that the contributions of each function on it's own are independent of the target protein and generality is achieved.<br />
At first it is tested how long the linker is. We assumed that a shorter linker normally is better for restraining the ends, as a longer helix might give more flexibility to the ends. Because we want this value to be independent of the size of the protein, the length of the linker is normed by the distance of the two termini. It is clear that not only because the two termini are far from each other and clearly a longer linker is needed, this contribution exceeds all other contributions to the weighting function.<br />
The next value is a measure for the goodness of the angles used in the linker. Each of our potential angle patterns is somehow distributed with a standard deviation from the mean. It is clear, that the narrower this distribution is, the likelier it is, that the pattern will in the end produce the assumed angle between the embracing helices. Therefore path-angles that fit perfectly with the angle patterns provided get a lower weighting value.<br />
Then the distance from the protein is taken into account. Because the linkers should not disturb the protein in it's normal environment, linkers that pass near to the protein's surface are considered better linkers. Of course no linker is too near on the surface, so that it would interact with the protein too much. This value is normalized with the minimal distance a linker should have from the surface.<br />
After this the places a linker should ommit are calculated. Each protein binds some ligands or substrates. The user can specify, how big they are and where they attach. If a linker passes through a potential ligand, this value goes to infinity, so that a linker is discarded. Otherwise the farther a linker is from a potential ligand, the better this value gets. The user can also specify the importance of certain regions. In the end the total weighting is independent of the amount of binding sites.<br />
<br />
===Calibrating the weighting function===<br />
???keep?###<br />
Every contribution has it's own distribution. You can see an example in figure ### [[figure histogram_length_lys.png]], but all of them have different shapes. The aim is to find the paths that minimize all of these distributions. Therefore in the weighting function the four mentioned contributions are combined in a linear manner:<br />
\[ W(p) = \alpha L(p) + \beta A(p) + \gamma D(p) + \delta u(p) \]<br />
where W is the final weighting, p the path, L the length contribution, A the angle contribution, D the distance contribution and u the contribution from the forbidden regions. $\alpha, \beta, \gamma, \delta$ are the weighting constants that needed to be found. The normalization performed for each of the contribution were made so that each of them is dimensionless and that all have reasonably similar values.<br />
The weighting constants were obtained from the [[linker-screening ###lin]] performed with lysozyme and the [[enzyme-modeling ###link]]<br />
Please see [[below ###link]] for the detailed explanation, how the values were obtained.<br />
==Translating paths to sequence==<br />
As already mentioned before the software is provided with two databases, one for the possible angle patterns and one for the helix patterns. The choice of the patterns was inspired by known crystal structures extracted from databases and described in different papers.<br />
A huge in silico screening for refining the preferences of the patterns was then set up using the [[iGEM@home|###Link]] system. For the complete description of search for suitable patterns, one can read the [[modeling|###Link to patternspart]] page. <br />
All the possible paths are now split up at the angles and compared with the possible patterns in the databases. ###Figure needed to explain, how the path is translated### The most suitable patterns are identified and added together to build the paths sequence. It is important to notice that this is only possible because of the modularity of our linker patterns used as building blocks: each block, being an alpha helix or an angle pattern, is not affected by the other. Thus for each possible path, one sequence is produced.<br />
==Clustering of paths==<br />
Many different paths are represented by the same sequence [[###Figure that shows, different paths have same properties, already before in the text ###]] and we therefore clustered such paths. The weigths for those clustered paths were then calculated by averaging the weights of the different paths that compose a cluster.<br />
<br />
=Results=<br />
==DNMT1==<br />
A major motivation of our effort to design rigid linkers with angles was the circularization of the DNA methyltranferase Dnmt1 (link). The truncation form used in our project is composed of 900 amino acids and the N- and C-terminal extremities are well separated. To circularize it, two linkers were designed: a flexible one made of glycine and serine, and a rigid one designed by the software. The rigif linkers for DNMT1 were obtained from an early state of the software. At that time the calculation took 11 days on a laptop computer with intel i5 processor and 8GB of RAM, which shows the importance of a distributed computing system for large proteins. But from that state on, the software has still imporved a lot, resulting in reduced calculation time to about 1 day for DNMT1.<br />
<br />
###DNMT1 could be made heatstable, still missing###<br />
<br />
<br />
From the [[Project/Linker_Screening | linker-screening]] and the [[Modeling/Enzyme_Modeling | enzyme_modeling]] we have obtained the activities after heat-shock of the different linkers. With these a calibration of the weighting function has been made, please see table 1. for the results.<br />
<br />
{| class="table table-hover" style="text-align: center;"<br />
|+ '''table 1''': Linker and their amino acid sequence. Green: attachment sequences to prevent the flexible regions from being perturbed; Blue: angle; Purple: extein.<br />
! Linker<br />
! Amino acid sequence<br />
! activity<br />
! length- contribution<br />
! angle- contribution<br />
! binding site contribution<br />
! distance from surface<br />
! weightingvalue after calibration<br />
|- <br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|'''Very good linkers'''<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| sgt2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#00BFFF;">AAAHPEA</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
| 0.7477<br />
| 1.9205<br />
| 6.7789<br />
| 0.002259<br />
| 10.525<br />
| 114912<br />
|-<br />
| rigid<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKAA<span style="color:#3ADF00;">P</span><span style="color:#A901DB;">RGKCWE</span><br />
| 0.9447<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| '''Average linkers'''<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| may1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKA<span style="color:#00BFFF;">AAAHPEA</span>AEAAAK EAAAKA<span style="color:#00BFFF;">KTA</span>AEAAAKEAAAKA<span style="color:#A901DB;">RGTCWE</span><br />
| 0.7489<br />
| 6.2225<br />
| 13.19<br />
| 0.00384<br />
| 1095.2<br />
| 196414<br />
|-<br />
| ord1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ATGDLA</span>AEAAAKAA<span style="color:#A901DB;">RGTCWE</span><br />
| 0.956<br />
| 4.936<br />
| 4.639<br />
| 0.00055708<br />
| 220.8<br />
| 27985<br />
|-<br />
| ord3<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ASLPAA</span>AEAAAKEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
| 1.390<br />
| 4.949<br />
| 7.116<br />
| 0.000545<br />
| 261.2<br />
| 28557<br />
|-<br />
|<br />
| <br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| '''Short linkers'''<br />
| <br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| sho1<br />
| <span style="color:#3ADF00;">GG</span><span style="color:#A901DB;">RGTCWE</span><br />
| 0.7087<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| sho2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
| 0.5743<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| flexible linker<br />
| GGSGGGSGRGKCWE<br />
| 0.6851<br />
| <br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| linear lysozyme<br />
| no linker<br />
| 0.7039<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|}<br />
[[circ_lam_lys_nils.png]] comparison with [[nice_linker_lysozyme_flexible_ends.png]]<br />
###About predictions of software###<br />
In the end we obtained a ranking of the in vitro tested linkers from the [[linker-screening]] and chose the parameters $\alpha, \beta, \gamma, \delta$ of the weighting function in the way, that the ranking from the software represented the ranking from the assays. The final values were: ...<br />
<br />
=Discussion=<br />
Will always be refined with more data from i@h...<br />
=References=<br />
<br />
[0] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).<br />
<br />
[1] Thornton, J.M. & Sibanda, B.L. Amino and carboxy-terminal regions in globular proteins. Journal of molecular biology 167, 443-460 (1983).</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/pages/Linker_SoftwareTeam:Heidelberg/pages/Linker Software2014-10-17T21:45:31Z<p>Igemnils: /* Abstract */</p>
<hr />
<div><br />
=Abstract=<br />
As already [https://2014.igem.org/Team:Heidelberg/Toolbox/Circularization introduced], artificially circularized proteins may gain some heat stability by restraining the C- and N-terminus from moving around freely. This circularization may be trivial when the protein termini are very close to each other, which seems to be reasonably common [[#References|[1]]]. However, if the ends are too far from each other, a long linker is needed to connect them. This linker should not change the natural conformation of the protein and should constrain the relative position of the ends to restrict the degrees of freedom and thus to stabilize the structure even when heated up. On top, these linkers should not affect any of the protein functions. Consequently it is important to prevent linkers from passing through the active site or from covering binding domains to other molecules for example. Therefore one needs to be able to define the shape of possible linkers. This section describes the software we developed to design such linkers. We would like to stress that this work has been made possible thanks to the feedback between computer modeling and experimental work: We could first design linkers in silico, test them experimentally and use the results to further calibrate the software. To our knowledge, this is the first time that such an approach is used to customly design rigid linkers with angles to connect protein extremities.<br />
<br />
=General procedure=<br />
At first the proteinstructure is analysed and then all possible paths that connect the two ends are found, that have less than three additional edges. In the end all these paths should be sorted by well it would be, if we circularize the protein using this path. But as the final weighing is quite computation intensive, at first the paths need to be sorted out. A path is only sorted out, if it is breaching any rule for linkers. For example paths should never pass through the protein. After all the paths have been generated, the paths are improved, by shifting the points according to the underlying linkermodel, so that no paths are taken into account, that could not be built with our building blocks.<br />
After all paths being correctly generated, the paths are weighted by several factors. Afterwards one weighting for a path is calculated, that corresponds to the goodness of this certain path. Then the path is retranslated by usage of our amino acid patterns to produce an amino acid sequence. But as one sequence can follow more than one path, all the paths built up by this sequence are clustered. In the end the average of the weighting of the pathclusters is calculated and thus it for every linker only one weightingvalue is produced with the contributions of all paths represented by this linker.<br />
==PDB parsing==<br />
At first the PDB file containing the structure of the target protein is parsed and thus the information about the coordinates of the atoms are stored. From this data a calibration of coordinate system of the PDB to metric units is made. For this purpose the distance between certain atoms in all the glycines is measured. This distance is a well known distance and thus a calibration can be made, leading normally to 100 pm per unit in the PDB file.<br />
After this some first tests with the protein structure are made. At first it is tested, whether the C- and N-terminus lie on the solvent accessible surface of the protein, which is crucial for circularisation. The angles from which the ends are accessible are stored for the linkergeneration afterwards. An end is accessible in a certain angle, if the axis that is rotatet from z-achsis there with the angle, is not too near to any of the atompoints. The minimal distance a connection must have from the protein is set to the radius of an alpha helix 5 &Aring; . ###fig needed### <br />
==Generation of paths==<br />
As our model is to build linkers with helical rods and connecting angles, a path is completely defined by the coordinates of the angle points. Thus always just the angle points are generated and then the good ones are sorted out. As making shifts to existing points in our programmingstyle is very efficient, this was easier than only to generate the points, that are representing good connections.<br />
In the [[###link###|modelling]] part we have already described the patterns we are using for building up our linkers. The modularity was crucial for the success of the software. Therefore the angle patterns were always chosen, so that the end of the anglepattern was ???8??? &Aring; away from the turning point ###figure needed###. Thus only displacements had to be made with a certain length, not depending on the direction in which it is going and not depending on the direction in that the linker will continue. This modularity makes the calculations more efficient than it would be, with just generating points randomly.<br />
===Flexible ends===<br />
Most proteins have flexible regions at the ends, that are not pointing in a certain direction. Often these flexible ends are even missing in the structure files but still our software estimates how they could behave. Furtheron due to circularization non-helical sequences remain at the ends of the protein. This gives a huge possibility to insert fitting linkers. But this is also a big problem as the estimation of flexible parts is not easy with our brute-force ansatz.<br />
For generation of linkers taking into account the flexibility of the ends until now have been included two functions.<br />
====One helix at flexible ends====<br />
The only variability we have in the customly tailored linker there is the length of the helix. Most likely we no suitable linker can be found, because if there is some obstacle between the ends, the linker can't bend around it. But even though if there is the possibility of such a linker, it should be found, because this will be one of the best linkers predicted. ###figure needed###<br />
So for all accessible angles that were calculated before, we calculate all points from the N-terminus and from the C-terminus that lie in certain distances from the terminus. The points are spread over varying distances from 4.5 &Aring; to the maximum length of the flexible part. ###figure###<br />
Then all possible connections between the firstpoints and the lastpoints are tested It is tested whether they are too close at the protein or even pass through the protein. This is done with a higher accuracy, because none of these linkers should be lost by error. Therefore here another function for calculating the distance from the protein to the connection is used, that is more time-consuming, but also more accurate, than the ones used normally.<br />
<br />
====Linker with one angle at flexible ends.====<br />
The other possibilty for taking flexible ends into account without blowing up the calculation time was to reduce the possibilities in the amount of angles in the helical pattern to one specified. <br />
For the linkers with flexible ends and one angle are built as rectangular triangles. ###figure### The rectangular triangles are chosen, because a rectangle can be built well with the angles we have and the rectangle helps a lot in further calculation. <br />
At first out of all possible linkerparts it is analyzed which triangles are possible to be built, by use of Pythagoras' Theorem. It is important that both legs can be built with our linker patterns and that the hypotenuse has the correct length to fit between the ends.<br />
Afterwards all possible rectangular triangles constructed, that have the two edges of the hypothenuse on N- and C-terminus. For this purpose Thales' Theorem is used, as all the possible rectangles lie on a sphere of radius hypothenuses half. Then these rectangular triangles are analyzed, whether they can be built with our linkerparts, by comparing the angles with the amount of possible triangles from Pythagoras' Theorem.<br />
These triangles are now all shiftet by each displacement of possible angles at the C-terminus, resulting in the rectanglepoints ###figure needed###. Now the connections to the possible points from the N-terminus are generated. At this step the triangles don't need to be rectangular anymore, but can have slightly different angles, but in the next steps the paths are analyzed, whether they still fit. At first they lengthes of the legs of the triangles are checked and then it is checked, that the paths don't disturb the protein. If they would disturbe the protein anyhow, they are just deleted.<br />
===Rigid paths===<br />
If the fist two possibilities for finding suitable paths didn't work the software provides also the possibility to find paths with up to three edges. Thus even one of the worst shapes for circularization, a torus with the two termini in the pits, could be circularized with our linker system, without needing infinitely long straights. On the other hand this were the maximum of possibilities which was still feasible to calculate.<br />
Here now the flexible parts of the protein are estimated to point into the same direction as the following helix. By this mean the amount of possibilities is kept fixed but of course this is quite some rough estimation.<br />
Let's check the next paragraph together in detail<br />
At first points from each end in the right distances are generated. ###figure needed, explains point names### Now for each point of the first points to all directions in 5 degrees angledifference next points are generated. Of these all points that don't fit are immediately sorted out. Now all connections from second points to the last points are checked whether they lie in the right distance. After this the normal sorting steps are made for these new connections.<br />
==Sorting out of paths==<br />
The most calcualtion time is consumed while sorting out misleading paths. This is due to the fact, that every path needs to be checked. Making the brute-force ansatz the amount of possible paths is about 10^9, so this step consumes most time.<br />
There are three main functions, that sort out paths that don't fit. The easiest just sorts paths out, that would require an angle, we can't produce with our angle patterns. But as we nearly can produce angles from ???20 -170??? degrees only very few paths are sorted out by this function.<br />
The next function just checks whether the endpoint is lying in the protein. If yes, the path is deleted. Otherwise the connection between the point coming from to this point is checked, whether it is passing too near at the protein.<br />
<br />
==Shifting paths to the patterns==<br />
<br />
Because of rounding errors and other inaccuracies, it can always happen that the generated paths don't fit perfectly to the helical- and to the angle patterns. Therefore before translating them to the aminoacid sequence, the paths need to be refined. This is also done before the weighting, so that the paths don't change after weighting anymore.<br />
Therefore the distances between the different points are calculated and then the points are shifted so far, that they fit into the patterns. The shifts never exceed a certain length so that no path then would pass through the protein after refinement, even though it didn't before.<br />
==Weighting of paths==<br />
Before translating the paths to sequences and thus to expressible linkers, at first the value of each path needs to be checked. Therefore we have identified four contributions, that determine how good a path will work to enhance heatstability of a target protein by circularization. In the end all these contributions are summarized to get one final value for the goodness of the connection. For each of the single contributions, always lower weighting values refer to a better path. Therefore all contributions somehow need to be normed by some property of the protein, so that the contributions of each function on it's own are independent of the target protein and generality is achieved.<br />
At first it is tested how long the linker is. We assumed that a shorter linker normally is better for restraining the ends, as a longer helix might give more flexibility to the ends. Because we want this value to be independent of the size of the protein, the length of the linker is normed by the distance of the two termini. It is clear that not only because the two termini are far from each other and clearly a longer linker is needed, this contribution exceeds all other contributions to the weighting function.<br />
The next value is a measure for the goodness of the angles used in the linker. Each of our potential angle patterns is somehow distributed with a standard deviation from the mean. It is clear, that the narrower this distribution is, the likelier it is, that the pattern will in the end produce the assumed angle between the embracing helices. Therefore path-angles that fit perfectly with the angle patterns provided get a lower weighting value.<br />
Then the distance from the protein is taken into account. Because the linkers should not disturb the protein in it's normal environment, linkers that pass near to the protein's surface are considered better linkers. Of course no linker is too near on the surface, so that it would interact with the protein too much. This value is normalized with the minimal distance a linker should have from the surface.<br />
After this the places a linker should ommit are calculated. Each protein binds some ligands or substrates. The user can specify, how big they are and where they attach. If a linker passes through a potential ligand, this value goes to infinity, so that a linker is discarded. Otherwise the farther a linker is from a potential ligand, the better this value gets. The user can also specify the importance of certain regions. In the end the total weighting is independent of the amount of binding sites.<br />
<br />
===Calibrating the weighting function===<br />
???keep?###<br />
Every contribution has it's own distribution. You can see an example in figure ### [[figure histogram_length_lys.png]], but all of them have different shapes. The aim is to find the paths that minimize all of these distributions. Therefore in the weighting function the four mentioned contributions are combined in a linear manner:<br />
\[ W(p) = \alpha L(p) + \beta A(p) + \gamma D(p) + \delta u(p) \]<br />
where W is the final weighting, p the path, L the length contribution, A the angle contribution, D the distance contribution and u the contribution from the forbidden regions. $\alpha, \beta, \gamma, \delta$ are the weighting constants that needed to be found. The normalization performed for each of the contribution were made so that each of them is dimensionless and that all have reasonably similar values.<br />
The weighting constants were obtained from the [[linker-screening ###lin]] performed with lysozyme and the [[enzyme-modeling ###link]]<br />
Please see [[below ###link]] for the detailed explanation, how the values were obtained.<br />
==Translating paths to sequence==<br />
As already mentioned before the software is provided with two databases, one for the possible angle patterns and one for the helix patterns. The choice of the patterns was inspired by known crystal structures extracted from databases and described in different papers.<br />
A huge in silico screening for refining the preferences of the patterns was then set up using the [[iGEM@home|###Link]] system. For the complete description of search for suitable patterns, one can read the [[modeling|###Link to patternspart]] page. <br />
All the possible paths are now split up at the angles and compared with the possible patterns in the databases. ###Figure needed to explain, how the path is translated### The most suitable patterns are identified and added together to build the paths sequence. It is important to notice that this is only possible because of the modularity of our linker patterns used as building blocks: each block, being an alpha helix or an angle pattern, is not affected by the other. Thus for each possible path, one sequence is produced.<br />
==Clustering of paths==<br />
Many different paths are represented by the same sequence [[###Figure that shows, different paths have same properties, already before in the text ###]] and we therefore clustered such paths. The weigths for those clustered paths were then calculated by averaging the weights of the different paths that compose a cluster.<br />
=Results=<br />
==DNMT1==<br />
A major motivation of our effort to design rigid linkers with angles was the circularization of the DNA methyltranferase Dnmt1 (link). The truncation form used in our project is composed of 900 amino acids and the N- and C-terminal extremities are well separated. To circularize it, two linkers were designed: a flexible one made of glycine and serine, and a rigid one designed by the software. The rigif linkers for DNMT1 were obtained from an early state of the software. At that time the calculation took 11 days on a laptop computer with intel i5 processor and 8GB of RAM, which shows the importance of a distributed computing system for large proteins. But from that state on, the software has still imporved a lot, resulting in reduced calculation time to about 1 day for DNMT1.<br />
<br />
###DNMT1 could be made heatstable, still missing###<br />
<br />
<br />
From the [[Project/Linker_Screening | linker-screening]] and the [[Modeling/Enzyme_Modeling | enzyme_modeling]] we have obtained the activities after heat-shock of the different linkers. With these a calibration of the weighting function has been made, please see table 1. for the results.<br />
<br />
{| class="table table-hover" style="text-align: center;"<br />
|+ '''table 1''': Linker and their amino acid sequence. Green: attachment sequences to prevent the flexible regions from being perturbed; Blue: angle; Purple: extein.<br />
! Linker<br />
! Amino acid sequence<br />
! activity<br />
! length- contribution<br />
! angle- contribution<br />
! binding site contribution<br />
! distance from surface<br />
! weightingvalue after calibration<br />
|- <br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|'''Very good linkers'''<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| sgt2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#00BFFF;">AAAHPEA</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
| 0.7477<br />
| 1.9205<br />
| 6.7789<br />
| 0.002259<br />
| 10.525<br />
| 114912<br />
|-<br />
| rigid<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKAA<span style="color:#3ADF00;">P</span><span style="color:#A901DB;">RGKCWE</span><br />
| 0.9447<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| '''Average linkers'''<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| may1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKA<span style="color:#00BFFF;">AAAHPEA</span>AEAAAK EAAAKA<span style="color:#00BFFF;">KTA</span>AEAAAKEAAAKA<span style="color:#A901DB;">RGTCWE</span><br />
| 0.7489<br />
| 6.2225<br />
| 13.19<br />
| 0.00384<br />
| 1095.2<br />
| 196414<br />
|-<br />
| ord1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ATGDLA</span>AEAAAKAA<span style="color:#A901DB;">RGTCWE</span><br />
| 0.956<br />
| 4.936<br />
| 4.639<br />
| 0.00055708<br />
| 220.8<br />
| 27985<br />
|-<br />
| ord3<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ASLPAA</span>AEAAAKEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
| 1.390<br />
| 4.949<br />
| 7.116<br />
| 0.000545<br />
| 261.2<br />
| 28557<br />
|-<br />
|<br />
| <br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| '''Short linkers'''<br />
| <br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| sho1<br />
| <span style="color:#3ADF00;">GG</span><span style="color:#A901DB;">RGTCWE</span><br />
| 0.7087<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| sho2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
| 0.5743<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| flexible linker<br />
| GGSGGGSGRGKCWE<br />
| 0.6851<br />
| <br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| linear lysozyme<br />
| no linker<br />
| 0.7039<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|}<br />
[[circ_lam_lys_nils.png]] comparison with [[nice_linker_lysozyme_flexible_ends.png]]<br />
###About predictions of software###<br />
In the end we obtained a ranking of the in vitro tested linkers from the [[linker-screening]] and chose the parameters $\alpha, \beta, \gamma, \delta$ of the weighting function in the way, that the ranking from the software represented the ranking from the assays. The final values were: ...<br />
<br />
=Discussion=<br />
Will always be refined with more data from i@h...<br />
=References=<br />
<br />
[0] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).<br />
<br />
[1] Thornton, J.M. & Sibanda, B.L. Amino and carboxy-terminal regions in globular proteins. Journal of molecular biology 167, 443-460 (1983).</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/pages/Linker_SoftwareTeam:Heidelberg/pages/Linker Software2014-10-17T21:42:20Z<p>Igemnils: /* Abstract */</p>
<hr />
<div><br />
=Abstract=<br />
As already [https://2014.igem.org/Team:Heidelberg/Toolbox/Circularization introduced], artificially circularized proteins may gain some heat stability by restraining the C- and N-terminus from moving around freely. This circularization may be trivial when the protein termini are very close to each other, which seems to be reasonably common [[#References|[1]]]. However, if the ends are too far from each other, a long linker is needed to connect them. This linker should not change the natural conformation of the protein and should constrain the relative position of the ends to restrict the degrees of freedom and thus to stabilize the structure even when heated up. On top, these linkers should not affect any of the protein functions. Consequently it is important to prevent linkers from passing through the active site or from covering binding domains to other molecules for example. Therefore one needs to be able to define the shape of possible linkers. This section describes the software we developed to design such linkers. We would like to stress that this work has been made possible thanks to the feedback between computer modeling and experimental work: We could first design linkers in silico, test them experimentally and use the results to further calibrate the software. To our knowledge, this is the first time that such an approach is used to customly design linkers to connect protein extremities.<br />
<br />
=General procedure=<br />
At first the proteinstructure is analysed and then all possible paths that connect the two ends are found, that have less than three additional edges. In the end all these paths should be sorted by well it would be, if we circularize the protein using this path. But as the final weighing is quite computation intensive, at first the paths need to be sorted out. A path is only sorted out, if it is breaching any rule for linkers. For example paths should never pass through the protein. After all the paths have been generated, the paths are improved, by shifting the points according to the underlying linkermodel, so that no paths are taken into account, that could not be built with our building blocks.<br />
After all paths being correctly generated, the paths are weighted by several factors. Afterwards one weighting for a path is calculated, that corresponds to the goodness of this certain path. Then the path is retranslated by usage of our amino acid patterns to produce an amino acid sequence. But as one sequence can follow more than one path, all the paths built up by this sequence are clustered. In the end the average of the weighting of the pathclusters is calculated and thus it for every linker only one weightingvalue is produced with the contributions of all paths represented by this linker.<br />
==PDB parsing==<br />
At first the PDB file containing the structure of the target protein is parsed and thus the information about the coordinates of the atoms are stored. From this data a calibration of coordinate system of the PDB to metric units is made. For this purpose the distance between certain atoms in all the glycines is measured. This distance is a well known distance and thus a calibration can be made, leading normally to 100 pm per unit in the PDB file.<br />
After this some first tests with the protein structure are made. At first it is tested, whether the C- and N-terminus lie on the solvent accessible surface of the protein, which is crucial for circularisation. The angles from which the ends are accessible are stored for the linkergeneration afterwards. An end is accessible in a certain angle, if the axis that is rotatet from z-achsis there with the angle, is not too near to any of the atompoints. The minimal distance a connection must have from the protein is set to the radius of an alpha helix 5 &Aring; . ###fig needed### <br />
==Generation of paths==<br />
As our model is to build linkers with helical rods and connecting angles, a path is completely defined by the coordinates of the angle points. Thus always just the angle points are generated and then the good ones are sorted out. As making shifts to existing points in our programmingstyle is very efficient, this was easier than only to generate the points, that are representing good connections.<br />
In the [[###link###|modelling]] part we have already described the patterns we are using for building up our linkers. The modularity was crucial for the success of the software. Therefore the angle patterns were always chosen, so that the end of the anglepattern was ???8??? &Aring; away from the turning point ###figure needed###. Thus only displacements had to be made with a certain length, not depending on the direction in which it is going and not depending on the direction in that the linker will continue. This modularity makes the calculations more efficient than it would be, with just generating points randomly.<br />
===Flexible ends===<br />
Most proteins have flexible regions at the ends, that are not pointing in a certain direction. Often these flexible ends are even missing in the structure files but still our software estimates how they could behave. Furtheron due to circularization non-helical sequences remain at the ends of the protein. This gives a huge possibility to insert fitting linkers. But this is also a big problem as the estimation of flexible parts is not easy with our brute-force ansatz.<br />
For generation of linkers taking into account the flexibility of the ends until now have been included two functions.<br />
====One helix at flexible ends====<br />
The only variability we have in the customly tailored linker there is the length of the helix. Most likely we no suitable linker can be found, because if there is some obstacle between the ends, the linker can't bend around it. But even though if there is the possibility of such a linker, it should be found, because this will be one of the best linkers predicted. ###figure needed###<br />
So for all accessible angles that were calculated before, we calculate all points from the N-terminus and from the C-terminus that lie in certain distances from the terminus. The points are spread over varying distances from 4.5 &Aring; to the maximum length of the flexible part. ###figure###<br />
Then all possible connections between the firstpoints and the lastpoints are tested It is tested whether they are too close at the protein or even pass through the protein. This is done with a higher accuracy, because none of these linkers should be lost by error. Therefore here another function for calculating the distance from the protein to the connection is used, that is more time-consuming, but also more accurate, than the ones used normally.<br />
<br />
====Linker with one angle at flexible ends.====<br />
The other possibilty for taking flexible ends into account without blowing up the calculation time was to reduce the possibilities in the amount of angles in the helical pattern to one specified. <br />
For the linkers with flexible ends and one angle are built as rectangular triangles. ###figure### The rectangular triangles are chosen, because a rectangle can be built well with the angles we have and the rectangle helps a lot in further calculation. <br />
At first out of all possible linkerparts it is analyzed which triangles are possible to be built, by use of Pythagoras' Theorem. It is important that both legs can be built with our linker patterns and that the hypotenuse has the correct length to fit between the ends.<br />
Afterwards all possible rectangular triangles constructed, that have the two edges of the hypothenuse on N- and C-terminus. For this purpose Thales' Theorem is used, as all the possible rectangles lie on a sphere of radius hypothenuses half. Then these rectangular triangles are analyzed, whether they can be built with our linkerparts, by comparing the angles with the amount of possible triangles from Pythagoras' Theorem.<br />
These triangles are now all shiftet by each displacement of possible angles at the C-terminus, resulting in the rectanglepoints ###figure needed###. Now the connections to the possible points from the N-terminus are generated. At this step the triangles don't need to be rectangular anymore, but can have slightly different angles, but in the next steps the paths are analyzed, whether they still fit. At first they lengthes of the legs of the triangles are checked and then it is checked, that the paths don't disturb the protein. If they would disturbe the protein anyhow, they are just deleted.<br />
===Rigid paths===<br />
If the fist two possibilities for finding suitable paths didn't work the software provides also the possibility to find paths with up to three edges. Thus even one of the worst shapes for circularization, a torus with the two termini in the pits, could be circularized with our linker system, without needing infinitely long straights. On the other hand this were the maximum of possibilities which was still feasible to calculate.<br />
Here now the flexible parts of the protein are estimated to point into the same direction as the following helix. By this mean the amount of possibilities is kept fixed but of course this is quite some rough estimation.<br />
Let's check the next paragraph together in detail<br />
At first points from each end in the right distances are generated. ###figure needed, explains point names### Now for each point of the first points to all directions in 5 degrees angledifference next points are generated. Of these all points that don't fit are immediately sorted out. Now all connections from second points to the last points are checked whether they lie in the right distance. After this the normal sorting steps are made for these new connections.<br />
==Sorting out of paths==<br />
The most calcualtion time is consumed while sorting out misleading paths. This is due to the fact, that every path needs to be checked. Making the brute-force ansatz the amount of possible paths is about 10^9, so this step consumes most time.<br />
There are three main functions, that sort out paths that don't fit. The easiest just sorts paths out, that would require an angle, we can't produce with our angle patterns. But as we nearly can produce angles from ???20 -170??? degrees only very few paths are sorted out by this function.<br />
The next function just checks whether the endpoint is lying in the protein. If yes, the path is deleted. Otherwise the connection between the point coming from to this point is checked, whether it is passing too near at the protein.<br />
<br />
==Shifting paths to the patterns==<br />
<br />
Because of rounding errors and other inaccuracies, it can always happen that the generated paths don't fit perfectly to the helical- and to the angle patterns. Therefore before translating them to the aminoacid sequence, the paths need to be refined. This is also done before the weighting, so that the paths don't change after weighting anymore.<br />
Therefore the distances between the different points are calculated and then the points are shifted so far, that they fit into the patterns. The shifts never exceed a certain length so that no path then would pass through the protein after refinement, even though it didn't before.<br />
==Weighting of paths==<br />
Before translating the paths to sequences and thus to expressible linkers, at first the value of each path needs to be checked. Therefore we have identified four contributions, that determine how good a path will work to enhance heatstability of a target protein by circularization. In the end all these contributions are summarized to get one final value for the goodness of the connection. For each of the single contributions, always lower weighting values refer to a better path. Therefore all contributions somehow need to be normed by some property of the protein, so that the contributions of each function on it's own are independent of the target protein and generality is achieved.<br />
At first it is tested how long the linker is. We assumed that a shorter linker normally is better for restraining the ends, as a longer helix might give more flexibility to the ends. Because we want this value to be independent of the size of the protein, the length of the linker is normed by the distance of the two termini. It is clear that not only because the two termini are far from each other and clearly a longer linker is needed, this contribution exceeds all other contributions to the weighting function.<br />
The next value is a measure for the goodness of the angles used in the linker. Each of our potential angle patterns is somehow distributed with a standard deviation from the mean. It is clear, that the narrower this distribution is, the likelier it is, that the pattern will in the end produce the assumed angle between the embracing helices. Therefore path-angles that fit perfectly with the angle patterns provided get a lower weighting value.<br />
Then the distance from the protein is taken into account. Because the linkers should not disturb the protein in it's normal environment, linkers that pass near to the protein's surface are considered better linkers. Of course no linker is too near on the surface, so that it would interact with the protein too much. This value is normalized with the minimal distance a linker should have from the surface.<br />
After this the places a linker should ommit are calculated. Each protein binds some ligands or substrates. The user can specify, how big they are and where they attach. If a linker passes through a potential ligand, this value goes to infinity, so that a linker is discarded. Otherwise the farther a linker is from a potential ligand, the better this value gets. The user can also specify the importance of certain regions. In the end the total weighting is independent of the amount of binding sites.<br />
<br />
===Calibrating the weighting function===<br />
???keep?###<br />
Every contribution has it's own distribution. You can see an example in figure ### [[figure histogram_length_lys.png]], but all of them have different shapes. The aim is to find the paths that minimize all of these distributions. Therefore in the weighting function the four mentioned contributions are combined in a linear manner:<br />
\[ W(p) = \alpha L(p) + \beta A(p) + \gamma D(p) + \delta u(p) \]<br />
where W is the final weighting, p the path, L the length contribution, A the angle contribution, D the distance contribution and u the contribution from the forbidden regions. $\alpha, \beta, \gamma, \delta$ are the weighting constants that needed to be found. The normalization performed for each of the contribution were made so that each of them is dimensionless and that all have reasonably similar values.<br />
The weighting constants were obtained from the [[linker-screening ###lin]] performed with lysozyme and the [[enzyme-modeling ###link]]<br />
Please see [[below ###link]] for the detailed explanation, how the values were obtained.<br />
==Translating paths to sequence==<br />
As already mentioned before the software is provided with two databases, one for the possible angle patterns and one for the helix patterns. The choice of the patterns was inspired by known crystal structures extracted from databases and described in different papers.<br />
A huge in silico screening for refining the preferences of the patterns was then set up using the [[iGEM@home|###Link]] system. For the complete description of search for suitable patterns, one can read the [[modeling|###Link to patternspart]] page. <br />
All the possible paths are now split up at the angles and compared with the possible patterns in the databases. ###Figure needed to explain, how the path is translated### The most suitable patterns are identified and added together to build the paths sequence. It is important to notice that this is only possible because of the modularity of our linker patterns used as building blocks: each block, being an alpha helix or an angle pattern, is not affected by the other. Thus for each possible path, one sequence is produced.<br />
==Clustering of paths==<br />
Many different paths are represented by the same sequence [[###Figure that shows, different paths have same properties, already before in the text ###]] and we therefore clustered such paths. The weigths for those clustered paths were then calculated by averaging the weights of the different paths that compose a cluster.<br />
=Results=<br />
==DNMT1==<br />
A major motivation of our effort to design rigid linkers with angles was the circularization of the DNA methyltranferase Dnmt1 (link). The truncation form used in our project is composed of 900 amino acids and the N- and C-terminal extremities are well separated. To circularize it, two linkers were designed: a flexible one made of glycine and serine, and a rigid one designed by the software. The rigif linkers for DNMT1 were obtained from an early state of the software. At that time the calculation took 11 days on a laptop computer with intel i5 processor and 8GB of RAM, which shows the importance of a distributed computing system for large proteins. But from that state on, the software has still imporved a lot, resulting in reduced calculation time to about 1 day for DNMT1.<br />
<br />
###DNMT1 could be made heatstable, still missing###<br />
<br />
<br />
From the [[Project/Linker_Screening | linker-screening]] and the [[Modeling/Enzyme_Modeling | enzyme_modeling]] we have obtained the activities after heat-shock of the different linkers. With these a calibration of the weighting function has been made, please see table 1. for the results.<br />
<br />
{| class="table table-hover" style="text-align: center;"<br />
|+ '''table 1''': Linker and their amino acid sequence. Green: attachment sequences to prevent the flexible regions from being perturbed; Blue: angle; Purple: extein.<br />
! Linker<br />
! Amino acid sequence<br />
! activity<br />
! length- contribution<br />
! angle- contribution<br />
! binding site contribution<br />
! distance from surface<br />
! weightingvalue after calibration<br />
|- <br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|'''Very good linkers'''<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| sgt2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#00BFFF;">AAAHPEA</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
| 0.7477<br />
| 1.9205<br />
| 6.7789<br />
| 0.002259<br />
| 10.525<br />
| 114912<br />
|-<br />
| rigid<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKAA<span style="color:#3ADF00;">P</span><span style="color:#A901DB;">RGKCWE</span><br />
| 0.9447<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| '''Average linkers'''<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| may1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKA<span style="color:#00BFFF;">AAAHPEA</span>AEAAAK EAAAKA<span style="color:#00BFFF;">KTA</span>AEAAAKEAAAKA<span style="color:#A901DB;">RGTCWE</span><br />
| 0.7489<br />
| 6.2225<br />
| 13.19<br />
| 0.00384<br />
| 1095.2<br />
| 196414<br />
|-<br />
| ord1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ATGDLA</span>AEAAAKAA<span style="color:#A901DB;">RGTCWE</span><br />
| 0.956<br />
| 4.936<br />
| 4.639<br />
| 0.00055708<br />
| 220.8<br />
| 27985<br />
|-<br />
| ord3<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ASLPAA</span>AEAAAKEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
| 1.390<br />
| 4.949<br />
| 7.116<br />
| 0.000545<br />
| 261.2<br />
| 28557<br />
|-<br />
|<br />
| <br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| '''Short linkers'''<br />
| <br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| sho1<br />
| <span style="color:#3ADF00;">GG</span><span style="color:#A901DB;">RGTCWE</span><br />
| 0.7087<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| sho2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
| 0.5743<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| flexible linker<br />
| GGSGGGSGRGKCWE<br />
| 0.6851<br />
| <br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| linear lysozyme<br />
| no linker<br />
| 0.7039<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|}<br />
[[circ_lam_lys_nils.png]] comparison with [[nice_linker_lysozyme_flexible_ends.png]]<br />
###About predictions of software###<br />
In the end we obtained a ranking of the in vitro tested linkers from the [[linker-screening]] and chose the parameters $\alpha, \beta, \gamma, \delta$ of the weighting function in the way, that the ranking from the software represented the ranking from the assays. The final values were: ...<br />
<br />
=Discussion=<br />
Will always be refined with more data from i@h...<br />
=References=<br />
<br />
[0] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).<br />
<br />
[1] Thornton, J.M. & Sibanda, B.L. Amino and carboxy-terminal regions in globular proteins. Journal of molecular biology 167, 443-460 (1983).</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/pages/Linker_SoftwareTeam:Heidelberg/pages/Linker Software2014-10-17T21:21:17Z<p>Igemnils: </p>
<hr />
<div><br />
=Abstract=<br />
Artificially circularized proteins gain the effect on heatstability by restraining the C- and N-terminus from moving around freely. If the ends are too far from each other, a linker is needed to connect them, for not changing the natural conformation of the protein too much and restraining the relative position of the ends and thus restricting the degrees of freedom. These linkers should omit hindrance of the protein's function by any mean. Consequently it is import to avoid linkers from passing through the active site or from covering a cavity of a protein for example.<br />
In the [[###link###|modelling]] part we have showed, that it is possible to define the shape of our linkers, by applying our model of rigid helical rods connected by well-defined angle regions. But having the possibility to define the path the linker should take, one still needs to know.<br />
Especially for larger proteins with complex shapes this can be very difficult. Furthermore one would like to take into account that the active sites are omitted. But even taking all this into account, one could never also take the paths into account, that the same sort of linker is also able to take, because of some sort of rotational degree of freedom inherent in the linkermodel ###figure needed###<br />
Thus we decided to provide the science community with a powerfull open-source software, that for every protein with a given structure can calculate the sequence of the linker needed to circularize the protein with a rigid linker and with minimal inhibition of the protein's function. <br />
Until now each scientist had to estimate the length of the linker himself,###check for flexlinker### so our software is a completely novel approach to circularization.<br />
=General procedure=<br />
At first the proteinstructure is analysed and then all possible paths that connect the two ends are found, that have less than three additional edges. In the end all these paths should be sorted by well it would be, if we circularize the protein using this path. But as the final weighing is quite computation intensive, at first the paths need to be sorted out. A path is only sorted out, if it is breaching any rule for linkers. For example paths should never pass through the protein. After all the paths have been generated, the paths are improved, by shifting the points according to the underlying linkermodel, so that no paths are taken into account, that could not be built with our building blocks.<br />
After all paths being correctly generated, the paths are weighted by several factors. Afterwards one weighting for a path is calculated, that corresponds to the goodness of this certain path. Then the path is retranslated by usage of our amino acid patterns to produce an amino acid sequence. But as one sequence can follow more than one path, all the paths built up by this sequence are clustered. In the end the average of the weighting of the pathclusters is calculated and thus it for every linker only one weightingvalue is produced with the contributions of all paths represented by this linker.<br />
==PDB parsing==<br />
At first the PDB file containing the structure of the target protein is parsed and thus the information about the coordinates of the atoms are stored. From this data a calibration of coordinate system of the PDB to metric units is made. For this purpose the distance between certain atoms in all the glycines is measured. This distance is a well known distance and thus a calibration can be made, leading normally to 100 pm per unit in the PDB file.<br />
After this some first tests with the protein structure are made. At first it is tested, whether the C- and N-terminus lie on the solvent accessible surface of the protein, which is crucial for circularisation. The angles from which the ends are accessible are stored for the linkergeneration afterwards. An end is accessible in a certain angle, if the axis that is rotatet from z-achsis there with the angle, is not too near to any of the atompoints. The minimal distance a connection must have from the protein is set to the radius of an alpha helix 5 &Aring; . ###fig needed### <br />
==Generation of paths==<br />
As our model is to build linkers with helical rods and connecting angles, a path is completely defined by the coordinates of the angle points. Thus always just the angle points are generated and then the good ones are sorted out. As making shifts to existing points in our programmingstyle is very efficient, this was easier than only to generate the points, that are representing good connections.<br />
In the [[###link###|modelling]] part we have already described the patterns we are using for building up our linkers. The modularity was crucial for the success of the software. Therefore the angle patterns were always chosen, so that the end of the anglepattern was ???8??? &Aring; away from the turning point ###figure needed###. Thus only displacements had to be made with a certain length, not depending on the direction in which it is going and not depending on the direction in that the linker will continue. This modularity makes the calculations more efficient than it would be, with just generating points randomly.<br />
===Flexible ends===<br />
Most proteins have flexible regions at the ends, that are not pointing in a certain direction. Often these flexible ends are even missing in the structure files but still our software estimates how they could behave. Furtheron due to circularization non-helical sequences remain at the ends of the protein. This gives a huge possibility to insert fitting linkers. But this is also a big problem as the estimation of flexible parts is not easy with our brute-force ansatz.<br />
For generation of linkers taking into account the flexibility of the ends until now have been included two functions.<br />
====One helix at flexible ends====<br />
The only variability we have in the customly tailored linker there is the length of the helix. Most likely we no suitable linker can be found, because if there is some obstacle between the ends, the linker can't bend around it. But even though if there is the possibility of such a linker, it should be found, because this will be one of the best linkers predicted. ###figure needed###<br />
So for all accessible angles that were calculated before, we calculate all points from the N-terminus and from the C-terminus that lie in certain distances from the terminus. The points are spread over varying distances from 4.5 &Aring; to the maximum length of the flexible part. ###figure###<br />
Then all possible connections between the firstpoints and the lastpoints are tested It is tested whether they are too close at the protein or even pass through the protein. This is done with a higher accuracy, because none of these linkers should be lost by error. Therefore here another function for calculating the distance from the protein to the connection is used, that is more time-consuming, but also more accurate, than the ones used normally.<br />
<br />
====Linker with one angle at flexible ends.====<br />
The other possibilty for taking flexible ends into account without blowing up the calculation time was to reduce the possibilities in the amount of angles in the helical pattern to one specified. <br />
For the linkers with flexible ends and one angle are built as rectangular triangles. ###figure### The rectangular triangles are chosen, because a rectangle can be built well with the angles we have and the rectangle helps a lot in further calculation. <br />
At first out of all possible linkerparts it is analyzed which triangles are possible to be built, by use of Pythagoras' Theorem. It is important that both legs can be built with our linker patterns and that the hypotenuse has the correct length to fit between the ends.<br />
Afterwards all possible rectangular triangles constructed, that have the two edges of the hypothenuse on N- and C-terminus. For this purpose Thales' Theorem is used, as all the possible rectangles lie on a sphere of radius hypothenuses half. Then these rectangular triangles are analyzed, whether they can be built with our linkerparts, by comparing the angles with the amount of possible triangles from Pythagoras' Theorem.<br />
These triangles are now all shiftet by each displacement of possible angles at the C-terminus, resulting in the rectanglepoints ###figure needed###. Now the connections to the possible points from the N-terminus are generated. At this step the triangles don't need to be rectangular anymore, but can have slightly different angles, but in the next steps the paths are analyzed, whether they still fit. At first they lengthes of the legs of the triangles are checked and then it is checked, that the paths don't disturb the protein. If they would disturbe the protein anyhow, they are just deleted.<br />
===Rigid paths===<br />
If the fist two possibilities for finding suitable paths didn't work the software provides also the possibility to find paths with up to three edges. Thus even one of the worst shapes for circularization, a torus with the two termini in the pits, could be circularized with our linker system, without needing infinitely long straights. On the other hand this were the maximum of possibilities which was still feasible to calculate.<br />
Here now the flexible parts of the protein are estimated to point into the same direction as the following helix. By this mean the amount of possibilities is kept fixed but of course this is quite some rough estimation.<br />
Let's check the next paragraph together in detail<br />
At first points from each end in the right distances are generated. ###figure needed, explains point names### Now for each point of the first points to all directions in 5 degrees angledifference next points are generated. Of these all points that don't fit are immediately sorted out. Now all connections from second points to the last points are checked whether they lie in the right distance. After this the normal sorting steps are made for these new connections.<br />
==Sorting out of paths==<br />
The most calcualtion time is consumed while sorting out misleading paths. This is due to the fact, that every path needs to be checked. Making the brute-force ansatz the amount of possible paths is about 10^9, so this step consumes most time.<br />
There are three main functions, that sort out paths that don't fit. The easiest just sorts paths out, that would require an angle, we can't produce with our angle patterns. But as we nearly can produce angles from ???20 -170??? degrees only very few paths are sorted out by this function.<br />
The next function just checks whether the endpoint is lying in the protein. If yes, the path is deleted. Otherwise the connection between the point coming from to this point is checked, whether it is passing too near at the protein.<br />
<br />
==Shifting paths to the patterns==<br />
<br />
Because of rounding errors and other inaccuracies, it can always happen that the generated paths don't fit perfectly to the helical- and to the angle patterns. Therefore before translating them to the aminoacid sequence, the paths need to be refined. This is also done before the weighting, so that the paths don't change after weighting anymore.<br />
Therefore the distances between the different points are calculated and then the points are shifted so far, that they fit into the patterns. The shifts never exceed a certain length so that no path then would pass through the protein after refinement, even though it didn't before.<br />
==Weighting of paths==<br />
Before translating the paths to sequences and thus to expressible linkers, at first the value of each path needs to be checked. Therefore we have identified four contributions, that determine how good a path will work to enhance heatstability of a target protein by circularization. In the end all these contributions are summarized to get one final value for the goodness of the connection. For each of the single contributions, always lower weighting values refer to a better path. Therefore all contributions somehow need to be normed by some property of the protein, so that the contributions of each function on it's own are independent of the target protein and generality is achieved.<br />
At first it is tested how long the linker is. We assumed that a shorter linker normally is better for restraining the ends, as a longer helix might give more flexibility to the ends. Because we want this value to be independent of the size of the protein, the length of the linker is normed by the distance of the two termini. It is clear that not only because the two termini are far from each other and clearly a longer linker is needed, this contribution exceeds all other contributions to the weighting function.<br />
The next value is a measure for the goodness of the angles used in the linker. Each of our potential angle patterns is somehow distributed with a standard deviation from the mean. It is clear, that the narrower this distribution is, the likelier it is, that the pattern will in the end produce the assumed angle between the embracing helices. Therefore path-angles that fit perfectly with the angle patterns provided get a lower weighting value.<br />
Then the distance from the protein is taken into account. Because the linkers should not disturb the protein in it's normal environment, linkers that pass near to the protein's surface are considered better linkers. Of course no linker is too near on the surface, so that it would interact with the protein too much. This value is normalized with the minimal distance a linker should have from the surface.<br />
After this the places a linker should ommit are calculated. Each protein binds some ligands or substrates. The user can specify, how big they are and where they attach. If a linker passes through a potential ligand, this value goes to infinity, so that a linker is discarded. Otherwise the farther a linker is from a potential ligand, the better this value gets. The user can also specify the importance of certain regions. In the end the total weighting is independent of the amount of binding sites.<br />
<br />
===Calibrating the weighting function===<br />
???keep?###<br />
Every contribution has it's own distribution. You can see an example in figure ### [[figure histogram_length_lys.png]], but all of them have different shapes. The aim is to find the paths that minimize all of these distributions. Therefore in the weighting function the four mentioned contributions are combined in a linear manner:<br />
\[ W(p) = \alpha L(p) + \beta A(p) + \gamma D(p) + \delta u(p) \]<br />
where W is the final weighting, p the path, L the length contribution, A the angle contribution, D the distance contribution and u the contribution from the forbidden regions. $\alpha, \beta, \gamma, \delta$ are the weighting constants that needed to be found. The normalization performed for each of the contribution were made so that each of them is dimensionless and that all have reasonably similar values.<br />
The weighting constants were obtained from the [[linker-screening ###lin]] performed with lysozyme and the [[enzyme-modeling ###link]]<br />
Please see [[below ###link]] for the detailed explanation, how the values were obtained.<br />
==Translating paths to sequence==<br />
As already mentioned before the software is provided with two databases, one for the possible angle patterns and one for the helix patterns. The choice of the patterns was inspired by known crystal structures extracted from databases and described in different papers.<br />
A huge in silico screening for refining the preferences of the patterns was then set up using the [[iGEM@home|###Link]] system. For the complete description of search for suitable patterns, one can read the [[modeling|###Link to patternspart]] page. <br />
All the possible paths are now split up at the angles and compared with the possible patterns in the databases. ###Figure needed to explain, how the path is translated### The most suitable patterns are identified and added together to build the paths sequence. It is important to notice that this is only possible because of the modularity of our linker patterns used as building blocks: each block, being an alpha helix or an angle pattern, is not affected by the other. Thus for each possible path, one sequence is produced.<br />
==Clustering of paths==<br />
Many different paths are represented by the same sequence [[###Figure that shows, different paths have same properties, already before in the text ###]] and we therefore clustered such paths. The weigths for those clustered paths were then calculated by averaging the weights of the different paths that compose a cluster.<br />
=Results=<br />
==DNMT1==<br />
A major motivation of our effort to design rigid linkers with angles was the circularization of the DNA methyltranferase Dnmt1 (link). The truncation form used in our project is composed of 900 amino acids and the N- and C-terminal extremities are well separated. To circularize it, two linkers were designed: a flexible one made of glycine and serine, and a rigid one designed by the software. The rigif linkers for DNMT1 were obtained from an early state of the software. At that time the calculation took 11 days on a laptop computer with intel i5 processor and 8GB of RAM, which shows the importance of a distributed computing system for large proteins. But from that state on, the software has still imporved a lot, resulting in reduced calculation time to about 1 day for DNMT1.<br />
<br />
###DNMT1 could be made heatstable, still missing###<br />
<br />
<br />
From the [[Project/Linker_Screening | linker-screening]] and the [[Modeling/Enzyme_Modeling | enzyme_modeling]] we have obtained the activities after heat-shock of the different linkers. With these a calibration of the weighting function has been made, please see table 1. for the results.<br />
<br />
{| class="table table-hover" style="text-align: center;"<br />
|+ '''table 1''': Linker and their amino acid sequence. Green: attachment sequences to prevent the flexible regions from being perturbed; Blue: angle; Purple: extein.<br />
! Linker<br />
! Amino acid sequence<br />
! activity<br />
! length- contribution<br />
! angle- contribution<br />
! binding site contribution<br />
! distance from surface<br />
! weightingvalue after calibration<br />
|- <br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|'''Very good linkers'''<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| sgt2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#00BFFF;">AAAHPEA</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
| 0.7477<br />
| 1.9205<br />
| 6.7789<br />
| 0.002259<br />
| 10.525<br />
| 114912<br />
|-<br />
| rigid<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKAA<span style="color:#3ADF00;">P</span><span style="color:#A901DB;">RGKCWE</span><br />
| 0.9447<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| '''Average linkers'''<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| may1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKA<span style="color:#00BFFF;">AAAHPEA</span>AEAAAK EAAAKA<span style="color:#00BFFF;">KTA</span>AEAAAKEAAAKA<span style="color:#A901DB;">RGTCWE</span><br />
| 0.7489<br />
| 6.2225<br />
| 13.19<br />
| 0.00384<br />
| 1095.2<br />
| 196414<br />
|-<br />
| ord1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ATGDLA</span>AEAAAKAA<span style="color:#A901DB;">RGTCWE</span><br />
| 0.956<br />
| 4.936<br />
| 4.639<br />
| 0.00055708<br />
| 220.8<br />
| 27985<br />
|-<br />
| ord3<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ASLPAA</span>AEAAAKEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
| 1.390<br />
| 4.949<br />
| 7.116<br />
| 0.000545<br />
| 261.2<br />
| 28557<br />
|-<br />
|<br />
| <br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| '''Short linkers'''<br />
| <br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| sho1<br />
| <span style="color:#3ADF00;">GG</span><span style="color:#A901DB;">RGTCWE</span><br />
| 0.7087<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| sho2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
| 0.5743<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| flexible linker<br />
| GGSGGGSGRGKCWE<br />
| 0.6851<br />
| <br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| linear lysozyme<br />
| no linker<br />
| 0.7039<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|}<br />
[[circ_lam_lys_nils.png]] comparison with [[nice_linker_lysozyme_flexible_ends.png]]<br />
###About predictions of software###<br />
In the end we obtained a ranking of the in vitro tested linkers from the [[linker-screening]] and chose the parameters $\alpha, \beta, \gamma, \delta$ of the weighting function in the way, that the ranking from the software represented the ranking from the assays. The final values were: ...<br />
<br />
=Discussion=<br />
Will always be refined with more data from i@h...<br />
=References=<br />
<br />
[0] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).<br />
<br />
[1] Thornton, J.M. & Sibanda, B.L. Amino and carboxy-terminal regions in globular proteins. Journal of molecular biology 167, 443-460 (1983).</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/pages/Linker_SoftwareTeam:Heidelberg/pages/Linker Software2014-10-17T21:20:43Z<p>Igemnils: </p>
<hr />
<div>IMPORTANT!<br />
paths-> geometrical<br />
linkers-> helices...<br />
sequences-> seq of linkers<br />
insist on huge number of data produced<br />
We definitely need to cite [0]!!!<br />
=Abstract=<br />
Artificially circularized proteins gain the effect on heatstability by restraining the C- and N-terminus from moving around freely. If the ends are too far from each other, a linker is needed to connect them, for not changing the natural conformation of the protein too much and restraining the relative position of the ends and thus restricting the degrees of freedom. These linkers should omit hindrance of the protein's function by any mean. Consequently it is import to avoid linkers from passing through the active site or from covering a cavity of a protein for example.<br />
In the [[###link###|modelling]] part we have showed, that it is possible to define the shape of our linkers, by applying our model of rigid helical rods connected by well-defined angle regions. But having the possibility to define the path the linker should take, one still needs to know.<br />
Especially for larger proteins with complex shapes this can be very difficult. Furthermore one would like to take into account that the active sites are omitted. But even taking all this into account, one could never also take the paths into account, that the same sort of linker is also able to take, because of some sort of rotational degree of freedom inherent in the linkermodel ###figure needed###<br />
Thus we decided to provide the science community with a powerfull open-source software, that for every protein with a given structure can calculate the sequence of the linker needed to circularize the protein with a rigid linker and with minimal inhibition of the protein's function. <br />
Until now each scientist had to estimate the length of the linker himself,###check for flexlinker### so our software is a completely novel approach to circularization.<br />
=General procedure=<br />
At first the proteinstructure is analysed and then all possible paths that connect the two ends are found, that have less than three additional edges. In the end all these paths should be sorted by well it would be, if we circularize the protein using this path. But as the final weighing is quite computation intensive, at first the paths need to be sorted out. A path is only sorted out, if it is breaching any rule for linkers. For example paths should never pass through the protein. After all the paths have been generated, the paths are improved, by shifting the points according to the underlying linkermodel, so that no paths are taken into account, that could not be built with our building blocks.<br />
After all paths being correctly generated, the paths are weighted by several factors. Afterwards one weighting for a path is calculated, that corresponds to the goodness of this certain path. Then the path is retranslated by usage of our amino acid patterns to produce an amino acid sequence. But as one sequence can follow more than one path, all the paths built up by this sequence are clustered. In the end the average of the weighting of the pathclusters is calculated and thus it for every linker only one weightingvalue is produced with the contributions of all paths represented by this linker.<br />
==PDB parsing==<br />
At first the PDB file containing the structure of the target protein is parsed and thus the information about the coordinates of the atoms are stored. From this data a calibration of coordinate system of the PDB to metric units is made. For this purpose the distance between certain atoms in all the glycines is measured. This distance is a well known distance and thus a calibration can be made, leading normally to 100 pm per unit in the PDB file.<br />
After this some first tests with the protein structure are made. At first it is tested, whether the C- and N-terminus lie on the solvent accessible surface of the protein, which is crucial for circularisation. The angles from which the ends are accessible are stored for the linkergeneration afterwards. An end is accessible in a certain angle, if the axis that is rotatet from z-achsis there with the angle, is not too near to any of the atompoints. The minimal distance a connection must have from the protein is set to the radius of an alpha helix 5 &Aring; . ###fig needed### <br />
==Generation of paths==<br />
As our model is to build linkers with helical rods and connecting angles, a path is completely defined by the coordinates of the angle points. Thus always just the angle points are generated and then the good ones are sorted out. As making shifts to existing points in our programmingstyle is very efficient, this was easier than only to generate the points, that are representing good connections.<br />
In the [[###link###|modelling]] part we have already described the patterns we are using for building up our linkers. The modularity was crucial for the success of the software. Therefore the angle patterns were always chosen, so that the end of the anglepattern was ???8??? &Aring; away from the turning point ###figure needed###. Thus only displacements had to be made with a certain length, not depending on the direction in which it is going and not depending on the direction in that the linker will continue. This modularity makes the calculations more efficient than it would be, with just generating points randomly.<br />
===Flexible ends===<br />
Most proteins have flexible regions at the ends, that are not pointing in a certain direction. Often these flexible ends are even missing in the structure files but still our software estimates how they could behave. Furtheron due to circularization non-helical sequences remain at the ends of the protein. This gives a huge possibility to insert fitting linkers. But this is also a big problem as the estimation of flexible parts is not easy with our brute-force ansatz.<br />
For generation of linkers taking into account the flexibility of the ends until now have been included two functions.<br />
====One helix at flexible ends====<br />
The only variability we have in the customly tailored linker there is the length of the helix. Most likely we no suitable linker can be found, because if there is some obstacle between the ends, the linker can't bend around it. But even though if there is the possibility of such a linker, it should be found, because this will be one of the best linkers predicted. ###figure needed###<br />
So for all accessible angles that were calculated before, we calculate all points from the N-terminus and from the C-terminus that lie in certain distances from the terminus. The points are spread over varying distances from 4.5 &Aring; to the maximum length of the flexible part. ###figure###<br />
Then all possible connections between the firstpoints and the lastpoints are tested It is tested whether they are too close at the protein or even pass through the protein. This is done with a higher accuracy, because none of these linkers should be lost by error. Therefore here another function for calculating the distance from the protein to the connection is used, that is more time-consuming, but also more accurate, than the ones used normally.<br />
<br />
====Linker with one angle at flexible ends.====<br />
The other possibilty for taking flexible ends into account without blowing up the calculation time was to reduce the possibilities in the amount of angles in the helical pattern to one specified. <br />
For the linkers with flexible ends and one angle are built as rectangular triangles. ###figure### The rectangular triangles are chosen, because a rectangle can be built well with the angles we have and the rectangle helps a lot in further calculation. <br />
At first out of all possible linkerparts it is analyzed which triangles are possible to be built, by use of Pythagoras' Theorem. It is important that both legs can be built with our linker patterns and that the hypotenuse has the correct length to fit between the ends.<br />
Afterwards all possible rectangular triangles constructed, that have the two edges of the hypothenuse on N- and C-terminus. For this purpose Thales' Theorem is used, as all the possible rectangles lie on a sphere of radius hypothenuses half. Then these rectangular triangles are analyzed, whether they can be built with our linkerparts, by comparing the angles with the amount of possible triangles from Pythagoras' Theorem.<br />
These triangles are now all shiftet by each displacement of possible angles at the C-terminus, resulting in the rectanglepoints ###figure needed###. Now the connections to the possible points from the N-terminus are generated. At this step the triangles don't need to be rectangular anymore, but can have slightly different angles, but in the next steps the paths are analyzed, whether they still fit. At first they lengthes of the legs of the triangles are checked and then it is checked, that the paths don't disturb the protein. If they would disturbe the protein anyhow, they are just deleted.<br />
===Rigid paths===<br />
If the fist two possibilities for finding suitable paths didn't work the software provides also the possibility to find paths with up to three edges. Thus even one of the worst shapes for circularization, a torus with the two termini in the pits, could be circularized with our linker system, without needing infinitely long straights. On the other hand this were the maximum of possibilities which was still feasible to calculate.<br />
Here now the flexible parts of the protein are estimated to point into the same direction as the following helix. By this mean the amount of possibilities is kept fixed but of course this is quite some rough estimation.<br />
Let's check the next paragraph together in detail<br />
At first points from each end in the right distances are generated. ###figure needed, explains point names### Now for each point of the first points to all directions in 5 degrees angledifference next points are generated. Of these all points that don't fit are immediately sorted out. Now all connections from second points to the last points are checked whether they lie in the right distance. After this the normal sorting steps are made for these new connections.<br />
==Sorting out of paths==<br />
The most calcualtion time is consumed while sorting out misleading paths. This is due to the fact, that every path needs to be checked. Making the brute-force ansatz the amount of possible paths is about 10^9, so this step consumes most time.<br />
There are three main functions, that sort out paths that don't fit. The easiest just sorts paths out, that would require an angle, we can't produce with our angle patterns. But as we nearly can produce angles from ???20 -170??? degrees only very few paths are sorted out by this function.<br />
The next function just checks whether the endpoint is lying in the protein. If yes, the path is deleted. Otherwise the connection between the point coming from to this point is checked, whether it is passing too near at the protein.<br />
<br />
==Shifting paths to the patterns==<br />
<br />
Because of rounding errors and other inaccuracies, it can always happen that the generated paths don't fit perfectly to the helical- and to the angle patterns. Therefore before translating them to the aminoacid sequence, the paths need to be refined. This is also done before the weighting, so that the paths don't change after weighting anymore.<br />
Therefore the distances between the different points are calculated and then the points are shifted so far, that they fit into the patterns. The shifts never exceed a certain length so that no path then would pass through the protein after refinement, even though it didn't before.<br />
==Weighting of paths==<br />
Before translating the paths to sequences and thus to expressible linkers, at first the value of each path needs to be checked. Therefore we have identified four contributions, that determine how good a path will work to enhance heatstability of a target protein by circularization. In the end all these contributions are summarized to get one final value for the goodness of the connection. For each of the single contributions, always lower weighting values refer to a better path. Therefore all contributions somehow need to be normed by some property of the protein, so that the contributions of each function on it's own are independent of the target protein and generality is achieved.<br />
At first it is tested how long the linker is. We assumed that a shorter linker normally is better for restraining the ends, as a longer helix might give more flexibility to the ends. Because we want this value to be independent of the size of the protein, the length of the linker is normed by the distance of the two termini. It is clear that not only because the two termini are far from each other and clearly a longer linker is needed, this contribution exceeds all other contributions to the weighting function.<br />
The next value is a measure for the goodness of the angles used in the linker. Each of our potential angle patterns is somehow distributed with a standard deviation from the mean. It is clear, that the narrower this distribution is, the likelier it is, that the pattern will in the end produce the assumed angle between the embracing helices. Therefore path-angles that fit perfectly with the angle patterns provided get a lower weighting value.<br />
Then the distance from the protein is taken into account. Because the linkers should not disturb the protein in it's normal environment, linkers that pass near to the protein's surface are considered better linkers. Of course no linker is too near on the surface, so that it would interact with the protein too much. This value is normalized with the minimal distance a linker should have from the surface.<br />
After this the places a linker should ommit are calculated. Each protein binds some ligands or substrates. The user can specify, how big they are and where they attach. If a linker passes through a potential ligand, this value goes to infinity, so that a linker is discarded. Otherwise the farther a linker is from a potential ligand, the better this value gets. The user can also specify the importance of certain regions. In the end the total weighting is independent of the amount of binding sites.<br />
<br />
===Calibrating the weighting function===<br />
???keep?###<br />
Every contribution has it's own distribution. You can see an example in figure ### [[figure histogram_length_lys.png]], but all of them have different shapes. The aim is to find the paths that minimize all of these distributions. Therefore in the weighting function the four mentioned contributions are combined in a linear manner:<br />
\[ W(p) = \alpha L(p) + \beta A(p) + \gamma D(p) + \delta u(p) \]<br />
where W is the final weighting, p the path, L the length contribution, A the angle contribution, D the distance contribution and u the contribution from the forbidden regions. $\alpha, \beta, \gamma, \delta$ are the weighting constants that needed to be found. The normalization performed for each of the contribution were made so that each of them is dimensionless and that all have reasonably similar values.<br />
The weighting constants were obtained from the [[linker-screening ###lin]] performed with lysozyme and the [[enzyme-modeling ###link]]<br />
Please see [[below ###link]] for the detailed explanation, how the values were obtained.<br />
==Translating paths to sequence==<br />
As already mentioned before the software is provided with two databases, one for the possible angle patterns and one for the helix patterns. The choice of the patterns was inspired by known crystal structures extracted from databases and described in different papers.<br />
A huge in silico screening for refining the preferences of the patterns was then set up using the [[iGEM@home|###Link]] system. For the complete description of search for suitable patterns, one can read the [[modeling|###Link to patternspart]] page. <br />
All the possible paths are now split up at the angles and compared with the possible patterns in the databases. ###Figure needed to explain, how the path is translated### The most suitable patterns are identified and added together to build the paths sequence. It is important to notice that this is only possible because of the modularity of our linker patterns used as building blocks: each block, being an alpha helix or an angle pattern, is not affected by the other. Thus for each possible path, one sequence is produced.<br />
==Clustering of paths==<br />
Many different paths are represented by the same sequence [[###Figure that shows, different paths have same properties, already before in the text ###]] and we therefore clustered such paths. The weigths for those clustered paths were then calculated by averaging the weights of the different paths that compose a cluster.<br />
=Results=<br />
==DNMT1==<br />
A major motivation of our effort to design rigid linkers with angles was the circularization of the DNA methyltranferase Dnmt1 (link). The truncation form used in our project is composed of 900 amino acids and the N- and C-terminal extremities are well separated. To circularize it, two linkers were designed: a flexible one made of glycine and serine, and a rigid one designed by the software. The rigif linkers for DNMT1 were obtained from an early state of the software. At that time the calculation took 11 days on a laptop computer with intel i5 processor and 8GB of RAM, which shows the importance of a distributed computing system for large proteins. But from that state on, the software has still imporved a lot, resulting in reduced calculation time to about 1 day for DNMT1.<br />
<br />
###DNMT1 could be made heatstable, still missing###<br />
<br />
<br />
From the [[Project/Linker_Screening | linker-screening]] and the [[Modeling/Enzyme_Modeling | enzyme_modeling]] we have obtained the activities after heat-shock of the different linkers. With these a calibration of the weighting function has been made, please see table 1. for the results.<br />
<br />
{| class="table table-hover" style="text-align: center;"<br />
|+ '''table 1''': Linker and their amino acid sequence. Green: attachment sequences to prevent the flexible regions from being perturbed; Blue: angle; Purple: extein.<br />
! Linker<br />
! Amino acid sequence<br />
! activity<br />
! length- contribution<br />
! angle- contribution<br />
! binding site contribution<br />
! distance from surface<br />
! weightingvalue after calibration<br />
|- <br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|'''Very good linkers'''<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| sgt2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#00BFFF;">AAAHPEA</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
| 0.7477<br />
| 1.9205<br />
| 6.7789<br />
| 0.002259<br />
| 10.525<br />
| 114912<br />
|-<br />
| rigid<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKAA<span style="color:#3ADF00;">P</span><span style="color:#A901DB;">RGKCWE</span><br />
| 0.9447<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| '''Average linkers'''<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| may1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAKA<span style="color:#00BFFF;">AAAHPEA</span>AEAAAK EAAAKA<span style="color:#00BFFF;">KTA</span>AEAAAKEAAAKA<span style="color:#A901DB;">RGTCWE</span><br />
| 0.7489<br />
| 6.2225<br />
| 13.19<br />
| 0.00384<br />
| 1095.2<br />
| 196414<br />
|-<br />
| ord1<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ATGDLA</span>AEAAAKAA<span style="color:#A901DB;">RGTCWE</span><br />
| 0.956<br />
| 4.936<br />
| 4.639<br />
| 0.00055708<br />
| 220.8<br />
| 27985<br />
|-<br />
| ord3<br />
| <span style="color:#3ADF00;">GG</span>AEAAAKEAAAK<span style="color:#00BFFF;">ASLPAA</span>AEAAAKEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
| 1.390<br />
| 4.949<br />
| 7.116<br />
| 0.000545<br />
| 261.2<br />
| 28557<br />
|-<br />
|<br />
| <br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| '''Short linkers'''<br />
| <br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| sho1<br />
| <span style="color:#3ADF00;">GG</span><span style="color:#A901DB;">RGTCWE</span><br />
| 0.7087<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| sho2<br />
| <span style="color:#3ADF00;">GG</span>AEAAAK<span style="color:#A901DB;">RGTCWE</span><br />
| 0.5743<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| flexible linker<br />
| GGSGGGSGRGKCWE<br />
| 0.6851<br />
| <br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| linear lysozyme<br />
| no linker<br />
| 0.7039<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|}<br />
[[circ_lam_lys_nils.png]] comparison with [[nice_linker_lysozyme_flexible_ends.png]]<br />
###About predictions of software###<br />
In the end we obtained a ranking of the in vitro tested linkers from the [[linker-screening]] and chose the parameters $\alpha, \beta, \gamma, \delta$ of the weighting function in the way, that the ranking from the software represented the ranking from the assays. The final values were: ...<br />
<br />
=Discussion=<br />
Will always be refined with more data from i@h...<br />
=References=<br />
<br />
[0] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).<br />
<br />
[1] Thornton, J.M. & Sibanda, B.L. Amino and carboxy-terminal regions in globular proteins. Journal of molecular biology 167, 443-460 (1983).</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/pages/Linker_ModelingTeam:Heidelberg/pages/Linker Modeling2014-10-17T21:15:24Z<p>Igemnils: /* In silico refinement */</p>
<hr />
<div>=Background=<br />
Primary, secondary, tertiary and quaternary structures are the main levels of protein structure characterization. Primary structure designates the amino acid sequence, while the secondary structure describes the arrangement of consecutive amino acids through their two dihedral angles $\phi$ and $\psi$. The Ramachandran plot, which represents the amino acid position in the space of those two angles, shows two particular arrangement commonly found in proteins: alpha helices and beta sheets. The next level of protein organization is the tertiary structure, which describes how the protein is organized in the three spatial dimensions, whereas the quaternary structure describes how different subunits of proteins cluster.<br />
Finally, closely related to these standard structures, the supersecondary structure describes how secondary structure elements are connected to each other. While these connections look undefined at first sight, further analysis revealed that this wide variety of supersecondary structure motifs can be clustered to certain patterns [[#References|[5]]].<br />
<br />
==Supersecondary structure==<br />
<br />
When the properties of supersecondary structures were first described, only very few patterns were identified, mainly due to the lack of highly resolved protein structures. At that time, the structures were mainly classified by the Ramachandran plot regions where the amino acids could be found [[#References|[6]]]. With growing amount of known crystal structures, the analysis of supersecondary structure improved and lead to databases with about 150 000 classified loop structures and elaborate clustering [[#References|[7]]]. Nowadays supersecondary structures are defined as the structures built when two secondary structure elements are combined by a small peptide that is not clustered into one of the secondary structures. These loop peptides range from 1 to 9 amino acids.<br />
Our aim was to build reliable stable linkers out of alpha helices connected by supersecondary structure motifs that produce certain angles. To achieve that, we searched for the most reliable alpha helix patterns that would form rigid rods and angle patterns covering the whole range of angles from 0 to 180 degrees.<br />
<br />
=Linker building block design=<br />
<br />
===Helix patterns===<br />
Various different patterns have been used to build helical linkers to connect protein ends [[References | [8]]]. Moreover, in known protein structures, linkers between subdomains can be identified and their properties have already been analyzed [[#References | [9]]]. Two main criteria were used to build the alpha helix patterns: they should robustly for alpha helices, and they should be soluble in aqueous solution. Therefore, we could not just use linkers built of Alanine. So we decided to add some charged aminoacids to the pattern, and to position them physically close to each other so that they could stabilize themselves by Coulomb interaction. These amino acids needed to be separated by 3 amino acids as a helical turn takes about 3.6 aminoacids. The pattern we chose as most suitable for our purpose was also described to be one of the most stable [[#References | [3]]].<br />
8 alpha helix building blocks were eventually chosen: AEAAAK, AEAAAKA, AEAAAKAA, AEAAAKEAAAK, AEAAAKEAAAKA, AEAAAKEAAAKEAAAKA, AEAAAKEAAAKEAAAKEAAAKA, AEAAAKEAAAKEAAAKEAAAKEAAAKA, with a respective estimated length of 9, 10.5, 12, 16.5, 18, 25.5 and 33&Aring;.<br />
<br />
===Angle patterns===<br />
<br />
The angle patterns for our model were obtained from the ArchDB database [[#References | [5]]], which classifies loops from known proteins structures. About 17 000 non-homologous proteins from PDB database were analyzed and from them, 150 000 loop structures, i.e. regions connecting two secondary structure elements, were identified. The classification took into account not only the length of the loop, its conformation, meaning φ and ψ backbone dihedral angles of the residues in the loop, but also the distance between the attachments of the loop to the surrounding secondary structures. Furthermore the secondary structures surrounding the loop and the geometry defined by the super-secondary structure motifs can be found in the database.<br />
<br />
To extract from ArchDB the relevant supersecondary structure motifs for our linker design, the complete database was downloaded and helix-loop-helix motifs were extracted using a [https://github.com/igemsoftware/Heidelberg_2014 self-written script] in Python programming language. From them we only took into account loops composed of 1 and 2 amino acids, because the longer the loops, the less frequent and therefore the less reliable they are, and the further the ends are from each other.<br />
The interesting information for us was (1) the angle produced between the vectors defining the surrounding alpha helices, (2) the distance between the ends of the loop, and (3) the type of amino acids surrounding the loop. Furthermore we analyzed the statistical significance of the conformation. For each amino acid combination in the loop region, the angle distribution between the embracing alpha helices, the loop length distribution and a 2D heatmap of the surrounding amino acids were plotted automatically (Figures 1-4).<br />
<br />
These distributions were then visually analyzed to identify loops of interest for linker design. We focused on loops that showed a narrow angle distribution and that appeared frequently in the database and we extended the distribution analysis to the surrounding amino acids. For example, when we have identified that the amino acid T in a turn produces an interesting distribution (Figure 1), we would elongate it with the amino acid K in front (Figure 2), and with the amino acid A at the end (Figure 3), and with both (Figure 4) and investigate how the distributions behave. By restraining the possibilites, the occurrences go down tremendously, but the properties become more interesting.<br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of__T_.png|<br />
descr= Figure 1) A loop composed of the amino acid T produces an interesting well-defined angle distribution (left panel). The middle panel represents the distribution of lengths found with this loop. The right panel represent the distribution of amino acids found before and after the amino acid T in all the corresponding loops. One can see that the amino acid K is the most frequent before and A the most frequent after.}} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of_K_T_.png|<br />
descr= Figure 2) These distributions are subsets of the ones of Figure 1, with only the loops composed of the amino acid T preceded by the amino acid K. This constrain considerably narrows down the angle and length distributions.}} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of__T_A.png|<br />
descr= Figure 3) Same procedure as in Figure 2. with the loops composed of the amino acid T followed by the amino acid A. This also narrows down the angle and length distributions compared to Figure 1., although not as much as in Figure 2.}} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of_K_T_A.png|<br />
descr= Figure 4) The distributions correspond to loops composed of the amino acids KTA and therefore to subsets of all the preceeding ones. One can notice that the angle distribution becomes very narrow, but also that the frequency is reduced compared to the preceeding ones.}}<br />
<br />
This step had two main goals: narrowing down the angle distribution and finding loops with no preferences for the amino acid surrounding them. This last point was important for the modularity of our approach: the angle blocks should not be affected by the surrounding alpha helices. Using this approach, 10 different angle motifs (table 1) could be identified producing different angles. Importantly, all these motifs were chosen so that the length of the two segments starting from the turning point was the same for all of them (put figure turning point). It should also be noticed that 3 of the patterns, KTA, LVA and AAIAP, produce the same angles. There were nevertheless all kept for the ''in silico'' refinement described below. But only one was used for the [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software].<br />
<br />
<br />
{| class="table table-hover" style="text-align: center;"<br />
|+'''Table 1''': The span of parameters.<br />
!colspan="10"|Angle Patterns<br />
|-<br />
|Pattern <br />
| NVL <br />
| KTA <br />
| LVA <br />
| AAIAP <br />
| AADGTL <br />
| VNLTA<br />
| AAAHPEA<br />
| ASLPAA <br />
| ATGDLA<br />
|-<br />
|Mean <br />
| 29.7 <br />
| 38.7 <br />
| 35 <br />
| 36.5 <br />
| 60 <br />
| 74.5 <br />
| 117 <br />
| 140 <br />
| 160 <br />
|-<br />
| Variation <br />
| 8.5 <br />
| 30 <br />
| 29 <br />
| 27 <br />
| 12 <br />
| 27 <br />
| 12 <br />
| 15 <br />
| 5 <br />
|}<br />
<br />
===Sequences to connect the alpha helix to the protein extremity===<br />
<br />
Helical patterns often affect the folding of the attached sequences. To prevent them from affected the structure of the protein of interest, we analyzed the effects of various aminoacids in silico using online tools like [http://bioserv.rpbs.univ-paris-diderot.fr/services/PEP-FOLD/ pep-fold] on helix formation. We identified glycines and prolines as reliable amino acids to interrupt helix formation. We then decided to use glycine pairs to connect the protein of interest to the linker, because they give more flexibility to the initial orientation of the initial helix.<br />
This only concerns the protein end that was connected to the linker through the coding sequence. The other end is ligated through exteins or sortase scar, both treated as unstructured flexible regions.<br />
<br />
===''Conclusion''===<br />
<br />
The last three parts show how we could design alpha helix and angle pattern blocks and connect them to each other and to the protein of interest. They provide the material that allowed us to transform a linker defined as a geometrical path into a real amino acid sequence in our [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software software].<br />
<br />
Thanks to this we could design linkers to circularize [https://2014.igem.org/Team:Heidelberg/Project/PCR_2.0 DNMT1] and [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening lysozyme]. Additionally it was shown for lysozyme, that a customly tailored linker enhances heat-stability compared to a badly designed linker and even to the flexible linker.<br />
<br />
=''In silico'' refinement=<br />
<br />
As some of the interesting patterns could not be found often enough to be statistically significant, we decided to make a further refinement ''in silico'' by modeling the structure of proteins with circularizing linkers. To perform this for realistic situations, we selected, from the RCSB database, structures of non-homologous target proteins with extremities that are separated enough to require a linker for circularization. First, the [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software] generates possible fitting linkers for various proteins. From these possible linkers, the 100 shortest were taken. Among the 3 possible angle patterns that generate the same angle, the software provides only one. But the linker refinement developed here used the three of them for comparison. We assume that the linkers connect the ends of the protein without setting tension on the protein, so that the protein can fold in its natural way.<br />
<br />
After this the circularized proteins with the specific linkers are modelled using a software called Modeller [[#References|[10]]]. This software is widely used for comparative structure prediction. It is well established in the scientific community and should be most suitable for prediction of loop regions attached to existing structures [[#References|[11]]]. It is freely available for academical usage from the [http://salilab.org/modeller/ salilab] webpages. The program is able to predict the 3d structure of a given sequence based on an alignment with a given structure. In our script Modeller needed to be provided with a sequence with the linker attached and with the PDB file of the protein of interest. This latter file is the only structural information used here. Modeller is recommended when at least 30% sequence similarity exists between the provided structure and the one that the user wish to model. Here, only the linker is different, so this similarity if around 90% in our case. At first Modeller makes an alignment between the provided structure and the sequence of our linked protein, identifying the regions that cannot be found in the structure. Based on that, Modeller generates 4 initial models. One of the strengths of the software is its capacity to further refine only certain parts of the protein. Thus we let Modeller refine the loops, which are defined as any part of the protein that could not be found in the structure file. The ''ab initio'' modelling of our linkers is made by minimizing energy functions with different methods like conjugate gradients and molecular dynamics. Eventually, each modeled structure is provided with energy values, thanks to which different models of the same structure can be compared. From Modeller we received about 8 different models and choose the one with the best energy scores to further proceed. For these refinement steps, one can choose different levels of optimization. We always decided for accuracy instead of velocity of the program. The result of a prediction for lysozyme of bacteriophage lambda can be seen in figure 5.<br />
<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
caption=Figure 5) Circular lysozyme|<br />
descr=The structure of circular lysozyme (yellow) was predicted and aligned to the linear structure (purple). The linker calculated by the software was marked in green. |<br />
file=circ_lam_lys_nils.png}}<br />
<br />
Modeller was run by distributing calculation via the [https://2014.igem.org/Team:Heidelberg/Software/igemathome iGEM@home system] system. The modelling of one linker took about 10 hours of calculation time on average, a value that is highly depending on the size of the protein. Then the best model is evaluated by another self-written program to analyze the behaviour of the linker patterns in their natural surroundings.<br />
Finally, all the models for the different structures and the different linkers are analyzed for their properties like the length of the helical patterns, the shape of the attachment structures of the linker and the angles produced by the angle patterns [[ Figure helix_winkel_messung.png]]. First the modeled structure and the natural structure are fitted, to see how big the differences between those are. If the protein has been disturbed too much, the model is discarded. We control that the alpha helix patterns really generate the expected pattern by measuring the distance between the first and the last amino acids of each of them and checking if it is compatible with an alpha helix. These first and last amino acids of each rods define a vector that is then used to calculate the angles between consecutive helical patterns. The frequency of all the determined lengths and angles is then further analyzed, using a similar strategy to the ArchDB analysis presented above.<br />
<br />
The whole process for the verification of the different linker patterns was set up on the [https://2014.igem.org/Team:Heidelberg/Software/igemathome distributed computing system]. Due to lack of time, only few results could be analyzed, resulting in distributions for the different helices, see for example figures 6 and 7.<br />
This lead to a refinement of the length of the 8 different alpha helix blocks presented above to 8.7, 10, 10.8, 15.6, 16.8, 24.8, 32.3 &Aring;.<br />
<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
descr=|<br />
caption=Fig. 7. Length distribution of AEAAAKEAAAK|<br />
file=plot_of_AEAAAKEAAAK.png}}<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
descr=|<br />
caption=Fig. 6. Length distribution of AEAAAKA |<br />
file=plot_of_AEAAAKA.png}}<br />
<br />
=Conclusion=<br />
The patterns that we identified by analyzing structure databases provide an easy and fast tool to build customly shaped peptides. The main achievement is the identification of the angle patterns. These are designed as building blocks for enhanced applicability. The shapes were identified from a database of non-homologous proteins and the patterns were refined until the distribution of surrounding amino acids looked randomly distributed. Thus we can exclude that the angle distributions we observed is not due to the surrounding sequences, but to the identified patterns. We have not observed any evidence for certain helix patterns being preferred in the database.<br />
From figures 6 and 7 we have learned that the lengths we have assumed for the helices needed to be refined. For example we had assumed the AEAAAKA motif to span a distance of 10.5 &Aring; but have observed it to be only 10 &Aring; long. Our [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software] was accordingly corrected.<br />
<br />
=References=<br />
<br />
[1] Vieille, C. & Zeikus, G.J. Hyperthermophilic enzymes: sources, uses, and molecular mechanisms for thermostability. Microbiology and molecular biology reviews : MMBR 65, 1-43 (2001).<br />
<br />
[2] Yu, Y. & Lutz, S. Circular permutation: a different way to engineer enzyme structure and function. Trends Biotechnol. 29, 18-25 (2011).<br />
<br />
[3] Arai, R., Ueda, H., Kitayama, a, Kamiya, N. & Nagamune, T. Design of the linkers which effectively separate domains of a bifunctional fusion protein. Protein Eng. 14, 529-532 (2001).<br />
<br />
[4] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).<br />
<br />
[5] Efimov, a V. Standard structures in proteins. Prog. Biophys. Mol. Biol. 60, 201-2–39 (1993).<br />
<br />
[6] Donate, L. E., Rufino, S. D., Canard, L. H. & Blundell, T. L. Conformational analysis and clustering of short and medium size loops connecting regular secondary structures: a database for modeling and prediction. Protein Sci. 5, 2600-26–16 (1996).<br />
<br />
[7] Bonet, J. et al. ArchDB 2014: structural classification of loops in proteins. Nucleic Acids Res. 42, D315-D31–9 (2014).<br />
<br />
[8] George, R. a & Heringa, J. An analysis of protein domain linkers: their classification and role in protein folding. Protein Eng. 15, 871-879 (2002).<br />
<br />
[9] Chen, X., Zaro, J. L. & Shen, W.-C. Fusion protein linkers: property, design and functionality. Adv. Drug Deliv. Rev. 65, 1357-1369 (2013).<br />
<br />
[10] Fiser, a et al. Modeling of loops in protein structures. Protein science : a publication of the Protein Society 9, 1753-73 (2000).<br />
<br />
[11] Fiser, A. & Sali, A. ModLoop: Automated modeling of loops in protein structures. Bioinformatics 19, 2500-2501 (2003).</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/pages/Linker_ModelingTeam:Heidelberg/pages/Linker Modeling2014-10-17T21:14:41Z<p>Igemnils: /* In silico refinement */</p>
<hr />
<div>=Background=<br />
Primary, secondary, tertiary and quaternary structures are the main levels of protein structure characterization. Primary structure designates the amino acid sequence, while the secondary structure describes the arrangement of consecutive amino acids through their two dihedral angles $\phi$ and $\psi$. The Ramachandran plot, which represents the amino acid position in the space of those two angles, shows two particular arrangement commonly found in proteins: alpha helices and beta sheets. The next level of protein organization is the tertiary structure, which describes how the protein is organized in the three spatial dimensions, whereas the quaternary structure describes how different subunits of proteins cluster.<br />
Finally, closely related to these standard structures, the supersecondary structure describes how secondary structure elements are connected to each other. While these connections look undefined at first sight, further analysis revealed that this wide variety of supersecondary structure motifs can be clustered to certain patterns [[#References|[5]]].<br />
<br />
==Supersecondary structure==<br />
<br />
When the properties of supersecondary structures were first described, only very few patterns were identified, mainly due to the lack of highly resolved protein structures. At that time, the structures were mainly classified by the Ramachandran plot regions where the amino acids could be found [[#References|[6]]]. With growing amount of known crystal structures, the analysis of supersecondary structure improved and lead to databases with about 150 000 classified loop structures and elaborate clustering [[#References|[7]]]. Nowadays supersecondary structures are defined as the structures built when two secondary structure elements are combined by a small peptide that is not clustered into one of the secondary structures. These loop peptides range from 1 to 9 amino acids.<br />
Our aim was to build reliable stable linkers out of alpha helices connected by supersecondary structure motifs that produce certain angles. To achieve that, we searched for the most reliable alpha helix patterns that would form rigid rods and angle patterns covering the whole range of angles from 0 to 180 degrees.<br />
<br />
=Linker building block design=<br />
<br />
===Helix patterns===<br />
Various different patterns have been used to build helical linkers to connect protein ends [[References | [8]]]. Moreover, in known protein structures, linkers between subdomains can be identified and their properties have already been analyzed [[#References | [9]]]. Two main criteria were used to build the alpha helix patterns: they should robustly for alpha helices, and they should be soluble in aqueous solution. Therefore, we could not just use linkers built of Alanine. So we decided to add some charged aminoacids to the pattern, and to position them physically close to each other so that they could stabilize themselves by Coulomb interaction. These amino acids needed to be separated by 3 amino acids as a helical turn takes about 3.6 aminoacids. The pattern we chose as most suitable for our purpose was also described to be one of the most stable [[#References | [3]]].<br />
8 alpha helix building blocks were eventually chosen: AEAAAK, AEAAAKA, AEAAAKAA, AEAAAKEAAAK, AEAAAKEAAAKA, AEAAAKEAAAKEAAAKA, AEAAAKEAAAKEAAAKEAAAKA, AEAAAKEAAAKEAAAKEAAAKEAAAKA, with a respective estimated length of 9, 10.5, 12, 16.5, 18, 25.5 and 33&Aring;.<br />
<br />
===Angle patterns===<br />
<br />
The angle patterns for our model were obtained from the ArchDB database [[#References | [5]]], which classifies loops from known proteins structures. About 17 000 non-homologous proteins from PDB database were analyzed and from them, 150 000 loop structures, i.e. regions connecting two secondary structure elements, were identified. The classification took into account not only the length of the loop, its conformation, meaning φ and ψ backbone dihedral angles of the residues in the loop, but also the distance between the attachments of the loop to the surrounding secondary structures. Furthermore the secondary structures surrounding the loop and the geometry defined by the super-secondary structure motifs can be found in the database.<br />
<br />
To extract from ArchDB the relevant supersecondary structure motifs for our linker design, the complete database was downloaded and helix-loop-helix motifs were extracted using a [https://github.com/igemsoftware/Heidelberg_2014 self-written script] in Python programming language. From them we only took into account loops composed of 1 and 2 amino acids, because the longer the loops, the less frequent and therefore the less reliable they are, and the further the ends are from each other.<br />
The interesting information for us was (1) the angle produced between the vectors defining the surrounding alpha helices, (2) the distance between the ends of the loop, and (3) the type of amino acids surrounding the loop. Furthermore we analyzed the statistical significance of the conformation. For each amino acid combination in the loop region, the angle distribution between the embracing alpha helices, the loop length distribution and a 2D heatmap of the surrounding amino acids were plotted automatically (Figures 1-4).<br />
<br />
These distributions were then visually analyzed to identify loops of interest for linker design. We focused on loops that showed a narrow angle distribution and that appeared frequently in the database and we extended the distribution analysis to the surrounding amino acids. For example, when we have identified that the amino acid T in a turn produces an interesting distribution (Figure 1), we would elongate it with the amino acid K in front (Figure 2), and with the amino acid A at the end (Figure 3), and with both (Figure 4) and investigate how the distributions behave. By restraining the possibilites, the occurrences go down tremendously, but the properties become more interesting.<br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of__T_.png|<br />
descr= Figure 1) A loop composed of the amino acid T produces an interesting well-defined angle distribution (left panel). The middle panel represents the distribution of lengths found with this loop. The right panel represent the distribution of amino acids found before and after the amino acid T in all the corresponding loops. One can see that the amino acid K is the most frequent before and A the most frequent after.}} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of_K_T_.png|<br />
descr= Figure 2) These distributions are subsets of the ones of Figure 1, with only the loops composed of the amino acid T preceded by the amino acid K. This constrain considerably narrows down the angle and length distributions.}} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of__T_A.png|<br />
descr= Figure 3) Same procedure as in Figure 2. with the loops composed of the amino acid T followed by the amino acid A. This also narrows down the angle and length distributions compared to Figure 1., although not as much as in Figure 2.}} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of_K_T_A.png|<br />
descr= Figure 4) The distributions correspond to loops composed of the amino acids KTA and therefore to subsets of all the preceeding ones. One can notice that the angle distribution becomes very narrow, but also that the frequency is reduced compared to the preceeding ones.}}<br />
<br />
This step had two main goals: narrowing down the angle distribution and finding loops with no preferences for the amino acid surrounding them. This last point was important for the modularity of our approach: the angle blocks should not be affected by the surrounding alpha helices. Using this approach, 10 different angle motifs (table 1) could be identified producing different angles. Importantly, all these motifs were chosen so that the length of the two segments starting from the turning point was the same for all of them (put figure turning point). It should also be noticed that 3 of the patterns, KTA, LVA and AAIAP, produce the same angles. There were nevertheless all kept for the ''in silico'' refinement described below. But only one was used for the [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software].<br />
<br />
<br />
{| class="table table-hover" style="text-align: center;"<br />
|+'''Table 1''': The span of parameters.<br />
!colspan="10"|Angle Patterns<br />
|-<br />
|Pattern <br />
| NVL <br />
| KTA <br />
| LVA <br />
| AAIAP <br />
| AADGTL <br />
| VNLTA<br />
| AAAHPEA<br />
| ASLPAA <br />
| ATGDLA<br />
|-<br />
|Mean <br />
| 29.7 <br />
| 38.7 <br />
| 35 <br />
| 36.5 <br />
| 60 <br />
| 74.5 <br />
| 117 <br />
| 140 <br />
| 160 <br />
|-<br />
| Variation <br />
| 8.5 <br />
| 30 <br />
| 29 <br />
| 27 <br />
| 12 <br />
| 27 <br />
| 12 <br />
| 15 <br />
| 5 <br />
|}<br />
<br />
===Sequences to connect the alpha helix to the protein extremity===<br />
<br />
Helical patterns often affect the folding of the attached sequences. To prevent them from affected the structure of the protein of interest, we analyzed the effects of various aminoacids in silico using online tools like [http://bioserv.rpbs.univ-paris-diderot.fr/services/PEP-FOLD/ pep-fold] on helix formation. We identified glycines and prolines as reliable amino acids to interrupt helix formation. We then decided to use glycine pairs to connect the protein of interest to the linker, because they give more flexibility to the initial orientation of the initial helix.<br />
This only concerns the protein end that was connected to the linker through the coding sequence. The other end is ligated through exteins or sortase scar, both treated as unstructured flexible regions.<br />
<br />
===''Conclusion''===<br />
<br />
The last three parts show how we could design alpha helix and angle pattern blocks and connect them to each other and to the protein of interest. They provide the material that allowed us to transform a linker defined as a geometrical path into a real amino acid sequence in our [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software software].<br />
<br />
Thanks to this we could design linkers to circularize [https://2014.igem.org/Team:Heidelberg/Project/PCR_2.0 DNMT1] and [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening lysozyme]. Additionally it was shown for lysozyme, that a customly tailored linker enhances heat-stability compared to a badly designed linker and even to the flexible linker.<br />
<br />
=''In silico'' refinement=<br />
<br />
As some of the interesting patterns could not be found often enough to be statistically significant, we decided to make a further refinement ''in silico'' by modeling the structure of proteins with circularizing linkers. To perform this for realistic situations, we selected, from the RCSB database, structures of non-homologous target proteins with extremities that are separated enough to require a linker for circularization. First, the [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software] generates possible fitting linkers for various proteins. From these possible linkers, the 100 shortest were taken. Among the 3 possible angle patterns that generate the same angle, the software provides only one. But the linker refinement developed here used the three of them for comparison. We assume that the linkers connect the ends of the protein without setting tension on the protein, so that the protein can fold in its natural way.<br />
<br />
After this the circularized proteins with the specific linkers are modelled using a software called Modeller [[#References|[10]]]. This software is widely used for comparative structure prediction. It is well established in the scientific community and should be most suitable for prediction of loop regions attached to existing structures [[#References|[11]]]. It is freely available for academical usage from the [http://salilab.org/modeller/ salilab] webpages. The program is able to predict the 3d structure of a given sequence based on an alignment with a given structure. In our script Modeller needed to be provided with a sequence with the linker attached and with the PDB file of the protein of interest. This latter file is the only structural information used here. Modeller is recommended when at least 30% sequence similarity exists between the provided structure and the one that the user wish to model. Here, only the linker is different, so this similarity if around 90% in our case. At first Modeller makes an alignment between the provided structure and the sequence of our linked protein, identifying the regions that cannot be found in the structure. Based on that, Modeller generates 4 initial models. One of the strengths of the software is its capacity to further refine only certain parts of the protein. Thus we let Modeller refine the loops, which are defined as any part of the protein that could not be found in the structure file. The ''ab initio'' modelling of our linkers is made by minimizing energy functions with different methods like conjugate gradients and molecular dynamics. Eventually, each modeled structure is provided with energy values, thanks to which different models of the same structure can be compared. From Modeller we received about 8 different models and choose the one with the best energy scores to further proceed. For these refinement steps, one can choose different levels of optimization. We always decided for accuracy instead of velocity of the program. The result of a prediction for lysozyme of bacteriophage lambda can be seen in figure 5.<br />
<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
caption=Figure 5) Circular lysozyme|<br />
descr=The structure of circular lysozyme (yellow) was predicted and aligned to the linear structure (purple). The linker calculated by the software was marked green. |<br />
file=circ_lam_lys_nils.png}}<br />
<br />
Modeller was run by distributing calculation via the [https://2014.igem.org/Team:Heidelberg/Software/igemathome iGEM@home system] system. The modelling of one linker took about 10 hours of calculation time on average, a value that is highly depending on the size of the protein. Then the best model is evaluated by another self-written program to analyze the behaviour of the linker patterns in their natural surroundings.<br />
Finally, all the models for the different structures and the different linkers are analyzed for their properties like the length of the helical patterns, the shape of the attachment structures of the linker and the angles produced by the angle patterns [[ Figure helix_winkel_messung.png]]. First the modeled structure and the natural structure are fitted, to see how big the differences between those are. If the protein has been disturbed too much, the model is discarded. We control that the alpha helix patterns really generate the expected pattern by measuring the distance between the first and the last amino acids of each of them and checking if it is compatible with an alpha helix. These first and last amino acids of each rods define a vector that is then used to calculate the angles between consecutive helical patterns. The frequency of all the determined lengths and angles is then further analyzed, using a similar strategy to the ArchDB analysis presented above.<br />
<br />
The whole process for the verification of the different linker patterns was set up on the [https://2014.igem.org/Team:Heidelberg/Software/igemathome distributed computing system]. Due to lack of time, only few results could be analyzed, resulting in distributions for the different helices, see for example figures 6 and 7.<br />
This lead to a refinement of the length of the 8 different alpha helix blocks presented above to 8.7, 10, 10.8, 15.6, 16.8, 24.8, 32.3 &Aring;.<br />
<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
descr=|<br />
caption=Fig. 7. Length distribution of AEAAAKEAAAK|<br />
file=plot_of_AEAAAKEAAAK.png}}<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
descr=|<br />
caption=Fig. 6. Length distribution of AEAAAKA |<br />
file=plot_of_AEAAAKA.png}}<br />
<br />
=Conclusion=<br />
The patterns that we identified by analyzing structure databases provide an easy and fast tool to build customly shaped peptides. The main achievement is the identification of the angle patterns. These are designed as building blocks for enhanced applicability. The shapes were identified from a database of non-homologous proteins and the patterns were refined until the distribution of surrounding amino acids looked randomly distributed. Thus we can exclude that the angle distributions we observed is not due to the surrounding sequences, but to the identified patterns. We have not observed any evidence for certain helix patterns being preferred in the database.<br />
From figures 6 and 7 we have learned that the lengths we have assumed for the helices needed to be refined. For example we had assumed the AEAAAKA motif to span a distance of 10.5 &Aring; but have observed it to be only 10 &Aring; long. Our [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software] was accordingly corrected.<br />
<br />
=References=<br />
<br />
[1] Vieille, C. & Zeikus, G.J. Hyperthermophilic enzymes: sources, uses, and molecular mechanisms for thermostability. Microbiology and molecular biology reviews : MMBR 65, 1-43 (2001).<br />
<br />
[2] Yu, Y. & Lutz, S. Circular permutation: a different way to engineer enzyme structure and function. Trends Biotechnol. 29, 18-25 (2011).<br />
<br />
[3] Arai, R., Ueda, H., Kitayama, a, Kamiya, N. & Nagamune, T. Design of the linkers which effectively separate domains of a bifunctional fusion protein. Protein Eng. 14, 529-532 (2001).<br />
<br />
[4] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).<br />
<br />
[5] Efimov, a V. Standard structures in proteins. Prog. Biophys. Mol. Biol. 60, 201-2–39 (1993).<br />
<br />
[6] Donate, L. E., Rufino, S. D., Canard, L. H. & Blundell, T. L. Conformational analysis and clustering of short and medium size loops connecting regular secondary structures: a database for modeling and prediction. Protein Sci. 5, 2600-26–16 (1996).<br />
<br />
[7] Bonet, J. et al. ArchDB 2014: structural classification of loops in proteins. Nucleic Acids Res. 42, D315-D31–9 (2014).<br />
<br />
[8] George, R. a & Heringa, J. An analysis of protein domain linkers: their classification and role in protein folding. Protein Eng. 15, 871-879 (2002).<br />
<br />
[9] Chen, X., Zaro, J. L. & Shen, W.-C. Fusion protein linkers: property, design and functionality. Adv. Drug Deliv. Rev. 65, 1357-1369 (2013).<br />
<br />
[10] Fiser, a et al. Modeling of loops in protein structures. Protein science : a publication of the Protein Society 9, 1753-73 (2000).<br />
<br />
[11] Fiser, A. & Sali, A. ModLoop: Automated modeling of loops in protein structures. Bioinformatics 19, 2500-2501 (2003).</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/pages/Linker_ModelingTeam:Heidelberg/pages/Linker Modeling2014-10-17T21:13:19Z<p>Igemnils: /* Angle patterns */</p>
<hr />
<div>=Background=<br />
Primary, secondary, tertiary and quaternary structures are the main levels of protein structure characterization. Primary structure designates the amino acid sequence, while the secondary structure describes the arrangement of consecutive amino acids through their two dihedral angles $\phi$ and $\psi$. The Ramachandran plot, which represents the amino acid position in the space of those two angles, shows two particular arrangement commonly found in proteins: alpha helices and beta sheets. The next level of protein organization is the tertiary structure, which describes how the protein is organized in the three spatial dimensions, whereas the quaternary structure describes how different subunits of proteins cluster.<br />
Finally, closely related to these standard structures, the supersecondary structure describes how secondary structure elements are connected to each other. While these connections look undefined at first sight, further analysis revealed that this wide variety of supersecondary structure motifs can be clustered to certain patterns [[#References|[5]]].<br />
<br />
==Supersecondary structure==<br />
<br />
When the properties of supersecondary structures were first described, only very few patterns were identified, mainly due to the lack of highly resolved protein structures. At that time, the structures were mainly classified by the Ramachandran plot regions where the amino acids could be found [[#References|[6]]]. With growing amount of known crystal structures, the analysis of supersecondary structure improved and lead to databases with about 150 000 classified loop structures and elaborate clustering [[#References|[7]]]. Nowadays supersecondary structures are defined as the structures built when two secondary structure elements are combined by a small peptide that is not clustered into one of the secondary structures. These loop peptides range from 1 to 9 amino acids.<br />
Our aim was to build reliable stable linkers out of alpha helices connected by supersecondary structure motifs that produce certain angles. To achieve that, we searched for the most reliable alpha helix patterns that would form rigid rods and angle patterns covering the whole range of angles from 0 to 180 degrees.<br />
<br />
=Linker building block design=<br />
<br />
===Helix patterns===<br />
Various different patterns have been used to build helical linkers to connect protein ends [[References | [8]]]. Moreover, in known protein structures, linkers between subdomains can be identified and their properties have already been analyzed [[#References | [9]]]. Two main criteria were used to build the alpha helix patterns: they should robustly for alpha helices, and they should be soluble in aqueous solution. Therefore, we could not just use linkers built of Alanine. So we decided to add some charged aminoacids to the pattern, and to position them physically close to each other so that they could stabilize themselves by Coulomb interaction. These amino acids needed to be separated by 3 amino acids as a helical turn takes about 3.6 aminoacids. The pattern we chose as most suitable for our purpose was also described to be one of the most stable [[#References | [3]]].<br />
8 alpha helix building blocks were eventually chosen: AEAAAK, AEAAAKA, AEAAAKAA, AEAAAKEAAAK, AEAAAKEAAAKA, AEAAAKEAAAKEAAAKA, AEAAAKEAAAKEAAAKEAAAKA, AEAAAKEAAAKEAAAKEAAAKEAAAKA, with a respective estimated length of 9, 10.5, 12, 16.5, 18, 25.5 and 33&Aring;.<br />
<br />
===Angle patterns===<br />
<br />
The angle patterns for our model were obtained from the ArchDB database [[#References | [5]]], which classifies loops from known proteins structures. About 17 000 non-homologous proteins from PDB database were analyzed and from them, 150 000 loop structures, i.e. regions connecting two secondary structure elements, were identified. The classification took into account not only the length of the loop, its conformation, meaning φ and ψ backbone dihedral angles of the residues in the loop, but also the distance between the attachments of the loop to the surrounding secondary structures. Furthermore the secondary structures surrounding the loop and the geometry defined by the super-secondary structure motifs can be found in the database.<br />
<br />
To extract from ArchDB the relevant supersecondary structure motifs for our linker design, the complete database was downloaded and helix-loop-helix motifs were extracted using a [https://github.com/igemsoftware/Heidelberg_2014 self-written script] in Python programming language. From them we only took into account loops composed of 1 and 2 amino acids, because the longer the loops, the less frequent and therefore the less reliable they are, and the further the ends are from each other.<br />
The interesting information for us was (1) the angle produced between the vectors defining the surrounding alpha helices, (2) the distance between the ends of the loop, and (3) the type of amino acids surrounding the loop. Furthermore we analyzed the statistical significance of the conformation. For each amino acid combination in the loop region, the angle distribution between the embracing alpha helices, the loop length distribution and a 2D heatmap of the surrounding amino acids were plotted automatically (Figures 1-4).<br />
<br />
These distributions were then visually analyzed to identify loops of interest for linker design. We focused on loops that showed a narrow angle distribution and that appeared frequently in the database and we extended the distribution analysis to the surrounding amino acids. For example, when we have identified that the amino acid T in a turn produces an interesting distribution (Figure 1), we would elongate it with the amino acid K in front (Figure 2), and with the amino acid A at the end (Figure 3), and with both (Figure 4) and investigate how the distributions behave. By restraining the possibilites, the occurrences go down tremendously, but the properties become more interesting.<br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of__T_.png|<br />
descr= Figure 1) A loop composed of the amino acid T produces an interesting well-defined angle distribution (left panel). The middle panel represents the distribution of lengths found with this loop. The right panel represent the distribution of amino acids found before and after the amino acid T in all the corresponding loops. One can see that the amino acid K is the most frequent before and A the most frequent after.}} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of_K_T_.png|<br />
descr= Figure 2) These distributions are subsets of the ones of Figure 1, with only the loops composed of the amino acid T preceded by the amino acid K. This constrain considerably narrows down the angle and length distributions.}} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of__T_A.png|<br />
descr= Figure 3) Same procedure as in Figure 2. with the loops composed of the amino acid T followed by the amino acid A. This also narrows down the angle and length distributions compared to Figure 1., although not as much as in Figure 2.}} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of_K_T_A.png|<br />
descr= Figure 4) The distributions correspond to loops composed of the amino acids KTA and therefore to subsets of all the preceeding ones. One can notice that the angle distribution becomes very narrow, but also that the frequency is reduced compared to the preceeding ones.}}<br />
<br />
This step had two main goals: narrowing down the angle distribution and finding loops with no preferences for the amino acid surrounding them. This last point was important for the modularity of our approach: the angle blocks should not be affected by the surrounding alpha helices. Using this approach, 10 different angle motifs (table 1) could be identified producing different angles. Importantly, all these motifs were chosen so that the length of the two segments starting from the turning point was the same for all of them (put figure turning point). It should also be noticed that 3 of the patterns, KTA, LVA and AAIAP, produce the same angles. There were nevertheless all kept for the ''in silico'' refinement described below. But only one was used for the [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software].<br />
<br />
<br />
{| class="table table-hover" style="text-align: center;"<br />
|+'''Table 1''': The span of parameters.<br />
!colspan="10"|Angle Patterns<br />
|-<br />
|Pattern <br />
| NVL <br />
| KTA <br />
| LVA <br />
| AAIAP <br />
| AADGTL <br />
| VNLTA<br />
| AAAHPEA<br />
| ASLPAA <br />
| ATGDLA<br />
|-<br />
|Mean <br />
| 29.7 <br />
| 38.7 <br />
| 35 <br />
| 36.5 <br />
| 60 <br />
| 74.5 <br />
| 117 <br />
| 140 <br />
| 160 <br />
|-<br />
| Variation <br />
| 8.5 <br />
| 30 <br />
| 29 <br />
| 27 <br />
| 12 <br />
| 27 <br />
| 12 <br />
| 15 <br />
| 5 <br />
|}<br />
<br />
===Sequences to connect the alpha helix to the protein extremity===<br />
<br />
Helical patterns often affect the folding of the attached sequences. To prevent them from affected the structure of the protein of interest, we analyzed the effects of various aminoacids in silico using online tools like [http://bioserv.rpbs.univ-paris-diderot.fr/services/PEP-FOLD/ pep-fold] on helix formation. We identified glycines and prolines as reliable amino acids to interrupt helix formation. We then decided to use glycine pairs to connect the protein of interest to the linker, because they give more flexibility to the initial orientation of the initial helix.<br />
This only concerns the protein end that was connected to the linker through the coding sequence. The other end is ligated through exteins or sortase scar, both treated as unstructured flexible regions.<br />
<br />
===''Conclusion''===<br />
<br />
The last three parts show how we could design alpha helix and angle pattern blocks and connect them to each other and to the protein of interest. They provide the material that allowed us to transform a linker defined as a geometrical path into a real amino acid sequence in our [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software software].<br />
<br />
Thanks to this we could design linkers to circularize [https://2014.igem.org/Team:Heidelberg/Project/PCR_2.0 DNMT1] and [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening lysozyme]. Additionally it was shown for lysozyme, that a customly tailored linker enhances heat-stability compared to a badly designed linker and even to the flexible linker.<br />
<br />
=''In silico'' refinement=<br />
<br />
As some of the interesting patterns could not be found often enough to be statistically significant, we decided to make a further refinement ''in silico'' by modeling the structure of proteins with circularizing linkers. To perform this for realistic situations, we selected, from the RCSB database, structures of non-homologous target proteins with extremities that are separated enough to require a linker for circularization. First, the [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software] generates possible fitting linkers for various proteins. From these possible linkers, the 100 shortest were taken. Among the 3 possible angle patterns that generate the same angle, the software provides only one. But the linker refinement developed here used the three of them for comparison. We assume that the linkers connect the ends of the protein without setting tension on the protein, so that the protein can fold in its natural way.<br />
<br />
After this the circularized proteins with the specific linkers are modelled using a software called Modeller [[#References|[10]]]. This software is widely used for comparative structure prediction. It is well established in the scientific community and should be most suitable for prediction of loop regions attached to existing structures [[#References|[11]]]. It is freely available for academical usage from the [http://salilab.org/modeller/ salilab] webpages. The program is able to predict the 3d structure of a given sequence based on an alignment with a given structure. In our script Modeller needed to be provided with a sequence with the linker attached and with the PDB file of the protein of interest. This latter file is the only structural information used here. Modeller is recommended when at least 30% sequence similarity exists between the provided structure and the one that the user wish to model. Here, only the linker is different, so this similarity if around 90% in our case. At first Modeller makes an alignment between the provided structure and the sequence of our linked protein, identifying the regions that cannot be found in the structure. Based on that, Modeller generates 4 initial models. One of the strengths of the software is its capacity to further refine only certain parts of the protein. Thus we let Modeller refine the loops, which are defined as any part of the protein that could not be found in the structure file. The ''ab initio'' modelling of our linkers is made by minimizing energy functions with different methods like conjugate gradients and molecular dynamics. Eventually, each modeled structure is provided with energy values, thanks to which different models of the same structure can be compared. From Modeller we received about 8 different models and choose the one with the best energy scores to further proceed. For these refinement steps, one can choose different levels of optimization. We always decided for accuracy instead of velocity of the program. The result of a prediction for lysozyme of bacteriophage lambda can be seen in figure 5.<br />
<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
caption=Figure 5) Circular lysozyme|<br />
descr=The structure of circular lysozyme (yellow) was predicted and aligned to the linear structure (purple). The linker calculated by the software was marked green. |<br />
file=circ_lam_lys_nils.png}}<br />
<br />
<br />
Modeller was run by distributing calculation via the [https://2014.igem.org/Team:Heidelberg/Software/igemathome iGEM@home system] system. The modelling of one linker took about 10 hours of calculation time on average, a value that is highly depending on the size of the protein. Then the best model is evaluated by another self-written program to analyze the behaviour of the linker patterns in their natural surroundings.<br />
Finally, all the models for the different structures and the different linkers are analyzed for their properties like the length of the helical patterns, the shape of the attachment structures of the linker and the angles produced by the angle patterns [[ Figure helix_winkel_messung.png]]. First the modeled structure and the natural structure are fitted, to see how big the differences between those are. If the protein has been disturbed too much, the model is discarded. We control that the alpha helix patterns really generate the expected pattern by measuring the distance between the first and the last amino acids of each of them and checking if it is compatible with an alpha helix. These first and last amino acids of each rods define a vector that is then used to calculate the angles between consecutive helical patterns. The frequency of all the determined lengths and angles is then further analyzed, using a similar strategy to the ArchDB analysis presented above.<br />
<br />
The whole process for the verification of the different linker patterns was set up on the [https://2014.igem.org/Team:Heidelberg/Software/igemathome distributed computing system]. Due to lack of time, only few results could be analyzed, resulting in distributions for the different helices, see for example figures 6 and 7.<br />
This lead to a refinement of the length of the 8 different alpha helix blocks presented above to 8.7, 10, 10.8, 15.6, 16.8, 24.8, 32.3 &Aring;.<br />
<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
descr=|<br />
caption=Fig. 7. Length distribution of AEAAAKEAAAK|<br />
file=plot_of_AEAAAKEAAAK.png}}<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
descr=|<br />
caption=Fig. 6. Length distribution of AEAAAKA |<br />
file=plot_of_AEAAAKA.png}}<br />
=Conclusion=<br />
The patterns that we identified by analyzing structure databases provide an easy and fast tool to build customly shaped peptides. The main achievement is the identification of the angle patterns. These are designed as building blocks for enhanced applicability. The shapes were identified from a database of non-homologous proteins and the patterns were refined until the distribution of surrounding amino acids looked randomly distributed. Thus we can exclude that the angle distributions we observed is not due to the surrounding sequences, but to the identified patterns. We have not observed any evidence for certain helix patterns being preferred in the database.<br />
From figures 6 and 7 we have learned that the lengths we have assumed for the helices needed to be refined. For example we had assumed the AEAAAKA motif to span a distance of 10.5 &Aring; but have observed it to be only 10 &Aring; long. Our [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software] was accordingly corrected.<br />
<br />
=References=<br />
<br />
[1] Vieille, C. & Zeikus, G.J. Hyperthermophilic enzymes: sources, uses, and molecular mechanisms for thermostability. Microbiology and molecular biology reviews : MMBR 65, 1-43 (2001).<br />
<br />
[2] Yu, Y. & Lutz, S. Circular permutation: a different way to engineer enzyme structure and function. Trends Biotechnol. 29, 18-25 (2011).<br />
<br />
[3] Arai, R., Ueda, H., Kitayama, a, Kamiya, N. & Nagamune, T. Design of the linkers which effectively separate domains of a bifunctional fusion protein. Protein Eng. 14, 529-532 (2001).<br />
<br />
[4] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).<br />
<br />
[5] Efimov, a V. Standard structures in proteins. Prog. Biophys. Mol. Biol. 60, 201-2–39 (1993).<br />
<br />
[6] Donate, L. E., Rufino, S. D., Canard, L. H. & Blundell, T. L. Conformational analysis and clustering of short and medium size loops connecting regular secondary structures: a database for modeling and prediction. Protein Sci. 5, 2600-26–16 (1996).<br />
<br />
[7] Bonet, J. et al. ArchDB 2014: structural classification of loops in proteins. Nucleic Acids Res. 42, D315-D31–9 (2014).<br />
<br />
[8] George, R. a & Heringa, J. An analysis of protein domain linkers: their classification and role in protein folding. Protein Eng. 15, 871-879 (2002).<br />
<br />
[9] Chen, X., Zaro, J. L. & Shen, W.-C. Fusion protein linkers: property, design and functionality. Adv. Drug Deliv. Rev. 65, 1357-1369 (2013).<br />
<br />
[10] Fiser, a et al. Modeling of loops in protein structures. Protein science : a publication of the Protein Society 9, 1753-73 (2000).<br />
<br />
[11] Fiser, A. & Sali, A. ModLoop: Automated modeling of loops in protein structures. Bioinformatics 19, 2500-2501 (2003).</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/pages/Linker_ModelingTeam:Heidelberg/pages/Linker Modeling2014-10-17T21:12:00Z<p>Igemnils: </p>
<hr />
<div>=Background=<br />
Primary, secondary, tertiary and quaternary structures are the main levels of protein structure characterization. Primary structure designates the amino acid sequence, while the secondary structure describes the arrangement of consecutive amino acids through their two dihedral angles $\phi$ and $\psi$. The Ramachandran plot, which represents the amino acid position in the space of those two angles, shows two particular arrangement commonly found in proteins: alpha helices and beta sheets. The next level of protein organization is the tertiary structure, which describes how the protein is organized in the three spatial dimensions, whereas the quaternary structure describes how different subunits of proteins cluster.<br />
Finally, closely related to these standard structures, the supersecondary structure describes how secondary structure elements are connected to each other. While these connections look undefined at first sight, further analysis revealed that this wide variety of supersecondary structure motifs can be clustered to certain patterns [[#References|[5]]].<br />
<br />
==Supersecondary structure==<br />
<br />
When the properties of supersecondary structures were first described, only very few patterns were identified, mainly due to the lack of highly resolved protein structures. At that time, the structures were mainly classified by the Ramachandran plot regions where the amino acids could be found [[#References|[6]]]. With growing amount of known crystal structures, the analysis of supersecondary structure improved and lead to databases with about 150 000 classified loop structures and elaborate clustering [[#References|[7]]]. Nowadays supersecondary structures are defined as the structures built when two secondary structure elements are combined by a small peptide that is not clustered into one of the secondary structures. These loop peptides range from 1 to 9 amino acids.<br />
Our aim was to build reliable stable linkers out of alpha helices connected by supersecondary structure motifs that produce certain angles. To achieve that, we searched for the most reliable alpha helix patterns that would form rigid rods and angle patterns covering the whole range of angles from 0 to 180 degrees.<br />
<br />
=Linker building block design=<br />
<br />
===Helix patterns===<br />
Various different patterns have been used to build helical linkers to connect protein ends [[References | [8]]]. Moreover, in known protein structures, linkers between subdomains can be identified and their properties have already been analyzed [[#References | [9]]]. Two main criteria were used to build the alpha helix patterns: they should robustly for alpha helices, and they should be soluble in aqueous solution. Therefore, we could not just use linkers built of Alanine. So we decided to add some charged aminoacids to the pattern, and to position them physically close to each other so that they could stabilize themselves by Coulomb interaction. These amino acids needed to be separated by 3 amino acids as a helical turn takes about 3.6 aminoacids. The pattern we chose as most suitable for our purpose was also described to be one of the most stable [[#References | [3]]].<br />
8 alpha helix building blocks were eventually chosen: AEAAAK, AEAAAKA, AEAAAKAA, AEAAAKEAAAK, AEAAAKEAAAKA, AEAAAKEAAAKEAAAKA, AEAAAKEAAAKEAAAKEAAAKA, AEAAAKEAAAKEAAAKEAAAKEAAAKA, with a respective estimated length of 9, 10.5, 12, 16.5, 18, 25.5 and 33&Aring;.<br />
<br />
===Angle patterns===<br />
<br />
The angle patterns for our model were obtained from the ArchDB database [[#References | [5]]], which classifies loops from known proteins structures. About 17 000 non-homologous proteins from PDB database were analyzed and from them, 150 000 loop structures, i.e. regions connecting two secondary structure elements, were identified. The classification took into account not only the length of the loop, its conformation, meaning φ and ψ backbone dihedral angles of the residues in the loop, but also the distance between the attachments of the loop to the surrounding secondary structures. Furthermore the secondary structures surrounding the loop and the geometry defined by the super-secondary structure motifs can be found in the database.<br />
<br />
To extract from ArchDB the relevant supersecondary structure motifs for our linker design, the complete database was downloaded and helix-loop-helix motifs were extracted using a [https://github.com/igemsoftware/Heidelberg_2014 self-written script] in Python programming language. From them we only took into account loops composed of 1 and 2 amino acids, because the longer the loops, the less frequent and therefore the less reliable they are, and the further the ends are from each other.<br />
The interesting information for us was (1) the angle produced between the vectors defining the surrounding alpha helices, (2) the distance between the ends of the loop, and (3) the type of amino acids surrounding the loop. Furthermore we analyzed the statistical significance of the conformation. For each amino acid combination in the loop region, the angle distribution between the embracing alpha helices, the loop length distribution and a 2D heatmap of the surrounding amino acids were plotted automatically (Figures 1-4).<br />
<br />
These distributions were then visually analyzed to identify loops of interest for linker design. We focused on loops that showed a narrow angle distribution and that appeared frequently in the database and we extended the distribution analysis to the surrounding amino acids. For example, when we have identified that the amino acid T in a turn produces an interesting distribution (Figure 1), we would elongate it with the amino acid K in front (Figure 2), and with the amino acid A at the end (Figure 3), and with both (Figure 4) and investigate how the distributions behave. By restraining the possibilites, the occurrences go down tremendously, but the properties become more interesting.<br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of__T_.png|<br />
descr= Figure 1) A loop composed of the amino acid T produces an interesting well-defined angle distribution (left panel). The middle panel represents the distribution of lengths found with this loop. The right panel represent the distribution of amino acids found before and after the amino acid T in all the corresponding loops. One can see that the amino acid K is the most frequent before and A the most frequent after.}} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of_K_T_.png|<br />
descr= Figure 2) These distributions are subsets of the ones of Figure 1, with only the loops composed of the amino acid T preceded by the amino acid K. This constrain considerably narrows down the angle and length distributions.}} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of__T_A.png|<br />
descr= Figure 3) Same procedure as in Figure 2. with the loops composed of the amino acid T followed by the amino acid A. This also narrows down the angle and length distributions compared to Figure 1., although not as much as in Figure 2.}} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of_K_T_A.png|<br />
descr= Figure 4) The distributions correspond to loops composed of the amino acids KTA and therefore to subsets of all the preceeding ones. One can notice that the angle distribution becomes very narrow, but also that the frequency is reduced compared to the preceeding ones.}}<br />
<br />
This step had two main goals: narrowing down the angle distribution and finding loops with no preferences for the amino acid surrounding them. This last point was important for the modularity of our approach: the angle blocks should not be affected by the surrounding alpha helices. Using this approach, 10 different angle motifs (table 1) could be identified producing different angles. Importantly, all these motifs were chosen so that the length of the two segments starting from the turning point was the same for all of them (put figure turning point). It should also be noticed that 3 of the patterns, KTA, LVA and AAIAP, produce the same angles. There were nevertheless all kept for the ''in silico'' refinement described below. But only one was used for the [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software].<br />
<br />
<br />
{| class="table table-hover" style="text-align: center;"<br />
|+'''table 1''': The span of parameters.<br />
!colspan="10"|Angle Patterns<br />
|-<br />
|Pattern <br />
| NVL <br />
| KTA <br />
| LVA <br />
| AAIAP <br />
| AADGTL <br />
| VNLTA<br />
| AAAHPEA<br />
| ASLPAA <br />
| ATGDLA<br />
|-<br />
|Mean <br />
| 29.7 <br />
| 38.7 <br />
| 35 <br />
| 36.5 <br />
| 60 <br />
| 74.5 <br />
| 117 <br />
| 140 <br />
| 160 <br />
|-<br />
| Variation <br />
| 8.5 <br />
| 30 <br />
| 29 <br />
| 27 <br />
| 12 <br />
| 27 <br />
| 12 <br />
| 15 <br />
| 5 <br />
|}<br />
<br />
===Sequences to connect the alpha helix to the protein extremity===<br />
<br />
Helical patterns often affect the folding of the attached sequences. To prevent them from affected the structure of the protein of interest, we analyzed the effects of various aminoacids in silico using online tools like [http://bioserv.rpbs.univ-paris-diderot.fr/services/PEP-FOLD/ pep-fold] on helix formation. We identified glycines and prolines as reliable amino acids to interrupt helix formation. We then decided to use glycine pairs to connect the protein of interest to the linker, because they give more flexibility to the initial orientation of the initial helix.<br />
This only concerns the protein end that was connected to the linker through the coding sequence. The other end is ligated through exteins or sortase scar, both treated as unstructured flexible regions.<br />
<br />
===''Conclusion''===<br />
<br />
The last three parts show how we could design alpha helix and angle pattern blocks and connect them to each other and to the protein of interest. They provide the material that allowed us to transform a linker defined as a geometrical path into a real amino acid sequence in our [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software software].<br />
<br />
Thanks to this we could design linkers to circularize [https://2014.igem.org/Team:Heidelberg/Project/PCR_2.0 DNMT1] and [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening lysozyme]. Additionally it was shown for lysozyme, that a customly tailored linker enhances heat-stability compared to a badly designed linker and even to the flexible linker.<br />
<br />
=''In silico'' refinement=<br />
<br />
As some of the interesting patterns could not be found often enough to be statistically significant, we decided to make a further refinement ''in silico'' by modeling the structure of proteins with circularizing linkers. To perform this for realistic situations, we selected, from the RCSB database, structures of non-homologous target proteins with extremities that are separated enough to require a linker for circularization. First, the [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software] generates possible fitting linkers for various proteins. From these possible linkers, the 100 shortest were taken. Among the 3 possible angle patterns that generate the same angle, the software provides only one. But the linker refinement developed here used the three of them for comparison. We assume that the linkers connect the ends of the protein without setting tension on the protein, so that the protein can fold in its natural way.<br />
<br />
After this the circularized proteins with the specific linkers are modelled using a software called Modeller [[#References|[10]]]. This software is widely used for comparative structure prediction. It is well established in the scientific community and should be most suitable for prediction of loop regions attached to existing structures [[#References|[11]]]. It is freely available for academical usage from the [http://salilab.org/modeller/ salilab] webpages. The program is able to predict the 3d structure of a given sequence based on an alignment with a given structure. In our script Modeller needed to be provided with a sequence with the linker attached and with the PDB file of the protein of interest. This latter file is the only structural information used here. Modeller is recommended when at least 30% sequence similarity exists between the provided structure and the one that the user wish to model. Here, only the linker is different, so this similarity if around 90% in our case. At first Modeller makes an alignment between the provided structure and the sequence of our linked protein, identifying the regions that cannot be found in the structure. Based on that, Modeller generates 4 initial models. One of the strengths of the software is its capacity to further refine only certain parts of the protein. Thus we let Modeller refine the loops, which are defined as any part of the protein that could not be found in the structure file. The ''ab initio'' modelling of our linkers is made by minimizing energy functions with different methods like conjugate gradients and molecular dynamics. Eventually, each modeled structure is provided with energy values, thanks to which different models of the same structure can be compared. From Modeller we received about 8 different models and choose the one with the best energy scores to further proceed. For these refinement steps, one can choose different levels of optimization. We always decided for accuracy instead of velocity of the program. The result of a prediction for lysozyme of bacteriophage lambda can be seen in figure 5.<br />
<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
caption=Figure 5) Circular lysozyme|<br />
descr=The structure of circular lysozyme (yellow) was predicted and aligned to the linear structure (purple). The linker calculated by the software was marked green. |<br />
file=circ_lam_lys_nils.png}}<br />
<br />
<br />
Modeller was run by distributing calculation via the [https://2014.igem.org/Team:Heidelberg/Software/igemathome iGEM@home system] system. The modelling of one linker took about 10 hours of calculation time on average, a value that is highly depending on the size of the protein. Then the best model is evaluated by another self-written program to analyze the behaviour of the linker patterns in their natural surroundings.<br />
Finally, all the models for the different structures and the different linkers are analyzed for their properties like the length of the helical patterns, the shape of the attachment structures of the linker and the angles produced by the angle patterns [[ Figure helix_winkel_messung.png]]. First the modeled structure and the natural structure are fitted, to see how big the differences between those are. If the protein has been disturbed too much, the model is discarded. We control that the alpha helix patterns really generate the expected pattern by measuring the distance between the first and the last amino acids of each of them and checking if it is compatible with an alpha helix. These first and last amino acids of each rods define a vector that is then used to calculate the angles between consecutive helical patterns. The frequency of all the determined lengths and angles is then further analyzed, using a similar strategy to the ArchDB analysis presented above.<br />
<br />
The whole process for the verification of the different linker patterns was set up on the [https://2014.igem.org/Team:Heidelberg/Software/igemathome distributed computing system]. Due to lack of time, only few results could be analyzed, resulting in distributions for the different helices, see for example figures 6 and 7.<br />
This lead to a refinement of the length of the 8 different alpha helix blocks presented above to 8.7, 10, 10.8, 15.6, 16.8, 24.8, 32.3 &Aring;.<br />
<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
descr=|<br />
caption=Fig. 7. Length distribution of AEAAAKEAAAK|<br />
file=plot_of_AEAAAKEAAAK.png}}<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
descr=|<br />
caption=Fig. 6. Length distribution of AEAAAKA |<br />
file=plot_of_AEAAAKA.png}}<br />
=Conclusion=<br />
The patterns that we identified by analyzing structure databases provide an easy and fast tool to build customly shaped peptides. The main achievement is the identification of the angle patterns. These are designed as building blocks for enhanced applicability. The shapes were identified from a database of non-homologous proteins and the patterns were refined until the distribution of surrounding amino acids looked randomly distributed. Thus we can exclude that the angle distributions we observed is not due to the surrounding sequences, but to the identified patterns. We have not observed any evidence for certain helix patterns being preferred in the database.<br />
From figures 6 and 7 we have learned that the lengths we have assumed for the helices needed to be refined. For example we had assumed the AEAAAKA motif to span a distance of 10.5 &Aring; but have observed it to be only 10 &Aring; long. Our [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software] was accordingly corrected.<br />
<br />
=References=<br />
<br />
[1] Vieille, C. & Zeikus, G.J. Hyperthermophilic enzymes: sources, uses, and molecular mechanisms for thermostability. Microbiology and molecular biology reviews : MMBR 65, 1-43 (2001).<br />
<br />
[2] Yu, Y. & Lutz, S. Circular permutation: a different way to engineer enzyme structure and function. Trends Biotechnol. 29, 18-25 (2011).<br />
<br />
[3] Arai, R., Ueda, H., Kitayama, a, Kamiya, N. & Nagamune, T. Design of the linkers which effectively separate domains of a bifunctional fusion protein. Protein Eng. 14, 529-532 (2001).<br />
<br />
[4] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).<br />
<br />
[5] Efimov, a V. Standard structures in proteins. Prog. Biophys. Mol. Biol. 60, 201-2–39 (1993).<br />
<br />
[6] Donate, L. E., Rufino, S. D., Canard, L. H. & Blundell, T. L. Conformational analysis and clustering of short and medium size loops connecting regular secondary structures: a database for modeling and prediction. Protein Sci. 5, 2600-26–16 (1996).<br />
<br />
[7] Bonet, J. et al. ArchDB 2014: structural classification of loops in proteins. Nucleic Acids Res. 42, D315-D31–9 (2014).<br />
<br />
[8] George, R. a & Heringa, J. An analysis of protein domain linkers: their classification and role in protein folding. Protein Eng. 15, 871-879 (2002).<br />
<br />
[9] Chen, X., Zaro, J. L. & Shen, W.-C. Fusion protein linkers: property, design and functionality. Adv. Drug Deliv. Rev. 65, 1357-1369 (2013).<br />
<br />
[10] Fiser, a et al. Modeling of loops in protein structures. Protein science : a publication of the Protein Society 9, 1753-73 (2000).<br />
<br />
[11] Fiser, A. & Sali, A. ModLoop: Automated modeling of loops in protein structures. Bioinformatics 19, 2500-2501 (2003).</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/pages/Linker_ModelingTeam:Heidelberg/pages/Linker Modeling2014-10-17T21:10:42Z<p>Igemnils: /* linker building block design */</p>
<hr />
<div>=Background=<br />
Primary, secondary, tertiary and quaternary structures are the main levels of protein structure characterization. Primary structure designates the amino acid sequence, while the secondary structure describes the arrangement of consecutive amino acids through their two dihedral angles $\phi$ and $\psi$. The Ramachandran plot, which represents the amino acid position in the space of those two angles, shows two particular arrangement commonly found in proteins: alpha helices and beta sheets. The next level of protein organization is the tertiary structure, which describes how the protein is organized in the three spatial dimensions, whereas the quaternary structure describes how different subunits of proteins cluster.<br />
Finally, closely related to these standard structures, the supersecondary structure describes how secondary structure elements are connected to each other. While these connections look undefined at first sight, further analysis revealed that this wide variety of supersecondary structure motifs can be clustered to certain patterns [[#References|[5]]].<br />
<br />
==Supersecondary structure==<br />
<br />
When the properties of supersecondary structures were first described, only very few patterns were identified, mainly due to the lack of highly resolved protein structures. At that time, the structures were mainly classified by the Ramachandran plot regions where the amino acids could be found [[#References|[6]]]. With growing amount of known crystal structures, the analysis of supersecondary structure improved and lead to databases with about 150 000 classified loop structures and elaborate clustering [[#References|[7]]]. Nowadays supersecondary structures are defined as the structures built when two secondary structure elements are combined by a small peptide that is not clustered into one of the secondary structures. These loop peptides range from 1 to 9 amino acids.<br />
Our aim was to build reliable stable linkers out of alpha helices connected by supersecondary structure motifs that produce certain angles. To achieve that, we searched for the most reliable alpha helix patterns that would form rigid rods and angle patterns covering the whole range of angles from 0 to 180 degrees.<br />
<br />
=Linker building block design=<br />
<br />
===Helix patterns===<br />
Various different patterns have been used to build helical linkers to connect protein ends [[References | [8]]]. Moreover, in known protein structures, linkers between subdomains can be identified and their properties have already been analyzed [[#References | [9]]]. Two main criteria were used to build the alpha helix patterns: they should robustly for alpha helices, and they should be soluble in aqueous solution. Therefore, we could not just use linkers built of Alanine. So we decided to add some charged aminoacids to the pattern, and to position them physically close to each other so that they could stabilize themselves by Coulomb interaction. These amino acids needed to be separated by 3 amino acids as a helical turn takes about 3.6 aminoacids. The pattern we chose as most suitable for our purpose was also described to be one of the most stable [[#References | [3]]].<br />
8 alpha helix building blocks were eventually chosen: AEAAAK, AEAAAKA, AEAAAKAA, AEAAAKEAAAK, AEAAAKEAAAKA, AEAAAKEAAAKEAAAKA, AEAAAKEAAAKEAAAKEAAAKA, AEAAAKEAAAKEAAAKEAAAKEAAAKA, with a respective estimated length of 9, 10.5, 12, 16.5, 18, 25.5 and 33&Aring;.<br />
<br />
===Angle patterns===<br />
<br />
The angle patterns for our model were obtained from the ArchDB database [[#References | [5]]], which classifies loops from known proteins structures. About 17 000 non-homologous proteins from PDB database were analyzed and from them, 150 000 loop structures, i.e. regions connecting two secondary structure elements, were identified. The classification took into account not only the length of the loop, its conformation, meaning φ and ψ backbone dihedral angles of the residues in the loop, but also the distance between the attachments of the loop to the surrounding secondary structures. Furthermore the secondary structures surrounding the loop and the geometry defined by the super-secondary structure motifs can be found in the database.<br />
<br />
To extract from ArchDB the relevant supersecondary structure motifs for our linker design, the complete database was downloaded and helix-loop-helix motifs were extracted using a [https://github.com/igemsoftware/Heidelberg_2014 self-written script] in Python programming language. From them we only took into account loops composed of 1 and 2 amino acids, because the longer the loops, the less frequent and therefore the less reliable they are, and the further the ends are from each other.<br />
The interesting information for us was (1) the angle produced between the vectors defining the surrounding alpha helices, (2) the distance between the ends of the loop, and (3) the type of amino acids surrounding the loop. Furthermore we analyzed the statistical significance of the conformation. For each amino acid combination in the loop region, the angle distribution between the embracing alpha helices, the loop length distribution and a 2D heatmap of the surrounding amino acids were plotted automatically (Figures 1-4).<br />
<br />
These distributions were then visually analyzed to identify loops of interest for linker design. We focused on loops that showed a narrow angle distribution and that appeared frequently in the database and we extended the distribution analysis to the surrounding amino acids. For example, when we have identified that the amino acid T in a turn produces an interesting distribution (Figure 1), we would elongate it with the amino acid K in front (Figure 2), and with the amino acid A at the end (Figure 3), and with both (Figure 4) and investigate how the distributions behave. By restraining the possibilites, the occurrences go down tremendously, but the properties become more interesting.<br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of__T_.png|<br />
descr= Figure 1) A loop composed of the amino acid T produces an interesting well-defined angle distribution (left panel). The middle panel represents the distribution of lengths found with this loop. The right panel represent the distribution of amino acids found before and after the amino acid T in all the corresponding loops. One can see that the amino acid K is the most frequent before and A the most frequent after.}} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of_K_T_.png|<br />
descr= Figure 2) These distributions are subsets of the ones of Figure 1, with only the loops composed of the amino acid T preceded by the amino acid K. This constrain considerably narrows down the angle and length distributions.}} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of__T_A.png|<br />
descr= Figure 3) Same procedure as in Figure 2. with the loops composed of the amino acid T followed by the amino acid A. This also narrows down the angle and length distributions compared to Figure 1., although not as much as in Figure 2.}} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of_K_T_A.png|<br />
descr= Figure 4) The distributions correspond to loops composed of the amino acids KTA and therefore to subsets of all the preceeding ones. One can notice that the angle distribution becomes very narrow, but also that the frequency is reduced compared to the preceeding ones.}}<br />
<br />
This step had two main goals: narrowing down the angle distribution and finding loops with no preferences for the amino acid surrounding them. This last point was important for the modularity of our approach: the angle blocks should not be affected by the surrounding alpha helices. Using this approach, 10 different angle motifs (table 1) could be identified producing different angles. Importantly, all these motifs were chosen so that the length of the two segments starting from the turning point was the same for all of them (put figure turning point). It should also be noticed that 3 of the patterns, KTA, LVA and AAIAP, produce the same angles. There were nevertheless all kept for the ''in silico'' refinement described below. But only one was used for the [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software].<br />
<br />
<br />
{| class="table table-hover" style="text-align: center;"<br />
|+'''table 1''': The span of parameters.<br />
!colspan="10"|Angle Patterns<br />
|-<br />
|Pattern <br />
| NVL <br />
| KTA <br />
| LVA <br />
| AAIAP <br />
| AADGTL <br />
| VNLTA<br />
| AAAHPEA<br />
| ASLPAA <br />
| ATGDLA<br />
|-<br />
|Mean <br />
| 29.7 <br />
| 38.7 <br />
| 35 <br />
| 36.5 <br />
| 60 <br />
| 74.5 <br />
| 117 <br />
| 140 <br />
| 160 <br />
|-<br />
| Variation <br />
| 8.5 <br />
| 30 <br />
| 29 <br />
| 27 <br />
| 12. <br />
| 27. <br />
| 12. <br />
| 15 . <br />
| 5. <br />
|}<br />
<br />
===Sequences to connect the alpha helix to the protein extremity===<br />
<br />
Helical patterns often affect the folding of the attached sequences. To prevent them from affected the structure of the protein of interest, we analyzed the effects of various aminoacids in silico using online tools like [http://bioserv.rpbs.univ-paris-diderot.fr/services/PEP-FOLD/ pep-fold] on helix formation. We identified glycines and prolines as reliable amino acids to interrupt helix formation. We then decided to use glycine pairs to connect the protein of interest to the linker, because they give more flexibility to the initial orientation of the initial helix.<br />
This only concerns the protein end that was connected to the linker through the coding sequence. The other end is ligated through exteins or sortase scar, both treated as unstructured flexible regions.<br />
<br />
===''Conclusion''===<br />
<br />
The last three parts show how we could design alpha helix and angle pattern blocks and connect them to each other and to the protein of interest. They provide the material that allowed us to transform a linker defined as a geometrical path into a real amino acid sequence in our [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software software].<br />
<br />
Thanks to this we could design linkers to circularize [https://2014.igem.org/Team:Heidelberg/Project/PCR_2.0 DNMT1] and [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening lysozyme]. Additionally it was shown for lysozyme, that a customly tailored linker enhances heat-stability compared to a badly designed linker and even to the flexible linker.<br />
<br />
=''In silico'' refinement=<br />
<br />
As some of the interesting patterns could not be found often enough to be statistically significant, we decided to make a further refinement ''in silico'' by modeling the structure of proteins with circularizing linkers. To perform this for realistic situations, we selected, from the RCSB database, structures of non-homologous target proteins with extremities that are separated enough to require a linker for circularization. First, the [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software] generates possible fitting linkers for various proteins. From these possible linkers, the 100 shortest were taken. Among the 3 possible angle patterns that generate the same angle, the software provides only one. But the linker refinement developed here used the three of them for comparison. We assume that the linkers connect the ends of the protein without setting tension on the protein, so that the protein can fold in its natural way.<br />
<br />
After this the circularized proteins with the specific linkers are modelled using a software called Modeller [[#References|[10]]]. This software is widely used for comparative structure prediction. It is well established in the scientific community and should be most suitable for prediction of loop regions attached to existing structures [[#References|[11]]]. It is freely available for academical usage from the [http://salilab.org/modeller/ salilab] webpages. The program is able to predict the 3d structure of a given sequence based on an alignment with a given structure. In our script Modeller needed to be provided with a sequence with the linker attached and with the PDB file of the protein of interest. This latter file is the only structural information used here. Modeller is recommended when at least 30% sequence similarity exists between the provided structure and the one that the user wish to model. Here, only the linker is different, so this similarity if around 90% in our case. At first Modeller makes an alignment between the provided structure and the sequence of our linked protein, identifying the regions that cannot be found in the structure. Based on that, Modeller generates 4 initial models. One of the strengths of the software is its capacity to further refine only certain parts of the protein. Thus we let Modeller refine the loops, which are defined as any part of the protein that could not be found in the structure file. The ''ab initio'' modelling of our linkers is made by minimizing energy functions with different methods like conjugate gradients and molecular dynamics. Eventually, each modeled structure is provided with energy values, thanks to which different models of the same structure can be compared. From Modeller we received about 8 different models and choose the one with the best energy scores to further proceed. For these refinement steps, one can choose different levels of optimization. We always decided for accuracy instead of velocity of the program. The result of a prediction for lysozyme of bacteriophage lambda can be seen in figure 5.<br />
<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
caption=Figure 5) Circular lysozyme|<br />
descr=The structure of circular lysozyme (yellow) was predicted and aligned to the linear structure (purple). The linker calculated by the software was marked green. |<br />
file=circ_lam_lys_nils.png}}<br />
<br />
<br />
Modeller was run by distributing calculation via the [https://2014.igem.org/Team:Heidelberg/Software/igemathome iGEM@home system] system. The modelling of one linker took about 10 hours of calculation time on average, a value that is highly depending on the size of the protein. Then the best model is evaluated by another self-written program to analyze the behaviour of the linker patterns in their natural surroundings.<br />
Finally, all the models for the different structures and the different linkers are analyzed for their properties like the length of the helical patterns, the shape of the attachment structures of the linker and the angles produced by the angle patterns [[ Figure helix_winkel_messung.png]]. First the modeled structure and the natural structure are fitted, to see how big the differences between those are. If the protein has been disturbed too much, the model is discarded. We control that the alpha helix patterns really generate the expected pattern by measuring the distance between the first and the last amino acids of each of them and checking if it is compatible with an alpha helix. These first and last amino acids of each rods define a vector that is then used to calculate the angles between consecutive helical patterns. The frequency of all the determined lengths and angles is then further analyzed, using a similar strategy to the ArchDB analysis presented above.<br />
<br />
The whole process for the verification of the different linker patterns was set up on the [https://2014.igem.org/Team:Heidelberg/Software/igemathome distributed computing system]. Due to lack of time, only few results could be analyzed, resulting in distributions for the different helices, see for example figures 6 and 7.<br />
This lead to a refinement of the length of the 8 different alpha helix blocks presented above to 8.7, 10, 10.8, 15.6, 16.8, 24.8, 32.3 &Aring;.<br />
<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
descr=|<br />
caption=Fig. 7. Length distribution of AEAAAKEAAAK|<br />
file=plot_of_AEAAAKEAAAK.png}}<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
descr=|<br />
caption=Fig. 6. Length distribution of AEAAAKA |<br />
file=plot_of_AEAAAKA.png}}<br />
<br />
=Conclusion=<br />
The patterns that we identified by analyzing structure databases provide an easy and fast tool to build customly shaped peptides. The main achievement is the identification of the angle patterns. These are designed as building blocks for enhanced applicability. The shapes were identified from a database of non-homologous proteins and the patterns were refined until the distribution of surrounding amino acids looked randomly distributed. Thus we can exclude that the angle distributions we observed is not due to the surrounding sequences, but to the identified patterns. We have not observed any evidence for certain helix patterns being preferred in the database.<br />
From figures 6 and 7 we have learned that the lengths we have assumed for the helices needed to be refined. For example we had assumed the AEAAAKA motif to span a distance of 10.5 &Aring; but have observed it to be only 10 &Aring; long. Our [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software] was accordingly corrected.<br />
<br />
=References=<br />
<br />
[1] Vieille, C. & Zeikus, G.J. Hyperthermophilic enzymes: sources, uses, and molecular mechanisms for thermostability. Microbiology and molecular biology reviews : MMBR 65, 1-43 (2001).<br />
<br />
[2] Yu, Y. & Lutz, S. Circular permutation: a different way to engineer enzyme structure and function. Trends Biotechnol. 29, 18-25 (2011).<br />
<br />
[3] Arai, R., Ueda, H., Kitayama, a, Kamiya, N. & Nagamune, T. Design of the linkers which effectively separate domains of a bifunctional fusion protein. Protein Eng. 14, 529-532 (2001).<br />
<br />
[4] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).<br />
<br />
[5] Efimov, a V. Standard structures in proteins. Prog. Biophys. Mol. Biol. 60, 201-2–39 (1993).<br />
<br />
[6] Donate, L. E., Rufino, S. D., Canard, L. H. & Blundell, T. L. Conformational analysis and clustering of short and medium size loops connecting regular secondary structures: a database for modeling and prediction. Protein Sci. 5, 2600-26–16 (1996).<br />
<br />
[7] Bonet, J. et al. ArchDB 2014: structural classification of loops in proteins. Nucleic Acids Res. 42, D315-D31–9 (2014).<br />
<br />
[8] George, R. a & Heringa, J. An analysis of protein domain linkers: their classification and role in protein folding. Protein Eng. 15, 871-879 (2002).<br />
<br />
[9] Chen, X., Zaro, J. L. & Shen, W.-C. Fusion protein linkers: property, design and functionality. Adv. Drug Deliv. Rev. 65, 1357-1369 (2013).<br />
<br />
[10] Fiser, a et al. Modeling of loops in protein structures. Protein science : a publication of the Protein Society 9, 1753-73 (2000).<br />
<br />
[11] Fiser, A. & Sali, A. ModLoop: Automated modeling of loops in protein structures. Bioinformatics 19, 2500-2501 (2003).</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/pages/Linker_ModelingTeam:Heidelberg/pages/Linker Modeling2014-10-17T21:10:21Z<p>Igemnils: /* Helix patterns */</p>
<hr />
<div>=Background=<br />
Primary, secondary, tertiary and quaternary structures are the main levels of protein structure characterization. Primary structure designates the amino acid sequence, while the secondary structure describes the arrangement of consecutive amino acids through their two dihedral angles $\phi$ and $\psi$. The Ramachandran plot, which represents the amino acid position in the space of those two angles, shows two particular arrangement commonly found in proteins: alpha helices and beta sheets. The next level of protein organization is the tertiary structure, which describes how the protein is organized in the three spatial dimensions, whereas the quaternary structure describes how different subunits of proteins cluster.<br />
Finally, closely related to these standard structures, the supersecondary structure describes how secondary structure elements are connected to each other. While these connections look undefined at first sight, further analysis revealed that this wide variety of supersecondary structure motifs can be clustered to certain patterns [[#References|[5]]].<br />
<br />
==Supersecondary structure==<br />
<br />
When the properties of supersecondary structures were first described, only very few patterns were identified, mainly due to the lack of highly resolved protein structures. At that time, the structures were mainly classified by the Ramachandran plot regions where the amino acids could be found [[#References|[6]]]. With growing amount of known crystal structures, the analysis of supersecondary structure improved and lead to databases with about 150 000 classified loop structures and elaborate clustering [[#References|[7]]]. Nowadays supersecondary structures are defined as the structures built when two secondary structure elements are combined by a small peptide that is not clustered into one of the secondary structures. These loop peptides range from 1 to 9 amino acids.<br />
Our aim was to build reliable stable linkers out of alpha helices connected by supersecondary structure motifs that produce certain angles. To achieve that, we searched for the most reliable alpha helix patterns that would form rigid rods and angle patterns covering the whole range of angles from 0 to 180 degrees.<br />
<br />
=linker building block design=<br />
<br />
===Helix patterns===<br />
Various different patterns have been used to build helical linkers to connect protein ends [[References | [8]]]. Moreover, in known protein structures, linkers between subdomains can be identified and their properties have already been analyzed [[#References | [9]]]. Two main criteria were used to build the alpha helix patterns: they should robustly for alpha helices, and they should be soluble in aqueous solution. Therefore, we could not just use linkers built of Alanine. So we decided to add some charged aminoacids to the pattern, and to position them physically close to each other so that they could stabilize themselves by Coulomb interaction. These amino acids needed to be separated by 3 amino acids as a helical turn takes about 3.6 aminoacids. The pattern we chose as most suitable for our purpose was also described to be one of the most stable [[#References | [3]]].<br />
8 alpha helix building blocks were eventually chosen: AEAAAK, AEAAAKA, AEAAAKAA, AEAAAKEAAAK, AEAAAKEAAAKA, AEAAAKEAAAKEAAAKA, AEAAAKEAAAKEAAAKEAAAKA, AEAAAKEAAAKEAAAKEAAAKEAAAKA, with a respective estimated length of 9, 10.5, 12, 16.5, 18, 25.5 and 33&Aring;.<br />
<br />
===Angle patterns===<br />
<br />
The angle patterns for our model were obtained from the ArchDB database [[#References | [5]]], which classifies loops from known proteins structures. About 17 000 non-homologous proteins from PDB database were analyzed and from them, 150 000 loop structures, i.e. regions connecting two secondary structure elements, were identified. The classification took into account not only the length of the loop, its conformation, meaning φ and ψ backbone dihedral angles of the residues in the loop, but also the distance between the attachments of the loop to the surrounding secondary structures. Furthermore the secondary structures surrounding the loop and the geometry defined by the super-secondary structure motifs can be found in the database.<br />
<br />
To extract from ArchDB the relevant supersecondary structure motifs for our linker design, the complete database was downloaded and helix-loop-helix motifs were extracted using a [https://github.com/igemsoftware/Heidelberg_2014 self-written script] in Python programming language. From them we only took into account loops composed of 1 and 2 amino acids, because the longer the loops, the less frequent and therefore the less reliable they are, and the further the ends are from each other.<br />
The interesting information for us was (1) the angle produced between the vectors defining the surrounding alpha helices, (2) the distance between the ends of the loop, and (3) the type of amino acids surrounding the loop. Furthermore we analyzed the statistical significance of the conformation. For each amino acid combination in the loop region, the angle distribution between the embracing alpha helices, the loop length distribution and a 2D heatmap of the surrounding amino acids were plotted automatically (Figures 1-4).<br />
<br />
These distributions were then visually analyzed to identify loops of interest for linker design. We focused on loops that showed a narrow angle distribution and that appeared frequently in the database and we extended the distribution analysis to the surrounding amino acids. For example, when we have identified that the amino acid T in a turn produces an interesting distribution (Figure 1), we would elongate it with the amino acid K in front (Figure 2), and with the amino acid A at the end (Figure 3), and with both (Figure 4) and investigate how the distributions behave. By restraining the possibilites, the occurrences go down tremendously, but the properties become more interesting.<br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of__T_.png|<br />
descr= Figure 1) A loop composed of the amino acid T produces an interesting well-defined angle distribution (left panel). The middle panel represents the distribution of lengths found with this loop. The right panel represent the distribution of amino acids found before and after the amino acid T in all the corresponding loops. One can see that the amino acid K is the most frequent before and A the most frequent after.}} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of_K_T_.png|<br />
descr= Figure 2) These distributions are subsets of the ones of Figure 1, with only the loops composed of the amino acid T preceded by the amino acid K. This constrain considerably narrows down the angle and length distributions.}} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of__T_A.png|<br />
descr= Figure 3) Same procedure as in Figure 2. with the loops composed of the amino acid T followed by the amino acid A. This also narrows down the angle and length distributions compared to Figure 1., although not as much as in Figure 2.}} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of_K_T_A.png|<br />
descr= Figure 4) The distributions correspond to loops composed of the amino acids KTA and therefore to subsets of all the preceeding ones. One can notice that the angle distribution becomes very narrow, but also that the frequency is reduced compared to the preceeding ones.}}<br />
<br />
This step had two main goals: narrowing down the angle distribution and finding loops with no preferences for the amino acid surrounding them. This last point was important for the modularity of our approach: the angle blocks should not be affected by the surrounding alpha helices. Using this approach, 10 different angle motifs (table 1) could be identified producing different angles. Importantly, all these motifs were chosen so that the length of the two segments starting from the turning point was the same for all of them (put figure turning point). It should also be noticed that 3 of the patterns, KTA, LVA and AAIAP, produce the same angles. There were nevertheless all kept for the ''in silico'' refinement described below. But only one was used for the [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software].<br />
<br />
<br />
{| class="table table-hover" style="text-align: center;"<br />
|+'''table 1''': The span of parameters.<br />
!colspan="10"|Angle Patterns<br />
|-<br />
|Pattern <br />
| NVL <br />
| KTA <br />
| LVA <br />
| AAIAP <br />
| AADGTL <br />
| VNLTA<br />
| AAAHPEA<br />
| ASLPAA <br />
| ATGDLA<br />
|-<br />
|Mean <br />
| 29.7 <br />
| 38.7 <br />
| 35 <br />
| 36.5 <br />
| 60 <br />
| 74.5 <br />
| 117 <br />
| 140 <br />
| 160 <br />
|-<br />
| Variation <br />
| 8.5 <br />
| 30 <br />
| 29 <br />
| 27 <br />
| 12. <br />
| 27. <br />
| 12. <br />
| 15 . <br />
| 5. <br />
|}<br />
<br />
===Sequences to connect the alpha helix to the protein extremity===<br />
<br />
Helical patterns often affect the folding of the attached sequences. To prevent them from affected the structure of the protein of interest, we analyzed the effects of various aminoacids in silico using online tools like [http://bioserv.rpbs.univ-paris-diderot.fr/services/PEP-FOLD/ pep-fold] on helix formation. We identified glycines and prolines as reliable amino acids to interrupt helix formation. We then decided to use glycine pairs to connect the protein of interest to the linker, because they give more flexibility to the initial orientation of the initial helix.<br />
This only concerns the protein end that was connected to the linker through the coding sequence. The other end is ligated through exteins or sortase scar, both treated as unstructured flexible regions.<br />
<br />
===''Conclusion''===<br />
<br />
The last three parts show how we could design alpha helix and angle pattern blocks and connect them to each other and to the protein of interest. They provide the material that allowed us to transform a linker defined as a geometrical path into a real amino acid sequence in our [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software software].<br />
<br />
Thanks to this we could design linkers to circularize [https://2014.igem.org/Team:Heidelberg/Project/PCR_2.0 DNMT1] and [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening lysozyme]. Additionally it was shown for lysozyme, that a customly tailored linker enhances heat-stability compared to a badly designed linker and even to the flexible linker.<br />
<br />
=''In silico'' refinement=<br />
<br />
As some of the interesting patterns could not be found often enough to be statistically significant, we decided to make a further refinement ''in silico'' by modeling the structure of proteins with circularizing linkers. To perform this for realistic situations, we selected, from the RCSB database, structures of non-homologous target proteins with extremities that are separated enough to require a linker for circularization. First, the [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software] generates possible fitting linkers for various proteins. From these possible linkers, the 100 shortest were taken. Among the 3 possible angle patterns that generate the same angle, the software provides only one. But the linker refinement developed here used the three of them for comparison. We assume that the linkers connect the ends of the protein without setting tension on the protein, so that the protein can fold in its natural way.<br />
<br />
After this the circularized proteins with the specific linkers are modelled using a software called Modeller [[#References|[10]]]. This software is widely used for comparative structure prediction. It is well established in the scientific community and should be most suitable for prediction of loop regions attached to existing structures [[#References|[11]]]. It is freely available for academical usage from the [http://salilab.org/modeller/ salilab] webpages. The program is able to predict the 3d structure of a given sequence based on an alignment with a given structure. In our script Modeller needed to be provided with a sequence with the linker attached and with the PDB file of the protein of interest. This latter file is the only structural information used here. Modeller is recommended when at least 30% sequence similarity exists between the provided structure and the one that the user wish to model. Here, only the linker is different, so this similarity if around 90% in our case. At first Modeller makes an alignment between the provided structure and the sequence of our linked protein, identifying the regions that cannot be found in the structure. Based on that, Modeller generates 4 initial models. One of the strengths of the software is its capacity to further refine only certain parts of the protein. Thus we let Modeller refine the loops, which are defined as any part of the protein that could not be found in the structure file. The ''ab initio'' modelling of our linkers is made by minimizing energy functions with different methods like conjugate gradients and molecular dynamics. Eventually, each modeled structure is provided with energy values, thanks to which different models of the same structure can be compared. From Modeller we received about 8 different models and choose the one with the best energy scores to further proceed. For these refinement steps, one can choose different levels of optimization. We always decided for accuracy instead of velocity of the program. The result of a prediction for lysozyme of bacteriophage lambda can be seen in figure 5.<br />
<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
caption=Figure 5) Circular lysozyme|<br />
descr=The structure of circular lysozyme (yellow) was predicted and aligned to the linear structure (purple). The linker calculated by the software was marked green. |<br />
file=circ_lam_lys_nils.png}}<br />
<br />
<br />
Modeller was run by distributing calculation via the [https://2014.igem.org/Team:Heidelberg/Software/igemathome iGEM@home system] system. The modelling of one linker took about 10 hours of calculation time on average, a value that is highly depending on the size of the protein. Then the best model is evaluated by another self-written program to analyze the behaviour of the linker patterns in their natural surroundings.<br />
Finally, all the models for the different structures and the different linkers are analyzed for their properties like the length of the helical patterns, the shape of the attachment structures of the linker and the angles produced by the angle patterns [[ Figure helix_winkel_messung.png]]. First the modeled structure and the natural structure are fitted, to see how big the differences between those are. If the protein has been disturbed too much, the model is discarded. We control that the alpha helix patterns really generate the expected pattern by measuring the distance between the first and the last amino acids of each of them and checking if it is compatible with an alpha helix. These first and last amino acids of each rods define a vector that is then used to calculate the angles between consecutive helical patterns. The frequency of all the determined lengths and angles is then further analyzed, using a similar strategy to the ArchDB analysis presented above.<br />
<br />
The whole process for the verification of the different linker patterns was set up on the [https://2014.igem.org/Team:Heidelberg/Software/igemathome distributed computing system]. Due to lack of time, only few results could be analyzed, resulting in distributions for the different helices, see for example figures 6 and 7.<br />
This lead to a refinement of the length of the 8 different alpha helix blocks presented above to 8.7, 10, 10.8, 15.6, 16.8, 24.8, 32.3 &Aring;.<br />
<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
descr=|<br />
caption=Fig. 7. Length distribution of AEAAAKEAAAK|<br />
file=plot_of_AEAAAKEAAAK.png}}<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
descr=|<br />
caption=Fig. 6. Length distribution of AEAAAKA |<br />
file=plot_of_AEAAAKA.png}}<br />
<br />
=Conclusion=<br />
The patterns that we identified by analyzing structure databases provide an easy and fast tool to build customly shaped peptides. The main achievement is the identification of the angle patterns. These are designed as building blocks for enhanced applicability. The shapes were identified from a database of non-homologous proteins and the patterns were refined until the distribution of surrounding amino acids looked randomly distributed. Thus we can exclude that the angle distributions we observed is not due to the surrounding sequences, but to the identified patterns. We have not observed any evidence for certain helix patterns being preferred in the database.<br />
From figures 6 and 7 we have learned that the lengths we have assumed for the helices needed to be refined. For example we had assumed the AEAAAKA motif to span a distance of 10.5 &Aring; but have observed it to be only 10 &Aring; long. Our [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software] was accordingly corrected.<br />
<br />
=References=<br />
<br />
[1] Vieille, C. & Zeikus, G.J. Hyperthermophilic enzymes: sources, uses, and molecular mechanisms for thermostability. Microbiology and molecular biology reviews : MMBR 65, 1-43 (2001).<br />
<br />
[2] Yu, Y. & Lutz, S. Circular permutation: a different way to engineer enzyme structure and function. Trends Biotechnol. 29, 18-25 (2011).<br />
<br />
[3] Arai, R., Ueda, H., Kitayama, a, Kamiya, N. & Nagamune, T. Design of the linkers which effectively separate domains of a bifunctional fusion protein. Protein Eng. 14, 529-532 (2001).<br />
<br />
[4] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).<br />
<br />
[5] Efimov, a V. Standard structures in proteins. Prog. Biophys. Mol. Biol. 60, 201-2–39 (1993).<br />
<br />
[6] Donate, L. E., Rufino, S. D., Canard, L. H. & Blundell, T. L. Conformational analysis and clustering of short and medium size loops connecting regular secondary structures: a database for modeling and prediction. Protein Sci. 5, 2600-26–16 (1996).<br />
<br />
[7] Bonet, J. et al. ArchDB 2014: structural classification of loops in proteins. Nucleic Acids Res. 42, D315-D31–9 (2014).<br />
<br />
[8] George, R. a & Heringa, J. An analysis of protein domain linkers: their classification and role in protein folding. Protein Eng. 15, 871-879 (2002).<br />
<br />
[9] Chen, X., Zaro, J. L. & Shen, W.-C. Fusion protein linkers: property, design and functionality. Adv. Drug Deliv. Rev. 65, 1357-1369 (2013).<br />
<br />
[10] Fiser, a et al. Modeling of loops in protein structures. Protein science : a publication of the Protein Society 9, 1753-73 (2000).<br />
<br />
[11] Fiser, A. & Sali, A. ModLoop: Automated modeling of loops in protein structures. Bioinformatics 19, 2500-2501 (2003).</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/pages/Linker_ModelingTeam:Heidelberg/pages/Linker Modeling2014-10-17T21:07:36Z<p>Igemnils: /* Angle patterns */</p>
<hr />
<div>=Background=<br />
Primary, secondary, tertiary and quaternary structures are the main levels of protein structure characterization. Primary structure designates the amino acid sequence, while the secondary structure describes the arrangement of consecutive amino acids through their two dihedral angles $\phi$ and $\psi$. The Ramachandran plot, which represents the amino acid position in the space of those two angles, shows two particular arrangement commonly found in proteins: alpha helices and beta sheets. The next level of protein organization is the tertiary structure, which describes how the protein is organized in the three spatial dimensions, whereas the quaternary structure describes how different subunits of proteins cluster.<br />
Finally, closely related to these standard structures, the supersecondary structure describes how secondary structure elements are connected to each other. While these connections look undefined at first sight, further analysis revealed that this wide variety of supersecondary structure motifs can be clustered to certain patterns [[#References|[5]]].<br />
<br />
==Supersecondary structure==<br />
<br />
When the properties of supersecondary structures were first described, only very few patterns were identified, mainly due to the lack of highly resolved protein structures. At that time, the structures were mainly classified by the Ramachandran plot regions where the amino acids could be found [[#References|[6]]]. With growing amount of known crystal structures, the analysis of supersecondary structure improved and lead to databases with about 150 000 classified loop structures and elaborate clustering [[#References|[7]]]. Nowadays supersecondary structures are defined as the structures built when two secondary structure elements are combined by a small peptide that is not clustered into one of the secondary structures. These loop peptides range from 1 to 9 amino acids.<br />
Our aim was to build reliable stable linkers out of alpha helices connected by supersecondary structure motifs that produce certain angles. To achieve that, we searched for the most reliable alpha helix patterns that would form rigid rods and angle patterns covering the whole range of angles from 0 to 180 degrees.<br />
<br />
=linker building block design=<br />
<br />
===Helix patterns===<br />
Various different patterns have been used to build helical linkers to connect protein ends [[References | [8]]]. Moreover, in known protein structures, linkers between subdomains can be identified and their properties have already been analyzed [[#References | [9]]]. Two main criteria were used to build the alpha helix patterns: they should robustly for alpha helices, and they should be soluble in aqueous solution. Therefore, we could not just use linkers built of Alanine. So we decided to add some charged aminoacids to the pattern, and to position them physically close to each other so that they could stabilize themselves by Coulomb interaction. These amino acids needed to be separated by 3 amino acids as a helical turn takes about 3.6 aminoacids. The pattern we chose as most suitable for our purpose was also described to be one of the most stable [[#References | [3]]].<br />
8 alpha helix building blocks were eventually chosen: AEAAAK, AEAAAKA, AEAAAKAA, AEAAAKEAAAK, AEAAAKEAAAKA, AEAAAKEAAAKEAAAKA, AEAAAKEAAAKEAAAKEAAAKA, AEAAAKEAAAKEAAAKEAAAKEAAAKA, with a respective estimated length of 9, 10.5, 12, 16.5, 18, 25.5 and 33 &Aring;.<br />
<br />
===Angle patterns===<br />
<br />
The angle patterns for our model were obtained from the ArchDB database [[#References | [5]]], which classifies loops from known proteins structures. About 17 000 non-homologous proteins from PDB database were analyzed and from them, 150 000 loop structures, i.e. regions connecting two secondary structure elements, were identified. The classification took into account not only the length of the loop, its conformation, meaning φ and ψ backbone dihedral angles of the residues in the loop, but also the distance between the attachments of the loop to the surrounding secondary structures. Furthermore the secondary structures surrounding the loop and the geometry defined by the super-secondary structure motifs can be found in the database.<br />
<br />
To extract from ArchDB the relevant supersecondary structure motifs for our linker design, the complete database was downloaded and helix-loop-helix motifs were extracted using a [https://github.com/igemsoftware/Heidelberg_2014 self-written script] in Python programming language. From them we only took into account loops composed of 1 and 2 amino acids, because the longer the loops, the less frequent and therefore the less reliable they are, and the further the ends are from each other.<br />
The interesting information for us was (1) the angle produced between the vectors defining the surrounding alpha helices, (2) the distance between the ends of the loop, and (3) the type of amino acids surrounding the loop. Furthermore we analyzed the statistical significance of the conformation. For each amino acid combination in the loop region, the angle distribution between the embracing alpha helices, the loop length distribution and a 2D heatmap of the surrounding amino acids were plotted automatically (Figures 1-4).<br />
<br />
These distributions were then visually analyzed to identify loops of interest for linker design. We focused on loops that showed a narrow angle distribution and that appeared frequently in the database and we extended the distribution analysis to the surrounding amino acids. For example, when we have identified that the amino acid T in a turn produces an interesting distribution (Figure 1), we would elongate it with the amino acid K in front (Figure 2), and with the amino acid A at the end (Figure 3), and with both (Figure 4) and investigate how the distributions behave. By restraining the possibilites, the occurrences go down tremendously, but the properties become more interesting.<br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of__T_.png|<br />
descr= Figure 1) A loop composed of the amino acid T produces an interesting well-defined angle distribution (left panel). The middle panel represents the distribution of lengths found with this loop. The right panel represent the distribution of amino acids found before and after the amino acid T in all the corresponding loops. One can see that the amino acid K is the most frequent before and A the most frequent after.}} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of_K_T_.png|<br />
descr= Figure 2) These distributions are subsets of the ones of Figure 1, with only the loops composed of the amino acid T preceded by the amino acid K. This constrain considerably narrows down the angle and length distributions.}} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of__T_A.png|<br />
descr= Figure 3) Same procedure as in Figure 2. with the loops composed of the amino acid T followed by the amino acid A. This also narrows down the angle and length distributions compared to Figure 1., although not as much as in Figure 2.}} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of_K_T_A.png|<br />
descr= Figure 4) The distributions correspond to loops composed of the amino acids KTA and therefore to subsets of all the preceeding ones. One can notice that the angle distribution becomes very narrow, but also that the frequency is reduced compared to the preceeding ones.}}<br />
<br />
This step had two main goals: narrowing down the angle distribution and finding loops with no preferences for the amino acid surrounding them. This last point was important for the modularity of our approach: the angle blocks should not be affected by the surrounding alpha helices. Using this approach, 10 different angle motifs (table 1) could be identified producing different angles. Importantly, all these motifs were chosen so that the length of the two segments starting from the turning point was the same for all of them (put figure turning point). It should also be noticed that 3 of the patterns, KTA, LVA and AAIAP, produce the same angles. There were nevertheless all kept for the ''in silico'' refinement described below. But only one was used for the [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software].<br />
<br />
<br />
{| class="table table-hover" style="text-align: center;"<br />
|+'''table 1''': The span of parameters.<br />
!colspan="10"|Angle Patterns<br />
|-<br />
|Pattern <br />
| NVL <br />
| KTA <br />
| LVA <br />
| AAIAP <br />
| AADGTL <br />
| VNLTA<br />
| AAAHPEA<br />
| ASLPAA <br />
| ATGDLA<br />
|-<br />
|Mean <br />
| 29.7 <br />
| 38.7 <br />
| 35 <br />
| 36.5 <br />
| 60 <br />
| 74.5 <br />
| 117 <br />
| 140 <br />
| 160 <br />
|-<br />
| Variation <br />
| 8.5 <br />
| 30 <br />
| 29 <br />
| 27 <br />
| 12. <br />
| 27. <br />
| 12. <br />
| 15 . <br />
| 5. <br />
|}<br />
<br />
===Sequences to connect the alpha helix to the protein extremity===<br />
<br />
Helical patterns often affect the folding of the attached sequences. To prevent them from affected the structure of the protein of interest, we analyzed the effects of various aminoacids in silico using online tools like [http://bioserv.rpbs.univ-paris-diderot.fr/services/PEP-FOLD/ pep-fold] on helix formation. We identified glycines and prolines as reliable amino acids to interrupt helix formation. We then decided to use glycine pairs to connect the protein of interest to the linker, because they give more flexibility to the initial orientation of the initial helix.<br />
This only concerns the protein end that was connected to the linker through the coding sequence. The other end is ligated through exteins or sortase scar, both treated as unstructured flexible regions.<br />
<br />
===''Conclusion''===<br />
<br />
The last three parts show how we could design alpha helix and angle pattern blocks and connect them to each other and to the protein of interest. They provide the material that allowed us to transform a linker defined as a geometrical path into a real amino acid sequence in our [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software software].<br />
<br />
Thanks to this we could design linkers to circularize [https://2014.igem.org/Team:Heidelberg/Project/PCR_2.0 DNMT1] and [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening lysozyme]. Additionally it was shown for lysozyme, that a customly tailored linker enhances heat-stability compared to a badly designed linker and even to the flexible linker.<br />
<br />
=''In silico'' refinement=<br />
<br />
As some of the interesting patterns could not be found often enough to be statistically significant, we decided to make a further refinement ''in silico'' by modeling the structure of proteins with circularizing linkers. To perform this for realistic situations, we selected, from the RCSB database, structures of non-homologous target proteins with extremities that are separated enough to require a linker for circularization. First, the [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software] generates possible fitting linkers for various proteins. From these possible linkers, the 100 shortest were taken. Among the 3 possible angle patterns that generate the same angle, the software provides only one. But the linker refinement developed here used the three of them for comparison. We assume that the linkers connect the ends of the protein without setting tension on the protein, so that the protein can fold in its natural way.<br />
<br />
After this the circularized proteins with the specific linkers are modelled using a software called Modeller [[#References|[10]]]. This software is widely used for comparative structure prediction. It is well established in the scientific community and should be most suitable for prediction of loop regions attached to existing structures [[#References|[11]]]. It is freely available for academical usage from the [http://salilab.org/modeller/ salilab] webpages. The program is able to predict the 3d structure of a given sequence based on an alignment with a given structure. In our script Modeller needed to be provided with a sequence with the linker attached and with the PDB file of the protein of interest. This latter file is the only structural information used here. Modeller is recommended when at least 30% sequence similarity exists between the provided structure and the one that the user wish to model. Here, only the linker is different, so this similarity if around 90% in our case. At first Modeller makes an alignment between the provided structure and the sequence of our linked protein, identifying the regions that cannot be found in the structure. Based on that, Modeller generates 4 initial models. One of the strengths of the software is its capacity to further refine only certain parts of the protein. Thus we let Modeller refine the loops, which are defined as any part of the protein that could not be found in the structure file. The ''ab initio'' modelling of our linkers is made by minimizing energy functions with different methods like conjugate gradients and molecular dynamics. Eventually, each modeled structure is provided with energy values, thanks to which different models of the same structure can be compared. From Modeller we received about 8 different models and choose the one with the best energy scores to further proceed. For these refinement steps, one can choose different levels of optimization. We always decided for accuracy instead of velocity of the program. The result of a prediction for lysozyme of bacteriophage lambda can be seen in figure 5.<br />
<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
caption=Figure 5) Circular lysozyme|<br />
descr=The structure of circular lysozyme (yellow) was predicted and aligned to the linear structure (purple). The linker calculated by the software was marked green. |<br />
file=circ_lam_lys_nils.png}}<br />
<br />
<br />
Modeller was run by distributing calculation via the [https://2014.igem.org/Team:Heidelberg/Software/igemathome iGEM@home system] system. The modelling of one linker took about 10 hours of calculation time on average, a value that is highly depending on the size of the protein. Then the best model is evaluated by another self-written program to analyze the behaviour of the linker patterns in their natural surroundings.<br />
Finally, all the models for the different structures and the different linkers are analyzed for their properties like the length of the helical patterns, the shape of the attachment structures of the linker and the angles produced by the angle patterns [[ Figure helix_winkel_messung.png]]. First the modeled structure and the natural structure are fitted, to see how big the differences between those are. If the protein has been disturbed too much, the model is discarded. We control that the alpha helix patterns really generate the expected pattern by measuring the distance between the first and the last amino acids of each of them and checking if it is compatible with an alpha helix. These first and last amino acids of each rods define a vector that is then used to calculate the angles between consecutive helical patterns. The frequency of all the determined lengths and angles is then further analyzed, using a similar strategy to the ArchDB analysis presented above.<br />
<br />
The whole process for the verification of the different linker patterns was set up on the [https://2014.igem.org/Team:Heidelberg/Software/igemathome distributed computing system]. Due to lack of time, only few results could be analyzed, resulting in distributions for the different helices, see for example figures 6 and 7.<br />
This lead to a refinement of the length of the 8 different alpha helix blocks presented above to 8.7, 10, 10.8, 15.6, 16.8, 24.8, 32.3 &Aring;.<br />
<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
descr=|<br />
caption=Fig. 7. Length distribution of AEAAAKEAAAK|<br />
file=plot_of_AEAAAKEAAAK.png}}<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
descr=|<br />
caption=Fig. 6. Length distribution of AEAAAKA |<br />
file=plot_of_AEAAAKA.png}}<br />
<br />
=Conclusion=<br />
The patterns that we identified by analyzing structure databases provide an easy and fast tool to build customly shaped peptides. The main achievement is the identification of the angle patterns. These are designed as building blocks for enhanced applicability. The shapes were identified from a database of non-homologous proteins and the patterns were refined until the distribution of surrounding amino acids looked randomly distributed. Thus we can exclude that the angle distributions we observed is not due to the surrounding sequences, but to the identified patterns. We have not observed any evidence for certain helix patterns being preferred in the database.<br />
From figures 6 and 7 we have learned that the lengths we have assumed for the helices needed to be refined. For example we had assumed the AEAAAKA motif to span a distance of 10.5 &Aring; but have observed it to be only 10 &Aring; long. Our [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software] was accordingly corrected.<br />
<br />
=References=<br />
<br />
[1] Vieille, C. & Zeikus, G.J. Hyperthermophilic enzymes: sources, uses, and molecular mechanisms for thermostability. Microbiology and molecular biology reviews : MMBR 65, 1-43 (2001).<br />
<br />
[2] Yu, Y. & Lutz, S. Circular permutation: a different way to engineer enzyme structure and function. Trends Biotechnol. 29, 18-25 (2011).<br />
<br />
[3] Arai, R., Ueda, H., Kitayama, a, Kamiya, N. & Nagamune, T. Design of the linkers which effectively separate domains of a bifunctional fusion protein. Protein Eng. 14, 529-532 (2001).<br />
<br />
[4] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).<br />
<br />
[5] Efimov, a V. Standard structures in proteins. Prog. Biophys. Mol. Biol. 60, 201-2–39 (1993).<br />
<br />
[6] Donate, L. E., Rufino, S. D., Canard, L. H. & Blundell, T. L. Conformational analysis and clustering of short and medium size loops connecting regular secondary structures: a database for modeling and prediction. Protein Sci. 5, 2600-26–16 (1996).<br />
<br />
[7] Bonet, J. et al. ArchDB 2014: structural classification of loops in proteins. Nucleic Acids Res. 42, D315-D31–9 (2014).<br />
<br />
[8] George, R. a & Heringa, J. An analysis of protein domain linkers: their classification and role in protein folding. Protein Eng. 15, 871-879 (2002).<br />
<br />
[9] Chen, X., Zaro, J. L. & Shen, W.-C. Fusion protein linkers: property, design and functionality. Adv. Drug Deliv. Rev. 65, 1357-1369 (2013).<br />
<br />
[10] Fiser, a et al. Modeling of loops in protein structures. Protein science : a publication of the Protein Society 9, 1753-73 (2000).<br />
<br />
[11] Fiser, A. & Sali, A. ModLoop: Automated modeling of loops in protein structures. Bioinformatics 19, 2500-2501 (2003).</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/pages/Linker_ModelingTeam:Heidelberg/pages/Linker Modeling2014-10-17T21:03:01Z<p>Igemnils: /* In silico refinement */</p>
<hr />
<div>=Background=<br />
Primary, secondary, tertiary and quaternary structures are the main levels of protein structure characterization. Primary structure designates the amino acid sequence, while the secondary structure describes the arrangement of consecutive amino acids through their two dihedral angles $\phi$ and $\psi$. The Ramachandran plot, which represents the amino acid position in the space of those two angles, shows two particular arrangement commonly found in proteins: alpha helices and beta sheets. The next level of protein organization is the tertiary structure, which describes how the protein is organized in the three spatial dimensions, whereas the quaternary structure describes how different subunits of proteins cluster.<br />
Finally, closely related to these standard structures, the supersecondary structure describes how secondary structure elements are connected to each other. While these connections look undefined at first sight, further analysis revealed that this wide variety of supersecondary structure motifs can be clustered to certain patterns [[#References|[5]]].<br />
<br />
==Supersecondary structure==<br />
<br />
When the properties of supersecondary structures were first described, only very few patterns were identified, mainly due to the lack of highly resolved protein structures. At that time, the structures were mainly classified by the Ramachandran plot regions where the amino acids could be found [[#References|[6]]]. With growing amount of known crystal structures, the analysis of supersecondary structure improved and lead to databases with about 150 000 classified loop structures and elaborate clustering [[#References|[7]]]. Nowadays supersecondary structures are defined as the structures built when two secondary structure elements are combined by a small peptide that is not clustered into one of the secondary structures. These loop peptides range from 1 to 9 amino acids.<br />
Our aim was to build reliable stable linkers out of alpha helices connected by supersecondary structure motifs that produce certain angles. To achieve that, we searched for the most reliable alpha helix patterns that would form rigid rods and angle patterns covering the whole range of angles from 0 to 180 degrees.<br />
<br />
=linker building block design=<br />
<br />
===Helix patterns===<br />
Various different patterns have been used to build helical linkers to connect protein ends [[References | [8]]]. Moreover, in known protein structures, linkers between subdomains can be identified and their properties have already been analyzed [[#References | [9]]]. Two main criteria were used to build the alpha helix patterns: they should robustly for alpha helices, and they should be soluble in aqueous solution. Therefore, we could not just use linkers built of Alanine. So we decided to add some charged aminoacids to the pattern, and to position them physically close to each other so that they could stabilize themselves by Coulomb interaction. These amino acids needed to be separated by 3 amino acids as a helical turn takes about 3.6 aminoacids. The pattern we chose as most suitable for our purpose was also described to be one of the most stable [[#References | [3]]].<br />
8 alpha helix building blocks were eventually chosen: AEAAAK, AEAAAKA, AEAAAKAA, AEAAAKEAAAK, AEAAAKEAAAKA, AEAAAKEAAAKEAAAKA, AEAAAKEAAAKEAAAKEAAAKA, AEAAAKEAAAKEAAAKEAAAKEAAAKA, with a respective estimated length of 9, 10.5, 12, 16.5, 18, 25.5 and 33 &Aring;.<br />
<br />
===Angle patterns===<br />
<br />
The angle patterns for our model were obtained from the ArchDB database [[#References | [5]]], which classifies loops from known proteins structures. About 17 000 non-homologous proteins from PDB database were analyzed and from them, 150 000 loop structures, i.e. regions connecting two secondary structure elements, were identified. The classification took into account not only the length of the loop, its conformation, meaning φ and ψ backbone dihedral angles of the residues in the loop, but also the distance between the attachments of the loop to the surrounding secondary structures. Furthermore the secondary structures surrounding the loop and the geometry defined by the super-secondary structure motifs can be found in the database.<br />
<br />
To extract from ArchDB the relevant supersecondary structure motifs for our linker design, the complete database was downloaded and helix-loop-helix motifs were extracted using a [https://github.com/igemsoftware/Heidelberg_2014 self-written script] in Python programming language. From them we only took into account loops composed of 1 and 2 amino acids, because the longer the loops, the less frequent and therefore the less reliable they are, and the further the ends are from each other.<br />
The interesting information for us was (1) the angle produced between the vectors defining the surrounding alpha helices, (2) the distance between the ends of the loop, and (3) the type of amino acids surrounding the loop. Furthermore we analyzed the statistical significance of the conformation. For each amino acid combination in the loop region, the angle distribution between the embracing alpha helices, the loop length distribution and a 2d heatmap of the surrounding amino acids were automatically plotted (Figures 1-4).<br />
<br />
These distributions were then visually analyzed to identify loops of interest for linker design. We focused on loops that showed a narrow angle distribution and that appeared frequently in the database and we extended the distribution analysis to the surrounding amino acids. For example, when we have identified that T in a turn produces an interesting distribution (Fig. 1), we would elongate it by K in front and elongate it by A in the end and see how the distribution behaves. By restraining the possibilites, the occurrences go down tremendously, but the properties become more interesting.<br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of__T_.png|<br />
descr= Figure 1) A loop composed of the amino acid T produces an interesting well-defined angle distribution (left panel). The middle panel represents the distribution of lengths found with this loop. The right panel represent the distribution of amino acids found before and after the amino acid T in all the corresponding loops. One can see that the amino acid K is the most frequent before and A the most frequent after.}} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of_K_T_.png|<br />
descr= Figure 2) These distributions are subsets of the ones of Figure 1, with only the loops composed of the amino acid T preceded by the amino acid K. This constrain considerably narrows down the angle and length distributions.}} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of__T_A.png|<br />
descr= Figure 3) Same procedure as in Figure 2. with the loops composed of the amino acid T followed by the amino acid A. This also narrows down the angle and length distributions compared to Figure 1., although not as much as in Figure 2.}} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of_K_T_A.png|<br />
descr= Figure 4) The distributions correspond to loops composed of the amino acids KTA and therefore to subsets of all the preceeding ones. One can notice that the angle distribution becomes very narrow, but also that the frequency is reduced compared to the preceeding ones.}}<br />
<br />
This step had two main goals: narrowing down the angle distribution and finding loops with no preferences for the amino acid surrounding them. This last point was important for the modularity of our approach: the angle blocks should not be affected by the surrounding alpha helices. Using this approach, 10 different angle motifs (table 1) could be identified producing different angles. Importantly, all these motifs were chosen so that the length of the two segments starting from the turning point was the same for all of them (put figure turning point). It should also be noticed that 3 of the patterns, KTA, LVA and AAIAP, produce the same angles. There were nevertheless all kept for the ''in silico'' refinement described below. But only one was used for the [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software].<br />
<br />
<br />
{| class="table table-hover" style="text-align: center;"<br />
|+'''table 1''': The span of parameters.<br />
!colspan="10"|Angle Patterns<br />
|-<br />
|Pattern <br />
| NVL <br />
| KTA <br />
| LVA <br />
| AAIAP <br />
| AADGTL <br />
| VNLTA<br />
| AAAHPEA<br />
| ASLPAA <br />
| ATGDLA<br />
|-<br />
|Mean <br />
| 29.7 <br />
| 38.7 <br />
| 35 <br />
| 36.5 <br />
| 60 <br />
| 74.5 <br />
| 117 <br />
| 140 <br />
| 160 <br />
|-<br />
| Variation <br />
| 8.5 <br />
| 30 <br />
| 29 <br />
| 27 <br />
| 12. <br />
| 27. <br />
| 12. <br />
| 15 . <br />
| 5. <br />
|}<br />
<br />
===Sequences to connect the alpha helix to the protein extremity===<br />
<br />
Helical patterns often affect the folding of the attached sequences. To prevent them from affected the structure of the protein of interest, we analyzed the effects of various aminoacids in silico using online tools like [http://bioserv.rpbs.univ-paris-diderot.fr/services/PEP-FOLD/ pep-fold] on helix formation. We identified glycines and prolines as reliable amino acids to interrupt helix formation. We then decided to use glycine pairs to connect the protein of interest to the linker, because they give more flexibility to the initial orientation of the initial helix.<br />
This only concerns the protein end that was connected to the linker through the coding sequence. The other end is ligated through exteins or sortase scar, both treated as unstructured flexible regions.<br />
<br />
===''Conclusion''===<br />
<br />
The last three parts show how we could design alpha helix and angle pattern blocks and connect them to each other and to the protein of interest. They provide the material that allowed us to transform a linker defined as a geometrical path into a real amino acid sequence in our [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software software].<br />
<br />
Thanks to this we could design linkers to circularize [https://2014.igem.org/Team:Heidelberg/Project/PCR_2.0 DNMT1] and [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening lysozyme]. Additionally it was shown for lysozyme, that a customly tailored linker enhances heat-stability compared to a badly designed linker and even to the flexible linker.<br />
<br />
=''In silico'' refinement=<br />
<br />
As some of the interesting patterns could not be found often enough to be statistically significant, we decided to make a further refinement ''in silico'' by modeling the structure of proteins with circularizing linkers. To perform this for realistic situations, we selected, from the RCSB database, structures of non-homologous target proteins with extremities that are separated enough to require a linker for circularization. First, the [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software] generates possible fitting linkers for various proteins. From these possible linkers, the 100 shortest were taken. Among the 3 possible angle patterns that generate the same angle, the software provides only one. But the linker refinement developed here used the three of them for comparison. We assume that the linkers connect the ends of the protein without setting tension on the protein, so that the protein can fold in its natural way.<br />
<br />
After this the circularized proteins with the specific linkers are modelled using a software called Modeller [[#References|[10]]]. This software is widely used for comparative structure prediction. It is well established in the scientific community and should be most suitable for prediction of loop regions attached to existing structures [[#References|[11]]]. It is freely available for academical usage from the [http://salilab.org/modeller/ salilab] webpages. The program is able to predict the 3d structure of a given sequence based on an alignment with a given structure. In our script Modeller needed to be provided with a sequence with the linker attached and with the PDB file of the protein of interest. This latter file is the only structural information used here. Modeller is recommended when at least 30% sequence similarity exists between the provided structure and the one that the user wish to model. Here, only the linker is different, so this similarity if around 90% in our case. At first Modeller makes an alignment between the provided structure and the sequence of our linked protein, identifying the regions that cannot be found in the structure. Based on that, Modeller generates 4 initial models. One of the strengths of the software is its capacity to further refine only certain parts of the protein. Thus we let Modeller refine the loops, which are defined as any part of the protein that could not be found in the structure file. The ''ab initio'' modelling of our linkers is made by minimizing energy functions with different methods like conjugate gradients and molecular dynamics. Eventually, each modeled structure is provided with energy values, thanks to which different models of the same structure can be compared. From Modeller we received about 8 different models and choose the one with the best energy scores to further proceed. For these refinement steps, one can choose different levels of optimization. We always decided for accuracy instead of velocity of the program. The result of a prediction for lysozyme of bacteriophage lambda can be seen in figure 5.<br />
<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
caption=Figure 5) Circular lysozyme|<br />
descr=The structure of circular lysozyme (yellow) was predicted and aligned to the linear structure (purple). The linker calculated by the software was marked green. |<br />
file=circ_lam_lys_nils.png}}<br />
<br />
<br />
Modeller was run by distributing calculation via the [https://2014.igem.org/Team:Heidelberg/Software/igemathome iGEM@home system] system. The modelling of one linker took about 10 hours of calculation time on average, a value that is highly depending on the size of the protein. Then the best model is evaluated by another self-written program to analyze the behaviour of the linker patterns in their natural surroundings.<br />
Finally, all the models for the different structures and the different linkers are analyzed for their properties like the length of the helical patterns, the shape of the attachment structures of the linker and the angles produced by the angle patterns [[ Figure helix_winkel_messung.png]]. First the modeled structure and the natural structure are fitted, to see how big the differences between those are. If the protein has been disturbed too much, the model is discarded. We control that the alpha helix patterns really generate the expected pattern by measuring the distance between the first and the last amino acids of each of them and checking if it is compatible with an alpha helix. These first and last amino acids of each rods define a vector that is then used to calculate the angles between consecutive helical patterns. The frequency of all the determined lengths and angles is then further analyzed, using a similar strategy to the ArchDB analysis presented above.<br />
<br />
The whole process for the verification of the different linker patterns was set up on the [https://2014.igem.org/Team:Heidelberg/Software/igemathome distributed computing system]. Due to lack of time, only few results could be analyzed, resulting in distributions for the different helices, see for example figures 6 and 7.<br />
This lead to a refinement of the length of the 8 different alpha helix blocks presented above to 8.7, 10, 10.8, 15.6, 16.8, 24.8, 32.3 &Aring;.<br />
<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
descr=|<br />
caption=Fig. 7. Length distribution of AEAAAKEAAAK|<br />
file=plot_of_AEAAAKEAAAK.png}}<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
descr=|<br />
caption=Fig. 6. Length distribution of AEAAAKA |<br />
file=plot_of_AEAAAKA.png}}<br />
<br />
=Conclusion=<br />
The patterns that we identified by analyzing structure databases provide an easy and fast tool to build customly shaped peptides. The main achievement is the identification of the angle patterns. These are designed as building blocks for enhanced applicability. The shapes were identified from a database of non-homologous proteins and the patterns were refined until the distribution of surrounding amino acids looked randomly distributed. Thus we can exclude that the angle distributions we observed is not due to the surrounding sequences, but to the identified patterns. We have not observed any evidence for certain helix patterns being preferred in the database.<br />
From figures 6 and 7 we have learned that the lengths we have assumed for the helices needed to be refined. For example we had assumed the AEAAAKA motif to span a distance of 10.5 &Aring; but have observed it to be only 10 &Aring; long. Our [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software] was accordingly corrected.<br />
<br />
=References=<br />
<br />
[1] Vieille, C. & Zeikus, G.J. Hyperthermophilic enzymes: sources, uses, and molecular mechanisms for thermostability. Microbiology and molecular biology reviews : MMBR 65, 1-43 (2001).<br />
<br />
[2] Yu, Y. & Lutz, S. Circular permutation: a different way to engineer enzyme structure and function. Trends Biotechnol. 29, 18-25 (2011).<br />
<br />
[3] Arai, R., Ueda, H., Kitayama, a, Kamiya, N. & Nagamune, T. Design of the linkers which effectively separate domains of a bifunctional fusion protein. Protein Eng. 14, 529-532 (2001).<br />
<br />
[4] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).<br />
<br />
[5] Efimov, a V. Standard structures in proteins. Prog. Biophys. Mol. Biol. 60, 201-2–39 (1993).<br />
<br />
[6] Donate, L. E., Rufino, S. D., Canard, L. H. & Blundell, T. L. Conformational analysis and clustering of short and medium size loops connecting regular secondary structures: a database for modeling and prediction. Protein Sci. 5, 2600-26–16 (1996).<br />
<br />
[7] Bonet, J. et al. ArchDB 2014: structural classification of loops in proteins. Nucleic Acids Res. 42, D315-D31–9 (2014).<br />
<br />
[8] George, R. a & Heringa, J. An analysis of protein domain linkers: their classification and role in protein folding. Protein Eng. 15, 871-879 (2002).<br />
<br />
[9] Chen, X., Zaro, J. L. & Shen, W.-C. Fusion protein linkers: property, design and functionality. Adv. Drug Deliv. Rev. 65, 1357-1369 (2013).<br />
<br />
[10] Fiser, a et al. Modeling of loops in protein structures. Protein science : a publication of the Protein Society 9, 1753-73 (2000).<br />
<br />
[11] Fiser, A. & Sali, A. ModLoop: Automated modeling of loops in protein structures. Bioinformatics 19, 2500-2501 (2003).</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/pages/Linker_ModelingTeam:Heidelberg/pages/Linker Modeling2014-10-17T21:01:36Z<p>Igemnils: /* Angle patterns */</p>
<hr />
<div>=Background=<br />
Primary, secondary, tertiary and quaternary structures are the main levels of protein structure characterization. Primary structure designates the amino acid sequence, while the secondary structure describes the arrangement of consecutive amino acids through their two dihedral angles $\phi$ and $\psi$. The Ramachandran plot, which represents the amino acid position in the space of those two angles, shows two particular arrangement commonly found in proteins: alpha helices and beta sheets. The next level of protein organization is the tertiary structure, which describes how the protein is organized in the three spatial dimensions, whereas the quaternary structure describes how different subunits of proteins cluster.<br />
Finally, closely related to these standard structures, the supersecondary structure describes how secondary structure elements are connected to each other. While these connections look undefined at first sight, further analysis revealed that this wide variety of supersecondary structure motifs can be clustered to certain patterns [[#References|[5]]].<br />
<br />
==Supersecondary structure==<br />
<br />
When the properties of supersecondary structures were first described, only very few patterns were identified, mainly due to the lack of highly resolved protein structures. At that time, the structures were mainly classified by the Ramachandran plot regions where the amino acids could be found [[#References|[6]]]. With growing amount of known crystal structures, the analysis of supersecondary structure improved and lead to databases with about 150 000 classified loop structures and elaborate clustering [[#References|[7]]]. Nowadays supersecondary structures are defined as the structures built when two secondary structure elements are combined by a small peptide that is not clustered into one of the secondary structures. These loop peptides range from 1 to 9 amino acids.<br />
Our aim was to build reliable stable linkers out of alpha helices connected by supersecondary structure motifs that produce certain angles. To achieve that, we searched for the most reliable alpha helix patterns that would form rigid rods and angle patterns covering the whole range of angles from 0 to 180 degrees.<br />
<br />
=linker building block design=<br />
<br />
===Helix patterns===<br />
Various different patterns have been used to build helical linkers to connect protein ends [[References | [8]]]. Moreover, in known protein structures, linkers between subdomains can be identified and their properties have already been analyzed [[#References | [9]]]. Two main criteria were used to build the alpha helix patterns: they should robustly for alpha helices, and they should be soluble in aqueous solution. Therefore, we could not just use linkers built of Alanine. So we decided to add some charged aminoacids to the pattern, and to position them physically close to each other so that they could stabilize themselves by Coulomb interaction. These amino acids needed to be separated by 3 amino acids as a helical turn takes about 3.6 aminoacids. The pattern we chose as most suitable for our purpose was also described to be one of the most stable [[#References | [3]]].<br />
8 alpha helix building blocks were eventually chosen: AEAAAK, AEAAAKA, AEAAAKAA, AEAAAKEAAAK, AEAAAKEAAAKA, AEAAAKEAAAKEAAAKA, AEAAAKEAAAKEAAAKEAAAKA, AEAAAKEAAAKEAAAKEAAAKEAAAKA, with a respective estimated length of 9, 10.5, 12, 16.5, 18, 25.5 and 33 &Aring;.<br />
<br />
===Angle patterns===<br />
<br />
The angle patterns for our model were obtained from the ArchDB database [[#References | [5]]], which classifies loops from known proteins structures. About 17 000 non-homologous proteins from PDB database were analyzed and from them, 150 000 loop structures, i.e. regions connecting two secondary structure elements, were identified. The classification took into account not only the length of the loop, its conformation, meaning φ and ψ backbone dihedral angles of the residues in the loop, but also the distance between the attachments of the loop to the surrounding secondary structures. Furthermore the secondary structures surrounding the loop and the geometry defined by the super-secondary structure motifs can be found in the database.<br />
<br />
To extract from ArchDB the relevant supersecondary structure motifs for our linker design, the complete database was downloaded and helix-loop-helix motifs were extracted using a [https://github.com/igemsoftware/Heidelberg_2014 self-written script] in Python programming language. From them we only took into account loops composed of 1 and 2 amino acids, because the longer the loops, the less frequent and therefore the less reliable they are, and the further the ends are from each other.<br />
The interesting information for us was (1) the angle produced between the vectors defining the surrounding alpha helices, (2) the distance between the ends of the loop, and (3) the type of amino acids surrounding the loop. Furthermore we analyzed the statistical significance of the conformation. For each amino acid combination in the loop region, the angle distribution between the embracing alpha helices, the loop length distribution and a 2d heatmap of the surrounding amino acids were automatically plotted (Figures 1-4).<br />
<br />
These distributions were then visually analyzed to identify loops of interest for linker design. We focused on loops that showed a narrow angle distribution and that appeared frequently in the database and we extended the distribution analysis to the surrounding amino acids. For example, when we have identified that T in a turn produces an interesting distribution (Fig. 1), we would elongate it by K in front and elongate it by A in the end and see how the distribution behaves. By restraining the possibilites, the occurrences go down tremendously, but the properties become more interesting.<br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of__T_.png|<br />
descr= Figure 1) A loop composed of the amino acid T produces an interesting well-defined angle distribution (left panel). The middle panel represents the distribution of lengths found with this loop. The right panel represent the distribution of amino acids found before and after the amino acid T in all the corresponding loops. One can see that the amino acid K is the most frequent before and A the most frequent after.}} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of_K_T_.png|<br />
descr= Figure 2) These distributions are subsets of the ones of Figure 1, with only the loops composed of the amino acid T preceded by the amino acid K. This constrain considerably narrows down the angle and length distributions.}} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of__T_A.png|<br />
descr= Figure 3) Same procedure as in Figure 2. with the loops composed of the amino acid T followed by the amino acid A. This also narrows down the angle and length distributions compared to Figure 1., although not as much as in Figure 2.}} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of_K_T_A.png|<br />
descr= Figure 4) The distributions correspond to loops composed of the amino acids KTA and therefore to subsets of all the preceeding ones. One can notice that the angle distribution becomes very narrow, but also that the frequency is reduced compared to the preceeding ones.}}<br />
<br />
This step had two main goals: narrowing down the angle distribution and finding loops with no preferences for the amino acid surrounding them. This last point was important for the modularity of our approach: the angle blocks should not be affected by the surrounding alpha helices. Using this approach, 10 different angle motifs (table 1) could be identified producing different angles. Importantly, all these motifs were chosen so that the length of the two segments starting from the turning point was the same for all of them (put figure turning point). It should also be noticed that 3 of the patterns, KTA, LVA and AAIAP, produce the same angles. There were nevertheless all kept for the ''in silico'' refinement described below. But only one was used for the [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software].<br />
<br />
<br />
{| class="table table-hover" style="text-align: center;"<br />
|+'''table 1''': The span of parameters.<br />
!colspan="10"|Angle Patterns<br />
|-<br />
|Pattern <br />
| NVL <br />
| KTA <br />
| LVA <br />
| AAIAP <br />
| AADGTL <br />
| VNLTA<br />
| AAAHPEA<br />
| ASLPAA <br />
| ATGDLA<br />
|-<br />
|Mean <br />
| 29.7 <br />
| 38.7 <br />
| 35 <br />
| 36.5 <br />
| 60 <br />
| 74.5 <br />
| 117 <br />
| 140 <br />
| 160 <br />
|-<br />
| Variation <br />
| 8.5 <br />
| 30 <br />
| 29 <br />
| 27 <br />
| 12. <br />
| 27. <br />
| 12. <br />
| 15 . <br />
| 5. <br />
|}<br />
<br />
===Sequences to connect the alpha helix to the protein extremity===<br />
<br />
Helical patterns often affect the folding of the attached sequences. To prevent them from affected the structure of the protein of interest, we analyzed the effects of various aminoacids in silico using online tools like [http://bioserv.rpbs.univ-paris-diderot.fr/services/PEP-FOLD/ pep-fold] on helix formation. We identified glycines and prolines as reliable amino acids to interrupt helix formation. We then decided to use glycine pairs to connect the protein of interest to the linker, because they give more flexibility to the initial orientation of the initial helix.<br />
This only concerns the protein end that was connected to the linker through the coding sequence. The other end is ligated through exteins or sortase scar, both treated as unstructured flexible regions.<br />
<br />
===''Conclusion''===<br />
<br />
The last three parts show how we could design alpha helix and angle pattern blocks and connect them to each other and to the protein of interest. They provide the material that allowed us to transform a linker defined as a geometrical path into a real amino acid sequence in our [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software software].<br />
<br />
Thanks to this we could design linkers to circularize [https://2014.igem.org/Team:Heidelberg/Project/PCR_2.0 DNMT1] and [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening lysozyme]. Additionally it was shown for lysozyme, that a customly tailored linker enhances heat-stability compared to a badly designed linker and even to the flexible linker.<br />
<br />
=''In silico'' refinement=<br />
<br />
As some of the interesting patterns could not be found often enough to be statistically significant, we decided to make a further refinement ''in silico'' by modeling the structure of proteins with circularizing linkers. To perform this for realistic situations, we selected, from the RCSB database, structures of non-homologous target proteins with extremities that are separated enough to require a linker for circularization. First, the [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software] generates possible fitting linkers for various proteins. From these possible linkers, the 100 shortest were taken. Among the 3 possible angle patterns that generate the same angle, the software provides only one. But the linker refinement developed here used the three of them for comparison. We assume that the linkers connect the ends of the protein without setting tension on the protein, so that the protein can fold in its natural way.<br />
<br />
After this the circularized proteins with the specific linkers are modelled using a software called Modeller [[#References|[10]]]. This software is widely used for comparative structure prediction. It is well established in the scientific community and should be most suitable for prediction of loop regions attached to existing structures [[#References|[11]]]. It is freely available for academical usage from the [http://salilab.org/modeller/ salilab] webpages. The program is able to predict the 3d structure of a given sequence based on an alignment with a given structure. In our script Modeller needed to be provided with a sequence with the linker attached and with the PDB file of the protein of interest. This latter file is the only structural information used here. Modeller is recommended when at least 30% sequence similarity exists between the provided structure and the one that the user wish to model. Here, only the linker is different, so this similarity if around 90% in our case. At first Modeller makes an alignment between the provided structure and the sequence of our linked protein, identifying the regions that cannot be found in the structure. Based on that, Modeller generates 4 initial models. One of the strengths of the software is its capacity to further refine only certain parts of the protein. Thus we let Modeller refine the loops, which are defined as any part of the protein that could not be found in the structure file. The ''ab initio'' modelling of our linkers is made by minimizing energy functions with different methods like conjugate gradients and molecular dynamics. Eventually, each modeled structure is provided with energy values, thanks to which different models of the same structure can be compared. From Modeller we received about 8 different models and choose the one with the best energy scores to further proceed. For these refinement steps, one can choose different levels of optimization. We always decided for accuracy instead of velocity of the program. The result of a prediction for lysozyme of bacteriophage lambda can be seen in figure 5.<br />
<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
caption=Figure 5) Circular lysozyme|<br />
descr=The structure of circular lysozyme (yellow) was predicted and aligned to the linear structure (purple). The linker calculated by the software was marked green. |<br />
file=circ_lam_lys_nils.png}}<br />
<br />
<br />
Modeller was run by distributing calculation via the [https://2014.igem.org/Team:Heidelberg/Software/igemathome iGEM@home system] system. The modelling of one linker took about 10 hours of calculation time on average, a value that is highly depending on the size of the protein. Then the best model is evaluated by another self-written program to analyze the behaviour of the linker patterns in their natural surroundings.<br />
Finally, all the models for the different structures and the different linkers are analyzed for their properties like the length of the helical patterns, the shape of the attachment structures of the linker and the angles produced by the angle patterns [[ Figure helix_winkel_messung.png]]. First the modeled structure and the natural structure are fitted, to see how big the differences between those are. If the protein has been disturbed too much, the model is discarded. We control that the alpha helix patterns really generate the expected pattern by measuring the distance between the first and the last amino acids of each of them and checking if it is compatible with an alpha helix. These first and last amino acids of each rods define a vector that is then used to calculate the angles between consecutive helical patterns. The frequency of all the determined lengths and angles is then further analyzed, using a similar strategy to the ArchDB analysis presented above.<br />
<br />
The whole process for the verification of the different linker patterns was set up on the [https://2014.igem.org/Team:Heidelberg/Software/igemathome distributed computing system]. Due to lack of time, only few results could be analyzed, resulting in distributions for the different helices, see for example figures 5 and 6.<br />
This lead to a refinement of the length of the 8 different alpha helix blocks presented above to 8.7, 10, 10.8, 15.6, 16.8, 24.8, 32.3 &Aring;.<br />
<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
descr=|<br />
caption=Fig. 7. Length distribution of AEAAAKEAAAK|<br />
file=plot_of_AEAAAKEAAAK.png}}<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
descr=|<br />
caption=Fig. 6. Length distribution of AEAAAKA |<br />
file=plot_of_AEAAAKA.png}}<br />
<br />
=Conclusion=<br />
The patterns that we identified by analyzing structure databases provide an easy and fast tool to build customly shaped peptides. The main achievement is the identification of the angle patterns. These are designed as building blocks for enhanced applicability. The shapes were identified from a database of non-homologous proteins and the patterns were refined until the distribution of surrounding amino acids looked randomly distributed. Thus we can exclude that the angle distributions we observed is not due to the surrounding sequences, but to the identified patterns. We have not observed any evidence for certain helix patterns being preferred in the database.<br />
From figures 6 and 7 we have learned that the lengths we have assumed for the helices needed to be refined. For example we had assumed the AEAAAKA motif to span a distance of 10.5 &Aring; but have observed it to be only 10 &Aring; long. Our [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software] was accordingly corrected.<br />
<br />
=References=<br />
<br />
[1] Vieille, C. & Zeikus, G.J. Hyperthermophilic enzymes: sources, uses, and molecular mechanisms for thermostability. Microbiology and molecular biology reviews : MMBR 65, 1-43 (2001).<br />
<br />
[2] Yu, Y. & Lutz, S. Circular permutation: a different way to engineer enzyme structure and function. Trends Biotechnol. 29, 18-25 (2011).<br />
<br />
[3] Arai, R., Ueda, H., Kitayama, a, Kamiya, N. & Nagamune, T. Design of the linkers which effectively separate domains of a bifunctional fusion protein. Protein Eng. 14, 529-532 (2001).<br />
<br />
[4] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).<br />
<br />
[5] Efimov, a V. Standard structures in proteins. Prog. Biophys. Mol. Biol. 60, 201-2–39 (1993).<br />
<br />
[6] Donate, L. E., Rufino, S. D., Canard, L. H. & Blundell, T. L. Conformational analysis and clustering of short and medium size loops connecting regular secondary structures: a database for modeling and prediction. Protein Sci. 5, 2600-26–16 (1996).<br />
<br />
[7] Bonet, J. et al. ArchDB 2014: structural classification of loops in proteins. Nucleic Acids Res. 42, D315-D31–9 (2014).<br />
<br />
[8] George, R. a & Heringa, J. An analysis of protein domain linkers: their classification and role in protein folding. Protein Eng. 15, 871-879 (2002).<br />
<br />
[9] Chen, X., Zaro, J. L. & Shen, W.-C. Fusion protein linkers: property, design and functionality. Adv. Drug Deliv. Rev. 65, 1357-1369 (2013).<br />
<br />
[10] Fiser, a et al. Modeling of loops in protein structures. Protein science : a publication of the Protein Society 9, 1753-73 (2000).<br />
<br />
[11] Fiser, A. & Sali, A. ModLoop: Automated modeling of loops in protein structures. Bioinformatics 19, 2500-2501 (2003).</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/pages/Linker_ModelingTeam:Heidelberg/pages/Linker Modeling2014-10-17T21:00:42Z<p>Igemnils: /* Angle patterns */</p>
<hr />
<div>=Background=<br />
Primary, secondary, tertiary and quaternary structures are the main levels of protein structure characterization. Primary structure designates the amino acid sequence, while the secondary structure describes the arrangement of consecutive amino acids through their two dihedral angles $\phi$ and $\psi$. The Ramachandran plot, which represents the amino acid position in the space of those two angles, shows two particular arrangement commonly found in proteins: alpha helices and beta sheets. The next level of protein organization is the tertiary structure, which describes how the protein is organized in the three spatial dimensions, whereas the quaternary structure describes how different subunits of proteins cluster.<br />
Finally, closely related to these standard structures, the supersecondary structure describes how secondary structure elements are connected to each other. While these connections look undefined at first sight, further analysis revealed that this wide variety of supersecondary structure motifs can be clustered to certain patterns [[#References|[5]]].<br />
<br />
==Supersecondary structure==<br />
<br />
When the properties of supersecondary structures were first described, only very few patterns were identified, mainly due to the lack of highly resolved protein structures. At that time, the structures were mainly classified by the Ramachandran plot regions where the amino acids could be found [[#References|[6]]]. With growing amount of known crystal structures, the analysis of supersecondary structure improved and lead to databases with about 150 000 classified loop structures and elaborate clustering [[#References|[7]]]. Nowadays supersecondary structures are defined as the structures built when two secondary structure elements are combined by a small peptide that is not clustered into one of the secondary structures. These loop peptides range from 1 to 9 amino acids.<br />
Our aim was to build reliable stable linkers out of alpha helices connected by supersecondary structure motifs that produce certain angles. To achieve that, we searched for the most reliable alpha helix patterns that would form rigid rods and angle patterns covering the whole range of angles from 0 to 180 degrees.<br />
<br />
=linker building block design=<br />
<br />
===Helix patterns===<br />
Various different patterns have been used to build helical linkers to connect protein ends [[References | [8]]]. Moreover, in known protein structures, linkers between subdomains can be identified and their properties have already been analyzed [[#References | [9]]]. Two main criteria were used to build the alpha helix patterns: they should robustly for alpha helices, and they should be soluble in aqueous solution. Therefore, we could not just use linkers built of Alanine. So we decided to add some charged aminoacids to the pattern, and to position them physically close to each other so that they could stabilize themselves by Coulomb interaction. These amino acids needed to be separated by 3 amino acids as a helical turn takes about 3.6 aminoacids. The pattern we chose as most suitable for our purpose was also described to be one of the most stable [[#References | [3]]].<br />
8 alpha helix building blocks were eventually chosen: AEAAAK, AEAAAKA, AEAAAKAA, AEAAAKEAAAK, AEAAAKEAAAKA, AEAAAKEAAAKEAAAKA, AEAAAKEAAAKEAAAKEAAAKA, AEAAAKEAAAKEAAAKEAAAKEAAAKA, with a respective estimated length of 9, 10.5, 12, 16.5, 18, 25.5 and 33 &Aring;.<br />
<br />
===Angle patterns===<br />
<br />
The angle patterns for our model were obtained from the ArchDB database [[#References | [5]]], which classifies loops from known proteins structures. About 17 000 non-homologous proteins from PDB database were analyzed and from them, 150 000 loop structures, i.e. regions connecting two secondary structure elements, were identified. The classification took into account not only the length of the loop, its conformation, meaning φ and ψ backbone dihedral angles of the residues in the loop, but also the distance between the attachments of the loop to the surrounding secondary structures. Furthermore the secondary structures surrounding the loop and the geometry defined by the super-secondary structure motifs can be found in the database.<br />
<br />
To extract from ArchDB the relevant supersecondary structure motifs for our linker design, the complete database was downloaded and helix-loop-helix motifs were extracted using a [https://github.com/igemsoftware/Heidelberg_2014 self-written script] in Python programming language. From them we only took into account loops composed of 1 and 2 amino acids, because the longer the loops, the less frequent and therefore the less reliable they are, and the further the ends are from each other.<br />
The interesting information for us was (1) the angle produced between the vectors defining the surrounding alpha helices, (2) the distance between the ends of the loop, and (3) the type of amino acids surrounding the loop. Furthermore we analyzed the statistical significance of the conformation. For each amino acid combination in the loop region, the angle distribution between the embracing alpha helices, the loop length distribution and a 2d heatmap of the surrounding amino acids were automatically plotted (Fig. 1-4).<br />
<br />
These distributions were then visually analyzed to identify loops of interest for linker design. We focused on loops that showed a narrow angle distribution and that appeared frequently in the database and we extended the distribution analysis to the surrounding amino acids. For example, when we have identified that T in a turn produces an interesting distribution (Fig. 1), we would elongate it by K in front and elongate it by A in the end and see how the distribution behaves. By restraining the possibilites, the occurrences go down tremendously, but the properties become more interesting.<br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of__T_.png|<br />
descr= Figure 1) A loop composed of the amino acid T produces an interesting well-defined angle distribution (left panel). The middle panel represents the distribution of lengths found with this loop. The right panel represent the distribution of amino acids found before and after the amino acid T in all the corresponding loops. One can see that the amino acid K is the most frequent before and A the most frequent after.}} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of_K_T_.png|<br />
descr= Figure 2) These distributions are subsets of the ones of Figure 1, with only the loops composed of the amino acid T preceded by the amino acid K. This constrain considerably narrows down the angle and length distributions.}} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of__T_A.png|<br />
descr= Figure 3) Same procedure as in Figure 2. with the loops composed of the amino acid T followed by the amino acid A. This also narrows down the angle and length distributions compared to Figure 1., although not as much as in Figure 2.}} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of_K_T_A.png|<br />
descr= Figure 4) The distributions correspond to loops composed of the amino acids KTA and therefore to subsets of all the preceeding ones. One can notice that the angle distribution becomes very narrow, but also that the frequency is reduced compared to the preceeding ones.}}<br />
<br />
This step had two main goals: narrowing down the angle distribution and finding loops with no preferences for the amino acid surrounding them. This last point was important for the modularity of our approach: the angle blocks should not be affected by the surrounding alpha helices. Using this approach, 10 different angle motifs (table 1) could be identified producing different angles. Importantly, all these motifs were chosen so that the length of the two segments starting from the turning point was the same for all of them (put figure turning point). It should also be noticed that 3 of the patterns, KTA, LVA and AAIAP, produce the same angles. There were nevertheless all kept for the ''in silico'' refinement described below. But only one was used for the [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software].<br />
<br />
<br />
{| class="table table-hover" style="text-align: center;"<br />
|+'''table 1''': The span of parameters.<br />
!colspan="10"|Angle Patterns<br />
|-<br />
|Pattern <br />
| NVL <br />
| KTA <br />
| LVA <br />
| AAIAP <br />
| AADGTL <br />
| VNLTA<br />
| AAAHPEA<br />
| ASLPAA <br />
| ATGDLA<br />
|-<br />
|Mean <br />
| 29.7 <br />
| 38.7 <br />
| 35 <br />
| 36.5 <br />
| 60 <br />
| 74.5 <br />
| 117 <br />
| 140 <br />
| 160 <br />
|-<br />
| Variation <br />
| 8.5 <br />
| 30 <br />
| 29 <br />
| 27 <br />
| 12. <br />
| 27. <br />
| 12. <br />
| 15 . <br />
| 5. <br />
|}<br />
<br />
===Sequences to connect the alpha helix to the protein extremity===<br />
<br />
Helical patterns often affect the folding of the attached sequences. To prevent them from affected the structure of the protein of interest, we analyzed the effects of various aminoacids in silico using online tools like [http://bioserv.rpbs.univ-paris-diderot.fr/services/PEP-FOLD/ pep-fold] on helix formation. We identified glycines and prolines as reliable amino acids to interrupt helix formation. We then decided to use glycine pairs to connect the protein of interest to the linker, because they give more flexibility to the initial orientation of the initial helix.<br />
This only concerns the protein end that was connected to the linker through the coding sequence. The other end is ligated through exteins or sortase scar, both treated as unstructured flexible regions.<br />
<br />
===''Conclusion''===<br />
<br />
The last three parts show how we could design alpha helix and angle pattern blocks and connect them to each other and to the protein of interest. They provide the material that allowed us to transform a linker defined as a geometrical path into a real amino acid sequence in our [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software software].<br />
<br />
Thanks to this we could design linkers to circularize [https://2014.igem.org/Team:Heidelberg/Project/PCR_2.0 DNMT1] and [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening lysozyme]. Additionally it was shown for lysozyme, that a customly tailored linker enhances heat-stability compared to a badly designed linker and even to the flexible linker.<br />
<br />
=''In silico'' refinement=<br />
<br />
As some of the interesting patterns could not be found often enough to be statistically significant, we decided to make a further refinement ''in silico'' by modeling the structure of proteins with circularizing linkers. To perform this for realistic situations, we selected, from the RCSB database, structures of non-homologous target proteins with extremities that are separated enough to require a linker for circularization. First, the [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software] generates possible fitting linkers for various proteins. From these possible linkers, the 100 shortest were taken. Among the 3 possible angle patterns that generate the same angle, the software provides only one. But the linker refinement developed here used the three of them for comparison. We assume that the linkers connect the ends of the protein without setting tension on the protein, so that the protein can fold in its natural way.<br />
<br />
After this the circularized proteins with the specific linkers are modelled using a software called Modeller [[#References|[10]]]. This software is widely used for comparative structure prediction. It is well established in the scientific community and should be most suitable for prediction of loop regions attached to existing structures [[#References|[11]]]. It is freely available for academical usage from the [http://salilab.org/modeller/ salilab] webpages. The program is able to predict the 3d structure of a given sequence based on an alignment with a given structure. In our script Modeller needed to be provided with a sequence with the linker attached and with the PDB file of the protein of interest. This latter file is the only structural information used here. Modeller is recommended when at least 30% sequence similarity exists between the provided structure and the one that the user wish to model. Here, only the linker is different, so this similarity if around 90% in our case. At first Modeller makes an alignment between the provided structure and the sequence of our linked protein, identifying the regions that cannot be found in the structure. Based on that, Modeller generates 4 initial models. One of the strengths of the software is its capacity to further refine only certain parts of the protein. Thus we let Modeller refine the loops, which are defined as any part of the protein that could not be found in the structure file. The ''ab initio'' modelling of our linkers is made by minimizing energy functions with different methods like conjugate gradients and molecular dynamics. Eventually, each modeled structure is provided with energy values, thanks to which different models of the same structure can be compared. From Modeller we received about 8 different models and choose the one with the best energy scores to further proceed. For these refinement steps, one can choose different levels of optimization. We always decided for accuracy instead of velocity of the program. The result of a prediction for lysozyme of bacteriophage lambda can be seen in figure 5.<br />
<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
caption=Figure 5) Circular lysozyme|<br />
descr=The structure of circular lysozyme (yellow) was predicted and aligned to the linear structure (purple). The linker calculated by the software was marked green. |<br />
file=circ_lam_lys_nils.png}}<br />
<br />
<br />
Modeller was run by distributing calculation via the [https://2014.igem.org/Team:Heidelberg/Software/igemathome iGEM@home system] system. The modelling of one linker took about 10 hours of calculation time on average, a value that is highly depending on the size of the protein. Then the best model is evaluated by another self-written program to analyze the behaviour of the linker patterns in their natural surroundings.<br />
Finally, all the models for the different structures and the different linkers are analyzed for their properties like the length of the helical patterns, the shape of the attachment structures of the linker and the angles produced by the angle patterns [[ Figure helix_winkel_messung.png]]. First the modeled structure and the natural structure are fitted, to see how big the differences between those are. If the protein has been disturbed too much, the model is discarded. We control that the alpha helix patterns really generate the expected pattern by measuring the distance between the first and the last amino acids of each of them and checking if it is compatible with an alpha helix. These first and last amino acids of each rods define a vector that is then used to calculate the angles between consecutive helical patterns. The frequency of all the determined lengths and angles is then further analyzed, using a similar strategy to the ArchDB analysis presented above.<br />
<br />
The whole process for the verification of the different linker patterns was set up on the [https://2014.igem.org/Team:Heidelberg/Software/igemathome distributed computing system]. Due to lack of time, only few results could be analyzed, resulting in distributions for the different helices, see for example figures 5 and 6.<br />
This lead to a refinement of the length of the 8 different alpha helix blocks presented above to 8.7, 10, 10.8, 15.6, 16.8, 24.8, 32.3 &Aring;.<br />
<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
descr=|<br />
caption=Fig. 7. Length distribution of AEAAAKEAAAK|<br />
file=plot_of_AEAAAKEAAAK.png}}<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
descr=|<br />
caption=Fig. 6. Length distribution of AEAAAKA |<br />
file=plot_of_AEAAAKA.png}}<br />
<br />
=Conclusion=<br />
The patterns that we identified by analyzing structure databases provide an easy and fast tool to build customly shaped peptides. The main achievement is the identification of the angle patterns. These are designed as building blocks for enhanced applicability. The shapes were identified from a database of non-homologous proteins and the patterns were refined until the distribution of surrounding amino acids looked randomly distributed. Thus we can exclude that the angle distributions we observed is not due to the surrounding sequences, but to the identified patterns. We have not observed any evidence for certain helix patterns being preferred in the database.<br />
From figures 6 and 7 we have learned that the lengths we have assumed for the helices needed to be refined. For example we had assumed the AEAAAKA motif to span a distance of 10.5 &Aring; but have observed it to be only 10 &Aring; long. Our [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software] was accordingly corrected.<br />
<br />
=References=<br />
<br />
[1] Vieille, C. & Zeikus, G.J. Hyperthermophilic enzymes: sources, uses, and molecular mechanisms for thermostability. Microbiology and molecular biology reviews : MMBR 65, 1-43 (2001).<br />
<br />
[2] Yu, Y. & Lutz, S. Circular permutation: a different way to engineer enzyme structure and function. Trends Biotechnol. 29, 18-25 (2011).<br />
<br />
[3] Arai, R., Ueda, H., Kitayama, a, Kamiya, N. & Nagamune, T. Design of the linkers which effectively separate domains of a bifunctional fusion protein. Protein Eng. 14, 529-532 (2001).<br />
<br />
[4] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).<br />
<br />
[5] Efimov, a V. Standard structures in proteins. Prog. Biophys. Mol. Biol. 60, 201-2–39 (1993).<br />
<br />
[6] Donate, L. E., Rufino, S. D., Canard, L. H. & Blundell, T. L. Conformational analysis and clustering of short and medium size loops connecting regular secondary structures: a database for modeling and prediction. Protein Sci. 5, 2600-26–16 (1996).<br />
<br />
[7] Bonet, J. et al. ArchDB 2014: structural classification of loops in proteins. Nucleic Acids Res. 42, D315-D31–9 (2014).<br />
<br />
[8] George, R. a & Heringa, J. An analysis of protein domain linkers: their classification and role in protein folding. Protein Eng. 15, 871-879 (2002).<br />
<br />
[9] Chen, X., Zaro, J. L. & Shen, W.-C. Fusion protein linkers: property, design and functionality. Adv. Drug Deliv. Rev. 65, 1357-1369 (2013).<br />
<br />
[10] Fiser, a et al. Modeling of loops in protein structures. Protein science : a publication of the Protein Society 9, 1753-73 (2000).<br />
<br />
[11] Fiser, A. & Sali, A. ModLoop: Automated modeling of loops in protein structures. Bioinformatics 19, 2500-2501 (2003).</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/pages/Linker_ModelingTeam:Heidelberg/pages/Linker Modeling2014-10-17T20:59:10Z<p>Igemnils: /* Angle patterns */</p>
<hr />
<div>=Background=<br />
Primary, secondary, tertiary and quaternary structures are the main levels of protein structure characterization. Primary structure designates the amino acid sequence, while the secondary structure describes the arrangement of consecutive amino acids through their two dihedral angles $\phi$ and $\psi$. The Ramachandran plot, which represents the amino acid position in the space of those two angles, shows two particular arrangement commonly found in proteins: alpha helices and beta sheets. The next level of protein organization is the tertiary structure, which describes how the protein is organized in the three spatial dimensions, whereas the quaternary structure describes how different subunits of proteins cluster.<br />
Finally, closely related to these standard structures, the supersecondary structure describes how secondary structure elements are connected to each other. While these connections look undefined at first sight, further analysis revealed that this wide variety of supersecondary structure motifs can be clustered to certain patterns [[#References|[5]]].<br />
<br />
==Supersecondary structure==<br />
<br />
When the properties of supersecondary structures were first described, only very few patterns were identified, mainly due to the lack of highly resolved protein structures. At that time, the structures were mainly classified by the Ramachandran plot regions where the amino acids could be found [[#References|[6]]]. With growing amount of known crystal structures, the analysis of supersecondary structure improved and lead to databases with about 150 000 classified loop structures and elaborate clustering [[#References|[7]]]. Nowadays supersecondary structures are defined as the structures built when two secondary structure elements are combined by a small peptide that is not clustered into one of the secondary structures. These loop peptides range from 1 to 9 amino acids.<br />
Our aim was to build reliable stable linkers out of alpha helices connected by supersecondary structure motifs that produce certain angles. To achieve that, we searched for the most reliable alpha helix patterns that would form rigid rods and angle patterns covering the whole range of angles from 0 to 180 degrees.<br />
<br />
=linker building block design=<br />
<br />
===Helix patterns===<br />
Various different patterns have been used to build helical linkers to connect protein ends [[References | [8]]]. Moreover, in known protein structures, linkers between subdomains can be identified and their properties have already been analyzed [[#References | [9]]]. Two main criteria were used to build the alpha helix patterns: they should robustly for alpha helices, and they should be soluble in aqueous solution. Therefore, we could not just use linkers built of Alanine. So we decided to add some charged aminoacids to the pattern, and to position them physically close to each other so that they could stabilize themselves by Coulomb interaction. These amino acids needed to be separated by 3 amino acids as a helical turn takes about 3.6 aminoacids. The pattern we chose as most suitable for our purpose was also described to be one of the most stable [[#References | [3]]].<br />
8 alpha helix building blocks were eventually chosen: AEAAAK, AEAAAKA, AEAAAKAA, AEAAAKEAAAK, AEAAAKEAAAKA, AEAAAKEAAAKEAAAKA, AEAAAKEAAAKEAAAKEAAAKA, AEAAAKEAAAKEAAAKEAAAKEAAAKA, with a respective estimated length of 9, 10.5, 12, 16.5, 18, 25.5 and 33 &Aring;.<br />
<br />
===Angle patterns===<br />
<br />
The angle patterns for our model were obtained from the ArchDB database [[#References | [5]]], which classifies loops from known proteins structures. About 17 000 non-homologous proteins from PDB database were analyzed and from them, 150 000 loop structures, i.e. regions connecting two secondary structure elements, were identified. The classification took into account not only the length of the loop, its conformation, meaning φ and ψ backbone dihedral angles of the residues in the loop, but also the distance between the attachments of the loop to the surrounding secondary structures. Furthermore the secondary structures surrounding the loop and the geometry defined by the super-secondary structure motifs can be found in the database.<br />
<br />
To extract from ArchDB the relevant supersecondary structure motifs for our linker design, the complete database was downloaded and helix-loop-helix motifs were extracted using a [https://github.com/igemsoftware/Heidelberg_2014 self-written script] in Python programming language. From them we only took into account loops composed of 1 and 2 amino acids, because the longer the loops, the less frequent and therefore the less reliable they are, and the further the ends are from each other.<br />
The interesting information for us was (1) the angle produced between the vectors defining the surrounding alpha helices, (2) the distance between the ends of the loop, and (3) the type of amino acids surrounding the loop. Furthermore we analyzed the statistical significance of the conformation. For each amino acid combination in the loop region, the angle distribution between the embracing alpha helices, the loop length distribution and a 2d heatmap of the surrounding amino acids were automatically plotted (Fig. 1-4).<br />
<br />
These distributions were then visually analyzed to identify loops of interest for linker design. We focused on loops that showed a narrow angle distribution and that appeared frequently in the database and we extended the distribution analysis to the surrounding amino acids. For example, when we have identified that T in a turn produces an interesting distribution (Fig. 1), we would elongate it by K in front and elongate it by A in the end and see how the distribution behaves. By restraining the possibilites, the occurrences go down tremendously, but the properties become more interesting.<br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of__T_.png|<br />
descr= Figure 1) A loop composed of the amino acid T produces an interesting well-defined angle distribution (left panel). The middle panel represents the distribution of lengths found with this loop. The right panel represent the distribution of amino acids found before and after the amino acid T in all the corresponding loops. One can see that the amino acid K is the most frequent before.}} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of_K_T_.png|<br />
descr= Figure 2) These distributions are subsets of the ones of Figure 1, with only the loops composed of the amino acid T preceded by the amino acid K. This constrain considerably narrows down the angle and length distributions.}} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of__T_A.png|<br />
descr= Figure 3) Same procedure as in Figure 2. with the loops composed of the amino acid T followed by the amino acid A. This also narrows down the angle and length distributions compared to Figure 1., although not as much as in Figure 2.}} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of_K_T_A.png|<br />
descr= Figure 4) The distributions correspond to loops composed of the amino acids KTA and therefore to subsets of all the preceeding ones. One can notice that the angle distribution becomes very narrow, but also that the frequency is reduced compared to the preceeding ones.}}<br />
<br />
This step had two main goals: narrowing down the angle distribution and finding loops with no preferences for the amino acid surrounding them. This last point was important for the modularity of our approach: the angle blocks should not be affected by the surrounding alpha helices. Using this approach, 10 different angle motifs (table 1) could be identified producing different angles. Importantly, all these motifs were chosen so that the length of the two segments starting from the turning point was the same for all of them (put figure turning point). It should also be noticed that 3 of the patterns, KTA, LVA and AAIAP, produce the same angles. There were nevertheless all kept for the ''in silico'' refinement described below. But only one was used for the [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software].<br />
<br />
<br />
{| class="table table-hover" style="text-align: center;"<br />
|+'''table 1''': The span of parameters.<br />
!colspan="10"|Angle Patterns<br />
|-<br />
|Pattern <br />
| NVL <br />
| KTA <br />
| LVA <br />
| AAIAP <br />
| AADGTL <br />
| VNLTA<br />
| AAAHPEA<br />
| ASLPAA <br />
| ATGDLA<br />
|-<br />
|Mean <br />
| 29.7 <br />
| 38.7 <br />
| 35 <br />
| 36.5 <br />
| 60 <br />
| 74.5 <br />
| 117 <br />
| 140 <br />
| 160 <br />
|-<br />
| Variation <br />
| 8.5 <br />
| 30 <br />
| 29 <br />
| 27 <br />
| 12. <br />
| 27. <br />
| 12. <br />
| 15 . <br />
| 5. <br />
|}<br />
<br />
===Sequences to connect the alpha helix to the protein extremity===<br />
<br />
Helical patterns often affect the folding of the attached sequences. To prevent them from affected the structure of the protein of interest, we analyzed the effects of various aminoacids in silico using online tools like [http://bioserv.rpbs.univ-paris-diderot.fr/services/PEP-FOLD/ pep-fold] on helix formation. We identified glycines and prolines as reliable amino acids to interrupt helix formation. We then decided to use glycine pairs to connect the protein of interest to the linker, because they give more flexibility to the initial orientation of the initial helix.<br />
This only concerns the protein end that was connected to the linker through the coding sequence. The other end is ligated through exteins or sortase scar, both treated as unstructured flexible regions.<br />
<br />
===''Conclusion''===<br />
<br />
The last three parts show how we could design alpha helix and angle pattern blocks and connect them to each other and to the protein of interest. They provide the material that allowed us to transform a linker defined as a geometrical path into a real amino acid sequence in our [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software software].<br />
<br />
Thanks to this we could design linkers to circularize [https://2014.igem.org/Team:Heidelberg/Project/PCR_2.0 DNMT1] and [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening lysozyme]. Additionally it was shown for lysozyme, that a customly tailored linker enhances heat-stability compared to a badly designed linker and even to the flexible linker.<br />
<br />
=''In silico'' refinement=<br />
<br />
As some of the interesting patterns could not be found often enough to be statistically significant, we decided to make a further refinement ''in silico'' by modeling the structure of proteins with circularizing linkers. To perform this for realistic situations, we selected, from the RCSB database, structures of non-homologous target proteins with extremities that are separated enough to require a linker for circularization. First, the [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software] generates possible fitting linkers for various proteins. From these possible linkers, the 100 shortest were taken. Among the 3 possible angle patterns that generate the same angle, the software provides only one. But the linker refinement developed here used the three of them for comparison. We assume that the linkers connect the ends of the protein without setting tension on the protein, so that the protein can fold in its natural way.<br />
<br />
After this the circularized proteins with the specific linkers are modelled using a software called Modeller [[#References|[10]]]. This software is widely used for comparative structure prediction. It is well established in the scientific community and should be most suitable for prediction of loop regions attached to existing structures [[#References|[11]]]. It is freely available for academical usage from the [http://salilab.org/modeller/ salilab] webpages. The program is able to predict the 3d structure of a given sequence based on an alignment with a given structure. In our script Modeller needed to be provided with a sequence with the linker attached and with the PDB file of the protein of interest. This latter file is the only structural information used here. Modeller is recommended when at least 30% sequence similarity exists between the provided structure and the one that the user wish to model. Here, only the linker is different, so this similarity if around 90% in our case. At first Modeller makes an alignment between the provided structure and the sequence of our linked protein, identifying the regions that cannot be found in the structure. Based on that, Modeller generates 4 initial models. One of the strengths of the software is its capacity to further refine only certain parts of the protein. Thus we let Modeller refine the loops, which are defined as any part of the protein that could not be found in the structure file. The ''ab initio'' modelling of our linkers is made by minimizing energy functions with different methods like conjugate gradients and molecular dynamics. Eventually, each modeled structure is provided with energy values, thanks to which different models of the same structure can be compared. From Modeller we received about 8 different models and choose the one with the best energy scores to further proceed. For these refinement steps, one can choose different levels of optimization. We always decided for accuracy instead of velocity of the program. The result of a prediction for lysozyme of bacteriophage lambda can be seen in figure 5.<br />
<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
caption=Figure 5) Circular lysozyme|<br />
descr=The structure of circular lysozyme (yellow) was predicted and aligned to the linear structure (purple). The linker calculated by the software was marked green. |<br />
file=circ_lam_lys_nils.png}}<br />
<br />
<br />
Modeller was run by distributing calculation via the [https://2014.igem.org/Team:Heidelberg/Software/igemathome iGEM@home system] system. The modelling of one linker took about 10 hours of calculation time on average, a value that is highly depending on the size of the protein. Then the best model is evaluated by another self-written program to analyze the behaviour of the linker patterns in their natural surroundings.<br />
Finally, all the models for the different structures and the different linkers are analyzed for their properties like the length of the helical patterns, the shape of the attachment structures of the linker and the angles produced by the angle patterns [[ Figure helix_winkel_messung.png]]. First the modeled structure and the natural structure are fitted, to see how big the differences between those are. If the protein has been disturbed too much, the model is discarded. We control that the alpha helix patterns really generate the expected pattern by measuring the distance between the first and the last amino acids of each of them and checking if it is compatible with an alpha helix. These first and last amino acids of each rods define a vector that is then used to calculate the angles between consecutive helical patterns. The frequency of all the determined lengths and angles is then further analyzed, using a similar strategy to the ArchDB analysis presented above.<br />
<br />
The whole process for the verification of the different linker patterns was set up on the [https://2014.igem.org/Team:Heidelberg/Software/igemathome distributed computing system]. Due to lack of time, only few results could be analyzed, resulting in distributions for the different helices, see for example figures 5 and 6.<br />
This lead to a refinement of the length of the 8 different alpha helix blocks presented above to 8.7, 10, 10.8, 15.6, 16.8, 24.8, 32.3 &Aring;.<br />
<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
descr=|<br />
caption=Fig. 7. Length distribution of AEAAAKEAAAK|<br />
file=plot_of_AEAAAKEAAAK.png}}<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
descr=|<br />
caption=Fig. 6. Length distribution of AEAAAKA |<br />
file=plot_of_AEAAAKA.png}}<br />
<br />
=Conclusion=<br />
The patterns that we identified by analyzing structure databases provide an easy and fast tool to build customly shaped peptides. The main achievement is the identification of the angle patterns. These are designed as building blocks for enhanced applicability. The shapes were identified from a database of non-homologous proteins and the patterns were refined until the distribution of surrounding amino acids looked randomly distributed. Thus we can exclude that the angle distributions we observed is not due to the surrounding sequences, but to the identified patterns. We have not observed any evidence for certain helix patterns being preferred in the database.<br />
From figures 6 and 7 we have learned that the lengths we have assumed for the helices needed to be refined. For example we had assumed the AEAAAKA motif to span a distance of 10.5 &Aring; but have observed it to be only 10 &Aring; long. Our [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software] was accordingly corrected.<br />
<br />
=References=<br />
<br />
[1] Vieille, C. & Zeikus, G.J. Hyperthermophilic enzymes: sources, uses, and molecular mechanisms for thermostability. Microbiology and molecular biology reviews : MMBR 65, 1-43 (2001).<br />
<br />
[2] Yu, Y. & Lutz, S. Circular permutation: a different way to engineer enzyme structure and function. Trends Biotechnol. 29, 18-25 (2011).<br />
<br />
[3] Arai, R., Ueda, H., Kitayama, a, Kamiya, N. & Nagamune, T. Design of the linkers which effectively separate domains of a bifunctional fusion protein. Protein Eng. 14, 529-532 (2001).<br />
<br />
[4] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).<br />
<br />
[5] Efimov, a V. Standard structures in proteins. Prog. Biophys. Mol. Biol. 60, 201-2–39 (1993).<br />
<br />
[6] Donate, L. E., Rufino, S. D., Canard, L. H. & Blundell, T. L. Conformational analysis and clustering of short and medium size loops connecting regular secondary structures: a database for modeling and prediction. Protein Sci. 5, 2600-26–16 (1996).<br />
<br />
[7] Bonet, J. et al. ArchDB 2014: structural classification of loops in proteins. Nucleic Acids Res. 42, D315-D31–9 (2014).<br />
<br />
[8] George, R. a & Heringa, J. An analysis of protein domain linkers: their classification and role in protein folding. Protein Eng. 15, 871-879 (2002).<br />
<br />
[9] Chen, X., Zaro, J. L. & Shen, W.-C. Fusion protein linkers: property, design and functionality. Adv. Drug Deliv. Rev. 65, 1357-1369 (2013).<br />
<br />
[10] Fiser, a et al. Modeling of loops in protein structures. Protein science : a publication of the Protein Society 9, 1753-73 (2000).<br />
<br />
[11] Fiser, A. & Sali, A. ModLoop: Automated modeling of loops in protein structures. Bioinformatics 19, 2500-2501 (2003).</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/pages/Linker_ModelingTeam:Heidelberg/pages/Linker Modeling2014-10-17T20:42:50Z<p>Igemnils: /* Angle patterns */</p>
<hr />
<div>=Background=<br />
Primary, secondary, tertiary and quaternary structures are the main levels of protein structure characterization. Primary structure designates the amino acid sequence, while the secondary structure describes the arrangement of consecutive amino acids through their two dihedral angles $\phi$ and $\psi$. The Ramachandran plot, which represents the amino acid position in the space of those two angles, shows two particular arrangement commonly found in proteins: alpha helices and beta sheets. The next level of protein organization is the tertiary structure, which describes how the protein is organized in the three spatial dimensions, whereas the quaternary structure describes how different subunits of proteins cluster.<br />
Finally, closely related to these standard structures, the supersecondary structure describes how secondary structure elements are connected to each other. While these connections look undefined at first sight, further analysis revealed that this wide variety of supersecondary structure motifs can be clustered to certain patterns [[#References|[5]]].<br />
<br />
==Supersecondary structure==<br />
<br />
When the properties of supersecondary structures were first described, only very few patterns were identified, mainly due to the lack of highly resolved protein structures. At that time, the structures were mainly classified by the Ramachandran plot regions where the amino acids could be found [[#References|[6]]]. With growing amount of known crystal structures, the analysis of supersecondary structure improved and lead to databases with about 150 000 classified loop structures and elaborate clustering [[#References|[7]]]. Nowadays supersecondary structures are defined as the structures built when two secondary structure elements are combined by a small peptide that is not clustered into one of the secondary structures. These loop peptides range from 1 to 9 amino acids.<br />
Our aim was to build reliable stable linkers out of alpha helices connected by supersecondary structure motifs that produce certain angles. To achieve that, we searched for the most reliable alpha helix patterns that would form rigid rods and angle patterns covering the whole range of angles from 0 to 180 degrees.<br />
<br />
=linker building block design=<br />
<br />
===Helix patterns===<br />
Various different patterns have been used to build helical linkers to connect protein ends [[References | [8]]]. Moreover, in known protein structures, linkers between subdomains can be identified and their properties have already been analyzed [[#References | [9]]]. Two main criteria were used to build the alpha helix patterns: they should robustly for alpha helices, and they should be soluble in aqueous solution. Therefore, we could not just use linkers built of Alanine. So we decided to add some charged aminoacids to the pattern, and to position them physically close to each other so that they could stabilize themselves by Coulomb interaction. These amino acids needed to be separated by 3 amino acids as a helical turn takes about 3.6 aminoacids. The pattern we chose as most suitable for our purpose was also described to be one of the most stable [[#References | [3]]].<br />
8 alpha helix building blocks were eventually chosen: AEAAAK, AEAAAKA, AEAAAKAA, AEAAAKEAAAK, AEAAAKEAAAKA, AEAAAKEAAAKEAAAKA, AEAAAKEAAAKEAAAKEAAAKA, AEAAAKEAAAKEAAAKEAAAKEAAAKA, with a respective estimated length of 9, 10.5, 12, 16.5, 18, 25.5 and 33 &Aring;.<br />
<br />
===Angle patterns===<br />
<br />
The angle patterns for our model were obtained from the ArchDB database [[#References | [5]]], which classifies loops from known proteins structures. About 17 000 non-homologous proteins from PDB database were analyzed and from them, 150 000 loop structures, i.e. regions connecting two secondary structure elements, were identified. The classification took into account not only the length of the loop, its conformation, meaning φ and ψ backbone dihedral angles of the residues in the loop, but also the distance between the attachments of the loop to the surrounding secondary structures. Furthermore the secondary structures surrounding the loop and the geometry defined by the super-secondary structure motifs can be found in the database.<br />
<br />
To extract from ArchDB the relevant supersecondary structure motifs for our linker design, the complete database was downloaded and helix-loop-helix motifs were extracted using a [https://github.com/igemsoftware/Heidelberg_2014 self-written script] in Python programming language. From them we only took into account loops composed of 1 and 2 amino acids, because the longer the loops, the less frequent and therefore the less reliable they are, and the further the ends are from each other.<br />
The interesting information for us was (1) the angle produced between the vectors defining the surrounding alpha helices, (2) the distance between the ends of the loop, and (3) the type of amino acids surrounding the loop. Furthermore we analyzed the statistical significance of the conformation. For each amino acid combination in the loop region, the angle distribution between the embracing alpha helices, the loop length distribution and a 2d heatmap of the surrounding amino acids were automatically plotted (Fig. 1-4).<br />
<br />
These distributions were then visually analyzed to identify loops of interest for linker design. We focused on loops that showed a narrow angle distribution and that appeared frequently in the database and we extended the distribution analysis to the surrounding amino acids. For example, when we have identified that T in a turn produces an interesting distribution (Fig. 1), we would elongate it by K in front and elongate it by A in the end and see how the distribution behaves. By restraining the possibilites, the occurrences go down tremendously, but the properties become more interesting.<br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of__T_.png|<br />
descr= Fig. 1. A loop composed of the amino acid T produces an interesting well-defined angle distribution (left panel). The middle panel represents the distribution of lengths found with this loop. The right panel represent the distribution of amino acids found before and after the amino acid T in all the corresponding loops. One can see that the amino acid K is the most frequent before.}} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of_K_T_.png|<br />
descr= The distribution is narrowing a lot, K seems to be good before the turn }} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of__T_A.png|<br />
descr= Also with A after the loop the distribution stays well defined}} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of_K_T_A.png|<br />
descr= The whole sequence still produces a nicely shaped distribution, but the amount of hits is much less than before}}<br />
<br />
This step had two main goals: narrowing down the angle distribution and finding loops with no preferences for the amino acid surrounding them. This last point was important for the modularity of our approach: the angle blocks should not be affected by the surrounding alpha helices. Using this approach, 10 different angle motifs (table 1) could be identified producing different angles. Importantly, all these motifs were chosen so that the length of the two segments starting from the turning point was the same for all of them (put figure turning point). It should also be noticed that 3 of the patterns, KTA, LVA and AAIAP, produce the same angles. There were nevertheless all kept for the ''in silico'' refinement described below. But only one was used for the [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software].<br />
<br />
<br />
{| class="table table-hover" style="text-align: center;"<br />
|+'''table 1''': The span of parameters.<br />
!colspan="10"|Angle Patterns<br />
|-<br />
|Pattern <br />
| NVL <br />
| KTA <br />
| LVA <br />
| AAIAP <br />
| AADGTL <br />
| VNLTA<br />
| AAAHPEA<br />
| ASLPAA <br />
| ATGDLA<br />
|-<br />
|Mean <br />
| 29.7 <br />
| 38.7 <br />
| 35 <br />
| 36.5 <br />
| 60 <br />
| 74.5 <br />
| 117 <br />
| 140 <br />
| 160 <br />
|-<br />
| Variation <br />
| 8.5 <br />
| 30 <br />
| 29 <br />
| 27 <br />
| 12. <br />
| 27. <br />
| 12. <br />
| 15 . <br />
| 5. <br />
|}<br />
<br />
===Sequences to connect the alpha helix to the protein extremity===<br />
<br />
Helical patterns often affect the folding of the attached sequences. To prevent them from affected the structure of the protein of interest, we analyzed the effects of various aminoacids in silico using online tools like [http://bioserv.rpbs.univ-paris-diderot.fr/services/PEP-FOLD/ pep-fold] on helix formation. We identified glycines and prolines as reliable amino acids to interrupt helix formation. We then decided to use glycine pairs to connect the protein of interest to the linker, because they give more flexibility to the initial orientation of the initial helix.<br />
This only concerns the protein end that was connected to the linker through the coding sequence. The other end is ligated through exteins or sortase scar, both treated as unstructured flexible regions.<br />
<br />
===''Conclusion''===<br />
<br />
The last three parts show how we could design alpha helix and angle pattern blocks and connect them to each other and to the protein of interest. They provide the material that allowed us to transform a linker defined as a geometrical path into a real amino acid sequence in our [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software software].<br />
<br />
Thanks to this we could design linkers to circularize [https://2014.igem.org/Team:Heidelberg/Project/PCR_2.0 DNMT1] and [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening lysozyme]. Additionally it was shown for lysozyme, that a customly tailored linker enhances heat-stability compared to a badly designed linker and even to the flexible linker.<br />
<br />
=''In silico'' refinement=<br />
<br />
As some of the interesting patterns could not be found often enough to be statistically significant, we decided to make a further refinement ''in silico'' by modeling the structure of proteins with circularizing linkers. To perform this for realistic situations, we selected, from the RCSB database, structures of non-homologous target proteins with extremities that are separated enough to require a linker for circularization. First, the [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software] generates possible fitting linkers for various proteins. From these possible linkers, the 100 shortest were taken. Among the 3 possible angle patterns that generate the same angle, the software provides only one. But the linker refinement developed here used the three of them for comparison. We assume that the linkers connect the ends of the protein without setting tension on the protein, so that the protein can fold in its natural way.<br />
<br />
After this the circularized proteins with the specific linkers are modelled using a software called Modeller [[#References|[10]]]. This software is widely used for comparative structure prediction. It is well established in the scientific community and should be most suitable for prediction of loop regions attached to existing structures [[#References|[11]]]. It is freely available for academical usage from the [http://salilab.org/modeller/ salilab] webpages. The program is able to predict the 3d structure of a given sequence based on an alignment with a given structure. In our script Modeller needed to be provided with a sequence with the linker attached and with the PDB file of the protein of interest. This latter file is the only structural information used here. Modeller is recommended when at least 30% sequence similarity exists between the provided structure and the one that the user wish to model. Here, only the linker is different, so this similarity if around 90% in our case. At first Modeller makes an alignment between the provided structure and the sequence of our linked protein, identifying the regions that cannot be found in the structure. Based on that, Modeller generates 4 initial models. One of the strengths of the software is its capacity to further refine only certain parts of the protein. Thus we let Modeller refine the loops, which are defined as any part of the protein that could not be found in the structure file. The ''ab initio'' modelling of our linkers is made by minimizing energy functions with different methods like conjugate gradients and molecular dynamics. Eventually, each modeled structure is provided with energy values, thanks to which different models of the same structure can be compared. From Modeller we received about 8 different models and choose the one with the best energy scores to further proceed. For these refinement steps, one can choose different levels of optimization. We always decided for accuracy instead of velocity of the program. The result of a prediction for lysozyme of bacteriophage lambda can be seen in figure 5.<br />
<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
caption=Figure 5) Circular lysozyme|<br />
descr=The structure of circular lysozyme (yellow) was predicted and aligned to the linear structure (purple). The linker calculated by the software was marked green. |<br />
file=circ_lam_lys_nils.png}}<br />
<br />
<br />
Modeller was run by distributing calculation via the [https://2014.igem.org/Team:Heidelberg/Software/igemathome iGEM@home system] system. The modelling of one linker took about 10 hours of calculation time on average, a value that is highly depending on the size of the protein. Then the best model is evaluated by another self-written program to analyze the behaviour of the linker patterns in their natural surroundings.<br />
Finally, all the models for the different structures and the different linkers are analyzed for their properties like the length of the helical patterns, the shape of the attachment structures of the linker and the angles produced by the angle patterns [[ Figure helix_winkel_messung.png]]. First the modeled structure and the natural structure are fitted, to see how big the differences between those are. If the protein has been disturbed too much, the model is discarded. We control that the alpha helix patterns really generate the expected pattern by measuring the distance between the first and the last amino acids of each of them and checking if it is compatible with an alpha helix. These first and last amino acids of each rods define a vector that is then used to calculate the angles between consecutive helical patterns. The frequency of all the determined lengths and angles is then further analyzed, using a similar strategy to the ArchDB analysis presented above.<br />
<br />
The whole process for the verification of the different linker patterns was set up on the [https://2014.igem.org/Team:Heidelberg/Software/igemathome distributed computing system]. Due to lack of time, only few results could be analyzed, resulting in distributions for the different helices, see for example figures 5 and 6.<br />
This lead to a refinement of the length of the 8 different alpha helix blocks presented above to 8.7, 10, 10.8, 15.6, 16.8, 24.8, 32.3 &Aring;.<br />
<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
descr=|<br />
caption=Fig. 7. Length distribution of AEAAAKEAAAK|<br />
file=plot_of_AEAAAKEAAAK.png}}<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
descr=|<br />
caption=Fig. 6. Length distribution of AEAAAKA |<br />
file=plot_of_AEAAAKA.png}}<br />
<br />
=Conclusion=<br />
The patterns that we identified by analyzing structure databases provide an easy and fast tool to build customly shaped peptides. The main achievement is the identification of the angle patterns. These are designed as building blocks for enhanced applicability. The shapes were identified from a database of non-homologous proteins and the patterns were refined until the distribution of surrounding amino acids looked randomly distributed. Thus we can exclude that the angle distributions we observed is not due to the surrounding sequences, but to the identified patterns. We have not observed any evidence for certain helix patterns being preferred in the database.<br />
From figures 6 and 7 we have learned that the lengths we have assumed for the helices needed to be refined. For example we had assumed the AEAAAKA motif to span a distance of 10.5 &Aring; but have observed it to be only 10 &Aring; long. Our [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software] was accordingly corrected.<br />
<br />
=References=<br />
<br />
[1] Vieille, C. & Zeikus, G.J. Hyperthermophilic enzymes: sources, uses, and molecular mechanisms for thermostability. Microbiology and molecular biology reviews : MMBR 65, 1-43 (2001).<br />
<br />
[2] Yu, Y. & Lutz, S. Circular permutation: a different way to engineer enzyme structure and function. Trends Biotechnol. 29, 18-25 (2011).<br />
<br />
[3] Arai, R., Ueda, H., Kitayama, a, Kamiya, N. & Nagamune, T. Design of the linkers which effectively separate domains of a bifunctional fusion protein. Protein Eng. 14, 529-532 (2001).<br />
<br />
[4] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).<br />
<br />
[5] Efimov, a V. Standard structures in proteins. Prog. Biophys. Mol. Biol. 60, 201-2–39 (1993).<br />
<br />
[6] Donate, L. E., Rufino, S. D., Canard, L. H. & Blundell, T. L. Conformational analysis and clustering of short and medium size loops connecting regular secondary structures: a database for modeling and prediction. Protein Sci. 5, 2600-26–16 (1996).<br />
<br />
[7] Bonet, J. et al. ArchDB 2014: structural classification of loops in proteins. Nucleic Acids Res. 42, D315-D31–9 (2014).<br />
<br />
[8] George, R. a & Heringa, J. An analysis of protein domain linkers: their classification and role in protein folding. Protein Eng. 15, 871-879 (2002).<br />
<br />
[9] Chen, X., Zaro, J. L. & Shen, W.-C. Fusion protein linkers: property, design and functionality. Adv. Drug Deliv. Rev. 65, 1357-1369 (2013).<br />
<br />
[10] Fiser, a et al. Modeling of loops in protein structures. Protein science : a publication of the Protein Society 9, 1753-73 (2000).<br />
<br />
[11] Fiser, A. & Sali, A. ModLoop: Automated modeling of loops in protein structures. Bioinformatics 19, 2500-2501 (2003).</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/pages/Linker_ModelingTeam:Heidelberg/pages/Linker Modeling2014-10-17T20:38:37Z<p>Igemnils: /* In silico refinement */</p>
<hr />
<div>=Background=<br />
Primary, secondary, tertiary and quaternary structures are the main levels of protein structure characterization. Primary structure designates the amino acid sequence, while the secondary structure describes the arrangement of consecutive amino acids through their two dihedral angles $\phi$ and $\psi$. The Ramachandran plot, which represents the amino acid position in the space of those two angles, shows two particular arrangement commonly found in proteins: alpha helices and beta sheets. The next level of protein organization is the tertiary structure, which describes how the protein is organized in the three spatial dimensions, whereas the quaternary structure describes how different subunits of proteins cluster.<br />
Finally, closely related to these standard structures, the supersecondary structure describes how secondary structure elements are connected to each other. While these connections look undefined at first sight, further analysis revealed that this wide variety of supersecondary structure motifs can be clustered to certain patterns [[#References|[5]]].<br />
<br />
==Supersecondary structure==<br />
<br />
When the properties of supersecondary structures were first described, only very few patterns were identified, mainly due to the lack of highly resolved protein structures. At that time, the structures were mainly classified by the Ramachandran plot regions where the amino acids could be found [[#References|[6]]]. With growing amount of known crystal structures, the analysis of supersecondary structure improved and lead to databases with about 150 000 classified loop structures and elaborate clustering [[#References|[7]]]. Nowadays supersecondary structures are defined as the structures built when two secondary structure elements are combined by a small peptide that is not clustered into one of the secondary structures. These loop peptides range from 1 to 9 amino acids.<br />
Our aim was to build reliable stable linkers out of alpha helices connected by supersecondary structure motifs that produce certain angles. To achieve that, we searched for the most reliable alpha helix patterns that would form rigid rods and angle patterns covering the whole range of angles from 0 to 180 degrees.<br />
<br />
=linker building block design=<br />
<br />
===Helix patterns===<br />
Various different patterns have been used to build helical linkers to connect protein ends [[References | [8]]]. Moreover, in known protein structures, linkers between subdomains can be identified and their properties have already been analyzed [[#References | [9]]]. Two main criteria were used to build the alpha helix patterns: they should robustly for alpha helices, and they should be soluble in aqueous solution. Therefore, we could not just use linkers built of Alanine. So we decided to add some charged aminoacids to the pattern, and to position them physically close to each other so that they could stabilize themselves by Coulomb interaction. These amino acids needed to be separated by 3 amino acids as a helical turn takes about 3.6 aminoacids. The pattern we chose as most suitable for our purpose was also described to be one of the most stable [[#References | [3]]].<br />
8 alpha helix building blocks were eventually chosen: AEAAAK, AEAAAKA, AEAAAKAA, AEAAAKEAAAK, AEAAAKEAAAKA, AEAAAKEAAAKEAAAKA, AEAAAKEAAAKEAAAKEAAAKA, AEAAAKEAAAKEAAAKEAAAKEAAAKA, with a respective estimated length of 9, 10.5, 12, 16.5, 18, 25.5 and 33 &Aring;.<br />
<br />
===Angle patterns===<br />
<br />
The angle patterns for our model were obtained from the ArchDB database [[#References | [5]]], which classifies loops from known proteins structures. About 17 000 non-homologous proteins from PDB database were analyzed and from them, 150 000 loop structures, i.e. regions connecting two secondary structure elements, were identified. The classification took into account not only the length of the loop, its conformation, meaning φ and ψ backbone dihedral angles of the residues in the loop, but also the distance between the attachments of the loop to the surrounding secondary structures. Furthermore the secondary structures surrounding the loop and the geometry defined by the super-secondary structure motifs can be found in the database.<br />
<br />
To extract from ArchDB the relevant supersecondary structure motifs for our linker design, the complete database was downloaded and helix-loop-helix motifs were extracted using a [https://github.com/igemsoftware/Heidelberg_2014 self-written script] in Python programming language. From them we only took into account loops composed of 1 and 2 amino acids, because the longer the loops, the less frequent and therefore the less reliable they are, and the further the ends are from each other.<br />
The interesting information for us was (1) the angle produced between the vectors defining the surrounding alpha helices, (2) the distance between the ends of the loop, and (3) the type of amino acids surrounding the loop. Furthermore we analyzed the statistical significance of the conformation. For each amino acid combination in the loop region, the angle distribution between the embracing alpha helices, the loop length distribution and a 2d heatmap of the surrounding amino acids were automatically plotted (Fig. 1-4).<br />
<br />
These distributions were then visually analyzed to identify loops of interest for linker design. We focused on loops that showed a narrow angle distribution and that appeared frequently in the database and we extended the distribution analysis to the surrounding amino acids. For example, when we have identified that T in a turn produces an interesting distribution (Fig. 1), we would elongate it by K in front and elongate it by A in the end and see how the distribution behaves. By restraining the possibilites, the occurrences go down tremendously, but the properties become more interesting.<br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of__T_.png|<br />
descr= Fig 1. A loop composed of T produces an interesting well-defined angle distribution (left panel). K seems to be most frequently before}} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of_K_T_.png|<br />
descr= The distribution is narrowing a lot, K seems to be good before the turn }} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of__T_A.png|<br />
descr= Also with A after the loop the distribution stays well defined}} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of_K_T_A.png|<br />
descr= The whole sequence still produces a nicely shaped distribution, but the amount of hits is much less than before}}<br />
<br />
This step had two main goals: narrowing down the angle distribution and finding loops with no preferences for the amino acid surrounding them. This last point was important for the modularity of our approach: the angle blocks should not be affected by the surrounding alpha helices. Using this approach, 10 different angle motifs (table 1) could be identified producing different angles. Importantly, all these motifs were chosen so that the length of the two segments starting from the turning point was the same for all of them (put figure turning point). It should also be noticed that 3 of the patterns, KTA, LVA and AAIAP, produce the same angles. There were nevertheless all kept for the ''in silico'' refinement described below. But only one was used for the [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software].<br />
<br />
<br />
{| class="table table-hover" style="text-align: center;"<br />
|+'''table 1''': The span of parameters.<br />
!colspan="10"|Angle Patterns<br />
|-<br />
|Pattern <br />
| NVL <br />
| KTA <br />
| LVA <br />
| AAIAP <br />
| AADGTL <br />
| VNLTA<br />
| AAAHPEA<br />
| ASLPAA <br />
| ATGDLA<br />
|-<br />
|Mean <br />
| 29.7 <br />
| 38.7 <br />
| 35 <br />
| 36.5 <br />
| 60 <br />
| 74.5 <br />
| 117 <br />
| 140 <br />
| 160 <br />
|-<br />
| Variation <br />
| 8.5 <br />
| 30 <br />
| 29 <br />
| 27 <br />
| 12. <br />
| 27. <br />
| 12. <br />
| 15 . <br />
| 5. <br />
|}<br />
<br />
===Sequences to connect the alpha helix to the protein extremity===<br />
<br />
Helical patterns often affect the folding of the attached sequences. To prevent them from affected the structure of the protein of interest, we analyzed the effects of various aminoacids in silico using online tools like [http://bioserv.rpbs.univ-paris-diderot.fr/services/PEP-FOLD/ pep-fold] on helix formation. We identified glycines and prolines as reliable amino acids to interrupt helix formation. We then decided to use glycine pairs to connect the protein of interest to the linker, because they give more flexibility to the initial orientation of the initial helix.<br />
This only concerns the protein end that was connected to the linker through the coding sequence. The other end is ligated through exteins or sortase scar, both treated as unstructured flexible regions.<br />
<br />
===''Conclusion''===<br />
<br />
The last three parts show how we could design alpha helix and angle pattern blocks and connect them to each other and to the protein of interest. They provide the material that allowed us to transform a linker defined as a geometrical path into a real amino acid sequence in our [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software software].<br />
<br />
Thanks to this we could design linkers to circularize [https://2014.igem.org/Team:Heidelberg/Project/PCR_2.0 DNMT1] and [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening lysozyme]. Additionally it was shown for lysozyme, that a customly tailored linker enhances heat-stability compared to a badly designed linker and even to the flexible linker.<br />
<br />
=''In silico'' refinement=<br />
<br />
As some of the interesting patterns could not be found often enough to be statistically significant, we decided to make a further refinement ''in silico'' by modeling the structure of proteins with circularizing linkers. To perform this for realistic situations, we selected, from the RCSB database, structures of non-homologous target proteins with extremities that are separated enough to require a linker for circularization. First, the [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software] generates possible fitting linkers for various proteins. From these possible linkers, the 100 shortest were taken. Among the 3 possible angle patterns that generate the same angle, the software provides only one. But the linker refinement developed here used the three of them for comparison. We assume that the linkers connect the ends of the protein without setting tension on the protein, so that the protein can fold in its natural way.<br />
<br />
After this the circularized proteins with the specific linkers are modelled using a software called Modeller [[#References|[10]]]. This software is widely used for comparative structure prediction. It is well established in the scientific community and should be most suitable for prediction of loop regions attached to existing structures [[#References|[11]]]. It is freely available for academical usage from the [http://salilab.org/modeller/ salilab] webpages. The program is able to predict the 3d structure of a given sequence based on an alignment with a given structure. In our script Modeller needed to be provided with a sequence with the linker attached and with the PDB file of the protein of interest. This latter file is the only structural information used here. Modeller is recommended when at least 30% sequence similarity exists between the provided structure and the one that the user wish to model. Here, only the linker is different, so this similarity if around 90% in our case. At first Modeller makes an alignment between the provided structure and the sequence of our linked protein, identifying the regions that cannot be found in the structure. Based on that, Modeller generates 4 initial models. One of the strengths of the software is its capacity to further refine only certain parts of the protein. Thus we let Modeller refine the loops, which are defined as any part of the protein that could not be found in the structure file. The ''ab initio'' modelling of our linkers is made by minimizing energy functions with different methods like conjugate gradients and molecular dynamics. Eventually, each modeled structure is provided with energy values, thanks to which different models of the same structure can be compared. From Modeller we received about 8 different models and choose the one with the best energy scores to further proceed. For these refinement steps, one can choose different levels of optimization. We always decided for accuracy instead of velocity of the program. The result of a prediction for lysozyme of bacteriophage lambda can be seen in figure 5.<br />
<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
caption=Figure 5) Circular lysozyme|<br />
descr=The structure of circular lysozyme (yellow) was predicted and aligned to the linear structure (purple). The linker calculated by the software was marked green. |<br />
file=circ_lam_lys_nils.png}}<br />
<br />
<br />
Modeller was run by distributing calculation via the [https://2014.igem.org/Team:Heidelberg/Software/igemathome iGEM@home system] system. The modelling of one linker took about 10 hours of calculation time on average, a value that is highly depending on the size of the protein. Then the best model is evaluated by another self-written program to analyze the behaviour of the linker patterns in their natural surroundings.<br />
Finally, all the models for the different structures and the different linkers are analyzed for their properties like the length of the helical patterns, the shape of the attachment structures of the linker and the angles produced by the angle patterns [[ Figure helix_winkel_messung.png]]. First the modeled structure and the natural structure are fitted, to see how big the differences between those are. If the protein has been disturbed too much, the model is discarded. We control that the alpha helix patterns really generate the expected pattern by measuring the distance between the first and the last amino acids of each of them and checking if it is compatible with an alpha helix. These first and last amino acids of each rods define a vector that is then used to calculate the angles between consecutive helical patterns. The frequency of all the determined lengths and angles is then further analyzed, using a similar strategy to the ArchDB analysis presented above.<br />
<br />
The whole process for the verification of the different linker patterns was set up on the [https://2014.igem.org/Team:Heidelberg/Software/igemathome distributed computing system]. Due to lack of time, only few results could be analyzed, resulting in distributions for the different helices, see for example figures 5 and 6.<br />
This lead to a refinement of the length of the 8 different alpha helix blocks presented above to 8.7, 10, 10.8, 15.6, 16.8, 24.8, 32.3 &Aring;.<br />
<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
descr=|<br />
caption=Fig. 7. Length distribution of AEAAAKEAAAK|<br />
file=plot_of_AEAAAKEAAAK.png}}<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
descr=|<br />
caption=Fig. 6. Length distribution of AEAAAKA |<br />
file=plot_of_AEAAAKA.png}}<br />
<br />
=Conclusion=<br />
The patterns that we identified by analyzing structure databases provide an easy and fast tool to build customly shaped peptides. The main achievement is the identification of the angle patterns. These are designed as building blocks for enhanced applicability. The shapes were identified from a database of non-homologous proteins and the patterns were refined until the distribution of surrounding amino acids looked randomly distributed. Thus we can exclude that the angle distributions we observed is not due to the surrounding sequences, but to the identified patterns. We have not observed any evidence for certain helix patterns being preferred in the database.<br />
From figures 6 and 7 we have learned that the lengths we have assumed for the helices needed to be refined. For example we had assumed the AEAAAKA motif to span a distance of 10.5 &Aring; but have observed it to be only 10 &Aring; long. Our [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software] was accordingly corrected.<br />
<br />
=References=<br />
<br />
[1] Vieille, C. & Zeikus, G.J. Hyperthermophilic enzymes: sources, uses, and molecular mechanisms for thermostability. Microbiology and molecular biology reviews : MMBR 65, 1-43 (2001).<br />
<br />
[2] Yu, Y. & Lutz, S. Circular permutation: a different way to engineer enzyme structure and function. Trends Biotechnol. 29, 18-25 (2011).<br />
<br />
[3] Arai, R., Ueda, H., Kitayama, a, Kamiya, N. & Nagamune, T. Design of the linkers which effectively separate domains of a bifunctional fusion protein. Protein Eng. 14, 529-532 (2001).<br />
<br />
[4] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).<br />
<br />
[5] Efimov, a V. Standard structures in proteins. Prog. Biophys. Mol. Biol. 60, 201-2–39 (1993).<br />
<br />
[6] Donate, L. E., Rufino, S. D., Canard, L. H. & Blundell, T. L. Conformational analysis and clustering of short and medium size loops connecting regular secondary structures: a database for modeling and prediction. Protein Sci. 5, 2600-26–16 (1996).<br />
<br />
[7] Bonet, J. et al. ArchDB 2014: structural classification of loops in proteins. Nucleic Acids Res. 42, D315-D31–9 (2014).<br />
<br />
[8] George, R. a & Heringa, J. An analysis of protein domain linkers: their classification and role in protein folding. Protein Eng. 15, 871-879 (2002).<br />
<br />
[9] Chen, X., Zaro, J. L. & Shen, W.-C. Fusion protein linkers: property, design and functionality. Adv. Drug Deliv. Rev. 65, 1357-1369 (2013).<br />
<br />
[10] Fiser, a et al. Modeling of loops in protein structures. Protein science : a publication of the Protein Society 9, 1753-73 (2000).<br />
<br />
[11] Fiser, A. & Sali, A. ModLoop: Automated modeling of loops in protein structures. Bioinformatics 19, 2500-2501 (2003).</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/pages/Linker_ModelingTeam:Heidelberg/pages/Linker Modeling2014-10-17T20:36:32Z<p>Igemnils: /* In silico refinement */</p>
<hr />
<div>=Background=<br />
Primary, secondary, tertiary and quaternary structures are the main levels of protein structure characterization. Primary structure designates the amino acid sequence, while the secondary structure describes the arrangement of consecutive amino acids through their two dihedral angles $\phi$ and $\psi$. The Ramachandran plot, which represents the amino acid position in the space of those two angles, shows two particular arrangement commonly found in proteins: alpha helices and beta sheets. The next level of protein organization is the tertiary structure, which describes how the protein is organized in the three spatial dimensions, whereas the quaternary structure describes how different subunits of proteins cluster.<br />
Finally, closely related to these standard structures, the supersecondary structure describes how secondary structure elements are connected to each other. While these connections look undefined at first sight, further analysis revealed that this wide variety of supersecondary structure motifs can be clustered to certain patterns [[#References|[5]]].<br />
<br />
==Supersecondary structure==<br />
<br />
When the properties of supersecondary structures were first described, only very few patterns were identified, mainly due to the lack of highly resolved protein structures. At that time, the structures were mainly classified by the Ramachandran plot regions where the amino acids could be found [[#References|[6]]]. With growing amount of known crystal structures, the analysis of supersecondary structure improved and lead to databases with about 150 000 classified loop structures and elaborate clustering [[#References|[7]]]. Nowadays supersecondary structures are defined as the structures built when two secondary structure elements are combined by a small peptide that is not clustered into one of the secondary structures. These loop peptides range from 1 to 9 amino acids.<br />
Our aim was to build reliable stable linkers out of alpha helices connected by supersecondary structure motifs that produce certain angles. To achieve that, we searched for the most reliable alpha helix patterns that would form rigid rods and angle patterns covering the whole range of angles from 0 to 180 degrees.<br />
<br />
=linker building block design=<br />
<br />
===Helix patterns===<br />
Various different patterns have been used to build helical linkers to connect protein ends [[References | [8]]]. Moreover, in known protein structures, linkers between subdomains can be identified and their properties have already been analyzed [[#References | [9]]]. Two main criteria were used to build the alpha helix patterns: they should robustly for alpha helices, and they should be soluble in aqueous solution. Therefore, we could not just use linkers built of Alanine. So we decided to add some charged aminoacids to the pattern, and to position them physically close to each other so that they could stabilize themselves by Coulomb interaction. These amino acids needed to be separated by 3 amino acids as a helical turn takes about 3.6 aminoacids. The pattern we chose as most suitable for our purpose was also described to be one of the most stable [[#References | [3]]].<br />
8 alpha helix building blocks were eventually chosen: AEAAAK, AEAAAKA, AEAAAKAA, AEAAAKEAAAK, AEAAAKEAAAKA, AEAAAKEAAAKEAAAKA, AEAAAKEAAAKEAAAKEAAAKA, AEAAAKEAAAKEAAAKEAAAKEAAAKA, with a respective estimated length of 9, 10.5, 12, 16.5, 18, 25.5 and 33 &Aring;.<br />
<br />
===Angle patterns===<br />
<br />
The angle patterns for our model were obtained from the ArchDB database [[#References | [5]]], which classifies loops from known proteins structures. About 17 000 non-homologous proteins from PDB database were analyzed and from them, 150 000 loop structures, i.e. regions connecting two secondary structure elements, were identified. The classification took into account not only the length of the loop, its conformation, meaning φ and ψ backbone dihedral angles of the residues in the loop, but also the distance between the attachments of the loop to the surrounding secondary structures. Furthermore the secondary structures surrounding the loop and the geometry defined by the super-secondary structure motifs can be found in the database.<br />
<br />
To extract from ArchDB the relevant supersecondary structure motifs for our linker design, the complete database was downloaded and helix-loop-helix motifs were extracted using a [https://github.com/igemsoftware/Heidelberg_2014 self-written script] in Python programming language. From them we only took into account loops composed of 1 and 2 amino acids, because the longer the loops, the less frequent and therefore the less reliable they are, and the further the ends are from each other.<br />
The interesting information for us was (1) the angle produced between the vectors defining the surrounding alpha helices, (2) the distance between the ends of the loop, and (3) the type of amino acids surrounding the loop. Furthermore we analyzed the statistical significance of the conformation. For each amino acid combination in the loop region, the angle distribution between the embracing alpha helices, the loop length distribution and a 2d heatmap of the surrounding amino acids were automatically plotted (Fig. 1-4).<br />
<br />
These distributions were then visually analyzed to identify loops of interest for linker design. We focused on loops that showed a narrow angle distribution and that appeared frequently in the database and we extended the distribution analysis to the surrounding amino acids. For example, when we have identified that T in a turn produces an interesting distribution (Fig. 1), we would elongate it by K in front and elongate it by A in the end and see how the distribution behaves. By restraining the possibilites, the occurrences go down tremendously, but the properties become more interesting.<br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of__T_.png|<br />
descr= Fig 1. A loop composed of T produces an interesting well-defined angle distribution (left panel). K seems to be most frequently before}} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of_K_T_.png|<br />
descr= The distribution is narrowing a lot, K seems to be good before the turn }} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of__T_A.png|<br />
descr= Also with A after the loop the distribution stays well defined}} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of_K_T_A.png|<br />
descr= The whole sequence still produces a nicely shaped distribution, but the amount of hits is much less than before}}<br />
<br />
This step had two main goals: narrowing down the angle distribution and finding loops with no preferences for the amino acid surrounding them. This last point was important for the modularity of our approach: the angle blocks should not be affected by the surrounding alpha helices. Using this approach, 10 different angle motifs (table 1) could be identified producing different angles. Importantly, all these motifs were chosen so that the length of the two segments starting from the turning point was the same for all of them (put figure turning point). It should also be noticed that 3 of the patterns, KTA, LVA and AAIAP, produce the same angles. There were nevertheless all kept for the ''in silico'' refinement described below. But only one was used for the [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software].<br />
<br />
<br />
{| class="table table-hover" style="text-align: center;"<br />
|+'''table 1''': The span of parameters.<br />
!colspan="10"|Angle Patterns<br />
|-<br />
|Pattern <br />
| NVL <br />
| KTA <br />
| LVA <br />
| AAIAP <br />
| AADGTL <br />
| VNLTA<br />
| AAAHPEA<br />
| ASLPAA <br />
| ATGDLA<br />
|-<br />
|Mean <br />
| 29.7 <br />
| 38.7 <br />
| 35 <br />
| 36.5 <br />
| 60 <br />
| 74.5 <br />
| 117 <br />
| 140 <br />
| 160 <br />
|-<br />
| Variation <br />
| 8.5 <br />
| 30 <br />
| 29 <br />
| 27 <br />
| 12. <br />
| 27. <br />
| 12. <br />
| 15 . <br />
| 5. <br />
|}<br />
<br />
===Sequences to connect the alpha helix to the protein extremity===<br />
<br />
Helical patterns often affect the folding of the attached sequences. To prevent them from affected the structure of the protein of interest, we analyzed the effects of various aminoacids in silico using online tools like [http://bioserv.rpbs.univ-paris-diderot.fr/services/PEP-FOLD/ pep-fold] on helix formation. We identified glycines and prolines as reliable amino acids to interrupt helix formation. We then decided to use glycine pairs to connect the protein of interest to the linker, because they give more flexibility to the initial orientation of the initial helix.<br />
This only concerns the protein end that was connected to the linker through the coding sequence. The other end is ligated through exteins or sortase scar, both treated as unstructured flexible regions.<br />
<br />
===''Conclusion''===<br />
<br />
The last three parts show how we could design alpha helix and angle pattern blocks and connect them to each other and to the protein of interest. They provide the material that allowed us to transform a linker defined as a geometrical path into a real amino acid sequence in our [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software software].<br />
<br />
Thanks to this we could design linkers to circularize [https://2014.igem.org/Team:Heidelberg/Project/PCR_2.0 DNMT1] and [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening lysozyme]. Additionally it was shown for lysozyme, that a customly tailored linker enhances heat-stability compared to a badly designed linker and even to the flexible linker.<br />
<br />
=''In silico'' refinement=<br />
<br />
As some of the interesting patterns could not be found often enough to be statistically significant, we decided to make a further refinement ''in silico'' by modeling the structure of proteins with circularizing linkers. To perform this for realistic situations, we selected, from the RCSB database, structures of non-homologous target proteins with extremities that are separated enough to require a linker for circularization. First, the [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software] generates possible fitting linkers for various proteins. From these possible linkers, the 100 shortest were taken. Among the 3 possible angle patterns that generate the same angle, the software provides only one. But the linker refinement developed here used the three of them for comparison. We assume that the linkers connect the ends of the protein without setting tension on the protein, so that the protein can fold in its natural way.<br />
<br />
After this the circularized proteins with the specific linkers are modelled using a software called Modeller [[#References|[10]]]. This software is widely used for comparative structure prediction. It is well established in the scientific community and should be most suitable for prediction of loop regions attached to existing structures [[#References|[11]]]. It is freely available for academical usage from the [http://salilab.org/modeller/ salilab] webpages. The program is able to predict the 3d structure of a given sequence based on an alignment with a given structure. In our script Modeller needed to be provided with a sequence with the linker attached and with the PDB file of the protein of interest. This latter file is the only structural information used here. Modeller is recommended when at least 30% sequence similarity exists between the provided structure and the one that the user wish to model. Here, only the linker is different, so this similarity if around 90% in our case. At first Modeller makes an alignment between the provided structure and the sequence of our linked protein, identifying the regions that cannot be found in the structure. Based on that, Modeller generates 4 initial models. One of the strengths of the software is its capacity to further refine only certain parts of the protein. Thus we let Modeller refine the loops, which are defined as any part of the protein that could not be found in the structure file. The ''ab initio'' modelling of our linkers is made by minimizing energy functions with different methods like conjugate gradients and molecular dynamics. Eventually, each modeled structure is provided with energy values, thanks to which different models of the same structure can be compared. From Modeller we received about 8 different models and choose the one with the best energy scores to further proceed. For these refinement steps, one can choose different levels of optimization. We always decided for accuracy instead of velocity of the program. The result of a prediction for lysozyme of bacteriophage lambda can be seen in figure 5.<br />
<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
caption=Figure 5) Circular lysozyme|<br />
descr=The structure of circular lysozyme (yellow) was predicted and aligned to the linear structure (purple). The linker calculated by the software was marked green. |<br />
file=circ_lam_lys_nils.png}}<br />
<br />
<br />
Modeller was run by distributing calculation via the [https://2014.igem.org/Team:Heidelberg/Software/igemathome iGEM@home system] system. The modelling of one linker took about 10 hours of calculation time on average, a value that is highly depending on the size of the protein. Then the best model is evaluated by another self-written program to analyze the behaviour of the linker patterns in their natural surroundings.<br />
Finally, all the models for the different structures and the different linkers are analyzed for their properties like the length of the helical patterns, the shape of the attachment structures of the linker and the angles produced by the angle patterns [[ Figure helix_winkel_messung.png]]. First the modeled structure and the natural structure are fitted, to see how big the differences between those are. If the protein has been disturbed too much, the model is discarded. We control that the alpha helix patterns really generate the expected pattern by measuring the distance between the first and the last amino acids of each of them and checking if it is compatible with an alpha helix. These first and last amino acids of each rods define a vector that is then used to calculate the angles between consecutive helical patterns. The frequency of all the determined lengths and angles is then further analyzed, using a similar strategy to the ArchDB analysis presented above.<br />
<br />
The whole process for the verification of the different linker patterns was set up on the [https://2014.igem.org/Team:Heidelberg/Software/igemathome distributed computing system]. Due to lack of time, only few results could be analyzed, resulting in distributions for the different helices, see for example figures 5 and 6.<br />
This lead to a refinement of the length of the 8 different alpha helix blocks presented above to 8.7, 10, 10.8, 15.6, 16.8, 24.8, 32.3 &Aring;.<br />
<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
descr=|<br />
caption=Figure 6) Length distribution of AEAAAKA|<br />
file=plot_of_AEAAAKA.png}}<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
descr=|<br />
caption=Figure 7) Length distribution of AEAAAKEAAAK|<br />
file=plot_of_AEAAAKEAAAK.png}}<br />
<br />
=Conclusion=<br />
The patterns that we identified by analyzing structure databases provide an easy and fast tool to build customly shaped peptides. The main achievement is the identification of the angle patterns. These are designed as building blocks for enhanced applicability. The shapes were identified from a database of non-homologous proteins and the patterns were refined until the distribution of surrounding amino acids looked randomly distributed. Thus we can exclude that the angle distributions we observed is not due to the surrounding sequences, but to the identified patterns. We have not observed any evidence for certain helix patterns being preferred in the database.<br />
From figures 6 and 7 we have learned that the lengths we have assumed for the helices needed to be refined. For example we had assumed the AEAAAKA motif to span a distance of 10.5 &Aring; but have observed it to be only 10 &Aring; long. Our [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software] was accordingly corrected.<br />
<br />
=References=<br />
<br />
[1] Vieille, C. & Zeikus, G.J. Hyperthermophilic enzymes: sources, uses, and molecular mechanisms for thermostability. Microbiology and molecular biology reviews : MMBR 65, 1-43 (2001).<br />
<br />
[2] Yu, Y. & Lutz, S. Circular permutation: a different way to engineer enzyme structure and function. Trends Biotechnol. 29, 18-25 (2011).<br />
<br />
[3] Arai, R., Ueda, H., Kitayama, a, Kamiya, N. & Nagamune, T. Design of the linkers which effectively separate domains of a bifunctional fusion protein. Protein Eng. 14, 529-532 (2001).<br />
<br />
[4] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).<br />
<br />
[5] Efimov, a V. Standard structures in proteins. Prog. Biophys. Mol. Biol. 60, 201-2–39 (1993).<br />
<br />
[6] Donate, L. E., Rufino, S. D., Canard, L. H. & Blundell, T. L. Conformational analysis and clustering of short and medium size loops connecting regular secondary structures: a database for modeling and prediction. Protein Sci. 5, 2600-26–16 (1996).<br />
<br />
[7] Bonet, J. et al. ArchDB 2014: structural classification of loops in proteins. Nucleic Acids Res. 42, D315-D31–9 (2014).<br />
<br />
[8] George, R. a & Heringa, J. An analysis of protein domain linkers: their classification and role in protein folding. Protein Eng. 15, 871-879 (2002).<br />
<br />
[9] Chen, X., Zaro, J. L. & Shen, W.-C. Fusion protein linkers: property, design and functionality. Adv. Drug Deliv. Rev. 65, 1357-1369 (2013).<br />
<br />
[10] Fiser, a et al. Modeling of loops in protein structures. Protein science : a publication of the Protein Society 9, 1753-73 (2000).<br />
<br />
[11] Fiser, A. & Sali, A. ModLoop: Automated modeling of loops in protein structures. Bioinformatics 19, 2500-2501 (2003).</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/pages/Linker_ModelingTeam:Heidelberg/pages/Linker Modeling2014-10-17T20:35:34Z<p>Igemnils: /* In silico refinement */</p>
<hr />
<div>=Background=<br />
Primary, secondary, tertiary and quaternary structures are the main levels of protein structure characterization. Primary structure designates the amino acid sequence, while the secondary structure describes the arrangement of consecutive amino acids through their two dihedral angles $\phi$ and $\psi$. The Ramachandran plot, which represents the amino acid position in the space of those two angles, shows two particular arrangement commonly found in proteins: alpha helices and beta sheets. The next level of protein organization is the tertiary structure, which describes how the protein is organized in the three spatial dimensions, whereas the quaternary structure describes how different subunits of proteins cluster.<br />
Finally, closely related to these standard structures, the supersecondary structure describes how secondary structure elements are connected to each other. While these connections look undefined at first sight, further analysis revealed that this wide variety of supersecondary structure motifs can be clustered to certain patterns [[#References|[5]]].<br />
<br />
==Supersecondary structure==<br />
<br />
When the properties of supersecondary structures were first described, only very few patterns were identified, mainly due to the lack of highly resolved protein structures. At that time, the structures were mainly classified by the Ramachandran plot regions where the amino acids could be found [[#References|[6]]]. With growing amount of known crystal structures, the analysis of supersecondary structure improved and lead to databases with about 150 000 classified loop structures and elaborate clustering [[#References|[7]]]. Nowadays supersecondary structures are defined as the structures built when two secondary structure elements are combined by a small peptide that is not clustered into one of the secondary structures. These loop peptides range from 1 to 9 amino acids.<br />
Our aim was to build reliable stable linkers out of alpha helices connected by supersecondary structure motifs that produce certain angles. To achieve that, we searched for the most reliable alpha helix patterns that would form rigid rods and angle patterns covering the whole range of angles from 0 to 180 degrees.<br />
<br />
=linker building block design=<br />
<br />
===Helix patterns===<br />
Various different patterns have been used to build helical linkers to connect protein ends [[References | [8]]]. Moreover, in known protein structures, linkers between subdomains can be identified and their properties have already been analyzed [[#References | [9]]]. Two main criteria were used to build the alpha helix patterns: they should robustly for alpha helices, and they should be soluble in aqueous solution. Therefore, we could not just use linkers built of Alanine. So we decided to add some charged aminoacids to the pattern, and to position them physically close to each other so that they could stabilize themselves by Coulomb interaction. These amino acids needed to be separated by 3 amino acids as a helical turn takes about 3.6 aminoacids. The pattern we chose as most suitable for our purpose was also described to be one of the most stable [[#References | [3]]].<br />
8 alpha helix building blocks were eventually chosen: AEAAAK, AEAAAKA, AEAAAKAA, AEAAAKEAAAK, AEAAAKEAAAKA, AEAAAKEAAAKEAAAKA, AEAAAKEAAAKEAAAKEAAAKA, AEAAAKEAAAKEAAAKEAAAKEAAAKA, with a respective estimated length of 9, 10.5, 12, 16.5, 18, 25.5 and 33 &Aring;.<br />
<br />
===Angle patterns===<br />
<br />
The angle patterns for our model were obtained from the ArchDB database [[#References | [5]]], which classifies loops from known proteins structures. About 17 000 non-homologous proteins from PDB database were analyzed and from them, 150 000 loop structures, i.e. regions connecting two secondary structure elements, were identified. The classification took into account not only the length of the loop, its conformation, meaning φ and ψ backbone dihedral angles of the residues in the loop, but also the distance between the attachments of the loop to the surrounding secondary structures. Furthermore the secondary structures surrounding the loop and the geometry defined by the super-secondary structure motifs can be found in the database.<br />
<br />
To extract from ArchDB the relevant supersecondary structure motifs for our linker design, the complete database was downloaded and helix-loop-helix motifs were extracted using a [https://github.com/igemsoftware/Heidelberg_2014 self-written script] in Python programming language. From them we only took into account loops composed of 1 and 2 amino acids, because the longer the loops, the less frequent and therefore the less reliable they are, and the further the ends are from each other.<br />
The interesting information for us was (1) the angle produced between the vectors defining the surrounding alpha helices, (2) the distance between the ends of the loop, and (3) the type of amino acids surrounding the loop. Furthermore we analyzed the statistical significance of the conformation. For each amino acid combination in the loop region, the angle distribution between the embracing alpha helices, the loop length distribution and a 2d heatmap of the surrounding amino acids were automatically plotted (Fig. 1-4).<br />
<br />
These distributions were then visually analyzed to identify loops of interest for linker design. We focused on loops that showed a narrow angle distribution and that appeared frequently in the database and we extended the distribution analysis to the surrounding amino acids. For example, when we have identified that T in a turn produces an interesting distribution (Fig. 1), we would elongate it by K in front and elongate it by A in the end and see how the distribution behaves. By restraining the possibilites, the occurrences go down tremendously, but the properties become more interesting.<br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of__T_.png|<br />
descr= Fig 1. A loop composed of T produces an interesting well-defined angle distribution (left panel). K seems to be most frequently before}} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of_K_T_.png|<br />
descr= The distribution is narrowing a lot, K seems to be good before the turn }} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of__T_A.png|<br />
descr= Also with A after the loop the distribution stays well defined}} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of_K_T_A.png|<br />
descr= The whole sequence still produces a nicely shaped distribution, but the amount of hits is much less than before}}<br />
<br />
This step had two main goals: narrowing down the angle distribution and finding loops with no preferences for the amino acid surrounding them. This last point was important for the modularity of our approach: the angle blocks should not be affected by the surrounding alpha helices. Using this approach, 10 different angle motifs (table 1) could be identified producing different angles. Importantly, all these motifs were chosen so that the length of the two segments starting from the turning point was the same for all of them (put figure turning point). It should also be noticed that 3 of the patterns, KTA, LVA and AAIAP, produce the same angles. There were nevertheless all kept for the ''in silico'' refinement described below. But only one was used for the [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software].<br />
<br />
<br />
{| class="table table-hover" style="text-align: center;"<br />
|+'''table 1''': The span of parameters.<br />
!colspan="10"|Angle Patterns<br />
|-<br />
|Pattern <br />
| NVL <br />
| KTA <br />
| LVA <br />
| AAIAP <br />
| AADGTL <br />
| VNLTA<br />
| AAAHPEA<br />
| ASLPAA <br />
| ATGDLA<br />
|-<br />
|Mean <br />
| 29.7 <br />
| 38.7 <br />
| 35 <br />
| 36.5 <br />
| 60 <br />
| 74.5 <br />
| 117 <br />
| 140 <br />
| 160 <br />
|-<br />
| Variation <br />
| 8.5 <br />
| 30 <br />
| 29 <br />
| 27 <br />
| 12. <br />
| 27. <br />
| 12. <br />
| 15 . <br />
| 5. <br />
|}<br />
<br />
===Sequences to connect the alpha helix to the protein extremity===<br />
<br />
Helical patterns often affect the folding of the attached sequences. To prevent them from affected the structure of the protein of interest, we analyzed the effects of various aminoacids in silico using online tools like [http://bioserv.rpbs.univ-paris-diderot.fr/services/PEP-FOLD/ pep-fold] on helix formation. We identified glycines and prolines as reliable amino acids to interrupt helix formation. We then decided to use glycine pairs to connect the protein of interest to the linker, because they give more flexibility to the initial orientation of the initial helix.<br />
This only concerns the protein end that was connected to the linker through the coding sequence. The other end is ligated through exteins or sortase scar, both treated as unstructured flexible regions.<br />
<br />
===''Conclusion''===<br />
<br />
The last three parts show how we could design alpha helix and angle pattern blocks and connect them to each other and to the protein of interest. They provide the material that allowed us to transform a linker defined as a geometrical path into a real amino acid sequence in our [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software software].<br />
<br />
Thanks to this we could design linkers to circularize [https://2014.igem.org/Team:Heidelberg/Project/PCR_2.0 DNMT1] and [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening lysozyme]. Additionally it was shown for lysozyme, that a customly tailored linker enhances heat-stability compared to a badly designed linker and even to the flexible linker.<br />
<br />
=''In silico'' refinement=<br />
<br />
As some of the interesting patterns could not be found often enough to be statistically significant, we decided to make a further refinement ''in silico'' by modeling the structure of proteins with circularizing linkers. To perform this for realistic situations, we selected, from the RCSB database, structures of non-homologous target proteins with extremities that are separated enough to require a linker for circularization. First, the [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software] generates possible fitting linkers for various proteins. From these possible linkers, the 100 shortest were taken. Among the 3 possible angle patterns that generate the same angle, the software provides only one. But the linker refinement developed here used the three of them for comparison. We assume that the linkers connect the ends of the protein without setting tension on the protein, so that the protein can fold in its natural way.<br />
<br />
After this the circularized proteins with the specific linkers are modelled using a software called Modeller [[#References|[10]]]. This software is widely used for comparative structure prediction. It is well established in the scientific community and should be most suitable for prediction of loop regions attached to existing structures [[#References|[11]]]. It is freely available for academical usage from the [http://salilab.org/modeller/ salilab] webpages. The program is able to predict the 3d structure of a given sequence based on an alignment with a given structure. In our script Modeller needed to be provided with a sequence with the linker attached and with the PDB file of the protein of interest. This latter file is the only structural information used here. Modeller is recommended when at least 30% sequence similarity exists between the provided structure and the one that the user wish to model. Here, only the linker is different, so this similarity if around 90% in our case. At first Modeller makes an alignment between the provided structure and the sequence of our linked protein, identifying the regions that cannot be found in the structure. Based on that, Modeller generates 4 initial models. One of the strengths of the software is its capacity to further refine only certain parts of the protein. Thus we let Modeller refine the loops, which are defined as any part of the protein that could not be found in the structure file. The ''ab initio'' modelling of our linkers is made by minimizing energy functions with different methods like conjugate gradients and molecular dynamics. Eventually, each modeled structure is provided with energy values, thanks to which different models of the same structure can be compared. From Modeller we received about 8 different models and choose the one with the best energy scores to further proceed. For these refinement steps, one can choose different levels of optimization. We always decided for accuracy instead of velocity of the program. The result of a prediction for lysozyme of bacteriophage lambda can be seen in figure 5.<br />
<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
caption=Figure 5) Circular lysozyme|<br />
descr=The structure of circular lysozyme (yellow) was predicted and aligned to the linear structure (purple). The linker calculated by the software was marked green. |<br />
file=circ_lam_lys_nils.png}}<br />
<br />
<br />
Modeller was run by distributing calculation via the [https://2014.igem.org/Team:Heidelberg/Software/igemathome iGEM@home system] system. The modelling of one linker took about 10 hours of calculation time on average, a value that is highly depending on the size of the protein. Then the best model is evaluated by another self-written program to analyze the behaviour of the linker patterns in their natural surroundings.<br />
Finally, all the models for the different structures and the different linkers are analyzed for their properties like the length of the helical patterns, the shape of the attachment structures of the linker and the angles produced by the angle patterns [[ Figure helix_winkel_messung.png]]. First the modeled structure and the natural structure are fitted, to see how big the differences between those are. If the protein has been disturbed too much, the model is discarded. We control that the alpha helix patterns really generate the expected pattern by measuring the distance between the first and the last amino acids of each of them and checking if it is compatible with an alpha helix. These first and last amino acids of each rods define a vector that is then used to calculate the angles between consecutive helical patterns. The frequency of all the determined lengths and angles is then further analyzed, using a similar strategy to the ArchDB analysis presented above.<br />
<br />
The whole process for the verification of the different linker patterns was set up on the [https://2014.igem.org/Team:Heidelberg/Software/igemathome distributed computing system]. Due to lack of time, only few results could be analyzed, resulting in distributions for the different helices, see for example figures 5 and 6.<br />
This lead to a refinement of the length of the 8 different alpha helix blocks presented above to 8.7, 10, 10.8, 15.6, 16.8, 24.8, 32.3 &Aring;.<br />
<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
descr=|<br />
caption=Figure 6) Length distribution of AEAAAKA|<br />
file=plot_of_AEAAAKA.png}}<br />
<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
descr=|<br />
caption=Figure 7) Length distribution of AEAAAKEAAAK|<br />
file=plot_of_AEAAAKEAAAK.png}}<br />
<br />
=Conclusion=<br />
The patterns that we identified by analyzing structure databases provide an easy and fast tool to build customly shaped peptides. The main achievement is the identification of the angle patterns. These are designed as building blocks for enhanced applicability. The shapes were identified from a database of non-homologous proteins and the patterns were refined until the distribution of surrounding amino acids looked randomly distributed. Thus we can exclude that the angle distributions we observed is not due to the surrounding sequences, but to the identified patterns. We have not observed any evidence for certain helix patterns being preferred in the database.<br />
From figures 6 and 7 we have learned that the lengths we have assumed for the helices needed to be refined. For example we had assumed the AEAAAKA motif to span a distance of 10.5 &Aring; but have observed it to be only 10 &Aring; long. Our [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software] was accordingly corrected.<br />
<br />
=References=<br />
<br />
[1] Vieille, C. & Zeikus, G.J. Hyperthermophilic enzymes: sources, uses, and molecular mechanisms for thermostability. Microbiology and molecular biology reviews : MMBR 65, 1-43 (2001).<br />
<br />
[2] Yu, Y. & Lutz, S. Circular permutation: a different way to engineer enzyme structure and function. Trends Biotechnol. 29, 18-25 (2011).<br />
<br />
[3] Arai, R., Ueda, H., Kitayama, a, Kamiya, N. & Nagamune, T. Design of the linkers which effectively separate domains of a bifunctional fusion protein. Protein Eng. 14, 529-532 (2001).<br />
<br />
[4] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).<br />
<br />
[5] Efimov, a V. Standard structures in proteins. Prog. Biophys. Mol. Biol. 60, 201-2–39 (1993).<br />
<br />
[6] Donate, L. E., Rufino, S. D., Canard, L. H. & Blundell, T. L. Conformational analysis and clustering of short and medium size loops connecting regular secondary structures: a database for modeling and prediction. Protein Sci. 5, 2600-26–16 (1996).<br />
<br />
[7] Bonet, J. et al. ArchDB 2014: structural classification of loops in proteins. Nucleic Acids Res. 42, D315-D31–9 (2014).<br />
<br />
[8] George, R. a & Heringa, J. An analysis of protein domain linkers: their classification and role in protein folding. Protein Eng. 15, 871-879 (2002).<br />
<br />
[9] Chen, X., Zaro, J. L. & Shen, W.-C. Fusion protein linkers: property, design and functionality. Adv. Drug Deliv. Rev. 65, 1357-1369 (2013).<br />
<br />
[10] Fiser, a et al. Modeling of loops in protein structures. Protein science : a publication of the Protein Society 9, 1753-73 (2000).<br />
<br />
[11] Fiser, A. & Sali, A. ModLoop: Automated modeling of loops in protein structures. Bioinformatics 19, 2500-2501 (2003).</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/pages/Linker_ModelingTeam:Heidelberg/pages/Linker Modeling2014-10-17T20:31:48Z<p>Igemnils: /* In silico refinement */</p>
<hr />
<div>=Background=<br />
Primary, secondary, tertiary and quaternary structures are the main levels of protein structure characterization. Primary structure designates the amino acid sequence, while the secondary structure describes the arrangement of consecutive amino acids through their two dihedral angles $\phi$ and $\psi$. The Ramachandran plot, which represents the amino acid position in the space of those two angles, shows two particular arrangement commonly found in proteins: alpha helices and beta sheets. The next level of protein organization is the tertiary structure, which describes how the protein is organized in the three spatial dimensions, whereas the quaternary structure describes how different subunits of proteins cluster.<br />
Finally, closely related to these standard structures, the supersecondary structure describes how secondary structure elements are connected to each other. While these connections look undefined at first sight, further analysis revealed that this wide variety of supersecondary structure motifs can be clustered to certain patterns [[#References|[5]]].<br />
<br />
==Supersecondary structure==<br />
<br />
When the properties of supersecondary structures were first described, only very few patterns were identified, mainly due to the lack of highly resolved protein structures. At that time, the structures were mainly classified by the Ramachandran plot regions where the amino acids could be found [[#References|[6]]]. With growing amount of known crystal structures, the analysis of supersecondary structure improved and lead to databases with about 150 000 classified loop structures and elaborate clustering [[#References|[7]]]. Nowadays supersecondary structures are defined as the structures built when two secondary structure elements are combined by a small peptide that is not clustered into one of the secondary structures. These loop peptides range from 1 to 9 amino acids.<br />
Our aim was to build reliable stable linkers out of alpha helices connected by supersecondary structure motifs that produce certain angles. To achieve that, we searched for the most reliable alpha helix patterns that would form rigid rods and angle patterns covering the whole range of angles from 0 to 180 degrees.<br />
<br />
=linker building block design=<br />
<br />
===Helix patterns===<br />
Various different patterns have been used to build helical linkers to connect protein ends [[References | [8]]]. Moreover, in known protein structures, linkers between subdomains can be identified and their properties have already been analyzed [[#References | [9]]]. Two main criteria were used to build the alpha helix patterns: they should robustly for alpha helices, and they should be soluble in aqueous solution. Therefore, we could not just use linkers built of Alanine. So we decided to add some charged aminoacids to the pattern, and to position them physically close to each other so that they could stabilize themselves by Coulomb interaction. These amino acids needed to be separated by 3 amino acids as a helical turn takes about 3.6 aminoacids. The pattern we chose as most suitable for our purpose was also described to be one of the most stable [[#References | [3]]].<br />
8 alpha helix building blocks were eventually chosen: AEAAAK, AEAAAKA, AEAAAKAA, AEAAAKEAAAK, AEAAAKEAAAKA, AEAAAKEAAAKEAAAKA, AEAAAKEAAAKEAAAKEAAAKA, AEAAAKEAAAKEAAAKEAAAKEAAAKA, with a respective estimated length of 9, 10.5, 12, 16.5, 18, 25.5 and 33 &Aring;.<br />
<br />
===Angle patterns===<br />
<br />
The angle patterns for our model were obtained from the ArchDB database [[#References | [5]]], which classifies loops from known proteins structures. About 17 000 non-homologous proteins from PDB database were analyzed and from them, 150 000 loop structures, i.e. regions connecting two secondary structure elements, were identified. The classification took into account not only the length of the loop, its conformation, meaning φ and ψ backbone dihedral angles of the residues in the loop, but also the distance between the attachments of the loop to the surrounding secondary structures. Furthermore the secondary structures surrounding the loop and the geometry defined by the super-secondary structure motifs can be found in the database.<br />
<br />
To extract from ArchDB the relevant supersecondary structure motifs for our linker design, the complete database was downloaded and helix-loop-helix motifs were extracted using a [https://github.com/igemsoftware/Heidelberg_2014 self-written script] in Python programming language. From them we only took into account loops composed of 1 and 2 amino acids, because the longer the loops, the less frequent and therefore the less reliable they are, and the further the ends are from each other.<br />
The interesting information for us was (1) the angle produced between the vectors defining the surrounding alpha helices, (2) the distance between the ends of the loop, and (3) the type of amino acids surrounding the loop. Furthermore we analyzed the statistical significance of the conformation. For each amino acid combination in the loop region, the angle distribution between the embracing alpha helices, the loop length distribution and a 2d heatmap of the surrounding amino acids were automatically plotted (Fig. 1-4).<br />
<br />
These distributions were then visually analyzed to identify loops of interest for linker design. We focused on loops that showed a narrow angle distribution and that appeared frequently in the database and we extended the distribution analysis to the surrounding amino acids. For example, when we have identified that T in a turn produces an interesting distribution (Fig. 1), we would elongate it by K in front and elongate it by A in the end and see how the distribution behaves. By restraining the possibilites, the occurrences go down tremendously, but the properties become more interesting.<br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of__T_.png|<br />
descr= Fig 1. A loop composed of T produces an interesting well-defined angle distribution (left panel). K seems to be most frequently before}} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of_K_T_.png|<br />
descr= The distribution is narrowing a lot, K seems to be good before the turn }} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of__T_A.png|<br />
descr= Also with A after the loop the distribution stays well defined}} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of_K_T_A.png|<br />
descr= The whole sequence still produces a nicely shaped distribution, but the amount of hits is much less than before}}<br />
<br />
This step had two main goals: narrowing down the angle distribution and finding loops with no preferences for the amino acid surrounding them. This last point was important for the modularity of our approach: the angle blocks should not be affected by the surrounding alpha helices. Using this approach, 10 different angle motifs (table 1) could be identified producing different angles. Importantly, all these motifs were chosen so that the length of the two segments starting from the turning point was the same for all of them (put figure turning point). It should also be noticed that 3 of the patterns, KTA, LVA and AAIAP, produce the same angles. There were nevertheless all kept for the ''in silico'' refinement described below. But only one was used for the [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software].<br />
<br />
<br />
{| class="table table-hover" style="text-align: center;"<br />
|+'''table 1''': The span of parameters.<br />
!colspan="10"|Angle Patterns<br />
|-<br />
|Pattern <br />
| NVL <br />
| KTA <br />
| LVA <br />
| AAIAP <br />
| AADGTL <br />
| VNLTA<br />
| AAAHPEA<br />
| ASLPAA <br />
| ATGDLA<br />
|-<br />
|Mean <br />
| 29.7 <br />
| 38.7 <br />
| 35 <br />
| 36.5 <br />
| 60 <br />
| 74.5 <br />
| 117 <br />
| 140 <br />
| 160 <br />
|-<br />
| Variation <br />
| 8.5 <br />
| 30 <br />
| 29 <br />
| 27 <br />
| 12. <br />
| 27. <br />
| 12. <br />
| 15 . <br />
| 5. <br />
|}<br />
<br />
===Sequences to connect the alpha helix to the protein extremity===<br />
<br />
Helical patterns often affect the folding of the attached sequences. To prevent them from affected the structure of the protein of interest, we analyzed the effects of various aminoacids in silico using online tools like [http://bioserv.rpbs.univ-paris-diderot.fr/services/PEP-FOLD/ pep-fold] on helix formation. We identified glycines and prolines as reliable amino acids to interrupt helix formation. We then decided to use glycine pairs to connect the protein of interest to the linker, because they give more flexibility to the initial orientation of the initial helix.<br />
This only concerns the protein end that was connected to the linker through the coding sequence. The other end is ligated through exteins or sortase scar, both treated as unstructured flexible regions.<br />
<br />
===''Conclusion''===<br />
<br />
The last three parts show how we could design alpha helix and angle pattern blocks and connect them to each other and to the protein of interest. They provide the material that allowed us to transform a linker defined as a geometrical path into a real amino acid sequence in our [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software software].<br />
<br />
Thanks to this we could design linkers to circularize [https://2014.igem.org/Team:Heidelberg/Project/PCR_2.0 DNMT1] and [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening lysozyme]. Additionally it was shown for lysozyme, that a customly tailored linker enhances heat-stability compared to a badly designed linker and even to the flexible linker.<br />
<br />
=''In silico'' refinement=<br />
<br />
As some of the interesting patterns could not be found often enough to be statistically significant, we decided to make a further refinement ''in silico'' by modeling the structure of proteins with circularizing linkers. To perform this for realistic situations, we selected, from the RCSB database, structures of non-homologous target proteins with extremities that are separated enough to require a linker for circularization. First, the [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software] generates possible fitting linkers for various proteins. From these possible linkers, the 100 shortest were taken. Among the 3 possible angle patterns that generate the same angle, the software provides only one. But the linker refinement developed here used the three of them for comparison. We assume that the linkers connect the ends of the protein without setting tension on the protein, so that the protein can fold in its natural way.<br />
<br />
After this the circularized proteins with the specific linkers are modelled using a software called Modeller [[#References|[10]]]. This software is widely used for comparative structure prediction. It is well established in the scientific community and should be most suitable for prediction of loop regions attached to existing structures [[#References|[11]]]. It is freely available for academical usage from the [http://salilab.org/modeller/ salilab] webpages. The program is able to predict the 3d structure of a given sequence based on an alignment with a given structure. In our script Modeller needed to be provided with a sequence with the linker attached and with the PDB file of the protein of interest. This latter file is the only structural information used here. Modeller is recommended when at least 30% sequence similarity exists between the provided structure and the one that the user wish to model. Here, only the linker is different, so this similarity if around 90% in our case. At first Modeller makes an alignment between the provided structure and the sequence of our linked protein, identifying the regions that cannot be found in the structure. Based on that, Modeller generates 4 initial models. One of the strengths of the software is its capacity to further refine only certain parts of the protein. Thus we let Modeller refine the loops, which are defined as any part of the protein that could not be found in the structure file. The ''ab initio'' modelling of our linkers is made by minimizing energy functions with different methods like conjugate gradients and molecular dynamics. Eventually, each modeled structure is provided with energy values, thanks to which different models of the same structure can be compared. From Modeller we received about 8 different models and choose the one with the best energy scores to further proceed. For these refinement steps, one can choose different levels of optimization. We always decided for accuracy instead of velocity of the program. The result of a prediction for lysozyme of bacteriophage lambda can be seen in figure 5.<br />
<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
caption=Figure 5) Circular lysozyme|<br />
descr=The structure of circular lysozyme (yellow) was predicted and aligned to the linear structure (purple). The linker calculated by the software was marked green. |<br />
file=circ_lam_lys_nils.png}}<br />
<br />
<br />
Modeller was run by distributing calculation via the [https://2014.igem.org/Team:Heidelberg/Software/igemathome iGEM@home system] system. The modelling of one linker took about 10 hours of calculation time on average, a value that is highly depending on the size of the protein. Then the best model is evaluated by another self-written program to analyze the behaviour of the linker patterns in their natural surroundings.<br />
Finally, all the models for the different structures and the different linkers are analyzed for their properties like the length of the helical patterns, the shape of the attachment structures of the linker and the angles produced by the angle patterns [[ Figure helix_winkel_messung.png]]. First the modeled structure and the natural structure are fitted, to see how big the differences between those are. If the protein has been disturbed too much, the model is discarded. We control that the alpha helix patterns really generate the expected pattern by measuring the distance between the first and the last amino acids of each of them and checking if it is compatible with an alpha helix. These first and last amino acids of each rods define a vector that is then used to calculate the angles between consecutive helical patterns. The frequency of all the determined lengths and angles is then further analyzed, using a similar strategy to the ArchDB analysis presented above.<br />
<br />
The whole process for the verification of the different linker patterns was set up on the [https://2014.igem.org/Team:Heidelberg/Software/igemathome distributed computing system]. Due to lack of time, only few results could be analyzed, resulting in distributions for the different helices, see for example figures 5 and 6.<br />
This lead to a refinement of the length of the 8 different alpha helix blocks presented above to 8.7, 10, 10.8, 15.6, 16.8, 24.8, 32.3 &Aring;.<br />
<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=left|<br />
descr=|<br />
caption=Figure 6) Length distribution of AEAAAKA|<br />
file=plot_of_AEAAAKA.png}}<br />
<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
descr=|<br />
caption=Figure 7) Length distribution of AEAAAKEAAAK|<br />
file=plot_of_AEAAAKEAAAK.png}}<br />
<br />
=Conclusion=<br />
The patterns that we identified by analyzing structure databases provide an easy and fast tool to build customly shaped peptides. The main achievement is the identification of the angle patterns. These are designed as building blocks for enhanced applicability. The shapes were identified from a database of non-homologous proteins and the patterns were refined until the distribution of surrounding amino acids looked randomly distributed. Thus we can exclude that the angle distributions we observed is not due to the surrounding sequences, but to the identified patterns. We have not observed any evidence for certain helix patterns being preferred in the database.<br />
From figures 6 and 7 we have learned that the lengths we have assumed for the helices needed to be refined. For example we had assumed the AEAAAKA motif to span a distance of 10.5 &Aring; but have observed it to be only 10 &Aring; long. Our [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software] was accordingly corrected.<br />
<br />
=References=<br />
<br />
[1] Vieille, C. & Zeikus, G.J. Hyperthermophilic enzymes: sources, uses, and molecular mechanisms for thermostability. Microbiology and molecular biology reviews : MMBR 65, 1-43 (2001).<br />
<br />
[2] Yu, Y. & Lutz, S. Circular permutation: a different way to engineer enzyme structure and function. Trends Biotechnol. 29, 18-25 (2011).<br />
<br />
[3] Arai, R., Ueda, H., Kitayama, a, Kamiya, N. & Nagamune, T. Design of the linkers which effectively separate domains of a bifunctional fusion protein. Protein Eng. 14, 529-532 (2001).<br />
<br />
[4] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).<br />
<br />
[5] Efimov, a V. Standard structures in proteins. Prog. Biophys. Mol. Biol. 60, 201-2–39 (1993).<br />
<br />
[6] Donate, L. E., Rufino, S. D., Canard, L. H. & Blundell, T. L. Conformational analysis and clustering of short and medium size loops connecting regular secondary structures: a database for modeling and prediction. Protein Sci. 5, 2600-26–16 (1996).<br />
<br />
[7] Bonet, J. et al. ArchDB 2014: structural classification of loops in proteins. Nucleic Acids Res. 42, D315-D31–9 (2014).<br />
<br />
[8] George, R. a & Heringa, J. An analysis of protein domain linkers: their classification and role in protein folding. Protein Eng. 15, 871-879 (2002).<br />
<br />
[9] Chen, X., Zaro, J. L. & Shen, W.-C. Fusion protein linkers: property, design and functionality. Adv. Drug Deliv. Rev. 65, 1357-1369 (2013).<br />
<br />
[10] Fiser, a et al. Modeling of loops in protein structures. Protein science : a publication of the Protein Society 9, 1753-73 (2000).<br />
<br />
[11] Fiser, A. & Sali, A. ModLoop: Automated modeling of loops in protein structures. Bioinformatics 19, 2500-2501 (2003).</div>Igemnilshttp://2014.igem.org/Team:Heidelberg/pages/Linker_ModelingTeam:Heidelberg/pages/Linker Modeling2014-10-17T20:30:03Z<p>Igemnils: /* In silico refinement */</p>
<hr />
<div>=Background=<br />
Primary, secondary, tertiary and quaternary structures are the main levels of protein structure characterization. Primary structure designates the amino acid sequence, while the secondary structure describes the arrangement of consecutive amino acids through their two dihedral angles $\phi$ and $\psi$. The Ramachandran plot, which represents the amino acid position in the space of those two angles, shows two particular arrangement commonly found in proteins: alpha helices and beta sheets. The next level of protein organization is the tertiary structure, which describes how the protein is organized in the three spatial dimensions, whereas the quaternary structure describes how different subunits of proteins cluster.<br />
Finally, closely related to these standard structures, the supersecondary structure describes how secondary structure elements are connected to each other. While these connections look undefined at first sight, further analysis revealed that this wide variety of supersecondary structure motifs can be clustered to certain patterns [[#References|[5]]].<br />
<br />
==Supersecondary structure==<br />
<br />
When the properties of supersecondary structures were first described, only very few patterns were identified, mainly due to the lack of highly resolved protein structures. At that time, the structures were mainly classified by the Ramachandran plot regions where the amino acids could be found [[#References|[6]]]. With growing amount of known crystal structures, the analysis of supersecondary structure improved and lead to databases with about 150 000 classified loop structures and elaborate clustering [[#References|[7]]]. Nowadays supersecondary structures are defined as the structures built when two secondary structure elements are combined by a small peptide that is not clustered into one of the secondary structures. These loop peptides range from 1 to 9 amino acids.<br />
Our aim was to build reliable stable linkers out of alpha helices connected by supersecondary structure motifs that produce certain angles. To achieve that, we searched for the most reliable alpha helix patterns that would form rigid rods and angle patterns covering the whole range of angles from 0 to 180 degrees.<br />
<br />
=linker building block design=<br />
<br />
===Helix patterns===<br />
Various different patterns have been used to build helical linkers to connect protein ends [[References | [8]]]. Moreover, in known protein structures, linkers between subdomains can be identified and their properties have already been analyzed [[#References | [9]]]. Two main criteria were used to build the alpha helix patterns: they should robustly for alpha helices, and they should be soluble in aqueous solution. Therefore, we could not just use linkers built of Alanine. So we decided to add some charged aminoacids to the pattern, and to position them physically close to each other so that they could stabilize themselves by Coulomb interaction. These amino acids needed to be separated by 3 amino acids as a helical turn takes about 3.6 aminoacids. The pattern we chose as most suitable for our purpose was also described to be one of the most stable [[#References | [3]]].<br />
8 alpha helix building blocks were eventually chosen: AEAAAK, AEAAAKA, AEAAAKAA, AEAAAKEAAAK, AEAAAKEAAAKA, AEAAAKEAAAKEAAAKA, AEAAAKEAAAKEAAAKEAAAKA, AEAAAKEAAAKEAAAKEAAAKEAAAKA, with a respective estimated length of 9, 10.5, 12, 16.5, 18, 25.5 and 33 &Aring;.<br />
<br />
===Angle patterns===<br />
<br />
The angle patterns for our model were obtained from the ArchDB database [[#References | [5]]], which classifies loops from known proteins structures. About 17 000 non-homologous proteins from PDB database were analyzed and from them, 150 000 loop structures, i.e. regions connecting two secondary structure elements, were identified. The classification took into account not only the length of the loop, its conformation, meaning φ and ψ backbone dihedral angles of the residues in the loop, but also the distance between the attachments of the loop to the surrounding secondary structures. Furthermore the secondary structures surrounding the loop and the geometry defined by the super-secondary structure motifs can be found in the database.<br />
<br />
To extract from ArchDB the relevant supersecondary structure motifs for our linker design, the complete database was downloaded and helix-loop-helix motifs were extracted using a [https://github.com/igemsoftware/Heidelberg_2014 self-written script] in Python programming language. From them we only took into account loops composed of 1 and 2 amino acids, because the longer the loops, the less frequent and therefore the less reliable they are, and the further the ends are from each other.<br />
The interesting information for us was (1) the angle produced between the vectors defining the surrounding alpha helices, (2) the distance between the ends of the loop, and (3) the type of amino acids surrounding the loop. Furthermore we analyzed the statistical significance of the conformation. For each amino acid combination in the loop region, the angle distribution between the embracing alpha helices, the loop length distribution and a 2d heatmap of the surrounding amino acids were automatically plotted (Fig. 1-4).<br />
<br />
These distributions were then visually analyzed to identify loops of interest for linker design. We focused on loops that showed a narrow angle distribution and that appeared frequently in the database and we extended the distribution analysis to the surrounding amino acids. For example, when we have identified that T in a turn produces an interesting distribution (Fig. 1), we would elongate it by K in front and elongate it by A in the end and see how the distribution behaves. By restraining the possibilites, the occurrences go down tremendously, but the properties become more interesting.<br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of__T_.png|<br />
descr= Fig 1. A loop composed of T produces an interesting well-defined angle distribution (left panel). K seems to be most frequently before}} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of_K_T_.png|<br />
descr= The distribution is narrowing a lot, K seems to be good before the turn }} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of__T_A.png|<br />
descr= Also with A after the loop the distribution stays well defined}} <br />
<br />
{{:Team:Heidelberg/templates/image-full|<br />
caption = |<br />
file = plot_of_K_T_A.png|<br />
descr= The whole sequence still produces a nicely shaped distribution, but the amount of hits is much less than before}}<br />
<br />
This step had two main goals: narrowing down the angle distribution and finding loops with no preferences for the amino acid surrounding them. This last point was important for the modularity of our approach: the angle blocks should not be affected by the surrounding alpha helices. Using this approach, 10 different angle motifs (table 1) could be identified producing different angles. Importantly, all these motifs were chosen so that the length of the two segments starting from the turning point was the same for all of them (put figure turning point). It should also be noticed that 3 of the patterns, KTA, LVA and AAIAP, produce the same angles. There were nevertheless all kept for the ''in silico'' refinement described below. But only one was used for the [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software].<br />
<br />
<br />
{| class="table table-hover" style="text-align: center;"<br />
|+'''table 1''': The span of parameters.<br />
!colspan="10"|Angle Patterns<br />
|-<br />
|Pattern <br />
| NVL <br />
| KTA <br />
| LVA <br />
| AAIAP <br />
| AADGTL <br />
| VNLTA<br />
| AAAHPEA<br />
| ASLPAA <br />
| ATGDLA<br />
|-<br />
|Mean <br />
| 29.7 <br />
| 38.7 <br />
| 35 <br />
| 36.5 <br />
| 60 <br />
| 74.5 <br />
| 117 <br />
| 140 <br />
| 160 <br />
|-<br />
| Variation <br />
| 8.5 <br />
| 30 <br />
| 29 <br />
| 27 <br />
| 12. <br />
| 27. <br />
| 12. <br />
| 15 . <br />
| 5. <br />
|}<br />
<br />
===Sequences to connect the alpha helix to the protein extremity===<br />
<br />
Helical patterns often affect the folding of the attached sequences. To prevent them from affected the structure of the protein of interest, we analyzed the effects of various aminoacids in silico using online tools like [http://bioserv.rpbs.univ-paris-diderot.fr/services/PEP-FOLD/ pep-fold] on helix formation. We identified glycines and prolines as reliable amino acids to interrupt helix formation. We then decided to use glycine pairs to connect the protein of interest to the linker, because they give more flexibility to the initial orientation of the initial helix.<br />
This only concerns the protein end that was connected to the linker through the coding sequence. The other end is ligated through exteins or sortase scar, both treated as unstructured flexible regions.<br />
<br />
===''Conclusion''===<br />
<br />
The last three parts show how we could design alpha helix and angle pattern blocks and connect them to each other and to the protein of interest. They provide the material that allowed us to transform a linker defined as a geometrical path into a real amino acid sequence in our [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software software].<br />
<br />
Thanks to this we could design linkers to circularize [https://2014.igem.org/Team:Heidelberg/Project/PCR_2.0 DNMT1] and [https://2014.igem.org/Team:Heidelberg/Project/Linker_Screening lysozyme]. Additionally it was shown for lysozyme, that a customly tailored linker enhances heat-stability compared to a badly designed linker and even to the flexible linker.<br />
<br />
=''In silico'' refinement=<br />
<br />
As some of the interesting patterns could not be found often enough to be statistically significant, we decided to make a further refinement ''in silico'' by modeling the structure of proteins with circularizing linkers. To perform this for realistic situations, we selected, from the RCSB database, structures of non-homologous target proteins with extremities that are separated enough to require a linker for circularization. First, the [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software] generates possible fitting linkers for various proteins. From these possible linkers, the 100 shortest were taken. Among the 3 possible angle patterns that generate the same angle, the software provides only one. But the linker refinement developed here used the three of them for comparison. We assume that the linkers connect the ends of the protein without setting tension on the protein, so that the protein can fold in its natural way.<br />
<br />
After this the circularized proteins with the specific linkers are modelled using a software called Modeller [[#References|[10]]]. This software is widely used for comparative structure prediction. It is well established in the scientific community and should be most suitable for prediction of loop regions attached to existing structures [[#References|[11]]]. It is freely available for academical usage from the [http://salilab.org/modeller/ salilab] webpages. The program is able to predict the 3d structure of a given sequence based on an alignment with a given structure. In our script Modeller needed to be provided with a sequence with the linker attached and with the PDB file of the protein of interest. This latter file is the only structural information used here. Modeller is recommended when at least 30% sequence similarity exists between the provided structure and the one that the user wish to model. Here, only the linker is different, so this similarity if around 90% in our case. At first Modeller makes an alignment between the provided structure and the sequence of our linked protein, identifying the regions that cannot be found in the structure. Based on that, Modeller generates 4 initial models. One of the strengths of the software is its capacity to further refine only certain parts of the protein. Thus we let Modeller refine the loops, which are defined as any part of the protein that could not be found in the structure file. The ''ab initio'' modelling of our linkers is made by minimizing energy functions with different methods like conjugate gradients and molecular dynamics. Eventually, each modeled structure is provided with energy values, thanks to which different models of the same structure can be compared. From Modeller we received about 8 different models and choose the one with the best energy scores to further proceed. For these refinement steps, one can choose different levels of optimization. We always decided for accuracy instead of velocity of the program. The result of a prediction for lysozyme of bacteriophage lambda can be seen in figure 5.<br />
<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
caption=Figure 5) Circular lysozyme|<br />
descr=The structure of circular lysozyme (yellow) was predicted and aligned to the linear structure (purple). The linker calculated by the software was marked green. |<br />
file=circ_lam_lys_nils.png}}<br />
<br />
<br />
Modeller was run by distributing calculation via the [https://2014.igem.org/Team:Heidelberg/Software/igemathome iGEM@home system] system. The modelling of one linker took about 10 hours of calculation time on average, a value that is highly depending on the size of the protein. Then the best model is evaluated by another self-written program to analyze the behaviour of the linker patterns in their natural surroundings.<br />
Finally, all the models for the different structures and the different linkers are analyzed for their properties like the length of the helical patterns, the shape of the attachment structures of the linker and the angles produced by the angle patterns [[ Figure helix_winkel_messung.png]]. First the modeled structure and the natural structure are fitted, to see how big the differences between those are. If the protein has been disturbed too much, the model is discarded. We control that the alpha helix patterns really generate the expected pattern by measuring the distance between the first and the last amino acids of each of them and checking if it is compatible with an alpha helix. These first and last amino acids of each rods define a vector that is then used to calculate the angles between consecutive helical patterns. The frequency of all the determined lengths and angles is then further analyzed, using a similar strategy to the ArchDB analysis presented above.<br />
<br />
<br />
The whole process for the verification of the different linker patterns was set up on the [https://2014.igem.org/Team:Heidelberg/Software/igemathome distributed computing system]. Due to lack of time, only few results could be analyzed, resulting in distributions for the different helices, see for example figures 5 and 6.<br />
<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=left|<br />
descr=|<br />
caption=Figure 6) Length distribution of AEAAAKA|<br />
file=plot_of_AEAAAKA.png}}<br />
<br />
{{:Team:Heidelberg/templates/image-quarter|<br />
align=right|<br />
descr=|<br />
caption=Figure 7) Length distribution of AEAAAKEAAAK|<br />
file=plot_of_AEAAAKEAAAK.png}}<br />
<br />
This lead to a refinement of the length of the 8 different alpha helix blocks presented above to 8.7, 10, 10.8, 15.6, 16.8, 24.8, 32.3 &Aring;.<br />
<br />
=Conclusion=<br />
The patterns that we identified by analyzing structure databases provide an easy and fast tool to build customly shaped peptides. The main achievement is the identification of the angle patterns. These are designed as building blocks for enhanced applicability. The shapes were identified from a database of non-homologous proteins and the patterns were refined until the distribution of surrounding amino acids looked randomly distributed. Thus we can exclude that the angle distributions we observed is not due to the surrounding sequences, but to the identified patterns. We have not observed any evidence for certain helix patterns being preferred in the database.<br />
From figures 6 and 7 we have learned that the lengths we have assumed for the helices needed to be refined. For example we had assumed the AEAAAKA motif to span a distance of 10.5 &Aring; but have observed it to be only 10 &Aring; long. Our [https://2014.igem.org/Team:Heidelberg/Software/Linker_Software CRAUT software] was accordingly corrected.<br />
<br />
=References=<br />
<br />
[1] Vieille, C. & Zeikus, G.J. Hyperthermophilic enzymes: sources, uses, and molecular mechanisms for thermostability. Microbiology and molecular biology reviews : MMBR 65, 1-43 (2001).<br />
<br />
[2] Yu, Y. & Lutz, S. Circular permutation: a different way to engineer enzyme structure and function. Trends Biotechnol. 29, 18-25 (2011).<br />
<br />
[3] Arai, R., Ueda, H., Kitayama, a, Kamiya, N. & Nagamune, T. Design of the linkers which effectively separate domains of a bifunctional fusion protein. Protein Eng. 14, 529-532 (2001).<br />
<br />
[4] Wang, C.K.L., Kaas, Q., Chiche, L. & Craik, D.J. CyBase: A database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Research 36, (2008).<br />
<br />
[5] Efimov, a V. Standard structures in proteins. Prog. Biophys. Mol. Biol. 60, 201-2–39 (1993).<br />
<br />
[6] Donate, L. E., Rufino, S. D., Canard, L. H. & Blundell, T. L. Conformational analysis and clustering of short and medium size loops connecting regular secondary structures: a database for modeling and prediction. Protein Sci. 5, 2600-26–16 (1996).<br />
<br />
[7] Bonet, J. et al. ArchDB 2014: structural classification of loops in proteins. Nucleic Acids Res. 42, D315-D31–9 (2014).<br />
<br />
[8] George, R. a & Heringa, J. An analysis of protein domain linkers: their classification and role in protein folding. Protein Eng. 15, 871-879 (2002).<br />
<br />
[9] Chen, X., Zaro, J. L. & Shen, W.-C. Fusion protein linkers: property, design and functionality. Adv. Drug Deliv. Rev. 65, 1357-1369 (2013).<br />
<br />
[10] Fiser, a et al. Modeling of loops in protein structures. Protein science : a publication of the Protein Society 9, 1753-73 (2000).<br />
<br />
[11] Fiser, A. & Sali, A. ModLoop: Automated modeling of loops in protein structures. Bioinformatics 19, 2500-2501 (2003).</div>Igemnils