Team:Heidelberg/pages/Linker-Software Docu

From 2014.igem.org

(Difference between revisions)
(Created page with "=Abstract= Peptidic linkers are widely used tools in protein modification, not only for connecting protein subdomains but also for connecting their extremities to circularize the...")
Line 1: Line 1:
-
=Abstract=
+
=General=
-
Peptidic linkers are widely used tools in protein modification, not only for connecting protein subdomains but also for connecting their extremities to circularize them. Traditionally,  flexible linkers like Glycin-Serine peptides are used for this purpose. ### Reference needed ###. However, to keep domains of chimeric proteins in a certain distance, rigid peptides built of helical patterns are also often applied.
+
During the iGEM competetion we have written a software, that can predict the best linker to circularize a protein. Therefore at first connections between the ends are found, these are weighted on their goodness for the linker and then these paths are retranslated to biological sequences.
-
In the following we show a novel approach to build customized rigid linkers that follow a desired shape. This is achieved by connecting peptides forming rigid helices with amino acids that produce a certain angle between those. To design these linkers structures databases normally used for protein-structure prediction are analyzed and used the other way around for finding fitting building blocks. The potential linkers were tested in a large screening in silico, showing that the patterns produce the predicted structures. Additionally they were tested in vitro for circularizing lysozyme from bacteriophage lambda as a model enzyme. Not only the modularity but also the reliability of our linkers are huge advantages compared to designing linkers classically.
+
The software is mainly made possible by python's numpy package for easily handling and processing huge amount of data. Numpy is one of the most used python packages in scientific computing, providing a powerful N-dimensional array object and fast C/C++ written functions to process them. Thus we were able to handle the amount of different linker paths (in the scale of 10^9 paths).
-
=Introduction=
+
Python as a high level programming language with it's various packages enabled us within the short time period of iGEM to write such a powerfull software. On the other hand, being an interpreter language, python's runtime of course is much higher. As python is able to integrate fast C-Code natively, the runtime of Numpy calculations is not that much higher than compared to classical precompiled C-code. This is achieved because all the entries in an array must be of the same type so that always arrays can be processed completely. On the other hand this consumes much more memory, because always the whole array has to be loaded in the RAM, which was one of the major problems for us. But the software should run on Computers with 1 GB of free RAM.
-
Artificially circularized proteins can gain heat stability due to the constrain of the relative positioning of the C- and the N-termini. If the ends are too far from each other, circularization requires a linker that does not change the natural conformation of the protein but restrains the relative position of the ends and thus restricts the degrees of freedom. On top, these linkers should not affect any of the protein's functions. Consequently it is important to prevent linkers from passing through the active site or from covering binding domain to other molecules for example. Therefore one needs to be able to define the shape of possible linkers.
+
==Path storage==
-
Classically, protein linkers were designed in three different manners. ###REFERENCE### The easiest way is to define the length a linker should span and then simply use a flexible glycine-serine peptide with the right amount of amino acids to match this length. Glycine is used for flexibility, as it has no sidechain and does not produce any steric hindrance, while serine is used for solubility, as it has a small polar side chain. This solubility is important, as normally the linkers should not pass through the hydrophobic core of the protein, but should be dissolved in the surrounding medium. These flexible linkers were normally used in circularization but also for connecting different proteins, when the main important aspect is that the different parts are connected.
+
The possible paths are always stored as the angle points of the paths under the variables: firstpointsflexible, secondpointsflexible, thirdpointsflexible, firstpointstriangle, secondpointstriangle, thirdpointstriangle, erstepunkte, zweitepunkte and drittepunkte. As each point in 3d has three coordinates, all of these variables are n*3 arrays. The first index identifies the path then. This makes it possible, that a path is just deleted, by deleting the line of all the arrays. This way also the arrays can be easily sliced, making it possible to process parts of an array in a fast way.
-
A second strategy consists in using rigid helical linkers to keep proteins or protein domains at a certain distance from each other. This is important especially for signalling proteins and fluorescent proteins . ###TODO: Reference### One major property of alpha helices is that they always fold in a well defined way with well defined angles and lengths. There are also many different helical patterns that differ in stability and solubility. One big disadvantage of this strategy is that one can only build straight linkers with helices.
+
==Protein data==
-
The third option consists in designing customly tailored linkers for each specific application. These linkers can be obtained from protein structure prediction. At first one needs to define the path the linker should take to connect two amino acids. Afterwards one designs a possible linker sequence that might fit well. Next one makes a structure prediction of the linker attached to the proteins to validate the prediction. Several different linkers, with slight changes, can be compared. This is repeated several times until the linker is taking the path it should. ###TODO: Reference, WADE paper### This method is time consuming as it is not only computation intensive, but also requires a strong knowledge on protein folding and protein structure prediction. On the other hand the benefit can be important as the interaction of the linker with the proteins surface can be taken into account and as one can nearly completely define the path taken by the linker up to the accuracy of protein structure prediction.
+
In the PDB file, all coordinates of non-hydrogen atoms are stored. These are then loaded into arrays x, y and z, just containing one coordinate of a point. these are arrays of length n.  As they are not good to handle, the information is restored in different point arrays of shape n*3.
-
By combining the advantages of these different approaches for linker design, we have set up a model to build rigid linkers with alpha helices following a certain path. The main achievement is the modularity of our system for building.
+
*PointsOfAllSubunits: These are all points from the PDB file, that should not be ignored. The user has to tell the program, which parts should not be taken into account.
-
§§§don't know yet where to put this information§§§Also for artificial protein engineering it is most important being able to define the conformation of the single helical building blocks by defining a supersecondary structure.
+
*pkte: These are all the points of part, that should be circularized, so this are all the atom-coordinates between N- and C-terminus
-
=Background=
+
*OtherPoints: These are all points of PointsOfAllSubunits that are not in pkte.
-
Primary, secondary, tertiary and quaternary structure are the main levels of protein structure characterization. Primary structure designates the amino acid sequence, while the secondary structure describes the arrangement of consecutive amino acids  through their two dihedral angles $\phi$ and $\psi$. The Ramachandran plot, which represents the amino acid position in the space of those two angles, shows two particular arrangement commonly found in proteins: alpha helices and beta sheets. The next level of protein organization is the tertiary structure, which describes how the protein is organized in the three spatial dimensions, whereas the quaternary structure describes how different subunits of proteins cluster.
+
==Unpreferable places==
-
Finally, closely related to these standard structures is the supersecondary structure, that describes how secondary structure elements are connected to each other by on first sight undefined conformations. Further analysis revealed that this wide variety of supersecondary structure motifs can be clustered to certain patterns. [[#References|[5]]]
+
==List of angles and rods==
-
==Supersecondary structure==
+
==General definitions==
-
When the properties of supersecondary structures were first described, only very few patterns were identified, mainly due to the lack of highly resolved protein structures. At that time the structures were mainly classified by the Ramachandran plot regions ($\alpha, \beta, \gamma$ etc. ) where the amino acids could be found. [[#References|[6]]] With growing amount of known crystal structures, the analysis of supersecondary structure became better and better leading to databases with about 300 000 classified loop structures and elaborate clustering. [[#References|[7]]] Nowadays supersecondary structures are just the structures built when two secondary structure elements are combined by a small peptide that is not clustered into one of the secondary structures. These loop peptides range from 1 to 9 amino acids.
+
*minabstand: the radius of an alpha-helix, also the minimal distance an atom needs from a connection
-
=Defining the structure=
+
*LengthOfAngle:
-
The aim was to build reliable stable linkers out of alpha helices connected by supersecondary structure motifs that produce certain angles. To achieve that, we searched for the most reliable alpha helix patterns and angle patterns covering the whole range of angles from 0 to 180 degrees.  
+
*LengthOfFlexibleAA:
-
==Helix patterns==
+
*Flexatstartseq and Flexatendseq:
-
To be done.
+
=Biggest problems=
-
==Angle patterns==
+
We encountered many bigger or smaller problems while programming. Some are quite serious issues and are mainly due to the brute-forcing ansatz we made, but were mainly solved to an acceptable extent.  
-
###Figure for explanation needed###
+
==RAM usage==
-
The angle patterns for our model were obtained from the ArchDB database, which classifies loops from known proteins structures. About 17 000 non-homologous proteins from PDB database were analyzed and thus over 300 000 loop structures, i.e. regions connecting two secondary structure elements, were identified. ###Numbers to be checked###. The classification took into account not only the length of the loop, its conformation, meaning φ and ψ backbone dihedral angles of the residues in the loop, but also the distance between the attachments of the loop to the surrounding secondary structures. Furthermore the surrounding secondary structures of the loop and the geometry defined by the super-secondary structure motifs can be found in the database.
+
For example when the distance from the connection in a path to the protein is calculated, always the distance of all of the atoms of the protein to this connection is calculated in one array. Thus easily arrays of 100 000 000 * 3 * 6000 * 3 shape can occur, which is just too much for normal RAMsize. On the other hand, using numpy arrays, the bigger the arrays, that are processed are, the faster the program is in total, because the functions don't have to be loaded again and again. Therefore in each step the arrays are sliced in a way, that the following procedure can take place in the RAM of the computer, see [[#make_small_generator_offset(listofarraysinRAM, PointArray, repetition, RAM, tobesplitlength,  ProteinArray = None):|functions]] part.
-
To extract from ArchDB the relevant supersecondary structure motifs for our linker design, the complete database was downloaded and helix-loop-helix motifs were extracted using a self-written script in python programming language. Furthermore we only took into account loops composed of 1 and 2 amino acids, because the longer the loops, the less frequent and therefore the less reliable a single loop is, and the further the ends are from each other.
+
===Array Size===
-
The interesting information for us was the angle produced between the vectors defining the bracing alpha helices, the distance between the ends of the loop, and the type of amino acids surrounding the loop. Furthermore we analyzed the statistical significance of the conformation. ###still to be done###
+
The next issue was, that with arrays with  about 300 000 000 float 16 entries, like they occur for large proteins  while [[software-representation#Point generation|point generation]] they  could even not be kept in RAM in total. Thus we had to manually store  them on the harddrive for processing. Of course also this slowed the procedure down. But fortunately we could use python's [[#h5py|h5py]] package, which allowed us, just taking out certain lines from the array stored on  the disk.
-
For each amino acid combination in the loop region, the angle distribution between the embracing  alpha helices, the loop length distribution and a 2d heatmap of the embracing amino acids were automatically plotted.
+
==Runtime==
-
These distributions were then visually analyzed to identify loops of interest for linker design. We focused on loops that show a narrow angle distribution and that appear frequently in the database. The corresponding  amino acids were further analyzed, by enlarging the amino acid pattern with the amino acids occurring the most next to them§§§ In practical terms this means..§§§ ###figure needed###. This allowed us to narrow down the angle distribution, and also to select loops where no preference for the surrounding amino acids could be seen anymore. Thus we can claim, that the angle distribution is not due to the surrounding structure, but because of the identified pattern itself.
+
The longest calculation took about 11days on a 2.6 GHz intel i5 CPU with 8GB RAM and an SSD harddrive of continuous calculation. Therefore results are frequently stored on the harddisk using python's [[#cpickle|cpickle]] package, which allowed a fast storage of complete numpy objects. Thus calculations can be stopped and restarted after certain points.
-
==In silico refinement==
+
On the other hand, calculation time increases with the number of points for the connections and for the protein. Reducing the points of the protein to the atoms on the surface would be the next step to take, which could reduce calculation time to one sixth of the time now.
-
As some of the interesting patterns could not be found often enough to be statistically significant, we decided to make a further refinement in silico by modeling the structure of proteins with circularizing linkers. To perform this for realistic situations, we selected from the RCSB database structures of non-homologous target proteins with extremities that are separated enough to require a linker for circularization. For setting up an environment as close as possible to the application of the patterns were we designed the following workflow.
+
Due to lack of time, we often had to make the tradeoff between fast programming and fast calculation. Having the huge resources of [[iGEM@home|###i@h###]] we clearly decided for fast programming and not optimizing the code yet for velocity.
-
First, the ###linker software### generates possible fitting linkers for various proteins. From these possible linkers, the 100 shortest were taken and the angle patterns that should produce the same angles §§§Be more clear. For producing one angle different patterns can be used. We want to know which of them is best then, so we tested them against each other. §§§§ are exchanged in the linkers to get more variety. The linkers connect the ends of the protein without setting tension on the protein, so that the protein can fold in its natural way. For further information on the generation of the linkersequences please follow to the [[Linkergenerator|###link###]]
+
==Flexible ends==
-
After this the circularized proteins with the specific linkers are modelled using a software called Modeller.[[#References|[8]]] This software is widely used for comparative structure prediction. It is well established in the scientific community and is most suitable for prediction of loop regions attached to existing structures. [[#References|[9]]]
+
Long flexible (non helical) regions until now are kind of an issue, but we have implemented several functions that should handle this. The problem is, that the number of possible conformations already for two or three exceedes our capabilities and could not be easily handeled in the building-block system we chose to implement. Therefore also helical regions are handeled as straight connections, but with varying length. After the flexible regions there are no angular restraints given to the attached helical block.
-
Modeller is a program that is able to predict the 3d structure of a given sequence based on the restraints from an
+
=Functions=
-
§§§What do you mean? Explain:
+
==imported python modules==
-
Modeller needs two things, a sequence and a structure, then finds the sequence in the structure and predicts the folding.
+
===necessary===
-
§§§§
+
numpy: the basic module for numerical calculations in large scale
-
alignment to existing structures. It is freely available for academical usage from the salilab webpages. ###[[link]]###As we just want to determine the properties of our linker patterns attached to proteins, it perfectly fitted our purpose. ???We provided it with the crystal structure of a protein of interest and a linker sequence attached to it and the software returned a model of the structure of the circularized protein through the linker.??? Most important for our purpose is that Modeller does not rely on structural databases like ArchDB database, but does an ab initio modelling of our linkers by minimizing energy functions with different methods like conjugate gradients and molecular dynamics.
+
h5py: Storing arrays on the harddisk, allowes slicing of the arrays on the disk.
-
Each modeled structure is provided with energy values, thanks to which different models of the same structure can be compared. From Modeller we received about ###30### different models and choose the one with the best energy scores to further proceed. ###To calculate the models we at first make an alignment between the structure and the sequence and starting from this, modeller generates ???3??? initial models. On each of these models a loop refinement is made, resulting in about ???12??? models per run. For the loop refinement all parts of the sequence, that can not be found in the structure file, are further calculated.§§§should appear before§§§### For these refinement steps, one can choose different levels of optimization. We always decided for accuracy instead of velocity of the program.
+
os: Used for reading and writing files
-
  Modeller was run  via the [[iGEM@home|###link to i@h###]] system. ###further explanation needed###. The modelling of one linker took about ???10??? hours of calculation time on average via the iGEM@home system. But this value is highly depending on the size of the protein. Each run of modeller is made twice so that calculation errors can be handeled better §§§Rewrite again no errors, manipulations, variety§§§. Then the best model is determined and is analyzed by another self-written program to analyze the behaviour of the linker patterns in the natural surroundings.
+
sys: module used for exiting the program at certain points.
 +
===recommended===
 +
matplotlib.pyplot: can be used
 +
from mpl_toolkits.mplot3d,  Axes3D:
 +
time: so that one can see the progress of the calculations and observe the calculation times.
 +
fnmatch: allows wildcard search in strings, important for finding specific linker patterns
 +
cPickle: used for intermediate storing of the arrays, so that calculations coul be continued after restarting the program.
 +
==selfwritten functions==
 +
===angle_between_connections_array(startingarray, middlearray, endingarray):===
 +
   
 +
::    calculates the angles between the vectors from startingarray to middlearray and middlearray to endingarray. If there is no displacement between the arrays it returns zero as angle. startingarray and endingarray can be only one single point, middlearray should always be an array of points in 3d space.
 +
::  returns values between [0,pi] in an array of size Startarray.
 +
---
 +
===angle_between_vectors(vect1, vect2):===
 +
   
 +
:: calculates the angle between two arrays of vectors. If one of the vectors  is 0, the angle is set to 0. The result is based on arccos.
 +
:: returns the angles between two vectors.
 +
---
 +
===distance_from_connection(Startarray, Endarray, Points):===
 +
::takes a connection from Startarray to Endarray and calculates the perpendicular distance of the points from the connection. Startarray or Endarray can also be single points.
 +
::Returns an array of size (Startarray * points) with all perpendicular distances or distances of the endpoints.
 +
---
 +
===punktebeigerade(minabstand, pkte, gerade, aufpunkt, laenge):===
 +
::checks whether there are points too close to a straight line coming from aufpunkt with in direction of gerade with length laenge.
 +
::returns True if no point of pkte is closer to the straight line than
 +
    minabstand
 +
--- 
 +
===test_accessible_angles(winkelarray, length, anfangspunkt, proteinpoints, gerade=np.array([0, 0, 1])):===
 +
:: winkelarray is an array of angles that should be checked, whether they are accessible from anfangspunkt. Accessible means that no point of proteinpoints is too close to the straight line, which is produced by rotating gerade with the angles of winkelarray. Gerade always starts at anfangspunkt angles are measured from z-axis, if gerade is not defined else.
 +
::returns a boolean array with which winkelarray can be sliced.
 +
---
 +
===reduce_angles_from_redundancies(winkelarray):===
 +
::takes an array of angles in the format [phi, theta] and looks which angles produce the same result in the vector.
 +
::Returns an array with all indices, that can be deleted along the 0 axis of winkelarray.
 +
---
 +
===make_displacements(lengtharray, displacementarray):===
 +
::generates all possible displacements from displacementarray (an array of vectors) and lengtharray (array of different lengths)
 +
   
 +
::returns an array with displacementvectors in different lengths
 +
---
 +
===sort_out_by_protein(startingarray, endingarray, proteinpoints, mindist, beforearray = None):===
 +
                       
 +
::sorts out the connections between startingarray and endingarray with proteinpoints. A connection is sorted out, if one point of the proteinpoints is nearer to the connection, than mindist.
 +
   
 +
::Returns only the points for the connections, that are good. If beforearray is set, returns also beforearray
 +
---
 +
===naechstepunkte(anfangsarray, verschiebungsarray):===
 +
::generates for each point of anfangsarray, all points that are made by displacements of that point with verschiebungsarray.
 +
   
 +
::Returns two arrays of equal size, the enlarged anfangsarray and the array resulting from verschiebungsarray.
 +
---
 +
===aussortierennachpunken(punktearray, proteinpunkte, minabstand, maxabstand):===
 +
::sorts all the points of punktearray out, that are nearer than minabstand to one of the points from proteinpunkte, or farther away than maxabstand.
 +
   
 +
::returns a boolean array, with which one can slice punktearray.
 +
---
 +
===angle_weighing(anglearray, angletosequence=angletosequence):===
 +
::weighting of the angles form anglearray. The better an angle fits to the angles provided by angletosequence, the lower the value is. The best angle gets a weighing of 1, the worst angle of 2.
 +
   
 +
::Returns a weighingarray for the angles of anglearray. Each weighing is in the range between 1 and 2. The weighing is based on gaussian distributions.
 +
---
 +
===angle_function(StartingArray, MiddleArray, EndingArray):===
 +
::makes a weighing of the connection from Startingarray, over Middlearray to Endingarray based on the weighing of the angles.
 +
   
 +
::Returns an angle weighting for each connection.
 +
---
 +
===unpreferable_places(Start, End, ProteinPoints, AminoacidNumberArray, ToBeWeighedAAInput, WeighingofAA, substratelist):===
 +
::Calculates a weighting for the connection from the points of Start to the points of End based on the distance from regions that should be omitted. These aminoacids should be defined in the ToBeWeighedAAInput array and the WeighingofAA array defines how important this region is. If one wants whole substrates to be omitted, they should be added in the substratelist. The total returned number is normalized, so the weighting of the regions is independent of the number of places, that should be omitted.
 +
   
 +
  ::Parameters:
 +
::::Start: The points where the rod starts,
 +
::::End: The points where the rod ends
 +
::::ProteinPoints: The points of the protein, that should be taken into account.
 +
::::AminoacidNumberArray: The array, that tells, to which amino acid one atom belongs
 +
::::ToBeWeighedAAInput: One output of make_weighing_arrays
 +
::::substratelist: A list of tuples (amino acid nr, size of substrate). Amino acid nr, is the amino acid, where the substrate binds to.
   
   
-
All these models for the different structures and the different linkers are then analyzed for their properties like the length of the helical patterns, the shape of the attachment structures of the linker and the angles produced by the angle patterns. ###Figure missing###. First the modeled structure and the natural structure are fitted together, to see how big the differences between those are. If the protein has been disturbed too much, the model is discarded ###still needs to be done###.  Otherwise length of attachment sequences are calculated just by calculating the distance of the atoms. For the helical patterns a vector is fitted to the C$\alpha$s. For these vectors always distance between the ends, the length and the angles are calculated. Furthermore a possible crossing point is estimated. Afterwards for each helical pattern and for each angle-pattern we obtain a distribution for the different properties, so that we can refine our assumptions on the behaviour of the patterns. With the coordinates of the estimation of crossing points, on can furthermore see, whether the linker really follows a software predicted path and thus verifying the results of the linker-software.
+
::returns the weighing of the connections, because of the regions, where the linker passes through.
-
=Results=
+
---
-
Out of this, we decided to make set up a modular system for our linkers. All linkers start with two amino acids, that guarantee some flexibility to the ends of the protein and that prohibit the attached helix to continue into the protein and thus making non-helical regions helical. The next building block is one of the alpha helix forming patterns AEAAAK, AEAAAKA, AEAAAKAA, AEAAAKEAAAK, ###... with a well-defined length and shape. Then an angle pattern is attached. All the angle patterns chosen by us, have the same distance from the actual turning point. Thus one can easily exchange different angle patterns and easily calculate the distances between the different turning points, like it is used in our [[###software###]]. To this angle pattern, another helix pattern can easily be attached again. ###figure needed###. All our linkers end with the two exteins because of circularization or the sortase scar, treated both as rather unstructured flexible regions.
+
===distance_from_surface(beforearray, testarray, ProteinPoints, Afterpoint = None):===
-
==Application==
+
::calculates the distances of the testarrays points from the surface as just the minimum of the distances to all proteinpoints. It doesn't calculate the points that are equal to the points of the beforearray, so that these are not taken double. And it checks that the points don't lie on the endpoint.
-
-DNMT1
+
   
-
-Lysozyme
+
::Returns the weighting of the distance by subtracting mindist, dividing it through mindist for making it unitless and then squaring, so that the values are better distributed.  
-
==Verification of patterns==
+
---
-
=Conclusion=
+
===weighing_function_rigids(StartPoint, FirstArray, SecondArray, ThirdArray, EndPoint, ProteinPoints,AminoacidNumberArray, ToBeWeighedAA, WeighingofAA=None, substratelist=None):
-
-Statistical evidence
+
::makes the weighting of rigid linkers, with angle, distance, length and regions distribution.
-
-No preference for helix types
+
   
-
-Many different non-homologous proteins.
+
::returns a list of 5 arrays: weighedvalue, normed lenghtweighing, Angleweighing, Siteinfluence and the distances
-
=References=
+
---
-
[5] Efimov, a V. Standard structures in proteins. Prog. Biophys. Mol. Biol. 60, 201-2–39 (1993).
+
===weighing_function_flex(StartPoint, FirstArray, SecondArray, ThirdArray,  EndPoint, ProteinPoints, AminoacidNumberArray,  ToBeWeighedAA, WeighingofAA = None, substratelist = None):===
-
[6] Donate, L. E., Rufino, S. D., Canard, L.   H. & Blundell, T. L. Conformational analysis and clustering of   short and medium size loops connecting regular secondary structures: a   database for modeling and prediction. Protein Sci. 5, 2600-26–16  (1996).
+
::makes the weighting of flexible linkers, with angle, distance, length and regions distribution.
-
[7] Bonet, J. et al. ArchDB 2014: structural classification of loops in proteins. Nucleic Acids Res. 42, D315-D31–9 (2014).
+
   
-
[8] Fiser, a et al. Modeling of loops in protein structures. Protein science : a publication of the Protein Society 9, 1753-73 (2000).
+
::returns a list of 5 arrays: weighedvalue, normed lenghtweighing, Angleweighing, Siteinfluence and the distances
-
[9] Fiser, A. & Sali, A. ModLoop: Automated modeling of loops in protein structures. Bioinformatics 19, 2500-2501 (2003).
+
---
 +
===make_weighingarrays(Userstring):===
 +
::Userstring is of the shape: 273,10 280-290,5 298,7,35.6  etc. (spaces separate entries, "," is for single residues "-" for anges, second "," for the diameter of the substrate)
 +
::If nothing should be weighted, insert ""
 +
::returns the information in arrayform (Shouldbeweighed and Weighingarray) and a substratelist
 +
---
 +
===sort_out_by_angle (startingarray, middlearray, endingarray, angletosequence):===
 +
::sorts out the paths from startingarray over middlearray to endingarray. A path is sorted out, when the angle it would need is too far away from the possible angles in angletosequence
 +
   
 +
::returns a boolian array which paths to keep, middle and endingarray must have same dimension
 +
::If startingarray is only one point, it returns only middlearray and endingarray, else all three arrays are returned
 +
---
 +
===make_small_generator(PointArray, repetition, RAM, tobesplitlength, ProteinArray = None):===
 +
::calculates how often PointArray needs to be split so that the following calculations still fit into the RAM.
 +
::RAM in GByte,
 +
::repetition means how often is the largest array repeated. Repetition must be manually found and adjusted as the real amount of repetitions is only a hint.
 +
::returns MakeSmall and teiler
 +
---
 +
===make_small_generator_offset(listofarraysinRAM, PointArray, repetition, RAM, tobesplitlength,  ProteinArray = None):===
 +
::calculates how often PointArray needs to be split so that the following calculations still fit into the RAM.
 +
::In the listofarraysinRAM can be either just the arrays or the size of the arrays, same for PointArray
 +
::RAM in GByte,
 +
::repetition means how often is the largest  array repeated. Repetition must be manually found and adjusted as the  real amount of repetitions is only a hint.
 +
::returns MakeSmall and teiler
 +
---
 +
===sort_out_by_distance(startingpoints, endingpoints, firstpoints, distance, variation):===
 +
::generates all possible connections from startingpoints to endingpoints, that lie in one of the distances plus minus the variation.
 +
::returns three arrays with all possible paths, made out of all possible combinations startingpoints to endingpoints that are in a certain distance
 +
---
 +
===sort_out_by_length (comefrompoints, gotopoints, linkerlaengen):===
 +
::sorts out the connections between comefrompoints and gotopoints, when they don't fit to the linkerlengths from linkerlaengen.
 +
::Either comefrompoints or gotopoints can be only one point, but never both of them can.
 +
::returns a boolean array, with which you can slice the points, True means the values are kept
 +
---
 +
===length_to_sequence(lengtharray, linkerdatenbank, linkerlaengen):===
 +
::translates the lengthes from lengtharray to sequences according to the different linkerpieces in linkerdatenbank.
 +
   
 +
::returns an array of the sequences that reproduce the length
 +
---
 +
===angle_to_sequence(anglearray, angletosequence, angleseparators):===
 +
::translates the angles from anglearray to sequences according to the different angletosequence data.
 +
   
 +
::returns an array of the sequences that reproduce the angles
 +
---
 +
===translate_paths_to_sequences(startpoint, firstflex, secondflex, thirdflex, firstrig, secondrig, thirdrig, endpoint, linkerdb, linkerlKO, angletosequence, angleseparators, weightflex, weightrig):===
 +
                                 
 +
::translates all paths to sequences according to the patterns provided in angleosequence and linkerdb
 +
   
 +
::returns an array with sequences for each path
 +
---

Revision as of 10:31, 12 October 2014

Contents

General

During the iGEM competetion we have written a software, that can predict the best linker to circularize a protein. Therefore at first connections between the ends are found, these are weighted on their goodness for the linker and then these paths are retranslated to biological sequences. The software is mainly made possible by python's numpy package for easily handling and processing huge amount of data. Numpy is one of the most used python packages in scientific computing, providing a powerful N-dimensional array object and fast C/C++ written functions to process them. Thus we were able to handle the amount of different linker paths (in the scale of 10^9 paths). Python as a high level programming language with it's various packages enabled us within the short time period of iGEM to write such a powerfull software. On the other hand, being an interpreter language, python's runtime of course is much higher. As python is able to integrate fast C-Code natively, the runtime of Numpy calculations is not that much higher than compared to classical precompiled C-code. This is achieved because all the entries in an array must be of the same type so that always arrays can be processed completely. On the other hand this consumes much more memory, because always the whole array has to be loaded in the RAM, which was one of the major problems for us. But the software should run on Computers with 1 GB of free RAM.

Path storage

The possible paths are always stored as the angle points of the paths under the variables: firstpointsflexible, secondpointsflexible, thirdpointsflexible, firstpointstriangle, secondpointstriangle, thirdpointstriangle, erstepunkte, zweitepunkte and drittepunkte. As each point in 3d has three coordinates, all of these variables are n*3 arrays. The first index identifies the path then. This makes it possible, that a path is just deleted, by deleting the line of all the arrays. This way also the arrays can be easily sliced, making it possible to process parts of an array in a fast way.

Protein data

In the PDB file, all coordinates of non-hydrogen atoms are stored. These are then loaded into arrays x, y and z, just containing one coordinate of a point. these are arrays of length n. As they are not good to handle, the information is restored in different point arrays of shape n*3.

  • PointsOfAllSubunits: These are all points from the PDB file, that should not be ignored. The user has to tell the program, which parts should not be taken into account.
  • pkte: These are all the points of part, that should be circularized, so this are all the atom-coordinates between N- and C-terminus
  • OtherPoints: These are all points of PointsOfAllSubunits that are not in pkte.

Unpreferable places

List of angles and rods

General definitions

  • minabstand: the radius of an alpha-helix, also the minimal distance an atom needs from a connection
  • LengthOfAngle:
  • LengthOfFlexibleAA:
  • Flexatstartseq and Flexatendseq:

Biggest problems

We encountered many bigger or smaller problems while programming. Some are quite serious issues and are mainly due to the brute-forcing ansatz we made, but were mainly solved to an acceptable extent.

RAM usage

For example when the distance from the connection in a path to the protein is calculated, always the distance of all of the atoms of the protein to this connection is calculated in one array. Thus easily arrays of 100 000 000 * 3 * 6000 * 3 shape can occur, which is just too much for normal RAMsize. On the other hand, using numpy arrays, the bigger the arrays, that are processed are, the faster the program is in total, because the functions don't have to be loaded again and again. Therefore in each step the arrays are sliced in a way, that the following procedure can take place in the RAM of the computer, see functions part.

Array Size

The next issue was, that with arrays with about 300 000 000 float 16 entries, like they occur for large proteins while point generation they could even not be kept in RAM in total. Thus we had to manually store them on the harddrive for processing. Of course also this slowed the procedure down. But fortunately we could use python's h5py package, which allowed us, just taking out certain lines from the array stored on the disk.

Runtime

The longest calculation took about 11days on a 2.6 GHz intel i5 CPU with 8GB RAM and an SSD harddrive of continuous calculation. Therefore results are frequently stored on the harddisk using python's cpickle package, which allowed a fast storage of complete numpy objects. Thus calculations can be stopped and restarted after certain points. On the other hand, calculation time increases with the number of points for the connections and for the protein. Reducing the points of the protein to the atoms on the surface would be the next step to take, which could reduce calculation time to one sixth of the time now. Due to lack of time, we often had to make the tradeoff between fast programming and fast calculation. Having the huge resources of ###i@h### we clearly decided for fast programming and not optimizing the code yet for velocity.

Flexible ends

Long flexible (non helical) regions until now are kind of an issue, but we have implemented several functions that should handle this. The problem is, that the number of possible conformations already for two or three exceedes our capabilities and could not be easily handeled in the building-block system we chose to implement. Therefore also helical regions are handeled as straight connections, but with varying length. After the flexible regions there are no angular restraints given to the attached helical block.

Functions

imported python modules

necessary

numpy: the basic module for numerical calculations in large scale h5py: Storing arrays on the harddisk, allowes slicing of the arrays on the disk. os: Used for reading and writing files sys: module used for exiting the program at certain points.

recommended

matplotlib.pyplot: can be used from mpl_toolkits.mplot3d, Axes3D: time: so that one can see the progress of the calculations and observe the calculation times. fnmatch: allows wildcard search in strings, important for finding specific linker patterns cPickle: used for intermediate storing of the arrays, so that calculations coul be continued after restarting the program.

selfwritten functions

angle_between_connections_array(startingarray, middlearray, endingarray):

calculates the angles between the vectors from startingarray to middlearray and middlearray to endingarray. If there is no displacement between the arrays it returns zero as angle. startingarray and endingarray can be only one single point, middlearray should always be an array of points in 3d space.
returns values between [0,pi] in an array of size Startarray.

---

angle_between_vectors(vect1, vect2):

calculates the angle between two arrays of vectors. If one of the vectors is 0, the angle is set to 0. The result is based on arccos.
returns the angles between two vectors.

---

distance_from_connection(Startarray, Endarray, Points):

takes a connection from Startarray to Endarray and calculates the perpendicular distance of the points from the connection. Startarray or Endarray can also be single points.
Returns an array of size (Startarray * points) with all perpendicular distances or distances of the endpoints.

---

punktebeigerade(minabstand, pkte, gerade, aufpunkt, laenge):

checks whether there are points too close to a straight line coming from aufpunkt with in direction of gerade with length laenge.
returns True if no point of pkte is closer to the straight line than
   minabstand

---

test_accessible_angles(winkelarray, length, anfangspunkt, proteinpoints, gerade=np.array([0, 0, 1])):

winkelarray is an array of angles that should be checked, whether they are accessible from anfangspunkt. Accessible means that no point of proteinpoints is too close to the straight line, which is produced by rotating gerade with the angles of winkelarray. Gerade always starts at anfangspunkt angles are measured from z-axis, if gerade is not defined else.
returns a boolean array with which winkelarray can be sliced.

---

reduce_angles_from_redundancies(winkelarray):

takes an array of angles in the format [phi, theta] and looks which angles produce the same result in the vector.
Returns an array with all indices, that can be deleted along the 0 axis of winkelarray.

---

make_displacements(lengtharray, displacementarray):

generates all possible displacements from displacementarray (an array of vectors) and lengtharray (array of different lengths)
returns an array with displacementvectors in different lengths

---

sort_out_by_protein(startingarray, endingarray, proteinpoints, mindist, beforearray = None):

sorts out the connections between startingarray and endingarray with proteinpoints. A connection is sorted out, if one point of the proteinpoints is nearer to the connection, than mindist.
Returns only the points for the connections, that are good. If beforearray is set, returns also beforearray

---

naechstepunkte(anfangsarray, verschiebungsarray):

generates for each point of anfangsarray, all points that are made by displacements of that point with verschiebungsarray.
Returns two arrays of equal size, the enlarged anfangsarray and the array resulting from verschiebungsarray.

---

aussortierennachpunken(punktearray, proteinpunkte, minabstand, maxabstand):

sorts all the points of punktearray out, that are nearer than minabstand to one of the points from proteinpunkte, or farther away than maxabstand.
returns a boolean array, with which one can slice punktearray.

---

angle_weighing(anglearray, angletosequence=angletosequence):

weighting of the angles form anglearray. The better an angle fits to the angles provided by angletosequence, the lower the value is. The best angle gets a weighing of 1, the worst angle of 2.
Returns a weighingarray for the angles of anglearray. Each weighing is in the range between 1 and 2. The weighing is based on gaussian distributions.

---

angle_function(StartingArray, MiddleArray, EndingArray):

makes a weighing of the connection from Startingarray, over Middlearray to Endingarray based on the weighing of the angles.
Returns an angle weighting for each connection.

---

unpreferable_places(Start, End, ProteinPoints, AminoacidNumberArray, ToBeWeighedAAInput, WeighingofAA, substratelist):

Calculates a weighting for the connection from the points of Start to the points of End based on the distance from regions that should be omitted. These aminoacids should be defined in the ToBeWeighedAAInput array and the WeighingofAA array defines how important this region is. If one wants whole substrates to be omitted, they should be added in the substratelist. The total returned number is normalized, so the weighting of the regions is independent of the number of places, that should be omitted.
::Parameters:
Start: The points where the rod starts,
End: The points where the rod ends
ProteinPoints: The points of the protein, that should be taken into account.
AminoacidNumberArray: The array, that tells, to which amino acid one atom belongs
ToBeWeighedAAInput: One output of make_weighing_arrays
substratelist: A list of tuples (amino acid nr, size of substrate). Amino acid nr, is the amino acid, where the substrate binds to.
::returns the weighing of the connections, because of the regions, where the linker passes through.

---

distance_from_surface(beforearray, testarray, ProteinPoints, Afterpoint = None):

calculates the distances of the testarrays points from the surface as just the minimum of the distances to all proteinpoints. It doesn't calculate the points that are equal to the points of the beforearray, so that these are not taken double. And it checks that the points don't lie on the endpoint.
Returns the weighting of the distance by subtracting mindist, dividing it through mindist for making it unitless and then squaring, so that the values are better distributed.

--- ===weighing_function_rigids(StartPoint, FirstArray, SecondArray, ThirdArray, EndPoint, ProteinPoints,AminoacidNumberArray, ToBeWeighedAA, WeighingofAA=None, substratelist=None):

makes the weighting of rigid linkers, with angle, distance, length and regions distribution.
returns a list of 5 arrays: weighedvalue, normed lenghtweighing, Angleweighing, Siteinfluence and the distances

---

weighing_function_flex(StartPoint, FirstArray, SecondArray, ThirdArray, EndPoint, ProteinPoints, AminoacidNumberArray, ToBeWeighedAA, WeighingofAA = None, substratelist = None):

makes the weighting of flexible linkers, with angle, distance, length and regions distribution.
returns a list of 5 arrays: weighedvalue, normed lenghtweighing, Angleweighing, Siteinfluence and the distances

---

make_weighingarrays(Userstring):

Userstring is of the shape: 273,10 280-290,5 298,7,35.6 etc. (spaces separate entries, "," is for single residues "-" for anges, second "," for the diameter of the substrate)
If nothing should be weighted, insert ""
returns the information in arrayform (Shouldbeweighed and Weighingarray) and a substratelist

---

sort_out_by_angle (startingarray, middlearray, endingarray, angletosequence):

sorts out the paths from startingarray over middlearray to endingarray. A path is sorted out, when the angle it would need is too far away from the possible angles in angletosequence
returns a boolian array which paths to keep, middle and endingarray must have same dimension
If startingarray is only one point, it returns only middlearray and endingarray, else all three arrays are returned

---

make_small_generator(PointArray, repetition, RAM, tobesplitlength, ProteinArray = None):

calculates how often PointArray needs to be split so that the following calculations still fit into the RAM.
RAM in GByte,
repetition means how often is the largest array repeated. Repetition must be manually found and adjusted as the real amount of repetitions is only a hint.
returns MakeSmall and teiler

---

make_small_generator_offset(listofarraysinRAM, PointArray, repetition, RAM, tobesplitlength, ProteinArray = None):

calculates how often PointArray needs to be split so that the following calculations still fit into the RAM.
In the listofarraysinRAM can be either just the arrays or the size of the arrays, same for PointArray
RAM in GByte,
repetition means how often is the largest array repeated. Repetition must be manually found and adjusted as the real amount of repetitions is only a hint.
returns MakeSmall and teiler

---

sort_out_by_distance(startingpoints, endingpoints, firstpoints, distance, variation):

generates all possible connections from startingpoints to endingpoints, that lie in one of the distances plus minus the variation.
returns three arrays with all possible paths, made out of all possible combinations startingpoints to endingpoints that are in a certain distance

---

sort_out_by_length (comefrompoints, gotopoints, linkerlaengen):

sorts out the connections between comefrompoints and gotopoints, when they don't fit to the linkerlengths from linkerlaengen.
Either comefrompoints or gotopoints can be only one point, but never both of them can.
returns a boolean array, with which you can slice the points, True means the values are kept

---

length_to_sequence(lengtharray, linkerdatenbank, linkerlaengen):

translates the lengthes from lengtharray to sequences according to the different linkerpieces in linkerdatenbank.
returns an array of the sequences that reproduce the length

---

angle_to_sequence(anglearray, angletosequence, angleseparators):

translates the angles from anglearray to sequences according to the different angletosequence data.
returns an array of the sequences that reproduce the angles

---

translate_paths_to_sequences(startpoint, firstflex, secondflex, thirdflex, firstrig, secondrig, thirdrig, endpoint, linkerdb, linkerlKO, angletosequence, angleseparators, weightflex, weightrig):

translates all paths to sequences according to the patterns provided in angleosequence and linkerdb
returns an array with sequences for each path

---