Team:Heidelberg/pages/Linker-Software Docu

From 2014.igem.org

(Difference between revisions)
(Created page with "=Abstract= Peptidic linkers are widely used tools in protein modification, not only for connecting protein subdomains but also for connecting their extremities to circularize the...")
(Bundeled version)
 
(14 intermediate revisions not shown)
Line 1: Line 1:
-
=Abstract=
+
=General=
-
Peptidic linkers are widely used tools in protein modification, not only for connecting protein subdomains but also for connecting their extremities to circularize them. Traditionally, flexible linkers like Glycin-Serine peptides are used for this purpose. ### Reference needed ###. However, to keep domains of chimeric proteins in a certain distance, rigid peptides built of helical patterns are also often applied.
+
 
-
In the following we show a novel approach to build customized rigid linkers that follow a desired shape. This is achieved by connecting peptides forming rigid helices with amino acids that produce a certain angle between those. To design these linkers structures databases normally used for protein-structure prediction are analyzed and used the other way around for finding fitting building blocks. The potential linkers were tested in a large screening in silico, showing that the patterns produce the predicted structures.  Additionally they were tested in vitro for circularizing lysozyme from bacteriophage lambda as a model enzyme. Not only the modularity but also the reliability of our linkers are huge advantages compared to designing linkers classically.
+
During the iGEM competition we have written a software, that can predict the best linker to circularize a proteinAt first, connections between the ends of proteins are found, these are weighted on their quality for the linker and then these paths are retranslated to biological sequences.
-
=Introduction=
+
The software is mainly made possible by python's numpy package for easily handling and processing huge amount of data. Numpy is one of the most used python packages in scientific computing, providing a powerful N-dimensional array object and fast C/C++ written functions to process them. Thus we were able to handle the huge amount of different linker paths (in the scale of 10^9 paths).  
-
Artificially circularized proteins can gain heat stability due to the constrain of the relative positioning of the C- and the N-termini. If the ends are too far from each other, circularization requires a linker that does not change the natural conformation of the protein but restrains the relative position of the ends and thus restricts the degrees of freedom. On top, these linkers should not affect any of the protein's functions. Consequently it is important to prevent linkers from passing through the active site or from covering binding domain to other molecules for example. Therefore one needs to be able to define the shape of possible linkers.
+
Python, as a high level programming language with its various packages, enabled us to write such a powerfull software within the short time period of this years iGEM competition. Being an interpreter language, python's runtime of course is much higher. As python is able to integrate fast C-Code natively, the runtime of Numpy calculations is not that much higher than compared to classical precompiled C-code. This is achieved by using the same type of entries in an array, so that the whole array can be processed  at once. On the other hand this consumes much more memory, since the whole array has to be loaded in the RAM in every processing step, causing one of the major problems for us. Therefore, the software requires at least 2 GB of free RAM.
-
Classically, protein linkers were designed in three different manners. ###REFERENCE### The easiest way is to define the length a linker should span and then simply use a flexible glycine-serine peptide with the right amount of amino acids to match this length. Glycine is used for flexibility, as it has no sidechain and does not produce any steric hindrance, while serine is used for solubility, as it has a small polar side chain. This solubility is important, as normally the linkers should not pass through the hydrophobic core of the protein, but should be dissolved in the surrounding medium. These flexible linkers were normally used in circularization but also for connecting different proteins, when the main important aspect is that the different parts are connected.
+
 
-
A second strategy consists in using rigid helical linkers to keep proteins or protein domains at a certain distance from each other. This is important especially for signalling proteins and fluorescent proteins . ###TODO: Reference### One major property of alpha helices is that they always fold in a well defined way with well defined angles and lengths. There are also many different helical patterns that differ in stability and solubility. One big disadvantage of this strategy is that one can only build straight linkers with helices.
+
==Usage==
-
The third option consists in designing customly tailored linkers for each specific application. These linkers can be obtained from protein structure prediction. At first one needs to define the path the linker should take to connect two amino acids. Afterwards one designs a possible linker sequence that might fit well. Next one makes a structure prediction of the linker attached to the proteins to validate the prediction. Several different linkers, with slight changes, can be compared. This is repeated several times until the linker is taking the path it should. ###TODO: Reference, WADE paper### This method is time consuming as it is not only computation intensive, but also requires a strong knowledge on protein folding and protein structure prediction. On the other hand the benefit can be important as the interaction of the linker with the proteins surface can be taken into account and as one can nearly completely define the path taken by the linker up to the accuracy of protein structure prediction.
+
 
-
By combining the advantages of these different approaches for linker design, we have set up a model to build rigid linkers with alpha helices following a certain path. The main achievement is the modularity of our system for building.
+
Two different versions can be downloaded. One is a bundeled version with reduced features, but enhanced usability. The other is the source code, where in the beginning definitions need to be made in the code, giving the full usability at hand of the user. If not all features are needed, we recommend using the reduced version. The bundeled version can be used only on linux PCs.
-
§§§don't know yet where to put this information§§§Also for artificial protein engineering it is most important being able to define the conformation of the single helical building blocks by defining a supersecondary structure.
+
Both versions require minimum 2GB of free RAM. 
-
=Background=
+
 
-
Primary, secondary, tertiary and quaternary structure are the main levels of protein structure characterization. Primary structure designates the amino acid sequence, while the secondary structure describes the arrangement of consecutive amino acids  through their two dihedral angles $\phi$ and $\psi$. The Ramachandran plot, which represents the amino acid position in the space of those two angles, shows two particular arrangement commonly found in proteins: alpha helices and beta sheets. The next level of protein organization is the tertiary structure, which describes how the protein is organized in the three spatial dimensions, whereas the quaternary structure describes how different subunits of proteins cluster.
+
===Bundeled version===
-
Finally, closely related to these standard structures is the supersecondary structure, that describes how secondary structure elements are connected to each other by on first sight undefined conformations. Further analysis revealed that this wide variety of supersecondary structure motifs can be clustered to certain patterns. [[#References|[5]]]
+
 
-
==Supersecondary structure==
+
Please download CRAUT from [https://github.com/igemsoftware/Heidelberg_2014 here ]here. You only need to add the protein structure file in PDB format and an instructions file. For detailed information please see the README in the folder.
-
When the properties of supersecondary structures were first described, only very few patterns were identified, mainly due to the lack of highly resolved protein structures. At that time the structures were mainly classified by the Ramachandran plot regions ($\alpha, \beta, \gamma$ etc. ) where the amino acids could be found. [[#References|[6]]] With growing amount of known crystal structures, the analysis of supersecondary structure became better and better leading to databases with about 300 000 classified loop structures and elaborate clustering. [[#References|[7]]] Nowadays supersecondary structures are just the structures built when two secondary structure elements are combined by a small peptide that is not clustered into one of the secondary structures. These loop peptides range from 1 to 9 amino acids.
+
 
-
=Defining the structure=
+
The bundeled version is nicer accessible more user friendly, butdoes not provide full usability. Mainly, it is missing adjustability. Some functions that can not be used are:
-
The aim was to build reliable stable linkers out of alpha helices connected by supersecondary structure motifs that produce certain angles. To achieve that, we searched for the most reliable alpha helix patterns and angle patterns covering the whole range of angles from 0 to 180 degrees.  
+
 
-
==Helix patterns==
+
Weighting of different aminoacids: A function, which allows you to define aminoacids and regions that should be omitted by the linker.
-
To be done.
+
 
-
==Angle patterns==
+
Checkpointing: The program has to run in total at once, it can not be stopped  during the calculation, because no files are stored during computation. Calculations can take up to days.
-
###Figure for explanation needed###
+
 
-
The angle patterns for our model were obtained from the ArchDB database, which classifies loops from known proteins structures. About 17 000 non-homologous proteins from PDB database were analyzed and thus over 300 000 loop structures, i.e. regions connecting two secondary structure elements, were identified. ###Numbers to be checked###. The classification took into account not only the length of the loop, its conformation, meaning φ and ψ backbone dihedral angles of the residues in the loop, but also the distance between the attachments of the loop to the surrounding secondary structures. Furthermore the surrounding secondary structures of the loop and the geometry defined by the super-secondary structure motifs can be found in the database.
+
Size of linkers: As no checkpointing is enabled in this version, also the maximum angles are reduced to two angles in the linker. This shold be sufficient for most proteins. but If this does not produce results, please refer to the python version from the website.
-
To extract from ArchDB the relevant supersecondary structure motifs for our linker design, the complete database was downloaded and helix-loop-helix motifs were extracted using a self-written script in python programming language. Furthermore we only took into account loops composed of 1 and 2 amino acids, because the longer the loops, the less frequent and therefore the less reliable a single loop is, and the further the ends are from each other.
+
 
-
The interesting information for us was the angle produced between the vectors defining the bracing alpha helices, the distance between the ends of the loop, and the type of amino acids surrounding the loop. Furthermore we analyzed the statistical significance of the conformation. ###still to be done###
+
Circularizing only a part of the protein: The complete protein, provided by the PDB file is circularized, in contrast to the the complete version, where the user can define the amino acid site  where the protein circularization should occure.
-
For each amino acid combination in the loop region, the angle distribution between the embracing  alpha helices, the loop length distribution and a 2d heatmap of the embracing amino acids were automatically plotted.
+
 
-
These distributions were then visually analyzed to identify loops of interest for linker design. We focused on loops that show a narrow angle distribution and that appear frequently in the database. The corresponding amino acids were further analyzed, by enlarging the amino acid pattern with the amino acids occurring the most next to them§§§ In practical terms this means..§§§ ###figure needed###This allowed us to narrow down the angle distribution, and also to select loops where no preference for the surrounding amino acids could be seen anymore. Thus we can claim, that the angle distribution is not due to the surrounding structure, but because of the identified pattern itself.
+
Ignoring different parts of proteins: It is not possible to ignore additional parts of the PDB. Only the defined subunit will be processed, other subunis are ignored.
-
==In silico refinement==
+
 
-
As some of the interesting patterns could not be found often enough to be statistically significant, we decided to make a further refinement in silico by modeling the structure of proteins with circularizing linkers. To perform this for realistic situations, we selected from the RCSB database structures of non-homologous target proteins with extremities that are separated enough to require a linker for circularization. For setting up an environment as close as possible to the application of the patterns were we designed the following workflow.
+
Attachment-Sequences: The exchange of attachment-sequences (normally GG in the beginning of the linker) has to be done manually.
-
First, the ###linker software### generates possible fitting linkers for various proteins. From these possible linkers, the 100 shortest were taken and the angle patterns that should produce the same angles §§§Be more clear. For producing one angle different patterns can be used. We want to know which of them is best then, so we tested them against each other. §§§§ are exchanged in the linkers to get more variety. The linkers connect the ends of the protein without setting tension on the protein, so that the protein can fold in its natural way. For further information on the generation of the linkersequences please follow to the [[Linkergenerator|###link###]]
+
 
-
After this the circularized proteins with the specific linkers are modelled using a software called Modeller.[[#References|[8]]] This software is widely used for comparative structure prediction. It is well established in the scientific community and is most suitable for prediction of loop regions attached to existing structures. [[#References|[9]]]
+
===The python file===
-
Modeller is a program that is able to predict the 3d structure of a given sequence based on the restraints from an
+
 
-
§§§What do you mean? Explain:
+
For this version you need a running python 2.7 environment, please see below on the used packages. Please download the .py file from [[###here###]] and insert the data in the header of the code to the different variables.
-
Modeller needs two things, a sequence and a structure, then finds the sequence in the structure and predicts the folding.
+
 
-
§§§§
+
Two folders have to be creaded in the same directory as  the python program is running, one named "files", in which checkpoints are saved, and one called "protein", which contains the PDB file.
-
alignment to existing structures. It is freely available for academical usage from the salilab webpages. ###[[link]]###As we just want to determine the properties of our linker patterns attached to proteins, it perfectly fitted our purpose. ???We provided it with the crystal structure of a protein of interest and a linker sequence attached to it and the software returned a model of the structure of the circularized protein through the linker.??? Most important for our purpose is that Modeller does not rely on structural databases like ArchDB database, but does an ab initio modelling of our linkers by minimizing energy functions with different methods like conjugate gradients and molecular dynamics.
+
 
-
Each modeled structure is provided with energy values, thanks to which different models of the same structure can be compared. From Modeller we received about ###30### different models and choose the one with the best energy scores to further proceed. ###To calculate the models we at first make an alignment between the structure and the sequence and starting from this, modeller generates ???3??? initial models. On each of these models a loop refinement is made, resulting in about ???12??? models per run. For the loop refinement all parts of the sequence, that can not be found in the structure file, are further calculated.§§§should appear before§§§### For these refinement steps, one can choose different levels of optimization. We always decided for accuracy instead of velocity of the program.
+
There can be made several choices on which parts of the protein should be taken into account for circularization and with which sequences this should be done. The extein sequence or the sortase scar should be inserted at ScarsAtEnd variable.
-
Modeller was run  via the [[iGEM@home|###link to i@h###]] system. ###further explanation needed###. The modelling of one linker took about ???10??? hours of calculation time on average via the iGEM@home system. But this value is highly depending on the size of the protein. Each run of modeller is made twice so that calculation errors can be handeled better §§§Rewrite again no errors, manipulations, variety§§§. Then the best model is determined and is analyzed by another self-written program to analyze the behaviour of the linker patterns in the natural surroundings.
+
Also a weighting of certain amino acids can be defined in the beginning.
 +
 
 +
The software will store temporary data  after reaching a processing checkpoint so that calculation can be resumed afterwards again. The code is split in different procedures, that can be run in IDEs like [https://pythonhosted.org/spyder/overview.html  spyder ] separately. Each of these sections defines a piece of code,
 +
 
 +
One
 +
Afterwards the calculations can just be resumed at one of the various pickle.load  checkpoints.
 +
 
 +
For most proteins, not too many paths need to be checked, which results to a runtime of  so it runs in several hours on a 2.6GHz intel i5. The software might calculate several days for the linkers. For big proteins and long linkers, about 8 GB of RAM are needed to run completely. But for
 +
 
 +
In the end a single results file is saved on the disk, containing the sequnces of the linkers and the weightings of them.
 +
 
 +
=Biggest problems we had=
 +
 
 +
We encountered many bigger or smaller problems while programming. Some are quite serious issues and are mainly due to the brute force approach ansatz we made, but were mainly solved to an acceptable extent.
 +
 
 +
==RAM usage==
 +
 
 +
For example w The calculation of the distance from the connection in a path to the protein is calculated, encompasses always the distance of all of the atoms of the protein to this connection. This easily results in arrays of 100 000 000 * 3 * 6000 * 3 shape can occur, which is just too much for normal RAM sized Computer.  
 +
 
 +
On the other hand, using numpy arrays, the bigger the processed arrays are , the faster the program is in total, because every functions only have to be loaded once .  Therefore in each step the arrays are sliced in a way, that the following procedure can take place in the RAM of the computer, see [[#make_small_generator_offset(listofarraysinRAM, PointArray, repetition, RAM, tobesplitlength,  ProteinArray = None):|functions]] part.
 +
 
 +
===Array Size===
 +
 
 +
Arrays with about 300 000 000 float 16 entries, like they occur for large proteins, while [[software-representation#Point generation|point generation]] are too large in size for the RAM. Thus we had to manually store  them on the harddrive for processing. Of course also this slowed the procedure down. But fortunately we could use python's [http://www.h5py.org/ |h5py] package, which allowed us, just taking out certain lines from the array stored on  the disk.
 +
 
 +
==Runtime==
 +
 
 +
The longest calculation took about 11days on a 2.6 GHz intel i5 CPU with 8GB RAM and an SSD harddrive. Therefore, results are frequently stored on the harddisk using python's [[#cpickle|cpickle]] package, which allows a fast storage of complete numpy objects. Thus calculations can be stopped and restarted after certain points.
 +
On the other hand, calculation time increases with the number of points for the connections and for the protein. Reducing the points of the protein to the atoms on the surface would be the next step to take, which could reduce calculation time to one sixth of the time now.
 +
Due to lack of time, we often had to make the tradeoff between fast programming and fast calculation. Having the huge resources of [[iGEM@home|###i@h###]] we clearly decided for fast programming and not optimizing the code yet for velocity.
 +
 
 +
==Flexible ends==
 +
 
 +
Long flexible (non helical) regions might cause problems until now are kind of an issue, but we have implemented several functions that should handle those. The problem is, that tThe number of possible conformations of  two or three OF WHAT? exceedes our capabilities and could not be easily handeled in the building-block system we choose to implement. Therefore also helical regions are handeled as straight connections, but with varying length. After the flexible regions there are no angular restraints given to the attached helical block.
 +
 
 +
==Path storage==
 +
 
 +
The possible paths are always stored as the angle points of the paths under following variables: firstpointsflexible, secondpointsflexible,  thirdpointsflexible, firstpointstriangle, secondpointstriangle,  thirdpointstriangle, erstepunkte, zweitepunkte and drittepunkte. As each point in 3d has three coordinates, all of these variables are n*3  arrays. The first index identifies the path then. This creates the possibility it  possible, that a path is just deleted, to erase a path by deleting the line of all the arrays. This way also the arrays can be easily sliced, making it  possible to process parts of an array in a fast way.
 +
 
 +
==Differences in proteins==
 +
 
 +
Proteins can differ widely in their size and sometimes even in their definition in the PDB files. It was a big issue we encountered in setting up to set up a stable version running with all these different PDBs varying in size by the factor of 20.
 +
 
 +
=Different definitions=
 +
 
 +
==Protein data==
 +
 
 +
In  the PDB file, all coordinates of non-hydrogen atoms are stored. These are then loaded into arrays x, y and z, just containing one coordinate of each a point. these are arrays of length n. Since this kind of array is complicated to  handle, the information is restored in different point arrays of shape  n*3.
 +
 
 +
*PointsOfAllSubunits:  These are all points from the PDB file, that should not be ignored. The user has to decide, which parts of the protein should not be taken into  account.
 +
 
 +
*pkte:  These are all the points of part ???, that should be circularized, so this are all the atom-coordinates between N- and C-terminus
 +
 
 +
*OtherPoints: These are all points of PointsOfAllSubunits that are not in pkte.
 +
 
 +
 
 +
==List of angles and rods==
 +
 
 +
Rods: "AEAAAK", "AEAAAKA", "AEAAAKAA", "AEAAAKEAAAK", "AEAAAKEAAAKA", "AEAAAKEAAAKEAAAKA", "AEAAAKEAAAKEAAAKEAAAKA",  "AEAAAKEAAAKEAAAKEAAAKEAAAKA"
 +
 
 +
Angles,  alwys mean, then std and the pattern: [(29.7, 8.5, "NVL"), (38.7, 30., "KTA"),  (60., 12., "AADGTL"), (74.5, 27., "VNLTA"),  (117., 12.,  "AAAHPEA"), (140., 15., "ASLPAA"), (160., 5., "ATGDLA")]
 +
 
 +
==General definitions==
 +
 
 +
*minabstand: the radius of an alpha-helix, also the minimal distance an atom needs from a connection
 +
 
 +
*LengthOfAngle: The distance between the end of an angle pattern and the geometrical turning point.
 +
 
 +
*LengthOfFlexibleAA: 3.5 Å
 +
 
 +
*FlexAtStartSeq and FlexAtEndSeq: This is not only the regions missing from the PDB on both sides, but are also the flexible endsof the linker and the extein on the other side.
 +
 
 +
 
 +
=Functions=
 +
 
 +
==imported python modules==
 +
 
 +
===necessary===
 +
numpy: the basic module for numerical calculations in large scale
 +
h5py: Storing arrays on the harddisk, allowes slicing of the arrays on the disk.
 +
os: Used for reading and writing files
 +
sys: module used for exiting the program at certain points.
 +
 
 +
 
 +
===recommended===
 +
matplotlib.pyplot: can be used for what
 +
from mpl_toolkits.mplot3d,  Axes3D: for what
 +
time: is nedd to see the progression of the calculations and observe the calculation times.
 +
fnmatch: allows wildcard search in strings, important for finding specific linker patterns
 +
cPickle: used for intermediate storing of the arrays, so that calculations coul be continued after restarting the program.
 +
 
 +
 
 +
==selfwritten functions==
 +
 
 +
===angle_between_connections_array(startingarray, middlearray, endingarray):===
 +
 
 +
 
 +
::    calculates the angles between the vectors from startingarray to middlearray and middlearray to endingarray. If there is no displacement between the arrays it returns zero as angle. startingarray and endingarray can be only one single point, middlearray should always be an array of points in 3d space.
 +
::  returns values between [0,pi] in an array of size Startarray.
 +
 
 +
----
 +
 
 +
===angle_between_vectors(vect1, vect2):===
 +
   
 +
:: calculates the angle between two arrays of vectors. If one of the vectors  is 0, the angle is set to 0. The result is based on arccos.
 +
:: returns the angles between two vectors.
 +
----
 +
 
 +
 
 +
===distance_from_connection(Startarray, Endarray, Points):===
 +
 
 +
::takes a connection from Startarray to Endarray and calculates the perpendicular distance of the points from the connection. Startarray or Endarray can also be single points.
 +
::returns an array of size (Startarray * points) with all perpendicular distances or distances of the endpoints.
 +
----
 +
 
 +
===punktebeigerade(minabstand, pkte, gerade, aufpunkt, laenge):===
 +
 
 +
::checks whether there are points too close to a straight line coming from aufpunkt with in direction of gerade with length laenge.
 +
::returns True if no point of pkte is closer to the straight line than minabstand
 +
----
 +
 
 +
===test_accessible_angles(winkelarray, length, anfangspunkt, proteinpoints, gerade=np.array([0, 0, 1])):===
 +
 
 +
:: winkelarray is an array of angles that should be checked, whether they are accessible from anfangspunkt. Accessible means that no point of proteinpoints is too close to the straight line, which is produced by rotating gerade with the angles of winkelarray. Gerade always starts at anfangspunkt angles are measured from z-axis, if gerade is not defined else.
 +
::returns a boolean array with which winkelarray can be sliced.
 +
 
 +
----
 +
 
 +
===reduce_angles_from_redundancies(winkelarray):===
 +
::takes an array of angles in the format [phi, theta] and looks which angles produce the same result in the vector.
 +
::returns an array with all indices, that can be deleted along the 0 axis of winkelarray.
 +
----
 +
===make_displacements(lengtharray, displacementarray):===
 +
 
 +
::generates all possible displacements from displacementarray (an array of vectors) and lengtharray (array of different lengths)
 +
::returns an array with displacementvectors in different lengths
 +
 
 +
----
 +
 
 +
===sort_out_by_protein(startingarray, endingarray, proteinpoints, mindist, beforearray = None):===
 +
                       
 +
::sorts out the connections between startingarray and endingarray with proteinpoints. A connection is sorted out, if one point of the proteinpoints is nearer to the connection, than mindist.
 +
::returns only the points for the connections, that are good. If beforearray is set, returns also beforearray
 +
 
 +
----
 +
 
 +
===naechstepunkte(anfangsarray, verschiebungsarray):===
 +
 
 +
::generates for each point of anfangsarray, all points that are made by displacements of that point with verschiebungsarray.
 +
::returns two arrays of equal size, the enlarged anfangsarray and the array resulting from verschiebungsarray.
 +
 
 +
----
 +
 
 +
===aussortierennachpunken(punktearray, proteinpunkte, minabstand, maxabstand):===
 +
 
 +
::sorts all the points of punktearray out, that are nearer than minabstand to one of the points from proteinpunkte, or farther away than maxabstand.
 +
   
 +
::returns a boolean array, with which one can slice punktearray.
 +
 
 +
----
 +
 
 +
===angle_weighing(anglearray, angletosequence=angletosequence):===
 +
 
 +
::weighting of the angles form anglearray. The better an angle fits to the angles provided by angletosequence, the lower the value is. The best angle gets a weighing of 1, the worst angle of 2.
 +
::returns a weighingarray for the angles of anglearray. Each weighing is in the range between 1 and 2. The weighing is based on gaussian distributions.
 +
 
 +
----
 +
 
 +
===angle_function(StartingArray, MiddleArray, EndingArray):===
 +
 
 +
::makes a weighing of the connection from Startingarray, over Middlearray to Endingarray based on the weighing of the angles.
 +
::returns an angle weighting for each connection.
 +
 
 +
----
 +
 
 +
===unpreferable_places(Start, End, ProteinPoints, AminoacidNumberArray, ToBeWeighedAAInput, WeighingofAA, substratelist):===
 +
 
 +
::Calculates a weighting for the connection from the points of Start to the points of End based on the distance from regions that should be omitted. These aminoacids should be defined in the ToBeWeighedAAInput array and the WeighingofAA array defines how important this region is. If one wants whole substrates to be omitted, they should be added in the substratelist. The total returned number is normalized, so the weighting of the regions is independent of the number of places, that should be omitted.
 +
   
 +
::Parameters:
 +
::::Start: The points where the rod starts,
 +
::::End: The points where the rod ends
 +
::::ProteinPoints: The points of the protein, that should be taken into account.
 +
::::AminoacidNumberArray: The array, that tells, to which amino acid one atom belongs
 +
::::ToBeWeighedAAInput: One output of make_weighing_arrays
 +
::::substratelist: A list of tuples (amino acid nr, size of substrate). Amino acid nr, is the amino acid, where the substrate binds to.
   
   
-
All these models for the different structures and the different linkers are then analyzed for their properties like the length of the helical patterns, the shape of the attachment structures of the linker and the angles produced by the angle patterns. ###Figure missing###. First the modeled structure and the natural structure are fitted together, to see how big the differences between those are. If the protein has been disturbed too much, the model is discarded ###still needs to be done###.  Otherwise length of attachment sequences are calculated just by calculating the distance of the atoms. For the helical patterns a vector is fitted to the C$\alpha$s. For these vectors always distance between the ends, the length and the angles are calculated. Furthermore a possible crossing point is estimated. Afterwards for each helical pattern and for each angle-pattern we obtain a distribution for the different properties, so that we can refine our assumptions on the behaviour of the patterns. With the coordinates of the estimation of crossing points, on can furthermore see, whether the linker really follows a software predicted path and thus verifying the results of the linker-software.
+
::returns the weighing of the connections, because of the regions, where the linker passes through.
-
=Results=
+
 
-
Out of this, we decided to make set up a modular system for our linkers. All linkers start with two amino acids, that guarantee some flexibility to the ends of the protein and that prohibit the attached helix to continue into the protein and thus making non-helical regions helical. The next building block is one of the alpha helix forming patterns AEAAAK, AEAAAKA, AEAAAKAA, AEAAAKEAAAK, ###... with a well-defined length and shape. Then an angle pattern is attached. All the angle patterns chosen by us, have the same distance from the actual turning point. Thus one can easily exchange different angle patterns and easily calculate the distances between the different turning points, like it is used in our [[###software###]]. To this angle pattern, another helix pattern can easily be attached again. ###figure needed###. All our linkers end with the two exteins because of circularization or the sortase scar, treated both as rather unstructured flexible regions.
+
----
-
==Application==
+
 
-
-DNMT1
+
===distance_from_surface(beforearray, testarray, ProteinPoints, Afterpoint = None):===
-
-Lysozyme
+
 
-
==Verification of patterns==
+
::calculates the distances of the testarray points from the surface as just the minimum of the distances to all proteinpoints. It doesn't calculate the points that are equal to the points of the beforearray, so that these are not taken twice. Additionally, it checks that the points don't lie on the endpoint.
-
=Conclusion=
+
   
-
-Statistical evidence
+
::Returns the weighting of the distance by subtracting mindist, dividing it through mindist for making it unitless and then squaring, so that the values are better distributed.  
-
-No preference for helix types
+
 
-
-Many different non-homologous proteins.
+
----
-
=References=
+
 
-
[5] Efimov, a V. Standard structures in proteins. Prog. Biophys. Mol. Biol. 60, 201-2–39 (1993).
+
===weighing_function_rigids(StartPoint, FirstArray, SecondArray, ThirdArray, EndPoint, ProteinPoints,AminoacidNumberArray, ToBeWeighedAA, WeighingofAA=None, substratelist=None):===
-
[6] Donate, L. E., Rufino, S. D., Canard, L.   H. & Blundell, T. L. Conformational analysis and clustering of   short and medium size loops connecting regular secondary structures: a   database for modeling and prediction. Protein Sci. 5, 2600-26–16  (1996).
+
 
-
[7] Bonet, J. et al. ArchDB 2014: structural classification of loops in proteins. Nucleic Acids Res. 42, D315-D31–9 (2014).
+
::makes the weighting of rigid linkers, with angle, distance, length and regions distribution.
-
[8] Fiser, a et al. Modeling of loops in protein structures. Protein science : a publication of the Protein Society 9, 1753-73 (2000).
+
::returns a list of 5 arrays: weighedvalue, normed lenghtweighing, Angleweighing, Siteinfluence and the distances
-
[9] Fiser, A. & Sali, A. ModLoop: Automated modeling of loops in protein structures. Bioinformatics 19, 2500-2501 (2003).
+
 
 +
----
 +
 
 +
===weighing_function_flex(StartPoint, FirstArray, SecondArray, ThirdArray,  EndPoint, ProteinPoints, AminoacidNumberArray,  ToBeWeighedAA, WeighingofAA = None, substratelist = None):===
 +
 
 +
::makes the weighting of flexible linkers, with angle, distance, length and regions distribution.
 +
::returns a list of 5 arrays: weighedvalue, normed lenghtweighing, Angleweighing, Siteinfluence and the distances
 +
 
 +
----
 +
 
 +
===make_weighingarrays(Userstring):===
 +
 
 +
::Userstring is of the shape: 273,10 280-290,5 298,7,35.6  etc. (spaces separate entries, "," is for single residues "-" for anges, second "," for the diameter of the substrate)
 +
 
 +
::If nothing should be weighted, insert ""
 +
::returns the information in arrayform (Shouldbeweighed and Weighingarray) and a substratelist
 +
 
 +
----
 +
 
 +
===sort_out_by_angle (startingarray, middlearray, endingarray, angletosequence):===
 +
 
 +
::sorts out the paths from startingarray over middlearray to endingarray. A path is sorted out, when the angle it would need is too far away from the possible angles in angletosequence
 +
::returns a boolian array which paths to keep, middle and endingarray must have same dimension, if startingarray is only one point, it returns only middlearray and endingarray, else all three arrays are returned
 +
 
 +
----
 +
 
 +
===make_small_generator(PointArray, repetition, RAM, tobesplitlength, ProteinArray = None):===
 +
 
 +
::calculates how often PointArray needs to be split so that the following calculations still fit into the RAM.
 +
 
 +
::RAM in GByte
 +
::repetition means how often is the largest array repeated. Repetition must be manually found and adjusted as the real amount of repetitions is only a hint.
 +
::returns MakeSmall and teiler
 +
 
 +
----
 +
 
 +
===make_small_generator_offset(listofarraysinRAM, PointArray, repetition, RAM, tobesplitlength, ProteinArray = None):===
 +
 
 +
::calculates how often PointArray needs to be split so that the following calculations still fit into the RAM.
 +
::In the listofarraysinRAM can be either just the arrays or the size of the arrays, same for PointArray
 +
 
 +
::RAM in GByte,
 +
::repetition means how often is the largest  array repeated. Repetition must be manually found and adjusted as the  real amount of repetitions is only a hint.
 +
 
 +
::returns MakeSmall and teiler
 +
 
 +
----
 +
 
 +
===sort_out_by_distance(startingpoints, endingpoints, firstpoints, distance, variation):===
 +
 
 +
::generates all possible connections from startingpoints to endingpoints, that lie in one of the distances plus minus the variation.
 +
::returns three arrays with all possible paths, made out of all possible combinations startingpoints to endingpoints that are in a certain distance
 +
 
 +
----
 +
 
 +
===sort_out_by_length (comefrompoints, gotopoints, linkerlaengen):===
 +
 
 +
::sorts out the connections between comefrompoints and gotopoints, when they don't fit to the linkerlengths from linkerlaengen.
 +
 
 +
::Either comefrompoints or gotopoints can be only one point, but never both of them.
 +
::returns a boolean array, with which you can slice the points, True means the values are kept
 +
 
 +
----
 +
 
 +
===length_to_sequence(lengtharray, linkerdatenbank, linkerlaengen):===
 +
 
 +
::translates the length from lengtharray to sequences according to the different linkerpieces in linkerdatenbank.
 +
::returns an array of the sequences that reproduce the length
 +
 
 +
----
 +
 
 +
===angle_to_sequence(anglearray, angletosequence, angleseparators):===
 +
 
 +
::translates the angles from anglearray to sequences according to the different angletosequence data.
 +
::returns an array of the sequences that reproduce the angles
 +
 
 +
----
 +
 
 +
===translate_paths_to_sequences(startpoint, firstflex, secondflex, thirdflex, firstrig, secondrig, thirdrig, endpoint, linkerdb, linkerlKO, angletosequence, angleseparators, weightflex, weightrig):===
 +
                                 
 +
::translates all paths to sequences according to the patterns provided in angleosequence and linkerdb
 +
::returns an array with sequences for each path
 +
----

Latest revision as of 03:58, 18 October 2014

Contents

General

During the iGEM competition we have written a software, that can predict the best linker to circularize a protein. At first, connections between the ends of proteins are found, these are weighted on their quality for the linker and then these paths are retranslated to biological sequences. The software is mainly made possible by python's numpy package for easily handling and processing huge amount of data. Numpy is one of the most used python packages in scientific computing, providing a powerful N-dimensional array object and fast C/C++ written functions to process them. Thus we were able to handle the huge amount of different linker paths (in the scale of 10^9 paths). Python, as a high level programming language with its various packages, enabled us to write such a powerfull software within the short time period of this years iGEM competition. Being an interpreter language, python's runtime of course is much higher. As python is able to integrate fast C-Code natively, the runtime of Numpy calculations is not that much higher than compared to classical precompiled C-code. This is achieved by using the same type of entries in an array, so that the whole array can be processed at once. On the other hand this consumes much more memory, since the whole array has to be loaded in the RAM in every processing step, causing one of the major problems for us. Therefore, the software requires at least 2 GB of free RAM.

Usage

Two different versions can be downloaded. One is a bundeled version with reduced features, but enhanced usability. The other is the source code, where in the beginning definitions need to be made in the code, giving the full usability at hand of the user. If not all features are needed, we recommend using the reduced version. The bundeled version can be used only on linux PCs. Both versions require minimum 2GB of free RAM.

Bundeled version

Please download CRAUT from here here. You only need to add the protein structure file in PDB format and an instructions file. For detailed information please see the README in the folder.

The bundeled version is nicer accessible more user friendly, butdoes not provide full usability. Mainly, it is missing adjustability. Some functions that can not be used are:

Weighting of different aminoacids: A function, which allows you to define aminoacids and regions that should be omitted by the linker.

Checkpointing: The program has to run in total at once, it can not be stopped during the calculation, because no files are stored during computation. Calculations can take up to days.

Size of linkers: As no checkpointing is enabled in this version, also the maximum angles are reduced to two angles in the linker. This shold be sufficient for most proteins. but If this does not produce results, please refer to the python version from the website.

Circularizing only a part of the protein: The complete protein, provided by the PDB file is circularized, in contrast to the the complete version, where the user can define the amino acid site where the protein circularization should occure.

Ignoring different parts of proteins: It is not possible to ignore additional parts of the PDB. Only the defined subunit will be processed, other subunis are ignored.

Attachment-Sequences: The exchange of attachment-sequences (normally GG in the beginning of the linker) has to be done manually.

The python file

For this version you need a running python 2.7 environment, please see below on the used packages. Please download the .py file from ###here### and insert the data in the header of the code to the different variables.

Two folders have to be creaded in the same directory as the python program is running, one named "files", in which checkpoints are saved, and one called "protein", which contains the PDB file.

There can be made several choices on which parts of the protein should be taken into account for circularization and with which sequences this should be done. The extein sequence or the sortase scar should be inserted at ScarsAtEnd variable. Also a weighting of certain amino acids can be defined in the beginning.

The software will store temporary data after reaching a processing checkpoint so that calculation can be resumed afterwards again. The code is split in different procedures, that can be run in IDEs like spyder separately. Each of these sections defines a piece of code,

One Afterwards the calculations can just be resumed at one of the various pickle.load checkpoints.

For most proteins, not too many paths need to be checked, which results to a runtime of so it runs in several hours on a 2.6GHz intel i5. The software might calculate several days for the linkers. For big proteins and long linkers, about 8 GB of RAM are needed to run completely. But for

In the end a single results file is saved on the disk, containing the sequnces of the linkers and the weightings of them.

Biggest problems we had

We encountered many bigger or smaller problems while programming. Some are quite serious issues and are mainly due to the brute force approach ansatz we made, but were mainly solved to an acceptable extent.

RAM usage

For example w The calculation of the distance from the connection in a path to the protein is calculated, encompasses always the distance of all of the atoms of the protein to this connection. This easily results in arrays of 100 000 000 * 3 * 6000 * 3 shape can occur, which is just too much for normal RAM sized Computer.

On the other hand, using numpy arrays, the bigger the processed arrays are , the faster the program is in total, because every functions only have to be loaded once . Therefore in each step the arrays are sliced in a way, that the following procedure can take place in the RAM of the computer, see functions part.

Array Size

Arrays with about 300 000 000 float 16 entries, like they occur for large proteins, while point generation are too large in size for the RAM. Thus we had to manually store them on the harddrive for processing. Of course also this slowed the procedure down. But fortunately we could use python's |h5py package, which allowed us, just taking out certain lines from the array stored on the disk.

Runtime

The longest calculation took about 11days on a 2.6 GHz intel i5 CPU with 8GB RAM and an SSD harddrive. Therefore, results are frequently stored on the harddisk using python's cpickle package, which allows a fast storage of complete numpy objects. Thus calculations can be stopped and restarted after certain points. On the other hand, calculation time increases with the number of points for the connections and for the protein. Reducing the points of the protein to the atoms on the surface would be the next step to take, which could reduce calculation time to one sixth of the time now. Due to lack of time, we often had to make the tradeoff between fast programming and fast calculation. Having the huge resources of ###i@h### we clearly decided for fast programming and not optimizing the code yet for velocity.

Flexible ends

Long flexible (non helical) regions might cause problems until now are kind of an issue, but we have implemented several functions that should handle those. The problem is, that tThe number of possible conformations of two or three OF WHAT? exceedes our capabilities and could not be easily handeled in the building-block system we choose to implement. Therefore also helical regions are handeled as straight connections, but with varying length. After the flexible regions there are no angular restraints given to the attached helical block.

Path storage

The possible paths are always stored as the angle points of the paths under following variables: firstpointsflexible, secondpointsflexible, thirdpointsflexible, firstpointstriangle, secondpointstriangle, thirdpointstriangle, erstepunkte, zweitepunkte and drittepunkte. As each point in 3d has three coordinates, all of these variables are n*3 arrays. The first index identifies the path then. This creates the possibility it possible, that a path is just deleted, to erase a path by deleting the line of all the arrays. This way also the arrays can be easily sliced, making it possible to process parts of an array in a fast way.

Differences in proteins

Proteins can differ widely in their size and sometimes even in their definition in the PDB files. It was a big issue we encountered in setting up to set up a stable version running with all these different PDBs varying in size by the factor of 20.

Different definitions

Protein data

In the PDB file, all coordinates of non-hydrogen atoms are stored. These are then loaded into arrays x, y and z, just containing one coordinate of each a point. these are arrays of length n. Since this kind of array is complicated to handle, the information is restored in different point arrays of shape n*3.

  • PointsOfAllSubunits: These are all points from the PDB file, that should not be ignored. The user has to decide, which parts of the protein should not be taken into account.
  • pkte: These are all the points of part ???, that should be circularized, so this are all the atom-coordinates between N- and C-terminus
  • OtherPoints: These are all points of PointsOfAllSubunits that are not in pkte.


List of angles and rods

Rods: "AEAAAK", "AEAAAKA", "AEAAAKAA", "AEAAAKEAAAK", "AEAAAKEAAAKA", "AEAAAKEAAAKEAAAKA", "AEAAAKEAAAKEAAAKEAAAKA", "AEAAAKEAAAKEAAAKEAAAKEAAAKA"

Angles, alwys mean, then std and the pattern: [(29.7, 8.5, "NVL"), (38.7, 30., "KTA"), (60., 12., "AADGTL"), (74.5, 27., "VNLTA"), (117., 12., "AAAHPEA"), (140., 15., "ASLPAA"), (160., 5., "ATGDLA")]

General definitions

  • minabstand: the radius of an alpha-helix, also the minimal distance an atom needs from a connection
  • LengthOfAngle: The distance between the end of an angle pattern and the geometrical turning point.
  • LengthOfFlexibleAA: 3.5 Å
  • FlexAtStartSeq and FlexAtEndSeq: This is not only the regions missing from the PDB on both sides, but are also the flexible endsof the linker and the extein on the other side.


Functions

imported python modules

necessary

numpy: the basic module for numerical calculations in large scale h5py: Storing arrays on the harddisk, allowes slicing of the arrays on the disk. os: Used for reading and writing files sys: module used for exiting the program at certain points.


recommended

matplotlib.pyplot: can be used for what from mpl_toolkits.mplot3d, Axes3D: for what time: is nedd to see the progression of the calculations and observe the calculation times. fnmatch: allows wildcard search in strings, important for finding specific linker patterns cPickle: used for intermediate storing of the arrays, so that calculations coul be continued after restarting the program.


selfwritten functions

angle_between_connections_array(startingarray, middlearray, endingarray):

calculates the angles between the vectors from startingarray to middlearray and middlearray to endingarray. If there is no displacement between the arrays it returns zero as angle. startingarray and endingarray can be only one single point, middlearray should always be an array of points in 3d space.
returns values between [0,pi] in an array of size Startarray.

angle_between_vectors(vect1, vect2):

calculates the angle between two arrays of vectors. If one of the vectors is 0, the angle is set to 0. The result is based on arccos.
returns the angles between two vectors.


distance_from_connection(Startarray, Endarray, Points):

takes a connection from Startarray to Endarray and calculates the perpendicular distance of the points from the connection. Startarray or Endarray can also be single points.
returns an array of size (Startarray * points) with all perpendicular distances or distances of the endpoints.

punktebeigerade(minabstand, pkte, gerade, aufpunkt, laenge):

checks whether there are points too close to a straight line coming from aufpunkt with in direction of gerade with length laenge.
returns True if no point of pkte is closer to the straight line than minabstand

test_accessible_angles(winkelarray, length, anfangspunkt, proteinpoints, gerade=np.array([0, 0, 1])):

winkelarray is an array of angles that should be checked, whether they are accessible from anfangspunkt. Accessible means that no point of proteinpoints is too close to the straight line, which is produced by rotating gerade with the angles of winkelarray. Gerade always starts at anfangspunkt angles are measured from z-axis, if gerade is not defined else.
returns a boolean array with which winkelarray can be sliced.

reduce_angles_from_redundancies(winkelarray):

takes an array of angles in the format [phi, theta] and looks which angles produce the same result in the vector.
returns an array with all indices, that can be deleted along the 0 axis of winkelarray.

make_displacements(lengtharray, displacementarray):

generates all possible displacements from displacementarray (an array of vectors) and lengtharray (array of different lengths)
returns an array with displacementvectors in different lengths

sort_out_by_protein(startingarray, endingarray, proteinpoints, mindist, beforearray = None):

sorts out the connections between startingarray and endingarray with proteinpoints. A connection is sorted out, if one point of the proteinpoints is nearer to the connection, than mindist.
returns only the points for the connections, that are good. If beforearray is set, returns also beforearray

naechstepunkte(anfangsarray, verschiebungsarray):

generates for each point of anfangsarray, all points that are made by displacements of that point with verschiebungsarray.
returns two arrays of equal size, the enlarged anfangsarray and the array resulting from verschiebungsarray.

aussortierennachpunken(punktearray, proteinpunkte, minabstand, maxabstand):

sorts all the points of punktearray out, that are nearer than minabstand to one of the points from proteinpunkte, or farther away than maxabstand.
returns a boolean array, with which one can slice punktearray.

angle_weighing(anglearray, angletosequence=angletosequence):

weighting of the angles form anglearray. The better an angle fits to the angles provided by angletosequence, the lower the value is. The best angle gets a weighing of 1, the worst angle of 2.
returns a weighingarray for the angles of anglearray. Each weighing is in the range between 1 and 2. The weighing is based on gaussian distributions.

angle_function(StartingArray, MiddleArray, EndingArray):

makes a weighing of the connection from Startingarray, over Middlearray to Endingarray based on the weighing of the angles.
returns an angle weighting for each connection.

unpreferable_places(Start, End, ProteinPoints, AminoacidNumberArray, ToBeWeighedAAInput, WeighingofAA, substratelist):

Calculates a weighting for the connection from the points of Start to the points of End based on the distance from regions that should be omitted. These aminoacids should be defined in the ToBeWeighedAAInput array and the WeighingofAA array defines how important this region is. If one wants whole substrates to be omitted, they should be added in the substratelist. The total returned number is normalized, so the weighting of the regions is independent of the number of places, that should be omitted.
Parameters:
Start: The points where the rod starts,
End: The points where the rod ends
ProteinPoints: The points of the protein, that should be taken into account.
AminoacidNumberArray: The array, that tells, to which amino acid one atom belongs
ToBeWeighedAAInput: One output of make_weighing_arrays
substratelist: A list of tuples (amino acid nr, size of substrate). Amino acid nr, is the amino acid, where the substrate binds to.
returns the weighing of the connections, because of the regions, where the linker passes through.

distance_from_surface(beforearray, testarray, ProteinPoints, Afterpoint = None):

calculates the distances of the testarray points from the surface as just the minimum of the distances to all proteinpoints. It doesn't calculate the points that are equal to the points of the beforearray, so that these are not taken twice. Additionally, it checks that the points don't lie on the endpoint.
Returns the weighting of the distance by subtracting mindist, dividing it through mindist for making it unitless and then squaring, so that the values are better distributed.

weighing_function_rigids(StartPoint, FirstArray, SecondArray, ThirdArray, EndPoint, ProteinPoints,AminoacidNumberArray, ToBeWeighedAA, WeighingofAA=None, substratelist=None):

makes the weighting of rigid linkers, with angle, distance, length and regions distribution.
returns a list of 5 arrays: weighedvalue, normed lenghtweighing, Angleweighing, Siteinfluence and the distances

weighing_function_flex(StartPoint, FirstArray, SecondArray, ThirdArray, EndPoint, ProteinPoints, AminoacidNumberArray, ToBeWeighedAA, WeighingofAA = None, substratelist = None):

makes the weighting of flexible linkers, with angle, distance, length and regions distribution.
returns a list of 5 arrays: weighedvalue, normed lenghtweighing, Angleweighing, Siteinfluence and the distances

make_weighingarrays(Userstring):

Userstring is of the shape: 273,10 280-290,5 298,7,35.6 etc. (spaces separate entries, "," is for single residues "-" for anges, second "," for the diameter of the substrate)
If nothing should be weighted, insert ""
returns the information in arrayform (Shouldbeweighed and Weighingarray) and a substratelist

sort_out_by_angle (startingarray, middlearray, endingarray, angletosequence):

sorts out the paths from startingarray over middlearray to endingarray. A path is sorted out, when the angle it would need is too far away from the possible angles in angletosequence
returns a boolian array which paths to keep, middle and endingarray must have same dimension, if startingarray is only one point, it returns only middlearray and endingarray, else all three arrays are returned

make_small_generator(PointArray, repetition, RAM, tobesplitlength, ProteinArray = None):

calculates how often PointArray needs to be split so that the following calculations still fit into the RAM.
RAM in GByte
repetition means how often is the largest array repeated. Repetition must be manually found and adjusted as the real amount of repetitions is only a hint.
returns MakeSmall and teiler

make_small_generator_offset(listofarraysinRAM, PointArray, repetition, RAM, tobesplitlength, ProteinArray = None):

calculates how often PointArray needs to be split so that the following calculations still fit into the RAM.
In the listofarraysinRAM can be either just the arrays or the size of the arrays, same for PointArray
RAM in GByte,
repetition means how often is the largest array repeated. Repetition must be manually found and adjusted as the real amount of repetitions is only a hint.
returns MakeSmall and teiler

sort_out_by_distance(startingpoints, endingpoints, firstpoints, distance, variation):

generates all possible connections from startingpoints to endingpoints, that lie in one of the distances plus minus the variation.
returns three arrays with all possible paths, made out of all possible combinations startingpoints to endingpoints that are in a certain distance

sort_out_by_length (comefrompoints, gotopoints, linkerlaengen):

sorts out the connections between comefrompoints and gotopoints, when they don't fit to the linkerlengths from linkerlaengen.
Either comefrompoints or gotopoints can be only one point, but never both of them.
returns a boolean array, with which you can slice the points, True means the values are kept

length_to_sequence(lengtharray, linkerdatenbank, linkerlaengen):

translates the length from lengtharray to sequences according to the different linkerpieces in linkerdatenbank.
returns an array of the sequences that reproduce the length

angle_to_sequence(anglearray, angletosequence, angleseparators):

translates the angles from anglearray to sequences according to the different angletosequence data.
returns an array of the sequences that reproduce the angles

translate_paths_to_sequences(startpoint, firstflex, secondflex, thirdflex, firstrig, secondrig, thirdrig, endpoint, linkerdb, linkerlKO, angletosequence, angleseparators, weightflex, weightrig):

translates all paths to sequences according to the patterns provided in angleosequence and linkerdb
returns an array with sequences for each path