Team:Heidelberg/pages/Linker-Software Docu

From 2014.igem.org

(Difference between revisions)
(distance_from_surface(beforearray, testarray, ProteinPoints, Afterpoint = None):)
(Bundeled version)
 
Line 12: Line 12:
===Bundeled version===
===Bundeled version===
-
Please download [[CRAUT ### link]] from here. You only need to add the protein structure file in PDB format and an instructions file. For detailed information please see the README in the folder.
+
Please download CRAUT from [https://github.com/igemsoftware/Heidelberg_2014 here ]here. You only need to add the protein structure file in PDB format and an instructions file. For detailed information please see the README in the folder.
The bundeled version is nicer accessible more user friendly, butdoes not provide full usability. Mainly, it is missing adjustability. Some functions that can not be used are:
The bundeled version is nicer accessible more user friendly, butdoes not provide full usability. Mainly, it is missing adjustability. Some functions that can not be used are:

Latest revision as of 03:58, 18 October 2014

Contents

General

During the iGEM competition we have written a software, that can predict the best linker to circularize a protein. At first, connections between the ends of proteins are found, these are weighted on their quality for the linker and then these paths are retranslated to biological sequences. The software is mainly made possible by python's numpy package for easily handling and processing huge amount of data. Numpy is one of the most used python packages in scientific computing, providing a powerful N-dimensional array object and fast C/C++ written functions to process them. Thus we were able to handle the huge amount of different linker paths (in the scale of 10^9 paths). Python, as a high level programming language with its various packages, enabled us to write such a powerfull software within the short time period of this years iGEM competition. Being an interpreter language, python's runtime of course is much higher. As python is able to integrate fast C-Code natively, the runtime of Numpy calculations is not that much higher than compared to classical precompiled C-code. This is achieved by using the same type of entries in an array, so that the whole array can be processed at once. On the other hand this consumes much more memory, since the whole array has to be loaded in the RAM in every processing step, causing one of the major problems for us. Therefore, the software requires at least 2 GB of free RAM.

Usage

Two different versions can be downloaded. One is a bundeled version with reduced features, but enhanced usability. The other is the source code, where in the beginning definitions need to be made in the code, giving the full usability at hand of the user. If not all features are needed, we recommend using the reduced version. The bundeled version can be used only on linux PCs. Both versions require minimum 2GB of free RAM.

Bundeled version

Please download CRAUT from here here. You only need to add the protein structure file in PDB format and an instructions file. For detailed information please see the README in the folder.

The bundeled version is nicer accessible more user friendly, butdoes not provide full usability. Mainly, it is missing adjustability. Some functions that can not be used are:

Weighting of different aminoacids: A function, which allows you to define aminoacids and regions that should be omitted by the linker.

Checkpointing: The program has to run in total at once, it can not be stopped during the calculation, because no files are stored during computation. Calculations can take up to days.

Size of linkers: As no checkpointing is enabled in this version, also the maximum angles are reduced to two angles in the linker. This shold be sufficient for most proteins. but If this does not produce results, please refer to the python version from the website.

Circularizing only a part of the protein: The complete protein, provided by the PDB file is circularized, in contrast to the the complete version, where the user can define the amino acid site where the protein circularization should occure.

Ignoring different parts of proteins: It is not possible to ignore additional parts of the PDB. Only the defined subunit will be processed, other subunis are ignored.

Attachment-Sequences: The exchange of attachment-sequences (normally GG in the beginning of the linker) has to be done manually.

The python file

For this version you need a running python 2.7 environment, please see below on the used packages. Please download the .py file from ###here### and insert the data in the header of the code to the different variables.

Two folders have to be creaded in the same directory as the python program is running, one named "files", in which checkpoints are saved, and one called "protein", which contains the PDB file.

There can be made several choices on which parts of the protein should be taken into account for circularization and with which sequences this should be done. The extein sequence or the sortase scar should be inserted at ScarsAtEnd variable. Also a weighting of certain amino acids can be defined in the beginning.

The software will store temporary data after reaching a processing checkpoint so that calculation can be resumed afterwards again. The code is split in different procedures, that can be run in IDEs like spyder separately. Each of these sections defines a piece of code,

One Afterwards the calculations can just be resumed at one of the various pickle.load checkpoints.

For most proteins, not too many paths need to be checked, which results to a runtime of so it runs in several hours on a 2.6GHz intel i5. The software might calculate several days for the linkers. For big proteins and long linkers, about 8 GB of RAM are needed to run completely. But for

In the end a single results file is saved on the disk, containing the sequnces of the linkers and the weightings of them.

Biggest problems we had

We encountered many bigger or smaller problems while programming. Some are quite serious issues and are mainly due to the brute force approach ansatz we made, but were mainly solved to an acceptable extent.

RAM usage

For example w The calculation of the distance from the connection in a path to the protein is calculated, encompasses always the distance of all of the atoms of the protein to this connection. This easily results in arrays of 100 000 000 * 3 * 6000 * 3 shape can occur, which is just too much for normal RAM sized Computer.

On the other hand, using numpy arrays, the bigger the processed arrays are , the faster the program is in total, because every functions only have to be loaded once . Therefore in each step the arrays are sliced in a way, that the following procedure can take place in the RAM of the computer, see functions part.

Array Size

Arrays with about 300 000 000 float 16 entries, like they occur for large proteins, while point generation are too large in size for the RAM. Thus we had to manually store them on the harddrive for processing. Of course also this slowed the procedure down. But fortunately we could use python's [http://www.h5py.org/ |h5py] package, which allowed us, just taking out certain lines from the array stored on the disk.

Runtime

The longest calculation took about 11days on a 2.6 GHz intel i5 CPU with 8GB RAM and an SSD harddrive. Therefore, results are frequently stored on the harddisk using python's cpickle package, which allows a fast storage of complete numpy objects. Thus calculations can be stopped and restarted after certain points. On the other hand, calculation time increases with the number of points for the connections and for the protein. Reducing the points of the protein to the atoms on the surface would be the next step to take, which could reduce calculation time to one sixth of the time now. Due to lack of time, we often had to make the tradeoff between fast programming and fast calculation. Having the huge resources of ###i@h### we clearly decided for fast programming and not optimizing the code yet for velocity.

Flexible ends

Long flexible (non helical) regions might cause problems until now are kind of an issue, but we have implemented several functions that should handle those. The problem is, that tThe number of possible conformations of two or three OF WHAT? exceedes our capabilities and could not be easily handeled in the building-block system we choose to implement. Therefore also helical regions are handeled as straight connections, but with varying length. After the flexible regions there are no angular restraints given to the attached helical block.

Path storage

The possible paths are always stored as the angle points of the paths under following variables: firstpointsflexible, secondpointsflexible, thirdpointsflexible, firstpointstriangle, secondpointstriangle, thirdpointstriangle, erstepunkte, zweitepunkte and drittepunkte. As each point in 3d has three coordinates, all of these variables are n*3 arrays. The first index identifies the path then. This creates the possibility it possible, that a path is just deleted, to erase a path by deleting the line of all the arrays. This way also the arrays can be easily sliced, making it possible to process parts of an array in a fast way.

Differences in proteins

Proteins can differ widely in their size and sometimes even in their definition in the PDB files. It was a big issue we encountered in setting up to set up a stable version running with all these different PDBs varying in size by the factor of 20.

Different definitions

Protein data

In the PDB file, all coordinates of non-hydrogen atoms are stored. These are then loaded into arrays x, y and z, just containing one coordinate of each a point. these are arrays of length n. Since this kind of array is complicated to handle, the information is restored in different point arrays of shape n*3.

  • PointsOfAllSubunits: These are all points from the PDB file, that should not be ignored. The user has to decide, which parts of the protein should not be taken into account.
  • pkte: These are all the points of part ???, that should be circularized, so this are all the atom-coordinates between N- and C-terminus
  • OtherPoints: These are all points of PointsOfAllSubunits that are not in pkte.


List of angles and rods

Rods: "AEAAAK", "AEAAAKA", "AEAAAKAA", "AEAAAKEAAAK", "AEAAAKEAAAKA", "AEAAAKEAAAKEAAAKA", "AEAAAKEAAAKEAAAKEAAAKA", "AEAAAKEAAAKEAAAKEAAAKEAAAKA"

Angles, alwys mean, then std and the pattern: [(29.7, 8.5, "NVL"), (38.7, 30., "KTA"), (60., 12., "AADGTL"), (74.5, 27., "VNLTA"), (117., 12., "AAAHPEA"), (140., 15., "ASLPAA"), (160., 5., "ATGDLA")]

General definitions

  • minabstand: the radius of an alpha-helix, also the minimal distance an atom needs from a connection
  • LengthOfAngle: The distance between the end of an angle pattern and the geometrical turning point.
  • LengthOfFlexibleAA: 3.5 Å
  • FlexAtStartSeq and FlexAtEndSeq: This is not only the regions missing from the PDB on both sides, but are also the flexible endsof the linker and the extein on the other side.


Functions

imported python modules

necessary

numpy: the basic module for numerical calculations in large scale h5py: Storing arrays on the harddisk, allowes slicing of the arrays on the disk. os: Used for reading and writing files sys: module used for exiting the program at certain points.


recommended

matplotlib.pyplot: can be used for what from mpl_toolkits.mplot3d, Axes3D: for what time: is nedd to see the progression of the calculations and observe the calculation times. fnmatch: allows wildcard search in strings, important for finding specific linker patterns cPickle: used for intermediate storing of the arrays, so that calculations coul be continued after restarting the program.


selfwritten functions

angle_between_connections_array(startingarray, middlearray, endingarray):

calculates the angles between the vectors from startingarray to middlearray and middlearray to endingarray. If there is no displacement between the arrays it returns zero as angle. startingarray and endingarray can be only one single point, middlearray should always be an array of points in 3d space.
returns values between [0,pi] in an array of size Startarray.

angle_between_vectors(vect1, vect2):

calculates the angle between two arrays of vectors. If one of the vectors is 0, the angle is set to 0. The result is based on arccos.
returns the angles between two vectors.


distance_from_connection(Startarray, Endarray, Points):

takes a connection from Startarray to Endarray and calculates the perpendicular distance of the points from the connection. Startarray or Endarray can also be single points.
returns an array of size (Startarray * points) with all perpendicular distances or distances of the endpoints.

punktebeigerade(minabstand, pkte, gerade, aufpunkt, laenge):

checks whether there are points too close to a straight line coming from aufpunkt with in direction of gerade with length laenge.
returns True if no point of pkte is closer to the straight line than minabstand

test_accessible_angles(winkelarray, length, anfangspunkt, proteinpoints, gerade=np.array([0, 0, 1])):

winkelarray is an array of angles that should be checked, whether they are accessible from anfangspunkt. Accessible means that no point of proteinpoints is too close to the straight line, which is produced by rotating gerade with the angles of winkelarray. Gerade always starts at anfangspunkt angles are measured from z-axis, if gerade is not defined else.
returns a boolean array with which winkelarray can be sliced.

reduce_angles_from_redundancies(winkelarray):

takes an array of angles in the format [phi, theta] and looks which angles produce the same result in the vector.
returns an array with all indices, that can be deleted along the 0 axis of winkelarray.

make_displacements(lengtharray, displacementarray):

generates all possible displacements from displacementarray (an array of vectors) and lengtharray (array of different lengths)
returns an array with displacementvectors in different lengths

sort_out_by_protein(startingarray, endingarray, proteinpoints, mindist, beforearray = None):

sorts out the connections between startingarray and endingarray with proteinpoints. A connection is sorted out, if one point of the proteinpoints is nearer to the connection, than mindist.
returns only the points for the connections, that are good. If beforearray is set, returns also beforearray

naechstepunkte(anfangsarray, verschiebungsarray):

generates for each point of anfangsarray, all points that are made by displacements of that point with verschiebungsarray.
returns two arrays of equal size, the enlarged anfangsarray and the array resulting from verschiebungsarray.

aussortierennachpunken(punktearray, proteinpunkte, minabstand, maxabstand):

sorts all the points of punktearray out, that are nearer than minabstand to one of the points from proteinpunkte, or farther away than maxabstand.
returns a boolean array, with which one can slice punktearray.

angle_weighing(anglearray, angletosequence=angletosequence):

weighting of the angles form anglearray. The better an angle fits to the angles provided by angletosequence, the lower the value is. The best angle gets a weighing of 1, the worst angle of 2.
returns a weighingarray for the angles of anglearray. Each weighing is in the range between 1 and 2. The weighing is based on gaussian distributions.

angle_function(StartingArray, MiddleArray, EndingArray):

makes a weighing of the connection from Startingarray, over Middlearray to Endingarray based on the weighing of the angles.
returns an angle weighting for each connection.

unpreferable_places(Start, End, ProteinPoints, AminoacidNumberArray, ToBeWeighedAAInput, WeighingofAA, substratelist):

Calculates a weighting for the connection from the points of Start to the points of End based on the distance from regions that should be omitted. These aminoacids should be defined in the ToBeWeighedAAInput array and the WeighingofAA array defines how important this region is. If one wants whole substrates to be omitted, they should be added in the substratelist. The total returned number is normalized, so the weighting of the regions is independent of the number of places, that should be omitted.
Parameters:
Start: The points where the rod starts,
End: The points where the rod ends
ProteinPoints: The points of the protein, that should be taken into account.
AminoacidNumberArray: The array, that tells, to which amino acid one atom belongs
ToBeWeighedAAInput: One output of make_weighing_arrays
substratelist: A list of tuples (amino acid nr, size of substrate). Amino acid nr, is the amino acid, where the substrate binds to.
returns the weighing of the connections, because of the regions, where the linker passes through.

distance_from_surface(beforearray, testarray, ProteinPoints, Afterpoint = None):

calculates the distances of the testarray points from the surface as just the minimum of the distances to all proteinpoints. It doesn't calculate the points that are equal to the points of the beforearray, so that these are not taken twice. Additionally, it checks that the points don't lie on the endpoint.
Returns the weighting of the distance by subtracting mindist, dividing it through mindist for making it unitless and then squaring, so that the values are better distributed.

weighing_function_rigids(StartPoint, FirstArray, SecondArray, ThirdArray, EndPoint, ProteinPoints,AminoacidNumberArray, ToBeWeighedAA, WeighingofAA=None, substratelist=None):

makes the weighting of rigid linkers, with angle, distance, length and regions distribution.
returns a list of 5 arrays: weighedvalue, normed lenghtweighing, Angleweighing, Siteinfluence and the distances

weighing_function_flex(StartPoint, FirstArray, SecondArray, ThirdArray, EndPoint, ProteinPoints, AminoacidNumberArray, ToBeWeighedAA, WeighingofAA = None, substratelist = None):

makes the weighting of flexible linkers, with angle, distance, length and regions distribution.
returns a list of 5 arrays: weighedvalue, normed lenghtweighing, Angleweighing, Siteinfluence and the distances

make_weighingarrays(Userstring):

Userstring is of the shape: 273,10 280-290,5 298,7,35.6 etc. (spaces separate entries, "," is for single residues "-" for anges, second "," for the diameter of the substrate)
If nothing should be weighted, insert ""
returns the information in arrayform (Shouldbeweighed and Weighingarray) and a substratelist

sort_out_by_angle (startingarray, middlearray, endingarray, angletosequence):

sorts out the paths from startingarray over middlearray to endingarray. A path is sorted out, when the angle it would need is too far away from the possible angles in angletosequence
returns a boolian array which paths to keep, middle and endingarray must have same dimension, if startingarray is only one point, it returns only middlearray and endingarray, else all three arrays are returned

make_small_generator(PointArray, repetition, RAM, tobesplitlength, ProteinArray = None):

calculates how often PointArray needs to be split so that the following calculations still fit into the RAM.
RAM in GByte
repetition means how often is the largest array repeated. Repetition must be manually found and adjusted as the real amount of repetitions is only a hint.
returns MakeSmall and teiler

make_small_generator_offset(listofarraysinRAM, PointArray, repetition, RAM, tobesplitlength, ProteinArray = None):

calculates how often PointArray needs to be split so that the following calculations still fit into the RAM.
In the listofarraysinRAM can be either just the arrays or the size of the arrays, same for PointArray
RAM in GByte,
repetition means how often is the largest array repeated. Repetition must be manually found and adjusted as the real amount of repetitions is only a hint.
returns MakeSmall and teiler

sort_out_by_distance(startingpoints, endingpoints, firstpoints, distance, variation):

generates all possible connections from startingpoints to endingpoints, that lie in one of the distances plus minus the variation.
returns three arrays with all possible paths, made out of all possible combinations startingpoints to endingpoints that are in a certain distance

sort_out_by_length (comefrompoints, gotopoints, linkerlaengen):

sorts out the connections between comefrompoints and gotopoints, when they don't fit to the linkerlengths from linkerlaengen.
Either comefrompoints or gotopoints can be only one point, but never both of them.
returns a boolean array, with which you can slice the points, True means the values are kept

length_to_sequence(lengtharray, linkerdatenbank, linkerlaengen):

translates the length from lengtharray to sequences according to the different linkerpieces in linkerdatenbank.
returns an array of the sequences that reproduce the length

angle_to_sequence(anglearray, angletosequence, angleseparators):

translates the angles from anglearray to sequences according to the different angletosequence data.
returns an array of the sequences that reproduce the angles

translate_paths_to_sequences(startpoint, firstflex, secondflex, thirdflex, firstrig, secondrig, thirdrig, endpoint, linkerdb, linkerlKO, angletosequence, angleseparators, weightflex, weightrig):

translates all paths to sequences according to the patterns provided in angleosequence and linkerdb
returns an array with sequences for each path