Team:Heidelberg/pages/Linker-Software Docu
From 2014.igem.org
General
During the iGEM competition we have written a software, that can predict the best linker to circularize a protein. Therefore at first connections between the ends are found, these are weighted on their goodness for the linker and then these paths are retranslated to biological sequences. The software is mainly made possible by python's numpy package for easily handling and processing huge amount of data. Numpy is one of the most used python packages in scientific computing, providing a powerful N-dimensional array object and fast C/C++ written functions to process them. Thus we were able to handle the amount of different linker paths (in the scale of 10^9 paths). Python as a high level programming language with it's various packages enabled us within the short time period of iGEM to write such a powerfull software. On the other hand, being an interpreter language, python's runtime of course is much higher. As python is able to integrate fast C-Code natively, the runtime of Numpy calculations is not that much higher than compared to classical precompiled C-code. This is achieved because all the entries in an array must be of the same type so that always arrays can be processed completely. On the other hand this consumes much more memory, because always the whole array has to be loaded in the RAM, which was one of the major problems for us. But the software should run on Computers with 1 GB of free RAM.
Usage
Two different versions can be downloaded. One is a bundeled version with reduced possibilites, but enhanced usability. The other is the pure python code, where in the beginning definitions need to be made in the code, giving the full usability at hand of the user. But if not needed the bundeled version is recommended. The bundeled version can be used only on linux PCs. Both versions require minimum 2GB of free RAM.
Bundeled version
Please download ###something### from here. You only need to add the protein structure file and an instructions file. For detailed information please see the README in the folder. The bundeled version is nicer accessible, but on the other hand doesn't provide full usability. Mainly it is just missing adjustability in this version. Some functions that can't be used here are: Weighting of different aminoacids: A function, with which aminoacids and regions can be defined that should be omitted by the linker, can't be used in this version. Checkpointing: The program has to run in total at once, it can't be stopped meanwhile the calculation, because no files are stored during computation. Calculations can take up to days also in this version. Size of linkers: As no checkpointing is enabled in this version, also the maximum angles are reduced to two angles in the linker. This shold suffice for most of the proteins, but if this doesn't produce results, please refer to the python version from the website. Circularizing only a part of the protein: Always the complete protein as included in the PDB file is circularized. In the python version the user can define the atoms from when on the protein should be circularized. Ignoring different parts of proteins: Also no additional parts of the PDB can be ignored. On the other hand all other subunits than the defined one are ignored completely. Attachmentsequences: The exchange of attachmentsequences (normally GG in the beginning of the linker) would need to be done by hand by just exchanging the GG with some other 2 aminoacid sequence.
The python file
For this version you need a running python 2.7 environment with different packages in the latest versions installed, please see below on the used packages. Please download the .py file from ###here### and insert the data in the beginning of the code to the different variables. In the same directory as the program is running there should be two folders, one called files for saving checkpoints and one called proteine, in which the PDB files should be found. There can be made several choices on which parts of the protein should be taken into account for circularization and with which sequences this should be done. The extein sequence or the sortase scar should be inserted at ScarsAtEnd variable. Also a weighting of certain aminoacids can be defined in the beginning. After each bigger calculation step the software will store the data on the harddrive so that calculation can be resumed afterwards again. The code is sliced in different sections, that can be run in IDEs like [| spyder ] separately. It is defined a part in the code, until that all the relevant functions and definitions have been loaded. Afterwards the calculations can just be resumed at one of the various pickle.load points. The software may run several days calculating the linkers. For big proteins and long linkers about 8 GB of RAM are needed to run completely. But for most proteins, not that many paths need to be checked so it runs in several hours on a 2.6GHz intel i5. In the end a single resultsfile is saved on the disk containing the sequnces of the linkers and the weightings of them.
Biggest problems we had
We encountered many bigger or smaller problems while programming. Some are quite serious issues and are mainly due to the brute-forcing ansatz we made, but were mainly solved to an acceptable extent.
RAM usage
For example when the distance from the connection in a path to the protein is calculated, always the distance of all of the atoms of the protein to this connection is calculated in one array. Thus easily arrays of 100 000 000 * 3 * 6000 * 3 shape can occur, which is just too much for normal RAMsize. On the other hand, using numpy arrays, the bigger the arrays, that are processed are, the faster the program is in total, because the functions don't have to be loaded again and again. Therefore in each step the arrays are sliced in a way, that the following procedure can take place in the RAM of the computer, see functions part.
Array Size
The next issue was, that with arrays with about 300 000 000 float 16 entries, like they occur for large proteins while point generation they could even not be kept in RAM in total. Thus we had to manually store them on the harddrive for processing. Of course also this slowed the procedure down. But fortunately we could use python's h5py package, which allowed us, just taking out certain lines from the array stored on the disk.
Runtime
The longest calculation took about 11days on a 2.6 GHz intel i5 CPU with 8GB RAM and an SSD harddrive of continuous calculation. Therefore results are frequently stored on the harddisk using python's cpickle package, which allowed a fast storage of complete numpy objects. Thus calculations can be stopped and restarted after certain points. On the other hand, calculation time increases with the number of points for the connections and for the protein. Reducing the points of the protein to the atoms on the surface would be the next step to take, which could reduce calculation time to one sixth of the time now. Due to lack of time, we often had to make the tradeoff between fast programming and fast calculation. Having the huge resources of ###i@h### we clearly decided for fast programming and not optimizing the code yet for velocity.
Flexible ends
Long flexible (non helical) regions until now are kind of an issue, but we have implemented several functions that should handle this. The problem is, that the number of possible conformations already for two or three exceedes our capabilities and could not be easily handeled in the building-block system we chose to implement. Therefore also helical regions are handeled as straight connections, but with varying length. After the flexible regions there are no angular restraints given to the attached helical block.
Path storage
The possible paths are always stored as the angle points of the paths under the variables: firstpointsflexible, secondpointsflexible, thirdpointsflexible, firstpointstriangle, secondpointstriangle, thirdpointstriangle, erstepunkte, zweitepunkte and drittepunkte. As each point in 3d has three coordinates, all of these variables are n*3 arrays. The first index identifies the path then. This makes it possible, that a path is just deleted, by deleting the line of all the arrays. This way also the arrays can be easily sliced, making it possible to process parts of an array in a fast way.
Differences in proteins
Proteins can differ widely in their size and sometimes even in their definition in the PDB files. These were large issues we encountered to set up a stable version running with all these different PDBs varying in size by the factor of 20.
Different definitions
Protein data
In the PDB file, all coordinates of non-hydrogen atoms are stored. These are then loaded into arrays x, y and z, just containing one coordinate of a point. these are arrays of length n. As they are not good to handle, the information is restored in different point arrays of shape n*3.
- PointsOfAllSubunits: These are all points from the PDB file, that should not be ignored. The user has to tell the program, which parts should not be taken into account.
- pkte: These are all the points of part, that should be circularized, so this are all the atom-coordinates between N- and C-terminus
- OtherPoints: These are all points of PointsOfAllSubunits that are not in pkte.
List of angles and rods
Rods: "AEAAAK", "AEAAAKA", "AEAAAKAA", "AEAAAKEAAAK", "AEAAAKEAAAKA", "AEAAAKEAAAKEAAAKA",
"AEAAAKEAAAKEAAAKEAAAKA", "AEAAAKEAAAKEAAAKEAAAKEAAAKA"
Angles, alwys mean, then std and the pattern: (29.7, 8.5, "NVL"), (38.7, 30., "KTA"), (60., 12., "AADGTL"), (74.5, 27., "VNLTA"), (117., 12., "AAAHPEA"), (140., 15., "ASLPAA"), (160., 5., "ATGDLA")
General definitions
- minabstand: the radius of an alpha-helix, also the minimal distance an atom needs from a connection
- LengthOfAngle: The distance between the end of an angle pattern and the geometrical turning point.
- LengthOfFlexibleAA: 3.5 Å
- FlexAtStartSeq and FlexAtEndSeq: This is not only the regions missing from the PDB on both sides, but is also the flexible ends of the linker and the extein on the other side.
Functions
imported python modules
necessary
numpy: the basic module for numerical calculations in large scale h5py: Storing arrays on the harddisk, allowes slicing of the arrays on the disk. os: Used for reading and writing files sys: module used for exiting the program at certain points.
recommended
matplotlib.pyplot: can be used from mpl_toolkits.mplot3d, Axes3D: time: so that one can see the progress of the calculations and observe the calculation times. fnmatch: allows wildcard search in strings, important for finding specific linker patterns cPickle: used for intermediate storing of the arrays, so that calculations could be continued after restarting the program.
selfwritten functions
angle_between_connections_array(startingarray, middlearray, endingarray):
- calculates the angles between the vectors from startingarray to middlearray and middlearray to endingarray. If there is no displacement between the arrays it returns zero as angle. startingarray and endingarray can be only one single point, middlearray should always be an array of points in 3d space.
- returns values between [0,pi] in an array of size Startarray.
angle_between_vectors(vect1, vect2):
- calculates the angle between two arrays of vectors. If one of the vectors is 0, the angle is set to 0. The result is based on arccos.
- returns the angles between two vectors.
distance_from_connection(Startarray, Endarray, Points):
- takes a connection from Startarray to Endarray and calculates the perpendicular distance of the points from the connection. Startarray or Endarray can also be single points.
- Returns an array of size (Startarray * points) with all perpendicular distances or distances of the endpoints.
punktebeigerade(minabstand, pkte, gerade, aufpunkt, laenge):
- checks whether there are points too close to a straight line coming from aufpunkt with in direction of gerade with length laenge.
- returns True if no point of pkte is closer to the straight line than
minabstand
test_accessible_angles(winkelarray, length, anfangspunkt, proteinpoints, gerade=np.array([0, 0, 1])):
- winkelarray is an array of angles that should be checked, whether they are accessible from anfangspunkt. Accessible means that no point of proteinpoints is too close to the straight line, which is produced by rotating gerade with the angles of winkelarray. Gerade always starts at anfangspunkt angles are measured from z-axis, if gerade is not defined else.
- returns a boolean array with which winkelarray can be sliced.
reduce_angles_from_redundancies(winkelarray):
- takes an array of angles in the format [phi, theta] and looks which angles produce the same result in the vector.
- Returns an array with all indices, that can be deleted along the 0 axis of winkelarray.
make_displacements(lengtharray, displacementarray):
- generates all possible displacements from displacementarray (an array of vectors) and lengtharray (array of different lengths)
- returns an array with displacementvectors in different lengths
sort_out_by_protein(startingarray, endingarray, proteinpoints, mindist, beforearray = None):
- sorts out the connections between startingarray and endingarray with proteinpoints. A connection is sorted out, if one point of the proteinpoints is nearer to the connection, than mindist.
- Returns only the points for the connections, that are good. If beforearray is set, returns also beforearray
naechstepunkte(anfangsarray, verschiebungsarray):
- generates for each point of anfangsarray, all points that are made by displacements of that point with verschiebungsarray.
- Returns two arrays of equal size, the enlarged anfangsarray and the array resulting from verschiebungsarray.
aussortierennachpunken(punktearray, proteinpunkte, minabstand, maxabstand):
- sorts all the points of punktearray out, that are nearer than minabstand to one of the points from proteinpunkte, or farther away than maxabstand.
- returns a boolean array, with which one can slice punktearray.
angle_weighing(anglearray, angletosequence=angletosequence):
- weighting of the angles form anglearray. The better an angle fits to the angles provided by angletosequence, the lower the value is. The best angle gets a weighing of 1, the worst angle of 2.
- Returns a weighingarray for the angles of anglearray. Each weighing is in the range between 1 and 2. The weighing is based on gaussian distributions.
angle_function(StartingArray, MiddleArray, EndingArray):
- makes a weighing of the connection from Startingarray, over Middlearray to Endingarray based on the weighing of the angles.
- Returns an angle weighting for each connection.
unpreferable_places(Start, End, ProteinPoints, AminoacidNumberArray, ToBeWeighedAAInput, WeighingofAA, substratelist):
- Calculates a weighting for the connection from the points of Start to the points of End based on the distance from regions that should be omitted. These aminoacids should be defined in the ToBeWeighedAAInput array and the WeighingofAA array defines how important this region is. If one wants whole substrates to be omitted, they should be added in the substratelist. The total returned number is normalized, so the weighting of the regions is independent of the number of places, that should be omitted.
::Parameters:
- Start: The points where the rod starts,
- End: The points where the rod ends
- ProteinPoints: The points of the protein, that should be taken into account.
- AminoacidNumberArray: The array, that tells, to which amino acid one atom belongs
- ToBeWeighedAAInput: One output of make_weighing_arrays
- substratelist: A list of tuples (amino acid nr, size of substrate). Amino acid nr, is the amino acid, where the substrate binds to.
::returns the weighing of the connections, because of the regions, where the linker passes through.
distance_from_surface(beforearray, testarray, ProteinPoints, Afterpoint = None):
- calculates the distances of the testarrays points from the surface as just the minimum of the distances to all proteinpoints. It doesn't calculate the points that are equal to the points of the beforearray, so that these are not taken double. And it checks that the points don't lie on the endpoint.
- Returns the weighting of the distance by subtracting mindist, dividing it through mindist for making it unitless and then squaring, so that the values are better distributed.
===weighing_function_rigids(StartPoint, FirstArray, SecondArray, ThirdArray, EndPoint, ProteinPoints,AminoacidNumberArray, ToBeWeighedAA, WeighingofAA=None, substratelist=None):
- makes the weighting of rigid linkers, with angle, distance, length and regions distribution.
- returns a list of 5 arrays: weighedvalue, normed lenghtweighing, Angleweighing, Siteinfluence and the distances
weighing_function_flex(StartPoint, FirstArray, SecondArray, ThirdArray, EndPoint, ProteinPoints, AminoacidNumberArray, ToBeWeighedAA, WeighingofAA = None, substratelist = None):
- makes the weighting of flexible linkers, with angle, distance, length and regions distribution.
- returns a list of 5 arrays: weighedvalue, normed lenghtweighing, Angleweighing, Siteinfluence and the distances
make_weighingarrays(Userstring):
- Userstring is of the shape: 273,10 280-290,5 298,7,35.6 etc. (spaces separate entries, "," is for single residues "-" for anges, second "," for the diameter of the substrate)
- If nothing should be weighted, insert ""
- returns the information in arrayform (Shouldbeweighed and Weighingarray) and a substratelist
sort_out_by_angle (startingarray, middlearray, endingarray, angletosequence):
- sorts out the paths from startingarray over middlearray to endingarray. A path is sorted out, when the angle it would need is too far away from the possible angles in angletosequence
- returns a boolian array which paths to keep, middle and endingarray must have same dimension
- If startingarray is only one point, it returns only middlearray and endingarray, else all three arrays are returned
make_small_generator(PointArray, repetition, RAM, tobesplitlength, ProteinArray = None):
- calculates how often PointArray needs to be split so that the following calculations still fit into the RAM.
- RAM in GByte,
- repetition means how often is the largest array repeated. Repetition must be manually found and adjusted as the real amount of repetitions is only a hint.
- returns MakeSmall and teiler
make_small_generator_offset(listofarraysinRAM, PointArray, repetition, RAM, tobesplitlength, ProteinArray = None):
- calculates how often PointArray needs to be split so that the following calculations still fit into the RAM.
- In the listofarraysinRAM can be either just the arrays or the size of the arrays, same for PointArray
- RAM in GByte,
- repetition means how often is the largest array repeated. Repetition must be manually found and adjusted as the real amount of repetitions is only a hint.
- returns MakeSmall and teiler
sort_out_by_distance(startingpoints, endingpoints, firstpoints, distance, variation):
- generates all possible connections from startingpoints to endingpoints, that lie in one of the distances plus minus the variation.
- returns three arrays with all possible paths, made out of all possible combinations startingpoints to endingpoints that are in a certain distance
sort_out_by_length (comefrompoints, gotopoints, linkerlaengen):
- sorts out the connections between comefrompoints and gotopoints, when they don't fit to the linkerlengths from linkerlaengen.
- Either comefrompoints or gotopoints can be only one point, but never both of them can.
- returns a boolean array, with which you can slice the points, True means the values are kept
length_to_sequence(lengtharray, linkerdatenbank, linkerlaengen):
- translates the lengthes from lengtharray to sequences according to the different linkerpieces in linkerdatenbank.
- returns an array of the sequences that reproduce the length
angle_to_sequence(anglearray, angletosequence, angleseparators):
- translates the angles from anglearray to sequences according to the different angletosequence data.
- returns an array of the sequences that reproduce the angles
translate_paths_to_sequences(startpoint, firstflex, secondflex, thirdflex, firstrig, secondrig, thirdrig, endpoint, linkerdb, linkerlKO, angletosequence, angleseparators, weightflex, weightrig):
- translates all paths to sequences according to the patterns provided in angleosequence and linkerdb
- returns an array with sequences for each path