Team:Heidelberg/pages/Linker Modeling

From 2014.igem.org

Revision as of 10:11, 12 October 2014 by Igemnils (Talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Contents

General

During the iGEM competetion we have written a software, that can predict the best linker to circularize a protein. Therefore at first connections between the ends are found, these are weighted on their goodness for the linker and then these paths are retranslated to biological sequences. The software is mainly made possible by python's numpy package for easily handling and processing huge amount of data. Numpy is one of the most used python packages in scientific computing, providing a powerful N-dimensional array object and fast C/C++ written functions to process them. Thus we were able to handle the amount of different linker paths (in the scale of 10^9 paths). Python as a high level programming language with it's various packages enabled us within the short time period of iGEM to write such a powerfull software. On the other hand, being an interpreter language, python's runtime of course is much higher. As python is able to integrate fast C-Code natively, the runtime of Numpy calculations is not that much higher than compared to classical precompiled C-code. This is achieved because all the entries in an array must be of the same type so that always arrays can be processed completely. On the other hand this consumes much more memory, because always the whole array has to be loaded in the RAM, which was one of the major problems for us. But the software should run on Computers with 1 GB of free RAM.

Path storage

The possible paths are always stored as the angle points of the paths under the variables: firstpointsflexible, secondpointsflexible, thirdpointsflexible, firstpointstriangle, secondpointstriangle, thirdpointstriangle, erstepunkte, zweitepunkte and drittepunkte. As each point in 3d has three coordinates, all of these variables are n*3 arrays. The first index identifies the path then. This makes it possible, that a path is just deleted, by deleting the line of all the arrays. This way also the arrays can be easily sliced, making it possible to process parts of an array in a fast way.

Protein data

In the PDB file, all coordinates of non-hydrogen atoms are stored. These are then loaded into arrays x, y and z, just containing one coordinate of a point. these are arrays of length n. As they are not good to handle, the information is restored in different point arrays of shape n*3.

  • PointsOfAllSubunits: These are all points from the PDB file, that should not be ignored. The user has to tell the program, which parts should not be taken into account.
  • pkte: These are all the points of part, that should be circularized, so this are all the atom-coordinates between N- and C-terminus
  • OtherPoints: These are all points of PointsOfAllSubunits that are not in pkte.

Unpreferable places

List of angles and rods

General definitions

  • minabstand: the radius of an alpha-helix, also the minimal distance an atom needs from a connection
  • LengthOfAngle:
  • LengthOfFlexibleAA:
  • Flexatstartseq and Flexatendseq:

Biggest problems

We encountered many bigger or smaller problems while programming. Some are quite serious issues and are mainly due to the brute-forcing ansatz we made, but were mainly solved to an acceptable extent.

RAM usage

For example when the distance from the connection in a path to the protein is calculated, always the distance of all of the atoms of the protein to this connection is calculated in one array. Thus easily arrays of 100 000 000 * 3 * 6000 * 3 shape can occur, which is just too much for normal RAMsize. On the other hand, using numpy arrays, the bigger the arrays, that are processed are, the faster the program is in total, because the functions don't have to be loaded again and again. Therefore in each step the arrays are sliced in a way, that the following procedure can take place in the RAM of the computer, see functions part.

Array Size

The next issue was, that with arrays with about 300 000 000 float 16 entries, like they occur for large proteins while point generation they could even not be kept in RAM in total. Thus we had to manually store them on the harddrive for processing. Of course also this slowed the procedure down. But fortunately we could use python's h5py package, which allowed us, just taking out certain lines from the array stored on the disk.

Runtime

The longest calculation took about 11days on a 2.6 GHz intel i5 CPU with 8GB RAM and an SSD harddrive of continuous calculation. Therefore results are frequently stored on the harddisk using python's cpickle package, which allowed a fast storage of complete numpy objects. Thus calculations can be stopped and restarted after certain points. On the other hand, calculation time increases with the number of points for the connections and for the protein. Reducing the points of the protein to the atoms on the surface would be the next step to take, which could reduce calculation time to one sixth of the time now. Due to lack of time, we often had to make the tradeoff between fast programming and fast calculation. Having the huge resources of ###i@h### we clearly decided for fast programming and not optimizing the code yet for velocity.

Flexible ends

Long flexible (non helical) regions until now are kind of an issue, but we have implemented several functions that should handle this. The problem is, that the number of possible conformations already for two or three exceedes our capabilities and could not be easily handeled in the building-block system we chose to implement. Therefore also helical regions are handeled as straight connections, but with varying length. After the flexible regions there are no angular restraints given to the attached helical block.

Functions

imported python modules

necessary

numpy: the basic module for numerical calculations in large scale h5py: Storing arrays on the harddisk, allowes slicing of the arrays on the disk. os: Used for reading and writing files sys: module used for exiting the program at certain points.

recommended

matplotlib.pyplot: can be used from mpl_toolkits.mplot3d, Axes3D: time: so that one can see the progress of the calculations and observe the calculation times. fnmatch: allows wildcard search in strings, important for finding specific linker patterns cPickle: used for intermediate storing of the arrays, so that calculations coul be continued after restarting the program.

selfwritten functions

angle_between_connections_array(startingarray, middlearray, endingarray):

calculates the angles between the vectors from startingarray to middlearray and middlearray to endingarray. If there is no displacement between the arrays it returns zero as angle. startingarray and endingarray can be only one single point, middlearray should always be an array of points in 3d space.
returns values between [0,pi] in an array of size Startarray.

---

angle_between_vectors(vect1, vect2):

calculates the angle between two arrays of vectors. If one of the vectors is 0, the angle is set to 0. The result is based on arccos.
returns the angles between two vectors.

---

distance_from_connection(Startarray, Endarray, Points):

takes a connection from Startarray to Endarray and calculates the perpendicular distance of the points from the connection. Startarray or Endarray can also be single points.
Returns an array of size (Startarray * points) with all perpendicular distances or distances of the endpoints.

---

punktebeigerade(minabstand, pkte, gerade, aufpunkt, laenge):

checks whether there are points too close to a straight line coming from aufpunkt with in direction of gerade with length laenge.
returns True if no point of pkte is closer to the straight line than minabstand

---

test_accessible_angles(winkelarray, length, anfangspunkt, proteinpoints, gerade=np.array([0, 0, 1])):

winkelarray is an array of angles that should be checked, whether they are accessible from anfangspunkt. Accessible means that no point of proteinpoints is too close to the straight line, which is produced by rotating gerade with the angles of winkelarray. Gerade always starts at anfangspunkt angles are measured from z-axis, if gerade is not defined else.
returns a boolean array with which winkelarray can be sliced.

---

reduce_angles_from_redundancies(winkelarray):

takes an array of angles in the format [phi, theta] and looks which angles produce the same result in the vector.
Returns an array with all indices, that can be deleted along the 0 axis of winkelarray.

---

make_displacements(lengtharray, displacementarray):

generates all possible displacements from displacementarray (an array of vectors) and lengtharray (array of different lengths)
returns an array with displacementvectors in different lengths

---

sort_out_by_protein(startingarray, endingarray, proteinpoints, mindist, beforearray = None):

sorts out the connections between startingarray and endingarray with proteinpoints. A connection is sorted out, if one point of the proteinpoints is nearer to the connection, than mindist.
Returns only the points for the connections, that are good. If beforearray is set, returns also beforearray

---

naechstepunkte(anfangsarray, verschiebungsarray):

generates for each point of anfangsarray, all points that are made by displacements of that point with verschiebungsarray.
Returns two arrays of equal size, the enlarged anfangsarray and the array resulting from verschiebungsarray.

---

aussortierennachpunken(punktearray, proteinpunkte, minabstand, maxabstand):

sorts all the points of punktearray out, that are nearer than minabstand to one of the points from proteinpunkte, or farther away than maxabstand.
returns a boolean array, with which one can slice punktearray.

---

angle_weighing(anglearray, angletosequence=angletosequence):

weighting of the angles form anglearray. The better an angle fits to the angles provided by angletosequence, the lower the value is. The best angle gets a weighing of 1, the worst angle of 2.
Returns a weighingarray for the angles of anglearray. Each weighing is in the range between 1 and 2. The weighing is based on gaussian distributions.

---

angle_function(StartingArray, MiddleArray, EndingArray):

makes a weighing of the connection from Startingarray, over Middlearray to Endingarray based on the weighing of the angles.
Returns an angle weighting for each connection.

---

unpreferable_places(Start, End, ProteinPoints, AminoacidNumberArray, ToBeWeighedAAInput, WeighingofAA, substratelist):

Calculates a weighting for the connection from the points of Start to the points of End based on the distance from regions that should be omitted. These aminoacids should be defined in the ToBeWeighedAAInput array and the WeighingofAA array defines how important this region is. If one wants whole substrates to be omitted, they should be added in the substratelist. The total returned number is normalized, so the weighting of the regions is independent of the number of places, that should be omitted.
::Parameters:
Start: The points where the rod starts,
End: The points where the rod ends
ProteinPoints: The points of the protein, that should be taken into account.
AminoacidNumberArray: The array, that tells, to which amino acid one atom belongs
ToBeWeighedAAInput: One output of make_weighing_arrays
substratelist: A list of tuples (amino acid nr, size of substrate). Amino acid nr, is the amino acid, where the substrate binds to.
::returns the weighing of the connections, because of the regions, where the linker passes through.

---

distance_from_surface(beforearray, testarray, ProteinPoints, Afterpoint = None):

calculates the distances of the testarrays points from the surface as just the minimum of the distances to all proteinpoints. It doesn't calculate the points that are equal to the points of the beforearray, so that these are not taken double. And it checks that the points don't lie on the endpoint.
Returns the weighting of the distance by subtracting mindist, dividing it through mindist for making it unitless and then squaring, so that the values are better distributed.

--- ===weighing_function_rigids(StartPoint, FirstArray, SecondArray, ThirdArray, EndPoint, ProteinPoints,AminoacidNumberArray, ToBeWeighedAA, WeighingofAA=None, substratelist=None):

makes the weighting of rigid linkers, with angle, distance, length and regions distribution.
returns a list of 5 arrays: weighedvalue, normed lenghtweighing, Angleweighing, Siteinfluence and the distances

---

weighing_function_flex(StartPoint, FirstArray, SecondArray, ThirdArray, EndPoint, ProteinPoints, AminoacidNumberArray, ToBeWeighedAA, WeighingofAA = None, substratelist = None):

makes the weighting of flexible linkers, with angle, distance, length and regions distribution.
returns a list of 5 arrays: weighedvalue, normed lenghtweighing, Angleweighing, Siteinfluence and the distances

---

make_weighingarrays(Userstring):

Userstring is of the shape: 273,10 280-290,5 298,7,35.6 etc. (spaces separate entries, "," is for single residues "-" for anges, second "," for the diameter of the substrate)
If nothing should be weighted, insert ""
returns the information in arrayform (Shouldbeweighed and Weighingarray) and a substratelist

---

sort_out_by_angle (startingarray, middlearray, endingarray, angletosequence):

sorts out the paths from startingarray over middlearray to endingarray. A path is sorted out, when the angle it would need is too far away from the possible angles in angletosequence
returns a boolian array which paths to keep, middle and endingarray must have same dimension
If startingarray is only one point, it returns only middlearray and endingarray, else all three arrays are returned

---

make_small_generator(PointArray, repetition, RAM, tobesplitlength, ProteinArray = None):

calculates how often PointArray needs to be split so that the following calculations still fit into the RAM.
RAM in GByte,
repetition means how often is the largest array repeated. Repetition must be manually found and adjusted as the real amount of repetitions is only a hint.
returns MakeSmall and teiler

---

make_small_generator_offset(listofarraysinRAM, PointArray, repetition, RAM, tobesplitlength, ProteinArray = None):

calculates how often PointArray needs to be split so that the following calculations still fit into the RAM.
In the listofarraysinRAM can be either just the arrays or the size of the arrays, same for PointArray
RAM in GByte,
repetition means how often is the largest array repeated. Repetition must be manually found and adjusted as the real amount of repetitions is only a hint.
returns MakeSmall and teiler

---

sort_out_by_distance(startingpoints, endingpoints, firstpoints, distance, variation):

generates all possible connections from startingpoints to endingpoints, that lie in one of the distances plus minus the variation.
returns three arrays with all possible paths, made out of all possible combinations startingpoints to endingpoints that are in a certain distance

---

sort_out_by_length (comefrompoints, gotopoints, linkerlaengen):

sorts out the connections between comefrompoints and gotopoints, when they don't fit to the linkerlengths from linkerlaengen.
Either comefrompoints or gotopoints can be only one point, but never both of them can.
returns a boolean array, with which you can slice the points, True means the values are kept

---

length_to_sequence(lengtharray, linkerdatenbank, linkerlaengen):

translates the lengthes from lengtharray to sequences according to the different linkerpieces in linkerdatenbank.
returns an array of the sequences that reproduce the length

---

angle_to_sequence(anglearray, angletosequence, angleseparators):

translates the angles from anglearray to sequences according to the different angletosequence data.
returns an array of the sequences that reproduce the angles

---

translate_paths_to_sequences(startpoint, firstflex, secondflex, thirdflex, firstrig, secondrig, thirdrig, endpoint, linkerdb, linkerlKO, angletosequence, angleseparators, weightflex, weightrig):

translates all paths to sequences according to the patterns provided in angleosequence and linkerdb
returns an array with sequences for each path

---