Modeling and Simulation

"All models are wrong, but some are useful." When we decided to use TAL effectors building CROWN, our project, there were three main challenges concerning the efficiency of this system. First, allowing some DNA mutations, can the CROWN be as efficient as before? Second, given that CROWN can be successfully distributed on certain area of single cell, can it make sense? Third, how to design the sequence of Golden Gate? The following three parts focus on the three questions.

Part I Single Cell

Our project is about the system involving various enzymes, mostly the series enzymes, combining into certain area. This area can be more efficient when it comes to synthesizing or degrading chemicals. So the first question is, whether this system can be so useful when distributing multiple similar areas in a single cell.

Four Types of Distribution

Type 1: Membrane & Random The position of enzyme is distributed randomly on the cell membrane.

Type 2: Membrane & Polymerization Certain enzymes are polymerized on the cell membrane.

Type 3: Matrix & Random The position of enzyme is distributed randomly inside the cell.

Type 4: Matrix & Polymerization The polymerization of certain enzymes is distributed randomly inside the cell.

Hypothesis of Simulation

1. Metabolism

Figure2.2.1 The process of the metabolism: s0, s1, s2, s3 are the substrates and E0, E1,E2 are the enzymes

Enzymes: E0, E1,E2


Figure2.2.2 the simulation of the CROWN

2. Initial Distribution of Substrates

All substrates are randomly distributed OUTSIDE the cell in all four simulations.

3. Movement of Substrates

The motion of molecules is random, including the rate and orientation.

4. Catalytic reaction

The time period of reaction is neglected. When the type of chemical match the type of enzyme, distance is less than threshold, then the enzyme reaction is recognized and recorded.

5. Other Hypothesis

Other physical and chemical parameters are under the scaling rule. The whole modeling combined with periodic boundary condition(PBC) to show the real performance of substrates and enzyme system.


All Results

Click to watch the video
Figure2.2.3 All the results of the four types.

Type 1

Click to watch the video Youtube Youku

Figure2.2.4 The extent of reaction of type 1.

Type 2

Click to watch the video Youtube Youku

Figure2.2.5 The extent of reaction of type 2.

Type 3

Click to watch the video Youtube Youku

Figure2.2.6 The extent of reaction of type 3

Type 4

Click to watch the video Youtube Youku

Figure2.2.7 The extent of reaction of type 4

Part II Docking

Why do we need Docking?

Biobrick designers and users want to understand the characteristics of a particular biobrick, especially the performance and accuracy. Because they need to answer a question, that is, were there to be a certain mutation, whether a huge change would happen to the protein function. We hope to introduce evaluation methods of bioinformatics, to evaluate binding of protein and DNA.


TAL (transcription activator-like) effectors, secreted by phytopathogenic bacteria, recognize host DNA sequences through a central domain of tandem repeats. Each repeat consists of 33 to 35 conserved amino acids and targets a specific base pair by using two hypervariable residues [known as repeat variable diresidues (RVDs)] at positions 12 and 13.



We designed fifteen sequences derived from raw sequence. These mutated sequences contain different mutations, ranging from one to five. Through a series of calculations, we obtained scores to represent the binding of TAL effectors and DNA.

  • The highlighted Letters represent the mutation site.
  • The white DNA sequences on the graph is the originated position and orange one represents the possible binding DNA.
  • The higher Docking scores, the better Docking

    • mutation-1
    • Score:1164.128

    • mutation-2
    • Score:1170.910

    • mutation-3
    • Score:1153.537

    • mutation-4
    • Score:1377.231

    • mutation-5
    • Score:1169.283

    • mutation-6
    • Score:1179.122

    • mutation-7
    • Score:1482.902

    • mutation-8
    • Score:1161.824

    • mutation-9
    • Score:1482.897

    • mutation-10
    • Score:1174.229

    • mutation-11
    • Score:1237.449

    • mutation-12
    • Score:1482.896

    • mutation-13
    • Score:1483.352

    • mutation-14
    • Score:1482.048

    • mutation-15
    • Score:1164.128



    Scatter Diagram

    Figure 2.2.8 The correlation between the number of mutation sites and the docking scores.The higher docking scores indicates the better combination of TAL and target sequence.

    From the docking scores, we can see that in the event of single nucleotide mutation, binding of TAL effectors and DNA differs greatly from normal. However, when there are more than two mutation sites, the difference becomes less drastic.

    From the PDB document, we can find that mutation at certain sites may lead to huge conformational distortions of TAL-DNA complex. With as many as five mutations, the binding site changes greatly.

    In conclusion, we strongly recommend that TAL designers and users ensure the accuracy of DNA binding sequence. If not, the specificity of binding site will not be guaranteed.

    Part Ⅲ K-clique Problem

    Problem Identification

    Before designing the TAL sequence, what we need to do first is to design a set of sequences, which are consist of seven fragments containing four nucleotides. In order to accomplish this, we offer a new sequence alignment algorithm for these fragmental sequences comparison. By using Loose Algorithm and Strict Algorithm, two fragmental sequences can be scored, and when the score is less than or equal to a specific value (eg 2), we can accept that they are a valid combination.

    As we can see, the sequence database contains a total of 4 * 4 * 4 * 4 = 256 kinds of fragmental sequences. Now, what we should do is to find out some seven sequences whose each two fragments can be a valid combination, based on a large (256*256) score table.

    Assumption and Model Formulation

    We try to solve this problem by making full use of graph theory. In our model, each fragmental sequence can be treated as a node, and two “node” is connected, if they are a valid combination (the score less than or equal to 2). After doing that, we just need to find such seven nodes which are consist of a complete graph from a graph with 256 nodes. (Note: not every two nodes in original graph is connected)

    Brief introduction to graph theory

    For an undirected graph G = (V, E),if U⊆V, and for any u, v ⊆ U, (u, v) ⊆ E, U is called complete subgraph of G.

    A clique in an undirected graph G = (V, E) is a subset of the vertex set C ⊆ V, such that for every two vertices in C, there exists an edge connecting the two. A maximal clique is a clique that cannot be extended by including one more adjacent vertex, that is, a clique which does not exist exclusively within the vertex set of a larger clique.

    U is the maximal complete subgraph of G, if and only if U is a clique of U and U is not contained in a larger subgraph.

    The k-clique problem is to find the complete graph with k nodes in a specific graph. What’s more, k-clique algorithm is defined in the paper "Uncovering the overlapping community structure of complex networks in nature and society" - G. Palla, I. Derényi, I. Farkas, and T. Vicsek - Nature 435, 814–818 (2005).

    Although a deterministic algorithm for this problem with an O(n*2^n) algorithm time complexity, fortunately, in our experiment, the problem we should solve is just with 256 nodes, so a non-deterministic algorithm can be applied.


    Early in 2005, a scientist have tried to use some kinds of Backtracking Algorithm to solve this problem, which have published in Science. Based on this excellent work, we have gone further to offer two more efficient algorithms.

    Algorithm 1

    In the solution space tree containing all available solutions, we search the solution space tree from the root according to the depth-first strategy. When reaching a node, we always firstly determine whether the node include any solutions or not. If not, we don't need to search the subtree whose root is such node, and then backtrack the ancestor nodes step-by-step; otherwise, we should continue our depth-first search.

    Algorithm 2

    1.Sort the degree of each node.

    2.In the current data set, from the first degree to the last, we remain the nodes which is relative to such degree, and delete the unconnected one.

    3.In the set, containing the nodes connected to the previous one, for each node, determine whether it is connected to others, and then sort the nodes based on their degree.

    4.Divide the problem into some much smaller size problems, repeat above method.


    Based on our efficient algorithm, we have found a possible solution:


    Clique problem play an important role in graph theory as well as is quite complex. However, we have provided some valid combinations for the TAL users here.


    1. Pierce, Brian G., Yuichiro Hourai, and Zhiping Weng. "Accelerating protein docking in ZDOCK using an advanced 3D convolution library." PloS one 6.9 (2011): e24657.
    2. Mintseris, Julian, et al. "Integrating statistical pair potentials into protein complex prediction." Proteins: Structure, Function, and Bioinformatics 69.3 (2007): 511-520.
    3. G. Palla, I. Derényi, I. Farkas, and T. Vicsek. "Uncovering the overlapping community structure of complex networks in nature and society" Nature 435, 814–818 (2005)