Team:SYSU-China/file/Project/Model/BLOSUM62.html
From 2014.igem.org
Template:Team:SYSU-China/Home/Mainhead
Contents
|
Stochastic Sequence Evolution Model
Introduction
In our experiment, the replication of DNA is error prone. If the sequence encoding antibody is mutated, the structure of antibody may change and so does its affinity with the antigen. Inspired by this, we try to model for the evolution of a specific antibody.
First, a basic model is created for simulating the evolution of antibody’s amino acid (AA) sequence. By defining the similarity of evolved sequences, mutated sequences are selected according to their similarity judged by local alignment of AA sequences. Several parameters, such as library size, length of evolved sequence, are scanned to investigate how they influence the evolution process. Though crude, this model is sufficient to reveal essential property of molecular evolution, such as the emergence of plateau in evolution process and the advantage of gene diversity etc.
Then BLOSUM62 is introduced as a more discriminative score criterion. More non-obvious properties are unraveled by visualization evolution process of AA sequences, such as the suboptimal evolution end due to extreme selection pressure and the limitation of single original evolved sequence.
From protein evolution perspective, this model answers the question of” How will sequence evolve if the system we design works well”. This model helps us to gain deeper understanding of our design and provides us with some advices for experiment.
Symbols and Definition
<a class="fancybox" rel="group" href=""><img src="" alt="" /></a>
Model Analysis and Assumptions
<p> Our experiment aims to select the antibody with higher affinity, which can be represented by its AA sequence. However, what mutation directly effect is its DNA sequence. Thus, a basic model is build to simulate the effect of mutation and the evolution of DNA sequence.
Firstly, we need to obtain an initial DNA sequence for evolution. Our design aims to accelerate the evolution of many kinds of antibody, rather than a specific one. Considering the extremely diversity of antibody, modeling for a single specific DNA sequence may cause the loss of generality. Thus, the initial DNA sequence is generated randomly. Denote the length of DNA sequence as loD, we randomly choice loD elements from the set {A,T,C,G}, join them together as the initial DNA sequence.
Secondly, we need to simulate the mutation. Assuming libS(const. ) mutated sequences are generated in each round of experiment (denoted as one generation) and all of them with nM bases mutated. For simplicity, assuming the mutation probability between each kind of base are the same. The mutated sequence can be obtained by randomly choice nM bases in loD and randomly choice another 3 basis to replace the original one.
Thirdly, the mutated sequence should be evaluated. As a result, an evaluation reference and an evaluation criterion are necessary. For reference, we assume the existence of one ideal amino acid sequence that has the highest affinity with the antigen used in our experiment; denote it as “target sequence” and its alignment with the antigen as the “best alignment”. In practice, for the same reason, target sequence is generated randomly with length loD. By comparing the similarity between the corresponding protein of the target sequence and the mutated sequences, the affinity between mutated sequences and the antigen can be estimated.
In reality, it’s proteins rather than peptide that interact with each other. However, it’s impossible for us to evaluate this antibody-antigen interaction accurately for arbitrary sequence. Thus, we have to assume that the interaction between the protein of the mutated sequence and the antigen can to some extent be estimated by comparing the similarity between their AA sequences. Before comparison, the sequences are translated into their corresponding amino acid sequences. By local “Dot-to-Dot” alignment, the amount of sites (N_S) shares the same kind of AA can be obtain and the similarity (S) is defined as : <math>S=\frac{N_s}{loD}</math>
<a class="fancybox" rel="group" href=""><img src="" alt="" /></a>
In this model, phages are characterized by their antibody, and thus their similarity, S.
Finally, mutated sequences should be selected to enter the next round of experiment. In reality, all the mutated sequences have certain possibility to get into the next round. However, simulation that follows the reality results in enormously large load for computation, which becomes impossible to continue after a few generation. To solve this problem, we assume that: among the mutated sequences and the non-mutated sequence (denoted as parent sequence), only the one with highest similarity S can become the parent sequence of next generation.
Results of Basic model
Evolution Process of Single Sequence in One Experiment
The size of antibody’s DNA is rather large. However, in order to reduce the load of computation while without losing generality, only simulate a small piece of DNA is considered and the parameters are set as follows:
<a class="fancybox" rel="group" href=""><img src="" alt="" /></a>
By recording the similarity of each generation, the evolution process is shown in Fig. 1.
<a class="fancybox" rel="group" href=""><img src="" alt="" /></a>
Fig. 1 Evolution process of single sequence in one simulation
As excepted, the similarity S increase with generation G, and reach 1 after about 300 round of experiment. Besides, the increment of each generation decreases with the similarity goes up. Several platform periods are shown, suggesting the phages take several generations to wait for a “right mutation”, which increase the similarity.
Parameter Scan of Library Size
libS can be interpreted as the effective amount of mutated phages. Considering the mutated progenies are generated with the same mutation rate (P_M), the number of mutated progenies follows the Binomial Distribution and thus libS can be estimated as the exception of the Binomial Distribution, which is proportionate to the population of phages (pP):
<a class="fancybox" rel="group" href=""><img src="" alt="" /></a>
So how the evolution process is influenced by the population of phages can be answered by the parameter scan of libS, with other parameters hold as follow:
<a class="fancybox" rel="group" href=""><img src="" alt="" /></a>
For the reason of stability, each process is the average of 20 times simulation with the same parameters.
<a class="fancybox" rel="group" href=""><img src="" alt="" /></a> <p1> Fig. 2 Parameter Scan of Library Size </p1>
As libS can be interpreted as the population of phages, Fig. 2 suggests that large population speeds up evolution. Its root can be traced back to the assumptions of this model as well as probability. Since the mutation rate P_M is independent of pP, larger population is more likely get progeny with “right mutation”. Meanwhile, note the tiny difference between the result of libS=100 and libS=250, which derives from the limitation of loD. If the population is large enough to traverse all the possible sequences, its further increase becomes dispensable. As a result, small protein can be evolved in relative small population while large protein had better evolved in large population.
Parameter Scan of loD
As to answer how the length of evolved sequence influences the evolution process, parameter loD is scanned with others hold as follows: <a class="fancybox" rel="group" href=""><img src="" alt="" /></a>
<a class="fancybox" rel="group" href=""><img src="" alt="" /></a> </p>
<p1>Fig. 3 Parameter Scan of the length of DNA</p1>
As Fig. 3 shows, processes with different loD share the same tendency and finally become exactly the same with the target sequence. But it slows down with the length of DNA sequence increases and more time is required for the evolution of larger protein.
Model Modification
Though basic model is based on some strict and even unrealistic assumptions, it helps us to gain some non-obvious properties of our design, such as the appearance of platform period and the benefit of large population. Here, some modifications to the basic model are applied and more properties of our design will be investigated.
1) BLOSUM62 Matrix is introduced as new criterion of similarity comparison.
In reality, different amino acids have distinct physical and chemical properties, which will obviously result in different affinity. The method previously used for estimating similarity is oversimplified. To simulate the evolution more precisely, BLOSUM62 Matrix is introduced as a modification to the defect mentioned above. The value for each pair of AA given by BLOSUM62 reveals the likelihood of their substitution in nature protein evolution, which to some extent reflect the similarity between them. In our model, the score for each site is obtained by looking-up BLOSUM62 for the corresponding AA in evolved sequence and target sequence. Summing up the score for each sites of evolved sequences (N_e) and the full score of target sequence(N_t), the similarity is defined as: S=N_e/N_t 3
<a class="fancybox" rel="group" href=""><img src="" alt="" /></a> </p>
2) Termination Codon is taken into consideration by adding the score for termination codon in BLOSUM62.
According to simulation, the mRNA of mutated DNA sequences sometimes contains termination codon. The occurrence of termination codon may result in various consequences, most of which are always fatal. Thus, the scores for termination codon with respect to all amino acid is supposed to be -100, which is large enough to reveal the fatal effect of termination codon and eliminate these mutated sequence from the selection pool. Other words, among all kinds of mutation, only base substitution is taken into consideration in our model.
Results of Modified Model
The modified model gives a similar process with the basic model, as shown in Fig. 4.
<a class="fancybox" rel="group" href=""><img src="" alt="" /></a> <p1> Fig. 4 The evolution process given by modified model. Each gray line stands for one evolution while the blue line is the average of 20 evolution processes with same parameters. </p1>
Apart from similar tendency, one important difference can be observed in Fig. 4: the final similarity of the evolution. In basic model, all the sequence will finally be evolved to a sequence exactly the same with the target sequence. In contract, modified model suggests that the evolution will terminate with final similarity S≈0.8, which is so surprising that further investigation is necessary.
Evolution Process Visualization
Previously, the information extracted from the simulation is merely the change of similarity with respect to generation. To get more information, the parent sequences of each generation and the score for each AA on them are recorded and the evolution process is visualized. As a result, more important properties of AA evolution are unraveled:
[gif]
<p1> Fig. 5 Amino acid sequences evolution </p1>
According to Fig. 5, new properties can be observed: Some AA sites are able to be mutated to target AA directly while most of them have to be mutated into “suboptimum” AA (lighter color) first and the optimum one (white) latter. Obviously, the introduction of BLOSUM62 increases the discrimination of different kinds of AA in our model and makes the evolution process more realistic. Besides, starting from the same original AA sequence, 10 sequences go through different evolution path, which is also observed by some related experiments.[1]
If we focus on the final similarity, the reason for non-ideal final similarity emerges:
1) One original AA sequence may go through different evolution process and evolved into different sequences with approximate similarity with the target sequence.
2) Non-ideal evolution end frequently occurs under extreme selection pressure.
3) Evolution process dependent on the specific sequence of antibody and antigen. For the same target, final similarity is limited by single original AA sequence.
[Fig 6]
<p1> Fig. 6 Score for each AA sequence of 500th generation. </p1>
500 cycles of experiments is long enough to obtain a stable result, but differences between evolved sequences can still be observed. As shown in Fig. 6, some AA sites that different from the corresponding sites on the target sequence are shared by all the evolved sequences while some are not. The not shared ones may come from diversity of evolution process. As for those shared, more simulations are done and most of them give similar result. Thus we infer that some sites of AA on the evolved sequence are almost impossible to be mutated to its target AA. Here, we try to give a possible explanation.
Selection Pressure and Evolution End
The modified model is based on some assumptions and two of them are both important and strict:
1) Since the mutation is rare, each mutated sequence are assumed to has one and only one base mutated.
2) Among the mutated sequences and the parent sequence, only the one with highest similarity S can become the parent sequence of next generation. In other words, extreme selection pressure is applied in our model ------only one survival in each generation.
The assumptions above are hard to meet in our experiment. Meanwhile, for the following reasons, it’s this two assumptions that result in unexpected imperfect evolution result.
First, mutation happens to base on the DNA sequence and 3 sequential bases determine the corresponding amino acid. Thus, there exist two kind of amino acid, their substitution require the mutation of 1, 2, 3 base. For those “better” substitutions require 1 base mutated, they are more likely to survive under the extreme selection pressure. In contract, for those 2 or 3 mutations are require for a “better” substitution, any mutation on this site in this generation will gain a lower score and thus be eliminated by the extreme selection pressure. For amino acid sites that require 2 or 3 bases mutated to evolve to the closest “better” AA according to the target sequence, we denote the site as “meta-stable site”. Once the site evolved to be a meta-stable site, it can hardly be changed in the later evolution. In conclusion, under extreme selection pressure, neutral mutation and long-term-beneficial while short-term-harmful mutation can hardly be selected.
Considering the randomness of evolution, meta-stable site are very likely to appear during the long evolution process. Evolution process ends when the sites of evolved sequence are either best alignment sites or meta-stable site. As a result, the evolved sequence is likely to stop as a sub-optimal one.
Multi-sequences Scan
Since the evolved sequences and target sequence are generated randomly, the result of simulation inevitably has some randomicity. To gain more general result, some original sequences and target sequence are generated and applied to simulation.
First, different original sequences and one target sequence are used in simulation. The evolution processes are shown below, illustrating that their similarity differ from sequences to sequences evidently.[Fig.7]
<a class="fancybox" rel="group" href=""><img src="" alt="" /></a> <p1> Fig. 7 Evolution process of 10 different original sequences </p1>
Now we focus on the final similarity distribution. With the parameters shown below, 673 original sequences and 1 target sequence are generated randomly. For each sequences, the final similarity is the average of 10 simulations. Distribution of final similarity is shown in Fig. 7.
<a class="fancybox" rel="group" href=""><img src="" alt="" /></a> <a class="fancybox" rel="group" href=""><img src="" alt="" /></a> <p1> [Fig.8] Fig. 8 Final similarity distribution for 673 different original sequences with 1 target sequence. </p1>
As suggested by Fig. 8, different sequences have very different final similarity. For one target sequence, final similarity of 673 distinct original sequences range from 0.7 to 0.996, revealing that the evolution process dependents on the “distance” between original sequences and the targeted one. The mean of the final similarity is 0.82, which also shows the imperfect evolution end.
Consequently, in order to avoid evolving a sequence that is right “difficult” to evolve, some distinct original sequences can be evolved in our system in parallel.
Conclusion
The stochastic model is built to simulate the evolution process of amino acid sequence. Based on some assumptions, our model assists us in unraveling the non-obvious properties of the system we designed:
1) Large population evolves faster. To accelerate the evolution, small peptide can be evolved in relatively small population, while complex protein had better be evolved in large population.
2) Under extreme selection pressure, neutral mutation and long-term-beneficial while short-term-harmful mutation can hardly survive in the evolution. As a result, sequences evolve fastest always fail to evolve into the target sequence.
3) Under extreme selection pressure, the evolution end dependent of the feature of original AA sequence and target AA sequence. Having several significantly different original sequences evolved in experiment can increase the final similarity effectively.
Though the assumptions set by the model are hard to meet in experiment, some inferences about non-extreme selection can be made according to the properties above. This model helps us to gain deeper understanding of our design and provides us with some advices for further experiments. Besides, some properties of evolution given by this model have been confirmed by some relative experiments, or previously discovered in evolutionary biology. Further confirmation of this model can be carried by testing all the testable property presented by this model.