Team:SYSU-China/file/Project/Model/BLOSUM62.html

From 2014.igem.org

(Difference between revisions)
Line 1: Line 1:
{{CSS/Main}}
{{CSS/Main}}
 +
{{Template:Team:SYSU-China/Home/Mainhead}}
<h1>BLOSUM62</h1>
<h1>BLOSUM62</h1>
<h2>Introduction</h2>
<h2>Introduction</h2>

Revision as of 07:10, 17 October 2014

Template:Team:SYSU-China/Home/Mainhead

Contents

BLOSUM62

Introduction

In our experiment, the replication of DNA is error prone. If the sequence encoding antibody is mutated, the structure of antibody may change and so does its affinity with the antigen. Inspired by this, we try to model for the evolution of a specific antibody.

First, a basic model based on the local alignment of amino acid (AA) sequence is created to simulate the evolutionary process of the antibody. Then BLOSUM62 is introduced as a score criterion. Though crude, our model is sufficient in revealing some non-obvious properties of molecular evolution, such as the appearance of platform period, the benefit of large population, the suboptimal evolution result due to extreme selection pressure and the limitation of single original evolved sequence. Besides, it helps us to gain deeper understanding of our design and provides us with some advices for further experiments.

Symbols and Definition

<a class="fancybox" rel="group" href="Sysuchina_44.png"><img src="Sysuchina_44.png" alt="" /></a>

Analysis and Assumptions

<p> Our experiment aims to select the antibody with higher affinity, which can be represented by its AA sequence. However, what mutation directly effect is its DNA sequence. Thus, a basic model is build to simulate the effect of mutation and the evolution of DNA sequence.

Firstly, we need to obtain an initial DNA sequence for evolution. Our design aims to accelerate the evolution of many kinds of antibody, rather than a specific one. Considering the extremely diversity of antibody, modeling for a single specific DNA sequence may cause the loss of generality. Thus, the initial DNA sequence is generated randomly. Denote the length of DNA sequence as loD, we randomly choice loD elements from the set {A,T,C,G}, join them together as the initial DNA sequence.

Secondly, we need to simulate the mutation. Assuming libS(const. ) mutated sequences are generated in each round of experiment (denoted as one generation) and all of them with nM bases mutated. For simplicity, assuming the mutation probability between bases are the same. The mutated sequence can be obtained by randomly choice nM bases in loD and randomly choice another 3 basis to replace the original one.

Thirdly, the mutated sequence should be evaluated. As a result, an evaluation reference and an evaluation criterion are necessary. For reference, we assume the existence of one ideal amino acid sequence that has the highest affinity with the antigen used in our experiment; denote it as “target sequence” and its alignment with the antigen as the “best alignment”. In practice, for the same reason, target sequence is generated randomly with length loD. By comparing the similarity between the corresponding protein of the target sequence and the mutated sequences, the affinity between mutated sequences and the antigen can be estimated.

In reality, it’s proteins rather than peptide that interact with each other. However, it’s impossible for us to evaluate this antibody-antigen interaction accurately for arbitrary sequence. Thus, we have to assume that the interaction between the protein of the mutated sequence and the antigen can to some extent be estimated by comparing the similarity between their AA sequences. Before comparison, the sequences are translated into their corresponding amino acid sequences. By local “Dot-to-Dot” alignment, the amount of sites (N_S) shares the same kind of AA can be obtain and the similarity (S) is defined as : S=N_s/loD

<a class="fancybox" rel="group" href="Sysuchina_42.png"><img src="Sysuchina_42.png" alt="" /></a>

In this model, phages are characterized by their antibody, and thus their similarity, S.

Finally, mutated sequences should be selected to enter the next round of experiment. In reality, all the mutated sequences have certain possibility to get into the next round. However, simulation that follows the reality results in enormously large load for computation, which becomes impossible to continue after a few generation. To solve this problem, we assume that: among the mutated sequences and the non-mutated sequence (denoted as parent sequence), only the one with highest similarity S can become the parent sequence of next generation.

Results of Basic model

Evolution Process of Single Sequence in One Experiment

The size of antibody’s DNA is rather large, while the part that antibody interact with antigen has about [?????], denoted as effective length. In simulation, in order to reduce the load of computation, only the effective part is considered and the parameters are set as follows:

<a class="fancybox" rel="group" href="Sysuchina_20.png"><img src="Sysuchina_20.png" alt="" /></a>

By recording the similarity of each generation, the evolution process is shown in Fig. 1.

<a class="fancybox" rel="group" href="Sysuchina_21.png"><img src="Sysuchina_21.png" alt="" /></a>

Fig. 1 Evolution process of single sequence in one simulation

As excepted, the similarity S increase with generation G, and reach 1 after about 300 round of experiment. Besides, the increment of each generation decreases with the similarity goes up. Several platform periods are shown, suggesting the phages take several generations to wait for a “right mutation”, which increase the similarity.

Parameter Scan of libS

libS can be interpreted as the effective amount of mutated phages. Considering the mutated progenies are generated with the same mutation rate (P_M), the number of mutated progenies follows the Binomial Distribution and thus libS can be estimated as the exception of the Binomial Distribution, which is proportionate to the population of phages (pP):

<a class="fancybox" rel="group" href="Sysuchina_44.png"><img src="Sysuchina_44.png" alt="" /></a>

So how the evolution process is influenced by the population of phages can be answered by the parameter scan of libS, with other parameters hold as follow:

<a class="fancybox" rel="group" href="Sysuchina_22.png"><img src="Sysuchina_22.png" alt="" /></a>

For the reason of stability, each process is the average of 20 times simulation with the same parameters.

<a class="fancybox" rel="group" href="Sysuchina_23.png"><img src="Sysuchina_23.png" alt="" /></a> <p1> Fig. 2 Parameter Scan of Library Size </p1>

As libS can be interpreted as the population of phages, Fig. 2 suggests that large population speeds up evolution. Its root can be traced back to the assumptions of this model as well as probability. Since the mutation rate P_M is independent of pP, larger population is more likely get progeny with “right mutation”. Meanwhile, note the tiny difference between the result of libS=100 and libS=250, which derives from the limitation of loD. If the population is large enough to traverse all the possible sequences, its further increase becomes dispensable. As a result, small protein can be evolved in relative small population while large protein had better evolved in large population.

Parameter Scan of loD

As to answer how the length of evolved sequence influences the evolution process, parameter loD is scanned with others hold as follows: <a class="fancybox" rel="group" href="Sysuchina_24.png"><img src="Sysuchina_24.png" alt="" /></a>

<a class="fancybox" rel="group" href="Sysuchina_25.png"><img src="Sysuchina_25.png" alt="" /></a> </p>

<p1>Fig. 3 Parameter Scan of the length of DNA</p1>

As Fig. 3 shows, processes with different loD share the same tendency and finally become exactly the same with the target sequence. But it slows down with the length of DNA sequence increases and more time is required for the evolution of larger protein.

Model Modification

Though basic model is based on some strict and even unrealistic assumptions, it helps us to gain some non-obvious properties of our design, such as the appearance of platform period and the benefit of large population. Here, some modifications to the basic model are applied and more properties of our design will be investigated.

1) BLOSUM62 Matrix is introduced as new criterion of similarity comparison.

In reality, different amino acids have distinct physical and chemical properties, which will obviously result in different affinity. The method previously used for estimating similarity is oversimplified. To simulate the evolution more precisely, BLOSUM62 Matrix is introduced as a modification to the defect mentioned above. The value for each pair of AA given by BLOSUM62 reveals the likelihood of their substitution in nature protein evolution, which to some extent reflect the similarity between them. In our model, the score for each site is obtained by looking-up BLOSUM62 for the corresponding AA in evolved sequence and target sequence. Summing up the score for each sites of evolved sequences (N_e) and the full score of target sequence(N_t), the similarity is defined as: S=N_e/N_t 3

<a class="fancybox" rel="group" href="Sysuchina_45.png"><img src="Sysuchina_45.png" alt="" /></a> </p>


2) Termination Codon is taken into consideration by adding the score for termination codon in BLOSUM62.

According to simulation, the mRNA of mutated DNA sequences sometimes contains termination codon. The occurrence of termination codon may result in various consequences, most of which are always fatal. Thus, the scores for termination codon with respect to all amino acid is supposed to be -100, which is large enough to reveal the fatal effect of termination codon and eliminate these mutated sequence from the selection pool. Other words, among all kinds of mutation, only base substitution is taken into consideration in our model.

Results of Modified Model

The modified model gives a similar process with the basic model, as shown in Fig. 4.

<a class="fancybox" rel="group" href="Sysuchina_26.png"><img src="Sysuchina_26.png" alt="" /></a> <p1> Fig. 4 The evolution process given by modified model. Each gray line stands for one evolution while the blue line is the average of 20 evolution processes with same parameters. </p1>

Apart from similar tendency, one important difference can be observed in Fig. 4: the final similarity of the evolution. In basic model, all the sequence will finally be evolved to a sequence exactly the same with the target sequence. In contract, modified model suggests that the evolution will terminate with final similarity S≈0.8, which is so surprising that further investigation is necessary.

Evolution of AA sequence

Previously, the information extracted from the simulation is merely the change of similarity with respect to generation. To get more information, the parent sequences of each generation and the score for each AA on them are recorded. As a result, more important properties of AA evolution are unraveled:

1) One original AA sequence may go through different evolution process and evolved into different sequences with approximate similarity with the target sequence.
2) Non-ideal evolution end frequently occurs under extreme selection pressure.
3) Evolution process dependent on the specific sequence of antibody and antigen. For the same target, final similarity is limited by single original AA sequence.

Run the simulation with the same parameters, original AA sequence and target AA sequence, record the AA sequence selected in each generation. Set the score of each site in target AA sequence as 0 and the score for each site in evolved AA sequence can be obtained. By some data analysis, the evolution processes of 10 experiments are shown in the following GIF: 【gif图】

<p1> Fig. 5 AA sequence evolution </p1>

According to Fig. 5, new properties can be observed:

1) Some AA sites are able to be mutated to target AA directly while most of them have to be mutated into “suboptimum” AA (lighter color) first and the optimum one (white) latter. Obviously, the introduction of BLOSUM62 increases the discrimination of different kinds of AA in our model and makes the evolution process more realistic.
2) Starting from the same original AA sequence, 10 sequences go through different evolution process but have approximate similarity at the end of experiments. The final AA sequences in 10 experiments are similar but not the same. This inference has been observed by experimentalists.[Dr.Liu]

If we focus on the final similarity,

<a class="fancybox" rel="group" href="Sysuchina_27.png"><img src="Sysuchina_27.png" alt="" /></a> <p1> Fig. 6 Score for each AA sequence of 100th generation.</p1>

The 100th generation of Fig. 5 is shown in Fig. 5中的第100代如上图所示。由Fig. 6我们可以推断: 1)存在某些位点的氨基酸在较长时间内(100代为实验条件和目的允许的代数)保持不变,或无法突变为目标序列对应位点上的氨基酸。 从上述推断出发,我们寻找造成该现象的原因:

Selection Pressure and Evolution End

The modified model is based on some assumptions and two of them are both important and strict:

1) Since the mutation is rare, each mutated sequence are assumed to has one and only one base mutated.
2) Among the mutated sequences and the parent sequence, only the one with highest similarity S can become the parent sequence of next generation. In other words, extreme selection pressure is applied in our model ------only one survival in each generation.

The assumptions above are hard to meet in our experiment. Meanwhile, for the following reasons, it’s this two assumptions that result in unexpected imperfect evolution result.

First, mutation happens to base on the DNA sequence and 3 sequential bases determine the corresponding amino acid. Thus, there exist two kind of amino acid, their substitution require the mutation of 1, 2, 3 base. For those “better” substitutions require 1 base mutated, they are more likely to survive under the extreme selection pressure. In contract, for those 2 or 3 mutations are require for a “better” substitution, any mutation on this site in this generation will gain a lower score and thus be eliminated by the extreme selection pressure. For amino acid sites that require 2 or 3 bases mutated to evolve to the closest “better” AA according to the target sequence, we denote the site as “metastable site”. Once the site evolved to be a metastable site, it can hardly be changed in the later evolution. In conclusion, under extreme selection pressure, neutral mutation and long-term-beneficial while short-term-harmful mutation can hardly be selected.

Considering the randomness of evolution, metastable site are very likely to appear during the long evolution process. Evolution process ends when the sites of evolved sequence are either best alignment sites or metastable site. As a result, the evolved sequence is likely to stop as a suboptimal one.

Multi-sequences Scan

Since the evolved sequences and target sequence are generated randomly, the result of simulation inevitably has some randomicity. To gain more general result, some original sequences and target sequence are generated and applied to simulation. Here, we focus on the end of evolution.

<a class="fancybox" rel="group" href="Sysuchina_28.png"><img src="Sysuchina_28.png" alt="" /></a>

With the parameters shown below, 500 original sequences and 1 target sequence are generated randomly. The final similarity is the average of 10 simulations. Distribution of final similarity is shown in Fig. 7.

<a class="fancybox" rel="group" href="Sysuchina_29.png"><img src="Sysuchina_29.png" alt="" /></a> <a class="fancybox" rel="group" href="Sysuchina_30.png"><img src="Sysuchina_30.png" alt="" /></a> <p1> Fig. 7 Final similarity distribution for 500 different original sequences with 1 target sequence. </p1>

上图说明,起始序列不同,辅助进化循环结束后的最终相似度不同。这说明不同起始序列对于同一目标序列而言,在分子进化上具有不同的性质。这一性质与起始序列的DNA序列有关。不同起始序列向同一目标序列进化,达到同一相似度S的难度也不同。在极端选择压力下,进化后的序列相似度平均值为66%。此外,上图还呈现出明显的非对称性。这说明高相似度的突变后代更难产生。这与基本模型中DNA出现正确突变的概率分布给出的性质是一致的,尽管这里是基于氨基酸位点突变的统计。上图揭示的性质提示着,在实验中准备若干基因组成差异较大的候选起始序列可以有效地增加分子进化的成功率和最终相似度。

Conclusion

The stochastic model is built to simulate the evolution process of amino acid sequence. Based on some assumptions, our model assists us in unraveling the non-obvious properties of the system we designed:

1) Large population evolves faster. To accelerate the evolution, small peptide can be evolved in relatively small population, while complex protein had better be evolved in large population.
2) Under extreme selection pressure, neutral mutation and long-term-beneficial while short-term-harmful mutation can hardly survive in the evolution. As a result, sequences evolve fastest always fail to evolve into the target sequence.
3) Under extreme selection pressure, the evolution end dependent of the feature of original AA sequence and target AA sequence. Having several significantly different original sequences evolved in experiment can increase the final similarity effectively.

Though the assumptions set by the model are hard to meet in experiment, some inferences about non-extreme selection can be made according to the properties above. This model helps us to gain deeper understanding of our design and provides us with some advices for further experiments. Besides, some properties of evolution given by this model have been confirmed by some relative experiments, or previously discovered in evolutionary biology. Further confirmation of this model can be carried by testing all the testable property presented by this model.

From protein evolution perspective, this model answers the question of”How will sequence evolve if the system we design work well”. However, a very important question has been neglected by us: weather our design can selective the mutated sequence effectively. In this model, its answer is just assumed to be “Yes”. Later, we will answer this question from the perspective of population growth.