Team:StanfordBrownSpelman/Modelling
From 2014.igem.org
(Difference between revisions)
(43 intermediate revisions not shown) | |||
Line 1: | Line 1: | ||
- | |||
<html class="no-js" lang="en"> | <html class="no-js" lang="en"> | ||
<head> | <head> | ||
Line 51: | Line 50: | ||
</h3> | </h3> | ||
<center><div class="boxedmenu"> | <center><div class="boxedmenu"> | ||
- | <h7><center><a href="#" id="data">DoubleOptimizer</a> ● <a href="#" id="methods">CompositionSearch</a> | + | <h7><center><a href="#" id="data">DoubleOptimizer</a> ● <a href="#" id="methods">CompositionSearch</a><br><a href="#" id="pics">Flux Balance Analysis</a></h7> |
+ | </div> | ||
- | <h6 id=" | + | <h6 id="subheader">While working on our synthetic biology projects for this year's iGEM competition, we found ourselves in need of some computational tools for synthetic biology that did not yet exist. Therefore, we developed our own software packages to meet these needs. In particular, we have developed two pieces of software, specifically designed for the needs of synthetic biologists:<b></b> |
- | < | + | <div class="sub5">● DoubleOptimizer: a tool that facilitates synthesis of repetitive genes by optimizing codon usage to both match a codon usage distribution for a desired organism, and to reduce and avoid repetitive nucleotide sequences, allowing for easier synthesis.</div> |
+ | <div class="sub5">● CompositionSearch: a tool that quickly ranks all proteins in a proteome by their similarity to a given amino acid distribution.</div> | ||
</h6> | </h6> | ||
</div> | </div> | ||
</div> | </div> | ||
+ | <div class="row"> | ||
+ | <div id="results" class="small-3 small-centered columns cells2"> | ||
+ | </div> | ||
+ | </div> | ||
<!-- ====== DOUBLE OPTIMIZER ====== --> | <!-- ====== DOUBLE OPTIMIZER ====== --> | ||
Line 64: | Line 69: | ||
<div class="row"> | <div class="row"> | ||
<div id="subheader" class="small-8 small-centered columns"> | <div id="subheader" class="small-8 small-centered columns"> | ||
- | < | + | <h5><center> |
- | DoubleOptimizer </ | + | DoubleOptimizer </h5> |
- | < | + | <h6><center>A utility for simultaneous codon and <br>gene synthesis optimization<br></h6> |
+ | </div></div> | ||
+ | <div class="small-7 small-centered columns"><br><center><img src="https://static.igem.org/mediawiki/2014/0/0b/Double_optimizer_graphic.jpg"> | ||
+ | </div> | ||
+ | <div class="row"> | ||
+ | <div id="subheader" class="small-8 small-centered columns"> | ||
+ | <h6> | ||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
<h6> | <h6> | ||
- | Gene synthesis as a tool for biological engineering presents both opportunities and challenges. One opportunity presented is the ability to optimize codon usage in a gene to match that of a host organism. Compared to traditional cloning methods, this can increase protein yields in the host organism by several fold.[1] However, while there exist a large number of freely-usable programs that perform codon optimization, there is no guarantee that the sequences these programs provide will be able to be synthesized. Specifically, in the case of genes with repetitive amino acid sequences, these programs will often generate outputs that contain too many repeated short DNA sequences to be synthesized commercially. <br></br> | + | <h5>Background</h5> |
+ | <h6>Gene synthesis as a tool for biological engineering presents both opportunities and challenges. One opportunity presented is the ability to optimize codon usage in a gene to match that of a host organism. Compared to traditional cloning methods, this can increase protein yields in the host organism by several fold.[1] However, while there exist a large number of freely-usable programs that perform codon optimization, there is no guarantee that the sequences these programs provide will be able to be synthesized. Specifically, in the case of genes with repetitive amino acid sequences, these programs will often generate outputs that contain too many repeated short DNA sequences to be synthesized commercially. <br></br> | ||
- | As an example, the hypothetical protein X777_06170 from the ant species <i>Cerapachys biroi</i> has an amino acid sequence that appears to be somewhat repetitive: <br>< | + | As an example, the hypothetical protein X777_06170 from the ant species <i>Cerapachys biroi</i> has an amino acid sequence that appears to be somewhat repetitive: <br><br> |
<div class="sub6">001 mklfkclvpv vvlllikdss arpglirdfv ggtvgsilep fqilkpkdsy adanshasah <br></br> | <div class="sub6">001 mklfkclvpv vvlllikdss arpglirdfv ggtvgsilep fqilkpkdsy adanshasah <br></br> | ||
Line 90: | Line 95: | ||
421 ggisgphggh gqaganagan ananagatvg ssggvlggvg dhggyhgyng hdgssglnlg<br></br> | 421 ggisgphggh gqaganagan ananagatvg ssggvlggvg dhggyhgyng hdgssglnlg<br></br> | ||
481 gygggsnana qassnalass ggsssatsda lsnahssggs alanssskas angsgsanan<br></br> | 481 gygggsnana qassnalass ggsssatsda lsnahssggs alanssskas angsgsanan<br></br> | ||
- | 541 ahassnassg shglgsktsa ssqasasadt rdmlifs[2]</div> | + | 541 ahassnassg shglgsktsa ssqasasadt rdmlifs[2]</div><br> |
</h6> | </h6> | ||
- | |||
- | |||
- | + | <h6> Note that this sequence is not simply the same sequence repeated multiple times, but instead contains several motifs on the order of 10 - 20 amino acids in length that occur several times. When this sequence was run through the codon optimization program for expression in ''<i>E. coli</i>'' provided by a major DNA synthesis firm, the resulting output could not be synthesized by the very same firm: the "optimized" DNA sequence contained too many recurring short (> 8 nucleotide) DNA sequences to allow for synthesis.<br><br> | |
- | + | ||
- | <h6> Note that this sequence is not simply the same sequence repeated multiple times, but instead contains several motifs on the order of 10 - 20 amino acids in length that occur several times. When this sequence was run through the codon optimization program for expression in ''E. coli'' provided by a major DNA synthesis firm, the resulting output could not be synthesized by the very same firm: the "optimized" DNA sequence contained too many recurring short (> 8 nucleotide) DNA sequences to allow for synthesis.<br>< | + | |
- | Manually correcting for repeats in the codon-optimized DNA sequence is a sub-optimal solution: not only is this process time-consuming, but it has the tendency to undo the codon-optimization: if a sequence of amino acids occurs several times, one may be forced to use all possible codon-combinations to represent this sequence to avoid nucleotide-sequence repetition. Unless corrected for by skewing codon usage elsewhere in the sequence, this will tend to make the codon usage more uniform than is optimal for the expression vector. Additionally, any changes made in either correcting for repeats or re-correcting for codon usage may in turn introduce additional repeats. | + | Manually correcting for repeats in the codon-optimized DNA sequence is a sub-optimal solution: not only is this process time-consuming, but it has the tendency to undo the codon-optimization: if a sequence of amino acids occurs several times, one may be forced to use all possible codon-combinations to represent this sequence to avoid nucleotide-sequence repetition. Unless corrected for by skewing codon usage elsewhere in the sequence, this will tend to make the codon usage more uniform than is optimal for the expression vector. Additionally, any changes made in either correcting for repeats or re-correcting for codon usage may in turn introduce additional repeats. |
</h6> | </h6> | ||
</div> | </div> | ||
Line 109: | Line 110: | ||
<div class="row"> | <div class="row"> | ||
<div id="subheader" class="small-8 small-centered columns"> | <div id="subheader" class="small-8 small-centered columns"> | ||
- | <h5><center>Solution</h5> | + | <h5><center>Solution: DoubleOptimizer</h5> |
- | <h6>DoubleOptimizer is a software tool we have created to optimize codon usage in a gene both to match a given codon usage distribution and to avoid repetition of nucleotide sequences. Given a DNA or amino acid sequence and a desired codon distribution, DoubleOptimizer will produce, within a matter of | + | <h6>DoubleOptimizer is a software tool we have created to optimize codon usage in a gene both to match a given codon usage distribution and to avoid repetition of nucleotide sequences. Given a DNA or amino acid sequence and a desired codon distribution, DoubleOptimizer will produce, within a matter of seconds, an equivalent sequence that has substantially reduced DNA sequence repetition, while also closely matching the desired codon usage. |
</h6> | </h6> | ||
</div> | </div> | ||
Line 121: | Line 122: | ||
<div id="subheader" class="small-8 small-centered columns"> | <div id="subheader" class="small-8 small-centered columns"> | ||
<h5><center>Availability and Usage</h5> | <h5><center>Availability and Usage</h5> | ||
- | <h6>DoubleOptimizer may be downloaded <a href = " | + | <h6>DoubleOptimizer may be downloaded <a href = "https://static.igem.org/mediawiki/2014/a/aa/SBSDoubleOptimizer.zip"><u>here</u></a>. DoubleOptimizer is a command line utility, provided as a Java jar file. It can be invoked from command line on any system with Java 7 or later installed, with the following syntax:<br></br> |
- | + | ||
- | DoubleOptimizer is a command line utility, provided as a Java jar file. It can be invoked from command line on any system with Java 7 or later installed, with the following syntax:<br></br> | + | |
<div class="sub6">java -jar DoubleOptimizer.jar seq.txt codons.txt <br></br> </div> | <div class="sub6">java -jar DoubleOptimizer.jar seq.txt codons.txt <br></br> </div> | ||
- | where "seq.txt" is a DNA sequence, stored as a plain text file, and "codons.txt" is a file containing the desired codon distribution to match. It should be formatted as plain text, according to the following example template: <br>< | + | where "seq.txt" is a DNA sequence, stored as a plain text file, and "codons.txt" is a file containing the desired codon distribution to match. It should be formatted as plain text, according to the following example template: <br><br> |
<div class="sub6"> | <div class="sub6"> | ||
Line 231: | Line 230: | ||
This allows for an amino-acid sequence, specified in single-letter code, to be used as input instead of a DNA sequence. The initial sequence statistics displayed will be for a uniform random reverse translation of the given amino acid sequence. <br></br> | This allows for an amino-acid sequence, specified in single-letter code, to be used as input instead of a DNA sequence. The initial sequence statistics displayed will be for a uniform random reverse translation of the given amino acid sequence. <br></br> | ||
- | Example: < | + | Example: <br> |
<div class="sub6">java -jar DoubleOptimizer.jar aaseq.txt codons.txt -A <br></br> | <div class="sub6">java -jar DoubleOptimizer.jar aaseq.txt codons.txt -A <br></br> | ||
Line 240: | Line 239: | ||
This allows the user to specify, in seconds, a different run-time for the program other than the default 10 seconds. While 10 seconds should be sufficient to produce a well-optimized result for most genes of moderate length on modern desktop computers, longer times may produce better-optimized results on slower machines or on longer sequences. <br></br> | This allows the user to specify, in seconds, a different run-time for the program other than the default 10 seconds. While 10 seconds should be sufficient to produce a well-optimized result for most genes of moderate length on modern desktop computers, longer times may produce better-optimized results on slower machines or on longer sequences. <br></br> | ||
- | Example:< | + | Example:<br> |
<div class="sub6">java -jar DoubleOptimizer.jar aaseq.txt codons.txt -A -T30 <br></br> | <div class="sub6">java -jar DoubleOptimizer.jar aaseq.txt codons.txt -A -T30 <br></br> | ||
Line 249: | Line 248: | ||
This allows the user to specify a different minimum length for what is considered a repeat, other than the default 8 nucleotides.<br></br> | This allows the user to specify a different minimum length for what is considered a repeat, other than the default 8 nucleotides.<br></br> | ||
- | Example:< | + | Example:<br> |
<div class="sub6"> java -jar DoubleOptimizer.jar seq.txt codons.txt -L7 -T15 <br></br> | <div class="sub6"> java -jar DoubleOptimizer.jar seq.txt codons.txt -L7 -T15 <br></br> | ||
- | |||
- | |||
+ | </div> | ||
+ | <div><h5><ul><li>-S##, -E##</ul></h5></div> | ||
These allow the user to specify the starting and ending nucleotide, respectively, of the coding region in a construct sequence to be sythesized. Nucleotides outside this frame will be ignored for codon usage optimization, and will never be modified. This option is useful for preventing repetitions, within the coding region, of fixed sequences that occur at the ends of a construct to be sythesized, outside of the coding region. The default values are the beginning of the sequence, and the end of the last complete codon. These options may be used together or independently. Values are one-indexed. If used with ''-A'', these will be interpreted as amino acid indices.<br></br> | These allow the user to specify the starting and ending nucleotide, respectively, of the coding region in a construct sequence to be sythesized. Nucleotides outside this frame will be ignored for codon usage optimization, and will never be modified. This option is useful for preventing repetitions, within the coding region, of fixed sequences that occur at the ends of a construct to be sythesized, outside of the coding region. The default values are the beginning of the sequence, and the end of the last complete codon. These options may be used together or independently. Values are one-indexed. If used with ''-A'', these will be interpreted as amino acid indices.<br></br> | ||
- | Example:< | + | Example:<br> |
<div class="sub6"> java -jar DoubleOptimizer.jar seq.test codons.txt -S121 -E1853 -R1000 -D100 | <div class="sub6"> java -jar DoubleOptimizer.jar seq.test codons.txt -S121 -E1853 -R1000 -D100 | ||
Line 266: | Line 265: | ||
- | Example:< | + | Example:<br> |
<div class="sub6"> java -jar DoubleOptimizer.jar seq.test codons.txt -S121 -E1853 -R1000 -D100 | <div class="sub6"> java -jar DoubleOptimizer.jar seq.test codons.txt -S121 -E1853 -R1000 -D100 | ||
Line 274: | Line 273: | ||
This option will make the program halt optimization after a certain number of optimization cycles. This may be used with a large value of -T to standardize optimization quality between computers of different speeds. Because this option is mostly only useful for testing the efficiency of this program itself, it may be removed from future releases.<br></br> | This option will make the program halt optimization after a certain number of optimization cycles. This may be used with a large value of -T to standardize optimization quality between computers of different speeds. Because this option is mostly only useful for testing the efficiency of this program itself, it may be removed from future releases.<br></br> | ||
- | Example:< | + | Example:<br> |
<div class="sub6"> java -jar DoubleOptimizer.jar seq.test codons.txt -S121 -E1853 -T1000 -R1000 | <div class="sub6"> java -jar DoubleOptimizer.jar seq.test codons.txt -S121 -E1853 -T1000 -R1000 | ||
Line 290: | Line 289: | ||
<h6>As an example of the normal operation of the program on an amino acid sequence, consider the above-listed example of a repetitive amino acid sequence, hypothetical protein X777_06170 from the ant species <i>Cerapachys biroi</i>. At one point in our waterproofing project, this protein was considered for sythesis as a possible homologue of the waterproofing <i>Polistes</i> protein. DoubleOptimizer was originally written to overcome difficulties in synthesising this protein for expression in <i> E. coli</i>. (Due to further advances in the <i>Polistes</i> project, this gene was never synthesized). <b></b> | <h6>As an example of the normal operation of the program on an amino acid sequence, consider the above-listed example of a repetitive amino acid sequence, hypothetical protein X777_06170 from the ant species <i>Cerapachys biroi</i>. At one point in our waterproofing project, this protein was considered for sythesis as a possible homologue of the waterproofing <i>Polistes</i> protein. DoubleOptimizer was originally written to overcome difficulties in synthesising this protein for expression in <i> E. coli</i>. (Due to further advances in the <i>Polistes</i> project, this gene was never synthesized). <b></b> | ||
- | A screenshot of a run of DoubleOptimizer on the amino acid sequence for this protein (listed above), with the <i> E. coli</i> codon usage table also listed above, with default settings for 10 seconds on a laptop computer (2011 model Macbook Pro) is shown below: | + | A screenshot of a run of DoubleOptimizer on the amino acid sequence for this protein (listed above), with the <i> E. coli</i> codon usage table also listed above, with default settings for 10 seconds on a laptop computer (2011 model Macbook Pro) is shown below: |
- | + | ||
+ | </h6> | ||
+ | </div></div> | ||
+ | |||
+ | <div class="small-10 small-centered columns"><br><center><img src="https://static.igem.org/mediawiki/2014/8/85/AADoubleOptExample.png"> | ||
</div> | </div> | ||
Line 298: | Line 300: | ||
<div id="subheader" class="small-8 small-centered columns"> | <div id="subheader" class="small-8 small-centered columns"> | ||
<h6> | <h6> | ||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
+ | Note that the number of repeated sequences is significantly reduced, and the closeness-of-fit to the desired codon sequence is much improved, compared to the initial random reverse-translation. As an example of DoubleOptimizer run on a DNA sequence, a somewhat-repetitive <i>Homo sapiens</i> gene was arbitrarily selected from the NCBI database. Specifically, the exon sequence for <a href="http://www.ncbi.nlm.nih.gov/nuccore/18028443"><u>Ca+2-binding protein CBP86 form VII</u></a> was optimized for synthesis, again in <i> E. coli</i>. The same computer setup was used, and DoubleOptimizer was run for the default 10 seconds: | ||
+ | </h6></div></div> | ||
+ | |||
+ | <div class="small-10 small-centered columns"><br><center><img src="https://static.igem.org/mediawiki/2014/0/09/DNADoubleOptExample.png"> | ||
</div> | </div> | ||
+ | |||
<div class="row"> | <div class="row"> | ||
Line 311: | Line 312: | ||
<h6> | <h6> | ||
- | + | In this case, the sequence went from being over 47% composed of repetitive sequences 8 nucleotides or more in length to being completely non-repetitive. Also, the codon usage distribution is now a near-perfect match to the <i> E. coli</i> codon usage distribution. | |
- | + | ||
- | In this case, the sequence went from being over 47% composed of repetitive sequences 8 nucleotides or more in length to being completely non-repetitive. Also, the codon usage distribution is now a near-perfect match to the <i> E. coli</i> codon usage distribution. | + | |
</h6> | </h6> | ||
Line 324: | Line 323: | ||
<h5><center>Algorithm</h5> | <h5><center>Algorithm</h5> | ||
<h6>DoubleOptimizer uses a genetic optimization algorithm.[4] This algorithm mimics natural selection by taking the original gene sequence and generating several variants by copying the sequence and creating random silent point mutations in each copy. Each variant is then assigned a fitness score based on its repetitiveness and closeness-of-fit to the desired codon usage distribution, and the most fit variants are kept. These are then re-copied, re-mutated and re-scored, and again only the most fit variants are kept, through several thousand rounds of selection. When optimization time runs out, the most fit member of the population after the final round of selection is displayed. <br></br> | <h6>DoubleOptimizer uses a genetic optimization algorithm.[4] This algorithm mimics natural selection by taking the original gene sequence and generating several variants by copying the sequence and creating random silent point mutations in each copy. Each variant is then assigned a fitness score based on its repetitiveness and closeness-of-fit to the desired codon usage distribution, and the most fit variants are kept. These are then re-copied, re-mutated and re-scored, and again only the most fit variants are kept, through several thousand rounds of selection. When optimization time runs out, the most fit member of the population after the final round of selection is displayed. <br></br> | ||
- | One challenge in implementing this algorithm efficiently is that scoring for repetitiveness is a time-consuming process (the time belongs to O(n<sup>2</sup>) in the length of the sequence). Because many thousands of sequences are scored in a single optimization, finding a way to reduce the computational burden of this step was a task of paramount importance. Our solution was to store for each sequence a matrix representing the locations in the sequence where repetitions are already known to occur, and to only attempt to add or remove repetitions that could by affected by the point mutations that were made. This significantly improved run-time, compared to the | + | One challenge in implementing this algorithm efficiently is that scoring for repetitiveness is a time-consuming process (the time belongs to O(n<sup>2</sup>) in the length of the sequence). Because many thousands of sequences are scored in a single optimization, finding a way to reduce the computational burden of this step was a task of paramount importance. Our solution was to store for each sequence a matrix representing the locations in the sequence where repetitions are already known to occur, and to only attempt to add or remove repetitions that could by affected by the point mutations that were made. This significantly improved run-time, compared to the unsophisticated technique of independently re-scoring each modified sequence. |
- | < | + | <br> |
</h6> | </h6> | ||
</div> | </div> | ||
</div> | </div> | ||
+ | |||
<!-- ====== References ====== --> | <!-- ====== References ====== --> | ||
Line 334: | Line 334: | ||
<div id="subheader" class="small-8 small-centered columns"> | <div id="subheader" class="small-8 small-centered columns"> | ||
<h5><center>References</h5> | <h5><center>References</h5> | ||
- | <h6>1. Several references exist that establish the link between codon usage and expression. To give some highly-cited examples:<br></br> | + | <h6>1. Several references exist that establish the link between codon usage and expression. To give some highly-cited examples:<br> </br> |
+ | |||
+ | Ikemura, T. (1985) Codon usage and tRNA content in unicellular and multicellular organisms. <i>Mol Biol Evol. </i> 2 (1): 13-34. PMID: <a href="http://www.ncbi.nlm.nih.gov/pubmed/3916708" target="_blank">3916708 </a> <br></br> | ||
+ | |||
+ | Gouy, M. and Gautier, C. (1982) Codon usage in bacteria: correlation with gene expressivity. <i>Nucl. Acids Res.</i> 10 (22): 7055-7074. PMCID: <a href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC326988/" target="_blank"> PMC326988 </a> <br></br> | ||
+ | |||
+ | Sharp, PM. and Li, W.-H. (1987) The codon adaptation index-a measure of directional synonymous codon usage bias, and its potential applications <i>Nucl. Acids Res.</i> 15 (3): 1281-1295. PMID: <a href="http://www.ncbi.nlm.nih.gov/pubmed/3547335/" target="_blank"> 3547335 </a> <br></br> | ||
+ | |||
+ | 2. Protein sequence from NCBI, <a href="http://www.ncbi.nlm.nih.gov/protein/607359946" target="_blank"><u> here.</u></a><br> | ||
+ | Original paper: Oxley PR, <i>et al.</i> (2014) The genome of the clonal raider ant Cerapachys biroi. <i> Curr Biol. </i> 24(4):451-8. PMID: <a href="http://www.ncbi.nlm.nih.gov/pubmed/24508170" target="_blank"> 24508170 </a> <br></br> | ||
- | + | 3. Nakamura, Y. (2007) Codon Usage Database: Escherichia coli W3110. <a href="http://www.kazusa.or.jp/codon/cgi-bin/showcodon.cgi?species=316407"" target="_blank"> http://www.kazusa.or.jp/codon/cgi-bin/showcodon.cgi?species=316407</a>. <br></br> | |
- | + | ||
- | + | ||
- | + | ||
- | + | ||
4. For a good introduction to genetic algorithms, see:<br></br> | 4. For a good introduction to genetic algorithms, see:<br></br> | ||
- | Goldberg, | + | Goldberg, DE. (1989) <i>Genetic Algorithms in Search, Optimization, and Machine Learning.</i> Reading, Mass: Addison-Wesley Pub. Co. |
- | < | + | <br> |
</h6> | </h6> | ||
</div> | </div> | ||
Line 354: | Line 359: | ||
<div class="row"> | <div class="row"> | ||
- | <div id=" | + | <div id="process" class="small-3 small-centered columns cells8"> |
</div> | </div> | ||
- | |||
</div> | </div> | ||
Line 364: | Line 368: | ||
<div class="row"> | <div class="row"> | ||
<div id="subheader" class="small-8 small-centered columns"> | <div id="subheader" class="small-8 small-centered columns"> | ||
- | < | + | <h5 ><center> CompositionSearch</h5> |
- | < | + | <h6><center>A fast, local search of protein sequences<br>against a specified amino acid distribution</h6><br> </br><br> |
+ | <h5><center>Background</h5> | ||
<h6>When the relative proportions of amino acids in an unknown protein product have been chemically determined, it is often useful to search a proteome for proteins that have similar amino acid distributions, in order to identify this protein product. While at least one online utility for performing this task already exists (provided by the <a href = "http://web.expasy.org/aacompident/aacompfree.html"> <u>Swiss Institute of Bioinformatics</u> </a>), the web-based nature of this program creates some limitations. Firstly, the SIB provides computational resources for the calculation, resulting in slower turnaround for the user (searches take about 15 minutes). Secondly, this program will only search for proteins already in the Swiss-Prot / TrEMBL databases. (At last check, only the smaller Swiss-Prot search was functional.) If an organism is being newly studied and has just been sequenced, its predicted proteome will not be in these databases. Thirdly, due to limited resources, only the top matches to a given search are provided. This does not allow for statistical comparison to the "typical" protein within a given proteome. Fourthly, for very-high-security tasks, uploading data to a third party may be undesirable. | <h6>When the relative proportions of amino acids in an unknown protein product have been chemically determined, it is often useful to search a proteome for proteins that have similar amino acid distributions, in order to identify this protein product. While at least one online utility for performing this task already exists (provided by the <a href = "http://web.expasy.org/aacompident/aacompfree.html"> <u>Swiss Institute of Bioinformatics</u> </a>), the web-based nature of this program creates some limitations. Firstly, the SIB provides computational resources for the calculation, resulting in slower turnaround for the user (searches take about 15 minutes). Secondly, this program will only search for proteins already in the Swiss-Prot / TrEMBL databases. (At last check, only the smaller Swiss-Prot search was functional.) If an organism is being newly studied and has just been sequenced, its predicted proteome will not be in these databases. Thirdly, due to limited resources, only the top matches to a given search are provided. This does not allow for statistical comparison to the "typical" protein within a given proteome. Fourthly, for very-high-security tasks, uploading data to a third party may be undesirable. | ||
</h6> | </h6> | ||
Line 375: | Line 380: | ||
<div class="row"> | <div class="row"> | ||
<div id="subheader" class="small-8 small-centered columns"> | <div id="subheader" class="small-8 small-centered columns"> | ||
- | <h5><center>Solution</h5> | + | <h5><center>Solution: CompositionSearch</h5> |
<h6>CompositionSearch is a software tool we have created to address these issues by allowing an individual to rank all proteins in a proteome by similarity (minimum Euclidian distance) to a reference amino acid distribution locally on one's own computer. This ranking can be generated in a matter of seconds, rather than taking several minutes. Because it ranks all proteins in a proteome, CompositionSearch can also generate a figure for the significance of the similarity of a given protein to a given amino acid distribution, using the similarity of the rest of the proteome as a statistical distribution function. | <h6>CompositionSearch is a software tool we have created to address these issues by allowing an individual to rank all proteins in a proteome by similarity (minimum Euclidian distance) to a reference amino acid distribution locally on one's own computer. This ranking can be generated in a matter of seconds, rather than taking several minutes. Because it ranks all proteins in a proteome, CompositionSearch can also generate a figure for the significance of the similarity of a given protein to a given amino acid distribution, using the similarity of the rest of the proteome as a statistical distribution function. | ||
</h6> | </h6> | ||
Line 387: | Line 392: | ||
<div id="subheader" class="small-8 small-centered columns"> | <div id="subheader" class="small-8 small-centered columns"> | ||
<h5><center>Availability and Usage</h5> | <h5><center>Availability and Usage</h5> | ||
- | <h6>CompositionSearch may be downloaded <a href= " | + | <h6>CompositionSearch may be downloaded <a href= "https://static.igem.org/mediawiki/2014/c/ca/SBSCompositionSearch.zip"><u>here</u></a>. <br></br> |
CompositionSearch is a command line utility, provided as a Java jar file. It can be invoked from command line on any system with Java 7 or later installed, with the following syntax:<br></br> | CompositionSearch is a command line utility, provided as a Java jar file. It can be invoked from command line on any system with Java 7 or later installed, with the following syntax:<br></br> | ||
Line 393: | Line 398: | ||
<div class="sub6"> java -jar CompositionSearch.jar prot.fasta freqs.txt out.csv <br></br> </div> | <div class="sub6"> java -jar CompositionSearch.jar prot.fasta freqs.txt out.csv <br></br> </div> | ||
- | where "prot.fasta" is a FASTA-formatted proteome, and "freqs.txt" is a file containing the desired amino distribution to match. It should be formatted as plain text, according to the following example template:< | + | where "prot.fasta" is a FASTA-formatted proteome, and "freqs.txt" is a file containing the desired amino distribution to match. It should be formatted as plain text, according to the following example template:<br> |
<div class="sub6"> | <div class="sub6"> | ||
Line 410: | Line 415: | ||
</div> | </div> | ||
- | Note that this may contain as many or as few amino acids as desired. The frequencies, however, are interpreted as absolute, so if all amino acids are represented, they should add 1. (See below for discussion of how the optional -N and -X flags affect this interpretation). <br></br> | + | Note that this may contain as many or as few amino acids as desired. The frequencies, however, are interpreted as absolute, so if all amino acids are represented, they should add up to 1. (See below for discussion of how the optional -N and -X flags affect this interpretation). <br></br> |
- | "out.csv" in the above example line is the destination path to store the results of the calculation. The output will be a spreadsheet in csv format, which may be imported into your favorite desktop spreadsheet application ( | + | "out.csv" in the above example line is the destination path to store the results of the calculation. The output will be a spreadsheet in csv format, which may be imported into your favorite desktop spreadsheet application (e.g. Microsoft Excel, LibreOffice Calc, etc.).<br></br> |
After execution, out.csv will contain a spreadsheet showing the reference distribution, and, in order by similarity to the reference distribution, the amino acid distributions of all proteins in the proteome. Protein name, similarity ranking, similarity score (Euclidian distance between amino acid distributions; lower is better), and similarity p-value (on the curve of other proteins in the proteome) will be listed for each protein. <br></br> | After execution, out.csv will contain a spreadsheet showing the reference distribution, and, in order by similarity to the reference distribution, the amino acid distributions of all proteins in the proteome. Protein name, similarity ranking, similarity score (Euclidian distance between amino acid distributions; lower is better), and similarity p-value (on the curve of other proteins in the proteome) will be listed for each protein. <br></br> | ||
Line 422: | Line 427: | ||
This will cause the program to ignore amino acids in the proteins that are not in the distribution list. In other words, this means that the frequencies given refer to frequencies relative to only the the other amino acids listed, instead of all amino acids. To clarify:<br></br> | This will cause the program to ignore amino acids in the proteins that are not in the distribution list. In other words, this means that the frequencies given refer to frequencies relative to only the the other amino acids listed, instead of all amino acids. To clarify:<br></br> | ||
- | Without the -N flag, the line in the above example distribution list: < | + | Without the -N flag, the line in the above example distribution list: <br> |
<div class="sub6"> P 0.08 <br></br> </div> | <div class="sub6"> P 0.08 <br></br> </div> | ||
means that 8% of all amino acid residues in the matching protein are expected to be proline residues. <br></br> | means that 8% of all amino acid residues in the matching protein are expected to be proline residues. <br></br> | ||
- | With the -N flag, the line in the above example distribution list:< | + | With the -N flag, the line in the above example distribution list:<br> |
<div class="sub6"> P 0.08 <br></br> </div> | <div class="sub6"> P 0.08 <br></br> </div> | ||
means that 8% of amino acid residues that belong in the set {ADEGIKLPRSTV} (the amino acids with defined frequences) in the matching protein are expected to be proline residues. <br></br> | means that 8% of amino acid residues that belong in the set {ADEGIKLPRSTV} (the amino acids with defined frequences) in the matching protein are expected to be proline residues. <br></br> | ||
- | Example:< | + | Example:<br> |
<div class="sub6"> java -jar CompositionSearch.jar prot.fasta freqs.txt out.csv -N <br></br> | <div class="sub6"> java -jar CompositionSearch.jar prot.fasta freqs.txt out.csv -N <br></br> | ||
Line 439: | Line 444: | ||
The default value of this set of characters is 'X,' often used to represent unknown amino acid residues. Therefore the flag -XX is equivalent to normal behavior. Note that the set of characters being ignored is replaced by the -X flag, so it is always advisable to list X when using this flag.<br></br> | The default value of this set of characters is 'X,' often used to represent unknown amino acid residues. Therefore the flag -XX is equivalent to normal behavior. Note that the set of characters being ignored is replaced by the -X flag, so it is always advisable to list X when using this flag.<br></br> | ||
- | Example:< | + | Example:<br> |
<div class="sub6"> java -jar CompositionSearch.jar prot.fasta freqs.txt out.csv -XX*-<br></br> </div> | <div class="sub6"> java -jar CompositionSearch.jar prot.fasta freqs.txt out.csv -XX*-<br></br> </div> | ||
Line 460: | Line 465: | ||
<div class="row"> | <div class="row"> | ||
<div id="subheader" class="small-8 small-centered columns"> | <div id="subheader" class="small-8 small-centered columns"> | ||
- | < | + | <h5 id="results"><center> |
- | Cellulose Pathway Modeling </ | + | Cellulose Pathway Modeling </h5> |
<!-- ====== Flux Balance Analysis ====== --> | <!-- ====== Flux Balance Analysis ====== --> | ||
- | < | + | <h6><center>Flux Balance Analysis</h6><br> |
<h6> | <h6> | ||
+ | Modeling allows for efficient experimental design by optimizing specific conditions before testing with in vitro methods. Flux Balance Analysis (FBA) is used to optimize the growth medium for the bioengineered <i>Gluconacetobacter hansenii</i> to yield bacterial cellulose to be used as biomaterial for a biodegradable drone. FBA is a mathematical method used to examine how metabolites relate to each network and makes predictions for the growth of an organism and product output. It is a direct application of linear programming to biological systems that used the stoichiometric coefficient for each reaction in the system as the set of constraints for the optimization. | ||
- | </ | + | </h6> |
+ | </div></div> | ||
+ | |||
+ | <div class="small-8 small-centered columns"><br><center><img src="https://static.igem.org/mediawiki/2014/9/97/SBS_ModelingFigure.png"><br> | ||
+ | <h6><center>Sample pathway, stoichiometric matrix, and constraints vectors.</center></h6> | ||
+ | </div> | ||
+ | <div class="row"> | ||
+ | <div id="subheader" class="small-8 small-centered columns"> | ||
- | < | + | <h6> |
- | </br></br>FBA will be used to optimize the growth conditions of | + | Flux Balance Analysis is perfomed under steady state conditions and requires information about the stoichiometry of metabolic pathways, metabolic demands, and strain specific parameters. At steady state, there is no accumulation or depletion of metabolites in a metabolic network, so the production rate of each metabolite in the network must equal its rate of consumption [3]. |
- | + | ||
+ | </br></br>FBA will be used to optimize the growth conditions of <i>G. hansenii</i> | ||
+ | in order to maximize the product output of cellulose. The exchange reactions determine the metabolites that are beneficial to the medium. Changing the composition will allow us to be able to determine the effect the composition has on the efficiency of the production media. | ||
+ | </h6><br><br> | ||
<!-- ====== Model SEED- ModelView ====== --> | <!-- ====== Model SEED- ModelView ====== --> | ||
Line 480: | Line 496: | ||
<h5><center>Model SEED- ModelView</h5> | <h5><center>Model SEED- ModelView</h5> | ||
- | <h6>Model SEED is an online analysis software with its own genome and media formulation database. An analysis of | + | <h6>Model SEED is an online analysis software with its own genome and media formulation database. An analysis of <i>G. hansenii</i> is done with the curated model including a biomass reaction set for optimal organismal growth and a control of complete media. The biomass reaction describes the rate, which all the biomass precursors are made in correct proportions [2]. A biomass reaction requires a knowledge of the composition of a cell and the energy requirements that allow for the cell to grow. The biomass equation consists of three levels: macromolecular level (RNA, protein), intermediate level (biosynthetic energy), advanced level (vitamins, elements, and cofactors). The biomass equation also list to stochiometric coefficients for each part of the biomass equation. Using this biomass equation the growth rate was determined to be 66.48 gm per biomass CDW/hr was obtained. Since the flux is positive this explains that the flux of the analysis is indeed towards growth a negative or zero flux would have explained that the pathway is not used or it is impossible. |
- | </h6> | + | </h6><br><br> |
Line 488: | Line 504: | ||
<h5><center>Future Directions/Modifications</h5> | <h5><center>Future Directions/Modifications</h5> | ||
- | <h6>Although a growth rate is obtained using Model SEED, this rate is for an environment where a complete medium is used. The growth medium being used contains dextrose, peptone, yeast extract, Na2HPO4, citric acid, agar and water. Model SEED does not have the capabilities to create a media file, so all further analysis will be done using KBase because it has the ability to customize the media used for analysis. Kbase is a software that allows for online analysis of models, where models and the corresponding media for organismal growth can be optimized. The genome database contains all genomes including the G. hansenii genome (as a JSON file). A full database for | + | <h6>Although a growth rate is obtained using Model SEED, this rate is for an environment where a complete medium is used. The growth medium being used contains dextrose, peptone, yeast extract, Na2HPO4, citric acid, agar and water. Model SEED does not have the capabilities to create a media file, so all further analysis will be done using KBase because it has the ability to customize the media used for analysis. Kbase is a software that allows for online analysis of models, where models and the corresponding media for organismal growth can be optimized. The genome database contains all genomes including the <i>G. hansenii</i> genome (as a JSON file). A full database for media that exist and capabilities to create the media being used in experimentation through the IRIS interface, which is the command line environment used by Kbase. Using IRIS and the subsequent FBA tutorial, a flux balance analysis was attempted but several issue were observed because of command line issues. Kbase will allow for the majority of modifications to be done in order to obtain the correct model and medium, further help will be sought out in order to conquer the error messages seen. Once the error messages are fixed, further literature searches and consultations will be done in order to change the biomass reaction to one that will target the product growth of cellulose instead of organismal growth of <i>G. hansenii</i>. |
- | </h6> | + | Once an accurate growth rate with the proper medium is found, it will then be possible to plot the organism growth rate vs. product growth rate. Plotting this data will determine if there is a direct or inverse relationship between the two quantities. |
+ | </h6><br><br> | ||
Line 542: | Line 559: | ||
<div id="closing" class="small-8 small-centered columns"> | <div id="closing" class="small-8 small-centered columns"> | ||
<h6> | <h6> | ||
- | Built atop Foundation. Content & Development © Stanford–Brown–Spelman iGEM 2014. | + | Built atop Foundation. Content & Development © Stanford–Brown–Spelman iGEM 2014. |
</h6> | </h6> | ||
</div> | </div> |
Latest revision as of 04:01, 9 January 2015
Bioinformatics & Modeling
While working on our synthetic biology projects for this year's iGEM competition, we found ourselves in need of some computational tools for synthetic biology that did not yet exist. Therefore, we developed our own software packages to meet these needs. In particular, we have developed two pieces of software, specifically designed for the needs of synthetic biologists:
● DoubleOptimizer: a tool that facilitates synthesis of repetitive genes by optimizing codon usage to both match a codon usage distribution for a desired organism, and to reduce and avoid repetitive nucleotide sequences, allowing for easier synthesis.
● CompositionSearch: a tool that quickly ranks all proteins in a proteome by their similarity to a given amino acid distribution.
DoubleOptimizer
A utility for simultaneous codon and
gene synthesis optimization
gene synthesis optimization
Background
Gene synthesis as a tool for biological engineering presents both opportunities and challenges. One opportunity presented is the ability to optimize codon usage in a gene to match that of a host organism. Compared to traditional cloning methods, this can increase protein yields in the host organism by several fold.[1] However, while there exist a large number of freely-usable programs that perform codon optimization, there is no guarantee that the sequences these programs provide will be able to be synthesized. Specifically, in the case of genes with repetitive amino acid sequences, these programs will often generate outputs that contain too many repeated short DNA sequences to be synthesized commercially.
As an example, the hypothetical protein X777_06170 from the ant species Cerapachys biroi has an amino acid sequence that appears to be somewhat repetitive:
001 mklfkclvpv vvlllikdss arpglirdfv ggtvgsilep fqilkpkdsy adanshasah
061 nlggtfslgp vslggglssa sasssasang gglasasska daqaggygyg gsnanaqasa
121 sanaqgggyg nggihgiypg qqgvhggnpf lggagsnana naiananaqa naggnngglg
181 syggyqqggn ypidsstgpi gnnpflsggh gdgnanaaan anagasaign gggpidvnnp
241 flhggaansg agginyqpgn aggiilsekp lglptiypgq hppayldsig spgansnaga
301 napcsecgss gatilgyegq glggikesgs sgatilgyeg qglggikesg ssgatilgye
361 gqglggikes gssgatilgs ydgqgpsgat ilgdyngqgl ggikessgvt vlgdyegqgl
421 ggisgphggh gqaganagan ananagatvg ssggvlggvg dhggyhgyng hdgssglnlg
481 gygggsnana qassnalass ggsssatsda lsnahssggs alanssskas angsgsanan
541 ahassnassg shglgsktsa ssqasasadt rdmlifs[2]
Note that this sequence is not simply the same sequence repeated multiple times, but instead contains several motifs on the order of 10 - 20 amino acids in length that occur several times. When this sequence was run through the codon optimization program for expression in ''E. coli'' provided by a major DNA synthesis firm, the resulting output could not be synthesized by the very same firm: the "optimized" DNA sequence contained too many recurring short (> 8 nucleotide) DNA sequences to allow for synthesis.
Manually correcting for repeats in the codon-optimized DNA sequence is a sub-optimal solution: not only is this process time-consuming, but it has the tendency to undo the codon-optimization: if a sequence of amino acids occurs several times, one may be forced to use all possible codon-combinations to represent this sequence to avoid nucleotide-sequence repetition. Unless corrected for by skewing codon usage elsewhere in the sequence, this will tend to make the codon usage more uniform than is optimal for the expression vector. Additionally, any changes made in either correcting for repeats or re-correcting for codon usage may in turn introduce additional repeats.
Background
Gene synthesis as a tool for biological engineering presents both opportunities and challenges. One opportunity presented is the ability to optimize codon usage in a gene to match that of a host organism. Compared to traditional cloning methods, this can increase protein yields in the host organism by several fold.[1] However, while there exist a large number of freely-usable programs that perform codon optimization, there is no guarantee that the sequences these programs provide will be able to be synthesized. Specifically, in the case of genes with repetitive amino acid sequences, these programs will often generate outputs that contain too many repeated short DNA sequences to be synthesized commercially.
As an example, the hypothetical protein X777_06170 from the ant species Cerapachys biroi has an amino acid sequence that appears to be somewhat repetitive:
001 mklfkclvpv vvlllikdss arpglirdfv ggtvgsilep fqilkpkdsy adanshasah
061 nlggtfslgp vslggglssa sasssasang gglasasska daqaggygyg gsnanaqasa
121 sanaqgggyg nggihgiypg qqgvhggnpf lggagsnana naiananaqa naggnngglg
181 syggyqqggn ypidsstgpi gnnpflsggh gdgnanaaan anagasaign gggpidvnnp
241 flhggaansg agginyqpgn aggiilsekp lglptiypgq hppayldsig spgansnaga
301 napcsecgss gatilgyegq glggikesgs sgatilgyeg qglggikesg ssgatilgye
361 gqglggikes gssgatilgs ydgqgpsgat ilgdyngqgl ggikessgvt vlgdyegqgl
421 ggisgphggh gqaganagan ananagatvg ssggvlggvg dhggyhgyng hdgssglnlg
481 gygggsnana qassnalass ggsssatsda lsnahssggs alanssskas angsgsanan
541 ahassnassg shglgsktsa ssqasasadt rdmlifs[2]
061 nlggtfslgp vslggglssa sasssasang gglasasska daqaggygyg gsnanaqasa
121 sanaqgggyg nggihgiypg qqgvhggnpf lggagsnana naiananaqa naggnngglg
181 syggyqqggn ypidsstgpi gnnpflsggh gdgnanaaan anagasaign gggpidvnnp
241 flhggaansg agginyqpgn aggiilsekp lglptiypgq hppayldsig spgansnaga
301 napcsecgss gatilgyegq glggikesgs sgatilgyeg qglggikesg ssgatilgye
361 gqglggikes gssgatilgs ydgqgpsgat ilgdyngqgl ggikessgvt vlgdyegqgl
421 ggisgphggh gqaganagan ananagatvg ssggvlggvg dhggyhgyng hdgssglnlg
481 gygggsnana qassnalass ggsssatsda lsnahssggs alanssskas angsgsanan
541 ahassnassg shglgsktsa ssqasasadt rdmlifs[2]
Note that this sequence is not simply the same sequence repeated multiple times, but instead contains several motifs on the order of 10 - 20 amino acids in length that occur several times. When this sequence was run through the codon optimization program for expression in ''E. coli'' provided by a major DNA synthesis firm, the resulting output could not be synthesized by the very same firm: the "optimized" DNA sequence contained too many recurring short (> 8 nucleotide) DNA sequences to allow for synthesis.
Manually correcting for repeats in the codon-optimized DNA sequence is a sub-optimal solution: not only is this process time-consuming, but it has the tendency to undo the codon-optimization: if a sequence of amino acids occurs several times, one may be forced to use all possible codon-combinations to represent this sequence to avoid nucleotide-sequence repetition. Unless corrected for by skewing codon usage elsewhere in the sequence, this will tend to make the codon usage more uniform than is optimal for the expression vector. Additionally, any changes made in either correcting for repeats or re-correcting for codon usage may in turn introduce additional repeats.
Solution: DoubleOptimizer
DoubleOptimizer is a software tool we have created to optimize codon usage in a gene both to match a given codon usage distribution and to avoid repetition of nucleotide sequences. Given a DNA or amino acid sequence and a desired codon distribution, DoubleOptimizer will produce, within a matter of seconds, an equivalent sequence that has substantially reduced DNA sequence repetition, while also closely matching the desired codon usage.
Availability and Usage
DoubleOptimizer may be downloaded here. DoubleOptimizer is a command line utility, provided as a Java jar file. It can be invoked from command line on any system with Java 7 or later installed, with the following syntax:
java -jar DoubleOptimizer.jar seq.txt codons.txt
where "seq.txt" is a DNA sequence, stored as a plain text file, and "codons.txt" is a file containing the desired codon distribution to match. It should be formatted as plain text, according to the following example template:
A
GCG .36
GCC .27
GCA .21
GCT .16
R
CGC .40
CGT .38
CGG .10
CGA .06
AGA .04
AGG .02
N
AAC .55
AAT .45
D
GAT .63
GAC .37
C
TGC .55
TGT .45
E
GAA .69
GAG .31
Q
CAG .65
CAA .35
G
GGC .40
GGT .34
GGG .15
GGA .11
H
CAT .57
CAC .43
I
ATT .51
ATC .42
ATA .07
L
CTG .50
TTG .13
TTA .13
CTT .10
CTC .10
CTA .04
K
AAA .77
AAG .23
M
ATG 1
F
TTT .57
TTC .43
P
CCG .52
CCA .19
CCT .16
CCC .12
S
AGC .28
AGT .15
TCG .15
TCT .15
TCC .15
TCA .12
*
TAA .64
TGA .29
TAG .07
T
ACC .44
ACG .27
ACT .17
ACA .13
W
TGG 1
Y
TAT .57
TAC .43
V
GTG .37
GTT .26
GTC .22
GTA .15
(Note that the above example is actually the codon usage distribution of E. coli.[3])
DoubleOptimizer supports non-canonical codon assignments: the amino acid-codon groupings can by specified in whatever way the user wants in the codon distribution file.
When executed, DoubleOptimizer will first display the input sequence with repetitive regions highlighted. It will also give the fraction of the sequence that initially consists of repetitive regions (defined by default as regions of 8 nucleotides or more that occur more than once in the sequence, including as their reverse complement), and a chi-squared value for the goodness-of-fit to the desired codon distribution (lower is better).
DoubleOptimizer will then compute and display the optimized sequence (By default, it will produce the best sequence it can find after 10 seconds of computation time). Again, repetitive regions will be highlighted, and the same measurements of repetitiveness and codon fit will be given.
The following optional flags may be used to change the program's behavior:
- -A
This allows for an amino-acid sequence, specified in single-letter code, to be used as input instead of a DNA sequence. The initial sequence statistics displayed will be for a uniform random reverse translation of the given amino acid sequence.
Example:
java -jar DoubleOptimizer.jar aaseq.txt codons.txt -A
- -T##
This allows the user to specify, in seconds, a different run-time for the program other than the default 10 seconds. While 10 seconds should be sufficient to produce a well-optimized result for most genes of moderate length on modern desktop computers, longer times may produce better-optimized results on slower machines or on longer sequences.
Example:
java -jar DoubleOptimizer.jar aaseq.txt codons.txt -A -T30
- -L##
This allows the user to specify a different minimum length for what is considered a repeat, other than the default 8 nucleotides.
Example:
java -jar DoubleOptimizer.jar seq.txt codons.txt -L7 -T15
- -S##, -E##
These allow the user to specify the starting and ending nucleotide, respectively, of the coding region in a construct sequence to be sythesized. Nucleotides outside this frame will be ignored for codon usage optimization, and will never be modified. This option is useful for preventing repetitions, within the coding region, of fixed sequences that occur at the ends of a construct to be sythesized, outside of the coding region. The default values are the beginning of the sequence, and the end of the last complete codon. These options may be used together or independently. Values are one-indexed. If used with ''-A'', these will be interpreted as amino acid indices.
Example:
java -jar DoubleOptimizer.jar seq.test codons.txt -S121 -E1853 -R1000 -D100
- -D##
This option will make the program periodically display the current best sequence, and associated statistics, as it runs. The number given is the number of optoimization cycles the program will perform between each round of displaying the sequence. This provides a continuous measure of progress on long optimization runs. Note that, when given 10 seconds, the program may execute several thousand cycles of optimization, so an argument on the order of 100 may be reasonable.
Example:
java -jar DoubleOptimizer.jar seq.test codons.txt -S121 -E1853 -R1000 -D100
- -R##
This option will make the program halt optimization after a certain number of optimization cycles. This may be used with a large value of -T to standardize optimization quality between computers of different speeds. Because this option is mostly only useful for testing the efficiency of this program itself, it may be removed from future releases.
Example:
java -jar DoubleOptimizer.jar seq.test codons.txt -S121 -E1853 -T1000 -R1000
GCG .36
GCC .27
GCA .21
GCT .16
R
CGC .40
CGT .38
CGG .10
CGA .06
AGA .04
AGG .02
N
AAC .55
AAT .45
D
GAT .63
GAC .37
C
TGC .55
TGT .45
E
GAA .69
GAG .31
Q
CAG .65
CAA .35
G
GGC .40
GGT .34
GGG .15
GGA .11
H
CAT .57
CAC .43
I
ATT .51
ATC .42
ATA .07
L
CTG .50
TTG .13
TTA .13
CTT .10
CTC .10
CTA .04
K
AAA .77
AAG .23
M
ATG 1
F
TTT .57
TTC .43
P
CCG .52
CCA .19
CCT .16
CCC .12
S
AGC .28
AGT .15
TCG .15
TCT .15
TCC .15
TCA .12
*
TAA .64
TGA .29
TAG .07
T
ACC .44
ACG .27
ACT .17
ACA .13
W
TGG 1
Y
TAT .57
TAC .43
V
GTG .37
GTT .26
GTC .22
GTA .15
- -A
- -T##
- -L##
- -S##, -E##
- -D##
- -R##
Examples of Use
As an example of the normal operation of the program on an amino acid sequence, consider the above-listed example of a repetitive amino acid sequence, hypothetical protein X777_06170 from the ant species Cerapachys biroi. At one point in our waterproofing project, this protein was considered for sythesis as a possible homologue of the waterproofing Polistes protein. DoubleOptimizer was originally written to overcome difficulties in synthesising this protein for expression in E. coli. (Due to further advances in the Polistes project, this gene was never synthesized). A screenshot of a run of DoubleOptimizer on the amino acid sequence for this protein (listed above), with the E. coli codon usage table also listed above, with default settings for 10 seconds on a laptop computer (2011 model Macbook Pro) is shown below:
Note that the number of repeated sequences is significantly reduced, and the closeness-of-fit to the desired codon sequence is much improved, compared to the initial random reverse-translation. As an example of DoubleOptimizer run on a DNA sequence, a somewhat-repetitive Homo sapiens gene was arbitrarily selected from the NCBI database. Specifically, the exon sequence for Ca+2-binding protein CBP86 form VII was optimized for synthesis, again in E. coli. The same computer setup was used, and DoubleOptimizer was run for the default 10 seconds:
In this case, the sequence went from being over 47% composed of repetitive sequences 8 nucleotides or more in length to being completely non-repetitive. Also, the codon usage distribution is now a near-perfect match to the E. coli codon usage distribution.
Algorithm
DoubleOptimizer uses a genetic optimization algorithm.[4] This algorithm mimics natural selection by taking the original gene sequence and generating several variants by copying the sequence and creating random silent point mutations in each copy. Each variant is then assigned a fitness score based on its repetitiveness and closeness-of-fit to the desired codon usage distribution, and the most fit variants are kept. These are then re-copied, re-mutated and re-scored, and again only the most fit variants are kept, through several thousand rounds of selection. When optimization time runs out, the most fit member of the population after the final round of selection is displayed.
One challenge in implementing this algorithm efficiently is that scoring for repetitiveness is a time-consuming process (the time belongs to O(n2) in the length of the sequence). Because many thousands of sequences are scored in a single optimization, finding a way to reduce the computational burden of this step was a task of paramount importance. Our solution was to store for each sequence a matrix representing the locations in the sequence where repetitions are already known to occur, and to only attempt to add or remove repetitions that could by affected by the point mutations that were made. This significantly improved run-time, compared to the unsophisticated technique of independently re-scoring each modified sequence.
References
1. Several references exist that establish the link between codon usage and expression. To give some highly-cited examples:
Ikemura, T. (1985) Codon usage and tRNA content in unicellular and multicellular organisms. Mol Biol Evol. 2 (1): 13-34. PMID: 3916708
Gouy, M. and Gautier, C. (1982) Codon usage in bacteria: correlation with gene expressivity. Nucl. Acids Res. 10 (22): 7055-7074. PMCID: PMC326988
Sharp, PM. and Li, W.-H. (1987) The codon adaptation index-a measure of directional synonymous codon usage bias, and its potential applications Nucl. Acids Res. 15 (3): 1281-1295. PMID: 3547335
2. Protein sequence from NCBI, here.
Original paper: Oxley PR, et al. (2014) The genome of the clonal raider ant Cerapachys biroi. Curr Biol. 24(4):451-8. PMID: 24508170
3. Nakamura, Y. (2007) Codon Usage Database: Escherichia coli W3110. http://www.kazusa.or.jp/codon/cgi-bin/showcodon.cgi?species=316407.
4. For a good introduction to genetic algorithms, see:
Goldberg, DE. (1989) Genetic Algorithms in Search, Optimization, and Machine Learning. Reading, Mass: Addison-Wesley Pub. Co.
CompositionSearch
A fast, local search of protein sequences
against a specified amino acid distribution
against a specified amino acid distribution
Background
When the relative proportions of amino acids in an unknown protein product have been chemically determined, it is often useful to search a proteome for proteins that have similar amino acid distributions, in order to identify this protein product. While at least one online utility for performing this task already exists (provided by the Swiss Institute of Bioinformatics ), the web-based nature of this program creates some limitations. Firstly, the SIB provides computational resources for the calculation, resulting in slower turnaround for the user (searches take about 15 minutes). Secondly, this program will only search for proteins already in the Swiss-Prot / TrEMBL databases. (At last check, only the smaller Swiss-Prot search was functional.) If an organism is being newly studied and has just been sequenced, its predicted proteome will not be in these databases. Thirdly, due to limited resources, only the top matches to a given search are provided. This does not allow for statistical comparison to the "typical" protein within a given proteome. Fourthly, for very-high-security tasks, uploading data to a third party may be undesirable.
Solution: CompositionSearch
CompositionSearch is a software tool we have created to address these issues by allowing an individual to rank all proteins in a proteome by similarity (minimum Euclidian distance) to a reference amino acid distribution locally on one's own computer. This ranking can be generated in a matter of seconds, rather than taking several minutes. Because it ranks all proteins in a proteome, CompositionSearch can also generate a figure for the significance of the similarity of a given protein to a given amino acid distribution, using the similarity of the rest of the proteome as a statistical distribution function.
Availability and Usage
CompositionSearch may be downloaded here.
CompositionSearch is a command line utility, provided as a Java jar file. It can be invoked from command line on any system with Java 7 or later installed, with the following syntax:
java -jar CompositionSearch.jar prot.fasta freqs.txt out.csv
where "prot.fasta" is a FASTA-formatted proteome, and "freqs.txt" is a file containing the desired amino distribution to match. It should be formatted as plain text, according to the following example template:
A 0.134
D 0.044
E 0.04
G 0.228
I 0.033
K 0.021
L 0.04
P 0.08
R 0.009
S 0.151
T 0.038
V 0.083
Note that this may contain as many or as few amino acids as desired. The frequencies, however, are interpreted as absolute, so if all amino acids are represented, they should add up to 1. (See below for discussion of how the optional -N and -X flags affect this interpretation).
"out.csv" in the above example line is the destination path to store the results of the calculation. The output will be a spreadsheet in csv format, which may be imported into your favorite desktop spreadsheet application (e.g. Microsoft Excel, LibreOffice Calc, etc.).
After execution, out.csv will contain a spreadsheet showing the reference distribution, and, in order by similarity to the reference distribution, the amino acid distributions of all proteins in the proteome. Protein name, similarity ranking, similarity score (Euclidian distance between amino acid distributions; lower is better), and similarity p-value (on the curve of other proteins in the proteome) will be listed for each protein.
The following optional flags may be used to change the program's behavior:
- -N
This will cause the program to ignore amino acids in the proteins that are not in the distribution list. In other words, this means that the frequencies given refer to frequencies relative to only the the other amino acids listed, instead of all amino acids. To clarify:
Without the -N flag, the line in the above example distribution list:
P 0.08
means that 8% of all amino acid residues in the matching protein are expected to be proline residues.
With the -N flag, the line in the above example distribution list:
P 0.08
means that 8% of amino acid residues that belong in the set {ADEGIKLPRSTV} (the amino acids with defined frequences) in the matching protein are expected to be proline residues.
Example:
java -jar CompositionSearch.jar prot.fasta freqs.txt out.csv -N
- -X$$
This will cause the program to completely disregard certain amino acid residue symbols in the proteome, regardless of use of the N flag.
The default value of this set of characters is 'X,' often used to represent unknown amino acid residues. Therefore the flag -XX is equivalent to normal behavior. Note that the set of characters being ignored is replaced by the -X flag, so it is always advisable to list X when using this flag.
Example:
java -jar CompositionSearch.jar prot.fasta freqs.txt out.csv -XX*-
This will cause the symbols X,*, and - to be ignored in the proteome.
D 0.044
E 0.04
G 0.228
I 0.033
K 0.021
L 0.04
P 0.08
R 0.009
S 0.151
T 0.038
V 0.083