Team:StanfordBrownSpelman/Modelling

From 2014.igem.org

Revision as of 18:59, 10 October 2014 by Jotthek (Talk | contribs)

Stanford–Brown–Spelman iGEM 2014

As an example, the hypothetical protein X777_06170 from the ant species Cerapachys biroi has an amino acid sequence that appears to be somewhat repetitive:
001 mklfkclvpv vvlllikdss arpglirdfv ggtvgsilep fqilkpkdsy adanshasah

061 nlggtfslgp vslggglssa sasssasang gglasasska daqaggygyg gsnanaqasa

121 sanaqgggyg nggihgiypg qqgvhggnpf lggagsnana naiananaqa naggnngglg

181 syggyqqggn ypidsstgpi gnnpflsggh gdgnanaaan anagasaign gggpidvnnp

241 flhggaansg agginyqpgn aggiilsekp lglptiypgq hppayldsig spgansnaga

301 napcsecgss gatilgyegq glggikesgs sgatilgyeg qglggikesg ssgatilgye

361 gqglggikes gssgatilgs ydgqgpsgat ilgdyngqgl ggikessgvt vlgdyegqgl

421 ggisgphggh gqaganagan ananagatvg ssggvlggvg dhggyhgyng hdgssglnlg

481 gygggsnana qassnalass ggsssatsda lsnahssggs alanssskas angsgsanan

541 ahassnassg shglgsktsa ssqasasadt rdmlifs
Note that this sequence is not simply the same sequence repeated multiple times, but instead contains several motifs on the order of 10 - 20 amino acids in length that occur several times. When this sequence was run through the codon optimization program for expression in ''E. coli'' provided by a major DNA synthesis firm, the resulting output could not be synthesized by the very same firm: the "optimized" DNA sequence contained too many recurring short (> 8 nucleotide) DNA sequences to allow for synthesis.

Manually correcting for repeats in the codon-optimized DNA sequence is a sub-optimal solution: not only is this process time-consuming, but it has the tendency to undo the codon-optimization: if a sequence of amino acids occurs several times, one may be forced to use all possible codon-combinations to represent this sequence to avoid nucleotide-sequence repetition. Unless corrected for by skewing codon usage elsewhere in the sequence, this will tend to make the codon usage more uniform than is optimal for the expression vector. Additionally, any changes made in either correcting for repeats or re-correcting for codon usage may in turn introduce additional repeats.

Solution: Double Optimizer
DoubleOptimizer is a software tool we have created to optimize codon usage in a gene both to match a given codon usage distribution and to avoid repetition of nucleotide sequences. Given a DNA or amino acid sequence and a desired codon distribution, DoubleOptimizer will produce, within a matter of minutes, an equivalent sequence that has substantially reduced DNA sequence repetition, while also closely matching the desired codon usage.
Availability and Usage
DoubleOptimizer may be downloaded here

DoubleOptimizer is a command line utility, provided as a Java jar file. It can be invoked from command line on any system with Java installed, with the following syntax:
java -jar DoubleOptimizer.jar seq.txt codons.txt [Optional flags]
where "seq.txt" is a DNA sequence, stored as a plain text file, and "codons.txt" is a file containing the desired codon distribution to match. It should be formatted as plain text, according to the following example template:

GCG .36 GCC .27 GCA .21 GCT .16 R CGC .40 CGT .38 CGG .10 CGA .06 AGA .04 AGG .02 N AAC .55 AAT .45 D GAT .63 GAC .37 C TGC .55 TGT .45 E GAA .69 GAG .31 Q CAG .65 CAA .35 G GGC .40 GGT .34 GGG .15 GGA .11 H CAT .57 CAC .43 I ATT .51 ATC .42 ATA .07 L CTG .50 TTG .13 TTA .13 CTT .10 CTC .10 CTA .04 K AAA .77 AAG .23 M ATG 1 F TTT .57 TTC .43 P CCG .52 CCA .19 CCT .16 CCC .12 S AGC .28 AGT .15 TCG .15 TCT .15 TCC .15 TCA .12 * TAA .64 TGA .29 TAG .07 T ACC .44 ACG .27 ACT .17 ACA .13 W TGG 1 Y TAT .57 TAC .43 V GTG .37 GTT .26 GTC .22 GTA .15
(Note that the above example is actually the codon usage distribution of E. coli.)

DoubleOptimizer supports non-canonical codon assignments: the amino acid-codon groupings can by specified in whatever way the user wants in the codon distribution file.

When executed, DoubleOptimizer will first display the input sequence with repetitive regions highlighted. It will also give the fraction of the sequence that initially consists of repetitive regions (defined by default as regions of 8 nucleotides or more that occur more than once in the sequence, including as their reverse complement), and a chi-squared value for the goodness-of-fit to the desired codon distribution (lower is better).

DoubleOptimizer will then compute and display the optimized sequence (By default, it will produce the best sequence it can find after 10 seconds of computation time). Again, repetitive regions will be highlighted, and the same measurements of repetitiveness and codon fit will be given.

The following optional flags may be used to change the program's behavior:

*-A

This allows for an amino-acid sequence, specified in single-letter code, to be used as input instead of a DNA sequence. The initial sequence statistics displayed will be for a uniform random reverse translation of the given amino acid sequence.

Example: java -jar DoubleOptimizer.jar aaseq.txt codons.txt -A

*'''''-T##'''''

This allows the user to specify, in seconds, a different run-time for the program other than the default 10 seconds. While 10 seconds should be sufficient to produce a well-optimized result for most genes of moderate length on modern desktop computers, longer times may produce better-optimized results on slower machines or on longer sequences.

Example:

java -jar DoubleOptimizer.jar aaseq.txt codons.txt -A -T30

*'''''-L##'''''

This allows the user to specify a different minimum length for what is considered a repeat, other than the default 8 nucleotides.

Example:

java -jar DoubleOptimizer.jar seq.txt codons.txt -L7 -T15

*'''''-S##, -E##''''' These allow the user to specify the starting and ending nucleotide, respectively, of the coding region in a construct sequence to be sythesized. Nucleotides outside this frame will be ignored for codon usage optimization, and will never be modified. This option is useful for preventing repetitions, within the coding region, of fixed sequences that occur at the ends of a construct to be sythesized, outside of the coding region. The default values are the beginning of the sequence, and the end of the last complete codon. These options may be used together or independently. Values are one-indexed. If used with ''-A'', these will be interpreted as amino acid indices. Example: java -jar DoubleOptimizer.jar seq.test ecoli2.txt -S121 -E1853 -R1000 -D100 *'''''-D##''''' This option will make the program periodically display the current best sequence, and associated statistics, as it runs. The number given is the number of optoimization cycles the program will perform between each round of displaying the sequence. This provides a continuous measure of progress on long optimization runs. Note that, when given 10 seconds, the program may execute several thousand cycles of optimization, so an argument on the order of 100 may be reasonable. Example: java -jar DoubleOptimizer.jar seq.test ecoli2.txt -S121 -E1853 -R1000 -D100 *'''''-R##''''' This option will make the program halt optimization after a certain number of optimization cycles. This may be used with a large value of -T to standardize optimization quality between computers of different speeds. Because this option is mostly only useful for testing the efficiency of this program itself, it may be removed from future releases. Example: java -jar DoubleOptimizer.jar seq.test ecoli2.txt -S121 -E1853 -T1000 -R1000 '''Case Studies and Performance Data''' *To be added '''References''' *To be added == CompositionSearch: a simple utility for local, fast search of protein sequences for matches to a specified amino acid distribution.== '''Background''' When the relative proportions of amino acids in an unknown protein product have been chemically determined, it is often useful to search a proteome for proteins that have similar amino acid distributions, in order to identify this protein product. While at least one online utility for performing this task already exists (provided by the Swiss Institute of Bioinformatics, [[http://web.expasy.org/aacompident/aacompfree.html here]]), the web-based nature of this program creates some limitations. Firstly, the SIB provides computational resources for the calculation, resulting in slower turnaround for the user (searches take about 15 minutes). Secondly, this program will only search for proteins already in the Swiss-Prot / TrEMBL databases. (At last check, only the smaller Swiss-Prot search was functional.) If an organism is being newly studied and has just been sequenced, its predicted proteome will not be in these databases. Thirdly, due to limited resources, only the top matches to a given search are provided. This does not allow for statistical comparison to the "typical" protein within a given proteome. Fourthly, for very-high-security tasks, uploading data to a third party may be undesirable. '''Solution: CompositionSearch''' CompositionSearch is a software tool we have created to address these issues by allowing an individual to rank all proteins in a proteome by similarity (minimum Euclidian distance) to a reference amino acid distribution locally on one's own computer. This ranking can be generated in a matter of seconds, rather than taking several minutes. Because it ranks all proteins in a proteome, CompositionSearch can also generate a figure for the significance of the similarity of a given protein to a given amino acid distribution, using the similarity of the rest of the proteome as a statistical distribution function. '''Availability and Usage''' CompositionSearch may be downloaded [https://drive.google.com/a/brown.edu/file/d/0B6Q5Eo65G4cPbnpQYlNlcU1QUE0/view?usp=sharing here]. (Note: this link is not public yet, but it will be converted to public before Wiki Freeze Day.) DoubleOptimizer is a command line utility, provided as a Java jar file. It can be invoked from command line on any system with Java installed, with the following syntax: java -jar CompositionSearch.jar prot.fasta freqs.txt out.csv [Optional flags] where "prot.fasta" is a FASTA-formatted proteome, and "freqs.txt" is a file containing the desired amino distribution to match. It should be formatted as plain text, according to the following example template: A 0.134 D 0.044 E 0.04 G 0.228 I 0.033 K 0.021 L 0.04 P 0.08 R 0.009 S 0.151 T 0.038 V 0.083 '''''(Eli, can you help me make this collapsible??)''''' Note that this may contain as many or as few amino acids as desired. The frequencies, however, are interpreted as absolute, so if all amino acids are represented, they should add 1. (See below for discussion of how the optional -N and -X flags affect this interpretation.) "out.csv" in the above example line is the destination path to store the results of the calculation. The output will be a spreadsheet in csv format, which may be imported into your favorite desktop spreadsheet application (i.e. Microsoft Excel, LibreOffice Calc, etc.). After execution, out.csv will contain a spreadsheet showing the reference distribution, and, in order by similarity to the reference distribution, the amino acid distributions of all proteins in the proteome. Protein name, similarity ranking, similarity score (Euclidian distance between amino acid distributions; lower is better), and similarity p-value (on the curve of other proteins in the proteome) will be listed for each protein. The following optional flags may be used to change the program's behavior: *'''''-N''''' This will cause the program to ignore amino acids in the proteins that are not in the distribution list. In other words, this means that the frequencies given refer to frequencies relative to only the the other amino acids listed, instead of all amino acids. To clarify: Without the -N flag, the line in the above example distribution list: P 0.08 means that 8% of all amino acid residues in the matching protein are expected to be proline residues. With the -N flag, the line in the above example distribution list: P 0.08 means that 8% of acid residues that belong in the set {ADEGIKLPRSTV} (the amino acids with defined frequences) in the matching protein are expected to be proline residues. Example: java -jar CompositionSearch.jar prot.fasta freqs.txt out.csv -N *'''''-X$$$''''' This will cause the program to completely disregard certain amino acid residue symbols in the proteome, regardless of use of the N flag. The default value of this set of characters is 'X,' often used to represent unknown amino acid residues. Therefore the flag -XX is equivalent to normal behavior. Note that the set of characters being ignored is replaced by the -X flag, so it is always advisable to list X when using this flag. Example: java -jar CompositionSearch.jar prot.fasta freqs.txt out.csv -XX*- This will cause the symbols X,*, and - to be ignored in the proteome.