Team:Penn State/CodonOptimization

From 2014.igem.org



WELCOME TO PENN STATE iGEM 2014!

Click here to edit this page!

HOME JUDGING FORM OFFICIAL PROFILE TEAM PROJECTS PARTS WETLAB SAFETY HUMAN PRACTICES ATTRIBUTIONS

CODON OPTIMIZATION: ENGINEERING A MORE USEFUL GENE AT THE CODON LEVEL

Click HERE to return to the Projects page.

Project Summary


Background


Many important bioproducts are comprised of proteins, and proteins are made up of amino acids, specified by certain codons in an organism's DNA. This DNA is then transcribed and translated to create these proteins. Since multiple codons can be used to specify an amino acid in E.coli, it is possible to use multiple coding sequences to produce the same chain of amino acids.


The Big Question


Even though genes with different codon preferences can code for the same protein, they may not necessarily do so at the same rate, or lead to expression of that protein at the same level1. The question we asked is "How and why are they different?"

To biologists, the answer to this question may shed light on the nature of translation itself, and the reasons that some codons are naturally preferred by the genome of E.coli. To engineers, it could act as an additional point of control over tricky genetic systems, and perhaps more efficient production of useful bioproducts. To undergrads, the pursuit of these answers was the summer of a lifetime.



Degenerate codons do not necessarily lead to equally efficient expression of an amino acid.


Our Objective


Codon optimization is not a new idea, but what makes our project special is that it used several novel criteria for optimization. By picking specific codons for a synthetic gene, we can determine the effect of certain types of codons on translation and ultimately draw conclusions that may be very useful to the biotech community.


Our Design

To optimize genes we had to find some criteria. What we chose were codons that were generally rare in the overall genome (rare-G1), common in the overall genome (common-G2), abundant in regions of the genome with known fast translation initiation (fast-G3), abundant in regions of the genome with known slow translation initiation (slow-G4), or predicted by advanced software to have slow insertion time (slow insertion time-G5).

Next, we chose a gene to optimize, and came up with superfolder GFP, a well studied reporter gene that was described by the 2008 Cambridge iGEM team (part BBa_I746916). We optimized the gene using our five criteria, then assembled plasmids containing all the necessary parts for the gene to be expressed. The "fast" optimized GFP has been submitted as a part to the registry, as it is the only variant that we have finished characterizing (part BBa_K1506002).

In order to find out more about the translational efficiency of our GFPs, we designed a degenerate ribosome binding site (dRBS) to be inserted in our construct before all five variants. By measuring the expression plateaus at high translation initiation rates (TIR), we will be able to see how efficiently our GFPs are translated. Raising or elimenating plateaus at high TIR will be a marker of how effectively we optimized the gene.


Our final design looked like this:


Includes a promoter, RBS, leader sequence, variant gene, and terminator

Our Results


Unfortunately, we were unable to progress beyond the cloning phase with G4, and beyond the insertion of dRBS with G1 and G5. Of the two variants that we characterized, only the results for G2 have been sequence verified and analyzed at this point.


Shows the adjusted average fluorescence of the cells that we measured graphed with the strength of the ribosome binding site in that strain.

The results show only vague indication that expression of GFP was increased as TIR increased. Unfortunately, only 14 strains were successfully characterized and sequenced. Characterization of the other GFPs will allow us to determine the success of our optimization.

Future Plans


First, the cloning will be completed. Once all five variants are present in the backbone with the degenerate ribosome binding site inserted, fluorescence data will be collected. Once each measured colony is sequenced, the fluorescence will be graphed alongside the predicted strength of the ribosome binding site. This will give us a good idea of the translational efficiency of each GFP. By comparing fluorescence data for a single ribosome binding site across multiple GFPs, we will be able to determine the effects of our optimization criteria.

References:

1. Subramaniam, Arvind R, Tao Pan, and Philippe Cluzel. “Environmental Perturbations Lift the Degeneracy of the Genetic Code to Regulate Protein Levels in Bacteria.” Proceedings of the National Academy of Sciences of the United States of America 110.6 (2013): 2419–24. Web. 26 May 2014.











Complete Project Information- Beyond the Summary

Abstract


Codons are groups of three nucleotides that specify a single amino acid, which is then added to a growing polypeptide chain during translation. Even though each codon spefifies only one amino acid, some amino acids are coded by multiple codons. It has been demonstrated that the genome of E.coli shows statistical preference for some of these degenerate codons over others, and it is hypothesized that these codons translate more efficiently than non preferred degenerate codons. We constructed synthetic reporter genes entirely from codons hypothesized to be fast or slow,and characterized them in E.coli. As of right now, we have characterized the level of GFP expression in G3, the fast coding sequence for superfolder GFP. GFP 2, the common coding sequence has been successfully transformed in the plasmid and data is expected to be analyzed for that variant by the giant jamboree. There are still issues with getting a dRBS in the G1, rare, and G5, the calculated slow insertion time. G4, the slow codons, has not been successfully introduced to the backbone, so it has been decided to forego this variant due to time constraints.




Illustrates the principle of codon redundancy. The number in each degenerate codon refers to the criterion that would lead to it being chosen.

Why is this important?


Examples of bioproducts vital to our lives are medicines, fuels, and even industrial chemicals. Codon optimization is important because it gives engineers an additional point of control over protein synthesis.

Our codon optimization research is important for the additional reason that it will help future researchers to develop more comprehensive models of translation. A better understanding of translation is an example of a foundational advance in biology that will lead to faster, more efficient research in many areas of biology. If, for example, our research shows clearly that certain degenerate codons are preferred because they can be translated more efficiently this will allow scientists to search for a mechanism that predicts these effects, and will invite engineers to redesign genes to be translated more efficiently.



Metaphor Alert: Codon optimization can be thought of as an extra dial that can be tuned to rationally control output of genetic systems.

Background


Codon optimization refers to the idea that the individual codons of a gene in a specific organism can be changed in order to alter the behavior of that organism. This relies on an understanding of the central dogma of biology, which states that any organism produces proteins by first transcribing genetic material in the form of DNA to RNA, which is then “read” by ribosomes which produce proteins based on the sequence of amino acids in that RNA. The reading of the RNA is done three nucleotides at a time, and these three letter series of nucleotides are called codons. Codons specify to the ribosome which amino acid to add to a growing amino acid chain.

There are 4 nucleotides, thus 43, or 64 codons are possible. Since there are only 20 amino acids, there is redundancy in the codons, that is, some amino acids are specified by multiple codons. There is no ambiguity, however, meaning that each codon specifies only one amino acid. Codons that code for the same amino acid are called degenerate codons, and even though these degenerate codons code for the same amino acid, they do not necessarily lead to the same expression levels of that amino acid.



Degenerate codons do not necessarily lead to equally efficient expression of an amino acid.

Our Objectives


1) Find Criteria for Optimizing Genes in E. coli


All coding sequences were designed so that there would be no difference between the amino acid profile of the variant GFP and the original superfolder GFP. This ensured that each gene led to the expression of the same protein.


Previous researchers have determined through a statistical analysis of the entire genome that some degenerate codons occur more often in protein coding sequences and some are more infrequent. These are referred to as common and rare codons. The importance of this is that protein expression in cells is limited either by either translation initiation rate (TIR) or translation elongation rate, and it is theorized that commonly occurring codons will have faster elongation rates than degenerate rare codons. Translation initiation rate can be artificially controlled by varying the strength of the ribosome binding site (RBS), which consists of the genetic sequence that precedes the protein coding sequences (CDS) of a gene. This is accomplished through the use of the RBS calculator, and in previous research was used to steadily increase the RBS strength of a gene, GFP mut3b, the expression of which was then characterized. Unexpectedly, expression level of proteins plateaued even as the RBS strength (and thus TIR) was increased. By using the RBS Calculator to increase the translation initiation rate, we can detect when the plateau occurs, which is called the "maximum translation rate capacity." Since this plateau occurs independently of TIR, it is theorized that it is due solely to translation elongation becoming a rate limiting step. The design of the GFPs using only common and rare codons was based on the data in this table.


Codon Frequency


Number in % collumn shows the percent of time that the amino acid was coded for with a specific codon. More frequent codons have higher percentages.

Modified from Maloy, S., V. Stewart, and R. Taylor. 1996. Genetic analysis of pathogenic bacteria. Cold Spring Harbor Laboratory Press, NY.



To optimize for common degenerate codons, the most frequent codon for a specific amino acid was taken. To optimize for rare degenerate codons, the least frequent codon was taken. For example, if a codon in the original superfolder GFP coded for Phenylalanine, the codons UUU and UUC were available. The frequencies of these were taken from the table (UUU-.51, UUC-.59). For common GFP, UUU was used whenever Phenylalanine was desired, because it had the highest frequency. For rare GFP, UUC was used. Because all codons were found to be either common or rare for E. coli, the common and rare optimized genes had zero commonality. Tryptophan is the only amino acid coded by one codon, but it does not appear in superfolder GFP.



Result of common/rare optimization is two coding sequences with zero commonality.

In another recent project, all the genes (coding DNA sequences) of E. coli are divided into five groups based on the naturally occurring TIR, from lowest to highest. Then, the codon usage profile of each group of genes is statistically analyzed to determine whether a codon is slow or fast. A fast codon is defined as one with high correlation between TIR and its frequency. Otherwise, it is a slow codon. It is hypothesized that the groups of CDS with high TIR will hold more “fast” codons, which will lead to higher translation elongation rate and thus higher protein expression, whereas the slow regions will hold more “slow” codons leading to lower expression. This data is summarized in the following figure.



Codon Frequency in Fast and Slow regions of the Genome

Fast codons show a positive correlation between frequency and TIR, slow codons show a negative correlation

Ng, C. Y., Farasat, I., Zomorrodi, A. R., Maranas, C. D. & Salis, H. M. Model-guided construction and optimization of synthetic metabolism for chemical product synthesis. Synthetic Biology Engineering Research Center Spring Retreat (2013), Berkeley, CA.



Codons whose frequency increases with TIR are defined as fast codons. Those with declining frequency in relation to TIR are slow codons, and those with no correlation are defined as independent of TIR. This can be viewed in the figure above as the slope of the graphs for each codon showing ratio and TIR. If ratio increases with TIR, the codon is fast, and the graph displays positive slope. Slow codons slow negative slope, and TIR independent codons show essentially no slope.

Example of a fast codon. Notice that the codon is more frequent in higher TIR regions of the genome.

In another related research project, researchers developed a program which models the process of translation elongation. This program takes into account the chemical binding of individual codons to ribosomes as well as numerous other relevant biological criteria and is able to predict the time taken by a ribosome to add an amino acid to a growing polypeptide chain. This is known as the “insertion time” for that codon. Using this software, a list of the insertion times for each codon was compiled. It is theorized that codons with longer insertion times will have lower translation elongation rates and thus lower the expression of the protein of the particular CDS that contains the slow codons. This data is summarized in the table below.




Codons with faster insertion times are theorized to lead to higher protein expression.

2) Apply These Criteria to a Reporter Gene (GFP)


It is important to understand the difference between the slow, fast, rare, common, and insertion times criteria. Common and rare codons are based on the frequency of particular degenerate codons in the entire E. coli genome. The hypothesis that common codons will lead to higher expression of proteins is based on the idea that cells have become optimized through evolution to efficiently translate proteins necessary to their survival. Based on this assumption, the most efficient codons will appear more frequently in the overall genome. The fast and slow codon differentiation relies on a very similar analysis. Fast codons are defined as those with a high correlation between frequency and the high TIR regions of the genome, whereas slow codons are those with a high correlation between frequency and the low TIR regions of the genome. This is an extension of the common/rare distinction, but is more specific, as certain parts of the genome with low TIR could possibly code for proteins where high expression (and thus fast translation elongation) is not necessary. By analyzing the codon usage profile of individual regions of the genome with different TIR, it is theorized that the fast and slow codons could be used to artificially control the expression of a particular gene through codon optimization. In some instances, the same codon is used for multiple optimization plans (fast, common, ect) to specify a certain amino acid. Because of this, some genes have instances of similarity, where the same codon is used in the same position.




Result of common/fast optimization is two coding sequences that may be similar, as in some instances a codon may be both common and fast.

The slow insertion time design was based solely on recently designed software which analyzes the biophysical phenomena that underlie translation elongation. Thus, the difference between the slow insertion time GFP and the others is that it is optimized based on the results of biophysical modeling instead of codon usage profile, and thus, on understanding of the physics of translation elongation verses the understanding that evolution optimizes organisms for high efficiency. The overall hypothesis of this research can now be fully understood.

This hypothesis is that the maximum translation rate capacity is due to translation elongation becoming the rate limiting step of protein synthesis, and that it could be controlled by increasing the translation elongation rate, through codon optimization of the CDS. Essentially, the presence of more common or fast codons is hypothesized to raise the maximum translation rate capacity, while the presence of more slow, rare, or slow insertion time codons will lower it.

To test this hypothesis, five variants of the Green Fluorescent Protein gene (GFP), one each of purely fast, slow, rare, and common codons, as well as one with codons having the slowest insertion time, were designed and constructed. The TIR of each variant GFP was then varied by attaching ribosome binding sites (RBS) of varying strength to the synthetic genes. The genes were then expressed in E. coli cells.



Shows the similarity between GFPs. Genes with opposite criteria, such as slow/fast and common/rare, show very little similarity.

To design the genes a custom program was created which replaces all degenerate codons in a gene with the desired codons, for example, replacing all rare codons with common degenerate codons or all slow codons with fast degenerate codons. The variants were sent for construction at a commercial laboratory (Integrated DNA Technologies), and then inserted into viral DNA vectors which were incorporated into the cells’ plasmids through existing replication machinery. This was accomplished through basic cloning, and the expression of the fluorescent protein was then characterized using flow cytometry, a quantitative method of measuring fluorescence. Using this data, the maximum translation rate capacity of each variant GFP was determined and that data was used to distinguish between rare, slow, frequent, and fast codons.


3) Introduce the Synthetic Genes Optimized Using Our Criteria into E. coli


Our general plan for expressing our variant GFPs in living cells was to ligate the genes into a vector, transform the cells with the vector, then sequence to confirm the presence of our variant genes. After this we ligated in a dRBS, measured the florescence of the cells, and then sequenced again to determine which colonies were using which RBS.

The figures below show the construction process that we used to insert an RBS library and variant GFP into plasmid pFTV.


Inverse PCR



Through inverse PCR, we cut away the existing superfolder GFP while amplifying the rest of the plasmid. By adding "tails" to our primers outside the annealing sites we were able to introduce new restriction sites into the plasmid.



Inverse PCR Products

Inverse PCR cuts out the pre-existing coding sequencing while simultaneously amplifying the plasmid backbone with the new restriction sites.

Inserting GFP


Each GFP variant is inserted separately


A decision was reached to use a leader sequence to homogenize the first 60 base pairs of each GFP.


Why use a leader Sequence?

Leader Sequence is heavily optimized to ensure it does not become the rate limiting step in translation. It ensures an even range of translation initiation is sampled across all variants by homogenizing the first 60 base pairs, which can impact TIR.


Our construct before insertion of the dRBS


The dRBS will be inserted between Sac1 and Pst1


pFTV carrying a variant GFP. The dRBS will be inserted between sites Sac1 and Pst1.

Insertion of the dRBS

The spacer which had held a place for the dRBS is cut out by the enzymes Sac1 and Pst1


dRBS is flanked by restriction sites Sac1 and Pst1 and is manufactured by annealing two compliementary oligos that contain the dRBS.


The Ribosome Binding Site

The degenerate ribosome binding site (dRBS) is a sequence that contains a ribosome binding site library. Using software developed by the Salis Lab, we calculated the range of translation initiation (TIR) that would be expected for this sequence, from 0.5-157,000 au.

Check out the cool software that allowed us to accomplish this.

The dRBS sequence is the location where the ribosome binds, and a higher TIR will allow more ribosomes to bind to the mRNA. It was essential to measure the performance of our synthetic GFPs over a wide range of TIR in order to see if expression plateaued at high TIR, indicating that translation elongation was the rate limiting step, or whether it climbed along with TIR, indicating that elongation had been made sufficiently more efficient as to avoid any plateau.


The sequence carries five degenerate letters. Four of these specify one of two possible bases, while the other specifies one of three. Because of this, there are 2*2*2*2*3 = 48 possible sequences in our dRBS.


Graph showing the sequences in our library and their calculated TIR

Each number on the x axis corresponds to one of the sequences out of the total 48. TIR is graphed on the y axis.


Salis, Howard, Voight, Christopher, and Mirsky, Ethan. “Automated design of synthetic ribosome binding sites to control protein expression.” Nature Biotechnology 27 (2009): 946 – 950. Web.



The final construct




Includes a promoter, RBS, leader sequence, variant gene, and terminator

4) Characterize the GFPs by Measuring Fluorescence of the Cells



We were able to characterize the fluorescence of the fast codon variant of superfolder GFP.



\

This graph shows the absorbance of our colonies versus time.


It is clear that several strains exceed the rest.

The strongest strains did not bear the highest TIR ribosome binding sites.


Unexpectedly, the most prolific strains did not contain the strongest ribosome binding sites. It is possible that at very high TIR the cells experienced toxicity or too much metabolic strain from the production of GFP. With further characterization of the GFPs it will be possible to compare the expression of strains containing the same RBS but different variant GFPs. Through this analysis we will be able to determine the strength of our optimization criteria and hopefully add to the ability of future engineers to optimize their genetic systems.