Team:Penn State/CodonOptimization

From 2014.igem.org

Revision as of 15:59, 12 October 2014 by Clayswack (Talk | contribs)



WELCOME TO PENN STATE iGEM 2014!

(Page under construction)


Click here to edit this page!

Home Team Official Team Profile Projects Parts Wetlab Safety Human Practices Attributions

CODON OPTIMIZATION PROJECT

Click HERE to return to the Projects page.

Project Description

tell about project - give background, essentially write the abstract. (1-2 paragraphs)

references

iGEM teams are encouraged to record references you use during the course of your research. They should be posted somewhere on your wiki so that judges and other visitors can see how you though about your project and what works inspired you.

It's also important to clearly describe your achievements so that judges will know what you tried to do and where you succeeded. Please write your project page such that what you achieved is easy to distinguish from what you attempted.

INFO TO INCLUDE:

  1. Overall project summary
  2. Project Details
  3. Materials and Methods
  4. The Experiments
  5. Results
  6. Data analysis
  7. Conclusions

Click HERE to return to the Projects page.

Codon Optimization: Engineering a More Useful Gene at the Codon Level

Project Summary

Codons are groups of three nucleotides that specify a single amino acid, which is then added to a growing polypeptide chain during translation. Even though each codon spefifies only one amino acid, some amino acids are coded by multiple codons. It has been demonstrated that the genome of E.coli shows statistical preference for some of these degenerate codons over others, and it is hypothesized that these codons translate more efficiently than non preferred degenerate codons. We constructed synthetic reporter genes entirely from codons hypothesized to be fast or slow,and characterized them in E.coli, demonstrating that...

Illustrates the principle of codon redundancy. The number in each degenerate codon refers to the criterion that would lead to it being chosen.

Why is this important?

Numerous bioproducts are important in our lives. Examples include medicines, fuels, and industrial chemicals. All of these are derived from biological sources, and the ability to engineer their production is vital to a wide variety of industries. Codon optimization is an important area of research because it has the potential to give engineers an additional point of control over protein synthesis, and proteins(a broad class of macromolecules that includes enzymes)are vital components of countless bioproducts.

Our codon optimization research is important for the additional reason that it will help future researchers to develop more comprehensive models of translation. A better understanding of translation is an example of a foundational advance in biology that will lead to faster, more efficient research in many areas of biology. If, for example, our research shows clearly that certain degenerate codons are preferred because they can be translated more efficiently this will allow scientists to search for a mechanism that predicts these effects, and will invite engineers to redesign genes to be translated more efficiently.

Background

Codon optimization refers to the idea that the individual codons of a gene in a specific organism can be changed in order to alter the behavior of that organism. This relies on an understanding of the central dogma of biology, which states that any organism produces proteins by first transcribing genetic material in the form of DNA to RNA, which is then “read” by ribosomes which produce proteins based on the sequence of amino acids in that RNA. The reading of the RNA is done three nucleotides at a time, and these three letter series of nucleotides are called codons. Codons specify to the ribosome which amino acid to add to a growing amino acid chain.

There are 4 nucleotides, thus 43, or 64 codons are possible. Since there are only 20 amino acids, there is redundancy in the codons, that is, some amino acids are specified by multiple codons. There is no ambiguity, however, meaning that each codon specifies only one amino acid. Codons that code for the same amino acid are called degenerate codons, and even though these degenerate codons code for the same amino acid, they do not necessarily lead to the same expression levels of that amino acid.

Our Objectives

1) Find Criteria for Optimizing Genes in E. coli

All coding sequences were designed so that there would be no difference between the amino acid profile of the variant GFP and the original superfolder GFP. This ensured that each gene led to the expression of the same protein.

Previous researchers have determined through a statistical analysis of the entire genome that some degenerate codons occur more often in protein coding sequences and some are more infrequent. These are referred to as common and rare codons. The importance of this is that protein expression in cells is limited either by either translation initiation rate (TIR) or translation elongation rate, and it is theorized that commonly occurring codons will have faster elongation rates than degenerate rare codons. Translation initiation rate can be artificially controlled by varying the strength of the ribosome binding site (RBS), which consists of the genetic sequence that precedes the protein coding sequences (CDS) of a gene. This is accomplished through the use of the RBS calculator, and in previous research was used to steadily increase the RBS strength of a gene, GFP mut3b, the expression of which was then characterized. Unexpectedly, expression level of proteins plateaued even as the RBS strength (and thus TIR) was increased. By using the RBS Calculator to increase the translation initiation rate, we can detect when the plateau occurs, which is called the "maximum translation rate capacity." Since this plateau occurs independently of TIR, it is theorized that it is due solely to translation elongation becoming a rate limiting step. The design of the GFPs using only common and rare codons was based on the data in this table.

Codon Frequency

Number in % collumn shows the percent of time that the amino acid was coded for with a specific codon. More frequent codons have higher percentages.

Modified from Maloy, S., V. Stewart, and R. Taylor. 1996. Genetic analysis of pathogenic bacteria. Cold Spring Harbor Laboratory Press, NY.

To optimize for common degenerate codons, the most frequent codon for a specific amino acid was taken. To optimize for rare degenerate codons, the least frequent codon was taken. For example, if a codon in the original superfolder GFP coded for Phenylalanine, the codons UUU and UUC were available. The frequencies of these were taken from the table (UUU-.51, UUC-.59). For common GFP, UUU was used whenever Phenylalanine was desired, because it had the highest frequency. For rare GFP, UUC was used.

In another recent project, all the genes (coding DNA sequences) of E. coli are divided into five groups based on the naturally occurring TIR, from lowest to highest. Then, the codon usage profile of each group of genes is statistically analyzed to determine whether a codon is slow or fast. A fast codon is defined as one with high correlation between TIR and its frequency. Otherwise, it is a slow codon. It is hypothesized that the groups of CDS with high TIR will hold more “fast” codons, which will lead to higher translation elongation rate and thus higher protein expression, whereas the slow regions will hold more “slow” codons leading to lower expression. This data is summarized in the following figure.

Codon Frequency in Fast and Slow regions of the Genome

Fast codons show a positive correlation between frequency and TIR, slow codons show a negative correlation

Ng, C. Y., Farasat, I., Zomorrodi, A. R., Maranas, C. D. & Salis, H. M. Model-guided construction and optimization of synthetic metabolism for chemical product synthesis. Synthetic Biology Engineering Research Center Spring Retreat (2013), Berkeley, CA.

Codons whose frequency increases with TIR are defined as fast codons. Those with declining frequency in relation to TIR are slow codons, and those with no correlation are defined as independent of TIR. This can be viewed in the figure above as the slope of the graphs for each codon showing ratio and TIR. If ratio increases with TIR, the codon is fast, and the graph displays positive slope. Slow codons slow negative slope, and TIR independent codons show essentially no slope.

In another related research project, researchers developed a program which models the process of translation elongation. This program takes into account the chemical binding of individual codons to ribosomes as well as numerous other relevant biological criteria and is able to predict the time taken by a ribosome to add an amino acid to a growing polypeptide chain. This is known as the “insertion time” for that codon. Using this software, a list of the insertion times for each codon was compiled. It is theorized that codons with longer insertion times will have lower translation elongation rates and thus lower the expression of the protein of the particular CDS that contains the slow codons. This data is summarized in the table below.

Codons with faster insertion times are theorized to lead to higher protein expression.

2) Apply These Criteria to a Reporter Gene (GFP)

It is important to understand the difference between the slow, fast, rare, common, and insertion times criteria. Common and rare codons are based on the frequency of particular degenerate codons in the entire E. coli genome. The hypothesis that common codons will lead to higher expression of proteins is based on the idea that cells have become optimized through evolution to efficiently translate proteins necessary to their survival. Based on this assumption, the most efficient codons will appear more frequently in the overall genome. The fast and slow codon differentiation relies on a very similar analysis. Fast codons are defined as those with a high correlation between frequency and the high TIR regions of the genome, whereas slow codons are those with a high correlation between frequency and the low TIR regions of the genome. This is an extension of the common/rare distinction, but is more specific, as certain parts of the genome with low TIR could possibly code for proteins where high expression (and thus fast translation elongation) is not necessary. By analyzing the codon usage profile of individual regions of the genome with different TIR, it is theorized that the fast and slow codons could be used to artificially control the expression of a particular gene through codon optimization. The slow insertion time design was based solely on recently designed software which analyzes the biophysical phenomena that underlie translation elongation. Thus, the difference between the slow insertion time GFP and the others is that it is optimized based on the results of biophysical modeling instead of codon usage profile, and thus, on understanding of the physics of translation elongation verses the understanding that evolution optimizes organisms for high efficiency. The overall hypothesis of this research can now be fully understood.

This hypothesis is that the maximum translation rate capacity is due to translation elongation becoming the rate limiting step of protein synthesis, and that it could be controlled by increasing the translation elongation rate, through codon optimization of the CDS. Essentially, the presence of more common or fast codons is hypothesized to raise the maximum translation rate capacity, while the presence of more slow, rare, or slow insertion time codons will lower it.

To test this hypothesis, four variants of the Green Fluorescent Protein gene (GFP), one each of purely fast, slow, rare, and common codons, as well as one with codons having the slowest insertion time, were designed and constructed. The TIR of each variant GFP was then varied by attaching ribosome binding sites (RBS) of varying strength to the synthetic genes. The genes were then expressed in E. coli cells.

To design the genes a custom program was created which replaces all degenerate codons in a gene with the desired codons, for example, replacing all rare codons with common degenerate codons or all slow codons with fast degenerate codons. The variants were sent for construction at a commercial laboratory (Integrated DNA Technologies), and then inserted into viral DNA vectors which were incorporated into the cells’ plasmids through existing replication machinery. This was accomplished through basic cloning, and the expression of the fluorescent protein was then characterized using flow cytometry, a quantitative method of measuring fluorescence. Using this data, the maximum translation rate capacity of each variant GFP was determined and that data was used to distinguish between rare, slow, frequent, and fast codons.

3) Introduce the Synthetic Genes Optimized Using Our Criteria into E. coli

Our general plan for expressing our variant GFPs in living cells was to ligate the genes into a vector, transform the cells with the vector, then sequence to confirm the presence of our variant genes. After this we ligated in a dRBS, measured the florescence of the cells, and then sequenced again to determine which colonies were using which RBS.

The figure below shows the vector pFTV that was altered using inverse PCR.

Inverse PCR

Caption

The figure below shows the products of inverse PCR.

Inverse PCR Products

Caption

The figure below shows the constructs used to insert each variant GFP.

Inserting GFP

Caption

A decision was reached to use a leader sequence to homogenize the first 60 base pairs of each GFP.

Why use a leader Sequence?

Caption

Our construct before insertion of the dRBS

The dRBS will be inserted between Sac1 and Pst1

Caption

Insertion of the dRBS

The spacer which had held a place for the dRBS is cut out by the enzymes Sac1 and Pst1

Caption

The final construct

Includes a promoter, RBS, leader sequence, variant gene, and terminator

Caption

The dRBS

Caption

The dRBS

Caption

Translation Initiation

Caption

4) Characterize the GFPs by Measuring Fluorescence of the Cells

5) Compare Protein Expression Levels from the Various Genes

Design Methods

Bold heading: