From 2014.igem.org

iGEM teams are encouraged to record references you use during the course of your research. They should be posted somewhere on your wiki so that judges and other visitors can see how you though about your project and what works inspired you.

It's also important to clearly describe your achievements so that judges will know what you tried to do and where you succeeded. Please write your project page such that what you achieved is easy to distinguish from what you attempted.

INFO TO INCLUDE:

Overall project summary
Project Details
Materials and Methods
The Experiments
Results
Data analysis
Conclusions

Click HERE to return to the Projects page.

Codon Optimization: Engineering a More Useful Gene at the Codon Level

Project Summary

Codons are groups of three nucleotides that specify a single amino acid, which is then added to a growing polypeptide chain during translation. Even though each codon spefifies only one amino acid, some amino acids are coded by multiple codons. It has been demonstrated that the genome of E.coli shows statistical preference for some of these degenerate codons over others, and it is hypothesized that these codons translate more efficiently than non preferred degenerate codons. We constructed synthetic reporter genes entirely from codons hypothesized to be fast or slow,and characterized them in E.coli, demonstrating that...

Why is this important?

Examples of bioproducts vital to our lives are medicines, fuels, and even industrial chemicals. Codon optimization is important because it gives engineers an additional point of control over protein synthesis.

Our codon optimization research is important for the additional reason that it will help future researchers to develop more comprehensive models of translation. A better understanding of translation is an example of a foundational advance in biology that will lead to faster, more efficient research in many areas of biology. If, for example, our research shows clearly that certain degenerate codons are preferred because they can be translated more efficiently this will allow scientists to search for a mechanism that predicts these effects, and will invite engineers to redesign genes to be translated more efficiently.

Background

Codon optimization refers to the idea that the individual codons of a gene in a specific organism can be changed in order to alter the behavior of that organism. This relies on an understanding of the central dogma of biology, which states that any organism produces proteins by first transcribing genetic material in the form of DNA to RNA, which is then “read” by ribosomes which produce proteins based on the sequence of amino acids in that RNA. The reading of the RNA is done three nucleotides at a time, and these three letter series of nucleotides are called codons. Codons specify to the ribosome which amino acid to add to a growing amino acid chain.

There are 4 nucleotides, thus 43, or 64 codons are possible. Since there are only 20 amino acids, there is redundancy in the codons, that is, some amino acids are specified by multiple codons. There is no ambiguity, however, meaning that each codon specifies only one amino acid. Codons that code for the same amino acid are called degenerate codons, and even though these degenerate codons code for the same amino acid, they do not necessarily lead to the same expression levels of that amino acid.

Our Objectives

1) Find Criteria for Optimizing Genes in E. coli

All coding sequences were designed so that there would be no difference between the amino acid profile of the variant GFP and the original superfolder GFP. This ensured that each gene led to the expression of the same protein.

Previous researchers have determined through a statistical analysis of the entire genome that some degenerate codons occur more often in protein coding sequences and some are more infrequent. These are referred to as common and rare codons. The importance of this is that protein expression in cells is limited either by either translation initiation rate (TIR) or translation elongation rate, and it is theorized that commonly occurring codons will have faster elongation rates than degenerate rare codons. Translation initiation rate can be artificially controlled by varying the strength of the ribosome binding site (RBS), which consists of the genetic sequence that precedes the protein coding sequences (CDS) of a gene. This is accomplished through the use of the RBS calculator, and in previous research was used to steadily increase the RBS strength of a gene, GFP mut3b, the expression of which was then characterized. Unexpectedly, expression level of proteins plateaued even as the RBS strength (and thus TIR) was increased. By using the RBS Calculator to increase the translation initiation rate, we can detect when the plateau occurs, which is called the "maximum translation rate capacity." Since this plateau occurs independently of TIR, it is theorized that it is due solely to translation elongation becoming a rate limiting step. The design of the GFPs using only common and rare codons was based on the data in this table.

Codon Frequency

Modified from Maloy, S., V. Stewart, and R. Taylor. 1996. Genetic analysis of pathogenic bacteria. Cold Spring Harbor Laboratory Press, NY.

To optimize for common degenerate codons, the most frequent codon for a specific amino acid was taken. To optimize for rare degenerate codons, the least frequent codon was taken. For example, if a codon in the original superfolder GFP coded for Phenylalanine, the codons UUU and UUC were available. The frequencies of these were taken from the table (UUU-.51, UUC-.59). For common GFP, UUU was used whenever Phenylalanine was desired, because it had the highest frequency. For rare GFP, UUC was used. Because all codons were found to be either common or rare for E. coli, the common and rare optimized genes had zero commonality. Tryptophan is the only amino acid coded by one codon, but it does not appear in superfolder GFP.

In another recent project, all the genes (coding DNA sequences) of E. coli are divided into five groups based on the naturally occurring TIR, from lowest to highest. Then, the codon usage profile of each group of genes is statistically analyzed to determine whether a codon is slow or fast. A fast codon is defined as one with high correlation between TIR and its frequency. Otherwise, it is a slow codon. It is hypothesized that the groups of CDS with high TIR will hold more “fast” codons, which will lead to higher translation elongation rate and thus higher protein expression, whereas the slow regions will hold more “slow” codons leading to lower expression. This data is summarized in the following figure.

Codon Frequency in Fast and Slow regions of the Genome

Ng, C. Y., Farasat, I., Zomorrodi, A. R., Maranas, C. D. & Salis, H. M. Model-guided construction and optimization of synthetic metabolism for chemical product synthesis. Synthetic Biology Engineering Research Center Spring Retreat (2013), Berkeley, CA.

Codons whose frequency increases with TIR are defined as fast codons. Those with declining frequency in relation to TIR are slow codons, and those with no correlation are defined as independent of TIR. This can be viewed in the figure above as the slope of the graphs for each codon showing ratio and TIR. If ratio increases with TIR, the codon is fast, and the graph displays positive slope. Slow codons slow negative slope, and TIR independent codons show essentially no slope.

In another related research project, researchers developed a program which models the process of translation elongation. This program takes into account the chemical binding of individual codons to ribosomes as well as numerous other relevant biological criteria and is able to predict the time taken by a ribosome to add an amino acid to a growing polypeptide chain. This is known as the “insertion time” for that codon. Using this software, a list of the insertion times for each codon was compiled. It is theorized that codons with longer insertion times will have lower translation elongation rates and thus lower the expression of the protein of the particular CDS that contains the slow codons. This data is summarized in the table below.

2) Apply These Criteria to a Reporter Gene (GFP)

It is important to understand the difference between the slow, fast, rare, common, and insertion times criteria. Common and rare codons are based on the frequency of particular degenerate codons in the entire E. coli genome. The hypothesis that common codons will lead to higher expression of proteins is based on the idea that cells have become optimized through evolution to efficiently translate proteins necessary to their survival. Based on this assumption, the most efficient codons will appear more frequently in the overall genome. The fast and slow codon differentiation relies on a very similar analysis. Fast codons are defined as those with a high correlation between frequency and the high TIR regions of the genome, whereas slow codons are those with a high correlation between frequency and the low TIR regions of the genome. This is an extension of the common/rare distinction, but is more specific, as certain parts of the genome with low TIR could possibly code for proteins where high expression (and thus fast translation elongation) is not necessary. By analyzing the codon usage profile of individual regions of the genome with different TIR, it is theorized that the fast and slow codons could be used to artificially control the expression of a particular gene through codon optimization. In some instances, the same codon is used for multiple optimization plans (fast, common, ect) to specify a certain amino acid. Because of this, some genes have instances of similarity, where the same codon is used in the same position.

The slow insertion time design was based solely on recently designed software which analyzes the biophysical phenomena that underlie translation elongation. Thus, the difference between the slow insertion time GFP and the others is that it is optimized based on the results of biophysical modeling instead of codon usage profile, and thus, on understanding of the physics of translation elongation verses the understanding that evolution optimizes organisms for high efficiency. The overall hypothesis of this research can now be fully understood.

This hypothesis is that the maximum translation rate capacity is due to translation elongation becoming the rate limiting step of protein synthesis, and that it could be controlled by increasing the translation elongation rate, through codon optimization of the CDS. Essentially, the presence of more common or fast codons is hypothesized to raise the maximum translation rate capacity, while the presence of more slow, rare, or slow insertion time codons will lower it.

To test this hypothesis, five variants of the Green Fluorescent Protein gene (GFP), one each of purely fast, slow, rare, and common codons, as well as one with codons having the slowest insertion time, were designed and constructed. The TIR of each variant GFP was then varied by attaching ribosome binding sites (RBS) of varying strength to the synthetic genes. The genes were then expressed in E. coli cells.

To design the genes a custom program was created which replaces all degenerate codons in a gene with the desired codons, for example, replacing all rare codons with common degenerate codons or all slow codons with fast degenerate codons. The variants were sent for construction at a commercial laboratory (Integrated DNA Technologies), and then inserted into viral DNA vectors which were incorporated into the cells’ plasmids through existing replication machinery. This was accomplished through basic cloning, and the expression of the fluorescent protein was then characterized using flow cytometry, a quantitative method of measuring fluorescence. Using this data, the maximum translation rate capacity of each variant GFP was determined and that data was used to distinguish between rare, slow, frequent, and fast codons.

3) Introduce the Synthetic Genes Optimized Using Our Criteria into E. coli

Our general plan for expressing our variant GFPs in living cells was to ligate the genes into a vector, transform the cells with the vector, then sequence to confirm the presence of our variant genes. After this we ligated in a dRBS, measured the florescence of the cells, and then sequenced again to determine which colonies were using which RBS.

The figures below show the construction process that we used to insert an RBS library and variant GFP into plasmid pFTV.

Inverse PCR

The figure below shows the products of inverse PCR.

Inverse PCR Products

The figures below shows the constructs used to insert each variant GFP.

Inserting GFP

A decision was reached to use a leader sequence to homogenize the first 60 base pairs of each GFP.

Why use a leader Sequence?

Our construct before insertion of the dRBS

The dRBS will be inserted between Sac1 and Pst1

Insertion of the dRBS

The spacer which had held a place for the dRBS is cut out by the enzymes Sac1 and Pst1

The Ribosome Binding Site

The degenerate ribosome binding site (dRBS) is a sequence that contains a ribosome binding site library. Using software developed by the Salis Lab, we calculated the range of translation initiation (TIR) that would be expected for this sequence, from 0.5-157,000 au.

Check out the cool software that allowed us to accomplish this.

The dRBS sequence is the location where the ribosome binds, and a higher TIR will allow more ribosomes to bind to the mRNA. It was essential to measure the performance of our synthetic GFPs over a wide range of TIR in order to see if expression plateaued at high TIR, indicating that translation elongation was the rate limiting step, or whether it climbed along with TIR, indicating that elongation had been made sufficiently more efficient as to avoid any plateau.

The sequence carries five degenerate letters. Four of these specify one of two possible bases, while the other specifies one of three. Because of this, there are 2*2*2*2*3 = 48 possible sequences in our dRBS.

Graph showing the sequences in our library and their calculated TIR

Each number on the x axis corresponds to one of the sequences out of the total 48. TIR is graphed on the y axis.

The final construct

4) Characterize the GFPs by Measuring Fluorescence of the Cells

5) Compare Protein Expression Levels from the Various Genes

@@ Line 365: / Line 365: @@
 </p>
+<p><strong>Graph showing the sequences in our library and their calculated TIR</strong></p>
+<p>
+<figure><center>
+  <p><image src="https://static.igem.org/mediawiki/2014/2/2c/PSU2014_RBS_calculator_output_graph.png" width="575px"></p>
+</figure></center>
+</p>
+<p>Each number on the x axis corresponds to one of the sequences out of the total 48. TIR is graphed on the y axis.</p>
@@ Line 378: / Line 387: @@
 </figure>
 </p>
-<p>Translation Initiation</p>
-<p><strong></strong></p>
-<p>
-<figure>
-  <p><image src="https://static.igem.org/mediawiki/2014/3/37/Slide11.PNG" width="575px"></p>
-  <p><fig caption>Caption</figcaption></p>
-</figure>
-</p>

Team:Penn State/CodonOptimization

From 2014.igem.org

Revision as of 22:00, 16 October 2014

WELCOME TO PENN STATE iGEM 2014!

CODON OPTIMIZATION PROJECT

Project Description

references