Team:Cambridge-JIC/Informatics

optimise codon usage in our registry parts and to facilitate future synthetic biology work on Marchantia;
identify and characterise promoters, in particular looking for strong, inducible, tissue-specific or early development stage promoters.

Codon usage optimisation

Our start point was the Marchantia genome and the mRNA transcriptome predicted with the Geneious software (http://www.geneious.com/). The data was given to us from Jim's lab by Bernardo. 99 000 ORFs were predicted, which seems too large to be realistic. Half of these were only 100 aa long. We set the threshold for candidate genes amongst these at 300 aa, obtaining the expected normal distribution of lengths.

Using the BLAST software (http://blast.ncbi.nlm.nih.gov/Blast.cgi), we compared the proteins coded by these candidate genes with the proteins present in Arabidopsis, given on Araport (www.araport.org). Some of the sequences showed incomplete matches, indicated that our predicted ORFs should be regarded with some vigilance. The longest sequence showed a 40% match, a small number as expected.

The relevant DNA sequences from the candidate mRNAs using the BLAST output. Then the different codons were counted in these genes, in order to obtain a codon table for Marchantia. While the table is not strikingly similar to Arabidopsis', we can note a similarity in the slight preference for C over other bases at the end of codons and that for G-p-C sites.

codon	amino acid	per thousand	frequency
aaa	K	15.5903	0.386139
aag	K	24.7846	0.613861
aac	N	12.1702	0.485816
aat	N	12.8809	0.514184
aga	R	21.4977	0.204651
agg	R	21.7198	0.206765
agc	S	22.2528	0.211839
agt	S	11.6816	0.111205
aca	T	16.2121	0.249317
acg	T	17.4114	0.26776
acc	T	12.4811	0.19194
act	T	10.5268	0.161885
ata	I	8.83894	0.253827
atg	M	20.7426	1
atc	I	13.858	0.304094
att	I	12.1258	0.284969
gaa	E	16.4342	0.4474
gag	E	20.2985	0.5526
gac	D	13.4139	0.521589
gat	D	12.3035	0.478411
gga	G	15.4126	0.282343
ggg	G	13.325	0.244101
ggc	G	15.768	0.288853
ggt	G	10.0826	0.184703
gca	A	18.5218	0.287388
gcg	A	17.8111	0.276361
gcc	A	13.4139	0.208132
gct	A	14.702	0.228119
gta	V	8.17269	0.173749
gtg	V	15.457	0.328612
gtc	V	13.3695	0.28423
gtt	V	10.0382	0.213409
caa	Q	16.4786	0.39135
cag	Q	25.6285	0.60865
cac	H	13.7248	0.463268
cat	H	15.9012	0.536732
cga	R	17.7223	0.16871
cgg	R	15.7235	0.149683
cgc	R	16.0789	0.153066
cgt	R	12.3035	0.117125
cca	P	21.5422	0.319921
ccg	P	18.9216	0.281003
ccc	P	10.5712	0.156992
cct	P	16.301	0.242084
cta	L	6.88461	0.0693202
ctg	L	24.4292	0.245975
ctc	L	20.565	0.207066
ctt	L	17.3226	0.174419
taa	S	17.3226	0.174419

@@ Line 33: / Line 33: @@
 <table class="reference" style="width:300px">
-<tbody>
+<tr>
 <th> codon </th>
 <th> amino acid </th>
 <th> per thousand </th>
 <th> frequency </th>
+</tr>
 <tr>
 	<td>aaa</td>