Team:UFMG Brazil/Project/Modelling


Revision as of 02:12, 18 October 2014 by Daniela fc (Talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Home UFMG Team

Beautiful models!

Protein models

To obtain the three-dimensional structure ...

... of our conditional sensor designed to bind repetitive DNA sequences, we employed comparative modelling. We began searching for appropriate templates for the selected biobrick sequences of TALE (Bba_K747027, Bba_K747043, Bba_K747059, Bba_K747075 obtained from the registry, plus Bba_K1514002 and Bba_K1514003 we synthesized) + linker + hemiCherry1 (BBa_K1514000) or 2 (BBa_K1514001). These were submitted to a PSI-BLAST similarity search against the Protein Data Bank (PDB). Templates for each domain were selected based on the percentages of residue identity, e-values, alignment scores and sequence coverage.

To start modeling, the program Chimera 1.9 (Pettersen et al., 2004) was used for sequence aligment. The aligned sequences were generated with default values ​​and manual curation. To build the three-dimensional models of our chimera proteins, different templates were chosen for each protein region. Manual curation of the alignments obtained was performed using DNATagger (Scherer and Basso, 2008). Then, a set of at least 100 models was generated using Modeller 9.10 (Eswar et al., 2006). Structural characteristics of each protein part was analysed for the best models generated. Manual adjustment of torsional angles in the linker region were performed afterwards, using Swiss-PDB viewer (Johansson et al., 2012), and the quality of the final models was validated using the QMEAN Z-score calculation (Benkert et al., 2008).

After obtaining our final models we performed a structural alignment of both mCherry parts against the active mcherry structure (PDB 2H5Q). This alignment enabled us to estimate the final structure of our models bound to DNA and the distance between both TALE domains in the DNA, to perform our mathematical modeling.


Two PDB proteins were selected as templates for model building. For the N-terminal part of our molecule, the crystal structure of TAL effector (3UGM) was selected and for the C-terminal part, mCherry (2H5Q) was used. Except for the linker region, the templates had 100% coverage and close to 100% identity against our sequences (99.4% to mCherry1, 100% to mCherry2 and 92.9% for both TALE parts). After modeling we selected the best Z-DOPE scores models for each protein (figure 1). Our model consists of six concatenated TALES self-associated shaped as a right-handed superhelix wrapped around the DNA major groove and connected by a linker to a hemicherry beta barrel structure.

Assessment of model quality for each protein through the QMEAN Server showed that our models have high quality, with |Z-scores| lesser than 1. QMEAN is a composite scoring function which is able to derive both global and local error estimates on the basis of one single model. The QMEAN Z-score indicates how many standard deviations the score differs from the expected values ​​of experimental structures. This is illustrated in the two graphs in figure 2, where being closer to black better reflects and low Z-score and low standard deviation.

To estimate if our models would be able to bind to DNA while maintaining the restituted mCherry conformation, we aligned both parts to the structure of the active form and kept their TALE domains spatially in a linear configuration. This showed that our models are compatible with DNA binding and mCherry restitution. We also calculated the inner distance between both linked TALES, which resulted in 35 Å, suggesting that there must be approximately 10 DNA base pairs between each (GT)6 binding region (Figure 3).

Probabilistic models

To estimate the chance of our protein binding to human DNA ...

... we built a repeat library using as template the human chromosome 1, using RepeatScout v1.0.5 (Price et al., 2005) and RepeatMasker 3.0 (Smit et al., 2010) programs. The repeat library was initially composed by all possible repetition patterns. Then, we selected just the (GT)n and (CA)n repetitions, which are recognized by our TALE protein domains (Figure1). In both cases, the most frequent tandem repeats sizes were between 15 and 24.

After filtering only repetitive sequences with at least 12 tandem repetitions, we calculated the chances to find these elements in different DNA fragments sizes in chromosome 1. Our results show an increased chance of finding these elements into fragments bigger than 2000 bp, when compared to smaller fragments (Figure 2). Considering the binding of a hemiCherry sensor to a DNA strand, the intensity of a 2000 bp fragment can have an increased intensity 10 times higher than a smaller fragment of only 200bp.


Scherer N.M. and Basso D.M. (2008) DNATagger, colors for codons. Genet. Mol. Res. 7 (3): 853-860

Eswar, N., Marti-Renom, M. A., Webb, B., Madhusudhan, M. S., Eramian, D., Shen, M., Pieper, U., Sali, A. (2006) Comparative Protein Structure Modeling With MODELLER. Current Protocols in Bioinformatics, John Wiley & Sons, Inc., Supplement 15, 5.6.1-5.6.30, 2006.

Johansson, M.U., Zoete V., Michielin O. & Guex N. (2012) Defining and searching for structural motifs using DeepView/Swiss-PdbViewer BMC Bioinformatics, 13:173.

Benkert, P., Tosatto, S.C.E. and Schomburg, D. (2008). "QMEAN: A comprehensive scoring function for model quality assessment." Proteins: Structure, Function, and Bioinformatics, 71(1):261-277

Price A.L., Jones N.C. and Pevzner P.A. 2005. De novo identification of repeat families in large genomes. To appear in Proceedings of the 13 Annual International conference on Intelligent Systems for Molecular Biology (ISMB-05). Detroit, Michigan.

Smit, AFA, Hubley, R & Green, P. RepeatMasker Open-3.0. 1996-2010 <>.