Team:TU Darmstadt/Results/Modeling/Open Software

From 2014.igem.org

(Difference between revisions)
Line 14: Line 14:
<div id="wikicontent" class="grid_19">
<div id="wikicontent" class="grid_19">
-
<!--TYPO3SEARCH_begin--><div id="c381" class="csc-default"><div class="csc-header csc-header-n1"><h1 class="csc-firstHeader">Open Software</h1></div><div class="csc-textpic-text"></div></div><div id="c382" class="csc-default"><article><header><header><article><p>According to our Open Hardware approach, we would like to contribute an automated version of a general sequence and structure file analysis. Furthermore it could be used as a corporate design data visualization tool during iGEM Projects.</p></article></header></header><p>Our preferred programming language is&nbsp;<i>R</i>, due to its user friendly interface. Changing code is easy and intuitive even for beginners. We implemented the following automated functions, which are free to use or to modify.</p></article></div><div id="c383" class="csc-default"><div class="csc-textpic-text"><table summary="" style="margin: 0px auto;" class="contenttable"><thead><tr><th scope="col"><header><h2>Sequence Analysis</h2></header></th><th><h2>&nbsp; &nbsp; &nbsp;&nbsp;</h2></th><th><h2><span style="text-align: start;">Structure Analysis</span></h2></th><th><h2>&nbsp; &nbsp; &nbsp;&nbsp;</h2></th><th scope="col"><h2>Plotting</h2></th></tr></thead><tbody><tr><td><p>&nbsp;</p><ul><li>Clustal Omega</li></ul><ul><li>Consensus&nbsp; Seqeuence</li></ul><ul><li>Conservative Side Detection</li></ul><ul><li>Shannon Entropy</li></ul><ul><li>Mutual Information</li></ul><p>&nbsp;</p></td><td></td><td><p>&nbsp;</p><ul><li>&nbsp;Normal mode Analysis</li></ul><ul><li>Comparison of models (NMA)</li></ul><ul><li>Trajectory Analysis</li></ul><ul><li>RMSD</li></ul><ul><li>RMSF</li></ul><ul><li>Binding estimation</li></ul><ul><li>Torison/Dihedral analysis</li></ul><ul><li>Distance matrix calculation</li></ul><p>&nbsp;</p></td><td></td><td><p>&nbsp;</p><ul><li>ggplot</li></ul><ul><li>HeatMap</li></ul><ul><li>Volcano Plot</li></ul><ul><li>Wireframe</li></ul><ul><li>'Fancy' 3D Scatter Plot</li></ul><p>&nbsp;</p><p>&nbsp;</p></td></tr></tbody></table></div></div><div id="c384" class="csc-default"><div class="csc-header csc-header-n4"><h1>Sequence Analysis</h1></div><div class="csc-textpic-text"><p>Bioinformatics relies essentially on sequences and their corresponding alignment. Bad sequence alignment will worse results received from calculation of Shannon Entropy and Mutual Information.&nbsp;<br />After using the Basic Local Alignment Search Tool (<i>BLAST</i>) you will have to align all your sequences by using a distribution of Clustal Omega for instance. When using any Linux system you can use this function after installing the needed software package.<sup>1</sup>&nbsp;<br />'<i>MSA_File</i>' will enable pre-aligning of your sequences by using the <i>tcltk</i> interface. After finished calculation you should rework output for an optimal solution. &nbsp;<br />The next function '<i>Analyse_Start</i>' is an automated version of sequence analysis. Per default it will calculate Shannon Entropy, two sets of mutual information (<i>SUMI</i>&nbsp;&amp; <i>ORMI</i> of the <i>BioPhysConnectoR</i> package) and a mutual information based contact map. General information like consensus sequence and potential conservative sites will also be computed and plotted automatically. Modifying your scope by using&nbsp; other default options like MI-Treshold for contact map objects and choosing nullmod counter for different calculation of mutual information can be chosen at start. Another option of your choice would be the change of used amino acid alphabet instead of the common set. Therefore you could gain knowledge about the distributed amino acids on a specific position and relevant chemical properties.</p></div></div><div id="c385" class="csc-default"><div class="csc-header csc-header-n5"><h1>Structure Analysis</h1></div><div class="csc-textpic-text"><p>Not only distribution of amino acid at a specific position inside an alignment is important but also knowledge about the three-dimensional structure and their implication on the function is crucial.&nbsp;<br />As written in our theory section, we used a normal mode analysis based on the <i>bio3d </i>package developed by <i>The Grant Lab</i>. Using<i>&nbsp;'igem_NMA'&nbsp;</i>we can validate motion of protein by using different force fields described in the corresponding R documentation. These will be automatically compared and relative residual cross correlation matrix will be plotted indicating a positive or negative correlation. Atomic fluctuations and deformation energy will also be quantified and saved as a pdb-file. Using the provided trajectory analysis calculation will enable calculation of RMSD and RMSF. Another interesting option would be computation of distance calculation between two different chains, ligand or chain and the absolute distance between all atoms inside a pdb as a distance matrix. General structural information like Torsion/Dihedral analysis can also be plotted easily.</p></div></div><div id="c386" class="csc-default"><div class="csc-header csc-header-n6"><h1>Plotting</h1></div><div class="csc-textpic-text"><footer><aside><p class="align-center"><i>Different data need different plots.</i></p>
+
<!--TYPO3SEARCH_begin--><div id="c381" class="csc-default"><div class="csc-header csc-header-n1"><h1 class="csc-firstHeader">Open Software</h1></div><div class="csc-textpic-text"></div></div><div id="c382" class="csc-default"><article><header><header><article><p>According to our Open Hardware approach, we would like to contribute an automated version of a general sequence and structure file analysis. Furthermore it could be used as a corporate design data visualization tool during iGEM Projects.</p></article></header></header><p>Our preferred programming language is&nbsp;<i>R</i>, due to its user friendly interface. Changing code is easy and intuitive even for beginners. We implemented the following automated functions, which are free to use or to modify.</p></article></div><div id="c383" class="csc-default"><div class="csc-textpic-text"><br><table summary="" style="margin: 0px auto;" class="contenttable"><thead><tr><th scope="col"><header><h2>Sequence Analysis</h2></header></th><th><h2>&nbsp; &nbsp; &nbsp;&nbsp;</h2></th><th><h2><span style="text-align: start;">Structure Analysis</span></h2></th><th><h2>&nbsp; &nbsp; &nbsp;&nbsp;</h2></th><th scope="col"><h2>Plotting</h2></th></tr></thead><tbody><tr><td><p>&nbsp;</p><ul><li>Clustal Omega</li></ul><ul><li>Consensus&nbsp; Seqeuence</li></ul><ul><li>Conservative Side Detection</li></ul><ul><li>Shannon Entropy</li></ul><ul><li>Mutual Information</li></ul><p>&nbsp;</p></td><td></td><td><p>&nbsp;</p><ul><li>&nbsp;Normal mode Analysis</li></ul><ul><li>Comparison of models (NMA)</li></ul><ul><li>Trajectory Analysis</li></ul><ul><li>RMSD</li></ul><ul><li>RMSF</li></ul><ul><li>Binding estimation</li></ul><ul><li>Torison/Dihedral analysis</li></ul><ul><li>Distance matrix calculation</li></ul><p>&nbsp;</p></td><td></td><td><p>&nbsp;</p><ul><li>ggplot</li></ul><ul><li>HeatMap</li></ul><ul><li>Volcano Plot</li></ul><ul><li>Wireframe</li></ul><ul><li>'Fancy' 3D Scatter Plot</li></ul><p>&nbsp;</p><p>&nbsp;</p></td></tr></tbody></table></div></div><div id="c384" class="csc-default"><div class="csc-header csc-header-n4"><h1>Sequence Analysis</h1></div><div class="csc-textpic-text"><p>Bioinformatics relies essentially on sequences and their corresponding alignment. Bad sequence alignment will worse results received from calculation of Shannon Entropy and Mutual Information.&nbsp;<br />After using the Basic Local Alignment Search Tool (<i>BLAST</i>) you will have to align all your sequences by using a distribution of Clustal Omega for instance. When using any Linux system you can use this function after installing the needed software package.<sup>1</sup>&nbsp;<br />'<i>MSA_File</i>' will enable pre-aligning of your sequences by using the <i>tcltk</i> interface. After finished calculation you should rework output for an optimal solution. &nbsp;<br />The next function '<i>Analyse_Start</i>' is an automated version of sequence analysis. Per default it will calculate Shannon Entropy, two sets of mutual information (<i>SUMI</i>&nbsp;&amp; <i>ORMI</i> of the <i>BioPhysConnectoR</i> package) and a mutual information based contact map. General information like consensus sequence and potential conservative sites will also be computed and plotted automatically. Modifying your scope by using&nbsp; other default options like MI-Treshold for contact map objects and choosing nullmod counter for different calculation of mutual information can be chosen at start. Another option of your choice would be the change of used amino acid alphabet instead of the common set. Therefore you could gain knowledge about the distributed amino acids on a specific position and relevant chemical properties.</p></div></div><div id="c385" class="csc-default"><div class="csc-header csc-header-n5"><h1>Structure Analysis</h1></div><div class="csc-textpic-text"><p>Not only distribution of amino acid at a specific position inside an alignment is important but also knowledge about the three-dimensional structure and their implication on the function is crucial.&nbsp;<br />As written in our theory section, we used a normal mode analysis based on the <i>bio3d </i>package developed by <i>The Grant Lab</i>. Using<i>&nbsp;'igem_NMA'&nbsp;</i>we can validate motion of protein by using different force fields described in the corresponding R documentation. These will be automatically compared and relative residual cross correlation matrix will be plotted indicating a positive or negative correlation. Atomic fluctuations and deformation energy will also be quantified and saved as a pdb-file. Using the provided trajectory analysis calculation will enable calculation of RMSD and RMSF. Another interesting option would be computation of distance calculation between two different chains, ligand or chain and the absolute distance between all atoms inside a pdb as a distance matrix. General structural information like Torsion/Dihedral analysis can also be plotted easily.</p></div></div><div id="c386" class="csc-default"><div class="csc-header csc-header-n6"><h1>Plotting</h1></div><div class="csc-textpic-text"><footer><aside><p class="align-center"><i>Different data need different plots.</i></p>
<p>Therefore we are willing to provide standardized plotting functions in R.<sup>2</sup>&nbsp; Although though of an corporate design, these can be modified by the user easily by adding new layers onto an existing graph. Although you do minor changes like another main title or using other fonts the output will be the same.<br />Most data visualization will be in a two dimensional space but can be achieved - in R - with different input classes like 'data.frame','vector' and 'matrix', although latter must be converted into class 'data.frame' before plot initialization.<sup>3 '</sup><i>Save2D_Vec</i>' and '<i>Save2D_DF'</i>&nbsp;need different input information as written in each function name. Both displayed dot plots connected via respective coloured line. Auxilliary it will create a corresponding bar or density plot, due to its input information. Width of bar plot can be calculated during runtime or taken from command line. Other plots describe a three dimensional space as shown in HeatMap and Volcano Plot. The last two function create plots best used for short fragments, due to automatic highlightning of data as text inside plot. Setting ticks manually inside all plots is preferred, while using short sequences because of possible overplotting.&nbsp;</p></aside></footer></div></div><div id="c389" class="csc-default"><div></div><div><ol><li><a href="http://www.clustal.org/omega/" target="_blank">http://www.clustal.org/omega/</a></li><li><a href="http://www.r-project.org/" target="_blank">http://www.r-project.org/</a></li><li><a href="http://docs.ggplot2.org/current/" target="_blank">http://docs.ggplot2.org/current/</a></li></ol></div><div></div></div><div id="c387" class="csc-default csc-space-before-10"><div class="csc-header csc-header-n8"><h1>Downloadable Content</h1></div><ul class="csc-uploads csc-uploads-0"><li class="li-odd li-first csc-uploads-element csc-uploads-element-r"><span class="csc-uploads-fileName"><a href="fileadmin/files/iGEM_Analysis.R" target="_blank">iGEM Analysis.R</a></span></li><li class="li-even csc-uploads-element csc-uploads-element-r"><span class="csc-uploads-fileName"><a href="fileadmin/files/iGEM_Calc.R" target="_blank">iGEM Calc.R</a></span></li><li class="li-odd csc-uploads-element csc-uploads-element-r"><span class="csc-uploads-fileName"><a href="fileadmin/files/iGEM_Clustalo.R" target="_blank">iGEM Clustalo.R</a></span></li><li class="li-even csc-uploads-element csc-uploads-element-r"><span class="csc-uploads-fileName"><a href="fileadmin/files/iGEM_Plot.R" target="_blank">iGEM Plot.R</a></span></li><li class="li-odd csc-uploads-element csc-uploads-element-r"><span class="csc-uploads-fileName"><a href="fileadmin/files/iGEM_ReducingAlphabet.R" target="_blank">iGEM ReducingAlphabet.R</a></span></li><li class="li-even csc-uploads-element csc-uploads-element-r"><span class="csc-uploads-fileName"><a href="fileadmin/files/iGEM_StatFunc.R" target="_blank">iGEM StatFunc.R</a></span></li></ul></div><div id="c388" class="csc-default"><div class="csc-header csc-header-n9"><h1>Reference</h1></div><footer><footer><footer><footer><ol><li>Hoffgaard, F., Weil, P., &amp; Hamacher, K. (2010). BioPhysConnectoR: Connecting sequence information and biophysical models. BMC Bioinformatics, 11, 199. doi:10.1186/1471-2105-11-199.&nbsp;</li><li>Sievers, F., Wilm, A., Dineen, D., Gibson, T. J., Karplus, K., Li, W., Lopez, R., et al. (2011). Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Molecular Systems Biology, 7(539), 539. Nature Publishing Group. doi:10.1038/msb.2011.75&nbsp;</li><li>Shanon, C. E. (1948). A Mathematical Theory of Communication. The Bell System Technical Journal, 27, 379-423.&nbsp;</li><li>Bio3D: An R package for the comparative analysis of protein structures. Grant, Rodrigues, ElSawy, McCammon, Caves, (2006) Bioinformatics 22, 2695-2696.org/bio3d/index.php</li><li>Timischl, Werner, Biostatistik, Eine Einführung für Biologen und Mediziner, Springer, 3. Auflage 2012</li><li><a href="http://bio.math-inf.uni-greifswald.de/viscose/html/alphabets.html" target="_blank">bio.math-inf.uni-greifswald.de/viscose/html/alphabets.html</a></li></ol></footer></footer></footer></footer></div><!--TYPO3SEARCH_end-->
<p>Therefore we are willing to provide standardized plotting functions in R.<sup>2</sup>&nbsp; Although though of an corporate design, these can be modified by the user easily by adding new layers onto an existing graph. Although you do minor changes like another main title or using other fonts the output will be the same.<br />Most data visualization will be in a two dimensional space but can be achieved - in R - with different input classes like 'data.frame','vector' and 'matrix', although latter must be converted into class 'data.frame' before plot initialization.<sup>3 '</sup><i>Save2D_Vec</i>' and '<i>Save2D_DF'</i>&nbsp;need different input information as written in each function name. Both displayed dot plots connected via respective coloured line. Auxilliary it will create a corresponding bar or density plot, due to its input information. Width of bar plot can be calculated during runtime or taken from command line. Other plots describe a three dimensional space as shown in HeatMap and Volcano Plot. The last two function create plots best used for short fragments, due to automatic highlightning of data as text inside plot. Setting ticks manually inside all plots is preferred, while using short sequences because of possible overplotting.&nbsp;</p></aside></footer></div></div><div id="c389" class="csc-default"><div></div><div><ol><li><a href="http://www.clustal.org/omega/" target="_blank">http://www.clustal.org/omega/</a></li><li><a href="http://www.r-project.org/" target="_blank">http://www.r-project.org/</a></li><li><a href="http://docs.ggplot2.org/current/" target="_blank">http://docs.ggplot2.org/current/</a></li></ol></div><div></div></div><div id="c387" class="csc-default csc-space-before-10"><div class="csc-header csc-header-n8"><h1>Downloadable Content</h1></div><ul class="csc-uploads csc-uploads-0"><li class="li-odd li-first csc-uploads-element csc-uploads-element-r"><span class="csc-uploads-fileName"><a href="fileadmin/files/iGEM_Analysis.R" target="_blank">iGEM Analysis.R</a></span></li><li class="li-even csc-uploads-element csc-uploads-element-r"><span class="csc-uploads-fileName"><a href="fileadmin/files/iGEM_Calc.R" target="_blank">iGEM Calc.R</a></span></li><li class="li-odd csc-uploads-element csc-uploads-element-r"><span class="csc-uploads-fileName"><a href="fileadmin/files/iGEM_Clustalo.R" target="_blank">iGEM Clustalo.R</a></span></li><li class="li-even csc-uploads-element csc-uploads-element-r"><span class="csc-uploads-fileName"><a href="fileadmin/files/iGEM_Plot.R" target="_blank">iGEM Plot.R</a></span></li><li class="li-odd csc-uploads-element csc-uploads-element-r"><span class="csc-uploads-fileName"><a href="fileadmin/files/iGEM_ReducingAlphabet.R" target="_blank">iGEM ReducingAlphabet.R</a></span></li><li class="li-even csc-uploads-element csc-uploads-element-r"><span class="csc-uploads-fileName"><a href="fileadmin/files/iGEM_StatFunc.R" target="_blank">iGEM StatFunc.R</a></span></li></ul></div><div id="c388" class="csc-default"><div class="csc-header csc-header-n9"><h1>Reference</h1></div><footer><footer><footer><footer><ol><li>Hoffgaard, F., Weil, P., &amp; Hamacher, K. (2010). BioPhysConnectoR: Connecting sequence information and biophysical models. BMC Bioinformatics, 11, 199. doi:10.1186/1471-2105-11-199.&nbsp;</li><li>Sievers, F., Wilm, A., Dineen, D., Gibson, T. J., Karplus, K., Li, W., Lopez, R., et al. (2011). Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Molecular Systems Biology, 7(539), 539. Nature Publishing Group. doi:10.1038/msb.2011.75&nbsp;</li><li>Shanon, C. E. (1948). A Mathematical Theory of Communication. The Bell System Technical Journal, 27, 379-423.&nbsp;</li><li>Bio3D: An R package for the comparative analysis of protein structures. Grant, Rodrigues, ElSawy, McCammon, Caves, (2006) Bioinformatics 22, 2695-2696.org/bio3d/index.php</li><li>Timischl, Werner, Biostatistik, Eine Einführung für Biologen und Mediziner, Springer, 3. Auflage 2012</li><li><a href="http://bio.math-inf.uni-greifswald.de/viscose/html/alphabets.html" target="_blank">bio.math-inf.uni-greifswald.de/viscose/html/alphabets.html</a></li></ol></footer></footer></footer></footer></div><!--TYPO3SEARCH_end-->
</div>
</div>

Revision as of 19:00, 17 October 2014

Home

Open Software

According to our Open Hardware approach, we would like to contribute an automated version of a general sequence and structure file analysis. Furthermore it could be used as a corporate design data visualization tool during iGEM Projects.

Our preferred programming language is R, due to its user friendly interface. Changing code is easy and intuitive even for beginners. We implemented the following automated functions, which are free to use or to modify.


Sequence Analysis

      

Structure Analysis

      

Plotting

 

  • Clustal Omega
  • Consensus  Seqeuence
  • Conservative Side Detection
  • Shannon Entropy
  • Mutual Information

 

 

  •  Normal mode Analysis
  • Comparison of models (NMA)
  • Trajectory Analysis
  • RMSD
  • RMSF
  • Binding estimation
  • Torison/Dihedral analysis
  • Distance matrix calculation

 

 

  • ggplot
  • HeatMap
  • Volcano Plot
  • Wireframe
  • 'Fancy' 3D Scatter Plot

 

 

Sequence Analysis

Bioinformatics relies essentially on sequences and their corresponding alignment. Bad sequence alignment will worse results received from calculation of Shannon Entropy and Mutual Information. 
After using the Basic Local Alignment Search Tool (BLAST) you will have to align all your sequences by using a distribution of Clustal Omega for instance. When using any Linux system you can use this function after installing the needed software package.1 
'MSA_File' will enable pre-aligning of your sequences by using the tcltk interface. After finished calculation you should rework output for an optimal solution.  
The next function 'Analyse_Start' is an automated version of sequence analysis. Per default it will calculate Shannon Entropy, two sets of mutual information (SUMI & ORMI of the BioPhysConnectoR package) and a mutual information based contact map. General information like consensus sequence and potential conservative sites will also be computed and plotted automatically. Modifying your scope by using  other default options like MI-Treshold for contact map objects and choosing nullmod counter for different calculation of mutual information can be chosen at start. Another option of your choice would be the change of used amino acid alphabet instead of the common set. Therefore you could gain knowledge about the distributed amino acids on a specific position and relevant chemical properties.

Structure Analysis

Not only distribution of amino acid at a specific position inside an alignment is important but also knowledge about the three-dimensional structure and their implication on the function is crucial. 
As written in our theory section, we used a normal mode analysis based on the bio3d package developed by The Grant Lab. Using 'igem_NMA' we can validate motion of protein by using different force fields described in the corresponding R documentation. These will be automatically compared and relative residual cross correlation matrix will be plotted indicating a positive or negative correlation. Atomic fluctuations and deformation energy will also be quantified and saved as a pdb-file. Using the provided trajectory analysis calculation will enable calculation of RMSD and RMSF. Another interesting option would be computation of distance calculation between two different chains, ligand or chain and the absolute distance between all atoms inside a pdb as a distance matrix. General structural information like Torsion/Dihedral analysis can also be plotted easily.

Plotting

Reference

  1. Hoffgaard, F., Weil, P., & Hamacher, K. (2010). BioPhysConnectoR: Connecting sequence information and biophysical models. BMC Bioinformatics, 11, 199. doi:10.1186/1471-2105-11-199. 
  2. Sievers, F., Wilm, A., Dineen, D., Gibson, T. J., Karplus, K., Li, W., Lopez, R., et al. (2011). Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Molecular Systems Biology, 7(539), 539. Nature Publishing Group. doi:10.1038/msb.2011.75 
  3. Shanon, C. E. (1948). A Mathematical Theory of Communication. The Bell System Technical Journal, 27, 379-423. 
  4. Bio3D: An R package for the comparative analysis of protein structures. Grant, Rodrigues, ElSawy, McCammon, Caves, (2006) Bioinformatics 22, 2695-2696.org/bio3d/index.php
  5. Timischl, Werner, Biostatistik, Eine Einführung für Biologen und Mediziner, Springer, 3. Auflage 2012
  6. bio.math-inf.uni-greifswald.de/viscose/html/alphabets.html