Team:Heidelberg/pages/midnightdoc

From 2014.igem.org

(Difference between revisions)
 
(13 intermediate revisions not shown)
Line 1: Line 1:
=Introduction=
=Introduction=
-
{{:Team:Heidelberg/templates/image-quarter| align=right| caption=BOINC logo| descr=| file=BOINC_logo.png}}
 
-
Back in 1999 the Search for Extra-Terrestrial Intelligence (SETI) project suffered from insufficient computing resources, while trying to analyze extraterrestrial data from radio telescopes for the purpose of registering potential alien activity. To circumvent the problem of building a server cluster to analyze all the collected data, they developed a software called SETI@home. It is based on the idea to exploit the unused computing power of thousands of home PCs for scientific research. The SETI team developed a client that enlists the computer of volunteers on a central server which then distributes computation jobs to them. After the successful completion of the calculation the result is send back to the scientists.
 
-
===Overview===
+
Documentation is one of the most important aspects of any scientific project. Nevertheless, maintaining a detailed log of all experiments performed can often be a daunting task, especially when consistency and standardization across multiple persons is required. This consistency is particularly important in the context of iGEM, where group work is of paramount importance. Also, as many iGEM projects build upon results of preceding teams, a standardized way of documenting would further enhance this interaction. Reproducibility of projects would also be increased. Thus, a consistent, semi-automated, documentation method could help solve many of the problems faced by iGEM teams, but also researchers in general.
-
We, the iGEM Team Heidelberg, were confronted with a similar problem. Working on protein modifications, we wanted to know what effect our modifications could have on the protein by simulating them, before stepping to expensive and time intensive wet lab screenings. Yet iGEM Teams often lack the ability to do resource intensive computations, because they have no access to server clusters. We therefore decided to start iGEM@home, as a universal platform for iGEM-Teams to distribute computing jobs to users all over the world.
+
For these reasons, we created a software which does just that. Our MidnightDoc enables reproducibility by making sure that documenting experimental methods becomes a fun activity; a natural extension of normal lab routine. In the following, we will describe the main design principles behind this fine piece of software.
-
{{:Team:Heidelberg/templates/image-half| align=right| caption=Overview about iGEM@home| descr=| file=igemathome.png}}
+
-
Right now we are using the platform to calculate linkers to enable the simple and efficient circularization of proteins. Further information on the type of computing tasks we are distributing may be found on the Modelling page. In future we hope to see the project grow and prosper when new iGEM teams take over and run their own computations on iGEM@home.
+
=Lay out your plan, then record what you did=
 +
The most important design principle behind MidnightDoc is the fact that it accompanies lab work and in fact makes it easier. In order to illustrate this, consider the simple example of a restriction or ligation experiment: The user will first look up for the concentrations of his fragments, as measured for example by a spectrophotometer. Then, based on these concentration measurements and the manufacturer guidelines, which might specify the required amount of each fragment (e.g. in micromol or nanograms) or the ratio between them, the user will calculate the necessary volumina. Only after these calculation have been done, will the user be able to mix these components and then start the reactions. Once this has been done, he will document some of the aforementioned numbers (e.g. the volumina used) in his labbook for future reference.  
-
=Project Components=
+
Now, notice that there are several ways in which a software tool can assist in the previous process. First of all, an integrated database with information on all available plasmid, PCR amplified DNA fragments, etc. would make the retrieval of the concentration information a lot easier. Then, this can be immediately used for subsequent calculations, which also should be done  by the software.  
-
The project is separated into two main components:
+
-
#'''Communication related side''': The software for the user management, communication between server and client and distribution of jobs
+
-
#'''Scientific calculations''': The computing intensive software that is distributed to the volunteers, which run the software and send back the results after completion
+
-
Each of these components again is separated into different subcategories.
+
-
===Communication related side:===
+
To achieve this, you have to pre-specify your plan: the experimental guidelines and procedures which you want to follow. The rules, such as ratios of fragments, will then be integrated in order to automate calculations. Thus, MidnightDoc, will help with your calculations, but at the same time document what you did. Also, once you have created the protocol for an experimental procedure, such as a ligation with different fragments, you will be able to reuse it with new details, such as the new fragments or the different concentrations.
-
For the server architecture and the communication between server and client software, we relied on the Berkley Open Infrastructure for Network Computing (BOINC). It was created during the development of SETI@home, and has since then been developed further. It includes:
+
-
#User management system
+
-
#Job management system
+
-
#Database
+
-
#Backend Interface and forum software
+
-
These components were adapted to our needs, for example we added scripts for the easy deployment of our jobs. In addition we implemented a workflow for distributing jobs at multiple levels of the linker generation, which includes waiting for successful results and joining them for the next step.
+
=Propagation and Backtracking=
 +
Of course, storing all details in the software also has many other advantages, such as "Propagation" and "Backtracking". Already in a previous design principle, we touched on the idea of propagation: Details about biological samples, which have already been entered into the database, can then be used for calculations in subsequent experiments. But it also helps with error analysis: Assume you repeat a concentration measurement of a DNA sample and determine that the originally measured value was incorrect. Now also assume that you have already used this DNA for downstream work. If you were to update this value now, all values depending on it will be updated. For example, the MidnightDoc will automatically reflect the change by showing the true amount (e.g. in nanomols), which was used downstream. Thus, the user will be able to determine, if the wrong measurement of concentration has had implications for the other experiments.
-
===Scientific calculations:===
+
By "backtracking", we describe the other side of the same coin: You should be able to follow the complete history of any sample in your lab. For example, for a plasmid, you should be able to trace back the components out of which it was assembled. In turn, you should be able to trace the history of these components. In turn, if these were amplified by PCR, you should also be able to access the primers that were used, the template DNA (or host species), as well as the reaction conditions. In this way, "backtracking" also assists with possible error analysis and makes a task, which is very hard and time-consuming in the classical labbook setting, very easy.
-
<html>
+
-
<div class="panel panel-default">
+
-
<div class=panel-body" style="background-color:lightgray; padding:15px">This paragraph focuses on the implementation details of the distributed software. If you want to know more about the theory behind the modelling please visit our <a href="/Team:Heidelberg/Modeling/Linker_Modeling">Modelling page</a>.
+
-
</div>
+
-
</div>
+
-
</html>
+
-
We are distributing three types of software to our user base, which reflect the different steps of our modelling. Our software starts with a protein (a protein data bank file (pdb)) that one wants to circularize. The pdb is fed into the first and most important part of our software the Linker Generator.  
+
-
#The Linker Generator generates a list of all possible linkers that are able to connect the N- and C-terminus of the protein. It sorts out Linkers that cross the protein and then ranks them based on some features of the Linker. In the end it returns a list of the 300 best Linkers for further analysis.
+
=You decide the level of detail=
-
#The server receives the list of possible Linkers and redistributes their pdb’s to the client. Now the second part of the software (Modeller) analyzes the super-secondary structure of the protein and updates the location of the amino acids of the Linker to the place that seems most natural. The pdb with the updated structure is returned to the server.
+
When starting to use MidnightDoc, the user will first need to describe the protocols they wish to perform. These protocols can be entered in a very flexible way and allow to specify whatever parameters the user may want to record for this type of experiment. While it is certainly possible to add a multitude of possible value inputs to each and every method, filling in all the information might become too much of a burden for day-to-day lab work. The user may therefore choose to record only few values for standard (e. g. cloning) protocols, and more for complex assays for meaningful statistical analysis.
-
#The server collects all results and redistributes them to the Linker Evaluator. It analyzes the list of all proteins, calculates an alignment score based on the FatCat Algorithm with the natural protein structure. Furthermore to evaluate the assumptions of the Linker Generator it calculates the length of the linker parts and the angles in between and writes them to a text file that allows the dynamic enhancement of the linker prediction.
+
-
The result is a simple file that contains all important information on the process of the Linker Generation and can easily be evaluated on a single laptop.
+
This applies likewise to the source materials of experiments which can be backtracked: it would certainly be possible to document every bottle of lysis buffer or even every flask of growth medium ever produced. Generally, however, the user might not want to have to check which database entry corresponds to the flask of medium used for their miniprep cultures; it may therefore not be advisable to include this as a traceable source material, whereas it would be of greater importance to be able to relate the lot of a manually purified protein used in an assay.
-
===Implementation Details of the scientific software component:===
+
This abstraction principle also makes MidnightDoc conform to the different requirements of different labs and even the different sciences.
-
====Python Applications:====
+
=Version control=
-
BILD[ Loader(written from scratch in C, verifies input files via signatures, manages versioning of applications, extracts required files for python runtime, loads up embedded python environment , calls python module/script)  
+
The biostatistics community has recently embraced the idea of [http://cran.r-project.org/web/views/ReproducibleResearch.html Reproducible research], by introducing standards which should be followed when reporting any computational task. Thus, other researchers will be able to verify methods used in diverse publications, but also modify and improve upon them. We believe, that documentation of wet lab work, should also try to make use some of these concepts (and vice versa!).
-
→ Python script doing computations and accessing functions of the loader via C-written python module “loader”]
+
-
As multiple components of our workflow are programmed in python, we needed to find a way to make these python applications portable and executable on computers without python installed, so they can be distributed via the BOINC system. Although python is widely installed on Linux distributions, it is underrepresented on Microsoft Windows operating systems. For the propose of packaging the python script with the required components, we implemented an extraction and version management system in the loader application, which extracts a delivered zip file into a separate folder for each application version and removes old versions if they are no longer needed. Additionally it implements methods for accessing the BOINC-API via a python accessible “loader” module and file signing checking for input files to avoid licensing issues when using software for academic use only.
+
The main concept of the reproducible research community which inspired us for MidnightDoc are the version control systems, such as [http://git-scm.com/ git]. In computational tasks and programming in general, the basic idea is that you should be able to look at the code you had previously written, which over many iterations lead to the final production code. Similarly, in the context of wet lab documentation, you should be able to easily revert to the state of the documentation at a particular date. This would allow one to explore protocols that were used by the lab previously and got replaced or modified over time. In addition, it would allow to explore PCR fragments, plasmids, etc. which were available at a particular point in time.
-
====Linker Generator:====
+
In summary, version control, is very exciting indeed, even if it is not distributed, as MidnightDoc will take care of it!
-
The linker generator is written in Python. Being one of the most spread programming languages in the scientific community, Python allows to access many modules for the implementation of complex data analysis required for scientific applications. The linker generator relies heavily on the NumPy module of the Python SciPy package, which extends the functionality of python to cope with large multi-dimensional arrays and matrices at high computational speed. This is implemented by using compiled C-code and processor specific optimizations for time critical operations.  For distribution of the linker generator, we needed to use an alternative approach compared to the distribution of the Modeller software, as there seemed to be incompatibilities between NumPy and the software used for bundling, Nuitka . For further details on the implementation of the Linker Generator or on the principles used for bundling python applications please visit the [[/Team:Heidelberg/Software/Linker-Generator|Linker-Generator]] or [[/Team:Heidelberg/Software/igemathome/Implementation|Bundeling Implementation]] pages.
+
-
====Modeller:====
+
=It's on the web=
-
After talking to multiples experts in the field protein folding analysis, we decided to rely on the state of the art software widely used in this field of research: [[https://salilab.org/modeller/|Sali Labs Modeller]] allows modeling of super secondary structures via homology analysis and discrete energy minimization. Further information on Modeller itself may be found on the [[/Team:Heidelberg/Modeling/Linker#In_silico_refinement| Linker Modeling page]]. To enable the distribution of Modeller we used a software called [[http://nuitka.net/|Nuitka]], which defines itself as a python compiler. This means, that the python code is translated into C++-Code which results in a 2-3 fold acceleration of execution speed. Nuitka also allows embedding the complete python runtime into a single executable, thus making the application executable without a python runtime installed. Yet it was required to rewrite parts of the Nuitka software, so the process works flawlessly with the python loader. For further details please visit the [[/Team:Heidelberg/Solftware/igemathome/implementation| implementation page]].
+
MidnightDoc is implemented as a standard web application. Therefore, there is no need to install any software on many client computers -- just open up your browser and point it to the copy deployed on your lab's IT infrastructure. Of course, MidnightDoc is designed as a multi-user system with a shared database, so the propagation and backtracking features also are continuous when
 +
basing your work upon previous experiments of others -- and all of the aforementioned features make it astonishingly easy to understand every aspect of how they were performed. This makes it ideal for team work like in an iGEM setting.
-
====Linker Evaluator:====
+
=Outlook=
-
The Linker Evaluator is written in pure Java and usesthe bioinformatics framework BioJava. This is in a bold contrast to the support of the BOINC framework for only the languages C and C++. One of the biggest advantages of Java is that a program can be written and debugged on a single platform and then be run on all the others, which is especially important if one targets a versatile community as with BOINC. To be able to use the Java with BOINC we decided to port the JRE to BOINC. Another approach that was previously discussed in the BOINC community  favored the use of Java that is already installed on the computer of the user, but as the number preinstalled JRE’s is declining in the last year we decide to make our program independent of this trend and deploy it as  a standalone program.  
+
Conceptually, MidnightDoc could also harness the plethora of available web resources. It goes without saying that accessing all documentation details from everywhere is a big advantage already. But we also envisage other applications by this web-integration, such as the following: The MidnightDoc tool will be tightly linked in the future to synthetic biology registries, such as the iGEM/BBF parts registry, but also other ones, such as the [https://public-registry.jbei.org/ JBEI registry]. Thus, information about different parts used will be easily coupled to experimental methods. Another application would be the integrated design of primers, e.g. by use of [https://2011.igem.org/Team:Cambridge/Project/Gibthon#/Project/Gibthon Gibthon], an automated tool for the design of primers for Gibson assembly developed by the Cambridge iGEM teams of 2010 and 2011.
-
<html>
+
Right now, MidnightDoc is still in pre-beta stage, but the source code is [https://github.com/sschmitz/midnightdoc available on GitHub]!
-
<div class="panel panel-default">
+
-
<div class="panel-heading">Java</div>
+
-
<div class=panel-body" style="background-color:lightgray; padding:15px">Java is a programming language created by Sun Corporation in 1995. Due to its rich content of libraries, simple memory management and platform independency it quickly became one of the most important and popular (TIOBE and RedMonk Index) programming languages in the world. Most programming languages including the famous C and C++ are platform dependent. That means they can only be executed on the platform on which they were compiled. In contrast Java offers with the Java Virtual Machine (JVM) a general platform in which the programs are executed and allows the developer to “write once and run anywhere”.
+
-
To achieve this independence the Java Runtime Environment (JRE), that includes the JVM, is written for each platform. So the basic idea is that each user has its own JRE installed and that any Java application can be run on it. But in the last years Java has been affected by some security issues and today many people stopped installing it. That’s why the Java community developed a tool called javafxpackager that bundles a java application directly with its own runtime environment and thereby removes the need for the user to install the JRE.
+
-
 
+
-
</div>
+
-
</div>
+
-
</html>
+
-
 
+
-
We therefore utilized the tool javafxpackager that helps bundeling a JRE with a finished java program (JAR file). To incorporate it into the BOINC infrastructure we had to rewrite the Loader program, that is written in C and starts the JVM and executes the actual program.
+
-
 
+
-
To reduce the size of the bundle we removed some packages from the JRE, for example all programs that are responsible for displaying graphics (AWT and Swing) and some more files as it was suggested in the JRE Readme (http://www.oracle.com/us/technologies/java/jre-7-readme-430162.html).
+
-
 
+
-
The BOINC architecture brings a bunch of methods that are necessary for running applications. Some of these are essential like starting the BOINC process that communicates with the BOINC manager or resolving a file name to the encoded representation that BOINC is internally using. To give users access to these method that were implemented in the C and C++ we had to write a BOINC API Wrapper. It uses the Java Native Interface that gives Java code the ability to call native C and C++ methods. Our BOINCWrapperAPI Java class is distributed with the JRE to give application developers the most natural access to its method and it implements access to the most important BOINC methods. In addition it is easy to extend due to its modularity.
+
-
 
+
-
=Conclusions=
+
-
With all this components in place it is very easy to develop applications for BOINC. One can use the popular scripting language Python together with one of the most used science frameworks: SciPy. In addition one can use Java a simple, though powerful fully featured object oriented language and basically all of its libraries to develop applications that can be scaled up from the single development machine to the distribution to thousands of home computers.
+
-
 
+
-
With our effort to expand the list of supported languages by the BOINC system with more high level languages, we significantly reduced the implementation effort and required expertise for future BOINC projects and especially for future iGEM teams that want to use iGEM@home and benefit from the available computing power.
+
-
 
+
-
=References=
+
-
[1] http://freegamedev.net/wiki/Portable_binaries#System_libraries_that_cannot_be_bundled
+
-
 
+
-
[2] http://tldp.org/HOWTO/Program-Library-HOWTO/shared-libraries.html
+

Latest revision as of 03:55, 18 October 2014

Contents

Introduction

Documentation is one of the most important aspects of any scientific project. Nevertheless, maintaining a detailed log of all experiments performed can often be a daunting task, especially when consistency and standardization across multiple persons is required. This consistency is particularly important in the context of iGEM, where group work is of paramount importance. Also, as many iGEM projects build upon results of preceding teams, a standardized way of documenting would further enhance this interaction. Reproducibility of projects would also be increased. Thus, a consistent, semi-automated, documentation method could help solve many of the problems faced by iGEM teams, but also researchers in general.

For these reasons, we created a software which does just that. Our MidnightDoc enables reproducibility by making sure that documenting experimental methods becomes a fun activity; a natural extension of normal lab routine. In the following, we will describe the main design principles behind this fine piece of software.

Lay out your plan, then record what you did

The most important design principle behind MidnightDoc is the fact that it accompanies lab work and in fact makes it easier. In order to illustrate this, consider the simple example of a restriction or ligation experiment: The user will first look up for the concentrations of his fragments, as measured for example by a spectrophotometer. Then, based on these concentration measurements and the manufacturer guidelines, which might specify the required amount of each fragment (e.g. in micromol or nanograms) or the ratio between them, the user will calculate the necessary volumina. Only after these calculation have been done, will the user be able to mix these components and then start the reactions. Once this has been done, he will document some of the aforementioned numbers (e.g. the volumina used) in his labbook for future reference.

Now, notice that there are several ways in which a software tool can assist in the previous process. First of all, an integrated database with information on all available plasmid, PCR amplified DNA fragments, etc. would make the retrieval of the concentration information a lot easier. Then, this can be immediately used for subsequent calculations, which also should be done by the software.

To achieve this, you have to pre-specify your plan: the experimental guidelines and procedures which you want to follow. The rules, such as ratios of fragments, will then be integrated in order to automate calculations. Thus, MidnightDoc, will help with your calculations, but at the same time document what you did. Also, once you have created the protocol for an experimental procedure, such as a ligation with different fragments, you will be able to reuse it with new details, such as the new fragments or the different concentrations.

Propagation and Backtracking

Of course, storing all details in the software also has many other advantages, such as "Propagation" and "Backtracking". Already in a previous design principle, we touched on the idea of propagation: Details about biological samples, which have already been entered into the database, can then be used for calculations in subsequent experiments. But it also helps with error analysis: Assume you repeat a concentration measurement of a DNA sample and determine that the originally measured value was incorrect. Now also assume that you have already used this DNA for downstream work. If you were to update this value now, all values depending on it will be updated. For example, the MidnightDoc will automatically reflect the change by showing the true amount (e.g. in nanomols), which was used downstream. Thus, the user will be able to determine, if the wrong measurement of concentration has had implications for the other experiments.

By "backtracking", we describe the other side of the same coin: You should be able to follow the complete history of any sample in your lab. For example, for a plasmid, you should be able to trace back the components out of which it was assembled. In turn, you should be able to trace the history of these components. In turn, if these were amplified by PCR, you should also be able to access the primers that were used, the template DNA (or host species), as well as the reaction conditions. In this way, "backtracking" also assists with possible error analysis and makes a task, which is very hard and time-consuming in the classical labbook setting, very easy.

You decide the level of detail

When starting to use MidnightDoc, the user will first need to describe the protocols they wish to perform. These protocols can be entered in a very flexible way and allow to specify whatever parameters the user may want to record for this type of experiment. While it is certainly possible to add a multitude of possible value inputs to each and every method, filling in all the information might become too much of a burden for day-to-day lab work. The user may therefore choose to record only few values for standard (e. g. cloning) protocols, and more for complex assays for meaningful statistical analysis.

This applies likewise to the source materials of experiments which can be backtracked: it would certainly be possible to document every bottle of lysis buffer or even every flask of growth medium ever produced. Generally, however, the user might not want to have to check which database entry corresponds to the flask of medium used for their miniprep cultures; it may therefore not be advisable to include this as a traceable source material, whereas it would be of greater importance to be able to relate the lot of a manually purified protein used in an assay.

This abstraction principle also makes MidnightDoc conform to the different requirements of different labs and even the different sciences.

Version control

The biostatistics community has recently embraced the idea of [http://cran.r-project.org/web/views/ReproducibleResearch.html Reproducible research], by introducing standards which should be followed when reporting any computational task. Thus, other researchers will be able to verify methods used in diverse publications, but also modify and improve upon them. We believe, that documentation of wet lab work, should also try to make use some of these concepts (and vice versa!).

The main concept of the reproducible research community which inspired us for MidnightDoc are the version control systems, such as [http://git-scm.com/ git]. In computational tasks and programming in general, the basic idea is that you should be able to look at the code you had previously written, which over many iterations lead to the final production code. Similarly, in the context of wet lab documentation, you should be able to easily revert to the state of the documentation at a particular date. This would allow one to explore protocols that were used by the lab previously and got replaced or modified over time. In addition, it would allow to explore PCR fragments, plasmids, etc. which were available at a particular point in time.

In summary, version control, is very exciting indeed, even if it is not distributed, as MidnightDoc will take care of it!

It's on the web

MidnightDoc is implemented as a standard web application. Therefore, there is no need to install any software on many client computers -- just open up your browser and point it to the copy deployed on your lab's IT infrastructure. Of course, MidnightDoc is designed as a multi-user system with a shared database, so the propagation and backtracking features also are continuous when basing your work upon previous experiments of others -- and all of the aforementioned features make it astonishingly easy to understand every aspect of how they were performed. This makes it ideal for team work like in an iGEM setting.

Outlook

Conceptually, MidnightDoc could also harness the plethora of available web resources. It goes without saying that accessing all documentation details from everywhere is a big advantage already. But we also envisage other applications by this web-integration, such as the following: The MidnightDoc tool will be tightly linked in the future to synthetic biology registries, such as the iGEM/BBF parts registry, but also other ones, such as the JBEI registry. Thus, information about different parts used will be easily coupled to experimental methods. Another application would be the integrated design of primers, e.g. by use of Gibthon, an automated tool for the design of primers for Gibson assembly developed by the Cambridge iGEM teams of 2010 and 2011.

Right now, MidnightDoc is still in pre-beta stage, but the source code is available on GitHub!