Team:UCSD Software/Project

From 2014.igem.org

Revision as of 01:41, 17 October 2014 by Kspuri (Talk | contribs)


Description

Abstract

Genetic circuits are often difficult to engineer, requiring months to design, build, and test each individual genetic device involved in the circuit. SBiDer, a web tool developed by the UCSD Software iGEM team, will leverage existing devices to construct a database with consideration for the function of each device interpreted as boolean logic. The data can be queried by the user through SBiDer's visual interface to explore circuit designs. The displayed circuit's literature reference, characterization data, and images of included devices can be viewed through the built-in table. Basic validation of the circuit performance is also provided within in the interface. SBiDer's web of information can be expanded through user-generated additions to the database to improve the efficiency of the application and the accuracy of the models. (Partial STUB PAGE)

Future Directions

Synthetic genetic circuits created by synthetic biologists have yielded exciting applications such as biofuels production and cancer killing bacteria. These circuits are often difficult to engineer, requiring months to design, build, and test each individual genetic device involved in the circuit. Although there are many genetic devices that have been built, re-using these devices often requires a time-consuming review of the literature. The UCSD Software iGEM team will address this challenge by creating a web-tool that leverages existing genetic devices to create complex genetic circuits. (STUB PAGE)

Web Application

Framework of the SBiDer Interface

Structure Overview

SBiDer was developed off of CyNetShare, a HTML5 application designed to make utilizing and visualizing CytoscapeJS graphs simple and intuitive through use of the javascript library AngularJS. The object of the SBiDer web tool is to visually present genetic circuits and possible interactions between them in addition to providing relative information about each device’s construction and function through reference material. Images of devices and image data, in Systems Biology Markup Language (SBML), are provided by a Synthetic Biology Open Language (SBOL) visual rendered through use of the popular PigeonCad for intuitive understanding and universal sharing. Users are also provided with the PubMed ID’s of all genetic devices so that they may be able to access and verify entries information in the PubMed journal service.

Visualizing the Network

CytoscapeJS is a widely used visualization library written in Javascript that presents complex relationships in more easily understood figures of nodes connected by edges. The library’s popularity and supportive community make it an ideal framework from which the SBiDer team can reach a large audience of synthetic biologists. Characterizing information about the genetic devices can be easily stored inside CytoscapeJS’s nodes as data. Unfortunately, manipulating, editing, and updating this data autonomously without further means poses a difficult challenge that is resolved with the use of AngularJS and CyNetShare.

Dynamic Interface and Data Handling

The large portion of the SBiDer framework relies on the data manipulation and two-way databinding provided by the AngularJS javascript library which provides the interactivity to the web application. Key to this is the two-way binding that allows information to be changed through the User Interface (UI) and updated between the scripting code and html without need to constantly resync data with the server, allowing for quicker response times for users and cleaner source code for further development. The ability to pass data from the UI to the SBiDer application also simplified the procedure for User-generated entries to the SBiDer database. The major alternative library was JQuery, an older and popular method of achieving a dynamic interface that doesn’t have the crucial feature of two-way databinding. Data is passed from server to client through a JSON string which is parsed by AngularJS for the SBiDer application to visualize the network using CytoscapeJS as well as updating the web page without need to reload the window unlike when employing JQuery. Additionally, AngularJS is compatible with many JQuery functionalities which allows for a combination of use for both libraries with negligible drawbacks. Effectively, AngularJS allows SBiDer to effectively and dynamically communicate with CytoscapeJS and other components of the application without the clutter of JQuery’s syntax.

Modular Interface

CyNetShare is a new innovation from the creators of CytoscapeJS which is used to share visualizations made with CytoscapeJS without the need to manually exchange files, instead rendering the files through encoded url’s. CyNetShare provides the basic framework for SBiDer as it has much of the core foundation in its source code including the CytoscapeJS visualization, table of node information, and a basic user interface layout and styling. SBiDer developed upon this framework to include the ability to retrieve results from the SBiDer database, display images of genetic devices, and easily access and download data.

Other Utilized Technologies

Apart from the aforementioned technologies that were used to create SBiDer itself, there were two other major technologies/languages involved, namely Boostrap, to design the website & the wiki, and Java, for the server.
Bootstrap is a front-end framework that contains several HTML & CSS-based design templates for various aspects of web pages. It's an easy to use framework that's used widely all over the world since it makes websites look more modern, cleaner and aesthetically pleasing.
The language chosen for the server was Java since it's easier for other developers to use, is extremely common and is compatible with a lot of languages including but not limited to Python, JavaScript and SQL. In essence, all of the languages that we made use of are very well documented, modern and easy to use languages.

Future Prospects

The current SBiDer web tool has been constructed with consideration for expansion in all its components from the structure of the source code to the allowance of users to contribute to the database. It is the latter improvement that allows SBiDer to become more efficient as time progresses. As more information becomes available, the SBiDer application can begin to transition from a qualitative description of how its proposed genetic circuits behave to a more quantitative model.

The Search Algorithm

Introduction to the SBiDer Search Algorithm

Purpose of the Algorithm

SBiDer is fundamentally constructed on a manually curated database of existing genetic devices, or operons. This SBiDer database stores operons, species, and most importantly the biochemical interactions of these elements. Also, the database can be represented as an operon interaction network via species. Using this network representation of the database, SBiDer’s search algorithm searches for feasible operon paths connecting a set of species to another set of species. As a result, given a set of input species and a set of output species, the search algorithm returns a subnetwork of operon paths connecting the input species to the output species. Each path within the subnetwork is an operon path that can be used to produce the output species from the input species. In other words, placing operons from a returned operon path as well as the input species into a system can produce the output species.

Biology Captured by the Algorithm

The search algorithm conducts a complex, effective, and robust operon path search over the network representation of the database, ultimately generating single or multiple operon paths connecting a set of input species to a set of output species. A resulting path represents a system that can produce the output species given the input species. Also, each path attempts to capture biological reality by considering two major mechanisms of operon activation (detailed description below). Finally, SBiDer represents a resulting path using Petri net model, such as chemical reactions (C.A. Petri and W. Reisig). The Petri net generated by the search algorithm consists of all of the input species, input transitions, input transition logic, operons, output transitions, and output species in a path. As a result, the Petri net model captures the fundamental components of operon interactions, allowing the biological phenomena to be better understood, modeled, and analyzed.

Robustness and Modularity of the Algorithm

Due to the current number of operons in the database, the search algorithm should scale with the increasing size of the database, which will expand as the synthetic biology community adds more species, operons, and plasmids to the database (see complexity analysis of the algorithm is located below). In addition to this robustness, the search algorithm is easily modifiable since it has been implemented independently from the database and the SBiDer web application. Therefore, the search algorithm can be applied to a broad range of networks. Furthermore, the search algorithm was developed on an open source platform to encourage global collaboration in further optimizing the search algorithm and minimizing the barrier for algorithm improvement. As a result, the database as well as the search algorithm will be easily, effectively, and accurately expanded, extended, and optimized by the global synthetic biology community - together.

Algorithm

Overview of the Algorithm

SBiDer's search algorithm is fundamentally based on multiple breadth first searches (BFS) on the SBiDer database. From a set of input species, the algorithm searches for any operon that can be activated by the input species (see Tutorial for how to format the input query). If such an operon is found, the output species of the activated operon and/or other existing species in the system (details below) is used to search for the next operon. In the process, the search algorithm constructs an operon paths by successively adding activated operons to the path. There are two ways of activating an operon: linear successional activation and nonlinear successional activation (more details below), and the search algorithm can create an operon path using either or both of the activation methods. This search algorithm continues until either: the search algorithm concludes that there is no operon path that can lead to the production of the output species; or an operon path producing the output species is found.

Linear Successional Activation Search (LSAS)

The search algorithm uses linear successional activation search (LSAS) to generate an operon path in which the output species of a preceding operon produces all of the required input species for the succeeding operon. LSAS extends an operon path by adding only operons that can be activated by the species produced by the last operon in the path. Thus, for each potential activated operon by the input species, LSAS uses a tree data structure in searching an operon path. From an input species (or multiple input species with an AND relationship), LSAS builds a tree, and if an output species is found in the process of growing the tree, the path from the root operon to the leaf operon that has produced the output species is returned as a resulting operon path. Finally, this resulting set operon paths is used to construct a Petri net path including all of the input species, input transitions, input transition logic, operons, output transitions, output transition logic, and output species.

Nonlinear Successional Activation Search (NLSAS)

An operon may not produce all of the input species required for an activation of another operon. Yet, multiple operons may together produce all of the required input species for the other operon. Here, each of the activating operons produces some, but not all, of the required species for the activated operon. The search algorithm captures such operon activation mechanism using nonlinear successional search (NLSAS) in building an operon path. As LSAS, NLSAS also uses a tree data structure, but appends an operon whose input species, or activation requirements, are produced by any of the preceding operons in the path. While NLSAS assumes no degradation of the species in the system, it allows for a wider range of operon sets that could produce our desired output from our given input. As a result, each resulting operon path can be linear or nonlinear. Since a NLSAS result captures operon activation in which a single or multiple preceding operons produce species required for the activation of a succeeding operon, LSAS result is a subset of a corresponding NLSAS result.

Overview of LSAS and NLSAS

Overall, LSAS generates a single or multiple operon paths in which a preceding operon directly activates a succeeding operon, and NLSAS paths in which any of the preceding operons can activate a succeeding operon. In general, a resulting LSAS operon path includes operons mined from same or similarly conducted experimental protocols, where the operons are optimized to coexist in a system. As a result, LSAS can return well optimized operon paths and interactions, which may be used in biochemical reactions with higher confidence. The SBiDer web application can further facilitate the use of such operons paths by providing literature and other published resources. NLSAS, on the other hand, can generate an operon path by including operons from dissimilar experimental protocols. So, using NLSAS can potentially lead to the discovery of novel operon paths and thus new biochemical systems. By providing LSAS and NLSAS variations of the search algorithm, SBiDer assists a user in capturing the biological reality underlying an operon activation as accurately as possible, and provides a means to evaluate, analyze, and apply the results.

Complexity Analysis

The network that is traversed by the SBiDer search algorithm is based upon Petri Nets. The linearity of the network representation of the database is such that one input transition leads to one operon, which leads to one output transition, and one list containing all species produced by that operon, as shown in Figure 1. This entire structure can be collapsed into one node since there is no branching, and allows us to represent the network as a directed graph, as shown in Figure 2. By condensing the network in this fashion, complex searches can be conducted in a more efficient manner. Figure 3 shows the Petri Net representation a three operon circuit by ???????????????????????? (ADD FIGURES!)

LSAS

Assume the network has m transitions, with depth d and breadth b from a specified root, and V vertices which represent the aforementioned condensed nodes (centered on operons), and E directed edges connecting said vertices.
The SBiDer search algorithm is a modified version of breadth first search where at each transition m that is activated by the input species set a breadth first search for the output species set is performed. For a standard breadth first search, the complexity is
BFS Complexity = O(|b|^|d|)
This is another way of saying that in the worst case, every vertex and every edge is explored, which can be rewritten in terms of the number of vertices and edges. Since all edges have weights of one, V and E will always be positive.
BFS Complexity = O(|V| + |E|) => O(V + E).
For this standard BFS, assuming a large network where every node is connected to every other node, for V significantly large, V ≈ V - 1, meaning the number of edges searched after the selection of each successive node is approximately equal to V, meaning the V nodes are searched V times. Additionally, consider that the total number of edges (E) in a completely connected graph is bound by V^2. In a completely connected graph (with no self or duplicate edges) with V nodes, the total number of edges is (V-1) + (V-2) + (V-3) + … + 2 + 1. It has been shown that the (V-1)th partial sum of this divergent series sums to [V*(V-1)]/2, which is bound by V^2. Therefore the upper bound for the time complexity for a single breadth first search from a single activated transition from an input species is O(V + E) ≤ O(V+V^2) => O(V^2).
For our network, a breadth first search of time complexity O(V^2) happens at every activated transition m.
Therefore, the LSAS complexity = O(mV^2)

NLSAS

For the NLSAS, a similar tree from LSAS is created for each activated input transition by the input species. However, the difference here lies in the construction of the path from the input species to the output species. A special traceback algorithm is used to tie the output species to the input species creating a path that may resemble a subnetwork rather than a traditional linear path. This traceback algorithm is where the true complexity of this algorithm arises from.
Similar to LSAS, this algorithm also begins by constructing a subnetwork of paths that can lead to the output of interest from a specific input. This generates a list of edges to be used in the traceback algorithm. Assuming every edge could be explored, the algorithm runs in O(E) time. As shown for the LSAS algorithm, O(E) <= O(V^2), and also returns a list that will be explored also in O(E) size or O(V^2). Again, a new list is constructed for each potentially activated transition, thus giving this entire first step a run time complexity of O(m*V^2).
Now, for each of the O(m) list of edges constructed from the last step, a traceback algorithm is run for this path from the output species set to return a subnetwork of operons that could form a system that leads to the production of the desired output. From the desired output species, the traceback algorithm searches through the list of edges E to find the transitions for each operon contained in the path that produces the output species of interest. Since every transition in the set E must be checked, the algorithm runs in O(E) or O(V^2) time. At worst, every operon must be checked, resulting in a complexity of O(m*V^2) for this step.
Now, in order to determine which species were required to activate the operons now connected to the output species, the database of transitions is searched to identify which transitions will have to be activated. This search happens across an O(m) space for each of the O(V) operons that could be activating the output. Therefore, this next time step happens in O(mV) time. The first two steps together now will happen as O(m*(V2+mV)). At the end of these steps, the path now has O(m*V) species.
However, it is important to note here that at the step when possible operons are being explored, the search will have a time complexity of O(E), and will never happen more than O(m*V) times, which is the bound for the number of unique species in the system. For the following step, the search for species activating all operons on this increment will have a time complexity of O(m), but it will never happen more than O(V) times, which is the bound for the number of unique operons in the system. Therefore, for a full iteration of these two steps in the traceback algorithm, it will be bounded by the time complexity of O(V^2*mV + m*V). This traceback step will happen until the depth of the entire path generated from the LSAS step has been explored, which is bound by O(V). Thus, the time complexity for the traceback of a single path is O(mV^4+mV^2). Then complexity of the path generation is added for a complexity of O(mV^4+mV2+V^2). Now consider that this path generation and traceback happens for every single activated transition by the input species, which gives O(m*(mV^4+mV^2+V^2))=>O(m^2*V^4).

Future Directions and Optimizations

SBiDer’s three main pillars - database, search algorithm, and web application - intertwine effectively and make SBiDer a robust tool for the synthetic biology community. Furthermore, the database, search algorithm, and web application are modularly developed so that each component can be easily updated, extended, and optimized by the global synthetic biology community. In the development, there was a focus on minimizing the barrier of collaboration by using an open source platform and generating a comprehensive usage and implementation documents. SBiDer was developed on the consideration to encourage the synthetic biology community to collaborate and drive the field forward. In the course of the development, various tools that could further develop SBiDer were discovered. In future iterations of SBiDer, the following functionalities should be considered for extending SBiDer:

Web Ontology Language (OWL)

What OWL is

The Web Ontology Language (OWL) is a family of knowledge representation languages or ontology languages for authoring ontologies or knowledge bases (W3C). The data described by an ontology in the OWL family is interpreted as a set of "individuals" and a set of "property assertions" which relate these individuals to each other. An ontology consists of a set of axioms which place constraints on sets of individuals (called "classes") and the types of relationships permitted between them. These axioms provide semantics by allowing systems to infer additional information based on the data explicitly provided.

Integration of OWL into SBiDer

SBiDer can describe all of the components in the database using OWL by creating a unique ontological hierarchy that suits the field of synthetic biology. Using OWL relationships, SBiDer can better connect its components through more complex fashion that better captures the biological reality underlying synthetic biology. For example, an ontology describing components used by SBiDer, or individuals in the semantics of ontology, might include axioms stating that a "activatePromotor" property is only present between two individuals when "isBiochecmicalSpecies" is also present, and individuals of class "activatePromotor" are never related via "isBiochemicalSpecies" to of the "transcribeRNA" class. If it is stated that the individual green fluorescent protein (GFP) is related via "produceLight" to the individual yellow fluorescent protein (YFP), and that GFP is a member of the "isProtein" class, then it can be inferred that YFP is not a member of "isPromotor" class. In this fashion, OWL allows the use of more sophisticated relationships amongst database components. Furthermore, there already exists various ontological hierarchy optimized for biology such as genetic interaction ontology, protein interaction ontology, and cancer genomics ontology. And OWL allows potential interaction of multiple ontologies. So, the integration of OWL into SBiDer allows SBiDer to potentially interact with many other aspect of biology such as genetic interaction, protein interaction, and RNA processing. As a result of OWL integration, SBiDer can represent the biological reality of synthetic biology in a much more complex, sophisticated, and realistic way.

Factor Graph

What Factor Graph is

In probability theory and its applications, a factor graph is a particular type of graphical model, with applications in Bayesian inference, that enables efficient computation of marginal distributions through the sum-product algorithm (H. Loeliger). Factor graph can capture a system as Markov networks or Bayesian networks, and represent mathematical relationships of elements of a system.

Integration of Factor Graph into SBiDer

SBiDer currently implements its network representation of the database as a Petri-Net, which allows modeling of our complex biological system. SBiDer can also capture the network using factor graph, which provides additional statistical tools for analyzing, interpreting, and prediction of the behaviour of our network. Integration of the factor graph can also provide us with machine learning tools to further advance our tools quantitatively and qualitatively.

References

Carl Adam Petri and Wolfgang Reisig (2008) Petri net. Scholarpedia, 3(4):6477

"OWL 2 Web Ontology Language Document Overview". W3C. 2009-10-27.

An Introduction to Factor Graphs by Hans-Andrea Loeliger, IEEE Signal Processing Magazine,January 2004, pp. 28–41.

Database (the foundation)

We will mine the scientific literature for existing genetic devices and then construct a database that captures device characteristics such as:
  • composition of devices
  • function
  • characterization data
  • literature reference
We will design our database by rigorously constructing an entity relationship diagram and then normalizing these relationships to construct tables for a relational database. (another stub)

Modeling

  1. What is this program?
  2. Synthetic biology has been developed for a decade and is being expected to evolve the academic society because of its standardization and convenience.1, 2 This novel technology has already yielded bunches of applications on the field of pharmacy1, 2 and therapeutics3 while also showing its potential on energy industry.5 However, a standardized model is eagerly required to be established because it will facilitate the exploration of synthetic biology. Therefore, we developed the software – AutoModel, which integrates dozens of models of simple devices and stipulates the I/O interface then provides user standardized modeling path with high degree of freedom.

    This python script is being expected to help noviciate to construct their model and decide how much experimental data should be collected. In the coming section, we will demonstrate how this software works and how it help wet-lab experiments. Meanwhile, some instance are listed for illumination.

  3. How does it work?
  4. During the synthetic biological procedure, substantial biochemical reaction happens in vivo or in vitro. Generally speaking, those gene expression molecular level events could be represented by mass equation:4

    k1[A]aq + k2[B]aq k3[A B]aq

    where k1, k2 and k3 are reaction coefficients. One of best mathematical representation for this reaction formula is differential equation. The differential equation is a mathematical equation about some continuously varying quantities and their rates of change in time that relates some function of one or more variables with its derivatives. Therefore, differential model is widely employed on the chemical modeling. However, constructing a precise model for any specific reaction is always complex because dozens of material and formulae has to be considered.

    But now, a novel technique has been developed to help you figure out this problem. Inspired by the primary idea of synthetic biology, the AutoModel separates existed simple device as individual parts and integrates them into the device file which is a plug-in for main program. There are hundreds of mathematical expressions in the device file that represents those relationship between input(reactant) and output(resultant). Customers could add, adapt or ampu- tate any functions, parameters and terms to optimally fitting their experimental data and theoretical model. Besides, preseted device functions could be called for predict your experimental result.

  5. How to use it?

  6. There are 3 kinds of documents included in the program package.

    3.1 Device

    All of differential equations are stored in the ’DEVICE.py’ file which is a list of import subfunction of python. Those subfunction are defined in the same format that allowing user to recall them by the same method. The format is

    def device’s name (od , input_1, input_2, output, dt = 0.1) :
         parameters
         input1 = input_1 + (differential equation of input_1)* od* d t
         input2 = input_2 + (differential equation of input_2)* od* d t
         output1 = output + (differential equation of output)* od* d t
         return (input1 , input2, output)
    where od is optical density and dt is integral time step.

    3.2 Network Document

    The ’network.txt’ is the file of network which describes the connection among simulated model. This text file is a table. The first column is device’s name which corresponds to the name in ’DEVICE.py’. The next 2 columns are input_1 and its initial value. It could be either inducer’s name or device’s name from first column. The 4th and 5th represent input_2. Last 2 columns are output’s name and its value. If the word in any row is output, it means that the program will return the output value of this device.

    3.3 Main Program

    The main code for simulation is written in ’demo python.py’ which contains calculating function for growing curve, integrating function for ODE and a plotting part. Here we use an instance for illustration how to use this software.
    First, there is a default growing curve provided by demo. The initial cell density(OD) is 0.2 while growing rate is
    0.3 per 6 mins. You can change the maximum value of density by change the parameter k1 in the function grow. Second, function ’Calculate’ is an integral function that calculating the output value while reading the network from network file from top to bottom and then loop this process until the end of simulation time. In this case, the demo network is invested by David L. Shis and Matthew R. Bennett.6

    Final, ’Calculate’ will return a matrix of time scalar, optical density and output to our plotting script.

  7. Demonstration
  8. It is reasonable to assume that our host is incubated in ideal environment that the curve of optical density (OD) perfectly matches the ’S’ shape model. In other word, the growing curve is integrated by logistic model