Team:UCSD Software/Project

From 2014.igem.org

Topics

Description

Human Practices

Neil Gershenfeld, a physicist who is the director of the Center for Bits and Atoms, said that the improvement in the capacity to read and write biological genes has given rise to the possibility of “spectacular advances,” like the ability to use a computer to design a complete genome, output it, insert it in a cell and in effect create life from scratch. -NY Times

It is clear that the field of synthetic biology has made leaps and bounds in recent years, from the introduction of genetically modified food to the cellular production of polymers with incredible tensile strength. What Gershenfeld is stating in the above quote is that previously, to make a significant contribution to the field of synthetic biology, there was a certain element of trial and error. What does this gene do? How does it interact with other species within the cell? How can I make my output more efficient and effective? And the biology community has come up with significant answers for each of these: gene sequencing and indexing to understand gene presence and function, documentation of chemical interactions, and investigations into crosstalk and modeling of the in vivo cellular environment.

SBiDer is the spectacular advance to which Gershenfeld is referring. Our application allows synthetic biologists of all levels to contribute in a significant way and to develop their ideas and streamline the synthetic design process. SBiDer is built upon the established body of work in the field of synthetic biology to not only simplify the design of biological circuits, but to increase the ability of users to visualize their circuits in silico. By simplifying the design and visualization process, our application allows users to explore the myriad of papers that compose our database simultaneously, without having to traverse typical scientific jargon that usually precludes young scientists from fully fleshing out their ideas. In this way, SBiDer has opened the field of synthetic biology for users of all levels, from high school freshmen to private investigators.

By opening the field to aspiring synthetic biologists and decreasing the time between idea conceptualization and wet lab production, SBiDer has removed the barriers of entry into the world of synthetic genetic production, thereby not only establishing a comprehensive database of operons and their interactions with one another, but also enabling new, novel discoveries between operons in new, novel ways.

Enabling young scientists to enter the field of synthetic biology means nothing if they are unable to generate data of some significance, but it is important to understand that SBiDer enables just that.

The Defense Advanced Research Projects Agency (DARPA) started the Living Foundries project to fund many scientific endeavors to aid in the transformation of biology into an engineering endeavor by pursuing biological projects with efficiency and efficacy and establishing the technical means for other scientists to do so. DARPA funds projects in the hope that some will aid in the government in its continual pursuit of national security. One recent project to emerge from the Living Foundries project is called 1,000 Molecules, aimed at building a “scalable, integrated, rapid design and prototyping infrastructure for the facile engineering of biology.” SBiDer itself meets these criteria by aiding users in the pursuit of novel circuits by combining operons in previously unimagined ways, but it also allows others to complete projects that fall within this criteria as well. One such project that would have benefitted from the service that the SBiDer application provides is the Spider Silk project from the Lewis lab. The project focused on genetically engineering E. Coli to produce spider silk in mass quantities to be potentially used in protective equipment similar to kevlar. Their polymer relies highly on arabinose, which is a species with many connections within the web of SBiDer database.

The SBiDer application has a variety of potential uses within the synthetic biology community. The cultivation of algae for biofuels is an inefficient process that could be aided by genetically engineered colonies that produce high lipid content but also grow efficiently in culture could drastically reduce international dependence upon fossil fuels.

Synthetic biology also has the potential for significant impacts in the field of agriculture. Drought resistant wheat has made the year-round cultivation of wheat a possibility, and has made cheap food for many available. Of the potential inhibitors to proper and sustained crop growth, drought is the most prohibitive. By reducing the dependence of crops on water, food production could be expanded to previously unexplored regions. The ability of our application to accept new operons to our database means that the wide acceptance and use of the SBiDer application will enable projects such as the drought tolerance of plants to build on the success of previous endeavors and to integrate information from other synthetic biology projects.

In addition to defense, biofuels, and agriculture, SBiDer also has the potential to advance health care. The ability to genetically engineer cells to produce specific outputs in response to specific small molecules cannot be understated. One recent development in this field is the production of glucose in genetically engineered cells. These implantable cells have the potential to reduce the prevalence of diabetes, which has become a significant health challenge for adults and adolescents alike.

Many more interesting and current uses of synthetic biology in various industries are also described by the Biotechnology Industry Organization. We hope that as our database increases in size and the more operons we implement, SBiDer will be involved more frequently in the discovery of new and novel uses of existing biology to help drive the field of synthetic biology and society forward.

Validation

We first validated the SBiDer application by searching for circuits described in the literature sources we used to construct the SBiDer database. As expected we were able to successfully retrieve all the circuits used to for the database. We also used this set of devices during the development of our web application; minimally our application should be able to find known circuits used to construct our database. The full list is below:

Inputs	Outputs
LacI,IPTG	gfp
OHHL,luxR	gfp
araC,Lara	yfp
na	araC
tetR,aTc	yfp
na	tetR
lasR,PAI-1	yfp
na	lasR
aTc,tetR	yfp
na	araC,tetR
na	araC,lasR
lasR,	PAI-1
na	tetR,lasR
aTc,tetR	cI
araC,Lara	cI
lasR,PAI-1	cI
tetR,aTc	cI
tetR,aTc	gfp
cI	LacI
LacI,IPTG	cI
LacI,IPTG	tetR
LacI,IPTG	cI,gfp
tetR,aTc	LacI
LacI,IPTG	tetR,gfp
LacIq,IPTG	gfp
na	lacIq
araC,Lara	gfp
na	luxR
LuxR,AHL	gfp
LacIq,IPTG	hrpR
LuxR,AHL	hrpR
LacIq,IPTG	cI
hrpR,hrpS	gfp
hrpR,hrpS	cI
Sal	supD
na	nahR
tetR	gfp
araC,Lara	T7ptag
T7ptag,supD	gfp
AHL,luxR	lacI
AHL,lacI	rfp
IPTG,lacI	ecfp
na	lacI
IPTG,lacI	rfp
IPTG,LacI	gfp
IPTG,LacI	lacI
araC,Lara	lacI
tetR,aTc	luxR
LacI,IPTG	AHL_luxR
araC,Lara	AHL_luxR
AHL,luxR	lacZalpha-ccdB
aTc,	tetR
LuxR,	PAI-2
lacI,IPTG	cI
lacI,IPTG	yfp
araC,Lara	rfp
lacI,IPTG	rfp
luxR,AHL	rfp
tetR,aTc	rfp
tetR,aTc	invF
araC,Lara	sicA
tetR,aTc	mxiE
araC,Lara	ipgC
ipgC,mxiE	rfp
tetR,aTc	exsDA
luxR,AHL	exsC
exsC,exsDA	rfp
exsC,exsDA	invF
ipgC,mxiE	sicA
LacI,IPTG	mxiE
LacI,IPTG	luxR,luxI
luxR,AHL	lacZalpha-ccdB

While many circuits are contained within the SBiDer database, there were new circuits published in the literature that we could not include in our database due to time constraints. However, supposing that circuits contained within the SBiDer database represent the majority of mechanisms by which synthetic circuits can be constructed (which they do), then SBiDer can predict novel circuits that may eventually be constructed and published. Validating predictions of novel circuits described by our database constituted a search of relevant publications. The list of relevant publications that may have explored and validated at least one of our predicted circuits is described by the set of all papers citing more than 1 paper cited by our database. SBiDer gathered circuitry from n=17 publications. The most recent circuit contained in our database was published in 2012. Interestingly, SBiDer was able to discover several aTc and AHL based circuit recently validated and published by the Piro et al in 2014 in our database. We are excited to find that we have predicted novel circuits that are relevant to active research projects that have been published and submit this as justification of the validity of SBiDer’s capacity to predict novel genetic circuitry.

Supported Standards

Synthetic Biology Open Language

Synthetic Biology Open Language (SBOL) Visua; is an open source graphical notation that allows the user to communicate sequences of genetic parts visually using a set of standard symbols which represent common features of operons. The specific symbols used in our application are the promoter, coding sequence, and terminator. A promoter is a DNA segment that regulates the transcription of a gene. There are two types of promoters that appear in the database: regulated and constitutive. A regulated promoter must be triggered, usually by a small molecule. A constitutive promoter is always activated. In this case, both regulated and constitutive promoters are represented the same way: with a bent, right-facing arrow. Coding sequences are DNA regions that produce some protein when transcribed and are regulated by promoters, and are represented by elongated pentagonal arrowheads. Lastly, there are terminators, which are DNA segments that signal for transcription to cease, and are represented with capital T. Promoter, coding sequence, and terminator symbols which appear on the same line are all part of the same operon. To indicate sequences that are oriented in the 3’ to 5’ direction, arrows point in the opposite direction. The symbols are generated using a language called Pigeon[link], which is specific to SBOL Visual.

Using SBOL Visual, it is possible for us to communicate the exact structure of the design generated by the SBiDer application, thereby enhancing user understanding. This way, the user can design and visualize their desired output in silico, prior to wet lab production.

Systems Biology Markup Language

Systems Biology Markup Language (SBML) captures qualitative models using standardized syntax and semantics established by the SBML community and developers. By specifying the "Level”, "Qualitative Species”, and "Transitions" of a qualitative model, practitioners are capable of conveying the entirety of their model to a large community without sacrificing any nuance in their modeling of a biological network. Level is an integer value that refers to the current state of the network entities that are being model [ref]. In our Petri-Boolean model, network entities, chemical species and operons, have a basal level of 0 with a maximum capacity of 1. Qualitative Species refer to entities that are compartmentalized and characterized by the qualitative influence of their components; compartments allow one to pool entities of similar composition [ref]. In our Petri-Boolean model, chemical species and operons are referred to as qualitative species with species and plasmid as their respective compartments. Transitions alter the level of entities in a network at given conditions by a fixed amount such conditions involve the fluctuation between levels falling below or above an established threshold [ref]. Transitions in our network govern the binary state of its entities thus a network entity can assume a level of 0 or 1 depending on current state conditions. By converting every network entity, qualitative species or transition, in our database to its corresponding SBML description we not only make our qualitative model accessible to the public but also our entire database.

Network Analysis

Gene circuit design often involves the laborious extension of existing circuitry through the development of new devices. Device development is a taxing biological endeavor to which SBiDer offers an alternative. SBIDer facilitates circuit design by integrating knowledge across existing genetic modules. We provide evidence for an increase in functional motifs relative to the circuitry input into our database by examining common network attributes and executing more sophisticated network analysis.

In analyzing our database, we demonstrated and quantified the increased number of genetic circuits relative to the input circuits. Connectivity served as a preliminary indicator of the increase in potential circuits. We compared the connectivity, measured by “in-degree” and “out-degree,” of the input circuitry to the degree measure of the integrated database. The increase in degree indicates a combinatorial increase in potential circuits.

This increase is observed as the rightward shift of the In-Degree(v) and Out-Degree(v) distributions. The degree comparison between the input circuitry and the database demonstrated an increase in modulators and modulations in the database. Theoretically, every in-degree increase represents a novel modulation of that node, every out-degree increase represents a novel opportunity for that node to modulate another. The increase in modulators and modulations is important because every combination of these novel inputs and outputs constitutes a new circuit. Further, each ancestor and descendant of a node increase the count of novel circuits exponentially, as a function of the degree. In addition to analyzing the connectivity of our network, we implemented network motif analysis techniques in attempt to identify topological motifs.

Topological motifs can map to functional motifs which are the subunits of genetic circuitry [Alon, 2007]. We considered this correspondence and applied CytoKavosh [Masoudi-Nejad et.al., 2012] to identify topological motifs that occurred significantly relative to a random graph. CytoKavosh is a visual implementation of the Kavosh motif finding algorithm [Kashani, et.al, 2009] constructed on the Cytoscape platform. We report an increase in the number of topological motifs after analysis of the input circuitry. The bi-fan network motif 1 (Kavosh motif 204) showed a 1.98 fold increase from the input circuitry to the integrated database. In our most extreme cases, motifs 540 and 16920 (not included in the graph) had a 7.89 and 6.75 fold increase, respectively. The cumulative increase in fold change across discussed motifs is 24.57. Increases in the counts of selected motifs are shown below:

A display of counts of selected motifs, grouped by motif IDs generated by CytoKavosh

A comparison of CytoKavosh analyzed motifs between the input circuitry and the databases shows a clear increase in the number of known motifs. The observed increase in network motifs validates our expectation that integrating related circuits would increase their potential.

Genetic circuits are often difficult to engineer, requiring months to design, build, and test each individual genetic device involved in the circuit. SBiDer, a web tool developed by the UCSD Software iGEM team, will leverage existing devices to construct a database with consideration for the function of each device interpreted as boolean logic. The data can be queried by the user through SBiDer's visual interface to explore circuit designs. Users can search for existing circuits that can be used to assemble a complex circuit. The displayed circuit's literature reference, characterization data, and images of included devices can be viewed through the built-in table. We also provide a standalone modelling Python package that can be used to model circuits given by our online webtool. SBiDer's web of information can be expanded through user-generated additions to the database to improve the efficiency of the application and the accuracy of the models.

List of definitions for terms used in various sections of this page:

Nodes by context
- Search Algorithm
  - Operon node: A combination of the input transition, operon, and output transition nodes as described in the Petri Net model.
- SBiDer Petri Net Model
  - Operon node: A reciever and producer of species through input and output transitions.
- Web Application
  - Operon node: A set of genes that are regulated by DNA sequences known as cis or trans regulating elements. Under the sbider Petri net adaptation operons are places
  - Species node: A node that contains only the species information which is similar to its usage in the algorithm.
  - Transition node (see Input Transition and Output Transition): Graphical representation of the Petri Net Transitions.
Input (for search)
- A set of desired initial species to start the search algorithm.
Output (for search)
- A set of desired final species produced by the search algorithm.
Paths:
- Operon paths:
  - composed of operon nodes as defined within the algorithm representation of the network and species nodes. The network condenses to a directed tree graph from a Petri Net representation.
- Petri Net paths:
  - The cyclic movement from a place (species) to an input transition to another place (operon) to an output transition to an ultimate place(species).
Petri net:
- Consists of operon nodes, species nodes, and transitions. All logic required to activate an operon is encompassed within the transition. Species nodes always point to an input transition, which points to one operon, which points to a singular output transition, which points to a species node.
Places
- Conditions to be met in order to enable a transition in the Petri Net model.
Transitions
- Activated events that allow movement from Place to Place in a Petri Net model.
SBML
- a free and open interchange format for computer models of biological processes, which is based on the XML standard. SBML is useful for models of metabolism, cell signaling, and more. It continues to be evolved and expanded by an international community.
SBOL
- Synthetic Biology Open Language Visual is an open source graphical notation that allows the user to communicate sequences of genetic parts visually using a set of standard symbols which represent common features of operons.
Plasmids
- The structure of bacterial DNA. Plasmids are circular, closed loops of DNA that contain functioning genes used by the organism.
Species
- Any chemicals and stimuli affecting gene production. Under the SBiDer Petri net adaptation species are places.
Input transitions
- Species-regulated regions of the DNA which are formalized into boolean logic operations under the sbider Petri net adaptation.
Operons
- are a set of genes that are regulated by DNA sequences known as cis or trans regulating elements.
Output transitions
- Describe the events and species produced after operon activation namely, RNA and protein production.

Future Directions

Database - Integrating NDEx

NDEx is a breeding ground for network sharing, discussion, analysis and storage. By acting like the Facebook of networks, NDEx serves a medium for the discovery, sharing and discussion of networks. Networks are commonly used for their ability to capture the biology as a system of particular molecular interactions with interaction types varying between the type of networks built by scientist. The discussion of networks offered by NDEx provides a means for researchers to share their ideas behind network construction and analysis as well as sharing a copy of the network. NDEx also provides what is know as provenance history which means that history of the network construction is saved and the development of the network can be seen. As a means of further connection the biological network community NDEx offers an API and has plans for further integration with the network visualizer, Cytoscape (www.ndexbio.org).

SBiDer was developed with the intention to involve the biology community with the SBiDer Network. Discussion of genetic circuits is imperative for the proper use of a circuit as well as to give the proper context of the circuit. NDex would give a discussion board to the SBiDer network where biologist can discuss topics regarding optimal genetic circuits, the potential to experimentally create new circuits from the existing network or a range of many other topics. As said before, genetic device building is not an easy task and provenance history provides useful backtracking to understand the development of genetic circuit networks and due to the ability of network sharing it becomes possible for the SBiDer network to exist in different versions with the purpose of displaying genetic circuit networks with specific functionalities. NDEx API functionality then provides the means to create tools with which SBiDer networks can be better analyzed.

Traversal

SBiDer’s three main pillars - database, search algorithm, and web application - intertwine effectively and make SBiDer a robust tool for the synthetic biology community. Furthermore, the database, search algorithm, and web application are modularly developed so that each component can be easily updated, extended, and optimized by the global synthetic biology community. In the development, there was a focus on minimizing the barrier of collaboration by using an open source platform and generating a comprehensive usage and implementation documents. SBiDer was developed on the consideration to encourage the synthetic biology community to collaborate and drive the field forward. In the course of the development, various tools that could further develop SBiDer were discovered. In future iterations of SBiDer, the following functionalities should be considered for extending SBiDer:

Web Ontology Language (OWL)

What OWL is

The Web Ontology Language (OWL) is a family of knowledge representation languages or ontology languages for authoring ontologies or knowledge bases (W3C). The data described by an ontology in the OWL family is interpreted as a set of "individuals" and a set of "property assertions" which relate these individuals to each other. An ontology consists of a set of axioms which place constraints on sets of individuals (called "classes") and the types of relationships permitted between them. These axioms provide semantics by allowing systems to infer additional information based on the data explicitly provided.

Integration of OWL into SBiDer

SBiDer can describe all of the components in the database using OWL by creating a unique ontological hierarchy that suits the field of synthetic biology. Using OWL relationships, SBiDer can better connect its components through more complex fashion that better captures the biological reality underlying synthetic biology. For example, an ontology describing components used by SBiDer, or individuals in the semantics of ontology, might include axioms stating that a "activatePromotor" property is only present between two individuals when "isBiochecmicalSpecies" is also present, and individuals of class "activatePromotor" are never related via "isBiochemicalSpecies" to of the "transcribeRNA" class. If it is stated that the individual green fluorescent protein (GFP) is related via "produceLight" to the individual yellow fluorescent protein (YFP), and that GFP is a member of the "isProtein" class, then it can be inferred that YFP is not a member of "isPromotor" class. In this fashion, OWL allows the use of more sophisticated relationships amongst database components. Furthermore, there already exists various ontological hierarchy optimized for biology such as genetic interaction ontology, protein interaction ontology, and cancer genomics ontology. And OWL allows potential interaction of multiple ontologies. So, the integration of OWL into SBiDer allows SBiDer to potentially interact with many other aspect of biology such as genetic interaction, protein interaction, and RNA processing. As a result of OWL integration, SBiDer can represent the biological reality of synthetic biology in a much more complex, sophisticated, and realistic manner.

Factor Graph

What Factor Graph is

In probability theory and its applications, a factor graph is a particular type of graphical model, with applications in Bayesian inference, that enables efficient computation of marginal distributions through the sum-product algorithm (H. Loeliger). Factor graph can capture a system as Markov networks or Bayesian networks, and represent mathematical relationships of elements of a system.

Integration of Factor Graph into SBiDer

SBiDer currently implements its network representation of the database as a Petri-Net, which allows modeling of our complex biological system. SBiDer can also capture the network using factor graph, which provides additional statistical tools for analyzing, interpreting, and prediction of the behaviour of our network. Integration of the factor graph can also provide us with machine learning tools to further advance our tools quantitatively and qualitatively.

Web App

The current SBiDer web tool has been constructed with consideration for expansion in all its components from the structure of the source code to the allowance of users to contribute to the database. It is the latter improvement that allows SBiDer to become more efficient as time progresses. As more information becomes available, the SBiDer application can begin to transition from a qualitative description of how its proposed genetic circuits behave to a more quantitative model.

Web Application

Framework of the SBiDer Interface

Structure Overview

SBiDer was developed off of CyNetShare, a HTML5 application designed to make utilizing and visualizing CytoscapeJS graphs simple and intuitive through use of the javascript library AngularJS. The object of the SBiDer web tool is to visually present genetic circuits and possible interactions between them in addition to providing relative information about each device’s construction and function through reference material. Images of devices and image data, in Systems Biology Markup Language (SBML), are provided by a Synthetic Biology Open Language (SBOL) visual rendered through use of the popular PigeonCad for intuitive understanding and universal sharing. Users are also provided with the PubMed ID’s of all genetic devices so that they may be able to access and verify entries information in the PubMed journal service. SBiDer utilizes and extends the web-application architecture used by the Owl Datasheet generator project from iGEM 2013.

Visualizing the Network

CytoscapeJS is a widely used visualization library written in Javascript that presents complex relationships in more easily understood figures of nodes connected by edges. The library’s popularity and supportive community make it an ideal framework from which the SBiDer team can reach a large audience of synthetic biologists. Characterizing information about the genetic devices can be easily stored inside CytoscapeJS’s nodes as data. Unfortunately, manipulating, editing, and updating this data autonomously without further means poses a difficult challenge that is resolved with the use of AngularJS and CyNetShare.

Dynamic Interface and Data Handling

The large portion of the SBiDer framework relies on the data manipulation and two-way databinding provided by the AngularJS javascript library which provides the interactivity to the web application. Key to this is the two-way binding that allows information to be changed through the User Interface (UI) and updated between the scripting code and html without need to constantly resync data with the server, allowing for quicker response times for users and cleaner source code for further development. The ability to pass data from the UI to the SBiDer application also simplified the procedure for User-generated entries to the SBiDer database. The major alternative library was JQuery, an older and popular method of achieving a dynamic interface that doesn’t have the crucial feature of two-way databinding. Data is passed from server to client through a JSON string which is parsed by AngularJS for the SBiDer application to visualize the network using CytoscapeJS as well as updating the web page without need to reload the window unlike when employing JQuery. Additionally, AngularJS is compatible with many JQuery functionalities which allows for a combination of use for both libraries with negligible drawbacks. Effectively, AngularJS allows SBiDer to effectively and dynamically communicate with CytoscapeJS and other components of the application without the clutter of JQuery’s syntax.

Modular Interface

CyNetShare is a new innovation from the creators of CytoscapeJS which is used to share visualizations made with CytoscapeJS without the need to manually exchange files, instead rendering the files through encoded url’s. CyNetShare provides the basic framework for SBiDer as it has much of the core foundation in its source code including the CytoscapeJS visualization, table of node information, and a basic user interface layout and styling. SBiDer developed upon this framework to include the ability to retrieve results from the SBiDer database, display images of genetic devices, and easily access and download data.

Other Utilized Technologies

Apart from the aforementioned technologies that were used to create SBiDer itself, there were two other major technologies/languages involved, namely Boostrap, to design the website & the wiki, and Java, for the server.
Bootstrap is a front-end framework that contains several HTML & CSS-based design templates for various aspects of web pages. It's an easy to use framework that's used widely all over the world since it makes websites look more modern, cleaner and aesthetically pleasing.
The language chosen for the server was Java since it's easier for other developers to use, is extremely common and is compatible with a lot of languages including but not limited to Python, JavaScript and SQL. In essence, all of the languages that we made use of are very well documented, modern and easy to use languages.

The Search Algorithm

Introduction to the SBiDer Search Algorithm

Purpose of the Algorithm

SBiDer is fundamentally constructed on a manually curated database of existing genetic devices, or operons. This SBiDer database stores operons, species, and most importantly the biochemical interactions of these elements. Also, the database can be represented as an operon interaction network via species. Using this network representation of the database, SBiDer’s search algorithm searches for feasible operon paths connecting a set of species to another set of species. As a result, given a set of input species and a set of output species, the search algorithm returns a subnetwork of operon paths connecting the input species to the output species. Each path within the subnetwork is an operon path that can be used to produce the output species from the input species. In other words, placing operons from a returned operon path as well as the input species into a system can produce the output species.

Biology Captured by the Algorithm

The search algorithm conducts a complex, effective, and robust operon path search over the network representation of the database, ultimately generating single or multiple operon paths connecting a set of input species to a set of output species. A resulting path represents a system that can produce the output species given the input species. Also, each path attempts to capture biological reality by considering two major mechanisms of operon activation (detailed description below).

Robustness and Modularity of the Algorithm

Due to the current number of operons in the database, the search algorithm should scale with the increasing size of the database, which will expand as the synthetic biology community adds more species, operons, and plasmids to the database (see complexity analysis of the algorithm is located below). In addition to this robustness, the search algorithm is easily modifiable since it has been implemented independently from the database and the SBiDer web application. Therefore, the search algorithm can be applied to a broad range of networks. Furthermore, the search algorithm was developed on an open source platform to encourage global collaboration in further optimizing the search algorithm and minimizing the barrier for algorithm improvement. As a result, the database as well as the search algorithm will be easily, effectively, and accurately expanded, extended, and optimized by the global synthetic biology community - together.

Algorithm

Overview of the Algorithm

SBiDer's search algorithm is fundamentally based on multiple breadth first searches (BFS) on the SBiDer database. From a set of input species, the algorithm searches for any operon that can be activated by the input species (see Tutorial for how to format the input query). If such an operon is found, the output species of the activated operon and/or other existing species in the system (details below) is used to search for the next operon. In the process, the search algorithm constructs an operon paths by successively adding activated operons to the path. There are two ways of activating an operon: linear successional activation and nonlinear successional activation (more details below), and the search algorithm can create an operon path using either or both of the activation methods. This search algorithm continues until either: the search algorithm concludes that there is no operon path that can lead to the production of the output species; or an operon path producing the output species is found.

Linear Successional Activation Search (LSAS)

The search algorithm uses linear successional activation search (LSAS) to generate an operon path in which the output species of a preceding operon produces all of the required input species for the succeeding operon. LSAS extends an operon path by adding only operons that can be activated by the species produced by the last operon in the path. Thus, for each potential activated operon by the input species, LSAS uses a tree data structure in searching an operon path. From an input species (or multiple input species with an AND relationship), LSAS builds a tree, and if an output species is found in the process of growing the tree, the path from the root operon to the leaf operon that has produced the output species is returned as a resulting operon path. Finally, this resulting set operon paths is used to construct a Petri net path including all of the input species, input transitions, input transition logic, operons, output transitions, output transition logic, and output species.

Nonlinear Successional Activation Search (NLSAS)

An operon may not produce all of the input species required for an activation of another operon. Yet, multiple operons may together produce all of the required input species for the other operon. Here, each of the activating operons produces some, but not all, of the required species for the activated operon. The search algorithm captures such operon activation mechanism using nonlinear successional search (NLSAS) in building an operon path. As LSAS, NLSAS also uses a tree data structure, but appends an operon whose input species, or activation requirements, are produced by any of the preceding operons in the path. While NLSAS assumes no degradation of the species in the system, it allows for a wider range of operon sets that could produce our desired output from our given input. As a result, each resulting operon path can be linear or nonlinear. Since a NLSAS result captures operon activation in which a single or multiple preceding operons produce species required for the activation of a succeeding operon, LSAS result is a subset of a corresponding NLSAS result.

Overview of LSAS and NLSAS

Overall, LSAS generates a single or multiple operon paths in which a preceding operon directly activates a succeeding operon, and NLSAS paths in which any of the preceding operons can activate a succeeding operon. In general, a resulting LSAS operon path includes operons mined from same or similarly conducted experimental protocols, where the operons are optimized to coexist in a system. As a result, LSAS can return well optimized operon paths and interactions, which may be used in biochemical reactions with higher confidence. The SBiDer web application can further facilitate the use of such operons paths by providing literature and other published resources. NLSAS, on the other hand, can generate an operon path by including operons from dissimilar experimental protocols. So, using NLSAS can potentially lead to the discovery of novel operon paths and thus new biochemical systems. By providing LSAS and NLSAS variations of the search algorithm, SBiDer assists a user in capturing the biological reality underlying an operon activation as accurately as possible, and provides a means to evaluate, analyze, and apply the results.

Complexity Analysis

The SBiDer network is based upon the Petri Net model, which allows for the capturing of boolean logic. The linearity of the network representation of the database is such that one input transition leads to one operon, which leads to one output transition, and one list containing all species produced by that operon, as shown in Figure 1.

Figure 1 - Petri Net representation of an input species node, an input transition, an operon node, an output transition, and a species node consisting of only the species produced by the current operon respectively.This is the representation used for the construction of the SBiDer database since they allow for the incorporation of boolean logic.

This entire structure can be collapsed into one node since there is no branching, and allows us to represent the network as a directed graph, as shown in Figure 2. By condensing the network in this fashion, complex searches can be conducted in a more efficient manner.

Figure 2 - Operon representation of the network where the input and output transitions have been collapsed into one operon node to decrease the complexity of the search algorithm. This is possible because an input transition always points to one operon and one output transition always points to one species node.

LSAS

Assume the network has m transitions, with depth d and breadth b from a specified root, and V vertices which represent the aforementioned condensed nodes (centered on operons), and E directed edges connecting said vertices.

The SBiDer search algorithm is a modified version of breadth first search where at each transition m that is activated by the input species set a breadth first search for the output species set is performed. For a standard breadth first search, the complexity is

BFS Complexity = O(|b|^|d|)

This is another way of saying that in the worst case, every vertex and every edge is explored, which can be rewritten in terms of the number of vertices and edges. Since all edges have weights of one, V and E will always be positive.

BFS Complexity = O(|V| + |E|) => O(V + E).

For this standard BFS, assuming a large network where every node is connected to every other node, for V significantly large, V ≈ V - 1, meaning the number of edges searched after the selection of each successive node is approximately equal to V, meaning the V nodes are searched V times. Additionally, consider that the total number of edges (E) in a completely connected graph is bound by V^2. In a completely connected graph (with no self or duplicate edges) with V nodes, the total number of edges is (V-1) + (V-2) + (V-3) + … + 2 + 1. It has been shown that the (V-1)th partial sum of this divergent series sums to [V*(V-1)]/2, which is bound by V^2. Therefore the upper bound for the time complexity for a single breadth first search from a single activated transition from an input species is O(V + E) ≤ O(V+V^2) => O(V^2).

For our network, a breadth first search of time complexity O(V^2) happens at every activated transition m.
Therefore, the LSAS complexity = O(mV^2)

Paths generated using the LSAS search would resemble Figure 2 where a species node directly activates an operon, which produces a singular species node. The LSAS algorithm would not be able to capture the logic encompassed within Figure 3 where multiple species nodes are required to activate the middle input transition.

NLSAS

For the NLSAS, a similar tree from LSAS is created for each activated input transition by the input species. However, the difference here lies in the construction of the path from the input species to the output species. A special traceback algorithm is used to tie the output species to the input species creating a path that may resemble a subnetwork rather than a traditional linear path. This traceback algorithm is where the true complexity of this algorithm arises from.
Similar to LSAS, this algorithm also begins by constructing a subnetwork of paths that can lead to the output of interest from a specific input. This generates a list of edges to be used in the traceback algorithm. Assuming every edge could be explored, the algorithm runs in O(E) time. As shown for the LSAS algorithm, O(E) <= O(V^2), and also returns a list that will be explored also in O(E) size or O(V^2). Again, a new list is constructed for each potentially activated transition, thus giving this entire first step a run time complexity of O(m*V^2).

Now, for each of the O(m) list of edges constructed from the last step, a traceback algorithm is run for this path from the output species set to return a subnetwork of operons that could form a system that leads to the production of the desired output. From the desired output species, the traceback algorithm searches through the list of edges E to find the transitions for each operon contained in the path that produces the output species of interest. Since every transition in the set E must be checked, the algorithm runs in O(E) or O(V^2) time. At worst, every operon must be checked, resulting in a complexity of O(m*V^2) for this step.

Now, in order to determine which species were required to activate the operons now connected to the output species, the database of transitions is searched to identify which transitions will have to be activated. This search happens across an O(m) space for each of the O(V) operons that could be activating the output. Therefore, this next time step happens in O(mV) time. The first two steps together now will happen as O(m*(V2+mV)). At the end of these steps, the path now has O(m*V) species.

However, it is important to note here that at the step when possible operons are being explored, the search will have a time complexity of O(E), and will never happen more than O(m*V) times, which is the bound for the number of unique species in the system. For the following step, the search for species activating all operons on this increment will have a time complexity of O(m), but it will never happen more than O(V) times, which is the bound for the number of unique operons in the system. Therefore, for a full iteration of these two steps in the traceback algorithm, it will be bounded by the time complexity of O(V^2*mV + m*V). This traceback step will happen until the depth of the entire path generated from the LSAS step has been explored, which is bound by O(V). Thus, the time complexity for the traceback of a single path is O(mV^4+mV^2). Then complexity of the path generation is added for a complexity of O(mV^4+mV2+V^2). Now consider that this path generation and traceback happens for every single activated transition by the input species, which gives O(m*(mV^4+mV^2+V^2))=>O(m^2*V^4).

Figure 4 - Petri Net representation Figure 3 where multiple species nodes are required to activate one input transition. First two species nodes activate each of the three initial input transitions, which in turn activate three separate operons. The output of each operon is an output transition which points to a species node that contains all output species produced by that operon. The three species nodes in the center of the graph trigger the final input transition, triggering the final operon, and producing the final species node. This expansion would add an additional layer of complexity to the NLSAS algorithm that is unnecessary given the previously explained linearity of the operon nodes.

References
[1]Carl Adam Petri and Wolfgang Reisig (2008) Petri net. Scholarpedia, 3(4):6477

[2]"OWL 2 Web Ontology Language Document Overview". W3C. 2009-10-27.

[3]An Introduction to Factor Graphs by Hans-Andrea Loeliger, IEEE Signal Processing Magazine,January 2004, pp. 28–41.

Database

Data Collection

The database was built to meet the needs of synthetic biology and genetic circuit building by offering literature references, storage, maintenance and upload of genetic devices as well as SBML text files for network data exchange.1 Genetic circuit information is not easily extracted from papers making it difficult to begin any genetic device construction. SBiDer provides a centralized library of relevant genetic circuit references in the form of pubmed id’s so that synthetic biologists can quickly and easily access experimentally backed information. Outsourcing the literature search to SBiDer gives synthetic biologists an interface with reliably relevant information. As well as providing literature references, SBiDer extracts the necessary details of each genetic device so that synthetic biologists can build devices using essential genetic building blocks. As a means of further involving the biology community, SBML data exchange files for our genetic parts and SBiDer network are accessible by using the SBiDer API. Using SBML provides further data analysis of the SBiDer network by taking advantage of approximately 200 SBML supported softwares such as Cytoscape, Systems Biology Workbench and more (www.cytoscape.org) (http://sbw.kgi.edu/research/sbwintro.htm) (www.sbml.org).

Figure 1. Plasmids used to create a genetic circuit know as μBE (Hoon Hong, 2012).

Petri Net and Schema Design

This project integrated the Petri net model by describing interactions between species and operons. Species act as input for operon activation and output. In the context of biology, a simple genetic circuit is broken down into operons, and operons can be broken down into their components: promoters, regulating elements, and coding sequences. Combining the biology and Petri net formalism, genetic device information is captured and makes for a database schema that innately captures a network of genetic device interactions. The Petri net model inspired schema works by atomizing the necessary genetic circuit information so the parts necessary for genetic circuit design are represented. Operons and species are represented as equivalents of Petri net chemical species while the interaction of species with operons and the production of species by operons is represent by input and output transitions namely. Genetic circuit design is now formalized for the biology community under the Petri net model and stored inside the SBiDer database.

Community Based Support

A single genetic circuit has a set of responses based on the chemical or environmental factors present in a biological system. By introducing new genetic circuits into the SBiDer network the biology community benefits from new devices with new functionalities or improved functionality, and when introduced into the SBiDer network creates a multitude of new genetic device combinations. The SBiDer website includes a device upload page to encourage the collaboration of the synthetic biology community to keep expanding the database and network for improved and up to date genetic devices. The database also maintains updated SBML forms of all genetic device parts and the SBiDer network so the biology community can continue working on the most up to date information.

Technologies

The database was implemented using SQLite, a widely used database management system, and the SQLite3 Python module. SQLite is developed and maintained by multiple companies including Oracle and Adobe (www.sqlite.com). With it’s widespread use and fast queries, SQLite is a powerful tool to use for collaboration with the biological community. Python’s success within the biology community, it’s ease of use and compatibility with major operating systems also makes for an effective combination with SQLite and computational work in biology (“Programming for Biologists”, 2014). Using these two technologies, a Python based API was create that encourages the synthetic biology community to participate in the database expansion and operation as well as SBML file access.

Database Schema in Depth

The plasmid table stores the name and pubmed literature references allowing users to easily find out more about their given device. The plasmid is then broken down into operons where multiple operons can be mapped to the same plasmid. In order to capture this relationship, a table of plasmid id and operon id allows for multiple operons relating to a single plasmid. A direction column in the plasmid-operon relationship table also allows users to specify the directionality of the operon inside of the given plasmid as multiple operon sequences can can be found with presence in either strands of the DNA. Operons are also mapped exclusively to their plasmid, meaning that identical operons from different plasmids require insertion into the database with a mapping to the corresponding “parent” plasmid. The operon table itself contains information on the operon sbol and sbml files representations.

The input transition table explains the type of logic that is being used for operon activation as either AND or OR logic. AND logic transitions are transitions which must take in all corresponding inputs species for the activation of the operon. OR logic transitions are transitions which require a single input species or multiple input species but do not require all. More complex relations are made between the input species and the input transitions table.

Input transition and species interactions are captured using a separate relationship table. Using this separate input transition species relationship table allows for more flexibility when describing the boolean logic at a given input transition. The same species may interact as a repressor or as activator depending on what input transition it encounters. Therefore, the table attribute repression gives a true or false statement under the repression column describing the role of that species in a given operon activation. All complex logic types, NAND; NOR; XOR; and XNOR, are captured solely by using the atomic logic types, AND; OR; and NOT. So a NAND input transition would be represented as an AND inside the input transition table with repressor molecules as the interacting species and TRUE values in the repression columns of these species.

Operons also have associated output transitions which describe the events and species produced after operon activation. These transitions unlike input transitions have very simple logic, namely IF. The reason for this is, that once an operon is activated then the production of the protein will inevitably occur. Just like input transitions and species, output transitions and species also have a relationship table with multiple species mapping to a single output transition when applicable. This illustrates the idea that operons are a sequence of genes that when activated, transcribe mRNA and protein species.

Figure 2. An entity relational diagram displaying the sbider database. Entities and their corresponding attributes are like-colored. Relationships represented by the grey diamonds may also store attributes tied to entities to illustrate that certain information is the result of a relationship.

References
[1]Hucka, M., 25 Oct. 2002., "The Systems Biology Markup Language (SBML): A Medium for Representation and Exchange of Biochemical Network Models". Oxford Journals, www.http://bioinformatics.oxfordjournals.org (October 15, 2014)

[2]"Cytoscape Core Features." What Is Cytoscape?, www.cytoscape.org (October 15, 2014)

[3]"Download SBW File A Bug Browse Forum." http://sbw.kgi.edu/research/sbwintro.htm (October 15, 2014)

[4]"SBML Software Showcase." 10 June 2014., The System Biology Markup Language, www.sbml.org (October 15, 2014)

[5]Baez, John C. "Petri Nets." www.johncarlosbaez.wordpress.com (October 15, 2014)

[6]"SQLite Home Page." SQLite Home Page. www.sqlite.org (October 15, 2014)

[7]"Why Python?" www.programmingforbiologists.org (15 October 2014)

[8]"NDEx About." www.ndexbio.org (October 15, 2014)

[9]Hoon Hong, Seok. 03 Jan. 2012, "Synthetic Quorum-sensing Circuit to Control Consortial Biogilm Formation and Dispersal in a Microfluidic Device." Nature.com. Nature Publishing Group, (October 17, 2014)

Modeling

What is this program?

1, 2

3

5

wet-lab

How does it work?

in vivo

in vitro

4

k1[A]aq + k2[B]aq k3[A B]aq

k

1

k

2

k

3

plug-in

How to use it?

3.1

def device’s name (od , input_1, input_2, output, dt = 0.1) :
     parameters
     input1 = input_1 + (differential equation of input_1)* od* d t
     input2 = input_2 + (differential equation of input_2)* od* d t
     output1 = output + (differential equation of output)* od* d t
     return (input1 , input2, output)
where od is optical density and dt is integral time step.

3.2

3.3

k

1

6

Demonstration

k

grow

k

max

OD

max

6

x

1

y

References
[1]Ellis, Tom, Xiao Wang, and James J. Collins. ”Diversity-based, model-guided construction of synthetic gene networks with predicted functions.” Nature biotechnology 27.5 (2009): 465-471.

[2]Andrianantoandro, Ernesto, et al. ”Synthetic biology: new engineering rules for an emerging discipline.” Molecular systems biology 2.1 (2006).

[3]Lu, Timothy K., and James J. Collins. ”Engineered bacteriophage targeting gene networks as adjuvants for an- tibiotic therapy.” Proceedings of the National Academy of Sciences 106.12 (2009): 4629-4634.

[4]Kaznessis, Yiannis N. ”Models for synthetic biology.” BMC systems biology 1.1 (2007): 47.

[5]Lee, Sung Kuk, et al. ”Metabolic engineering of microorganisms for biofuels production: from bugs to synthetic biology to fuels.” Current opinion in biotechnology 19.6 (2008): 556-563.

[6]Shis, David L., and Matthew R. Bennett. ”Library of synthetic transcriptional AND gates built with split T7 RNA polymerase mutants.” Proceedings of the National Academy of Sciences 110.13 (2013): 5028-5033.