Team:Vanderbilt Software/Novel Approach
From 2014.igem.org
(Created page with "{{CSS/Main}} <html> <style type="text/css"> body { position: relative; width: 850px //100%; margin: 0;...") |
|||
Line 61: | Line 61: | ||
<tr> | <tr> | ||
<td width="45%" valign="top"> | <td width="45%" valign="top"> | ||
- | + | <p>git, svn, and other version control systems focus on differences between lines. Since most | |
+ | DNA file formats split DNA to fixed-length lines, many lines are changed at once, for | ||
+ | example, when inserting a single new line. darwin does away with that by producing a | ||
+ | formatted file representing each ORF on its own line of text, making each edit only modify a | ||
+ | single line of the output text.</p> | ||
+ | <figure> | ||
+ | <img src="https://static.igem.org/mediawiki/2014/f/f9/Editing_single_lines.png"> | ||
+ | <figcaption>Fig1. - darwin eliminates extra lines in the output file</figcaption> | ||
+ | </figure> | ||
+ | |||
+ | <p>Genes can be very long. To combat this, darwin will sample a section of every newly inserted | ||
+ | ORF and compare it to nearby ORFs; if the new ORF is similar to another ORF, it is counted | ||
+ | as “edited,” and darwin only records the character-by-character changes required to | ||
+ | transform the old ORF into the new ORF.</p> | ||
+ | <figure> | ||
+ | <img src="https://static.igem.org/mediawiki/2014/5/50/Editing_characters_in_lines.png"> | ||
+ | <figcaption>Fig2. - darwin's unique method of parsing ORF</figcaption> | ||
+ | </figure> | ||
+ | |||
+ | <p>Finally, darwin uses concurrency to help speed up the process. File I/O is typically extremely slow, | ||
+ | much slower than processing a file data already in memory. Splitting the processing concurrently helps to open up that speed bottleneck.</p> | ||
+ | <figure> | ||
+ | <img src="https://static.igem.org/mediawiki/2014/b/bc/Pipeline_diagram_concurrency.png"> | ||
+ | <figcaption>Fig3. - Representation of darwin‘s block processor increasing processing speed</figcaption> | ||
+ | </figure> | ||
</td> | </td> | ||
</tr> | </tr> |
Latest revision as of 18:09, 18 January 2015
|
||
git, svn, and other version control systems focus on differences between lines. Since most DNA file formats split DNA to fixed-length lines, many lines are changed at once, for example, when inserting a single new line. darwin does away with that by producing a formatted file representing each ORF on its own line of text, making each edit only modify a single line of the output text. Genes can be very long. To combat this, darwin will sample a section of every newly inserted ORF and compare it to nearby ORFs; if the new ORF is similar to another ORF, it is counted as “edited,” and darwin only records the character-by-character changes required to transform the old ORF into the new ORF. Finally, darwin uses concurrency to help speed up the process. File I/O is typically extremely slow, much slower than processing a file data already in memory. Splitting the processing concurrently helps to open up that speed bottleneck. |