I’m a hacker, and I love to build stuff for the Web.
Monday 2nd November, 2009
A key problem in the field of Bioinformatics is referring to biological sequences. A single sequence may be known by many names across many databases on the web, with distinct ‘IDs’ on different sites and little to tie them together.
This is where BioSeq comes in. BioSeq is a technology I’m currently working on to leverage Content-Addressable Storage (CAS), powered by Bitcache, to store and refer to biological sequence data. Sequences are identified with a URI like the following:
The scheme-specific part of a
bioseq: URI is that sequence’s unique
identifier, generated by applying a cryptographic hash algorithm (currently
SHA1) to the sequence. The sequence itself can then be retrieved by querying a
Bitcache server for the hash, usually via a simple HTTP GET to
The use of cryptographic hashes as identifiers causes the URI to have several useful properties:
A cryptographic hash algorithm will always produce a fixed-length output regardless of the size of the input, so URIs have a fixed-length.
The hashes of two identical sequences will always be identical, and the
hashes of two different sequences have such a small probability of collision
that we may safely ignore this risk. This means
bioseq: URIs are truly
universal, since a given URI will always refer unambiguously to one
This also gives the property that
bioseq: URIs are independent of the
server on which the sequences themselves are stored; it is up to the client
which wants to fetch the sequence to resolve the URI into a URL. This can
actually be done with URI prefixes: setting
bioseq: as a prefix pointing
to a Bitcache server (in an RDF document, for example) will make URI-to-URL
resolution occur automatically.
Since two identical sequences will have the same address, redundancy (on the sequence level) is eradicated entirely.
Because changing a sequence will cause its hash to change, the fetching/updating of sequences will also verify data integrity.
CAS makes incremental server-to-server replication very easy, since the dataset is append-only. See Chris Anderson’s blog post on CouchDB replication to see how replication could be implemented on top of Bitcache.
The whole concept is still in the planning stages, but the technology to realize it is all present. In a following blog post, I’ll describe how BioSeq will integrate with other semantic web technologies, allowing for the creation of a scalable, distributed infrastructure for storing and querying biological sequence data.