I’m a hacker, and I love to build stuff for the Web.

 

BioSeq

Monday 2nd November, 2009

A key problem in the field of Bioinformatics is referring to biological sequences. A single sequence may be known by many names across many databases on the web, with distinct ‘IDs’ on different sites and little to tie them together.

This is where BioSeq comes in. BioSeq is a technology I’m currently working on to leverage Content-Addressable Storage (CAS), powered by Bitcache, to store and refer to biological sequence data. Sequences are identified with a URI like the following:

  bioseq:a21268b77c91c67973efa8289cc42a62772d8c33

The scheme-specific part of a bioseq: URI is that sequence’s unique identifier, generated by applying a cryptographic hash algorithm (currently SHA1) to the sequence. The sequence itself can then be retrieved by querying a Bitcache server for the hash, usually via a simple HTTP GET to http://bitcache.example.com/-identifier-.

The use of cryptographic hashes as identifiers causes the URI to have several useful properties:

  • A cryptographic hash algorithm will always produce a fixed-length output regardless of the size of the input, so URIs have a fixed-length.

  • The hashes of two identical sequences will always be identical, and the hashes of two different sequences have such a small probability of collision that we may safely ignore this risk. This means bioseq: URIs are truly universal, since a given URI will always refer unambiguously to one biological sequence.

  • This also gives the property that bioseq: URIs are independent of the server on which the sequences themselves are stored; it is up to the client which wants to fetch the sequence to resolve the URI into a URL. This can actually be done with URI prefixes: setting bioseq: as a prefix pointing to a Bitcache server (in an RDF document, for example) will make URI-to-URL resolution occur automatically.

  • Since two identical sequences will have the same address, redundancy (on the sequence level) is eradicated entirely.

  • Because changing a sequence will cause its hash to change, the fetching/updating of sequences will also verify data integrity.

  • CAS makes incremental server-to-server replication very easy, since the dataset is append-only. See Chris Anderson’s blog post on CouchDB replication to see how replication could be implemented on top of Bitcache.

The whole concept is still in the planning stages, but the technology to realize it is all present. In a following blog post, I’ll describe how BioSeq will integrate with other semantic web technologies, allowing for the creation of a scalable, distributed infrastructure for storing and querying biological sequence data.