Thursday, October 23, 2008

Globally distributed object identification for biological knowledgebases

Citation: Tim Clark, Sean Martin, Ted Liefeld. Globally distributed object identification for biological knowledgebases. Briefings in Bioinformatics 2004 5(1):59-70; doi:10.1093/bib/5.1.59
Link: Oxford Journals

Summary

The Web provides a globally distributed communication framework this is essential for almost all scientific collarboration, including bioinformatics. However, several limits and inadequacies have become apparent, one of which is the inability to programmatically identify locally named objects that may be widely distributed over the network. This shortcoming limits our ability to integrate multiple knowledgebases, each of which gives partial information of a shared domain, as is commonly seen in bioinformatics. Fundamentally, we must be able to solve the following problems to fully integrate web-distributed databases: (a) define the link interfaces formally so that they may be understood programmatically; (b) encapsulate the link interfaces so that they are not addresses, but names; (c) locally specify and control object identifiers while guaranteeing them to be globally unique; (d) describe the object attributes using a formal ontology.

Life Science Identifiers (LSIDs) and the LSID Resolution System (LSRS) form the most useful system evolved to date for meeting the above mentioned requirements for biological databases. This system is based on existing IETF and W3C technologies, with some judicious extension, while being compatible with Web services and semantic Web.

LSIDs are a special form of Universal Resource Names (URNs). They have their own resolution protocol, and are persistent, global, location-independent object names. LSIDs can be used to persistently name resources such as individual proteins or genes, transcripts, experimental data sets, annotations, ontologies, publications, biological knowledgebases or objects within them. The syntax of an LSID is: <LSID> ::= 'urn:' 'lsid:' <Authority ID> ':' <Authority Namespace ID> ':' [':' <Revision ID>]. Once assigned, an LSID is permanent and is never reassigned. Also, as unique identifiers, they can specify only one object for all time.

Any organization assigning LSIDs has several responsibilities: (1) It must identify itself with an Authority ID string, which must be globally unique. Typically it is an Internet domain name it owns. (2) It must ensure the uniqueness of the string created from the namespace, object and revision identifications within any given authority's domain.

Resolution of LSIDs works as follows: (1) A client has an LSID and knows the appropriate resolution service, or contacts the LSID Resolution Discovery Service to find the appropriate resolution service. (2) The client contacts the resolution service to get information on services available on LSID, where they are located, how to call them etc. (3) After getting this information, client calls the desired data retrieval services to submit requests. (4) The service executes the requests and sends the results back to the client.

The current LSID specification allows for accessing the LSID services using simple protocols such as HTTP and FTP. It also utilizes important web services standards like SOAP and WSDL, to allow Web-service like access. The current open source implementations of LSID Resolution Discovery make use of DNS.

Definitions

Ontology: An ontology is the specification of a conceptualization of a knowledge domain. It is a controlled vocabulary that describes objects and the relations between them in a formal way, and has a grammar for using the vocabulary terms to express something meaningful within a specified domain of interest. The vocabulary is used to make queries and assertions.
Gene Ontology: The Gene Ontology, or GO, is a trio of controlled vocabularies that are being developed to aid the description of the molecular functions of gene products, their placement in and as cellular components, and their participation in biological processes. Terms in each of the vocabularies are related to one another within a vocabulary in a polyhierarchical (or directed acyclic graph) manner; terms are mutually exclusive across the three vocabularies.

No comments: