Thursday, October 23, 2008

From XML to RDF: how semantic web technologies will change the design of 'omic' standards

Citation: Xiaoshu Wang, Robert Gorlitsky, Jonas S Almeida. From XML to RDF: how semantic web technologies will change the design of 'omic' standards. In Nature Technology, Vol 23, No 9, pp 1099-1103, Sep 2005.
Link: NCBI PubMed

Summary

Developing a data standard addresses two major concerns: (1) the content - what should be standardized, and (2) methodology - how the standard should be formatted. A data standard is more than just a medium to uniform data representation. By laying out the overall structure of relationships of the encoded data, a data standard will effectively define a schema for a particular area of domain knowledge. The purpose of this article is to discuss how the above issues affect the choice of methodologies to establish data standards. More specifically, it discusses why there is a need to go beyond XML as a standard technology to represent biological data.

Examples of XML documents show that compatibility cannot be achieved by XML alone because the language can be used in more than one way to encode the same information. Although, different parties can agree on a single XML format, but in most scientific disciplines data relationships are bound to change with the development of new experimental methods. In case of such change, the standard must adjust to reflect the newly established relationship. Unfortunately, this is very difficult to achieve using XML.

The problems with XML-based standards are:
  • XML restrics what type of data can and cannot be in what places. Although it allows XML-encoded message to be validated easily, it makes it harder to extend the vocabulary of the system beyond the original specifications. Techniques for allowing extensibility exist in XML, but they make schema design harder to manage.
  • Since no rules can be specified to manage the extension of system, separately developed applications are very likely to develop different dialects of extension. Also, since XML-based applications rely on the correct document structure to operate properly, any change in structure may potentially break the application.
  • Newly extended features can be grouped into a new namespace, but they are unlikely to be structurally cohesive with existing structure. This brings the issues of schema integration along with it.
The problems with XML originate from the limited expressiveness of the XML language. XML, designed as a language for message encoding, is only self-descriptive about the following structural relationships: containment, adjacency, co-occurrence, attribute and opaque reference. These relationships are useful for serialization, but are not optimal for objects of a problem domain.

Meaningful data exchange involves communication both at message level (data encoded in a standard format to applications can know how to convert bytes into objects), and at algorithmic level (relationships between objects specified explicitly to enable applications to process data accordingly). XML is designed to standardize message level communication. What appears to be missing is the description of semantic relationships between nested content holders that are required to invoke appropriate algorithms.

What is needed for solving this interoperability issue is a knowledge representation technology that can explicitly describe the data semantics. The foundation Semantic Web technology is RDF. The semantics of an RDF model are obtained via reference to RDFS and OWL. RDFS and OWL are layered on top of RDF to offer support for inference and axioms.

Comparing RDF to XML reveals three important differences:
  • Unlike the namespaces in XML, which ultimately are unique character strings for grouping related concepts, the ontology URI in RDF must be retrievable.
  • The description of semantic relationship is explicit in RDF.
  • The unique identifier attribute used in XML is no longer needed. Since RDF uses URIs, the uniqueness is ensured globally instead of just within the document.
Three distinct features of RDF make it very helpful to omic sciences:
  1. The data structure for RDF is a directed labeled graph (DLG). Adding nodes and edges does not change the structure of any existing subgraph, so RDF does not suffer from the unpredictable extension-induced changes problem.
  2. RDF has an open world assumption in that is 'allows anyone to make statements about any resource'. It is monotonic in that new statements do not negate the validity to previous assertions, making it particularly suitable in an academic environment, where consensus and disagreement have a useful coexistence that needs to be formally recorded.
  3. All RDF terms share a global naming scheme in URI, making distributed data and ontologies possible.
Just like any other emerging technology, RDF is not without issues.
  • One particular problem is the vagueness of 'resource' definition. When using URLs, instead of URIs, to represent resources of multiple dimensionalities an identity crisis occurs. In practice, the problem can be conveniently avoided by using LSIDs which couple a naming scheme with data-retrieving framework.
  • Bristling alternative ontologies may emerge at the initial stage of ontological development for a particular domain. But as field matures, it is expected that ontology usage will converge to the most comprehensive subset. The use of URIs helps in this regard. By assigning each concept a globally unique URI, RDF becomes immune to dialects that vexed XML. Whether an ontology becomes a standard or not, is usually determined by its usefulness for a community.

No comments: