Thursday, October 23, 2008

QIS: A Framework for Biomedical Database Federation

Citation: Luis Marenco, Tzuu-Yi Wang, Gordon Shepherd, Perry L Miller, Prakash Nadkarni. QIS: A Framework for Biomedical Database Federation. In Journal of the American Medical Informatics Association, Vol 11 No 6, pp 523-534, Nov/Dec 2004.
Link: NCBI PubMed

Summary

Query Integrator System (QIS) is a database mediator framework intended to address robust data integration from continuously changing heterogeneous data sources in the biosciences. This paper outlines barriers to interoperability of bioscience databases, summarizes previous interoperation approaches, and describes QIS.

The QIS architecture is based on a set of distributed network-based servers, data source servers, integration servers, and ontology servers, that exchange metadata as well as mappings of both metadata and data elements to elements in an ontology. Metadata version difference determination coupled with decomposition of stored queries is used as the basis for partial query recovery when the schema of data sources alters.

Bioscience schemas evolve significantly and rapidly because of scientific progress, changing research goals and discovery of better data representation techniques. Due to this nature, federation of bioscience databases faces these major barriers:
  • Databases do not usually support unstructured SQL for performance and safety reasons. Therefore, predefined parametrized queries are generally used. Any alterations to the database can cause these queries to break.
  • Interoperation between databases becomes difficult in the presence of synonymy and polysemy. Such issues require presence of a controlled vocabulary, and that data and metadata must be mapped to concepts in the controlled vocabulary.
  • Federated search mechanisms must appropriately exclude data that are still preliminary and not available for public access beyond the research group creating an individual database.
Major existing approaches to database federation are the following. The global schema approach is based on an agreed-upon and infrequently revised standard definition of domain-specific data types and classes and their interrelationships. Anyone using this standard must not deviate from it. Therefore, its applicability in rapidly evolving areas seems doubtful. In contrast, mediator systems allow a single query to be translated into the language recognized by heterogeneous databases, extracting their information and integrating the results in a single dataset. Managing such systems becomes enormously complex in the presence of autonomous and frequently changing heterogeneous databases. In bioscience domain, we would like to address data mediation and schema adaptation together in order to achieve robust evolvable data integration.

The objectives underlying the creation of QIS are: (1) to integrate data sources, (2) to devise a scalable approach to schema evolution in a loosely coupled database federation, (3) to devise robust mechanisms for metadata exchange, (4) to address the problem of breaking-up of federated queries by automatic detection of schema evolution, (5) to support interoperation with a separation between public and private data, (6) to facilitate recording of system semantics, and (7) to devise an open-source, low-cost, lightweight system to facilitate research work.

QIS uses a distributed architecture that is composed of three main functional units: integrator servers (ISs), data source servers (DSSs), and the ontology server (OS). These units form the system's middle tier, connecting data consumers with data providers and knowledge sources.

Definitions

Synonymy: It is the semantic relation that holds between two words that can, in a given context, express the same meaning.
Polysemy: It is defined as the ambiguity of an individual word or phrase that can be used, in different contexts, to express two or more different meanings.

No comments: