Showing posts with label biomedicine. Show all posts
Showing posts with label biomedicine. Show all posts

Saturday, February 13, 2010

Ontologies: formalising biological knowledge for bioinformatics

Citation: Jonathan Bard. Ontologies: formalising biological knowledge for bioinformatics. Bioessays, May 2003, 25(5):501-506.
Link: NCBI PubMed

Summary

Ontologies are becoming increasingly important in bioinformatics because they can be linked to the information in databases and their knowledge then used to query the databases. This direct connection allows for faster searching in databases and less ambiguity than in string-based searches. Also, lots of data contains hierarchical relationships and relational databases do not handle hierarchies very well. The result is rich ontologies, which are independent of their associated databases and linked to them through term IDs.

The Gene Ontology (GO) is used to integrate genetic data about gene products with our knowledge of their properties. The GO catalogues its knowledge in three essentially non-overlapping ways: their location within cells, the process to which they contribute, and the functions they fulfill.

Tuesday, April 21, 2009

Requirements of Phylogenetic Databases

Citation: Luay Nakhleh, Daniel Miranker, Francois Barbancon, William H. Piel, Michael Donoghue. Requirements of Phylogenetic Databases, Third IEEE International Symposium on Bioinformatic and Bioengineering, vol. 0, no. 0, pp. 141, 2003.
Link: IEEE CS Digital Library

Summary

This work examines the impact of phylogenetic databases on the need and use of phylogenetic data. It evaluates the drawbacks of unnormalized Newick format in existing databases, e.g. TreeBASE, and suggests using normalized data model by providing a list of potential application/queries that a biologist may wish to see integrated into their phylogenetic DBMS.

There are two major drawbacks of the unnormalized Newick format:
  • The database cannot directly support queries concerning the relationships between the taxa and the structure of the phylogeny.
  • Some processes (e.g. hybridization, horizontal gene transfer etc.) result in graph structures, which are not supported by Newich format.

Authors of this paper identify six different categories of users of phylogenetic databases: (1) casual users, (2) visualization, (3) study development, (4) super-tree algorithms, (5) simulation studies, and (6) comparative genomics.

Definitions

Phylogeny: A phylogeny is a rooted, leaf-labeled tree, whose leaves represent a set of operational taxa, and whose internal nodes represent the (hypothetical) ancestral taxa. A phylogeny on a set of taxa represents the evolutionary history of the taxa in from their most recent ancestor (at the root of the tree).
Tree of Life A phylogenetic tree that represents the evolutionary history of all species in the world. It is expected that when finished, the Tree of Life will contain millions of species.

Thursday, October 23, 2008

Globally distributed object identification for biological knowledgebases

Citation: Tim Clark, Sean Martin, Ted Liefeld. Globally distributed object identification for biological knowledgebases. Briefings in Bioinformatics 2004 5(1):59-70; doi:10.1093/bib/5.1.59
Link: Oxford Journals

Summary

The Web provides a globally distributed communication framework this is essential for almost all scientific collarboration, including bioinformatics. However, several limits and inadequacies have become apparent, one of which is the inability to programmatically identify locally named objects that may be widely distributed over the network. This shortcoming limits our ability to integrate multiple knowledgebases, each of which gives partial information of a shared domain, as is commonly seen in bioinformatics. Fundamentally, we must be able to solve the following problems to fully integrate web-distributed databases: (a) define the link interfaces formally so that they may be understood programmatically; (b) encapsulate the link interfaces so that they are not addresses, but names; (c) locally specify and control object identifiers while guaranteeing them to be globally unique; (d) describe the object attributes using a formal ontology.

Life Science Identifiers (LSIDs) and the LSID Resolution System (LSRS) form the most useful system evolved to date for meeting the above mentioned requirements for biological databases. This system is based on existing IETF and W3C technologies, with some judicious extension, while being compatible with Web services and semantic Web.

LSIDs are a special form of Universal Resource Names (URNs). They have their own resolution protocol, and are persistent, global, location-independent object names. LSIDs can be used to persistently name resources such as individual proteins or genes, transcripts, experimental data sets, annotations, ontologies, publications, biological knowledgebases or objects within them. The syntax of an LSID is: <LSID> ::= 'urn:' 'lsid:' <Authority ID> ':' <Authority Namespace ID> ':' [':' <Revision ID>]. Once assigned, an LSID is permanent and is never reassigned. Also, as unique identifiers, they can specify only one object for all time.

Any organization assigning LSIDs has several responsibilities: (1) It must identify itself with an Authority ID string, which must be globally unique. Typically it is an Internet domain name it owns. (2) It must ensure the uniqueness of the string created from the namespace, object and revision identifications within any given authority's domain.

Resolution of LSIDs works as follows: (1) A client has an LSID and knows the appropriate resolution service, or contacts the LSID Resolution Discovery Service to find the appropriate resolution service. (2) The client contacts the resolution service to get information on services available on LSID, where they are located, how to call them etc. (3) After getting this information, client calls the desired data retrieval services to submit requests. (4) The service executes the requests and sends the results back to the client.

The current LSID specification allows for accessing the LSID services using simple protocols such as HTTP and FTP. It also utilizes important web services standards like SOAP and WSDL, to allow Web-service like access. The current open source implementations of LSID Resolution Discovery make use of DNS.

Definitions

Ontology: An ontology is the specification of a conceptualization of a knowledge domain. It is a controlled vocabulary that describes objects and the relations between them in a formal way, and has a grammar for using the vocabulary terms to express something meaningful within a specified domain of interest. The vocabulary is used to make queries and assertions.
Gene Ontology: The Gene Ontology, or GO, is a trio of controlled vocabularies that are being developed to aid the description of the molecular functions of gene products, their placement in and as cellular components, and their participation in biological processes. Terms in each of the vocabularies are related to one another within a vocabulary in a polyhierarchical (or directed acyclic graph) manner; terms are mutually exclusive across the three vocabularies.

Ontologies in Biology: Design, Applications and Future Challenges

Citation: Jonathan B L Bard, Seung Y Rhee. Ontologies in Biology: Design, Applications and Future Challenges. In Nature Reviews: Genetics, 2004.
Link: Nature Reviews
Status: Incomplete

Summary

Until recently, the most important task of bioinformatics was thought to be the storage, retrieval and analysis of molecular data. However, as experimental technologies move from producing relatively simple data to more complex data, we need comparable advances in bioinformatics to manage and relate these data. There is also a great deal of sophisticated biological knowledge, often hierarchical in nature, that needs to be integrated with other data. One way to represent such biological knowledge is by using ontologies. The resulting biological ontologies are formal representations of areas of knowledge in which the essential terms are combined with structuring rules that describe the relationship between the terms. Knowledge that is structured within a biological ontology can then be linked to the molecular databases.

For any ontology to be of public value, it has to be widely disseminated and accepted by the field that it aims to summarize. Sociological factors are important in ontology production and acceptance, and a strong community involvement is also crucial.

Definitions

Phenotype: The observable traits or characteristics of an organism, for example hair color, weight, or the presence or absence of a disease. Phenotypic traits are not necessarily genetic.
Systematics: This is an umbrella term to describe the processes that describe species. There are three disciplines which are united under this broad locution: description of species (identification), the naming of names (taxonomy) and description of the relationships among and between taxa (phylogenetics).

QIS: A Framework for Biomedical Database Federation

Citation: Luis Marenco, Tzuu-Yi Wang, Gordon Shepherd, Perry L Miller, Prakash Nadkarni. QIS: A Framework for Biomedical Database Federation. In Journal of the American Medical Informatics Association, Vol 11 No 6, pp 523-534, Nov/Dec 2004.
Link: NCBI PubMed

Summary

Query Integrator System (QIS) is a database mediator framework intended to address robust data integration from continuously changing heterogeneous data sources in the biosciences. This paper outlines barriers to interoperability of bioscience databases, summarizes previous interoperation approaches, and describes QIS.

The QIS architecture is based on a set of distributed network-based servers, data source servers, integration servers, and ontology servers, that exchange metadata as well as mappings of both metadata and data elements to elements in an ontology. Metadata version difference determination coupled with decomposition of stored queries is used as the basis for partial query recovery when the schema of data sources alters.

Bioscience schemas evolve significantly and rapidly because of scientific progress, changing research goals and discovery of better data representation techniques. Due to this nature, federation of bioscience databases faces these major barriers:
  • Databases do not usually support unstructured SQL for performance and safety reasons. Therefore, predefined parametrized queries are generally used. Any alterations to the database can cause these queries to break.
  • Interoperation between databases becomes difficult in the presence of synonymy and polysemy. Such issues require presence of a controlled vocabulary, and that data and metadata must be mapped to concepts in the controlled vocabulary.
  • Federated search mechanisms must appropriately exclude data that are still preliminary and not available for public access beyond the research group creating an individual database.
Major existing approaches to database federation are the following. The global schema approach is based on an agreed-upon and infrequently revised standard definition of domain-specific data types and classes and their interrelationships. Anyone using this standard must not deviate from it. Therefore, its applicability in rapidly evolving areas seems doubtful. In contrast, mediator systems allow a single query to be translated into the language recognized by heterogeneous databases, extracting their information and integrating the results in a single dataset. Managing such systems becomes enormously complex in the presence of autonomous and frequently changing heterogeneous databases. In bioscience domain, we would like to address data mediation and schema adaptation together in order to achieve robust evolvable data integration.

The objectives underlying the creation of QIS are: (1) to integrate data sources, (2) to devise a scalable approach to schema evolution in a loosely coupled database federation, (3) to devise robust mechanisms for metadata exchange, (4) to address the problem of breaking-up of federated queries by automatic detection of schema evolution, (5) to support interoperation with a separation between public and private data, (6) to facilitate recording of system semantics, and (7) to devise an open-source, low-cost, lightweight system to facilitate research work.

QIS uses a distributed architecture that is composed of three main functional units: integrator servers (ISs), data source servers (DSSs), and the ontology server (OS). These units form the system's middle tier, connecting data consumers with data providers and knowledge sources.

Definitions

Synonymy: It is the semantic relation that holds between two words that can, in a given context, express the same meaning.
Polysemy: It is defined as the ambiguity of an individual word or phrase that can be used, in different contexts, to express two or more different meanings.