Wednesday, June 27, 2001

"Suppose that in order to use the appliances in your kitchen, an electrical engineer needs to tear down the wall and splice in wiring for the appropriate voltage converter."

Then suppose you're running a restaurant out of that kitchen.

That, according to those in the know, is what it's like to work with life-science information today, where the absence of a common programming language makes coordination between researchers a dicey prospect."

Enter IBM, which is spearheading an effort to write that common language using XML.

"The Interoperable Informatics Infrastructure Consortium (I3C) unveiled its first demonstration of a working protocol Tuesday at the BIO 2001 Conference."

The XML-driven format allowed the exchange and analysis of sequence data across 10 different organizations? products. I3C views the demonstration as a bridge to the next step to defining components needed for a more general open architecture."

"The new coalition, led by the Biotechnology Industry Organization (BIO), a Washington trade group, plans to spend the next year or so creating a detailed specification for biological data. This specification would be available without fee to any company or scientist that wanted to use it to help organize and mine information."

The project has been dubbed the Interoperable Informatics Infrastructure Consortium, or I3C."

"Sun Microsystems said Wednesday it would partner with the Biotechnology Industry Organization, the National Cancer Institute, and several commercial bioinformatics vendors to support a collaborative effort to develop an open platform for the life sciences based on Java and XML.

The proposed initiative, temporarily referred to as Life Force or LI4 (Lifescience Informatics Interoperability Infrastructure Initiative) aims to develop an open platform to support data integration and interoperability and to focus the growing number of standards efforts"

"Sun intends to contribute the underlying infrastructure for the open platform, which the company hopes will form the eventual hub for a broad variety of life science computing needs, including bioinformatics, cheminformatics, genomics, proteomics, pharmacogenomics, metabolomics, and clinical informatics."

"Small DNA-laden wafers have transformed biology. Using these DNA chips, geneticists can see which genes are turned on, or expressed, in a cell at a particular time. Such gene expression experiments allow bioscientists to diagnose different diseases, quickly screen thousands of drug candidates for efficacy and safety and even learn the functions of newly discovered genes.

Sharing this information over the Web could lead to an explosion in biological knowledge. But each experiment generates gigabytes of data written in one of several formats, depending on the type of chip used. And with dozens of chips on the market and hundreds of ways to analyze the data, the Web is in danger of becoming a genetic Tower of Babel."

"Companies and academics have begun creating uniform formats for representing gene expression data, designed to work on any computer."

"As more and more genomes are sequenced, it is becoming clear that deciphering the clues latent in these sequences is anything but trivial. In this Techview, Attwood analyzes the current state of the art in sequence-structure-function bioinformatics. She highlights the need for precise terminology, and argues that a holistic view of complex biological systems will be an essential next step for bioinformatics."

"Once the world had a single language and not too many words, but then clarity deteriorated into clamor. Today in the small but prolific world of bioinformatics, another Tower of Babel is rising up, with the miscommunication due as much to the rapid expansion of information as to basic changes in how it is processed. "Horrible problems" crop up as more information is computed on instead of read by a human researcher, according to Ewan Birney, a group leader in the Ensembl genome annotation project at the European Bioinformatics Institute (EBI) in Cambridge, England.

In the early days of bioinformatics, human-readable data exchange formats such as ASN.1, the format adopted for GenBank by the National Center for Biotechnology Information (NCBI) 10 years ago, were the norm. Easily editable with a text utility, ASN.1's syntactic looseness makes it congenial to the human user, but not to the machine, which likes its inputs defined with dictatorial rigidity."

"We may rehearse this fundamental axiom of descriptive markup in terms of a classical SGML polemic: the doubly-delimited information objects in an SGML/XML document are described by markup in a meaningful, self-documenting way through the use of names which are carefully selected by domain experts for element type names, attribute names, and attribute values. This is true of XML in 1998, was true of SGML in 1986, and was true of Brian Reid's Scribe system in 1976. However, of itself, descriptive markup proves to be of limited relevance as a mechanism to enable information interchange at the level of the machine.

As enchanting as it is to contemplate the apparent 'semantic' clarity, flexibility, and extensibility of XML vis-à-vis HTML (e.g., how wonderfully perspicuous XML <bookTitle> seems when compared to HTML <i>), we must reckon with the cold fact that XML does not of itself enable blind interchange or information reuse. XML may help humans predict what information might lie "between the tags" in the case of <trunk> </trunk>, but XML can only help. For an XML processor, <trunk> and <i> and <booktitle> are all equally (and totally) meaningless. Yes, meaningless.

Just like its parent metalanguage (SGML), XML has no formal mechanism to support the declaration of semantic integrity constraints, and XML processors have no means of validating object semantics even if these are declared informally in an XML DTD. XML processors will have no inherent understanding of document object semantics because XML (meta-)markup languages have no predefined application-level processing semantics. XML thus formally governs syntax only - not semantics."

"With XML has come a proliferation of consortia from every industry imagineable to populate structured material with standard terms (see Appendix B). By one estimate, a new industry consortium is founded every week, perhaps one in four of which can collect serious membership dues. Rising in concert are intermediary groups to provide a consistent dictionary in cyberspace, in which each consortium's words are registered and catalogued.

Having come so far with a syntactic standard, XML, will E-commerce and knowledge organization stall out in semantic confusion?"

"How are semantic standards to come about?"

"There is an increasing demand for formalized knowledge on the Web. Several communities (e.g. in bioinformatics and educational media) are getting ready to offer semiformal or formal Web content. XML-based markup languages provide a 'universal' storage and interchange format for such Web-distributed knowledge representation. This tutorial introduces techniques for knowledge markup: we show how to map AI representations (e.g., logics and frames) to XML (incl. RDF and RDF Schema), discuss how to specify XML DTDs and RDF (Schema) descriptions for various representations, survey existing XML extensions for knowledge bases/ontologies, deal with the acquisition and processing of such representations, and detail selected applications. After the tutorial, participants will have absorbed the theoretical foundation and practical use of knowledge markup and will be able to assess XML applications and extensions for AI. Besides bringing to bear existing AI techniques for a Web-based knowledge markup scenario, the tutorial will identify new AI research directions for further developing this scenario."

Bioinformatics will be at the core of biology in the 21st century. In fields ranging from structural biology to genomics to biomedical imaging, ready access to data and analytical tools are fundamentally changing the way investigators in the life sciences conduct research and approach problems. Complex, computationally intensive biological problems are now being addressed and promise to significantly advance our understanding of biology and medicine. No biological discipline will be unaffected by these technological breakthroughs.


