XTech 2006 news

Newsletter sign-up


RSS and Atom feed icon News feeds

Future-proofing your XML data

C. M. Sperberg-McQueen (World Wide Web Consortium (W3C)), Eric Miller (World Wide Web Consortium)
Core technologies St. John 1

Future-proofing your information isn’t as easy as it sometimes looks. Using XML is sometimes said to guarantee that information can survive hardware and software obsolescence and thus protect investment in crucial data. That’s partly true: XML does guarantee that the syntax of the data remains clear, and if the XML vocabulary is well designed (and the element names and/or documentation are in one’s native tongue) it may be possible to see the structure of the information. But there are two ways that data can become inaccessible, and undocumented syntax which is no longer supported is only one. Information also becomes inaccessible if the meaning of the data is no longer clear, because it was not documented, or not documented in a way that is understood any longer.

Documenting XML vocabularies is thus not just a good idea – it’s essential to the long-term preservation of the data. Despite frequent claims that XML is “self-documenting” (this applies only to the syntactic structure, if at all), documentation means writing down clearly the agreed upon structure of the information and the usage and meaning of all elements and attributes.

Human-readable documentation is crucial, but with the development of Semantic Web technologies it is possible to go a step further, toward documentation that is not only human- but also machine-processable. This paper will describe one way this may be done.

The crucial first step is to make explicit decisions about the information structure, the properties of each thing (entity, in philosophical jargon), and how the things interrelate. With these decisions, it’s easy to record the meaning of the markup in a well known notation like that of symbolic logic or that of the Resource Description Framework (RDF).

Several proposals have been made for this kind of translation from a specific vocabulary into another vocabulary which is, or is expected to be, well known in a wider community and for a longer time. Architectural forms are perhaps the best known, followed perhaps by the Schema Adjunct Framework. Many of these proposals posit a one to one mapping between individual items of the source vocabulary and individual items of the target vocabulary. “The ‘date’ element in this XML vocabulary,” one may hear, “maps to the ‘date_of_sale’ column in this relational database over here.” While a machine-processible means for expressing this relationship is an important step it not generally enough. The correct unit of documentation and translation, is not the individual datum (element content or attribute value, RDBMS column value) but the cluster of inter-related data. It is not enough for each ‘date’ element to be mapped to the ‘date_of_sale’ column in the right table; the value must also be correctly associated with the right customer id, product ids, and so on. That is, it has to go in the right row, or else the translation is worse than useless.

The assumption of a one to one mapping makes all of the proposals we know for semantic mapping cumbersome and fragile. A cluster-based mapping can be more flexible. The ability for schema creators or trusted third-party services to associate these mappings in a machine-processable manner is an important step in making one’s information available and useful beyond the application that created it or the manner in which it was created.

In the paper we will show examples, starting with simple vocabularies with simple mappings into logical form and RDF, and concluding with more elaborate vocabularies which present more difficult technical problems.

Chair: Eric Prud'hommeaux