RSS and Atom feed icon News feeds

Chameleon XML models

Overview

People don't quite like to agree on things. There are often multiple initiatives to express a given domain model in XML, whether the domain is shared among departments within an organization, or across interests within an industry. It's often very difficult to minimize such duplication because even when people share a problem space, they often differ in their viewpoint of that problem space, and furthermore they differ in how they translate elements in that space into computer abstractions such as XML schemata. These subtle differences tend to be complicated to a surprising degree by group dynamics, which is why governance is such a prominent issue among data architects regardless of precise discipline.

XML is built to oil the wheels of governance. XML was of course named for its supposed extensibility, but the real strength of XML shines through not in extending it, but rather in transforming it. XML is designed for ready transformation from one syntactic form to another, and this should be communicated as the cornerstone of XML's business proposition. XML makes it easier for different interests to agree to disagree, so that everyone can get on with solving problems beyond the data representation. XML is good for creating chameleon models, which have the same substance, but can offer a different appearance as needed in different environments.

Unfortunately, the XML community has been sluggish in developing and communicating the tools and best practices for such chameleon models. For the most part different interests within a domain develop XML formats in isolation then wait for some incident requiring interchange before hammering together some odd script, perhaps in XSLT, for translation from one form to another. This approach can be made to work, but it has several problematic effects. The transforms are often built upon exigency and are thus hastily designed and harder to maintain. If there are multiple model variations transforms tend to be one to one with little opportunity for modularization and reuse. The biggest problem however is that this approach always exaggerates the differences between the models, which further impedes communication, and compounds the difficulties whenever there's a further modification or interchange.

Architectures

The SGML community recognized this problem and developed a solution called architectural forms. At a very high level the point of an architecture is that interests can figure out what they do have in common, and express this medium as an architecture, and they could then address the remaining disagreements in specialized models that are expressed as declared transformations from the architecture. In this way the architecture becomes the fulcrum for the transforms, which fall naturally out of the declared differences from the architecture, meaning they are developed early enough in the lifecycle, and they are built on a declarative foundation and thereby gain much transparency and modularity. The architecture also becomes the conduit for communication between interests, helping to alleviate problems in governance (many of which result from unclear understanding of actual model differences).

This paper shall not cover classic SGML architectural forms, but several people have tried to bring the power of architectural forms to XML. John Cowan in 2002 came up with a basic specification named Architectural Forms: A New Generation (AF:NG) [COW02]. He never took the specification beyond draft form, but it does cover a significant cross-section of the sorts of model variations one expects to see within a domain, and there is a draft XSLT implementation in XSLT contributed by Jeni Tennison. It may be worthwhile for the community to complete this work, although Schematron's abstract patterns, to be will be covered in a later section, do render much of it redundant. There are fundamental differences in approach between AF:NG and Schematron's abstract patterns, so it's instructive to have the former as a template for extended or alternative techniques when developing chameleon models.

Let's say a group of departments comes together to decide on a common model for product information. The following is a very simplified version of data model negotiations common in real-world governance.

The marketing department cares about the following information for each product:

  • SKU

  • trademark name

  • product Web page

  • brief description

  • long marketing description

  • base unit price

Operations (fulfilment in particular) cares about:

  • SKU

  • simple name

  • brief description

  • base unit price

  • unit weight

  • freight class

The obvious overlap between the two is:

  • SKU

  • brief description

  • base unit price

Further discussion around the two varieties of names reveals that they almost always correspond, except that trademark names include trademark and service mark designations. A give product might have the trademark name "Python Perfect™ IDE" whereas the simple name would be "Python Perfect IDE". The departments reconcile this by deciding to indicate within the name text any words that need to be specially marked, and leaving to the display whether or not to actually render those marks.

The model the working group comes up with is, in RELAX NG compact syntax:

element product {
  attribute sku { text },
  name,
  element description { text },
  element unitprice { attribute currency, text }
}

name = element name { mixed { mark } }

mark = element trademark { text } | element servicemark { text }

This schema serves as the core data model, which is to be adapted for the derivative needs of other parties. A corresponding sample of a product record is as follows:

<product sku='438-AX'>
  <name>Python <trademark>Perfect</trademark> IDE</name>
  <description>Integrated Development environment software for Python
  </description>
  <price currency='USD'>250</price>
</product>

A sample of a product information record used in the fulfilment department is as follows:

<line-item is-a='product' sku='438-AX'>
  <prod-name>Python Perfect IDE</prod-name>
  <price units='USD'>250</price>
  <detail>Integrated Development environment software for Python
  </detail>
  <freight class='C'>
    <weight units='Kg'>0.8</weight>
  </freight>
</line-item>

A sample of a product information record used in the marketing department is as follows:

<product sku='438-AX'
         info-page='http://example.com/product-info/ppi'>
  <name>Python <trademark>Perfect</trademark> IDE</name>
  <description>Integrated Development environment software for Python
  </description>
  <long-description>
  Uses mind-reading technology to anticipate and 
  accommodate all user needs in Python development.
  </long-description>
  <price currency='USD'>250</price>
</product>

In Architectural Forms the core model is the architecture, and the other models can be defined in terms of transform to the architecture. An Architectural Form for relating the fulfilment format to the core model is as follows.

<archmap form-att='is-a'>
  <form name='name' source-elem='prod-name'/>
  <form name='brief-description' source-elem='detail'
        form-elem='description'/>
  <form name='price' source-elem='price'>
    <attmap arch-att='currency' source='units'/>
  </form>
  <form name='freight' source-elem='freight'
        arch-elem='#NONE' children='skip'/>
</archmap>

The archmap element embodies the architectural form, which can be applied to a source document to create an architecture instance. form-att is mandatory and specifies an attribute to be watched for in the source document. In this case, the processor looks for any element with an is-a attribute and considers that element a match for the form given as the value of that attribute. <line-item is-a='product'... is thus matched with the product form, regardless of the generic identifier line-item. Matching a form means that an element follows a transform procedure. By default it's just renamed, attributes and character content is copied to the output, and processing descends to its children. All these actions can be modified by using explicit form elements.

The first such element in the above example effects a rename of the prod-name element to name. The next element is also a simple renaming operation, but this time the output element name is given in form-elem. Technically the name attribute is just an informational name for a form, and it's only assumed to be the output element name if form-elem is omitted. The next element demonstrates renaming attributes (the output element name price is the same as the source element name). attmap effects a renaming of the attribute specified in source to the name specified in arch-att, with the same value. Attribute mapping can also create output attributes with data taken from element content, create attributes with fixed content, or manipulate the value of attributes of token type.

The final form element specifies that there is to be no corresponding output (#NONE) element from the source element freight, and that the children are not to be processed. The effect it to completely ignore that source element. If it could also have character content one would also need to specify data='ignore'.

One problem with such transforms is that they can lose information if care is not taken. If the organization had a central repository of product information and fulfilment happened to be the first department to enter information for a new product, it could do so by simple transforms from its specialized records. In this case the product's name would not have any trademark information, which might make it unsuitable for use by marketing. This problem is not at all particular to chameleon data models. Even with ad-hoc models and ad-hoc transforms any transform from data with lower information content to a format with higher content will involve such a loss. Chameleon data formats do underscore the problem because of the ease of transform and sharing. In a closed world such as an organization this can be solved with collaboration and workflow rules. In less centralized environments this is a very difficult problem along the lines of all such data quality considerations in loosely coupled systems.

AF:NG is not really ready for use. Neither the specification draft nor the experimental XSLT implementation are complete, nor have they been updated since 2002. It does however server to illustrate the basic approach of architectural forms, inherited from SGML. Classic SGML architectural forms used parameterized DTDs rather than a specialized instance language as used in AF:NG, but the approach to transformation is similar. The supported transforms are limited, mostly involving renaming elements and attributes, omitting elements or controlling the copying of character data and child elements, or turning element content into attributes. This covers a lot of territory but Schematron abstract patterns, which are presently ready for use with several available implementations, offer much more power.

Schematron abstract patterns

ISO Schematron is a unique XML schema language, offering extraordinary power in conjunction with other schema languages or on its own. Schematron is not just a schema language but a full-blown reporting facility for XML. Schematron's abstract patterns are a particularly powerful mechanism which support chameleon data models.

An abstract pattern is a special Schematron pattern that is treated essentially as a template. It contains one or more parameter, using the same syntax as XPath variables ('$param'), but having a very different nature. The Schematron processor performs a preprocessor pass during which it does a literal string substitution of the parameter with the given value. Going back to our product information example, the following is a Schematron pattern that covers some of the specification provided by the core product schema.

<pattern name='product-structure'>
  <rule context='product'>
    <assert test='@sku'>A product must have a sku attribute</assert>
    <assert test='name'>A product must have a name element</assert>
    <assert test='description'>A product must have a description element</assert>
    <assert test='price'>A product must have a price element</assert>
  </rule>
  <rule context='price'>
    <assert test='@currency'>A price must have a currency attribute</assert>
  </rule>
  <rule context='name'>
    <assert test='trademark|servicemark'>A name element uses optional trademark or servicemark elements</assert>
  </rule>
</pattern>

In an abstract pattern much of those element and attribute names used in context and tests can be parameterized. The following example is an abstract pattern that offers a lot of control over derivative formats.

<pattern abstract='true' name='product-structure'>
  <rule context='$product'>
    <assert test='$sku'>A product must have a sku attribute</assert>
    <assert test='$name'>A product must have a name element</assert>
    <assert test='$description'>A product must have a description element</assert>
    <assert test='$price'>A product must have a price element</assert>
  </rule>
  <rule context='$price'>
    <assert test='$currency'>A price must have a currency attribute</assert>
  </rule>
  <rule context='$name'>
    <assert test='trademark|servicemark'>A name element uses optional trademark or servicemark elements</assert>
  </rule>
</pattern>

The attribute abstract='true' establishes that this is an abstract pattern. The constructs in the pattern (chiefly within the context and test attributes) can now have parameter references such as $product. These are very different from XPath variable references; they are purely string substitutions to be computed before run time, and effectively allow one to rewrite the queries using the given parameters. Other than the use of parameter references, abstract patterns look just like non-abstract ones. The following listing puts the abstract pattern to use, creating concrete versions by providing values for the parameters. In this case the parameter values correspond to the conventions discussed above for the fulfilment department.

<pattern name='fulfilment-product' is-a='product'>
  <param formal='product' actual='line-item'/>
  <param formal='sku' actual='sku'/>
  <param formal='name' actual='prod-name'/>
  <param formal='description' actual='detail'/>
  <param formal='price' actual='price'/>
  <param formal='currency' actual='units'/>
</pattern>

This pattern is a concrete instance of the abstract pattern given by the is-a attribute--in this case is-a='product'. The param elements provide a value for each parameter reference used in the abstract pattern. For example, the first param in the first pattern has an attribute formal='product' to indicate that it is providing a value for parameter instances of the form $product in the abstract pattern. The value is given by the actual attribute (actual='line-item' in this case). The value is simple replacement text, and not parsed or pre-processed in any special way. You can only use patterns with an is-a attribute to specify parameters. You cannot also have rules or variable assignments in such patterns.

The use of opaque replacement text opens up a wide range of possibilities for instantiating abstract patterns. If the fulfilment department needed to express the SKU within a very ungainly key/value pair construct such as in the following snippet:

<line-item>
  <item-data key='sku' value='438-AX'/>
  <prod-name>Python Perfect IDE</prod-name>
  ...
</line-item>

It can do so using a parameter substitution such as:

<param formal='sku'
       actual='item-data[@key="sku"]/@value'/>

As you can see, the mechanism of Schematron abstract patterns is very simple, but it offers a great deal of flexibility in expressing constraints while allowing for syntactic variation. Furthermore Schematron is more than just a constraints language and you can use its report capabilities to help automate transform and processing of concrete instances based on abstract patterns. You could use report clauses to generate XSLT directly from the Schematron processing, or to generate a transformation hints file that's enough for XSLT to complete the work. Many schematron snippets themselves are expressed in query languages which are in practice almost always XPath 1.0 and XSLT 1.0 patterns, but could easily include EXSLT extensions, or even XPath and XSLT 2.0, or XQuery.

Using abstract patterns as discussed so far data models are expressed through Schematron constraints, and this is of course a perfectly useful way to express and validate models, but it has its limitations, and another of Schematron's strengths lies in its usefulness in connection with other languages such as RELAX NG and W3C XML Schema language. One might also need to combine approaches if tool limitations or other requirements stipulate grammar based models. Certainly most schema-driven tools are limited to grammar-based languages at present.

Conclusion

It will always be common to have a set of basic ideas for a data model widely agreed upon, with disagreement over precise form. In this paper's running example everyone agrees that a product has a SKU, a simple name and a simple description, but there are all sorts of variations within and beyond that. All too often in such cases the result is multiple schemata developed in isolation, with ad-hoc scripts for point to point transforms. This wastes resources and compromises many of XML's benefits, especially in regard to information transparency. By adopting and formalizing the idea of chameleon data models developers can manage much of the complexity that comes with inevitable disagreements within a domain. Unfortunately the XML community has spent much less effort on such workflow considerations than on monolithic schema and transformation systems. It would be nice to see continued work and evangelism along the lines of model expression techniques discussed in this paper.

Bibliography

[COW02] J. Cowan, “Architectural Forms: A New Generation (Draft 2.3)” http://home.ccil.org/~cowan/XML/afng.html 2002 .