RSS and Atom feed icon News feeds

Markup for Flat-XML Processing

Introduction

As XML technologies get adopted in mainstream data processing, the need grows for markup languages that convert legacy data to and from XML. For SQL database management systems, the need is addressed by the ANSI and ISO SQL/XML standard. But there does not appear to be a comparable effort that covers off the many varieties of flat files, including delimited files, positional files, files of records with repeating groups, EDI files, etc. Flat files may often be regarded as a serial form of other data sources, and in general are less regular in format than strictly tabular data.

The major database and middleware vendors offer proprietary approaches, but there is no standard solution. Proprietary approaches tend to emphasize interactive tooling rather than scripting markup.

[Kay04] suggests that new features in XSLT 2.0 - the unparsed-text() function, regular expression processing, grouping features, and schema-aware processing - make XSLT 2.0 suitable for a wide range of flat to XML processing. Recipes for performing various kinds of flat-XML transformations with XSLT can be found at flat file transformations. The claim, however, seems overstated. XSLT simply does not provide a natural way for processing a sequence of records, particularly not if the number of records is in the hundreds of thousands, or the number of columns is in the hundreds, or there is a requirement to validate each record and discard failing ones.

Two prior efforts have made significant contributions to a practical treatment of flat-XML processing. The first is the XFlat markup language described in [LYONS99], and its commercial implementation, XML Convert. The second is the work on legacy data conversion by [RAWLINS03a], and the author’s attempts to start a related open source project, Babel Blaster.

Lyons’ XFLAT markup language defines a vocabulary for expressing flat-XML and XML-flat conversions. It supports markup to describe records, including groups of records and nested record structures, but not recursive structures. It includes markup to express constraints for validating field values, including constraints on data type (string, integer, float), minimum data width, and maximum data value. It defines a mapping to a target XML document that is roughly isomorphic to the source flat file. The fact that the mapping preserves the information in the flat file means that the flat file definition suffices to also define the inverse mapping, from the XML document back to the flat file. If users want a result tree that’s a little different from what XFLAT produces, perhaps omitting some fields or inserting some grouping tags, they need to use XSLT.

Rawlins’ book provides a virtual treatise on CSV, varying record type, and EDI files. His approach is to define separate EBNF grammars for describing these three classes of legacy data. He defines an XML vocabulary for expressing these grammars. Like Lyons, he is concerned with defining a mapping to an XML document that is roughly isomorphic to the source flat file, so permitting conversion in the other direction. If a different result tree is desired, XSLT is required. His open source project provides six command-line utilities for performing the “to” and “from” conversions for the three classes of flat files.

Other significant prior work includes the early XML pipeline efforts, including the Apache Cocoon project, and the authors of the Java TrAX API. Flat-XML processing involves performing a sequence of steps on a stream of records. It is convenient to give an XML representation to the records, and to structure the processing as an XML pipeline. Doing so allows XML technologies, such as XML Schema and XSLT, to be used to validate and transform individual records.

A new pipeline based language for flat-XML processing is emerging in the open source project ServingXML. Unlike [RAWLINS03a], the ServingXML vocabulary attempts to support a unified approach to defining record structures, without special distinctions between CSV files, varying record files, and EDI files. Unlike both [RAWLINS03a] and [LYONS99], it separates the description of the flat file from the mapping of the records, and abandons the idea of an isomorphic mapping as too restrictive. Consequently, it requires different markup for the “to” conversion and the “from” conversion.

This paper looks at some of the themes coming out of this latter work. It begins with an overview of the problem of aggregating flat records into XML, and of flattening XML into flat records. It then focuses on two key issues:

  • The adaptation of individual records to a canonical XML representation.

  • The need for specialized markup to group adjacent records in target XML.

The goal is to leverage existing XML technologies like XML Schema and XSLT where appropriate, but not to lose sight of the fact that, with records, the only numbers that count are zero, one, and “as many as you like.” This means “record at a time” processing, not building huge trees in memory.

Aggregation and flattening

Figure 1, “Aggregating records”, shows a record pipe that reads a stream of records, adapts them to a canonical XML representation, validates them with XML Schema, discards bad records, transforms good ones, and aggregates them to produce an XML document.

Figure 1. Aggregating records

Note that by supporting an XML representation of the record, we allow XML Schema to be used to validate individual records. This means that we do not need to introduce specialized markup for record validation.

We can use XSLT to transform the record as it is aggregated into a result tree. This allows us the full flexibility of XSLT for conditional processing. If, for example, the record contains both a home and a mailing address, and we want to suppress the mailing address if it is missing a zip code, we can do that with xsl:if or xsl:choose. Since these XSLT stylesheets can be compiled once and are applied to relatively small documents, they are reasonably efficient. As we’ll see below, though, we’ll still want the option of using specialized markup to express the most common mappings.

Figure 2, “Flattening XML”, shows an XML pipe that reads an XML document, produces a stream of SAX events, extracts a sequence of subtrees, transforms them to the canonical XML representation for the output records, validates these documents, fits them to a flat structure, and writes them to a flat file or database.

Figure 2. Flattening XML

Note the “restricted XSLT match pattern”, which is used to match on a stack of startElement SAX events. It’s basically restricted to matching on an absolute or relative path of elements, allowing for a predicate referring to attributes of the current element. This match pattern identifies subtrees to be extracted and subsequently transformed into one or more records.

Aggregation example

Below is a simple example of converting a CSV file into an XML document.

code,name
ATF,"FRENCH SOUTHERN TERRITORIES, D.R. OF"
WLF,WALLIS & FUTUNA ISLANDS

Table 1. CSV file

This data may be represented as a sequence of virtual XML documents:

<country>
  <code>ATF</code>
  <name>FRENCH SOUTHERN TERRITORIES, D.R. OF</name>
<country>
...

Now, suppose we want to transform this sequence into the following XML document:

<Countries>
  <Country Code="ATF">
    <Name>FRENCH SOUTHERN TERRITORIES, D.R. OF</Name>
  </Country>
  <Country Code="WLF">
    <Name>WALLIS &amp; FUTUNA ISLANDS</Name>
  </Country>
</Countries>

We can use the markup below to describe the aggregation of the records into the result tree, using XSLT to transform the individual records:

<sx:recordMapping>
  <Countries>
    <sx:onRecord>
      <sx:transformRecord>
        <sx:xslt>
          <xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
            <xsl:template match="country">
              <xsl:element name="Country">
                <xsl:attribute name="Code">
                  <xsl:value-of select="code"/>
                </xsl:attribute>
                <Name>
                  <xsl:value-of select="name"/>
                </Name>
              </xsl:element>
            </xsl:template>
          </xsl:transform>
        </sx:xslt>
      </sx:transformRecord>
    </sx:onRecord>
  </Countries>
</sx:recordMapping>

Here, the XSLT xsl:value-of element is used to insert values from the XMLized record into the result tree. But while having XSLT as an option is a good thing, for the complex cases, it’s clearly excessive in this instance. The following specialized markup is proposed as an alternative.

<sx:recordMapping>
  <Countries>
    <sx:onRecord>
      <Country>
        <sx:fieldElementMap select="name" element="Name"/>
        <sx:fieldAttributeMap select="code" attribute="Code"/>
      </Country>
    </sx:onRecord>
  </Countries>
</sx:recordMapping>

In this particular case the processing engine need not ever perform an XSLT transformation, since it can take advantage of the simpler structure of the record. But in general the select attribute is allowed to take full XPATH expressions, so except for the easily optimized case where the expression is the field name, the processing engine would internally create one JAXP Templates object for all the mappings, compiled once, and applied on each record.

Flattening example

Below is a simple example of converting an XML document into a flat file. We start with the output XML document of the previous example.

<Countries>
  <Country Code="ATF">
    <Name>FRENCH SOUTHERN TERRITORIES, D.R. OF</Name>
  </Country>
  <Country Code="WLF">
    <Name>WALLIS &amp; FUTUNA ISLANDS</Name>
  </Country>
</Countries>

The following markup describes the flattening of the XML into records:

<sx:inverseRecordMapping>
  <sx:onSubtree path="/Countries/Country">
    <sx:transformSubtree>
      <sx:xslt>
          <xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
            <xsl:template match="/">
              <country>
                <code>
                  <xsl:value-of select="@Code"/>
                </code>
                <name>
                  <xsl:value-of select="Name"/>
                </Name>
              </country>
            </xsl:template>
          </xsl:transform>
        </sx:xslt>
    </sx:transformSubtree>
  </sx:onSubtree>
</sx:inverseRecordMapping>

Again, the XSLT version, while affording generality, is a little on the verbose side. We propose the specialized markup below as an alternative.

<sx:inverseRecordMapping>
  <sx:onSubtree path="/Countries/Country">
    <sx:subtreeRecordMap recordType="country">
      <sx:subtreeFieldMap select="Name" field="name"/>
      <sx:subtreeFieldMap select="@Code" field="code"/>
    </sx:subtreeRecordMap>
  </sx:onSubtree>
</sx:inverseRecordMapping>

This markup reduces, in the processing engine, to a single JAXP Templates object, compiled once, and applied on each subtree.

XMLizing records

In the ServingXML vocabulary, a record is represented in EBNF notation as follows:

Record ::= name (Field+)
Field ::= name (value*) |
  name (Record+)

This representation is general enough to accommodate practically any legacy data format.

The name of the record denotes the record type. If a file contains records for employees, and all of the records in the file have the same format, then we’ll give all of the records the same name, say “employee”. If the file contains different kinds of records, perhaps address as well as employee records, then we’ll give a different name to the address records, say “address”.

A record has fields. A field has a name. A field may have a value, or multiple values, or it may have one or more records of its own.

This section shows markup for adapting some interesting record formats to a canonical XML representation. These include:

A CSV file.

A file containing records with multi-valued fields.

A file containing records of different types.

A file containing repeating groups of records, each group consisting of records of the same type.

A file containing a repeating group of records that are of different types.

A CSV file

Table 1 below shows a simple CSV file.

code,name
ATF,"FRENCH SOUTHERN TERRITORIES, D.R. OF"
WLF,WALLIS & FUTUNA ISLANDS

Table 1. CSV file

We want the flat file processing engine to produce the following sequence of (virtual) XML documents:

<country>
    <code>ATF</code>
    <name>FRENCH SOUTHERN TERRITORIES, D.R. OF</name>
</country>
<country>
    <code>WLF</code>
    <name>WALLIS &amp; FUTUNA ISLANDS</name>
</country>

The markup below describes the physical properties of the file – the comma delimiter, the fact that the second field may be escaped with quote marks. In addition, it assigns names to the record type and to the fields.

<sx:flatRecordType name="country">          
  <sx:fieldDelimiter value=","/>
  <sx:delimitedField name="code"/>
  <sx:delimitedField name="name" quote="yes"/>
</sx:flatRecordType>

This is enough information for the processing engine to produce the desired sequence.

Multi-valued fields

Table 3 shows a record with three pipe-delimited fields, employee-no, employee-name, and children. The children field is multi-valued, with subfields delimited by semi-colons.

employee-no|employee-name|children
0001|Matthew|Joe;Julia;Dave

Table 3. Flat file with delimited fields and subfields

This record will be represented as follows.

<employee>
    <employee-no>001</employee-no>
    <employee-name>Matthew</employee-name>
    <children>Joe</children>
    <children>Julia</children>
    <children>Dave</children>
</employee>

This is similar to our first example, except we add a subfield delimiter for children.

<sx:flatRecordType name="employee">
  <sx:fieldDelimiter value="|"/>
  <sx:delimitedField name="employee-no"/>
  <sx:delimitedField name="employee-name"/>
  <sx:delimitedField name="children">
    <sx:subfieldDelimiter value=";"/>
  </sx:delimitedField>
</sx:flatRecordType>

Records of different types

Table 2 below shows a flat file containing different types of records, with the type depending on two fields, product and style.

product   trade-id  style     
SWAP      9235      FLOAT     BA
SWAP      9235      FIX       0.03

Table 2. Positional flat file with varying record types.

The proposed XML representation of the records is as follows.

<swap-float-leg>
    <product>SWAP</swap>
    <trade-id>9235</trade-id>
    <style>FLOAT</style>
    <index>BA</index>
</swap-float-leg>
<swap-fix-leg>
    <product>SWAP</swap>
    <trade-id>9235</trade-id>
    <style>FIX</style>
    <rate>0.03</rate>
</swap-fix-leg>

The markup below defines the mapping to the proposed representation. The three fields appearing immediately beneath the sx:flatRecordTypeChoice element represent the common part of the record, and are given an XML representation much as an ordinary record. The sx:when instructions specify XPath boolean expressions that are evaluated against that XML instance. The first test expression to evaluate as true determines the type of the record, and consequently the XML representation.

<sx:flatRecordTypeChoice>
  <sx:positionalField name="product" width="10"/>
  <sx:positionalField name="trade-id" width="10"/>
  <sx:positionalField name="style" width="10"/>
  <sx:when test="recordType='SWAP' and style='FLOAT'">
    <sx:flatRecordType name='swap-float-leg'>
      <sx:positionalField name="product" width="10"/>
      <sx:positionalField name="trade-id" width="10"/>
      <sx:positionalField name="style" width="10"/>
      <sx:positionalField name="index" width="10"/>
    </sx:flatRecordType>
  </sx:when>
  <sx:when test="recordType='SWAP' and style='FIX'">
    <sx:flatRecordType name='swap-fix-leg'>
      <sx:positionalField name="product" width="10"/>
      <sx:positionalField name="trade-id" width="10"/>
      <sx:positionalField name="style" width="10"/>
      <sx:positionalField name="rate" width="10"/>
    </sx:flatRecordType>
  </sx:when>
</sx:flatRecordTypeChoice> 

Repeating groups of records of the same type

Table 4. shows a delimited record consisting of a name (“Jane”), followed by two sub-records containing course and grade information, followed by Jane’s birth year and favourite colour, followed by two sub-records containing address information. Here, fields are separated by a carat (^), records within repeating groups are separated by a tilde (~), and the whole repeating group is terminated by a pipe (|).

Jane^ENGL^C-~MATH^A+|1972^BLUE^CHICAGO^IL^USD~ATLANTA^GA^USD

Table 4. Delimited flat file with repeating groups

The proposed XML representation of this record is as follows.

<student>
    <name>Jane</name>
    <grades>
    <subject-grade>
        <subject>ENGL</subject>
        <grade>C-</grade>
    </subject-grade>
    <subject-grade>
        <subject>MATH</subject>
        <grade>A+</grade>
    </subject-grade>
    </grades>
    <year-born></year-born>
    <favourite-color>BLUE</favourite-color>
    <addresses>
        <address>
            <city>CHICAGO</city>
            <state>IL</state>
        </address>
        <address>
            <city>ATLANTA</city>
            <state>GA</state>
        </address>
    </addresses>
</student>

The markup below defines the mapping of the record to the proposed representation.

<sx:flatRecordType name="student">
  <sx:fieldDelimiter value="^"/>
  <sx:repeatDelimiter value="~"/>
  <sx:segmentDelimiter value="|"/>
  <sx:delimitedField name="name"/>
  <sx:repeatingGroup name="grades">
    <sx:flatRecordType name="subject-grade">
      <sx:fieldDelimiter value="^"/>
      <sx:delimitedField name="subject"/>
      <sx:delimitedField name="grade"/>
    </sx:flatRecordType>
  </sx:repeatingGroup>
  <sx:delimitedField name="year-born"/>
  <sx:delimitedField name="favorite-color"/>
  <sx:repeatingGroup name="addresses">
    <sx:flatRecordType name="address">
      <sx:fieldDelimiter value="^"/>
      <sx:delimitedField name="city"/>
      <sx:delimitedField name="state"/>
    </sx:flatRecordType>
  </sx:repeatingGroup>
</sx:flatRecordType>

A repeating groups of records of different types

Table 5 shows a record consisting of a name (“John”) followed by a group of four records: a course record, followed by an address record, followed by a course record, followed by an address record.

John01SCIEB02CHICAGO   ILUS 02TORONTO   ONCAN01HISTB

Table 5. Positional flat file with a repeating group of varying record types

The proposed XML representation of this record is as follows.

<student>
    <name>John</name>
    <student-info>
        <subject-grade>
            <tag>01</tag>
            <subject>SCIE</subject>
            <grade>B</grade>
        </subject-grade>
        <address>
            <tag>02</tag>
            <city>CHICAGO</city>
            <state>IL</state>
            <country>USA</country>
        </address>
        <address>
            <tag>02</tag>
            <city>TORONTO</city>
            <state>ON</state>
            <country>CAN</country>
        </address>
        <subject-grade>
            <tag>01</tag>
            <subject>HIST</subject>
            <grade>B</grade>
        </subject-grade>
    </student-info>
</student>

The markup below defines the mapping of the record to the proposed representation.

<sx:flatRecordType name="student">
  <sx:positionalField name="name" width="4"/>
  <sx:repeatingGroup name="student-info">
    <sx:flatRecordTypeChoice>
      <sx:positionalField name="tag" width="2"/>
      <sx:when test="tag='01'">
        <sx:flatRecordType name="subject-grade">
          <sx:positionalField name="tag" width="2"/>
          <sx:positionalField name="subject" width="4"/>
          <sx:positionalField name="grade" width="2"/>
        </sx:flatRecordType>
      </sx:when>
      <sx:when test="tag='02'">
        <sx:flatRecordType name="address">
          <sx:positionalField name="tag" width="2"/>
          <sx:positionalField name="city" width="10"/>
          <sx:positionalField name="state" width="2"/>
          <sx:positionalField name="country" width="3"/>
        </sx:flatRecordType>
      </sx:otherwise>
    </sx:flatRecordTypeChoice>
  </sx:repeatingGroup>
</sx:flatRecordType>

Grouping records

This section looks at the problem of mapping a sequence of “XMLized” records to a grouped or hierarchical structure.

[LYONS99] and [RAWLINS03a] leave any augmentation of the result XML tree, including the addition of grouping elements, to XSLT. While XSLT 1.0 lacks built in support for grouping sequences, various recipes have been proposed, see [TENNISON]. XSLT 2.0 introduces a new element xsl:for-each-group, which supports grouping based on values, and grouping based on the position of elements in a sequence, see [KAY04].

The ServingXML vocabulary, while allowing for XSLT transformations on the result tree, introduces specialized markup to support grouping. The reasons are threefold. First, with source files sometimes containing hundreds of thousands of records, the result trees may be large, and XSLT transforms inefficient or impractical. Second, records coming from a flat data source are typically already sorted, so grouping can generally be expressed with local references to adjacent records, without need for the entire tree. Third, simple value-based grouping can be expressed naturally as break logic on field values of records flowing one at a time, which is a familiar idea to anyone who has ever used a SQL query or reporting tool.

Simple value-based grouping

We’ll start with the common case of grouping records based on changes in field values from one record to another. This form of grouping requires comparing only two records, the previous with the current.

The financial plan file below has records for cost by project, detail-name, and period.

project

detail-name

period

cost

1767

AD_Sales_SDT_SVA

2003

150

1767

AD_Sales_SDT_SVA

2004

24750

1767

OPS_SQA

2004

113

1785

AD_Sales_SDT_SVA

2004

7920

We want to group the financial information by project and detail-name, like this:

<Plans>
  <Plan project="1767">
    <Details>
      <Detail detailName="AD_Sales_SDT_SVA">
        <PlanData period="2003" cost="150"/>
        <PlanData period="2004" cost="24750"/>
      </Detail>
      <Detail detailName="OPS_SQA">
        <PlanData period="2004" cost="113"/>
      </Detail>
    </Details>
  </Plan>
  <Plan project="1785">  ...

The sx:groupBy element is used to group multiple adjacent records by one or more fields, emitting summary tags around the grouped records.

<sx:recordMapping>
  <Plans>
    <sx:groupBy fields="project">
      <Plan>
        <sx:fieldAttributeMap select="project" attribute="project"/>
        <Details>
          <sx:groupBy fields="project detail-name">
            <Detail>
              <sx:fieldAttributeMap select="detail-name" 
                                    attribute="detailName"/>
              <sx:onRecord>
                <PlanData>
                  <sx:fieldAttributeMap select="period" 
                                        attribute="period"/> ...
                  <sx:fieldAttributeMap select="cost" attribute="cost"/> ...
                  <sx:fieldAttributeMap select="revenue" attribute="revenue"/> ...

The sx:groupBy instruction works somewhat analogously to "group by" in SQL, except that it only applies to adjacent records. It can be nested to any depth.

Generalizing grouping

Some users have requirements for gouping based on the specific value of a field, as opposed to breaks in the value. Consider, for example, the data below.

tag

value

A

value01

B

value02

D

value03

C

value04

C

value05

X

value06

Y

value07

B

value08

D

value09

C

value10

X

value11

Y

value12

Z

value13

Here, the A record type indicates the beginning of a group of level 1, the Z indicates the end. The B indicates the beginning of a group of level 2, the Y indicates the end. The C indicates the beginning of a group of level 3, the X indicates the end.

The result tree that we want is as follows:

<agent-issues>
  <agents>
    <agent>
      <D code=”value03”/>
      <issues>
        <C code=”value04”/>
        <C code=”value05”/>
      </issues> 
    </agent>
    <agent>
      <D code=”value09”/>
      <issues>
        <C code=”value10”/>
      </issues>
    </agent>
  </agents>  
</agent-issues>

Two record mapping elements have been introduced to address these cases: sx:innerGroup and sx:outerGroup.

<sx:recordMapping>
  <agent-issues>
    <sx:innerGroup>
      <sx:startGroup test="sx:previous//tag='A'"/>
      <sx:endGroup test="sx:next//tag='Z'"/>
      <agents>
        <sx:innerGroup>
          <sx:startGroup test="sx:current//tag='B'"/>
          <sx:endGroup test="sx:current//tag='Y'"/>
          <agent>
            <sx:outerGroup>
              <sx:startGroup test="sx:current//tag='C'"/>
              <sx:endGroup test="sx:current//tag='X'"/>
              <issues>
                <sx:onRecord>
                  <sx:elementMap element=”{tag}”>
                    <sx:fieldAttributeMap select="value"  
                                       attribute="code"/>
                  </elementMap>
                </sx:onRecord>
              <issues>
            </sx:outerGroup>
          </agent>
        </sx:innerGroup>
      </agents>
    </sx:innerGroup>
  </agent-issues>
</sx:recordMapping>

The sx:innerGroup and sx:outerGroup elements contain an sx:startGroup, then (optionally) an sx:endGroup, then record mapping elements. The sx:startGroup and sx:endGroup elements define the beginning and end of a group through XPath boolean expressions applied to adjacent records. The sx:innerGroup instruction will always skip a record if the record does not satisfy its grouping criteria. The sx:outerGroup instruction, in contrast, will always pass the record down to the next nested grouping instruction (if any), even if the record does not satisy its own grouping criteria, but in that case it will not emit any tags inbetween.

The sx:innerGroup and sx:outerGroup elements are the most general grouping elements, and sx:groupBy becomes a special case. The fragment below will emit the same tags as the corresponding sx:groupBy section in the earlier example.

<sx:recordMapping>
  <Plans>
    <sx:innerGroup>
      <sx:startGroup test="not(sx:previous//project) 
            or sx:current//project != sx:previous//project"/>
      <Plan>
      ...

Conclusions

This paper discusses markup for flat-XML processing, focusing on themes worked out in the open source project ServingXML. It identifies the notion of a record pipe with “XMLized” records flowing though it, each capable of being validated with XML Schema or transformed with XSLT.

For flat to XML, it proposes a unified representation of records with a canonical XML form, and gives examples of markup for adapting a variety of flat file formats to that form. It suggests specialized markup for handling simple yet common field-element mappings. It suggests specialized markup for grouping records.

For XML to flat, it proposes a restricted XSLT match pattern to extract a sequence of subtrees, each to be further transformed to one or more “XMLized” records, these to be adapted to physical flat records. It suggests XSLT for handling the most complex transformations, and specialized markup for the more common cases.

Bibliography

[KAY04] Michael Kay, Up-conversion using XSLT 2.0, XML 2004, Washington, D.C., November 2004

[LYONS99] Bob Lyons, Converting Flat File Content into XML and Vice Versa, XML 99, Philadelphia, December 1999.

[Parker05] Daniel A. Parker Records, Tags and Pipelines XML 2005, Atlanta, Ga., 2005

[RAWLINS03a] Michael C. Rawlins, Using XML with Legacy Business Applications, Addison-Wesley Professional, 2003

[RAWLINS03b] Michael C. Rawlins, Processing Legacy File Formats with XML and the DOM XML 2003, Philadelphia, Pa., 2003

[TENNISON] Jeni Tennison, Jeni's XSLT Pages: Grouping.