Efficient implementation of content models with numerical occurrence constraints
XML content models are a form of extended context-free grammar, in which simple regular expressions are allowed on the right-hand side of productions. The extensions allow optionality and unbounded repetition to be concisely expressed, using ’?’ for optionality and the Kleene star (*) and plus (+) for repetition.
W3C XML Schema added a further extension, by allowing arbitrary numerical upper and/or lower bounds to be specified, so that e.g.
<xs:element name="foo" minOccurs="2" maxOccurs="10"/>
in a content model requires at least two but no more than ten occurrences of <foo>.
Although such numeric occurrence ranges are not new (Perl has had them for many years, for example), implementating a parser which enforces them correctly and efficiently is not easy.
Most (all?) programming language regexp implementations use counters and pseudo-parallelism or backtracking. Most W3C XML Schema parsers use a published algorithm which avoids counters and non-determinism by converting the content model regular expressions into deterministic finite-state automata, at the cost of an exponential blow-up in the number of states required in the worst case.
I present a solution to this problem which is both deterministic and linear in the number of states required, by adding counters to a finite state automaton which can be incremented on transitions and bounds on which act as guards on transitions.
The new solution is not complete—that is, there are content models with nested numeric ranges for which it gives the wrong answer. Although this sounds bad, in fact the situations in which it arises are very rare, and existing implementations actually also don’t cover all possible uses of numeric ranges.
The problem cases (which are why e.g. Perl uses a backtracking parser) are exemplified by the following regexp:
(a{1,2}){2,2}
That is, two repetitions of one or two ‘a’s, which amounts to two, three or four ‘a’s.
(Note that the XML Schema equivalent of the above regexp does not violate the infamous Unambiguous Particle Attribution constraint, because there is only one ‘a’ particle involved. . .)
However, any attempt to parse this using counters and a simple greedy or conservative approach to shift/reduce conflicts will miss one valid string—the greedy approach will miss ‘aa’, because it consumes both ‘a’s on the first go-around the outer loop, and so has nothing left for the second, while the conservative approach will miss ‘aaaa’, because having consumed only one ‘a’ on the first go-around, it has too many left for the second.
The new solution described here is greedy, and so cannot correctly implement some cases of nested numeric ranges (although nesting numerics inside * or +, or vice versa, is just fine).
In the full paper, a step-by-step explanation of the construction and interpretation of the counting automata is given, along with a complete characterisation of the circumstances in which a correct implementation isn’t possible.
A brief discussion of checking subsumption of content models with respect to such automata is also given.




