Going Horizontal: Comparing Open Data Vocabularies Across Domains
The Dutch government is actively promoting the use of digital exchanges. XML-based open standard vocabularies are rapidly appearing in Healthcare, Criminal Justice, Social Security and Financial Reporting. Currently the developments are taking place within – not across – those separate domains. It’s about time to compare the different approaches. What is different and what’s the same? Where is the common ground, and what can be gained by standardization? What works and what doesn’t in building vocabularies?
A research project in which the vocabularies are compared in those different areas is nearly complete. Using a loose tagging approach, overlap in the vocabularies is calculated without requiring semantic equivalence relationships across domains. Instead, using the solution of adding tags we established areas of ‘semantic kinship’ and quickly identified several related fields. The interesting points are the amount of overlap – currently estimated at about 25% of terms used – and the areas of overlap – mainly person, address, location, contact and financial data. This suggests a strong opportunity to develop cross-domain microformats for exactly those common areas. Several are already being developed. The vocabularies will also be compared in their differing approaches to datatyping, (XML) serialization and identity and reference.
Further, the various methodologies used in developing vocabularies are compared. Healthcare uses HL7, with its strong emphasis on a metamodel (the RIM). Criminal justice uses UBL, with strengths in vocabulary composition and access. Both have their roots in UML, though the results differ significantly. The various methodologies will be contrasted at a high level, highlighting the advantages and strengths of each approach, and providing a useful best-of practices overview. Last, we will compare the different approaches to maintenance and versioning issues. With exchanges springing up, the move to ‘version 2.0’ and the pitfalls involved are currently the main issue in several domains.




