Content modelling at the BBC using RDF and OWL
- , ,
- , ,
Abstract
The British Broadcasting Corporation’s “Content Management Culture” department is rolling out a Content Management System to produce the content for all bbc.co.uk sites, adhering to structured content management principles; notably atomised content described with controlled vocabularies.
To create a new site in our CMS, information architects analyse the content on the site and create a Content Object Model (COM). In the past, the COM was represented as a series of documents describing how content is structured and how different types of content can relate to each other (for example, how an instance of an Article object can relate to an instance of an Image object).
The COM is then implemented in the CMS through XML configuration files that define the content creation and editing interface.
In order to enable to maximum degree of content sharing and re-use both within and outside of the BBC, we have undertaken to share as much of our content models as possible between departments. This creates a management problem, as keeping track of the differences between content models, and creating new models based on existing ones, is difficult to do with simple schemas or documentation.
To solve this problem we created a “metamodel” that describes our content modelling structure, and set about finding a tool that would let us create and compare models according to various features. After evaluating and discarding UML and other standard modelling tools, we decided Protégé-OWL’s flexible RDF/OWL-based modelling environment suited our needs, and its Java-based extension framework allowed us to create plug-in modules to automatically generate our XML templates and interface definition files.
This paper discusses how RDF and OWL helped us to model our system, how we used Object Constraint Language-like constructs to define the constraints of our metamodel, how the open-world model of RDF/OWL can make life difficult for tool developers, the difference between rules languages and description logics, whether the use of RDF reasoners is appropriate for these tools.
Background
Content management at the BBC
We expect that the future of the BBC, and the broadcasting and publishing industry in general, will rely heavily on new technology and the way in which it can provide context, navigation and gateways to different types of content. To enable this, we need to create content in such a way that it can be reused any time, anywhere, on any device. Our strategy is to make our content digitised at the point of creation, atomised so that it can be repurposed, and findable through having the correct metadata. The BBC’s “on-demand” strategy aims to cross all media, all content types and all devices.
“Rather than creating TV, radio or online content, the BBC needs to think in terms of creating media assets for the future, flexible enough to be used across all devices and in all environments. […] We will invest in making our content available on demand in ways that will benefit both the public and the industry, grouping them under one banner so that they are more easily recognisable and usable to the user and more manageable, in terms or resource and direction, for the BBC. We must also ensure that the correct rights are in place for use and re-use.” - BBC’s internet strategy
The “Content Management Culture” group within BBC New Media Central was formed to address this issue. The group is facilitating the move from many disparate, hand-crafted “content production systems” to a small number of interoperable content management systems that “speak a common language”, to allow re-use and adaptability of content, and ease of implementation and maintenance.
When the Content Management Culture group was in its initial planning stages, it analysed the BBC’s content management and re-use problems and the way in which industry experts and other organisations have approached the issue. Our main influences were the work of content management consultants Ann Rockley [ROCKLEY03], [ROCKLEY06] and Bob Boiko [BOIKO02], Information Architecture consultant Louis Rosenfeld [ROSENFELD], and others in the content management and information architecture fields.
Based on this analysis the group decided that the best approach would be to create a shared content object model that describes the BBC’s content in an atomised, re-usable way, and a shared metadata model that allow the content to be described in a way that lets us find it again in the future.
For example, we can break down a page on a web site into separate, atomised, reusable components: an “article text” object, one or more images, web links, audio and video links, perhaps some objective information such as a person profile, event description, or sport result. These individual “content objects” can be pulled together in a different way to be published to another platform such as interactive television or mobile phones, or syndicated to news readers, or re-published as a PDF file for printing, to name a few examples. By breaking the content down in a sensible way, the optimum balance between ease of content creation and practical re-use can be struck. By annotating the content objects with a known, re-usable metadata structure, based on controlled vocabularies of descriptive metadata terms, the content can be found and re-purposed easily. For more information on our semi-automated metadata system, see [HARVEY05].
A shared content and metadata model
The goals behind a shared metadata and content object model are similar to the goals of most XML-based publishing projects, particularly those that span multiple departments and multiple content management systems. These include:
We can avoid re-modelling each site/project separately, reinventing wheels of content as we go, by having different teams re-model the same content independently. Actually it is more likely that we would end up re-modelling “by accident,” by re-inventing our content structure to suit the quirks of each tool as we implement new systems. This is a common problem with organisations that don’t take a “model-oriented” approach and dive into CMS implementation without first addressing the modelling issues.
Having a common description for our content types gives us scope to aggregate content across all of bbc.co.uk much more easily. For example, if we have a common set of attributes to describe a “person”, be it an actor, writer, newsreader, sports personality or historical figure, then we could create a rich vein of content that links together all people referred to by any bbc.co.uk content, to allow serendipitous content browsing and to link our sites together through a common navigation “hook”.
Using shared descriptive metadata terms allows for content repurposing and re-use, and minimises content duplication. For example, the easier it is to find images about the cheese rolling competition in Devon(???), the less likely the situation where our journalists will need to go to an image bureau and purchase a new copy of an image for which we already own publishing rights.
In fact, using a common approach to rights across all of our content, through common fields in a content object model, also benefits rights management in the sense that rights can be checked and enforced in an easier way.
Our “model-first” approach ideally creates a content object and metadata model that more abstract than any particular implementation of that model in a CMS. Therefore we should be able to translate the model into implementations in any CMS that can handle a model in the format we describe here. At the moment, that means the system must use XML-based (or otherwise hierarchical) discrete objects that can handle inter-object relations, and must support querying on those objects based on metadata terms. This aspect of the approach requires more investigation before it can be proven that our model is truly interoperable; this will be an area of future work.
Implementing the shared model approach
To create a shared content object and metadata model, first a team of information architects and business analysts work with the project stakeholders (content creators such as journalists, their managers such as editors, graphic designers, senior management, business people, and of course site users) to determine what are the most important parts of content on the site (or other property), what is no longer needed, and what broad content types are required. The output from this stage of the project is known as a content audit. They may also be able to draw up some wireframes and a site map if the business side knows what they want their site to look like. In addition the business analysts gather a requirements specification that describes any high-level requirements from the business side.
At this point, the technical and information architecture teams should join together to decide which direction to take for the next stage, content analysis, which dives deeper into determining what individual content objects, property fields and business rules should be created by the CMS to implement the pages desired by the business users. The output of this phase is the first iteration of the content object model, which in the first generations of our project was implemented as a Word document looking something like the following:
Figure 1: example content object description in Microsoft Word
Problems with early implementations
We soon discovered that a tabular format in a word processor was not the best way to describe the content objects, partly because a two-dimensional table couldn’t simply represent a multi-levelled hierarchical model (e.g. a repeatable set of properties within another repeatable set, for instance a player within a team in a sports result object) and relationships between objects were very hard to manage.
To avoid some of these problems, we included some information in appendices to the “COM” to handle controlled lists of terms used in objects (e.g. the types of image that could be created, map, photo, cartoon etc) and metadata terms. Other information was included in “notes” columns in the tables (e.g. the dimensions of images that could be associated with articles in different places) and some in completely separate documents, such as the soon-to-become-infamous “cardinality spreadsheet,” which determined how many of each object type could relate to each other object type.
In addition to this complexity, when it came time to implement two “clients” in our content management system, we had two copies of each of these documents, that described how the content object models used the same objects but with different slants such as different image sizes or numbers of articles in an index.
Having so many documents describing the same content made implementation and maintenance of the CMS extremely error-prone and time consuming, and testing took far longer than it should have, because simple errors and inconsistencies in the documentation couldn’t be spotted by looking over the two 120-page word processor documents and additional spreadsheets and other material.
To make matters worse the documentation set we used was incomplete -- some elements of the business logic, and editor-level presentation rules, were only described in the actual XML configuration files that were created for the particular CMS we were using. This caused two problems: firstly, our developers were left to define parts of the system that should have been modelled much earlier in the design process; and secondly we fell into the trap of designing for a particular system rather than being led by the model.
It took a great amount of discipline within the information architecture and technical teams to ensure that we didn’t introduce inconsistencies in the shared model (e.g. accidentally different article types for different clients) – we were able to mitigate most of these problems by taking care in our modelling approach and cross-checking everything by hand, but this was slow and error-prone itself.
As a direct result of these problems, the project’s bug resolution cycle was overly complex: testers would spot a bug, raise it to the developers, who might then need to go back to the IAs for confirmation on a problem or inconsistency. Client-side developers might not hear that the model had slightly changed, so they might build XSLT templates against the wrong version of the spec, which would mean that further bugs would arise, not to be found until later in the testing process, etc.
So after the second client implementation, we knew that we had to step back and solve the content modelling problem.
The COM Management Tool project
The problems outlined above all pointed to the fact that we needed some kind of a tool to help with creating and maintaining our Content Object Model. At a high level, the tool needed to:
Allow Information Architects (IAs) to maintain the model/s directly, and to some degree, allow them to see the implications of a change in the design
Empower the IAs to be more accountable for design decisions that were originally more developer-led: decisions like which differences in object would necessitate a new “object subtype”, such as different types of sports results with different variations of input fields, and what could be implemented with some kind of select list inside a single object type, such as different types of images like maps and photos
Reduce the manual, complex, error-prone effort involved in turning a paper model into a set of configuration files for the particular CMS we were using, and also reduce the testing and regression cycles around making sure that the configuration files matched the paper spec
Provide an interface or output that we could show to clients from the content and/or business side to explain the content object models that we were creating on their behalf, and navigate through content relationships and business rules to show the implications of design decisions and how the model related to their content
A model-driven architecture approach
When we embarked on the project to “modelling our models” in a new tool, we realised that we were entering a world of multiple abstraction layers and separating design from implementation. At the time the project began, we didn’t know much about model-driven design and MDA, but as we learned more about the approach we realised that it fit quite well with what we had originally set out to do – that is, to be “model-driven” rather than “tool-driven” in our approach.
The model-driven architecture approach [MDA] divides a modelling project into layers of models which, with a small stretch of the imagination, can be seen to closely match our layers. The MDA system was mostly created for software design projects, but it seemed to apply quite well to solving our data design problems.
If one thinks about our problem as have object instances (e.g. actual articles and images) at the lowest level, then object definitions for a particular client a level up (e.g. what fields and business rules define an Article object for BBC News or Radio & Music), then general system-level definitions (e.g. what makes article objects from News or R&M shareable with each other), then the actual system that stores those object definitions, we have something like a classic tiered MDA pyramid:

Figure 2: MDA-like view of the COM Management Tool (with apologies to the Object Management Group)
Another way of describing our model in an MDA-like way would be to say that the model maintained in the tool was the “platform-independent model” and the configuration files generated for a particular CMS product out of the standard model was the “platform-specific model”. This would probably be closer to the original MDA specification.
Creating the metamodel
Before we started thinking about what software we would use to implement the COM Management Tool, as part of the requirements gathering process we created a “metamodel” that defined how our content object models were structured, and what we needed a tool to support. The idea is that the tool should allow users to create content models that adhere to our metamodel, and prompt the user with warnings and errors if she tries to stray from the model.
The core parts of the metamodel are objects, containers and properties. These non-system-specific “classes” each have “subclasses”, such as BinaryObject and ContentObject, which can then be instantiated to create the model for a given content object.
This is a variation on the standard hierarchical metamodel, following all document-style XML models. A relational model, implemented in a simple relational database such as Oracle 7 or Microsoft Access, would differ from this in that it would only include objects and properties; the “container” class would not exist. Using “container” classes allows us to model XML more closely, and also to allow for repeatable elements without creating new objects for a person’s appearances in a programme, for repeatable sets of telephone numbers, etc.
It is important to note that the metamodel only describes the content structure side of what was in the Content Object Model document: properties, containers, and objects, plus how objects can relate to each other (through the ContentObjectReferenceContainer), but no details on how the interface is presented to CMS users, or how the content is displayed on a web site or interactive TV platform, for example.

Figure 3: Simplified UML model of Content Object Model Management Tool
We feel that this “content object model model” is quite generic and may be useful for other content creators in similar situations. To help the community create similar models in the future and to help to set a convention or best practice, we plan to release our content models to the public in the future.
Investigating solutions
UML-based tools
As this was a modelling tool project, the first and most obvious solution area was UML-based modelling tools. We looked long and hard, but couldn’t find any tools that would allow us to create a “metamodel” in UML, and then enforce users to create models only using that metamodel. Some tools could come close with a lot of customisation, but nothing would do it out of the box.
In addition, representing our complex model instances, including many relationships between object types, containers and properties, meant that any purely diagrammatic representation of the model would be next-to-impossible to read and digest.
XML Schema-based tools
We could have possibly created a version of the metamodel as a W3C XML Schema document (or series of documents) and then created objects as instance documents conforming to this schema. This may have worked but would have involved contorting our model into something that would work as a W3C XML Schema.
In addition the tools we know of would not have provided a flexible way of relating objects to other objects according to the metamodel.
However, further work may open up some options in this area.
RDF/OWL based tools
At the time of investigation in late 2005, Protégé-OWL (actually, the OWL set of plug-ins for the Protégé ontology editor) was the only RDF/OWL tool with a graphical interface that we could find. However it quickly became the favourite, for the simple reason that we could implement the metamodel in the tool quite quickly and the user interface could scale to our needs: that is, the user interface did not become more complicated as the number of modelled objects grew.
Implementing the metamodel in Protégé
Early designs led to something that worked very quickly, and through Protégé-OWL’s support for refactoring, we were quickly able to create an implementation of the metamodel that allowed us to test theories and quickly change the way we handled different types of object and relationships between them. With little knowledge of RDF or OWL, we could simply create “classes” and “instances” (also known as individuals in Protégé-OWL) to implement our metamodel and instance models.
Classes, instances and restrictions
In particular, we changed the class/instance model quite often, settling on a hybrid model where all the of the “metamodel” but also some of the “model” was expressed as classes in our Protégé-OWL model, rather than implementing the metamodel only as classes and the model only as instances, which might seem more logical at first.
Figure 4 is an example of a “feature subtype” class, which denotes a particular type of article that can be created in our content model. Created through the Class Editor in Protégé and represented as an owl:Class in the underlying RDF/OWL file, each class represents a type in our content model.
Figure 4: Example of a class (“Feature: Organisation (musical group)”), implemented in Protégé-OWL
Once the class is created in Protégé, instance objects are made in the “individuals” tab, which automatically include constituent properties and containers based on the restrictions created as part of the class (see next section).
Figure 5: Instance example based on above class
Conditions and restrictions
To implement the “constraint” side of the metamodel that UML-based tools couldn’t provide, i.e. the ability to restrict users of the tool to only create models that adhered to our metamodel, we used two features of RDF/OWL and Protégé: restrictions and ontology tests.
Restrictions are created when making a class, so for example when we created the “Feature” class, we created a restriction on hasOptionalProperty which can be written as “hasOptionalProperty someValuesFrom BodyText”. This means that any instance of the Feature class, or any of its subclasses, can have an optional property called BodyText.
Our ontology tests enforce these restrictions, so that a user that assigns an OptionalProperty to a Feature.organisation instance that is not defined through restrictions on the class will be presented with an error.
Custom extensions and plugins to Protégé
We made several customisations to Protégé, using the extensible Java plugin API, to implement some custom requirements for the COM Management Tool project.
These include:
A set of locked-down selector widgets, such as “RestrictedMultiChoiceSelectorWidget”, to replace existing widgets in our forms. We want to ensure that users cannot accidentally choose related properties or objects that are disallowed by the restrictions defined in the class, so this enforces the constraints defined in the metamodel
A “Content Template” generator, which uses the model to create empty XML instance files for each object type, to be uploaded directly to our Content Management System
Similarly, a “Rules File” generator, which creates configuration files that tell our CMS how to present objects to CMS users for editing and the business rules around how objects can be selected and manipulated
We also built a custom CVS integration plugin, to connect with our corporate version control system. Protégé doesn’t provide CVS compatibility by default as it is assumed that concurrent and large-scale use will be facilitated through using the Protégé group server, which we did not need at this stage of the project.
Issues with using RDF/OWL to describe content object models
Why RDF/OWL was useful for us
Even though we did not need to know that we were using RDF/OWL for the majority of this project, the fact that Protégé is built using the RDF/OWL model means that we benefited from its structure and versatility.
In particular, the very open metamodel of RDF and OWL meant that we could create arbitrary object and relationship types, and through relatively simple configuration and definition, using generic concepts such as classes, properties, instances and restrictions, that could then be enforced by the system. As we discovered, more restrictive tools using UML and other metamodels are not as broad as RDF, or are too broad in that they let you do anything and don’t provide a way to restrict what users can do.
Open-world assumption
A core assumption of the RDF/OWL system is inherited from its Artificial Intelligence and Knowledge Representation roots, and is called the “open-world assumption”. It basically assumes that unless one is told that a particular fact is true or untrue, the system cannot make any assumptions about related facts. In a closed-world system we say “everything we don’t know is false”, but in an open-world system we say “everything we don’t know is undefined”. RDF/OWL, and therefore Protégé-OWL, uses the open-world model.
This means that we have to create restrictions on classes, our own custom widgets, etc, to avoid letting users create anything using the interface and only being told that there was an inconsistency when validating the model at a late stage in model development.
For more discussion on this topic and how it underpins RDF/OWL, see [MAZZOCCHI05]
Interface complexity and documentation
While Protégé is a fairly straightforward system for developers and information modelling professionals to understand, it was far beyond the capabilities of average business users to understand restrictions, subclasses, instances and ontological rules. Most non-technical users were intimidated by the interface of Protégé and how it described the content that they knew very well.
In response to this, we had to write custom code to generate business-user-level documentation in HTML that ignored the RDF and OWL constructs that helped to create the metamodel, but focussed on our classes and instances for objects created by our information architects to be implemented in the content management system.
Figure 6: Example output of HTML documentation from COM Management Tool
It is worth noting that the documentation we have created still may not be quite useful enough; it is yet to be fully tested with business-level users.
Terminology
As in many “meta-level” projects where one talks about the way types of things can interact with other types of things, nomenclature has been an issue from the very start, and has continued to be an issue throughout the process.
For example, what should we mean when we say “a class”? Do we mean a Java class? An XML complexType? An RDFS Class? An OWL Class? Obviously it depends on the context of the particular conversation.
Or how about “object”? The object created in the CMS when an editor says “I want to make a new image”? Or the Java object? Or the object type modelled in the COM Management Tool? Even “metamodel” is a loaded term: are we talking about the OWL metamodel? Or the RDFS one? Or the underlying Protégé model of slots and properties? Or our COM Management Tool “metamodel” which describes how objects can be created?
The only way to handle these issues is to be very careful when describing systems and to keep a reference document close to hand when discussing issues around a whiteboard.
Benefits of the new approach
Generally we are very happy with the new COM Management Tool. Here are some specific ways in which we have benefited as a result of our use of RDF/OWL and Protégé-OWL:
Speed to implementation
Content Object Model design and implementation projects will be much faster now as we have removed a large chunk of tedious work in implementing a paper model using somewhat fragile and error-prone XML configuration files. We now produce virtually all server-side object model configuration information automatically from the RDF/OWL model. Previously most of these files were created and tested by hand – over 10,000 lines of XML per deployment.
In addition, consistency errors in the COM documents were hard to notice, whereas now, with re-use of properties and containers amongst objects, it is much easier to detect and fix inconsistencies when they occur, and it is harder to introduce inconsistencies as the system encourages the use of existing properties rather than creating new, duplicate and perhaps inconsistent properties.
Transparency of the model and the solution
The nature of our new system is that it brings together the two sides of the content modelling process: our developers, who create the XML configuration files for the CMS, write supporting custom code, and define more technical aspects of the COM such as what queries will help to find related content for a particular object, and our information architects, who analyse the content and design the content objects.
The fact that the developers and IAs work more closely and can see each other’s work in progress means that issues are raised and resolved much earlier in the project lifecycle, before any code is written, and before any bugs are introduced. This should also serve to reduce implementation and bug-fixing time.
In addition, the transparency of the model should mean that bugs and omissions in the model implementation (e.g. the CMS editing screens) can be remedied by IAs easily using the COM Management Tool, rather than using the previous lengthy process of raising the bug, going through developers, checking with IAs and editors, implementing a fix, and then being tested again by all parties.
It ensures that everyone has the same understanding of what an object is and does (everyone who understands the Protégé model, at least).
Automatic documentation
As described above, we built an extension module for Protégé that generates HTML-based Content Object Model documentation that can be presented to business users. This also removes the need to manually write documentation for models and the business rules around them.
Remaining issues / future directions
Some options for how we could move the system forward over the short- and medium-term future are listed below:
We would like to test managing and exporting content object models for different content management systems, handling the models for systems that aren’t directly linked to specific content management products (e.g. the “Programme Information Platform” developed in-house within BBC New Media)We haven’t yet tried to output a model that would be usable by a different content management tool than the one we use; although we design the system to handle such work it has not yet been tested. One factor is what we call the “administrative metadata” around content objects, which are handled differently be each CMS tool: things like “last modified by”, “created date”, “last published date” which would be very useful and should be modelled in a shareable way, but don’t need to be modelled on an object-by-object basis so have been left out until now.
The presentation/layout subsystem, which lets IAs define where on the CMS editing screens each property should be positioned, is quite rudimentary at present. As it is very specific to the way our CMS and its “rules files” work it may be worth thinking about how to generalise a presentation structure around the COM editing interface before building such a tool.
Look at formalising our RDF schema and OWL rules with a view to publishing them as a “best practice” or at least something that other organisations could build upon or take ideas from – or just tell us where we went wrong!
Re-think the XML files generated by the system. Right now we use an intermediate XML format internally to describe the file format in something simpler than RDF/OWL or our rules files. This format could have wider use. In addition we could model the XML instance files better, using namespaces and re-using other models such as ISO address details and geographic information schemas [BS7666], and TV-Anytime [TVAnytime] for programme schedule information. We could even model the actual content objects in RDF. Using a tool such as the one described here will make it much easier to consider refactoring our XML and making it more standards-compliant.
Developers are still needed in the process of creating new content objects and importing them into the system. Partly that is because the tools aren’t fully integrated with the CMS, so the upload process is still manual. But it is also partly because the objects themselves are not purely “information-based”: in a complex web site or other interactive offering, the definition of many object types will be based on their technical capabilities within the published system, so developers still need to work closely with IAs to determine what the objects should look like, and how they should function.
References
[BOIKO04] Bob Boiko, “Content Management Bible, 2nd edition”, Hungry Minds, 2004 http://www.metatorial.com/
[BS7666] UK Online - Information Architecture - BS7666 Address and Geographic Location Structures Fragment, 21 March 2005, http://www.govtalk.gov.uk/schemasstandards/schemalibrary_schema.asp?schemaid=244
[CMPROS] The Content Management Professionals organisation. http://www.cmprofessionals.org/
[LOASBY05] Karen Loasby, The growing pains of a controlled vocabulary: an analysis of what happens when you let your users contribute to your CVs, IA Summit, March 2005, http://www.iasummit.org/2005/conferencedescrip.htm#66
[MAZZOCCHI05] Stefano Mazzocchi, blog post, 2005: http://www.betaversion.org/~stefano/linotype/news/91/
[MDA]: “Model-Driven Architecture”, Object Management Group, http://www.omg.org/mda/
[ROCKLEY02] Ann Rockley et al, “Managing Enterprise Content: A Unified Content Strategy”, New Riders, 2002 http://www.rockley.com/
[ROCKLEY06] Ann Rockley, “Object-oriented design”, IA Summit March 2006, http://www.iasummit.org/2006/conferencedescrip.htm#67
[ROSENFELD] Lou Rosenfeld & Peter Morville, “Information Architecture for the World Wide Web, 2nd edition”, O’Reilly, 2002 http://www.louisrosenfeld.com/
[TVAnytime]: TV-Anytime XML Schema spec, TV Anytime forum. ftp://ftp.bbc.co.uk/pub/Specifications/0-Specifications.html




