RSS and Atom feed icon News feeds

Building and Managing a Massive Triple Store: An Experience Report

Abstract

The aim of the Ingenta MetaStore project is to build a flexible and scalable repository for the storage of bibliographic metadata spanning 17 million articles and 20,000 publications.

The repository replaces several existing data stores and will act as a focal point for integration of a number of existing applications and future projects. Scalability, replication and robustness were important considerations in the repository design.

After introducing the benefits of using RDF as the data model for this repository, the paper will focus on the practical challenges involved in creating and managing a very large triple store.

The repository currently contains over 200 million triples from a range of vocabularies including FOAF, Dublin Core and PRISM.

The challenges faced range from schema design, data loading, SPARQL query performance. Load testing of the repository provided some insights into the tuning of SPARQL queries.

The paper will introduce the solutions developed to meet these challenges with the goal of helping others seeking to deploy a large triple store in a production environment. The paper will also suggest some avenues for further research and development.

Introduction

Project Scope and Goals

The IngentaConnect website currently contains metadata from 17 million articles sourced from 20 thousand publications, primarily from scientific, technical and medical publishers. The website supports about 2 million sessions per month, aiding researchers and students to find and access relevant content.

Prior to the MetaStore project, this data was stored in a number of “legacy” databases. These ranged from specialised bibliographic search engines through to a mixture of relational database platforms, and a large collection of XML documents storing individual article metadata. For example, references (citations) between articles were stored in a separate database to article abstracts. Maintaining and synchronizing this amount of data was an onerous task.

In addition, as the range of features offered by Ingenta grows, there is a requirement to store arbitrary additional data about existing content, e.g. supplementary material such as research results that annotate academic papers, and to create new relationships between content items, e.g. “related articles”. These needs placed challenges on both the data model and database platforms.

In addition, Ingenta has distribution agreements with partner organisations such as CrossRef [Crossref] and NLM [NLM]. Storing data in an industry-standard format would ease the data distribution process.

Recognising the limitations of the current architecture and data models, the MetaStore project was initiated to investigate the replacement of these existing data silos with a RDF based triplestore. The primary aim of the project was to build a flexible and scalable repository for storing bibliographic metadata.

A key initial goal of the project was to determine whether current RDF triple store technologies could scale to the needs of the IngentaConnect application, and ultimately to select a suitable platform for building the complete system.

In order to achieve integration across heterogeneous existing and future metadata, the data model had to be highly flexible, and support the following:

  1. Integration of existing data, (e.g. handle sparse and duplicate data).

  2. Extensibility - to cover new types of data in the future.

  3. Flexible cross-linking between resources, e.g. citations, related articles.

  4. Industry standard vocabularies.

It was believed that RDF would meet all of these requirements.

Strategy

The basic strategy adopted during the process was as follows:

First, the existing data was converted into an RDF format. In doing so, the decision was made to rely, as far as possible, on existing published vocabularies. These vocabularies were supplemented, where necessary, with bespoke schemas. Please refer to the Modelling section below for a fuller discussion of the model.

Having collated a large amount of metadata, several engines were selected for testing. RDF repositories were created using each technology and then load-tested. A technology was selected based on a number of factors, primarily scalability, but also ease of management etc. Please refer to the Choice of Technology section for further discussion of the selection process and findings.

Having selected a technology, another round of data modelling was undertaken to refine the model based on the initial experiences of working with and querying the data. Additional data sources were also introduced to scale up the size of the test data set.

Subsequent phases of the project involved an iterative process of refining the model, load testing and tuning.

The Data Set

The core data set is metadata for academic journal articles. This includes abstracts, keywords, authors, references (cited articles) and citations (citing articles). The repository also contains resources representing publishers, journals, books etc. A typical Article resource is shown in Figure 1.

Figure 1. RDF Graph of a typical Article

Scale

The repository currently contains metadata for approximately 4.3 million articles from over 6 thousand publications, and continues to grow on a daily basis.

Loading the initial data generated over 200 million triples in the main table and took up 65Gb of disk space.

Modelling

Standard vocabularies

Where available, the data model was composed using industry standard vocabularies: Dublin Core [DC03], PRISM [PRISM04] and FOAF [FOAF06].

Example 1: Many bibliographic literals and resources are in the Dublin Core namespace:

<struct:Article rdf:about="http://metastore.ingenta.com/content/articles/4" >
  <dc:title>Schedule-induced hours-of-service and ... </dc:title>
  <dc:description>Driver fatigue is well recognized as an important causational factor in accidents involving long-distance truck drivers.....</dc:description>
  <dc:identifier rdf:resource="http://metastore.ingenta.com/content/00013575/v27n1/p33" />
  <dc:subject rdf:resource="http://metastore.ingenta.com/content/subjects/1" />
  ...(other properties)...
</struct:Article>

Example 2: Other bibliographic literals and resources in the PRISM and FOAF namespaces:

<struct:Article rdf:about="http://metastore.ingenta.com/content/articles/4">
  <prism:endingPage>42</prism:endingPage>
  <prism:startingPage>33</prism:startingPage>
  <foaf:maker rdf:resource="http://metastore.ingenta.com/content/authors/9"/>
  (...)
</struct:Article>

The isPartOf [PRISM04b] property from the PRISM vocabulary was used extensively in the model because the data is naturally hierarchical; articles are part of issues, which are part of journals, and chapters are part of books.

Example 3: PRISM:isPartOf is a crucial relationship

<struct:Article rdf:about="http://metastore.ingenta.com/content/articles/">
  <prism:isPartOf rdf:resource="http://metastore.ingenta.com/content/parts/1"/>
  (...)
</struct:Article>

<struct:Issue rdf:about="http://metastore.ingenta.com/content/parts/1">
  <prism:isPartOf rdf:resource="http://metastore.ingenta.com/content/titles/1"/>
  (...)
</struct:Issue>

<struct:Journal rdf:about="http://metastore.ingenta.com/content/titles/1">
  <prism:isPartOf rdf:resource="http://metastore.ingenta.com/content/pubs/28"/>
  (...)
</struct:Journal>

Custom vocabularies

The legacy data stores contained some data which did not correspond to any available standard vocabularies. In some cases, this was an extra level of detail within the data, in others the data was system-dependent or internal. Therefore, the next stage in modelling was to create some custom vocabularies using RDFS.[RDFS]

These extended standard vocabularies where possible. For example, a custom Author class was developed: the Person [FOAF06b] class from the FOAF vocabulary was found to be nearly adequate to represent an author. However, authors also have affiliations with an institution. Therefore, the custom Author class was designed to extend Person, and to have, inter alia, a literal affiliations property.

Other non-standard properties included branding properties such as logos, and an internal 'status' property.

Example 4 Examples of properties in Ingenta custom vocabularies:

<struct:Publisher rdf:about="http://metastore.ingenta.com/content/pubs/1">
  <branding:bigLogo rdf:resource="http://www.ingentaconnect.com/provider-logos/ben.gif"/>
  <struct:status rdf:resource="http://metastore.ingenta.com/content/status/live"/>
  (...)
</struct:Publisher>

The custom vocabularies were modularised to promote reuse, and also to organise the properties into manageable groups. Application-specific vocabularies were defined in their own namespace, separating bespoke requirements from key items like “structure”.

Practical Considerations

Due to the scale of the repository, modelling not only involved identifying the logical classes and properties inherent in the metadata – there were also practical considerations and compromises.

An audit trail – minimising bloat

The flexibility and tolerance of an RDF store make it particularly susceptible to loading mistakes. As a growing set of programs (and programmers) began interacting with the repository, it became clear that an audit trail was needed.

In a relational system it would be easy to track changes made to every row in every table, using createdOn/updatedOn columns. An elegant version of this in RDF would require the use of reification, modelling each change as a statement along with timestamps and an identifier for who changed the data, etc.

However, such a comprehensive audit trail would have come at a cost: it would have radically increased the size of the store, (by an estimated 200 Gb). The impact on query, backup and replication performance would have been measurable. It was decided that the advantages did not justify this cost.

Nevertheless, an audit trail was clearly essential for debugging and tracking manual changes.

Therefore, a compromise was established – three simple audit triples were introduced: dc:created and dc:modified timestamps, and a struct:lastTouchedBy. There is only one set per significant resource (for example per article,) rather than per triple.

Example 5: Rudimentary audit trail

<struct:Article rdf:about="http://metastore.ingenta.com/content/articles/2009875">
  <dcterms:created>2005-12-14T08:51:19^^http://www.w3.org/2001/XMLSchema#dateTime</dcterms:created>
  <dcterms:modified>2006-02-01T13:13:55^^http://www.w3.org/2001/XMLSchema#dateTime</dcterms:modified>
  <struct:lastTouchedBy  rdf:resource="http://applications.ingenta.com/loader/1.0" />
</struct:Article>

Clearly this coarse-grained solution left ambiguity – it is not possible to determine which of the article's properties was changed on the dc:modified date. However, the information has proved useful and sufficient so far. The compromise was a good practical solution.

Multiple languages - minimising bloat

The great majority of Ingenta abstracts are available only in English. However, for a small number, there is also a version in one or more other languages.

First attempts to model this involved multiple dc:description properties, each with the appropriate xml:lang attribute, as shown in Example 6:

Example 6 One option for handling multiple languages: xml:lang

<struct:Article rdf:about="http://metastore.ingenta.com/content/articles/33">
  <dc:description xml:lang="es">
    Objetivos: En los últimos 20 años, las medidas para evitar las epidemias de dengue se han centrado en...
  </dc:description>
  <dc:description xml:lang="en">
    Objective: In the past 20 years, the emphasis for avoiding dengue epidemics has focused on...
  </dc:description>
  (...)
</struct:Article>

However, abstracts also regularly contain XHTML markup, such as bold tags, and therefore the dc:description would need to carry a rdf:parseType='Literal'attribute; however the RDF Spec [Klyne04] does not allow both to be applied simultaneously.

An elegant, theoretical option to solve this problem would have been to define a class called Abstract, and construct a bNode [bNode] to link a dc:language property with the (typed) literal, as shown in Example 7.

Example 7 Elegant option for handling multiple languages: a bNode and dc:language tag

<struct:Article rdf:about="http://metastore.ingenta.com/content/articles/33">
  <dc:description>
    <struct:Abstract>
      <dc:description rdf:parseType="Literal">
        <b>Objective:</b> In the past 20 years, the emphasis for avoiding...
      </dc:description>
      <dc:language>en</dc:language>
    </struct:Abstract>
  </dc:description>
</struct:Article>

The bNode approach would have been the ideal option with a small repository containing clean data.

However, for this project, it would have bloated the repository by at least 3 more triples for each of the 4.3 million articles – 12 million extra triples. As discussed above, this would have impacted on query performance. Furthermore, the great majority of articles only had a single language - and therefore the extra triples would have added no value in 99% of cases.

Therefore the decision was taken to follow advice from the Jena developers [Carroll05] and the RDF Spec [Klyne04] and use an extra <xhtml:span> tag to hold the xml:lang attribute. The obvious downside to this decision was that it prevents query by language.

Again, this compromise between theoretical modelling and practical performance was necessary because of the size of this triplestore.

Generous Adoption of Identifiers

The new repository would exist within an established framework of other existing systems - “client” applications within Ingenta. For instance, the IngentaConnect website, the Full Text Document Server, the Reference Resolver, the Distribution Engine, the Search Indexer.

As discussed by Fowler [Fowler04] , implementing the “Shared Database pattern is a technical and political challenge in itself. One problem is that client systems are unlikely to have easy access to the new repository identifier, and usually have their own identifiers in their old databases. Extending the model to simply include all such identifiers provided a simple, powerful integration hook:

Example 8: Legacy secondary identifiers

<struct:Article rdf:about="http://metastore.ingenta.com/content/articles/42">
  <linking:genlinkerRefId rdf:resource="genlinker://refid/5518325"/>
  <ident:infobike rdf:resource="infobike://maney/mint/2003/00000112/00000003/art00001"/>
  <dc:identifier rdf:resource="http://metastore.ingenta.com/content/maney/03717844/v112n3/s1"/>
  <ident:doi>10.1179/037178403225003582</ident:doi>
  <ident:sici>0371-7844(20031201)112:3L.141;1-</ident:sici>
</struct:Article>

Some of these identifiers have the potential to be used by multiple clients and have been modelled using the basic dc:identifier property.

Other identifiers that are relevant only for particular clients, e.g. linking:genlinkerRefId. Application-specific namespaces have been developed for such properties, although they subclass dc:identifier.

In addition, there are industry standard identifiers, e.g. DOI [DOI] and SICI [SICI], aimed at external clients.

It has become a feature of this repository that every significant resource has multiple identifiers. This aspect of the model was not planned, but it has proved to be expedient in a number of ways. The ability to easily store extra and duplicate identifiers is a significant advantage of a flexible RDF model.

Choice of technology

Exclusively Java engines were considered, because Ingenta has in-house expertise in Java.

The developers experimented with: Jena [Jena05] with a PostgreSQL [Postgres] backend, Sesame [Sesame05] with a PostgreSQL backend, and Kowari [Kowari05] with a native Kowari backend.

Jena proved to be most suitable with in terms of scalability – although it should be emphasised that testing was in no way exhaustive. Sesame became slow at the opening model stage, once the repository grew beyond 50 million triples. (20 minutes). With Kowari, "Out of memory" problems began to result with a repository beyond 25 million triples.

Also, an RDBMS backend was preferred over a native store for the following reasons:

  • Replication. An RDBMS database has inbuilt mature replication, whereas with the native store, there would have been a need to develop a custom solution.

  • Maintenance. Ingenta system administrators have experience with maintaining and backing up RDBMS - in particular PostgreSQL.

Another advantage of Jena was that the main triples table was relatively human-readable with a standard SQL client. It is simple to understand with “subj”, “prop”, “obj” columns corresponding to the subject-property-object triple , and the URIs and literals can be viewed unencoded. This was useful for debugging and investigation throughout the development process.

Loading – performance problems and workarounds.

Creating the initial repository involved transferring data from 2 legacy systems: a file based data store of SGML headers (4.3 million), and a relational database of references (8 million).

The loading process was not a direct import of data into the repository, since there was a need to de-duplicate data and establish cross references. For example, where two articles had the same author, then just a single Author resource was created in the repository. Therefore for every author for every new article, an author query had to be performed. Similarly, for every reference(cited article), an article query had to be performed.

An initial loading program was created, using SPARQL [SPARQL] to query existing data, and then converting the new data into triples and inserting it using standard Jena calls. The problem with this approach was performance. Based on the first few thousand articles, estimates of total loading time were around 2000 hours (3 months); clearly this was not a practical option.

It was quickly discovered that batching insertions improved performance [PPKP06] . The second enhancement was to improve query performance by minimising the use of SPARQL:

Due to their complexity and the size of the store, common SPARQL queries used in loading were not fast. As shown in Example 9, a SPARQL query based on bibliographic literals was quite complicated; for example, identifying data includes ISSN, [ISSN] but ISSN was modelled as a property of a Journal resource, and attached via two isPartOf links to the article itself.

Example 9: SPARQL query for an Article

PREFIX  branding: <http://metastore.ingenta.com/ns/branding/> 
PREFIX  dcterms: <http://purl.org/dc/terms/>
PREFIX  struct: <http://metastore.ingenta.com/ns/structure/>
PREFIX  dc:    <http://purl.org/dc/elements/1.1/>
PREFIX  linking: <http://metastore.ingenta.com/ns/linking/>
PREFIX  rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX  prism: <http://prismstandard.org/namespaces/1.2/basic/>
PREFIX  ident: <http://metastore.ingenta.com/ns/identifiers/> 
SELECT  ?pub ?journal ?issue ?article 
WHERE {
  ?journal     rdf:type        struct:Journal .
  ?journal     dc:identifier> <http://metastore.ingenta.com/content/issn/02670836> .
  ?issue  prism:isPartOf   ?journal  .
  ?issue  prism:volume  ?volumeLiteral . 
  ?issue    prism:number    ?issueLiteral .
  ?article  prism:isPartOf   ?issue .
  ?article  prism:startingPage  ?firstPageLiteral . 
  FILTER ( ?volumeLiteral = "20" ) 
  FILTER ( ?issueLiteral = "4" ) 
  FILTER ( ?firstPageLiteral = "539" ) 
}

However, preliminary performance testing also suggested a solution. Retrieval via dc:identifier was very fast, even with a large repository. Therefore, predictable identifiers were created by concatenating the identifying bibliographic literals.

Example 10 Examples of the extra, predictable, dc:identifiers

<dc:identifier rdf:resource="http://metastore.ingenta.com/content/02670836/v20n4/s2"/> 
<dc:identifier rdf:resource="http://metastore.ingenta.com/content/02670836/v20/p539"/>

The identifier formats were carefully designed to maximise the chances that each legacy system would contain appropriate data to compose one for any given resource. The loader was modified to resort to the heavy SPARQL query only when the predictable identifier could not be created by the loader, or returned ambiguous results.

There were obvious costs to this approach. First, it bloated the repository with extra triples. Second, the identifiers were composed of derived, duplicate data, and now have to be updated to track the literal values on which they are based – a significant maintenance burden.

However, despite these costs, it was absolutely essential to to improve query performance while loading. The “predictable identifiers” solution was another example of a compromise introduced into the model in order to overcome performance problems inherent in such a massive store.

Querying

SPARQL vs RDQL

Initially, queries were developed in RDQL [RDQL]. However, as the project progressed, the it became clear that a richer query language was needed. In order to produce queries on the triplestore which would be analogous to complex SQL queries in existing applications, RDQL was exchanged for SPARQL [SPARQL].

For example, a reference-resolution application polls partner organisations for articles which Ingenta does not host; it must identify articles which are suitable for polling. The SPARQL query is shown in Example 11.

Example 11: Real-application SPARQL query – including negation (as failure) [Prud06]

PREFIX  rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX  prism: <http://prismstandard.org/namespaces/1.2/basic/>
PREFIX  dc:    <http://purl.org/dc/elements/1.1/>
PREFIX  dcterms: <http://purl.org/dc/terms/>
PREFIX  struct: <http://metastore.ingenta.com/ns/structure/>
PREFIX  linking: <http://metastore.ingenta.com/ns/linking/>
SELECT  ?article ?date ?hostingDesc
WHERE{ 
  ?article  rdf:type  struct:Article .
  ?article  struct:status <http://metastore.ingenta.com/content/status/bare> .
  ?article  dcterms:created  ?date .
  OPTIONAL {   
    ?hostingDesc rdf:type  linking:HostingDescription .
    ?hostingDesc linking:hostedArticle  ?article .
    ?hostingDesc linking:linkingPartner  <http://www.crossref.org> .
  }
  FILTER ( ?date > "2004-10-06T00:00:00.109+01:00"^^<http://www.w3.org/2001/XMLSchema#dateTime> )
  FILTER ( ! bound(?hostingDesc) )
}

This query would not have been possible in RDQL.

SPARQL vs SQL

The query in example 11 approximates the following outer-join SQL query from the legacy system:

Example 12: SQL query equivalent to Example 11.

SELECT refs.ref_id 
FROM sources, refs LEFT OUTER JOIN matches ON refs.ref_id=matches.ref_id 
WHERE sources.ref_id=refs.ref_id 
      AND tags_doi IS NULL 
      AND date_loaded > 20041006000000;

As can be seen by comparing examples 11 and 12 , the SPARQL versions of queries were more verbose and complex than the SQL versions in the old systems.

The development team found that creating SPARQL implementations of realistic queries involved significant time and expertise.

Performance testing

The following sections summarise some performance testing carried out against the repository. A suite of 3 tests was developed:

  1. Identifier based retrieval of an article

  2. SPARQL query for an article, based on bibliographic literal values.

  3. A variation of the above query, with a less explicit graph pattern. This query demonstrates the potential sensitivity of SPARQL implementations to SQL query optimisers. See [PPKP06] for more discussion.

Test 1. Identifier -based Retrieval

String id="http://metastore.ingenta.com/content/articles/4";
Resource ires = model.getResource(id);  

Test 2. SPARQL Query

PREFIX  branding: <http://metastore.ingenta.com/ns/branding/> 
PREFIX  dcterms: <http://purl.org/dc/terms/>
PREFIX  struct: <http://metastore.ingenta.com/ns/structure/>
PREFIX  dc:    <http://purl.org/dc/elements/1.1/>
PREFIX  linking: <http://metastore.ingenta.com/ns/linking/>
PREFIX  rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX  prism: <http://prismstandard.org/namespaces/1.2/basic/>
PREFIX  ident: <http://metastore.ingenta.com/ns/identifiers/> 
SELECT  ?pub ?title ?issue ?article 
WHERE {
  ?title    rdf:type        struct:Title .
  ?title    dc:identifier   <http://metastore.ingenta.com/content/issn/02670836> .
  ?issue  prism:isPartOf   ?title .
  ?issue  prism:volume  ?volumeLiteral . 
  ?issue    prism:number    ?issueLiteral .
  ?article  prism:isPartOf   ?issue .
  ?article  prism:startingPage  ?firstPageLiteral . 
  FILTER ( ?volumeLiteral = "20" ) 
  FILTER ( ?issueLiteral = "4" ) 
  FILTER ( ?firstPageLiteral = "539" ) 
}

Test 3. Naive version of Test 2.

(prefixes)
SELECT  ?pub ?title ?issue ?article 
WHERE {
  ?title    dc:identifier   <http://metastore.ingenta.com/content/issn/02670836> .
  ?issue  prism:isPartOf   ?title .
  ?issue  prism:volume  ?volumeLiteral . 
  ?issue    prism:number    ?issueLiteral .
  ?article  prism:isPartOf   ?issue .
  ?article  prism:startingPage  ?firstPageLiteral . 
  FILTER ( ?volumeLiteral = "16" ) 
  FILTER ( ?issueLiteral = "11-12" ) 
  FILTER ( ?firstPageLiteral = "1309" ) 
}

Results

Size of store (millions of triples)

2

10

77

99

152

DCIDENT (ms)

17

15

25

32

23

SPARQL (ms)

385

987

1048

1128

1465

SPARQL naive (ms)

1404

21237

123547

161112

212844

Test conditions were:

  • Jena 2.3

  • PostgreSQL 7

  • Debian 3.1

  • Intel(R) Xeon(TM) CPU 3.20GHz

  • 6 SCSI Drives in RAID5 - Ultra320 (15,000 rpm)

  • 4G RAM

Figure 2: Query Performance / Repository size, for a range of query strategies

Analysis

  1. Identifier based retrieval times were very fast: 23ms even with 152 million triples.

  2. Identifier retrieval times were only negligibly affected by repository size.

  3. SPARQL query times were around 1.4 seconds for a very large store (152 million triples). SPARQL query times suffered slightly as repository size increased.

  4. SPARQL/Jena did scale: performance was acceptable even with a very large store.

  5. There was high variability in response times for logically similar SPARQL queries.

Therefore, these results suggest that identifier retrieval should be used wherever possible, for example in user facing applications such as the IngentaConnect website. However, with performance in the order of 1.4 seconds, SPARQL can also be used where suitable identifiers are not available.

Conclusions

  • RDF has enabled the development of a centralised, cross-linked data store. The task of synching data between multiple legacy databases has been eliminated.

  • The flexibility of RDF modelling has enabled the new repository to handle heterogeneous existing and future types of metadata.

  • Industry-standard vocabularies were available for most of the model, but some custom vocabularies and extensions were needed.

  • While designing the data model for a massive triplestore, compromises have to be made - often elegance must be weighed against bloat. In a commercial context, the model may also need to include integration data.

  • The Jena-PostgreSQL combination used in this project proved to be suitable with regard to scalability and usability.

  • Querying the datastore is an essential part of every application, even the initial loading process.

  • Predictable identifiers are useful to improve query performance, but their cost must also be recognised.

  • SPARQL was found to be more suitable for real-life applications than RDQL, but can be a complex language for SQL programmers.

  • Identifier-based querying is very fast : 23ms even with 152 million triples, (under our particular test conditions). This strategy should be used where possible.

  • SPARQL query performance is also acceptable: 1.4s. SPARQL can be used for those applications for which identifiers cannot be introduced easily.

Future directions

The current stage of this project is the ongoing redevelopment of client applications, and demonstration of the benefits of RDF to customers.

The next step will be to perform validation in the loader using an OWL [OWL04] ontology.

A future goal is to introduce inferencing using OWL. Performance and memory considerations prevented the use of inferencing during this first loading stage [PPKP06].

Summary

This discussion has focused on one team's practical experience of designing and deploying a very large triple store.

First, the scope and purpose of the repository were introduced. Ingenta's core data set is academic journal article metadata, and core use cases include real time browsing and linking.

The process of modelling in RDFS was then explained, including the use of standard and custom vocabularies, and a number of compromises made due to scale.

The decision process in choosing a triplestore technology was outlined. The process of loading the store was then discussed in depth, including performance problems and workarounds.

The experience of working with SPARQL on realistic applications was discussed, before the presentation of quantitative performance testing figures.

The paper concluded with some observations and lessons learned while working on this project. It is hoped that this experience report will help others looking to deploy a large triple store within a commercial environment.

References

[Carroll05] Carroll J (2005) Re: [jena-dev] rdf:parseType with xml:lang (Jena-dev message.) http://groups.yahoo.com/group/jena-dev/message/18744

[Dodds05] Dodds L, (2005) Introducing SPARQL: Querying the Semantic Web <http://www.xml.com/pub/a/2005/11/16/introducing-sparql-querying-semantic-web-tutorial.html>

[DC03] The Dublin Core Metadata Initiative (2003) The Dublin Core Element Set v1.1 <http://purl.org/dc/elements/1.1/>

[FOAF06] Brickley D, Miller L (2006) FOAF Vocabulary Specification <http://xmlns.com/foaf/0.1/>

[FOAF06b] Brickley D, Miller L (2006) FOAF Vocabulary Specification:Class: foaf:Person. <http://xmlns.com/foaf/0.1/#term_Person>

[Fowler04] Fowler M (2004) Shared Database Pattern <http://www.awprofessional.com/articles/article.asp?p=169483&seqNum=3&rl=1>

[Jena05] HP Labs Semantic Web Research Group (2005) Jena – A Semantic Web Framework for Java <http://jena.sourceforge.net/>

[Klyne04] Klyne G and Carroll J (2004) Resource Description Framework (RDF): Concepts and Abstract Syntax. W3C Recommendation. Section 3.4: Literals <http://www.w3.org/TR/rdf-concepts/#section-Literals>

[Kowari05] Tucana / Northrop Grumman (2005) Kowari RDF Database <http://www.kowari.org/>

[Lee04] Lee, R. (2004) Scalability Report on Triple Store Applications <http://simile.mit.edu/repository/site/reports/stores/stores.pdf>

[Manola04] Manola F, and Miller E, (2004) RDF Primer, W3C Recommendation <http://www.w3.org/TR/rdf-primer/ >

[OWL04] McGuinness D, van Harmelen F (eds) (2004) OWL Web Ontology Language, W3C Recommendation <http://www.w3.org/TR/owl-features/>

[PRISM04] IDE Alliance (2004) PRISM: Publishing Requirements for Industry Standard Metadata 1.2. <http://www.prismstandard.org/specifications/Prism1%5B1%5D.2.pdf>

[PRISM04b] IDE Alliance (2004) PRISM:isPartOf: The described resource is a physical or logical part of the referenced resource. <http://www.prismstandard.org/specifications/Prism1%5B1%5D.2.pdf#page=73 >

[Sesame05] Kampman A and Broekstra J (2005) Sesame RDF Database <http://www.openrdf.org>

[Postgres05] PostgreSQL Global Development Group (2005) PostgreSQL “The world's most advanced open source database” <http://www.postgresql.org/docs/>

[Powers03] Powers, S. (2003) Practical RDF Sebastopol, CA: O'Reilly and Associates Inc.

[PPKP06] Parvatikar P, Portwin K (2006) Scaling Jena in a commercial environment: The Ingenta MetaStore Project (Preprint)

[Prud06] Prud'hommeaux E and Seaborne A (2006) SPARQL Query Language for RDF, W3C Working Draft; Section 11.4: Operators Definitions <http://www.w3.org/TR/rdf-sparql-query/#SparqlOps>

Glossary

[bNode] A node in an RDF graph which does not itself have an identifier, but is used to aggregate other statements. RDF Primer <http://www.w3.org/TR/rdf-primer/#structuredproperties>

[Crossref] (Publishers International Linking Association) The official DOI registration agency for scholarly and professional publications. <http://www.crossref.org/>

[DOI] Digital Object Identifier. DOI is a system for identifying content objects in the digital environment. <http://www.doi.org/>

[ISSN] International Standard Serial Number. An eight-digit number which identifies all periodical publications. <http://www.issn.org>

[NLM] The National Library of Medicine <http://www.nlm.nih.gov/>

[RDFS] RDF Vocabulary Description Language 1.0: RDF Schema <http://www.w3.org/TR/rdf-schema/>

[RDQL] RDF Data Query Language <http://www.w3.org/Submission/RDQL/>

[Reification] The reification vocabulary is used to make statements about RDF statements. RDF Primer: <http://www.w3.org/TR/rdf-primer/#reification>

[SICI] Serial Item and Contribution Identifier. A variable length code which can be used to identify both print and electronic serial publications. <http://www.niso.org/standards/standard_detail.cfm?std_id=530>

[SPARQL] SPARQL Protocol And RDF Query Language <http://www.w3.org/TR/rdf-sparql-query/>

Acknowledgements

Leigh Dodds, Jonathan Greenstreet, Charlie Rapple.