RSS and Atom feed icon News feeds

SPARQLing Services

Backed by the flexibility of the RDF data model, and consisting of both a query language and data access protocol SPARQL has the potential to become a key component in Web 2.0 applications. SPARQL could provide a common query language for all Web 2.0 applications.

Developing and enhancing a web service API involves dealing with the conflicting needs of end users. The push towards ever finer grained access to data must be balanced against the performance and efficiency costs of returning redundant data. By supporting a query end-point a service may let clients effectively design their own API, with corresponding performance improvements.

This paper will review the SPARQL specifications and its potential benefits to Web 2.0 applications. Focusing on the SPARQL protocol for RDF, the paper will provide implementation guidance for developers interested in adding SPARQL support to their applications.

Issues such as mapping existing data sources onto the RDF model will be introduced alongside other fundamental concerns such as efficiency and security.

Trivial, but useful extensions to the SPARQL protocol, including multiple output options, and integration with AJAX applications will also be demonstrated.

Drawing on services (e.g. GovTrack, Opera Community) that already support SPARQL querying, the examples used in this paper will be grounded in real deployed applications and services.

The State of Web Services

Web services have become common place. RSS feeds and web services exposing both functionality and underlying data are an essential component of any site. By way of illustration, as the time of writing, ProgrammableWeb.com is currently tracking nearly 200 web services.

With the WS-* Wards won (simplicity prevailed), a growing community of practice is flocking to the Web 2.0 banner. That community has naturally begun looking to consolidate on common approaches and identify best practices. This activity is occurring from both the client and service provider perspectives.

On the one hand, developers are beginning to share experiences about how to build and maintain scalable web services, with an emphasis on cheap hardware and open source components. Joshua Schachter has shared his experiences with maintaining the del.icio.us API; Cal Henderson has similarly provided insights into how flickr has been built and scaled: "normalised data is for sissies".

API designers are beginning to borrow not only design elements from one another (e.g. means of authentication, or usage limits) but also entire API designs. Driven by the need to enable client-side developers to transfer skills and code between services, the ultimate aim is to reduce or eliminate "switching costs" between services that offer similar functionality or data sets.

Other concerns such as the need for stable, well-designed URLs are also well documented. The current wisdom of offering hierarchical, predictable, and hackable URLs is another subtle side-effect of minimising development overheads and enabling easy programmatic creation of deep links into a site's functionality.

An interesting phenomenon has been the rapid adoption of JSON for delivering data to browser based clients. While XML still reigns supreme as the principle format for data transfer, the need to deliver ever more complex data structures to the browser is encouraging the serving of "ready to run" object models rather than just raw data. Avoiding the need to parse XML and object model on the client side not only allows some processing to be off-loaded back to the server, it also further simplifies client-side development.

With the emphasis of application development shifting towards dynamic in-browser interface, the web is being increasingly treated like a global database. Web services expose specialised data services and islands; folksonomies act as user-created indexes over the available content. If the web is a database then we can legitimately consider: does it need a query language?

Why Does Web 2.0 Need a Query Language?

Query languages simplify client development. They provide a useful level of abstraction that encapsulated bespoke interfaces, hides storage details, and enable easier substitution of alternative data stores. Query languages reduce switching costs.

Even with a well-designed API a client may need to make multiple requests to assemble the desired data. Client developers have little or no control over the granularity of data returned from a given service, nor can they optimise to reduce the number of requests.

From the opposite perspective, in general an API designer can only optimise for commonly requested resources, and not (as easily) for data items of interest. There is an inevitable tension between adding more data to each response and adding further specialised interfaces. More data means more overheads for all clients. More specialisation means increasing the “surface area” of an API making it more complex and adding to maintenance.

With a query language a client can design their own interface.

A simple example will illustrate these tensions. Consider the following categorisation of types of web page in an application, as suggested by Tom Coates in a recent presentation (“Native to a Web of Data”):

  • Destination Page - A core first order concept, complete with its sub-ordinate information. E.g. a film and its basic metadata;

  • List View Pages - A slice of data used to navigate between first order concepts. E.g. search results, or an actors list of films;

  • Manipulation Interfaces -- A means of manipulating one or more first order concepts. E.g. submitting a review of a film.

These page types map naturally to the core features of a web service interface. However, except for the most trivial cases, a client will need more information than is presented by a single resource. First order concepts often need context, e.g. where do they sit within (one or more) lists? Lists often need further annotation, e.g. date of publication or average review for each film. The more data there is presented in each view the greater the overheads in its generation and consumption.

This hyper-linked structure is undoubtedly the best way to organise and present information on the web of data, but its not sufficient for all machine-machine data transfer, and its not automatically the most efficient.

If a service were to offer a query interface then developers can optimise their requests to extract the relevant data of interest in a smaller number of requests. Enabling clients to construct, for example, their own “list views” based on arbitrary criteria.

Service providers can benefit from monitoring common queries (and hence importance of individual data elements) rather than commonly accessed resources. This allows an alternate and more targeted form of optimisation.

While an individual service could implement a bespoke query protocol, there will be selection pressures that encourage convergence on common mechanisms. Just as service providers are moving to conform as to the design of their endpoints, there will be similar pressure to conform to a common query language.

Implementing a query language will require a formal model, a "relational model for the web". Happily there already is one: RDF. Ignoring the oft cited, but largely unimportant issues with its syntax, the basic RDF model is very simple: There are Resources; Resources have associated data; Some data is simple literal values, the rest are relationships between resources. If you can understand a relational database or Object-Oriented Programming, you can understand the RDF model. Forget the syntax.

The RDF model also has a query language: SPARQL. And to use it, one need only understand the RDF model, and not the syntax.

Introducing SPARQL

SPARQL draws on a long history of research on query languages for semi-structured data and recent work on RDF query languages. SPARQL actually consists of three separate specifications. The core is the SPARQL query language itself which defines the fundamentals of the language: the syntax, operators, and processing model. In addition to this the Data Access Working Group (DAWG) have also defined an XML format for serializing the results of SPARQL queries, and a protocol to transmit queries and their results across the web. This protocol is defined using WSDL and has binding for both SOAP and HTTP, its the latter bindings that are of most relevance to this paper.

The following sections provide a basic overview of each of these specifications.

The SPARQL Query Language

In its first version SPARQL is purely a query language. The specification does not attempt to define a way to create or update RDF data sources, the emphasis is solely on data retrieval.

SPARQL is made up of four distinct query types: SELECT, ASK, DESCRIBE and CONSTRUCT. Each of these query types are specialised for particular purposes.

DESCRIBE and CONSTRUCT queries are used to query data and extract it in the RDF/XML format. In a DESCRIBE query its left up to the query engine to decide which information is relevant to return about each resource. While in a CONSTRUCT query the author determines the data of interest. These query forms are of most relevance when a client application is attempting to extract a subset of the available data for further local processing using RDF tools. As they are more specialised these query forms are less likely to be relevant for Web 2.0 services.

SELECT queries will be familiar to anyone with SQL experience. SPARQL purposefully shares a lot of syntactic similarities with SQL. The following example SELECT query lists the names of all people in an RDF graph:

PREFIX foaf: <http://xmlns.com/foaf/0.1>
SELECT ?name
WHERE
{
  ?person foaf:name ?name.
}

The PREFIX keyword provides an XML namespace style binding of prefixes to URIs; the SELECT clause specifies the data to return in the results and the WHERE clause describes the data of interest. See "Introducing SPARQL: Querying the Semantic Web" for a fuller introduction into to SPARQL query language.

ASK queries are a specialised version of a SELECT query: they pose "true or false" questions that test whether a particular data item or data patterns are present in an RDF graph. They are a concise way to determine whether a particular service holds data of interest, e.g. are there any documents written by Leigh Dodds?

PREFIX dc: <http://purl.org/dc/elements/1.1/>
ASK WHERE
{
  ?document dc:creator "Leigh Dodds".
}

The combination of ASK, to discover whether a service holds data of interest, and SELECT to extract that data, provides a powerful toolkit for working with web services. This is particularly true given that the results of SELECT and ASK queries can be serialised using the SPARQL XML results format.

The SPARQL Query Results Format

The SPARQL Query Results XML Format is a very simple XML vocabulary that standardises the representation of the results of a SPARQL query. The format consists of a handful of elements in a single namespace. Unlike RDF syntaxes, which offer many ways to serialize the same data, the SPARQL results format is much more regular. This means its much easier to manipulate using XSLT or other standard XML processing tools. SPARQL queries can therefore be used in contexts that support plain XML processing; no specialised RDF toolkit is required.

The following simple example illustrates the basics of the format:

<sparql xmlns="http://www.w3.org/2005/sparql-results#">
  <head>
    <variable name="name"/>
  </head>
  <results ordered="false" distinct="false">
    <result>
      <binding name="name">
        <literal datatype="http://www.w3.org/2001/XMLSchema#string">sodium
        </literal>
      </binding>
    </result>
   <!-- more results -->
   </results>
</sparql>

Like the results of a SQL query, a SPARQL result set it tabular, consisting of a number of “rows” and “columns”. The format can be trivially transformed into an HTML table using a simple XSLT style sheet.

The root sparql element contains a head element that lists the variables ("columns") returned in the results. Its sibling, the results element contains one result ("row") for each query result. The additional elements (literal, etc.) distinguish between simple literal and resource values which are basic concepts in the RDF model.

Much of the criticism levelled at RDF actually focuses on the variability of its syntax. It can be difficult to process with simple XML tools without normalization. As the above example shows, the SPARQL results format doesn't suffer from this problem.

Issuing requests to a remote query service is also trivial, as the next section illustrates.

The SPARQL Protocol for RDF

The SPARQL Protocol for RDF describes an abstract protocol that can be used to submit SPARQL queries to a remote query service, and retrieve the results. Described using WSDL 2.0 the protocol has both a SOAP and a simple HTTP binding.

The protocol itself is quite simple:

The only required parameter is query whose value is the SPARQL query passed to the service.

An optional (and repeated) "default-graph-uri" parameter can be used to describe the RDF data set against which the query should be run. Services may assume a default data set, so the parameter isn't required. RDF graphs are identified by URI and may correspond to directly accessible resources on the Internet.

An optional (and repeated) "named-graph-uri" parameter can be used to specify additional data to be referenced by the query. SPARQL defines separate processing rules for these data sources, allowing the origin or “provenance” of data to be checked in a query.

The results from a protocol request are either serialized as RDF, in the case of CONSTRUCT or DESCRIBE queries, or using the Query Results format in the case of SELECT and ASK.

A Review of Some Existing SPARQL Services

GOVTRACK.US

GovTrack.us is a site that organises and publishes information about the United States Congress, making information on the status of federal legislation, voting records, and campaign contributions available to its users. The information is fully cross referenced, allowing easy navigation and monitoring of key legislative events. The site makes all of its data available as RDF and there is a small developer community collaborating to extend the range of data available to the state level.

Alongside the RDF exports, the site provides a SPARQL interface that can be used to query the underling data store. At the time of writing the store holds roughly 25 million triples. A range of different vocabularies are used including Dublin Core, FOAF, Vcard, plus a number of bespoke vocabularies that describe votes, bills, census data, etc. A fixed limit of 1000 results is applied to avoid long-running queries hogging resources.

The following query extracts a list the titles and dates of all House Bills that are categorised under the topic "Greenhouse gases", along with the names and homepage of their sponsor.

PREFIX dc:  <http://purl.org/dc/elements/1.1/>
PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX bill: <tag:govshare.info,2005:rdf/usbill/>
PREFIX foaf:  <http://xmlns.com/foaf/0.1/>
SELECT ?title ?date ?name ?homepage
WHERE {
  ?bill rdf:type bill:HouseBill.
  ?bill dc:subject "Greenhouse gases".
  ?bill dc:title ?title.
  ?bill dc:date ?date.
  ?bill bill:sponsor ?sponsor.
  ?sponsor foaf:name ?name.
  ?sponsor foaf:homepage ?homepage.
}

View the results by clicking here.

Opera Community

Opera Community is a website where users may sign up for free web hosting including a blog, file sharing and photo galleries. The site currently exposes a selection of its metadata as RDF, e.g. a basic FOAF profile and a photo gallery.

Using the SPARQL interface it is possible to answer questions such as show the name, blog URL and (optionally) the blog title for everyone known by Kjetil Kjernsmo (the creator of the service):

PREFIX foaf:  <http://xmlns.com/foaf/0.1/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
SELECT ?name ?blog ?blogtitle
WHERE
{
  ?user foaf:name "Kjetil Kjernsmo".
  ?user foaf:knows ?friend.
  ?friend foaf:name ?name.
  ?friend foaf:weblog ?blog.
  OPTIONAL { ?blog dc:title ?blogtitle. }
}

View the results by clicking here.

SPARQL.org, XMLArmyKnife, RASQAL DEMONSTRATOR

The SPARQLer, Redland Rasqal, and XMLArmyKnife SPARQL interfaces all allow queries to be executed against any web accessible RDF resource. So, unlike the My Opera and GovTrack services which query a fixed database, these services can be used to make queries against arbitrary RDF content.

The services all support the returning of results using the SPARQL XML format, JSON output, as well as delivery as a human-readable HTML document.

The SPARQL.org and librdf.org services are primarily technology demonstrators created by, respectively, the authors of the ARQ and Rasqal SPARQL query engines. The XMLArmyKnife service has been created by the author of this paper to experiment with a number of additional extensions to the SPARQL protocol. For example the service offers a SPARQL AJAX client for directly integrating queries into AJAX environments.

All of these services can easily act as SPARQL front-ends for not only existing RDF data sources but any other XML format that can be converted to RDF via XSLT. The services will load data from all URIs specified in the request, and those URIs may have been "pipelined" via the W3C XSLT query service (for example) to convert the data to RDF on the fly.

Implementing a SPARQL Service

There are a number of issues to consider when designing and implementing a SPARQL query service. The following sections review a number of the key issues in more detail. A later section introduces some useful extensions to the SPARQL query protocol that developers may also choose to support.

What Type of Service?

From the review of existing SPARQL services it's possible to derive two distinct types of services:

Fixed Data Source (FDS) Services
SPARQL services that operate on a fixed RDF graph. E.g. The database underlying an existing web service. Examples include the Opera Community and GovTrack.us services.
Arbitrary Data Source (ADS) Services
SPARQL services that do not specify the RDF graph against which queries will be executed. The service will rely on clients defining the data of interest on a per-query basis. This is achieved using either query terms (FROM, FROM NAMED) or protocol parameters (default-graph-uri, named-graph-uri). These parameters will specify data sources using URIs which will be dereferenced to fetch the RDF data. Examples include the SPARQLer and XMLArmyKnife services.

FDS Services are likely to be the most common, with the SPARQL protocol endpoint acting as a natural extension to the rest of the applications public data services.

ADS Services potentially offer a lot more power for querying across multiple resources but some additional issues become important. For example allowing clients to trigger network accesses to fetch arbitrary resources has implications for both efficiency and security.

Of course the two styles of service are extremes of a possible spectrum of behaviour. Individual services may offer a mixture of options. E.g. Selection between and combination of a number of local rather than remote data sources, or allowing a query to operate against both a local persistent RDF store and also mix in remote data that is fetched “on demand”.

Rejecting Inappropriate Requests

The SPARQL protocol has some fixed guidelines that describe an "order of precedence" for deciding how to assemble the data set for a query.

If the protocol request references a data set then this must be used in preference to that described in the query, or a fixed data source preferred by the processor.

If the query itself references one or more data sources (e.g. using a FROM clause) then this must be used in preference to any service default.

Attempting to "mix and match" a data set using different methods of description is prohibited.

On the surface this may suggest that a processor cannot enforce a fixed data set. However a query processor is free to reject any request based on arbitrary criteria, including "inappropriate" data set descriptions.

To avoid ambiguity, service implementers should reject outright any request that does not refer to an acceptable range of RDF data sources. Simply applying the query to an alternate preferred data set opens the potential for confusing results.

Synthesising the Dataset

Implementing a SPARQL query service against an RDF data store is trivial. There are already a number of stable open source toolkits available. The most popular semantic web frameworks (e.g. Jena, Sesame, Redland) already provide SPARQL query engines, so the only effort required is implementing an appropriate protocol wrapper.

For services based on an existing relational back end there are several options available to convert or expose this data as RDF so that it can be queried using SPARQL.

Converting the entire database to RDF is an option but generally not feasible in practice without a large amount of additional effort to create and maintain a duplicate data set in RDF.

More promising options come from two other areas. Firstly there are signs of relational database vendors extending their products to include SPARQL query support. Oracle has recently added support for storing RDF data and SPARQL querying cannot be far behind. A project to add support for SPARQL querying to MySQL is also in progress.

The second promising avenue is from "gateway" products that provide middleware features that can map a relational schema to RDF. A good example of this kind of product is D2R. This open source application consists of D2R Map which can map a relational schema to the RDF model, and D2R Query which provides SPARQL query support. A server component capable of executing SPARQL queries against a relational back end by applying suitable mapping rules completes the functionality. As a standalone tool this provides a useful sandbox for playing with SPARQL support.

As many SPARQL implementations have been designed to work against persistent RDF storage implemented in relational databases, a final approach to synthesising existing application data as RDF is to convert implement appropriate "SPARQL to SQL" rewriting rules for the bespoke application schema.

In the conclusion to Jim Melton's presentation XML 2005 presentation, “SQL, XQuery, and SPARQL”, he concludes that:

...the design goals of SQL and SPARQL are sufficiently different that there is adequate justification for the creation of a special-purpose language for querying RDF collections. We are comforted by the belief that it is possible to translate SPARQL expressions into SQL expressions, allowing users to store their RDF collections in relational databases if they wish to do so, and to write their queries in either SQL or in SPARQL, as they see fit. While predicting that it will be similarly possible to serialize RDF collections into XML documents and transform SPARQL expressions into XQuery expressions, we do not believe that most users would take that direction.

Converting an entire application to use an RDF back end is another obvious option but may be too great a step for many developers as the ultimate benefits of SPARQL querying have yet to be conclusively demonstrated. However projects such as ActiveRDF may be able to marry the productivity of web frameworks such as Ruby on Rails with an RDF back end. The potential for this kind of architecture is enormous, enabling rapid development against a highly flexible data model.

Efficiency Considerations: Caching

If a query service is allowing queries to operate on arbitrary data sets then caching is an important pre-requisite to ensure that performance is acceptable. Several levels of caching can usefully be applied to optimise network accesses.

Short-term in-memory, caches are an effective way to reduce repeated fetches of the same resources. Use of Conditional GET HTTP requests to only refresh resources when they have actually changed is also recommended. Some RDF data sources, e.g. RSS 1.0 feeds, are more ephemeral than others, so short-term caching, coupled with Conditional GET requests allows changed sources to be identified, but avoid unnecessary fetches.

Caching HTTP proxies provide a drop-in caching framework that can transparently improve service performance.

Longer term, disk based caching is particularly suitable when clients may access data sources that are changed very infrequently (e.g. schemas and ontologies) or where those data sources are too large to fetch on-demand.

A wide range of existing infrastructure can be deployed to implement this type of caching. E.g. a semantic web crawler may be used to build and maintain a local cache of resources to be used by the query service in preference to a remote copy. Substituting locally cached versions of resources can be achieved using proprietary mechanisms offered by a particular RDF toolkit (e.g. Jena LocationMapper), or standard XML or programming language features such as XML Catalogs and URI Resolvers.

Caching can of course also be applied to the results of commonly requested queries, avoiding the need to execute the SPARQL query until the data is known to have changed.

Efficiency Considerations: Query Costs

Even though there are a number of stable implementations of SPARQL, experience with optimising SPARQL queries is still something of a research topic. When implementing a SPARQL service consideration should be given to the costs that queries can impose on the underlying system. For example, it's possible to construct a simple SELECT query that will extract all triples from the underlying data set:

SELECT ?subject ?predicate ?object
WHERE
{
  ?subject ?predicate ?object.
}

This kind of query will not only hog system resources it also provides a simple way for clients to extract the entire data set from a service; this is unlikely to be desirable! More subtle problems can also be encountered, e.g. queries that attempt to run regular expression matches across large literal values may take a long time to process.

There are several simple options for imposing some limits on the resources consumed by each query.

Firstly, Limits on the execution time for each query can easily be applied using time-outs. Systems backed by a relational database may support these at the database API level, but native RDF stores may not offer this feature and so bespoke coding may be required. E.g. running each query in a separate thread.

Secondly, imposing upper limits on the number of results returned by a single query can avoid both the need to process and serialise a large amount of data. SPARQL offers the LIMIT keyword to impose an upper limit on the number of results returned by a query. Implementing this mechanism therefore simply involves manipulating the incoming query to add a suitable LIMIT if not already specified.

Experience at Ingenta has also shown that even for semantically equivalent SPARQL queries, which differ only in some minor aspects of their query terms, a relational database query optimiser may choose radically different query plans. Experimenting with alternate query forms and monitoring incoming queries will be essential to expose these kind of issues.

Security and Privacy

Privacy is an issue no matter how data is published onto the web. By providing direct query access to a database privacy of data becomes a concern. For example: could a client extract users email addresses or other personal details? A traditional web service may choose to obfuscate email addresses, or simply omit key information in the responses to its API calls. But a SPARQL query can potentially fetch any item of data. A query can even probe the kinds of properties held in the store, so it is quite simple to determine the range of data available before accessing it.

With this in mind implementers have several options. The first is to store the private data in a separate RDF data store, or otherwise avoid mapping it into RDF so it is not accessible to SPARQL queries.

Alternatively, incoming requests can be "sniffed" to determine whether they involve expressions or triple patterns that match against the sensitive data. Such queries can then be rejected as unsuitable.

Arbitrary Data Source services have another potential security loop-hole. As noted in Section 3.1 of the SPARQL Protocol specification.

To assemble the data source for a given query, a service may have to dereference one or more URIs. This means that a single SPARQL query may result in one or more additional HTTP requests to other servers. A SPARQL service could therefore potentially be used as a means of carrying out a denial of service attack another other system. The same is true for any web service that allows clients to trigger arbitrary network request. Care should be taken to close this loophole. The most secure is to restrict locally cached data only, i.e. implement a Fixed Data Source service.

Alternatives include monitoring the rate and origin of incoming requests to attempt to identify abuses. Tools such as mod_throttle can be used to implement throttling of requests at the Apache level, thereby limiting the ability of clients to make abuses of the service. Tracking the number of requests made from a given IP address can be an effective measure. Web services now implement a range of mechanisms, e.g. requiring API users to register for "key" to use the service, to track origin of requests and perhaps impose limits on the number of requests per hour or day. See “7 Ways to Limit API Use” for a fuller list of alternatives.

The SPARQL protocol specification suggests that services may wish to record the URIs which it is asked to retrieve. This allows some simple usage tracking to identify commonly requested resources. As well as identifying potential abuses or security issues, this information is will help identify resources that might usefully be added to a local cache.

Extending the SPARQL Protocol

XSLT Post-Processing

An option to apply an XSLT transformation to retrieved data is a useful addition to any web service. As already noted the SPARQL Query Response Format is very amenable to processing with XSLT to generate alternate formats. The results specification itself references a transformation to a simple HTML table. Conversions to any equivalent format should be trivial.

Other sample style sheets have been produced by the community, e.g. Morten Frederiksen's conversion to an RSS 1.0 feed. It's likely that the number of generic reusable style sheets will grow. They are relatively simple to produce, relying not only on the simplicity of the results format, but common naming practices for variables (or "columns") returned by a result.

For example Frederiksen's conversion to RSS extracts the link and title for each RSS item by inspecting the results for variable named "rsslink" and "rsstitle". Query authors need only "shape" their queries to match the expectations of the style sheet in order to generate RSS. A similar approach can be applied to create other formats.

Services that decide to support this post-processing option should allow clients to specify not only the XSLT style sheet to apply, but also the output format (i.e. mime type) for the results. Style sheets may also require additional parameters to produce correct output . (E.g. The URI and title of the RSS channel as used in Frederiksen's style sheet), Services should also therefore make any non-protocol request parameters available as parameters to the XSLT style sheet.

Applying XSLT to the results of a query could also help deal with one of the primary limitations of SPARQL: the lack of aggregate functions. Currently SPARQL lacks the ability to count results, calculate minimums, maximums, etc. XSLT 2.0 provides the necessary features that would allow post-processing of results to generate aggregated data. The final results can still be returned using the SPARQL standard response format. The process may not be as efficient as a proprietary SPARQL extension, but there is as yet no organised effort to standard extension functions across query engines.

The JSON Output Format

Not surprisingly there has been significant early interest in integrating SPARQL querying with AJAX environments. Damian Steer and Libby Miller were the first to implement client-side AJAX processing of SPARQL query results, and others have quickly followed suit. Early implementations relied on client-side XML processing of the SPARQL results.

However, as several services have been experimenting with JSON serialisations of SPARQL results, allowing much easier AJAX integration, a draft specification has been produced to standardise JSON as an alternative serialisation of SPARQL results, including registration of an appropriate mime type. Both the Jena and Redland frameworks have already been extended to support this serialization. A recent SPARQL Javascript library from Lee Feigenbaum and Elias Torres uses the JSON format to simplify the parsing of results.

The combination of SPARQL querying with an AJAX environment has already given birth to a number of interesting demonstration applications, e.g. the SPARQL calendar demo. The combination of these technologies has a lot of potential, placing the power of semantic web technology within reach of Web 2.0 developers in an technology environment with which they are already familiar.

Query By Reference

At present the SPARQL protocol requires that query is transmitted within the request itself; a client cannot publish the query as a separate web resource and then indicate it "by reference" in a protocol request. This option was available in early drafts of the specification but was subsequently removed. The DAWG are awaiting further implementation experience before revisiting this decision in a later specification revision.

There are several advantages of being able to performing querying by reference:

  • as queries can be identified by URI, query processors may more easily recognise commonly used queries and cache them locally, avoiding the need to transmit and re-parse the queries.

  • being able to publish and share queries allows the community to more easily share common queries. In many cases queries will be generic so there is a high potential for reuse.

Querying by reference also has some interesting possibilities as it allows queries to be generated on-the-fly rather than be pre-computed. Queries may then be tailored for particular users or tweaked based on known characteristics of the data.

It is suggested that services that decide to offer query by reference should standardise on the query parameter "query-uri" to follow the general naming convention used in the SPARQL protocol.

Active Query Services

Arbitrary Data Source services may offer additional value-added features for clients. One useful option is to maintain the data being used as server-side state, e.g. within a session, or using a temporary URI from which the data may be retrieved. The data source may then be manipulated or otherwise enhanced by the query service. For example RDF inferencing may be applied to the data set prior to executing queries. Clients may also wish to specify rules or ontologies to be applied to data before it is used.

Services may also wish to expand the available data by inspecting it for known properties such as rdfs:seeAlso links to include additional relevant data. This technique has been used successfully within the SPARQL calendar demo to follow references to calendaring data from FOAF documents. The same technique can be generalised to any data source.

Not all data is available as RDF, but services may wish to allow SPARQL querying against this data by automatically converting known formats to RDF. E.g. iCal data to RDF Calendar, Atom to RSS 1.0, or extracting microformat data from XHTML documents by applying GRDDL transformations.

Conclusions

The current trends in web application development have created selection pressures that are driving an increase in uniformity across services. By offering a SPARQL query service interface onto their data, application developers can offer a highly flexible and standard mechanism for querying data across the web.

Its ease of integration with AJAX environments makes SPARQL a strong candidate to become the standard query language of the web.

Bibliography

The full set of references for this paper, along with other relevant additional reading material can be found on del.icio.us, under the tag “xtech-06-sparqling-services”.