RSS and Atom feed icon News feeds

Ditching the Database: XML and the PHP web app

  • David Megginson, Megginson Technologies Ltd.

PHP web applications often start as self-contained pages. Each page is its own script, processing GET or POST parameters, opening database connections, executing queries and updates setting response headers, and generating the response payload. As much as this might make a Java or Ruby developer's skin crawl, it's still hard not to admire the sheer simplicity. Think of all the files you have to configure and all the directories you have to create to get going with big frameworks like J2EE, Struts, Hibernate, Spring, Rails, etc. – by contrast, with PHP you simply upload a file with the extension .PHP and magic happens. You can mix PHP files in with your regular static HTML files instead of segregating them in a special hierarchy or a CGI directory. You don't have to partition your site into separate applications as with Rails or J2EE – instead, you just drop PHP files in here and there like a bit of spice.

Illustration 1: A simple PHP web application: every page view causes database activity.

For most PHP applications – in fact, for most web applications of any kind – there's no catch. Granted, a naive PHP script like the one in Illustration 1 will trigger database activity with every page view, but if we're talking about a few views each hour (or even each minute) , who cares? Still, a very small number of PHP applications will get lucky enough that their performance does start to matter. Once a site gets up to multiple hits each second, opening a new database connection with each page view becomes a huge problem; even with a connection pool, executing four or five queries to construct each page will generate a lot of activity, and that's especially serious when you consider the fact that most PHP applications run on shared servers, where other applications are competing for resources.

If the web application is write-heavy – in other words, if it accepts user input and updates its database almost constant, as is the case with a chat room or email application – then there's not much choice left but to start adding hardware, moving to a dedicated server, then to separate database and application/web servers, and then on to clusters, load balancers, and the whole kit. If the application is read-heavy, however – in other words, if users spend most of their time viewing information without adding or modifying anything, as is the case with a weblog or news site – then introducing caching can postpone or even eliminate the need to gear up to handle more traffic.

Some kinds of caching are almost free. As long as you serve your side-effect-free pages using HTTP GET with unique, persistent URLs and intelligent time-outs, systems all over the web, from big ISPs to the user's desktop web browser, will save copies of your pages and serve up the copies for you – this process is most effective, of course, for pages that do not require authentication and authorization. At the next level, you yourself can save copies of the HTML pages you generate with PHP and serve out only the copies until something changes – PHP packages like Smarty have built-in support for this kind of caching, but it's worth mentioning that as soon as you start to use one of these packages, you're moving back towards creating lots of special directories, configuration files, and other artefacts that PHP originally let you avoid. Other tactics include automatically compiling PHP pages to a faster format to save CPU time, and caching database query results – these can both be transparent or even invisible to PHP scripts, making them much more attractive.

Now it's time to introduce XML into the problem. Caching HTML saves a copy of the final rendition of a page. That rendition might pull together information from many different sources, such as a user's account record, lists of new or popular items, visitor comments, primary content, and so on. Some of these change frequently, while others change very rarely – since they're all combined on the HTML page, however, all of the information has to be refreshed with yet another set of queries to the database whenever any of it changes. One solution to this problem is to use XML to cache information that does not change very often. For example, you can recalculate the top ten items every five minutes and save the result in XML, bringing a huge reduction in database activity. Caching HTML fragments would also reduce database activity if the application needed the information only for rendering; however, where XML really comes into its own is for caching operational information that the application needs to process as well as render, such as options in a user's profile or a list of subscriptions. SQL query caching can bring some of the same benefits, but SQL query optimization is already baroque enough without trying to fine-tune caching, and the XML saves the application the trouble of repeated reconstructing a data record from several sets of query results.

Illustration 2: PHP web application with an XML caching layer added.

If you take this idea further – and let yourself wander away from the simplicity that attracted you to PHP in the first place – you could isolate the application from the database completely by adding an XML component in the middle, as shown in Illustration 2.

The XML layer adds inefficiencies by requiring a double transformation: first relational data to XML, then XML to HTML. However, if the SQL queries are sufficiently expensive and repetitive, then caching their results as XML data structures could save far more time than the additional transformation step takes.

At this point, however, it's worth asking why the database is still in the diagram. If XML is capable of representing all of the application's backend information, and if the application already writes all its information out as XML and reads it back in as XML, why not just store it as XML and ditch the database, as shown in Illustration 3? After all, it's not the PHP way to add layers and abstractions.

Illustration 3: PHP web application without a database.

Depending on the kind of information your application is storing, and on how it needs to manipulate that information, using XML might even be easier than querying and update SQL databases. Relational databases handle tabular information simply and elegantly, and they handle unordered, repeating information reasonably well (provided that the developer has advanced database skills), but when you move to nested, ordered information, SQL queries can become absurdly complex and performance becomes unpredictable, requiring long sessions of performance measurement and query optimization. Consider even a simple user profile in XML:

<user>
  <id>dpm</id>
  <fullname>David Megginson</fullname>
  <permissions>
    <section ref="culture">
      <permission>read</permission>
      <permission>write</permission>
    </section>
    <section ref="technology">
      <permission>read</permission>
      <permission>moderate</permission>
    </section>
  </permissions>
</user>

Storing this information in a relational database will require about five tables (maybe more if you're strict about using third normal form):

  1. A table to hold basic, non-repeating user information (id and full name).

  2. A table to hold basic section information (id).

  3. A table to associate users and sections in a one-to-many relationship.

  4. A table to list all possible types of permissions.

  5. A table to associate sections and permissions in a one-to-many relationship.

It will be necessary to execute several queries to join all of these tables, and then merge the query results to build a single data record for the user profile. Real-life examples are typically much more complicated than this, and real-life production databases often end up heavily denormalized to reduce the amount of searching and joining required.

Of course, using XML instead of a database, rather than together with a database, does not make every problem go away; in fact, it introduces some significant new problems, some of them huge. Instead of simply giving a hypothetical list, however, I decided to try an informal experiment by building a real (if simple) web application in PHP to see how far I could get without a database, and what the biggest problems (and benefits) might be. The application is called Sheets: it allows users to define simple data structures, and then use a forms interface to fill in those data structures. A sheet's maintainer may authorize other people to help fill in the sheet, and anyone interested can view the sheet (in HTML or XML) and monitor changes through an RSS interface.

The first problem that came up in working with XML in PHP was, quite simply, working with XML in PHP. There are several XML interfaces available for PHP5. Some are streaming APIs, which I did not want to use for this application (I expect XML documents to be small, and would rather hold them in memory). Some try to make XML look like regular PHP data structures, but they are not read-write, and since XML allows multiple elements with the same name (as well as element attributes), it does not map all that cleanly and naturally to PHP arrays. In the end, I decided to work with the venerable DOM interface, which has the advantage (in PHP) of including XPath support for easy addressing. The problem is that using the DOM is fiendishly awkward and verbose. Consider the following XML fragment:

<player>
  <name>David</name>
  <score>10</score>
</player>

Now, assume that you have access to this XML document through a DOM interface in PHP, and you want to perform a simple operation: increase the score by one. How hard can that be?

// Get the 'score' element (assume that there's only one)
$nodeList = $dom->getElementsByTagName('score');
$scoreNode = $nodeList->item(0);
$score = '';

// Concatenate all child text nodes to get the element's content.
$child = $scoreNode->firstChild;
while ($child) {
  if ($child->nodeType == XML_TEXT_NODE) {
    $score += $child->nodeValue;
  }
  $child = $child->nextSibling;
}

// Increase the score by one.
$score = $score + 1;

// Remove all existing child nodes.
$child = $scoreNode->firstChild;
while ($child) {
  $nextChild = $child->nextSibling;
  $scoreNode->removeChild($child);
  $child = $nextChild;
}

// Create a new text node with the new score.
$textNode = $scoreNode->ownerDocument->createTextNode($score);
$scoreNode->appendChild($textNode);

That's a bit excessive, especially when you compare it to the PHP code to perform the same operation on a SQL database:

$row = $connection->getRow('select score from Players where id=?',
                           array($id), DB_FETCHMODE_ASSOC);
$score = $row['score'];
$score++;
$connection->execute('update Players set score=? where id=?',
                     array($score, $id));

(Note that neither of these examples includes error checking.) Can using XML in PHP be this easy? Yes, but to accomplish that, we'll need to provide a few helper functions to automate the tedious tasks:

$scoreNode = domGetNode($dom, '//score');
$score = domGetValue($scoreNode);
$score++;
domSetValue($scoreNode, $score);

Or even more succinctly, at the cost of an extra XPath query,

domSet($dom, '//score', domGet($dom, '//score') + 1);

If working with XML can be so simple, why is the DOM so complicated? XML structures for storing data are generally fairly simple, but the DOM has to be able to deal with all aspects of XML, including features designed for document publishing (such as mixed content), fossils from SGML (such as entities and notations), and extra lexical information that might need to be preserved by authoring tools (such as comments and document type declarations). A XML interface designed only for data records can dispense with most of this and present a vastly simpler view of an XML document; to do so, however, it's necessary to make a few assumptions:

  • All that matters in an XML document is elements, attributes, and text.

  • There is no mixed content: each element in a document is either a leaf node (containing only text), or a branch node (containing other elements).

  • Each element node has a simple attribute map attached.

  • All entity references are fully expanded, and do not have to be preserved.

And that's it. These assumptions can form the basis for a set of simple DOM helper functions like the ones shown earlier, for an object-oriented wrapper around a DOM interface, or for an entirely independent interface. Whatever the approach, reading and writing XML data in a PHP application should be no harder than this:

$user = xdataLoad("../../data/users/$userId.xml");
$user->set('//score', $user->get('//score') + 1);
$user->save("../../data/users/$userId.xml");

It turns out that this approach works well for Sheets, at least in a single-server environment – when a user creates a new account, the details are POSTed to a PHP script do-account-create.php, and that script creates a file u/userId.xml containing the user's information (and another file outside of the web hierarchy for private information). The page describing a user loads the XML profile once and uses it to populate an HTML page. The PHP script that generates a form for editing a user profile pre-populates an HTML form with the values from the XML file and POSTs updates to the do-account-update.php script, which modifies and replaces the XML file. The same thing works for sheet definitions, sheet data, tags, and all other application information. Anything that would be looked up by a single primary key in a database can just as easily be retrieved using an XML file name – the code is shorter and simpler, and there is no database overhead (and minimal processor overhead for such short documents).

Sheets shows that it is possible to build a complete, working web application without a database, using XML to save and manage the application's data. Given an appropriate DOM helper library, the code is generally simpler than the equivalent code for working with a database. It also seems to run fast. So was the experiment a success? Is there any downside to ditching the database? It's worth taking a look at what I've had to give up:

  • transaction and locking support (but the XML is the transaction, and file locking is good enough)

  • denormalized information (but third normal form is inefficient anyway, and even database-driven applications tend to use a lot of extra code)

  • no search ability (but in real life, most database fields do not need to be searched – they exist only to allow joins; for the few that are worth searching, I could use a hash file or a B-tree file)

  • no support for clustering (but I could use file replication or a shared file server/SAN, as long as it's fast and supports file locking)

  • no easy way to perform ad-hoc searches looking for bugs or collecting statistics (but I can search the whole thing as a collection of XML documents)

  • no data typing (XML can support data typing, but why bother when PHP doesn't really support it anyway?)

  • no two-phase commit (but Sheet's isn't a bank)

There's an answer to every problem in the list, but every answer means a bit more work – more code means more bugs, and at some point using XML can become harder and slower than using a database. Sheets isn't there now – and I'm a firm believer in building for today – but I do expect that soon the application will need features where a database will be the most sensible solution. For example, more sophisticated support for tags, including tag clouds, would be much simpler with a relational database. Right now, adding a new field to every user profile will require taking down the system and running a transformation on every XML file – if I had hundreds or thousands of users, that would be unpleasant. Adding a new field in a relational database using SQL is almost trivially easy by comparison.

For real applications, XML makes sense as an alternative to the database for representing data structures that do not need to be searched, broken apart, and recombined – for example, it's not likely that your application will have a pressing need to show pages listing every user with the first name “Susan” and a yellow background colour on her home page – but it might not make sense for data that is truly relational in spirit.

There are other good arguments for keeping XML as a middle layer, even when the database is the final information store, and these go beyond the simple performance benefits of caching. Consider what happens when I make a tiny change to the application by moving my cached XML files from a private directory to one visible over the web, allowing people and other web applications to download the XML data directly, bypassing my HTML application. With only a little extra work, I can also accept new or updated XML documents via HTTP POST. Illustration 4 shows how the architecture has changed.

Illustration 4: Exposing your XML to the web.

Of course, some discretion is necessary. For example, it's probably OK to publish this XML data document online:

<user>
  <id>jsmith</id>
  <name>Jane Smith</name>
  <home-page>http://www.example.org/jsmith/homepage</home-page>
</user>

But I doubt that it's OK to publish this one:

<user>
  <id>jsmith</id>
  <password>pokemon2</password>
  <email>jsmith@example.org</email>
  <credit-card>
    <type>VISA</type>
    <number>1234 5678 9012</number>
    <expiry>06/06</expiry>
  </credit-card>
</user>

Sensitive information will have to either stay in the database or be segregated into a private directory that's not served over the web (or that requires authentication to download). Getting user information will then require loading two XML documents rather than one. For the most part, however, as soon as you expose your (sanitized) XML cache to the web, you get Web 2.0 support for free – other sites can now do clever things with your data, such as search and index it, mash it up with Google or Yahoo maps, and so on.

It is possible to imagine a future where your PHP application serves XML exclusively, without worrying at all about HTML rendering, browser sniffing, session management, and all the other joys that go with the presentation layer. Other applications – possibly written using other technologies – can worry about the details of rendering, while the presentation applications do not have to worry about the details of storage and querying, as shown in Illustration 5.

Illustration 5: Ditching the presentation layer.

One interesting variation of this pattern moves the layout and presentation entirely onto the user's desktop, using JavaScript, DOM, and CSS in the web browser. The web application serves out XML, together with some static HTML, CSS, and JavaScript pages that contain rendering instructions. The user's browser does all the rendering on the desktop, using the user's computer, and sends information back as XML when needed. Some people would call this approach “AJAX”, but the A-part – “asynchronous” – is not necessary here. The browser application starts with a synchronous load of any needed XML data from your application, then renders the page, handles user interaction, and returns XML to the application, much like the load-edit-save cycle in a traditional desktop application. The XML server-side application can serve out mostly static XML files, resulting in an extremely low load even under heavy traffic, and the user's computers distribute the work of combining and rendering the XML data structures.

The trick is to maintain two or more DOM trees in the browser, one for the HTML rendering tree, and one for each of the XML source documents (or at least, for the ones that need to be modified and returned). Working with the DOM in JavaScript is just as awkward and verbose as working with the DOM in PHP, some some helper functions will be needed to simplify the work.

How likely is it that the future of the web will look like this, with fat browser applications handling rendering, while simple, high-volume server-side applications send and receive XML data? There are reasons to be sceptical – a neat separation of concerns like this one almost never works in real life, and a whole new generation of browsers in wireless devices (especially in the developing world) does not support this kind of client-side work. Still, if we do start to see more of a division of labour between applications that manage data and applications that provide user interfaces, then some of the big elaborate server-side frameworks currently in use may be dead ends. Maybe it's not so much a question of ditching the database as ditching all the other dead wood.