A high performance RDFS store using a Generic Object Model
- , ,
- , ,
- , ,
Introduction
The ability to consider multiple data sources simultaneously greatly enhances the ability to perform threat analysis and disruption of terrorist activities. The simultaneous consideration of multiple data sources with disparate schemas requires the ability to perform semantic alignment, often at a very large scale. In addition to concept merging, merging must be possible at the instance level, since individuals and other entities often occur in many or all data sources being considered. This merging, or “federation,” must maintain provenance information for the data so that information can be traced to its source even after being brought into federation. These requirements can be best met using Semantic Web technologies, specifically RDF Schema, specific constructs in OWL, and support in RDF query languages for provenance tracking and maintaining trust boundaries.
RDF Schema (RDFS) [BRI04] is a semantic extension of the Resource Description Framework (RDF) [KLY04] to allow for subclass/subproperty relations and domain/range typing on predicates. The RDF Semantics [HAY04] document gives a model theoretic account of the semantics of RDFS, but also provides a set of entailment rules that are sound and complete for that semantics. These entailment rules are often used to guide the implementation of RDFS reasoners, especially those based on rule engines, whether forward chaining, backward chaining, or mixed. Choice of chaining strategy has both a noticeable impact on overall performance and on when the cost of reasoning is paid (reasoning time is paid on insert for forward chaining and on query for backward chaining).
These strategies are present in various architectures being explored within the Semantic Web community for building RDF and RDFS databases with secondary storage. These architectures have emerged from a variety of backgrounds including Prolog (SWIPL), datalog (Kaon-1, Kaon-2), relational (Sesame RDBMS SAIL, 3-store, and the Oracle database), and “custom” RDF store architectures (YARS, Kowari).
This paper reports on an approach to building an RDFS database using the Generic Object Model (GOM)[CUT04]. The GOM was created to support general purpose application programming and scalable orthogonal persistence. Although originally intended to enhance developer productivity, the GOM also exhibits excellent performance and scaling characteristics. It provides a good contrast to existing approaches and helps us to explore the suitability of various architectures.
Finally, we present the some results from a study testing the impact of using federated data sets to detect threat actors in synthetic data.
Generic Object Model
The Generic Object Model (GOM) is an OODBMS application framework. It sits above a persistence layer and provides an API which is directly useful for writing applications. GOM supports data and behaviour extensible persistent objects, one-to-one, many-to-one, and many-to-many associations among persistent objects, and indexed access to collection members. The basic linking mechanism is a bi-directional many-to-one association. One-to-one links are simply constrained many-to-one links. Many-to-many links are constructed using an intermediate linking object. In UML models [OMG06a] the multiplicity of the association ends is always indicated. Composition is used to indicate a life cycle constraint. Access from either association end is achieved using the association name. Names of association ends are ignored, but may play a role later in Model Driven Architecture [OMG06b] tooling for the GOM.
GOM is well-suited to modelling graph data. The ability to define arbitrary associations between objects in an ad-hoc fashion and have the index structures of these link sets managed automatically frees the application developer to concentrate on architecture and strategy instead of worrying about mundane tactical considerations such as the impedance mismatch between a graph model and set of relational tables.
Object Cache
The GOM guarantees a one-to-one relationship between live objects in a virtual machine and the corresponding persistent objects. This guarantee is realized by a weak reference cache for live objects. Each operation directed to persistence layer by the application touches the cache. The cache is structured as an internal hard reference LRU cache policy of capacity N and an outer weak reference cache. Operations touching the outer cache also touch the inner cache, causing the LRU ordering to be updated, but do not write through to secondary storage. Objects are buffered in memory until (a) they are no longer referenced by the application; and (b) they are evicted from the LRU cache. While keys (object identifiers) are removed from the weak cache once the garbage collector has cleared the weak reference, eviction notices are generated by the inner cache. In practice, this means that newly created objects, strongly reachable objects and recently used objects do not require any I/O. In our experience, GOM applications may be CPU bound since access patterns result in a significant percentage of relevant persistent objects living in cache. This makes it possible to write code, such as the reasoner, which touches a large number of objects without becoming I/O bound.
The guidance for the Java garbage collector is that weak references should be cleared as if no reference exists while soft references should be maintained until the garbage collector determines that it needs more memory (the policy is fairly flexible). This means that soft references cause objects to be retained longer. Based on observation, the use of a soft reference outer cache policy leads to unnecessary retention of objects and longer commit processing since fewer objects are incrementally written to stable storage before the transaction commits.
RDFS Store Architecture
The GOM RDFS SAIL [PER06] uses Sesame’s generic architecture for storing and querying RDF [BRO02] and implements the Storage And Inference Layer (SAIL). The Sesame architecture makes it relatively easy to develop and test new RDFS storage architectures. Developers benefit from the reuse of components for high level query languages bundled with the Sesame platform. Applications benefit from a common interface to semantic web technology while retaining choice about the specifics of the storage and inference solution. The GOM RDFS SAIL may be used as a plug and play alternative for the back ends bundled with the Sesame 1.x distribution.
The implementation of the GOM RDFS SAIL consists of an implementation of the Graph API (manipulation of RDF graphs) and an RDFS reasoner used to compute and maintain the closure of the store according to the model-theoretic semantics specified by RDF Model Theory [HAY04]. The persistent state defined by the Graph API was modelled directly in UML, see Figures 1 and 2. The behaviours defined by the Graph and SAIL APIs were implemented by skins that delegate to the GOM layer. The GOM layer provides a mechanism for lookup of skins by interface or implementation class name. The use of the skin delegation pattern makes it possible to run the same code over multiple GOM implementations. To date, GOM integrations have been developed for the CTC Technology, LTD GOM store [CUT06] and an GOM implementation based on an open source stack [THO06].

Figure 1. UML Model of the Statements in the RDFS Store.

Figure 2. UML Model of Collections in the RDFS Store.
Truth Maintenance
The strategy used for inference and truth maintenance is similar to the strategy described in “Inferencing and Truth Maintenance in RDF Schema” [BRO03]. The basic proof maintenance architecture, Figure 3, maintains explicit proof chains. Each Statement is classified as either an axiom (always true), an explicit statement (one that was asserted by the SAIL API), or an inferred statement. Inferences are computed during insert and maintained when during statement removal. This approach demonstrates reasonable performance and is extremely fast for stores that anticipate frequent statement removal.

Figure 3. UML Model of Basic Proof Maintenance Strategy for the RDFS Store.
Proof objects are the single most common object in the store and IO operations account for a significant portion of the time required when bulk loading new data. If we can reduce the #of proofs stored, then we are able to: (a) increase both the speed at which data may be loaded into the store; and (b) reduce the size of the store on disk. Therefore we are exploring alternative architectures for reasoning and truth maintenance. Clark and Parsia, LTD have recently developed a backward chainer for the GOM RDFS SAIL. We hope to report at the conference on an option which would persist only a single proof in the store and use the backward chainer to re-prove support for an inferred statement iff that proof is falsified.
Federation Study
An experimental hypothesis was posed to examine the benefits of federation: that by federating multiple data sources in various intelligence community locations, analysts would be able to improve the comprehensiveness, completeness, and timeliness of analysis as compared to analysis on the data sources individually. The goal was to compare the results of a group detection algorithm on a federation of data sets to group detection on the data sets that make up the federation. The experiment was designed such that:
Synthetic data sets were generated having characteristics similar to real world data at smaller scale. The data model consisted of entities (corresponding to hypothetical persons or organizations), their attributes, and links between those entities.
There were three categories of entities: members of groups as dictated by a synthetic ground truth (that is, it was ground truth for the synthetic data set), intermediate coupling sets (communities in which group members participated), and background population.
Group detection was run on three data sets (non-federated condition) and the results were combined into a single results set of groups and scored against the synthetic ground truth.
Group detection was run on the three data sets in federation and the results were then also scored against the synthetic ground truth.
The scores were compared to quantify the effects of federation.
Scoring Type | Overall Membership | Group Structure | ||
|---|---|---|---|---|
Federated | Yes | No | Yes | No |
Recall | 0.93 | 0.63 | 0.86 | 0.18 |
Precision | 1.00 | 1.00 | 0.74 | 0.85 |
Preservation Index | 0.93 | 0.63 | 0.64 | 0.15 |
Overall Membership refers to group member identification independent of specific group structure, while Group Structure refers to overall assessment of preservation of group structure as reflected in the group co-membership relationship of all pairs of entities. Recall is defined as the ratio of detected true positives to corresponding ground-truth count. Precision is defined as the ratio of true positives to the total detected. Preservation Index is simply Precision * Recall.
With respect to overall membership, detection of threat actors was higher in federated view (93%) vis-à-vis the individual runs (63%). Both strategies yielded no false positives. For group structure, detection of threat groups was much higher in federated view (86%) vis-à-vis the individual runs (18%). Because the gap in group structure detection was so large, federation created slightly more false positives (26%) than the individual runs (15%). Overall, the lack of a full picture in non-federated runs dramatically reduced the effectiveness the group detection algorithm is finding group members and uncovering group structure.
References
[BRI04] D. Brickley and R.V. Guha. RDF Vocabulary Description Language 1.0: RDF Schema. Recommendation, World Wide Web Consortium, February 10, 2004. See http://www.w3.org/TR/rdf-schema/.
[BRO02] J. Broekstra, A. Kampman, and F. van Harmelen. Sesame: A Generic Architecture for Storing and Querying RDF and RDF Schema. Published at the International Semantic Web Conference 2002, Sardinia, Italy. See http://openrdf.org/doc/papers/Sesame-ISWC2002.pdf.
[BRO03] J. Broekstra, A. Kampman. Inferencing and Truth Maintenance in RDF Schema: Exploring a Naive Practical Approach. In Workshop on Practical and Scalable Semantic Systems (PSSS) 2003, Second International Semantic Web Conference, Sanibel Island, Florida, USA. See http://openrdf.org/doc/papers/inferencing.pdf.
[CUT04] The Generic Object Model is part of a suite of Java-based technology created by CTC Technology Ltd, a UK company. See http://www.cutthecrap.biz/software/whitepapers/theproject.html
[CUT06] CTC Technology Ltd, a UK company, sells a commercial GOM database. See http://www.ctc-tech.biz.
[HAY04] P. Hayes. RDF Semantics. Recommendation, World Wide Web Consortium, February 10, 2004. See http://www.w3.org/TR/rdf-mt/.
[KLY04] G. Klyne and J.K. Carroll. Resource Description Framework (RDF): Concepts and Abstract Syntax. Recommendation, World Wide Web Consortium, February 10, 2004. See http://www.w3.org/TR/rdf-concepts/.
[OMG06a] Unified Modeling Language Copyright © 1997-2006 Object Management Group, Inc. See http://www.omg.org/technology/documents/formal/uml.htm.
[OMG06b] Model Driven Architecture Copyright © 1997-2006 Object Management Group, Inc. See http://www.omg.org/mda/.
[PER06] Mike Personick, Bryan Thompson. © 2006, CognitiveWeb. GOM RDFS Sail. See http://proto.cognitiveweb.org/projects/cweb/multiproject/cweb-rdf-generic/index.html.
[PRU06] SPARQL Query Language for RDF. Copyright © 2006 W3C®. See http://www.w3.org/TR/rdf-sparql-query/.
[THO06a] Bryan Thompson. © 2006, CognitiveWeb. GOM for CTC. See http://proto.cognitiveweb.org/projects/cweb/multiproject/cweb-generic-ctc/index.html.
[THO06] Bryan Thompson. © 2006, CognitiveWeb. GOM (Native). See http://proto.cognitiveweb.org/projects/cweb/multiproject/cweb-generic-native/index.html.




