Semantic Web @ NASA
- , , ,
- , ,
- , , ,
NASA has a data problem
NASA is a unique institution with a unique mission, namely, to explore our planet, solar system, galaxy, and universe in order to expand the frontiers of human knowledge. As one of the largest scientific institutions in the world, NASA generates a lot of data: NASA generates mountains and mountains of data every single day.
As we will see below, NASA’s data problem has some of the characteristics of the World Wide Web itself, which is one of the reasons we’ve been trying to respond to NASA’s data problem by using some of the principles, techniques, and hacks from the Semantic Web.
In what follows we discuss some aspects of the data problem at NASA; then we discuss some of the ways in which the NASA scientific and IT communities are using Semantic Web technologies to respond to this problem; and, finally, we discuss a Semantic Web application, POPS, in greater detail.
Size
There's a practically unimaginable amount of NASA data: unknown terabytes of data across disparate sites and in every data format possible. And, as near as anyone can tell, the rate at which NASA is generating new data is not only increasing, but it’s increasing at a faster rate than ever before, although this is only a tenuous judgment at best, given the inherent problems in reaching it.
Decentralized
NASA itself is composed of at least eleven geographically dispersed centers or sites, each one of which is a complex, multi-disciplinary, multi-goaled institution that rivals, in both size and complexity, a large corporation. No centralized, robust data plan or structure could conceivably be imposed upon such a complex organization, not least because of historical reasons. NASA evolved into its present state, and it effectively predates the period of modern information technology.
Scope
Not only does NASA have a lot of data, it has every conceivable type of data:
scientific, financial, and organizational
structured, semistructured, and unstructured
sensor and experimental data output from every conceivable kind of device, including thousands of bespoke, ad hoc, and one-off devices
metadata embedded in every conceivable kind of imagery
all the usual "enterprise data" associated with an organization of NASA's size and scope
data about people, their skills, and ongoing projects, including teams across dozens of sites, thousands of experts, and hundredsof thousands of existing and historical projects
Integratable, but not Integrated
Further, consider the promise of the integration task for the NASA data universe. Just as with the Web, where the promise of data and information integration is always present and potent, but where the reality often falls short, NASA in both its scientific and in its enterprise IT missions, needs to do data and information integration often and well. One can imagine some of the payoffs of data integration in NASA, but the data isn’t integrated and often resists integration because of complexity and other problems.
Consider an example: When planning to return to the Moon, NASA mission and planning analysts have to use lunar planetology data to find potential landing sites. But what makes a landing site suitable is a complex question, involving seemingly (but not really) disparate data about lunar topography and geology, but also about people, their skills, historic projects, experimental and other mission requirements, as well as mission requirements under various scenarios.
NASA needs to be able to integrate data that is, in theory, capable of being integrated, but which is not, in fact, very easy to discover, much less to integrate.
Importance
NASA's data matters: it's vital to helping humanity understand its place in the universe. Few human projects matter more than figuring out where we came from, what our world and its place in the universe is like, and how far we can push back the boundaries of ignorance.
Internationalization
Much, but not all of NASA's data is very public and belongs not only to the US taxpayer but, also, in some sense, to the world. I18N issues matter, as do accessibility, privacy, and security issues.
Semantic Web Efforts
Of course the scientists, engineers, and IT professionals who work at NASA understand their data situation as well as anyone. And they’ve been working at solving it. Some of the noteworthy efforts to date include metadata and knowledge management councils and best practice groups, as well as communities of practitioners of particularly strategic technologies like XML.
The library and information science disciplines are also well-represented inside NASA, which means there are catalogues, libraries, data dictionaries, and dictionaries of data collections across the Agency. And since scientists really understand the importance of data management, there are lots of well-curated data collections inside the science directorates in NASA.
And on the IT front, significant efforts have been made to modernize and improve the infrastructure necessary for data integration, including an Enterprise Service Bus built from open source web service components like Apache’s AXIS, employing W3C standards: XML, SOAP, WSDL, etc.
But, alas, a lot of NASA’s data remains in forms that aren’t really machine-readable or are insufficiently expressive. Thus there are the first steps across the Agency to move to more expressive Semantic Web formalisms like RDF and OWL. Groups and projects across NASA are using RDF to manage agile data integration and OWL ontologies to describe and reason about, for example, complex planetary data ([SweetJPL]). NASA is also using best-of-breed Semantic Web tools like mSpace (http://mspace.fm) and JSpace (http://clarkparsia.com/projects/code/jspace/) to browse complex knowledge bases.
In what remains of this paper we’ll introduce and discuss POPS, a Semantic Web project inside NASA to integration information about people, organizations, projects, and skills.
POPS
The POPS project focuses on one subset of the data problem, namely, expertise location. The Agency is full of bright, well-trained people with interesting, relevant skill sets and experience. But since there anywhere from 70,000 to 80,000 such people in the Agency at any one time, how do managers and other staffing planners find the right person for any particular job? The problem is exacerbated by the fact that the relevant datafor people, projects, and compentenciesare spread out over at least three distinct databases in three separate NASA centers.
Working in partnership with NASA technical leadership, Clark+Parsia, a small Semantic Web firm in Washington, DC, with a focus on research & development, started working on the expertise locator problem with two key insights:
First, it makes sense to conceive of expertise location as an information integration problem very similar to the kind of lightweight information integation that happens in Web 2.0 mashups.
Second, that the primary user activity in expertise location is browsing, not searching.
Federated RDF, Info Integration, and Mashups
The data necessary to build an expertise location service for NASA existed in three different databases at three different NASA centers: project data lives in a database at Langley; skills information lives in a database at Kennedy; and people and organizational data lives in a database at Marshall.
Rather than build yet another traditional database that combined these three databases, we chose to build an RDF federation of the three databases instead. So the three source databases that feed this RDF federation continue to be the authoritative source for the data they host. They continue to be the source for changing the data they host. And they continue their operation and service unchanged.
Employing a Sesame open source RDF database as the basis of an RDF integration platform, we worked to convert the existing data from Langley, Kennedy, and Marshall into RDF using simply Python programs running over database dumps. We used the SWOOP ontology editor to create an OWL ontology for the RDF integration, which provided a basis for formal agreement among the development partners about the problem domain, as well as a logical formalism to prove that the proposed integration of the problem domain was consistent and satisfiable.
We quickly had three disjoint databases integrated into one RDF federation, using unique identifiers in each database to relate relevant information within the federated store, including unique identifiers for NASA personnel and projects.
Since this data does not change rapidly, we are able to periodically repopulate the RDF federated store without interrupting service at the host databases and without incurring the performance penalty of live, dynamic querying of the host databases.
Browsing as Visual Query Building
The next step to building POPS was to create a functional, appropriate client to access the information within the federated RDF store, which leads to the second insight, namely: some applications are search-oriented, but others are browse-oriented. Searching and browsing a data set are fundamentally different activities. Typically a user cares only about the most relevant search results, while users just as typically care about proximate and contextual nodes in a browse application. Browsing is especially relevant when users need to have a sense for the overall features and contours of a data set, and when finding a node of interest may spark interest in other, related or contiguous nodes.
Since the RDF federation that served as the data repository for POPS was an RDF database, it was queryable by several different RDF query languages. We knew that given the appropriate query, a user could locate relevant expertise rapidly and efficiently, but we also knew that most users are not prepared for or especially interested in the task of learning a new data model and a query language for that model.
Thus we focused on using a GUI to build RDF queries dynamically, based on user input, and to seamlessly integrate the results of each new query with the results of al other queries. Since we were focused particularly on developing a form of user interaction with POPS that supported browsing the data set, as opposed to searching it, we wanted to find a visual or UI metaphor that was suitable for browsing or navigating through a complexin this case, multiply intersecting trees or hierarchiesdata set in a way that was responsive to user needs as well as easy to learn.
We quickly hit upon the idea of using m.c. schraefel’s [Note to editors/copy editors: the preferred form of this person’s name is lower case: m.c. schrafel.] polyarchical Semantic Web browser, mSpace, as a way to present to the end user of POPS a visually useful way to navigate a polyarchy (that is, a multiply intersecting set of hierarchies), where navigation choices were really, implementationally, RDF queries against a federated store.
Polyarchical Browsing
The first step was to port the JavaScript, in-browser mSpace to a more robust Java client, which we dubbed JSpace. The POPS browsers, mspace and JSpace, are built for browsing polyarchies, that is, multiple intersecting hierarchies. In the POPS RDf federation each data source to be integrated is a hierarchy with some attribute or property that intersects with the other data sources. Thus the data sources for people, organizations, projects, and skills are each disjoint data hierarchies in which each node is uniquely identified by some property value that overlaps or intersects with another of the hierarchies.
In HCI and UI research, there are several novel approaches to interacting with polyarchies, including Microsoft's Visual Pivots and m.c. schraefel's mspaces. POPS employs the latter approach, giving an intuitive paned interface for end-user exploratory browsing of intersecting hierarchies. See Figure 1 below.

Figure 1. POPS browser with social network graph and source annotations
The most interesting feature of the JSpace browser in POPS is that users can choose any number and any ordering of columns to filter data down to a goal or match column. Each different reordering of the columns or addition or deletion of new columns simply changes the parameters of the RDF query being sent to the RDF federated store.
People can be filtered by their skills or by their affiliation with some NASA center or by their participation in a project. Likewise projects can be filtered by people or skill or NASA center, etc. This approach obviously includes the possibility of integrating additional data sources, like other sources of project information, or historical project information, and using those new data sources as filters or constraints on the queries, and hence the results of those queries, to the federated store.
In Figure 1 above a user has navigated to one person in the database by filtering on NASA center, choosing “GRC”, the label for the Glenn Research Center, and then also filtering by one of the projects located at Glenn, something called the “Crew Exploration Vehicle Spiral”, and then by a list of all the compentencies possessed by all the people who work at Glenn on that project. The user has chosen the “Engineering and Technology Knowledge” compentency, and then asked that any matching person be displayed in the final people column. By navigating in a very free form manner using JSpace, the user has essentially built up a rather complex RDF query that asks for all people who work at Glenn, on the Crew Vehicle Exploration Spiral project, with Engineering and Technology Knowledge compentency. Building that query by hand is non-trivial and certainly beyond the interest or skill level of most staffing planners.
In effect the POPS browsers are domain-specific visual query builders: each reordering of the columns and each path through the hierarchies represents RDF queries transmitted from the browser to an RDF-based federated data store via HTTP; the visible results of each path or column reordering is the browser's rendition of the results of the previous queries.
Social Network Visualization
The other piece of the POPS puzzle was to enrich or enhance NASA culture. The most successful knowledge management and information integration projects enhance an institutional culture, rather than demanding it change in ways that don’t make sense to users.
Having located a person who might fit some set of staffing criteria, the next step is to make a connection with that person. But how best to facilitate that connection is often a matter of institutional culture. In NASA the rolodex is still a vital part of this culture, and we wanted the POPS service to respect that culture. It’s often the case that a staffing manager reach out to potential project candidates by calling someone whom the staffing manager knows at the same research center where the candidate works.
Since the RDF federated store contains most of the information one needs to know to build a social network graph, we worked with Semantic Web researcher Dr. Jen Golbeck, who’s focused on social networking and small world graph research in the context of trust and the Semantic Web, to design an algorithm for computing a social network graph between the person trying to locate human expertise and a likely candidate.
This social network visualization feature supports to modes.
First, it will display other nodes in the RDF federated store that have the same properties as a given node. That is, it will display other people in the same department, at the same facility, or with the same skills as a candidate. It will also display people who have the same department and facility and skills as the candiate, on the reasonable assumption that the more of these properties which are shared, the more likely these people might already know each other and could facilitate introductions or further information.
Second, the social network visualization algorithm will also find the shortest path from the person looking for some expertise to a person who has that expertise. This feature suggests that POPS will support use cases other than expertise location, but may also help NASA personnel find relevantly similar other NASA personnel according to other needs or purposes than expertise location.
Microsoft’s IIS Approach
An interesting coincidence which we discovered only after creating the first POPS prototype was the similarity between this approach and a Microsoft product known as the Identity Integration Service, which also uses a kind of polyarchical user interface modality known as Visual Pivots, as well as an extension of Microsoft’s SQL Server to accommodate queries over polyarchical structures.
Significant differences exist between the POPS service we built and this Microsoft product, not least of which is our reliance on open source software (especially the Sesame RDF framework) and W3C standards like XML, RDF, OWL, and (in a future iteration) SPARQL. From a design point of view, we believe that lightweight, RDF-based data federation is a more agile approach to information integration than the Microsoft approach, which requires a significantly more onerous integration process at the RDBMS-level. We also believe that our use of plain old HTTP makes the POPS service more loosely coupled with NASA’s existing enterprise architecture.
Future Work
We will begin work on a production-grade implementation of POPS in the summer of 2006 to support NASA-wide expertise location services. Some of the challenges to that work include robust RDF databases and clever uses of HTTP to ease network traffice, etc. But these are mostly solved problems from the long history of building robust, scalable Web applications, and being able to piggyback on that work is one advantage to standards-based Semantic Web development.
We also anticipate adding new, additional features to the POPS infrastructure, including the integration of additional disjoint data sources, which will expand the utility of JSpace and POPS dramatically. For advanced users, we plan to expose raw SPARQL query interface and a client-side search facility for winnowing large match sets. We also plan to do additional work with OWL in the role of database alignment and integration support in order to begin integrating more difficult, less disjoint data sources. We plan to add query-by-example features to JSpace in order to support the use case where a browser has located a node in the data set that is exactly like the required node except different in one or another property value. We also plan to add property filters to introduce a kind of blended browse-and-search capacity.
JSpace and mSpace both require an RDF document as input, which is called, confusingly, a model. It provides some basic support to the browser for how to build the queries and to display the results. Once we have integrated a dozen or so data sources, we anticipate that users will be clamoring for arbitrary mashups between combinations of data sources for which JSpace models haven’t yet been written. We will be adding an editor to JSpace to allow users to create arbitrary models and, thus, arbitrary mashups based on their specific needs. This will require better infrastructure support for flexible, on-the-fly RDF federations of source databases. We’re hopeful that the new W3C standard for RDF query language and data access protocol will be a new spur to increasing commercialization of the RDF database market.
We believe that POPS demonstrates the utility of Semantic Web technologies as the basis for interesting and useful information integration projects, both at NASA and in other complex institutions.
Bibliography
[SweetJPL] Semantic Web for Earth and Environmental Terminology (SWEET), http://sweet.jpl.nasa.gov




