I am currently working with Priya, on a giant RDF triple store. The aims of the "MetaStore" Project are to:

1. Pin down all of our metadata in one place. (Currently we have article headers in XML files, references in a relational database, other databases which map legacy article identifiers to other (legacy) article identifiers.. etc etc - basically, an integration nuisance for the content team!)

2. Model it properly - using an ultra-flexible database model, and industry standard vocabularies. (For example, it is nice to model books as books, and supplementary data as itself - rather than shoehorning them into the journal/article model.)

The first stage of the process was to do some initial RDFS modelling and convert a big chunk of test data into RDF/XML.

Next, we experimented with a few RDF engines. We went in the end for Jena, with a PostgreSQL back-end. Jena has a solid Java API, good support, and scaled well in initial query performance tests. The Postgres back-end gives us a tried and tested master-slaving mechanism.

We had 4.5 million articles to load.. so had to develop strategies for optimising load performance.

After a January spent watching load logs scroll past, all of the backdata headers are finally loaded, and a daily process keeps the store updated with new arrivals.

The interesting thing about this project (in terms of RDF research) is its scale:

This is a BIG triplestore:


metastore=# select count(subj) from jena_g1t1_stmt;
count
-----------
158267598
(1 row)

(And the references aren't in there yet.)

Because of the scale, we haven't been able to use our OWL ontology to infer extra triples, as Facilities won't buy the planet-sized amount of memory we'd need... This is a big, dumb(ish) store.

One of the interesting features of the store is the number and variety of identifiers we have ended up with for each resource. (We've been using dc:identifier from Dublin Core). But more on that later...