I am currently working with Priya, on a giant RDF triple store. The aims of the "MetaStore" Project are to:
1. Pin down all of our metadata in one place. (Currently we have article headers in XML files, references in a relational database, other databases which map legacy article identifiers to other (legacy) article identifiers.. etc etc - basically, an integration nuisance for the content team!)
2. Model it properly - using an ultra-flexible database model, and industry standard vocabularies. (For example, it is nice to model books as books, and supplementary data as itself - rather than shoehorning them into the journal/article model.)
The first stage of the process was to do some initial RDFS modelling and convert a big chunk of test data into RDF/XML.
Next, we experimented with a few RDF engines. We went in the end for Jena, with a PostgreSQL back-end. Jena has a solid Java API, good support, and scaled well in initial query performance tests. The Postgres back-end gives us a tried and tested master-slaving mechanism.
We had 4.5 million articles to load.. so had to develop strategies for optimising load performance.
After a January spent watching load logs scroll past, all of the backdata headers are finally loaded, and a daily process keeps the store updated with new arrivals.
The interesting thing about this project (in terms of RDF research) is its scale:
This is a BIG triplestore:
metastore=# select count(subj) from jena_g1t1_stmt;
(And the references aren't in there yet.)
Because of the scale, we haven't been able to use our OWL ontology to infer extra triples, as Facilities won't buy the planet-sized amount of memory we'd need... This is a big, dumb(ish) store.
One of the interesting features of the store is the number and variety of identifiers we have ended up with for each resource. (We've been using dc:identifier from Dublin Core). But more on that later...