Can industrial scale triplestores be made to perform?
Is breaking the "triple table" model the answer?
Scaling SPARQL performance is something I worry about. Mike Bergman worries about it too.
In our XTech paper, we showed that even a simple, bread and butter sample query, is taking 1.5 seconds on our 200million triplestore. (Scroll down for the colourful graph). In our Jena paper, we showed that with a triplestore in a relational database, you're at the mercy of your query planner - some configurations of SPARQL query perform reaaaaly badly.
Both these facts make sense when you think that your SPARQL doesn't just get done by magic - of course, it gets turned into a ma-hoosive SQL JOIN statement, across the main triples table again and again, like this:
Select A0.Subj, A2.Subj, A3.Obj, A4.Obj, A5.Subj, A6.Obj From jena_g1t1_stmt A0, jena_g1t1_stmt A1,
jena_g1t1_stmt A2, jena_g1t1_stmt A3, jena_g1t1_stmt A4, jena_g1t1_stmt A5, jena_g1t1_stmt A6 Where
Databases weren't really designed to do queries like this - it's not surprising that they aren't very fast at it...
So I was really interested by Kevin Wilkinson's paper on property tables - also at the JUC last week. Kevin's idea is basically this:
Say, in your triplestore, you have this:
| s | p | o |
| X | p1 | o1 |
| X | p2 | o2 |
| X | p3 | o3 |
| Y | p1y| o1y|
(This means that X has three properties attached to it. )
Kevin points out that often, you get "groups" - patterns in the data. Often, X's always have exactly one o1, one o2 and one o3.
For example, consider X to be an article, and articles always have one title, one doi, one volume, issue, etc.
In this case, you could have a "property" table like this:
s | p1 | p2 | p3 |
X | o1 | o2 | o3 |
Y | o1y| o2y| o3y|
You could still keep other unsuitable-for-grouping stuff in the main triples table, but drag out groups into 'property tables'. This would make them fast to query. (There's a lot more to it than this - eg how properties with cardinality >1 go in their own table too - but you'll have to read the paper!)
This made sense to me. There are definitely patterns in the data. We identify articles by their issn/volume/issue/page data. They always(?) have these properties. How sensible it would seem to put them in a nice normal table, so we can at least occasionally have sql like:
"select * from articles where articles.vol=...."
- rather than scary mega-joins, to throw at our poor old database. Sounds totally common sense!
Consider this: our 'articles' table would look like this:
id | title | doi | volume | .....
1 | "bla" | 123.456..| 42 | .......
2 | "foo" | 789.101..| 45 | .......
Er.... does this look rather familiar? Isn't this exactly what we started with?
Hang on - what was wrong with what we started with again?
Well, for us, the problem was that we found that we had to keep re-modelling the database - initial assumptions were correct for the first hundred thousand articles.. and then suddenly weren't. For example, we recently discovered that sometimes Articles have 2 DOIs. Horrible, but true. The great thing about the triplestore is that we don't have to bake assumptions about the data into the database - we can have as many whatevers as we like.
I'm still very keen to try out Kevin's architecture - I hope to nag him into solving this undecided-cardinality-problem within his app by changing the schema dynamically and shufting the data around. I just hope that could scale. Much looking forward to first release and finding out more!