Apache Solr and Join Queries
The ins and outs of join queries in Apache Solr.
Experimenting with Solr Join Query Performance
We recently had a client who wanted some up-front sense of how Apache Solr provided join support and, if so, how it performed. Naturally, the client wanted to use a join query in the most painful way, so I set out to make a prototype. Of course I ran into some issues, but one of the delights of working for Lucidworks is that I have ready access to many of the people who wrote the code, something to treasure! Being able to access these folks makes me look waaaay smarter than I am….
Anyway, on my 2009 MacBook Pro I ran some rather unscientific experiments, but enough to give me a sense of how join query performs in one particular case. I’ll outline what I did and what the results were.
The Setup
For this experiment, I created an index consisting of 26M documents. They were divided up into groups, one text document and 5 metadata documents. The text document contained 1K of semi-random English words (just chosen from „some list I got from the internet“). Semi-random because I weighted them a bit to have more common words than rare words, but it turns out that the searching part of the process isn’t where the time is spent so we can pretty much ignore that.
There are 5 metadata documents related to the text doc by Solr’s <uniqueKey> (analogous to a RDBMS primary key). Think of this as the metadata documents having a foreign key into the text doc <uniqueKey>. The metadata documents also had an integer field in the range 0-10,000. The whole purpose of this setup was to form queries that returned the text docs for which a metadata doc existed granting access. The complexity of granting access is…er…low, I just did a range query. „Not realistic“ you say. You’re right. I didn’t want any complex processing to get in the way of looking at joins, so I kept all this simple.
The form of the join query was:
q=text_all:(1 to 3 random words)&fl=id,score&sort=score desc&fq={!join from=join_id to=id}access:[7434 TO 7514]
See: http://wiki.apache.org/solr/Join. I wrote a little harness to fire off HTTP queries at the instance of Solr (4.x from a couple of months back). I could configure the number of simultaneous threads firing off queries. Note that I was testing this form because it applied to the customer, but I suspect that the other forms have the same issue.
Small Disclaimer
As I mentioned, one of the pleasures of working for Lucid is having access to people who deeply understand the code. So I chatted with the join author (Yonik Seeley) and discovered, of course, that the scenario I was testing was the worst performance wise. Joins are O(num_terms_in_fields), and using the <uniqueKey> as my join field guarantees that there are lots and lots and lots of terms. So these results are worst-case. Unfortunately, they’re also one of the most common.
Threads | Queries (total) | Avg QTime(seconds) | Elapsed Clocktime | Queries/second |
1 | 20 | 4.9 | 98 | 0.2 |
2 | 40 | 5.9 | 123 | 0.3 |
5 | 100 | 15.3 | 310 | 0.3 |
10 | 200 | 31.5 | 649 | 0.3 |
A note about these rather counter-intuitive numbers. Once the CPU maxes out, the QTime starts to increase, but the QPS rate stays rather constant. On a dual-core machine, we see that with 2 threads. The 5 and 10 thread (client) rows simply show that each individual request takes longer, end-to-end, but there are more queries being served by Solr simultaneously.
When I took the join part out, performance went up about 15x. I was monitoring the CPU, and it was pegged with 2 threads, which makes sense. I had jConsole running and didn’t see any anything odd with memory/garbage collection, but it was just a cursory examination.
But What Does It Mean?
Well, the take-away is that you really, really should experiment with the join performance in your situation before deciding on it as a solution for all your problems. I’d expect the numbers to be much better for fields with fewer unique values. But Solr makes a lousy RDBMS, and every time you think of using it as one, you should make an effort to re-think your problem in a way that doesn’t try to make Solr behave as one. These numbers, assuming that they are representative of your particular situation could well be killers. On the other hand, they may be fine if your particular situation is serving a small community of users for whom the time spent waiting for a query to return is well-spent. It Depends ™.
It might also mean that the case that Solr join functionality was meant to solve takes an unnecessarily restrictive approach for this particular problem. I suspect it’s quite possible that specializing the join code for the to-id was a <uniqueKey> might change the performance radically. One of the characteristics of open-source code is that solutions for the immediate problem get implemented and then refined for other cases as necessary.
„Rethinking“ often involves at least four phases:
- Think hard about the problem. Can it be solved by clever indexing? DB folks really don’t like to flatten data, but that’s often a viable approach.
- Ask if the functionality is really and truly something that’ll help the user experience. Often faceting and filtering will be „good enough“. There’s nothing particularly „natural“ about RDBMS concepts as far as your users are concerned, so ask your UI design experts what would really help the user.
- Prototype in your situation and talk to your product managers before irrevocably deciding to go down this route. The eXtreme Programming people emphasize over and over that making your PMs aware of the costs of a feature they’re insisting on will help them make better decisions. Asking „What you want will require 5 times as much hardware and take an additional month to implement, will XXX be good enough?“ gives them some information.
- Ask „Is Solr the right solution?“ I love Solr/Lucene. Working with this ecosystem pays my bills. I admire the work that people do in the nitty-gritty parts of the code. But Solr and Lucene aren’t suited for some tasks. It may be that the problem you’re trying to solve would be better served by an RDBMS. It may be that some kind of hybrid between Solr and an <insert your favorite solution here> works better. It may be that Solr shouldn’t be part of the solution to this problem at all. Not all nails should be hammered with Solr.
As I understand it, this behavior is inherent in how the join code is implemented and the number of matching documents isn’t the limiting factor (and this was borne out in my experiments). I wonder if one could make use of the fact that the join field is a <uniqueKey> to implement a specialization. Hmmm, I’ll have to talk to Yonik, but I suspect it’s one of those things that seems simple but quickly gets untenable. And, here we go again trying to make Solr behave like a DB…..
You Can Still Use Joins!
Don’t interpret this as saying „Don’t use joins“. Rather, you should be aware that they were implemented to solve a specific problem, not the general many-to-many relationship. The algorithm does what it needs to do to solve that problem, but when applied to different problems may not be performant enough to apply to your situation. Test, test, test!!!
Addendum
Looking at the results above caused me to wonder what would happen if I restricted the number of unique values in the join field, since it’s expected that joining on a field with lots and lots of unique values is just asking for trouble. So, running the 5 thread example above on a new join field that contained just 200 unique values (instead of 5,000,000 unique values) gives significantly better results. Running the same kind of test, I get roughly a 10-fold increase in QPS rate:
- thread: 5
- queries: 110
- avg QTime: 0.5 seconds
- clock time: 53
- queries/sec: 2
Which underscores the caution that before deciding whether joins are the answer to your question, you really need to test a realistic dataset.
This post originally published on June 20, 2012.
LEARN MORE
Contact us today to learn how Lucidworks can help your team create powerful search and discovery applications for your customers and employees.