Solr’s DateRangeField, How Does It Perform?
Techniques for optimizing performance of the DateRangeField for date range queries in Apache Solr.
Solr’s DateRangeField
I have to credit David Smiley as co-author here. First of all, he’s largely responsible for the spatial functionality and second he’s been very generous explaining some details here. Mistakes are my responsibility of course. Solr has had a new DateRangeField for quite some time (see SOLR-6103). DateRangeFields are based on more of the magic of Solr Spatial and allow some very interesting ways of working with dates. Here are a couple of references to get you started. Working with dates, Solr Reference Guide Spatial for Time Durations.
About the DateRangeField:
- It is a fieldType that indexes a date range, i.e. a beginning and end date in a single field
- It’s used just like a typical range query
- The query parser is a little more lenient in how you specify the range values. Unlike the typical date format in Apache Solr, it supports friendlier date specifications. That is you can form queries like
q=field:[2000-11-01 TO 2014-12-01]
orq=field:2000-11
- It supports indexing a
date range
in a single field. For instance a field in a document could be added to a document in SolrJ as solrInputDocument.addField(“dateRange”, “[2000 TO 2014-05-21]”) or in an XML format as<field name="dateRange">[2000 TO 2014-05-21]</field>
- It supports multi-valued date ranges. This has always been a difficult thing to do with Apache Solr. To index a range, one had to have two fields, say “date_s” and “date_e”. It was straightforward to perform a query that found docs spanning some date, it looked something like
q=date_e:[* TO target] AND date_s:[target TO *]
. This worked fine if the document only had one range, but when two or more ranges were necessary, this approach falls down since if date_s and date_e have multiValued=”true”, the query above would find the doc if any entry indate_s
was < than target date and anydate_e
was > the target date..
Minor rant: I really approve of Solr requiring full date specifications in UTC date format, but I do admit it is sometimes a bit awkward, the ability to specify partial dates is pretty cool. DateRangeField more naturally expresses some of the concepts we often need to support with dates in documents. For instance, “this document is valid from dates A to B, C to D and M to N”. There are other very interesting things that can be done with this “spatial” stuff, see: Hossman’s Spatial for Non Spatial. Enough of the introduction. In the Reference Guide, there’s the comment “Consider using this [DateRangeField] even if it’s just for date instances, particularly when the queries typically fall on UTC year/month/day/hour etc. boundaries.” The follow-on question is “well, how does it perform?” I recently had to try to answer that question and realized I had no references so I set out to make some. The result is this blog.
Methodology:
For this test, there are a few things to be aware of.
- This test does not use the fancy range capabilities. There are some problems that are much easier if you can index a range, but this is intended to compare the “just for date instances” from the quote above. Thus it is somewhat apples-to-oranges. What it is intended to help evaluate is the consequences of using DateRangeField as a direct substitute for TrieDate (with or without DocValues)
- David has a series of improvements in mind that will change some of these measurements, particularly the JVM heap necessary. These will probably not require re-indexing.
- The setup has 25M documents in the index. There are a series of 1,000 different queries sent to the server and the results tallied. Measurements aren’t taken until after 100 warmup queries are executed. Each group of 1,000 queries are one of the following patterns:
q=field:date
. These are removed from the results since it isn’t interesting, the response times are all near 0 milliseconds after warmup.- simple
q=field:[date1 TO date2]
. These are not included in the graph as they’re not interesting, they all are satisfied too quickly to be of consequence. - interval facets,
facet=true&...facet.range.start=date1&facet.range.end=date2&facet.range.gap=+1DAY (or MINUTE or..)
. - 1-5
facet.query
clauses whereq=*:*
- The setup is not SolrCloud as it shouldn’t really impact the results.
- The queries were run with 1, 10, 20, and 50 threads to see if there was some weirdness when the Solr instances got really busy. There weren’t, the results produced essentially the same graphs so the graph below is for the 10 thread version.
- The DateRangeType was compared to:
- TrieDate, indexed=”true” docValues=”false” (TrieDate for the rest of this document)
- TrieDate, indexed=”true” docValues=”true” (DocValues in the rest of this document)
- I had three cores, one for each type. Each core had identical documents with very simple docs, basically the ID field and the dateRange field (well, the _version_ field was defined too). For each test
- Only the core under test was active, the other two were not loaded (trickery with core.properties if you must know)
- At the end of each test I measured the memory consumption, but the scale is too small to draw firm conclusions. What I _can_ report is that DateRangeType is not wildly different at this point. That said, see the filterCache comments in David’s comments below.
- Statistics were gathered on an external client where QTimes were recorded
Results
- As the graph a bit later shows, DateRangeField out-performed both TrieDate and DocValues in general.
- The number of threads made very little difference in the relative performance of DateRangeField .vs. the other two. Of course the absolute response time will increase as enough threads are executing at once that the CPU gets saturated.
- DateRangeFields have a fairly constant improvement when measured against TrieDate fields and TrieDate+DocValues.
- The facet.range.method=dv option was not enabled on these tests. For small numbers of hits, specifying this value may well significantly improve performance, but this particular test uses a minimum bucket size of 1M which empirically is beyond the number of matches where specifying that parameter is beneficial. I’ll try to put together a follow-on blog with smaller numbers of hits in the future.
The Graph
These will take a little explanation. Important notes.
- The interval and query facets are over the entire 25M documents. These are the points on the extreme right of the graph. These really show that for interval and query facets, in terms of query time, the difference isn’t huge.
- The rest of the marks (0-24 on the X axis) are performance over hits of that many million docs for day, minute and millisecond ranges. So a value of 10 on the x axis is the column for result sets of 10M documents for TrieDate and TrieDate+DocValues.
- The few marks above 1 (100%) are instances where DateRangeFields were measured as a bit slower. This may be a test artifact.
- The Y-axis is the percent of the time the DateRange fields took .vs. the TrieDate (green Xs) and TrieDate+DocValues (red Xs).
Index And Memory Size
The scale is too small to report on index and memory differences. At this size (25M date fields), the difference between index only and docValues in both memory and disk sizes (not counting DateRangeField) was small enough that it was buried in the noise so even though I looked at it, it’s misleading at best to report, say, a difference of 1%. See David’s comments below. We do know that DocValues will increase the on-disk index size and decrease JVM memory required by roughly the size of the *.dvd
files on disk.
David Smiley’s Enrichment
Again, thanks to David for his tutorials. Here are some things to keep in mind:
- The expectation is that DateRangeField should be faster than TrieDateField for ranges that are aligned to units of a second or coarser than that; but perhaps not any coarser than a hundred years apart.
- So if you expect to do range queries from some random millisecond to another, you should continue to use TrieDate; otherwise consider DateRangeField.
- [EOE] I have to emphasize again that the DateRangeField has applications to types of queries that were not exercised by this test. There are simply problems that are much easier than using DateRangeField. This exercise was just to stack up DateRangeField against the other variants.
- TrieDate+DocValues does not use the filterCache at all, whereas TrieDate-only and DateRangeField do. At present there isn’t a way to tell DateFangeField to not use filterCache. That said, one of the future enhancements is to enable facet.range with DateRangeField to use the low-level facet implementation in common with the spatial heatmap faceting, which would result in DateRangeField not using filterCache.
- [EOE] If you want more details on this, ask David, he’s the wizard. I’ll add that the heatmap stuff is very cool, I saw someone put this in their application in 2 hours one day (not DateRangeField, just a heatmap). Admittedly the browser display bits were an off-the-shelf bit of code.
- Another thing on the radar worth mentioning is the advent of “PointValues” (formerly known as DimensionalValues) in Lucene 6. It would stack up like a much faster TrieDateField (without DocValues)
- Discussions pertaining to memory use or realtime search mostly just apply to facet.range. For doing a plain ‘ol range query search, the DV doesn’t even apply and there’s no memory requirements/concern.
Closing Remarks
As always, your mileage may vary when it comes to using DateRangeFields.
- For realtime searches, docValues are preferred for both TrieDate-only and DateRangeFields, although in the future that may change.
- As more work is done here more functionality will be pushed down into the OS’s memory so the JVM usage by DateRangeField will be reduced.
- If your problem maps more fully into the enhanced capabilities of DateRangeField, it should be preferentially used. Performance will not suffer (at least as measured by these tests), but you will pay a memory cost over TrieDate+DocValues
- I had the filterCache turned off for this exercise. This is likely a closer simulation of NRT setups, but in a relatively static index DateRangeField using the fiterCache needs to be evaluated in your specific situation to determine the consequences.
Over-Analysis
Originally, I wanted to compare memory usage, disk space etc. There’s a tendency to try to pull information that just isn’t there out of a limited test. After I dutifully gathered many of those bits of information I realized that… there wasn’t enough information there to extract any generalizations from. Anything I could say based on this data other than what David provided and what I know to be true (e.g. docValues increase index size on disk but reduce JVM memory) would not be particularly relevant…
As always, please post any comments you have, especially Mr. Smiley!
This post originally published on February 13, 2016.
LEARN MORE
Contact us today to learn how Lucidworks can help your team create powerful search and discovery applications for your customers and employees.