Optimize operations and expungeDeletes may no longer be so bad for you. They’re still expensive and should not be used casually.

That said, these operations are no longer as susceptible to the issues listed my previous article.  If you’re not familiar with Solr/Lucene’s segment merging process that blog provides some background that may be useful.

Executive Summary

  • expungeDeletes and optimize/forceMerge implemented by the default TieredMergePolicy (TMP) behave quite differently starting with Apache Solr 7.5.
  • TieredMergePolicy will soon have additional options for controlling the percentage of deleted documents in an index. See: LUCENE-8263 for the current status.
  • TMP now respects the configuration parameter maxMergedSegmentMB for forceMerge and expungeDeletes by default.
  • If you require the old behavior for forceMerge (optimize), you can get it by specifying maxSegments on the optimize command.
  • expungeDeletes has no option to exceed maxMergedSegmentMB.
  • If you have created very large segments as deleted documents accumulate in huge segments, the segments will be „singleton merged“ to purge those deleted documents. NOTE: currently this will only happen when your index approaches around 50% deleted docs although a follow-on JIRA may make that tunable.

Introduction

A while ago, I wrote a blog about a „gotcha“ when using Solr’s optimize and expungeDeletes commit option. As of Apache Solr 7.5 the worst-case scenario outlined in that document is no longer valid. If you want to see all of the gory details, see LUCENE-7976 and related JIRAs. WARNING: When Solr/Lucene devs get to discussing something like this, it can make your eyes glaze over.

As of Solr 7.5, optimize (aka forceMerge) and expungeDeletes respect the maxMergedSegmentMB configuration parameter when using TieredMergePolicy, which is both the default and recommended merge policy to use.

For such a simple statement, there are some fairly significant ramifications, thus this blog post.

Quick Review of forceMerge and expungeDeletes Prior to Apache Solr 7.5

First a quick review. The default behavior when optimize is run or expungeDeletes is specified on the commit command was that any segments that get merged were merged into a single segment regardless of how large the resulting segment became.

  • For optimize, the entire index was merged into the number of segments specified by the maxSegments parameter (default 1).
  • For expungeDeletes, all the segments that had more than 10% deleted documents were combined into a single segment.

For „natural“ merging as an index is being updated, each hard commit initiated a process as follows:

  • all segments with < 50% of maxMergedSegmentMB „live“ docs were examined and selected segments were merged.
  • „selected segments“ means that heuristics were applied to try to select the merges causing the least work and still respect maxMergedSegmentMB

The critical difference here is that optimize/forceMerge and expungeDeletes did not respect maxMergedSegmentMB. In both cases, merged segments have all data associated with deleted documents in the original segments removed. This reduces the amount of disk space occupied by the index and reduces the number of segments in the index.

Why Was maxMergedSegmentMB Implemented in the First Place?

There’s a long discussion here, but I’m going to skip much of it and say that keeping an index up to date has to deal with a number of competing priorities and maxMergedSegmentMB was part of resolving those issues. The various bits that need to be balanced include:

  • Keeping I/O under control as indexing and searching can be sensitive to I/O bottlenecks.
  • Keeping the segment count under control to prevent running out of file handles and the like.
  • Keeping memory consumption under control, the idea of requiring, say, 5G on the heap just for indexing is unacceptable.
  • When originally written, there were significant speed gains to be had by merging down to one segment, later versions of Solr don’t show the same level of improvement.

As Lucene has evolved, the utility of forceMerge/optimize has lessened, but the underlying merge policy needed to catch up.

The New Way

As of Apache Solr 7.5, optimize (aka forceMerge) and expungeDeletes now use the same algorithm that „natural“ merges use. The relevant difference between „natural“, „forceMerge/optimize“, and „expungeDeletes“ is what segments are candidates for merging.

There are three cases:

  • natural: All segments are considered for merging. This is the normal operation when indexing documents to Solr/Lucene. The various possibilities are scored and the cheapest ones are chosen as measured by estimates of computation and I/O. Large segments with few deletions are unlikely to be considered cheap and thus rarely merged.
  • expungeDeletes: Segments with > 10% deleted documents, no matter how large are considered for merging.
  • optimize: Siiiigh, here are 2 sub-cases, maxSegments is defined and maxSegments is not defined:
    • maxSegments is specified: all segments are eligible.
    • maxSegments not specified: all segments < maxMergedSegmentMB „live“ documents and all segments with deleted documents are eligible. Thus segments > maxMergedSegmentMB that have no deleted docs are not eligible.

„Wait!“ you cry! „You’ve told us that maxMergedSegmentMB is respected for expungeDeletes and optimize/forceMerge, yet you can specify maxSegments = 1 and have segments waaaaaay over maxMergedSegmentMB! How does that work?“

I’m so glad you asked (I love providing both sides of the argument. While I can disagree with myself, I never lose the argument! Yes you do. No I don’t. You’re a big stupid-head… Excuse me, my therapist says I should perform calming exercises when that starts happening).

Ok, I’m back now.

TMP in Solr 7.5 introduces a „singleton merge“. Whenever a segment qualifies for merging, if it’s „too big“ it can be re-written into a new segment, removing deleted documents in the process.

This has some interesting consequences. Say you have optimized down to 1 segment and start indexing more docs that cause deletions to occur. The blog post linked at the top of this article expounds on the negatives there, namely that that single large segment won’t be merged away until the vast majority of it consists of deleted documents. This is no longer true. When certain other conditions are met, a „singleton merge“ will be performed on that one overly-large segment, essentially rewriting it to exactly 1 new segment and removing deleted documents. It will gradually shrink back to under maxMergedSegmentMB, at which point it’s treated just like any other segment.

WARNING: This comes at a cost of course, that cost being increased I/O. Let’s say you have a segment 200GB in size. Let’s further say that it consists of 20% deleted documents and is selected for a singleton merge. You’ll re-write 160GB at some point determined by the merging algorithm. It gives you a way to recover from conditions outlined in the blog linked at the beginning of the article that doesn’t require re-indexing, but it’s best by far to not get into that situation in the first place.

I’ll repeat this several times:

Do not assume optimize/forceMerge and/or expungeDeletes are A Good Thing, measure first.

If you can show evidence that it’s valuable in your situation, then only do these operations under controlled conditions as they’re expensive.

Still You’re Talking About 50% Deleted Documents, That’s Still Too Much.

I’m so glad you asked (reprise).

A follow-on JIRA LUCENE-8263 provides a discussion of the approach used to control this. I’ll update this blog post when the code is committed to Solr. You’ll be able to specify that your index consists of no more than a defined percentage of deleted documents.

WARNING: TANSTAAFL (There Ain’t No Such Thing As A Free Lunch). This reduction in deleted documents will come at a cost of increased I/O as well as CPU utilization. If the percentage deleted docs matters, it’s preferred to just expungeDeletes during off hours.

Why expungeDeletes rather than forceMerge/optimize? Well, it’s a judgement call, the consideration being whether you’re willing to expend the resources to rewrite a segment that’s 4.999G in size to reclaim 1 document’s worth of resources.

What Do You Recommend?

In order of preference:

  1. Don’t worry, be happy! Unless you have good reason to require that deleted docs are purged, just don’t worry about it. Let the default settings control it all.
  2. When LUCENE-8263 is available (probably Solr 7.5), assign a new target percent deleted to TMP (solrconfig.xml for Solr users), and measure, measure, measure. This will increase your I/O and CPU utilization when doing your regular indexing. Especially if you only test in a development environment that increased load may not seem significant but may become significant in production.
  3. Periodically execute a commit with expungeDeletes. Don’t fiddle with the 10% default, it represents a reasonable compromise between wasted space and out-of-control I/O. Lucene is very good at skipping deleted docs, the main expense is disk space and memory. If those aren’t in short supply, leave it alone (or even increase it).
  4. Optimize/forceMerge periodically. This is not nearly as „fraught“ as before as the maxMergedSegmentMB is respected, so you won’t automatically create huge segments. But you will generate more I/O and CPU resources than an expungeDeletes.
  5. Optimize/forceMerge with maxSegments=1. This is OK if (and only if) you can tolerate re-running the command regularly. One typical pattern is when an index is updated only once a day during off hours and you can follow that up with an optimize/forceMerge.

Conclusion

Optimize/forceMerge are better behaved, but still expensive operations. We strongly advise that you do not do these at all without seriously considering the consequences. A horrible anti-pattern is to do these operations from a client program on each commit. In fact we discourage even issuing basic commits from a client program.

If you’ve tested rather than assumed that that optimize/forceMerge and/or expungeDeletes is beneficial, run them periodically from a cron job during off hours.


This post originally published on June 20, 2018.

About Erick Erickson

Read more from this author

LEARN MORE

Contact us today to learn how Lucidworks can help your team create powerful search and discovery applications for your customers and employees.