Getting started Spell Checking With Apache Lucene and Solr
Introduction
Recently, I did some minor work on improving the usability of the Lucene spell checker (see LUCENE-2479, LUCENE-2608 and the associated Solr work) and it got me thinking that a post on spell checking in Solr would be useful.
For those who aren’t familiar, the notion of spell checking in search (often called Did You Mean?) is slightly different from the notion of simply correcting spelling errors. It’s not that we don’t want to correct misspelled words, it’s more that we want to give suggestions for words that will lead to better results based on the way things are spelled in the index as well as other factors like past user behavior, the “correct” spelling of the word and any other apriori information, such as business goals, we might have. For instance, it may be the case that a word is so often misspelled by writers in your corpus that the best suggestion just might be an incorrectly spelled word, even if the user’s original query was properly spelled! For some background on building the foundation of a spell checker, see Peter Norvig’s excellent post.
Background
To understand spell checking in Solr, it is helpful to know a bit more about what is going on underneath the hood. There are several working parts to the spell checker, some in Solr and some in Lucene.
Starting with Solr, the primary mechanism for delivering spelling corrections is through a Search Component called the SpellCheckComponent. It is a highly configurable component that allows an application designer to plug in multiple spell checkers (more in a moment) at configuration time and then receive spelling suggestions from those dictionaries at query time as part of the Solr query response. A spell checker is an implementation of the SolrSpellChecker that, given inputs like a query and other parameters, returns suggestions along with other metadata. There are several spell checkers provided, including ones based off the Lucene spell checker and file based ones. The most commonly used one is the Lucene spell checker, but I’ve implemented others and have seen others do likewise.
Since the Lucene spell checker is the most commonly used one, it is worth digging into a little bit more. The Lucene spell checker comes with the Lucene release and is located in contrib/spellchecker. At a high level, it works by taking an existing field from the main Lucene index and builds a secondary index designed specifically for rapid look up of candidate suggestions. This index, for those who are curious, is built by creating character-based n-grams of the words from the original field. At query time, the word to be checked is appropriately analyzed and then searched against this secondary index. Assuming one or more hits are returned, the candidate word is then compared to the original word using a String distance measure (see org.apache.lucene.search.spell.StringDistance). The distance measure is completely pluggable. There are currently three different measures available: LevensteinDistance, JaroWinklerDistance and NGramDistance. Each one has it’s own merits, so I recommend you try them out to see which gives the best results. Generally speaking, JaroWinkler does a good job in accommodating for the fact that most people get the first few characters of a word right. (Coincidentally, my Taming Text< co-author Tom Morton has written a full chapter on fuzzy string matching, including spell checking, for our book. It is chapter 4 and is currently available in MEAP.) Once the candidates are scored, they are added to a Priority Queue and then the top X results are returned, where X is an input parameter to the method call.
With the background out of the way, let’s take a look at it in action.
Setup
Setup of the SpellCheckComponent is pretty easy. In the solrconfig.xml, we need to declare a <searchComponent> and then configure it. The Solr tutorial, for instance, has:
<searchComponent name="spellcheck" class="solr.SpellCheckComponent"> <str name="queryAnalyzerFieldType">textSpell</str> <lst name="spellchecker"> <str name="name">default</str> <str name="field">name</str> <str name="spellcheckIndexDir">./spellchecker</str> </lst> <!-- a spellchecker that uses a different distance measure <lst name="spellchecker"> <str name="name">jarowinkler</str> <str name="field">spell</str> <str name="distanceMeasure">org.apache.lucene.search.spell.JaroWinklerDistance</str> <str name="spellcheckIndexDir">./spellchecker2</str> </lst> --> <!-- Use an alternate comparator --> <!--<lst name="spellchecker"> <str name="name">freq</str> <str name="field">lowerfilt</str> <str name="spellcheckIndexDir">spellcheckerFreq</str> <!– comparatorClass be one of: 1. score (default) 2. freq (Frequency first, then score) 3. A fully qualified class name –> <str name="comparatorClass">freq</str> <str name="buildOnCommit">true</str> --> <!-- a file based spell checker <lst name="spellchecker"> <str name="classname">solr.FileBasedSpellChecker</str> <str name="name">file</str> <str name="sourceLocation">spellings.txt</str> <str name="characterEncoding">UTF-8</str> <str name="spellcheckIndexDir">./spellcheckerFile</str> </lst> --> </searchComponent>
While this setup shows a number of different ways to set up the spell checker, I’m going to focus on the key moving parts. The first thing to notice is the queryAnalyzerFieldType. This tells the spell checker how to tokenize and otherwise analyze the incoming query to prep it for spell checking. Generally speaking, it should be a FieldType that can produce tokens that match the analysis used to create tokens in your spelling index/dictionary. If you are using the Lucene spell checker, it should match the analysis of the source Field used to generate the spelling index (in this case the “name” field). The other thing to notice is the declarations of the spellCheckers (the <lst> elements). In this case, we have one declared spell checker. It is a Lucene based one (which is the default) and it is being built from the “name” field in the schema. The other spell checkers are all commented out, but showcase the various different configuration options available.
The second piece of configuration, and the one that most commonly trips people up, is the addition of the Search Component to a Request Handler. The reason why it commonly trips people up is that they add the SpellCheckComponent to a different Request Handler than their primary search request handler, thus requiring them to make two separate requests to Solr, one for the search results and one for the spelling suggestions. Instead, the SpellCheckComponent should be hooked directly into the main Request Handler, thus saving one round trip to Solr. The configuration should look something like:
<requestHandler name="/myMainRequestHandler" class="solr.SearchHandler" lazy="true"> <lst name="defaults"> <str name="spellcheck.onlyMorePopular">false</str> <str name="spellcheck.extendedResults">false</str> <str name="spellcheck.count">1</str> </lst> <arr name="last-components"> <str>spellcheck</str> </arr> </requestHandler>
Again, I can’t stress it enough, the SpellCheckComponent should not be placed in a separate Request Handler that thus requires two calls to Solr, despite the fact that the Solr tutorial does this for demonstration purposes (see the very large comment right above it).
Once the spell checkers are setup and Solr is up in running, you can issue queries to it. If you are using the Lucene spell checker or others, you may first need to build the underlying index. See http://wiki.apache.org
Best Practices
Once built, usage of the spell checker is pretty straightforward. In your Solr request or as part of your Request Handler configuration, you need to turn on the component (&spellcheck=true) and specify various other parameters to tell it how you want your results.
Based on my experience, the spell checker does a decent job out of the box, but not great, so you should be prepared to spend some time tuning it. First off, make sure you are doing effective analysis of the source content. See http://wiki.apache.org
Also note, the current collate functionality in the SpellCheckComponent has some warts that may prevent it’s effective use. However, the community is working through a fix to this right now, so keep an eye on SOLR-2010.
Finally, I still have a lot to learn about spell checking in search, so I’d appreciate your feedback on what worked and didn’t work for you in your applications. Please provide your tips below so we can all learn.
LEARN MORE
Contact us today to learn how Lucidworks can help your team create powerful search and discovery applications for your customers and employees.