Exploring Query Parsers
There are a surprising number of query parser options in the Lucene/Solr world – not something I realized very quickly in my early Lucene days. I thought I might highlight a few of the options out there.
The Default Lucene QueryParser [Lucene]
This parser must be the most well known and handles a healthy syntax that spans most of Lucene’s underlying Query objects. Most of what you would expect out of the default query parser – except that it lacks support for Span queries and handles operator precedence in a very unintuitive manner. The support for field selection is also rather week. Its been described as a kitchen sink in the past, but it actually works quite well for many use cases. Syntax errors could be handled much more gracefully. On the other hand, this parser gets the most attention, and you’ve likely already found it useful.
Example Syntax:
mod_date:[20020101 TO 20030101]
title:(-return +”pink panther”)
(jakarta OR apache) AND website
“jakarta apache” -“Apache Lucene”
MultiFieldQueryParser [Lucene]
A take on the default Lucene parser that allows you to supply an array of fields to search. This makes it easier to query multiple fields using a shorter syntax.
Example Syntax:
sear?h roam~0.8 wor* In code you might pass the fields: [title, body, keywords]
PrecedenceQueryParser [Lucene]
Contrib contains another variation on the default QueryParser called PrecedenceQueryParser. This parser attempts to make the precedence support in QueryParser more intuitive and useful – otherwise its pretty much the same as the default parser. The java doc for this parser says that its still experimental and somewhat incomplete.
Example Syntax:
Generally the same syntax as the default Lucene QueryParser, but an attempt is made to handle precedence in a more sensible manner.
Surround [Lucene]
Surround is a cool parser contributed a while back by Paul Elschot. Surround provides a syntax that works with almost the full range of Lucene Query objects. Most notably , it supports the family of Span queries rather nicely. Surround recognizes either a post or infix notation.
Example Syntax:
aa NOT bb NOT cc – same effect as: (aa NOT bb) NOT cc
and(aa,bb,cc) – aa and bb and cc
99w(aa,bb,cc) – ordered span query with slop 98
99n(aa,bb,cc) – unordered span query with slop 98
20n(aa*,bb*)
3w(a?a or bb?, cc+)
title:text: aa not bb
cc 3w dd – infix: dual.
Xml-Query-Parser [Lucene]
A cool contribution from Mark Harwood, the xml-query-parser can parse xml files that specify which Query objects to construct (making it easy to support the full Lucene Query family). The xml-query-parser is easily expandable to new Query types and even handles simple Filter caching for you. Throw in a little XSLT, and the coolness multiplies.
Example syntax:
<BooleanQuery fieldName="contents"> <Clause occurs="should"> <TermQuery>merger</TermQuery> </Clause> <Clause occurs="mustnot"> <TermQuery>sumitomo</TermQuery> </Clause> <Clause occurs="must"> <TermQuery>bank</TermQuery> </Clause> </BooleanQuery>
Qsol [Lucene/Solr?]
I wrote this query parser a few years back with the goal of creating something that could handle sentence/paragraph within n proximity searches, somewhat mimic legacy query syntaxes, allow a mix of proximity and boolean clauses (eg {mark | miller} within3wordsOf toast), and properly handle precedence of operators in a configurable manner. Qsol can be pretty powerful in that regard, but the code base is rather intense for a QueryParser and I haven’t waded into it for some time now – other than for the occasional bug fix. Most of the users of Qsol have rather niche requirements when it comes to a QueryParser. Qsol is configurable up the wahzoo.
I saw that someone ported part of Qsol to Solr a while back, but I’m not sure how far along that patch is.
Example syntax:
bush & chain | cheney
basketball love ! hate
(bill | william) ~3 clinton
field1,field2(horse | dog)
*:* ! hor*
SolrQueryParser [Solr]
This is essentially the standard Lucene QueryParser, extended so that it supports a tighter integration with Solr. Certain differences exist at different times though – for example, SolrQueryParser used constant score multi-term queries before Lucene’s QueryParser.
Example syntax:
The syntax is the same as the default Lucene syntax – the behavior may be tweaked for certain cases though.
DisMax [Solr]
DisMax heavily clamps down on the allowed syntax (quotes, and +/- only) and allows you to easily specify multiple fields to search over with different boosts for each field. Effectively, this is a nice solution to the un-graceful syntax error handling of the default Lucene QueryParser. DisMax gives you something much closer to how your favorite web search engine might work.
Example Syntax:
a grand search that I might use to find
+horse “cattle and pigs” -rats
Gaz{sdf*?verhinon
LEARN MORE
Contact us today to learn how Lucidworks can help your team create powerful search and discovery applications for your customers and employees.