[Update] Accessing Words Around a Positional Match in Lucene
Way back when, I posted a blurb on how to access words around a positional match in Lucene and a friend of mine asked me how to do similar things in Lucene 4. Since Lucene 4 has a lot of API changes, I thought it would be worthwhile to update the example for the new APIs. While I am keeping with the spirit of my original post, do note that there may be alternative ways to do this in Lucene 4 (a post for another day), since it is a lot easier to store things like offsets, payloads and other things directly into the postings list in Lucene 4, thereby avoiding the more expensive term vectors used here. For instance, see the new PostingsHighlighter
for an example of how to do this. (I have not benchmarked the difference in performance, if any.)
As a rehash, the code I’m about to show has the goal of creating a window of terms around a specific query term (in this case, „fleece“) in order to do some downstream analysis of that window. Here’s the basic code for creating the docs I’m using:
public static String[] DOCS = { "The quick red fox jumped over the lazy brown dogs.", "Mary had a little lamb whose fleece was white as snow.", "Moby Dick is a story of a whale and a man obsessed.", "The robber wore a black fleece jacket and a baseball cap.", "The English Springer Spaniel is the best of all dogs.", "The fleece was green and red", "History looks fondly upon the story of the golden fleece, but most people don't agree" };
In updating my code for Lucene 4 (4.3 specifically), the key differences from the previous examples have to do with using an AtomicReader (see Uwe Schindler’s excellent talk on the subject for more details) instance, as well as some new parameters to pass in to the SpanTermQuery.getSpans() method I used and finally, how to access the term vectors. Here’s the relevant bits for creating the Spans object (the full code is at the bottom) and I’ll cover the term vector parts later:
SpanTermQuery fleeceQ = new SpanTermQuery(new Term("content", "fleece")); ... IndexReader reader = searcher.getIndexReader(); //this is not the best way of doing this, but it works for the example. See http://www.slideshare.net/lucenerevolution/is-your-index-reader-really-atomic-or-maybe-slow for higher performance approaches AtomicReader wrapper = SlowCompositeReaderWrapper.wrap(reader); Map<Term, TermContext> termContexts = new HashMap<Term, TermContext>(); Spans spans = fleeceQ.getSpans(wrapper.getContext(), new Bits.MatchAllBits(reader.numDocs()), termContexts);
In the getSpans() method, the first parameter is essentially providing access to the Reader, the second parameter can be used to filter out documents and the third, the termContexts, can be used to enable better performance when looking up terms.
The other big change is accessing Term Vectors in the span lookup loop. You no longer need to use TermVectorMapper instances, but instead simply use instances of Terms, TermsEnum and DocsAndPositionsEnum, as in:
Terms content = reader.getTermVector(spans.doc(), "content"); TermsEnum termsEnum = content.iterator(null); BytesRef term; while ((term = termsEnum.next()) != null) { //could store the BytesRef here, but String is easier for this example String s = new String(term.bytes, term.offset, term.length); DocsAndPositionsEnum positionsEnum = termsEnum.docsAndPositions(null, null); if (positionsEnum.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) { int i = 0; int position = -1; while (i < positionsEnum.freq() && (position = positionsEnum.nextPosition()) != -1) { if (position >= start && position <= end) { entries.put(position, s); } i++; } } }
Running this code, should yield:
Score Doc: doc=5 score=0.4725143 shardIndex=-1 Score Doc: doc=3 score=0.35438573 shardIndex=-1 Score Doc: doc=1 score=0.29532143 shardIndex=-1 Score Doc: doc=6 score=0.23625715 shardIndex=-1 Doc: 1 Start: 6 End: 7 Entries:{4=lamb, 5=whose, 6=fleece, 8=white} Doc: 3 Start: 5 End: 6 Entries:{4=black, 5=fleece, 6=jacket} Doc: 5 Start: 1 End: 2 Entries:{1=fleece, 3=green} Doc: 6 Start: 9 End: 10 Entries:{8=golden, 9=fleece, 11=most, 12=people}
The complete code is listed below. As you can see, it spits out some info about the actual query, and then shows some context about the matches. Note, of course, you could also extend this to access things like payloads and more.
package com.lucidworks.noodles; /** * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.document.FieldType; import org.apache.lucene.document.StringField; import org.apache.lucene.index.AtomicReader; import org.apache.lucene.index.DirectoryReader; import org.apache.lucene.index.DocsAndPositionsEnum; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexWriterConfig; import org.apache.lucene.index.SlowCompositeReaderWrapper; import org.apache.lucene.index.Term; import org.apache.lucene.index.TermContext; import org.apache.lucene.index.Terms; import org.apache.lucene.index.TermsEnum; import org.apache.lucene.search.DocIdSetIterator; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.ScoreDoc; import org.apache.lucene.search.TopDocs; import org.apache.lucene.search.spans.SpanTermQuery; import org.apache.lucene.search.spans.Spans; import org.apache.lucene.store.RAMDirectory; import org.apache.lucene.util.Bits; import org.apache.lucene.util.BytesRef; import org.apache.lucene.util.Version; import java.io.IOException; import java.util.HashMap; import java.util.Map; import java.util.TreeMap; /** * This class is for demonstration purposes only. No warranty, guarantee, etc. is implied. * <p/> * This is not production quality code! */ public class TermVectorFun { public static String[] DOCS = { "The quick red fox jumped over the lazy brown dogs.", "Mary had a little lamb whose fleece was white as snow.", "Moby Dick is a story of a whale and a man obsessed.", "The robber wore a black fleece jacket and a baseball cap.", "The English Springer Spaniel is the best of all dogs.", "The fleece was green and red", "History looks fondly upon the story of the golden fleece, but most people don't agree" }; public static void main(String[] args) throws IOException { RAMDirectory ramDir = new RAMDirectory(); //Index some made up content IndexWriter writer = new IndexWriter(ramDir, new IndexWriterConfig(Version.LUCENE_43, new StandardAnalyzer(Version.LUCENE_43))); //Store both position and offset information FieldType type = new FieldType(); type.setStoreTermVectors(true); type.setStoreTermVectorOffsets(true); type.setStoreTermVectorPositions(true); type.setIndexed(true); type.setTokenized(true); for (int i = 0; i < DOCS.length; i++) { Document doc = new Document(); Field id = new StringField("id", "doc_" + i, Field.Store.YES); doc.add(id); Field text = new Field("content", DOCS[i], type); doc.add(text); writer.addDocument(doc); } writer.close(); //Get a searcher IndexSearcher searcher = new IndexSearcher(DirectoryReader.open(ramDir)); // Do a search using SpanQuery SpanTermQuery fleeceQ = new SpanTermQuery(new Term("content", "fleece")); TopDocs results = searcher.search(fleeceQ, 10); for (int i = 0; i < results.scoreDocs.length; i++) { ScoreDoc scoreDoc = results.scoreDocs[i]; System.out.println("Score Doc: " + scoreDoc); } IndexReader reader = searcher.getIndexReader(); //this is not the best way of doing this, but it works for the example. See http://www.slideshare.net/lucenerevolution/is-your-index-reader-really-atomic-or-maybe-slow for higher performance approaches AtomicReader wrapper = SlowCompositeReaderWrapper.wrap(reader); Map<Term, TermContext> termContexts = new HashMap<Term, TermContext>(); Spans spans = fleeceQ.getSpans(wrapper.getContext(), new Bits.MatchAllBits(reader.numDocs()), termContexts); int window = 2;//get the words within two of the match while (spans.next() == true) { Map<Integer, String> entries = new TreeMap<Integer, String>(); System.out.println("Doc: " + spans.doc() + " Start: " + spans.start() + " End: " + spans.end()); int start = spans.start() - window; int end = spans.end() + window; Terms content = reader.getTermVector(spans.doc(), "content"); TermsEnum termsEnum = content.iterator(null); BytesRef term; while ((term = termsEnum.next()) != null) { //could store the BytesRef here, but String is easier for this example String s = new String(term.bytes, term.offset, term.length); DocsAndPositionsEnum positionsEnum = termsEnum.docsAndPositions(null, null); if (positionsEnum.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) { int i = 0; int position = -1; while (i < positionsEnum.freq() && (position = positionsEnum.nextPosition()) != -1) { if (position >= start && position <= end) { entries.put(position, s); } i++; } } } System.out.println("Entries:" + entries); } } }
LEARN MORE
Contact us today to learn how Lucidworks can help your team create powerful search and discovery applications for your customers and employees.