A Short Introduction to Indexing / Search using Lucene
At times I find I need an indexing tool to do something akin to an embedded database. This is an embedded index. This comes up when trying to run filters over data in a large visual table, or over some other visualization.
From a coding point of view, the initial attempt at filtering might look something like this:
public List<String> filter(String userFilterText) {
List<String> ret = new LinkedList<String>();
for( Entity e : entities ) {
if( e.containsText(userFilterString) {
ret.add(e.getEntityId());
}
}
return ret;
}
At some point the number of rows, or data elements, exceeds the ability to respond to the user request in a timely manner. Even with trying to collect the data and put it into some memory structure, eventually this will break down in some manner.
The solution is to build an index, embedded into the application, which manages the filtering. This means filtering becomes:
public List<String> filter(String userFilterText) {
List<String> ret = index.query(userFilterText);
return ret;
}
Although for a small number of items this is a bit slower, it is never really slow enough to impede user perspective. That is, if there is a lot of stuff, there will be an expectation of something being slightly slower and this is acceptable. In addition, the filter now has a way to filter by field instead of just using something like String.contains() or even regular expressions.
Building one of these indexes is quite simple. You add data with Document.add(Field). You query with searcher.search(Query, Collector). It is really just that simple. A fairly useful module can be had for less than 1000 lines of code.
The class IndexProvider.java is at the heart of the example. You call IndexProvider.index(data) for every object you have to index. And then you can call IndexProvider.search(String) to query over the built up index.
The entry point is Example.main() and has one artificial requirement. The first time the example is run, it will create a directory named index and index example.csv. The second time it is run, it will run a query for ‚the‘ over the content.
Other, more complicated, queries are possible. To get all of the Lorem text
ut eu
To get a specific field,
+ut +f1:two
This allows the visualization filtering to be as rich as any query. And, more importantly, the filtering can be tied to what ever the data happens to be without any code changes involved.
Click here ->lucene-starter to download a .tgz file with a pom and sources.
LEARN MORE
Contact us today to learn how Lucidworks can help your team create powerful search and discovery applications for your customers and employees.