Blog, Open Source, SearchHub, Technischer Artikel

Introduction to Apache Lucene/Solr

by Lucidworks
September 3, 2009

Lucene and Solr are state of the art search technologies available for free as open source from The Apache Software Foundation. Lucene is the underlying search library, and Solr is a platform built on top of Lucene that makes it easy to build Lucene-based applications. Both are full-featured and have excellent performance, relevancy ranking and scalability. These technologies are used today by thousands of organizations and power substantial search applications at AOL, Comcast Interactive Media, IBM, Netflix, LinkedIn and MySpace.

By Marc Krellenstein, CTO, Lucid Imagination

Choosing a Search Engine

In the last decade a single search engine technology has sometimes been the dominant choice for enterprises interested in producing their own search capability for a web site, product or internal or extranet use. No one product can meet all needs. But a single technology was recognized as the default choice, and users could most easily start their search evaluation by asking if there were reasons not to use it. Today, I believe Apache Lucene and Solr are the default full text search technology for organizations

What is Lucene?

Lucene is a Java-based search library. It was initially written over 10 years ago by Doug Cutting, who had worked on two search engines before that, including the once popular Excite Internet service. Lucene was one of the first 3rd generation search capabilities. Like Google and Microsoft’s recently acquired Fast, Lucene has an architecture that employs best practice relevancy ranking and querying, as well as state of the art text compression and a partitioned index strategy to optimize both query performance and indexing flexibility.

Unlike those other products, however, Lucene is available for free as open source under the liberal Apache Software license. This license allows users to modify or embed the technology as they see fit, and to keep proprietary, sell and/or re-distribute any resulting product. Lucene is written entirely in Java, though there are today .NET and other versions available. The source code is not merely freely available but actually practical and relatively simple to use or modify. Finally, and perhaps most importantly for an open source project, Lucene has stood the test of time. Today, Lucene has a large number of active contributors and thousands of installations, including production applications at AOL, Apple, CNET, Comcast Interactive Media, IBM, LinkedIn, Monster, MySpace, Netflix, Technorati and Wikipedia. And while there are other open source search projects, none have more than a fraction of Lucene’s installed base and contributors.

Lucene is full-featured and provides

Speed — sub-second query performance for most queries
Strong out of the box relevancy ranking — as good or better than the best commercial competitors
Complete query capabilities: keyword, Boolean and +/- queries, proximity operators, wildcards, fielded searching, term/field/document weights, find-similar, spell-checking, multi-lingual search and more
Full results processing, including sorting by relevancy, date or any field, dynamic summaries and hit highlighting
Portability: runs on any platform supporting Java, and indexes are portable across platforms – you can build an index on Linux and copy it to a Microsoft Windows machine and search it there
Scalability — there are production applications in the hundreds of millions and billions of documents/records
Low overhead indexes and rapid incremental indexing, especially with versions 2.3 and later

What is Solr?

Solr is a layer of code on top of Lucene that transforms Lucene into a search platform for building search applications. Solr was created by Yonik Seeley while at CNET and contributed to Apache by CNET. Solr provides the following capabilities:

Web service: Solr places Lucene over HTTP, allowing programs written in any language to invoke Lucene
XML-based schema for managing indexed fields and their characteristics
System administration tools for configuration, data loading, index replication, statistics, logging and cache management
Large scale distributed search
Fixed/paid result list placement
Faceting — the dynamic clustering of items or search results into categories that lets users drill into search results (or even skip searching entirely) by any value in any field, as seen on popular ecommerce sites such as Amazon

Most users building Lucene-based search applications will find they can do so more quickly if they start with Solr since it contains many of the capabilities needed to turn a core search capability into a full-fledged search application. Most of the more recent large Lucene-based installations mentioned above use Solr, including AOL, Comcast Interactive Media and Netflix, and of course CNET. However, as in any open layered environment, users can still choose to work directly with the underlying Lucene library, perhaps to manipulate or exploit lower level Lucene capabilities.

Advantages and disadvantages of Lucene/Solr being open source

The fact that Lucene/Solr are Apache open source software provides some significant advantages:

Free to use — no license fees whatsoever
Complete source code, providing the independence and control one normally gets only by writing your own software. The Lucene/Solr Apache license allows users to produce or distribute derivative or proprietary works without restrictions.
Code developed by programmers who are themselves end-users trying to solve pressing end-user needs
Community — A large, active and helpful community of developers and end-users, with forums and mailing lists for discussion and resolving problems and independent consultants offering more specialized assistance

Open source software — Apache-licensed or otherwise — also has some limitations as compared to the best commercial software:

No formal support contracts
No assured availability of training or other professional services to fulfill specific software needs or assist with building an application
No formalized release testing program, release schedule or assurance of upgrade compatability, though Lucene/Solr contributions must have unit testing before they are committed to the code, and releases receive integrated testing

Building your search application with Lucene/Solr

Building good full text search is a demanding undertaking, and having the best technology is only part of the solution. Search engines such as Lucene/Solr have good default settings and tools to help make applications not only work but to be effective. But the best search applications require understanding both the data and the users. Information must be aggregated and indexed from file systems, databases or web sites and normalized for search. For example, one set of documents may refer to a document name as title, another to it as a heading; a search for ‘fox’ should probably find items with ‘foxes’ in it as well. Potential users’ level of expertise and familiarity with the data must also be considered in the design, and the use of synonyms may be needed (e.g., heart attack = myocardial infarction). Relevancy ranking will generally require tuning based on what users are actually doing to improve an initial application’s effectiveness. More advanced features such as ‘automatic feedback’ may be useful (and, on the other hand, many oft-attempted efforts at improving search can be ignored in favor of current best practices).

A great search application such as Google is only partly a success of raw technology. It also reflects an expert appreciation of the data and users of that particular application. With more than enough good answers for a search on the Internet and even more bad answers, a popularity-weighted ranking will screen out the bad data and find more than enough good data for Google’s typical users. But any particular search application may have very different data and users. Bad data usually does not exceed good data for most search applications, and finding the best results might be more important than finding good enough results. The security and privacy requirements of a typical application may also be very different from those of a public Internet service (or those of an intelligence agency). The art of good search is to be able to transform good generic technology to a good specific applications.

The skills for building a great search application come mostly from having built other ones, but for most users, building a search application is an infrequent occurrence For that reason it can be useful to seek out expert and experienced resources to assist with application design, development and/or deployment, just as it may be valuable to secure expert support resources for ongoing maintenance. Commercial companies such as Lucid Imagination are based on open source but can provide such formal support and assistance for people using those open source tools.

Basic Lucene/Solr Resources

http://lucene.apache.org/java – The Apache Lucene website
http://lucene.apache.org/solr – The Apache Solr website

About Lucidworks

LEARN MORE

Contact us today to learn how Lucidworks can help your team create powerful search and discovery applications for your customers and employees.

Lucidworks-Plattform – Übersicht

Lucidworks-Plattform – Preisgestaltung

KI-Zentrum

FUNKTIONEN VON LUCIDWORKS (ALLES INKLUSIVE)

Produktentdeckung

Searchandising

Websitesuche

Suche am Arbeitsplatz

Daten aufnehmen und Signale erfassen

Sucherlebnis der Mitarbeitenden

Kundenservice und Lösung von Fällen

KI und Large Language Models

LÖSUNGEN

Commerce

Kundenservice

Wissensmanagement

BRANCHEN

B2B-Commerce und -Vertrieb

B2B-Fertigung

Einzelhandel

Regierungsbehörden und öffentlicher Sektor

Gesundheitswesen

Finanzdienstleistungen

B2B Core Package

ENTDECKEN SIE UNSERE INHALTE

E-Books und Berichte

Blog

Videos

Presse

RESSOURCEN

Über Lucidworks

Dokumentation

Karriere

LucidAcademy

Kontakt

Technischer Support