Schemaless Solr: Part 1
This is the first part in a series about adding schemaless features to Apache Solr.
Solr has dynamic fields, which allow for a kind of schemaless operation: you map field name prefixes or suffixes to field types, so that field names don’t have to be fully known up-front. However, it would be nice if Solr supported a schemaless mode that didn’t require any up-front schema configuration. This schemaless mode should determine field type mappings for unknown fields based on the field’s content rather than its name. Work is underway to make that happen.
What does “schemaless” mean?
Schemaless is usually understood to mean zero configuration required. A schemaless search engine should index—that is, process and store—your data without having to be told what structure to expect, or how to process the data.
You might want schemaless capabilities if you’re a new user and don’t want to have to figure out how to configure the engine before getting started, or if you have data sources whose structure you can’t predict. See below for a discussion of the appropriate scope for schemaless capabilities.
What’s a schema?
A schema is a collection of a) constraints on data record structure and b) data processing instructions associated with elements of the record structure.
Records stored in Solr are called documents. Input documents are a collection of fields, which are name/value pairs. Searching in Solr requires that when you query against indexed fields, you process each field in both the query and the input document in a compatible fashion. Solr’s schema is where this compatibility is enabled. The schema hosts user configuration of input document structure – a menu of possible field names and dynamic field name patterns (more on dynamic fields below) – as well as some aspects of index-time and query-time processing.
The Solr schema maps explicitly named fields and dynamic fields to field types, which are collections of field properties, including: whether a field will be searchable; whether it will be stored in its original form; whether it’s required to be present in an input document; its data type (e.g. integer, date, text); and for text data types, text processing pipeline specifications. See http://wiki.apache.org/solr/SchemaXml for details.
Lucene, the search engine library used by Solr under the hood, has no schema facility. As a result, all applications using Lucene, including Solr, must maintain some form of a schema in order to execute queries, to ensure that query field processing is performed consistently with indexed field processing.
Solr’s dynamic fields
Solr’s dynamic field capability reduces up-front configuration requirements for fields with predictable naming patterns. For example, the following dynamic field definition maps any field name with suffix “_i” to the “int” field type, which is declared elsewhere in the schema:
<dynamicField name="*_i" type="int" indexed="true" stored="true" />
Dynamic fields are a kind of Hungarian Notation for document fields: naming fields using configurable suffixes or prefixes triggers mappings to field types.
Dynamic fields do require that you configure them (or use a preconfigured set, e.g. from the Solr example schema), so they aren’t exactly zero-configuration, but they can go a long way toward reducing schema size, and field names don’t have to be fully spelled out up-front.
Steps toward schemaless
Several Solr features have been added in recent releases in preparation for enabling a form of schemalessness, in which Solr determines the field type for unknown fields on the fly, based on input document data type (e.g. JSON boolean and numeric types), or based on guessing the field type from the field value. Work is underway to finish the task of enabling a Schemaless Solr mode.
For example, adding a document with field name/value {"startDate":"10/10/2010",...}
, where field startDate
is not already in the schema, would cause this field first to be added to the schema, with a mapping to a date type already defined in the schema, followed by adding the document itself.
Three features supporting Schemaless Solr mode have been committed to the Solr code base:
- Read access to the live schema via a new REST API
- Managed schema: make Solr the schema owner
- Add new fields to the live schema
Schema REST API: read access
Solr 4.2 included read-only access via a REST API to most elements of the live schema, and Solr 4.3 completed this API to cover all schema elements.
For example, to get the dynamic field definition for "*_i"
, you can send a request via curl, and see the results below:
PROMPT$ curl 'http://localhost:8983/solr/schema/dynamicfields/*_i' { "responseHeader":{ "status":0, "QTime":1}, "dynamicfield":{ "name":"*_i", "type":"int", "indexed":true, "stored":true}}
By default, the Schema REST API formats its response in indented JSON, but you can specify the response format using the “wt” parameter, e.g. ?wt=xml
, etc.
The full live schema in schema.xml format is also available, using a “wt” parameter designed for this purpose:
PROMPT$ curl 'http://localhost:8983/solr/schema?wt=schema.xml' <?xml version="1.0" encoding="UTF-8"?> <schema name="example" version="1.5"> <uniquekey>id</uniquekey> <defaultsearchfield>text</defaultsearchfield> <solrqueryparser defaultoperator="OR"/> <similarity class="solr.SchemaSimilarityFactory"/> <types> <fieldtype name="alphaOnlySort" class="solr.TextField" sortmissinglast="true" omitnorms="true"> <analyzer> [...]
See the Schema REST API page on the Solr wiki for a full listing of URLs to access each schema element: http://wiki.apache.org/solr/SchemaRESTAPI.
Managed schema
As a prerequisite to dynamically adding fields to the live schema, Solr will take control of the schema. A configuration setting in solrconfig.xml
enables managed schema mode, in which a new schema file, named managed-schema
by default, will contain the persisted schema. On first startup, an existing schema.xml
file is parsed then persisted out to the managed schema file, and the original schema.xml
file is renamed to schema.xml.bak
.
It’s possible to move between managed schema and user-editable schema modes, by removing the managed schema configuration from solrconfig.xml
(user-editable schema mode is the default) and providing a schema.xml
file, which can be bootstrapped from the live schema available via the Schema REST API.
The managed schema feature is included in Solr 4.3.
Add new fields to the live schema
This feature will be part of Solr 4.4, which has not yet been released as of this writing.
Internal and REST API methods were added to allow adding one or more previously unknown fields to the live schema. The schema is persisted after adding the unknown fields. In SolrCloud mode, when managed schema is enabled, each replica keeps a watch on the schema it uses, and automatically reloads the schema when any other replica persists the schema to ZooKeeper.
For example, to add a single new field named "newfield"
via the Schema REST API:
PROMPT$ curl -X PUT -H 'Content-type:application/json' -d '{"type":"text_general","stored":"false"}' http://localhost:8983/solr/schema/fields/newfield
See the Schema REST API wiki page for the full details.
By combining the read and write methods of the Schema REST API, an application will be able to incrementally build a schema based on input document structure, by comparing the fields in each input document to the fields in the live schema, adding unknown fields, then adding the input document.
What’s next?
Work is ongoing under Solr JIRA issue SOLR-3250 to complete the Schemaless Solr mode described above.
Support for guessing the following field types based on field values is planned: date/time (including a few basic formats), double, long, boolean, and text. Types for fields with multiple values will be guessed as the lowest common denominator among the guesses for each value.
A new schemaless mode example configuration is also planned.
Schemaless: What is it good for?
It’s important to recognize that Solr’s schemaless mode, once it’s available, will be incrementally building a schema for you in the background, rather than operating without a schema. As described above, Solr can’t function properly without a schema.
Schemaless mode will help developers bootstrap their schema, but the result will not be tuned for production use. Users of schemaless mode should consider it an initial schema design tool, rather than a production-ready solution.
Some schemaless mode limitations:
- Allowing new schema fields to be created in production may not be a good idea
- Solr allows you to customize what happens when unknown fields are encountered – this can help to guard against typos and other unwanted sources of new fields. If you use schemaless mode in production, you won’t get this benefit. It’s very likely that previously unseen fields will fit a pattern matchable using dynamic fields. You should be able to look at the set of newly created fields after adding a set of documents in schemaless mode, and then craft dynamic fields that cover new fields in the future.
- Field type detection can’t know about the full range of a field’s values
- Schemaless mode will choose longs rather than ints, and doubles rather than floats, even if the initial values would fit in the smaller data types, to guard against the possibility of overflow. This could result in larger than necessary disk and memory usage.
- Limited data processing discrimination
-
- The field types to which schemaless mode will map unknown fields are limited to a fairly small subset of what Solr is capable of. For example, numeric and date field types can be tuned to perform faster range queries, at the cost of a larger index.
- For text fields, schemaless mode won’t be able to tell if you want the text broken up into individual terms, or left as a single term, e.g. SKUs or other strings you need to be able to match exactly. So schemaless mode will always break up text fields into individual terms.
- Documents with per-field language differences won’t be handled ideally using schemaless mode – all text fields will get the same treatment.
- Solr can process the same field’s data multiple ways, and aggregate fields, using the
<copyField>
schema directive. By contrast, schemaless mode will only process each field in one way, though it will include a<copyField>
directive to copy all other fields’ content into a single catch-all field, to enable single-field queries against all document content.
Don’t let the above limitations discourage you—schemaless mode will be an effective and appropriate tool for the schema exploration/definition phase of new projects.
LEARN MORE
Contact us today to learn how Lucidworks can help your team create powerful search and discovery applications for your customers and employees.