Sorting, Faceting and Schema Design in Solr
I was recently with a client doing a Best Practices assesment when I came across a common source of confusion related to sorting, faceting and schema design.
As background, Solr provides a schema that describes the Fields and Field Types (FT) that are used by an application. Field Types describe how Solr should handle the information contained in a Field. For instance, the integer FT tells Solr to treat the contents of any Field of type integer as, you guessed it, an integer. By integer here, I mean, good old fashioned Java ints. Solr provides other FTs like long, double, float, string, date, as well as Text (which can be associated with Lucene’s analysis process). Additionally, Solr provides several „sortable“ FTs such as sint, slong, sdouble and sfloat. Therein lies the confusion. I think what happens is developers hear the word „sortable“ and think they should use the sortable FT for any field they want to sort results by. However, there is some subtlety here. Namely, „sortable“ FTs manipulate the content so that the lexicographic order is the same as the numeric order for use during search. Sortables are thus really meant to be used when doing things like range queries (i.e. [price:2 TO 100]) and not for sorting as it relates to returning results. Due to these required changes, sortables take up more space in the index (and in memory) then their non-sortable compadres.
What’s this got to do with schema design? Well, this client had three fields, all defined as sortable integer FTs, as in:
- fieldOriginal – The source of the content. This was the main field used for sorting
- fieldSearch – Copy field of Original, but rounded to the nearest 100. This was the main field for searching.
- fieldFacet – Copy field of Original, but rounded based on a percentage of the original value so as to provide a sliding scale for faceting. This was the main field used for faceting.
In this case, the client was using the Original for sorting, Search for searching, and Facet for faceting. They were not doing any range queries, so they did not need fieldSearch to be „sortable“. Furthermore, the Original field had over 1 million unique terms, so sorting on it was taking up a good chunk of memory and disk space. The other two fields were smaller, so the cost of sortables was not that big of a deal. Finally, this field „pattern“ was replicated for several other fields as well, some of which also had a significant number of unique terms.
Thus, simply by changing the Fields to use integers where appropriate, we significantly reduced the memory footprint and the disk space required in this client application.
So, as is always the case, play close attention to your schema design. While the Solr example schema is pretty good out of the box, you shouldn’t just take it as gospel, either. Spend some time thinking about your needs during design and it will likely save you much time later when debugging and testing your application.
**UPDATE**: Note, making these changes will require you to re-index.
LEARN MORE
Contact us today to learn how Lucidworks can help your team create powerful search and discovery applications for your customers and employees.