Solr Statistics and Fields Facets
Note: While the features and functionality discussed in this blog post are still available and supported in Solr, new users are encouraged to instead use the JSON Facet API to achieve similar results. Although it’s accuracy in distributed collections was somewhat limited when first introduced in Solr 5.0, the JSON Facet API supports a broader set of features (including the ability to sort on nested stats functions). With the additions of (two-phase) refinement support in Solr 7.0, and configurable overrefine
added in 7.5, there are virtually no reasons for users to start using facet.pivot
or stats.field
.
Solr has supported basic „Field Facets“ for a very long time. Solr has also supported „Field Stats“ over numeric fields for (almost) as long. But starting with Solr 5.0 (building off of the great work done to support Distributed Pivot Faceting in Solr) it will now be possible to compute Field Stats for each Constraint of a Pivot Facet. Today I’d like to explain what the heck that means, and how it might be useful to you.
Facets
„Field Faceting“ is hopefully a fairly straight forward concept to most Solr users. For any query, you can also ask Solr’s FacetComponent
to compute the top „terms“ from a field of your choice, and return those terms along with the cardinality of the subset of documents that match that term.
To consider a trivial little example: if you have a bunch of documents representing „Books“ and you do a query for books about „Crime“, you can then tell Solr to Facet on the author
field, and Solr might tell you that it found 1024 books matching the query q=Crime
and of those books the most commonly found author is „Kaiser Soze“ who has written „42“ of those books. If you then subsequently filter your results with fq=author:"Kaiser Soze"
you should only get 42 results.
http://localhost:8983/solr/books/select?q=Crime&facet=true&facet.field=author ... "facet_counts":{ "facet_queries":{}, "facet_fields":{ "author":[ "Kaiser Soze",42, "James Moriarty",37, "Carmine Falcone",25, ...
Stats
„Field Stats“ is a feature of Solr many users may not be very familiar with. It’s a way to instruct Solr to use the StatsComponent to compute some aggregate statistics against a numeric field for all documents matching a query. The set of statistics supported are:
- min
- mean
- max
- sum
- count (number of unique values found in the field for these docs)
- missing (number of documents in the result set that have no value in this field
- stddev (standard deviation)
- sumOfSquares (Intermediate result used to compute stddev, not useful for most users)
So to continue our previous example: When doing your search for q=Crime
you can tell Solr you want to compute stats over the price
field and look at the min
, mean
, max
, and stddev
values to get an idea of how expensive books about Crime are.
http://localhost:8983/solr/books/select?q=Crime&stats=true&stats.field=price ... "stats":{ "stats_fields":{ "price":{ "min":12.34, "max":57.65, "mean":34.56, ...
You Got Your Facets In My Stats!
From the very beginning of it’s existence, the StatsComponent
has supported some rudimentary support for generating „sub-facets“ over a field using the stats.facet
param. This generated a simplistic list of facet terms, and computed the stats over each subset. To continue our earlier example, the results might look something like this….
http://localhost:8983/solr/books/select?q=Crime&stats=true&stats.field=price&stats.facet=author ... "stats":{ "stats_fields":{ "price":{ "min":12.34, "max":57.65, "mean":34.56, ... "facets":{ "author":{ "Carmine Falcone":{ "min":22.50, "max":37.50, ... }, ... "James Moriarty":{ "min":19.95, "max":39.95, ...
But this stats.facet
approach has always been plagued with problems:
- Completely different code from
FacetComponent
that was hard to maintain, and doesn’t supported distributed search (see EDIT#1 below) - Always returns every term from the
stats.facet
field, w/o any support forfacet.limit
,facet.sort
, etc… - Lots of problems with multivalued facet fields and/or non string facet fields.
You Got Your Stats In My Facets!
One of the new features available in Solr 5.0 will be the ability to „link“ a stats.field
to a facet.pivot
param — this inverts the relationship that stats.facet
used to offer (nesting the stats under the facets so to speak, instead of putting the facets under the stats) so that the FacetComponent
does all the heavy lifting of determining the facet constraints, and delegates to the StatsComponent
only as needed to compute stats over the subset of documents for each constraint. (Having the Peanut-Butter on the inside of the Chocolate is much less messy then the alternative.)
With our previous example, this means that you could get results like so…
http://localhost:8983/solr/techproducts/select?q=crime&facet=true&stats=true&stats.field={!tag=t1}price&facet.pivot={!stats=t1}author ... "facet_pivot":{ "author":[{ "field":"author", "value":"Kaiser Soze", "count":42, "stats":{ "stats_fields":{ "price":{ "min":12.95, "max":29.95, ...}}}}, { "field":"author", "value":"James Moriarty", "count":37, "stats":{ "stats_fields":{ "price":{ "min":19.95, "max":39.95, ...
The linkage mechanism is via a tag
Local Param specified on the stats.field
. This allows multiple facet.pivot
params to refer to the same stats.field
, or a single facet.pivot
to refer to multiple different stats.field
params over different fields/functions that all use the same tag
, etc. And because this functionality is built on top of Pivot Facets, multiple levels of Pivots can be computed, and the stats will be computed at each level. See the Solr Reference Guide for more details.
Putting the Pieces Together: CitiBike
The examples I’ve mentioned so far have been fairly simple and contrived, but if you are interested in checking out some very cool applications of the new pivot+stats functionality, you should take a look at the „Solr For DataScience“ repo Grant Ingersoll put together for a recent presentation using the NYC CitiBike usage data.
With the small sample data subset (bike Usage from July-Oct 2013) indexed in the citi_py
collection (See ./index-py.sh
), you can use the following queries to find the answers to some non-trivial questions. For example…
Most Popular Trips for Subscribers, with Average Duration
For all trips made by Subscribers, find the top 5 start stations with the most common end station for each, as well as the top 5 end stations and the most common start station for each. Compute stats on the trip duration (in seconds) for each of these pairs of stations.
101202 Total Trips Taken by Subscribers | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
Top Destinations For Male & Female Subscribers departing from NYU, with Average Age of Rider
For all trips made by subscribers originating at one of the 5 stations adjacent to NYU, find the top 5 destination stations for each gender, as well as the average age of the rider to each of these stations.
2189 Total Trips Leaving NYU, Top 5 Destination Stations by Gender | ||||||
(1658) Male | (531) Female | |||||
Trips | Destination | Mean Age | Trips | Destination | Mean Age | |
54 | University Pl & E 14 St | 35 Years | 27 | University Pl & E 14 St | 34 Years | |
39 | E 12 St & 3 Ave | 29 Years | 11 | Broadway & E 14 St | 39 Years | |
31 | Lafayette St & E 8 St | 39 Years | 9 | E 10 St & Avenue A | 37 Years | |
29 | Mercer St & Bleecker St | 37 Years | 9 | E 17 St & Broadway | 35 Years | |
28 | LaGuardia Pl & W 3 St | 39 Years | 9 | Washington Square E | 41 Years |
Where We Go From Here
There are still a lot of cool improvements in the pipeline for linking Field Stats With Faceting (eg: combining Stats with Range Faceting, combining Range Faceting with Pivot Faceting, etc.) as well as plans to support more options for the statistics (eg: Limiting the stats computed, generating Percentile histograms, etc.). All of this work is being tracked in SOLR-6348 and the associated Sub-Tasks, So please watch those issues in Jira to keep track of future development — we can always use more folks testing out patches!
EDIT#1: A previous version of this post said that stats.facet
did not support distributed search — that was incorrect. The problem I was thinking of in my head is that the way stats component works, and deals with distributed requests depends on all of the data from each shard being returend in a single pass — which relates to the second bullet („Always returns every term from the stats.facet
field…“). Fixing stats.facet
to support those params, or delegate to the existing Facet code (which uses refinement requests to get accurate counts) was/is virtually impossible in a way that would still support accurate stats.facet
counts in distributed search.
LEARN MORE
Contact us today to learn how Lucidworks can help your team create powerful search and discovery applications for your customers and employees.