Note: While the features and functionality discussed in this blog post are still available and supported in Solr, new users are encouraged to instead use the JSON Facet API to achieve similar results. Although it’s accuracy in distributed collections was somewhat limited when first introduced in Solr 5.0, the JSON Facet API supports a broader set of features (including the ability to sort on nested stats functions). With the additions of (two-phase) refinement support in Solr 7.0, and configurable overrefine added in 7.5, there are virtually no reasons for users to start using facet.pivot or stats.field.

Solr has supported basic „Field Facets“ for a very long time. Solr has also supported „Field Stats“ over numeric fields for (almost) as long. But starting with Solr 5.0 (building off of the great work done to support Distributed Pivot Faceting in Solr) it will now be possible to compute Field Stats for each Constraint of a Pivot Facet. Today I’d like to explain what the heck that means, and how it might be useful to you.

Facets

Field Faceting“ is hopefully a fairly straight forward concept to most Solr users. For any query, you can also ask Solr’s FacetComponent to compute the top „terms“ from a field of your choice, and return those terms along with the cardinality of the subset of documents that match that term.

To consider a trivial little example: if you have a bunch of documents representing „Books“ and you do a query for books about „Crime“, you can then tell Solr to Facet on the author field, and Solr might tell you that it found 1024 books matching the query q=Crime and of those books the most commonly found author is „Kaiser Soze“ who has written „42“ of those books. If you then subsequently filter your results with fq=author:"Kaiser Soze" you should only get 42 results.

http://localhost:8983/solr/books/select?q=Crime&facet=true&facet.field=author
  ...
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "author":[
        "Kaiser Soze",42,
        "James Moriarty",37,
        "Carmine Falcone",25,
        ...

Stats

Field Stats“ is a feature of Solr many users may not be very familiar with. It’s a way to instruct Solr to use the StatsComponent to compute some aggregate statistics against a numeric field for all documents matching a query. The set of statistics supported are:

  • min
  • mean
  • max
  • sum
  • count (number of unique values found in the field for these docs)
  • missing (number of documents in the result set that have no value in this field
  • stddev (standard deviation)
  • sumOfSquares (Intermediate result used to compute stddev, not useful for most users)

So to continue our previous example: When doing your search for q=Crime you can tell Solr you want to compute stats over the price field and look at the min, mean, max, and stddev values to get an idea of how expensive books about Crime are.

http://localhost:8983/solr/books/select?q=Crime&stats=true&stats.field=price
  ...
  "stats":{
    "stats_fields":{
      "price":{
        "min":12.34,
        "max":57.65,
        "mean":34.56,
        ...

You Got Your Facets In My Stats!

From the very beginning of it’s existence, the StatsComponent has supported some rudimentary support for generating „sub-facets“ over a field using the stats.facet param. This generated a simplistic list of facet terms, and computed the stats over each subset. To continue our earlier example, the results might look something like this….

http://localhost:8983/solr/books/select?q=Crime&stats=true&stats.field=price&stats.facet=author
  ...
  "stats":{
    "stats_fields":{
      "price":{
        "min":12.34,
        "max":57.65,
        "mean":34.56,
        ...
        "facets":{
          "author":{
            "Carmine Falcone":{
              "min":22.50,
              "max":37.50,
              ...
            },
            ...
            "James Moriarty":{
              "min":19.95,
              "max":39.95,
              ...

But this stats.facet approach has always been plagued with problems:

  • Completely different code from FacetComponent that was hard to maintain, and doesn’t supported distributed search (see EDIT#1 below)
  • Always returns every term from the stats.facet field, w/o any support for facet.limit, facet.sort, etc…
  • Lots of problems with multivalued facet fields and/or non string facet fields.

You Got Your Stats In My Facets!

One of the new features available in Solr 5.0 will be the ability to „link“ a stats.field to a facet.pivot param — this inverts the relationship that stats.facet used to offer (nesting the stats under the facets so to speak, instead of putting the facets under the stats) so that the FacetComponent does all the heavy lifting of determining the facet constraints, and delegates to the StatsComponent only as needed to compute stats over the subset of documents for each constraint. (Having the Peanut-Butter on the inside of the Chocolate is much less messy then the alternative.)

With our previous example, this means that you could get results like so…

http://localhost:8983/solr/techproducts/select?q=crime&facet=true&stats=true&stats.field={!tag=t1}price&facet.pivot={!stats=t1}author
    ...
    "facet_pivot":{
      "author":[{
          "field":"author",
          "value":"Kaiser Soze",
          "count":42,
          "stats":{
            "stats_fields":{
              "price":{
                "min":12.95,
                "max":29.95,
                ...}}}},
        {
          "field":"author",
          "value":"James Moriarty",
          "count":37,
          "stats":{
            "stats_fields":{
              "price":{
                "min":19.95,
                "max":39.95,
        ...

The linkage mechanism is via a tag Local Param specified on the stats.field. This allows multiple facet.pivot params to refer to the same stats.field, or a single facet.pivot to refer to multiple different stats.field params over different fields/functions that all use the same tag, etc. And because this functionality is built on top of Pivot Facets, multiple levels of Pivots can be computed, and the stats will be computed at each level. See the Solr Reference Guide for more details.

Putting the Pieces Together: CitiBike

The examples I’ve mentioned so far have been fairly simple and contrived, but if you are interested in checking out some very cool applications of the new pivot+stats functionality, you should take a look at the „Solr For DataScience“ repo Grant Ingersoll put together for a recent presentation using the NYC CitiBike usage data.

With the small sample data subset (bike Usage from July-Oct 2013) indexed in the citi_py collection (See ./index-py.sh), you can use the following queries to find the answers to some non-trivial questions. For example…

Most Popular Trips for Subscribers, with Average Duration

For all trips made by Subscribers, find the top 5 start stations with the most common end station for each, as well as the top 5 end stations and the most common start station for each. Compute stats on the trip duration (in seconds) for each of these pairs of stations.

101202 Total Trips Taken by Subscribers
Top 5 Starting Stations With Most Popular Destination For Each
Station Trips Mean Duration
Pershing Square N 1075 13.5 Minutes
⇒ Broadway & W 32 St 32 8.4 Minutes
Lafayette St & E 8 St 984 13.2 Minutes
⇒ E 17 St & Broadway 35 5.5 Minutes
E 17 St & Broadway 971 12.2 Minutes
⇒ W 21 St & 6 Ave 16 6.1 Minutes
W 20 St & 11 Ave 957 13.3 Minutes
⇒ W 17 St & 8 Ave 25 5.3 Minutes
8 Ave & W 31 St 930 12.8 Minutes
⇒ 8 Ave & W 52 St 24 10.0 Minutes
Top 5 Destination Stations With Most Popular Start For Each
Station Trips Mean Duration
Lafayette St & E 8 St ⇒ 35 5.5 Minutes
E 17 St & Broadway 1103 11.7 Minutes
W 17 St & 8 Ave ⇒ 30 6.0 Minutes
8 Ave & W 31 S 973 13.2 Minutes
W 17 St & 8 Ave ⇒ 24 7.3 Minutes
W 20 St & 11 Ave 960 12.2 Minutes
E 10 St & Avenue A ⇒ 23 6.5 Minutes
Lafayette St & E 8 St 930 10.8 Minutes
E 30 St & Park Ave S ⇒ 21 6.0 Minutes
Pershing Square N 840 14.3 Minutes

 

Top Destinations For Male & Female Subscribers departing from NYU, with Average Age of Rider

For all trips made by subscribers originating at one of the 5 stations adjacent to NYU, find the top 5 destination stations for each gender, as well as the average age of the rider to each of these stations.

2189 Total Trips Leaving NYU, Top 5 Destination Stations by Gender
(1658) Male (531) Female
Trips Destination Mean Age Trips Destination Mean Age
54 University Pl & E 14 St 35 Years 27 University Pl & E 14 St 34 Years
39 E 12 St & 3 Ave 29 Years 11 Broadway & E 14 St 39 Years
31 Lafayette St & E 8 St 39 Years 9 E 10 St & Avenue A 37 Years
29 Mercer St & Bleecker St 37 Years 9 E 17 St & Broadway 35 Years
28 LaGuardia Pl & W 3 St 39 Years 9 Washington Square E 41 Years

 

Where We Go From Here

There are still a lot of cool improvements in the pipeline for linking Field Stats With Faceting (eg: combining Stats with Range Faceting, combining Range Faceting with Pivot Faceting, etc.) as well as plans to support more options for the statistics (eg: Limiting the stats computed, generating Percentile histograms, etc.). All of this work is being tracked in SOLR-6348 and the associated Sub-Tasks, So please watch those issues in Jira to keep track of future development — we can always use more folks testing out patches!


EDIT#1: A previous version of this post said that stats.facet did not support distributed search — that was incorrect. The problem I was thinking of in my head is that the way stats component works, and deals with distributed requests depends on all of the data from each shard being returend in a single pass — which relates to the second bullet („Always returns every term from the stats.facet field…“). Fixing stats.facet to support those params, or delegate to the existing Facet code (which uses refinement requests to get accurate counts) was/is virtually impossible in a way that would still support accurate stats.facet counts in distributed search.

About Hoss

Read more from this author

LEARN MORE

Contact us today to learn how Lucidworks can help your team create powerful search and discovery applications for your customers and employees.