Apache Solr, Open Source, SearchHub, Tutorials und Dokumentation

Solr Statistics and Fields Facets

by Hoss
January 29, 2015

Note: While the features and functionality discussed in this blog post are still available and supported in Solr, new users are encouraged to instead use the JSON Facet API to achieve similar results. Although it’s accuracy in distributed collections was somewhat limited when first introduced in Solr 5.0, the JSON Facet API supports a broader set of features (including the ability to sort on nested stats functions). With the additions of (two-phase) refinement support in Solr 7.0, and configurable overrefine added in 7.5, there are virtually no reasons for users to start using facet.pivot or stats.field.

Solr has supported basic “Field Facets” for a very long time. Solr has also supported “Field Stats” over numeric fields for (almost) as long. But starting with Solr 5.0 (building off of the great work done to support Distributed Pivot Faceting in Solr) it will now be possible to compute Field Stats for each Constraint of a Pivot Facet. Today I’d like to explain what the heck that means, and how it might be useful to you.

Facets

“Field Faceting” is hopefully a fairly straight forward concept to most Solr users. For any query, you can also ask Solr’s FacetComponent to compute the top “terms” from a field of your choice, and return those terms along with the cardinality of the subset of documents that match that term.

To consider a trivial little example: if you have a bunch of documents representing “Books” and you do a query for books about “Crime”, you can then tell Solr to Facet on the author field, and Solr might tell you that it found 1024 books matching the query q=Crime and of those books the most commonly found author is “Kaiser Soze” who has written “42” of those books. If you then subsequently filter your results with fq=author:"Kaiser Soze" you should only get 42 results.

http://localhost:8983/solr/books/select?q=Crime&facet=true&facet.field=author
  ...
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "author":[
        "Kaiser Soze",42,
        "James Moriarty",37,
        "Carmine Falcone",25,
        ...

Stats

“Field Stats” is a feature of Solr many users may not be very familiar with. It’s a way to instruct Solr to use the StatsComponent to compute some aggregate statistics against a numeric field for all documents matching a query. The set of statistics supported are:

min
mean
max
sum
count (number of unique values found in the field for these docs)
missing (number of documents in the result set that have no value in this field
stddev (standard deviation)
sumOfSquares (Intermediate result used to compute stddev, not useful for most users)

So to continue our previous example: When doing your search for q=Crime you can tell Solr you want to compute stats over the price field and look at the min, mean, max, and stddev values to get an idea of how expensive books about Crime are.

http://localhost:8983/solr/books/select?q=Crime&stats=true&stats.field=price
  ...
  "stats":{
    "stats_fields":{
      "price":{
        "min":12.34,
        "max":57.65,
        "mean":34.56,
        ...

You Got Your Facets In My Stats!

From the very beginning of it’s existence, the StatsComponent has supported some rudimentary support for generating “sub-facets” over a field using the stats.facet param. This generated a simplistic list of facet terms, and computed the stats over each subset. To continue our earlier example, the results might look something like this….

http://localhost:8983/solr/books/select?q=Crime&stats=true&stats.field=price&stats.facet=author
  ...
  "stats":{
    "stats_fields":{
      "price":{
        "min":12.34,
        "max":57.65,
        "mean":34.56,
        ...
        "facets":{
          "author":{
            "Carmine Falcone":{
              "min":22.50,
              "max":37.50,
              ...
            },
            ...
            "James Moriarty":{
              "min":19.95,
              "max":39.95,
              ...

But this stats.facet approach has always been plagued with problems:

Completely different code from FacetComponent that was hard to maintain, and doesn’t supported distributed search (see EDIT#1 below)
Always returns every term from the stats.facet field, w/o any support for facet.limit, facet.sort, etc…
Lots of problems with multivalued facet fields and/or non string facet fields.

You Got Your Stats In My Facets!

One of the new features available in Solr 5.0 will be the ability to “link” a stats.field to a facet.pivot param — this inverts the relationship that stats.facet used to offer (nesting the stats under the facets so to speak, instead of putting the facets under the stats) so that the FacetComponent does all the heavy lifting of determining the facet constraints, and delegates to the StatsComponent only as needed to compute stats over the subset of documents for each constraint. (Having the Peanut-Butter on the inside of the Chocolate is much less messy then the alternative.)

With our previous example, this means that you could get results like so…

http://localhost:8983/solr/techproducts/select?q=crime&facet=true&stats=true&stats.field={!tag=t1}price&facet.pivot={!stats=t1}author
    ...
    "facet_pivot":{
      "author":[{
          "field":"author",
          "value":"Kaiser Soze",
          "count":42,
          "stats":{
            "stats_fields":{
              "price":{
                "min":12.95,
                "max":29.95,
                ...}}}},
        {
          "field":"author",
          "value":"James Moriarty",
          "count":37,
          "stats":{
            "stats_fields":{
              "price":{
                "min":19.95,
                "max":39.95,
        ...

The linkage mechanism is via a tag Local Param specified on the stats.field. This allows multiple facet.pivot params to refer to the same stats.field, or a single facet.pivot to refer to multiple different stats.field params over different fields/functions that all use the same tag, etc. And because this functionality is built on top of Pivot Facets, multiple levels of Pivots can be computed, and the stats will be computed at each level. See the Solr Reference Guide for more details.

Putting the Pieces Together: CitiBike

The examples I’ve mentioned so far have been fairly simple and contrived, but if you are interested in checking out some very cool applications of the new pivot+stats functionality, you should take a look at the “Solr For DataScience” repo Grant Ingersoll put together for a recent presentation using the NYC CitiBike usage data.

With the small sample data subset (bike Usage from July-Oct 2013) indexed in the citi_py collection (See ./index-py.sh), you can use the following queries to find the answers to some non-trivial questions. For example…

Most Popular Trips for Subscribers, with Average Duration

For all trips made by Subscribers, find the top 5 start stations with the most common end station for each, as well as the top 5 end stations and the most common start station for each. Compute stats on the trip duration (in seconds) for each of these pairs of stations.

101202 Total Trips Taken by Subscribers

Top 5 Starting Stations With Most Popular Destination For Each
Station	Trips	Mean Duration
Pershing Square N	1075	13.5 Minutes
⇒ Broadway & W 32 St	32	8.4 Minutes
Lafayette St & E 8 St	984	13.2 Minutes
⇒ E 17 St & Broadway	35	5.5 Minutes
E 17 St & Broadway	971	12.2 Minutes
⇒ W 21 St & 6 Ave	16	6.1 Minutes
W 20 St & 11 Ave	957	13.3 Minutes
⇒ W 17 St & 8 Ave	25	5.3 Minutes
8 Ave & W 31 St	930	12.8 Minutes
⇒ 8 Ave & W 52 St	24	10.0 Minutes

Top 5 Destination Stations With Most Popular Start For Each
Station	Trips	Mean Duration
Lafayette St & E 8 St ⇒	35	5.5 Minutes
E 17 St & Broadway	1103	11.7 Minutes
W 17 St & 8 Ave ⇒	30	6.0 Minutes
8 Ave & W 31 S	973	13.2 Minutes
W 17 St & 8 Ave ⇒	24	7.3 Minutes
W 20 St & 11 Ave	960	12.2 Minutes
E 10 St & Avenue A ⇒	23	6.5 Minutes
Lafayette St & E 8 St	930	10.8 Minutes
E 30 St & Park Ave S ⇒	21	6.0 Minutes
Pershing Square N	840	14.3 Minutes

Top Destinations For Male & Female Subscribers departing from NYU, with Average Age of Rider

For all trips made by subscribers originating at one of the 5 stations adjacent to NYU, find the top 5 destination stations for each gender, as well as the average age of the rider to each of these stations.

2189 Total Trips Leaving NYU, Top 5 Destination Stations by Gender
(1658) Male		(531) Female
Trips	Destination	Mean Age	Trips	Destination	Mean Age
54	University Pl & E 14 St	35 Years	27	University Pl & E 14 St	34 Years
39	E 12 St & 3 Ave	29 Years	11	Broadway & E 14 St	39 Years
31	Lafayette St & E 8 St	39 Years	9	E 10 St & Avenue A	37 Years
29	Mercer St & Bleecker St	37 Years	9	E 17 St & Broadway	35 Years
28	LaGuardia Pl & W 3 St	39 Years	9	Washington Square E	41 Years

Where We Go From Here

There are still a lot of cool improvements in the pipeline for linking Field Stats With Faceting (eg: combining Stats with Range Faceting, combining Range Faceting with Pivot Faceting, etc.) as well as plans to support more options for the statistics (eg: Limiting the stats computed, generating Percentile histograms, etc.). All of this work is being tracked in SOLR-6348 and the associated Sub-Tasks, So please watch those issues in Jira to keep track of future development — we can always use more folks testing out patches!

EDIT#1: A previous version of this post said that stats.facet did not support distributed search — that was incorrect. The problem I was thinking of in my head is that the way stats component works, and deals with distributed requests depends on all of the data from each shard being returend in a single pass — which relates to the second bullet (“Always returns every term from the stats.facet field…”). Fixing stats.facet to support those params, or delegate to the existing Facet code (which uses refinement requests to get accurate counts) was/is virtually impossible in a way that would still support accurate stats.facet counts in distributed search.

About Hoss

LEARN MORE

Contact us today to learn how Lucidworks can help your team create powerful search and discovery applications for your customers and employees.

Lucidworks-Plattform – Übersicht

Lucidworks-Plattform – Preisgestaltung

KI-Zentrum

FUNKTIONEN VON LUCIDWORKS (ALLES INKLUSIVE)

Produktentdeckung

Searchandising

Websitesuche

Suche am Arbeitsplatz

Daten aufnehmen und Signale erfassen

Sucherlebnis der Mitarbeitenden

Kundenservice und Lösung von Fällen

KI und Large Language Models

LÖSUNGEN

Commerce

Kundenservice

Wissensmanagement

BRANCHEN

B2B-Commerce und -Vertrieb

B2B-Fertigung

Einzelhandel

Regierungsbehörden und öffentlicher Sektor

Gesundheitswesen

Finanzdienstleistungen

B2B Core Package

ENTDECKEN SIE UNSERE INHALTE

E-Books und Berichte

Blog

Videos

Presse

RESSOURCEN

Über Lucidworks

Dokumentation

Karriere

LucidAcademy

Kontakt

Technischer Support