Open Source, SearchHub, Technischer Artikel

How Solr’s Post Filter Works

In this post, I’m going to provide a concrete example of custom post filtering for the case of filtering documents based on access control lists.

by Erik Hatcher
February 22, 2012

May 13, 2015: A code update was made for Solr 5.x – Full details in a new blog post.

Dec. 6, 2012: A code update was made for Solr 4.0 (see commented section in AccessControlQParserPlugin.java below)

Yonik recently wrote about “Advanced Filter Caching in Solr” where he talked about expensive and custom filters; it was left as an exercise to the reader on the implementation details. In this post, I’m going to provide a concrete example of custom post filtering for the case of filtering documents based on access control lists.

Recap of Solr’s filtering and caching

First let’s review Solr’s filtering and caching capabilities. Queries to Solr involve a full-text, relevancy scored, query (the infamous q parameter). As users navigate they will browse into facets. The search application generates filter query (fq) parameters for faceted navigation (eg. fq=color:red, as in the article referenced above). The filter queries are not involved in document scoring, serving only to reduce the search space. Solr sports a filter cache, caching the document sets of each unique filter query. These document sets are generated in advance, cached, and reduce the documents considered by the main query. Caching can be turned off on a per-filter basis; when filters are not cached, they are used in parallel to the main query to “leap frog” to documents for consideration, and a cost can be associated with each filter in order to prioritize the leap-frogging (smallest set first would minimize documents being considered for matching).

Post filtering

Even without caching, filter sets default to generate in advance. In some cases it can be extremely expensive and prohibitive to generate a filter set. One example of this is with access control filtering that needs to take the users query context into account in order to know which documents are allowed to be returned or not. Ideally only matching documents, documents that match the query and straightforward filters, should be evaluated for security access control. It’s wasteful to evaluate any other documents that wouldn’t otherwise match anyway. So let’s run through an example… a contrived example for the sake of showing how Solr’s post filtering works. Here’s the design:

Documents have an “access control list” associated with them, specifying allowed and disallowed users as well as allowed and disallowed groups.
The access control list is an ordered list of allowed/disallowed users and groups. Order matters, such that the first matching rule determines access.
If no allowing access is found, the document is not allowed.

For example, a document could have an access control string specified as “+u:user1 +g:group1 -g:group2 +u:user2 -u:user3”. Query requests to Solr will include the user name and the users group membership. Given this example access control string, here’s how this contrived design should respond:

user='user1', groups=null: allowed
user='user2', groups=null: allowed
user='user1', groups=[group1]: allowed
user='user2', groups=[group2]: NOT ALLOWED
user='user3', groups=[group1]: allowed
user='user3', groups=[group2]: NOT ALLOWED
user='user3', groups=[group1, group2]: allowed

That’s to say if user2, as a member of group2 searches, he should not be allowed to find this particular document (-g:group2 precedes +u:user2 in the rules, and order matters). I know, I know, this is pretty contrived, but not wholly unrealistic given some customer work we’ve recently done.

Because these rules are dependent on order and the query request, it’s not possible to do a straightforward Lucene query to filter allowed documents. Play along with me here on this example, I tried to make it sufficiently complicated to go along with this point. Solr has a relatively new PostFilter capability that allows this last check on filtering documents on the fly. It takes some know-how to implement a PostFilter appropriately, so the code example here will be a nice starting point for your own custom post filtering. The way a PostFilter gets leveraged is through a Solr QParserPlugin. Here’s my custom AccessControlQParserPlugin:

public class AccessControlQParserPlugin extends QParserPlugin {
  public static String NAME = "acl";

  public void init(NamedList args) {
  }

  @Override
  public QParser createParser(String qstr, SolrParams localParams,
                              SolrParams params, SolrQueryRequest req) {
    return new QParser(qstr, localParams, params, req) {

      @Override
      public Query parse() throws ParseException {
        return new AccessControlQuery(localParams.get("user"), localParams.get("groups"));
      }
    };
  }
}

And then this is wired into solrconfig.xml as follows:

<queryParser name="acl" class="AccessControlQParserPlugin"/>

All of that is just necessary glue in order to hook in a PostFilter implementation. Here’s my example implementation:

/**
 * Note that this Query implementation can _only_ be used as an fq, not as a q (it would need to implement createWeight).
 */
class AccessControlQuery extends ExtendedQueryBase implements PostFilter {

  private String user;
  private String[] groups;

  public AccessControlQuery(String user, String groups) {
    this.user = user;
    this.groups = groups.split(",");
  }

  public static boolean isAllowed(String acl, String user, String[] groups) {
    // acl is in the form of a series of whitespace separated [+|-][u|g]:name
    // allowed is determined by any explicit user or group mentions, plus or minus
    // order matters
    // if nothing matches, it is not allowed

    if (user == null && groups == null) return false;

    String[] permissions = acl.split(" ");

    for(String p : permissions) {
      boolean allowed = p.charAt(0) == '+';
      String name = p.substring(3);
      if (p.charAt(1) == 'u') { // user
        if (user != null && user.equals(name)) return allowed;
      } else { // group
        if (groups != null) {
          for (String g : groups) {
             if (g.equals(name)) return allowed;
          }
        }
      }
    }

    return false;
  }

  @Override
  public boolean getCache() {
    return false;  // never cache
  }

  @Override
  public int getCost() {
    return Math.max(super.getCost(), 100);  // never return less than 100 since we only support post filtering
  }

  public DelegatingCollector getFilterCollector(IndexSearcher searcher) {
    return new DelegatingCollector() {
      String[] acls;

      @Override
      public void collect(int doc) throws IOException {
        if (isAllowed(acls[doc], user, groups)) super.collect(doc);
      }

      @Override
      public void setNextReader(IndexReader reader, int docBase) throws IOException {
        acls = FieldCache.DEFAULT.getStrings(reader, "acl");  
        super.setNextReader(reader, docBase);
      }
    };
  }

  // For Solr 4.0, replace getFilterCollector with this one, adjusting
  //public DelegatingCollector getFilterCollector(IndexSearcher searcher) {
  //  return new DelegatingCollector() {
  //    FieldCache.DocTerms acls;

  //    @Override
  //    public void collect(int doc) throws IOException {
  //      final BytesRef br = new BytesRef();
  //      if (isAllowed(acls.getTerm(doc, br).utf8ToString(), user, groups)) super.collect(doc);
  //    }
  //
  //    @Override
  //    public void setNextReader(AtomicReaderContext context) throws IOException {
  //      acls = FieldCache.DEFAULT.getTerms(context.reader(), "acl");   // may be better to use the StringIndex version
  //      super.setNextReader(context);
  //    }
  //
  //
  //  };
  //}

  // NOTE: it is very important to implement proper equals and hashCode methods for this class, as it is used with
  // *result* caching (not filter caching, which is explicitly disabled here).

  @Override
  public String toString() {
    return "AccessControlQuery{" +
        "user='" + user + ''' +
        ", groups=" + (groups == null ? null : Arrays.asList(groups)) +
        '}';
  }

  @Override
  public boolean equals(Object o) {
    if (this == o) return true;
    if (o == null || getClass() != o.getClass()) return false;
    if (!super.equals(o)) return false;

    AccessControlQuery that = (AccessControlQuery) o;

    if (!Arrays.equals(groups, that.groups)) return false;
    if (user != null ? !user.equals(that.user) : that.user != null) return false;

    return true;
  }

  @Override
  public int hashCode() {
    int result = super.hashCode();
    result = 31 * result + (user != null ? user.hashCode() : 0);
    result = 31 * result + (groups != null ? Arrays.hashCode(groups) : 0);
    return result;
  }

  public static void main(String[] args) {
    String acl = "+u:user1 +g:group1 -g:group2 +u:user2 -u:user3";

    System.out.println("acl = " + acl);

    test(acl, "user1", null);
    test(acl, "user2", null);
    test(acl, "user1", new String[] {"group1"});
    test(acl, "user2", new String[] {"group2"});
    test(acl, "user3", new String[] {"group1"});
    test(acl, "user3", new String[] {"group2"});
    test(acl, "user3", new String[] {"group1","group2"});
  }

  private static void test(String acl, String user, String[] groups) {
    System.out.println("user='" + user + ''' +
        ", groups=" + (groups == null ? null : Arrays.asList(groups)) +
        ": " + (isAllowed(acl, user, groups) ? "allowed" : "NOT ALLOWED"));
  }
}

The main() method was used to generate the above rule processing results. A few notes to emphasize from this code:

This implementation can only be used as a filter query (fq) parameter, not a q parameter.
hashCode/equals are very important to get right, otherwise unexpected/incorrect results can occur.
Caching is explicitly disabled, so no need to set cache=false.
Solr has logic that only kicks in PostFilter’s when the cost is >= 100, that’s why the getCost method is the way it is.
The custom filtering logic is all within the single isAllowed() method.
This example was built using the Lucene/Solr 3.x codebase. Some slight adjustments are necessary to tweak this for the 4.x codebase.

In this implementation, the access control rules are entirely specified on each document, in the acl field. In order to efficiently filter by these rules at query time, Lucene’s FieldCache is used. There is upfront cost in time and RAM in building the FieldCache data structure, making this rapid to access at query time; when FieldCache is used (sorting, some faceting implementations, function queries, and this custom query parser) it is wise to put in appropriate warming queries to have the FieldCache entries built at commit-time rather than end users waiting longer at query-time.

So, with all that implementation behind us, here’s how we finally use it: index some documents, make queries that filter using the “acl” query parser. Here are the documents, in CSV format:

id,acl
1,+u:bob
2,-g:sales +g:engineering
3,+g:hr -g:engineering
4,-u:alice +g:hr
5,+g:hr -u:alice
6,+g:sales +g:engineering -u:bob
7,+g:hr -u:alice +g:sales
8,+g:sales
9,+g:engineering
10,+g:hr

This was indexed using Solr’s example post.jar:

java -Dtype=text/csv -Durl=http://localhost:8983/solr/update/csv -jar post.jar example_docs.csv

where the acl field is defined as <field name=”acl” type=”string” indexed=”true” stored=”true” multiValued=”false”/>.

To make it easy to present, a quick and dirty Velocity template, ids.vm, was added to the conf/velocity directory:

Matching ids:
#if($page.results_found > 0)
  #foreach($doc in $response.results)
    $doc.id
  #end
#else
  None
#end

And finally let’s see the results, using the base request of http://localhost:8983/solr/select?q=*:*&wt=velocity&v.template=ids, which by itself yields “Matching ids: 1 2 3 4 5 6 7 8 9 10”. Appending an fq parameter using the syntax &fq={!acl user=’username’ groups=’group1,group2′} applies the security filter. Here are several variations in user and groups and the results:

&fq={!acl user=’alice‘ groups=”}: Matching ids: None

&fq={!acl user=’bob‘ groups=”}: Matching ids: 1

&fq={!acl user=’alice‘ groups=’hr‘}: Matching ids: 3 5 7 10

&fq={!acl user=’alice‘ groups=’hr,sales‘}: Matching ids: 3 5 6 7 8 10

&fq={!acl user=’alice‘ groups=’hr,sales,engineering‘}: Matching ids: 3 5 6 7 8 9 10

&fq={!acl user=’bob‘ groups=’hr‘}: Matching ids: 1 3 4 5 7 10

Fine print

It’s important to note that PostFilter is a last resort for implementing document filtering. Don’t make the solution more complicated than it needs to be. More often than not, even access control filtering can be implemented using plain ol’ search techniques, by indexing allowed users and groups onto documents and using the lucene (or another) query parser to do the trick. Only when the rules are too complicated, or external information is needed, does a custom PostFilter make sense. Performance is key here, and the internal #collect() method will be called for every matching document; a *:* query was used in this example causing every document in the index to be post-filter evaluated and this may be prohibitive on a large index, and as such your application may need to require a narrowing query or another filter constraint involved before kicking in a PostFilter. What happens in #collect needs to be highly optimized. DO NOT, I repeat, DO NOT fetch the document from the Lucene index in that method (I won’t even mention what methods to steer clear of)! If you need to get at field data, have it on single-valued indexed fields and use the FieldCache (and eventually the 4.x doc-values feature will be handy here as well). And if you use FieldCache, add a representative query using your PostFilter to your warming queries in solrconfig.xml.

About Erik Hatcher

LEARN MORE

Contact us today to learn how Lucidworks can help your team create powerful search and discovery applications for your customers and employees.

Lucidworks-Plattform – Übersicht

Lucidworks-Plattform – Preisgestaltung

KI-Zentrum

FUNKTIONEN VON LUCIDWORKS (ALLES INKLUSIVE)

Produktentdeckung

Searchandising

Websitesuche

Suche am Arbeitsplatz

Daten aufnehmen und Signale erfassen

Sucherlebnis der Mitarbeitenden

Kundenservice und Lösung von Fällen

KI und Large Language Models

LÖSUNGEN

Commerce

Kundenservice

Wissensmanagement

BRANCHEN

B2B-Commerce und -Vertrieb

B2B-Fertigung

Einzelhandel

Regierungsbehörden und öffentlicher Sektor

Gesundheitswesen

Finanzdienstleistungen

B2B Core Package

ENTDECKEN SIE UNSERE INHALTE

E-Books und Berichte

Blog

Videos

Presse

RESSOURCEN

Über Lucidworks

Dokumentation

Karriere

LucidAcademy

Kontakt

Technischer Support

Recap of Solr’s filtering and caching

Post filtering

Fine print

About Erik Hatcher

Related Articles

LEARN MORE