How Solr’s Post Filter Works
In this post, I’m going to provide a concrete example of custom post filtering for the case of filtering documents based on access control lists.
May 13, 2015: A code update was made for Solr 5.x – Full details in a new blog post.
Dec. 6, 2012: A code update was made for Solr 4.0 (see commented section in AccessControlQParserPlugin.java below)
Yonik recently wrote about „Advanced Filter Caching in Solr“ where he talked about expensive and custom filters; it was left as an exercise to the reader on the implementation details. In this post, I’m going to provide a concrete example of custom post filtering for the case of filtering documents based on access control lists.
Recap of Solr’s filtering and caching
First let’s review Solr’s filtering and caching capabilities. Queries to Solr involve a full-text, relevancy scored, query (the infamous q parameter). As users navigate they will browse into facets. The search application generates filter query (fq) parameters for faceted navigation (eg. fq=color:red, as in the article referenced above). The filter queries are not involved in document scoring, serving only to reduce the search space. Solr sports a filter cache, caching the document sets of each unique filter query. These document sets are generated in advance, cached, and reduce the documents considered by the main query. Caching can be turned off on a per-filter basis; when filters are not cached, they are used in parallel to the main query to „leap frog“ to documents for consideration, and a cost can be associated with each filter in order to prioritize the leap-frogging (smallest set first would minimize documents being considered for matching).
Post filtering
Even without caching, filter sets default to generate in advance. In some cases it can be extremely expensive and prohibitive to generate a filter set. One example of this is with access control filtering that needs to take the users query context into account in order to know which documents are allowed to be returned or not. Ideally only matching documents, documents that match the query and straightforward filters, should be evaluated for security access control. It’s wasteful to evaluate any other documents that wouldn’t otherwise match anyway. So let’s run through an example… a contrived example for the sake of showing how Solr’s post filtering works. Here’s the design:
- Documents have an „access control list“ associated with them, specifying allowed and disallowed users as well as allowed and disallowed groups.
- The access control list is an ordered list of allowed/disallowed users and groups. Order matters, such that the first matching rule determines access.
- If no allowing access is found, the document is not allowed.
For example, a document could have an access control string specified as „+u:user1 +g:group1 -g:group2 +u:user2 -u:user3“. Query requests to Solr will include the user name and the users group membership. Given this example access control string, here’s how this contrived design should respond:
user='user1', groups=null: allowed user='user2', groups=null: allowed user='user1', groups=[group1]: allowed user='user2', groups=[group2]: NOT ALLOWED user='user3', groups=[group1]: allowed user='user3', groups=[group2]: NOT ALLOWED user='user3', groups=[group1, group2]: allowed
That’s to say if user2, as a member of group2 searches, he should not be allowed to find this particular document (-g:group2 precedes +u:user2 in the rules, and order matters). I know, I know, this is pretty contrived, but not wholly unrealistic given some customer work we’ve recently done.
Because these rules are dependent on order and the query request, it’s not possible to do a straightforward Lucene query to filter allowed documents. Play along with me here on this example, I tried to make it sufficiently complicated to go along with this point. Solr has a relatively new PostFilter capability that allows this last check on filtering documents on the fly. It takes some know-how to implement a PostFilter appropriately, so the code example here will be a nice starting point for your own custom post filtering. The way a PostFilter gets leveraged is through a Solr QParserPlugin. Here’s my custom AccessControlQParserPlugin:
public class AccessControlQParserPlugin extends QParserPlugin {
public static String NAME = "acl";
public void init(NamedList args) {
}
@Override
public QParser createParser(String qstr, SolrParams localParams,
SolrParams params, SolrQueryRequest req) {
return new QParser(qstr, localParams, params, req) {
@Override
public Query parse() throws ParseException {
return new AccessControlQuery(localParams.get("user"), localParams.get("groups"));
}
};
}
}
And then this is wired into solrconfig.xml as follows:
<queryParser name="acl" class="AccessControlQParserPlugin"/>
All of that is just necessary glue in order to hook in a PostFilter implementation. Here’s my example implementation:
/**
* Note that this Query implementation can _only_ be used as an fq, not as a q (it would need to implement createWeight).
*/
class AccessControlQuery extends ExtendedQueryBase implements PostFilter {
private String user;
private String[] groups;
public AccessControlQuery(String user, String groups) {
this.user = user;
this.groups = groups.split(",");
}
public static boolean isAllowed(String acl, String user, String[] groups) {
// acl is in the form of a series of whitespace separated [+|-][u|g]:name
// allowed is determined by any explicit user or group mentions, plus or minus
// order matters
// if nothing matches, it is not allowed
if (user == null && groups == null) return false;
String[] permissions = acl.split(" ");
for(String p : permissions) {
boolean allowed = p.charAt(0) == '+';
String name = p.substring(3);
if (p.charAt(1) == 'u') { // user
if (user != null && user.equals(name)) return allowed;
} else { // group
if (groups != null) {
for (String g : groups) {
if (g.equals(name)) return allowed;
}
}
}
}
return false;
}
@Override
public boolean getCache() {
return false; // never cache
}
@Override
public int getCost() {
return Math.max(super.getCost(), 100); // never return less than 100 since we only support post filtering
}
public DelegatingCollector getFilterCollector(IndexSearcher searcher) {
return new DelegatingCollector() {
String[] acls;
@Override
public void collect(int doc) throws IOException {
if (isAllowed(acls[doc], user, groups)) super.collect(doc);
}
@Override
public void setNextReader(IndexReader reader, int docBase) throws IOException {
acls = FieldCache.DEFAULT.getStrings(reader, "acl");
super.setNextReader(reader, docBase);
}
};
}
// For Solr 4.0, replace getFilterCollector with this one, adjusting
//public DelegatingCollector getFilterCollector(IndexSearcher searcher) {
// return new DelegatingCollector() {
// FieldCache.DocTerms acls;
// @Override
// public void collect(int doc) throws IOException {
// final BytesRef br = new BytesRef();
// if (isAllowed(acls.getTerm(doc, br).utf8ToString(), user, groups)) super.collect(doc);
// }
//
// @Override
// public void setNextReader(AtomicReaderContext context) throws IOException {
// acls = FieldCache.DEFAULT.getTerms(context.reader(), "acl"); // may be better to use the StringIndex version
// super.setNextReader(context);
// }
//
//
// };
//}
// NOTE: it is very important to implement proper equals and hashCode methods for this class, as it is used with
// *result* caching (not filter caching, which is explicitly disabled here).
@Override
public String toString() {
return "AccessControlQuery{" +
"user='" + user + ''' +
", groups=" + (groups == null ? null : Arrays.asList(groups)) +
'}';
}
@Override
public boolean equals(Object o) {
if (this == o) return true;
if (o == null || getClass() != o.getClass()) return false;
if (!super.equals(o)) return false;
AccessControlQuery that = (AccessControlQuery) o;
if (!Arrays.equals(groups, that.groups)) return false;
if (user != null ? !user.equals(that.user) : that.user != null) return false;
return true;
}
@Override
public int hashCode() {
int result = super.hashCode();
result = 31 * result + (user != null ? user.hashCode() : 0);
result = 31 * result + (groups != null ? Arrays.hashCode(groups) : 0);
return result;
}
public static void main(String[] args) {
String acl = "+u:user1 +g:group1 -g:group2 +u:user2 -u:user3";
System.out.println("acl = " + acl);
test(acl, "user1", null);
test(acl, "user2", null);
test(acl, "user1", new String[] {"group1"});
test(acl, "user2", new String[] {"group2"});
test(acl, "user3", new String[] {"group1"});
test(acl, "user3", new String[] {"group2"});
test(acl, "user3", new String[] {"group1","group2"});
}
private static void test(String acl, String user, String[] groups) {
System.out.println("user='" + user + ''' +
", groups=" + (groups == null ? null : Arrays.asList(groups)) +
": " + (isAllowed(acl, user, groups) ? "allowed" : "NOT ALLOWED"));
}
}
The main() method was used to generate the above rule processing results. A few notes to emphasize from this code:
- This implementation can only be used as a filter query (fq) parameter, not a q parameter.
- hashCode/equals are very important to get right, otherwise unexpected/incorrect results can occur.
- Caching is explicitly disabled, so no need to set cache=false.
- Solr has logic that only kicks in PostFilter’s when the cost is >= 100, that’s why the getCost method is the way it is.
- The custom filtering logic is all within the single isAllowed() method.
- This example was built using the Lucene/Solr 3.x codebase. Some slight adjustments are necessary to tweak this for the 4.x codebase.
In this implementation, the access control rules are entirely specified on each document, in the acl field. In order to efficiently filter by these rules at query time, Lucene’s FieldCache is used. There is upfront cost in time and RAM in building the FieldCache data structure, making this rapid to access at query time; when FieldCache is used (sorting, some faceting implementations, function queries, and this custom query parser) it is wise to put in appropriate warming queries to have the FieldCache entries built at commit-time rather than end users waiting longer at query-time.
So, with all that implementation behind us, here’s how we finally use it: index some documents, make queries that filter using the „acl“ query parser. Here are the documents, in CSV format:
id,acl 1,+u:bob 2,-g:sales +g:engineering 3,+g:hr -g:engineering 4,-u:alice +g:hr 5,+g:hr -u:alice 6,+g:sales +g:engineering -u:bob 7,+g:hr -u:alice +g:sales 8,+g:sales 9,+g:engineering 10,+g:hr
This was indexed using Solr’s example post.jar:
java -Dtype=text/csv -Durl=http://localhost:8983/solr/update/csv -jar post.jar example_docs.csv
where the acl field is defined as <field name=“acl“ type=“string“ indexed=“true“ stored=“true“ multiValued=“false“/>.
To make it easy to present, a quick and dirty Velocity template, ids.vm, was added to the conf/velocity directory:
Matching ids: #if($page.results_found > 0) #foreach($doc in $response.results) $doc.id #end #else None #end
And finally let’s see the results, using the base request of http://localhost:8983/solr/select?q=*:*&wt=velocity&v.template=ids, which by itself yields „Matching ids: 1 2 3 4 5 6 7 8 9 10“. Appending an fq parameter using the syntax &fq={!acl user=’username‘ groups=’group1,group2′} applies the security filter. Here are several variations in user and groups and the results:
&fq={!acl user=‘alice‚ groups=“}: Matching ids: None
&fq={!acl user=‘bob‚ groups=“}: Matching ids: 1
&fq={!acl user=‘alice‚ groups=‘hr‚}: Matching ids: 3 5 7 10
&fq={!acl user=‘alice‚ groups=‘hr,sales‚}: Matching ids: 3 5 6 7 8 10
&fq={!acl user=‘alice‚ groups=‘hr,sales,engineering‚}: Matching ids: 3 5 6 7 8 9 10
&fq={!acl user=‘bob‚ groups=‘hr‚}: Matching ids: 1 3 4 5 7 10
Fine print
It’s important to note that PostFilter is a last resort for implementing document filtering. Don’t make the solution more complicated than it needs to be. More often than not, even access control filtering can be implemented using plain ol‘ search techniques, by indexing allowed users and groups onto documents and using the lucene (or another) query parser to do the trick. Only when the rules are too complicated, or external information is needed, does a custom PostFilter make sense. Performance is key here, and the internal #collect() method will be called for every matching document; a *:* query was used in this example causing every document in the index to be post-filter evaluated and this may be prohibitive on a large index, and as such your application may need to require a narrowing query or another filter constraint involved before kicking in a PostFilter. What happens in #collect needs to be highly optimized. DO NOT, I repeat, DO NOT fetch the document from the Lucene index in that method (I won’t even mention what methods to steer clear of)! If you need to get at field data, have it on single-valued indexed fields and use the FieldCache (and eventually the 4.x doc-values feature will be handy here as well). And if you use FieldCache, add a representative query using your PostFilter to your warming queries in solrconfig.xml.
LEARN MORE
Contact us today to learn how Lucidworks can help your team create powerful search and discovery applications for your customers and employees.