Solr sparse faceting

06/02/2015 - 14:30 to 15:10

Stage 1

long talk (40 min)

Advanced

Session abstract:

netarchive.dk maintains a historical archive of Danish net resources. We are indexing its 500TB of raw data into Solr. One of the requirements is to provide faceting on several fields, the largest having billions of unique String values. Stock Solr is not capable of doing that with satisfiable performance on our hardware. Inspection of Solr's core faceting code has led to multiple performance improvements for high cardinality faceting.

Less memory overhead, using packed counters
Less garbage collection, reusing counters
Better performance for small result sets, using sparse counters
Better performance overall with distribution, rewriting fine-counting logic

Performance gains relative to stock Solr varies with result size. A rule of thumb is 2x for single shard indexes and 4x for multi shard. The principles behind the improvements will be presented and their influence on the faceting performance curve will be discussed and visualized with data from tests and production systems.

Sparse faceting is Open Source and available at http://tokee.github.io/lucene-solr/