netarchive.dk maintains a historical archive of Danish net resources. We are indexing its 500TB of raw data into Solr. One of the requirements is to provide faceting on several fields, the largest having billions of unique String values. Stock Solr is not capable of doing that with satisfiable performance on our hardware. Inspection of Solr's core faceting code has led to multiple performance improvements for high cardinality faceting.
- Less memory overhead, using packed counters
- Less garbage collection, reusing counters
- Better performance for small result sets, using sparse counters
- Better performance overall with distribution, rewriting fine-counting logic
Performance gains relative to stock Solr varies with result size. A rule of thumb is 2x for single shard indexes and 4x for multi shard. The principles behind the improvements will be presented and their influence on the faceting performance curve will be discussed and visualized with data from tests and production systems.
Sparse faceting is Open Source and available at http://tokee.github.io/lucene-solr/