Facets

Xapian provides functionality which allows you to dynamically generate complete lists of values which feature in matching documents. For example, colour, manufacturer, size values are good candidates for faceting.

There are numerous potential uses this can be put to, but a common one is to offer the user the ability to narrow down their search by filtering it to only include documents with a particular value of a particular category. This is often referred to as faceted search.

Todo

string and numeric facets

Todo

grouping

Todo

selecting which facets to show

Implementation

Faceting works against information stored in document value slots and, when executed, provides a list of the unique values for that slot together with a count of the number of times each value occurs.

Indexing

No additional work is needed to implement faceted searching, except to ensure that the values you wish to use in facets are stored as document values.

def index(datapath, dbpath):
    # Create or open the database we're going to be writing to.
    db = xapian.WritableDatabase(dbpath, xapian.DB_CREATE_OR_OPEN)

    # Set up a TermGenerator that we'll use in indexing.
    termgenerator = xapian.TermGenerator()
    termgenerator.set_stemmer(xapian.Stem("en"))

    for fields in parse_csv_file(datapath):
        # 'fields' is a dictionary mapping from field name to value.
        # Pick out the fields we're going to index.
        description = fields.get('DESCRIPTION', u'')
        title = fields.get('TITLE', u'')
        identifier = fields.get('id_NUMBER', u'')
        collection = fields.get('COLLECTION', u'')
        maker = fields.get('MAKER', u'')

        # We make a document and tell the term generator to use this.
        doc = xapian.Document()
        termgenerator.set_document(doc)

        # Index each field with a suitable prefix.
        termgenerator.index_text(title, 1, 'S')
        termgenerator.index_text(description, 1, 'XD')

        # Index fields without prefixes for general search.
        termgenerator.index_text(title)
        termgenerator.increase_termpos()
        termgenerator.index_text(description)

        # Add the collection as a value in slot 0.
        doc.add_value(0, collection)

        # Add the maker as a value in slot 1.
        doc.add_value(1, maker)

        # Store all the fields for display purposes.
        doc.set_data(json.dumps(fields))

        # We use the identifier to ensure each object ends up in the
        # database only once no matter how many times we run the
        # indexer.
        idterm = u"Q" + identifier
        doc.add_boolean_term(idterm)
        db.replace_document(idterm, doc)

Here we’re using two value slots: 0 contains the collection, and 1 contains the name of whoever made the object. We know from the documentation of the dataset that both from fixed and curated lists, so we don’t have to worry about normalising the values before using them as facets. Let’s run that to build a dataset with document values suitable for faceting:

$ python2 code/python/index_facets.py data/100-objects-v1.csv db

Querying

To query, Xapian uses the concept of spies to observe slots of matched documents during a search.

The procedure works in three steps: first, you create a spy (instance of xapian.ValueCountMatchSpy) for each slot you want the facets; second, you bind each spy to the xapian.Enquire using add_matchspy(spy); third, after the search was performed, you retrieve the results that each spy observed. This is an example of how this is done:

    # Set up a spy to inspect the MAKER value at slot 1
    spy = xapian.ValueCountMatchSpy(1)
    enquire.add_matchspy(spy)

    for match in enquire.get_mset(offset, pagesize, 100):
        fields = json.loads(match.document.get_data())
        print(u"%(rank)i: #%(docid)3.3i %(title)s" % {
            'rank': match.rank + 1,
            'docid': match.docid,
            'title': fields.get('TITLE', u''),
            })
        matches.append(match.docid)

    # Fetch and display the spy values
    for facet in spy.values():
        print("Facet: %(term)s; count: %(count)i" % {
            'term' : facet.term,
            'count' : facet.termfreq
        })

    # Finally, make sure we log the query and displayed results
    support.log_matches(querystring, offset, pagesize, matches)

Here we’re faceting on value slot 1, which is the object maker. After you get the MSet, you can ask the spy for the facets it found, including the frequency. Note that although we’re generally only showing ten matches, we use a parameter to get_mset() called checkatleast, so that the entire dataset is considered and the facet frequencies are correct. See Limitations for some discussion of the implications of this. Here’s the output:

$ python2 code/python/search_facets.py db clock
1: #044 Two-dial clock by the Self-Winding Clock Co; as used on the
2: #096 Clock with Hipp pendulum (an electric driven clock with Hipp
3: #012 Assembled and unassembled EXA electric clock kit
4: #098 'Pond' electric clock movement (no dial)
5: #083 Harrison's eight-day wooden clock movement, 1715.
6: #005 "Ever Ready" ceiling clock
7: #039 Electric clock of the Bain type
8: #061 Van der Plancke master clock
9: #064 Morse electrical clock, dial mechanism
10: #052 Reconstruction of Dondi's Astronomical Clock, 1974
Facet: Bain, Alexander; count: 3
Facet: Bloxam, J. M.; count: 1
Facet: Braun (maker); count: 1
Facet: British Horo-Electric Ltd. (maker); count: 1
Facet: British Vacuum Cleaner and Engineering Co. Ltd., Magneto Time division (maker); count: 1
Facet: EXA; count: 1
Facet: Ever Ready Co. (maker); count: 2
Facet: Ferranti Ltd.; count: 1
Facet: Galilei, Galileo, 1564-1642; Galilei, Vincenzio, 1606-1649; count: 1
Facet: Harrison, John (maker); count: 1
Facet: Hipp, M.; count: 1
Facet: La Précision Cie; count: 1
Facet: Lund, J.; count: 1
Facet: Morse, J. S.; count: 1
Facet: Self Winding Clock Company; count: 1
Facet: Self-Winding Clock Co. (maker); count: 1
Facet: Synchronome Co. Ltd. (maker); count: 2
Facet: Thwaites and Reed Ltd.; count: 1
Facet: Thwaites and Reed Ltd. (maker); count: 1
Facet: Viviani, Vincenzo; count: 1
Facet: Vulliamy, Benjamin, 1747-1811; count: 1
Facet: Whitefriars Glass Ltd. (maker); count: 1
INFO:xapian.search:'clock'[0:10] = 44 96 12 98 83 5 39 61 64 52

Note that the spy will give you facets in alphabetical order, not in order of frequency; if you want to show the most frequent first you should use the top_values iterator (begin_top_values() in C++ and some other languages).

If you want to work with multiple facets, you can register multiple xapian.ValueCountMatchSpy objects before running get_mset(), although each additional one will have some performance impact.

Restricting by Facets

If you’re using the facets to offer the user choices for narrowing down their search results, you then need to be able to apply a suitable filter.

For a single value, you could use xapian.Query.OP_VALUE_RANGE with the same start and end, or xapian.MatchDecider, but it’s probably most efficient to also index the categories as suitably prefixed boolean terms and use those for filtering.

Limitations

The accuracy of Xapian’s faceting capability is determined by the number of records that are examined by Xapian whilst it is searching. You can control this number by specifying the checkatleast parameter to get_mset(); however it is important to be aware that increasing this number may have an effect on overall query performance, although a typical sized database is unlikely to see adverse effects.

In Development

Some additional features currently in development may benefit users of facets. These are:

  • Multiple values in slots: this will allow you to have a single value slot (e.g. colour) which contains multiple values (e.g. red, blue). This will also allow you to create a facet by colour which is aware of these multiple values, giving counts for both red and blue.

Todo

This is misleading - it’s already possibly to dead with a facet with multiple values like this. We should document how rather than seeming to imply you can’t currently.

  • Bucketing: this provides a means to group together numeric facets, so that a single facet can contain a range of values (e.g. price ranges).