Facets¶
Table of contents
Xapian provides functionality which allows you to dynamically generate complete lists of values which feature in matching documents. For example, colour, manufacturer, size values are good candidates for faceting.
There are numerous potential uses this can be put to, but a common one is to offer the user the ability to narrow down their search by filtering it to only include documents with a particular value of a particular category. This is often referred to as faceted search.
Todo
string and numeric facets
Todo
grouping
Todo
selecting which facets to show
Implementation¶
Faceting works against information stored in document value slots and, when executed, provides a list of the unique values for that slot together with a count of the number of times each value occurs.
Indexing¶
No additional work is needed to implement faceted searching, except to ensure that the values you wish to use in facets are stored as document values.
def index(datapath, dbpath):
# Create or open the database we're going to be writing to.
db = xapian.WritableDatabase(dbpath, xapian.DB_CREATE_OR_OPEN)
# Set up a TermGenerator that we'll use in indexing.
termgenerator = xapian.TermGenerator()
termgenerator.set_stemmer(xapian.Stem("en"))
for fields in parse_csv_file(datapath):
# 'fields' is a dictionary mapping from field name to value.
# Pick out the fields we're going to index.
description = fields.get('DESCRIPTION', u'')
title = fields.get('TITLE', u'')
identifier = fields.get('id_NUMBER', u'')
collection = fields.get('COLLECTION', u'')
maker = fields.get('MAKER', u'')
# We make a document and tell the term generator to use this.
doc = xapian.Document()
termgenerator.set_document(doc)
# Index each field with a suitable prefix.
termgenerator.index_text(title, 1, 'S')
termgenerator.index_text(description, 1, 'XD')
# Index fields without prefixes for general search.
termgenerator.index_text(title)
termgenerator.increase_termpos()
termgenerator.index_text(description)
# Add the collection as a value in slot 0.
doc.add_value(0, collection)
# Add the maker as a value in slot 1.
doc.add_value(1, maker)
# Store all the fields for display purposes.
doc.set_data(json.dumps(fields))
# We use the identifier to ensure each object ends up in the
# database only once no matter how many times we run the
# indexer.
idterm = u"Q" + identifier
doc.add_boolean_term(idterm)
db.replace_document(idterm, doc)
Here we’re using two value slots: 0 contains the collection, and 1 contains the name of whoever made the object. We know from the documentation of the dataset that both from fixed and curated lists, so we don’t have to worry about normalising the values before using them as facets. Let’s run that to build a dataset with document values suitable for faceting:
$ python3 code/python3/index_facets.py data/100-objects-v1.csv db
Querying¶
To query, Xapian uses the concept of spies to observe slots of matched documents during a search.
The procedure works in three steps: first, you create a spy
(instance of xapian.ValueCountMatchSpy
)
for each slot you want the facets; second, you bind each spy to the
xapian.Enquire
using add_matchspy(spy)
;
third, after the search was performed, you retrieve the results that
each spy observed. This is an example of how this is done:
# Set up a spy to inspect the MAKER value at slot 1
spy = xapian.ValueCountMatchSpy(1)
enquire.add_matchspy(spy)
for match in enquire.get_mset(offset, pagesize, 100):
fields = json.loads(match.document.get_data().decode('utf8'))
print(u"%(rank)i: #%(docid)3.3i %(title)s" % {
'rank': match.rank + 1,
'docid': match.docid,
'title': fields.get('TITLE', u''),
})
matches.append(match.docid)
# Fetch and display the spy values
for facet in spy.values():
print("Facet: %(term)s; count: %(count)i" % {
'term' : facet.term.decode('utf-8'),
'count' : facet.termfreq
})
# Finally, make sure we log the query and displayed results
support.log_matches(querystring, offset, pagesize, matches)
Here we’re faceting on value slot 1, which is the object maker. After
you get the MSet, you can ask the spy for the facets it found,
including the frequency. Note that although we’re generally only
showing ten matches, we use a parameter to get_mset()
called checkatleast, so that the entire dataset is considered and the facet
frequencies are correct. See Limitations for some discussion of the
implications of this. Here’s the output:
$ python3 code/python3/search_facets.py db clock
1: #044 Two-dial clock by the Self-Winding Clock Co; as used on the
2: #096 Clock with Hipp pendulum (an electric driven clock with Hipp
3: #012 Assembled and unassembled EXA electric clock kit
4: #098 'Pond' electric clock movement (no dial)
5: #083 Harrison's eight-day wooden clock movement, 1715.
6: #005 "Ever Ready" ceiling clock
7: #039 Electric clock of the Bain type
8: #061 Van der Plancke master clock
9: #064 Morse electrical clock, dial mechanism
10: #052 Reconstruction of Dondi's Astronomical Clock, 1974
Facet: Bain, Alexander; count: 3
Facet: Bloxam, J. M.; count: 1
Facet: Braun (maker); count: 1
Facet: British Horo-Electric Ltd. (maker); count: 1
Facet: British Vacuum Cleaner and Engineering Co. Ltd., Magneto Time division (maker); count: 1
Facet: EXA; count: 1
Facet: Ever Ready Co. (maker); count: 2
Facet: Ferranti Ltd.; count: 1
Facet: Galilei, Galileo, 1564-1642; Galilei, Vincenzio, 1606-1649; count: 1
Facet: Harrison, John (maker); count: 1
Facet: Hipp, M.; count: 1
Facet: La Précision Cie; count: 1
Facet: Lund, J.; count: 1
Facet: Morse, J. S.; count: 1
Facet: Self Winding Clock Company; count: 1
Facet: Self-Winding Clock Co. (maker); count: 1
Facet: Synchronome Co. Ltd. (maker); count: 2
Facet: Thwaites and Reed Ltd.; count: 1
Facet: Thwaites and Reed Ltd. (maker); count: 1
Facet: Viviani, Vincenzo; count: 1
Facet: Vulliamy, Benjamin, 1747-1811; count: 1
Facet: Whitefriars Glass Ltd. (maker); count: 1
'clock'[0:10] = 44 96 12 98 83 5 39 61 64 52
Note that the spy will give you facets in alphabetical order, not in
order of frequency; if you want to show the most frequent first you
should use the top_values iterator (begin_top_values()
in C++ and some other languages).
If you want to work with multiple facets, you can register multiple
xapian.ValueCountMatchSpy
objects before running
get_mset()
, although each additional one will have some
performance impact.
Restricting by Facets¶
If you’re using the facets to offer the user choices for narrowing down their search results, you then need to be able to apply a suitable filter.
For a single value, you could use xapian.Query.OP_VALUE_RANGE
with
the same start and end, or xapian.MatchDecider
, but it’s probably most
efficient to also index the categories as suitably prefixed boolean terms
and use those for filtering.
Limitations¶
The accuracy of Xapian’s faceting capability is determined by the number
of records that are examined by Xapian whilst it is searching. You can
control this number by specifying the checkatleast parameter to
get_mset()
; however it is important to be aware that
increasing this number may have an effect on overall query performance,
although a typical sized database is unlikely to see adverse effects.
In Development¶
Some additional features currently in development may benefit users of facets. These are:
- Multiple values in slots: this will allow you to have a single value slot (e.g. colour) which contains multiple values (e.g. red, blue). This will also allow you to create a facet by colour which is aware of these multiple values, giving counts for both red and blue.
Todo
This is misleading - it’s already possibly to dead with a facet with multiple values like this. We should document how rather than seeming to imply you can’t currently.
- Bucketing: this provides a means to group together numeric facets, so that a single facet can contain a range of values (e.g. price ranges).