Table of contents
In the museums dataset we used in our earlier examples, there is a field DATE_MADE that tells us when the object in question was made, so one of the natural things people might want to do is to only search for objects made in a particular time period. Suppose we want to extend our original system to allow that, we’re going to have to do a number of things.
- Parse the field from the data set to turn it into something consistent; at the moment it includes years, year ranges (“1671-1700”), approximate years (“c. 1936”) and commentary (“patented 1885”, or “1642-1649 (original); 1883 (model)”). Additionally, some records have no information about when the object was made.
- Store that information in the Xapian database.
- Provide a way during search of specifying a date range to constrain to.
If we look through the other fields in the data set, there are more that could be useful for range queries: we could extract the longest dimension from MEASUREMENTS to enable people to restrict to only very large or very small objects, for instance.
We’ll see how to perform range searches across both those dimensions, and then we’ll look at how to cope with full dates rather than just years.
If you think back to when we introduced the query concepts behind
Xapian, you’ll remember that one group of query operators is
responsible for handling value ranges:
So we’ll be tackling range queries by using document values, and constructing
queries using these operators to restrict matches suitably.
Since we want to expose this functionality generally to users, we want
them to be able to type in a query that will include one or more range
restrictions; the QueryParser contains support for doing this, using
range processors, subclasses of
Xapian comes with some standard ones itself, or you can write your own.
Since document values are stored as strings in Xapian, and the operators provided perform string comparisons, we need a way of converting numbers to strings to store them. For this, Xapian provides a pair of utility functions: sortable_serialise and sortable_unserialise, which convert between floating point numbers (strictly, each works with a double) and a string that will sort in the same way and so can be compared easily.
We need a new version of our indexer. This one is index_ranges.py, and creates document values from both MEASUREMENTS and DATE_MADE. We’ll put the largest dimension in value slot 0 (fortunately the data is stored in millimetres and kilograms, so we can cheat a little and assume that dimensions will always be larger than weights), and a year taken from DATE_MADE into value slot 1 (we choose the first year we can parse, since it can contain such a variety of date formats).
# parse the two values we need measurements = fields.get('MEASUREMENTS', u'') if len(measurements) > 0: numbers = numbers_from_string(measurements) if len(numbers) > 0: doc.add_value(0, xapian.sortable_serialise(max(numbers))) date_made = fields.get('DATE_MADE', u'') years = numbers_from_string(date_made) if len(years) > 0: doc.add_value(1, xapian.sortable_serialise(years))
We run this like so:
$ python2 code/python/index_ranges.py data/100-objects-v1.csv db
We can check this has created document values using xapian-delve:
$ xapian-delve -V0 db | cat -v Value 0 for each document: 5:M-@M-@ 8:M-HV 9:M-EM-p 10:M-MF 11:M-AM-0 12:M-AP 15:M-8^P 19:M-Dt 20:M-GM-P 21:M-E 24:M-O: 25:M-BM-@ 26:M-AM- 27:M-BX 29:M-DD 30:M-BM-^P 31:M-6@ 33:M-;` 34:M-A0 35:M-LM-l 36:M-C^P 37:M-9M-p 38:M-A( 39:M-FT 42:M-H2 45:M-N@ 46:M-AP 50:M-:M-^P 51:M-9P 52:M-LM-! 54:M-CM-( 55:M-9M-P 56:M-@P 59:M-D` 61:M-A( 62:M-;@ 64:M-:M-^P 66:M-AM-H 67:M-8` 68:M-@D33333@ 69:M-D^P 70:M-@M-H 71:M-KM-( 72:M-8^P 73:M-5M-^NfffffM-^@ 74:M-5M-^NfffffM-^@ 75:M-C$M-LM-LM-LM-LM-LM-@ 76:M-BM-?33333@ 77:M-C>33333@ 78:M-;M-^@ 79:M-E^T 80:M-9P 81:M-A@ 84:M-9M-t 86:M-L~ 87:M-BM-@ 88:M-9(M-LM-LM-LM-LM-LM-@ 89:M-:M-?33333@ 90:M-8M-C33333@ 91:M-E| 93:M-A( 94:M-@` 97:M-EM-\ 98:M-Bh 100:M-9^P
All the odd characters are because xapian-delve doesn’t know to run sortable_unserialise to turn the strings back into numbers.
All we need to do once we’ve got the document values in place is to
tell the QueryParser about them. The simplest range processor is
xapian.RangeProcessor itself, but here we need two
To distinguish between the two different ranges, we’ll require that dimensions must be specified with the suffix ‘mm’, but years are just numbers. For this to work, we have to tell QueryParser about the value range with a suffix first:
queryparser.add_rangeprocessor( xapian.NumberRangeProcessor(0, 'mm', xapian.RP_SUFFIX) ) queryparser.add_rangeprocessor( xapian.NumberRangeProcessor(1) )
The first call has a final parameter of False to say that ‘mm’ is a suffix (the default is for it to be a prefix). When using the empty string, as in the second call, it doesn’t matter whether you say it’s a suffix or prefix, so it’s convenient to skip that parameter.
This is implemented in search_ranges.py, which also modifies the output to show the measurements and date made fields as well as the title.
We can now restrict across dimensions using queries like ‘..50mm’ (everything at most 50mm in its longest dimension), and across years using ‘1980..1989’:
$ python2 code/python/search_ranges.py db ..50mm 1: #031 (1588) overall diameter: 50 mm Portable universal equinoctial sundial, in brass, signed "A 2: #073 (1701-1721) overall: 15 mm x 44.45 mm, weight: 0.055kg Universal pocket sundial 3: #074 (1596) overall: 13 mm x 44.45 mm x 44.45 mm, weight: 0.095kg Sundial, made as a locket, gilt metal, part silver '..50mm'[0:10] = 31 73 74
$ python2 code/python/search_ranges.py db 1980..1989 1: #050 (1984) overall: 105 mm x 75 mm x 57 mm, Quartz Analogue "no battery" wristwatch by Pulsar Quartz (CA 2: #051 (1984) overall: 85 mm x 65 mm x 38 mm, Analogue quartz clock with voice controlled alarm by Braun, '1980..1989'[0:10] = 50 51
You can of course combine this with ‘normal’ search terms, such as all clocks made from 1960 onwards:
$ python2 code/python/search_ranges.py db clock 1960.. 1: #052 (1974) clock: 1185 x 780 mm, 122 kg; rewind unit: 460 x 640 x 350 mm Reconstruction of Dondi's Astronomical Clock, 1974 2: #051 (1984) overall: 85 mm x 65 mm x 38 mm, Analogue quartz clock with voice controlled alarm by Braun, 3: #009 (1973) overall: 380 mm x 300 mm x 192 mm, weight: 6.45kg Copy of a Dwerrihouse skeleton clock with coup-perdu escape 'clock 1960..'[0:10] = 52 51 9
and even combining both ranges at once, such as all large objects from the 19th century:
$ python2 code/python/search_ranges.py db 1000..mm 1800..1899 1: #024 (1845-1855) overall: 1850 mm x 350 mm x 250 mm Regulator Clock with Gravity Escapement '1000..mm 1800..1899'[0:10] = 24
Note the slightly awkward syntax 1000..mm. The suffix must always go on the end of the entire range; it may also go on the beginning (so you can do 1000mm..mm). Similarly, you can have 100mm..200mm or 100..200mm but not 100mm..200. These rules are reversed for prefixes.
If you get the rules wrong, the QueryParser will raise a QueryParserError, which in production code you could catch and either signal to the user or perhaps try the query again without the RangeProcessor that tripped up.
To restrict to a date range, we need to decide how to both store the
date in a document value, and how we want users to input the date
range in their query.
xapian.DateRangeProcessor, which is part of
Xapian, works by storing the date as a string in the form ‘YYYYMMDD’,
and can take dates in either US style (month/day/year) or European
To show how this works, we’re going to need to use a different dataset, because the museums data only gives years the objects were made in; we’ve built one using data on the fifty US states, taken from Wikipedia infoboxes on 5th November 2011 and then tidied up a small amount. The CSV file is states.csv, and the code that did most of the work is from_wikipedia.py, using a list of Wikipedia page titles in us_states_on_wikipedia. The CSV is licensed as Creative Commons Attribution-Share Alike 3.0, as per Wikipedia.
We need a new indexer for this as well, which is index_ranges2.py. It stores two numbers using sortable_serialise: year of admission in value slot 1 and population in slot 3. It also stores the date of admission as ‘YYYYMMDD’ in slot 2. Here’s the code which does this:
# Index each field with a suitable prefix. termgenerator.index_text(name, 1, 'S') termgenerator.index_text(description, 1, 'XD') termgenerator.index_text(motto, 1, 'XM') # Index fields without prefixes for general search. termgenerator.index_text(name) termgenerator.increase_termpos() termgenerator.index_text(description) termgenerator.increase_termpos() termgenerator.index_text(motto) # Add document values. if admitted is not None: doc.add_value(1, xapian.sortable_serialise(int(admitted[:4]))) doc.add_value(2, admitted) # YYYYMMDD if population is not None: doc.add_value(3, xapian.sortable_serialise(int(population)))
We’ll look at just the date ones for now, and come back to the others in a minute.
We use the indexer in the same way as previous ones:
$ python2 code/python/index_ranges2.py data/states.csv statesdb
With this done, we can change the set of value range processors we give to the QueryParser.
queryparser.add_rangeprocessor( xapian.DateRangeProcessor(2, xapian.RP_DATE_PREFER_MDY, 1860) ) queryparser.add_rangeprocessor( xapian.NumberRangeProcessor(1) )
xapian.DateRangeProcessor is working on value slot 2, with an
“epoch” of 1860 (so two digit years will be considered as starting at
1860 and going forward as far 1959). The second parameter is whether
it should prefer US style dates or not; since we’re looking at US
states, we’ve gone for US dates. The
is as we saw before, which means that it can’t cope with two digit years.
This enables us to search for any state that talks about the Spanish in its description:
$ python2 code/python/search_ranges2.py statesdb spanish 1: #004 State of Montana November 8, 1889 Population 989,415 2: #019 State of Texas December 29, 1845 Population 25,145,561 'spanish'[0:10] = 4 19
or for all states admitted in the 19th century:
$ python2 code/python/search_ranges2.py statesdb 1800..1899 1: #001 State of Washington November 11, 1889 Population 6,744,496 2: #002 State of Arkansas June 15, 1836 Population 2,915,918 3: #003 State of Oregon February 14, 1859 Population 3,831,074 4: #004 State of Montana November 8, 1889 Population 989,415 5: #005 Idaho July 3, 1890 Population 1,567,582 6: #006 State of Nevada October 31, 1864 Population 2,700,551 7: #007 State of California September 9, 1850 Population 37,253,956 8: #009 State of Utah January 4, 1896 Population 2,763,885 9: #010 State of Wyoming July 10, 1890 Population 563,626 10: #011 State of Colorado August 1, 1876 Population 5,029,196 '1800..1899'[0:10] = 1 2 3 4 5 6 7 9 10 11
That uses the
xapian.NumberRangeProcessor on value slot 1, as in
our previous example. Let’s be more specific and ask for only those
between November 8th 1889, when Montana became part of the Union, and
July 10th 1890, when Wyoming joined:
$ python2 code/python/search_ranges2.py statesdb 11/08/1889..07/10/1890 1: #001 State of Washington November 11, 1889 Population 6,744,496 2: #004 State of Montana November 8, 1889 Population 989,415 3: #005 Idaho July 3, 1890 Population 1,567,582 4: #010 State of Wyoming July 10, 1890 Population 563,626 '11/08/1889..07/10/1890'[0:10] = 1 4 5 10
That uses the
xapian.DateRangeProcessor on value slot 2; it can’t
cope with year ranges, which is why we indexed to both slots 1 and 2.
We haven’t yet done anything with population. What we want is
something that behaves like
xapian.NumberRangeProcessor, but knows
what reason possible values are. If we insert it before the
xapian.NumberRangeProcessor on slot 1 (year), it can pick up
anything that should be treated as a population, and let everything else be
treated as a year range.
To do this, we need to know how a
called by the QueryParser. What happens is that each processor in turn is
passed the start and end of the range. If it doesn’t understand the range, it
xapian.BAD_VALUENO. If it does understand
the range, it should return the value number to use with
xapian.Query.OP_VALUE_RANGE and if it wants to, it can
modify the start and end values (to convert them to the correct format for
the string comparison which
What we’re going to do is to write a custom
that accepts numbers in the range 500,000 to 50,000,000; these can’t
possibly be years in our data set, and encompass the full range of
populations. If either number is outside that range, we will return
xapian.BAD_VALUENO and the QueryParser will move on.
class PopulationRangeProcessor(xapian.RangeProcessor): def __init__(self, slot, low, high): super(PopulationRangeProcessor, self).__init__() self.nrp = xapian.NumberRangeProcessor(slot) self.low = low self.high = high def __call__(self, begin, end): if len(begin) > 0: try: _begin = int(begin) if _begin < self.low or _begin > self.high: raise ValueError() except: return xapian.Query(xapian.Query.OP_INVALID) if len(end) > 0: try: _end = int(end) if _end < self.low or _end > self.high: raise ValueError() except: return xapian.Query(xapian.Query.OP_INVALID) return self.nrp(begin, end) queryparser.add_rangeprocessor( PopulationRangeProcessor(3, 500000, 50000000) )
Most of the work is in __call__ (python’s equivalent of operator()
in C++), which gets called with the two strings at either end of the
range in the query string; either but not both can be the empty
string, which indicates an open-ended range. This method returns a
xapian.Query object - if the object doesn’t want to handle
the range, then this should use operator
it doesn’t want to handle it; otherwise this query is the range that
is matched - typically using
xapian.Query objects are supported.
Rather than re-implement
xapian.NumberRangeProcessor, we wrap it to do
the serialisation (due to the way python interacts with the API it’s currently
not possible to subclass it successfully here).
Range processors are called in the order they’re added, so our custom one gets a chance to look at all ranges, but will only ‘claim’ ranges which use integer numbers within the 500 thousand to 50 million range.
We can then search for states by population, such as all over 10 million:
$ python2 code/python/search_ranges2.py statesdb 10000000.. 1: #007 State of California September 9, 1850 Population 37,253,956 2: #019 State of Texas December 29, 1845 Population 25,145,561 3: #027 State of Illinois December 3, 1818 Population 12,830,632 4: #030 State of Ohio March 1, 1803 Population 11,536,504 5: #035 State of Florida March 3, 1845 Population 18,801,310 6: #040 Commonwealth of Pennsylvania December 12, 1787 Population 12,702,379 7: #041 State of New York July 26, 1788 Population 19,378,102 '10000000..'[0:10] = 7 19 27 30 35 40 41
Or all that joined the union in the 1780s and have a population now over 10 million:
$ python2 code/python/search_ranges2.py statesdb 1780..1789 10000000.. 1: #040 Commonwealth of Pennsylvania December 12, 1787 Population 12,702,379 2: #041 State of New York July 26, 1788 Population 19,378,102 '1780..1789 10000000..'[0:10] = 40 41
With a little more work, we could support ranges such as ‘..5m’ to mean up to 5 million, or ‘..750k’ for up to 750 thousand.
Similarly, it would be possible to use the same approach to create a custom
xapian.RangeProcessor that could restrict to a range of years, and
cope with two digit years, as our
xapian.DateRangeProcessor did for
Without other terms in a query, a
xapian.RangeProcessor can cause
a value operation to be performed across the whole database, which means
loading all the values in a given slot. On a small database, this
isn’t a problem, but for a large one it can have performance
implications: you may end up with very slow queries.