Sorting¶
By default, Xapian orders search results by decreasing relevance score. However, it also allows results to be ordered by other criteria, or a combination of other criteria and relevance score.
If two or more results compare equal by the sorting criteria, then their
order is decided by their document ids. By default, the document ids sort
in ascending order (so a lower document id is “better”), but descending
order can be chosen using enquire.set_docid_order(enquire.DESCENDING);
.
If you have no preference, you can tell Xapian to use whatever order is
most efficient using enquire.set_docid_order(enquire.DONT_CARE);
.
It is also possible to change the way that the relevance scores are calculated - for details, see the document on weighting schemes and scoring for details.
Sorting by Value¶
You can order documents by comparing a specified document value. Note that the
comparison used compares the byte values in the value (i.e. it’s a string sort
ignoring locale), so 1
< 10
< 2
. If you want to encode the value
such that it sorts numerically, use sortable_serialise()
to encode
values at index time - this works equally well on integers and floating point
values:
doc.add_value(0, xapian.sortable_serialise(price))
There are three methods which are used to specify how the value is used to sort, depending if/how you want relevance used in the ordering:
xapian.Enquire.set_sort_by_value()
specifies the relevance doesn’t affect the ordering at all.xapian.Enquire.set_sort_by_value_then_relevance()
specifies that relevance is used for ordering any groups of documents for which the value is the same.xapian.Enquire.set_sort_by_relevance_then_value()
specifies that documents are ordered by relevance, and the value is only used to order groups of documents with identical relevance values (note: the weight has to be exactly the same for values to determine the order, so this method isn’t very useful when using BM25 with the default parameters, as that will rarely give identical scores to different documents).
We’ll use the states dataset to demonstrate this, and the code from dealing with dates in the range queries HOWTO:
$ python3 code/python3/index_ranges2.py data/states.csv statesdb
This has three document values: slot 1 has the year of admission to the union, slot 2 the full date (as “YYYYMMDD”), and slot 3 the latest population estimate. So if we want to sort by year of entry to the union and then within that by relevance, we want to add the following before we call get_mset:
enquire.set_sort_by_value_then_relevance(1, False)
The final parameter is False
for ascending order,
True
for descending. We can then run sorted searches like
this:
$ python3 code/python3/search_sorting.py statesdb spanish
1: #019 State of Texas December 29, 1845
Population 25,145,561
2: #004 State of Montana November 8, 1889
Population 989,415
'spanish'[0:10] = 19 4
Generated Sort Keys¶
To allow more elaborate sorting schemes, Xapian allows you to provide a
functor object subclassed from xapian.KeyMaker
which generates a sort
key for each matching document which is under consideration. This is
called at most once for each document, and then the generated sort keys are
ordered by comparing byte values (i.e. with a string sort ignoring locale).
Sorting by Multiple Values¶
There’s a standard subclass xapian.MultiValueKeyMaker
which allows
sorting on more than one document value (so the first document value
specified determines the order; amongst groups of documents where that’s
the same, the second document value determines the order, and so on).
We’ll use this to change our sorted search above to order by year of entry to the union and then by decreasing population.
keymaker = xapian.MultiValueKeyMaker()
keymaker.add_value(1, False)
keymaker.add_value(3, True)
enquire.set_sort_by_key_then_relevance(keymaker, False)
As with the Enquire methods, add_value has a second parameter that controls whether it uses an ascending or descending sort. So now we can run a search with a more complex sort:
$ python3 code/python3/search_sorting2.py statesdb State
1: #040 Commonwealth of Pennsylvania December 12, 1787
Population 12,702,379
2: #043 State of New Jersey December 18, 1787
Population 8,791,894
3: #049 State of Delaware December 7, 1787
Population 897,934
4: #041 State of New York July 26, 1788
Population 19,378,102
5: #034 State of Georgia January 2, 1788
Population 9,687,653
6: #038 Commonwealth of Virginia June 25, 1788
Population 8,001,024
7: #046 Commonwealth of Massachusetts February 6, 1788
Population 6,547,629
8: #050 State of Maryland April 28, 1788
Population 5,773,552
9: #036 State of South Carolina May 23, 1788
Population 4,625,384
10: #048 State of Connecticut January 9, 1788
Population 3,574,097
'State'[0:10] = 40 43 49 41 34 38 46 50 36 48
Other Uses for Generated Keys¶
xapian.KeyMaker
can also be subclassed to sort based on a calculation.
For example, “sort by geographical distance”, where a subclass could take
the latitude and longitude of the user’s location, and coordinates of the
document from a value slot, and sort results so that those closest to the
user are ranked highest.
For this, we’re going to want the geographical coordinates of each state stored in a value. We can use the approximate middle of the state for this purpose, which is calculated for us when parsing the states.csv file:
midlat = fields['midlat']
midlon = fields['midlon']
if midlat and midlon:
doc.add_value(4, "%f,%f" % (float(midlat), float(midlon)))
We don’t have to sort on these, so we’ve just put them both into one slot that we can easily read them out from again:
$ python3 code/python3/index_values_with_geo.py data/states.csv statesdb
Now we need a KeyMaker; let’s have it return a key that sorts by distance from Washington, DC.
class DistanceKeyMaker(xapian.KeyMaker):
def __call__(self, doc):
# we want to return a sortable string which represents
# the distance from Washington, DC to the middle of this
# state.
value = doc.get_value(4).decode('utf8')
x, y = map(float, value.split(','))
washington = (38.012, -77.037)
return xapian.sortable_serialise(
support.distance_between_coords((x, y), washington)
)
enquire.set_sort_by_key_then_relevance(DistanceKeyMaker(), False)
And running it is as simple as before:
$ python3 code/python3/search_sorting3.py statesdb State
1: #050 State of Maryland April 28, 1788
Population 5,773,552
2: #049 State of Delaware December 7, 1787
Population 897,934
3: #040 Commonwealth of Pennsylvania December 12, 1787
Population 12,702,379
4: #043 State of New Jersey December 18, 1787
Population 8,791,894
5: #039 State of West Virginia June 20, 1863
Population 1,859,815
6: #037 State of North Carolina November 21, 1789
Population 9,535,483
7: #041 State of New York July 26, 1788
Population 19,378,102
8: #038 Commonwealth of Virginia June 25, 1788
Population 8,001,024
9: #048 State of Connecticut January 9, 1788
Population 3,574,097
10: #036 State of South Carolina May 23, 1788
Population 4,625,384
'State'[0:10] = 50 49 40 43 39 37 41 38 48 36