Sorting

By default, Xapian orders search results by decreasing relevance score. However, it also allows results to be ordered by other criteria, or a combination of other criteria and relevance score.

If two or more results compare equal by the sorting criteria, then their order is decided by their document ids. By default, the document ids sort in ascending order (so a lower document id is “better”), but descending order can be chosen using enquire.set_docid_order(enquire.DESCENDING);. If you have no preference, you can tell Xapian to use whatever order is most efficient using enquire.set_docid_order(enquire.DONT_CARE);.

It is also possible to change the way that the relevance scores are calculated - for details, see the document on weighting schemes and scoring for details.

Sorting by Value

You can order documents by comparing a specified document value. Note that the comparison used compares the byte values in the value (i.e. it’s a string sort ignoring locale), so 1 < 10 < 2. If you want to encode the value such that it sorts numerically, use sortable_serialise() to encode values at index time - this works equally well on integers and floating point values:

doc.add_value(0, xapian.sortable_serialise(price))

There are three methods which are used to specify how the value is used to sort, depending if/how you want relevance used in the ordering:

  • xapian.Enquire.set_sort_by_value() specifies the relevance doesn’t affect the ordering at all.
  • xapian.Enquire.set_sort_by_value_then_relevance() specifies that relevance is used for ordering any groups of documents for which the value is the same.
  • xapian.Enquire.set_sort_by_relevance_then_value() specifies that documents are ordered by relevance, and the value is only used to order groups of documents with identical relevance values (note: the weight has to be exactly the same for values to determine the order, so this method isn’t very useful when using BM25 with the default parameters, as that will rarely give identical scores to different documents).

We’ll use the states dataset to demonstrate this, and the code from dealing with dates in the range queries HOWTO:

$ python2 code/python/index_ranges2.py data/states.csv statesdb

This has three document values: slot 1 has the year of admission to the union, slot 2 the full date (as “YYYYMMDD”), and slot 3 the latest population estimate. So if we want to sort by year of entry to the union and then within that by relevance, we want to add the following before we call get_mset:

    enquire.set_sort_by_value_then_relevance(1, False)

The final parameter is False for ascending order, True for descending. We can then run sorted searches like this:

$ python2 code/python/search_sorting.py statesdb spanish
1: #019 State of Texas December 29, 1845
        Population 25,145,561
2: #004 State of Montana November 8, 1889
        Population 989,415
'spanish'[0:10] = 19 4

Generated Sort Keys

To allow more elaborate sorting schemes, Xapian allows you to provide a functor object subclassed from xapian.KeyMaker which generates a sort key for each matching document which is under consideration. This is called at most once for each document, and then the generated sort keys are ordered by comparing byte values (i.e. with a string sort ignoring locale).

Sorting by Multiple Values

There’s a standard subclass xapian.MultiValueKeyMaker which allows sorting on more than one document value (so the first document value specified determines the order; amongst groups of documents where that’s the same, the second document value determines the order, and so on).

We’ll use this to change our sorted search above to order by year of entry to the union and then by decreasing population.

    keymaker = xapian.MultiValueKeyMaker()
    keymaker.add_value(1, False)
    keymaker.add_value(3, True)
    enquire.set_sort_by_key_then_relevance(keymaker, False)

As with the Enquire methods, add_value has a second parameter that controls whether it uses an ascending or descending sort. So now we can run a search with a more complex sort:

$ python2 code/python/search_sorting2.py statesdb State
1: #040 Commonwealth of Pennsylvania December 12, 1787
        Population 12,702,379
2: #043 State of New Jersey December 18, 1787
        Population 8,791,894
3: #049 State of Delaware December 7, 1787
        Population 897,934
4: #041 State of New York July 26, 1788
        Population 19,378,102
5: #034 State of Georgia January 2, 1788
        Population 9,687,653
6: #038 Commonwealth of Virginia June 25, 1788
        Population 8,001,024
7: #046 Commonwealth of Massachusetts February 6, 1788
        Population 6,547,629
8: #050 State of Maryland April 28, 1788
        Population 5,773,552
9: #036 State of South Carolina May 23, 1788
        Population 4,625,384
10: #048 State of Connecticut January 9, 1788
        Population 3,574,097
'State'[0:10] = 40 43 49 41 34 38 46 50 36 48

Other Uses for Generated Keys

xapian.KeyMaker can also be subclassed to sort based on a calculation. For example, “sort by geographical distance”, where a subclass could take the latitude and longitude of the user’s location, and coordinates of the document from a value slot, and sort results so that those closest to the user are ranked highest.

For this, we’re going to want the geographical coordinates of each state stored in a value. We can use the approximate middle of the state for this purpose, which is calculated for us when parsing the states.csv file:

        midlat = fields['midlat']
        midlon = fields['midlon']
        if midlat and midlon:
            doc.add_value(4, "%f,%f" % (float(midlat), float(midlon)))

We don’t have to sort on these, so we’ve just put them both into one slot that we can easily read them out from again:

$ python2 code/python/index_values_with_geo.py data/states.csv statesdb

Now we need a KeyMaker; let’s have it return a key that sorts by distance from Washington, DC.

    class DistanceKeyMaker(xapian.KeyMaker):
        def __call__(self, doc):
            # we want to return a sortable string which represents
            # the distance from Washington, DC to the middle of this
            # state.
            coords = map(float, doc.get_value(4).split(','))
            washington = (38.012, -77.037)
            return xapian.sortable_serialise(
                support.distance_between_coords(coords, washington)
                )
    enquire.set_sort_by_key_then_relevance(DistanceKeyMaker(), False)

And running it is as simple as before:

$ python2 code/python/search_sorting3.py statesdb State
1: #050 State of Maryland April 28, 1788
        Population 5,773,552
2: #049 State of Delaware December 7, 1787
        Population 897,934
3: #040 Commonwealth of Pennsylvania December 12, 1787
        Population 12,702,379
4: #043 State of New Jersey December 18, 1787
        Population 8,791,894
5: #039 State of West Virginia June 20, 1863
        Population 1,859,815
6: #037 State of North Carolina November 21, 1789
        Population 9,535,483
7: #041 State of New York July 26, 1788
        Population 19,378,102
8: #038 Commonwealth of Virginia June 25, 1788
        Population 8,001,024
9: #048 State of Connecticut January 9, 1788
        Population 3,574,097
10: #036 State of South Carolina May 23, 1788
        Population 4,625,384
'State'[0:10] = 50 49 40 43 39 37 41 38 48 36