Custom Weighting Schemes

You can also implement your own weighting scheme, provided it can be expressed in the form of a sum over the matching terms, plus an extra term which depends on term-independent statistics (such as the normalised document length).

Currently it is only possible to implement custom weighting schemes in C++. The API could probably be wrapped with a bit of effort, but performance is likely to be disappointing as the get_sumpart() method gets called a lot (approximately once per matching term in each considered document), so the overhead of routing a method call from C++ to the wrapped language will matter.

For example, here’s an implementation of “coordinate matching” - each matching term scores one point:

class CoordinateWeight : public Xapian::Weight {
  public:
    CoordinateWeight * clone() const { return new CoordinateWeight; }
    CoordinateWeight() { }
    ~CoordinateWeight() { }

    std::string name() const { return "Coord"; }
    std::string serialise() const { return std::string(); }
    CoordinateWeight * unserialise(const std::string &) const {
        return new CoordinateWeight;
    }

    double get_sumpart(Xapian::termcount, Xapian::doclength) const {
        return 1;
    }
    double get_maxpart() const { return 1; }

    double get_sumextra(Xapian::doclength) const { return 0; }
    double get_maxextra() const { return 0; }

    bool get_sumpart_needs_doclength() const { return false; }
};

Implement a custom weighting scheme that requires various statistics

The Coordinate scheme given above does not require any statistics. However, custom weighting schemes that require various statistics such as average document length in the database, the query length, total number of documents in the collection etc. can also be implemented.

For that, the weighting scheme subclassed from xapian.Weight simply needs to “tell” xapian.Weight which statistics it will be needing. This is done by calling the need_stat(STATISTIC REQUIRED) method in the constructor of the subclassed weighting scheme. Note however, that only those statistics which are absolutely required must be asked for as collecting statistics is expensive. For a full list of statistics currently available from xapian.Weight and the enumerators required to access them, please refer to the API documentation.

Todo

Sort out doxygen visibility of protected stat_flags so the link above can be to the apidocs

The statistics can then be obtained by the subclass by simply calling the corresponding function of the xapian.Weight class. For eg:- The document frequency (Term frequency) of the term can be obtained by calling get_termfreq(). For a full list of functions required to obtain various statistics, refer to the xapian/weight.h header file.

Example:- Consider a simple weighting scheme such as a pseudo Tf-Idf weighting scheme which returns the document weight as the product of the within document frequency of the term and the inverse of the document frequency of the term (Inverse of the number of documents the term appears in).

The implementation will be as follows:

class TfIdfWeight : public Xapian::Weight {
  public:
    TfIdfWeight * clone() const { return new TfIdfWeight; }
    TfIdfWeight() {
        need_stat(WDF);
        need_stat(TERMFREQ);
        need_stat(WDF_MAX);
    }
    ~TfIdfWeight() { }

    std::string name() const { return "TfIdf"; }
    std::string serialise() const { return std::string(); }
    TfIdfWeight * unserialise(const std::string &) const {
        return new TfIdfWeight;
    }

    double get_sumpart(Xapian::termcount wdf, Xapian::doclength) const {
        Xapian::doccount df = get_termfreq();
        double wdf_double(wdf);
        double wt = wdf_double / df;
        return wt;
    }

    double get_maxpart() const {
        Xapian::doccount df = get_termfreq();
        double max_wdf(get_wdf_upper_bound());
        double max_weight = max_wdf / df;
        return max_weight;
    }
    double get_sumextra(Xapian::doclength) const { return 0; }
    double get_maxextra() const { return 0; }
};

Note: The get_maxpart() method returns an upper bound on the weight returned by get_sumpart(). In order to do that, it requires the WDF_MAX statistic (the maximum wdf of the term among all documents).