Custom Weighting Schemes

You can also implement your own weighting scheme, provided it can be expressed in the form of a sum over the matching terms, plus an extra term which depends on term-independent statistics (such as the normalised document length).

Currently it is only possible to implement custom weighting schemes in C++. The API could probably be wrapped with a bit of effort, but performance is likely to be disappointing as the get_sumpart() method gets called a lot (approximately once per matching term in each considered document), so the overhead of routing a virtual method call from C++ to the wrapped language will matter.

For example, here’s an implementation of “coordinate matching” - each matching term scores one point (this is provided in the API as xapian.Xapian::CoordWeight but is an illustrative example of implementing a simple weighting scheme):

class CoordWeight : public Xapian::Weight {
    double factor = 1.0;

  public:
    CoordWeight() { }

    ~CoordWeight() { }

    CoordWeight* clone() const override { return new CoordWeight; }

    void init(double factor_) override { factor = factor_; }

    std::string name() const override { return "coord"; }

    // No parameters to serialise.
    std::string serialise() const override { return std::string(); }

    CoordWeight* unserialise(const std::string&) const override {
        return new CoordWeight;
    }

    double get_sumpart(Xapian::termcount,
                       Xapian::termcount,
                       Xapian::termcount) const override {
        return factor;
    }
    double get_maxpart() const override { return factor; }

    double get_sumextra(Xapian::termcount,
                        Xapian::termcount) const override {
        return 0;
    }
    double get_maxextra() const override { return 0; }
};

Implement a custom weighting scheme that requires various statistics

The Coordinate scheme given above does not require any statistics. However, custom weighting schemes that require various statistics such as average document length in the database, the query length, total number of documents in the collection etc. can also be implemented.

For that, the weighting scheme subclassed from xapian.Weight simply needs to “tell” xapian.Weight which statistics it will be needing. This is done by calling the need_stat(STATISTIC REQUIRED) method in the constructor of the subclassed weighting scheme. Note however, that only those statistics which are absolutely required must be asked for as collecting statistics is expensive. For a full list of statistics currently available from xapian.Weight and the enumerators required to access them, please refer to the API documentation.

The statistics can then be obtained by the subclass by simply calling the corresponding function of the xapian.Weight class. For eg:- The document frequency (Term frequency) of the term can be obtained by calling get_termfreq(). For a full list of functions required to obtain various statistics, refer to the xapian/weight.h header file.

Example:- Consider a simple weighting scheme such as a pseudo Tf-Idf weighting scheme which returns the document weight as the product of the within document frequency of the term and the inverse of the term frequency of the term (one divided by the number of documents the term appears in).

The implementation will be as follows:

class PseudoTfIdfWeight : public Xapian::Weight {
    double factor = 1.0;

  public:
    PseudoTfIdfWeight() {
        need_stat(WDF);
        need_stat(TERMFREQ);
        need_stat(WDF_MAX);
    }

    ~PseudoTfIdfWeight() { }

    PseudoTfIdfWeight* clone() const override {
        return new PseudoTfIdfWeight;
    }

    void init(double factor_) override { factor = factor_; }

    std::string name() const override { return "pseudotfidf"; }

    // No parameters to serialise.
    std::string serialise() const override { return std::string(); }

    PseudoTfIdfWeight* unserialise(const std::string&) const override {
        return new PseudoTfIdfWeight;
    }

    double get_sumpart(Xapian::termcount wdf,
                       Xapian::termcount,
                       Xapian::termcount) const override {
        Xapian::doccount df = get_termfreq();
        double wdf_double(wdf);
        double wt = wdf_double / df;
        return wt * factor;
    }

    double get_maxpart() const override {
        Xapian::doccount df = get_termfreq();
        double max_wdf(get_wdf_upper_bound());
        double max_weight = max_wdf / df;
        return max_weight * factor;
    }

    double get_sumextra(Xapian::termcount,
                        Xapian::termcount) const override { return 0; }

    double get_maxextra() const override { return 0; }
};

Note: The get_maxpart() method returns an upper bound on the weight returned by get_sumpart(). In order to do that, it requires the WDF_MAX statistic (the maximum wdf of the term among all documents).