Table of contents
The easiest way to change document scoring is to change, or tune,
the weighting scheme in use; Xapian provides a number of weighting schemes,
(the default is BM25Weight).
You can also implement your own.
The BM25 weighting formula which Xapian uses by default has a number of parameters. We have picked some default parameter values which do a good job in general. The optimal values of these parameters depend on the data being indexed and the type of queries being run, so you may be able to improve the effectiveness of your search system by adjusting these values, but it’s a fiddly process to tune them so people tend not to bother.
Say something more useful about tuning the parameters!
The occurrences of a query term in very long documents may not be rewarded properly by BM25, and thus those very long documents could be overly penalized. In such cases, the BM25+ weighting formula is a useful improvement over the existing BM25 weighting formula. In BM25, it is easy to note that there is a strict upper bound (k1 + 1) for Term Frequency normalization. However, the other interesting direction, lower-bounding TF, has not been well addressed.
BM25+ was originally proposed by Lv-Zhai in CIKM11 paper: Lower-Bounding Term Frequency Normalization. BM25+ was derived from BM25 by lower-bounding TF and using all of the parameters of BM25 with an additional parameter – delta(δ). Experiments by Lv-Zhai have shown that BM25+ works very well with δ = 1.
PL2Weight implements the representative scheme of the Divergence from Randomness Framework This weighting scheme is useful for tasks that require early precision. It uses the Poisson approximation of the Binomial Probabilistic distribution (P),the Laplace method to find the after-effect of sampling (L) and the second wdf normalization to normalize the wdf in the document to the length of the document (H2).
Document weight is controlled by parameter c. The default value of 1 for c is suitable for longer queries but it may need to be changed for shorter queries.
Proposed by Lv-Zhai, PL2PlusWeight is the modified lower-bounded PL2 retrieval function of the Divergence from Randomness Framework with an additonal parameter delta in addition to the parameter c from the PL2 weighting function.
Parmater delta is the pseudo tf value to control the scale of the tf lower bound. It can be tuned for e.g from 0.1 to 1.5 in increments of 0.1 or so. Although, PL2+ works effectively across collections with a fixed default value of 0.8.
An important aspect of language model-based weighting is that, since not all terms appear in all documents (and hence the wdf of some terms is zero with respect to a given document), we have to employ smoothing to avoid problems.
Xapian provides four different smoothing types, which take further parameters to control the effects of smoothing; we have picked some default parameter values which do a good job, using two stage smoothing.
The UnigramLM weighting formula is based on an original approach by Bruce Croft. It uses statistical language modelling; ‘unigram’ in this case means that words are considered to occur independently.
The Dirichlet prior method is one of the best performing language modeling approaches. Xapian now provides support for a modified Dirichlet prior method, namely Dir+ which is an improvement over the original as it is particularly more effective across web collections with very long documents (where document length is much larger than average document length).
TfIdfWeight implements the support for a number of SMART normalization variants of the tf-idf weighting scheme. These normalizations are specified by a three character string:
Normalizations are specified by the first character of their name string:
- “n one” : wdfn = wdf“b oolean” (or sometimes binary) : wdfn = 1 if term is present in document else 0.“s quare” : wdfn = wdf * wdf“l og” : wdfn = 1 + ln (wdf)“P ivoted” : wdfn = (1+log(1+log(wdf)))*(1/(1-slope+(slope*doclen/avg_len)))+delta [not in 1.4.x]
- “n one” : idfn = 1“t fidf” : idfn = log (N / Termfreq) where N is the number of documents in collection and Termfreq is the number of documents“p rob” : idfn = log ((N - Termfreq) / Termfreq)“f req” : idfn = 1 / Termfreq“s quared” : idfn = log (N / Termfreq) ^ 2“P ivoted” : idfn = log ((N + 1) / Termfreq) [not in 1.4.x]
- “n one” : wtn = wdfn * idfn
More recently supported normalization in TfIdfWeight is the pivoted (piv+) retrieval function which represents one of the best performing vector space models. Piv+ takes two parameters; slope and delta which are set to their default optimal values. You may want pass different candidate values ranging from 0.1 to 1.5 and choose one which fits best to your system based upon corpus being used. Piv+ isn’t supported by 1.4.x, it’s only in git master (and will be in the next release series) - it’s hard to backport because the two new parameters need to be stored by the TfIdfWeight class.
TradWeight implements the original probabilistic weighting formula, which is essentially a special case of BM25 (it’s BM25 with k2 = 0, k3 = 0, b = 1, and min_normlen = 0, except that all the weights are scaled by a constant factor).
This needs writing; it’s also somewhat esoteric, and perhaps should be an advanced document or at least down-played.